soundex

Difference between version 29 and 30 - Previous - Next
Soundex is an algorithm designed by Knuth.  It is a kind of [hash] for words, where words that ''sound'' similar will have hash to similar values.
It can be used in a pattern matching algorithm for similar words which could be misspellings of one another, or mistaken for each other in speech (especially the output from speech recognition).

** Official Documentation **

There is an implementation of Knuth's soundex in [tcllib]:
   https://core.tcl-lang.org/tcllib/doc/trunk/embedded/md/tcllib/files/modules/soundex/soundex.md:   

** See also **

   * [similarity] includes several implementation of Levenshtein (edit) Distance, and links to related algorithms
   * Lawrence Philips' Metaphone and Double Metaphone algorithms can be found here: http://aspell.sourceforge.net/metaphone/
   * Another implementation of Knuth's soundex in Tcl is at [Evan Rempel]'s page, [http://web.uvic.ca/~erempel/tcl/Soundex/Soundex.html].  [AK] consulted this implementation when developing the [tcllib] module.
   * [Tclsnowball] contains a binding to some stemmers
   * [SQLite] includes a https://sqlite.org/lang_corefunc.html#soundex%|%soundex() function%|%, although it is not enabled by default at compile time.

----

** Discussion **

[AK]:
Feel free to add more algorithms here, polish existing one, and/or give
references to papers, implementations, etc. I.e. make this a staging are
for things which can go into the soundex module of [tcllib].

I got myself essentially two soundexes, the one from this page and the
[Knuth] one, implemented by [Evan Rempel]. When I found out that the
algorithm here differs just in a minor detail from the Knuth (see later
on this page) I decided to use the Knuth one as seed for the module.
The other algorithm here, metaphone is a completely unknown to me and
thus I am hesitant to just add it. Especially given the comments about
first pass / draft and the unusual equivalences.

----
[Michael Schlenker]:Should the tcllib soundex module be reserved for such sounds-like pattern matching algorithms 
or could more linguistic tools be added (like stemmers, see for example [Tclsnowball] 
for a Tcl binding to some stemmers.)?

[AK]: Hm. ... This could be a question for the tcllib-devel mailing list in general.

Question: What does a stemmer do ?
Answer: A stemmer tries to find the stem of words. This is useful for mapping plural and 
singular forms of words, and other forms of basically the same word to a common stem. 

----[DKF]: This code has been greatly tightened and should be clearer too.  
Some idioms work better in Tcl than they do in other languages, so transcribing 
an algorithm from C is not always straight-forward...
======
## Be nice and friendly with namespaces
namespace eval ::soundex {namespace export soundex}

## Set up some static data only once
array set ::soundex::soundexcode {
    a 0   b 1   c 2   d 3   e 0   f 1   g 2
    h 0   i 0   j 2   k 2   l 4   m 5   n 5
    o 0   p 1   q 2   r 6   s 2   t 3   u 0
    v 1   w 0   x 2   y 0   z 2
}

proc ::soundex::soundex {string} {
    variable soundexcode

    ## force lowercase and strip out all non-alphabetic characters
    regsub -all {[^a-z]} [string tolower $string] {} letters

    ## the null string is code Z000
    if {![string length $letters]} { return Z000 }

    set last -1
    set key  {}

    ## scan until end of string or until the key is built
    foreach char [split $letters {}] {
        set code $soundexcode($char)
        ## Fold together adjacent letters with the same code
        if {$last != $code} {
            set last $code
            ## Ignore code==0 letters except as separators
            if {$last} {
                append key $last
                ## Only need the first four
                if {[string length $key] >= 4} {break}
            }
        }
    }

    ## normalise by adding zeros to get four characters
    string range "[string toupper $key]0000" 0 3
}
======[DKF]: Are soundex codes all numeric except for the one for the empty string?  Or should the append really be an append of ''$char'' instead?

[AK]: Donal, the algorithm above is very very near to the soundex algorithm by Knuth. The Z000 is possibly a remnant of that. The Knuth algorithm keeps the the first letter of the word (in uppercase) whereas the algorithm here converts this letter to a soundex code too. I noticed when I ran this one over the examples provided for the Knuth soundex and it came out identical for all the examples, except for the first position of the result.
[LV]: There are some alternative algorithms to soundex that attempt to achieve similar functionality.  
I recall seeing several coded for Perl.
The benefit was that they achieved varying degrees of better matches.Anyone familiar with alternatives?

----
[MG] Apr 29th 2004 - I wrote a soundex package myself about a year ago, shortly before 
I found out Tcllib had one included. Just found it again, and thought I'd share it. 
It uses the same algorithm as Knuth, I believe, though is a fair bit slower. 
The only 'advantage' is the code is smaller, in terms of numbers of characters.
======
proc soundex-mg {word} {

    if {![string is alpha -strict $word]} {
        error "argument must contain only Unicode alphabet characters";
    }

    set word [string toupper $word]
        if {[string match "PH*" $word]} {
            set first "F"
            set rest [string range $word 2 end]
        } else {
            set first [string range $word 0 0]
            set rest [string range $word 1 end]
        }

    set map {A "" E "" I "" O "" U "" H "" W "" Y "" B 1 P 1 F 1 V 1 C 2 G 2 J 2 K 2 Q 2 S 2 X 2 Z 2 D 3 T 3 L 4 M 5 N 5 R 6}

    regsub -all {(.)\1+} [string map $map $rest] {\1} rest

    return [string range "$first${rest}000" 0 3]; 
};# soundex-mg
======----
'''[TWu]''' 2025-07-16 09:55: For the German language there is a similar algorithm named "Kölner Phonetik" (Cologne phonetics).<<br>>
It was created for searching in phone-books to find names with similar sound.<<br>>
You can find on https://de.wikipedia.org/wiki/K%C3%B6lner_Phonetik / https://en.wikipedia.org/wiki/Cologne_phonetics an explanation.

<<categories>> Command | Tcllib | String Processing | Human Language