Soundex is an algorithm designed by Knuth. It is a kind of [hash] for words, where words that ''sound'' similar will have hash to similar values.
It can be used in a pattern matching algorithm for similar words which could be misspellings of one another, or mistaken for each other in speech (especially the output from speech recognition).
** Official Documentation **
There is an implementation of Knuth's soundex in [tcllib]:
https://core.tcl-lang.org/tcllib/doc/trunk/embedded/md/tcllib/files/modules/soundex/soundex.md:
** See also **
* [similarity] includes several implementation of Levenshtein (edit) Distance, and links to related algorithms
* Lawrence Philips' Metaphone and Double Metaphone algorithms can be found here: http://aspell.sourceforge.net/metaphone/
* Another implementation of Knuth's soundex in Tcl is at [Evan Rempel]'s page, [http://web.uvic.ca/~erempel/tcl/Soundex/Soundex.html]. [AK] consulted this implementation when developing the [tcllib] module.
* [Tclsnowball] contains a binding to some stemmers
* [SQLite] includes a https://sqlite.org/lang_corefunc.html#soundex%|%soundex() function%|%, although it is not enabled by default at compile time.
----
** Discussion **
[AK]:
Feel free to add more algorithms here, polish existing one, and/or give
references to papers, implementations, etc. I.e. make this a staging are
for things which can go into the soundex module of [tcllib].
I got myself essentially two soundexes, the one from this page and the
[Knuth] one, implemented by [Evan Rempel]. When I found out that the
algorithm here differs just in a minor detail from the Knuth (see later
on this page) I decided to use the Knuth one as seed for the module.
The other algorithm here, metaphone is a completely unknown to me and
thus I am hesitant to just add it. Especially given the comments about
first pass / draft and the unusual equivalences.
----
[Michael Schlenker]:Should the tcllib soundex module be reserved for such sounds-like pattern matching algorithms
or could more linguistic tools be added (like stemmers, see for example [Tclsnowball]
for a Tcl binding to some stemmers.)?
[AK]: Hm. ... This could be a question for the tcllib-devel mailing list in general.
Question: What does a stemmer do ?
Answer: A stemmer tries to find the stem of words. This is useful for mapping plural and
singular forms of words, and other forms of basically the same word to a common stem.
----[DKF]: This code has been greatly tightened and should be clearer too.
Some idioms work better in Tcl than they do in other languages, so transcribing
an algorithm from C is not always straight-forward...
======
## Be nice and friendly with namespaces
namespace eval ::soundex {namespace export soundex}
## Set up some static data only once
array set ::soundex::soundexcode {
a 0 b 1 c 2 d 3 e 0 f 1 g 2
h 0 i 0 j 2 k 2 l 4 m 5 n 5
o 0 p 1 q 2 r 6 s 2 t 3 u 0
v 1 w 0 x 2 y 0 z 2
}
proc ::soundex::soundex {string} {
variable soundexcode
## force lowercase and strip out all non-alphabetic characters
regsub -all {[^a-z]} [string tolower $string] {} letters
## the null string is code Z000
if {![string length $letters]} { return Z000 }
set last -1
set key {}
## scan until end of string or until the key is built
foreach char [split $letters {}] {
set code $soundexcode($char)
## Fold together adjacent letters with the same code
if {$last != $code} {
set last $code
## Ignore code==0 letters except as separators
if {$last} {
append key $last
## Only need the first four
if {[string length $key] >= 4} {break}
}
}
}
## normalise by adding zeros to get four characters
string range "[string toupper $key]0000" 0 3
}
======[DKF]: Are soundex codes all numeric except for the one for the empty string? Or should the append really be an append of ''$char'' instead?
[AK]: Donal, the algorithm above is very very near to the soundex algorithm by Knuth. The Z000 is possibly a remnant of that. The Knuth algorithm keeps the the first letter of the word (in uppercase) whereas the algorithm here converts this letter to a soundex code too. I noticed when I ran this one over the examples provided for the Knuth soundex and it came out identical for all the examples, except for the first position of the result.
[LV]: There are some alternative algorithms to soundex that attempt to achieve similar functionality.
I recall seeing several coded for Perl.
The benefit was that they achieved varying degrees of better matches.Anyone familiar with alternatives?
----
[MG] Apr 29th 2004 - I wrote a soundex package myself about a year ago, shortly before
I found out Tcllib had one included. Just found it again, and thought I'd share it.
It uses the same algorithm as Knuth, I believe, though is a fair bit slower.
The only 'advantage' is the code is smaller, in terms of numbers of characters.
======
proc soundex-mg {word} {
if {![string is alpha -strict $word]} {
error "argument must contain only Unicode alphabet characters";
}
set word [string toupper $word]
if {[string match "PH*" $word]} {
set first "F"
set rest [string range $word 2 end]
} else {
set first [string range $word 0 0]
set rest [string range $word 1 end]
}
set map {A "" E "" I "" O "" U "" H "" W "" Y "" B 1 P 1 F 1 V 1 C 2 G 2 J 2 K 2 Q 2 S 2 X 2 Z 2 D 3 T 3 L 4 M 5 N 5 R 6}
regsub -all {(.)\1+} [string map $map $rest] {\1} rest
return [string range "$first${rest}000" 0 3];
};# soundex-mg
======----
'''[TWu]''' 2025-07-16 09:55: For the German language there is a similar algorithm named "Kölner Phonetik" (Cologne phonetics).<<br>>
It was created for searching in phone-books to find names with similar sound.<<br>>
You can find on https://de.wikipedia.org/wiki/K%C3%B6lner_Phonetik / https://en.wikipedia.org/wiki/Cologne_phonetics an explanation.
<<categories>> Command | Tcllib | String Processing | Human Language