Recent News Articles

Why Are We Limited to Soundex?

3 Oct 2024 8:38 AM | Anonymous

Genealogists love Soundex, a method of matching names that have similar sounds but may be spelled differently. In fact, Soundex became popular amongst genealogists almost as soon as it was invented in 1918. Soundex was patented by Robert C. Russell of Pittsburgh, Pennsylvania, and is sometimes called the “Russell Code.” The U.S. Census Bureau immediately adopted Soundex for indexing census records. Since then, others have used the Soundex code to sort similar-sounding names for telephone books, work records, drivers' licenses, and many other purposes. I noticed that the first four characters of my driver's license number are “E235,” the Soundex code for my last name.

Genealogists use Soundex to find variant spellings of ancestors' names. Almost all modern genealogy databases have a "search by Soundex" capability. 

Soundex is a form of "phonetic encoding" or "sound-alike" codes. A Soundex code consists of one letter followed by three digits. For instance, Smith and Smythe both are coded as S530, Eastman is E235, and Williams is W452. 

If you search many records of interest to genealogists, sooner or later you will need to use Soundex codes. Why? Well, you can often find a person's entry by his or her Soundex code, even when the names have been misspelled. This becomes important when you realize that many census takers did not speak the language of the people being enumerated. In fact, in the first 150 years of U.S. census records, the majority of Americans were illiterate and did not know how to write their own last names. Spellings on census and other public records varied widely. The spelling of many family names also has changed over the years, but often the Soundex code remains the same. Soundex can be a big help in finding the same family in different databases that have different spellings. 

As good as Soundex is, it suffers from numerous shortcomings. For example, Korbin and Corbin have two different Soundex codes, even though they sound exactly alike. The same is true for Kramer/Cramer, Kreighton/Creiton, Leighton/Layton, Phifer/Pheiffer/Fifer, Coghburn/Coburn and many others. At the same time, the names "Robert" and "Rupert" are pronounced differently, yet both have the same Soundex code, R163.

Of course, such shortcomings in Soundex create problems for genealogists. Sometimes Soundex can find similar-sounding names, but often it does not. You may be searching a database that contains information about your ancestors, but you will never know that because you cannot find them, either by exact spelling or by the inexact Soundex system. Fortunately, better solutions are available. 

Soundex was "state of the art" technology in 1918, but numerous improved methods have been invented since then. Each one is more accurate than the original Soundex system. Yet none of the new and improved systems has ever achieved much popularity in the genealogy world. Admittedly, the Daitch-Mokotoff Soundex System has achieved some popularity in unique sounds of surnames found in Jewish genealogy; however, it has seen little use elsewhere. For more information about the Daitch-Mokotoff Soundex System, see https://www.avotaynu.com/soundex.htm

Several newer and improved methods of Soundex have been invented over the years. Steve Morse published an excellent article describing many of the newer methods in the March 2010 issue of the Association of Professional Genealogists Quarterly. However, this great explanation hasn't received much publicity. In the article, Steve provides information not only about the Russell Soundex system of 1918, but also about the following methods:

American Soundex – 1930

Daitch-Mokotoff Soundex – 1985 

Metaphone – 1990

Double Metaphone – 2000

Beider-Morse Phonetic Matching – 2008

Steve also provides examples of the strengths and shortcomings of the various methods. If you have an interest in improved Soundex methods, I suggest you read Steve Morse's article at http://stevemorse.org/phonetics/bmpm2.htm.

One newer method, called the Double Metaphone Search Algorithm, promises to perform far more accurate name matching than anything available before. Double Metaphone’s inventor is Lawrence Philips, a Software Engineer at Verity, Inc. Philips has donated the algorithm to the public domain so that it can easily be used in any application, genealogy-related or not.

Double Metaphone provides much more accurate matches to the surnames typically found in North America, including most of those that originated in various European countries. Unlike Soundex, Double Metaphone handles different pronunciations of the same letters. Typical examples would include the letters "gh" that are pronounced differently in "light" and "rough" or the letters "ch" that are pronounced differently in "children" and "orchestra." It even handles silent letters properly, such as the "k" in "knight and the letter "b" in "dumb" and "plumb."

Double Metaphone handles pronunciations of names from Italian, Spanish, and French, and from various Germanic and Slavic languages.

The Double Metaphone codes can be as short as one letter (for the name "Lee") or can extend to eight or possibly more letters. However, the code seems to be highly accurate, even when limited to four characters.

Here are examples of Double Metaphone codes for a number of surnames:

Ashcraft - code: AXKR

Ashcroft - code: AXKR

Eastman - code: ESTM

Jansen - code: JNSN

Jansson - code: JNSN

Jensen - code: JNSN

Johnson - code: JNSN

Johnsson - code: JNSN

Law - code: L

Lea - code: L

Leah - code: L

Lee - code: L

Leigh - code: L

Lew - code: L

Li - code: L

Lopes - code: LPS 

Lopez - code: LPS 

Mallory - code: MLR

Malorie - code: MLR

Malory - code: MLR

Mellar - code: MLR

Millar - code: MLR

Miller - code: MLR

Millur - code: MLR

Mueller - code: MLR

Muller - code: MLR

Williams - code: WLMS

Williamsen - code: WLMS

Williamson - code: WLMS

Here are the Double Metaphone codes for the "problem names" that I mentioned earlier as not being handled properly in Soundex:

Kramer - code: KRMR

Cramer - code: KRMR

Kreighton - code: KRTN

Creiton - code: KRTN

Creighton - code: KRTN

Leighton - code: LTN 

Layton - code: LTN

Phifer - code: FFR

Pheiffer - code: FFR 

Fifer - code: FFR

Coghburn - code: KBRN 

Coburn - code: KBRN

As you can see, Double Metaphone handles all of these properly. To be sure, this new system still isn't perfect. If you search long enough, you can find a few non-matches. For instance, my last name of Eastman produces a Double Metaphone of ESTM and yet my early ancestors often had the name spelled Easman (without the letter “t”), a Double Metaphone code of ESMN. The two names sound almost the same, but the Double Metaphone codes are different. However, the number of non-matches are far less in Double Metaphone than with Soundex.

The algorithms used in Double Metaphone are complex. Inventor Lawrence Philips assumes that a computer will always be used to create the codes. Algorithms in BASIC, C++, C#, Perl, PHP, Java, and a number of other programming languages are available if you start at http://goo.gl/IgYra. 

Here are the Metaphone Rules, explained in English:

Metaphone reduces the alphabet to 16 consonant sounds:

B X S K J T F H L M N P R 0 W Y

That isn't an O but a zero - representing the 'th' sound.

Transformations

Metaphone uses the following transformation rules: 

Doubled letters except "c" -> drop 2nd letter.

Vowels are only kept when they are the first letter.

B -> B unless at the end of a word after "m" as in "dumb"

C -> X (sh) if -cia- or -ch-

S if -ci-, -ce- or -cy-

K otherwise, including -sch-

D -> J if in -dge-, -dgy- or -dgi-

T otherwise

F -> F

G -> silent if in -gh- and not at end or before a vowel

in -gn- or -gned- (also see dge etc. above)

J if before i or e or y if not double gg

K otherwise

H -> silent if after vowel and no vowel follows

H otherwise

J -> J

K -> silent if after "c"

K otherwise

L -> L 

M -> M

N -> N

P -> F if before "h"

P otherwise

Q -> K

R -> R

S -> X (sh) if before "h" or in -sio- or -sia-

S otherwise

T -> X (sh) if -tia- or -tio-

0 (th) if before "h"

silent if in -tch-

T otherwise

V -> F

W -> silent if not followed by a vowel

W if followed by a vowel

X -> KS

Y -> silent if not followed by a vowel

Y if followed by a vowel

Z -> S 

Initial Letter Exceptions 

Initial kn-, gn- pn, ac- or wr- -> drop first letter

Initial x- -> change to "s"

Initial wh- -> change to "w"

The code is truncated at 4 characters in this example, but more could be used.

Programmers may find more information, including sample Double Metaphone programming code, at a number of web sites, including: http://aspell.sourceforge.net/metaphone/

Indeed, it appears that Double Metaphone codes are far more accurate at identifying sound-alike names that use different spelling. So why aren't we using this improved method in genealogy applications? 

My guess is that the only thing stopping us – and the programmers – is inertia: we are so used to Soundex that we don't want to change, even if a far better solution is available right now.

If all genealogy databases used Double Metaphone codes, thousands of genealogists could find ancestors already documented that have previously eluded them due to spelling and Soundex differences. I am not advocating the abandonment of Soundex. However, it should be easy with today's technology to have both Soundex and Double Metaphone codes displayed simultaneously on the screen. More choices for genealogists means more ancestors found!

Does your favorite genealogy program use Double Metaphone codes alongside Soundex codes?


Blog posts

Eastman's Online Genealogy Newsletter









































Powered by Wild Apricot Membership Software