Dec 222017
A phonetic text-retrieval algorithm better than Soundex. Sample source code in Basic. | |||
---|---|---|---|
File Name | File Size | Zip Size | Zip Type |
METAFONE.BAS | 5396 | 1200 | deflated |
METAFONE.TXT | 7390 | 3125 | deflated |
TPCREAD.ME | 199 | 165 | deflated |
Download File METAFONE.ZIP Here
Contents of the METAFONE.TXT file
***** Computer Select, July 1991 : Doc #36047 *****
Journal: Computer Language Dec 1990 v7 n12 p38(6)
* Full Text COPYRIGHT Miller Freeman Publications 1990.
-----------------------------------------------------------------------------
Title: Hanging on the Metaphone. (a phonetic text-retrieval algorithm
better than Soundex) (tutorial)
Author: Philips, Lawrence.
Summary: Soundex is the most popular algorithm used for data bases when
users need to be able to search on a name even if they are unsure
of its spelling. The program attempts to retrieve names that
sound like the one a user enters. Soundex maps groups of
consonants to single digits. It saves the first letter and
eliminates any vowels after the first letter. Soundex, however,
returns some choices that sound nothing like what is entered and
sometimes does not locate names that sound similar such as Stephen
and Steven. There is also no reason to map sets of letters to
numbers. The Metaphone algorithm constructs a phoneticized and
coded spelling. It also ignores vowels after the first letter but
tries to apply common English pronunciation rules. Zero is used
to represent the 'th' sound and the letter X is used for the 'sh'
sound. Usually only the first four letters of phonetic spellings
are used.
-----------------------------------------------------------------------------
Descriptors..
Topic: Algorithms
Data Bases
Search Strategy
Phonetics
Type-In Programs
Tutorial.
Feature: illustration
table
program.
Caption: The 16 consonant sounds, exceptions and transformations. (table)
Subroutine Metaphone. (program)
Record#: 09 620 419.
-----------------------------------------------------------------------------
Full Text:
Hanging on the Metaphone Your customers are not satisfied. Despite the fact
that your perfectly good database application lets them search on a company
name as well as a numeric key (if they happen to remember it), somehow they
still can't find the company they're looking for. It seems they can't always
remember how to spell the name. They want you to make the program somehow
retrieve items that sound like the ones they entered.
Soundex is by far the most popular algorithm used to address t his problem.
I find this strange since Soundex ignores the basic rules of English
pronunciation, equating "g" and "j" no matter what the situation, as well as
running together "c," "q," "s," "z," amd "k."
Often the basic algorithm is kludged to account for some obvious
equivalences, but a kludge is still a kludge. English is a bear in this
regard anyway, on account of its notoriously strange and inconsistent
spelling, but that doesn't mean we have to give up. If you look at the
problem more closely, it turns out that you can predict the pronunciation of
an English word algorithmically about 95% of the time.
But we must deal with more difficult issues that lead us out of clear-cut,
quantifiable problems and into the realm hackers dread: personal, subjective
opinion. What is the real purpose of Soundex? When someone is looking for a
name but can't remember how it's spelled, what would be a reasonable set of
possibilities given the approximation just entered? Sounds like a fine
graduate-thesis topic to me: "Perceptions of Phonetic and Orthographic
Promixity in Computational String Searches."
First< a distinction must be made between sounding alike and looking alike.
I had to face this question at work when I was asked to "improve" Soundex. I
asked my colleague how the peopleon our floor would use it; she told me tht
they would her names over the phone and attempt to find them in our database.
They demanded a lot: they must be able to find the right name, no matter how
fuzzy their recollection. After negotiations, we agreed that sounding alike
was a sufficient criterion.
Another complication I was fortunate to be able to ignore is the large number
of foreigh words and names one encounters in the U.S. One of our programmers
is named DiBisceglie. If you know any Italian, you might be tempted to
pronounce it dee-bishell-yeh. (Since I can't speak Italian, I'm not too sure
about that, either!) In America, however, it's pronounced dee-biasig-lee. A
program would have t o be pretty sophisticated to recognize the language a
word came from as well as how it's pronounced!
Soundex maps groups of consonants to single digits after saving the first
letter (whatever it is) and throwing away any vowels after the first letter,
often icluding "w," "y," and "h" in the set of vowels. You end up with
something like "P202" or "H345." This looks very technical and mysterious,
but there is no reason to map sets of letters to numbers. Our users
complained that some of the choices Soundex returned sounded nothing like
what they entered, and that Soundex didn't find things that did sound
similar. For example, a search on "Cajun" returned "Cigna" among the
choices, and "Faker" returned "Fisher." "Stephen" was not matched to
"Steven."
I decided to throw Soundex out and start from scratch. Instead of the weird
alphanumeric key Soundex gives you, my program constructs phoneticized and
somewhat coded spelling. This has the incidental advantage of making it
easier for the progrmmer to determine what the key represents and whether
it's really correct. I call my algorithm Metaphone.
Metaphone is similar to Soundex in that it ignores vowels after the first
letter and simplifies thereafter, by equating "d" and "t," and so on.
However, Metaphone attempts to apply commonplace rules of English
pronunciation (for example, "'c' before 'i' or 'e' is pronounced like 's'").
Metaphone reduces the aplhabet to 16 consonant sounds, although vowels are
kept when they are the first letter. Zero is used to represent the "th"
sound (it looks a lot like the Greek theta when it has that line through it
programmers like so much). "X" is used for the "sh" sound (the Chinese now
use it that way when spelling Chinese words for westerners, as in "Deng
Xiaopeng"). Figure 1 shows the transformation. Listing 1 shows the encoding
routine.
For practical purposes, usually only the first four (if there are that many)
letters of the phonetic spelling are used, giving, for example, SKL for
"school" and XBRT for "Shubert." Meraphone can be modified to account for
the "sh" sound in "casual," and so on, according to your intended domain, but
English spelling is so weird that after a certain point you run into the
contradictory cases ("-sua-" in "casual" is "sh," but in "persuade" the "s"
sounds like an "s"). Other inconsistence include common words like
"chemical," "technical," and "mechanic," where "ch" is pronounced like "k"
since the words are derived from Greek. I advised checking a first run of
this algorithm against your domain, then accounting for the mistakes and
inconsistencies you are sure to find!
Lawrence Philips is an artificial intelligence specialist at NAC Reinsurance
in Greenwich, Conn.
December 22, 2017
Add comments