How
NameSearch® Works - Phonetic Coding Algorithms, Soundex and NYSIIS
Phonetic coding, generically referred to as "Soundex",
is often used to enable retrieval of information from data processing
systems. R. C. Russell developed the Soundex algorithm to processes data
collected from the 1890 census. Known as the Russell Soundex algorithm
numerous variants have been employed for genealogy studies and retrieval
systems.
New York State Identification and Intelligence Algorithm
(NYSIIS)
In 1970 the New York State Identification and Intelligence project headed by
Robert L. Taft published the paper "Name Search Techniques". In this
paper he compared Soundex with a new phonetic routine (NYSIIS) that was designed
through rigorous empirical analysis.
The NYSIIS project concluded that:
NYSIIS is 98.72% accurate with a selectivity factor
of .164% per name inquiry.
Soundex is 95.99% accurate with a selectivity factor of .213% per name inquiry.
Selectivity is defended by the number of records returned by the size of the
data set.
* In 1998 the New York State
Division of Criminal Justice the agency responsible for the NYSIIS
project replaced the NYSIIS engine with NameSearch®
NameSearch's intelligent phonetic routines
have been proven to have increased accuracy while decreasing
selectivity as compared to NYSIIS.
Traditional solutions such as Soundex and NYSIIS
used for solving name variations only deal with phonetic errors.
These solutions involved the standardization of easily confused
sounds. For example, PH's would be treated as F's. Linguistic rules
were generated to phonetically tokenize a name. These phonetically
tokenized words served as the basis for name retrieval. In some
instances these rules helped find names that were hard to spell,
unfortunately, the distribution pattern of common names became
even more skewed. For example, inquiries on John also returned
Joan, Jim, Jane, Jimmy, Jenn and other names which fell in the "JAN" phonetic
pattern. By aggravating the skew in distribution of names both
quality and performance were sacrificed.
Discrepancies caused by phonetic errors account for
twenty to twenty five percent of all name variations. Intelligent
Search Technology addresses problems due to phonetics by employing
analysis routines to determine the extent of phonetic tokenization.
This enables NameSearch to overcome problems due to phonetics without
the negative consequences incurred with all other methods of name
search.
Additional information on phonetic
encoding:
Problems caused
by phonetic skewed distribution
Phonetic coding
NYSIIS VS. Soundex
How NameSearch® works
|