Spelling Errors
Spelling and keyboard errors account for many of the duplicates to be found in
a database. Through the use of intelligent key building
and advanced comparison routines MerlinMerge® SpeedPro successfully overcomes
spelling errors including: multiple typos, letter transpositions, incomplete
words, etc.
Rulebase
expertise
The rulebase expert system is used to identify nicknames. Entities
such as Bill, William, Bob and Robert are often used interchangeably
to identify individuals. The rulebase is also used to identify
noise words. Noise
words are elements in a name that do not help in the identification
of a candidate. Examples of noise words are: Incorporated,
Corporation, Limited, Junior, Senior, Avenue and Street. Often
there are times where
elements in a name contribute to the identity but should be
treated as less important. In these cases, the rulebase does
not treat them
as noise words but recognizes that they are less significant.
Some examples are: associate, board, international and services.
Other variations
are caused by the use of common prefixes. Names like McDonnell,
are confused with MacDonnell. Prefix recognition provides the
facility for
handling these classes of problems. The rulebase can also recognize
diminutives. Frequently there are names which end in a diminutive
such as "ie" or "y". In these cases, it is useful
to identify the root and apply the rule. For example, you would
want Bill,
Billie and Billy to find William or Willie.
| BILL YARA |
WILLIAM YARA |
| BOBBY KENNEDY |
ROBERT KENNEDY |
| JIM P PHILLIPS SR |
JAMES P PHILLIPS |
| SMITH AND ASSOCIATES |
SMITH |
| MCDONELL CORPORATION |
MCDONELL |
| MR MATT J THOMAS |
MATTHEW J THOMAS |
| MARINA DELSOLE |
MARINA DEL SOLE |
| DR LEONARD MACCOY MD |
LEONARD MCCOY |
Phonetic
Errors
Discrepancies caused by phonetic
errors account for 20-25% of all name variations.
Traditional solutions such as Soundex
and NYSIIS used for solving name variations only deal with
phonetic errors. These solutions involve the standardization
of easily confused sounds. For example, PH's would be treated
as F's. Linguistic rules are generated to phonetically
tokenize a name. These phonetically tokenized words serve
as the basis for name retrieval. In some instances these
rules help to find names that are difficult to spell. Unfortunately,
the distribution pattern of common names becomes even more
skewed. For example, inquiries on John also return Joan,
Jim, Jane, Jimmy, Jenn and other names which fall in the "JAN" phonetic
pattern. By aggravating the skew in distribution of names
both quality and performance
are sacrificed.
MerlinMerge® SpeedPro addresses problems due to phonetics
by employing analysis routines to determine the extent of phonetic tokenization. This
enables MerlinMerge® SpeedPro to overcome problems due to phonetics without
the negative
consequences incurred with all other methods of name search.
Examples of phonetic tokenization: (taken directly from Robert
L. Taft, "Name
Search
Techniques",
New York State Identification and Intelligence):
| 1) Translate first characters of name |
| |
MAC => MCC |
| |
PH => FF |
| |
KN => NN |
| |
K => C |
| |
SCH => SSS |
| 2) Translate last characters of name |
| |
EE => Y |
| |
IE => Y |
| |
DT,RT,RD,NT,ND => D |
| 3) First character of key = first character of name |
| 4) Translate remaining characters by following rules, incrementing
by one character each time |
| |
EV => AF else A,E,I,O,U => A |
| |
Q => G |
| |
Z => S |
| |
M => N |
| |
KN => N else K => C |
| |
SCH => SSS |
| |
PH => FF |
| |
H => If previous or next is non vowel, previous |
| |
W => If previous is vowel, previous |
| 5) Translate last characters of name |
| |
If last character is S, remove it |
| |
If last characters are AY, replace with Y |
| |
If last character is A, remove it |
Missing,
extra, noise words
The rulebase is used to
identify noise words. Noise words are elements in
a name that do not help in the identification of
a candidate. Examples of noise words are: Incorporated,
Corporation, Limited, Junior, Senior, Avenue and
Street.
While processing the data, MerlinMerge® SpeedPro
goes through a process called sanitization that removes
noise characters, extra spaces, control
characters and converts lower case letters to uppercase.
Examples of noise characters are: @, #. $, %, ^, &,
*, (, ), }, {, [, ]. The following characters are
handled separately and have special meanings: commas,
hyphens and quotes. Commas usually indicate the insertion
of a last name. Sanitization places words followed
by commas at the end of the string. Quotes are deleted
and the space between them is removed. A space replaces
the hyphens.
| Before Sanitization |
After Sanitization |
| Scott Lions |
SCOTT LIONS |
| Smith, John F. |
JOHN F SMITH |
| Rose Stone-Shield |
ROSE STONE SHIELD |
| James O'Tool |
JAMES OTOOL |
| James O. Tool |
JAMES OTOOL |
| Owen, Tool, James |
JAMES OWEN TOOL |
| # Williams , $Richard |
RICHARD WILLIAMS |
The sanitization process also uses a small rulebase.
The rulebase is applied after all the alpha characters have been
converted to upper case letters and extra blanks are removed. This
rulebase is used to recognize words that contain noise characters
or prefixes that could be effected by the sanitization process.
| Before Sanitization |
After Sanitization |
Sanitization (without rulebase expertise) |
| c\o |
CARE OF |
C O |
| Mc Donald, Old |
OLD MCDONALD |
MC OLD DONALD |
| % |
CARE OF |
|
Word
Sequence Variations
Many search problems are caused by sequence variations. The inability
to determine the order of words for a particular entity occurs
at both data entry and inquiry time. The name Frank Lee for example,
could have been Lee Frank. This problem is particularly pervasive
in company names. Names such as International Business Machines,
Anderson Consulting and Kemper Insurance Company are examples
where the left-most word is most significant. Conversely, Edward
S. Gordan Real Estate Company and Paul Mitchell Hair Products
are examples where the left-most word is less significant. The
inability to predict the significant name with respect to word
position causes many searches to fail.
Merging foreign database files causes other sequence variations.
This frequently occurs when external lists are purchased or
companies consolidate information. Inconsistent methodologies
for data capture make the standardization of name fields impossible.
Aggravating the sequence problem are those instances in which
company names are intermixed with personal names. All of these
factors, in addition to human error, contribute to identification
problems caused by sequence variations. MerlinMerge® SpeedPro
provides a facility for handling these problems.
To understand this better we will draw
an analogy between a telephone book and a database system.
When we look
for Frank Lee we search the "L" section. If the name
is not there, we continue the search by looking in the "F" section.
In order to find Frank Lee we had to search two separate sections
of the phone book. Suppose we were looking for Frank Lee Ray.
To ensure success we must search all the permutations. This
is an extremely arduous and time consuming process for both
people and computers. By listing Frank Lee in both the L and
F sections, regardless of order, only one section would need
to be searched.
Using this approach, MerlinMerge® SpeedPro is
able to overcome word sequence variations without sacrificing
performance.
Acronym
Recognition
Corporate name searching concretely illustrates the pragmatic
difficulties in developing solutions that find correct
information without missing likely
candidates. People readily understand the similarities between "Triple
A towing" and "AAA towing" yet computerized systems would
need to employ a knowledge-based algorithm to recognize the relationship
between Triple A and AAA.
The deployment of intelligence through knowledge based systems greatly benefits
search and matching algorithms by identifying nicknames, shortened forms,
noise words and other circumstances that require experience to return a more
comprehensive result set. However, knowledge-based systems are limited by
the breadth and depth of their lexicon. Contrary to names such as IBM and
AT&T, the vast majority of acronyms lie outside the scope of knowledge-base
processing. For example, our clients often used the IST acronym interchangeably
with Intelligent Search Technology yet it would be unreasonable to expect
the inclusion of IST in a knowledge-based system.
The MerlinMerge® SpeedPro software with its corporate search algorithms
and acronym recognition functionality significantly advances the
ability
to seek and match corporate name data.
|