[Bitpim-devel] String matching

Roger Binns

2004-12-28 00:11:40 UTC

I have also updated the string matching routines. The main problem
is dealing with new incoming entries and trying to match them with
existing ones. This requires comparing all fields and coming up
with match scores.

We were using a routine from difflib. That routine worked out
how many changes you would have to make to convert one string
to the other and the string similarity was based on the inverse
of the number of edits.

It is in pure Python, and fairly slow but worked on any data type
that is a sequence.

There has been a lot of worldwide research on string matching. This
has been due to the need to match up various databases such as medical
records from multiple sources and census information.

The leading algorithm is known as Jaro after its inventor which
finds a similarity by looking for the same characters within a sliding
window and measuring how different those lists are. It is a simple
fast heuristic and works better on real world data than one would
expect.

A fellow named Winkler from the US Census Bureau then came up with
a tweak that gives a bonus if the leading characters are the same
in both strings since misspellings of the same name are more likely
to have the misspellings later on (remember we want those to match!)

A final tweak is training on a representative data sample. This is
to take into account things like 'i' and 'y' are very similar in English.
(eg you pronounce Smith and Smyth the same). But they aren't the same
in all languages, and there are domain specific things as well (eg which
bits of disease names are more important than others, the letters www
at the begining of a url matter less).

Adit originally wanted to use a Jaro-Winkler routine from the febrl
project. Unfortunately they had it under an open source license,
but it was one of those that was hand wavy and assumed it was the
only component in existence, and was definitely not GPL compatible.

So on Saturday I thought I would spend a few hours and have a look
at this again (the difflib routines show up big time in the profile
runs I do). A few hours later I had an implementation of Jaro-Winkler.

It was done first in Python and much to my astonishment worked
perfectly first time. I re-implemented in C where it didn't work
first time (a long story of macros, compiler warnings, signed and
unsigned mismatches etc)

My implementation also fixed varies breakages that most other
implementations suffer from (every other implementation makes
copies of the strings and scribble asterisks all over the
characters that have already been matched which would serious
breakage if any string contained an asterisk already).

If you do nothing the Python implementation will be used. This
will be slower, but the whole process doesn't take long enough
to matter with small data sets.

If you want to do a build or are importing large amounts of
data them you need to use the C version. Go into the
native/strings directory and run 'python setup.py build'
(add --compile=mingw32 on Windows if needed). The resulting
.so/.pyd will end up in a subdirectory. Copy it into the
native/strings directory.

There is another Python Jaro Winkler implementation at
http://trific.ath.cx/resources/python/levenshtein/api/
that actually wraps C code. It too suffers from the double
memory allocations, overwriting with asterisk and won't
cope with the parameters not both being the same type
(string/unicode)

If anyone is curious about what is out there, and the
performance and accuracy metrics, have a look at
http://www.isi.edu/info-agents/workshops/ijcai03/papers/Cohen-p.pdf
I did deliberately do my implementation so that
it can fairly easily be adapted to a token based
sceheme as well as give differing weights.

Roger