Algorithms for Matching Strings

Last year, I had database table full of authors’ names I’d entered in from various sources. Many of these names were simply variants of one another: eg J. Smith and John Smith. I had already associated many of these names with biographical information in another table. My problem now was to match up the unidentified name variants with the identified names.

I was using PHP for the project, and browsing through the docs, I came across some functions for matching strings. soundex, metaphone, and levenshtein.

soundex and metaphone are phonetic algorithms that output a code that will be the same for similar sounding words (though differently perhaps spelt): eg: Smith and Smyth. I discounted these because my name variations would sound quite differently depending on how much of the name was given – eg: Smith, J. Smith, Mr. John H. Smith.


Levenshtein
looked more promising; you give it two strings, and it gives you the number of changes you would have to make to change one string into the other. eg:
levenshtein('Smith', 'Smyth'); // returns 1

Ok, great for variants, but what about abbreviations? So I subtract the difference between the two strings from the levenshtein distance between them. Ok, better, but still not great: I’ve got an integer that might be a crucial difference to short strings, and negligible to long strings. So I need a ratio: my integer divided by the length of the smallest string.

$levenshtein = levenshtein($known_name, $unknown_name);
$lengthdifference = max($known_name, $unknown_name) - min($known_name, $unknown_name);	
$levenshtein-=$lengthdifference;	
$similarity = $levenshtein/strlen(min($known_name, $unknown_name));

So I experimented with this a bit, and found that a similarity of < 0.4
would get me (almost) all of my variants, and not too many false matches. (One weakness is that, for example, The Reverend J. H. would not match J. Hall.)

I used this to generate a form with each of the unknown names (and the facts that were known about them), presented along-side radio selects for the possible matching known names (and the biographical details).

I could then go through each of the unknown names, and relatively easily match it with the right person (from a managebly small, relatively plausible selection). – It should be noted that I was never trying to eliminate human decision altogether – human research was often necessary to determine if this John Smith really was the same as that John Smith.

I’ve posted this solution here because other people may have a similar problem, but also (more importantly) because my solution is stupid, and I’m hoping other people will post suggestions for better solutions. (Actually, some searching reveals that quite a lot of time and money has been spent on solving this, probably in very sophisticated ways – though the solutions may not be readily accessible to the average humanities hacker. I just found a website offering licenses for an algorithm called NameX – interestingly they explain how it works in reasonable detail.)

My solution (although it worked well enough for me) is stupid because it does not take into consideration many of the things that a human reader would in making a judgement.
A human reader knows, for example, about titles like Mr. and Reverend, and knows that they are not integral to the name (for these purposes at least). A human reader would also give more weight to a surname, perhaps than a forename. A skilled human reader would know from the context which orthographical differences were significant: for example, in the Renaissance era, I might replace J, VV replace W, etc, and names might well be latinised ('Peter' == 'Petrus').

A cleverer approach might have been to use a phonetic algorithm for the surname (assuming a surname can be separated from the rest of the string, perhaps with regular expressions), and if it passes that test, use my levenshtein-based approach with some other rules from human knowledge mixed in (eg: I == J). And if it was really clever, it might be able to look at the whole corpus to get an idea of context (eg: Is there a consistency to capitalisation or punctuation?).

Or perhaps even an AI approach – a program that could be trained to recognise good matches, much as you can train your email software to recognise junk mail?

(NB: Although this approach is ignorant about personal names, it is equally ignorant about other types of strings, such as place names and book titles; which at least means it is broadly applicable.)

Suggestions are most welcome.

References

http://en.wikipedia.org/wiki/Hamming_distance
http://en.wikipedia.org/wiki/Levenshtein_distance
http://en.wikipedia.org/wiki/Soundex
On Identifying Name Equivalences in Digital Libraries

The Identification of Authors in the Mathematical Reviews Database

Names: A New Frontier in Text Mining

About these ads

5 Comments »

  1. Pierre said

    Resolving names can be an headache, for example a author named “Chen C” matches more than 2400 references on pubmed.

    The coauthorship could be used to discriminate one author from another ?

    Pierre

  2. Yes, I can imagine where co-authorship is quite common, like in scientific articles, you could do pretty interesting things with networks of co-authorship. Perhaps a handy OPAC feature would be an interface that visualised this and let you navigate through it (see http://www.openacademia.org ). Or a search that let you limit a search on one author by the nth degrees of separation from another.

  3. Got said

    This kind of research is very interesting. I am trying it in order to request corpus in medieval latin or in ancient french and it can represent a solution if the corpus are not lemmatized. Levenstein distance, in particular, is the base algorithm for fuzzy searche and is implemented in exist, an XML database : http://exist-db.org/ (and the fuzzy search : http://demo.exist-db.org/xquery/functions.xq#text:fuzzy-index-terms). Maybe this implementation can give you some idea.

  4. Gene T said

    There’s a lot of string/sequence matching algo’s out there:

    http://www.dcs.shef.ac.uk/~sam/stringmetrics.html
    http://www-igm.univ-mlv.fr/~lecroq/string/index.html
    (i can send you dozens mroe refs is you’re interested)
    http://homepages.cs.ncl.ac.uk/brian.randell/Genealogy/NameMatching.pdf

    If you want to dig into Computational Linguistics (Info Extraction and Named ENtity Recognition), the state of the art are techniques like max entropy, conditional random fields, e.g.

    http://www.cs.umass.edu/~mccallum/papers/collseg04icmlws.pdf

  5. AAA said

    It would be great if you publish all your refs. Thanks.

RSS feed for comments on this post · TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: