Archive for August, 2006

Algorithms for Matching Strings

Last year, I had database table full of authors’ names I’d entered in from various sources. Many of these names were simply variants of one another: eg J. Smith and John Smith. I had already associated many of these names with biographical information in another table. My problem now was to match up the unidentified name variants with the identified names.

I was using PHP for the project, and browsing through the docs, I came across some functions for matching strings. soundex, metaphone, and levenshtein.

soundex and metaphone are phonetic algorithms that output a code that will be the same for similar sounding words (though differently perhaps spelt): eg: Smith and Smyth. I discounted these because my name variations would sound quite differently depending on how much of the name was given – eg: Smith, J. Smith, Mr. John H. Smith.


Levenshtein
looked more promising; you give it two strings, and it gives you the number of changes you would have to make to change one string into the other. eg:
levenshtein('Smith', 'Smyth'); // returns 1

Ok, great for variants, but what about abbreviations? So I subtract the difference between the two strings from the levenshtein distance between them. Ok, better, but still not great: I’ve got an integer that might be a crucial difference to short strings, and negligible to long strings. So I need a ratio: my integer divided by the length of the smallest string.

$levenshtein = levenshtein($known_name, $unknown_name);
$lengthdifference = max($known_name, $unknown_name) - min($known_name, $unknown_name);	
$levenshtein-=$lengthdifference;	
$similarity = $levenshtein/strlen(min($known_name, $unknown_name));

So I experimented with this a bit, and found that a similarity of < 0.4
would get me (almost) all of my variants, and not too many false matches. (One weakness is that, for example, The Reverend J. H. would not match J. Hall.)

I used this to generate a form with each of the unknown names (and the facts that were known about them), presented along-side radio selects for the possible matching known names (and the biographical details).

I could then go through each of the unknown names, and relatively easily match it with the right person (from a managebly small, relatively plausible selection). – It should be noted that I was never trying to eliminate human decision altogether – human research was often necessary to determine if this John Smith really was the same as that John Smith.

I’ve posted this solution here because other people may have a similar problem, but also (more importantly) because my solution is stupid, and I’m hoping other people will post suggestions for better solutions. (Actually, some searching reveals that quite a lot of time and money has been spent on solving this, probably in very sophisticated ways – though the solutions may not be readily accessible to the average humanities hacker. I just found a website offering licenses for an algorithm called NameX – interestingly they explain how it works in reasonable detail.)

My solution (although it worked well enough for me) is stupid because it does not take into consideration many of the things that a human reader would in making a judgement.
A human reader knows, for example, about titles like Mr. and Reverend, and knows that they are not integral to the name (for these purposes at least). A human reader would also give more weight to a surname, perhaps than a forename. A skilled human reader would know from the context which orthographical differences were significant: for example, in the Renaissance era, I might replace J, VV replace W, etc, and names might well be latinised ('Peter' == 'Petrus').

A cleverer approach might have been to use a phonetic algorithm for the surname (assuming a surname can be separated from the rest of the string, perhaps with regular expressions), and if it passes that test, use my levenshtein-based approach with some other rules from human knowledge mixed in (eg: I == J). And if it was really clever, it might be able to look at the whole corpus to get an idea of context (eg: Is there a consistency to capitalisation or punctuation?).

Or perhaps even an AI approach – a program that could be trained to recognise good matches, much as you can train your email software to recognise junk mail?

(NB: Although this approach is ignorant about personal names, it is equally ignorant about other types of strings, such as place names and book titles; which at least means it is broadly applicable.)

Suggestions are most welcome.

References

http://en.wikipedia.org/wiki/Hamming_distance
http://en.wikipedia.org/wiki/Levenshtein_distance
http://en.wikipedia.org/wiki/Soundex
On Identifying Name Equivalences in Digital Libraries

The Identification of Authors in the Mathematical Reviews Database

Names: A New Frontier in Text Mining

Advertisements

Comments (5)

When is semantic html not important?

For many web developers semantic markup is very important. It makes your javascript and css code more maintainable, increases accessibility, helps search engines rank you better, and especially if you use microformats or RDFa, makes your data machine-readable.

What if you are transforming to html from a more semantic markup language like TEI? In this circumstance, isn’t HTML your presentation layer? How concerned with semantics should you be at this stage, when you are trying to prepare the material for human usage? Well, it is still easier to maintain CSS files than changing inline styles, font tags and misused tags in your XSL files. And you still get the accessibility benefits of using semantic html.

But should you care about machine readability if you are also publishing the more-semantic TEI source? Should you bother about putting RDFa into your html? Well, arguably you should. As I see it, TEI is for describing the elements of a document; transforming to HTML with RDFa can add a meaningful interpretative layer, stating what the document is saying about the external world. In addition, the semantics of RDF and HTML microformats are more widely understood by machines than the semantics of TEI.

But what about if you are providing the information in a (linked) RDF document as well? Is it still a beneficial thing to add these extra semantics to your html document? After all, why give machines the trouble of extracting RDF from your html if you are giving it to them pure and free anyway? Are, or will there be, user agents (for example, browser plugins, screenreaders) that will find these inline semantic statements more useful than pure RDF in a separate document?

Comments (7)

Web 2.0 and the Digital Humanities

What Digital Humanities tools could take from Web 2.0:

Give users tools to visualise and network their own data. And make it easy.

A good example is Last.FM. You run a program they give you that uploads the data about the songs you listen to, as you are listening to them. You can then see stats about your listening habits, and are linked with people with similar listening habits. The key thing is that you don’t have to do extra work.

Another example is LibraryThing, which makes it easy visualise and network data about your book collection. It can’t be as automatic as last.fm, but it does let you import any file you might happen to have with ISBNs in it.

Compare this to a Digital Humanities project: The Reading Experience Database, which aims to accumulate records of reading experiences. They ask that if you come across any reading experiences in your research, you note them down, and submit them to the database with their online form (there are two – a 4 page form and a shorter one page form if you can’t be bothered with 4 pages of forms).
I’m not out to disparage the RED here – in many ways it is a fine endeavour. But I do want to criticise the conceptual model of how it accumulates data:
It requires that you, as a researcher, do your normal work, and then go and fill in (ideally) 4 pages of web forms for every reading experience that you have found (and possibly already documented elsewhere). Do you like filling out forms? I don’t. Worst of all, you don’t get any kind of access to the data – yours, or anyone elses (you just have to trust they will eventually get around to coding a search page).
This doesn’t help you to do your work now.

Which brings me to my next point…

Harness the self-interest of your users

You need them to use you, so make it worth their while. Don’t ask for their help, help them!

One problem, I think, is when projects start from a research interest. They want to gather data on that topic, so they ask other researchers to help them by filling out web forms.

A better approach to gathering data, I suggest, is to help the user with their own research interests as a first priority. The guy that built del.icio.us, interestingly, said that he primarily wanted his users to tag bookmarks with the keywords that suited them best personally, to tag out of pure self interest. The network effect of their tagging is a huge side benefit, but it doesn’t need to be the reason that people use del.icio.us. The end result is something more anarchic, more used, and more useful than something like dmoz.org.

del.icio.us doesn’t say I’m interested in French Renaissance Poetry, please fill in these forms. It gives you a tool to keep track of your bookmarks. It let’s you import bookmarks you already have, and it lets you export your data too.

Have an API

You don’t know what you’ve got until you give it away.

SOAP is good, but it doesn’t even need to be that complicated. Make sure that search results are retrievable through a url, and presented as semantic xhtml, and your data is already much more sharable (listen to Tom Coates’ presentation on the Web of Data).(Base4 has an interesting post arguing that the approach to URLs is the defining characteristic of Web 2.0. )

Sharing data in a machine readable and retrievable format, is the most important feature. It lets other people build features for you

Back in March, Dan Cohen lamented on the lack of non-commercial APIs
suitable for the humanities hacker. And it’s odd – humanities scholarship is a community that you would think would want to facilitate access and reuse of their data – and the only useful APIs Dan Cohen could find (from programmableweb.org) were from the Library of Congress and the BBC. (It’s not quite as bad as that, commercial APIs are potentially useful too, and there’s also the COPAC for querying UK research libraries, and of course wikipedia).
There are a ton of digital projects stored away in repositories, such as are provided by the AHDS, but few are much more accessible or usable in their digital form than in print.
I read that the ESTC is going to be made freely available through the British Library’s website later this year – imagine the historical mashups that could be done – the information that could be mined and visualised – if they would provide a developers’ API.

Embrace the chaos of knowledge

The exciting thing about the folksonomy approach of tagging, and the user creation and maintenance of knowledge of Wikipedia, is that they have shown that a bottom-up method of knowledge representation can be more powerful and more accurate than traditional top-down methods.
It’s messy, flawed, pragmatic, flexible, useful, and realistic system for representing knowledge.

What do you think?

Some projects already do, and have done, some of these things for quite some time (please comment with examples!).

Perhaps it is wrong to try to apply lessons from commercial/mainstream web apps too closely to digital humanities projects, which after all, have different aims and priorities?
There are also different types of projects (some more like resources, others more like tools?), some of which might find these points inappropriate.

What other principles (and web trends) do you think digital humanities projects should be thinking about?

Further Reading

Reading Lists

Comments (8)

Text Analysis – say it with flowers

Leave a Comment

Opera’s Semantic Web Widgets

Lately, there have been two semantic web widgets for Opera:

Both of these seem to work much better than their webpage equivalents.

So there is a use for these widget things after all?

Comments (1)

Social Network of AJAX Books

Dietrich Kappe has done another of those social network studies of book consumption using Amazon’s Other Customer’s also Bought...

(The original (at least, the first one I heard about) being Valdis Krebb’s Study of polarised political book buying).

The graph isn’t so pretty as Krebb’s, but it is more interesting in that it shows a more complex picture than the rather-to-be-expected left/right political divide of American politics.

Choice of programming language is also political of course, and it’s interesting that Kappe’s study shows that related books on server-side languages break up into subnets, whilst client-side technologies like CSS and javascript form a common ground (as you’d expect really).

If you’ve written or read similar studies, I’d appreciate it if you’d link to ’em in the comments.

Comments (2)

Accessibility vs. Semantic Markup?

I came across a post about semantic markup and accessibility citing a remark I had made about how, for all the talk about semantic markup in the web-dev community, HTML isn’t a very semantic markup language.

The post goes so far as to say:

[…] when you mark up a page in HTML you shouldn’t get too hung up on the semantic meaning of the elements.[…] What you should be concerned about […] is describing your page elements in such a way as to make them easier to use by screen readers, keyboard-based browsers etc. For example, don’t ask ‘is this set of elements really an unordered list?’ but do ask ‘if I mark up this set of elements as an unordered list, does that make my page more accessible and easier to use?’

However, I feel this has got things backwards – accessibility should, and will be, a consequence of good semantic markup.

Ideally, accessibility is a game for two: you provide the document in as semantic a form as you can, the user agent interprets that document as intelligently as it can. And if the user agent isn’t smart enough to handle all the semantics of your document today, then it will be tomorrow. Admittedly, in practice, a lot of things have to be dumbed down for Internet Explorer – though these tend to be of the bells and whistles rather than semantic variety, but it is usually better to aim at solid principles than the moving target of particular user agents.

The post does make a valuable point about how HTML, besides having to describe a document’s structure, also has to be used as an application interface markup language – which, aside from a rather limited set of form widgets, it isn’t really equipped to do, semantically at least. So we have to make do with the semantically bland div tag spiced up with plenty of javascript.

In theory, there’s lots of ways we can markup user interfaces – XUL, XBL, XForms, ZAML
– all of these hugely inaccessible compared to HTML (even HTML and javascript), because cross-browser support just isn’t there for anything else.

But the div doesn’t have to be bland anymore.

The Role Attribute

Yes, the role attribute is going to save the day.

Not only can we use it to to add semantics to html with RDFa, but this mozilla tutorial shows how we can use that added semantic power to make javascripted widgets accessible as well.

You can read more about how wonderful the role attribute is at Mark Birbeck’s blog.

Comments (3)

Older Posts »