Semantic Web podcasts

Comments (2)

Tutorial: Key Word In Context with Javascript

The Digital History Hacks blog has been running a nice series on using Python for digital humanities type tasks. One of the tutorials was on creating a Key Word In Context list. It made me want to write a KWIC script too, so I wrote one in javascript:

// get the word you want to find
var word = prompt('Enter the word you wish to find');
//make a regular expression to find the word
var re = new RegExp('(\\w+?\\s+?)?(\\w+?\\s+?)?'+word+'(\\s+?\\w+?)?(\\s+?\\w+?)? ','gim');
var matches = this.document.getElementsByTagName('body')[0].textContent.match(re);
var report = window.open();
report.document.write('<ol>');
for(i = 0; i < matches.length; i++)
{
 report.document.write('<li>'+matches[i]+'</li>');
}
report.document.write('</ol>');

And now you can strip the white space, bung a ‘javascript:’ protocol in front of it, stick it in an href of an anchor and you’ve got a bookmarklet you can drag to your bookmarks toolbar:

KWIC

WordPress doesn’t seem to like bookmarklets, so I’m afraid you’ll have to turn it into a link yourself.

Click it whilst on a web page, and you’ll get a list of all the occurrences of the word in context.
It works with TEI documents, and even plain text too. (On firefox at least).

Comments (3)

Tutorial: Networks (of Folksonomy) with Ruby, del.icio.us and Graphiz

I was idly thinking about my del.icio.us bookmarks, how the tags are connected to each other when they are used to describe the same bookmarks, and wondering what they would look like as a graph.

Instead of simply searching the web and finding this del.icio.us tag grapher, I decided that I wanted to try playing with Graphiz (open source graphing software), so I wrote a ruby script to write the .dot file from my bookmarks.

I really liked Graphiz. It’s a great tool, and .dot is a nice format, as it lets you abstract all the positioning and presentation, whereas if I had been generating an SVG file (for example), I would have had to do lots of calculations for the positioning of all the nodes and everything.

Anyway, this is how I did it:

#open the bookmarks file (after running it through HTML Tidy
# first, to transform it into XML)
require "rexml/document"
file = File.new( "delicious.xhtml" )
doc = REXML::Document.new file

#create a 2D array: an array of an array 
# of the tags used for each bookmark.
tag_sets = Array.new()
doc.elements.each('//a') {|e| tag_sets.push(e.attributes['tags'].split(',')) } 

# I added this following line because I had too many bookmarks, 
# making the graph too big and complicated: ->
#      tag_sets = tag_sets.slice(0..10)

# now flatten the 2D array, and get a 1D array
# of all the tags used - .uniq gets rid of duplicates
tag_list = tag_sets.flatten.uniq         


#get the relationships
relationships = Array.new()

# now iterate through the tag list, 
# and for each tag, look for that in each of the bookmarks.
# If it's found, record a relationship with the other tags of
# that bookmark

tag_list.each do |tag|
 
 tag_sets.each do |tag_set|
   
   if tag_set.include? tag
     tag_set.each do |related_tag|
     relationships.push([tag, related_tag]) if tag!=related_tag 
     end
   end
   
 end
  
end

# relationships is now a 2D array of arrays each
# containing two tags

# put it into the .dot syntax

graph = "digraph x { \r\n"+relationships.uniq.collect{|r|'"'+r.join('" -> "')+'";'}.join("\r")+"}"

# now  write it all into the .dot file


file = File.new("delicious_graph.dot", "w")
file.write(graph)
file.close()

Links to the Results

I don’t expect the results will be of much interest to anyone, but here they are for completeness sake.

the .dot file
an SVG export of the graph (you may need a plugin, or a recent version of firefox, safari or opera)

Comments (1)

Digital Humanities Start-up funding

Leave a Comment

Algorithms for Matching Strings

Last year, I had database table full of authors’ names I’d entered in from various sources. Many of these names were simply variants of one another: eg J. Smith and John Smith. I had already associated many of these names with biographical information in another table. My problem now was to match up the unidentified name variants with the identified names.

I was using PHP for the project, and browsing through the docs, I came across some functions for matching strings. soundex, metaphone, and levenshtein.

soundex and metaphone are phonetic algorithms that output a code that will be the same for similar sounding words (though differently perhaps spelt): eg: Smith and Smyth. I discounted these because my name variations would sound quite differently depending on how much of the name was given – eg: Smith, J. Smith, Mr. John H. Smith.


Levenshtein
looked more promising; you give it two strings, and it gives you the number of changes you would have to make to change one string into the other. eg:
levenshtein('Smith', 'Smyth'); // returns 1

Ok, great for variants, but what about abbreviations? So I subtract the difference between the two strings from the levenshtein distance between them. Ok, better, but still not great: I’ve got an integer that might be a crucial difference to short strings, and negligible to long strings. So I need a ratio: my integer divided by the length of the smallest string.

$levenshtein = levenshtein($known_name, $unknown_name);
$lengthdifference = max($known_name, $unknown_name) - min($known_name, $unknown_name);	
$levenshtein-=$lengthdifference;	
$similarity = $levenshtein/strlen(min($known_name, $unknown_name));

So I experimented with this a bit, and found that a similarity of < 0.4
would get me (almost) all of my variants, and not too many false matches. (One weakness is that, for example, The Reverend J. H. would not match J. Hall.)

I used this to generate a form with each of the unknown names (and the facts that were known about them), presented along-side radio selects for the possible matching known names (and the biographical details).

I could then go through each of the unknown names, and relatively easily match it with the right person (from a managebly small, relatively plausible selection). – It should be noted that I was never trying to eliminate human decision altogether – human research was often necessary to determine if this John Smith really was the same as that John Smith.

I’ve posted this solution here because other people may have a similar problem, but also (more importantly) because my solution is stupid, and I’m hoping other people will post suggestions for better solutions. (Actually, some searching reveals that quite a lot of time and money has been spent on solving this, probably in very sophisticated ways – though the solutions may not be readily accessible to the average humanities hacker. I just found a website offering licenses for an algorithm called NameX – interestingly they explain how it works in reasonable detail.)

My solution (although it worked well enough for me) is stupid because it does not take into consideration many of the things that a human reader would in making a judgement.
A human reader knows, for example, about titles like Mr. and Reverend, and knows that they are not integral to the name (for these purposes at least). A human reader would also give more weight to a surname, perhaps than a forename. A skilled human reader would know from the context which orthographical differences were significant: for example, in the Renaissance era, I might replace J, VV replace W, etc, and names might well be latinised ('Peter' == 'Petrus').

A cleverer approach might have been to use a phonetic algorithm for the surname (assuming a surname can be separated from the rest of the string, perhaps with regular expressions), and if it passes that test, use my levenshtein-based approach with some other rules from human knowledge mixed in (eg: I == J). And if it was really clever, it might be able to look at the whole corpus to get an idea of context (eg: Is there a consistency to capitalisation or punctuation?).

Or perhaps even an AI approach – a program that could be trained to recognise good matches, much as you can train your email software to recognise junk mail?

(NB: Although this approach is ignorant about personal names, it is equally ignorant about other types of strings, such as place names and book titles; which at least means it is broadly applicable.)

Suggestions are most welcome.

References

http://en.wikipedia.org/wiki/Hamming_distance
http://en.wikipedia.org/wiki/Levenshtein_distance
http://en.wikipedia.org/wiki/Soundex
On Identifying Name Equivalences in Digital Libraries

The Identification of Authors in the Mathematical Reviews Database

Names: A New Frontier in Text Mining

Comments (5)

When is semantic html not important?

For many web developers semantic markup is very important. It makes your javascript and css code more maintainable, increases accessibility, helps search engines rank you better, and especially if you use microformats or RDFa, makes your data machine-readable.

What if you are transforming to html from a more semantic markup language like TEI? In this circumstance, isn’t HTML your presentation layer? How concerned with semantics should you be at this stage, when you are trying to prepare the material for human usage? Well, it is still easier to maintain CSS files than changing inline styles, font tags and misused tags in your XSL files. And you still get the accessibility benefits of using semantic html.

But should you care about machine readability if you are also publishing the more-semantic TEI source? Should you bother about putting RDFa into your html? Well, arguably you should. As I see it, TEI is for describing the elements of a document; transforming to HTML with RDFa can add a meaningful interpretative layer, stating what the document is saying about the external world. In addition, the semantics of RDF and HTML microformats are more widely understood by machines than the semantics of TEI.

But what about if you are providing the information in a (linked) RDF document as well? Is it still a beneficial thing to add these extra semantics to your html document? After all, why give machines the trouble of extracting RDF from your html if you are giving it to them pure and free anyway? Are, or will there be, user agents (for example, browser plugins, screenreaders) that will find these inline semantic statements more useful than pure RDF in a separate document?

Comments (7)

Web 2.0 and the Digital Humanities

What Digital Humanities tools could take from Web 2.0:

Give users tools to visualise and network their own data. And make it easy.

A good example is Last.FM. You run a program they give you that uploads the data about the songs you listen to, as you are listening to them. You can then see stats about your listening habits, and are linked with people with similar listening habits. The key thing is that you don’t have to do extra work.

Another example is LibraryThing, which makes it easy visualise and network data about your book collection. It can’t be as automatic as last.fm, but it does let you import any file you might happen to have with ISBNs in it.

Compare this to a Digital Humanities project: The Reading Experience Database, which aims to accumulate records of reading experiences. They ask that if you come across any reading experiences in your research, you note them down, and submit them to the database with their online form (there are two – a 4 page form and a shorter one page form if you can’t be bothered with 4 pages of forms).
I’m not out to disparage the RED here – in many ways it is a fine endeavour. But I do want to criticise the conceptual model of how it accumulates data:
It requires that you, as a researcher, do your normal work, and then go and fill in (ideally) 4 pages of web forms for every reading experience that you have found (and possibly already documented elsewhere). Do you like filling out forms? I don’t. Worst of all, you don’t get any kind of access to the data – yours, or anyone elses (you just have to trust they will eventually get around to coding a search page).
This doesn’t help you to do your work now.

Which brings me to my next point…

Harness the self-interest of your users

You need them to use you, so make it worth their while. Don’t ask for their help, help them!

One problem, I think, is when projects start from a research interest. They want to gather data on that topic, so they ask other researchers to help them by filling out web forms.

A better approach to gathering data, I suggest, is to help the user with their own research interests as a first priority. The guy that built del.icio.us, interestingly, said that he primarily wanted his users to tag bookmarks with the keywords that suited them best personally, to tag out of pure self interest. The network effect of their tagging is a huge side benefit, but it doesn’t need to be the reason that people use del.icio.us. The end result is something more anarchic, more used, and more useful than something like dmoz.org.

del.icio.us doesn’t say I’m interested in French Renaissance Poetry, please fill in these forms. It gives you a tool to keep track of your bookmarks. It let’s you import bookmarks you already have, and it lets you export your data too.

Have an API

You don’t know what you’ve got until you give it away.

SOAP is good, but it doesn’t even need to be that complicated. Make sure that search results are retrievable through a url, and presented as semantic xhtml, and your data is already much more sharable (listen to Tom Coates’ presentation on the Web of Data).(Base4 has an interesting post arguing that the approach to URLs is the defining characteristic of Web 2.0. )

Sharing data in a machine readable and retrievable format, is the most important feature. It lets other people build features for you

Back in March, Dan Cohen lamented on the lack of non-commercial APIs
suitable for the humanities hacker. And it’s odd – humanities scholarship is a community that you would think would want to facilitate access and reuse of their data – and the only useful APIs Dan Cohen could find (from programmableweb.org) were from the Library of Congress and the BBC. (It’s not quite as bad as that, commercial APIs are potentially useful too, and there’s also the COPAC for querying UK research libraries, and of course wikipedia).
There are a ton of digital projects stored away in repositories, such as are provided by the AHDS, but few are much more accessible or usable in their digital form than in print.
I read that the ESTC is going to be made freely available through the British Library’s website later this year – imagine the historical mashups that could be done – the information that could be mined and visualised – if they would provide a developers’ API.

Embrace the chaos of knowledge

The exciting thing about the folksonomy approach of tagging, and the user creation and maintenance of knowledge of Wikipedia, is that they have shown that a bottom-up method of knowledge representation can be more powerful and more accurate than traditional top-down methods.
It’s messy, flawed, pragmatic, flexible, useful, and realistic system for representing knowledge.

What do you think?

Some projects already do, and have done, some of these things for quite some time (please comment with examples!).

Perhaps it is wrong to try to apply lessons from commercial/mainstream web apps too closely to digital humanities projects, which after all, have different aims and priorities?
There are also different types of projects (some more like resources, others more like tools?), some of which might find these points inappropriate.

What other principles (and web trends) do you think digital humanities projects should be thinking about?

Further Reading

Reading Lists

Comments (7)

Older Posts »
Follow

Get every new post delivered to your Inbox.