Archive for Scripting

Tutorial: Networks (of Folksonomy) with Ruby, del.icio.us and Graphiz

I was idly thinking about my del.icio.us bookmarks, how the tags are connected to each other when they are used to describe the same bookmarks, and wondering what they would look like as a graph.

Instead of simply searching the web and finding this del.icio.us tag grapher, I decided that I wanted to try playing with Graphiz (open source graphing software), so I wrote a ruby script to write the .dot file from my bookmarks.

I really liked Graphiz. It’s a great tool, and .dot is a nice format, as it lets you abstract all the positioning and presentation, whereas if I had been generating an SVG file (for example), I would have had to do lots of calculations for the positioning of all the nodes and everything.

Anyway, this is how I did it:

#open the bookmarks file (after running it through HTML Tidy
# first, to transform it into XML)
require "rexml/document"
file = File.new( "delicious.xhtml" )
doc = REXML::Document.new file

#create a 2D array: an array of an array 
# of the tags used for each bookmark.
tag_sets = Array.new()
doc.elements.each('//a') {|e| tag_sets.push(e.attributes['tags'].split(',')) } 

# I added this following line because I had too many bookmarks, 
# making the graph too big and complicated: ->
#      tag_sets = tag_sets.slice(0..10)

# now flatten the 2D array, and get a 1D array
# of all the tags used - .uniq gets rid of duplicates
tag_list = tag_sets.flatten.uniq         


#get the relationships
relationships = Array.new()

# now iterate through the tag list, 
# and for each tag, look for that in each of the bookmarks.
# If it's found, record a relationship with the other tags of
# that bookmark

tag_list.each do |tag|
 
 tag_sets.each do |tag_set|
   
   if tag_set.include? tag
     tag_set.each do |related_tag|
     relationships.push([tag, related_tag]) if tag!=related_tag 
     end
   end
   
 end
  
end

# relationships is now a 2D array of arrays each
# containing two tags

# put it into the .dot syntax

graph = "digraph x { \r\n"+relationships.uniq.collect{|r|'"'+r.join('" -> "')+'";'}.join("\r")+"}"

# now  write it all into the .dot file


file = File.new("delicious_graph.dot", "w")
file.write(graph)
file.close()

Links to the Results

I don’t expect the results will be of much interest to anyone, but here they are for completeness sake.

the .dot file
an SVG export of the graph (you may need a plugin, or a recent version of firefox, safari or opera)

Advertisements

Comments (1)

Algorithms for Matching Strings

Last year, I had database table full of authors’ names I’d entered in from various sources. Many of these names were simply variants of one another: eg J. Smith and John Smith. I had already associated many of these names with biographical information in another table. My problem now was to match up the unidentified name variants with the identified names.

I was using PHP for the project, and browsing through the docs, I came across some functions for matching strings. soundex, metaphone, and levenshtein.

soundex and metaphone are phonetic algorithms that output a code that will be the same for similar sounding words (though differently perhaps spelt): eg: Smith and Smyth. I discounted these because my name variations would sound quite differently depending on how much of the name was given – eg: Smith, J. Smith, Mr. John H. Smith.


Levenshtein
looked more promising; you give it two strings, and it gives you the number of changes you would have to make to change one string into the other. eg:
levenshtein('Smith', 'Smyth'); // returns 1

Ok, great for variants, but what about abbreviations? So I subtract the difference between the two strings from the levenshtein distance between them. Ok, better, but still not great: I’ve got an integer that might be a crucial difference to short strings, and negligible to long strings. So I need a ratio: my integer divided by the length of the smallest string.

$levenshtein = levenshtein($known_name, $unknown_name);
$lengthdifference = max($known_name, $unknown_name) - min($known_name, $unknown_name);	
$levenshtein-=$lengthdifference;	
$similarity = $levenshtein/strlen(min($known_name, $unknown_name));

So I experimented with this a bit, and found that a similarity of < 0.4
would get me (almost) all of my variants, and not too many false matches. (One weakness is that, for example, The Reverend J. H. would not match J. Hall.)

I used this to generate a form with each of the unknown names (and the facts that were known about them), presented along-side radio selects for the possible matching known names (and the biographical details).

I could then go through each of the unknown names, and relatively easily match it with the right person (from a managebly small, relatively plausible selection). – It should be noted that I was never trying to eliminate human decision altogether – human research was often necessary to determine if this John Smith really was the same as that John Smith.

I’ve posted this solution here because other people may have a similar problem, but also (more importantly) because my solution is stupid, and I’m hoping other people will post suggestions for better solutions. (Actually, some searching reveals that quite a lot of time and money has been spent on solving this, probably in very sophisticated ways – though the solutions may not be readily accessible to the average humanities hacker. I just found a website offering licenses for an algorithm called NameX – interestingly they explain how it works in reasonable detail.)

My solution (although it worked well enough for me) is stupid because it does not take into consideration many of the things that a human reader would in making a judgement.
A human reader knows, for example, about titles like Mr. and Reverend, and knows that they are not integral to the name (for these purposes at least). A human reader would also give more weight to a surname, perhaps than a forename. A skilled human reader would know from the context which orthographical differences were significant: for example, in the Renaissance era, I might replace J, VV replace W, etc, and names might well be latinised ('Peter' == 'Petrus').

A cleverer approach might have been to use a phonetic algorithm for the surname (assuming a surname can be separated from the rest of the string, perhaps with regular expressions), and if it passes that test, use my levenshtein-based approach with some other rules from human knowledge mixed in (eg: I == J). And if it was really clever, it might be able to look at the whole corpus to get an idea of context (eg: Is there a consistency to capitalisation or punctuation?).

Or perhaps even an AI approach – a program that could be trained to recognise good matches, much as you can train your email software to recognise junk mail?

(NB: Although this approach is ignorant about personal names, it is equally ignorant about other types of strings, such as place names and book titles; which at least means it is broadly applicable.)

Suggestions are most welcome.

References

http://en.wikipedia.org/wiki/Hamming_distance
http://en.wikipedia.org/wiki/Levenshtein_distance
http://en.wikipedia.org/wiki/Soundex
On Identifying Name Equivalences in Digital Libraries

The Identification of Authors in the Mathematical Reviews Database

Names: A New Frontier in Text Mining

Comments (5)

Accessibility vs. Semantic Markup?

I came across a post about semantic markup and accessibility citing a remark I had made about how, for all the talk about semantic markup in the web-dev community, HTML isn’t a very semantic markup language.

The post goes so far as to say:

[…] when you mark up a page in HTML you shouldn’t get too hung up on the semantic meaning of the elements.[…] What you should be concerned about […] is describing your page elements in such a way as to make them easier to use by screen readers, keyboard-based browsers etc. For example, don’t ask ‘is this set of elements really an unordered list?’ but do ask ‘if I mark up this set of elements as an unordered list, does that make my page more accessible and easier to use?’

However, I feel this has got things backwards – accessibility should, and will be, a consequence of good semantic markup.

Ideally, accessibility is a game for two: you provide the document in as semantic a form as you can, the user agent interprets that document as intelligently as it can. And if the user agent isn’t smart enough to handle all the semantics of your document today, then it will be tomorrow. Admittedly, in practice, a lot of things have to be dumbed down for Internet Explorer – though these tend to be of the bells and whistles rather than semantic variety, but it is usually better to aim at solid principles than the moving target of particular user agents.

The post does make a valuable point about how HTML, besides having to describe a document’s structure, also has to be used as an application interface markup language – which, aside from a rather limited set of form widgets, it isn’t really equipped to do, semantically at least. So we have to make do with the semantically bland div tag spiced up with plenty of javascript.

In theory, there’s lots of ways we can markup user interfaces – XUL, XBL, XForms, ZAML
– all of these hugely inaccessible compared to HTML (even HTML and javascript), because cross-browser support just isn’t there for anything else.

But the div doesn’t have to be bland anymore.

The Role Attribute

Yes, the role attribute is going to save the day.

Not only can we use it to to add semantics to html with RDFa, but this mozilla tutorial shows how we can use that added semantic power to make javascripted widgets accessible as well.

You can read more about how wonderful the role attribute is at Mark Birbeck’s blog.

Comments (3)

Timeline API

Simile at MIT have recently come up with a timeline widget api – google maps for timelines. It lets you plot events and time ranges (taken from either XML or JSON) onto timelines, and it’s all pretty darn good.

It should facilitate all kinds of interesting visualisations. There are a few really interesting examples up already – my favourite is the timeline of the shooting of JFK.

Comments (4)

Tutorial: Writing a Word Frequencies Script in Ruby

There are plenty of ready-made programs that do the same thing and more, but I hope that
this basic example can serve as a useful jumping off-point for your own more
ingenious scripts.

The basic steps are as follows:

  1. Read the text file into a string
  2. Split the text into an array of words
  3. Count the number of times each word occurs, storing it in a hash
  4. Display the word frequency list

Ok, install Ruby if you don’t already have it on your machine, and boot up a text editor (preferably one with ruby syntax highlighting), and on we go with the code.

Read the text-file into a string

First, we want to get the name of the text file we’re analysing, and we’ll let the
user enter it at the prompt:

 puts 'What is the name and path of the file?'
filename = gets.chomp

“puts” writes the string that follows it to the screen
“gets” gets a string from the user at the prompt
“chomp” removes the carriage return from the end of the string. After the user has
typed in the filename, s/he presses Return to signal that s/he has finished typing.
We need to remove that carriage return, so that all we have is the filename, which we
store in a variable we are calling ‘filename’.

We now create a new string variable that we are calling ‘text’.

text = String.new

‘text’ is where we will put the contents of our file.

File.open(filename) { |f|  text = f.read } 

Here, we are opening the file, and reading it into the ‘text’ variable. The syntax is
quite rubyish. In the first part, ‘File.open(filename) ‘, a file object is being
created, and passed to the block that follows it. The block is delimited by the curly
braces, and receives the file object through the variable ‘f’, which is specified
between the two pipe characters: |f|.

Split the text into an array of words

Onto step two: creating an array of all the words in the text. This is easy.

words = text.split(/[^a-zA-Z]/)

‘words’ is the name of our new array. We are ‘splitting’ our big string of text
(which we have called ‘text’) into chunks, using a regular expression ‘/[^a-zA-Z]/’.
Regular Expressions (reg exes) are a way of pattern matching text using wildcards.
They can be extraordinarily useful if you are working with electronic text, and
reading up on them will definitely reap rewards at some point (regular-expressions.info has a fairly comprehensive amount of information). Suffice to say here
that ‘[^a-zA-Z]’ matches anything that isn’t an alphabetic character; so our
‘words’ are all the chunks of text between non-alphabetic characters. This may
not be precise enough definition of a word for your purposes, but we’ll assume it is
for now and push on.

Count the number of times each word occurs, storing it in a hash

freqs = Hash.new(0)

We create a new Hash to store the words and their frequencies in. A basic Hash
consists of pairs of ‘keys’ and ‘values’. You access a value by referring to its key.
In our case, the key will be a (unique) word, and its ‘value’ is the number of times
it occurs in the text.

words.each { |word| freqs[word] += 1 }

‘words.each’ takes each word one at a time from the array ‘words’, and passes it to
the block after it. If the word doesn’t yet have an entry in our hash (if
!freqs[word]), then we create an entry with a value of 1. Otherwise (if we have
encountered the word before), the value is whatever it was before, plus one.

 freqs = freqs.sort_by {|x,y| y }

This line sorts our hash by the frequency number.

 freqs.reverse!

This line sorts it in order of greatest frequency first (The exclamation mark after
the method ‘reverse’ means that ‘freqs’ is to be reset to the outcome of ‘reverse’;
it is the same as: ‘freqs = freqs.reverse’).

Display the word frequency list

freqs.each {|word, freq| puts word+' '+freq.to_s}

Finally, we write our results to the screen. Note that the frequency number must by
converted to a string (‘freq.to_s’) to be used with ‘puts’.

And for those who want to cut and paste


puts 'What is the name and path of the file?'
filename = gets.chomp
text = String.new
File.open(filename) { |f| text = f.read }
words = text.split(/[^a-zA-Z]/)
freqs = Hash.new(0)
words.each { |word| freqs[word] += 1 }
freqs = freqs.sort_by {|x,y| y }
freqs.reverse!
freqs.each {|word, freq| puts word+' '+freq.to_s}

Or, inspired by the concision of William Turkel’s Python word frequency code, you could do it like this:

#replace 'filename.txt' with the file you want to process
words = File.open('filename.txt') {|f| f.read }.split
freqs=Hash.new(0)
words.each { |word| freqs[word] += 1 }
freqs.sort_by {|x,y| y }.reverse.each {|w, f| puts w+' '+f.to_s}

Further Enhancements

And there we have it. There are definitely some improvements you might want to make.

You’ll probably want to convert your ‘text’ string to all lowercase or all uppercase
so that ‘Ruby’ and ‘ruby’ don’t get counted separately.

You may want to strip the text of sgml/xml tags before you split it into words.

You may want to convert plural nouns to singular, or normalise verb endings, or remove
any words that also occur in a stop-list (a list of very frequent common words that
you want to ignore).

You might make it work through a web browser instead of the
command line.

Best of all is if you make a customisation that is interesting to you
and your text, but isn’t already covered by the text analysis software currently
available. Please share your ideas for text analysis innovations in the Comments.

Comments (28)

Text Analysis with PHP tutorial

The English Department at Stanford run a Literature and Technology course that goes through some basic exercises with xml and php for humanities scholars.You can see the index at http://english.stanford.edu/resources/Exercises/exercises.php
If you already know something about xml and php, the meaty part is: http://english.stanford.edu/resources/Exercises/html/exercise7.2.html
which walks you through writing a word-frequency script.

Technorati Tags: , ,

Leave a Comment

Ruby Quiz

Inspired by phpflashcards.com, I wrote an ajax-driven ruby quiz. This isn’t specifically humanities computing, but ruby is a really nice scripting language, and a useful addition to the digital humanist’s tool box. I’m new to the language myself, and the aim of the thing is as much for me personally to learn ruby as anything else – the questions are written with the intention of making ruby’s syntax clear to the newcomer rather than to be really tricky.

Leave a Comment

Older Posts »