Archive for text analysis

Tutorial: Key Word In Context with Javascript

The Digital History Hacks blog has been running a nice series on using Python for digital humanities type tasks. One of the tutorials was on creating a Key Word In Context list. It made me want to write a KWIC script too, so I wrote one in javascript:

// get the word you want to find
var word = prompt('Enter the word you wish to find');
//make a regular expression to find the word
var re = new RegExp('(\\w+?\\s+?)?(\\w+?\\s+?)?'+word+'(\\s+?\\w+?)?(\\s+?\\w+?)? ','gim');
var matches = this.document.getElementsByTagName('body')[0].textContent.match(re);
var report =;
for(i = 0; i < matches.length; i++)

And now you can strip the white space, bung a ‘javascript:’ protocol in front of it, stick it in an href of an anchor and you’ve got a bookmarklet you can drag to your bookmarks toolbar:


WordPress doesn’t seem to like bookmarklets, so I’m afraid you’ll have to turn it into a link yourself.

Click it whilst on a web page, and you’ll get a list of all the occurrences of the word in context.
It works with TEI documents, and even plain text too. (On firefox at least).

Comments (3)

Algorithms for Matching Strings

Last year, I had database table full of authors’ names I’d entered in from various sources. Many of these names were simply variants of one another: eg J. Smith and John Smith. I had already associated many of these names with biographical information in another table. My problem now was to match up the unidentified name variants with the identified names.

I was using PHP for the project, and browsing through the docs, I came across some functions for matching strings. soundex, metaphone, and levenshtein.

soundex and metaphone are phonetic algorithms that output a code that will be the same for similar sounding words (though differently perhaps spelt): eg: Smith and Smyth. I discounted these because my name variations would sound quite differently depending on how much of the name was given – eg: Smith, J. Smith, Mr. John H. Smith.

looked more promising; you give it two strings, and it gives you the number of changes you would have to make to change one string into the other. eg:
levenshtein('Smith', 'Smyth'); // returns 1

Ok, great for variants, but what about abbreviations? So I subtract the difference between the two strings from the levenshtein distance between them. Ok, better, but still not great: I’ve got an integer that might be a crucial difference to short strings, and negligible to long strings. So I need a ratio: my integer divided by the length of the smallest string.

$levenshtein = levenshtein($known_name, $unknown_name);
$lengthdifference = max($known_name, $unknown_name) - min($known_name, $unknown_name);	
$similarity = $levenshtein/strlen(min($known_name, $unknown_name));

So I experimented with this a bit, and found that a similarity of < 0.4
would get me (almost) all of my variants, and not too many false matches. (One weakness is that, for example, The Reverend J. H. would not match J. Hall.)

I used this to generate a form with each of the unknown names (and the facts that were known about them), presented along-side radio selects for the possible matching known names (and the biographical details).

I could then go through each of the unknown names, and relatively easily match it with the right person (from a managebly small, relatively plausible selection). – It should be noted that I was never trying to eliminate human decision altogether – human research was often necessary to determine if this John Smith really was the same as that John Smith.

I’ve posted this solution here because other people may have a similar problem, but also (more importantly) because my solution is stupid, and I’m hoping other people will post suggestions for better solutions. (Actually, some searching reveals that quite a lot of time and money has been spent on solving this, probably in very sophisticated ways – though the solutions may not be readily accessible to the average humanities hacker. I just found a website offering licenses for an algorithm called NameX – interestingly they explain how it works in reasonable detail.)

My solution (although it worked well enough for me) is stupid because it does not take into consideration many of the things that a human reader would in making a judgement.
A human reader knows, for example, about titles like Mr. and Reverend, and knows that they are not integral to the name (for these purposes at least). A human reader would also give more weight to a surname, perhaps than a forename. A skilled human reader would know from the context which orthographical differences were significant: for example, in the Renaissance era, I might replace J, VV replace W, etc, and names might well be latinised ('Peter' == 'Petrus').

A cleverer approach might have been to use a phonetic algorithm for the surname (assuming a surname can be separated from the rest of the string, perhaps with regular expressions), and if it passes that test, use my levenshtein-based approach with some other rules from human knowledge mixed in (eg: I == J). And if it was really clever, it might be able to look at the whole corpus to get an idea of context (eg: Is there a consistency to capitalisation or punctuation?).

Or perhaps even an AI approach – a program that could be trained to recognise good matches, much as you can train your email software to recognise junk mail?

(NB: Although this approach is ignorant about personal names, it is equally ignorant about other types of strings, such as place names and book titles; which at least means it is broadly applicable.)

Suggestions are most welcome.

On Identifying Name Equivalences in Digital Libraries

The Identification of Authors in the Mathematical Reviews Database

Names: A New Frontier in Text Mining

Comments (5)

Text Analysis – say it with flowers

Leave a Comment

Tutorial: Writing a Word Frequencies Script in Ruby

There are plenty of ready-made programs that do the same thing and more, but I hope that
this basic example can serve as a useful jumping off-point for your own more
ingenious scripts.

The basic steps are as follows:

  1. Read the text file into a string
  2. Split the text into an array of words
  3. Count the number of times each word occurs, storing it in a hash
  4. Display the word frequency list

Ok, install Ruby if you don’t already have it on your machine, and boot up a text editor (preferably one with ruby syntax highlighting), and on we go with the code.

Read the text-file into a string

First, we want to get the name of the text file we’re analysing, and we’ll let the
user enter it at the prompt:

 puts 'What is the name and path of the file?'
filename = gets.chomp

“puts” writes the string that follows it to the screen
“gets” gets a string from the user at the prompt
“chomp” removes the carriage return from the end of the string. After the user has
typed in the filename, s/he presses Return to signal that s/he has finished typing.
We need to remove that carriage return, so that all we have is the filename, which we
store in a variable we are calling ‘filename’.

We now create a new string variable that we are calling ‘text’.

text =

‘text’ is where we will put the contents of our file. { |f|  text = } 

Here, we are opening the file, and reading it into the ‘text’ variable. The syntax is
quite rubyish. In the first part, ‘ ‘, a file object is being
created, and passed to the block that follows it. The block is delimited by the curly
braces, and receives the file object through the variable ‘f’, which is specified
between the two pipe characters: |f|.

Split the text into an array of words

Onto step two: creating an array of all the words in the text. This is easy.

words = text.split(/[^a-zA-Z]/)

‘words’ is the name of our new array. We are ‘splitting’ our big string of text
(which we have called ‘text’) into chunks, using a regular expression ‘/[^a-zA-Z]/’.
Regular Expressions (reg exes) are a way of pattern matching text using wildcards.
They can be extraordinarily useful if you are working with electronic text, and
reading up on them will definitely reap rewards at some point ( has a fairly comprehensive amount of information). Suffice to say here
that ‘[^a-zA-Z]’ matches anything that isn’t an alphabetic character; so our
‘words’ are all the chunks of text between non-alphabetic characters. This may
not be precise enough definition of a word for your purposes, but we’ll assume it is
for now and push on.

Count the number of times each word occurs, storing it in a hash

freqs =

We create a new Hash to store the words and their frequencies in. A basic Hash
consists of pairs of ‘keys’ and ‘values’. You access a value by referring to its key.
In our case, the key will be a (unique) word, and its ‘value’ is the number of times
it occurs in the text.

words.each { |word| freqs[word] += 1 }

‘words.each’ takes each word one at a time from the array ‘words’, and passes it to
the block after it. If the word doesn’t yet have an entry in our hash (if
!freqs[word]), then we create an entry with a value of 1. Otherwise (if we have
encountered the word before), the value is whatever it was before, plus one.

 freqs = freqs.sort_by {|x,y| y }

This line sorts our hash by the frequency number.


This line sorts it in order of greatest frequency first (The exclamation mark after
the method ‘reverse’ means that ‘freqs’ is to be reset to the outcome of ‘reverse’;
it is the same as: ‘freqs = freqs.reverse’).

Display the word frequency list

freqs.each {|word, freq| puts word+' '+freq.to_s}

Finally, we write our results to the screen. Note that the frequency number must by
converted to a string (‘freq.to_s’) to be used with ‘puts’.

And for those who want to cut and paste

puts 'What is the name and path of the file?'
filename = gets.chomp
text = { |f| text = }
words = text.split(/[^a-zA-Z]/)
freqs =
words.each { |word| freqs[word] += 1 }
freqs = freqs.sort_by {|x,y| y }
freqs.each {|word, freq| puts word+' '+freq.to_s}

Or, inspired by the concision of William Turkel’s Python word frequency code, you could do it like this:

#replace 'filename.txt' with the file you want to process
words ='filename.txt') {|f| }.split
words.each { |word| freqs[word] += 1 }
freqs.sort_by {|x,y| y }.reverse.each {|w, f| puts w+' '+f.to_s}

Further Enhancements

And there we have it. There are definitely some improvements you might want to make.

You’ll probably want to convert your ‘text’ string to all lowercase or all uppercase
so that ‘Ruby’ and ‘ruby’ don’t get counted separately.

You may want to strip the text of sgml/xml tags before you split it into words.

You may want to convert plural nouns to singular, or normalise verb endings, or remove
any words that also occur in a stop-list (a list of very frequent common words that
you want to ignore).

You might make it work through a web browser instead of the
command line.

Best of all is if you make a customisation that is interesting to you
and your text, but isn’t already covered by the text analysis software currently
available. Please share your ideas for text analysis innovations in the Comments.

Comments (28)