Archive for February, 2006

Data Analytics Firefox Extension

This is still being developed, so all the features aren’t implemented yet.
You can get it and/ or read more about it here.

I think this could be a really useful extension. What it does is lets you select some tabular data on a page, then manipulate it as you would in a spreadsheet or database application, including editing, performing aggregate functions, and outputting to reports and graphs.

Sure, lots of spreadsheet applications let you import data from html tables, but where it gets really interesting is the roadmap, which seems to promise the capability to import from RDF. Here’s hoping…

Comments (1)

Talking with Talis Podcast

Talking with Talis is a podcast you might be interested in listening in to. Talis is a service provider to libraries in the UK, and they’ve set up a podcast interviewing various luminaries world-wide in the field of information management. The latest one has a lively discussion of ‘Library 2.0’.

Leave a Comment

Tutorial: Writing a Word Frequencies Script in Ruby

There are plenty of ready-made programs that do the same thing and more, but I hope that
this basic example can serve as a useful jumping off-point for your own more
ingenious scripts.

The basic steps are as follows:

  1. Read the text file into a string
  2. Split the text into an array of words
  3. Count the number of times each word occurs, storing it in a hash
  4. Display the word frequency list

Ok, install Ruby if you don’t already have it on your machine, and boot up a text editor (preferably one with ruby syntax highlighting), and on we go with the code.

Read the text-file into a string

First, we want to get the name of the text file we’re analysing, and we’ll let the
user enter it at the prompt:

 puts 'What is the name and path of the file?'
filename = gets.chomp

“puts” writes the string that follows it to the screen
“gets” gets a string from the user at the prompt
“chomp” removes the carriage return from the end of the string. After the user has
typed in the filename, s/he presses Return to signal that s/he has finished typing.
We need to remove that carriage return, so that all we have is the filename, which we
store in a variable we are calling ‘filename’.

We now create a new string variable that we are calling ‘text’.

text = String.new

‘text’ is where we will put the contents of our file.

File.open(filename) { |f|  text = f.read } 

Here, we are opening the file, and reading it into the ‘text’ variable. The syntax is
quite rubyish. In the first part, ‘File.open(filename) ‘, a file object is being
created, and passed to the block that follows it. The block is delimited by the curly
braces, and receives the file object through the variable ‘f’, which is specified
between the two pipe characters: |f|.

Split the text into an array of words

Onto step two: creating an array of all the words in the text. This is easy.

words = text.split(/[^a-zA-Z]/)

‘words’ is the name of our new array. We are ‘splitting’ our big string of text
(which we have called ‘text’) into chunks, using a regular expression ‘/[^a-zA-Z]/’.
Regular Expressions (reg exes) are a way of pattern matching text using wildcards.
They can be extraordinarily useful if you are working with electronic text, and
reading up on them will definitely reap rewards at some point (regular-expressions.info has a fairly comprehensive amount of information). Suffice to say here
that ‘[^a-zA-Z]’ matches anything that isn’t an alphabetic character; so our
‘words’ are all the chunks of text between non-alphabetic characters. This may
not be precise enough definition of a word for your purposes, but we’ll assume it is
for now and push on.

Count the number of times each word occurs, storing it in a hash

freqs = Hash.new(0)

We create a new Hash to store the words and their frequencies in. A basic Hash
consists of pairs of ‘keys’ and ‘values’. You access a value by referring to its key.
In our case, the key will be a (unique) word, and its ‘value’ is the number of times
it occurs in the text.

words.each { |word| freqs[word] += 1 }

‘words.each’ takes each word one at a time from the array ‘words’, and passes it to
the block after it. If the word doesn’t yet have an entry in our hash (if
!freqs[word]), then we create an entry with a value of 1. Otherwise (if we have
encountered the word before), the value is whatever it was before, plus one.

 freqs = freqs.sort_by {|x,y| y }

This line sorts our hash by the frequency number.

 freqs.reverse!

This line sorts it in order of greatest frequency first (The exclamation mark after
the method ‘reverse’ means that ‘freqs’ is to be reset to the outcome of ‘reverse’;
it is the same as: ‘freqs = freqs.reverse’).

Display the word frequency list

freqs.each {|word, freq| puts word+' '+freq.to_s}

Finally, we write our results to the screen. Note that the frequency number must by
converted to a string (‘freq.to_s’) to be used with ‘puts’.

And for those who want to cut and paste


puts 'What is the name and path of the file?'
filename = gets.chomp
text = String.new
File.open(filename) { |f| text = f.read }
words = text.split(/[^a-zA-Z]/)
freqs = Hash.new(0)
words.each { |word| freqs[word] += 1 }
freqs = freqs.sort_by {|x,y| y }
freqs.reverse!
freqs.each {|word, freq| puts word+' '+freq.to_s}

Or, inspired by the concision of William Turkel’s Python word frequency code, you could do it like this:

#replace 'filename.txt' with the file you want to process
words = File.open('filename.txt') {|f| f.read }.split
freqs=Hash.new(0)
words.each { |word| freqs[word] += 1 }
freqs.sort_by {|x,y| y }.reverse.each {|w, f| puts w+' '+f.to_s}

Further Enhancements

And there we have it. There are definitely some improvements you might want to make.

You’ll probably want to convert your ‘text’ string to all lowercase or all uppercase
so that ‘Ruby’ and ‘ruby’ don’t get counted separately.

You may want to strip the text of sgml/xml tags before you split it into words.

You may want to convert plural nouns to singular, or normalise verb endings, or remove
any words that also occur in a stop-list (a list of very frequent common words that
you want to ignore).

You might make it work through a web browser instead of the
command line.

Best of all is if you make a customisation that is interesting to you
and your text, but isn’t already covered by the text analysis software currently
available. Please share your ideas for text analysis innovations in the Comments.

Comments (28)

Text Analysis with PHP tutorial

The English Department at Stanford run a Literature and Technology course that goes through some basic exercises with xml and php for humanities scholars.You can see the index at http://english.stanford.edu/resources/Exercises/exercises.php
If you already know something about xml and php, the meaty part is: http://english.stanford.edu/resources/Exercises/html/exercise7.2.html
which walks you through writing a word-frequency script.

Technorati Tags: , ,

Leave a Comment

Authoring Born TEI

I wrote my thesis in TEI. Before I began, I searched google (mainly in vain) for advice and examples of writing born-TEI – that is, documents originally written in TEI, not encoded in it afterwards. So, for the benefit of others who are also thinking of authoring in TEI, here is some of what I took away from the experience.

  • Writing (and thinking) Digitally and Semantically Can Be Quite Different from Writing (and thinking) in the conventions of the Printed Page

    For one thing, I was tempted down the path of DRY. So, for example, when citing a book, instead of having a footnote with bibliographic details, I had a <ptr/> element with a target attribute pointing to the id of the <bibl/> in my bibliography. Not having to repeat yourself is nice.

    Another thing you can do is write your notes inline with the text, and then transform them when it comes to presentation, replacing the text with a reference number, and moving the text to the foot of the page or the end of the document. Not only is this breaking out of the mindset of print, it is easier than having to shoot down to a notes section in the document every time you want to write one.

    You could also do the same thing with citations I suppose. Instead of targeting items in a bibliography section, you might simply write the bibliographic details out inline with your main text, targeting back in subsequent citations, and moving everything down into a bibliography (and notes) in the presentational stage. However, you may want to have a bibliography in the TEI as well, for those books and articles that you don’t refer to directly, but nonetheless want to acknowledge. You may also prefer to write your bibliography before you start writing the main text. It depends how you work.

  • You (probably) Still Have to Present It in the Conventions of the Printed Page

    It can be pretty annoying having to batter your born-digital document back into the typographical conventions you tried so hard to think and write outside of. The great advantage of course, is that you can present your text in many different forms without touching the original document. Unfortunately, most new documents, such as university dissertations, only really have to be presented in one form, so this advantage didn’t really console me much.

  • TEI offers too many different ways to fulfil common tasks

    Not that we need less choice, but it would be good if there were ‘microformats’ for authoring in TEI, so that you didn’t have to develop so many mini principles of best practice as you wrote.

    An example: in your bibliography, you have some urls. Scholarly practice dictates that you include a ‘last accessed on’ date, but how do you mark it up? This is a situation where you have to follow a convention anyway, so it would be really useful if you could follow a conventional way to mark it up. If we all do <date type=”lastAccessed>2004-10-16</date> then we can share stylesheets and other tools. And that would be nice.

  • HTML is pretty un-semantic

    There is a lot of talk in the web-dev community about the importance of semantic (x)html. And it’s true, html authors should try to do it as semantically as possible. Transforming from a really semantic mark-up language like TEI though, you realise how little the amount of meaning you can give text with html really is. Of course, it is a good thing that html has a far smaller tag set – imagine the success of the web if every homepage-jockey had to wade through the TEI guidlines to publish their poetry and pet photography. But it really puzzles me why in html we have so many tags for presenting programming stuff – kbd, samp, var, code – but not tags for marking-up the stuff that programmers really care about, like dates and names.

    So, if you are going to transform your TEI into html, you also have to decide how semantic your html is going to be, and how much presentation you are going to do with XSLT (or scripting language of your choice), and how much you are going to do with CSS. This probably depends heavily on the browsers you need to support. CSS3 is quite powerful, but it ain’t gonna work in Internet Explorer. CSS is also, I find, a bit easier to read and work with than XSLT, but you will need to stop-gap html’s small tag set with plenty of classed spans and divs, and it can get quite time consuming switching between xsl and css files trying to locate and solve various presentational glitches (did I do this in css, or xsl?).

    One answer is to skip the html stage. Style your TEI with css, and use only a mere sprinkling of scripting/xslt to re-order and copy chunks of content. This has the advantage that your document will retain its semantics right up till it hits the printer ribbon. The disadvantage is that it loses the functionality of html – you won’t have hyperlinks, and it will only really work in the newest most standard compliant browsers, so won’t be terribly accessible.

Mapping TEI to HTML

One of the annoying differences between TEI and valid (x)html is that in tei, lists and quotes can occur in paragraphs, but in html they can’t. So I thought it might be helpful to put my solution to this here as well. The following template assumes that quotes longers than 130 characters are blockquotes, whilst shorter quotes will be inline. Lists that are part of a paragraph’s text (ie: a comma separated list) cannot be transformed to an <ul> or an <ol> (well, they maybe can if you split it into two paragraphs and fiddle enough with the css, but it’s probably less semantic than to transform it into plain text). I have marked up these lists in the TEI with @type=’inline’.

<xsl:template match="tei:p[child::tei:list[not(@type='inline')]|child::tei:cit[string-length(tei:quote) > 130]|child::tei:listBibl]">
<p>
<xsl:if test="@xml:id">
<xsl:attribute name="id">
<xsl:value-of select="@xml:id"/>
</xsl:attribute>
</xsl:if>
<xsl:attribute name="class">
<xsl:value-of select="string('preblock')"/>
</xsl:attribute>
<xsl:for-each select="node()[following-sibling::tei:cit[string-length(tei:quote) > 130]|following-sibling::tei:list[not(@type='inline')]|following-sibling::tei:listBibl]">
<xsl:apply-templates select="current()"/>
</xsl:for-each>
</p>
<xsl:apply-templates select="tei:list|tei:cit[string-length(tei:quote) > 130]|tei:listBibl"/>
<p class="postblock">
<xsl:for-each select="node()[preceding-sibling::tei:list[not(@type='inline')]|preceding-sibling::tei:listBibl|preceding-sibling::tei:cit[string-length(tei:quote) > 130]]">
<xsl:apply-templates select="current()"/>
</xsl:for-each>
</p>
</xsl:template>

NB: XHTML 2.0, when it comes, will allow lists within paragraphs.
Also, if you don't already know, #tei-c at irc.freenode.net is a good place to ask, argue and discuss TEI.

Comments are (as always), most welcome.

Comments (3)

Ruby Quiz

Inspired by phpflashcards.com, I wrote an ajax-driven ruby quiz. This isn’t specifically humanities computing, but ruby is a really nice scripting language, and a useful addition to the digital humanist’s tool box. I’m new to the language myself, and the aim of the thing is as much for me personally to learn ruby as anything else – the questions are written with the intention of making ruby’s syntax clear to the newcomer rather than to be really tricky.

Leave a Comment

Relationship Graphs

New Javascript Canvas Graph Library
This is a library that makes it easy to generate relationship graphs. This has really interesting applications in visualising humanities data and concepts – for instance graphing social networks.

It’s pretty simple to get it to do its stuff. You just make an html page with a bunch of elements having unique ids and containing text. These are the entities to be graphed, and the text is the label. Then you define their relationships in your javascript like this:
var g = new Graph();

g.addEdge($(‘fred’), $(‘wilma’));

var layouter = new Graph.Layout.Spring(g);
layouter.layout();

var renderer = new Graph.Renderer.Basic($(‘people’), g);
renderer.draw();
This makes two entities (fred and wilma), creates a line between them, and draws this in the canvas element that you have in your html (with the id ‘people’).

I gave it a try with some of the data from thesis project. While it’s easy to use, and is a nice proof of concept, I find it to be far too slow to be practically useful with any largish amount of data. To graph between 30 people, the page takes about 20-30 seconds to load, with at least one “Unresponsive Script” warning. (In firefox at least; admittedly it is a bit better in safari and opera).

My test – graph shows book owners and book authors. Lines between mean that the owner owned a book by the author

If I have the time, I’d like to write something that does something similar in a server-side language.

Leave a Comment