<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Semantic Humanities &#187; tutorials</title>
	<atom:link href="http://semantichumanities.wordpress.com/category/tutorials/feed/" rel="self" type="application/rss+xml" />
	<link>http://semantichumanities.wordpress.com</link>
	<description>web technology and humanities scholarship</description>
	<lastBuildDate>Fri, 22 Dec 2006 00:46:13 +0000</lastBuildDate>
	<generator>http://wordpress.com/</generator>
	<language></language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<cloud domain='semantichumanities.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://www.gravatar.com/blavatar/bd12708ab1f97a85cad305a9a4ffec26?s=96&#038;d=http://s.wordpress.com/i/buttonw-com.png</url>
		<title>Semantic Humanities &#187; tutorials</title>
		<link>http://semantichumanities.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://semantichumanities.wordpress.com/osd.xml" title="Semantic Humanities" />
		<item>
		<title>Tutorial: Key Word In Context with Javascript</title>
		<link>http://semantichumanities.wordpress.com/2006/09/06/tutorial-key-word-in-context-with-javascript/</link>
		<comments>http://semantichumanities.wordpress.com/2006/09/06/tutorial-key-word-in-context-with-javascript/#comments</comments>
		<pubDate>Wed, 06 Sep 2006 00:12:48 +0000</pubDate>
		<dc:creator>semantichumanities</dc:creator>
				<category><![CDATA[text analysis]]></category>
		<category><![CDATA[tutorials]]></category>

		<guid isPermaLink="false">http://semantichumanities.wordpress.com/2006/09/06/tutorial-key-word-in-context-with-javascript/</guid>
		<description><![CDATA[The Digital History Hacks blog has been running a nice series on using Python for digital humanities type tasks. One of the tutorials was on creating a Key Word In Context list. It made me want to write a KWIC script too, so I wrote one in javascript:


// get the word you want to find
var [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=semantichumanities.wordpress.com&blog=110377&post=25&subd=semantichumanities&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p><a href="http://digitalhistory.uwo.ca/dhh/index.php">The Digital History Hacks blog</a> has been running a nice series on using Python for digital humanities type tasks. One of the tutorials was on creating a Key Word In Context list. It made me want to write a KWIC script too, so I wrote one in javascript:</p>
<p><code></p>
<pre>
// get the word you want to find
var word = prompt('Enter the word you wish to find');
//make a regular expression to find the word
var re = new RegExp('(\\w+?\\s+?)?(\\w+?\\s+?)?'+word+'(\\s+?\\w+?)?(\\s+?\\w+?)? ','gim');
var matches = this.document.getElementsByTagName('body')[0].textContent.match(re);
var report = window.open();
report.document.write('&lt;ol&gt;');
for(i = 0; i &lt; matches.length; i++)
{
 report.document.write('&lt;li&gt;'+matches[i]+'&lt;/li&gt;');
}
report.document.write('&lt;/ol&gt;');
</pre>
<p></code></p>
<p>And now you can strip the white space, bung a &#8216;javascript:&#8217; protocol in front  of it, stick it in an href of an anchor and you&#8217;ve got a bookmarklet you can drag to your bookmarks toolbar:</p>
<p><del><a href="var word = prompt('Enter the word you wish to find'); var re = new RegExp('(\\w+?\\s+?)?(\\w+?\\s+?)?'+word+'(\\s+?\\w+?)?(\\s+?\\w+?)? ','gim'); thisDoc = this.document.getElementsByTagName('body')[0]; var matches = thisDoc.textContent.match(re); var report = window.open(); report.document.write('&lt;ol&gt;'); for(i = 0;  i  matches.length;  i++){ report.document.write('&lt;li&gt;'+matches[i]+'&lt;/li&gt;'); }report.document.write('&lt;/ol&gt;'); report.document.close();" title="a keyword in context bookmarklet" rel="bookmarklet">KWIC</a></del></p>
<p><ins>WordPress doesn&#8217;t seem to like bookmarklets, so I&#8217;m afraid you&#8217;ll have to  turn it into a link yourself.</ins></p>
<p>Click it whilst on a web page, and  you&#8217;ll get a list of all the occurrences of the word in context.<br />
It works with TEI documents, and even plain text too. (On firefox at least).</p>
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/semantichumanities.wordpress.com/25/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/semantichumanities.wordpress.com/25/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/semantichumanities.wordpress.com/25/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/semantichumanities.wordpress.com/25/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/semantichumanities.wordpress.com/25/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/semantichumanities.wordpress.com/25/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/semantichumanities.wordpress.com/25/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/semantichumanities.wordpress.com/25/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/semantichumanities.wordpress.com/25/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/semantichumanities.wordpress.com/25/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/semantichumanities.wordpress.com/25/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/semantichumanities.wordpress.com/25/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=semantichumanities.wordpress.com&blog=110377&post=25&subd=semantichumanities&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://semantichumanities.wordpress.com/2006/09/06/tutorial-key-word-in-context-with-javascript/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/5e8e1a35811b32fde34824e34012e10d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">semantichumanities</media:title>
		</media:content>
	</item>
		<item>
		<title>Tutorial: Networks (of Folksonomy) with Ruby, del.icio.us and Graphiz</title>
		<link>http://semantichumanities.wordpress.com/2006/09/04/tutorial-networks-of-folksonomy-with-ruby-delicious-and-graphiz/</link>
		<comments>http://semantichumanities.wordpress.com/2006/09/04/tutorial-networks-of-folksonomy-with-ruby-delicious-and-graphiz/#comments</comments>
		<pubDate>Mon, 04 Sep 2006 11:09:05 +0000</pubDate>
		<dc:creator>semantichumanities</dc:creator>
				<category><![CDATA[Scripting]]></category>
		<category><![CDATA[tutorials]]></category>
		<category><![CDATA[visualisation]]></category>

		<guid isPermaLink="false">https://semantichumanities.wordpress.com/2006/09/04/tutorial-networks-of-folksonomy-with-ruby-delicious-and-graphiz/</guid>
		<description><![CDATA[I was idly thinking about my del.icio.us bookmarks, how the tags are connected to each other when they are used to describe the same bookmarks, and wondering what they would look like as a graph.
Instead of simply searching the web and finding this del.icio.us tag grapher, I decided that I wanted to try playing with [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=semantichumanities.wordpress.com&blog=110377&post=24&subd=semantichumanities&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>I was idly thinking about my del.icio.us bookmarks, how the tags are connected to each other when they are used to describe the same bookmarks, and wondering what they would look like as a graph.</p>
<p>Instead of simply searching the web and finding this <a href="http://www.hubmed.org/touchgraphs/deltags.php?start=history">del.icio.us tag grapher</a>, I decided that I wanted to try playing with <a href="http://www.graphviz.org/">Graphiz</a> (open source graphing software), so I wrote a ruby script to write the <strong>.dot</strong> file from my bookmarks.</p>
<p>I really liked Graphiz. It&#8217;s a great tool, and .dot is a nice format, as it lets you abstract all the positioning and presentation, whereas if I had been generating an SVG file (for example), I would have had to do lots of calculations for the positioning of all the nodes and everything.</p>
<p>Anyway, this is how I did it:</p>
<p><code>
<pre>
#open the bookmarks file (after running it through HTML Tidy
# first, to transform it into XML)
require "rexml/document"
file = File.new( "delicious.xhtml" )
doc = REXML::Document.new file

#create a 2D array: an array of an array
# of the tags used for each bookmark.
tag_sets = Array.new()
doc.elements.each('//a') {|e| tag_sets.push(e.attributes['tags'].split(',')) } 

# I added this following line because I had too many bookmarks,
# making the graph too big and complicated: -&gt;
#      tag_sets = tag_sets.slice(0..10)

# now flatten the 2D array, and get a 1D array
# of all the tags used - <var>.uniq</var> gets rid of duplicates
tag_list = tag_sets.flatten.uniq         

#get the relationships
relationships = Array.new()

# now iterate through the tag list,
# and for each tag, look for that in each of the bookmarks.
# If it's found, record a relationship with the other tags of
# that bookmark

tag_list.each do |tag|

 tag_sets.each do |tag_set|

   if tag_set.include? tag
     tag_set.each do |related_tag|
     relationships.push([tag, related_tag]) if tag!=related_tag
     end
   end

 end

end

# <var>relationships</var> is now a 2D array of arrays each
# containing two tags

# put it into the <strong>.dot</strong> syntax

graph = "digraph x { \r\n"+relationships.uniq.collect{|r|'"'+r.join('" -&gt; "')+'";'}.join("\r")+"}"

# now  write it all into the <strong>.dot</strong> file

file = File.new("delicious_graph.dot", "w")
file.write(graph)
file.close()
</pre>
<p></code></p>
<h4>Links to the Results</h4>
<p>I don&#8217;t expect the results will be of much interest to anyone, but here they are for completeness sake.</p>
<p><a href="http://keithalexander.co.uk/files/delicious_graph.dot">the .dot file</a><br />
<a href="http://keithalexander.co.uk/files/delicious_graph.svg">an SVG export of the graph</a> (you may need a plugin, or a recent version of firefox, safari or opera)</p>
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/semantichumanities.wordpress.com/24/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/semantichumanities.wordpress.com/24/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/semantichumanities.wordpress.com/24/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/semantichumanities.wordpress.com/24/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/semantichumanities.wordpress.com/24/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/semantichumanities.wordpress.com/24/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/semantichumanities.wordpress.com/24/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/semantichumanities.wordpress.com/24/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/semantichumanities.wordpress.com/24/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/semantichumanities.wordpress.com/24/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/semantichumanities.wordpress.com/24/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/semantichumanities.wordpress.com/24/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=semantichumanities.wordpress.com&blog=110377&post=24&subd=semantichumanities&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://semantichumanities.wordpress.com/2006/09/04/tutorial-networks-of-folksonomy-with-ruby-delicious-and-graphiz/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/5e8e1a35811b32fde34824e34012e10d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">semantichumanities</media:title>
		</media:content>
	</item>
		<item>
		<title>Algorithms for Matching Strings</title>
		<link>http://semantichumanities.wordpress.com/2006/08/30/algorithms-for-matching-name-variants/</link>
		<comments>http://semantichumanities.wordpress.com/2006/08/30/algorithms-for-matching-name-variants/#comments</comments>
		<pubDate>Wed, 30 Aug 2006 14:59:13 +0000</pubDate>
		<dc:creator>semantichumanities</dc:creator>
				<category><![CDATA[Scripting]]></category>
		<category><![CDATA[text analysis]]></category>
		<category><![CDATA[tutorials]]></category>

		<guid isPermaLink="false">https://semantichumanities.wordpress.com/2006/08/30/algorithms-for-matching-name-variants/</guid>
		<description><![CDATA[Last year, I had database table full of authors&#8217; names I&#8217;d entered in from various sources. Many of these names were simply variants of one another: eg J. Smith and John  Smith. I had already associated many of these names with biographical information in another table. My problem now was to match up the [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=semantichumanities.wordpress.com&blog=110377&post=22&subd=semantichumanities&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Last year, I had database table full of authors&#8217; names I&#8217;d entered in from various sources. Many of these names were simply variants of one another: eg <q>J. Smith</q> and <q>John  Smith</q>. I had already associated many of these names with biographical information in another table. My problem now was to match up the unidentified name variants with the identified names.</p>
<p>I was using PHP for the project, and browsing through the docs, I came across some functions for matching strings. <var>soundex</var>, <var>metaphone</var>, and <var>levenshtein</var>. </p>
<p> <var>soundex</var> and <var>metaphone</var> are phonetic algorithms that output a code that will be the same for similar sounding words (though differently perhaps spelt): eg: <q>Smith</q> and <q>Smyth</q>. I discounted these because my name variations would sound quite differently depending on how much of the name was given &#8211; eg:  <q>Smith</q>, <q>J. Smith</q>, <q>Mr. John H. Smith</q>.</p>
<p><strong><br />
Levenshtein</strong> looked more promising; you give it two strings, and it gives you the number of changes you would have to make to change one string into the other. eg:<br />
<code>	levenshtein('Smith', 'Smyth'); // returns 1</code></p>
<p>Ok, great for variants, but what about abbreviations? So I subtract the <em>difference</em> between the two strings from the <strong>levenshtein distance</strong> between them. Ok, better, but still not great: I&#8217;ve got an integer that might be a crucial difference to short strings, and negligible to long strings. So I need a ratio: my integer divided by the length of the smallest string.</p>
<p><code></p>
<pre>
$levenshtein = levenshtein($known_name, $unknown_name);
$lengthdifference = max($known_name, $unknown_name) - min($known_name, $unknown_name);
$levenshtein-=$lengthdifference;
$similarity = $levenshtein/strlen(min($known_name, $unknown_name));
</pre>
<p></code></p>
<p>So I experimented with this a bit, and found that a similarity of <var>&lt; 0.4</var><br />
 would get me (almost) all of my variants, and not too many false matches. (One weakness is that, for example, <q>The Reverend J. H.</q> would not match <q>J. Hall</q>.)</p>
<p>I used this to generate a form with each of the unknown names (and the facts that were known about them), presented along-side radio selects for the possible matching known names (and the biographical details).</p>
<p>I could then go through each of the unknown names, and relatively easily match it with the right person (from a managebly small, relatively plausible selection). &#8211; It should be noted that I was never trying to eliminate human decision altogether &#8211; human research was often necessary to determine if <em>this</em> John Smith really was the same as <em>that</em> John Smith.</p>
<p>I&#8217;ve posted this solution here because other people may have a similar problem, but also (more importantly) because my solution is stupid, and I&#8217;m hoping other people will post suggestions for better solutions. <ins>(Actually, some searching reveals that quite a lot of time and money has been spent on solving this, probably in very sophisticated ways &#8211; though the solutions may not be readily accessible to the average humanities hacker. I just found a website offering licenses for an algorithm called <a href="http://www.originsnetwork.com/namex/">NameX</a> &#8211; interestingly they explain how it works in reasonable detail.)</ins></p>
<p>My solution (although it worked well enough for me) is stupid because it does not take into consideration many of the things that a human reader would in making a judgement.<br />
A human reader knows, for example, about titles like <q>Mr.</q> and <q>Reverend</q>, and knows that they are not integral to the name (for these purposes at least). A human reader  would also give more weight to a surname, perhaps than a forename. A skilled human reader would know from the context which orthographical differences were significant: for example, in the Renaissance era, <var>I</var> might replace <var>J</var>, <var>VV</var> replace <var>W</var>, etc, and names might well be latinised (<code>'Peter' == 'Petrus'</code>).</p>
<p>A cleverer approach might have been to use a phonetic algorithm for the surname (assuming a surname can be separated from the rest of the string, perhaps with regular expressions), and if it passes that test, use my levenshtein-based approach with some other rules from human knowledge mixed in (eg: I == J). And if it was really clever, it might be able to look at the whole corpus to get an idea of context (eg: Is there a consistency to capitalisation or punctuation?). </p>
<p>Or perhaps even an AI approach &#8211; a program that could be trained to recognise good matches, much as you can train your email software  to recognise junk mail?</p>
<p>(NB: Although this approach is ignorant about personal names, it is equally ignorant about other types of strings, such as place names and book titles; which at least means it is broadly applicable.)</p>
<p>Suggestions are most welcome.</p>
<h4>References</h4>
<p><a href="http://en.wikipedia.org/wiki/Hamming_distance">http://en.wikipedia.org/wiki/Hamming_distance</a><br />
<a href="http://en.wikipedia.org/wiki/Levenshtein_distance">http://en.wikipedia.org/wiki/Levenshtein_distance</a><br />
<a href="http://en.wikipedia.org/wiki/Soundex">http://en.wikipedia.org/wiki/Soundex</a><br />
<a href="http://informationr.net/ir/9-4/paper192.html">On Identifying Name Equivalences in Digital Libraries</a></p>
<p><a href="http://www.istl.org/01-summer/databases.html">The Identification of Authors in the Mathematical Reviews Database</a></p>
<p><a href="http://www.ecom.arizona.edu/ISI/long_paper_sample.pdf#search=%22match%20name%20variants%22">Names: A New Frontier in Text Mining</a></p>
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/semantichumanities.wordpress.com/22/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/semantichumanities.wordpress.com/22/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/semantichumanities.wordpress.com/22/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/semantichumanities.wordpress.com/22/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/semantichumanities.wordpress.com/22/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/semantichumanities.wordpress.com/22/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/semantichumanities.wordpress.com/22/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/semantichumanities.wordpress.com/22/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/semantichumanities.wordpress.com/22/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/semantichumanities.wordpress.com/22/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/semantichumanities.wordpress.com/22/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/semantichumanities.wordpress.com/22/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=semantichumanities.wordpress.com&blog=110377&post=22&subd=semantichumanities&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://semantichumanities.wordpress.com/2006/08/30/algorithms-for-matching-name-variants/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/5e8e1a35811b32fde34824e34012e10d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">semantichumanities</media:title>
		</media:content>
	</item>
		<item>
		<title>Tutorial: Writing a Word Frequencies Script in Ruby</title>
		<link>http://semantichumanities.wordpress.com/2006/02/21/word-frequencies-in-ruby-tutorial/</link>
		<comments>http://semantichumanities.wordpress.com/2006/02/21/word-frequencies-in-ruby-tutorial/#comments</comments>
		<pubDate>Tue, 21 Feb 2006 01:03:14 +0000</pubDate>
		<dc:creator>semantichumanities</dc:creator>
				<category><![CDATA[Scripting]]></category>
		<category><![CDATA[text analysis]]></category>
		<category><![CDATA[tutorials]]></category>

		<guid isPermaLink="false">https://semantichumanities.wordpress.com/2006/02/21/word-frequencies-in-ruby-tutorial/</guid>
		<description><![CDATA[There are plenty of ready-made programs that do the same thing and more, but I hope that
this basic example can serve as a useful jumping off-point for your own more
ingenious scripts.
The basic steps are as follows:

Read the text file into a string
Split the text into an array of words
Count the number of times each word [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=semantichumanities.wordpress.com&blog=110377&post=7&subd=semantichumanities&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>There are plenty of ready-made programs that do the same thing and more, but I hope that<br />
this basic example can serve as a useful jumping off-point for your own more<br />
ingenious scripts.</p>
<p>The basic steps are as follows:</p>
<ol>
<li>Read the text file into a string</li>
<li>Split the text into an array of words</li>
<li>Count the number of times each word occurs, storing it in a hash</li>
<li>Display the word frequency list</li>
</ol>
<p>Ok, <a href="http://pine.fm/LearnToProgram/?Chapter=00">install Ruby if you don&#8217;t already have it on your machine</a>, and boot up a text editor (preferably one with ruby syntax highlighting), and on we go with the code.</p>
<h4>Read the text-file into a string</h4>
<p>First, we want to get the name of the text file we&#8217;re analysing, and we&#8217;ll let the<br />
user enter it at the prompt:</p>
<pre><code> puts 'What is the name and path of the file?'
filename = gets.chomp
</code></pre>
<p>&#8220;puts&#8221; writes the string that follows it to the screen<br />
&#8220;gets&#8221; gets a string from the user at the prompt<br />
&#8220;chomp&#8221; removes the carriage return from the end of the string. After the user has<br />
typed in the filename, s/he presses Return to signal that s/he has finished typing.<br />
We need to remove that carriage return, so that all we have is the filename, which we<br />
store in a variable we are calling &#8216;filename&#8217;.</p>
<p>We now create a new string variable that we are calling &#8216;text&#8217;.</p>
<pre><code>text = String.new
</code></pre>
<p>&#8216;text&#8217; is where we will put the contents of our file.</p>
<pre><code>File.open(filename) { |f|  text = f.read } </code></pre>
<p>Here, we are opening the file, and reading it into the &#8216;text&#8217; variable. The syntax is<br />
quite rubyish. In the first part, &#8216;File.open(filename) &#8216;,  a file object is being<br />
created, and passed to the block that follows it. The block is delimited by the curly<br />
braces, and receives the file object through the variable &#8216;f&#8217;, which is specified<br />
between the two pipe characters: |f|.</p>
<h4>Split the text into an array of words</h4>
<p>Onto step two: creating an array of all the words in the text. This is easy.</p>
<pre><code>words = text.split(/[^a-zA-Z]/)
</code></pre>
<p>&#8216;words&#8217; is the name of our new array. We are &#8217;splitting&#8217; our big string of text<br />
(which we have called &#8216;text&#8217;) into chunks, using a regular expression &#8216;/[^a-zA-Z]/&#8217;.<br />
Regular Expressions (reg exes) are a way of pattern matching text using wildcards.<br />
They can be extraordinarily useful if you are working with electronic text, and<br />
reading up on them will definitely reap rewards at some point (<a href="http://www.regular-expressions.info/">regular-expressions.info</a> has a fairly comprehensive amount of information). Suffice to say here<br />
that &#8216;[^a-zA-Z]&#8216; matches anything that <em>isn&#8217;t</em> an alphabetic character; so our<br />
&#8216;words&#8217; are all the chunks of text between non-alphabetic characters. This may<br />
not be precise enough  definition of a word for your purposes, but we&#8217;ll assume it is<br />
for now and push on.</p>
<h4>Count the number of times each word occurs, storing it in a hash</h4>
<pre><code>freqs = Hash.new(0)
</code></pre>
<p>We create a new Hash to store the words and their frequencies in. A basic Hash<br />
consists of pairs of &#8216;keys&#8217; and &#8216;values&#8217;. You access a value by referring to its key.<br />
In our case, the key will be a (unique) word, and its &#8216;value&#8217; is the number of times<br />
it occurs in the text.</p>
<pre><code>words.each { |word| freqs[word] += 1 }
</code></pre>
<p>&#8216;words.each&#8217; takes each word one at a time from the array &#8216;words&#8217;, and passes it to<br />
the block after it. If the word doesn&#8217;t yet have an entry in our hash (if<br />
!freqs[word]), then we create an entry with a value of 1. Otherwise (if we have<br />
encountered the word before), the value is whatever it was before, plus one.</p>
<pre><code> freqs = freqs.sort_by {|x,y| y }
</code></pre>
<p>This line sorts our hash by the frequency number.</p>
<pre><code> freqs.reverse!
</code></pre>
<p>This line sorts it in order of greatest frequency first (The exclamation mark after<br />
the method &#8216;reverse&#8217; means that &#8216;freqs&#8217; is to be reset to the outcome of &#8216;reverse&#8217;;<br />
it is the same as: &#8216;freqs = freqs.reverse&#8217;).</p>
<h4>Display the word frequency list</h4>
<pre><code>freqs.each {|word, freq| puts word+' '+freq.to_s}
</code></pre>
<p>Finally, we write our results to the screen. Note that the frequency number must by<br />
converted to a string (&#8216;freq.to_s&#8217;) to be used with &#8216;puts&#8217;.</p>
<h4>And for those who want to cut and paste</h4>
<p><code><br />
puts 'What is the name and path of the file?'<br />
<var>filename</var> = gets.chomp<br />
<var>text</var> = String.new<br />
File.open(filename) { |f|  <var>text</var> = f.read } <br />
<var>words</var> = <var>text</var>.split(/[^a-zA-Z]/)<br />
<var>freqs</var> = Hash.new(0)<br />
<var>words</var>.each { |word| freqs[word] += 1 }<br />
<var>freqs</var> = freqs.sort_by {|x,y| y }<br />
<var>freqs</var>.reverse!<br />
<var>freqs</var>.each {|word, freq| puts word+' '+freq.to_s}<br />
</code></p>
<p><ins><em>Or</em>, inspired by the concision of <a href="http://digitalhistory.uwo.ca/dhh/index.php/2006/08/20/easy-pieces-in-python-word-frequencies/">William Turkel&#8217;s Python word frequency code</a>, you could do it like this:</ins><br />
<code><br />
#replace 'filename.txt' with the file you want to process <br />
 words   = File.open('filename.txt') {|f| f.read }.split        <br />
 freqs=Hash.new(0)      <br />
words.each { |word| freqs[word] += 1 } <br />
freqs.sort_by {|x,y| y }.reverse.each {|w, f| puts w+'  '+f.to_s}          <br />
</code></p>
<h4>Further Enhancements</h4>
<p>And there we have it. There are definitely some improvements you might want to make.</p>
<p>You&#8217;ll probably want to convert your &#8216;text&#8217; string to all lowercase or all uppercase<br />
so that &#8216;Ruby&#8217; and &#8216;ruby&#8217; don&#8217;t get counted separately.</p>
<p>You may want to strip the text of sgml/xml tags before you split it into words. </p>
<p>You may want to convert plural nouns to singular, or normalise verb endings, or remove<br />
any words that also occur in a stop-list (a list of very frequent common words that<br />
you want to ignore). </p>
<p>You might make it work through a web browser instead of the<br />
command line. </p>
<p>Best of all is if you make a customisation that is interesting to you<br />
and your text, but isn&#8217;t already covered by the text analysis software currently<br />
available. Please share your ideas for text analysis innovations in the Comments. </p>
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/semantichumanities.wordpress.com/7/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/semantichumanities.wordpress.com/7/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/semantichumanities.wordpress.com/7/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/semantichumanities.wordpress.com/7/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/semantichumanities.wordpress.com/7/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/semantichumanities.wordpress.com/7/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/semantichumanities.wordpress.com/7/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/semantichumanities.wordpress.com/7/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/semantichumanities.wordpress.com/7/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/semantichumanities.wordpress.com/7/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/semantichumanities.wordpress.com/7/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/semantichumanities.wordpress.com/7/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=semantichumanities.wordpress.com&blog=110377&post=7&subd=semantichumanities&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://semantichumanities.wordpress.com/2006/02/21/word-frequencies-in-ruby-tutorial/feed/</wfw:commentRss>
		<slash:comments>14</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/5e8e1a35811b32fde34824e34012e10d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">semantichumanities</media:title>
		</media:content>
	</item>
	</channel>
</rss>