<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Semantic Humanities &#187; text analysis</title>
	<atom:link href="http://semantichumanities.wordpress.com/category/text-analysis/feed/" rel="self" type="application/rss+xml" />
	<link>http://semantichumanities.wordpress.com</link>
	<description>web technology and humanities scholarship</description>
	<lastBuildDate>Fri, 22 Dec 2006 00:46:13 +0000</lastBuildDate>
	<generator>http://wordpress.com/</generator>
	<language></language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<cloud domain='semantichumanities.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://www.gravatar.com/blavatar/bd12708ab1f97a85cad305a9a4ffec26?s=96&#038;d=http://s.wordpress.com/i/buttonw-com.png</url>
		<title>Semantic Humanities &#187; text analysis</title>
		<link>http://semantichumanities.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://semantichumanities.wordpress.com/osd.xml" title="Semantic Humanities" />
		<item>
		<title>Tutorial: Key Word In Context with Javascript</title>
		<link>http://semantichumanities.wordpress.com/2006/09/06/tutorial-key-word-in-context-with-javascript/</link>
		<comments>http://semantichumanities.wordpress.com/2006/09/06/tutorial-key-word-in-context-with-javascript/#comments</comments>
		<pubDate>Wed, 06 Sep 2006 00:12:48 +0000</pubDate>
		<dc:creator>semantichumanities</dc:creator>
				<category><![CDATA[text analysis]]></category>
		<category><![CDATA[tutorials]]></category>

		<guid isPermaLink="false">http://semantichumanities.wordpress.com/2006/09/06/tutorial-key-word-in-context-with-javascript/</guid>
		<description><![CDATA[The Digital History Hacks blog has been running a nice series on using Python for digital humanities type tasks. One of the tutorials was on creating a Key Word In Context list. It made me want to write a KWIC script too, so I wrote one in javascript:


// get the word you want to find
var [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=semantichumanities.wordpress.com&blog=110377&post=25&subd=semantichumanities&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p><a href="http://digitalhistory.uwo.ca/dhh/index.php">The Digital History Hacks blog</a> has been running a nice series on using Python for digital humanities type tasks. One of the tutorials was on creating a Key Word In Context list. It made me want to write a KWIC script too, so I wrote one in javascript:</p>
<p><code></p>
<pre>
// get the word you want to find
var word = prompt('Enter the word you wish to find');
//make a regular expression to find the word
var re = new RegExp('(\\w+?\\s+?)?(\\w+?\\s+?)?'+word+'(\\s+?\\w+?)?(\\s+?\\w+?)? ','gim');
var matches = this.document.getElementsByTagName('body')[0].textContent.match(re);
var report = window.open();
report.document.write('&lt;ol&gt;');
for(i = 0; i &lt; matches.length; i++)
{
 report.document.write('&lt;li&gt;'+matches[i]+'&lt;/li&gt;');
}
report.document.write('&lt;/ol&gt;');
</pre>
<p></code></p>
<p>And now you can strip the white space, bung a &#8216;javascript:&#8217; protocol in front  of it, stick it in an href of an anchor and you&#8217;ve got a bookmarklet you can drag to your bookmarks toolbar:</p>
<p><del><a href="var word = prompt('Enter the word you wish to find'); var re = new RegExp('(\\w+?\\s+?)?(\\w+?\\s+?)?'+word+'(\\s+?\\w+?)?(\\s+?\\w+?)? ','gim'); thisDoc = this.document.getElementsByTagName('body')[0]; var matches = thisDoc.textContent.match(re); var report = window.open(); report.document.write('&lt;ol&gt;'); for(i = 0;  i  matches.length;  i++){ report.document.write('&lt;li&gt;'+matches[i]+'&lt;/li&gt;'); }report.document.write('&lt;/ol&gt;'); report.document.close();" title="a keyword in context bookmarklet" rel="bookmarklet">KWIC</a></del></p>
<p><ins>WordPress doesn&#8217;t seem to like bookmarklets, so I&#8217;m afraid you&#8217;ll have to  turn it into a link yourself.</ins></p>
<p>Click it whilst on a web page, and  you&#8217;ll get a list of all the occurrences of the word in context.<br />
It works with TEI documents, and even plain text too. (On firefox at least).</p>
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/semantichumanities.wordpress.com/25/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/semantichumanities.wordpress.com/25/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/semantichumanities.wordpress.com/25/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/semantichumanities.wordpress.com/25/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/semantichumanities.wordpress.com/25/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/semantichumanities.wordpress.com/25/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/semantichumanities.wordpress.com/25/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/semantichumanities.wordpress.com/25/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/semantichumanities.wordpress.com/25/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/semantichumanities.wordpress.com/25/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/semantichumanities.wordpress.com/25/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/semantichumanities.wordpress.com/25/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=semantichumanities.wordpress.com&blog=110377&post=25&subd=semantichumanities&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://semantichumanities.wordpress.com/2006/09/06/tutorial-key-word-in-context-with-javascript/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/5e8e1a35811b32fde34824e34012e10d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">semantichumanities</media:title>
		</media:content>
	</item>
		<item>
		<title>Algorithms for Matching Strings</title>
		<link>http://semantichumanities.wordpress.com/2006/08/30/algorithms-for-matching-name-variants/</link>
		<comments>http://semantichumanities.wordpress.com/2006/08/30/algorithms-for-matching-name-variants/#comments</comments>
		<pubDate>Wed, 30 Aug 2006 14:59:13 +0000</pubDate>
		<dc:creator>semantichumanities</dc:creator>
				<category><![CDATA[Scripting]]></category>
		<category><![CDATA[text analysis]]></category>
		<category><![CDATA[tutorials]]></category>

		<guid isPermaLink="false">https://semantichumanities.wordpress.com/2006/08/30/algorithms-for-matching-name-variants/</guid>
		<description><![CDATA[Last year, I had database table full of authors&#8217; names I&#8217;d entered in from various sources. Many of these names were simply variants of one another: eg J. Smith and John  Smith. I had already associated many of these names with biographical information in another table. My problem now was to match up the [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=semantichumanities.wordpress.com&blog=110377&post=22&subd=semantichumanities&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Last year, I had database table full of authors&#8217; names I&#8217;d entered in from various sources. Many of these names were simply variants of one another: eg <q>J. Smith</q> and <q>John  Smith</q>. I had already associated many of these names with biographical information in another table. My problem now was to match up the unidentified name variants with the identified names.</p>
<p>I was using PHP for the project, and browsing through the docs, I came across some functions for matching strings. <var>soundex</var>, <var>metaphone</var>, and <var>levenshtein</var>. </p>
<p> <var>soundex</var> and <var>metaphone</var> are phonetic algorithms that output a code that will be the same for similar sounding words (though differently perhaps spelt): eg: <q>Smith</q> and <q>Smyth</q>. I discounted these because my name variations would sound quite differently depending on how much of the name was given &#8211; eg:  <q>Smith</q>, <q>J. Smith</q>, <q>Mr. John H. Smith</q>.</p>
<p><strong><br />
Levenshtein</strong> looked more promising; you give it two strings, and it gives you the number of changes you would have to make to change one string into the other. eg:<br />
<code>	levenshtein('Smith', 'Smyth'); // returns 1</code></p>
<p>Ok, great for variants, but what about abbreviations? So I subtract the <em>difference</em> between the two strings from the <strong>levenshtein distance</strong> between them. Ok, better, but still not great: I&#8217;ve got an integer that might be a crucial difference to short strings, and negligible to long strings. So I need a ratio: my integer divided by the length of the smallest string.</p>
<p><code></p>
<pre>
$levenshtein = levenshtein($known_name, $unknown_name);
$lengthdifference = max($known_name, $unknown_name) - min($known_name, $unknown_name);
$levenshtein-=$lengthdifference;
$similarity = $levenshtein/strlen(min($known_name, $unknown_name));
</pre>
<p></code></p>
<p>So I experimented with this a bit, and found that a similarity of <var>&lt; 0.4</var><br />
 would get me (almost) all of my variants, and not too many false matches. (One weakness is that, for example, <q>The Reverend J. H.</q> would not match <q>J. Hall</q>.)</p>
<p>I used this to generate a form with each of the unknown names (and the facts that were known about them), presented along-side radio selects for the possible matching known names (and the biographical details).</p>
<p>I could then go through each of the unknown names, and relatively easily match it with the right person (from a managebly small, relatively plausible selection). &#8211; It should be noted that I was never trying to eliminate human decision altogether &#8211; human research was often necessary to determine if <em>this</em> John Smith really was the same as <em>that</em> John Smith.</p>
<p>I&#8217;ve posted this solution here because other people may have a similar problem, but also (more importantly) because my solution is stupid, and I&#8217;m hoping other people will post suggestions for better solutions. <ins>(Actually, some searching reveals that quite a lot of time and money has been spent on solving this, probably in very sophisticated ways &#8211; though the solutions may not be readily accessible to the average humanities hacker. I just found a website offering licenses for an algorithm called <a href="http://www.originsnetwork.com/namex/">NameX</a> &#8211; interestingly they explain how it works in reasonable detail.)</ins></p>
<p>My solution (although it worked well enough for me) is stupid because it does not take into consideration many of the things that a human reader would in making a judgement.<br />
A human reader knows, for example, about titles like <q>Mr.</q> and <q>Reverend</q>, and knows that they are not integral to the name (for these purposes at least). A human reader  would also give more weight to a surname, perhaps than a forename. A skilled human reader would know from the context which orthographical differences were significant: for example, in the Renaissance era, <var>I</var> might replace <var>J</var>, <var>VV</var> replace <var>W</var>, etc, and names might well be latinised (<code>'Peter' == 'Petrus'</code>).</p>
<p>A cleverer approach might have been to use a phonetic algorithm for the surname (assuming a surname can be separated from the rest of the string, perhaps with regular expressions), and if it passes that test, use my levenshtein-based approach with some other rules from human knowledge mixed in (eg: I == J). And if it was really clever, it might be able to look at the whole corpus to get an idea of context (eg: Is there a consistency to capitalisation or punctuation?). </p>
<p>Or perhaps even an AI approach &#8211; a program that could be trained to recognise good matches, much as you can train your email software  to recognise junk mail?</p>
<p>(NB: Although this approach is ignorant about personal names, it is equally ignorant about other types of strings, such as place names and book titles; which at least means it is broadly applicable.)</p>
<p>Suggestions are most welcome.</p>
<h4>References</h4>
<p><a href="http://en.wikipedia.org/wiki/Hamming_distance">http://en.wikipedia.org/wiki/Hamming_distance</a><br />
<a href="http://en.wikipedia.org/wiki/Levenshtein_distance">http://en.wikipedia.org/wiki/Levenshtein_distance</a><br />
<a href="http://en.wikipedia.org/wiki/Soundex">http://en.wikipedia.org/wiki/Soundex</a><br />
<a href="http://informationr.net/ir/9-4/paper192.html">On Identifying Name Equivalences in Digital Libraries</a></p>
<p><a href="http://www.istl.org/01-summer/databases.html">The Identification of Authors in the Mathematical Reviews Database</a></p>
<p><a href="http://www.ecom.arizona.edu/ISI/long_paper_sample.pdf#search=%22match%20name%20variants%22">Names: A New Frontier in Text Mining</a></p>
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/semantichumanities.wordpress.com/22/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/semantichumanities.wordpress.com/22/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/semantichumanities.wordpress.com/22/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/semantichumanities.wordpress.com/22/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/semantichumanities.wordpress.com/22/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/semantichumanities.wordpress.com/22/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/semantichumanities.wordpress.com/22/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/semantichumanities.wordpress.com/22/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/semantichumanities.wordpress.com/22/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/semantichumanities.wordpress.com/22/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/semantichumanities.wordpress.com/22/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/semantichumanities.wordpress.com/22/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=semantichumanities.wordpress.com&blog=110377&post=22&subd=semantichumanities&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://semantichumanities.wordpress.com/2006/08/30/algorithms-for-matching-name-variants/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/5e8e1a35811b32fde34824e34012e10d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">semantichumanities</media:title>
		</media:content>
	</item>
		<item>
		<title>Text Analysis &#8211; say it with flowers</title>
		<link>http://semantichumanities.wordpress.com/2006/08/23/text-analysis-say-it-with-flowers/</link>
		<comments>http://semantichumanities.wordpress.com/2006/08/23/text-analysis-say-it-with-flowers/#comments</comments>
		<pubDate>Wed, 23 Aug 2006 08:41:42 +0000</pubDate>
		<dc:creator>semantichumanities</dc:creator>
				<category><![CDATA[text analysis]]></category>
		<category><![CDATA[visualisation]]></category>

		<guid isPermaLink="false">https://semantichumanities.wordpress.com/2006/08/23/text-analysis-say-it-with-flowers/</guid>
		<description><![CDATA[http://www.neoformix.com/2006/TopicFlower.html
(seen at http://infosthetics.com/archives/2006/08/topic_flowers.html)
       <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=semantichumanities.wordpress.com&blog=110377&post=19&subd=semantichumanities&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p><a href="http://www.neoformix.com/2006/TopicFlower.html">http://www.neoformix.com/2006/TopicFlower.html</a></p>
<p>(seen at <a href="http://infosthetics.com/archives/2006/08/topic_flowers.html">http://infosthetics.com/archives/2006/08/topic_flowers.html</a>)</p>
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/semantichumanities.wordpress.com/19/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/semantichumanities.wordpress.com/19/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/semantichumanities.wordpress.com/19/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/semantichumanities.wordpress.com/19/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/semantichumanities.wordpress.com/19/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/semantichumanities.wordpress.com/19/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/semantichumanities.wordpress.com/19/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/semantichumanities.wordpress.com/19/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/semantichumanities.wordpress.com/19/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/semantichumanities.wordpress.com/19/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/semantichumanities.wordpress.com/19/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/semantichumanities.wordpress.com/19/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=semantichumanities.wordpress.com&blog=110377&post=19&subd=semantichumanities&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://semantichumanities.wordpress.com/2006/08/23/text-analysis-say-it-with-flowers/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/5e8e1a35811b32fde34824e34012e10d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">semantichumanities</media:title>
		</media:content>
	</item>
		<item>
		<title>Tutorial: Writing a Word Frequencies Script in Ruby</title>
		<link>http://semantichumanities.wordpress.com/2006/02/21/word-frequencies-in-ruby-tutorial/</link>
		<comments>http://semantichumanities.wordpress.com/2006/02/21/word-frequencies-in-ruby-tutorial/#comments</comments>
		<pubDate>Tue, 21 Feb 2006 01:03:14 +0000</pubDate>
		<dc:creator>semantichumanities</dc:creator>
				<category><![CDATA[Scripting]]></category>
		<category><![CDATA[text analysis]]></category>
		<category><![CDATA[tutorials]]></category>

		<guid isPermaLink="false">https://semantichumanities.wordpress.com/2006/02/21/word-frequencies-in-ruby-tutorial/</guid>
		<description><![CDATA[There are plenty of ready-made programs that do the same thing and more, but I hope that
this basic example can serve as a useful jumping off-point for your own more
ingenious scripts.
The basic steps are as follows:

Read the text file into a string
Split the text into an array of words
Count the number of times each word [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=semantichumanities.wordpress.com&blog=110377&post=7&subd=semantichumanities&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>There are plenty of ready-made programs that do the same thing and more, but I hope that<br />
this basic example can serve as a useful jumping off-point for your own more<br />
ingenious scripts.</p>
<p>The basic steps are as follows:</p>
<ol>
<li>Read the text file into a string</li>
<li>Split the text into an array of words</li>
<li>Count the number of times each word occurs, storing it in a hash</li>
<li>Display the word frequency list</li>
</ol>
<p>Ok, <a href="http://pine.fm/LearnToProgram/?Chapter=00">install Ruby if you don&#8217;t already have it on your machine</a>, and boot up a text editor (preferably one with ruby syntax highlighting), and on we go with the code.</p>
<h4>Read the text-file into a string</h4>
<p>First, we want to get the name of the text file we&#8217;re analysing, and we&#8217;ll let the<br />
user enter it at the prompt:</p>
<pre><code> puts 'What is the name and path of the file?'
filename = gets.chomp
</code></pre>
<p>&#8220;puts&#8221; writes the string that follows it to the screen<br />
&#8220;gets&#8221; gets a string from the user at the prompt<br />
&#8220;chomp&#8221; removes the carriage return from the end of the string. After the user has<br />
typed in the filename, s/he presses Return to signal that s/he has finished typing.<br />
We need to remove that carriage return, so that all we have is the filename, which we<br />
store in a variable we are calling &#8216;filename&#8217;.</p>
<p>We now create a new string variable that we are calling &#8216;text&#8217;.</p>
<pre><code>text = String.new
</code></pre>
<p>&#8216;text&#8217; is where we will put the contents of our file.</p>
<pre><code>File.open(filename) { |f|  text = f.read } </code></pre>
<p>Here, we are opening the file, and reading it into the &#8216;text&#8217; variable. The syntax is<br />
quite rubyish. In the first part, &#8216;File.open(filename) &#8216;,  a file object is being<br />
created, and passed to the block that follows it. The block is delimited by the curly<br />
braces, and receives the file object through the variable &#8216;f&#8217;, which is specified<br />
between the two pipe characters: |f|.</p>
<h4>Split the text into an array of words</h4>
<p>Onto step two: creating an array of all the words in the text. This is easy.</p>
<pre><code>words = text.split(/[^a-zA-Z]/)
</code></pre>
<p>&#8216;words&#8217; is the name of our new array. We are &#8217;splitting&#8217; our big string of text<br />
(which we have called &#8216;text&#8217;) into chunks, using a regular expression &#8216;/[^a-zA-Z]/&#8217;.<br />
Regular Expressions (reg exes) are a way of pattern matching text using wildcards.<br />
They can be extraordinarily useful if you are working with electronic text, and<br />
reading up on them will definitely reap rewards at some point (<a href="http://www.regular-expressions.info/">regular-expressions.info</a> has a fairly comprehensive amount of information). Suffice to say here<br />
that &#8216;[^a-zA-Z]&#8216; matches anything that <em>isn&#8217;t</em> an alphabetic character; so our<br />
&#8216;words&#8217; are all the chunks of text between non-alphabetic characters. This may<br />
not be precise enough  definition of a word for your purposes, but we&#8217;ll assume it is<br />
for now and push on.</p>
<h4>Count the number of times each word occurs, storing it in a hash</h4>
<pre><code>freqs = Hash.new(0)
</code></pre>
<p>We create a new Hash to store the words and their frequencies in. A basic Hash<br />
consists of pairs of &#8216;keys&#8217; and &#8216;values&#8217;. You access a value by referring to its key.<br />
In our case, the key will be a (unique) word, and its &#8216;value&#8217; is the number of times<br />
it occurs in the text.</p>
<pre><code>words.each { |word| freqs[word] += 1 }
</code></pre>
<p>&#8216;words.each&#8217; takes each word one at a time from the array &#8216;words&#8217;, and passes it to<br />
the block after it. If the word doesn&#8217;t yet have an entry in our hash (if<br />
!freqs[word]), then we create an entry with a value of 1. Otherwise (if we have<br />
encountered the word before), the value is whatever it was before, plus one.</p>
<pre><code> freqs = freqs.sort_by {|x,y| y }
</code></pre>
<p>This line sorts our hash by the frequency number.</p>
<pre><code> freqs.reverse!
</code></pre>
<p>This line sorts it in order of greatest frequency first (The exclamation mark after<br />
the method &#8216;reverse&#8217; means that &#8216;freqs&#8217; is to be reset to the outcome of &#8216;reverse&#8217;;<br />
it is the same as: &#8216;freqs = freqs.reverse&#8217;).</p>
<h4>Display the word frequency list</h4>
<pre><code>freqs.each {|word, freq| puts word+' '+freq.to_s}
</code></pre>
<p>Finally, we write our results to the screen. Note that the frequency number must by<br />
converted to a string (&#8216;freq.to_s&#8217;) to be used with &#8216;puts&#8217;.</p>
<h4>And for those who want to cut and paste</h4>
<p><code><br />
puts 'What is the name and path of the file?'<br />
<var>filename</var> = gets.chomp<br />
<var>text</var> = String.new<br />
File.open(filename) { |f|  <var>text</var> = f.read } <br />
<var>words</var> = <var>text</var>.split(/[^a-zA-Z]/)<br />
<var>freqs</var> = Hash.new(0)<br />
<var>words</var>.each { |word| freqs[word] += 1 }<br />
<var>freqs</var> = freqs.sort_by {|x,y| y }<br />
<var>freqs</var>.reverse!<br />
<var>freqs</var>.each {|word, freq| puts word+' '+freq.to_s}<br />
</code></p>
<p><ins><em>Or</em>, inspired by the concision of <a href="http://digitalhistory.uwo.ca/dhh/index.php/2006/08/20/easy-pieces-in-python-word-frequencies/">William Turkel&#8217;s Python word frequency code</a>, you could do it like this:</ins><br />
<code><br />
#replace 'filename.txt' with the file you want to process <br />
 words   = File.open('filename.txt') {|f| f.read }.split        <br />
 freqs=Hash.new(0)      <br />
words.each { |word| freqs[word] += 1 } <br />
freqs.sort_by {|x,y| y }.reverse.each {|w, f| puts w+'  '+f.to_s}          <br />
</code></p>
<h4>Further Enhancements</h4>
<p>And there we have it. There are definitely some improvements you might want to make.</p>
<p>You&#8217;ll probably want to convert your &#8216;text&#8217; string to all lowercase or all uppercase<br />
so that &#8216;Ruby&#8217; and &#8216;ruby&#8217; don&#8217;t get counted separately.</p>
<p>You may want to strip the text of sgml/xml tags before you split it into words. </p>
<p>You may want to convert plural nouns to singular, or normalise verb endings, or remove<br />
any words that also occur in a stop-list (a list of very frequent common words that<br />
you want to ignore). </p>
<p>You might make it work through a web browser instead of the<br />
command line. </p>
<p>Best of all is if you make a customisation that is interesting to you<br />
and your text, but isn&#8217;t already covered by the text analysis software currently<br />
available. Please share your ideas for text analysis innovations in the Comments. </p>
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/semantichumanities.wordpress.com/7/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/semantichumanities.wordpress.com/7/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/semantichumanities.wordpress.com/7/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/semantichumanities.wordpress.com/7/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/semantichumanities.wordpress.com/7/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/semantichumanities.wordpress.com/7/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/semantichumanities.wordpress.com/7/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/semantichumanities.wordpress.com/7/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/semantichumanities.wordpress.com/7/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/semantichumanities.wordpress.com/7/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/semantichumanities.wordpress.com/7/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/semantichumanities.wordpress.com/7/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=semantichumanities.wordpress.com&blog=110377&post=7&subd=semantichumanities&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://semantichumanities.wordpress.com/2006/02/21/word-frequencies-in-ruby-tutorial/feed/</wfw:commentRss>
		<slash:comments>14</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/5e8e1a35811b32fde34824e34012e10d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">semantichumanities</media:title>
		</media:content>
	</item>
	</channel>
</rss>