There are plenty of ready-made programs that do the same thing and more, but I hope that
this basic example can serve as a useful jumping off-point for your own more
ingenious scripts.
The basic steps are as follows:
- Read the text file into a string
- Split the text into an array of words
- Count the number of times each word occurs, storing it in a hash
- Display the word frequency list
Ok, install Ruby if you don’t already have it on your machine, and boot up a text editor (preferably one with ruby syntax highlighting), and on we go with the code.
Read the text-file into a string
First, we want to get the name of the text file we’re analysing, and we’ll let the
user enter it at the prompt:
puts 'What is the name and path of the file?'
filename = gets.chomp
“puts” writes the string that follows it to the screen
“gets” gets a string from the user at the prompt
“chomp” removes the carriage return from the end of the string. After the user has
typed in the filename, s/he presses Return to signal that s/he has finished typing.
We need to remove that carriage return, so that all we have is the filename, which we
store in a variable we are calling ‘filename’.
We now create a new string variable that we are calling ‘text’.
text = String.new
‘text’ is where we will put the contents of our file.
File.open(filename) { |f| text = f.read }
Here, we are opening the file, and reading it into the ‘text’ variable. The syntax is
quite rubyish. In the first part, ‘File.open(filename) ‘, a file object is being
created, and passed to the block that follows it. The block is delimited by the curly
braces, and receives the file object through the variable ‘f’, which is specified
between the two pipe characters: |f|.
Split the text into an array of words
Onto step two: creating an array of all the words in the text. This is easy.
words = text.split(/[^a-zA-Z]/)
‘words’ is the name of our new array. We are ’splitting’ our big string of text
(which we have called ‘text’) into chunks, using a regular expression ‘/[^a-zA-Z]/’.
Regular Expressions (reg exes) are a way of pattern matching text using wildcards.
They can be extraordinarily useful if you are working with electronic text, and
reading up on them will definitely reap rewards at some point (regular-expressions.info has a fairly comprehensive amount of information). Suffice to say here
that ‘[^a-zA-Z]‘ matches anything that isn’t an alphabetic character; so our
‘words’ are all the chunks of text between non-alphabetic characters. This may
not be precise enough definition of a word for your purposes, but we’ll assume it is
for now and push on.
Count the number of times each word occurs, storing it in a hash
freqs = Hash.new(0)
We create a new Hash to store the words and their frequencies in. A basic Hash
consists of pairs of ‘keys’ and ‘values’. You access a value by referring to its key.
In our case, the key will be a (unique) word, and its ‘value’ is the number of times
it occurs in the text.
words.each { |word| freqs[word] += 1 }
‘words.each’ takes each word one at a time from the array ‘words’, and passes it to
the block after it. If the word doesn’t yet have an entry in our hash (if
!freqs[word]), then we create an entry with a value of 1. Otherwise (if we have
encountered the word before), the value is whatever it was before, plus one.
freqs = freqs.sort_by {|x,y| y }
This line sorts our hash by the frequency number.
freqs.reverse!
This line sorts it in order of greatest frequency first (The exclamation mark after
the method ‘reverse’ means that ‘freqs’ is to be reset to the outcome of ‘reverse’;
it is the same as: ‘freqs = freqs.reverse’).
Display the word frequency list
freqs.each {|word, freq| puts word+' '+freq.to_s}
Finally, we write our results to the screen. Note that the frequency number must by
converted to a string (‘freq.to_s’) to be used with ‘puts’.
And for those who want to cut and paste
puts 'What is the name and path of the file?'
filename = gets.chomp
text = String.new
File.open(filename) { |f| text = f.read }
words = text.split(/[^a-zA-Z]/)
freqs = Hash.new(0)
words.each { |word| freqs[word] += 1 }
freqs = freqs.sort_by {|x,y| y }
freqs.reverse!
freqs.each {|word, freq| puts word+' '+freq.to_s}
Or, inspired by the concision of William Turkel’s Python word frequency code, you could do it like this:
#replace 'filename.txt' with the file you want to process
words = File.open('filename.txt') {|f| f.read }.split
freqs=Hash.new(0)
words.each { |word| freqs[word] += 1 }
freqs.sort_by {|x,y| y }.reverse.each {|w, f| puts w+' '+f.to_s}
Further Enhancements
And there we have it. There are definitely some improvements you might want to make.
You’ll probably want to convert your ‘text’ string to all lowercase or all uppercase
so that ‘Ruby’ and ‘ruby’ don’t get counted separately.
You may want to strip the text of sgml/xml tags before you split it into words.
You may want to convert plural nouns to singular, or normalise verb endings, or remove
any words that also occur in a stop-list (a list of very frequent common words that
you want to ignore).
You might make it work through a web browser instead of the
command line.
Best of all is if you make a customisation that is interesting to you
and your text, but isn’t already covered by the text analysis software currently
available. Please share your ideas for text analysis innovations in the Comments.
Digital History Hacks » Blog Archive » Easy Pieces in Python: Word Frequencies said
[...] I had originally planned to use Perl with my digital history students but have come to the reluctant conclusion that the language probably isn’t ideal for my purposes. Perl has the motto that “there’s more than one way to do it,” which is fine for experienced programmers but a bit confusing for beginners. So I’ve made the shift to Python and am very happy so far. When I came across the tutorial on word frequencies in Ruby at Semantic Humanities, I decided it would make a nice demo for Python, too. The basic problem is to split a text file into an array of words, count the number of occurrences of each word, and return a dictionary sorted by frequency. For my text, I chose Charles William Colby, The Fighting Governor: A Chronicle of Frontenac (1915) available from Project Gutenberg. We start by reading the file into one long string and then use whitespace to split the string into a list of separate words. In Python it looks like this: input = open(’cca0710-trimmed.txt’, ‘r’) text = input.read() wordlist = text.split() [...]
William J. Turkel said
Hi, I posted a riff on this at Digital History Hacks.
semantichumanities said
Nice one William. If anyone finds or writes similar tutorials in other languages, it’d be great if you could post links to them in the comments.
» Planning a Ruby course - request for comments « amazing development said
[...] Did I forget something essential? I hope not. I’m still looking for a few simple and short examples which show the differences and advantages of Ruby. At first I thought I would use something from the Rubyquiz but I think these excercises take too much time. At the end of the second day I would like to write one “larger” program, so far I like the idea from PostHelloWorld to parse text and create a histogramm for the words found, like the one described in this tutorial. [...]
Doug @ Straw Dogs said
I did one of these in PHP a good year or two ago for analysing the word frequency of a page for SEO purposes. This Ruby version could come in handy – if I add the HTML extraction too it could become a webpage word index counter.
I’ll get to work on it.
semantichumanities said
Hi Doug, If you do, feel free to put a link up here to your code. Cheers
slabounty said
In your puts line:
wordfreq.each {|word, freq| puts word+’ ‘+freq.to_s}
shouldn’t wordfreq be freqs? wordfreq doesn’t seem to have anything assigned to it.
semantichumanities said
absolutely right slabounty, slip of the key, thanks for that
MoeD said
Interesting, but a typo. Unlike perl, you can’t “add 1″ (using += 1) to a hash value that hasn’t yet been initialized to something that accepts the “+” message.
A simple example points this out:
############################################
freqs = Hash.new
%w(foo bar baz bar foo foo foo).each{|word|
freqs[word] += 1
}
p freqs
############################################
You will get:
/tmp/tmp.rb:5: undefined method `+’ for nil:NilClass (NoMethodError)
from c:/tmp/tmp.rb:4:in `each’
from c:/tmp/tmp.rb:4
Since freqs['foo'] is nil at the first iteration of the each loop, you get an error since nil doesn’t respond to the “+” message.
Change the “freqs = Hash.new” to “freqs = Hash.new(0)”, and you will get your intended affect. Using a parameter to the Hash::new method tells it to use the parameter as the value if one hasn’t already been assigned.
NOW you get:
~>ruby /tmp/tmp.rb
{“baz”=>1, “foo”=>4, “bar”=>2}
Moe
Eddie said
MoeD,
If you initialize your hash like you did…
freqs = Hash.new
… you will get an exception, but if you initialize the hash like the example…
freqs = Hash.new(0)
…everything works fine.
Martin DeMello said
Nice tutorial. but it’d be a lot more readable if the code stood out more from the text
semantichumanities said
I know, I know. I just changed the theme, but code still looks tiny, and I can’t be bothered trying to find a nice theme right now, or stumping up the readies for custom css.
Anyway, I’ve put all the code in one place now, so at least it’s easy to cut’n'paste
Patrick Hall said
Hi there, interesting post.
I’ve got a similar tutorial in Python, the first in a project on Python in NLP which I started and haven’t yet continued. ☺
Actually mine isn’t quite the same, it counts letters; also, it tries to be a very from-scratch sort of intro to programming. I’m not sure I succeeded, but here ya go:
http://blogamundo.net/py4lx/
It’s also as inefficient as all get out. ☺
Fabian Pena said
Great article and very good readable code.
Thanks