Tutorial: Writing a Word Frequencies Script in Ruby

There are plenty of ready-made programs that do the same thing and more, but I hope that
this basic example can serve as a useful jumping off-point for your own more
ingenious scripts.

The basic steps are as follows:

  1. Read the text file into a string
  2. Split the text into an array of words
  3. Count the number of times each word occurs, storing it in a hash
  4. Display the word frequency list

Ok, install Ruby if you don’t already have it on your machine, and boot up a text editor (preferably one with ruby syntax highlighting), and on we go with the code.

Read the text-file into a string

First, we want to get the name of the text file we’re analysing, and we’ll let the
user enter it at the prompt:

 puts 'What is the name and path of the file?'
filename = gets.chomp

“puts” writes the string that follows it to the screen
“gets” gets a string from the user at the prompt
“chomp” removes the carriage return from the end of the string. After the user has
typed in the filename, s/he presses Return to signal that s/he has finished typing.
We need to remove that carriage return, so that all we have is the filename, which we
store in a variable we are calling ‘filename’.

We now create a new string variable that we are calling ‘text’.

text = String.new

‘text’ is where we will put the contents of our file.

File.open(filename) { |f|  text = f.read } 

Here, we are opening the file, and reading it into the ‘text’ variable. The syntax is
quite rubyish. In the first part, ‘File.open(filename) ‘, a file object is being
created, and passed to the block that follows it. The block is delimited by the curly
braces, and receives the file object through the variable ‘f’, which is specified
between the two pipe characters: |f|.

Split the text into an array of words

Onto step two: creating an array of all the words in the text. This is easy.

words = text.split(/[^a-zA-Z]/)

‘words’ is the name of our new array. We are ‘splitting’ our big string of text
(which we have called ‘text’) into chunks, using a regular expression ‘/[^a-zA-Z]/’.
Regular Expressions (reg exes) are a way of pattern matching text using wildcards.
They can be extraordinarily useful if you are working with electronic text, and
reading up on them will definitely reap rewards at some point (regular-expressions.info has a fairly comprehensive amount of information). Suffice to say here
that ‘[^a-zA-Z]’ matches anything that isn’t an alphabetic character; so our
‘words’ are all the chunks of text between non-alphabetic characters. This may
not be precise enough definition of a word for your purposes, but we’ll assume it is
for now and push on.

Count the number of times each word occurs, storing it in a hash

freqs = Hash.new(0)

We create a new Hash to store the words and their frequencies in. A basic Hash
consists of pairs of ‘keys’ and ‘values’. You access a value by referring to its key.
In our case, the key will be a (unique) word, and its ‘value’ is the number of times
it occurs in the text.

words.each { |word| freqs[word] += 1 }

‘words.each’ takes each word one at a time from the array ‘words’, and passes it to
the block after it. If the word doesn’t yet have an entry in our hash (if
!freqs[word]), then we create an entry with a value of 1. Otherwise (if we have
encountered the word before), the value is whatever it was before, plus one.

 freqs = freqs.sort_by {|x,y| y }

This line sorts our hash by the frequency number.

 freqs.reverse!

This line sorts it in order of greatest frequency first (The exclamation mark after
the method ‘reverse’ means that ‘freqs’ is to be reset to the outcome of ‘reverse’;
it is the same as: ‘freqs = freqs.reverse’).

Display the word frequency list

freqs.each {|word, freq| puts word+' '+freq.to_s}

Finally, we write our results to the screen. Note that the frequency number must by
converted to a string (‘freq.to_s’) to be used with ‘puts’.

And for those who want to cut and paste


puts 'What is the name and path of the file?'
filename = gets.chomp
text = String.new
File.open(filename) { |f| text = f.read }
words = text.split(/[^a-zA-Z]/)
freqs = Hash.new(0)
words.each { |word| freqs[word] += 1 }
freqs = freqs.sort_by {|x,y| y }
freqs.reverse!
freqs.each {|word, freq| puts word+' '+freq.to_s}

Or, inspired by the concision of William Turkel’s Python word frequency code, you could do it like this:

#replace 'filename.txt' with the file you want to process
words = File.open('filename.txt') {|f| f.read }.split
freqs=Hash.new(0)
words.each { |word| freqs[word] += 1 }
freqs.sort_by {|x,y| y }.reverse.each {|w, f| puts w+' '+f.to_s}

Further Enhancements

And there we have it. There are definitely some improvements you might want to make.

You’ll probably want to convert your ‘text’ string to all lowercase or all uppercase
so that ‘Ruby’ and ‘ruby’ don’t get counted separately.

You may want to strip the text of sgml/xml tags before you split it into words.

You may want to convert plural nouns to singular, or normalise verb endings, or remove
any words that also occur in a stop-list (a list of very frequent common words that
you want to ignore).

You might make it work through a web browser instead of the
command line.

Best of all is if you make a customisation that is interesting to you
and your text, but isn’t already covered by the text analysis software currently
available. Please share your ideas for text analysis innovations in the Comments.

Advertisements

28 Comments »

  1. […] I had originally planned to use Perl with my digital history students but have come to the reluctant conclusion that the language probably isn’t ideal for my purposes. Perl has the motto that “there’s more than one way to do it,” which is fine for experienced programmers but a bit confusing for beginners. So I’ve made the shift to Python and am very happy so far. When I came across the tutorial on word frequencies in Ruby at Semantic Humanities, I decided it would make a nice demo for Python, too. The basic problem is to split a text file into an array of words, count the number of occurrences of each word, and return a dictionary sorted by frequency. For my text, I chose Charles William Colby, The Fighting Governor: A Chronicle of Frontenac (1915) available from Project Gutenberg. We start by reading the file into one long string and then use whitespace to split the string into a list of separate words. In Python it looks like this: input = open(’cca0710-trimmed.txt’, ‘r’) text = input.read() wordlist = text.split() […]

  2. Hi, I posted a riff on this at Digital History Hacks.

  3. Nice one William. If anyone finds or writes similar tutorials in other languages, it’d be great if you could post links to them in the comments.

  4. […] Did I forget something essential? I hope not. I’m still looking for a few simple and short examples which show the differences and advantages of Ruby. At first I thought I would use something from the Rubyquiz but I think these excercises take too much time. At the end of the second day I would like to write one “larger” program, so far I like the idea from PostHelloWorld to parse text and create a histogramm for the words found, like the one described in this tutorial. […]

  5. I did one of these in PHP a good year or two ago for analysing the word frequency of a page for SEO purposes. This Ruby version could come in handy – if I add the HTML extraction too it could become a webpage word index counter.

    I’ll get to work on it.

  6. Hi Doug, If you do, feel free to put a link up here to your code. Cheers

  7. slabounty said

    In your puts line:

    wordfreq.each {|word, freq| puts word+’ ‘+freq.to_s}

    shouldn’t wordfreq be freqs? wordfreq doesn’t seem to have anything assigned to it.

  8. absolutely right slabounty, slip of the key, thanks for that

  9. MoeD said

    Interesting, but a typo. Unlike perl, you can’t “add 1” (using += 1) to a hash value that hasn’t yet been initialized to something that accepts the “+” message.

    A simple example points this out:

    ############################################
    freqs = Hash.new

    %w(foo bar baz bar foo foo foo).each{|word|
    freqs[word] += 1
    }

    p freqs
    ############################################

    You will get:

    /tmp/tmp.rb:5: undefined method `+’ for nil:NilClass (NoMethodError)
    from c:/tmp/tmp.rb:4:in `each’
    from c:/tmp/tmp.rb:4

    Since freqs[‘foo’] is nil at the first iteration of the each loop, you get an error since nil doesn’t respond to the “+” message.

    Change the “freqs = Hash.new” to “freqs = Hash.new(0)”, and you will get your intended affect. Using a parameter to the Hash::new method tells it to use the parameter as the value if one hasn’t already been assigned.

    NOW you get:

    ~>ruby /tmp/tmp.rb

    {“baz”=>1, “foo”=>4, “bar”=>2}

    Moe

  10. Eddie said

    MoeD,

    If you initialize your hash like you did…

    freqs = Hash.new

    … you will get an exception, but if you initialize the hash like the example…

    freqs = Hash.new(0)

    …everything works fine.

  11. Nice tutorial. but it’d be a lot more readable if the code stood out more from the text

  12. I know, I know. I just changed the theme, but code still looks tiny, and I can’t be bothered trying to find a nice theme right now, or stumping up the readies for custom css.

    Anyway, I’ve put all the code in one place now, so at least it’s easy to cut’n’paste

  13. Hi there, interesting post.

    I’ve got a similar tutorial in Python, the first in a project on Python in NLP which I started and haven’t yet continued. ☺

    Actually mine isn’t quite the same, it counts letters; also, it tries to be a very from-scratch sort of intro to programming. I’m not sure I succeeded, but here ya go:

    http://blogamundo.net/py4lx/

    It’s also as inefficient as all get out. ☺

  14. Great article and very good readable code.
    Thanks

  15. Ray Renteria said

    Fantastic instructional! Thanks for helping me break through the ice!

  16. It creates an empty string key in the hash when you have consecutive non-word characters like ‘, ‘. Changing words = text.split(/[^a-zA-Z]/) to
    words = text.split(/[^a-zA-Z]+/) will solve this problem.

  17. What a information of un-ambiguity and preserveness of precious familiarity about
    unpredicted feelings.

  18. Attractive section of content. I just stumbled upon your web site and in
    accession capital to assert that I acquire in fact enjoyed account your blog posts.
    Anyway I will be subscribing to your feeds and even I achievement
    you access consistently quickly.

  19. Dirk said

    Thanks for a marvelous posting! I certainly enjoyed
    reading it, you can be a great author. I will be sure to bookmark your blog and may come back
    down the road. I want to encourage that you continue your great job,
    have a nice weekend!

  20. Hi, I do think this is an excellent site. I stumbledupon it 😉 I’m going to return once again since i have saved as a favorite it. Money and freedom is the greatest way to change, may you be rich and continue to help others.

  21. Unfortunately, if the house is a testament to modern innovation.
    You may also be a skilled you may finish up with
    a market research on the modern trends and
    materials available. Many want their tables to appear older and opt for darker shades on the floor of a small bathroom look
    smaller. As you might have the money. For grinding, sanding, cutting, and scraping there is no
    running water or electricity in the bathroom then anywhere else.

  22. SOWNDARYA PALANISAMY said

    awesome …….

  23. Steven C said

    Good site especially for Ruby newcommers like myself. I loved your examples I hope these’ll help me progress with my ruby programming. Thanks.

  24. Your style is unique compared to oother folks I’ve read stuff from.
    Many thanks for posting when you’ve got the opportunity, Guess I will just book mark this web site.

  25. I do consider all the ideas you have presented on your post.
    They’re very convincing and can definitely work. Nonetheless, the
    posts are very short for beginners. May just you please
    extend them a little from subsequent time? Thanks for the
    post.

  26. You really make it seem so easy with your presentation however I in finding this matter to
    be really one thing that I believe I would by no
    means understand. It kind of feels too complex and very extensive for me.
    I’m having a look ahead to your subsequent submit,
    I’ll attempt to get the hang of it!

  27. David said

    I’m gone to convey my little brother, that he should also go
    to see this weblog on regular basis to get updated from
    hottest gossip.

  28. En la última instalación el cliente quería conservar su caldera de gas para el agua caliente, por lo que le pusimos la caldera de pellets en paralelo con la de gas pero sólo para calefacción, además montamos un sistema que en caso de que la de pellets se quede sin combustible automáticamente arranca la de gas en modo calefacción para que no baje la temperatura de la vivienda.

RSS feed for comments on this post · TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: