There are plenty of ready-made programs that do the same thing and more, but I hope that
this basic example can serve as a useful jumping off-point for your own more
ingenious scripts.
The basic steps are as follows:
- Read the text file into a string
- Split the text into an array of words
- Count the number of times each word occurs, storing it in a hash
- Display the word frequency list
Ok, install Ruby if you don’t already have it on your machine, and boot up a text editor (preferably one with ruby syntax highlighting), and on we go with the code.
Read the text-file into a string
First, we want to get the name of the text file we’re analysing, and we’ll let the
user enter it at the prompt:
puts 'What is the name and path of the file?'
filename = gets.chomp
“puts” writes the string that follows it to the screen
“gets” gets a string from the user at the prompt
“chomp” removes the carriage return from the end of the string. After the user has
typed in the filename, s/he presses Return to signal that s/he has finished typing.
We need to remove that carriage return, so that all we have is the filename, which we
store in a variable we are calling ‘filename’.
We now create a new string variable that we are calling ‘text’.
text = String.new
‘text’ is where we will put the contents of our file.
File.open(filename) { |f| text = f.read }
Here, we are opening the file, and reading it into the ‘text’ variable. The syntax is
quite rubyish. In the first part, ‘File.open(filename) ‘, a file object is being
created, and passed to the block that follows it. The block is delimited by the curly
braces, and receives the file object through the variable ‘f’, which is specified
between the two pipe characters: |f|.
Split the text into an array of words
Onto step two: creating an array of all the words in the text. This is easy.
words = text.split(/[^a-zA-Z]/)
‘words’ is the name of our new array. We are ’splitting’ our big string of text
(which we have called ‘text’) into chunks, using a regular expression ‘/[^a-zA-Z]/’.
Regular Expressions (reg exes) are a way of pattern matching text using wildcards.
They can be extraordinarily useful if you are working with electronic text, and
reading up on them will definitely reap rewards at some point (regular-expressions.info has a fairly comprehensive amount of information). Suffice to say here
that ‘[^a-zA-Z]‘ matches anything that isn’t an alphabetic character; so our
‘words’ are all the chunks of text between non-alphabetic characters. This may
not be precise enough definition of a word for your purposes, but we’ll assume it is
for now and push on.
Count the number of times each word occurs, storing it in a hash
freqs = Hash.new(0)
We create a new Hash to store the words and their frequencies in. A basic Hash
consists of pairs of ‘keys’ and ‘values’. You access a value by referring to its key.
In our case, the key will be a (unique) word, and its ‘value’ is the number of times
it occurs in the text.
words.each { |word| freqs[word] += 1 }
‘words.each’ takes each word one at a time from the array ‘words’, and passes it to
the block after it. If the word doesn’t yet have an entry in our hash (if
!freqs[word]), then we create an entry with a value of 1. Otherwise (if we have
encountered the word before), the value is whatever it was before, plus one.
freqs = freqs.sort_by {|x,y| y }
This line sorts our hash by the frequency number.
freqs.reverse!
This line sorts it in order of greatest frequency first (The exclamation mark after
the method ‘reverse’ means that ‘freqs’ is to be reset to the outcome of ‘reverse’;
it is the same as: ‘freqs = freqs.reverse’).
Display the word frequency list
freqs.each {|word, freq| puts word+' '+freq.to_s}
Finally, we write our results to the screen. Note that the frequency number must by
converted to a string (‘freq.to_s’) to be used with ‘puts’.
And for those who want to cut and paste
puts 'What is the name and path of the file?'
filename = gets.chomp
text = String.new
File.open(filename) { |f| text = f.read }
words = text.split(/[^a-zA-Z]/)
freqs = Hash.new(0)
words.each { |word| freqs[word] += 1 }
freqs = freqs.sort_by {|x,y| y }
freqs.reverse!
freqs.each {|word, freq| puts word+' '+freq.to_s}
Or, inspired by the concision of William Turkel’s Python word frequency code, you could do it like this:
#replace 'filename.txt' with the file you want to process
words = File.open('filename.txt') {|f| f.read }.split
freqs=Hash.new(0)
words.each { |word| freqs[word] += 1 }
freqs.sort_by {|x,y| y }.reverse.each {|w, f| puts w+' '+f.to_s}
Further Enhancements
And there we have it. There are definitely some improvements you might want to make.
You’ll probably want to convert your ‘text’ string to all lowercase or all uppercase
so that ‘Ruby’ and ‘ruby’ don’t get counted separately.
You may want to strip the text of sgml/xml tags before you split it into words.
You may want to convert plural nouns to singular, or normalise verb endings, or remove
any words that also occur in a stop-list (a list of very frequent common words that
you want to ignore).
You might make it work through a web browser instead of the
command line.
Best of all is if you make a customisation that is interesting to you
and your text, but isn’t already covered by the text analysis software currently
available. Please share your ideas for text analysis innovations in the Comments.