CSIS 4244 Homework 3

Homework 3: Due February 14, 2011

Ruby Programming, part 1

This assignment includes several short interactive exercises and Ruby programs. These can be run either on a Windows or Linux computer, assuming that Ruby software has been installed (available in D017 on campus).

One thing you'll notice at the end of these exercises is that you can do a huge amount of processing with only a few lines of Ruby.

Lab warm-up:

Launch the interactive Ruby interpreter irb (in the lab it is located in the Programming Applications folder). We'll begin by trying out a few String operations. Enter the following and observe the results:

"Ruby practice".length

s = "The First Ruby Practice"

s[0..1] + s[5..8] + s[13..14] + s[-3..-1]

s.swapcase

Go to the Ruby-doc.org site and find the documentation for the String class. Look up the center and upcase methods. Write a single statement that uses these methods with s from above to produce a 35 character string that looks like this:

++++++THE FIRST RUBY PRACTICE++++++

Q1: Write down the statement to make the above string.

Enter this command and observe the result:

s.gsub(' ', '')

Have a look at the documentation for gsub, since we'll be using this method again later.

Try the following statements and observe which expressions cause the if condition to succeed:

puts "This one works" if true
puts "This one works" if false
puts "This one works" if 500
puts "This one works" if 0
puts "This one works" if "abc"
puts "This one works" if nil
puts "This one works" if "abc".length or false

Q2: Based on this experiment, generalize what evaluates to true and what evaluates to false in Ruby.

The "spaceship" operator. Try the following

"abc" <=> "def"
"xyz" <=> "xy"
"x" <=> "x"

Q3: Based on this experiment, generalize what the <=> operator does.

Its easy to do input in Ruby. Enter the following statement to read one line from the keyboard and store the result in the variable line.

line = gets

(Note there won't be any prompt, just type in the line and press the Enter key.)

Look at the contents of line and notice its exact contents. Then try this

line.chomp

Observe the returned result and how its different from line. Once again, look at the contents of line and notice its exact contents.

Q4: Now find a method that will permanently remove the end of line character and apply it to line.

Basic formatting is fairly simple in Ruby. It usually involves printing a character string with values of variables and expressions inserted where needed. Just enclose the variable or expression in #{}. Here's an example to illustrate:

sum = 500
n = 24
puts "The average of #{n} values is #{sum/n}"

The output produced by this is

The average of 24 values is 20

Writing Ruby scripts

Next we're going to write some Ruby scripts that are capable of reading and processing significant amounts of data. Launch a text editor and enter the following:

# Ruby script to read and echo lines of text input
lines = readlines
for line in lines
puts line
end

Save the file with the name echo.rb and take note of the folder where the file is stored. Launch a command window (from the Start menu, select Run and enter cmd. Change directories to the one that contains echo.rb and run the command

ruby echo.rb

Type some lines of text and when you're finished, indicate the end of input by pressing Ctrl-Z (in Linux it's Ctrl-D). You should see all the lines you types echoed to the command window.

A nice feature of command line programs is how easy it is to redirect input from another place. Enter the following to have the program get its input from the file echo.rb rather than from the keyboard. You should see the source code for the script displayed.

ruby echo.rb < echo.rb

Before continuing, have a look at the documentation the readlines (Kernel) method.

Q5: What does this method return?

1. A Ruby script to do text analysis

Now that you know how to easily input lines of text, we're ready to write a script that will do some simple text analysis, including counting the number of lines, characters, non-blank characters, words, sentences, and paragraphs. When we're done, the output should look similar to the following:

25 lines
3456 characters
2971 non-whitespace characters
720 words
26 sentences
9 paragraphs

Counting lines

This is trivial when you know how readlines works. Have your script input lines of text as before and output the line count. Save your script and test it from the command window.

Counting characters

If we could join all the characters in lines into a single string, it would be trivial to get a count of the total characters. Luckily there is a method for doing just that. Add one more statement to the script to get the text as a string and print the character count (end of line characters included). Test the modified script.

When the previous step is working properly, modify the statement to exclude the end of line characters.

Counting characters excluding white space

This requires deleting the spaces from the string obtained in the previous step. Eliminating all white space requires a simple regular expression that matches all white space: /\s+/

The forward slashes are used to delimit the regular expression, and \s is a meta-character that represents all forms of spaces (blanks, tabs, end-of-lines).

Add a statement to print the non-whitespace characters and count and. Test the modified script.

Counting words

For our purposes, we'll always consider words to be separated by spaces (another approach is to use contiguous sequences of letters as words, but we won't use that here). This makes it very easy to split a string into words). Add a statement to print the word count. Test the script.

Counting sentences

A sentence ends with a period (.), question mark (?), or exclamation mark (!). Hint: Use a simple regular expression. Add a statement to print a sentence count and test the script.

Counting paragraphs

Paragraphs in a text file are separated by two newline characters in succession. Add a statement to print a paragraph count and test the script.

A longer test

Let's try a longer text file to test this script. Here's a text file containing the Declaration of Independence: decind.txt. You might want to comment out the loop that echo prints the file.

2. A Ruby script to identify "useful" words

Most text processing applications (search engines like Google for example) ignore words that are normally uninteresting, like "the", "and", "is", etc.

Note: You can learn more about these "stop words" and get a list of them at http://en.wikipedia.org/wiki/Stop_words

For this assignment, we'll just use a small list of stop words. Create a Ruby script that inputs lines of text, like the previous one. Copy and paste the following to your script to create an array of common stop words:

stop_words = %w{a by on for of are the with but me etc is
and to my I has some in if am now he she it}

The above is a fast way to make an array from a list of words.

Now we need an array of words from the text to compare with the stop words. The technique in the previous script won't quite work here because it treats all non-blank characters as words, so it probably includes punctuation marks in some of the words. The scan method can be used to make an array containing just strings of "word characters". Look at the first example in the documentation for the scan method to see the pattern for this.

Now to find only the "useful" words, we'll use the select method, which will iterate through an array and select only the elements that meet a specified condition. For example, to make an array that contains only strings from list that are more than 4 characters long we could use the second statement below:

list = %w{abcde qwerty abc zzzz ttttttt ggggg 12345 xxxyyy}
long_strings = list.select {|str| str.length > 4}

In this exercise, we need a condition that will select only words which are not in the array of stop words. Check the documentation for Array if you can't think of a method to do this.

Include statements to output the percentage of "useful" words in the input text. Test the script.

3. A Ruby script to find the "median" of a list of words

Write a Ruby script that finds the median(s) of a list. The term median is normally applied to numerical data, but program should find the median of a list of words. Informally, a median of a list is an item which is "halfway'' down the list (that is, there are as many items above the median as there are below it).

For this exercise, if there are n items in a sorted list and n is odd, there is a single median at location (n-1)/2. If n is even, there are two medians at locations (n-1)/2 and (n - 1)/2 + 1. The program should allow the user to enter a series of words (or other lines of text), with the entire list terminated by an end-of-file. It should read the lines into an array and sort them based on ASCII ordering. Use the sort method to return a sorted version of a list. The program should then print the median or medians of the list. Suppose your program is named median. Then entering the command

ruby median

should run the program, taking input from the keyboard until the user enters Ctrl-Z and then display the median(s).

The program should also work when a file is used for input. So if words.txt contains lines of text, then the following command will give the median(s) of the lines in the file:

ruby median < words.txt

On many Unix systems, the file /usr/dict/words contains English words, one per line. Here's file from a Linux system that contains more than 38,000 words: words.zip. Normally, this file would be in alphabetical order, but this one was purposely arranged randomly. Unzip this file and use it to test your median program on a big input list.

Warning: Do not print the list for this collection of data - it's way too big!

Hand In

Put the three Ruby scripts and a file containing the answers to questions Q1 - Q5 in a zip file and SUBMIT HERE.