Saturday, March 29, 2014

Lesson 3: Ruby Data Harvesting, String Array Iteration

Lesson 3: 

In Lesson 2 we learned how to download a html file from the Internet and store each line of the source code into an array element. In this data harvesting software program, all downloaded files are stored in string arrays.  So before we continue, it is important to learn the basics of accessing, searching and displaying string arrays  In this example, the each iterator is used to examine the contents of a string array.

Code Example: Looping Through An Array of Strings

file_contents = ["x y"]
file_contents[1] = "<http:www.bookmark"
puts file_contents.include?("<http:www.bookmark")
count = 0
c = []
file_contents.each do |loc|

  count = count + 1
  puts "Line #{count} contains " + loc + '!'
  b = loc
  puts 'loc is ' + loc
  puts b
  puts count
  c[count] = b
  puts c[count], count
end

a = file_contents[0]
all_words = a.scan(/\w+/)
puts all_words

Executing the program, C:\Ruby193>ruby  array_include.rb

The output of this code is:

true
Line 1 contains x y!
loc is x y
x y
1
x y
1
Line 2 contains <http:www.bookmark!
loc is <http:www.bookmark
<http:www.bookmark
2
<http:www.bookmark
2
x
y

Code Analysis

In the example code above, an array called file_contents is created. The contents of the first two elements of the array are filled with two string variables. On the third line down, an include statement is used to test if the string loaded into the second element of the array is indeed an element in the array. As the output of the code reveals (true on the first line of the output code)

In the file_contents code block, the file_contents array is looped through with the each method. The contents of each line in the array is displayed along with the index number of the array. The index of the array is obtained with the program statement "count = count + 1". Since there are only two values in the array, the each iteration occurs just two times.

An Interpolated variable is used {#count] to print out the line number that the contents is found on. In this example, double quotes are used to contain the displayed text and the actual numeric valued of the count variable. If single quotes were used, the actual string [#count] would be displayed instead of the numerical values of the count variable.

What is important to realize about the code is that the array variable, c, and the number variable, count, must be initialized outside of the the each loop. If either of these variables is initialized within the each loop, an error will occur.

In the last two lines of the code, the element at location 0 in the array is stored into a variable, a. That variable is then parsed with the regular expression, w+. The w+ regular expression splits the contents of a string based on whether or not a space separates two words within a string.


When searching for contents within an array is important to realize that a search of a string array will not allow you to directly find substrings of array elements (only complete string matches will return a true result). For example, if you typed in "http" as the parameter in the include method of the above code, a false result would have been returned since their is not a complete string match.

















No comments:

Post a Comment