Friday, March 28, 2014

Ruby Data Harvesting, Lesson 2, Web Page Downloading, Readlines and Arrays

In  lesson 1 the code was written to download the html source page from the website, www.bookmarktutoring.com. If you ran the example code, you will have noticed that this web page has quite a few lines.

In the example code below, instead of just downloading the html file and displaying it, instead we are going to download the file and place the individual lines of the file into an array. This will allow us to access each line of the web page individually. Not only that, it will allow us in the next lesson, to search for specific content within the file, extract the data, and replace data in the file. Being able to download a web page to a file and store it in an array effectively will give us part of the foundation to build a data harvesting or scraping tool

require 'open-uri'
f = open('http://www.bookmarktutoring.com/default.html')
file_contents = f.readlines
puts file_contents
puts file_contents[0]
puts file_contents[1]
puts file_contents[3]


In the above code, the open-uri gem was imported into  prthe ogram. The main page of bookmarktutoring.com was downloaded and stored into the open uri object variable, f, with the open command (Note single quotes around the URL).  Next the readlines method was invoked on the variable f. The readlnes method reads one line of a time of the downloaded web page and stores each line sequentially into an array called file_contents.

The first puts command, puts_file_contents, displays the entire page,  as in the last example. The next three put statements however display only the first, second and third line of the web page. When you run the program you will first see the entire contents of the web page fly by the screen. The last three lines displayed will contain the first three lines of the web page.

As an exercise, substitute a URL of your choice in the open command. Be aware that you may run into problems for web pages that have URLs with different extension names and prefixes.


TO LESSON 3: DATA HARVESTING, ARRAY ITERATION

No comments:

Post a Comment