BookMark Magazine ......... Education with Imagination .......: downloading web pages

Friday, March 28, 2014

Ruby Data Harvesting, Lesson 2, Web Page Downloading, Readlines and Arrays

In lesson 1 the code was written to download the html source page from the website, www.bookmarktutoring.com. If you ran the example code, you will have noticed that this web page has quite a few lines.

In the example code below, instead of just downloading the html file and displaying it, instead we are going to download the file and place the individual lines of the file into an array. This will allow us to access each line of the web page individually. Not only that, it will allow us in the next lesson, to search for specific content within the file, extract the data, and replace data in the file. Being able to download a web page to a file and store it in an array effectively will give us part of the foundation to build a data harvesting or scraping tool

require 'open-uri'
f = open('http://www.bookmarktutoring.com/default.html')
file_contents = f.readlines
puts file_contents
puts file_contents[0]
puts file_contents[1]
puts file_contents[3]

In the above code, the open-uri gem was imported into prthe ogram. The main page of bookmarktutoring.com was downloaded and stored into the open uri object variable, f, with the open command (Note single quotes around the URL). Next the readlines method was invoked on the variable f. The readlnes method reads one line of a time of the downloaded web page and stores each line sequentially into an array called file_contents.

The first puts command, puts_file_contents, displays the entire page, as in the last example. The next three put statements however display only the first, second and third line of the web page. When you run the program you will first see the entire contents of the web page fly by the screen. The last three lines displayed will contain the first three lines of the web page.

As an exercise, substitute a URL of your choice in the open command. Be aware that you may run into problems for web pages that have URLs with different extension names and prefixes.

TO LESSON 3: DATA HARVESTING, ARRAY ITERATION

Thursday, March 27, 2014

Data Harvesting with Ruby: Lesson 1: Downloading Web Pages

In order to harvest or scrape data from the world wide web, you have to learn how to download iles from the web. The sample code below, a simplifief method, illustrates a program that extracts the default page of bookmarktutoring.com.

require 'net/http'
url = 'http://www.bookmarktutoring.com/default.html'
uri = URI.parse(url)
response = Net::HTTP.get_response(uri)
puts response.body

In Ruby, downloading a file from the Internet requires the net/http gem. This code package or gem allows you to access (read and write) methods on files stored on remote servers.

The code extracts the main page of BookMarkTutoring.com and displays it on the screen. Note that in order to access the main page of bookmarktutoring the default or index page of the web site is used.

On the first line of the code, the net/http gem is imported into the program. Next the page to read into a variable called URL. Note that the web page is required to be in single quote marks.

The parse method of the URI is then involved. This command splits up the URL into a parsed segments so that the server where the web page resides can locate the file.

The method "get_response" is then called. This method retrieves the response of the server. The resulting response of the server is stored in the variable called "response."

The body method is then invoked on the response object. The body contains the data contained in the page returned.

Lesson 2 Ruby Data Harvesting, Downloading Web Pages, Readlines, Arrays