Thursday, March 27, 2014

Data Harvesting with Ruby: Lesson 1: Downloading Web Pages

In  order to harvest or scrape data from the world wide web, you have to learn how to download iles from the web. The sample code below, a simplifief method,  illustrates a program that extracts the default page of bookmarktutoring.com.

require 'net/http'
url = 'http://www.bookmarktutoring.com/default.html'
uri = URI.parse(url)
response = Net::HTTP.get_response(uri)
puts response.body


In Ruby, downloading a file from the Internet requires the net/http gem. This code package or gem allows you to access (read and write) methods on files stored on remote servers.

The code extracts the main page of  BookMarkTutoring.com and displays it on the screen. Note that in order to access the main page of bookmarktutoring the default or index page of the web site is used.

On the first line of the code, the net/http gem is imported into the program. Next the page  to read into a variable called URL. Note that the web page is required to be in single quote marks.

The parse method of the URI is then involved. This command splits up the URL into a parsed segments so that the server where the web page resides can locate the file.

The method  "get_response" is then called. This method retrieves  the response of the server. The  resulting response of the server  is stored in the variable called "response."

The body method is then invoked on the response object.  The body contains the data contained in the page returned.

Lesson 2 Ruby Data Harvesting, Downloading Web Pages, Readlines, Arrays

No comments:

Post a Comment