Sunday, March 30, 2014

Extracting Links with Ruby String Methods: index, rindex

Extracting links from web pages (often known as data harvesting or web scraping)  is an easy task with Ruby's string methods. With the index and rindex string methods, the position of the first delimiter and the last delimiter in a line of html code can be found with just two lines of code. Many links within html source code begin and end with a double quote character(").  This makes the double quote character the ideal delimiter to locate when attempting to extract a link. Using the index method to find the position of the first double quote character and the rindex method to find the position of the  last double quote character, one can then use the range method to extract the complete link.


Identifying the Link to Extract 

In this code example, the goal is to extract the link from the following string within an html file:

for background information visit <a href="http://www.linkedin.com/pub/mark- stansberry/60/828/826">linkedin page</a><br />

The link we want, http://www.linkedin.com/pub/mark- stansberry/60/828/826, is delimited with two double quotes.

The code used to find the positions of the first double quote character and the last double quote character is given below. Note, that we assign the string from which the link is extracted to the string variable, a.


The Code: File name: strings_lesson.rb

a = 'for background information visit <a href="http://www.linkedin.com/pub/mark- stansberry/60/828/826">linkedin page</a><br />'


#finding the position of the first occurrence of a quote symbol

puts "The first quote in the string variable, a, is at postion #{a.index('"')}"

#But what about the last quote(") position. for that we use the rindex string method

puts "The last quote in the string variable, a,  is at postion #{a.rindex('"')}"        


Output from the Code

The first quote in the string variable, a, is at position 41
The last quote in the string variable, a, is at position 97

The Next Step 

Now that the positions of the first and last double quote characters is known, that information can be used with Ruby's In the next lesson, the code to actually extract and store the link is given. This code uses Ruby's powerful  substring extraction methods. Ruby's substring extraction methods. These methods require as parameters the starting position and the ending position of the substring delimiters in order to extract the substring.

Error Messages 

Running the code, requires that you cut and paste the above code into a file with the Ruby extension, .rb and then run the program. If for some reason you are obtaining errors, feel free to comment on this blog. The code has been tested (actually cut and pasted from the blog) and runs correctly.


References

String: http://www.ruby-doc.org/core-2.1.0/String.html

No comments:

Post a Comment