BookMark Magazine ......... Education with Imagination .......: web scraping

Showing posts with label web scraping. Show all posts

Monday, March 31, 2014

Extracting Links from Web Pages: The Anchor Tag as a Start and End Delimiter

The link extraction or link harvesting process is an essential operation of all web scraping programs. Extracting a link requires that you determine the best type of delimiter to use to find a link. Fortunately the structure of the html markup language requires that all links start and end with the anchor symbols, <a and </a>. This means, for the most part if anchor tags are used for the beginning and ending delimiters, the link will always be found in between the two delimiters.

The generalized html code for a link is given as

<a href="url">Link text</a>

Although the anchor tag is a better search delimiter than perhaps the double quotes that surround the URL, it is not foolproof. Anchor tags often signify, in html, a bookmark or an email address. The problem with double quotes as delimiters arises when one starts a line-by-line parse. A line of html code could contain any number of quotes and quotes that do not include links. This makes double quotes unacceptable in all but situations that only contain one link per line and do not include any other non link tags or text that have double quote marks.

Examining the source code and the html code that surrounds the links in the source code will help give you an idea if the delimiters you select will work for a specific web page. You will find, for the most part, that anchor tags as delimiters are the best choice for delimiters.

Below find some sample clips of html code that contains links. Review them and decide for yourself what would be the best delimiters to use. Than answer is the html anchor tag.

<a href="#tips">Visit the Useful Tips Section</a>

<a href="http://www.w3schools.com/html_links.htm#tips">Visit the Useful Tips Section</a>

<strong>Email: </strong><a href="mailto:mark.c.stansberry@gmail.com">mark.c.stansberry@gmail.com</a></p>
<p><br />

<p><a href="http://www.amazon.com/Very-Best-Algebra-Notes-ebook/dp/B00B7AY2LA"><img src="cover.jpg" width="540" height="324" alt="Very Best Algebra Notes" /></a></p>

<p><a href="http://www.santa-rosa-algebra-geometry-statistics--tutoring.com/santa-rosa-online-tutoring-chemistry-AP-santa-rosa-junior-college-srjc-sonoma-state-university-chem-42-1A-1B-102-105-107-110-115A-115B-ssu.html" title="Santa Rosa Online Tutoring: AP Chemistry, SRJC, SSU Chemistry Classes, Mark Stansberry Tutoring, BookMark Tutoring"><img src="Chemistry-button-icon.jpg" alt="chemistry tutoring santa rosa bookmark" width="82" height="78" hspace="3" align="left" />Chemistry: </a>Chemistry tutoring is available for topics covered in AP Chemistry and for chemistry courses taught at Santa Rosa Junior College and Sonoma State University. Provide homework help, laboratory report writing services and test preparation for chemistry topics that relate to atomic theory, the periodic table , bonding, thermochemistry, solids, liquids and gases, chemical equilibrium, acids and bases, electrochemistry, nuclear reactions, and organic chemistry. </p>

<a href="https://archive.org/details/movies">Video</a>
</div>
<div class="tab">
<a href="https://archive.org/details/texts">Texts</a>
</div>
<div class="tab">
<a href="https://archive.org/details/audio">Audio</a>
</div>
<div class="tab">
<a href="https://archive.org/details/software">Software</a>
</div>
<div class="tabsel backColor1">
<a href="https://archive.org/about/">About</a>
</div>

<tr class="hitRow"><td class="numberCell"><img src="/images/mediatype_audio.gif" alt="[audio]"/></td><td class="hitCell" colspan="2"><a class="titleLink" href="/details/DavidRMerryLennon"><span class="searchTerm">Lennon</span></a> - David R Merry<br/>A Folk/Rock track written as a tribute to the memory of the late, great "John <span class="searchTerm">Lennon</span>" (solo multi track digital recording)<br/><span style="font-weight:bold;">Keywords:</span> <a href="/search.php?query=lennon%20AND%20subject%3A%22John%22">John</a>; <a href="/search.php?query=lennon%20AND%20subject%3A%22Lennon%22">Lennon</a>; <a href="/search.php?query=lennon%20AND%20subject%3A%22Folk%22">Folk</a>; <a href="/search.php?query=lennon%20AND%20subject%3A%22Blues%22">Blues</a>; <a href="/search.php?query=lennon%20AND%20subject%3A%22Beatlemania%22">Beatlemania</a>; <a href="/search.php?query=lennon%20AND%20subject%3A%22Rock%22">Rock</a>; <a href="/search.php?query=lennon%20AND%20subject%3A%22Daveyboy%22">Daveyboy</a>; <a href="/search.php?query=lennon%20AND%20subject%3A%22Singer%22">Singer</a>;

Sunday, March 30, 2014

Extracting Links with Ruby String Methods: index, rindex

Extracting links from web pages (often known as data harvesting or web scraping) is an easy task with Ruby's string methods. With the index and rindex string methods, the position of the first delimiter and the last delimiter in a line of html code can be found with just two lines of code. Many links within html source code begin and end with a double quote character("). This makes the double quote character the ideal delimiter to locate when attempting to extract a link. Using the index method to find the position of the first double quote character and the rindex method to find the position of the last double quote character, one can then use the range method to extract the complete link.

Identifying the Link to Extract

In this code example, the goal is to extract the link from the following string within an html file:

for background information visit <a href="http://www.linkedin.com/pub/mark- stansberry/60/828/826">linkedin page</a><br />

The link we want, http://www.linkedin.com/pub/mark- stansberry/60/828/826, is delimited with two double quotes.

The code used to find the positions of the first double quote character and the last double quote character is given below. Note, that we assign the string from which the link is extracted to the string variable, a.

The Code: File name: strings_lesson.rb

a = 'for background information visit <a href="http://www.linkedin.com/pub/mark- stansberry/60/828/826">linkedin page</a><br />'

#finding the position of the first occurrence of a quote symbol

puts "The first quote in the string variable, a, is at postion #{a.index('"')}"

#But what about the last quote(") position. for that we use the rindex string method

puts "The last quote in the string variable, a, is at postion #{a.rindex('"')}"

Output from the Code

The first quote in the string variable, a, is at position 41
The last quote in the string variable, a, is at position 97

The Next Step

Now that the positions of the first and last double quote characters is known, that information can be used with Ruby's In the next lesson, the code to actually extract and store the link is given. This code uses Ruby's powerful substring extraction methods. Ruby's substring extraction methods. These methods require as parameters the starting position and the ending position of the substring delimiters in order to extract the substring.

Error Messages

Running the code, requires that you cut and paste the above code into a file with the Ruby extension, .rb and then run the program. If for some reason you are obtaining errors, feel free to comment on this blog. The code has been tested (actually cut and pasted from the blog) and runs correctly.

References

String: http://www.ruby-doc.org/core-2.1.0/String.html

Friday, March 28, 2014

Ruby Data Harvesting, Lesson 2, Web Page Downloading, Readlines and Arrays

In lesson 1 the code was written to download the html source page from the website, www.bookmarktutoring.com. If you ran the example code, you will have noticed that this web page has quite a few lines.

In the example code below, instead of just downloading the html file and displaying it, instead we are going to download the file and place the individual lines of the file into an array. This will allow us to access each line of the web page individually. Not only that, it will allow us in the next lesson, to search for specific content within the file, extract the data, and replace data in the file. Being able to download a web page to a file and store it in an array effectively will give us part of the foundation to build a data harvesting or scraping tool

require 'open-uri'
f = open('http://www.bookmarktutoring.com/default.html')
file_contents = f.readlines
puts file_contents
puts file_contents[0]
puts file_contents[1]
puts file_contents[3]

In the above code, the open-uri gem was imported into prthe ogram. The main page of bookmarktutoring.com was downloaded and stored into the open uri object variable, f, with the open command (Note single quotes around the URL). Next the readlines method was invoked on the variable f. The readlnes method reads one line of a time of the downloaded web page and stores each line sequentially into an array called file_contents.

The first puts command, puts_file_contents, displays the entire page, as in the last example. The next three put statements however display only the first, second and third line of the web page. When you run the program you will first see the entire contents of the web page fly by the screen. The last three lines displayed will contain the first three lines of the web page.

As an exercise, substitute a URL of your choice in the open command. Be aware that you may run into problems for web pages that have URLs with different extension names and prefixes.

TO LESSON 3: DATA HARVESTING, ARRAY ITERATION