The link extraction or link harvesting process is an essential operation of all web scraping programs. Extracting a link requires that you determine the best type of delimiter to use to find a link. Fortunately the structure of the html markup language requires that all links start and end with the anchor symbols, <a and </a>. This means, for the most part if anchor tags are used for the beginning and ending delimiters, the link will always be found in between the two delimiters.
The generalized html code for a link is given as
<a href="url">Link text</a>
Although the anchor tag is a better search delimiter than perhaps the double quotes that surround the URL, it is not foolproof. Anchor tags often signify, in html, a bookmark or an email address. The problem with double quotes as delimiters arises when one starts a line-by-line parse. A line of html code could contain any number of quotes and quotes that do not include links. This makes double quotes unacceptable in all but situations that only contain one link per line and do not include any other non link tags or text that have double quote marks.
Examining the source code and the html code that surrounds the links in the source code will help give you an idea if the delimiters you select will work for a specific web page. You will find, for the most part, that anchor tags as delimiters are the best choice for delimiters.
Below find some sample clips of html code that contains links. Review them and decide for yourself what would be the best delimiters to use. Than answer is the html anchor tag.
<a href="#tips">Visit the Useful Tips Section</a>
<a href="http://www.w3schools.com/html_links.htm#tips">Visit the Useful Tips Section</a>
<strong>Email: </strong><a href="mailto:mark.c.stansberry@gmail.com">mark.c.stansberry@gmail.com</a></p>
<p><br />
<p><a href="http://www.amazon.com/Very-Best-Algebra-Notes-ebook/dp/B00B7AY2LA"><img src="cover.jpg" width="540" height="324" alt="Very Best Algebra Notes" /></a></p>
<p><a href="http://www.santa-rosa-algebra-geometry-statistics--tutoring.com/santa-rosa-online-tutoring-chemistry-AP-santa-rosa-junior-college-srjc-sonoma-state-university-chem-42-1A-1B-102-105-107-110-115A-115B-ssu.html" title="Santa Rosa Online Tutoring: AP Chemistry, SRJC, SSU Chemistry Classes, Mark Stansberry Tutoring, BookMark Tutoring"><img src="Chemistry-button-icon.jpg" alt="chemistry tutoring santa rosa bookmark" width="82" height="78" hspace="3" align="left" />Chemistry: </a>Chemistry tutoring is available for topics covered in AP Chemistry and for chemistry courses taught at Santa Rosa Junior College and Sonoma State University. Provide homework help, laboratory report writing services and test preparation for chemistry topics that relate to atomic theory, the periodic table , bonding, thermochemistry, solids, liquids and gases, chemical equilibrium, acids and bases, electrochemistry, nuclear reactions, and organic chemistry. </p>
<a href="https://archive.org/details/movies">Video</a>
</div>
<div class="tab">
<a href="https://archive.org/details/texts">Texts</a>
</div>
<div class="tab">
<a href="https://archive.org/details/audio">Audio</a>
</div>
<div class="tab">
<a href="https://archive.org/details/software">Software</a>
</div>
<div class="tabsel backColor1">
<a href="https://archive.org/about/">About</a>
</div>
<tr class="hitRow"><td class="numberCell"><img src="/images/mediatype_audio.gif" alt="[audio]"/></td><td class="hitCell" colspan="2"><a class="titleLink" href="/details/DavidRMerryLennon"><span class="searchTerm">Lennon</span></a> - David R Merry<br/>A Folk/Rock track written as a tribute to the memory of the late, great "John <span class="searchTerm">Lennon</span>" (solo multi track digital recording)<br/><span style="font-weight:bold;">Keywords:</span> <a href="/search.php?query=lennon%20AND%20subject%3A%22John%22">John</a>; <a href="/search.php?query=lennon%20AND%20subject%3A%22Lennon%22">Lennon</a>; <a href="/search.php?query=lennon%20AND%20subject%3A%22Folk%22">Folk</a>; <a href="/search.php?query=lennon%20AND%20subject%3A%22Blues%22">Blues</a>; <a href="/search.php?query=lennon%20AND%20subject%3A%22Beatlemania%22">Beatlemania</a>; <a href="/search.php?query=lennon%20AND%20subject%3A%22Rock%22">Rock</a>; <a href="/search.php?query=lennon%20AND%20subject%3A%22Daveyboy%22">Daveyboy</a>; <a href="/search.php?query=lennon%20AND%20subject%3A%22Singer%22">Singer</a>;
BookMark Magazine, an educational publication from BookMarkTutoring.com, provides educational supplemental material for students and teachers. These range from captivating educational classroom activities, such as click and color educational coloring pages and our Infinity Machine online drawing software (mathematical brush driven) to subject specific learning link libraries.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment