In a previous article, I showed how to import text from a table on a Web page. However, as you're probably aware, cells in tables on Web pages can contain more than simply text. For example, hyperlinks and images are often included as data.
As before, for the purposes of this article, I've hosted a sample Web page. This page is actually a copy of the CIA page, but I was a little afraid that they might change the layout before this article was publishedt (I didn't think there was too much chance that they'd go out of business!) As figure 1 at the bottom of this page illustrates, the page illustrates over 250 flags of the world. Each flag is a hyperlink to a larger version of the flag's image, while the name of each country is a hyperlink to information about that country.
What I intend to show you in this article is how to import the hyperlinks (i.e.: both the URL pointed to, and the text describing the link), as well as how to download the image to your computer.
HTML Anchor Tags
Once again, I'm not going to go into great detail about the HTML that makes up that Web page, nor about the Document Object Model which will allow me to work with the data on that page, but I feel it's necessary to give you some background.
The actual HTML for a sample cell is something like the following (although I'm leaving out some of the actual URLs in an attempt to make it fit):
<TD align=middle width=117 height=100>
<A href="https://.../flags/ca-flag.html">
<IMG alt="Flag of Canada" src=".../ca-flag.gif">
</A><BR>
<A href="https://.../geos/ca.html">
<FONT color=#ffffff>Canada</FONT>
</A>
</TD>
In HTML, hyperlinks are designated by <A> (anchor) tags. You can have text, images or both within the anchor, while the href property of the tag indicates the destination URL for the anchor. As you can see, there are two hyperlinks in the previous example and one image (the <IMG> tag). What may not be quite as obvious is that the first hyperlink has no text associated with it: only the image is contained between the <A> and </A> tags. To get some useful text to display in my database I'll have to use the alt property of the IMG tag.
Those of you who read the previous article may recall that the DOM didn't really provide me with a means to get at directly at the tables: I had to use the getElementsByTagName() method to create a nodeList collection representing all of the tables contained on the page. For hyperlinks, though, the DOM explicitly includes collections of links and images on the page (the links and images collections respectively), so I won't have to do quite as much work to work with hyperlinks as I did with tables. The problem, though, is that the links and images collections both refer to the entire page, and there's no real way to determine which specific entries are part of the table and which aren't. You'll see the kludges I had to use in order to limit myself to only those hyperlinks and images contained within the table.
Figure 1: A sample Web page to import -- The data of interest is presented in a table comprised of images and hyperlinks.