Is there an easy way to extract data from specific HTML tables using Mathematica? Import seems to be pretty powerful, and Mathematica appears to be capable of handling formats such as XML pretty well.
Here’s an example: http://en.wikipedia.org/wiki/Unemployment_by_country
For general examples of this there are these How tos:
For this specific example just import it
Cleaning it up is fairly straight forward with this import. The table is 3 columns so extract it from the rest of the stuff:
You will presumably want to remove the square bracket references (??):
Note also you can add the header back if you want it in your table, which you probably do
purists might object to the last step but when you are scraping data generally you just want to get the job done and each site is a case by case prospect. So some manual inspection and flexibility gets you the fastest overall result.
Edit
if you wanted the flags you could also get them from
CountryData. Some further cleaning up is needed otherwise a lot of misses will occur. The cleanup involves removing the reference to the “sovereign country” in parenthesis. e.g. “Guam ( United States )” -> “Gaum”.This will still produce some output that
CountryDatadoes not recognize.6 misses out of 190. Remove those misses from the output:
Note that this takes a while to render.
You can obviously style the
Gridas desired usingGridoptions and also resize the images if needed.