Is there an easy way to extract data from specific HTML tables using Mathematica?

Question

0

Asked: May 28, 20262026-05-28T04:34:27+00:00 2026-05-28T04:34:27+00:00

Is there an easy way to extract data from specific HTML tables using Mathematica?

0

Is there an easy way to extract data from specific HTML tables using Mathematica? Import seems to be pretty powerful, and Mathematica appears to be capable of handling formats such as XML pretty well.

Here’s an example: http://en.wikipedia.org/wiki/Unemployment_by_country

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-28T04:34:28+00:00

For general examples of this there are these How tos:

How to | Clean Up Data Imported from a ZIP File
How to | Clean Up Data Imported from a Website

For this specific example just import it

tmp = Import["http://en.wikipedia.org/wiki/Unemployment_by_country", "Data"]

Cleaning it up is fairly straight forward with this import. The table is 3 columns so extract it from the rest of the stuff:

tmp1 = Cases[tmp, {_, _?NumberQ, _}, \[Infinity]]

You will presumably want to remove the square bracket references (??):

tmp1[[All, 3]] = Flatten[If[StringQ[#], 
StringCases[#, x__ ~~ Whitespace ~~ "[" ~~ __ :> x], #] & /@ tmp1[[All, 3]]]

Grid[tmp1, Frame -> All]

Note also you can add the header back if you want it in your table, which you probably do

Grid[Join[{{"Country / Region", "Unemployment rate (%)", 
   "Source / date of information"}}, tmp1], Frame -> All]

purists might object to the last step but when you are scraping data generally you just want to get the job done and each site is a case by case prospect. So some manual inspection and flexibility gets you the fastest overall result.

Edit

if you wanted the flags you could also get them from CountryData. Some further cleaning up is needed otherwise a lot of misses will occur. The cleanup involves removing the reference to the “sovereign country” in parenthesis. e.g. “Guam ( United States )” -> “Gaum”.

tmp2 = Flatten[
  If[StringMatchQ[#, __ ~~ "(" ~~ __], 
     StringCases[#, 
      z__ ~~ Shortest["(" ~~ __ ~~ ")" ~~ EndOfString] :> 
       StringTrim@z], StringTrim[#]] & /@ tmp1[[All, 1]]]

This will still produce some output that CountryData does not recognize.

flags = CountryData[#, "Flag"] & /@ tmp2;
Cases[flags, _CountryData]

6 misses out of 190. Remove those misses from the output:

flags = If[Head[#] === CountryData, {""}, {#}] & /@ flags; (*much faster than rule replacement*)
tmp2 = Join[flags, tmp1, 2];
Grid[tmp2, Frame -> All]

Note that this takes a while to render.

enter image description here

You can obviously style the Grid as desired using Grid options and also resize the images if needed.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Is there an easy way to extract data from specific HTML tables using Mathematica?

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply