I have some data in text form, taken from a webpage. It’s quite lengthy but follows the form:
<p><span class="monthyear">Jan 2001</span>
<br><b>Foo text (2)</b></p>
<p><span class="monthyear">Nov 2006</span>
<br><b>Bar text (29)</b>
<br><b>More bar text (4)</b>
<br><b>Yet more bar text (102)</b></p>
<p><span class="monthyear">Apr 2004</span>
<br><b>Further foo text (1)</b>
<br><b>Combination foo and bar text (41)</b></p>
I want to extract the relevant parts of this into a data frame, like so:
monthyear info n
1 Jan 2001 Foo text 2
2 Nov 2006 Bar text 29
3 Nov 2006 More bar text 4
…but I’m not sure how to do it. If I have the html in a character vector called text I can extract the monthyear data using a function from the stringr package:
monthyear <- str_extract_all(
text[1],perl("(?<=\\\"monthyear\\\">).*?20[0-9]{2}")
)
and I could extract the info and n data in the same sort of way, but given that there are multiple info and n entries for each monthyear entry, I’m not sure how to combine them. Am I going about this all wrong?
Unfortunately, we can’t always control the quality of our data sources, so we have to resort to some tedious manual processing. (Some people say that the majority of a data analyst’s time is spent in cleaning data, and not in analysis.)
As already noted in the comments, regular expressions aren’t the best tools for working with HTML, because HTML, in general, isn’t really a regular language (I think it’s called a context-free language). But, if the HTML sources are somewhat regular (as they are in the example data you’ve provided), you might still be able to use them effectively.
Here’s a step-by-step example. I’ve added HTML header tags to your example text and stored it here: http://ideone.com/O1PC05
Read in your data using
readLinesIsolate the “body” of the web page
greplreturns aTRUEorFALSEfor if “monthyear” was found in a given line. Usecumsumto create “groups”, andsplitto convert the character vector to a list.You can do the following in multiple steps if you prefer. The basic idea is to
lapplyover your list, replace all your HTML tags with tabs, and replace your brackets with tabs. After that you can useread.delim, but expect to get a lot of columns that are FULL ofNAvalues since we’re inserting a lot more tabs than we need.This is most likely where you will fail for several reasons. (1) It assumes that the source data is indeed well structured… (2) but, the text itself might have brackets… (3) or, there might be other content in the body, including script tags, table tags, and so on that will be read in and tried to be processed.
I mentioned that in step 4, we will end up with a lot of junk columns. Let’s get rid of those.
And, now, let’s name the columns in a more meaningful way. We know that the first column will be the “monthyear” variable by design, and the others should be “info” and “n”, so we can do some basic
reps wrapped inpasteto get our variable names. While we’re at it, we’ll useas.yearmonfrom the “zoo” package to convert our “monyear” variable to actual dates, allowing us to sort and do other nifty things that we can do with actual dates.If you really wanted your data in long form, use
reshape:Do some optional cleanup, like ordering the output by date, resetting your row names, and dropping incomplete cases:
Anyway, try it out, and modify as needed. My guess is that at some point, you’ll have to open up the files in a plain text editor and do some preliminary cleanup there before you can proceed.