Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9195247
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 17, 20262026-06-17T21:35:17+00:00 2026-06-17T21:35:17+00:00

I have some data in text form, taken from a webpage. It’s quite lengthy

  • 0

I have some data in text form, taken from a webpage. It’s quite lengthy but follows the form:

<p><span class="monthyear">Jan 2001</span>
<br><b>Foo text (2)</b></p>
<p><span class="monthyear">Nov 2006</span>
<br><b>Bar text (29)</b>
<br><b>More bar text (4)</b>
<br><b>Yet more bar text (102)</b></p>
<p><span class="monthyear">Apr 2004</span>
<br><b>Further foo text (1)</b>
<br><b>Combination foo and bar text (41)</b></p>

I want to extract the relevant parts of this into a data frame, like so:

  monthyear          info  n
1  Jan 2001      Foo text  2
2  Nov 2006      Bar text 29
3  Nov 2006 More bar text  4

…but I’m not sure how to do it. If I have the html in a character vector called text I can extract the monthyear data using a function from the stringr package:

monthyear <- str_extract_all(
text[1],perl("(?<=\\\"monthyear\\\">).*?20[0-9]{2}")
)

and I could extract the info and n data in the same sort of way, but given that there are multiple info and n entries for each monthyear entry, I’m not sure how to combine them. Am I going about this all wrong?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-17T21:35:18+00:00Added an answer on June 17, 2026 at 9:35 pm

    Unfortunately, we can’t always control the quality of our data sources, so we have to resort to some tedious manual processing. (Some people say that the majority of a data analyst’s time is spent in cleaning data, and not in analysis.)

    As already noted in the comments, regular expressions aren’t the best tools for working with HTML, because HTML, in general, isn’t really a regular language (I think it’s called a context-free language). But, if the HTML sources are somewhat regular (as they are in the example data you’ve provided), you might still be able to use them effectively.

    Here’s a step-by-step example. I’ve added HTML header tags to your example text and stored it here: http://ideone.com/O1PC05

    1. Read in your data using readLines

      x1 <- readLines("http://ideone.com/plain/O1PC05")
      
    2. Isolate the “body” of the web page

      bodycontent <- grep("<body>|</body>", x1)
      x2 <- x1[(bodycontent[1]+1):(bodycontent[2]-1)]
      
    3. grepl returns a TRUE or FALSE for if “monthyear” was found in a given line. Use cumsum to create “groups”, and split to convert the character vector to a list.

      x3 <- split(x2, cumsum(grepl("monthyear", x2)))
      
    4. You can do the following in multiple steps if you prefer. The basic idea is to lapply over your list, replace all your HTML tags with tabs, and replace your brackets with tabs. After that you can use read.delim, but expect to get a lot of columns that are FULL of NA values since we’re inserting a lot more tabs than we need.

      This is most likely where you will fail for several reasons. (1) It assumes that the source data is indeed well structured… (2) but, the text itself might have brackets… (3) or, there might be other content in the body, including script tags, table tags, and so on that will be read in and tried to be processed.

      x4 <- read.delim(header = FALSE,
                       stringsAsFactors = FALSE,
                       strip.white = TRUE, 
                       sep = "\t", 
                       text = 
                         unlist(lapply(x3, 
                                       function(x) {
                                         temp <- gsub("<(.|\n)*?>", "\t", x)
                                         paste(gsub("[()]", "\t", temp), 
                                               collapse="\t")
                                         })))
      
    5. I mentioned that in step 4, we will end up with a lot of junk columns. Let’s get rid of those.

      x5 <- x4[apply(x4, 2, function(x) !all(is.na(x)))]
      
    6. And, now, let’s name the columns in a more meaningful way. We know that the first column will be the “monthyear” variable by design, and the others should be “info” and “n”, so we can do some basic reps wrapped in paste to get our variable names. While we’re at it, we’ll use as.yearmon from the “zoo” package to convert our “monyear” variable to actual dates, allowing us to sort and do other nifty things that we can do with actual dates.

      myseq <- ncol(x5[-1])/2 # We expect pairs of columns, right?
      names(x5) <- c("monthyear", 
                     paste(rep(c("info", "n"), myseq), 
                           sep(1:myseq, each = 2), sep = "."))
      library(zoo)
      x5$monthyear <- as.Date(as.yearmon(x5$monthyear, "%b %Y"))
      x5
      #    monthyear           info.1 n.1                       info.2 n.2            info.3 n.3
      # 1 2001-01-01         Foo text   2                               NA                    NA
      # 2 2006-11-01         Bar text  29                More bar text   4 Yet more bar text 102
      # 3 2004-04-01 Further foo text   1 Combination foo and bar text  41                    NA
      
    7. If you really wanted your data in long form, use reshape:

      x6 <- reshape(x5, 
                    direction = "long", 
                    idvar = "monthyear", 
                    varying = 2:ncol(x5))
      
    8. Do some optional cleanup, like ordering the output by date, resetting your row names, and dropping incomplete cases:

      x6 <- x6[order(x6$monthyear), ]
      rownames(x6) <- NULL
      x6[complete.cases(x6), ]
      #    monthyear time                         info   n
      # 1 2001-01-01    1                     Foo text   2
      # 4 2004-04-01    1             Further foo text   1
      # 5 2004-04-01    2 Combination foo and bar text  41
      # 7 2006-11-01    1                     Bar text  29
      # 8 2006-11-01    2                More bar text   4
      # 9 2006-11-01    3            Yet more bar text 102
      

    Anyway, try it out, and modify as needed. My guess is that at some point, you’ll have to open up the files in a plain text editor and do some preliminary cleanup there before you can proceed.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

We have some C# code that reads data from a text file using a
I have some data from log files and would like to group entries by
I have some data which (quite reasonably) uses null and false for different meanings.
I have some data that won't printf.... echo works, but not printf There is
I have a django app and I need to import existing data from text
I have a form in which i have some text boxes and below to
I have several strings in the rough form: [some text] [some number] [some more
I have some data loaded as a np.ndarray and need to convert it to
I have some data files in the resources/ folder for my Rubymotion project. How
I have some data and need to create a json file with this structure

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.