Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8906973
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 15, 20262026-06-15T02:47:08+00:00 2026-06-15T02:47:08+00:00

I have a corpus, x, in R created from a directory using DirSource. Each

  • 0

I have a corpus, x, in R created from a directory using DirSource. Each document is a text file containing the full HTML of a related vBulletin forum webpage. Since it is a thread, each document has multiple separate posts that I want to capture with my XPath. The XPath seems to work, but I cannot put all my captured nodes back into the corpus.

If my corpus has 25 documents that have an average 4 posts each, then my new corpus should have 100 documents. I’m wondering if I have to do a loop and create a new corpus.

Here is my messy work so far. Any source from a thread in http://www.vbulletin.org/forum/ is an example of the structure.

#for stepping through
xt <- x[[5]]
xpath <- "//div[contains(@id,'post_message')]"

getxpath <- function(xt,xpath){
  require(XML)

  #either parse
  doc <- htmlParse(file=xt)
  #doc <- htmlTreeParse(tolower(xt), asText = TRUE, useInternalNodes = TRUE)

  #don't know which to use
  #result <- xpathApply(doc,xpath,xmlValue)
  result <- xpathSApply(doc,xpath,xmlValue)

  #clean up
  result <- gsub(pattern="\\s+",replacement=" ",x=gsub(pattern="\n|\t",replacement=" ",x=result))

  result <- c(result[1:length(result)])

  free(doc)

  #converts group of nodes into 1 data frame with numbers before separate posts
  #require(plyr)
  #xbythread <- ldply(.data=result,.fun=function(x){unlist(x)})

  #don't know what needs to be returned
  result <- Corpus(VectorSource(result))
  #result <- as.PlainTextDocument(result)

  return(result)
}

#call
x2 <- tm_map(x=x,FUN=getxpath,"//div[contains(@id,'post_message')]")
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-15T02:47:10+00:00Added an answer on June 15, 2026 at 2:47 am

    Figured it out a while ago. htmlParse needs isURL=TRUE.

    getxpath <- function(xt,xpath){
      require(XML);require(tm)
      x <- htmlParse(file=u,isURL=TRUE)
      resultvector <- xpathSApply(x,xpath,xmlValue)
      result <- gsub(pattern="\\s+",replacement=" ",x=gsub(pattern="\n|\t",replacement=" ",x=resultvector))
      return(result)
    }
    
    res <- getxpath("http://url.com/board.html","//xpath")
    

    To get all the files, I use list.files to get the file list, Map/clusterMap with getxpath() to put them in a list, do.call to get them in a vector, and Corpus(VectorSource(res)) to put them in a Corpus.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a text corpus of 11 files each having about 190000 lines. I
I have a corpus of documents and I want to represent each document as
I have a rather large text corpus, of which I would like to check
I have a string buffer of a huge text file. I have to search
I have a large corpus of text (10 million sentences or so) which I'd
I have a text corpus which is already aligned at sentence level by construction
I have an index from a large corpus with several fields. Only one these
in Python, I have created a text generator that acts on certain parameters but
I have corpus of several hundred of documents and I am using NLTK PlaintextCorpusReader
I have newspaper articles' corpus by day. Each word in the corpus has a

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.