Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7436221
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 29, 20262026-05-29T10:12:17+00:00 2026-05-29T10:12:17+00:00

Context When iterating over a set of Rdata files (each containing a character vector

  • 0

Context

When iterating over a set of Rdata files (each containing a character vector of HTML code) that are loaded, analyzed (via XML functionality) and then removed from memory again, I experience a significant increase in an
R process’ memory consumption (killing the process eventually).

It just seems like

  • freeing objects via free(),
  • removing them via rm() and
  • running gc()

do not have any effects, so the memory consumption cumulates until there’s no more memory left.

EDIT 2012-02-13 23:30:00

Thanks to valuable insight shared by the author and maintainer of package XML, Duncan Temple Lang (again: I really appreciate it very much!), the problem seems to be closely related to the way external pointers are freed and how garbage collection is handled in the XML package. Duncan issued a bug-fixed version of the package (3.92-0) that consolidated certain aspects of parsing XML and HTML and features an improved garbage collection where it’s not necessary anymore to explicitly free the object containing the external pointer via free(). You find the source code and a Windows binary at Duncan’s Omegahat website.


EDIT 2012-02-13 23:34:00

Unfortunately, the new package version still does not seem to fix the issues I’m encountering in the little little example that I’ve put together. I followed some suggestion and simplified the example a bit, making it easier to grasp and to find the relevant functions where things seem to go wrong (check functions ./lib/exampleRun.R and .lib/scrape.R).


EDIT 2012-02-14 15:00:00

Duncan suggested trying to force to free the parsed document explicitly via .Call("RS_XML_forceFreeDoc", html). I’ve included a logical switch in the example (do.forcefree in script ./scripts/memory.R) that, if set to TRUE, will do just that. Unfortunately, this made my R console crash. It’d be great if someone could verify this on their machine! Actually, the doc should be freed automatically when using the latest version of XML (see above). The fact that it isn’t seems to be a bug (according to Duncan).


EDIT 2012-02-14 23:12:00

Duncan pushed yet another version of XML (3.92-1) to his Omegahat website Omegahat website. This should fix the issue in general. However, I seem to be out of luck with my example as I still experience the same memory leakage.


EDIT 2012-02-17 20:39:00 > SOLUTION!

YES! Duncan found and fixed the bug! It was a little typo in a Windows-only script which explained why the bug didn’t show in Linux, Mac OS etc. Check out the latest version 3.92-2.! Memory consumption is now as constant as can be when iteratively parsing and processing XML files!

Special thanks again to Duncan Temple Lang and thanks to everyone else that responded to this question!


>>> LEGACY PARTS OF THE ORIGINAL QUESTION <<<

Example Instructions (edited 2012-02-14 15:00:00)

  1. Download folder ‘memory’ from my Github repo.
  2. Open up the script ./scripts/memory.R and set a) your working directory at line 6, b) the example scope at line 16 as well c) whether to force the freeing of the parsed doc or not at line 22. Note that you can still find the old scripts; they are “tagged” by an “LEGACY” at the end of the filename.
  3. Run the script.
  4. Investigate the latest file ./memory_<TIMESTAMP>.txt to see the increase in logged memory states over time. I’ve included two text files that resulted from my own test runs.

Things I’ve done with respect to memory control

  • making sure a loaded object is removed again via rm() at the end of each iteration.
  • When parsing XML files, I’ve set argument addFinalizer=TRUE, removed all R objects that have a reference to the parsed XML doc before freeing the C pointer via free() and removing the object containing the external pointer.
  • adding a gc() here and there.
  • trying to follow the advice in Duncan Temple Lang’s notes on memory management when using its XML package (I have to admit though that I did not fully comprehend what’s stated there)

EDIT 2012-02-13 23:42:00:
As I pointed out above, explicit calls to free() followed by rm() should not be necessary anymore, so I commented these calls out.

System Info

  • Windows XP 32 Bit, 4 GB RAM
  • Windows 7 32 Bit, 2 GB RAM
  • Windows 7 64 Bit, 4 GB RAM
  • R 2.14.1
  • XML 3.9-4
  • XML 3.92-0 as found at http://www.omegahat.org/RSXML/

Initial Findings as of 2012-02-09 01:00:00

  1. Running the webscraping scenario on several machines (see section “System Info” above) always busts the memory consumption of my R process after about 180 – 350 iterations (depending on OS and RAM).
  2. Running the plain rdata scenario yields constant memory consumption if and only if you set an explicit call to the garbage collector via gc() in each iteration; else you experience the same behavior as in the webscraping scenario.

Questions

  1. Any idea what’s causing the memory increase?
  2. Any ideas how to work around this?

Findings as of 2012-02-013 23:44:00

Running the example in ./scripts/memory.R on several machines (see section “System Info” above) still busts the memory consumption of my R process after about 180 – 350 iterations (depending on OS and RAM).

There’s still an evident increase in memory consumption and even though it may not appear to be that much when just looking at the numbers, my R processes always died at some point due to this.

Below, I’ve posted a couple of time series that resulted from running my example on a WinXP 32 Bit box with 2 GB RAM:

TS_1 (XML 3.9-4, 2012-02-09)

29.07
33.32
30.55
35.32
30.76
30.94
31.13
31.33
35.44
32.34
33.21
32.18
35.46
35.73
35.76
35.68
35.84
35.6
33.49
33.58
33.71
33.82
33.91
34.04
34.15
34.23
37.85
34.68
34.88
35.05
35.2
35.4
35.52
35.66
35.81
35.91
38.08
36.2

TS_2 (XML 3.9-4, 2012-02-09)

28.54
30.13
32.95
30.33
30.43
30.54
35.81
30.99
32.78
31.37
31.56
35.22
31.99
32.22
32.55
32.66
32.84
35.32
33.59
33.32
33.47
33.58
33.69
33.76
33.87
35.5
35.52
34.24
37.67
34.75
34.92
35.1
37.97
35.43
35.57
35.7
38.12
35.98

Error Message associated to TS_2

[...]
Scraping html page 30 of ~/data/rdata/132.rdata
Scraping html page 31 of ~/data/rdata/132.rdata
error : Memory allocation failed : growing buffer
error : Memory allocation failed : growing buffer
I/O error : write error
Scraping html page 32 of ~/data/rdata/132.rdata
Fehler in htmlTreeParse(file = obj[x.html], useInternalNodes = TRUE, addFinalizer =     TRUE): 
 error in creating parser for (null)
> Synch18832464393836

TS_3 (XML 3.92-0, 2012-02-13)

20.1
24.14
24.47
22.03
25.21
25.54
23.15
23.5
26.71
24.6
27.39
24.93
28.06
25.64
28.74
26.36
29.3
27.07
30.01
27.77
28.13
31.13
28.84
31.79
29.54
32.4
30.25
33.07
30.96
33.76
31.66
34.4
32.37
35.1
33.07
35.77
38.23
34.16
34.51
34.87
35.22
35.58
35.93
40.54
40.9
41.33
41.6

Error Message associated to TS_3

[...]
---------- status: 31.33 % ----------

Scraping html page 1 of 50
Scraping html page 2 of 50
[...]
Scraping html page 36 of 50
Scraping html page 37 of 50
Fehler: 1: Memory allocation failed : growing buffer
2: Memory allocation failed : growing buffer

Edit 2012-02-17: please help me verifying counter value

You’d do me a huge favor if you could run the following code.
It won’t take more than 2 minutes of your time.
All you need to do is

  1. Download an Rdata file and save it as seed.Rdata.
  2. Download the script containing my scraping function and save it as scrape.R.
  3. Source the following code after setting the working directory accordingly.

Code:

setwd("set/path/to/your/wd")
install.packages("XML", repos="http://www.omegahat.org/R")
library(XML)
source("scrape.R")
load("seed.rdata")
html <- htmlParse(obj[1], asText = TRUE)
counter.1 <- .Call("R_getXMLRefCount", html)
print(counter.1)
z <- scrape(html)
gc()
gc()
counter.2 <- .Call("R_getXMLRefCount", html)
print(counter.2)
rm(html)
gc()
gc()

I’m particularly interested in the values of counter.1 and counter.2 which should be 1 in both calls. In fact, it is on all machines that Duncan has tested this on. However, as it turns out counter.2 has value 259 on all of my machines (see details above) and that’s exactly what’s causing my problem.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-29T10:12:19+00:00Added an answer on May 29, 2026 at 10:12 am

    From the XML package’s webpage, it seems that the author, Duncan Temple Lang, has quite extensively described certain memory management issues. See this page: “Memory Management in the XML Package”.

    Honestly, I’m not proficient in the details of what’s going on here with your code and the package, but I think you’ll either find the answer in that page, specifically in the section called “Problems”, or in direct communication with Duncan Temple Lang.


    Update 1. An idea that might work is to use the multicore and foreach packages (i.e. listResults = foreach(ix = 1:N) %dopar% {your processing;return(listElement)}. I think that for Windows you’ll need doSMP, or maybe doRedis; under Linux, I use doMC. In any case, by parallelizing the loading, you’ll get faster throughput. The reason I think you may get some benefit from memory usage is that it could be that forking R, could lead to different memory cleaning, as each spawned process gets killed when complete. This isn’t guaranteed to work, but it could address both memory and speed issues.

    Note, though: doSMP has its own idiosyncracies (i.e. you may still have some memory issues with it). There have been other Q&As on SO that mentioned some issues, but I’d still give it a shot.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

the use case that I'm concerned with in this post involves iterating over a
iterating over an array, am printing some div content for each iteration. Whenever the
Context: I have a WPF App that uses certain unmanaged DLLs in the D:\WordAutomation\MyApp_Source\Executables\MyApp
Context PHP 5.3.x Overview After doing a code-review with an associate who uses both
Context: I need to develop a monitoring server that monitors some of our applications
Context: HTML widgets generated using a Django ModelForm and template, jQuery 1.3.2, JavaScript on
I'm iterating the tables of a context and then the properties of those tables
I'm encountering what seems like quite surprising performance differences when iterating over a small
I am trying to register helpers with Handlebars to allow iterating over JSON objects.
I'm running a program to benchmark how fast finding and iterating over all the

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.