I’m writing a 32-bit .NET program with a 2 stage input process:
-
It uses native C++ via C++/CLI to parse an indefinite number files into corresponding SQLite databases (all with the same schema). The allocations by C++ ‘new’ will typically consume up to 1GB of the virtual address space (out of 2GB available; I’m aware of the 3GB extension but that’ll just delay the issue).
-
It uses complex SQL queries (run from C#) to merge the databases into a single database. I set the cache_size to 1GB for the merged database so that the merging part has minimal page faults.
My problem is that the cache in stage 2 does not re-use the 1GB of memory allocated by ‘new’ and properly released by ‘delete’ in stage 1. I know there’s no leak because immediately after leaving stage 1, ‘private bytes’ drops down to a low amount like I’d expect. ‘Virtual size’ however remains at about the peak of what the C++ used.
This non-sharing between the C++ and SQLite cache causes me to run out of virtual address space. How can I resolve this, preferably in a fairly standards-compliant way? I really would like to release the memory allocated by C++ back to the OS.
This is not something you can control effectively from the C++ level of abstraction (in other words you cannot know for sure if memory that your program released to the C++ runtime is going to be released to the OS or not). Using special allocation policies and non-standard extensions to try to handle the issue is probably not working anyway because you cannot control how the external libraries you use deal with memory (e.g. if the have cached data).
A possible solution would be moving the C++ part to an external process that terminates once the SQLite databases have been created. Having an external process will introduce some annoyiance (e.g. it’s a bit harder to keep a “live” control on what happens), but also opens up more possibilities like parallel processing even if libraries are not supporting multithreading or using multiple machines over a network.