I am developing a C# application which needs to process approximately 4,000,000 english sentences. All these sentences are being stored in a tree. Where each node in the tree is a class which has these fields:
class TreeNode
{
protected string word;
protected Dictionary<string, TreeNode> children;
}
My problem is that the application is using up all the RAM (I have 2 GB RAM) when it reaches the 2,000,000th sentence. So it only manages to process half the sentences and then it slows down drastically.
What can I do to try and reduce the memory footprint of the application?
EDIT: Let me explain a bit more my application. So I have approximately 300,000 english sentences, and from each sentence I am generating further sub sentences like this:
Example:
Sentence: Football is a very popular sport
Sub Sentences I need:
- Football is a very popular sport
- is a very popular sport
- a very popular sport
- very popular sport
- popular sport
- sport
Each sentence is stored in a tree word by word. So considering the example above, i have a TreeNode Class with the word field = “Football”, and the children list has the TreeNode for the word “is”. The child of the “is” node is the “a” node. The child for the “a” node is the “very” node. I need to store the sentences word by word since i need to be able to search for all the sentences starting with Example: “Football is”.
So basically for each word in a sentence i am creating a new (sub-sentence). And this is the reason I ultimately end up with 4,000,000 different sentences. Storing the data in a database is not an option, since the app needs to work on the whole structure at once. And it will further slow down the process if i had to stay writing all the data to a database.
Thanks
What is it you are using as the key? Where are you getting the data from? If these are words (not full setences), I’m wondering if you have a lot of duplicated keys (different
stringinstances with the same fundamental value), in which case you might benefit from implementing a local interner to re-use the values (and let the transient copies get garbage collected).Instantiate this when building the tree, and use (when you think a value is likely to be duplicated):