I am trying to run a program that analyzes a bunch of text files containing numbers. The total size of the text files is ~12 MB, and I take 1,000 doubles from each of 360 text files and puts them into a vector. My problem is that I get about halfway through the list of text files and then my computer slows down until it isn’t processing any more files. The program is not infinite looping, but I think I have a problem with using too much memory. Is there a better way to store this data that won’t use as much memory?
Other possibly relevant system information:
Running Linux
8 GB memory
Cern ROOT framework installed (I don’t know how to reduce my memory footprint with this though)
Intel Xeon Quad Core Processor
If you need other information, I will update this list
EDIT: I ran top, and my program uses more memory, and once it got above 80% i killed it. There’s a lot of code, so I’ll pick out the bits where memory is being allocated and such to share.
EDIT 2: My code:
void FileAnalysis::doWork(std::string opath, std::string oName)
{
//sets the ouput filepath and the name of the file to contain the results
outpath = opath;
outname = oName;
//Reads the data source and writes it to a text file before pushing the filenames into a vector
setInput();
//Goes through the files queue and analyzes each file
while(!files.empty())
{
//Puts all of the data points from the next file onto the points vector then deletes the file from the files queue
readNext();
//Places all of the min or max points into their respective vectors
analyze();
//Calculates the averages and the offset and pushes those into their respective vectors
calcAvg();
}
makeGraph();
}
//Creates the vector of files to be read
void FileAnalysis::setInput()
{
string sysCall = "", filepath="", temp;
filepath = outpath+"filenames.txt";
sysCall = "ls "+dataFolder+" > "+filepath;
system(sysCall.c_str());
ifstream allfiles(filepath.c_str());
while (!allfiles.eof())
{
getline(allfiles, temp);
files.push(temp);
}
}
//Places the data from the next filename into the files vector, then deletes the filename from the vector
void FileAnalysis::readNext()
{
cout<<"Reading from "<<dataFolder<<files.front()<<endl;
ifstream curfile((dataFolder+files.front()).c_str());
string temp, temptodouble;
double tempval;
getline(curfile, temp);
while (!curfile.eof())
{
if (temp.size()>0)
{
unsigned long pos = temp.find_first_of("\t");
temptodouble = temp.substr(pos, pos);
tempval = atof(temptodouble.c_str());
points.push_back(tempval);
}
getline(curfile, temp);
}
setTime();
files.pop();
}
//Sets the maxpoints and minpoints vectors from the points vector and adds the vectors to the allmax and allmin vectors
void FileAnalysis::analyze()
{
for (unsigned int i = 1; i<points.size()-1; i++)
{
if (points[i]>points[i-1]&&points[i]>points[i+1])
{
maxpoints.push_back(points[i]);
}
if (points[i]<points[i-1]&&points[i]<points[i+1])
{
minpoints.push_back(points[i]);
}
}
allmax.push_back(maxpoints);
allmin.push_back(minpoints);
}
//Calculates the average max and min points from the maxpoints and minpoints vector and adds those averages to the avgmax and avgmin vectors, and adds the offset to the offset vector
void FileAnalysis::calcAvg()
{
double maxtotal = 0, mintotal = 0;
for (unsigned int i = 0; i<maxpoints.size(); i++)
{
maxtotal+=maxpoints[i];
}
for (unsigned int i = 0; i<minpoints.size(); i++)
{
mintotal+=minpoints[i];
}
avgmax.push_back(maxtotal/maxpoints.size());
avgmin.push_back(mintotal/minpoints.size());
offset.push_back((maxtotal+mintotal)/2);
}
EDIT 3: I added in the code to reserve vector space and I added code to close the files, but my memory still gets filled to 96% before the program stops…
This could be optimized endlessly, but my immediate reaction would be to use a container other than vector. Remember that storage for a vector is allocated serially in memory, which means adding additional elements causes a reallocation of the entire vector if there isn’t enough current space to hold the new elements.
Try a container optimized for constant insertions, such as a queue or list.
Alternatively, if vector is required, you could try allocating the expected memory footprint up-front to avoid continuous reallocation. See
vector.reserve(): Vector. Note that the reserved capacity is in terms of elements, not bytes.———- EDIT FOLLOWING CODE POST ———-
My immediate concern would be the following logic in
analyze():Specifically, my concern is the allmax and allmin containers, onto which you are pushing copies of the maxpoints and minpoints containers. The maxpoints and minpoints containers themselves can grow quite large with this logic, depending on the datasets.
You’re incurring the cost of container copies several times. Is it really necessary to copy the minpoints/maxpoints containers into allmax/allmin? Without knowing a bit more, it’s hard to optimize your storage design.
I don’t see anywhere that minpoints and maxpoints are actually emptied, which means that over time they can grow very large, and their corresponding copies to the allmin/allmax containers will grow very large. Are minpoints/maxpoints supposed to represent the min/max points for just one file?
As an example, let’s look at a simplified minpoints and allmin scenario (but keep in mind that this applies to max just as well, and both are on a larger scale than shown here). This is, obviously, a dataset engineered to show my point:
There are other optimizations and critiques to be made, but for now I’m limiting this to trying to solve your immediate problem. Can you post the
makeGraph()function, as well as the definitions of all containers involved (points, minpoints, maxpoints, allmin, allmax)?