Segmentation fault occurs when datact=10736 when trying to insert into an unordered_map (see commented line for which call throws the fault). See below for attempted fixes.
When SIGSEGV is thrown it points me to line 764 of hashtable_policy.h
INPUT: data file with column1=counts, column2=16-character strings
PURPOSE: cluster 16-character sequences by adding all the counts of 1-substitution different sequences together. First sequence seen is “origin” by which all its 1-substitution friends are identified.
PSEUDOCODE: for each line in file:
-
read count, read sequence.
-
if sequence key_value exists in hash ‘location’ (type
unordered_map), add current count; -
otherwise make a new key_value, have it point to the count here, and
assign all 1-substitution sequences to also point to this count.
Code:
#include <iostream>
#include <fstream>
#include <string>
#include <sstream>
#include <cmath>
#include <vector>
#include <unordered_map>
#include <map>
#include "matrix.h"
using namespace std;
int nuc2num(char currchar)
{ // returns 0,1,2,3 for A,C,T,G respectively
int outnum;
if (currchar=='A')
{
outnum=0;
}
else if (currchar=='C')
{
outnum=1;
}
else if (currchar=='T')
{
outnum=2;
}
else
{
outnum=3;
}
return outnum;
}
int main(int argc, char* argv[])
{
//command line arguments
// arg1=filename, arg2=barcode sequence length, arg3=#mismatches permitted
//input handling
// file format: column 1 | column 2
// counts | sequence [int | string]
string filename;
string funnelstring;
// define lookup matrix; rows=ACTG, cols = ACTG without row element
Matrix <char> sub_lookup(4,3);
sub_lookup[0][0] = 'C';
sub_lookup[0][1] = 'T';
sub_lookup[0][2] = 'G';
sub_lookup[1][0] = 'A';
sub_lookup[1][1] = 'T';
sub_lookup[1][2] = 'G';
sub_lookup[2][0] = 'A';
sub_lookup[2][1] = 'C';
sub_lookup[2][2] = 'G';
sub_lookup[3][0] = 'A';
sub_lookup[3][1] = 'C';
sub_lookup[3][2] = 'T';
int L,k;
int j=0;
const int buffersize=10000;
int currentsize=buffersize;
int datact=0;
int currchar;
vector <unsigned int> ctarr(buffersize);
vector <string> seqarr(buffersize);
filename=argv[1];
L=atoi(argv[2]);
k=atoi(argv[3]);
unsigned int sct;
int substrlen;
string sequence,textct;
ifstream seqfile (filename.c_str());
//map <string,unsigned int*> location;
unordered_map <string,unsigned int*> location;
if (seqfile.is_open())
{
getline(seqfile,textct,'\n');
while (textct != "")
{
sct=atoi(textct.c_str());
substrlen=textct.length();
//cout << textct << endl;
sequence=textct.substr(substrlen-L,L);
//cout << sequence << endl;
//is there an associated guy?
if (location.find(sequence) != location.end()) //just asks whether this key has been assigned
{ //there's a value in the region
*location[sequence]+=sct;
}
else
{ //no value in region, make a footprint
ctarr[datact]=sct;
seqarr[datact]=sequence;
location[sequence]=&ctarr[datact]; //assign current key to point to data count
//assign k substitution "funnel" region to point to this count as well
for (j=0; j<L; j++)
{
funnelstring=sequence;
currchar = nuc2num(sequence[j]);
if (datact==10736 && j==13)
{
cout << "here" << endl;
cout << sequence << endl;
}
for (k=0; k<3; k++)
{
funnelstring[j]=sub_lookup[currchar][k];
// if (datact==10736 && j==13)
// {
// cout << funnelstring << endl;
// cout << location.max_size() << " | " << location.size() << endl;
// string asdf;
// asdf="AAAAAAAAAAAAAAAA";
// location[asdf]=&ctarr[datact]; //still segfaults
// }
if (location.find(funnelstring) == location.end()) // asks whether this key has been assigned
{ //this region is not assigned to another funnel
location[funnelstring]=&ctarr[datact]; //LINE THAT CAUSES SIGSEGV
}
}
}
datact++;
cout << datact << endl;
if (datact>=currentsize)
{
ctarr.resize(currentsize+buffersize);
seqarr.resize(currentsize+buffersize);
currentsize+=buffersize;
}
}
getline(seqfile,textct,'\n');
}
seqfile.close();
}
Explorations.
- It doesn’t matter what key is added, when
datact==10736andj=13, any
key added to (unordered_map) location results in a SIGSEGV. - The line of code in question (marked by a comment above) is called many times prior, and operates correctly.
- Exchanging unordered_map for map results in the same event, but at a different (higher) count of datact. Still low (between
datact==16-35
thousand). - Rewriting this code almost exactly, but with randomly generated 16-character sequences works perfectly, as far as I can tell (no
segfaults for datact up to 200,000, didnt test higher). - In the code in (4), it does appear that there is a rehash right around 10000, which could be related or coincidence.
If need be, I can post the datafile that is read.
EDIT: Solved
Instead of unordered_map <string,unsigned int*> location, replaced with unordered_map <string,unsigned int> location (value_type is int instead of int*). Now value_type holds the index in the ctarr[]. Runs fine. Thanks!
Pointers to
vectorelements can be invalidated when you callvector::resize(). This is because the entire data may have to be moved around in order to find a contiguous memory block that fits the new size. In other words: As soon as you callresize, all yourlocationdata suddenly becomes useless garbage.Possible solutions:
locationstore the index of the required element inctarras its value rather than a pointer. (This will certainly not change the semantics of your program.)locationstore the actualunsigned intvalue instead of a pointer. Depending on your program logic and how you change and access this data, this may not be what you want.Also note that although the segfault occurs in
hashtable_policy.h, this bug has nothing to do with the implementation ofunordered_map(orvector) – it is entirely your fault for not reading the reference forvector::resize()😉 : http://www.cplusplus.com/reference/vector/vector/resize/ (section ‘Iterator validity’)The other thing I noticed about your code is that you use
operator[]for accessing yourvectorelements. This disables out-of-bounds checking. If I came across an error like yours in my code (hard to trace back because it occurs somewhere far from my erroneous code), my first course of action would be swappingoperator[]forvector::at()(actually, I always start withat()and only switch if I can prove beyond reasonable doubt that bounds checking is a performance bottleneck for this particular purpose). This would not have helped with your problem, but is often an invaluable help in spotting mistakes.