I have some C code embedded inside an R function which keeps sigsegging in the same way, but at different points (through the programs progression – always seems to come from the same function).
Here’s the thing – the error I get is;
*** glibc detected *** /packages/R/2.15.0/lib64/R/bin/exec/R: munmap_chunk():
invalid pointer: 0x0000000014059b20 ***
Now this is a pretty standard error (munmap_chunk() is part of free(), if I recall) – the weird thing is that the error comes from a function which is freeing a set of arrays from within a struct (the program allocs and frees millions of these structs over the course of its running normally).
The function looks like this;
multifit_work_t *free_multifit(multifit_work_t *work)
{
if (work == NULL || work->u==NULL || work->w==NULL || work->v==NULL || work->b==NULL || work->rv1==NULL) {
fprintf(stderr,"ERROR: Internal array in multifit_work_t object was already NULL\n");
exit(1);
}
// each of the work->* arrays are just an array of doubles of length 1 or more.
// LOGGING FUNCTIONALITY: Here, prints out the address and values of each
// of the arrays
// free each array first
free(work->u);
free(work->w);
free(work->v);
free(work->b);
free(work->rv1);
free(work);
// LOGGING FUNCTIONALITY: Here prints an, "Exiting free_multifit()" message
return NULL;
}
So I’m checking each pointer before I free it. I added in logging functionality to output the address and initial value for each of these arrays. Grepping the logfile of the crash which generated the above error for the offending pointer, I get a lot of hits (understandably we’re re-using the same memory location after it’s freed);
$: grep 14059b20 logfile.txt
....
194624) work->b: ADDRESS: [0x14059b20] VALUE: [-5.620804e-02]
194629) work->b: ADDRESS: [0x14059b20] VALUE: [2.759472e+00]
194634) work->b: ADDRESS: [0x14059b20] VALUE: [5.498979e-02]
194684) work->b: ADDRESS: [0x14059b20] VALUE: [9.323869e+07]
194689) work->b: ADDRESS: [0x14059b20] VALUE: [3.016410e+07]
194694) work->b: ADDRESS: [0x14059b20] VALUE: [1.688376e-08]
194699) work->b: ADDRESS: [0x14059b20] VALUE: [1.660441e+00]
.....
Operation 194699 is in the last set of values I get before the segfault;
Calling free_multifit...
194696) work->u: ADDRESS: [0x1305f7d0] VALUE: [1.350474e+01]
194697) work->w: ADDRESS: [0x92ec810] VALUE: [1.350474e+01]
194698) work->v: ADDRESS: [0x122cc210] VALUE: [5.798884e-09]
194699) work->b: ADDRESS: [0x14059b20] VALUE: [1.660441e+00]
194700) work->rv1: ADDRESS: [0xea37a50] VALUE: [0.000000e+00]
< If it didn't crash in the function we'd see an "Exiting function message" here - so it sigsegs on the freeing of one the the arrays or the work object itself.
[EOF]
So, despite checking the pointer is good, and actually pulling a value from it’s location (1.66) it seems like when I try and free it all goes wrong.
Any ideas why/how this could happen? Is this a hardware issue? I’m running it on a cluster, if that makes any difference.
UPDATED 1
multifit_work_t is created through the following;
typedef struct {
int m,n;
double *w,*u,*v,*b,*rv1;
} multifit_work_t;
multifit_work_t *alloc_multifit(int m, int n)
{
multifit_work_t *work=(multifit_work_t *)malloc(sizeof(multifit_work_t));
if (work==NULL) {
fprintf(stderr,"failed to allocate multifit_work\n");
exit(1);
}
work->m=m;
work->n=n;
work->u=(double *)malloc(n*m*sizeof(double)); /* temporary storage - n x m matrix */
work->w=(double *)malloc(n*sizeof(double)); /* n vector */
work->v=(double *)malloc(n*n*sizeof(double)); /* n x n matrix */
work->b=(double *)malloc(m*sizeof(double)); /* m vector */
work->rv1=(double *)malloc(n*sizeof(double)); /* temporary storage - n vector */
if (work->u==NULL || work->w==NULL || work->v==NULL || work->b==NULL || work->rv1==NULL) {
fprintf(stderr,"failed to allocate multifit_work\n");
exit(1);
}
return work;
}
UPDATE 2
When I run it on my local system the same thing happens, but the error is along the lines of;
*** caught segfault ***
address 0x11e000000, cause 'memory not mapped'
Always at a noticeably even memory address.
UPDATE 3
Below is the valgrind report
valgrind --leak-check=full --show-reachable=yes ./execute
==23072== Memcheck, a memory error detector
==23072== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==23072== Using Valgrind-3.6.0.SVN-Debian and LibVEX; rerun with -h for copyright info
==23072== Command: ./execute
==23072==
==23072==
==23072== HEAP SUMMARY:
==23072== in use at exit: 0 bytes in 0 blocks
==23072== total heap usage: 445 allocs, 445 frees, 27,900 bytes allocated
==23072==
==23072== All heap blocks were freed -- no leaks are possible
==23072==
==23072== For counts of detected and suppressed errors, rerun with: -v
==23072== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 23 from 8)
This is killing me!
Somewhere, deep, deep, in the code, an input array was only filling a local array a small portion of its (m)allocated size. The code was then calling the uninitialized regions of the array, the values of which looked a lot like the expected input (which is why it took my so long to identify).
In summary – overstepping arrays is bad, but watch out for understepping (when initializing) too!