Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 1062145
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 16, 20262026-05-16T18:34:45+00:00 2026-05-16T18:34:45+00:00

I’m writing a C application which is run across a compute cluster (using condor).

  • 0

I’m writing a C application which is run across a compute cluster (using condor). I’ve tried many methods to reveal the offending code but to no avail.

Clues:

  • On Average when I run the code on 15 machines for 2 days, I get two or three segfaults (signal 11).
  • When I run the code locally I do not get a segfault. I ran it for nearly 3 weeks on my home machine.

Attempts:

  • I ran the code in valGrind for four days locally with no memory errors.
  • I captured the segfault signal by defining my own signal handler so that I can output some of the program state.
  • Now when a segfault happens I can print out the current stack using backtrace.
  • I can print out variable values.
  • I created a variable which is set to the current line number.
  • Have also tried commenting chunks of the code out, hoping that if the problem goes away I will discover the segfault.

Sadly the line number outputted is fairly random. I’m not entirely sure what I can do with the stacktrace. Am I correct in assuming that it only records the address of the function in which the segfault occurs?

Suspicions:

  • I suspect that the check pointing system which condor uses to move jobs across machines is more sensitive to memory corruption and this is why I don’t see it locally.
  • That indices are being corrupted by the bug, and that these indices are causing the segfault. This would explain the fact that the segfaults are occurring on fairly random line numbers.

UPDATE

Researching this some more I’ve found the following links:

  • LibSegFault – a library for automatically catching and printing state data about segfaults.

  • Stack unwinding (stack trace) with GCC tutorial on catching segfaults and get the line numbers of the offending instructions.

UPDATE 2

Greg suggested looking at the condor log and to ‘correlate the segfaults to when condor restarts the executable from a checkpoint’. Looking at the logs the segfaults all occur immediately after a restart. All of the failures appear to occur when a job switches from one type of machine to another type.

UPDATE 3

The segfault was being caused by differences between hosts, by setting the ‘requiremets’ field in the condor submit file to problem completely disappeared.

One can set individual machines:

requirements = machine == "hostname1" || machine == "hostname2"

or an entire class of machines:

requirements = classOfMachinesName

See requirements example here

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-16T18:34:46+00:00Added an answer on May 16, 2026 at 6:34 pm

    if you can, compile with debugging, and run under gdb.
    alternatively, get core dumped and load that into debugger.

    mpich has built-in debugger, or you can buy commercial parallel debugger.

    Then you can step through the code to see what happening in debugger

    http://nmi.cs.wisc.edu/node/1610

    http://nmi.cs.wisc.edu/node/1611

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I would like to run a str_replace or preg_replace which looks for certain words
link Im having trouble converting the html entites into html characters, (&# 8217;) i
That's pretty much it. I'm using Nokogiri to scrape a web page what has
I have just tried to save a simple *.rtf file with some websites and
I want to count how many characters a certain string has in PHP, but
I am trying to understand how to use SyndicationItem to display feed which is
I used javascript for loading a picture on my website depending on which small
I have a string like this: La Torre Eiffel paragonata all’Everest What PHP function
I am reading a book about Javascript and jQuery and using one of the
I'm using v2.0 of ClassTextile.php, with the following call: $testimonial_text = $textile->TextileRestricted($_POST['testimonial']); ... and

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.