Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 5968991
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 22, 20262026-05-22T20:10:02+00:00 2026-05-22T20:10:02+00:00

My monte carlo pi calculation CUDA program is causing my nvidia driver to crash

  • 0

My monte carlo pi calculation CUDA program is causing my nvidia driver to crash when I exceed around 500 trials and 256 full blocks. It seems to be happening in the monteCarlo kernel function.Any help is appreciated.

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <curand.h>
#include <curand_kernel.h>


#define NUM_THREAD 256
#define NUM_BLOCK 256



///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////

// Function to sum an array
__global__ void reduce0(float *g_odata) {
extern __shared__ int sdata[];

// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_odata[i];
__syncthreads();

// do reduction in shared mem
for (unsigned int s=1; s < blockDim.x; s *= 2) { // step = s x 2
    if (tid % (2*s) == 0) { // only threadIDs divisible by the step participate
        sdata[tid] += sdata[tid + s];
    }
    __syncthreads();
}

// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
__global__ void monteCarlo(float *g_odata, int  trials, curandState *states){
//  unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
    unsigned int incircle, k;
    float x, y, z;
    incircle = 0;

    curand_init(1234, i, 0, &states[i]);

    for(k = 0; k < trials; k++){
        x = curand_uniform(&states[i]);
        y = curand_uniform(&states[i]);
        z =(x*x + y*y);
        if (z <= 1.0f) incircle++;
    }
    __syncthreads();
    g_odata[i] = incircle;
}
///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
int main() {

    float* solution = (float*)calloc(100, sizeof(float));
    float *sumDev, *sumHost, total;
    const char *error;
    int trials; 
    curandState *devStates;

    trials = 500;
    total = trials*NUM_THREAD*NUM_BLOCK;

    dim3 dimGrid(NUM_BLOCK,1,1); // Grid dimensions
    dim3 dimBlock(NUM_THREAD,1,1); // Block dimensions
    size_t size = NUM_BLOCK*NUM_THREAD*sizeof(float); //Array memory size
    sumHost = (float*)calloc(NUM_BLOCK*NUM_THREAD, sizeof(float));

    cudaMalloc((void **) &sumDev, size); // Allocate array on device
    error = cudaGetErrorString(cudaGetLastError());
    printf("%s\n", error);


    cudaMalloc((void **) &devStates, (NUM_THREAD*NUM_BLOCK)*sizeof(curandState));
    error = cudaGetErrorString(cudaGetLastError());
    printf("%s\n", error);


    // Do calculation on device by calling CUDA kernel
    monteCarlo <<<dimGrid, dimBlock>>> (sumDev, trials, devStates);
    error = cudaGetErrorString(cudaGetLastError());
    printf("%s\n", error);

        // call reduction function to sum
    reduce0 <<<dimGrid, dimBlock, (NUM_THREAD*sizeof(float))>>> (sumDev);
    error = cudaGetErrorString(cudaGetLastError());
    printf("%s\n", error);

    dim3 dimGrid1(1,1,1);
    dim3 dimBlock1(256,1,1);
    reduce0 <<<dimGrid1, dimBlock1, (NUM_THREAD*sizeof(float))>>> (sumDev);
    error = cudaGetErrorString(cudaGetLastError());
    printf("%s\n", error);

    // Retrieve result from device and store it in host array
    cudaMemcpy(sumHost, sumDev, sizeof(float), cudaMemcpyDeviceToHost);
    error = cudaGetErrorString(cudaGetLastError());
    printf("%s\n", error);


    *solution = 4*(sumHost[0]/total);
    printf("%.*f\n", 1000, *solution);
    free (solution);
    free(sumHost);
    cudaFree(sumDev);
    cudaFree(devStates);
    //*solution = NULL;
    return 0;
}
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-22T20:10:03+00:00Added an answer on May 22, 2026 at 8:10 pm

    If smaller numbers of trials work correctly, and if you are running on MS Windows without the NVIDIA Tesla Compute Cluster (TCC) driver and/or the GPU you are using is attached to a display, then you are probably exceeding the operating system’s “watchdog” timeout. If the kernel occupies the display device (or any GPU on Windows without TCC) for too long, the OS will kill the kernel so that the system does not become non-interactive.

    The solution is to run on a non-display-attached GPU and if you are on Windows, use the TCC driver. Otherwise, you will need to reduce the number of trials in your kernel and run the kernel multiple times to compute the number of trials you need.

    EDIT: According to the CUDA 4.0 curand docs(page 15, “Performance Notes”), you can improve performance by copying the state for a generator to local storage inside your kernel, then storing the state back (if you need it again) when you are finished:

    curandState state = states[i];
    
    for(k = 0; k < trials; k++){
        x = curand_uniform(&state);
        y = curand_uniform(&state);
        z =(x*x + y*y);
        if (z <= 1.0f) incircle++;
    }
    

    Next, it mentions that setup is expensive, and suggests that you move curand_init into a separate kernel. This may help keep the cost of your MC kernel down so you don’t run up against the watchdog.

    I recommend reading that section of the docs, there are several useful guidelines.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a program which performs a Monte Carlo-type simulation. Currently I have written
Whenever I run large scale monte carlo simulations in S-Plus, I always end up
I've been tasked with taking an existing single threaded monte carlo simulation and optimising
When should the Monte-Carlo method be used? For example, why did Joel decide to
Hiya im trying to write a code for basic 1D Monte-Carlo integration. To do
We got a 12-core MacPro to do some Monte Carlo calculations. Its Intel Xeon
I am trying to make an iPhone-app that does the Monte Carlo simulation on
I am rewriting a Monte Carlo simulation model in MATLAB with an emphasis on
I've written a Monte Carlo player for the board game Nine Men's Morris. Everything
I want to implement Monte Carlo Localization in a project I'm doing. The first

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.