Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 5962313
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 22, 20262026-05-22T19:05:17+00:00 2026-05-22T19:05:17+00:00

My code is a parallel implmentation that calculates the nth digit of pi. When

  • 0

My code is a parallel implmentation that calculates the nth digit of pi. When I finish the kernel and try to copy the memory back to the host I get a “the launch timed out and was terminated” error.
I used this code for error checking for each cudamalloc, cudamemcpy, and kernal launch.

std::string error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);

These calls were saying everything was fine until the first cudamemcpy call after returning from the kernel. the error happens in the line “cudaMemcpy(avhost, avdev, size, cudaMemcpyDeviceToHost);” in main. Any help is appreciated.

#include <stdlib.h>
#include <stdio.h>
#include <math.h>

#define mul_mod(a,b,m) fmod( (double) a * (double) b, m)
///////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////
/* return the inverse of x mod y */
__device__ int inv_mod(int x,int y) {
  int q,u,v,a,c,t;

  u=x;
  v=y;
  c=1;
  a=0;
  do {
    q=v/u;

    t=c;
    c=a-q*c;
    a=t;

    t=u;
    u=v-q*u;
    v=t;
  } while (u!=0);
  a=a%y;
  if (a<0) a=y+a;
  return a;
}
///////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////
/* return the inverse of u mod v, if v is odd */
__device__ int inv_mod2(int u,int v) {
  int u1,u3,v1,v3,t1,t3;

  u1=1;
  u3=u;

  v1=v;
  v3=v;

  if ((u&1)!=0) {
    t1=0;
    t3=-v;
    goto Y4;
  } else {
    t1=1;
    t3=u;
  }

  do {

    do {
      if ((t1&1)==0) {
    t1=t1>>1;
    t3=t3>>1;
      } else {
    t1=(t1+v)>>1;
    t3=t3>>1;
      }
      Y4:;
    } while ((t3&1)==0);

    if (t3>=0) {
      u1=t1;
      u3=t3;
    } else {
      v1=v-t1;
      v3=-t3;
    }
    t1=u1-v1;
    t3=u3-v3;
    if (t1<0) {
      t1=t1+v;
    }
  } while (t3 != 0);
  return u1;
}


/* return (a^b) mod m */
__device__ int pow_mod(int a,int b,int m)
{
  int r,aa;

  r=1;
  aa=a;
  while (1) {
    if (b&1) r=mul_mod(r,aa,m);
    b=b>>1;
    if (b == 0) break;
    aa=mul_mod(aa,aa,m);
  }
  return r;
}
///////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////
/* return true if n is prime */
int is_prime(int n)
{
   int r,i;
   if ((n % 2) == 0) return 0;

   r=(int)(sqrtf(n));
   for(i=3;i<=r;i+=2) if ((n % i) == 0) return 0;
   return 1;
}
///////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////
/* return the prime number immediatly after n */
int next_prime(int n)
{
   do {
      n++;
   } while (!is_prime(n));
   return n;
}
///////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////
#define DIVN(t,a,v,vinc,kq,kqinc)       \
{                       \
  kq+=kqinc;                    \
  if (kq >= a) {                \
    do { kq-=a; } while (kq>=a);        \
    if (kq == 0) {              \
      do {                  \
    t=t/a;                  \
    v+=vinc;                \
      } while ((t % a) == 0);           \
    }                       \
  }                     \
}

///////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////

__global__ void digi_calc(int *s, int *av, int *primes, int N, int n, int nthreads){
    int a,vmax,num,den,k,kq1,kq2,kq3,kq4,t,v,i,t1, h;
    unsigned int tid = blockIdx.x*blockDim.x + threadIdx.x;
// GIANT LOOP
    for (h = 0; h<1; h++){
    if(tid > nthreads) continue;
    a = primes[tid];
    vmax=(int)(logf(3*N)/logf(a));
    if (a==2) {
      vmax=vmax+(N-n);
      if (vmax<=0) continue;
    }
    av[tid]=1;
    for(i=0;i<vmax;i++) av[tid]*= a;

    s[tid]=0;
    den=1;
    kq1=0;
    kq2=-1;
    kq3=-3;
    kq4=-2;
    if (a==2) {
      num=1;
      v=-n; 
    } else {
      num=pow_mod(2,n,av[tid]);
      v=0;
    }

    for(k=1;k<=N;k++) {

      t=2*k;
      DIVN(t,a,v,-1,kq1,2);
      num=mul_mod(num,t,av[tid]);

      t=2*k-1;
      DIVN(t,a,v,-1,kq2,2);
      num=mul_mod(num,t,av[tid]);

      t=3*(3*k-1);
      DIVN(t,a,v,1,kq3,9);
      den=mul_mod(den,t,av[tid]);

      t=(3*k-2);
      DIVN(t,a,v,1,kq4,3);
      if (a!=2) t=t*2; else v++;
      den=mul_mod(den,t,av[tid]);

      if (v > 0) {
    if (a!=2) t=inv_mod2(den,av[tid]);
    else t=inv_mod(den,av[tid]);
    t=mul_mod(t,num,av[tid]);
    for(i=v;i<vmax;i++) t=mul_mod(t,a,av[tid]);
    t1=(25*k-3);                                                                                                                                                                                                                                                                                                                                                                       
    t=mul_mod(t,t1,av[tid]);
    s[tid]+=t;
    if (s[tid]>=av[tid]) s-=av[tid];
      }
    }

    t=pow_mod(5,n-1,av[tid]);
    s[tid]=mul_mod(s[tid],t,av[tid]);
    }
    __syncthreads();
}
///////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////
int main(int argc,char *argv[])
{
  int N,n,i,totalp, h;
  double sum;
  const char *error;
  int *sdev, *avdev, *shost, *avhost, *adev, *ahost;
    argc = 2;
    argv[1] = "2";
  if (argc<2 || (n=atoi(argv[1])) <= 0) {
    printf("This program computes the n'th decimal digit of pi\n"
       "usage: pi n , where n is the digit you want\n"
       );
    exit(1);
  }
    sum = 0;
    N=(int)((n+20)*logf(10)/logf(13.5));
    totalp=(N/logf(N))+10;
    ahost = (int *)calloc(totalp, sizeof(int));
    i = 0;
    ahost[0]=2;
    for(i=1; ahost[i-1]<=(3*N); ahost[i+1]=next_prime(ahost[i])){
        i++;
    }
    // allocate host memory
    size_t size = i*sizeof(int);
    shost = (int *)malloc(size);
    avhost = (int *)malloc(size);

  //allocate memory on device
    cudaMalloc((void **) &sdev, size);
    cudaMalloc((void **) &avdev, size);
    cudaMalloc((void **) &adev, size);
    cudaMemcpy(adev, ahost, size, cudaMemcpyHostToDevice);

    if (i >= 512){
        h = 512;
    }
    else h = i;
    dim3 dimGrid(((i+512)/512),1,1);                   
    dim3 dimBlock(h,1,1);

    // launch kernel
    digi_calc <<<dimGrid, dimBlock >>> (sdev, avdev, adev, N, n, i);

    //copy memory back to host
    cudaMemcpy(avhost, avdev, size, cudaMemcpyDeviceToHost);
    cudaMemcpy(shost, sdev, size, cudaMemcpyDeviceToHost);

  // end malloc's, memcpy's, kernel calls
    for(h = 0; h <=i; h++){
    sum=fmod(sum+(double) shost[h]/ (double) avhost[h],1.0);
    }
  printf("Decimal digits of pi at position %d: %09d\n",n,(int)(sum*1e9));
    //free memory
    cudaFree(sdev);
    cudaFree(avdev);
    cudaFree(adev);
    free(shost);
    free(avhost);
    free(ahost);
  return 0;
}
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-22T19:05:18+00:00Added an answer on May 22, 2026 at 7:05 pm

    This is exactly the same problem you asked about in this question. The kernel is getting terminated early by the driver because it is taking too long to finish. If you read the documentation for any of these runtime API functions you will see the following note:

    Note:
    Note that this function may also return error codes from previous,
    asynchronous launches.

    All that is happening is that the first API call after the kernel launch is returning the error incurred while the kernel was running – in this case the cudaMemcpy call. The way you can confirm this for yourself is to do something like this directly after the kernel launch:

    // launch kernel
    digi_calc <<<dimGrid, dimBlock >>> (sdev, avdev, adev, N, n, i);
    std::string error = cudaGetErrorString(cudaPeekAtLastError());
    printf("%s\n", error);
    error = cudaGetErrorString(cudaThreadSynchronize());
    printf("%s\n", error);
    

    The cudaPeekAtLastError() call will show you if there are any errors in the kernel launch, and the error code returned by the cudaThreadSynchronize() call will show whether any errors were generated while the kernel was executing.

    The solution is exactly as outlined in the previous question: probably the simplest way is redesign the code so it is “re-entrant” so you can split the work over several kernel launches, with each kernel launch safely under the display driver watchdog timer limit.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Here some C++ code that is accessed from multiple threads in parallel. It has
My question centers on some Parallel.ForEach code that used to work without fail, and
Whats a simple code that does parallel processing in python 2.7? All the examples
In my parallel programming book, I came across this code that says the slaves
Code that is untestable really annoys me. The following things make oo-code untestable: global
(code examples are python) Lets assume we have a list of percentages that add
Code: for i in {0..3}; do ping http://www.pythonchallenge.com/pc/def/$i.html; done A host should be found
I am porting some code to Parallel.ForEach and got an error with a continue
Currently I have a section of code that needs to make about 7 web
Can i get the source code for a WAMP stack installer somewhere? Any help

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.