Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8471509
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 10, 20262026-06-10T16:48:55+00:00 2026-06-10T16:48:55+00:00

I am using 63 registers/thread ,so (32768 is maximum) i can use about 520

  • 0

I am using 63 registers/thread ,so (32768 is maximum) i can use about 520 threads.I am using now 512 threads in this example.

(The parallelism is in the function “computeEvec” inside global computeEHfields function function.)
The problems are:

1) The mem check error below.

2) When i use numPointsRp>2000 it show me “out of memory” ,but (if i am not doing wrong) i compute the global memory and it’s ok.

——————————-UPDATED—————————

i run the program with cuda-memcheck and it gives me (only when numPointsRs>numPointsRp):

========= Invalid global read of size 4

========= at 0x00000428 in computeEHfields

========= by thread (2,0,0) in block (0,0,0)

========= Address 0x4001076e0 is out of bounds

=========
========= Invalid global read of size 4

========= at 0x00000428 in computeEHfields

========= by thread (1,0,0) in block (0,0,0)

========= Address 0x4001076e0 is out of bounds

=========
========= Invalid global read of size 4

========= at 0x00000428 in computeEHfields

========= by thread (0,0,0) in block (0,0,0)

========= Address 0x4001076e0 is out of bounds

ERROR SUMMARY: 160 errors

———–EDIT—————————-

Also , some times (if i use only threads and not blocks (i haven’t test it for blocks) ) if for example i have numPointsRs=1000 and numPointsRp=100 and then change the numPointsRp=200 and then again change the numPointsRp=100 i am not taking the first results!

import pycuda.gpuarray as gpuarray
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np
import cmath
import pycuda.driver as drv


Rs=np.zeros((numPointsRs,3)).astype(np.float32)
for k in range (numPointsRs): 
    Rs[k]=[0,k,0]

Rp=np.zeros((numPointsRp,3)).astype(np.float32)
for k in range (numPointsRp): 
    Rp[k]=[1+k,0,0]


#---- Initialization and passing(allocate memory and transfer data) to GPU -------------------------
Rs_gpu=gpuarray.to_gpu(Rs)
Rp_gpu=gpuarray.to_gpu(Rp)


J_gpu=gpuarray.to_gpu(np.ones((numPointsRs,3)).astype(np.complex64))
M_gpu=gpuarray.to_gpu(np.ones((numPointsRs,3)).astype(np.complex64))

Evec_gpu=gpuarray.to_gpu(np.zeros((numPointsRp,3)).astype(np.complex64))
Hvec_gpu=gpuarray.to_gpu(np.zeros((numPointsRp,3)).astype(np.complex64))
All_gpu=gpuarray.to_gpu(np.ones(numPointsRp).astype(np.complex64))


mod =SourceModule("""
#include <pycuda-complex.hpp>
#include <cmath>
#include <vector>
#define RowRsSize %(numrs)d
#define RowRpSize %(numrp)d


typedef  pycuda::complex<float> cmplx;
extern "C"{


    __device__ void computeEvec(float Rs_mat[][3], int numPointsRs,   
         cmplx J[][3],
         cmplx M[][3],
         float *Rp,
         cmplx kp, 
         cmplx eta,
         cmplx *Evec,
         cmplx *Hvec, cmplx *All)

{

    while (c<numPointsRs){
        ...         
                c++;

                }     
        }


__global__  void computeEHfields(float *Rs_mat_, int numPointsRs,   
        float *Rp_mat_, int numPointsRp,    
    cmplx *J_,
    cmplx *M_,
    cmplx  kp, 
    cmplx  eta,
    cmplx E[][3],
    cmplx H[][3], cmplx *All )
    {
        float Rs_mat[RowRsSize][3];
        float Rp_mat[RowRpSize][3];

        cmplx J[RowRsSize][3];
        cmplx M[RowRsSize][3];


    int k=threadIdx.x+blockIdx.x*blockDim.x;

      while (k<numPointsRp)  
     {

        computeEvec( Rs_mat, numPointsRs,  J, M, Rp_mat[k], kp, eta, E[k], H[k], All );
        k+=blockDim.x*gridDim.x;


    }

}
}

"""% { "numrs":numPointsRs, "numrp":numPointsRp},no_extern_c=1)


func = mod.get_function("computeEHfields")


func(Rs_gpu,np.int32(numPointsRs),Rp_gpu,np.int32(numPointsRp),J_gpu, M_gpu, np.complex64(kp), np.complex64(eta),Evec_gpu,Hvec_gpu, All_gpu, block=(128,1,1),grid=(200,1))

print(" \n")


#----- get data back from GPU-----
Rs=Rs_gpu.get()
Rp=Rp_gpu.get()
J=J_gpu.get()
M=M_gpu.get()
Evec=Evec_gpu.get()
Hvec=Hvec_gpu.get()
All=All_gpu.get()

——————–GPU MODEL————————————————

Device 0: "GeForce GTX 560"
  CUDA Driver Version / Runtime Version          4.20 / 4.10
  CUDA Capability Major/Minor version number:    2.1
  Total amount of global memory:                 1024 MBytes (1073283072 bytes)
  ( 0) Multiprocessors x (48) CUDA Cores/MP:     0 CUDA Cores   //CUDA Cores    336 => 7 MP and 48 Cores/MP
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-10T16:48:56+00:00Added an answer on June 10, 2026 at 4:48 pm

    When i use numPointsRp>2000 it show me “out of memory”

    Now we have some real code to work with, let’s compile it and see what happens. Using RowRsSize=2000 and RowRpSize=200 and compiling with the CUDA 4.2 toolchain, I get:

    nvcc -arch=sm_21 -Xcompiler="-D RowRsSize=2000 -D RowRpSize=200" -Xptxas="-v" -c -I./ kivekset.cu 
    ptxas info    : Compiling entry function '_Z15computeEHfieldsPfiS_iPN6pycuda7complexIfEES3_S2_S2_PA3_S2_S5_S3_' for 'sm_21'
    ptxas info    : Function properties for _Z15computeEHfieldsPfiS_iPN6pycuda7complexIfEES3_S2_S2_PA3_S2_S5_S3_
        122432 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
    ptxas info    : Used 57 registers, 84 bytes cmem[0], 168 bytes cmem[2], 76 bytes cmem[16]
    

    The key numbers are 57 registers and 122432 bytes stack frame per thread. The occupancy calculator suggests that a block of 512 threads will have a maximum of 1 block per SM, and your GPU has 7 SM. This gives a total of 122432 * 512 * 7 = 438796288 bytes of stack frame (local memory) to run your kernel, before you have allocated a single of byte of memory for input and output using pyCUDA. On a GPU with 1Gb of memory, it isn’t hard to imagine running out of memory. Your kernel has a enormous local memory footprint. Start thinking about ways to reduce it.


    As I indicated in comments, it is absolutely unclear why every thread needs a complete copy of the input data in this kernel code. It results in a gigantic local memory footprint and there seems to be absolutely no reason why the code should be written in this way. You could, I suspect, modify the kernel to something like this:

    typedef  pycuda::complex<float> cmplx;
    typedef float fp3[3];
    typedef cmplx cp3[3];
    
    __global__  
    void computeEHfields2(
            float *Rs_mat_, int numPointsRs,
            float *Rp_mat_, int numPointsRp,
            cmplx *J_,
            cmplx *M_,
            cmplx  kp, 
            cmplx  eta,
            cmplx E[][3],
            cmplx H[][3], 
            cmplx *All )
    {
    
        fp3 * Rs_mat = (fp3 *)Rs_mat_;
        cp3 * J = (cp3 *)J_;
        cp3 * M = (cp3 *)M_;
    
        int k=threadIdx.x+blockIdx.x*blockDim.x;
        while (k<numPointsRp)  
        {
            fp3 * Rp_mat = (fp3 *)(Rp_mat_+k);
            computeEvec2( Rs_mat, numPointsRs, J, M, *Rp_mat, kp, eta, E[k], H[k], All );
            k+=blockDim.x*gridDim.x;
        }
    }
    

    and the main __device__ function it calls to something like this:

    __device__ void computeEvec2(
            fp3 Rs_mat[], int numPointsRs,   
            cp3 J[],
            cp3 M[],
            fp3   Rp,
            cmplx kp, 
            cmplx eta,
            cmplx *Evec,
            cmplx *Hvec, 
            cmplx *All)
    {
     ....
    }
    

    and eliminate every byte of thread local memory without changing the functionality of the computational code at all.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a COM outproc server written in ATL that registers itself using _Module.RegisterClassObjects(CLSCTX_LOCAL_SERVER,
Using phpMyAdmin and running the following query I get 19 registers: SELECT * FROM
I am using the next instructions to get some registers from my Database. Create
I want to create sub-domains using PHP on the fly. Suppose a user registers
I have a UITableView that fetches data from CoreData using FetchedResultsController and it registers
Code: #include <iostream> #include stdafx.h #include <boost/thread.hpp> #include <boost/thread/mutex.hpp> using namespace std; boost::mutex mut;
How can I wait for a detached thread to finish in C++? I don't
I get this crash report when I run java bean of my project using
I am a tinkerer—no doubt about that. For this reason (and very little beyond
I've been having trouble using xmpp4r to do in-band registration following the SO thread

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.