Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9009397
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 16, 20262026-06-16T02:11:03+00:00 2026-06-16T02:11:03+00:00

I would like to write an electromagnetic 2D Finite Difference Time Domain (FDTD) code

  • 0

I would like to write an electromagnetic 2D Finite Difference Time Domain (FDTD) code in CUDA language.
The C code for the update of the magnetic field is the following

// --- Update for Hy and Hx
for(int i=n1; i<=n2; i++)
   for(int j=n11; j<=n21; j++){
      Hy[i*ydim+j]=A[i*ydim+j]*Hy[i*ydim+j]+B[i*ydim+j]*(Ezx[(i+1)*ydim+j]-Ezx[i*ydim+j]+Ezy[(i+1)*ydim+j]-Ezy[i*ydim+j]);
  Hx[i*ydim+j]=G[i*ydim+j]*Hx[i*ydim+j]-H[i*ydim+j]*(Ezx[i*ydim+j+1]-Ezx[i*ydim+j]+Ezy[i*ydim+j+1]-Ezy[i*ydim+j]);
   }
}

My first parallelization attempt has been the following kernel:

__global__ void H_update_kernel(double* Hx_h, double* Hy_h, double* Ezx_h, double* Ezy_h, double* A_h, double* B_h,double* G_h, double* H_h, int n1, int n2, int n11, int n21)
{
   int idx = blockIdx.x*BLOCK_SIZE_X + threadIdx.x;
   int idy = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y;

   if ((idx <= n2 && idx >= n1)&&(idy <= n21 && idy >= n11)) {
      Hy_h[idx*ydim+idy]=A_h[idx*ydim+idy]*Hy_h[idx*ydim+idy]+B_h[idx*ydim+idy]*(Ezx_h[(idx+1)*ydim+idy]-Ezx_h[idx*ydim+idy]+Ezy_h[(idx+1)*ydim+idy]-Ezy_h[idx*ydim+idy]);
  Hx_h[idx*ydim+idy]=G_h[idx*ydim+idy]*Hx_h[idx*ydim+idy]-H_h[idx*ydim+idy]*(Ezx_h[idx*ydim+idy+1]-Ezx_h[idx*ydim+idy]+Ezy_h[idx*ydim+idy+1]-Ezy_h[idx*ydim+idy]); }

}

However, by also using the Visual Profiler, I have been unsatisfied by this solution for two reasons:
1) The memory accesses are poorly coalesced;
2) The shared memory is not used.

I then decided to use the following solution

__global__ void H_update_kernel(double* Hx_h, double* Hy_h, double* Ezx_h, double* Ezy_h, double* A_h, double* B_h,double* G_h, double* H_h, int n1, int n2, int n11, int n21)
{
    int i       = threadIdx.x;
int j       = threadIdx.y;
int idx     = blockIdx.x*BLOCK_SIZE_X + threadIdx.x;
int idy     = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y;

int index1  = j*BLOCK_SIZE_Y+i;

int i1      = (index1)%(BLOCK_SIZE_X+1);
int j1      = (index1)/(BLOCK_SIZE_Y+1);

int i2      = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)%(BLOCK_SIZE_X+1);
int j2      = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)/(BLOCK_SIZE_Y+1);

__shared__ double Ezx_h_shared[BLOCK_SIZE_X+1][BLOCK_SIZE_Y+1];     
__shared__ double Ezy_h_shared[BLOCK_SIZE_X+1][BLOCK_SIZE_Y+1];     

if (((blockIdx.x*BLOCK_SIZE_X+i1)<xdim)&&((blockIdx.y*BLOCK_SIZE_Y+j1)<ydim))
    Ezx_h_shared[i1][j1]=Ezx_h[(blockIdx.x*BLOCK_SIZE_X+i1)*ydim+(blockIdx.y*BLOCK_SIZE_Y+j1)];

if (((i2<(BLOCK_SIZE_X+1))&&(j2<(BLOCK_SIZE_Y+1)))&&(((blockIdx.x*BLOCK_SIZE_X+i2)<xdim)&&((blockIdx.y*BLOCK_SIZE_Y+j2)<ydim)))
    Ezx_h_shared[i2][j2]=Ezx_h[(blockIdx.x*BLOCK_SIZE_X+i2)*xdim+(blockIdx.y*BLOCK_SIZE_Y+j2)];

__syncthreads();

if ((idx <= n2 && idx >= n1)&&(idy <= n21 && idy >= n11)) {
    Hy_h[idx*ydim+idy]=A_h[idx*ydim+idy]*Hy_h[idx*ydim+idy]+B_h[idx*ydim+idy]*(Ezx_h_shared[i+1][j]-Ezx_h_shared[i][j]+Ezy_h[(idx+1)*ydim+idy]-Ezy_h[idx*ydim+idy]);
    Hx_h[idx*ydim+idy]=G_h[idx*ydim+idy]*Hx_h[idx*ydim+idy]-H_h[idx*ydim+idy]*(Ezx_h_shared[i][j+1]-Ezx_h_shared[i][j]+Ezy_h[idx*ydim+idy+1]-Ezy_h[idx*ydim+idy]); }

    } 

The index trick is needed to make a block of BS_x * BS_y threads read (BS_x+1)*(BS_y+1) global memory locations to the shared memory.
I believe that this choice is better than the previous one, due to the use of the shared memory, although not all the accesses are really coalesced, see

Analyzing memory access coalescing of my CUDA kernel

My question is that if anyone of you can address me to a better solution in terms of coalesced memory access. Thank you.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-16T02:11:05+00:00Added an answer on June 16, 2026 at 2:11 am

    Thank you for providing the profiling information.

    Your second algorithm is better because you are getting a higher IPC. Still, on CC 2.0, max IPC is 2.0, so your average in the second solution of 1.018 means that only half of your available compute power is utilized. Normally, that means that your algorithm is memory bound, but I’m not sure in your case, because almost all the code in your kernel is inside if conditionals. A significant amount of warp divergence will affect performance, but I haven’t checked if instructions which results are not used count towards the IPC.

    You may want to look into reading through the texture cache. Textures are optimized for 2D spatial locality and better support semi-random 2D access. It may help your [i][j] type accesses.

    In the current solution, make sure that it’s the Y position ([j]) that changes the least between two threads with adjacent thread IDs (to keep memory accesses as sequential as possible).

    It could be that the compiler has optimized this for you, but you recalculate idx*ydim+idy many times. Try calculating it once and reusing the result. That would have more potential for improvement if your algorithm was compute bound though.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I would like to write some Objective-C code for distribution. What is the best
I would like to write a bit of code that calls a function specified
I would like to write a little code that copy on a local pc
I would like to write JavaScript code that flattens the DOM of an arbitrary
I would like to write a 2D real time strategy game probably using XNA
I would like to write inside that code (pretty-config.xml): <pretty-config xmlns=http://ocpsoft.com/prettyfaces/3.3.2 xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance xsi:schemaLocation=http://ocpsoft.com/prettyfaces/3.3.2 http://ocpsoft.com/xml/ns/prettyfaces/ocpsoft-pretty-faces-3.3.2.xsd>
I would like to write an abstract base class class func_double_double_t : public unary_function<double,
I would like to write a function that creates a plot of a set
I would like to write an if statement that would continue to repeat a
I would like to write a query that returns a single date value, that

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.