This is the first time i ask question here so thanks very much in

Question

0

Asked: June 13, 20262026-06-13T11:37:54+00:00 2026-06-13T11:37:54+00:00

This is the first time i ask question here so thanks very much in

0

This is the first time i ask question here so thanks very much in advance and please forgive my ignorance. And also I’ve just started to CUDA programming.

Basically, i have a bunch of points, and i want to calculate all the pair-wise distances. Currently my kernel function just holds on one point, and iteratively read in all other points (from global memory), and conduct the calculation. Here’s some of my confusions:

I’m using a Tesla M2050 with 448 cores. But my current parallel version (kernel<<<128,16,16>>>) achieves a much higher parallelism (about 600x faster than kernel<<<1,1,1>>>). Is it possibly due to the multithreading thing or pipeline issue, or they actually indicate the same thing?
I want to further improve the performance. So i figure to use shared memory to hold some input points for each multiprocessing block. But the new code is just as fast. What’s the possible cause? Could it be related to the fact that i set too many threads?
Or, is it because i have a if-statement in the code? The thing is, i only consider and count the short distances, so i have a statement like (if dist < 200). How much should i worry about this one?

A million thanks!
Bin

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-13T11:37:55+00:00

Mark Harris has a very good presentation about optimizing CUDA: Optimizing Parallel Reduction in CUDA.

Algorithmic optimizations
Changes to addressing, algorithm cascading
11.84x speedup, combined!
Code optimizations
Loop unrolling
2.54x speedup, combined

Having an extra operations statement, does indeed cause problems although it will be the last thing you want to optimize, if not simply because you need to know the layout of your code before implementing the size assumptions!

The problem you are working on sounds like the famous n-body problem,
see Fast N-Body Simulation with CUDA.

An additional performance increase can be achieved if you can avoid doing a pairwise computation, for example, the elements are too far to have an effect on each-other. This applies to any relationship that can be expressed geometrically, whether it be pairwise costs or a physics simulation with springs. My favorite method is to divide the grid into boxes and, with each element putting itself into a box via division, then only evaluate pairwise relations between between neighboring boxes. This can be called O(n*m).

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

This is the first time i ask question here so thanks very much in

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply