Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8743185
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 13, 20262026-06-13T11:37:54+00:00 2026-06-13T11:37:54+00:00

This is the first time i ask question here so thanks very much in

  • 0

This is the first time i ask question here so thanks very much in advance and please forgive my ignorance. And also I’ve just started to CUDA programming.

Basically, i have a bunch of points, and i want to calculate all the pair-wise distances. Currently my kernel function just holds on one point, and iteratively read in all other points (from global memory), and conduct the calculation. Here’s some of my confusions:

  • I’m using a Tesla M2050 with 448 cores. But my current parallel version (kernel<<<128,16,16>>>) achieves a much higher parallelism (about 600x faster than kernel<<<1,1,1>>>). Is it possibly due to the multithreading thing or pipeline issue, or they actually indicate the same thing?

  • I want to further improve the performance. So i figure to use shared memory to hold some input points for each multiprocessing block. But the new code is just as fast. What’s the possible cause? Could it be related to the fact that i set too many threads?

  • Or, is it because i have a if-statement in the code? The thing is, i only consider and count the short distances, so i have a statement like (if dist < 200). How much should i worry about this one?

A million thanks!
Bin

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-13T11:37:55+00:00Added an answer on June 13, 2026 at 11:37 am

    Mark Harris has a very good presentation about optimizing CUDA: Optimizing Parallel Reduction in CUDA.

    Algorithmic optimizations
    Changes to addressing, algorithm cascading
    11.84x speedup, combined!
    Code optimizations
    Loop unrolling
    2.54x speedup, combined

    Having an extra operations statement, does indeed cause problems although it will be the last thing you want to optimize, if not simply because you need to know the layout of your code before implementing the size assumptions!

    The problem you are working on sounds like the famous n-body problem,
    see Fast N-Body Simulation with CUDA.

    An additional performance increase can be achieved if you can avoid doing a pairwise computation, for example, the elements are too far to have an effect on each-other. This applies to any relationship that can be expressed geometrically, whether it be pairwise costs or a physics simulation with springs. My favorite method is to divide the grid into boxes and, with each element putting itself into a box via division, then only evaluate pairwise relations between between neighboring boxes. This can be called O(n*m).

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Sorry, this's my first time to ask a question here. So, I don't have
this is my first time asking a question here. I tried to be well
It is the first time I ask a question on this forum. Sorry if
First time I ever ask a question here so correct me if I´m doing
This is my first time on this site. I will ask you to help
First time asking a question here. Usually I can find my answer without having
I had another question about this issue, but I didn't ask properly, so here
Related to this question here , but I decided to ask another question for
Okay, this is like the 5th time I have had to ask this question,
Am just learning C and this is my first time on stackoverflow so am

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.