Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7940593
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 3, 20262026-06-03T23:26:09+00:00 2026-06-03T23:26:09+00:00

How can I use two devices in order to improve for example the performance

  • 0

How can I use two devices in order to improve for example
the performance of the following code (sum of vectors)?
Is it possible to use more devices “at the same time”?
If yes, how can I manage the allocations of the vectors on the global memory of the different devices?

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <cuda.h>

#define NB 32
#define NT 500
#define N NB*NT

__global__ void add( double *a, double *b, double *c);

//===========================================
__global__ void add( double *a, double *b, double *c){

    int tid = threadIdx.x + blockIdx.x * blockDim.x; 

    while(tid < N){
        c[tid] = a[tid] + b[tid];
        tid += blockDim.x * gridDim.x;
    }

}

//============================================
//BEGIN
//===========================================
int main( void ) {

    double *a, *b, *c;
    double *dev_a, *dev_b, *dev_c;

    // allocate the memory on the CPU
    a=(double *)malloc(N*sizeof(double));
    b=(double *)malloc(N*sizeof(double));
    c=(double *)malloc(N*sizeof(double));

    // allocate the memory on the GPU
    cudaMalloc( (void**)&dev_a, N * sizeof(double) );
    cudaMalloc( (void**)&dev_b, N * sizeof(double) );
    cudaMalloc( (void**)&dev_c, N * sizeof(double) );

    // fill the arrays 'a' and 'b' on the CPU
    for (int i=0; i<N; i++) {
        a[i] = (double)i;
        b[i] = (double)i*2;
    }

    // copy the arrays 'a' and 'b' to the GPU
    cudaMemcpy( dev_a, a, N * sizeof(double), cudaMemcpyHostToDevice);
    cudaMemcpy( dev_b, b, N * sizeof(double), cudaMemcpyHostToDevice);

    for(int i=0;i<10000;++i)
        add<<<NB,NT>>>( dev_a, dev_b, dev_c );

    // copy the array 'c' back from the GPU to the CPU
    cudaMemcpy( c, dev_c, N * sizeof(double), cudaMemcpyDeviceToHost);

    // display the results
    // for (int i=0; i<N; i++) {
    //      printf( "%g + %g = %g\n", a[i], b[i], c[i] );
    //  }
    printf("\nGPU done\n");

    // free the memory allocated on the GPU
    cudaFree( dev_a );
    cudaFree( dev_b );
    cudaFree( dev_c );
    // free the memory allocated on the CPU
    free( a );
    free( b );
    free( c );

    return 0;
}

Thank you in advance.
Michele

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-03T23:26:10+00:00Added an answer on June 3, 2026 at 11:26 pm

    Since CUDA 4.0 was released, multi-GPU computations of the type you are asking about are relatively easy. Prior to that, you would have need to use a multi-threaded host application with one host thread per GPU and some sort of inter-thread communication system in order to use mutliple GPUs inside the same host application.

    Now it is possible to do something like this for the memory allocation part of your host code:

    double *dev_a[2], *dev_b[2], *dev_c[2];
    const int Ns[2] = {N/2, N-(N/2)};
    
    // allocate the memory on the GPUs
    for(int dev=0; dev<2; dev++) {
        cudaSetDevice(dev);
        cudaMalloc( (void**)&dev_a[dev], Ns[dev] * sizeof(double) );
        cudaMalloc( (void**)&dev_b[dev], Ns[dev] * sizeof(double) );
        cudaMalloc( (void**)&dev_c[dev], Ns[dev] * sizeof(double) );
    }
    

    (disclaimer: written in browser, never compiled, never tested, use at own risk).

    The basic idea here is that you use cudaSetDevice to select between devices when you are preforming operations on a device. So in the above snippet, I have assumed two GPUs and allocated memory on each [(N/2) doubles on the first device and N-(N/2) on the second].

    The transfer of data from the host to device could be as simple as:

    // copy the arrays 'a' and 'b' to the GPUs
    for(int dev=0,pos=0; dev<2; pos+=Ns[dev], dev++) {
        cudaSetDevice(dev);
        cudaMemcpy( dev_a[dev], a+pos, Ns[dev] * sizeof(double), cudaMemcpyHostToDevice);
        cudaMemcpy( dev_b[dev], b+pos, Ns[dev] * sizeof(double), cudaMemcpyHostToDevice);
    }
    

    (disclaimer: written in browser, never compiled, never tested, use at own risk).

    The kernel launching section of your code could then look something like:

    for(int i=0;i<10000;++i) {
        for(int dev=0; dev<2; dev++) {
            cudaSetDevice(dev);
            add<<<NB,NT>>>( dev_a[dev], dev_b[dev], dev_c[dev], Ns[dev] );
        }
    }
    

    (disclaimer: written in browser, never compiled, never tested, use at own risk).

    Note that I have added an extra argument to your kernel call, because each instance of the kernel may be called with a different number of array elements to process. I Will leave it to you to work out the modifications required.
    But, again, the basic idea is the same: use cudaSetDevice to select a given GPU, then run kernels on it in the normal way, with each kernel getting its own unique arguments.

    You should be able to put these parts together to produce a simple multi-GPU application. There are a lot of other features which can be used in recent CUDA versions and hardware to assist multiple GPU applications (like unified addressing, the peer-to-peer facilities are more), but this should be enough to get you started. There is also a simple muLti-GPU application in the CUDA SDK you can look at for more ideas.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Can i use Tcp Sockets to connect two Android Devices(not Emulators) giving the proper
Can you use two regex in preg_replace to match and replace items in an
I know that I can use cmp, diff, etc to compare two files, but
How can I use DATEDIFF to return the difference between two dates in years,
How to merge two PNG's together? I know that you can't use PNGObject.Draw because
I am using two update panel..in nested way.. can i use like this and
I've got two attributes start_time and end_time . How can I use the date
Is this possible to share data between two applications on the same device? Or
I can use the Network tab in the Google Chrome Web Inspector to debug
I can use -d option to put files into a database. mongofiles -d test

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.