Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7545237
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 30, 20262026-05-30T08:52:33+00:00 2026-05-30T08:52:33+00:00

Executing the following code sample takes ~750 ms on a GeForce GT540M whereas the

  • 0

Executing the following code sample takes ~750 ms on a GeForce GT540M whereas the same code executes in ~250 ms on a GT330M.

Copying the dev_a and dev_b to the CUDA device memory takes ~350 ms on the GT540M and ~250. The execution of “addCuda” and the copying back to the host takes another ~400 ms on GT540M and ~0 ms for the GT330M.

This is not what I expected, so I checked the devices’ properties and discovered that the GT540M device surpasses or equals GT330M in every way except in the number of multiprocessors – GT540M has 2 and GT330M has 6. Can this really be true? And if so, can it really have such a great impact on the execution time?

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>

#define T 512
#define N 60000*T

__global__ void addCuda(double *a, double *b, double *c) {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    if(tid < N) {
        c[tid] = sqrt(fabs(a[tid] * b[tid] / 12.34567)) * cos(a[tid]);
    }
}

int main() {
    double *dev_a, *dev_b, *dev_c;

    double* a = (double*)malloc(N*sizeof(double));
    double* b = (double*)malloc(N*sizeof(double));
    double* c = (double*)malloc(N*sizeof(double));

    printf("Filling arrays (CPU)...\n\n");
    int i;
    for(i = 0; i < N; i++) {
        a[i] = (double)-i;
        b[i] = (double)i;
    }

    int timer = clock();
    cudaMalloc((void**) &dev_a, N*sizeof(double));
    cudaMalloc((void**) &dev_b, N*sizeof(double));
    cudaMalloc((void**) &dev_c, N*sizeof(double));
    cudaMemcpy(dev_a, a, N*sizeof(double), cudaMemcpyHostToDevice);
    cudaMemcpy(dev_b, b, N*sizeof(double), cudaMemcpyHostToDevice);

    printf("Memcpy time: %d\n", clock() - timer);
    addCuda<<<(N+T-1)/T,T>>>(dev_a, dev_b, dev_c);
    cudaMemcpy(c, dev_c, N*sizeof(double), cudaMemcpyDeviceToHost);

    printf("Time elapsed: %d\n", clock() - timer);

cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
free(a);
free(b);
free(c);

return 0;
}

The device properties for the devices:

GT540M:

Major revision number:         2
Minor revision number:         1
Name:                          GeForce GT 540M
Total global memory:           1073741824
Total shared memory per block: 49152
Total registers per block:     32768
Warp size:                     32
Maximum memory pitch:          2147483647
Maximum threads per block:     1024
Maximum dimension 0 of block:  1024
Maximum dimension 1 of block:  1024
Maximum dimension 2 of block:  64
Maximum dimension 0 of grid:   65535
Maximum dimension 1 of grid:   65535
Maximum dimension 2 of grid:   65535
Clock rate:                    1344000
Total constant memory:         65536
Texture alignment:             512
Concurrent copy and execution: Yes
Number of multiprocessors:     2
Kernel execution timeout:      Yes

GT330M

Major revision number:         1
Minor revision number:         2
Name:                          GeForce GT 330M
Total global memory:           268435456
Total shared memory per block: 16384
Total registers per block:     16384
Warp size:                     32
Maximum memory pitch:          2147483647
Maximum threads per block:     512
Maximum dimension 0 of block:  512
Maximum dimension 1 of block:  512
Maximum dimension 2 of block:  64
Maximum dimension 0 of grid:   65535
Maximum dimension 1 of grid:   65535
Maximum dimension 2 of grid:   1
Clock rate:                    1100000
Total constant memory:         65536
Texture alignment:             256
Concurrent copy and execution: Yes
Number of multiprocessors:     6
Kernel execution timeout:      Yes
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-30T08:52:35+00:00Added an answer on May 30, 2026 at 8:52 am

    I think that it isn’t possible for a copy from device to host to be ~0ms. I would suggest to check if there is stg wrong with that copy

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

If I try executing the following code f = file('test','rb') fout = file('test.out','wb') for
I am getting Incorrect syntax near the keyword 'select' after executing the following code.
Executing the below code gives me the following exception on the last line: InvalidOperationException:
The following code is taken from Project Silk (a Microsoft sample application) The publish
Consider the following sample code: class MyClass { public long x; public void DoWork()
I am executing following sample program of httpclient of GET method. import org.apache.commons.httpclient.HttpClient; import
Hi i use the following java code and a sample NSIS script while i
I was expecting the following code to throw an exception when I goto write
The following code summarizes the problem I have at the moment. My current execution
Why does the following code not work as I was expecting? <?php $data =

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.