Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8012337
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 4, 20262026-06-04T19:18:01+00:00 2026-06-04T19:18:01+00:00

I’m new to CUDA and I’m probably doing something wrong. All I need is

  • 0

I’m new to CUDA and I’m probably doing something wrong.
All I need is logical operation on two binary vectors. Length of vectors is 2048000. I compared speed between logical and in Matlab’s C mex file and in CUDA kernel. C on CPU is ~5% faster than CUDA. Please note that I measured only kernel execution (without memory transfer). I have i7 930 and 9800GT.

##MEX file testCPU.c:##

#include "mex.h"
void mexFunction( int nlhs, mxArray *plhs[],
        int nrhs, const mxArray *prhs[] ) {
    
    int i, varLen;
    unsigned char *vars, *output;
            
    vars = mxGetPr(prhs[0]);
    plhs[0] = mxCreateLogicalMatrix(2048000, 1);
    output = mxGetPr(plhs[0]);
    for (i=0;i<2048000;i++){
        output[i] = vars[i] & vars[2048000+i];
    }
}

Compile

mex testCPU.c

Create vectors

vars = ~~(randi(2,2048000,2)-1);

Measure speed:

tic;testCPU(vars);toc;

CUDA:

#CUDA file testGPU.cu#
#include "mex.h"
#include "cuda.h"

__global__ void logical_and(unsigned char* in, unsigned char* out, int N) {
    int idx = blockIdx.x*blockDim.x+threadIdx.x;
    out[idx] = in[idx] && in[idx+N];
}


void mexFunction( int nlhs, mxArray *plhs[],
        int nrhs, const mxArray *prhs[] ) {
    
    int i;
    unsigned char *vars, *output, *gpu, *gpures;
    
    vars = (unsigned char*)mxGetData(prhs[0]);
    
    plhs[0] = mxCreateLogicalMatrix(2048000, 1);
    output = (unsigned char*)mxGetData(plhs[0]);       
       
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    float dt_ms;
    
    // input GPU malloc
    cudaEventRecord(start, 0);
    cudaMalloc( (void **) &gpu, sizeof(unsigned char)*4096000);
    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&dt_ms, start, stop);
    printf("GPU input malloc: %f ms, %i\n", dt_ms, cudaGetLastError());
    
    // output GPU malloc
    cudaEventRecord(start, 0);
    cudaMalloc( (void **) &gpures, sizeof(unsigned char)*2048000);
    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&dt_ms, start, stop);
    printf("GPU output malloc: %f ms, %i\n", dt_ms, cudaGetLastError());
    
    // copy from CPU to GPU
    cudaEventRecord(start, 0);
    cudaMemcpy( gpu, vars, sizeof(unsigned char)*4096000, cudaMemcpyHostToDevice);
    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&dt_ms, start, stop);
    printf("copy input from CPU to GPU: %f ms, %i\n", dt_ms, cudaGetLastError());
    
    dim3 dimBlock(32);
    printf("thread count: %i\n", dimBlock.x);
    dim3 dimGrid(2048000/dimBlock.x);
    printf("block count: %i\n", dimGrid.x);
    
    // --- KERNEL ---
    cudaEventRecord(start, 0);
    logical_and<<<dimGrid, dimBlock>>>(gpu, gpures, 2048000);
    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&dt_ms, start, stop);
    printf("GPU kernel: %f ms, %i\n", dt_ms, cudaGetLastError());
    
    // result from GPU to CPU
    cudaEventRecord(start, 0);
    cudaMemcpy( output, gpures, sizeof(unsigned char)*2048000, cudaMemcpyDeviceToHost );
    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&dt_ms, start, stop);
    printf("copy output from GPU to CPU: %f ms, %i\n", dt_ms, cudaGetLastError());
    
    
    cudaFree(gpu);
    cudaFree(gpures);
    
}

Compile:

 nvmex -f nvmexopts_9.bat testGPU.cu 
-I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.2\include" 
-L"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.2\lib\x64" -lcudart -lcufft

Output:

GPU input malloc: 0.772160 ms, 0
GPU output malloc: 0.041728 ms, 0
copy input from CPU to GPU: 1.494784 ms, 0
thread count: 32
block count: 64000
*** GPU kernel: 3.761216 ms, 0 ***
copy output from GPU to CPU: 1.203488 ms, 0

Is that code OK? CPU was ~0.1ms faster than CUDA kernel. I tried different thread counts (multipliers of 32) up to 512, 32 was fastest. Operator & instead of && was almost 1ms slower.

Is 9800GT really so weak? What speed-up can I expect with today’s mainstream card (ie. GTX460,560)?

Thank you

EDIT: based on talonmies’ comment, I made these modifications:

Kernel function:

__global__ void logical_and(uchar4* in, uchar4* out, int N) {
    int idx = blockIdx.x*blockDim.x+threadIdx.x;
    out[idx].x = in[idx].x & in[idx+N].x;
    out[idx].y = in[idx].y & in[idx+N].y;
    out[idx].z = in[idx].z & in[idx+N].z;
    out[idx].w = in[idx].w & in[idx+N].w;
}

Main function:

uchar4 *gpu, *gpures;

// 32 was worst, 64,128,256,512 were similar
dim3 dimBlock(128);
// block count is now 4xtimes smaller
dim3 dimGrid(512000/dimBlock.x);

Output:

GPU input malloc: 0.043360 ms, 0
GPU output malloc: 0.038592 ms, 0
copy input from CPU to GPU: 1.499584 ms, 0
thread count: 128
block count: 4000
*** GPU kernel: 0.131296 ms, 0 ***
copy output from GPU to CPU: 1.281120 ms, 0

Is that correct? Almost 30x speed-up! It seems too good to be true, but result is correct 🙂
How faster will be GTX560 on this particular task? Thx

Edit 2:

Is this code

__global__ void logical_and(uchar4* in, uchar4* out, int N) {
    int idx = blockIdx.x*blockDim.x+threadIdx.x;
    
    out[idx].x = in[idx].x & in[idx+N].x;
    out[idx].y = in[idx].y & in[idx+N].y;
    out[idx].z = in[idx].z & in[idx+N].z;
    out[idx].w = in[idx].w & in[idx+N].w;
}

automatically transformed to:

__global__ void logical_and(uchar4* in, uchar4* out, int N) {
    int idx = blockIdx.x*blockDim.x+threadIdx.x;  
    uchar4 buff;

    buff.x = in[idx].x;
    buff.y = in[idx].y;
    buff.z = in[idx].z;
    buff.w = in[idx].w;

    buff.x &= in[idx+N].x;
    buff.y &= in[idx+N].y;
    buff.z &= in[idx+N].z;
    buff.w &= in[idx+N].w;

    out[idx].x = buff.x;
    out[idx].y = buff.y;
    out[idx].z = buff.z;
    out[idx].w = buff.w;
}

by compiler?

If it is correct, it explains my confusion about coalesced access. I thought that in[idx] & in[idx+N] leads to non-coalesced access, because of accessing non-contiguous memory. But in fact, in[idx] and in[idx+N] are loaded in two coalesced steps. N can be any multiple of 16, because uchar4 is 4 bytes long, and for coalesced access address must be aligned to 64 bytes (on 1.1 device). Am I right?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-04T19:18:03+00:00Added an answer on June 4, 2026 at 7:18 pm

    As talonmies pointed out, you’re accessing and processing your data byte-wise, which is far from optimal. A collection of techniques you may want to consider, such as Instruction-Level Parallelism and buffered read/writes, are summarized in the nVidia Webinar Better Performance at Lower Occupancy by Vasily Volkov.

    In a nutshell, what you want to do is, in each thread, read several uint4 in a coalesced way, process them, and only then store them.

    Update

    Does it make any difference if you re-write your code as follows?

    __global__ void logical_and(unsigned int* in, unsigned int* out, int N) {
        int idx = blockIdx.x*blockDim.x*chunksize+threadIdx.x;
        unsigned int buff[chunksize];
        #pragma unroll
        for ( int k = 0 ; k < chunksize ; k++ )
            buff[k] = in[ blockDim.x*k + idx ];
        #pragma unroll
        for ( int k = 0 ; k < chunksize ; k++ )
            buff[k] &= in[ blockDim.x*k + idx + N ];
        #pragma unroll
        for ( int k = 0 ; k < chunksize ; k++ )
            out[ blockDim.x*k + idx ] = buff[k];
    }
    

    Note that I’ve assumed chunksize is a variable you’ve #defined somewhere, e.g.

    #define chunksize 4
    

    And that you have to divide the number of blocks you launch and N by that number. I’ve also used unsigned int which is just four packed uchar. In your calling function, you may have to cast your pointers accordingly.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a string like this: La Torre Eiffel paragonata all&#8217;Everest What PHP function
I'm new to using the Perl treebuilder module for HTML parsing and can't figure
link Im having trouble converting the html entites into html characters, (&# 8217;) i
I would like to count the length of a string with PHP. The string
I want use html5's new tag to play a wav file (currently only supported
I am doing a simple coin flipping experiment for class that involves flipping a
In my XML file chapters tag has more chapter tag.i need to display chapters
I'm parsing an RSS feed that has an &#8217; in it. SimpleXML turns this
I need to clean up various Word 'smart' characters in user input, including but
I have a text area in my form which accepts all possible characters from

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.