I just learned about Nvidia’s thrust library. Just to try it wrote a small

Question

0

Asked: May 21, 20262026-05-21T11:52:05+00:00 2026-05-21T11:52:05+00:00

I just learned about Nvidia’s thrust library. Just to try it wrote a small

0

I just learned about Nvidia’s thrust library. Just to try it wrote a small example which is supposed to normalize a bunch of vectors.

#include <cstdio>

#include <thrust/transform.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>

struct normalize_functor: public thrust::unary_function<double4, double4>
{
    __device__ __host__ double4 operator()(double4 v)
    {
        double len = sqrt(v.x*v.x + v.y*v.y + v.z*v.z);
        v.x /= len;
        v.y /= len;
        v.z /= len;
        printf("%f %f %f\n", v.x, v.y, v.z);
    }
};

int main()
{
    thrust::host_vector<double4> v(2);
    v[0].x = 1; v[0].y = 2; v[0].z = 3;
    v[1].x = 4; v[1].y = 5; v[1].z = 6;

    thrust::device_vector<double4> v_d = v; 
    thrust::for_each(v_d.begin(), v_d.end(), normalize_functor());

    // This doesn't seem to copy back
    v = v_d;

    // Neither this does..
    thrust::host_vector<double4> result = v_d;

    for(int i=0; i<v.size(); i++)
        printf("[ %f %f %f ]\n", result[i].x, result[i].y, result[i].z);

    return 0;
}

The example above seems to work, however I’m unable to copy the data back.. I thought a simple assignment would invoke a cudaMemcpy. It works to copy the data from the host to the device but not back???

Secondly I’m not sure if I do this the right way. The documentation of for_each says:

for_each applies the function object f to each element in the range [first, last); f’s return value, if any, is ignored.

But the unary_function struct template expects two template arguments (one for the return value) and forces the operator() to also return a value, this results in a warning when compiling. I don’t see how I’m supposed to write an unary functor with no return value.

Next is the data arrangement. I just chose double4 since this will result in two fetch instructions ld.v2.f64 and ld.f64 IIRC. However I’m wondering how thrust fetches data under the hood (and how many cuda threads/blocks) are created. If I would chose instead a struct of 4 vectors would it be able to fetch data in a coalesced way.

Finally thrust provides tuples. What about an array of tuples? How would the data be arranged in this case.

I looked through the examples, but I haven’t found an example which explains which data structure to choose for a bunch of vectors (the dot_products_with_zip.cu example says something about “structure of arrays” instead of “arrays of structures” but I see no structures used in the example.

Update

I fixed the code above and tried to run a larger example, this time normalizing 10k vectors.

#include <cstdio>

#include <thrust/transform.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>

struct normalize_functor
{
    __device__ __host__ void operator()(double4& v)
    {
        double len = sqrt(v.x*v.x + v.y*v.y + v.z*v.z);
        v.x /= len;
        v.y /= len;
        v.z /= len;
    }
};

int main()
{
    int n = 10000;
    thrust::host_vector<double4> v(n);
    for(int i=0; i<n; i++) {
        v[i].x = rand();
        v[i].y = rand();
        v[i].z = rand();
    }

    thrust::device_vector<double4> v_d = v;

    thrust::for_each(v_d.begin(), v_d.end(), normalize_functor());

    v = v_d;

    return 0;
}

Profiling with computeprof shows me a low occupancy and non-coalesced memory access:

Kernel Occupancy Analysis

Kernel details : Grid size: 23 x 1 x 1, Block size: 448 x 1 x 1
Register Ratio      = 0.984375  ( 32256 / 32768 ) [24 registers per thread] 
Shared Memory Ratio     = 0 ( 0 / 49152 ) [0 bytes per Block] 
Active Blocks per SM        = 3 / 8
Active threads per SM       = 1344 / 1536
Potential Occupancy     = 0.875  ( 42 / 48 )
Max achieved occupancy  = 0.583333  (on 9 SMs)
Min achieved occupancy  = 0.291667  (on 5 SMs)
Occupancy limiting factor   = Block-Size

Memory Throughput Analysis for kernel launch_closure_by_value on device GeForce GTX 470

Kernel requested global memory read throughput(GB/s): 29.21
Kernel requested global memory write throughput(GB/s): 17.52
Kernel requested global memory throughput(GB/s): 46.73
L1 cache read throughput(GB/s): 100.40
L1 cache global hit ratio (%): 48.15
Texture cache memory throughput(GB/s): 0.00
Texture cache hit rate(%): 0.00
L2 cache texture memory read throughput(GB/s): 0.00
L2 cache global memory read throughput(GB/s): 42.44
L2 cache global memory write throughput(GB/s): 46.73
L2 cache global memory throughput(GB/s): 89.17
L2 cache read hit ratio(%): 88.86
L2 cache write hit ratio(%): 3.09
Local memory bus traffic(%): 0.00
Global memory excess load(%): 31.18
Global memory excess store(%): 62.50
Achieved global memory read throughput(GB/s): 4.73
Achieved global memory write throughput(GB/s): 45.29
Achieved global memory throughput(GB/s): 50.01
Peak global memory throughput(GB/s): 133.92

I wonder how I can optimized this?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-21T11:52:06+00:00

If you want to modify a sequence in-place with for_each then you’ll need to take the argument by reference in the functor:

struct normalize_functor
{
    __device__ __host__ void operator()(double4& ref)
    {
        double v = ref;
        double len = sqrt(v.x*v.x + v.y*v.y + v.z*v.z);
        v.x /= len;
        v.y /= len;
        v.z /= len;
        printf("%f %f %f\n", v.x, v.y, v.z);
        ref = v;
    }
};

Alternatively, you could use your definition of normalize_functor with the transform algorithm, specifying v_d as both the source and destination range:

thrust::transform(v_d.begin(), v_d.end(), v_d.begin(), normalize_functor());

My personal preference is to use transform in this situation, but the performance ought to be the same in either case.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I just learned about Nvidia’s thrust library. Just to try it wrote a small

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply