Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 4344236
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 21, 20262026-05-21T11:52:05+00:00 2026-05-21T11:52:05+00:00

I just learned about Nvidia’s thrust library. Just to try it wrote a small

  • 0

I just learned about Nvidia’s thrust library. Just to try it wrote a small example which is supposed to normalize a bunch of vectors.

#include <cstdio>

#include <thrust/transform.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>

struct normalize_functor: public thrust::unary_function<double4, double4>
{
    __device__ __host__ double4 operator()(double4 v)
    {
        double len = sqrt(v.x*v.x + v.y*v.y + v.z*v.z);
        v.x /= len;
        v.y /= len;
        v.z /= len;
        printf("%f %f %f\n", v.x, v.y, v.z);
    }
};

int main()
{
    thrust::host_vector<double4> v(2);
    v[0].x = 1; v[0].y = 2; v[0].z = 3;
    v[1].x = 4; v[1].y = 5; v[1].z = 6;

    thrust::device_vector<double4> v_d = v; 
    thrust::for_each(v_d.begin(), v_d.end(), normalize_functor());

    // This doesn't seem to copy back
    v = v_d;

    // Neither this does..
    thrust::host_vector<double4> result = v_d;

    for(int i=0; i<v.size(); i++)
        printf("[ %f %f %f ]\n", result[i].x, result[i].y, result[i].z);

    return 0;
}

The example above seems to work, however I’m unable to copy the data back.. I thought a simple assignment would invoke a cudaMemcpy. It works to copy the data from the host to the device but not back???

Secondly I’m not sure if I do this the right way. The documentation of for_each says:

for_each applies the function object f to each element in the range [first, last); f’s return value, if any, is ignored.

But the unary_function struct template expects two template arguments (one for the return value) and forces the operator() to also return a value, this results in a warning when compiling. I don’t see how I’m supposed to write an unary functor with no return value.

Next is the data arrangement. I just chose double4 since this will result in two fetch instructions ld.v2.f64 and ld.f64 IIRC. However I’m wondering how thrust fetches data under the hood (and how many cuda threads/blocks) are created. If I would chose instead a struct of 4 vectors would it be able to fetch data in a coalesced way.

Finally thrust provides tuples. What about an array of tuples? How would the data be arranged in this case.

I looked through the examples, but I haven’t found an example which explains which data structure to choose for a bunch of vectors (the dot_products_with_zip.cu example says something about “structure of arrays” instead of “arrays of structures” but I see no structures used in the example.

Update

I fixed the code above and tried to run a larger example, this time normalizing 10k vectors.

#include <cstdio>

#include <thrust/transform.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>

struct normalize_functor
{
    __device__ __host__ void operator()(double4& v)
    {
        double len = sqrt(v.x*v.x + v.y*v.y + v.z*v.z);
        v.x /= len;
        v.y /= len;
        v.z /= len;
    }
};

int main()
{
    int n = 10000;
    thrust::host_vector<double4> v(n);
    for(int i=0; i<n; i++) {
        v[i].x = rand();
        v[i].y = rand();
        v[i].z = rand();
    }

    thrust::device_vector<double4> v_d = v;

    thrust::for_each(v_d.begin(), v_d.end(), normalize_functor());

    v = v_d;

    return 0;
}

Profiling with computeprof shows me a low occupancy and non-coalesced memory access:

Kernel Occupancy Analysis

Kernel details : Grid size: 23 x 1 x 1, Block size: 448 x 1 x 1
Register Ratio      = 0.984375  ( 32256 / 32768 ) [24 registers per thread] 
Shared Memory Ratio     = 0 ( 0 / 49152 ) [0 bytes per Block] 
Active Blocks per SM        = 3 / 8
Active threads per SM       = 1344 / 1536
Potential Occupancy     = 0.875  ( 42 / 48 )
Max achieved occupancy  = 0.583333  (on 9 SMs)
Min achieved occupancy  = 0.291667  (on 5 SMs)
Occupancy limiting factor   = Block-Size

Memory Throughput Analysis for kernel launch_closure_by_value on device GeForce GTX 470

Kernel requested global memory read throughput(GB/s): 29.21
Kernel requested global memory write throughput(GB/s): 17.52
Kernel requested global memory throughput(GB/s): 46.73
L1 cache read throughput(GB/s): 100.40
L1 cache global hit ratio (%): 48.15
Texture cache memory throughput(GB/s): 0.00
Texture cache hit rate(%): 0.00
L2 cache texture memory read throughput(GB/s): 0.00
L2 cache global memory read throughput(GB/s): 42.44
L2 cache global memory write throughput(GB/s): 46.73
L2 cache global memory throughput(GB/s): 89.17
L2 cache read hit ratio(%): 88.86
L2 cache write hit ratio(%): 3.09
Local memory bus traffic(%): 0.00
Global memory excess load(%): 31.18
Global memory excess store(%): 62.50
Achieved global memory read throughput(GB/s): 4.73
Achieved global memory write throughput(GB/s): 45.29
Achieved global memory throughput(GB/s): 50.01
Peak global memory throughput(GB/s): 133.92

I wonder how I can optimized this?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-21T11:52:06+00:00Added an answer on May 21, 2026 at 11:52 am

    If you want to modify a sequence in-place with for_each then you’ll need to take the argument by reference in the functor:

    struct normalize_functor
    {
        __device__ __host__ void operator()(double4& ref)
        {
            double v = ref;
            double len = sqrt(v.x*v.x + v.y*v.y + v.z*v.z);
            v.x /= len;
            v.y /= len;
            v.z /= len;
            printf("%f %f %f\n", v.x, v.y, v.z);
            ref = v;
        }
    };
    

    Alternatively, you could use your definition of normalize_functor with the transform algorithm, specifying v_d as both the source and destination range:

    thrust::transform(v_d.begin(), v_d.end(), v_d.begin(), normalize_functor());
    

    My personal preference is to use transform in this situation, but the performance ought to be the same in either case.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I've got an application that just shipped. Since I wrote it, I've learned about
I just learned about the AJAX Push Engine but it runs on Linux/Apache which
I just learned about list comprehension, which is a great fast way to get
I just learned about how the Java Collections Framework implements data structures in linked
I just learned about ngrep , a cool program that lets you easily sniff
I just learned about how to include FxCop on a build. But it's slow
I just learned about Java's Scanner class and now I'm wondering how it compares/competes
I just learned about the serialize() and unserialize() functions. What are some uses for
I just learned about the XmlSerializer class in .Net. Before I had always parsed
I just learned about jquery's .makeArray and I am trying to use JSON.stringify to

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.