Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7178127
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 28, 20262026-05-28T16:51:29+00:00 2026-05-28T16:51:29+00:00

I have the following representative code: __global__ void func() { register ushort4 result =

  • 0

I have the following representative code:

__global__ void func()
{
    register ushort4 result = make_ushort4(__float2half_rn(0.5), __float2half_rn(0.5), __float2half_rn(0.5), __float2half_rn(1.0));
}

When compiling, result is stored in local memory. Is it possible to force this to registers? Local memory is too slow for the intended application.

Furthermore, this result must be stored to an array of var4 elements. I would like to store these results coalesced, like ((ushort4*)(output))[x + y * width] = result;. Another solution without var4 is also an option.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-28T16:51:30+00:00Added an answer on May 28, 2026 at 4:51 pm

    A vector type should be compiled into registers if there is available registers to do so. Turning your snippet into something that will survive dead code removal:

    __global__ void func(ushort4 *out) 
    { 
        ushort4 result = make_ushort4(__float2half_rn(0.5), __float2half_rn(0.5), 
                __float2half_rn(0.5), __float2half_rn(1.0)); 
    
        out[threadIdx.x+blockDim.x*blockIdx.x] = result;
    } 
    

    and compiling it:

    >nvcc -cubin -arch=sm_20 -Xptxas="-v" ushort4.cu
    ushort4.cu
    ushort4.cu
    tmpxft_000010b8_00000000-3_ushort4.cudafe1.gpu
    tmpxft_000010b8_00000000-10_ushort4.cudafe2.gpu
    ptxas info    : Compiling entry function '_Z4funcP7ushort4' for 'sm_20'
    ptxas info    : Function properties for _Z4funcP7ushort4
        0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
    ptxas info    : Used 8 registers, 36 bytes cmem[0]
    

    shows no spills (ie. local memory). Further, disassembling the resulting cubin file shows:

    >cuobjdump --dump-sass ushort4.cubin
    
            code for sm_20
                    Function : _Z4funcP7ushort4
            /*0000*/     /*0x00005de428004404*/     MOV R1, c [0x1] [0x100];
            /*0008*/     /*0x01101c041000cfc0*/     F2F.F16.F32 R0, 0x3f000;
            /*0010*/     /*0x94009c042c000000*/     S2R R2, SR_CTAid_X;
            /*0018*/     /*0x8400dc042c000000*/     S2R R3, SR_Tid_X;
            /*0020*/     /*0x01111c041000cfe0*/     F2F.F16.F32 R4, 0x3f800;
            /*0028*/     /*0x00915c041c000000*/     I2I.U16.U16 R5, R0;
            /*0030*/     /*0x20209c0320064000*/     IMAD.U32.U32 R2, R2, c [0x0] [0x8], R3;
            /*0038*/     /*0x40019c03280ac040*/     BFI R6, R0, 0x1010, R5;
            /*0040*/     /*0x4041dc03280ac040*/     BFI R7, R4, 0x1010, R5;
            /*0048*/     /*0x80201c6340004000*/     ISCADD R0, R2, c [0x0] [0x20], 0x3;
            /*0050*/     /*0x00019ca590000000*/     ST.64 [R0], R6;
            /*0058*/     /*0x00001de780000000*/     EXIT;
                    .................................
    

    ie. the ushort4 is stuffed into register and then a 64 bit store is used to write the packed vector out to global memory. No local memory access to be seen.

    So if you have convinced yourself that you have a vector value compiling into local memory, it is either because you have a kernel with a lot of register pressure, or you are asking the compiler to (the volatile keyword will do that), or you have misinterpreted what the compiler/assembler are telling you at compile time.


    EDIT: Using the CUDA 4.0 release tookit with Visual Studio Express 2008 and compiling on 32bit Windows 7 for a compute 1.1 device gives:

    >nvcc --version
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2011 NVIDIA Corporation
    Built on Fri_May_13_02:42:40_PDT_2011
    Cuda compilation tools, release 4.0, V0.2.1221
    
    >cl.exe
    Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 15.00.30729.01 for 80x86
    Copyright (C) Microsoft Corporation.  All rights reserved.
    
    usage: cl [ option... ] filename... [ /link linkoption... ]
    
    
    >nvcc -cubin -arch=sm_11 -Xptxas=-v ushort4.cu
    ushort4.cu
    ushort4.cu
    tmpxft_00001788_00000000-3_ushort4.cudafe1.gpu
    tmpxft_00001788_00000000-10_ushort4.cudafe2.gpu
    ptxas info    : Compiling entry function '_Z4funcP7ushort4' for 'sm_11'
    ptxas info    : Used 4 registers, 4+16 bytes smem
    

    which is the exact same result as for the original build for a compute 2.0 target.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have the following code... $stored_date = date_format(date_create('now'), 'Y-m-d H:i:s'); echo 'Stored date: '
I Have following code: Controller: public ActionResult Step1() { return View(); } [AcceptVerbs(HttpVerbs.Post)] public
I have following Code Block Which I tried to optimize in the Optimized section
I have following classes. class A { public: void fun(); } class B: public
I have the following code in a project that write 's the ascii representation
So, I have the following code: a = new Float64Array([num]) b = new Uint8Array(a.buffer)
I have the following code. Is it not the exact code which I am
Currently I have the following code for a project that represents some probability trees
I have the following code which I am are currently using .... Basically, this
I have the following code: lblMetaTag.Text = <meta property=' + ctrl.property_name + ' content='

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.