I am reading the CUDA 5.0 samples (AdvancedQuickSort) now. However, I cannot understand this

Question

0

Asked: June 15, 20262026-06-15T07:09:10+00:00 2026-06-15T07:09:10+00:00

I am reading the CUDA 5.0 samples (AdvancedQuickSort) now. However, I cannot understand this

0

I am reading the CUDA 5.0 samples (AdvancedQuickSort) now. However, I cannot understand this sample totally due to following codes:

// Now compute my own personal offset within this. I need to know how many
// threads with a lane ID less than mine are going to write to the same buffer
// as me. We can use popc to implement a single-operation warp scan in this case.
unsigned lane_mask_lt;
asm( "mov.u32 %0, %%lanemask_lt;" : "=r"(lane_mask_lt) );
unsigned int my_mask = greater ? gt_mask : lt_mask;
unsigned int my_offset = __popc(my_mask & lane_mask_lt);

which is in the __global__ void qsort_warp function, especially for this assemble language in the codes. Can anyone help me to explain the meaning of this assemble language?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-15T07:09:11+00:00

%lanemask_lt is a special, read-only register in PTX assembly which is initialized with a 32-bit mask with bits set in positions less than the thread’s lane number in the warp. The inline PTX you have posted is simply reading the value of that register and storing it in a variable where it can be used in the subsequent C++ code you posted.

Every version of the CUDA toolkit ships with a PTX assembly lanugage reference guide you can use to look up things like this.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am reading the CUDA 5.0 samples (AdvancedQuickSort) now. However, I cannot understand this

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply