Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8558117
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 11, 20262026-06-11T15:44:09+00:00 2026-06-11T15:44:09+00:00

I am looking into how to copy a 2D array of variable width for

  • 0

I am looking into how to copy a 2D array of variable width for each row into the GPU.

int rows = 1000;
int cols;
int** host_matrix = malloc(sizeof(*int)*rows);
int *d_array;
int *length;

...

Each host_matrix[i] might have a different length, which I know length[i], and there is where the problem starts. I would like to avoid copying dummy data. Is there a better way of doing it?

According to this thread, that won’t be a clever way of doing it:

cudaMalloc(d_array, rows*sizeof(int*));  
for(int i = 0 ; i < rows ; i++)    {  
    cudaMalloc((void **)&d_array[i], length[i] * sizeof(int)); 
}  

But I cannot think of any other method. Is there any other smarter way of doing it?
Can it be improved using cudaMallocPitch and cudaMemCpy2D ??

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-11T15:44:10+00:00Added an answer on June 11, 2026 at 3:44 pm

    The correct way to allocate an array of pointers for the GPU in CUDA is something like this:

    int **hd_array, **d_array;
    hd_array = (int **)malloc(nrows*sizeof(int*));
    cudaMalloc(d_array, nrows*sizeof(int*));  
    for(int i = 0 ; i < nrows ; i++)    {  
        cudaMalloc((void **)&hd_array[i], length[i] * sizeof(int)); 
    }
    cudaMemcpy(d_array, hd_array, nrows*sizeof(int*), cudaMemcpyHostToDevice);
    

    (disclaimer: written in browser, never compiled, never tested, use at own risk)

    The idea is that you assemble a copy of the array of device pointers in host memory first, then copy that to the device. For your hypothetical case with 1000 rows, that means 1001 calls to cudaMalloc and then 1001 calls to cudaMemcpy just to set up the device memory allocations and copy data into the device. That is an enormous overhead penalty, and I would counsel against trying it; the performance will be truly terrible.

    If you have very jagged data and need to store it on the device, might I suggest taking a cue of the mother of all jagged data problems – large, unstructured sparse matrices – and copy one of the sparse matrix formats for your data instead. Using the classic compressed sparse row format as a model you could do something like this:

    int * data, * rows, * lengths;
    
    cudaMalloc(rows, nrows*sizeof(int));
    cudaMalloc(lengths, nrows*sizeof(int));
    cudaMalloc(data, N*sizeof(int));
    

    In this scheme, store all the data in a single, linear memory allocation data. The ith row of the jagged array starts at data[rows[i]] and each row has a length of length[i]. This means you only need three memory allocation and copy operations to transfer any amount of data to the device, rather than nrows in your current scheme, ie. it reduces the overheads from O(N) to O(1).

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

We're looking into running multiple instances of Tomcat from the same copy of the
I have been looking into the possibility of creating a soft copy(image/EMF file) of
Looking into org.pentaho.reporting.engine.classic.core.DataFactory and more specifically into the initialize method (which was formerly part
Looking into selector performance between $('#ID1, #ID2, #ID3') vs $('1X CLASS'). Which is faster?
While looking into parallel programming, and subsequently evaluation strategies, the question whether thunks are
By looking into the Open JPA website i've found that i can log the
Currently looking into learn new technology and silverlight is on the potential list. However,
Am looking into developing an iPhone native app using Titanium Developer Since this is
I looking into making a kind of robot testing browser. Like Selenium, but one
I am looking into making a c# program that will read in the logcat

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.