Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 3431694
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 18, 20262026-05-18T07:21:28+00:00 2026-05-18T07:21:28+00:00

Summary: memcpy seems unable to transfer over 2GB/sec on my system in a real

  • 0

Summary:

memcpy seems unable to transfer over 2GB/sec on my system in a real or test application. What can I do to get faster memory-to-memory copies?

Full details:

As part of a data capture application (using some specialized hardware), I need to copy about 3 GB/sec from temporary buffers into main memory. To acquire data, I provide the hardware driver with a series of buffers (2MB each). The hardware DMAs data to each buffer, and then notifies my program when each buffer is full. My program empties the buffer (memcpy to another, larger block of RAM), and reposts the processed buffer to the card to be filled again. I am having issues with memcpy moving the data fast enough. It seems that the memory-to-memory copy should be fast enough to support 3GB/sec on the hardware that I am running on. Lavalys EVEREST gives me a 9337MB/sec memory copy benchmark result, but I can’t get anywhere near those speeds with memcpy, even in a simple test program.

I have isolated the performance issue by adding/removing the memcpy call inside the buffer processing code. Without the memcpy, I can run full data rate- about 3GB/sec. With the memcpy enabled, I am limited to about 550Mb/sec (using current compiler).

In order to benchmark memcpy on my system, I’ve written a separate test program that just calls memcpy on some blocks of data. (I’ve posted the code below) I’ve run this both in the compiler/IDE that I’m using (National Instruments CVI) as well as Visual Studio 2010. While I’m not currently using Visual Studio, I am willing to make the switch if it will yield the necessary performance. However, before blindly moving over, I wanted to make sure that it would solve my memcpy performance problems.

Visual C++ 2010: 1900 MB/sec

NI CVI 2009: 550 MB/sec

While I am not surprised that CVI is significantly slower than Visual Studio, I am surprised that the memcpy performance is this low. While I’m not sure if this is directly comparable, this is much lower than the EVEREST benchmark bandwidth. While I don’t need quite that level of performance, a minimum of 3GB/sec is necessary. Surely the standard library implementation can’t be this much worse than whatever EVEREST is using!

What, if anything, can I do to make memcpy faster in this situation?


Hardware details:
AMD Magny Cours- 4x octal core
128 GB DDR3
Windows Server 2003 Enterprise X64

Test program:

#include <windows.h>
#include <stdio.h>

const size_t NUM_ELEMENTS = 2*1024 * 1024;
const size_t ITERATIONS = 10000;

int main (int argc, char *argv[])
{
    LARGE_INTEGER start, stop, frequency;

    QueryPerformanceFrequency(&frequency);

    unsigned short * src = (unsigned short *) malloc(sizeof(unsigned short) * NUM_ELEMENTS);
    unsigned short * dest = (unsigned short *) malloc(sizeof(unsigned short) * NUM_ELEMENTS);

    for(int ctr = 0; ctr < NUM_ELEMENTS; ctr++)
    {
        src[ctr] = rand();
    }

    QueryPerformanceCounter(&start);

    for(int iter = 0; iter < ITERATIONS; iter++)
        memcpy(dest, src, NUM_ELEMENTS * sizeof(unsigned short));

    QueryPerformanceCounter(&stop);

    __int64 duration = stop.QuadPart - start.QuadPart;

    double duration_d = (double)duration / (double) frequency.QuadPart;

    double bytes_sec = (ITERATIONS * (NUM_ELEMENTS/1024/1024) * sizeof(unsigned short)) / duration_d;

    printf("Duration: %.5lfs for %d iterations, %.3lfMB/sec\n", duration_d, ITERATIONS, bytes_sec);

    free(src);
    free(dest);

    getchar();

    return 0;
}

EDIT: If you have an extra five minutes and want to contribute, can you run the above code on your machine and post your time as a comment?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-18T07:21:29+00:00Added an answer on May 18, 2026 at 7:21 am

    I have found a way to increase speed in this situation. I wrote a multi-threaded version of memcpy, splitting the area to be copied between threads. Here are some performance scaling numbers for a set block size, using the same timing code as found above. I had no idea that the performance, especially for this small size of block, would scale to this many threads. I suspect that this has something to do with the large number of memory controllers (16) on this machine.

    Performance (10000x 4MB block memcpy):
    
     1 thread :  1826 MB/sec
     2 threads:  3118 MB/sec
     3 threads:  4121 MB/sec
     4 threads: 10020 MB/sec
     5 threads: 12848 MB/sec
     6 threads: 14340 MB/sec
     8 threads: 17892 MB/sec
    10 threads: 21781 MB/sec
    12 threads: 25721 MB/sec
    14 threads: 25318 MB/sec
    16 threads: 19965 MB/sec
    24 threads: 13158 MB/sec
    32 threads: 12497 MB/sec
    

    I don’t understand the huge performance jump between 3 and 4 threads. What would cause a jump like this?

    I’ve included the memcpy code that I wrote below for other that may run into this same issue. Please note that there is no error checking in this code- this may need to be added for your application.

    #define NUM_CPY_THREADS 4
    
    HANDLE hCopyThreads[NUM_CPY_THREADS] = {0};
    HANDLE hCopyStartSemaphores[NUM_CPY_THREADS] = {0};
    HANDLE hCopyStopSemaphores[NUM_CPY_THREADS] = {0};
    typedef struct
    {
        int ct;
        void * src, * dest;
        size_t size;
    } mt_cpy_t;
    
    mt_cpy_t mtParamters[NUM_CPY_THREADS] = {0};
    
    DWORD WINAPI thread_copy_proc(LPVOID param)
    {
        mt_cpy_t * p = (mt_cpy_t * ) param;
    
        while(1)
        {
            WaitForSingleObject(hCopyStartSemaphores[p->ct], INFINITE);
            memcpy(p->dest, p->src, p->size);
            ReleaseSemaphore(hCopyStopSemaphores[p->ct], 1, NULL);
        }
    
        return 0;
    }
    
    int startCopyThreads()
    {
        for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
        {
            hCopyStartSemaphores[ctr] = CreateSemaphore(NULL, 0, 1, NULL);
            hCopyStopSemaphores[ctr] = CreateSemaphore(NULL, 0, 1, NULL);
            mtParamters[ctr].ct = ctr;
            hCopyThreads[ctr] = CreateThread(0, 0, thread_copy_proc, &mtParamters[ctr], 0, NULL); 
        }
    
        return 0;
    }
    
    void * mt_memcpy(void * dest, void * src, size_t bytes)
    {
        //set up parameters
        for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
        {
            mtParamters[ctr].dest = (char *) dest + ctr * bytes / NUM_CPY_THREADS;
            mtParamters[ctr].src = (char *) src + ctr * bytes / NUM_CPY_THREADS;
            mtParamters[ctr].size = (ctr + 1) * bytes / NUM_CPY_THREADS - ctr * bytes / NUM_CPY_THREADS;
        }
    
        //release semaphores to start computation
        for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
            ReleaseSemaphore(hCopyStartSemaphores[ctr], 1, NULL);
    
        //wait for all threads to finish
        WaitForMultipleObjects(NUM_CPY_THREADS, hCopyStopSemaphores, TRUE, INFINITE);
    
        return dest;
    }
    
    int stopCopyThreads()
    {
        for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
        {
            TerminateThread(hCopyThreads[ctr], 0);
            CloseHandle(hCopyStartSemaphores[ctr]);
            CloseHandle(hCopyStopSemaphores[ctr]);
        }
        return 0;
    }
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

SUMMARY How can I make my GUI application run on windows startup on a
Summary : is there a way to get the unique lines from a file
Summary When I execute a very simple program using Perl's Benchmark utility. I get
Summary I have a web-application with more than 5 themes. Each themes covers a
Summary jquery is used to retrieve search results via the get() call. When rendering
Summary: Where can I find a reference to SQL Server file version numbers? Background:
SUMMARY: How to compile in Release mode...I cannot get it to take what I
Summary: Does anyone know what the minimum we have to do is to get
Summary We have an ASP.NET application that allows users to query a SQL Server
Summary: How do I configure my facebook application to request additional information from the

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.