I’m trying to improve my multiprocess application by using shared memory to communicate. I was doing some profiling with simple tests and something strange came out. When I’m trying to copy the data stored in the SharedMemory, it’s faster with ReadProcessMemory than with Memcopy.
I know I’m not supposed to use SharedMemory that way (it’s better to read straight inside the shared memory), but I’m still wondering why is this happening. By pursuing my investigation further, another thing showed up : if I do 2 consecutive memcpy on the same shared memory area (in fact, the same region), the second copy is twice faster than the first.
Here is a sample code showing the problem. In this example, there is only one process but the problem is stil here. Doing a memcpy from the shared memory region is slower than doing a ReadProcessMemory of that same area on my own process !
#include <tchar.h>
#include <basetsd.h>
#include <iostream>
#include <boost/interprocess/mapped_region.hpp>
#include <boost/interprocess/windows_shared_memory.hpp>
#include <time.h>
namespace bip = boost::interprocess;
#include <boost/asio.hpp>
bip::windows_shared_memory* AllocateSharedMemory(UINT32 a_UI32_Size)
{
bip::windows_shared_memory* l_pShm = new bip::windows_shared_memory (bip::create_only, "Global\\testSharedMemory", bip::read_write, a_UI32_Size);
bip::mapped_region l_region(*l_pShm, bip::read_write);
std::memset(l_region.get_address(), 1, l_region.get_size());
return l_pShm;
}
//Copy the shared memory with memcpy
void CopySharedMemory(UINT32 a_UI32_Size)
{
bip::windows_shared_memory m_shm(bip::open_only, "Global\\testSharedMemory", bip::read_only);
bip::mapped_region l_region(m_shm, bip::read_only);
void* l_pData = malloc(a_UI32_Size);
memcpy(l_pData, l_region.get_address(), a_UI32_Size);
free(l_pData);
}
//Copy the shared memory with ReadProcessMemory
void ProcessCopySharedMemory(UINT32 a_UI32_Size)
{
bip::windows_shared_memory m_shm(bip::open_only, "Global\\testSharedMemory", bip::read_only);
bip::mapped_region l_region(m_shm, bip::read_only);
void* l_pData = malloc(a_UI32_Size);
HANDLE hProcess = OpenProcess( PROCESS_ALL_ACCESS, FALSE,(DWORD) GetCurrentProcessId());
size_t l_szt_CurRemote_Readsize;
ReadProcessMemory(hProcess,
(LPCVOID)((void*)l_region.get_address()),
l_pData,
a_UI32_Size,
(SIZE_T*)&l_szt_CurRemote_Readsize);
free(l_pData);
}
// do 2 memcpy on the same shared memory
void CopySharedMemory2(UINT32 a_UI32_Size)
{
bip::windows_shared_memory m_shm(bip::open_only, "Global\\testSharedMemory", bip::read_only);
bip::mapped_region l_region(m_shm, bip::read_only);
clock_t begin = clock();
void* l_pData = malloc(a_UI32_Size);
memcpy(l_pData, l_region.get_address(), a_UI32_Size);
clock_t end = clock();
std::cout << "FirstCopy: " << (end - begin) * 1000 / CLOCKS_PER_SEC << " ms" << std::endl;
free(l_pData);
begin = clock();
l_pData = malloc(a_UI32_Size);
memcpy(l_pData, l_region.get_address(), a_UI32_Size);
end = clock();
std::cout << "SecondCopy: " << (end - begin) * 1000 / CLOCKS_PER_SEC << " ms" << std::endl;
free(l_pData);
}
int _tmain(int argc, _TCHAR* argv[])
{
UINT32 l_UI32_Size = 1048576000;
bip::windows_shared_memory* l_pShm = AllocateSharedMemory(l_UI32_Size);
clock_t begin = clock();
for (int i=0; i<10 ; i++)
CopySharedMemory(l_UI32_Size);
clock_t end = clock();
std::cout << "MemCopy: " << (end - begin) * 1000 / CLOCKS_PER_SEC << " ms" << std::endl;
begin = clock();
for (int i=0; i<10 ; i++)
ProcessCopySharedMemory(l_UI32_Size);
end = clock();
std::cout << "ReadProcessMemory: " << (end - begin) * 1000 / CLOCKS_PER_SEC << " ms" << std::endl;
for (int i=0; i<10 ; i++)
CopySharedMemory2(l_UI32_Size);
delete l_pShm;
return 0;
}
And here is the output :
MemCopy: 8891 ms
ReadProcessMemory: 6068 ms
FirstCopy: 796 ms
SecondCopy: 327 ms
FirstCopy: 795 ms
SecondCopy: 328 ms
FirstCopy: 780 ms
SecondCopy: 344 ms
FirstCopy: 780 ms
SecondCopy: 343 ms
FirstCopy: 780 ms
SecondCopy: 327 ms
FirstCopy: 795 ms
SecondCopy: 343 ms
FirstCopy: 780 ms
SecondCopy: 344 ms
FirstCopy: 796 ms
SecondCopy: 343 ms
FirstCopy: 796 ms
SecondCopy: 327 ms
FirstCopy: 780 ms
SecondCopy: 328 ms
If anybody has an idea on why the memcpy is so slow and if there is a solution to this problem, I’m all ears.
Thanks.
My comment as answer for reference.
Using ‘memcpy’ across a big chunk of memory would need the OS to sift through its process/memory tables for each new page copied. Using ‘ReadProcessMemory’, in turn, tells the OS directly which pages from which process to which other process should be copied.
This difference went away as you benchmarked with a single page, confirming some of this.
I could guess that the reason why ‘memcpy’ is faster in the ‘small’ scenario might be that ‘ReadProcessMemory’ has an extra switch from user to kernel mode to do. Memcpy, on the other hand, sort of offloads the task to the underlying memory management system, which always runs in parallel with your process and is supported natively by the hardware to some extent.