In this scenario: I receive of about 1MB over the network into a buffer

Question

0

Editorial Team

Asked: June 18, 20262026-06-18T15:39:56+00:00 2026-06-18T15:39:56+00:00

In this scenario: I receive of about 1MB over the network into a buffer

0

In this scenario:

I receive of about 1MB over the network into a buffer over TCP in a separate thread.
Decompression on this buffer results in about 2MB of data.
Two calls to the same dynamically linked library function of which I own the source.
Basically a bunch of FFTs using FFTW 3.3.3 with NEON support. The first set of
FFTs I consider cold, the second hot.

The cold run is about 200 ms slower than the hot run:

Cold: 570 ms
Hot: 260 ms

If 1.) and 2.) are replaced by a read from a file of the exact same data, the hot and cold runs are equally fast.

If I reduce the network data to about 200K, and thus the decompressed data to 400K, performance is the same between cold and hot.

If I perform an L2 flush* immediately after 2.) the cold performance increases to be the same as the hot performance. I don’t understand this. I’ve tried changing many compiler options, and as long as the optimizer is used I see this behavior.

If I flush less of the cache, then the performance of the cold run worsens proportionally.

*Here is the code that I’m using to attempt to flush the 1MB L2 cache:

const size_t cache_size = 1024 * 1024;
char *cache = new char[cache_size];
srand(1);
for (size_t dc = 1; dc < cache_size; ++dc)
{
    cache[dc] = cache[dc - 1] + rand() * 255;
}

Using a giant switch statement to flush the instruction cache does not have much of an effect. I used 32,000 cases as shown here:
How can I cause an instruction cache miss?

If I copy the data that the FFTs operate on to a duplicate structure in a different part of memory, and then operate on copy, it does not have the same effect as flushing L2.

I’d like to understand what’s going on. I’ve forced process and thread affinity to a single core on the Tegra 3. What other easily accessible measurements can I make or view?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-18T15:39:57+00:00

Editorial Team

2026-06-18T15:39:57+00:00Added an answer on June 18, 2026 at 3:39 pm

If the cache is configured for write-back (which is typically higher performance) then your flush is causing all the output of the first FFTW to get written to memory rather than waiting until the cache line is needed by the second FFTW. So the cache is empty rather than polluted with dirty lines.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

In this scenario: I receive of about 1MB over the network into a buffer

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply