Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6530095
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 25, 20262026-05-25T09:42:52+00:00 2026-05-25T09:42:52+00:00

Can anyone give an example or a link to an example which uses __builtin_prefetch

  • 0

Can anyone give an example or a link to an example which uses __builtin_prefetch in GCC (or just the asm instruction prefetcht0 in general) to gain a substantial performance advantage? In particular, I’d like the example to meet the following criteria:

  1. It is a simple, small, self-contained example.
  2. Removing the __builtin_prefetch instruction results in performance degradation.
  3. Replacing the __builtin_prefetch instruction with the corresponding memory access results in performance degradation.

That is, I want the shortest example showing __builtin_prefetch performing an optimization that couldn’t be managed without it.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-25T09:42:53+00:00Added an answer on May 25, 2026 at 9:42 am

    Here’s an actual piece of code that I’ve pulled out of a larger project. (Sorry, it’s the shortest one I can find that had a noticable speedup from prefetching.)
    This code performs a very large data transpose.

    This example uses the SSE prefetch instructions, which may be the same as the one that GCC emits.

    To run this example, you will need to compile this for x64 and have more than 4GB of memory. You can run it with a smaller datasize, but it will be too fast to time.

    #include <iostream>
    using std::cout;
    using std::endl;
    
    #include <emmintrin.h>
    #include <malloc.h>
    #include <time.h>
    #include <string.h>
    
    #define ENABLE_PREFETCH
    
    
    #define f_vector    __m128d
    #define i_ptr       size_t
    inline void swap_block(f_vector *A,f_vector *B,i_ptr L){
        //  To be super-optimized later.
    
        f_vector *stop = A + L;
    
        do{
            f_vector tmpA = *A;
            f_vector tmpB = *B;
            *A++ = tmpB;
            *B++ = tmpA;
        }while (A < stop);
    }
    void transpose_even(f_vector *T,i_ptr block,i_ptr x){
        //  Transposes T.
        //  T contains x columns and x rows.
        //  Each unit is of size (block * sizeof(f_vector)) bytes.
    
        //Conditions:
        //  - 0 < block
        //  - 1 < x
    
        i_ptr row_size = block * x;
        i_ptr iter_size = row_size + block;
    
        //  End of entire matrix.
        f_vector *stop_T = T + row_size * x;
        f_vector *end = stop_T - row_size;
    
        //  Iterate each row.
        f_vector *y_iter = T;
        do{
            //  Iterate each column.
            f_vector *ptr_x = y_iter + block;
            f_vector *ptr_y = y_iter + row_size;
    
            do{
    
    #ifdef ENABLE_PREFETCH
                _mm_prefetch((char*)(ptr_y + row_size),_MM_HINT_T0);
    #endif
    
                swap_block(ptr_x,ptr_y,block);
    
                ptr_x += block;
                ptr_y += row_size;
            }while (ptr_y < stop_T);
    
            y_iter += iter_size;
        }while (y_iter < end);
    }
    int main(){
    
        i_ptr dimension = 4096;
        i_ptr block = 16;
    
        i_ptr words = block * dimension * dimension;
        i_ptr bytes = words * sizeof(f_vector);
    
        cout << "bytes = " << bytes << endl;
    //    system("pause");
    
        f_vector *T = (f_vector*)_mm_malloc(bytes,16);
        if (T == NULL){
            cout << "Memory Allocation Failure" << endl;
            system("pause");
            exit(1);
        }
        memset(T,0,bytes);
    
        //  Perform in-place data transpose
        cout << "Starting Data Transpose...   ";
        clock_t start = clock();
        transpose_even(T,block,dimension);
        clock_t end = clock();
    
        cout << "Done" << endl;
        cout << "Time: " << (double)(end - start) / CLOCKS_PER_SEC << " seconds" << endl;
    
        _mm_free(T);
        system("pause");
    }
    

    When I run it with ENABLE_PREFETCH enabled, this is the output:

    bytes = 4294967296
    Starting Data Transpose...   Done
    Time: 0.725 seconds
    Press any key to continue . . .
    

    When I run it with ENABLE_PREFETCH disabled, this is the output:

    bytes = 4294967296
    Starting Data Transpose...   Done
    Time: 0.822 seconds
    Press any key to continue . . .
    

    So there’s a 13% speedup from prefetching.

    EDIT:

    Here’s some more results:

    Operating System: Windows 7 Professional/Ultimate
    Compiler: Visual Studio 2010 SP1
    Compile Mode: x64 Release
    
    Intel Core i7 860 @ 2.8 GHz, 8 GB DDR3 @ 1333 MHz
    Prefetch   : 0.868
    No Prefetch: 0.960
    
    Intel Core i7 920 @ 3.5 GHz, 12 GB DDR3 @ 1333 MHz
    Prefetch   : 0.725
    No Prefetch: 0.822
    
    Intel Core i7 2600K @ 4.6 GHz, 16 GB DDR3 @ 1333 MHz
    Prefetch   : 0.718
    No Prefetch: 0.796
    
    2 x Intel Xeon X5482 @ 3.2 GHz, 64 GB DDR2 @ 800 MHz
    Prefetch   : 2.273
    No Prefetch: 2.666
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Can anyone give me an example of how I can consume the following web
Can anyone tell or define more what is ancestor and give an example on
Can anyone please give an example how to use the OnExited event in C++,
Can anyone give a specific example of when the following setting in Config.groovy is
Can anyone give me the example code that I can use to first present
Can anyone give me an example of a regular expression for must be loger
Can anyone give me a link to the eclipse project in zip format that
is GROUP BY used only with aggregate functions ? Can anyone give an example
Can anyone give an example of a UDP Hole Punching ? Actually, I want
Can anyone give me a complete list of string manipulation function in Microsoft SQL

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.