Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8582519
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 11, 20262026-06-11T21:18:41+00:00 2026-06-11T21:18:41+00:00

Refering to @auselen’s answer here: Using ARM NEON intrinsics to add alpha and permute

  • 0

Refering to @auselen’s answer here: Using ARM NEON intrinsics to add alpha and permute, looks like armcc compiler is far more better than the gcc compiler for NEON optimizations. Is this really true? I haven’t really tried armcc compiler. But I got pretty optimized code using the gcc compiler with -O3 optimization flag. But now I’m wondering if armcc is really that good? So which of the two compiler is better, considering all the factors?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-11T21:18:43+00:00Added an answer on June 11, 2026 at 9:18 pm

    Compilers are software as well, they tend to improve over time. Any generic claim like armcc is better than GCC on NEON (or better said as vectorization) can’t hold true forever since one developer group can close the gap with enough attention. However initially it is logical to expect compilers developed by hardware companies to be superior because they need to demonstrate/market these features.

    One recent example I saw was here on Stack Overflow about an answer for branch prediction. Quoting from last line of updated section “This goes to show that even mature modern compilers can vary wildly in their ability to optimize code…”.

    I am a big fan of GCC, but I wouldn’t bet on quality of code produced by it against compilers from Intel or ARM. I expect any mainstream commercial compiler to produce code at least as good as GCC.

    One empirical answer to this question could be to use hilbert-space’s neon optimization example and see how different compilers optimize it.

    void neon_convert (uint8_t * __restrict dest, uint8_t * __restrict src, int n)
    {
      int i;
      uint8x8_t rfac = vdup_n_u8 (77);
      uint8x8_t gfac = vdup_n_u8 (151);
      uint8x8_t bfac = vdup_n_u8 (28);
      n/=8;
    
      for (i=0; i<n; i++)
      {
        uint16x8_t  temp;
        uint8x8x3_t rgb  = vld3_u8 (src);
        uint8x8_t result;
    
        temp = vmull_u8 (rgb.val[0],      rfac);
        temp = vmlal_u8 (temp,rgb.val[1], gfac);
        temp = vmlal_u8 (temp,rgb.val[2], bfac);
    
        result = vshrn_n_u16 (temp, 8);
        vst1_u8 (dest, result);
        src  += 8*3;
        dest += 8;
      }
    }
    

    This is armcc 5.01

      20:   f421140d    vld3.8  {d1-d3}, [r1]!
      24:   e2822001    add r2, r2, #1
      28:   f3810c04    vmull.u8    q0, d1, d4
      2c:   f3820805    vmlal.u8    q0, d2, d5
      30:   f3830806    vmlal.u8    q0, d3, d6
      34:   f2880810    vshrn.i16   d0, q0, #8
      38:   f400070d    vst1.8  {d0}, [r0]!
      3c:   e1520003    cmp r2, r3
      40:   bafffff6    blt 20 <neon_convert+0x20>
    

    This is GCC 4.4.3-4.7.1

      1e:   f961 040d   vld3.8  {d16-d18}, [r1]!
      22:   3301        adds    r3, #1
      24:   4293        cmp r3, r2
      26:   ffc0 4ca3   vmull.u8    q10, d16, d19
      2a:   ffc1 48a6   vmlal.u8    q10, d17, d22
      2e:   ffc2 48a7   vmlal.u8    q10, d18, d23
      32:   efc8 4834   vshrn.i16   d20, q10, #8
      36:   f940 470d   vst1.8  {d20}, [r0]!
      3a:   d1f0        bne.n   1e <neon_convert+0x1e>
    

    Which looks extremely similar, so we have a draw. After seeing this I tried mentioned add alpha and permute again.

    void neonPermuteRGBtoBGRA(unsigned char* src, unsigned char* dst, int numPix)
    {
        numPix /= 8; //process 8 pixels at a time
    
        uint8x8_t alpha = vdup_n_u8 (0xff);
    
        for (int i=0; i<numPix; i++)
        {
            uint8x8x3_t rgb  = vld3_u8 (src);
            uint8x8x4_t bgra;
    
            bgra.val[0] = rgb.val[2]; //these lines are slow
            bgra.val[1] = rgb.val[1]; //these lines are slow 
            bgra.val[2] = rgb.val[0]; //these lines are slow
    
            bgra.val[3] = alpha;
    
            vst4_u8(dst, bgra);
    
            src += 8*3;
            dst += 8*4;
        }
    }
    

    Compiling with gcc…

    $ arm-linux-gnueabihf-gcc --version
    arm-linux-gnueabihf-gcc (crosstool-NG linaro-1.13.1-2012.05-20120523 - Linaro GCC 2012.05) 4.7.1 20120514 (prerelease)
    $ arm-linux-gnueabihf-gcc -std=c99 -O3 -c ~/temp/permute.c -marm -mfpu=neon-vfpv4 -mcpu=cortex-a9 -o ~/temp/permute_gcc.o
    
    00000000 <neonPermuteRGBtoBGRA>:
       0:   e3520000    cmp r2, #0
       4:   e2823007    add r3, r2, #7
       8:   b1a02003    movlt   r2, r3
       c:   e92d01f0    push    {r4, r5, r6, r7, r8}
      10:   e1a021c2    asr r2, r2, #3
      14:   e24dd01c    sub sp, sp, #28
      18:   e3520000    cmp r2, #0
      1c:   da000019    ble 88 <neonPermuteRGBtoBGRA+0x88>
      20:   e3a03000    mov r3, #0
      24:   f460040d    vld3.8  {d16-d18}, [r0]!
      28:   eccd0b06    vstmia  sp, {d16-d18}
      2c:   e59dc014    ldr ip, [sp, #20]
      30:   e2833001    add r3, r3, #1
      34:   e59d6010    ldr r6, [sp, #16]
      38:   e1530002    cmp r3, r2
      3c:   e59d8008    ldr r8, [sp, #8]
      40:   e1a0500c    mov r5, ip
      44:   e59dc00c    ldr ip, [sp, #12]
      48:   e1a04006    mov r4, r6
      4c:   f3c73e1f    vmov.i8 d19, #255   ; 0xff
      50:   e1a06008    mov r6, r8
      54:   e59d8000    ldr r8, [sp]
      58:   e1a0700c    mov r7, ip
      5c:   e59dc004    ldr ip, [sp, #4]
      60:   ec454b34    vmov    d20, r4, r5
      64:   e1a04008    mov r4, r8
      68:   f26401b4    vorr    d16, d20, d20
      6c:   e1a0500c    mov r5, ip
      70:   ec476b35    vmov    d21, r6, r7
      74:   f26511b5    vorr    d17, d21, d21
      78:   ec454b34    vmov    d20, r4, r5
      7c:   f26421b4    vorr    d18, d20, d20
      80:   f441000d    vst4.8  {d16-d19}, [r1]!
      84:   1affffe6    bne 24 <neonPermuteRGBtoBGRA+0x24>
      88:   e28dd01c    add sp, sp, #28
      8c:   e8bd01f0    pop {r4, r5, r6, r7, r8}
      90:   e12fff1e    bx  lr
    

    Compiling with armcc…

    $ armcc
    ARM C/C++ Compiler, 5.01 [Build 113]
    $ armcc --C99 --cpu=Cortex-A9 -O3 -c permute.c -o permute_arm.o
    
    00000000 <neonPermuteRGBtoBGRA>:
       0:   e1a03fc2    asr r3, r2, #31
       4:   f3870e1f    vmov.i8 d0, #255    ; 0xff
       8:   e0822ea3    add r2, r2, r3, lsr #29
       c:   e1a031c2    asr r3, r2, #3
      10:   e3a02000    mov r2, #0
      14:   ea000006    b   34 <neonPermuteRGBtoBGRA+0x34>
      18:   f420440d    vld3.8  {d4-d6}, [r0]!
      1c:   e2822001    add r2, r2, #1
      20:   eeb01b45    vmov.f64    d1, d5
      24:   eeb02b46    vmov.f64    d2, d6
      28:   eeb05b40    vmov.f64    d5, d0
      2c:   eeb03b41    vmov.f64    d3, d1
      30:   f401200d    vst4.8  {d2-d5}, [r1]!
      34:   e1520003    cmp r2, r3
      38:   bafffff6    blt 18 <neonPermuteRGBtoBGRA+0x18>
      3c:   e12fff1e    bx  lr
    

    In this case armcc produces much better code. I think this justifies fgp’s answer above. Most of the time GCC will produce good enough code, but you should keep an eye on critical parts or most importantly first you must measure / profile.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am using Server version: Apache/1.3.34 (Debian) mod_perl - 1.29 By refering to STDIN,
I have 2 questions to ask here. I am refering http://www.broculos.net/en/article/android-101-how-create-stackview-widget to create a
I'm wondering how to create windows like these alt text http://img824.imageshack.us/img824/997/this.jpg I'm refering to
refering to this answer , the second code block. My question is: If I
i am using this reference in jsp .i am able to call functions like
By refering code at http://initd.org/psycopg/docs/extras.html#dictionary-like-cursor >>> rec['data'] abc'def >>> rec[2] abc'def I was wondering
Hi after refering to http://www.mono-project.com/Embedding_Mono i can call methods from managed code by using
I am refering to article at http://www.wintoolzone.com/articles/AuthoringStackWalkerForX86.pdf I am using VC++ 2008. I realize
here is a site i'm refering to: http://www.graphicfirm.com/index.php if you scroll down to the
Refering to a previously asked question , I would like to know how to

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.