Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8378397
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 9, 20262026-06-09T15:55:08+00:00 2026-06-09T15:55:08+00:00

First, in tuning a frequency analysis function using the Accelerate framework, the absolute system

  • 0

First, in tuning a frequency analysis function using the Accelerate framework, the absolute system time has consistently been 225ms per iteration. Then last night I changed the order of which two of the arrays were declared and suddenly it went down to 202ms. A 10% increase by just changing the declaration order seems insane. Can someone explain to me why the compiler (which is set to optimize) is not already finding this solution?

Additional info: Before the loop there is some setup of the arrays used in the loop consisting of converting them from integer to float arrays (for Accelerate) and then taking sin and cos of the time array (16 lines long). All of the float arrays (8 arrays x 1000 elements) are declared first in the function (after a sanity check of the parameters). They are always declared the same size (by a constant), because otherwise performance suffered for little shrinkage of the footprint. I tested making them globals, but I think the compiler already figured that out as there is no performance change. The loop is 25 lines long.

—Additions—

Yes, “-Os” is the flag. (default in Xcode anyways: Fastest, Smallest)

(below is from memory – don’t try to compile it, cause I didn’t put in things like stride (which is 1), etc. However, all of the Accelerate calls are there)

passed parameters: inttimearray, intamparray, length, scale1, scale2, amp

float trigarray1[maxsize];
float trigarray2[maxsize];
float trigarray3[maxsize];
float trigarray4[maxsize];
float trigarray5[maxsize];
float temparray[maxsize];
float amparray[maxsize];    //these two make the most change
float timearray[maxsize];    //these two make the most change

vDSP_vfltu32(inttimearray,timearray,length); //convert to float array
vDSP_vflt16(intamparray,amparray,length);    //convert to float array

vDSP_vsmul(timearray,scale1,temparray,length);    //scale time and store in temp
vvcosf(temparray,trigarray3,length);     //cos of temparray
vvsinf(temparray,trigarray4,length);     //sin of temparray
vDSP_vneg(trigarray4,trigarray5,length); //negative of trigarray4

vDSP_vsmul(timearray,scale2,temparray,length); //scale time and store in temp
vvcosf(temparray,trigarray1,length);           //cos of temparray
vvsinf(temprray,trigarray2,length);            //sin of temparray

float ysum;
vDSP_sve(amparray,ysum,length);    //sum of amparray

float csum, ssum, ccsum, sssum, cssum, ycsum, yssum;

for (i = 0; i<max; i++) {

    vDSP_sve(trigarray1,csum,length);    //sum of trigarray1
    vDSP_sve(trigarray2,ssum,length);    //sum of trigarray2

    vDSP_svesq(trigarray1,ccsum,length); //sum of trigarray1^2
    vDSP_svesq(trigarray2,sssum,length); //sum of trigarray2^2

    vDSP_vmul(trigarray1,trigarray2,temparray,length); //temp = trig1*trig2
    vDSP_sve(temparray,cssum,length);                  //sum of temp array
    // 2 more sets of the above 2 lines, for the 2 remaining sums

    amp[i] = (arithmetic of sums);

    //trig identity to increase the sin/cos by a delta frequency
    //vmma is a*b+c*d=result
    vDSP_vmma (trigarray1,trigarray3,trigarray2,trigarray4,temparray,length);
    vDSP_vmma (trigarray2,trigarray3,trigarray1,trigarray5,trigarray2,length);
    memcpy(trigarray1,temparray,length*sizeof(float));
}

—Current Solution—

I’ve made some changes as follows:

The arrays are all declared aligned, and zero’d out (I’ll explain next) and maxsize is now a multiple of 16

__attribute__ ((align (16))) float timearray[maxsize] = {0};

I’ve zero’d out all of the arrays because now, when the length is less than maxsize, I round the length up to the nearest multiple of 16 so that all of the looped functions operate on widths divisible by 16, without affecting the sums.

The benefits are:

  • Slight performance boost
  • The speed is nearly constant regardless of order of array declaration (which is now done right before they are needed, instead of all in a big block)
  • The speed is also nearly constant for any 16-wide length (i.e. 241 to 256, or 225 to 240…), whereas before, if the length went from 256 to 255, the function would take a 3+% performance hit.

In the future (possibly with this code, as analysis requirements are still in flux), I realize I’ll need to take into consideration stack usage more, and alignment/chunks of vectors. Unfortunately, for this code, I can’t make these arrays static or globals as this function can be called by more than one object at a time.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-09T15:55:09+00:00Added an answer on June 9, 2026 at 3:55 pm

    The first thing I would suspect is alignment. You may want to experiment with:

    __attribute__ ((align (16))) float ...[maxsize];
    

    Or make sure that maxsize is a multiple of 16. That could definitely cause a 10% hit if in one configuration you’re aligned and in another you’re not. Vector operations can be extremely sensitive to this.

    The next major issue you may have is a huge stack (assuming maxsize is fairly large). ARM can deal with numbers less than 4k much more efficiently than it can deal with numbers larger than 4k (because it can only deal with 12-bit immediate values). So depending on the how the compiler has optimized it, pushing amparray way down on the stack could lead to more complicated math to access it.

    When small twiddly things lead to big performance changes, I always recommend pulling up the assembly (Product>Generate Output>Assembly) and seeing what’s changes in the compiler output. I also highly recommend Whirlwind Tour of ARM Assembly to get you started understanding what you’re looking at. (Make sure you set the output to “For Archiving” so you see the optimized result.)

    You should also do a few more things:

    • Try rewriting this routine as simple C instead of using Accelerate. Yes, I know Accelerate is always faster, except it’s not. All those function calls are quite expensive, and the compiler can often better vectorize simple multiplication and addition that Accelerate can in my experience. This is particularly true if your stride is 1, your vectors are not enormous, and you’re on a 1-2 core device like an iPad. The moment you have code that handles a stride (if you don’t need a stride), it’s more complicated (slower) than the code you would have written by hand. In my experience, Accelerate does seem to be very good at ramps and transcendentals (cosines of big tables for example), but not nearly so good at simple vector and matrix math.

    • If this code really matters to you, I’ve found that hand-writing the assembly can definitely out-pace the compiler. I’m not even that good at ARM assembler, and I’ve been able to beat the compiler by 2x on simple matrix math (and the compiler crushed Accelerate). I’m particularly talking about your loop here that seems to be doing just adds and multiplies. Handwriting the assembly is a pain of course, and you then have to maintain a C version for the assembler, but when it really matters it’s really fast.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

First, some context: I'm a Python developer who has written a medium-sized application using
I've been tuning my game's renderer for my laptop, which has a Radeon HD
First time posting here, will try to be succinct. This is a classic 'can't
First things first: I'm using Ruby 1.9.3 and DataMapper 1.2.0. I created the following
This is going to be my first attempt at fine tuning our SQL Server
Well, this is my first foray into memory profiling a .NET app (CPU tuning
I have an iphone app that has been submitted that makes a lot of
I know that the question about turning on/off GPS programatically on android has been
I wish my first post wasn't so newbie. I've been working with openframeworks, so
I have some data I load into the database the first time a user

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.