Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 744521
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 14, 20262026-05-14T08:59:16+00:00 2026-05-14T08:59:16+00:00

I’m calculating fixedpoint reciprocals in Q22.10 with Goldschmidt division for use in my software

  • 0

I’m calculating fixedpoint reciprocals in Q22.10 with Goldschmidt division for use in my software rasterizer on ARM.

This is done by just setting the numerator to 1, i.e the numerator becomes the scalar on the first iteration. To be honest, I’m kind of following the wikipedia algorithm blindly here. The article says that if the denominator is scaled in the half-open range (0.5, 1.0], a good first estimate can be based on the denominator alone: Let F be the estimated scalar and D be the denominator, then F = 2 – D.

But when doing this, I lose a lot of precision. Say if I want to find the reciprocal of 512.00002f. In order to scale the number down, I lose 10 bits of precision in the fraction part, which is shifted out. So, my questions are:

  • Is there a way to pick a better estimate which does not require normalization? Why? Why not? A mathematical proof of why this is or is not possible would be great.
  • Also, is it possible to pre-calculate the first estimates so the series converges faster? Right now, it converges after the 4th iteration on average. On ARM this is about ~50 cycles worst case, and that’s not taking emulation of clz/bsr into account, nor memory lookups. If it’s possible, I’d like to know if doing so increases the error, and by how much.

Here is my testcase. Note: The software implementation of clz on line 13 is from my post here. You can replace it with an intrinsic if you want. clz should return the number of leading zeros, and 32 for the value 0.

#include <stdio.h>
#include <stdint.h>

const unsigned int BASE = 22ULL;

static unsigned int divfp(unsigned int val, int* iter)
{
  /* Numerator, denominator, estimate scalar and previous denominator */
  unsigned long long N,D,F, DPREV;
  int bitpos;

  *iter = 1;
  D = val;
  /* Get the shift amount + is right-shift, - is left-shift. */
  bitpos = 31 - clz(val) - BASE;
  /* Normalize into the half-range (0.5, 1.0] */
  if(0 < bitpos)
    D >>= bitpos;
  else
    D <<= (-bitpos);

  /* (FNi / FDi) == (FN(i+1) / FD(i+1)) */
  /* F = 2 - D */
  F = (2ULL<<BASE) - D;
  /* N = F for the first iteration, because the numerator is simply 1.
     So don't waste a 64-bit UMULL on a multiply with 1 */
  N = F;
  D = ((unsigned long long)D*F)>>BASE;

  while(1){
    DPREV = D;
    F = (2<<(BASE)) - D;
    D = ((unsigned long long)D*F)>>BASE;
    /* Bail when we get the same value for two denominators in a row.
      This means that the error is too small to make any further progress. */
    if(D == DPREV)
      break;
    N = ((unsigned long long)N*F)>>BASE;
    *iter = *iter + 1;
  }
  if(0 < bitpos)
    N >>= bitpos;
  else
    N <<= (-bitpos);
  return N;
}


int main(int argc, char* argv[])
{
  double fv, fa;
  int iter;
  unsigned int D, result;

  sscanf(argv[1], "%lf", &fv);

  D = fv*(double)(1<<BASE);
  result = divfp(D, &iter); 

  fa = (double)result / (double)(1UL << BASE);
  printf("Value: %8.8lf 1/value: %8.8lf FP value: 0x%.8X\n", fv, fa, result);
  printf("iteration: %d\n",iter);

  return 0;
}
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-14T08:59:16+00:00Added an answer on May 14, 2026 at 8:59 am

    I could not resist spending an hour on your problem…

    This algorithm is described in section 5.5.2 of “Arithmetique des ordinateurs” by Jean-Michel Muller (in french). It is actually a special case of Newton iterations with 1 as starting point. The book gives a simple formulation of the algorithm to compute N/D, with D normalized in range [1/2,1[:

    e = 1 - D
    Q = N
    repeat K times:
      Q = Q * (1+e)
      e = e*e
    

    The number of correct bits doubles at each iteration. In the case of 32 bits, 4 iterations will be enough. You can also iterate until e becomes too small to modify Q.

    Normalization is used because it provides the max number of significant bits in the result. It is also easier to compute the error and number of iterations needed when the inputs are in a known range.

    Once your input value is normalized, you don’t need to bother with the value of BASE until you have the inverse. You simply have a 32-bit number X normalized in range 0x80000000 to 0xFFFFFFFF, and compute an approximation of Y=2^64/X (Y is at most 2^33).

    This simplified algorithm may be implemented for your Q22.10 representation as follows:

    // Fixed point inversion
    // EB Apr 2010
    
    #include <math.h>
    #include <stdio.h>
    
    // Number X is represented by integer I: X = I/2^BASE.
    // We have (32-BASE) bits in integral part, and BASE bits in fractional part
    #define BASE 22
    typedef unsigned int uint32;
    typedef unsigned long long int uint64;
    
    // Convert FP to/from double (debug)
    double toDouble(uint32 fp) { return fp/(double)(1<<BASE); }
    uint32 toFP(double x) { return (int)floor(0.5+x*(1<<BASE)); }
    
    // Return inverse of FP
    uint32 inverse(uint32 fp)
    {
      if (fp == 0) return (uint32)-1; // invalid
    
      // Shift FP to have the most significant bit set
      int shl = 0; // normalization shift
      uint32 nfp = fp; // normalized FP
      while ( (nfp & 0x80000000) == 0 ) { nfp <<= 1; shl++; } // use "clz" instead
    
      uint64 q = 0x100000000ULL; // 2^32
      uint64 e = 0x100000000ULL - (uint64)nfp; // 2^32-NFP
      int i;
      for (i=0;i<4;i++) // iterate
        {
          // Both multiplications are actually
          // 32x32 bits truncated to the 32 high bits
          q += (q*e)>>(uint64)32;
          e = (e*e)>>(uint64)32;
          printf("Q=0x%llx E=0x%llx\n",q,e);
        }
      // Here, (Q/2^32) is the inverse of (NFP/2^32).
      // We have 2^31<=NFP<2^32 and 2^32<Q<=2^33
      return (uint32)(q>>(64-2*BASE-shl));
    }
    
    int main()
    {
      double x = 1.234567;
      uint32 xx = toFP(x);
      uint32 yy = inverse(xx);
      double y = toDouble(yy);
    
      printf("X=%f Y=%f X*Y=%f\n",x,y,x*y);
      printf("XX=0x%08x YY=0x%08x XX*YY=0x%016llx\n",xx,yy,(uint64)xx*(uint64)yy);
    }
    

    As noted in the code, the multiplications are not full 32×32->64 bits. E will become smaller and smaller and fits initially on 32 bits. Q will always be on 34 bits. We take only the high 32 bits of the products.

    The derivation of 64-2*BASE-shl is left as an exercise for the reader :-). If it becomes 0 or negative, the result is not representable (the input value is too small).

    EDIT. As a follow-up to my comment, here is a second version with an implicit 32-th bit on Q. Both E and Q are now stored on 32 bits:

    uint32 inverse2(uint32 fp)
    {
      if (fp == 0) return (uint32)-1; // invalid
    
      // Shift FP to have the most significant bit set
      int shl = 0; // normalization shift for FP
      uint32 nfp = fp; // normalized FP
      while ( (nfp & 0x80000000) == 0 ) { nfp <<= 1; shl++; } // use "clz" instead
      int shr = 64-2*BASE-shl; // normalization shift for Q
      if (shr <= 0) return (uint32)-1; // overflow
    
      uint64 e = 1 + (0xFFFFFFFF ^ nfp); // 2^32-NFP, max value is 2^31
      uint64 q = e; // 2^32 implicit bit, and implicit first iteration
      int i;
      for (i=0;i<3;i++) // iterate
        {
          e = (e*e)>>(uint64)32;
          q += e + ((q*e)>>(uint64)32);
        }
      return (uint32)(q>>shr) + (1<<(32-shr)); // insert implicit bit
    }
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm parsing an RSS feed that has an &#8217; in it. SimpleXML turns this
I have a string like this: La Torre Eiffel paragonata all&#8217;Everest What PHP function
link Im having trouble converting the html entites into html characters, (&# 8217;) i
I have just tried to save a simple *.rtf file with some websites and
For some reason, after submitting a string like this Jack’s Spindle from a text
I am trying to understand how to use SyndicationItem to display feed which is
this is what i have right now Drawing an RSS feed into the php,
I have this code to decode numeric html entities to the UTF8 equivalent character.
I want use html5's new tag to play a wav file (currently only supported
I have this code: - (void)parser:(NSXMLParser *)parser foundCDATA:(NSData *)CDATABlock { NSString *someString = [[NSString

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.