Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8446421
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 10, 20262026-06-10T09:55:43+00:00 2026-06-10T09:55:43+00:00

I want to convert a floating point value to a 16-bit unsigned integer without

  • 0

I want to convert a floating point value to a 16-bit unsigned integer without saturating (wraparound/overflow instead).

#include <iostream>
#include <xmmintrin.h>

void satur_wrap()
{
    const float bigVal = 99000.f;
    const __m128 bigValVec = _mm_set1_ps(bigVal);

    const __m64 outVec64 =_mm_cvtps_pi16(bigValVec);

#if 0
    const __m128i outVec = _mm_movpi64_epi64(outVec64);
#else

    #if 1
        const __m128i outVec  = _mm_packs_epi32(_mm_cvttps_epi32(bigValVec), _mm_cvttps_epi32(bigValVec));
    #else
        const __m128i outVec  = _mm_cvttps_epi32(bigValVec);
    #endif

#endif

    uint16_t *outVals = NULL;
    posix_memalign((void **) &outVals, sizeof(__m128i), sizeof(__m128i));

    _mm_store_si128(reinterpret_cast<__m128i *>(outVals), outVec);

    for (int i = 0; i < sizeof(outVec) / sizeof(*outVals); i++)
    {
        std::cout << "outVals[" << i << "]: " << outVals[i] << std::endl;
    }

    std::cout << std::endl
        << "\tbigVal: " << bigVal << std::endl
        << "\t(unsigned short) bigVal: " << ((unsigned short) bigVal)  << std::endl
        << "\t((unsigned short)((int) bigVal)): " << ((unsigned short)((int) bigVal)) << std::endl
        << std::endl;
}

Sample execution:

$ ./row
outVals[0]: 32767
outVals[1]: 32767
outVals[2]: 32767
outVals[3]: 32767
outVals[4]: 32767
outVals[5]: 32767
outVals[6]: 32767
outVals[7]: 32767

        bigVal: 99000
        (unsigned short) bigVal: 65535
        ((unsigned short)((int) bigVal)): 33464

The ((unsigned short)((int) bigVal)) expression works as desired (but it’s probably UB, right?). But I can’t find something quite similar with SSE. I must be missing something, but I couldn’t find a primitive to convert four 32-bit floats to four 32-bit ints.


EDIT: Oops, I figured it would be “normal” for 32-bit integer -> 16-bit unsigned integer conversion to use wraparound. But I’ve since learned that _mm_packs_epi32 uses signed-saturate (and there doesn’t appear to be a _mm_packus_epi32). Is there a way to set the mode, or another primitive besides _mm_packus_epi32?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-10T09:55:44+00:00Added an answer on June 10, 2026 at 9:55 am

    I’m answering only part of the question concerning 32-bit integer -> 16-bit unsigned integer conversion.

    Since you need a wraparound, just take the low-order word of each double-word containing 32-bit integer. These 16-bit integers are interleaved with 16-bit pieces of unused data, so it may be convenient to pack them into a contiguous array. The easiest way to do this is using _mm_shuffle_epi8 intrinsic (SSSE3).

    If you want your program to be more portable and require only SSE2 instruction set, you can pack the values with _mm_packs_epi32, but disable its saturating behavior with following trick:

    x = _mm_slli_epi32(x, 16);
    y = _mm_slli_epi32(y, 16);
    
    x = _mm_srai_epi32(x, 16);
    y = _mm_srai_epi32(y, 16);
    
    x = _mm_packs_epi32(x, y);
    

    This trick works because it performs sign extension of 16-bit values, which makes signed saturation a no-op.

    The same trick works with _mm_packus_epi32:

    x = _mm_and_si128(x, _mm_set1_epi32(65535));
    y = _mm_and_si128(y, _mm_set1_epi32(65535));
    x = _mm_packus_epi32(x, y);
    

    This trick works because it performs zero extension of 16-bit values, which makes unsigned saturation a no-op. It is easier to perform zero extension, but you need SSE4.1 instruction set to make _mm_packus_epi32 available.

    It is possible to pack 8 16-bit integers using a single instruction: _mm_perm_epi8. But this requires pretty rare XOP instruction set.


    And here are several words about saturated conversion.

    In fact _mm_packus_epi32 intrinsic is available if you change #include <xmmintrin.h> to #include <smmintrin.h> or #include <x86intrin.h>. You need both your CPU and compiler to support SSE4.1 extensions.

    If you have no SSE4.1-compatible CPU or compiler or want your program to be more portable, substitute _mm_packus_epi32 intrinsic with code like this:

    __m128i m1 = _mm_cmpgt_epi32(x, _mm_set1_epi32(0));
    __m128i m2 = _mm_cmpgt_epi32(x, _mm_set1_epi32(65535));
    x = _mm_and_si128(x, m1);
    x = _mm_or_si128(x, m2);
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I want to convert a floating point value to its integer representation. As this
Can someone explain to me how I convert a 32-bit floating point value to
I want to convert a floating point user input into its integer equivalent. I
I want convert the storage of a floating point number to an integer (the
I want to convert floating value of 8 digits after floating point in to
How do I convert a string to a floating point number if I want
I want to convert the integer to floating number with precision 2. ie. -
I want to convert a pointer *int to its real value int , in
I want to convert a 44,100 Hz, 24 Bit Mono wav file to aformat
Possible Duplicate: 32 bit hex to 32 bit floating point (IEEE 754) conversion in

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.