Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8268827
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 8, 20262026-06-08T06:01:32+00:00 2026-06-08T06:01:32+00:00

i am learning assembly and i started experiments on SSE and MMX registers within

  • 0

i am learning assembly and i started experiments on SSE and MMX registers within the Digital-Mars C++ compiler (intel sytanx more easily readable). I have finished a program that takes var_1 as a value and converts it to the var_2 number system(this is in 8 bit for now. will expand it to 32 64 128 later) . Program does this by two ways:

  1. __asm inlining

  2. Usual C++ way of %(modulo) operator.

Question: Can you tell me more efficient way to use xmm0-7 and mm0-7 registers and can you tell me how to exchange exact bytes of them with al,ah… 8 bit registers?

Usual %(modulo) operator in the C++ usual way is very slow in comparison with __asm on my computer(pentium-m centrino 2.0GHz).
If you can tell me how to get rid of division instruction in __asmm, it will be even faster.

When i run the program it gives me:

(for the values: var_1=17,var_2=2,all loops are 200M times)

17 is 10001 in number system 2
__asm(clock)...........: 7250    <------too bad. it is 8-bit calc.
C++(clock).............: 12250   <------not very slow(var_2 is a power of 2)


(for the values: var_1=33,var_2=7,all loops are 200M times)
33 is 45 in number system 7
 __asm(clock)..........: 2875   <-------not good. it is 8-bit calc.
 C++(clock)............: 6328   <----------------really slow(var_2 is not a power of 2)

The second C++ code(the one with % operator): /////////////////////////////////////////////////////////

t1=clock();//reference time
for(int i=0;i<200000000;i++)
{
    y=x;
    counter=0;
    while(y>g)
    {   

        var_3[counter]=y%g;
        y/=g;
        counter++;
    }

     var_3[counter]=y%g;
}   
t2=clock();//final time

_asm code:////////////////////////////////////////////////////////////////////////////////////////////////////////////

     __asm  // i love assembly in some parts of C++
        {

        pushf   //here does register backup
        push eax
        push ebx
        push ecx
        push edx
        push edi

            mov eax,0h      //this will be outer loop counter init to zero
            //init of medium-big registers to zero
            movd xmm0,eax    //cannot set to immediate constant: xmm0=outer loop counter 
            shufps xmm0,xmm0,0h //this makes all bits zero
            movd xmm1,eax
            movd xmm2,eax   
            shufps xmm1,xmm1,0h
            shufps xmm2,xmm2,0h
            movd xmm2,eax 
            shufps xmm3,xmm3,0h//could have made pxor xmm3,xmm3(single instruction)
            //init complete(xmm0,xmm1,xmm2,xmm3 are zero)

            movd xmm1,[var_1] //storing variable_1 to register
            movd xmm2,[var_2] //storing var_2 to register    
            lea ebx,var_3     //calculate var_3 address
            movd xmm3,ebx     //storing var_3's address to register
            for_loop:
            mov eax,0h      
            //this line is index-init to zero(digit array index)
            movd edx,xmm2
            mov cl,dl       //this is the var_1 stored in cl
            movd edx,xmm1
            mov al,dl       //this is the var_2 stored in al
            mov edx,0h
            dng:
                mov ah,00h      //preparation for a 8-bit division
                div cl          //divide

                movd ebx,xmm3   //get var_3 address
                add ebx,edx     //i couldnt find a way to multiply with 4
                add ebx,edx     //so i added 4 times ^^
                add ebx,edx     //add   
                add ebx,edx     //last adding
                //below, mov [ebx],ah is the only memory accessing instruction
                mov [ebx],ah    //(8 bit)this line is equivalent to var_3[i]=remainder


                inc edx         //i++;
                cmp al,00h      //is division zero?
            jne dng             //if no, loop again

            //here edi register has the number of digits

            movd eax,xmm0       //get the outer loop counter from medium-big register
            add eax,01h         //j++;
            movd xmm0,eax       //store the new counter to medium-big register
            cmp eax,0BEBC200h           //is j<(200,000,000) ?
            jb for_loop     //if yes, go loop again
            mov [var_3_size],edx //now we have number of digits too!
         //here does registers revert back to old values
        pop edi
        pop edx
        pop ecx
        pop ebx
        pop eax
        popf     

        }

Whole code://///////////////////////////////////////////////////////////////////////////////////////

#include <iostream.h>
#include <cmath>
#include<stdlib.h>
#include<stdio.h>
#include<time.h>
int main()
    {

    srand(time(0));


    clock_t t1=clock();
    clock_t t2=clock();

    int var_1=17;  //number itself
    int var_2=2;   //number system
    int var_3[100];  //digits to be showed(maximum 100 as seen )
    int var_3_size=0;//asm block will decide what will the number of  digits be

    for(int i=0;i<100;i++)
    {
    var_3[i]=0; //here we initialize digits to zeroes
    }


    t1=clock();//reference time to take
     __asm  // i love assembly in some parts of C++
        {

        pushf   //here does register backup
        push eax
        push ebx
        push ecx
        push edx
        push edi

            mov eax,0h      //this will be outer loop counter init to zero
            //init of medium-big registers to zero
            movd xmm0,eax    //cannot set to immediate constant: xmm0=outer loop counter 
            shufps xmm0,xmm0,0h //this makes all bits zero
            movd xmm1,eax
            movd xmm2,eax   
            shufps xmm1,xmm1,0h
            shufps xmm2,xmm2,0h
            movd xmm2,eax 
            shufps xmm3,xmm3,0h
            //init complete(xmm0,xmm1,xmm2,xmm3 are zero)

            movd xmm1,[var_1] //storing variable_1 to register
            movd xmm2,[var_2] //storing var_2 to register    
            lea ebx,var_3     //calculate var_3 address
            movd xmm3,ebx     //storing var_3's address to register
            for_loop:
            mov eax,0h      
            //this line is index-init to zero(digit array index)
            movd edx,xmm2
            mov cl,dl       //this is the var_1 stored in cl
            movd edx,xmm1
            mov al,dl       //this is the var_2 stored in al
            mov edx,0h
            dng:
                mov ah,00h      //preparation for a 8-bit division
                div cl          //divide

                movd ebx,xmm3   //get var_3 address
                add ebx,edx     //i couldnt find a way to multiply with 4
                add ebx,edx     //so i added 4 times ^^
                add ebx,edx     //add   
                add ebx,edx     //last adding
                //below, mov [ebx],ah is the only memory accessing instruction
                mov [ebx],ah    //(8 bit)this line is equivalent to var_3[i]=remainder


                inc edx         //i++;
                cmp al,00h      //is division zero?
            jne dng             //if no, loop again

            //here edi register has the number of digits

            movd eax,xmm0       //get the outer loop counter from medium-big register
            add eax,01h         //j++;
            movd xmm0,eax       //store the new counter to medium-big register
            cmp eax,0BEBC200h           //is j<(200,000,000) ?
            jb for_loop     //if yes, go loop again
            mov [var_3_size],edx //now we have number of digits too!
         //here does registers revert back to old values
        pop edi
        pop edx
        pop ecx
        pop ebx
        pop eax
        popf     

        }
    t2=clock(); //finish time
    printf("\n assembly_inline(clocks): %i  for the 200 million calculations",(t2-t1)); 

        printf("\n value %i(in decimal) is: ",var_1);
for(int i=var_3_size-1;i>=0;i--)
{
    printf("%i",var_3[i]);
}
printf(" in the number system: %i \n",var_2);




//and: more readable form(end easier)
    int counter=var_3_size;
    int x=var_1;
    int g=var_2;
    int y=x;// backup
t1=clock();//reference time

for(int i=0;i<200000000;i++)
{
    y=x;
    counter=0;
    while(y>g)
    {   

        var_3[counter]=y%g;
        y/=g;
        counter++;
    }

     var_3[counter]=y%g;
}

t2=clock();//final time
printf("\n C++(clocks): %i  for the 200 million calculations",(t2-t1)); 

printf("\n value %i(in decimal) is: ",x);
for(int i=var_3_size-1;i>=0;i--)
{
    printf("%i",var_3[i]);
}
printf(" in the number system: %i \n",g);
return 0;

}

edit:
this is 32-bit version

    void get_digits_asm()
{
    __asm
    {

        pushf       //couldnt store this in other registers 
        movd xmm0,eax//storing in xmm registers instead of pushing
        movd xmm1,ebx//
        movd xmm2,ecx//
        movd xmm3,edx//
        movd xmm4,edi//end of push backups

        mov eax,[variable_x]
        mov ebx,[number_system]
        mov ecx,0h
        mov edi,0h

        begin_loop:
        mov edx,0h
        div ebx             
        lea edi,digits  
        mov [edi+ecx*4],edx
        add ecx,01h
        cmp eax,ebx
        ja begin_loop

        mov edx,0
        div ebx
        lea edi,digits
        mov [edi+ecx*4],edx
        inc ecx
        mov [digits_total],ecx


        movd edi,xmm4//pop edi
        movd edx,xmm3//pop edx
        movd ecx,xmm2//pop ecx
        movd ebx,xmm1//pop ebx
        movd eax,xmm0//pop eax
        popf            
    }

}
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-08T06:01:33+00:00Added an answer on June 8, 2026 at 6:01 am

    The code can be much simpler of course: (modeled after the C++ version, does not include pushes and pops, and not tested)

      mov esi,200000000
    _bigloop:
      mov eax,[y]
      mov ebx,[g]
      lea edi,var_3
      ; eax = y
      ; ebx = g
      ; edi = var_3
      xor ecx,ecx
      ; ecx = counter
    _loop:
      xor edx,edx
      div ebx
      mov [edi+ecx*4],edx
      add ecx,1
      test eax,eax
      jnz _loop
      sub esi,1
      jnz _bigloop
    

    But I would be surprised if it was faster than the C++ version, and in fact it’ll almost certainly be slower if the base is a power of two – all sane compilers know how to turn a division and/or modulo by a power of two into bitshifts and bitwise ands.


    Here’s a version that uses ab 8-bit division. Similar caveats apply, but now the division could even overflow (if y / g is more than 255).

      mov esi,200000000
    _bigloop:
      mov eax,[y]
      mov ebx,[g]
      lea edi,var_3
      ; eax = y
      ; ebx = g
      ; edi = var_3
      xor ecx,ecx
      ; ecx = counter
    _loop:
      div bl
      mov [edi+ecx],ah
      add ecx,1
      and eax,0xFF
      jnz _loop
      sub esi,1
      jnz _bigloop
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I've started learning 16-bit assembly (eventually moving up to 32-bit) from this tutorial here:
I've been learning compiler theory and assembly and have managed to create a compiler
Hi I just started learning assembly in IA32. Can anyone tell me what these
I've started learning assembly for the DCPU-16 to prepare for 0x10c, but upon arriving
I'm learning assembly language. I started with Paul A. Carter's PC Assembly Language which
Just started learning x64 assembly and I have a question about functions, arguments, and
i just started learning assembly and making some custom loop for swapping two variables
I've started learning assembly recently and as I've looked across the internet I see
I'm learning assembly and I'm trying to do a simple read from keyboard /
So I have been learning assembly and came to the topic of stack, storing

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.