Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 3215548
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 17, 20262026-05-17T15:11:26+00:00 2026-05-17T15:11:26+00:00

Is there a way to speed up this 1D convolution ? I tried to

  • 0

Is there a way to speed up this 1D convolution ? I tried to make the dy cache efficient
but compiling with g++ and -O3 gave worse performances.

I am convolving with [-1. , 0., 1] in both directions.
Is not homework.

#include<iostream>
#include<cstdlib>
#include<sys/time.h>

void print_matrix( int height, int width, float *matrix){
    for (int j=0; j < height; j++){
      for (int i=0; i < width; i++){
        std::cout << matrix[j * width + i] << ",";
    }
      std::cout << std::endl;
  }
}

void fill_matrix( int height, int width,  float *matrix){
    for (int j=0; j < height; j++){
      for (int i=0; i < width; i++){
        matrix[j * width + i] = ((float)rand() / (float)RAND_MAX) ;
    }
  }
}

#define RESTRICT __restrict__

void dx_matrix( int height, int width, float * RESTRICT in_matrix,  float * RESTRICT out_matrix, float *min, float *max){
  //init min,max
  *min = *max = -1.F * in_matrix[0] + in_matrix[1]; 

    for (int j=0; j < height; j++){
      float* row = in_matrix + j * width;
      for (int i=1; i < width-1; i++){
        float res = -1.F * row[i-1] + row[i+1]; /* -1.F * value + 0.F * value + 1.F * value; */ 
        if (res > *max ) *max = res;
        if (res < *min ) *min = res;
        out_matrix[j * width + i] = res;
      }
    }
}

void dy_matrix( int height, int width, float * RESTRICT in_matrix,  float * RESTRICT out_matrix, float *min, float *max){
  //init min,max
  *min = *max = -1.F * in_matrix[0] + in_matrix[ width + 1]; 

  for (int j=1; j < height-1; j++){
      for (int i=0; i < width; i++){
        float res = -1.F * in_matrix[ (j-1) * width + i] + in_matrix[ (j+1) * width + i] ;
        if (res > *max ) *max = res;
        if (res < *min ) *min = res;
        out_matrix[j * width + i] =  res;
      }
    }
}

double now (void)                                                                                          
{                                                                                                                    
  struct timeval tv;                                                                                               
  gettimeofday(&tv, NULL);                                                                                         
  return (double)tv.tv_sec + (double)tv.tv_usec / 1000000.0;
}


int main(int argc, char **argv){

  int width, height;
  float *in_matrix;
  float *out_matrix;

  if(argc < 3){
    std::cout  << argv[0] << "usage: width height " << std::endl;
    return -1;
  }

  srand(123);

  width = atoi(argv[1]);
  height = atoi(argv[2]);

  std::cout << "Width:"<< width << " Height:" << height << std::endl;

  if (width < 3){
    std::cout << "Width too short " << std::endl;
    return -1;
  }
  if (height < 3){
    std::cout << "Height too short " << std::endl;
    return -1;
  }

  in_matrix = (float *) malloc( height * width * sizeof(float));
  out_matrix = (float *) malloc( height * width * sizeof(float));

  fill_matrix(height, width, in_matrix);
  //print_matrix(height, width, in_matrix);

  float min, max;

  double a = now();
  dx_matrix(height, width, in_matrix, out_matrix, &min, &max);
  std::cout << "dx min:" << min << " max:" << max << std::endl;

  dy_matrix(height, width, in_matrix, out_matrix, &min, &max);
  double b = now();
  std::cout << "dy min:" << min << " max:" << max << std::endl;
  std::cout << "time: " << b-a << " sec" << std::endl;


  return 0;
}
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-17T15:11:27+00:00Added an answer on May 17, 2026 at 3:11 pm

    First of all, I would rewrite the dy loop to get rid of “[ (j-1) * width + i]” and “in_matrix[ (j+1) * width + i]”, and do something like:

      float* p, *q, *out;
     p = &in_matrix[(j-1)*width];
     q = &in_matrix[(j+1)*width];
     out = &out_matrix[j*width];
      for (int i=0; i < width; i++){ 
            float res = -1.F * p[i] + q[i] ; 
            if (res > *max ) *max = res; 
            if (res < *min ) *min = res; 
            out[i] =  res; 
          } 
    

    But that is a trivial optimization that the compiler may already be doing for you.

    It will be slightly faster to do “q[i]-p[i]” instead of “-1.f*p[i]+q[i]”, but, again, the compiler may be smart enough to do that behind your back.

    The whole thing would benefit considerably from SSE2 and multithreading. I’d bet on at least a 3x speedup from SSE2 right away. Multithreading can be added using OpenMP and it will only take a few lines of code.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Is there a way to make curl abort transfer, if average speed of transfer
Is there any way to determine a removable drive speed in Windows without actually
Is there any way to get the exposure settings (aperture, shutter speed, ISO) for
Is there way in next piece of code to only get the first record?
is there way thats i can preselect an item when the page loads or
Is there a way to enforce constraint checking in MSSQL only when inserting new
Is there any way to check whether a file is locked without using a
Is there a way to find the name of the program that is running
Is there a way to hide radio buttons inside a RadioButtonList control programmatically?
Is there a way to find the number of files of a specific type

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.