Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7730169
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 1, 20262026-06-01T06:09:23+00:00 2026-06-01T06:09:23+00:00

Assume that the dimensions are very large (up to 1 billion elements in a

  • 0

Assume that the dimensions are very large (up to 1 billion elements in a matrix). How would I implement a cache oblivious algorithm for matrix-vector product? Based on wikipedia I will need to recursively divide and conquer however I feel like there would be a lot of overhead.. Would it be efficient to do so?

Follow up question and answer: OpenMP with matrices and vectors

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-01T06:09:25+00:00Added an answer on June 1, 2026 at 6:09 am

    So the answer to the question, “how do I make this basic linear algebra operation fast”, is always and everywhere to find and link to a tuned BLAS library for your platform. Eg, GotoBLAS (whose work is being continued in OpenBLAS), or the slower autotuned ATLAS, or commercial packages like Intel’s MKL. Linear algebra is so fundamental to so many other operations that enormous amounts of effort goes into optimizing these packages for various platforms, and there’s just no chance you’re going to come up with something in a few afternoon’s work that will compete. The particular subroutine calls you’re looking for for general dense matrix-vector multiplicaiton is SGEMV/DGEMV/CGEMV/ZGEMV.

    Cache-oblivious algorithms, or autotuning, are for when you can’t be bothered tuning for the specific cache architecture of your system – which might be fine, normally, but since people are willing to do that for BLAS routines, and then make the tuned results available, means that you’re best off just using those routines.

    The memory access pattern for GEMV is straightforward enough that you don’t really need divide and conquer (same for the standard case of matrix transpose) – you just find the cache blocking size and use it. In GEMV (y = Ax), you still have to scan through the entire matrix once, so there’s nothing to be done for reuse (and thus effective cache use) there, but you can try reuse x as much as possible so you load it once instead of (number of rows) times – and you still want access to A to be cache friendly. So the obvious cache blocking thing to do is to break along blocks:

      A x -> [ A11 | A12 ] | x1 | = | A11 x1 + A12 x2 |
             [ A21 | A22 ] | x2 |   | A21 x1 + A22 x2 |
    

    And you can certainly do that recursively. But doing a naive implementation, it’s slower than the simple double-loop, and way slower than a proper SGEMV library call:

    $ ./gemv
    Testing for N=4096
    Double Loop: time = 0.024995, error = 0.000000
    Divide and conquer: time = 0.299945, error = 0.000000
    SGEMV: time = 0.013998, error = 0.000000
    

    The code follows:

    #include <stdio.h>
    #include <stdlib.h>
    #include <sys/time.h>
    #include "mkl.h"
    
    float **alloc2d(int n, int m) {
        float *data = malloc(n*m*sizeof(float));
        float **array = malloc(n*sizeof(float *));
        for (int i=0; i<n; i++)
            array[i] = &(data[i*m]);
        return array;
    }
    
    void tick(struct timeval *t) {
        gettimeofday(t, NULL);
    }
    
    /* returns time in seconds from now to time described by t */
    double tock(struct timeval *t) {
        struct timeval now;
        gettimeofday(&now, NULL);
        return (double)(now.tv_sec - t->tv_sec) + ((double)(now.tv_usec - t->tv_usec)/1000000.);
    }
    
    float checkans(float *y, int n) {
        float err = 0.;
        for (int i=0; i<n; i++)
            err += (y[i] - 1.*i)*(y[i] - 1.*i);
        return err;
    }
    
    /* assume square matrix */
    void divConquerGEMV(float **a, float *x, float *y, int n,
                        int startr, int endr, int startc, int endc) {
    
        int nr = endr - startr + 1;
        int nc = endc - startc + 1;
    
        if (nr == 1 && nc == 1) {
            y[startc] += a[startr][startc] * x[startr];
        } else {
            int midr = (endr + startr+1)/2;
            int midc = (endc + startc+1)/2;
            divConquerGEMV(a, x, y, n, startr, midr-1, startc, midc-1);
            divConquerGEMV(a, x, y, n, midr,   endr,   startc, midc-1);
            divConquerGEMV(a, x, y, n, startr, midr-1, midc,   endc);
            divConquerGEMV(a, x, y, n, midr,   endr,   midc,   endc);
        }
    }
    int main(int argc, char **argv) {
        const int n=4096;
        float **a = alloc2d(n,n);
        float *x  = malloc(n*sizeof(float));
        float *y  = malloc(n*sizeof(float));
        struct timeval clock;
        double eltime;
    
        printf("Testing for N=%d\n", n);
    
        for (int i=0; i<n; i++) {
            x[i] = 1.*i;
            for (int j=0; j<n; j++)
                a[i][j] = 0.;
            a[i][i] = 1.;
        }
    
        /* naive double loop */
        tick(&clock);
        for (int i=0; i<n; i++) {
            y[i] = 0.;
            for (int j=0; j<n; j++) {
                y[i] += a[i][j]*x[j];
            }
        }
        eltime = tock(&clock);
        printf("Double Loop: time = %lf, error = %f\n", eltime, checkans(y,n));
    
        for (int i=0; i<n; i++) y[i] = 0.;
    
        /* naive divide and conquer */
        tick(&clock);
        divConquerGEMV(a, x, y, n, 0, n-1, 0, n-1);
        eltime = tock(&clock);
        printf("Divide and conquer: time = %lf, error = %f\n", eltime, checkans(y,n));
    
        /* decent GEMV implementation */
        tick(&clock);
    
        float alpha = 1.;
        float beta =  0.;
        int incrx=1;
        int incry=1;
        char trans='N';
    
        sgemv(&trans,&n,&n,&alpha,&(a[0][0]),&n,x,&incrx,&beta,y,&incry);
        eltime = tock(&clock);
        printf("SGEMV: time = %lf, error = %f\n", eltime, checkans(y,n));
    
        return 0;
    }
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Let's assume that I have a vector r <- rnorm(4) and a matrix W
Assume that you're working a x86 32-bits system. Your task is to implement the
It's my first time programming for anything that isn't a PC. I would assume
Assume that we have an array of small (about 10^(-15) ) double numbers in
Assume that we have N erlang nodes, running same application. I want to share
Assume that you have a running SQL Server Express instance named (local)\SQLEXPRESS. Its database
Assume that the core project has a base entity and every plugin maybe extends
Assume that I have this piece of code: @interface Foo : NSObject { Bar
Assume that the following Perl code is given: my $user_supplied_string = &retrieved_from_untrusted_user(); $user_supplied_string =~
Assume that I want to listen to a non-common port (9090 for example) 24/7

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.