Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8389191
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 9, 20262026-06-09T18:39:28+00:00 2026-06-09T18:39:28+00:00

I am learning OpenACC (with PGI’s compiler) and trying to optimize matrix multiplication example.

  • 0

I am learning OpenACC (with PGI’s compiler) and trying to optimize matrix multiplication example. The fastest implementation I came up so far is the following:

void matrix_mul(float *restrict r, float *a, float *b, int N, int accelerate){

#pragma acc data copyin (a[0: N * N ], b[0: N * N]) copyout (r [0: N * N ]) if(accelerate)
{
# pragma acc region if(accelerate)
{
# pragma acc loop independent vector(32) 
for (int j = 0; j < N; j ++)
{    
   # pragma acc loop independent vector(32) 
   for (int i = 0; i < N ; i ++ )
   {
      float sum = 0;
      for (int k = 0; k < N ; k ++ ) {
         sum += a [ i + k*N ] * b [ k + j * N ];
      }
      r[i + j * N ] = sum ;
   }
}
}
}

This results in thread blocks of size 32×32 threads and gives me the best performance so far.
Here are the benchmarks:

Matrix multiplication (1500x1500): 
GPU: Geforce GT650 M, 64-bit Linux 

Data sz             : 1500     
Unaccelerated:
     matrix_mul() time    : 5873.255333 msec
Accelerated:
     matrix_mul() time    : 420.414700 msec

Data size             : 1750 x 1750     
    matrix_mul() time    : 876.271200 msec
Data size             : 2000 x 2000     
    matrix_mul() time    : 1147.783400 msec
Data size             : 2250 x 2250     
    matrix_mul() time    : 1863.458100 msec
Data size             : 2500 x 2500     
    matrix_mul() time    : 2516.493200 msec

Unfortunately I realized that the generated CUDA code is quite primitive (e.g. it does not even use shared memory) and hence cannot compete with hand-optimized CUDA program. As a reference implementation I took Arrayfire lib with the following results:

Arrayfire 1500 x 1500 matrix mul
CUDA toolkit 4.2, driver 295.59
GPU0 GeForce GT 650M, 2048 MB, Compute 3.0 (single,double)
Memory Usage: 1932 MB free (2048 MB total)
af:  0.03166 seconds

Arrayfire 1750 x 1750 matrix mul
 af:  0.05042 seconds
Arrayfire 2000 x 2000 matrix mul
 af:  0.07493 seconds
Arrayfire 2250 x 2250 matrix mul
 af:  0.10786 seconds
Arrayfire 2500 x 2500 matrix mul
 af:  0.14795 seconds

I wonder if there any suggestions how to get better performance from OpenACC ?
Perhaps my choice of directives is not right ?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-09T18:39:30+00:00Added an answer on June 9, 2026 at 6:39 pm

    You’re getting right at a 14x speedup, which is pretty good for PGI’s compiler in my experience.

    First off, are you compiling with -Minfo? That will give you a lot of feedback from the compiler regarding optimization choices.

    You are using a 32×32 thread block, but in my experience 16×16 thread blocks tend to get better performance. If you omit the vector(32) clauses, what scheduling does the compiler choose?

    Declaring a and b with restrict might let the compiler generate better code.

    Just by looking at your code, I’m not sure that shared memory would help performance. Shared memory only helps improve performance if your code can store and reuse values there instead of going to global memory. In this case you’re not reusing any part of a or b after reading it.

    It’s also worth noting that I’ve had bad experiences with PGI’s compiler when it comes to shared memory usage. It will sometimes do funny stuff and cache the wrong values (seems to mostly happen if you iterate a loop backward), generating wrong results. I actually have to compile my current application using the undocumented -ta=nvidia,nocache option to get it to work correctly, by bypassing shared memory usage altogether.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Learning CasperJS Trying to understand why the following is not displaying my results in
Learning some VBA. So far, I've constructed this piece of code which should allow
Learning xml, Can anyone help me? I have following XML code: **<book lang=en>name of
Learning how to do a master page in asp.net. Trying to figure out how
Learning java server technologies, trying to clarify some things. There are few technologies that
Learning rails and something smells a little funny. I have the following form for
For learning purposes, I'm trying to make a function using Python that takes in
When learning regular expressions, I once saw the following four examples. How can I
Learning C and I'm trying to get a visual comparison of the variable types
Learning how to test is HARD! I'm just trying to get MiniTest running and

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.