Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 5943787
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 22, 20262026-05-22T16:26:44+00:00 2026-05-22T16:26:44+00:00

I’m I have come across a strange problem. I’m implementing some linear algebra, only

  • 0

I’m I have come across a strange problem. I’m implementing some linear algebra, only matrix multiplications so far, in OpenCL, and have been testing this on my laptop. The code is really simple:

__kernel void matrix_mult(__global float* a, 
              __global float* b, 
              __global float* c,
              const int N) 
{
  int row = get_global_id(1);
  int col = get_global_id(0);
  float sum = 0.0f;
  for (int i = 0; i < N; i++) {
    sum += a[row*N+i] * b[i*N+col];
  }
  c[row*N+col] = sum;
}

I test the hardware by running the code 100 times like this:

  clock_t begin=clock(); 

  const unsigned int repeats = 100;
  for(int  i = 0; i != repeats; i++){
    runCL(a, b, results,N, N*N);
  }

  clock_t end=clock();

On my MBP matrix_multiplications take about 1.2 ms, on matrices of size 512*512 while the same code takes about 3 ms when running on a GTX 480 Linux box. This bothers me since, I would’t expect the expensive GTX card to be a little faster than the laptop.

As far as I can see either my code is ‘wrong’ of I’m timing in some wrong way.

I tried using the event-based timing system in the OpenCL spec, this gave some a bit more realistic results.

cl_event event = {0}; 
err = clEnqueueNDRangeKernel(cmd_queue, kernel[0], 2, NULL, global_work_size, NULL, 0, NULL, &event);
assert(err == CL_SUCCESS);


cl_int err =  clWaitForEvents (1,&event);
cl_ulong start, end; 
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END,   sizeof(cl_ulong), &end,   NULL); 
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL); 
double executionTimeInMilliseconds = (end - start) * 1.0e-6f;
std::cout << "execution time in milis : " << executionTimeInMilliseconds << std::endl;

Now the GT330M will do the operation in 46ms and the GTX480 does it in 2.5 ms. This then makes for another really interesting question, with PROFILING turned on the GT 330M becomes about 30 times slower, this sorta makes sense, but the GTX480 keeps up the same performance. Can anyone explain why this is?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-22T16:26:45+00:00Added an answer on May 22, 2026 at 4:26 pm

    In timing the original problem, what you’re seeing here is that with this naive code, the better specs of the GTX480 are actually hurting you.

    The code sample, a first pass at a matrix multiply, is completely dominated by memory bandwidth; each thread is accessing a different element of B which can’t be coallesced because of the stride.

    The GTX480 has a 3x larger (384 bit) and 2x faster (1840 MHz) memory bus than the GT330M (128bit, 800 MHz). Nominally, that gives a peak bandwidth advantage of 177.4GB/s vs 25.6 GB/s, and since this is memory-bandwidth dominated, you might think that would win. However, because of the non-coalesced reads and the wider memory bus, the b-array accesses are only using 32 bits of that 384 bit memory access, and in the 330M case, only 32 bits out of each 128 bit access. So the effective memory bandwidths for the b access are 14.8GB/s and 6.4GB/s; so now there’s only a factor of 2 difference in total memory bandwidth rather than 7 or so, and so much of the advantage of the faster card is being squandered; in addition, that memory bandwidth has to be divided by 10x as many cores, so the latency for each core to get its access and do the calculation is longer. I suspect that if you used larger matrix sizes, you could hide more of the latency and get at closer to the best-possible 2x speedup rather than the 2.5x slowdown you’re seeing.

    The ultimate solution here is to use a more memory-friendly matrix multiplication algorithm as a benchmark.

    The profiling results you’re seeing, though, I have no idea about. Perhaps the 330M doesn’t have as good hardware support for the profiling, so things have to be implemented in software? Since the GTX numbers are about the same either way, I’d just use the simpler timing approach for now, which since you’re not using asynchronous kernels or transfer, should be fine.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have just tried to save a simple *.rtf file with some websites and
I have a string like this: La Torre Eiffel paragonata all&#8217;Everest What PHP function
I have some data like this: 1 2 3 4 5 9 2 6
I have a jquery bug and I've been looking for hours now, I can't
link Im having trouble converting the html entites into html characters, (&# 8217;) i
For some reason, after submitting a string like this Jack’s Spindle from a text
this is what i have right now Drawing an RSS feed into the php,
I have a French site that I want to parse, but am running into
I am currently running into a problem where an element is coming back from
I want use html5's new tag to play a wav file (currently only supported

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.