Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8397089
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 9, 20262026-06-09T20:42:07+00:00 2026-06-09T20:42:07+00:00

I’ve got a Jacobi implementation on CUDA, but the problem is: I assign threads

  • 0

I’ve got a Jacobi implementation on CUDA, but the problem is:

I assign threads at this way:

#define imin(a,b) (a < b ? a : b)
int  dimBlocks, dimThreads;
dimThreads = 256;
dimBlocks =  imin(32, (dimThreads + dim - 1)/dimThreads);

But if I use 32 threads it’s fastest than using 256 threads or moreover…

I’ve got these results:

Sequential times:
9900 5.882000 
9900 6.071000

Parallel times:
9900 1.341000 //using 32
9900 1.626000 //using 256

Where 9900 is matrix WIDTH… And we can see the following:

5.882 / 1.34 = 4.39
6.07 / 1.62 = 3.74

So 32 threads is more efficient than 256?

Sorry, I don’t know if I should upload the code(since they are a bit long), if you request it I will do it.

EDIT:

//Based on doubletony algorithm
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include "Jacobi.cuh"

#include "thrust\host_vector.h"
#include "thrust\device_vector.h"
#include "thrust\extrema.h"

#include <cstdio>
#include <cstdlib>
#include <cmath>
#include <ctime>

#define imin(a,b) (a < b ? a : b)

// name OF FUNCTION: __copy_vector
// PURPOSE:
//    The function will copy a vector.
//
// PARAMETERS:
// name         type      value/reference   description
// ---------------------------------------------------------------------
// source       double*   value             vector to be copied
// dest         double*   reference         vector copied
// dim          int       value             vector dimension
// RETURN VALUE:
// name         type  description
// ---------------------------------------------------------------------
//
__global__ void __copy_vector(double *source, double *dest, const int  dim)
{
    int  tIdx = blockDim.x * blockIdx.x + threadIdx.x;
    while(tIdx < dim){
        dest[tIdx] = source[tIdx];
        tIdx += gridDim.x * blockDim.x;
    }
}

// name OF FUNCTION: __Jacobi_sum
// PURPOSE:
//    The function will execute matrix vector multiplication
//
// PARAMETERS:
// name         type      value/reference   description
// ---------------------------------------------------------------------
// A            double*   value             A
// B            double*   value             B
// C            double*   reference         A*B
// dim          int       value             vector dimension
// RETURN VALUE:
// name         type  description
// ---------------------------------------------------------------------
//
__global__ void __Jacobi_sum(const double *A, 
                             const double *B, 
                             double *resul, 
                             const int  dim)
{
    int  tIdx = blockIdx.x * blockDim.x + threadIdx.x;
    while(tIdx < dim){
        resul[tIdx] = 0;
        for(int i = 0; i < dim; i++)
            if(tIdx != i)
                resul[tIdx] += A[tIdx * dim + i] * B[i];
        tIdx += gridDim.x * blockDim.x;
    }
    __syncthreads;
}
// name OF FUNCTION: __substract
// PURPOSE:
//    The function will execute A-B=resul
//
// PARAMETERS:
// name         type      value/reference   description
// ---------------------------------------------------------------------
// A            double*   value             A
// B            double*   value             B
// C            double*   reference         A-B
// dim          int       value             vector dimension
// RETURN VALUE:
// name         type  description
// ---------------------------------------------------------------------
//
__global__ void __substract(const double *A, 
                            const double *B, 
                            double *C, 
                            const int  dim)
{
    int  tIdx = blockIdx.x * blockDim.x + threadIdx.x;
    while(tIdx < dim){
        C[tIdx] = A[tIdx] - B[tIdx];
        tIdx += gridDim.x * blockDim.x;
    }
}
// name OF FUNCTION: __divide
// PURPOSE:
//    The function will execute the jacobi division, that is, 
//    (B-sum)/A[i,i]
//
// PARAMETERS:
// name         type      value/reference   description
// ---------------------------------------------------------------------
// A            double*   value             A
// B            double*   reference         (B-sum)/A[i,i]
// dim          int       value             vector dimension
// RETURN VALUE:
// name         type  description
// ---------------------------------------------------------------------
//
__global__ void __divide(const double *A, double *B, const int dim)
{
    int  tIdx = blockIdx.x * blockDim.x + threadIdx.x;
    while(tIdx < dim){
        //if(A[tIdx * dim + tIdx] != 0)
            B[tIdx] /= A[tIdx * dim + tIdx];
        tIdx += blockDim.x * gridDim.x;
    }
}
// name OF FUNCTION: __absolute
// PURPOSE:
//    The function will calculate the absolute value for each
//    number in an array
//
// PARAMETERS:
// name         type      value/reference   description
// ---------------------------------------------------------------------
// A            double*   reference         |A[i]|
// dim          int       value             vector dimension
// RETURN VALUE:
// name         type  description
// ---------------------------------------------------------------------
//
__global__ void __absolute(double *A, const int dim)
{
    int  tIdx = blockIdx.x * blockDim.x + threadIdx.x;
    while(tIdx < dim){
        if(A[tIdx] < 0)
            A[tIdx] = -A[tIdx];
        tIdx += blockDim.x * gridDim.x;
    }
}
// name OF FUNCTION: Jacobi_Cuda
// PURPOSE:
//    The function will calculate a X solution for a linear system
//    using Jacobi's Method.
//
// PARAMETERS:
// name         type      value/reference   description
// ---------------------------------------------------------------------
// Matrix_A     double*   value             Matrix A(coefficients)
// Vector_B     double*   value             Vector B
// Vector_X     double*   reference         Solution
// dim          int       value             Matrix Dimension
// e            double    value             Error allowed
// maxIter      int       value             Maximum iterations allowed
// RETURN VALUE:
// name         type  description
// ---------------------------------------------------------------------
//

void Jacobi_Cuda(const double *Matrix_A, 
                 const double *Vector_B,
                 double *Vector_X, 
                 const  int  dim, 
                 const double e,
                 const int  maxIter,
                 double *t)
{

    /** Host variables **/
    int iter = 0; // iter counter
    double err = 1; // error between X^k and X^k-1
    double *tmp; // temporary for thrust norm
    double *norm; // Vector norm

    tmp = (double *) malloc(sizeof(double) * dim);
    norm = (double *) malloc(sizeof(double));

    int  dimBlocks, dimThreads;
    dimThreads = 64;
    dimBlocks =  imin(32, (dim + dimThreads - 1)/dimThreads);
    /** ************** **/

    /** Device variables **/
    double *d_Matrix_A, *d_Vector_B, *d_Vector_X, *d_Vector_Y, *d_Vector_Resul;

    cudaMalloc((void**)&d_Matrix_A, sizeof(double) * dim * dim);
    cudaMalloc((void**)&d_Vector_B, sizeof(double) * dim);
    cudaMalloc((void**)&d_Vector_X, sizeof(double) * dim);
    cudaMalloc((void**)&d_Vector_Y, sizeof(double) * dim);
    cudaMalloc((void**)&d_Vector_Resul, sizeof(double) * dim);

    /** **************** **/

    /** Initialize **/
    cudaMemcpy(d_Matrix_A, Matrix_A, sizeof(double) * dim * dim, 
                cudaMemcpyHostToDevice);
    cudaMemcpy(d_Vector_B, Vector_B, sizeof(double) * dim, cudaMemcpyHostToDevice);
    cudaMemcpy(d_Vector_X, Vector_X, sizeof(double) * dim, cudaMemcpyHostToDevice);
    /** ********** **/


    clock_t start,finish;
    double totaltime;
    start = clock(); 

    /** Jacobi **/
    while(err > e && iter < maxIter){

        __copy_vector<<<dimBlocks, dimThreads>>>(d_Vector_X, d_Vector_Y, dim);

        __Jacobi_sum<<<dimBlocks, dimThreads>>>(d_Matrix_A, d_Vector_Y, 
                                                d_Vector_Resul, dim); 
        __substract<<<dimBlocks, dimThreads>>>(d_Vector_B, d_Vector_Resul, 
                                                d_Vector_X, dim);

        __divide<<<dimBlocks, dimThreads>>>(d_Matrix_A, d_Vector_X, dim);

        __substract<<<dimBlocks, dimThreads>>>(d_Vector_Y, d_Vector_X,
                                                d_Vector_Resul, dim);
        __absolute<<<dimBlocks, dimThreads>>>(d_Vector_Resul, dim);

        cudaMemcpy(tmp, d_Vector_Resul, sizeof(double) * dim, cudaMemcpyDeviceToHost);

        double *t = thrust::max_element(tmp, tmp + dim); //vector norm

        err = *t;

        iter++;
    }

    finish = clock();

    totaltime=(double)(finish-start)/CLOCKS_PER_SEC;   

    *t = totaltime;

    cudaMemcpy(Vector_X, d_Vector_X, sizeof(double) * dim, 
                cudaMemcpyDeviceToHost);
    if(iter == maxIter)
        puts("Jacobi has reached maxIter!");
    /** ****** **/

    /** Free memory **/
    cudaFree(d_Matrix_A);
    cudaFree(d_Vector_B);
    cudaFree(d_Vector_X);
    cudaFree(d_Vector_Y);
    cudaFree(d_Vector_Resul);
    free(tmp);
    free(norm);
    /** *********** **/
}
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-09T20:42:08+00:00Added an answer on June 9, 2026 at 8:42 pm

    It depends on your algorithm. Some algorithms are by definition non-parallelizable (calculating the Fibonacci series, for example). But here’s a parallelizable Jacobi algorithm courtesy of Brown. Note that solving systems of equations CAN be solved either in serial or in parallel, it’s just a matter of writing the code.

    In short, it’s impossible to know whether or not more threads = more speed unless you show us (or at least explain) the algorithm. As far as thread synchronization goes, CUDA is very (very) good at normalizing synchronization costs so (if your algorithm is proper), more threads should almost always yield more speed.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a string like this: La Torre Eiffel paragonata all&#8217;Everest What PHP function
I'm parsing an RSS feed that has an &#8217; in it. SimpleXML turns this
I want to count how many characters a certain string has in PHP, but
link Im having trouble converting the html entites into html characters, (&# 8217;) i
For some reason, after submitting a string like this Jack’s Spindle from a text
I've got a string that has curly quotes in it. I'd like to replace
this is what i have right now Drawing an RSS feed into the php,
I have this code to decode numeric html entities to the UTF8 equivalent character.
I have a French site that I want to parse, but am running into
I have this code: - (void)parser:(NSXMLParser *)parser foundCDATA:(NSData *)CDATABlock { NSString *someString = [[NSString

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.