In cuBLAS, cublasIsamin() gives the argmin for a single-precision array. Here’s the full function

Question

0

Asked: June 15, 20262026-06-15T10:48:29+00:00 2026-06-15T10:48:29+00:00

In cuBLAS, cublasIsamin() gives the argmin for a single-precision array. Here’s the full function

0

In cuBLAS, cublasIsamin() gives the argmin for a single-precision array.

Here’s the full function declaration: cublasStatus_t cublasIsamin(cublasHandle_t handle, int n, const float *x, int incx, int *result)

The cuBLAS programmer guide provides this information about the cublasIsamin() parameters:
enter image description here

If I use host (CPU) memory for result, then cublasIsamin works properly. Here’s an example:

void argmin_experiment_hostOutput(){
    float h_A[4] = {1, 2, 3, 4}; int N = 4; 
    float* d_A = 0;
    CHECK_CUDART(cudaMalloc((void**)&d_A, N * sizeof(d_A[0])));
    CHECK_CUBLAS(cublasSetVector(N, sizeof(h_A[0]), h_A, 1, d_A, 1));
    cublasHandle_t handle; CHECK_CUBLAS(cublasCreate(&handle));

    int result; //host memory
    CHECK_CUBLAS(cublasIsamin(handle, N, d_A, 1, &result));
    printf("argmin = %d, min = %f \n", result, h_A[result]);

    CHECK_CUBLAS(cublasDestroy(handle));
}

However, if I use device (GPU) memory for result, then cublasIsamin segfaults. Here’s an example that segfaults:

void argmin_experiment_deviceOutput(){
    float h_A[4] = {1, 2, 3, 4}; int N = 4;
    float* d_A = 0;
    CHECK_CUDART(cudaMalloc((void**)&d_A, N * sizeof(d_A[0])));
    CHECK_CUBLAS(cublasSetVector(N, sizeof(h_A[0]), h_A, 1, d_A, 1));
    cublasHandle_t handle; CHECK_CUBLAS(cublasCreate(&handle));

    int* d_result = 0; 
    CHECK_CUDART(cudaMalloc((void**)&d_result, 1 * sizeof(d_result[0]))); //just enough device memory for 1 result
    CHECK_CUDART(cudaMemset(d_result, 0, 1 * sizeof(d_result[0])));
    CHECK_CUBLAS(cublasIsamin(handle, N, d_A, 1, d_result)); //SEGFAULT!

    CHECK_CUBLAS(cublasDestroy(handle));
}

The Nvidia guide says that `cublasIsamin()` can output to device memory. What am I doing wrong?

Motivation: I want to compute the argmin() of several vectors concurrently in multiple streams. Outputting to host memory requires CPU-GPU synchronization and seems to kill the multi-kernel concurrency. So, I want to output the argmin to device memory instead.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-15T10:48:31+00:00

The CUBLAS V2 API does support writing scalar results to device memory. But it doesn’t support this by default. As per Section 2.4 “Scalar parameters” of the documentation, you need to use cublasSetPointerMode() to make the API aware that scalar argument pointers will reside in device memory. Note this also makes these level 1 BLAS functions asynchronous, so you must ensure that the GPU has completed the kernel(s) before trying to access the result pointer.

See this answer for a complete working example.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

In cuBLAS, cublasIsamin() gives the argmin for a single-precision array. Here’s the full function

The Nvidia guide says that `cublasIsamin()` can output to device memory. What am I doing wrong?

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply