Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7645347
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 31, 20262026-05-31T09:55:07+00:00 2026-05-31T09:55:07+00:00

TL;DR version: What’s the best way to round-robin kernel calls to multiple GPUs with

  • 0

TL;DR version: “What’s the best way to round-robin kernel calls to multiple GPUs with Python/PyCUDA such that CPU and GPU work can happen in parallel?” with a side of “I can’t have been the first person to ask this; anything I should read up on?”

Full version:

I would like to know the best way to design context, etc. handling in an application that uses CUDA on a system with multiple GPUs. I’ve been trying to find literature that talks about guidelines for when context reuse vs. recreation is appropriate, but so far haven’t found anything that outlines best practices, rules of thumb, etc.

The general overview of what we’re needing to do is:

  • Requests come in to a central process.
  • That process forks to handle a single request.
  • Data is loaded from the DB (relatively expensive).

The the following is repeated an arbitrary number of times based on the request (dozens):

  • A few quick kernel calls to compute data that is needed for later kernels.
  • One slow kernel call (10 sec).

Finally:

  • Results from the kernel calls are collected and processed on the CPU, then stored.

At the moment, each kernel call creates and then destroys a context, which seems wasteful. Setup is taking about 0.1 sec per context and kernel load, and while that’s not huge, it is precluding us from moving other quicker tasks to the GPU.

I am trying to figure out the best way to manage contexts, etc. so that we can use the machine efficiently. I think that in the single-gpu case, it’s relatively simple:

  • Create a context before starting any of the GPU work.
  • Launch the kernels for the first set of data.
  • Record an event for after the final kernel call in the series.
  • Prepare the second set of data on the CPU while the first is computing on the GPU.
  • Launch the second set, repeat.
  • Insure that each event gets synchronized before collecting the results and storing them.

That seems like it should do the trick, assuming proper use of overlapped memory copies.

However, I’m unsure what I should do when wanting to round-robin each of the dozens of items to process over multiple GPUs.

The host program is Python 2.7, using PyCUDA to access the GPU. Currently it’s not multi-threaded, and while I’d rather keep it that way (“now you have two problems” etc.), if the answer means threads, it means threads. Similarly, it would be nice to just be able to call event.synchronize() in the main thread when it’s time to block on data, but for our needs efficient use of the hardware is more important. Since we’ll potentially be servicing multiple requests at a time, letting other processes use the GPU when this process isn’t using it is important.

I don’t think that we have any explicit reason to use Exclusive compute modes (ie. we’re not filling up the memory of the card with one work item), so I don’t think that solutions that involve long-standing contexts are off the table.

Note that answers in the form of links to other content that covers my questions are completely acceptable (encouraged, even), provided they go into enough detail about the why, not just the API. Thanks for reading!

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-31T09:55:08+00:00Added an answer on May 31, 2026 at 9:55 am

    Caveat: I’m not a PyCUDA user (yet).

    With CUDA 4.0+ you don’t even need an explicit context per GPU. You can just call cudaSetDevice (or the PyCUDA equivalent) before doing per-device stuff (cudaMalloc, cudaMemcpy, launch kernels, etc.).

    If you need to synchronize between GPUs, you will need to potentially create streams and/or events and use cudaEventSynchronize (or the PyCUDA equivalent). You can even have one stream wait on an event inserted in another stream to do sophisticated dependencies.

    So I suspect the answer to day is quite a lot simpler than talonmies’ excellent pre-CUDA-4.0 answer.

    You might also find this answer useful.

    (Re)Edit by OP: Per my understanding, PyCUDA supports versions of CUDA prior to 4.0, and so still uses the old API/semantics (the driver API?), so talonmies’ answer is still relevant.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Short version: I'm wondering if it's possible, and how best, to utilise CPU specific
PHPList (version 2.10.17) fails to send messages to addresses that match one of the
(Python version: 3.1.1) I am having a strange problem with StringVar in tkinter. While
Joomla version 1.7 I have an article and a content plugin that changes content
Quick version: Which is the best of the following and why? (or is there
The version of NetBeans that I use is 6.5.1
Version: Crystal Report 2008 I have 2 parameters that prompt the user to enter
Python version: 2.6.6 PySerial version: 2.5 Arduino board: Duemilanove 328 I have written some
Version 8.04 on windows. I noticed that when I have 2 data sets and
Doctrine version 2.1 i am persisting a lot of objects, that is why I

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.