To put the question another way, if one were to try and reimplement OpenGL

Question

0

Asked: May 26, 20262026-05-26T12:25:02+00:00 2026-05-26T12:25:02+00:00

To put the question another way, if one were to try and reimplement OpenGL

0

To put the question another way, if one were to try and reimplement OpenGL or DirectX (or an analogue) using GPGPU (CUDA, OpenCL), where and why would it be slower that the stock implementations on NVIDIA and AMD cards?

I can see how vertex/fragment/geometry/tesselation shaders could be made nice and fast using GPGPU, but what about things like generating the list of fragments to be rendered, clipping, texture sampling and so on?

I’m asking purely for academic interest.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T12:25:03+00:00

Modern GPUs have still lots of fixed-function hardware which is hidden from the compute APIS. This includes: The blending stages, the triangle rasterization and a lot of on-chip queues. The shaders of course all map well to CUDA/OpenCL — after all, shaders and the compute languages all use the same part of the GPU — the general purpose shader cores. Think of those units as a bunch of very-wide SIMD CPUs (for instance, a GTX 580 has 16 cores with a 32 wide SIMD unit.)

You get access to the texture units via shaders though, so there’s no need to implement that in “compute”. If you would, your performance would suck most likely as you don’t get access to the texture caches which are optimized for spatial layout.

You shouldn’t underestimate the amount of work required for rasterization. This is a major problem, and if you throw all of the GPU at it you get roughly 25% of the raster hardware performance (see: High-Performance Software Rasterization on GPUs.) That includes the blending costs, which are also done by fixed-function units usually.

Tesselation has also a fixed-function part which is difficult to emulate efficiently, as it amplifies the input up to 1:4096, and you surely don’t want to reserve so much memory up-front.

Next, you get lots of performance penalties because you don’t have access to framebuffer compression, as there is again dedicated hardware for this which is “hidden” from you when you’re in compute only mode. Finally, as you don’t have any on-chip queues, it will be difficult to reach the same utility ratio as the “graphics pipeline” gets (for instance, it can easily buffer output from vertex shaders depending on shader load, you can’t switch shaders that flexibly.)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

To put the question another way, if one were to try and reimplement OpenGL

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply