Motivation: I have been tasked with measuring the Karp-Flatt metric and parallel efficiency of my CUDA C code, which requires computation of speedup. In particular, I need to plot all these metrics as a function of the number of processors p.
Definition: Speedup refers to how much a parallel algorithm is faster than a corresponding sequential algorithm, and is defined as:

Issue: I have implemented my algorithm in CUDA C, and have timed it to get Tp. However, there remains some issues in determining Sp:
- How to observe
T1without completely rewriting my code from scratch?- Can I execute CUDA code in serial???
- What is
pwhen I run different kernels with different numbers of threads?- Does it refer to no. of threads or no. of processors used throughout runtime?
- Since both of these quantities will also vary throughout runtime, is it the maximum or the average used?
- How do I even restrict my code to run on a subset of processors or with fewer threads!?
Many thanks.
To get a reasonable measure of speedup, you need the actual sequential program. If you don’t have one, you need to write the best sequential version you can, because comparing a highly tuned parallel code to a junk serial implementation is unreasonable.
Nor can you reasonably compare a 1-processor version of your parallel program to the N-processor version to get a true measure of speedup. Such a comparison tells you speedup from going from P=1 to P=N for the same program, but the point of the speedup curves is to show why building a parallel program (which is usually harder amd requires more complicated hardware [GPU] and tools [OpenCL]) makes sense compared to coding the best sequential version using more widely available hardware and tools.
In other words, no cheating.