Each GPU device (AMD, NVidea, or any other) is split into several Compute Units (MultiProcessors), each of which has a fixed number of cores (VertexShaders/StreamProcessors). So, one has (Compute Units) x (VertexShaders/compute unit) simultaneous processors to compute with, but there is only a small fixed amount of __local memory (usually 16KB or 32KB) available per MultiProcessor. Hence, the exact number of these multiprocessors matters.
Now my questions:
- (a) How can I know the number of multiprocessors on a device? Is this the same as
CL_DEVICE_MAX_COMPUTE_UNITS? Can I deduce it from specification sheets such as http://en.wikipedia.org/wiki/Comparison_of_AMD_graphics_processing_units? - (b) How can I know how much
__localmemory per MP there is available on a GPU before buying it? Of course I can requestCL_DEVICE_LOCAL_MEM_SIZEon a computer that runs it, but I don’t see how I can deduce it from even an individual detailed specifications sheet such as http://www.amd.com/us/products/desktop/graphics/7000/7970/Pages/radeon-7970.aspx#3? - (c) What is the card with currently the largest
CL_DEVICE_LOCAL_MEM_SIZE? Price doesn’t really matter, but 64KB (or larger) would give a clear benefit for the application I’m writing, since my algorithm is completely parallelizable, but also highly memory-intensive with random access pattern within each MP (iterating over edges of graphs).
CL_DEVICE_MAX_COMPUTE_UNITSshould give you the number of ComputeUnits, otherwise you can glance it from appropriate manuals (the AMD opencl programming guide and the Nvidia OpenCL programming guide)For your usage of local memory you should however remember that local memory is allocated per workgroup (and only accessible for a workgroup), while a Compute Unit can typically sustain more then one workgroup. So if your algorithm allocated the whole local memory to one workgroup you will not be able to use achieve the maximum amount of parallelity. Also note that since local memory is banked random access will lead to alot of bank conflicts and warp serializations. So your algorithm might not parallize quite as good as you think it will (or maybe it will, just mentioning the possibility).
With a Fermi based card your best bet might be to count on the caches instead of explicit local memory, if all your workgroups operate on the same data (I don’t know how to switch the L1/local Memory configuration though).