If I call EnqueueNDRange with maximum local group size and some large global group size, can I be sure that the local group kernels will be executed inorder?
i.e:
Global 0 : Local 0 1 2 3 4
Global 5 : Local 0 1 2 3 4
Global 10 : Local 0 1 2 3 4
etc
AFAIK the arguments global_work_size parameter specifies number of work items in each dimension of the NDRange and local_work_size specifies number of work items in each dimension of the workgroup.
All the workitems and workgroups are supposed to run in parallel, while work items within workgroup goes in lockstep manner, different workgroups will execute different SIMD engine but they are supposed to run in parallel, unless there is hardware limitation on GPU in terms of number of SIMD engines available, wave/warp scheduler limitation etc.