I’m just messing around trying to learn a little bit about parallel computing. If have a something that looks like this,
long A[12];
long B[5,000,000];
long C[12];
long long total=0;
long long tmp;
GPUKernel(){
for (n=0; n < 5,000,000; ++n) {
B[n]=0;
}
for (n=0; n < 5,000,000; ++n) {
for (n2=0; n2 < 12; ++n2) {
B[n]+=C[A[n2]];
}
tmp+=B[n];
}
if (tmp > total) {
total=tmp;
tmp=0;
}
}
int main(){
srand( (unsigned)time( NULL ) );
for (n=0; n < 12; ++n) {
C[n]=rand() % 1000000;
}
for (n=0 ; n < 8916100448256 ; ++n) {
++A[0];
for (p=0; n<11; ++p) {
if (A[p]==12) {
A[p]=0;
++A[p+1];
}
}
GPUKernel();
}
return 0;
}
My idea is that I’ll get the number of threads the CPU can use. For example, if there are 4, and I’ll make separate copies of all the data for how every many cpu threads I make. So each gpu kernel will have it’s own data as well. Does this make sense? Would this be a good way of going about this task?
//cpu core 1
for (n=0; n < 8916100448256/4 ; ++n) {
...
GPUKernel1();
}
//cpu core 2
for (n=(8916100448256/4; n < (8916100448256/4)*2 ; ++n) {
...
GPUKernel2();
}
//cpu core 3
for (n=(8916100448256/4)*2; n < (8916100448256/4)*3 ; ++n) {
...
GPUKernel3();
}
//cpu core 4
for (n=(8916100448256/4)*3; n < 8916100448256) ; ++n) {
...
GPUKernel4();
}
Correct me if I’m wrong, but this seems like an algorithms questions. OpenCL is nowhere in the picture. BTW, when you write kernel code in OpenCL/CUDA the data allocated to each thread will be determined by the thread ID of that thread, you can divide them in terms of blocks etc. Please refer to the Programming guide(NVIDIA/AMD).