I am trying to execute MPI and CUDA code on a cluster. The code works fine on single machine but when I try to execute it on cluster I get error:
error while loading shared libraries: libcudart.so.4: cannot open shared object file: No such file or directory
I checked my PATH and LD_PATH and it looks ok. I have a .bashrc file which contains following entries –
export PATH=$PATH:/usr/local/lib/:/usr/local/lib/openmpi:/usr/local/cuda/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib:/usr/local/ lib/openmpi/:/usr/local/cuda/lib
All the machines haves same installation of CUDA and OpenMPI.
I also have /usr/local/cuda/lib in /etc/ld.so.conf
Can anyone help me with this. This problem is really annoying.
Thanks.
If you are sending a batch job on a cluster, please add commands like
to your batch script. This should help to debug the problem.
Also make sure that you export environment variables in mpirun. For instance, in OpenMPI you would run your code with