Parallel runs crash on one cluster but not the other |
- Date: 2009/04/21 23:44
- Name: Baruch Feldman
I installed OpenMx about two weeks ago and have been trying to replicate the curve of speedup benchmarks shown in figure (c) at
based on DIA64_Band.dat.
I've found that the Methane and Automatic run tests work fine. Moreover, I've run DIA64_Band.dat with several different numbers of processors and get results in agreement for all the runs. I've had near linear speedup for numbers of cores ranging from 8 to 36.
However, I'm currently only able to run parallel jobs (serial ones work ok) on our 400 series cluster, but not on our 500 series cluster, which has more total computing power. According to our documentation, these clusters have the same libraries and MPI. The only obvious difference (to me) is that the 500 cluster contains two dual-core CPU's per node, whereas 400 contains two single-core CPU's. I compiled OpenMx with OpenMP but am running without the -nt option, so I expect pure MPI parallelization. Here is the submission command from my PBS script:
/opt/mpi/bin/mpirun -np $NP -machinefile $PBS_NODEFILE openmx DIA64_Band.dat > diamond64.std
On the 500 cluster, the job crashes within seconds, with the following typical output:
p0_16740: p4_error: Child process exited while making connection to remote process on sfi5045: 0
p0_16740: (46.070312) net_send: could not write to fd=6, errno = 32
mpirun: got sig, my pid is 16190
childs pid is:
Any suggestions would be appreciated.