problem with parallel running

Top Page > Browsing

problem with parallel running

Date: 2011/09/29 17:32
Name: David: Dear Prof. Dr. T.Ozaki,

I got a problem with parallel running which is difficult to understand. I used openmpi for the compilation of OPENMX.

Suppose that my computer cluster consists of quad-core machines. Everything goes well if I just use four nodes for my calculations by calling OPENMX using "mpirun -n 4 $OPENMX_DIR/openmx INPUT -nt 1 > OUTPUT".

However, If I use 16 nodes for the calculations by using "mpirun -n 16 $OPENMX_DIR/openmx INPUT -nt 4 > OUTPUT", my jobs are terminated with errors. The error information seems to be related to openmpi, of which a few are listed following.

[quad001:21826] *** Process received signal ***
[quad001:21826] Signal: Segmentation fault (11)
[quad001:21826] Signal code: (128)
[quad001:21826] Failing at address: (nil)

...

[quad004:21826] [ 0] /lib64/libpthread.so.0(+0xf5d0) [0x2ac2e61335d0]
[quad004:21826] [ 1] /opt/libs//openmpi-1.2.6-INTEL-10.1.015-64/lib/openmpi/mca_btl_openib.so(+0x8767) [0x2ac2ec850767]
[quad004:21826] [ 2] /opt/libs//openmpi-1.2.6-INTEL-10.1.015-64/lib/openmpi/mca_bml_r2.so(mca_bml_r2_progress+0x3d) [0x2ac2ec645549]
[quad004:21826] [ 3] /opt/libs/openmpi-1.2.6-INTEL-10.1.015-64/lib/libopen-pal.so.0(opal_progress+0x83) [0x2ac2e6e49399]
[quad004:21826] [ 4] /opt/libs//openmpi-1.2.6-INTEL-10.1.015-64/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_wait+0x23) [0x2ac2e8d6f1a7]
[quad004:21826] [ 5] /opt/libs//openmpi-1.2.6-INTEL-10.1.015-64/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_recv+0x4a7) [0x2ac2e8d739a3]
[quad004:21826] [ 6] /opt/libs/openmpi-1.2.6-INTEL-10.1.015-64/lib/libopen-rte.so.0(mca_oob_recv_packed+0x32) [0x2ac2e6bf2fe2]
[quad004:21826] [ 7] /opt/libs//openmpi-1.2.6-INTEL-10.1.015-64/lib/openmpi/mca_gpr_proxy.so(orte_gpr_proxy_put+0x25d) [0x2ac2e91887b5]
[quad004:21826] [ 8] /opt/libs/openmpi-1.2.6-INTEL-10.1.015-64/lib/libopen-rte.so.0(orte_smr_base_set_proc_state+0x2cd) [0x2ac2e6c0d129]

Could you please be kind enough to tell me how to fix it? Thanks a lot.

Best regards,

David

Page: [1]

Re: problem with parallel running ( No.1 )

Date: 2011/09/29 19:22
Name: David

Solved the problem. MKL_PATH and LD_LIB_PATH are very important.

Re: problem with parallel running ( No.2 )

Date: 2011/09/29 20:56
Name: T.Ozaki

Hi,

I am wondering how you specified the number of MPI processes per each node
for the parallel calculation in the hybrid mode. The proper specification is
also important for efficiency of the hybrid parallelization.

Regards,

TO

Re: problem with parallel running ( No.3 )

Date: 2012/03/17 17:29
Name: Johny Tang <tangtao_my@163.com>

Dear Prof. Dr. T.Ozaki,

I have a cluster containing more than 64 nodes and 8 cores each node. How to mpirun the job?
I compiled the openms3.6 with following options:
mpicc = /u1/local/mvapich1-icc/bin/mpicc
CC = $(mpicc) -openmp -O3 -I/usr/include -I/u1/local/include
LIB = -L/u1/local/fftw322/lib -lfftw3 -L/u1/local/icc11/mkl/lib/em64t -llapack -lblas

just several warning information emerged. autoruntest for OpenMP/MPI parallel running is OK on 1 nodes with 8 cores. But when automatic running test with large-scale systems with OpeneMp/MPI on 16 nodes (total 128 cores), it is very slow compare the reference test-results.

note: Our cluster jobs controling via pbs.

Page: [1]