Cannot finish the NEGF testrun

Top Page > Browsing

Cannot finish the NEGF testrun

Date: 2020/01/07 18:57
Name: K. Yamaguchi: Dear Developers,

I've installed to OpenMX 3.9 and applied the latest patch to it.
I tried to run the NEGFruntest but the calculation was terminated unexpectedly.
I guess the lead-l-8zgnr-nc.dat doesn't work in my system.

The other runtest(-runtest, -runtestL) were done normally.

---Install---
CC = mpiicc -O3 -xHOST -ip -no-prec-div -mcmodel=medium -shared-intel -qopenmp -I/opt/app/fftw/3.3.5/intel-17.0-impi-2017.1/include
FC = mpiifort -O3 -xHOST -ip -no-prec-div -mcmodel=medium -shared-intel -qopenmp
LIB= -L/opt/app/fftw/3.3.5/intel-17.0-impi-2017.1/lib -lfftw3 -L$(MKLROOT)/lib/intel64 -mkl=parallel -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lpthread -lifcore -lmpi -liomp5 -lm -ldl
------------

----Log----
<Band_DFT> Eigen, time=0.945907
<Band_DFT> DM, time=1.280435
1 C MulP 2.20 1.95 sum 4.15 diff 0.25 (180.00 202.53) Ml 0.00 (125.12 224.44) Ml+s 0.25 (180.00 224.44)
2 C MulP 1.96 1.95 sum 3.90 diff 0.01 (180.00 176.69) Ml 0.00 ( 29.91 224.41) Ml+s 0.01 (180.00 224.41)
3 C MulP 2.03 1.95 sum 3.98 diff 0.07 (180.00 -7.09) Ml 0.00 (120.80 174.84) Ml+s 0.07 (180.00 174.84)
4 C MulP 2.00 1.99 sum 3.99 diff 0.01 (180.00 186.01) Ml 0.00 (173.61 150.23) Ml+s 0.01 (180.00 150.23)
5 C MulP 2.01 1.98 sum 3.99 diff 0.03 (180.00 161.03) Ml 0.00 (138.86 249.41) Ml+s 0.03 (180.00 249.41)
6 C MulP 2.00 1.99 sum 3.99 diff 0.01 (180.00 -36.54) Ml 0.00 ( 89.01 265.92) Ml+s 0.01 (180.00 265.92)
7 C MulP 2.01 1.99 sum 4.00 diff 0.02 (180.00 -1.95) Ml 0.00 ( 93.76 83.20) Ml+s 0.02 (180.00 83.20)
8 C MulP 2.00 1.99 sum 4.00 diff 0.01 (180.00 -7.69) Ml 0.00 (127.29 53.75) Ml+s 0.01 (180.00 53.75)
9 C MulP 2.00 1.99 sum 4.00 diff 0.01 (180.00 183.10) Ml 0.00 (134.30 125.59) Ml+s 0.01 (180.00 125.59)
10 C MulP 2.01 1.99 sum 4.00 diff 0.02 (180.00 32.66) Ml 0.00 (151.63 118.82) Ml+s 0.02 (180.00 118.82)
11 C MulP 2.00 1.99 sum 3.99 diff 0.01 (180.00 261.36) Ml 0.00 (162.75 94.40) Ml+s 0.01 (180.00 94.40)
12 C MulP 2.01 1.98 sum 3.99 diff 0.03 (180.00 30.12) Ml 0.00 ( 71.53 217.60) Ml+s 0.03 (180.00 217.60)
13 C MulP 2.00 1.99 sum 3.99 diff 0.01 (180.00 8.93) Ml 0.00 ( 53.06 120.80) Ml+s 0.01 (180.00 120.80)
14 C MulP 2.03 1.95 sum 3.98 diff 0.07 (180.00 268.84) Ml 0.00 ( 77.18 73.16) Ml+s 0.07 (180.00 73.16)
15 C MulP 1.96 1.95 sum 3.90 diff 0.01 (180.00 70.74) Ml 0.00 ( 92.65 79.57) Ml+s 0.01 (180.00 79.57)
16 C MulP 2.20 1.95 sum 4.15 diff 0.25 (180.00 212.79) Ml 0.00 ( 77.36 174.47) Ml+s 0.25 (180.00 174.47)
17 H MulP 0.53 0.47 sum 1.00 diff 0.05 ( 0.00 137.70) Ml 0.00 ( 90.00 0.00) Ml+s 0.05 ( 0.00 0.00)
18 H MulP 0.53 0.47 sum 1.00 diff 0.05 ( 0.00 175.80) Ml 0.00 ( 90.00 0.00) Ml+s 0.05 ( 0.00 0.00)
19 C MulP 2.20 1.95 sum 4.15 diff 0.25 (180.00 159.73) Ml 0.00 ( 76.90 243.41) Ml+s 0.25 (180.00 243.41)
20 C MulP 1.96 1.95 sum 3.90 diff 0.01 (180.00 3.55) Ml 0.00 ( 11.05 197.00) Ml+s 0.01 (180.00 197.00)
..........
......

Sum of MulP: up = 66.91906 down = 65.08094
total= 132.00000 ideal(neutral)= 132.00000

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 79853 RUNNING AT nb-0044.kudpc.kyoto-u.ac.jp
= EXIT CODE: 139
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 79853 RUNNING AT nb-0044.kudpc.kyoto-u.ac.jp
= EXIT CODE: 11
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
Intel(R) MPI Library troubleshooting guide:
https://software.intel.com/node/561764
===================================================================================
================================================================================

Resource Usage on 2020-01-07 17:55:55.419381:

JobId: 4188073.jb
Job_Name = B010717
queue = gr*****b
Resource_List.Aoption = gr*****:p=36:t=2:c=1:m=3413:c_dash=1
Resource_List.select = 1:ncpus=72:mpiprocs=36:mem=122868mb:ompthreads=2:jobfilter=long
qtime = 2020-01-07 17:54:49
stime = 2020-01-07 17:54:54
resources_used.walltime = 00:01:03
resources_used.cput = 00:38:30
resources_used.cpupercent = 3666
resources_used.ncpus = 72
resources_used.mem = 15135216kb

================================================================================
-----------

Best regards,
K. Yamaguchi

Page: [1]

Re: Cannot finish the NEGF testrun ( No.1 )

Date: 2020/01/07 21:46
Name: T. Ozaki

Hi,

Could you copy work/negf_example/Lead-L-8ZGNR-NC.dat to your work directory, and
perform openmx with the input file directly without the runtestNEGF command?

Since we do not have an account on the machine installed at Kyoto Univ., unfortunately,
it might be difficult to fix the problem.

Regards,

TO

Re: Cannot finish the NEGF testrun ( No.2 )

Date: 2020/01/08 23:30
Name: K. Yamaguchi

Dear Ozaki-sensei,

Thank you very much for your advises.

I've solved this MPI error.
I changed Intel Compiler(19.0.5) and Intel MPI Library(2019.5)
from Intel Compiler(18.0.5) and Intel MPI Library(2018.4).

Then I complied the whole software again.
Finally, I was able to finish NEGFruntest normally.

The MPI error seems to be caused by Intel MPI Library,
however, it's not sure what the problem is.

Best regards,
K. Yamaguchi

Re: Cannot finish the NEGF testrun ( No.3 )

Date: 2020/01/16 20:01
Name: K. Yamaguchi

Dear Ozaki-sensei,

The DFT calculation for NC systems is unstable on my simulation environment.

****Makefile****
FFTW = /opt/app/fftw/3.3.5/intel-17.0-impi-2017.1
CC = mpiicc -O3 -xHOST -ip -no-prec-div -mcmodel=medium -shared-intel -qopenmp -I${FFTW}/include -fp-model precise -g
FC = mpiifort -O3 -xHOST -ip -no-prec-div -mcmodel=medium -shared-intel -qopenmp -I${FFTW}/include -fp-model precise -g
LIB= -L${FFTW}/lib -lfftw3 -L$(MKLROOT)/lib/intel64 -mkl=parallel -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -lifcore -lmpi -liomp5 -lpthread -lm -ldl
****************

The error occurred again at lead-l-8zgnr-nc.dat or lead-r-8zgnr-nc.dat.
I checked the problem using the debugger software(Arm DDT).

****Error Message****
Process stopped in make_NC_v_eff (Occupation_Number_LDA_U.c:5120) with signal SIGSEGV (Segmentation fault).
Reason/Origin: address not mapped to object (attempt to access invalid address)

5120: time_per_atom[Gc_AN] += Etime_atom - Stime_atom;

Etime_atom: 1578738274.145021
Stime_atom: 4.9406564584124654e-324

openmx:110521 terminated with signal 11 at PC=711388 SP=7fffad93e680. Backtrace:
openmx[0x711388]
openmx[0x6f5b05]
openmx[0x4b8340]
openmx[0x996029]
openmx[0x994291]
openmx[0x405ff2]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b1f232be495]
openmx[0x405379]
**********************

I compiled OpenMX3.8.5 used the same condition(ScaLAPACK).
However, the error never happen.

Do you have any solutions for this problem?

Thank you in advance.

Best regards,
K. Yamaguchi

Re: Cannot finish the NEGF testrun ( No.4 )

Date: 2020/01/18 12:34
Name: T. Ozaki

Hi,

I cannot reproduce your problem on my environment.
So, it might be difficult to fix the problem.

Could you modify the code "Occupation_Number_LDA_U.c" as follows:?

for the line 5120

Present:
time_per_atom[Gc_AN] += Etime_atom - Stime_atom;

Modified:
// time_per_atom[Gc_AN] += Etime_atom - Stime_atom;

for the line 3379

Present:
void make_NC_v_eff(int SCF_iter, int SucceedReadingDMfile, double dUele, double ECE[])

Modified:
#pragma optimization_level 1
void make_NC_v_eff(int SCF_iter, int SucceedReadingDMfile, double dUele, double ECE[])

Please modfify the code as shown above, and recompile the code.
Please let us know what happens after the modification.

Regards,

TO

Page: [1]