Segfaults during memory accesses in Band_DFT

Top Page > Browsing

Segfaults during memory accesses in Band_DFT_Col.c

Date: 2011/07/05 02:10
Name: Renato Miceli
References: http://www.ichec.ie/: Hi,

a user of ours at ICHEC (http://www.ichec.ie/) is having issues with OpenMX again. This time, though, it does not seem to be an evident bug. I have put some effort in trying to find the bug itself, but, despite my assumptions that it was an out-of-memory, all I get are segfaults. Could you take a look at this issue, please?

----------

File: Band_DFT_Col.c
Function: static double Band_collinear2(int SCF_iter,
int knum_i, int knum_j, int knum_k,
int SpinP_switch,
double *****nh,
double ****CntOLP,
double *****CDM,
double *****EDM,
double Eele0[2], double Eele1[2],
int *MP,
int *order_GA,
double *ko,
double *koS,
double ***EIGEN,
double *H1,
double *S1,
double *CDM1,
double *EDM1,
dcomplex **H,
dcomplex **S,
dcomplex **C,
dcomplex *BLAS_S,
int ***k_op,
int *T_k_op,
int **T_k_ID,
double *T_KGrids1,
double *T_KGrids2,
double *T_KGrids3,
int myworld1,
int *NPROCS_ID1,
int *Comm_World1,
int *NPROCS_WD1,
int *Comm_World_StartID1,
MPI_Comm *MPI_CommWD1,
int myworld2,
int *NPROCS_ID2,
int *NPROCS_WD2,
int *Comm_World2,
int *Comm_World_StartID2,
MPI_Comm *MPI_CommWD2);

Lines: 790 and 797 (initialization of matrices S and H, respectively -- the line where the error takes place depends on the process).
Piece of code:
if (SCF_iter==1 || all_knum!=1){

for (i1=0; i1<n*n; i1++) BLAS_S[i1] = Complex(0.0,0.0);

for (i1=1; i1<=n; i1++){
for (j1=1; j1<=n; j1++){
S[i1][j1] = Complex(0.0,0.0);
}
}
}

for (i1=1; i1<=n; i1++){
for (j1=1; j1<=n; j1++){
H[i1][j1] = Complex(0.0,0.0);
}
}

Backtrace:
#0 0x00000000004a8037 in Band_collinear2 (SCF_iter=1, knum_i=1, knum_j=1, knum_k=1, SpinP_switch=0,
nh=0x7e50940, CntOLP=0x6dd4e80, CDM=0x91fa8c0, EDM=0x9218c70, Eele0=0x7fff309d7ad0, Eele1=0x7fff309d7ae0,
MP=0x2aab470a7ea0, order_GA=0x1786e8f0, ko=0x2aab47b543e0, koS=0x2aab47b5ea50, EIGEN=0x2aab470ac650,
H1=0x38d4d820, S1=0x3941b830, CDM1=0x39ae9840, EDM1=0x3a1b7850, H=0x2aab47b710f0, S=0x2aab47b29d50,
C=0x2aab9b4420d0, BLAS_S=0x1dccaf10, k_op=0x4099bd30, T_k_op=0x2aab474a6bd0, T_k_ID=0x2aab474a6bf0,
T_KGrids1=0x2aab47b690c0, T_KGrids2=0x2aabb636e6e0, T_KGrids3=0x2aabb646e3e0, myworld1=0,
NPROCS_ID1=0x2aabb646ba90, Comm_World1=0x2aabb636ba90, NPROCS_WD1=0x2aab475acdb0,
Comm_World_StartID1=0x2aab475acdd0, MPI_CommWD1=0x2aabb636c170, myworld2=0, NPROCS_ID2=0x17832f30,
NPROCS_WD2=0x2aabb636c190, Comm_World2=0x2aab47aa72d0, Comm_World_StartID2=0x2aabb646e200,
MPI_CommWD2=0x2aabb646e220) at Band_DFT_Col.c:790
#1 0x00000000004a6053 in Band_DFT_Col (SCF_iter=1, knum_i=1, knum_j=1, knum_k=1, SpinP_switch=0, nh=0x7e50940,
ImNL=0x0, CntOLP=0x6dd4e80, CDM=0x91fa8c0, EDM=0x9218c70, Eele0=0x7fff309d7ad0, Eele1=0x7fff309d7ae0,
MP=0x2aab470a7ea0, order_GA=0x1786e8f0, ko=0x2aab47b543e0, koS=0x2aab47b5ea50, EIGEN=0x2aab470ac650,
H1=0x38d4d820, S1=0x3941b830, CDM1=0x39ae9840, EDM1=0x3a1b7850, H=0x2aab47b710f0, S=0x2aab47b29d50,
C=0x2aab9b4420d0, BLAS_S=0x1dccaf10, k_op=0x4099bd30, T_k_op=0x2aab474a6bd0, T_k_ID=0x2aab474a6bf0,
T_KGrids1=0x2aab47b690c0, T_KGrids2=0x2aabb636e6e0, T_KGrids3=0x2aabb646e3e0, myworld1=0,
NPROCS_ID1=0x2aabb646ba90, Comm_World1=0x2aabb636ba90, NPROCS_WD1=0x2aab475acdb0,
Comm_World_StartID1=0x2aab475acdd0, MPI_CommWD1=0x2aabb636c170, myworld2=0, NPROCS_ID2=0x17832f30,
NPROCS_WD2=0x2aab47aa72d0, Comm_World2=0x2aabb636c190, Comm_World_StartID2=0x2aabb646e200,
MPI_CommWD2=0x2aabb646e220) at Band_DFT_Col.c:130
#2 0x0000000000644f2b in TRAN_DFT (comm1=1140850688, SucceedReadingDMfile=0, level_stdout=1, iter=3,
SpinP_switch=0, nh=0x7e50940, ImNL=0x0, CntOLP=0x6dd4e80, atomnum=844, Matomnum=7, WhatSpecies=0x60190e0,
Spe_Total_CNO=0x46ac810, FNAN=0x60243e0, natn=0x67eec50, ncn=0x67d7660, M2G=0x65a7fb0, G2ID=0x601c5e0,
F_G2M=0x601d320, atv_ijk=0x67ebe20, List_YOUSO=0xa13b20, CDM=0x91fa8c0, EDM=0x9218c70,
TRAN_DecMulP=0xb9adaa0, Eele0=0x7fff309d7ad0, Eele1=0x7fff309d7ae0, ChemP_e0=0x7fff309d7ab0)
at TRAN_DFT.c:457
#3 0x00000000004573e6 in DFT (MD_iter=1, Cnt_Now=1) at DFT.c:901
#4 0x0000000000407096 in main (argc=2, argv=0x7fff309d80b8) at openmx.c:462

Peeking the value of some local variables (taken from one of the segfaulted processes, which died for matrix S):
(gdb) p i1
$1 = 83
(gdb) p j1
$2 = 1
(gdb) p n
$3 = 5324
(gdb) p SCF_iter
$4 = 1
(gdb) p all_knum
$5 = 1
(gdb) p S
$6 = (dcomplex **) 0x2aab47b29d50
(gdb) p S[i1]
$7 = (dcomplex *) 0x2aabb66a7010
(gdb) p S[i1][j1]
Cannot access memory at address 0x2aabb66a7020

Issue: segfaults arise, due to trying to access unallocated elements S[i] and H[i], for certain values of i. This only happens for MPI processes whose identifier is a multiple of n, for n processes spawn over a same compute node (scenarios with 6, 7 and 8 processes per compute node were tried). All other processes are terminated with signal 15.
Attempt to solve: tried to check the allocation of matrices S and H against NULL (both S == NULL and S[i] == NULL), but no such scenario was caught. Their allocation takes place on function TRAN_DFT, lines 320 to 328 in file TRAN_DFT.c -- matrices S and H are named there S_Band and H_Band_Col, respectively.
Cause of error: supposedly out-of-memory, but unknown as of now.

Execution environment: the cluster Stoney (http://ichec.ie/infrastructure/stoney) over 16 nodes, each with 8 cores
Version of Code: OpenMX 3.5.1
Used libraries and tools: Intel MKL (version 10.2.5.035), MVAPICH2 for Intel Compilers, Intel C Compiler (version 11.1.072)
Time taken to fail: around 45 min (if compiled with -g -O0)

----------

You can find the input files at the following address:
http://www-staff.ichec.ie/~rmiceli/openmx-issue/

Please note that the definitions of the atomic species used were retrieved "as is" from the OpenMX official database of VPS and PAO (version 2006):
http://www.jaist.ac.jp/~t-ozaki/vps_pao2006/vps_pao.html

Please let me know if you'd like any further information.

I'd appreciate if you could keep me informed of the bug tracing process (e.g. any suppositions of where it might be, where it is confirmed it is not). I am very interested in seeing this issue solved as soon as possible.

Looking forward to hearing from you soon.

Kindest regards,
Renato Miceli

Page: [1]

Re: Segfaults during memory accesses in Band_DFT_Col.c ( No.1 )

Date: 2011/07/06 00:48
Name: T.Ozaki

Hi,

Using the input files with a little modification, I could reproduce
a similar error. Although I haven't fully analyzed it, the issue seems
to be a simple memory shortage.

Could you try the calculation using a less number of cores per node,
e.g., one core per node, so that the memory per core can be increased?

Regards,

TO

Re: Segfaults during memory accesses in Band_DFT_Col.c ( No.2 )

Date: 2011/07/06 02:04
Name: Renato Miceli
References: http://www.ichec.ie/

Hi, T. Ozaki,

thank you for your quick reply.

It seems to be a simple memory shortage; however, I could not confirm it by checking the forementioned matrices H and S against NULL. None of their elements is NULL.

I ran these inputs for 8, 7 and 6 MPI processes per compute node, which gives 6 GB, ~6.8 GB and 8 GB of RAM per MPI process, respectively (each compute node on our cluster Stoney has 48 GB of RAM). The errors still appear despite there being more memory per process. I am now trying with 5 MPI processes per compute node, making it 9.6 GB RAM per process. I will let you know as soon as my tests finish.

Unfortunately I cannot try using one core per compute node, as we would be requesting 128 compute nodes and Stoney only has 64 of them. Moreover, we have to be aware that this cluster is used for research and production systems and requesting such large number of compute nodes (right now I requested 26 compute nodes, about 40% of the entire cluster) results in our job being hung for a very long time, besides taking the computing power of useful jobs, only for testing and debugging purposes.

I am certainly very interested in seeing the results of your full analysis of this error. Please let me know once you have further information about it. I am really looking forward to having this issue fixed as soon as possible.

Kindest regards,
Renato Miceli

Re: Segfaults during memory accesses in Band_DFT_Col.c ( No.3 )

Date: 2011/07/07 23:29
Name: Renato Miceli
References: http://www.ichec.ie/

Hi, T. Ozaki,

I tested the system with 5 MPI processes per compute node (hence every process has 9.6 GB of RAM at its disposal). The scenario still failed with the same errors.

Then, I ran a final test using 32 compute nodes (about 52% of the whole Stoney cluster), 4 MPI processes per compute node (every process having 12 GB of RAM to use). This time it worked.

However, there are a number of hints pointing to a memory bug:
1) If the OS had identified a memory shortage, the OOM killer should have killed the application, and not let a segfault be raised on the application side. The signal we would have seen would not be a segmentation fault;
2) The size of core dumps on disk for the scenario where we have 5 MPI processes per compute node was 5~5.2 GB, while there was 9.6 GB of RAM free for each process. If all memory was used, we should have gotten core dumps of about 9.5 GB.

It is possible that having more memory per process may be hiding the bug, or even that we were lucky that the execution successfully finished this time. If we ran it again, we might catch the error once more. Yet, having jobs this large for a cluster like Stoney is not good, as it takes computing power from other jobs only for a single execution. Besides, we are on the limit allowed for job sizes on Stoney. It won't be possible to test the scenario using less MPI processes per compute node.

It should be noticed that swapping is disabled on this cluster, so it is not possible to use more RAM than what is allowed. Moreover, please bear in mind that memory overcommitment is enabled.

I am very interested in seeing this issue solved as soon as possible. Please let me know if I can contribute to this fix somehow.

Thanks for your collaboration.
Looking forward to hearing from you soon.

Kindest regards,
Renato Miceli

Re: Segfaults during memory accesses in Band_DFT_Col.c ( No.4 )

Date: 2011/07/10 01:01
Name: T.Ozaki

Hi,

Thank you very much for providing the detailed information.
I ran the input file with two cases:

Case A:
32 processes & 1 thread, where 4 processes/node were allocated.

Case B:
16 processes & 2 threads, where 2 processes/node were allocated

Each node of our machine consists of 8 cores and has 16GB.

Then, the case A was terminated by "PtlMEMDPost() failed", while the case B
finished normally where in the case B, each process was able to use 8GB.

Although the case A was not able to finish normally the job, it may be regarded
as a problem in MPI setting we used rather than in OpenMX.
So, I have not seen yet any clear evidence which tells us there is a program bug.
Of course, I am sure that this does not guarantee that there is no program bug
as you suspected.

However, we should find out a clearer evidence if this does come from possible bugs
in OpenMX.

Regards,

TO

Re: Segfaults during memory accesses in Band_DFT_Col.c ( No.5 )

Date: 2011/07/12 01:33
Name: Renato Miceli
References: http://www.ichec.ie/

Hi, T. Ozaki,

thank you for your reply.

As a matter of information completeness, I'd like to let you know that our version of OpenMX was compiled using the following options:
- Without OpenMP: flag "-Dnoomp"
- Optimisation level zero: flag "-O0"
- Producing debugging information: flags "-g" and "-ggdb"

The two cases you ran the input file for spawned either 1 or 2 threads, which I assumed were OpenMP threads. In the case of the runs on our cluster, the use of OpenMP was disabled during compilation, so each of the 128 cores ran a different MPI process rather than different threads. I wonder if having a more massive use of MPI changes the behaviour of the code or of the MPI library, hence bringing the problem forward in the execution or increasing the possibility of it happening with even smaller number of processes. Is it possible that the MPI setting used is sensitive to the number of MPI processes spawned?

I agree that we should keep looking for more evidences to support that this issue is indeed a bug. Please let me know if I can contribute somehow to its resolution.
I am really looking forward to seeing this issue solved as soon as possible.

Thanks for your collaboration.

Kindest regards,
Renato Miceli

Re: Segfaults during memory accesses in Band_DFT_Col.c ( No.6 )

Date: 2011/07/19 04:07
Name: Renato Miceli
References: http://www.ichec.ie/

Hi, T. Ozaki,

one thing I forgot to mention was that the following premise I stated before was broken when I ran for 4 and 5 MPI processes per compute node:

"Issue: segfaults arise, due to trying to access unallocated elements S[i] and H[i], for certain values of i. This only happens for MPI processes whose identifier is a multiple of n, for n processes spawn over a same compute node (scenarios with 6, 7 and 8 processes per compute node were tried). All other processes are terminated with signal 15."

Not all MPI processes whose ID is a multiple of n died with a segfault for the aforementioned scenarios. In the case of 4 processes per node, no processes died; for 5 processes per node, only processes 5, 20, 30, 50, 55, 65, 70, 80, 85, 95, 100, 110, and 125 died with signal 11.

I hope these pieces of information be useful for you. Please let me know what I can do more to help out.
I am looking forward to hearing back from you soon.

Thanks again for your collaboration.

Kindest regards,
Renato Miceli

Re: Segfaults during memory accesses in Band_DFT_Col.c ( No.7 )

Date: 2011/07/21 20:43
Name: Renato Miceli
References: http://www.ichec.ie/

Hi, T. Ozaki,

as we were suspecting of a memory issue, I executed the same scenario again using debugging hooks over the libc functions that deal with memory management. The goal was to try to catch any strange behaviour in memory management during the program execution. Basically I exported an environment variable named "MALLOC_CHECK_" before execution. When its value is set, debugging (but less efficient) versions of the memory management functions are used. There is no need to recompile the application.

Firstly, I've set MALLOC_CHECK_ to 2 and run the scenario. When MALLOC_CHECK_ is 2, any memory management bugs cause the execution to abort (e.g. calling free twice with the same address, changing data outside the boundaries of the allocated memory block). That is what happened. The execution finished with signal 6 (SIGABRT) after 7 seconds and all 128 processes died, each generating a core dump file. The details of one of the core files are as it follows:

-------------------------------------------------------------------------------
Backtrace:
#0 0x0000003f36030215 in raise () from /lib64/libc.so.6
#1 0x0000003f36031cc0 in abort () from /lib64/libc.so.6
#2 0x00000000006952be in free_check ()
#3 0x0000000000696fad in free ()
#4 0x0000000000702044 in mvapich2_minit ()
#5 0x0000000000732921 in MPIDI_CH3I_CM_Init ()
#6 0x000000000072f8f1 in MPIDI_CH3_Init ()
#7 0x000000000071a2c8 in MPID_Init ()
#8 0x00000000006bbfda in MPIR_Init_thread ()
#9 0x00000000006bb100 in PMPI_Init ()
#10 0x000000000040626b in main (argc=2, argv=0x7fffb117c3a8) at openmx.c:107

Piece of code in openmx.c:
98 /* for idle CPUs */
99 int tag;
100 int complete;
101 MPI_Request request;
102 MPI_Status status;
103
104 /* MPI initialize */
105
106 mpi_comm_level1 = MPI_COMM_WORLD;
107 MPI_Init(&argc,&argv);
108 MPI_Comm_size(mpi_comm_level1,&numprocs);
109 MPI_Comm_rank(mpi_comm_level1,&myid);
110 Num_Procs = numprocs;
111
112 /* for measuring elapsed time */
113
114 dtime(&TStime);
-------------------------------------------------------------------------------

Afterwards, since the root cause of the abort was still unclear, I executed the same scenario once more having MALLOC_CHECK_ set to 1. When this variable is set to 1, debugging information is printed to stderr.
No core dumps were generated and the execution finished with status 1 (not signal 1) after 3 seconds of walltime. Each of the 128 processes printed a message similar to the following one (only the pointer address changes from one process to another):

----------------------------------------------------
malloc: using debugging hooks
malloc: using debugging hooks
malloc: using debugging hooks

free(): invalid pointer 0xda46000!

Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Other MPI error

mpiexec_raw: Warning: tasks 0-127 exited with status 1.
----------------------------------------------------

It appears that we now have some hints that there is a memory issue in OpenMX. Like you suggested, it seems to be related to the MPI setting used in OpenMX.
Could you give a look into this issue, please?

I hope these pieces of information be useful. Should you need more information, please let me know.

Thank you very much for your collaboration.
Looking forward to hearing back from you soon.

Kindest regards,
Renato Miceli

Re: Segfaults during memory accesses in Band_DFT_Col.c ( No.8 )

Date: 2011/07/23 21:18
Name: T.Ozaki

Dear Retato,

Thank you very much for reporting the detailed analysis.
However, it is still difficult for us to be sure that this abort happens
due to certain bug of OpenMX.

In your last report, you mentioned that you had gotten an error message:

> Fatal error in MPI_Init:

But, as you can see, the initialization of MPI appears at just the beginning of
the main routine of the OpenMX code. Could you be sure that the message tells
us a useful information?

> Not all MPI processes whose ID is a multiple of n died with a segfault for the
> aforementioned scenarios. In the case of 4 processes per node, no processes died; for 5
> processes per node, only processes 5, 20, 30, 50, 55, 65, 70, 80, 85, 95, 100, 110, and
> 125 died with signal 11.

Also, the observation can be explained by a fact that memory size required for each
process can be different and the memory requirement in the dead processes exceeds some
threshold.

Since the memory size can be systematically controlled by changing "scf.energycutoff"
in your input file, you can further check that the abort happens due to either memory
shortage or certain bug.

Thank you very much for your cooperation in advance.

Regards,

Taisuke

Re: Segfaults during memory accesses in Band_DFT_Col.c ( No.9 )

Date: 2011/09/13 20:58
Name: Renato Miceli
References: http://www.ichec.ie/

Hi, T. Ozaki,

the user has identified another scenario which strikes the same issue. However, the user also discovered a trick that allows this scenario to finish successfully. As the errors were happening at the beginning of step 3 at the NEGF loop, the scenario was divided into two runs: one to compute the first two steps and another one to compute the rest of the steps. The first run generates the restart and output files, and the second one starts from where the first one had finished to generate the final output files. The total execution finishes without any errors when this trick is used.

The input files can be found in the following link. Please notice that files "Left_lead.dat" and "Right_lead.dat" are the same despite using or not the trick:
http://www-staff.ichec.ie/~rmiceli/openmx-issue/tricky-scenario/

The scenario input file, if served as input to OpenMX, will render the very same error. It can be found in the following link:
http://www-staff.ichec.ie/~rmiceli/openmx-issue/tricky-scenario/no-trick/

The split scenario input files for running the first two steps and only thereafter running the rest of the steps will not render any errors anymore. Both files can be found in the following link:
http://www-staff.ichec.ie/~rmiceli/openmx-issue/tricky-scenario/using-trick/

The execution trials were performed in the following environment:
---
Execution environment: cluster Stoney (http://ichec.ie/infrastructure/stoney) over 16 nodes, each with 8 cores
Version of code: OpenMX 3.5.2
Used libraries and tools: Intel MKL (version 10.2.5.035), MVAPICH2 for Intel Compilers, Intel C Compiler (version 11.1.072)
Time taken to fail: around 00h04m03s (binary compiled with -O0 -g -ggdb)
Time taken to run successfully: around 00h03m53s for the first two steps and 02h54m33s for the rest of the steps (binary compiled with -O0 -g -ggdb)
---

An interesting fact is that this trick does not work when executing the original scenario reported in the first post to this thread. Its input files can be found here:
http://www-staff.ichec.ie/~rmiceli/openmx-issue/

The execution still segfaulted despite using the 2-run trick; the first run finishes successfully, but the second run still segfaults. I also tried a 3-run trick, in an attempt to overcome this issue: the first run computes the first two steps, the second run computes only the third step, and the final run computes the rest of the steps (which are 6 in total for this scenario). In both trials the execution still segfaults while computing the third step.

It appears that this new scenario is stimulating the system in a different way compared to the original scenario. The fact that the new scenario works using the trick and the original one does not may be a good hint that there is a bug in the code and the differences in outcome might be used to point out exactly where in the code the bug is. It might be possible to follow the execution path within OpenMX, which may differ from scenario to scenario, and find exactly the place where the code defect is, provided we now have two different scenarios that come across the very same issue.

Kindest regards,
Renato Miceli

Re: Segfaults during memory accesses in Band_DFT_Col.c ( No.10 )

Date: 2011/09/15 17:59
Name: Renato Miceli
References: http://www.ichec.ie/

Hi, T. Ozaki,

as the suspicions are still being drawn towards a memory issue, I've rerun the scenarios setting the environment variable "MALLOC_CHECK_", the same way I had previously done. Having this variable set forces debugging hooks to be placed around libc memory management functions in order to catch any strange behaviour related to memory management during runtime.

I've tested this new scenario in two cases: using and not using the trick. I've run the scenario twice for each of these cases: setting MALLOC_CHECK_=1 and setting MALLOC_CHECK_=2. All these 4 executions finished unsuccessfully with the same outputs I had gotten when I tried setting MALLOC_CHECK_ to run the original scenario. When the trick was used, the crashes happened for executing the first split part of the input file (first two iterations), hence the rest of the iterations never get to execute.

The outputs for executing the original scenario and this new scenario using debugging hooks were the same. However, thinking that the root cause of the error has something to do with this fact is very misleading. In order to test that hypothesis, I've run an old scenario which served to reproduce the first bug I reported on OpenMX code, as a guinea pig for the experiment. This scenario can be found here:
http://www-staff.ichec.ie/~rmiceli/openmx-input-segfaulted.txt

Having MALLOC_CHECK_ set to either 1 or 2 renders the execution to finish unsuccessfully with the very same outputs. On the other hand, not having MALLOC_CHECK_ set allows the execution to finish successfully, with valid outputs (because the bug that affected this old scenario has already been fixed). This means that what causes the execution to fail when using the debugging hooks is common to all three scenarios, despite the bug we are looking for existing or not. It may be common to the three input files (e.g. a typo), to the OpenMX binary (a code bug), to the C compiler (which is capturing and interpreting the debugging hooks in a wrong manner), or even to the libraries (e.g. a bug in the MPI library). In any case, it would be another unrelated bug that does not appear to have nothing to do with the bug we are looking for. For now, there does not seem to be any evidence that can help trace the bug that is causing the original error.

Would you have any ideas about how we can use these results to trace the bugs? Or even other ideas of tests we can perform to find the cause of these issues? All comments are greatly appreciated.

Thank you very much for you time and collaboration.
Should you have any questions, please don't hesitate to get in touch.

Looking forward to hearing back from you soon.

Kindest regards,
Renato Miceli

Re: Segfaults during memory accesses in Band_DFT_Col.c ( No.11 )

Date: 2011/10/01 00:45
Name: Renato Miceli
References: http://www.ichec.ie/

Hi, T. Ozaki,

just to update you on the fact that this issue is still unresolved for OpenMX 3.5.4. I've rerun the original and the "tricky" scenarios, both using and not using the 2-run trick, and the status is still the same: the original scenario crashes both for using and not using the trick, while the "tricky" scenario crashes when the trick is not used but passes when the trick is applied. I've also tried the 3-run trick with the original scenario (running for respectively 2, 3 and 6 iterations), but the execution still crashes on the second run. Please notice that the definitions of the atomic species were retrieved from the official Database version 2006, not the newest one 2011.

Interestingly, only one of the 128 processes segfaulted when I ran the "tricky" scenario without applying the trick, and the error appeared in a different location this time -- all the other executions, on the other hand, crashed on the very same location as before (line 790 or 797 of file Band_DFT_Col.c). The details are as it follows.

--------------------------------------------------------------------------------------------------
Backtrace:
#0 0x0000000000749e50 in __intel_new_memcpy ()
#1 0x0000000000747086 in _intel_fast_memcpy.J ()
#2 0x00000000007105ca in MPIDI_CH3U_Request_unpack_uebuf ()
#3 0x0000000000719e7d in MPID_Irecv ()
#4 0x00000000006a2d1b in MPIC_Sendrecv ()
#5 0x000000000069bc54 in MPIR_Allreduce ()
#6 0x000000000069b091 in PMPI_Allreduce ()
#7 0x00000000006b4a6d in MPIR_Get_contextid ()
#8 0x00000000006b11fe in PMPI_Comm_create ()
#9 0x00000000004a7aba in Band_collinear2 (SCF_iter=1, knum_i=1, knum_j=1, knum_k=1, SpinP_switch=0, nh=0x1ab70fe0,
CntOLP=0x18961890, CDM=0x1a5dafe0, EDM=0x18a31380, Eele0=0x7fff1297d250, Eele1=0x7fff1297d260, MP=0x3c568670,
order_GA=0x216732c0, ko=0x3c56a460, koS=0x3c56ff70, EIGEN=0x2aaae25a9b50, H1=0x33e293d0, S1=0x33f8c9e0,
CDM1=0x340efff0, EDM1=0x34253600, H=0x2aaae25af010, S=0x2aaae26a3bc0, C=0x41429e40, BLAS_S=0x2bcc53c0, k_op=0x3c569b40,
T_k_op=0x2aaae26ae2b0, T_k_ID=0x2aaae26ae2d0, T_KGrids1=0x2aaae59ac220, T_KGrids2=0x2aaae59ac9f0,
T_KGrids3=0x2aaae25aad60, myworld1=0, NPROCS_ID1=0x21675fe0, Comm_World1=0x2aaae26a96d0, NPROCS_WD1=0x2aaae58a9bd0,
Comm_World_StartID1=0x2aaae58a9bf0, MPI_CommWD1=0x2aaae56aeed0, myworld2=0, NPROCS_ID2=0x3c569cf0,
NPROCS_WD2=0x2aaae56aeef0, Comm_World2=0x2aaae26a98e0, Comm_World_StartID2=0x2aaae5ae2d30, MPI_CommWD2=0x2aaae5ae2d50)
at Band_DFT_Col.c:702
#10 0x00000000004a605f in Band_DFT_Col (SCF_iter=1, knum_i=1, knum_j=1, knum_k=1, SpinP_switch=0, nh=0x1ab70fe0, ImNL=0x0,
CntOLP=0x18961890, CDM=0x1a5dafe0, EDM=0x18a31380, Eele0=0x7fff1297d250, Eele1=0x7fff1297d260, MP=0x3c568670,
order_GA=0x216732c0, ko=0x3c56a460, koS=0x3c56ff70, EIGEN=0x2aaae25a9b50, H1=0x33e293d0, S1=0x33f8c9e0,
CDM1=0x340efff0, EDM1=0x34253600, H=0x2aaae25af010, S=0x2aaae26a3bc0, C=0x41429e40, BLAS_S=0x2bcc53c0, k_op=0x3c569b40,
T_k_op=0x2aaae26ae2b0, T_k_ID=0x2aaae26ae2d0, T_KGrids1=0x2aaae59ac220, T_KGrids2=0x2aaae59ac9f0,
T_KGrids3=0x2aaae25aad60, myworld1=0, NPROCS_ID1=0x21675fe0, Comm_World1=0x2aaae26a96d0, NPROCS_WD1=0x2aaae58a9bd0,
Comm_World_StartID1=0x2aaae58a9bf0, MPI_CommWD1=0x2aaae56aeed0, myworld2=0, NPROCS_ID2=0x3c569cf0,
NPROCS_WD2=0x2aaae26a98e0, Comm_World2=0x2aaae56aeef0, Comm_World_StartID2=0x2aaae5ae2d30, MPI_CommWD2=0x2aaae5ae2d50)
at Band_DFT_Col.c:130
#11 0x000000000064474b in TRAN_DFT (comm1=1140850688, SucceedReadingDMfile=0, level_stdout=1, iter=3, SpinP_switch=0,
nh=0x1ab70fe0, ImNL=0x0, CntOLP=0x18961890, atomnum=720, Matomnum=6, WhatSpecies=0x17ccc1a0, Spe_Total_CNO=0x17832760,
FNAN=0x17cd5f50, natn=0x183bb380, ncn=0x183d5fa0, M2G=0x1826bfb0, G2ID=0x17cceee0, F_G2M=0x17ccfa30,
atv_ijk=0x18557be0, List_YOUSO=0xa12b80, CDM=0x1a5dafe0, EDM=0x18a31380, TRAN_DecMulP=0x1bb22740, Eele0=0x7fff1297d250,
Eele1=0x7fff1297d260, ChemP_e0=0x7fff1297d230) at TRAN_DFT.c:457
#12 0x00000000004573e6 in DFT (MD_iter=1, Cnt_Now=1) at DFT.c:901
#13 0x0000000000407096 in main (argc=2, argv=0x7fff1297d838) at openmx.c:462

File: Band_DFT_Col.c
Line of code: 702
Piece of code:
698 MPI_Comm_group(MPI_CommWD1[myworld1], &old_group);
699
700 /* define a new group */
701 MPI_Group_incl(old_group,T_knum,new_ranks,&new_group);
702 MPI_Comm_create(MPI_CommWD1[myworld1], new_group, &MPI_CommWD_CDM1[ID0]);
703 MPI_Group_free(&new_group);
--------------------------------------------------------------------------------------------------

It still appears to be a problem within the MPI library or the way OpenMX uses the library. Can you advise me on this issue, please?

Thank you for your consideration.
Looking forward to hearing back from you soon.

Kindest regards,
Renato Miceli

Page: [1]