This thread is locked.Only browsing is available.
Top Page > Browsing
Array index out of bounds causes segfault in TRAN_Set_CentOverlap.c
Date: 2011/08/25 20:42
Name: Renato Miceli
References: http://www.ichec.ie/

Hi,

a user of ours at ICHEC (http://www.ichec.ie/) is having issues with OpenMX once again. It appears to be another bug in OpenMX. Could you please take a look at this issue?

---------------------------------------------------------------------

File: TRAN_Set_CentOverlap.c

Function:
void TRAN_Set_CentOverlap(
MPI_Comm comm1,
int job,
int SpinP_switch,
double k2,
double k3,
int *order_GA,
double **H1,
double *S1,
double *****H, /* input */
double ****OLP, /* input */
int atomnum,
int Matomnum,
int *M2G,
int *G2ID,
int *WhatSpecies,
int *Spe_Total_CNO,
int *FNAN,
int **natn,
int **ncn,
int **atv_ijk
/* int *WhichRegion */
);

Lines: 208 or 260 (both lines of code are identical and the line where the error takes place depends on the process).

Piece of code (the code surrounding lines 208 and 260 is identical):
(...)
for (GA_AN=1; GA_AN<=atomnum; GA_AN++){
wanA = WhatSpecies[GA_AN];
tnoA = Spe_Total_CNO[wanA];
Anum = MP[GA_AN];

GA_AN_e = TRAN_Original_Id[GA_AN];
Anum_e = MP_e[iside][GA_AN_e]; /* = Anum */

for (GB_AN=1; GB_AN<=atomnum; GB_AN++){
(...)

Backtrace for one of the processes:
#0 0x00000000006511c9 in TRAN_Set_CentOverlap (comm1=-2080374780, job=3,
SpinP_switch=0, k2=9.9999999997324451e-07, k3=-9.9999999999999995e-07,
order_GA=0x2aaadf8530b0, H1=0x2aaadfc53c00, S1=0x2aaad1ef6010, H=0x198bc770,
OLP=0x18f7ab50, atomnum=800, Matomnum=6, M2G=0x18812000, G2ID=0x181546c0,
WhatSpecies=0x18151480, Spe_Total_CNO=0x17c8ae40, FNAN=0x1815bff0,
natn=0x188030c0, ncn=0x187902a0, atv_ijk=0x18800180)
at TRAN_Set_CentOverlap.c:260
#1 0x0000000000647aca in TRAN_DFT_Kdependent (comm1=-2080374780, parallel_mode=1,
numprocs=128, myid=73, level_stdout=1, iter=4, SpinP_switch=0,
k2=9.9999999997324451e-07, k3=-9.9999999999999995e-07, k_op=1,
order_GA=0x2aaadf8530b0, DM1=0x2aaae0a539d0, H1=0x2aaadfc53c00,
S1=0x2aaad1ef6010, nh=0x198bc770, ImNL=0x0, CntOLP=0x18f7ab50, atomnum=800,
Matomnum=6, WhatSpecies=0x18151480, Spe_Total_CNO=0x17c8ae40, FNAN=0x1815bff0,
natn=0x188030c0, ncn=0x187902a0, M2G=0x18812000, G2ID=0x181546c0,
atv_ijk=0x18800180, List_YOUSO=0xa13b20, CDM=0x18e26e30, EDM=0x1b82ea80,
Eele0=0x7fff636676d0, Eele1=0x7fff636676e0) at TRAN_DFT.c:1291
#2 0x00000000006467d5 in TRAN_DFT_Original (comm1=1140850688, level_stdout=1,
iter=4, SpinP_switch=0, nh=0x198bc770, ImNL=0x0, CntOLP=0x18f7ab50,
atomnum=800, Matomnum=6, WhatSpecies=0x18151480, Spe_Total_CNO=0x17c8ae40,
FNAN=0x1815bff0, natn=0x188030c0, ncn=0x187902a0, M2G=0x18812000,
G2ID=0x181546c0, F_G2M=0x18155350, atv_ijk=0x18800180, List_YOUSO=0xa13b20,
CDM=0x18e26e30, EDM=0x1b82ea80, TRAN_DecMulP=0x1d455f00, Eele0=0x7fff636676d0,
Eele1=0x7fff636676e0, ChemP_e0=0x7fff636676b0) at TRAN_DFT.c:921
#3 0x0000000000643dda in TRAN_DFT (comm1=1140850688, SucceedReadingDMfile=0,
level_stdout=1, iter=4, SpinP_switch=0, nh=0x198bc770, ImNL=0x0,
CntOLP=0x18f7ab50, atomnum=800, Matomnum=6, WhatSpecies=0x18151480,
Spe_Total_CNO=0x17c8ae40, FNAN=0x1815bff0, natn=0x188030c0, ncn=0x187902a0,
M2G=0x18812000, G2ID=0x181546c0, F_G2M=0x18155350, atv_ijk=0x18800180,
List_YOUSO=0xa13b20, CDM=0x18e26e30, EDM=0x1b82ea80, TRAN_DecMulP=0x1d455f00,
Eele0=0x7fff636676d0, Eele1=0x7fff636676e0, ChemP_e0=0x7fff636676b0)
at TRAN_DFT.c:218
#4 0x00000000004573e6 in DFT (MD_iter=1, Cnt_Now=1) at DFT.c:901
#5 0x0000000000407096 in main (argc=2, argv=0x7fff63667cb8) at openmx.c:462

Peeking the value of some local variables, for one of the segfaulted processes:
(gdb) p MP_e
$1 = {0x2aaadf853d40, 0x2aaacd264b00}
(gdb) p iside
$2 = 1
(gdb) p GA_AN_e
$3 = 320
(gdb) p Anum_e
$4 = 0
(gdb) p GA_AN
$5 = 360
(gdb) p TRAN_Original_Id
$6 = (int *) 0x18159210

------

Issue: segfaults are raised when trying to access matrix MP_e.

Cause of error:
MP_e[0] and MP_e[1] are both sized 113 (NUM_e[0]+1 and NUM_e[1]+1, respectively), while variable GA_AN_e tries to index it in position 320.

MP_e is instantiated in line 87...
int *MP, *MP_e[2];

And lines 105...
MP_e[0] = (int*)malloc(sizeof(int)*(NUM_e[0]+1));

and 108...
MP_e[1] = (int*)malloc(sizeof(int)*(NUM_e[1]+1));

allocate its arrays. Peeking variable NUM_e yields the following...
(gdb) p NUM_e
$7 = {112, 112}

It should be noted that variable NUM_e is not ever changed since its use to allocate arrays MP_e[0] and MP_e[1]. Therefore, MP_e[0] and MP_e[1] are both sized 113.

The error could be happening when MP_e is indexed using variable iside, but since iside has value 1, it fits the length of MP_e.

It was also possible that GA_AN_e was trash coming from bad indexation of array TRAN_Original_Id by variable GA_AN. This possibility is however discarded, since TRAN_Original_Id has length atomnum+1 and variable GA_AN iterates over the interval [1; atomnum]. GA_AN_e has therefore an intended value.

Variable TRAN_Original_Id is allocated in files TRAN_Allocate.c (line 38) and TranMain.c (lines 256 and 687), and all occurrences are the same line of code...
TRAN_Original_Id = (int*)malloc(sizeof(int)*(atomnum+1));

And peeking the value of atomnum after the process segfaulted yields the following...
(gdb) p atomnum
$8 = 800

A final possibility could be that attributing a value to variable Anum_e is raising the segfault. However, variable Anum_e is statically allocated and no errors should come from attributions to it.

------

Execution environment: cluster Stoney (http://ichec.ie/infrastructure/stoney) over 16 nodes, each with 8 cores
Version of Code: OpenMX 3.5.1
Used libraries and tools: Intel MKL (version 10.2.5.035), MVAPICH2 for Intel Compilers, Intel C Compiler (version 11.1.072)
Time taken to fail: around 02h 07min 15s (compiled with -g -ggdb -O0)

---------------------------------------------------------------------

You can find the input files for this scenario at the following address:
http://www-staff.ichec.ie/~rmiceli/openmx-bug/

Please note that the definitions of the atomic species used were retrieved "as is" from OpenMX official database of VPS and PAO (version 2006):
http://www.jaist.ac.jp/~t-ozaki/vps_pao2006/vps_pao.html

Please let me know if you'd like any further information.

I'd appreciate if you could keep me informed of the bug tracing process (e.g. any suppositions of where the bug might or might not be). I am very interested in seeing this issue resolved as soon as possible.

Thanks in advance for you collaboration.
Looking forward to hearing from you soon.

Kindest regards,
Renato Miceli
メンテ
Page: [1]

Re: Array index out of bounds causes segfault in TRAN_Set_CentOverlap.c ( No.1 )
Date: 2011/08/30 15:05
Name: T.Ozaki

Hi,

Thanks a lot for reporting the problem.
In fact, this is a program bug, and I have fixed it.
The patch is available at
http://www.openmx-square.org/bugfixed/11Aug30/patch3.5.2.tar.gz

Could you try it and report how it works?

Regards,

TO
メンテ
Re: Array index out of bounds causes segfault in TRAN_Set_CentOverlap.c ( No.2 )
Date: 2011/09/02 04:37
Name: Renato Miceli
References: http://www.ichec.ie/

Hi,

thank you very much for your quick response!

I've run OpenMX 3.5.2 using the same input files and this time the execution finished successfully. It appears that the bug is now fixed.
Thank you for your time and collaboration. This fix is very important for us.

I am also grateful for having my contributions acknowledged at OpenMX website. It means a lot to me.

Should you have any questions, please don't hesitate to get in touch.

Kindest regards,
Renato Miceli
メンテ

Page: [1]