This thread is locked.Only browsing is available.
Top Page > Browsing
Bug in function Cluster_collinear
Date: 2011/04/15 01:24
Name: Renato Miceli
References: http://www.ichec.ie/

Hi,

one of our users at ICHEC (http://www.ichec.ie/) is having constant issues with OpenMX while running her scenarios. The executions are always ending with segfaults being raised. A deeper look into the code showed that there may be a bug in OpenMX. Could you please give a look onto this?

----------

File: Cluster_DFT.c
Function: double Cluster_collinear(char *, int, int, double ***, double **, double *****, double ****, double *****, double *****, double[2], double[2])
Line: 743
Issue: OneD_Mat1 is being indexed outside the valid range (index out of bounds)

Piece of code:
for (i=istart; i<=iend; i++){
i1 = (i - istart)*n;
for (j=1; j<=n; j++){
k = i1 + j - 1;
OneD_Mat1[k] = C[spin][i][j];
}
}

Function call:
Cluster_collinear (mode=0x70904c "scf", SCF_iter=1, SpinP_switch=0, C=0x34966f00, ko=0x34966f20, nh=0x2aaabb122ce0, CntOLP=0x2f1d3b10, CDM=0x39440f50, EDM=0x3582c610, Eele0=0x7fff43c97550, Eele1=0x7fff43c97560)

Values of variables (taken when process segfaulted):
nstep = 5
wstep = 1528
MaxN = 7644
n = 7644
OneD_Mat1 = (double *) 0x2aabbb127010
C = (double ***) 0x34966f00
spin = 0
step = 4
istart = 6113
iend = 7644
i = 7643
i1 = 11695320
j = 1831
k = 11697150

Cause of error: OneD_Mat1 is sized 11696850, but its index k has value 11697150
Execution environment: the cluster Stoney (http://ichec.ie/infrastructure/stoney) over 16 nodes, each with 8 cores
Version of Code: OpenMX 3.5
Used libraries and tools: Intel MKL (version 10.2.5.035), MVAPICH2 for Intel Compilers, Intel C Compiler (version 11.1.072)
Time taken to fail: around 30 min (if compiled with -O3)

----------

I understand it is easier to trace the bug if you can replicate the scenario. I would be glad to give you the necessary input files; however, these files are not mine and for so I am waiting for the user to grant me permission to be able to pass them to you.

If you'd like any further information, please tell me so.
And in case you have any follow-ups in regard to this issue, please do let me know.

Kind regards,
Renato
メンテ
Page: [1]

Re: Bug in function Cluster_collinear ( No.1 )
Date: 2011/05/04 19:12
Name: T. Ozaki

Hi,

Thank you very much for letting me know the possible bug in Cluster_DFT.c.
I would like to reproduce the error to check whether the error comes from
the program bug or the aggressive optimization by O3.
Could you kindly give me the input file?

Thank you very much for your cooperation in advance.

Sincerely,

TO
メンテ
Re: Bug in function Cluster_collinear ( No.2 )
Date: 2011/05/12 20:25
Name: Renato Miceli

Hi,

thank you for your reply.

Unfortunately the user could not disclose her original input file (the one whose variables were depicted in my first post), but I was allowed to disclose another input file which drives to the same error. Following I am sending enclosed a snapshot of the scenario when the process segfaulted. The input file is too large to be posted here, so I hosted it in this address:

Please note that the definitions of the atomic species used were retrieved "as is" from the OpenMX official database of VPS and PAO (version 2006):


----------

File: Cluster_DFT.c
Function: double Cluster_collinear(char *, int, int, double ***, double **, double *****, double ****, double *****, double *****, double[2], double[2])
Line: 743
Issue: OneD_Mat1 is being indexed outside the valid range (index out of bounds)
Result: Signal 11 (segmentation fault) is raised

Piece of code:
for (i=istart; i<=iend; i++) {
i1 = (i - istart) * n;
for (j=1; j<=n; j++) {
k = i1 + j - 1;
OneD_Mat1[k] = C[spin][i][j];
}
}

Function call:
Cluster_collinear (mode=0x70904c "scf", SCF_iter=1, SpinP_switch=0, C=0x2271f1f0, ko=0x2271f210, nh=0x22b197a0, CntOLP=0x244470c0, CDM=0x282aad40, EDM=0x33853c00, Eele0=0x7fff5a68f550, Eele1=0x7fff5a68f560)

Values of variables (taken when process segfaulted):
nstep = 5
wstep = 1486
MaxN = 7434
n = 7434
OneD_Mat1 = (double *) 0x2aab945e3010
C = (double ***) 0x2271f1f0
spin = 0
step = 4
istart = 5945
iend = 7434
i = 7433
i1 = 11061792
j = 1503
k = 11063294

Error Cause: OneD_Mat1 is sized 11063280, but its index k has value 11063294
Execution environment: the cluster Stoney (http://ichec.ie/infrastructure/stoney) over 16 nodes, each with 8 cores
Version of Code: OpenMX 3.5
Used libraries and tools: Intel MKL (version 10.2.5.035), MVAPICH2 for Intel Compilers, Intel C Compiler (version 11.1.072)
Time taken to fail: around 30 min (if compiled with -O3)

----------


In case any other piece of information is necessary, please do let me know. I'd appreciate also if you could please keep me informed of the bug tracing (any suppositions of where it might be, where it is confirmed it is not) and whether the error is a result of the optimisations performed by the compiler. I am very interested in seeing this issue solved as soon as possible.

Thank you very much for your cooperation.

Kind regards,
Renato Miceli
メンテ
Re: Bug in function Cluster_collinear ( No.3 )
Date: 2011/05/12 20:33
Name: Renato Miceli
References: http://www.ichec.ie/

Hi,

just to let you know I had to remove the referenced links to be able to post the message.

The address where you can find the input file is this one:
http://www-staff.ichec.ie/~rmiceli/openmx-input-segfaulted.txt

The address to the official OpenMX database (version 2006) is this one:
http://www.jaist.ac.jp/~t-ozaki/vps_pao2006/vps_pao.html

Kind regards,
Renato Miceli
メンテ
Re: Bug in function Cluster_collinear ( No.4 )
Date: 2011/05/12 21:34
Name: T.Ozaki

Hi,

Thank you for giving the input file.
I will try to fix the problem and report a possible reason.

> just to let you know I had to remove the referenced links to be able
> to post the message.

Sorry for the inconvenience.

Best regards,

TO
メンテ
Re: Bug in function Cluster_collinear ( No.5 )
Date: 2011/05/12 22:26
Name: Renato Miceli
References: http://www.ichec.ie/

Hi,

thank you for your quick reply.
As soon as you have any news in regard to this issue, please let me know.

Kind regards,
Renato Miceli
メンテ
Re: Bug in function Cluster_collinear ( No.6 )
Date: 2011/05/13 00:41
Name: T.Ozaki

Hi,

It turned out that in fact this is a program bug that rounding down
in division of integer numbers is not properly taken into account.
Such a segfau may happen for larger systems, while it rarely happens
for smaller systems.

I have fixed it, and released a patch at
http://www.openmx-square.org/bugfixed/11May13/patch3.5.1.tar.gz

Thank you for your valuable report.

Best regards,

TO
メンテ
Re: Bug in function Cluster_collinear ( No.7 )
Date: 2011/05/13 01:06
Name: Renato Miceli
References: http://www.ichec.ie/

Hi,

thank you very much for your time and cooperation. And especially for your promptness. This fix is very important for us. Right now I am working on applying the patch and building the latest version of OpenMX so that our user can take advantage of the fixed bug. Hopefully we shouldn't have any issues anymore.

Again, thank you very much. Should you have any comments or questions, don't hesitate in getting in touch.

Kindest regards,
Renato Miceli
メンテ
Re: Bug in function Cluster_collinear ( No.8 )
Date: 2011/05/13 21:42
Name: Renato Miceli
References: http://www.ichec.ie/

Hi,

just to let you know that our execution with the input file provided completed successfully, without any issues (no runtime errors). I believe we should not encounter any problems anymore. Thanks again for your cooperation.

Kind regards,
Renato Miceli
メンテ

Page: [1]