Top Page > Browsing
memory increasing as MD steps increase and the version issue on it
Date: 2021/07/14 21:32
Name: Kunihiro Yananose   <ykunihiro@snu.ac.kr>

Dear Developers and Users,

Hi. Recently, I tried the relaxation of the atomic structure of a large system. However, I encountered a memory exceeding problem. The memory occupation increased at every MD step and the calculation was terminated by the system after several MD steps due to the memory exceeding.

I found the following threads about similar situations from this forum,
http://www.openmx-square.org/forum/patio.cgi?mode=view&no=1248
http://www.openmx-square.org/forum/patio.cgi?mode=view&no=2619
http://www.openmx-square.org/forum/patio.cgi?mode=view&no=2843
In principle, I can finish the relaxation by restarting the calculation several times. However, I have tried some tests and I feel that it should be reported in detail.

From the first thread, I learned that the memory occupation will increase as MD proceeds when we use the BFGS, RF, or EF method.
If I understand correctly,
1. Even in such a case, after the MD step reaches at (MD.Opt.StartDIIS)+(MD.Opt.DIIS.History), the memory usage will be saturated.
2. If I use the steepest descent method (MD.type Opt), memory usage does not significantly increase in comparison to the first MD step.

However, what I found from the tests with the smaller system are
1. When I use the RF method, even when the MD step far exceeds (MD.Opt.StartDIIS)+(MD.Opt.DIIS.History), memory occupation increases at every MD step.
2. When I use the SD method (Opt), memory occupation increases by a similar amount with the RF case.
3. I used the 3.9 version for the above 2 cases. However, when I used the 3.8 version for the same system, the memory usage was saturated at a moderate amount in comparison to the first MD step.

In detail, when I checked the memory usage of the calculations running on the 20 cores* 2 nodes (totally 40 cores) with a total of 252 GB memory, the changes between the 1st step and nearly the 50th step are as follows.
(1) ver 3.9 with RF : 13.5 % to 46.85 %
(2) ver 3.9 with SD : 13.5 % to 43.6 %
(3) ver 3.8 with RF : 14.75 % to 21.0 % (this value was almost saturated at 13th step)
(4) ver 3.8 with SD : 17.65 % to 22.8 %

So I guess that the memory leak newly occurs in version 3.9. I tried the memory leak test by adding “memory.leak on” option, but it did not work.
From the automatic memory leak test in the work directory by using the -mltest option, I couldn’t find a problem from the ‘mltest.result’ file.

I’m not sure whether it is due to a bug in code or due to my compiling. Could someone please check this issue?

I will attach my input file for the test here.

Sincerely,
Kunihiro Yananose

===========================================
System.CurrrentDirectory ./
System.Name CsPbI3_222rel
level.of.stdout 1
level.of.fileout 0

memory.leak on

Species.Number 3
<Definition.of.Atomic.Species
Cs Cs12.0-s2p2d2f1 Cs_PBE19
Pb Pb8.0-s2p2d2f1 Pb_PBE19
I I7.0-s2p2d2f1 I_PBE19
Definition.of.Atomic.Species>

Atoms.Number 40
Atoms.SpeciesAndCoordinates.Unit Ang
<Atoms.SpeciesAndCoordinates
1 Cs -0.109350302481 0.017661825817 0.123717736208 4.5 4.5
2 Cs -0.124823342223 0.147765296525 6.380680590298 4.5 4.5
3 Cs -0.007008736483 6.410857820077 -0.074383741132 4.5 4.5
4 Cs -0.118783732689 6.555175721101 6.422955307289 4.5 4.5
5 Cs 6.464431501880 -0.123579728882 -0.070189557646 4.5 4.5
6 Cs 6.427261600317 0.064232887890 6.348300039800 4.5 4.5
7 Cs 6.431834491679 6.339778169950 -0.141025479534 4.5 4.5
8 Cs 6.443892616710 6.477404425606 6.549260030321 4.5 4.5
9 Pb 3.272791985076 3.198242404407 3.199496275314 7.0 7.0
10 Pb 3.101028713465 3.276815140899 9.738897527700 7.0 7.0
11 Pb 3.224216836450 9.675880163716 3.212617710133 7.0 7.0
12 Pb 3.115689670255 9.652795003628 9.466297669410 7.0 7.0
13 Pb 9.552866474663 3.131487577824 3.099067705741 7.0 7.0
14 Pb 9.682485584765 3.300933809636 9.589271571266 7.0 7.0
15 Pb 9.635313739043 9.529292406954 3.202637961559 7.0 7.0
16 Pb 9.736884237717 9.680878345311 9.595758252134 7.0 7.0
17 I -0.040470869233 3.331136487231 3.201244902193 3.5 3.5
18 I -0.129839528095 3.333964598607 9.685554599797 3.5 3.5
19 I 0.090021458574 9.591833976545 3.343405331190 3.5 3.5
20 I 0.093392408372 9.532349466972 9.761073697706 3.5 3.5
21 I 6.545890826776 3.072905920953 3.223745556771 3.5 3.5
22 I 6.383904299641 3.136338045401 9.738173316161 3.5 3.5
23 I 6.402500603032 9.502674533975 3.061946226745 3.5 3.5
24 I 6.359066892740 9.486870196897 9.585903242210 3.5 3.5
25 I 3.159863033433 0.082467653794 3.290511543652 3.5 3.5
26 I 3.091457017761 0.133947541598 9.647750421211 3.5 3.5
27 I 3.167633380446 6.475779158721 3.271996769516 3.5 3.5
28 I 3.082218155842 6.464731433346 9.598311680722 3.5 3.5
29 I 9.730941445839 -0.045670803688 3.246576249420 3.5 3.5
30 I 9.613230046510 -0.114531203152 9.540865849085 3.5 3.5
31 I 9.685824118632 6.394104141214 3.154133422327 3.5 3.5
32 I 9.558215936012 6.296042817188 9.508648494660 3.5 3.5
33 I 3.193330822127 3.233235651656 0.146946342923 3.5 3.5
34 I 3.059034707844 3.242768045783 6.420996399233 3.5 3.5
35 I 3.138091341784 9.605332554478 -0.040018785521 3.5 3.5
36 I 3.154600024318 9.637835775037 6.551585445171 3.5 3.5
37 I 9.737581048537 3.153552960117 0.028731563163 3.5 3.5
38 I 9.518078835362 3.093365127634 6.458017392987 3.5 3.5
39 I 9.734247407858 9.547443621117 0.061449670843 3.5 3.5
40 I 9.605756045341 9.736854155951 6.531283916087 3.5 3.5
Atoms.SpeciesAndCoordinates>
Atoms.UnitVectors.Unit Ang
<Atoms.UnitVectors
12.820340000000 0.000000000000 0.000000000000
0.000000000000 12.820340000000 0.000000000000
0.000000000000 0.000000000000 12.820340000000
Atoms.UnitVectors>

scf.XcType GGA-PBE
scf.SpinPolarization off
scf.SpinOrbit.Coupling off
scf.ElectronicTemperature 300.0
scf.energycutoff 250.0
scf.maxIter 300
scf.EigenvalueSolver band
scf.Kgrid 2 2 2
scf.Mixing.Type rmm-diisk
scf.Init.Mixing.Weight 0.30
scf.Min.Mixing.Weight 0.001
scf.Max.Mixing.Weight 0.400
scf.Mixing.History 7
scf.Mixing.StartPulay 10
scf.criterion 1.0e-7

MD.Type RF

MD.Opt.DIIS.History 4
MD.Opt.StartDIIS 5
MD.maxIter 200
MD.TimeStep 0.5
MD.Opt.criterion 1.0e-4

メンテ
Page: [1]

Re: memory increasing as MD steps increase and the version issue on it ( No.1 )
Date: 2021/07/20 11:02
Name: T. Ozaki

Hi,

Thank you for your detailed report.
I will try to see what's happening, and post my report once the problem is figured out.

Regards,

TO
メンテ
Re: memory increasing as MD steps increase and the version issue on it ( No.2 )
Date: 2021/07/24 17:51
Name: T. Ozaki

Hi,

I have checked the memory usage of OpenMX in both v3.8.5 and v3.9.2 with the input file you provided
by monitoring RSS as shown at
https://t-ozaki.issp.u-tokyo.ac.jp/calcs/monitoring_memory.PNG

Both the calculations were performed using 20 MPI cores, and the sum of the memory usage required by 20 MPI cores
is shown in the figure. As we can see in the figure, the memory usage of v3.9.2 seems to be comparable to that of v3.8.5.
So, from the comparison it seems to be difficult to say that any specific memory leak exists in v3.9.2.

Howvever, I have noticed from a series of benchmarks that an unexpected memory leak seems to occur when pdgemm or pzgemm is called in some environment.
Note that only v3.9.2 calls pdgemm or pzgemm in the eigenvalue solver.
I am not sure whether what I found is related to the issue mentioned at
https://software.intel.com/content/www/us/en/develop/documentation/onemkl-linux-developer-guide/top/managing-performance-and-memory/using-memory-functions/avoiding-memory-leaks-in-onemkl.html
If so, the issue may depend on the version of MKL.

I will keep the issue in mind.
Anyway, thank you very much for your detailed resport.

Regards,

TO
メンテ
Re: memory increasing as MD steps increase and the version issue on it ( No.3 )
Date: 2021/07/27 15:11
Name: Kunihiro Yananose  <ykunihiro@snu.ac.kr>

Dear prof. Ozaki,

Thank you for the test and a kind reply.
I compiled openmx with the intel oneAPI MKL. So the MKL version might be related to this issue as you remarked.

I tried to apply the solutions in the link and did some tests. However, I couldn't see any visible improvement.

Specifically,

(1) variable setting at the job submit bash script by
export MKL_DISABLE_FAST_MM=1

(2) Recompile with the code modification of openmx.c by adding

at line 73,
#include <mkl.h>
void mkl_free_buffers(void);
void mkl_thread_free_buffers(void);


and at line 677,
mkl_free_buffers();
mkl_thread_free_buffers();

I don't know whether it is a correct way or not.

Because this is not an urgent issue to me, I'll also put off this issue for the present.

Thank you so much,
K. Yananose
メンテ

Page: [1]

Thread Title (must) Move the thread to the top
Your Name (must)
E-Mail (must)
URL
Password (used in modification of the submitted text)
Comment (must)

   Save Cookie