Memory problem in supercomputers

Top Page > Browsing

Memory problem in supercomputers

Date: 2021/07/21 10:43
Name: Ninomiya: Dear OpenMX developers.

Hi, I have a trouble.

I have computed a supercell system with Atoms.Number = 700 on a supercomputer.
When I used the cell optimization "EF", the calculation stopped. (1hour)
This problem may be caused by exceeding the memory limit.

The calculation conditions are as follows
node = 16
mpirun -np 152 openmx A.dat -nt 4

The calculation parameters are as follows
scf.XcType = GGA-PBE
scf.Mixing.Type = rmm-disk
scf.maxIter = 100
MD.Type = EF
MD.maxIter = 100

The memory usage was over 1TB. Why is the memory usage so large?

Please let me know how to avoid this problem.
Should I add more nodes or reduce the number of MPI processes?
Or should I change the scf or structural relaxation conditions?
Do you know of any?

Best regards,
Ninomiya,

Page: [1]

Re: Memory problem in supercomputers ( No.1 )

Date: 2021/06/16 19:47
Name: Naoya Yamaguchi

Hi,

Your problem might be close to
http://www.openmx-square.org/forum/patio.cgi?mode=view&no=2619

Especially, as mentioned in the above thread, when several MD steps are finished, you can continue calculations by restarting with `*.dat#`.

Regards,
Naoya Yamaguchi

Re: Memory problem in supercomputers ( No.2 )

Date: 2021/06/17 08:55
Name: T. Ozaki

Hi,

I wonder that the optimization of a system including 700 atoms may be doable using 16 nodes.
How many cores and how much memory does each node have on the computer?
Why did you speficy 152 MPI processes which is not divisible by 16 nodes?
A simple prescription is to reduce the number of threads from 4 to 2 or 1.

If you show us your shell script and input file, we may be able to give you more proper suggestions.

Regards,

TO

Re: Memory problem in supercomputers ( No.3 )

Date: 2021/07/21 10:44
Name: Ninomiya

Dear Prof.Ozaki and Prof.Yamaguchi
Thank you for your attention and help.

I’m using the SQUID supercomputer at Osaka University. (http://www.hpc.cmc.osaka-u.ac.jp/system/manual/squid-use/)
I’m using 76 cores per node. 248GB of memory is available per node. (http://www.hpc.cmc.osaka-u.ac.jp/system/manual/squid-use/jobclass/)

I have computed a supercell system with Atoms.Number = 687 on a supercomputer.
I am a beginner in openmx and supercomputers, so I may have made some mistakes in the calculation conditions.
As advised, I reduced the number of threads to 2 and submitted 160 mpi processes, divisible by 16 nodes.
The shell script and input file were set up as follows

shell script (http://www.hpc.cmc.osaka-u.ac.jp/system/manual/squid-use/jobscript/);
========
#!/bin/bash
#PBS -q SQUID
#PBS --group=
#PBS -l elapstim_req=05:00:00
#PBS -T intmpi
#PBS -b 16
#PBS -v OMP_NUM_THREADS=2

module load BaseCPU/2021
module load BaseApp/2021
module load OpenMX/3.9
cd $PBS_O_WORKDIR
mpirun ${NQSV_MPIOPTS} -np 160 openmx 6.dat## -nt 2 > 6.std
========

input file;
========
System.CurrrentDirectory ./
System.Name 6
level.of.stdout 1
level.of.fileout 1

DATA.PATH /system/apps/rhel8/cpu/OpenMX/intel2020u4/3.9/openmx3.9/work/DFT_DATA19

Species.Number 2
<Definition.of.Atomic.Species
O O6.0-s2p2d1 O_PBE19
Ti Ti7.0-s3p2d1 Ti_PBE19
Definition.of.Atomic.Species>

Atoms.Number 687
Atoms.SpeciesAndCoordinates.Unit Ang
<Atoms.SpeciesAndCoordinates

Atoms.SpeciesAndCoordinates>

Atoms.UnitVectors.Unit Ang
<Atoms.UnitVectors
10.9272740 0.0000000 0.0000000
-4.5171250 13.5419169 0.0000000
44.8710433 51.1749099 47.2458760
Atoms.UnitVectors>

scf.XcType GGA-PBE
scf.SpinPolarization off
scf.ElectronicTemperature 300.0
scf.energycutoff 150.0
scf.maxIter 100
scf.EigenvalueSolver band
scf.Kgrid 2 2 1
scf.Mixing.Type rmm-diisk

scf.Init.Mixing.Weight 0.01
scf.Min.Mixing.Weight 0.01
scf.Max.Mixing.Weight 0.2
scf.Mixing.History 40
scf.Mixing.StartPulay 20
scf.Mixing.EveryPulay 1
scf.Kerker.factor 5.0

scf.criterion 1.0e-6
scf.maxIter 100

MD.Type EF
MD.Opt.DIIS.History 3
MD.Opt.StartDIIS 5
MD.Opt.EveryDIIS 200
MD.maxIter 100
MD.Opt.criterion 0.0003
========

The status of the job during the calculation was as follows.
========
ReqName Queue Pri STT S Memory CPU Elapse R H M Jobs
-------- -------- ---- --- - -------- -------- -------- - - - ----
openmx.sh SC64 0 RUN - 772.23G 248398.16 850 Y Y Y 16
========

When I set the number of threads to 2 and ran the calculation, the calculation finished in 20 minutes with the following error.
========
Exceeded per-job memory size warning limitq
========

Should I restart several times in a short period of time?
I would appreciate your advice on shell scripts and input files.

Best regards,
Ninomiya,

Re: Memory problem in supercomputers ( No.4 )

Date: 2021/06/17 13:05
Name: T. Ozaki

Hi,

To me, the resource you are using looks enough to run the calculation.

(1)
Can I assume that you could successfully run openmx with other input files on SQUID?
If so, please let us know where openmx crashed.
Just at the first SCF step, or after several geometry optimization steps?
This can be checked by looking at the stdout or err file from the queuing system.

(2)
You seem not to specify the following options:

#PBS -l cpunum_job
#PBS -l memsz_job

ref.: http://www.hpc.cmc.osaka-u.ac.jp/system/manual/squid-use/jobscript/

In this case, are the maximum values are automatically set?

(3)
Do you know how many MPI processes is allocated to each node by the option: #PBS -l cpunum_job?

(4)
As for the input file, I think that the gamma-point calculation is enough such as scf.Kgrid 1 1 1
Also, decreasing scf.Mixing.History from 40 to 30 may reduce the memory requirement.

Regards,

TO

Re: Memory problem in supercomputers ( No.5 )

Date: 2021/07/21 10:44
Name: Ninomiya

Dear Prof.Ozaki.
Thank you for your attention and help.

(1)
Yes, I was able to finish the calculation for the model with 48 atoms.

In the .std file, it stops at the following part.
******************* MD= 1 SCF=31 *******************
<Poisson> Poisson's equation using FFT...
<Set_Hamiltonian> Hamiltonian matrix for VNA+dVH+Vxc...
<Band> Solving the eigenvalue problem...
KGrids1: -0.25000 0.25000
KGrids2: -0.25000 0.25000
KGrids3: 0.00000

(2), (3)
Yes.
#PBS -l cpunum_job and #PBS -l memsz_job are automatically set to the maximum value if not set.
The upper limit of the submitted job class will be set.

(4)
Thanks for your advice. I will change the conditions and try to calculate.

Best regards,
Ninomiya,

Re: Memory problem in supercomputers ( No.6 )

Date: 2021/06/17 17:25
Name: Naoya Yamaguchi

Dear Ninomiya-san,

Although your job script specified `6.dat##` as an input file, is the calculation with the above error a restart calculation? If yes, were the first (`6.dat`) and second (`6.dat#`) calculations normally finished?

Regards,
Naoya Yamaguchi

Re: Memory problem in supercomputers ( No.7 )

Date: 2021/06/17 18:06
Name: T. Ozaki

Hi,

By the following quetion:

> (3) Do you know how many MPI processes is allocated to each node by the option: #PBS -l cpunum_job?

I intended to ask how we can know the number of MPI processes allocated to each node.

With the following specification:

#PBS -b 16
#PBS -v OMP_NUM_THREADS=2

mpirun ${NQSV_MPIOPTS} -np 160 openmx 6.dat## -nt 2 > 6.std

can we regard as 160 MPI processes/16 nodes = 10 MPI processes/node?
If so, you are using 10 MPI processes x 2 OMP threads = 20 CPU cores/node.

Then, according to your explanation, no specification for the keyword: #PBS -l cpunum_job
allocates 76 CPU cores/node. Then, 76-20 = 56 CPU cores will be idle, and may waste memory unexpectedly.
It might be better to explicitly specify the number of CPU cores using the keyword: #PBS -l cpunum_job.
In your case, it should be 20.

Also, does "openmx 6.dat## -nt 2 > 6.std" make sense, since # tends to be regarded as a flag of
comment in shell script ?

Regards,

TO

Re: Memory problem in supercomputers ( No.8 )

Date: 2021/07/21 10:45
Name: Ninomiya

Dear Prof.Yamaguchi,

Yes, the calculation with the above error is a restart calculation.
The first (`6.dat`) and second (`6.dat#`) calculations were not normally finished.　
The calculation is restarted from the calculation that was terminated with an error.

Best regards,
Ninomiya,

Re: Memory problem in supercomputers ( No.9 )

Date: 2021/06/19 18:40
Name: Naoya Yamaguchi

Dear Ninomiya-san,

I think that, as Prof. Ozaki says, you should review the job script.
Basically, you should set the numbers of MPI processes and OpenMP threads satisfying the product is equal to the number of cores included in allocated nodes.
If not, you might need to set additional options appropriately such as cpunum_job or cpunum-lhost (https://www.hpc.nec/documents/nqsv/pdfs/g2ad03-NQSVUG-Operation.pdf).

Regards,
Naoya Yamaguchi

Re: Memory problem in supercomputers ( No.10 )

Date: 2021/07/21 10:45
Name: Ninomiya

Dear Prof.Ozaki and Prof.Yamaguchi

Thank you for your attention and help.
I will review the job script.

Best regards,
Ninomiya,

Page: [1]