Slow DM step with OpenMP parallelization

Top Page > Browsing

Slow DM step with OpenMP parallelization

Date: 2022/01/14 00:25
Name: Pavel Ondracka <pavel.ondracka@email.cz>: Dear OpenMX users,

I see a huge difference in CPU time when running full MPI vs hybrid OpenMP/MPI for one specific case. The system should be amorphous TiON and has ~130 atoms and I have not seen this for any other system. For the benchmarks I just run 10 SCF steps. The input file is here: https://drive.google.com/file/d/1XbQFSyC9kJQsnHFfJR-k_OhppGz4KNai/view?usp=sharing

When I run it fully MPI parallelized with 24processes:

******************* MD= 1 SCF=10 *******************
<Poisson> Poisson's equation using FFT...
<Set_Hamiltonian> Hamiltonian matrix for VNA+dVH+Vxc...
<Band> Solving the eigenvalue problem...
KGrids1: -0.33333 0.00000 0.33333
KGrids2: -0.33333 -0.00000 0.33333
KGrids3: -0.33333 0.00000 0.33334
<Band_DFT> Eigen, time=11.755527
<Band_DFT> DM, time=0.000001
......
<DFT> Uele =-1152.589808396384 dUele = 1.660562545115
<DFT> NormRD = 3.810872422696 Criterion = 0.000500000000
......
Min_ID Min_Time Max_ID Max_Time
Total Computational Time = 14 252.855 0 252.871
readfile = 14 2.670 0 2.675
truncation = 14 8.278 19 8.654
MD_pac = 0 0.012 23 0.019
OutData = 13 0.000 0 0.011
DFT = 19 241.471 14 241.848

*** In DFT ***

Set_OLP_Kin = 14 5.804 19 8.477
Set_Nonlocal = 19 13.349 14 16.021
Set_ProExpn_VNA = 11 29.194 19 30.232
Set_Hamiltonian = 4 20.096 11 20.096
Poisson = 0 0.048 23 0.048
Diagonalization = 0 120.648 19 120.648
Mixing_DM = 12 0.313 1 0.313
Force = 0 19.957 9 19.957
Total_Energy = 3 10.171 0 10.175
Set_Aden_Grid = 19 0.175 11 1.211
Set_Orbitals_Grid = 14 3.009 19 4.707
Set_Density_Grid = 2 12.737 20 12.755
RestartFileDFT = 2 0.282 21 0.359
Mulliken_Charge = 6 0.021 13 0.021
FFT(2D)_Density = 14 0.098 0 0.098
Others = 19 0.093 14 2.169

When I run at 12MPI processes and 2 threads per process:

******************* MD= 1 SCF=10 *******************
<Poisson> Poisson's equation using FFT...
<Set_Hamiltonian> Hamiltonian matrix for VNA+dVH+Vxc...
<Band> Solving the eigenvalue problem...
KGrids1: -0.33333 0.00000 0.33333
KGrids2: -0.33333 -0.00000 0.33333
KGrids3: -0.33333 0.00000 0.33334
<Band_DFT> Eigen, time=14.865723
<Band_DFT> DM, time=25.889454
......
<DFT> Uele =-1152.589808383992 dUele = 1.660562545608
<DFT> NormRD = 3.810872418488 Criterion = 0.000500000000
......
Min_ID Min_Time Max_ID Max_Time
Total Computational Time = 1 542.048 0 542.064
readfile = 10 1.583 6 1.585
truncation = 6 9.128 1 9.522
MD_pac = 0 0.012 11 0.020
OutData = 11 0.000 0 0.012
DFT = 1 530.889 6 531.283

*** In DFT ***

Set_OLP_Kin = 6 6.714 8 8.059
Set_Nonlocal = 8 13.440 6 14.785
Set_ProExpn_VNA = 6 28.091 1 28.786
Set_Hamiltonian = 0 17.256 10 17.256
Poisson = 2 0.063 11 0.063
Diagonalization = 0 412.163 8 412.163
Mixing_DM = 4 0.413 1 0.413
Force = 0 22.612 1 22.612
Total_Energy = 6 9.618 0 9.623
Set_Aden_Grid = 1 0.199 6 0.892
Set_Orbitals_Grid = 6 2.997 1 3.840
Set_Density_Grid = 6 13.814 10 13.849
RestartFileDFT = 4 0.275 0 0.313
Mulliken_Charge = 3 0.024 4 0.024
FFT(2D)_Density = 2 0.169 1 0.169
Others = 1 0.116 6 1.389

I'm confused by the final reporting. In the SCF step it claims the diagonalization is comparable and the DM step is slow but in the final time report this is not the case and there only the diagonalization (ELPA?) is reported much slower.
Is this just hybrid ELPA slow or is something else going on? For other systems this works fine and actually the hybrid OpenMP/MPI setup has faster diagonalization (albeit that was for system with ~300atoms).

This is with the latest OpenMX 3.9.9 compiled with intel compiler version 19.0.1.144 and OpenMP-parallel MKL.

Any ideas?

Best regards
Pavel

Page: [1]

Re: Slow DM step with OpenMP parallelization ( No.1 )

Date: 2022/01/14 15:29
Name: Naoya Yamaguchi

Hi,

I guess that you forgot to set the environment variable of the OpenMP (e.g. `OMP_NUM_THREADS=2`).

Regards,
Naoya Yamaguchi

Re: Slow DM step with OpenMP parallelization ( No.2 )

Date: 2022/01/14 16:14
Name: Naoya Yamaguchi

Dear Pavel,

I misunderstood your parallelization way.
In the cases of 24 MPI and 12 MPI/2 OMP, the 12 MPI/2 OMP case requires a diagonalization only for the DM as explained in http://www.openmx-square.org/openmx_man3.9/node82.html , while the 24 MPI cases doesn't, since the number of k-points to be calculated is 14 in your case.
If you do benchmark calculations, you need to consider it.

Regards,
Naoya Yamaguchi

Re: Slow DM step with OpenMP parallelization ( No.3 )

Date: 2022/01/14 17:00
Name: Pavel Ondracka <pavel.ondracka@email.cz>

OK, my bad for not reading the manual well enough. I believe I get it now, thanks a lot for the explanation.

BTW I think there is a small mistake in the manual "In addition, when the number of processes used in the parallelization _exceeds_ (spin multiplicity)$\times $(the number of k-points), OpenMX uses an efficient way in which finding the Fermi level", specifically if I now understand it correctly the "exceeds" should in fact be "is at least", right?

I think it would be also nice if the k-point-process distribution would be written somewhere into the output, the same way the atoms per process distribution is written right now, so that I can actually see I'm doing stupid stuff things. I mean in this case the calculation of the number of irreducible k-points as illustrated in the manual is quite simple, this is P1 so there is just the inversion symmetry, but for something that has actually other symmetries this would be non-trivial IMO.

Re: Slow DM step with OpenMP parallelization ( No.4 )

Date: 2022/01/14 18:51
Name: Naoya Yamaguchi

Dear Pavel,

>BTW I think there is a small mistake in the manual "In addition, when the number of processes used in the parallelization _exceeds_ (spin multiplicity)$\times $(the number of k-points), OpenMX uses an efficient way in which finding the Fermi level", specifically if I now understand it correctly the "exceeds" should in fact be "is at least", right?

It is right.

>I think it would be also nice if the k-point-process distribution would be written somewhere into the output,

Although I've not tried this, according to the source code, you can get such information when you set `level.of.stdout` to 3.

Regards,
Naoya Yamaguchi

Page: [1]