Page 1 of 1

NEB apparently hangs

Posted: Thu Feb 23, 2023 11:07 am
by paulf
I have installed VTST tools 6.3 in Vasp 6.3.2. While there are no test suites to run, I did try the standard Vasp NEB tutorial and it worked fine. In the problem I wish to solve, I have two phases with different lattice vectors requiring the use of LNEBCELL. I used nebmake.pl to make five image POSCAR files along with the beginning and ending structures from 00 to 06. The program seems to run fine, but hangs after a single SCF convergence. The console outputs indicates no apparent errors, but apparently hangs after the first iteration. A similar OSZICAR can be found in each of the image directories. I have attached the (smaller) files in each directory for reference. I would be grateful fi a hint could be offered as to what I am doing incorrectly.

running on 160 total cores
each image running on 32 cores
distrk: each k-point on 32 cores, 1 groups
distr: one band on 4 cores, 8 groups
using from now: 01/INCAR
vasp.6.3.2 27Jun22 (build Feb 17 2023 17:23:00) complex

POSCAR found type information on POSCAR MnTe
01/POSCAR found : 2 types and 4 ions
scaLAPACK will be used
LDA part: xc-table for Pade appr. of Perdew
POSCAR found type information on POSCAR MnTe
00/POSCAR found : 2 types and 4 ions
POSCAR found type information on POSCAR MnTe
06/POSCAR found : 2 types and 4 ions
Jacobian: 6.13161304309615
POSCAR found type information on POSCAR MnTe
00/POSCAR found : 2 types and 4 ions
POSCAR found type information on POSCAR MnTe
06/POSCAR found : 2 types and 4 ions
POSCAR, INCAR and KPOINTS ok, starting setup
FFT: planning ... GRIDC
FFT: planning ... GRID_SOFT
FFT: planning ... GRID
WAVECAR not read
augmentation electrons 37.9563187076692
soft electrons 0.000000000000000E+000
total electrons 37.9563187076692
augmentation electrons -4.251541731503774E-009
soft electrons 0.000000000000000E+000
total electrons -4.251541731503774E-009
augmentation electrons 129.104521749217
soft electrons 0.000000000000000E+000
total electrons 129.104521749217
augmentation electrons 121.791508869217
soft electrons 0.000000000000000E+000
total electrons 121.791508869217
WARNING: random wavefunctions but no delay for mixing, default for NELMDL
entering main loop
N E dE d eps ncg rms rms(c)
DAV: 1 0.341236403766E+03 0.34124E+03 -0.10816E+04 99456 0.341E+02
DAV: 2 0.115094332188E+02 -0.32973E+03 -0.30331E+03100744 0.129E+02
DAV: 3 -0.175276898409E+02 -0.29037E+02 -0.27981E+02116584 0.428E+01
DAV: 4 -0.179579005608E+02 -0.43021E+00 -0.42940E+00117352 0.512E+00
DAV: 5 -0.179617826664E+02 -0.38821E-02 -0.38813E-02120816 0.471E-01 0.129E+01
DAV: 6 -0.204863093301E+02 -0.25245E+01 -0.10126E+00119232 0.809E+00 0.124E+01
DAV: 7 -0.206173684080E+02 -0.13106E+00 -0.12614E-01117424 0.306E+00 0.102E+01
DAV: 8 -0.205726433884E+02 0.44725E-01 -0.13498E-01119736 0.126E+00 0.603E+00
DAV: 9 -0.205732052168E+02 -0.56183E-03 -0.68356E-02112784 0.988E-01 0.158E+00
DAV: 10 -0.205755921613E+02 -0.23869E-02 -0.18857E-03115936 0.299E-01 0.554E-01
DAV: 11 -0.205749301332E+02 0.66203E-03 -0.91064E-04114264 0.983E-02 0.320E-01
DAV: 12 -0.205747815381E+02 0.14860E-03 -0.49486E-04116016 0.686E-02 0.807E-02
DAV: 13 -0.205747936006E+02 -0.12063E-04 -0.89394E-05113440 0.320E-02 0.449E-02
DAV: 14 -0.205747962922E+02 -0.26916E-05 -0.64912E-06114440 0.732E-03 0.230E-02
DAV: 15 -0.205747992027E+02 -0.29105E-05 -0.24378E-06110984 0.429E-03 0.799E-03
DAV: 16 -0.205747996474E+02 -0.44476E-06 -0.55293E-07117680 0.190E-03 0.409E-03
DAV: 17 -0.205747995234E+02 0.12409E-06 -0.52617E-08 92952 0.561E-04 0.219E-03
DAV: 18 -0.205747994486E+02 0.74779E-07 -0.25340E-08 70752 0.598E-04

Re: NEB apparently hangs

Posted: Wed Mar 01, 2023 2:45 am
by paulf
A slight update, but basically the same question. After reviewing the background paper, I realized that it was necessary to rotate the system so that the lattice vectors have lower triangular form. I have done so and restarted the calculation, but it seems to hang again. I have included a tar file of the output at the point at which the calculation apparently hangs (it is still using CPU). I am not familiar with how fast the calculation should progress, but my earlier experiments with the Vasp Pt surface NEB input files showed that the calculation ran more or less continuously.

I have manually confirmed that the first SCF look worked for all five image files. The last few lines of the outcar from one of the images (02) is posted below. I am concerned that the energy of the chain is reported to be zero, although the other forces seem to be finite and reasonable. I would be grateful if some hints could be offered as to what to try next.

FREE ENERGIE OF THE ION-ELECTRON SYSTEM (eV)
---------------------------------------------------
free energy TOTEN = -20.63357155 eV

energy without entropy= -20.63357155 energy(sigma->0) = -20.63357155



--------------------------------------------------------------------------------------------------------


POTLOK: cpu time 0.0122: real time 0.0123


--------------------------------------------------------------------------------------------------------


energy of chain is (eV) 0.000000 for this image 0.000000
tangential force (eV/A) -2.024192
left and right image 0.726959 0.726959 A
TANGENT CHAIN-FORCE (eV/Angst)
-----------------------------------------------------------------------------------
0.49319 -0.28474 -0.29638 0.998317 -0.576378 -0.599928
-0.49319 0.28474 -0.29638 -0.998317 0.576378 -0.599930
0.00000 0.00000 0.29638 0.000000 0.000000 0.599928
0.00000 0.00000 0.29638 0.000000 0.000000 0.599929
-----------------------------------------------------------------------------------
CHAIN + TOTAL (eV/Angst)
----------------------------------------------
0.37360 -0.21575 0.41446
-0.37359 0.21570 0.41449
-0.65825 0.38006 -0.41448
0.65823 -0.38001 -0.41447
----------------------------------------------

Re: NEB apparently hangs

Posted: Wed Mar 01, 2023 7:23 pm
by graeme
I have a long answer that doesn't really answer your question. First, there is no need to rotate the cell as that will be done internally.

Unfortunately, it is hard to debug the hanging problem. I don't see it on our machines. My one thought is an MPI wait timeout, but that's just a guess.

I do have some other advice for your calculation. First, you can see in my neb_test that your band sort-of optimizes. The fact that I couldn't get the forces to the convergence setting made me worried that there were insufficient images.

In neb_test2, I increase the number of images to 8 and the band looks quite a bit different, as I was concerned about. You can also see the development of an intermediate minimum. I relaxed that minimum in the mini directory.

Finally, neb_test3 shows an 8 image band from the initial state to the intermediate minimum. I didn't finish converging it, but it is all looking good.

Hopefully this helps if/when you solve the hanging problem after the first iteration. Actually, I would be interested to know if you have the same problem with the regular NEB in vasp - I would expect it to behave in the same way.

Re: NEB apparently hangs

Posted: Thu Mar 02, 2023 9:46 am
by paulf
Dear Graeme,

Thank you very much for your help. I have a cluster with 5 nodes so I was working with 5 images to start. There is indeed a mpi problem. When I tried the same input with a single 32 core node (with 8 images), the calculation executed without error. I am at a loss as to how to debug what is apparently an mpi problem as vasp (6.3.2) runs fine in parallel for non-NEB calculations when multiple nodes are used. In any case I am grateful for the help. I don't suppose you have any insight on how to troubleshoot such an mpi problem? I also have a gpu (which runs under mpi using a single node as the Vasp developers have implemented it). This also hangs after a single SCF iteration and does not start the next cell resizing.

I am now continuing the calculation using a single node for the calculation while I ponder how to debug the mpi situation.

I hope I might ask a naive question about parallelism and NEB. Is it required to run the problem in parallel (e.g. allocating enough cores so that all images are worked on in parallel)?

Re: NEB apparently hangs

Posted: Thu Mar 02, 2023 3:56 pm
by graeme
In vasp the communication between NEB images is by MPI, and so you are stuck with at least as many MPI ranks as images.

It is, however, possible to run the NEB through ASE (and with our TSASE extension for the SSNEB) with vasp as a calculator. For that, ASE will just run each image one after the other.

The MPI solution is likely dependent upon which mpi you are running. We have done well with openmpi. In any case, you might look for environmental variables that you can set for your MPI. I would look for things related to timeout or wait and see if you can set a no_timeout option. But as I mentioned, I am really just guessing and it's very hard to debug without access to the machine and trying things.

If you get stuck, I can give you access to a cluster where you can run these calculations.

Re: NEB apparently hangs

Posted: Fri Mar 03, 2023 2:17 am
by paulf
Dear Graeme,

Hello from Japan where the cherry blossoms should be blossoming in a few weeks.

I am still playing with the combinations, but using MPI over several nodes works. My understanding is still incomplete, but I have parallelized over three nodes without trouble (paying attention to the KPAR and IMAGES values). The combination that got stuck was five nodes for five images -- what I would have thought would have been the most communication efficient combination. I have more to learn. In any case, VTST tools seems to work fine. I have never seen the (Infiniband) MPI lock up like this before. It sounds like it would be an interesting experience trying to debug it. For reference, I am using Intel's oneapi 2023. I am also now using eight images as you suggested.

Thank you for your help again.

Best wishes,
Paul