Debuging stuck jobs for LBFGS solver

sidd20111992 · Post by **sidd20111992** » Sat May 20, 2017 5:42 pm

Hi VTST community,

First and foremost thank you professor Henkelman and all the associated member for making such a valuable tool and having such a lively discussion forum. My problem here concerns about my jobs getting stuck while using the LBFGS solver. As suggested by so many posts, I first use a first order method for my NEB (QM or FIRE) and then once forces are sufficiently down, i shift to LBFGS with a very strong EDIFF cutoff (1e-7).

I have had a lot of success with this. But, lately some jobs have gotten stuck randomly during the LBFGS runs and I cannot understand why. First i thought there was something wrong with our platform and hence I tried to run another job on one of the NERSC systems, where the same thing happened. There is no pattern to it in terms of number of steps taken or time etc. Also I tried to run this from some of our scratch system, thinking this could be a disk I/O problem, but I got a hung job there also. I was wondering if someone could suggest a way to find out what is happening while the job is stuck. The interesting part is it doesn;t kill itself and just keeps on going on till i manually kill it or the walltime is reached. I am using vasp 5.4.1 with the VTST built.

Please help me out with this.

Thank you

Siddharth

sidd20111992 · Post by **sidd20111992** » Sat May 20, 2017 8:51 pm

Hi,

I observed something more that might help debug the issue quicker. All stuck jobs are at the end of a completed electronic cycle run (I am sorry i said there was no pattern before :(). The OUTCAR just displays information till the point after which there should be LBFGS information. I have attached an example OUTCAR file for the same. Also I doing 2 things at this point, running with a different version of vasp (5.3.5) and i have also set NWRITE = 3 (earlier it was 1). Let me know what else i can do to debug this issue. Looking forward to your reply :)

Thanks

Sid

Post by **graeme** » Sun May 21, 2017 1:00 am

This appears to be the OUTCAR from a single image of an NEB calculation. It might help to see the entire calculation. If one image is stuck, all images will wait until it is finished. But again, having a .tar.gz file of the entire calculation would help to debug the problem.

sidd20111992 · Post by **sidd20111992** » Sun May 21, 2017 1:28 am

Hi Professor Henkelman,

Attached is the .tar.gz file of the job.

Thanks

Sid

Post by **graeme** » Sun May 21, 2017 2:04 am

It looks like image 02 was still working. Is it possible that the calculation was terminated by the queueing system?

I don't see any problem with the calculation, except that the job ended.

sidd20111992 · Post by **sidd20111992** » Sun May 21, 2017 4:11 pm

Hi Professor Henkelman,

The calculation was not terminated by the queueing system. It got terminated after almost an hour after which the OUTCAR's were written (It was doing an electronic step every minute or so while running). Also when the calculation terminated, the output file (.o) that got generated was generated at the time when the OUTCAR's were updated, i.e. with 1-hour older time stamp. I observe this happening in lot of my jobs and at varied intervals. I ran some calculations with NWRITE=3, but they also gave no new information. It is a little bit harder to put those here, the OUTCAR's are just too big for those runs. Also the Quick Min runs as far as i have seen donot face this issue, they generally run through the entire walltime. Do you think this might have something to do with the way our VASP is compiled? Kindly let me know what you feel and what can be done :).

Thank you so much for your help.

Sid

Post by **graeme** » Wed May 24, 2017 4:40 pm

I don't understand what is going wrong with your calculation. The OUTCARs that you have indicate that the calculation is still in the electronic structure convergence and not in the VTST code. As you say though, if the calculation ran further without showing the output, then we don't know where it stoped.

I tried running the same calculation on one of our machines. You can see in the attached file that the calculation converged in just 14 ionic iterations.

sidd20111992 · Post by **sidd20111992** » Thu May 25, 2017 2:15 am

Hi Professor Henkelman,

Thank you so much for helping me out. I am getting more and more confident this is an issue in the way we have built vasp with vtst on our systems. I tried the vasp-tpc built on NERSC and that has not given me problems for the 6-7 neb runs I tried on it so far (just the vasp-vtst built did give me issues). We would relook into the way we built it on our end. Do you have any suggestions on the compilers that we should use or any patch that we should use so that the built is robust. Kindly, let me know what you think.

Thanks

Sid

Post by **graeme** » Thu May 25, 2017 12:17 pm

You can try running my binaries at NERSC. I think anyone can access my ~graeme/bin/vasp_??? binaries. I will also paste my makefile.include below:

------

# Precompiler options
CPP_OPTIONS= -DHOST=\"LinuxIFC\"\
-DMPI -DMPI_BLOCK=8000 \
-Duse_collective \
-DscaLAPACK \
-DCACHE_SIZE=4000 \
-Davoidalloc \
-Duse_bse_te \
-Dtbdyn \
-Duse_shmem \
-DMKL_ILP64

CPP = fpp -f_com=no -free -w0 $*$(FUFFIX) $*$(SUFFIX) $(CPP_OPTIONS)

FC = ftn -v
FCL = ftn -v -mkl=sequential -lstdc++

FREE = -free -names lowercase

FFLAGS = -assume byterecl
#OFLAG = -fast -no-ipo
OFLAG = -O2
OFLAG_IN = $(OFLAG)
DEBUG = -O0

#BLAS= /usr/local/lib/libgoto2_barcelona-r1.13.a
#LAPACK= #lib/lapack_double.o lib/linpack_double.o
#SCALAPACK = /usr/local/lib/libscalapack.a $(BLACS)

#MKLPATH = /opt/intel/parallel_studio_xe_2016.0.047/compilers_and_libraries_2016.0.109/linux/mkl/lib/intel64
MKLPATH = $(MKLROOT)/lib/intel64
#BLAS= -Wl,--start-group -L$(MKLPATH) $(MKLPATH)/libmkl_intel_lp64.a $(MKLPATH)/libmkl_sequential.a $(MKLPATH)/libmkl_core.a -Wl,--end-group -lpthread
BLAS=
#LAPACK= $(MKLPATH)/libmkl_lapack95_lp64.a
#LAPACK= $(MKLPATH)/libmkl_scalapack_lp64.a
BLACS = $(MKLPATH)/libmkl_blacs_intelimpi_lp64.a
LAPACK = $(MKLPATH)/libmkl_scalapack_lp64.a

OBJECTS = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o
#OBJECTS = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o \
# /usr/common/software/vasp/build/5.4.1/intel/fftw3xf/libfftw3xf_intel.a

INCS = -I$(MKLROOT)/include/fftw

#LLIBS = $(SCALAPACK) $(LAPACK) $(BLAS) -ldl -mkl=cluster
#LLIBS = $(SCALAPACK) $(LAPACK) $(BLAS) -ldl -mkl
#LLIBS = $(SCALAPACK) $(LAPACK) $(BLAS) -ldl -mkl
#LLIBS = $(SCALAPACK) $(LAPACK) $(BLAS) -ldl

LLIBS = -Wl,--start-group $(MKLROOT)/lib/intel64/libmkl_intel_lp64.a \
$(MKLROOT)/lib/intel64/libmkl_scalapack_lp64.a $(MKLROOT)/lib/intel64/libmkl_blacs_intelmpi_lp64.a \
$(MKLROOT)/lib/intel64/libmkl_sequential.a $(MKLROOT)/lib/intel64/libmkl_core.a -Wl,--end-group \

OBJECTS_O1 += fft3dfurth.o fftmpi_map.o fftmpi.o
#OBJECTS_O1 += fft3dfurth.o fftmpi_map.o fftmpi.o fftmpiw.o
OBJECTS_O2 += fft3dlib.o

# For what used to be vasp.5.lib
CPP_LIB = $(CPP)
FC_LIB = $(FC)
CC_LIB = icc
CFLAGS_LIB = -O
FFLAGS_LIB = -O1
FREE_LIB = $(FREE)

OBJECTS_LIB= linpack_double.o lapack_double.o getshmem.o

# For the parser library
CXX_PARS = icpc

LIBS += parser
LLIBS += -Lparser -lparser -lstdc++

# Normally no need to change this
SRCDIR = ../../src
BINDIR = ../../bin

#================================================
# GPU Stuff

CPP_GPU = -DCUDA_GPU -DRPROMU_CPROJ_OVERLAP -DUSE_PINNED_MEMORY -DCUFFT_MIN=28 -UscaLAPACK

OBJECTS_GPU = fftmpiw.o fftmpi_map.o fft3dlib.o fftw3d_gpu.o fftmpiw_gpu.o

CC = icc
CXX = icpc
CFLAGS = -fPIC -DADD_ -Wall -openmp -DMAGMA_WITH_MKL -DMAGMA_SETAFFINITY -DGPUSHMEM=300 -DHAVE_CUBLAS

CUDA_ROOT ?= /usr/local/cuda/
NVCC := $(CUDA_ROOT)/bin/nvcc -ccbin=icc
CUDA_LIB := -L$(CUDA_ROOT)/lib64 -lnvToolsExt -lcudart -lcuda -lcufft -lcublas

GENCODE_ARCH := -gencode=arch=compute_30,code=\"sm_30,compute_30\" \
-gencode=arch=compute_35,code=\"sm_35,compute_35\" \
-gencode=arch=compute_60,code=\"sm_60,compute_60\"

MPI_INC = $(I_MPI_ROOT)/include64/

sidd20111992 · Post by **sidd20111992** » Thu May 25, 2017 7:47 pm

Thank you so much. We would surely use this.

Thanks

Sid

UT theoretical chemistry code forum

Debuging stuck jobs for LBFGS solver

Debuging stuck jobs for LBFGS solver

Re: Debuging stuck jobs for LBFGS solver

Re: Debuging stuck jobs for LBFGS solver

Re: Debuging stuck jobs for LBFGS solver

Re: Debuging stuck jobs for LBFGS solver

Re: Debuging stuck jobs for LBFGS solver

Re: Debuging stuck jobs for LBFGS solver

Re: Debuging stuck jobs for LBFGS solver

Re: Debuging stuck jobs for LBFGS solver

Re: Debuging stuck jobs for LBFGS solver