Debuging stuck jobs for LBFGS solver
Moderator: moderators
-
- Posts: 9
- Joined: Tue May 16, 2017 4:09 pm
Debuging stuck jobs for LBFGS solver
Hi VTST community,
First and foremost thank you professor Henkelman and all the associated member for making such a valuable tool and having such a lively discussion forum. My problem here concerns about my jobs getting stuck while using the LBFGS solver. As suggested by so many posts, I first use a first order method for my NEB (QM or FIRE) and then once forces are sufficiently down, i shift to LBFGS with a very strong EDIFF cutoff (1e-7).
I have had a lot of success with this. But, lately some jobs have gotten stuck randomly during the LBFGS runs and I cannot understand why. First i thought there was something wrong with our platform and hence I tried to run another job on one of the NERSC systems, where the same thing happened. There is no pattern to it in terms of number of steps taken or time etc. Also I tried to run this from some of our scratch system, thinking this could be a disk I/O problem, but I got a hung job there also. I was wondering if someone could suggest a way to find out what is happening while the job is stuck. The interesting part is it doesn;t kill itself and just keeps on going on till i manually kill it or the walltime is reached. I am using vasp 5.4.1 with the VTST built.
Please help me out with this.
Thank you
Siddharth
First and foremost thank you professor Henkelman and all the associated member for making such a valuable tool and having such a lively discussion forum. My problem here concerns about my jobs getting stuck while using the LBFGS solver. As suggested by so many posts, I first use a first order method for my NEB (QM or FIRE) and then once forces are sufficiently down, i shift to LBFGS with a very strong EDIFF cutoff (1e-7).
I have had a lot of success with this. But, lately some jobs have gotten stuck randomly during the LBFGS runs and I cannot understand why. First i thought there was something wrong with our platform and hence I tried to run another job on one of the NERSC systems, where the same thing happened. There is no pattern to it in terms of number of steps taken or time etc. Also I tried to run this from some of our scratch system, thinking this could be a disk I/O problem, but I got a hung job there also. I was wondering if someone could suggest a way to find out what is happening while the job is stuck. The interesting part is it doesn;t kill itself and just keeps on going on till i manually kill it or the walltime is reached. I am using vasp 5.4.1 with the VTST built.
Please help me out with this.
Thank you
Siddharth
-
- Posts: 9
- Joined: Tue May 16, 2017 4:09 pm
Re: Debuging stuck jobs for LBFGS solver
Hi,
I observed something more that might help debug the issue quicker. All stuck jobs are at the end of a completed electronic cycle run (I am sorry i said there was no pattern before :(). The OUTCAR just displays information till the point after which there should be LBFGS information. I have attached an example OUTCAR file for the same. Also I doing 2 things at this point, running with a different version of vasp (5.3.5) and i have also set NWRITE = 3 (earlier it was 1). Let me know what else i can do to debug this issue. Looking forward to your reply :)
Thanks
Sid
I observed something more that might help debug the issue quicker. All stuck jobs are at the end of a completed electronic cycle run (I am sorry i said there was no pattern before :(). The OUTCAR just displays information till the point after which there should be LBFGS information. I have attached an example OUTCAR file for the same. Also I doing 2 things at this point, running with a different version of vasp (5.3.5) and i have also set NWRITE = 3 (earlier it was 1). Let me know what else i can do to debug this issue. Looking forward to your reply :)
Thanks
Sid
- Attachments
-
- OUTCAR_stuck.zip
- (1.08 MiB) Downloaded 5220 times
Re: Debuging stuck jobs for LBFGS solver
This appears to be the OUTCAR from a single image of an NEB calculation. It might help to see the entire calculation. If one image is stuck, all images will wait until it is finished. But again, having a .tar.gz file of the entire calculation would help to debug the problem.
-
- Posts: 9
- Joined: Tue May 16, 2017 4:09 pm
Re: Debuging stuck jobs for LBFGS solver
Hi Professor Henkelman,
Attached is the .tar.gz file of the job.
Thanks
Sid
Attached is the .tar.gz file of the job.
Thanks
Sid
- Attachments
-
- stuck_job.tar.gz
- (7.13 MiB) Downloaded 5359 times
Re: Debuging stuck jobs for LBFGS solver
It looks like image 02 was still working. Is it possible that the calculation was terminated by the queueing system?
I don't see any problem with the calculation, except that the job ended.
I don't see any problem with the calculation, except that the job ended.
-
- Posts: 9
- Joined: Tue May 16, 2017 4:09 pm
Re: Debuging stuck jobs for LBFGS solver
Hi Professor Henkelman,
The calculation was not terminated by the queueing system. It got terminated after almost an hour after which the OUTCAR's were written (It was doing an electronic step every minute or so while running). Also when the calculation terminated, the output file (.o) that got generated was generated at the time when the OUTCAR's were updated, i.e. with 1-hour older time stamp. I observe this happening in lot of my jobs and at varied intervals. I ran some calculations with NWRITE=3, but they also gave no new information. It is a little bit harder to put those here, the OUTCAR's are just too big for those runs. Also the Quick Min runs as far as i have seen donot face this issue, they generally run through the entire walltime. Do you think this might have something to do with the way our VASP is compiled? Kindly let me know what you feel and what can be done :).
Thank you so much for your help.
Sid
The calculation was not terminated by the queueing system. It got terminated after almost an hour after which the OUTCAR's were written (It was doing an electronic step every minute or so while running). Also when the calculation terminated, the output file (.o) that got generated was generated at the time when the OUTCAR's were updated, i.e. with 1-hour older time stamp. I observe this happening in lot of my jobs and at varied intervals. I ran some calculations with NWRITE=3, but they also gave no new information. It is a little bit harder to put those here, the OUTCAR's are just too big for those runs. Also the Quick Min runs as far as i have seen donot face this issue, they generally run through the entire walltime. Do you think this might have something to do with the way our VASP is compiled? Kindly let me know what you feel and what can be done :).
Thank you so much for your help.
Sid
Re: Debuging stuck jobs for LBFGS solver
I don't understand what is going wrong with your calculation. The OUTCARs that you have indicate that the calculation is still in the electronic structure convergence and not in the VTST code. As you say though, if the calculation ran further without showing the output, then we don't know where it stoped.
I tried running the same calculation on one of our machines. You can see in the attached file that the calculation converged in just 14 ionic iterations.
I tried running the same calculation on one of our machines. You can see in the attached file that the calculation converged in just 14 ionic iterations.
- Attachments
-
- stuck_job.tar.gz
- (4.78 MiB) Downloaded 5395 times
-
- Posts: 9
- Joined: Tue May 16, 2017 4:09 pm
Re: Debuging stuck jobs for LBFGS solver
Hi Professor Henkelman,
Thank you so much for helping me out. I am getting more and more confident this is an issue in the way we have built vasp with vtst on our systems. I tried the vasp-tpc built on NERSC and that has not given me problems for the 6-7 neb runs I tried on it so far (just the vasp-vtst built did give me issues). We would relook into the way we built it on our end. Do you have any suggestions on the compilers that we should use or any patch that we should use so that the built is robust. Kindly, let me know what you think.
Thanks
Sid
Thank you so much for helping me out. I am getting more and more confident this is an issue in the way we have built vasp with vtst on our systems. I tried the vasp-tpc built on NERSC and that has not given me problems for the 6-7 neb runs I tried on it so far (just the vasp-vtst built did give me issues). We would relook into the way we built it on our end. Do you have any suggestions on the compilers that we should use or any patch that we should use so that the built is robust. Kindly, let me know what you think.
Thanks
Sid
Re: Debuging stuck jobs for LBFGS solver
You can try running my binaries at NERSC. I think anyone can access my ~graeme/bin/vasp_??? binaries. I will also paste my makefile.include below:
------
# Precompiler options
CPP_OPTIONS= -DHOST=\"LinuxIFC\"\
-DMPI -DMPI_BLOCK=8000 \
-Duse_collective \
-DscaLAPACK \
-DCACHE_SIZE=4000 \
-Davoidalloc \
-Duse_bse_te \
-Dtbdyn \
-Duse_shmem \
-DMKL_ILP64
CPP = fpp -f_com=no -free -w0 $*$(FUFFIX) $*$(SUFFIX) $(CPP_OPTIONS)
FC = ftn -v
FCL = ftn -v -mkl=sequential -lstdc++
FREE = -free -names lowercase
FFLAGS = -assume byterecl
#OFLAG = -fast -no-ipo
OFLAG = -O2
OFLAG_IN = $(OFLAG)
DEBUG = -O0
#BLAS= /usr/local/lib/libgoto2_barcelona-r1.13.a
#LAPACK= #lib/lapack_double.o lib/linpack_double.o
#SCALAPACK = /usr/local/lib/libscalapack.a $(BLACS)
#MKLPATH = /opt/intel/parallel_studio_xe_2016.0.047/compilers_and_libraries_2016.0.109/linux/mkl/lib/intel64
MKLPATH = $(MKLROOT)/lib/intel64
#BLAS= -Wl,--start-group -L$(MKLPATH) $(MKLPATH)/libmkl_intel_lp64.a $(MKLPATH)/libmkl_sequential.a $(MKLPATH)/libmkl_core.a -Wl,--end-group -lpthread
BLAS=
#LAPACK= $(MKLPATH)/libmkl_lapack95_lp64.a
#LAPACK= $(MKLPATH)/libmkl_scalapack_lp64.a
BLACS = $(MKLPATH)/libmkl_blacs_intelimpi_lp64.a
LAPACK = $(MKLPATH)/libmkl_scalapack_lp64.a
OBJECTS = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o
#OBJECTS = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o \
# /usr/common/software/vasp/build/5.4.1/intel/fftw3xf/libfftw3xf_intel.a
INCS = -I$(MKLROOT)/include/fftw
#LLIBS = $(SCALAPACK) $(LAPACK) $(BLAS) -ldl -mkl=cluster
#LLIBS = $(SCALAPACK) $(LAPACK) $(BLAS) -ldl -mkl
#LLIBS = $(SCALAPACK) $(LAPACK) $(BLAS) -ldl -mkl
#LLIBS = $(SCALAPACK) $(LAPACK) $(BLAS) -ldl
LLIBS = -Wl,--start-group $(MKLROOT)/lib/intel64/libmkl_intel_lp64.a \
$(MKLROOT)/lib/intel64/libmkl_scalapack_lp64.a $(MKLROOT)/lib/intel64/libmkl_blacs_intelmpi_lp64.a \
$(MKLROOT)/lib/intel64/libmkl_sequential.a $(MKLROOT)/lib/intel64/libmkl_core.a -Wl,--end-group \
OBJECTS_O1 += fft3dfurth.o fftmpi_map.o fftmpi.o
#OBJECTS_O1 += fft3dfurth.o fftmpi_map.o fftmpi.o fftmpiw.o
OBJECTS_O2 += fft3dlib.o
# For what used to be vasp.5.lib
CPP_LIB = $(CPP)
FC_LIB = $(FC)
CC_LIB = icc
CFLAGS_LIB = -O
FFLAGS_LIB = -O1
FREE_LIB = $(FREE)
OBJECTS_LIB= linpack_double.o lapack_double.o getshmem.o
# For the parser library
CXX_PARS = icpc
LIBS += parser
LLIBS += -Lparser -lparser -lstdc++
# Normally no need to change this
SRCDIR = ../../src
BINDIR = ../../bin
#================================================
# GPU Stuff
CPP_GPU = -DCUDA_GPU -DRPROMU_CPROJ_OVERLAP -DUSE_PINNED_MEMORY -DCUFFT_MIN=28 -UscaLAPACK
OBJECTS_GPU = fftmpiw.o fftmpi_map.o fft3dlib.o fftw3d_gpu.o fftmpiw_gpu.o
CC = icc
CXX = icpc
CFLAGS = -fPIC -DADD_ -Wall -openmp -DMAGMA_WITH_MKL -DMAGMA_SETAFFINITY -DGPUSHMEM=300 -DHAVE_CUBLAS
CUDA_ROOT ?= /usr/local/cuda/
NVCC := $(CUDA_ROOT)/bin/nvcc -ccbin=icc
CUDA_LIB := -L$(CUDA_ROOT)/lib64 -lnvToolsExt -lcudart -lcuda -lcufft -lcublas
GENCODE_ARCH := -gencode=arch=compute_30,code=\"sm_30,compute_30\" \
-gencode=arch=compute_35,code=\"sm_35,compute_35\" \
-gencode=arch=compute_60,code=\"sm_60,compute_60\"
MPI_INC = $(I_MPI_ROOT)/include64/
------
# Precompiler options
CPP_OPTIONS= -DHOST=\"LinuxIFC\"\
-DMPI -DMPI_BLOCK=8000 \
-Duse_collective \
-DscaLAPACK \
-DCACHE_SIZE=4000 \
-Davoidalloc \
-Duse_bse_te \
-Dtbdyn \
-Duse_shmem \
-DMKL_ILP64
CPP = fpp -f_com=no -free -w0 $*$(FUFFIX) $*$(SUFFIX) $(CPP_OPTIONS)
FC = ftn -v
FCL = ftn -v -mkl=sequential -lstdc++
FREE = -free -names lowercase
FFLAGS = -assume byterecl
#OFLAG = -fast -no-ipo
OFLAG = -O2
OFLAG_IN = $(OFLAG)
DEBUG = -O0
#BLAS= /usr/local/lib/libgoto2_barcelona-r1.13.a
#LAPACK= #lib/lapack_double.o lib/linpack_double.o
#SCALAPACK = /usr/local/lib/libscalapack.a $(BLACS)
#MKLPATH = /opt/intel/parallel_studio_xe_2016.0.047/compilers_and_libraries_2016.0.109/linux/mkl/lib/intel64
MKLPATH = $(MKLROOT)/lib/intel64
#BLAS= -Wl,--start-group -L$(MKLPATH) $(MKLPATH)/libmkl_intel_lp64.a $(MKLPATH)/libmkl_sequential.a $(MKLPATH)/libmkl_core.a -Wl,--end-group -lpthread
BLAS=
#LAPACK= $(MKLPATH)/libmkl_lapack95_lp64.a
#LAPACK= $(MKLPATH)/libmkl_scalapack_lp64.a
BLACS = $(MKLPATH)/libmkl_blacs_intelimpi_lp64.a
LAPACK = $(MKLPATH)/libmkl_scalapack_lp64.a
OBJECTS = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o
#OBJECTS = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o \
# /usr/common/software/vasp/build/5.4.1/intel/fftw3xf/libfftw3xf_intel.a
INCS = -I$(MKLROOT)/include/fftw
#LLIBS = $(SCALAPACK) $(LAPACK) $(BLAS) -ldl -mkl=cluster
#LLIBS = $(SCALAPACK) $(LAPACK) $(BLAS) -ldl -mkl
#LLIBS = $(SCALAPACK) $(LAPACK) $(BLAS) -ldl -mkl
#LLIBS = $(SCALAPACK) $(LAPACK) $(BLAS) -ldl
LLIBS = -Wl,--start-group $(MKLROOT)/lib/intel64/libmkl_intel_lp64.a \
$(MKLROOT)/lib/intel64/libmkl_scalapack_lp64.a $(MKLROOT)/lib/intel64/libmkl_blacs_intelmpi_lp64.a \
$(MKLROOT)/lib/intel64/libmkl_sequential.a $(MKLROOT)/lib/intel64/libmkl_core.a -Wl,--end-group \
OBJECTS_O1 += fft3dfurth.o fftmpi_map.o fftmpi.o
#OBJECTS_O1 += fft3dfurth.o fftmpi_map.o fftmpi.o fftmpiw.o
OBJECTS_O2 += fft3dlib.o
# For what used to be vasp.5.lib
CPP_LIB = $(CPP)
FC_LIB = $(FC)
CC_LIB = icc
CFLAGS_LIB = -O
FFLAGS_LIB = -O1
FREE_LIB = $(FREE)
OBJECTS_LIB= linpack_double.o lapack_double.o getshmem.o
# For the parser library
CXX_PARS = icpc
LIBS += parser
LLIBS += -Lparser -lparser -lstdc++
# Normally no need to change this
SRCDIR = ../../src
BINDIR = ../../bin
#================================================
# GPU Stuff
CPP_GPU = -DCUDA_GPU -DRPROMU_CPROJ_OVERLAP -DUSE_PINNED_MEMORY -DCUFFT_MIN=28 -UscaLAPACK
OBJECTS_GPU = fftmpiw.o fftmpi_map.o fft3dlib.o fftw3d_gpu.o fftmpiw_gpu.o
CC = icc
CXX = icpc
CFLAGS = -fPIC -DADD_ -Wall -openmp -DMAGMA_WITH_MKL -DMAGMA_SETAFFINITY -DGPUSHMEM=300 -DHAVE_CUBLAS
CUDA_ROOT ?= /usr/local/cuda/
NVCC := $(CUDA_ROOT)/bin/nvcc -ccbin=icc
CUDA_LIB := -L$(CUDA_ROOT)/lib64 -lnvToolsExt -lcudart -lcuda -lcufft -lcublas
GENCODE_ARCH := -gencode=arch=compute_30,code=\"sm_30,compute_30\" \
-gencode=arch=compute_35,code=\"sm_35,compute_35\" \
-gencode=arch=compute_60,code=\"sm_60,compute_60\"
MPI_INC = $(I_MPI_ROOT)/include64/
-
- Posts: 9
- Joined: Tue May 16, 2017 4:09 pm
Re: Debuging stuck jobs for LBFGS solver
Thank you so much. We would surely use this.
Thanks
Sid
Thanks
Sid