Page 1 of 1
about LANCZOS
Posted: Sun Jun 25, 2006 2:19 am
by hakuna
Dear all:
How can we find the coordinates of the saddle point when running a LANCZOS job?
I use the INCAR posted by Graeme and the VASP writes the lanczos.out,
but I find the coo are just the initial guess of the saddle point, then how can we grep the coordinates of the real saddle point? and which is the relvent energy of the saddle point?
Following is the lanczos.out of my test.
is it the right form of the file?
--------------
Echo control variables:
Maximum size of Lanczos matrix ... nl = 20
Finite difference step length ... dR = 1.0E-03
Tolerance for eigenvalue convergence ... ltol = 1.0E-02
Time step length for Quick-Min ... dt = 1.0E-01
Maximum total movement in one step ... maxmove = 5.0E-01
Use conjugate gradients when minimizing ... ifcg = T
eig Step# Iteration Eig EigOld |(Eig-EigOld)/EigOld|
eig -------------------------------------------------------------------------
conv Step# Iteration Energy Max|Feff| Eig
conv ----------------------------------------------------------
Point 1:
------------------------------------------------------
Coordinates:
Coo 1 3.599235105968 1.177390351737 11.128122180684
Coo 1 4.847356847474 2.685392462765 11.220254737064
Coo 1 3.980737579995 2.543432344092 11.620331212508
Coo 1 2.937990744838 0.139541814280 11.469089733710
Coo 1 0.000000000000 0.000000000000 13.697057786255
Coo 1 4.193850319217 2.421320610741 13.697057786255
Coo 1 1.397950106406 0.807106870247 15.979900750631
Coo 1 5.591800425623 3.228427480988 15.979900750631
Coo 1 1.397950106406 2.421320610741 13.697057786255
Coo 1 2.795900212812 0.000000000000 13.697057786255
Coo 1 2.795900212812 3.228427480988 15.979900750631
Coo 1 4.193850319217 0.807106870247 15.979900750631
Forces:
For 1 -1.060828574239E+00 4.846689803034E-02 2.109707509045E+00
For 1 6.172523257602E-01 3.922886296382E-01 -1.827286099419E-01
For 1 -9.054839668730E-01 -2.186792571370E+00 -7.558101448973E-01
For 1 1.307389797615E+00 1.692044751869E+00 -1.022975858908E+00
For 1 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00
For 1 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00
For 1 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00
For 1 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00
For 1 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00
For 1 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00
For 1 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00
For 1 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00
---------
thank you for the replay! cheers the worldcup!
Posted: Sun Jun 25, 2006 6:48 pm
by andri
The final coordinates are of course given in the CONTCAR, XDATCAR and OUTCAR and the energy is obtained by <grep "energy without entropy" OUTCAR> as always.
What should happen though is that the current coordinates at each step are written to the lanczos.out file as the run progresses. Does that not happen?
Posted: Mon Jun 26, 2006 12:34 am
by hakuna
[quote="andri"]The final coordinates are of course given in the CONTCAR, XDATCAR and OUTCAR and the energy is obtained by <grep "energy without entropy" OUTCAR> as always.
What should happen though is that the current coordinates at each step are written to the lanczos.out file as the run progresses. Does that not happen?[/quote]
Yes, I paste the whole lanczos.out, it seems no further ionic steps, so I think something may be wrong.
Posted: Mon Jun 26, 2006 7:52 am
by andri
Clearly the job has not converged to a saddle point since the forces are not particularly small. Did the job finish normally (perhaps NSW=1) or did it just simply die or hang after one step? If the latter then you need to look into how the executable was compiled. Given the great variety of architectures and system builds then we usually can offer very limited assistance with that. The code has been successfully compiled and run on Intel (Pentium 4, Xeon, Itanium), Mac G4/G5, IBM AIX and AMD processors that I now of. Only if you are certain of the integrity of your build (inlcuding MPI) will we look into possible bugs in the code.
Posted: Mon Jun 26, 2006 11:13 am
by hakuna
Yes, it is realy that the job dies after only one step.
I run the job on a AMD opteron (dual core/two cpu) and MPICH, the CNEB runs very well(the same compiled program with LANCZOS).
we compile the program using PGI FORTRAN and ACML(the intel fortran compiled program suffers memory leak on our machine).
the screen output of VASP was redirected to a file, it reads:
running on 4 nodes
distr: one band on 1 nodes, 4 groups
vasp.4.6.26 15Dec04 complex
POSCAR found : 4 types and 12 ions
LDA part: xc-table for Ceperly-Alder, standard interpolation
POSCAR, INCAR and KPOINTS ok, starting setup
WARNING: wrap around errors must be expected
FFT: planning ... 1
reading WAVECAR
entering main loop
N E dE d eps ncg rms rms(c)
DAV: 1 0.738737703242E+03 0.73874E+03 -0.33010E+04 1568 0.104E+03
DAV: 2 0.290414395651E+02 -0.70970E+03 -0.64956E+03 1816 0.347E+02
DAV: 3 -0.561727182491E+02 -0.85214E+02 -0.79792E+02 1828 0.115E+02
DAV: 4 -0.592467680940E+02 -0.30740E+01 -0.30149E+01 1752 0.251E+01
DAV: 5 -0.593233801033E+02 -0.76612E-01 -0.76525E-01 1848 0.357E+00 0.118E+01
DAV: 6 -0.558903271726E+02 0.34331E+01 -0.15981E+01 2216 0.254E+01 0.557E+00
DAV: 7 -0.558987275925E+02 -0.84004E-02 -0.37549E+00 2180 0.211E+01 0.569E+00
DAV: 8 -0.556109202535E+02 0.28781E+00 -0.20856E+00 2280 0.144E+01 0.295E+00
DAV: 9 -0.557014440688E+02 -0.90524E-01 -0.12743E+00 2272 0.104E+01 0.409E+00
DAV: 10 -0.555200896828E+02 0.18135E+00 -0.13353E+00 2380 0.907E+00 0.188E+00
DAV: 11 -0.554994254231E+02 0.20664E-01 -0.28060E-01 2288 0.524E+00 0.135E+00
DAV: 12 -0.554717644829E+02 0.27661E-01 -0.60142E-02 2384 0.270E+00 0.287E-01
DAV: 13 -0.554682944009E+02 0.34701E-02 -0.98451E-03 2084 0.694E-01 0.191E-01
DAV: 14 -0.554670846543E+02 0.12097E-02 -0.51923E-03 2192 0.659E-01 0.235E-01
DAV: 15 -0.554622039055E+02 0.48807E-02 -0.46472E-03 2140 0.556E-01 0.930E-02
DAV: 16 -0.554598228933E+02 0.23810E-02 -0.17989E-03 2036 0.349E-01 0.121E-01
DAV: 17 -0.554588124161E+02 0.10105E-02 -0.10216E-03 2196 0.319E-01 0.551E-02
DAV: 18 -0.554583211311E+02 0.49128E-03 -0.95748E-04 1632 0.165E-01 0.372E-02
DAV: 19 -0.554593047251E+02 -0.98359E-03 -0.21272E-04 1636 0.126E-01 0.383E-02
DAV: 20 -0.554601542902E+02 -0.84957E-03 -0.19826E-04 1984 0.114E-01 0.170E-02
DAV: 21 -0.554615273746E+02 -0.13731E-02 -0.17271E-04 1672 0.571E-02 0.143E-02
DAV: 22 -0.554621195046E+02 -0.59213E-03 -0.59436E-05 2132 0.486E-02 0.172E-02
DAV: 23 -0.554624682803E+02 -0.34878E-03 -0.43346E-05 1888 0.440E-02 0.472E-03
DAV: 24 -0.554626479791E+02 -0.17970E-03 -0.97265E-06 1684 0.147E-02 0.266E-03
DAV: 25 -0.554627259020E+02 -0.77923E-04 -0.88143E-06 1648 0.106E-02 0.131E-03
DAV: 26 -0.554627396923E+02 -0.13790E-04 -0.27609E-06 1736 0.515E-03 0.741E-04
DAV: 27 -0.554627433390E+02 -0.36468E-05 -0.11241E-06 1732 0.322E-03 0.441E-04
DAV: 28 -0.554627442295E+02 -0.89042E-06 -0.37696E-07 1780 0.188E-03 0.243E-04
DAV: 29 -0.554627442812E+02 -0.51726E-07 -0.70126E-08 1764 0.100E-03 0.215E-04
DAV: 30 -0.554627442657E+02 0.15456E-07 -0.98163E-09 1168 0.473E-04 0.163E-04
DAV: 31 -0.554627441658E+02 0.99967E-07 -0.95527E-09 1112 0.458E-04 0.102E-04
DAV: 32 -0.554627441207E+02 0.45093E-07 -0.19908E-09 1084 0.236E-04 0.648E-05
DAV: 33 -0.554627441118E+02 0.88894E-08 -0.64752E-10 1024 0.136E-04
1 F= -.55462744E+02 E0= -.55458802E+02 d E =-.554627E+02
quench: g(F)= 0.000E+00 g(S)= 0.000E+00 dE (1.order)= 0.000E+00
p0_25988: p4_error: net_recv read: probable EOF on socket: 1
p1_25992: p4_error: interrupt SIGSEGV: 11
rm_l_1_25997: (1804.054688) net_send: could not write to fd=5, errno = 32
p3_26008: p4_error: interrupt SIGSEGV: 11
rm_l_3_26013: (1804.015625) net_send: could not write to fd=5, errno = 32
rm_l_2_26006: (1804.035156) net_send: could not write to fd=5, errno = 32
"2layerneb.out" 53L, 3998C
Posted: Mon Jun 26, 2006 3:43 pm
by graeme
Do you have the IMAGES tag set in the INCAR file?
Can you try running this on the 2 processors in a single machine. It would be helpful to know if it's dying because of an mpi communication problem across the machines or if it's something else.
I have a cluster with the same setup you describe, so I could also try running this to see if I can reproduce the error.
Posted: Tue Jun 27, 2006 8:18 am
by hakuna
[quote="graeme"]Do you have the IMAGES tag set in the INCAR file?
Can you try running this on the 2 processors in a single machine. It would be helpful to know if it's dying because of an mpi communication problem across the machines or if it's something else.
I have a cluster with the same setup you describe, so I could also try running this to see if I can reproduce the error.[/quote]
Thank you for the reply, Graeme.
1.No IMAGES tag was set in the INCAR file. In fact, I use the INCAR on your webpage. is the IMAGES tag necessary as in NEB method ?
2.The above problem occured when the progam runs on a single machine(2CPU/dual core), the CNEB runs well on the same computer, so it seems not to be the MPI communication problem. To test that, I'll try to run a serial LANCZOS job.
Posted: Wed Jun 28, 2006 6:57 am
by hakuna
Yes, the serial LANCZOS seems to run well on the same machine, it runs 36 steps now and still gongs on. I also find a NEWMODECAR file in the entry and the E0 in every ionic step seems going convergence.
Now the problem is why the parallel runing of LANCZOS dies after only one step, if it is the MPI communication problem, then the CNEB should also suffer the same problem, but in fact it does not, the CNEB runs well in parallel mode.
By the way, my test of LANCZOS and CNEB, in parallel or serial, uses the same compiled program and runs on the same machine.
In addition, what about the g(F) and g(S)? I find both of them are zero in every step, is it right?
How about your suggestion, Dr. Graeme ?
Posted: Wed Jun 28, 2006 5:02 pm
by graeme
If you make a link to a tar file of you run (or email it) I can try it an a similar cluster. Include the binary, which might work directly.
I think it would be good to figure out if this is an mpi problem, or a problem with our code (or a combination). It would also be interesting to know if the problem persists using lam, or if it's only with mpich.
cc your run to Andri. He's the author of the code, and might be able to see problems in the run before the mpi error.
Posted: Thu Jun 29, 2006 12:15 am
by hakuna
Thank you, Dr Graeme
I have sent you an email with an attachment which includes all my input file and the binary code.
Posted: Thu Jun 29, 2006 4:20 am
by graeme
I can't reproduce the error on my cluster using my binary. Below is a sippet of the output going through the first iteration.
I noticed that your binary is actually linked to the intel math kernal library, and not the acml. Maybe this is related to the problem. Anyways, try using it to see if it fixes the problem - that is my setup, and I don't get the error.
However, I wonder if there could be a problem with the eigensystem lapack call in the lanczos step when using mkl. What do you think Andri? You have used mkl with lanczos without problems, haven't you? Maybe it is related to the 64 bit opteron arch or something.
-------------
DAV: 31 -0.554627441632E+02 0.81263E-07 -0.69499E-09 938 0.433E-04 0.104E-04
DAV: 32 -0.554627440963E+02 0.66880E-07 -0.37521E-09 1024 0.270E-04 0.811E-05
DAV: 33 -0.554627440912E+02 0.51468E-08 -0.73632E-10 976 0.151E-04
1 F= -.55462744E+02 E0= -.55458802E+02 d E =-.554627E+02
quench: g(F)= 0.000E+00 g(S)= 0.000E+00 dE (1.order)= 0.000E+00
bond charge predicted
N E dE d eps ncg rms rms(c)
DAV: 1 -0.554612113721E+02 -0.55461E+02 -0.19762E-03 1508 0.189E-01 0.860E-03
DAV: 2 -0.554612159630E+02 -0.45909E-05 -0.45087E-05 1696 0.335E-02 0.999E-03
DAV: 3 -0.554612160908E+02 -0.12775E-06 -0.95578E-06 2134 0.314E-02 0.102E-02
Posted: Thu Jun 29, 2006 6:26 pm
by andri
I have used MKL exclusively on various types of Intel processors (both-32 and 64-bit) and never found any problems with it.
Does the problem still persist after the code as been recompiled using the proper ACML libraries instead of MKL?