Checkpoints?

eOn code for long time scale dynamics

Moderator: moderators

Post Reply
michaelCG*
Posts: 12
Joined: Tue Mar 22, 2011 7:24 pm

Checkpoints?

Post by michaelCG* »

Maybe i've done something wrong, but i believe the checkpoints are not working. Okay in the past there was no need for it because the workunits did not take much time. But i don't like to waste cpu time because after a break the workunit is starting from the beginning.

Plz have a look into the checkpoints feature and USE it.
chill
Posts: 96
Joined: Tue Jul 28, 2009 9:04 pm

Re: Checkpoints?

Post by chill »

There is nothing wrong with your setup. We do not currently have support for checkpointing in our code.
geosteve
Posts: 1
Joined: Mon Jul 11, 2011 7:03 pm

Re: Checkpoints?

Post by geosteve »

michaelCG* wrote:But i don't like to waste cpu time because after a break the workunit is starting from the beginning.
I totally agree. I have a pc running 4-5 hours a day and without checkpoints working on eOn is impossible. Also, the computer I use more frequently may be stopped sometimes and hours of work are lost.
chill
Posts: 96
Joined: Tue Jul 28, 2009 9:04 pm

Re: Checkpoints?

Post by chill »

Currently it is not useful for us to receive workunits that are older than a day and most of our wus can be finished in an hour or two. Thus implementing checkpointing would only allow users to waste their CPU time on wus that were not needed.

Also is there really a need to turn off computers? Between suspending to ram and disk my laptop has an uptime of 70 days. Or is checkpointing needed for more cases than when the OS has been rebooted?
michaelCG*
Posts: 12
Joined: Tue Mar 22, 2011 7:24 pm

Re: Checkpoints?

Post by michaelCG* »

There are sure cases in which a System can not be running 24/7.
There are cases where someone turns off the System to stop working, get some sleep and turn it on a few hours later (but in less then 24 hours) so do i have to suspend early enought to let a workunit finish up to 100% until i can leave my office? Or do i have to accept it that i have to crunch the same 99% workunit when i start my computer in the office again?
The reason for checkpointing is to avoid a fallback to the very beginning of a calculation. Okay maybe you do not see the necessarity because your usage may be total different and in the history your workunits where small enough to be able to wait until the are through.
Take a look around, tell my WHY other projects DO HAVE working checkpointing?!?

Think of failures like:
- Power Loss
- Software Bugs
- for Windows Users even BSOD

But please let me know why you do not want to implement 1,2,3 or more little checkpoints?
graeme
Site Admin
Posts: 2256
Joined: Tue Apr 26, 2005 4:25 am
Contact:

Re: Checkpoints?

Post by graeme »

The issue here is related to the types of calculations that we are doing.

In some projects, there is a large amount of computational work that can be done in any order. With Seti, for example, the signal from some galaxy can be analyzed at any point by any number of different people. There are lots of other signals from other places in the sky that other computers can work on at the same time.

Our project is much more synchronized because we are modeling the dynamics of atomic systems through time. Basically, everyone is working to discover reaction mechanisms that are available given the current state of the system. Using statistical mechanics, we can evaluate the likelihood of each reaction pathway so that we can advance the simulation through time. As time evolves, however, results from a point in the past are not important. So we have to strike a balance between short work units in which everyone reports results frequently and long work units which minimize network traffic but also where slower machines may not report back. We could certainly implement checkpointing but for any significant delay, it would be better for the machine to get a new and current work unit.

There may be other algorithmic strategies which are better in terms of fault tolerance for distributed computing that we could try in the future. But at the moment, we are aiming to have a stable system for our current calculations. Hopefully by adjusting the work unit size, we can find an acceptable balance between rapid reporting and minimal lost work.
Jwb52z
Posts: 4
Joined: Thu Jul 28, 2011 5:39 pm

Re: Checkpoints?

Post by Jwb52z »

Unless the 3.09 version's VERY long WU times are not an error, I would say this is a reason to have checkpointing.
chill
Posts: 96
Joined: Tue Jul 28, 2009 9:04 pm

Re: Checkpoints?

Post by chill »

Jwb52z wrote:Unless the 3.09 version's VERY long WU times are not an error, I would say this is a reason to have checkpointing.
Its not an error but it is unexpected. There is no reason we need them to be this long. In about 24 hours there should be shorter work units. I will let the current long ones finish and resume with shorter ones.
Post Reply