[gmx-users] maxh mdrun option does not work with REMD simulation
Maud Jusot
Maud.Jusot at impmc.upmc.fr
Tue Apr 5 16:51:18 CEST 2016
Hello again
I still don’t manage to restart correctly REMD simulation (see my
previous message) but I can add some details. When it restarts, gromacs
doesn’t create new checkpoint files (no matter –cpt option is), and
doesn’t stop at maxh time. I tried with 3 different versions of gromacs
(4.6.5, 5.1.0 and 5.1.2) on two different clusters, so I am quite sure
the problem does not come from the installation nor from the version.
It’s a big issue for me because I try to run 2.4 micro seconds
simulation and on the cluster I use I can’t do simulation for more than
24h (which correspond to 300 ns approximately) without doing restart. So
without checkpoint file I am unable to re-launch my simulation.
Is there something wrong in what I do ?
Does any body have an idea or do you think it's a bug and I should write
to the developers mailing list ?
Thanks,
Maud
Le 29/03/2016 11:33, Maud Jusot a écrit :
> Dear Gromacs users,
>
> I tried to do a REMD simulation with gromacs 5.1 which is re-launched
> every hour (in a queuing system) with the -maxh option.
> The first time it was launched, it worked : the run stoped at the maxh
> time and it was re-launched with the checkpoint files and continued
> the simulation. But during this second run, when the maxh time was
> achieved (step 1981062), gromacs said that it was going to stop but it
> did not stop until the system kill the job (step 2545600) .
>
> I tried with different maxh times ( 1/0.95/0.20 hour) to be sure that
> the time between maxh and the cluster maxtime was sufficient, but in
> any case the second run continued until it reached the one hour and
> was killed by the system.
>
> I find this very strange that it works the first time and that the
> second time gromacs says that it has to stop but does not.
> Moreover, I tried the same work but with a classical simulation
> (without REMD) and this time there was no problem.
> Did I forget an option or something like that for maxh being
> compatible with the REMD ?
> I searched on the web and the mailing list but I did not find any
> recording problems between maxh and REMD.
>
> Do you have any idea of what the problem is ?
>
> Here is the command lines in my script myJob.slurm :
> ---------------------
> srun --mpi=pmi2 -K1 --resv-ports -n $SLURM_NTASKS mdrun_mpi -ntomp 1
> -multi 8 -replex 500 -maxh 0.2 -deffnm mdA_ -cpi mdA_.cpt -cpo
> mdA_.cpt -v 2>> remdA.log
> # resubmit the same job at the end for a long run:
> sbatch myJob.slurm
> ---------------------
>
> Here is a part of my remdA.log file :
> ---------------------
> starting mdrun 'myPeptide'
> starting mdrun 'myPeptide'
> 120000000 steps, 240000.0 ps (continuing from step 655701, 1311.4 ps).
> starting mdrun 'myPeptide'
> 120000000 steps, 240000.0 ps (continuing from step 655701, 1311.4 ps).
> starting mdrun 'myPeptide'
> 120000000 steps, 240000.0 ps (continuing from step 655701, 1311.4 ps).
> starting mdrun 'myPeptide'
> 120000000 steps, 240000.0 ps (continuing from step 655701, 1311.4 ps).
> starting mdrun 'myPeptide'
> 120000000 steps, 240000.0 ps (continuing from step 655701, 1311.4 ps).
> starting mdrun 'myPeptide'
> 120000000 steps, 240000.0 ps (continuing from step 655701, 1311.4 ps).
> starting mdrun 'myPeptide'
> 120000000 steps, 240000.0 ps (continuing from step 655701, 1311.4 ps).
> 120000000 steps, 240000.0 ps (continuing from step 655701, 1311.4 ps).
>
> Step 1981061: Run time exceeded 0.198 hours, will terminate the run
>
> Step 1981062: Run time exceeded 0.198 hours, will terminate the run
>
> Step 1981062: Run time exceeded 0.198 hours, will terminate the run
>
> Step 1981062: Run time exceeded 0.198 hours, will terminate the run
>
> Step 1981062: Run time exceeded 0.198 hours, will terminate the run
>
> Step 1981062: Run time exceeded 0.198 hours, will terminate the run
>
> Step 1981062: Run time exceeded 0.198 hours, will terminate the run
>
> Step 1981062: Run time exceeded 0.198 hours, will terminate the run
>
> step 1981100, will finish Sat Mar 26 11:14:40 2016
> step 1981200, will finish Sat Mar 26 11:14:40 2016
> ...
> step 2545600, will finish Sat Mar 26 11:15:49 2016srun: Job step
> aborted: Waiting up to 32 seconds for job step to finish.
>
> Received the TERM signal, stopping at the next NS step
>
> Received the TERM signal, stopping at the next NS step
>
> Received the TERM signal, stopping at the next NS step
>
> Received the TERM signal, stopping at the next NS step
>
> Received the TERM signal, stopping at the next NS step
>
> Received the TERM signal, stopping at the next NS step
>
> Received the TERM signal, stopping at the next NS step
>
> Received the TERM signal, stopping at the next NS step
> ---------------------
>
> Thanks a lot,
>
> Maud
>
>
>
>
>
More information about the gromacs.org_gmx-users
mailing list