[gmx-users] crystal waters crash parallel but not serial

chris.neale at utoronto.ca chris.neale at utoronto.ca
Mon Sep 25 15:24:32 CEST 2006


My run crashes in parallel but not in serial. I have narrowed it down  
to the inclusion of crystal waters, but can't imagine what the problem  
could be that would occur only on parallel. I have done three reps of  
each situation in order to be sure that it is not a fluke. It's easy  
enough to take a pass on parallel runs for now, but I am worried about  
my system and basically I am asking the question: "does anybody know  
that something like this would only happen if a system is otherwise  
incorrectly made or could this be a quirk of multiple water groups /  
ordering of groups / pressure coupling / something else across  
multiple processors?"

My system here is an opls-aa protein in a POPE/DMPE membrane with
tip4p waters. Single precision. I am using the double-pairlist  
inclusion method to scale the lipid 1-4 interactions. It runs fine  
through initial protein heavy atom position restraints (0.5ns) and  
then fine again with no topological restraints (3.5ns).

When I repeat this setup procedure, this time also with the crystal  
waters as moleculetype xtip4p.itp/residue XSOL so that I can restrain  
their positions as well, it runs fine for 0.5ns.

However, if I try to repeat this in parallel it crashes after ~40ps. I  
can extend this time by increasing the number of EM steps or not  
restraining the crystal waters (worrysome), but still it eventually  
crashes. I have repeated this 3 times with parallel on 4 procs and 3  
times running on single processors. I also did one additional replica  
in double precision that crashed on parallel (8ps) and ran fine for  
500ps on serial.

All of my crystal waters and my protein (the only things that are ever  
restrained) are located on the first processor. I don't use shuffle  
since I use position restraints. I have ordered the topology Protein,  
crystal water, membrane, bulk water (and the includes in the same  
order) and have temperature coupling groups that are 1)protein,  
2)lipid, 3)crystal + bulk water.

I use lincs.

When the run crashes in parallel, the frame prior to the crash shows a  
single exploding water molecule (bulk water not crystal water).  
However this was only 2 times (the others gave me a silent  
compute-forever death even with kill -HUP so I didn't get to see the  
frame) and the large number of bulk waters means that this might just  
be by chance.

Thanks for any comments.
Chris.







More information about the gromacs.org_gmx-users mailing list