[gmx-users] Re: parallel job crash for large system

chris.neale at utoronto.ca chris.neale at utoronto.ca
Tue Aug 23 01:06:07 CEST 2011


Your density seems to be about 70% of what I would expect. Are you  
sure that this is not just a normal case of a poorly equilibrated  
system crashing? That matches with what you say about the density  
growing (although perhaps it has more to do with poor equilibration  
than with mixing, as you suggest)?

In any event, I'd suggest simplifying your system and making it  
smaller to see of you can reproduce the problem with a system that  
will run quickly in serial.

Chris.

On 23/08/2011 8:44 AM, Dr. Vitaly V. Chaban wrote:
> In the below issue, the barostat is setup semiisotropically and works
> only along the "long" direction. The density of the system slowly
> grows due to mixing. If this can be useful....

Does a different barostat work?

Mark

>
>
> On Mon, Aug 22, 2011 at 5:32 PM, Dr. Vitaly V. Chaban
> <vvchaban at gmail.com>  wrote:
>> We are running the system consisting of 84000 atoms in
>> parallelepipedic box, 6x6x33nm. The starting geometry, etc are OK and
>> evolution of trajectory is reasonable but after several hundred
>> thousands of steps it suddenly crashes. Mysteriously, each time it
>> crashes at different time-steps, but it always occurs. The parts of
>> this system were equilibrated separately and did not crash. The system
>> is not in equilibrium but without external forces. The
>> Parrinello-Rahman barostat is turned on. The md.log does not show any
>> problems, the PDB configurations are not written down before crash,
>> the constaints are absent, the time-step is 1fs that is OK for
>> separate components (in separate boxes).
>>
>> With serial gromacs, the error is not yet observed, but given the size
>> the run is very slow.
>>
>> What can it be? Can it be somehow connected with the very (oblongated) box?
>>
>>
>> Stdout below:
>>
>> 50000000 steps,  50000.0 ps.
>> [exciton04:10256] *** Process received signal ***
>> [exciton04:10256] Signal: Segmentation fault (11)
>> [exciton04:10256] Signal code: Address not mapped (1)
>> [exciton04:10256] Failing at address: 0x6c0ebf10
>> [exciton04:10257] *** Process received signal ***
>> [exciton04:10257] Signal: Segmentation fault (11)
>> [exciton04:10257] Signal code: Address not mapped (1)
>> [exciton04:10257] Failing at address: 0x6378320
>> [exciton04:10253] *** Process received signal ***
>> [exciton04:10253] Signal: Segmentation fault (11)
>> [exciton04:10253] Signal code: Address not mapped (1)
>> [exciton04:10253] Failing at address: 0x1bfbe110
>> [exciton04:10253] [ 0] /lib64/libpthread.so.0 [0x3402a0eb10]
>> [exciton04:10253] [ 1] mdrun [0x66bb4d]
>> [exciton04:10253] *** End of error message ***
>> [exciton04:10255] *** Process received signal ***
>> [exciton04:10255] Signal: Segmentation fault (11)
>> [exciton04:10255] Signal code: Address not mapped (1)
>> [exciton04:10255] Failing at address: 0x13dd139b0
>> [exciton04:10255] [ 0] /lib64/libpthread.so.0 [0x3402a0eb10]
>> [exciton04:10255] [ 1] mdrun [0x66bb5e]
>> [exciton04:10255] *** End of error message ***
>> [exciton04:10256] [ 0] /lib64/libpthread.so.0 [0x3402a0eb10]
>> [exciton04:10256] [ 1] mdrun [0x66bb6f]
>> [exciton04:10256] *** End of error message ***
>> [exciton04:10254] *** Process received signal ***
>> [exciton04:10254] Signal: Segmentation fault (11)
>> [exciton04:10254] Signal code: Address not mapped (1)
>> [exciton04:10254] Failing at address: 0x13d2103b0
>> [exciton04:10254] [ 0] /lib64/libpthread.so.0 [0x3402a0eb10]
>> [exciton04:10254] [ 1] mdrun [0x66bb5e]
>> [exciton04:10254] *** End of error message ***
>> [exciton04:10257] [ 0] /lib64/libpthread.so.0 [0x3402a0eb10]
>> [exciton04:10257] [ 1] mdrun [0x66bb5e]
>> [exciton04:10257] *** End of error message ***
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 0 with PID 10253 on node exciton04
>> exited on signal 11 (Segmentation fault).
>> --------------------------------------------------------------------------
>> 5 total processes killed (some possibly by mpirun during cleanup)
>>
>>
>>
>> The version is 4.0.7 used with OpenMPI.
>>





More information about the gromacs.org_gmx-users mailing list