[gmx-users] Re: parallel job crash for large system

Mark Abraham Mark.Abraham at anu.edu.au
Tue Aug 23 00:47:04 CEST 2011


On 23/08/2011 8:44 AM, Dr. Vitaly V. Chaban wrote:
> In the below issue, the barostat is setup semiisotropically and works
> only along the "long" direction. The density of the system slowly
> grows due to mixing. If this can be useful....

Does a different barostat work?

Mark

>
>
> On Mon, Aug 22, 2011 at 5:32 PM, Dr. Vitaly V. Chaban
> <vvchaban at gmail.com>  wrote:
>> We are running the system consisting of 84000 atoms in
>> parallelepipedic box, 6x6x33nm. The starting geometry, etc are OK and
>> evolution of trajectory is reasonable but after several hundred
>> thousands of steps it suddenly crashes. Mysteriously, each time it
>> crashes at different time-steps, but it always occurs. The parts of
>> this system were equilibrated separately and did not crash. The system
>> is not in equilibrium but without external forces. The
>> Parrinello-Rahman barostat is turned on. The md.log does not show any
>> problems, the PDB configurations are not written down before crash,
>> the constaints are absent, the time-step is 1fs that is OK for
>> separate components (in separate boxes).
>>
>> With serial gromacs, the error is not yet observed, but given the size
>> the run is very slow.
>>
>> What can it be? Can it be somehow connected with the very (oblongated) box?
>>
>>
>> Stdout below:
>>
>> 50000000 steps,  50000.0 ps.
>> [exciton04:10256] *** Process received signal ***
>> [exciton04:10256] Signal: Segmentation fault (11)
>> [exciton04:10256] Signal code: Address not mapped (1)
>> [exciton04:10256] Failing at address: 0x6c0ebf10
>> [exciton04:10257] *** Process received signal ***
>> [exciton04:10257] Signal: Segmentation fault (11)
>> [exciton04:10257] Signal code: Address not mapped (1)
>> [exciton04:10257] Failing at address: 0x6378320
>> [exciton04:10253] *** Process received signal ***
>> [exciton04:10253] Signal: Segmentation fault (11)
>> [exciton04:10253] Signal code: Address not mapped (1)
>> [exciton04:10253] Failing at address: 0x1bfbe110
>> [exciton04:10253] [ 0] /lib64/libpthread.so.0 [0x3402a0eb10]
>> [exciton04:10253] [ 1] mdrun [0x66bb4d]
>> [exciton04:10253] *** End of error message ***
>> [exciton04:10255] *** Process received signal ***
>> [exciton04:10255] Signal: Segmentation fault (11)
>> [exciton04:10255] Signal code: Address not mapped (1)
>> [exciton04:10255] Failing at address: 0x13dd139b0
>> [exciton04:10255] [ 0] /lib64/libpthread.so.0 [0x3402a0eb10]
>> [exciton04:10255] [ 1] mdrun [0x66bb5e]
>> [exciton04:10255] *** End of error message ***
>> [exciton04:10256] [ 0] /lib64/libpthread.so.0 [0x3402a0eb10]
>> [exciton04:10256] [ 1] mdrun [0x66bb6f]
>> [exciton04:10256] *** End of error message ***
>> [exciton04:10254] *** Process received signal ***
>> [exciton04:10254] Signal: Segmentation fault (11)
>> [exciton04:10254] Signal code: Address not mapped (1)
>> [exciton04:10254] Failing at address: 0x13d2103b0
>> [exciton04:10254] [ 0] /lib64/libpthread.so.0 [0x3402a0eb10]
>> [exciton04:10254] [ 1] mdrun [0x66bb5e]
>> [exciton04:10254] *** End of error message ***
>> [exciton04:10257] [ 0] /lib64/libpthread.so.0 [0x3402a0eb10]
>> [exciton04:10257] [ 1] mdrun [0x66bb5e]
>> [exciton04:10257] *** End of error message ***
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 0 with PID 10253 on node exciton04
>> exited on signal 11 (Segmentation fault).
>> --------------------------------------------------------------------------
>> 5 total processes killed (some possibly by mpirun during cleanup)
>>
>>
>>
>> The version is 4.0.7 used with OpenMPI.
>>
>> --
>> Dr. Vitaly V. Chaban, 430 Hutchison Hall, Chem. Dept.
>> Univ. Rochester, Rochester, New York 14627-0216
>> THE UNITED STATES OF AMERICA
>>




More information about the gromacs.org_gmx-users mailing list