[gmx-users] parallel job crash for large system

Dr. Vitaly V. Chaban vvchaban at gmail.com
Mon Aug 22 23:32:37 CEST 2011


We are running the system consisting of 84000 atoms in
parallelepipedic box, 6x6x33nm. The starting geometry, etc are OK and
evolution of trajectory is reasonable but after several hundred
thousands of steps it suddenly crashes. Mysteriously, each time it
crashes at different time-steps, but it always occurs. The parts of
this system were equilibrated separately and did not crash. The system
is not in equilibrium but without external forces. The
Parrinello-Rahman barostat is turned on. The md.log does not show any
problems, the PDB configurations are not written down before crash,
the constaints are absent, the time-step is 1fs that is OK for
separate components (in separate boxes).

With serial gromacs, the error is not yet observed, but given the size
the run is very slow.

What can it be? Can it be somehow connected with the very (oblongated) box?


Stdout below:

50000000 steps,  50000.0 ps.
[exciton04:10256] *** Process received signal ***
[exciton04:10256] Signal: Segmentation fault (11)
[exciton04:10256] Signal code: Address not mapped (1)
[exciton04:10256] Failing at address: 0x6c0ebf10
[exciton04:10257] *** Process received signal ***
[exciton04:10257] Signal: Segmentation fault (11)
[exciton04:10257] Signal code: Address not mapped (1)
[exciton04:10257] Failing at address: 0x6378320
[exciton04:10253] *** Process received signal ***
[exciton04:10253] Signal: Segmentation fault (11)
[exciton04:10253] Signal code: Address not mapped (1)
[exciton04:10253] Failing at address: 0x1bfbe110
[exciton04:10253] [ 0] /lib64/libpthread.so.0 [0x3402a0eb10]
[exciton04:10253] [ 1] mdrun [0x66bb4d]
[exciton04:10253] *** End of error message ***
[exciton04:10255] *** Process received signal ***
[exciton04:10255] Signal: Segmentation fault (11)
[exciton04:10255] Signal code: Address not mapped (1)
[exciton04:10255] Failing at address: 0x13dd139b0
[exciton04:10255] [ 0] /lib64/libpthread.so.0 [0x3402a0eb10]
[exciton04:10255] [ 1] mdrun [0x66bb5e]
[exciton04:10255] *** End of error message ***
[exciton04:10256] [ 0] /lib64/libpthread.so.0 [0x3402a0eb10]
[exciton04:10256] [ 1] mdrun [0x66bb6f]
[exciton04:10256] *** End of error message ***
[exciton04:10254] *** Process received signal ***
[exciton04:10254] Signal: Segmentation fault (11)
[exciton04:10254] Signal code: Address not mapped (1)
[exciton04:10254] Failing at address: 0x13d2103b0
[exciton04:10254] [ 0] /lib64/libpthread.so.0 [0x3402a0eb10]
[exciton04:10254] [ 1] mdrun [0x66bb5e]
[exciton04:10254] *** End of error message ***
[exciton04:10257] [ 0] /lib64/libpthread.so.0 [0x3402a0eb10]
[exciton04:10257] [ 1] mdrun [0x66bb5e]
[exciton04:10257] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 10253 on node exciton04
exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
5 total processes killed (some possibly by mpirun during cleanup)



The version is 4.0.7 used with OpenMPI.

-- 
Dr. Vitaly V. Chaban, 430 Hutchison Hall, Chem. Dept.
Univ. Rochester, Rochester, New York 14627-0216
THE UNITED STATES OF AMERICA



More information about the gromacs.org_gmx-users mailing list