[gmx-users] Re: parallel job crash for large system

Dr. Vitaly V. Chaban vvchaban at gmail.com
Tue Aug 23 00:44:30 CEST 2011


In the below issue, the barostat is setup semiisotropically and works
only along the "long" direction. The density of the system slowly
grows due to mixing. If this can be useful....


On Mon, Aug 22, 2011 at 5:32 PM, Dr. Vitaly V. Chaban
<vvchaban at gmail.com> wrote:
> We are running the system consisting of 84000 atoms in
> parallelepipedic box, 6x6x33nm. The starting geometry, etc are OK and
> evolution of trajectory is reasonable but after several hundred
> thousands of steps it suddenly crashes. Mysteriously, each time it
> crashes at different time-steps, but it always occurs. The parts of
> this system were equilibrated separately and did not crash. The system
> is not in equilibrium but without external forces. The
> Parrinello-Rahman barostat is turned on. The md.log does not show any
> problems, the PDB configurations are not written down before crash,
> the constaints are absent, the time-step is 1fs that is OK for
> separate components (in separate boxes).
>
> With serial gromacs, the error is not yet observed, but given the size
> the run is very slow.
>
> What can it be? Can it be somehow connected with the very (oblongated) box?
>
>
> Stdout below:
>
> 50000000 steps,  50000.0 ps.
> [exciton04:10256] *** Process received signal ***
> [exciton04:10256] Signal: Segmentation fault (11)
> [exciton04:10256] Signal code: Address not mapped (1)
> [exciton04:10256] Failing at address: 0x6c0ebf10
> [exciton04:10257] *** Process received signal ***
> [exciton04:10257] Signal: Segmentation fault (11)
> [exciton04:10257] Signal code: Address not mapped (1)
> [exciton04:10257] Failing at address: 0x6378320
> [exciton04:10253] *** Process received signal ***
> [exciton04:10253] Signal: Segmentation fault (11)
> [exciton04:10253] Signal code: Address not mapped (1)
> [exciton04:10253] Failing at address: 0x1bfbe110
> [exciton04:10253] [ 0] /lib64/libpthread.so.0 [0x3402a0eb10]
> [exciton04:10253] [ 1] mdrun [0x66bb4d]
> [exciton04:10253] *** End of error message ***
> [exciton04:10255] *** Process received signal ***
> [exciton04:10255] Signal: Segmentation fault (11)
> [exciton04:10255] Signal code: Address not mapped (1)
> [exciton04:10255] Failing at address: 0x13dd139b0
> [exciton04:10255] [ 0] /lib64/libpthread.so.0 [0x3402a0eb10]
> [exciton04:10255] [ 1] mdrun [0x66bb5e]
> [exciton04:10255] *** End of error message ***
> [exciton04:10256] [ 0] /lib64/libpthread.so.0 [0x3402a0eb10]
> [exciton04:10256] [ 1] mdrun [0x66bb6f]
> [exciton04:10256] *** End of error message ***
> [exciton04:10254] *** Process received signal ***
> [exciton04:10254] Signal: Segmentation fault (11)
> [exciton04:10254] Signal code: Address not mapped (1)
> [exciton04:10254] Failing at address: 0x13d2103b0
> [exciton04:10254] [ 0] /lib64/libpthread.so.0 [0x3402a0eb10]
> [exciton04:10254] [ 1] mdrun [0x66bb5e]
> [exciton04:10254] *** End of error message ***
> [exciton04:10257] [ 0] /lib64/libpthread.so.0 [0x3402a0eb10]
> [exciton04:10257] [ 1] mdrun [0x66bb5e]
> [exciton04:10257] *** End of error message ***
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 10253 on node exciton04
> exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
> 5 total processes killed (some possibly by mpirun during cleanup)
>
>
>
> The version is 4.0.7 used with OpenMPI.
>
> --
> Dr. Vitaly V. Chaban, 430 Hutchison Hall, Chem. Dept.
> Univ. Rochester, Rochester, New York 14627-0216
> THE UNITED STATES OF AMERICA
>



More information about the gromacs.org_gmx-users mailing list