[gmx-users] Pull code stalling?

Tue Apr 8 04:21:25 CEST 2014

Hi All,

I have noticed a strange problem involving the pull code, and perhaps other 
types of restraints with version 5.0-beta2.  It seems that use of the pull code 
causes runs to simply stall and produce no output beyond the header of the .log 
file.  A few notes to break down the situation:

1. I initially suspected a hardware problem, but I have determined that the 
nodes in question work correctly.  I have 64-CPU nodes that are handling these 
jobs.  Runs submitted using version 4.6.3 or 5.0-beta2 without the pull code run 
correctly.

2. It seems that the runs with the pull code are stalling.  Logging in to the 
node where the job is running reveals that 8 CPU are being used instead of 64, 
but mdrun stays stuck in uninterruptible sleep.  Runs on the same nodes with 
version 4.6.3 with the pull code are correctly using all 64 CPU and are 
producing output at regular intervals.

3. Gromacs has been compiled with the thread-MPI library, which has worked well 
for previous versions.

4. The mdrun command is simply mdrun -nt 64 -deffnm pull -px pullx.xvg -pf 
pullf.xvg.  Invoking mdrun -nt 64 for other runs with 5.0-beta2 without the pull 
code works fine with decent performance.  The problem persists with different 
numbers of CPUs/threads.

I have not tried the new release candidate from today, but if that would help 
narrow the problem down, I will gladly do it.

Any ideas?  Sections of the .log file for a stalled run are posted below.  Note, 
too, that the compiler version does not affect the outcome - recompiling with 
GCC 4.7.2 results in the same behavior.

As an aside, using flat-bottom restraints (separate set of jobs entirely) also 
results in curious behavior.  Runs proceed at a normal rate, but then take 20 
minutes or more to proceed from the final step to actually writing the final 
output (coordinates, checkpoint, trajectory, and energy file), then another hour 
or more to actually exit.  Perhaps this is an unrelated issue, but in case 
there's something more globally wrong with restraints, I thought I'd mention it. 
  Normal position restraints work fine.

-Justin

==== .log of stall ====

GROMACS:    gmx mdrun, VERSION 5.0-beta2
Executable: /home/jalemkul/software/gromacs/5.0-beta2/bin/gmx
Command line:
   mdrun -nt 64 -deffnm pull -pf pullf.xvg -px pullx.xvg

Gromacs version:    VERSION 5.0-beta2
Precision:          single
Memory model:       64 bit
MPI library:        thread_mpi
OpenMP support:     enabled
GPU support:        disabled
invsqrt routine:    gmx_software_invsqrt(x)
CPU acceleration:   SSE2
FFT library:        fftw-3.3.3-sse2
RDTSCP usage:       enabled
C++11 compilation:  disabled
TNG support:        enabled
Built on:           Sun Apr  6 17:12:22 EDT 2014
Built by:           jalemkul at ocracoke [CMAKE]
Build OS/arch:      Linux 2.6.32-5-amd64 x86_64
Build CPU vendor:   AuthenticAMD
Build CPU brand:    AMD Opteron(tm) Processor 6172
Build CPU family:   16   Model: 9   Stepping: 1
Build CPU features: apic clfsh cmov cx8 cx16 htt lahf_lm misalignsse mmx msr 
nonstop_tsc pdpe1gb popcnt pse rdtscp sse2 sse3 sse4a
C compiler:         /usr/bin/cc GNU 4.4.5
C compiler flags:   -msse2    -Wextra -Wno-missing-field-initializers 
-Wno-sign-compare -Wall -Wno-unused -Wunused-value -Wunused-parameter 
-fomit-frame-pointer -funroll-all-loops  -O3 -DNDEBUG
C++ compiler:       /usr/bin/c++ GNU 4.4.5
C++ compiler flags: -msse2    -Wextra -Wno-missing-field-initializers -Wall 
-Wno-unused-function   -fomit-frame-pointer -funroll-all-loops  -O3 -DNDEBUG
Boost version:      1.48.0 (internal)

...

Initializing Domain Decomposition on 64 nodes
Dynamic load balancing: auto
Will sort the charge groups at every domain (re)decomposition
Initial maximum inter charge-group distances:
     two-body bonded interactions: 0.401 nm, LJ-14, atoms 131 138
   multi-body bonded interactions: 0.401 nm, Proper Dih., atoms 131 138
Minimum cell size due to bonded interactions: 0.441 nm
Maximum distance for 5 constraints, at 120 deg. angles, all-trans: 0.768 nm
Estimated maximum distance required for P-LINCS: 0.768 nm
This distance will limit the DD cell size, you can override this with -rcon
Guess for relative PME load: 0.12
Will use 56 particle-particle and 8 PME only nodes
This is a guess, check the performance at the end of the log file
Using 8 separate PME nodes, as guessed by mdrun
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Optimizing the DD grid for 56 cells with a minimum initial size of 0.960 nm
The maximum allowed number of cells is: X 17 Y 7 Z 7
Domain decomposition grid 8 x 7 x 1, separate PME nodes 8
PME domain decomposition: 8 x 1 x 1
Interleaving PP and PME nodes
This is a particle-particle only node

Domain decomposition nodeid 0, coordinates 0 0 0

Using 64 MPI threads
Using 1 OpenMP thread per tMPI thread

Detecting CPU-specific acceleration.
Present hardware specification:
Vendor: AuthenticAMD
Brand:  AMD Opteron(TM) Processor 6276
Family: 21  Model:  1  Stepping:  2
Features: aes apic avx clfsh cmov cx8 cx16 fma4 htt lahf_lm misalignsse mmx msr 
nonstop_tsc pclmuldq pdpe1gb popcnt pse rdtscp sse2 sse3 sse4a sse4.1 sse4.2 
ssse3 xop
Acceleration most likely to fit this hardware: AVX_128_FMA
Acceleration selected at GROMACS compile time: SSE2

-- 
==================================================

Justin A. Lemkul, Ph.D.
Ruth L. Kirschstein NRSA Postdoctoral Fellow

Department of Pharmaceutical Sciences
School of Pharmacy
Health Sciences Facility II, Room 601
University of Maryland, Baltimore
20 Penn St.
Baltimore, MD 21201

jalemkul at outerbanks.umaryland.edu | (410) 706-7441
http://mackerell.umaryland.edu/~jalemkul

==================================================