[gmx-users] GROMACS 4.6.7 not running on more than 16 MPI threads

Thu Feb 26 23:36:15 CET 2015

Hi,

First, you're not using "MPI threads." You're using MPI, and perhaps OpenMP
threading within MPI ranks. MPI implemented on threads would be a totally
different thing.

On Thu, Feb 26, 2015 at 10:12 PM, Agnivo Gosai <agnivogromacs14 at gmail.com>
wrote:

> Dear Users
>
> I am running GROMACS 4.6.7 on my university cluster. Its salient
> specifications are :-
>
> http://hpcgroup.public.iastate.edu/HPC/CyEnce/description.html
>
> *I compiled GROMACS 4.6.7 as follows :-*
>
> work/gb_lab/agosai/GROMACS/cmake-2.8.11/bin/cmake .. -DGMX_GPU=OFF
> -DGMX_MPI=ON -DGMX_OPENMP=ON -DGMX_THREAD_MPI=OFF -DGMX_OPENMM=OFF
> -DCMAKE_C_COMPILER=icc -DCMAKE_CXX_COMPILER=icpc -DGMX_BUILD_OWN_FFTW=ON
> -DCMAKE_INSTALL_PREFIX=/work/gb_lab/agosai/gmx467ag -DGMX_DOUBLE=OFF
>
> *My mdrun command in a PBS script is as follows :-*
>
> mpirun -np 16 -f $PBS_NODEFILE mdrun_mpi -s ex.tpr deffnm -v , with
> lnodes=1 and ppn = 16.
>
> *This is part of a standard 'log file' of a mdrun command running on 1 node
> and 16 processes :-*
>
> Log file opened on Mon Feb 23 15:29:39 2015
> Host: node021  pid: 13159  nodeid: 0  nnodes:  16
> Gromacs version:    VERSION 4.6.7
> Precision:          single
> Memory model:       64 bit
> MPI library:        MPI
> OpenMP support:     enabled
> GPU support:        disabled
> invsqrt routine:    gmx_software_invsqrt(x)
> CPU acceleration:   AVX_256
> FFT library:        fftw-3.3.2-sse2
> Large file support: enabled
> RDTSCP usage:       enabled
> Built on:           Fri Nov 21 12:55:48 CST 2014
> Built by:           agosai at share [CMAKE]
> Build OS/arch:      Linux 2.6.32-279.19.1.el6.x86_64 x86_64
> Build CPU vendor:   GenuineIntel
> Build CPU brand:    Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
> Build CPU family:   6   Model: 45   Stepping: 7
> Build CPU features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr
> nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1
> sse4.2 ssse3 tdt x2apic
> C compiler:         /shared/intel/impi/4.1.0.024/intel64/bin/mpiicc Intel
> icc (ICC) 13.0.1 20121010
> C compiler flags:   -mavx    -std=gnu99 -Wall   -ip -funroll-all-loops  -O3
> -DNDEBUG
>
>
> ............................................................................................................................................
> Using 16 MPI processes
>
> Detecting CPU-specific acceleration.
> Present hardware specification:
> Vendor: GenuineIntel
> Brand:  Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
> Family:  6  Model: 45  Stepping:  7
> Features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr nonstop_tsc
> pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3
> tdt x2apic
> Acceleration most likely to fit this hardware: AVX_256
> Acceleration selected at GROMACS compile time: AVX_256
>
> *This is found in the standard PBS error file :*
> ...................................
> Back Off! I just backed up smdelec1.log to ./#smdelec1.log.1#
>
> Number of CPUs detected (16) does not match the number reported by OpenMP
> (1).
> Consider setting the launch configuration manually!
> Reading file smdelec1.tpr, VERSION 4.6.7 (double precision)
> Using 16 MPI processes
>
> Non-default thread affinity set probably by the OpenMP library,
> disabling internal thread affinity
>
> ........................
> *The program runs successfully and speed is around 7 ns / day for my
> particular biomolecule.*
>

OK, so 16 ranks has a permissible domain decomposition. However there are
geometric limits on how many ranks can be used - at some point no domain
decomposition can be made. The contents of the simulation are relevant
here. You can see the output early in the .log file of GROMACS deciding
what DD it can use, and how close that might be to the limits at 16 ranks.

However , the mdrun command *fails t*o run when I use *more than 1 node and
> keep ppn = 16*. I observed that it can run on 2 nodes with 4 processes or
> on 2 nodes with 8 processes. Similarly it can run on 4 nodes with 4
> processes.

That's expected, but doesn't shed any light. If there's a DD for 16 ranks,
then trivially there is a DD for any number of ranks that is a factor of
16, since you can just paste the 16-rank decomposition together differently.

> That is np = 16 is the limit for the command in my case.
>
> *For lnodes = 3 and ppn =3, I have a message like this :-*
>

Don't do this. You have 16 cores per node, with 8 cores on each of two
sockets. How are you going to split those up with three ranks per node? I
forget what the intended behaviour of GROMACS is here, but it's not
something you should even want to attempt, because it would run horribly
even if it could run. You need to choose a number of ranks that respects
the structure of the hardware, e.g. leads to numbers of threads per rank
that is a divisor of the number of cores per socket. Here, 16,8,4,2 ranks
per node make possible sense, but likely only 16 8 4 are worth trying for a
CPU-only run.

Number of CPUs detected (16) does not match the number reported by OpenMP
> (1).
> Consider setting the launch configuration manually!
> Reading file pull1.tpr, VERSION 4.6.7 (double precision)
> Using 9 MPI processes
>
> ..............................................................................................
> =>> PBS: job killed: walltime 50 exceeded limit 30. I killed the job.
> *For lnodes = 4 and ppn = 2, I get this :-*
>
> Number of CPUs detected (16) does not match the number reported by OpenMP
> (2).
> Consider setting the launch configuration manually!
> Reading file pull1.tpr, VERSION 4.6.7 (double precision)
> Using 8 MPI processes
>

That should work, but I would be suspicious of your statement of ppn = 2,
because that should be 8 cores per rank, not the 2 reported by OpenMP.

>
> ........................................................................................................
> =>> PBS: job killed: walltime 50 exceeded limit 30 . I killed the job.
>
> In the above test cases my walltime was 00:30:00 , arbitrarily chosen so as
> to see if they run or not.
>

Just pass -nsteps 10 to mdrun_mpi and save some time ;-)

> *If I use , say, lnode = 2 , ppn = 16 and np = 32 , the program runs but no
> output is generated. If I cancel it then this error comes :-*
>
> [mpiexec at node094] HYD_pmcd_pmiserv_send_signal
> (./pm/pmiserv/pmiserv_cb.c:221):
> assert (!closed) failed
> [mpiexec at node094] ui_cmd_cb (./pm/pmiserv/pmiserv_pmci.c:128): unable to
> send SIGUSR1 downstream
> [mpiexec at node094] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77):
> callback returned error status
> [mpiexec at node094] HYD_pmci_wait_for_completion
> (./pm/pmiserv/pmiserv_pmci.c:388):
> error waiting for event
> [mpiexec at node094] main (./ui/mpich/mpiexec.c:718): process manager error
> waiting for completion
>
> Can anyone please help with this???? I am waiting for a reply in this forum
> after which I will take it up with the cluster admins.
>

That looks like you are misusing the MPI also. That no stdout/stderr get
returned in any of your non-working runs suggests something is
misused/misconfigured. Further, your output above suggests you compiled
with IntelMPI, but mpiexec might be from MPICH. Don't mix MPI
installations. For the record, there have been reports of issues with
(unknown versions of) IntelMPI. Regardless, you should ask the admins about
the setting for the eager message protocol, and consider using a smaller
one with PME simulations. Or try OpenMPI ;-)

Mark

> Thanks & Regards
> Agnivo Gosai
> Grad Student, Iowa State University.
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>