[gmx-users] GROMACS 4.6.7 not running on more than 16 MPI threads

Szilárd Páll pall.szilard at gmail.com
Fri Feb 27 00:27:28 CET 2015


On Thu, Feb 26, 2015 at 11:36 PM, Mark Abraham <mark.j.abraham at gmail.com> wrote:
> Hi,
>
> First, you're not using "MPI threads." You're using MPI, and perhaps OpenMP
> threading within MPI ranks. MPI implemented on threads would be a totally
> different thing.
>
> On Thu, Feb 26, 2015 at 10:12 PM, Agnivo Gosai <agnivogromacs14 at gmail.com>
> wrote:
>
>> Dear Users
>>
>> I am running GROMACS 4.6.7 on my university cluster. Its salient
>> specifications are :-
>>
>> http://hpcgroup.public.iastate.edu/HPC/CyEnce/description.html
>>
>> *I compiled GROMACS 4.6.7 as follows :-*
>>
>> work/gb_lab/agosai/GROMACS/cmake-2.8.11/bin/cmake .. -DGMX_GPU=OFF
>> -DGMX_MPI=ON -DGMX_OPENMP=ON -DGMX_THREAD_MPI=OFF -DGMX_OPENMM=OFF
>> -DCMAKE_C_COMPILER=icc -DCMAKE_CXX_COMPILER=icpc -DGMX_BUILD_OWN_FFTW=ON
>> -DCMAKE_INSTALL_PREFIX=/work/gb_lab/agosai/gmx467ag -DGMX_DOUBLE=OFF
>>
>> *My mdrun command in a PBS script is as follows :-*
>>
>> mpirun -np 16 -f $PBS_NODEFILE mdrun_mpi -s ex.tpr deffnm -v , with
>> lnodes=1 and ppn = 16.
>>
>> *This is part of a standard 'log file' of a mdrun command running on 1 node
>> and 16 processes :-*
>>
>> Log file opened on Mon Feb 23 15:29:39 2015
>> Host: node021  pid: 13159  nodeid: 0  nnodes:  16
>> Gromacs version:    VERSION 4.6.7
>> Precision:          single
>> Memory model:       64 bit
>> MPI library:        MPI
>> OpenMP support:     enabled
>> GPU support:        disabled
>> invsqrt routine:    gmx_software_invsqrt(x)
>> CPU acceleration:   AVX_256
>> FFT library:        fftw-3.3.2-sse2
>> Large file support: enabled
>> RDTSCP usage:       enabled
>> Built on:           Fri Nov 21 12:55:48 CST 2014
>> Built by:           agosai at share [CMAKE]
>> Build OS/arch:      Linux 2.6.32-279.19.1.el6.x86_64 x86_64
>> Build CPU vendor:   GenuineIntel
>> Build CPU brand:    Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
>> Build CPU family:   6   Model: 45   Stepping: 7
>> Build CPU features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr
>> nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1
>> sse4.2 ssse3 tdt x2apic
>> C compiler:         /shared/intel/impi/4.1.0.024/intel64/bin/mpiicc Intel
>> icc (ICC) 13.0.1 20121010
>> C compiler flags:   -mavx    -std=gnu99 -Wall   -ip -funroll-all-loops  -O3
>> -DNDEBUG
>>
>>
>> ............................................................................................................................................
>> Using 16 MPI processes
>>
>> Detecting CPU-specific acceleration.
>> Present hardware specification:
>> Vendor: GenuineIntel
>> Brand:  Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
>> Family:  6  Model: 45  Stepping:  7
>> Features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr nonstop_tsc
>> pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3
>> tdt x2apic
>> Acceleration most likely to fit this hardware: AVX_256
>> Acceleration selected at GROMACS compile time: AVX_256
>>
>> *This is found in the standard PBS error file :*
>> ...................................
>> Back Off! I just backed up smdelec1.log to ./#smdelec1.log.1#
>>
>> Number of CPUs detected (16) does not match the number reported by OpenMP
>> (1).
>> Consider setting the launch configuration manually!
>> Reading file smdelec1.tpr, VERSION 4.6.7 (double precision)
>> Using 16 MPI processes
>>
>> Non-default thread affinity set probably by the OpenMP library,
>> disabling internal thread affinity
>>
>> ........................
>> *The program runs successfully and speed is around 7 ns / day for my
>> particular biomolecule.*
>>
>
> OK, so 16 ranks has a permissible domain decomposition. However there are
> geometric limits on how many ranks can be used - at some point no domain
> decomposition can be made. The contents of the simulation are relevant
> here. You can see the output early in the .log file of GROMACS deciding
> what DD it can use, and how close that might be to the limits at 16 ranks.
>
> However , the mdrun command *fails t*o run when I use *more than 1 node and
>> keep ppn = 16*. I observed that it can run on 2 nodes with 4 processes or
>> on 2 nodes with 8 processes. Similarly it can run on 4 nodes with 4
>> processes.
>
>
> That's expected, but doesn't shed any light. If there's a DD for 16 ranks,
> then trivially there is a DD for any number of ranks that is a factor of
> 16, since you can just paste the 16-rank decomposition together differently.
>
>
>> That is np = 16 is the limit for the command in my case.
>>
>> *For lnodes = 3 and ppn =3, I have a message like this :-*
>>
>
> Don't do this. You have 16 cores per node, with 8 cores on each of two
> sockets. How are you going to split those up with three ranks per node? I
> forget what the intended behaviour of GROMACS is here,

IIRC we implemented the automation such that it checks for #ranks >
#hw-threads and if it is, it starts nthreads=#hwthreads / #ranks (and
issues a warning if it's false). That should give 5 threads/rank with
pp=3 and 16 cores (or 10 with HT on).

> but it's not
> something you should even want to attempt, because it would run horribly
> even if it could run. You need to choose a number of ranks that respects
> the structure of the hardware, e.g. leads to numbers of threads per rank
> that is a divisor of the number of cores per socket. Here, 16,8,4,2 ranks
> per node make possible sense, but likely only 16 8 4 are worth trying for a
> CPU-only run.
>
> Number of CPUs detected (16) does not match the number reported by OpenMP
>> (1).
>> Consider setting the launch configuration manually!
>> Reading file pull1.tpr, VERSION 4.6.7 (double precision)
>> Using 9 MPI processes
>>
>> ..............................................................................................
>> =>> PBS: job killed: walltime 50 exceeded limit 30. I killed the job.
>> *For lnodes = 4 and ppn = 2, I get this :-*
>>
>> Number of CPUs detected (16) does not match the number reported by OpenMP
>> (2).
>> Consider setting the launch configuration manually!
>> Reading file pull1.tpr, VERSION 4.6.7 (double precision)
>> Using 8 MPI processes
>>
>
> That should work, but I would be suspicious of your statement of ppn = 2,
> because that should be 8 cores per rank, not the 2 reported by OpenMP.

Job schedulers do not always communicate very well with the launcher,
e.g. even if you set ppn=4 and you've 2x8 core no HT, the scheduler
won't set OMP_NUM_THREADS correctly to 4. At the same time, some MPI
launchers can be eager to set the number of threads or affinities
which can lead to the above warning.

>
>>
>> ........................................................................................................
>> =>> PBS: job killed: walltime 50 exceeded limit 30 . I killed the job.
>>
>> In the above test cases my walltime was 00:30:00 , arbitrarily chosen so as
>> to see if they run or not.
>>
>
> Just pass -nsteps 10 to mdrun_mpi and save some time ;-)
>
>
>> *If I use , say, lnode = 2 , ppn = 16 and np = 32 , the program runs but no
>> output is generated. If I cancel it then this error comes :-*
>>
>> [mpiexec at node094] HYD_pmcd_pmiserv_send_signal
>> (./pm/pmiserv/pmiserv_cb.c:221):
>> assert (!closed) failed
>> [mpiexec at node094] ui_cmd_cb (./pm/pmiserv/pmiserv_pmci.c:128): unable to
>> send SIGUSR1 downstream
>> [mpiexec at node094] HYDT_dmxu_poll_wait_for_event
>> (./tools/demux/demux_poll.c:77):
>> callback returned error status
>> [mpiexec at node094] HYD_pmci_wait_for_completion
>> (./pm/pmiserv/pmiserv_pmci.c:388):
>> error waiting for event
>> [mpiexec at node094] main (./ui/mpich/mpiexec.c:718): process manager error
>> waiting for completion
>>
>> Can anyone please help with this???? I am waiting for a reply in this forum
>> after which I will take it up with the cluster admins.
>>
>
> That looks like you are misusing the MPI also. That no stdout/stderr get
> returned in any of your non-working runs suggests something is
> misused/misconfigured. Further, your output above suggests you compiled
> with IntelMPI, but mpiexec might be from MPICH. Don't mix MPI
> installations. For the record, there have been reports of issues with
> (unknown versions of) IntelMPI. Regardless, you should ask the admins about
> the setting for the eager message protocol, and consider using a smaller
> one with PME simulations. Or try OpenMPI ;-)
>
> Mark
>
>
>> Thanks & Regards
>> Agnivo Gosai
>> Grad Student, Iowa State University.
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-request at gromacs.org.
>>
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.


More information about the gromacs.org_gmx-users mailing list