[gmx-users] g_tune_pme_mpi on GPU cluster fails

Wed Jan 7 17:04:30 CET 2015

On Wed, Jan 7, 2015 at 3:11 PM, Ebert Maximilian <m.ebert at umontreal.ca>
wrote:

> Hi there,
>
> I have again a question regarding our GPU cluster. I tried to
> g_tune_pme_mpi on the cluster. After starting it across 3 nodes I get the
> following errors:
>
>
> Command line:
>   g_tune_pme_mpi -v -x -deffnm 1G68_run1ns -s ../run100ns.tpr
>

You don't want to run MPI-compiled g_tune_pme. It launches MPI processes
(to time them), and (roughly speaking) that means it must not be an MPI
process. Install a non-MPI Gromacs, and arrange for it to call the MPI
mdrun build (e.g.) with cmake -DGMX_MPI=on -DGMX_BUILD_MDRUN_ONLY=on. We
could probably make this more clear in g_tune_pme -h.

Mark

Reading file ../run100ns.tpr, VERSION 5.0.1 (single precision)
> Reading file ../run100ns.tpr, VERSION 5.0.1 (single precision)
> Reading file ../run100ns.tpr, VERSION 5.0.1 (single precision)
> Reading file ../run100ns.tpr, VERSION 5.0.1 (single precision)
> Will test 1 tpr file.
> Will test 1 tpr file.
> Will test 1 tpr file.
> Will test 1 tpr file.
> [ngpu-a4-06:13382] [[25869,1],0] ORTE_ERROR_LOG: A message is attempting
> to be sent to a process whose contact information is unknown in file
> rml_oob_send.c at line 104
> [ngpu-a4-06:13382] [[25869,1],0] could not get route to [[INVALID],INVALID]
> [ngpu-a4-06:13382] [[25869,1],0] ORTE_ERROR_LOG: A message is attempting
> to be sent to a process whose contact information is unknown in file
> base/plm_base_proxy.c at line 81
> [ngpu-a4-06:13385] [[25869,1],1] ORTE_ERROR_LOG: A message is attempting
> to be sent to a process whose contact information is unknown in file
> rml_oob_send.c at line 104
> [ngpu-a4-06:13385] [[25869,1],1] could not get route to [[INVALID],INVALID]
> [ngpu-a4-06:13385] [[25869,1],1] ORTE_ERROR_LOG: A message is attempting
> to be sent to a process whose contact information is unknown in file
> base/plm_base_proxy.c at line 81
> [ngpu-a4-06:13384] [[25869,1],2] ORTE_ERROR_LOG: A message is attempting
> to be sent to a process whose contact information is unknown in file
> rml_oob_send.c at line 104
> [ngpu-a4-06:13384] [[25869,1],2] could not get route to [[INVALID],INVALID]
> [ngpu-a4-06:13384] [[25869,1],2] ORTE_ERROR_LOG: A message is attempting
> to be sent to a process whose contact information is unknown in file
> base/plm_base_proxy.c at line 81
> .....
> =>> PBS: job killed: walltime 3641 exceeded limit 3600
> mpirun: killing job...
>
> [ngpu-a4-06:13356] [[25869,0],0]-[[25869,1],3] mca_oob_tcp_msg_recv: readv
> failed: Connection reset by peer (104)
> [ngpu-a4-06:13356] [[25869,0],0]-[[25869,1],2] mca_oob_tcp_msg_recv: readv
> failed: Connection reset by peer (104)
> [ngpu-a4-06:13356] [[25869,0],0]-[[25869,1],0] mca_oob_tcp_msg_recv: readv
> failed: Connection reset by peer (104)
> [ngpu-a4-06:13356] [[25869,0],0]-[[25869,1],1] mca_oob_tcp_msg_recv: readv
> failed: Connection reset by peer (104)
>
>
> Any idea what is wrong?
>
> Thank you very much!
>
> Max
>
> -----Ursprüngliche Nachricht-----
> Von: gromacs.org_gmx-users-bounces at maillist.sys.kth.se [mailto:
> gromacs.org_gmx-users-bounces at maillist.sys.kth.se] Im Auftrag von Ebert
> Maximilian
> Gesendet: Mittwoch, 7. Januar 2015 14:43
> An: gmx-users at gromacs.org
> Betreff: Re: [gmx-users] Working on a GPU cluster with GROMACS 5
>
> Hi Carsten,
>
> thanks again for your reply. The why our cluster is setup is that you ask
> for GPUs using the ppn command and not CPUs. Therefore, I put 4 there. But
> to rule out the possibility that someone is actually using the note I
> called for 7 GPUs (so the entire note) but with GPU id just assign the
> first 4 to GROMACS. I still get the same error. I also tried -gpu_id 00 or
> -gpu_id 4444 to change the CPU and to only use a single GPU but I always
> get:
>
> NOTE: You assigned GPUs to multiple MPI processes.
>
> -------------------------------------------------------
> Program gmx_mpi, VERSION 5.0.1
> Source code file:
> /RQusagers/rqchpbib/stubbsda/gromacs-5.0.1/src/gromacs/gmxlib/cuda_tools/
> pmalloc_cuda.cu, line: 61
>
> Fatal error:
> cudaMallocHost of size 4 bytes failed: all CUDA-capable devices are busy
> or unavailable
>
> For more information and tips for troubleshooting, please check the
> GROMACS website at http://www.gromacs.org/Documentation/Errors
> -------------------------------------------------------
>
> Error on rank 1, will try to stop all ranks Halting parallel program
> gmx_mpi on CPU 1 out of 4
>
> -----Ursprüngliche Nachricht-----
> Von: gromacs.org_gmx-users-bounces at maillist.sys.kth.se [mailto:
> gromacs.org_gmx-users-bounces at maillist.sys.kth.se] Im Auftrag von Carsten
> Kutzner
> Gesendet: Mittwoch, 7. Januar 2015 14:13
> An: gmx-users at gromacs.org
> Betreff: Re: [gmx-users] Working on a GPU cluster with GROMACS 5
>
> Hi Max,
>
> On 07 Jan 2015, at 11:36, Ebert Maximilian <m.ebert at umontreal.ca> wrote:
>
> > Hi Carsten,
> >
> > thanks for your answer. I tried what you described and it is basically
> working except for letting multiple MPI workers use one GPU. In my setup I
> use 4 GPUs with 8 MPI workers and hence 8 CPUs and OpenMP 1.  This is how I
> start GROMACS:
> >
> > mpirun -np 8 gmx_mpi mdrun -gpu_id 00112233 -v -x -deffnm run1ns -s
> > ../run1ns.tpr
> >
> > and I submit this using:
> >
> > qsub -q @test -lnodes=1:ppn=4 -lwalltime=1:00:00 gromacs_run_gpu
> why are you using ppn=4? Shouldn't that be 8?
>
> >
> > Now I get the following errors (the output is longer but to keep it
> shorter I omitted the rest):
> >
> > Using 8 MPI processes
> > Using 1 OpenMP thread per MPI process
> >
> > 7 GPUs detected on host ngpu-a4-06:
> >  #0: NVIDIA GeForce GTX 580, compute cap.: 2.0, ECC:  no, stat:
> > compatible
> >  #1: NVIDIA GeForce GTX 580, compute cap.: 2.0, ECC:  no, stat:
> > compatible
> >  #2: NVIDIA GeForce GTX 580, compute cap.: 2.0, ECC:  no, stat:
> > compatible
> >  #3: NVIDIA GeForce GTX 580, compute cap.: 2.0, ECC:  no, stat:
> > compatible
> >  #4: NVIDIA GeForce GTX 580, compute cap.: 2.0, ECC:  no, stat:
> > compatible
> >  #5: NVIDIA GeForce GTX 580, compute cap.: 2.0, ECC:  no, stat:
> > compatible
> >  #6: NVIDIA GeForce GTX 580, compute cap.: 2.0, ECC:  no, stat:
> > compatible
> >
> > 4 GPUs user-selected for this run.
> > Mapping of GPUs to the 8 PP ranks in this node: #0, #0, #1, #1, #2,
> > #2, #3, #3
> >
> > NOTE: You assigned GPUs to multiple MPI processes.
> >
> > -------------------------------------------------------
> > Program gmx_mpi, VERSION 5.0.1
> > Source code file:
> > /RQusagers/rqchpbib/stubbsda/gromacs-5.0.1/src/gromacs/gmxlib/cuda_too
> > ls/pmalloc_cuda.cu, line: 61
> >
> > Fatal error:
> > cudaMallocHost of size 4 bytes failed: all CUDA-capable devices are
> > busy or unavailable
> >
> Could it be that someone else's processes are running on that node while
> Gromacs tries to use the GPUs? Maybe try to the the whole node, maybe even
> in interactive mode to play around.
>
> Carsten
>
> > For more information and tips for troubleshooting, please check the
> > GROMACS website at http://www.gromacs.org/Documentation/Errors
> > -------------------------------------------------------
> >
> > Error on rank 1, will try to stop all ranks Halting parallel program
> > gmx_mpi on CPU 1 out of 8
> >
> > -------------------------------------------------------
> > Program gmx_mpi, VERSION 5.0.1
> > Source code file:
> > /RQusagers/rqchpbib/stubbsda/gromacs-5.0.1/src/gromacs/gmxlib/cuda_too
> > ls/pmalloc_cuda.cu, line: 61
> >
> > Fatal error:
> > cudaMallocHost of size 4 bytes failed: all CUDA-capable devices are
> > busy or unavailable
> >
> > For more information and tips for troubleshooting, please check the
> > GROMACS website at http://www.gromacs.org/Documentation/Errors
> > -------------------------------------------------------
> >
> > Error on rank 3, will try to stop all ranks Halting parallel program
> > gmx_mpi on CPU 3 out of 8
> >
> > -----Ursprüngliche Nachricht-----
> > Von: gromacs.org_gmx-users-bounces at maillist.sys.kth.se
> > [mailto:gromacs.org_gmx-users-bounces at maillist.sys.kth.se] Im Auftrag
> > von Carsten Kutzner
> > Gesendet: Donnerstag, 18. Dezember 2014 17:27
> > An: gmx-users at gromacs.org
> > Betreff: Re: [gmx-users] Working on a GPU cluster with GROMACS 5
> >
> > Hi Max,
> >
> > On 18 Dec 2014, at 15:30, Ebert Maximilian <m.ebert at umontreal.ca> wrote:
> >
> >> Dear list,
> >>
> >> I am benchmarking my system on a GPU cluster with 6 GPU's and two quad
> core CPUs for each node. First I am wondering if there is any output which
> confirms how many CPUs and GPUs were used during the run? I find the output
> for GPUs in the log file but only for a single node. When I use multiple
> nodes why don't the other nodes show up in the log file as hosts? For
> instance in this example I used two nodes and claimed 4 GPUs each but got
> this in my log file:
> >>
> >> 6 GPUs detected on host ngpu-a4-01:
> >> #0: NVIDIA GeForce GTX 580, compute cap.: 2.0, ECC:  no, stat:
> >> compatible
> >> #1: NVIDIA GeForce GTX 580, compute cap.: 2.0, ECC:  no, stat:
> >> compatible
> >> #2: NVIDIA GeForce GTX 580, compute cap.: 2.0, ECC:  no, stat:
> >> compatible
> >> #3: NVIDIA GeForce GTX 580, compute cap.: 2.0, ECC:  no, stat:
> >> compatible
> >> #4: NVIDIA GeForce GTX 580, compute cap.: 2.0, ECC:  no, stat:
> >> compatible
> >> #5: NVIDIA GeForce GTX 580, compute cap.: 2.0, ECC:  no, stat:
> >> compatible
> >>
> >> 4 GPUs auto-selected for this run.
> >> Mapping of GPUs to the 4 PP ranks in this node: #0, #1, #2, #3
> > This will be the same across all nodes. Gromacs will refuse to run if
> there are not enough GPUs on any of your other nodes.
> >
> >>
> >>
> >>
> >> ngpu-a4-02 is not shown here. Any idea? The job was submitted in the
> following way:
> >>
> >> qsub -q @test -lnodes=2:ppn=4 -lwalltime=1:00:00 gromacs_run_gpu
> >>
> >> and the gromacs_run_gpu file:
> >>
> >> #!/bin/csh
> >> #
> >>
> >> #PBS -o result_run10ns96-8.dat
> >> #PBS -j oe
> >> #PBS -W umask=022
> >> #PBS -r n
> >>
> >> cd 8_gpu
> >>
> >> module add CUDA
> >> module load gromacs/5.0.1-gpu
> >>
> >> mpirun gmx_mpi mdrun -v -x -deffnm 10ns_rep1-8GPU
> >>
> >>
> >> Another question I had was how can I define the number of CPUs and
> check if they were really used?
> > Use -ntomp to control how many OpenMP threads each of your MPI processes
> will have.
> > This way you can make use of all cores you have on each node.
> >
> >> I can't find any information about the number of CPUs in the log file.
> > Look for
> > "Using . MPI processes"
> > "Using . OpenMP threads per MPI process"
> > in the log file.
> >
> >> I would also like to try combinations like 4 CPUs + 1 GPU
> > You can use the -gpu_id switch to supply a list of eligible GPUs (see
> mdrun -h).
> > If you just want to use the first GPU on you node with, e.g. 4 MPI
> processes, use -gpu_id 0000.
> >
> > Best,
> >  Carsten
> >
> >
> >
> >> or 2 CPUs + 2 GPU. How do I set this up?
> >>
> >> Thank you very much for your help,
> >>
> >> Max
> >>
> >> --
> >> Gromacs Users mailing list
> >>
> >> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
> >>
> >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >>
> >> * For (un)subscribe requests visit
> >> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
> >
> >
> > --
> > Dr. Carsten Kutzner
> > Max Planck Institute for Biophysical Chemistry Theoretical and
> > Computational Biophysics Am Fassberg 11, 37077 Goettingen, Germany
> > Tel. +49-551-2012313, Fax: +49-551-2012302
> > http://www.mpibpc.mpg.de/grubmueller/kutzner
> > http://www.mpibpc.mpg.de/grubmueller/sppexa
> >
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>
>
> --
> Dr. Carsten Kutzner
> Max Planck Institute for Biophysical Chemistry Theoretical and
> Computational Biophysics Am Fassberg 11, 37077 Goettingen, Germany Tel.
> +49-551-2012313, Fax: +49-551-2012302
> http://www.mpibpc.mpg.de/grubmueller/kutzner
> http://www.mpibpc.mpg.de/grubmueller/sppexa
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>