[gmx-users] strange GPU load distribution

Mon May 7 01:14:02 CEST 2018

Hi,

In 2018 and 2018.1, mdrun does indeed run GPU detection and compatibility
checks before any logic about whether it should use any GPUs that were in
fact detected. However, there's nothing about those checks that should a)
take any noticeable time, b) acquire any ongoing resources, or c) lead to
persistent occupancy of the GPUs after the simulation process completes.
Those combined observations point to something about the installation of
the GPUs / runtime / SDK / drivers. What distro are you using? Is there
maybe some kind of security feature enabled that could be interfering? Are
the GPUs configured to use some kind of process-exclusive mode?

Mark

On Mon, May 7, 2018 at 12:14 AM Justin Lemkul <jalemkul at vt.edu> wrote:

>
>
> On 5/6/18 6:11 PM, Alex wrote:
> > A separate CPU-only build is what we were going to try, but if it
> > succeeds with not touching GPUs, then what -- keep several builds?
> >
>
> If your CPU-only run produces something that doesn't touch the GPU
> (which it shouldn't), that test would rather conclusively state the if
> the user requests a CPU-only run, then the mdrun code needs to be
> patched in such a way that GPU detection is not carried out. If that's
> the case, yes, you'd have to wait for a patch and in the meantime
> maintain two different mdrun binaries, but it would be a valuable bit of
> information for the dev team.
>
> > That latency you mention is definitely there, I think it is related to
> > my earlier report of one of the regression tests failing (I think Mark
> > might remember that one). That failure, by the way, is persistent with
> > 2018.1 we just installed on a completely different machine.
>
> I seemed to recall that, which is what got me thinking.
>
> -Justin
>
> >
> > Alex
> >
> >
> > On 5/6/2018 4:03 PM, Justin Lemkul wrote:
> >>
> >>
> >> On 5/6/18 5:51 PM, Alex wrote:
> >>> Unfortunately, we're still bogged down when the EM runs (example
> >>> below) start -- CPU usage by these jobs is initially low, while
> >>> their PIDs show up in nvidia-smi. After about a minute all goes back
> >>> to normal. Because the user is doing it frequently (scripted),
> >>> everything is slowed down by a large factor. Interestingly, we have
> >>> another user utilizing a GPU with another MD package (LAMMPS) and
> >>> that GPU is never touched by these EM jobs.
> >>>
> >>> Any ideas will be greatly appreciated.
> >>>
> >>
> >> Thinking out loud - a run that explicitly calls for only the CPU to
> >> be used might be trying to detect GPU if mdrun is GPU-enabled. Is
> >> that a possibility, including any latency in detecting that device?
> >> Have you tested to make sure that an mdrun binary that is explicitly
> >> disabled from using GPU (-DGMX_GPU=OFF) doesn't affect the GPU usage
> >> when running the same command?
> >>
> >> -Justin
> >>
> >>> Thanks,
> >>>
> >>> Alex
> >>>
> >>>
> >>>> PID TTY      STAT TIME COMMAND
> >>>>
> >>>> 60432 pts/8    Dl+ 0:01 gmx mdrun -table ../../../tab_it.xvg -nt 1
> >>>> -nb cpu -pme cpu -deffnm em_steep
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>>> On 4/27/2018 2:16 PM, Mark Abraham wrote:
> >>>>> Hi,
> >>>>>
> >>>>> What you think was run isn't nearly as useful when troubleshooting as
> >>>>> asking the kernel what is actually running.
> >>>>>
> >>>>> Mark
> >>>>>
> >>>>>
> >>>>> On Fri, Apr 27, 2018, 21:59 Alex<nedomacho at gmail.com> wrote:
> >>>>>
> >>>>>> Mark, I copied the exact command line from the script, right
> >>>>>> above the
> >>>>>> mdp file. It is literally how the script calls mdrun in this case:
> >>>>>>
> >>>>>> gmx mdrun -nt 2 -nb cpu -pme cpu -deffnm
> >>>>>>
> >>>>>>
> >>>>>> On 4/27/2018 1:52 PM, Mark Abraham wrote:
> >>>>>>> Group cutoff scheme can never run on a gpu, so none of that should
> >>>>>> matter.
> >>>>>>> Use ps and find out what the command lines were.
> >>>>>>>
> >>>>>>> Mark
> >>>>>>>
> >>>>>>> On Fri, Apr 27, 2018, 21:37 Alex<nedomacho at gmail.com>  wrote:
> >>>>>>>
> >>>>>>>> Update: we're basically removing commands one by one from the
> >>>>>>>> script
> >>>>>> that
> >>>>>>>> submits the jobs causing the issue. The culprit is both EM and
> >>>>>>>> the MD
> >>>>>> run:
> >>>>>>>> and GPUs are being affected _before_ MD starts loading the CPU,
> >>>>>>>> i.e.
> >>>>>> this
> >>>>>>>> is the initial setting up of the EM run -- CPU load is near zero,
> >>>>>>>> nvidia-smi reports the mess. I wonder if this is in any way
> >>>>>>>> related to
> >>>>>> that
> >>>>>>>> timing test we were failing a while back.
> >>>>>>>> mdrun call and mdp below, though I suspect they have nothing to
> >>>>>>>> do with
> >>>>>>>> what is happening. Any help will be very highly appreciated.
> >>>>>>>>
> >>>>>>>> Alex
> >>>>>>>>
> >>>>>>>> ***
> >>>>>>>>
> >>>>>>>> gmx mdrun -nt 2 -nb cpu -pme cpu -deffnm
> >>>>>>>>
> >>>>>>>> mdp:
> >>>>>>>>
> >>>>>>>> ; Run control
> >>>>>>>> integrator               = md-vv       ; Velocity Verlet
> >>>>>>>> tinit                    = 0
> >>>>>>>> dt                       = 0.002
> >>>>>>>> nsteps                   = 500000    ; 1 ns
> >>>>>>>> nstcomm                  = 100
> >>>>>>>> ; Output control
> >>>>>>>> nstxout                  = 50000
> >>>>>>>> nstvout                  = 50000
> >>>>>>>> nstfout                  = 0
> >>>>>>>> nstlog                   = 50000
> >>>>>>>> nstenergy                = 50000
> >>>>>>>> nstxout-compressed       = 0
> >>>>>>>> ; Neighborsearching and short-range nonbonded interactions
> >>>>>>>> cutoff-scheme            = group
> >>>>>>>> nstlist                  = 10
> >>>>>>>> ns_type                  = grid
> >>>>>>>> pbc                      = xyz
> >>>>>>>> rlist                    = 1.4
> >>>>>>>> ; Electrostatics
> >>>>>>>> coulombtype              = cutoff
> >>>>>>>> rcoulomb                 = 1.4
> >>>>>>>> ; van der Waals
> >>>>>>>> vdwtype                  = user
> >>>>>>>> vdw-modifier             = none
> >>>>>>>> rvdw                     = 1.4
> >>>>>>>> ; Apply long range dispersion corrections for Energy and Pressure
> >>>>>>>> DispCorr                  = EnerPres
> >>>>>>>> ; Spacing for the PME/PPPM FFT grid
> >>>>>>>> fourierspacing           = 0.12
> >>>>>>>> ; EWALD/PME/PPPM parameters
> >>>>>>>> pme_order                = 6
> >>>>>>>> ewald_rtol               = 1e-06
> >>>>>>>> epsilon_surface          = 0
> >>>>>>>> ; Temperature coupling
> >>>>>>>> Tcoupl                   = nose-hoover
> >>>>>>>> tc_grps                  = system
> >>>>>>>> tau_t                    = 1.0
> >>>>>>>> ref_t                    = some_temperature
> >>>>>>>> ; Pressure coupling is off for NVT
> >>>>>>>> Pcoupl                   = No
> >>>>>>>> tau_p                    = 0.5
> >>>>>>>> compressibility          = 4.5e-05
> >>>>>>>> ref_p                    = 1.0
> >>>>>>>> ; options for bonds
> >>>>>>>> constraints              = all-bonds
> >>>>>>>> constraint_algorithm     = lincs
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Fri, Apr 27, 2018 at 1:14 PM, Alex<nedomacho at gmail.com>
> wrote:
> >>>>>>>>
> >>>>>>>>> As I said, only two users, and nvidia-smi shows the process
> >>>>>>>>> name. We're
> >>>>>>>>> investigating and it does appear that it is EM that uses cutoff
> >>>>>>>>> electrostatics and as a result the user did not bother with
> >>>>>>>>> -pme cpu in
> >>>>>>>> the
> >>>>>>>>> mdrun call. What would be the correct way to enforce cpu-only
> >>>>>>>>> mdrun
> >>>>>> when
> >>>>>>>>> coulombtype = cutoff?
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>>
> >>>>>>>>> Alex
> >>>>>>>>>
> >>>>>>>>> On Fri, Apr 27, 2018 at 12:45 PM, Mark Abraham <
> >>>>>> mark.j.abraham at gmail.com
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> No.
> >>>>>>>>>>
> >>>>>>>>>> Look at the processes that are running, e.g. with top or ps.
> >>>>>>>>>> Either
> >>>>>> old
> >>>>>>>>>> simulations or another user is running.
> >>>>>>>>>>
> >>>>>>>>>> Mark
> >>>>>>>>>>
> >>>>>>>>>> On Fri, Apr 27, 2018, 20:33 Alex<nedomacho at gmail.com>  wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Strange. There are only two people using this machine,
> >>>>>>>>>>> myself being
> >>>>>>>> one
> >>>>>>>>>> of
> >>>>>>>>>>> them, and the other person specifically forces -nb cpu -pme
> >>>>>>>>>>> cpu in
> >>>>>> his
> >>>>>>>>>>> calls to mdrun. Are any other GMX utilities (e.g.
> >>>>>>>>>>> insert-molecules,
> >>>>>>>>>> grompp,
> >>>>>>>>>>> or energy) trying to use GPUs?
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>>
> >>>>>>>>>>> Alex
> >>>>>>>>>>>
> >>>>>>>>>>> On Fri, Apr 27, 2018 at 5:33 AM, Szilárd Páll <
> >>>>>> pall.szilard at gmail.com
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> The second column is PIDs so there is a whole lot more
> >>>>>>>>>>>> going on
> >>>>>>>> there
> >>>>>>>>>>> than
> >>>>>>>>>>>> just a single simulation, single rank using two GPUs. That
> >>>>>>>>>>>> would be
> >>>>>>>>>> one
> >>>>>>>>>>> PID
> >>>>>>>>>>>> and two entries for the two GPUs. Are you sure you're not
> >>>>>>>>>>>> running
> >>>>>>>>>> other
> >>>>>>>>>>>> processes?
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Szilárd
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Apr 26, 2018 at 5:52 AM, Alex<nedomacho at gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I am running GMX 2018 with gmx mdrun -pinoffset 0 -pin on
> >>>>>>>>>>>>> -nt 24
> >>>>>>>>>>> -ntmpi 4
> >>>>>>>>>>>>> -npme 1 -pme gpu -nb gpu -gputasks 1122
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Once in a while the simulation slows down and nvidia-smi
> >>>>>>>>>>>>> reports
> >>>>>>>>>>>> something
> >>>>>>>>>>>>> like this:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> |    1     12981      C gmx
> >>>>>>>>>>>>> 175MiB |
> >>>>>>>>>>>>> |    2     12981      C gmx
> >>>>>>>>>>>>> 217MiB |
> >>>>>>>>>>>>> |    2     13083      C gmx
> >>>>>>>>>>>>> 161MiB |
> >>>>>>>>>>>>> |    2     13086      C gmx
> >>>>>>>>>>>>> 159MiB |
> >>>>>>>>>>>>> |    2     13089      C gmx
> >>>>>>>>>>>>> 139MiB |
> >>>>>>>>>>>>> |    2     13093      C gmx
> >>>>>>>>>>>>> 163MiB |
> >>>>>>>>>>>>> |    2     13096      C gmx
> >>>>>>>>>>>>> 11MiB |
> >>>>>>>>>>>>> |    2     13099      C gmx
> >>>>>>>>>>>>> 8MiB |
> >>>>>>>>>>>>> |    2     13102      C gmx
> >>>>>>>>>>>>> 8MiB |
> >>>>>>>>>>>>> |    2     13106      C gmx
> >>>>>>>>>>>>> 8MiB |
> >>>>>>>>>>>>> |    2     13109      C gmx
> >>>>>>>>>>>>> 8MiB |
> >>>>>>>>>>>>> |    2     13112      C gmx
> >>>>>>>>>>>>> 8MiB |
> >>>>>>>>>>>>> |    2     13115      C gmx
> >>>>>>>>>>>>> 8MiB |
> >>>>>>>>>>>>> |    2     13119      C gmx
> >>>>>>>>>>>>> 8MiB |
> >>>>>>>>>>>>> |    2     13122      C gmx
> >>>>>>>>>>>>> 8MiB |
> >>>>>>>>>>>>> |    2     13125      C gmx
> >>>>>>>>>>>>> 8MiB |
> >>>>>>>>>>>>> |    2     13128      C gmx
> >>>>>>>>>>>>> 8MiB |
> >>>>>>>>>>>>> |    2     13131      C gmx
> >>>>>>>>>>>>> 8MiB |
> >>>>>>>>>>>>> |    2     13134      C gmx
> >>>>>>>>>>>>> 8MiB |
> >>>>>>>>>>>>> |    2     13138      C gmx
> >>>>>>>>>>>>> 8MiB |
> >>>>>>>>>>>>> |    2     13141      C gmx
> >>>>>>>>>>>>> 8MiB |
> >>>>>>>>>>>>> +-----------------------------------------------------------
> >>>>>>>>>>>>> ------------------+
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Then goes back to the expected load. Is this normal?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Alex
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> --
> >>>>>>>>>>>>> Gromacs Users mailing list
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> * Please search the archive athttp://www.gromacs.org/Support
> >>>>>>>>>>>>> /Mailing_Lists/GMX-Users_List before posting!
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> * Can't post?
> >>>>>>>>>>>>> Readhttp://www.gromacs.org/Support/Mailing_Lists
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> * For (un)subscribe requests visit
> >>>>>>>>>>>>>
> >>>>>>>>
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> >>>>>>>>>> or
> >>>>>>>>>>>>> send a mail togmx-users-request at gromacs.org.
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Gromacs Users mailing list
> >>>>>>>>>>>>
> >>>>>>>>>>>> * Please search the archive athttp://www.gromacs.org/
> >>>>>>>>>>>> Support/Mailing_Lists/GMX-Users_List before posting!
> >>>>>>>>>>>>
> >>>>>>>>>>>> * Can't post? Readhttp://
> www.gromacs.org/Support/Mailing_Lists
> >>>>>>>>>>>>
> >>>>>>>>>>>> * For (un)subscribe requests visit
> >>>>>>>>>>>>
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> >>>>>>>>>>>>
> >>>>>>>> or
> >>>>>>>>>>>> send a mail togmx-users-request at gromacs.org.
> >>>>>>>>>>> --
> >>>>>>>>>>> Gromacs Users mailing list
> >>>>>>>>>>>
> >>>>>>>>>>> * Please search the archive at
> >>>>>>>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List
> >>>>>>>>>>> before
> >>>>>>>>>>> posting!
> >>>>>>>>>>>
> >>>>>>>>>>> * Can't post? Readhttp://www.gromacs.org/Support/Mailing_Lists
> >>>>>>>>>>>
> >>>>>>>>>>> * For (un)subscribe requests visit
> >>>>>>>>>>>
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> >>>>>>>>>>>
> >>>>>> or
> >>>>>>>>>>> send a mail togmx-users-request at gromacs.org.
> >>>>>>>>>> --
> >>>>>>>>>> Gromacs Users mailing list
> >>>>>>>>>>
> >>>>>>>>>> * Please search the archive athttp://www.gromacs.org/Support
> >>>>>>>>>> /Mailing_Lists/GMX-Users_List before posting!
> >>>>>>>>>>
> >>>>>>>>>> * Can't post? Readhttp://www.gromacs.org/Support/Mailing_Lists
> >>>>>>>>>>
> >>>>>>>>>> * For (un)subscribe requests visit
> >>>>>>>>>>
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> >>>>>>>>>> or
> >>>>>>>>>> send a mail togmx-users-request at gromacs.org.
> >>>>>>>> --
> >>>>>>>> Gromacs Users mailing list
> >>>>>>>>
> >>>>>>>> * Please search the archive at
> >>>>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List
> >>>>>>>> before
> >>>>>>>> posting!
> >>>>>>>>
> >>>>>>>> * Can't post? Readhttp://www.gromacs.org/Support/Mailing_Lists
> >>>>>>>>
> >>>>>>>> * For (un)subscribe requests visit
> >>>>>>>>
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> >>>>>>>> or
> >>>>>>>> send a mail togmx-users-request at gromacs.org.
> >>>>>> --
> >>>>>> Gromacs Users mailing list
> >>>>>>
> >>>>>> * Please search the archive at
> >>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> >>>>>> posting!
> >>>>>>
> >>>>>> * Can't post? Readhttp://www.gromacs.org/Support/Mailing_Lists
> >>>>>>
> >>>>>> * For (un)subscribe requests visit
> >>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>
> >>>>>> or
> >>>>>> send a mail togmx-users-request at gromacs.org.
> >>>>
> >>>
> >>
> >
>
> --
> ==================================================
>
> Justin A. Lemkul, Ph.D.
> Assistant Professor
> Virginia Tech Department of Biochemistry
>
> 303 Engel Hall
> 340 West Campus Dr.
> Blacksburg, VA 24061
>
> jalemkul at vt.edu | (540) 231-3129
> http://www.thelemkullab.com
>
> ==================================================
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.