[gmx-users] strange GPU load distribution

Mon May 7 00:03:54 CEST 2018

On 5/6/18 5:51 PM, Alex wrote:
> Unfortunately, we're still bogged down when the EM runs (example 
> below) start -- CPU usage by these jobs is initially low, while their 
> PIDs show up in nvidia-smi. After about a minute all goes back to 
> normal. Because the user is doing it frequently (scripted), everything 
> is slowed down by a large factor. Interestingly, we have another user 
> utilizing a GPU with another MD package (LAMMPS) and that GPU is never 
> touched by these EM jobs.
>
> Any ideas will be greatly appreciated.
>

Thinking out loud - a run that explicitly calls for only the CPU to be 
used might be trying to detect GPU if mdrun is GPU-enabled. Is that a 
possibility, including any latency in detecting that device? Have you 
tested to make sure that an mdrun binary that is explicitly disabled 
from using GPU (-DGMX_GPU=OFF) doesn't affect the GPU usage when running 
the same command?

-Justin

> Thanks,
>
> Alex
>
>
>> PID TTY      STAT TIME COMMAND
>>
>> 60432 pts/8    Dl+ 0:01 gmx mdrun -table ../../../tab_it.xvg -nt 1 
>> -nb cpu -pme cpu -deffnm em_steep
>>
>>
>>
>
>
>> On 4/27/2018 2:16 PM, Mark Abraham wrote:
>>> Hi,
>>>
>>> What you think was run isn't nearly as useful when troubleshooting as
>>> asking the kernel what is actually running.
>>>
>>> Mark
>>>
>>>
>>> On Fri, Apr 27, 2018, 21:59 Alex<nedomacho at gmail.com> wrote:
>>>
>>>> Mark, I copied the exact command line from the script, right above the
>>>> mdp file. It is literally how the script calls mdrun in this case:
>>>>
>>>> gmx mdrun -nt 2 -nb cpu -pme cpu -deffnm
>>>>
>>>>
>>>> On 4/27/2018 1:52 PM, Mark Abraham wrote:
>>>>> Group cutoff scheme can never run on a gpu, so none of that should
>>>> matter.
>>>>> Use ps and find out what the command lines were.
>>>>>
>>>>> Mark
>>>>>
>>>>> On Fri, Apr 27, 2018, 21:37 Alex<nedomacho at gmail.com>  wrote:
>>>>>
>>>>>> Update: we're basically removing commands one by one from the script
>>>> that
>>>>>> submits the jobs causing the issue. The culprit is both EM and 
>>>>>> the MD
>>>> run:
>>>>>> and GPUs are being affected _before_ MD starts loading the CPU, i.e.
>>>> this
>>>>>> is the initial setting up of the EM run -- CPU load is near zero,
>>>>>> nvidia-smi reports the mess. I wonder if this is in any way 
>>>>>> related to
>>>> that
>>>>>> timing test we were failing a while back.
>>>>>> mdrun call and mdp below, though I suspect they have nothing to 
>>>>>> do with
>>>>>> what is happening. Any help will be very highly appreciated.
>>>>>>
>>>>>> Alex
>>>>>>
>>>>>> ***
>>>>>>
>>>>>> gmx mdrun -nt 2 -nb cpu -pme cpu -deffnm
>>>>>>
>>>>>> mdp:
>>>>>>
>>>>>> ; Run control
>>>>>> integrator               = md-vv       ; Velocity Verlet
>>>>>> tinit                    = 0
>>>>>> dt                       = 0.002
>>>>>> nsteps                   = 500000    ; 1 ns
>>>>>> nstcomm                  = 100
>>>>>> ; Output control
>>>>>> nstxout                  = 50000
>>>>>> nstvout                  = 50000
>>>>>> nstfout                  = 0
>>>>>> nstlog                   = 50000
>>>>>> nstenergy                = 50000
>>>>>> nstxout-compressed       = 0
>>>>>> ; Neighborsearching and short-range nonbonded interactions
>>>>>> cutoff-scheme            = group
>>>>>> nstlist                  = 10
>>>>>> ns_type                  = grid
>>>>>> pbc                      = xyz
>>>>>> rlist                    = 1.4
>>>>>> ; Electrostatics
>>>>>> coulombtype              = cutoff
>>>>>> rcoulomb                 = 1.4
>>>>>> ; van der Waals
>>>>>> vdwtype                  = user
>>>>>> vdw-modifier             = none
>>>>>> rvdw                     = 1.4
>>>>>> ; Apply long range dispersion corrections for Energy and Pressure
>>>>>> DispCorr                  = EnerPres
>>>>>> ; Spacing for the PME/PPPM FFT grid
>>>>>> fourierspacing           = 0.12
>>>>>> ; EWALD/PME/PPPM parameters
>>>>>> pme_order                = 6
>>>>>> ewald_rtol               = 1e-06
>>>>>> epsilon_surface          = 0
>>>>>> ; Temperature coupling
>>>>>> Tcoupl                   = nose-hoover
>>>>>> tc_grps                  = system
>>>>>> tau_t                    = 1.0
>>>>>> ref_t                    = some_temperature
>>>>>> ; Pressure coupling is off for NVT
>>>>>> Pcoupl                   = No
>>>>>> tau_p                    = 0.5
>>>>>> compressibility          = 4.5e-05
>>>>>> ref_p                    = 1.0
>>>>>> ; options for bonds
>>>>>> constraints              = all-bonds
>>>>>> constraint_algorithm     = lincs
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 27, 2018 at 1:14 PM, Alex<nedomacho at gmail.com>  wrote:
>>>>>>
>>>>>>> As I said, only two users, and nvidia-smi shows the process 
>>>>>>> name. We're
>>>>>>> investigating and it does appear that it is EM that uses cutoff
>>>>>>> electrostatics and as a result the user did not bother with -pme 
>>>>>>> cpu in
>>>>>> the
>>>>>>> mdrun call. What would be the correct way to enforce cpu-only mdrun
>>>> when
>>>>>>> coulombtype = cutoff?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Alex
>>>>>>>
>>>>>>> On Fri, Apr 27, 2018 at 12:45 PM, Mark Abraham <
>>>> mark.j.abraham at gmail.com
>>>>>>> wrote:
>>>>>>>
>>>>>>>> No.
>>>>>>>>
>>>>>>>> Look at the processes that are running, e.g. with top or ps. 
>>>>>>>> Either
>>>> old
>>>>>>>> simulations or another user is running.
>>>>>>>>
>>>>>>>> Mark
>>>>>>>>
>>>>>>>> On Fri, Apr 27, 2018, 20:33 Alex<nedomacho at gmail.com>  wrote:
>>>>>>>>
>>>>>>>>> Strange. There are only two people using this machine, myself 
>>>>>>>>> being
>>>>>> one
>>>>>>>> of
>>>>>>>>> them, and the other person specifically forces -nb cpu -pme 
>>>>>>>>> cpu in
>>>> his
>>>>>>>>> calls to mdrun. Are any other GMX utilities (e.g. 
>>>>>>>>> insert-molecules,
>>>>>>>> grompp,
>>>>>>>>> or energy) trying to use GPUs?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Alex
>>>>>>>>>
>>>>>>>>> On Fri, Apr 27, 2018 at 5:33 AM, Szilárd Páll <
>>>> pall.szilard at gmail.com
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> The second column is PIDs so there is a whole lot more going on
>>>>>> there
>>>>>>>>> than
>>>>>>>>>> just a single simulation, single rank using two GPUs. That 
>>>>>>>>>> would be
>>>>>>>> one
>>>>>>>>> PID
>>>>>>>>>> and two entries for the two GPUs. Are you sure you're not 
>>>>>>>>>> running
>>>>>>>> other
>>>>>>>>>> processes?
>>>>>>>>>>
>>>>>>>>>> -- 
>>>>>>>>>> Szilárd
>>>>>>>>>>
>>>>>>>>>> On Thu, Apr 26, 2018 at 5:52 AM, Alex<nedomacho at gmail.com>  
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> I am running GMX 2018 with gmx mdrun -pinoffset 0 -pin on 
>>>>>>>>>>> -nt 24
>>>>>>>>> -ntmpi 4
>>>>>>>>>>> -npme 1 -pme gpu -nb gpu -gputasks 1122
>>>>>>>>>>>
>>>>>>>>>>> Once in a while the simulation slows down and nvidia-smi 
>>>>>>>>>>> reports
>>>>>>>>>> something
>>>>>>>>>>> like this:
>>>>>>>>>>>
>>>>>>>>>>> |    1     12981      C gmx
>>>>>>>>>>> 175MiB |
>>>>>>>>>>> |    2     12981      C gmx
>>>>>>>>>>> 217MiB |
>>>>>>>>>>> |    2     13083      C gmx
>>>>>>>>>>> 161MiB |
>>>>>>>>>>> |    2     13086      C gmx
>>>>>>>>>>> 159MiB |
>>>>>>>>>>> |    2     13089      C gmx
>>>>>>>>>>> 139MiB |
>>>>>>>>>>> |    2     13093      C gmx
>>>>>>>>>>> 163MiB |
>>>>>>>>>>> |    2     13096      C gmx
>>>>>>>>>>> 11MiB |
>>>>>>>>>>> |    2     13099      C gmx
>>>>>>>>>>> 8MiB |
>>>>>>>>>>> |    2     13102      C gmx
>>>>>>>>>>> 8MiB |
>>>>>>>>>>> |    2     13106      C gmx
>>>>>>>>>>> 8MiB |
>>>>>>>>>>> |    2     13109      C gmx
>>>>>>>>>>> 8MiB |
>>>>>>>>>>> |    2     13112      C gmx
>>>>>>>>>>> 8MiB |
>>>>>>>>>>> |    2     13115      C gmx
>>>>>>>>>>> 8MiB |
>>>>>>>>>>> |    2     13119      C gmx
>>>>>>>>>>> 8MiB |
>>>>>>>>>>> |    2     13122      C gmx
>>>>>>>>>>> 8MiB |
>>>>>>>>>>> |    2     13125      C gmx
>>>>>>>>>>> 8MiB |
>>>>>>>>>>> |    2     13128      C gmx
>>>>>>>>>>> 8MiB |
>>>>>>>>>>> |    2     13131      C gmx
>>>>>>>>>>> 8MiB |
>>>>>>>>>>> |    2     13134      C gmx
>>>>>>>>>>> 8MiB |
>>>>>>>>>>> |    2     13138      C gmx
>>>>>>>>>>> 8MiB |
>>>>>>>>>>> |    2     13141      C gmx
>>>>>>>>>>> 8MiB |
>>>>>>>>>>> +-----------------------------------------------------------
>>>>>>>>>>> ------------------+
>>>>>>>>>>>
>>>>>>>>>>> Then goes back to the expected load. Is this normal?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Alex
>>>>>>>>>>>
>>>>>>>>>>> -- 
>>>>>>>>>>> Gromacs Users mailing list
>>>>>>>>>>>
>>>>>>>>>>> * Please search the archive athttp://www.gromacs.org/Support
>>>>>>>>>>> /Mailing_Lists/GMX-Users_List before posting!
>>>>>>>>>>>
>>>>>>>>>>> * Can't post? Readhttp://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>>>>
>>>>>>>>>>> * For (un)subscribe requests visit
>>>>>>>>>>>
>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>>>>>>>> or
>>>>>>>>>>> send a mail togmx-users-request at gromacs.org.
>>>>>>>>>> -- 
>>>>>>>>>> Gromacs Users mailing list
>>>>>>>>>>
>>>>>>>>>> * Please search the archive athttp://www.gromacs.org/
>>>>>>>>>> Support/Mailing_Lists/GMX-Users_List before posting!
>>>>>>>>>>
>>>>>>>>>> * Can't post? Readhttp://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>>>
>>>>>>>>>> * For (un)subscribe requests visit
>>>>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users 
>>>>>>>>>>
>>>>>> or
>>>>>>>>>> send a mail togmx-users-request at gromacs.org.
>>>>>>>>> -- 
>>>>>>>>> Gromacs Users mailing list
>>>>>>>>>
>>>>>>>>> * Please search the archive at
>>>>>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List  
>>>>>>>>> before
>>>>>>>>> posting!
>>>>>>>>>
>>>>>>>>> * Can't post? Readhttp://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>>
>>>>>>>>> * For (un)subscribe requests visit
>>>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users 
>>>>>>>>>
>>>> or
>>>>>>>>> send a mail togmx-users-request at gromacs.org.
>>>>>>>> -- 
>>>>>>>> Gromacs Users mailing list
>>>>>>>>
>>>>>>>> * Please search the archive athttp://www.gromacs.org/Support
>>>>>>>> /Mailing_Lists/GMX-Users_List before posting!
>>>>>>>>
>>>>>>>> * Can't post? Readhttp://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>
>>>>>>>> * For (un)subscribe requests visit
>>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users  
>>>>>>>> or
>>>>>>>> send a mail togmx-users-request at gromacs.org.
>>>>>> -- 
>>>>>> Gromacs Users mailing list
>>>>>>
>>>>>> * Please search the archive at
>>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List  before
>>>>>> posting!
>>>>>>
>>>>>> * Can't post? Readhttp://www.gromacs.org/Support/Mailing_Lists
>>>>>>
>>>>>> * For (un)subscribe requests visit
>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users  
>>>>>> or
>>>>>> send a mail togmx-users-request at gromacs.org.
>>>> -- 
>>>> Gromacs Users mailing list
>>>>
>>>> * Please search the archive at
>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>> posting!
>>>>
>>>> * Can't post? Readhttp://www.gromacs.org/Support/Mailing_Lists
>>>>
>>>> * For (un)subscribe requests visit
>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users  or
>>>> send a mail togmx-users-request at gromacs.org.
>>
>

-- 
==================================================

Justin A. Lemkul, Ph.D.
Assistant Professor
Virginia Tech Department of Biochemistry

303 Engel Hall
340 West Campus Dr.
Blacksburg, VA 24061

jalemkul at vt.edu | (540) 231-3129
http://www.thelemkullab.com

==================================================