[gmx-users] strange GPU load distribution

Sun May 6 23:52:00 CEST 2018

Unfortunately, we're still bogged down when the EM runs (example below) 
start -- CPU usage by these jobs is initially low, while their PIDs show 
up in nvidia-smi. After about a minute all goes back to normal. Because 
the user is doing it frequently (scripted), everything is slowed down by 
a large factor. Interestingly, we have another user utilizing a GPU with 
another MD package (LAMMPS) and that GPU is never touched by these EM jobs.

Any ideas will be greatly appreciated.

Thanks,

Alex

> PID TTY      STAT TIME COMMAND
>
> 60432 pts/8    Dl+ 0:01 gmx mdrun -table ../../../tab_it.xvg -nt 1 -nb 
> cpu -pme cpu -deffnm em_steep
>
>
>

> On 4/27/2018 2:16 PM, Mark Abraham wrote:
>> Hi,
>>
>> What you think was run isn't nearly as useful when troubleshooting as
>> asking the kernel what is actually running.
>>
>> Mark
>>
>>
>> On Fri, Apr 27, 2018, 21:59 Alex<nedomacho at gmail.com>  wrote:
>>
>>> Mark, I copied the exact command line from the script, right above the
>>> mdp file. It is literally how the script calls mdrun in this case:
>>>
>>> gmx mdrun -nt 2 -nb cpu -pme cpu -deffnm
>>>
>>>
>>> On 4/27/2018 1:52 PM, Mark Abraham wrote:
>>>> Group cutoff scheme can never run on a gpu, so none of that should
>>> matter.
>>>> Use ps and find out what the command lines were.
>>>>
>>>> Mark
>>>>
>>>> On Fri, Apr 27, 2018, 21:37 Alex<nedomacho at gmail.com>  wrote:
>>>>
>>>>> Update: we're basically removing commands one by one from the script
>>> that
>>>>> submits the jobs causing the issue. The culprit is both EM and the MD
>>> run:
>>>>> and GPUs are being affected _before_ MD starts loading the CPU, i.e.
>>> this
>>>>> is the initial setting up of the EM run -- CPU load is near zero,
>>>>> nvidia-smi reports the mess. I wonder if this is in any way related to
>>> that
>>>>> timing test we were failing a while back.
>>>>> mdrun call and mdp below, though I suspect they have nothing to do with
>>>>> what is happening. Any help will be very highly appreciated.
>>>>>
>>>>> Alex
>>>>>
>>>>> ***
>>>>>
>>>>> gmx mdrun -nt 2 -nb cpu -pme cpu -deffnm
>>>>>
>>>>> mdp:
>>>>>
>>>>> ; Run control
>>>>> integrator               = md-vv       ; Velocity Verlet
>>>>> tinit                    = 0
>>>>> dt                       = 0.002
>>>>> nsteps                   = 500000    ; 1 ns
>>>>> nstcomm                  = 100
>>>>> ; Output control
>>>>> nstxout                  = 50000
>>>>> nstvout                  = 50000
>>>>> nstfout                  = 0
>>>>> nstlog                   = 50000
>>>>> nstenergy                = 50000
>>>>> nstxout-compressed       = 0
>>>>> ; Neighborsearching and short-range nonbonded interactions
>>>>> cutoff-scheme            = group
>>>>> nstlist                  = 10
>>>>> ns_type                  = grid
>>>>> pbc                      = xyz
>>>>> rlist                    = 1.4
>>>>> ; Electrostatics
>>>>> coulombtype              = cutoff
>>>>> rcoulomb                 = 1.4
>>>>> ; van der Waals
>>>>> vdwtype                  = user
>>>>> vdw-modifier             = none
>>>>> rvdw                     = 1.4
>>>>> ; Apply long range dispersion corrections for Energy and Pressure
>>>>> DispCorr                  = EnerPres
>>>>> ; Spacing for the PME/PPPM FFT grid
>>>>> fourierspacing           = 0.12
>>>>> ; EWALD/PME/PPPM parameters
>>>>> pme_order                = 6
>>>>> ewald_rtol               = 1e-06
>>>>> epsilon_surface          = 0
>>>>> ; Temperature coupling
>>>>> Tcoupl                   = nose-hoover
>>>>> tc_grps                  = system
>>>>> tau_t                    = 1.0
>>>>> ref_t                    = some_temperature
>>>>> ; Pressure coupling is off for NVT
>>>>> Pcoupl                   = No
>>>>> tau_p                    = 0.5
>>>>> compressibility          = 4.5e-05
>>>>> ref_p                    = 1.0
>>>>> ; options for bonds
>>>>> constraints              = all-bonds
>>>>> constraint_algorithm     = lincs
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Apr 27, 2018 at 1:14 PM, Alex<nedomacho at gmail.com>  wrote:
>>>>>
>>>>>> As I said, only two users, and nvidia-smi shows the process name. We're
>>>>>> investigating and it does appear that it is EM that uses cutoff
>>>>>> electrostatics and as a result the user did not bother with -pme cpu in
>>>>> the
>>>>>> mdrun call. What would be the correct way to enforce cpu-only mdrun
>>> when
>>>>>> coulombtype = cutoff?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Alex
>>>>>>
>>>>>> On Fri, Apr 27, 2018 at 12:45 PM, Mark Abraham <
>>> mark.j.abraham at gmail.com
>>>>>> wrote:
>>>>>>
>>>>>>> No.
>>>>>>>
>>>>>>> Look at the processes that are running, e.g. with top or ps. Either
>>> old
>>>>>>> simulations or another user is running.
>>>>>>>
>>>>>>> Mark
>>>>>>>
>>>>>>> On Fri, Apr 27, 2018, 20:33 Alex<nedomacho at gmail.com>  wrote:
>>>>>>>
>>>>>>>> Strange. There are only two people using this machine, myself being
>>>>> one
>>>>>>> of
>>>>>>>> them, and the other person specifically forces -nb cpu -pme cpu in
>>> his
>>>>>>>> calls to mdrun. Are any other GMX utilities (e.g. insert-molecules,
>>>>>>> grompp,
>>>>>>>> or energy) trying to use GPUs?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Alex
>>>>>>>>
>>>>>>>> On Fri, Apr 27, 2018 at 5:33 AM, Szilárd Páll <
>>> pall.szilard at gmail.com
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> The second column is PIDs so there is a whole lot more going on
>>>>> there
>>>>>>>> than
>>>>>>>>> just a single simulation, single rank using two GPUs. That would be
>>>>>>> one
>>>>>>>> PID
>>>>>>>>> and two entries for the two GPUs. Are you sure you're not running
>>>>>>> other
>>>>>>>>> processes?
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Szilárd
>>>>>>>>>
>>>>>>>>> On Thu, Apr 26, 2018 at 5:52 AM, Alex<nedomacho at gmail.com>  wrote:
>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> I am running GMX 2018 with gmx mdrun -pinoffset 0 -pin on -nt 24
>>>>>>>> -ntmpi 4
>>>>>>>>>> -npme 1 -pme gpu -nb gpu -gputasks 1122
>>>>>>>>>>
>>>>>>>>>> Once in a while the simulation slows down and nvidia-smi reports
>>>>>>>>> something
>>>>>>>>>> like this:
>>>>>>>>>>
>>>>>>>>>> |    1     12981      C gmx
>>>>>>>>>> 175MiB |
>>>>>>>>>> |    2     12981      C gmx
>>>>>>>>>> 217MiB |
>>>>>>>>>> |    2     13083      C gmx
>>>>>>>>>> 161MiB |
>>>>>>>>>> |    2     13086      C gmx
>>>>>>>>>> 159MiB |
>>>>>>>>>> |    2     13089      C gmx
>>>>>>>>>> 139MiB |
>>>>>>>>>> |    2     13093      C gmx
>>>>>>>>>> 163MiB |
>>>>>>>>>> |    2     13096      C gmx
>>>>>>>>>> 11MiB |
>>>>>>>>>> |    2     13099      C gmx
>>>>>>>>>> 8MiB |
>>>>>>>>>> |    2     13102      C gmx
>>>>>>>>>> 8MiB |
>>>>>>>>>> |    2     13106      C gmx
>>>>>>>>>> 8MiB |
>>>>>>>>>> |    2     13109      C gmx
>>>>>>>>>> 8MiB |
>>>>>>>>>> |    2     13112      C gmx
>>>>>>>>>> 8MiB |
>>>>>>>>>> |    2     13115      C gmx
>>>>>>>>>> 8MiB |
>>>>>>>>>> |    2     13119      C gmx
>>>>>>>>>> 8MiB |
>>>>>>>>>> |    2     13122      C gmx
>>>>>>>>>> 8MiB |
>>>>>>>>>> |    2     13125      C gmx
>>>>>>>>>> 8MiB |
>>>>>>>>>> |    2     13128      C gmx
>>>>>>>>>> 8MiB |
>>>>>>>>>> |    2     13131      C gmx
>>>>>>>>>> 8MiB |
>>>>>>>>>> |    2     13134      C gmx
>>>>>>>>>> 8MiB |
>>>>>>>>>> |    2     13138      C gmx
>>>>>>>>>> 8MiB |
>>>>>>>>>> |    2     13141      C gmx
>>>>>>>>>> 8MiB |
>>>>>>>>>> +-----------------------------------------------------------
>>>>>>>>>> ------------------+
>>>>>>>>>>
>>>>>>>>>> Then goes back to the expected load. Is this normal?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Alex
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Gromacs Users mailing list
>>>>>>>>>>
>>>>>>>>>> * Please search the archive athttp://www.gromacs.org/Support
>>>>>>>>>> /Mailing_Lists/GMX-Users_List before posting!
>>>>>>>>>>
>>>>>>>>>> * Can't post? Readhttp://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>>>
>>>>>>>>>> * For (un)subscribe requests visit
>>>>>>>>>>
>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>>>>>>> or
>>>>>>>>>> send a mail togmx-users-request at gromacs.org.
>>>>>>>>> --
>>>>>>>>> Gromacs Users mailing list
>>>>>>>>>
>>>>>>>>> * Please search the archive athttp://www.gromacs.org/
>>>>>>>>> Support/Mailing_Lists/GMX-Users_List before posting!
>>>>>>>>>
>>>>>>>>> * Can't post? Readhttp://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>>
>>>>>>>>> * For (un)subscribe requests visit
>>>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>>>>> or
>>>>>>>>> send a mail togmx-users-request at gromacs.org.
>>>>>>>> --
>>>>>>>> Gromacs Users mailing list
>>>>>>>>
>>>>>>>> * Please search the archive at
>>>>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List  before
>>>>>>>> posting!
>>>>>>>>
>>>>>>>> * Can't post? Readhttp://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>
>>>>>>>> * For (un)subscribe requests visit
>>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>>> or
>>>>>>>> send a mail togmx-users-request at gromacs.org.
>>>>>>> --
>>>>>>> Gromacs Users mailing list
>>>>>>>
>>>>>>> * Please search the archive athttp://www.gromacs.org/Support
>>>>>>> /Mailing_Lists/GMX-Users_List before posting!
>>>>>>>
>>>>>>> * Can't post? Readhttp://www.gromacs.org/Support/Mailing_Lists
>>>>>>>
>>>>>>> * For (un)subscribe requests visit
>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users  or
>>>>>>> send a mail togmx-users-request at gromacs.org.
>>>>> --
>>>>> Gromacs Users mailing list
>>>>>
>>>>> * Please search the archive at
>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List  before
>>>>> posting!
>>>>>
>>>>> * Can't post? Readhttp://www.gromacs.org/Support/Mailing_Lists
>>>>>
>>>>> * For (un)subscribe requests visit
>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users  or
>>>>> send a mail togmx-users-request at gromacs.org.
>>> --
>>> Gromacs Users mailing list
>>>
>>> * Please search the archive at
>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List  before
>>> posting!
>>>
>>> * Can't post? Readhttp://www.gromacs.org/Support/Mailing_Lists
>>>
>>> * For (un)subscribe requests visit
>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users  or
>>> send a mail togmx-users-request at gromacs.org.
>