[gmx-users] strange GPU load distribution

Mon May 7 00:13:56 CEST 2018

On 5/6/18 6:11 PM, Alex wrote:
> A separate CPU-only build is what we were going to try, but if it 
> succeeds with not touching GPUs, then what -- keep several builds?
>

If your CPU-only run produces something that doesn't touch the GPU 
(which it shouldn't), that test would rather conclusively state the if 
the user requests a CPU-only run, then the mdrun code needs to be 
patched in such a way that GPU detection is not carried out. If that's 
the case, yes, you'd have to wait for a patch and in the meantime 
maintain two different mdrun binaries, but it would be a valuable bit of 
information for the dev team.

> That latency you mention is definitely there, I think it is related to 
> my earlier report of one of the regression tests failing (I think Mark 
> might remember that one). That failure, by the way, is persistent with 
> 2018.1 we just installed on a completely different machine.

I seemed to recall that, which is what got me thinking.

-Justin

>
> Alex
>
>
> On 5/6/2018 4:03 PM, Justin Lemkul wrote:
>>
>>
>> On 5/6/18 5:51 PM, Alex wrote:
>>> Unfortunately, we're still bogged down when the EM runs (example 
>>> below) start -- CPU usage by these jobs is initially low, while 
>>> their PIDs show up in nvidia-smi. After about a minute all goes back 
>>> to normal. Because the user is doing it frequently (scripted), 
>>> everything is slowed down by a large factor. Interestingly, we have 
>>> another user utilizing a GPU with another MD package (LAMMPS) and 
>>> that GPU is never touched by these EM jobs.
>>>
>>> Any ideas will be greatly appreciated.
>>>
>>
>> Thinking out loud - a run that explicitly calls for only the CPU to 
>> be used might be trying to detect GPU if mdrun is GPU-enabled. Is 
>> that a possibility, including any latency in detecting that device? 
>> Have you tested to make sure that an mdrun binary that is explicitly 
>> disabled from using GPU (-DGMX_GPU=OFF) doesn't affect the GPU usage 
>> when running the same command?
>>
>> -Justin
>>
>>> Thanks,
>>>
>>> Alex
>>>
>>>
>>>> PID TTY      STAT TIME COMMAND
>>>>
>>>> 60432 pts/8    Dl+ 0:01 gmx mdrun -table ../../../tab_it.xvg -nt 1 
>>>> -nb cpu -pme cpu -deffnm em_steep
>>>>
>>>>
>>>>
>>>
>>>
>>>> On 4/27/2018 2:16 PM, Mark Abraham wrote:
>>>>> Hi,
>>>>>
>>>>> What you think was run isn't nearly as useful when troubleshooting as
>>>>> asking the kernel what is actually running.
>>>>>
>>>>> Mark
>>>>>
>>>>>
>>>>> On Fri, Apr 27, 2018, 21:59 Alex<nedomacho at gmail.com> wrote:
>>>>>
>>>>>> Mark, I copied the exact command line from the script, right 
>>>>>> above the
>>>>>> mdp file. It is literally how the script calls mdrun in this case:
>>>>>>
>>>>>> gmx mdrun -nt 2 -nb cpu -pme cpu -deffnm
>>>>>>
>>>>>>
>>>>>> On 4/27/2018 1:52 PM, Mark Abraham wrote:
>>>>>>> Group cutoff scheme can never run on a gpu, so none of that should
>>>>>> matter.
>>>>>>> Use ps and find out what the command lines were.
>>>>>>>
>>>>>>> Mark
>>>>>>>
>>>>>>> On Fri, Apr 27, 2018, 21:37 Alex<nedomacho at gmail.com>  wrote:
>>>>>>>
>>>>>>>> Update: we're basically removing commands one by one from the 
>>>>>>>> script
>>>>>> that
>>>>>>>> submits the jobs causing the issue. The culprit is both EM and 
>>>>>>>> the MD
>>>>>> run:
>>>>>>>> and GPUs are being affected _before_ MD starts loading the CPU, 
>>>>>>>> i.e.
>>>>>> this
>>>>>>>> is the initial setting up of the EM run -- CPU load is near zero,
>>>>>>>> nvidia-smi reports the mess. I wonder if this is in any way 
>>>>>>>> related to
>>>>>> that
>>>>>>>> timing test we were failing a while back.
>>>>>>>> mdrun call and mdp below, though I suspect they have nothing to 
>>>>>>>> do with
>>>>>>>> what is happening. Any help will be very highly appreciated.
>>>>>>>>
>>>>>>>> Alex
>>>>>>>>
>>>>>>>> ***
>>>>>>>>
>>>>>>>> gmx mdrun -nt 2 -nb cpu -pme cpu -deffnm
>>>>>>>>
>>>>>>>> mdp:
>>>>>>>>
>>>>>>>> ; Run control
>>>>>>>> integrator               = md-vv       ; Velocity Verlet
>>>>>>>> tinit                    = 0
>>>>>>>> dt                       = 0.002
>>>>>>>> nsteps                   = 500000    ; 1 ns
>>>>>>>> nstcomm                  = 100
>>>>>>>> ; Output control
>>>>>>>> nstxout                  = 50000
>>>>>>>> nstvout                  = 50000
>>>>>>>> nstfout                  = 0
>>>>>>>> nstlog                   = 50000
>>>>>>>> nstenergy                = 50000
>>>>>>>> nstxout-compressed       = 0
>>>>>>>> ; Neighborsearching and short-range nonbonded interactions
>>>>>>>> cutoff-scheme            = group
>>>>>>>> nstlist                  = 10
>>>>>>>> ns_type                  = grid
>>>>>>>> pbc                      = xyz
>>>>>>>> rlist                    = 1.4
>>>>>>>> ; Electrostatics
>>>>>>>> coulombtype              = cutoff
>>>>>>>> rcoulomb                 = 1.4
>>>>>>>> ; van der Waals
>>>>>>>> vdwtype                  = user
>>>>>>>> vdw-modifier             = none
>>>>>>>> rvdw                     = 1.4
>>>>>>>> ; Apply long range dispersion corrections for Energy and Pressure
>>>>>>>> DispCorr                  = EnerPres
>>>>>>>> ; Spacing for the PME/PPPM FFT grid
>>>>>>>> fourierspacing           = 0.12
>>>>>>>> ; EWALD/PME/PPPM parameters
>>>>>>>> pme_order                = 6
>>>>>>>> ewald_rtol               = 1e-06
>>>>>>>> epsilon_surface          = 0
>>>>>>>> ; Temperature coupling
>>>>>>>> Tcoupl                   = nose-hoover
>>>>>>>> tc_grps                  = system
>>>>>>>> tau_t                    = 1.0
>>>>>>>> ref_t                    = some_temperature
>>>>>>>> ; Pressure coupling is off for NVT
>>>>>>>> Pcoupl                   = No
>>>>>>>> tau_p                    = 0.5
>>>>>>>> compressibility          = 4.5e-05
>>>>>>>> ref_p                    = 1.0
>>>>>>>> ; options for bonds
>>>>>>>> constraints              = all-bonds
>>>>>>>> constraint_algorithm     = lincs
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Apr 27, 2018 at 1:14 PM, Alex<nedomacho at gmail.com>  wrote:
>>>>>>>>
>>>>>>>>> As I said, only two users, and nvidia-smi shows the process 
>>>>>>>>> name. We're
>>>>>>>>> investigating and it does appear that it is EM that uses cutoff
>>>>>>>>> electrostatics and as a result the user did not bother with 
>>>>>>>>> -pme cpu in
>>>>>>>> the
>>>>>>>>> mdrun call. What would be the correct way to enforce cpu-only 
>>>>>>>>> mdrun
>>>>>> when
>>>>>>>>> coulombtype = cutoff?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Alex
>>>>>>>>>
>>>>>>>>> On Fri, Apr 27, 2018 at 12:45 PM, Mark Abraham <
>>>>>> mark.j.abraham at gmail.com
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> No.
>>>>>>>>>>
>>>>>>>>>> Look at the processes that are running, e.g. with top or ps. 
>>>>>>>>>> Either
>>>>>> old
>>>>>>>>>> simulations or another user is running.
>>>>>>>>>>
>>>>>>>>>> Mark
>>>>>>>>>>
>>>>>>>>>> On Fri, Apr 27, 2018, 20:33 Alex<nedomacho at gmail.com>  wrote:
>>>>>>>>>>
>>>>>>>>>>> Strange. There are only two people using this machine, 
>>>>>>>>>>> myself being
>>>>>>>> one
>>>>>>>>>> of
>>>>>>>>>>> them, and the other person specifically forces -nb cpu -pme 
>>>>>>>>>>> cpu in
>>>>>> his
>>>>>>>>>>> calls to mdrun. Are any other GMX utilities (e.g. 
>>>>>>>>>>> insert-molecules,
>>>>>>>>>> grompp,
>>>>>>>>>>> or energy) trying to use GPUs?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Alex
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Apr 27, 2018 at 5:33 AM, Szilárd Páll <
>>>>>> pall.szilard at gmail.com
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> The second column is PIDs so there is a whole lot more 
>>>>>>>>>>>> going on
>>>>>>>> there
>>>>>>>>>>> than
>>>>>>>>>>>> just a single simulation, single rank using two GPUs. That 
>>>>>>>>>>>> would be
>>>>>>>>>> one
>>>>>>>>>>> PID
>>>>>>>>>>>> and two entries for the two GPUs. Are you sure you're not 
>>>>>>>>>>>> running
>>>>>>>>>> other
>>>>>>>>>>>> processes?
>>>>>>>>>>>>
>>>>>>>>>>>> -- 
>>>>>>>>>>>> Szilárd
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Apr 26, 2018 at 5:52 AM, Alex<nedomacho at gmail.com>  
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am running GMX 2018 with gmx mdrun -pinoffset 0 -pin on 
>>>>>>>>>>>>> -nt 24
>>>>>>>>>>> -ntmpi 4
>>>>>>>>>>>>> -npme 1 -pme gpu -nb gpu -gputasks 1122
>>>>>>>>>>>>>
>>>>>>>>>>>>> Once in a while the simulation slows down and nvidia-smi 
>>>>>>>>>>>>> reports
>>>>>>>>>>>> something
>>>>>>>>>>>>> like this:
>>>>>>>>>>>>>
>>>>>>>>>>>>> |    1     12981      C gmx
>>>>>>>>>>>>> 175MiB |
>>>>>>>>>>>>> |    2     12981      C gmx
>>>>>>>>>>>>> 217MiB |
>>>>>>>>>>>>> |    2     13083      C gmx
>>>>>>>>>>>>> 161MiB |
>>>>>>>>>>>>> |    2     13086      C gmx
>>>>>>>>>>>>> 159MiB |
>>>>>>>>>>>>> |    2     13089      C gmx
>>>>>>>>>>>>> 139MiB |
>>>>>>>>>>>>> |    2     13093      C gmx
>>>>>>>>>>>>> 163MiB |
>>>>>>>>>>>>> |    2     13096      C gmx
>>>>>>>>>>>>> 11MiB |
>>>>>>>>>>>>> |    2     13099      C gmx
>>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>>> |    2     13102      C gmx
>>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>>> |    2     13106      C gmx
>>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>>> |    2     13109      C gmx
>>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>>> |    2     13112      C gmx
>>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>>> |    2     13115      C gmx
>>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>>> |    2     13119      C gmx
>>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>>> |    2     13122      C gmx
>>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>>> |    2     13125      C gmx
>>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>>> |    2     13128      C gmx
>>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>>> |    2     13131      C gmx
>>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>>> |    2     13134      C gmx
>>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>>> |    2     13138      C gmx
>>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>>> |    2     13141      C gmx
>>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>>> +-----------------------------------------------------------
>>>>>>>>>>>>> ------------------+
>>>>>>>>>>>>>
>>>>>>>>>>>>> Then goes back to the expected load. Is this normal?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Alex
>>>>>>>>>>>>>
>>>>>>>>>>>>> -- 
>>>>>>>>>>>>> Gromacs Users mailing list
>>>>>>>>>>>>>
>>>>>>>>>>>>> * Please search the archive athttp://www.gromacs.org/Support
>>>>>>>>>>>>> /Mailing_Lists/GMX-Users_List before posting!
>>>>>>>>>>>>>
>>>>>>>>>>>>> * Can't post? 
>>>>>>>>>>>>> Readhttp://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>>>>>>
>>>>>>>>>>>>> * For (un)subscribe requests visit
>>>>>>>>>>>>>
>>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>>>>>>>>>> or
>>>>>>>>>>>>> send a mail togmx-users-request at gromacs.org.
>>>>>>>>>>>> -- 
>>>>>>>>>>>> Gromacs Users mailing list
>>>>>>>>>>>>
>>>>>>>>>>>> * Please search the archive athttp://www.gromacs.org/
>>>>>>>>>>>> Support/Mailing_Lists/GMX-Users_List before posting!
>>>>>>>>>>>>
>>>>>>>>>>>> * Can't post? Readhttp://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>>>>>
>>>>>>>>>>>> * For (un)subscribe requests visit
>>>>>>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users 
>>>>>>>>>>>>
>>>>>>>> or
>>>>>>>>>>>> send a mail togmx-users-request at gromacs.org.
>>>>>>>>>>> -- 
>>>>>>>>>>> Gromacs Users mailing list
>>>>>>>>>>>
>>>>>>>>>>> * Please search the archive at
>>>>>>>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List  
>>>>>>>>>>> before
>>>>>>>>>>> posting!
>>>>>>>>>>>
>>>>>>>>>>> * Can't post? Readhttp://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>>>>
>>>>>>>>>>> * For (un)subscribe requests visit
>>>>>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users 
>>>>>>>>>>>
>>>>>> or
>>>>>>>>>>> send a mail togmx-users-request at gromacs.org.
>>>>>>>>>> -- 
>>>>>>>>>> Gromacs Users mailing list
>>>>>>>>>>
>>>>>>>>>> * Please search the archive athttp://www.gromacs.org/Support
>>>>>>>>>> /Mailing_Lists/GMX-Users_List before posting!
>>>>>>>>>>
>>>>>>>>>> * Can't post? Readhttp://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>>>
>>>>>>>>>> * For (un)subscribe requests visit
>>>>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users  
>>>>>>>>>> or
>>>>>>>>>> send a mail togmx-users-request at gromacs.org.
>>>>>>>> -- 
>>>>>>>> Gromacs Users mailing list
>>>>>>>>
>>>>>>>> * Please search the archive at
>>>>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List  
>>>>>>>> before
>>>>>>>> posting!
>>>>>>>>
>>>>>>>> * Can't post? Readhttp://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>
>>>>>>>> * For (un)subscribe requests visit
>>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users  
>>>>>>>> or
>>>>>>>> send a mail togmx-users-request at gromacs.org.
>>>>>> -- 
>>>>>> Gromacs Users mailing list
>>>>>>
>>>>>> * Please search the archive at
>>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>>>> posting!
>>>>>>
>>>>>> * Can't post? Readhttp://www.gromacs.org/Support/Mailing_Lists
>>>>>>
>>>>>> * For (un)subscribe requests visit
>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users  
>>>>>> or
>>>>>> send a mail togmx-users-request at gromacs.org.
>>>>
>>>
>>
>

-- 
==================================================

Justin A. Lemkul, Ph.D.
Assistant Professor
Virginia Tech Department of Biochemistry

303 Engel Hall
340 West Campus Dr.
Blacksburg, VA 24061

jalemkul at vt.edu | (540) 231-3129
http://www.thelemkullab.com

==================================================