[gmx-users] strange GPU load distribution

Mon May 7 00:11:24 CEST 2018

A separate CPU-only build is what we were going to try, but if it 
succeeds with not touching GPUs, then what -- keep several builds?

That latency you mention is definitely there, I think it is related to 
my earlier report of one of the regression tests failing (I think Mark 
might remember that one). That failure, by the way, is persistent with 
2018.1 we just installed on a completely different machine.

Alex

On 5/6/2018 4:03 PM, Justin Lemkul wrote:
>
>
> On 5/6/18 5:51 PM, Alex wrote:
>> Unfortunately, we're still bogged down when the EM runs (example 
>> below) start -- CPU usage by these jobs is initially low, while their 
>> PIDs show up in nvidia-smi. After about a minute all goes back to 
>> normal. Because the user is doing it frequently (scripted), 
>> everything is slowed down by a large factor. Interestingly, we have 
>> another user utilizing a GPU with another MD package (LAMMPS) and 
>> that GPU is never touched by these EM jobs.
>>
>> Any ideas will be greatly appreciated.
>>
>
> Thinking out loud - a run that explicitly calls for only the CPU to be 
> used might be trying to detect GPU if mdrun is GPU-enabled. Is that a 
> possibility, including any latency in detecting that device? Have you 
> tested to make sure that an mdrun binary that is explicitly disabled 
> from using GPU (-DGMX_GPU=OFF) doesn't affect the GPU usage when 
> running the same command?
>
> -Justin
>
>> Thanks,
>>
>> Alex
>>
>>
>>> PID TTY      STAT TIME COMMAND
>>>
>>> 60432 pts/8    Dl+ 0:01 gmx mdrun -table ../../../tab_it.xvg -nt 1 
>>> -nb cpu -pme cpu -deffnm em_steep
>>>
>>>
>>>
>>
>>
>>> On 4/27/2018 2:16 PM, Mark Abraham wrote:
>>>> Hi,
>>>>
>>>> What you think was run isn't nearly as useful when troubleshooting as
>>>> asking the kernel what is actually running.
>>>>
>>>> Mark
>>>>
>>>>
>>>> On Fri, Apr 27, 2018, 21:59 Alex<nedomacho at gmail.com> wrote:
>>>>
>>>>> Mark, I copied the exact command line from the script, right above 
>>>>> the
>>>>> mdp file. It is literally how the script calls mdrun in this case:
>>>>>
>>>>> gmx mdrun -nt 2 -nb cpu -pme cpu -deffnm
>>>>>
>>>>>
>>>>> On 4/27/2018 1:52 PM, Mark Abraham wrote:
>>>>>> Group cutoff scheme can never run on a gpu, so none of that should
>>>>> matter.
>>>>>> Use ps and find out what the command lines were.
>>>>>>
>>>>>> Mark
>>>>>>
>>>>>> On Fri, Apr 27, 2018, 21:37 Alex<nedomacho at gmail.com>  wrote:
>>>>>>
>>>>>>> Update: we're basically removing commands one by one from the 
>>>>>>> script
>>>>> that
>>>>>>> submits the jobs causing the issue. The culprit is both EM and 
>>>>>>> the MD
>>>>> run:
>>>>>>> and GPUs are being affected _before_ MD starts loading the CPU, 
>>>>>>> i.e.
>>>>> this
>>>>>>> is the initial setting up of the EM run -- CPU load is near zero,
>>>>>>> nvidia-smi reports the mess. I wonder if this is in any way 
>>>>>>> related to
>>>>> that
>>>>>>> timing test we were failing a while back.
>>>>>>> mdrun call and mdp below, though I suspect they have nothing to 
>>>>>>> do with
>>>>>>> what is happening. Any help will be very highly appreciated.
>>>>>>>
>>>>>>> Alex
>>>>>>>
>>>>>>> ***
>>>>>>>
>>>>>>> gmx mdrun -nt 2 -nb cpu -pme cpu -deffnm
>>>>>>>
>>>>>>> mdp:
>>>>>>>
>>>>>>> ; Run control
>>>>>>> integrator               = md-vv       ; Velocity Verlet
>>>>>>> tinit                    = 0
>>>>>>> dt                       = 0.002
>>>>>>> nsteps                   = 500000    ; 1 ns
>>>>>>> nstcomm                  = 100
>>>>>>> ; Output control
>>>>>>> nstxout                  = 50000
>>>>>>> nstvout                  = 50000
>>>>>>> nstfout                  = 0
>>>>>>> nstlog                   = 50000
>>>>>>> nstenergy                = 50000
>>>>>>> nstxout-compressed       = 0
>>>>>>> ; Neighborsearching and short-range nonbonded interactions
>>>>>>> cutoff-scheme            = group
>>>>>>> nstlist                  = 10
>>>>>>> ns_type                  = grid
>>>>>>> pbc                      = xyz
>>>>>>> rlist                    = 1.4
>>>>>>> ; Electrostatics
>>>>>>> coulombtype              = cutoff
>>>>>>> rcoulomb                 = 1.4
>>>>>>> ; van der Waals
>>>>>>> vdwtype                  = user
>>>>>>> vdw-modifier             = none
>>>>>>> rvdw                     = 1.4
>>>>>>> ; Apply long range dispersion corrections for Energy and Pressure
>>>>>>> DispCorr                  = EnerPres
>>>>>>> ; Spacing for the PME/PPPM FFT grid
>>>>>>> fourierspacing           = 0.12
>>>>>>> ; EWALD/PME/PPPM parameters
>>>>>>> pme_order                = 6
>>>>>>> ewald_rtol               = 1e-06
>>>>>>> epsilon_surface          = 0
>>>>>>> ; Temperature coupling
>>>>>>> Tcoupl                   = nose-hoover
>>>>>>> tc_grps                  = system
>>>>>>> tau_t                    = 1.0
>>>>>>> ref_t                    = some_temperature
>>>>>>> ; Pressure coupling is off for NVT
>>>>>>> Pcoupl                   = No
>>>>>>> tau_p                    = 0.5
>>>>>>> compressibility          = 4.5e-05
>>>>>>> ref_p                    = 1.0
>>>>>>> ; options for bonds
>>>>>>> constraints              = all-bonds
>>>>>>> constraint_algorithm     = lincs
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Apr 27, 2018 at 1:14 PM, Alex<nedomacho at gmail.com>  wrote:
>>>>>>>
>>>>>>>> As I said, only two users, and nvidia-smi shows the process 
>>>>>>>> name. We're
>>>>>>>> investigating and it does appear that it is EM that uses cutoff
>>>>>>>> electrostatics and as a result the user did not bother with 
>>>>>>>> -pme cpu in
>>>>>>> the
>>>>>>>> mdrun call. What would be the correct way to enforce cpu-only 
>>>>>>>> mdrun
>>>>> when
>>>>>>>> coulombtype = cutoff?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Alex
>>>>>>>>
>>>>>>>> On Fri, Apr 27, 2018 at 12:45 PM, Mark Abraham <
>>>>> mark.j.abraham at gmail.com
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> No.
>>>>>>>>>
>>>>>>>>> Look at the processes that are running, e.g. with top or ps. 
>>>>>>>>> Either
>>>>> old
>>>>>>>>> simulations or another user is running.
>>>>>>>>>
>>>>>>>>> Mark
>>>>>>>>>
>>>>>>>>> On Fri, Apr 27, 2018, 20:33 Alex<nedomacho at gmail.com>  wrote:
>>>>>>>>>
>>>>>>>>>> Strange. There are only two people using this machine, myself 
>>>>>>>>>> being
>>>>>>> one
>>>>>>>>> of
>>>>>>>>>> them, and the other person specifically forces -nb cpu -pme 
>>>>>>>>>> cpu in
>>>>> his
>>>>>>>>>> calls to mdrun. Are any other GMX utilities (e.g. 
>>>>>>>>>> insert-molecules,
>>>>>>>>> grompp,
>>>>>>>>>> or energy) trying to use GPUs?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Alex
>>>>>>>>>>
>>>>>>>>>> On Fri, Apr 27, 2018 at 5:33 AM, Szilárd Páll <
>>>>> pall.szilard at gmail.com
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> The second column is PIDs so there is a whole lot more going on
>>>>>>> there
>>>>>>>>>> than
>>>>>>>>>>> just a single simulation, single rank using two GPUs. That 
>>>>>>>>>>> would be
>>>>>>>>> one
>>>>>>>>>> PID
>>>>>>>>>>> and two entries for the two GPUs. Are you sure you're not 
>>>>>>>>>>> running
>>>>>>>>> other
>>>>>>>>>>> processes?
>>>>>>>>>>>
>>>>>>>>>>> -- 
>>>>>>>>>>> Szilárd
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Apr 26, 2018 at 5:52 AM, Alex<nedomacho at gmail.com>  
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>
>>>>>>>>>>>> I am running GMX 2018 with gmx mdrun -pinoffset 0 -pin on 
>>>>>>>>>>>> -nt 24
>>>>>>>>>> -ntmpi 4
>>>>>>>>>>>> -npme 1 -pme gpu -nb gpu -gputasks 1122
>>>>>>>>>>>>
>>>>>>>>>>>> Once in a while the simulation slows down and nvidia-smi 
>>>>>>>>>>>> reports
>>>>>>>>>>> something
>>>>>>>>>>>> like this:
>>>>>>>>>>>>
>>>>>>>>>>>> |    1     12981      C gmx
>>>>>>>>>>>> 175MiB |
>>>>>>>>>>>> |    2     12981      C gmx
>>>>>>>>>>>> 217MiB |
>>>>>>>>>>>> |    2     13083      C gmx
>>>>>>>>>>>> 161MiB |
>>>>>>>>>>>> |    2     13086      C gmx
>>>>>>>>>>>> 159MiB |
>>>>>>>>>>>> |    2     13089      C gmx
>>>>>>>>>>>> 139MiB |
>>>>>>>>>>>> |    2     13093      C gmx
>>>>>>>>>>>> 163MiB |
>>>>>>>>>>>> |    2     13096      C gmx
>>>>>>>>>>>> 11MiB |
>>>>>>>>>>>> |    2     13099      C gmx
>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>> |    2     13102      C gmx
>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>> |    2     13106      C gmx
>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>> |    2     13109      C gmx
>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>> |    2     13112      C gmx
>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>> |    2     13115      C gmx
>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>> |    2     13119      C gmx
>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>> |    2     13122      C gmx
>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>> |    2     13125      C gmx
>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>> |    2     13128      C gmx
>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>> |    2     13131      C gmx
>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>> |    2     13134      C gmx
>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>> |    2     13138      C gmx
>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>> |    2     13141      C gmx
>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>> +-----------------------------------------------------------
>>>>>>>>>>>> ------------------+
>>>>>>>>>>>>
>>>>>>>>>>>> Then goes back to the expected load. Is this normal?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> Alex
>>>>>>>>>>>>
>>>>>>>>>>>> -- 
>>>>>>>>>>>> Gromacs Users mailing list
>>>>>>>>>>>>
>>>>>>>>>>>> * Please search the archive athttp://www.gromacs.org/Support
>>>>>>>>>>>> /Mailing_Lists/GMX-Users_List before posting!
>>>>>>>>>>>>
>>>>>>>>>>>> * Can't post? Readhttp://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>>>>>
>>>>>>>>>>>> * For (un)subscribe requests visit
>>>>>>>>>>>>
>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>>>>>>>>> or
>>>>>>>>>>>> send a mail togmx-users-request at gromacs.org.
>>>>>>>>>>> -- 
>>>>>>>>>>> Gromacs Users mailing list
>>>>>>>>>>>
>>>>>>>>>>> * Please search the archive athttp://www.gromacs.org/
>>>>>>>>>>> Support/Mailing_Lists/GMX-Users_List before posting!
>>>>>>>>>>>
>>>>>>>>>>> * Can't post? Readhttp://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>>>>
>>>>>>>>>>> * For (un)subscribe requests visit
>>>>>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users 
>>>>>>>>>>>
>>>>>>> or
>>>>>>>>>>> send a mail togmx-users-request at gromacs.org.
>>>>>>>>>> -- 
>>>>>>>>>> Gromacs Users mailing list
>>>>>>>>>>
>>>>>>>>>> * Please search the archive at
>>>>>>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List  
>>>>>>>>>> before
>>>>>>>>>> posting!
>>>>>>>>>>
>>>>>>>>>> * Can't post? Readhttp://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>>>
>>>>>>>>>> * For (un)subscribe requests visit
>>>>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users 
>>>>>>>>>>
>>>>> or
>>>>>>>>>> send a mail togmx-users-request at gromacs.org.
>>>>>>>>> -- 
>>>>>>>>> Gromacs Users mailing list
>>>>>>>>>
>>>>>>>>> * Please search the archive athttp://www.gromacs.org/Support
>>>>>>>>> /Mailing_Lists/GMX-Users_List before posting!
>>>>>>>>>
>>>>>>>>> * Can't post? Readhttp://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>>
>>>>>>>>> * For (un)subscribe requests visit
>>>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users  
>>>>>>>>> or
>>>>>>>>> send a mail togmx-users-request at gromacs.org.
>>>>>>> -- 
>>>>>>> Gromacs Users mailing list
>>>>>>>
>>>>>>> * Please search the archive at
>>>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List  before
>>>>>>> posting!
>>>>>>>
>>>>>>> * Can't post? Readhttp://www.gromacs.org/Support/Mailing_Lists
>>>>>>>
>>>>>>> * For (un)subscribe requests visit
>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users  
>>>>>>> or
>>>>>>> send a mail togmx-users-request at gromacs.org.
>>>>> -- 
>>>>> Gromacs Users mailing list
>>>>>
>>>>> * Please search the archive at
>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>>> posting!
>>>>>
>>>>> * Can't post? Readhttp://www.gromacs.org/Support/Mailing_Lists
>>>>>
>>>>> * For (un)subscribe requests visit
>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users  
>>>>> or
>>>>> send a mail togmx-users-request at gromacs.org.
>>>
>>
>