[gmx-users] strange GPU load distribution

Alex nedomacho at gmail.com
Mon May 7 03:35:44 CEST 2018


Hi Mark,

I forwarded your email to the person who installed CUDA on our boxes. 
Just to be clear, there is no persistent occupancy of the GPUs _after_ 
the process has finished. The observation is as follows: EM jobs 
submitted > low CPU use by the EM jobs, GPUs bogged down, no output 
files yet > GPUs released, normal CPU use, output/log files appear, 
normal completion.

I will update as soon as I know more. We seem to have run into a very 
unpleasant combination (type of jobs submitted + slow GPU init). I 
recall that when I first reported the issue with the regression test, 
Szilard suggested that for whatever reason all of our GPU 
initializations will take longer. We would not have noticed, but we have 
a user who does a lot of these initializations now.

Thanks,

Alex

On 5/6/2018 5:13 PM, Mark Abraham wrote:
> Hi,
>
> In 2018 and 2018.1, mdrun does indeed run GPU detection and compatibility
> checks before any logic about whether it should use any GPUs that were in
> fact detected. However, there's nothing about those checks that should a)
> take any noticeable time, b) acquire any ongoing resources, or c) lead to
> persistent occupancy of the GPUs after the simulation process completes.
> Those combined observations point to something about the installation of
> the GPUs / runtime / SDK / drivers. What distro are you using? Is there
> maybe some kind of security feature enabled that could be interfering? Are
> the GPUs configured to use some kind of process-exclusive mode?
>
> Mark
>
> On Mon, May 7, 2018 at 12:14 AM Justin Lemkul <jalemkul at vt.edu> wrote:
>
>>
>> On 5/6/18 6:11 PM, Alex wrote:
>>> A separate CPU-only build is what we were going to try, but if it
>>> succeeds with not touching GPUs, then what -- keep several builds?
>>>
>> If your CPU-only run produces something that doesn't touch the GPU
>> (which it shouldn't), that test would rather conclusively state the if
>> the user requests a CPU-only run, then the mdrun code needs to be
>> patched in such a way that GPU detection is not carried out. If that's
>> the case, yes, you'd have to wait for a patch and in the meantime
>> maintain two different mdrun binaries, but it would be a valuable bit of
>> information for the dev team.
>>
>>> That latency you mention is definitely there, I think it is related to
>>> my earlier report of one of the regression tests failing (I think Mark
>>> might remember that one). That failure, by the way, is persistent with
>>> 2018.1 we just installed on a completely different machine.
>> I seemed to recall that, which is what got me thinking.
>>
>> -Justin
>>
>>> Alex
>>>
>>>
>>> On 5/6/2018 4:03 PM, Justin Lemkul wrote:
>>>>
>>>> On 5/6/18 5:51 PM, Alex wrote:
>>>>> Unfortunately, we're still bogged down when the EM runs (example
>>>>> below) start -- CPU usage by these jobs is initially low, while
>>>>> their PIDs show up in nvidia-smi. After about a minute all goes back
>>>>> to normal. Because the user is doing it frequently (scripted),
>>>>> everything is slowed down by a large factor. Interestingly, we have
>>>>> another user utilizing a GPU with another MD package (LAMMPS) and
>>>>> that GPU is never touched by these EM jobs.
>>>>>
>>>>> Any ideas will be greatly appreciated.
>>>>>
>>>> Thinking out loud - a run that explicitly calls for only the CPU to
>>>> be used might be trying to detect GPU if mdrun is GPU-enabled. Is
>>>> that a possibility, including any latency in detecting that device?
>>>> Have you tested to make sure that an mdrun binary that is explicitly
>>>> disabled from using GPU (-DGMX_GPU=OFF) doesn't affect the GPU usage
>>>> when running the same command?
>>>>
>>>> -Justin
>>>>
>>>>> Thanks,
>>>>>
>>>>> Alex
>>>>>
>>>>>
>>>>>> PID TTY      STAT TIME COMMAND
>>>>>>
>>>>>> 60432 pts/8    Dl+ 0:01 gmx mdrun -table ../../../tab_it.xvg -nt 1
>>>>>> -nb cpu -pme cpu -deffnm em_steep
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>> On 4/27/2018 2:16 PM, Mark Abraham wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> What you think was run isn't nearly as useful when troubleshooting as
>>>>>>> asking the kernel what is actually running.
>>>>>>>
>>>>>>> Mark
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Apr 27, 2018, 21:59 Alex<nedomacho at gmail.com> wrote:
>>>>>>>
>>>>>>>> Mark, I copied the exact command line from the script, right
>>>>>>>> above the
>>>>>>>> mdp file. It is literally how the script calls mdrun in this case:
>>>>>>>>
>>>>>>>> gmx mdrun -nt 2 -nb cpu -pme cpu -deffnm
>>>>>>>>
>>>>>>>>
>>>>>>>> On 4/27/2018 1:52 PM, Mark Abraham wrote:
>>>>>>>>> Group cutoff scheme can never run on a gpu, so none of that should
>>>>>>>> matter.
>>>>>>>>> Use ps and find out what the command lines were.
>>>>>>>>>
>>>>>>>>> Mark
>>>>>>>>>
>>>>>>>>> On Fri, Apr 27, 2018, 21:37 Alex<nedomacho at gmail.com>  wrote:
>>>>>>>>>
>>>>>>>>>> Update: we're basically removing commands one by one from the
>>>>>>>>>> script
>>>>>>>> that
>>>>>>>>>> submits the jobs causing the issue. The culprit is both EM and
>>>>>>>>>> the MD
>>>>>>>> run:
>>>>>>>>>> and GPUs are being affected _before_ MD starts loading the CPU,
>>>>>>>>>> i.e.
>>>>>>>> this
>>>>>>>>>> is the initial setting up of the EM run -- CPU load is near zero,
>>>>>>>>>> nvidia-smi reports the mess. I wonder if this is in any way
>>>>>>>>>> related to
>>>>>>>> that
>>>>>>>>>> timing test we were failing a while back.
>>>>>>>>>> mdrun call and mdp below, though I suspect they have nothing to
>>>>>>>>>> do with
>>>>>>>>>> what is happening. Any help will be very highly appreciated.
>>>>>>>>>>
>>>>>>>>>> Alex
>>>>>>>>>>
>>>>>>>>>> ***
>>>>>>>>>>
>>>>>>>>>> gmx mdrun -nt 2 -nb cpu -pme cpu -deffnm
>>>>>>>>>>
>>>>>>>>>> mdp:
>>>>>>>>>>
>>>>>>>>>> ; Run control
>>>>>>>>>> integrator               = md-vv       ; Velocity Verlet
>>>>>>>>>> tinit                    = 0
>>>>>>>>>> dt                       = 0.002
>>>>>>>>>> nsteps                   = 500000    ; 1 ns
>>>>>>>>>> nstcomm                  = 100
>>>>>>>>>> ; Output control
>>>>>>>>>> nstxout                  = 50000
>>>>>>>>>> nstvout                  = 50000
>>>>>>>>>> nstfout                  = 0
>>>>>>>>>> nstlog                   = 50000
>>>>>>>>>> nstenergy                = 50000
>>>>>>>>>> nstxout-compressed       = 0
>>>>>>>>>> ; Neighborsearching and short-range nonbonded interactions
>>>>>>>>>> cutoff-scheme            = group
>>>>>>>>>> nstlist                  = 10
>>>>>>>>>> ns_type                  = grid
>>>>>>>>>> pbc                      = xyz
>>>>>>>>>> rlist                    = 1.4
>>>>>>>>>> ; Electrostatics
>>>>>>>>>> coulombtype              = cutoff
>>>>>>>>>> rcoulomb                 = 1.4
>>>>>>>>>> ; van der Waals
>>>>>>>>>> vdwtype                  = user
>>>>>>>>>> vdw-modifier             = none
>>>>>>>>>> rvdw                     = 1.4
>>>>>>>>>> ; Apply long range dispersion corrections for Energy and Pressure
>>>>>>>>>> DispCorr                  = EnerPres
>>>>>>>>>> ; Spacing for the PME/PPPM FFT grid
>>>>>>>>>> fourierspacing           = 0.12
>>>>>>>>>> ; EWALD/PME/PPPM parameters
>>>>>>>>>> pme_order                = 6
>>>>>>>>>> ewald_rtol               = 1e-06
>>>>>>>>>> epsilon_surface          = 0
>>>>>>>>>> ; Temperature coupling
>>>>>>>>>> Tcoupl                   = nose-hoover
>>>>>>>>>> tc_grps                  = system
>>>>>>>>>> tau_t                    = 1.0
>>>>>>>>>> ref_t                    = some_temperature
>>>>>>>>>> ; Pressure coupling is off for NVT
>>>>>>>>>> Pcoupl                   = No
>>>>>>>>>> tau_p                    = 0.5
>>>>>>>>>> compressibility          = 4.5e-05
>>>>>>>>>> ref_p                    = 1.0
>>>>>>>>>> ; options for bonds
>>>>>>>>>> constraints              = all-bonds
>>>>>>>>>> constraint_algorithm     = lincs
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Apr 27, 2018 at 1:14 PM, Alex<nedomacho at gmail.com>
>> wrote:
>>>>>>>>>>> As I said, only two users, and nvidia-smi shows the process
>>>>>>>>>>> name. We're
>>>>>>>>>>> investigating and it does appear that it is EM that uses cutoff
>>>>>>>>>>> electrostatics and as a result the user did not bother with
>>>>>>>>>>> -pme cpu in
>>>>>>>>>> the
>>>>>>>>>>> mdrun call. What would be the correct way to enforce cpu-only
>>>>>>>>>>> mdrun
>>>>>>>> when
>>>>>>>>>>> coulombtype = cutoff?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Alex
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Apr 27, 2018 at 12:45 PM, Mark Abraham <
>>>>>>>> mark.j.abraham at gmail.com
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> No.
>>>>>>>>>>>>
>>>>>>>>>>>> Look at the processes that are running, e.g. with top or ps.
>>>>>>>>>>>> Either
>>>>>>>> old
>>>>>>>>>>>> simulations or another user is running.
>>>>>>>>>>>>
>>>>>>>>>>>> Mark
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Apr 27, 2018, 20:33 Alex<nedomacho at gmail.com>  wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Strange. There are only two people using this machine,
>>>>>>>>>>>>> myself being
>>>>>>>>>> one
>>>>>>>>>>>> of
>>>>>>>>>>>>> them, and the other person specifically forces -nb cpu -pme
>>>>>>>>>>>>> cpu in
>>>>>>>> his
>>>>>>>>>>>>> calls to mdrun. Are any other GMX utilities (e.g.
>>>>>>>>>>>>> insert-molecules,
>>>>>>>>>>>> grompp,
>>>>>>>>>>>>> or energy) trying to use GPUs?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Alex
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Apr 27, 2018 at 5:33 AM, Szilárd Páll <
>>>>>>>> pall.szilard at gmail.com
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> The second column is PIDs so there is a whole lot more
>>>>>>>>>>>>>> going on
>>>>>>>>>> there
>>>>>>>>>>>>> than
>>>>>>>>>>>>>> just a single simulation, single rank using two GPUs. That
>>>>>>>>>>>>>> would be
>>>>>>>>>>>> one
>>>>>>>>>>>>> PID
>>>>>>>>>>>>>> and two entries for the two GPUs. Are you sure you're not
>>>>>>>>>>>>>> running
>>>>>>>>>>>> other
>>>>>>>>>>>>>> processes?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Szilárd
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Apr 26, 2018 at 5:52 AM, Alex<nedomacho at gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I am running GMX 2018 with gmx mdrun -pinoffset 0 -pin on
>>>>>>>>>>>>>>> -nt 24
>>>>>>>>>>>>> -ntmpi 4
>>>>>>>>>>>>>>> -npme 1 -pme gpu -nb gpu -gputasks 1122
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Once in a while the simulation slows down and nvidia-smi
>>>>>>>>>>>>>>> reports
>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>> like this:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> |    1     12981      C gmx
>>>>>>>>>>>>>>> 175MiB |
>>>>>>>>>>>>>>> |    2     12981      C gmx
>>>>>>>>>>>>>>> 217MiB |
>>>>>>>>>>>>>>> |    2     13083      C gmx
>>>>>>>>>>>>>>> 161MiB |
>>>>>>>>>>>>>>> |    2     13086      C gmx
>>>>>>>>>>>>>>> 159MiB |
>>>>>>>>>>>>>>> |    2     13089      C gmx
>>>>>>>>>>>>>>> 139MiB |
>>>>>>>>>>>>>>> |    2     13093      C gmx
>>>>>>>>>>>>>>> 163MiB |
>>>>>>>>>>>>>>> |    2     13096      C gmx
>>>>>>>>>>>>>>> 11MiB |
>>>>>>>>>>>>>>> |    2     13099      C gmx
>>>>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>>>>> |    2     13102      C gmx
>>>>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>>>>> |    2     13106      C gmx
>>>>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>>>>> |    2     13109      C gmx
>>>>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>>>>> |    2     13112      C gmx
>>>>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>>>>> |    2     13115      C gmx
>>>>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>>>>> |    2     13119      C gmx
>>>>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>>>>> |    2     13122      C gmx
>>>>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>>>>> |    2     13125      C gmx
>>>>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>>>>> |    2     13128      C gmx
>>>>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>>>>> |    2     13131      C gmx
>>>>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>>>>> |    2     13134      C gmx
>>>>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>>>>> |    2     13138      C gmx
>>>>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>>>>> |    2     13141      C gmx
>>>>>>>>>>>>>>> 8MiB |
>>>>>>>>>>>>>>> +-----------------------------------------------------------
>>>>>>>>>>>>>>> ------------------+
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Then goes back to the expected load. Is this normal?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Alex
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Gromacs Users mailing list
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> * Please search the archive athttp://www.gromacs.org/Support
>>>>>>>>>>>>>>> /Mailing_Lists/GMX-Users_List before posting!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> * Can't post?
>>>>>>>>>>>>>>> Readhttp://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> * For (un)subscribe requests visit
>>>>>>>>>>>>>>>
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>>>>>>>>>>>> or
>>>>>>>>>>>>>>> send a mail togmx-users-request at gromacs.org.
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Gromacs Users mailing list
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> * Please search the archive athttp://www.gromacs.org/
>>>>>>>>>>>>>> Support/Mailing_Lists/GMX-Users_List before posting!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> * Can't post? Readhttp://
>> www.gromacs.org/Support/Mailing_Lists
>>>>>>>>>>>>>> * For (un)subscribe requests visit
>>>>>>>>>>>>>>
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>>>>>>>>>> or
>>>>>>>>>>>>>> send a mail togmx-users-request at gromacs.org.
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Gromacs Users mailing list
>>>>>>>>>>>>>
>>>>>>>>>>>>> * Please search the archive at
>>>>>>>>>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List
>>>>>>>>>>>>> before
>>>>>>>>>>>>> posting!
>>>>>>>>>>>>>
>>>>>>>>>>>>> * Can't post? Readhttp://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>>>>>>
>>>>>>>>>>>>> * For (un)subscribe requests visit
>>>>>>>>>>>>>
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>>>>>>>> or
>>>>>>>>>>>>> send a mail togmx-users-request at gromacs.org.
>>>>>>>>>>>> --
>>>>>>>>>>>> Gromacs Users mailing list
>>>>>>>>>>>>
>>>>>>>>>>>> * Please search the archive athttp://www.gromacs.org/Support
>>>>>>>>>>>> /Mailing_Lists/GMX-Users_List before posting!
>>>>>>>>>>>>
>>>>>>>>>>>> * Can't post? Readhttp://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>>>>>
>>>>>>>>>>>> * For (un)subscribe requests visit
>>>>>>>>>>>>
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>>>>>>>>>>>> or
>>>>>>>>>>>> send a mail togmx-users-request at gromacs.org.
>>>>>>>>>> --
>>>>>>>>>> Gromacs Users mailing list
>>>>>>>>>>
>>>>>>>>>> * Please search the archive at
>>>>>>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List
>>>>>>>>>> before
>>>>>>>>>> posting!
>>>>>>>>>>
>>>>>>>>>> * Can't post? Readhttp://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>>>
>>>>>>>>>> * For (un)subscribe requests visit
>>>>>>>>>>
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>>>>>>>>>> or
>>>>>>>>>> send a mail togmx-users-request at gromacs.org.
>>>>>>>> --
>>>>>>>> Gromacs Users mailing list
>>>>>>>>
>>>>>>>> * Please search the archive at
>>>>>>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>>>>>>>> posting!
>>>>>>>>
>>>>>>>> * Can't post? Readhttp://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>
>>>>>>>> * For (un)subscribe requests visit
>>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>>>>>>>> or
>>>>>>>> send a mail togmx-users-request at gromacs.org.
>> --
>> ==================================================
>>
>> Justin A. Lemkul, Ph.D.
>> Assistant Professor
>> Virginia Tech Department of Biochemistry
>>
>> 303 Engel Hall
>> 340 West Campus Dr.
>> Blacksburg, VA 24061
>>
>> jalemkul at vt.edu | (540) 231-3129
>> http://www.thelemkullab.com
>>
>> ==================================================
>>
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-request at gromacs.org.



More information about the gromacs.org_gmx-users mailing list