[gmx-developers] GPU memory issue?

Wed Mar 14 02:08:32 CET 2012

Hi,

> We have a new GPU workstation equipped with a Tesla C2075.  I installed
> Gromacs 4.5.5 and tried to run the DHFR benchmark but got a fatal error that
> the pre-simulation memory test had failed.  This was the first time anything
> had been run on the GPU, so it struck me as odd.  I saw the discussion on
> this list a few days ago regarding memory exhaustion, so I used Szilard's
> script to see what was going on.  I got the following:
>
> [2012-03-12 08:37:36]  Memory usage: virt      10736    res       4920
> [2012-03-12 08:37:46]  Memory usage: virt     148428    res      71188
> [2012-03-12 08:37:56]  Memory usage: virt     148428    res      71188
> ...
> [2012-03-12 08:52:18]  Memory usage: virt     148428    res      71188
> [2012-03-12 08:52:28]  Memory usage: virt     148428    res      71188
> [2012-03-12 08:52:38]  Memory usage: virt     149348    res      72256
> [2012-03-12 08:52:48]  Memory usage: virt     149348    res      72256
> [2012-03-12 08:52:58]  Memory usage: virt     149348    res      72256
> [2012-03-12 08:53:08]  Memory usage: virt     149348    res      72256
> [2012-03-12 08:53:18]  Memory usage: virt     149348    res      72256
> [2012-03-12 08:53:28]  Memory usage: virt     149348    res      72256
> ...
> (and onward to completion)
>
> Is this increase in memory usage indicative of a problem?  If so, is there
> anything that can be done?  I imagine that the bump in the beginning is
> simply an effect of the simulation starting up, or is that not correct?

The memory usage is printed in bytes, ~70kB looks OK, and there is no
relevant increase in memory usage in the 15 minutes window you show
above.

> Looking at the observables in the simulation, there seems to be nothing
> wrong - no abnormal structures or RMSD spikes, energy terms are very stable,
> etc. Should I be looking at other things?  The error that comes up about
> failing the memory test seems to indicate that I should expect catastrophic
> problems, so I'd like to fully investigate any issues before we invest
> serious time in doing data collection.

I've seen such errors pop up when the GPU/diver is "stuck" in a
corrupted state which is generally solved by a restart. You can try to
run memtestG80 (which is the standalone version of the memory test
used by mdrun) or cuda_memtest:
https://simtk.org/home/memtest
http://sourceforge.net/projects/cudagpumemtest/

> I would also note that on the Gromacs website, it states that the Tesla
> C2075 is supported, but it is not present in the hard-coded list.  Thus I
> have to use "force-device=yes" in my mdrun-gpu command.  I suppose this is
> something to be fixed, if it hasn't already been.

It should be fixed by the next release.

--
Szilárd

> -Justin
>
> --
> ========================================
>
> Justin A. Lemkul
> Ph.D. Candidate
> ICTAS Doctoral Scholar
> MILES-IGERT Trainee
> Department of Biochemistry
> Virginia Tech
> Blacksburg, VA
> jalemkul[at]vt.edu | (540) 231-9080
> http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin
>
> ========================================
> --
> gmx-developers mailing list
> gmx-developers at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-developers
> Please don't post (un)subscribe requests to the list. Use the www interface
> or send it to gmx-developers-request at gromacs.org.