[gmx-developers] GPU memory issue?
Szilárd Páll
szilard.pall at cbr.su.se
Wed Mar 14 02:08:32 CET 2012
Hi,
> We have a new GPU workstation equipped with a Tesla C2075. I installed
> Gromacs 4.5.5 and tried to run the DHFR benchmark but got a fatal error that
> the pre-simulation memory test had failed. This was the first time anything
> had been run on the GPU, so it struck me as odd. I saw the discussion on
> this list a few days ago regarding memory exhaustion, so I used Szilard's
> script to see what was going on. I got the following:
>
> [2012-03-12 08:37:36] Memory usage: virt 10736 res 4920
> [2012-03-12 08:37:46] Memory usage: virt 148428 res 71188
> [2012-03-12 08:37:56] Memory usage: virt 148428 res 71188
> ...
> [2012-03-12 08:52:18] Memory usage: virt 148428 res 71188
> [2012-03-12 08:52:28] Memory usage: virt 148428 res 71188
> [2012-03-12 08:52:38] Memory usage: virt 149348 res 72256
> [2012-03-12 08:52:48] Memory usage: virt 149348 res 72256
> [2012-03-12 08:52:58] Memory usage: virt 149348 res 72256
> [2012-03-12 08:53:08] Memory usage: virt 149348 res 72256
> [2012-03-12 08:53:18] Memory usage: virt 149348 res 72256
> [2012-03-12 08:53:28] Memory usage: virt 149348 res 72256
> ...
> (and onward to completion)
>
> Is this increase in memory usage indicative of a problem? If so, is there
> anything that can be done? I imagine that the bump in the beginning is
> simply an effect of the simulation starting up, or is that not correct?
The memory usage is printed in bytes, ~70kB looks OK, and there is no
relevant increase in memory usage in the 15 minutes window you show
above.
> Looking at the observables in the simulation, there seems to be nothing
> wrong - no abnormal structures or RMSD spikes, energy terms are very stable,
> etc. Should I be looking at other things? The error that comes up about
> failing the memory test seems to indicate that I should expect catastrophic
> problems, so I'd like to fully investigate any issues before we invest
> serious time in doing data collection.
I've seen such errors pop up when the GPU/diver is "stuck" in a
corrupted state which is generally solved by a restart. You can try to
run memtestG80 (which is the standalone version of the memory test
used by mdrun) or cuda_memtest:
https://simtk.org/home/memtest
http://sourceforge.net/projects/cudagpumemtest/
> I would also note that on the Gromacs website, it states that the Tesla
> C2075 is supported, but it is not present in the hard-coded list. Thus I
> have to use "force-device=yes" in my mdrun-gpu command. I suppose this is
> something to be fixed, if it hasn't already been.
It should be fixed by the next release.
--
Szilárd
> -Justin
>
> --
> ========================================
>
> Justin A. Lemkul
> Ph.D. Candidate
> ICTAS Doctoral Scholar
> MILES-IGERT Trainee
> Department of Biochemistry
> Virginia Tech
> Blacksburg, VA
> jalemkul[at]vt.edu | (540) 231-9080
> http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin
>
> ========================================
> --
> gmx-developers mailing list
> gmx-developers at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-developers
> Please don't post (un)subscribe requests to the list. Use the www interface
> or send it to gmx-developers-request at gromacs.org.
More information about the gromacs.org_gmx-developers
mailing list