[gmx-users] shortage of shared memory

David van der Spoel spoel at xray.bmc.uu.se
Sun Jul 8 08:25:19 CEST 2007


chris.neale at utoronto.ca wrote:
> I have a variety of systems that run in parallel without ever having 
> errors due to shortage of shared memory (up to 500K atoms). However, I 
> find that I sometimes run into this problem with lipid bilayer systems 
> of less than 30K atoms.
> 
> If I submit a job and I get the shared memory error the error occurs 
> before any simulation time. What's more, if I resubmit the job it often 
> works fine. Howerver, one recent bilayer system set up by a colleague 
> won't ever run.
> 
> I am using openmpi_v1.2.1 and I can avoid using shared memory like this:
> 
> ${OMPI}/mpirun --mca btl ^sm ${ED}/mdrun_openmpi_v1.2.1 -np ${mynp} -4
> (etc...)
> 
> That absolutely fixes the error, but when I do that the scaling to 4 
> processors is very poor as judged by walltime and also by the output at 
> the end of the gromacs .log file.
> 
> This also confuses me since my sysadmin tells me that gromacs doesn't 
> use shared memory.
> 
> I get two basic error messages. Sometimes it is this to stderr: 
> [cn-r4-18][0,1,1][btl_sm_component.c:521:mca_btl_sm_component_progress] 
> SM faild to send message due to shortage of shared memory.
> 
> And sometimes it is a longer style error message (see the end of this 
> email for all stderr from a run of that type.)
> 
> I believe this to be a problem with our cluster, and I guess that would 
> make this the wrong mailing list for this question, but I am hoping that 
> somebody can help me clarify what is going on with shared memory usage 
> in gromacs and perhaps why the error appears to be stochastic but also 
> related to bilayers.
> 
> Our cluster is also having some problems with random xtc or trr file 
> corruption (1 in 10 to 20 runs) in case that seems related to the shared 
> memory issue. However, that is not the issue that I am presenting in 
> this post.
> 
> Thanks,
> Chris.
> 
So it seems that there is a problem in the shared memory communication 
layer of openmpi that only shows up sporadically. However, if it is not 
reproducible it could also be physical memory problems, i.e. bad DIMMS, 
espcially sice you have data corruption every once in a while. Some 
tests that you can do, take a big file (much larger than the amount of 
memory you have) and run md5sum on it a few times. Copy the file to a 
"good" machine and run it there as well. It should always give the same 
result. If you can rule out hardware than OpenMPI could be the problem. 
You could try the latest LAM or MPICH 2.x (not 1.x!).


-- 
David van der Spoel, Ph.D.
Molec. Biophys. group, Dept. of Cell & Molec. Biol., Uppsala University.
Box 596, 75124 Uppsala, Sweden. Phone:	+46184714205. Fax: +4618511755.
spoel at xray.bmc.uu.se	spoel at gromacs.org   http://folding.bmc.uu.se



More information about the gromacs.org_gmx-users mailing list