[gmx-users] MPI_Recv invalid count and system explodes for large but not small parallelization on power6 but not opterons

Wed Mar 4 05:32:35 CET 2009

Thanks Roland,<br />
<br />
The system has 500,000 atoms. I use PME and a 0.9 nm cutoff for  
coulombic and a 1.4/0.9 nm twin range cutoff for LJ. The interconnect  
is infiniband between power6 nodes that each have 32 cores @ 4.2 GHz  
that are multithreaded so that I can put 64 &quot;tasks&quot; on each  
node. The scaling on the power6 is like this:<br />
<br />
N=1    0.036 ns/day  (Scaling Efficiency)<br />
N=2    0.070 ns/day  (97%)<br />
N=4    0.135 ns/day  (94%)<br />
N=8    0.246 ns/day  (85%)<br />
N=16   0.449 ns/day  (75%)<br />
N=32   0.787 ns/day  (68%)<br />
N=60   0.984 ns/day  (46%)<br />
And I have errors above N=60.<br />
<br />
In terms of what you want from the log file, are you referring to  
general scaling issues for, for example, N=32, which does run  
successfully? Or are you rather requesting more log file information  
to assist me deal with my errors while running 196 or 200  
&quot;tasks&quot;?<br />
<br />
Thanks so much for your help Roland,<br />
Chris.<br />
<br />
-- original message --<br />
<br />
Chris,<br />
<br />
depending on your system size and th interconnect this might be OK.<br />
<br />
Thus you need to give us more information. E.g.: how many atoms, how  
many<br />
ns/day, what interconnect, PME?.<br />
<br />
Also the messages at the end of the md.log might tell you some advices  
to<br />
improve performance.<br />
<br />
Roland<br />
<br />
On Tue, Mar 3, 2009 at 10:48 PM, &lt;chris.neale at utoronto.ca&gt; wrote:<br />
<br />
[Hide Quoted Text]<br />
Thanks Mark,<br />
<br />
your information is always useful. In this case, the page that you<br />
reference appears to be empty. All I see is &quot;There is currently  
no text in<br />
this page, you can search for this page title in other pages or edit  
this<br />
page.&quot;<br />
<br />
Thanks also for your consideration of the massive scaling issue.<br />
<br />
Chris.<br />
<br />
chris.neale at utoronto.ca wrote:<br />
Hello,<br />
<br />
I am currently testing a large system on a power6 cluster. I have  
compiled<br />
gromacs 4.0.4 successfully, and it appears to be working fine for &lt;64<br />
&quot;cores&quot; (sic, see later). First, I notice that it runs at  
approximately 1/2<br />
the speed that it obtains on some older opterons, which is unfortunate  
but<br />
acceptable. Second, I run into some strange issues when I have a greater<br />
number of cores. Since there are 32 cores per node with simultaneous<br />
multithreading this yields 64 tasks inside one box, and I realize that  
these<br />
problems could be MPI related.<br />
<br />
Some background:<br />
This test system is stable for &gt; 100ns on an opteron so I am quite<br />
confident that I do not have a problem with my topology or starting<br />
structure.<br />
<br />
Compilation was successful with -O2 only when I modified the ./configure<br />
file as follows, otherwise I got a stray ')' and a linking error:<br />
[cneale at tcs-f11n05]$ diff configure.000 configure<br />
5052a5053<br />
ac_cv_f77_libs=&quot;-L/scratch/cneale/exe/fftw-3.1.2_aix/exec/lib  
-lxlf90<br />
-L/usr/lpp/xlf/lib -lxlopt -lxlf -lxlomp_ser -lpthreads -lm -lc&quot;<br />
Rather than modify configure, I suggest you use a customized command  
line, such as the one described here<br />
<a target="_blank"  
href="http://wiki.gromacs.org/index.php/GROMACS_on_BlueGene">http://wiki.gromacs.org/index.php/GROMACS_on_BlueGene</a>. The output<br  
/>
config.log will have a record of what you did, too.<br />
<br />
Sorry I can't help with the massive scaling issue.<br />
<br />
Mark