[gmx-developers] some notes on compiling the current gmx cvs sources on cray xt3

Wed May 17 13:16:33 CEST 2006

On Wed, 17 May 2006, David van der Spoel wrote:

DS> Axel Kohlmeyer wrote:

DS> I have fixed the // problems. I don't understand the remark about 
DS> compiling serially.

thanks. to compile serially i need to add the following change
(extracted from the patch i sent):

Index: src/mdlib/pme.c
===================================================================
RCS file: /home/gmx/cvs/gmx/src/mdlib/pme.c,v
retrieving revision 1.75
diff -u -r1.75 pme.c

--- src/mdlib/pme.c     16 May 2006 15:18:08 -0000      1.75
+++ src/mdlib/pme.c     16 May 2006 23:50:00 -0000
@@ -1294,7 +1294,9 @@
     pme->nodeid = cr->nodeid;
     pme->nnodes = cr->nnodes;
   }
+#ifdef GMX_MPI
   pme->mpi_comm = cr->mpi_comm_mygroup;
+#endif
  
   fprintf(log,"Will do PME sum in reciprocal space.\n");
   please_cite(log,"Essman95a");



mpi_comm_mygroup is only defined with --enable-mpi,
see include/types/commrec.h

note, that the same file now (revision 1.76) also 
contains an unresolved cvs conflict.
 

DS> > 
DS> > as already reported by shawn brown elsewhere, the gcc compiled code
DS> > tends to crash
DS> > in different places (e.g. segfault in add_gbond(), mpi error in 
DS> > splitter.c).
DS> 
DS> This is weird, the add_gbond error might depend on the system studied 
DS> however.

this is the DPPC benchmark. it seems to be quite random. i'll try 
compiling with lower optimization. the gcc compiler is a gcc 3.3.1.

[...]

DS> > 
DS> Your numbers look good for the ring parallellization scheme that we have 
DS> used until now. It will quite soon be possible to obtain even better 

well, the xt3 has a 3d-torus network and that should have an advantage
with ring schemes (same as SCI dolphin / scali).

DS> scaling using domain decomposition, Berk has been working very hard to 
DS> implement it. The DPPC benchmark scales to 32 Opteron cores on my Gbit 
DS> network already, so it will be interesting to see whether it will be 
DS> even better on the Cray.

yes indeed. using -dd 6 6 2 and 72 nodes with the same input gives:
               NODE (s)   Real (s)      (%)
       Time:    102.000    102.000    100.0   1:42
               (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
Performance:    481.538     40.828      8.471      2.833

this is now at a point where there is so frequent output, that 
increasing the i/o buffers will be needed to reduce the latencies
from the portals (the nodes have no local disk, only access to a 
parallel lustre filesystem via an RPC-like scheme which forwards
all i/o to a comparatively small number of i/o nodes.).

the troughput is quite impressive.

regards,
   axel.

-- 
=======================================================================
Axel Kohlmeyer   akohlmey at cmm.chem.upenn.edu   http://www.cmm.upenn.edu
   Center for Molecular Modeling   --   University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.