[gmx-users] desort

Wed Jan 24 23:49:05 CET 2007

I have developed a methodology to desort based on the script that I 
previously mentioned. The speed of desorting has been improved. For a 
system of 100K atoms the setup and processing for mdrun_mpi (grompp's 
trjconv's g_desort's etc.) took me 30sec. For a system of 750K atoms the 
time was 90sec.

Benifits:
When running in 200ps segments on NP=4 in single precision I get a 
speedup of about 1.4 times vs without -shuffle and speedup of 1.15 times 
vs. with shuffle but without sort. It is only with sort that my scaling 
to NP=4 or NP=16 matches the gromacs benchmarks.

Testing:
1. I have run 20 x 20ps mdrun_mpi iteratively using -shuffle -sort and 
this g_desort program and the system "looks normal".. no quantitative 
test though.
2. I have compared the -deshuf.ndx file produced by grompp to a g_desort 
output .ndx file (both produced using -shuffle but not -sort) and they 
match exactly.

I have not posted this tool to the uploads section since I don't want 
anybody to run into unexpected problems in case it is incorrect since it 
affects the mdrun directly.

I welcome any suggestions for ways to test if this is correct. Also, if 
anybody has some long-time sorted trajectories out there then they could 
use this to recover individual positions and "follow their favourite 
water molecule". If anybody does do this, please post something about 
your success.

This was written from template.c and compilation of this program is done 
similarly.

Before using this, please read the "static char *desc[] = {" completely. 
Especially for the warnings that discuss what number your index files 
begin with and the exact implementation that is a bit of a hassle based 
on the order that grompp does things but it can all be scripted.

Chris Neale.

/*
 * $Id: template.c,v 1.4 2001/07/23 15:28:29 lindahl Exp $
 *
 *                This source code is part of
 *
 *                 G   R   O   M   A   C   S
 *
 *          GROningen MAchine for Chemical Simulations
 *
 *                        VERSION 3.0
 *
 * Copyright (c) 1991-2001
 * BIOSON Research Institute, Dept. of Biophysical Chemistry
 * University of Groningen, The Netherlands
 *
 * This program is free software; you can redistribute it and/or
 * modify it under the terms of the GNU General Public License
 * as published by the Free Software Foundation; either version 2
 * of the License, or (at your option) any later version.
 *
 * If you want to redistribute modifications, please consider that
 * scientific software is very special. Version control is crucial -
 * bugs must be traceable. We will be happy to consider code for
 * inclusion in the official distribution, but derived work must not
 * be called official GROMACS. Details are found in the README & COPYING
 * files - if they are missing, get the official version at www.gromacs.org.
 *
 * To help us fund GROMACS development, we humbly ask that you cite
 * the papers on the package - you can find them in the top README file.
 *
 * Do check out http://www.gromacs.org , or mail us at gromacs at gromacs.org .
 *
 * And Hey:
 * Gyas ROwers Mature At Cryogenic Speed
 */

/* This line is only for CVS version info */
static char *SRCID_template_c = "$Id: template.c,v 1.4 2001/07/23 
15:28:29 lindahl Exp $";

#include "statutil.h"
#include "typedefs.h"
#include "smalloc.h"
#include "vec.h"
#include "copyrite.h"
#include "statutil.h"
#include "tpxio.h"
#include "math.h"

int main(int argc,char *argv[])
{
  static char *desc[] = {
    "g_desort takes two coordinate files as input and outputs a .ndx 
file that ",
    "can be used to properly sort/shuffle or desort/deshuffle. ",
    "The -f option takes two coordinate files as input and the order is 
important: ",
    "the first file is the desired order and the second file is the 
current order. ",
    "To desort coordinates g_desort -f unsorted.gro sorted.gro and to 
resort ",
    "coordinates g_desort -f sorted.gro unsorted.gro. In order to use 
the -sort option ",
    "of grompp you will only need to desort your trajectory and g_desort 
makes this trivial. ",
    "However, the benifit of sorting ",
    "decreases over time as the particles move away from their sorted 
positions. Therefore ",
    "regular re-sorting is desirable. The usage section below describes 
how to do this. ",
    "The basic idea is to pre-sort the .trr file for loading into grompp 
based on a resort ",
    ".ndx file generated by g_desort after a preliminary grompp without 
the .trr. An additional ",
    "sorting step is added to ensure that the grompp sorting did not 
change based on the loaded i",
    ".trr file. The mdrun in then completed and desorted prior to 
starting again. \n",
    "USAGE: \n",
    "1.   grompp -shuffle -sort -c original.gro -o sorted_temp.tpr \n",
    "2.   editconf -f sorted_temp.tpr -o sorted_temp.gro \n",
    "3.   g_desort -f sorted_temp.gro original.gro -o resort_temp.ndx \n",
    "4.   trjconv -f original.trr -n resort_temp.ndx -o sorted.trr \n",
    "5.   grompp -shuffle -sort -c original.gro -t sorted.trr -o 
sorted.tpr \n",
    "6.   editconf -f sorted.tpr -o sorted.gro \n",
    "7.   g_desort -f sorted.gro original.gro -o resort.ndx \n",
    "8.   look=`diff -q resort_temp.ndx resort.ndx`; if [ -n ${look} ]; 
then exit; fi \n",
    "9.   g_desort -f original.gro sorted.gro -o desort.ndx \n",
    "10.  mdrun_mpi -s sorted.tpr \n",
    "11a. trjconv -f output.xtc -n desort.ndx -o output_desorted.xtc \n",
    "11b. trjconv -f output.trr -n desort.ndx -o output_desorted.trr \n",
    "11c. trjconv -f output.gro -n desort.ndx -o output_desorted.gro \n",
    "12.  Use these output_desorted files as input to step 1 as the 
original.xtc/trr/gro files \n",
    "       - the .edr file does not need to be desorted \n",
    "       - g_desort takes care of sorting and shuffling at the same 
time \n",
    "           - Do not use the -deshuf deshuf.ndx file from grompp \n",
    "WARNINGS:\n",
    " - This is a beta version. It might not work properly. Don't trust 
it \n",
    " - Expecting your index files to start from the number 1. If they 
start from the number 0 then ",
    "you must modify the fprintf output lines to remove the +1 from the 
integer that is output \n",
    "BUGS: \n",
    "The determination of the expected number of particles per grid is 
sub-optimal ",
    "and if the ratio of system volume over grid points is too small 
then the program will not allocate ",
    "enough memory. This is detected, but it should be fixed (perhaps by 
actually counting before allocating\n",
  };

  static char *outputfnam;
  static real rDISTCUT=0.005;
  static int iNUMBINS=10;

  /* Extra arguments - but note how you always get the begin/end
   * options when running the program, without mentioning them here!
   */

  //
  t_pargs pa[] = {
         { "-c", FALSE, etREAL, {&rDISTCUT},
                "Cut-off distance for atom identification."
         },
         { "-n", FALSE, etINT, {&iNUMBINS},
                "Number of grids in each dim (>=1)."
         }

  };

  t_topology top;
  char       title[STRLEN];
  t_trxframe fra;
  t_trxframe frb;
  rvec       *xtop;
  matrix     box;
  int        status;
  int        flags = TRX_READ_X;
  char        **fnms;
  int        nfile_in,ata,atb,found_flag;
  real    dab;
  real    
max[3]={0.0,0.0,0.0},min[3]={0.0,0.0,0.0},maxminusmin[3]={0.0,0.0,0.0},minovermaxminusmin[3]={0.0,0.0,0.0};
  int    **bxyz; //[natoms][3]
  int    twobxyz[3]={0.0,0.0,0.0};
  int    ****twobinlist; //[xdim][ydim][zdim][list]  *dim is the cells 
[list=0] is the number in that list
  int maxcellnum;
  int i,j,k;
  FILE *fout;

  t_filenm fnm[] = {
    { efTRX, "-f", NULL, ffRDMULT },      /* and this for the trajectory */
    { efNDX, "-o", "deshuffledesort", ffWRITE }    //output
  };

#define NFILE asize(fnm)

  CopyRight(stderr,argv[0]);

  /* This is the routine responsible for adding default options,
   * calling the X/motif interface, etc. */
  parse_common_args(&argc,argv,PCA_CAN_TIME | PCA_CAN_VIEW,
            NFILE,fnm,asize(pa),pa,asize(desc),desc,0,NULL);

  nfile_in = opt2fns(&fnms,"-f",NFILE,fnm);
  if (nfile_in==0)
    gmx_fatal(FARGS,"");
  if (nfile_in!=2)
    gmx_fatal(FARGS,"There must be exactly 2 input files!");
  outputfnam=opt2fn("-o",NFILE,fnm);
  if(iNUMBINS<=0)
    gmx_fatal(FARGS,"The number of bins must be greater than or equal to 
1\n");
  /* The first time we read data is a little special */
  read_first_frame(&status,fnms[0],&fra,flags);
  read_first_frame(&status,fnms[1],&frb,flags);

  if(fra.natoms!=frb.natoms){
    printf("Error: unequal number of atoms (%d and 
%d)\n",fra.natoms,frb.natoms);
    exit(1);
  }

  printf("WARNING: This is a beta version. It might not work properly. 
Don't trust it\n");
  fout=fopen(outputfnam,"w");
  fprintf(fout,"[ DeShuffle ]");

  //Could use a smarter determination of maxcellnum based on loaded 
dimensions and on knowing how many grids unfilled
  if(iNUMBINS<=3)
    maxcellnum=fra.natoms+1;
  else
    maxcellnum=2*fra.natoms/((iNUMBINS-2)*(iNUMBINS-2)*(iNUMBINS-2)) + 1;
  bxyz=(int **)malloc((fra.natoms+1)*sizeof(int *));
  if(bxyz==NULL)exit(1);
  for(i=0; i<=fra.natoms; ++i){
    bxyz[i]=(int *)malloc(sizeof(int[3]));
    if(bxyz[i]==NULL)exit(1);
  }
  twobinlist=(int ****)malloc(iNUMBINS*sizeof(int ***));
  if(twobinlist==NULL)exit(1);
  for(i=0; i<iNUMBINS; ++i){
    twobinlist[i]=(int ***)malloc(iNUMBINS*sizeof(int **));
    if(twobinlist[i]==NULL)exit(1);
    for(j=0; j<iNUMBINS; ++j){
      twobinlist[i][j]=(int **)malloc(iNUMBINS*sizeof(int *));
      if(twobinlist[i][j]==NULL)exit(1);
      for(k=0; k<iNUMBINS; ++k){
        twobinlist[i][j][k]=(int *)malloc(maxcellnum*sizeof(int));
        if(twobinlist[i][j][k]==NULL)exit(1);
        twobinlist[i][j][k][0]=0;
      }
    }
  }

//printf("About to divide data\n");fflush(stdout);

  //Divide each dimension into the boxes
  for(ata=0; ata<fra.natoms; ++ata){
    for(i=0; i<=2; ++i){
      if(fra.x[ata][i]>max[i])max[i]=fra.x[ata][i];
      if(fra.x[ata][i]<min[i])min[i]=fra.x[ata][i];
    }
  }
  for(i=0; i<=2; ++i){
    maxminusmin[i]=(max[i]-min[i])/(real)iNUMBINS;          
    minovermaxminusmin[i]=min[i]/maxminusmin[i];
  }

  //We only need to know which bin the first one is in
  for(ata=0; ata<fra.natoms; ++ata){
    for(i=0; i<=2; ++i){
      bxyz[ata][i]=(int)floor(fra.x[ata][i]/maxminusmin[i] - 
minovermaxminusmin[i]);

bxyz[ata][i]=(bxyz[ata][i]<0)?0:(bxyz[ata][i]>(iNUMBINS-1))?(iNUMBINS-1):bxyz[ata][i];
    }
  }

  //For the second list, we need to group them by which cell
  for(atb=0; atb<frb.natoms; ++atb){
    for(i=0; i<=2; ++i){
      twobxyz[i]=(int)floor(frb.x[atb][i]/maxminusmin[i] - 
minovermaxminusmin[i]);

twobxyz[i]=(twobxyz[i]<0)?0:(twobxyz[i]>(iNUMBINS-1))?(iNUMBINS-1):twobxyz[i];
    }
    //printf("%d in %d,%d,%d as %d 
entry\n",atb,twobxyz[XX],twobxyz[YY],twobxyz[ZZ],twobinlist[twobxyz[XX]][twobxyz[YY]][twobxyz[ZZ]][0]);fflush(stdout);

    if(++twobinlist[twobxyz[XX]][twobxyz[YY]][twobxyz[ZZ]][0]>maxcellnum){
      printf("Error: too many atoms in one of the cells (exceeded max of 
%d). Try reducing the number of grids that you are using\n",maxcellnum);
      exit(1);
    }

twobinlist[twobxyz[XX]][twobxyz[YY]][twobxyz[ZZ]][twobinlist[twobxyz[XX]][twobxyz[YY]][twobxyz[ZZ]][0]]=atb;
  }

  //printf("About to find matches\n");fflush(stdout);
  /* IMPORTANT NOTE:
   * Printed values must be +1 to value
   * Also, rDISTCUT is squared here to avoid sqrt of dist
   */
  rDISTCUT*=rDISTCUT;
  for(ata=0; ata<fra.natoms; ++ata){
    if(div(ata,10).rem==0)fprintf(fout,"\n");
    found_flag=0;
    //try first to see if it is in the same box
    for(atb=1; 
atb<=twobinlist[bxyz[ata][XX]][bxyz[ata][YY]][bxyz[ata][ZZ]][0]; ++atb){

dab=distance2(fra.x[ata],frb.x[twobinlist[bxyz[ata][XX]][bxyz[ata][YY]][bxyz[ata][ZZ]][atb]]);
      if(dab<rDISTCUT){
        //printf("%d\t-ON GRID-\t%d %d 
%d\n",twobinlist[bxyz[ata][XX]][bxyz[ata][YY]][bxyz[ata][ZZ]][atb]+1,bxyz[ata][XX],bxyz[ata][YY],bxyz[ata][ZZ]);
        fprintf(fout,"  
%d",twobinlist[bxyz[ata][XX]][bxyz[ata][YY]][bxyz[ata][ZZ]][atb]+1);
        found_flag=1;
        break;
      }
    }
    if(found_flag)continue;
    //Look through the entire list
    for(atb=0; atb<frb.natoms; ++atb){
      dab=distance2(fra.x[ata],frb.x[atb]);
      if(dab<rDISTCUT){
        fprintf(fout,"  %d",atb+1);
        found_flag=1;
        break;     
      }
    }
    if(!found_flag){
      printf("Error: could not find pair for %d\n",ata);
      exit(1);
    }
  }
  fprintf(fout,"\n");
  fclose(fout);
  thanx(stderr);

  return 0;
}