[gmx-users] Looking for integration tips: multiple-GPU devices & platform LSF (plus checkpointing)

Chris Dagdigian dag at bioteam.net
Tue Aug 21 16:13:25 CEST 2012


Hello,

I've done some googling, searched the website and mailing lists so 
apologies in advance if this is a bothersome set of questions...

Long story short I'm trying to help make a gromacs user more efficient 
by developing a better set of tools for interacting with both Platform 
LSF and compute nodes that have multiple GPU cards installed. My main 
handicap is that I'm mostly a Grid Engine admin and it's been years 
since I was seriously hands-on with LSF.

I'd love to hear about any tips, tricks, links, FAQs or best-practices 
for the following "gromacs, gromacs-gpu & LSF integration" topics:


1. Gromacs, LSF and compute nodes w/ multiple Nvidia GPU cards.

The primary issue here is that unless you tell gromacs-gpu otherwise, it 
will always default to using GPU deviceID=0. This is problematic on 
nodes with 3x GPU cards installed where there is a risk that we'll slam 
GPU device 0 while leaving devices 1 and 2 untouched and unused. My 
preference is to not have the user have to worry about device 
selection.  It's easy of course to pass in environment variables or 
other logic but I'm wondering if there is a "best practice" with LSF for 
doing this. Has anyone used any of the LSF features like elim-scripts 
(perhaps an elim script monitors the load on the GPUs and does some sort 
of load balancing by updating an ENV variable or file flag...) or more 
likely any sort of Application specific esub wrapper that automates  the 
process of substituting in the proper value for deviceID passed to the 
gromacs binary as a command line argument?


2. Gromacs Checkpointing vs LSF checkpoint/restart features

Any real world recommendations on how to best handle checkpointing on 
Platform LSF managed clusters? Is it best to stick only with the 
built-in gromacs checkpoint features and just submit the jobs to LSF so 
that they are restartable/rerunabble? Are any of the LSF checkpoint / 
restart / rerun features useful or add value to gromacs users? Any queue 
or LSF settings?

Honestly I'd love to see what real world LSF users are using with 
respect to Gromacs and Gromacs-GPU integration. If there are any wrapper 
scripts or even queue configurations  / LSF configurations / esub / elim 
scripts that I can look at would be appreciated.


Regards,
Chris





More information about the gromacs.org_gmx-users mailing list