[gmx-users] Looking for integration tips: multiple-GPU devices & platform LSF (plus checkpointing)
Chris Dagdigian
dag at bioteam.net
Tue Aug 21 16:13:25 CEST 2012
Hello,
I've done some googling, searched the website and mailing lists so
apologies in advance if this is a bothersome set of questions...
Long story short I'm trying to help make a gromacs user more efficient
by developing a better set of tools for interacting with both Platform
LSF and compute nodes that have multiple GPU cards installed. My main
handicap is that I'm mostly a Grid Engine admin and it's been years
since I was seriously hands-on with LSF.
I'd love to hear about any tips, tricks, links, FAQs or best-practices
for the following "gromacs, gromacs-gpu & LSF integration" topics:
1. Gromacs, LSF and compute nodes w/ multiple Nvidia GPU cards.
The primary issue here is that unless you tell gromacs-gpu otherwise, it
will always default to using GPU deviceID=0. This is problematic on
nodes with 3x GPU cards installed where there is a risk that we'll slam
GPU device 0 while leaving devices 1 and 2 untouched and unused. My
preference is to not have the user have to worry about device
selection. It's easy of course to pass in environment variables or
other logic but I'm wondering if there is a "best practice" with LSF for
doing this. Has anyone used any of the LSF features like elim-scripts
(perhaps an elim script monitors the load on the GPUs and does some sort
of load balancing by updating an ENV variable or file flag...) or more
likely any sort of Application specific esub wrapper that automates the
process of substituting in the proper value for deviceID passed to the
gromacs binary as a command line argument?
2. Gromacs Checkpointing vs LSF checkpoint/restart features
Any real world recommendations on how to best handle checkpointing on
Platform LSF managed clusters? Is it best to stick only with the
built-in gromacs checkpoint features and just submit the jobs to LSF so
that they are restartable/rerunabble? Are any of the LSF checkpoint /
restart / rerun features useful or add value to gromacs users? Any queue
or LSF settings?
Honestly I'd love to see what real world LSF users are using with
respect to Gromacs and Gromacs-GPU integration. If there are any wrapper
scripts or even queue configurations / LSF configurations / esub / elim
scripts that I can look at would be appreciated.
Regards,
Chris
More information about the gromacs.org_gmx-users
mailing list