[gmx-users] gmx5.0.2 with GPU acceleration problem

Szilárd Páll pall.szilard at gmail.com
Fri Mar 27 18:45:31 CET 2015


Hi,

That is not the full output, please share the *full* stderr/stdout
(and don't paste it into the mail body, please). I find it hard to
believe that the log is completely empty, if the application does
actually start up, at least the version header should be present in
the log file.

That your admin suggested using "OMP_NUM_THREADS=8 aprun -N 1" is
somewhat strange. That setting is rarely optimal - perhaps if you
don't use PME and have a very large input.

My submission scripts are rather specialized to executing multiple
runs in a single allocation and processing a large number of command
line options so I double you can make much use of them.

However, here's a brief summary:
* sbatch --nodes=N --time=T
* aprun -cc none -n N -N NPPN -d DEPTH
* OMP_NUM_THREADS=NT mdrun_mpi -npme NPME -pin on -dlb no -nstlist
[25-50] -gpu_id [repeat 0 M times, M=#PP ranks per node]

Typical values:
- NT=1-4
- NPME: half of the total ranks if N>=8
- nstlist 40-50 is typically optimal (mdrun will often switch to close
to optimal setting)
- depending on your setup -dlb auto (default) could help, but more
often than not it doesn't
- gpu_id: map the PP ranks to GPUs see the GROMACS wiki for more details

Additionally, there are further tweaks that relate specifically to the
Cray XC machine and its architecture, but I'd rather get back to those
later when your basic issues are sorted out.

Cheers,
--
Szilárd

On Fri, Mar 27, 2015 at 4:50 PM, Wayne Liang <chungwen.liang at gmail.com> wrote:
> Hello Szilárd,
>
> Thanks for your response. The error msg I got:
>
> from the slurm report:
>
> Rank 31 [Tue Mar 10 15:50:53 2015] [c6-2c0s5n2] application called
> MPI_Abort(MPI_COMM_WORLD, -1) - process 31
>
> Rank 27 [Tue Mar 10 15:50:53 2015] [c6-2c0s4n2] application called
> MPI_Abort(MPI_COMM_WORLD, -1) - process 27
>
> Rank 23 [Tue Mar 10 15:50:53 2015] [c5-2c1s14n2] application called
> MPI_Abort(MPI_COMM_WORLD, -1) - process 23
>
> Rank 19 [Tue Mar 10 15:50:53 2015] [c9-1c1s14n2] application called
> MPI_Abort(MPI_COMM_WORLD, -1) - process 19
>
> Rank 15 [Tue Mar 10 15:50:53 2015] [c9-1c1s13n2] application called
> MPI_Abort(MPI_COMM_WORLD, -1) - process 15
>
> Rank 7 [Tue Mar 10 15:50:53 2015] [c7-1c2s7n3] application called
> MPI_Abort(MPI_COMM_WORLD, -1) - process 7
>
> _pmiu_daemon(SIGCHLD): [NID 04922] [c5-2c1s14n2] [Tue Mar 10 15:50:53 2015]
> PE RANK 23 exit signal Aborted
>
> _pmiu_daemon(SIGCHLD): [NID 03423] [c7-1c2s7n3] [Tue Mar 10 15:50:53 2015]
> PE RANK 7 exit signal Aborted
>
> _pmiu_daemon(SIGCHLD): [NID 01382] [c7-0c0s9n2] [Tue Mar 10 15:50:53 2015]
> PE RANK 3 exit signal Aborted
>
> _pmiu_daemon(SIGCHLD): [NID 03766] [c9-1c1s13n2] [Tue Mar 10 15:50:53 2015]
> PE RANK 15 exit signal Aborted
>
> _pmiu_daemon(SIGCHLD): [NID 05014] [c6-2c0s5n2] [Tue Mar 10 15:50:53 2015]
> PE RANK 31 exit signal Aborted
>
> _pmiu_daemon(SIGCHLD): [NID 05010] [c6-2c0s4n2] [Tue Mar 10 15:50:53 2015]
> PE RANK 27 exit signal Aborted
>
> _pmiu_daemon(SIGCHLD): [NID 03762] [c9-1c1s12n2] [Tue Mar 10 15:50:53 2015]
> PE RANK 11 exit signal Aborted
>
> [NID 04922] 2015-03-10 15:50:53 Apid 3843535: initiated application
> termination
>
> Application 3843535 exit codes: 134
>
> Application 3843535 exit signals: Killed
>
> Application 3843535 resources: utime ~123s, stime ~455s, Rss ~96416,
> inblocks ~35678, outblocks ~60750
>
>
>
> and there is nothing showing in md.log.
>
>
> The option OMP_NUM_THREADS=8 (with -N = 1) was suggested by CSCS admin.
> Please let me know what you think is the most efficient way to run. Thanks
> very much for your suggestions.
>
> Could you please share your submission script to me for the testing?
>
> Best,
>
> Chungwen
>
>
>
> On Fri, Mar 27, 2015 at 3:09 PM, Szilárd Páll <pall.szilard at gmail.com>
> wrote:
>>
>> Hi Chungwen,
>>
>> I run on Piz Daint on a regular basis and never had these issues. The
>> only reason for such a thing to happen is that some strange rank over
>> nodes distribution ends up leaving some nodes without PP rank.
>>
>> Could you please post a log file (thread pastebin or something
>> similar) of the failing run?
>>
>> BTW, you know that you are almost always better off running
>> OMP_NUM_THREADS<8 (and -N > 1)?
>>
>> --
>> Szilárd
>>
>>
>> On Fri, Mar 27, 2015 at 12:09 PM, Wayne Liang <chungwen.liang at gmail.com>
>> wrote:
>> > Dear Users,
>> >
>> > I have encountered a problem that mdrun always crashes with the
>> > following
>> > msg, when I run it on more than 32 nodes:
>> >
>> > Software inconsistency error:
>> >
>> > Limiting the number of GPUs to <1 doesn't make sense (detected 1, 0
>> > requested)!
>> >
>> > For more information and tips for troubleshooting, please check the
>> > GROMACS
>> >
>> > website at http://www.gromacs.org/Documentation/Errors
>> >
>> >
>> > My submission script:
>> >
>> > #!/bin/bash -l
>> >
>> > #SBATCH --job-name test-gpu
>> >
>> > #SBATCH --nodes 32
>> >
>> > #SBATCH --cpus-per-task 8
>> >
>> > #SBATCH --ntasks-per-node 1
>> >
>> > #SBATCH --time 01:00:00
>> >
>> >
>> > GMX=/apps/daint/gromacs/5.0.2/gnu_482/bin
>> >
>> >
>> > cd $SLURM_SUBMIT_DIR/
>> >
>> > export OMP_NUM_THREADS=8
>> >
>> > aprun -n 32 -N 1 -d 8 $GMX/mdrun_mpi
>> >
>> > However, using option -nb cpu (bypass GPU acceleration), there is no any
>> > problem.
>> > I have searched online, including mailing list. There is not much
>> > information about it.
>> >
>> > Thanks very much for any response.
>> >
>> > Best,
>> >
>> > Chungwen
>> > --
>> > Gromacs Users mailing list
>> >
>> > * Please search the archive at
>> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>> >
>> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> >
>> > * For (un)subscribe requests visit
>> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> > send a mail to gmx-users-request at gromacs.org.
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send
>> a mail to gmx-users-request at gromacs.org.
>
>


More information about the gromacs.org_gmx-users mailing list