[gmx-users] Problems with simulation on multi-nodes cluster

Tue Mar 20 12:35:14 CET 2012

Could someone tell me what tell the below error

Getting Loaded...
Reading file MD_100.tpr, VERSION 4.5.4 (single precision)
Loaded with Money

Will use 30 particle-particle and 18 PME only nodes
This is a guess, check the performance at the end of the log file
[ib02:22825] *** Process received signal ***
[ib02:22825] Signal: Segmentation fault (11)
[ib02:22825] Signal code: Address not mapped (1)
[ib02:22825] Failing at address: 0x10
[ib02:22825] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf030)
[0x7f535903e03$
[ib02:22825] [ 1] /usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0x7e23)
[0x7f535$
[ib02:22825] [ 2] /usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0x8601)
[0x7f535$
[ib02:22825] [ 3] /usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0x8bab)
[0x7f535$
[ib02:22825] [ 4] /usr/lib/openmpi/lib/openmpi/mca_btl_sm.so(+0x42af)
[0x7f5353$
[ib02:22825] [ 5] /usr/lib/libopen-pal.so.0(opal_progress+0x5b)
[0x7f535790506b]
[ib02:22825] [ 6] /usr/lib/libmpi.so.0(+0x37755) [0x7f5359282755]
[ib02:22825] [ 7] /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so(+0x1c3a)
[0x7f$
[ib02:22825] [ 8] /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so(+0x7fae)
[0x7f$
[ib02:22825] [ 9] /usr/lib/libmpi.so.0(ompi_comm_split+0xbf)
[0x7f535926de8f]
[ib02:22825] [10] /usr/lib/libmpi.so.0(MPI_Comm_split+0xdb) [0x7f535929dc2b]
[ib02:22825] [11]
/usr/lib/libgmx_mpi_d.openmpi.so.6(gmx_setup_nodecomm+0x19b) $
[ib02:22825] [12] mdrun_mpi_d.openmpi(mdrunner+0x46a) [0x40be7a]
[ib02:22825] [13] mdrun_mpi_d.openmpi(main+0x1256) [0x407206]
[ib02:22825] [14] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfd)
[0x7f$
[ib02:22825] [15] mdrun_mpi_d.openmpi() [0x407479]
[ib02:22825] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 36 with PID 22825 on node ib02 exited on
sign$
--------------------------------------------------------------------------

I've obtained it when I've tried to use my system on multi-node station (
there is no problem on single node). Does this problem with the cluster
system or something wrong with parameters of my simulation?

JAmes

15 марта 2012 г. 15:25 пользователь James Starlight
<jmsstarlight at gmail.com>написал:

> Mark, Peter,
>
>
> I've tried to do .tpr file on my local CPU and launch only
>
> mpiexec -np 24 mdrun_mpi_d.openmpi -v -deffnm MD_100
>
> on the cluster with 2 nodes.
>
> I see my job as working but when I've checking the MD_100.log (attached)
> file there are no any information about simulation steps in that file (
> when I use just one node I see in that file step-by-step progression of my
> simulation like below wich was find in the same log file for ONE NODE
> simulation ):
>
> Started mdrun on node 0 Thu Mar 15 11:22:35 2012
>
>            Step           Time         Lambda
>               0        0.00000        0.00000
>
> Grid: 12 x 9 x 12 cells
>    Energies (kJ/mol)
>        G96Angle    Proper Dih.  Improper Dih.          LJ-14     Coulomb-14
>     1.32179e+04    3.27485e+03    2.53267e+03    4.06443e+02    6.15315e+04
>         LJ (SR)        LJ (LR)  Disper. corr.   Coulomb (SR)   Coul. recip.
>     4.12152e+04   -5.51788e+03   -1.70930e+03   -4.54886e+05   -1.46292e+05
>      Dis. Rest. D.R.Viol. (nm)     Dih. Rest.      Potential    Kinetic En.
>     2.14240e-02    3.46794e+00    1.33793e+03   -4.84889e+05    9.88771e+04
>    Total Energy  Conserved En.    Temperature Pres. DC (bar) Pressure (bar)
>    -3.86012e+05   -3.86012e+05    3.11520e+02   -1.14114e+02    3.67861e+02
>    Constr. rmsd
>     3.75854e-05
>
>            Step           Time         Lambda
>            2000        4.00000        0.00000
>
>    Energies (kJ/mol)
>        G96Angle    Proper Dih.  Improper Dih.          LJ-14     Coulomb-14
>     1.31741e+04    3.25280e+03    2.58442e+03    3.51371e+02    6.15913e+04
>         LJ (SR)        LJ (LR)  Disper. corr.   Coulomb (SR)   Coul. recip.
>     4.16349e+04   -5.53474e+03   -1.70930e+03   -4.56561e+05   -1.46485e+05
>      Dis. Rest. D.R.Viol. (nm)     Dih. Rest.      Potential    Kinetic En.
>     4.78276e+01    3.38844e+00    9.82735e+00   -4.87644e+05    9.83280e+04
>    Total Energy  Conserved En.    Temperature Pres. DC (bar) Pressure (bar)
>    -3.89316e+05   -3.87063e+05    3.09790e+02   -1.14114e+02    7.25905e+02
>    Constr. rmsd
>     1.88008e-05
>
> end etc...
>
>
>
> What's wrong can be with multi-node computations?
>
>
> James
>
>
> 15 марта 2012 г. 11:25 пользователь Mark Abraham <Mark.Abraham at anu.edu.au>написал:
>
> On 15/03/2012 6:13 PM, Peter C. Lai wrote:
>>
>>> Try separating your grompp run from your mpirun:
>>> You should not really be having the scheduler execute the grompp. Run
>>> your grompp step to generate a .tpr either on the head node or on your
>>> local
>>> machine (then copy it over to the cluster).
>>>
>>
>> Good advice.
>>
>>
>>> (The -p that the scheduler is complaining about only appears in the
>>> grompp
>>> step, so don't have the scheduler run it).
>>>
>>
>> grompp is running successfully, as you can see from the output
>>
>> I think "mpiexec -np 12" is being interpreted as "mpiexec -n 12 -p", and
>> the process of separating the grompp stage from the mdrun stage would help
>> make that clear - read documentation first, however.
>>
>> Mark
>>
>>
>>
>>>
>>> On 2012-03-15 10:04:49AM +0300, James Starlight wrote:
>>>
>>>> Dear Gromacs Users!
>>>>
>>>>
>>>> I have some problems with running my simulation on multi-modes station
>>>> wich
>>>> use open_MPI
>>>>
>>>> I've launch my jobs by means of that script. The below example of
>>>> running
>>>> work on 1 node ( 12 cpu).
>>>>
>>>> #!/bin/sh
>>>> #PBS -N gromacs
>>>> #PBS -l nodes=1:red:ppn=12
>>>> #PBS -V
>>>> #PBS -o gromacs.out
>>>> #PBS -e gromacs.err
>>>>
>>>> cd /globaltmp/xz/job_name
>>>> grompp -f md.mdp -c nvtWprotonated.gro -p topol.top -n index.ndx -o
>>>> job.tpr
>>>> mpiexec -np 12 mdrun_mpi_d.openmpi -v -deffnm job
>>>>
>>>> All nodes of my cluster consist of 12 CPU. When I'm using just 1 node on
>>>> that cluster I have no problems with running of my jobs but when I try
>>>> to
>>>> use more than one nodes I've obtain error ( the example is attached in
>>>> the
>>>> gromacs.err file as well as mmd.mdp of that system). Another outcome of
>>>> such multi-node simulation is that my job has been started but no
>>>> calculation were done ( the name_of_my_job.log file was empty and no
>>>> update
>>>> of .trr file was seen ). Commonly this error occurs when I uses many
>>>> nodes
>>>> (8-10) Finally sometimes I've obtain some errors with the PME order (
>>>> this
>>>> time I've used 3 nodes). The exactly error differs when I varry the
>>>> number
>>>> of nodes.
>>>>
>>>>
>>>> Could you tell me whats wrong could be with my cluster?
>>>>
>>>> Thanks for help
>>>>
>>>> James
>>>>
>>>
>>>
>>>  --
>>>> gmx-users mailing list    gmx-users at gromacs.org
>>>> http://lists.gromacs.org/**mailman/listinfo/gmx-users<http://lists.gromacs.org/mailman/listinfo/gmx-users>
>>>> Please search the archive at http://www.gromacs.org/**
>>>> Support/Mailing_Lists/Search<http://www.gromacs.org/Support/Mailing_Lists/Search>before posting!
>>>> Please don't post (un)subscribe requests to the list. Use the
>>>> www interface or send it to gmx-users-request at gromacs.org.
>>>> Can't post? Read http://www.gromacs.org/**Support/Mailing_Lists<http://www.gromacs.org/Support/Mailing_Lists>
>>>>
>>>
>>>
>> --
>> gmx-users mailing list    gmx-users at gromacs.org
>> http://lists.gromacs.org/**mailman/listinfo/gmx-users<http://lists.gromacs.org/mailman/listinfo/gmx-users>
>> Please search the archive at http://www.gromacs.org/**
>> Support/Mailing_Lists/Search<http://www.gromacs.org/Support/Mailing_Lists/Search>before posting!
>> Please don't post (un)subscribe requests to the list. Use the www
>> interface or send it to gmx-users-request at gromacs.org.
>> Can't post? Read http://www.gromacs.org/**Support/Mailing_Lists<http://www.gromacs.org/Support/Mailing_Lists>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20120320/b7fb487b/attachment.html>