[gmx-users] Gromacs 4.6 crushes in PBS queue system

Tomek Wlodarski tomek.wlodarski at gmail.com
Tue Feb 19 13:32:32 CET 2013


Hi All,

The problem is that this is only message I got...
I also  get this Warning:
--------------------------------------------------------------------------
WARNING: Open MPI will create a shared memory backing file in a
directory that appears to be mounted on a network filesystem.
Creating the shared memory backup file on a network file system, such
as NFS or Lustre is not recommended -- it may cause excessive network
traffic to your file servers and/or cause shared memory traffic in
Open MPI to be much slower than expected.

You may want to check what the typical temporary directory is on your
node.  Possible sources of the location of this temporary directory
include the $TEMPDIR, $TEMP, and $TMP environment variables.

Note, too, that system administrators can set a list of filesystems
where Open MPI is disallowed from creating temporary files by settings
the MCA parameter "orte_no_session_dir".

  Local host: n344
  Fileame:    /tmp/openmpi-sessions-didymos at n344_0
/19430/1/shared_mem_pool.n344

You can set the MCA paramter shmem_mmap_enable_nfs_warning to 0 to
disable this message.
--------------------------------------------------------------------------

but this I got also with gromacs 4.5.5 which is running ok so I think this
is not a problem in my case.

Like Alexey notice the problem is that my nodes have different
architecture.. but this was not a problem with gromacs 4.5.5

My access node:

processor    : 0
vendor_id    : AuthenticAMD
cpu family    : 21
model        : 1
model name    : AMD Opteron(TM) Processor 6272
stepping    : 2
cpu MHz        : 2400.003
cache size    : 2048 KB
physical id    : 0
siblings    : 16
core id        : 0
cpu cores    : 16
apicid        : 32
initial apicid    : 0
fpu        : yes
fpu_exception    : yes
cpuid level    : 13
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx
 mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc nonstop_tsc extd_apicid
amd_dcm pni pclmulqdq monitor ssse3 cx16 sse4_1 sse4_2 po
pcnt aes xsave avx lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a
misalignsse 3dnowprefetch osvw ibs xop skinit wdt nodeid_m
sr arat
bogomips    : 4199.99
TLB size    : 1536 4K pages
clflush size    : 64
cache_alignment    : 64
address sizes    : 48 bits physical, 48 bits virtual
power management: ts ttp tm 100mhzsteps hwpstate [9]

My computational node:

processor    : 0
vendor_id    : AuthenticAMD
cpu family    : 16
model        : 2
model name    : Quad-Core AMD Opteron(tm) Processor 8354
stepping    : 3
cpu MHz        : 2200.001
cache size    : 512 KB
physical id    : 0
siblings    : 4
core id        : 0
cpu cores    : 4
apicid        : 0
initial apicid    : 0
fpu        : yes
fpu_exception    : yes
cpuid level    : 5
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx
 mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good
nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm c
mp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw
ibs
bogomips    : 4399.99
TLB size    : 1024 4K pages
clflush size    : 64
cache_alignment    : 64
address sizes    : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

Thanks a lot!

Best!

tomek




On Sun, Feb 17, 2013 at 2:37 PM, Alexey Shvetsov <alexxy at omrb.pnpi.spb.ru>wrote:

> Hi!
>
> В письме от 16 февраля 2013 23:27:45 пользователь Tomek Wlodarski написал:
> > Hi!
> >
> > I have problem in running gromacs 4.6 in PBS queue...
> > I end up with error:
> >
> >
> > [n370:03036] [[19430,0],0]-[[19430,1],8] mca_oob_tcp_msg_recv: readv
> > failed: Connection reset by peer (104)
> >
> --------------------------------------------------------------------------
> > mpirun noticed that process rank 18 with PID 616 on node n344 exited on
> > signal 4 (Illegal instruction).
>
> Aha. Your mdrun process got SIGILL. This means that your nodes have
> different
> instruction set then head node. So try to use different acceleration level.
> Can you share details about your hw?
>
> >
> --------------------------------------------------------------------------
> > [n370:03036] 3 more processes have sent help message
> > help-opal-shmem-mmap.txt / mmap on nfs
> > [n370:03036] Set MCA parameter "orte_base_help_aggregate" to 0 to see all
> > help / error messages
> > 3 total processes killed (some possibly by mpirun during cleanup)
> >
> > I run the same pbs files with older gromacs 4.5.5 (installed with the
> same
> > openmpi, gcc and fftw) and everything is working..
> >
> > also when I am running gromacs directly on the access node:
> >
> > mpirun -np 32 /home/users/didymos/gromacs/bin/mdrun_mpi -v -deffnm
> > protein-EM-solvated -c protein-EM-solvated.gro
> >
> > it is running OK.
> > Any ideas?
> > Thank you!
> > Best!
> >
> > tomek
> --
> Best Regards,
> Alexey 'Alexxy' Shvetsov
> Petersburg Nuclear Physics Institute, NRC Kurchatov Institute,
> Gatchina, Russia
> Department of Molecular and Radiation Biophysics
> Gentoo Team Ru
> Gentoo Linux Dev
> mailto:alexxyum at gmail.com
> mailto:alexxy at gentoo.org
> mailto:alexxy at omrb.pnpi.spb.ru
> --
> gmx-users mailing list    gmx-users at gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> * Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-request at gromacs.org.
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>



More information about the gromacs.org_gmx-users mailing list