[gmx-users] TestBed in MPI not working

Justin A. Lemkul jalemkul at vt.edu
Tue May 12 04:10:53 CEST 2009



Jones de Andrade wrote:
> Ok, summary of errors begin here.
> 
> First, errors with MPI in double precision:
> 
> 1 Simple Test:
> bham: ns type Simple is not supported with domain decomposition, use 
> particle decomposition: mdrun -pd
> 
> 7 Complex Tests:
> acetonitrilRF: ns type Simple is not supported with domain 
> decomposition, use particle decomposition: mdrun -pd
> aminoacids: ns type Simple is not supported with domain decomposition, 
> use particle decomposition: mdrun -pd
> argon: ns type Simple is not supported with domain decomposition, use 
> particle decomposition: mdrun -pd
> sw: ns type Simple is not supported with domain decomposition, use 
> particle decomposition: mdrun -pd
> tip4p: ns type Simple is not supported with domain decomposition, use 
> particle decomposition: mdrun -pd
> urea: ns type Simple is not supported with domain decomposition, use 
> particle decomposition: mdrun -pd
> water: ns type Simple is not supported with domain decomposition, use 
> particle decomposition: mdrun -pd
> 

All of the above can be fixed by changing the appropriate .mdp option.

> 16 Kernel Tests: 0 computation time. Something gone REALLY bad on 
> those...  :(
> 

Known issue:

http://bugzilla.gromacs.org/show_bug.cgi?id=313

> Except for the kernel tests, (seems that) in all I'm getting that same 
> error message (still looking at it). Are those expected to appear? And 
> the kernel ones? Am I wrong, or that means compilation problems 
> (specially because they appear in all tests, single and double 
> precision, with and withou MPI).
> 
> Also getting error in serial in single precision in 4 complex tests. 
> Those seems to have run, but yelded wrong results?
> 

Someone else just experienced this problem as well.  Probably needs to be looked 
into.  Look into thee contents of checkpot.out and checkvir.out to see if the 
results are similar to:

http://www.gromacs.org/pipermail/gmx-users/2009-May/041696.html

The problem there appeared to be a missing energy term (Vir-XX).

-Justin

> Does anybody has any clue, please? Shall I go straight to recompilation, 
> despite there is no reason for failure here?
> 
> Thanks a lot!
> 
> Jones
> 
> On Mon, May 11, 2009 at 10:42 PM, Jones de Andrade <johannesrs at gmail.com 
> <mailto:johannesrs at gmail.com>> wrote:
> 
>     Hi Justin.
> 
>     Well, bothering again. Good and bad news.
> 
>     The good news: I found a strange "work-around" for my problems here.
>     For some reason, the perl script updates the path, environments and
>     everything else when runs. So, the variables I placed on the script
>     I was using where simply lost. Workaround here was, then, to just
>     include those in the .tcshrc file and log again.
> 
>     The problem is that it's not pratical. I'm trying a lot of different
>     MPIs and libraries compilations, and having to edit that file, and
>     or logou/login or source it, is not pratical at all. Is there any
>     other way, so that the perl script will be happy with the variables
>     it has when its called, instead of initializing all them again?
> 
>     Second, here comes the real bad news: Lots of erros.
> 
>     Without MPI, in single precision, 4 complex and 16 kernel tests fail.
> 
>     Without MPI, but in double precision, "just" the 16 kernel tests fail.
> 
>     With MPI, in single precision, it fails on 1 simple, 9 complex and
>     16 kernel tests!
> 
>     And with MPI and double precision, 1 simple, 7 complex and 16 kernel
>     tests fails. :P
> 
>     Edit: Just received your message. Well, it seems that I've done a
>     mistake on my script, but since at least part of the tests worked,
>     it means that it's not the MPI that is, at least, missconfigured.
> 
>     I will look deeper into the erros above, and tell you later.
> 
>     Thanks a lot,
> 
>     Jones
> 
> 
>     On Mon, May 11, 2009 at 9:41 PM, Jones de Andrade
>     <johannesrs at gmail.com <mailto:johannesrs at gmail.com>> wrote:
> 
>         Hi Justin.
> 
>         Thanks a lot for that. It helped, but enough yet. :(  Just made
>         4.0.4 tests reach the same "range of errors" that I'm getting
>         with 3.3.3. :P
> 
>         Using openMPI, it just complains that it can't find orted. That
>         would mean that the paths are not in there, BUT they are. :P If
>         I just try to run orted from the command line without any arguments:
> 
>         *****************
>         /gmxtest404 196% orted
>         [palpatine:28366] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found
>         in file runtime/orte_init.c at line 125
>         --------------------------------------------------------------------------
>         It looks like orte_init failed for some reason; your parallel
>         process is
>         likely to abort.  There are many reasons that a parallel process can
>         fail during orte_init; some of which are due to configuration or
>         environment problems.  This failure appears to be an internal
>         failure;
>         here's some additional information (which may only be relevant to an
>         Open MPI developer):
> 
>           orte_ess_base_select failed
>           --> Returned value Not found (-13) instead of ORTE_SUCCESS
>         --------------------------------------------------------------------------
>         [palpatine:28366] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found
>         in file orted/orted_main.c at line 323
>         /*****************
> 
>         So, the shell IS finding the file. But when I do it not from the
>         script anymore (I was already thinking in something on the
>         "it-else-end" stack), all mpi tests fail with the following
>         message on mdrun.out file:
> 
>         **********************
>         /orted: Command not found.
>         --------------------------------------------------------------------------
>         A daemon (pid 27972) died unexpectedly with status 1 while
>         attempting
>         to launch so we are aborting.
> 
>         There may be more information reported by the environment (see
>         above).
> 
>         This may be because the daemon was unable to find all the needed
>         shared
>         libraries on the remote node. You may set your LD_LIBRARY_PATH
>         to have the
>         location of the shared libraries on the remote nodes and this will
>         automatically be forwarded to the remote nodes.
>         --------------------------------------------------------------------------
>         --------------------------------------------------------------------------
>         mpirun noticed that the job aborted, but has no info as to the
>         process
>         that caused that situation.
>         --------------------------------------------------------------------------
>         mpirun: clean termination accomplished
>         /**********************
> 
>         What is going on? Next thing I think about doing is to execute a
>         full command line from one of the tests directly, to see that it
>         works...  :(  :P
> 
>         Now I'm absolutelly lost. Any ideas, please?
> 
>         Thanks a lot,
> 
>         Jones
> 
> 
>         On Mon, May 11, 2009 at 9:07 PM, Justin A. Lemkul
>         <jalemkul at vt.edu <mailto:jalemkul at vt.edu>> wrote:
> 
> 
> 
>             Justin A. Lemkul wrote:
> 
> 
> 
>                 Jones de Andrade wrote:
> 
>                     Hi Justin
> 
>                        This has been discussed several times on the
>                     list.  The -np flag is
>                        no longer necessary with grompp.  You don't get
>                     an mdrun.out because
>                        the .tpr file is likely never created, since
>                     grompp fails.
> 
> 
>                     Yes, I know that and that is what I would have
>                     expected. But what I'm running is the gmxtest.pl
>                     script. Even using the 4.0.4 version, it explicit
>                     states that I must use "-np N" to make parallel
>                     works on its own command line.
> 
>                     ************
>                     gmxtest.pl
>                     Usage: ./gmxtest.pl [ -np N ] [-verbose ] [ -double
>                     ] [ simple | complex | kernel | pdb2gmx | all ]
>                       or: ./gmxtest.pl clean | refclean | dist
>                     ************
> 
>                     I would expect that the script would use it only for
>                     mdrun and not for grompp, but it seems to try to use
>                     on both. What becomes really strange it the testbed
>                     really works. So, gmxtest.pl has a bug on 4.0.4? Or
>                     how should I really tell gmxtest.pl to test in a
>                     growing number of cores?
>                      
> 
> 
>                 Ah, sorry for the mis-read :)  There is a simple fix
>                 that you can apply to the gmxtest.pl script:
> 
>                 % diff gmxtest.pl gmxtest_orig.pl
>                 161c161
>                 <         system("$grompp -maxwarn 10 $ndx > grompp.out
>                 2>&1");
>                 ---
>                  >         system("$grompp -maxwarn 10 $ndx $par >
>                 grompp.out 2>&1");
> 
>                 -Justin
> 
> 
> 
>                            Version 3.3.3 on the other hand already
>                     failed in so many
>                            different places that I'm really thinking IF
>                     I'll make it
>                            available in the new cluster. :P
> 
> 
>                        What messages are you getting from 3.3.3?  I
>                     thought you said the
>                        3.3.x series worked fine.
> 
> 
>                     I'll login for those and try to get any reproducible
>                     error here. ;) As soon as I have these, I post back
>                     in this thread.
> 
>                     Thanks a lot again,
> 
>                     Jones
> 
> 
> 
>             -- 
>             ========================================
> 
>             Justin A. Lemkul
>             Ph.D. Candidate
>             ICTAS Doctoral Scholar
>             Department of Biochemistry
>             Virginia Tech
>             Blacksburg, VA
>             jalemkul[at]vt.edu <http://vt.edu> | (540) 231-9080
>             http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin
> 
>             ========================================
>             _______________________________________________
>             gmx-users mailing list    gmx-users at gromacs.org
>             <mailto:gmx-users at gromacs.org>
>             http://www.gromacs.org/mailman/listinfo/gmx-users
>             Please search the archive at http://www.gromacs.org/search
>             before posting!
>             Please don't post (un)subscribe requests to the list. Use
>             the www interface or send it to
>             gmx-users-request at gromacs.org
>             <mailto:gmx-users-request at gromacs.org>.
>             Can't post? Read http://www.gromacs.org/mailing_lists/users.php
> 
> 
> 
> 

-- 
========================================

Justin A. Lemkul
Ph.D. Candidate
ICTAS Doctoral Scholar
Department of Biochemistry
Virginia Tech
Blacksburg, VA
jalemkul[at]vt.edu | (540) 231-9080
http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin

========================================



More information about the gromacs.org_gmx-users mailing list