[gmx-users] TestBed in MPI not working
Justin A. Lemkul
jalemkul at vt.edu
Tue May 12 04:10:53 CEST 2009
Jones de Andrade wrote:
> Ok, summary of errors begin here.
>
> First, errors with MPI in double precision:
>
> 1 Simple Test:
> bham: ns type Simple is not supported with domain decomposition, use
> particle decomposition: mdrun -pd
>
> 7 Complex Tests:
> acetonitrilRF: ns type Simple is not supported with domain
> decomposition, use particle decomposition: mdrun -pd
> aminoacids: ns type Simple is not supported with domain decomposition,
> use particle decomposition: mdrun -pd
> argon: ns type Simple is not supported with domain decomposition, use
> particle decomposition: mdrun -pd
> sw: ns type Simple is not supported with domain decomposition, use
> particle decomposition: mdrun -pd
> tip4p: ns type Simple is not supported with domain decomposition, use
> particle decomposition: mdrun -pd
> urea: ns type Simple is not supported with domain decomposition, use
> particle decomposition: mdrun -pd
> water: ns type Simple is not supported with domain decomposition, use
> particle decomposition: mdrun -pd
>
All of the above can be fixed by changing the appropriate .mdp option.
> 16 Kernel Tests: 0 computation time. Something gone REALLY bad on
> those... :(
>
Known issue:
http://bugzilla.gromacs.org/show_bug.cgi?id=313
> Except for the kernel tests, (seems that) in all I'm getting that same
> error message (still looking at it). Are those expected to appear? And
> the kernel ones? Am I wrong, or that means compilation problems
> (specially because they appear in all tests, single and double
> precision, with and withou MPI).
>
> Also getting error in serial in single precision in 4 complex tests.
> Those seems to have run, but yelded wrong results?
>
Someone else just experienced this problem as well. Probably needs to be looked
into. Look into thee contents of checkpot.out and checkvir.out to see if the
results are similar to:
http://www.gromacs.org/pipermail/gmx-users/2009-May/041696.html
The problem there appeared to be a missing energy term (Vir-XX).
-Justin
> Does anybody has any clue, please? Shall I go straight to recompilation,
> despite there is no reason for failure here?
>
> Thanks a lot!
>
> Jones
>
> On Mon, May 11, 2009 at 10:42 PM, Jones de Andrade <johannesrs at gmail.com
> <mailto:johannesrs at gmail.com>> wrote:
>
> Hi Justin.
>
> Well, bothering again. Good and bad news.
>
> The good news: I found a strange "work-around" for my problems here.
> For some reason, the perl script updates the path, environments and
> everything else when runs. So, the variables I placed on the script
> I was using where simply lost. Workaround here was, then, to just
> include those in the .tcshrc file and log again.
>
> The problem is that it's not pratical. I'm trying a lot of different
> MPIs and libraries compilations, and having to edit that file, and
> or logou/login or source it, is not pratical at all. Is there any
> other way, so that the perl script will be happy with the variables
> it has when its called, instead of initializing all them again?
>
> Second, here comes the real bad news: Lots of erros.
>
> Without MPI, in single precision, 4 complex and 16 kernel tests fail.
>
> Without MPI, but in double precision, "just" the 16 kernel tests fail.
>
> With MPI, in single precision, it fails on 1 simple, 9 complex and
> 16 kernel tests!
>
> And with MPI and double precision, 1 simple, 7 complex and 16 kernel
> tests fails. :P
>
> Edit: Just received your message. Well, it seems that I've done a
> mistake on my script, but since at least part of the tests worked,
> it means that it's not the MPI that is, at least, missconfigured.
>
> I will look deeper into the erros above, and tell you later.
>
> Thanks a lot,
>
> Jones
>
>
> On Mon, May 11, 2009 at 9:41 PM, Jones de Andrade
> <johannesrs at gmail.com <mailto:johannesrs at gmail.com>> wrote:
>
> Hi Justin.
>
> Thanks a lot for that. It helped, but enough yet. :( Just made
> 4.0.4 tests reach the same "range of errors" that I'm getting
> with 3.3.3. :P
>
> Using openMPI, it just complains that it can't find orted. That
> would mean that the paths are not in there, BUT they are. :P If
> I just try to run orted from the command line without any arguments:
>
> *****************
> /gmxtest404 196% orted
> [palpatine:28366] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found
> in file runtime/orte_init.c at line 125
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel
> process is
> likely to abort. There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems. This failure appears to be an internal
> failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
> orte_ess_base_select failed
> --> Returned value Not found (-13) instead of ORTE_SUCCESS
> --------------------------------------------------------------------------
> [palpatine:28366] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found
> in file orted/orted_main.c at line 323
> /*****************
>
> So, the shell IS finding the file. But when I do it not from the
> script anymore (I was already thinking in something on the
> "it-else-end" stack), all mpi tests fail with the following
> message on mdrun.out file:
>
> **********************
> /orted: Command not found.
> --------------------------------------------------------------------------
> A daemon (pid 27972) died unexpectedly with status 1 while
> attempting
> to launch so we are aborting.
>
> There may be more information reported by the environment (see
> above).
>
> This may be because the daemon was unable to find all the needed
> shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH
> to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the
> process
> that caused that situation.
> --------------------------------------------------------------------------
> mpirun: clean termination accomplished
> /**********************
>
> What is going on? Next thing I think about doing is to execute a
> full command line from one of the tests directly, to see that it
> works... :( :P
>
> Now I'm absolutelly lost. Any ideas, please?
>
> Thanks a lot,
>
> Jones
>
>
> On Mon, May 11, 2009 at 9:07 PM, Justin A. Lemkul
> <jalemkul at vt.edu <mailto:jalemkul at vt.edu>> wrote:
>
>
>
> Justin A. Lemkul wrote:
>
>
>
> Jones de Andrade wrote:
>
> Hi Justin
>
> This has been discussed several times on the
> list. The -np flag is
> no longer necessary with grompp. You don't get
> an mdrun.out because
> the .tpr file is likely never created, since
> grompp fails.
>
>
> Yes, I know that and that is what I would have
> expected. But what I'm running is the gmxtest.pl
> script. Even using the 4.0.4 version, it explicit
> states that I must use "-np N" to make parallel
> works on its own command line.
>
> ************
> gmxtest.pl
> Usage: ./gmxtest.pl [ -np N ] [-verbose ] [ -double
> ] [ simple | complex | kernel | pdb2gmx | all ]
> or: ./gmxtest.pl clean | refclean | dist
> ************
>
> I would expect that the script would use it only for
> mdrun and not for grompp, but it seems to try to use
> on both. What becomes really strange it the testbed
> really works. So, gmxtest.pl has a bug on 4.0.4? Or
> how should I really tell gmxtest.pl to test in a
> growing number of cores?
>
>
>
> Ah, sorry for the mis-read :) There is a simple fix
> that you can apply to the gmxtest.pl script:
>
> % diff gmxtest.pl gmxtest_orig.pl
> 161c161
> < system("$grompp -maxwarn 10 $ndx > grompp.out
> 2>&1");
> ---
> > system("$grompp -maxwarn 10 $ndx $par >
> grompp.out 2>&1");
>
> -Justin
>
>
>
> Version 3.3.3 on the other hand already
> failed in so many
> different places that I'm really thinking IF
> I'll make it
> available in the new cluster. :P
>
>
> What messages are you getting from 3.3.3? I
> thought you said the
> 3.3.x series worked fine.
>
>
> I'll login for those and try to get any reproducible
> error here. ;) As soon as I have these, I post back
> in this thread.
>
> Thanks a lot again,
>
> Jones
>
>
>
> --
> ========================================
>
> Justin A. Lemkul
> Ph.D. Candidate
> ICTAS Doctoral Scholar
> Department of Biochemistry
> Virginia Tech
> Blacksburg, VA
> jalemkul[at]vt.edu <http://vt.edu> | (540) 231-9080
> http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin
>
> ========================================
> _______________________________________________
> gmx-users mailing list gmx-users at gromacs.org
> <mailto:gmx-users at gromacs.org>
> http://www.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at http://www.gromacs.org/search
> before posting!
> Please don't post (un)subscribe requests to the list. Use
> the www interface or send it to
> gmx-users-request at gromacs.org
> <mailto:gmx-users-request at gromacs.org>.
> Can't post? Read http://www.gromacs.org/mailing_lists/users.php
>
>
>
>
--
========================================
Justin A. Lemkul
Ph.D. Candidate
ICTAS Doctoral Scholar
Department of Biochemistry
Virginia Tech
Blacksburg, VA
jalemkul[at]vt.edu | (540) 231-9080
http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin
========================================
More information about the gromacs.org_gmx-users
mailing list