[gmx-users] TestBed in MPI not working

Jones de Andrade johannesrs at gmail.com
Tue May 12 02:41:42 CEST 2009


Hi Justin.

Thanks a lot for that. It helped, but enough yet. :(  Just made 4.0.4 tests
reach the same "range of errors" that I'm getting with 3.3.3. :P

Using openMPI, it just complains that it can't find orted. That would mean
that the paths are not in there, BUT they are. :P If I just try to run orted
from the command line without any arguments:

*****************
*gmxtest404 196% orted
[palpatine:28366] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
runtime/orte_init.c at line 125
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[palpatine:28366] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
orted/orted_main.c at line 323
******************

So, the shell IS finding the file. But when I do it not from the script
anymore (I was already thinking in something on the "it-else-end" stack),
all mpi tests fail with the following message on mdrun.out file:

**********************
*orted: Command not found.
--------------------------------------------------------------------------
A daemon (pid 27972) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
mpirun: clean termination accomplished
***********************

What is going on? Next thing I think about doing is to execute a full
command line from one of the tests directly, to see that it works...  :(  :P

Now I'm absolutelly lost. Any ideas, please?

Thanks a lot,

Jones

On Mon, May 11, 2009 at 9:07 PM, Justin A. Lemkul <jalemkul at vt.edu> wrote:

>
>
> Justin A. Lemkul wrote:
>
>>
>>
>> Jones de Andrade wrote:
>>
>>> Hi Justin
>>>
>>>    This has been discussed several times on the list.  The -np flag is
>>>    no longer necessary with grompp.  You don't get an mdrun.out because
>>>    the .tpr file is likely never created, since grompp fails.
>>>
>>>
>>> Yes, I know that and that is what I would have expected. But what I'm
>>> running is the gmxtest.pl script. Even using the 4.0.4 version, it explicit
>>> states that I must use "-np N" to make parallel works on its own command
>>> line.
>>>
>>> ************
>>> gmxtest.pl
>>> Usage: ./gmxtest.pl [ -np N ] [-verbose ] [ -double ] [ simple | complex
>>> | kernel | pdb2gmx | all ]
>>>   or: ./gmxtest.pl clean | refclean | dist
>>> ************
>>>
>>> I would expect that the script would use it only for mdrun and not for
>>> grompp, but it seems to try to use on both. What becomes really strange it
>>> the testbed really works. So, gmxtest.pl has a bug on 4.0.4? Or how should I
>>> really tell gmxtest.pl to test in a growing number of cores?
>>>
>>>
>>
>> Ah, sorry for the mis-read :)  There is a simple fix that you can apply to
>> the gmxtest.pl script:
>>
>> % diff gmxtest.pl gmxtest_orig.pl
>> 161c161
>> <         system("$grompp -maxwarn 10 $ndx > grompp.out 2>&1");
>> ---
>>  >         system("$grompp -maxwarn 10 $ndx $par > grompp.out 2>&1");
>>
>> -Justin
>>
>>
>>>
>>>        Version 3.3.3 on the other hand already failed in so many
>>>        different places that I'm really thinking IF I'll make it
>>>        available in the new cluster. :P
>>>
>>>
>>>    What messages are you getting from 3.3.3?  I thought you said the
>>>    3.3.x series worked fine.
>>>
>>>
>>> I'll login for those and try to get any reproducible error here. ;) As
>>> soon as I have these, I post back in this thread.
>>>
>>> Thanks a lot again,
>>>
>>> Jones
>>>
>>
>>
> --
> ========================================
>
> Justin A. Lemkul
> Ph.D. Candidate
> ICTAS Doctoral Scholar
> Department of Biochemistry
> Virginia Tech
> Blacksburg, VA
> jalemkul[at]vt.edu | (540) 231-9080
> http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin
>
> ========================================
> _______________________________________________
> gmx-users mailing list    gmx-users at gromacs.org
> http://www.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at http://www.gromacs.org/search before posting!
> Please don't post (un)subscribe requests to the list. Use the www interface
> or send it to gmx-users-request at gromacs.org.
> Can't post? Read http://www.gromacs.org/mailing_lists/users.php
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20090511/88353251/attachment.html>


More information about the gromacs.org_gmx-users mailing list