[gmx-users] TestBed in MPI not working
Jones de Andrade
johannesrs at gmail.com
Tue May 12 04:06:23 CEST 2009
Ok, summary of errors begin here.
First, errors with MPI in double precision:
1 Simple Test:
bham: ns type Simple is not supported with domain decomposition, use
particle decomposition: mdrun -pd
7 Complex Tests:
acetonitrilRF: ns type Simple is not supported with domain decomposition,
use particle decomposition: mdrun -pd
aminoacids: ns type Simple is not supported with domain decomposition, use
particle decomposition: mdrun -pd
argon: ns type Simple is not supported with domain decomposition, use
particle decomposition: mdrun -pd
sw: ns type Simple is not supported with domain decomposition, use particle
decomposition: mdrun -pd
tip4p: ns type Simple is not supported with domain decomposition, use
particle decomposition: mdrun -pd
urea: ns type Simple is not supported with domain decomposition, use
particle decomposition: mdrun -pd
water: ns type Simple is not supported with domain decomposition, use
particle decomposition: mdrun -pd
16 Kernel Tests: 0 computation time. Something gone REALLY bad on those...
:(
Except for the kernel tests, (seems that) in all I'm getting that same error
message (still looking at it). Are those expected to appear? And the kernel
ones? Am I wrong, or that means compilation problems (specially because they
appear in all tests, single and double precision, with and withou MPI).
Also getting error in serial in single precision in 4 complex tests. Those
seems to have run, but yelded wrong results?
Does anybody has any clue, please? Shall I go straight to recompilation,
despite there is no reason for failure here?
Thanks a lot!
Jones
On Mon, May 11, 2009 at 10:42 PM, Jones de Andrade <johannesrs at gmail.com>wrote:
> Hi Justin.
>
> Well, bothering again. Good and bad news.
>
> The good news: I found a strange "work-around" for my problems here. For
> some reason, the perl script updates the path, environments and everything
> else when runs. So, the variables I placed on the script I was using where
> simply lost. Workaround here was, then, to just include those in the .tcshrc
> file and log again.
>
> The problem is that it's not pratical. I'm trying a lot of different MPIs
> and libraries compilations, and having to edit that file, and or logou/login
> or source it, is not pratical at all. Is there any other way, so that the
> perl script will be happy with the variables it has when its called, instead
> of initializing all them again?
>
> Second, here comes the real bad news: Lots of erros.
>
> Without MPI, in single precision, 4 complex and 16 kernel tests fail.
>
> Without MPI, but in double precision, "just" the 16 kernel tests fail.
>
> With MPI, in single precision, it fails on 1 simple, 9 complex and 16
> kernel tests!
>
> And with MPI and double precision, 1 simple, 7 complex and 16 kernel tests
> fails. :P
>
> Edit: Just received your message. Well, it seems that I've done a mistake
> on my script, but since at least part of the tests worked, it means that
> it's not the MPI that is, at least, missconfigured.
>
> I will look deeper into the erros above, and tell you later.
>
> Thanks a lot,
>
> Jones
>
>
> On Mon, May 11, 2009 at 9:41 PM, Jones de Andrade <johannesrs at gmail.com>wrote:
>
>> Hi Justin.
>>
>> Thanks a lot for that. It helped, but enough yet. :( Just made 4.0.4
>> tests reach the same "range of errors" that I'm getting with 3.3.3. :P
>>
>> Using openMPI, it just complains that it can't find orted. That would mean
>> that the paths are not in there, BUT they are. :P If I just try to run orted
>> from the command line without any arguments:
>>
>> *****************
>> *gmxtest404 196% orted
>> [palpatine:28366] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
>> runtime/orte_init.c at line 125
>> --------------------------------------------------------------------------
>> It looks like orte_init failed for some reason; your parallel process is
>> likely to abort. There are many reasons that a parallel process can
>> fail during orte_init; some of which are due to configuration or
>> environment problems. This failure appears to be an internal failure;
>> here's some additional information (which may only be relevant to an
>> Open MPI developer):
>>
>> orte_ess_base_select failed
>> --> Returned value Not found (-13) instead of ORTE_SUCCESS
>> --------------------------------------------------------------------------
>> [palpatine:28366] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
>> orted/orted_main.c at line 323
>> ******************
>>
>> So, the shell IS finding the file. But when I do it not from the script
>> anymore (I was already thinking in something on the "it-else-end" stack),
>> all mpi tests fail with the following message on mdrun.out file:
>>
>> **********************
>> *orted: Command not found.
>> --------------------------------------------------------------------------
>> A daemon (pid 27972) died unexpectedly with status 1 while attempting
>> to launch so we are aborting.
>>
>> There may be more information reported by the environment (see above).
>>
>> This may be because the daemon was unable to find all the needed shared
>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>> location of the shared libraries on the remote nodes and this will
>> automatically be forwarded to the remote nodes.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --------------------------------------------------------------------------
>> mpirun: clean termination accomplished
>> ***********************
>>
>> What is going on? Next thing I think about doing is to execute a full
>> command line from one of the tests directly, to see that it works... :( :P
>>
>> Now I'm absolutelly lost. Any ideas, please?
>>
>> Thanks a lot,
>>
>> Jones
>>
>>
>> On Mon, May 11, 2009 at 9:07 PM, Justin A. Lemkul <jalemkul at vt.edu>wrote:
>>
>>>
>>>
>>> Justin A. Lemkul wrote:
>>>
>>>>
>>>>
>>>> Jones de Andrade wrote:
>>>>
>>>>> Hi Justin
>>>>>
>>>>> This has been discussed several times on the list. The -np flag is
>>>>> no longer necessary with grompp. You don't get an mdrun.out because
>>>>> the .tpr file is likely never created, since grompp fails.
>>>>>
>>>>>
>>>>> Yes, I know that and that is what I would have expected. But what I'm
>>>>> running is the gmxtest.pl script. Even using the 4.0.4 version, it explicit
>>>>> states that I must use "-np N" to make parallel works on its own command
>>>>> line.
>>>>>
>>>>> ************
>>>>> gmxtest.pl
>>>>> Usage: ./gmxtest.pl [ -np N ] [-verbose ] [ -double ] [ simple |
>>>>> complex | kernel | pdb2gmx | all ]
>>>>> or: ./gmxtest.pl clean | refclean | dist
>>>>> ************
>>>>>
>>>>> I would expect that the script would use it only for mdrun and not for
>>>>> grompp, but it seems to try to use on both. What becomes really strange it
>>>>> the testbed really works. So, gmxtest.pl has a bug on 4.0.4? Or how should I
>>>>> really tell gmxtest.pl to test in a growing number of cores?
>>>>>
>>>>>
>>>>
>>>> Ah, sorry for the mis-read :) There is a simple fix that you can apply
>>>> to the gmxtest.pl script:
>>>>
>>>> % diff gmxtest.pl gmxtest_orig.pl
>>>> 161c161
>>>> < system("$grompp -maxwarn 10 $ndx > grompp.out 2>&1");
>>>> ---
>>>> > system("$grompp -maxwarn 10 $ndx $par > grompp.out 2>&1");
>>>>
>>>> -Justin
>>>>
>>>>
>>>>>
>>>>> Version 3.3.3 on the other hand already failed in so many
>>>>> different places that I'm really thinking IF I'll make it
>>>>> available in the new cluster. :P
>>>>>
>>>>>
>>>>> What messages are you getting from 3.3.3? I thought you said the
>>>>> 3.3.x series worked fine.
>>>>>
>>>>>
>>>>> I'll login for those and try to get any reproducible error here. ;) As
>>>>> soon as I have these, I post back in this thread.
>>>>>
>>>>> Thanks a lot again,
>>>>>
>>>>> Jones
>>>>>
>>>>
>>>>
>>> --
>>> ========================================
>>>
>>> Justin A. Lemkul
>>> Ph.D. Candidate
>>> ICTAS Doctoral Scholar
>>> Department of Biochemistry
>>> Virginia Tech
>>> Blacksburg, VA
>>> jalemkul[at]vt.edu | (540) 231-9080
>>> http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin
>>>
>>> ========================================
>>> _______________________________________________
>>> gmx-users mailing list gmx-users at gromacs.org
>>> http://www.gromacs.org/mailman/listinfo/gmx-users
>>> Please search the archive at http://www.gromacs.org/search before
>>> posting!
>>> Please don't post (un)subscribe requests to the list. Use the www
>>> interface or send it to gmx-users-request at gromacs.org.
>>> Can't post? Read http://www.gromacs.org/mailing_lists/users.php
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20090511/572c63d1/attachment.html>
More information about the gromacs.org_gmx-users
mailing list