[gmx-users] TestBed in MPI not working

Mark Abraham Mark.Abraham at anu.edu.au
Tue May 12 03:26:36 CEST 2009


Jones de Andrade wrote:
> Hi Justin.
> 
> Thanks a lot for that. It helped, but enough yet. :(  Just made 4.0.4 
> tests reach the same "range of errors" that I'm getting with 3.3.3. :P
> 
> Using openMPI, it just complains that it can't find orted. That would 
> mean that the paths are not in there, BUT they are. :P If I just try to 
> run orted from the command line without any arguments:
> 
> *****************
> /gmxtest404 196% orted
> [palpatine:28366] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
> runtime/orte_init.c at line 125
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>   orte_ess_base_select failed
>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
> --------------------------------------------------------------------------
> [palpatine:28366] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
> orted/orted_main.c at line 323
> /*****************
> 
> So, the shell IS finding the file.

True, but you have a deeper problem that has nothing to do with GROMACS 
or the shell. You need to get your MPI environment working properly 
first. Read its documentation, test and troubleshoot there. Ask its 
mailing list if you can't work it out.

> But when I do it not from the script 
> anymore (I was already thinking in something on the "it-else-end" 
> stack), all mpi tests fail with the following message on mdrun.out file:
> 
> **********************
> /orted: Command not found.
> --------------------------------------------------------------------------
> A daemon (pid 27972) died unexpectedly with status 1 while attempting
> to launch so we are aborting.
> 
> There may be more information reported by the environment (see above).
> 
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> mpirun: clean termination accomplished
> /**********************
> 
> What is going on? 

Your environment inside your shell is different from your environment 
inside the gmxtest.pl script. If you're running under a queueing system, 
get the environment variables exported to the script. If you're using 
the script you were using earlier, then perhaps you need to learn about 
setting environment variables under different shells. I don't recall 
whether variable names are case-sensitive, but your "set path=(blah)" 
simply won't do the job. Only environment variables get exported to 
child processes (i.e. the gmxtest.pl you spawn later), so you need 
"setenv PATH blah". Under bash, use "export PATH=blah" not "PATH=blah".

Otherwise, change the script to dump the contents of the enviroment PATH 
variable (e.g. print STDERR "$env{'PATH'}\n";) to see if those contents 
clue you in to what is happening.

> Next thing I think about doing is to execute a full 
> command line from one of the tests directly, to see that it works...  :(  :P

Won't work, because MPI is broken.

Mark



More information about the gromacs.org_gmx-users mailing list