[gmx-users] Gromacs 4.6.7 with MPI and OpenMP

Malcolm Tobias mtobias at wustl.edu
Fri May 8 16:45:38 CEST 2015


Szilárd,

On Friday 08 May 2015 15:56:12 Szilárd Páll wrote:
> What's being utilized vs what's being started are different things. If
> you don't believe the mdrun output - which is quite likely not wrong
> about the 2 ranks x 4 threads -, use your favorite tool to check the
> number of ranks and threads started and their placement. That will
> explain what's going on...

Good point.  If I use 'ps -L' I can see the OpenMP threads:

[root at gpu21 ~]# ps -Lfu mtobias
UID        PID  PPID   LWP  C NLWP STIME TTY          TIME CMD
mtobias   9830  9828  9830  0    1 09:28 ?        00:00:00 sshd: mtobias at pts/0
mtobias   9831  9830  9831  0    1 09:28 pts/0    00:00:00 -bash
mtobias   9989  9831  9989  0    2 09:33 pts/0    00:00:00 mpirun -np 2 mdrun_mp
mtobias   9989  9831  9991  0    2 09:33 pts/0    00:00:00 mpirun -np 2 mdrun_mp
mtobias   9990  9831  9990  0    1 09:33 pts/0    00:00:00 tee mdrun.out
mtobias   9992  9989  9992 38    7 09:33 pts/0    00:00:02 mdrun_mpi -ntomp 4 -v
mtobias   9992  9989  9994  0    7 09:33 pts/0    00:00:00 mdrun_mpi -ntomp 4 -v
mtobias   9992  9989  9998  0    7 09:33 pts/0    00:00:00 mdrun_mpi -ntomp 4 -v
mtobias   9992  9989 10000  0    7 09:33 pts/0    00:00:00 mdrun_mpi -ntomp 4 -v
mtobias   9992  9989 10001 16    7 09:33 pts/0    00:00:00 mdrun_mpi -ntomp 4 -v
mtobias   9992  9989 10002 16    7 09:33 pts/0    00:00:00 mdrun_mpi -ntomp 4 -v
mtobias   9992  9989 10003 16    7 09:33 pts/0    00:00:00 mdrun_mpi -ntomp 4 -v
mtobias   9993  9989  9993 73    7 09:33 pts/0    00:00:05 mdrun_mpi -ntomp 4 -v
mtobias   9993  9989  9995  0    7 09:33 pts/0    00:00:00 mdrun_mpi -ntomp 4 -v
mtobias   9993  9989  9999  0    7 09:33 pts/0    00:00:00 mdrun_mpi -ntomp 4 -v
mtobias   9993  9989 10004  0    7 09:33 pts/0    00:00:00 mdrun_mpi -ntomp 4 -v
mtobias   9993  9989 10005 12    7 09:33 pts/0    00:00:00 mdrun_mpi -ntomp 4 -v
mtobias   9993  9989 10006 12    7 09:33 pts/0    00:00:00 mdrun_mpi -ntomp 4 -v
mtobias   9993  9989 10007 12    7 09:33 pts/0    00:00:00 mdrun_mpi -ntomp 4 -v

but top only shows 2 CPUs being utilized:

top - 09:33:42 up 37 days, 19:48,  2 users,  load average: 2.13, 1.05, 0.68
Tasks: 517 total,   3 running, 514 sleeping,   0 stopped,   0 zombie
Cpu0  : 98.7%us,  1.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 98.7%us,  1.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  0.0%us,  0.3%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.0%us,  0.3%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu8  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu9  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu10 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu11 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu12 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu13 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu14 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu15 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  132053748k total, 23817664k used, 108236084k free,   268628k buffers
Swap:  4095996k total,     1884k used,  4094112k free, 15600572k cached


> Very likely that's exactly what's screwing things up. We try to be
> nice and back off (mdrun should note that on the output) when
> affinities are set externally assuming that they are set for a good
> reason and to correct values. Sadly, that assumption often proves to
> be wrong. Try running with "-pin on" or turn off the CPUSET-ing (or
> double-check if it's right).

I wouldn't expect the CPUSETs to be problematic, I've been using them with Gromacs for over a decade now ;-)

If I use '-pin on' it appears to be utilizing 8 CPU-cores as expected:

[mtobias at gpu21 Gromacs_Test]$ mpirun -np 2 mdrun_mpi -ntomp 4 -pin on -v -deffnm PolyA_Heli_J_hi_equil 

top - 09:36:26 up 37 days, 19:50,  2 users,  load average: 1.00, 1.14, 0.78
Tasks: 516 total,   4 running, 512 sleeping,   0 stopped,   0 zombie
Cpu0  : 78.9%us,  2.7%sy,  0.0%ni, 18.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 63.7%us,  0.3%sy,  0.0%ni, 36.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 65.6%us,  0.3%sy,  0.0%ni, 33.8%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu3  : 64.9%us,  0.3%sy,  0.0%ni, 34.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  : 80.7%us,  2.7%sy,  0.0%ni, 16.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  : 64.0%us,  0.3%sy,  0.0%ni, 35.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  : 62.0%us,  0.3%sy,  0.0%ni, 37.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  : 60.3%us,  0.3%sy,  0.0%ni, 39.1%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st


Weird.  I wonder if anyone else has experience using pin'ing with CPUSETs?

Malcolm

-- 
Malcolm Tobias
314.362.1594




More information about the gromacs.org_gmx-users mailing list