[gmx-users] 2018 installation make check errors, probably CUDA related

Tresadern, Gary [RNDBE] gtresade at its.jnj.com
Thu Mar 22 17:45:55 CET 2018


Hi Mark, 
Thanks, I tried 2018-1 and was hopeful it would solve the problem as I'd seen comment of funny findGpus() behavior whilst googling to fix this. Unfortunately I still have the same problem. I've spent the day trying to pin down the nvidia-smi settings, I have the persistence on and the daemon running to restart at reboot, I have clocked up the K40 to 3004,875, but these are minor issues. Something more fundamental is must be going wrong. I'm out of ideas at this point, I must have tried the rebuild 3 dozen times in last ten days or so.

Cheers
Gary


-----Original Message-----
From: gromacs.org_gmx-users-bounces at maillist.sys.kth.se [mailto:gromacs.org_gmx-users-bounces at maillist.sys.kth.se] On Behalf Of Mark Abraham
Sent: Wednesday, 21 March 2018 17:03
To: gmx-users at gromacs.org
Cc: gromacs.org_gmx-users at maillist.sys.kth.se
Subject: [EXTERNAL] Re: [gmx-users] 2018 installation make check errors, probably CUDA related

Hi,

Please try 2018.1 and let us know, as some issues that look like these have been resolved.

Thanks,

Mark



>> Cheers
>> Gary
>>
>>
>>
>>
>> wrndbeberhel13 :~> nvidia-smi
>> Wed Mar 21 16:25:23 2018
>>
>> +-----------------------------------------------------------------------------+
>> | NVIDIA-SMI 390.42                 Driver Version: 390.42
>>     |
>>
>> |-------------------------------+----------------------+----------------------+
>> | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr.
>> ECC |
>> | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util
>> Compute M. |
>>
>> |===============================+======================+======================|
>> |   0  Quadro K4200        On   | 00000000:03:00.0  On |
>> N/A |
>> | 30%   36C    P8    15W / 110W |     71MiB /  4036MiB |      0%   E.
>> Process |
>>
>> +-------------------------------+----------------------+----------------------+
>> |   1  Tesla K40c          On   | 00000000:81:00.0 Off |
>>   2 |
>> | 23%   40C    P8    22W / 235W |      0MiB / 11441MiB |      0%   E.
>> Process |
>>
>> +-------------------------------+----------------------+----------------------+
>>
>>
>> +-----------------------------------------------------------------------------+
>> | Processes:                                                       GPU
>> Memory |
>> |  GPU       PID   Type   Process name                             Usage
>>     |
>>
>> |=============================================================================|
>> |    0      7891      G   /usr/bin/Xorg
>>  69MiB |
>>
>> +-----------------------------------------------------------------------------+
>>
>> -----Original Message-----
>> From: gromacs.org_gmx-users-bounces at maillist.sys.kth.se [mailto:
>> gromacs.org_gmx-users-bounces at maillist.sys.kth.se] On Behalf Of 
>> Tresadern, Gary [RNDBE]
>> Sent: Saturday, 17 March 2018 16:46
>> To: 'gromacs.org_gmx-users at maillist.sys.kth.se' < 
>> gromacs.org_gmx-users at maillist.sys.kth.se>
>> Subject: [EXTERNAL] Re: [gmx-users] 2018 installation make check 
>> errors, probably CUDA related
>>
>> Hi,
>>
>> I am unable to pass the make check tests for a 2018 build. I had a 
>> working build earlier in the week, but since we updated the cuda 
>> toolkit and nvidia driver it now fails.
>> Below are some details of the installation procedure.
>> I tried manually setting variables such as CUDA_VISIBLE_DEVICES but 
>> that also didn't help.
>> I am running out of ideas, if you have any tips please let me know.
>>
>> Thanks
>> Gary
>>
>> bash-4.1$ su softinst
>> bash-4.1$ scl enable devtoolset-2 bash bash-4.1$ which cmake 
>> /usr/local/bin/cmake bash-4.1$ cmake --version cmake version 3.6.2 
>> CMake suite maintained and supported by Kitware (kitware.com/cmake).
>> bash-4.1$ gcc --version
>> gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-15) Copyright (C) 2013 Free 
>> Software Foundation, Inc.
>> This is free software; see the source for copying conditions.  There 
>> is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A 
>> PARTICULAR PURPOSE.
>> bash-4.1$ ls /usr/local/cuda-9.1/
>> bin/              extras/           lib64/            libnvvp/
>>  nsightee_plugins/ nvvm/             samples/          src/
>> tools/ doc/              include/          libnsight/
>> LICENSE           nvml/             README            share/
>> targets/          version.txt bash-4.1$ ls /usr/local/cuda-9.1/bin/
>> bin2c                        cuda-gdb
>> fatbinary                    nvcc.profile                 nvvp
>> computeprof                  cuda-gdbserver
>> gpu-library-advisor          nvdisasm                     ptxas
>> crt/                         cuda-install-samples-9.1.sh
>> nsight                       nvlink cudafe
>> cuda-memcheck                nsight_ee_plugins_manage.sh
>> nvprof
>> cudafe++                     cuobjdump
>> nvcc
>> cudafe++nvprune
>> bash-4.1$ export PATH=$PATH:/usr/local/bin/ bash-4.1$ export 
>> CUDA_HOME=/usr/local/cuda-9.1/ bash-4.1$ export 
>> PATH=$PATH:/usr/lib64/mpich/bin/ bash-4.1$ export 
>> LD_LIBRARY_PATH="/usr/local/cuda-9.1/lib64/:${LD_LIBRARY_PATH}"
>> bash-4.1$ export
>> LD_LIBRARY_PATH="/usr/local/cuda-9.1/lib64:/usr/local/cuda-9.1/targets/x86_64-linux/lib/:${LD_LIBRARY_PATH}"
>> bash-4.1$ export
>> LD_LIBRARY_PATH=/usr/lib64/openmpi-1.10/lib/openmpi/:$LD_LIBRARY_PATH
>> bash-4.1$ export
>> MPI_CXX_INCLUDE_PATH=/usr/include/openmpi-1.10-x86_64/openmpi/ompi/mp
>> i/cxx/ bash-4.1$ export PATH=$PATH:/usr/lib64/openmpi-1.10/bin/
>>
>> bash-4.1$ cmake .. -DGMX_BUILD_OWN_FFTW=ON 
>> -DREGRESSIONTEST_DOWNLOAD=ON 
>> -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-9.1/ -DGMX_GPU=on 
>> -DCMAKE_INSTALL_PREFIX=/prd/pkgs/gromacs/gromacs-2018/ -DGMX_MPI=on 
>> bash-4.1$ make bash-4.1$ make check Test project /prd/pkgs/gromacs/gromacs-2018/build
>>       Start  1: TestUtilsUnitTests
>> 1/39 Test  #1: TestUtilsUnitTests ...............   Passed    0.41 sec
>>       Start  2: TestUtilsMpiUnitTests
>> 2/39 Test  #2: TestUtilsMpiUnitTests ............   Passed    0.29 sec
>>       Start  3: MdlibUnitTest
>> 3/39 Test  #3: MdlibUnitTest ....................   Passed    0.24 sec
>>       Start  4: AppliedForcesUnitTest
>> 4/39 Test  #4: AppliedForcesUnitTest ............   Passed    0.22 sec
>>       Start  5: ListedForcesTest
>> 5/39 Test  #5: ListedForcesTest .................   Passed    0.25 sec
>>       Start  6: CommandLineUnitTests
>> 6/39 Test  #6: CommandLineUnitTests .............   Passed    0.29 sec
>>       Start  7: EwaldUnitTests
>> 7/39 Test  #7: EwaldUnitTests ...................***Failed    0.92 sec
>> [==========] Running 257 tests from 10 test cases.
>> [----------] Global test environment set-up.
>>
>> -------------------------------------------------------
>> Program:     ewald-test, version 2018
>> Source file: src/gromacs/gpu_utils/gpu_utils.cu (line 735)
>> Function:    void findGpus(gmx_gpu_info_t*)
>>
>> Assertion failed:
>> Condition: cudaSuccess == cudaPeekAtLastError() Should be cudaSuccess
>>
>> For more information and tips for troubleshooting, please check the 
>> GROMACS website at http://www.gromacs.org/Documentation/Errors
>> -------------------------------------------------------
>> ---------------------------------------------------------------------
>> ----- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
>> with errorcode 1.
>>
>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>> You may or may not see output from other processes, depending on 
>> exactly when Open MPI kills them.
>> ---------------------------------------------------------------------
>> -----
>>
>>       Start  8: FFTUnitTests
>> 8/39 Test  #8: FFTUnitTests .....................   Passed    0.37 sec
>>       Start  9: GpuUtilsUnitTests
>> 9/39 Test  #9: GpuUtilsUnitTests ................***Failed    0.91 sec
>> [==========] Running 35 tests from 7 test cases.
>> [----------] Global test environment set-up.
>> [----------] 7 tests from HostAllocatorTest/0, where TypeParam = int [
>> RUN      ] HostAllocatorTest/0.EmptyMemoryAlwaysWorks
>>
>> -------------------------------------------------------
>> Program:     gpu_utils-test, version 2018 Source file:
>> src/gromacs/gpu_utils/gpu_utils.cu (line 735)
>> Function:    void findGpus(gmx_gpu_info_t*)
>>
>> Assertion failed:
>> Condition: cudaSuccess == cudaPeekAtLastError() Should be cudaSuccess
>>
>> For more information and tips for troubleshooting, please check the 
>> GROMACS website at http://www.gromacs.org/Documentation/Errors
>> -------------------------------------------------------
>> ---------------------------------------------------------------------
>> ----- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
>> with errorcode 1.
>>
>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>> You may or may not see output from other processes, depending on 
>> exactly when Open MPI kills them.
>> ---------------------------------------------------------------------
>> -----
>>
>>       Start 10: HardwareUnitTests
>> 10/39 Test #10: HardwareUnitTests ................   Passed    0.24 sec
>>       Start 11: MathUnitTests
>> 11/39 Test #11: MathUnitTests ....................   Passed    0.25 sec
>>       Start 12: MdrunUtilityUnitTests
>> 12/39 Test #12: MdrunUtilityUnitTests ............   Passed    0.22 sec
>>       Start 13: MdrunUtilityMpiUnitTests
>> 13/39 Test #13: MdrunUtilityMpiUnitTests .........   Passed    0.35 sec
>>       Start 14: OnlineHelpUnitTests
>> 14/39 Test #14: OnlineHelpUnitTests ..............   Passed    0.24 sec
>>       Start 15: OptionsUnitTests
>> 15/39 Test #15: OptionsUnitTests .................   Passed    0.25 sec
>>       Start 16: RandomUnitTests
>> 16/39 Test #16: RandomUnitTests ..................   Passed    0.26 sec
>>       Start 17: TableUnitTests
>> 17/39 Test #17: TableUnitTests ...................   Passed    0.41 sec
>>       Start 18: TaskAssignmentUnitTests
>> 18/39 Test #18: TaskAssignmentUnitTests ..........   Passed    0.21 sec
>>       Start 19: UtilityUnitTests
>> 19/39 Test #19: UtilityUnitTests .................   Passed    0.32 sec
>>       Start 20: FileIOTests
>> 20/39 Test #20: FileIOTests ......................   Passed    0.26 sec
>>       Start 21: PullTest
>> 21/39 Test #21: PullTest .........................   Passed    0.24 sec
>>       Start 22: AwhTest
>> 22/39 Test #22: AwhTest ..........................   Passed    0.23 sec
>>       Start 23: SimdUnitTests
>> 23/39 Test #23: SimdUnitTests ....................   Passed    0.29 sec
>>       Start 24: GmxAnaTest
>> 24/39 Test #24: GmxAnaTest .......................   Passed    0.38 sec
>>       Start 25: GmxPreprocessTests
>> 25/39 Test #25: GmxPreprocessTests ...............   Passed    0.58 sec
>>       Start 26: CorrelationsTest
>> 26/39 Test #26: CorrelationsTest .................   Passed    1.23 sec
>>       Start 27: AnalysisDataUnitTests
>> 27/39 Test #27: AnalysisDataUnitTests ............   Passed    0.30 sec
>>       Start 28: SelectionUnitTests
>> 28/39 Test #28: SelectionUnitTests ...............   Passed    0.61 sec
>>       Start 29: TrajectoryAnalysisUnitTests
>> 29/39 Test #29: TrajectoryAnalysisUnitTests ......   Passed    1.19 sec
>>       Start 30: EnergyAnalysisUnitTests
>> 30/39 Test #30: EnergyAnalysisUnitTests ..........   Passed    0.58 sec
>>       Start 31: CompatibilityHelpersTests
>> 31/39 Test #31: CompatibilityHelpersTests ........   Passed    0.23 sec
>>       Start 32: MdrunTests
>> 32/39 Test #32: MdrunTests .......................***Failed    0.98 sec
>> [==========] Running 29 tests from 11 test cases.
>> [----------] Global test environment set-up.
>> [----------] 6 tests from BondedInteractionsTest [ RUN      ]
>> BondedInteractionsTest.NormalBondWorks
>>
>> NOTE 1 [file
>> /prd/pkgs/gromacs/gromacs-2018/build/src/programs/mdrun/tests/Testing
>> /Temporary/BondedInteractionsTest_NormalBondWorks_input.mdp,
>> line 1]:
>>
>> /prd/pkgs/gromacs/gromacs-2018/build/src/programs/mdrun/tests/Testing
>> /Temporary/BondedInteractionsTest_NormalBondWorks_input.mdp
>> did not specify a value for the .mdp option "cutoff-scheme". Probably it
>>   was first intended for use with GROMACS before 4.6. In 4.6, the Verlet
>>   scheme was introduced, but the group scheme was still the default. The
>>   default is now the Verlet scheme, so you will observe different 
>> behaviour.
>>
>>
>> NOTE 2 [file
>> /prd/pkgs/gromacs/gromacs-2018/build/src/programs/mdrun/tests/Testing/Temporary/BondedInteractionsTest_NormalBondWorks_input.mdp]:
>>   For a correct single-point energy evaluation with nsteps = 0, use
>>   continuation = yes to avoid constraining the input coordinates.
>>
>> Setting the LD random seed to 417973934 Generated 3 of the 3 
>> non-bonded parameter combinations Excluding 3 bonded neighbours 
>> molecule type 'butane'
>> Removing all charge groups because cutoff-scheme=Verlet
>>
>> NOTE 3 [file BondedInteractionsTest_NormalBondWorks_butane1.top, line 31]:
>>   In moleculetype 'butane' 2 atoms are not bound by a potential or
>>   constraint to any other atom in the same moleculetype. Although
>>   technically this might not cause issues in a simulation, this often 
>> means
>>   that the user forgot to add a bond/potential/constraint or put multiple
>>   molecules in the same moleculetype definition by mistake. Run with -v to
>>   get information for each atom.
>>
>> Number of degrees of freedom in T-Coupling group rest is 9.00
>>
>> NOTE 4 [file
>> /prd/pkgs/gromacs/gromacs-2018/build/src/programs/mdrun/tests/Testing/Temporary/BondedInteractionsTest_NormalBondWorks_input.mdp]:
>>   NVE simulation with an initial temperature of zero: will use a Verlet
>>   buffer of 10%. Check your energy drift!
>>
>>
>> There were 4 notes
>>
>> -------------------------------------------------------
>> Program:     mdrun-test, version 2018
>> Source file: src/gromacs/gpu_utils/gpu_utils.cu (line 735)
>> Function:    void findGpus(gmx_gpu_info_t*)
>>
>> Assertion failed:
>> Condition: cudaSuccess == cudaPeekAtLastError() Should be cudaSuccess
>>
>> For more information and tips for troubleshooting, please check the 
>> GROMACS website at http://www.gromacs.org/Documentation/Errors
>> -------------------------------------------------------
>> This run will generate roughly 0 Mb of data
>> ---------------------------------------------------------------------
>> ----- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
>> with errorcode 1.
>>
>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>> You may or may not see output from other processes, depending on 
>> exactly when Open MPI kills them.
>> ---------------------------------------------------------------------
>> -----
>>
>>       Start 33: MdrunMpiTests
>> 33/39 Test #33: MdrunMpiTests ....................***Failed    2.06 sec
>> [==========] Running 7 tests from 5 test cases.
>> [----------] Global test environment set-up.
>> [----------] 1 test from MultiSimTerminationTest [ RUN      ]
>> MultiSimTerminationTest.WritesCheckpointAfterMaxhTerminationAndThenRe
>> starts
>>
>> NOTE 1 [file
>> /prd/pkgs/gromacs/gromacs-2018/build/src/programs/mdrun/tests/Testing
>> /Temporary/MultiSimTerminationTest_WritesCheckpointAfterMaxhTerminati
>> onAndThenRestarts_input1.mdp,
>> line 14]:
>>
>> /prd/pkgs/gromacs/gromacs-2018/build/src/programs/mdrun/tests/Testing
>> /Temporary/MultiSimTerminationTest_WritesCheckpointAfterMaxhTerminati
>> onAndThenRestarts_input1.mdp did not specify a value for the .mdp 
>> option "cutoff-scheme". Probably it
>>   was first intended for use with GROMACS before 4.6. In 4.6, the Verlet
>>   scheme was introduced, but the group scheme was still the default. The
>>   default is now the Verlet scheme, so you will observe different 
>> behaviour.
>>
>>
>> NOTE 1 [file
>> /prd/pkgs/gromacs/gromacs-2018/build/src/programs/mdrun/tests/Testing
>> /Temporary/MultiSimTerminationTest_WritesCheckpointAfterMaxhTerminati
>> onAndThenRestarts_input0.mdp,
>> line 14]:
>>
>> /prd/pkgs/gromacs/gromacs-2018/build/src/programs/mdrun/tests/Testing
>> /Temporary/MultiSimTerminationTest_WritesCheckpointAfterMaxhTerminati
>> onAndThenRestarts_input0.mdp did not specify a value for the .mdp 
>> option "cutoff-scheme". Probably it
>>   was first intended for use with GROMACS before 4.6. In 4.6, the Verlet
>>   scheme was introduced, but the group scheme was still the default. The
>>   default is now the Verlet scheme, so you will observe different 
>> behaviour.
>>
>> Setting the LD random seed to 73630723 Generated 3 of the 3 
>> non-bonded parameter combinations Generating 1-4
>> interactions: fudge = 0.5 Generated 3 of the 3 1-4 parameter 
>> combinations Excluding 2 bonded neighbours molecule type 'SOL'
>> Setting gen_seed to -1322183961
>> Velocities were taken from a Maxwell distribution at 288 K Removing 
>> all charge groups because cutoff-scheme=Verlet Number of degrees of 
>> freedom in T-Coupling group System is 9.00 Determining Verlet buffer 
>> for a tolerance of 0.005 kJ/mol/ps at 298 K Calculated rlist for 1x1 
>> atom pair-list as
>> 1.026 nm, buffer size 0.026 nm Set rlist, assuming 4x4 atom 
>> pair-list, to
>> 1.024 nm, buffer size 0.024 nm Note that mdrun will redetermine rlist 
>> based on the actual pair-list setup This run will generate roughly 0 
>> Mb of data
>>
>> NOTE 2 [file
>> /prd/pkgs/gromacs/gromacs-2018/build/src/programs/mdrun/tests/Testing/Temporary/MultiSimTerminationTest_WritesCheckpointAfterMaxhTerminationAndThenRestarts_input1.mdp]:
>>   You are using a plain Coulomb cut-off, which might produce artifacts.
>>   You might want to consider using PME electrostatics.
>>
>>
>>
>> There were 2 notes
>> Setting the LD random seed to 408678750 Generated 3 of the 3 
>> non-bonded parameter combinations Generating 1-4
>> interactions: fudge = 0.5 Generated 3 of the 3 1-4 parameter 
>> combinations Excluding 2 bonded neighbours molecule type 'SOL'
>> Setting gen_seed to 1490520586
>> Velocities were taken from a Maxwell distribution at 298 K Removing 
>> all charge groups because cutoff-scheme=Verlet Number of degrees of 
>> freedom in T-Coupling group System is 9.00 Determining Verlet buffer 
>> for a tolerance of 0.005 kJ/mol/ps at 298 K
>>
>> NOTE 2 [file
>> /prd/pkgs/gromacs/gromacs-2018/build/src/programs/mdrun/tests/Testing/Temporary/MultiSimTerminationTest_WritesCheckpointAfterMaxhTerminationAndThenRestarts_input0.mdp]:
>>   You are using a plain Coulomb cut-off, which might produce artifacts.
>>   You might want to consider using PME electrostatics.
>>
>>
>>
>> There were 2 notes
>> Calculated rlist for 1x1 atom pair-list as 1.026 nm, buffer size 
>> 0.026 nm Set rlist, assuming 4x4 atom pair-list, to 1.024 nm, buffer 
>> size 0.024 nm Note that mdrun will redetermine rlist based on the 
>> actual pair-list setup This run will generate roughly 0 Mb of data
>>
>> -------------------------------------------------------
>> Program:     mdrun-mpi-test, version 2018 Source file:
>> src/gromacs/gpu_utils/gpu_utils.cu (line 735)
>> Function:    void findGpus(gmx_gpu_info_t*) MPI rank:    0 (out of 2)
>>
>> Assertion failed:
>> Condition: cudaSuccess == cudaPeekAtLastError() Should be cudaSuccess
>>
>> For more information and tips for troubleshooting, please check the 
>> GROMACS website at http://www.gromacs.org/Documentation/Errors
>> -------------------------------------------------------
>> ---------------------------------------------------------------------
>> ----- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
>> with errorcode 1.
>>
>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>> You may or may not see output from other processes, depending on 
>> exactly when Open MPI kills them.
>> ---------------------------------------------------------------------
>> -----
>>
>>       Start 34: regressiontests/simple
>> 34/39 Test #34: regressiontests/simple ...........   Passed   25.95 sec
>>       Start 35: regressiontests/complex
>> 35/39 Test #35: regressiontests/complex ..........   Passed   80.79 sec
>>       Start 36: regressiontests/kernel
>> 36/39 Test #36: regressiontests/kernel ...........   Passed  223.69 sec
>>       Start 37: regressiontests/freeenergy
>> 37/39 Test #37: regressiontests/freeenergy .......   Passed   16.11 sec
>>       Start 38: regressiontests/pdb2gmx
>> 38/39 Test #38: regressiontests/pdb2gmx ..........   Passed   92.77 sec
>>       Start 39: regressiontests/rotation
>> 39/39 Test #39: regressiontests/rotation .........   Passed   20.51 sec
>>
>> 90% tests passed, 4 tests faile
>>
>
--
Gromacs Users mailing list

* Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.


More information about the gromacs.org_gmx-users mailing list