[gmx-users] 2018 installation make check errors, probably CUDA related
Tresadern, Gary [RNDBE]
gtresade at its.jnj.com
Thu Mar 22 17:45:55 CET 2018
Hi Mark,
Thanks, I tried 2018-1 and was hopeful it would solve the problem as I'd seen comment of funny findGpus() behavior whilst googling to fix this. Unfortunately I still have the same problem. I've spent the day trying to pin down the nvidia-smi settings, I have the persistence on and the daemon running to restart at reboot, I have clocked up the K40 to 3004,875, but these are minor issues. Something more fundamental is must be going wrong. I'm out of ideas at this point, I must have tried the rebuild 3 dozen times in last ten days or so.
Cheers
Gary
-----Original Message-----
From: gromacs.org_gmx-users-bounces at maillist.sys.kth.se [mailto:gromacs.org_gmx-users-bounces at maillist.sys.kth.se] On Behalf Of Mark Abraham
Sent: Wednesday, 21 March 2018 17:03
To: gmx-users at gromacs.org
Cc: gromacs.org_gmx-users at maillist.sys.kth.se
Subject: [EXTERNAL] Re: [gmx-users] 2018 installation make check errors, probably CUDA related
Hi,
Please try 2018.1 and let us know, as some issues that look like these have been resolved.
Thanks,
Mark
>> Cheers
>> Gary
>>
>>
>>
>>
>> wrndbeberhel13 :~> nvidia-smi
>> Wed Mar 21 16:25:23 2018
>>
>> +-----------------------------------------------------------------------------+
>> | NVIDIA-SMI 390.42 Driver Version: 390.42
>> |
>>
>> |-------------------------------+----------------------+----------------------+
>> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr.
>> ECC |
>> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util
>> Compute M. |
>>
>> |===============================+======================+======================|
>> | 0 Quadro K4200 On | 00000000:03:00.0 On |
>> N/A |
>> | 30% 36C P8 15W / 110W | 71MiB / 4036MiB | 0% E.
>> Process |
>>
>> +-------------------------------+----------------------+----------------------+
>> | 1 Tesla K40c On | 00000000:81:00.0 Off |
>> 2 |
>> | 23% 40C P8 22W / 235W | 0MiB / 11441MiB | 0% E.
>> Process |
>>
>> +-------------------------------+----------------------+----------------------+
>>
>>
>> +-----------------------------------------------------------------------------+
>> | Processes: GPU
>> Memory |
>> | GPU PID Type Process name Usage
>> |
>>
>> |=============================================================================|
>> | 0 7891 G /usr/bin/Xorg
>> 69MiB |
>>
>> +-----------------------------------------------------------------------------+
>>
>> -----Original Message-----
>> From: gromacs.org_gmx-users-bounces at maillist.sys.kth.se [mailto:
>> gromacs.org_gmx-users-bounces at maillist.sys.kth.se] On Behalf Of
>> Tresadern, Gary [RNDBE]
>> Sent: Saturday, 17 March 2018 16:46
>> To: 'gromacs.org_gmx-users at maillist.sys.kth.se' <
>> gromacs.org_gmx-users at maillist.sys.kth.se>
>> Subject: [EXTERNAL] Re: [gmx-users] 2018 installation make check
>> errors, probably CUDA related
>>
>> Hi,
>>
>> I am unable to pass the make check tests for a 2018 build. I had a
>> working build earlier in the week, but since we updated the cuda
>> toolkit and nvidia driver it now fails.
>> Below are some details of the installation procedure.
>> I tried manually setting variables such as CUDA_VISIBLE_DEVICES but
>> that also didn't help.
>> I am running out of ideas, if you have any tips please let me know.
>>
>> Thanks
>> Gary
>>
>> bash-4.1$ su softinst
>> bash-4.1$ scl enable devtoolset-2 bash bash-4.1$ which cmake
>> /usr/local/bin/cmake bash-4.1$ cmake --version cmake version 3.6.2
>> CMake suite maintained and supported by Kitware (kitware.com/cmake).
>> bash-4.1$ gcc --version
>> gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-15) Copyright (C) 2013 Free
>> Software Foundation, Inc.
>> This is free software; see the source for copying conditions. There
>> is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
>> PARTICULAR PURPOSE.
>> bash-4.1$ ls /usr/local/cuda-9.1/
>> bin/ extras/ lib64/ libnvvp/
>> nsightee_plugins/ nvvm/ samples/ src/
>> tools/ doc/ include/ libnsight/
>> LICENSE nvml/ README share/
>> targets/ version.txt bash-4.1$ ls /usr/local/cuda-9.1/bin/
>> bin2c cuda-gdb
>> fatbinary nvcc.profile nvvp
>> computeprof cuda-gdbserver
>> gpu-library-advisor nvdisasm ptxas
>> crt/ cuda-install-samples-9.1.sh
>> nsight nvlink cudafe
>> cuda-memcheck nsight_ee_plugins_manage.sh
>> nvprof
>> cudafe++ cuobjdump
>> nvcc
>> cudafe++nvprune
>> bash-4.1$ export PATH=$PATH:/usr/local/bin/ bash-4.1$ export
>> CUDA_HOME=/usr/local/cuda-9.1/ bash-4.1$ export
>> PATH=$PATH:/usr/lib64/mpich/bin/ bash-4.1$ export
>> LD_LIBRARY_PATH="/usr/local/cuda-9.1/lib64/:${LD_LIBRARY_PATH}"
>> bash-4.1$ export
>> LD_LIBRARY_PATH="/usr/local/cuda-9.1/lib64:/usr/local/cuda-9.1/targets/x86_64-linux/lib/:${LD_LIBRARY_PATH}"
>> bash-4.1$ export
>> LD_LIBRARY_PATH=/usr/lib64/openmpi-1.10/lib/openmpi/:$LD_LIBRARY_PATH
>> bash-4.1$ export
>> MPI_CXX_INCLUDE_PATH=/usr/include/openmpi-1.10-x86_64/openmpi/ompi/mp
>> i/cxx/ bash-4.1$ export PATH=$PATH:/usr/lib64/openmpi-1.10/bin/
>>
>> bash-4.1$ cmake .. -DGMX_BUILD_OWN_FFTW=ON
>> -DREGRESSIONTEST_DOWNLOAD=ON
>> -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-9.1/ -DGMX_GPU=on
>> -DCMAKE_INSTALL_PREFIX=/prd/pkgs/gromacs/gromacs-2018/ -DGMX_MPI=on
>> bash-4.1$ make bash-4.1$ make check Test project /prd/pkgs/gromacs/gromacs-2018/build
>> Start 1: TestUtilsUnitTests
>> 1/39 Test #1: TestUtilsUnitTests ............... Passed 0.41 sec
>> Start 2: TestUtilsMpiUnitTests
>> 2/39 Test #2: TestUtilsMpiUnitTests ............ Passed 0.29 sec
>> Start 3: MdlibUnitTest
>> 3/39 Test #3: MdlibUnitTest .................... Passed 0.24 sec
>> Start 4: AppliedForcesUnitTest
>> 4/39 Test #4: AppliedForcesUnitTest ............ Passed 0.22 sec
>> Start 5: ListedForcesTest
>> 5/39 Test #5: ListedForcesTest ................. Passed 0.25 sec
>> Start 6: CommandLineUnitTests
>> 6/39 Test #6: CommandLineUnitTests ............. Passed 0.29 sec
>> Start 7: EwaldUnitTests
>> 7/39 Test #7: EwaldUnitTests ...................***Failed 0.92 sec
>> [==========] Running 257 tests from 10 test cases.
>> [----------] Global test environment set-up.
>>
>> -------------------------------------------------------
>> Program: ewald-test, version 2018
>> Source file: src/gromacs/gpu_utils/gpu_utils.cu (line 735)
>> Function: void findGpus(gmx_gpu_info_t*)
>>
>> Assertion failed:
>> Condition: cudaSuccess == cudaPeekAtLastError() Should be cudaSuccess
>>
>> For more information and tips for troubleshooting, please check the
>> GROMACS website at http://www.gromacs.org/Documentation/Errors
>> -------------------------------------------------------
>> ---------------------------------------------------------------------
>> ----- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>> with errorcode 1.
>>
>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>> You may or may not see output from other processes, depending on
>> exactly when Open MPI kills them.
>> ---------------------------------------------------------------------
>> -----
>>
>> Start 8: FFTUnitTests
>> 8/39 Test #8: FFTUnitTests ..................... Passed 0.37 sec
>> Start 9: GpuUtilsUnitTests
>> 9/39 Test #9: GpuUtilsUnitTests ................***Failed 0.91 sec
>> [==========] Running 35 tests from 7 test cases.
>> [----------] Global test environment set-up.
>> [----------] 7 tests from HostAllocatorTest/0, where TypeParam = int [
>> RUN ] HostAllocatorTest/0.EmptyMemoryAlwaysWorks
>>
>> -------------------------------------------------------
>> Program: gpu_utils-test, version 2018 Source file:
>> src/gromacs/gpu_utils/gpu_utils.cu (line 735)
>> Function: void findGpus(gmx_gpu_info_t*)
>>
>> Assertion failed:
>> Condition: cudaSuccess == cudaPeekAtLastError() Should be cudaSuccess
>>
>> For more information and tips for troubleshooting, please check the
>> GROMACS website at http://www.gromacs.org/Documentation/Errors
>> -------------------------------------------------------
>> ---------------------------------------------------------------------
>> ----- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>> with errorcode 1.
>>
>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>> You may or may not see output from other processes, depending on
>> exactly when Open MPI kills them.
>> ---------------------------------------------------------------------
>> -----
>>
>> Start 10: HardwareUnitTests
>> 10/39 Test #10: HardwareUnitTests ................ Passed 0.24 sec
>> Start 11: MathUnitTests
>> 11/39 Test #11: MathUnitTests .................... Passed 0.25 sec
>> Start 12: MdrunUtilityUnitTests
>> 12/39 Test #12: MdrunUtilityUnitTests ............ Passed 0.22 sec
>> Start 13: MdrunUtilityMpiUnitTests
>> 13/39 Test #13: MdrunUtilityMpiUnitTests ......... Passed 0.35 sec
>> Start 14: OnlineHelpUnitTests
>> 14/39 Test #14: OnlineHelpUnitTests .............. Passed 0.24 sec
>> Start 15: OptionsUnitTests
>> 15/39 Test #15: OptionsUnitTests ................. Passed 0.25 sec
>> Start 16: RandomUnitTests
>> 16/39 Test #16: RandomUnitTests .................. Passed 0.26 sec
>> Start 17: TableUnitTests
>> 17/39 Test #17: TableUnitTests ................... Passed 0.41 sec
>> Start 18: TaskAssignmentUnitTests
>> 18/39 Test #18: TaskAssignmentUnitTests .......... Passed 0.21 sec
>> Start 19: UtilityUnitTests
>> 19/39 Test #19: UtilityUnitTests ................. Passed 0.32 sec
>> Start 20: FileIOTests
>> 20/39 Test #20: FileIOTests ...................... Passed 0.26 sec
>> Start 21: PullTest
>> 21/39 Test #21: PullTest ......................... Passed 0.24 sec
>> Start 22: AwhTest
>> 22/39 Test #22: AwhTest .......................... Passed 0.23 sec
>> Start 23: SimdUnitTests
>> 23/39 Test #23: SimdUnitTests .................... Passed 0.29 sec
>> Start 24: GmxAnaTest
>> 24/39 Test #24: GmxAnaTest ....................... Passed 0.38 sec
>> Start 25: GmxPreprocessTests
>> 25/39 Test #25: GmxPreprocessTests ............... Passed 0.58 sec
>> Start 26: CorrelationsTest
>> 26/39 Test #26: CorrelationsTest ................. Passed 1.23 sec
>> Start 27: AnalysisDataUnitTests
>> 27/39 Test #27: AnalysisDataUnitTests ............ Passed 0.30 sec
>> Start 28: SelectionUnitTests
>> 28/39 Test #28: SelectionUnitTests ............... Passed 0.61 sec
>> Start 29: TrajectoryAnalysisUnitTests
>> 29/39 Test #29: TrajectoryAnalysisUnitTests ...... Passed 1.19 sec
>> Start 30: EnergyAnalysisUnitTests
>> 30/39 Test #30: EnergyAnalysisUnitTests .......... Passed 0.58 sec
>> Start 31: CompatibilityHelpersTests
>> 31/39 Test #31: CompatibilityHelpersTests ........ Passed 0.23 sec
>> Start 32: MdrunTests
>> 32/39 Test #32: MdrunTests .......................***Failed 0.98 sec
>> [==========] Running 29 tests from 11 test cases.
>> [----------] Global test environment set-up.
>> [----------] 6 tests from BondedInteractionsTest [ RUN ]
>> BondedInteractionsTest.NormalBondWorks
>>
>> NOTE 1 [file
>> /prd/pkgs/gromacs/gromacs-2018/build/src/programs/mdrun/tests/Testing
>> /Temporary/BondedInteractionsTest_NormalBondWorks_input.mdp,
>> line 1]:
>>
>> /prd/pkgs/gromacs/gromacs-2018/build/src/programs/mdrun/tests/Testing
>> /Temporary/BondedInteractionsTest_NormalBondWorks_input.mdp
>> did not specify a value for the .mdp option "cutoff-scheme". Probably it
>> was first intended for use with GROMACS before 4.6. In 4.6, the Verlet
>> scheme was introduced, but the group scheme was still the default. The
>> default is now the Verlet scheme, so you will observe different
>> behaviour.
>>
>>
>> NOTE 2 [file
>> /prd/pkgs/gromacs/gromacs-2018/build/src/programs/mdrun/tests/Testing/Temporary/BondedInteractionsTest_NormalBondWorks_input.mdp]:
>> For a correct single-point energy evaluation with nsteps = 0, use
>> continuation = yes to avoid constraining the input coordinates.
>>
>> Setting the LD random seed to 417973934 Generated 3 of the 3
>> non-bonded parameter combinations Excluding 3 bonded neighbours
>> molecule type 'butane'
>> Removing all charge groups because cutoff-scheme=Verlet
>>
>> NOTE 3 [file BondedInteractionsTest_NormalBondWorks_butane1.top, line 31]:
>> In moleculetype 'butane' 2 atoms are not bound by a potential or
>> constraint to any other atom in the same moleculetype. Although
>> technically this might not cause issues in a simulation, this often
>> means
>> that the user forgot to add a bond/potential/constraint or put multiple
>> molecules in the same moleculetype definition by mistake. Run with -v to
>> get information for each atom.
>>
>> Number of degrees of freedom in T-Coupling group rest is 9.00
>>
>> NOTE 4 [file
>> /prd/pkgs/gromacs/gromacs-2018/build/src/programs/mdrun/tests/Testing/Temporary/BondedInteractionsTest_NormalBondWorks_input.mdp]:
>> NVE simulation with an initial temperature of zero: will use a Verlet
>> buffer of 10%. Check your energy drift!
>>
>>
>> There were 4 notes
>>
>> -------------------------------------------------------
>> Program: mdrun-test, version 2018
>> Source file: src/gromacs/gpu_utils/gpu_utils.cu (line 735)
>> Function: void findGpus(gmx_gpu_info_t*)
>>
>> Assertion failed:
>> Condition: cudaSuccess == cudaPeekAtLastError() Should be cudaSuccess
>>
>> For more information and tips for troubleshooting, please check the
>> GROMACS website at http://www.gromacs.org/Documentation/Errors
>> -------------------------------------------------------
>> This run will generate roughly 0 Mb of data
>> ---------------------------------------------------------------------
>> ----- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>> with errorcode 1.
>>
>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>> You may or may not see output from other processes, depending on
>> exactly when Open MPI kills them.
>> ---------------------------------------------------------------------
>> -----
>>
>> Start 33: MdrunMpiTests
>> 33/39 Test #33: MdrunMpiTests ....................***Failed 2.06 sec
>> [==========] Running 7 tests from 5 test cases.
>> [----------] Global test environment set-up.
>> [----------] 1 test from MultiSimTerminationTest [ RUN ]
>> MultiSimTerminationTest.WritesCheckpointAfterMaxhTerminationAndThenRe
>> starts
>>
>> NOTE 1 [file
>> /prd/pkgs/gromacs/gromacs-2018/build/src/programs/mdrun/tests/Testing
>> /Temporary/MultiSimTerminationTest_WritesCheckpointAfterMaxhTerminati
>> onAndThenRestarts_input1.mdp,
>> line 14]:
>>
>> /prd/pkgs/gromacs/gromacs-2018/build/src/programs/mdrun/tests/Testing
>> /Temporary/MultiSimTerminationTest_WritesCheckpointAfterMaxhTerminati
>> onAndThenRestarts_input1.mdp did not specify a value for the .mdp
>> option "cutoff-scheme". Probably it
>> was first intended for use with GROMACS before 4.6. In 4.6, the Verlet
>> scheme was introduced, but the group scheme was still the default. The
>> default is now the Verlet scheme, so you will observe different
>> behaviour.
>>
>>
>> NOTE 1 [file
>> /prd/pkgs/gromacs/gromacs-2018/build/src/programs/mdrun/tests/Testing
>> /Temporary/MultiSimTerminationTest_WritesCheckpointAfterMaxhTerminati
>> onAndThenRestarts_input0.mdp,
>> line 14]:
>>
>> /prd/pkgs/gromacs/gromacs-2018/build/src/programs/mdrun/tests/Testing
>> /Temporary/MultiSimTerminationTest_WritesCheckpointAfterMaxhTerminati
>> onAndThenRestarts_input0.mdp did not specify a value for the .mdp
>> option "cutoff-scheme". Probably it
>> was first intended for use with GROMACS before 4.6. In 4.6, the Verlet
>> scheme was introduced, but the group scheme was still the default. The
>> default is now the Verlet scheme, so you will observe different
>> behaviour.
>>
>> Setting the LD random seed to 73630723 Generated 3 of the 3
>> non-bonded parameter combinations Generating 1-4
>> interactions: fudge = 0.5 Generated 3 of the 3 1-4 parameter
>> combinations Excluding 2 bonded neighbours molecule type 'SOL'
>> Setting gen_seed to -1322183961
>> Velocities were taken from a Maxwell distribution at 288 K Removing
>> all charge groups because cutoff-scheme=Verlet Number of degrees of
>> freedom in T-Coupling group System is 9.00 Determining Verlet buffer
>> for a tolerance of 0.005 kJ/mol/ps at 298 K Calculated rlist for 1x1
>> atom pair-list as
>> 1.026 nm, buffer size 0.026 nm Set rlist, assuming 4x4 atom
>> pair-list, to
>> 1.024 nm, buffer size 0.024 nm Note that mdrun will redetermine rlist
>> based on the actual pair-list setup This run will generate roughly 0
>> Mb of data
>>
>> NOTE 2 [file
>> /prd/pkgs/gromacs/gromacs-2018/build/src/programs/mdrun/tests/Testing/Temporary/MultiSimTerminationTest_WritesCheckpointAfterMaxhTerminationAndThenRestarts_input1.mdp]:
>> You are using a plain Coulomb cut-off, which might produce artifacts.
>> You might want to consider using PME electrostatics.
>>
>>
>>
>> There were 2 notes
>> Setting the LD random seed to 408678750 Generated 3 of the 3
>> non-bonded parameter combinations Generating 1-4
>> interactions: fudge = 0.5 Generated 3 of the 3 1-4 parameter
>> combinations Excluding 2 bonded neighbours molecule type 'SOL'
>> Setting gen_seed to 1490520586
>> Velocities were taken from a Maxwell distribution at 298 K Removing
>> all charge groups because cutoff-scheme=Verlet Number of degrees of
>> freedom in T-Coupling group System is 9.00 Determining Verlet buffer
>> for a tolerance of 0.005 kJ/mol/ps at 298 K
>>
>> NOTE 2 [file
>> /prd/pkgs/gromacs/gromacs-2018/build/src/programs/mdrun/tests/Testing/Temporary/MultiSimTerminationTest_WritesCheckpointAfterMaxhTerminationAndThenRestarts_input0.mdp]:
>> You are using a plain Coulomb cut-off, which might produce artifacts.
>> You might want to consider using PME electrostatics.
>>
>>
>>
>> There were 2 notes
>> Calculated rlist for 1x1 atom pair-list as 1.026 nm, buffer size
>> 0.026 nm Set rlist, assuming 4x4 atom pair-list, to 1.024 nm, buffer
>> size 0.024 nm Note that mdrun will redetermine rlist based on the
>> actual pair-list setup This run will generate roughly 0 Mb of data
>>
>> -------------------------------------------------------
>> Program: mdrun-mpi-test, version 2018 Source file:
>> src/gromacs/gpu_utils/gpu_utils.cu (line 735)
>> Function: void findGpus(gmx_gpu_info_t*) MPI rank: 0 (out of 2)
>>
>> Assertion failed:
>> Condition: cudaSuccess == cudaPeekAtLastError() Should be cudaSuccess
>>
>> For more information and tips for troubleshooting, please check the
>> GROMACS website at http://www.gromacs.org/Documentation/Errors
>> -------------------------------------------------------
>> ---------------------------------------------------------------------
>> ----- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>> with errorcode 1.
>>
>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>> You may or may not see output from other processes, depending on
>> exactly when Open MPI kills them.
>> ---------------------------------------------------------------------
>> -----
>>
>> Start 34: regressiontests/simple
>> 34/39 Test #34: regressiontests/simple ........... Passed 25.95 sec
>> Start 35: regressiontests/complex
>> 35/39 Test #35: regressiontests/complex .......... Passed 80.79 sec
>> Start 36: regressiontests/kernel
>> 36/39 Test #36: regressiontests/kernel ........... Passed 223.69 sec
>> Start 37: regressiontests/freeenergy
>> 37/39 Test #37: regressiontests/freeenergy ....... Passed 16.11 sec
>> Start 38: regressiontests/pdb2gmx
>> 38/39 Test #38: regressiontests/pdb2gmx .......... Passed 92.77 sec
>> Start 39: regressiontests/rotation
>> 39/39 Test #39: regressiontests/rotation ......... Passed 20.51 sec
>>
>> 90% tests passed, 4 tests faile
>>
>
--
Gromacs Users mailing list
* Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.
More information about the gromacs.org_gmx-users
mailing list