Skip to content
This repository has been archived by the owner on Jul 13, 2024. It is now read-only.

GPU folding stops after sleep (in Linux) -- BAD_WORK_UNIT #1720

Open
spexgaelen opened this issue Jan 15, 2024 · 2 comments
Open

GPU folding stops after sleep (in Linux) -- BAD_WORK_UNIT #1720

spexgaelen opened this issue Jan 15, 2024 · 2 comments

Comments

@spexgaelen
Copy link

Your issue may already be reported!
Please search on the issue tracker before creating one.

Your Environment

  • F@H Software version:
  • Operating System:
  • Browser:
    F@H SW Version: 7.6.21
    Pop!_OS 22.04 LTS
    Linux merope 6.6.6-76060606-generic #202312111032170230614322.04~d28ffec SMP PREEMPT_DYNAMIC Mon D x86_64 x86_64 x86_64 GNU/Linux
    AMD® Ryzen 9 3900x 12-core processor × 24
    NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A]
    I mostly use the CLI, FAHClient .

Expected Behavior

While FAH is running, I put the system into suspend. Later I wake the system and both the CPU and GPU based FAH jobs continue running.


Current Behavior

Only the CPU based FAH jobs continue running.
Rebooting allows the GPU jobs to run again.


Possible Solution (Optional)

Perhaps the cuda interface needs to be reinitialized?


Steps To Reproduce

  1. start FAH
  2. sleep the computer
  3. wake the computer

Context

I would like to be able to sleep my PC in between sessions. I would like to avoid rebooting every time i want the GPU parts of F@H to function.


@spexgaelen
Copy link
Author

spexgaelen commented Jan 17, 2024

Here is an excerpt of the logs where the GPU side repeatedly fails:

14:01:14:WU00:FS01:0x22:Folding@home` Core Shutdown: BAD_WORK_UNIT
ESC[93m14:01:14:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)ESC[0m
14:01:14:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:18725 run:0 clone:1466 gen:569 core:0x22 unit:0x39020000ba0500000000000025490000
14:01:14:WU00:FS01:Uploading 11.50KiB to 131.239.113.97
14:01:14:WU00:FS01:Connecting to 131.239.113.97:8080
14:01:14:WU00:FS01:Upload complete
14:01:15:WU00:FS01:Server responded WORK_ACK (400)
14:01:15:WU00:FS01:Cleaning up
14:01:15:WU01:FS01:Connecting to assign1.foldingathome.org:80
14:01:15:WU01:FS01:Assigned to work server 206.223.170.146
14:01:15:WU01:FS01:Requesting new work unit for slot 01: gpu:9:0 TU102 [GeForce RTX 2080 Ti Rev. A] M 13448 from 206.223.170.146
14:01:15:WU01:FS01:Connecting to 206.223.170.146:8080
14:01:15:WU01:FS01:Downloading 6.98MiB
14:01:17:WU01:FS01:Download complete
14:01:18:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:12272 run:0 clone:132 gen:64 core:0x23 unit:0x000000840000004000002ff000000000
14:01:18:WU01:FS01:Starting
14:01:18:WU01:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/openmm-core-23/centos-7.9.2009-64bit/release/0x23-8.0.3/Core_23.fah/FahCore_23 -dir 01 -suffix 01 -version 706 -lifeline 1989 -checkpoint 15 -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu-vendor nvidia -gpu 0 -gpu-usage 100
14:01:18:WU01:FS01:Started FahCore on PID 97450
14:01:18:WU01:FS01:Core PID:97454
14:01:18:WU01:FS01:FahCore 0x23 started
14:01:18:WU01:FS01:0x23:*********************** Log Started 2024-01-15T14:01:18Z ***********************
14:01:18:WU01:FS01:0x23:*************************** Core23 Folding@home Core ***************************
14:01:18:WU01:FS01:0x23:       Core: Core23
14:01:18:WU01:FS01:0x23:       Type: 0x23
14:01:18:WU01:FS01:0x23:    Version: 8.0.3
14:01:18:WU01:FS01:0x23:     Author: Joseph Coffland <[email protected]>
14:01:18:WU01:FS01:0x23:  Copyright: 2022 foldingathome.org
14:01:18:WU01:FS01:0x23:   Homepage: https://foldingathome.org/
14:01:18:WU01:FS01:0x23:       Date: Aug 3 2023
14:01:18:WU01:FS01:0x23:       Time: 08:28:22
14:01:18:WU01:FS01:0x23:   Revision: 199cb870317d05441d0a301287d9ef61254fa32b
14:01:18:WU01:FS01:0x23:     Branch: HEAD
14:01:18:WU01:FS01:0x23:   Compiler: GNU 7.5.0
14:01:18:WU01:FS01:0x23:    Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
14:01:18:WU01:FS01:0x23:             -fdata-sections -O3 -funroll-loops -fno-pie
14:01:18:WU01:FS01:0x23:             -DOPENMM_VERSION="\"8.0.0\""
14:01:18:WU01:FS01:0x23:   Platform: linux 5.15.0-1041-azure
14:01:18:WU01:FS01:0x23:       Bits: 64
14:01:18:WU01:FS01:0x23:       Mode: Release
14:01:18:WU01:FS01:0x23:Maintainers: John Chodera <[email protected]> and Peter Eastman
14:01:18:WU01:FS01:0x23:             <[email protected]>
14:01:18:WU01:FS01:0x23:       Args: -dir 01 -suffix 01 -version 706 -lifeline 97450 -checkpoint 15
14:01:18:WU01:FS01:0x23:             -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu-vendor
14:01:18:WU01:FS01:0x23:             nvidia -gpu 0 -gpu-usage 100
14:01:18:WU01:FS01:0x23:************************************ libFAH ************************************
14:01:18:WU01:FS01:0x23:       Date: Aug 3 2023
14:01:18:WU01:FS01:0x23:       Time: 08:27:48
14:01:18:WU01:FS01:0x23:   Revision: 112c2234abe20611a05652defc3c7f854cbf927f
14:01:18:WU01:FS01:0x23:     Branch: HEAD
14:01:18:WU01:FS01:0x23:   Compiler: GNU 7.5.0
14:01:18:WU01:FS01:0x23:    Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
14:01:18:WU01:FS01:0x23:             -fdata-sections -O3 -funroll-loops -fno-pie
14:01:18:WU01:FS01:0x23:   Platform: linux 5.15.0-1041-azure
14:01:18:WU01:FS01:0x23:       Bits: 64
14:01:18:WU01:FS01:0x23:       Mode: Release
14:01:18:WU01:FS01:0x23:************************************ CBang *************************************
14:01:18:WU01:FS01:0x23:    Version: 1.7.2
14:01:18:WU01:FS01:0x23:     Author: Joseph Coffland <[email protected]>
14:01:18:WU01:FS01:0x23:        Org: Cauldron Development LLC
14:01:18:WU01:FS01:0x23:  Copyright: Cauldron Development LLC, 2003-2023
14:01:18:WU01:FS01:0x23:   Homepage: https://cauldrondevelopment.com/
14:01:18:WU01:FS01:0x23:    License: GPL 2+
14:01:18:WU01:FS01:0x23:       Date: Aug 3 2023
14:01:18:WU01:FS01:0x23:       Time: 08:27:30
14:01:18:WU01:FS01:0x23:   Revision: eae4b58965bdd4d54ea9eb77972674352b37a547
14:01:18:WU01:FS01:0x23:     Branch: HEAD
14:01:18:WU01:FS01:0x23:   Compiler: GNU 7.5.0
14:01:18:WU01:FS01:0x23:    Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
14:01:18:WU01:FS01:0x23:             -fdata-sections -O3 -funroll-loops -fno-pie -fPIC
14:01:18:WU01:FS01:0x23:   Platform: linux 5.15.0-1041-azure
14:01:18:WU01:FS01:0x23:       Bits: 64
14:01:18:WU01:FS01:0x23:       Mode: Release
14:01:18:WU01:FS01:0x23:************************************ System ************************************
14:01:18:WU01:FS01:0x23:        CPU: AMD Ryzen 9 3900X 12-Core Processor
14:01:18:WU01:FS01:0x23:     CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
14:01:18:WU01:FS01:0x23:       CPUs: 24
14:01:18:WU01:FS01:0x23:     Memory: 31.26GiB
14:01:18:WU01:FS01:0x23:Free Memory: 9.22GiB
14:01:18:WU01:FS01:0x23:    Threads: POSIX_THREADS
14:01:18:WU01:FS01:0x23: OS Version: 6.6
14:01:18:WU01:FS01:0x23:Has Battery: false
14:01:18:WU01:FS01:0x23: On Battery: false
14:01:18:WU01:FS01:0x23: UTC Offset: -5
14:01:18:WU01:FS01:0x23:        PID: 97454
14:01:18:WU01:FS01:0x23:        CWD: /var/lib/fahclient/work
14:01:18:WU01:FS01:0x23:       Exec: /var/lib/fahclient/cores/cores.foldingathome.org/openmm-core-23/centos-7.9.2009-64bit/release/0x23-8.0.3/Core_23.fah/FahCore_23
14:01:18:WU01:FS01:0x23:************************************ OpenMM ************************************
14:01:18:WU01:FS01:0x23:    Version: 8.0.0
14:01:18:WU01:FS01:0x23:********************************************************************************
14:01:18:WU01:FS01:0x23:Project: 12272 (Run 0, Clone 132, Gen 64)
14:01:18:WU01:FS01:0x23:Reading tar file core.xml
14:01:18:WU01:FS01:0x23:Reading tar file integrator.xml
14:01:18:WU01:FS01:0x23:Reading tar file state.xml.bz2
14:01:18:WU01:FS01:0x23:Reading tar file system.xml.bz2
14:01:18:WU01:FS01:0x23:Digital signatures verified
14:01:18:WU01:FS01:0x23:Folding@home GPU Core23 Folding@home Core
14:01:18:WU01:FS01:0x23:Version 8.0.3
14:01:18:WU01:FS01:0x23:  Checkpoint write interval: 50000 steps (2%) [50 total]
14:01:18:WU01:FS01:0x23:  JSON viewer frame write interval: 25000 steps (1%) [100 total]
14:01:18:WU01:FS01:0x23:  XTC frame write interval: 25000 steps (1%) [100 total]
14:01:18:WU01:FS01:0x23:  Global context and integrator variables write interval: disabled
14:01:18:WU01:FS01:0x23:There are 3 platforms available.
14:01:18:WU01:FS01:0x23:Platform 0: Reference
14:01:18:WU01:FS01:0x23:Platform 1: CPU
14:01:18:WU01:FS01:0x23:Platform 2: CUDA
14:01:18:WU01:FS01:0x23:  cuda-device 0 specified
14:01:18:WU01:FS01:0x23:opencl-device was set but OpenCL platform could not be found.
14:01:22:WU01:FS01:0x23:Attempting to create CUDA context:
14:01:22:WU01:FS01:0x23:  Configuring platform CUDA
14:01:22:WU01:FS01:0x23:Failed to create CUDA context:
14:01:22:WU01:FS01:0x23:Error initializing CUDA: CUDA_ERROR_UNKNOWN (999) at /home/conda/feedstock_root/build_artifacts/openmm_1682500577703/work/platforms/cuda/src/CudaContext.cpp:140
14:01:22:WU01:FS01:0x23:ERROR:125: Failed to create a GPU-enabled OpenMM Context.
14:01:22:WU01:FS01:0x23:Saving result file ../logfile_01.txt
14:01:22:WU01:FS01:0x23:Saving result file science.log
14:01:22:WU01:FS01:0x23:Folding@home Core Shutdown: BAD_WORK_UNIT
ESC[93m14:01:22:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)ESC[0m

message repeats

@spexgaelen spexgaelen changed the title GPU folding stops after sleep (in Linux) GPU folding stops after sleep (in Linux) -- BAD_WORK_UNIT Jan 17, 2024
@spexgaelen
Copy link
Author

Regrettably this appears to be a known/regular issue somewhere between NVIDIA and Pop!_OS.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant