Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build: init mimas #501

Merged
merged 9 commits into from
Dec 6, 2024
Merged

build: init mimas #501

merged 9 commits into from
Dec 6, 2024

Conversation

mweinelt
Copy link
Member

Work in progress configuration for a successor to rhea, because the machine cannot keep up with the compression workload by the large number of builders we employ.

The plan is for an AX162-R at Hetzner.

TODO:

  • Check we have all relevant configs and secrets from rhea

@nixos-discourse

This comment was marked as off-topic.

@mweinelt
Copy link
Member Author

mweinelt commented Dec 3, 2024

The script to migrate the state from rhea as found in ~/sync.sh with some further improvements:

#!/usr/bin/env bash

NIX_STORE=daemon
FROM=rhea.nixos.org

set -eux

mkdir -p /var/lib/hydra/queue-runner/keys/
mkdir -p /var/lib/hydra/www/keys/
mkdir -p /var/lib/hydra/{build-logs,runcommand-logs}
mkdir -p /nix/var/nix/gcroots/hydra
rsync -avPz "${FROM}":/etc/nix/machines /etc/nix/machines
rsync -avPz "${FROM}":/var/lib/hydra/queue-runner/keys/ /var/lib/hydra/queue-runner/keys
rsync -avPz "${FROM}":/var/lib/hydra/queue-runner/.aws/ /var/lib/hydra/queue-runner/.aws
rsync -avPz "${FROM}":/var/lib/hydra/queue-runner/.ssh/ /var/lib/hydra/queue-runner/.ssh
rsync -avPz "${FROM}":/var/lib/hydra/www/keys/ /var/lib/hydra/www/keys
rsync -avPz "${FROM}":/var/lib/hydra/build-logs/ /var/lib/hydra/build-logs || true
rsync -avPz "${FROM}":/var/lib/hydra/runcommand-logs/ /var/lib/hydra/runcommand-logs || true
rsync -avPz "${FROM}":/nix/var/nix/gcroots/hydra/ /nix/var/nix/gcroots/hydra || true
find /nix/var/nix/gcroots/hydra \( -type l -o -type f \) -print0 \
  | while IFS= read -r -d $'\0' file; do
      test -e "$file" || echo "/nix/store/$(basename "$file")"
    done \
  | xargs -P 16 nix-copy-closure --from --gzip --include-outputs --use-substitutes "${FROM}"

@mweinelt
Copy link
Member Author

mweinelt commented Dec 3, 2024

Reformatted the NVME namespaces to use 4k blocks.

root@rescue /dev/disk/by-id # nvme format -l 1 /dev/nvme0n1
You are about to format nvme0n1, namespace 0x1.
WARNING: Format may irrevocably delete this device's data.
You have 10 seconds to press Ctrl-C to cancel this operation.

Use the force [--force] option to suppress this warning.
Sending format operation ... 
Success formatting namespace:1
root@rescue /dev/disk/by-id # nvme format -l 1 /dev/nvme1n1
You are about to format nvme1n1, namespace 0x1.
WARNING: Format may irrevocably delete this device's data.
You have 10 seconds to press Ctrl-C to cancel this operation.

Use the force [--force] option to suppress this warning.
Sending format operation ... 
Success formatting namespace:1

nvme-SAMSUNG_MZQL21T9HCJR-00A07_S64GNNFX604905

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.11.3] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SAMSUNG MZQL21T9HCJR-00A07
Serial Number:                      S64GNNFX604905
Firmware Version:                   GDC5902Q
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 1,920,383,410,176 [1.92 TB]
Unallocated NVM Capacity:           0
Controller ID:                      6
NVMe Version:                       1.4
Number of Namespaces:               32
Namespace 1 Size/Capacity:          1,920,383,410,176 [1.92 TB]
Namespace 1 Utilization:            1,153,474,560 [1.15 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Tue Dec  3 19:55:52 2024 CET
Firmware Updates (0x17):            3 Slots, Slot 1 R/O, no Reset required
Optional Admin Commands (0x005f):   Security Format Frmw_DL NS_Mngmt Self_Test MI_Snd/Rec
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x0e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     80 Celsius
Critical Comp. Temp. Threshold:     83 Celsius
Namespace 1 Features (0x1a):        NA_Fields No_ID_Reuse NP_Fields

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +    25.00W   14.00W       -    0  0  0  0       70      70
 1 +     8.00W    8.00W       -    1  1  1  1       70      70

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 -     512       0         0
 1 +    4096       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        38 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    528,298 [270 GB]
Data Units Written:                 439,541 [225 GB]
Host Read Commands:                 64,232,059
Host Write Commands:                54,331,568
Controller Busy Time:               10
Power Cycles:                       3
Power On Hours:                     6
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      2
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               38 Celsius
Temperature Sensor 2:               46 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

nvme-SAMSUNG_MZQL21T9HCJR-00A07_S64GNNFX604919

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.11.3] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SAMSUNG MZQL21T9HCJR-00A07
Serial Number:                      S64GNNFX604919
Firmware Version:                   GDC5902Q
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 1,920,383,410,176 [1.92 TB]
Unallocated NVM Capacity:           0
Controller ID:                      6
NVMe Version:                       1.4
Number of Namespaces:               32
Namespace 1 Size/Capacity:          1,920,383,410,176 [1.92 TB]
Namespace 1 Utilization:            1,153,474,560 [1.15 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Tue Dec  3 19:55:56 2024 CET
Firmware Updates (0x17):            3 Slots, Slot 1 R/O, no Reset required
Optional Admin Commands (0x005f):   Security Format Frmw_DL NS_Mngmt Self_Test MI_Snd/Rec
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x0e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     80 Celsius
Critical Comp. Temp. Threshold:     83 Celsius
Namespace 1 Features (0x1a):        NA_Fields No_ID_Reuse NP_Fields

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +    25.00W   14.00W       -    0  0  0  0       70      70
 1 +     8.00W    8.00W       -    1  1  1  1       70      70

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 -     512       0         0
 1 +    4096       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        36 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    527,834 [270 GB]
Data Units Written:                 440,387 [225 GB]
Host Read Commands:                 64,173,963
Host Write Commands:                54,437,024
Controller Busy Time:               10
Power Cycles:                       3
Power On Hours:                     6
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      2
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               36 Celsius
Temperature Sensor 2:               45 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

@mweinelt mweinelt force-pushed the mimas branch 5 times, most recently from ab028e1 to dc5dcfe Compare December 6, 2024 00:25
@mweinelt mweinelt marked this pull request as ready for review December 6, 2024 01:35
@mweinelt mweinelt requested a review from a team as a code owner December 6, 2024 01:35
Copy link
Member

@Mic92 Mic92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passes the smell test.

@mweinelt
Copy link
Member Author

mweinelt commented Dec 6, 2024

Hydra is now running on mimas. I runtime-masked the hydra services on rhea for now. Will get some sleep and check back tomorrow.

@mweinelt mweinelt force-pushed the mimas branch 2 times, most recently from f9fab7b to 0adfb7f Compare December 6, 2024 03:08
@mweinelt mweinelt force-pushed the mimas branch 2 times, most recently from 1c88182 to 2c42161 Compare December 6, 2024 17:17
We currently hold over 430GiB of build logs, which makes up 25% of the
total disk space used on mimas.
Our hydra fork has a limit on the number of active build slots,
which defaults to the number of threads minus two, so 96-2 = 94.

vcunat looked at mimas graphs when we were hitting this limit,
and it seemed OK to push more (negligible cpusome PSI).
@mweinelt
Copy link
Member Author

mweinelt commented Dec 6, 2024

Alright. There is much to improve still, but the migration looks pretty solid to me. Let's leave rhea around for another week just in case we've missed something and then cancel the contract.

@mweinelt mweinelt enabled auto-merge December 6, 2024 17:36
@mweinelt mweinelt merged commit fadaefc into master Dec 6, 2024
3 checks passed
@mweinelt mweinelt deleted the mimas branch December 6, 2024 17:41
@@ -93,6 +93,9 @@ in

max_concurrent_evals = 1

# increase the number of active compress slots (CPU is 48*2 on mimas)
max_local_worker_threads = 144
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we had a longer period of "Copying Results" hovering roughly around 130, and PSI kept still very low. So I believe this certainly still isn't a too high value. We'll see over time. So far it's been relatively rare to get so high anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants