Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gigahorse-1.1.8-9d66b86: GPU address limit 1TB/40bit problem: instance of 'std::runtime_error, signal 6, swiotlb buffer is full, NVRM: Failed to create a DMA mapping! #28

Open
bladeuserpi opened this issue Feb 3, 2023 · 6 comments

Comments

@bladeuserpi
Copy link

bladeuserpi commented Feb 3, 2023

Hi,

on 1.5TB machine K34 plots do not work (GPU: Quadro M6000).
I believe this is due to GPU 40bit address limit.

Here it is mentioned:
https://learn.microsoft.com/en-us/windows-hardware/drivers/display/iommu-dma-remapping
" This page describes the IOMMU DMA remapping feature that was introduced in Windows 11 22H2 (WDDM 3.0).
...
Upcoming servers and high end workstations can be configured with over 1TB of memory which crosses the common 40-bit address space limitation of many GPUs."

So it seems while Windows 22H2 can handle it, in Linux it can be a problem (kernel 4.18.0-425.10.1.el8_7).

Also note increasing swiotlb delays the termination to the 2nd plot, but even the 1st plot
might be corrupt as there are ten thousand (!) of such (and other) messages:

    200 park_delta(): LP_1 < LP_0 (1875189930, 18446744073709551615) (x = 1348, y = 6770)
    201 park_delta(): LP_1 < LP_0 (2255150717, 18446744073709551615) (x = 1351, y = 6770)
    202 park_delta(): LP_1 < LP_0 (1891597597, 18446744073709551615) (x = 1353, y = 6770)
    203 park_delta(): LP_1 < LP_0 (1267797774, 1891597597) (x = 1354, y = 6770)
    204 park_delta(): LP_1 < LP_0 (2922005224, 3001459040) (x = 1356, y = 6770)
    205 park_delta(): LP_1 < LP_0 (30753450, 2922005224) (x = 1357, y = 6770)

Furthermore these messages also appear for K32 after some successful plots when running "-n -1",
so while that did not terminate it might also produce corrupt plots.

This workstation has a BIOS option "1TB Memory Cap":

If 1 TB of memory is installed, limits useable memory to 1TB-64MB for compatibility with graphics cards that
can`t address 1TB or more of memory".

I will try that next.

Logs

    46 Chia k34 next-gen CUDA plotter - 9d66b86
     47 Plot Format: v2.4
     48 Network Port: 11337 [MMX] (unique)
     49 No. GPUs: 1
     50 No. Streams: 4
     51 Final Destination: ./
     52 Shared Memory limit: unlimited
     53 Number of Plots: 5
     54 Initialization took 0.106 sec
     55 Crafting plot 1 out of 5 (2023/02/01 16:36:20)
     56 Process ID: 1993
     57 Pool Puzzle Hash:  xxx
     58 Farmer Public Key: xxx
     59 Working Directory:   ./
     60 Working Directory 2: @RAM
     61 Compression Level: C1 (xbits = 15, final table = 3)
     62 Plot Name: plot-mmx-k34-c1-2023-02-01-16-36-xxx
     63 [P1] Setup took 0.894 sec
     64 [P1] Table 1 took 77.857 sec, 17179869184 entries, 16789935 max, 17020 tmp, 0 GB/s up, 2.31198 GB/s down
     65 [P1] Table 2 took 143.196 sec, 17179636764 entries, 16790768 max, 17010 tmp, 1.00561 GB/s up, 1.81572 GB/s down
     66 [P1] Table 3 took 319.245 sec, 17178866356 entries, 16787901 max, 16960 tmp, 0.651528 GB/s up, 1.8168 GB/s down
     67 terminate called after throwing an instance of 'std::runtime_error'
     68   what():  OS call failed or operation not supported on this OS
     69 Command terminated by signal 6
     70 223.30user 287.34system 10:14.91elapsed 83%CPU (0avgtext+0avgdata 541219792maxresident)k

This can be seen with "dmesg -T" or /var/log/messages:

61238 Feb  1 17:45:35 m8 kernel: nvidia 0000:2d:00.0: swiotlb buffer is full (sz: 4194304 bytes), total 32768 (slots), used 0 (slots)
61239 Feb  1 17:45:35 m8 kernel: NVRM: 0000:2d:00.0: Failed to create a DMA mapping!

With some experiment I also got this failures mode:

Feb  1 22:43:28 m8 kernel: NVRM: GPU 0000:2d:00.0: RmInitAdapter failed! (0x25:0x65:1457)
Feb  1 22:43:28 m8 kernel: NVRM: GPU 0000:2d:00.0: rm_init_adapter failed, device minor number 0

Related documentation:
https://lenovopress.lenovo.com/lp1467.pdf
An Introduction to IOMMU Infrastructure in the Linux Kernel

@madMAx43v3r
Copy link
Owner

Interesting problem, I don't think I can fix this via code though...

@bladeuserpi
Copy link
Author

With Bios "1TB Memory Cap" the problems are resolved, K34 works fine.

@bladeuserpi
Copy link
Author

The 40bit Limit is also mentioned here.
So it seems it affects Maxwell (and earlier), but does not affect Pascal (and later).
https://docs.nvidia.com/cuda/cuda-c-programming-guide/

Linear memory is allocated in a single unified address space, which means that separately allocated entities can reference one another via pointers, for example, in a binary tree or linked list. The size of the address space depends on the host system (CPU) and the compute capability of the used GPU:

  x86_64 (AMD64) POWER (ppc64le) ARM64
up to compute capability 5.3 (Maxwell) 40bit 40bit 40bit
compute capability 6.0 (Pascal) or newer up to 47bit up to 49bit up to 48bit

Note

On devices of compute capability 5.3 (Maxwell) and earlier, the CUDA driver creates an uncommitted 40bit virtual address reservation to ensure that memory allocations (pointers) fall into the supported range. This reservation appears as reserved virtual memory, but does not occupy any physical memory until the program actually allocates memory.

@bladeuserpi
Copy link
Author

The workaround "Bios option 1TB Memory Cap" reduces my RAM from 1.5TB to 1TB, so I loose 512GB.
I also tested another workaround without loosing memory and this also worked for me:

  1. Enable VT-x and VT-d in BIOS (the VT-d enables IOMMU hardware)
  2. Append to kernel commandline: "intel_iommu=on iommu=nopt"
    intel_iommu=on: this is needed so the kernel is told to make use of VT-d
    iommu=nopt: this is needed to use "translated" mode (i.e. no passthrough)

@bladeuserpi
Copy link
Author

While it works with iommu, the performance is massively degraded for this use case.

For 3060Ti I get these numbers (1st plot omitted)

  1. Running with iommu enabled
    Total plot creation time was 208.395 sec (3.47324 min)
    Total plot creation time was 207.218 sec (3.45363 min)

  2. Running with iommu not enabled
    Total plot creation time was 151.548 sec (2.5258 min)
    Total plot creation time was 152.016 sec (2.53361 min)

@madMAx43v3r
Copy link
Owner

Interesting to know

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants