Gigahorse-1.1.8-9d66b86: GPU address limit 1TB/40bit problem: instance of 'std::runtime_error, signal 6, swiotlb buffer is full, NVRM: Failed to create a DMA mapping! #28

bladeuserpi · 2023-02-03T18:12:51Z

Hi,

on 1.5TB machine K34 plots do not work (GPU: Quadro M6000).
I believe this is due to GPU 40bit address limit.

Here it is mentioned:
https://learn.microsoft.com/en-us/windows-hardware/drivers/display/iommu-dma-remapping
" This page describes the IOMMU DMA remapping feature that was introduced in Windows 11 22H2 (WDDM 3.0).
...
Upcoming servers and high end workstations can be configured with over 1TB of memory which crosses the common 40-bit address space limitation of many GPUs."

So it seems while Windows 22H2 can handle it, in Linux it can be a problem (kernel 4.18.0-425.10.1.el8_7).

Also note increasing swiotlb delays the termination to the 2nd plot, but even the 1st plot
might be corrupt as there are ten thousand (!) of such (and other) messages:

    200 park_delta(): LP_1 < LP_0 (1875189930, 18446744073709551615) (x = 1348, y = 6770)
    201 park_delta(): LP_1 < LP_0 (2255150717, 18446744073709551615) (x = 1351, y = 6770)
    202 park_delta(): LP_1 < LP_0 (1891597597, 18446744073709551615) (x = 1353, y = 6770)
    203 park_delta(): LP_1 < LP_0 (1267797774, 1891597597) (x = 1354, y = 6770)
    204 park_delta(): LP_1 < LP_0 (2922005224, 3001459040) (x = 1356, y = 6770)
    205 park_delta(): LP_1 < LP_0 (30753450, 2922005224) (x = 1357, y = 6770)

Furthermore these messages also appear for K32 after some successful plots when running "-n -1",
so while that did not terminate it might also produce corrupt plots.

This workstation has a BIOS option "1TB Memory Cap":

If 1 TB of memory is installed, limits useable memory to 1TB-64MB for compatibility with graphics cards that
can`t address 1TB or more of memory".

I will try that next.

Logs

    46 Chia k34 next-gen CUDA plotter - 9d66b86
     47 Plot Format: v2.4
     48 Network Port: 11337 [MMX] (unique)
     49 No. GPUs: 1
     50 No. Streams: 4
     51 Final Destination: ./
     52 Shared Memory limit: unlimited
     53 Number of Plots: 5
     54 Initialization took 0.106 sec
     55 Crafting plot 1 out of 5 (2023/02/01 16:36:20)
     56 Process ID: 1993
     57 Pool Puzzle Hash:  xxx
     58 Farmer Public Key: xxx
     59 Working Directory:   ./
     60 Working Directory 2: @RAM
     61 Compression Level: C1 (xbits = 15, final table = 3)
     62 Plot Name: plot-mmx-k34-c1-2023-02-01-16-36-xxx
     63 [P1] Setup took 0.894 sec
     64 [P1] Table 1 took 77.857 sec, 17179869184 entries, 16789935 max, 17020 tmp, 0 GB/s up, 2.31198 GB/s down
     65 [P1] Table 2 took 143.196 sec, 17179636764 entries, 16790768 max, 17010 tmp, 1.00561 GB/s up, 1.81572 GB/s down
     66 [P1] Table 3 took 319.245 sec, 17178866356 entries, 16787901 max, 16960 tmp, 0.651528 GB/s up, 1.8168 GB/s down
     67 terminate called after throwing an instance of 'std::runtime_error'
     68   what():  OS call failed or operation not supported on this OS
     69 Command terminated by signal 6
     70 223.30user 287.34system 10:14.91elapsed 83%CPU (0avgtext+0avgdata 541219792maxresident)k

This can be seen with "dmesg -T" or /var/log/messages:

61238 Feb  1 17:45:35 m8 kernel: nvidia 0000:2d:00.0: swiotlb buffer is full (sz: 4194304 bytes), total 32768 (slots), used 0 (slots)
61239 Feb  1 17:45:35 m8 kernel: NVRM: 0000:2d:00.0: Failed to create a DMA mapping!

With some experiment I also got this failures mode:

Feb  1 22:43:28 m8 kernel: NVRM: GPU 0000:2d:00.0: RmInitAdapter failed! (0x25:0x65:1457)
Feb  1 22:43:28 m8 kernel: NVRM: GPU 0000:2d:00.0: rm_init_adapter failed, device minor number 0

Related documentation:
https://lenovopress.lenovo.com/lp1467.pdf
An Introduction to IOMMU Infrastructure in the Linux Kernel

The text was updated successfully, but these errors were encountered:

madMAx43v3r · 2023-02-03T18:23:52Z

Interesting problem, I don't think I can fix this via code though...

bladeuserpi · 2023-02-04T11:54:29Z

With Bios "1TB Memory Cap" the problems are resolved, K34 works fine.

bladeuserpi · 2023-02-09T16:43:35Z

The 40bit Limit is also mentioned here.
So it seems it affects Maxwell (and earlier), but does not affect Pascal (and later).
https://docs.nvidia.com/cuda/cuda-c-programming-guide/

Linear memory is allocated in a single unified address space, which means that separately allocated entities can reference one another via pointers, for example, in a binary tree or linked list. The size of the address space depends on the host system (CPU) and the compute capability of the used GPU:

	x86_64 (AMD64)	POWER (ppc64le)	ARM64
up to compute capability 5.3 (Maxwell)	40bit	40bit	40bit
compute capability 6.0 (Pascal) or newer	up to 47bit	up to 49bit	up to 48bit

Note

On devices of compute capability 5.3 (Maxwell) and earlier, the CUDA driver creates an uncommitted 40bit virtual address reservation to ensure that memory allocations (pointers) fall into the supported range. This reservation appears as reserved virtual memory, but does not occupy any physical memory until the program actually allocates memory.

bladeuserpi · 2023-02-09T17:47:27Z

The workaround "Bios option 1TB Memory Cap" reduces my RAM from 1.5TB to 1TB, so I loose 512GB.
I also tested another workaround without loosing memory and this also worked for me:

Enable VT-x and VT-d in BIOS (the VT-d enables IOMMU hardware)
Append to kernel commandline: "intel_iommu=on iommu=nopt"
intel_iommu=on: this is needed so the kernel is told to make use of VT-d
iommu=nopt: this is needed to use "translated" mode (i.e. no passthrough)

bladeuserpi · 2023-02-12T13:14:43Z

While it works with iommu, the performance is massively degraded for this use case.

For 3060Ti I get these numbers (1st plot omitted)

Running with iommu enabled
Total plot creation time was 208.395 sec (3.47324 min)
Total plot creation time was 207.218 sec (3.45363 min)
Running with iommu not enabled
Total plot creation time was 151.548 sec (2.5258 min)
Total plot creation time was 152.016 sec (2.53361 min)

madMAx43v3r · 2023-02-12T16:23:07Z

Interesting to know

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gigahorse-1.1.8-9d66b86: GPU address limit 1TB/40bit problem: instance of 'std::runtime_error, signal 6, swiotlb buffer is full, NVRM: Failed to create a DMA mapping! #28

Gigahorse-1.1.8-9d66b86: GPU address limit 1TB/40bit problem: instance of 'std::runtime_error, signal 6, swiotlb buffer is full, NVRM: Failed to create a DMA mapping! #28

bladeuserpi commented Feb 3, 2023 •

edited

Loading

madMAx43v3r commented Feb 3, 2023

bladeuserpi commented Feb 4, 2023

bladeuserpi commented Feb 9, 2023

bladeuserpi commented Feb 9, 2023

bladeuserpi commented Feb 12, 2023

madMAx43v3r commented Feb 12, 2023

Gigahorse-1.1.8-9d66b86: GPU address limit 1TB/40bit problem: instance of 'std::runtime_error, signal 6, swiotlb buffer is full, NVRM: Failed to create a DMA mapping! #28

Gigahorse-1.1.8-9d66b86: GPU address limit 1TB/40bit problem: instance of 'std::runtime_error, signal 6, swiotlb buffer is full, NVRM: Failed to create a DMA mapping! #28

Comments

bladeuserpi commented Feb 3, 2023 • edited Loading

Logs

madMAx43v3r commented Feb 3, 2023

bladeuserpi commented Feb 4, 2023

bladeuserpi commented Feb 9, 2023

bladeuserpi commented Feb 9, 2023

bladeuserpi commented Feb 12, 2023

madMAx43v3r commented Feb 12, 2023

bladeuserpi commented Feb 3, 2023 •

edited

Loading