-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect CPU kinds on AMD Threadripper PRO 7000 #690
Comments
Hello. This is actually rather related to #502 and we have a workaround that may work on AMD too. Try setting HWLOC_CPUKINDS_MAXFREQ=adjust=50 to ignore frequency differences up to 50% and let us know if you get a single cpukind. |
It's unchanged, still reporting 16 different kinds.
Or simply ignore frequency differences on AMD entirely, at least until AMD starts making hybrid CPUs. At that point #587 should help tell the core types apart, similar to cpu_{atom,core} on Intel. |
Can you send the tarball foo.tar.bz2 generated by hwloc-gather-topology foo on this machine so that I debug this from here? In the meantime setting HWLOC_CPUKINDS_HOMOGENEOUS=1 should workaround the issue. The issue with #587 is that it's only in our x86 backend. It'd be easier if Linux exposed it in sysfs but Linux kernel devs aren't convinced it's useful. Intel added /sys/devices/cpu_{atom,core} for PMU but I don't know if AMD will do the same. I agree ignoring cpukinds on AMD might be easier for now (with an envvar to reenable it if ever needed). But I am going to poke my AMD contacts to better know what's coming. There are some leaks of Zen 5 "strix point" coming with both P and E core soon. |
Will do.
Indeed it does.
I just realized that there are some Zen 4 CPUs that mix Zen 4 and Zen 4c, see https://www.phoronix.com/review/amd-zen4-zen4c-scaling. So CPU kind detection is desirable even on current generation AMD CPUs. |
Threadripper 7000 doesn't mix Zen 4 and Zen 4c. I suspect this is actually tied to a preferred cores detection issue. AMD does do rankings via CPPC of which cores on the die are better, even if they can all clock identically. There is a series that I submitted for 6.12-rc1 that I think will make this behave properly both with acpi-cpufreq and amd-pstate. The PR for it is already merged, so if you want to try Linus' tree as of today you can see if it helps. |
And yes https://www.amd.com/en/products/processors/laptop/ryzen/300-series/amd-ryzen-ai-9-hx-370.html is already public and Strix is on the market. You can see that SKU clocks at 5.1GHz for the performant cores and 3.3 GHz for efficient. |
Thanks @superm1. I assume you are referring to torvalds/linux@9bcf303? I unfortunately don't have admin privileges on that Threadripper 7000 machine, but I'll see if I can get someone else to test it.
Is there a sysfs node that exposes whether a core is performant or efficient? |
Yes that's the merge commit that pulls in all the 6.12 content and I expect helps this with acpi-cpufreq OR amd-pstate.
There's a CPUID explained in the APM volume 2 for it on page 213: The appendix of volume 3 on page 646 explains more about it too: https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24594.pdf |
Support for this CPUID in hwloc is pretty much ready in #587 but I don't have any platform to test it. If strix is already public, it would be nice if you could run "hwloc-gather-cpuid" on it and send me a tarball of the resulting "cpuid" directly (this tools dumps the output of all CPUID leaves on each core so that I can use them remotely in hwloc). Regarding sysfs itself, as far as I know, the only way to get E-core vs P-core information on Intel is by reading /sys/devices/cpu_{atom,core}/cpus (that's where they store PMU info). Any chance this would be available for AMD too? I requested the addition of a dedicated "type" in sysfs cpu files recently but it's not clear it'll ever happen, at least because the "atom vs core" type isn't enough on Intel when there are low-power cores (but they could tweak that "type" sysfs file to report something different for E-core and LPE-core). |
Sure, this is with specifically the SKU AMD Ryzen AI 9 HX 370 w/ Radeon 890M.
As it's already available from the cpuid information, I would to understand the usecase to justify exporting it somewhere. What are you going to do with it? IMO alone it doesn't tell you enough relational information. For example whether the cores are on the same CCX, CCD, the family etc. The CPUID tells you a lot more so you can make informed decisions on it. |
With your argument, should we remove sysfs cpu topology files? The vast majority of topology info from CPUID 0xb, 0x1f on Intel and 0x80000026 on AMD is already exposed in sysfs in a portable way. Hybrid core info is an important piece that it still missing in sysfs, for many users who are going to look at which core is small or big before binding tasks in parallel jobs. CPUID is far less convenient than sysfs because you have to bind to every single core to run Intel or AMD specific CPUID calls to get hybrid info (what #587 will do when the operating system doesn't expose it). |
No; that would cause regressions from any software that utilized them. Once you introduce such a file, you can' t remove it. That's exactly the reason I want to make sure that it makes sense to create before doing so. It's a maintenance burden to hang on to.
I have the view that this is the scheduler's job, not the user's job. The scheduler should be made aware the capacity of the cores and place and migrate tasks based upon that. Even without the hetero detection code I'm working on for 6.13, I would expect that amd-pstate does a relatively good job using preferred cores and CPPC highest perf values to rank them. |
That's the eternal debate between kernel developers saying the kernel can guess what userspace wants, and HPC users not trusting kernel for understanding anything correctly. In the past, it was only HPC users, but nowadays it's very common because parallel libraries are everywhere. For general purpose irregular workloads, the scheduler may be able to do good things. However when userspace knows what it's doing, it'll create one task per cpu and has better information about which one should go where. Also, another use case with hybrid info is userspace apps running regular parallelism where you want all your tasks to run at the same speed so that they don't slowdown each other. If you have 8 E-cores and 4 P-cores, you'll want either 4 tasks on P-core or 8 tasks on E-core. But first you have to know how many P and E-core exist in the system. |
Of course affinitizing a task to a certain core could be helpful in some contexts by some userss. The problem is you might not be able to correctly classify it against the available hardware performance capacity from userspace. It's alluded to in this series, but I'll mention that some hardware can actually feed back hints to the scheduler for information about tasks that should be migrated.
But the thing is it's not just raw max frequency. You have other factors like how much cache the cores have and which cores share that cache. You have to know which cores have not hit thermal limits and can boost for longer. IMO I think if you're going to try to play with gaming which cores to put tasks on, you're better off using something like sched_ext and working out a userspace scheduler.
But not all performance cores are identical! Just looking at the raw CPPC highest performance characterization for that Strix system I had above let me show you: $ grep -v foo /sys/bus/cpu/devices/*/cpufreq/amd_pstate_highest_perf
/sys/bus/cpu/devices/cpu0/cpufreq/amd_pstate_highest_perf:208
/sys/bus/cpu/devices/cpu10/cpufreq/amd_pstate_highest_perf:125
/sys/bus/cpu/devices/cpu11/cpufreq/amd_pstate_highest_perf:125
/sys/bus/cpu/devices/cpu12/cpufreq/amd_pstate_highest_perf:208
/sys/bus/cpu/devices/cpu13/cpufreq/amd_pstate_highest_perf:208
/sys/bus/cpu/devices/cpu14/cpufreq/amd_pstate_highest_perf:202
/sys/bus/cpu/devices/cpu15/cpufreq/amd_pstate_highest_perf:196
/sys/bus/cpu/devices/cpu16/cpufreq/amd_pstate_highest_perf:125
/sys/bus/cpu/devices/cpu17/cpufreq/amd_pstate_highest_perf:125
/sys/bus/cpu/devices/cpu18/cpufreq/amd_pstate_highest_perf:125
/sys/bus/cpu/devices/cpu19/cpufreq/amd_pstate_highest_perf:125
/sys/bus/cpu/devices/cpu1/cpufreq/amd_pstate_highest_perf:208
/sys/bus/cpu/devices/cpu20/cpufreq/amd_pstate_highest_perf:125
/sys/bus/cpu/devices/cpu21/cpufreq/amd_pstate_highest_perf:125
/sys/bus/cpu/devices/cpu22/cpufreq/amd_pstate_highest_perf:125
/sys/bus/cpu/devices/cpu23/cpufreq/amd_pstate_highest_perf:125
/sys/bus/cpu/devices/cpu2/cpufreq/amd_pstate_highest_perf:202
/sys/bus/cpu/devices/cpu3/cpufreq/amd_pstate_highest_perf:196
/sys/bus/cpu/devices/cpu4/cpufreq/amd_pstate_highest_perf:125
/sys/bus/cpu/devices/cpu5/cpufreq/amd_pstate_highest_perf:125
/sys/bus/cpu/devices/cpu6/cpufreq/amd_pstate_highest_perf:125
/sys/bus/cpu/devices/cpu7/cpufreq/amd_pstate_highest_perf:125
/sys/bus/cpu/devices/cpu8/cpufreq/amd_pstate_highest_perf:125
/sys/bus/cpu/devices/cpu9/cpufreq/amd_pstate_highest_perf:125 You can probably infer which of those cores are the Zen5c cores and which are the Zen5 without an extra sysfs file. But how would you want to parallelize things? Only a few of the Zen5 cores behave the same. You'll have two SMT pairs at 208, one SMT pair at 202 and 1 at 196. So would you put your workload on the Zen5c cores because they all can boost the same? No, you would need to know which siblings make sense from shared cache, which make sense because they're SMT pairs (depending upon the workload). And you would need to know if you're going to be jumping from one CCD to another. |
I'd be happy to use sched_ext, but there are tons of existing users and legacy apps who didn't ever trust the kernel for scheduling their apps correctly, it's not going to change (especially because the hardware is much more complicated and they don't see why the kernel would do anything better than them). They just disable things like turboboost to mitigate the issue, ignore tiny differences like your 196 vs 208 above, etc. And rely on library like hwloc to get hardware info (including cache sharing, cache sizes, as you say) to do their own scheduling (which isn't really scheduling but often rather placing one task per thread). Anyway, this issue is diverging from the original issue. Do you think @mkuron original issue with very different frequencies will go away in future releases, so that disabling hwloc frequency comparison algorithm is enough for now? |
Sorry for the delay. Here is the topology data from the Threadripper 7975WX with the spurious 16 CPU kinds: threadripper7975WX.tar.gz. I'll be happy to test whatever workaround you might come up with inside hwloc, @bgoglin. |
Thanks @mkuron. The reason why HWLOC_CPUKINDS_MAXFREQ=adjust=50 didn't help is that there is no basefreq like in Intel pstate (I don't adjust max frequencies unless base frequency are found and identical). I'll use acpi_cppc/nominal_freq instead when available. Anyway, the real workaround is HWLOC_CPUKINDS_HOMOGENEOUS=1 and hope the kernel fix works. |
Sounds good. That value is consistently reported as 4001 on this machine. |
cpufreq/base_frequency is only available on Intel so far, and works well. acpi_cppc/nominal_freq is already available on AMD (and ARM or soon), so it's likely good for the future. However it reports incorrect values on Intel SPR and MTL at least. Hence try cpufreq/base_frequency first, then fallback to acpi_cppc/nominal_freq. Refs #690 Signed-off-by: Brice Goglin <[email protected]>
cpufreq/base_frequency is only available on Intel so far, and works well. acpi_cppc/nominal_freq is already available on AMD (and ARM or soon), so it's likely good for the future. However it reports incorrect values on Intel SPR and MTL at least. Hence try cpufreq/base_frequency first, then fallback to acpi_cppc/nominal_freq. Refs #690 Signed-off-by: Brice Goglin <[email protected]> (cherry picked from commit 2292110)
cpufreq/base_frequency is only available on Intel so far, and works well. acpi_cppc/nominal_freq is already available on AMD (and ARM or soon), so it's likely good for the future. However it reports incorrect values on Intel SPR and MTL at least. Hence try cpufreq/base_frequency first, then fallback to acpi_cppc/nominal_freq. Refs #690 Signed-off-by: Brice Goglin <[email protected]> (cherry picked from commit 2292110)
What version of hwloc are you using?
2.10.0
Which operating system and hardware are you running on?
Alma Linux 8.10
Linux 4.18.0-553.5.1.el8_10.x86_64
Dell Precision 7875 Tower
BIOS version 1.6.2
AMD Ryzen Threadripper PRO 7975WX 32-Cores
Details of the problem
lstopo shows multiple CPU kinds on AMD Ryzen Threadripper PRO 7975WX due to variations in the max frequency (which looks excessively high) and lack of a base frequency (base frequencies are in general do not seem to be reported by hwloc for AMD CPUs). The AMD Ryzen Threadripper "Storm Peak"/Zen 4 generation is a homogeneous CPU and should have all cores represented as the same kind.
This issue bears some similarity to #634, though there the frequencies had only very minor variations and looked much more reasonable. I am not entirely sure whether this CPU really thinks it has such excessively high and varying frequencies, or if this is simply a bug in the BIOS, firmware, or Linux kernel that leads to incorrect reporting.
Notes
The data sheet for this CPU says that the boost frequency is 5.3 GHz (which actually coincides with CPU kind #0), but I can't imagine 7.7 GHz being achievable with any kind of cooling. https://openbenchmarking.org/s/AMD+Ryzen+Threadripper+PRO+7975WX+32-Cores has the
lscpu
output for the same machine and theirs even goes up to 8.1 GHz. https://www.phoronix.com/review/hp-z6-g5-a/3 actually stated 9 months ago that:As this bug remains unfixed at least in RHEL8's Linux kernel (didn't verify any others), a workaround for this hardware quirk inside hwloc would be desirable. The frequencies reported in /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_max_freq correspond to the ones reported by hwloc, so this is clearly not an hwloc bug, but could potentially be worked around in ways similar to #634/#635.
The text was updated successfully, but these errors were encountered: