-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hardware inventory #661
Hardware inventory #661
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verified the nvidia GPUs and the RAM in the Apple chips, ignored all the rest.
If things are being simplified such that one flops number is used, it's probably good to focus on the most used/usable option, i.e. for Nvidia cards the non-sparse half-precision tensor core FLOPS is what most people will be operating under for train workloads, many inferencing ... although inferencing is starting to see heavier use of 8-bit int or float precisions on chips with support... Comparing any Nvidia datacenter GPU on the base (non tensorcore) FP32 flops does not make any sense. Nobody uses them in that mode, they cut the capability of that mode from gamer GPUs so that they can fit more TC on there, maybe a few will be using TF32 on the TC, but most will be in mixed or half precision for matrix multiplies.... |
Also some of the latest Intel CPU (and ARM) have bfloat16 and/or int8 instruction sets for mixed/half precision training and inference, so that's something to watch. Up to this point there were usually just float32 flops to consider on CPUs. |
An example of what @rwightman is bringing up is an A100 A100s have only 19.49 TFLOPS on fp32 - which would make it look weaker than A10, A10G, L4 and L40s which clearly it isn't in real world use-cases e.g.: So I agree with Ross that maybe best to use fp16 as baseline flops vs fp32 for all GPUs. I think it may correlate better with perceived real world performance |
Co-authored-by: Pedro Cuenca <[email protected]>
@apolinario @rwightman, sounds good, where can one find the values for FP16? (@apolinario where is your graph from) |
i've switched from FP32 => FP16 for NV and AMD GPUs. TFlops for CPUs are still kinda random, if anyone wants to give me a hand? |
Regarding CPUs, the same generation of processor can vary in TFLOPS, I think this website can provide decent directions (https://www.cpu-monkey.com/en/benchmark-intel_core_i5_13600k-bench_11) and I'm happy to help with data-entry for CPUs However, wondering how to handle the scenario:
Suggestion: having for each processor a
Would be bad if the user doesn't know how low/high end is their cpu, but if they don't know if theirs is low or high end they may not know the specific model to pick either 🤔 - also this misses intermediate steps - so not super sure whether this is the best way, but if we think this is a decent enough way happy to fill in the data for this |
honestly i would just suggest inputting a value in the middle of Especially given this comment line here at the top of This is only approximate/theoretical and shouldn't be taken too seriously. |
CPU flops can be quite confusing because there are FPLOPS from the old FPU (floating point unit), there are FLOPS from the SIMD AVX/NEON/etc units that are often leveraged by ML libs, and there are FLOPS from integrated GPUs. The advertising of each is not consistent, and they tend to focus on the bigger number, often the iGPU but that is most likely not usable for ML in most AMD or Intel cases. |
Most NVIDIA datasheets or spec sumaries will have the list of different FLOPs. The bfloat16 or float16 'Tensor Core' FLOPS are the interesting ones (not the non-TC FP16 flops), more specifically the FP16 w/ FP32 accumulate (but they don't always destinguish this). For datacentere GPU like A100, H100 (they are called 'Tensor Core' GPUs) or workstation Quadro FP16 w/ FP32 accum is th default in lower precision. For gamer GPUs they cripple them to differentiate price points and usually halve the performance of FP16 /w FP32 accum from FP16 w/ FP16 accum (not that useful for ML) so might distinguish on spec sheets. The the other sill thing, when you look at some of the spec sheets, they use the 'sparisity' FLOPS number which is only reliazable in specific situations. It's often denoted by a small superscript (number is with sparsity). The actual number is typically half that. |
Co-authored-by: apolinário <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks a lot @apolinario!
cc @mfuntowicz too on the whole thread btw |
@rwightman even if not cleanly comparable to GPUs, are the current CPU tflops values in the current state of this PR reasonable enough? |
@julien-c bit late to the party it seems, checking Ice Lake and Sapphire Rapids, they're both lower than I'd expect .. think Ice lake is in the 1-3 tflop range ans Sapphire is 3-4+.. Also, GPUs still way off, consumer cards about 1/2 what they should be and A100/H100 off by much more... |
@rwightman oops, PR welcome on top of this! |
I hesitated putting this one in
@huggingface/tasks
, or creating a new@huggingface/hardware
.What do you think?
The idea is the community to be able to contribute to that list.
What I picked for now⤵️
Because I'm lazy – and because it's somewhat linked to #659 – i've added it to
tasks
.