-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stmt and stmt_html output are too low level #7519
Comments
@maaz139 Other than line 441 of Module lower(const std::vector<Function> &output_funcs, ..., Halide::my_intent_t = Stmt_only); |
I'd like to chime in, and state that I ❤️ the PTX code in the Stmt HTML to get a good understanding of what's happening in a single CUDA thread. I examined these for months when working on something for work. Much like how one wants to check the generated assembly for CPU code, you'd want to validate what got generated for the GPU code. In my opinion, the compilation pipeline should be respected for what it is: a pipeline. The code can be useful to inspect at different stages. Statement IR can be useful, PTX code can be useful. Just like we are able to get a Alternatively, we output the PTX code in the assembly file, but that would cause us to lose the colors, and in general is hard to navigate, as the Considering all the discussions regarding the VizIR HTML, the problems it has, and the fact that now PTX code is likely to be another variable, I believe a possible solution could be to have more generator emit options: now there is only
In the above, Additionally, to generalize the
If this sounds reasonable, I could give this a try to implement this. What I like about this, is that it will be backward-compatible. Feedback on this idea? |
To make the discussion complete, I'll copy paste my thoughts I posted on the other PR here: I have been thinking -- like most of you -- quite a bit about this. I believe it just makes sense to indeed pick a few point in the lowering process, stash the IR away into the Module for later, and emit multiple files. Each file representing one point in the lowering process. I'm absolutely not sure what to do with the PTX code and the assembly tab. It is assembly, and therefore could go there, but it's not the assembly of the pipeline. It really is a buffer containing assembly for the GPU. Maybe... Perhaps... It is more sensible to have a "Buffers" tab? Overall, I believe the HTML-way of doing things is getting convoluted, and I'm thinking of an approach where we just dump a very information-rich IR tree to some file format and have a custom DearImGui-based tool be the visualizer? I don't know what the recently-merged "experimental serializer" is exactly capable of, but maybe piggybacking on that to dump flatbuffers of the IR at a few points of the lowering process can be a nice trick here to build such a tool. @TH3CHARLie Can the serializer you built be used for dumping an IR tree? |
Yes, that's pretty much the only thing it can be used for :-) |
Moving the discussions from the draft PR here. Feel free to clarify if I mis-quoted you.
@mcourteaux pointed out the PTX code dump in the @maaz139 I am not sure if the "assembly tab" will visualize the PTX code, even if we add a "jump" button. If I understand the docs correctly, the PTX IR is further "lowered" by LLVM into Nvidia bytecode, then encoded/obfuscated into a buffer entry in the assembly. So, the jump button may be useless here.
Understood. I retracted my draft PR. It does serve my most immediate needs. That is, to reason about the Agreed it can be misleading if (other) users' goal, unlike mine, is to cross-reference the lowered IR and the assembly output. |
Perhaps we use the term "assembly tab/code/dump" too broadly, creating confusions. There's a more formal term available: 2GL and 3GL. When I said the PTX as an IR, looks too low level, I mean the PTX dump looks too much like the 2GL. I was expecting a 3GL-style textual representation of the Halide IR back in Halide 10.0. @maaz139 , could you please drive the discussion on the work scope? That is, how low, 2GL or 3GL, should Halide IR should be printed as In the meantime, I will pay the effort to learn the PTX IR syntax. (An off-topic opinion: I personally call the Halide language the 4GL, and languages like the the CVXGEN the 5GL.) |
@antonysigma To clear this up: what I meant there is that the IR printed to the HTML file is actually the IR that gets compiled as CPU code. The I'm not sure about the LLVM IR PTX you describe... As far as I understood, the buffer contains PTX code, which is NVIDIA specific, and not really anything related to LLVM.
@steven-johnson Seems tho, that at this point, the interface only supports dumping a |
Yeah -- it's still experimental, and thus the precise use cases that are sensible to allow (or disallow) are still a bit fluid (and likely will be for a version, at least). Feel free to offer a PR that allows de/serializing fragments. |
|
Thats a great question. In my personal opinion, the statement files are meant to look like 3GL code. The whole point of printing the Stmt file is to look at code that is not 2GL -- at least in my use cases. I can't really comment on why folks decided to print low-level PTX code inside the stmt files, I was not involved with the code back then but I can imagine there was a demand for it. Generally speaking, I agree that different users typically seek out different details. Generating different variations of the stmt_html files is one way to disentangle different "views" into the Halide program and have consistent expectations with each view. I am not sure if there should be dedicated generator flags for each, perhaps we can generate all the of versions whenever a user selects
That seems like a fair bit of work but I concur that the HTML based method is not very scalable. I'm happy to contribute on the de-serialization of missing lower-level IR constructs. |
@maaz139 I requested this long ago, and eventually contributed this feature in #6444 (buffer in the stmt HTML) and #6447 (syntax highlighting of the buffer). I can maybe answer this by asking a question instead: how do you otherwise review what the code generated looks like. I originally asked about this exact thing in #6410, as I was clueless on how to check out what the code is that actually gets run as CUDA kernel. @abadams helped me out and said that one can set the environment variable HL_DEBUG_CODEGEN=1. This will output the PTX code to stdout during the time the generator runs. Not at all pleasant to work with. Making the PTX available in the stmt file made sense to me, as that is really what gets compiled. |
But I agree that a patch like @antonysigma proposed in #7753 looks super useful to get something like in the screenshot above. I don't know how you ever got to that screenshot (supposedly in Halide 10?), because the GPU-specific Stmt IR got already offloaded in the Lowering passes before the Generator could even generate the HMTL. |
Gotcha! That makes sense. I agree that having specialized views into different stages of the pipeline would be a nice solution. |
Yes, the screenshot was created in Halide 10.0. At the time, Halide 14.0 and Halide 10.0 still retain some form of API/ABI compatibility. I exploited that to "switch" between the Halide IR view and the so-called :Halide IR with offloaded PTX" view. I can no longer do that beyond Halide 15.0.
Great! Are we getting a multi-panel HTML page, a multiple HTML pages featuring various stages? Both a are fine with me. Again, my use case is to:
These, I think, will require Halide IR printouts to be in 3GL to be productive. |
I am with you on the PTX printout requirement. I know a few companies who demand 2GL program listings for security/high-availability auditing purposes. I simply don't know how popular is such unorthodox usage of Halide. I will defer to those who actually use the PTX printout features. I can give my two cents on the purpose of the direct PTX/NEON/AVX printouts. I encountered a few industries who treats the So, these industries adopts a "cleanroom" design protocol: nothing (Halide generated) gets in, nothing (proprietary) gets out. In other words, employee A programs the AOT to generate the Yeah, I know such a cleanroom protocol effectively rejects Halide's core design philosophy. Again, I am simply not sure how popular is this approach, and the unorthodox use of |
🤯 Waw... Crazy.
Today I got back into making GPU schedules in Halide, and here I am: struggling with the PTX only code. I think I will one of these days revive the idea in #7753 and make it generate three separate HTML files:
Maybe I'm missing something, but I believe this could be a reasonable solution that offers both the 3GL and 2GL code views. If you have any tips or requests regarding this, please let me know such that I can consider those when working on it. |
Still working on this! PR in a few days. 😄 |
Thanks @mcourteaux . I am looking forward to the PR. My web development skill is 20 years out of date (XHTML 1.0, Backbone.js, MVC-based architecture). But I can help review the UX stuff, verify the Makefile rules, and check for |
@abadams @steven-johnson I'm currently working on adding an assembly split pane for "devices" (as opposed to "host" code). I am only familiar thus far with CUDA PTX as a Halide "device". I wonder if, for example, if the Hexagon DSP code is considered "host" code in the end, or if it's also threaded as a device in Halide. I see that the Now that I'm thinking about it, not all of those will use LLVM as their backend probably? |
I think we can compile hexagon in either mode (it's the host code, or it's device code embedded in a Buffer like PTX). While it be very cool to be able to see the hexagon assembly, the challenge might be that hexagon is already compiled into binary when it's embedded. For opencl, metal, d3d12 etc, the embedded buffer is shader source code. It would be cool to have that in a pane. I think "device code" is a good name. |
Meaning that we'd need to disassemble it first, or ask the compiler to additionally generate an assembly file next to the binary? |
I was thinking we'd ask the compiler to additionally generate the assembly file if possible. |
Okay, I'm done with the Stmt HTML stuff. I'm wondering now... Do we want to keep the VizTree stuff? I haven't looked at that yet. I have added jump-to-device-code buttons as well. Panes are collapsible and resizable. Overall performance is great, and there is no jQuery or bootstrap used. Only dependency right now is the syntax highlighter for the assembly. |
Mostly resolved by #7843 being merged. |
For example, they show already-compiled PTX assembly for cuda kernels instead of stmt ir, because those have already been offloaded. As more and more of codegen creeps into lowering, this problem will get worse. We need to identify a point in lowering at which the stmt should be preserved for stmt and stmt_html output. I propose just after the custom passes and before hexagon kernel offload.
See #7507
The text was updated successfully, but these errors were encountered: