-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Halide Development Roadmap #5055
Comments
Assigning a bunch of people who seem like they may want to contribute to top-level planning. |
GPU support: I think there's a bigger, higher-level architectural issue: the memory management and runtime models for GPUs/accelerators feels broken and insufficient. We should consider rethinking it significantly to allow clearer and more explicit, predictable control (as we have on CPUs with a single unified memory space). |
Modules / Libraries / reusable code / abstraction |
Build system: Should we explicitly break apart build system issues for Halide development and build system issues for Halide users? I think these are mostly quite distinct and probably should be separate top-level headings. |
Better accessibility and support for research within/on the Halide code base |
Build Halide without LLVM: useful for GPU JIT and IR manipulation. |
Have you been following the store_in stuff? You just explicitly place Funcs in the memory type you want now. |
I agree with @jrk that we should distinguish build issues that affect Halide developers versus users. Generator aliases fix the GeneratorParams thing a bit, but they aren't very discoverable and aren't covered in the tutorials AFAIK. See #4054 and #3677
I'm not sure why this is painful in Visual Studio? Just because of how many times GenGen.cpp gets built? We could fix that by optimizing GenGen for build time. Windows users who wish to build from Halide from source should use CMake. If they want to use binary releases without our CMake rules, then they're on their own. We shouldn't pay off their technical debt for nothing in return.
Properly caching Halide outputs is complicated. Outputs are a function of the Halide version (we don't currently version Halide), the autoscheduler version (if used. we also don't currently version our autoschedulers), the algorithm and schedule (can these be consistently hashed?), and the generator parameters. It's not clear to me how often this is a benefit in incremental build scenarios. If your source files changed, then typically so has your pipeline. In CI scenarios, Halide versioning becomes more important since users would otherwise run into cache invalidation issues every time they update. Between builds, there could be some wins here, but they could also implement their own caching system by hashing the source files, Halide git commit hash, and generator parameters.
Maybe one of the companies that wants Halide to work around their build system should pay for hardware and hire a full-time DevOps specialist. They could get all our buildbots configured with Ansible and set up all the Docker/virtual machine images we'd need. Failing that, they could foot the bill for a cloud-based CI service that has GPU and Android support. Versioning and Releasing Halide We should start versioning Halide and getting on a steady (quarterly?) release schedule. We could start at v0.1.0 so we don't imply any API stability per semantic versioning -- only v1 and above implies API stability within a major version. This would allow us to publish Halide on vcpkg / pip / APT PPAs / etc. |
I was partly thinking of the heavily dynamic, lazy runtime aspects. When does memory get allocated and freed? When do copies happen? Most things in Halide are pretty static and eager, and explicitly controlled via schedules; GPU runtime behavior inherently includes a bunch of dynamic and lazy behavior, which is not clearly controlled by the schedule. Imagine now having multiple GPUs or different accelerators in a machine. I should be able to use schedules to decompose computation across multiple GPUs, reason about and control explicit data movement between them, etc. |
It's painful in visual studio because the actual GUI stops working right once you have more than a certain number of binary targets. Shoaib says it just stops showing them so you have no way to access them. People should just use X build system is a non-solution. Halide is being used into large products that already have build systems, and it must exist within them. Punting on solving this problem entirely means that the current experience of using Halide in a company other than Google is 80% build system nightmare and 20% writing code. If we want Halide to be a useful tool it must be able to integrate into existing build systems cleanly.
The particular problem I've seen is that people work around the number-of-binaries issue by packing all of their generators into a single binary, but then a naive dependency analysis then thinks that editing any source file requires rerunning every generator (and there may be hundreds). We might be able to help. Telling people to just fix their damn build system is obviously an attractive attitude, but that's also asking them to pay down a large amount of technical debt before they can start using Halide. The outcome is that they don't use Halide.
Probably better to do the hashing correctly in one place upstream than have lots of incorrectly-implemented hashing schemes downstream. We're the ones who know how to hash an algorithm/schedule/Halide version correctly. But the caching idea was just an example of how we can make life easier for people by making it possible to do some of the things that should really be happening in the build system in C++ instead/as well, so that people can use Halide without taking on the possibly-intractable task of fixing their build system first.
|
Generally agree, but wanted to add that you can explicitly schedule the copies using Func::copy_to_device and friends if you don't want them done lazily. If you use that often no dirty bits come into play. The input to the Func lives only on the CPU, and the output lives only on the GPU. |
I edited the top post to add:
|
I don't understand why this is our problem as opposed to Visual Studio's. Isn't this something their customers would complain about? I've see this happen in Visual Studio myself, but Googling for the issue doesn't turn up much. I'll bet if Adobe, no doubt paying hundreds of thousands of dollars for Visual Studio licenses, complained loudly enough, it could get fixed.
From offline discussion, it sounds like incremental building with a unified generator binary is our worst end-user story. Still, shouldering the maintenance burden for every hand-rolled, proprietary build system is also a non-solution. If we implement caching, even opt-in, we'll have to test it, keep its behavior stable, and deal with the whole can of worms that opens up. Even our first-party CMake doesn't get it perfect because it can't assume one-generator-per-file. It can't generically establish a mapping between source files and generator invocations. We should look in to Ninja depfiles for more precise dependencies in CMake. |
It may seem dumb, but it is a reality that many real users face today, and the alternatives are:
Even if Microsoft should fix it (and I think it is not likely that they could or would on any reasonable time scale), the only thing in our power to do is help support working around it. If we don't, we're effectively just shutting out some of our highest-impact potential users. |
My grab-bag of thoughts:
|
👍 Fully agree here. Having a version is basically a prerequisite for inclusion into package managers, too. Having a stable API also means that shared distributions of Halide can be upgraded independently of applications, which is important if we hope FOSS will adopt us.
👍 AppVeyor seems to have a reasonable set-up that allows for a mix of self-hosted (for special hardware) and cloud-hosted instances. Also, we should try to convince one or more of the multi-billion-dollar companies that employ our developers and benefit from our work to donate computing resources for this purpose.
I agree. Weak linkage and dynamic lookup into executable exports are super cool... if you're writing Linux software. Unfortunately, since they aren't standard C/C++, they're inherently non-portable and aren't modeled by CMake, so they require hacks for the supported platforms and don't work on Windows. Plus, dynamic lookup breaks a fundamental assumption about static linkage, namely that other modules won't be affected by changes to statically linked libraries. This doesn't just affect the runtime, but the plugins/autoschedulers, too. We're already planning to refactor the autoschedulers out of apps. While we're at it, we should make the interface accept a pointer to a structure in the parent process that it can populate, rather than trying to find the structure via dynamic lookup.
See also #4651 -- as we discuss versioning, we should also discuss symbol export, since they're inter-related. At the very least, we should investigate whether |
I think both @BachiLi and I have put some thought into writing Halide tutorials. I think it would be a good idea to merge our efforts 🙂
Fortunately, this is now pretty easy to do if you're using our CMake build 😉
An export API that would generate C++ code representing a Halide pipeline and schedule would be cool... but I'm not sure it would be more useful than our existing
I'm torn on the idea of having an external Halide syntax. There are some clear benefits... it would become easier to write tests, to write analysis tools, to metaprogram (maybe), provide more helpful compiler diagnostics, integrate with the Compiler Explorer, etc. But on the other hand, maybe it would just be a high-maintenance dunsel. |
Porting JIT code to AOT is much bigger than just build system issues. All of a sudden it's staged compilation. E.g. things that were constants like memory layout and image size are now unknown. |
That would be a nice addition. I've been experimenting with On a side note, speaking of |
You can just instantiate a generator and call it via jit. Generator instances have a "realize" method you can call directly, or a get_pipeline() method that gives you a Pipeline object just like when you're jitting code. |
In general, I'd love to see a cleaner workflow for handling (inspecting, modifying, etc.) the "fat binaries" we produce. Right now, we have a lot of targets that generate some code for a host target, and an "offload" target. This includes GPUs, OpenCL, Hexagon, etc.. Currently, these are designed to produce single object files with the offloaded code embedded in it somehow, which are great for convenience and dependency management. However, inspecting these embedded objects or even modifying somehow, e.g. signing Hexagon code is hard and requires inspecting object files, or hooks/callbacks (mostly implemented with environment variables like HL_DEBUG_CODEGEN or HL_HEXAGON_CODE_SIGNER). We took some small steps here, like
|
I would say that any approach that does not involve inspecting an object file is inherently better than any approach that does. It's simply not portable and we try to support a variety of compilers and platforms. |
For non-hexagon backends, .stmt files should capture the IR, and assembly output should capture the generated machine code in human readable form. Hexagon is compiled earlier in lowering, so it's tricky. We should find some way to carry along the higher-level representations of it. For other shader backends it's an escaped string constant, so it's there, but it looks like:
Maybe we should add a shader_assembly generator output? |
Meanwhile I believe standard practice is HL_DEBUG_CODEGEN=1. |
I would argue that even if we were to output a separate .stmt file for the offloaded Hexagon part of the pipeline, it would be a significant benefit. |
Autoscheduler
Debugging
|
We should start talking about converting this into actionable items and divvying up the work. |
I think the commit messages are too short and lack of details. |
Halide issues are not maintained very well. |
Maybe we can start by closing all the issues that were opened more than, say. 6-12 months ago and never got a comment. That would take care of 169 issues. We have many issues that are open and quite old: Similar to this is the number of branches that are still on this repo that have been merged or are stale. See #4567. Both of these issues make life harder for new collaborators ("Which issues/branches are important? Where do I get started?") and deter would-be new users ("This project's maintainers don't care / are overwhelmed. The project is buggy and/or unstable"). |
The community for OpenCV is huge. Many of them find that OpenCV is not fast enough in practice and eager to know how well the Halide can perform. However, most of them stuck on the issue of data structure (Mat <-> halide_buffer_t) and give up, eventually. Halide should at least give an official guide for this issue. I suggest Halide team write some simple examples for OpenCV users. We have a bunch of good practical applications in apps, porting shouldn't be an issue. |
This is a good discussion. I want to provide an alternative point of view for some things raised above: Open source projects have a strike a careful balance between rigor/discipline and ease-of-contribution. Things like aggressively closing issues, deleting branches, mandating clean git histories, requiring very verbose commit messages, and having very strict CI checks have real benefits but also real costs - they can make working on Halide less fun and deter contributors. Very few of our core contributors get paid to specifically maintain Halide (just Steven, I think), so you can't tell people to eat their vegetables without risking them just working on Halide less. This even applies to me: One reason I left Google in 2017 was to get a break from the day-to-day maintenance and do more research. PRs from outsiders get abandoned because we pile on requirements that they have no incentive to address. So far we've avoided being strict about things that don't slow down development or hurt our users in obvious ways. So I don't care much about deleting branches and closing old issues. Defunct branches aren't particularly visible. Issues can be searched rather than sorted, and we have plenty of issues tagged as good for first-time contributors. I also don't care much about semantic versioning, because it's not particularly useful to the main users that I'm aware of (compared to calendar or git-commit based versioning). I have no objection to these things, and would be appreciative of anyone who wants to take it on. I certainly appreciate that semantic versioning would help with package managers. Historically we've been much stricter about CI cleanliness that these other issues because not being strict about keeping tests working causes cascading issues that can stall development of unrelated work (i.e. it makes development less fun). clang-format/clang-tidy was added to the mix because the fixes are automatic and it removes an entire class of nits from code review, so while I was wary I think it actually makes life easier for regular contributors and reviewers on balance. Some of these attitudes come at the cost of growth. E.g. clang-tidy/format as part of CI instead of done periodically after-the-fact makes life easier for regular contributors and maintainers but is another roadblock to PRs from outsiders. But growth is not an end we should target for its own sake. I want Halide to be a great tool for high-performance image processing, and I want it to be a great platform for PL research. Those objectives don't necessarily require growth, and can be in conflict with it. Regarding OpenCV specifically, an app or tutorial that demos integrating Halide with opencv would be very welcome. |
We might consider completely gutting and restarting (or at least auditing for outdated and misleading content, and properly emphasizing what we most want to emphasize today) the top-level README. |
@abadams I am attempting to address the following three points at the top-level:
I agree with you about the commit messages / git histories, but closing issues that are more than a year old and never got a comment is not "aggressive". We could attach a bot that would do it automatically, but if we cared to do it manually, I doubt it would take more than an hour. I'd guess we could decide whether to close such an issue in less than a minute and there's only ~150.
That assumes that the search results are relevant. If we merge a PR that fixes an issue, then that issue should be closed. There's no way to filter out issues that are open but shouldn't be, and there are a lot of them. This is a drag on new contributors and on our own maintenance. It's also an issue for our users who run into a bug in an old version, see the open issue, and conclude that the bug is still relevant.
Would closing an issue you've fixed really make working on Halide so much less fun and such a deterrent that you or another reasonable person would consider leaving the project? I would be surprised if that were true. It's not a burden to put "Fixes #issue" in the top-level PR comment. We're not talking about going on a full-on raw diet here, just eating something green every once in a while.
This is confusing. Asking that an outsider's PR passes the tests and meets basic static analysis / formatting checks is (a) reasonable and widely expected, and (b) less costly than whatever incentivized them to put in the work and open the PR in the first place. PRs get abandoned because our testing infrastructure is unreasonably slow.
Using semantic versioning, which is well understood by tooling, makes it easier for other researchers to compare to Halide in a reproducible manner. Yes, they can reference the git commit hash, but nobody does this and those values aren't directly comparable. Semantic version numbers can be registered in a package repo and then particular versions can be quickly installed for comparison. It also adds zero additional maintenance burden over our calendar-based versioning (which I believe is great for applications, but unsuitable for libraries) because we can just version as
Clang-tidy/format is not a roadblock to PRs for outsiders. Just the opposite. It paves the road for such PRs and makes the process faster by streamlining code review. Less burden on both the reviewers and the PR author when we don't have to argue about where the space goes around
No one is saying that we need to aggressively scale. I don't see how my proposed actions:
are promoting growth "for its own sake" or are in conflict with making Halide a great tool and research platform. |
Added more details regarding fast prototyping. Let me know if it is unclear. These are from our experience developing new (differentiable) image processing pipelines in Halide. |
@alexreinking I don't have much new to add, but there's one thing that's important to respond to for the benefit of others reading this: The correct way to cite a version of Halide is with a git commit hash. That's the current standard practice I see in papers that reference Halide in a way where the version matters (e.g. things that benchmark against our manual schedules). Everyone reading this, please keep doing that. The growth comments weren't a response to you. They were more a response to benzwt proposing actively trying to grow by attracting OpenCV users. While it's great to be as useful as possible to the largest number of users possible, I wanted to point out that we shouldn't get distracted by growth for its own sake as I've seen some projects do (which leads to things like over-promising). |
Is this not a consequence of our slow release schedule and versioning scheme? I cannot think of another case where I have seen a git commit hash in a citation as opposed to a released version number (eg. when comparing compilers, libraries, etc.). Ideally a version number would also tag a commit such that they are equivalent (and equally unambiguous). This is as-good since you should change git tags about as often as you force push. |
Version numbers can't refer to branches, commits not associated with any release (e.g. just after a paper author reaches out to us about a bug), etc. I see both commit hashes and version numbers in papers for different projects. I prefer commit hashes. |
We should revisit and complete https://github.com/halide/halide-app-template with standalone Makefiles/CMakefiles/VS solutions/XCode projects inside |
Another thing to make plans for: when do we want to upgrade the minimum C++ requirement? I would hope that we could consider moving to require C++17 as a baseline in the short-to-medium-term future. |
An idea from an offline conversation: it would be nice to have a mode where the bounds information is forwardly propagated from the inputs (similar to numpy/Tensor Comprehension). More suitable for building deep learning architectures/linear algebra/etc. |
Part of the motivation for @BachiLi's comment above is the crazy list of estimates needed for autoscheduling resnet50: https://github.com/halide/Halide/blob/standalone_autoscheduler_gpu/apps/resnet_50_blockwise/Resnet50BlockGenerator.cpp#L286 These are all inferable using simple forward propagation from the inputs, in the style of NumPy or any ML framework. |
This is a somewhat minor issue, but I think we should consider making it a policy to prefer squash-and-merge (when possible) when merging Pull Requests; the individual commit history within a PR is useful for people reviewing the PR, but it really clogs up the history of the main branch with small commits. I can't think of a good reason for all the interstitial commits to be preserved for eternity. |
Relatedly, we should codify this in the standard |
I'd appreciate the group's take on Modular/Mojo regarding the Halide roadmap. After listening to Lex Fridman's long interview with Chris Lattner on the subject, it seems that the goals of Mojo include those of Halide, especially the autoscheduling part. Thanks for any input. It seems that the community that has developed Halide all these years is a natural group to converge with efforts in the ML community to get at the same functionality from either a C++ foundation or a Python foundation. |
Would it be possible to show source locations for errors produced with generator invocations? Consider, for example, the following error:
Would it be possible to highlight the source location in the I think highlighting source locations would facilitate working with Halide, especially for addressing bugs during the code generation. I have looked for a related ticket but didn't find one. |
This issue serves to collect high-level areas where we want to improve or extend Halide. Reading it will let you know what is on the minds of the core Halide developers. If there's something you think we're not considering that we should be, leave a comment. This document is a continual work in progress.
This document aims to address the following high-level questions:
To the greatest extent possible we should attach actionable items to roadmap issues.
Documentation and education
The new user experience could use an audit (e.g. the README).
There are a large number of topics that are missing tutorials
Some examples:
There is not enough educational material on the Halide expert-user development flow, looping between tweaking a schedule, benchmarking/profiling it, and examining the .stmt and assembly output.
One thing we have is this: https://www.youtube.com/watch?v=UeyWo42_PS8
Documentation for the developers
There should be a guide for how an external contributor should make their first pull request on Halide and what to expect. This is commonly in a
CONTRIBUTING.md
top-level document. There are also pull request templates we can create.There should be a more detailed document or talk describing the entire compilation pipeline from the front-end IR to backend code to help new developers understand the entire project.
Support for extending or repurposing parts of Halide for other projects
Some things that could help:
Build system issues
We shouldn't assume companies have functioning build tools
Some companies build projects using a mix of duct tape and glue in a platform-varying way. Any configuration that goes into the build system is very painful for them (e.g. GeneratorParams for generator variants). Large numbers of binaries (e.g. one generator binary per generator) can also be painful (e.g. in Visual Studio). We should consider making GenGen.cpp friendlier to the build system (e.g. by implementing caching or depfiles) to help out these users.
Our buildbots aren't keeping up and require too much manual maintenance
Our buildbots are overloaded and have increasingly out-of-date hardware in them. Some can only be administered by employees at specific companies. We need to figure out how to increase capacity without requiring excessive manual management of them.
Runtime issues
The runtime includes a lot of global state, which is great for sharing things between all the Halide pipelines in a process, but if there are multiple types of user of Halide in the same large process things can get complicated quickly (e.g. if they want different custom allocators). One option would be removing all global state and passing the whole runtime in as a struct of function pointers.
While most of the important parts of the runtime can be overridden by setting function pointers, some parts of the runtime can only be overridden using weak linkage or other linker tricks, and this is problematic on some platforms in some build configurations.
There needs to be more top-level documentation for the runtime, describing how one may want to customize it in various situations. Currently there's just a few paragraphs at the top of HalideRuntime.h, and then documentation on the individual functions.
Runtime error handling is a contentious topic. The default behavior (abort on any error) is the wrong thing for production environments. There isn't much guidance or consistency on how to handle errors in production environments.
Lifecycle
Versioning
Since October 2020, Halide uses semantic versioning. The latest release is (or will soon be) v15.0.0. We should adopt some practice for keeping a changelog between versions for Halide users. Our current approach of labeling "important" PRs with
release_notes
has not scaled.Packaging
Much work has been put into making Halide's CMake build amenable to third-party package maintainers. There is still more to do for cross-compiling our arm builds on x86.
We maintain a list of packaging partners here: #4660
Code reuse, modularity
How do we reuse existing Halide code without recompiling it, especially in a fast prototyping JIT environment? An extension of the extern function calls or the generators should be able to achieve this.
Building a Halide standard library
There should be a set of Halide functions people can just call or include in their programs (e.g., image resampling, FFT, winograd convolution). The longstanding issue to solve is that it's hard to compose the scheduling.
Fast prototyping
How can we make fast prototyping of algorithms in Halide easier? JIT is great for getting started, but not all platforms support it (e.g. iOS), and the step from JIT to AOT is large, in terms of what the code looks like syntactically, what the API is, and what the mental model is.
Consider typical deep learning/numerical computation workflows (PyTorch, NumPy, Matlab, etc). A user would fire up an interpreter, manipulate and visualize their data, experiments with different computation models, print out intermediate values of their program for understanding the data and debugging, and rerun the programs multiple times for different inputs and iterate.
Unfortunately, the current Halide workflow does not fit this very well, even with the Python frontend.
print()
(and recompile the program) or adding the intermediate Halide function to the output (and recompile the program).Two immediate work items.
GPU features
We should be able to place Funcs in texture memory and use texture sampling units to access them.
This is particularly relevant on mobile GPUs where you can't otherwise get things to use the texture cache. It's also necessary to interop with other frameworks that use texture memory (e.g. coreML).
An API to perform filtered texture sampling is needed. Ideally this will work, if not necessarily be blazingly fast, in a cross platform way. Validating on CPUs is very useful. There are some issues in the design having to do with the scope and cost of required sampler object allocations in many GPU APIs.
Currently this has been low priority because we don't have examples where texture sampling matters a lot. Even for cases where it obviously should (e.g. bilateral guided upsample), it doesn't seem to matter much.
A good first step is supporting texture sampling on CUDA, because it doesn't require changing the way in which the original buffer is written to or allocated. An independent first step would be supporting texture memory on some GPU API without supporting filtered texture sampling. These two things can be done orthogonally.
Past issues on this topic: #1021, #1866
We should support tensor instructions.
We have support for dot product instructions on arm and ptx via within-vector reductions. The next task is nested vectorization #4873. After that we'll need to do some backend work to recognize the right set of multi-dimensional vector reductions that map to tensor cores. A relevant paper on the topic is: https://dl.acm.org/doi/10.1145/3378678.3391880
New CPU features
ARM SVE support
Machine learning use-cases
We should be able to compile generators to tensorflow and coreML custom ops.
We can currently do this for pytorch (see apps/HelloPytorch), but it's not particularly discoverable.
We should have fully-scheduled examples of a few neural networks. We have resnet50, but it's still unscheduled.
Targeting MLIR is worth consideration as well.
This is likely a poor match, because most MLIR flavors operate at a higher level of abstraction than Halide (operations on tensors rather than loops around scalar computation).
Autoschedulers
There's lots of work to do before autoschedulers are truly useful. A list of tasks:
We need to figure out how to provide stable autoschedulers that work with Halide master to serve as baselines for academic work while at the same time being able to improve autoschedulers over time.
There needs to be a tutorial on using standalone autoschedulers, including autotuning modes for those that can autotune.
We need to figure out how to include them in distributions
There should be a hello-world autoscheduler that serves as a guide for writing a custom one.
There should be a click button solution for all sorts of autoscheduling scenarios (pretraining, autotuning, heuristics-based, etc).
For several autoschedulers, the generated schedules may or may not work for image sizes smaller than the estimate provided. This is lousy, because autoschedulers should be usable by people who don't understand the scheduling language and don't know to fix tailstrategies.
loop-unroll failure in Autoscheduler #4271
Things we can deprecate
The text was updated successfully, but these errors were encountered: