-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE REQUEST] CUDA acceleration? #281
Comments
Yes it require code to be rewritten. I didn't done any test, but my guess is that the performance will be worst compared to a decent CPU. On UVtools most operations are performed 1 time on a image for each layer.
So we waste time transfering image back and forward just to perform 1 operation, this is performance killer. UVtools have few areas where it perform multiple operations on same mat, but even still will be a bottleneck, because most of that operations runs and then need to do some checks around the mat/pixels, for that you need to transfer back the image to CPU and use CPU after all. I can conduct a test latter, but i have no hope on better performance. |
Interesting. Along the same lines: is OpenCL accel built into the OpenCV lib? Beyond that, there seems to be a fair amount of controversy around OS-level support for these various accelerator solutions, so some of them might be total dead ends. wrt Cuda, it seems like some operations could be swapped out directly (with supporting code): for UVtools/UVtools.Core/Layer/LayerManager.cs Line 1157 in 08a5797
fwiw, back of the envelope for 16bit mono 4k matrices, a 4GB GPU could nominally hold ~250 layers at once. An interesting first step might be to profile the current implementation to see if certain OpenCV wrapped functions are taking the bulk of the time? |
It is, and is already in use in some functions by default, you dont need to enable it.
No, the default is to use openCL if supported, we dont need to enable it
UVtools processes 8bit bitmaps, so its more 500 layers. I'm waiting on cuda package to update, because right now i can't use it since it's one version behind the main lib. But i have a code ready to test: But my faith is really low on this... Maybe will benefit 8K images. But poor laptop with low speed interfaces will kill all the performance. Eg for morph: if (CoreSettings.CanUseCuda)
{
var gpuMat = target.ToGpuMat();
using var morph = new CudaMorphologyFilter((MorphOp)MorphOperation, target.Depth, target.NumberOfChannels, Kernel.Matrix, Kernel.Anchor, iterations);
morph.Apply(gpuMat, gpuMat);
gpuMat.Download(target);
}
else
{
CvInvoke.MorphologyEx(target, target, (MorphOp) MorphOperation, Kernel.Matrix, Kernel.Anchor, iterations, BorderType.Reflect101, default);
} |
Futher note: I tested UMat (OpenCL) on Resin traps which use many openCV calls, and performance is way worse by huge amount. CPU kept with no usage and GPU fire up, but performance is so bad that don't deserve to be used. Again the only benefit should be on video and AI things where the stream is constant or too many operations on same UMat that lacks CPU operations. I didn't test CUDA yet but i dont see a future for it either. Will report latter. |
Interesting. As far as I know, the standard development cycle for CUDA is to implement, then use the nvidia profiler to check if you are using all your bandwidth for copy to/from GPU, or if one is waiting on the other, then iterate. It looks like there are few of the OpenCV calls that have accelerated implementations in CUDA. One first step (I could open a separate ticket) is: |
If you used pinned gpu memory, then memory transfers will usually not be a bottleneck. You can also use a double buffer to load one layer to process while the other is being processed. Even better is not to treat each mat individually, but load N mats at a time and process them in batch. I'm surprised your custom CUDA code is 4x slower, especially on an 8s operation. Even with simple operations, I've found GPU to be faster than manually vectorized OpenMP CPU code. Are you making use of shared memory, and made sure you have high kernel occupancy? Image editors often use GPU to do simple one-off filters, so I don't think GPUs only being useful for video and AI is justified. |
I don't have experience on GPU calls, but i followed thier calls and pratices. Workflow of operation:
So we have two transfers just so we can use a Cuda operation per layer On my tests, CPU version just kills OpenCL (UMat) or Cuda (GpuMat) both gets killed easy, maybe i'm doing something wrong, i don't know... If you have experience, fork UVtools and give it a try on a cheap operation like blur or resize, which are easy to convert and report back with results, if it show large benefit i may consider |
I also agree that GPU acceleration would be a great addition to this. However I'd reccomend a larger package such as OpenGL or DirectX, I reccomend these larger paackages because UVTools could better utalize the GPU-side memory, ideally only ever reading back small buffers to show statistics on the UI. For instance, the rendering of the matrix' could be done GPU-side so no need to read back from GPU after an operation to render to the UI. To add further reasons why GPU acceleration would be much faster:
Other benefits include:
I've been researching into this for the last couple of days as its getting more and more imporant now that resin printers are getting larger resolution screens. My personal one is 8K, with a few thousand layers (for an overnight print) it can take nearly 30 mins to run a few passes finding and fixing issues, most of that time spent waiting for it to process. |
It's normal the performance to start to be slower with 8K resolutions, many more pixels to deal. Even then the performance is better then I could expect. You just need to have an proper desktop, laptops are a joke for serious computation. Also make sure you process only 1 copy of the model and pattern them after in UVtools, less pixels on plate the better for UVtools. To gain additional performance select LZ4 on settings. DirectX is out as it's windows only. It also have the Also note that you can't discard CPU even for the library (not UI), we need to store an cache somewhere, none GPU will hold high count of 8K images on the stack unless you compress them there, today PCs have more RAM than GPU. Then you also have many operations accessing pixels to perform 'if' logic, the requirement of the CPU will always be a bottleneck, unless someone manage to do it directly. If you know how to start on this there is a library that can help: Amplifier.NET |
@sn4k3 |
@jremen afaik, the Neural Engine is not a general accelerator, and only usages that can be fit into the coreML framework would be able to target it. I believe the Apple "Accelerate" library can do general compute on the Apple Silicon GPU, though. Don't know if there is some general accelerator framework that could be adopted to target both apple GPU and nvidia GPU. OpenCL is deprecated on Apple silicon now, afaik. That being said: |
Optimizations should come from openCV library, any framework would not benefit UVtools since a substantial portion of the code and all algorithms come from an external source: OpenCV. Using any kind of accelerator would only benefit if algorithms were written and implemented by self, this is not the case were all depend on openCV. The only possible accelerator to use is CUDA because it is inside openCV, still my tests show a huge performance loss compared to CPU, this is because each object is stored in RAM, must be decoded converted and sent to GPU and them back to CPU and RAM. The CPU code alone is so good optimized that CUDA is defeated with this kind of usage. CUDA would win if everything could stay and processed within the GPU, without much of CPU need, unfortonally that's not the case. Also CUDA lacks many functions that only run in the CPU. As resolutions grow is normal that processing time increase, not much time ago we were in 1080p, now 12K that is a huge jump in every term (2073600 pixels vs 79626240 pixels). In every computation problem, apart from optimizations, if you want speed you need to boost your hardware. For UVtools give it the best CPU and RAM you can as it will utilize to full.
Most people don't believe how good openCV is in term of image processing, UVtools is also well optimized, just look the fact that you have 12k * n layers loaded into memory and each one is decoded/encoded on access/save time. I hope that openCV 5 bring a substential boost. |
Is there any possibility of getting CUDA acceleration enabled for the computationally heavy operations?
I recall from reading some documentation that this is not as simple as just enabling CUDA and rebuilding, and would require explicit re-implementation in places to move data to/from GPU buffers.
So this might be a significant long-term project.
The text was updated successfully, but these errors were encountered: