Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_data on GPU using cupy #241

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Rad-hi
Copy link

@Rad-hi Rad-hi commented Sep 24, 2024

Previously, I tried to extend the Python API with the ability to keep the data on the GPU (#230), and I ran into some weird behaviors (back then they were weird, but now, it's obvious that it was just a lack on understanding of how the data is laid out in memory).

This PR, however, provides a fully functional extension.

NOTE: this change adds an extra dependency; cupy.

The targeted function is get_data(), and both modes of providing data (memory view / deep copy) were implemented for GPU as well.

This was tested on an Nvidia AGX Orin 32Gb, with JetPack 5.1.2, and ZED_SDK_4.1.4.

Shoutout to @andreacelani for the discussion that lead to figuring out how to implement this correctly (look into the closed PR #230 for details).

Benchmarking with an ML pipeline:

@andreacelani did some benchmarking with impressive results: #230 (comment)

Additionally, I tested it myself using a real feed from a ZED Mini with a simple pipeline (see picture), and here are my findings:

image

TL;DR:

  • grabbing is 60% faster
  • preprocessing on the GPU would be faster (when implemented correctly)

Details:

"""
HD2K @15FPS:

GPU:
[GPU_GRAB]             Mean: 8.531 ms, Std: 2.563 ms, Max: 15.460 ms, Min: 5.205 ms, N Samples: 200.
[GPU_PREP_RESIZE]      Mean: 5.205 ms, Std: 1.553 ms, Max: 7.745 ms, Min: 2.473 ms, N Samples: 200.
[GPU_PREP]             Mean: 6.004 ms, Std: 1.554 ms, Max: 8.721 ms, Min: 3.259 ms, N Samples: 200.
[GPU_ROT]              Mean: 0.916 ms, Std: 0.061 ms, Max: 1.162 ms, Min: 0.827 ms, N Samples: 200.
[GPU_INF]              Mean: 24.066 ms, Std: 0.701 ms, Max: 28.860 ms, Min: 23.353 ms, N Samples: 200.
[GPU_STEP]             Mean: 39.537 ms, Std: 1.452 ms, Max: 44.720 ms, Min: 38.024 ms, N Samples: 200.
[GPU_CPU_DUMMY_SLEEP]  Mean: 30.065 ms, Std: 0.003 ms, Max: 30.084 ms, Min: 30.046 ms, N Samples: 200.

Throughput: ~13 iter/s

CPU:
[CPU_GRAB]             Mean: 21.728 ms, Std: 1.193 ms, Max: 25.891 ms, Min: 20.530 ms, N Samples: 200.
[CPU_PREP_RESIZE]      Mean: 5.252 ms, Std: 0.167 ms, Max: 6.051 ms, Min: 5.183 ms, N Samples: 200.
[CPU_PREP_D2H]         Mean: 1.123 ms, Std: 0.066 ms, Max: 1.445 ms, Min: 0.772 ms, N Samples: 200.
[CPU_PREP]             Mean: 13.468 ms, Std: 0.468 ms, Max: 15.780 ms, Min: 13.130 ms, N Samples: 200.
[CPU_ROT]              Mean: 1.767 ms, Std: 0.475 ms, Max: 3.314 ms, Min: 1.053 ms, N Samples: 200.
[CPU_INF]              Mean: 24.054 ms, Std: 1.301 ms, Max: 31.345 ms, Min: 23.337 ms, N Samples: 200.
[CPU_STEP]             Mean: 61.058 ms, Std: 2.245 ms, Max: 70.546 ms, Min: 58.555 ms, N Samples: 200.
[GPU_CPU_DUMMY_SLEEP]  Mean: 30.064 ms, Std: 0.012 ms, Max: 30.084 ms, Min: 30.016 ms, N Samples: 200.

Throughput: ~10 iter/s

HD1080 @30FPS:

GPU:
[GPU_GRAB]             Mean: 6.146 ms, Std: 1.574 ms, Max: 11.672 ms, Min: 4.429 ms, N Samples: 200.
[GPU_PREP_RESIZE]      Mean: 6.188 ms, Std: 1.396 ms, Max: 7.494 ms, Min: 1.917 ms, N Samples: 200.
[GPU_PREP]             Mean: 6.907 ms, Std: 1.404 ms, Max: 8.313 ms, Min: 2.610 ms, N Samples: 200.
[GPU_ROT]              Mean: 0.851 ms, Std: 0.051 ms, Max: 1.244 ms, Min: 0.795 ms, N Samples: 200.
[GPU_INF]              Mean: 23.864 ms, Std: 0.697 ms, Max: 30.536 ms, Min: 22.047 ms, N Samples: 200.
[GPU_STEP]             Mean: 37.785 ms, Std: 0.774 ms, Max: 44.811 ms, Min: 35.756 ms, N Samples: 200.
[GPU_CPU_DUMMY_SLEEP]  Mean: 0.005 ms, Std: 0.003 ms, Max: 0.038 ms, Min: 0.003 ms, N Samples: 200.

Throughput: ~26 iter/s


CPU:
[CPU_GRAB]             Mean: 18.501 ms, Std: 1.092 ms, Max: 22.510 ms, Min: 17.040 ms, N Samples: 200.
[CPU_PREP_RESIZE]      Mean: 4.796 ms, Std: 0.139 ms, Max: 5.671 ms, Min: 4.714 ms, N Samples: 200.
[CPU_PREP_D2H]         Mean: 1.107 ms, Std: 0.062 ms, Max: 1.447 ms, Min: 0.901 ms, N Samples: 200.
[CPU_PREP]             Mean: 11.538 ms, Std: 0.361 ms, Max: 13.599 ms, Min: 11.297 ms, N Samples: 200.
[CPU_ROT]              Mean: 1.319 ms, Std: 0.350 ms, Max: 1.848 ms, Min: 0.862 ms, N Samples: 200.
[CPU_INF]              Mean: 24.247 ms, Std: 1.295 ms, Max: 31.933 ms, Min: 22.330 ms, N Samples: 200.
[CPU_STEP]             Mean: 55.640 ms, Std: 2.117 ms, Max: 69.769 ms, Min: 52.252 ms, N Samples: 200.
[GPU_CPU_DUMMY_SLEEP]  Mean: 0.009 ms, Std: 0.011 ms, Max: 0.163 ms, Min: 0.003 ms, N Samples: 200.

Throughput: ~17 iter/s
"""

Notes:

  • I used the generic YOLO (from ultralytics import YOLO), and a custom trained Pytorch YOLOV8 model.
  • I added the sleep because in the case of HD2K grabbing, my pipeline wasn't saturating the 15FPS rate, thus grabbing was seemingly slower in GPU (faulty read).
  • The preprocessing includes 4 channel to 3 channel reduction, resizing (to meet the 640x640 expected input), and normalization.
  • There's a step that I didn't put in the pipeline, which is a rotation on the X axis of the PCL just to simulate real work. (code details are here Feature/get data gpu #230 (comment).)

@Rad-hi Rad-hi marked this pull request as ready for review September 27, 2024 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

1 participant