Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using a large model on a ESP32 S3 N16R8 (TFMIC-37) #94

Open
nicklasb opened this issue Aug 25, 2024 · 7 comments
Open

Using a large model on a ESP32 S3 N16R8 (TFMIC-37) #94

nicklasb opened this issue Aug 25, 2024 · 7 comments

Comments

@nicklasb
Copy link

nicklasb commented Aug 25, 2024

Hi,

I am inferring a large model (~2 MB) on the ESP32 S3, and it takes about 60 seconds, while taking about 50 ms on my PC.
As the tensor arena seem to have to be about 5 MB to satisfy TF lite, and the RGB image is larger than SRAM (it's a YOLO model, it prefers RGB), obviously everything ends up happening in PSRAM, which slows things down significantly.
However I don't think it does so by a factor of thousand even though the ESP32 S3 is obviously slower as well.

What can I do? Can I override the memory allocator to put just some stuff in SRAM, I have about 300K available that isn't being used?
I saw that the p4 will basically have SRAM as a cache for a faster PSRAM more or less, is there something similar that could be done in the meanwhile? Or something else?

@github-actions github-actions bot changed the title Using a large model on a ESP32 S3 N16R8 Using a large model on a ESP32 S3 N16R8 (TFMIC-37) Aug 25, 2024
@vikramdattu
Copy link
Collaborator

Hi @nicklasb ESP32-S3 have cache option and you can explore the same. If you have that much internal RAM underutilised, do increase the Data cache and I cache sizes from menuconfig to their max. It should give you a good boost in performance.

As far as moving some of the allocations to internal goes, the current tflite structure is not flexible enough to allow that. You may move some of the critical kernels (from esp-nn) to IRAM to make them always persist in RAM to boost it even further. Please explore esp_att.h for the same. Simply add IRAM_ATTR in front of the function and it will be placed in the IRAM.

5MB tensor Arena requirement is indeed high and I am not sure if this should be the case. I will give the Yolo model a try and do some experiments myself.

@nicklasb
Copy link
Author

nicklasb commented Aug 26, 2024

Hi, thanks for you answer!

Hi @nicklasb ESP32-S3 have cache option and you can explore the same. If you have that much internal RAM underutilised, do increase the Data cache and I cache sizes from menuconfig to their max. It should give you a good boost in performance.

I am afraid that they have not done much difference in my case, also, their maximum values aren't very high, I will revisit them and see if have missed something.

As far as moving some of the allocations to internal goes, the current tflite structure is not flexible enough to allow that.

It sort of is IMO, but only up to a point, the MicroAllocator can be initialized with a non-persistent area that I think could help to some degree. However, I have not been able to make that work, then again C++ semantics is not my home turf (yet) and they seem to use all the tricks. There are some PR:s that touch this over at TFLite, if Espressif put some of their might behind that I think it would make a difference.

You may move some of the critical kernels (from esp-nn) to IRAM to make them always persist in RAM to boost it even further. Please explore esp_att.h for the same. Simply add IRAM_ATTR in front of the function and it will be placed in the IRAM.

Ok, I will take a look.

5MB tensor Arena requirement is indeed high and I am not sure if this should be the case.

It is, but perhaps it is not that strange, the model is > 2 MB. Either way, the size doesn't matter as long it fits well within PSRAM. It is more about that the frequent stuff needs a faster memory.

I will give the Yolo model a try and do some experiments myself.

For your information, when testing out YOLO, its export.py can both quantize to int8 and directly export to .tflite format, I went on a long tangent before realizing that. Also I used YOLOv5.

@nicklasb
Copy link
Author

nicklasb commented Aug 26, 2024

@vikramdattu

Please explore esp_att.h for the same. Simply add IRAM_ATTR in front of the function and it will be placed in the IRAM.

I am afraid that made no discernable difference.

What I am going to do now is clone this library instead of using it as a stand-alone compinent, and focus on if I can override the memory management in some way by using MicroAllocator. Maybe I can help out in some way.

@nicklasb
Copy link
Author

nicklasb commented Aug 26, 2024

@vikramdattu
On a side note, I ran the model without the nn function optimization, and that made the inference 3.5 times slower.
So good work there. :-)

Wierdly, setting compiler option to (x) Optimize for performance (-O2) makes inference about 20 percent slower.
And the inference produced some strange results. Very odd.

@vikramdattu
Copy link
Collaborator

Did you set -O2 from menuconfig options or from esp-nn/CMakeLists.txt? Can you please share the observations about strange results? Do you get the bit mismatch?

@nicklasb
Copy link
Author

Hi, I set it from menuconfig and basically the box coordinates ended upp smaller, I had some of them going negative.
I don't have a working codebase currently (writing a custom MicroAllocator), I can get you more specifics tonight (CET).

@nicklasb
Copy link
Author

I couldn't as I am now deep into the custom MicroAllocator/planner, but generally, all values became smaller, should not be too hard to replicate with any YOLO model, there is nothing special with mine.
I did have a strange general issue were all boxes end up about 50 pixels too high up (with the optimizations on), likely unrelated to this but just to mention it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants