Cache for Encoding - Runtime Boosted by 12% #319

Majdoddin · 2024-07-10T10:24:00Z

This PR introduces a caching mechanism in _encode_ordinary_native(), which stores the tokens for each "piece" of text. When a piece of text is repeated, its tokens are retrieved from the cache instead of being tokenized again.

This results in a runtime improvement of over 12% (from 20.21s to 17.96s on a single CPU core) when encoding 100MB of Linux source code as a single text.

The cache hit ratio is very high, approximately 95%. The final cache size is only 0.5% of the total number of pieces (218,450 vs. 39,769,721).

TODO:

Despite the 95% cache hit ratio, the expected runtime boost was not fully realized. This is because 80% of the loop runtime in the current code is spent splitting the text using regex. While this PR makes the tokenization logic 65% faster, the BIG gain can be achieved by optimizing the text splitting, possibly through multithreading.
Investigate declaring the cache in the struct CoreBPE so that it can be utilized across subsequent calls.

added cache to _encode_ordinary_native() to improve runtime.

55263ac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache for Encoding - Runtime Boosted by 12% #319

Cache for Encoding - Runtime Boosted by 12% #319

Majdoddin commented Jul 10, 2024

Cache for Encoding - Runtime Boosted by 12% #319

Are you sure you want to change the base?

Cache for Encoding - Runtime Boosted by 12% #319

Conversation

Majdoddin commented Jul 10, 2024