Week 1 is the most theory-heavy week of the course. You can find the lecture slides here: Week 1 Slides.
Research on Tokenizers and write a section to your final report reflecting on the following questions:
- What are tokenizers?
- Why are they important for language modeling and LLMs?
- What different tokenization algorithms are there and which ones are the most popular ones and why?
Some references:
- Neural Machine Translation of Rare Words with Subword Units: https://arxiv.org/abs/1508.07909
- SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing: https://arxiv.org/abs/1808.06226