Skip to content

simecek/dspracticum2020

Repository files navigation

Data Science Practicum 2020

This Data Science course has been taught at the Faculty of Science, Masaryk University, in the fall semester 2020/2021.

I gave 12 lectures, each focused on one ML technique and dataset (typically from kaggle.com). The emphasis has been on coding and practicing data science skills, rather than the theoretical background.

Course Info

The course is now in the IS Course Catalogue, look for M7DataSP Data Science Practicum (Praktikum z pokročilé datové vědy).

The course is scheduled for Mondays, 12:00-13:30, and will be taught remotely through Google Meet/Hangouts. The first lecture will take place on October 5. To be invited to the classes, enroll to the course in IS. If you want to try a few first lectures without a formal enrollment, send me an email.

No special knowledge is expected but you should have at least one year of coding experience, either R or Python. I like diverse crowds; students from different faculties and specializations are encouraged to enroll (if still in doubt, let me know to be paired with a more experienced student). The course will be taught in English if at least two students will be interested, otherwise in Czech.

Lectures

  1. Intro, linear regression (one neuron), neural networks (NN), TensorFlow (TF)
  2. Logistic regression, softmax, cross-entropy
  3. Image data, convolutional NN
  4. ImageNet, fine tuning, tranfer learning, data augmentation
  5. TenforFlowJS, GitHub Pages, backpropagation
  6. Natural language processing (NLP), text preprocessing, dense NN
  7. Embeddings, recurrent NN (LSTM, GRU)
  8. Text classification, transformers, NLP methods on genomic data
  9. Recommenders / Collaborative filtering, optimization
  10. Tabular data, batch normalization
  11. Trees, random forest, XGBoost, LightGBM, CatBoost
  12. ML models interpretation, hyper-parameters optimization, autoML

Recordings of the lectures can be found in the teaching materials in IS (you need to have MU GSuite to access the videos).

Assignments

  1. Create a GitHub account and your first repository
  2. Classify penguins based on their size
  3. Fashion MNIST classification
  4. Image classification app
  5. Text generation
  6. Genomic seqs classification
  7. Ratings prediction
  8. Blue book for buldosers

For the students' solution, see Assignments.md.

Recommended books and blogs

  1. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd edition)
  2. Deep Learning with Python, Second Edition
  3. TensorFlow 2 in 30 days
  4. Deep Learning for Coders with fastai and PyTorch: AI Applications Without a PhD
  5. RStudio AI Blog
  6. The Missing Semester of Your CS Education

Acknowledgement

This work would be impossible without tutorials provided by TensorFlow and RStudio. I also get a lot of inspiration from numerous Kagle notebooks and blogs all over the internet. Sometimes, in a time pressure before the lecture, I might have forgotten to properly link all my sources. If this is the case, I would be grateful if you correct my mistake, either by a pull request or sending me a message. Thank you.