Skip to content
FilipBolt edited this page Sep 17, 2019 · 19 revisions

Takelab logo Welcome to TakeLab Podium - Python machine learning library that helps users to accelerate use of natural language processing models.

This wiki is the main source of documentation for developers working with (or contributing to) the TakeLab Podium project.

Podium goal is described in next figure.

Podium goal

Data

Data part of podium starts with Dataset definition which is composed by using Examples and Fields.

Every Field can have it's own vocabulary about which you can find more here.

Iteration through dataset is defined by Iterators.

Preprocessing

Preprocessing utilities are defined as part of the preproc submodule. Here, one can find some typical natural language processing utilities:

  • tokenizers -- divide a single string into a list of tokens,
  • stop words lists -- words that are typically omitted when building NLP models
  • lemmatizers -- procedures to determine canonical form of word
  • stemmers -- procedures to determine the root of a word

Typically, preprocessing is defined as hooks, which are executed when data is loaded, either as pretokenize and posttokenize steps. Hooks are attached to Fields.

Models

Models simply output from input. Thus far, we have focused on Supervised models, which are defined through two methods: fit and predict, similarly to scikit-learn. Models implementing the AbstractFrameworkModel should define how to save and load (weights and parameters), so they can be fine-tuning with additional training or simply used out of the box.

Hyperparameter optimization

Validation and statistical tests

Utilities

  • Large resource - if you need to use or make a class that downloads a large resource from a server to takepod resources folder
  • Logging - if you need good logging from podium modules
Clone this wiki locally