-
Notifications
You must be signed in to change notification settings - Fork 2
Home
Welcome to TakeLab Podium - Python machine learning library that helps users to accelerate use of natural language processing models.
This wiki is the main source of documentation for developers working with (or contributing to) the TakeLab Podium project.
Podium goal is described in next figure.
Data part of podium starts with Dataset definition which is composed by using Examples and Fields.
Every Field can have it's own vocabulary about which you can find more here.
Iteration through dataset is defined by Iterators.
Preprocessing utilities are defined as part of the preproc submodule. Here, one can find some typical natural language processing utilities:
- tokenizers -- divide a single string into a list of tokens,
- stop words lists -- words that are typically omitted when building NLP models
- lemmatizers -- procedures to determine canonical form of word
- stemmers -- procedures to determine the root of a word
Typically, preprocessing is defined as hooks, which are executed when data is loaded, either as pretokenize and posttokenize steps. Hooks are attached to Fields.
Models simply output from input. Thus far, we have focused on Supervised models, which are defined through two methods: fit
and predict
, similarly to scikit-learn. Models implementing the AbstractFrameworkModel
should define how to save and load (weights and parameters), so they can be fine-tuning with additional training or simply used out of the box.
- Large resource - if you need to use or make a class that downloads a large resource from a server to takepod resources folder
- Logging - if you need good logging from podium modules