This repository contains winning submissions for Task 3: Post-correction of OCR results and Task 4: Question answering challenge. Both submission rely on fine-tuning the mT5 model on respective tasks.
Solution details are described in the workshop proceedings
model | dev-0 | test-A | test-B |
---|---|---|---|
original | 16.550 | 16.527 | 16.543 |
base | 4.678 | 4.792 | 4.796 |
large | 4.418 | 4.515 | 4.559 |
XXL | 3.604 | 3.725 | 3.744 |
model | test-B |
---|---|
base | 52.12 |
large | 59.20 |
XXL | 71.68 |
Common steps for both tasks
- Install pip requirements
pip install -r requirements.txt
- Download mT5 vocabulary to repository root
gsutil cp gs://t5-data/vocabs/mc4.250000.100extra/sentencepiece.model .
- Prepare GCS bucket for storing training datasets: https://cloud.google.com/storage/docs/creating-buckets
- Update
gs_base_path
inconfig/config.yaml
The provided data contains pages of text which are in many instances longer then maximum sequence length allowed by the model architecture. To alleviate that the training examples are created by aligning and splitting longer input/output pairs.
- Pull task repository
git clone -b secret https://github.com/poleval/2021-ocr-correction.git
- Split examples into chunks to match maximum sequence length
python3 -m data_preparation.ocr_correction.split_text \
2021-ocr-correction \
--length-limit 384
- Upload files to created bucket, update or match paths from
config/task/ocr_correction.yaml
. Keep.index
files to restore full text from predictions
For question answering the model input prompt consists of question and context passages retrieved from Wikipedia. This section shows how to reproduce the data used in submission.
The prepared data is available here. Skip to step 5 if using this dataset.
- Pull task repository
git clone -b secret https://github.com/poleval/2021-question-answering.git
- Start local Elasticsearch instance using docker (skip if using existing cluster)
docker volume create poleval-es # recommended for persistence
docker run \
-p 9200:9200 \
-p 9300:9300 \
-v poleval-es:/usr/share/elasticsearch/data \
-e "discovery.type=single-node" \
docker.elastic.co/elasticsearch/elasticsearch:7.13.4
- Download spaCy model
python -m spacy download pl_core_news_md
- Index and retrieve context passages for Polish QA dataset
python3 -m data_preparation.question_answering.quiz_pl \
2021-question-answering \
wiki_passages_pl
- Index and retrieve context passages for TriviaQA dataset
python3 -m data_preparation.question_answering.trivia_qa wiki_passages_en
- Select questions only for prediction
cat test-B-input-510.tsv | cut -f1 > test-B-questions-510.tsv
- Upload files to created bucket, update or match paths from
config/task/question_answering.yaml
The models were trained using TPUv3 device. Model configuration is defined in config/
folder.
After completing the training inference will be run using prompts from files specified under
config/task/<task.yaml> -> predict_files
- Start TPUv3 and cloud instance eg. using ctpu tool
ctpu up --name poleval --tpu-size=v3-8 -tf-version 2.5.0
- SSH to TPU instance, download this repository and install the requirements
- Start the training (or resume from the latest checkpoint) specifying task and model configuration
python3 main.py model=xxl task=question_answering +tpu_name=poleval
- (OCR only) Concatenate the corrected fragments to produce source text
python3 -m data_preparation.ocr_correction.restore \
gs://my-bucket/data/ocr/dev-0-input-384.txt-1100000 \
dev-0-384.index \
dev-0-restored.txt
- Evaluate results using geval tool
cd 2021-question-answering # or 2021-ocr-correction
gsutil cp gs://my-bucket/data/polish_qa/test-B-questions-510.tsv-1010000 test-B/out.tsv
./geval --test-name test-B
Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC)