This is an example to demonstrate the use of a Continuous Integration toolkit based on a machine learning program, as described in James Coupe's article A Continuous Integration Toolkit for Artists.
This is a Python program which is to be run using the links provided in the images.json
. This uses Amazon Textract to perform OCR on the downloaded images and outputs the recognized text in the console.
This is primarily intended to be run in a CI pipeline, but can also be executed locally to check and verify the results.
A Linux or MacOS is assumed. See this to setup a python virtual env in windows.
- Make sure Python 3.7+ is installed
- Create a directory
~/.virtualenvs
- Create a virtual environment called
arts
for example. Runpython3 -m venv ~/.virtualenvs/arts
to create it - Switch to it by running
source ~/.virtualenvs/arts/bin/activate
- Set 2 environment variables to be able to use AWS Textract: (see this to make a key pair)
AWS_ACCESS_ID
: The ID of the access keyAWS_SECRET_KEY
: The secret of the key
- Run
pip install -r requirements.txt
to install the dependencies - Run
python3 main.py
to obtain the output in the console - Customize the urls in the
images.json
to affect the images being downloaded
This project uses GitHub Actions as the CI and it uses pre-configured AWS credentials as actions secrets to connect to AWS and perform the OCR.
The status of the last pipeline runs can be viewed here. You can expand the pipeline stage called extract text
and view the outcome of the OCR.
To execute the runs on your own, you can fork this repo and you would get your own copy of the pipeline and can set your own AWS variables and should be triggered via pushing more commit to your repo.