GitHub - DaBeIDS/osc-transformer-presteps: Data Extraction: Transformer Pre-steps Tool

OSC Data Extractor Pre-Steps

OS-Climate Data Extraction Tool

This code provides you with an api and a streamlit app to which you can provide a pdf document and the output will be the text content in a json format. In the backend it is using a python module for extracting text from pdfs, which might be extended in the future to other file types. The json file is needed for later usage in the context of transformer models to extract relevant information, but it can also be used independently.

Quick start

For a quick start with the tool install python and clone the repository to your local environment

$ git clone https://github.com/os-climate/osc-transformer-presteps

Afterwards update your python to the requirements (possible for example via pdm update) and start a local api server via:

python ./src/run_server.py

Note:

We assume that you are located in the cloned repository.
To check if it is running open "http://localhost:8000/liveness" and you should see

{
"message": "OSC Transformer Pre-Steps Server is running."
}

Finally, run the following code to start a streamlit app which opens up the possibility to "upload" a file and extract data from pdf to json:

streamlit run ./src/osc_transformer_presteps/streamlit/app.py

Note: Check also docs/demo. There you can find local_extraction_demo.py which will start an extraction without any API call and then there is post_request_demo.py which will send a file to the API (of course you have to start server as above first).

Developer Notes

For adding new dependencies use pdm. First install via pip:

$ pip install pdm

And then you could add new packages via pdm add. For example numpy via:

$ pdm add numpy

For running linting tools just to the following:

$ pip install tox
$ tox -e lint

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
.github		.github
docs		docs
scripts		scripts
src		src
tests		tests
.coveragerc		.coveragerc
.devops-exclusions		.devops-exclusions
.flake8		.flake8
.gitignore		.gitignore
.markdownlint.yaml		.markdownlint.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierignore		.prettierignore
.readthedocs.yml		.readthedocs.yml
AUTHORS.rst		AUTHORS.rst
CHANGELOG.rst		CHANGELOG.rst
CONTRIBUTING.rst		CONTRIBUTING.rst
LICENSE.txt		LICENSE.txt
README.rst		README.rst
pdm.lock		pdm.lock
pyproject.toml		pyproject.toml
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OSC Data Extractor Pre-Steps

Quick start

Developer Notes

About

Releases

Packages

Languages

License

DaBeIDS/osc-transformer-presteps

Folders and files

Latest commit

History

Repository files navigation

OSC Data Extractor Pre-Steps

Quick start

Developer Notes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages