The system has mainly two components
- Data preparation module
- Streamlit app
This system was developed and ran on an L4 GPU, 24GB of memory, and 16 CPUs. If no GPU is available, do not attempt to run it.
- To setup the environment, first create a virtual environment
python3 -m venv .
- There are two requirements files,
is for the streamlit app and RAG dependencies,requirements-prep.txt
is for the data processing module dependencies. - Install libraries in both files using
pip -r requirements.txt
andpip -r requirements-prep.txt
To run the data preparation module through the CLI, run:
- For the "Press Release" PDF
python --pdf_path pdf1.pdf --originals_path pdf1/originals --cropped_path pdf1/cropped --text_file_path --originals_path pdf1/extracted_text.txt --start_page 1 --segment
- For the "Budget Reports" PDF
python --pdf_path pdf2.pdf --originals_path pdf2/originals --cropped_path pdf2/cropped --text_file_path --originals_path pdf2/extracted_text.txt --start_page 2
The module can be ran directly without the CLI interface through python
. Though the settings have to be changed manually in
- For "Press Release" PDF
PDF_PATH = ...
ORIGINAL_IMGS_PATH = "data/pdf1/original_images"
CROPPED_IMGS_PATH = "data/pdf1/cropped_images"
TEXT_FILE_PATH = "data/pdf1/pdf1.txt"
START_PAGE = 1 # zero-based
- For "Budget Reports" PDF
PDF_PATH = ...
ORIGINAL_IMGS_PATH = "data/pdf2/original_images"
CROPPED_IMGS_PATH = "data/pdf2/cropped_images"
TEXT_FILE_PATH = "data/pdf2/pdf2.txt"
START_PAGE = 2 # zero-based
- You can find the RAG deployed by clicking this link.
- To manually run the streamlit app, navigate to the root project directory and run
streamlit run
If you are met with an 429 error from the OpenAI API, please switch the the API key or contact me to switch it myself. For rate limiting purposes, the app deployed has the text for only the "Press Release" PDF. If you wish to have it work on text from both PDFs:
- Navigate to the file, and change line 20 from
loader = DirectoryLoader('data/pdf1', glob="**/*.txt", loader_cls=TextLoader)
toloader = DirectoryLoader('data/', glob="**/*.txt", loader_cls=TextLoader)
- Run the app locally with the command shown above.
- The link can show the app asleep due to it being inactive, in that case please reach out ASAP for me to respond.