Arabic RAG Application

System architecture

The system has mainly two components

Data preparation module
Streamlit app

Data preparation

RAG

Instructions

This system was developed and ran on an L4 GPU, 24GB of memory, and 16 CPUs. If no GPU is available, do not attempt to run it.

Env

To setup the environment, first create a virtual environment python3 -m venv .
There are two requirements files, requirements.txt is for the streamlit app and RAG dependencies, requirements-prep.txt is for the data processing module dependencies.
Install libraries in both files using pip -r requirements.txt and pip -r requirements-prep.txt

Data preparation

via CLI

To run the data preparation module through the CLI, run:

For the "Press Release" PDF python cli.py --pdf_path pdf1.pdf --originals_path pdf1/originals --cropped_path pdf1/cropped --text_file_path --originals_path pdf1/extracted_text.txt --start_page 1 --segment
For the "Budget Reports" PDF python cli.py --pdf_path pdf2.pdf --originals_path pdf2/originals --cropped_path pdf2/cropped --text_file_path --originals_path pdf2/extracted_text.txt --start_page 2

via module

The module can be ran directly without the CLI interface through python processor.py. Though the settings have to be changed manually in config.py

For "Press Release" PDF

PDF_PATH = ...
ORIGINAL_IMGS_PATH = "data/pdf1/original_images"
CROPPED_IMGS_PATH = "data/pdf1/cropped_images"
TEXT_FILE_PATH = "data/pdf1/pdf1.txt"
START_PAGE = 1  # zero-based
SEGMENT = True

For "Budget Reports" PDF

PDF_PATH = ...
ORIGINAL_IMGS_PATH = "data/pdf2/original_images"
CROPPED_IMGS_PATH = "data/pdf2/cropped_images"
TEXT_FILE_PATH = "data/pdf2/pdf2.txt"
START_PAGE = 2  # zero-based
SEGMENT = False

App

You can find the RAG deployed by clicking this link.
To manually run the streamlit app, navigate to the root project directory and run streamlit run app.py

If you are met with an 429 error from the OpenAI API, please switch the the API key or contact me to switch it myself. For rate limiting purposes, the app deployed has the text for only the "Press Release" PDF. If you wish to have it work on text from both PDFs:

Navigate to the app.py file, and change line 20 from loader = DirectoryLoader('data/pdf1', glob="**/*.txt", loader_cls=TextLoader) to loader = DirectoryLoader('data/', glob="**/*.txt", loader_cls=TextLoader)

Run the app locally with the command shown above.

The link can show the app asleep due to it being inactive, in that case please reach out ASAP for me to respond.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Arabic RAG Application

System architecture

Data preparation

RAG

Instructions

Env

Data preparation

via CLI

via module

App

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
resources		resources
.gitignore		.gitignore
README.md		README.md
app.py		app.py
cli.py		cli.py
config.py		config.py
extracted_text.txt		extracted_text.txt
main.ipynb		main.ipynb
processor.py		processor.py
requirements-prep.txt		requirements-prep.txt
requirements.txt		requirements.txt

kareemamrr/Arabic-RAG

Folders and files

Latest commit

History

Repository files navigation

Arabic RAG Application

System architecture

Data preparation

RAG

Instructions

Env

Data preparation

via CLI

via module

App

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages