This repository contains a system for generating question-answer pairs for FINE-TUNING LLMs
from the data you have. The system leverages various modules to extract text, generate questions using a language model, and save the generated questions.
- Added a feature to prcoess HTML as input files.
- Added a feature to remove duplicate and similar questions.
- Simplified the JSONL ouput format cleaning process.
- VLLM
- OpenAI API
- Azure OpenAI API
- Ollama
- Clone the repository:
git clone https://github.com/yourusername/question-generation.git
cd question-generation
- Create a virtual environment and activate it:
python3.11 -m venv .venv
source .venv/bin/activate # On Windows use `venv\Scripts\activate`
- Install the required dependencies:
pip install -r requirements.txt
- Copy the example environment file and configure it:
cp .env.example .env
- Update the .env file with your API URL and API Key.
The configuration for the model is specified in the config.json file. You can update the model name or other parameters as needed:
{
"inference_engine": "azure", # inference engine name here
"model_name": "llama3.1", # model name here
"model_max_tokens": 10000, # model's max tokens here
"input_folder": "input_data", # input data location
"output_folder": "generated_questions", # output data location
"chroma_db_path": "chromadb", # vector db location
"chroma_collection_name": "questions", # vectordb collection name
"duplicate_threshold": 0.1 # duplicate checking threshold
}
-
Place your input files in the
input_data
folder. -
To run the question generation process, execute the main.py script:
python main.py
- The system prompt for generating question-answer pairs is located in the
prompts
folder asgenerateQA-sys_prompt.txt
Contributions are welcome! Please open an issue or submit a pull request for any changes.
This project is licensed under the Apache-2.0 license. See the LICENSE file for details.