Dataset Generator for Fine-Tuning

This repository contains a system for generating question-answer pairs for FINE-TUNING LLMs from the data you have. The system leverages various modules to extract text, generate questions using a language model, and save the generated questions.

Latest Update

Added a feature to prcoess HTML as input files.
Added a feature to remove duplicate and similar questions.
Simplified the JSONL ouput format cleaning process.

Architecture Diagram

Supported Inference Engine

VLLM
OpenAI API
Azure OpenAI API
Ollama

Installation

Clone the repository:

git clone https://github.com/yourusername/question-generation.git
cd question-generation

Create a virtual environment and activate it:

python3.11 -m venv .venv
source .venv/bin/activate # On Windows use `venv\Scripts\activate`

Install the required dependencies:

pip install -r requirements.txt

Copy the example environment file and configure it:

cp .env.example .env

Update the .env file with your API URL and API Key.

Configuration

The configuration for the model is specified in the config.json file. You can update the model name or other parameters as needed:

{   
    "inference_engine": "azure", # inference engine name here
    "model_name": "llama3.1", # model name here
    "model_max_tokens": 10000, # model's max tokens here
    "input_folder": "input_data", # input data location
    "output_folder": "generated_questions", # output data location
    "chroma_db_path": "chromadb", # vector db location
    "chroma_collection_name": "questions", # vectordb collection name
    "duplicate_threshold": 0.1 # duplicate checking threshold
}

Usage

Place your input files in the input_data folder.
To run the question generation process, execute the main.py script:

python main.py

Prompts

The system prompt for generating question-answer pairs is located in the prompts folder as generateQA-sys_prompt.txt

Contributing

Contributions are welcome! Please open an issue or submit a pull request for any changes.

License

This project is licensed under the Apache-2.0 license. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 202 Commits
.github		.github
asserts		asserts
prompts		prompts
src		src
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.json		config.json
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataset Generator for Fine-Tuning

Latest Update

Architecture Diagram

Table of Contents

Supported Inference Engine

Installation

Configuration

Usage

Prompts

Contributing

License

About

Releases

Packages

Languages

License

shrijayan/dataset_generator

Folders and files

Latest commit

History

Repository files navigation

Dataset Generator for Fine-Tuning

Latest Update

Architecture Diagram

Table of Contents

Supported Inference Engine

Installation

Configuration

Usage

Prompts

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages