- About
- Problem Statement
- Our Proposed Solution
- Workflow
- Reporting issues
- FAIR practices
- License
- Team Members
- Acknowledgements
This repository belongs to Team #5, **SPARC **CHAT**, which took part in the SPARC Codeathon 2023. The project's concept and planning were collaboratively formulated by the team members during the event, reaching a mutual agreement.
For a new user, navigating unfamiliar resources like SPARC and its associated portals can be quite challenging, especially when trying to find specific information quickly. The process might involve extensive exploration, leading to significant time and effort being spent to acquire relevant information or datasets. In some cases, users may find themselves repeatedly searching for the same things in a never-ending loop. To achieve their purpose, users often need to search through various sections, projects, and pipelines, which can become a time-consuming task.
The emergence of OpenAI ChatGPT marks a significant advancement in chatbot technology. This next-generation chatbot enables users to interactively and efficiently ask queries and receive relevant answers. However, it is essential to exercise caution while using it. OpenAI ChatGPT is a large language model (LLM) trained on extensive datasets gathered from the internet. Since its launch, numerous closed and open-source LLMs have also been released.
In this project, we leverage open-source LLMs and the available data on the SPARC portal to create a chatbot that assists users in finding the desired links and provides summaries of relevant information. Currently, the chatbot is limited to processing text-based information.
We gathered data from various pages of the SPARC portal, including the SPARC Data & Models page and other provided web links. For our model, we randomly picked 15 datasets that contain valuable information related to related datasets, descriptions, abstracts, protocols, and other relevant details.
The data from the datasets were stored manually in .txt
files.
They include descriptions of the datasets. They are available in the
texts
folder of the repo.
We use publicly available HuggingFace models for vectorizing our data. Then we retrieve the information via prompt and answer through an LLM and finally, we use Gradio to serve as a GUI.
- Create a virtual environment
conda create -n chat
- Activate the virtual environment
conda activate chat
- Install requirements
pip install -r requirements.txt
- Run the app
python app.py --hf_token <YOUR-HUGGING-FACE_TOKEN>
- Open the app on your browser
http://127.0.0.1:7860
You should see the Gradio interface running locally, and you would be prompted to enter your query, like so:
If you get issues with installing hnswlib, try installing it from source: pip install git+https://github.com/nmslib/hnswlib.git
.
You may also need to run export HNSWLIB_NO_NATIVE=1
. See this ongoing Github thread for the discussion.
Then proceed with installing the requirements from requirements.txt
.
Please report an issue or suggest a new feature using the issue page. Check existing issues before submitting a new one.
Since the codeathon focused on FAIR data principles, SPARC CHAT also adheres to FAIR principles.
-
Alireza Moshayedi [Lead]
-
Lee Jia Lin [System Engineer]
-
Anmol Kiran [Writer- Documentation]
This code is licensed under the MIT License.
- We can change it to another license if we need.
We would like to thank the organizers of the SPARC Codeathon 2023 for guidance and help during this Codeathon.
- FAIR practices statement for this project