EchoBot is a Telegram bot designed to streamline data collection for research in the field of Large Language Models (LLM), specifically focusing on languages spoken in Southeast Asia. Transcribe/Translate various audio/video files belonging to languages spoken in South East Asian region and get respective result in a json format, the data is then included in the public dataset upon user's consent. It is an easy to use tool, that comes with extensive documentation and a friendly user interface.
Explore our dataset for a closer look at how all the features finally come together.
(Currently not hosted due to free plan limitations)
Start interacting with the bot here.
OR
Find us on Telegram, whether you're using a desktop or mobile device. Simply search for echo_kd_bot
in the search bar.
-
Start using EchoBot with the
/start
command. You can then choose from a variety of translation preferences. -
To share files, simply upload video and audio files, or record new audio/video messages.
-
The
/trmode
command can be used to toggle between translation settings;- Both: Receive a transcription of your uploaded media's audio in the original language, as well as an English translation of the same
- English only: Receive only the English translation of your uploaded media's audio
- SEA only: Receive only the transcription of your uploaded media in the original language.
-
View all supported languages using the
/languages
command. -
The
/privacy
command toggles between privacy settings. You can choose to keep your data private or anonymously contribute to our dataset. -
Use
/info
to learn more about our project and data policies. -
The
/help
command can be used to get a quick overview of how to use all features within the bot.For testing the bot the user may use the following files Test file 1 , Test file 2 , with the default privacy options, ie. the file transription/translation won't be stored in the dataset
-
Relevancy to SEA-based Research
The bot, on receiving media, autodetects its language, accepting only those which fall under the Southeast Asian region. It currently supports Vietnamese, Thai, Indonesian, Burmese/Myanmar, Malay, Lao/Laotian, khmer.
-
Data Compatible with LLM Corpora Policies
The bot generates a transcript for all provided media, along with an English translation. A
json
file, containing the translated, as well as transcribed text is returned to the user. The data within this file is properly structured in the form of key-value pairs, where the key denotes the identified language. Thejson
formatting of potential training data allows the respective LLM to understand and respond to conversational cues, such as questions, requests, and follow-up questions. -
Supports multiple media formats
The bot accepts all audio and video formats, including
.mp4
,.mov
,.webm
for video, as well as.wav
,.mp3
,.mpeg
,.oga
for audio focused content. The user can also upload any on-device media to the chat, in addition to being able to record content directly within the Telegram app. -
Open Sourced Data Collection
Users around the world can contribute to our dataset, through the bot. Within the dataset, the records are structured according to the following parameters;
SEA_TRANSCRIPTION
containing media transcriptions of the uploaded files in their regional language,ENG
containing their respective english translations, along withLANGUAGE
identifying the original language of the content. -
User Friendly Interface
The bot features an easy-to-use and intuitive interface with clearly mentioned instructions and helpful messages. It is fairly simple to navigate through all the included features using the
\help
command. The user can upload multiple media files at once, and the bot notifies them once those files have been processed. -
Respect for User Privacy
We respect the user's privacy and concern for their shared media. Adhering to this, the bot ensures secure removal of all uploaded files. The processed data is not stored in the open-source dataset without the user's consent.
The bot handles a wide range of user inputs, including audio, video, and all in-app recorded content. It then, automatically recognises the language of the uploaded media. This helps in filtering out data irrelevant to SEA context. Once identified, the files fed by the user are then fetched from the Telegram server and then, channelled through a machine learning backend. EchoBot is powered by OpenAI's Whisper
model, that produces highly accurate media transcriptions. On demand, English translations of the uploaded media are also generated.
Post processing, the bot integrates anonymized content into an open-source dataset, respecting user privacy preferences. This dataset is currently built upon the Google Sheets API. In cases where submissions are restricted by privacy settings, the bot ensures the secure removal of all associated files. Users are kept well-informed with clear confirmation messages at each stage of the process.
The bot uses a model with 244 Million parameters and is hosted using a Virtual machine with 4vcpus and 8 GB RAM .
Requires Python version 3.8 *
To build the bot from source run the following commands
git clone https://github.com/kashvigarg/echobot
cd echobot
The bot requires ffmpeg
installed on the system
# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg
# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg
# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg
Once inside the repository run
pip install requirements.txt
To run the bot locally, you'll need a Telegram Bot Token and a Google Cloud Service Account Key.
Obtaining a BOT Token is as simple as contacting @BotFather, issuing the /newbot command and following the steps until you're given a new token. You can find a step-by-step guide here.
To generate a Service Account Key, you can follow the instructions mentioned in this article.
Save the value of the bot token in the .env file as
BOT_TOKEN="Value of token"
and the service account json file as service_account.json
, both in the root folder.
Then finally run the following command to get the bot running.
python bot.py
Note: *Later versions of python may cause version mismatch of various python libraries installed , which may need to be resolved manually.
Note: The language model currently won't be able to recognize files as one of the SEA language files if the initial few seconds of the file contains any other language. At the same time in some cases it may face hallucinations while processing very small (1-4 seconds) files.
This project bagged the first position in the bounty organized by Infocomm Media Authority, Singapore and Stackup.