This is the dataset of E-QGen: Educational Lecture Abstract-based Question Generation System
This dataset is constructed by the method described in the paper E-QGen: Educational Lecture Abstract-based Question Generation System. We collected course transcripts from online courses on YouTube and match up with corresponding questions in the comment section. This dataset mainly focus on the lectures related to computer science. A total number of 356 golden pairs, 4,434 silver pairs and 4,829 platinum pairs is collected. Please check out the paper for more detailed collection procedure and dataset description. \ In this repo, we provide direct access to our dataset, which are the paragraph and question pairs.
Golden pairs are constructed by matching the timestamps back to the specific transcripts.
golden_pair_3agree.csv
,golden_pair_2agree.csv
- The postfix of the file name shows that the number of LLMs are used while filtering out questions from comments.
golden_pair_3agree_notime_gpt4.csv
,golden_pair_2agree_notime_notime_gpt4.csv
- The postfix
_notime_gpt4
means that the timestamps of the questions are removed. Since removing timstamps may cause the sentence become strange, we use GPT-4 to refine the question comments.
- The postfix
silver_pairs.csv
are collected by matching the comments without timestamps and the lecture paragraph. We compute the cosine similarity with PaLM, PaLM embedding and Sentence Transformer embeddings
platinum_pairs.csv
are generated by OpenAI GPT-4 model. We ask the GPT-4 model to generate 20 questions for each lecture paragraph.
- Paragraphs and questions pairs are collected from MIT OpenCourseWare and Stanford Online YouTube Channel