Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] Langchain-based quality evaluation for generated QA pairs #20

Open
NISH1001 opened this issue May 24, 2023 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@NISH1001
Copy link
Collaborator

NISH1001 commented May 24, 2023

What

One downstream application of LLMs is to auto-generate an initial list of question-answer pairs from a given scientific document. Currently, the attempt is made through codebase

We want to evaluate the "quality" of questions that are generated from a given text document.

Why

We want a zero-shot (or maybe few-shot) out-of-box evaluation using LLMs such as GPT-3.5/4 or Anthropic AI's claude or LLama in general.
The evaluation could possibly help in accelerated annotation for generative QA tasks.

How

We can use langchain to help achieve that with proper prompt engineering.

An initial dummy attempt using GPT-3.5/4 is using simple prompt hack for just evaluating questions (not answers) such as:

You are an expert scientist grading the quality of questions from scientific documents.

You are given a scientific document text and a question and are asked to score the question as GOOD or BAD. 

Example format:
DOCUMENT: document here
QUESTION: question here
GRADE: GOOD or BAD here

Grade the quality of the question based only on their factual accuracy in the provided document text. 
Grade the simple factoid questions as BAD if they are "wh" questions (such as what/how, etc.).
Grade the questions if they are very simple such as wh questions.
Grade the questions as GOOD if they are complex and can be answered from the given document. Begin!

DOCUMENT: "Menu National Snow and Ice Data Center NSIDC a part of CIRES at the University of Colorado Boulder Skip to main content Main navigation News & Analyses News & Stories Scientific Analyses About our Analyses Snow Today Greenland Today & Antarctic Ice Sheet Today Arctic Sea Ice News & Analysis (ASINA) Multimedia Data Explore Data Visualize Data Submit Data Submit NASA Data to NSIDC DAAC Submit Data to Other NSIDC Programs User Resources Get Started with Data Data Announcements Help Center Data Tools Documents Levels of service NASA Earthdata Forum Data Programs About our Programs NASA National Snow and Ice Data Center Distributed Active Archive Center (NSIDC DAAC) NOAA at NSIDC Exchange for Observations and Local Knowledge of the Arctic (ELOKA) Data Policies Our Research Learn What is the Cryosphere? Parts of the Cryosphere Arctic Weather & Climate Frozen Ground & Permafrost Glaciers Ice Sheets Ice Shelves Sea Ice Snow Ask a Scientist Cryosphere glossary About About NSIDC What we do Our People Published Research Our History Diversity, Equity & Inclusion Careers For the Media Contact Us Citation Policies Web Policy Land Acknowledgement Search News & Analyses News & Stories Scientific Analyses About our Analyses Snow Today Greenland Today & Antarctic Ice Sheet Today Arctic Sea Ice News & Analysis (ASINA) Multimedia Data Explore Data Visualize Data Submit Data Submit NASA Data to NSIDC DAAC Submit Data to Other NSIDC Programs User Resources Get Started with Data Data Announcements Help Center Data Tools Documents Levels of service NASA Earthdata Forum Data Programs About our Programs NASA National Snow and Ice Data Center Distributed Active Archive Center (NSIDC DAAC) NOAA at NSIDC Exchange for Observations and Local Knowledge of the Arctic (ELOKA) Data Policies Our Research Learn What is the Cryosphere? Parts of the Cryosphere Arctic Weather & Climate Frozen Ground"

QUESTION: What is NSIDC?

GRADE:

To achieve this, we could have a new QA evaluation component like LangChainBasedQAEvaluator where we can provide prompt templates. Something like:

from evalem.evaluators import LangChainBasedQAEvaluator

qa_data = [dict(context=<CONTEXT>, question=<QUESTION>, answer=<answer>), ...]

evaluator = LangChainBasedQAEvaluator(prompt=<PROMPT>, llm=<MAYBE_OPENAI>)
res = evaluator(qa_data, references)

Or instead of actual evaluator, this could just be a lang-chain based metric that says 0(BAD) and 1 (GOOD) and compute the GOOD-ness of generated questions.

from evalem.metrics import LangChainBasedQuestionQualityMetric

inputs = [dict(context=<CONTEXT>, question=<QUESTION>, answer=<answer>), ...]

metric = LangChainBasedQuestionQualityMetric(...)
res = metric(inputs, references)
...

References


cc: @muthukumaranR @xhagrg

@NISH1001 NISH1001 added the enhancement New feature or request label May 24, 2023
@NISH1001 NISH1001 self-assigned this May 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant