Created `MultipleChoiceQuestion` and `MultipleChoiceEvaluation` #157

jamesbraza · 2024-12-17T21:38:12Z

This PR generalizes https://github.com/Future-House/paper-qa/blob/a630d922d76551e9cd0258c62e2012cd5d459937/paperqa/litqa.py by creating a common multiple choice class that other environments besides paper-qa can use. It also aims to:

Improves QA capability to support
- Serialization/deserialization
- Exposing attributes for ideal answer, unsure answer, distractors
Resolves Reactor: Overly Complicated Question Eval paper-qa#761

src/aviary/utils.py

codeflash-ai · 2024-12-17T22:05:20Z

⚡️ Codeflash found optimizations for this PR

📄 `MultipleChoiceEvaluation.calculate_accuracy_precision` in `src/aviary/utils.py`

✨ Performance Summary:

Speed Increase: 📈 163% (1.63x faster)
Runtime Reduction: ⏱️ From 576 milliseconds down to 219 milliseconds (best of 5 runs)

📝 Explanation and details

Explanation of optimization.

The program now directly counts num_correct, num_total, and determines if an evaluation is unsure in a single loop instead of two separate loops, reducing the processing overhead.
This ensures that the program calculates both accuracy and precision metrics in one pass through the evaluations, making it more efficient.

✅ Correctness verification

The new optimized code was tested for correctness. The results are listed below:

Test	Status	Details
⚙️ Existing Unit Tests	✅ 5 Passed	See below
🌀 Generated Regression Tests	✅ 24 Passed	See below
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Coverage	undefined

⚙️ Existing Unit Tests Details

Click to view details

- test_utils.py

🌀 Generated Regression Tests Details

Click to view details

from collections.abc import Sequence
from enum import StrEnum
from typing import Self

# imports
import pytest  # used for our unit tests
from aviary.utils import MultipleChoiceEvaluation

# unit tests

def test_single_correct():
    codeflash_output = MultipleChoiceEvaluation.calculate_accuracy_precision(
        [MultipleChoiceEvaluation.CORRECT]
    )

def test_single_incorrect():
    codeflash_output = MultipleChoiceEvaluation.calculate_accuracy_precision(
        [MultipleChoiceEvaluation.INCORRECT]
    )


def test_mixed_evaluations():
    codeflash_output = MultipleChoiceEvaluation.calculate_accuracy_precision(
        [MultipleChoiceEvaluation.CORRECT, MultipleChoiceEvaluation.INCORRECT, MultipleChoiceEvaluation.CORRECT]
    )

def test_mixed_with_unsure():
    codeflash_output = MultipleChoiceEvaluation.calculate_accuracy_precision(
        [MultipleChoiceEvaluation.CORRECT, MultipleChoiceEvaluation.UNSURE, MultipleChoiceEvaluation.INCORRECT]
    )

def test_empty_input():
    with pytest.raises(ZeroDivisionError):
        MultipleChoiceEvaluation.calculate_accuracy_precision([])




def test_large_number_of_evaluations():
    codeflash_output = MultipleChoiceEvaluation.calculate_accuracy_precision(
        [MultipleChoiceEvaluation.CORRECT] * 1000 + [MultipleChoiceEvaluation.INCORRECT] * 1000
    )

def test_large_number_with_unsure():
    codeflash_output = MultipleChoiceEvaluation.calculate_accuracy_precision(
        [MultipleChoiceEvaluation.CORRECT] * 1000 + [MultipleChoiceEvaluation.INCORRECT] * 1000 + [MultipleChoiceEvaluation.UNSURE] * 1000
    )

def test_typical_classroom_scenario():
    codeflash_output = MultipleChoiceEvaluation.calculate_accuracy_precision(
        [MultipleChoiceEvaluation.CORRECT] * 20 + [MultipleChoiceEvaluation.INCORRECT] * 5 + [MultipleChoiceEvaluation.UNSURE] * 5
    )

def test_all_correct():
    codeflash_output = MultipleChoiceEvaluation.calculate_accuracy_precision(
        [MultipleChoiceEvaluation.CORRECT] * 10
    )

def test_all_incorrect():
    codeflash_output = MultipleChoiceEvaluation.calculate_accuracy_precision(
        [MultipleChoiceEvaluation.INCORRECT] * 10
    )

def test_non_string_non_enum_inputs():
    with pytest.raises(ValueError):
        MultipleChoiceEvaluation.calculate_accuracy_precision([None, 123, 45.6])

def test_minimum_valid_input():
    codeflash_output = MultipleChoiceEvaluation.calculate_accuracy_precision(
        [MultipleChoiceEvaluation.CORRECT]
    )

def test_maximum_valid_input():
    codeflash_output = MultipleChoiceEvaluation.calculate_accuracy_precision(
        [MultipleChoiceEvaluation.CORRECT] * 10**6
    )
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from collections.abc import Sequence
from enum import StrEnum
from typing import Self

# imports
import pytest  # used for our unit tests
from aviary.utils import MultipleChoiceEvaluation

# unit tests

def test_single_correct():
    """Test with a single correct evaluation"""
    codeflash_output = MultipleChoiceEvaluation.calculate_accuracy_precision([MultipleChoiceEvaluation.CORRECT])

def test_single_incorrect():
    """Test with a single incorrect evaluation"""
    codeflash_output = MultipleChoiceEvaluation.calculate_accuracy_precision([MultipleChoiceEvaluation.INCORRECT])


def test_mixed_correct_incorrect():
    """Test with a mix of correct and incorrect evaluations"""
    codeflash_output = MultipleChoiceEvaluation.calculate_accuracy_precision([MultipleChoiceEvaluation.CORRECT, MultipleChoiceEvaluation.INCORRECT])

def test_mixed_correct_incorrect_unsure():
    """Test with a mix of correct, incorrect, and unsure evaluations"""
    codeflash_output = MultipleChoiceEvaluation.calculate_accuracy_precision([MultipleChoiceEvaluation.CORRECT, MultipleChoiceEvaluation.INCORRECT, MultipleChoiceEvaluation.UNSURE])

def test_empty_input():
    """Test with an empty input sequence"""
    with pytest.raises(ZeroDivisionError):
        MultipleChoiceEvaluation.calculate_accuracy_precision([])




def test_invalid_string_input():
    """Test with an invalid string input"""
    with pytest.raises(ValueError):
        MultipleChoiceEvaluation.calculate_accuracy_precision(['INVALID'])

def test_mixed_valid_invalid_inputs():
    """Test with mixed valid and invalid inputs"""
    with pytest.raises(ValueError):
        MultipleChoiceEvaluation.calculate_accuracy_precision([MultipleChoiceEvaluation.CORRECT, 'INVALID'])

def test_large_number_of_evaluations():
    """Test with a large number of evaluations"""
    codeflash_output = MultipleChoiceEvaluation.calculate_accuracy_precision([MultipleChoiceEvaluation.CORRECT] * 1000 + [MultipleChoiceEvaluation.INCORRECT] * 1000)

def test_large_number_of_mixed_evaluations():
    """Test with a large number of mixed evaluations"""
    codeflash_output = MultipleChoiceEvaluation.calculate_accuracy_precision([MultipleChoiceEvaluation.CORRECT] * 1000 + [MultipleChoiceEvaluation.INCORRECT] * 500 + [MultipleChoiceEvaluation.UNSURE] * 500)

def test_very_large_input():
    """Test with a very large input"""
    codeflash_output = MultipleChoiceEvaluation.calculate_accuracy_precision([MultipleChoiceEvaluation.CORRECT] * 10**6 + [MultipleChoiceEvaluation.INCORRECT] * 10**6)

def test_stress_test_with_random_evaluations():
    """Stress test with random evaluations"""
    import random
    evaluations = random.choices([MultipleChoiceEvaluation.CORRECT, MultipleChoiceEvaluation.INCORRECT, MultipleChoiceEvaluation.UNSURE], k=1000)
    codeflash_output = MultipleChoiceEvaluation.calculate_accuracy_precision(evaluations)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

📣 **Feedback**

If you have any feedback or need assistance, feel free to join our Discord community:

src/aviary/utils.py

mskarlin · 2024-12-18T21:14:40Z

src/aviary/utils.py

@@ -88,6 +108,23 @@ def is_coroutine_callable(obj) -> bool:
    return False


+async def run_prompt(


worth using the llm-client code here?

Yes 100% agreed, but ima leave that for another PR.

I don't think @maykcaldas hasn't pulled llm-client into this repo's llm extra yet. Tho one comment is llm-client is a bit heavier than just litellm, so maybe we should allow for both routes

src/aviary/utils.py

mskarlin

lgtm -- couple minor comments

jamesbraza added 2 commits December 17, 2024 12:15

Standardized config names and made factory method on the EvalAnswerMode

82b9e2c

Implemented run_prompt function for DRY code

d688ccc

jamesbraza added the enhancement New feature or request label Dec 17, 2024

jamesbraza requested review from sidnarayanan and a team December 17, 2024 21:38

jamesbraza self-assigned this Dec 17, 2024

dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Dec 17, 2024

Created multiple choice question/evaluation classes, with tests

3b4fc47

jamesbraza force-pushed the multiple-choice branch from 85a5dad to 3b4fc47 Compare December 17, 2024 21:44

codeflash-ai bot reviewed Dec 17, 2024

View reviewed changes

src/aviary/utils.py Show resolved Hide resolved

This was referenced Dec 17, 2024

Moved to MultipleChoiceQuestion/MultipleChoiceEvaluation from aviary Future-House/paper-qa#768

Open

Added new extract answer feature #148

Closed

sidnarayanan approved these changes Dec 18, 2024

View reviewed changes

src/aviary/utils.py Show resolved Hide resolved

mskarlin reviewed Dec 18, 2024

View reviewed changes

src/aviary/utils.py Show resolved Hide resolved

mskarlin approved these changes Dec 18, 2024

View reviewed changes

jamesbraza merged commit c705773 into main Dec 18, 2024
6 checks passed

jamesbraza deleted the multiple-choice branch December 18, 2024 21:20

jamesbraza mentioned this pull request Dec 18, 2024

Moved to extract_answer from #148 and back to gpt-4o-mini #161

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Created `MultipleChoiceQuestion` and `MultipleChoiceEvaluation` #157

Created `MultipleChoiceQuestion` and `MultipleChoiceEvaluation` #157

jamesbraza commented Dec 17, 2024

codeflash-ai bot commented Dec 17, 2024

mskarlin Dec 18, 2024

jamesbraza Dec 18, 2024

mskarlin left a comment

		@@ -88,6 +108,23 @@ def is_coroutine_callable(obj) -> bool:
		return False


		async def run_prompt(

Created MultipleChoiceQuestion and MultipleChoiceEvaluation #157

Created MultipleChoiceQuestion and MultipleChoiceEvaluation #157

Conversation

jamesbraza commented Dec 17, 2024

codeflash-ai bot commented Dec 17, 2024

⚡️ Codeflash found optimizations for this PR

📄 MultipleChoiceEvaluation.calculate_accuracy_precision in src/aviary/utils.py

✨ Performance Summary:

📝 Explanation and details

✅ Correctness verification

⚙️ Existing Unit Tests Details

🌀 Generated Regression Tests Details

mskarlin Dec 18, 2024

Choose a reason for hiding this comment

jamesbraza Dec 18, 2024

Choose a reason for hiding this comment

mskarlin left a comment

Choose a reason for hiding this comment

Created `MultipleChoiceQuestion` and `MultipleChoiceEvaluation` #157

Created `MultipleChoiceQuestion` and `MultipleChoiceEvaluation` #157

📄 `MultipleChoiceEvaluation.calculate_accuracy_precision` in `src/aviary/utils.py`