Build comparison table #34

NISH1001 · 2023-07-11T19:37:19Z

Changelog

Major

now all the Metric.compute(...) method returns evalem._base.structures.MetricResult object
evalem._base.pipelines.NamedSimpleEvaluationPipeline is added that -- verbose -- provides naming to each pipeline. (If name is not provided, auto-naming scheme is done)
evalem.misc.utils.build_comparison_table(...) utility function is added (see usage section). (In the future, need to figure out a better way to package this comparison. Currently, the comparison is very basic/naive and assumes the same set of inputs and references are passed through all the evaluation pipes.)

Minor

evalem.nlp.metrics.RougeMetric actual score is now computed via averaging (rouge1, rougeL, rougeLsum) in the MetricResult.score object.
tests cases are refactored to remove redundancies
evalem._base.abc.InstantCountMixin mixin is added to provide usability for auto-naming/auto-id objects of any downstream class.
GitHub action now includes pytest coverage via python -m pytest --cov evalem --verbose tests/
evalem._base.evaluators.Evauator calls/invocations now returns only a list of results instead of dictionary since we're now using evalem.MetricResult data structure for downstream result ingestion.

Usage

from evalem import NamedSimpleEvaluationPipeline

from evalem.nlp.models import QuestionAnsweringHFPipelineWrapper
from evalem.nlp.evaluators import QAEvaluator
from evalem.nlp.metrics import BertScore, RougeMetric, MeteorMetric

from evalem.misc.utils import build_comparison_table

# vanilla model
wrapped_model = QuestionAnsweringHFPipelineWrapper(device="mps")

# nasa-v6 model
wrapped_model_2 = QuestionAnsweringHFPipelineWrapper.from_onnx(
    model="tmp/onnx/nasa-v6/",
    tokenizer="tmp/onnx/nasa-v6/",
    device="mps"
)

# common metrics
evaluators_common = [
    QAEvaluator(),
    BertScore(device="mps"),
#     BartScore(device="mps"),
    RougeMetric(),
    MeteorMetric()
]

# pipeline no. 1
eval_pipe = NamedSimpleEvaluationPipeline(
    model=wrapped_model,
    evaluators=evaluators_common,
    name="distilbert"
)

# pipeline no. 2
eval_pipe_2 = NamedSimpleEvaluationPipeline(
    model=wrapped_model_2,
    evaluators=evaluators_common,
    name="nasa-v6-onnx"
)

# get comparison table
results = build_comparison_table(
    eval_pipe, eval_pipe_2,
    inputs=list(data[["context", "question"]].T.to_dict().values()),
    references=data["answer_text"].to_list(),
)

print(results)

Metric.compute() method Now `evalem._base.structures.MetricResult` dataclass is used as a dto to any compute method return value.

`evalem.misc.utils.build_comparison_table(...)` is added. Note: `evalem._base.evaluators.Evaluator(...)` now returns list of result, not dict.

rougeLsum Also, refactor test_metric_score

comparison table

NISH1001 · 2023-07-12T20:10:59Z

Re:
We used the changes to evaluate askathon dataset at:
#35

muthukumaranR · 2023-07-17T19:00:29Z

evalem/nlp/metrics/semantics.py

+                result.extra["bertscore"][_key] = np.mean(
+                    result.extra["bertscore"][_key],
+                )


how are bertscore negative scores to be interpreted? does that affect the averaging done here?

The final actual bertscore isn't compute via mean. Only the f1/precision/recall are averaged -- hence no negative scores for these -- because of how Jury returns per-instance scores for these as well. (We could possibly also remove these as these will also be computed by evalem.F1Metric. etc)

evalem/_base/metrics.py

NISH1001 added 10 commits July 7, 2023 15:38

Add Named pipeline

a4548a3

Add MetricResult as the only return type for any downstream

418e58f

Metric.compute() method Now `evalem._base.structures.MetricResult` dataclass is used as a dto to any compute method return value.

Add utility to compare multiple evaluation

aa1fc27

`evalem.misc.utils.build_comparison_table(...)` is added. Note: `evalem._base.evaluators.Evaluator(...)` now returns list of result, not dict.

Bugfix rounding exception for NoneType scores

387f1ed

Add test coverage flag for pytest runner

7464283

Bugfix bartscore test score range to +- infinity

33a2143

Refactor docstring to accomodate for new changes

a2c994a

Move InstanceCountMixin to evalem._base.abc.py

7038fa2

Compute average rouge score as an average of rouge1, rouge2, rougeL and

d925fe5

rougeLsum Also, refactor test_metric_score

Add MetricResult to evalem.__init__

bed5db6

NISH1001 requested a review from muthukumaranR July 11, 2023 19:37

NISH1001 changed the title ~~Add comparison table~~ Build comparison table Jul 11, 2023

NISH1001 requested a review from xhagrg July 11, 2023 19:37

NISH1001 added 4 commits July 11, 2023 14:40

ignore pylint errors for evalem.__init__

5839e04

Add pytest-cov to requirements.txt

f5dcedd

Add pytest-cov to pyproject.toml

66944e1

Add information about average total items evaluated while generating

c14d35f

comparison table

muthukumaranR reviewed Jul 17, 2023

View reviewed changes

evalem/_base/metrics.py Outdated Show resolved Hide resolved

muthukumaranR approved these changes Jul 17, 2023

View reviewed changes

Remove commented lines

a149a5f

NISH1001 merged commit 26340fa into develop Jul 17, 2023

NISH1001 deleted the feature/comparison-table branch July 17, 2023 19:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build comparison table #34

Build comparison table #34

NISH1001 commented Jul 11, 2023 •

edited

Loading

NISH1001 commented Jul 12, 2023

muthukumaranR Jul 17, 2023 •

edited

Loading

NISH1001 Jul 17, 2023 •

edited

Loading

Build comparison table #34

Build comparison table #34

Conversation

NISH1001 commented Jul 11, 2023 • edited Loading

Changelog

Major

Minor

Usage

NISH1001 commented Jul 12, 2023

muthukumaranR Jul 17, 2023 • edited Loading

Choose a reason for hiding this comment

NISH1001 Jul 17, 2023 • edited Loading

Choose a reason for hiding this comment

NISH1001 commented Jul 11, 2023 •

edited

Loading

muthukumaranR Jul 17, 2023 •

edited

Loading

NISH1001 Jul 17, 2023 •

edited

Loading