Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build comparison table #34

Merged
merged 15 commits into from
Jul 17, 2023
Merged

Build comparison table #34

merged 15 commits into from
Jul 17, 2023

Conversation

NISH1001
Copy link
Collaborator

@NISH1001 NISH1001 commented Jul 11, 2023

Changelog

Major

  • now all the Metric.compute(...) method returns evalem._base.structures.MetricResult object
  • evalem._base.pipelines.NamedSimpleEvaluationPipeline is added that -- verbose -- provides naming to each pipeline. (If name is not provided, auto-naming scheme is done)
  • evalem.misc.utils.build_comparison_table(...) utility function is added (see usage section). (In the future, need to figure out a better way to package this comparison. Currently, the comparison is very basic/naive and assumes the same set of inputs and references are passed through all the evaluation pipes.)

Minor

  • evalem.nlp.metrics.RougeMetric actual score is now computed via averaging (rouge1, rougeL, rougeLsum) in the MetricResult.score object.
  • tests cases are refactored to remove redundancies
  • evalem._base.abc.InstantCountMixin mixin is added to provide usability for auto-naming/auto-id objects of any downstream class.
  • GitHub action now includes pytest coverage via python -m pytest --cov evalem --verbose tests/
  • evalem._base.evaluators.Evauator calls/invocations now returns only a list of results instead of dictionary since we're now using evalem.MetricResult data structure for downstream result ingestion.

Usage

from evalem import NamedSimpleEvaluationPipeline

from evalem.nlp.models import QuestionAnsweringHFPipelineWrapper
from evalem.nlp.evaluators import QAEvaluator
from evalem.nlp.metrics import BertScore, RougeMetric, MeteorMetric

from evalem.misc.utils import build_comparison_table

# vanilla model
wrapped_model = QuestionAnsweringHFPipelineWrapper(device="mps")

# nasa-v6 model
wrapped_model_2 = QuestionAnsweringHFPipelineWrapper.from_onnx(
    model="tmp/onnx/nasa-v6/",
    tokenizer="tmp/onnx/nasa-v6/",
    device="mps"
)

# common metrics
evaluators_common = [
    QAEvaluator(),
    BertScore(device="mps"),
#     BartScore(device="mps"),
    RougeMetric(),
    MeteorMetric()
]

# pipeline no. 1
eval_pipe = NamedSimpleEvaluationPipeline(
    model=wrapped_model,
    evaluators=evaluators_common,
    name="distilbert"
)

# pipeline no. 2
eval_pipe_2 = NamedSimpleEvaluationPipeline(
    model=wrapped_model_2,
    evaluators=evaluators_common,
    name="nasa-v6-onnx"
)

# get comparison table
results = build_comparison_table(
    eval_pipe, eval_pipe_2,
    inputs=list(data[["context", "question"]].T.to_dict().values()),
    references=data["answer_text"].to_list(),
)

print(results)

@NISH1001 NISH1001 changed the title Add comparison table Build comparison table Jul 11, 2023
@NISH1001 NISH1001 requested a review from xhagrg July 11, 2023 19:37
@NISH1001
Copy link
Collaborator Author

Re:
We used the changes to evaluate askathon dataset at:
#35

Comment on lines +106 to +108
result.extra["bertscore"][_key] = np.mean(
result.extra["bertscore"][_key],
)
Copy link
Collaborator

@muthukumaranR muthukumaranR Jul 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how are bertscore negative scores to be interpreted? does that affect the averaging done here?

Copy link
Collaborator Author

@NISH1001 NISH1001 Jul 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final actual bertscore isn't compute via mean. Only the f1/precision/recall are averaged -- hence no negative scores for these -- because of how Jury returns per-instance scores for these as well. (We could possibly also remove these as these will also be computed by evalem.F1Metric. etc)

@NISH1001 NISH1001 merged commit 26340fa into develop Jul 17, 2023
@NISH1001 NISH1001 deleted the feature/comparison-table branch July 17, 2023 19:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants