-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build comparison table #34
Conversation
Metric.compute() method Now `evalem._base.structures.MetricResult` dataclass is used as a dto to any compute method return value.
`evalem.misc.utils.build_comparison_table(...)` is added. Note: `evalem._base.evaluators.Evaluator(...)` now returns list of result, not dict.
rougeLsum Also, refactor test_metric_score
Re: |
result.extra["bertscore"][_key] = np.mean( | ||
result.extra["bertscore"][_key], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how are bertscore negative scores to be interpreted? does that affect the averaging done here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The final actual bertscore isn't compute via mean. Only the f1/precision/recall are averaged -- hence no negative scores for these -- because of how Jury returns per-instance scores for these as well. (We could possibly also remove these as these will also be computed by evalem.F1Metric
. etc)
Changelog
Major
Metric.compute(...)
method returnsevalem._base.structures.MetricResult
objectevalem._base.pipelines.NamedSimpleEvaluationPipeline
is added that -- verbose -- provides naming to each pipeline. (If name is not provided, auto-naming scheme is done)evalem.misc.utils.build_comparison_table(...)
utility function is added (see usage section). (In the future, need to figure out a better way to package this comparison. Currently, the comparison is very basic/naive and assumes the same set of inputs and references are passed through all the evaluation pipes.)Minor
evalem.nlp.metrics.RougeMetric
actual score is now computed via averaging (rouge1
,rougeL
,rougeLsum
) in theMetricResult.score
object.evalem._base.abc.InstantCountMixin
mixin is added to provide usability for auto-naming/auto-id objects of any downstream class.python -m pytest --cov evalem --verbose tests/
evalem._base.evaluators.Evauator
calls/invocations now returns only a list of results instead of dictionary since we're now usingevalem.MetricResult
data structure for downstream result ingestion.Usage