feat(ds): add init_error for worker init error matrix #114

shenxiangzhuang · 2024-08-02T08:19:04Z

Close #112

Checklist

I have read the CONTRIBUTING document
I hereby agree to the terms of either individual CLA available at: https://toloka.ai/legal/cla_individual or corporate CLA available at: https://toloka.ai/legal/cla_corporate, whichever is applicable to me, as it pertains to direct contributions to crowd-kit only
My change requires a change to the documentation
I have updated the documentation accordingly
I have added tests to cover my changes
All new and existing tests passed

shenxiangzhuang · 2024-08-02T08:35:31Z

This PR contains major changes to implement the feature in #112 . If you have any suggestions about the implementation, I'd love to make further improvements. Then, I will add the required docs and tests in this PR or a new one.

dustalov · 2024-08-02T21:50:47Z

I believe it would be useful to have it in Crowd-Kit, but I have two comments due to the significance of the proposed changes.

If I remember other implementations correctly, the initial estimate of worker errors is usually applied once at the beginning, effectively replacing the first M-step. The current approach adjusts the error matrices at each iteration.
Would it be possible to cover this new functionality with tests?

shenxiangzhuang · 2024-08-03T15:38:14Z

Thanks for your timely reply!

The implementaion is applied only once, and just as you said in the first M-step. Here the new defination of _m_step take initial_error as a optional parameter, and we only pass the initial_error at the first M-step(errors = self._m_step(data, probas, initial_error)), in the inner iterations we just ignore it(pass None) to _m_step method(errors = self._m_step(data, probas)), so it works just as before.
Sure, I would like to add tests for this functionality. In addition, I will also add some usage examples in docs to make it easier to use.

dustalov · 2024-08-04T09:53:37Z

It would be great to have docs and tests. Thank you for the clarifications. This new parameter is only applied at the very first iteration. I am wondering why this code performs an addition instead of an assignment. We already estimate these matrices from data, and right after that, we adjust them with initial_error.

shenxiangzhuang · 2024-08-05T10:55:53Z

It would be great to have docs and tests. Thank you for the clarifications. This new parameter is only applied at the very first iteration. I am wondering why this code performs an addition instead of an assignment. We already estimate these matrices from data, and right after that, we adjust them with initial_error.

Great observation! The addition serves to address the issue of incomplete initialization error matrices, particularly when conducting truth inference in a production environment. Essentially, in such environments, we might not have all the workers' initialization error matrices available for the truth inference process.

To clarify, the worker's initialization error matrix is more akin to a "count matrix" rather than a "probability matrix" (I will provide additional documentation on this later). It resembles the errors in the M-step before the averaging process.

Therefore, the addition operation on the error matrix functions as an accumulation. We combine the historical error matrix counts of the worker with the current task's error matrix counts to obtain an accumulated error matrix. In cases where we lack the initialization error matrices for certain workers, their error matrices remain unchanged. However, when available, we obtain an accumulated error matrix, which will later be averaged to derive the final "probability matrix."

shenxiangzhuang · 2024-08-06T04:47:55Z

The docs preview:

dustalov · 2024-08-06T21:19:08Z

I'd still prefer to have the option to assign the passed initial_error instead of adding it to the pre-computed error matrices. It might be a boolean or string literal option adjusting the approach.

For example, it might be useful for reproducibility to run the new aggregation with the same matrices as estimated before without making the new initialization of them.

Could you please describe your scenario so I can understand it well?

shenxiangzhuang · 2024-08-07T03:39:04Z

Sorry for the unclear illustration about the motivation of this implementation.

Ours scenario is pretty simple actually: all the T tasks are annotated by N workers(every one only annotates part of the tasks, not all), while M (<N) workers have initial_error because they have history annotation records data on this kind task. The others N-M workers don't have initial_error, because this is their first time to do these tasks.

In this case, if we simply assign the passed initial_error, we only got the error matrices for the M workers, which leads to an uncompleted errors_.(I think the errors_ should contains all the workers' error matrices to make the EM algorithm work. If I was wrong, correct me please.)

So, what we can do to use the N-M workers' initial error matrices and want a completed errors_? We choose to use the addition to do this which works more like a weighted sum of the initial error matrix with priori error matrix(evaluated from the task data). For the M workers, their error matrices(after first m-step) are weighted sum of the initial error matrix and priori error matrix, the weights come from the annotation task count; for the N-M workers, their error matrices is just priori error matrix, which is set the weight of priori error matrix to 1.

shenxiangzhuang · 2024-08-07T06:09:04Z

I'd still prefer to have the option to assign the passed initial_error instead of adding it to the pre-computed error matrices. It might be a boolean or string literal option adjusting the approach.

I think the idea of an extra method parameter is a good one. When using the assign method, in addition to making sure initial_error is "complete", there should be some data checking logic, as discussed earlier.

dustalov · 2024-08-07T22:33:26Z

I see, thank you. Let's add it and I'll be happy to merge it.

shenxiangzhuang · 2024-08-08T07:36:12Z

I have added initial_error_strategy for the difference usages of initial_error with examples, docs and tests. Please review again.

PS: The ci still fails due to some mypy checking errors, maybe we can fix them in another pr after figure out the reasons.( #115 #116)

dustalov · 2024-08-08T21:47:57Z

These issues have been fixed in main.

shenxiangzhuang · 2024-08-08T22:34:49Z

These issues have been fixed in main.

Thanks for your quick fix! The ci works well now. Please review again.

feat(ds): add init_error for worker init error matrix

5310974

shenxiangzhuang requested review from dustalov, pilot7747, aliskin, denaxen, varfolomeii, alexdrydew, Pocoder, vlad-mois and DrhF as code owners August 2, 2024 08:19

add initial_error in m_step

587f392

add: tests

0699e77

shenxiangzhuang marked this pull request as draft August 5, 2024 08:20

fix: mypy error

087e88c

This was referenced Aug 6, 2024

[BUG] New version of attrs got an mypy checking error #115

Closed

fix: temporary constrain attrs version to <= 23.2.0 to avoid mypy type checking error #116

Merged

shenxiangzhuang added 2 commits August 6, 2024 12:41

fix: test cases

9bab4ba

docs: add examples and notes about in docs

4d6dba2

shenxiangzhuang marked this pull request as ready for review August 6, 2024 04:48

add strategy for initial error matrix apply

31f342a

shenxiangzhuang added 2 commits August 8, 2024 11:25

Merge branch 'main' into feat/ds_with_history_error_matrix

7536965

fix: make m_step function as static method

0551d81

shenxiangzhuang marked this pull request as draft August 8, 2024 05:50

shenxiangzhuang added 2 commits August 8, 2024 14:19

tests: add more tests for assign strategy

9fb6f56

add more docs & examples

c83c92f

shenxiangzhuang marked this pull request as ready for review August 8, 2024 07:03

shenxiangzhuang added 3 commits August 9, 2024 06:18

Merge branch 'main' into feat/ds_with_history_error_matrix

a46c316

fix: tests mypy error

da28b73

fix: format

699defa

dustalov approved these changes Aug 8, 2024

View reviewed changes

dustalov merged commit e67d3a7 into Toloka:main Aug 9, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ds): add init_error for worker init error matrix #114

feat(ds): add init_error for worker init error matrix #114

shenxiangzhuang commented Aug 2, 2024 •

edited

Loading

shenxiangzhuang commented Aug 2, 2024 •

edited

Loading

dustalov commented Aug 2, 2024

shenxiangzhuang commented Aug 3, 2024

dustalov commented Aug 4, 2024

shenxiangzhuang commented Aug 5, 2024

shenxiangzhuang commented Aug 6, 2024

dustalov commented Aug 6, 2024

shenxiangzhuang commented Aug 7, 2024

shenxiangzhuang commented Aug 7, 2024

dustalov commented Aug 7, 2024

shenxiangzhuang commented Aug 8, 2024

dustalov commented Aug 8, 2024

shenxiangzhuang commented Aug 8, 2024

feat(ds): add init_error for worker init error matrix #114

feat(ds): add init_error for worker init error matrix #114

Conversation

shenxiangzhuang commented Aug 2, 2024 • edited Loading

Checklist

shenxiangzhuang commented Aug 2, 2024 • edited Loading

dustalov commented Aug 2, 2024

shenxiangzhuang commented Aug 3, 2024

dustalov commented Aug 4, 2024

shenxiangzhuang commented Aug 5, 2024

shenxiangzhuang commented Aug 6, 2024

dustalov commented Aug 6, 2024

shenxiangzhuang commented Aug 7, 2024

shenxiangzhuang commented Aug 7, 2024

dustalov commented Aug 7, 2024

shenxiangzhuang commented Aug 8, 2024

dustalov commented Aug 8, 2024

shenxiangzhuang commented Aug 8, 2024

shenxiangzhuang commented Aug 2, 2024 •

edited

Loading

shenxiangzhuang commented Aug 2, 2024 •

edited

Loading