Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ds): add init_error for worker init error matrix #114

Merged

Conversation

shenxiangzhuang
Copy link
Contributor

@shenxiangzhuang shenxiangzhuang commented Aug 2, 2024

Close #112

Checklist

  • I have read the CONTRIBUTING document
  • I hereby agree to the terms of either individual CLA available at: https://toloka.ai/legal/cla_individual or corporate CLA available at: https://toloka.ai/legal/cla_corporate, whichever is applicable to me, as it pertains to direct contributions to crowd-kit only
  • My change requires a change to the documentation
  • I have updated the documentation accordingly
  • I have added tests to cover my changes
  • All new and existing tests passed

@shenxiangzhuang
Copy link
Contributor Author

shenxiangzhuang commented Aug 2, 2024

This PR contains major changes to implement the feature in #112 . If you have any suggestions about the implementation, I'd love to make further improvements. Then, I will add the required docs and tests in this PR or a new one.

@dustalov
Copy link
Collaborator

dustalov commented Aug 2, 2024

I believe it would be useful to have it in Crowd-Kit, but I have two comments due to the significance of the proposed changes.

  1. If I remember other implementations correctly, the initial estimate of worker errors is usually applied once at the beginning, effectively replacing the first M-step. The current approach adjusts the error matrices at each iteration.
  2. Would it be possible to cover this new functionality with tests?

@shenxiangzhuang
Copy link
Contributor Author

Thanks for your timely reply!

  1. The implementaion is applied only once, and just as you said in the first M-step. Here the new defination of _m_step take initial_error as a optional parameter, and we only pass the initial_error at the first M-step(errors = self._m_step(data, probas, initial_error)), in the inner iterations we just ignore it(pass None) to _m_step method(errors = self._m_step(data, probas)), so it works just as before.

  2. Sure, I would like to add tests for this functionality. In addition, I will also add some usage examples in docs to make it easier to use.

@dustalov
Copy link
Collaborator

dustalov commented Aug 4, 2024

It would be great to have docs and tests. Thank you for the clarifications. This new parameter is only applied at the very first iteration. I am wondering why this code performs an addition instead of an assignment. We already estimate these matrices from data, and right after that, we adjust them with initial_error.

@shenxiangzhuang shenxiangzhuang marked this pull request as draft August 5, 2024 08:20
@shenxiangzhuang
Copy link
Contributor Author

It would be great to have docs and tests. Thank you for the clarifications. This new parameter is only applied at the very first iteration. I am wondering why this code performs an addition instead of an assignment. We already estimate these matrices from data, and right after that, we adjust them with initial_error.

Great observation! The addition serves to address the issue of incomplete initialization error matrices, particularly when conducting truth inference in a production environment. Essentially, in such environments, we might not have all the workers' initialization error matrices available for the truth inference process.

To clarify, the worker's initialization error matrix is more akin to a "count matrix" rather than a "probability matrix" (I will provide additional documentation on this later). It resembles the errors in the M-step before the averaging process.

Therefore, the addition operation on the error matrix functions as an accumulation. We combine the historical error matrix counts of the worker with the current task's error matrix counts to obtain an accumulated error matrix. In cases where we lack the initialization error matrices for certain workers, their error matrices remain unchanged. However, when available, we obtain an accumulated error matrix, which will later be averaged to derive the final "probability matrix."

@shenxiangzhuang
Copy link
Contributor Author

The docs preview:
image

@shenxiangzhuang shenxiangzhuang marked this pull request as ready for review August 6, 2024 04:48
@dustalov
Copy link
Collaborator

dustalov commented Aug 6, 2024

I'd still prefer to have the option to assign the passed initial_error instead of adding it to the pre-computed error matrices. It might be a boolean or string literal option adjusting the approach.

For example, it might be useful for reproducibility to run the new aggregation with the same matrices as estimated before without making the new initialization of them.

Could you please describe your scenario so I can understand it well?

@shenxiangzhuang
Copy link
Contributor Author

Sorry for the unclear illustration about the motivation of this implementation.

Ours scenario is pretty simple actually: all the T tasks are annotated by N workers(every one only annotates part of the tasks, not all), while M (<N) workers have initial_error because they have history annotation records data on this kind task. The others N-M workers don't have initial_error, because this is their first time to do these tasks.

In this case, if we simply assign the passed initial_error, we only got the error matrices for the M workers, which leads to an uncompleted errors_.(I think the errors_ should contains all the workers' error matrices to make the EM algorithm work. If I was wrong, correct me please.)

So, what we can do to use the N-M workers' initial error matrices and want a completed errors_? We choose to use the addition to do this which works more like a weighted sum of the initial error matrix with priori error matrix(evaluated from the task data). For the M workers, their error matrices(after first m-step) are weighted sum of the initial error matrix and priori error matrix, the weights come from the annotation task count; for the N-M workers, their error matrices is just priori error matrix, which is set the weight of priori error matrix to 1.

@shenxiangzhuang
Copy link
Contributor Author

I'd still prefer to have the option to assign the passed initial_error instead of adding it to the pre-computed error matrices. It might be a boolean or string literal option adjusting the approach.

I think the idea of ​​an extra method parameter is a good one. When using the assign method, in addition to making sure initial_error is "complete", there should be some data checking logic, as discussed earlier.

@dustalov
Copy link
Collaborator

dustalov commented Aug 7, 2024

I see, thank you. Let's add it and I'll be happy to merge it.

@shenxiangzhuang shenxiangzhuang marked this pull request as draft August 8, 2024 05:50
@shenxiangzhuang shenxiangzhuang marked this pull request as ready for review August 8, 2024 07:03
@shenxiangzhuang
Copy link
Contributor Author

I have added initial_error_strategy for the difference usages of initial_error with examples, docs and tests. Please review again.

PS: The ci still fails due to some mypy checking errors, maybe we can fix them in another pr after figure out the reasons.( #115 #116)

@dustalov
Copy link
Collaborator

dustalov commented Aug 8, 2024

These issues have been fixed in main.

@shenxiangzhuang
Copy link
Contributor Author

These issues have been fixed in main.

Thanks for your quick fix! The ci works well now. Please review again.

@dustalov dustalov merged commit e67d3a7 into Toloka:main Aug 9, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Improve ds algorithm to use worker's history error matrix info
2 participants