Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compute topic similarity more efficiently? #38

Open
f-hafner opened this issue Mar 8, 2023 · 4 comments
Open

compute topic similarity more efficiently? #38

f-hafner opened this issue Mar 8, 2023 · 4 comments

Comments

@f-hafner
Copy link
Owner

f-hafner commented Mar 8, 2023

Currently, we iterate over each graduation year, but for each iteration, we load a window of data +/- 5 years into memory. If we compute the similarity for a 2 or more neighboring graduation years, we only have to add data for two additional years. This could speed up the calculations. The trade-off is that this needs more memory.

@chrished
Copy link
Collaborator

chrished commented Mar 9, 2023

what about instead iteratively querying the data: when done with first window -> drop the first year + only load the additional year needed -> ...

@f-hafner
Copy link
Owner Author

f-hafner commented Mar 9, 2023

that is an option, although it would require some rewriting. and the parallelization would "only" be over fields of study. an alternative is DuckDB . #39

@chrished
Copy link
Collaborator

chrished commented Mar 9, 2023 via email

@f-hafner
Copy link
Owner Author

because we calculate the similarities for all graduates-potential employers, this step only comes after linking. so when we update the linking, we need to rerun the calculations for similarity as well.

but it's not urgent and not very important, but it crossed my mind and I wanted to keep it as an issue for the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants