Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running time #60

Open
MAranzazuRU89 opened this issue Feb 10, 2022 · 5 comments
Open

Running time #60

MAranzazuRU89 opened this issue Feb 10, 2022 · 5 comments

Comments

@MAranzazuRU89
Copy link

I had a question with the expected running time and computing capacity I need to plan to use fastLink. I am trying to run it on a database of 1.7M observations, only matching on two variables. However, so far (and the code has been running for 12h) I have not been able to run past the first task of calculating matches for each variable. So I was wondering whether this is to be expected and I should move to a cluster or whether this sounds weird and I am doing something wrong.
Thank you!

@aalexandersson
Copy link

Disclaimer: I am a regular fastLink user, not a fastLink developer.

It depends. Details matter. Please show the fastLink code that you used. Do you use blocking?

@tedenamorado
Copy link
Collaborator

Hi @MAranzazuRU89,

Like @aalexandersson mentions, a bit more context could be of help here. If it happens that your data allows for blocking (creating subsets of observations similar in at least one dimension), then I have no doubt the task you have in mind can be scaled and perhaps finished in less than 12 hours. If blocking is not an option, then computing power could be a solution.

Keep us posted!

Ted

@MAranzazuRU89
Copy link
Author

Hi!
I thought I couldn't block but now I think I can. I will try that, and if not, then I think i'll move to a cluster. But I think the smarter move could be to try blocking.
Thank you!

@ishanaratan
Copy link

Hi! I have a question directly related to run time reduction. I am trying to run fastLink on a cluster computer (matching a few million firms), and was wondering if I needed to specify the number of nodes available (and perhaps structure the code differently)?

I didn't see a mention of how to do this in the documentation, but perhaps missed it. Thanks in advance!

@tedenamorado
Copy link
Collaborator

Hi @ishanaratan,

If you are using a cluster computer. I would do the following:

  1. Block the data. For example, if you match firms from different cities, one idea is to subset your data by city name.
  2. Run fastLink matching one subset per node (or group of nodes). Within the node, fastLink will allocate the number of clusters so that you do run into memory issues.

fastLink runs in parallel within a node, but not across nodes. If the nodes have multiple threads, fastLink will make use of all of them if the size of the data is significant. If it is small, then it will use the minimum number of threads needed.

Please, if anything, let us know.

All my best,

Ted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants