Running time #60

MAranzazuRU89 · 2022-02-10T12:24:21Z

I had a question with the expected running time and computing capacity I need to plan to use fastLink. I am trying to run it on a database of 1.7M observations, only matching on two variables. However, so far (and the code has been running for 12h) I have not been able to run past the first task of calculating matches for each variable. So I was wondering whether this is to be expected and I should move to a cluster or whether this sounds weird and I am doing something wrong.
Thank you!

aalexandersson · 2022-02-10T12:41:56Z

Disclaimer: I am a regular fastLink user, not a fastLink developer.

It depends. Details matter. Please show the fastLink code that you used. Do you use blocking?

tedenamorado · 2022-02-10T15:16:45Z

Hi @MAranzazuRU89,

Like @aalexandersson mentions, a bit more context could be of help here. If it happens that your data allows for blocking (creating subsets of observations similar in at least one dimension), then I have no doubt the task you have in mind can be scaled and perhaps finished in less than 12 hours. If blocking is not an option, then computing power could be a solution.

Keep us posted!

Ted

MAranzazuRU89 · 2022-02-11T15:01:18Z

Hi!
I thought I couldn't block but now I think I can. I will try that, and if not, then I think i'll move to a cluster. But I think the smarter move could be to try blocking.
Thank you!

ishanaratan · 2022-10-27T22:44:46Z

Hi! I have a question directly related to run time reduction. I am trying to run fastLink on a cluster computer (matching a few million firms), and was wondering if I needed to specify the number of nodes available (and perhaps structure the code differently)?

I didn't see a mention of how to do this in the documentation, but perhaps missed it. Thanks in advance!

tedenamorado · 2022-10-27T23:40:56Z

Hi @ishanaratan,

If you are using a cluster computer. I would do the following:

Block the data. For example, if you match firms from different cities, one idea is to subset your data by city name.
Run fastLink matching one subset per node (or group of nodes). Within the node, fastLink will allocate the number of clusters so that you do run into memory issues.

fastLink runs in parallel within a node, but not across nodes. If the nodes have multiple threads, fastLink will make use of all of them if the size of the data is significant. If it is small, then it will use the minimum number of threads needed.

Please, if anything, let us know.

All my best,

Ted

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running time #60

Running time #60

MAranzazuRU89 commented Feb 10, 2022

aalexandersson commented Feb 10, 2022

tedenamorado commented Feb 10, 2022

MAranzazuRU89 commented Feb 11, 2022

ishanaratan commented Oct 27, 2022

tedenamorado commented Oct 27, 2022

Running time #60

Running time #60

Comments

MAranzazuRU89 commented Feb 10, 2022

aalexandersson commented Feb 10, 2022

tedenamorado commented Feb 10, 2022

MAranzazuRU89 commented Feb 11, 2022

ishanaratan commented Oct 27, 2022

tedenamorado commented Oct 27, 2022