You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Scraping should be made available across distributed machines, in order to make it faster.
Few ideas to implement this:
Split the config by period of time. Eg: 4 machines means the start and end time could be split into 4 and each period of time could be handled by each machine.
Use docker images and pull it across multiple machines.
The text was updated successfully, but these errors were encountered:
Hey
I wish to contribute to this feature. Could you please assign it to me.
Also could you please elaborate about the feature a little more please.
Reference : Aviyel
Thanx
Sure! I hope you understand this might be quite a long task, but I will guide you through the requirements if you wish to take this forward.
As of now, redditflow supports running only on a single machine, where the scraping and filtering are done. This might be time-consuming. If a researcher has multiple cloud machines they wish to split the task, this can be done the following way:
Take the time period start_time to end_time from the config, divide it into time frames, and make new configs, with new start_time and end_time for each machine according to the time split. Now, with ssh these scripts via python into respective cloud machines, and run the python scripts remotely on those machines.
Scraping should be made available across distributed machines, in order to make it faster.
Few ideas to implement this:
The text was updated successfully, but these errors were encountered: