Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Distributed Machines #3

Open
abhijithneilabraham opened this issue May 5, 2022 · 2 comments
Open

Support Distributed Machines #3

abhijithneilabraham opened this issue May 5, 2022 · 2 comments
Labels
enhancement New feature or request

Comments

@abhijithneilabraham
Copy link
Member

Scraping should be made available across distributed machines, in order to make it faster.
Few ideas to implement this:

  • Split the config by period of time. Eg: 4 machines means the start and end time could be split into 4 and each period of time could be handled by each machine.
  • Use docker images and pull it across multiple machines.
@abhijithneilabraham abhijithneilabraham added the enhancement New feature or request label May 5, 2022
@raaghavrm
Copy link

Hey
I wish to contribute to this feature. Could you please assign it to me.
Also could you please elaborate about the feature a little more please.
Reference : Aviyel
Thanx

@abhijithneilabraham
Copy link
Member Author

Hi @Raaghav4243 !

Sure! I hope you understand this might be quite a long task, but I will guide you through the requirements if you wish to take this forward.

As of now, redditflow supports running only on a single machine, where the scraping and filtering are done. This might be time-consuming. If a researcher has multiple cloud machines they wish to split the task, this can be done the following way:

Take the time period start_time to end_time from the config, divide it into time frames, and make new configs, with new start_time and end_time for each machine according to the time split. Now, with ssh these scripts via python into respective cloud machines, and run the python scripts remotely on those machines.

Reference for ssh connection via python: https://github.com/paramiko/paramiko

Here's another reference project where such distributed configurations and remote connections were done: https://github.com/autonomio/jako

Happy coding!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants