Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parallel download with asyncio #204

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

diegorodriguezv
Copy link

Hi. Thanks for this amazing piece of software.
This is a fix for bugs #163 #174 #199.
The bug happens because SSL uses the process id for encryption and if the same requests session is used in more than one process it will fail.
This fix solves the problem by using asyncio instead of multiprocessing. This way the parallelism is handled in-process without the need to fork. It uses less cpu, less memory and solves the problem in windows since no IPC pickling is necessary. Besides, it speeds up execution when the book contains several images or CSS files.
For an easy explanation of why this approach is better for IO bound problems (networking) see this: https://timber.io/blog/multiprocessing-vs-multithreading-in-python-what-you-need-to-know/
A good example is book 9780135262047 which include 1614 images. The execution time went from 30 minutes to 12 in my machine.
I hope this helps.

@diegorodriguezv
Copy link
Author

Other approach that I successfully tried but was far less elegant (and more resource consuming) was creating a copy of the session for each new process.

@lorenzodifuccia
Copy link
Owner

Hi @diegorodriguezv, thank you for your amazing spot.
Could you test it also in a Windows environment?
If it works on every env, we can push it on master!

@diegorodriguezv
Copy link
Author

Hi Lorenzo. I tested it in Ubuntu 19.10 with python 3.7.5 and in windows 10 with python 3.8.2. I downloaded several books and everything looked fine.

@bborysenko
Copy link

It raises NotImplementedError on macOS where sem_getvalue() is not implemented:

[05/Jun/2020 00:11:24]   File "safaribooks.py", line 1098, in <module>
    SafariBooks(args_parsed)
  File "safaribooks.py", line 392, in __init__
    self.collect_css()
  File "safaribooks.py", line 899, in collect_css
    self._start_parallel_download(self._thread_download_css, self.css)
  File "safaribooks.py", line 894, in _start_parallel_download
    loop.run_until_complete(asyncio.gather(*futures))
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/base_events.py", line 583, in run_until_complete
    return future.result()
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "safaribooks.py", line 863, in _thread_download_css
    self.display.state(len(self.css), self.css_done_queue.qsize())
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/queues.py", line 117, in qsize
    return self._maxsize - self._sem._semlock._get_value()

[05/Jun/2020 00:11:24] Unhandled Exception:  (type: NotImplementedError)

@mfradcourt
Copy link

Tested on windows 10 with Python 3.8.3 and it works well. I guess the macOS error should still be fixed before merging.

@Korred
Copy link

Korred commented Jan 24, 2022

The error mentioned by @bborysenko is not related to the asyncio implementation.
Instead, this is connected to the use of qsize() on macOS.

Please see the following discussion for more information:
keras-team/autokeras#368

Also, a possible fix for the qsize issue for macOS can be taken from here:
https://github.com/RLBot/RLBot/pull/499/files/a5b7e28e72f3009292d61275201f1d9d4fa749be

A portable implementation of multiprocessing.Queue.
Because of multithreading / multiprocessing semantics, Queue.qsize() may
raise the NotImplementedError exception on Unix platforms like Mac OS X
where sem_getvalue() is not implemented. This subclass addresses this
problem by using a synchronized shared counter (initialized to zero) and
increasing / decreasing its value every time the put() and get() methods
are called, respectively. This not only prevents NotImplementedError from
being raised, but also allows us to implement a reliable version of both
qsize() and empty().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants