Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple threads #106

Open
adildg opened this issue Nov 7, 2022 · 5 comments · May be fixed by #152
Open

Multiple threads #106

adildg opened this issue Nov 7, 2022 · 5 comments · May be fixed by #152
Milestone

Comments

@adildg
Copy link

adildg commented Nov 7, 2022

Hello,

I would like to be able to check multiple domains at the same time, is it okay to use multithreading ?

@Mr0grog
Copy link
Member

Mr0grog commented Nov 7, 2022

This package is currently based on the really broadly used Requests package, which is unfortunately not thread-safe. That means that, if you want to make requests from multiple threads, you should create a separate WaybackClient instance in each thread you want to make requests from.

For example:

mementos_to_get = [list, of, cdx, records, or, urls]

# Get a unique WaybackClient for whatever thread you're on.
def get_wayback_client():
    if 'wayback' not in threading.local():
        threading.local.wayback = wayback.WaybackClient()
    return threading.local().wayback

def get_memento_safely(*args, **kwargs)
    return get_wayback_client().get_memento(*args, **kwargs)

with ThreadPoolExecutor(max_workers=4) as executor:
    for memento in executor.map(get_memento_safely, mementos_to_get):
        # Do something with each memento result

Or using classic thread classes:

mementos_to_get = [list, of, cdx, records, or, urls]

class Worker(threading.Thread):
    def __init__(self, input_queue, output_queue):
        super().__init__()
        self.input_queue = input_queue
        self.output_queue = output_queue

    def run(self):
        # Make a client for this thread and use it:
        with wayback.WaybackClient() as client:
            while True:
                try:
                    # This expects the queue to already be full, and no be added to in real time.
                    # Otherwise you should get() instead of get_nowait().
                    item = self.input_queue.get_nowait()
                    memento = client.get_memento(your, args, here)
                    self.output_queue.put(memento)
                except queue.Empty:
                    # This thread is done, so let the run() method end.
                    break
                except Exception as error:
                    self.output_queue.put(error)
                finally:
                    self.input_queue.task_done()

processing_queue = queue.Queue()
results_queue = queue.Queue()
for item of mementos_to_get:
    processing_queue.put_nowait(item)
threads = [Worker(processing_queue, results_queue) for i in range(4)]

# Wait for them all to finish:
processing_queue.join()
# Start reading the results:
while not results_queue.empty():
    memento_or_error = results.queue.get()
    # Do something with the result

You can do some really complicated things with WaybackSession to share a pool of connections across threads, but it’s really complicated and I don’t recommend it. Here’s an example: https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/fcfb36341090bf1a2b560a9008c711386ef8da17/web_monitoring/cli/cli.py

That said, thread safety is one of my 2 next priorities (the other is the Wayback Machine’s new, beta CDX search API). v0.4.0 will be out in the next couple days, and then thread safety should be in v0.5.0. When that’s done, you can just use one client wherever you want, without worrying about whether you are on different threads. But that will take a lot of work, since it means moving off the Requests package. I don’t have a clear timeframe for it. (See #58).

@Mr0grog
Copy link
Member

Mr0grog commented Nov 7, 2022

Relatedly, if your use case is basically:

  1. Use search() to find a list of mementos, then
  2. Get those mementos efficiently on a bunch of threads

I’d appreciate any feedback on how we could or should make a nice wrapper for that in #17. (It will probably be a while before that gets implemented, though!)

@adildg
Copy link
Author

adildg commented Nov 8, 2022

This package is currently based on the really broadly used Requests package, which is unfortunately not thread-safe. That means that, if you want to make requests from multiple threads, you should create a separate WaybackClient instance in each thread you want to make requests from.

For example:

mementos_to_get = [list, of, cdx, records, or, urls]

# Get a unique WaybackClient for whatever thread you're on.
def get_wayback_client():
    if 'wayback' not in threading.local():
        threading.local.wayback = wayback.WaybackClient()
    return threading.local().wayback

def get_memento_safely(*args, **kwargs)
    return get_wayback_client().get_memento(*args, **kwargs)

with ThreadPoolExecutor(max_workers=4) as executor:
    for memento in executor.map(get_memento_safely, mementos_to_get):
        # Do something with each memento result

Or using classic thread classes:

mementos_to_get = [list, of, cdx, records, or, urls]

class Worker(threading.Thread):
    def __init__(self, input_queue, output_queue):
        super().__init__()
        self.input_queue = input_queue
        self.output_queue = output_queue

    def run(self):
        # Make a client for this thread and use it:
        with wayback.WaybackClient() as client:
            while True:
                try:
                    # This expects the queue to already be full, and no be added to in real time.
                    # Otherwise you should get() instead of get_nowait().
                    item = self.input_queue.get_nowait()
                    memento = client.get_memento(your, args, here)
                    self.output_queue.put(memento)
                except queue.Empty:
                    # This thread is done, so let the run() method end.
                    break
                except Exception as error:
                    self.output_queue.put(error)
                finally:
                    self.input_queue.task_done()

processing_queue = queue.Queue()
results_queue = queue.Queue()
for item of mementos_to_get:
    processing_queue.put_nowait(item)
threads = [Worker(processing_queue, results_queue) for i in range(4)]

# Wait for them all to finish:
processing_queue.join()
# Start reading the results:
while not results_queue.empty():
    memento_or_error = results.queue.get()
    # Do something with the result

You can do some really complicated things with WaybackSession to share a pool of connections across threads, but it’s really complicated and I don’t recommend it. Here’s an example: https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/fcfb36341090bf1a2b560a9008c711386ef8da17/web_monitoring/cli/cli.py

That said, thread safety is one of my 2 next priorities (the other is the Wayback Machine’s new, beta CDX search API). v0.4.0 will be out in the next couple days, and then thread safety should be in v0.5.0. When that’s done, you can just use one client wherever you want, without worrying about whether you are on different threads. But that will take a lot of work, since it means moving off the Requests package. I don’t have a clear timeframe for it. (See #58).

Amazing! thank you so so much for your explanation <3

@Mr0grog Mr0grog changed the title Multiple requests Multiple threads Nov 10, 2022
@Mr0grog Mr0grog added this to the v0.5.0 milestone Nov 10, 2022
@kyungsub1108
Copy link

I've tried multithreading and got blocked by the website.
If you are trying it, I recommend giving time.sleep() in between.

@Mr0grog
Copy link
Member

Mr0grog commented Dec 13, 2023

Quick update: I’m considering this a duplicate of #58, which I am pretty committed to actually solving this month.

@kyungsub1108 we made a bunch of rate limiting improvements recently in v0.4.4, and have some even bigger ones coming in v0.5.0 later this month (along with actual thread safety, so you can use a single client across multiple threads). Hopefully those help with situations like yours.

@Mr0grog Mr0grog moved this to Backlog in Wayback Roadmap Dec 13, 2023
@Mr0grog Mr0grog moved this from Backlog to Prioritized in Wayback Roadmap Dec 13, 2023
@Mr0grog Mr0grog assigned Mr0grog and unassigned Mr0grog Dec 13, 2023
@Mr0grog Mr0grog moved this from Prioritized to In Progress in Wayback Roadmap Dec 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

3 participants