-
Notifications
You must be signed in to change notification settings - Fork 707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
misc(RequestQueueV2): adding notes for request queue v2 implementation #2700
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @drobnikj for writing this! I briefly tested applying some of your notes on fetchNextRequest on the Python version and it seems to work so far.
I will need to read the code at least once more to make sure I get what's happening though 🙂
for (const { id, uniqueKey } of headData.items) { | ||
// Queue head index might be behind the main table, so ensure we don't recycle requests | ||
if ( | ||
// TODO Kuba: Missing id or uniqueKey means that the request returned from the queue is malformed. This should never happen, needs to investigate why. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is just defensive programming in a wrong place. Types support your claim that the ID and unique key should always be present. But since there is no runtime validation of the API responses (on the client side), we have no guarantee that the types actually match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should log warning or so and do not ignore this inconsistency completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the API can return objects that do not conform to the docs, maybe that should be checked on the server side? Adding code downstream for "what if the docs lie...?" is not an efficient thing to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is true, I even added this check and log warning on backend in time I was writing this feedback. So far not issue there. The problem can be cause down on way in some parsing error or network error who knows. So my point was that this should nearly never happen and if so we should log it here to have some evidence. I don't like approach just ignote it here-
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you turn that warning into an HTTP 500 then? It's a stretch to imagine that a JSON object that passed validation would somehow lose a field during the transfer over TCP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change to 500 but still, if this happens we should log it here not ignore, that was my point 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meeting:
this is optional and should be fixed on the backend
// TODO Kuba: There is raise condition here, if you lock the request and did not unlock it, it will be locked there and run stuck. | ||
// TODO Kuba: We should at least log this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only happens under very specific circumstances, and even if we don't manage to unlock the request, the lock will time out in ~60 seconds, right? So we shouldn't be risking a deadlock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
60 seconds in case there is no pageTimeout set in hours which happens in google maps scraper.
I think at least warning or debug log could be there to have some idea to debug this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, maybe it's a bad idea to lock requests for this long then. The RequestQueue
should be able to lock for a shorter time and prolong periodically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this was the thing I was proposing in some Slack tread to do it, but I think the output of the discussion is that it can generate even more issues and can be hard to implement as you need to do prolong solid (should be done in the separate tread to avoid block by scraping processes etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That brings us back to the "please make sure I have N locked requests, prolong them and send them to me" endpoint. If that's the only thing we need to do to have a properly managed queue head, it's worth the extra overhead of making sure that it's not blocked by random stuff on the event loop.
But even if somebody runs something that can hog the event loop for 60s straight (I know, under some circumstances, a shorter time would also be problematic), I'm pretty confident that they will have bigger problems than losing locks on their requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From meeting:
not strong opinion, but the prolonging lock using interval looks better in that case.
const headData = await this.client.listAndLockHead({ | ||
limit: Math.min(forefront ? this.assumedForefrontCount : 25, 25), | ||
lockSecs: this.requestLockSecs, | ||
}); | ||
|
||
const headIdBuffer = []; | ||
|
||
// TODO Kuba: This is a bit of hack this should not happen. We need to investigate why. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose you mean unlocking the forefront requests? I guess @barjin could provide some context here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, this is the PR that added it https://github.com/apify/crawlee/pull/2689/files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from meeting:
this fix only if we implement ensure head and lock
@@ -567,6 +567,30 @@ export abstract class RequestProvider implements IStorage { | |||
* the function might occasionally return a false negative, | |||
* but it will never return a false positive. | |||
*/ | |||
/** | |||
* TODO Kuba: | |||
* I would move this to request_queue_v2.ts file as it is related to the new API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree - I will just copy the relevant bits from RequestProvider
to the Request queue v2 class and remove the inheritance.
/** | ||
* TODO Kuba: | ||
* I would move this to request_queue_v2.ts file as it is related to the new API. | ||
* Let's clean it up. We would probably need some new API endpoints to make this work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can imagine consuming the queue would be easier if we had the following endpoints:
- "ensure queue head" - it would take
itemCount: number
, look at the items locked by the client, and if there were less thanitemCount
of them, it would lock more. Then it would return all requests held by that client (or IDs thereof) - "is the queue finished?" - that would contain the logic described below and probably just return
true/false
, maybe it could also respond "ask me later" in the case of the safety timeout
Living without 1. is pretty easy, but if we had it, it would mean less bookkeeping on Crawlee's side - we could just spam it to make sure we have stuff to work on.
2, on the other hand, would be a tremendous help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ad. 1. This is little bit new concept, I think if we introduce this one the listAndLockHead will not make sense. Maybe we can add some parameter instead. But can you iterate over how this will help?
ad. 2. This make sense to offload to API, I just do not want to offload the whole logic, as the scraper need to decide if want to finish or not. Plus it has better idea if need to enqueue new URLs or not. Maybe just isHeadEmpty will be ok to have as API>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ad. 1. This is little bit new concept, I think if we introduce this one the listAndLockHead will not make sense. Maybe we can add some parameter instead. But can you iterate over how this will help?
I would love to make it trivial to use the API from Crawlee. If I could just call this endpoint repeatedly and it would a) return my current locked requests, b) lock more requests if possible and if I have less than itemCount
and c) prolong locks on those I already have, that would be a step in that direction.
On top of that, if adding requests to the queue also returned the new "queue head", we could get rid of the whole special treatment for forefronted requests.
ad. 2. This make sense to offload to API, I just do not want to offload the whole logic, as the scraper need to decide if want to finish or not. Plus it has better idea if need to enqueue new URLs or not. Maybe just isHeadEmpty will be ok to have as API>
Which part of the logic would you like to keep in the crawler? Just because RequestQueue.isFinished()
says that I can finish doesn't mean that e.g. BasicCrawler
needs to obey it without question if e.g., keepAlive: true
is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ad. 1 Yeah, your points make sense, on the other hand, the request queue will be more expensive, because you will need to do more queries to dynamo, plus you need to transfer more data between client and server. I will try to find out how cost will change and if it is doable with dynamo.
ad 2.
On the backend you have no idea what is happening in crawlee, there can be still some parsing of the site map or so. Maybe we are talking about the same function, but we do not agree on naming, because you can easily say that the queue is empty on backend but you cannot for sure tell that requestQueue is finish, that was my point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ad. 1 Yeah, your points make sense, on the other hand, the request queue will be more expensive
Well, if it's cheap but nobody actually knows how to use it, that's a lose for us. Bugs and leaky abstractions have already cost us plenty and I've seen users running stuff on Apify with their homebaked queues, which also isn't great.
ad 2. On the backend you have no idea what is happening in crawlee, there can be still some parsing of the site map or so. Maybe we are talking about the same function, but we do not agree on naming, because you can easily say that the queue is empty on backend but you cannot for sure tell that requestQueue is finish, that was my point.
True, but if someone wants to 1. run crawlee in parallel with a single shared request queue and 2. break the paradigm that you add requests when handling another request from the same queue, I'd argue that they can either just wait for some time (the current grace period) or implement their own synchronization in a way that they see fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ad 2. On the backend you have no idea what is happening in crawlee
Also in crawlee you have no idea what's happening in another crawlee process that consumes the same queue 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on meeting:
ad 1.
maybe implement ensure queue head - based if it is possible with current index in dynamo
ad 2.
- document modifiedAt for API Get queue, plus make it from requestQueueCache.
- Into listAndLockHead add flag if there are any locks in queue.
const headData = await this.client.listAndLockHead({ | ||
limit: Math.min(forefront ? this.assumedForefrontCount : 25, 25), | ||
lockSecs: this.requestLockSecs, | ||
}); | ||
|
||
const headIdBuffer = []; | ||
|
||
// TODO Kuba: This is a bit of hack this should not happen. We need to investigate why. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, this is the PR that added it https://github.com/apify/crawlee/pull/2689/files
// TODO Kuba: We are missing there check for the queuePausedForMigration and for aborting. | ||
// If we did not pause it here, we would be fetching and locking requests. | ||
// The raise condition is here that we call _clearPossibleLocks(), but after that we can still fetch and lock requests, so the requests will be locked and never unlocked. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Soo... we should just return null
if queuePausedForMigration || aborting
? That should prevent us from calling listAndLockHead
again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't go deep in the implementation of each scrapers, but overall once a migrating or aborting event happens, we should stop working with queue and persist.
* - If the queueHeadIds is not empty -> happy path we have some requests to process. -> false | ||
* - If the queueHeadIds is empty | ||
* -> we need to ensure that the queue is really empty(ensureHeadIsNonEmpty). | ||
* -> we need to check if other clients still processing the requests (using listHead and getRequest on the first request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the getRequest
accomplish here? We already established that our queue head is empty (line above), and if listHead
returns a non-empty list, it means that a) some request arrived in the meantime or b) some other client is still working. Both means we are not finished, doesn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you check the line bellow, the getRequest
here is only to check which type of request it is. This info we will use to give user idea about what happen and log it. e.g if there is locked request by other client we just print something like "other client still processing queue".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pardon me, I'm stuck at "type of request" - what should I be looking for? If we're just interested in lock status, isn't it already certain that the request is locked by another client if ensureHeadIsNonEmpty
returns nothing but listHead
still returns requests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the listHead returns even locked requests plus pending. This way you can list all pending and lock requests for the queue, and then check if other clients lock these.
But as I write this I feel strange as it does not make sense 😄 . Anyway we should fix this return just pending requests without locked and introduce new API to list just lock reauests by client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But... if I got nothing from listAndLockHead
, there should be no pending requests..? Right? Or if there are (added between listAndLockHead
and the subsequent listHead
), isn't that what that grace period is for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But... if I got nothing from listAndLockHead, there should be no pending requests..?
Yes
Or if there are (added between listAndLockHead and the subsequent listHead), isn't that what that grace period is for?
I think the listHead is used there just get locked request. But true that it can be used this way, but you can do the same with listAndLockHead and the subsequent listAndLockHead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so what you suggest in the comment is:
- try
listAndLockHead
- if it returns nothing, try
listHead
- if there's something, it might be a) a request that arrived between 1. and 2. or b) a request locked by another client, and we detect if it's a) or b) using
getRequest
, but this whole shtick is just so that we have more information to log
Do I understand this right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, can sound strange as you write it 😄 , I think the user should have an idea that he is paying to compute units to wait for another run to finish its job or not.
Adding some notes for requestQueueV2 implementation.
Let's discuss.