-
Notifications
You must be signed in to change notification settings - Fork 707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: clarify AWS Lambda storage #2477
base: master
Are you sure you want to change the base?
Conversation
There is ephemeral storage in `/tmp` https://docs.aws.amazon.com/lambda/latest/api/API_EphemeralStorage.html Which could technically be used if desired `CRAWLEE_STORAGE_DIR=/tmp/crawlee/storage`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am honestly not sure if this is adding any clarity, to me its actually adding confusion (as now you say its crawlee that has read only storage?). If you want to improve this, why not mention what you said in the PR description explicitly?
This is only true to an extent - the ephemeral storage can be shared between different Lambda invocations, provided they run in the same execution environment (i.e. if you call the Lambdas one after another, AWS will repurpose the running Lambda environment). This might cause some very hard-to-debug issues (stuck shared state from the previous runs) - even though Crawlee should always purge the previous state, you can never be too cautious with these things :) This is especially important if you want to run multiple crawler instances in one Lambda. I agree w/ @B4nan that explaining all these whys and wherefores is rather counterproductive - I'd show the one and only way to do this rather than confusing the reader with (more or less) irrelevant details. |
Thanks for your feedback @B4nan and @barjin Sounds like your saying we should use in-memory storage not because of the readonly Lambda filesystem but because it will cause the "statefulness" and potential hard to debug issues. I've tried to update it to express that 70a4fdd. If you still think its worse than before then feel free to edit it and/or close this pull request. |
There is ephemeral storage in
/tmp
https://docs.aws.amazon.com/lambda/latest/api/API_EphemeralStorage.html
Which could technically be used if desired
CRAWLEE_STORAGE_DIR=/tmp/crawlee/storage