-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restore support for profile saving aka stateful crawl support #423
Comments
This project collects the issues related to stateful crawling. I suspect the problem will be pretty simple to fix, but will require a lot of testing to discover inevitable corner-case failures. Unfortunately, these corner cases are quite important for stateful crawling, as missing a site leads to an incomplete/inconsistent profile. The instability was introduced by geckodriver (back during the conversion from Selenium 2 to 3). Geckodriver manages browser profiles separately from Selenium, and I suspect this upgrade introduced a race condition between our profile management code and geckodriver's. E.g., sometimes the geckodriver process would delete the profile directory before we had a chance to back it up. Some specific pointers:
Note that it appears that the way we pass most configuration arguments (including the profile) to Selenium is deprecated, and we instead need to use a |
I wrote some code to do this for multipreffer tests: https://github.com/mozilla/multipreffer/blob/65f3bc67e4b8b381fd101a861bd8836b98cec101/test/functional/utils.js#L100-L122 |
Hello, @englehardt I am trying to figure this issue out as a part of my Outreachy internship. After reading this issue I have understood that we are facing the temp issue storage clash between what is generated by selenium driver and that by default of Firefox. |
Oversimplified the problem is: When GeckoDriver passes the critical exception through Selenium to our Python code, the best we can do is try to be faster at copying the temp directory then GeckoDriver is at deleting it which is inherently racing and bad. |
@vringar - now we have finalize command, do things get easier? |
For a successfull visit quite likely but it still doesn't handle the unexpected shutdown, |
@englehardt I see a path for restoring some of the original functionality. While I take your point that there are a lot of important edge cases, I think the only way to work through them is to restore functionality and start figuring them out. Here's what I'm thinking. I think I have a path where our profile path is the actual geckodriver temp profile path rather than the path we get from selenium. If we know when things are about to die (which I think @vringar's big efforts have given us more insight to) then we should be able to zip up and save that profile directory. If geckodriver really crashed crashed then I would expect it to not have cleaned up after itself and that temp directory to still be there too. |
I agree. Doing this incrementally is necessary and okay. My point was that it's hard to rely the feature prior to doing the work because losing a profile can heavily impact a measurement.
It seems like that could work. I suspect that the geckodriver process doesn't crash so much as throw an error. Perhaps we can handle that error somehow and deterministically intervene before geckodriver goes through its cleanup process. I'm not 100% sure though, but investigating that can be an initial step. |
As we discussed in the weekly meeting, a possible solution is described here: https://firefox-source-docs.mozilla.org/testing/geckodriver/CrashReports.html |
It seems that since bdb930f we get the browser profile location from This means that we no longer parse the profile location out of the geckodriver logs and can remove most (if not all) of the code in https://github.com/mozilla/OpenWPM/blob/master/openwpm/deploy_browsers/selenium_firefox.py. Do we have other reasons to want to redirect the geckodriver logs to the main logger? |
We want to have the geckodriver logs as part of our logging infrastructure so we can see errors happening inside the WebExtension along with the errors that are happening in the Python part of OpenWPM. |
I have run into a problem in the implementation of the solution described above. Using I checked Selenium's issue tracker and it seems that no one has mentioned this issue so far. This behavior can be reproduced by running the code example below and checking the import tempfile
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
profile = tempfile.mkdtemp("firefox-profile")
fo = Options()
fo.add_argument("-profile")
fo.add_argument(profile)
fo.set_preference("browser.startup.page", "https://example.com")
driver = webdriver.Firefox(options=fo, service_args=["--marionette-port", "2828"])
driver.get("example.com") As a workaround we could create the |
Ìnteresting. I think you should file a bug in the geckodriver component. I'd expect this to be Firefox specific. |
Done. Here's the issue: mozilla/geckodriver#1844 |
Very nice writeup! |
Geckodriver has a bug that makes it write the browser preferences we set, as well as its own default browser preferences, to a user.js file in the wrong profile directory when using a custom profile: mozilla/geckodriver#1844. As a temporary workaround until this issue gets fixed, we create the user.js file ourselves. In order to do this, we keep a copy of geckodriver's default preferences in our code. Closes openwpm#423
Geckodriver has a bug that makes it write the browser preferences we set, as well as its own default browser preferences, to a user.js file in the wrong profile directory when using a custom profile: mozilla/geckodriver#1844. As a temporary workaround until this issue gets fixed, we create the user.js file ourselves. In order to do this, we keep a copy of geckodriver's default preferences in our code. Closes openwpm#423
#382 disables support for stateful crawling and profile management. This should be restored.
Related to #383 and openwpm/openwpm-crawler#13, and there is a project but no specific issue to track restoration of stateful crawl support, hence this.
The text was updated successfully, but these errors were encountered: