Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore support for profile saving aka stateful crawl support #423

Closed
motin opened this issue Jul 24, 2019 · 15 comments · Fixed by #864
Closed

Restore support for profile saving aka stateful crawl support #423

motin opened this issue Jul 24, 2019 · 15 comments · Fixed by #864

Comments

@motin
Copy link
Contributor

motin commented Jul 24, 2019

#382 disables support for stateful crawling and profile management. This should be restored.

Related to #383 and openwpm/openwpm-crawler#13, and there is a project but no specific issue to track restoration of stateful crawl support, hence this.

@englehardt
Copy link
Collaborator

englehardt commented Aug 13, 2019

This project collects the issues related to stateful crawling. I suspect the problem will be pretty simple to fix, but will require a lot of testing to discover inevitable corner-case failures. Unfortunately, these corner cases are quite important for stateful crawling, as missing a site leads to an incomplete/inconsistent profile.

The instability was introduced by geckodriver (back during the conversion from Selenium 2 to 3). Geckodriver manages browser profiles separately from Selenium, and I suspect this upgrade introduced a race condition between our profile management code and geckodriver's. E.g., sometimes the geckodriver process would delete the profile directory before we had a chance to back it up.

Some specific pointers:

Note that it appears that the way we pass most configuration arguments (including the profile) to Selenium is deprecated, and we instead need to use a Service object. Also note that currently use a patched version of the Service object, which may no longer be necessary. Moving to this will hopefully fix the inconsistencies in profile locations, but likely won't fix the race condition related to dumping and reloading profiles. For that, we'll need to make sure we archive the temporary profile before the geckodriver process deletes it (and will need to ensure we do so regardless of how geckodriver closes).

@nhnt11
Copy link
Contributor

nhnt11 commented Nov 12, 2019

@shreyagupta30
Copy link

This project collects the issues related to stateful crawling. I suspect the problem will be pretty simple to fix, but will require a lot of testing to discover inevitable corner-case failures. Unfortunately, these corner cases are quite important for stateful crawling, as missing a site leads to an incomplete/inconsistent profile.

The instability was introduced by geckodriver (back during the conversion from Selenium 2 to 3). Geckodriver manages browser profiles separately from Selenium, and I suspect this upgrade introduced a race condition between our profile management code and geckodriver's. E.g., sometimes the geckodriver process would delete the profile directory before we had a chance to back it up.

Some specific pointers:

* Most of the profile management commands are in [`profile_commands.py`](https://github.com/mozilla/OpenWPM/blob/e96d97c7f2463f1d33ec1da6c328087f70d6f92c/automation/Commands/profile_commands.py). This includes the code to dump and load the current profile.

* When launching Firefox with Selenium we use the [`FirefoxProfile` interface](https://seleniumhq.github.io/selenium/docs/api/py/webdriver_firefox/selenium.webdriver.firefox.firefox_profile.html#module-selenium.webdriver.firefox.firefox_profile). This creates a profile in `/tmp/`. Prior to launching the browser we [copy in a profile](https://github.com/mozilla/OpenWPM/blob/e96d97c7f2463f1d33ec1da6c328087f70d6f92c/automation/DeployBrowsers/deploy_firefox.py#L41-L58) from a specified archive (via the `profile_tar` browser parameter). We also [write](https://github.com/mozilla/OpenWPM/blob/e96d97c7f2463f1d33ec1da6c328087f70d6f92c/automation/DeployBrowsers/deploy_firefox.py#L98-L115) the instrumentation extension configuration file to this directory.

* Once Firefox [is launched](https://github.com/mozilla/OpenWPM/blob/e96d97c7f2463f1d33ec1da6c328087f70d6f92c/automation/DeployBrowsers/deploy_firefox.py#L151-L153), the geckodriver process copies this profile to a new temporary location in `/tmp/`. I believe it also deletes the original location. Thus, any additional changes to folder specified by the original `FirefoxProfile` call will fail. We [access](https://github.com/mozilla/OpenWPM/blob/e96d97c7f2463f1d33ec1da6c328087f70d6f92c/automation/DeployBrowsers/deploy_firefox.py#L179) this new location through `driver.capabilities["moz:profile"]`. Note that this weirdness also means we have to hack around the unknown location when setting up logging by [parsing the real profile location out of the geckodriver logs](https://github.com/mozilla/OpenWPM/blob/e96d97c7f2463f1d33ec1da6c328087f70d6f92c/automation/DeployBrowsers/selenium_firefox.py#L79-L81). It also means we have to [clear the webdriver's `profile` attribute](https://github.com/mozilla/OpenWPM/blob/e96d97c7f2463f1d33ec1da6c328087f70d6f92c/automation/BrowserManager.py#L465-L473) before calling quite (to avoid meaningless error messages printed to console).

* If a stateful browser crashes we [archive the previous profile directory](https://github.com/mozilla/OpenWPM/blob/e96d97c7f2463f1d33ec1da6c328087f70d6f92c/automation/BrowserManager.py#L89-L110) and overwrite the `profile_tar` browser parameter with the new location so the browser will re-launch with the previous profile. This is the step that would fail intermittently as described in #419

Note that it appears that the way we pass most configuration arguments (including the profile) to Selenium is deprecated, and we instead need to use a Service object. Also note that currently use a patched version of the Service object, which may no longer be necessary. Moving to this will hopefully fix the inconsistencies in profile locations, but likely won't fix the race condition related to dumping and reloading profiles. For that, we'll need to make sure we archive the temporary profile before the geckodriver process deletes it (and will need to ensure we do so regardless of how geckodriver closes).

Hello, @englehardt I am trying to figure this issue out as a part of my Outreachy internship. After reading this issue I have understood that we are facing the temp issue storage clash between what is generated by selenium driver and that by default of Firefox.
Would using a cache storage service like Redis can solve this issue of stateful crawling? If not so, can you please tell me why?
Thank You!

@vringar
Copy link
Contributor

vringar commented Apr 3, 2020

Oversimplified the problem is:
On shutdown the GeckoDriver deletes the temporary Firefox profile that we are interested in, since it contains a lot of state we want to analyse.

When GeckoDriver passes the critical exception through Selenium to our Python code, the best we can do is try to be faster at copying the temp directory then GeckoDriver is at deleting it which is inherently racing and bad.
The only clean option to do this is to land a patch in geckodriver that allows us to specify a place where the temp profile should be saved to instead of being deleted.

@birdsarah
Copy link
Contributor

@vringar - now we have finalize command, do things get easier?

@vringar
Copy link
Contributor

vringar commented May 15, 2020

For a successfull visit quite likely but it still doesn't handle the unexpected shutdown,
which might be a requirement that can reasonably be dropped
I'm hesitant to commit to much to this as I don't have the full context rn
I can follow up on Monday

@birdsarah
Copy link
Contributor

@englehardt I see a path for restoring some of the original functionality. While I take your point that there are a lot of important edge cases, I think the only way to work through them is to restore functionality and start figuring them out.

Here's what I'm thinking. I think I have a path where our profile path is the actual geckodriver temp profile path rather than the path we get from selenium. If we know when things are about to die (which I think @vringar's big efforts have given us more insight to) then we should be able to zip up and save that profile directory. If geckodriver really crashed crashed then I would expect it to not have cleaned up after itself and that temp directory to still be there too.

@englehardt
Copy link
Collaborator

englehardt commented May 15, 2020

@englehardt I see a path for restoring some of the original functionality. While I take your point that there are a lot of important edge cases, I think the only way to work through them is to restore functionality and start figuring them out.

I agree. Doing this incrementally is necessary and okay. My point was that it's hard to rely the feature prior to doing the work because losing a profile can heavily impact a measurement.

Here's what I'm thinking. I think I have a path where our profile path is the actual geckodriver temp profile path rather than the path we get from selenium. If we know when things are about to die (which I think @vringar's big efforts have given us more insight to) then we should be able to zip up and save that profile directory. If geckodriver really crashed crashed then I would expect it to not have cleaned up after itself and that temp directory to still be there too.

It seems like that could work. I suspect that the geckodriver process doesn't crash so much as throw an error. Perhaps we can handle that error somehow and deterministically intervene before geckodriver goes through its cleanup process. I'm not 100% sure though, but investigating that can be an initial step.

@boolean5
Copy link
Contributor

boolean5 commented Feb 1, 2021

As we discussed in the weekly meeting, a possible solution is described here: https://firefox-source-docs.mozilla.org/testing/geckodriver/CrashReports.html
We could use Options instead of FirefoxProfile to set the profile. This way geckodriver will not delete it when closing or crashing, thus eliminating the race condition that broke stateful crawling. I'm implementing this change now and will soon open a PR.

@boolean5
Copy link
Contributor

boolean5 commented Feb 1, 2021

It seems that since bdb930f we get the browser profile location from driver.capabilities["moz:profile"].

This means that we no longer parse the profile location out of the geckodriver logs and can remove most (if not all) of the code in https://github.com/mozilla/OpenWPM/blob/master/openwpm/deploy_browsers/selenium_firefox.py. Do we have other reasons to want to redirect the geckodriver logs to the main logger?

@vringar
Copy link
Contributor

vringar commented Feb 1, 2021

We want to have the geckodriver logs as part of our logging infrastructure so we can see errors happening inside the WebExtension along with the errors that are happening in the Python part of OpenWPM.
I'm not sure that this is the best solution to that problem but it's the one we currently have.

@boolean5
Copy link
Contributor

I have run into a problem in the implementation of the solution described above. Using Options to set the browser profile indeed lets us use the selected profile in-place and the profile does not get deleted when geckodriver crashes or closes. Also, the profile is updated as expected after visiting some sites.
However, when geckodriver starts it also creates another profile directory whose name starts with the prefix rust_mozprofile. The only file in that directory is user.js. This means that the preferences we set in deploy_firefox() are not seen by the browser.

I checked Selenium's issue tracker and it seems that no one has mentioned this issue so far.

This behavior can be reproduced by running the code example below and checking the /tmp directory after execution:

import tempfile

from selenium import webdriver
from selenium.webdriver.firefox.options import Options

profile = tempfile.mkdtemp("firefox-profile")

fo = Options()
fo.add_argument("-profile")
fo.add_argument(profile)

fo.set_preference("browser.startup.page", "https://example.com")

driver = webdriver.Firefox(options=fo, service_args=["--marionette-port", "2828"])
driver.get("example.com")

As a workaround we could create the user.js file ourselves (not by setting preferences via Selenium with fo.set_preference()) and place it in the browser profile directory before starting Firefox.

@vringar
Copy link
Contributor

vringar commented Feb 15, 2021

Ìnteresting. I think you should file a bug in the geckodriver component. I'd expect this to be Firefox specific.

@boolean5
Copy link
Contributor

Done. Here's the issue: mozilla/geckodriver#1844

@vringar
Copy link
Contributor

vringar commented Feb 16, 2021

Very nice writeup!

boolean5 added a commit to boolean5/OpenWPM that referenced this issue Mar 3, 2021
Geckodriver has a bug that makes it write the browser preferences we
set, as well as its own default browser preferences, to a user.js file
in the wrong profile directory when using a custom profile:
mozilla/geckodriver#1844. As a temporary
workaround until this issue gets fixed, we create the user.js file
ourselves. In order to do this, we keep a copy of geckodriver's default
preferences in our code.

Closes openwpm#423
Zaxeli pushed a commit to Zaxeli/OpenWPM that referenced this issue Aug 10, 2021
Geckodriver has a bug that makes it write the browser preferences we
set, as well as its own default browser preferences, to a user.js file
in the wrong profile directory when using a custom profile:
mozilla/geckodriver#1844. As a temporary
workaround until this issue gets fixed, we create the user.js file
ourselves. In order to do this, we keep a copy of geckodriver's default
preferences in our code.

Closes openwpm#423
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

7 participants