Restore support for profile saving aka stateful crawl support #423

motin · 2019-07-24T07:33:18Z

#382 disables support for stateful crawling and profile management. This should be restored.

Related to #383 and openwpm/openwpm-crawler#13, and there is a project but no specific issue to track restoration of stateful crawl support, hence this.

englehardt · 2019-08-13T21:34:30Z

This project collects the issues related to stateful crawling. I suspect the problem will be pretty simple to fix, but will require a lot of testing to discover inevitable corner-case failures. Unfortunately, these corner cases are quite important for stateful crawling, as missing a site leads to an incomplete/inconsistent profile.

The instability was introduced by geckodriver (back during the conversion from Selenium 2 to 3). Geckodriver manages browser profiles separately from Selenium, and I suspect this upgrade introduced a race condition between our profile management code and geckodriver's. E.g., sometimes the geckodriver process would delete the profile directory before we had a chance to back it up.

Some specific pointers:

Most of the profile management commands are in profile_commands.py. This includes the code to dump and load the current profile.
When launching Firefox with Selenium we use the FirefoxProfile interface. This creates a profile in /tmp/. Prior to launching the browser we copy in a profile from a specified archive (via the profile_tar browser parameter). We also write the instrumentation extension configuration file to this directory.
Once Firefox is launched, the geckodriver process copies this profile to a new temporary location in /tmp/. I believe it also deletes the original location. Thus, any additional changes to folder specified by the original FirefoxProfile call will fail. We access this new location through driver.capabilities["moz:profile"]. Note that this weirdness also means we have to hack around the unknown location when setting up logging by parsing the real profile location out of the geckodriver logs. It also means we have to clear the webdriver's profile attribute before calling quite (to avoid meaningless error messages printed to console).
If a stateful browser crashes we archive the previous profile directory and overwrite the profile_tar browser parameter with the new location so the browser will re-launch with the previous profile. This is the step that would fail intermittently as described in Bump lodash.merge from 4.6.1 to 4.6.2 in /automation/Extension/webext-instrumentation #419

Note that it appears that the way we pass most configuration arguments (including the profile) to Selenium is deprecated, and we instead need to use a Service object. Also note that currently use a patched version of the Service object, which may no longer be necessary. Moving to this will hopefully fix the inconsistencies in profile locations, but likely won't fix the race condition related to dumping and reloading profiles. For that, we'll need to make sure we archive the temporary profile before the geckodriver process deletes it (and will need to ensure we do so regardless of how geckodriver closes).

nhnt11 · 2019-11-12T10:01:44Z

I wrote some code to do this for multipreffer tests: https://github.com/mozilla/multipreffer/blob/65f3bc67e4b8b381fd101a861bd8836b98cec101/test/functional/utils.js#L100-L122

shreyagupta30 · 2020-04-01T21:15:52Z

This project collects the issues related to stateful crawling. I suspect the problem will be pretty simple to fix, but will require a lot of testing to discover inevitable corner-case failures. Unfortunately, these corner cases are quite important for stateful crawling, as missing a site leads to an incomplete/inconsistent profile.

The instability was introduced by geckodriver (back during the conversion from Selenium 2 to 3). Geckodriver manages browser profiles separately from Selenium, and I suspect this upgrade introduced a race condition between our profile management code and geckodriver's. E.g., sometimes the geckodriver process would delete the profile directory before we had a chance to back it up.

Some specific pointers:
* Most of the profile management commands are in [`profile_commands.py`](https://github.com/mozilla/OpenWPM/blob/e96d97c7f2463f1d33ec1da6c328087f70d6f92c/automation/Commands/profile_commands.py). This includes the code to dump and load the current profile.

* When launching Firefox with Selenium we use the [`FirefoxProfile` interface](https://seleniumhq.github.io/selenium/docs/api/py/webdriver_firefox/selenium.webdriver.firefox.firefox_profile.html#module-selenium.webdriver.firefox.firefox_profile). This creates a profile in `/tmp/`. Prior to launching the browser we [copy in a profile](https://github.com/mozilla/OpenWPM/blob/e96d97c7f2463f1d33ec1da6c328087f70d6f92c/automation/DeployBrowsers/deploy_firefox.py#L41-L58) from a specified archive (via the `profile_tar` browser parameter). We also [write](https://github.com/mozilla/OpenWPM/blob/e96d97c7f2463f1d33ec1da6c328087f70d6f92c/automation/DeployBrowsers/deploy_firefox.py#L98-L115) the instrumentation extension configuration file to this directory.

* Once Firefox [is launched](https://github.com/mozilla/OpenWPM/blob/e96d97c7f2463f1d33ec1da6c328087f70d6f92c/automation/DeployBrowsers/deploy_firefox.py#L151-L153), the geckodriver process copies this profile to a new temporary location in `/tmp/`. I believe it also deletes the original location. Thus, any additional changes to folder specified by the original `FirefoxProfile` call will fail. We [access](https://github.com/mozilla/OpenWPM/blob/e96d97c7f2463f1d33ec1da6c328087f70d6f92c/automation/DeployBrowsers/deploy_firefox.py#L179) this new location through `driver.capabilities["moz:profile"]`. Note that this weirdness also means we have to hack around the unknown location when setting up logging by [parsing the real profile location out of the geckodriver logs](https://github.com/mozilla/OpenWPM/blob/e96d97c7f2463f1d33ec1da6c328087f70d6f92c/automation/DeployBrowsers/selenium_firefox.py#L79-L81). It also means we have to [clear the webdriver's `profile` attribute](https://github.com/mozilla/OpenWPM/blob/e96d97c7f2463f1d33ec1da6c328087f70d6f92c/automation/BrowserManager.py#L465-L473) before calling quite (to avoid meaningless error messages printed to console).

* If a stateful browser crashes we [archive the previous profile directory](https://github.com/mozilla/OpenWPM/blob/e96d97c7f2463f1d33ec1da6c328087f70d6f92c/automation/BrowserManager.py#L89-L110) and overwrite the `profile_tar` browser parameter with the new location so the browser will re-launch with the previous profile. This is the step that would fail intermittently as described in #419
Note that it appears that the way we pass most configuration arguments (including the profile) to Selenium is deprecated, and we instead need to use a Service object. Also note that currently use a patched version of the Service object, which may no longer be necessary. Moving to this will hopefully fix the inconsistencies in profile locations, but likely won't fix the race condition related to dumping and reloading profiles. For that, we'll need to make sure we archive the temporary profile before the geckodriver process deletes it (and will need to ensure we do so regardless of how geckodriver closes).

Hello, @englehardt I am trying to figure this issue out as a part of my Outreachy internship. After reading this issue I have understood that we are facing the temp issue storage clash between what is generated by selenium driver and that by default of Firefox.
Would using a cache storage service like Redis can solve this issue of stateful crawling? If not so, can you please tell me why?
Thank You!

vringar · 2020-04-03T09:36:56Z

Oversimplified the problem is:
On shutdown the GeckoDriver deletes the temporary Firefox profile that we are interested in, since it contains a lot of state we want to analyse.

When GeckoDriver passes the critical exception through Selenium to our Python code, the best we can do is try to be faster at copying the temp directory then GeckoDriver is at deleting it which is inherently racing and bad.
The only clean option to do this is to land a patch in geckodriver that allows us to specify a place where the temp profile should be saved to instead of being deleted.

birdsarah · 2020-05-15T21:11:57Z

@vringar - now we have finalize command, do things get easier?

vringar · 2020-05-15T21:21:08Z

For a successfull visit quite likely but it still doesn't handle the unexpected shutdown,
which might be a requirement that can reasonably be dropped
I'm hesitant to commit to much to this as I don't have the full context rn
I can follow up on Monday

birdsarah · 2020-05-15T21:24:39Z

@englehardt I see a path for restoring some of the original functionality. While I take your point that there are a lot of important edge cases, I think the only way to work through them is to restore functionality and start figuring them out.

Here's what I'm thinking. I think I have a path where our profile path is the actual geckodriver temp profile path rather than the path we get from selenium. If we know when things are about to die (which I think @vringar's big efforts have given us more insight to) then we should be able to zip up and save that profile directory. If geckodriver really crashed crashed then I would expect it to not have cleaned up after itself and that temp directory to still be there too.

englehardt · 2020-05-15T21:48:29Z

@englehardt I see a path for restoring some of the original functionality. While I take your point that there are a lot of important edge cases, I think the only way to work through them is to restore functionality and start figuring them out.

I agree. Doing this incrementally is necessary and okay. My point was that it's hard to rely the feature prior to doing the work because losing a profile can heavily impact a measurement.

Here's what I'm thinking. I think I have a path where our profile path is the actual geckodriver temp profile path rather than the path we get from selenium. If we know when things are about to die (which I think @vringar's big efforts have given us more insight to) then we should be able to zip up and save that profile directory. If geckodriver really crashed crashed then I would expect it to not have cleaned up after itself and that temp directory to still be there too.

It seems like that could work. I suspect that the geckodriver process doesn't crash so much as throw an error. Perhaps we can handle that error somehow and deterministically intervene before geckodriver goes through its cleanup process. I'm not 100% sure though, but investigating that can be an initial step.

boolean5 · 2021-02-01T14:33:35Z

As we discussed in the weekly meeting, a possible solution is described here: https://firefox-source-docs.mozilla.org/testing/geckodriver/CrashReports.html
We could use Options instead of FirefoxProfile to set the profile. This way geckodriver will not delete it when closing or crashing, thus eliminating the race condition that broke stateful crawling. I'm implementing this change now and will soon open a PR.

boolean5 · 2021-02-01T14:34:20Z

It seems that since bdb930f we get the browser profile location from driver.capabilities["moz:profile"].

This means that we no longer parse the profile location out of the geckodriver logs and can remove most (if not all) of the code in https://github.com/mozilla/OpenWPM/blob/master/openwpm/deploy_browsers/selenium_firefox.py. Do we have other reasons to want to redirect the geckodriver logs to the main logger?

vringar · 2021-02-01T14:43:50Z

We want to have the geckodriver logs as part of our logging infrastructure so we can see errors happening inside the WebExtension along with the errors that are happening in the Python part of OpenWPM.
I'm not sure that this is the best solution to that problem but it's the one we currently have.

boolean5 · 2021-02-15T12:14:22Z

I have run into a problem in the implementation of the solution described above. Using Options to set the browser profile indeed lets us use the selected profile in-place and the profile does not get deleted when geckodriver crashes or closes. Also, the profile is updated as expected after visiting some sites.
However, when geckodriver starts it also creates another profile directory whose name starts with the prefix rust_mozprofile. The only file in that directory is user.js. This means that the preferences we set in deploy_firefox() are not seen by the browser.

I checked Selenium's issue tracker and it seems that no one has mentioned this issue so far.

This behavior can be reproduced by running the code example below and checking the /tmp directory after execution:

import tempfile

from selenium import webdriver
from selenium.webdriver.firefox.options import Options

profile = tempfile.mkdtemp("firefox-profile")

fo = Options()
fo.add_argument("-profile")
fo.add_argument(profile)

fo.set_preference("browser.startup.page", "https://example.com")

driver = webdriver.Firefox(options=fo, service_args=["--marionette-port", "2828"])
driver.get("example.com")

As a workaround we could create the user.js file ourselves (not by setting preferences via Selenium with fo.set_preference()) and place it in the browser profile directory before starting Firefox.

vringar · 2021-02-15T13:45:15Z

Ìnteresting. I think you should file a bug in the geckodriver component. I'd expect this to be Firefox specific.

boolean5 · 2021-02-16T15:29:43Z

Done. Here's the issue: mozilla/geckodriver#1844

vringar · 2021-02-16T15:44:51Z

Very nice writeup!

Geckodriver has a bug that makes it write the browser preferences we set, as well as its own default browser preferences, to a user.js file in the wrong profile directory when using a custom profile: mozilla/geckodriver#1844. As a temporary workaround until this issue gets fixed, we create the user.js file ourselves. In order to do this, we keep a copy of geckodriver's default preferences in our code. Closes openwpm#423

motin mentioned this issue Jul 24, 2019

Can redis/this infrastructure be used for stateful crawl? openwpm/openwpm-crawler#13

Open

englehardt added the feature-request label Aug 6, 2019

boolean5 mentioned this issue Mar 3, 2021

Restore stateful crawling support #864

Merged

englehardt closed this as completed in #864 Mar 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore support for profile saving aka stateful crawl support #423

Restore support for profile saving aka stateful crawl support #423

motin commented Jul 24, 2019 •

edited

Loading

englehardt commented Aug 13, 2019 •

edited

Loading

nhnt11 commented Nov 12, 2019

shreyagupta30 commented Apr 1, 2020

vringar commented Apr 3, 2020 •

edited

Loading

birdsarah commented May 15, 2020

vringar commented May 15, 2020 •

edited

Loading

birdsarah commented May 15, 2020

englehardt commented May 15, 2020 •

edited

Loading

boolean5 commented Feb 1, 2021

boolean5 commented Feb 1, 2021

vringar commented Feb 1, 2021

boolean5 commented Feb 15, 2021

vringar commented Feb 15, 2021

boolean5 commented Feb 16, 2021

vringar commented Feb 16, 2021

Restore support for profile saving aka stateful crawl support #423

Restore support for profile saving aka stateful crawl support #423

Comments

motin commented Jul 24, 2019 • edited Loading

englehardt commented Aug 13, 2019 • edited Loading

nhnt11 commented Nov 12, 2019

shreyagupta30 commented Apr 1, 2020

vringar commented Apr 3, 2020 • edited Loading

birdsarah commented May 15, 2020

vringar commented May 15, 2020 • edited Loading

birdsarah commented May 15, 2020

englehardt commented May 15, 2020 • edited Loading

boolean5 commented Feb 1, 2021

boolean5 commented Feb 1, 2021

vringar commented Feb 1, 2021

boolean5 commented Feb 15, 2021

vringar commented Feb 15, 2021

boolean5 commented Feb 16, 2021

vringar commented Feb 16, 2021

motin commented Jul 24, 2019 •

edited

Loading

englehardt commented Aug 13, 2019 •

edited

Loading

vringar commented Apr 3, 2020 •

edited

Loading

vringar commented May 15, 2020 •

edited

Loading

englehardt commented May 15, 2020 •

edited

Loading