Skip to content
Cimbali edited this page Mar 26, 2020 · 11 revisions

Table of Contents

How does CleanLinks work?

CleanLinks protects your private life, by automatically detecting and skipping redirect pages, that track you on your way to the link you really wanted. Tracking parameters (e.g. utm_* or fbclid) are also removed.

You can test the current (master) link cleaning code online.

Embedded URL detection

We automatically detect embedded URLs, which are used either:

  1. when websites report your current URL, or
  2. when websites bring you to an intermediate page to track you and then redirect you to their destination.

These requests are then respectively dropped (we could also consider removing the query parameter containing the current URL) and redirected to the embedded URL.

CleanLinks has rules, which allow to specify which uses of embedded URLs are legitimate and whitelist those, i.e. not redirect them. A typical example is a login page with a ?redirectUrl= parameter to specify where to go once the login is successful.

CleanLinks will break some websites and you will need to manually whitelist these URLs for them to work. This can be done easily via the popup from the CleanLinks toolbar icon.

Rules

Rules allow whitelisting some embedded URLs, and performing further cleaning actions, such as removing tracking parameters (e.g. utm_*), or rewriting an URL’s path.

Different parts of an URL: https://addons.mozilla.org/en-GB/firefox/addon/clean-links-webext/reviews/?score=5
https
Protocol
org
Public suffix (usually same as top-level domain)
mozilla.org
Domain name
addons.
Subdomain
addons.mozilla.org
Fully-Qualified Domain Name (FQDN)
/en-GB/firefox/addon/clean-links-webext/reviews/
Path
?score=5
Query
score
Parameter

For maximum privacy, rules are maintained and editable locally (with decent defaults distributed in the add-on). There are currently 4 types of actions. Each of these is a list of regular expressions.

  1. Remove query parameters: Any query parameters matched by any expression in this list is removed, unless it is matched by a whitelist expression.

    For example, facebook adds a fbclid parameter with a unique identifier to every outgoing link, e.g.: https://soundcloud.com/artist/track?fbclid=IwAR1eyii3yum_rNgxs7ym2SY4bsb8QtCVtpOb3hYQ9bYOR-oao7lCC1fI1tY

  2. Whitelist query parameters: Any query parameters matched by any expression in this list is preserved as-is, even if it includes an embedded URL or is specified as a removeable parameter.

    In particular, whitelisting a parameter that contains an embedded URL avoids redirecting (or removing) requests to the intermediate page.

    For example, the stackoverflow login page contains the current page URL, to allow returning there once the user is logged in: https://stackoverflow.com/users/login?ssrc=head&returnurl=https%3a%2f%2fstackoverflow.com%2fquestions%2f32814161%2fhow-to-make-spoiler-text-in-github-wiki-pages

  3. Replace in URL path: Any part of the URL that is matched by any expression in this list is replaced with the specified replacement − or removed if no replacement is specified.

    For example, amazon puts its tracking data directly in the URL path, as /ref=some-value: https://www.amazon.es/gp/product/B06Y1VKRXJ/ref=ppx_od_dt_b_asin_title_s00?ie=UTF8&psc=1

  4. Whitelist URL path (Allow URL embedded in path): This means embedded URLs are also allowed in the URL path, without causing the intermediate page to be skipped. This also prevents replacements from being performed in the URL’s path.

    For example, the web archive puts the archived page URL in the path, so whitelisting the path allows https://web.archive.org/web/20200304112831/http://www.google.com/ to not redirect to google.com.

NB: Cleaning parameters happens before detected embedded URLs. If an embedded URL is found, it is then cleaned as well before being returned.

Request types

CleanLinks analyses and cleans your browser’s requests before they leave the browser, except for javascript requests which are cleaned at the moment they are clicked.

At this stage, it can distinguish between 3 types of requests:

  1. top-level requests, which are the websites that are opened and typically correspond to links that clicked inside or outside of the browser.

  2. other requests, which are initiated by the website to load resources: scripts, images, iframes, etc.

  3. header redirects, which happen when a website issues a 30x to send you from one location to the next. In this case we can clean the destination to which we are redirected.

    The BBC for example uses link shorteners: https://bbc.in/some-hash redirects (via a trib.al link shortening) to the following URL, with added tracking parameters: https://www.bbc.co.uk/news/article-id?at_custom2=facebook_page&at_custom1=%5Bpost+type%5D&at_campaign=64&at_medium=custom7&at_custom3=BBC+News

CleanLinks rule file structure

The default rules are available as a json file and can be exported or imported from the CleanLinks settings to allow backing up and restoring your rules set. Rules are stored per domain hierarchically (thus right to left in the .-separated domain parts).

The actions key specifies which actions to perform

These are detailed in the Rules section above.

Keys starting with . match domain parts

For example, with the following rules:

{
  ".org": {
    "actions": {
      ...
    },
    ".mozilla": {
      "actions": {
        ...
      }
    }
  }
}
  • The rules in the 1st actions are applied to all websites with the top-level domain .org
  • The rules in the 2nd actions are applied to all websites of the domain mozilla.org, or subdomains thereof (.e.g. www.mozilla.org and addons.mozilla.org).

Every other key is a regular expression matching the URL’s path