Add functionality to redact/filter sensitive data #1

ErikBjare · 2016-05-23T21:12:39Z

We need a model to filter out sensitive data by default.

For example if a window title contains: "[title] - Firefox (Private Browsing)" we should redact [title] to some magic string such as "REDACTED".

For some cases we might want to filter the window out entirely, giving 0 information about which window is running, better catch too much than too little.

It should be the goal that every user has a set of "clean" data. The filtering should also be able to be run on an existing database of data, so that cleaner data can be output. Preferably, the data should be so clean that there is little (or even no) reason not to share it (which would be great since easy access to a large dataset could make research in some areas a lot easier!).

The question left is where this processing step should take place. We want the filtering/redacting to happen before data is sent anywhere but it should also be able to be enforceable on a server (if the server owner doesn't trust the servers security, if in the cloud for example) and have clients notified of this so that they can do the filtering on their side, removing the need to send sensitive data at all. It might therefore be prudent to write a module in aw-core that implements this functionality since it should be useable from the server and all clients (which transmit sensitive data).

This feature should be on by default, we don't need anything advanced yet, first priority is to redact titles from Incognito/Private Browsing, that's a good step in the right direction.

This should have a far higher priority than Zero-Knowledge storage right now, because it's a lot easier and is more user friendly (In ZK storage: if you lose your keys you lose your data).

Useful when:

We want to export data to a 3rd party service but don't need them to know all the details.
We want to do overview analysis where full detail is not necessary.
We want to redact some information in the log, such as the window titles of Incognito/Private browser-windows, Tor Browser, etc.

This issue was originally moved from ActivityWatch/aw-server#4 which ended up here because it ended up having wider scope not only relating to aw-server.

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

ErikBjare · 2016-07-10T19:58:02Z

It might be better to do redaction by simply removing the window-title property of the event entirely.

ErikBjare · 2017-06-24T19:04:31Z

I've done some basic work in ActivityWatch/aw-research@eb45bf6

unode · 2018-04-20T14:57:17Z

Would it make more sense to instead of redaction implement some kind of data encryption?
Different users might have different opinions on what constitutes sensitive data...

For instance with some from of asymmetrical cryptography (public/private key) encryption would add no requirements but accessing the data would require a private key and/or a password.

Otto-AA · 2018-06-12T11:01:17Z

I'd also really like to have a filtering feature, so I'll add my two cents here. (Edit: Well, that grew to a bit more than two cents... o.O)

In my wanton imagination it would look something like this:

1) main idea

Via a html form (on the activitywatch website) the user can create filters (e.g. delete event if event.data.incognito == true).
He then can apply these to
(1.1) filter future events sent to the server (from all watchers) and
(1.2) remove already existing events in the database.

2) Filters

Filters should consist of two parts, namely
(2.1) Filter criteria
e.g. if event.data.incognito == true
(2.2) Filter action
e.g. then remove event.title

2.1) Filter criteria

Specifies when the filter should be applied.
Following functions would be nice:

(2.1.1) if [title/data.incognito] [equals/differs/includes/regex/>/>=/</<=] [comparison]

Examples:
if event.title includes 'Private Browsing'
if event.data.incognito equals true
if event.data.nested.val differs 'abc'
if event.data.val regex i_dunno_about_regex
if event.data.count > 10

(2.1.2) logical operators

Examples:
if event.title includes 'Private Browsing' and event.data.isSensitive equals true
if event.title includes 'Private Browsing' or event.data.isSensitive equals true

(2.1.3) metadata checks [watcher_name, is_test, ...]

e.g. if watcher_name equals 'aw-watcher-vscode'

(2.1.4) time ranges

Examples:
if event.timestamp is in_time_range(7:00, 9:00)
if event.timestamp is on_day('Monday')

2.2) Filter action

Specifies what should be done, if applied.
Following functions would be nice:

(2.2.1) remove [event/event.data/event.title/...]

Examples:
remove event
remove event.title

(2.2.2) replace [target] with [val]

Examples:
replace event.title with 'REDACTED'

3) Implementation

(3.1) User interface

Filters should be createable via a html interface on the localhost site (http://localhost:5600/filters)

(3.1.1) Filters page

(3.1.1.1) A list of the active filters with the options to [edit/copy/disable/delete] the filter
(3.1.1.2) Option to add a new filter

(3.1.2) Add new filter UI

Should be easy to understand for non-coders. Likely with dropdowns and predefined fields.
(3.1.2.1) Filter name
(3.1.2.2) Filter criteria (see 2.1)
(3.1.2.3) Filter action (see 2.2)

(3.2) Server part

Someone knows of a library for that...??? o.o
(3.2.1) API endpoint
(3.2.2) filter parser
(3.2.3) store filter in file/database
(3.2.4) filter incoming events by stored filters

(3.3) Standardized events format per bucket

This would be really nice, as we then can give the users a list of available options when creating filters (e.g. data.[dropdown: 1) pizza, 2) pasta, 3) ...]) and for making sure a filter is valid.
(3.3.1) Alter create_bucket method to take additional data_structure parameter
(3.3.2) On API, check if the sent event matches the data-structure

Notes

Of course, I am realistic that it would take time to implement this, especially if there's no library for this. But from my point of view, this would enhance this tool really much.

Also much of this is just nice-to-have and doesn't need to be implemented right from the beginning. I just thought I would write out everything, so that while developing we can keep an eye on these (and maybe code in a way these other options can be implemented easily)

From next week on, I would have more time for developing, so until then maybe we can discuss if/how we should implement this? :)

Otto-AA · 2018-06-12T16:57:18Z

Had a bit time, so here is a quick draft showing what I mean with these filters: https://github.com/Otto-AA/aw-filter/blob/master/filter.py
After trying out a bit, it actually seems rather easy to implement these filters in python. Thought it would be much more work O.o

Nonetheless, before starting getting into the details we should agree on how we implement it ^_^

Otto-AA · 2018-06-16T08:33:27Z

Any thoughts on this proposal?
If not, I'd do a bit more work and then create a pull request in aw_server

johan-bjareholt · 2018-06-16T09:38:16Z

@Otto-AA I've only skimmed through it as of now, but seems to be kind of in-line with what we have been thinking aswell. As of now I want to prioritize editor format and visualizations and once that's done the more important feature IMO is tagging (which would feature some similarities in the datastore, making this easier later on). But even more important is making a final 0.8 release.

This task is huge (just planning and prototyping the design would probably be 2 complete days of full work), so I'm not sure if I want to prioritize discussing the design of this as of now. I'm sorry, I really want this feature aswell.

stale · 2020-02-15T12:28:02Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

ErikBjare · 2020-02-15T14:07:28Z

Since categorization is now done, I'd just like to throw out a suggestion: one way to do this is to have a "sensitive/to-redact" category and then wipe the title/URL/app of all the events that match the category.

johan-bjareholt · 2020-03-01T15:06:33Z

@ErikBjare That is not a good solution in terms of security, to make it truly secure we have to never even add the data to the buckets in the first place, not filter it when querying.

We could add the new settings API to solve this, add a way in the web-ui to add regexes which should be filtered and then let aw-watcher-window check those on startup and filter them before the events get sent,

There's also a duplicate feature request on the forum
https://forum.activitywatch.net/t/add-an-exclude-list/345

ErikBjare · 2020-03-02T11:04:50Z

to make it truly secure we have to never even add the data to the buckets in the first place

Agreed.

not filter it when querying.

That's not what I mean. I mean to classify & filter when a heartbeat is received.

We could add the new settings API to solve this, add a way in the web-ui to add regexes which should be filtered and then let aw-watcher-window check those on startup and filter them before the events get sent,

That makes the watchers depend on the server settings, and also requires us to implement the same filtering in all watchers. It's a bit more secure than what I had in mind since the server would never see the sensitive info at all, but not sure if it's worth it.

It's worth mentioning that the rules themselves are sensitive information, especially if they only contain a few things, making the "anonymity set" for redacted events small. However, it would be less of a problem if we went for deleting events entirely.

In any case, I've been thinking of building a feature in aw-webui that lets you search for events matching a particular pattern, and then let you delete them or replace them with redacted versions of the events. Wouldn't take that much work to build, search would be a generally useful feature anyway, and wouldn't add any code to the server or watchers.

johan-bjareholt · 2020-03-02T13:23:30Z

That's not what I mean. I mean to classify & filter when a heartbeat is received.

Oh, alright.

Might still be an issue though, either we need to be aware of bucket types (so we for example don't corrupt events in buckets we don't expect to, for example replacing "afk" with "redacted" or something).
At that point architecture wise it makes more sense for the watchers to themselves solve redacting sensitive information in a way that matches their event format well.

That makes the watchers depend on the server settings, and also requires us to implement the same filtering in all watchers. It's a bit more secure than what I had in mind since the server would never see the sensitive info at all, but not sure if it's worth it.

Agreed, currently that's just a few watchers (aw-watcher-window and aw-watcher-web) but in the future it might become more.

It's worth mentioning that the rules themselves are sensitive information, especially if they only contain a few things, making the "anonymity set" for redacted events small. However, it would be less of a problem if we went for deleting events entirely.

Very good point, didn't think of that.

In any case, I've been thinking of building a feature in aw-webui that lets you search for events matching a particular pattern, and then let you delete them or replace them with redacted versions of the events. Wouldn't take that much work to build, search would be a generally useful feature anyway, and wouldn't add any code to the server or watchers.

Definitely a good start!

Not sure myself which one of our suggested solutions are the best, both have their pros and cons really.

stale · 2020-08-29T13:49:34Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

ErikBjare · 2020-12-06T13:25:05Z

I've added a new example redact_sensitive.py to aw-client that can be used to redact sensitive events: https://github.com/ActivityWatch/aw-client/blob/master/examples/redact_sensitive.py

pcuci · 2021-03-16T15:55:09Z

Why not encrypt the data going in?

The goal should be to not leak any private information if your machine gets hacked (unfortunately very common)

Then require a 2FA to view your own data.

NathanaelA · 2022-08-31T20:34:27Z

I've been looking at doing a change the Rust server engine to just filter out events on its side. So no matter which tracker is sending data to the server, the server itself is responsible to filter them out based on regex matches.

Looking at making the configuration be part of the server config file at this point for simplicity, but ultimately I think having a filter table built into the database would be useful so then the front end could easily then send new filters to the backend.

@ErikBjare @johan-bjareholt - Would this be a PR you would be interested in.

redactedscribe · 2023-01-18T22:18:43Z

Just to throw out a couple of ideas relating to this/window titles that I'd like to see realised (some points mentioned earlier by others):

A way to:

Encrypt all of your data.
Mark certain window titles as sensitive but still have them logged. Think something similar to Discord's Streamer Mode: Keep all the data, but have some of it obscured/omitted if you want to share your stats, e.g. via screenshots of graphs, for example. Could be useful for sanitizing exports for debugging purposes too.
Mark certain window titles as blacklisted and never have them logged. For example, some window titles are spammy and of little relevance, such as those with a countdown timer in their title. Also, private browsing, as mentioned before.
Treat certain window titles as the same thing for display purposes. E.g. for the timer example, group all those window title entries as one since they're conceptually the same. Perhaps this would be achieved through user Regex rules, or maybe a user settable "window title variance" slider that fuzzily matches on how similar a title is to others, but in either case, on a per-application basis.
Merge window titles. After a rule is made for display purposes (above), an option to move it before the logging function to permanently merge captured titles as they come in. Less data but of higher quality if done right.

I've known about ActivityWatch for many years and have probably installed it once every year or two, but the lack of any way to disable window title capturing completely has always caused me to inevitably uninstall. Until there are ways for a user to handle window title capturing, an on/off switch would be great. Excuse any ignorance as my overall experience with ActivityWatch is quite limited. I hope that will change because I'd really like to use your useful program.

Thanks.

dennisorlando · 2024-02-04T18:23:18Z

Shouldn't this be as easy as creating a filtering list on the server side, thus "if entry has a match with a filter, don't add it to the sql database"?
(I mean, I suppose not... else it wouldn't be open since 2016 🥲)

I can easily create a Category with a "Private browsing" pattern which correctly identifies all my "Private Browsing" data;
A really simple button named like "filter out data from this category" would work perfectly well for a lot of people.

Currently, there is no solution nor compromise which would fix / alleviate the problem, apart from using this pull or running "redact_sensitive.py" periodically.

pcuci · 2024-02-05T01:36:56Z

Doesn't Chrome already know where you've been? (Unless you turned off all settings to track you?)

I believe most AW users' expectations differ wildly from those of a https://www.qubes-os.org/ user

If you really, really can't trust yourself with what you're doing on your computer, simply use a different operating system that allows you to hide entire compute workloads from yourself.

ActivityWatch, in my mind, is not for PEPs or investigative journalists, it's for everyone else who wants more control (but not total control, as if that were even a thing...) over their digital crumbs, and trusts themselves enough with a local database, on a non-air-gapped computer, likely connected to the internet.

If you need even lower level trust, go for https://puri.sm/ with Qubes on it :-)

No need to overcomplexify AW, IMO

codermrrob · 2024-03-03T08:12:07Z

The default should be to respect private browsing, with opt-out option if somebody wants to record that. Mostly people will not want to record private browsing time, which by default for most people is not work related anyway.

pcuci · 2024-03-03T23:48:39Z

I do want to record private browsing time.

The reason I use private browsing and VPNs is to hide my activity from others on the web, not from myself. The reason I use AW is to surface insights into my own digital behavior (on and off the web, work and personal, both), private browsing included.

Actually, I use multiple computers (and VMs) and I'd like to track my behavior across all these (virtual) devices, not just my "main" device.

I do trust my LAN/VPNs to not be compromised... and AW fits the bill quite nicely. :-)

ErikBjare mentioned this issue May 23, 2016

Add functionality to redact/filter sensitive data ActivityWatch/aw-server#4

Closed

ErikBjare mentioned this issue Mar 10, 2017

Add upload to Zenobase feature #20

Closed

ErikBjare added the type: enhancement label Apr 18, 2017

johan-bjareholt added the backlog label Oct 10, 2017

ErikBjare mentioned this issue Nov 28, 2017

Feedback from a reddit user #144

Closed

johan-bjareholt added the priority: low label Jun 16, 2018

stale bot added the !wontfix label Feb 15, 2020

ErikBjare removed the !wontfix label Feb 15, 2020

johan-bjareholt mentioned this issue Mar 1, 2020

Support private mode #330

Closed

stale bot added the stale label Aug 29, 2020

johan-bjareholt removed the stale label Aug 30, 2020

aftabnaveed mentioned this issue Nov 18, 2020

Write mission/vision statement #236

Open

ErikBjare mentioned this issue Dec 6, 2020

feat: added basic search feature ActivityWatch/aw-webui#236

Merged

5 tasks

stale bot added the stale label Mar 17, 2022

stale bot closed this as completed Apr 13, 2022

ActivityWatch deleted a comment from stale bot Apr 13, 2022

ErikBjare reopened this Apr 13, 2022

stale bot removed the stale label Apr 13, 2022

NathanaelA mentioned this issue Sep 18, 2022

Allow ActivityWatch to Ignoring / Filtering ActivityWatch/aw-server-rust#302

Open

ErikBjare mentioned this issue Sep 4, 2023

"Ignore case" wording issue: I genuinly thought this was to create exclude lists #948

Closed

s1lverkin mentioned this issue Jul 1, 2024

High CPU usage since upgrading to v0.13.1 (rust) #1082

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add functionality to redact/filter sensitive data #1

Add functionality to redact/filter sensitive data #1

ErikBjare commented May 23, 2016 •

edited

Loading

ErikBjare commented Jul 10, 2016

ErikBjare commented Jun 24, 2017 •

edited

Loading

unode commented Apr 20, 2018

Otto-AA commented Jun 12, 2018 •

edited

Loading

Otto-AA commented Jun 12, 2018

Otto-AA commented Jun 16, 2018

johan-bjareholt commented Jun 16, 2018 •

edited

Loading

stale bot commented Feb 15, 2020

ErikBjare commented Feb 15, 2020

johan-bjareholt commented Mar 1, 2020 •

edited

Loading

ErikBjare commented Mar 2, 2020

johan-bjareholt commented Mar 2, 2020

stale bot commented Aug 29, 2020

ErikBjare commented Dec 6, 2020

pcuci commented Mar 16, 2021

NathanaelA commented Aug 31, 2022

redactedscribe commented Jan 18, 2023

dennisorlando commented Feb 4, 2024

pcuci commented Feb 5, 2024 •

edited

Loading

codermrrob commented Mar 3, 2024

pcuci commented Mar 3, 2024

Add functionality to redact/filter sensitive data #1

Add functionality to redact/filter sensitive data #1

Comments

ErikBjare commented May 23, 2016 • edited Loading

ErikBjare commented Jul 10, 2016

ErikBjare commented Jun 24, 2017 • edited Loading

unode commented Apr 20, 2018

Otto-AA commented Jun 12, 2018 • edited Loading

1) main idea

2) Filters

2.1) Filter criteria

(2.1.1) if [title/data.incognito] [equals/differs/includes/regex/>/>=/</<=] [comparison]

(2.1.2) logical operators

(2.1.3) metadata checks [watcher_name, is_test, ...]

(2.1.4) time ranges

2.2) Filter action

(2.2.1) remove [event/event.data/event.title/...]

(2.2.2) replace [target] with [val]

3) Implementation

(3.1) User interface

(3.1.1) Filters page

(3.1.2) Add new filter UI

(3.2) Server part

(3.3) Standardized events format per bucket

Notes

Otto-AA commented Jun 12, 2018

Otto-AA commented Jun 16, 2018

johan-bjareholt commented Jun 16, 2018 • edited Loading

stale bot commented Feb 15, 2020

ErikBjare commented Feb 15, 2020

johan-bjareholt commented Mar 1, 2020 • edited Loading

ErikBjare commented Mar 2, 2020

johan-bjareholt commented Mar 2, 2020

stale bot commented Aug 29, 2020

ErikBjare commented Dec 6, 2020

pcuci commented Mar 16, 2021

NathanaelA commented Aug 31, 2022

redactedscribe commented Jan 18, 2023

dennisorlando commented Feb 4, 2024

pcuci commented Feb 5, 2024 • edited Loading

codermrrob commented Mar 3, 2024

pcuci commented Mar 3, 2024

ErikBjare commented May 23, 2016 •

edited

Loading

ErikBjare commented Jun 24, 2017 •

edited

Loading

Otto-AA commented Jun 12, 2018 •

edited

Loading

johan-bjareholt commented Jun 16, 2018 •

edited

Loading

johan-bjareholt commented Mar 1, 2020 •

edited

Loading

pcuci commented Feb 5, 2024 •

edited

Loading