Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalizing email headers in SQLite #7

Open
knowtheory opened this issue Jul 20, 2024 · 1 comment
Open

Normalizing email headers in SQLite #7

knowtheory opened this issue Jul 20, 2024 · 1 comment

Comments

@knowtheory
Copy link

Hey everybody,

I've been spending some time over the past week trying to use mbox-to-sqlite to import my 57gb mbox from GMail. After a bunch of tinkering I've created an sqlite db for myself, but wanted to talk w/ users of this library (or just Simon) about some suggested enhancements, and whether the changes i made would be worth contributing.

Email headers are mixed case, and SQL is not.

As noted in #5 and #6, running mbox-to-sqlite will fail for arbitrary mbox files with a duplicate column error. These failures are caused by the fact that SQL and SQLite's column space are single case while email header keys are mixed case. Attempting to insert mixed case will result in these duplicate column errors.

So what do?

Should mbox-to-sqlite normalize header keys?

I'd posit yes. My personal solution was to preprocess each message that mailbox produced prior to upserting them as mbox-to-sqlite does. The normalization I chose was to map mix-cased keys to lowercase, and merge the values of any colliding keys.

Merging values seems fine for my purposes (search & retrieval). I don't need to trace individual headers back for forensic purposes or the like.

SQLite has a maximum column limit.

After deciding on a strategy for avoiding column naming conflicts, I discovered that SQLite has a limit to the number of columns allowed in a single table. SQLITE_MAX_COLUMNS is a compile time setting, and defaults to 2000. Since this setting is determined at compile time I assume it to be immutable for all practical intents and purposes.

This is a problem because the set of columns that mbox-to-sqlite builds can be arbitrarily large, since it's the union of all of the email header keys contained in the mbox. In my case, 57gb of email contained 4040 unique headers (3574 if mapped to lower case).

Should mbox-to-sqlite provide a (configurable) fixed set of columns?

If we treat SQLITE_MAX_COLUMNS as fixed there's no practical solution other than limiting the number of columns that mbox-to-sqlite produces. The question shifts to what columns to include, and what to do with remaining headers.

For myself I counted the frequency of headers in my mail archive and used that to filter out a list of headers with 1000 entries or more. I used that list of columns to set up my messages table, and filter the keys for each record dict. Any header not in the list gets put into a dict that gets dumped as a JSON string into an additional_headers field.

It shouldn't be hard to automate scanning the headers to calculate the distribution and suggest what an appropriate cut off point might be.

Anyway, if any of these contributions would be helpful, let me know and i can fire up a pr.

@knowtheory
Copy link
Author

p.s. SQLite limits aren't accessible via SQL, but Stack Overflow provides an insane but relatively quick way to find SQLITE_MAX_COLUMNS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant