Both a study on the MediaWiki schema evolution and this paper identify the following operations as by far the most common:
- add a column (39% of MW schema changes, which are 55% of total)
- drop a column (26% of MW schema changes)
- modify an index (40% of overall MW changes)
- add / remove a table (not really a schema change)
From the data above it is clear that a lot of utility can already be provided by the ability to add or remove columns and indexes. Additionally, more complex operations are often not or only partially supported by backend systems (Cassandra for example only supports renaming primary key columns), so would be much harder to support.
MediaWiki has had good success using a sequential schema evolution mechanism,
in which an upgrade step (which can be just a SQL statement like alter table
, but also more complex code) is executed if associated tests match the
existing schema. This is a very powerful system, perhaps too powerful for a
distributed storage service. It might make more sense to focus on the minimal
functionality needed on the service side, and leave anything more complex to
users.
If we focus on addition / removal of columns and indexes, then it might be feasible to directly diff the declarative JSON schema sent during table creation to figure out the necessary changes. This is especially straightforward for new non-primary-key columns and new indexes (okay, building indexes is more work, but it's all safe).
Deletion is more problematic. For secondary indexes this might be okay as those can be rebuilt, but column deletion could lose data. It is also common that an app needs to migrate old data to a new format before removing the column. Removal of columns can also affect existing indexes, which is more complex. For these reasons it seems to make more sense to provide a different, more explicit, mechanism to delete columns. Even if that functionality is missing initially, additional information can be excluded using the proj selector. In systems like Cassandra, the main practical downside of old columns is the storage used for existing data in them. In new rows without explicit values for these old columns, no space is used for these columns.
- Whenever a PUT is made to re-create an existing table, diff the schemas and bail out if there is any difference.
- Handle new non-primary-key columns by adding them to the table.
- Ignore deleted non-primary-key columns, and provide explicit entry point to delete non-primary-key columns from an app.
- Add / build and remove secondary indexes.