Skip to content

Latest commit

 

History

History
372 lines (246 loc) · 25.6 KB

README.md

File metadata and controls

372 lines (246 loc) · 25.6 KB

DBSamizdat

The blissfully naive PostgreSQL database object manager.

For specifics on integrating with the Django web framework, and some examples for integrating with its ORM, see the Django integration section further down.

For discussions, questions, bug reports, feature requests, development, and engaging other users, use the public mailinglist.

It requires Python 3.6 or later, and is tested for compatibility with PostgreSQL 9.6 and up.

What it is

DBSamizdat is a lightweight tool that manages code that lives in the database: views, materialized views, functions and triggers, and the dependencies between those and other DB objects.

There is a command-line utility, and a library-style API.

It does not take over your database; it is designed to augment an existing database.

It works by modeling DB objects that you wish to manage as Python classes, then syncing that model state to a database. It can also refresh materialized views in proper dependency order, via autogenerated triggers or on command.

For whom

  • Analysts:

    • share ontologies of business data and centralize their definitions
    • stop passing around various versions of SQL snippets embedded in ad-hoc queries
    • build data processing pipelines as stacks of (materialized) views, with reproducible results, and re-derive everything from ground truth at a keystroke
    • if you have permissions to create objects in a database (if not in the public schema, then perhaps in your own namespace aka "schema"), then you can use DBSamizdat.
  • Developers:

    • contain the use of raw SQL, make it reviewable, versionable and easily refactorable as part of a project's codebase, rather than treating it as DB state (as with table contents)
    • manage DB object lifecycles alongside existing migration infrastructure.
    • ORMs, and other developers, can read from views as if they're "just tables"; a familiar interface, and using views for encapsulating SQL can actually cut down on proliferation of one-off SQL queries (or esoteric ORM constructs, which are not necessarily better than SQL) in a codebase.

In particular, dbsamizdat is useful when:

  • you need a light & easy migration framework, allowing you to version your database handiwork in something more structure-aware than a flat SQL DDL dump
  • you work with stacks of database views, and want to refactor with more ease than is possible by search & replace/reorder of text in a flat SQL DDL dump
  • you would like to visualize analytical data flows, or visualize unit dependencies to aid in refactoring
  • you would like to easily refresh stacked materialized views, in dependency order, when certain base tables change
  • potentially even refreshing those materialized views automatically, in dependency order, using autogenerated triggers
  • you have a fine ORM, which does mundane DB operations very well, but you need a way to structure and manage raw SQL DB code for when the ORM does not suffice

How it works

The dependency graph

Based on dependencies between the objects it manages, and between managed objects and unmanaged objects (base tables, for instance), it creates a dependency graph.

That graph is useful:

  • it is checked for cycles and other errata and anomalies
  • it can be applied to a database, creating objects in their proper dependency order
  • it can be diffed with existing database state
  • it can be used to patch up existing database state
  • it is a model of data flows, so it can tell us where a certain base table's data ends up

State tracking

DBSamizdat tracks which objects fall under its responsibility. For this, rather than adding a table to your DB that tracks the application of successive migrations (as many DB migration frameworks do), it simply signs¹ the objects it creates by setting a PostgreSQL comment on those objects. The advantage is that the comment state is tied to the object's lifecycle: if someone or something DROPs the object, there is no potential source of disagreement left lying around.

Corollaries:

  • Remove DBSamizdat's comment from a DB object, and it will stop managing it.
  • Change DBSamizdat's comment on a DB object to something that doesn't look like a DBSamizdat signature anymore, and it will stop managing it.
  • Tamper with the comment on a DB object so that it still looks like a DBSamizdat signature and be in for surprises
  • If you look at the comments (eg \d[mvf] when in psql), you'll see some JSON, which might end up surprising end-users if those comments are served out through, for instance, some BI tool.
  • DBSamizdat will not notice when you alter an object behind its back. It only looks at the signature in the comment.

With this type of state tracking, it's possible to let multiple people create their own Samizdat structures in the same database, without stepping on eachother's toes² — for this, leverage PostgreSQL schemas and let everyone work in their own namespace. This can be done by convention but can also be administratively enforced through schema permissions.

¹) It's not a cryptographic signature or MAC. No guarantees beyond "this looks very much like something I would say" from DBSamizdat's perspective.

²) With the exception of triggers on shared base tables.

The basics

Install it - pip install --user dbsamizdat .

Create one or more modules with classes inheriting from SamizdatView, SamizdatMaterializedView, SamizdatFunction or SamizdatTrigger. Be sure to read the class reference below. The below is just an example:

from dbsamizdat import SamizdatView, SamizdatMaterializedView

class Salutation(SamizdatView):
    # SQL is written into a template, which will be read as a string.Template to substitute 
    # various values, such as the preamble, which will be substituted with the CREATE statement.
    sql_template = """
      ${preamble}
      SELECT format('Hi %s!', current_user) AS utterance
      ${postamble}
      """

class TimestampedGreeting(SamizdatMaterializedView):
    # As long as the result is a string and you embed the substitution placeholders,
    # you use whatever you want to create the `sql_template`. Here we've
    # added an index, making use of the templated-in object name,
    # so that we don't have to hardcode it.
    # In this case we use an F-string, and instead of hardcoding
    # the DB identity of the Greeting view from which we are selecting, we're
    # referencing it through its Samizdat class.
    # Note that in this F-string we'll have to escape the String.template braces.
    sql_template = f"""
      ${{preamble}}
      SELECT now() as saluted_at, utterance
      FROM {Salutation.db_object_identity}
      ${{postamble}}
      CREATE INDEX ON ${{samizdatname}} (saluted_at)
      """
    # As DBSamizdat doesn't parse your SQL, it doesn't actually know you're reading from the Greeting view.
    # We have to specify this relationship explicitly.
    depends_on = {Greeting}  # Dependencies can also be specified as flat strings, or tuple-of-strings — see class reference.

You can put your classes in any module, and modules can contain classes other than Samizdat classes without problems. It may make sense to organize them into several modules.

To apply the specified state to your PostgreSQL database, use the dbsamizdat CLI tool, like so:

dbsamizdat sync postgresql:///some_database your_module maybe_yet_another_module

Running the sync command again should tell you that there's nothing to do. But should you DROP the TimestampedGreeting view, sync will recreate it.

That's it, very basically!

Using the CLI

The CLI works through subcommands. Start exploring with a simple:

dbsamizdat --help

which will give you:

usage: dbsamizdat [-h] [--quiet] [--verbose] {nuke,printdot,diff,refresh,sync} ...

dbsamizdat, the blissfully naive PostgreSQL database object manager

optional arguments:
  -h, --help            show this help message and exit
  --quiet, -q           Be quiet (minimal output)
  --verbose, -v         Be verbose (on stderr).

commands:
  {nuke,printdot,diff,refresh,sync}
    nuke                Drop all dbsamizdat database objects.
    printdot            Print DB object dependency tree in GraphViz format.
    diff                Show differences between dbsamizdat state and database state. Exits nonzero
                        if any are found: 101 when there are excess DB-side objects, 102 if there 
                        are excess python-side objects, 103 if both sides have excess objects.
    refresh             Refresh materialized views, in dependency order
    sync                Make it so!

These commands have their specific parameter sets and explanatory descriptions. For instance, to learn more about the sync command, we can ask for some help on it:

dbsamizdat sync --help

which will then tell you:

usage: dbsamizdat sync [-h] [--txdiscipline {checkpoint,jumbo,dryrun}] dburl samizdatmodules [samizdatmodules ...]

positional arguments:
  dburl                 PostgreSQL DB connection string. Trivially, this might be 'postgresql:///mydbname'.
                        See https://www.postgresql.org/docs/14/static/libpq-connect.html#id-1.7.3.8.3.6 .
  samizdatmodules       Names of modules containing Samizdat subclasses

optional arguments:
  -h, --help            show this help message and exit
  --txdiscipline {checkpoint,jumbo,dryrun}, -t {checkpoint,jumbo,dryrun}
                        Transaction discipline. The "checkpoint" level commits after every dbsamizdat-level action.
                        The safe default of "jumbo" creates one large transaction.
                        "dryrun" also creates one large transaction, but rolls it back.

Class reference

Read-only properties

  • fq — The fully qualified DB object, including schema (see "Using schemas" under "Tips") and argument signature, if it pertains a function.
  • db_object_identity — The unique object reference from the DB's point of view (for functions, this includes the argument signature).

Methods

  • create, drop, sign — These generate SQL. Normally not called directly, rather, use the sync API function to operate on the samizdat collection as a whole.
  • dbinfo — Generates the comment that will be inserted into the database for this object.

Required attributes

  • sql_template — The create statement for this entity, as a string template. Start with the ${preamble}, which will be templated for you into an appropriate CREATE statement for the entity. Then start your SQL where the automation leaves off:
    • in the case of views, it's the SELECT statement following AS
    • in the case of a function, you'd start with the RETURNS (as you'd type right after the function head, which will be templated in for you)
    • in the case of a trigger, you'd start with the part that comes after [ FROM referenced_table_name ]. Refer to the PostgreSQL documentation on CREATE TRIGGER.

If you get the templating wrong it often easy enough to see why from the SQL that was generated; run the CLI tool with -v (-v2 for the Django management command) or get the .sql attribute from the ensuing DatabaseError.

For a view or materialized view, add a ${postamble}. After the postamble, you may tack on some more CREATE statements, eg to create indices on materialized views. These will be unmanaged, DBSamizdat doesn't parse your SQL and thus knows nothing about what you put there - so while you COULD create a table in there, you definitely shouldn't. But an index on a materialized view would be fine; since it will be attached to the lifecycle of the view without DBSamizdat needing to worry about it.

Required attributes for SamizdatTrigger-subclasses

  • on_table — The table this trigger should be installed on.
  • condition — The { BEFORE | AFTER | INSTEAD OF } { event [ OR ... ] } part of the CREATE TRIGGER syntax. A trivial example would be "AFTER INSERT". Refer to the PostgreSQL documentation on CREATE TRIGGER.

Optional attributes

  • schema — This defaults to public, if you've never heard of PostgreSQL schemas, don't worry and leave it out.
  • deps_on — Defaults to the empty set. This lists any other Samizdats that this entity depends on. DBSamizdat uses these dependency specifications to construct the aforementioned graph. As it is incapable of comprehending your SQL template, it can't extract dependencies from there, so you'll have to repeat yourself a bit here. Only include samizdat entities. References can be of the following forms, in reverse order of preference:
    • a string: Will be interpreted as the bare name of a samizdat in the same schema. This is deceptively simple. Too simple. It won't work for everything; referencing functions this way is not possible.
    • a 2- or 3-tuple of strings: Will be interpreted as (schemaname, samizdatname) or (schemaname, samizdatname, argument-signature); the latter form is for referencing functions. This is an unambiguous reference, but it is potentially unbound.
    • a direct class reference: Excellent. The preferred way of referencing Samizdats. You get the benefit of having the language (Python) watching over referential consistency, and not having to chase down and update your references when, for instance, a referenced function changes its argument signature.
  • deps_on_unmanaged — Defaults to the empty set. This lists other DB entities this entity depends on. If your entity depends on any database tables, list them here. This information is used for visualization (printdot command) and for determining which materialized views to refresh when using the belownodes CLI/API parameter for a refresh.

Optional attributes for SamizdatMaterializedView-subclasses

  • refresh_concurrently — Defaults to False. See the PostgreSQL documentation on the requirements for your materialized view in order to benefit from refreshes that don't lock out concurrent readers.
  • refresh_triggers — Defaults to the empty set. Tables referenced here will get a trigger installed that refreshes the materialized view on INSERT, UPDATE, DELETE and TRUNCATE. Triggers will run in dependency order, that is, if both materialized views A and B have a refresh_triggers reference to table X, and materialized view A depends directly or indirectly on materialized view B (through deps_on connections), then the autogenerated triggers will be named such that materialized view B will be refreshed before materialized view A.

Optional attributes for SamizdatFunction-subclasses

  • function_name — Defaults to the class name. The reason that this is overridable here is to support function polymorphism: the identity of a DB function is made up of its schema, name, and arguments signature. Thus you might want to create several functions sharing a name but differing in argument signature. If the function name would always be taken from the Python class name, you'd need to have several classes with the same name to support this polymorphism, which would then force you to place them in separate modules — not very convenient.
  • function_arguments_signature — Defaults to '', that is to say, the function takes no arguments. Specify the argument signature sans parentheses, eg username TEXT, userid INT.
  • function_arguments — If your function arguments' call signature is different from its arguments (which can be the case, for instance when you use arguments with defaults, and IN/OUT/INOUT/VARIADIC argument types) then you'll have to list your full function arguments here, and repeat the proper signature subset of all that in the function_arguments_signature. Because DBSamizdat doesn't interpret SQL, it will not derive the signature from the arguments; you'll need to spell it out.

Tips

Programmatic access

The api and django_api modules provide access to DBSamizdat functionality from Python: the sync, refresh and nuke actions, which function as their CLI namesakes.

Using schemas

DBSamizdat has support for PostgreSQL schemas. They can be thought of as namespaces for DB objects.

As noted in the class attribute reference, use the schema class attribute to place an object in a schema other than the default public schema. When dependency-referencing Samizdat classes, you can simply use the class itself, if you have a reference to it (say, imported it). If you're referencing non-Samizdat entities that exist in schemas foreign to the schema of the referencing class, you can specify them as a tuple of strings, like so: ('schemaname', 'tablename'). This tuple is exposed on any dbsamizdat class as the .fq ("fully qualified") property. When you can't reference a class directly, you may instead use such a tuple to refer to a class, and then the reference will be resolved when dbsamizdat runs. Note that functions are referenced with a 3-stringtuple which includes their argument signature — eg ('someschema', 'somefunctionname', 'userid int, comment text').

Visualization

Install GraphViz. If you are on Linux, chances are you can install it through your distribution. Your distribution may also carry the xdot program for visualizing dotfiles.

To generate a graph description in the dot language, and create an image of it, run a variation of:

dbsamizdat printdot dbsamizdat.test_samizdats | dot -Tsvg > test_samizdats.svg

And then open the resulting test_samizdats.svg file in something that can handle .svg files, such as a modern web browser.

Or pipe the dotfile output directly into xdot, if you have that:

dbsamizdat printdot dbsamizdat.test_samizdats | xdot -

You will see something like this:

example_output (go here if you don't see it)

The legend for these nonsensical test classes:

  • Nodes:
    • Yellow houses are DB tables
    • Grey squares are views
    • Red blocks are materialized views
    • Green hexagons are functions
    • Purple arrowboxes are triggers
  • Edges:
    • A blue dotted line with ball arrowhead and refresh symbol label denotes an autorefresh-relationship between a materialized view and a base table. They're implemented with an autogenerated trigger and trigger-function pair, which are left out of the visualization to reduce clutter.
    • A red edge denotes a relationship with an unmanaged object at its origin
    • A black edge denotes all else

Django integration

What integration?

"Integration" is a loosely defined term. For DBSamizdat, it relies on:

  1. A Django model's managed Meta-property, which we can use to tell Django to disavow the DB object underlying the model, so we can manage it through DBSamizdat.
  2. The fact that the ORM will read views as if they're tables just fine.
  3. The fact that Django doesn't mind extraneous DB objects (objects beyond those flowing from its Models). Unless, of course, they are in the way of something Django wants to do (say, create a table foo for a new Model, while a DBSamizdatView foo already exists).

Thus, the "integration" comes down to:

  1. Getting out of the way when Django needs to perform migrations, and making things right afterwards
  2. Autodiscovering your various apps' dbsamizdat definition modules
  3. Wrapping the CLI in a management command, so that the Django DB connection is used, and the autodiscovered modules are applied

Limitations

Currently, Samizdats are applied to the default Django DB connection. Multi-DB setups aren't supported through the Django helpers. But you could use the regular API and CLI to work with the other databases regardless.

Getting up and running

pip install dbsamizdat, then add dbsamizdat to your INSTALLED_APPS.

Then add a module called dbsamizdat_defs to your app, and define your Samizdats.

Then run ./manage.py dbsamizdat sync to sync state to your DB. ./manage.py dbsamizdat works pretty much the same as the normal CLI, except that you don't have to specify how to connect to PostgreSQL, neither do you need to list any modules, because they're autodiscovered.

To use the library API functions, load the django_api module rather than the api module.

Extra configuration

The following Django settings are consulted by DBSamizdat, if declared:

  • DBSAMIZDAT_SMART_MIGRATIONS — a boolean, False by default. This has to do with migrations. As Django's migrations are blissfully unaware of your samizdats, things will fail if, for instance, a table column is removed that one of your views selects. The default lets DBSamizdat assume the worst case, so it will simply remove all its DB state before Django's migrations are run, and restore the state afterwards. But if you define this setting as True, DBSamizdat will examine the migration plan and to determine the subset of DB state that will need to be dropped and rebuilt following all the migrations. This behaviour can be desirable when you have large or computationally intensive materialized views. But it's not turned on by default, as it's deemed experimental because of reliance on non-public Django APIs. See if it works for you, if not, turn it back off and send a bug report to the mailinglist!
  • DBSAMIZDAT_MODULES — an Iterable[str] of module names, eg ["myapp.somesamizdats"]. Empty, by default. These modules will be loaded in addition to any autoloaded modules. When you place a module named dbsamizdat_defs in a Django app, it will automatically be picked up by DBSamizdat. Think of it as a models.py, but for Samizdats. The DBSAMIZDAT_MODULES is there for when you want to place some (extra) Samizdat classes elsewhere. Though of course, you could also import those classes from any module into the canonical dbsamizdat_defs module.

Tips

Referencing things nicely

Django allows you to define Model classes that read from a SamizdatView (or SamizdatMaterializedView) rather than a table. For that, set your model's Meta up like so:

class MyModel(models.Model):
    ...

    class Meta:
        managed = False
        db_table = MySamizdatView.db_object_identity

We're directly referencing the Samizdat class here. You could use a bare string to refer to the view instead, but doing it as shown is better, as then the language (Python) will do the work of guarding referential integrity. Plus perhaps you'll gain some IDE integration, as you might now easily jump to the referenced class' definition.

Similarly, to reference one of Django's tables from a Samizdat:

class MyModel(models.Model):
    ...

class MySamizdat(SamizdatView):
    deps_on_unmanaged = {MyModel._meta.db_table}
    sql_template = f"""
    ${{preamble}}
    SELECT * FROM "{MyModel._meta.db_table}";
    ${{postamble}}
    """

We could have used a bare string to refer to MyModel's table, but instead we're directly referencing it, for reasons explained earlier. As this is an F-string, we'll need to escape the string.Template variables, so ${preamble} became ${{preamble}}. But you can use any templating/formatting you want, as long as the resultant string makes sense as the string.Template that it will be interpreted as. Of course, dynamic string templating SQL is a danger zone with respect to escaping and injection, but we're consenting adults here.

Refreshing materialized views: Django signals vs PostgreSQL triggers

Materialized views are nice, but they become stale as their source data is updated. To keep materialized views in sync following a mutations on Model instances, you could use Django signals — call the refresh() function from the django_api module from a signal handler acting on the post_save and post_delete signals of a model, passing an appropriate belownodes argument to refresh everything that depends on that model's DB table. For performance, you may want to choose the "checkpoint" transaction discipline if you want the affected view's updated content to become visible as refreshing progresses (on a view-by-view basis) rather than all the way at the end.

The upside of this approach is that you can defer the refreshing to a task queue so that it does not hold up the HTTP request/response cycle. You can also debounce the event.

The downside is that there is now a window of inconsistency; the materialized view will, if all goes well, be updated eventually, but in the meantime it might not be in agreement anymore with the updated source data. Also, whether the mechanism fires depends on whether you use .save() on a Model instance - if you use a QuerySet's update function instead, signals will not fire.

An alternative is to use DB triggers, so that a refresh takes place within the same transaction as the mutation on its base data. The big advantages are consistency & atomicity. They're also very easy to set up, simply list the base tables in the refresh_triggers attribute on your materialized view; in the trivial case, that could simply be refresh_triggers = deps_on_unmanaged as you've already spelled out the base tables there.

The disadvantage of the trigger approach is that what used to (probably) be a fast insert now has to wait for the (potentially lengthy) materialized view's refresh; it very much depends on specifics whether this is acceptible for your use case.

In either case, to prevent holding out readers while refreshing a materialized view, you could investigate whether you can turn on concurrent refreshes.

Depending on write flux, consistency requirements, and sheer computational demands, there are certainly cases for which neither signals nor triggers are a good fit, and a nightly batch process would work better, perhaps not even using materialized views at all.