Skip to content

Latest commit

 

History

History
199 lines (150 loc) · 11.2 KB

database.mdx

File metadata and controls

199 lines (150 loc) · 11.2 KB
title
Database Configuration

import Warning from 'taskcluster-ui/views/Documentation/components/Warning';

Database Configuration

Taskcluster uses a Postgres 11 database for its backend storage. A single database is shared among all services.

Taskcluster assumes that it "owns" the database, but can share a single server with other users. For production purposes we recommend a dedicated server, due to the possibility of contention for server resources, but sharing a server acceptable for non-production environments.

The default collation (sort order) for the database must be en_US.UTF8. This can be achieved by setting LC_CTYPE=en_US.UTF8 LC_COLLATE=en_US.UTF8 when creating the database, using createdb option -l, or invoking CREATE DATABASE with LC_CTYPE 'en_US.UTF8' LC_COLLATE 'en_US.UTF8'. The choice of en_US is an accident of history, and not meant to demonstrate a preference for that locale over any other. Since most sorting is performed on identifiers such as slugIds, C would have been a more appropriate choice, but all large production deployments were already using en_US as of discovery of this issue, and the default collation cannot be changed after the fact. In practice, the difference is minor and only visible when sorting identifiers that differ in case.

DB Users

Each service uses its own Postgres user to access the database, and Taskcluster uses this functionality internally to isolate services from one another. The Taskcluster deployment process additionally requires an "admin" user that is used for schema migrations and to update the permissions of the per-service users. This admin user must have full permission to all resources in the selected database, as well as permission to grant and revoke access for the service users. The admin user can have any name.

Because users are global to a Postgres server, Taskcluster requires a name prefix which is applied to each per-service user. In a non-production environment, this allows several installations of Taskcluster to co-exist on the same server. We recommend that the admin username and the prefix be the same string.

The prefix must consist of one ore more lowercase alphanumeric characters, and underscore (/[a-z_]+/).

The set of users that must be configured, then, are:

  • <prefix> -- admin user
  • <prefix>_auth -- user for Taskcluster auth service
  • <prefix>_github -- user for Taskcluster github service
  • <prefix>_hooks -- user for Taskcluster hooks service
  • <prefix>_index -- user for Taskcluster index service
  • <prefix>_notify -- user for Taskcluster notify service
  • <prefix>_object -- user for Taskcluster object service
  • <prefix>_purge_cache -- user for Taskcluster purge_cache service
  • <prefix>_queue -- user for Taskcluster queue service
  • <prefix>_secrets -- user for Taskcluster secrets service
  • <prefix>_web_server -- user for Taskcluster web_server service
  • <prefix>_worker_manager -- user for Taskcluster worker_manager service

Set these up using something with the following effect:

CREATE USER <prefix> PASSWORD '<admin_password>';
GRANT ALL ON DATABASE <dbname> TO <prefix> WITH GRANT OPTION;
CREATE USER <prefix>_taskcluster_queue PASSWORD '<queue_password>';
...

Every run of db:upgrade will check permissions for these users, including permissions bits like SUPERUSER and role memberships. Note that some cloud providers create users with elevated permissions by default. In such situations, it's best to use commands such as the above instead, or to reduce the new users' permissions after the fact.

Read and Write Replicas

Taskcluster assumes full transactional consistency from Postgres. That is, once a transaction is committed, the results of that transaction must be visible everywhere. Furthermore, Taskcluster relies on per-row locks to perform queueing operations. Within these parameters, the backend database can be replicated and sharded as load and availability demand.

Taskcluster allows each service to be configured with both "read-only" and "read-write" access configuration. The read-only configuration will only be used to read from the database. This can be used to direct reads to read-only replicas, in cases where the read load is significantly higher than the write load. However, note that these read-only replicas must conform to the consistency requirements described above.

Connection Counts

Each Kubernetes deployment can use up to 10 connections. In particular, the web pods may, under load, use this maximum. Periodic tasks and worker pods will user fewer connections. A basic deployment of Taskcluster with the default scale of one for each Kubernetes deployment can use about 70 connections when under load. This may seem a lot, but is relatively small given the number of pods in a running deployment.

When updating Kubernetes deployments, where new pods start up before old pods shut down, the connection count can temporarily double.

We recommend starting Postgres' max_connections to 200 for production Taskcluster deployments. Adjust that number based on observed connection counts, paying careful attention to spikes during Kubernetes deployments. Increase the number as necessary when increasing the replica counts on Kubernetes deployments.

Configuration

Database access is configured with URL-shaped strings as defined by node-postgres.

These URLs can specify TLS connections with a ?ssl=1 suffix, but note that this only enables end-to-end encryption, without identity validation on either end. This means that the connection could be intercepted by a MITM attack. An alternative syntax is ?sslcert=/path/to/client/cert&sslkey=/path/to/client/key&sslrootcert=/path/to/server/cert, giving filenames for a client cert and key, and for a server cert. These files must be on the filesystem, making this form difficult to use in deployed services. We recommend configuring private IP access for services and using ?ssl=1, and using a public IP with the full set of certificates and keys for the administrative access. In the unusual case that the database server's certificate contains a signed certificate for its hostname with a validation chain leading to a trusted root CA, use ?ssl=authorized.

Each service that requires database access has Helm properties <service>.read_db_url and <service>.write_db_url. Set these values to include the corresponding database users' credentials. In a non-replicated scenario, set each service's read and write URLs to the same value. For example:

queue:
  read_db_url: postgresql://prod_taskcluster_queue:[email protected]/taskcluster_prod?ssl=1
  write_db_url: postgresql://prod_taskcluster_queue:[email protected]/taskcluster_prod?ssl=1
  db_crypto_keys: [] # (optional) Defined in "Supporting Encrypyed Columns"
Note that the admin user's credentials should not appear anywhere in the Helm configuration!

Supporting Encrypted Columns

If a service has some data that it wants to be stored at rest, it will require keys in order to do so. The db_crypto_keys key of a service config is a JSON array where each element is an object of the form

{
  "id": "a unique identifier",
  "algo": "aes-256",
  "key": "32 bytes of base64 string"
}

The last element of the array will be the current key and all encryption will be performed with it. Other keys are used in decryption of values that were encrypted with the older key. This allows you to rotate your encryption keys. At the current time, Taskcluster implicitly treats azure_crypto_key as the oldest element of this list, with id azure, to maintain compatibility with existing configurations. from older table formats.

An easy way to create a random key, using a CSPRNG, is:

$ echo "console.log(require('crypto').randomBytes(32).toString('base64'));" | node

Database Versions

Taskcluster is designed to modify its database structure as it is upgraded, in a fashion that often requires little or no downtime. This is accomplished by tracking the current version of the database with an integer "database version", and upgrading from one version to the next using the process described below. A single Taskcluster release may contain many database versions, corresponding to the changes made in that release. A list of the database versions with brief descriptions can be found in the source code.

Taskcluster maintains the invariant that older code is compatible with newer databases, so an existing services deployment will continue to function after a database upgrade is committed. This compatibility is limited to major Taskcluster versions: code in version vN.x.y will function against database versions released in v(N+1).x.y.

Upgrades

In order to conform to the inviariant given above, a database upgrade must be completed before the corresponding code is deployed.

Before running an upgrade, we recommend:

  • take a backup of the database
  • suspend any crons and kill any running jobs (otherwise, these can hold locks for a long time)

To run an upgrade, run a docker container using the new Taskcluster image. Set ADMIN_DB_URL and USERNAME_PREFIX to the values described above and run:

docker run -ti --rm -e ADMIN_DB_URL -e USERNAME_PREFIX taskcluster/taskcluster:1.2.3 script/db:upgrade

The output will describe the changes taking place.

During the upgrade, monitor select * from pg_stat_activity, watching for anything that might be preventing the upgrade activities from proceeding, especially locks (wait_event_type = lock). Everything in Taskcluster is designed for failure, so it's OK to kill anything long-running that might be holding a lock the migration process needs to proceed.

As with any upgrade, consult the release notes before running the upgrade.

Downgrades

In general, it is safe to downgrade the Taskcluster services without downgrading the database. By design, older versions of the services can run against newer versions of the database.

However, in the specific circumstance that a database upgrade has itself caused problems -- for example, if a new stored function is failing -- then a database downgrade can be performed.

Database downgrades must be performed *after* the corresponding code is deployed. Database downgrades can cause data loss -- for example, downgrading a version that created a table will drop that table. Database versions are unrelated to Taskcluster release versions. Consult the Taskcluster source or development team to find the appropriate database version.

To perform a database downgrade, determine the Taskcluster version <tcversion> that defined the currently-deployed database version. This is the Taskcluster release version that was deployed and caused the error requiring a rollback. Then, determine the target database version. This should be the database version <dbversion> defined by the Taskcluster release to which you have (already!) rolled back. Then, run:

docker run -ti --rm -e ADMIN_DB_URL -e USERNAME_PREFIX taskcluster/taskcluster:<tcversion> db:downgrade <dbversion>