Spike data export for restoring prod db to local #2469

symroe · 2024-11-12T13:26:05Z

What is this all about?

This is an idea for a pattern of getting prod databases onto local dev machines without having to download personal data.

It uses a Docker container run on Lambda to make a new database on the production server and then removes PII from that database. It then dumps the data to an S3 bucket (already set to expire objects) and downloads that DB into a local database.

If we think this is a reasonable shape of project them we should be able to apply it to other projects.

chris48s · 2024-11-13T14:45:26Z

OK. In the grand scheme of things I like this idea/approach. I think if we can get it working it is going to be a big productivity boost.

Also, I am going to leave a lot of comments on the diff (many of them quite minor, but a lot of them)

I want to lead with the highest priority thing: I did try to run it locally a few times, but I haven't managed to see it actually working properly end-to-end. When I ran it, this happened:

The CREATE DATABASE {} TEMPLATE {} needs the source DB to not have any connections in order to work. I think we're going to need to find a different approach to this. Maybe we can do a pg_dump --schema-only to get the structure? I'm surprised you didn't run into this.

I'll leave some more comments inline on the diff.

chris48s · 2024-11-13T14:46:40Z

deploy/data_exporter/data_export_function/app.py

+
+ssm = boto3.client("ssm")
+s3 = boto3.client("s3", region_name="eu-west-1")
+bucket_name = "dc-ynr-short-term-backups"


Presumably this S3 bucket is just manually provisioned.
If we want to roll this out to another project, we just manually make another bucket in the relevant AWS account?

Also I'd probably make bucket_name BUCKET_NAME here.

Yeah. This bucket already existed so I just used that. I didn't include it in SAM for that reason, but there's no particular reason we can't set up a bucket in SAM.

I guess we can also have a single DC wide bucket for this: the user doesn't need bucket write permissions because of the pre-signed URL, so there's no problem with e.g limiting a user to only getting backups from one project but not others.

scripts/get-prod-db.sh

deploy/data_exporter/data_export_function/Dockerfile

deploy/data_exporter/data_export_function/app.py

chris48s · 2024-11-13T14:51:09Z

deploy/data_exporter/template.yaml

+Outputs:
+  DataExportFunction:
+    Description: Hello World Lambda Function ARN
+    Value: !GetAtt DataExportFunction.Arn
+  DataExportFunctionIamRole:
+    Description: Implicit IAM Role created for Hello World function
+    Value: !GetAtt DataExportFunctionRole.Arn


Oh. Hi there, world 👋
Can we replace with sensible descriptions?

this one is still relevant

scripts/get-prod-db.sh

chris48s · 2024-11-13T14:52:24Z

scripts/get-prod-db.sh

+echo "Dropping DB $(LOCAL_DB_NAME)"
+dropdb --if-exists "$LOCAL_DB_NAME"
+echo "Creating DB $(LOCAL_DB_NAME)"
+createdb "$LOCAL_DB_NAME"


If I try to run this locally (using AWS_PROFILE=sym-roe scripts/get-prod-db.sh), I get

+ dropdb --if-exists ynr-prod dropdb: error: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL: role "chris" does not exist

normally, I would sudo -u postgres dropdb or sudo -u postgres createdb <dbname> in order to drop or create a DB. Do you have your local user set up as a postgres admin, or do you run the whole script as postgres? I wonder if it is useful to allow the user to pass an option to say sudo here?

I reckon this should be sorted with 51b3570

chris48s · 2024-11-13T14:53:11Z

deploy/data_exporter/data_export_function/app.py

+SOURCE_DATABASE = "ynr"
+TMP_DATABASE_NAME = "ynr-for-dev-export"
+DB_HOST = get_parameter("/ynr/production/POSTGRES_HOST")
+DB_USER = get_parameter("/ynr/production/POSTGRES_USERNAME")
+DB_PASSWORD = get_parameter("/ynr/production/POSTGRES_PASSWORD")


I don't want to get too into prematurely trying to generalise. To start with, lets just copy & paste code to the 3 repos where we need it.

That said..
There are parts of this that could potentially be a shared library, and parts that are application specific

For example, this block here, the scrubbing rules, etc will be different for every project.

I think lets not get too bogged down in it at this point but having implemented it in a few places it might be worth having a think about what bits ended up the same/different and wether it is worth trying to generalise any of it.

Actually I think the parameters could be generic. No reason we can't enforce some conventions, especially when we have one application per account. These settings are only at this path because we're still in a shared AWS account.

chris48s · 2024-11-14T17:20:46Z

I have not had a chance to try this locally again, but I've had a really quick scan over the diff/comments and the direction this is going in seems sensible.

The one thing you missed after you dropped off the call earlier was: @awdem (running WSL) may also have some need to pass specific flags when running createdb and dropdb. I think they have to specify the host, so for them they have to run createdb --host 127.0.0.1 <dbname> rather than just createdb <dbname>.

Probably one to follow up with @awdem next week.

symroe · 2024-11-15T13:09:30Z

I have not had a chance to try this locally again, but I've had a really quick scan over the diff/comments and the direction this is going in seems sensible.

The one thing you missed after you dropped off the call earlier was: @awdem (running WSL) may also have some need to pass specific flags when running createdb and dropdb. I think they have to specify the host, so for them they have to run createdb --host 127.0.0.1 <dbname> rather than just createdb <dbname>.

Probably one to follow up with @awdem next week.

I think the solution here is to use a database connection URL rather than a name. The URL is an existing protocol for specifying all the connection parameters needed, and postgres's cli tooling like psql and createdb already supports it.

Using a URL opens the door — in theory — to using dj-db-url in the application too, with a single environment variable DATABASE_URL.

This doesn't work cleanly with multi-DB apps, but even with them we have a "principle" database that we could use that variable for.

GeoWill · 2024-11-19T10:59:03Z

scripts/check-database-url.sh

+# Check if DATABASE_URL is set after all attempts
+if [ -z "$DATABASE_URL" ]; then
+    echo "Error: DATABASE_URL is not provided."
+    echo "please the environment variable DATABASE_URL or pass it in as an argument"


please the environment variable -> please set the environment variable

GeoWill · 2024-11-19T11:00:31Z

scripts/check-database-url.sh

+echo "Creating the DB if it doesn't exist."
+createdb $DB_NAME >/dev/null 2>&1 || true
+
+# Check that we can connect to the local DB before returning


Is this only intended for local DBs? if so should we validate that in the url?
If it's not only for local DBs what happens if you call createdb on a remote url?

I'm not sure how we validate it in the local case because I don't know if localhost works on (eg) wsl.

chris48s · 2024-11-27T14:54:34Z

scripts/check-database-url.sh

+esac
+
+# Check if DATABASE_URL is set after all attempts
+if [ -z "$DATABASE_URL" ]; then


This script exhibits different behaviour when run standalone vs when called by get-prod-db.sh.

If I run scripts/check-database-url.sh this block runs and outputs a help message.
If I run scripts/get-prod-db.sh this just fails with

scripts/get-prod-db.sh: 33: ./scripts/check-database-url.sh: DATABASE_URL: parameter not set

I think the reason for this is that we've set set -euxo in get-prod-db.sh but not in this script. So when it is called from get-prod-db.sh those settings are inherited, whereas in standalone mode it just tries to plough on.

chris48s · 2024-11-27T14:55:36Z

scripts/check-database-url.sh

+
+# Check that we can connect to the local DB before returning
+psql $DATABASE_URL -c "\q"
+if [ $? -ne 0 ]; then


Here's another example. This probably does something standalone, but when invoked from get-prod-db.sh it doesn't do anything because we have already exited the script on the previous line.

chris48s · 2024-11-27T14:56:37Z

scripts/check-database-url.sh

+if [ -z "$DATABASE_URL" ]; then
+    echo "Error: DATABASE_URL is not provided."
+    echo "please the environment variable DATABASE_URL or pass it in as an argument"
+    echo "The format must comply with \033[4mhttps://www.postgresql.org/docs/$REQUIRED_POSTGRES_VERSION/libpq-connect.html#LIBPQ-CONNSTRING-URIS\033[0m"


It looks like this is supposed to be doing nice formatting but for me this just literally printed the characters \033[4m and \033[0m to my terminal.

chris48s · 2024-11-27T15:10:48Z

scripts/get-prod-db.sh

+echo "Dropping DB $(_SCRIPT_DATABASE_URL)"
+dropdb --if-exists "$_SCRIPT_DATABASE_URL"
+echo "Creating DB $(_SCRIPT_DATABASE_URL)"
+createdb "$_SCRIPT_DATABASE_URL"


I was not able to get this to work at all. As far as I can see, dropdb and createdb don't work with a postgresql:// connection string in the same way that psql and pg_restore can. I think we would have to parse out the user/pass/host/port to use dropdb and createdb

chris48s · 2024-11-27T15:29:21Z

Final comment. I ran https://www.shellcheck.net/ on your files (you can install this locally with apt install shellcheck) and it flagged quite a few issues. Some of them are belt and braces things but some of them may be legit bugs. I find it is a very useful check when writing non-trivial shell scripts like this. If we go down the route of writing a lot of shell scripts it might be a useful thing to run in CI.

symroe requested a review from chris48s November 12, 2024 13:27

symroe force-pushed the data_exporter branch from 28dc960 to e1a07cf Compare November 12, 2024 13:28

Spike data export for restoring prod db to local

8a729cd

symroe force-pushed the data_exporter branch from e1a07cf to 8a729cd Compare November 12, 2024 14:51

chris48s reviewed Nov 13, 2024

View reviewed changes

symroe added 7 commits November 14, 2024 13:30

Use pg_dump/restore rather than a template DB

db04b9d

Upper case BUCKET_NAME

7405f18

Fix comment

4ef5a0e

Use venv inside Docker

7996447

Clean up comments

c1ab7c0

Raise on error

71e4a2f

Fix table name

c36c93a

symroe added 2 commits November 15, 2024 12:56

A more robust solution to getting a database connection

51b3570

Make the file name in a function

577723c

GeoWill reviewed Nov 19, 2024

View reviewed changes

chris48s reviewed Nov 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spike data export for restoring prod db to local #2469

Spike data export for restoring prod db to local #2469

symroe commented Nov 12, 2024 •

edited

Loading

chris48s commented Nov 13, 2024

chris48s Nov 13, 2024

symroe Nov 14, 2024

chris48s Nov 13, 2024

chris48s Nov 27, 2024

chris48s Nov 13, 2024

symroe Nov 15, 2024

chris48s Nov 13, 2024

symroe Nov 15, 2024

chris48s commented Nov 14, 2024

symroe commented Nov 15, 2024

GeoWill Nov 19, 2024

GeoWill Nov 19, 2024

GeoWill Nov 19, 2024

chris48s Nov 27, 2024

chris48s Nov 27, 2024 •

edited

Loading

chris48s Nov 27, 2024

chris48s Nov 27, 2024

chris48s commented Nov 27, 2024

Spike data export for restoring prod db to local #2469

Are you sure you want to change the base?

Spike data export for restoring prod db to local #2469

Conversation

symroe commented Nov 12, 2024 • edited Loading

What is this all about?

chris48s commented Nov 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chris48s commented Nov 14, 2024

symroe commented Nov 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chris48s Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chris48s commented Nov 27, 2024

symroe commented Nov 12, 2024 •

edited

Loading

chris48s Nov 27, 2024 •

edited

Loading