Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike data export for restoring prod db to local #2469

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open

Conversation

symroe
Copy link
Member

@symroe symroe commented Nov 12, 2024

What is this all about?

This is an idea for a pattern of getting prod databases onto local dev machines without having to download personal data.

It uses a Docker container run on Lambda to make a new database on the production server and then removes PII from that database. It then dumps the data to an S3 bucket (already set to expire objects) and downloads that DB into a local database.

If we think this is a reasonable shape of project them we should be able to apply it to other projects.

@chris48s
Copy link
Member

OK. In the grand scheme of things I like this idea/approach. I think if we can get it working it is going to be a big productivity boost.

Also, I am going to leave a lot of comments on the diff (many of them quite minor, but a lot of them)

I want to lead with the highest priority thing: I did try to run it locally a few times, but I haven't managed to see it actually working properly end-to-end. When I ran it, this happened:

Screenshot at 2024-11-13 14-20-44

The CREATE DATABASE {} TEMPLATE {} needs the source DB to not have any connections in order to work. I think we're going to need to find a different approach to this. Maybe we can do a pg_dump --schema-only to get the structure? I'm surprised you didn't run into this.

I'll leave some more comments inline on the diff.


ssm = boto3.client("ssm")
s3 = boto3.client("s3", region_name="eu-west-1")
bucket_name = "dc-ynr-short-term-backups"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably this S3 bucket is just manually provisioned.
If we want to roll this out to another project, we just manually make another bucket in the relevant AWS account?

Also I'd probably make bucket_name BUCKET_NAME here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. This bucket already existed so I just used that. I didn't include it in SAM for that reason, but there's no particular reason we can't set up a bucket in SAM.

I guess we can also have a single DC wide bucket for this: the user doesn't need bucket write permissions because of the pre-signed URL, so there's no problem with e.g limiting a user to only getting backups from one project but not others.

scripts/get-prod-db.sh Outdated Show resolved Hide resolved
deploy/data_exporter/data_export_function/Dockerfile Outdated Show resolved Hide resolved
deploy/data_exporter/data_export_function/app.py Outdated Show resolved Hide resolved
deploy/data_exporter/data_export_function/app.py Outdated Show resolved Hide resolved
Comment on lines +43 to +49
Outputs:
DataExportFunction:
Description: Hello World Lambda Function ARN
Value: !GetAtt DataExportFunction.Arn
DataExportFunctionIamRole:
Description: Implicit IAM Role created for Hello World function
Value: !GetAtt DataExportFunctionRole.Arn
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. Hi there, world 👋
Can we replace with sensible descriptions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one is still relevant

scripts/get-prod-db.sh Show resolved Hide resolved
scripts/get-prod-db.sh Show resolved Hide resolved
Comment on lines 51 to 54
echo "Dropping DB $(LOCAL_DB_NAME)"
dropdb --if-exists "$LOCAL_DB_NAME"
echo "Creating DB $(LOCAL_DB_NAME)"
createdb "$LOCAL_DB_NAME"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I try to run this locally (using AWS_PROFILE=sym-roe scripts/get-prod-db.sh), I get

+ dropdb --if-exists ynr-prod
dropdb: error: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL:  role "chris" does not exist

normally, I would sudo -u postgres dropdb or sudo -u postgres createdb <dbname> in order to drop or create a DB. Do you have your local user set up as a postgres admin, or do you run the whole script as postgres? I wonder if it is useful to allow the user to pass an option to say sudo here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reckon this should be sorted with 51b3570

Comment on lines +22 to +26
SOURCE_DATABASE = "ynr"
TMP_DATABASE_NAME = "ynr-for-dev-export"
DB_HOST = get_parameter("/ynr/production/POSTGRES_HOST")
DB_USER = get_parameter("/ynr/production/POSTGRES_USERNAME")
DB_PASSWORD = get_parameter("/ynr/production/POSTGRES_PASSWORD")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to get too into prematurely trying to generalise. To start with, lets just copy & paste code to the 3 repos where we need it.

That said..
There are parts of this that could potentially be a shared library, and parts that are application specific

For example, this block here, the scrubbing rules, etc will be different for every project.

I think lets not get too bogged down in it at this point but having implemented it in a few places it might be worth having a think about what bits ended up the same/different and wether it is worth trying to generalise any of it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think the parameters could be generic. No reason we can't enforce some conventions, especially when we have one application per account. These settings are only at this path because we're still in a shared AWS account.

@chris48s
Copy link
Member

I have not had a chance to try this locally again, but I've had a really quick scan over the diff/comments and the direction this is going in seems sensible.

The one thing you missed after you dropped off the call earlier was: @awdem (running WSL) may also have some need to pass specific flags when running createdb and dropdb. I think they have to specify the host, so for them they have to run createdb --host 127.0.0.1 <dbname> rather than just createdb <dbname>.

Probably one to follow up with @awdem next week.

@symroe
Copy link
Member Author

symroe commented Nov 15, 2024

I have not had a chance to try this locally again, but I've had a really quick scan over the diff/comments and the direction this is going in seems sensible.

The one thing you missed after you dropped off the call earlier was: @awdem (running WSL) may also have some need to pass specific flags when running createdb and dropdb. I think they have to specify the host, so for them they have to run createdb --host 127.0.0.1 <dbname> rather than just createdb <dbname>.

Probably one to follow up with @awdem next week.

I think the solution here is to use a database connection URL rather than a name. The URL is an existing protocol for specifying all the connection parameters needed, and postgres's cli tooling like psql and createdb already supports it.

Using a URL opens the door — in theory — to using dj-db-url in the application too, with a single environment variable DATABASE_URL.

This doesn't work cleanly with multi-DB apps, but even with them we have a "principle" database that we could use that variable for.

# Check if DATABASE_URL is set after all attempts
if [ -z "$DATABASE_URL" ]; then
echo "Error: DATABASE_URL is not provided."
echo "please the environment variable DATABASE_URL or pass it in as an argument"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please the environment variable -> please set the environment variable

echo "Creating the DB if it doesn't exist."
createdb $DB_NAME >/dev/null 2>&1 || true

# Check that we can connect to the local DB before returning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only intended for local DBs? if so should we validate that in the url?
If it's not only for local DBs what happens if you call createdb on a remote url?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how we validate it in the local case because I don't know if localhost works on (eg) wsl.

esac

# Check if DATABASE_URL is set after all attempts
if [ -z "$DATABASE_URL" ]; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script exhibits different behaviour when run standalone vs when called by get-prod-db.sh.

If I run scripts/check-database-url.sh this block runs and outputs a help message.
If I run scripts/get-prod-db.sh this just fails with

scripts/get-prod-db.sh: 33: ./scripts/check-database-url.sh: DATABASE_URL: parameter not set

I think the reason for this is that we've set set -euxo in get-prod-db.sh but not in this script. So when it is called from get-prod-db.sh those settings are inherited, whereas in standalone mode it just tries to plough on.


# Check that we can connect to the local DB before returning
psql $DATABASE_URL -c "\q"
if [ $? -ne 0 ]; then
Copy link
Member

@chris48s chris48s Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's another example. This probably does something standalone, but when invoked from get-prod-db.sh it doesn't do anything because we have already exited the script on the previous line.

if [ -z "$DATABASE_URL" ]; then
echo "Error: DATABASE_URL is not provided."
echo "please the environment variable DATABASE_URL or pass it in as an argument"
echo "The format must comply with \033[4mhttps://www.postgresql.org/docs/$REQUIRED_POSTGRES_VERSION/libpq-connect.html#LIBPQ-CONNSTRING-URIS\033[0m"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is supposed to be doing nice formatting but for me this just literally printed the characters \033[4m and \033[0m to my terminal.

Comment on lines +65 to +68
echo "Dropping DB $(_SCRIPT_DATABASE_URL)"
dropdb --if-exists "$_SCRIPT_DATABASE_URL"
echo "Creating DB $(_SCRIPT_DATABASE_URL)"
createdb "$_SCRIPT_DATABASE_URL"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not able to get this to work at all. As far as I can see, dropdb and createdb don't work with a postgresql:// connection string in the same way that psql and pg_restore can. I think we would have to parse out the user/pass/host/port to use dropdb and createdb

@chris48s
Copy link
Member

Final comment. I ran https://www.shellcheck.net/ on your files (you can install this locally with apt install shellcheck) and it flagged quite a few issues. Some of them are belt and braces things but some of them may be legit bugs. I find it is a very useful check when writing non-trivial shell scripts like this. If we go down the route of writing a lot of shell scripts it might be a useful thing to run in CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants