-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spike data export for restoring prod db to local #2469
base: master
Are you sure you want to change the base?
Conversation
28dc960
to
e1a07cf
Compare
e1a07cf
to
8a729cd
Compare
OK. In the grand scheme of things I like this idea/approach. I think if we can get it working it is going to be a big productivity boost. Also, I am going to leave a lot of comments on the diff (many of them quite minor, but a lot of them) I want to lead with the highest priority thing: I did try to run it locally a few times, but I haven't managed to see it actually working properly end-to-end. When I ran it, this happened: The I'll leave some more comments inline on the diff. |
|
||
ssm = boto3.client("ssm") | ||
s3 = boto3.client("s3", region_name="eu-west-1") | ||
bucket_name = "dc-ynr-short-term-backups" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably this S3 bucket is just manually provisioned.
If we want to roll this out to another project, we just manually make another bucket in the relevant AWS account?
Also I'd probably make bucket_name
BUCKET_NAME
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. This bucket already existed so I just used that. I didn't include it in SAM for that reason, but there's no particular reason we can't set up a bucket in SAM.
I guess we can also have a single DC wide bucket for this: the user doesn't need bucket write permissions because of the pre-signed URL, so there's no problem with e.g limiting a user to only getting backups from one project but not others.
Outputs: | ||
DataExportFunction: | ||
Description: Hello World Lambda Function ARN | ||
Value: !GetAtt DataExportFunction.Arn | ||
DataExportFunctionIamRole: | ||
Description: Implicit IAM Role created for Hello World function | ||
Value: !GetAtt DataExportFunctionRole.Arn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. Hi there, world 👋
Can we replace with sensible descriptions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one is still relevant
scripts/get-prod-db.sh
Outdated
echo "Dropping DB $(LOCAL_DB_NAME)" | ||
dropdb --if-exists "$LOCAL_DB_NAME" | ||
echo "Creating DB $(LOCAL_DB_NAME)" | ||
createdb "$LOCAL_DB_NAME" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I try to run this locally (using AWS_PROFILE=sym-roe scripts/get-prod-db.sh
), I get
+ dropdb --if-exists ynr-prod
dropdb: error: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL: role "chris" does not exist
normally, I would sudo -u postgres dropdb
or sudo -u postgres createdb <dbname>
in order to drop or create a DB. Do you have your local user set up as a postgres admin, or do you run the whole script as postgres
? I wonder if it is useful to allow the user to pass an option to say sudo here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reckon this should be sorted with 51b3570
SOURCE_DATABASE = "ynr" | ||
TMP_DATABASE_NAME = "ynr-for-dev-export" | ||
DB_HOST = get_parameter("/ynr/production/POSTGRES_HOST") | ||
DB_USER = get_parameter("/ynr/production/POSTGRES_USERNAME") | ||
DB_PASSWORD = get_parameter("/ynr/production/POSTGRES_PASSWORD") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to get too into prematurely trying to generalise. To start with, lets just copy & paste code to the 3 repos where we need it.
That said..
There are parts of this that could potentially be a shared library, and parts that are application specific
For example, this block here, the scrubbing rules, etc will be different for every project.
I think lets not get too bogged down in it at this point but having implemented it in a few places it might be worth having a think about what bits ended up the same/different and wether it is worth trying to generalise any of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think the parameters could be generic. No reason we can't enforce some conventions, especially when we have one application per account. These settings are only at this path because we're still in a shared AWS account.
I have not had a chance to try this locally again, but I've had a really quick scan over the diff/comments and the direction this is going in seems sensible. The one thing you missed after you dropped off the call earlier was: @awdem (running WSL) may also have some need to pass specific flags when running Probably one to follow up with @awdem next week. |
I think the solution here is to use a database connection URL rather than a name. The URL is an existing protocol for specifying all the connection parameters needed, and postgres's cli tooling like Using a URL opens the door — in theory — to using This doesn't work cleanly with multi-DB apps, but even with them we have a "principle" database that we could use that variable for. |
# Check if DATABASE_URL is set after all attempts | ||
if [ -z "$DATABASE_URL" ]; then | ||
echo "Error: DATABASE_URL is not provided." | ||
echo "please the environment variable DATABASE_URL or pass it in as an argument" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please the environment variable -> please set the environment variable
echo "Creating the DB if it doesn't exist." | ||
createdb $DB_NAME >/dev/null 2>&1 || true | ||
|
||
# Check that we can connect to the local DB before returning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this only intended for local DBs? if so should we validate that in the url?
If it's not only for local DBs what happens if you call createdb
on a remote url?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how we validate it in the local case because I don't know if localhost
works on (eg) wsl.
esac | ||
|
||
# Check if DATABASE_URL is set after all attempts | ||
if [ -z "$DATABASE_URL" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script exhibits different behaviour when run standalone vs when called by get-prod-db.sh
.
If I run scripts/check-database-url.sh
this block runs and outputs a help message.
If I run scripts/get-prod-db.sh
this just fails with
scripts/get-prod-db.sh: 33: ./scripts/check-database-url.sh: DATABASE_URL: parameter not set
I think the reason for this is that we've set set -euxo
in get-prod-db.sh
but not in this script. So when it is called from get-prod-db.sh
those settings are inherited, whereas in standalone mode it just tries to plough on.
|
||
# Check that we can connect to the local DB before returning | ||
psql $DATABASE_URL -c "\q" | ||
if [ $? -ne 0 ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's another example. This probably does something standalone, but when invoked from get-prod-db.sh
it doesn't do anything because we have already exited the script on the previous line.
if [ -z "$DATABASE_URL" ]; then | ||
echo "Error: DATABASE_URL is not provided." | ||
echo "please the environment variable DATABASE_URL or pass it in as an argument" | ||
echo "The format must comply with \033[4mhttps://www.postgresql.org/docs/$REQUIRED_POSTGRES_VERSION/libpq-connect.html#LIBPQ-CONNSTRING-URIS\033[0m" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is supposed to be doing nice formatting but for me this just literally printed the characters \033[4m
and \033[0m
to my terminal.
echo "Dropping DB $(_SCRIPT_DATABASE_URL)" | ||
dropdb --if-exists "$_SCRIPT_DATABASE_URL" | ||
echo "Creating DB $(_SCRIPT_DATABASE_URL)" | ||
createdb "$_SCRIPT_DATABASE_URL" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not able to get this to work at all. As far as I can see, dropdb
and createdb
don't work with a postgresql://
connection string in the same way that psql
and pg_restore
can. I think we would have to parse out the user/pass/host/port to use dropdb
and createdb
Final comment. I ran https://www.shellcheck.net/ on your files (you can install this locally with |
What is this all about?
This is an idea for a pattern of getting prod databases onto local dev machines without having to download personal data.
It uses a Docker container run on Lambda to make a new database on the production server and then removes PII from that database. It then dumps the data to an S3 bucket (already set to expire objects) and downloads that DB into a local database.
If we think this is a reasonable shape of project them we should be able to apply it to other projects.