Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike data export for restoring prod db to local #2469

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions deploy/data_exporter/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.aws-sam
Empty file.
14 changes: 14 additions & 0 deletions deploy/data_exporter/data_export_function/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
FROM public.ecr.aws/docker/library/ubuntu:24.04

RUN apt update && \
apt install -y postgresql-client-16 python3.12 python3-pip curl unzip && \
rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip3 install --break-system-packages -r requirements.txt
RUN pip3 install --break-system-packages awslambdaric

COPY . .

ENTRYPOINT ["python3", "-m", "awslambdaric" ]
symroe marked this conversation as resolved.
Show resolved Hide resolved
CMD [ "app.lambda_handler" ]
178 changes: 178 additions & 0 deletions deploy/data_exporter/data_export_function/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
import os
import subprocess
from datetime import datetime, timedelta, timezone

import boto3
import psycopg
from psycopg import sql

ssm = boto3.client("ssm")
s3 = boto3.client("s3", region_name="eu-west-1")
bucket_name = "dc-ynr-short-term-backups"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably this S3 bucket is just manually provisioned.
If we want to roll this out to another project, we just manually make another bucket in the relevant AWS account?

Also I'd probably make bucket_name BUCKET_NAME here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. This bucket already existed so I just used that. I didn't include it in SAM for that reason, but there's no particular reason we can't set up a bucket in SAM.

I guess we can also have a single DC wide bucket for this: the user doesn't need bucket write permissions because of the pre-signed URL, so there's no problem with e.g limiting a user to only getting backups from one project but not others.

current_time = datetime.now().isoformat()
PREFIX = "ynr-export"
FILENAME = f"{PREFIX}-{current_time.replace(':', '-')}.dump"
chris48s marked this conversation as resolved.
Show resolved Hide resolved


def get_parameter(name):
response = ssm.get_parameter(Name=name)
return response["Parameter"]["Value"]


SOURCE_DATABASE = "ynr"
TMP_DATABASE_NAME = "ynr-for-dev-export"
DB_HOST = get_parameter("/ynr/production/POSTGRES_HOST")
DB_USER = get_parameter("/ynr/production/POSTGRES_USERNAME")
DB_PASSWORD = get_parameter("/ynr/production/POSTGRES_PASSWORD")
Comment on lines +21 to +25
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to get too into prematurely trying to generalise. To start with, lets just copy & paste code to the 3 repos where we need it.

That said..
There are parts of this that could potentially be a shared library, and parts that are application specific

For example, this block here, the scrubbing rules, etc will be different for every project.

I think lets not get too bogged down in it at this point but having implemented it in a few places it might be worth having a think about what bits ended up the same/different and wether it is worth trying to generalise any of it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think the parameters could be generic. No reason we can't enforce some conventions, especially when we have one application per account. These settings are only at this path because we're still in a shared AWS account.

DB_PORT = "5432"
os.environ["PGPASSWORD"] = DB_PASSWORD


def get_db_conn(db_name):
conn = psycopg.connect(
dbname=db_name,
user=DB_USER,
password=DB_PASSWORD,
host=DB_HOST,
port=DB_PORT,
)
conn.autocommit = True
return conn


def create_database_from_template():
# Connect to the PostgreSQL server (usually to the 'postgres' database for administrative tasks)
conn = get_db_conn(SOURCE_DATABASE)
symroe marked this conversation as resolved.
Show resolved Hide resolved
# Enable autocommit to run CREATE DATABASE commands
symroe marked this conversation as resolved.
Show resolved Hide resolved
try:
with conn.cursor() as cur:
print(f"Deleting {TMP_DATABASE_NAME}")
cur.execute(
sql.SQL("DROP DATABASE IF EXISTS {};").format(
sql.Identifier(TMP_DATABASE_NAME)
)
)
with conn.cursor() as cur:
# SQL to create the new database from the template
print(f"Creating {TMP_DATABASE_NAME}")
cur.execute(
sql.SQL("CREATE DATABASE {} TEMPLATE {};").format(
sql.Identifier(TMP_DATABASE_NAME),
sql.Identifier(SOURCE_DATABASE),
)
)
print(
f"Database '{TMP_DATABASE_NAME}' created successfully from template '{SOURCE_DATABASE}'."
)
except psycopg.Error as e:
print(f"Error creating database: {e}")
symroe marked this conversation as resolved.
Show resolved Hide resolved
finally:
conn.close()


def clean_database():
symroe marked this conversation as resolved.
Show resolved Hide resolved
conn = get_db_conn(db_name=TMP_DATABASE_NAME)
with conn.cursor() as cur:
print("Cleaning Users table")
cur.execute(
"""UPDATE auth_user SET
email = CONCAT('anon_', id, '@example.com'),
password = md5(random()::text);
"""
)
print("Cleaning Account email table")
cur.execute(
"""UPDATE auth_user SET
email = CONCAT('anon_', id, '@example.com');
chris48s marked this conversation as resolved.
Show resolved Hide resolved
"""
)
print("Cleaning IP addresses from LoggedActions")
cur.execute(
"""UPDATE candidates_loggedaction SET
ip_address = '127.0.0.1';
"""
)
print("Cleaning API tokens")
cur.execute(
"""UPDATE authtoken_token SET
key = md5(random()::text);
"""
)
print("Cleaning sessions")
cur.execute("""TRUNCATE TABLE django_session;""")


def dump_and_export():
dump_file = "/tmp/db_dump.sql" # Temporary file for the dump
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest using tempfile.NamedTemporaryFile for this rather than hard-coding the location.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to do so, but to be clear this is writing to the Lambda temp directory, not the local dev one. That directory is limited to the invocation of the function, so there's no way it can ever interact with anything outside of the scope of the invocation.


# Database credentials and parameters
symroe marked this conversation as resolved.
Show resolved Hide resolved

print("Run pg_dump to create the database dump")
try:
subprocess.run(
[
"pg_dump",
"-h",
DB_HOST,
"-U",
DB_USER,
"-d",
TMP_DATABASE_NAME,
"-Fc",
"-f",
dump_file,
],
check=True,
)

print("Upload the dump to S3")
s3.upload_file(dump_file, bucket_name, FILENAME)

print("Generate a presigned URL for downloading the dump")
presigned_url = s3.generate_presigned_url(
"get_object",
Params={"Bucket": bucket_name, "Key": FILENAME},
ExpiresIn=3600, # URL expires in 1 hour
)
print("Finished")
return presigned_url

except subprocess.CalledProcessError as e:
return f"Error generating database dump: {str(e)}"


def check_for_recent_exports():
"""
If we've exported a file in the last hour, don't export another one

"""
one_hour_ago = datetime.now(timezone.utc) - timedelta(hours=1)
response = s3.list_objects_v2(Bucket=bucket_name, Prefix=PREFIX)
if "Contents" in response:
recent_files = [
obj
for obj in response["Contents"]
if obj["LastModified"] >= one_hour_ago
]

recent_files.sort(key=lambda obj: obj["LastModified"], reverse=True)

if recent_files:
return s3.generate_presigned_url(
"get_object",
Params={"Bucket": bucket_name, "Key": recent_files[0]["Key"]},
ExpiresIn=3600, # URL expires in 1 hour
)
return None


def lambda_handler(event, context):
if recent_export := check_for_recent_exports():
return recent_export

print("Creating temp database")
create_database_from_template()
print("Cleaning temp database")
clean_database()
print("Dumping and exporting")
return dump_and_export()
2 changes: 2 additions & 0 deletions deploy/data_exporter/data_export_function/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
boto3===1.35.56
psycopg[binary]==3.2.3
33 changes: 33 additions & 0 deletions deploy/data_exporter/samconfig.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# More information about the configuration file can be found here:
# https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-sam-cli-config.html
version = 0.1

[default.global.parameters]
stack_name = "ynr-data-exporter"

[default.build.parameters]
cached = true
parallel = true

[default.validate.parameters]
lint = true
chris48s marked this conversation as resolved.
Show resolved Hide resolved

[default.deploy.parameters]
capabilities = "CAPABILITY_IAM"
confirm_changeset = true
resolve_s3 = true
s3_prefix = "ynr-data-exporter"
region = "eu-west-2"
image_repositories = ["DataExportFunction=929325949831.dkr.ecr.eu-west-2.amazonaws.com/ynrdataexporter736bb2dc/dataexportfunctionb95e9e19repo"]

[default.package.parameters]
resolve_s3 = true

[default.sync.parameters]
watch = true

[default.local_start_api.parameters]
warm_containers = "EAGER"

[default.local_start_lambda.parameters]
warm_containers = "EAGER"
49 changes: 49 additions & 0 deletions deploy/data_exporter/template.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: >
data_exporter

Exports data from the prod database, cleans it and puts the resulting dump in an S3 bucket

Globals:
Function:
Timeout: 600 # 10 minutes
chris48s marked this conversation as resolved.
Show resolved Hide resolved
MemorySize: 1024

LoggingConfig:
LogFormat: JSON
Resources:
DataExportFunction:
Type: AWS::Serverless::Function
Properties:
FunctionName: ynr-data-exporter
PackageType: Image
ImageUri: data_export_function
# Needs to be at least as big as the DB export, currently at around 350mb
EphemeralStorage:
Size: 1024
# Don't allow more than one export job to run at a time
ReservedConcurrentExecutions: 1
Policies:
- Statement:
- Sid: S3Access
Effect: Allow
Action:
- s3:*
Resource:
- 'arn:aws:s3:::dc-ynr-short-term-backups'
- 'arn:aws:s3:::dc-ynr-short-term-backups/*'
- Sid: SSM
Effect: Allow
Action:
- ssm:*
Resource:
- 'arn:aws:ssm:*:*:parameter/ynr/*'

Outputs:
DataExportFunction:
Description: Hello World Lambda Function ARN
Value: !GetAtt DataExportFunction.Arn
DataExportFunctionIamRole:
Description: Implicit IAM Role created for Hello World function
Value: !GetAtt DataExportFunctionRole.Arn
Comment on lines +43 to +49
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. Hi there, world 👋
Can we replace with sensible descriptions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one is still relevant

57 changes: 57 additions & 0 deletions scripts/get-prod-db.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
#!/bin/sh
set -euxo
chris48s marked this conversation as resolved.
Show resolved Hide resolved

# This script invokes an AWS Lambda function to retrieve a URL for downloading
# a cleaned version of the production database and then restores
# that data locally. By default the db name is "ynr-prod" but you can change the
# local name by passing it as the first argument to the script.
#
# This script requires access to the YNR production AWS account
#
# Usage:
# ./script.sh [LOCAL_DB_NAME]
symroe marked this conversation as resolved.
Show resolved Hide resolved
#
# Arguments:
# LOCAL_DB_NAME: Optional. Name of the local database to restore data to.
# Defaults to 'ynr-prod' if not specified.

# Configurable variables
LAMBDA_FUNCTION_NAME="ynr-data-exporter"
LOCAL_DB_NAME="${1:-ynr-prod}"

# Check for required tools
REQUIRED_TOOLS="aws dropdb createdb pg_restore wget"
for tool in $REQUIRED_TOOLS; do
if ! command -v "$tool" >/dev/null 2>&1; then
echo "Error: $tool is required but not installed." >&2
exit 1
fi
done
symroe marked this conversation as resolved.
Show resolved Hide resolved

# Create a temporary file and set up clean up on script exit
TEMP_FILE=$(mktemp)
trap 'rm -f "$TEMP_FILE"' EXIT

# Invoke AWS Lambda and store the result in the temp file
# The result is a presigned URL to the dump file on S3
echo "Invoking Lambda to get DB URL. This might take a few minutes..."
aws lambda invoke \
--function-name "$LAMBDA_FUNCTION_NAME" \
--cli-read-timeout=0 \
--no-cli-pager \
--output text \
--query 'Payload' \
"$TEMP_FILE"

# Extract the URL from the response
# This is because the response is quoted, so we just need to remove the quotation marks
URL=$(sed 's/^"\(.*\)"$/\1/' "$TEMP_FILE")
Comment on lines +42 to +52
Copy link
Member

@chris48s chris48s Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add some error handling to make sure what we got back appears to be a URL. When I ran this, I got

URL={"errorMessage": "connection failed: connection to server at \"13.41.39.44\", port 5432 failed: FATAL:  database \"ynr-for-dev-export\" does not exist", "errorType": "OperationalError", "requestId": "9c15f87a-f4ca-4081-b3f5-ecd2a97aaf7c", "stackTrace": ["  File \"/app.py\", line 176, in lambda_handler\n    clean_database()\n", "  File \"/app.py\", line 74, in clean_database\n    conn = get_db_conn(db_name=TMP_DATABASE_NAME)\n", "  File \"/app.py\", line 32, in get_db_conn\n    conn = psycopg.connect(\n", "  File \"/usr/local/lib/python3.12/dist-packages/psycopg/connection.py\", line 119, in connect\n    raise last_ex.with_traceback(None)\n"]}

but the script just tried to plough on regardless

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to pull the commit out (I imagine we'll squash them all later anyway), but this is done here: 51b3570#diff-388985bd50a1bb1aefbf516845f5019a2f244d5ba5cc1f10f0f6df6e98dddc1fR53

echo "Got URL: $(URL)"

echo "Dropping DB $(LOCAL_DB_NAME)"
dropdb --if-exists "$LOCAL_DB_NAME"
echo "Creating DB $(LOCAL_DB_NAME)"
createdb "$LOCAL_DB_NAME"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I try to run this locally (using AWS_PROFILE=sym-roe scripts/get-prod-db.sh), I get

+ dropdb --if-exists ynr-prod
dropdb: error: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL:  role "chris" does not exist

normally, I would sudo -u postgres dropdb or sudo -u postgres createdb <dbname> in order to drop or create a DB. Do you have your local user set up as a postgres admin, or do you run the whole script as postgres? I wonder if it is useful to allow the user to pass an option to say sudo here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reckon this should be sorted with 51b3570


echo "Downloading and restoring DB $(LOCAL_DB_NAME)"
wget -qO- "$URL" | pg_restore -d "$LOCAL_DB_NAME" -Fc --no-owner --no-privileges