Skip to content

Commit

Permalink
add iceberg and athena summary
Browse files Browse the repository at this point in the history
  • Loading branch information
SoumayaMauthoorMOJ committed Sep 23, 2024
1 parent bdd74f3 commit 0ca9444
Show file tree
Hide file tree
Showing 14 changed files with 4,724 additions and 1 deletion.
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# pyenv
# python
.python-version
*.egg-info

# Environments
.env
Expand All @@ -10,6 +11,7 @@ ENV/
env.bak/
venv.bak/
venv*/
*build/

# ds store
.DS_Store
Expand Down
198 changes: 198 additions & 0 deletions investigations/iceberg_athena_dbt/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
---
marp: true
lang: en-US
title: Codebase Collaborator
description: LLMs for source code analysis and augmentation
theme: moj
transition: fade
paginate: true
_paginate: skip
_class: title
footer: ![image w:40](https://raw.githubusercontent.com/ministryofjustice/marp-moj-theme/main/images/moj.png)
_footer: ''
---

<!-- _header: ![w:100](https://raw.githubusercontent.com/ministryofjustice/marp-moj-theme/main/images/moj.png) -->

## Building a transaction data lake using<br/> [Amazon Athena, Apache Iceberg and dbt](https://github.com/moj-analytical-services/dmet-cfe/iceberg_athena_dbt)

<br/>

### Dr Soumaya Mauthoor

September 2024

---
<style scoped>
section {
justify-content: flex-end;
}
</style>

![bg 70%](https://assets.publishing.service.gov.uk/media/6241e4dae90e075f06b37247/digi-strategy-2025.jpg)

Published in 2022 under the Johnson Conservative government

<!--
https://intranet.justice.gov.uk/blog/becoming-a-truly-data-led-justice-system/
Our data strategy:
We will improve justice outcomes through data driven insight and innovation.
We will ensure data meets user needs.
We will build a data culture to value data as a strategic asset.
-->

---
## MoJ Analytical Platform

</br>

![w:800](https://user-guidance.analytical-platform.service.justice.gov.uk/images/overview/analytical-platform.excalidraw.png)

![bg right:40% w:90%](./images/cjs_dashboard.png)

---

<style scoped>
p { text-align: center; }
</style>

## Previous ELT Architecture

![](./images/previous_etl_architecture.excalidraw.png)

---

<style scoped>
p, h5 {
text-align: center;
}
</style>

## Modern Table Formats

<!-- Provide a table-like abstraction on top of native file formats like Parquet by storing additional metadata. -->

![w:1200](https://miro.medium.com/v2/resize:fit:720/format:webp/1*H_goBvOV52AUid4egopzGw.png)

##### **Iceberg** was the obvious choice for our usecase because of enhanced Athena support


---

<style scoped>
li {
font-size: 25px;
}
</style>


## Glue PySpark vs Athena Curation Benchmarking

![bg left w:600](./images/athena_vs_glue.excalidraw.png)

**Criteria**
1. Cost
2. Complexity
3. Run Time

**Dataset**

TPCDS stores_sales

- scale: 100 (~10GB)
- scale: 3000 (~400GB)

---
## Bulk Insert

<br/>

```
CREATE TABLE target_table
AS SELECT * FROM source_table
```
<br/>

- Athena is cheaper <=3TB scale
- Glue PySpark is faster at the 3TB scale

![bg left w:550](./images/bulk_insert.png)

---
## SCD2 Merge

Update 0.1% rows

```
MERGE INTO target_table
USING source_query
ON search_condition
WHEN MATCHED THEN UPDATE []
WHEN NOT MATCHED THEN INSERT []
```
<br/>

- Athena is cheaper and faster
- Glue PySpark runs out of memory at the 3TB scale

![bg left w:550](./images/scd2_merge.png)

---

<style scoped>
p {
text-align: center;
}
</style>

## Full Load Blue-Green Deployment

![w:750](./images/wap.excalidraw.png)

---

<style scoped>
p {
text-align: center;
}
</style>

## Incremental Blue-Green Deployment

![w:750](./images/wap_incremental.excalidraw.png)

---

## Outcomes

![bg left w:600](./images/glue_vs_athena_costs.png)

- Reduced query costs by 99%

- Reduced query time by 50-75%
- Enabled daily refresh cycle
- Stabilised data pipeline
- Ensured data quality
- Streamlined technology stack
- Facilitated phased development

---

<!-- _class: title -->
<style scoped>
p {
font-size: 20px;
}
</style>

# Questions?

<br/>
<br/>
<br/>
<br/>

These slides were created using <img style="width:100px" src="https://github.com/marp-team/marp/raw/main/marp-dark.png">

33 changes: 33 additions & 0 deletions investigations/iceberg_athena_dbt/costs.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
Date,Glue,Athena
2022-01-01,0.200420705,0.006788253
2022-02-01,0.234382219,0.008365781
2022-03-01,0.179985882,0.011511357
2022-04-01,0.672464054,0.011120005
2022-05-01,0.328547535,0.015959739
2022-06-01,0.329692626,0.046454171
2022-07-01,0.328822443,0.022741952
2022-08-01,0.335708303,0.02577245
2022-09-01,0.45159921,0.03685975
2022-10-01,0.340109882,0.032156505
2022-11-01,0.220127312,0.036355487
2022-12-01,0.403483999,0.021851418
2023-01-01,0.569655357,0.023029389
2023-02-01,1,0.021210018
2023-03-01,0.899546517,0.044674264
2023-04-01,0.73687876,0.031588545
2023-05-01,0.551994307,0.051260474
2023-06-01,0.636772285,0.072586448
2023-07-01,0.614878175,0.058661321
2023-08-01,0.377902109,0.054203463
2023-09-01,0.269275657,0.049290057
2023-10-01,0.336504288,0.070287508
2023-11-01,0.401783323,0.083716463
2023-12-01,0.215929806,0.13910671
2024-01-01,0.209232392,0.118885746
2024-02-01,0.216702723,0.245415438
2024-03-01,0.232509031,0.151613374
2024-04-01,0.157422939,0.132180651
2024-05-01,0.042727833,0.126205249
2024-06-01,0.037141199,0.162597972
2024-07-01,0.039169778,0.154782834
2024-08-01,0.038066579,0.135886554
Loading

0 comments on commit 0ca9444

Please sign in to comment.