-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
bdd74f3
commit 0ca9444
Showing
14 changed files
with
4,724 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,198 @@ | ||
--- | ||
marp: true | ||
lang: en-US | ||
title: Codebase Collaborator | ||
description: LLMs for source code analysis and augmentation | ||
theme: moj | ||
transition: fade | ||
paginate: true | ||
_paginate: skip | ||
_class: title | ||
footer: ![image w:40](https://raw.githubusercontent.com/ministryofjustice/marp-moj-theme/main/images/moj.png) | ||
_footer: '' | ||
--- | ||
|
||
<!-- _header: ![w:100](https://raw.githubusercontent.com/ministryofjustice/marp-moj-theme/main/images/moj.png) --> | ||
|
||
## Building a transaction data lake using<br/> [Amazon Athena, Apache Iceberg and dbt](https://github.com/moj-analytical-services/dmet-cfe/iceberg_athena_dbt) | ||
|
||
<br/> | ||
|
||
### Dr Soumaya Mauthoor | ||
|
||
September 2024 | ||
|
||
--- | ||
<style scoped> | ||
section { | ||
justify-content: flex-end; | ||
} | ||
</style> | ||
|
||
![bg 70%](https://assets.publishing.service.gov.uk/media/6241e4dae90e075f06b37247/digi-strategy-2025.jpg) | ||
|
||
Published in 2022 under the Johnson Conservative government | ||
|
||
<!-- | ||
https://intranet.justice.gov.uk/blog/becoming-a-truly-data-led-justice-system/ | ||
Our data strategy: | ||
We will improve justice outcomes through data driven insight and innovation. | ||
We will ensure data meets user needs. | ||
We will build a data culture to value data as a strategic asset. | ||
--> | ||
|
||
--- | ||
## MoJ Analytical Platform | ||
|
||
</br> | ||
|
||
![w:800](https://user-guidance.analytical-platform.service.justice.gov.uk/images/overview/analytical-platform.excalidraw.png) | ||
|
||
![bg right:40% w:90%](./images/cjs_dashboard.png) | ||
|
||
--- | ||
|
||
<style scoped> | ||
p { text-align: center; } | ||
</style> | ||
|
||
## Previous ELT Architecture | ||
|
||
![](./images/previous_etl_architecture.excalidraw.png) | ||
|
||
--- | ||
|
||
<style scoped> | ||
p, h5 { | ||
text-align: center; | ||
} | ||
</style> | ||
|
||
## Modern Table Formats | ||
|
||
<!-- Provide a table-like abstraction on top of native file formats like Parquet by storing additional metadata. --> | ||
|
||
![w:1200](https://miro.medium.com/v2/resize:fit:720/format:webp/1*H_goBvOV52AUid4egopzGw.png) | ||
|
||
##### **Iceberg** was the obvious choice for our usecase because of enhanced Athena support | ||
|
||
|
||
--- | ||
|
||
<style scoped> | ||
li { | ||
font-size: 25px; | ||
} | ||
</style> | ||
|
||
|
||
## Glue PySpark vs Athena Curation Benchmarking | ||
|
||
![bg left w:600](./images/athena_vs_glue.excalidraw.png) | ||
|
||
**Criteria** | ||
1. Cost | ||
2. Complexity | ||
3. Run Time | ||
|
||
**Dataset** | ||
|
||
TPCDS stores_sales | ||
|
||
- scale: 100 (~10GB) | ||
- scale: 3000 (~400GB) | ||
|
||
--- | ||
## Bulk Insert | ||
|
||
<br/> | ||
|
||
``` | ||
CREATE TABLE target_table | ||
AS SELECT * FROM source_table | ||
``` | ||
<br/> | ||
|
||
- Athena is cheaper <=3TB scale | ||
- Glue PySpark is faster at the 3TB scale | ||
|
||
![bg left w:550](./images/bulk_insert.png) | ||
|
||
--- | ||
## SCD2 Merge | ||
|
||
Update 0.1% rows | ||
|
||
``` | ||
MERGE INTO target_table | ||
USING source_query | ||
ON search_condition | ||
WHEN MATCHED THEN UPDATE [] | ||
WHEN NOT MATCHED THEN INSERT [] | ||
``` | ||
<br/> | ||
|
||
- Athena is cheaper and faster | ||
- Glue PySpark runs out of memory at the 3TB scale | ||
|
||
![bg left w:550](./images/scd2_merge.png) | ||
|
||
--- | ||
|
||
<style scoped> | ||
p { | ||
text-align: center; | ||
} | ||
</style> | ||
|
||
## Full Load Blue-Green Deployment | ||
|
||
![w:750](./images/wap.excalidraw.png) | ||
|
||
--- | ||
|
||
<style scoped> | ||
p { | ||
text-align: center; | ||
} | ||
</style> | ||
|
||
## Incremental Blue-Green Deployment | ||
|
||
![w:750](./images/wap_incremental.excalidraw.png) | ||
|
||
--- | ||
|
||
## Outcomes | ||
|
||
![bg left w:600](./images/glue_vs_athena_costs.png) | ||
|
||
- Reduced query costs by 99% | ||
|
||
- Reduced query time by 50-75% | ||
- Enabled daily refresh cycle | ||
- Stabilised data pipeline | ||
- Ensured data quality | ||
- Streamlined technology stack | ||
- Facilitated phased development | ||
|
||
--- | ||
|
||
<!-- _class: title --> | ||
<style scoped> | ||
p { | ||
font-size: 20px; | ||
} | ||
</style> | ||
|
||
# Questions? | ||
|
||
<br/> | ||
<br/> | ||
<br/> | ||
<br/> | ||
|
||
These slides were created using <img style="width:100px" src="https://github.com/marp-team/marp/raw/main/marp-dark.png"> | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
Date,Glue,Athena | ||
2022-01-01,0.200420705,0.006788253 | ||
2022-02-01,0.234382219,0.008365781 | ||
2022-03-01,0.179985882,0.011511357 | ||
2022-04-01,0.672464054,0.011120005 | ||
2022-05-01,0.328547535,0.015959739 | ||
2022-06-01,0.329692626,0.046454171 | ||
2022-07-01,0.328822443,0.022741952 | ||
2022-08-01,0.335708303,0.02577245 | ||
2022-09-01,0.45159921,0.03685975 | ||
2022-10-01,0.340109882,0.032156505 | ||
2022-11-01,0.220127312,0.036355487 | ||
2022-12-01,0.403483999,0.021851418 | ||
2023-01-01,0.569655357,0.023029389 | ||
2023-02-01,1,0.021210018 | ||
2023-03-01,0.899546517,0.044674264 | ||
2023-04-01,0.73687876,0.031588545 | ||
2023-05-01,0.551994307,0.051260474 | ||
2023-06-01,0.636772285,0.072586448 | ||
2023-07-01,0.614878175,0.058661321 | ||
2023-08-01,0.377902109,0.054203463 | ||
2023-09-01,0.269275657,0.049290057 | ||
2023-10-01,0.336504288,0.070287508 | ||
2023-11-01,0.401783323,0.083716463 | ||
2023-12-01,0.215929806,0.13910671 | ||
2024-01-01,0.209232392,0.118885746 | ||
2024-02-01,0.216702723,0.245415438 | ||
2024-03-01,0.232509031,0.151613374 | ||
2024-04-01,0.157422939,0.132180651 | ||
2024-05-01,0.042727833,0.126205249 | ||
2024-06-01,0.037141199,0.162597972 | ||
2024-07-01,0.039169778,0.154782834 | ||
2024-08-01,0.038066579,0.135886554 |
Oops, something went wrong.