Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Spanner Migration Tool alerting and monitoring code for sharded migrations #2017

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

nehamodgil
Copy link

This Terraform module delivers the following features tailored for the Spanner Migration Tool used in sharded migrations:

  1. Creates a custom Google Cloud monitoring dashboard to visualize critical metrics associated with the Spanner Migration Tool, enabling better monitoring and analysis.
  2. Configures alert policies to track and respond to errors and performance concerns during the migration process.
  3. Sets up notification channels to route alerts to designated endpoints, such as email or SMS

@nehamodgil nehamodgil requested a review from a team as a code owner November 16, 2024 04:37
Copy link
Member

@manitgupta manitgupta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. This is great work. Added some comments, please take a look.

Overall -
Please look at the reference of an existing template and follow a similar structure wherever possible. We also have a status check which validates terraform templates and should ideally validate a bunch of stuff as well, please see it's output.

@@ -0,0 +1,48 @@
# Terraform state files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? I think there is a global gitignore that we can use instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, we do not need this. I have removed this file in the new commit.

@@ -0,0 +1,63 @@
# Spanner Migration Tool Monitoring Dashboard - Terraform Module
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where possible, can you mirror the sample structure we have for other templates?

Ref - https://github.com/GoogleCloudPlatform/DataflowTemplates/tree/main/v2/datastream-to-spanner/terraform/samples/mysql-sharded-end-to-end

Example: We include a terraform.tfvars and terraform_simple.tfvars for each sample. Definitions are here - https://github.com/GoogleCloudPlatform/DataflowTemplates/tree/main/v2/datastream-to-spanner/terraform/samples#sample-structure

As much as possible, please try to keep the naming convention consistent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using "Spanner Migration Tool" as a way to refer, could you please replace it with "Live migration template(s)".

For example -

  1. Live Migration Monitoring Dashboard (Title)
  2. visualize key metrics related to the Live migration template(s) (first line)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the README.md and the dashboard title to use Live Migration Monitoring Dashboard

name = "dataflow_conversion_errors_metric"

# Filter to capture only conversion errors with severity of ERROR or higher for Dataflow jobs
filter = "resource.type=\"dataflow_job\" AND severity>=ERROR AND textPayload:\"conversion error\""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shreyakhajanchi Please validate with the recent updates to the metrics if the filter is still accurate.

Comment on lines +2 to +44
resource "google_logging_metric" "dataflow_conversion_errors_metric" {
name = "dataflow_conversion_errors_metric"

# Filter to capture only conversion errors with severity of ERROR or higher for Dataflow jobs
filter = "resource.type=\"dataflow_job\" AND severity>=ERROR AND textPayload:\"conversion error\""

# Metric descriptor settings to define the metric type and value type
metric_descriptor {
metric_kind = "DELTA" # Tracks change over time
value_type = "INT64" # Counts integer values
}

}

# Log-based metric for other (non-conversion) Dataflow errors
resource "google_logging_metric" "dataflow_other_errors_metric" {
name = "dataflow_other_errors_metric"
description = "Metric to track other Dataflow errors (excluding conversion errors)"

# Filter to capture all Dataflow errors except conversion errors
filter = "resource.type=\"dataflow_job\" AND severity>=ERROR AND NOT textPayload:\"conversion error\""

metric_descriptor {
metric_kind = "DELTA"
value_type = "INT64"
}

}

# Log-based metric for total Dataflow errors (both conversion and other errors)
resource "google_logging_metric" "dataflow_total_errors_metric" {
name = "dataflow_total_errors_metric"
description = "Metric to track the total number of Dataflow errors"

# Filter to capture all Dataflow errors with severity of ERROR or higher
filter = "resource.type=\"dataflow_job\" AND severity>=ERROR"

metric_descriptor {
metric_kind = "DELTA"
value_type = "INT64"
}

}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need log-based metrics? Aren't these metrics directly published to GCP in a counter?

An older (but similar) example of querying Retryable Errors from Monitoring -

fetch dataflow_job
| metric 'dataflow.googleapis.com/job/user_counter'
| filter (resource.job_name == 'hb-dataflow-polltest-100tables-ef7e-3589')
| filter (metric.metric_name == 'Retryable errors')
| group_by 1m, [value_user_counter_mean: mean(value.user_counter)]
| every 1m

@manitgupta
Copy link
Member

Sample Status check run -
https://github.com/GoogleCloudPlatform/DataflowTemplates/actions/runs/11867244367/job/33121707907?pr=2017

This shows that the file is not formatted with terraform fmt.

Copy link

codecov bot commented Nov 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 52.93%. Comparing base (7fed825) to head (fe4a4d4).
Report is 5 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2017      +/-   ##
============================================
+ Coverage     45.41%   52.93%   +7.51%     
+ Complexity     3675     1367    -2308     
============================================
  Files           842      378     -464     
  Lines         49970    20661   -29309     
  Branches       5261     2090    -3171     
============================================
- Hits          22692    10936   -11756     
+ Misses        25608     9045   -16563     
+ Partials       1670      680     -990     
Components Coverage Δ
spanner-templates 67.93% <ø> (+1.22%) ⬆️
spanner-import-export ∅ <ø> (∅)
spanner-live-forward-migration 75.88% <ø> (ø)
spanner-live-reverse-replication 76.65% <ø> (ø)
spanner-bulk-migration 86.37% <ø> (ø)

see 481 files with indirect coverage changes

@nehamodgil nehamodgil requested a review from manitgupta December 4, 2024 04:06
pubsub_age_of_oldest_message_threshold = 120 # Maximum age of the oldest message in Pub/Sub in seconds.

# Notification Configuration
email_address = "[email protected]" # Email address to receive alert notifications.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change!

@@ -0,0 +1,63 @@
# Live Migration Monitoring Dashboard - Terraform Module
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@manitgupta
Copy link
Member

Looks like there are some formatting issues - Please run terraform fmt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants