Skip to content

Releases: GoogleCloudPlatform/gcpdiag

v0.77

14 Nov 12:27
Compare
Choose a tag to compare

0.77 (2024-11-13)

New Lint Rules

  • gke/err/2024_002: gke webhook failure endpoints not available
  • gke/warn/2024_007: GKE cluster in a dual-stack with external IPv6 access

New Runbooks

  • lb/ssl-certificate: Runbook for troubleshooting LB SSL certificates issues
  • gke/node-unavailability: Identifies the reasons for a GKE node being unavailable

New Queries

  • gke.get_cluster: Retrieve a single GKE cluster using its project, region, and cluster name.
  • dns.find_dns_records: Resolves DNS records for a given domain and returns a set of IP addresses.
  • lb.get_ssl_certificate: Returns object matching certificate name and region
  • lb.get_target_https_proxies: Retrieves the list of all TargetHttpProxy resources, regional and global, available to the specified project.
  • lb.get_forwarding_rule: Returns the specified ForwardingRule resource.

Enhancements

  • Functionality to auto suggest correct runbook names for misspelled runbooks
  • Updated docker images to ubuntu:24.04 (python 3.12)
  • Updated devcontainer to python 3.12
  • Migrated crm queries from v1 to v3
  • gce/vm-performance: Added PD performance health check
  • gce/vm-performance: Implemented disk average_io_latency check
  • Removed apis_utils.batch_execute_all call from orgpolicy query
  • Enabled gcpdiag.dev page indexing
  • Reduced API retries to 3 attempts
  • Improved START_TIME_UTC inconsistency & Error parsing date string fix
  • pubsub/pull-subscription-delivery: removed cold cache checks
  • Add functionality to disable query caching for edge cases
  • Improve error handling within gcpdiag library to raise errors for handling rather than exiting.

Fixes

  • lb.get_backend_service: Improved calls to fetch global backend
  • Added project_id parameters for the runbook tests without valid project ids

Deprecation

  • Flag --project: Full deprecation in runbook command to allow multiple project ids/numbers to be specified via --parameter

v0.76

01 Oct 17:12
Compare
Choose a tag to compare

0.76 (2024-10-1)

New Lint Rules

  • dataproc/warn/2024_005: Investigates if Data Fusion version is compatible with Dataproc version from the CDAP Preferences settings

New Runbooks

  • pubsub/pull-subscription-delivery: Investigates common Cloud Pub/Sub pull delivery issues related to delivery latency, quotas, pull rate and throughput rate

New Queries

  • pubsub.get_subscription: Retrieves a single pubsub subscription resource
  • apis.is_all_enabled: Check if a list of services are enabled on a given project
  • gke.get_release_schedule: Fetch GKE cluster release schedule

Enhancements

  • make new-rule: A make rule with a cookiecutter recipe to generate new lint rule templates
  • gce.get_gce_public_images: Improved gce_stub query to correctly fetch all image licenses during test.
  • Runbooks metrics generation for Google Internal Users
  • New flag --reason: argument primarily used by Google internal users to specify rational for executing the tool
  • Bundles: A runbook feature to allow execution of a collection of steps
  • Runbook operation (op.add_metadata) to create or retrieve metadata related to steps

Fixes

  • Enforce explicit parameter configuration in gce generalized steps.
  • dataflow/dataflow-permission: Refactored runbook to dataflow/job-permission
  • dataflow/bp/2024_002: Fixed resource filtering bug for forwarding rule (internal LB)
  • gce/vm-performance: Fixed disk performance benchmark lookup

Deprecation

  • apis_utils.batch_list_all: Replaced by apis\utils.multi_list_all
  • Flag --project: Soft deprecation in runbook command to allow multiple project ids/numbers to be spcified via --parameter
  • Deprecated pre-commit hook gke-eol-file

v0.75

03 Sep 17:14
Compare
Choose a tag to compare

0.75 (2024-9-2)

New Lint Rules

  • bigquery/WARN/2024_005: Checks BigQuery table does not exceed number of partition modifications
    to a column partitioned table
  • bigquery/WARN/2024_006: Checks BigQuery job does not exceed tabledata.list bytes
    per second per project
  • dataflow/ERR/2024_006: Checks Dataflow job does not fail during execution due
    to resource exhaustion in zone
  • datafusion/WARN_2024_004: Checks Data Fusion version is compatible with Dataproc
    version from the corresponding compute profiles
  • gke/WARN/2024_003: Checks Ingress traffic is successful if service is correctly mapped
  • gke/WARN/2024_004: Checks Ingress is successful if backendconfig crd is correctly mapped
  • gke/WARN/2024_005: Checks GKE Ingress successfully routes external traffic to NodePort service
  • gce/BP_EXT/2024_002: Calculate a GCE VM's IOPS and Throughput Limits

New Runbooks

  • lb/unhealthy-backends: Diagnose Unhealthy Backends of a Load Balancer
  • gke/resource-quota: Diagnose quota related issues related to gke clusters.
  • gce/vm-performance: Diagnose GCE VM performance
  • gke/image-pull: Diagnose Image Pull Failures related GKE clusters.
  • gke/node-auto-repair: RCA Node auto-repaired incidents
  • gke/gke-ip-masq-standard: Diagnose IP Masquerading issues on GKE clusters
  • dataflow/dataflow-permission: Diagnose Permission required for cluster creation and operation

New Query

  • lb.get_backend_service: Fetch instances matching compute backend service name and/or region
  • lb.get_backend_service_health: Fetch compute backend service health data
  • generic_api/datafusion: Re-implementation of how to call and test generic apis

Enhancements

  • cloudrun/service-deployment: 2 additional checks for image not found and image permissions failure
  • bigquery/WARN/2022_001: Updated lint rule diagnostic steps documentation
  • Implement ignorecase for input parameters
  • gce/ssh and gce/serial-log-analyzer: Include Auth failure checks in runbooks
  • Updated GKE version End of Life tracker
  • New API Stub for Recommender API

Fixes

  • gce/vm-termination: Made vm name and zone mandatory fields
  • Updated dependencies:
    • aiohttp: 3.9.5 -> 3.10.3
    • attrs: 23.2.0 -> 24.2.0
    • cachetools: 5.3.3 -> 5.4.0
    • certifi: 2024.6.2 -> 2024.7.4
    • exceptiongroup: 1.2.1 -> 1.2.2
    • google-api-python-client: 2.134.0 -> 2.141.0
    • google-auth: 2.30.0 -> 2.33.0
    • google-auth-oauthlib: 1.2.0 -> 1.2.1
    • importlib-resources: 6.4.0 -> 6.4.2
    • protobuf: 5.27.2 -> 5.27.3
    • pyyaml: 6.0.1 -> 6.0.2
    • soupsieve: 2.5 -> 2.6
  • Fix lint output and GCE query functions for multi-region resources
  • Removed deprecated option skip_delete from TF code

v0.74

11 Jul 15:51
Compare
Choose a tag to compare

Full Changelog: v0.68...v0.74

0.74 (2024-7-10)

Fixes

  • Re-roll of v0.72 after correcting pip module issue with the docker image build

New Lint Rule

datafusion/warn_2024_002 Data Fusion instance is in a running state

New Runbook

dataproc/cluster_creation Dataproc cluster creation diagnostic tree

0.73 (2024-7-8)

New Feature

  • Added search command to scale the docstrings for lint rules or runbooks to
    match keywords
  • added runbook check step outcome: step_ok, step_failed, etc.
  • Added a zonal endpoint in osconfig library. It returns inventories for all VMs under a certain zone

Fixes

  • Create runbook report regardless of the number of failed steps
  • Improve introductory error message for new runbooks
  • Update lint command API return value for display of resources in each rule
  • General spelling corrections
  • Add documentation for runbook operator methods
  • Remove unneeded google path reference in loading template block contenta
  • Update runbook name validation
  • Handle when gcloud command is not installed when running runbook generator
  • Allow to query logs for each test data separately in logs_stub
  • Update GKE EOL date
  • Relax contraints on location of end steps in runbook
  • Update pip dependencies; security fix for pdoc
  • Added monitoring to the list of supported products runbook steps
  • generic_api/datafusion apis.make_request() re-implementation
  • Update and improve runbook error handling

New Lint Rule

  • gke/err_2024_001_psa_violations Checking for no Pod Security Admission violations in the project
  • bigquery/warn_2024_002_invalid_external_connection BigQuery external
    connection with Cloud SQL does not fail
  • pubsub/err_2024_003_snapshot_creation_fails snapshot creation fails if
    backlog is too old
  • pubsub/err_2024_002_vpc_sc_new_subs_create_policy_violated check for
    pubsub error due to organization policy
  • bigquery/warn_2024_0003 BigQuery job does not fail due to Maximum API requests per user per method exceeded

New Runbook

  • gce/ops_agent Ops Agent Onboarding runbook
  • gcp/serial_log_analyzer runbook to analyse known issues logged into Serial Console logs
  • vertex/workbench_instance_stuck_in_provisioning Runbook to Troubleshoot Issue: Vertex AI Workbench Instance Stuck in Provisioning State
  • cloudrun/service_deployment Cloud Run deployment runbook
  • gke/ip_exhaustion gke ip exhaustion runbook
  • dataflow/failed_streaming_pipeline Diagnostic checks for failed Dataflow Streaming Pipelines
  • nat/out_of_resources vm external ip connectivity runbook

v0.67

21 Nov 18:24
Compare
Choose a tag to compare

0.67 (2023-10-17)

Fixes

  • Updating GKE EOL file and snapshot
  • Rewording message triggering internal leak test

New Command and Rules

  • Runbook POC with ssh runbook and terraform scripts

New rules

  • GKE cluster has workload identity enabled
  • Splunk job uses valid certificate

Full Changelog: v0.66...v0.67

gcpdiag 0.71

19 Apr 18:30
Compare
Choose a tag to compare

0.71 (2024-4-17)

New lint rules

  • datafusion/err_2024_001_delete_operation_failing datafusion
    deletion operation
  • gce/err_2024_003_vm_secure_boot_failures GCE Lint rule for boot
    failures for Shielded VM
  • gce/bp_2024_001_legacy_monitoring_agent GCE Legacy Monitoring Agent
    is not installed
  • gce/bp_2024_002_legacy_logging_agent GCE Legacy Logging Agent is not
    be installed
  • gce/bp_ext_2024_001_no_public_ip.py GCE SSH in Browser: SSH Button
    Disabled
  • pubsub/bp_2024_001_ouma_less_one_day Oldest Unacked Message Age
    Value less than 24 hours
  • bigquery/err_2024_001_query_too_complex query is too complex
  • bigquery/warn_2024_001_imports_or_query_appends_per_table table
    exceeds limit for imports or query appends

New query

New runbook

  • gce/vm_termination assist investigating underlying reasons behind
    termination or reboot
  • gke/cluster_autoscaler GKE Cluster autoscaler error messages check

New features

  • Add cache bypass option for runbook steps
  • Add runbook starter code generator; updates to code generator
  • Add API for runbook command

Fixes

  • Add mock data for datafusion API testing
  • Correct runbook documentation generation output
  • Improve runbook operator functions usage
  • Add dataflow and other components to supported runbook component list
  • Remove duplicate vm_termination.py script
  • Add jinja templates to docker image on cloud shell
  • correct argv passed for parsing in runbook command
  • Adding pipenv and git checks to help beginners get started easily on runbook
    generator
  • update idna pipenv CVE-2024-3651 Moderate severity
  • SSH runbook enhancements
  • runbook fixes - catch missing template errors, include project id when no
    parameters

gcpdiag 0.70

19 Apr 18:29
Compare
Choose a tag to compare

0.70 (2024-3-27)

New lint rules

  • pubsub/ERR_2024_001 bq subscription table not found
  • composer/WARN_2024_001 low scheduler cpu usuage
  • datafusion/WARN_2024_001 data fusion version
  • composer/WARN_2024_002 worker pod eviction
  • gce/ERR_2024_002 performance
  • notebooks/ERR_2024_001 executor explicit project permissions
  • dataflow/WARN_2024_001 dataflow operation ongoing
  • dataflow/ERR_2024_001 dataflow gce quotas
  • dataflow/WARN_2024_002 dataflow streaming appliance commit failed
  • dataflow/ERR_2024_002 dataflow key commit
  • gke/WARN_2024_001 cluster nap limits prevent autoscaling

New query

  • datafusion_cdap API query implementation - provides CDAP profile metadata

Fixes

  • Updated pipenv packages, Pipenv.lock dependencies
  • Updated github action workflow versions to stop warnings about node v10 and v10
  • Refactor Runbook: Implemented a modular, class-based design to facilitate a
    more configurable method for tree construction.

v0.69

26 Feb 20:43
Compare
Choose a tag to compare

0.69 (2024-2-21)

New feature

  • add universe_domain for Trusted Partner Client (TPC)

New rules

  • asm/WARN_2024_001 Webhook failed
  • lb/BP_2024_002 Check if global access is on for the regional iLB
  • pubsub/WARN_2024_003 Pub/Sub rule: CMEK - Topic Permissions
  • dataproc/WARN_2024_001 dataproc check hdfs safemode status
  • dataproc/WARN_2024_002 dataproc hdfs write issues
  • gce/ERR_2024_001 GCE rule:Snapshot creation rate limit
  • lb/BP_2024_001 session affinity enabled on load balancer
  • pubsub/WARN_2024_002 GCS subscription has the apt permissions
  • dataflow/ERR_2023_010 missing required field
  • pubsub/WARN_2024_001 DLQ Subscription has apt permissions

Fixes

  • Update Pull Request and Merge to only run when an update was committed
  • Creating a github action Workflow to automatically update the gke/eol.yaml file
  • Update gke/eol.yaml file

Full Changelog: https://github.com/GoogleCloudPlatform/gcpdiag/commits/v0.69

gcpdiag 0.68

18 Jan 20:54
Compare
Choose a tag to compare

0.68 (2024-1-17)

New Rules

  • gke/bp_2023_002 Gke cluster is a private cluster
  • composer/err_2023_002 Use allowed IP ranges to create Private IP Cluster
  • compoer/err_2023_004 DAG is detected as zombie
  • composer/err_2023_003 DAG timeout issue
  • composer/err_2023_005 Check NAT config for environment deletion fail
  • bigquery/err_2023_009 BigQuery job not failed due to Schedule query with multiple DML
  • gce/warn_2023_002 Serial logs don’t contain out-of-memory message due to airflow task run
  • dataflow/err_2023_011 Streaming insert mismatch column type
  • dataflow/err_2023_012 Spanner OOM
  • dataflow/err_2023_013 Spanner deadline error
  • pubsub/warn_2023_006 Pubsub push subscriptions have no push errors
  • dataproc/err_2023_008 Dataproc cluster disk space issues check and web page
  • composer/err_2024_001 Composer not failed due to 'no error was surfaced' error
  • lb/bp_2023_002 check that logging is enabled on health checks for load balancer backend
    services
  • vpc/warn_2024_001 Check Unused Reserved IP addresses
  • iam/sec_2024_001 Detect unused service accounts

New module

  • Add billing module query and lint rules

Fixes

  • Skip notebook instances query if API is not enabled
  • Update MD formatting for gke/WARN/2023_004.md
  • Update conflicting credentials import name
  • Updating EOL rule snapshot to match new schedule
  • Update gke eol.yaml
  • add str repr of RuleModule for more info in exceptions loading rules
  • fixed bug in billing change 1673236 - added checks for correct permissions
  • fixed bug in change id 2113602 - updated condition for check NAT config rule

Features and Improvements

  • Improved report generation for runbook
  • refactor lint.command.run to return a dict when run from API service
  • Add set_credentials() method
  • Clear credentials used in API service after request
  • Updated gke eol.yaml
  • Added the id label to filter the Dataflow jobs using the job id

gcpdiag 0.66

06 Nov 23:27
Compare
Choose a tag to compare

0.66 (2023-11-06)

New rules

  • bigquery/ERR/2023_008: user not authorized to perform this action
  • pubsub/WARN/2023_005: bigquery subscription has apt permissions
  • asm/ERR/2023_001, asm/ERR/2023_002: Anthos Service mesh
  • gke/BP/2022_003: Make GKE EOL detection more robust and less hardcoded
  • gke/WARN/2023_004: Add a check for too low maxPodsPerNode number
  • gke/ERR/2023_012: missing memory request for hpa
  • bigquery/ERR/2023_006: bigquery policy does not belong to user
  • pubsub/WARN/2023_00[14]: no subscription without attached topic
  • composer/WARN/2023_009: Cloud Composer Intermittent Task Failure during Scheduling

New module

  • Anthos Service mesh

Fixes