Releases: GoogleCloudPlatform/gcpdiag
Releases · GoogleCloudPlatform/gcpdiag
v0.77
0.77 (2024-11-13)
New Lint Rules
- gke/err/2024_002: gke webhook failure endpoints not available
- gke/warn/2024_007: GKE cluster in a dual-stack with external IPv6 access
New Runbooks
- lb/ssl-certificate: Runbook for troubleshooting LB SSL certificates issues
- gke/node-unavailability: Identifies the reasons for a GKE node being unavailable
New Queries
- gke.get_cluster: Retrieve a single GKE cluster using its project, region, and cluster name.
- dns.find_dns_records: Resolves DNS records for a given domain and returns a set of IP addresses.
- lb.get_ssl_certificate: Returns object matching certificate name and region
- lb.get_target_https_proxies: Retrieves the list of all TargetHttpProxy resources, regional and global, available to the specified project.
- lb.get_forwarding_rule: Returns the specified ForwardingRule resource.
Enhancements
- Functionality to auto suggest correct runbook names for misspelled runbooks
- Updated docker images to ubuntu:24.04 (python 3.12)
- Updated devcontainer to python 3.12
- Migrated crm queries from v1 to v3
- gce/vm-performance: Added PD performance health check
- gce/vm-performance: Implemented disk average_io_latency check
- Removed apis_utils.batch_execute_all call from orgpolicy query
- Enabled gcpdiag.dev page indexing
- Reduced API retries to 3 attempts
- Improved START_TIME_UTC inconsistency & Error parsing date string fix
- pubsub/pull-subscription-delivery: removed cold cache checks
- Add functionality to disable query caching for edge cases
- Improve error handling within gcpdiag library to raise errors for handling rather than exiting.
Fixes
- lb.get_backend_service: Improved calls to fetch global backend
- Added project_id parameters for the runbook tests without valid project ids
Deprecation
- Flag
--project
: Full deprecation in runbook command to allow multiple project ids/numbers to be specified via--parameter
v0.76
0.76 (2024-10-1)
New Lint Rules
- dataproc/warn/2024_005: Investigates if Data Fusion version is compatible with Dataproc version from the CDAP Preferences settings
New Runbooks
- pubsub/pull-subscription-delivery: Investigates common Cloud Pub/Sub pull delivery issues related to delivery latency, quotas, pull rate and throughput rate
New Queries
- pubsub.get_subscription: Retrieves a single pubsub subscription resource
- apis.is_all_enabled: Check if a list of services are enabled on a given project
- gke.get_release_schedule: Fetch GKE cluster release schedule
Enhancements
make new-rule
: A make rule with a cookiecutter recipe to generate new lint rule templates- gce.get_gce_public_images: Improved gce_stub query to correctly fetch all image licenses during test.
- Runbooks metrics generation for Google Internal Users
- New flag
--reason
: argument primarily used by Google internal users to specify rational for executing the tool - Bundles: A runbook feature to allow execution of a collection of steps
- Runbook operation (op.add_metadata) to create or retrieve metadata related to steps
Fixes
- Enforce explicit parameter configuration in gce generalized steps.
- dataflow/dataflow-permission: Refactored runbook to
dataflow/job-permission
- dataflow/bp/2024_002: Fixed resource filtering bug for forwarding rule (internal LB)
- gce/vm-performance: Fixed disk performance benchmark lookup
Deprecation
- apis_utils.batch_list_all: Replaced by apis\utils.multi_list_all
- Flag
--project
: Soft deprecation in runbook command to allow multiple project ids/numbers to be spcified via--parameter
- Deprecated pre-commit hook gke-eol-file
v0.75
0.75 (2024-9-2)
New Lint Rules
- bigquery/WARN/2024_005: Checks BigQuery table does not exceed number of partition modifications
to a column partitioned table - bigquery/WARN/2024_006: Checks BigQuery job does not exceed tabledata.list bytes
per second per project - dataflow/ERR/2024_006: Checks Dataflow job does not fail during execution due
to resource exhaustion in zone - datafusion/WARN_2024_004: Checks Data Fusion version is compatible with Dataproc
version from the corresponding compute profiles - gke/WARN/2024_003: Checks Ingress traffic is successful if service is correctly mapped
- gke/WARN/2024_004: Checks Ingress is successful if backendconfig crd is correctly mapped
- gke/WARN/2024_005: Checks GKE Ingress successfully routes external traffic to NodePort service
- gce/BP_EXT/2024_002: Calculate a GCE VM's IOPS and Throughput Limits
New Runbooks
- lb/unhealthy-backends: Diagnose Unhealthy Backends of a Load Balancer
- gke/resource-quota: Diagnose quota related issues related to gke clusters.
- gce/vm-performance: Diagnose GCE VM performance
- gke/image-pull: Diagnose Image Pull Failures related GKE clusters.
- gke/node-auto-repair: RCA Node auto-repaired incidents
- gke/gke-ip-masq-standard: Diagnose IP Masquerading issues on GKE clusters
- dataflow/dataflow-permission: Diagnose Permission required for cluster creation and operation
New Query
- lb.get_backend_service: Fetch instances matching compute backend service name and/or region
- lb.get_backend_service_health: Fetch compute backend service health data
- generic_api/datafusion: Re-implementation of how to call and test generic apis
Enhancements
- cloudrun/service-deployment: 2 additional checks for image not found and image permissions failure
- bigquery/WARN/2022_001: Updated lint rule diagnostic steps documentation
- Implement ignorecase for input parameters
- gce/ssh and gce/serial-log-analyzer: Include Auth failure checks in runbooks
- Updated GKE version End of Life tracker
- New API Stub for Recommender API
Fixes
- gce/vm-termination: Made vm name and zone mandatory fields
- Updated dependencies:
- aiohttp: 3.9.5 -> 3.10.3
- attrs: 23.2.0 -> 24.2.0
- cachetools: 5.3.3 -> 5.4.0
- certifi: 2024.6.2 -> 2024.7.4
- exceptiongroup: 1.2.1 -> 1.2.2
- google-api-python-client: 2.134.0 -> 2.141.0
- google-auth: 2.30.0 -> 2.33.0
- google-auth-oauthlib: 1.2.0 -> 1.2.1
- importlib-resources: 6.4.0 -> 6.4.2
- protobuf: 5.27.2 -> 5.27.3
- pyyaml: 6.0.1 -> 6.0.2
- soupsieve: 2.5 -> 2.6
- Fix lint output and GCE query functions for multi-region resources
- Removed deprecated option skip_delete from TF code
v0.74
Full Changelog: v0.68...v0.74
0.74 (2024-7-10)
Fixes
- Re-roll of v0.72 after correcting pip module issue with the docker image build
New Lint Rule
datafusion/warn_2024_002 Data Fusion instance is in a running state
New Runbook
dataproc/cluster_creation Dataproc cluster creation diagnostic tree
0.73 (2024-7-8)
New Feature
- Added search command to scale the docstrings for lint rules or runbooks to
match keywords - added runbook check step outcome: step_ok, step_failed, etc.
- Added a zonal endpoint in osconfig library. It returns inventories for all VMs under a certain zone
Fixes
- Create runbook report regardless of the number of failed steps
- Improve introductory error message for new runbooks
- Update lint command API return value for display of resources in each rule
- General spelling corrections
- Add documentation for runbook operator methods
- Remove unneeded google path reference in loading template block contenta
- Update runbook name validation
- Handle when gcloud command is not installed when running runbook generator
- Allow to query logs for each test data separately in logs_stub
- Update GKE EOL date
- Relax contraints on location of end steps in runbook
- Update pip dependencies; security fix for pdoc
- Added monitoring to the list of supported products runbook steps
- generic_api/datafusion apis.make_request() re-implementation
- Update and improve runbook error handling
New Lint Rule
- gke/err_2024_001_psa_violations Checking for no Pod Security Admission violations in the project
- bigquery/warn_2024_002_invalid_external_connection BigQuery external
connection with Cloud SQL does not fail - pubsub/err_2024_003_snapshot_creation_fails snapshot creation fails if
backlog is too old - pubsub/err_2024_002_vpc_sc_new_subs_create_policy_violated check for
pubsub error due to organization policy - bigquery/warn_2024_0003 BigQuery job does not fail due to Maximum API requests per user per method exceeded
New Runbook
- gce/ops_agent Ops Agent Onboarding runbook
- gcp/serial_log_analyzer runbook to analyse known issues logged into Serial Console logs
- vertex/workbench_instance_stuck_in_provisioning Runbook to Troubleshoot Issue: Vertex AI Workbench Instance Stuck in Provisioning State
- cloudrun/service_deployment Cloud Run deployment runbook
- gke/ip_exhaustion gke ip exhaustion runbook
- dataflow/failed_streaming_pipeline Diagnostic checks for failed Dataflow Streaming Pipelines
- nat/out_of_resources vm external ip connectivity runbook
v0.67
0.67 (2023-10-17)
Fixes
- Updating GKE EOL file and snapshot
- Rewording message triggering internal leak test
New Command and Rules
- Runbook POC with ssh runbook and terraform scripts
New rules
- GKE cluster has workload identity enabled
- Splunk job uses valid certificate
Full Changelog: v0.66...v0.67
gcpdiag 0.71
0.71 (2024-4-17)
New lint rules
- datafusion/err_2024_001_delete_operation_failing datafusion
deletion operation - gce/err_2024_003_vm_secure_boot_failures GCE Lint rule for boot
failures for Shielded VM - gce/bp_2024_001_legacy_monitoring_agent GCE Legacy Monitoring Agent
is not installed - gce/bp_2024_002_legacy_logging_agent GCE Legacy Logging Agent is not
be installed - gce/bp_ext_2024_001_no_public_ip.py GCE SSH in Browser: SSH Button
Disabled - pubsub/bp_2024_001_ouma_less_one_day Oldest Unacked Message Age
Value less than 24 hours - bigquery/err_2024_001_query_too_complex query is too complex
- bigquery/warn_2024_001_imports_or_query_appends_per_table table
exceeds limit for imports or query appends
New query
-
osconfig
"OS management tools that can be used for patch management, patch compliance,
and configuration management on VM instances."
https://cloud.google.com/compute/docs/osconfig/rest
New runbook
- gce/vm_termination assist investigating underlying reasons behind
termination or reboot - gke/cluster_autoscaler GKE Cluster autoscaler error messages check
New features
- Add cache bypass option for runbook steps
- Add runbook starter code generator; updates to code generator
- Add API for runbook command
Fixes
- Add mock data for datafusion API testing
- Correct runbook documentation generation output
- Improve runbook operator functions usage
- Add dataflow and other components to supported runbook component list
- Remove duplicate vm_termination.py script
- Add jinja templates to docker image on cloud shell
- correct argv passed for parsing in runbook command
- Adding pipenv and git checks to help beginners get started easily on runbook
generator - update idna pipenv CVE-2024-3651 Moderate severity
- SSH runbook enhancements
- runbook fixes - catch missing template errors, include project id when no
parameters
gcpdiag 0.70
0.70 (2024-3-27)
New lint rules
- pubsub/ERR_2024_001 bq subscription table not found
- composer/WARN_2024_001 low scheduler cpu usuage
- datafusion/WARN_2024_001 data fusion version
- composer/WARN_2024_002 worker pod eviction
- gce/ERR_2024_002 performance
- notebooks/ERR_2024_001 executor explicit project permissions
- dataflow/WARN_2024_001 dataflow operation ongoing
- dataflow/ERR_2024_001 dataflow gce quotas
- dataflow/WARN_2024_002 dataflow streaming appliance commit failed
- dataflow/ERR_2024_002 dataflow key commit
- gke/WARN_2024_001 cluster nap limits prevent autoscaling
New query
- datafusion_cdap API query implementation - provides CDAP profile metadata
Fixes
- Updated pipenv packages, Pipenv.lock dependencies
- Updated github action workflow versions to stop warnings about node v10 and v10
- Refactor Runbook: Implemented a modular, class-based design to facilitate a
more configurable method for tree construction.
v0.69
0.69 (2024-2-21)
New feature
- add universe_domain for Trusted Partner Client (TPC)
New rules
- asm/WARN_2024_001 Webhook failed
- lb/BP_2024_002 Check if global access is on for the regional iLB
- pubsub/WARN_2024_003 Pub/Sub rule: CMEK - Topic Permissions
- dataproc/WARN_2024_001 dataproc check hdfs safemode status
- dataproc/WARN_2024_002 dataproc hdfs write issues
- gce/ERR_2024_001 GCE rule:Snapshot creation rate limit
- lb/BP_2024_001 session affinity enabled on load balancer
- pubsub/WARN_2024_002 GCS subscription has the apt permissions
- dataflow/ERR_2023_010 missing required field
- pubsub/WARN_2024_001 DLQ Subscription has apt permissions
Fixes
- Update Pull Request and Merge to only run when an update was committed
- Creating a github action Workflow to automatically update the gke/eol.yaml file
- Update gke/eol.yaml file
Full Changelog: https://github.com/GoogleCloudPlatform/gcpdiag/commits/v0.69
gcpdiag 0.68
0.68 (2024-1-17)
New Rules
- gke/bp_2023_002 Gke cluster is a private cluster
- composer/err_2023_002 Use allowed IP ranges to create Private IP Cluster
- compoer/err_2023_004 DAG is detected as zombie
- composer/err_2023_003 DAG timeout issue
- composer/err_2023_005 Check NAT config for environment deletion fail
- bigquery/err_2023_009 BigQuery job not failed due to Schedule query with multiple DML
- gce/warn_2023_002 Serial logs don’t contain out-of-memory message due to airflow task run
- dataflow/err_2023_011 Streaming insert mismatch column type
- dataflow/err_2023_012 Spanner OOM
- dataflow/err_2023_013 Spanner deadline error
- pubsub/warn_2023_006 Pubsub push subscriptions have no push errors
- dataproc/err_2023_008 Dataproc cluster disk space issues check and web page
- composer/err_2024_001 Composer not failed due to 'no error was surfaced' error
- lb/bp_2023_002 check that logging is enabled on health checks for load balancer backend
services - vpc/warn_2024_001 Check Unused Reserved IP addresses
- iam/sec_2024_001 Detect unused service accounts
New module
- Add billing module query and lint rules
Fixes
- Skip notebook instances query if API is not enabled
- Update MD formatting for gke/WARN/2023_004.md
- Update conflicting credentials import name
- Updating EOL rule snapshot to match new schedule
- Update gke eol.yaml
- add str repr of RuleModule for more info in exceptions loading rules
- fixed bug in billing change 1673236 - added checks for correct permissions
- fixed bug in change id 2113602 - updated condition for check NAT config rule
Features and Improvements
- Improved report generation for runbook
- refactor lint.command.run to return a dict when run from API service
- Add set_credentials() method
- Clear credentials used in API service after request
- Updated gke eol.yaml
- Added the id label to filter the Dataflow jobs using the job id
gcpdiag 0.66
0.66 (2023-11-06)
New rules
- bigquery/ERR/2023_008: user not authorized to perform this action
- pubsub/WARN/2023_005: bigquery subscription has apt permissions
- asm/ERR/2023_001, asm/ERR/2023_002: Anthos Service mesh
- gke/BP/2022_003: Make GKE EOL detection more robust and less hardcoded
- gke/WARN/2023_004: Add a check for too low
maxPodsPerNode
number - gke/ERR/2023_012: missing memory request for hpa
- bigquery/ERR/2023_006: bigquery policy does not belong to user
- pubsub/WARN/2023_00[14]: no subscription without attached topic
- composer/WARN/2023_009: Cloud Composer Intermittent Task Failure during Scheduling
New module
- Anthos Service mesh
Fixes
- Handle app failure when project policy contains cross-project service accounts
- Update the version skew for modern versions of Kubernetes. https://kubernetes.io/blog/2023/08/15/kubernetes-v1-28-release/#changes-to-supported-skew-between-control-plane-and-node-versions
- Updating working and typos in multiple files
- Update gke test snapshot.
- added content in md file for rule apigee_err_2023_003