Thanos Receive

Table of Contents

[TOC]

Docs: https://thanos.io/tip/components/receive.md/

Thanos Receive implements a remote write endpoint for Prometheus. We are using it to more easily ingest metrics from various projects.

The receiversrun in ops and are deployed by k8s-workdloads helm charts.

Receive Components

There are 4 components that make up the receiver.

Nginx

Nginx is currently used for autthentication and tenant head injection. When a request is sent to the remote-write endpoint, nginx first authenticates the credentials using htpasswd/basicAuth, and then maps the THANOS_TENANT header to the username.

Receive Distributor (Router)

The distributor (AKA router) is responsible for routing requests to downstream receivers. It leverages a hashring config file hashring.json which instructs the distibutor what tenants should be sent to which receiver.

Example File:

[
    {
        "hashring": "hashring0",
        "tenants": ["high_volume_tenant_1", "tenant_b"],
        "endpoints": ["thanos-high-volume-receiver-1:10901"]
    },
    {
        "hashring": "hashring1",
        "tenants": [],
        "endpoints": ["thanos-catchall-receiver-1:10901"]
    }
]

It works on a first match basis. In the above example tenant_b would match hashring0, while any tenants not matching will end up in the hashring1 (Empty tenants list == unlimited).

The hashring config file is updated automatically be the receive-controller.

Receive (Ingester)

The receive ingester is the statefulset responsible for persistening the write requests to disks. It also replicates data based on the set replication factor, to ensure data availability in the event a pod goes down. Much like other components in thanos that receive or write data, it uploads on a 2 hour interval (by default) to our long term storage bucket.

Receive Controller

The receive controller helps with discoverability and scalability of receive ingester pods, and updates the distributor as needed. It does this by looking up the k8s api and discovering the provisioned pods in a given statefulset. When changes are detected it updates the hashring.json config and updates an annotation on the receive distributor to force an immediate re-read of the mounted config.

This enables us to scale out the ingester statefulsets automatically based on load at a given period.

Configuring Tenants

We leverage tenants in thanos to help identitfy the origin of metrics, as well as provide limits/quotas to given teams or environments. Tenants for Thanos Receive are configured in two parts:

An entry for the tenant and limits in k8s-workloads
Tenant Credentials in Vault

Adding A New Prometheus Client

After you have set up the tenant (or use an existing), you can give the auth credentials and the Thanos receive endpoint URL to the team. Prometheus configuration is done via the remote_write config.

Example:

remoteWrite:
  - url: https://remote-write.ops.gke.gitlab.net/api/v1/receive
    name: thanos
    basicAuth:
      username:
        name: remote-write-auth
        key: username
      password:
        name: remote-write-auth
        key: password

Unfortunately prometheus doesn't support ENV var substition in the config file, however if using via prometheus-operator it does support a kubernetes secret reference. In the above example we point the auth to a secret named remote-write-auth and the respending object keys for both username and password.

Here is a example config from Code Suggestions.

Note that the current usage of htpasswd/basicAuth will be replaced in a future iteration.

Scaling

All components in the receive service are built with autoscaling via kubernetes HPA. Both nginx and the receive distirbutor and deployments and scale normally based on the HPAs configured thresholds.

The receive ingester however is a statefulset, and while it a stateful workload, we are able to scale this as well via an HPA leveraging the thanos-controller.

Monitoring of Receive

We have implemented initial rules to notify us when a tenant is approaching thier quotes in the rules config for the Receive deployment. This will post to the Observability Team's slack channel.

Troubleshooting

Prometheus Remote Write 429 Errors

We enforce limits for tenants in thanos. 429s indicate rate limiting on the client side. If this is seen from a prometheus client:

Check the dashboard to validate a tenant has reached its limit here.
Validate no drastic changes in the given prometheus client workload.
If required increase tenant limits.

Before increasing limits, it's important we ensure that the given tenants increase in metrics is valid and required. This is a good opportunity to look into un-used metrics and potential cardinality explosions. If possible we should encourge dropping metrics that are not in use, before increasing the setl imits.

Remote Write requests failing

Likely resuting in 500 errors, we have a few things we can check on.

Ensure the nginx pods are running and processing requests.
Make sure the receive distributor pods are running.
Check the receive statefulset pods are running and have quorum. We use a replication of 3, so we must have 2 pods at any given time.
Lastly ensure the generated config matches the state of the active receivers kubectl -n thanos get cm thanos-thanos-stack-tenants-generated -o yaml.

Current Limitations

Currently the main limitation or drawback with the solution is the use of basicAuth via nginx. This will be replaced with OAuth in a future itteration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

thanos-receive.md

thanos-receive.md

Thanos Receive

Receive Components

Nginx

Receive Distributor (Router)

Receive (Ingester)

Receive Controller

Configuring Tenants

Adding A New Prometheus Client

Scaling

Monitoring of Receive

Troubleshooting

Prometheus Remote Write 429 Errors

Remote Write requests failing

Current Limitations

Files

thanos-receive.md

Latest commit

History

thanos-receive.md

File metadata and controls

Thanos Receive

Receive Components

Nginx

Receive Distributor (Router)

Receive (Ingester)

Receive Controller

Configuring Tenants

Adding A New Prometheus Client

Scaling

Monitoring of Receive

Troubleshooting

Prometheus Remote Write 429 Errors

Remote Write requests failing

Current Limitations