Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failing loki remote backend prevents working backends from receiving data regularly while forwarding logs to multiple loki clients at once #6963

Open
makeittotop opened this issue Mar 18, 2024 · 4 comments
Labels
bug Something isn't working needs-attention An issue or PR has been sitting around and needs attention. variant/static Related to Grafana Agent Static.

Comments

@makeittotop
Copy link

What's wrong?

I've noticed that in case of a multi loki clients setup to forward logs to, if one of the loki clients starts failing for some reason, eg. - no process listening on the specified port, etc, it starves other working loki endpoints to receive data as well until the failing client exhausts all of its max_retries (default = 10). Once the loop gets reset, the same issue repeats itself again.
In the end, the working clients only get the data every 6 minutes or so based on what the max_period is set to (Default = 5m). This also leads to "gaps" in the grafana dashboard while looking at the data for those clients,

Steps to reproduce

Take a look at this nominal config -

./agent-local-config.yaml

server:
  log_level: info

logs:
  configs:
  - clients:
    - tls_config:
        insecure_skip_verify: true
      basic_auth:
        password: xxxx
        username: loki
      url: https://logs.my-loki-instance.net/loki/api/v1/push
    - tls_config:
        insecure_skip_verify: true
      url: https://localhost:13100/loki/api/v1/push
      # backoff_config:
      #   # max_retries: 10
      #   max_period: 10s
    name: default
    positions:
      filename: /data/grafana_agent/log-positions.yml
    scrape_configs:
    - job_name: nginx
      pipeline_stages:
      - regex:
          expression: (?P<remote_addr>\S+) - (?P<remote_user>\S+) \[(?P<time_local>[^]]+)\]
            "(?P<request_method>[A-Z]+) (?P<request_url>[^? ]+)[?]*(?P<request_url_params>\S*)
            (?P<request_http_version>[^"]+)" (?P<status_code>\d+) (?P<body_bytes_sent>\d+)
            "(?P<http_referer>[^"]+)" "(?P<http_user_agent>[^"]+)" "(?P<http_x_forwarded_for>[^"]+)"
      - labels:
          remote_user: null
          request_http_version: null
          request_method: null
          request_url: null
          status_code: null
      - timestamp:
          format: 02/Jan/2006:15:04:05 -0700
          source: time_local
      static_configs:
      - labels:
          __path__: /var/log/nginx.log
          instance: dist1.foobar.com
          job: nginx
        targets:
        - dist1.foobar.com

Start the agent as

# /tmp/agent: ./grafana-agent --config.file ./agent-local-config.yaml

Now, let's assume that the localhost:13100 instance is missing for some reason. In such a case I expected the other endpoint (logs.my-loki-instance) to be able to receive data at the configured scrape intervals (60s), but that doesn't happen as explained above.

System information

Linux 6.5.0-15-generic

Software version

Grafana Agent 0.35.0 and master atm

Configuration

server:
  log_level: info

logs:
  configs:
  - clients:
    - tls_config:
        insecure_skip_verify: true
      basic_auth:
        password: xxxx
        username: loki
      url: https://logs.my-loki-instance.net/loki/api/v1/push
    - tls_config:
        insecure_skip_verify: true
      url: https://localhost:13100/loki/api/v1/push
      # backoff_config:
      #   # max_retries: 10
      #   max_period: 10s
    name: default
    positions:
      filename: /data/grafana_agent/log-positions.yml
    scrape_configs:
    - job_name: nginx
      pipeline_stages:
      - regex:
          expression: (?P<remote_addr>\S+) - (?P<remote_user>\S+) \[(?P<time_local>[^]]+)\]
            "(?P<request_method>[A-Z]+) (?P<request_url>[^? ]+)[?]*(?P<request_url_params>\S*)
            (?P<request_http_version>[^"]+)" (?P<status_code>\d+) (?P<body_bytes_sent>\d+)
            "(?P<http_referer>[^"]+)" "(?P<http_user_agent>[^"]+)" "(?P<http_x_forwarded_for>[^"]+)"
      - labels:
          remote_user: null
          request_http_version: null
          request_method: null
          request_url: null
          status_code: null
      - timestamp:
          format: 02/Jan/2006:15:04:05 -0700
          source: time_local
      static_configs:
      - labels:
          __path__: /var/log/nginx.log
          instance: dist1.foobar.com
          job: nginx
        targets:
        - dist1.foobar.com


### Logs

```text
Mar 18 20:28:22 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:28:22.36522416Z caller=client.go:430 level=error component=logs logs_config=default component=client host=localhost:13100 msg="final error sending batch" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:28:22 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:28:22.507835563Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:28:23 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:28:23.271720016Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:28:25 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:28:25.123445134Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:28:28 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:28:28.795872338Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:28:35 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:28:35.337596441Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:28:51 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:28:51.028375765Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:29:08 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:29:08.033159675Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:29:40 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:29:40.383066904Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:31:09 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:31:09.086003766Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
@makeittotop makeittotop added the bug Something isn't working label Mar 18, 2024
@makeittotop
Copy link
Author

From whatever I can tell with my limited knowledge of golang, and channels, it appears that there are 2 goroutines (in this case) - one for localhost:13100, other for logs.my-loki-instance.net in the grafana-agent process. Both of them are reading form the same channel (api.Entry) which is being populated in the promtail package in grafana/clients/pkg/promtail/targets/file/tailer.go readLines() function. As the localhost:13100 goroutine gets blocked due to failling into retries and exponential backoffs, it delays the other my-loki goroutine from receiving data too - atleast my tests confirm this.
Is this due to the fact that the underlying api.Entry channel is "full" due to 1 of the 2 receivers being tied up elsewhere? My tests show that as soon as the failing goroutine unblocks after exhausting its retries, both receivers receive data pretty much immediately.

@rfratto

This comment was marked as outdated.

@rfratto rfratto transferred this issue from grafana/agent Apr 11, 2024
Copy link
Contributor

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it.
If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue.
The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!

@github-actions github-actions bot added the needs-attention An issue or PR has been sitting around and needs attention. label May 18, 2024
@rfratto rfratto transferred this issue from grafana/alloy Jun 14, 2024
@rfratto rfratto added the variant/static Related to Grafana Agent Static. label Jun 14, 2024
@github-actions github-actions bot removed the needs-attention An issue or PR has been sitting around and needs attention. label Jun 15, 2024
Copy link
Contributor

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it.
If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue.
The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!

@github-actions github-actions bot added the needs-attention An issue or PR has been sitting around and needs attention. label Jul 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs-attention An issue or PR has been sitting around and needs attention. variant/static Related to Grafana Agent Static.
Projects
None yet
Development

No branches or pull requests

2 participants