Pointing to an ExternalName service without a DNS record can overload the DNS service #6523

lucianjon · 2020-11-25T20:43:08Z

NGINX Ingress controller version: v0.41.2

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.10", GitCommit:"f3add640dbcd4f3c33a7749f38baaac0b3fe810d", GitTreeState:"clean", BuildDate:"2020-05-20T14:00:52Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.9", GitCommit:"94f372e501c973a7fa9eb40ec9ebd2fe7ca69848", GitTreeState:"clean", BuildDate:"2020-09-16T13:47:43Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

Environment:

Cloud provider or hardware configuration: kops managed cluster on AWS
OS (e.g. from /etc/os-release):

Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.1 LTS
Release:	20.04
Codename:	focal

Kernel (e.g. uname -a): Linux ip-10-60-10-234 5.4.0-1024-aws #24-Ubuntu SMP Sat Sep 5 06:19:55 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

What happened:

If an ingress definition is created that points to an ExternalName service, which in turn produces a DNS lookup error, an endless loop of DNS requests is created that can bring the system down.

We noticed this when migrating from v0.19.0 -> v0.41.2, we have both controllers running in parallel. One of our teams was prepping for this and creating routes that pointed to yet to be created DNS records. It appears the old controllers were unaffected but there was huge amounts of DNS lookups generated by the routes on the new controller. It doesn't require actual requests to the routes, just creating the ingress and service definition is enough.

Eventually this overwhelmed dnsmasq and brought down our cluster's DNS, the concurrent requests were limited by dnsmasq but we were looking at thousands of requests per second. Was there some behaviour change between the two versions that could introduce this behaviour and is this expected? My naive guess is there would typically be some kind of exponential backoff on a DNS lookup error.

This is the error produced by the controller:

2020/11/25 20:18:52 [error] 1707#1707: *51723 [lua] dns.lua:152: dns_lookup(): failed to query the DNS server for foo.unknown.com:
server returned error code: 3: name error
server returned error code: 3: name error, context: ngx.timer

What you expected to happen:

DNS lookup failures to be handled with some form of backoff.

How to reproduce it:

These two definitions should be enough to reproduce the issue, assuming a proper class and namespace:

---
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
  name: dns-issue-repro
  namespace: default
  annotations:
    kubernetes.io/ingress.provider: "nginx"
    kubernetes.io/ingress.class: "external"
spec:
  rules:
    - host: foo.unknown.com
      http:
        paths:
          - path: /
            backend:
              serviceName: bad-svc
              servicePort: 80

---
apiVersion: v1
kind: Service
metadata:
  name: bad-svc
  namespace: default
spec:
  type: ExternalName
  externalName: foo.unknown.com

/kind bug

The text was updated successfully, but these errors were encountered:

aledbf · 2020-11-25T20:50:39Z

The behavior changed here #4671

fejta-bot · 2021-02-23T21:08:45Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

fejta-bot · 2021-03-25T21:54:34Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

fejta-bot · 2021-04-24T22:32:59Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

k8s-ci-robot · 2021-04-24T22:33:09Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

adamcharnock · 2022-02-15T17:01:13Z

Yep, this just got me too while working on a new cluster. Nginx Ingress essentially DOSed CoreDNS, which caused all kinds of wierdness in the cluster.

Edit: Running k8s.gcr.io/ingress-nginx/controller:v1.1.1

unnikm8 · 2022-02-18T14:56:18Z

I am getting this issue too.

Running k8s.gcr.io/ingress-nginx/controller:v1.1.0

VsevolodSauta · 2022-04-06T09:38:04Z

I'm also affected by this issue. Hope on some activity on it.
/reopen

k8s-ci-robot · 2022-04-06T09:38:19Z

@VsevolodSauta: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

I'm also affected by this issue. Hope on some activity on it.
/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

javimosch · 2022-09-22T15:05:20Z

I'm also getting this issue:
k8s.gcr.io/ingress-nginx/controller:v1.2.0

karlhaworth · 2022-09-23T12:46:06Z

Same issue.

dexterlakin-bdm · 2022-10-03T11:02:04Z

Why is this closed?

I am also seeing the same issue - has anyone here resolved it or has a workaround?

longwuyuan · 2022-10-03T11:20:11Z

Even without kubernetes, if a process makes calls to unresolvable hostname in a infinite loop, then there will be impact. Thanks, ; Long

…

On Mon, 3 Oct, 2022, 4:32 PM dexterlakin-bdm, ***@***.***> wrote: Why is this closed? I am also seeing the same issue - has anyone here resolved it or has a workaround? — Reply to this email directly, view it on GitHub <#6523 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABGZVWUGAXBN7GR4KZYGANLWBK4L7ANCNFSM4UC4AJOQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

sravanakinapally · 2022-10-07T19:00:28Z

Same issue
dns.lua:152: dns_lookup(): failed to query the DNS server for

ahmad-sharif · 2022-10-11T05:16:27Z

I am also having the same issue with v1.3.1 in some clusters

qixiaobo · 2023-02-07T12:47:27Z

Same problem, keep watching

alv91 · 2023-03-10T12:47:23Z

+1

fuog · 2023-10-26T18:27:37Z

We are experiencing identical issues on both GKE and AKS clusters while using ingress-nginx versions 1.9.1 and 1.9.3.

Occasionally, we encounter situations where the backend resides outside the cluster. The "ExternalName" record is dynamically resolved using endpoints controlled by Consul. However, if it happens to be a single backend service or the last one, and it deregisters due to reasons such as a reboot, the "ExternalName" encounters a non-existing CNAME record. This, in turn, causes ingress-nginx to goes completely crazy with such errors:

2023/10/26 18:16:18 [error] 432#432: *18134 [lua] dns.lua:152: dns_lookup(): failed to query the DNS server for my-not-existing-record.example.com:
server returned error code: 3: name error
server returned error code: 3: name error, context: ngx.timer

In situations where there are only a few occurrences, this behavior can sometimes be obscured by the sheer volume of logs. However, when a substantial number of endpoints become unreachable all at once, compounded by the current scale of Ingress-NGINX pods (which, in our scenario, includes both internal and external-facing ingress classes), the problem escalates significantly and places a severe burden on our coreDNS server, potentially overwhelming them.

What I would like to see is a restriction on the number of resolve attempts / limiting resolve-retry rates or, even more desirable, the implementation of a back-off mechanism.

mjozefcz · 2023-10-31T15:23:47Z

We're experiencing same behavior. With a few 'invalid' or 'temporaty invalid' svc ExternalName backend configurations we noticed a tons of messages like this and huge amount of DNS calls.

We tested the same scenario with traefik as a ingress controller - no issue at all, just 502 response on the client call.

tao12345666333 · 2024-02-17T10:56:01Z

/reopen

k8s-ci-robot · 2024-02-17T10:56:05Z

@tao12345666333: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2024-02-17T10:56:12Z

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-triage-robot · 2024-03-18T11:39:58Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-03-18T11:40:03Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

lucianjon added the kind/bug Categorizes issue or PR as related to a bug. label Nov 25, 2020

nic-6443 mentioned this issue Dec 13, 2020

Allow FQDN for ExternalName Service #6617

Merged

8 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 23, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 25, 2021

k8s-ci-robot closed this as completed Apr 24, 2021

marcandrews mentioned this issue Nov 2, 2022

dns_lookup(): failed to query the DNS server: no AAAA record resolved, context: ngx.timer error #9248

Closed

neerfri mentioned this issue Feb 17, 2024

fix DNS issues with unresolvable backends with ExternalName #10989

Open

10 tasks

k8s-ci-robot reopened this Feb 17, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Feb 17, 2024

k8s-ci-robot added the needs-priority label Feb 17, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pointing to an ExternalName service without a DNS record can overload the DNS service #6523

Pointing to an ExternalName service without a DNS record can overload the DNS service #6523

lucianjon commented Nov 25, 2020

aledbf commented Nov 25, 2020

fejta-bot commented Feb 23, 2021

fejta-bot commented Mar 25, 2021

fejta-bot commented Apr 24, 2021

k8s-ci-robot commented Apr 24, 2021

adamcharnock commented Feb 15, 2022 •

edited

Loading

unnikm8 commented Feb 18, 2022

VsevolodSauta commented Apr 6, 2022

k8s-ci-robot commented Apr 6, 2022

javimosch commented Sep 22, 2022

karlhaworth commented Sep 23, 2022

dexterlakin-bdm commented Oct 3, 2022

longwuyuan commented Oct 3, 2022 via email

sravanakinapally commented Oct 7, 2022

ahmad-sharif commented Oct 11, 2022

qixiaobo commented Feb 7, 2023

alv91 commented Mar 10, 2023

fuog commented Oct 26, 2023 •

edited

Loading

mjozefcz commented Oct 31, 2023 •

edited

Loading

tao12345666333 commented Feb 17, 2024

k8s-ci-robot commented Feb 17, 2024

k8s-ci-robot commented Feb 17, 2024

k8s-triage-robot commented Mar 18, 2024

k8s-ci-robot commented Mar 18, 2024

Pointing to an ExternalName service without a DNS record can overload the DNS service #6523

Pointing to an ExternalName service without a DNS record can overload the DNS service #6523

Comments

lucianjon commented Nov 25, 2020

aledbf commented Nov 25, 2020

fejta-bot commented Feb 23, 2021

fejta-bot commented Mar 25, 2021

fejta-bot commented Apr 24, 2021

k8s-ci-robot commented Apr 24, 2021

adamcharnock commented Feb 15, 2022 • edited Loading

unnikm8 commented Feb 18, 2022

VsevolodSauta commented Apr 6, 2022

k8s-ci-robot commented Apr 6, 2022

javimosch commented Sep 22, 2022

karlhaworth commented Sep 23, 2022

dexterlakin-bdm commented Oct 3, 2022

longwuyuan commented Oct 3, 2022 via email

sravanakinapally commented Oct 7, 2022

ahmad-sharif commented Oct 11, 2022

qixiaobo commented Feb 7, 2023

alv91 commented Mar 10, 2023

fuog commented Oct 26, 2023 • edited Loading

mjozefcz commented Oct 31, 2023 • edited Loading

tao12345666333 commented Feb 17, 2024

k8s-ci-robot commented Feb 17, 2024

k8s-ci-robot commented Feb 17, 2024

k8s-triage-robot commented Mar 18, 2024

k8s-ci-robot commented Mar 18, 2024

adamcharnock commented Feb 15, 2022 •

edited

Loading

fuog commented Oct 26, 2023 •

edited

Loading

mjozefcz commented Oct 31, 2023 •

edited

Loading