Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a way to customize via Spring framework a TTL for Netty's DNS cache #3517

Open
dimzul opened this issue Sep 6, 2024 · 5 comments
Open

Comments

@dimzul
Copy link

dimzul commented Sep 6, 2024

Problem

In k8s environment multiple instances of the same service are hidden by k8s Service name (like, my-test.my-namespace.svc.cluster.local). Same goes with DNS servers in k8s: multiple instances of it are hidden by k8s Service. In a case when one DNS server instance dies and emerges on a new k8s node with another IP address, due to DNS cache in Netty (transitive dependency of project-reactor) via DnsNameResolverBuilder and DefaultAuthoritativeDnsServerCache, IP addresses of DNS servers are cached for Integer.MAX_VALUE seconds by default and old/cached IP address is used for DNS resolution. This results in a request to the IP address with no listening DNS server and causes next error:

500 Server Error for HTTP GET "/my-test"
io.netty.resolver.dns.DnsResolveContext$SearchDomainUnknownHostException: Failed to resolve 'my-test' [A(1)] and search domain query for configured domains failed as well: [production.svc.cluster.local, svc.cluster.local, cluster.local]
	at io.netty.resolver.dns.DnsResolveContext.finishResolve(DnsResolveContext.java:1151)
	Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException: 
Error has been observed at the following site(s):
	*__checkpoint ⇢ org.springframework.cloud.gateway.filter.WeightCalculatorWebFilter [DefaultWebFilterChain]
	*__checkpoint ⇢ HTTP GET "/my-test" [ExceptionHandlingWebHandler]
Original Stack Trace:
		at io.netty.resolver.dns.DnsResolveContext.finishResolve(DnsResolveContext.java:1151)
		at io.netty.resolver.dns.DnsResolveContext.tryToFinishResolve(DnsResolveContext.java:1098)
		at io.netty.resolver.dns.DnsResolveContext.query(DnsResolveContext.java:457)
		at io.netty.resolver.dns.DnsResolveContext.access$700(DnsResolveContext.java:69)
		at io.netty.resolver.dns.DnsResolveContext$2.operationComplete(DnsResolveContext.java:526)
		at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)
		at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:583)
		at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:559)
		at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492)
		at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636)
		at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:629)
		at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:118)
		at io.netty.resolver.dns.DnsQueryContext.finishFailure(DnsQueryContext.java:380)
		at io.netty.resolver.dns.DnsQueryContext$5.run(DnsQueryContext.java:315)
		at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
		at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
		at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
		at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
		at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
		at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:405)
		at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:994)
		at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
		at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
		at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: io.netty.resolver.dns.DnsNameResolverTimeoutException: [15365: /12.34.5.67:53] DefaultDnsQuestion(my-test.production.svc.cluster.local. IN A) query '15365' via UDP timed out after 5000 milliseconds (no stack trace available)

Steps to reproduce

Following suggestions by @violetagg and @spencergibb on customizing DNS cache TTL and TcpClient in Spring Cloud Gateway, a next configuration was made:

@Component
@Configuration(proxyBeanMethods = false)
public class DnsCacheCustomizer implements HttpClientCustomizer {

    private static final int CACHE_TTL = 5;

    @Bean
    ClientHttpConnector clientHttpConnector(ReactorResourceFactory resourceFactory) {
        TcpClient tcpClient = TcpClient.create(resourceFactory.getConnectionProvider())
                .resolver(nameResolverSpec -> nameResolverSpec.cacheMaxTimeToLive(Duration.ofSeconds(CACHE_TTL)));
        return new ReactorClientHttpConnector(HttpClient.from(tcpClient));
    }

    @Override
    public HttpClient customize(HttpClient httpClient) {
        DnsNameResolverBuilder dnsResolverBuilder = new DnsNameResolverBuilder()
                .channelFactory(EpollDatagramChannel::new)
                .resolveCache(new DefaultDnsCache(0, CACHE_TTL, 0));
        httpClient
                .resolver(nameResolverSpec -> nameResolverSpec.cacheMaxTimeToLive(Duration.ofSeconds(CACHE_TTL)))
                .tcpConfiguration(tcpClient -> tcpClient.resolver(new DnsAddressResolverGroup(dnsResolverBuilder)));
        return httpClient;
    }
}

Having such a configuration, multiple instances of DnsNameResolverBuilder were created: 2 with the configured cache TTL and 2 with the default cache TTL:
cache_as_configured_1
cache_as_configured_2
cache_as_default_1
cache_as_default_2

But when an actual request comes in, the DnsNameResolverBuilder with a default cache TTL configuration is used and DNS cache with default TTL (2147483647 seconds) is applied:
actual request

Expected result

There is a way to configure DNS cache TTL via Spring Framework.

Versions

spring boot/spring-cloud-starter-gateway/spring-boot-starter-webflux: 3.2.8
reactor-netty-http: 1.1.21
netty: 4.1.111.Final

@bindupatnaik
Copy link

bindupatnaik commented Sep 15, 2024

@spring-cloud-issues any update on this issue? i was also facing same problem mentioned in this issue and was looking for help.
I also commented in this open issue #561 . pls provide an update when we are getting this issue fixed? I tried all the work arounds mentioned with no luck.

@bindupatnaik
Copy link

@dimzul did you find any workarounds for this problem? I am happy to connect with you to discuss further.

@dimzul
Copy link
Author

dimzul commented Sep 24, 2024

@bindupatnaik, unfortunately, no: all provided solutions don't have any effect on DNS cache TTL in Netty. I've debugged it locally and tested in real cluster and got the same result with default TTL applied. Also no effect with switching to JVM built-in resolver via:

    @Override
    public HttpClient customize(HttpClient httpClient) {
        httpClient
                .resolver(DefaultAddressResolverGroup.INSTANCE)
                .tcpConfiguration(tcpClient -> tcpClient.resolver(DefaultAddressResolverGroup.INSTANCE));
        return httpClient;
    }

If you find a solution, please share it here.

@violetagg
Copy link
Contributor

violetagg commented Sep 24, 2024

@dimzul

This configuration is not quite correct. You either use the HttpClient#resolver or HttpClient#tcpConfiguration but never both.
I would recommend HttpClient#resolver. HttpClient#tcpConfiguration is deprecated and everything that you can configure there, you can configure with direct invocation of HttpClient.

@Override
    public HttpClient customize(HttpClient httpClient) {
        httpClient
                .resolver(DefaultAddressResolverGroup.INSTANCE)
                .tcpConfiguration(tcpClient -> tcpClient.resolver(DefaultAddressResolverGroup.INSTANCE));
        return httpClient;
    }

DefaultAddressResolverGroup.INSTANCE is the JDK's built-in domain name lookup mechanism so you need to use the JDK configuration for the ttl.

I also do not recommend using HttpClient#from which is also deprecated.

@ParkerM
Copy link

ParkerM commented Oct 1, 2024

    @Override
    public HttpClient customize(HttpClient httpClient) {
        httpClient
                .resolver(DefaultAddressResolverGroup.INSTANCE)
                .tcpConfiguration(tcpClient -> tcpClient.resolver(DefaultAddressResolverGroup.INSTANCE));
        return httpClient;
    }

Note that the fluent config methods in reactor-netty's HttpClient don't modify the instance -- they configure and return a duplicated instance. This has bitten me before, and was ultimately solved by reassigning each call or returning the entire chain. Try this:

    @Override
    public HttpClient customize(HttpClient httpClient) {
        return httpClient
                .resolver(DefaultAddressResolverGroup.INSTANCE)
                .tcpConfiguration(tcpClient -> tcpClient.resolver(DefaultAddressResolverGroup.INSTANCE));
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants