Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fleet-Server abruptly closes connection when limit is reached instead of returning TooManyRequests (429) #4200

Open
belimawr opened this issue Dec 11, 2024 · 3 comments
Assignees
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@belimawr
Copy link
Contributor

When limits.max_connections is set, Fleet-Server will abruptly close connections, which makes the clients (usually Elastic-Agent) to get a read: connection reset by peer. This is not helpful and does not enable the client to correctly adapt its behaviour.

The problem comes from the limitListener (L 69):

func (l *limitListener) Accept() (net.Conn, error) {
// Accept the connection irregardless
c, err := l.Listener.Accept()
if err != nil {
return nil, err
}
// If we cannot acquire the semaphore, close the connection
if acquired := l.acquire(); !acquired {
zlog := log.Warn()
var err error
if c != nil {
err = c.Close()
zlog.Str(logger.ECSServerAddress, c.LocalAddr().String())
zlog.Str(logger.ECSClientAddress, c.RemoteAddr().String())
zlog.Err(err)
}
zlog.Int("max", cap(l.sem)).Msg("Connection closed due to max limit")
return c, nil
}
return &limitListenerConn{Conn: c, release: l.release}, nil
}

If there is a proxy in front of Fleet-Server this can cause a very hard situation to debug from the client's (e.g: Elastic-Agent) perspective: Fleet Server gets a connection that is over the limit, closes the connection, which results in an EOF for the proxy, which translates it into a 502, which the elastic-agent enrol command swallows and does not display.

@cmacknz
Copy link
Member

cmacknz commented Dec 11, 2024

Our cloud proxy will specifically translate an EOF from the fleet-server backend into a 502, which leads us to searching the Fleet Server code for places where 502 could be returned that don't exist.

@michel-laterman
Copy link
Contributor

If we wanted to change fleet-server's behaviour when the connection limit is reached to return a proper HTTP response, we would need to move our connection tracking; however we currently have the assumption that we track connections before the TLS handshake.

// Conn Limiter must be before the TLS handshake in the stack;
// The server should not eat the cost of the handshake if there
// is no capacity to service the connection.
// Also, it appears the HTTP2 implementation depends on the tls.Listener
// being at the top of the stack.
ln = wrapConnLimitter(ctx, ln, s.cfg)

Changing how we do this will have performance implications

@blakerouse
Copy link
Contributor

It was specifically implemented this way to not require TLS handshake of max connections.

@ycombinator ycombinator added the bug Something isn't working label Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

5 participants