Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s6 service issue on start up after down/up or restart of the mailserver container #68

Open
2 tasks done
kevinrode opened this issue May 31, 2024 · 4 comments · May be fixed by #70
Open
2 tasks done

s6 service issue on start up after down/up or restart of the mailserver container #68

kevinrode opened this issue May 31, 2024 · 4 comments · May be fixed by #70

Comments

@kevinrode
Copy link

Classification

  • Serious bug

Reproducibility

  • Always

Docker information

#docker info                                                                                                                                                                                                                                                                                                         
Client: Docker Engine - Community
 Version:    26.1.3
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.14.0
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.27.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 6
  Running: 6
  Paused: 0
  Stopped: 0
 Images: 18
 Server Version: 26.1.3
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 8b3b7ca2e5ce38e8f31a34f35b2b68ceb8470d89
 runc version: v1.1.12-0-g51d5e94
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 5.15.0-107-generic
 Operating System: Ubuntu 22.04.4 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 3.82GiB
 Name: mail
 ID: 7881e465-10f6-49b0-9515-fdb559ed7cdc
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

#docker images mailserver2/mailserver --digests --filter "dangling=false"                                                                                                                                                                                                                                           
REPOSITORY               TAG       DIGEST                                                                    IMAGE ID       CREATED        SIZE
mailserver2/mailserver   1.1.17    sha256:9ee31561400010b15c73cec122345017f6759cc8380bd8ba49d08bdc17a311fd   cf4a37c964d9   2 months ago   423MB

Description

After doing a down and up or just a restart of "mailserver2/mailserver" container, the container running into the issue that s6 service does not start properly.

mailserver-1    | s6-svc: fatal: unable to control /services/rsyslogd: supervisor not listening
mailserver-1    | s6-svc: fatal: unable to control /services/unbound: supervisor not listening
mailserver-1    | s6-svwait: fatal: unable to s6_svstatus_read: No such file or directory

It seems that only a restart of the host system ensures that the container can start normally again without errors. We also tried to restart the docker service, but this seems not to be working properly.

Steps to reproduce

docker compose down && docker compose up
docker restart docker-mailserver-1

Expected results

All services starting without issues

Actual results

S6 running into issues on startup

mailserver-1    | s6-svc: fatal: unable to control /services/rsyslogd: supervisor not listening
mailserver-1    | s6-svc: fatal: unable to control /services/unbound: supervisor not listening
mailserver-1    | s6-svwait: fatal: unable to s6_svstatus_read: No such file or directory

Debugging information

#docker compose logs mailserver -f
mailserver-1  | [INFO] MariaDB/PostgreSQL hostname not found in /etc/hosts
mailserver-1  | [INFO] Container IP found, adding a new record in /etc/hosts
mailserver-1  | [INFO] Redis hostname not found in /etc/hosts
mailserver-1  | [INFO] Container IP found, adding a new record in /etc/hosts
mailserver-1  | [INFO] Search for SSL certificates generated by Traefik
mailserver-1  | [INFO] acme.json found with Traefik v2 format, dumping into pem files
mailserver-1  | [INFO] Live Certificates match
mailserver-1  | [INFO] Starting services
mailserver-1  | s6-svc: fatal: unable to control /services/rsyslogd: supervisor not listening
mailserver-1  | s6-svc: fatal: unable to control /services/unbound: supervisor not listening
mailserver-1  | s6-svwait: fatal: some services reported permanent failure or their supervisor died

Configuration (docker-compose.yml, traefik.toml...etc)

cat docker-compose.yml                                                                                               
#version: '3.7'

# IPv4 only
# docker network create http_network

# IPv4/IPv6 network
# docker network create http_network --ipv6 --subnet "fd00:0000:0000:0000::/64"
# Refer to https://github.com/hardware/mailserver/#ipv6-support for more information.

networks:
  http_network:
    external: true
  mail_network:
    external: false

services:

  traefik:
    image: "traefik:${TRAEFIK_DOCKER_TAG}"
    restart: ${RESTART_MODE}
    networks:
      - http_network
    ports:
      # This allows incoming connection on 80 to be forwarder to port 80 of traefik
      - "80:80"
      # This allows incoming connection on 443 to be forwarder to port 443 of traefik
      - "443:443"
      # As above. Browse to port 8080 http to see trafik dashboard
      #      - "8080:8080"
    command:
      - "--log.level=DEBUG"
    volumes:
      # static config
      - "${VOLUMES_ROOT_PATH}/traefik/traefik.toml:/traefik.toml"
      # dynamic config
      - "${VOLUMES_ROOT_PATH}/traefik/file.toml:/file.toml"
      # let's encrypt data
      - "${VOLUMES_ROOT_PATH}/traefik/acme:/acme"
      # This is required for the docker provider of traefik to work (read container labels, etc)
      - "/var/run/docker.sock:/var/run/docker.sock:ro"

  mailserver:
    image: mailserver2/mailserver:${MAILSERVER_DOCKER_TAG}
    restart: ${RESTART_MODE}
    domainname: ${MAILSERVER_DOMAIN}                    # Mail server A/MX/FQDN & reverse PTR = mail.domain.tld.
    hostname: ${MAILSERVER_HOSTNAME}
    # extra_hosts:                          - Required for external database (on other server or for local database on host)
    #  - "mariadb:xx.xx.xx.xx"              - Replace with IP address of MariaDB server
    #  - "redis:xx.xx.xx.xx"                - Replace with IP address of Redis server
    ports:
      - "25:25"       # SMTP                - Required
    # - "110:110"     # POP3       STARTTLS - Optional - For webmails/desktop clients
      - "143:143"     # IMAP       STARTTLS - Optional - For webmails/desktop clients
    # - "465:465"     # SMTPS      SSL/TLS  - Optional - Enabled for compatibility reason, otherwise disabled
      - "587:587"     # Submission STARTTLS - Optional - For webmails/desktop clients
      - "993:993"     # IMAPS      SSL/TLS  - Optional - For webmails/desktop clients
    # - "995:995"     # POP3S      SSL/TLS  - Optional - For webmails/desktop clients
      - "4190:4190"   # SIEVE      STARTTLS - Optional - Recommended for mail filtering
    # - "11334:11334" # HTTP                - Optional - Rspamd WebUI
    environment:
      - DBPASS=${DATABASE_USER_PASSWORD}       # MariaDB database password (required)
      - RSPAMD_PASSWORD=${RSPAMD_PASSWORD}     # Rspamd WebUI password (required)
      - ADD_DOMAINS=${ADD_DOMAINS}             # Add additional domains separated by commas (needed for dkim keys etc.)
    # - DEBUG_MODE=true                        # Enable Postfix, Dovecot, Rspamd and Unbound verbose logging
    # - ENABLE_POP3=true                       # Enable POP3 protocol
    # - ENABLE_FETCHMAIL=true                  # Enable fetchmail forwarding
    # - DISABLE_RATELIMITING=false             # Enable ratelimiting policy
    # - DISABLE_CLAMAV=true                    # Disable virus scanning
    # - DISABLE_SIGNING=true                   # Disable DKIM/ARC signing
    # - DISABLE_GREYLISTING=true               # Disable greylisting policy
    # - DISABLE_VHOSTS_OWNERSHIP_SET=true     # Disable vhosts directory ownship set (useful, when you have lots of mailboxes)
    #
    # Full list : https://github.com/hardware/mailserver#environment-variables
    #
    labels:
      - "traefik.enable=true"
      - "traefik.docker.network=http_network"
      - "traefik.http.routers.spam.entrypoints=websecure"
      - "traefik.http.routers.spam.rule=Host(`spam.${MAILSERVER_DOMAIN}`)"
      - "traefik.http.routers.spam.service=spam"
      - "traefik.http.routers.spam.tls=true"
      - "traefik.http.routers.spam.tls.certresolver=letsencrypt"
      - "traefik.http.routers.spam.tls.domains[0].main=${MAILSERVER_HOSTNAME}.${MAILSERVER_DOMAIN}"
      - "traefik.http.routers.spam.tls.domains[0].sans=my.domain1.com, my.domain2.com, my.domain3.com, my.domain4.com"
      - "traefik.http.routers.spam.tls.options=default"
      - "traefik.http.services.spam.loadbalancer.server.port=11334"
      - "traefik.http.services.spam.loadbalancer.server.scheme=http"
    volumes:
      - ${VOLUMES_ROOT_PATH}/mail:/var/mail
      - ${VOLUMES_ROOT_PATH}/traefik/acme:/etc/letsencrypt/acme
      # Uncomment the line below, when you want whitelist some IP Addresses or domains in Postfix (please check the 'Whitelist Hosts/IP Addresses In Postfix' in README.md for more info)
      # - ${VOLUMES_ROOT_PATH}/postfix/rbl_override:/etc/postfix/rbl_override
    depends_on:
      - mariadb
      - redis
    networks:
      - mail_network
      - http_network

  # Administration interface
  # https://github.com/hardware/postfixadmin
  # http://postfixadmin.sourceforge.net/
  # Configuration : https://github.com/hardware/mailserver/wiki/Postfixadmin-initial-configuration
  postfixadmin:
    image: mailserver2/postfixadmin:${POSTFIXADMIN_DOCKER_TAG}
    restart: ${RESTART_MODE}
    domainname: ${MAILSERVER_DOMAIN}
    hostname: ${MAILSERVER_HOSTNAME}
    labels:
      - "traefik.enable=true"
      - "traefik.docker.network=http_network"
      - "traefik.http.routers.postfixadmin.entrypoints=websecure"
      - "traefik.http.routers.postfixadmin.rule=Host(`postfixadmin.${MAILSERVER_DOMAIN}`)"
      - "traefik.http.routers.postfixadmin.service=postfixadmin"
      - "traefik.http.routers.postfixadmin.tls=true"
      - "traefik.http.routers.postfixadmin.tls.certresolver=letsencrypt"
      - "traefik.http.routers.postfixadmin.tls.domains[0].main=postfixadmin.${MAILSERVER_DOMAIN}"
      - "traefik.http.routers.postfixadmin.tls.options=default"
      - "traefik.http.services.postfixadmin.loadbalancer.server.port=8888"
      - "traefik.http.services.postfixadmin.loadbalancer.server.scheme=http"
    environment:
      - DBPASS=${DATABASE_USER_PASSWORD}
    depends_on:
      - mailserver
      - mariadb
      - traefik
    networks:
      - mail_network
      - http_network

  # Webmail (Optional)
  # https://github.com/hardware/rainloop
  # https://www.rainloop.net/
  # Configuration : https://github.com/hardware/mailserver/wiki/Rainloop-initial-configuration
  rainloop:
    image: mailserver2/rainloop:${RAINLOOP_DOCKER_TAG}
    restart: ${RESTART_MODE}
    labels:
      - "traefik.enable=true"
      - "traefik.docker.network=http_network"
      - "traefik.http.routers.rainloop.entrypoints=websecure"
      - "traefik.http.routers.rainloop.rule=Host(`webmail.${MAILSERVER_DOMAIN}`)"
      - "traefik.http.routers.rainloop.service=rainloop"
      - "traefik.http.routers.rainloop.tls=true"
      - "traefik.http.routers.rainloop.tls.certresolver=letsencrypt"
      - "traefik.http.routers.rainloop.tls.domains[0].main=webmail.${MAILSERVER_DOMAIN}"
      - "traefik.http.routers.rainloop.tls.options=default"
      - "traefik.http.services.rainloop.loadbalancer.server.port=8888"
      - "traefik.http.services.rainloop.loadbalancer.server.scheme=http"
    volumes:
      - ${VOLUMES_ROOT_PATH}/rainloop:/rainloop/data
    #environment:
      #LOG_TO_STDOUT: "true"
    depends_on:
      - mailserver
      - mariadb
    networks:
      - mail_network
      - http_network

  # Database
  # https://github.com/docker-library/mariadb
  # https://mariadb.org/
  mariadb:
    image: mariadb:10.5
    restart: ${RESTART_MODE}
    # Info : These variables are ignored when the volume already exists (if databases was created before).
    environment:
      - MYSQL_RANDOM_ROOT_PASSWORD=yes
      - MYSQL_DATABASE=postfix
      - MYSQL_USER=postfix
      - MYSQL_PASSWORD=${DATABASE_USER_PASSWORD}
    volumes:
      - ${VOLUMES_ROOT_PATH}/mysql/db:/var/lib/mysql
    networks:
      - mail_network

  # Cache Database
  # https://github.com/docker-library/redis.
  # https://redis.io/
  redis:
    image: redis:6.0-alpine
    restart: ${RESTART_MODE}
    command: redis-server --appendonly yes
    sysctls:
      - net.core.somaxconn=1024
    volumes:
      - ${VOLUMES_ROOT_PATH}/redis/db/:/data
    networks:
      - mail_network
@financelurker
Copy link

financelurker commented Aug 21, 2024

Interestingly enough, I sometimes get the same error message when starting the docker container, version 1.1.18:
unable to control /services/rsyslogd: supervisor not listening

Accessing a shell within the container - after startup finished - and starting the rsyslog service then succeeds.

@AndrewSav
Copy link
Collaborator

I cannot run 1.1.19 because of this, I guess I need to get to the bottom of it over xmas if I find time

@AndrewSav
Copy link
Collaborator

I consulted the author of s6, here is what I got:

I think I understand what's happening. The mailserver2 is abusing the supervision pattern and starting services from another service (the _parent one) instead of using a service manager. And it's hitting a race condition: s6-svscan starts all the supervisors at the same time, including the one for _parent, but it takes some time to get ready, and _parent/run tries to s6-svc -u /services/rsyslogd before it's ready.
You can check that it's the correct hypothesis by editing https://github.com/mailserver2/mailserver/blob/master/rootfs/services/_parent/run and swapping lines 15 and 16 for instance. If the error message says "/services/unbound: supervisor not ready" instead, it confirms it - the first s6-svc fails.
the quick and dirty workaround is to add a small sleep (500ms or 1 second will be more than enough) at the start of _parent/run, this will give everything enough time to get ready. But there are still issues in the way the system is set up.
For instance, _parent will keep running even when everything's already up.
the real solution is for the project to change the way they're starting services. They need to 1. set up a supervision tree and 2. start the services from an external script, not one that is supervised.
They might benefit from s6-linux-init, which runs s6-svscan as pid 1 and runs an external script for one-time initialization once s6-svscan is running.
And if they're providing a container, they might want to switch to s6-overlay, which automates all this.

I think that adding the delay as advised should resolve this issue. I'll look into it in the new few weeks.

@AndrewSav
Copy link
Collaborator

AndrewSav commented Dec 16, 2024

Some more info:

all thise services have down files so they aren't up until _parent/run runs
this is very bad design
(the down file hack is what I susoected)
the minimum-touch approach would be to add a one second sleep to _parent, thebbetter (and more idiomatic) approach would be to have each service exit loop until the things they depend on are running
*the better
(or simply just start and assume that everything else will eventually be fine)
so yeah, step one would be to figure out if you need strict ordering or not
if not just get rid of all the down files and _parent
if yes then move the checks into the dependent scripts using s6-svwait
(and also get rid of the down files)
for clamav and freshclam you'll want to run `s6-svc -d .' in their respective run scripts if DISABLE_CLAMAV is true
(or unset I guess)
(note that s6-svwait OTHER_SERVICE can still error, but it'll be less dramatic to swallow errors in that case)
also, assuming this works the way I think it does, _parent/run is executed every two seconds
since it's a supervised process that exits awithout setting itself down
ah
setup.sh does that
by creating finish scripts
er, by creating finish scripts that set that particular service down
(they should be able to use s6-svc -d . instead of s6-svc -d /services/SERVICE)
because the cwd of run and finish starts in the service directory always

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants