Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add docker autoscaler executor #1118

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

mmoutama09
Copy link
Contributor

@mmoutama09 mmoutama09 commented Apr 23, 2024

Description

This is a first draft of using the new gitlab autoscaler executor.
I've been using the fleeting plugin for AWS only.

Prerequisite: Docker must already be installed on the AMI used by worker machines (the Docker autoscaler does not install it, unlike the Docker machine). Additionally, the user used to connect to the workers must also be added to the Docker group.

Related to issue #624

Migrations required

No

Verification

The tests are still in progress but it seems to work.

Copy link
Contributor

Hey @mmoutama09! 👋

Thank you for your contribution to the project. Please refer to the contribution rules for a quick overview of the process.

Make sure that this PR clearly explains:

  • the problem being solved
  • the best way a reviewer and you can test your changes

With submitting this PR you confirm that you hold the rights of the code added and agree that it will published under this LICENSE.

The following ChatOps commands are supported:

  • /help: notifies a maintainer to help you out

Simply add a comment with the command in the first line. If you need to pass more information, separate it with a blank line from the command.

This message was generated automatically. You are welcome to improve it.

@Tiduster
Copy link

Hi @kayman-mk.
I am a colleague of @mmoutama09 and @Kadeux.

This change could be the next major release of the module.

Gitlab is still on track to make their plugin GA this summer: https://gitlab.com/groups/gitlab-org/-/epics/6995

We are still NOT using this version in our production setup, but we will deploy it on part of our runners in June.

What should be the next steps for this PR?

Best regards,

@kayman-mk
Copy link
Collaborator

kayman-mk commented May 23, 2024

Sounds quite promising to get rid of the outdated docker machine. As soon as GitLab has published their module, we can integrate it here.

As far as I can see the docker machine can still be used, so we can create a feature release. Before the next major release I will check if we can get rid of docker machine to simplify the code.

Could you please post the settings to test this change?

At the moment I am working on #1117. That change will be merged before to support zero downtime during deployment of a new version.

@Tiduster
Copy link

Thanks @kayman-mk for your answer.

I was not aware of this zero downtime PR, very interesting, we can test it as well in our environment.
I will try to look at it and gave a feedback if I find something interesting

@kayman-mk
Copy link
Collaborator

@Tiduster Could you please post a minimal configuration showing which AMIs to use to get this up and running?

@mmoutama09 mmoutama09 force-pushed the add_gitlab_docker_autoscaler branch from ea41c66 to 0226a05 Compare June 3, 2024 14:40
@mmoutama09 mmoutama09 force-pushed the add_gitlab_docker_autoscaler branch from 0226a05 to 9a967f6 Compare June 5, 2024 09:39
@kayman-mk
Copy link
Collaborator

kayman-mk commented Jun 6, 2024

Just tried it, but with no success. Runner is up and working. But in case a job is processed, GitLab shows

Running with gitlab-runner 16.4.2 (e77af703)
  on prod-gitlab-ci-runner-test-Gitlab-Runner-TEST-A PsqsZYpLQ, system ID: s_0a07de49d04b
Resolving secrets 00:00
Preparing the "docker-autoscaler" executor 00:50
Dialing instance i-05caf9a7284ccaxxx...
Instance i-05caf9a7284ccaxxx connected
ERROR: Failed to remove network for build
ERROR: Preparation failed: error during connect: Get "http://internel.tunnel.invalid/v1.24/info": dialing environment connection: ssh: rejected: connect failed (open failed) (docker.go:826:0s)

The Runner shows in Cloudwatch

{
    "external-address": "",
    "instance-id": "i-05caf9a7284ccaxxx",
    "internal-address": "100.64.30.16",
    "job": 5314382,
    "level": "info",
    "msg": "Dialing instance",
    "project": 987,
    "runner": "PsqsZYpLQ",
    "time": "2024-06-06T07:40:32Z",
    "use-external-address": true
}
{"error":"networksManager is undefined","job":5314382,"level":"error","msg":"Failed to remove network for build","network":"","project":987,"runner":"PsqsZYpLQ","time":"2024-06-06T07:40:50Z"}

The first error seems to be related to use_external_addr = true in the config. Changed to false.

And I noticed that Docker was not installed on the Runner and ubuntu was not part of the docker group. After fixing that, my job was executed. AMI is ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20240531, but I have no idea how to change it.

@kayman-mk
Copy link
Collaborator

kayman-mk commented Jun 6, 2024

  • We mustn't use runner_worker_docker_machine_instance for configuration as it is tied to the docker+machine executor

@mmoutama09
Copy link
Contributor Author

@kayman-mk The installation of Docker is now mandatory indeed; I've mentioned it in the usage.md file (along with adding the user in docker group).
I reused all the variables from runner_worker_docker_machine_instance to avoid the duplication of multiple variables. However, we could do it differently if we don't mind having multiple if conditions in a local block to determine which variable should be taken. What do you think?

@Tiduster
Copy link

@kayman-mk

And I noticed that Docker was not installed on the Runner and ubuntu was not part of the docker group. After fixing that, my job was executed. AMI is ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20240531, but I have no idea how to change it.

On our side we build a custom AMI from ubuntu and we add docker package manually. Docker autoscaler do not do this by default, so it require an AMI with docker engine to work.

We re-used runner_worker_docker_machine_ami_filter and runner_worker_docker_machine_ami_owners for this to no duplicate variables. We can create new variables if you prefer.

@mmoutama09 added some information about this in usage.md.

Best regards,

@mmoutama09 mmoutama09 force-pushed the add_gitlab_docker_autoscaler branch from 20fd1b6 to bff273f Compare June 25, 2024 14:22
@mmoutama09
Copy link
Contributor Author

@kayman-mk I've updated my code to separate docker-autoscaler from docker+machine.

To use the new docker-autoscaler we must provide an AMI with docker installed and the user used by autoscaler to connect to workers must be added to docker group.

The variable docker-registry-mirror is no longer provided as it is not in the runner autoscaler configuration, but we could add it directly to the AMI.

@kayman-mk
Copy link
Collaborator

Hmm, the need for a custom built image doesn't sound good to me at first hand. Any chance to use a pre-existing AMI instead? Or can we install Docker on the fly?

In case we want to host this AMI: Can you provide a built script (Packer?)?

@kayman-mk
Copy link
Collaborator

@Tiduster Could you please share the PAcker scripts to build the AMI? Would be a good idea to have them available and/or publish an AMI here.

@mmoutama09
Copy link
Contributor Author

@kayman-mk here is a packer script to build the image with ubuntu 22 as a base
ubuntu-docker.json

We could also try to install it at launch using userdata, but we would have to add a lifecycle to prevent the autoscaler from connecting to the instance too early. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants