Table of Contents
[[TOC]]
- Alerts: https://alerts.gitlab.net/#/alerts?filter=%7Btype%3D%22logging%22%2C%20tier%3D%22inf%22%7D
- Label: gitlab-com/gl-infra/production~"Service::Logging"
- Camoproxy troubleshooting
- ../ci-runners/ci_pending_builds.md
- Cloudflare Logs
- ../elastic/advanced-search-in-gitlab.md
- Elastic Nodes Disk Space Saturation
- ../elastic/elastic-cloud.md
- ../elastic/kibana.md
- HAProxy Logging
- HAProxy Management at GitLab
- Possible Breach of SSH MaxStartups
- Purge Git data
- HostedRunnersServicePendingBuildsSaturationSingleShard
- ../hosted-runners/runner_system_failure.md
- HTTP Router Worker Logs
- Kubernetes-Agent Basic Troubleshooting
- Kubernetes
- logging_gcs_archive_bigquery.md
- Scaling Elastic Cloud Clusters
- Diagnosis with Kibana
- GitLab Pages returning 404
- Troubleshooting LetsEncrypt for Pages
- Log analysis on PostgreSQL, Pgbouncer, Patroni and consul Runbook
- ../patroni/postgres-checkup.md
- ../patroni/postgres.md
- ../patroni/postgresql-backups-wale-walg.md
- Sidekiq or Web/API is using most of its PgBouncer connections
- ../pgbouncer/service-pgbouncer.md
- PubSub Queuing Rate Increasing
- Container Registry Database Load Balancing
- ../registry/gitlab-registry.md
- Disabling Sidekiq workers
- ../sidekiq/sharding.md
- Example Tutorial Template
- Life of a Git Request
- Life of a Web Request
- Azure Test Subscription
- ../uncategorized/access-gcp-hosts.md
- How to Use Vault for Secrets Management in Infrastructure
- Vault Secrets Management
- version.gitlab.com Runbook
- Diagnostic Reports
- Static repository objects caching
Logging clusters:
Operations (formerly Stackdriver, we are currently not indexing any logs, it's used only for cold storage):
BigQuery:
For up to date retention period see details of the ILM policy assigned to the index. See also:
- https://gitlab.com/gitlab-com/runbooks/-/tree/master/elastic/managed-objects/log_gprd/ILM
- https://gitlab.com/gitlab-com/runbooks/-/blob/master/elastic/managed-objects/log_gprd/ILM/gitlab-infra-high-ilm-policy.jsonnet
Index | Production (gprd) | Staging (gstg) |
---|---|---|
pubsub-application-inf | 7 days | 6 days |
pubsub-gitaly-inf | 7 days | 6 days |
pubsub-haproxy-inf | 7 days | 6 days |
pubsub-pages-inf | 7 days | 6 days |
pubsub-postgres-inf | 7 days | 6 days |
pubsub-rails-inf | 7 days | 6 days |
pubsub-shell-inf | 7 days | 6 days |
pubsub-sidekiq-inf | 7 days | 6 days |
pubsub-system-inf | 7 days | 6 days |
pubsub-puma-inf | 7 days | 6 days |
pubsub-unstructured-inf | 7 days | 6 days |
pubsub-workhorse-inf | 7 days | 6 days |
pubsub-consul-inf | 7 days | 6 days |
We are not indexing any application logs. We are only using Operations for the sinks that are saving logs to cold storage.
All logs received by Stackdriver (even if excluded from indexing) are archived to object storage (GCS). Retention in GCS is currently configured for 365 days. For more details see gcs section below.
For up to date config see:
- https://gitlab.com/gitlab-cookbooks/gitlab_fluentd/
- https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/tanka-deployments/-/tree/master/environments/fluentd-elasticsearch
- https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/tanka-deployments/-/tree/master/lib/fluentd
There are many entries missing from this list:
name | logfile | type | index |
---|---|---|---|
gitaly | gitaly/current | JSON | pubsub-gitaly-inf |
pages | gitlab-pages/current | JSON | pubsub-pages-inf |
db.postgres | postgresql/current | line regex | pubsub-postgres-inf |
db.pgbouncer | gitlab/pgbouncer/current | line regex | pubsub-postgres-inf |
workhorse | gitlab/gitlab-workhorse/current | JSON | pubsub-workhorse-inf |
rails.api | gitlab-rails/api_json.log | JSON | pubsub-rails-inf |
rails.application | gitlab-rails/application.log | JSON | pubsub-application-inf |
rails.audit | gitlab-rails/audit_json.log | JSON | pubsub-rails-inf |
rails.auth | gitlab-rails/auth.log | JSON | pubsub-rails-inf |
rails.clickhouse | gitlab-rails/clickhouse.log | JSON | pubsub-rails-inf |
rails.database_load_balancing | gitlab-rails/database_load_balancing.log | JSON | pubsub-rails-inf |
rails.geo | gitlab-rails/geo.log | JSON | pubsub-rails-inf |
rails.git | gitlab-rails/git_json.log | JSON | pubsub-rails-inf |
rails.importer | gitlab-rails/importer.log | JSON | pubsub-rails-inf |
rails.exporter | gitlab-rails/exporter.log | JSON | pubsub-rails-inf |
rails.service_measurement | gitlab-rails/service_measurement.log | JSON | pubsub-rails-inf |
rails.integrations | gitlab-rails/integrations_json.log | JSON | pubsub-rails-inf |
rails.kubernetes | gitlab-rails/kubernetes.log | JSON | pubsub-rails-inf |
rails.web_exporter | gitlab-rails/web_exporter.log | JSON | pubsub-rails-inf |
rails.web_hooks | gitlab-rails/web_hooks.log | JSON | pubsub-rails-inf |
rails.exceptions | /var/log/gitlab/gitlab-rails/exceptions_json.log | JSON | pubsub-rails-inf |
rails.features | /var/log/gitlab/gitlab-rails/features_json.log | JSON | pubsub-rails-inf |
rails.production | gitlab-rails/production_json.log | JSON | pubsub-rails-inf |
shell | gitlab-shell/gitlab-shell.log | JSON | pubsub-shell-inf |
puma.current | /var/log/gitlab/puma/current | line regex | pubsub-puma-inf |
puma.stderr | /var/log/gitlab/puma/puma_stderr.log | line regex | pubsub-puma-inf |
puma.stdout | /var/log/gitlab/puma/puma_stdout.log | line regex | pubsub-puma-inf |
unstructured.production | gitlab-rails/production.log | lines | pubsub-unstructured-inf |
sidekiq | n/a | JSON | pubsub-sidekiq-inf |
haproxy | /var/log/haproxy.log | syslog | pubsub-haproxy-inf |
nginx.error | n/a | unstructured | StackDriver |
registry | n/a | mix (lines/json) | pubsub-gke-inf |
system.auth | /var/log/auth.log | syslog | pubsub-system-inf |
system.syslog | /var/log/syslog | syslog | pubsub-system-inf |
rails.graphql | /var/log/gitlab/gitlab-rails/graphql_json.log | JSON | pubsub-rails-inf |
rails.mail_room | /var/log/gitlab/mailroom/mail_room_json.log | JSON | pubsub-rails-inf |
history.psql | /home/*-db/.psql_history | ||
history.irb | /var/log/irb_history/*.log | ||
migration.log | Sidekiq BatchedBackgroundMigration worker | JSON. | pubsub-sidekiq-inf |
- Go to https://log.gprd.gitlab.net/
- in Kibana, in Discover application, select the relevant index pattern, e.g.
pubsub-rails-inf-gprd-*
- Go to https://nonprod-log.gitlab.net/
- select the relevant index pattern, e.g.
pubsub-rails-inf-gstg-*
- (optional) filter on the environment, e.g.
json.environment=gstg
orjson.environment=dev
Note:
- logs from dev are sent to staging indices
- almost no logs are forwarded from preprod
It is not possible to search for application logs using Operations (Stackdriver). You can still search for other logs, for example GCP authentication logs.
- Decide whether you want to use an existing ES index or create a new one (which will also require making some changes using Terraform). Some hints about how to decide:
- You won't need a dedicated index if the amount of logs is small
- It makes sense to use an existing index if the log file "conceptually" belongs to an existing log stream (for example it used be part of rails logs and it's simply going to be separated into a dedicate file)
- If the log structure is significantly different (log entries contain completely different fields from already existing logs), you'll want to use a dedicated index. Otherwise the ES cluster might refuse to process your logs due to mapping conflicts or too many field mappings set in an index.
Adding a logfile and using an existing ES index
- GCE VMs infrastructure:
- Update one of the fluentd templates and add a section for the new log file. Remember to bump the cookbook version in
metadata.rb
- follow the chef roll out process
- Update one of the fluentd templates and add a section for the new log file. Remember to bump the cookbook version in
- Update config of the Gitlab managed Fluentd DaemonSet. You will need to add a new entry in logging-config.yaml.
- Configure Elastic
- Add and modify the following and once merged, wait for the CI job to update ES config. An example of these changes is in this commit https://gitlab.com/gitlab-com/runbooks/-/commit/2b1c86471cfb3c792137c746613838d34d223e59
- Add the index name to the indices array file
- Add a new file with your index name with empty index mapping (such as
{}
) in the index mapping directory. This will need to be modified later with the log mappings. - Add the index to prod and non-prod index template files
-
Initialize the alias and create the first index using an api call. You can do it in Kibana UI with:
PUT /pubsub-<index_name>-inf-<env_name>-000001 { "aliases": { "pubsub-<index_name>-inf-<env_name>": { "is_write_index": true } } }
or using a script documented here: https://gitlab.com/gitlab-com/runbooks/blob/master/elastic/api_calls/single/initialize-alias-create-index.sh
-
- Make sure that all three exist in the relevant cluster: alias, index template, first index and that the index has an ILM policy assigned to it.
- If you're not sure which policy to use,
gitlab-infra-ilm-policy
is a sensible default. - You can assign an ILM policy using the Kibana UI, or add the following JSON to the index settings:
"index.lifecycle.name": "gitlab-infra-ilm-policy",
- If you're not sure which policy to use,
- Index patterns in Kibana can only be created once there are documents in indices.
- Create a PubSub topic
- Add your topic to the list of pubsubbeat topics in
variables.tf
file of the environment where you want to make the change, e.g. for gstg: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/blob/master/environments/gstg/variables.tf .
- Add your topic to the list of pubsubbeat topics in
- Create a beat that will forward logs from PubSub to ES (this step should only be performed after the topic was created)
- Add a beat for your topic to the list of beats in the relevant environment: https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles/-/tree/master/releases/pubsubbeat
- For log files on the GCE VMs infrastructure
- Add a new recipe in the
gitlab_fluentd
cookbook for your log file, for example: https://gitlab.com/gitlab-cookbooks/gitlab_fluentd/merge_requests/99/diffs - Edit the relevant roles in the chef repo to apply the new recipe to VMs managed with that role, for example: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/2367/diffs
- follow the chef roll out process
- Add a new recipe in the
- For log files in GKE
- Update config of the Gitlab managed Fluentd DaemonSet. You will need to add a new entry in logging-config.yaml and add your topic to pubsub-topics.libsonnet.
- To view the logs in kibana you'll need to create an index pattern. You can do this by going to Management > Stack Management > Kibana > Index Patterns and click on the "Create index pattern" button.
Roadmap:
- https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/10095
- https://docs.google.com/document/d/1EK3QUuC0JrN5ndXdz1McEwKirD_jfA7bk0siTgHXwI4/edit#
We are using Fluentd (td-agent
) for parsing log files and forwarding log messages to different destinations. There are at least three (3) different types of Fluentd processes involved: Gitlab managed Fluentd on GCE VMs, Gitlab managed Fluentd DaemonSet and StatefulSet in GKE, GCP managed Fluentd DaemonSet in GKE.
Fluentd running on VMs is configured to send logs to two destinations: Operations(formerly Stackdriver) and Cloud Pub/Sub. Here's the fluentd config for running on GCE VMs.
Gitlab managed Fluentd in Kubernetes (running as a DaemonSet) sends logs only to ElasticStack (this will likely change in the future, see: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/11655 and: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/10095). Its config can be found here.
Additionally, another GitLab managed Fluentd in Kubernetes named fluentd-archiver
(running as a StatefulSet) exports the logs sent to PubSub by fluentd-elasticsearch
to the GCS bucket gitlab-${env}-logging-archive
under the folder gke/
. This is because Stackdriver exports all Kubernetes container logs mixed together into stdout/
and stderr/
folders, making it difficult to filter per container name. Its config can be found here>
GCP managed Fluentd running as a DaemonSet in GKE, sends logs only to Operations. At the moment of writing, its config was not altered.
All logs reaching Operations are saved to a GCS bucket gitlab-${env}-logging-archive
using an export sink where they are stored long-term (e.g. 6 months) for compliance reasons and can be read using BigQuery.
We are using log exclusions to prevent application logs from being indexed: https://cloud.google.com/logging/docs/exclusions At the moment of writing, we are not indexing any of the application logs. The current exclusions for Operations (Stackdriver) can be found in terraform variables.tf,
search for sd_log_filters
.
Logs from different components have designated topics in Pub/Sub and each topic has corresponding subscriptions. There is a subscriber for each subscription. At the moment of writing we are using pubsubbeat to subscribe to Pub/Sub subscriptions and forward logs to an ElasticStack cluster.
Cloud PubSub quotas and limits. Note: there's a 10MB/s per StreamingPull limit
PubSub topics are managed using Terraform (search for pubsub here)
Examples of alternatives to Cloud Pub/Sub include: Kafka
Pubsubbeat is a tool that uses the ES Beat framework: https://www.elastic.co/beats/ and Google PubSub client library: https://pkg.go.dev/cloud.google.com/go/pubsub . The project is maintained by Gitlab and can be found here: https://gitlab.com/gitlab-org/pubsubbeat/ . The binary pulls logs from a subscription in Pub/Sub and uploads them to Elastic using the bulk API. The default configuration of Pubsubbeat is to create templates and field mappings in indices. We are not utilizing this feature and instead are precreating index templates to match our needs. A lot of the ES config is version controlled in our runbooks repo: https://gitlab.com/gitlab-com/runbooks/-/tree/master/elastic and applied using CI jobs. Many, but not all mappings are set statically here.
We deploy pubsubbeat to GKE using helm. For more details see config here. Permissions are given to k8s service accounts using Workload Identity using terraform (search for pusbubbeat here)
Examples of alternatives to Pubsubbeat include: Filebeat, Fluentd, Logstash.
Aliases are referenced by Pubsubbeat when uploading logs to Elastic. When logs reach the Elastic cluster, they are indexed into documents by a worker, the alias name is resolved to an index name and the documents are saved in the index. There should only ever be one active index per alias and the alias should be pointing to that index.
Logs (documents) can be viewed in Kibana using index patterns, i.e. when you open the Discover application in Kibana, you can select the index pattern from a drop-down list and all searches you will submit will be performed against all indices matching the index pattern. There are also a number of other features in Kibana we're using: dashboards, saved searches, visualizations, watchers.
More documentation on Elastic can be found here: elastic/doc/
Indices can be managed in different ways e.g. custom scripts, Curator, Index Lifecycle Management (ILM) plugin. ILM meets a lot of our requirements so that's what we're using for logs sent to ES7 clusters.
The ILM plugin has proved to be particularly useful and has become very popular in the Elastic community in recent years. For this reason, it was integrated into Elastic.
ILM behavior is configured via policies assigned to indices. The plugin runs at a scheduled interval (like a cronjob). When triggered, ILM goes through each index that has a policy assigned to it and performs a step defined in the policy (assigned to that particular index). If the conditions are not met, ILM will do nothing. Notice that the conditions are evaluated on a scheduled basis (rather than being event driven).
Policies can define a number of steps which in simple words translate to: warm -> hot -> cold -> delete. Behavior of ILM at each of those steps is defined in the ILM policy.
Here's an example policy:
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_age": "3d",
"max_size": "50gb"
},
"set_priority": {
"priority": 100
}
}
},
"warm": {
"min_age": "1m",
"actions": {
"forcemerge": {
"max_num_segments": 1
},
"allocate": {
"require": {
"data": "warm"
}
},
"set_priority": {
"priority": 50
}
}
},
"delete": {
"min_age": "7d",
"actions": {
"delete": {}
}
}
}
}
}
Let's say ILM is configured to run every 10 mins and the above policy is assigned to a newly created index. What will happen, is after 10 mins, ILM will trigger the hot phase, which will check the size and age of the index. If the size exceeds 50GB or the age exceeds 3 days, the configured action is triggered, which in this case would send a call to the rollover api. The rollover API will mark the current index as non-writable, mark it for the warm phase and create a new index from an index template. This way, we can control for example the size of shards within indices or logs retention period.
See also:
- https://www.elastic.co/blog/implementing-hot-warm-cold-in-elasticsearch-with-index-lifecycle-management
- https://www.elastic.co/guide/en/elasticsearch/reference/current/index-lifecycle-management.html
Our Elastic clusters have xpack monitoring enabled in Elastic Cloud and the monitoring metrics are forwarded to a separate monitoring cluster (called monitoring-es7
).
There is a VM in each environment called sd-exporter-*
. This VM is created using a generic terraform module https://ops.gitlab.net/gitlab-com/gl-infra/terraform-modules/google/generic-sv-with-group . The VM has a chef role assigned to it which downloads and runs the stackdriver exporter https://gitlab.com/gitlab-cookbooks/gitlab-exporters/ . The exporter service runs on a tcp port number 9255. Prometheus is configured through a role in chef-repo to scrape port 9255 on sd-exporter-*
VMs. Metrics scraped this way are the basis for Prometheus pubsub alerts.
For the monitoring of pubsubbeats in GKE, we use a ServiceMonitor object to tell Prometheus to scrape the endpoints exposed by the pubsubbeat pod. The pod exposes two endpoints: beat-metrics
and mtail-metrics
. These are exposed by sidecars, a beat-exporter process and an mtail process.
BigQuery can be used to search logs that are in cold storage (GCS).
The haproxy
logs are also configured to be forwarded to a BigQuery dataset using
a StackDriver sink: gitlab-production:haproxy_logs
Logs from the export sink are saved to a GCS bucket which we manage with Terraform: https://ops.gitlab.net/gitlab-com/gl-infra/terraform-modules/google/storage-buckets/blob/master/main.tf#L1
We configure this bucket with GCP lifecycle rules: https://ops.gitlab.net/gitlab-com/gl-infra/terraform-modules/google/storage-buckets/blob/master/main.tf#L14
These rules are parameterized and are configured with defaults which are set in the module: https://ops.gitlab.net/gitlab-com/gl-infra/terraform-modules/google/storage-buckets/blob/master/variables.tf#L150
We created cronjobs that dump diagnostic information from ES clusters.
Script repo: https://gitlab.com/gitlab-com/gl-infra/es-diagnostics/
Deployment: https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles/-/tree/master/releases/es-diagnostics
haproxy logs and nginx logs are not sent to ElasticStack because it was overwhelming our cluster. Currently, only HAProxy logs are only available through BigQuery. NGINX logs via STDERR on Kubernetes can be found in StackDriver. NGINX Access logs are completely ignored.
Design document for migration to ES7: https://about.gitlab.com/handbook/engineering/infrastructure/design/logging-upgrade/
There are three cookbooks that configure logging on gitlab.com
- gitlab-proxy - Sets up the nginx proxy so that users can access elastic cloud via log.gprd.gitlab.net, to be deprecated: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/9145 and https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/331
- gitlab_fluentd - Sets up td-agent on all nodes, forwards logs to pubsub topics.
- gitlab-elk (deprecated) - Sets up the pubsub beat which reads from the topics and forwards to elastic cloud.
- Add
recipe[gitlab_fluentd::<type>]
to the corresponding role to enable td-agent for the template - The ops proxy role configures the proxy vm that is the reverse proxy for elastic cloud.
- (deprecated) pubsubbeat module: https://gitlab.com/gitlab-com/gl-infra/terraform-modules/google/pubsubbeat
- pubsub topics are managed using a pubsub module: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/tree/master/modules/pubsub
- Workload Identity config: search for pubsubbeats in https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/blob/master/environments/gstg/main.tf
- Stackdriver log exclusions: https://ops.gitlab.net/gitlab-com/gl-infra/terraform-modules/google/stackdriver
- es-diagnostics storage bucket and permissions
Pub/Sub Subscriptions should be automatically created by the pubsubbeat service
on each pubsub host. If subscriptions get misconfigured (e.g. topics appear
as _deleted-topic_
) you can delete them and restart the pubsubbeat services to
re-create them. This will be deprecated soon: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/8339
There's a number of things configured using CI jobs in the runbooks repo, in runbooks/elastic/
, for example:
- index templates
- static mappings for indices
We are sending logs to Operations and GCS in addition to elasticsearch for longer retention and to preserve logs in object storage for 365 days.
We use it for two reasons. Firstly, to handle situations when our log sources emit more logs than Elasticsearch can ingest at real time. In this scenario, pubsub serves the role of a buffer.
Secondly, we were overloading Elastic Cloud with the number of connections. Having only a few pubsubbeats helps to reduce the overhead of a separate connection for each fluentd instance.
See also: message queueing discussion in roadmap
See Quick start
- Select the
pubsub-rails-inf-gprd-*
index pattern in Kibana - Search for
+json.username: <user>
If the request has json.correlation_id
field set, you can use that id for checking logs from all gitlab.com components using the Correlation dashboard.
see also: searching logs
They are created by https://github.com/GoogleCloudPlatform/pubsubbeat , I don't see a way we can remove them without forking the project.
At the moment, the only other way to access logs is using BigQuery.
See logging_gcs_archive_bigquery.md for
instructions on loading logs into BigQuery
from their GCS archive files.