Skip to content

Latest commit

 

History

History
507 lines (354 loc) · 32.5 KB

README.md

File metadata and controls

507 lines (354 loc) · 32.5 KB

Table of Contents

[[TOC]]

Logging Service

Logging

Troubleshooting Pointers

Quick start

URLs

Logging clusters:

Operations (formerly Stackdriver, we are currently not indexing any logs, it's used only for cold storage):

BigQuery:

Retention

ES

For up to date retention period see details of the ILM policy assigned to the index. See also:

Index Production (gprd) Staging (gstg)
pubsub-application-inf 7 days 6 days
pubsub-gitaly-inf 7 days 6 days
pubsub-haproxy-inf 7 days 6 days
pubsub-pages-inf 7 days 6 days
pubsub-postgres-inf 7 days 6 days
pubsub-rails-inf 7 days 6 days
pubsub-shell-inf 7 days 6 days
pubsub-sidekiq-inf 7 days 6 days
pubsub-system-inf 7 days 6 days
pubsub-puma-inf 7 days 6 days
pubsub-unstructured-inf 7 days 6 days
pubsub-workhorse-inf 7 days 6 days
pubsub-consul-inf 7 days 6 days

Operations (StackDriver)

We are not indexing any application logs. We are only using Operations for the sinks that are saving logs to cold storage.

GCS

All logs received by Stackdriver (even if excluded from indexing) are archived to object storage (GCS). Retention in GCS is currently configured for 365 days. For more details see gcs section below.

What are we logging?

For up to date config see:

There are many entries missing from this list:

name logfile type index
gitaly gitaly/current JSON pubsub-gitaly-inf
pages gitlab-pages/current JSON pubsub-pages-inf
db.postgres postgresql/current line regex pubsub-postgres-inf
db.pgbouncer gitlab/pgbouncer/current line regex pubsub-postgres-inf
workhorse gitlab/gitlab-workhorse/current JSON pubsub-workhorse-inf
rails.api gitlab-rails/api_json.log JSON pubsub-rails-inf
rails.application gitlab-rails/application.log JSON pubsub-application-inf
rails.audit gitlab-rails/audit_json.log JSON pubsub-rails-inf
rails.auth gitlab-rails/auth.log JSON pubsub-rails-inf
rails.clickhouse gitlab-rails/clickhouse.log JSON pubsub-rails-inf
rails.database_load_balancing gitlab-rails/database_load_balancing.log JSON pubsub-rails-inf
rails.geo gitlab-rails/geo.log JSON pubsub-rails-inf
rails.git gitlab-rails/git_json.log JSON pubsub-rails-inf
rails.importer gitlab-rails/importer.log JSON pubsub-rails-inf
rails.exporter gitlab-rails/exporter.log JSON pubsub-rails-inf
rails.service_measurement gitlab-rails/service_measurement.log JSON pubsub-rails-inf
rails.integrations gitlab-rails/integrations_json.log JSON pubsub-rails-inf
rails.kubernetes gitlab-rails/kubernetes.log JSON pubsub-rails-inf
rails.web_exporter gitlab-rails/web_exporter.log JSON pubsub-rails-inf
rails.web_hooks gitlab-rails/web_hooks.log JSON pubsub-rails-inf
rails.exceptions /var/log/gitlab/gitlab-rails/exceptions_json.log JSON pubsub-rails-inf
rails.features /var/log/gitlab/gitlab-rails/features_json.log JSON pubsub-rails-inf
rails.production gitlab-rails/production_json.log JSON pubsub-rails-inf
shell gitlab-shell/gitlab-shell.log JSON pubsub-shell-inf
puma.current /var/log/gitlab/puma/current line regex pubsub-puma-inf
puma.stderr /var/log/gitlab/puma/puma_stderr.log line regex pubsub-puma-inf
puma.stdout /var/log/gitlab/puma/puma_stdout.log line regex pubsub-puma-inf
unstructured.production gitlab-rails/production.log lines pubsub-unstructured-inf
sidekiq n/a JSON pubsub-sidekiq-inf
haproxy /var/log/haproxy.log syslog pubsub-haproxy-inf
nginx.error n/a unstructured StackDriver
registry n/a mix (lines/json) pubsub-gke-inf
system.auth /var/log/auth.log syslog pubsub-system-inf
system.syslog /var/log/syslog syslog pubsub-system-inf
rails.graphql /var/log/gitlab/gitlab-rails/graphql_json.log JSON pubsub-rails-inf
rails.mail_room /var/log/gitlab/mailroom/mail_room_json.log JSON pubsub-rails-inf
history.psql /home/*-db/.psql_history
history.irb /var/log/irb_history/*.log
migration.log Sidekiq BatchedBackgroundMigration worker JSON. pubsub-sidekiq-inf

How-to guides

Searching logs

Searching in Elastic

production (gitlab.com)
  1. Go to https://log.gprd.gitlab.net/
  2. in Kibana, in Discover application, select the relevant index pattern, e.g. pubsub-rails-inf-gprd-*
dev (dev.gitlab.org), staging (staging.gitlab.com), ops (ops.gitlab.com), preprod (pre.gitlab.com)
  1. Go to https://nonprod-log.gitlab.net/
  2. select the relevant index pattern, e.g. pubsub-rails-inf-gstg-*
  3. (optional) filter on the environment, e.g. json.environment=gstg or json.environment=dev

Note:

  • logs from dev are sent to staging indices
  • almost no logs are forwarded from preprod

Searching in Operations (Stackdriver)

It is not possible to search for application logs using Operations (Stackdriver). You can still search for other logs, for example GCP authentication logs.

Searching in object storage (GCS)

using BigQuery tutorial

Adding a new logfile

  • Decide whether you want to use an existing ES index or create a new one (which will also require making some changes using Terraform). Some hints about how to decide:
    • You won't need a dedicated index if the amount of logs is small
    • It makes sense to use an existing index if the log file "conceptually" belongs to an existing log stream (for example it used be part of rails logs and it's simply going to be separated into a dedicate file)
    • If the log structure is significantly different (log entries contain completely different fields from already existing logs), you'll want to use a dedicated index. Otherwise the ES cluster might refuse to process your logs due to mapping conflicts or too many field mappings set in an index.

Adding a logfile and using an existing ES index

Adding a logfile and using an existing ES index

Chef-managed VMs
  • GCE VMs infrastructure:
    • Update one of the fluentd templates and add a section for the new log file. Remember to bump the cookbook version in metadata.rb
    • follow the chef roll out process
Kubernetes infrastructure

Adding a logfile and creating a dedicated index for it

Concepts

Roadmap

Roadmap:

Logging infrastructure overview

Overview

Fluentd

We are using Fluentd (td-agent) for parsing log files and forwarding log messages to different destinations. There are at least three (3) different types of Fluentd processes involved: Gitlab managed Fluentd on GCE VMs, Gitlab managed Fluentd DaemonSet and StatefulSet in GKE, GCP managed Fluentd DaemonSet in GKE.

Fluentd running on VMs is configured to send logs to two destinations: Operations(formerly Stackdriver) and Cloud Pub/Sub. Here's the fluentd config for running on GCE VMs.

Gitlab managed Fluentd in Kubernetes (running as a DaemonSet) sends logs only to ElasticStack (this will likely change in the future, see: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/11655 and: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/10095). Its config can be found here.

Additionally, another GitLab managed Fluentd in Kubernetes named fluentd-archiver (running as a StatefulSet) exports the logs sent to PubSub by fluentd-elasticsearch to the GCS bucket gitlab-${env}-logging-archive under the folder gke/. This is because Stackdriver exports all Kubernetes container logs mixed together into stdout/ and stderr/ folders, making it difficult to filter per container name. Its config can be found here>

GCP managed Fluentd running as a DaemonSet in GKE, sends logs only to Operations. At the moment of writing, its config was not altered.

Operations (Stackdriver)

All logs reaching Operations are saved to a GCS bucket gitlab-${env}-logging-archive using an export sink where they are stored long-term (e.g. 6 months) for compliance reasons and can be read using BigQuery.

We are using log exclusions to prevent application logs from being indexed: https://cloud.google.com/logging/docs/exclusions At the moment of writing, we are not indexing any of the application logs. The current exclusions for Operations (Stackdriver) can be found in terraform variables.tf, search for sd_log_filters.

Cloud Pub/Sub

Logs from different components have designated topics in Pub/Sub and each topic has corresponding subscriptions. There is a subscriber for each subscription. At the moment of writing we are using pubsubbeat to subscribe to Pub/Sub subscriptions and forward logs to an ElasticStack cluster.

Cloud PubSub quotas and limits. Note: there's a 10MB/s per StreamingPull limit

PubSub topics are managed using Terraform (search for pubsub here)

Examples of alternatives to Cloud Pub/Sub include: Kafka

Pubsubbeat

Pubsubbeat is a tool that uses the ES Beat framework: https://www.elastic.co/beats/ and Google PubSub client library: https://pkg.go.dev/cloud.google.com/go/pubsub . The project is maintained by Gitlab and can be found here: https://gitlab.com/gitlab-org/pubsubbeat/ . The binary pulls logs from a subscription in Pub/Sub and uploads them to Elastic using the bulk API. The default configuration of Pubsubbeat is to create templates and field mappings in indices. We are not utilizing this feature and instead are precreating index templates to match our needs. A lot of the ES config is version controlled in our runbooks repo: https://gitlab.com/gitlab-com/runbooks/-/tree/master/elastic and applied using CI jobs. Many, but not all mappings are set statically here.

We deploy pubsubbeat to GKE using helm. For more details see config here. Permissions are given to k8s service accounts using Workload Identity using terraform (search for pusbubbeat here)

Examples of alternatives to Pubsubbeat include: Filebeat, Fluentd, Logstash.

Elastic

Aliases are referenced by Pubsubbeat when uploading logs to Elastic. When logs reach the Elastic cluster, they are indexed into documents by a worker, the alias name is resolved to an index name and the documents are saved in the index. There should only ever be one active index per alias and the alias should be pointing to that index.

Logs (documents) can be viewed in Kibana using index patterns, i.e. when you open the Discover application in Kibana, you can select the index pattern from a drop-down list and all searches you will submit will be performed against all indices matching the index pattern. There are also a number of other features in Kibana we're using: dashboards, saved searches, visualizations, watchers.

More documentation on Elastic can be found here: elastic/doc/

Index Lifecycle Management (ILM)

Indices can be managed in different ways e.g. custom scripts, Curator, Index Lifecycle Management (ILM) plugin. ILM meets a lot of our requirements so that's what we're using for logs sent to ES7 clusters.

The ILM plugin has proved to be particularly useful and has become very popular in the Elastic community in recent years. For this reason, it was integrated into Elastic.

ILM behavior is configured via policies assigned to indices. The plugin runs at a scheduled interval (like a cronjob). When triggered, ILM goes through each index that has a policy assigned to it and performs a step defined in the policy (assigned to that particular index). If the conditions are not met, ILM will do nothing. Notice that the conditions are evaluated on a scheduled basis (rather than being event driven).

Policies can define a number of steps which in simple words translate to: warm -> hot -> cold -> delete. Behavior of ILM at each of those steps is defined in the ILM policy.

Here's an example policy:

{
    "policy": {
        "phases": {
            "hot": {
                "actions": {
                    "rollover": {
                        "max_age": "3d",
                        "max_size": "50gb"
                    },
                    "set_priority": {
                        "priority": 100
                    }
                }
            },
            "warm": {
                "min_age": "1m",
                "actions": {
                    "forcemerge": {
                        "max_num_segments": 1
                    },
                    "allocate": {
                        "require": {
                            "data": "warm"
                        }
               },
               "set_priority": {
                  "priority": 50
                    }
                }
            },
            "delete": {
                "min_age": "7d",
                "actions": {
                    "delete": {}
                }
            }
        }
    }
}

Let's say ILM is configured to run every 10 mins and the above policy is assigned to a newly created index. What will happen, is after 10 mins, ILM will trigger the hot phase, which will check the size and age of the index. If the size exceeds 50GB or the age exceeds 3 days, the configured action is triggered, which in this case would send a call to the rollover api. The rollover API will mark the current index as non-writable, mark it for the warm phase and create a new index from an index template. This way, we can control for example the size of shards within indices or logs retention period.

See also:

Monitoring

Our Elastic clusters have xpack monitoring enabled in Elastic Cloud and the monitoring metrics are forwarded to a separate monitoring cluster (called monitoring-es7).

There is a VM in each environment called sd-exporter-*. This VM is created using a generic terraform module https://ops.gitlab.net/gitlab-com/gl-infra/terraform-modules/google/generic-sv-with-group . The VM has a chef role assigned to it which downloads and runs the stackdriver exporter https://gitlab.com/gitlab-cookbooks/gitlab-exporters/ . The exporter service runs on a tcp port number 9255. Prometheus is configured through a role in chef-repo to scrape port 9255 on sd-exporter-* VMs. Metrics scraped this way are the basis for Prometheus pubsub alerts.

For the monitoring of pubsubbeats in GKE, we use a ServiceMonitor object to tell Prometheus to scrape the endpoints exposed by the pubsubbeat pod. The pod exposes two endpoints: beat-metrics and mtail-metrics. These are exposed by sidecars, a beat-exporter process and an mtail process.

BigQuery

BigQuery can be used to search logs that are in cold storage (GCS).

The haproxy logs are also configured to be forwarded to a BigQuery dataset using a StackDriver sink: gitlab-production:haproxy_logs

GCS (long-term storage)

Logs from the export sink are saved to a GCS bucket which we manage with Terraform: https://ops.gitlab.net/gitlab-com/gl-infra/terraform-modules/google/storage-buckets/blob/master/main.tf#L1

We configure this bucket with GCP lifecycle rules: https://ops.gitlab.net/gitlab-com/gl-infra/terraform-modules/google/storage-buckets/blob/master/main.tf#L14

These rules are parameterized and are configured with defaults which are set in the module: https://ops.gitlab.net/gitlab-com/gl-infra/terraform-modules/google/storage-buckets/blob/master/variables.tf#L150

es-diagnostics

We created cronjobs that dump diagnostic information from ES clusters.

Script repo: https://gitlab.com/gitlab-com/gl-infra/es-diagnostics/

Deployment: https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles/-/tree/master/releases/es-diagnostics

Historical notes

haproxy logs and nginx logs are not sent to ElasticStack because it was overwhelming our cluster. Currently, only HAProxy logs are only available through BigQuery. NGINX logs via STDERR on Kubernetes can be found in StackDriver. NGINX Access logs are completely ignored.

Design document for migration to ES7: https://about.gitlab.com/handbook/engineering/infrastructure/design/logging-upgrade/

Configuration

Chef

Cookbooks

There are three cookbooks that configure logging on gitlab.com

Role configuration

  • Add recipe[gitlab_fluentd::<type>] to the corresponding role to enable td-agent for the template
  • The ops proxy role configures the proxy vm that is the reverse proxy for elastic cloud.

Terraform

Pub/Sub Subscriptions should be automatically created by the pubsubbeat service on each pubsub host. If subscriptions get misconfigured (e.g. topics appear as _deleted-topic_) you can delete them and restart the pubsubbeat services to re-create them. This will be deprecated soon: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/8339

runbooks repo

There's a number of things configured using CI jobs in the runbooks repo, in runbooks/elastic/, for example:

  • index templates
  • static mappings for indices

FAQ

Why are we using Operations and GCS in addition to ElasticSearch?

We are sending logs to Operations and GCS in addition to elasticsearch for longer retention and to preserve logs in object storage for 365 days.

Why are we using pubsub queues instead of sending logs from fluentd directly to Elastic?

We use it for two reasons. Firstly, to handle situations when our log sources emit more logs than Elasticsearch can ingest at real time. In this scenario, pubsub serves the role of a buffer.

Secondly, we were overloading Elastic Cloud with the number of connections. Having only a few pubsubbeats helps to reduce the overhead of a separate connection for each fluentd instance.

See also: message queueing discussion in roadmap

How do I find the right logs for my service?

See Quick start

A user sees an error on GitLab.com, how do I find logs for that user?

  • Select the pubsub-rails-inf-gprd-* index pattern in Kibana
  • Search for +json.username: <user>

If the request has json.correlation_id field set, you can use that id for checking logs from all gitlab.com components using the Correlation dashboard.

see also: searching logs

Why do we have these annoying json. prefixes?

They are created by https://github.com/GoogleCloudPlatform/pubsubbeat , I don't see a way we can remove them without forking the project.

What if I need to query logs older than the ones present in Elastic?

At the moment, the only other way to access logs is using BigQuery.

See logging_gcs_archive_bigquery.md for instructions on loading logs into BigQuery from their GCS archive files.