Skip to content

Latest commit

 

History

History
229 lines (151 loc) · 16.4 KB

README.md

File metadata and controls

229 lines (151 loc) · 16.4 KB

Table of Contents

[[TOC]]

Quick start

Elastic related resources

  1. Logging dashboard in Grafana
  2. runbooks repo:
    1. documentation
    2. Prometheus alerts
    3. dashboards/watchers/visualizations/searches
  3. terraform config:
    1. infra managed in the gitlab-com-infrastructure repo (e.g. pubsubbeat VMs, stackdriver exporter)
    2. relevant terraform modules
  4. chef config
  5. Design documents in www-gitlab-com repo: TODO: link here design docs once they are ready
  6. Logging working group: https://about.gitlab.com/company/team/structure/working-groups/log-aggregation/
  7. vendor issue tracker: https://gitlab.com/gitlab-com/gl-infra/elastic/issues
  8. Global Search engineering team
  9. Slack channel #g_global_search
  10. Discussions in different issues across multiple projects (e.g. regarding costs for indexing entire gitlab.com)
  11. Discussions in PM&Engineering meetings

Historical notes

  1. esc-tools repo used for managing the ES5 cluster

How-to guides

Administrative access/login

We've locked down OKTA access to read only for both non-prod and prod logging clusters. Both clusters can still be accessed for read/write by SRE on-call through the [email protected] account.

Once logged into Elastic Cloud, select 'open' for any of the clusters and you'll be logged into Kibana as a super user.

./img/elasticcloud.png

Disaster recovery

  1. Recovering lost Advanced Search updates

Upgrade checklist

Pre-flight

Upgrade Staging

Upgrade Production

Rollback steps

  • If the upgrade completed but something is not working, create a new cluster and restore an older version of Elasticsearch from the snapshot captured above. Then update the credentials in GitLab > Admin > Settings > General > Advanced Search to point to this new cluster. The original cluster should be kept for root cause analysis. Keep in mind that this is a last resort and will result in data loss.

How to verify the Elasticsearch cluster is healthy

How to verify that the Advanced Search feature is working

  • Add a comment to an issue and then search for that comment. Note: that before the results show up, all jobs in the queue need to be processed and this can take a few minutes. In addition, refreshing of the Elasticsearch index can take another 30s (if there were no search requests in the last 30s).
  • Search for a commit that was added after indexing was paused

Monitoring

Metric: Search overview metrics

Metric: Search controller performance

Metric: Search sidekiq indexing queues (Sidekiq Queues (Global Search))

Metric: Search sidekiq in flight jobs

Metric: Elastic Cloud outages

Performing operations on the Elastic cluster

One time Elastic operations should be documented as api_calls in this repo. Everything else, for example cluster config, index templates, should be managed using CI (with the exception of dashboards and visualizations created in Kibana by users).

The convention used in most scripts in api_calls is to provide cluster connection details using an env var called ES7_URL_WITH_CREDS. It has a format of: https://<es_username>:<password>@<cluster_url>:<es_port> . The secret that this env var should contain can be found in 1password.

Estimating Log Volume and Cluster Size

If we know how much log volume we are indexing per day, how many resources we are using on our cluster, the desired retention period and how much log volume we want to add, then we can estimate the needed cluster size.

Currently, fluentd is sending all logs to stackdriver and some logs to GCP PubSub. We have pubsubbeat nodes for each topic, sending the logs into elastic.

What is going to Stackdriver?

Stackdriver is ingesting everything - around 50TiB per month as of 17-01-2020: Resources view

haproxy logs are send into a GCP sink instead of to pubsub/elastic because of their size (10MiB/s or 850GiB/day).

What is the Volume of our PubSub topics?

Average daily pubsub volume per topic in GiB (base unit in prometheus is Byte/minute for this metric).

Same metric in Stackdriver metrics explorer (Byte/s)

Total of 1.3TiB/day as of 17-01-2020 (nginx being excluded).

How much elastic storage are we using per day?

As we have one index alias per pubsub topic and in ES5 cluster (gitlab-production) we use a naming convention for rolled-over indices to add the date and a counter, we can grep the elastic cat api for each pubsub index alias and add together the size of all indices belonging to the same alias with the same day in the name to get the daily index volume. [../api_calls/single/get-index-stats-summary.sh] is doing that for you.

The results as of 16-01-2020 are analyzed in this sheet.

We can conclude from this, that index volume (with one replica shard) is around 3 times the volume of the corresponding pubsub topic.

As of 17-01-2020 we are using ca. 4TiB elastic storage per day (only pubsub topics, excluding nginx). That means for a 7 day retention we consume around 28TiB storage. Adding nginx logs would increase that by 0.6TiB/day (15%), haproxy logs by 2.5TiB/day (63%).

Analyzing index mappings

At the moment of writing, we utilize static mappings defined in this repository. Here are a few ideas for analysis of those mappings:

jsonnet elastic/managed-objects/lib/index_mappings/rails.jsonnet | jq -r 'leaf_paths|join(".")' | grep -E '\.type$' | wc -l
jsonnet elastic/managed-objects/lib/index_mappings/rails.jsonnet | jq -r 'leaf_paths|join(".")' | grep -E '\.type$' | head
jsonnet elastic/managed-objects/lib/index_mappings/rails.jsonnet | jq -r 'leaf_paths|join(";")' | grep -E ';type$' | awk '{ print $1, 1 }' | inferno-flamegraph > mapping_rails.svg

Concepts

Elastic learning materials

Design Document (Elastic at Gitlab)

https://gitlab.com/gitlab-com/www-gitlab-com/merge_requests/23545 TODO: update this link once merged

Monitoring

Because Elastic Cloud is running on infrastructure that we do not manage or have access to, we cannot use our exporters/Prometheus/Thanos/Alertmanager setup. For this reason, the best option is to use Elasticsearch built-in x-pack monitoring that is storing monitoring metrics in Elasticsearch indices. In production environment, it makes sense to use a separate cluster for storing monitoring metrics (if metrics were stored on the same cluster, we wouldn't know the cluster is down because monitoring would be down as well).

When monitoring is enabled and configured to send metrics to another Elastic cluster, it's the receiving clusters' responsibility to handle metrics rotation, i.e. the receiving cluster needs to have retention configured. For more details see: https://www.elastic.co/guide/en/cloud/current/ec-enable-monitoring.html#ec-monitoring-retention and https://www.elastic.co/guide/en/elasticsearch/reference/current/monitoring-settings.html

Apart from monitoring using x-pack metrics + watches, we are also using a blackbox exporter in our infrastructure. It's used for monitoring selected API endpoints, such as ILM explain API.

Alerting

Since we cannot use our Alertmanager, Elasticsearch Watches have to be used for alerting. They will be configured on the Elastic cluster used for storing monitoring indices.

Blackbox probes cannot provide us with sufficient granularity of state reporting.