Enable slowlog on the production search ES cluster
Production Change
Change Summary
This change is similar to: gitlab-com/runbooks!3036 (merged)
Slowlog would be useful in troubleshooting problems like: gitlab-org/gitlab#292439 (closed)
For more context see: gitlab-org/gitlab#292439 (comment 463074619)
There are two parts to this change:
- enabling logs forwarding in the cluster (requires a rolling restart of all nodes in the ES cluster)
- enabling slowlog on the production index
Change Details
- Services Impacted - Advanced search
- Change Technician - @mwasilewski-gitlab
- Change Criticality - C2
- Change Type - changeunscheduled
- Change Reviewer - @DylanGriffith
-
Due Date -
2020-12-14 13:00:00
- Time tracking - 1h
- Downtime Component - no downtime planned
Detailed steps for the change
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
enable logs forwarding in the Elastic Cloud UI: -
trigger the change -
snapshot taking (~10min) -
rolling restart of nodes (~30min)
-
-
enable slowlog using an API call:
PUT /<index_name>/_settings
{
"index.search.slowlog.threshold.query.warn": "30s",
"index.search.slowlog.threshold.query.info": "30s",
"index.search.slowlog.threshold.query.debug": "30s",
"index.search.slowlog.threshold.query.trace": "30s",
"index.search.slowlog.threshold.fetch.warn": "30s",
"index.search.slowlog.threshold.fetch.info": "30s",
"index.search.slowlog.threshold.fetch.debug": "30s",
"index.search.slowlog.threshold.fetch.trace": "30s",
"index.search.slowlog.level": "info"
}
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 2 min
-
verify in the monitoring cluster that logs from the production search ES cluster are available
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 30min
-
Disable logs forwarding in the Elastic Cloud UI -
Disable slow log on the index:
PUT /<index_name>/_settings
{
"index.search.slowlog.threshold.query.warn": "-1",
"index.search.slowlog.threshold.query.info": "-1",
"index.search.slowlog.threshold.query.debug": "-1",
"index.search.slowlog.threshold.query.trace": "-1",
"index.search.slowlog.threshold.fetch.warn": "-1",
"index.search.slowlog.threshold.fetch.info": "-1",
"index.search.slowlog.threshold.fetch.debug": "-1",
"index.search.slowlog.threshold.fetch.trace": "-1",
"index.search.slowlog.level": "info"
}
Monitoring
Key metrics to observe
Advanced search dashboard: https://dashboards.gitlab.net/d/search-main/search-overview?orgId=1&from=now-6h%2Fm&to=now%2Fm&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
- Metric: Metric Name
- Location: Dashboard URL
- What changes to this metric should prompt a rollback: Describe Changes
Summary of infrastructure changes
-
Does this change introduce new compute instances? no -
Does this change re-size any existing compute instances? no -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? no
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue.there is no dry-run possible -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
There are currently no active incidents.
Edited by Michal Wasilewski