Skip to content
Next
Projects
Groups
Snippets
Help
Loading...
Help
Support
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
runbooks
Project overview
Project overview
Details
Activity
Releases
Cycle Analytics
Insights
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Locked Files
Infrastructure
Infrastructure
Merge Requests
48
Merge Requests
48
Security & Compliance
Security & Compliance
Dependency List
Packages
Packages
Container Registry
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Commits
Open sidebar
GitLab.com
runbooks
Commits
f387e6de
Commit
f387e6de
authored
Sep 12, 2019
by
John Skarbek
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
adds documentation when GCE causes instance deaths
parent
c52915fe
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
48 additions
and
0 deletions
+48
-0
troubleshooting/node-reboots.md
troubleshooting/node-reboots.md
+48
-0
No files found.
troubleshooting/node-reboots.md
0 → 100644
View file @
f387e6de
# Node Reboots
Search tags: reboot, restart, instance, node, VM, machine
## Discovering GCE Casualties
Sometimes GCE has issues themselves and force node restarts. We can find these
to validate GCE is the root cause by using the following search in stackdriver:
```
resource.type="gce_instance"
protoPayload.serviceName="compute.googleapis.com"
protoPayload.methodName="compute.instances.hostError"
protoPayload.methodName="automaticRestart"
```
What you'll notice with the above search is that GCE will notice an error of
some sort on the physical node, which causes the
`hostError`
, and normally
shortly after this within 5 hundredths of a second, you'll see the
`automaticRestart`
If you don't see the above, there may not be an issue from the GCE side of
things and at this point, you should start troubleshooting potential problems,
such as kernel panics, user induced reboots, or instance deletions.
## Instance Deletions
Node deletions can be found by using the below on your search query:
```
protoPayload.methodName="delete"
```
Normally, if it was done by terraform, you'll see the offending Engineer, or if
done via autoscaling, it'll be done by a specialized service account normally
created by Google on our Project.
## Instance Migrations
Many times google will simply migrate a machine. If there are performance
issues discovered with an instance during a period of time, add this item to
your search as the live migration could be a part of the cause:
```
protoPayload.methodName="migrateOnHostMaintenance"
```
Live migrations may introduce network latency or an intermittent loss of the
node as the machine is brought online on a new physical host.
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment