2016-11-10-why-choose-bare-metal.html.md 8.31 KB
Newer Older
1
---
Rebecca Dodd's avatar
Rebecca Dodd committed
2
title: "How we knew it was time to leave the cloud"
3
author: Pablo Carranza
4
author_twitter: psczg
Rebecca Dodd's avatar
Rebecca Dodd committed
5
categories: engineering
6 7 8 9 10
image_title: '/images/unsplash/data.png'
description: "How we're solving storage and performance issues as we scale."
twitter_image: '/images/tweets/why-bare-metal.png'
---

Erica 's avatar
Erica committed
11 12
{::options parse_block_html="true" /}

Stan Hu's avatar
Stan Hu committed
13 14
In my last [infrastructure update][infra-post], I documented our challenges with
storage as GitLab scales. We built a CephFS cluster to tackle both the capacity
Erica 's avatar
Erica committed
15
and performance issues of NFS and decided to replace PostgreSQL standard Vacuum
Stan Hu's avatar
Stan Hu committed
16
with the pg_repack extension. Now, we're feeling the pain of running a high
Erica 's avatar
Erica committed
17 18
performance distributed filesystem on the cloud.

Stan Hu's avatar
Stan Hu committed
19
Over the past month, we loaded a lot of projects, users, and CI artifacts onto
Erica 's avatar
Erica committed
20
CephFS. We chose CephFS because it's a reliable distributed file system that can
Stan Hu's avatar
Stan Hu committed
21 22 23 24 25 26 27 28 29
grow capacity to the petabyte, making it virtually infinite, and we needed
storage. By going with CephFS, we could push the solution into the infrastructure
instead of creating a complicated application. The problem with CephFS is that
in order to work, it needs to have a really performant underlaying infrastructure
because it needs to read and write a lot of things really fast.
If one of the hosts delays writing to the journal, then the rest of the fleet is
waiting for that operation alone, and the whole file system is blocked. When this happens,
all of the hosts halt, and you have a locked file system; no one can read or
write anything and that basically takes everything down.
Erica 's avatar
Erica committed
30

31
![osd-journal-latency](/images/blogimages/osd-journal-latency.png)
Erica 's avatar
Erica committed
32

Stan Hu's avatar
Stan Hu committed
33 34
What we learned is that when you get into the consistency, accessibility, and
partition tolerance (CAP) of CephFS, it will just give away availability in
Stan Hu's avatar
Stan Hu committed
35
exchange for consistency. We also learned that when you put a lot of pressure on
Stan Hu's avatar
Stan Hu committed
36 37 38 39 40 41 42 43 44 45 46 47 48 49
the system, it will generate hot spots. For example, in specific places in the
cluster of machines hosting the GitLab CE repo, all the reads and
writes end up being on the same spot during high load times. This problem is
amplified because we hosted the system in the cloud where there is not a minimum
SLA for IO latency.

## Performance Issues on the Cloud

By choosing to use the cloud, we are by default sharing infrastructure with a
lot of other people. The cloud is timesharing, i.e. you share the
machine with others on the providers resources. As such, the provider has to
ensure that everyone gets a fair slice of the time share. To do this, providers
place performance limits and thresholds on the services they provide.

Alex Hanselka's avatar
Alex Hanselka committed
50
On our server, GitLab can only perform 20,000 IOPS but the low limit is 0.
Stan Hu's avatar
Stan Hu committed
51 52 53 54 55 56 57
With this performance capacity, we became the "noisy neighbors" on the shared
machines, using all of the resources. We became the neighbor who plays their
music loud and really late. So, we were punished with latencies. Providers don't
provide a minimum IOPS, so they can just drop you. If we wanted to make the disk
reach something, we would have to wait 100 ms latency.
[That's basically telling us to wait 8 years][space-time-article]. What we found
is that the cloud was not meant to provide the level of IOPS performance we needed
Takuya Noguchi's avatar
Takuya Noguchi committed
58
to run an aggressive system like CephFS.
Stan Hu's avatar
Stan Hu committed
59 60 61 62 63

At a small scale, the cloud is cheaper and sufficient for many projects.
However, if you need to scale, it's not so easy. It's often sold as, "If you
need to scale and add more machines, you can spawn them because the cloud is
'infinite'". What we discovered is that yes, you can keep spawning more
Takuya Noguchi's avatar
Takuya Noguchi committed
64
machines but there is a threshold in time, particularly when you're adding heavy
Stan Hu's avatar
Stan Hu committed
65 66 67
IOPS, where it becomes less effective and very expensive. You'll still have to
pay for bigger machines. The nature of the cloud is time sharing so you still
will not get the best performance. When it comes down to it, you're paying a lot
Stan Hu's avatar
Stan Hu committed
68
of money to get a subpar level of service while still needing more performance.
Stan Hu's avatar
Stan Hu committed
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87

So, what happens when the cloud is just not enough?

## Moving to Bare Metal

At this point, moving to dedicated hardware makes sense for us. From a cost
perspective, it is more economical and reliable because of how the culture of
the cloud works and the level of performance we need. Of course hardware comes
with it's upfront costs: components will fail and need to be replaced. This
requires services and support that we currently don't have today. You have to
know the hardware you are getting into and put a lot more effort into keeping it
alive. But in the long run, it will make GitLab more efficient, consistent,
and reliable as we will have more ownership of the entire infrastructure.

## How We Proactively Uncover Issues

At GitLab, we are able to proactively uncover issues like this because we are
building an observable system as a way to understand how
our system behaves. The machine is doing a lot of things, most of which we are
Erica 's avatar
Erica committed
88
not even aware of. To get a deeper look at what's happening, we gather data and
Alex Hanselka's avatar
Alex Hanselka committed
89
metrics into Prometheus to build dashboards and observe trends.
90

Stan Hu's avatar
Stan Hu committed
91 92 93 94
These metrics are in the depth of the kernel and not readily visible to humans.
To see it, you need to build a system that allows you to pull, aggregate, and
graph this data in a way you can see it. Graphs are great because you can get a
lot of data in one screen and read it with a simple glance.
Erica 's avatar
Erica committed
95 96 97 98

For example, our fleet overview dashboard shows how many different workers are
performing in one view:

99
![workers-load](images/blogimages/workers-load.png)
Erica 's avatar
Erica committed
100

101
![workers-wait](images/blogimages/workers-wait.png)
102 103 104 105

### How we used our dashboard to understand CephFS in the cloud


Stan Hu's avatar
Stan Hu committed
106
Below, you can see OSD Journal Latency. You can see how, over the last 7 days shown, we had a spike.
107

108
![osd-journal-latency-one-week](/images/blogimages/osd-journal-latency-one-week.png)
109

Stan Hu's avatar
Stan Hu committed
110 111 112 113 114
This is how much time we spent trying to write to this journal disk. In general,
we roughly perform commit data to this journal within 2 to 12 seconds. You can
see where it jumps to 42 seconds to complete -- that delay is where we are being
punished. The high spikes show GitLab.com is down.

Alex Hanselka's avatar
Alex Hanselka committed
115
What's great about having this dashboard is that there is a lot of data available
Stan Hu's avatar
Stan Hu committed
116 117 118 119 120 121 122 123
quickly, in one place. Non-technical people can understand this. This is the
level of insight into your system you want to aim for. You can build on your own
with [Prometheus][prometheus]. We have been building this for the last month, it's close to the
end state. We're still working on it but to add more things.

This is how we make informed decisions to understand as best as we can what is
going on with our infrastructure. What we tend to do is whenever we see
a service failing or performing in a way that is unexpected, we pull together a
Erica 's avatar
Erica committed
124
dashboard to highlight the underlaying data to help us understand what's happening,
Stan Hu's avatar
Stan Hu committed
125 126
and how things are being impacted on a larger scale. Usually monitoring is an afterthought
but we are changing this by shipping more and more detailed and comprehensive
Erica 's avatar
Erica committed
127 128 129 130
monitoring with GitLab. Without detailed monitoring you are just guessing at
what is going on within your environment and systems.

The bottom line is that once you have moved beyond a handful of systems it is no
Stan Hu's avatar
Stan Hu committed
131
longer feasible to run one-off commands to try and understand what is happening
Erica 's avatar
Erica committed
132
within your infrastructure. True insight can only be gained by having enough
Stan Hu's avatar
Stan Hu committed
133
data to make informed decisions with.
134 135


Stan Hu's avatar
Stan Hu committed
136
## Recap: What We Learned
137

Stan Hu's avatar
Stan Hu committed
138
1. CephFS gives us more scalability and ostensibly performance but did not work well in the cloud on shared resources, despite tweaking and tuning it to try to make it work.
Erica 's avatar
Erica committed
139 140 141
1. There is a threshold of performance on the cloud and if you need more, you will have to pay a lot more, be punished with latencies, or leave the cloud.
1. Moving to dedicated hardware is more economical and reliable for the scale and performance of our application.
1. Building an observable system by pulling and aggregating performance data into understandable dashboards helps us spot non-obvious trends and correlations, leading to addressing issues faster.
Alex Hanselka's avatar
Alex Hanselka committed
142
1. Monitoring some things can be really application specific which is why we are [building our own gitlab-monitor Prometheus exporter][prom-exporter]. We plan to ship this with GitLab CE soon.
143 144 145

<!-- identifiers -->

Matija Čupić's avatar
Matija Čupić committed
146
[infra-post]: /2016/09/26/infrastructure-update/
Erica 's avatar
Erica committed
147
[prom-exporter]: https://gitlab.com/gitlab-org/omnibus-gitlab/issues/1481
Erica 's avatar
Erica committed
148
[prometheus]: https://prometheus.io/
Stan Hu's avatar
Stan Hu committed
149
[space-time-article]: https://blog.codinghorror.com/the-infinite-space-between-words/