We need to move from the cloud to metal because of operations#7 (closed)
The hardest thing is finding a performant and affordable 100TB+ fileserver setup
Conversation about the hardware here and in the Google Sheet https://docs.google.com/spreadsheets/d/1XG9VXdDxNd8ipgPlEr7Nb7Eg22twXPuzgDwsOhtdYKQ/edit#gid=1018071361
Conversation about what facility to use in #732 (closed)
If you also purchase a rack you can be sure they are colocated.
Would need a full rack 42u, in Washington DC (their only East coast location) for $999 per month https://www.leaseweb.com/colocation/private-rack
Delivery of dedicated servers takes 1 to 5 business days.
Their cloud bare metal servers don't have enough disk space to make one NFS server https://www.leaseweb.com/cloud/bare-metal-server/all-servers
Biggest normal server is R920 https://www.leaseweb.com/dedicated-servers/quad-processor?processorCount=4 but it is not available in Washington DC :(
- 24 x 1.2TB HDD or
- 16 x 1.2TB HDD + 8 x 1.6TB PowerEdge Express Flash NVMe PCIe SSDs
Hot-plug hard drive options:
- 2.5” SATA/SAS SSD, SAS HDD (15K, 10K), nearline SAS HDD (7.2K)
- 2.5” PCIe SSDs: Dell PowerEdge NVMe Express Flash PCIe SSD
- Up to 24 2.5” hot-plug 12Gb/6Gb SAS HDD or SAS/SATA SSD
- Up to 8 front-accessible Express Flash NVMe PCIe SSD (PCIe 3.0)
But on leaseweb I can get only 8 drives of 960GB each https://www.leaseweb.com/dedicated-server/configure/20654
I'll get a quote for the custom server.
Title changed from Leaseweb costs to Leaseweb and OVH costsToggle commit list
- You can upgrade the storage space later
- monthly contract
- dedicated server will be the same datacenter
- 10 servers + 2 ARIN blocks (arin blocks are for the virtual machines or secondary ip's created on your servers, so max. of 10 bare metal servers)
- it is a shared NAS
- available in east coast Canada datacenter (Montreal), US coming beginning 2016
- takes 1 business day to provision
- dedicated servers need enterprise, with 2 network cards and more bandwidth 500Mbs or 1Gbs
- MG-256 40 cores, MG-128 32 cores => MG-128 seems best value per CPU
- If you order a server via https://www.ovh.com/us/dedicated-servers/enterprise/ you can set the Netherlands in the account field.
- Dedicated servers are also pay by month, they are provisioned in 120 seconds
It will cost 5x219 + 869 = $1964 per month.
The MG-128 processor is not a lot worse than the MG-256 http://ark.intel.com/compare/81705,83356 but the machine is 1/3 more affordable.
BTW I called OVH and instantly got a knowledgeable and friendly person on the line. She knew the details about the recent downtime in the Montreal data center and assured me they now had north and south ducts of fiber.
@sytses : you can also try their servers for 1 week only, to reduce costs. https://www.ovh.com/us/dedicated-servers/one-week-rental.xml
OVH horror stories http://www.flowstopper.org/2015/06/french-ovh-vs-german-hetzner.html
LeaseWeb quote can't be made public
Word on the street is that bare metal has a lower cost per io than virtualized servers.
@jbdelhommeau Thanks for the usercase study, that seems very relevant. The FS max server can hold up to 36 disks of 800GB SSD for a total of 28TB (less after raid losses). We need to pay the disks up from ($18k per server, we will need a failover server).
I propose that we have two steps:
- Move to OVH or AWS for the next 6 months to buy time, running on one big file server with SSD's or using EFS
- Work on our next generation infrastructure for the time after that. Tests will tell us the direction. Can be cloud or cage, can be CephFS or distributed GitLab.
The more I think about it the more I like OVH
- Our app servers can quickly be added and you pay per month
- You get metal performance
- You're flexible in the fileserver disks, you can add them over time and we could add non-ssd fileservers for 'colder' repositories (is heterogeneous storage something CephFS can do automatically?)
- Not sure about the backup solution, their regular solution seems to go up to 10GB https://www.ovh.com/us/dedicated-servers/backup-ftp.xml
SSD or metal disks? => we do have lots of tiny files with random access
CephFS supports heterogeneous storage
I would like to know if we could get premium support from OVH if we find out we need it.
One of the "take aways" I took from the switch to Azure is that we may have been able to prevent a lot of heartburn by using their support team before actively switching, to talk architecture, io limitations, etc.
So... let's at least have such a conversation with whatever provider we intend to switch to.
- We'll try OVH
- Jeroen will be the contact person
- Discuss setup with them, preferably in an issue
- In parallel test performance and latency to Canada with one application server
- See if they offer tiered support
You are looking for US servers only ? In FR, we have a great hosting provider : online.net ! They got nice prices with an high quality (VIP) support. They just renew their servers, check-out these two : https://www.online.net/en/dedicated-server/dedibox-wopr & https://www.online.net/en/dedicated-server/dedibox-power8 (biggest one)
Discussed offline with @jnijhof :
- we should first finish making Postgres and Redis clustered
- @jnijhof had negative experiences with DRBD and will comment with some alternative solutions he wants to explore with our future hoster along with DRBD
- if we do go with DRBD for our next NFS server we may want to get commercial support from linbit
Status changed to closedToggle commit list
@jacobvosmaer Interesting that DRBD is not ideal. Softlayer comes with QuantaStor, maybe that helps http://osnexus.squarespace.com/frequently-asked-questions/general-questions/does-quantastor-support-dr-failover-remote-replication.html and http://forum.osnexus.org/forum/post/2471441, also see operations#14 (closed) and lets continue the conversation there.
Status changed to reopenedToggle commit list
Monthly Base Price$ 367.79 Hardware HP DL380eG8 (12xLFF) incl. 2x Intel Hexa-Core Xeon E5-2420 incl. 64GB DDR3 incl. 8x960GB SSD $ 346 WDC-01 incl. Remote Management incl. Software CentOS 6 incl. Network 1 x 1000Mbps Full-Duplex incl. 100 TB incl. IPv4: 1 incl. Contract Basic – 24x7x24 incl. Contract Term 1 Month Billing Term 1 Month Monthly total $713.79
We need 33TB with 3x redundancy (100TB). Each server is 7TB. We would need 15 servers. That is $10k per month. That is not unreasonable for having all SSD servers.
Considering the above requirements and our current size (70TB) I think we need a private rack.
https://www.leaseweb.com/colocation/configure/16260 has a 42U one for $2k.
https://www.leaseweb.com/dedicated-server/configure/23417 4x4TB SATA2 ($ 43.00) in 1u.
Assume half is file server. 21U * 4 disks per server * 4TB = 336 TB
The fileserver is $92.99 per month. One rack would be $2k + 42 * 100 = 6200 * 3 racks for redundancy = $18600 per month
- HPE ProLiant DL380 Gen9 E5-2620v4 1P 16GB-R P440ar 12LFF 2x800W PS Base Server - https://www.hpe.com/us/en/product-catalog/servers/proliant-servers/pip.overview.hpe-proliant-dl380-gen9-e5-2620v4-1p-16gb-r-p440ar-12lff-2x800w-ps-base-server.1008829994.html
- $ 3,532.74
- 40 Gb NIC
- 12 Gb/s SAS
- 12 front HDD slows (12) LFF None ship standard; 12 LFF Chassis, P840ar/2GB SAS controller. Supports optional rear drives
HP Enterprise 4TB 3.5" 7.2K 6G SAS
4 drives for OS and log (2x2 flash in raid mirror)
leaves 20 drivers * 4 tb (8tb will descrease IO per TB) = 80tb per server = $6000
need 2 servers for whole of GitLab, need 3x for redundancy, 6 total, each is about $10k, total of $60k for fileservers, add $30k in application servers and $10k in networking and capex is $100k (colo and remote hands is a monthly expense). Capex is equal to what we pay for cloud hosting per month.
We can fit the 6 servers in 12u, 10 app servers in 10u, if we rent a 42u rack we have half spare
@sytses SuperNAP is west coast, being based out of Las Vegas for their large facility with a ton of private fiber to One Wilshire Blvd. QTS has a more diverse resource roster in terms of location, and honestly I've gotten further with them.
My criteria thus far has been:
- Degrees of AS Connectivity
- Picking a provider with a high degree of connectivity ensure that you are less hops (closer) to more people on the internet. This is evaluated by BGP advertisements
- AWS DirectConnect Availability
- AWS has their finger on the pulse of on-demand computing, I want to lay the groundwork for being able to cover transient bursts or load-shedding through a hybrid-cloud approach
- Remote Hands Support (compute resources)
- We need remote hands to replace components that experience failures / need upgrades
- Remote Hands Support (dock-side & logistics)
- We need remote hand that can take delivery of packages and good for us, store them if need be, or deploy them to the rack if need be
- Degrees of AS Connectivity
Changed title: Leaseweb and OVH costs → Colocation planToggle commit list
Another hardware option: SuperMicro - https://www.supermicro.com/products/nfo/storage.cfm
https://www.hetzner.de/en/hosting/produkte_rootserver/sx291 hetzner might also be a valid option. They have a very fast uplink to frankfurts de-cix. It might be a good thing to invest in some anti ddos measures though, since there is no active mitigation on their side.
They also offer colocation.
Is a description of the current infrastructure (type and number of servers, number and size of disks, storage accounts, etc) available to "we, the people" aka non-employees?
@elygre the current infrastructure is:
NLB -> 9 HA Proxy Servers -> 20 Worker Servers -> NLB -> 2 Redis Servers -> NLB -> 2 PostgreSQL Servers -> 3 Ceph MDS Servers -> 13 Ceph OSD Servers HA Proxy Server = 8 Cores w/ 28 GB RAM Worker Server = 16 cores w/ 56 GB RAM Redis Server = 4 Cores w/ 56 GB RAM PostgreSQL Server = 32 Cores w/ 442 GB RAM MDS Server = 8 Cores w/ 56 GB RAM OSD Server = 16 Cores w/ 110 GB RAM w/ 24 1TB Disks
This does not include things like our licensing servers, landing page servers, staging env, dev env, monitoring, reporting, build hosts, CI, etc but it should serve to give you an overview of what is currently in play right now.
@elygre Yes, one storage account per server using PLRS storage, the new NFS servers are a stop-gap to bring GitLab.com into a more conistantly timed state. We're deploying 8 servers, each with 16 1TB disks, LVM striped into a 16 TB volume group that we'll present up via NFS. From there we will start round-robin distributing projects across the NFS servers.
In my hard drive calculation I forgot that we probably want to RAID the drives. I recommend Raid 0+1 instead of Raid 5 so my calculations are a factor of 2 too positive.
Regarding the app servers. 20 servers times 16 cores = 320 cores.
- HPE ProLiant DL160 Gen9 Server
- 18 cores, 1u
- we would need 18u of them (that would fit in a rack with 3x3 2u = 18u of fileservers and some networking and other gear)
- I have the feeling that 1 bare metal core will move a lot more work than an Azure core, but I don't have data for that.
FYI, we should probably add 10Gb networking support from the start: From http://docs.ceph.com/docs/jewel/start/hardware-recommendations/:
Consider starting with a 10Gbps network in your racks. Replicating 1TB of data across a 1Gbps network takes 3 hours, and 3TBs (a typical drive configuration) takes 9 hours. By contrast, with a 10Gbps network, the replication times would be 20 minutes and 1 hour respectively. In a petabyte-scale cluster, failure of an OSD disk should be an expectation, not an exception. System administrators will appreciate PGs recovering from a degraded state to an active + clean state as rapidly as possible, with price / performance tradeoffs taken into consideration. Additionally, some deployment tools (e.g., Dell’s Crowbar) deploy with five different networks, but employ VLANs to make hardware and network cabling more manageable. VLANs using 802.1q protocol require VLAN-capable NICs and Switches. The added hardware expense may be offset by the operational cost savings for network setup and maintenance. When using VLANs to handle VM traffic between between the cluster and compute stacks (e.g., OpenStack, CloudStack, etc.), it is also worth considering using 10G Ethernet. Top-of-rack routers for each network also need to be able to communicate with spine routers that have even faster throughput–e.g., 40Gbps to 100Gbps.
Yes, as listed above the goal is 40GB network to start.
@northrup Right, I overlooked that. I created a spreadsheet to start tracking everything we'll need: https://docs.google.com/spreadsheets/d/1XG9VXdDxNd8ipgPlEr7Nb7Eg22twXPuzgDwsOhtdYKQ/edit#gid=0
CephFS docs also say approximately 1 GB RAM / 1 TB disk? How does that line up with 110 GB RAM for 240 TB disk?
I would like to propose 3 Redis servers. That is the minimum recommended configuration for Redis Sentinel. This would also allow us to remove the LB for Redis. In the PoC that I set up for an HA environment at Digital Ocean, I was able to set up 3 Redis+Sentinel servers and I did NOT have to use any sort of floating IP or load balancer. Additionally, since GitLab now supports Sentinel natively, even when I killed nodes to force a failover, everything worked flawlessly with just the base IPs on the system and GitLab set up correctly to use Sentinel. I feel like this is the most simple solution.
Would we want to try a hybrid approach here where we run Redis etc. on AWS, and then for anything that requires filesystem access (e.g. application servers and CephFS cluster) on bare metal?
Oh! I think that might be very interesting indeed. We can also have a few hypervisors and deal with it that way if it would be cheaper. As long as we put no two same service on the same hypervisor of course.
@stanhu Excellent point - those are general guidelines, in our operation of CephFS we generally stay around 55 - 65 GB of RAM usage on a consistent basis with our highest loading period seeing 90GB usage, however it is hard to say if we wouldn't haven seen higher RAM usage if we had a hardware/IO layer that could support serving the data as quickly as it was requested. Therefor, I agree, we should trend closer to recommended guidelines on RAM that use our current observations as I believe other factors have skewed operations.
With regard to running Redis on AWS, I would like to reserve the hybrid approach for truly ephemeral services to start. At a minimum I would like to keep our database, redis, and file storage on colocated hardware. Out of the gate hybrid could look like ha-proxy and workers in the AWS cloud, however I don't believe that if we looked at the cost savings for having these hosts in a one-time bare metal investment verses month over month bandwidth, instance, and storage on AWS that AWS would be cost effective. I believe in this new path we're forging that we should leverage AWS for our package caching and ephemeral load scaling (we need 20 more sidekiq workers because of an influx of project imports, etc).
I had a call with Dell yesterday to see what they might be able to offer for the OSD servers. They pointed me to these multi-socket servers (blades): https://dellservervr.dell.com/poweredge-fc430/ (full specs: http://i.dell.com/sites/doccontent/shared-content/data-sheets/en/Documents/Dell-PowerEdge-FC430-Spec-Sheet.pdf). We could put up to 4 servers in a 2U rack (8 if we didn't need PCI slots for network cards). I assume real estate and power are a premium in a cage. What is your opinion on these types of servers?
The RedHat team pointed us to these reference hardware links:
The important charts from the PDF:
This is what @northrup discussed with Dell:
How does this compare with the RedHat tests? For highest write/read performance, they are recommending:
PowerEdge R730xd with 16 HDDs and 1 PCIe SSD, 3X data replication method and single-drive RAID0 mode.
- We're looking at 10K rpm drives vs 7.2K. I think that's great for performance, but we'll have 36 TB instead of 64 TB.
- We're looking at 4x3.84 TB SSD drives. I assume 2 will be use for OS (don't think we need 3.84 TB) and 2 for the journal? Do we need to spec out the exact SSDs for journals (e.g. PowerEdge Express Flash NVMe PCIe SSD)?
- Do we really need 512 GB RAM per OSD node? 64 GB seems like it would work.
I find RH suggested config quite massive, which is indeed interesting.
I don't think we need such a large drives for the OS, a smaller (quite smaller) would do.
I love the idea of the intel PCIe journal SSD drives
7.2k drives should be good enough for data, not for journals.
I think we should exaggerate just a bit in memory, we certainly don't need 512GB, but we could go a bit over 64 just to have headroom. Particularly if we plan to run the MDS within the CephFS OSDs/Mons fleet.
If you happen to get quotes for hardware that are confidential, please don't put them in this issue and make this issue confidential (blocking the community from seeing the progress), but instead put them in a separate confidential issue or a private Google doc.
Yep, quotes are confidential, but not our decision.
Have you put any thought into backup solutions, yet?
In a somewhat unrelated note, what is the difference between the gitlab-com/operations and gitlab-com/infrastructure projects? Most of the issues in this project were moved to infrastructure, but not this one.
We should select servers on:
- Recommendations of Ceph experts
- Order time (how fast in our configuration at data center)
- Availability (for how many years can we order this)
- Commonality (easier to swap hardware)
I think we should listen to recommendations by Ceph experts.
The spreadsheet should be public, please make any quotes we receive private and link to them.
I don't like blade servers, they seem to be on their way out because a failure impact multiple instances.
I think we should run the app and background workers on Kubernetes in the long term. But short term we can can opt to run them on metal to make the transition less complex.
It is great to have the option to spike to AWS but for now we should assume everything except CI is on metal to make the transition faster and prevent latency.
Why to use the R630 instead of the R430? Both fit 2 processors and are 1U. But R630 offers higher speed networking.
Regarding the memory of OSD servers let's stick to the recommendations of Ceph. Let's have separate OSD and MSD servers as recommended. Let's go for the big 6TB disks at 7.2k rpm to keep costs down.
I've made a proposed config in the Types tab of https://docs.google.com/spreadsheets/d/1XG9VXdDxNd8ipgPlEr7Nb7Eg22twXPuzgDwsOhtdYKQ/edit#gid=1018071361
It looks like we can barely fit it into one rack if we have only 9 application servers, 1 staging server, and no 3rd database server. I propose we rent a Ceph rack and an a rack for everything else next to each other so we don't have connection problems when we grow.
Of course we should vet our configuration with the Ceph experts.
@SeanPackham consider writing a weekly blog post about 'GitLab goes metal'. I think the subject will resonate with Hacker News readers. It will also make it clear we are making GitLab.com faster. Be sure not to blame the problems on our cloud providers, we made an error to host Ceph in the cloud, our cloud provider should not be blamed.
@SeanPackham first blog post was already written in https://about.gitlab.com/2016/11/10/why-choose-bare-metal/
I don't know about this. We had disastrous experiences with Ceph and Gluster on bare metal. I think this says more about the immaturity (and difficulty) of distributed file systems than the cloud per se.
Maybe worth having a conversation with him?
Hey gitlab folks -- why are you going with 10k HDDs as opposed to running under-provisioned 2GB same price / gb consumer oriented SSDs such as the Samsung 850 EVO? (or the SM863 if you want actual SAS, though the 850 EVO drives will work just fine). I assume you will RAID these drives and have duplication of your storage nodes.
I've commented on the HackerNews thread strongly recommending New York Internet (https://www.nyi.net/) as a datacentre if you haven't chosen a place to host your hardware yet. Seriously those guys are great, you won't regret it. We (FastMail) have been there for 15 years and we've worked with a few other datacentres and always missed the level of support we get from NYI.
We're entirely remote as well.
Regarding hardware - we have a vendor we buy Supermicro kit through - we're probably similar size to you in total. Our current IMAP server config costs about $25k per machine and contains Areca RAID with 24 x 4Tb drives for 80Tb storage, plus another 6 x 4Tb drives for search and 4 x 800Gb Intel DC3700 SSDs for the hot metadata. This works really well with Cyrus IMAPd because we built Cyrus' data models in conjunction with picking our hardware. I'd definitely talk to Redhat about what they recommend in terms of disk sizes and memory sizes per machine.
Our biggest current storage boxes "only" have 192Gb RAM, because that's the point of diminishing returns for them - but they come in at about US$25k each fully loaded in a 4U case.
Feel free to email me brong at fastmail dot fm if we can help you out :) We use GitLab software for hosting our own repositories, so we like you guys.
I saw the thread on Hacker News pointing here, and someone mentioned us, so I posted two things there. Not trying to be a commercial, but do feel free to delete this if it is not appropriate. We are users of an onsite installed (small) instance of gitlab (https://gitlab.scalableinformatics.com) . Have been for a while. We host a number of things there as well as a few other places.
We build high performance storage appliances, and have a very nice Ceph system that scales out to whatever capacity/performance you need, and provides best in class performance. Pointer to that info is https://scalableinformatics.com/unison and the white paper on our Ceph benchmarks from last year https://scalableinformatics.com/assets/documents/Unison-Ceph-Performance.pdf . Compare our EC and rados numbers to what was posted above ... anywhere from 2x to 3x better on reads and writes.
Again, not trying to be a commercial or spam, just provide options ...
Someone mentioned Stemma https://medium.com/@palantir/stemma-distributed-git-server-70afbca0fc29#.o1c00tvve on first look I think it doesn't offer all the features git offers.
BTW I think we should not use raid but just let Ceph talk to JBOD.
I'd like to see the performance of both on the hardware. The cost of high performance RAID controllers is cheap enough that we can spec one with the server and then try both ways, either way I'd like the memory backed write cache of the controller sitting in-line.
@sytses Yes, RAID is a very bad idea when using something like Ceph. Historically, RAID controllers will bottleneck performance due to having to pass traffic through their embedded CPUs. These CPUs are specialized, but still far less powerful than the system CPUs in modern servers. In almost all of the testing I've done over the years, you're better off with software RAID and a bog standard JBOD chipset.
But that's for assembling large block devices. When using a distributed object system like Ceph, it provides all of the RAID functionality over the network anyway. Adding RAID makes things worse, since you want to have OSDs attached per block device.
One thing to be very aware of is ensuring that the disc controllers - whether hardware RAID or not - properly handle a failing disc being swapped out and replaced with a good disc. Hardware RAID cards tend to be better at this than onboard SATA controllers, so may be worth it for that property alone even if you use them entirely in JBOD mode.
My current favorite controller for these setups is the LSI SAS 92xx series. They support both raid and target firmware that supports hot swap. But give I have, and it seems like you have, seen failures even with RAID. I stand by my usual recommendation to fully drain storage nodes for any kind of physical maintenance. The cost of draining is worth mitigating the cost of potential outage for an unplanned failure. :-)
This is why I also recommend more balanced node size when setting up storage clusters. With things like Kubernetes, you can balance compute and storage on the same nodes, avoiding storage cluster as a bottleneck.
Is there anyone else on the Infrastructure Team who would like to volunteer to pair up with someone from the Content Team to write the next post in this series?
What to do suggestions:
- Talk to Ceph expert
- Have a look at https://scalableinformatics.com/unison
- Get an initial quotes from Dell, HP, and Supermicro
- Pick a colo in #732 (closed)
As for the timeline it would be nice if we have a configuration that is vetted by Ceph experts, with a competitive quote, and an installation timeline at the end of November.
So I've noticed nobody has brought up Packet.net yet. Full Disclosure, I'm currently a Packet employee. We'd like to at the very least help you with expertise, but we can also discuss if our platform could work for your needs.
Had a great conversation with Joe at Scalable Informatics (SI) as recommended by someone on the Hacker News thread. They make a Ceph appliance that reportedly outperforms the Supermicro setup: https://scalableinformatics.com/assets/documents/Unison-Ceph-Performance.pdf.
Joe says the main advantages of the SI solution is that it is specifically designed to maximize I/O throughput for distributed filesystems:
More specifically, he says vendors such as Dell try to minimize cost by keeping the number of backplanes to a minimum. The SI solution has 15 backplanes per HBA. As stated in this RedHat document:
Many modern systems that house more than eight drives have SAS expander chips on the drive hotswap backplane. Similar to network switches, SAS expanders, often allow connection of many SAS devices to a controller with a limited number of SAS lanes. Ceph node configurations with SAS expanders are well suited for large capacity-optimized clusters. However, when selecting hardware with SAS expanders, consider the impact of:
• Adding extra latency
• Oversubscribed SAS lanes
• STP overhead of tunneling SATA over SAS
Because of backplane oversubscription or poor design, some servers used for Ceph deployments have encountered sub-par performance in systems that use SAS expanders, although it is not universally the case. The type of controller, expander, and even brand of drive and firmware all play a part in determining performance.
He'll send me a quote of different configurations so we can get a better idea of cost/performance.
@stanhu I honestly look forward to the SI quote and I think they have a novel product however, there is more than a small part of me that does not like the 'black box' approach to solving problems. The "trust us, we've tuned it all and you don't need to see or touch what's behind the curtain" just doesn't sit well. That aside though, my real concern is found in echoing @sytses commentary of boring solutions for hardware that have ubiquitous reach. Dell and HPE both have 4 hour same day parts replacement, and in fact both have common parts depots in the colocation facilities that we're looking at bring parts delivery down to minutes. Bespoke hardware vendors increase complexity for marginal gains and often cannot provide the level of service that global service providers can.
Really appreciate the open dialogue here. I have not used CephFS yet (really interested now), but a lot of it sounds similar to getting high performance out of ZFS when it comes to hardware. We've been using ZFS at Beanstalk for many years on Supermicro and learned a lot in the process. The hardware takeaways sound pretty similar:
- Present disks directly in JBOD mode. Common controllers are 9211, 9207, or more recent 9300 (for 12g sas).
- Focus on low latency disks for writes (in ZFS these are log devices). We used to use ZeusRAM, but HGST SAS SSDs are fantastic. NVMe should be better.
- As mentioned above, be careful with backplanes and interposers. You want the disks directly connected to the controller ports.
- Probably not as important for CephFS, but SAS SSDs (multipath) are usually preferred and should have powerloss protection.
- We use 10GbE with LACP and it has worked well - the focus is on lowering network latency and not throughput since repos are mostly small files. 40GB sounds pretty awesome.
From what I can tell, some of these things might not be needed due to the way Ceph works. With ZFS the strategy is mostly on scaling up and using things like dual head units with JBODs for failover (using RSF-1), so the hardware choices need to be scrutinized a lot more. Where with Ceph you can just focus on the raw performance of each node and let the cluster deal with the resiliency.
Really curious to see how this works out and I'm eager to try it out for some projects here at Wildbit. Thanks again for sharing.
So, I've been reviewing this thread, and I have some questions.
What are the requirements for storage:
- I see "~100TB+", but is that replicated or raw?
- How much space is needed for each of the components?
- What is the total IOPs requirements for each of components?
- What is the total throughput needed for each of the components?
- What is the current latency for the components IO?
My concern stems from discussions about technical requirements like networking (10gbe vs 40gbe), spinning diks (SAS 7.2k vs 10k), # of spindles, SSDs. But there's no good answer to what is really needed in order to make sure we meet those specs.
For example, for spinning disks, there is barely a ~20% improvement in IOPs and effective latency between 7.2k RPM and 10k RPM drives. But there is a large impact on price and power. Not to mention, it may not matter as SSD use for metadata and/or caching may have a far greater bang for the buck.
In light of recent downtime in #748 (closed), we had a meeting today to discuss whether we wanted to go all-in with CephFS or consider some other approach. We have heard through the grapevine that the issue seen in #748 (closed) is rarely seen on bare metal, so right now we're working off the assumption that this is just an artifact of us running in the cloud.
One issue is that right now CephFS does have a single point of failure with the metadata server (MDS), but we can make the CephFS MDS backup the journal to mitigate the problem. Also, multiple MDSs is something CephFS is working on. BlueStore also looks promising, but it's not clear whether we'd be able to switch easily.
Alternatives approach include, but each has its own set of pitfalls:
- Multiple NFS servers: boring and may not be fast, challenging to maintain
- SAN: expensive and requires being dependent on a third-party
- Multiple CephFS clusters: reduces the blast radius, but more moving parts and more to maintain
- Hybrid approach of NFS/CephFS: lots to maintain, confusing
- Writing our own distributed Git layer: lots of work, would take a while
The bottom line is that any type of server we could buy could be used as a filesystem store, so we will continue down the purchasing path assuming we are going to use CephFS.
I also discussed with @sytses about a number of items:
- We should just use 1 main CephFS instance and make it work (as opposed to having 4 CephFS shards, even though this would reduce the blast radius)
- Backing up is hard: AWS EFS is not cost-effective (e.g. 16 TB x 40 drives).
- Restoring a backup from S3 may take too long. If we’re down for 2 weeks and people can't access their data, it's not a good solution.
- Consider using another CephFS cluster for backup, possibly in another availability zone
- Can we put the second CephFS cluster in another datacenter?
- Probably not due to filesystem latency
- But we can move CephFS cluster to another availability zone and spin up app workers on AWS if we need
- We don’t want to be beholden to a proprietary solution: we want to be using commodity hardware for a number of reasons
- Not the worst thing in the world to buy hardware that could be used to support other filesystems
- Use GitLab Geo to replicate data
- Make sure our hardware buy takes into consideration things that will optimize CephFS performance (e.g. # of backplanes etc.)
- Using a secondary cluster means we'll need 2x everything for disk storage
I'd also recommend looking into https://rook.io/, but maybe not ready for production just yet.
One thing I would like to propose is instead of separating storage from compute, move to a blended model where all nodes are able to participate a distributed storage stack.
There are a number of advantages to do this, vs creating dedicated storage cluster(s).
- You get lots of free CPU time that is usually stranded on storage clusters
- You get data locality to your applications, so some percent of IOs doesn't have to leave localhost. (improved latency/bandwidth)
- More smaller nodes means faster time to recovery
- Lower bandwidth NICs (10G vs 40G) can be used as more nodes gives you the same aggregate cross sectional bandwidth.
Remember, that time to recovery for a broken storage node is N-1 * total cluster recovery bandwidth. I did some quick estimations and a 25% recovery bandwidth allocation on a 32-node 10Gbe cluster (dual-homed, so 20Gbs, 640Gbps total bandwidth) is under 15min.
I've written up a design/cost estimation sheet here:
@superq Thanks--that's an interesting thought, although I think with CephFS putting other applications on the disk nodes may eat RAM/CPU that is needed during a recovery?
- RedHat does not consider CephFS production-ready yet for two reasons:
- They have not done extensive tests
- They want to have active-active MDS support; currently the MDS is a SPOF (as we have seen now).
- How can we mitigate MDS issues now?
- We can run active-standby (active-active not available yet)
- Test this by hammering MDS with lots of updates and create transient errors on OSDs
- Does it make sense to move frequently-accessed data to SSDs (NVMe specifically) via CRUSH maps?
- Haven't really tested this, but seems like it would work
- There may be contention at NVMe and CephFS may not be able to maximize performance on NVMe
- This won't be necessary with Bluestore: Stable/default mode targeted for Spring 2017.
- Are we were going to put the Ceph monitor and MDS on the same box?
- Recommendation: It's okay, but don't put the MDS on the lead monitor
- Lead monitor is chosen by lowest IP address
- What about HBAs/# of backplanes in commodity hardware?
- One HBA is more than enough for most people
- We could consider adding dual HBAs for 16-drive case
- We could write a script that pins the OSD processes to the right HBAs for improved performance
- Recommendation: use RAID0 with spinning disks.
- RedHat team had seen issues with Perc 730 RAID in JBOD mode.
- With high concurrency with small block sizes, performance degrades significantly.
- Performance is fine in RAID configuration
- 16x6 = 96 TB drive configuration seems to make sense. Do we need 96 GB RAM?
- More RAM can't hurt, but 64 GB RAM should be fine.
- Recommendations for HPE hardware?
- DL380 G9
- also see Supermicro reference
- Long-term plans for replication at RADOS layer?
- Haven’t started to think about DR here
- How do we handle DR backups?
- Most people don't backup this much data
- Does backing up to another CephFS cluster make sense? Yes
- Smarter rsync could be discussed on community
- CephFS clients communicate OSDs via public network.
- OSD-OSD network communication can happen on its own network
- No way to change that
- MDS can be point of contention
- May want to increase cache size there, by default 1 GB RAM & 1 million object cache, consider putting MDS Journal on NVMe
- Are 3 monitors sufficient?
- 3 is minimum number to guarantee availability. Start to recommend 5 when there are over 500 OSDs processes.
- Timeline when running multiple MDSs?
- Hope to make it available with Luminous (Spring 2017)
- Will upgrades be a problem?
- Only if you try to mix Bluestore w/ Jewel today
- Due to design of CephFS, you can have Bluestore OSDs and legacy OSDs and migrate.
- But you don't want to be in the this mixed state too long due to performance
- Recommended NVM: Intel PCI 3700 NVMes
- RedHat emphasize stress testing ourselves
- Heavily test colocated MON and MDS
- Inject snowball errors, injecting transient failures like bringing network switches up/down repeatedly
- Does it make sense to use journals in RAID1? NO
- Don’t put journals on RAID, it’s unnecessary overhead
- RedHat does not consider CephFS production-ready yet for two reasons:
I also took away from this that RedHat/Ceph has yet to saturate an NVMe device with Ceph and the recommended approach to get the most out of the NVMe device is to create four (4) OSD devices per NVMe (and even that leaves headroom). Given that, I think mixing the journal and an OSD or two per NVMe device would be an OK thing to do/try.
This discussion, just like when we had the initial disk discussion with RedHat/Ceph, served to drive home even more for me that we as engineers need to quit trying to outgame Ceph in certain areas. The architecture of Ceph is such that it takes into account for placement of data on different hosts, etc. So that if an individual OSD on a host, or a given host dies, life moves on and the Ceph infrastructure rebalances - and our attempts to add traditional engineering design (RAID1, 5, or 6) only serve to slow things down. In short, if you're going to Ceph - you need to lean in hard and do your homework to understand what's happening under the hood.
@stanhu Yes, it will use extra resources on recovery, but by having lots of small nodes, the recovery requirements are greatly reduced as a single node is a much smaller fraction of the cluster. I'm not a Ceph expert, but the design is very similar to stuff I used to work on (Google File System, Google Colossus Filesystem).
As for your list: 1) I'm inclined to agree, however active-passive is better than nothing.
2) I would recommend having an MDS per use case. Like it was pointed out before, this limits collateral damage.
3) I was going to suggest bcache to provide IO acceleration.
4) I don't know
5,6) These are all NOOP when going with a more distributed setup, as you only need SATA connections.
7) I normally think more about how much ram you need in total across the cluster for read/write buffering. With NVMe in front of disks, you should need a lot less.
8) I've not looked at HPE in a long time.
9,10) I normally think about backups in terms of datasets. For files, some kind of sync to a cloud provider would probably be a good idea.
11) One thing I was wondering about that I'm not sure is in design. I don't see much about having a DMZ setup to connect the public internet to the backends.
12) I'm thinking having more than one MDS is a good idea. The question is can you run more than one filesystem on the same OSD pool.
13) +1, I would start with more.
15) I'd stick with the most stable track for now, which seems to be CephFS (again, I'm not an expert)
16) The S3700 series might be overkill. There is no performance advantage over the S3600, the big difference is write endurance. They cost 50% more per space, and the durability of the S3600 are quite good. I did the math for some MySQL servers and we found that the ~7k write cycles (full disk write) on the S3600 was more than sufficient.
18) I don't know.
I find it fascinating, and very bold, that you intend to use a file system that the vendor itself does not consider production ready. While your money may be OnPremise, your visibility sure is in the public site, and I'm not sure what kind of downtime or even data loss you can really afford.
I understand that the multiple-nsf-server setup is considered a band-aid while waiting for the Ceph-solution, but what is the experience with it? Does it work sufficiently well? And, could it be developed further, for example by running on a filesystem with de-duplication, which might be very useful when there are lots of forks?
@elygre You raise the million dollar question, and that's why we are having a discussion here about the best path forward. There are tradeoffs with every approach. CephFS seems like where the puck is going, but we are a bit early to the game.
I understand that the multiple-nsf-server setup is considered a band-aid while waiting for the Ceph-solution, but what is the experience with it? Does it work sufficiently well? And, could it be developed further, for example by running on a filesystem with de-duplication, which might be very useful when there are lots of forks?
Right, multiple NFS servers will work in the short-term, but there are still problems with this approach:
- We'd have to find a way to rebalance (or disable a shard) if a NFS shard gets full
- We'd still have a SPOF with each NFS server
We've also considered a hybrid deployment of both CephFS and NFS. Are there other alternatives?
I think other companies use a proprietary solution (e.g. NetApp, EMC Isilon, etc.) and pay an arm and a leg for a scale-out storage solution.
Yeah, your challenges are indeed difficult :-).
I would imagine that rebalancing is perhaps hard, but not all that hard. And while each NSF-shard would indeed be a SPOF, with multiple shards, each failure will at least only take down a single shard.
And, of course, backup would be significantly easier in this scenario. rsync could run independently on independent shards, and I believe filesystems like ZFS have a mature snapshot-to-backup solution.
I have no experience with deduplication, but it seems to me that git-repositories would be fairly well suited for that. I ran a quick check on a small repository that was just used for training, and therefore has a few branches, and found that there is a lot of duplication to de-duplicate. I'd imagine that with 2000 branches of the gitlab-ce repository, that alone would be a significant saver.
find . -type f |grep kurs-infeksjonskontroll |grep pack$ |xargs sum |sort -n -k2 01683 7142 ./iknowbase-user/kurs-infeksjonskontroll.git/objects/pack/pack-607a8bf03bf9dc217e170f1bdcc330dad0e1558f.pack 01683 7142 ./xxxx/kurs-infeksjonskontroll.git/objects/pack/pack-607a8bf03bf9dc217e170f1bdcc330dad0e1558f.pack 01683 7142 ./xxxx/kurs-infeksjonskontroll.git/objects/pack/pack-607a8bf03bf9dc217e170f1bdcc330dad0e1558f.pack 01683 7142 ./xxxx/kurs-infeksjonskontroll.git/objects/pack/pack-607a8bf03bf9dc217e170f1bdcc330dad0e1558f.pack 01683 7142 ./xxxx/kurs-infeksjonskontroll.git/objects/pack/pack-607a8bf03bf9dc217e170f1bdcc330dad0e1558f.pack 01683 7142 ./xxxx/kurs-infeksjonskontroll.git/objects/pack/pack-607a8bf03bf9dc217e170f1bdcc330dad0e1558f.pack 01683 7142 ./xxxx/kurs-infeksjonskontroll.git/objects/pack/pack-607a8bf03bf9dc217e170f1bdcc330dad0e1558f.pack 01683 7142 ./xxxx/kurs-infeksjonskontroll.git/objects/pack/pack-607a8bf03bf9dc217e170f1bdcc330dad0e1558f.pack 01683 7142 ./xxxx/kurs-infeksjonskontroll.git/objects/pack/pack-607a8bf03bf9dc217e170f1bdcc330dad0e1558f.pack 01683 7142 ./xxxx/kurs-infeksjonskontroll.git/objects/pack/pack-607a8bf03bf9dc217e170f1bdcc330dad0e1558f.pack 01683 7142 ./xxxx/kurs-infeksjonskontroll.git/objects/pack/pack-607a8bf03bf9dc217e170f1bdcc330dad0e1558f.pack 01683 7142 ./xxxx/kurs-infeksjonskontroll.git/objects/pack/pack-607a8bf03bf9dc217e170f1bdcc330dad0e1558f.pack 14400 7157 ./xxxx/kurs-infeksjonskontroll.git/objects/pack/pack-b41e1c8f197c9c787a5780170fbae428873008c5.pack
I've seen NFS-HA with DRBD. The setup is fairly reliable most of the time, but when it breaks, it's difficult to deal with in a nice automated way.
The real down-side to DRBD is that you're basically limited to pairs. This means that you're limited to the maximum performance of a single server per filesystem space. The other server is just a cold spare and doesn't contribute to the performance of the system. It also means you're still in the same situation as CephFS when it comes to only having a SPoF master with a cold backup.
With CephFS you get the aggregate capacity of all nodes participating in the cluster.
To me the big question is not the SPoF issue with the master, but how fast you can recover from failure. With the way NFS+DRBD works, it takes some time to stop, unmount, failover DRBD, mount, maybe fsck, and restart NFSd. Plus the time and fragility of NFSv4 clients. It's not a pretty setup that I would recommend to many people.
Since I don't have a lot of experience with CephFS, I can't say what failovers are like, but I hope it's better than the NFS-HA dance. :-)
We are pushing our design decisions and assumptions explained into this very repo: https://gitlab.com/gitlab-com/infrastructure/tree/master/design
Please open issues if you have any suggestion.
@superq we have seen failovers of the MDS server and they work pretty well.
We have also seen up to 4 OSDs going down out of 10 and CephFS locking while waiting for those nodes to come back up.
Given our experience with Ceph, it is quite reliable and it recovers gracefully and automatically. Which is awesome.
The only issue we had that we can't still explain and don't know where it came from is when we had the journal locking under no particular stress. That was a 1h outage while we executed the journal recovery tool.
I would say that this is the only scary point of CephFS
Here we went on a pure Open Compute platform for a rack with 17 nodes - 30 CPU (424 Cores) - 1.8TB total RAM - 339 TB Storage (29 TB of SSD's and 120 x 4 TB HDD's and each node having a 2x10G uplink to 2 TOR. Switches are ONIE running Cumulus (having 6 switches - to simulate 2 racks witch 2 TOR linked to 2 40GB Spines running OSPF for now). Without much "negotiation" as we only ordered 1 rack the current solution cost < $100K) - for logistic convenience we went on HP OCP products - Altoline switches and HPE Cloudline servers (https://www.hpe.com/us/en/product-catalog/detail/pip.1008862650.html#) but many vendors around too.
We're running Openstack (MOS9.0) on it with Ceph Storage, we had a lot of issue because of bad storage nodes/journal setup - as we didn't allocated enough memory per nodes. Ventilating the RAM from Compute to storage and some SSD fixed the issue. Note that this is our lab setup and we don't have huge traffic yet.
We're also having a look to http://openio.io/ project for storage and 6winds Openstack plugin for DPDK network optimizations
For OCP hardware to get an idea of the costs you have a simple quote tool provided by Horizon computing to help you https://sales.horizon-computing.com/dashboard (we made several simulation on their tool).
PS: have you considered the containerized approach for the CEPH setup ? would be interested in this subject :D https://www.sebastien-han.fr/blog/2016/06/06/Busy-working-on-ceph-docker/
Thanks for the comments. I agree the NFS-DRBD approach doesn't scale. At the moment, we're mostly considering a RedHat-supported configuration, and a containerized approach isn't even on a radar at the moment.
An update where we are today: We've asked Dell, HP, and Supermicro to give us quotes. We've conveyed our urgency to get this done soon. Now we are just waiting to get quotes, but things may be slow with the US holidays this week.
Quotes (GitLab team only) are in https://drive.google.com/drive/u/0/folders/0B7qPxsTi6lMOS2tfaUwwU3paRFU
@sytses Hello, Well power mgt is on the Rack itself so redundancy for power is made by the PDU from the middle of the rack (which is of course dual power sourced). Idea though is that a compute node has to be considered as commodity and thus be replaced easily without impacting the production.Regarding HW issues, we had none since setup (a year ago). As for operation I was able to add Ram in a server in about less that 10 minutes alone without tooling: unplug the network/SAS cables remove the server, open it, put the memory, put it back in the rack/plug back. Rack is a 250km from the offices and we manage the rest remotely (got more issue on our NFV router)
@davidrama Thanks. From HP:
Cloudline lacks about 15 different software/mgmt. features from our DL series. You will be responsible for updating all the firmware across all components including drives. For a lot of the customers this may be quite cumbersome and they stick with the DL line for that reason. They also have a longer lead time, about 16 weeks. Based on your timeline I didn’t think this would be a good fit.
@stanhu Hi, well reading their answer I assume they're reluctant to sell you OCP hardware ... and prefer to stick to their best earning hardware. What are the 15 missing features ? Except for "bugs" I never had to upgrade the hardware firmware (but this might be important for you). What can't be done using standard provisioning/mgt tooling ? (IPMI / PXE for the basics)
Yes I understand this point ... but ..
Did you check these US guys (I know time is an issue), this might make HP to be faster to deliver ?:
Might be more but looked at these for my RFP also.
Here's a quick update on what we've discussed today:
We've revised our hardware configuration to use one chassis with the general configuration:
- Dual-socket server with 3 to 4 3.5" HDDs, E5-2630v4, 128GB RAM, 800G PCIe SSDs
We'll add more memory, disk, or SSDs based on three different configurations:
- Storage (e.g 4x8 TB disks)
- Database (fast, large SSDs)
- Hardware vendors: I've updated Supermicro, HPE, and Dell with our updated configurations.
- CephFS: I've asked the CephFS experts if there any concerns with using more, slimmer boxes as opposed to fewer, fatter boxes.
- Datacenters: This also affects our power requirements, so I've been reaching out to datacenters to provide more information. I've been in touch with QTS, NYI, and Equinix.
IPv4 space has been exhausted for non-ISP entities, we will lease a chunk of address space from the collocation provider.
IPv4 space has been exhausted for non-ISP entities
Does ARIN allow private sales to non-ISPs? Having your own IPv4 block would give you a stable IP address for GitLab pages, in case you ever decide to move again.
They do not "officially" support such a thing. They allow third parties to "reach an agreement", the sale of IPv4 blocks outside of an acquisition is pretty shady and not guaranteed to be honored. Believe me I WANT us to own our own IPv4 block. We have our own IPv6, but that does little good in a world still ruled by v4.
Response from CephFS team regarding using smaller boxes and dual-bonded 10Gbps links:
No downside really. The smaller the boxes the more performance per OSD you get since you lower the number of disks per controller. The ref architecture mentions the 16 disks servers because they're a good compromise between cost and performance for most customers.
No issues with using dual-bonded configuration either. That's something we see very often with either LACP or Active/Backup configurations.
As @sytses is already pointing out with:
I don't like blade servers, they seem to be on their way out because a failure impact multiple instances.
Are we taking this in account? I know we will go for 4 nodes per chassis but still, this means i.e. with out PostgreSQL cluster that we need a db server per chassis and preferable different rack.
Hi, I see there is already some mention about OCP. We work with the manufacturers supplying to Facebook for example, not with HP (since already noted, their overhead is not required). Our location is Amsterdam, we focus on Europe. Would you consider having an OCP based solution in Amsterdam?
I wanted to take a step back and summarize where we are today. We've talked to a lot of people and received a lot of feedback. The current options are on the table:
- Purchase hardware and do everything with some remote hands (e.g. order hardware, find people to rack/configure/etc., and maintain)
- Purchase hardware and go with a managed service model (e.g. ServerCentral, etc.)
- Lease hardware with a complete managed service model (e.g. SoftLayer, Packet.net, etc.)
The first option is the most complex because it means owning all the details of a deployment. We also have to get involved with network/power/etc. design and hire more people maintain this. It looks attractive from an OpEx perspective, but many costs and headaches will add up with running this. Big upfront CapEx, unclear OpEx. At least network bandwidth is a fixed cost.
The second takes some of the burden off our hands. We can choose exactly the hardware we want and allow others to run it for us. Still big upfront CapEx, but OpEx is a bit more predictable. Network bandwidth also fixed cost?
The last option has the least CapEx spend, but high in OpEx (similar to the cloud model). Depending on the service, we may have less flexibility in what hardware we can choose. Some vendors are willing to customize the servers for us. Little CapEx, but OpEx may be large. Network bandwidth also will be charged by GB.
@stanhu I'm leaning towards the last option. As GitLab.com grows the CapEx will get heavier. We calculated that all storage (5x94TB raw) was about $5k per month, costs like that are much better than in the cloud. The performance will be bare metal although we have less control of the network. But not having to get really good at hardware avoids a lot of distraction. My only concern is what our database will do since Packet.net can't go above 512GB. But we can test that.
You overestimate the amount required to make Option 1 happen - Yes there is an initial expense to racking and stacking gear, however you do it once and it generally isn't something you go back and touch on a regular basis. Hardware failure of a systemic nature requiring whole server replacement happens on the order of maybe once every three years and component failure happens at the level of maybe three drive fails ever year. We purchase the hardware, @ahanselka and I fly out to rack, cable, label, and document it all and then you have a remote hands contract for when something breaks or needs replaced. I spent the last 11 years designing and implementing data centers and colocation cages around the globe.
We want fine level network control under Option 1 because that gives us the most direct ability to control how we handle packets and traffic to ensure uptime and smooth transitions of resources. There are things like BGP routing and anycast load balancing that we would like to do today in order to make the front-end dynamic and roll in and out servers/services without being noticed, these are things that require more control over the network than we currently have / are afforded under services.
With Option 1 the CapEx is up front, but not really. Depreciation and financing allow that to be spread over a period that makes it easier on the books, OpEx is greatly reduced at a cost that is clear as you've seen proposed pricing on it and I'm willing to bed real money that the overhead on Infrastructure Team won't be any more than dealing with current cloud hosting provider issues/limitations/quirks.
@jnijhof I cannot speak for @sytses but I think his concern with blade servers was one of where the blade chassis contains a network switch, backplane aggregation, etc for things like network, host bus, and power. The SuperMicro "blade" format that we've been looking at each host still has their own connections for network, host bus, and they share a common redundant power bus bar. This in my mind makes them different than say an HP C series blade chassis or a Dell M1000e type blade design.
@stanhu another option (which we did) is to order the racks fully loaded and if possible pre-cabled as a standard with a "pre-testing/validation" document. On our case because of DAC not available at delivery, we just had to do the NIC patching by ourselves. Considering a rack as the "standard" DC unit in your deployments ease to reduce the OpeX without too much impacting the CapEx (if the sellers want to sell you his stuff he'll surely make the effort). This is an in factory pic of the Rack in FoxConn/HP Factory: https://postimg.org/image/fasqyf3lt/ This is the delivery pic of the Rack: https://postimg.org/image/vpsaeir4j/
I had a long talk today with someone who brought multiple companies from the cloud to metal. His advice was: don't do it unless it is absolutely needed. Even companies that define themselves as providing a hosted service should not do it. The expertise you need to do hardware right is large, expensive, and hard to get. It means employing experts in servers, networking, backup, security, power, etc. Our board members are seeing exactly the same based on companies they saw going down this part, it takes about 70% of their engineering effort. For us the priority is making a great tool that most people host themselves. We can't let hosting dominate our organization.
The move to metal was based on us running Ceph(FS). Based on the feedback so far running Ceph is very hard and something you should only do if you can't solve things on the application layer. I think we can solve things on the application layer.
In 2017 GitLab repositories need to be made active active to serve our customers. Our current best plan is gitlab-org/gitlab-ee#1381 (closed) This involves sharding between different servers/clusters like we currently already do.
@pcarranza made the following comment about our IO problems that indicate the cloud is not our biggest problem: "Just accessing a git repo doing a pull is quite expensive in the filesystem as a whole, in the case of gitlab-ce git will perform ~31k open operations to probe that all the refs actually exist. Since we are adding our own refs (keep-around, merge-requests and tmp) this will only grow making these operations more expensive over time. gitlab-org/gitlab-ce#24329 (comment 19335246) On the other hand, if you check IOPS in the NFS servers you will see that we are dominated by write IOPS. But checking git processes on the workers it looks like we have more read commands (also consider that a git write command will perform reads to agree on what to write). So something is missing between what we see in the front and the back end. I think that our read load is currently being masked by filesystem buffers in all the workers (42G cached memory usage in worker 1 for ex). So it's not only that NFS is slow, the filesystem is slow on writes because we are hitting the Azure 20K IOPS limit all the time only on writes, but since we have all that free memory in the workers we are taming the reads. I don't think that there is a huge network latency here, I think that git is a system that is designed to be a database, but in the world of databases it is sqlite, a single file that works awesome when you have 1 client, the moment you add concurrency and heavy load you need a specific system that will keep all the metadata in memory (database indexes). Every time we invoke a git process we are loading all the indexes and probing them (the open syscalls), do this often enough and you will end up hammering the hosts and taking gitlab.com down (https://gitlab.com/gitlab-com/infrastructure/issues/506), and this is amplified by the fact that we have all the git access in the same place where we have the web application and doing an extremely free git access by just creating Rugged::Repository objects whenever we want to check that branch exists. So I don't think we can particularly blame NFS or CephFS here, the way git behaves collides with the way a filesystem works, true that by sharding a lot we could isolate the issue and take it under control, and having local drives would remove the penalty of reaching the network for every IOP we execute (remember the 31k IOPS for reading the refs) but on the long run we would be hitting another IOPS limitation. So I think we should keep moving to bare metal, while we work on the next step to remove load, and over time we start simplifying our git access by making it a specialized system. And after this we introduce distribution probably by following what Google is doing with Ketch. I would also like to mention that right now we are just sampling what is happening in the git world, and that we are missing a lot of information, moving the git access to a specific system will allow us also to monitor closely what is happening at that layer, and probably downsize the worker fleet."
We're currently struggling with IO file access. I propose following steps to solve this:
- If we need a quick fix right now add more NFS servers.
- Get better monitoring in place (right now we are just sampling what is happening in the git world, and that we are missing a lot of information)
- Find out if there are quick fixes to reduce the read and write load (for example don't create Rugged::Repository objects whenever we want to check that branch exists)
- Move all git access to one interface (either gitlab_git so or the git-access-daemon) for inclusive monitoring and moving to Remote Procedure Call (Git RPC) later.
- Move to Git RPC where the file servers become git servers, halving the amount of IOPS, reducing latency, and allowing workers to keep serving requests.
- Move to active-active, with this we can read from any node, greatly reducing the number of IOPS on the leader.
- Move to the most performant and cost effective cloud (easier because of active-active replication)
- Consider local storage, for example AWS has 24 x 2,000 GB (48 TB), this move is possible because with active active we can replicate accross availability zones.
There is work being done on https://gitlab.com/gitlab-org/git-access-daemon/blob/master/design/README.md which shares some of the above goals. But the indirectness in it is something that needs to be discussed.
Three more thoughts:
- Our users are using GitLab in the cloud, so we should ensure it works there.
- Any performance optimizations will benefit everyone.
- We should be having shorter feedback cycles (measure, improve, repeat)
If we do the above I should probably write a followup blog post 'What we learned from 300 people vetting our proposal':
- There is a lot to learn about hosting on metal (examples)
- This stuff sometimes is more art than science (examples of opposite opinions)
- Ceph(FS) is hard to run, even if we hire the expertise to run it, it would be a burden on our largest customers (examples)
- There are options between the cloud and DIY (Server Central, Packet.net)
- Metal is harder than it looks. Even with managed hosting hardware should become a core competence (networking, provisioning, security, capacity planning, disaster recovery). If you host in the cloud you can get their experts to help, outside you need to hire people that are increasingly hard to find. You need to hire for it and it changes your company. It is hard keep making software at the same rate when you also need to think about hosting. We want to keep improving GitLab at a fast rate.
- But since we won't do Ceph we don't have to move. Can do the three points above.
- See above comment #727 (comment 20044060) for our next steps.
@sytses Cloud vs hosted/bare metal solution depends on the "critical" size of your service. At one point it could become largely more expensive to be in the cloud than on a bare metal/hosted based solution (what would then be the cost for turning back ?) Going to the cloud doesn't free you from "Network/system/Software/architecture* expertise. You just shift them to an on cloud platform. You gain a certain agility when all is running ok, but might encounter issues when facing problems as you'll have to deal with your CSP SLA's and goodwill.
Great you are reconsidering the move to bare-metal. I see the value in CapEx, but it comes with a lot of strings attached. No need to rehash all the great points made.
Due to the latency and bandwidth reliance cloud might be getting expensive in the long run. Bandwidth is usually their high margin product. As Gitlab Geo is something you want to push forward anyway, I suggest starting with staying with Azure and using NFS shards for now and add Gitlab Geo read-only instances for runner traffic and even localized traffic. All fetch/pull requests would go through a Gitlab Geo instance hosted on leased servers with a reduced bandwidth price. Pushing this idea further having local instances of Gitlab Geo available, which get used for pull/fetch traffic might further reduce bandwidth and load on the main Gitlab instances.
The suggestions would be:
- scale with NFS sharding on Azure
- evaluate other cloud options with local SSD backed filesystems (either using CephFS or other replication)
- reduce traffic to primary Gitlab instance via transparent read-only usage of secondary Gitlab Geo instances (runner traffic, localized installations, customer locations)
- built on a standardized framework such as Mesos, docker swarm or Kubernetes so a move between clouds and bare-metal is always an easy option
- push Gitlab Geo to active-active capabilities
- remove the need to go bare-metal and keep agility to improve your core competency instead of doing hosting
One of the interesting/concerning FUD points above is the "70% of engineering effort". I'm not sure exactly what this means, but it seems way far off base.
Implementing bare-metal does cost extra engineering effort, but based on my experience a modern setup should cost no more than 5% of "engineering effort".
In this setup, GitLab is only looking at 1-2 racks worth of equipment. Even managed completely manually by "normal" sysadmin techniques, it should only take one person to keep this kind of setup running. Say that sysadmin costs $100-200k/year, we're still talking about a cloud payback on the order of less than 6 months based on the other data posted.
Sure, there are a lot of things to consider, but Cloud vs Bare-Metal seems to be getting a lot of FUD these days.
Sid's comment above struck me:
For us the priority is making a great tool that most people host themselves.
This is a bit left-field, but expands on @stp-ip's Gitlab Geo-related suggestions above: perhaps the repo storage of gitlab.com (and maybe background tasks like runner management, etc.) might be entirely distributed and replicated across constituent users. I get that performance and trust issues are big unknowns in terms of both technical and people engineering. For the mothership, the obvious big advantage is no more repo storage -- gitlab.com becomes just a coordinator.
Gitlab could push adoption by working with VPS providers to e.g., subsidize block storage costs in return for required availability to the mesh. Willingness of users to participate might be tested a bit by allowing people to share their CI runners now (based on the assumption that other folks out there do as we do and host repos on gitlab.com but spin up our own runners).
Philosophically, this transfers your shared & open development ideas to the actual operation of the hosting stack, which AFAIK would be a new thing. But yeah, I also get that it's little crazy to entrust your uptime to strangers.
closedToggle commit list