The GitLab Kubernetes Agent is available on gitlab.com. Using the Agent, you can benefit from fast, pull-based deployments to your cluster, while gitlab.com manages for you the necessary GitLab-side components of the Agent. Read more about the GitLab Kubernetes Agent in our documentation.
This page may contain information related to upcoming products, features and functionality.
It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes.
Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.
Hi @AnthonySandoval ! KAS is a new service being rolled out as part of the Gitlab product, and it is currently deployed in gstg, with this issue tracking bits and pieces that need to be done in order for it to be ready for a readiness review to go into production.
The next big part of what is needed is around observability (as noted in the description), I just wanted to make you aware that the KAS team might be reaching out to yourself and your team for assistance as needed.
@nicholasklick you can find the SRE Observability team in the slack channel #sre_observability , most of their members are in the EMEA and US timezones.
As per the Kubernetes Agent meeting last night, there was a discussion between @ash2k@tkuah and @hfyngvason about the idea of creating a DRAFT MR to the helm chart which would effectively promote kas into production.
Just so I fully understand, are the items listed in the description ordered steps, meaning we need to complete the following before considering the deployment?
Just so I fully understand, are the items listed in the description ordered steps, meaning we need to complete the following before considering the deployment?
Yes, that is correct
Infra: Rate limiting is tested and
Unsure, presume this is ~"group::configure" to write a stress test for staging. /cc @nicholasklick
Update: after discussion with @nicholasklick ~"group::configure" will own the testing of rate limiting. I have updated the issue description accordingly.
@nagyv-gitlab The Observability tasks are in progress and have a due date of Jan 20th, once they're complete and Rate limiting has been tested we'll be able to begin the deployment tasks which are expected to take a couple of days.
We'll be managing this deployment around the 13.8 release preparation so at this stage I'd say the week of the 25th looks more likely.
Then after the above two are complete and any new issues that come out are resolved, we can then move to introduce this service into production: gitlab-com/gl-infra/delivery#1459 (closed)
The latter will take roughly a day assuming nothing new came out of the readiness review.
@ggillies can you be the DRI for the readiness review? This can certainly be done in parallel with the Observability work.
I do not see an associated issue for such. I would add this as a blocker to pushing this into production. Shall either one of you be the DRI for such? The results of testing would be a great addition to the readiness review as well.
In particular, I'm unclear whether we're planning to run the rate limit testing and whether we need full dashboards ahead of the rollout, or whether they're items we'll cover in the next iteration.
Yes, I would vote for shipping the entire kas feature and make it globally available. If issues arise we can rollback and limit traffic. - #276099 (comment 498863396)
We should make sure we have a way to know if there are issues but after that do you feel happy rolling this out?
@amyphillips Yes I think as long as we have some basic monitoring place, I think my strategy will be
Enable KAS on gprd as quickly as possible, but this is currently blocked on the fact that the gitlab-com repo is currently not tracking master branch of the Gitlab chart due to @jarv work on removing ingress-nginx. I believe this is slowly making it's way to prod, and I have an MR awaiting to move back to master when we can gitlab-com/gl-infra/k8s-workloads/gitlab-com!666 (closed)
While KAS will be in gprd, initially noone will be able to authenticate to it and do anything useful, as we will not allow people inside the Gitlab.com application to create new agent credentials. We can do this via a feature flag to enable it for us first, then maybe some select clients, then keep rolling it out, same as any other Gitlab feature.
In parallel we will get alerting and dashboards in place, as well as wrap up the readiness review for the current state and get it to the infrastructure team
With the dropping of redis, we don't have any in built rate limiting, but I don't think it will be a big issue to start with.
After that MR, I have a branch with WIP for enabling kas in production, as well as cleaning up the configuration so it's consistent with the new chart changes and also a bit more readable. I'll use this to bring all environments up to the latest version of KAS and get KAS into gprd. The terraform work and everything else has already been done
@cindy has an MR to enable sentry, but the method of doing this will change, so we will hold onto this work until my previous step is done. After which we can do a final review of all settings across all environments and confirm all pieces are in place.
Following on @tkuah mention above, I do have an idea on how to use Google Cloud Armor which GKE ingress supports, to allow us to specify an IP blocklist should we discover any bad actors. I will do up the terraform prep work for this, and then update the runbook as necessary.
I believe currently sentry configuration will be blocked until gitlab-org/charts/gitlab!1807 (merged) is merged, allowing us to do specific overrides in customConfig without completely wiping out the default settings
After discussion with @jarv I've got a very clear picture now of what we have left to do to make sure this is ready for a final readiness review and then feature flag gated rollout to specific users outside Gitlab
After weighing up the security that Google cloud armor provides, vs the current state of cloudflare and what it provides, I did a small spike of work today to see how hard it would be to put a basic cloudflare configuration in front of kas.staging.gitlab.com. Turns out it wasn't that difficult, so I have opened MR https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2315 to get this in place for production as well.
In order to make sure that KAS is not able to do network communication to anything it's not supposed to, we decided to get a pod networkpolicy in place now. I have opened MR gitlab-org/charts/gitlab!1837 (merged) with the work and will deploy this once merged into master on the chart
Currently the kas pods have no nodeselector, which is fine for right now, but is something we should try and solve as quickly as possible, so I have opened gitlab-org/charts/gitlab!1836 (merged) and will deploy this once merged into master on the chart
@cindy has the sentry configuration ready to be added, but unfortunately it seems now the master version of the Gitlab chart has made the redis configuration for kas non-optional. I would like to avoid pulling in redis at this stage, so once we have the upstream chart tweaked to make that optional or not have it automatically set, we can pull in the upstream chart changes needed to make the sentry configuration. @tkuah has opened gitlab-org/charts/gitlab!1835 (closed) to at least stop the redis configuration being added automatically for the moment.
All the metrics and dashboards look correct to me, we just have such a small amount of traffic for the service they aren't exactly showing anything useful yet.
Based off further questions and feedback from the readiness review, I have spent a bit of time today updating the readiness review and the runbook to reflect the new architecture (with cloudflare) and to go into some parts in more detail. Was really good to double check a lot of things based off our experiences with deploying websockets into Kubernetes.
My target this stage is to present the final state and readiness review to the larger infrastructure team at the DNA meetings scheduled for mid next week. I'm hoping all MRs can get moved through soon to meet that.
@cindy has finished provisioning sentry and we can see it working due to errors being populated in sentry for the load testing done in staging https://sentry.gitlab.net/gitlab/kas/
I still have MR gitlab-org/charts/gitlab!1837 (merged) opened in the upstream chart to get networkPolicy for KAS implemented. Once merged I will apply to our environments. If we are able to get this looked at soon that would be much appreciated.
I have added an announcement about the KAS rediness review to the agenda for this weeks DNA meeting