This would remove the need for polling, but exchange it with long-lived connections. It's unclear if this will be any better. Websockets also requires us to replace Unicorn with e.g. Puma as Unicorn is not suitable for this. We don't want to run an extra process just for websockets as this complicates deployments, managing infrastructure, etc. Puma is something we have looked into the past, but we're not sure yet: !1899 (closed) and #3592 (closed)
Maybe we could add a pub/sub system with the help of Redis based on Websockets that would be terminated by Workhorse and updated (notified) by Rails?
We currently think about something similar for GitLab Runner, we don't yet use Websockets (only long-polling connections), but this will use pub implemented in Rails, with sub implemented in Workhorse.
Maybe it is out of discussion and I mention about it on another issue, but i use for distribute "realtime" message websocket, but with Nchan. It is pub/sub server, which can handle all suggested types of serving technologies. Configurable by channel can be defined set of method which can publish and another subset of techs which can subscribe. It is small program developed as nginx module, but can run standalone. cen use redis to sync with other instances to create cluster. Support dynamic channels, scaling, authetication .
Nchan itself dont asnwer evaluation, but enable in one interface use method preferable for each situation. For example, logged users will be served by websocket and others by pulling, users with developer (and higher) rigths can use on pipeline page websocket, others only long pulling, etc...
Contributions wanted
Please suggest possible solutions and describe them in the most detailed way that you can.
Designs
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related or that one is blocking others.
Learn more.
Some ideas that have been discussed before, in no particular order:
Websockets
This would remove the need for polling, but exchange it with long-lived connections. It's unclear if this will be any better. Websockets also requires us to replace Unicorn with e.g. Puma as Unicorn is not suitable for this. We don't want to run an extra process just for websockets as this complicates deployments, managing infrastructure, etc. Puma is something we have looked into the past, but we're not sure yet: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/1899 and https://gitlab.com/gitlab-org/gitlab-ce/issues/3592
Polling
This is what we currently use, and it doesn't scale. We could merge different polling endpoints/calls into a single one, but this will only work if:
This new endpoint is faster than the sum of the current ones
We can guarantee this endpoint stays fast, even when adding more data
Since item 2 somewhat violates the laws of physics (you can't add something new without it taking more time) I don't see this working out very well.
Workhorse
I don't remember the exact details, but I think somebody suggested polling workhorse and hooking up some kind of pub/sub to workhorse. We'd still have to poll, but at least we won't be hitting the Rails application. Whether workhorse can handle such load remains to be seen.
Currently for !8357 (closed) we are limiting scope to logged in users only.
In theory this can reduce the crazy amount of requests being made by visitors by a great amount when we hit #1 (closed) on Hackernews or get a Reddit hug of death.
Should we do this for all current realtime polling to reduce load as well?
Since first and foremost we ship a product that is installed locally, I am assuming the vast majority of companies are all logged-in users (unless the company is open source and has public urls/projects)
@yorickpeterse Talking about workhorse, since golang has great websocket/concurrency support, could we just use Go for the websockets and avoid polling to pub/sub?
I am sure there is a great postgres ORM in Go that could be used. The hard part will be using existing rails logic unless we can pass the methods via ffi
For issue titles, it's very easy to know when the thing we're polling has been updated, and this happens infrequently. We could cache the title on an issue page with a relatively short TTL (as people don't spend that long on these pages, and we shouldn't poll for background tabs anyway), and invalidate it when the title is updated. That way we only need to hit the DB when the title changes, or when an issue's title is not in the cache.
I don't have a good suggestion to scale this to system notes, though.
Maybe we could add a pub/sub system with the help of Redis based on Websockets that would be terminated by Workhorse and updated (notified) by Rails?
We currently think about something similar for GitLab Runner, we don't yet use Websockets (only long-polling connections), but this will use pub implemented in Rails, with sub implemented in Workhorse.
@yorickpeterse Ok that makes sense. And yea, I was just trying to shrink the scope for the current solution. What I meant to say i guess, is maybe we should only pub/sub even for logged in users anyways if we decide to go that route.
My vote for pub/sub too, long polling is fine as long as we have something that is actually handling this long polling properly.
What we need to run away from is sending a new request every X seconds that hits deeply into the database, we need to isolate this kind of resources from direct client access.
@adamniedzielski From a frontend perspective MessageBus seems really nice:
MessageBus.diagnostics() : Returns a log that may be used for diagnostics on the status of message busMessageBus.pause() : Pause all MessageBus activityMessageBus.resume() : Resume MessageBus activityMessageBus.stop() : Stop all MessageBus activityMessageBus.start() : Must be called to startup the MessageBus pollerMessageBus.status() : Return status (started, paused, stopped)
From a backend perspective it sounds promising:
Messaging reliability is far more important than WebSocketsMessageBus is backed by a reliable pub/sub channel. Messages are globally sequenced. Messages are locally sequenced to a channel. This means that at any point you can "catch up" with old messages (capped). API wise it means that when a client subscribes it has the option to tell the server what position the channel is:***// subscribe to the chat channel at position 7MessageBus.subscribe('/chat', function(msg){ alert(msg); }, 7);***Due to the reliable underpinnings of MessageBus it is immune to a class of issues that affect pure WebSocket implementations.This underpinning makes it trivial to write very efficient cross process caches amongst many other uses.Reliable messaging is a well understood concept. You can use Erlang, RabbitMQ, ZeroMQ, Redis, PostgreSQL or even MySQL to implement reliable messaging.With reliable messaging implemented, multiple transport mechanisms can be implemented with ease. This "unlocks" the ability to do long-polling, long-polling with chunked encoding, EventSource, polling, forever iframes etc in your framework.
I want to stress the concept of detaching a client request from a database query. Whatever we do, we need to make sure that as we get more clients, we don't get more queries being executed.
@adamniedzielski :I wanted to verify that according to your plan in the description, you want to deploy some code outside of any regular GitLab release right? So we'll probably deploy a few feature branches as PoCs to production of GitLab.com, and when we're comfortable, then we will start rolling the chosen technical design into specific features as part of regular GitLab releases right? Is this the plan?
@adamniedzielski :Do we have any specific metrics / plan of evaluating if a test passes? Will we just look at some graphs and make a judgement call?
@adamniedzielski :Do we have a proposed feature yet for as the test PoC? I would vote for the system note or issue title, or I guess whatever is easiest to implement. No strong opinion here. Just wanted to see what we were testing against as an example feature.
@adamniedzielski: Could we timebox at least the portion of this plan to decide on the first PoC solution? Or at least have an estimate of when this would be solved? Real-time features of GitLab are a strong focus and priority for GitLab right now and a major push for UX. At the same time, I understand that getting real-time correct technically is super important so that we don't incur any more tech debt and set ourselves up for future success. So if we need more time, we should use it, but we just need to set expectations for everyone accordingly, especially as we are also pushing for 9.0 in the coming months and I anticipate that's also another area of big engineering focus.
@yorickpeterse: You are strongly against polling right? (https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/8357#note_20943988). I assume you dislike the current solution in GitLab. Is it fair to say that you don't want any development of any new features that involve getting updates in real-time until we have established a GitLab standard for this? I want to have some consensus here and set some expectations accordingly so that all of Product and UX are aware so that we don't repeat this discussion in multiple issues until we have a final solution. Are we putting a hard stop to all real-time features (such as !8357 (closed)), or are we evaluating on an issue by issue basis?
I wanted to verify that according to your plan in the description, you want to deploy some code outside of any regular GitLab release right? So we'll probably deploy a few feature branches as PoCs to production of GitLab.com, and when we're comfortable, then we will start rolling the chosen technical design into specific features as part of regular GitLab releases right? Is this the plan?
We can not deploy to GitLab.com without making it a proper release. It's also really unwise to deploy any change that introduces more polling, even if it's just for testing purposes. Dev and staging are playgrounds, GitLab.com is where serious internet business takes place and should be treated carefully.
In the future we want to support canary deployments to allow faster deploy/feedback cycles, but it will be at least several months before we are able to do so.
You are strongly against polling right?
I assume you dislike the current solution in GitLab.
Yes.
Is it fair to say that you don't want any development of any new features that involve getting updates in real-time until we have established a GitLab standard for this?
Yes. We can't implement anything that's real-time until we have a solution for sending updates in a way that scales.
Are we putting a hard stop to all real-time features (such as !8357 (closed)), or are we evaluating on an issue by issue basis?
Until we have a solution this should be a hard stop. Not doing so will lead to development/reviewing time being wasted.
Thanks @yorickpeterse for the clarifications and helping me understanding the constraint space to solve this problem. That's extremely helpful.
So it sounds like we have to create these PoCs in regular releases, and use feature flags / patch releases to make any adjustments on the fly. So in the most aggressive case we could deploy multiple different solutions in a single release to test them out. But we can't go much faster than a release. So if we don't have something ready for 8.16, we'll be blocked for all our real-time features for another release.
I'll be sure to mention this product impact in the meeting. Thanks!
So it sounds like we have to create these PoCs in regular releases, and use feature flags / patch releases to make any adjustments on the fly
Perhaps I was not entirely clear. Changes like this (the ones where we're fairly certain they can have a huge infrastructure impact) should not be merged in/released at all until a proper solution has been found, not even when using feature flags or something similar. Doing so falls in the category of "works fine in dev, ops problem now".
The only way we can stop ourselves from shooting ourselves in the foot is by not shipping loaded guns.
Some data on what people are using (based on what I could quickly find using Google):
Facebook: seems to mostly use long polling, probably use dedicated application servers for this of some kind (instead of cramming everything into something like Unicorn)
Gmail: seems to use polling, not sure if it's regular polling or long polling
Twitter: seems to be using polling, unsure if it's regular or long polling
GitHub: not entirely sure. Editing an issue in one tab triggers XHR requests in another tab (viewing the same issue), but this seems to happen instantaneously. Looking at the console for a while I don't see any periodic XHR requests, suggesting it's either websockets or something else.
Rails 5 seems to support websockets ^1 via ActionCable. Considering Rails has a history for implementing things in a way that doesn't really scale I'm not sure if we want to use this, or something more low level that we have control over.
@yorickpeterse GitHub is using Websocket. When u open issue page with dev tools, in network history tab u can see initial GET HTTP requst to live.github.com (in one column is written Websocket), it contains header Connection: keep-alive, Upgrade. This switch communication to another protocol, which web browser (in my case Firefox) dont show. It also contains headers Set-Websocket-Key and Set-WebSocket-Extension. It seems that GitHub use Websocket without its "protocols".
They open Websocket even on closed issue
Maybe it is out of discussion and I mention about it on another issue, but i use for distribute "realtime" message websocket, but with Nchan. It is pub/sub server, which can handle all suggested types of serving technologies. Configurable by channel can be defined set of method which can publish and another subset of techs which can subscribe. It is small program developed as nginx module, but can run standalone. cen use redis to sync with other instances to create cluster. Support dynamic channels, scaling, authetication .
Nchan itself dont asnwer evaluation, but enable in one interface use method preferable for each situation. For example, logged users will be served by websocket and others by pulling, users with developer (and higher) rigths can use on pipeline page websocket, others only long pulling, etc...
Changes like this (the ones where we're fairly certain they can have a huge infrastructure impact) should not be merged in/released at all until a proper solution has been found, not even when using feature flags or something similar.
We have to evaluate the solutions in some way to find a proper solution. So what I'm thinking about is:
implement a proof of concept
the feature is turned off by default
do not include it in our release blog post, do not use it for marketing
release it and deploy to GitLab.com
turn it on on GitLab.com
measure performance
turn it off if it causes trouble
Do you have any other way to find a proper solution? Do you think that we can find it without testing it out on GitLab.com?
@victorwu I created this issue to turn our Slack conversation into a meaningful list of possibilities. The plan in the issue description is vague right now and not approved by anybody.
I don't think that we have any specific metrics in mind right now, but we will have to pick them.
Everybody is welcome to contribute both with solutions and with methods how to evaluate them.
We are working on having the form of long-polling for GitLab Runners, that is purely Redis-based to see how well it will behave. So this will give some meaningful performance data, given that Runners are the biggest resource hog now.
The basic plan is to have Pub on Rails side, and Sub on Workhorse side, using Redis as storage. The idea behind that is to use on Redis connection per-workhorse, and register many builds requests to look on this connection.
The long-polling is our first choice, as it doesn't require any changes on Runner side, ideally the WebSocket connection would be next iteration as it would reduce the processing requirements even further. The long-polling connection is expected to be hung for ~50s now.
Long-polling will allow us to reduce the number of requests by 16x, and make all the requests when there are no builds to watch only on Redis key change. The builds register would not touch PostgreSQL.
@mardukecz Thanks for the info! Nchan looks quite interesting.
@adamniedzielski Ah, I overlooked that. Regarding testing, we should first test changes on dev/staging in some shape or form. Once we're more confident with that we could deploy it in an optional form to GitLab.com, but I'd prefer to only do so if we know it's not going to cause massive problems. It would be less problematic if we could enable such a feature only for a specific host, instead of all; reducing the potential impact it could have.
We could use Pusher! OK, that's not a real suggestion, but something open source with a similar architecture, like socket.io. This is probably effectively the same as @ayufan's suggestion where we have a separate process handle the realtime websockets connection, and the main Rails app has a regular HTTP non-persistent connection to it.
One thing I would like to understand before we build a solution to fix all the things ™ is where are we using polling besides the runners, which endpoints are we hitting and how are we monitoring these endpoints. Do we have any data at all?
@pcarranza Besides runners we poll already on issue pages to load new comments as they are posted. Outside of that I don't think we have any other polling going on, but perhaps the frontend folks know more.
@markpundsack The advantage of @ayufan's suggestion is that it uses Workhorse which already knows how to connect to the main Rails app. I'm not sure about introducing a new component to handle the websockets or a new technology (NodeJS for socket.io) to our stack. I think that we standardised on Go for stuff that requires better performance / concurrency support, so making it in Go makes sense and we already have Workhorse written in Go, sitting in front of the Rails app.
@yorickpeterse@pcarranza We currently have polling on for the notification that says the build is running or passed or failed. That little popup that comes when you accept notifications.
@pcarranza we're expecting to add much more realtime features (both to existing endpoints and new) to GitLab over this year. A solid solution that would allow us to bring that to more and more of GitLab's features is now a blocker for a number of features.
Job et all. Again. I'm not against the feature or the tech. I'm ok with
whatever we decide (long poll or websocket) as long as we architect to make
it scale and we get all the pollings we currently have and make them go
through the same pipe. That will allow us to set gitlab production sanely
because we will only need to focus on one access pattern and not many.
That said, what i briefly read from kamil makes sense, and I think we
already discussed something like this some time ago. So that could be a
path forward.
I think if @ayufan's solution is in-progress, and no one has any major objections to that, then we should go with that for now.
That does mean that we will need to block the issue titles stuff (and anything beyond that) from shipping until the pipelines element has shipped and been tested in production. IMO that's a reasonable trade-off, but if we have disagreement then we need to talk about it more. @adamniedzielski wdyt?
@smcgivern Yes, the solution that @ayufan proposed looks really promising.
The only thing that I would like to avoid is that you have to modify Workhorse each time you want to add a new realtime feature. I'm not sure if that's possible or not with the proposed solution.
I was also thinking about how we want it to work a little bit more. Let's take the issue title as an example. The issue title is a Markdown field that may contain references to other issues, potentially confidential issues. Because of that issue title HTML is a different value for a guest user and a different value for a team member. And we will have more cases like that - content that is dependent on your access level.
If the issue title changes and we want to deliver the new content to all subscribers we have to generate it for every subscriber, because it's potentially different for every subscriber. I don't think that it's a good idea.
Again, I don't know what the established pattern here is. What I was thinking about is notifying that a change happened, without sending the changed value itself. When a subscriber receives the notification about the change, it is responsible for fetching the value.
What I was thinking about is notifying that a change happened, without sending the changed value itself. When a subscriber receives the notification about the change, it is responsible for fetching the value.
I'd be fine with that for things like titles, descriptions, and comments where:
The changes themselves are rare.
We need to show different things to different users.
We still have the problem of a bunch of clients potentially requesting a new rendered title at once, though that shouldn't be too expensive if it's a one-off.
Per discussion with @stanhu, we should work on implementing #25051 (closed) as the first feature to leverage the new non-polling solution, once it has been figured out. Currently we are aiming to ship that in 8.16.
@selfup already worked on #25051 (closed) (!8357 (closed)), so might be able to provide some background. But engineering can take over and get whoever else are the right resources to work on this.
Lots of good discussion here. Let me summarize our requirements here. We want a solution that has the characteristics:
Performant: We need to be able to handle real-time updates for thousands of connections
Extensible: Support for new real-time updates shouldn't require additional changes in Workhorse etc.
Secure: Users not only have to have permission to see the changes, but the data (e.g. Markdown) that is returned to them may be specific to them. For example, if we have 1000 users subscribed to an issue, we can't broadcast the update as-is to all 1000 users.
The idea here is that Workhorse handles the long polling and uses Redis publish/subscribe channels to notify it when it should retrieve the updated data. Also note that we don't put any data in Redis to avoid leaking information through Workhorse.
For me, the focus needs to be on how we generate the change notifications in the first place, rather than how we ship them from server to browser. There are lots of good options there, but noticing when our data changes, generating a diff and sending that to all appropriate clients is actually really hard.
For issue titles specifically, we can put something together that works, but it needs to take account of changes that invalidate a subscriber, as well as transitive changes. Without careful thought, we're going to end up with something that only barely works, with piles of special-casing code to account for edge cases.
Specific examples for issue titles:
Issue is marked confidential before having its title changed
issues.title field doesn't change, happens to reference a user / issue / MR / project / other gitlab markdown referrable which is removed
We don't actually deal with that second case at all right now, and we could ignore it for live issue titles, but transitive changes are going to hurt us at some point if we're not careful.
Can we be lazy and just notify clients that the page changed, leaving it to them to reload? Rather than pre-calculating some minimal set of data to send to each client, just be wasteful and trigger 1000 refreshes when there are 1000 clients?
Being able to send minimal web page diffs to clients that refresh sounds like a separate problem to solve to me.
Edit: Or to put it differently, if we have an issue that is open in 1000 browser tabs, the system needs to be able to handle that either way. Why create an alternative 'page building system' to our Rails controllers + cache.
Regarding the delivery mechanism, MessageBus looks like a nice approach. If it works as advertised. :) It's impressive they built this in a way that it even works with Unicorn.
It looks like they hide a Thin server in each Unicorn worker that holds the long polling requests open. One thing we should ask ourselves is how this behaves when Unicorn workers are short-lived.
@nick.thomas I was also thinking about possible transitive changes that may affect issue title (issue title is just an example here) and I agree that it's a really hard problem. I am not aware of any generic solution / pattern that would allow us to express these dependencies in an organised way. However, I consider real time updates to be a progressive enhancement. If we fail to notify subscribers that the issue title changed in some specific edge cases then it's not a big deal. There is a way to get the correct data (refresh the page) and we're just showing the stale data, but it doesn't break anything. There is just no enhancement in this specific edge case.
@victorwu is it also the way how the product team thinks about real time updates - progressive enhancement?
Can we be lazy and just notify clients that the page changed, leaving it to them to reload? Rather than pre-calculating some minimal set of data to send to each client, just be wasteful and trigger 1000 refreshes when there are 1000 clients?
I agree that we should use the existing endpoints in the Rails app and trigger 1000 requests to the Rails app when there are 1000 clients. The returned data is affected by your permission level (think: issue title with a reference to a confidential issue), so I think that we cannot avoid doing this one request per client.
There is one possible improvement if we decide to use the proposal involving Workhorse. Instead of notifying frontend that a change happened and relying on frontend to fetch JSON, we can do this request in Workhorse and say to frontend: a change happened and here is JSON for you. This is already included in @stanhu diagram.
There is a possible alternative, to use ActionCable for that. Biggest issue is that it requires Rails 5 (I think @connorshea was working on that, but I don't know the status). ActionCable is Rails, so it's the no-brainer solution. To make ActionCable scalable, here is a solution/hack described in this blogpost: https://evilmartians.com/chronicles/anycable-actioncable-on-steroids.
AnyCable uses either a Go based daemon or an Erlang one, to bridge the gap and solve the multiple connections problem that ruby can't handle well.
If we decide upon AnyCable and their Go solution, it could potentially be made part of workhorse to avoid having an additional daemon.
One of the ways to solve that problem is similar to how we plan to do it in GitLab Runner:
99.9% of time there's no change,
We don't track connected clients, as we always update a value in Redis when a thing will change,
The client will receive a notification when a change gets detected,
We will execute original requests as many times as clients watching on data,
In most cases, it is one client watching the change,
Every client can receive different payload, based on permissions
It is simple to use, you use the same request only adding two extra headers or parameters
The notification keys are short living to not overload Redis, assuming that client that will watch on resource will create a new key if it no longer exists,
Notification handler is implemented in Workhorse, Workhorse watches on single Redis connection for keyspace changes and finishes requests that are interested in these changes,
Rails do very cheap operation, bump the value of the Redis key.
Thoughts:
Most of the current solutions are optimized for being the client-aware (need to track a list of listeners). This solution is resource-based, as we do not care about a number of connected clients, we do care about storing a resource change, basically reacting on slope change. Change from one value to another, we do not care about a particular value that is stored as long as it is random.
Workhorse does track all notification watchers and when a Redis key is updated, Workhorse receives and delivers an update to any connected client.
There's minor data leak: by knowing the notification key you could receive a notification that anything did change. You would not receive the actual payload. The additional request would be executed that would pass all normal security checks.
We use existing JSON endpoints that return serialized model data. The only addition is to add a generic mechanism where we respond with notification key and notification value if a GET endpoint is requested.
We need to add after_save callback to bump the notification key when a resource is changed.
The notification key is a hashed identifier with a salt: ex. sha256(type || id || salt). Given that only client that are authorized would be able to use a notification watching.
We use only one request, so we don't need to execute additional requests to subscribe and unsubscribe as all of that is handled by Workhorse internally, as we do not care about the number of connected clients.
It is security safe.
We can run into a lot of Redis keys stored for notifications that are not being watched, but since they are short lived it is very unlikely that this would make a big performance difference.
The potential problem is that if we would have to watch the full keyspace for notifications it would increase the load for processing key changes in Workhorse. If we can watch on a subset of keyspace then, this is not a problem.
The proposal uses the long-polling and it can easily handle connection interruptions and reconnect if needed. We just pass the last received headers to resume from the last point.
We do not implement any incremental or differentials. We always return all data. It can be a little hefty for a "lists", but this can be optimized with more clever returning of data.
Unfortunately, this is custom made solution, not the boring one.
It is not ActiveCable, which can be tempting to be used in the future.
It works well with our current architecture: Unicorn and Workhorse.
Final word
This is not a queuing / message bus, it's more like a notification system that is optimized to deliver an information about the event happening without going into details what kind of event is that.
It may seem like a crazy idea, but I'm happy to discuss further :)
@ayufan This sounds like the best thing ever. Especially the 12th point in the Thoughts section
This will help a lot for both backend and ~Frontend for realtime pipelines with pagination. Unless we have to diff because so many things change (stages for example as well as pipeline statuses). But I don't think people stay on this page that long.
@ayufan I like what I see. Let me see if I fully understand. How is "watch on notification value" going to work with Redis? Are you using Redis PUBLISH/SUBSCRIBE channels, or just polling? If the former, I think there are a number of elements that we need to support this:
A key (notifications:issue:10) that associates an issue to the last state of the issue (random-value)
A Redis PUBLISH/SUBSCRIBE channel name (e.g. change:issue:10) on which the client listens for notifications of changes
Couple of questions:
Why do we need the PUT request for the change notification step? If you are concerned about expiring keys, could we just use the same notification key throughout and refresh the TTL each time a GET request arrives with indication for updates?
Can we make the second GET request simpler and avoid the race condition at the same time? For example:
a. First, send a Redis SUBSCRIBE change:issue:10 for updates.
b. Next, verify the value of notifications:issue:10 matches the expected random-value; if they differ, proxy the request immediately.
Ah, I just read https://redis.io/topics/notifications in more detail. It looks like Redis allows you to watch for values using special PUBLISH/SUBSCRIBE channel names.
Let me see if I fully understand. How is "watch on notification value" going to work with Redis? Are you using Redis PUBLISH/SUBSCRIBE channels, or just polling
I think that we have two options:
Use https://redis.io/topics/pubsub and have a separate queue where we would push all notification about updates which would be read by all Workhorse watchers,
Why do we need the PUT request for the change notification step? If you are concerned about expiring keys, could we just use the same notification key throughout and refresh the TTL each time a GET request arrives with an indication for updates?
PUT is a just indication of someone updating that. Since this is resource-based, not client-based the notification is generated for resource change. If we start to update that by GET all other clients that are watching for resource change would receive the notification.
Detailed flow should probably look like this:
A GET happens,
Workhorse checks the notification value,
If it is not the same (it will happen also for expired keys) we proxy the request to Rails:
Rails get the existing one, if it does not exist, it does create a notification key with TTL,
Rails fetch the resource,
Rails return serialized resource with the key and the value.
If it is the same, we long wait for a notification to be fired:
If it timeouts we return NO DATA,
If it fires we proxy the request to Rails as described in 3.
Can we make the second GET request simpler and avoid the race condition at the same time? For example
You have the same idea as me :) Subscribe first and then check the value. I just didn't want to complicate the diagram, only indicate that there's a race condition here, so it doesn't get unnoticed.
Clients and server implementations in many languages, including Javascript, Go, Ruby, Swift, etc
Works in a clustered environment (without need for sticky sessions etc)
Connection Orientated: expensive authentication/authorisation operations only occur once for duration of the connection.
Supports websockets and fallback to XHR long polling
Supports browser-server and server-server
Redis is the only dependency
Gitter runs avg ~20k-30k concurrent realtime connections through Faye
Gitter's current implementation still has a lot of capacity to scale vertically, but horizontal scaling is possible (with some work) - possibly when we get to ~100k concurrent connections...
Used by Gitter, Shopify, myspace, others
Easy to extend (plugins) on the client and the server
Graceful recovery after disconnection (eg, travelling through a tunnel on a train....)
FYI, I don't believe we'd be able to run Faye on our existing Rails app with Unicorn, which isn't designed to have lots of open requests. The author suggests running a separate server process for Faye under Rainbows: https://groups.google.com/d/msg/faye-users/QhyDk1Z1jV0/BGghzX9uLn4J That by itself gives me pause.
Longer term: implement an event sourcing for publishing/subscribing on events: gitlab-org/gitlab-ce#26894
The smarter polling system seems like something we can do right away with not too much effort. I think we can change our current notes polling to use this, which would relieve a significant amount of DB load. We can also consider using this to support the title/description updates.
I'm going to close this issue down because I think we've concluded that the long polling approach with ETags is the right next step: https://gitlab.com/gitlab-org/gitlab-ce/issues/26926 Other proposals can be considered later.