Auto log automatically gathers and searches the logs of applications that are deployed with GitLab.
Just like our Prometheus monitoring, logging can be deployed per application, this way we can easily scale it to millions of applications.
Considerations:
Use the Elastic Search cluster already needed for GitLab Search? Does logging need different tuning than search? Is it a problem that CE doesn't do search via Elastic Search?
Should we use FluentD since it is a cloud native standard?
How do we allow people to search the logs from GitLab?
How much should we help with tuning and maintaining the Elastic Search installation?
Should we only gather application logs or many more?
Current WIP proposal:
Before introducing search, we need to allow for the installation of a system that aggregates the logs. That will be the focus of the work in this issue for the short-term. Search will be addressed as part of a follow-up issue: #31621 (closed)
UI Updates required as part of this issue:
Introduce an "all pods" option on the pod selection dropdown:
If a user selects the "all pods" view, they will see the time stamp, the pod name and the log entry. If possible, these pieces of information will be color-coded to allow for quicker parsing of the aggregated logs. Since syntax highlighting is a user preference, ideally we would utilize the user's selected syntax highlighting theme and apply it to the logs, as well. For reference, the yellow shown below is ffd866:
@tauriedavis The mockup looks pretty good, and will likely be good enough given the time constraints, but if you've got time, it would be to get different example logs in there. Right now you're showing runner logs, but this is supposed to represent an app running production, which would have very different looking logs.
See https://papertrailapp.com/tour/viewer/context for an examples. A bunch of lines starting with a timestamp, a service name, and then structured or unstructured text. Those happen to be Heroku logs, so not a good example, but they'd look similar for other services.
Thanks @markpundsack - I've updated it. Im positive the log makes no sense but maybe it looks more-correct haha Let me know if you have specific copy I should pull in.
@mnohr - I'd like to get ahead of this one. Do we need to do some Engineering Discovery around options in advance of %12.4 to ensure this can be delivered in that time-frame?
Thanks for the ping, @adriel. I haven't worked on anything log related yet, so I'm getting up-to-speed. Please correct me if I'm misunderstanding anything.
This is what I'm seeing on the current pod logs screen (which users can navigate to by clicking on environments > and by clicking on the data viz displayed within a particular environment):
Assuming I'm looking at the correct log - in referencing the mock-up, I'm seeing a few things added to the current pod log screen:
Search
Additional icon actions (download log, I presume and...not sure what the doc icon represents)?
I'm assuming this issue is focused on adding the search piece. Right now, as far as I understand, the pod log screen shows the log for a single pod (ability to show logs for multiple pods will be added as part of https://gitlab.com/gitlab-org/gitlab-ee/issues/6502). We could add search in on this screen, then, but I presume the user would then only be searching the single displayed log, rather than searching across all the available logs. Is this ultimately what we want? It seems like the issue description is hinting at a larger search functionality here?
I wanted to start by asking that question, so I can at least get a better understanding of what the objective is here. Also, I wanted to flag that we are currently conducting research on user needs with regards to logging and tracing. I'm wondering if it makes sense to get the results of this research and to map out what our ideal user interaction here is with regards to logging, and then figure out where search best fits into it?
Thanks @ameliabauerly - I think of this issue as backend heavy to allow for the installation of a system that aggregates the logs. From a UX perspective, could we just extend the selection between pods (I think that exists and is functional today) and include an "All Pods" options to view the aggregated logs?
Could we just extend the selection between pods (I think that exists and is functional today) and include an "All Pods" options to view the aggregated logs?
We could. I guess my question with that direction would be: would users ever want to view the aggregated logs, or just search through them? Further, if we did have the option to view aggregated logs, what would that look like? I'm guessing an aggregated log could be millions of lines long, and I'm not sure how that would be organized. Going back to my first question, I'm also wondering if anyone would realistically want to see that?
I believe what you are proposing is that there is a first dropdown where users can select which pod to view. We would be adding a second field where users could search within their selection in the first dropdown. Utilizing these two dropdowns, users could both select which pod they want to look at (or, potentially select all pods), and then search within whatever selection they've made. This would allow users to view specific pods and search within them. It would also allow users to search within "all pods."
The question for me, though, is what do we display in the viewing pane if a user selects "all pods"? Do you have a sense of how that might work, @kencjohnston?
would users ever want to view the aggregated logs, or just search through them?
I've worked in environments where the tailing of centralized logs was useful for triage although in a limited fashion. I think an MVC of having the logs aggregated somewhere and presenting them in the same way we do today is adding value, although not completely.
I believe what you are proposing is that there is a first dropdown where users can select which pod to view.
Yes, as the scope for this issue.
We would be adding a second field where users could search within their selection in the first dropdown.
I've worked in environments where the tailing of centralized logs was useful for triage although in a limited fashion. I think an MVC of having the logs aggregated somewhere and presenting them in the same way we do today is adding value, although not completely.
@kencjohnston, @dhershkovitch - I'm trying to imagine what we would show when the user selects "all pods." Right now, we're showing the events and the duration from a single pod. From a technical POV, can we utilize the same framework we're using currently and just have the pod name followed by all the events, one-pod-after-another? That's how I'm assuming it would work for this first iteration anyway but, do we need to do some technical investigation into this to double check that this is possible?
@dhershkovitch - I only have access to limited logs. I also haven't worked on any logging features yet, so I'm thinking we'll need additional input to get accurate answers here, in particular to questions 1 and 3. My guesses are you can scroll back in time infinitely and, yes, if a pod goes away, we'd lose it's logs. Though @adriel or @mnohr, could you confirm?
With point 2 - I definitely think this would be a useful feature.
From what I'm understanding, currently, our "pod logs" feature displays single logs. It allows limited ability to switch between logs, and to scroll forwards and backwards. But it really has no analysis/analytics layer built into it at all yet. My assumption is that this issue, aiming to get all the logs into a single space, will allow us to build that analysis layer on top of it. But I don't think we've properly thought through how that would work/look yet.
@joshlambert - it strikes me you might have the best knowledge of this feature since you were PM back when pod logs were last worked on. Is my summary a correct understanding of where we're at/where we need to go? If not, can you help set us on the right path?
Sure, answering these to the best of my knowledge. Also note this is regarding the current implementation, which is per-container logs rather than any kind of aggregated solution.
How far can we scroll back in time? (Ideally to be an infinite scroll, however not sure what are the constraints)
K8s pulls from the docker logs, which are subject to rotation/cleanup. In my testing, it goes back quite far though, at least on GKE.
That said, we limit the returned logs (to 200kb?) because we don't currently support pagination and it could grow to be a large payload which slowed down the page and was too big for browser to easily parse. (multiple mb of text)
What happens when a pod goes away? will we lose all of its logs?
I think they are still accessible from docker logs, but not sure. Right now we would provide a way to try to access them once it is no longer running.
@dhershkovitch - given what Josh is saying about the current limits to returned logs and the lack of pagination, I'd agree that the time picker may need to be introduced a little further down the line. Maybe introducing pagination would be a good next step, especially as we're proposing consolidating logs into an aggregated view?
Speaking of which, looping back to Kenny's proposal here:
I think of this issue as backend heavy to allow for the installation of a system that aggregates the logs. From a UX perspective, could we just extend the selection between pods (I think that exists and is functional today) and include an "All Pods" options to view the aggregated logs
WDYT about this as a way forward? It sounds like having an aggregated log is a necessary next step towards getting logs to a place where we can search and run queries on them. But is that a correct assumption?
My big question with this route, as mentioned above, is what the UI would display if we had an "all pods" view. Could we just have, for example Pod A log followed by Pod B log, etc? If we don't have pagination though, and we have a 200kb limit to the logs, will an All Pods view actually show much of anything?
It sounds like having an aggregated log is a necessary next step towards getting logs to a place where we can search and run queries on them. But is that a correct assumption?
I would wait with introducing pagination for now, as it might be available for "free" in a centralized logging system.
what the UI would display if we had an "all pods" view.
I agree it would probably wont give much value with 200kb limitation, but if someone chooses to view it, we would display all logs order by timestamp with
Adding designs to summarize the discussion so far. We're proposing adding an "all pods" option into the pod selection dropdown:
When a user selects "all pods," we'll display all logs ordered by timestamp with their pod name.
It seems like that might be a ton of information, and that it might be good to break that information up so that it's more easy to parse. I don't know what sorts of formatting options we might have available to us here. It might also be that users won't really ever look at this view so the density of information is ultimately more useful than its "readability." Here are some options, depending on the level of formatting we want/have available to us:
No formatting, all info in one line
Minimal formatting, bold timestamp
Grouped by timestamp, dense option
Grouped by timestamp, less dense
Utilizing indents
I'm guessing the simple "no formatting, all info in one line" option will be the preferred way forward here but, wanted to share these alternate formatting options in case breaking up the log information would help users better digest this view.
@akohlbecker - Thank you for the great suggestion!
I took a first pass at what this could look like:
I'm going to post this in our UX slack channels for additional feedback on color choices. But, I think this is a good way of helping parse the information quickly without losing information density.
I've also realized that the issue description hadn't been updated based on these discussions. I took a pass at summarizing the discussions so far so that the issue description is up-to-date. @dhershkovitch, if I've left anything out or captured anything incorrectly, please feel free to update
Quick update here: the UX team also suggested using a human readable logging format. This would involve "separating out the columns with a distinct character so it looks like a table." Here's an example of how that could look:
Would something like this be possible, do you think, @akohlbecker? It might be slightly better from a usability POV, just because finding colors that are properly accessible may be challenging.
Although I'm a bit worried about long pod names shifting the whole column of messages to the right, such as in your mockup above. Do we want to truncate long names?
@akohlbecker, I think you're right, we might solve one problem and unintentionally create another. Perhaps we can move towards a human readable log format in the future but, we probably need to explore something like that further before implementing it.
In the short term, if we can introduce colors to break up the entries a bit, that seems like a good step forward. It also aligns with Taurie's original recommendation for this issue.
The UX team recommended reusing some of the colors used in the Monokai syntax highlighting theme. I couldn't find a hex value in our docs but I did grab them from the Monokai website. If we use the yellow ffd866 to highlight the pod names, it would look like this:
I'll add this image to the issue description. I think these are all the decisions you need from me in the short term so I will mark this UX Ready, for now. But, if you or @mrincon need anything else, definitely let me know!
A quick update here: Pedro pointed out that, since the syntax highlighting theme is a user preference, it would be great if we could apply these preferences to the logs, as well. I don't know if that happens currently or not (I'm guessing not, as I believe my "theme" is set to white but my logs show up in black). But, I agree, it probably would be good to have the users' preferences reflected in their logs. This could be a follow-up issue but, wanted to at least capture this information here. Perhaps it's easier to just apply the user's selected syntax highlighting theme here rather than hard coding specific color values into the display? I will leave that to you to decide, @mrincon!
Caught up with this issue. I'll check how other parts of the application are working to display logs and find where the theming comes from, it makes sense we use the same pattern.
As a side note, logs are rendered using haml and not vue.
Another suggestion since we must limit log sizes to avoid performance issues.
We can display the number of lines prominently. Users frequently use this kind of limitation when reading logs in their terminals, showing 100, 1000, 10000... lines of the latest logs, as you have showing with the date.
@mrincon - do you mean showing the number of lines at the bottom of the window, with the date summary information?
I don't think that summary information has been implemented yet, so I'm not sure if that work would necessarily be part of this issue. But, if we can add it and/or the summary of the number of lines in that space, I agree, that would be a nice add!
Of course, if I've misunderstood your suggestion and you were thinking of something else, def let me know :)
We should also start looking at fluentbit which is a lightweight version of FluentD, for a centralized logging system if we would settle on Elastic we should also consider Filebeat as it provides an easy way to ingest structured logs into Elastic
@akohlbecker - One other thing I wanted to flag in terms of technical considerations as we move forward - according to #6502 (closed):
Currently, we only expose the first container's logs
...in the pod log view. Since we don't yet support switching between multiple containers in a pod, we just show the first container's log only.
Which for me begs the question: should the "All Pods" view also only show only the first container's logs for all the pods otherwise listed in the dropdown? Or, should it contain all the logs for all of the containers, regardless of whether they are currently visible to users? The latter obviously seems like it would be preferable in an "All Pods" view but I don't know if that's even possible yet. At any rate, the current lack of visibility into pods with multiple containers seemed like something we should be mindful of as we move forward. Maybe it's not a big deal but, wanted to flag in case it is!
Thanks for flagging this! I think the default behaviour of our log aggregator will be to collect logs for all containers. I don't know at this time if we can filter it to display only the first container.
My personal opinion would be to keep the all containers display, rather than carry that pre-existing limitation, we could then think about how to display all the containers in other views in a further project
I opened #31105 (closed) recently as well for exposing gitlab-managed-apps within pod logs, would that be covered by the work here or does this issue only address user application pods?
To my knowledge this would apply to application pods only, but I can definitely see a view into the gitlab managed containers useful! Once the application logs are collected, adding a new namespace will be easy, at least on the log aggregation side, not sure how we would display them.
So I've done some early exploration on this issue. A number of things are needed:
Install elastic search as a gitlab managed application (I believe @dwilkins is working on this in #30729 (closed))
Install fluentbit as a gitlab managed application
Setup an elasticsearch client to be able to query indexes from rails
Augment logs endpoint to support getting all logs, update response format
Make use of updated endpoint in the frontend
@dhershkovitch Do we want to open separate issues? This one feels a bit large for everything that has to happen.
I'm currently working on the fluentbit chart, but since everything in this list is interdependent, not sure that Application logging as a whole can be done in two weeks. Thoughts?
I'll also need an assist from the frontend team to add the tile in the application list.
I would like to comment on the logs response format. Let's look at the example of color in the logs, I imagine the final result will be something like this, with different color identifying different pods or containers:
Now, our GitLab CI pipelines use Ansi2Html to pre-render the output of logs as HTML on the BE. I make the case here that we should do the same in our logs.
When our backend does pagination/search, it can do the "coloring" of pods / containers according to the situation required. Highlighting search terms, accounting for ansi color escape sequences, or interpreting the sequences in the logs themselves (when we get the "red" code, we can display it).
Pros of using Ansi style tracing
We reuse very well tested trace code from Gitlab CI
We can honor special codes our containers send into our UI
Move cursors e.g. yarn install loading effect
Data transfer overhead is lower than JSON
If sent via JSON, each line of log could have a lot of overhead in metadata (like pod name, container name, timestamp, etc...)
Cons
We cannot use syntax highlighting customization as easily.
Performance impact of BE-side rendering
We must account for security. e.g. a container log may sends some XSS content our way
The code is trace might be specific to Gitlab CI use cases (and there might be some legacy?).
The Verify Stage is refactoring the format to JSON instead of HTML The related issue is #31162 (closed) and is scheduled for %12.4.
My point for reusing their format still stands! They have figured out most of the response format and performance quirks from what I could see. So we will have a good starting point.
@filipa and @fabiopitino Thanks for answering my questions, if you have any comments on reusability / timeline would be helpful.
If you want to overlap logs from different Pods/endpoints you would do it upfront in the backend and once you have a merged stream of logs you can use the Ansi2json. This will do a shallow parsing of the stream to a more semantic log where ANSI escape codes are translated to strings (e.g. term-fg-green, term-bold, ...). On the frontend you will have a VueJS component that @filipa is building, that will handle the JSON from the backend and generate the UI for it. All this should be reusable out of the box as long as you point it to a different API/controller endpoint.
Regarding the timeline we are aiming for %12.4 and we are now at the final testing phase.
@fabiopitino Has been very helpful in explaining JSON format concept. One of my takeaways is that the "tagging" should be done on the backend, in other words the backend shouldn't respond with the metadata of which log belongs to which container, but simply an additional color class, for example.
Frontend will simply display the JSON based on the schema: #31162 (closed) so as to keep the integration effort low.
Given that most of the BE-FE integration will be done by the Verify team with the corresponding Ansi2json and vue component, I estimate FE effort on this issue I will be much smaller that backend's, and we will probably not have to discuss the format too much, as they have done most of the heavy lifting in this topic.
@mrincon - not sure if you needed a reply from me here but, just wanted to add that, for me, the important thing was to break up the wall of text a bit. If it makes sense to use the Ansi style tracing rather than the syntax highlighting customization, that's okay with me. That was more of a suggestion than a requirement
@ameliabauerly My suggestion has some implications for UX:
The look and feel will be mostly similar to that of the pipelines logs in Gitlab CI, which is not the same as in code highlighting. Unless we want to do some extra customization, which we probably can
Color highlighting comes from the pods logs, which will depend on what's installed there (applications logs can send colors our way). And after that, they can be enhanced on BE side, by adding the container names with a given color highlight.
Okay, thanks for letting me know! Given the different dependencies here, I'm not completely sure how this will end up looking. I guess we can continue to refine as needed depending on how these dependencies play out in practice. Keep me posted and we can continue to figure out a plan as needed!
@dhershkovitch Do we want to open separate issues? This one feels a bit large for everything that has to happen. I'm currently working on the fluentbit chart, but since everything in this list is interdependent, not sure that Application logging as a whole can be done in two weeks. Thoughts?