Skip to content

Rack middleware for path traversal checks

🛡 What if?

What if we check all requests received by the rails backend for path traversals?

The idea is to have something along the components that handle web requests, if possible as early as possible. That component would simply check the path that is accessed and run this function that will check for path traversals.

This would add another protection layer against path traversals located on accessed urls (like gitlab.com/api/v4/foo%2F..%2Fbar).

One good candidate for such component is a rails middleware. This MR explores that option.

Related to: #413766+

🚀 Deployment plan

We can't possibly deploy a change that impacts all web requests without some safety net. In addition, it would be good to measure in advance that impact. We're adding another handling in the web request. That handling is not free and will cost cpu time. That time will be added to the request latency time.

Before accepting the tradeoff, it would be nice to have a glimpse of what are exactly the terms of the tradeoff.

As such, we suggest deploying this change behind a feature flag and the following plan:

  1. Simulation phase (👈 this MR)
    • feature flag enabled for a low-med percentage of requests, e.g. Feature.enable_percentage_of_time(:my_feature_3, 5)
    • All requests are allowed.
    • Depending on the feature flag, the path traversals checks are executed. Any failures in those checks are logged.
    • While we're here, log the amount of time spent to execute the checks.
  2. Analysis phase
    • Gather the logs from the previous phase. Analyze if the request is a genuine one (one that we expect) or not.
    • If it's a valid request, investigate how to ignore it or make it skip the checks.
    • Iterate until no further logs are collected for valid requests.
    • Analyze the results of the amount of time spent on checks.
  3. Enforce phase
    • Start rejecting requests that don't pass the checks. Deploy this with a feature flag.
    • Let it run for a full milestone on .com.
    • Let it run for a full milestone on self-managed.
  4. Cleanup phase
    • If no feedback or issues are reported from the previous phase, remove the feature flag and logging.

This MR represents phase (1.): simulation. In other words, no request is declined with this change. We're merely logging the activity of the middleware.

🤔 What does this MR do and why?

  • Add a middleware to check path traversals on requests.
  • Add the related specs.

📺 Screenshots or screen recordings

You know... this a rails middleware. So, no UI changes 😸

So below to see the middleware in action

How to set up and validate locally

  1. Enable the feature flag:
    Feature.enable(:middleware_path_traversal)

Try different urls and check the application_json.log

url application_json.log
curl http://gdk.test:8000/foo/bar {"severity":"WARN","time":"2023-06-14T10:00:37.397Z","correlation_id":"01H2WNK6NG48YEMFVY9Z7TEMG6","duration_s":0.00089,"class_name":"Gitlab::Middleware::PathTraversalCheck"}
curl http://gdk.test:8000/foo/../bar {"severity":"WARN","time":"2023-06-14T11:56:52.031Z","correlation_id":"01H2WW81TXMJQHVFHTN7V144NJ","duration_s":0.00131,"class_name":"Gitlab::Middleware::PathTraversalCheck"}
curl http://gdk.test:8000/foo%2F../bar {"severity":"WARN","time":"2023-06-14T11:57:50.778Z","correlation_id":"01H2WW9V9GEJ2J3MN28VJNA9MW","fullpath":"/foo/%2F../bar","message":"Potential path traversal attempt detected","duration_s":0.00203,"class_name":"Gitlab::Middleware::PathTraversalCheck"}
curl "http://gdk.test:8000/foo/bar?q=%2F%2E%2E%2Ftest" {"severity":"WARN","time":"2023-06-14T11:58:35.362Z","correlation_id":"01H2WWB6TRM27HJX5BRTTVKQXA","fullpath":"/foo/bar?q=%2F%2E%2E%2Ftest","message":"Potential path traversal attempt detected","duration_s":0.00226,"class_name":"Gitlab::Middleware::PathTraversalCheck"}
curl "http://gdk.test:8000/foo%2F%2E%2E%2Fbar" {"severity":"WARN","time":"2023-06-14T11:59:43.408Z","correlation_id":"01H2WWD9938KKBR52MJBEXNSSE","fullpath":"/foo%2F%2E%2E%2Fbar","message":"Potential path traversal attempt detected","duration_s":0.00222,"class_name":"Gitlab::Middleware::PathTraversalCheck"}

† for this case, it seems that Rack itself will refuse to parse the fullpath as foo/../bar. Instead, the path used will be /bar which, from the middleware point of view, is a valid url (no path traversals).

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

🔮 Other considerations

🐎 Why not workhorse?

Excellent question 👍

Workhorse would be the ideal location to reject such invalid requests. This way, the rails backend doesn't even receive it at all (= rails resources saved).

However, given the reach of this change (all requests), we absolutely need some safety nets. feature flag s are those safety nets. In addition, we already have a function that check for path traversals in rails and that is known to be working as expected. In workhorse, we would need to find or create such function. We could be introducing risky changes (like rejecting a request that doesn't have any path traversal).

All things considered, I think it's better to start with a rails middleware. If that approach turns out to be a good one but it's consuming too many resources, we can look to port the solution to workhorse.

📊 Get me some numbers

In order to know how much time we're going to spend in checking for path traversals, I thought that running some benchmarks would be useful.

Here is the file I used. Pardon the code quality, I created this within 5 minutes and I just wanted the results. While at it, I explored some other implementation options for the check path traversal function to check if we could improve the execution time.

For the scenarios, I used 3 different urls: a short one, a medium one (~1000 chars) and a long one (~20 000 chars). The medium and long one have a path traversal attempt.

`script/benchmarks/path_traversal.rb`
# frozen_string_literal: true

require 'benchmark/ips'
require_relative "../../config/environment"

SHORT_URL = "foo%2Fbar"
LONG_URL = "#{'a' * 10_000}%2F%2E%2E%2F#{'b' * 10_000}"
MEDIUM_URL = "#{'a' * 1000}%2F%2E%2E"

RE2_REGEX = RE2::Regexp.new('(^(\.{1,2})$|^\.\.[/\\\\]|[/\\\\]\.\.$|[/\\\\]\.\.[/\\\\])').freeze
SIMPLIFIED_REGEX = %r{(/{0,1}\.\./{0,1})}.freeze
TWO_DOTS = '..'
EXISTING_PATH_TRAVERSAL_REGEX = %r{(\A(\.{1,2})\z|\A\.\.[/\\]|[/\\]\.\.\z|[/\\]\.\.[/\\]|\n)}.freeze

def check_path_traversal_with_constant!(path)
  return unless path

  path = path.to_s if path.is_a?(Gitlab::HashedPath)
  raise PathTraversalAttackError, 'Invalid path' unless path.is_a?(String)

  path = ::Gitlab::Utils.decode_path(path)

  if path.match?(EXISTING_PATH_TRAVERSAL_REGEX)
    logger.warn(message: "Potential path traversal attempt detected", path: path.to_s)
    raise PathTraversalAttackError, 'Invalid path'
  end

  path
end


Benchmark.ips do |x|
  x.report("short url existing check path traversal") do
    Gitlab::PathTraversal.check_path_traversal!(SHORT_URL)
  rescue
  end
  x.report("short url check path traversal with constant") do
    check_path_traversal_with_constant!(SHORT_URL)
  rescue
  end
  x.report("short url re2") do
    RE2_REGEX.match(CGI.unescape(SHORT_URL))
  end
  x.report("short url include?") do
    CGI.unescape(SHORT_URL).include?(TWO_DOTS)
  end
  x.report("short url simplified regex") do
    SIMPLIFIED_REGEX.match?(CGI.unescape(SHORT_URL))
  end

  x.compare!
end

Benchmark.ips do |x|
  x.report("medium url existing check path traversal") do
    Gitlab::PathTraversal.check_path_traversal!(MEDIUM_URL)
  rescue
  end
  x.report("medium url check path traversal with constant") do
    check_path_traversal_with_constant!(MEDIUM_URL)
  rescue
  end
  x.report("medium url re2") do
    RE2_REGEX.match(CGI.unescape(MEDIUM_URL))
  end
  x.report("medium url include?") do
    CGI.unescape(MEDIUM_URL).include?(TWO_DOTS)
  end
  x.report("medium url simplified regex") do
    SIMPLIFIED_REGEX.match?(CGI.unescape(MEDIUM_URL))
  end

  x.compare!
end


Benchmark.ips do |x|
  x.report("long url existing check path traversal") do
    Gitlab::PathTraversal.check_path_traversal!(LONG_URL)
  rescue
  end
  x.report("long url check path traversal with constant") do
    check_path_traversal_with_constant!(LONG_URL)
  rescue
  end
  x.report("long url re2") do
    RE2_REGEX.match(CGI.unescape(LONG_URL))
  end
  x.report("long url include?") do
    CGI.unescape(LONG_URL).include?(TWO_DOTS)
  end
  x.report("long url simplified regex") do
    SIMPLIFIED_REGEX.match?(CGI.unescape(LONG_URL))
  end

  x.compare!
end


Benchmark.bmbm do |x|
  x.report("short url existing check path traversal") do
    Gitlab::PathTraversal.check_path_traversal!(SHORT_URL)
  rescue
  end
  x.report("short url check path traversal with constant") do
    check_path_traversal_with_constant!(SHORT_URL)
  rescue
  end
  x.report("short url re2") do
    RE2_REGEX.match(CGI.unescape(SHORT_URL))
  end
  x.report("short url include?") do
    CGI.unescape(SHORT_URL).include?(TWO_DOTS)
  end
  x.report("short url simplified regex") do
    SIMPLIFIED_REGEX.match?(CGI.unescape(SHORT_URL))
  end
end

Benchmark.bmbm do |x|
  x.report("medium url existing check path traversal") do
    Gitlab::PathTraversal.check_path_traversal!(MEDIUM_URL)
  rescue
  end
  x.report("medium url check path traversal with constant") do
    check_path_traversal_with_constant!(MEDIUM_URL)
  rescue
  end
  x.report("medium url re2") do
    RE2_REGEX.match(CGI.unescape(MEDIUM_URL))
  end
  x.report("medium url include?") do
    CGI.unescape(MEDIUM_URL).include?(TWO_DOTS)
  end
  x.report("medium url simplified regex") do
    SIMPLIFIED_REGEX.match?(CGI.unescape(MEDIUM_URL))
  end
end


Benchmark.bmbm do |x|
  x.report("long url existing check path traversal") do
    Gitlab::PathTraversal.check_path_traversal!(LONG_URL)
  rescue
  end
  x.report("long url check path traversal with constant") do
    check_path_traversal_with_constant!(LONG_URL)
  rescue
  end
  x.report("long url re2") do
    RE2_REGEX.match(CGI.unescape(LONG_URL))
  end
  x.report("long url include?") do
    CGI.unescape(LONG_URL).include?(TWO_DOTS)
  end
  x.report("long url simplified regex") do
    SIMPLIFIED_REGEX.match?(CGI.unescape(LONG_URL))
  end
end

Here are the results I get:

results
$ ruby script/benchmarks/path_traversal.rb
Warming up --------------------------------------
short url existing check path traversal
                       131.495k i/100ms
short url check path traversal with constant
                       135.288k i/100ms
       short url re2   188.348k i/100ms
  short url include?   374.376k i/100ms
short url simplified regex
                       358.905k i/100ms
Calculating -------------------------------------
short url existing check path traversal
                          1.307M (± 2.9%) i/s -      6.575M in   5.036125s
short url check path traversal with constant
                          1.303M (± 3.5%) i/s -      6.629M in   5.094431s
       short url re2      1.838M (±17.5%) i/s -      8.664M in   5.073499s
  short url include?      3.607M (± 3.8%) i/s -     18.344M in   5.093497s
short url simplified regex
                          3.399M (± 3.4%) i/s -     17.227M in   5.074337s

Comparison:
  short url include?:  3606763.6 i/s
short url simplified regex:  3398969.6 i/s - same-ish: difference falls within error
       short url re2:  1838392.3 i/s - 1.96x  slower
short url existing check path traversal:  1306655.8 i/s - 2.76x  slower
short url check path traversal with constant:  1302862.5 i/s - 2.77x  slower

Warming up --------------------------------------
medium url existing check path traversal
                        31.000  i/100ms
medium url check path traversal with constant
                         2.805k i/100ms
      medium url re2    26.154k i/100ms
 medium url include?    48.485k i/100ms
medium url simplified regex
                        51.226k i/100ms
Calculating -------------------------------------
medium url existing check path traversal
                          6.534k (±46.9%) i/s -      4.061k in   5.284083s
medium url check path traversal with constant
                         29.154k (± 3.6%) i/s -    145.860k in   5.009454s
      medium url re2    247.030k (±18.7%) i/s -      1.125M in   5.006283s
 medium url include?    477.863k (±18.6%) i/s -      2.182M in   5.030746s
medium url simplified regex
                        485.396k (±18.5%) i/s -      2.254M in   5.090959s

Comparison:
medium url simplified regex:   485395.5 i/s
 medium url include?:   477862.6 i/s - same-ish: difference falls within error
      medium url re2:   247030.0 i/s - 1.96x  slower
medium url check path traversal with constant:    29153.9 i/s - 16.65x  slower
medium url existing check path traversal:     6534.4 i/s - 74.28x  slower

Warming up --------------------------------------
long url existing check path traversal
                        29.000  i/100ms
long url check path traversal with constant
                       399.000  i/100ms
        long url re2     2.367k i/100ms
   long url include?     3.518k i/100ms
long url simplified regex
                         3.936k i/100ms
Calculating -------------------------------------
long url existing check path traversal
                          2.426k (±36.5%) i/s -      3.799k in   5.005201s
long url check path traversal with constant
                          4.017k (±10.1%) i/s -     19.551k in   5.015698s
        long url re2     22.342k (±18.1%) i/s -    104.148k in   5.061759s
   long url include?     35.546k (± 1.9%) i/s -    179.418k in   5.049406s
long url simplified regex
                         39.282k (± 2.5%) i/s -    196.800k in   5.013185s

Comparison:
long url simplified regex:    39281.7 i/s
   long url include?:    35545.8 i/s - 1.11x  slower
        long url re2:    22341.9 i/s - 1.76x  slower
long url check path traversal with constant:     4016.9 i/s - 9.78x  slower
long url existing check path traversal:     2425.6 i/s - 16.19x  slower

Rehearsal --------------------------------------------------------------------------------
short url existing check path traversal        0.000040   0.000024   0.000064 (  0.000015)
short url check path traversal with constant   0.000030   0.000025   0.000055 (  0.000042)
short url re2                                  0.000012   0.000009   0.000021 (  0.000010)
short url include?                             0.000004   0.000001   0.000005 (  0.000005)
short url simplified regex                     0.000003   0.000001   0.000004 (  0.000004)
----------------------------------------------------------------------- total: 0.000149sec

                                                   user     system      total        real
short url existing check path traversal        0.000017   0.000034   0.000051 (  0.000015)
short url check path traversal with constant   0.000015   0.000002   0.000017 (  0.000013)
short url re2                                  0.000017   0.000035   0.000052 (  0.000014)
short url include?                             0.000013   0.000002   0.000015 (  0.000011)
short url simplified regex                     0.000009   0.000019   0.000028 (  0.000007)
Rehearsal ---------------------------------------------------------------------------------
medium url existing check path traversal        0.000223   0.000169   0.000392 (  0.000279)
medium url check path traversal with constant   0.000050   0.000010   0.000060 (  0.000048)
medium url re2                                  0.000039   0.000024   0.000063 (  0.000048)
medium url include?                             0.000017   0.000011   0.000028 (  0.000026)
medium url simplified regex                     0.000006   0.000000   0.000006 (  0.000006)
------------------------------------------------------------------------ total: 0.000549sec

                                                    user     system      total        real
medium url existing check path traversal        0.000232   0.000191   0.000423 (  0.000288)
medium url check path traversal with constant   0.000070   0.000038   0.000108 (  0.000069)
medium url re2                                  0.000028   0.000049   0.000077 (  0.000025)
medium url include?                             0.000013   0.000002   0.000015 (  0.000011)
medium url simplified regex                     0.000014   0.000028   0.000042 (  0.000011)
Rehearsal -------------------------------------------------------------------------------
long url existing check path traversal        0.000460   0.000146   0.000606 (  0.000497)
long url check path traversal with constant   0.000266   0.000012   0.000278 (  0.000265)
long url re2                                  0.000052   0.000008   0.000060 (  0.000053)
long url include?                             0.000034   0.000012   0.000046 (  0.000034)
long url simplified regex                     0.000026   0.000009   0.000035 (  0.000028)
---------------------------------------------------------------------- total: 0.001025sec

                                                  user     system      total        real
long url existing check path traversal        0.000459   0.000197   0.000656 (  0.000516)
long url check path traversal with constant   0.000276   0.000061   0.000337 (  0.000279)
long url re2                                  0.000073   0.000029   0.000102 (  0.000070)
long url include?                             0.000039   0.000050   0.000089 (  0.000036)
long url simplified regex                     0.000039   0.000035   0.000074 (  0.000037)

Alright, here is what I can say reading the above:

  • Obviously, the longer the url, the longer the execution is.
  • The existing function is already quite fast. On long urls, it took less than 1ms to execute.
  • Putting the existing regex as a constant and update the existing function is an interesting option. Almost 2x faster on long urls.
  • There is room for improvement with alternative implementations.
    • Among them, the simplified regex (simplified as we would expect to use it only on urls) and the .include? approaches are crazy . About 15+x faster.
    • The problem with alternative implementations is that we introduce a risk of not properly covering all the cases.

In conclusion here, we have improvements available but given the already good performance of the existing function, I think it's not worth it (nor the risks involved of using a different implementation).

The only improvement we could consider here is moving the regex to a constant. This would have 0 impact on the function's behavior. Done in Use a constant in check_path_traversal! (!123661 - merged). 😸

Edited by David Fernandez

Merge request reports