Projects api endpoint: prevent site scraping

While working on the projects api problem, @alexpooley noticed that huge amount of requests over the 5 seconds are made by anonymous users, who are probably scraping the projects data. The requests with per_page value of 100 and huge offsets tend to be very slow.

Examples: (all examples come from 24 hours period of December 21st, Wednesday):

all certain type queries are coming from two ip addresses: https://log.gprd.gitlab.net/goto/993d78f0-81e5-11ed-9f43-e3784d7fe3ca
- those two are 17% of all anonymous requests over 5 seconds
other ip address: https://log.gprd.gitlab.net/goto/9fa54e40-81e8-11ed-9f43-e3784d7fe3ca
- 12% of all anonymous requests over 5 seconds
all above error budget without user https://log.gprd.gitlab.net/goto/b91c5170-81e8-11ed-85ed-e7557b0a598c
- those are 51% of all requests with duration above error budget.

Those examples clearly shows, that we are spending a lot of time and effort to fix problems for some small number of users who are presenting not-typical use of our site. We propose to just limit the possibility to use this api endpoint for anonymous user:

proposition 1: forbid anonymous usage of this endpoint
proposition 2: heavy rate-limit anonymous usage of this endpoint.

Here is a ruby script to break out the `params.keys` and `params.values` from the elastic spreadsheet dump. Useful to further segment scrape data:

#!/usr/bin/ruby

require 'csv'
require 'set'
require 'byebug'

# Split "json.params.key","json.params.value" and rejoin as column/value.
PARAMS_KEY_COLUMN = 'json.params.key'
PARAMS_VAL_COLUMN = 'json.params.value'


def split_values(vals)
  vals.split(',').map(&:strip)
end

# All headers except for the param cols.
def base_headers(row)
  @base_headers ||= row.headers - [PARAMS_KEY_COLUMN, PARAMS_VAL_COLUMN]
end

# Convert row to a hash with the params split to columns.
def build_new_row(row)
  new_row = row.to_h

  # Split the param col values.
  keys = split_values row[PARAMS_KEY_COLUMN]
  vals = split_values row[PARAMS_VAL_COLUMN]

  # Combine param col values to a key value hash.
  param_cols = Hash[keys.zip(vals)]

  # Add param col values back to CSV output.
  new_row.merge(param_cols)
end


# Iterator through the CSV once to work out the headers.
header_set = Set.new
CSV.foreach(ARGV[0], headers: true) do |row|
  header_set += row.headers
  header_set += split_values row[PARAMS_KEY_COLUMN]
end
headers -= [PARAMS_KEY_COLUMN, PARAMS_VAL_COLUMN]
headers = header_set.to_a.sort


# Iterate again to translate row to new columns.
CSV(headers: true) do |out|
  out << headers

  CSV.foreach(ARGV[0], headers: true) do |row|
    row_hash = build_new_row(row)
    new_row = headers.map { |header| row_hash[header] }
    out << new_row
  end
end

Usage:

ruby split_log_params.rb elastic_dump.csv > output_file.csv

Edited Dec 23, 2022 by Alex Pooley