Projects api endpoint: prevent site scraping
While working on the projects api problem, @alexpooley noticed that huge amount of requests over the 5 seconds are made by anonymous users, who are probably scraping the projects data. The requests with per_page
value of 100 and huge offsets tend to be very slow.
Examples: (all examples come from 24 hours period of December 21st, Wednesday):
- all certain type queries are coming from two ip addresses: https://log.gprd.gitlab.net/goto/993d78f0-81e5-11ed-9f43-e3784d7fe3ca
- those two are 17% of all anonymous requests over 5 seconds
- other ip address: https://log.gprd.gitlab.net/goto/9fa54e40-81e8-11ed-9f43-e3784d7fe3ca
- 12% of all anonymous requests over 5 seconds
- all above error budget without user https://log.gprd.gitlab.net/goto/b91c5170-81e8-11ed-85ed-e7557b0a598c
- those are 51% of all requests with duration above error budget.
Those examples clearly shows, that we are spending a lot of time and effort to fix problems for some small number of users who are presenting not-typical use of our site. We propose to just limit the possibility to use this api endpoint for anonymous user:
- proposition 1: forbid anonymous usage of this endpoint
- proposition 2: heavy rate-limit anonymous usage of this endpoint.
Here is a ruby script to break out the `params.keys` and `params.values` from the elastic spreadsheet dump. Useful to further segment scrape data:
#!/usr/bin/ruby
require 'csv'
require 'set'
require 'byebug'
# Split "json.params.key","json.params.value" and rejoin as column/value.
PARAMS_KEY_COLUMN = 'json.params.key'
PARAMS_VAL_COLUMN = 'json.params.value'
def split_values(vals)
vals.split(',').map(&:strip)
end
# All headers except for the param cols.
def base_headers(row)
@base_headers ||= row.headers - [PARAMS_KEY_COLUMN, PARAMS_VAL_COLUMN]
end
# Convert row to a hash with the params split to columns.
def build_new_row(row)
new_row = row.to_h
# Split the param col values.
keys = split_values row[PARAMS_KEY_COLUMN]
vals = split_values row[PARAMS_VAL_COLUMN]
# Combine param col values to a key value hash.
param_cols = Hash[keys.zip(vals)]
# Add param col values back to CSV output.
new_row.merge(param_cols)
end
# Iterator through the CSV once to work out the headers.
header_set = Set.new
CSV.foreach(ARGV[0], headers: true) do |row|
header_set += row.headers
header_set += split_values row[PARAMS_KEY_COLUMN]
end
headers -= [PARAMS_KEY_COLUMN, PARAMS_VAL_COLUMN]
headers = header_set.to_a.sort
# Iterate again to translate row to new columns.
CSV(headers: true) do |out|
out << headers
CSV.foreach(ARGV[0], headers: true) do |row|
row_hash = build_new_row(row)
new_row = headers.map { |header| row_hash[header] }
out << new_row
end
end
Usage:
ruby split_log_params.rb elastic_dump.csv > output_file.csv
Edited by Alex Pooley