Querying metrics endpoint causes repeated schema loads to application settings
UPDATE: I suspect there are two issues here being conflated into one:
Issue | Merge requests |
---|---|
Original issue: Repeated calls to table_exists? in Gitlab::CurrentSettings.connect_to_db?
|
https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/17781 |
Secondary issue: Repeated calls to column_exists? in Elasticsearch columns |
https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/17780, https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/4987 |
Original issue:
I enabled full PostgreSQL statement logging on the Geo testbed, and I saw endless streams of:
2018-02-16_20:50:37.07759 postgres-01 postgresql: LOG: statement: SET application_name='unicorn worker[1] -D -E produc...service/gitlab-rails/config.ru';
2018-02-16_20:50:37.07850 postgres-01 postgresql: LOG: statement: SELECT 1
2018-02-16_20:50:37.08298 postgres-01 postgresql: LOG: statement: SELECT 1
2018-02-16_20:50:37.08512 postgres-01 postgresql: LOG: execute <unnamed>: SELECT COUNT(*)
2018-02-16_20:50:37.08520 postgres-01 postgresql: FROM pg_class c
2018-02-16_20:50:37.08523 postgres-01 postgresql: LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
2018-02-16_20:50:37.08525 postgres-01 postgresql: WHERE c.relkind IN ('r','v','m') -- (r)elation/table, (v)iew, (m)aterialized view
2018-02-16_20:50:37.08526 postgres-01 postgresql: AND c.relname = 'application_settings'
2018-02-16_20:50:37.08528 postgres-01 postgresql: AND n.nspname = ANY (current_schemas(false))
2018-02-16_20:50:37.08530 postgres-01 postgresql:
2018-02-16_20:50:38.01871 postgres-01 postgresql: LOG: statement: SELECT 1
2018-02-16_20:50:38.04939 postgres-01 postgresql: LOG: statement: SET synchronous_commit TO 'local'
2018-02-16_20:50:38.05267 postgres-01 postgresql: LOG: statement: SELECT 1
2018-02-16_20:50:38.05569 postgres-01 postgresql: LOG: statement: SET application_name='unicorn worker[3] -D -E produc...service/gitlab-rails/config.ru';
2018-02-16_20:50:38.05662 postgres-01 postgresql: LOG: statement: SELECT 1
2018-02-16_20:50:38.06043 postgres-01 postgresql: LOG: statement: SELECT 1
2018-02-16_20:50:38.06258 postgres-01 postgresql: LOG: execute <unnamed>: SELECT COUNT(*)
2018-02-16_20:50:38.06265 postgres-01 postgresql: FROM pg_class c
2018-02-16_20:50:38.06268 postgres-01 postgresql: LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
2018-02-16_20:50:38.06270 postgres-01 postgresql: WHERE c.relkind IN ('r','v','m') -- (r)elation/table, (v)iew, (m)aterialized view
2018-02-16_20:50:38.06272 postgres-01 postgresql: AND c.relname = 'application_settings'
2018-02-16_20:50:38.06274 postgres-01 postgresql: AND n.nspname = ANY (current_schemas(false))
I tracked down the source to happening whenever the metrics endpoint (/-/metrics
) is called:
105205 22:43:50.117894 write(10, "Started GET \"/-/metrics\" for 10.62.0.10 at 2018-02-16 22:43:50 +0000\n", 69) = 69 <0.000026>
105205 22:43:50.118655 sendto(12, "Q\0\0\0\rSELECT 1\0", 14, MSG_NOSIGNAL, NULL, 0) = 14 <0.000037>
105205 22:43:50.118748 poll([{fd=12, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=12, revents=POLLIN}]) <0.002338>
105205 22:43:50.121143 recvfrom(12, "T\0\0\0!\0\1?column?\0\0\0\0\0\0\0\0\0\0\27\0\4\377\377\377\377\0\0D\0\0\0\v\0\1\0\0\0\0011C\0\0\0\rSELECT 1\0Z\0\0\0\5I", 16384, 0, NULL, NULL) = 66 <0.000016>
105205 22:43:50.121586 fcntl(13, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK) <0.000013>
105205 22:43:50.121640 write(13, "*2\r\n$3\r\nget\r\n$50\r\ncache:gitlab:rack::attack:allow2ban:ban:10.62.0.10\r\n", 70) = 70 <0.000034>
105205 22:43:50.121735 fcntl(13, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK) <0.000012>
105205 22:43:50.121785 read(13, 0x7fea22098800, 1024) = -1 EAGAIN (Resource temporarily unavailable) <0.000013>
105205 22:43:50.121881 select(14, [13], NULL, NULL, {5, 0}) = 1 (in [13], left {4, 999198}) <0.000819>
105205 22:43:50.122771 fcntl(13, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK) <0.000012>
105205 22:43:50.122825 read(13, "$-1\r\n", 1024) = 5 <0.000014>
105205 22:43:50.123128 fcntl(13, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK) <0.000013>
105205 22:43:50.123180 write(13, "*2\r\n$3\r\nget\r\n$37\r\ncache:gitlab:application_setting.last\r\n", 57) = 57 <0.000031>
105205 22:43:50.123272 fcntl(13, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK) <0.000012>
105205 22:43:50.123321 read(13, 0x7fea22081600, 1024) = -1 EAGAIN (Resource temporarily unavailable) <0.000012>
105205 22:43:50.123406 select(14, [13], NULL, NULL, {5, 0}) = 1 (in [13], left {4, 999485}) <0.000532>
105205 22:43:50.124011 fcntl(13, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK) <0.000012>
105205 22:43:50.124092 read(13, "$-1\r\n", 1024) = 5 <0.000015>
105205 22:43:50.124248 sendto(12, "Q\0\0\0\rSELECT 1\0", 14, MSG_NOSIGNAL, NULL, 0) = 14 <0.000051>
105205 22:43:50.124401 poll([{fd=12, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=12, revents=POLLIN}]) <0.001438>
105205 22:43:50.125912 recvfrom(12, "T\0\0\0!\0\1?column?\0\0\0\0\0\0\0\0\0\0\27\0\4\377\377\377\377\0\0D\0\0\0\v\0\1\0\0\0\0011C\0\0\0\rSELECT 1\0Z\0\0\0\5I", 16384, 0, NULL, NULL) = 66 <0.000015>
105205 22:43:50.126200 recvfrom(12, 0x7fea1f85df40, 16384, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) <0.000015>
105205 22:43:50.126272 sendto(12, "P\0\0\1U\0 SELECT COUNT(*)\n FROM pg_class c\n LEFT JOIN pg_namespace n ON n.oid = c.relnamespace\n WHERE c.relkind IN ('r','v','m') -- (r)elation/table, (v)iew, (m)aterialized view\n AND c.relname = 'application_settings'\n AND n.nspname = ANY (current_schemas(false))\n\0\0\0B\0\0\0\16\0\0\0\0\0\0\0\1\0\0D\0\0\0\6P\0E\0\0\0\t\0\0\0\0\0S\0\0\0\4", 379, MSG_NOSIGNAL, NULL, 0) = 379 <0.000109>
105205 22:43:50.126853 recvfrom(12, 0x7fea1f85df40, 16384, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) <0.000013>
105205 22:43:50.126910 select(13, [12], NULL, NULL, NULL) = 1 (in [12]) <0.001994>
105205 22:43:50.129011 recvfrom(12, "1\0\0\0\0042\0\0\0\4T\0\0\0\36\0\1count\0\0\0\0\0\0\0\0\0\0\24\0\10\377\377\377\377\0\0D\0\0\0\v\0\1\0\0\0\0010C\0\0\0\rSELECT 1\0Z\0\0\0\5I", 16384, 0, NULL, NULL) = 73 <0.000046>
105205 22:43:50.134092 write(10, "Processing by MetricsController#index as HTML\n", 46) = 46 <0.000027>
105205 22:43:50.135742 write(10, "Completed 200 OK in 1ms (Views: 0.3ms | ActiveRecord: 0.0ms | Elasticsearch: 0.0ms)\n", 84) = 84 <0.000024>
It looks like there's a Redis cache miss for the cached application setting, which then causes Rails to reload the attributes: https://gitlab.com/gitlab-org/gitlab-ee/blob/7c9872f28058fc07d7bd96d24bd60a5e17386c76/app/models/application_setting.rb#L263
Gitlab::CurrentSettings
doesn't look like it does the right thing:
- Someone calls
Gitlab::CurrentSettings.prometheus_metrics_enabled?
- Since this is a new request,
RequestStore
is empty - This then calls
Gitlab::CurrentSettings#ensure_application_settings!
- This then calls
Gitlab::CurrentSettings#cached_application_settings
- This then calls
ApplicationSetting.cached
, which will read from Redis and reload the attributes - Since there's a cache miss, then
uncached_application_settings
is called, but the value is never cached again.
Edited by Stan Hu