Enable Pages access control

I'm opening this to track enabling Pages access control on .com. Seems we need to to some testing and check before enabling this as discussed in https://gitlab.slack.com/archives/C1BSEQ138/p1542369390013600.

it won't be enabled by default (impossible to do so on HA installations). I guess we'll need to do a production readiness review, etc, before turning it on

Summary

GitLab Pages access control was introduced in https://gitlab.com/gitlab-org/gitlab-ce/issues/33422.

Here's the Omnibus changes gitlab-org/omnibus-gitlab!2583 (merged) and the admin docs https://docs.gitlab.com/ee/administration/pages/#access-control.

Related issues

https://gitlab.com/gitlab-org/gitlab-ce/issues/59286 - blocker, on review stage
https://gitlab.com/gitlab-org/gitlab-ce/issues/56386 - inconsistent settings UI, will cause users to accidentally make pages private while updating project settings, but they can easily turn it back

Deployment plan

...

run rake gitlab:pages:make_all_public # starts background migration which fixes https://gitlab.com/gitlab-org/gitlab-ce/issues/59286, should take about an hour to finish

...

Architecture

Add architecture diagrams to this issue of feature components and how they interact with existing GitLab components. Include internal dependencies, ports, security policies, etc.

Couldn't find any, only these docs

Describe each component of the new feature and enumerate what it does to support customer use cases.

pages daemon
- checks if access_control is enabled for a particular project
- redirects users to gitlab.com for auth
- receives back redirect from gitlab.com with temporary token
- exchange temporary token to a permanent one and stores it in user session
- per every request go to "%s/api/v4/projects/%d/pages_access" for checking user authorization
gitlab rails app
- updates pages configs
- perform OAuth auth(redirects and token exchange)

For each component and dependency, what is the blast radius of failures? Is there anything in the feature design that will reduce this risk?

pages daemon
- in theory, can fail on the start with incorrect, but since we incrementally restart daemons, that will just result in reverted deploy
- if access-control will not work properly, pages sites with enabled access-control can become unavailable
- in theory, misconfiguration can result in cycle redirect between gitlab.com and gitlab.io for projects with enabled access-control
rails app
- in worst case scenario pages configs can stop updating(but I'm being too paranoid, this is well covered with tested and many users use it)
- if we enable access-control on a big amount of pages projects we can hit api rate limit and pages sites would be temporarily unavailable

If applicable, explain how this new feature will scale and any potential single points of failure in the design.
- see rate limits from the previous section
- both pages-daemon and rails api are single points of failure

Operational Risk Assessment

What are the potential scalability or performance issues that may result with this change?
- see rate limits above
List the external and internal dependencies to the application (ex: redis, postgres, etc) for this feature and how the it will be impacted by a failure of that dependency.
- rails api - pages with enabled access-control will become unavailable
Were there any features cut or compromises made to make the feature launch?
- none yet
List the top three operational risks when this feature goes live.
- a spike in API requests if a lot of projects will enable it, and we couldn't just turn this feature off after a month since this will make all sites public ...
What are a few operational concerns that will not be present at launch, but may be a concern later?
Can the new product feature be safely rolled back once it is live, can it be disabled using a feature flag?
- After a while no. But each project need to enable it manually and can turn it off at any time.
- We may consider guarding pages access level by feature flag for period of testing on gitlab.com
Document every way the customer will interact with this new feature and how customers will be impacted by a failure of each interaction.
- set pages_access_level in project settings
- open pages site - user will see 500 in case auth is misconfigured, or if anything goes wrong
As a thought experiment, think of worst-case failure scenarios for this product feature, how can the blast-radius of the failure be isolated?

Database

No changes affecting database required

Security

Were the gitlab security development guidelines followed for this feature?
If this feature requires new infrastructure, will it be updated regularly with OS updates?
- does not require
Has effort been made to obscure or elide sensitive customer data in logging?
- yes, not tokens are present in logs
Is any potentially sensitive user-provided data persisted? If so is this data encrypted at rest?
- no user data used

Performance

Explain what validation was done following GitLab's performance guidlines please explain or link to the results below
- No check was done, but rails part consists basically of 1 API request for auth
Are there any potential performance impacts on the database when this feature is enabled at GitLab.com scale?
- I don't see them
Are there any throttling limits imposed by this feature? If so how are they managed?
- no throttling limits
If there are throttling limits, what is the customer experience of hitting a limit?
- no throttling limits
For all dependencies external and internal to the application, are there retry and back-off strategies for them?
- no, if OAuth is failed or api requests failed, we render 500, but reloading of the page should help
Does the feature account for brief spikes in traffic, at least 2x above the expected TPS?

Backup and Restore

Outside of existing backups, are there any other customer data that needs to be backed up for this product feature?
- no

Monitoring and Alerts

Is the service logging in JSON format and are logs forwarded to logstash?
- yes
Is the service reporting metrics to Prometheus?
- currently NO, since failures repoted to logs, we can count error log messages per minute
How is the end-to-end customer experience measured?
- There is not metrics I'm aware of
Do we have a target SLA in place for this service?
Do we know what the indicators (SLI) are that map to the target SLA?
Do we have alerts that are triggered when the SLI's (and thus the SLA) are not met?
Do we have troubleshooting runbooks linked to these alerts?
What are the thresholds for tweeting or issuing an official customer notification for an outage related to this feature?

Responsibility

Which individuals are the subject matter experts and know the most about this feature?
- @nick.thomas, @nolith, @vshushlin
Which team or set of individuals will take responsibility for the reliability of the feature once it is in production?
Is someone from the team who built the feature on call for the launch? If not, why not?

Testing

Describe the load test plan used for this feature. What breaking points were validated?
For the component failures that were theorized for this feature, were they tested? If so include the results of these failure tests.
Give a brief overview of what tests are run automatically in GitLab's CI/CD pipeline for this feature?
- auth process is tested inside gitlab-pages daemon with mock api server

Edited Jul 12, 2019 by Vladimir Shushlin

Assignee Loading

Time tracking Loading