Skip to content

Draft: POC Enforce Organization Isolation based on `organization_id` on every table

What does this MR do and why?

Related to #394800 (closed) . This is a POC to evaluate whether we can enforce organization isolation using something like foreign keys. The idea is that every table has an organization_id and every existing foreign key is converted to also having organization_id as part of a composite foreign key. So this would basically mean both sides of every foreign key must have the same organization_id.

For example:

ALTER TABLE ONLY issue_links
    ADD CONSTRAINT org_fk_c900194ff2 FOREIGN KEY (organization_id, source_id) REFERENCES issues(organization_id, id) ON UPDATE CASCADE ON DELETE CASCADE;

ALTER TABLE ONLY issue_links
    ADD CONSTRAINT org_fk_e71bb44f1f FOREIGN KEY (organization_id, target_id) REFERENCES issues(organization_id, id) ON UPDATE CASCADE ON DELETE CASCADE;

This would ensure that issue links only exist between issues belonging to the same organization.

After deeper investigation we realised we don't need full foreign keys because they have 2 additional costs:

  1. They require a unique index on the referenced columns. It's expensive to build hundreds of new indexes and all of them would already be unique based on the fact that they are composite of another foreign key so it's just wasteful.
  2. We don't need to validate all existing rows. All existing rows will have organization_id=1 so it's wasteful having Postgres validating hundreds of foreign keys that are definitely valid

So we realised that we can implement the equivalent functionality we need with a Postgres trigger (foreign keys are implemented behind the scenes as a CONSTRAINT TRIGGER anyway).

TODO

  1. Add organization_id column to every table
  2. Get trigger approach working
  3. Implement a service that can (given a single organization_id) find all violations of cross-organization data. This could use the foreign keys we used to create the triggers but it could additionally use loose foreign keys.
    1. Run periodic worker that iterates over all organization_id (except for 1 because it will be caught by the other ids) and logs violations
    2. This service data could also be fetched and displayed on some organization page so that they will be able to know what data may not be working correctly
    3. Could this service also be repurposed for "proposed" organization moves? Could be something like using the group transfer service and then detecting all the violations. Could we do it in a transaction or something?
  4. Test out implementing the parent side of the foreign key constraint. This would fail to UPDATE the organization_id in the case of their being existing references to the current primary key. We probably don't need "ON DELETE CASCADE" because that would already be covered by the original non-composite FK. We just need to block updating the organization_id unless the references are already updated. How do we solve the chicken and egg problem in a single transaction?
  5. Is it possible that this kind of tooling is only necessary for foreign keys that could theoretically span multiple organizations? If specific foreign keys were known to imply data that belongs to the same namespace/project then we would also be confident that they must be in the same organization? So then maybe there is just a class of foreign keys like issue_issue_links.source_id/target_id and merge_requests.source_project_id/target_project_id that we need to be concerned with? Maybe the generalized solution is only tables that contain multiple foreign keys need to be considered. Anything that has only a single foreign key is considered to be hierachical and must "belong to" the parent record and presumably couldn't logically have a parent in another organization.

Statistics

Storage Added How it was calculated?
All new columns 313 GiB 8 * reltuples of all tables as it adds 8 bytes for every tuple in the DB
All new indexes 1.8 TiB 2 * 3 * above value. This is because it will create a new index of 2 8 byte columns and the value of 3 was just an observation that a few of our indexes are 2-3 times the size of the columns they store. Index size can be from bloat, the actual values and the extra tuple data needed to maintain the index
Total added 2.1 TiB = 11%

Challenges

  1. How to convert the FK violations into standard user facing errors
  2. How to ensure the correct organization_id is set when creating every record. Need some standard way to extract it from a parent but this may be difficult with code like FooBar.create!(project_id: project.id, some_column: some_value). We could consider adding it to SafeRequestStore and set it on all models in a hook
    1. If we have a sticky Postgres connection we could set a value in the session. Then a trigger could rewrite the query or set the value on the tuple
    2. Or we could amend all INSERT queries in a query analyzer with organization_id = ...
  3. How to handle loose foreign key references
    1. This MR shows we have ~800 foreign keys that can enforce these constraints. There are only 90 loose foreign keys. Maybe normal foreign keys are good enough. But the question remains around whether we might want to migrate more foreign keys to LFK in the future and we can't validate the data integrity with LFKs.
    2. Async data validation doesn't seem very useful. We can't present the error when the violation happens and have no real way of dealing with the violating data
      1. LFKs could be implemented as blocking in development to catch bugs. In production it could be a periodic job that allows us to catch data issues retrospectively and potentially catch application bugs
  4. Is the volume of data even reasonable?
  5. Is it possible to actually create migrations to create ~600 new columns and indexes and ~800 foreign keys
    1. How long would it take to run this migration in async indexes?
      1. Maybe we don't need the index at all since the lookup by id would be an index scan. Downside is that it's not an "index only scan" because we need to read the tuple to get the organization_id => more IO on create/update/delete
    2. How long would it take to run this migration if there was downtime?
      1. Is it possible to trick Postgres into just believing a foreign key is valid
        1. Underlying foreign key implementation is just foreign keys. So we could instead just use a trigger instead of validating everything.
        2. Or we could add a NOT VALID foreign key and just leave it invalid forever
      2. How long is index building?

Possible alternative all read queries have WHERE organization_id

  1. Downside is all instance wide things like periodic sidekiq workers will have to change
  2. We also might lose all index only scans because none of the indexes would have organization_id in them

Screenshots or screen recordings

Screenshots are required for UI changes, and strongly recommended for all other merge requests.

Before After

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Dylan Griffith

Merge request reports