Commit adf42647 authored by Emilie Schario's avatar Emilie Schario Committed by Taylor A Murphy, PhD

Resolve "Updates to the Data Page"

parent 9b5819c4
---
layout: markdown_page
title: "SQL Style Guide"
description: "GitLab Data Team Handbook"
---
## On this page
{:.no_toc}
- TOC
{:toc .toc-list-icons}
----
This style guide is for SQL and dbt.
## SQL Style Guide
**Since we don't have a linter, it is *our collective responsibility* to enforce this Style Guide.)) Some of the below comments apply to dbt which is the bulk of the SQL we write.
#### dbt
At Gitlab, we use dbt (data build tool) for data transformation. What follows are the conventions we use internally. *Inspired by [Fishtown Analytics](https://github.com/fishtown-analytics/corp/blob/master/dbt_coding_conventions.md)*
- Watch [this video (GitLab internal)](https://drive.google.com/open?id=1ZuieqqejDd2HkvhEZeOPd6f2Vd5JWyUn) on how to use dbt
- Use dbt for as much modeling as possible - see this [blog post](https://blog.fishtownanalytics.com/how-do-you-decide-what-to-model-in-dbt-vs-lookml-dca4c79e2304) from Fishtown Analytics
##### Model Configuration
- Model-specific attributes (like materializations) should be specified in the model.
- If a particular configuration applies to all models in a directory, it should be specified in the project.
- In-model configurations should be specified like this:
```
{{
config(
materialized = ’table’
)
}}
```
##### Base Models
- Only base models should select from source tables.
- Base models should not select from the `raw` database directly. Instead, they should reference `{{ var("database") }}`.
- Only a single base model should be able to select from a given source table.
- Base models should be placed in a `base/` directory.
- Base models should perform all necessary data type casting, using the `::` sytax when casting (You accomplish the same thing with fewer characters, and it presents as cleaner).
- Base models should perform all field naming to force field names to conform to standard field naming conventions.
- Source fields that use reserved words must be renamed in base models.
##### Field Naming Conventions
- An `id` or `name` value should always be prefixed by what it is identifying or naming, e.g. `account_id` and `account_name`, when represenating the primary value. Only in the `JOIN`s is it essential to be more explicit.
- When joining to any data from a different source, a field should be prefixed with the data source, e.g. `sfdc_account_id`, to avoid ambiguity. An example: In the `sfdc_account` model, you may have `account_id` and `account_
- All field names should be [snake-cased](https://en.wikipedia.org/wiki/Snake_case).
- Boolean field names should start with `has_`, `is_`, or `does_`.
##### CTEs (Common Table Expressions)
- All `{{ ref('...') }}` statements should be placed in CTEs at the top of the file. (Think of these as import statements.)
- This does not imply all CTE's that have a `{{ ref('...') }}` should be `SELECT *` only. It is ok to do additional manipulations in a CTE with a `ref` if it makes sense for the model.
- Where performance permits, CTEs should perform a single, logical unit of work.
- CTE names should be as verbose as needed to convey what they do.
- CTEs with confusing or noteable logic should be commented in file and documented in dbt docs.
- CTEs that are duplicated across models should be pulled out into their own models.
- CTEs should be formatted as follows:
``` sql
WITH events AS ( -- think of these select statements as your import statements.
...
), filtered_events AS ( -- CTE comments go here
...
)
SELECT * -- you should always aim to "select * from final" for your last model
FROM filtered_events
```
##### Style Guide
- Indents should be four spaces (except for predicates, which should line up with the `WHERE` keyword).
- Lines of SQL should be no longer than 80 characters.
- Field names should all be lowercased.
- Function names should all be capitalized.
- The `AS` keyword should be used when projecting a field or table name.
- Fields should be stated before aggregates / window functions.
- Ordering and grouping by a number (eg. group by 1, 2) is preferred.
- Prefer `WHERE` to `HAVING` when either would suffice.
- Be explicit when joining, e.g. use `LEFT JOIN` instead of `JOIN`. (Default joins are `INNER`)
- Follow the following convention for `JOIN`s
```
FROM source
LEFT JOIN other_source ON source.id = other_source.id
```
not
```
FROM source
LEFT JOIN other_source ON other_source.id = source.id
```
- **Never** use `USING` in joins. It will produce inaccurate results.
- Prefer `UNION ALL` to `UNION`.
- Prefer `NULLIF` TO `NVL`.
- Prefer `IFF` to a single line `CASE WHEN` statement.
- Consider performance. Understand the difference between `LIKE` vs `ILIKE`, `IS` vs `=`, and `NOT` vs `!` vs `<>`. Use appropriately.
- Familiarize yourself with [the DRY Principal](https://docs.getdbt.com/docs/design-patterns). Leverage jinja, macros, and CTEs. If you type the same line twice, it needs to be maintained in two places.
- *DO NOT OPTIMIZE FOR A SMALLER NUMBER OF LINES OF CODE. NEWLINES ARE CHEAP. BRAIN TIME IS EXPENSIVE.*
##### Example Code
```sql
with my_data as (
SELECT * FROM {{ ref('my_data') }}
WHERE filter = 'my_filter'
), some_cte as (
SELECT * FROM {{ ref('some_cte') }}
)
SELECT [distinct]
field_1 AS detailed_field_1,
field_2 AS detailed_field_2,
detailed_field_3,
CASE
WHEN cancellation_date IS NULL AND expiration_date IS NOT NULL
THEN expiration_date
WHEN cancellation_date IS NULL
THEN start_date+7
ELSE cancellation_date
END AS cancellation_date
SUM(field_4) AS field_4_sum,
MAX(field_5) AS field_5_max
FROM my_data
LEFT JOIN some_cte
ON my_data.id = some_cte.id
WHERE field_1 = ‘abc’
AND (field_2 = ‘def’ OR field_2 = ‘ghi’)
GROUP BY 1, 2, 3, 4
HAVING count(*) > 1
ORDER BY 4
DESC
```
##### Testing
- Every model should be tested in a `schema.yml` file
- At minimum, unique, not nullable fields, and foreign key constraints should be tested (if applicable)
- The output of dbt test should be pasted into MRs
- Any failing tests should be fixed or explained prior to requesting a review
##### Query Naming Convention
Follow the naming convention of `analysis type, data source (in alpha order, if multiple), thing, aggregation` (e.g. `retention_sfdc_zuora_customer_count.sql`)
##### Commenting
* When making single line comments in a model use the `--` syntax.
* When making multi-line comments in a model us the `/* */` syntax.
* dbt model comments should live in the model documentation.
* Calculations made in SQL should have a brief description of what's going on and a link to the handbook defining the metric (and how it's calculated)
* Instead of leaving `TODO` comments, create new issues for improvement.
\ No newline at end of file
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment