Verified Commit c1435160 authored by MrMan's avatar MrMan

Finish first draft of tokyo tech meetup talk

parent 794c5640
......@@ -7,66 +7,270 @@
- What is Postgres?
- Why Postgres?
- Why not Postgres?
- Relational Data
- SQL vs NOSQL
- Relational Data
- Key/Value Data
- Document Storage
- Graph
- Geographical Information Systems (GIS)
- K/V store
- Graph Data
- Geospatial Data
- Log Storage
- Message Queue
- Time Series Data
# Disclaimer #
I wasn't personally writing software in 1990/1997, and I'm not a computer science/sofware engineering historian. We're going to discuss in very broad strokes, and there will be mistakes.
If you notice some inaccuracies, write them down/keep them in mind until the end and my email will be available so you can enlighten me (and I'll update the slides).
# What is Postgres? #
Postgres is the most advanced open source database that's ever existed. It's developed in the open, and maintained by the community. There are a few large contributors in the space:
**Postgres is the most advanced open source database that's ever existed**. It's developed in the open, driven and maintained by the community.
There are a few large contributors in the space:
- 2nd Quadrant
- Citus Data (acquired by MS in January)
A few features set postgres apart:
- MVCC
- Extensibility
- Process-per-connection
- Multi Version Concurrency Control (MVCC)
- Plugin system (indices, functionality)
- Process-per-connection model
- Elephant mascot
# Why Postgres? #
There are lots of reasons to use Postgres:
## There are lots of reasons to use Postgres: ##
**Reliability** - Postgres is rock solid
 
**Performance** - Postgres is generally "fast enough", and can even be *really fast*
 
**Cloud vendor support** - AWS RDS, Azure Database, GCP Cloud SQL
 
**Open Source** - There are a lot of resources and other people pushing the boundaries
- No vendor, hence no vendor lock in
 
Reddit got to a billion users on a master-slave Postgres (scaling up rather than out, mostly)[^1]
[^1]: http://highscalability.com/blog/2013/8/26/reddit-lessons-learned-from-mistakes-made-scaling-to-1-billi.html
# Why Not Postgres? #
There are also reasons to not use Postgres:
## There are many reasons Postgres might *not* be a fit for your next project, here are a few: ##
 
No vendor, so no vendor to pay to help[^2]
 
No official scale out story[^3]
 
Structured Query Language (SQL) can be difficult
 
Transactional guarantees can eat into performance
 
[^2]: Lots of consultancies though (like 2nd Quadrant) which can help out
- No vendor, so no vendor to pay to help (you do have 2nd Quadrant and other Postgres consultancies though)
[^3]: PostgresXL does exist
# NOSQL vs SQL #
What people *normally* mean when they say "NOSQL" is a rejection of the structure, and transactional guarantees normally included with a Relational DataBase Management Systems (RDBMS)
**Relational Structure** as in Relational Algebra (sets, projection, unions, intersections, joins)
**Transactional Guarantees** as in ACID (Atomicity, Consistency, Isolation, Durability)
Examples of NOSQL databases:
- RethinkDB / MongoDB (documents, usually JSON)
- Redis (key/value)
- Neo4J (graphs)
- Postgres (more on this later)
# Relational Data #
99% of the data your organization needs to deal with is going to be relational -- most data isn't very useful without context.
 
```sql
SELECT company_name,amount,payment_status
FROM customers
JOIN invoices ON customers.id = invoices.customer_id
WHERE payment_status=='not-paid';
```
# Hot takes and tips #
 
A bunch of things I think that are probably right:
Want to go deeper? Postgres has `ENUM`s and custom `TYPE`s, and most tools you'd need to make sure the only data is *valid*, *correct* data.
- Use Gitlab
- Typescript (if you're using NodeJS)
- Try Lisp at least once
- Try Haskell at least once
- Try Rust more than once
- Never price by project*
- Don't build & deploy VMs on a greenfield project in 2019**
# Key/Value Data #
\* Unless you've built the thing already and you are literally going to reskin it and the client has absolutely *no* new feature requests.
\** Unless your VM in production is basically Container Linux
Postgres makes a surprisingly good simple key value store. You're not going to beat Redis, but it's *probably* going to be fast enough!
```sql
CREATE UNLOGGED TABLE kv (
id serial GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
key text NOT NULL,
value jsonb,
created_at timestamptz NOT NULL DEFAULT NOW()
);
```
 
Want to go deeper? Pluggable storage engines (the table access interface)[^4] has landed, you could *actually* put Redis in your Postgres
[^4]: https://www.postgresql.org/docs/devel/tableam.html
# Document Storage #
\small
```sql
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
CREATE TABLE docs (
id uuid PRIMARY KEY DEFAULT uuid_generate_v4(),
data jsonb,
updated_at timestamptz NOT NULL DEFAULT NOW(),
created_at timestamptz NOT NULL DEFAULT NOW()
)
-- GIN indexes massively speed up searches like:
-- SELECT * FROM docs WHERE data @> {"some_key": "some_value"}
CREATE INDEX docs_data_idx ON docs USING GIN (data);
```
\normalsize
Want to go deeper? Look into Postgres's full range of JSON operators[^5]. SQL/JSON (JSONPath for SQL) is coming in 12[^6].
[^5]: https://www.postgresql.org/docs/current/functions-json.html
[^6]: https://www.postgresql.org/docs/12/functions-json.html#FUNCTIONS-SQLJSON-PATH
# Graph Data #
Recursive Common Table Expressions (CTEs) are here to help:
\tiny
```sql
WITH RECURSIVE search_graph(id, link, data, depth, path, cycle) AS (
SELECT g.id, g.link, g.data, 1,
ARRAY[g.id],
false
FROM graph g
UNION ALL
SELECT g.id, g.link, g.data, sg.depth + 1,
path || g.id,
g.id = ANY(path)
FROM graph g, search_graph sg
WHERE g.id = sg.link AND NOT cycle
)
SELECT * FROM search_graph;
```
\normalsize
OK, *maybe* don't do this, but maybe *do* use AgensGraph[^7], a graphing solution built on Postgres.
Expect more graphing solutions to be build into Postgres once pluggable storage engines pick up steam.
[^7]: https://github.com/bitnine-oss/agensgraph
# Geospatial Data #
Geographic Information System (GIS) data is the bread and butter of PostGIS[^8]:
 
```sql
SELECT superhero.name
FROM city, superhero
WHERE ST_Contains(city.geom, superhero.geom)
AND city.name = 'Gotham';
```
 
Want to go deeper? Actually read and understand the feature set and documentation for PostGIS -- there's a lot to master
[^8]: https://postgis.net
# Log Storage #
Declarative partitioning means Postgres can take your gobs of structured logs (they *are* structured right?)
\small
```sql
-- The partitioned table
CREATE TABLE logs (
data jsonb NOT NULL,
logged_at timestamptz NOT NULL DEFAULT NOW()
) PARTITION BY RANGE (logged_at);
-- A partition for the month of June
SET TIME ZONE 'Asia/Tokyo';
CREATE TABLE logs_2019_06
PARTITION OF logs
FOR VALUES FROM ('2019-06') TO ('2019-07');
```
\normalsize
Some assembly/maintenance *is* required, but faster queries on smaller data sets (constraint exclusion) has never been cheaper.
# Message Queues #
If all your application instances are connected to the database, why not have them communicate?
\small
```sql
-- Create a channel named "virtual"
LISTEN virtual;
-- Notify with no payload
NOTIFY virtual;
-- notify with payload
NOTIFY virtual, 'This is the payload';
```
 
\normalsize
Maybe you don't need a NATS/RabbitMQ/NSQ/Kafka cluster *just* yet.
Want to go deeper? Try combining this feature with some `UNLOGGED` and `TEMPORARY` tables and build some data pipelines.
# Time Series Data #
You could build your own solution by using `PARTITION`s, `UNLOGGED` tables, some `TRIGGER`s, but don't bother. Just use TimescaleDB[^9].
![TimescaleDB insert performance on 1B inserts](timescale-vs-postgres-insert-1B.jpg){ height=50% }
# Time Series Data (continued) #
TimescaleDB also has an excellent, reasoned technical dives on where and why they can beat databases like MongoDB[^10] and even purpose-built DBs like InfluxDB[^11].
![TimescaleDB vs influx](timescale-vs-influx.png){ height=60% }
[^9]: https://docs.timescale.com/v1.3/introduction
[^10]: https://blog.timescale.com/how-to-store-time-series-data-mongodb-vs-timescaledb-postgresql-a73939734016/
[^11]: https://blog.timescale.com/timescaledb-vs-influxdb-for-time-series-data-timescale-influx-sql-nosql-36489299877/
# The End #
......@@ -88,3 +292,21 @@ I run a couple very small consultancies to support businesses in Japan and the U
Need help getting your organization to the present/future? I can help with that.
Need help getting your organization to the past? I probably can't help with that.
# Bloopers: Hot takes and tips #
A bunch of things I think that are probably right:
- Use Gitlab
- Don't write ECMAscript (AKA Javascript) without Typescript
- Try Lisp & Haskell (separately?) at least once
- Try Rust more than once
- Never price by project*
- Don't build & deploy VMs on a greenfield project in 2019**
 
\scriptsize
\* Unless you've built the thing already and you are literally going to reskin it and the client has absolutely *no* new feature requests.
\** Unless your VM in production is basically Container Linux
.PHONY: all 2019-04-mercari-dev-meetup
.PHONY: all \
2019-04-mercari-dev-meetup 2019-04-mercari-dev-meetup-pdf 2019-04-mercari-dev-meetup-condensed-pdf
all: 2019-04-mercari-dev-meetup
all: 2019-04-mercari-dev-meetup 2019-06-tokyo-tech-meetup
HTML_FORMAT ?= slidy
PDF_FORMAT ?= beamer
......@@ -30,3 +31,18 @@ ENTR ?= entr
--self-contained \
-s 2019/04/mercari-backend-meetup-condensed.md \
-o dist/2019/04/mercari-backend-meetup-condensed.pdf
2019-06-tokyo-tech-meetup: 2019-06-tokyo-tech-meetup-pdf
2019-06-tokyo-tech-meetup-watch:
find 2019/06/* | $(ENTR) -rc make 2019-06-tokyo-tech-meetup
2019-06-tokyo-tech-meetup-pdf:
@mkdir -p dist/2019/06/tokyo-tech-meetup
pandoc \
-t $(PDF_FORMAT)+footnotes \
--resource-path $(RESOURCE_PATH) \
--data-dir $(DATA_DIR) \
--self-contained \
-s 2019/06/tokyo-tech-meetup/just-use-postgres.md \
-o dist/2019/06/tokyo-tech-meetup/just-use-postgres.pdf
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment