Postgres notes
March 16, 2024
Indexing tips
Postgres has a number of index types, including B-tree, hash, GiST, SP-GiST, GIN, and BRIN. The most common index type is the B-tree index, which is the default index type in Postgres.
Here's a quick overview of a few different index types:
Type | Description | Use cases | Caveats |
---|---|---|---|
B-tree | The default index type for most Postgres data types. Supports equality and range queries. | Best for general use. | Not ideal for full-text searches or multidimensional arrays. |
Hash | Only supports equality ("=") operations, not sorting or ranges. | Faster than B-trees for exact match queries. | Does not support sorting or ranges. |
GiST | Flexible index type that allows building of various search trees. Supports multi-dimensional types and full texts. | Outperforms B-trees when dealing with geospatial data, range data, and full-text searches. | Build time can be slow. |
GIN | For indexing complex data types e.g., arrays, full-text search vectors | Go-to for array data like TEXT[] and nested data like JSONB . |
Can slow inserts and updates significantly. |
Postgres also supports partial indexes, which are indexes that only index a subset of the rows in a table. This can be useful for speeding up queries that only need to access a subset of the rows in a table. A classic use case is indexing non-null values.
Text search
Postgres full text search can be extremely slow. If you only need to match short strings, then the pg_trgm
extension can be used to create trigram indexes. Trigram indexes are much faster than full text indexes for short strings.
Foreign keys
When you create a foreign key, Postgres automatically creates an index on the foreign key column. If you're not using the foreign key for lookups, you can drop the index and create a new index that's more suitable for your use case.
Ingesting data
The fastest way to ingest data into a table is to use the COPY
command. It's much faster than using a series of INSERT
statements. Psycopg has a copy_from
method that can be used to copy data from a file or a file-like object.
It's possible to run multithreaded inserts by using the COPY
command with a temporary table and then inserting the data into the main table. At Village, we're able to achieve throughputs of >10k rows per second using this method on large VMs. You might run into deadlocks, which can be handled with retries.
with psycopg.connect(DB_CREDENTIALS) as conn:
with conn.cursor() as cur:
#
cur.execute(
"""
CREATE TEMP TABLE tmp_table AS
SELECT * FROM my_table LIMIT 0
"""
)
with cur.copy(
"""
COPY tmp_table (column_a, column_b, ...) FROM STDIN
"""
) as copy_cur:
# where row is a sequence of values
for row in iterator:
copy_cur.write_row(row)
cur.execute(
"""
INSERT INTO my_table
SELECT * FROM tmp_table
"""
)
cur.execute("DROP TABLE IF EXISTS tmp_table")
Deleting data
If you want to delete a specific set of rows, use the DELETE
statement. If you want to delete all rows in a table, use the TRUNCATE
statement.
Deleting rows can be slow when there are foreign keys. If you're sure that your rows are not referenced, disable triggers before deleting:
ALTER TABLE my_table DISABLE TRIGGER ALL;
delete from my_table where ...;
ALTER TABLE my_table ENABLE TRIGGER ALL;