Postgres notes

March 16, 2024

Indexing tips

Postgres has a number of index types, including B-tree, hash, GiST, SP-GiST, GIN, and BRIN. The most common index type is the B-tree index, which is the default index type in Postgres.

Here's a quick overview of a few different index types:

Type Description Use cases Caveats
B-tree The default index type for most Postgres data types. Supports equality and range queries. Best for general use. Not ideal for full-text searches or multidimensional arrays.
Hash Only supports equality ("=") operations, not sorting or ranges. Faster than B-trees for exact match queries. Does not support sorting or ranges.
GiST Flexible index type that allows building of various search trees. Supports multi-dimensional types and full texts. Outperforms B-trees when dealing with geospatial data, range data, and full-text searches. Build time can be slow.
GIN For indexing complex data types e.g., arrays, full-text search vectors Go-to for array data like TEXT[] and nested data like JSONB. Can slow inserts and updates significantly.

Postgres also supports partial indexes, which are indexes that only index a subset of the rows in a table. This can be useful for speeding up queries that only need to access a subset of the rows in a table. A classic use case is indexing non-null values.

Text search

Postgres full text search can be extremely slow. If you only need to match short strings, then the pg_trgm extension can be used to create trigram indexes. Trigram indexes are much faster than full text indexes for short strings.

Foreign keys

When you create a foreign key, Postgres automatically creates an index on the foreign key column. If you're not using the foreign key for lookups, you can drop the index and create a new index that's more suitable for your use case.

Ingesting data

The fastest way to ingest data into a table is to use the COPY command. It's much faster than using a series of INSERT statements. Psycopg has a copy_from method that can be used to copy data from a file or a file-like object.

It's possible to run multithreaded inserts by using the COPY command with a temporary table and then inserting the data into the main table. At Village, we're able to achieve throughputs of >10k rows per second using this method on large VMs. You might run into deadlocks, which can be handled with retries.

with psycopg.connect(DB_CREDENTIALS) as conn:
    with conn.cursor() as cur:
        #
        cur.execute(
            """
            CREATE TEMP TABLE tmp_table AS
            SELECT * FROM my_table LIMIT 0
            """
        )

        with cur.copy(
            """
            COPY tmp_table (column_a, column_b, ...) FROM STDIN
            """
        ) as copy_cur:
            # where row is a sequence of values
            for row in iterator:
                copy_cur.write_row(row)

        cur.execute(
            """
            INSERT INTO my_table
            SELECT * FROM tmp_table
            """
        )

        cur.execute("DROP TABLE IF EXISTS tmp_table")

Deleting data

If you want to delete a specific set of rows, use the DELETE statement. If you want to delete all rows in a table, use the TRUNCATE statement.

Deleting rows can be slow when there are foreign keys. If you're sure that your rows are not referenced, disable triggers before deleting:

ALTER TABLE my_table DISABLE TRIGGER ALL;

delete from my_table where ...;

ALTER TABLE my_table ENABLE TRIGGER ALL;