External Index Precomputation
Unlike pure SQL, external index precomputation performs clustering externally before inserting centroids into a PostgreSQL table. While this process may be more complex, it significantly speeds up indexing for larger datasets (>5M). We showed some benchmarks in the blog post. It takes around 3 minutes to build an index for 1M vectors, 16x faster than standard indexing in pgvector.
To get started, you need to do a clustering of vectors using:
- faiss
- scikit-learn
- fastkmeans
- or any other clustering library
The centroids should be preset in a table of any name with 3 columns:
id(integer)
: id of each centroid, should be uniqueparent(integer, nullable)
: parent id of each centroid, could beNULL
for normal clusteringvector(vector)
: representation of each centroid,vector
type
And example could be like this:
sql
-- Create table of centroids
CREATE TABLE public.centroids (id integer NOT NULL UNIQUE, parent integer, vector vector(768));
-- Insert centroids into it
INSERT INTO public.centroids (id, parent, vector) VALUES (1, NULL, '{0.1, 0.2, 0.3, ..., 0.768}');
INSERT INTO public.centroids (id, parent, vector) VALUES (2, NULL, '{0.4, 0.5, 0.6, ..., 0.768}');
INSERT INTO public.centroids (id, parent, vector) VALUES (3, NULL, '{0.7, 0.8, 0.9, ..., 0.768}');
-- ...
-- Create index using the centroid table
CREATE INDEX ON gist_train USING vchordrq (embedding vector_l2_ops) WITH (options = $$
[build.external]
table = 'public.centroids'
$$);
To simplify the workflow, we provide end-to-end scripts for external index pre-computation, see Run External Index Precomputation Toolkit.