Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: DuckDB community extension for prefiltered HNSW using ACORN-1 (github.com/cigrainger)
88 points by cigrainger 1 day ago | hide | past | favorite | 7 comments
Hey folks! As someone doing hybrid search daily and wishing I could have a pgvector-like experience but with actual prefiltered approximate nearest neighbours, I decided to just take a punt on implementing ACORN on a fork of the DuckDB VSS extension. I had to make some changes to (vendored) usearch that I'm thinking of submitting upstream. But this does the business. Approximate nearest neighbours with WHERE prefiltering.

Edit: Just to clarify, this has been accepted into the community extensions repo. So you can use it like:

```

INSTALL hnsw_acorn FROM community;

LOAD hnsw_acorn;

```

 help



The prefiltering gap has been a real pain point. With standard HNSW and strict WHERE clauses you're basically doing full candidate scans before applying the filter — you get the latency of ANN but none of the selectivity benefit.

Curious about filter selectivity though: does ACORN-1 still help when the filter eliminates, say, 95% of the index? I've seen cases where tight category filters completely undermine graph traversal because the graph structure assumed a uniform distribution over the subspace.


As an aside, there's now Lance data format support in DuckDB through their extension. It has Lance's vector search support available among other things:

https://github.com/lance-format/lance-duckdb/tree/main?tab=r...

I just noticed this, and your post, and haven't yet checked neither (sorry). I'm however doing some vector search benchmarking soon, with DuckDB's options alongside others. So your work caught my attention here.


This is great for analytical workloads. I work with financial time series data (Japanese company filings) and have been using BigQuery with in-memory caching for the hot path. Curious whether DuckDB extensions like this could replace the BQ dependency for smaller datasets — the cold start + query cost model of serverless warehouses can be painful for API-serving use cases.

Nice ! My most pressing request for VSS would be efficient binary vectors : is this on the table ?

I haven't given binary vectors a lot of thought, but I'm exploring RaBitQ[1].

[1] https://arxiv.org/abs/2405.12497


Does your method work better than standard ANN when filters are very strict—and how does it affect speed vs accuracy?

Please upstream it.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: