Show HN: Postgres extension for BM25 relevance-ranked full-text search

<- Back

Show HN: Postgres extension for BM25 relevance-ranked full-text search

tjgreen

Comments (30)

simonw
This is really cool. I've built things on PostgreSQL ts_vector() FTS in the past which works well but doesn't have whole-index ranking algorithms so can't do BM25.It's a bit surprising to me that this doesn't appear to have a mechanism to say "filter for just documents matching terms X and Y, then sort by BM25 relevance" - it looks like this extension currently handles just the BM25 ranking but not the FTS filtering. Are you planning to address that in the future?I found this example in the README quite confusing: SELECT * FROM documents WHERE content <@> to_bm25query('search terms', 'docs_idx') < -5.0 ORDER BY content <@> 'search terms' LIMIT 10; That -5.0 is a magic number which, based on my understanding of BM25, is difficult to predict in advance since the threshold you would want to pick varies for different datasets.
shreyssh
Nice work. pg_search has been on my radar for a while, having BM25 natively in Postgres instead of bolting on Elasticsearch is a huge DX win. Curious about the index build time on larger datasets though. I'm working with ~2M row tables and the bottleneck for most Postgres extensions I've tried isn't query speed, it's the initial indexing. Any benchmarks on that?
jascha_eng
FWIW TJ is not your average vibe coder imo: https://www.linkedin.com/in/todd-j-green/In september he burned through 3000$ in API credits though, but I think that's before we finally bought max plans for everyone that wanted it.
mattbessey
Please oh please let GCP add this to the supported managed Postgres extensions...
andai
Can you explain this in more detail? Is this for RAG, i.e. combining vector search with keyword search?My knowledge on that subject roughly begins and ends with this excellent article, so I'd love to hear how this relates to that.https://www.anthropic.com/engineering/contextual-retrievalEspecially since what Anthropic describes here is a bit of a rube Goldberg machine which also involves preprocessing (contextual summarization) and a reranking model, so I was wondering if there's any "good enough" out of the box solutions for it.
hmokiguess
I hope someone from Neon is watching this thread, definitely want to play with this asap
robotswantdata
“Just use Postgres” greybeards right again. Looking forward to giving this a go soon
Unical-A
Impressive benchmarks. How does the BM25 implementation handle high-frequency updates (writes) while maintaining search latency? Usually, there's a trade-off between ingest speed and search performance in Postgres-based full-text search.
maweaver
I've been doing some RAG prototypes with hybrid search using pg_textsearch plus pgvector and have been very pleased with the results. Happy to see a 1.0 release!
gmassman
Very exciting! Congrats on the release, this will be a huge benefit to all folks building RAG/rerank systems on top of Postgres. Looking forward to testing it out myself.
zephyrwhimsy
Input quality is almost always the actual bottleneck. Teams spend months tuning retrieval while feeding HTML boilerplate into their vector stores.
jackyliang
VERY excited about this, literally just looking to build hybrid search using Postgres FTS. When will this be available on Supabase?
timedude
When is this available on AWS in Aurora? Anyone from AWS here, add it pronto
bradfox2
Thank you!! Goodbye manticore if this works.
piskov
On a tangent note it’s amazing how hard it is to have a good case-insensitive search in Postgres.In SQL Server you just use case-insensitive collation (which is a default) and add an index (it’s the only one non-clustered) and call it a day.In postgres you need to go above and beyond just for that. It’s like postgres guys were “nah dog, everybody just uses lowercase; you don’t need to worry of people writing john doe as John Doe)”.And don’t get me started with storing datetime with timezone (e.g “4/2/2007 7:23:57 PM -07:00“). In sql server you have datetimeoffset; in Postgres you fuck off :-)
devmor
This is really cool to see! I've been using BM25+sqlite-vec for contextual search projects for a little while, it's a great performance addition.
anon
undefined
gplprotects
> ParadeDB, is guarded behind AGPLWhat a wonderful ad for ParadeDB, and clear signal that "TigerData" is a pernicious entity.
cucupac
[dead]
benjiro3000
[dead]
anon
undefined
zephyrwhimsy
[dead]