<- Back
Comments (156)
- simonwDon't miss how this works. It's not a server-side application - this code runs entirely in your browser using SQLite compiled to WASM, but rather than fetching a full 22GB database it instead uses a clever hack that retrieves just "shards" of the SQLite database needed for the page you are viewing.I watched it in the browser network panel and saw it fetch: https://hackerbook.dosaygo.com/static-shards/shard_1636.sqlite.gz https://hackerbook.dosaygo.com/static-shards/shard_1635.sqlite.gz https://hackerbook.dosaygo.com/static-shards/shard_1634.sqlite.gz As I paginated to previous days.It's reminiscent of that brilliant SQLite.js VFS trick from a few years ago: https://github.com/phiresky/sql.js-httpvfs - only that one used HTTP range headers, this one uses sharded files instead.The interactive SQL query interface at https://hackerbook.dosaygo.com/?view=query asks you to select which shards to run the query against, there are 1636 total.
- kamranjonIt'd be great if you could add it to Kiwix[1] somehow (not sure what the process is for that but 100rabbits figured it out for their site) - I use it all the time now that I have a dumb phone - I have the entirety of wikipedia, wiktionary and 100rabbits all offline.https://kiwix.org/en/
- yreadI wonder how much smaller it could get with some compression. You could probably encode "This website hijacks the scrollbar and I don't like it" comments into just a few bits.
- kristianpI tried "select * from items limit 10" and it is slowly iterating through the shards without returning. I got up to 60 shards before I stopped. Selecting just one shard makes that query return instantly. As mentioned elsewhere I think duckdb can work faster by only reading the part of a parquet file it needs over http.I was getting an error that the users and user_domains tables aren't available, but you just need to change the shard filter to the user stats shard.
- zkmonSimilar to Single-page applications (SPA), single-table application (STA) might become a thing. Just a shard a table on multiple keys and serve the shards as static files, provided that the data is Ok to share, similar to sharing static html content.
- carbocationThat repo is throwing up a 404 for me.Question - did you consider tradeoffs between duckdb (or other columnar stores) and SQLite?
- m-p-3Looks like the repo was taken down (404).That's too bad, I'd like to see the inner-working with a subset of data, even with placeholders for the posts and comments.
- Paul-EThat's pretty neat!I did something similar. I build a tool[1] to import the Project Arctic Shift dumps[2] of reddit into sqlite. It was mostly an exercise to experiment with Rust and SQLite (HN's two favorite topics). If you don't build a FTS5 index and import without WAL (--unsafe-mode), import of every reddit comment and submission takes a bit over 24 hours and produces a ~10TB DB.SQLite offers a lot of cool json features that would let you store the raw json and operate on that, but I eschewed them in favor of parsing only once at load time. THat also lets me normalize the data a bit.I find that building the DB is pretty "fast", but queries run much faster if I immediately vacuum the DB after building it. The vacuum operation is actually slower than the original import, taking a few days to finish.[1] https://github.com/Paul-E/Pushshift-Importer[2] https://github.com/ArthurHeitmann/arctic_shift/blob/master/d...
- Sn0wCoderSite does not load on Firefox console error says 'Uncaught (in promise) TypeError: can't access property "wasm", sqlite3 is null'Guess its common knowledge that SharedArrayBuffer (SQLite wasm) does not work with FF due to Cross-Origin Attacks (i just found out ;).Once the initial chunk of data loads the rest load almost instantly on Chrome. Can you please fix the GitHub link (current 404) would like to peak at the code. Thank you!
- diyseguylink no workie: https://github.com/DOSAYGO-STUDIO/HackerBook
- sieepWhat a reminder on how text is so much more efficient than video, its crazy! Could you imagine the same amount of knowledge (or dribble) but in video form? I wonder how large that would be.
- zX41ZdbWThe query tab looks quite complex with all these content shards: https://hackerbook.dosaygo.com/?view=queryI have a much simpler database: https://play.clickhouse.com/play?user=play#U0VMRUNUIHRpbWUsI...
- abixbWonder if you could turn this into a .zim file for offline browsing with an offline browser like Kiwix, etc. [0]I've been taking frequent "offline-only-day" breaks to consolidate whatever I've been learning, and Kiwix has been a great tool for reference (offline Wikipedia, StackOverflow and whatnot).[0] https://kiwix.org/en/the-new-kiwix-library-is-available/
- modelessIt's really a shame that comment scores are hidden forever. Would the admins consider publishing them after stories are old enough that voting is closed? It would be great to have them for archives and search indices and projects like this.
- 3eb7988a1663Did anyone get a copy of this before it was pulled? If GitHub is not keen, could it be uploaded to HuggingFace or some other service which hosts large assets?I have always known I could scrape HN, but I would much rather take a neat little package.
- tevonThe link seems to be down, was it taken down?
- foucSuddenly occurs to me that it would be neat to pair a small LLM (3-7B) with an HN dataset
- dspillettIs there a public dump of the data anywhere that this is based upon, or have they scraped it themselves?Such as DB might be entertaining to play with, and the threadedness of comments would be useful for beginners to practise efficient recursive queries (more so than the StackExchange dumps, for instance).
- spit2windThis is pretty neat! The calendar didn't work well for me. I could only seem to navigate by month. And when I selected the earliest day (after much tapping), nothing seemed to be updated.Nonetheless, random access history is cool.
- yupyupyups1 hour passed and it's already nuked?Thank you btw
- dmarwicke22gb for mostly text? tried loading the site, it's pretty slow. curious how the query performance is with this much data in sqlite
- layer8Apparently the comment counts are only the top-level comments?It would be nice for the thread pages to show a comment count.
- joshcsimmonsLink appears broken
- wslhIs this updated regularly? 404 on GitHub as the other comment.With all due respect it would be great if there is an official HN public dump available (and not requiring stuff such as BigQuery which is expensive).
- KomoDHow do I download it? That repo is a 404.
- sirjazThis would be awesome as a cross platform app.
- solarizedBeautiful !2026 prayer: for all you AI junkies—please don’t pollute H/N with your dirty AI gaming.Don’t bot posts, comments, or upvote/downvote just to maximize karma. Please.We can’t identify anymore who’s a bot and who’s human. I just want to hang out with real humans here.
- anonundefined
- asdefghykHow much space is needed? ...for the data .... Im wondering if it would work on a tablet? ....
- abetuskAlas, HN does not belong to us, and the existence of projects like this are subject to the whims of the legal owners of HN.From the terms of use [0]:"""Commercial Use: Unless otherwise expressly authorized herein or in the Site, you agree not to display, distribute, license, perform, publish, reproduce, duplicate, copy, create derivative works from, modify, sell, resell, exploit, transfer or upload for any commercial purposes, any portion of the Site, use of the Site, or access to the Site. The buying, exchanging, selling and/or promotion (commercial or otherwise) of upvotes, comments, submissions, accounts (or any aspect of your account or any other account), karma, and/or content is strictly prohibited, constitutes a material breach of these Terms of Use, and could result in legal liability."""[0] https://www.ycombinator.com/legal/#tou
- fao_> Community, All the HN belong to you. This is an archive of hacker news that fits in your browser.> 20 years of HN arguments and beauty, can be yours forever. So they'll never die. Ever. It's the unkillable static archive of HN and it's your handsI'm really sorry to have to ask this, but this really feels like you had an LLM write it?