<- Back
Comments (289)
- c-linkageThis seems like a tragedy of the commons -- GitHub is free after all, and it has all of these great properties, so why not? -- but this kind of decision making occurs whenever externalities are present.My favorite hill to die on (externality) is user time. Most software houses spend so much time focusing on how expensive engineering time is that they neglect user time. Software houses optimize for feature delivery and not user interaction time. Yet if I spent one hour making my app one second faster for my million users, I can save 277 user hour per year. But since user hours are an externality, such optimization never gets done.Externalities lead to users downloading extra gigabytes of data (wasted time) and waiting for software, all of which is waste that the developer isn't responsible for and doesn't care about.
- dboonI’m building Cargo/UV for C. Good article. I thought about this problem very deeply.Unfortunately, when you’re starting out, the idea of running a registry is a really tough sell. Now, on top of the very hard engineering problem of writing the code and making a world class tool, plus the social one of getting it adopted, I need to worry about funding and maintaining something that serves potentially a world of traffic? The git solution is intoxicating through this lense.Fundamentally, the issue is the sparse checkouts mentioned by the author. You’d really like to use git to version package manifests, so that anyone with any package version can get the EXACT package they built with.But this doesn’t work, because you need arbitrary commits. You either need a full checkout, or you need to somehow track the commit a package version is in without knowing what hash git will generate before you do it. You have to push the package update and then push a second commit recording that. Obviously infeasible, obviously a nightmare.Conan’s solution is I think just about the only way. It trades the perfect reproduction for conditional logic in the manifest. Instead of 3.12 pointing to a commit, every 3.x points to the same manifest, and there’s just a little logic to set that specific config field added in 3.12. If the logic gets too much, they let you map version ranges to manifests for a package. So if 3.13 rewrites the entire manifest, just remap it.I have not found another package manager that uses git as a backend that isn’t a terrible and slow tool. Conan may not be as rigorous as Nix because of this decision but it is quite pragmatic and useful. The real solution is to use a database, of course, but unless someone wants to wire me ten thousand dollars plus server costs in perpetuity, what’s a guy supposed to do?
- cesarbOne of these is not like the others...> The problem was that go get needed to fetch each dependency’s source code just to read its go.mod file and resolve transitive dependencies.This article is mixing two separate issues. One is using git as the master database storing the index of packages and their versions. The other is fetching the code of each package through git. They are orthogonal; you can have a package index using git but the packages being zip/tar/etc archives, you can have a package index not using git but each package is cloned from a git repository, you can have both the index and the packages being git repositories, you can have neither using git, you can even not have a package index at all (AFAIK that's the case for Go).
- ekjhgkejhgkDo the easy thing while it works, and when it stops working, fix the problem.Julia does the same thing, and from the Rust numbers on the article, Julia has about 1/7th the number of packages that Rust does[1] (95k/13k = 7.3).It works fine, Julia has some heuristics to not re-download it too often.But more importantly, there's a simple path to improve. The top Registry.toml [1] has a path to each package, and once donwloading everything proves unsustainable you can just download that one file and use it to download the rest as needed. I don't think this is a difficult problem.[1] https://github.com/JuliaRegistries/General/blob/master/Regis...
- jama211“It never works out” - hmm, seems like it worked out just fine, worked great to get the operation of the ground and when scale became an issue it was solvable by moving to something else. It served its purpose, sounds like it worked out to me.
- steeleduncanThe other conclusion to draw is "Git is a fantastic choice of database for starting your package manager, almost all popular package managers began that way."
- kibwenI think there's a form of survivorship bias at work here. To use the example of Cargo, if Rust had never caught on, and thereby gotten popular enough to inflate the git-based index beyond reason, then it would never have been a problem to use git as the backing protocol for the index. Likewise, we can imagine innumerable smaller projects that successfully use git as a distributed delta-updating data distribution protocol, and never happen to outgrow it.The point being, if you're not sure whether your project will ever need to scale, then it may not make sense to reinvent the wheel when git is right there (and then invent the solution for hosting that git repo, when Github is right there), letting you spend time instead on other, more immediate problems.
- newswangerdIt’s always humbling when you go on the front page of HN and see an article titled “the thing you’re doing right now is a bad idea and here’s why”This has happened to me a few times now. The last one was a fantastic article about how PG Notify locks the whole database.In this particular case it just doesn’t make a ton of sense to change course. Im a solo dev building a thing that may never take off, so using git for plug-in distribution is just a no brainer right now. That said, I’ll hold on to this article in case I’m lucky enough to be in a position where scale becomes an issue for me.
- quaintdevI host my own code repository using Forgejo. It's not public. In fact, it's behind mutual tls like all the service I host. Reason? I don't want to deal with bots and other security risks that come with opening port to the world.Turns out Go module will not accept package hosted on my Forgejo instance because it asks for certificate. There are ways to make go get use ssh but even with that approach the repository needs to be accessible over https. In the end, I cloned the repository and used it in my project using replace directive. It's really annoying.
- jarofgreenIt's not just package manager who do this - a lot of smaller projects crowd source data in git repositories. Most of these don't reach the scale where the technical limitations become a problem.Personally my view is that the main problem when they do this is that it gets much harder for non-technical people to contribute. At least that doesn't apply to package managers, where it's all technical people contributing.There are a few other small problems - but it's interesting to see that so many other projects do this.I ended up working on an open source software library to help in these cases: https://www.datatig.com/Here's a write up of an introduction talk about it: https://www.datatig.com/2024/12/24/talk.html I'll add the scale point to future versions of this talk with a link to this post.
- dleslieGitHub is intoxicatingly free hosting, but Git itself is a terrible database. Why not maintain an _actual_ database on GitHub, with tagged releases?Sqlite data is paged and so you can get away with only fetching the pages you need to resolve your query.https://phiresky.github.io/blog/2021/hosting-sqlite-database...
- cbondurantAdmittedly, I try and stay away from database design whenever possible at work. (Everything database is legacy for us) But the way the term is being used here kinda makes me wonder, do modern sql databases have enough security features and permissions management systems in place that you could just directly expose your database to the world with a "guest" user that can only make incredibly specific queries?Cut out the middle man, directly serve the query response to the package manager client.(I do immediately see issues stemming from the fact that you cant leverage features like edge caching this way, but I'm not really asking if its a good solution, im more asking if its possible at all)
- Ericson2314The Nixpkgs example is not like the others, because it is source code.I don't get what is so bad about shallow clones either. Why should they be so performance sensative?
- ifh-hnSo what's the answer then? That's the question I wanted answered after reading this article. With no experience with git or package management, would using a local client sqlite database and something similar on the server do?
- hogrugThe facts are interesting but the conclusion a bit strange. These package managers have succeeded because git is better for the low trust model and GitHub has been hosting infra for free that no one in their right mind would provide for the average DB.If it didn't work we would not have these massive ecosystems upsetting GitHub's freemium model, but anything at scale is naturally going to have consequences and features that aren't so compatible with the use case.
- jupedThese are actually all problems with using Github as an ersatz CDN.
- the__alchemistThe Cargo example at the top is striking. Whenever I publish a crate, and it blocks me until I write `--allow-dirty`, I am reminded that there is a conflation between Cargo/crates.io and Git that should not exist. I will write `--allow-dirty` because I think these are two separate functionalities that should not be coupled. Crates.io should not know about or care about my project's Git usage or lack thereof.
- anonundefined
- twoodfinWhat made git special & powerful from the start was its data model: Like the network databases of old, but embedded in a Merkle tree for independent evolution and verifiability.Scaling that data model beyond projects the size of the Linux kernel was not critical for the original implementation. I do wonder if there are fundamental limits to scaling the model for use cases beyond “source code management for modest-sized, long-lived projects”.
- themkI think git is overkill, and probably a database is as well.I quite like the hackage index, which is an append-only tar file. Incremental updates are trivial using HTTP range requests making hosting it trivial as well.
- ekjhgkejhgkUncertain if this is OT, but given that the CCC is politically inspired organization, I hope not:One thing that still seems absent is awareness of the complete takeover of "gadgets" in schools. Schools these days, as early as primary school, shove screens in front of children. They're expected to look at them, and "use" them for various activities, including practicing handwriting. I wish I was joking [1].I see two problems with this.First is that these devices are engineered to be addictive by way of constant notifications/distractions, and learning is something that requires long sustained focus. There's a lot of data showing that under certain common circumstances, you do worse learning from a screen than from paper.Second is implicitly it trains children to expect that anything has to be done through a screen connected to a closed point-and-click platform. (Uninformed) people will say "people who work with computers make money, so I want my child to have an ipad". But interacting with a closed platform like an ipad is removing the possibilities and putting the interaction "on rails". You don't learn to think, explore and learn from mistakes, instead you learn to use the app that's put in front of you. This in turn reinforces the "computer says no" [2] approach to understanding the world.I think this is a matter of civil rights and freedom, but sadly I don't often see "civil rights" organizations talk about this. I think I heard Stallman say something along these lines once, but other than that I don't see campaigns anywhere.[1] https://www.letterjoin.co.uk/[2] https://youtu.be/eE9vO-DTNZc
- mukundeshThough not Github, worth mentioning Huggingface, which is also using git, but managing large files with their(?) xet protocol. https://huggingface.co/docs/hub/en/xet/index
- aidenn0As far as I know, Nixpkgs doesn't use git as a package database. The packages definitions are stored and developed in git, but the channels certainly are not.
- teifererAnd this my friends is the reason why (only) focusing on CPU cycles and memory hierarchies is insufficient when thinking of the performance of a system. Yes they are important. But no level of low-level optimization will get you out of the hole that a wrong choice of algorithm and/or data structure may have dug you into.
- ZambyteThe issues with using Git for Nix seem to entirely be issues with using GitHub for Nix, no?
- gethlyIf we stopped using VCS to fetch source files, we would lose the ability to get the exact commit(understand as version that has nothing to do with the underlying VCS) of these files. Git, Mercurial, SVN.., github, bitbucket...it does not matter. Absolutely nobody will be building downloadable versions of their source files, hosted on who knows how "prestigious" domains, by copying them to another location just to serve the --->exact same content<--- that github and alike already provide.This entire blog is just a waste of time for anyone reading it.
- ori_bAlternatively: Downloading the entire state of all packages when you care about just one, it never works out.O(1) beats O(n) as n gets large.
- anonundefined
- bencornia> Grab’s engineering team went from 18 minutes for go get to 12 seconds after deploying a module proxy. That’s not a typo. Eighteen minutes down to twelve seconds.> The problem was that go get needed to fetch each dependency’s source code just to read its go.mod file and resolve transitive dependencies. Cloning entire repositories to get a single file.I have also had inconsistent performance with go get. Never enough to look closely at it. I wonder if I was running into the same issue?
- mikepurvisThe nix cli almost exclusively pulls GitHub as zipballs. Not perfect but certainly far faster than a real git clone.
- drzaiusx11I'd add git gemfile dependencies to the list of languages called out here as well. It supports git repos, but in general it's a bad idea unless you are diligent with git tag use and disallow git tag mutability, which also assumes you have complete control of your git dependencies...
- mikkupikkuPeople who put off learning SQL for later end up using anything other than a database as their database.
- drzaiusx11One of the first things I did at my current place of employment was to detangle the mess of gemfile git dependencies and get them to adopt semver and an actual package repo. There were so many footguns with git dependencies in ruby we were getting taken down by friendly fire on the daily...
- hk1337I like Go but it’s dependency management is weird and seems to be centered around GitHub a lot.
- nacozarinasuccessful things often have humble origins, it’s a feature not a bugfor every project that managed to out-grow ext4/git there were a hundred that were well-served and never needed to over-invest in something else
- PunchyHamsterThe article conclusion is just... not good. There are many benefits to using Git as backend, you can point your project to every single commit as a version which makes testing any fixes or changes in libs super easy, it has built in integrity control and technically (sadly not in practice) you could just sign commits and use that to verify whether package is authentic.It being unoptimal bandwidth wise is frankly just a technical hurdle to get over it, with benefits well worth the drawback
- mcnyI want to take a quick detour here if anyone is knowledgeable about this topic.> The hosting problems are symptoms. The underlying issue is that git inherits filesystem limitations, and filesystems make terrible databases.Does this mean mbox is inherently superior to maildir? I really like the idea of maildir because there is nothing to compact but if we assume we never delete emails (on the local machine anyways), does that mean mbox or similar is preferable over maildir?
- didipSo… What we need is a globally distributed git seeders of all open source github content, then?Seems possible if every git client is also a torrent client.
- pizlonatorWhat is the alternative?"Use a database" isn't actionable advice because it's not specific enough
- pxcLoved this article. Just enough detail to make the broad scope compatible with a reasonable length, and well-argued.I feel sometimes like package management is a relatively second-class topic in computer science (or at least among many working programmers). But a package manager's behavior can be the difference between a grotesque, repulsive experience and a delightful, beautiful one. And there aren't quite yet any package managers that do well everything that we collectively have learned how to do well, which makes it an interesting space imo.Re: Nixpkgs, interestingly, pre-flakes Nix distributes all of the needed Nix expressions as tarballs, which does play nice with CDNs. It also distributes an index of the tree as a SQLite database to obviate some of the "too many files/directories" problem with enumerating files. (In the meantime, Nixpkgs has also started bucketing package directories by name prefix, too.) So maybe there was a lesson learned here that would be useful to re-learn.On the other hand, IIRC if you use the GitHub fetcher rather than the Git one, including for fetching flakes, Nix will download tarballs from GitHub instead of doing clones. Regardless, downloading and unpacking Nixpkgs has become kinda slow. :-\
- xpressvideozThe article lists Git-based wiki engines as a bad usage of Git. Can anybody recommend alternatives? I want something that can be self-hosted, is easily modified by text editors, and has individual page history, preferably with Markdown.
- dwarduWorst thing is when you’re in a an office and your pc along with other pcs pulls from git unauthenticated, then you get hit with api limits
- iamwilThis sounds like a missing piece of software in the OSS world. If you have the inclination, you should write it.
- grumbelDo we have distributed databases that regular users can clone, modify and merge?
- weiwenhaoFor package management software that is rarely used, free is the biggest motivation.
- leohThe conclusion reached in this essay is 100% wrong. See " The reftable backend What it is, where it's headed, and why should you care?">With release 2.45, Git has gained support for the “reftable” backend to read and write references in a Git repository. While this was a significant milestone for Git, it wasn‘t the end of GitLab’s journey to improve scalability in repositories with many references. In this talk you will learn what the reftable backend is, what work we did to improve it even further and why you should care.https://www.youtube.com/watch?v=0UkonBcLeAoAlso see Scalar, which Microsoft used to scale their 300GiB Windows repository, https://github.com/microsoft/scalar.
- skywhopperNot sure I can agree with the takeaway. It works well at first, but doesn’t scale, so folks found workarounds. That’s how literally every working system grows. There are always bottlenecks eventually. And you address them when they become an issue, not five years earlier.
- keithgrovesWhen building https:/enact.tools we considered this. I'm glad we didn't go this route.
- dromologistWe wanted to pull updated code in our undockerized instances when they were instantiated, so we decided to pull the code from GitHub. Worked out pretty well though after a thousand trials we got a 502 and now we're one step closer to being forced into a CD pipeline.
- sghiassyUse the git clone —shallow option and you’ll only download the most recent commits. Yeesh
- miyuruFunnily enough, I clicked the homebrew GitHub link in the post, only to get a rate limited error page from GitHub.
- born-jrelol I see this as I plan on using Git for my thing store. https://github.com/blue-monads/potatoverse
- 0xbadcafebeeYOLO software engineering, the hallmark of the 21st century
- anonundefined
- notoranditRepsy
- stephenlfOmarchy
- frumplestlatzSince ~2002, Macports has used svn or git, but users, by default, rsync the complete port definitions + a server-generated index + a signature.The index is used for all lookups; it can also be generated or incrementally updated client-side to accommodate local changes.This has worked fine for literally decades, starting back when bandwidth and CPU power was far more limited.The problem isn’t using SCM, and the solutions have been known for a very long time.
- BlueTemplarWait, isn't fossil based on sqlite ?Or does fossil itself still have the same issues ?
- holyknightIt’s basically the same thing that always happens when you choose a technology because it’s convenient rather than a great fit for your problem. Sooner or later, you’ll hit a wall. Just because you can cook a salmon in your dishwasher doesn’t mean you should.
- encom>[Homebrew] Auto-updates now run every 24 hours instead of every 5 minutes[...]That is such an insane default, I'm at a loss for words.
- gjvcsqlite seems to be ideal for a package manager
- aniouAs side note. Maybe someone knows, why rust devs chose an already used name for language changes proposal? "RFC" was already taken and well-established and I simply refuse to accept that someone wasn't aware about Request For Comments - and if it was true and clash was created deliberately, then it was rude and arrogant.Every, ...king time, when I read something like "RFC 2789 introduced a sparse HTTP protocol." my brain suffers from a short-circuit. BTW: RFC 2789 is a "Mail Monitoring MIB".
- eviksIndeed, the seductive nature of bad tools lying close to your hand - no need to lift your butt to get them!