Relicensing with AI-Assisted Rewrite

<- Back

Relicensing with AI-Assisted Rewrite

tuananh

Comments (150)

danlitt
I am pretty sure this article is predicated on a misunderstanding of what a "clean room" implementation means. It does not mean "as long as you never read the original code, whatever you write is yours". If you had a hermetically sealed code base that just happened to coincide line for line with the codebase for GCC, it would still be a copy. Traditionally, a human-driven clean room implementation would have a vanishingly small probability of matching the original codebase enough to be considered a copy. With LLMs, the probability is much higher (since in truth they are very much not a "clean room" at all).The actual meaning of a "clean room implementation" is that it is derived from an API and not from an implementation (I am simplifying slightly). Whether the reimplementation is actually a "new implementation" is a subjective but empirical question that basically hinges on how similar the new codebase is to the old one. If it's too similar, it's a copy.What the chardet maintainers have done here is legally very irresponsible. There is no easy way to guarantee that their code is actually MIT and not LGPL without auditing the entire codebase. Any downstream user of the library is at risk of the license switching from underneath them. Ideally, this would burn their reputation as responsible maintainers, and result in someone else taking over the project. In reality, probably it will remain MIT for a couple of years and then suddenly there will be a "supply chain issue" like there was for mimemagic a few years ago.
kshri24
> The ownership void: If the code is truly a “new” work created by a machine, it might technically be in the public domain the moment it’s generated, rendering the MIT license moot.How would that work? We still have no legal conclusion on whether AI model generated code, that is trained on all publicly available source (irrespective of type of license), is legal or not. IANAL but IMHO it is totally illegal as no permission was sought from authors of source code the models were trained on. So there is no way to just release the code created by a machine into public domain without knowing how the model was inspired to come up with the generated code in the first place. Pretty sure it would be considered in the scope of "reverse engineering" and that is not specific only to humans. You can extend it to machines as well.EDIT: I would go so far as to say the most restrictive license that the model is trained on should be applied to all model generated code. And a licensing model with original authors (all Github users who contributed code in some form) should be setup to be reimbursed by AI companies. In other words, a % of profits must flow back to community as a whole every time code-related tokens are generated. Even if everyone receives pennies it doesn't matter. That is fair. Also should extend to artists whose art was used for training.
nairboon
That code is still LGPL, it doesn't matter what some release engineer writes in the release notes on Github. All original authors and copyright holders must have explicitly agreed to relicense under a different license, otherwise the code stays LGPL licensed.Also the mentioned SCOTUS decision is concerned with authorship of generative AI products. That's very different of this case. Here we're talking about a tool that transformed source code and somehow magically got rid of copyright due to this transformation? Imagine the consequences to the US copyright industry if that were actually possible.
abrookewood
This seems relevant: "No right to relicense this project (github.com/chardet)" https://news.ycombinator.com/item?id=47259177
samrus
> The ownership void: If the code is truly a “new” work created by a machine, it might technically be in the public domain the moment it’s generated, rendering the MIT license moot.Im struggling to see where this conclusion came from. To me it sounds like the AI-written work can not be coppywritten, and so its kind of like a copy pasting the original code. Copy pasting the original code doesnt make it public domain. Ai gen code cant be copywritten, or entered into the public domain, or used for purposes outside of the original code's license. Whats the paradox here?
dathinab
IMHO/IMHU AI can't claim authorship and as such can't copyright their work.This doesn't prevent any form of automatic copyrighting by production of derivative code or similar. It just prevent anyone from claiming ownership of any parts unique to the derived work.Like think about it if a natural disaster changes (e.g. water damages) a picture you did draw then a) you can't claim ownership of the natural produced changes but b) still have ownership of the original picture contained in the changed/derived work.AI shouldn't change that.Which brings us to another 2 aspects:1. if you give an AI a project access to the code to rewrite it anew it _is_ a copyright violation as it's basically a side-by-side rewrite2. but if you go the clean room approach but powered by AI then it likely isn't a copyright violation, but also now part of the public domain, i.e. not yoursSo yes, doing clean room rewrites has become incredible cheap.But no just because it's AI it doesn't make code go away.And lets be realistic one of the most relevant parts of many open source project is it being openly/shared maintained. You don't get this with clean room rewrites no matter if AI or not.
emsign
By design you can't know if the LLM doing the rewrite was exposed to the original code base. Unless the AI company is disclosing their training material, which they won't because they don't want to admit breaking the law.
shevy-java
> In traditional software law, a “clean room” rewrite requires two teamsSo, I dislike AI and wish it would disappear, BUT!The argument is strange here, because ... how can a2mark ensure that AI did NOT do a clean-room conforming rewrite? Because I think in theory AI can do precisely this; you just need to make sure that the model used does that too. And this can be verified, in theory. So I don't fully understand a2mark here. Yes, AI may make use of the original source code, but it could "implement" things on its own. Ultimately this is finite complexity, not infinite complexity. I think a2mark's argument is in theory weak here. And I say this as someone who dislikes AI. The main question is: can computers do a clean rewrite, in principle? And I think the answer is yes. That is not saying that claude did this here, mind you; I really don't know the particulars. But the underlying principle? I don't see why AI could not do this. a2mark may need to reconsider the statement here.
mfabbri77
This has the potential to kill open source, or at least the most restrictive licenses (GPL, AGPL, ...): if a license no longer protects software from unwanted use, the only possible strategy is to make the development closed source.
stuaxo
I don't see how (with current LLMs that have been trained on mixed licensed data) you can use the LLM to rewrite to a less restrictive license.You could probably use it to output code that is GPL'd though.
zozbot234
If you ask a LLM to derive a spec that has no expressive element of the original code (a clean-room human team can carefully verify this), and then ask another instance of the LLM (with fresh context) to write out code from the spec, how is that different from a "clean room" rewrite? The agent that writes the new code only ever sees the spec, and by assumption (the assumption that's made in all clean room rewrites) the spec is purely factual with all copyrightable expression having been distilled out.
amelius
I think you should interpret it like this:You cannot copyright the alphabet, but you can copyright the way letters are put together.Now, with AI the abstraction level goes from individual letters to functions, classes, and maybe even entire files.You can't copyright those (when written using AI), but you __can__ copyright the way they are put together.
anon
undefined
Retr0id
> In traditional software law, a “clean room” rewrite requires two teamsIs the "clean room" process meaningfully backed by legal precedent?
dessimus
Interesting to see how this plays out. Conceivably if running an LLM over text defeats copyright, it will destroy the book publishing industry, as I could run any ebook thru an LLM to make a new text, like the ~95% regurgitated Harry Potter.
Tomte
> The original author, a2mark , saw this as a potential GPL violationMark Pilgrim! Now that‘s a name I haven‘t read in a long time.
anilgulecha
This is precedent setting. In this case the rewrite was in same language, but if there's a python GPL project, and it's tests (spec) were used to rewrite specs in rust, and then an implementation in rust, can the second project be legally MIT, or any other?If yes, this in a sense allows a path around GPL requirements. Linux's MIT version would be out in the next 1-2 years.
DrammBA
I like the idea of AI-generated ~code~ anything being public domain. Public data in, public domain out.
pu_pe
Licensing issues aside, the chardet rewrite seems to be clearly superior to the original in performance too. It's likely that many open source projects could benefit from a similar approach.
gbuk2013
In mind, if you feed code into an AI model then the output is clearly a derivative work, with all the licensing implications. This seems objectively reasonable?
dspillett
> Accepting AI-rewriting as relicensing could spell the end of CopyleftThe more restrictive licences perhaps, though only if the rewriter convinces everyone that they can properly maintain the result. For ancient projects that aren't actively maintained anyway (because they are essentially done at this point) this might make little difference, but for active projects any new features and fixes might result in either manual reimplementation in the rewritten version or the clean-room process being repeated completely for the whole project.> chardet 7.0 is a ground-up, MIT-licensed rewrite of chardet. Same package name, same public API — (from the github description)The “same name” part to me feels somewhat disingenuous. It isn't the same thing so it should have a different name to avoid confusion, even if that name is something very similar to the original like chardet-ng or chardet-ai.
oytis
Is it just me, or HN recently started picking up a social media dynamics with contributions reacting/responding to each other?
skeledrew
Looks like copyright just died.
foota
I think the more interesting question here would be if someone could fine tune an open weight model to remove knowledge of a particular library (not sure how you'd do that, but maybe possible?) and then try to get it to produce a clean room implementation.
b65e8bee43c2ed0
at this point, every corporation in the world has AI slop in their software. any attempt to outlaw it would attract enough funding from the oligarchs for the opposition to dethrone any party. no attempts will be made in the next three years, obviously, and then it will be even more late than it is now.and while particularly diehard believers in democracy may insist that if they kvetch hard enough they can get things they don't like regulated out of existence, they pointedly ignore the elephant in the room. they could succeed beyond their wildest dreams - get the West to implement a moratorium on AI, dismantle every FAGMAN, Mossad every researcher, send Yudkowskyjugend death squads to knock down doors to seize fully semiautomatic assault GPUs, and none of it will make any fucking difference, because China doesn't give a fuck.
Cantinflas
"If “AI-rewriting” is accepted as a valid way to change licenses, it represents the end of Copyleft. "Software in the AI era is not that important.Copyleft has already won, you can have new code in 40 seconds for $0.70 worth of tokens.
jacquesm
If you don't understand the meaning of what a 'derived work' is then you should probably not be doing this kind of thing without a massive disclaimer and/or having your lawyer doing a review.There is no such thing as the output of an LLM as a 'new' work for copyright purposes, if it were then it would be copyrightable and it is not. The term of art is 'original work' instead of 'new'.The bigger issue will be using tools such as these and then humans passing off the results as their own because they believe that their contribution to the process whitewashes the AI contributions to the point that they rise to the status of original works. "The AI only did little bits" is not a very strong defense though.If you really want to own the work-product simply don't use AI during the creation. You can use it for reviews, but even then you simply do not copy-and-paste from the AI window to the text you are creating (whether code or ordinary prose isn't really a difference).I've seen a copyright case hinge on 10 lines of unique code that were enough of a fingerprint to clinch the 'derived work' assessment. Prize quote by the defendant: "We stole it, but not from them".There is a very blurry line somewhere in the contents of any large LLM: would a model be able to spit out the code that it did if it did not have access to similar samples and to what degree does that output rely on one or more key examples without which it would not be able to solve the problem you've tasked it with?The lower boundary would be the most minimal training set required to do the job, and then to analyze what the key corresponding bits were from the inputs that cause the output to be non-functional if they were dropped from the training set.The upper boundary would be where completely non-related works and general information rather than other parties copyrighted works would be sufficient to do the creation.The easiest way to loophole this is to copyright the prompt, not the work product of the AI, after all you should at least be able to write the prompt. Then others can re-create it too, but that's usually not the case with these AI products, they're made to be exact copies of something that already exists and the prompt will usually reflect that.That's why I'm a big fan of mandatory disclosure of whether or not AI was used in the production of some piece of text, for one it helps to establish whether or not you should trust it, who is responsible for it and whether the person publishing it has the right to claim authorship.Using AI as a 'copyright laundromat' is not going to end up well.
tgma
Isn't AFC test applicable here?
blamestross
Intellectual property laundering is the core and primary value of LLMs. Everything else is "bonus".
RcouF1uZ4gsC
> The copyright vacuum: If AI-generated code cannot be copyrighted (as the courts suggest), then the maintainers may not even have the legal standing to license v7.0.0 under MIT or any license.I believe this is a misunderstanding of the ruling. The code can’t be copyrighted by a LLM. However, the code could be copyrighted by the person running the LLM.
gspr
> If “AI-rewriting” is accepted as a valid way to change licenses, it represents the end of Copyleft. Any developer could take a GPL-licensed project, feed it into an LLM with the prompt “Rewrite this in a different style,” and release it under MIT. The legal and ethical lines are still being drawn, and the chardet v7.0.0 case is one of the first real-world tests.This isn't even limited to "the end of copyleft"; it's the end of all copyright! At least copyright protecting the little guy. If you have deep enough pockets to create LLMs, you can in this potential future use them to wash away anyone's copyright for any work. Why would the GPL be the only target? If it works for the GPL, it surely also works for your photographs, poetry – or hell even proprietary software?
duskdozer
This is such scummy behavior.
verdverm
Interesting questions raised by recent SCOTUS refusal to hear appeals related to AI an copyright-ability, and how that may affect licensing in open source.Hoping the HN community can bring more color to this, there are some members who know about these subjects.
est
Uh, patricide?The key leap from gpt3 to gpt-3.5 (aka ChatGPT) was code-davinci-002, which is trained upon Github source code after OpenAI-Microsoft partnership.Open source code contributed much to LLM's amazing CoT consistency. If there's no Open Source movement, LLM would be developed much later.
anon
undefined
himata4113
I mean in my opinion GPL licensed code should just infect models forcing them to follow the license.You can do this a lot by saying things like: complete the code "<snippet from gpl licensed code>".And if now the models are GPL licensed the problem of relicensing is gone since the code produced by these models should in theory be also GPL licensed.Unfortunately, there is a dumb clause that computer generated code cannot be copyrighted or licensed to begin with.
spwa4
Can we do the same with universal music? Because that's easy and already possible. Or Microsoft Windows? Because we all know the answer: if it works, essentially any government will immediately call it illegal.Because if this isn't allowed, that makes all of the AI models themselves illegal. They are very much the product of using others' copyrighted stuff and rewriting it.But of course this will be allowed because copyright was never meant to protect anyone small. And that it's in direct contradiction with what applies to large companies? Courts won't care.