And if you then write a program that is remarkably similar to the one you read, that's copyright infringement. As another reply noted--but without anywhere near enough verbosity--this is not without risk, and people who intend to work on similar systems often try to use a strategy where they burn one engineer by having them read the original code, have them document it carefully with a lawyer to remove all expressive aspects, and then have a separate engineer develop it from the clean documents.
Copyright doesn't protect general concepts, methods, or common knowledge. So you could write a program that is remarkably similar to another one and not infringe copyright. Just like you can write a book with the same plot as another without infringing copyright.
Plus given that most programming languages have a finite grammar and a limited number of ways to express general concepts, the individual bits of code that make up most programs are probably not sufficiently original to be copyrightable in themselves.
But the result is that you can't assume that this is the case: you have to actually look in a case-by-case basis to decide if the chatbot you are using -- one which has no understanding of copyright as nuanced as either of us -- merely learned something general purpose and applied it in a way which did not lead to infringement, if the code it generated is technically infringing but is fair use, or if what it developed isn't allowed.
A lot of people seem to want to believe that the output of the chatbot is somehow inherently clean in all cases, and they cite this idea that a human can read code and learn from it... but a human can -- even without realizing it!! -- infringe on copyrights, and so such an analogy doesn't absolve the chatbot. If we then continue to assume that the chatbot's output is clean, then we are ascribing it a superhuman ability to launder copyright.
> strategy where they burn one engineer by having them read the original code, have them document it carefully with a lawyer to remove all expressive aspects, and then have a separate engineer develop it from the clean documents.
Interesting. What kinds of situations is that strategy used for?
(I'm familiar with cleanroom, which I understand means that you start with un-tainted engineers, who've credibly never been exposed to the proprietary IP, the work only from unencumbered public documentation and running the system as an opaque box. Then there's also validation, like with parallel systems and fuzzing. But I haven't thought through in what situations this might not work, so might require the tainted documenting approach.)
This is the full or classic version of clean room reverse engineering. Using unencumbered public documentation is relatively new, that kind of detailed documentation wasn't widely available. Car manufacturers still protect their service manuals with an agreement that basically says they can't be used for this but I think a lot of service centers stopped making people sign them.
The classic tech story that used this technique is the IBM BIOS and the resulting spread of "IBM PC-Compatible" machines. There is a little bit about it on the wikipedia page (https://en.wikipedia.org/wiki/IBM_PC%E2%80%93compatible). Random factoid, the Netflix Original "Halt and Catch Fire" has a depiction of doing this IBM clone reverse engineering and did a pretty good job at it.
That sounds like a question of degree for the jury — the evaluation of whether or not the facts presented warrant a claim of sufficiently infringing similarity. In this case the judge felt the plaintiffs weren't even close to demonstrating infringement that the question never appeared in front of a jury.
If we're moving the question to one of degree then it's up to Microsoft and others to monitor their output because even if a model is not trained on copyrighted material, you can still accidentally infringe. Even if you never listened to music near or by Lady Gaga, that does not mean you can use your own original inspiration to accidentally write songs that are too similar to Lady Gaga. In other words, like the Ed Sheeran case.
Does that have any legal basis? It sounds a lot like what Google did for their Java engine, which essentially rewrote the entire engine with the same APIs, while referencing the original source code. Didn't the courts decide it was fine?