Skip to content

Conversation

@lwasser
Copy link
Member

@lwasser lwasser commented Sep 16, 2025

This blog post outlines pyOpenSci's new peer review policy regarding the use of generative AI tools in scientific software, emphasizing transparency, ethical considerations, and the importance of human oversight in the review process.

It is codeveloped by the pyOpenSci community and relates to a discussion here:

pyOpenSci/software-peer-review#331

@lwasser
Copy link
Member Author

lwasser commented Sep 23, 2025

@all-contributors please add @elliesch for review, blog

@allcontributors
Copy link
Contributor

@lwasser

I've put up a pull request to add @elliesch! 🎉

@lwasser
Copy link
Member Author

lwasser commented Nov 18, 2025

@all-contributors please add @elliesch for blog, review

@allcontributors
Copy link
Contributor

@lwasser

@elliesch already contributed before to blog, review

@lwasser
Copy link
Member Author

lwasser commented Nov 18, 2025

cc @willingc in case you are interested in this blog post!! no pressure!!

Copy link
Collaborator

@willingc willingc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this! A few grammar suggestions.

Comment on lines 90 to 93
* Using LLM output verbatim could violate the original code's license
* You might accidentally commit plagiarism or copyright infringement by using that output verbatim in your code
* Due diligence is nearly impossible since you can't trace what the LLM "learned from" (most LLM's are black boxes)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Using LLM output verbatim could violate the original code's license
* You might accidentally commit plagiarism or copyright infringement by using that output verbatim in your code
* Due diligence is nearly impossible since you can't trace what the LLM "learned from" (most LLM's are black boxes)
* Using LLM output verbatim could violate the original code's license
* You might accidentally commit plagiarism or copyright infringement by using that output verbatim in your code
* Due diligence is nearly impossible since you can't trace what the LLM "learned from" (most LLMs are black boxes)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "verbatim" is being leaned on too much here. An LLM can produce verbatim copies of its corpus, but the standard in copyright law is not limited to verbatim copies. If the process involved copying at any stage, refactoring can only obfuscate. The "substantial similarity" standards in copyright law are used as circumstantial evidence of process. Modifying the result by paraphrasing/refactoring is concealing the evidence (and thus reduces the likelihood of being caught), but does not make the process legal. I think we should be careful to not spread that misconception to readers.

Comment on lines 85 to 86
LLMs are trained on large amounts of open source code; most of that code has licenses that require attribution.
The problem? LLMs sometimes spit out near-exact copies of that training data, but without any attribution or copyright notices.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LLMs are trained on large amounts of open source code; most of that code has licenses that require attribution.
The problem? LLMs sometimes spit out near-exact copies of that training data, but without any attribution or copyright notices.
LLMs are trained on large amounts of open source code that is bound by various licenses, many of which require attribution. When an LLM generates code, it may reproduce verbatim output or patterns or structures from that training data—but without attribution or copyright notices.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to include more than just verbatim ... that fundamentally, the patterns as well are licensed.

Also wondering here - let's say that i produce some code totally on my own that happens to match a pattern of some code with a license that requires attribution. What happens there? (if my production code is legitimately developed on my own and the pattern just happens to be a great one that others use too, and maybe I've even seen it before, but I'm not intentionally copying).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as copyright law is concerned, that's exactly the scenario where the substantial similarity standard would be applied. The more substantial the copying and the more closely in time that you would have observed the original, the more likely your work would be found to have substantial similarity and to be infringing. Protecting against that ambiguity is why clean-room design exists.

lwasser and others added 24 commits December 16, 2025 13:46
Co-authored-by: Jed Brown <jed@jedbrown.org>
Co-authored-by: Jed Brown <jed@jedbrown.org>
Co-authored-by: Jed Brown <jed@jedbrown.org>
Co-authored-by: Jed Brown <jed@jedbrown.org>
Co-authored-by: Jed Brown <jed@jedbrown.org>
Co-authored-by: Carol Willing <carolcode@willingconsulting.com>
Co-authored-by: Carol Willing <carolcode@willingconsulting.com>
Co-authored-by: Carol Willing <carolcode@willingconsulting.com>
Co-authored-by: Carol Willing <carolcode@willingconsulting.com>
Co-authored-by: Carol Willing <carolcode@willingconsulting.com>
Co-authored-by: Carol Willing <carolcode@willingconsulting.com>
@lwasser
Copy link
Member Author

lwasser commented Dec 16, 2025

Ok everyone!! I have addressed all of the comments now I believe!! It's time to merge this. Once merged, if you notice anything that still feels off please feel free to open a new pr or an issue here. I think after seeing the direction that JOSS is going we may also want to add a future discussion on reviewer burden, and communication in issues being human not agent based!

@lwasser lwasser merged commit ae68d20 into pyOpenSci:main Dec 16, 2025
4 checks passed
@lwasser lwasser deleted the gen-ai branch December 16, 2025 21:34
@lwasser
Copy link
Member Author

lwasser commented Dec 16, 2025

@eliotwrobson merged!! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants