Member-only story
Is ChatGPT Stealing from Authors?
As a writer who has published literally millions of words online both personally and professionally, it’s practically guaranteed that at least some of my writing has been used to train large language models (LLMs) like ChatGPT. As a self-published author hoping to become traditionally published when my first major WIP is completed, I do have some skin in the game when it comes to AI writing models.
While there are a variety of options available in the AI writing game — including Sudowrite, Poe, Claude, NovelAI, and others — ChatGPT is the most accessible to amateur writers and small-time professional writers, like myself, due to its price point of FREE. (There is also, of course, a paid version that offers greater functionality.) So I decided to do a layman’s analysis of whether or not authors like George R.R. Martin, who is famously one of the authors suing OpenAI claiming ChatGPT is plagiarizing their work.
ChatGPT was trained on 570 GB (5.7e+11 bytes) worth of data from a dataset known as Common Crawl and additional data from Wikipedia (which is in the public domain) and other sources.
I’m looking in my own Documents folder, and the Scrivener file for my 100k WIP is only 108 KB. If it wasn’t in Scrivener format, but plain text, it would be far smaller. (It’s 100,026 words long, to be precise. Quintessentially roundable to 100k.)