Spampocalypse Now: Will AI-generated content break the internet?

Techtonic
7 min readApr 30, 2024
So the most interesting thing about this image is trying to work out from the cans in the foreground what kind of animal DALL-E thinks Spam is made out of. Image credit: DALL-E 3 via Microsoft Copilot

Back in the early 2000s, there was a period of about three years when serious thinkers were afraid that spam might make email unusable. This in turn led to predictions that the internet as a whole was going to collapse, buried beneath all the spam. An article in the New York Times in June 2002 noted that “In addition to becoming more sophisticated, spammers have become more prolific. These days, more and more junk e-mail is finding its way into In boxes [sic].”

Looking back, of course, this didn’t happen. That article was published right around peak spam, before a combination of technology and regulation pushed most spam out of our “In boxes” and into junk folders, or destroyed it entirely before it reached us.

We’re now in the early stages of a new kind of spam. Generative AI has made it trivially simple to create not just emails but blogs, LinkedIn posts, entire websites, and even Amazon listings in bulk. Platforms, afraid their users will go elsewhere, are even encouraging this, embedding “write with AI” buttons into their interfaces.

This is, of course, terrible. Generative AI has many strengths, but delivering interesting points of view and fresh perspectives is not among them. Its default voice is that of the dutiful high-school student, methodically rehashing the conventional wisdom found on the internet. If you ask it to change things up, and adopt a different voice, or an edgier take, it reads like that same dutiful high-school student trying to be funny, or trying to write like David Foster Wallace. This isn’t something that is likely to change any time soon; I believe it’s an inherent limitation in how LLMs work. You can train an LLM to sound like Hunter S. Thompson, but it won’t feel like a new work by Hunter S. Thompson. It will feel like the average of all existing works by Thompson, because it is. That’s just how this goes.

Because it’s so easy to create “content” with AI, and because there are potential rewards for you if people consume it, there’s a lot of this lowest-common-denominator material, and more is coming all the time. One of the more depressing, self-referential topics available on the internet–and that’s saying a lot–is the genre in which people who make money using AI to churn out low-grade content teaching people to make money using AI to churn out low-grade content.

Sometimes this is just irritating, a kind of digital litter that gets in the way when you’re searching for specific information. Sometimes it’s actively unhelpful, like the startup that produces stock “analysis” by scraping up bits of information and throwing it into an LLM. It’s posted a number of articles on the company where I work, and consistently gets everything wrong. This material runs along a spectrum from lazy to malicious, and you’ve probably already seen a lot of it, especially if you’re searching for low-popularity content like very local news or sports.

So in one way all of this is just like spam. It gets in the way of real content, wasting your time and everyone’s storage and bandwidth. But there are some important differences.

First, early 2000s spam (and spam today, for that matter) didn’t want to be read for its own merits. It wanted to sell you something, whether that was fake Viagra, the opportunity to work from home, or the playing cards US forces had in Iraq with the country’s 52 most wanted members of the regime. It wasn’t trying to masquerade as something else, and so if you didn’t want to buy the cards, you wouldn’t read it. Modern AI content spam only wants to be read (note that I’m not talking about AI-assisted fraud, which is an extremely serious but largely separate issue). It wants you to click on it and scroll down through the sort-of-content, so that it can serve you ads. And so while traditional spam clogged up your inbox, you could mostly avoid actually reading it, modern AI content spam is designed to fool you and take up far more of your time. This not only is a bigger waste of resources, but it’s also more dangerous, because you may not realize you’re reading unreliable LLM-generated material that could easily be wrong.

Second, the rise of LLMs has made the internet significantly more circular. The large foundation models all rely on masses of unstructured training data, most of which is (not surprisingly) harvested from the internet. While not all AI content spam will end up in training data, a lot of it will, creating a potential doom loop whereby LLMs are trained on large volumes of low-quality LLM content, thereby reinforcing the mediocrity of much LLM output, and so on. The internet is huge, so this effect may not be noticeable for some time, and it’s possible the model-makers are going to have clever ways to screen out low-grade AI-generated content, but the risk is very real.

So what does all this mean? Are we headed to the content version of the gray goo scenario, in which everything is just a rehash of OpenAI error messages?

I’m going to be optimistic and say that we aren’t. And the reason we aren’t is the same reason that the spampocalypse of the early 2000s didn’t destroy email, which is that once we all agreed on the problem, and government and the major industry participants saw eye-to-eye on the outlines of a solution, it only took a couple of years to take what felt like an escalating, intractable problem and reduce it to a minor irritant. The big email providers realized that spam was endangering their business models, and governments realized it was annoying voters. The solution included better technology for spotting and eliminating spam (ironically, using some of the same statistical techniques that went on to power generative AI), regulation that set out clear rules and penalties for violations, more controls in the deeper layers of the internet where email actually moves, and a governance structure that enlisted users and companies to identify and eliminate spam (your spam folder isn’t a one-way street; the way you interact with it is a critical piece of how spam is identified).

I believe that once AI content spam moves from being a nuisance to being a serious threat, something similar will happen. I don’t think we’ll be able to control AI content generation, through watermarking or guardrails, because spammers will be able to use open-source technologies without any such controls. But we can’t control content generation with spam, either. What we can control, and will control with AI content spam, is dissemination. Right now search engines aren’t particularly focused on avoiding content farms churning out fake news–these sites, as far as we know, receive more or less the same treatment as others, being prioritized or deprioritized based on usage and popularity. But that can change. The big platforms will get better at noticing the tell-tale signs of AI-generated content, not just within a single piece of content, but in where that content exists and how it came to be there, and its similarities with other suspect content found on the web. They’ll do this to protect themselves–if you can’t find anything useful on the open internet, there’s no point in using a search engine. They’ll also do it to protect the intake valves to their LLMs, to avoid quality degradation due to bad training data.

Government will get in on the act as well. It’s really hard to police the internet, of course, but clear rules and penalties will help with AI content spam. The 2003 CAN-SPAM Act didn’t eliminate email spam, obviously, but it drove it underground, causing many casual spammers to drop out of the industry and forcing others to invest significant time and resources in covering their tracks. Right now AI content spammers are living in Silicon Valley and getting venture-capital funding; those people will move on when the rules change, leaving the field to the smaller rump of professional trolls in gray geographies.

And all of us will also be enlisted to help out. Just like we are all labeling emails as spam/not spam all day, enabling the tech companies to train and update their spam algorithms, we’ll soon be labeling poor-quality AI content. This will start pushing it down the search rankings and out of training-data-ingestion pipelines.

Of course, this won’t solve the problem entirely, just like we haven’t solved spam entirely. We can’t solve it entirely; if we were ever fully on top of the problem, it would mean that no more labeled data was entering the system to update the models, and new types of spam would have a free pass. It’s a self-regulating system, like predators and prey in nature. For the rest of your life, you’ll occasionally see a terrible, error-ridden, bot-created piece of content on the internet. But it will be at the margins, annoying but hardly overwhelming.

So overall I’m optimistic about the future of AI content spam. The caveat, however–and this is a big caveat–is that this isn’t 2002, the year of peak email spam, right before we got serious. It’s 1998, when spam was just getting noticed as an issue, and viewed by most people as a subject for jokes rather than any kind of serious inconvenience. Almost by definition, governments and industry only mobilize to solve these kind of complex collective-action problems when they get so bad that serious people think the underlying ecosystem, whether that’s email or the internet as a whole, is about to collapse.

We’re a few years away from peak AI content spam. It’s going to get worse–probably a lot worse–before it gets better. And at some point, you’ll see mainstream publications speculating on whether this is the end of the internet as we know it.

When you see that news story, however, treat it as good news. That will be the signal for all of us to get serious about solving the problem, and we absolutely will be able to solve the problem. It will be a bumpy ride, but as we head into the valley of AI-content-spam despair, you should be confident that we are going to come out the other side.

--

--

Techtonic

I'm a company CEO and data scientist who writes on artificial intelligence and its connection to business. Also at https://www.linkedin.com/in/james-twiss/