How Much Does ChatGPT Rely on Reddit For Their Answers?
It's pretty well-known by now that the foundation of the major LLMs like ChatGPT is the immense training dataset made up of basically every piece of written content OpenAI could get their hands on.
At first blush, that seems great. If you feed the sum total of all human knowledge into a machine and then ask it questions, it should give you the most relevant, accurate answers, right?
Key Takeaways
- Reddit accounted for roughly 40% of ChatGPT citations in a Semrush study of 150,000 queries, reaching nearly 60% for ChatGPT specifically.
- ChatGPT's Reddit citations dropped sharply in late August, falling from ~55% to around 8-10%, likely due to an intentional OpenAI weighting change.
- Google and Perplexity consistently cited Reddit far less, hovering around 10% throughout the study period without major fluctuations.
- Reddit's crowdsourced upvote system prioritizes viral or early content over accuracy, making it a questionable authoritative source for AI training.
- Some subreddits now contain 15-60%+ AI-generated content, creating a risky feedback loop where LLMs potentially train on their own outputs.
The Challenge of Accuracy
Well, look at Google. Google has had that sum of human knowledge for ages, and it's always been difficult to get to the core of any given issue. In many cases, it's not even as much about the answers, as how to even ask the right questions.
So, it's a vast and challenging problem.
Then, you need to think about all of that data being fed into the machine. It's not just all of human knowledge. It's also all of human lies. All of human misinformation. All of human conspiracy theories. All of human fiction and fantasy. All of human outdated information and since-rescinded knowledge.
Some significant work has gone into annotating that data, but there's still only so much that can be done. After all, you have a very, very small portion of the population working on data annotation, and the sum output of all of humanity to tag. It's a drop in the bucket.
LLMs also don't work in a way that can separate fact from fiction. That is fundamentally not how they're designed or how they interact with language. LLMs assign each word a variety of possible shapes based on how it interacts with other words, and when you put in a prompt, you build an extremely complex shape out of all of the words you use. The output is the corresponding shape based on what statistically likely responses exist to the input shape. Several major AIs have begun adopting LLMs.txt as one way to help guide what information they surface.
Obviously, it's not actually shapes. It's all a lot of very complex black-box math when you get right down to it.
There's no mechanism here for fact versus fiction. Some post-hoc annotation can try to help, but there's only so much it can do. That's why hallucinations occur and, more importantly, why they're fundamental to the technology and aren't really a solvable problem.
How would you address this issue?
One possibility is that you weigh your training data.
When you're figuring out how heavily certain information should be prioritized in the training data set, you figure out your goals. If accuracy is your goal, you would want to give higher priority to more trustworthy and authoritative data sources.
You'd rather have the CDC website as a source than a 1600s book on the four humors, when someone asks how to treat their flu, right?
So, if you were to sit down and think about possible sources of top-tier authority, for the kinds of questions that people are going to be asking ChatGPT, what sorts of websites would you choose?
- People probably ask a lot of factual questions, so a big source of facts and information is a priority. Let's choose Wikipedia.
- People might ask for geographical information, so maybe some kind of mapping site. Let's pick Mapquest (since we have an adversarial relationship with Google).
- People might want product information, so let's pull that from a few sources. Say, Amazon, eBay, maybe a retail site like Target or Walmart.
That's something like a good start, right? But there are still a lot of gaps and things to consider.
- A lot of the most recent, modern knowledge people want is frustratingly only found in video. So, let's add in YouTube.
- Google is still a good source, and we can scrape their autocomplete and even their own AI overviews, so put Google in there.
- If you want a boots-on-the-ground sort of access to information from random experts, you could cite a place where they tend to gather. A few of them, even; say Medium, Forbes, and LinkedIn.
Of course, a lot of you are probably thinking of websites that are big authorities that I haven't mentioned yet. And, yes, a lot of those sites are on the list, too. NIH for health information, Arxiv for scholarly articles and printed content, PRNewswire for breaking news and sites like Microsoft.com for their support forums and Q&As.
And, of course, there's Reddit.
The Paradox of Reddit
Reddit is a strange beast to consider.
On one hand, it's one of the top most-used social media sites in the world, clocking in at number ten (though when four of those top ten are Facebook, WhatsApp, Instagram, and Messenger, you could probably roll those into one instead).
It's also incredibly popular as a source of "real" information. There was a meme for a while in marketing, and even in casual user circles, that, to get Google to give you anything worthwhile, you had to put "site:reddit.com" in your search.
And, sure, there's some truth to that. Reddit is a place with a billion little niche communities, and it's a place where people who are experts in weird, niche topics tend to gather.
Anywhere else online, two people who really love making clay sculptures of insects might never meet, but Reddit might have a whole community just for them, and if not, they can find a home on r/insects, r/pottery, r/clay, r/polymerclay, r/somethingimade, or any of a hundred other related communities.
There are very few other places on the internet where as much individual expertise is gathered, around so many different topics.
At the same time, as anyone who has spent time on Reddit knows… it's not really that authoritative.
The feedback mechanism for Reddit is upvotes and downvotes, which means crowdsourced opinion, which means a few things.
- If ten people tell the same joke, the one who told it first is likely going to get all the upvotes and float to the top, even if another had a better delivery.
- If one person has a lot of authority and also a lot of enemies, their posts can be downvoted out of sight despite holding a lot of value.
- Some communities have strict rules and moderation that tend to cut out good information that doesn't meet the criteria.
There's also a lot of partisanship on Reddit. If you use Reddit as a source of reputable data and someone asks a question about a political issue, will you prioritize r/conservative, r/liberal, r/leftist, something else? A mixture of them all?
Reddit often prioritizes viral content and the first plausible content more than actual accuracy, too. You often see a post, a very highly-upvoted top answer, and then a very detailed refutation of that answer that has a tenth of the votes buried in the comments.
People love Reddit for answers because they can comb through the comments, learn, synthesize ideas, and figure out fact from fiction.
AIs love Reddit because it's a massive site with information on nearly every conceivable topic, and because it doesn't matter quite as much if the information is wrong. If you want to get your site mentioned in AI answers, Reddit's dominance in training data is worth understanding.
Techbros who run AI companies love Reddit because they spent their formative years on it and are steeped in Reddit culture.
How Much Does ChatGPT Cite Reddit?
Now let's get to the heart of the question. We know that ChatGPT loves Reddit as a source, but how much are we talking about here?
Semrush did a pretty big study around the end of last year that has a few key data points I want to talk about.
Point #1: Reddit Accounts for 40% of ChatGPT Citations
First of all is this point, which you've probably encountered already. Reddit makes up 40.1% of citations on LLMs like ChatGPT, Perplexity, and Google's AI Mode.
There are a few details about this that are worth mentioning.
First, that's one specific conclusion from one specific study. It's a very large study, looking at 150,000 queries across the LLMs, but it's still just one study.
Second, it averages the values from three different LLM resources that cite sources over time. If you go check out the actual data, it's even more stark: for ChatGPT specifically, during the timeframe of that study, Reddit was actually closer to 60% of citations.
Third, these things change over time and depending on the use of the LLMs. Certain types of prompts will generate answers citing certain sources of information preferentially. That's where the weighting comes in. You're going to get a lot more Wikipedia sourcing when you search for facts of geography than you would for facts on a video game series.
Point #2: Data Weighting Changes Over Time
The second point I want to cover is that Semrush did a follow-up to their initial study, examining 230,000 queries over a period of three months.
This study showed two types of changes.
The first is that there are natural fluctuations over time. At the start of their study, Reddit accounted for around 55% of citations. It rose to a bit over 60% within the first month, but by the second month, it was down to 40%.
The second change is one made intentionally on the back end by OpenAI. It's unclear why this change was made, but at the end of August last year, Reddit (and Wikipedia) citations dropped sharply. By mid-September, Reddit accounted for just 8% or so of citations, and has only risen back to about 10% since.
Why did this change happen? There are a couple of theories.
One theory is that, at the same time this change happened, Google had changed a parameter. Formerly, you could use the URL parameter "num=100" in the Google search results to see the top 100 search results, instead of just the top 20.
Side note, remember back when you could page your way dozens or hundreds of pages back into Google results, and see all kinds of weird stuff? Turns out nearly no one ever did that, and Google just stopped allowing it to happen, and it kind of didn't matter to most people.
Google removed that parameter, which ostensibly means that the LLM bots scraping up content wouldn't be able to scrape and cite anything beyond the top 20 results.
While this can potentially account for some drop, only around 34% of Reddit's ranking pages were in the 20-100 range, so that alone wouldn't cost Reddit 90% of their citations.
The more likely theory is just that it's a reaction from OpenAI. A big study came out that showed ChatGPT citing Reddit for more than half of their citations, which has a lot of people asking, "Why don't I just search Reddit directly?" Others might inherently not trust the factual accuracy of Reddit and thus lose trust in ChatGPT.
Rather than cope with that potential loss of trust and usage, they pushed an update to make their weighting on Reddit much lower. The timing lining up with the removal of num=100 was a coincidence.
Or maybe it's both. Or maybe it's neither! Without OpenAI making a clear statement about it, we can't really know.
Point #3: This Was Just ChatGPT
One thing worth mentioning here is that all of this is just ChatGPT. Other LLMs didn't cite Reddit nearly as much from the outset, and the drop ChatGPT pushed put it more in line with other LLMs. If you're trying to get visibility in AI-generated answers, it may be worth reading up on how to get chosen as a source on Perplexity as well.
Google and Perplexity always hovered around 10% Reddit citations and didn't really change over the course of the study.
Is Over-Citation of Reddit a Problem?
So, is ChatGPT citing Reddit more than other sites a problem?
OpenAI seemed to think so, but Reddit is still pretty popular and a source of some unique information not readily available anywhere else, so they still come out on top.
Critically, Reddit is also a source of a particular kind of information you might not think of: conversational style. ChatGPT is explicitly an agentic LLM that you have conversations with, not a raw LLM that generates output like a lot of other systems. There's that added layer of filter, and a lot of that "how people talk to each other" data comes from Reddit.
At the same time, it might be less of a problem than you'd think. It's always worth remembering that Semrush is a marketing company, and it's likely that most of the prompts they chose to monitor are in a space where Reddit might be more of an authority, like marketing topics and discussions. Without seeing their data, I don't know for sure.
One possible issue, though, is poisoning.
The LLMs aren't static. They didn't harvest all the data they could to build themselves and then sit there. They're constantly ingesting new data, which is why you can ask ChatGPT for information on a current event and get current-ish information.
Unfortunately, several studies have recently shown that an LLM ingesting AI-generated content and feeding it back into its training data makes it notably worse. It gets generations removed from "reality" and thus less and less accurate.
At the same time, more and more people are using LLMs to create content for them. That goes for blogs, both text and images, but it also goes for all sorts of other content. ChatGPT isn't replacing content writers anytime soon, but it is changing how content gets made.
Depending on the source you find and the place you look, some subreddits have around 15% of their responses coming from LLMs like ChatGPT. Some people just ask the AI and post their results to back up their claims. Others use LLMs to generate their posts for them.
Other subreddits, particularly the really popular ones or ones where they're being targeted by bot or astroturf campaigns, can be 60% AI responses or more.
And that's just what's obvious. There's no currently foolproof way to positively identify LLM content, so it's possible a lot more is slipping through than people realize.
And that doesn't even account for the gimmick subreddits where LLMs just chat with each other, which you would hope wouldn't be ingested, but probably are.
People are abandoning sites where AI is rampant or has taken over completely, and Reddit has seen growth because of it, but Reddit isn't really any safer. If you're wondering whether Reddit ads are worth it for your business, the quality of the platform's content is worth factoring in.
And all of that is before you even get into the groups that are actively and maliciously targeting AI datasets to sway recommendations, poison their algorithms, or protect their content by making it dangerous to ingest. It is remarkably easy to inject something into these data sets and get LLMs to say whatever you want.
LLMs, ChatGPT, Reddit, and the Future of Information
So, where does this leave us?
That's a great question. I don't really have any firm conclusions here.
ChatGPT unquestionably cites Reddit for a lot of prompts, but they do so a lot less now than they did just a few months ago. In the future, that could change. They could drop Reddit even more, or they could start bumping it back up. There's no way to predict the future on this kind of thing.
I'm a content marketer. I've watched how LLMs have altered my industry, in good ways and bad ways. I've seen how Reddit has been used to build links, to build authority, to earn AI citations, and even as part of negative SEO.
There's a lot of unexplored space here. What lurks in the depths? We'll just have to look to find out.
Comments