What Is The TF*IDF Algorithm for Content and SEO?
The TF*IDF algorithm is a mathematical way of analyzing content and has been part of Google's algorithm for a long time. The algorithm works by measuring TF (term frequency) and the IDF (inverse document frequency). Understanding exactly how it all works involves a bit more technical know-how than you probably need, but it can be helpful to have a bird's-eye view of what it is and what it means.
Does the name "TF*IDF" make your eyes glaze over a bit? Me too. Let's dig into it.
How Do You Define TF*IDF?
TF*IDF is an old algorithm from data science. The TF part came from the work of Hans Peter Luhn back in 1957, and the IDF part was developed in 1972 by Karen Sparck Jones. It's information science, not search engine or content optimization, and it's just something Google uses as part of their algorithm. What is it, though? Let's start by breaking down what the TF*IDF is and its doing.
Let's say you're writing a blog post, 1,000 words long, about content marketing. You use the word "content" 50 times in the article. The frequency of that word (the number of times that it was mentioned) is 50 out of 1,000, or 50/1000, which is 0.05. This number is the TF: 0.05.
In some cases, term frequency is helpful to know whether you're using a term too often or too little for a piece of content. If you're writing a blog post about content marketing, but you somehow only use the word "content" in it once, Google is unlikely to think that it's very relevant. Conversely, Google might think you're keyword stuffing if you use it too much.
Unfortunately, the TF part of TF*IDF is the most straightforward part of this equation. What is IDF?
It’s a logarithmic function that looks like this:
IDF(content) = log (x,xxx,xxx,xxx/yyy,yyy)
Here, the X and the Y are essential, and you have no way of knowing them.
X is the total number of pieces of content in the entire corpus. In other words, this is the total number of web pages Google has indexed.
There's a bit of trickery to this, though. Your initial hunch might be that this is "the total number of indexable pieces of content on the internet," which is trillions upon trillions upon trillions of pages. Google discovered 130 trillion pages in one year alone, back in 2016. That's a lot.
Well, the thing is, Google's awareness of the internet and Google's index of content are different. Google doesn't index everything they find. There are many pages that Google can see but will refuse to index. A lot of it is easy to disregard; spam sites, noindexed pages, sites that violate policies somehow, ad landing pages, and so on.
Y, meanwhile, is the total number of pages in the corpus that include the word in question, "content," in this case. It's by necessity going to be a smaller number than the total number, but depending on the word, it may or may not be much smaller. A prevalent word such as "the" or "a" or "and" will show up on most pages, while an uncommon phrase like "syzygy" might not show up in many.
The log function creates your weighted value. The more common a word is, the less value it is assigned, and the less familiar the word is, the more weight it is given. Again, this IDF score is measured across the entirety of Google's index.
It's essentially just a sophisticated equation to codify the critical relevant keywords to your website and the internet as a whole.
How Does Google Use TF*IDF?
TF*IDF is a long-known and long-understood algorithm for searching and retrieving relevant content. It's not just for Google, but the entire field of information science; Google uses it as part of their overall algorithm.
TF*IDF is one of many ways Google indexes, analyzes, and codifies the content they index. John Mueller, Google's public spokesman, said this:
"With regards to trying to understand which are the relevant words on a page, we use a ton of different techniques from information retrieval. And there are tons of these metrics that have come out over the years."
In other words, he's trying to tell people in a not-so-subtle way that it's just one of many factors. The question is, why?
"The other thing is… this is a fairly old metric, and things have evolved quite a bit over the years. There are lots of other metrics as well."
You can check out the full hangout session in this video:
TF*IDF is a very old metric in data science, and it's been part of Google since very early on. Chances are pretty good that whatever variation of TF*IDF Google uses is not the same as the core data science TF*IDF described above. Without having an inside look into how Google does their data sorting, there's absolutely no way that we can guess what algorithm they use.
"In a world where AI, neural networks, and machine learning are the norms, TF-IDF is like a kids bike on training wheels compared to a Ferrari." – Roger Montti.
The other reason is, of course, that it's just one small part of the overall Google algorithm. I'm not going to link that "200 search ranking factors" post again (how many backlinks have I sent your way, Brian?), but you know the one I'm talking about. The point is, there are way more important things to care about for page SEO than an antiquated and inaccurate mathematical model.
We create blog content that converts - not just for ourselves, but for our clients, too.
We pick blog topics like hedge funds pick stocks. Then, we create articles that are 10x better to earn the top spot.
Content marketing has two ingredients - content and marketing. We've earned our black belts in both.
Should You Focus on TF*IDF?
TF*IDF is one of those metrics that sounds scientific and important until you realize it's just one of a large number of ways a massive algorithm like Google uses to codify its content. It is a more elaborate form of keyword density when you come right down to it.
Those of you who follow my blog frequently know that I'm not a fan of keyword density. In the bad old days of the pre-Panda internet, keyword density was a primary focus for many website owners, and it led to a lot of over-optimized content. Hammering your target keyword repeatedly in your content doesn't work the way it used to and is more likely to hurt your content strategy and SEO strategy.
The biggest problem with trying to use TF*IDF for technical SEO, though, is the IDF part. Remember how it's a number generated by using the word across every page on the entire internet? Yeah, none of us have the resources to begin to estimate, let alone accurately calculate, that number. You can make all the guesses you want, but they'll always be just that: guesses.
Sure, a few companies out there claim to offer TF-IDF tools. How do they do it? My guess is statistical sampling, with a much smaller index than Google search uses.
Amazon has a massive dataset with Alexa.com, but even their data is a rough estimate. There are significant factors that skew these metrics - for example, marketers are more likely to have the Alexa toolbar installed than non-marketers, so naturally, we can expect marketing-related sites to have an artificially-high Alexa score. Is Search Engine Journal really among the top 5,000 websites in the world? Perhaps - or perhaps marketers are more likely to use marketing tools and read marketing content. Alexa's data also skews heavily towards United States users, which make up more than two-thirds of their dataset.
There's another problem with trying to use TF*IDF for your SEO. Google has been experimenting with machine learning, semantic indexing, and other natural language analysis systems for a long time. Check out this blog post they published in 2014:
You know how sometimes when you search for a term on Google, you get a ranking page that is relevant to your query but isn't using that search term at all? And you think, sure, Google understands the thesaurus, they used some synonyms, and it's all good.
How does that square up with TF*IDF? If you were a webmaster optimizing for your TF*IDF search terms, you're not capitalizing on any of that natural language or synonymous query power. You're relying on Google to do that heavy lifting for you. And you know what? Google might, but other sites that use more natural language will work better.
Take a look at this screenshot above. In Google's early AI keyword tests, they used a topic "Should Tyco's Auditors Have Told More?", and then it spit out some relevant keywords that they'd like to see in that article. Looks a little familiar to what these TF*IDF tools are doing by scraping search results, doesn't it?
So, should you focus on TF*IDF?
How Do You Use TF*IDF?
The thing about TF*IDF analysis is that the tools that use it best look for common words with similar importance. If you plug in "content marketing" as a phrase and run an analysis on it, you'll find related keywords that don't share the exact words the way you would if you were scraping Google autocomplete results. Moz uses the example of "coconut oil" and finds related phrases like "MCT oil," "capric acid," and "osteoporosis."
TF*IDF analysis can be used in a few compelling ways.
- Analysis of underperforming but high-potential content can show you keywords you missed that you might be able to include to boost the relevance of the content and get it to rank higher in the SERPs.
- Analysis of content can reveal "subject drift," where the relevance of a post degrades over time because the overall discussion of the topic has shifted, so your post is no longer as relevant as it once was. You can readjust your focus and emphasis to realign your content with current discussions.
- Analysis of content across your whole site and across the whole of your industry (if you can find such an index or a tool that has one) can help identify gaps in content coverage. It may also uncover potential posts you could write to take advantage of new keywords you may have missed.
You can certainly try to make use of TF*IDF in this way, but you're probably not going to see the miracle growth some people would claim you will.
Is There a Problem with TF*IDF for SEO?
My biggest problem with using TF*IDF is two-pronged.
1. First, far too many SEO and digital marketing people seem to think it's some magical, extra-effective secret behind how Google works when it's just a tiny part of an overall massive algorithm. We're not talking about a mac and cheese recipe where the secret ingredient is nutmeg here. It's more like we're talking about a mac and cheese recipe where many people are convinced the secret is a particular brand of pasta that isn't sold anymore and was made of sub-par ingredients.
Why is everyone so focused on it? Well, it's attractive. It's obscure and math-based and is scientific-sounding, so many people feel like it has a lot of significance. Meanwhile, the reality is, it's a primitive and simplistic model. Google left it behind over a decade ago, almost certainly.
On top of that, TF*IDF has no nuance to it. If you run a TF-IDF analysis on the word "coke," you're likely to get recommended keywords like "soda" and "Pepsi," but also "illegal" and "petroleum," because "coke" can refer to the soft drink, to the narcotic, and the coal byproduct. The TF-IDF values don't care about these differences because they all use the same word.
I'm not the only one to recognize problems with using TF*IDF (and that post goes into a lot more detail than I do), but a few surprisingly high-profile companies have bought into it, including Moz.
2. My second problem with using TF*IDF is that, well, it seems astroturfed. Virtually every pro-TF*IDF blog post you see out there (including this one from Moz, Semrush, and Onely) comes down to one thing: use Ryte. Ryte, the business content analysis platform that conveniently seems to be just about the only tool on the web offering a TF*IDF analysis of content.
Yeah, to me, that says that Ryte decided they found a unique marketing angle, paid many people to write about this secret sauce SEO technique, and are getting a ton of referrals from people who don't know what TF*IDF is or how it's used. Now, perhaps I'm wrong, and maybe Ryte has a very sophisticated version that mirrors Google, but I don't think it comes anywhere close to Google's data. I suspect they're more likely to use more traditional keyword analysis techniques and claim they use TF*IDF to dazzle.
Should you bother with it for your marketing and keyword research? Absolutely. I think some TF*IDF software is worth playing with. I'm a big believer in Clearscope and MarketMuse, and I think comparing your content and keyword usage to your competitors is very effective. It helps optimize your content, shapes your strategy, and enlightens you on tangential talking points that you may have missed. It could be interesting to play with and learn about, and there are some promising developments at these companies. At the very least, it may help you add a few hundred words to your content by speaking about those topics.
My parting advice is to take TF*IDF recommendations with a grain of salt and never let them compromise your content's quality, relevance, or user experience.