RankBrain + Penguin + The Disavow Tool = Google.CON
On Friday 23rd September 2016 Google officially announced that Penguin 4.0 was being rolled out and with that the Penguin Algorithm had now been integrated into their core algorithm.
It took them a longtime and everyone in the SEO industry kept waiting for this to happen. Why was it delayed? What was Google up to?
In a nutshell, they needed more data to train RankBrain, and getting that data takes time and a lot of (human) effort that Google did not have the internal resources to get done.
Google released Penguin 1.0 knowing it would do collateral damage. They then released the Disavow Tool as a (false hope and lifeline to the SEOs industry) only with ONE main intention. And, that ONE intention was to get huge amounts of humanly crowdsourced manual link spam data so they could feed it into RankBrain as its seed data, so it could be trained via its deep learning mechanism, so that it could eventually work with Penguin 4.0 to help and make it (run in) real time. And, they wanted Penguin to run in Real Time so that it can bubble up the sites in real time to their manual review team.
In short, that’s how the con took place.
I’ll get to how it was planned out in a bit, but lets first try and understand a bit of what Penguin real time is and why and how its previous versions worked (inaccurately).
Note: While this post is not about debating if Penguin Realtime is good in general for SEO and has made things easier… its about the unfortunate path Google took to get to where they are now to combat search engine spam. If you really don’t care about this, then this post is probably not for you!
So what Penguin real time essentially means is the SEO signals that fall under Penguin algorithm will now have an effect on your site in real time… (depending of course on how often your site is crawled/re-crawled or the link that points to it is crawled).
Lets pan back a bit…
Up until now all the previous versions of Penguin updates were NOT real time.
Google did not have the ability to do this, nor the resources or the technology to identify if a link was toxic or not in “real time” with a high accuracy. And, they can’t get the a accuracy wrong specially when things are happening in real time.
They did get it wrong on many occasions with their initial multiple Penguin updates and roll outs that were not real time and that is why many innocent sites got hit. But, with Penguin 4 going real time they are (RankBrain is) very confident in link classification.
So, with the previous Penguin updates, if your site was demoted by Penguin filters – then you would have to wait for months for any hope of recovery.
You would have to wait until the next data refresh took place so you could actually see the positive effect of any fixes and tweaks you did AFTER you were hit by Penguin.
This was very frustrating, as it meant you had no way of firstly knowing immediately if what you were doing in trying to fix things was actually working.
If they could achieve Penguin Real Time, then they would solve the problem of identifying toxic links in real time with an almost 100% accuracy.
And, they could now not hurt sites in real time anymore with a “real time roll out” of Penguin. And, they would not need any more updates for Penguin because well – now it would work perfectly in congruence with Rank Brain.
Penguin Real Time is a big thing for Google
With the promise of the real time Penguin they can now simply either give you positive juice or discard your link and devalue it (without giving you a negative juice for it). This was not possible in the past because they didn’t have RankBrain trained with data. And, you can now easily fix things in real time too – by removing links that are devalued if you trip a thresh hold in your link profile.
I’ve said it before in my SEO 2016 post here that Rank Brain is going to be the heart of the algorithm as we move into the future.
But, is there any correlation between Penguin 4 and RankBrain?
Just like RankBrain works with Panda it also works with Penguin. They’re both part of the deep learning algorithm of search now, and they both needed large amounts of seed data to be able to function accurately.
So, if you’re not familiar with how RankBrain works, lets step back a bit and now go over and understand how it does.
RankBrain is a deep learning algorithm (based on Artificial Intelligence) that needs to be trained to learn stuff from large amounts of seed data so it can apply patterns in the data it gets to predict things and execute calculations with a very high certainty.
This seed data needs to be natural and accurate.
Were you able to feed RankBrain junk data as its initial data set – you would ruin its brain (it would go “mad”).
To explain in human terms, this is exactly why when a baby / child is growing up in the initial years you need to make sure you give it the best training… from all aspects if you want the child to grow up to have special abilities. The brain is like a sponge and deep learning is just how computers mimic the way we train the brain.
Except, it happens at a much faster pace now with machines – given the advances in hardware architecture and GPUs and neural networks etc.
So, in context of Google’s algorithm – “naturally occurring” initial data has to be gathered together and fed to RankBrain.
For example, if RankBrain was to learn how to play the game of Go – one simply has too feed it millions and millions of data from real games. It doesn’t have to learn any rules (like in the case of Chess). This is a little different kind of pattern matching happening.
Google has used RankBrain in the past very effectively and continues to do so… and it drives signals to the Panda algorithm component of the its search. And, with Panda the initial seed training took place when they did not announce RankBrain but silently tracked everything all Chrome users did on the web and on Google search.
Google has full control over Chrome and an ability to track users using it.
Google is able to feed in all the Panda related data into RankBrain, by tracking behaviour and engagement metrics of real people as they click on search results and visited sites.
However, initially with the Penguin 1.0 algorithm release – obviously Google could not do any sort of real time analysis and real time correction of data (link profiling and negative points).
So, thats why when Penguin 1.0 was initially released it devastated scores of innocent sites (a high amount of false negatives).
Google’s RankBrain algorithm was not around and Penguin did not have any way to accurately profile links.
Penguin did not have RankBrain, which in turn did not have the massive seed data to assist it in identifying if a new link was toxic or not.
All they cold do was profile a limited number of links, that their team of 10,000 odd employees at Bangalore could churn daily looking at their guidelines document.
Lets step back a few years to understand the amazing game plan and trickery on how Google hatched the con that could give it the human data it desperately needed for Penguin Real time.
Before Penguin 1.0, Google was simply devaluing toxic links. A bad link could not give your site a negative SERP ranking value or be dangerous towards your link profile.
That’s because Google could not identify if a link was toxic or not with a high accuracy ratio.
And so, when they launched Penguin 1.0 with an initial internal seed data set probably gathered together by employees, they (the unannounced RankBrain algorithm) could not yet identify toxic link profiles with a high accuracy or certainty.
Why? because the seed data set was not very accurate because the data was churned internally by their team and was limited in quantity.
For any deep learning algorithm to work with a high accuracy – the seed baseline data and training needs to be as clean (natural) as possible and in large quantities.
They knowingly let Penguin 1.0 out into the wild – knowing it would hurt many innocent sites as it would have many false negatives… because it was NOT yet ready to work perfectly the way it was supposed to (inadequate amounts of data from their internal team).
But why would they release it if they knew it was not ready yet and would produce and hit a large number of innocent sites with tons of false positives/negatives?
And this my friends is where the genius behind the con lies.
Google used this fact to their advantage.
They wanted people (the power of the crowd) to help identify identify toxic links that it could feed into Rank Brain.
But, how could they get them to voluntarily do this for free?
Why… release an inaccurate version of a half ready algorithm and get people to start freaking out all over the world as their sites got hit?
Then play innocent and throw them a lifeline in the form of the cleverly disguised Disavow Tool.
Google had to as a first step, ruthlessly lash out at sites and inject fear and desperation into SEOs and people IF they ever wanted them to respond to their next move by throwing a lifeline at these desperate people and teams all over the world.
Google needed to cleverly disguise a method to get large amounts of manually human crowdsourced data to help them get the massive data set they needed (and could not have done with internal resources) fed right into RankBrain.
In the past (before RankBrain was trained with data from the Disavow Tool) there was no way, Google and its employees could feed massive amounts of toxic link profile data into RankBrain.
And you’ve probably guessed it by now.
The Disavow Tool was a joke.
It never really worked and it was never really suppose to work the way Google said it would work. And, the core team at Google knows that. They couldn’t tell us what they were up to, just like they couldn’t tell us when they were recording our behaviour with Chrome initially.
The initial data sets that get fed into any deep learning algorithm, have to be natural and clean.
It was impossible to get out of a penalty by just using only the disavow tool without actually deleting your toxic links at the same time.
So, with all the (false) promises made by Google on what the disavow tool would do to their penalized site for recovery – we SEOs went crazy and started submitting large sets of toxic links that we examined manually and probably created ourselves) for our clients. In the process we also made more money by charging for such services… but in essence… we were helping RankBrain so it could work to create a Real time Penguin.
SEO clients were essentially funding the seed data RankBrain needed for real time Penguin!
Google was able to gather together massive amounts of data from the power of real humans and the crowd.
They used the started collecting massive amounts of seed data. Something they needed for RankBrain to work its deep learning wonders as accurately as possible, so it can perform its AI and pattern matching in the future – without telling us they would be using it for RankBrain!
Afterall, if they told us they were going to use it for seed data for Penguin Realtime – we could have submitted large chunks of gibberish data to trick RankBrain and make Penguin real time inaccurate… which would take us back to 2013 and they would have to figure out another way to get the large data sets they needed 🙂
I take my hats off to the genius core team that derived the steps to this amazing con. I’m guessing probably just a few in the top order knew the entire game and how it would pan out and Matt Cutts couldn’t handle the pressure anymore, which is why he exited (or was asked to leave? who knows).
So, with the Disavow Tool, Google threw a cleverly disguised lifeline towards SEOs in the form of a magic pill that it claimed would save their penalized sites.
As we SEOs ran around like headless chickens drowning – they literally threw us this lifeline called the “Disavow Tool” that promised to work as they said, but was in reality only collecting large droves of manual data that Google desperately needed.
All along, they knew innocent people and SEOs would be shaken by Penguin and we would become desperate to recover our sites.
Google wanted us to be desperate. Because they wanted us to work manually and toil in large numbers to identify toxic links so it could feed this data into RankBrain.
As Google engineers discovered the wonders of AI and deep learning and how it could save their engine from all the spam that was surfacing and could destroy the company itself.
And, they had to execute a carefully crafted set of steps… for the media, for people, for users. And it had to make “sense”.
I’ll say it again… they had RankBrain being developed underground all this time, and they knew that it was no good without any seed data.
And data is what they needed… data in massive amounts.
Natural, authentic data .. that could only be analyzed and categorized by real humans – manually.
Without this RankBrain is an empty head with no memories to act upon.
And once RankBrain was fed this baseline data that we humans profiled for it, they could use it to identify toxic links automatically with a very high probability (almost 100%).
The more authentic data, and the more amounts of data they had – the better it could get at identifying toxic links “in real time”.
So, now over a few years – Google has fed Rank Brain with toxic link data identified by real humans (the crowd). We SEOs waited and waited for Penguin 4 to come out.
But, Google kept holding back its release.
Because they wanted more and more human submitted data. They had only one shot at this, and they couldn’t screw it up.
Finally, once they reached their accurate tipping point. The released Penguin Real time.
And all that massive seed data we submitted to them over the years, now formed the AI behind them being able to confidently identify toxic links in real time using AI and deep learning.
Now that Google has gotten a far more evolved Rank Brain in terms of the data it has, they can essentially, very accurately devalue a bad backlink with almost 100% certainty in real time and in turn allow us to fix things also in real time.
Sure they’ve take a giant leap forward in combating spam in a fair manner so sites that are unfairly hit (even by negative SEO) can now get out of the sandbox faster… but lets look at the bigger picture here.
We humans have been conned into training an AI that is supposed to work for our own good in the long run – but may eventually take a shape of its own.
If you’re wondering if any of this can potentially be harmful for us, who knows?
Google engineers have admitted that they themselves don’t understand how RankBrain comes up with a lot of its ranking data – which could be used to our advantage as we find holes before engineers do.