sparktoro.com Open in urlscan Pro
2606:4700:3108::ac42:2ba0  Public Scan

Submitted URL: http://sparktoro.com/blog/an-anonymous-source-shared-thousands-of-leaked-google-search-api-documents-with-me-everyone...
Effective URL: https://sparktoro.com/blog/an-anonymous-source-shared-thousands-of-leaked-google-search-api-documents-with-me-everyone...
Submission: On June 02 via api from NL — Scanned from NL

Form analysis 0 forms found in the DOM

Text Content

 * Product
 * Plans
 * About
 * Resources
 * Blog
 * Log In

Try It Free
Log In


AN ANONYMOUS SOURCE SHARED THOUSANDS OF LEAKED GOOGLE SEARCH API DOCUMENTS WITH
ME; EVERYONE IN SEO SHOULD SEE THEM

By Rand FishkinMay 27, 2024

On Sunday, May 5th, I received an email from a person claiming to have access to
a massive leak of API documentation from inside Google’s Search division. The
email further claimed that these leaked documents were confirmed as authentic by
ex-Google employees, and that those ex-employees and others had shared
additional, private information about Google’s search operations.

Many of their claims directly contradict public statements made by Googlers over
the years, in particular the company’s repeated denial that click-centric user
signals are employed, denial that subdomains are considered separately in
rankings, denials of a sandbox for newer websites, denials that a domain’s age
is collected or considered, and more. 

Naturally, I was skeptical. The claims made by this source (who asked to remain
anonymous) seemed extraordinary–claims like:

 * In their early years, Google’s search team recognized a need for full
   clickstream data (every URL visited by a browser) for a large percent of web
   users to improve their search engine’s result quality.
 * A system called “NavBoost” (cited by VP of Search, Pandu Nayak, in his DOJ
   case testimony) initially gathered data from Google’s Toolbar PageRank, and
   desire for more clickstream data served as the key motivation for creation of
   the Chrome browser (launched in 2008).
 * NavBoost uses the number of searches for a given keyword to identify trending
   search demand, the number of clicks on a search result (I ran several
   experiments on this from 2013-2015), and long clicks versus short clicks
   (which I presented theories about in this 2015 video).
 * Google utilizes cookie history, logged-in Chrome data, and pattern detection
   (referred to in the leak as “unsquashed” clicks versus “squashed” clicks) as
   effective means for fighting manual & automated click spam.
 * NavBoost also scores queries for user intent. For example, certain thresholds
   of attention and clicks on videos or images will trigger video or image
   features for that query and related, NavBoost-associated queries.
 * Google examines clicks and engagement on searches both during and after the
   main query (referred to as a “NavBoost query”). For instance, if many users
   search for “Rand Fishkin,” don’t find SparkToro, and immediately change their
   query to “SparkToro” and click SparkToro.com in the search result,
   SparkToro.com (and websites mentioning “SparkToro”) will receive a boost in
   the search results for the “Rand Fishkin” keyword.
 * NavBoost’s data is used at the host level for evaluating a site’s overall
   quality (my anonymous source speculated that this could be what Google and
   SEOs called “Panda”). This evaluation can result in a boost or a demotion.
 * Other minor factors such as penalties for domain names that exactly match
   unbranded search queries (e.g. mens-luxury-watches.com or
   milwaukee-homes-for-sale.net), a newer “BabyPanda” score, and spam signals
   are also considered during the quality evaluation process.
 * NavBoost geo-fences click data, taking into account country and
   state/province levels, as well as mobile versus desktop usage. However, if
   Google lacks data for certain regions or user-agents, they may apply the
   process universally to the query results.
 * During the Covid-19 pandemic, Google employed whitelists for websites that
   could appear high in the results for Covid-related searches
 * Similarly, during democratic elections, Google employed whitelists for sites
   that should be shown (or demoted) for election-related information

And these are only the tip of the iceberg.

Extraordinary claims require extraordinary evidence. And while some of these
overlap with information revealed during the Google/DOJ case (some of which you
can read about on this thread from 2020), many are novel and suggest insider
knowledge.

So, this past Friday, May 24th (following several emails), I had a video call
with the anonymous source.

An anonymized screen capture from Rand’s call with the source

Update (5/28 at 10:00am Pacific): The anonymous source has decided to come
forward. This video announces their identity, Erfan Azimi, an SEO practitioner
and the founder of EA Eagle Digital.



Prior to the email and call, I had neither met nor heard of Erfan. He asked that
his identity remain veiled, and that I merely include the quote below:

An eagle uses the storm to reach unimaginable heights.
– Matshona Dhliwayo

After the call I was able to confirm details of Erfan’s work history, mutual
people we both know from the marketing world, and several of their claims about
being at particular events with industry insiders (including Googlers), though I
cannot confirm details of the meetings nor the content of discussions they claim
to have had.

During our call, Erfan showed me the leak itself: more than 2,500 pages of API
documentation containing 14,014 attributes (API features) that appear to come
from Google’s internal “Content API Warehouse.” Based on the document’s commit
history, this code was uploaded to GitHub on Mar 27, 2024 and not removed until
May 7, 2024. (Note: because this piece was, post-publishing, edited to reflect
Erfan’s identity, he’s referred to below as “the anonymous source”).

This documentation doesn’t show things like the weight of particular elements in
the search ranking algorithm, nor does it prove which elements are used in the
ranking systems. But, it does show incredible details about data Google
collects. Here’s an example of the document format:

Screen capture of leaked data about “good” and “bad” clicks, including length of
clicks (i.e. how long a visitor spends on a web page they’ve clicked from
Google’s search results before going back to the search results)

After walking me through a handful of these API modules, the source explained
their motivations (around transparency, holding Google to account, etc.) and
their hope: that I would publish an article sharing this leak, revealing some of
the many interesting pieces of data it contained, and refuting some “lies”
Googlers “had been spreading for years.”

A sample of statements from Google representatives (Matt Cutts, Gary Ilyes, and
John Mueller) denying the use of click-based user signals in rankings over the
years


IS THIS API LEAK AUTHENTIC? CAN WE TRUST IT?

A critical next step in the process was verifying the authenticity of the API
Content Warehouse documents.  So, I reached out to some ex-Googler friends,
shared the leaked docs, and asked for their thoughts. Three ex-Googlers wrote
back: one said they didn’t feel comfortable looking at or commenting on it. The
other two shared the following (off the record and anonymously):

 * “I didn’t have access to this code when I worked there. But this certainly
   looks legit. “
 * “It has all the hallmarks of an internal Google API.”
 * “It’s a Java-based API. And someone spent a lot of time adhering to Google’s
   own internal standards for documentation and naming.”
 * “I’d need more time to be sure, but this matches internal documentation I’m
   familiar with.”
 * “Nothing I saw in a brief review suggests this is anything but legit.”

Next, I needed help analyzing and deciphering the naming conventions and more
technical aspects of the documentation. I’ve worked with APIs a bit, but it’s
been 20 years since I wrote code and 6 years since I practiced SEO
professionally. So, I reached out to one of the world’s foremost technical SEOs:
Mike King, founder of iPullRank.

During a 40-minute phone call on Friday afternoon, Mike reviewed the leak and
confirmed my suspicions: this appears to be a legitimate set of documents from
inside Google’s Search division, and contains an extraordinary amount of
previously-unconfirmed information about Google’s inner workings.

2,500 technical documents is an unreasonable amount of material to ask one man
(a dad, husband, and entrepreneur, no less) to review in a single weekend. But,
that didn’t stop Mike from doing his best.
He’s put together an exceptionally detailed initial review of the Google API
leak here, which I’ll reference more in the findings below. And he’s also agreed
to join us at SparkTogether 2024 in Seattle, WA on Oct. 8, where he’ll present
the fully transparent story of this leak in far greater detail, and with the
benefit of the next few months of analysis.



--------------------------------------------------------------------------------


QUALIFICATIONS AND MOTIVATIONS FOR THIS POST

Before we go further, a few disclaimers: I no longer work in the SEO field. My
knowledge of and experience with SEO is 6+ years out of date. I don’t have the
technical expertise or knowledge of Google’s internal operations to analyze an
API documentation leak and confirm with certainty whether it’s authentic (hence
getting Mike’s help and the input of ex-Googlers).

So why publish on this topic?

Because when I spoke to the party that sent me this information, I found them
credible, thoughtful, and deeply knowledgeable. Despite going into the
conversation deeply skeptical, I could identify no red flags, nor any malicious
motivation. This person’s sole aim appeared quite aligned with my own: to hold
Google accountable for public statements that conflict with private
conversations and leaked documentation, and to bring greater transparency to the
field of search marketing. And they believed that, despite my years removed from
SEO, I was the best person to share this publicly.

These are goals I cared about deeply for almost two decades. And while my
professional life has moved on (I now run two companies: SparkToro, which makes
audience research software and Snackbar Studio, an indie video game developer),
my interest in and connections to the world of Search Engine Optimization remain
strong. I feel a deep obligation to share information about how the world’s
dominant search engine works, especially information Google would prefer to keep
quiet. And sadly, I’m not sure where else to send something this potentially
groundbreaking.

Years ago, before he left journalism to become Google’s Search Liaison, Danny
Sullivan, would have been my go-to source for a leak of this magnitude. He had
the gravitas, resume, knowledge, and experience to examine a claim like this and
present it fairly in the court of public opinion. There have been so many times
in the last few years I’ve wished for Danny’s calm, even-handed,
tough-but-fair-on-Google approach to newsworthy pieces like this–pieces that
could reach as far as the company’s statements on the witness stand (e.g. his
eloquent writing on Google’s indefensible privacy claims about organic keyword
data).

Whatever Google’s paying him, it isn’t nearly enough.

Apologies that instead of Danny, dear reader, you’re stuck with me. But since
you are, I’m going to assume you may not be familiar with my background or
credentials, and briefly share those.

 * I started doing SEO for small businesses in the Seattle area in 2001, and
   co-founded the SEO consultancy that would become Moz (originally called
   SEOmoz) in 2003.
 * For the next 15 years, I worked in the search marketing industry and was
   often recognized as an influential leader in that field. I
   authored/co-authored Lost and Founder: A Painfully Honest Field Guide to the
   Startup World, The Art of SEO, and Inbound Marketing and SEO.
 * Publications including the WSJ, Inc, Forbes, and hundreds more have written
   about and quoted me on the world of SEO and Google search, many of them
   citing a popular weekly video series I hosted for a decade: Whiteboard
   Friday.
 * Moz grew to 35,000+ paying customers of its SEO software, revenues of $50M+,
   and a team of ~200 before being sold to a private equity buyer in 2021. I
   left in 2018 and started SparkToro, and in 2023, Snackbar Studio.
 * I dropped out of college at the University of Washington in 2001 and do not
   hold a degree, yet my work on Google and SEO has been cited by the United
   States Congress, the US Federal Trade Commission, the Wall Street Journal,
   New York Times, and John Oliver’s Last Week Tonight, among dozens of others.
 * I hold several patents around the design of a web scale link index, and am
   the creator of numerous link-index metrics, including Domain Authority, a
   machine-learning based score commonly used in the digital marketing world to
   assess a website’s capability to rank in Google’s search engine.

OK. Back to the Google leak.

--------------------------------------------------------------------------------


WHAT IS THE GOOGLE API CONTENT WAREHOUSE?

When looking through the massive trove of API documentation, the first
reasonable set of questions might be: “What is this? What is it used for? Why
does it exist in the first place?”



The leak appears to come from GitHub, and the most credible explanation for its
exposure matches what my anonymous source told me on our call: these documents
were inadvertently and briefly made public (many links in the documentation
point to private GitHub repositories and internal pages on Google’s corporate
site that require specific, Google-credentialed logins). During this
probably-accidental, public period between March and May of 2024, the API
documentation was spread to Hexdocs (which indexes public GitHub repos) and
found/circulated by other sources (I’m certain that others have a copy, though
it’s odd that I could find no public discourse until now).

According to my ex-Googler sources, documentation like this exists on almost
every Google team, explaining various API attributes and modules to help
familiarize those working on a project with the data elements available. This
leak matches others in public GitHub repositories and on Google’s Cloud API
documentation, using the same notation style, formatting, and even
process/module/feature names and references.

If that all sounds like a technical mouthful, think of this as instructions for
members of Google’s search engine team. It’s like an inventory of books in a
library, a card catalogue of sorts, telling those employees who need to know
what’s available and how they can get it.

But, whereas libraries are public, Google search is one of the most secretive,
closely-guarded black boxes in the world. In the last quarter century, no leak
of this magnitude or detail has ever been reported from Google’s search
division.


HOW CERTAIN CAN WE BE THAT GOOGLE’S SEARCH ENGINE USES EVERYTHING DETAILED IN
THESE API DOCS?

That’s open to interpretation. Google could have retired some of these, used
others exclusively for testing or internal projects, or may even have made API
features available that were never employed.

However, there are references in the documentation to deprecated features and
specific notes on others indicating they should no longer be used. That strongly
suggests those not marked with such details were still in active use as of the
March, 2024 leak.

We also can’t say for certain whether the March leak is of the most recent
version of this documentation. The most recent date I can find referenced in the
API docs is August of 2023:



The relevant text reads:

“The domain-level display name of the website, such as “Google” for google.com.
See go/site-display-name for more details. As of Aug 2023, this field is being
deprecated in favor of
info.[AlternativeTitlesResponse].site_display_name_response field, which also
contains host-level site display names with additional information.”

A reasonable reader would conclude that the documentation was up-to-date as of
last summer (references to other changes in 2023 and earlier years, all the way
back to 2005, are also present), and possibly even up-to-date as of the March
2024 date of disclosure.

Google search obviously changes massively from year to year, and recent
introductions like their much-maligned AI Overviews, do not make an appearance
in this leak. Which of the items mentioned are actively used today in Google’s
ranking systems? That’s open to speculation. This trove contains fascinating
references, many that will be entirely new to non-Google-search-engineers.

But, I would urge readers not to point to a particular API feature in this leak
and say: “SEE! That’s proof Google uses XYZ in their rankings.” It’s not quite
proof. It’s a strong indication, stronger than patent applications or public
statements from Googlers, but still no guarantee.

That said, it’s as close to a smoking gun as anything since Google’s execs
testified in the DOJ trial last year. And, speaking of that testimony, much of
it is corroborated and expanded on in the document leak, as Mike details in his
post.


WHAT CAN WE LEARN FROM THE DATA WAREHOUSE LEAK?

I expect that interesting and marketing-applicable insights will be mined from
this massive file set for years to come. It’s simply too big and too dense to
think that a weekend of browsing could unearth a comprehensive set of takeaways,
or even come close.

However, I will share five of the most interesting, early discoveries in my
perusal, some that shed new light on things Google has long been assumed to be
doing, and others that suggest the company’s public statements (especially those
on what they “collect”) have been erroneous. Because doing so would be tedious
and could be perceived as personal grievances (given Google’s historic attacks
on my work), I won’t bother showing side-by-sides of what Googlers said vs. what
this document insinuates. Besides, Mike did a great job of that in his post.

Instead, I’ll focus on interesting and/or useful takeaways, and my conclusions
from the whole of the modules I’ve been able to review, Mike’s piece on the
leak, and how this combines with other things we know to be true of Google.


#1: NAVBOOST AND THE USE OF CLICKS, CTR, LONG VS. SHORT CLICKS, AND USER DATA



A handful of modules in the documentation make reference to features like
“goodClicks,” “badClicks,” “lastLongestClicks,” impressions, squashed,
unsquashed, and unicorn clicks. These are tied to Navboost and Glue, two words
that may be familiar to folks who reviewed Google’s DOJ testimony. Here’s a
relevant excerpt from DOJ attorney Kenneth Dintzer’s cross-examination of Pandu
Nayak, VP of Search on the Search Quality team:

Q. So remind me, is navboost all the way back to 2005?
A. It’s somewhere in that range. It might even be before that.

Q. And it’s been updated. It’s not the same old navboost that it was back then?
A. No.

Q. And another one is glue, right?
A. Glue is just another name for navboost that includes all of the other
features on the page.

Q. Right. I was going to get there later, but we can do that now. Navboost does
web results, just like we discussed, right?
A. Yes.

Q. And glue does everything else that’s on the page that’s not web results,
right?
A. That is correct.

Q. Together they help find the stuff and rank the stuff that ultimately shows up
on our SERP?
A. That is true. They’re both signals into that, yes.

A savvy reader of these API documents would find they support Mr. Nayak’s
testimony (and align with Google’s patent on site quality):

 * Quality Navboost Data module
 * Geo-segmentation of Navboost Data
 * Clicks Signals in Navboost
 * Data Aging Impressions and clicks

Google appears to have ways to filter out clicks they don’t want to count in
their ranking systems, and include ones they do. They also seem to measure
length of clicks (i.e. pogo-sticking – when a searcher clicks a result and then
quickly clicks the back button, unsatisfied by the answer they found) and
impressions.

Plenty has already been written about Google’s use of click data, so I won’t
belabor the point. What matters is that Google has named and described features
for that measurement, adding even more evidence to the pile.


#2: USE OF CHROME BROWSER CLICKSTREAMS TO POWER GOOGLE SEARCH



My anonymous source claimed that way back in 2005, Google wanted the full
clickstream of billions of Internet users, and with Chrome, they’ve now got it.
The API documents suggest Google calculates several types of metrics that can be
called using Chrome views related to both individual pages and entire domains.

This document, describing the features around how Google creates Sitelinks, is
particularly interesting. It showcases a call named topUrl, which is “A list of
top urls with highest two_level_score, i.e., chrome_trans_clicks.” My read is
that Google likely uses the number of clicks on pages in Chrome browsers and
uses that to determine the most popular/important URLs on a site, which go into
the calculation of which to include in the sitelinks feature.



E.G. In the above screenshot from Google’s results, pages like “Pricing,” the
“Blog,” and the “Login” pages are our most-visited, and Google knows this
through their tracking of billions of Chrome users’ clickstreams.

 * Quality NSR Data module
 * Video Content Search module
 * Quality Sitemap module


#3: WHITELISTS IN TRAVEL, COVID, AND POLITICS

A module on “Good Quality Travel Sites” would lead reasonable readers to
conclude that a whitelist exists for Google in the travel sector (unclear if
this is exclusively for Google’s “Travel” search tab, or web search more
broadly). References in several places to flags for “isCovidLocalAuthority” and
“isElectionAuthority” further suggests that Google is whitelisting particular
domains that are appropriate to show for highly controversial of potentially
problematic queries. 

For example, following the 2020 US Presidential election, one candidate claimed
(without evidence) that the election had been stolen, and encouraged their
followers to storm the Capital and take potentially violent action against
lawmakers, i.e. commit an insurrection.

Google would almost certainly be one of the first places people turned to for
information about this event, and if their search engine returned propaganda
websites that inaccurately portrayed the election evidence, that could directly
lead to more contention, violence, or even the end of US democracy. Those of us
who want free and fair elections to continue should be very grateful Google’s
engineers are employing whitelists in this case.

 * Quality NSR Data Attributes
 * Assistant API Settings for Music Filters
 * Video Content Search Query Features
 * Quality Travel Sites Data module


#4: EMPLOYING QUALITY RATER FEEDBACK



Google has long had a quality rating platform called EWOK (Cyrus Shepard, a
notable leader in the SEO space, spent several years contributing to this and
wrote about it here). We now have evidence that some elements from the quality
raters are used in the search systems.

How influential these rater-based signals are, and what precisely they’re used
for is unclear to me in an initial read, but I suspect some thoughtful SEO
detectives will dig into the leak, learn, and publish more about it. What I find
fascinating is that scores and data generated by EWOK’s quality raters may be
directly involved in Google’s search system, rather than simply a training set
for experiments. Of course, it’s possible these are “just for testing,” but as
you browse through the leaked documents, you’ll find that when that’s true, it’s
specifically called out in the notes and module details.

This one calls out a “per document relevance rating” sourced from evaluations
done via EWOK. There’s no detailed notation, but it’s not much of a logic-leap
to imagine how important those human evaluations of websites really are.



This one calls out “Human Ratings (e.g. ratings from EWOK)” and notes that
they’re “typically only populated in the evaluation pipelines,” which suggests
they may be primarily training data in this module (I’d argue that’s still a
hugely important role, and marketers shouldn’t dismiss how important it is that
quality raters perceive and rate their websites well).

 * Webref Mention Ratings module
 * Webref Task Data module
 * Document Level Relevance module
 * Webref per Doc Relevance Rating module
 * Webref Entity Join


#5: GOOGLE USES CLICK DATA TO DETERMINE HOW TO WEIGHT LINKS IN RANKINGS

This one’s fascinating, and comes directly from the anonymous source who first
shared the leak. In their words: “Google has three buckets/tiers for classifying
their link indexes (low, medium, high quality). Click data is used to determine
which link graph index tier a document belongs to. See SourceType here, and
TotalClicks here.” In summary:

 * If Forbes.com/Cats/ has no clicks it goes into the low-quality index and the
   link is ignored
 * If Forbes.com/Dogs/ has a high volume of clicks from verifiable devices (all
   the Chrome-related data discussed previously), it goes into the high-quality
   index and the link passes ranking signals

Once the link becomes “trusted” because it belongs to a higher tier index, it
can flow PageRank and anchors, or be filtered/demoted by link spam systems.
Links from the low-quality link index won’t hurt a site’s ranking; they are
merely ignored.

--------------------------------------------------------------------------------


BIG PICTURE TAKEAWAYS FOR MARKETERS WHO CARE ABOUT ORGANIC SEARCH TRAFFIC

If you care strategically about the value of organic search traffic, but don’t
have much use for the technical details of how Google works, this section’s for
you. It’s my attempt to sum up much of Google’s evolution from the period this
leak covers: 2005 – 2023, and I won’t limit myself exclusively to confirmed
elements of the leak.

 1. Brand matters more than anything else
    Google has numerous ways to identify entities, sort, rank, filter, and
    employ them. Entities include brands (brand names, their official websites,
    associated social accounts, etc.), and as we’ve seen in our clickstream
    research with Datos, they’ve been on an inexorable path toward exclusively
    ranking and sending traffic to big, powerful brands that dominate the web >
    small, independent sites and businesses.
    
    If there was one universal piece of advice I had for marketers seeking to
    broadly improve their organic search rankings and traffic, it would be:
    “Build a notable, popular, well-recognized brand in your space, outside of
    Google search.”
    
 2. Experience, expertise, authoritativeness, and trustworthiness (“E-E-A-T”)
    might not matter as directly as some SEOs think.
    The only mention of topical expertise in the leak we’ve found so far is a
    brief notation about Google Maps review contributions. The other aspects of
    E-E-A-T are either buried, indirect, labeled in hard-to-identify ways, or,
    more likely (in my opinion) correlated with things Google uses and cares
    about, but not specific elements of the ranking systems.
    
    As Mike noted in his article, there is documentation in the leak suggesting
    Google can identify authors and treats them as entities in the system.
    Building up one’s influence as an author online may indeed lead to ranking
    benefits in Google. But what exactly in the ranking systems makes up
    “E-E-A-T” and how powerful those elements are is an open question. I’m a bit
    worried that E-E-A-T is 80% propaganda, 20% substance. There are plenty of
    powerful brands that rank remarkably well in Google and have very little
    experience, expertise, authoritativeness, or trustworthiness, as
    HouseFresh’s recent, viral article details in depth.
    
 3. Content and links are secondary when user intention around navigation (and
    the patterns that intent creates) are present.
    Let’s say, for example, that many people in the Seattle area search for
    “Lehman Brothers” and scroll to page 2, 3, or 4 of the search results until
    they find the theatre listing for the Lehman Brother stage production, then
    click that result. Fairly quickly, Google will learn that’s what searchers
    for those words in that area want.
    
    Even if the Wikipedia article about Lehman Brothers’ role in the financial
    crisis of 2008 were to invest heavily in link building and content
    optimization, it’s unlikely they could outrank the user-intent signals
    (calculated from queries and clicks) of Seattle’s theatre-goers.
    
    Extending this example to the broader web and search as a whole, if you can
    create demand for your website among enough likely searchers in the regions
    you’re targeting, you may be able to end-around the need for classic
    on-and-off-page SEO signals like links, anchor text, optimized content, and
    the like. The power of Navboost and the intent of users is likely the most
    powerful ranking factor in Google’s systems. As Google VP Alexander
    Grushetsky put it in a 2019 email to other Google execs (including Danny
    Sullivan and Pandu Nayak):
    
    “We already know, one signal could be more powerful than the whole big
    system on a given metric. For example, I’m pretty sure that NavBoost alone
    was / is more positive on clicks (and likely even on precision / utility
    metrics) by itself than the rest of ranking (BTW, engineers outside of
    Navboost team used to be also not happy about the power of Navboost, and the
    fact it was “stealing wins”)“
    
    Those seeking even more confirmation could review Google engineer Paul
    Haahr’s detailed resume, which states:
    
    “I’m the manager for logs-based ranking projects. The team’s efforts are
    currently split among four areas: 1) Navboost. This is already one of
    Google’s strongest ranking signals. Current work is on automation in
    building new navboost data;”
    
 4. Classic ranking factors: PageRank, anchors (topical PageRank based on the
    anchor text of the link), and text-matching have been waning in importance
    for years. But Page Titles are still quite important.
    This is a finding from Mike’s excellent analysis that I’d be foolish not to
    call out here. PageRank still appears to have a place in search indexing and
    rankings, but it’s almost certainly evolved from the original 1998 paper.
    The document leak insinuates multiple versions of PageRank (rawPagerank, a
    deprecated PageRank referencing “nearest seeds,” firstCoveragePageRank from
    when the document was first served, etc.) have been created and discarded
    over the years. And anchor text links, while present in the leak, don’t seem
    to be as crucial or omnipresent as I’d have expected from my earlier years
    in SEO.
    
 5. For most small and medium businesses and newer creators/publishers, SEO is
    likely to show poor returns until you’ve established credibility,
    navigational demand, and a strong reputation among a sizable audience.
    SEO is a big brand, popular domain’s game. As an entrepreneur, I’m not
    ignoring SEO, but I strongly expect that for the years ahead, until/unless
    SparkToro becomes a much larger, more popular, more searched-for and
    clicked-on brand in its industry, this website will continue to be
    outranked, even for its original content, by aggregators and publishers
    who’ve existed for 10+ years.
    
    This is almost certainly true for other creators, publishers, and SMBs. The
    content you create is unlikely to perform well in Google if competition from
    big, popular websites with well-known brands exists. Google no longer
    rewards scrappy, clever, SEO-savvy operators who know all the right tricks.
    They reward established brands, search-measurable forms of popularity, and
    established domains that searchers already know and click. From 1998 – 2018
    (or so), one could reasonable start a powerful marketing flywheel with SEO
    for Google. In 2024, I don’t think that’s realistic, at least, not on the
    English-language web in competitive sectors.


NEXT STEPS FOR THE SEARCH INDUSTRY

I’m excited to see how practitioners with more recent experience and deeper
technical knowledge go about analyzing this leak. I encourage anyone curious to
dig into the documentation, attempt to connect it to other public documents,
statements, testimony, and ranking experiments, then publish their findings.

Historically, some of the search industry’s loudest voices and most prolific
publishers have been happy to uncritically repeat Google’s public statements.
They write headlines like “Google says XYZ is true,” rather than “Google Claims
XYZ; Evidence Suggests Otherwise.”

The SEO industry doesn’t benefit from these kinds of headlines

Please, do better. If this leak and the DOJ trial can create just one change, I
hope this is it.

When those new to the field read Search Engine Roundtable, Search Engine Land,
SE Journal, and the many agency blogs and websites that cover the SEO field’s
news, they don’t necessarily know how seriously to take Google’s statements.
Journalists and authors should not presume that readers are savvy enough to know
that dozens or hundreds of past public comments by Google’s official
representatives were later proven wrong.

This obligation isn’t just about helping the search industry—it’s about helping
the whole world. Google is one of the most powerful, influential forces for the
spread of information and commerce on this planet. Only recently have they been
held to some account by governments and reporters. The work of journalists and
writers in the search marketing field carries weight in the courts of public
opinion, in the halls of elected officials, and in the hearts of Google
employees, all of whom have the power to change things for the better or ignore
them at our collective peril.

--------------------------------------------------------------------------------

Thank you to Mike King for his invaluable help on this document leak story, to
Amanda Natividad for editing help, and to the anonymous source who shared this
leak with me. I expect that updates to this piece may arrive over the next few
days and weeks as it reaches more eyeballs. If you have findings that support or
contradict statements I’ve made here, please feel free to share them in the
comments below.

SUBSCRIBE TO THE SPARKTORO BLOG

Get Posts Via Email
Email:
Subscribe


ADDITIONAL READING

 Why Are Marketing Conferences Shrinking? And What Do Attendees Say They Want?
The Google API Leak Should Change How Marketers and Publishers Do SEO 



KEEP UP WITH SPARKTORO®

@SparkToro
@SparkToroHQ
@SparkToro
@SparkToro
@SparkToro
 * API
 * Lost & Founder
 * Support
 * Resources
 * Blog
 * RSS

 * Privacy Policy
 * Terms of Service
 * Do Not Sell My Personal Information
 * CA Privacy Rights




COOKIES & PRIVACY

This website uses cookies and similar technologies to customize online
advertisements and for other purposes. Please see our Privacy Policy for more
information.

Accept All CookiesDecline Cookies