The State of JavaScript and HTML Indexing in 2020

The State of JavaScript and HTML Indexing in 2020


Hey guys. So, today I wanted to talk about
one of the biggest challenges in 2020 and beyond in SEO, especially
technical SEO, about indexing your content. Something that we were kind of
used to for quite a few years, that basically you would publish your content and
this would automatically get indexed in Google, is no longer something which
we just get for free. We actually need to do quite a lot of work to get our content
indexed and it doesn’t really depend on if your website is a JavaScript website or
an HTML website – it kinda got a little bit more complex. And this is something I
want to talk about here in this video today. So let’s dive right into it. First of all, one of the
most important things to remember is that — and actually to know —
is that we can’t really say that there are like JavaScript websites and HTML
websites anymore it’s all kind of got mixed up. I think it’s very difficult
nowadays to find an HTML website that’s not using some part of their website
with JavaScript pieces, JavaScript modules and so on. JavaScript components
I think would be the best name for that. so this is something to to have in mind
that we can’t really say that you have as an HTML website and you’re safe
indexing anymore and this is something I’m going to explain in depth in in a
minute basically throughout our our research in
over the last few months we found out that javascript is a problem of HTML
websites and so we can’t really say anymore as I said previously that this
is that you have some problem with like two waves of JavaScript indexing only
for JavaScript powered website one actually one thing that’s extremely
important something that I wanted to stress here is that in the SEO community
are used to what I call post factum learning we are actually waiting most of
the time we’re waiting for something to break complete until we start acting
upon that in this case the the biggest problem with with what we’re observing
is that the change is happening extremely extremely slowly so the change
is constantly happening the change in how Google and other search engines are
indexing and interacting with our content this is basically happening very
very slowly we don’t have that case study that we’re all used to were
something drops or something breaks completely and we are we can act on it
so right now it’s happening very slowly and we as a community should have that
in mind and that basically there’s not going to be one massive drop for a
studia and I wanted to start slow with something extremely extremely easy my
opinion very geeky to show I wanted to start with medium.com case study and
this problem is is not the biggest one we saw but it’s actually quite ironic
quite interesting to to observe so enter irony here the cost of JavaScript is one
of the most popular if not the most popular articles in the both web
performance community JavaScript community and web development community
that somehow is touching on on technical SEO as well and something you have to
know is that this article was written by Andy Ozment who’s actually one of the
Googlers so he’s very very accomplished very smart guy and make sure to follow
him he’s actually one of the guys we ask of technical SEO web performance
committee we all look up to him for for fir knowledge for all the insight like
like this article what’s actually interesting that this article was
written I think it’s 2018 and if you look at some of the comments we can
google those you won’t find those comments in google which is quite
interesting especially if we look at the actually this article is heavily linked
it was very popular as I mentioned so it has more than 500 referring domains just
this page more than 2,000 buckling what’s actually even more interesting
this comment is more than one year old at the moment of recording this video
what we should actually discuss today is the timeframe of of indexing some of the
elements within your page or basically lack of tariff so basically the problem
of how soon some of the elements are gonna be indexed and what happens it’s
some of elements of your page of your website are never gonna be indexable or
never gonna be found and indexed by search engines so it’s 2019 hour and now
research is showing that and there are hundreds of thousands of domains that
are not fully indexed and there are multiple reasons for that let me start
with some of the basics here of how Google and other search engine is most
likely as well but we’re gonna use Google as an example in this video how
Google is rendering and indexing your content so how or enduring works with
Google in general rendering is a new concept in the SEO it basically means
processing author HTML and JavaScript and CSS into fully rendered page so how
rendering works with Google is something that I was talking to Bob Chandler and
Martin split in Zurich and this year and when they invited me to be a part of
Google Hangouts so basically the conversation was about I was actually
asking both mother and Martin Street about how does it work with two waves of
JavaScript indexing and they explained that they look at the difference between
the rendered version of your content and not rendered and they look if there are
any differences it’s all based on heuristics the rustics basically look
for changes so they look okay rendering the page is changing the content
completely or not and therefore we’re gonna have to go with two waves of
indexing in the future or we basically go and have to
that page in the future to see all of the content pushed out by by the
webmaster and the problem is that those heuristics
are as I Marcus please mentioned himself that he didn’t fully grasp what exactly
triggered them those heuristics are most likely built with some kind of machine
learning that’s not really human readable so this is not Martin’s fault
this is my assumption that this those heuristics the whole logistics of how
rendering or triggering an enduring works is based on machine learning so
this probably is something that’s going to be very difficult for both SEO and in
this case Googlers to fully and then grasp fully understand and the biggest
issue I think right now is those heuristics are still in their infant
page we’re still in the very very beginning of that journey with with
building we Google is at the very beginning of this journey of building
proper heuristics of how to index how to index content that depends on elements
like JavaScript for example not only that so when is rendering needed and
they have to decide that because rendering is expensive for a lot of
pages so they have to decide okay do we need to render that page all the time
that website all the time or can we maybe just drop that completely and
we’re gonna still have access to all of the content and the one important thing
is that all new websites get entered and this is something that Googlers told me
in story this is something that kind of changed how we look at a lot of our
experiments and this is actually why we’re not gonna play with like
experimenting with websites on staging because a lot of our experiments if not
most of our experiments are all built on brand new websites and they show that
Google is getting much better with indexing because
Google is enduring all the new websites by default but this is actually bringing
us to one more issue what is a new website exactly so how do we define a
new website a website where we just relaunched the CMS we we launched some
parts of the code and new website or doesn’t have to be a new domain and so
on so what’s a new website exactly so if Google is visiting medium.com to see
okay rendering didn’t change anything but then after a week or a month and
some comments appear on that page is that heuristics heuristic is gonna be
triggered again or not there are a lot of questions to be asked and before we
fully understand that part basically we have one you will decided to experiment
with how good Google is with their heuristics this is something that we
wanted to play with and we’ve built quite a lot of experiments in that
department we repeated some of the old experiments and long story short we
found that Google is actually extremely good with indexing all of the JavaScript
content on new domains this is very important to stress Google is indexing
almost all of the new content on new domains but let’s go back to 2017 and
see one of our experiments that was always failing and that’s actually one
of the experiments I want to mention without going through all of the
experiments with we went through just to show you a very very interesting change
in how Google is indexing the content in 2019 so our experiment from 2017 was
basically checking if Google is following the links created by
JavaScript so we had homepage and basically to get to page number six you
have to go through all the depth like all the different sub pages and this is
how we would see ok Google followed page number five to page number six so you
see how deep the Googlebot is going to go into your domain and in 2017 I can’t
help you with that that was creepy
so the crawler budget experiment in 2017 basically showed that Google is indexing
all of the HTML but the Googlebot didn’t follow all of the actually just went
from homepage two nested page by following a JavaScript link and then it
gave up completely until today so actually two years later most of the
content within and this website is still not indexed so so that was a massive
problem and we could see okay Google is really probably picking like
cherry-picking the pages built of JavaScript they’re gonna index just to
save or maybe somehow say with the resources they have at hand but the fun
fact is repeating this experiment the same for experiment the same code the
same continent new domain in 2019 we did it a few times and most of the content
was indexed within hours and all of the JavaScript content was indexed I think
the longest period we had to wait was like 8 10 hours to index all of the
JavaScript pages basically Google was following all the links injected by
JavaScript very very efficiently and quickly so this was something new this
was something we realized that this didn’t work like that before we can
safely assume and completely happy with Martin split here that Google is
rendering and indexing all the new JavaScript websites without any issues
the problem is that this is only for new websites and we have no idea so far how
long this can a honeymoon phase is gonna gonna gonna last so this is something we
we basically struggle with and we’ve decided to move out and start
experimenting with real life websites that was the only solution for us to see
as some kind of actionable actionable data we can share with the SEO community
we’ve started gathering data and experimenting with websites basically
big brands using different parts of big brands that are HTML websites using
javascript just for a tiny bit bits of their content well this experiment we
basically started with a lot of JavaScript our websites just to just to
see okay are there any JavaScript powered websites that are fully
indexable by Google and they’re fully they’re just ranking properly and it
doesn’t work well and we found quite a few websites that are 100% JavaScript
meaning that if you switch off JavaScript most of the content is gonna
disappear that works very very well with search engines in this example National
Geographic you can have a look this is National Geographic with JavaScript and
this is National Geographic without in JavaScript you can see that all of the
content actually disappeared but still Google had no issues crawling indexing
all of this content regardless of the JavaScript a source in the same story
with JavaScript if website works perfectly fine without JavaScript at at
the same time all of the content is gone we are only left with
the top menu that I’m actually not sure if it’s working without JavaScript and
in the image but still Google had no issues here crawling and indexing all of
this JavaScript content which is a good thing we find too many examples if any
examples like that two years ago so we can see Google is really making the
massive progress to make those websites rank to make to crawl them and to index
them fully moving forward unfortunately not every website is lucky enough or as
lucky as National Geographic or ASOS and this is something to show here so yeah
we’re gonna get to the moon where we’re gonna see a lot of HTML indexing issues
but just to finish with what javascript examples so percentage of JavaScript
content index for a lot of big brands that are not JavaScript websites is
actually a really shocking and something I didn’t
see coming so have a look at some of the brands like Urban Outfitters
that they actually have 0% of the content that relies on JavaScript so
parts of the website that rely on JavaScript are not indexed so for Urban
Outfitters is 0% of the content j.crew as well Topshop 0% Sephora has 42
percent of their content that’s realized in JavaScript index hmm 73 percent of
their JavaScript content is indexed and obviously a German efficiency here so
t-mobile 82 percent of their content that relies in JavaScript is indexed so
we can see okay this is a massive issue here but we’re still talking about about
parts of the content that rely on JavaScript but it gets a little bit more
complex so now we’re gonna get a little bit big geeky but let me walk you
through that step by step so I called H&M with and without JavaScript
rendering just to compare how different of a website graph Google and other
search engines are seeing when crawling without rendering and with rendering and
this is actually without this is the screenshot from the crawl without
JavaScript and rendering or without the rendering in general and we can see okay
there is something massively and very very wrong with H&M website because out
of 50,000 URLs I called 43,000 URLs are canonicalized so we can see okay out of
50,000 that makes it 86% ish well the of the content is completely not indexable
when we’re crawling without JavaScript indexing so this is this is a massive
issue right here with JavaScript and crawling with JavaScript indexing we can
see okay this crawl is a little bit better but the the size of
the crawl even though the limit was fifty thousand changed completely so
there is completely to websites completed to different website graphs
depending on if we switch ON or switch off JavaScript rendering so this is
something to have in mind that even for website like H&M that’s not a JavaScript
website this is something we would call four years and the HTML website HTML
powered website or whatever we can see that this is a massive problem this is
where I actually need to explain what you do relies on javascript because
there are different parts of websites that rely on JavaScript the suspect
number one is use the pagination so for quite a lot of websites including HTML
hmm pagination relies on JavaScript in case of H&M is if we switch off
pagination we only will see a first like small percentage of the products
available and there is no other way to find it but maybe sitemaps and the next
one is you might also be interested in or what some other forms of interlinking
like related articles relate products and so on so you might also like part is
very often very very very often relying on JavaScript to work and top products
also something we see quite often in a lot of e-commerce stores that top
products very often rely on JavaScript and without JavaScript google is really
not seeing those links reviews again quite popular problem comments we
actually talked about that in in case of medium in the case study of medium calm
and main content this is not that often this is not happening that often we saw
that for in case of a source or National Geographic but use the main content in
hopefully internal state that way in most cases relies on HTML so it’s
visible without rendering and the root of the
problem this is something actually that’s very very interesting for non
technical technical people the roots of the problem come from the fact that no
one is ever any one is anymore writing all the website or creating all the
websites from scratch so we’re past that time when a developer
would sit down and code an e-commerce store in HTML or PHP or whatever most
websites use ready-to-go components and they build websites or most developers
use the developer SKUs ready to go components and they basically build a
website using some some out of like some some elements just like an example if
you’re a wordpress user for example you will use of plugins or templates for
different parts of the website most of the people most of the webmasters we
want start coding that from scratch in most cases the the components we’re
talking about are like one for the menu one for slider one for main content
faces navigation comment related product and so on so if we if don’t rely on
JavaScript Google won’t see that well then this is again there’s a very good
chance that cocoa and see that Google may actually be indexing that content
but this is still something where we’re trying to understand a little bit better
in most cases Google is not really indexing that instantly and this leads
to a lot of lot of issues and rendering the light delay leads to two different
websites and two different website graphs something we saw in the crawl of
hmm so if we have here H&M that come with home page category and subcategory
let’s say we have hmm we’ll go to ladies or men or whatever we go to jeans and
then we have different different types of jeans and then okay with JavaScript
everything works we can see all of the products all subcategories all the
filters and so but once JavaScript is not rendered by a
search engine we are actually stuck with just category pages without pagination
so we see for example 36 pairs of jeans out of 200 and Google is having a
problem with finding all of your products and this leads us to the time
frames okay we see okay there’s some kind of problem of indexing what’s the
time frame of indexing of that content and we actually looked into quite a lot
of big brands we start checking ok what’s the percentage of JavaScript
content indexed after two weeks and if we look at like New York Post okay and
they did very well 100% of the JavaScript content so just
the content that relies on JavaScript was indexed after two weeks but moving
to two for example the garden or CNBC or the target the target has 70% of their
JavaScript content index after 13 days so that’s quite bad but the garden only
had 34% of their content JavaScript content indexed after after two weeks
which is really very bad because there is 70 66 % of their content not indexed
and for publisher this is definitely a big deal big deal CNBC I won’t even go
in-depth but to explain how bad of the problem that is but almost a hundred
percent of the JavaScript content wasn’t index after two weeks so that was quite
a lot of time for Google to go and index that content and as a community I’m kind
of sick and tired of seeing a lot of ass SEO blaming Google for that or saying
that javascript is evil and that this is happening and in our experience here at
one year when we work with technical SEO and JavaScript SEO for for a lot of
e-commerce stores and large brands we saw that every single JavaScript SEO
issue was always 100% self induced by understanding the technology fully by by
website owners basically so how to avoid those issues in the future how to make
sure that your website is not struggling with this problem when it’s deployed the
problem is that we found there are no tool set to diagnose clearly this
problem and see okay this is happening for me for non-technical people and
we’ve created a tool and we called it oMFG someone in might for geeks to
explain what and it’s happening to your website and to somehow help you fix your
website’s problems ofg is basically a tool set that’s helping a non-technical
developer and non-technical website owners and also developers if they need
to check something quickly things that helps to see ok what relies on
JavaScript and what kind of elements of my website may have issues with web
indexing so if you look at one you made for geeks tool set
it’s basically available at 1 it becomes less tools and you can go and and and
play with that word of warning is still an early alpha maybe now it’s closer to
beta version and so some things may crash but it should work well for you
hopefully one of the key tools we actually we’re most excited about is
WWJD so what would javascript do where you can compare a version of the website
with JavaScript enabled and disabled and you can see ok how my website is going
to look like before in standard and this is a very very good example of BBC where
you can see ok BBC is a Content website that’s not really but if you wouldn’t
call a JavaScript website per se but you can see ok this is a massive problem
here because with javascript disabled and the website looks completely
different to a JavaScript enabled website so there is a different content
and while other things are changing so this is something to look closer into so
one of the features in WWJD is that with you can have a look at some of the major
meta tags so you can see okay what’s happening to the non rendered version
and in this example this is actually quite interesting you can see that the
rendering is changing the title of the website from BBC home to BBC homepage
which is not that bad if you look at the other problems here and the tag
description is changing completely it’s changing to a whole other description
like Meta Description is completely different after processing the website
websites JavaScript or basically after a rendering where the website it changes
completely but the biggest issue we found the biggest and most interesting
issue we saw is that the canonical tag of that website goes to a different
domain after processing JavaScript I don’t think I need to explain that in
that but basically this is the night where we all as a SEOs want to avoid
when JavaScript is executed or website is rendered and it turns out that your
website canonical is pointing to a different domain so in case of BBC it
goes from the BBC who you care to come version after rendering JavaScript but
also if you look at that there are quite a lot of links that JavaScript is adding
in this example is having quite a lot of links to God knows why the co uk version
and it’s removing quite a lot of links internal links as well so this is
probably trimly big issue for BBC to solve and see okay why is my website
changing so much after being rendered with JavaScript especially that meta
meta description Canonical’s and someone they shouldn’t really rely on JavaScript
at any point for any website TLD air too long didn’t render
is actually something we failed to check the cost of your websites rendering and
this is very important when you have a mobile phone that’s not like very very
new and if you want to see okay maybe my website is targeting people with with
with cheaper devices on or as in my case older devices and that are not as
efficient as brand new as for example a brand new iPhone so if that’s the case
if your user base is not that wealthy or maybe you’re in developing country and
we’re we’re mobile devices are not very very expensive and you know not everyone
is having iPhone 11 Pro and then this is a must use tool for you to long didn’t
render basically takes the cost of CPU cost and memory cost of your websites
render on a mobile because a cheaper mobile and older mobile is gonna
struggle with websites like BBC or the garden and so on just rendering all this
JavaScript for a cheap and mobile it’s going to be very very very slow so in
this example visco UK is extremely expensive to render there’s quite a bit
of JavaScript and all their mobiles like like mine are gonna struggle with that
quite a lot this is something that’s not going to be as visible or foreign for
example iPhone X with a very very good CPU and – 11 and and some of the latest
top devices of their brands and yeah you can check that for your own domain see
ok I’m in the green zone probably my website is not relying as much on my
users CPU in their mobile or they’re like performance or of my users mobile
devices and last but not least TGIF so the Google indexing forecast and this is
something we’re very very excited about as well and this is basically our way of
checking how well is Google dealing with JavaScript
on a timeline so you can see if Google is getting better with indexing
JavaScript we check that for manually we got quite a lot of big brands and we
manually check Google indexing of those and website for JavaScript parts of
those websites every day to see what percentage of the content is indexed
after one day one week and two week and we also check for HTML indexing and this
is where it gets really exciting because we can see that for quite a big brands
indexing their HTML pages is not that easy and this is I guess the whole
reason for this for this video so if you look at that the percentage of indexed
content after two weeks it’s quite interesting because for in quite a lot
of cases we actually have and the data showing that like there’s 70 or 80
percent of HTML content indexed means on average meaning that there are quite a
lot of brands were their HTML pages so like regular pages of products with news
with whatever are not indexed after 2 weeks so that’s extremely extremely bad
quite a lot of free tools are coming soon to our tool set so stay tuned we’re
actually going to launch something very exciting within the next week or two
hopefully and that’s gonna take it to a whole other level
for you to have a tool that’s going to be very helpful to diagnose some of the
problems so stay tuned sign up to our newsletter or or subscribe to our
YouTube channel somewhere in here and make sure to be updated when the when
the new tools are coming out so that’s enough about JavaScript I’ll try to not
to use the JavaScript world until the end of this video let’s talk about just
HTML HTML is quite important because for example if we
the Guardian and we took like a random sample of 1302 our ELLs of the Guardian
we saw that okay and all of their HTML pages are indexed almost all of them
after one day but not all of the big brands are as lucky so when we look at
for example and the Guardian versus some of the other big brands we see massive
problem with them indexing their HTML content their very basic content so for
example for even bright even bright only has 58% of their content index after two
weeks which is extremely extremely bad and you can see some of the data data on
your screen right now blue chart is also showing some interesting issues the
target is struggling with just 3% of the content in acceptor one day which is
extremely difficult to understand for for an e-commerce store so this is
something to look into this something to address ASAP for those brands and just
to explain the whole problem for again people who are not as technical if
you’re just beginning with that or if you are technical but wanna have some of
our insights from our recent experiments we’ve created this vicious cycle of of
indexing circle one we actually understood what’s leading to HTML
content not being not being indexed for quite a lot of pages so the step number
one is what master select for example target.com
or walmart.com is updating the website with new products and yeah so like H&M
is adding quite a lot of new I think these are jeans but a tiny part of the
order of their website is relying on rendering that page so Google crawls
your website without seeing all of the links
because Google doesn’t click Google it’s only calls a part of the domain without
finding all of the products which is getting the Googlebot and Google indexer
confused there are they are like okay there is something wrong with this
website we are wasting quite a lot of our roller budget just to find just a
tiny bit of the content and we’re circling around and crawler budget Falls
because of that the crawler budget is too low to render your website or to
render I have to say that word render JavaScript and then the the circle kind
of closes to continue into that vicious cycle of of indexing which is the
problem we’re seeing for even bright and and so on so and just to leave you with
a little bit of to do at the very end the go-to one it comes – tools and check
your website it’s 100% free and – the price – seven five we have this Twitter
conversation we’re not using any of this data we don’t even have a sales team you
got one need so we just want to share that with the community we want to help
you make better website so go to one it would come surf stores and play with
your websites different pages so like a look at the product page category page
whatever you have within your website even if that’s WordPress because we’re
saying for a lot of pictures here as well crawl your website with or without
JavaScript if you want to do that you can double be there right or deep CRO
enabled JavaScript for one crawl disable that for second colleges compare if
you’re saying exactly the same website graph and more data is coming soon thank
you so much we’ve we are gonna launch quite a lot of new tools soon so make
sure to subscribe I’m guessing press the button to be updated about me in front
of camera looking very confused again if you enjoy watching that
and subscribe to our newsletter and make sure to be updated we’re gonna share
quite a lot of free tools soon to help you make your website indexed indexable
to index your website in the log in 2020 and beyond thank you for that and yeah
see you soon

One comment

Leave a Reply

Your email address will not be published. Required fields are marked *