Google has published a fresh installment of its educational video series “How Search Works,” explaining how its search engine discovers and accesses web pages through crawling.
Google Engineer Details Crawling Process
In the seven-minute episode hosted by Google Analyst Gary Illyes, the company provides an in-depth look at the technical aspects of how Googlebot—the software Google uses to crawl the web—functions.
Illyes outlines the steps Googlebot takes to find new and updated content across the internet’s trillions of webpages and make them searchable on Google.
Illyes explains:
“Most new URLs Google discovers are from other known pages that Google previously crawled.
You can think about a news site with different category pages that then link out to individual news articles.
Google can discover most published articles by revisiting the Category page every now and then and extracting the URLs that lead to the articles.”
How Googlebot Crawls the Web
Googlebot starts by following links from known webpages to uncover new URLs, a process called URL discovery.
It avoids overloading sites by crawling each one at a unique, customized speed based on server response times and content quality.
Googlebot renders pages using a current version of the Chrome browser to execute any JavaScript and correctly display dynamic content loaded by scripts. It also only crawls publicly available pages, not those behind logins.
Improving Discovery & Crawlability
Illyes highlighted the usefulness of sitemaps—XML files that list a site’s URLs—to help Google find and crawl new content.
He advised developers to have their content management systems automatically generate sitemaps.
Optimizing technical SEO factors like site architecture, speed, and crawl directives can also improve crawlability.
Here are some additional tactics for making your site more crawlable:
- Avoid crawl budget exhaustion – Websites that update frequently can overwhelm Googlebot’s crawl budget, preventing new content from being discovered. Careful CMS configuration and rel= “next” / rel= “prev” tags can help.
- Implement good internal linking – Linking to new content from category and hub pages enables Googlebot to discover new URLs. An effective internal linking structure aids crawlability.
- Make sure pages load quickly – Sites that respond slowly to Googlebot fetches may have their crawl rate throttled. Optimizing pages for performance can allow faster crawling.
- Eliminate soft 404 errors – Fixing soft 404s caused by CMS misconfigurations ensures URLs lead to valid pages, improving crawl success.
- Consider robots.txt tweaks – A tight robots.txt can block helpful pages. An SEO audit may uncover restrictions that can safely be removed.
Latest In Educational Video Series
The latest video comes after Google launched the educational “How Search Works” series last week to shed light on the search and indexing processes.
The newly released episode on crawling provides insight into one of the search engine’s most fundamental operations.
In the coming months, Google will produce additional episodes exploring topics like indexing, quality evaluation, and search refinements.
The series is available on the Google Search Central YouTube channel.
FAQ
What is the crawling process as described by Google?
Google’s crawling process, as outlined in their recent “How Search Works” series episode, involves the following key steps:
- Googlebot discovers new URLs by following links from known pages it has previously crawled.
- It strategically crawls sites at a customized speed to avoid overloading servers, taking into account response times and content quality.
- The crawler also renders pages using the latest version of Chrome to display content loaded by JavaScript correctly and only access publicly available pages.
- Optimizing technical SEO factors and utilizing sitemaps can facilitate Google’s crawling of new content.
How can marketers ensure their content is effectively discovered and crawled by Googlebot?
Marketers can adopt the following strategies to enhance their content’s discoverability and crawlability for Googlebot:
- Implement an automated sitemap generation within their content management systems.
- Focus on optimizing technical SEO elements such as site architecture and load speed and appropriately use crawl directives.
- Ensure frequent content updates do not exhaust the crawl budget by configuring the CMS efficiently and using pagination tags.
- Create an effective internal linking structure that helps discover new URLs.
- Check and optimize the website’s robots.txt file to ensure it is not overly restrictive to Googlebot.