There’s a lot of information all around the internet considering the use of XML Sitemaps for SEO purposes. In this post I’ll explain why you should have them, how you should structure them and how you can use XML Sitemaps to increase crawling and indexing of your website. I’ll answer some frequently asked questions too.
- What is the use of an XML Sitemap?
- How will the XML Sitemap influence crawling and indexing?
- How can I use XML Sitemaps to optimise crawling and indexing?
What is the use of an XML Sitemap?
The XML Sitemap is a list of pages on your website that you would like to have crawled and indexed by search engines. You can send this information to search engines to give them a clearer view of which content is on your website.
What does it consist of?
Although commonly mistakes are made here, the XML-file doesn’t need to contain a lot of info.
Take for example this excerpt from Moz.com’s XML Sitemap:
These are the elements that should be in your XML Sitemap:
- Document info: At the start you say that this is a sitemap that has been put together as defined by sitemaps.org protocol. This is being done by using the <urlset xlmns=”http://www.sitemaps.org/schemas/sitemap/0.9″> line and closing it at the end of the sitemap with </urlset>. This is referred to as the xml’s ‘namespace’.
- URL loc: For every URL you want to include, you open up a <url> and within a <loc> you put the URL of the page. So far, this is the only thing that’s required for it to be considered a XML Sitemap.
- URL lastmod: Within the same <url>, you can include the timestamp that indicates when the page has last been modified. So if you have updated a blog article, this timestamp should change.
These are the only 3 things you really need in your XML Sitemap for SEO reasons. If you followed through the link above, you might have noticed the <changefreq> and <priority>.
- <changefreq>: Indicates how frequently this page is updated. This ranges from ‘always’ (indicating the content on the page changes every time the page is being served) to ‘never’ (indicating this page is always the same).
- <priority>: A metric between 0.0 and 1.0 that you can use to indicate how this page compares to other sites in importance for crawlers.
So why not use these in your XML Sitemap? Because Google doesn’t care about them that much. After all, you could say that every page is 1.0 in priority and gets changed daily to increase the crawling of your website. So they aren’t counting on your honesty…. There are other things you can do though:
- Priority: Googlebot reads XML Sitemaps top to bottom. Make sure that the pages you really need to have crawled are at the top.
- Lastmod: Googlebot checks the <lastmod> and compares it with the last time it visited the page. So make sure this is functioning correctly and that your XML Sitemap is updated automatically.
Should every page be in your XML Sitemap?
No, just the ones that have great content that is useful for searchers. After all, what you are doing is pointing at URL’s that might be useful for indexing.
This is were a lot of mistakes happen. Especially if your CMS is not really SEO-friendly. Often these CMS’s will generate an XML Sitemap including all pages that are published. This will also include utility-pages that offer no value when indexed. This kind of trumps the use of an XML Sitemap and, if these pages are not properly marked with a ‘noindex’, could hurt your overall site quality metrics.
Synchronise your data!
Further elaborating my previous point: make sure your Robots.txt, Meta robots and XML Sitemap are in sync. You don’t want URL’s in your XML Sitemap that are blocked by robots.txt or have a ‘noindex’. Seeing this from Googlebot’s perspective, this would make no sense.
When to use an ‘index sitemap’?
If you have a lot of pages or want to structure your pages into different XML Sitemaps (as I will suggest later on), make sure you have an index sitemap in place. This is a sitemap that links to all of your divided sitemaps.
This might be useful for:
- Different language versions
- Different page types (product / category / blog / …)
- Different content topics
How will the XML Sitemap influence crawling and indexing?
What will happen when you submit your XML Sitemap to Google Search Console? These are two frequently asked questions. If you have other questions, please let me know.
Will Google only crawl / index the pages in my XML Sitemap?
No, they will still crawl and possibly index other content. Consider blocking by Robots.txt if you don’t want it to be crawled and/or no-indexing it when you don’t want it to be indexed.
Will Google crawl / index all the pages in my XML Sitemap?
No, the XML Sitemap is just an indication of which pages you want to have crawled or indexed. If you include more URL’s than Googlebot is willing to crawl, it will not crawl every URL. Consider moving important ones to the top and make sure your <lastmod> is set correctly.
How can I use XML Sitemaps to optimise crawling and indexing?
As I have stated a couple times in the article above, the XML Sitemaps are only used as an indication of what pages you want crawled and indexed by search engines. By cleverly using some other tools, you might get important information on how search engines are handling your website…
Using Google Search Console
If you have added your XML Sitemap to Google Search Console, it will (in time) give you feedback on how many pages have been indexed by Google. By splitting up your XML Sitemaps into different parts, you can easily trace indexing issues.
You might want to split it up by language, page type, subject, … Whatever makes sense for your website.
It will also give you information on the progress:
This way you can track if changes to your website / sitemap / … have had any influence on the indexing of your website.
The more you split up your XML Sitemaps, the better the view on indexed or non-indexed pages.
Using your own crawl data
By using tools like Screaming Frog SEO Spider, you can both crawl your website and your XML Sitemap. By comparing these two crawls, you can solve issues in crawling and indexing by Google:
- Page both in Site Crawl & Sitemap Crawl: Good!
- Page in Site Crawl but not in Sitemap Crawl: Should it be? Add it! This list should only consist of pages that have no value for search. Probably, most of them should be ‘noindex’.
- Page in Sitemap Crawl but not in Site Crawl: Possibly there’s no internal linking to this page or these pages are wrong.
Using Server Logs
If you want to take this a step further, you could also add your Server Logs by using something like Screaming Frog Log File Analyser. This will show you which pages were visited by Googlebot and how frequently this has happened.
If you combine this information with the Site Crawl and the XML Sitemap crawl, any crawling or indexing issues should become clear as daylight. It requires some technical skills to do this analysis, but it definitely pays off!
Still need help? Let me know in the comments below.