Using the Robots Exclusion Protocol in Spring-based Web ApplicationsSEO, Spring, and Java
This post explains how to integrate SEO elements into your Spring MVC web application using a simple library and a few annotations.
I've recently been working on SEO optimisations for the Betfair Site Platform, and found it a little surprising that there aren't many tools out there for Spring-based web apps to make use of the Robots Exclusion Protocol (REP). So I decided to build something that might help anyone who wants to use REP within their apps and make it publicly available. More on that at the end of this article, so please skip ahead if you already know what REP is.
Crawling vs Indexing
It's important to distinguish between these two actions, because the terms are very often misused.
- is the act of content discovery. The crawler will find pages on the web and mark them for indexing. It will also look for links within pages and attempt to crawl those too. This process is logically recursive in that sense.
- is the act of processing the content of the pages that have been crawled and using the resulting information to drive search results.
Here's a good explanation from Google on the topic.
What is REP?
The protocol covers three main areas: robots.txt, robots meta tags, and
X-Robot-Tag headers. For anyone unfamiliar with this, I'll give a very brief overview of what each is and when you might want to use it.
This is a file located in the domain root (e.g. http://www.example.org/robots.txt) to specify which paths you want crawled. It can be used to allow or disallow paths, but it's important to remember that we're talking about an exclusion protocol here - so the default is for all content to be included. If you do not provide a robots.txt file, crawlers will reasonably assume that any link they find on you site is relevant for crawling. It's also important to note that robots.txt is not a contract. No crawler will promise to strictly adhere to the rules you set out in this file. They will only use it as a guideline.
Robots Meta Tags
Within HTML content, you can also specific
<meta> tags to provide instructions to robots for that specific page. This gives you a little more control, because you can specify directives to instruct the robot on how to crawl or index your site at a page level.
HTTP response headers can also be used to give instructions to robots in a similar fashion to
<meta> tags. This is particularly useful for responses that are not HTML.
More details on robots.txt, meta tags and headers can be found here.
The spring-rep Library
So I mentioned earlier that I was building a library to simplify how you can use this in your Spring-based web apps. The idea behind this is to provide simple annotations that can be used to decorate request mappings within Spring-based web apps and provide a standard mechanism for using this across applications.
How to use it
Firstly, you'll need to register an interceptor that does all the work for you. Note that the interceptor has two properties that allow you to specify the mode and combinator strategy. If you want more details, these are explained in the README on GitHub.
Next, just apply the annotations to the relevant request mappings in your application.
Version 0.1 is available on Maven Central, so please give it a try and let me know if you find any bugs so I can fix them. I have a few ideas of how to make it better, but please feel free to add any issues to the list and I'll do my best to include them. This is based on Spring 3.1, so please make sure you have the correct version available before declaring this dependency in your project.
You can import this using the normal Maven dependency mechanism:
The source (and some basic docs in the README) is available on GitHub if you want to submit a fix to any bugs you might find or fork it for your own purposes.