Rails, Sitemaps, and Google Webmaster Tools

I have heard many good things about Google Webmaster Tools, and set out to get brockbouchard.net registered.  One of the best features of the webmaster tools is that you can build a “Sitemap” for your site (which is just XML describing your site’s content) and submit it directly to Google.  However, generating the Sitemap at first looks like an arduous task.  Fortunately, some individuals in the Rails community set out to make the task easier for all of us:

  1. Alastair Brunton, with improvements from Harry Love, created a means of generating your Sitemap dynamically for each model in your database.
  2. Phil Misiowiec at Webficient created a tool to generate a Sitemap for your Rails app’s static content.

While each is incredibly useful, I wanted a solution that combined both.  I thus took the code created by all of the above, and extended their solutions to generate a Sitemap for both your dynamic and static content all at once.  Curiously, I also ran into and fixed a problem with the dynamic Sitemap generator whereby the XML created was a single line and Google was rejecting it with a non-descript error.

To get up and running with all of this, do the following:

1. Make sure you have the “mechanize” gem installed:

sudo gem install mechanize

2. Be sure to create a “sitemaps” subfolder in your [rails_app]/public directory.

3. Copy the two files below to your [rails_app]/lib directory:

# static_crawler.rb

require 'mechanize'

class StaticCrawler

  # EXTENSIONS_IGNORED = %w[.csv .doc .docx .gif .jpg .jpeg .js .mp3 .mp4 .mpg .mpeg .pdf .png .ppt .rss .swf .txt .xls .xlsx .xml]
  # BRB - In my case, I want to index document types like doc and pdf
  EXTENSIONS_IGNORED = %w[.csv .gif .jpg .jpeg .js .mp3 .mp4 .mpg .mpeg .png .rss .swf .xml]

  PROTOCOLS_IGNORED = %w[feed ftp itms javascript mailto]

  def initialize(starting_url, credentials = nil, quiet_mode = false, sitemap = false, debug = false)
    @bad_pages = []
    @agent = WWW::Mechanize.new
    @sitemap = sitemap
    @debug = debug
    @visited_pages = []

    if credentials
      creds = credentials.split(':')
      @agent.basic_auth(creds[0], creds[1])
    end

    @quiet_mode = quiet_mode
    @starting_url = starting_url
    @starting_url_domain = starting_url[/([a-z0-9-]+)\.([a-z.]+)/i]
    puts "domain: #{@starting_url_domain}" if @debug
    extract_and_call_urls(starting_url)
    generate_sitemap if @sitemap
  end

  def extract_and_call_urls(url)
    #get page
    puts "#{@visited_pages.size+1} #{url}" unless @quiet_mode
    begin
      page = @agent.get(url)
    rescue => exception
      @bad_pages << url
      puts "error: #{url}, #{exception.message}"
      return
    end

    #for any content types we may have missed above, exit if content type is not html
    return if page.instance_of?(WWW::Mechanize::File) || page.content_type.index('text/html') == nil

    #add to array
    @visited_pages << url

    #get links found on page
    links = page.links

    #for each link, call the url if not in history
    links.each{ |link| extract_and_call_urls(link.href) unless
      ignore_url?(link.href) || @visited_pages.include?(link.href) }
  end

  private

  def ignore_url?(url)
    begin
      return ignored = true if url.nil? ||
                       (url.include? 'http' and !url.include?("webficient.com")) ||
                       @bad_pages.include?(url) ||
                       PROTOCOLS_IGNORED.find{ |prt| url =~ /#{prt}:/ } != nil ||
                       EXTENSIONS_IGNORED.find{ |ext| url =~ /#{ext}$/ } != nil
    ensure
      puts "ignored: #{url}" if ignored and @debug
    end
  end

  def generate_sitemap
  	xml_str = ""
  	xml = Builder::XmlMarkup.new(:target => xml_str, :indent => 2)

  	xml.instruct!
  	xml.urlset(:xmlns=>'http://www.sitemaps.org/schemas/sitemap/0.9') {
  		@visited_pages.each do |url|
  		  unless @starting_url == url
    	    xml.url {
      	    xml.loc(@starting_url + url)
      			xml.lastmod(Time.now.utc.strftime("%Y-%m-%dT%H:%M:%S+00:00"))
      			xml.changefreq('weekly')
   			  }
   			end
  		end
  	}

  	save_file(xml_str)

  	# BRB - don't need to call this as something similar is called at the end of ModelCrawler
  	# update_google
  end

	# Saves the xml file to disc. This could also be used to ping the webmaster tools
	def save_file(xml)
		File.open(RAILS_ROOT + '/public/sitemaps/static.xml', "w+") do |f|
			f.write(xml)
		end
	end

	# Notify google of the new sitemap
	# def update_google
	#     sitemap_uri = @starting_url + '/sitemap.xml'
	#     escaped_sitemap_uri = URI.escape(sitemap_uri)
	#     Net::HTTP.get('www.google.com',
	#                   '/webmasters/tools/ping?sitemap=' +
	#                   escaped_sitemap_uri)
	# end

end

# model_crawler.rb

require 'net/http'
require 'uri'
require 'zlib'

# A class specific to the application which generates a google sitemap from the contents of the database.
# Author: Alastair Brunton
# Modified: Harry Love 2008-06-09
class ModelCrawler

  def initialize(base_url, sources)
    @base_url = base_url
    @sources = sources
  end

  # 1. Iterate through each model's #get_paths method
  # 2. Create sitemap file for each model
  # 3. Create sitemap index file
  # 4. Ping Google
  def generate
    path_ar = []
    sitemaps = []
    @sources.each do |source|
      # initialize the class and call the get_paths method on it.
      path_ar = eval("#{source}.get_paths")
      xml = generate_sitemap(path_ar)
      save_file(source, xml)
    end
    index = generate_sitemap_index(@sources)
    save_file('index', index)
    update_google
  end

  # Create a sitemap document for a model
  def generate_sitemap(path_ar)
    xml_str = ""
    xml = Builder::XmlMarkup.new(:target => xml_str)
    xml.instruct!
    xml.urlset(:xmlns => 'http://www.sitemaps.org/schemas/sitemap/0.9') {
      path_ar.each do |path|
        xml.url {
      	  xml.loc(@base_url + path[:url])
      	  xml.lastmod(path[:last_mod])
      	  xml.changefreq('weekly')
        }
      end
    }
    xml_str
  end

  # Create a sitemap index document
  def generate_sitemap_index(sitemaps)
    xml_str = ""
    xml = Builder::XmlMarkup.new(:target => xml_str, :indent => 2)
    xml.instruct!
    xml.sitemapindex(:xmlns => 'http://www.sitemaps.org/schemas/sitemap/0.9') {
      xml.sitemap {
    	  xml.loc(@base_url + "/sitemaps/static.xml")
    	  xml.lastmod(Time.now.strftime('%Y-%m-%d'))
 	    }
      sitemaps.each do |site|
        xml.sitemap {
      	  xml.loc(@base_url + "/sitemaps/#{site}.xml.gz")
      	  xml.lastmod(Time.now.strftime('%Y-%m-%d'))
   	    }
      end
    }
    xml_str
  end

  # Save the xml file (gzipped) to disk
  def save_file(source, xml)
    File.open(RAILS_ROOT + "/public/sitemaps/#{source}.xml.gz", 'w+') do |f|
      gz = Zlib::GzipWriter.new(f)
      gz.write xml
      gz.close
    end
  end

  # Notify Google of the new sitemap index file
  def update_google
    sitemap_uri = @base_url + '/sitemaps/index.xml.gz'
    escaped_sitemap_uri = URI.escape(sitemap_uri)
    Net::HTTP.get('www.google.com', '/webmasters/tools/ping?sitemap=' + escaped_sitemap_uri)
  end
end

4. Alter deploy.rb

Now you’ll need an entry in your [rails_app]/config/deploy.rb file to copy your Sitemaps over with each new release:

namespace :sitemap do
  desc "Copy the sitemap files after deploy"
  task :copy_sitemap, :roles => :app, :on_error => :continue do
    puts "copying Rails sitemap files"
    run "cp #{previous_release}/public/sitemaps/* #{current_release}/public/sitemaps/"
  end
end

after :deploy, 'sitemap:copy_sitemap'

5. Create a rake task

Now add a rake task to actually perform the Sitemap generation by creating the [rails_app]/lib/tasks/sitemap.rake file and adding the following code:

require 'static_crawler'
require 'model_crawler'

site_url = ENV['URL'] || 'http://localhost:3000'

namespace :sitemap do

  desc 'Crawl the site and create sitemap xml files for both static and dynamic content.  Set CREDS as username:password if you are hitting a password protected site.'

  task(:generate => :environment) do
    # Generate static sitemap
    sitemap = StaticCrawler.new(site_url, (ENV['CREDS'] if ENV['CREDS']), true, true, false)

    # Generate dynamic sitemaps for each of the models listed in the array
    models = %w( Project )
    sitemap = ModelCrawler.new(site_url, models)
    sitemap.generate
  end

end

6. Setup a cron task

Finally, add an entry in your crontab to periodically run the rake task and generate Sitemaps:

30 9 * * * cd /path/to/rails/app && /path/to/rake sitemap:generate URL=http://domain.com RAILS_ENV=production

Be sure to verify the path to your rake command.  It can be different on some systems.

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.