Sitemaps and robots.txt

LiveWhale automatically generates a sitemap to help search engines like Google index fresh and relevant content as quickly as possible, and a robots.txt file that points search engines to the sitemap and instructs them to exclude certain URLs.

Instruct Google to use your sitemap

Google will automatically crawl your site anyways—and should find your sitemap via robots.txt—but you can help it along by submitting your sitemap using the Google Webmaster Tools. You’ll need to create an account and add a property for your website. You may need to go through some verification steps to prove to Google that you are the administrator of the site.

Once that’s done, select your property and find Crawl > Sitemaps in the main menu. Click Add/Test Sitemap and enter https://myschool.edu/sitemap.livewhale.xml . Once that’s submitted, over time Google will use that in indexing your site.

Note: Google chooses how quickly and how often to re-index your site. While you can request reindexing of certain pages, it may still take up to several months for certain content to appear in Google. If you are updating your pages frequently with content Google’s algorithm deems high-quality, you should have no trouble getting indexed, but it can sometimes take awhile.

How LiveWhale generates your sitemap

LiveWhale makes an effort to include your most up-to-date, relevant content in the sitemap.

What is included

  • All content meeting these criteria:
    • marked Live
    • visible to “Everyone”
  • All pages meeting these criteria:
    • marked Live
    • visible to “Everyone”
    • page is in one of your navigations

What is excluded

  • All content and pages marked Hidden
  • All content and pages visible to “This group only,” “Logged-in users,” or “Anyone with the link”
  • All content that is archived
  • Pages that are not in any navigation
  • Old content (e.g., news stories that are several years old)
  • Anything in your /_ingredients/ folder (e.g., templates)
  • Anything in a folder that starts with /_ (e.g., /_sample/index.php)
  • Any page that contains “.test.” in the filename (e.g., index.test.php)
  • Any page that contains “/test/“ in the filepath (e.g., /admissions/test/open-house/index.php)

Google News Sitemap (LiveWhale 2.7.0+)

In LiveWhale 2.7.0 and later, news stories published within the last two days will receive additional syntax in your sitemap XML file in accordance with the Google News <news:news> specification.

<news:news>
        <news:publication>
                <news:name>My University</news:name>
                <news:language>en</news:language>
        </news:publication>
        <news:publication_date>2022-11-08</news:publication_date>
        <news:title>Title of Article</news:title>
</news:news>

Adding custom items to the LiveWhale sitemap

If you have special content outside of LiveWhale you want to include in your sitemap, you can do so by creating a file at /sitemap.custom.xml in your main web root. LiveWhale will detect that file, if it exists, and include it in the main sitemap the next time it is generated.

How LiveWhale generates your robots.txt

The robots.txt file indicates to search engines where your sitemap is located and also tells the web crawler not to visit

  • pages that are Hidden
  • pages visible to “This group only”
  • pages visible to “Logged-in users”

Pages visible to “Anyone with the link” will not be included, so you don’t have to worry about anyone finding your secret URLs via robots.txt. Those pages still do include a robots “noindex” meta tag, so search engines should receive the instruction not to index them.

On this page