Sitemaps & robots.txt: Practical SEO Guide | My Seven Stars
Why this matters (short)
Sitemaps help search engines discover URLs and metadata (lastmod, priority, hreflang links). robots.txt tells crawlers which parts of a site they may (or may not) request. Together they reduce crawler waste, speed discovery, and avoid accidental indexing of private or resource-heavy areas.
Important: robots.txt controls crawling, not indexing — use or HTTP headers to prevent indexing. :contentReference[oaicite:0]{index=0}
Quick definitions — plain English
XML Sitemap — an XML file (usually sitemap.xml) listing canonical URLs and optional metadata (lastmod, changefreq, priority). Useful for discovery, especially on large or complex sites.
Sitemap index — a file that points to multiple sitemap files. Use it when you exceed a sitemap file’s limits.
robots.txt — a plain-text file at the root (e.g., https://www.example.com/robots.txt) that gives crawling instructions to bots.
Key limits & behavior (what you must know)
A single sitemap file may contain up to 50,000 URLs and must be under 50MB uncompressed . If you exceed either limit, split into multiple sitemaps and use a sitemap index. :contentReference[oaicite:1]{index=1}
A sitemap index may list up to 50,000 sitemap files. :contentReference[oaicite:2]{index=2}
robots.txt must live at the site root to apply (e.g., /robots.txt). It cannot control other hosts/subdomains. :contentReference[oaicite:3]{index=3}
Listing a URL in robots.txt does not reliably keep it out of search results; to prevent indexing use meta robots or HTTP headers. :contentReference[oaicite:4]{index=4}
General workflow — the 6 step plan
Generate an XML sitemap that lists canonical URLs (and hreflang groups where applicable).
Validate the sitemap XML and ensure file size/URL count are within limits.
Place the sitemap at a logical URL (e.g., /sitemap.xml) and optionally compress to .xml.gz if you serve gzip files.
Add a reference to your sitemap in /robots.txt (optional but recommended): Sitemap: https://www.example.com/sitemap.xml.
Submit the sitemap in Google Search Console and Bing Webmaster Tools; monitor processing and errors.
Test robots.txt in Google Search Console’s robots tester and keep it minimal — avoid blocking important resources (CSS/JS) that affect rendering.
Copy/paste examples
robots.txt (basic)
User-agent: *<br>Disallow: /wp-admin/<br>Allow: /wp-admin/admin-ajax.php
# Link to sitemap(s)<br>Sitemap: https://www.example.com/sitemap.xml
XML sitemap (small site example)
https://www.example.com/<br>2025-09-01<br>daily<br>1.00
https://www.example.com/blog/seo-sitemap-robots<br>2025-09-18
Sitemap index (for large sites)
https://www.example.com/sitemap-posts-1.xml.gz<br>2025-10-02
https://www.example.com/sitemap-posts-2.xml.gz<br>2025-10-02
Best practices & tips
Canonicalize first: Sitemaps should list canonical URLs only (no duplicate /with-trailing-slash and /without variants).
Don’t include noindexed pages: If a page is noindex, remove it from sitemaps — it sends mixed signals.
Use lastmod correctly: Only update lastmod when the content changes meaningfully (not on every visit or analytics update).
Keep sitemaps fresh: For sites with frequent new content, generate sitemaps automatically and notify search engines (ping endpoints or re-submit via GSC API).
Reference sitemaps in robots.txt: makes discovery easier for crawlers that check robots.txt first. (Robots.txt and sitemaps are complementary.) :contentReference[oaicite:5]{index=5}
Compress if needed: .xml.gz is supported and reduces bandwidth; sitemap files themselves must be under the uncompressed 50MB limit. :contentReference[oaicite:6]{index=6}
Testing & verification
Submit the sitemap in Google Search Console → Sitemaps and monitor “Discovered URLs” and errors. :contentReference[oaicite:7]{index=7}
Use the robots.txt tester in Google Search Console to confirm important pages aren’t accidentally blocked. :contentReference[oaicite:8]{index=8}
Validate XML with an XML validator or online sitemap validator (many CMS plugins also provide validation).
Check Crawl stats in GSC to see how often Googlebot requests your site — reducing unnecessary crawl can save server load.
Common pitfalls & how to avoid them
Blocking resources that break rendering
Some SEOs block CSS/JS in robots.txt to save crawl budget. That often backfires because Google needs those files to render pages properly. Only disallow what truly shouldn't be crawled. :contentReference[oaicite:9]{index=9}
Using robots.txt to “noindex”
robots.txt only controls crawling, not indexing. If you want a page removed from search results, use a page-level noindex or a removal request in GSC. :contentReference[oaicite:10]{index=10}
Quick checklist before launch
Sitemap reachable at canonical URL(s) (e.g., /sitemap.xml)<br>Sitemap size & URL count within limits (split if needed)<br>robots.txt at site root, checked in GSC robots tester<br>All important pages are crawlable and not...