Introduction
robots.txt is the first file search engine crawlers check. It tells them which areas they can access. Misconfiguration can block your entire site or waste crawl budget on unimportant pages.
Understanding robots.txt is essential — it controls crawl behavior and helps search engines focus on important pages.
This guide covers syntax, common patterns, and the important distinction between blocking crawling and blocking indexing.
Key Concepts
Basic Syntax
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/
Sitemap: https://example.com/sitemap.xml
Crawling vs Indexing
robots.txt blocks CRAWLING, not INDEXING. A disallowed page can still appear in results if linked. To prevent indexing, use noindex meta tag or X-Robots-Tag header.
Practical Examples
1. Next.js robots.txt
// app/robots.ts
export default function robots() {
return {
rules: [{ userAgent: '*', allow: '/', disallow: ['/admin/', '/api/'] }],
sitemap: 'https://example.com/sitemap.xml',
};
}
2. Environment-Aware
export default function robots() {
if (process.env.NODE_ENV !== 'production') {
return { rules: { userAgent: '*', disallow: '/' } };
}
return { rules: { userAgent: '*', allow: '/' }, sitemap: 'https://example.com/sitemap.xml' };
}
3. Block Specific Bots
User-agent: AhrefsBot
Disallow: /
User-agent: *
Disallow: /api/
Disallow: /*?sort=
Best Practices
- ✅ Keep robots.txt simple and organized
- ✅ Include Sitemap directive
- ✅ Block admin, API, and preview routes
- ✅ Test with Google's robots.txt Tester
- ✅ Use noindex for pages you don't want indexed
- ❌ Don't block CSS or JavaScript files
- ❌ Don't rely on robots.txt for security
Common Pitfalls
- 🚫 Blocking CSS/JS — Google can't render pages
- 🚫 Using Disallow when you mean noindex
- 🚫 Forgetting robots.txt on staging
- 🚫 Overly broad Disallow rules