Should Allow Come Before Disallow in robots.txt? RFC and Compatibility Best Practices
robots.txt
controls web crawling. Have you ever been told that “Allow
directives should be written before comprehensive Disallow
directives”?
While this practice is treated as “common sense” in many workplaces, do you truly understand its technical background?
To get straight to the point, this advice is “incorrect if only considering crawlers compliant with the latest RFC” but “correct from the perspective of maximizing compatibility with all crawlers”.
This article explores the interpretation rules of robots.txt
behind this seemingly contradictory conclusion and presents best practices for real-world implementation.
Currently, major search engines like Google and Bing follow the robots.txt
specification standardized as RFC 9309 (Robots Exclusion Protocol).
In this specification, the priority of rule application is not determined by the order of directives. It follows these rules:
Allow
and Disallow
directives within groups matching their User-agent.Allow
takes precedence over Disallow
.For example, the following two robots.txt
files have exactly the same meaning for RFC 9309-compliant crawlers:
# Pattern A: Disallow first
User-agent: *
Disallow: /
Allow: /assets/
# Pattern B: Allow first
User-agent: *
Allow: /assets/
Disallow: /
In both patterns, when evaluating the URL /assets/styles.css
, Allow: /assets/
(path length: 7) is more specific than Disallow: /
(path length: 1), so crawling of the /assets/
directory is allowed.
Google’s official documentation also explains this specificity-based evaluation logic without mentioning line order. In other words, for major search engines, the order of Allow
and Disallow
doesn’t affect crawling results.
The background of this convention lies in the history of robots.txt
and the diversity of existing crawlers.
Historically, there have been two main approaches to interpreting robots.txt
rules:
Specificity Rule As described above, this is the current standard adopted by Google and others. Priority is determined by path length.
First Match Rule A method adopted by older or simply-implemented custom crawlers. They read the file from top to bottom and apply the first matching rule, ending the evaluation.
For crawlers using the “first match” rule, order is critically important. Let’s look at Pattern A again:
# Pattern A: Disallow first
User-agent: *
Disallow: / # ← Everything matches here
Allow: /assets/ # ← This line is never evaluated
When this crawler evaluates /assets/styles.css
, it determines “no crawling allowed” when it matches the first Disallow: /
, and the subsequent Allow: /assets/
is ignored. As a result, it unintentionally denies crawling of the entire site.
To avoid such tragedies, the practice of “writing exceptional permissions (Allow
) first and comprehensive denials (Disallow
) later” emerged and spread as a defensive writing style to ensure intended behavior across all crawlers.
Based on this background, our action plan is clear:
Target | Order Impact | Recommended Writing Style |
---|---|---|
Major search engines like Google/Bing | No impact | Either order is fine |
Old crawlers or unknown bots | High possibility of impact | Allow → Disallow order is safer |
Code readability/maintainability | Human interpretability | Allow → Disallow (“exceptions first, general rules later”) is more intuitive |
Let me summarize the key points about Allow
/Disallow
order in robots.txt
:
Allow
) first, general (Disallow
) later” makes sense.Therefore, while the claim that “it won’t work unless you change the order” may be technically inaccurate in some cases, it’s extremely valuable advice for ensuring maximum compatibility and preventing unintended behavior. Unless there’s a specific reason not to, following this safe practice is recommended.
That’s all from the Gemba, where I’ve relearned about robots.txt
specifications.