October 27, 20257 min read

Crawl Budget for Large Sites

Crawl budget decides which pages search engines and AI crawlers ever see. A practitioner's playbook for steering crawlers to the pages that earn revenue.

Technical SEOSEO

Crawl Budget for Large Sites — cover illustration

When the crawler runs out of patience

On a small site, crawl budget is a non-issue. The crawler shows up, reads everything, leaves. On a large site, with hundreds of thousands or millions of URLs, it becomes one of the most consequential things you manage and one of the least understood. Crawl budget is the finite attention a crawler is willing to spend on your site in a given window. Spend it on junk, and your important pages go stale, get indexed late, or never get seen at all. Steer it to the pages that earn revenue, and the rest of your technical work actually gets discovered.

I have spent fifteen years moving numbers in large programs, and on big sites I have watched genuinely good content fail to perform for one boring reason: the crawler never got to it, or got to it months late, because the site wasted its crawl budget on infinite filter combinations and dead parameter URLs. This is a problem you solve with architecture and discipline, not content. Here is how I approach it.

What actually is crawl budget?

Crawl budget is shaped by two forces working against each other. The first is how much a crawler is willing to fetch from your site without straining your servers, which scales with your site's health and authority. The second is how much it wants to fetch, which depends on how important and how fresh your pages appear to be. Big, authoritative, fast sites earn more. Slow, error-prone, low-value sites earn less.

The practical consequence: on a large site you are always making a trade. Every low-value URL a crawler fetches is a high-value URL it did not. The job is to tilt that trade in your favor.

A few realities worth internalizing:

Crawlers do not see your site map of intentions. They see URLs, links, and responses. If your architecture leads them into a swamp, they will crawl the swamp.
It is not only Google anymore. AI and answer-engine crawlers are now fetching your pages to build the answers behind generative engine optimization. Wasting crawl budget means wasting your shot at being retrieved and cited, not just indexed.
Most waste is self-inflicted. Faceted navigation, session parameters, infinite calendars, and near-duplicate pages generate the bulk of the problem, and all of them are within your control.

Where do large sites bleed crawl budget?

Before you optimize, find the leaks. On nearly every large site I audit, the same culprits show up.

The usual suspects

Faceted navigation gone wild. Filter and sort combinations multiply into millions of crawlable URLs, most of them thin, duplicative, and worthless.
Parameter sprawl. Tracking, session, and sort parameters create endless variants of the same page.
Soft 404s and error pages. Pages that return a success status while showing nothing useful waste fetches and confuse the crawler.
Redirect chains. Each hop costs a fetch. Long chains burn budget and dilute equity.
Orphaned and infinite spaces. Calendars, internal search results, and auto-generated pages with no end.
Slow responses. A sluggish server lowers how much a crawler is willing to fetch at all, which is one more reason the business case for speed extends well past user experience.

How do you steer crawlers to the pages that matter?

Once you know where the budget leaks, you plug the holes and point the flow at revenue pages. This is steering, not blocking for its own sake.

Control what gets crawled

Block low-value spaces in robots.txt. Internal search results, infinite parameter spaces, and admin areas have no business consuming crawl budget. Disallow them.
Tame faceted navigation deliberately. Decide which filter combinations deserve indexable pages and which do not. Use canonical tags to consolidate near-duplicates, and noindex the long tail of useless combinations.
Handle parameters consistently. Strip or canonicalize tracking and session parameters so they do not spawn duplicate URLs.
Fix soft 404s and error responses. Return honest status codes. A page that is gone should say so.
Collapse redirect chains. Point redirects straight at the final destination. One hop, not five.

Point the crawler at the right pages

Keep an accurate XML sitemap. Include only canonical, indexable, valuable URLs. A sitemap full of redirects and dead pages teaches the crawler to distrust it.
Use internal linking as a priority signal. Crawlers follow links and infer importance from how you link. Treating internal linking as a growth lever is also how you tell a crawler which pages deserve its attention.
Flatten deep architectures. A page buried ten clicks from the homepage signals low importance and gets crawled rarely. Bring revenue pages closer to the surface.
Serve fast, stable responses. Performance directly raises how much a crawler will fetch. Speed is a crawl strategy, not just a UX one.

How do you know it is working?

You cannot manage crawl budget by feel. The signal lives in your server logs, which is where this work separates the practitioners from the theorists.

Read the logs. Log file analysis tells you exactly which pages crawlers fetch, how often, and how much budget goes to junk. This is the ground truth, and most teams never look at it.
Watch crawl stats and coverage. Track how many of your pages are crawled and indexed versus discovered-but-not-crawled. A growing gap is a warning.
Tie it to the audit. Crawl budget is one of the highest-leverage findings in a good SEO audit that surfaces the twenty percent that matters. Treat it as a first-class technical priority on any large site.

A short crawl budget checklist

Pull and read server logs to see where crawlers actually spend time.
Disallow internal search, infinite spaces, and admin areas in robots.txt.
Decide which faceted URLs are indexable; canonicalize or noindex the rest.
Strip or canonicalize tracking and session parameters.
Fix soft 404s and return honest status codes.
Collapse redirect chains to a single hop.
Keep the XML sitemap to canonical, valuable URLs only.
Flatten architecture so revenue pages sit near the surface.
Improve server speed to raise the crawl rate.

The takeaway

Crawl budget is invisible until it costs you, and on a large site it always eventually does. The fix is not glamorous: find the leaks in your logs, stop the crawler from wasting fetches on junk, and point its attention at the pages that earn. Do this and everything else you build, fresh content, structured data, refreshed pages, actually gets discovered, indexed, and pulled into the answers machines are now writing. Skip it and you are publishing into a void the crawler never reaches.

Keep reading: Taming Faceted Navigation Before It Buries Your Catalog, Sitemaps at Scale: The Index Signal Most Teams Waste, Robots.txt and the Art of Crawl Control, Canonical Tags and the Duplicate Content You Did Not Know You Had, SEO Through a Merger: Consolidating Two Sites.

If you are wrestling a large site where good work is not getting seen, the channel is open by introduction. Bring your logs and we will find where the budget is going.

Written by Joseph Carroll, Carroll Consulting Services. Connect on LinkedIn ↗