Fix: Avoid Common Crawl calls for sitemap-only URL seeding #1746

ChiragBellara · 2026-01-30T17:54:54Z

Summary

Fixes the issue #1747.
Fixes an issue where AsyncUrlSeeder initialized the Common Crawl index even when source="sitemap" was specified. This caused unnecessary requests to Common Crawl and ConnectTimeout failures in environments where Common Crawl is unreachable. The change ensures Common Crawl is only initialized when explicitly requested and normalizes the source string to be case-insensitive. The ability to perform URL seeding using sitemap only should not depend on the availability of the Common Crawl indexes.

List of files changed and why

async_url_seeder.py – Updated source parsing logic to normalize and validate the source value, and to initialize the Common Crawl index only when "cc" is explicitly included. This prevents unintended Common Crawl calls during sitemap-only seeding.
Since the sources are matched using string matching, I've added normalization of the source tokens (case-insensitive, trimmed) to ensure values like "CC" or "cc + sitemap" or " cc" or " CC + sitemap " are handled correctly.

How Has This Been Tested?

Added a regression test that simulates a Common Crawl outage by overriding _latest_index() to raise a ConnectTimeout.
Verified that sitemap-only seeding no longer invokes Common Crawl and completes successfully.
Confirmed that Common Crawl is still initialized when source includes "cc" (e.g. "cc" or "cc+sitemap").

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added/updated unit tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…emap"

Added fix for URL Seeder forcing Common Crawl index in case of a "sit…

694ba44

…emap"

ChiragBellara marked this pull request as ready for review January 31, 2026 01:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: Avoid Common Crawl calls for sitemap-only URL seeding #1746

Fix: Avoid Common Crawl calls for sitemap-only URL seeding #1746

Uh oh!

ChiragBellara commented Jan 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Fix: Avoid Common Crawl calls for sitemap-only URL seeding #1746

Are you sure you want to change the base?

Fix: Avoid Common Crawl calls for sitemap-only URL seeding #1746

Uh oh!

Conversation

ChiragBellara commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

List of files changed and why

How Has This Been Tested?

Checklist:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ChiragBellara commented Jan 30, 2026 •

edited

Loading