Skip to content

Conversation

@ChiragBellara
Copy link

@ChiragBellara ChiragBellara commented Jan 30, 2026

Summary

Fixes the issue #1747.
Fixes an issue where AsyncUrlSeeder initialized the Common Crawl index even when source="sitemap" was specified. This caused unnecessary requests to Common Crawl and ConnectTimeout failures in environments where Common Crawl is unreachable. The change ensures Common Crawl is only initialized when explicitly requested and normalizes the source string to be case-insensitive. The ability to perform URL seeding using sitemap only should not depend on the availability of the Common Crawl indexes.

List of files changed and why

async_url_seeder.py – Updated source parsing logic to normalize and validate the source value, and to initialize the Common Crawl index only when "cc" is explicitly included. This prevents unintended Common Crawl calls during sitemap-only seeding.
Since the sources are matched using string matching, I've added normalization of the source tokens (case-insensitive, trimmed) to ensure values like "CC" or "cc + sitemap" or " cc" or " CC + sitemap " are handled correctly.

How Has This Been Tested?

  • Added a regression test that simulates a Common Crawl outage by overriding _latest_index() to raise a ConnectTimeout.
  • Verified that sitemap-only seeding no longer invokes Common Crawl and completes successfully.
  • Confirmed that Common Crawl is still initialized when source includes "cc" (e.g. "cc" or "cc+sitemap").

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added/updated unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@ChiragBellara ChiragBellara marked this pull request as ready for review January 31, 2026 01:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant