Fix: Avoid Common Crawl calls for sitemap-only URL seeding #1746
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the issue #1747.
Fixes an issue where AsyncUrlSeeder initialized the Common Crawl index even when source="sitemap" was specified. This caused unnecessary requests to Common Crawl and ConnectTimeout failures in environments where Common Crawl is unreachable. The change ensures Common Crawl is only initialized when explicitly requested and normalizes the source string to be case-insensitive. The ability to perform URL seeding using sitemap only should not depend on the availability of the Common Crawl indexes.
List of files changed and why
async_url_seeder.py – Updated source parsing logic to normalize and validate the source value, and to initialize the Common Crawl index only when "cc" is explicitly included. This prevents unintended Common Crawl calls during sitemap-only seeding.
Since the sources are matched using string matching, I've added normalization of the source tokens (case-insensitive, trimmed) to ensure values like "CC" or "cc + sitemap" or " cc" or " CC + sitemap " are handled correctly.
How Has This Been Tested?
Checklist: