Skip to content

[WIP] Add user scraper script for GitHub Uganda#2

Draft
Copilot wants to merge 1 commit intomainfrom
copilot/create-uganda-user-scraper
Draft

[WIP] Add user scraper script for GitHub Uganda#2
Copilot wants to merge 1 commit intomainfrom
copilot/create-uganda-user-scraper

Conversation

Copy link
Contributor

Copilot AI commented Feb 21, 2026

Thanks for asking me to work on this. I will get started on it and keep this PR's description up to date as I form a plan and make progress.

GitHub Uganda User Scraper Script (Ethical + Practical)

Goal: Build a script that finds GitHub users who appear to be in Uganda by using public GitHub search and profile data, then exports the results (CSV/JSON) for analysis.

Important: GitHub does not provide a guaranteed “country” field for all users. “Location” is free-text and can be missing or messy, so treat results as best-effort.


✅ What the script will do

  1. Use GitHub’s Search Users API to find accounts where:
    • location:"Uganda" OR
    • location:"Kampala" / Entebbe / Jinja / Mbarara etc.
  2. For each user found, fetch:
    • username, name
    • location
    • bio
    • followers/following
    • public repos
    • profile URL
    • email (rarely public)
  3. Deduplicate + normalize
  4. Save to:
    • uganda_users.json
    • uganda_users.csv

⚠️ Rules + rate limits (don’t get blocked)

  • Use a GitHub token (PAT) to increase rate limits.
  • Add delays + backoff when rate-limited.
  • Respect GitHub API Terms and avoid aggressive scraping.
  • Prefer GitHub API over HTML scraping.

Tech stack options

Option A (Recommended): Node.js + Octokit

  • Easy JSON handling
  • Nice GitHub client libraries
  • Great for scheduling + retries

Option B: Python + Requests

  • Great for data science workflows
  • Easy CSV export

Below is a Node.js plan (since you use JS a lot).


1) Project setup (Node.js)

Requirements

  • Node.js 18+
  • A GitHub Personal Access Token (classic or fine-grained)

Env

  • GITHUB_TOKEN=...

Install

  • @octokit/rest (GitHub API client)
  • dotenv
  • p-limit (limit concurrency)
  • csv-writer (CSV export)

2) Search strategy

Because location is free-text, run multiple queries and merge.

Example Uganda location queries

  • location:"Uganda"
  • location:"Kampala"
  • location:"Entebbe"
  • location:"Jinja"
  • location:"Mbarara"
  • location:"Gulu"
  • location:"Mbale"
  • location:"Mukono"
  • location:"Wakiso"

You can also include variants:

  • location:"UG"
  • location:"Uga"
  • emojis or typos (optional)

Optional filters (helps reduce noise)

  • repos:>5
  • followers:>10

Example query:

  • location:"Kampala" repos:>3

3) Pagination

GitHub Search API returns up to:

  • 100 results/page
  • practical limits for search queries (you may need multiple queries)

Plan:

  • For each query:
    • fetch pages 1..N until no results or a max cap
    • collect logins
  • Deduplicate by login

4) Data normalization

Normalize location text

  • lowercase
  • trim spaces
  • map known aliases:
    • "kampala ug" → "kampala, uganda"
    • "uganda" → "uganda"

Uganda match heuristic

A user counts as “Uganda” if location contains:

  • uganda OR any city list OR ug (careful) OR common forms like kampala, ug

This reduces false positives.


5) Concurrency + rate limit safety

  • Fetch user details in parallel, but limited to small concurrency (like 3–8).
  • If GitHub returns 403 + rate limit headers:
    • wait until reset time
    • retry

6) Output schema

JSON record example

{
  "login": "octocat",
  "name": "The Octocat",
  "location": "Kampala, Uganda",
  "bio": "Frontend dev",
  "followers": 120,
  "following": 40,
  "public_repos": 35,
  "blog": "https://...",
  "twitter_username": "....",
  "company": "....",
  "email": null,
  "html_url": "https://github.com/octocat",
  "created_at": "2017-01-01T00:00:00Z",
  "updated_at": "2026-02-01T00:00:00Z",
  "source_query": "location:\"Kampala\" repos:>3"
}

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI requested a review from amkayondo February 21, 2026 15:25
Copilot stopped work on behalf of amkayondo due to an error February 21, 2026 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants