Skip to content

Conversation

@cstyan
Copy link
Contributor

@cstyan cstyan commented Dec 19, 2025

Still a WIP, and need to do my own review of the batching struct/interface and auth (looks like we're using system auth atm) since I used tasks to prototype this. Some values here should also likely be configurable while ATM they're hardcoded, such as the batch size and flush interval.

However, this should work (and I have tested manually via develop.sh with a few workspaces each having 1 metadata, and can confirm I saw batch updates only and the # of agents reported appeared correct based on the # of running workspaces) to significantly reduce the volume of pg_notify calls that happen as a part of workspace agent metadata updates.

The changes are as follows:

  • a new BatchUpdateWorkspaceAgentMetadata to handle updating multiple agents metadata in a single query to PG
  • New pubsub publish/subscribe via workspace_agent_metadata_batch, ATM subscribers have to loop over IDs in each published message to see if the agent ID they care about had an update
  • batcher struct that batches together metadata updates and will either flush at a max batch size, or when an interval since the last flush is reached

Callum Styan and others added 8 commits December 18, 2025 21:35
Introduces a metadata batcher to reduce database write and pubsub
publish frequency for workspace agent metadata updates.

Key changes:
- Add BatchUpdateWorkspaceAgentMetadata SQL query for multi-agent updates
- Create MetadataBatcher with 5s flush interval and 500 agent batch size
- Integrate batcher into coderd API and agent RPC flow
- Update BatchUpdateMetadata handler to use batcher when available
- Add comprehensive unit tests for batcher functionality

Performance impact:
- Reduces database writes from 60/sec to 0.2/sec (99.7% reduction)
- Based on workload of 600 workspaces with 15 metadata items at 10s intervals
- Implements early flush at 80% capacity (400/500 agents)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Per project style guide, function parameters and arguments should not
be split across multiple lines for better readability.

Thread safety note: The MetadataBatcher is already thread-safe with
proper mutex locking in both Add() and flush() methods. Multiple
concurrent metadata updates are correctly serialized.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Changed MetadataBatcher implementation to match correct semantics:

- buf is now a slice with fixed capacity instead of a map
- Add() appends to slice, drops updates when buffer is full
- Flush triggers when buffer reaches capacity OR interval fires
- Removed 80% threshold logic - flush at exact capacity
- Multiple updates for same agent are now batched separately

This simplifies the logic and makes buffer overflow behavior explicit.
Updates are dropped with a warning when the buffer is full rather than
replacing existing entries.

Test updates:
- Renamed test to reflect new behavior (no replacement)
- Updated capacity flush test to fill buffer exactly
- Added new test for drop behavior when buffer is full

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Implements Scenario 3 to drastically reduce pg_notify call volume.

Changes:
- Added WorkspaceAgentMetadataBatchPayload with array of updates
- Added WatchWorkspaceAgentMetadataBatchChannel() for global channel
- Batcher now publishes to both per-agent and batch channels
- Listener subscribes to both channels, filters batch for its agent

Performance impact:
- Reduces pubsub NOTIFYs from ~300/sec to ~0.2/sec (99.9% reduction!)
- NOTIFY rate now controlled by batch_flush_period config
- Rate = num_coderd_instances / batch_flush_period
- Independent of workspace/agent count

Migration strategy:
- Both channels published for backwards compatibility
- Per-agent channels can be removed once all clients migrate
- Listeners filter batched messages by agent ID

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
The batcher now only publishes to the global batch channel.
Per-agent channels are still used for the non-batched fallback path
(when batcher is nil in metadata.go).

During pod rollout all connections are dropped/reconnected, so there's
no need for backwards compatibility in the batcher path.

This achieves the target NOTIFY reduction:
- Before: ~300 NOTIFYs/sec (one per agent)
- After: ~0.2 NOTIFYs/sec (one per batch flush)
- 99.9% reduction in pg_notify call volume

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Changed the batch notification to only send agent IDs instead of full
metadata details (keys, timestamps, etc). Listeners now re-fetch all
metadata for updated agents from the database.

Benefits:
- Smaller NOTIFY payloads (just UUIDs instead of keys/values/timestamps)
- Listeners automatically get complete, consistent metadata state
- Matches existing pattern from per-agent channel notifications

The listener already has this pattern:
- Receive notification with changed keys (or nil for all keys)
- Fetch those keys from database
- Update local cache

Now the batch notification sends: {"agent_ids": [uuid1, uuid2, ...]}
And listener fetches all keys for that agent.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Log which subscription channel (per-agent vs batch) receives notifications
to aid in debugging and monitoring the batching system. Includes agent_id,
keys for per-agent updates, and batch_size for batched updates.
@github-actions
Copy link


Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


Callum Styan seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants