-
Notifications
You must be signed in to change notification settings - Fork 1.1k
WIP perf: reduce pg_notify call volume by batching together agent metadata updates #21330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Introduces a metadata batcher to reduce database write and pubsub publish frequency for workspace agent metadata updates. Key changes: - Add BatchUpdateWorkspaceAgentMetadata SQL query for multi-agent updates - Create MetadataBatcher with 5s flush interval and 500 agent batch size - Integrate batcher into coderd API and agent RPC flow - Update BatchUpdateMetadata handler to use batcher when available - Add comprehensive unit tests for batcher functionality Performance impact: - Reduces database writes from 60/sec to 0.2/sec (99.7% reduction) - Based on workload of 600 workspaces with 15 metadata items at 10s intervals - Implements early flush at 80% capacity (400/500 agents) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Per project style guide, function parameters and arguments should not be split across multiple lines for better readability. Thread safety note: The MetadataBatcher is already thread-safe with proper mutex locking in both Add() and flush() methods. Multiple concurrent metadata updates are correctly serialized. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Changed MetadataBatcher implementation to match correct semantics: - buf is now a slice with fixed capacity instead of a map - Add() appends to slice, drops updates when buffer is full - Flush triggers when buffer reaches capacity OR interval fires - Removed 80% threshold logic - flush at exact capacity - Multiple updates for same agent are now batched separately This simplifies the logic and makes buffer overflow behavior explicit. Updates are dropped with a warning when the buffer is full rather than replacing existing entries. Test updates: - Renamed test to reflect new behavior (no replacement) - Updated capacity flush test to fill buffer exactly - Added new test for drop behavior when buffer is full 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Implements Scenario 3 to drastically reduce pg_notify call volume. Changes: - Added WorkspaceAgentMetadataBatchPayload with array of updates - Added WatchWorkspaceAgentMetadataBatchChannel() for global channel - Batcher now publishes to both per-agent and batch channels - Listener subscribes to both channels, filters batch for its agent Performance impact: - Reduces pubsub NOTIFYs from ~300/sec to ~0.2/sec (99.9% reduction!) - NOTIFY rate now controlled by batch_flush_period config - Rate = num_coderd_instances / batch_flush_period - Independent of workspace/agent count Migration strategy: - Both channels published for backwards compatibility - Per-agent channels can be removed once all clients migrate - Listeners filter batched messages by agent ID 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
The batcher now only publishes to the global batch channel. Per-agent channels are still used for the non-batched fallback path (when batcher is nil in metadata.go). During pod rollout all connections are dropped/reconnected, so there's no need for backwards compatibility in the batcher path. This achieves the target NOTIFY reduction: - Before: ~300 NOTIFYs/sec (one per agent) - After: ~0.2 NOTIFYs/sec (one per batch flush) - 99.9% reduction in pg_notify call volume 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Changed the batch notification to only send agent IDs instead of full
metadata details (keys, timestamps, etc). Listeners now re-fetch all
metadata for updated agents from the database.
Benefits:
- Smaller NOTIFY payloads (just UUIDs instead of keys/values/timestamps)
- Listeners automatically get complete, consistent metadata state
- Matches existing pattern from per-agent channel notifications
The listener already has this pattern:
- Receive notification with changed keys (or nil for all keys)
- Fetch those keys from database
- Update local cache
Now the batch notification sends: {"agent_ids": [uuid1, uuid2, ...]}
And listener fetches all keys for that agent.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Log which subscription channel (per-agent vs batch) receives notifications to aid in debugging and monitoring the batching system. Includes agent_id, keys for per-agent updates, and batch_size for batched updates.
|
I have read the CLA Document and I hereby sign the CLA Callum Styan seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. |
Still a WIP, and need to do my own review of the batching struct/interface and auth (looks like we're using system auth atm) since I used tasks to prototype this. Some values here should also likely be configurable while ATM they're hardcoded, such as the batch size and flush interval.
However, this should work (and I have tested manually via
develop.shwith a few workspaces each having 1 metadata, and can confirm I saw batch updates only and the # of agents reported appeared correct based on the # of running workspaces) to significantly reduce the volume ofpg_notifycalls that happen as a part of workspace agent metadata updates.The changes are as follows:
BatchUpdateWorkspaceAgentMetadatato handle updating multiple agents metadata in a single query to PGworkspace_agent_metadata_batch, ATM subscribers have to loop over IDs in each published message to see if the agent ID they care about had an update