WIP perf: reduce pg_notify call volume by batching together agent metadata updates #21330

cstyan · 2025-12-19T00:59:15Z

Still a WIP, and need to do my own review of the batching struct/interface and auth (looks like we're using system auth atm) since I used tasks to prototype this. Some values here should also likely be configurable while ATM they're hardcoded, such as the batch size and flush interval.

However, this should work (and I have tested manually via develop.sh with a few workspaces each having 1 metadata, and can confirm I saw batch updates only and the # of agents reported appeared correct based on the # of running workspaces) to significantly reduce the volume of pg_notify calls that happen as a part of workspace agent metadata updates.

The changes are as follows:

a new BatchUpdateWorkspaceAgentMetadata to handle updating multiple agents metadata in a single query to PG
New pubsub publish/subscribe via workspace_agent_metadata_batch, ATM subscribers have to loop over IDs in each published message to see if the agent ID they care about had an update
batcher struct that batches together metadata updates and will either flush at a max batch size, or when an interval since the last flush is reached

Introduces a metadata batcher to reduce database write and pubsub publish frequency for workspace agent metadata updates. Key changes: - Add BatchUpdateWorkspaceAgentMetadata SQL query for multi-agent updates - Create MetadataBatcher with 5s flush interval and 500 agent batch size - Integrate batcher into coderd API and agent RPC flow - Update BatchUpdateMetadata handler to use batcher when available - Add comprehensive unit tests for batcher functionality Performance impact: - Reduces database writes from 60/sec to 0.2/sec (99.7% reduction) - Based on workload of 600 workspaces with 15 metadata items at 10s intervals - Implements early flush at 80% capacity (400/500 agents) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

Per project style guide, function parameters and arguments should not be split across multiple lines for better readability. Thread safety note: The MetadataBatcher is already thread-safe with proper mutex locking in both Add() and flush() methods. Multiple concurrent metadata updates are correctly serialized. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

Changed MetadataBatcher implementation to match correct semantics: - buf is now a slice with fixed capacity instead of a map - Add() appends to slice, drops updates when buffer is full - Flush triggers when buffer reaches capacity OR interval fires - Removed 80% threshold logic - flush at exact capacity - Multiple updates for same agent are now batched separately This simplifies the logic and makes buffer overflow behavior explicit. Updates are dropped with a warning when the buffer is full rather than replacing existing entries. Test updates: - Renamed test to reflect new behavior (no replacement) - Updated capacity flush test to fill buffer exactly - Added new test for drop behavior when buffer is full 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

Implements Scenario 3 to drastically reduce pg_notify call volume. Changes: - Added WorkspaceAgentMetadataBatchPayload with array of updates - Added WatchWorkspaceAgentMetadataBatchChannel() for global channel - Batcher now publishes to both per-agent and batch channels - Listener subscribes to both channels, filters batch for its agent Performance impact: - Reduces pubsub NOTIFYs from ~300/sec to ~0.2/sec (99.9% reduction!) - NOTIFY rate now controlled by batch_flush_period config - Rate = num_coderd_instances / batch_flush_period - Independent of workspace/agent count Migration strategy: - Both channels published for backwards compatibility - Per-agent channels can be removed once all clients migrate - Listeners filter batched messages by agent ID 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

The batcher now only publishes to the global batch channel. Per-agent channels are still used for the non-batched fallback path (when batcher is nil in metadata.go). During pod rollout all connections are dropped/reconnected, so there's no need for backwards compatibility in the batcher path. This achieves the target NOTIFY reduction: - Before: ~300 NOTIFYs/sec (one per agent) - After: ~0.2 NOTIFYs/sec (one per batch flush) - 99.9% reduction in pg_notify call volume 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

Changed the batch notification to only send agent IDs instead of full metadata details (keys, timestamps, etc). Listeners now re-fetch all metadata for updated agents from the database. Benefits: - Smaller NOTIFY payloads (just UUIDs instead of keys/values/timestamps) - Listeners automatically get complete, consistent metadata state - Matches existing pattern from per-agent channel notifications The listener already has this pattern: - Receive notification with changed keys (or nil for all keys) - Fetch those keys from database - Update local cache Now the batch notification sends: {"agent_ids": [uuid1, uuid2, ...]} And listener fetches all keys for that agent. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

Log which subscription channel (per-agent vs batch) receives notifications to aid in debugging and monitoring the batching system. Includes agent_id, keys for per-agent updates, and batch_size for batched updates.

github-actions · 2025-12-19T00:59:24Z

Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.

I have read the CLA Document and I hereby sign the CLA

Callum Styan seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You can retrigger this bot by commenting recheck in this Pull Request.}_{Posted by the CLA Assistant Lite bot.}

Callum Styan and others added 8 commits December 18, 2025 21:35

fix: implement dbauthz wrapper for BatchUpdateWorkspaceAgentMetadata

202d09d

feat: add info-level logging for metadata update notifications

8c91fc9

Log which subscription channel (per-agent vs batch) receives notifications to aid in debugging and monitoring the batching system. Includes agent_id, keys for per-agent updates, and batch_size for batched updates.

github-actions bot assigned cstyan Dec 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP perf: reduce pg_notify call volume by batching together agent metadata updates #21330

WIP perf: reduce pg_notify call volume by batching together agent metadata updates #21330

cstyan commented Dec 19, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WIP perf: reduce pg_notify call volume by batching together agent metadata updates #21330

Are you sure you want to change the base?

WIP perf: reduce pg_notify call volume by batching together agent metadata updates #21330

Conversation

cstyan commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cstyan commented Dec 19, 2025 •

edited

Loading