Skip to content

Conversation

@inakisoriamrf
Copy link

Introduces a new failover mode for RedisCluster that prioritizes reading from replica nodes while maintaining high availability through automatic master fallback.

Changes:

  • Add REDIS_FAILOVER_PREFER_REPLICA constant (value: 4)
  • Implement replica-first logic in cluster_sock_write() with master fallback
  • Update validation in setOption() to accept new failover mode
  • Add RedisCluster::FAILOVER_PREFER_REPLICA PHP constant
  • Update cluster.md with usage example
  • Add some test coverage in RedisClusterTest

Behavior:
When enabled, read commands are directed to replica nodes first. If no replicas are available for a given shard, the command automatically falls back to the master node, ensuring availability.

Usage:
$cluster->setOption(
RedisCluster::OPT_SLAVE_FAILOVER,
RedisCluster::FAILOVER_PREFER_REPLICA
);

Introduces a new failover mode for RedisCluster that prioritizes reading
from replica nodes while maintaining high availability through automatic
master fallback.

Changes:
- Add REDIS_FAILOVER_PREFER_REPLICA constant (value: 4)
- Implement replica-first logic in cluster_sock_write() with master fallback
- Update validation in setOption() to accept new failover mode
- Add RedisCluster::FAILOVER_PREFER_REPLICA PHP constant
- Regenerate arginfo files for both PHP 8.0+ and legacy versions
- Update cluster.md with usage example
- Add comprehensive test coverage in RedisClusterTest

Behavior:
When enabled, read commands are directed to replica nodes first. If no
replicas are available for a given shard, the command automatically falls
back to the master node, ensuring availability.

Usage:
  $cluster->setOption(
    RedisCluster::OPT_SLAVE_FAILOVER,
    RedisCluster::FAILOVER_PREFER_REPLICA
  );
@inakisoriamrf
Copy link
Author

The main motivation for this change is to improve availability under failure scenarios. In our workload, if all replicas in a shard fail, it’s preferable to continue serving reads from the master node rather than blocking all read operations entirely.

This PR introduces a new failover mode (FAILOVER_PREFER_REPLICA) that implements a “replica-first, master-fallback” strategy. It ensures that read commands are routed to replicas when available, but automatically fall back to the shard’s master if no replicas are reachable, maintaining both redundancy and read availability without manual intervention.

@michael-grunder michael-grunder self-assigned this Oct 20, 2025
@inakisoriamrf
Copy link
Author

@michael-grunder something I have to change here? Maybe it is a KeyDB configuration issue?
I see there are a lot of tests Skpped for KeyDB

@michael-grunder
Copy link
Member

Possibly. The fact that the tests are exclusively failing against KeyDB may indicate the problem is serverside.

I'd try to determine why KeyDB failes by isolatinig the root cause locally. I will also play around with it.

Ideally it's somethinig that could be tweaked in the client so KeyDB gains access to the feature too, but if it's not solveable client-side we could skip KeyDB tests and document the situation.

@michael-grunder
Copy link
Member

I replicated the KeyDB failures locally and it just had to do with replication delay.

You can use the WAIT command via rawCommand to make sure the key has had time to propagate to the replica before testing failover.

It's a bit tricky because wait takes a number of replicas and a millisecond timeout. I just went with 1 since we set up a cluster with 1 replica per node in CI. It may mean however, that tests fail locally for people who run them against a cluster with many more replicas.

A more robust solution might be to detect the number of replicas at runtime and use that number instead of a hardcoded one.

Use WAIT command after write operations to ensure keys have propagated
to replicas before testing FAILOVER_PREFER_REPLICA mode. This prevents
test failures due to replication lag, particularly with KeyDB.

The WAIT command waits for at least 1 replica (matching the CI cluster
configuration) with a 1 second timeout.
@inakisoriamrf
Copy link
Author

I've added the WAITs, hopefully this helps.

@michael-grunder
Copy link
Member

michael-grunder commented Oct 24, 2025

OK still failing but in an inexplicable place and with the failover tests.

The WAIT problem was 100% the cause of the previous failures, so I'm not sure what's going on. These tests pass every time for me locally so it must be a classic GitHub 🥔 CI issue.

I'll track it down this weekend.

Edit: On my machine keys is crashing or causing KeyDB to exit 😄

@michael-grunder
Copy link
Member

@inakisoriamrf For now we should just skip your failover test against KeyDB. Once they patch it we can re-enable.

It's just a matter of putting this at the start of your test method:

        if ( ! $this->is_keydb)
            $this->markTestSkipped();

@inakisoriamrf
Copy link
Author

@michael-grunder I forgot to mention, it’s already being skipped following your suggestion.

@michael-grunder
Copy link
Member

Yep, I saw that thanks. We'll merge this after we release 6.3.0 GA and get it iinto 6.3.1.

It should have some time to simmer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants