-
-
Notifications
You must be signed in to change notification settings - Fork 33.9k
Closed
Labels
interpreter-core(Objects, Python, Grammar, and Parser dirs)(Objects, Python, Grammar, and Parser dirs)performancePerformance or resource usagePerformance or resource usage
Description
Feature or enhancement
Proposal:
First, I am not sure how important this is because even though _make_key could be optimized, AFAIK only _lru_cache_wrapper uses it, and that may be overwritten by this import:
try:
from _functools import _lru_cache_wrapper
except ImportError:
passIf we do think it is worth optimizing, then this change (I can make a PR if needed):
@@ -537,18 +537,17 @@ def _make_key(args, kwds, typed,
# Formerly, we sorted() the kwds before looping. The new way is *much*
# faster; however, it means that f(x=1, y=2) will now be treated as a
# distinct call from f(y=2, x=1) which will be cached separately.
- key = args
+ key = list(args) if kwds or typed else args
if kwds:
- key += kwd_mark
- for item in kwds.items():
- key += item
+ key.append(kwd_mark)
+ key.extend(kwds.items())
if typed:
- key += tuple(type(v) for v in args)
+ key.extend([type(v) for v in args])
if kwds:
- key += tuple(type(v) for v in kwds.values())
+ key.extend([type(v) for v in kwds.values()])
elif len(key) == 1 and type(key[0]) in fasttypes:
return key[0]
- return key
+ return tuple(key)
def lru_cache(maxsize=128, typed=False):
"""Least-recently-used cache decorator.makes _make_key significantly faster:
from functools import _make_key as mk
timeit mk((), {}, False)
257 ns ± 0.645 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
timeit mk((1, 2, 3), {1:2, 2:3, 3:4}, False)
861 ns ± 0.597 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
timeit mk((1, 2, 3), {1:2, 2:3, 3:4}, True)
3.03 μs ± 6.85 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
to
timeit mk((), {}, False)
239 ns ± 0.429 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
1.08x faster
timeit mk((1, 2, 3), {1:2, 2:3, 3:4}, False)
700 ns ± 2.74 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
1.23x faster
timeit mk((1, 2, 3), {1:2, 2:3, 3:4}, True)
1.25 μs ± 2.24 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
2.42x fasterIf we remove the import, then a benchmark for lru_cache, which is the public API of this code, shows 1.01x to 1.68x improvement depending on use case:
lru_cache_baseline.json
=======================
Python version: 3.15.0a3+ (64-bit) revision e0fb2780649
Report on macOS-14.6.1-arm64-arm-64bit-Mach-O
Number of logical CPUs: 8
Start date: 2026-01-13 20:58:56.174556
End date: 2026-01-13 21:02:28.432336
lru_cache_optimized.json
========================
Python version: 3.15.0a3+ (64-bit) revision e0fb2780649
Report on macOS-14.6.1-arm64-arm-64bit-Mach-O
Number of logical CPUs: 8
Start date: 2026-01-13 21:04:47.595674
End date: 2026-01-13 21:08:21.508782
### lru_cache_hit_args_and_kwargs ###
Mean +- std dev: 104 us +- 1 us -> 103 us +- 1 us: 1.01x faster
Not significant
### lru_cache_hit_args_and_multiple_kwargs ###
Mean +- std dev: 126 us +- 1 us -> 113 us +- 1 us: 1.12x faster
Significant (t=65.57)
### lru_cache_hit_multiple_args ###
Mean +- std dev: 68.5 us +- 0.9 us -> 66.1 us +- 0.6 us: 1.04x faster
Significant (t=17.21)
### lru_cache_hit_multiple_kwargs ###
Mean +- std dev: 121 us +- 1 us -> 106 us +- 1 us: 1.14x faster
Significant (t=91.32)
### lru_cache_hit_single_float ###
Mean +- std dev: 71.7 us +- 0.6 us -> 69.5 us +- 1.1 us: 1.03x faster
Significant (t=13.89)
### lru_cache_hit_single_int ###
Mean +- std dev: 69.0 us +- 0.7 us -> 65.4 us +- 0.7 us: 1.05x faster
Significant (t=27.70)
### lru_cache_hit_single_kwarg ###
Mean +- std dev: 98.3 us +- 2.7 us -> 95.1 us +- 0.9 us: 1.03x faster
Significant (t=8.68)
### lru_cache_hit_single_str ###
Mean +- std dev: 69.8 us +- 0.7 us -> 66.7 us +- 0.5 us: 1.05x faster
Significant (t=27.05)
### lru_cache_hit_typed_args_and_kwargs ###
Mean +- std dev: 247 us +- 2 us -> 147 us +- 1 us: 1.68x faster
Significant (t=367.54)
### lru_cache_hit_typed_multiple_args ###
Mean +- std dev: 143 us +- 2 us -> 103 us +- 1 us: 1.38x faster
Significant (t=158.01)
### lru_cache_hit_typed_single_arg ###
Mean +- std dev: 123 us +- 1 us -> 92 us +- 0 us: 1.34x faster
Significant (t=212.03)
### lru_cache_hit_typed_single_kwarg ###
Mean +- std dev: 213 us +- 2 us -> 128 us +- 2 us: 1.66x faster
Significant (t=233.02)
### lru_cache_miss_single_int ###
Mean +- std dev: 68.5 us +- 0.8 us -> 65.0 us +- 0.5 us: 1.05x faster
Significant (t=29.21)
### lru_cache_miss_single_kwarg ###
Mean +- std dev: 96.6 us +- 0.7 us -> 94.4 us +- 0.5 us: 1.02x faster
Significant (t=19.86)
### lru_cache_miss_typed_args_kwargs ###
Mean +- std dev: 237 us +- 3 us -> 143 us +- 1 us: 1.65x faster
Significant (t=260.03)
The benchmark was written by Claude Code, and since it is 233 lines long I am attaching it rather than inlining.
Has this already been discussed elsewhere?
This is a minor feature, which does not need previous discussion elsewhere
Links to previous discussion of this feature:
No response
Linked PRs
Metadata
Metadata
Assignees
Labels
interpreter-core(Objects, Python, Grammar, and Parser dirs)(Objects, Python, Grammar, and Parser dirs)performancePerformance or resource usagePerformance or resource usage