-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Improve json.loads performance
#6704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughJSON parsing was refactored to track both Python character indices and underlying byte indices, moving string/object/array parsing into Rust, adding byte-oriented utilities, richer error reporting, and public scanstring and DecodeError APIs. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant JsonScanner
participant Parser
participant ScanOnce
participant ScanString
participant ObjParser
participant ArrParser
participant Finalizer
Caller->>JsonScanner: call(pystr, char_idx)
JsonScanner->>Parser: parse(pystr, char_idx, byte_idx, scan_once)
Parser->>ScanOnce: call_scan_once(byte_idx, char_idx)
alt string
ScanOnce->>ScanString: scanstring(s, char_offset, strict)
ScanString-->>ScanOnce: (decoded, chars, bytes)
else object
ScanOnce->>ObjParser: parse_object(s, byte_idx)
ObjParser->>ScanString: scanstring(key, char_offset, strict)
ScanString-->>ObjParser: (key, chars, bytes)
ObjParser->>ScanOnce: parse value (recursive)
ObjParser->>Finalizer: finalize_object(pairs, hooks)
Finalizer-->>ObjParser: Python object
else array
ScanOnce->>ArrParser: parse_array(s, byte_idx)
ArrParser->>ScanOnce: parse element (recursive)
else literal/number
ScanOnce-->>Parser: parsed literal/number
end
ScanOnce-->>Parser: (value, end_positions)
Parser-->>JsonScanner: PyIterReturn
JsonScanner-->>Caller: result
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧹 Recent nitpick comments
📜 Recent review detailsConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Pro ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (2)
🧰 Additional context used📓 Path-based instructions (1)**/*.rs📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
🧬 Code graph analysis (2)crates/stdlib/src/json/machinery.rs (1)
crates/stdlib/src/json.rs (1)
🔇 Additional comments (12)
✏️ Tip: You can disable this entire section by setting Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Code has been automatically formatted The code in this PR has been formatted using:
git pull origin improve-json-perf |
e7bfda5 to
4772f32
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@crates/stdlib/src/json/machinery.rs`:
- Around line 212-227: The error position passed to decode_unicode for the
second surrogate is wrong; adjust the computed position so it points at the
start of the second "\uXXXX" escape (not inside the first). In the
surrogate-pair branch around uni.to_lead_surrogate(), change the position
argument passed to decode_unicode from char_offset + next_char_i + 1 to a value
that accounts for the first "\uXXXX" (e.g. char_offset + next_char_i + 7) or,
better, use the character index/byte_pos extracted from the tuple destructuring
(the byte_pos2/tuple element) to compute the exact offset and pass that into
decode_unicode so error locations refer to the second escape. Ensure chunk_start
and chars updates remain consistent with the new position calculation.
🧹 Nitpick comments (1)
crates/stdlib/src/json.rs (1)
587-617: Consider validating non-negative return index in fallback path.The fallback to Python
scan_oncecorrectly handles the interface, but ifscan_oncereturns a negative index (line 600), theas usizecast at line 605 would produce a very large value, causing.nth()to returnNoneand defaulting tos.len().While negative indices from
scan_oncewould indicate a bug in the Python code, an explicit check could provide a clearer error:Optional defensive check
let end_char_idx: isize = tuple.as_slice()[1].try_to_value(vm)?; + if end_char_idx < 0 { + return Err(vm.new_value_error("scan_once returned negative index")); + } // For fallback, we need to calculate byte_idx from char_idx
📜 Review details
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Lib/test/test_json/test_decode.pyis excluded by!Lib/**
📒 Files selected for processing (2)
crates/stdlib/src/json.rscrates/stdlib/src/json/machinery.rs
🧰 Additional context used
📓 Path-based instructions (1)
**/*.rs
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.rs: Follow the default rustfmt code style usingcargo fmtto format Rust code
Always run clippy to lint code (cargo clippy) before completing tasks and fix any warnings or lints introduced by changes
Follow Rust best practices for error handling and memory management
Use the macro system (pyclass,pymodule,pyfunction, etc.) when implementing Python functionality in Rust
Files:
crates/stdlib/src/json/machinery.rscrates/stdlib/src/json.rs
🧬 Code graph analysis (2)
crates/stdlib/src/json/machinery.rs (1)
crates/stdlib/src/json.rs (1)
scanstring(702-732)
crates/stdlib/src/json.rs (1)
crates/stdlib/src/json/machinery.rs (3)
scanstring(140-252)new(112-115)len(123-128)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: Run snippets and cpython tests (macos-latest)
- GitHub Check: Run rust tests (windows-2025)
- GitHub Check: Run snippets and cpython tests on wasm-wasi
- GitHub Check: Run snippets and cpython tests (ubuntu-latest)
- GitHub Check: Run snippets and cpython tests (windows-2025)
- GitHub Check: Check Rust code with clippy
- GitHub Check: Run rust tests (macos-latest)
🔇 Additional comments (13)
crates/stdlib/src/json/machinery.rs (5)
33-33: LGTM on imports and public visibility change.The
memchr2import enables efficient byte-scanning for the fast path, and makingDecodeError::newpublic is necessary for error construction in the calling code.Also applies to: 112-115
130-144: LGTM on function signature and documentation.The documentation clearly explains the contract - input slice starts after the opening quote, and
char_offsetenables accurate error position reporting. The triple return value(result, chars_consumed, bytes_consumed)elegantly supports the dual-index tracking pattern used throughout the PR.
145-168: LGTM on fast path implementation.The optimization strategy is sound: use
memchr2to quickly locate the first"or\, and if the first match is the closing quote, the entire string content can be processed without escape handling. The control character check correctly scans for bytes< 0x20, and the character count properly usescode_points().count()to handle multi-byte sequences.
197-241: LGTM on escape sequence handling.All standard JSON escape sequences are correctly handled, including the optional
/escape. The error reporting correctly uses the character-based position (char_offset + char_i) for accurate Python-level error messages.
254-267: LGTM on decode_unicode implementation.The function correctly parses 4 hex digits using bit shifting and properly handles errors when characters are missing or non-hexadecimal. The conversion from
u16toCodePointcorrectly handles both BMP characters and surrogates.crates/stdlib/src/json.rs (8)
20-40: LGTM on helper functions.The
skip_whitespacefunction correctly handles all JSON whitespace characters, and the comment appropriately notes that ASCII whitespace characters are single-byte, making the byte count equal to the character count. Thestarts_with_bytesfunction is a clean utility for pattern matching.
92-188: LGTM on updated parse method.The method correctly maintains dual byte/char indices throughout. String parsing delegates to
machinery::scanstringwith proper slice offset, and the nested object/array parsing is cleanly delegated to dedicated methods. The constant handling correctly relies on ASCII lengths equaling character counts.
226-386: LGTM on parse_object implementation.The method correctly handles all object parsing edge cases including empty objects, missing delimiters, and trailing commas. The key memoization via
HashMap<String, PyStrRef>is an effective optimization for JSON with repeated keys. Error messages are clear and position-accurate.
388-469: LGTM on parse_array implementation.The array parsing correctly mirrors the object parsing structure with proper handling for empty arrays, element delimiters, and trailing comma detection. The recursive value parsing via
call_scan_oncemaintains consistent index tracking.
471-503: LGTM on finalize_object implementation.The method correctly implements the CPython JSON module's hook priority:
object_pairs_hooktakes precedence when present, otherwise a dict is built and optionally passed toobject_hook. The behavior for duplicate keys (last value wins) matches JSON specification.
619-629: LGTM on make_decode_error helper.This provides a clean abstraction for creating decode errors within the
JsonScannermethods, maintaining consistency with the existingpy_decode_errorfunction.
632-664: LGTM on Callable implementation.The entry point correctly validates the index, computes the byte index from the character index (accepting the O(n) cost at the entry point), and properly handles the beyond-string-length case by returning
StopIteration. This establishes the dual-index tracking that propagates through all nested parsing.
701-732: LGTM on scanstring function.The function correctly bridges the Python API (character indices) with the Rust implementation (byte indices). The index conversion using
code_point_indices().nth()handles multi-byte characters correctly, and the error handling for out-of-bounds indices is appropriate.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| if let Some(lead) = uni.to_lead_surrogate() { | ||
| // uni is a surrogate -- try to find its pair | ||
| let mut chars2 = chars.clone(); | ||
| if let Some(((pos2, _), (_, _))) = chars2 | ||
| if let Some(((_, (byte_pos2, _)), (_, _))) = chars2 | ||
| .next_tuple() | ||
| .filter(|((_, (_, c1)), (_, (_, c2)))| *c1 == '\\' && *c2 == 'u') | ||
| { | ||
| let uni2 = decode_unicode(&mut chars2, pos2)?; | ||
| let uni2 = | ||
| decode_unicode(&mut chars2, char_offset + next_char_i + 1)?; | ||
| if let Some(trail) = uni2.to_trail_surrogate() { | ||
| // ok, we found what we were looking for -- \uXXXX\uXXXX, both surrogates | ||
| uni = lead.merge(trail).into(); | ||
| chunk_start = pos2 + 6; | ||
| chunk_start = byte_pos2 + 6; | ||
| chars = chars2; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error position for second surrogate escape may be misleading.
When parsing a surrogate pair like \uD800\uDC00, if the second escape is malformed, the error position passed to decode_unicode appears incorrect.
At line 219-220, the position char_offset + next_char_i + 1 evaluates to approximately char_offset + char_i + 2 (pointing into the first escape), whereas the actual error would be in the second escape starting around char_offset + char_i + 7.
This only affects error message positions, not parsing correctness, so it's a minor issue.
Suggested position fix
The position should account for the 6 characters consumed by the first \uXXXX plus the \ already consumed by next_tuple():
- let uni2 =
- decode_unicode(&mut chars2, char_offset + next_char_i + 1)?;
+ let uni2 =
+ decode_unicode(&mut chars2, char_offset + char_i + 7)?;Alternatively, extract the character index from the tuple destructuring instead of computing it.
🤖 Prompt for AI Agents
In `@crates/stdlib/src/json/machinery.rs` around lines 212 - 227, The error
position passed to decode_unicode for the second surrogate is wrong; adjust the
computed position so it points at the start of the second "\uXXXX" escape (not
inside the first). In the surrogate-pair branch around uni.to_lead_surrogate(),
change the position argument passed to decode_unicode from char_offset +
next_char_i + 1 to a value that accounts for the first "\uXXXX" (e.g.
char_offset + next_char_i + 7) or, better, use the character index/byte_pos
extracted from the tuple destructuring (the byte_pos2/tuple element) to compute
the exact offset and pass that into decode_unicode so error locations refer to
the second escape. Ensure chunk_start and chars updates remain consistent with
the new position calculation.
Parse JSON constants (null, true, false) directly in Rust within call_scan_once() instead of falling back to Python scan_once. This reduces Python-Rust boundary crossings for array/object values. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Parse JSON numbers starting with digits (0-9) directly in Rust within call_scan_once() by reusing the existing parse_number() method. This reduces Python-Rust boundary crossings for array/object values. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Parse special JSON constants (NaN, Infinity, -Infinity) and negative numbers directly in Rust within call_scan_once(). This handles: - 'N' -> NaN via parse_constant callback - 'I' -> Infinity via parse_constant callback - '-' -> -Infinity or negative numbers via parse_constant/parse_number This reduces Python-Rust boundary crossings for array/object values. Co-Authored-By: Claude Opus 4.5 <[email protected]>
7e51d21 to
073a24a
Compare
First, to minimize movement between Python and Rust, I made it parse with Rust whenever possible. This improved performance from 8.77s on the main branch (by my local benchmark) to 7.37s. Although still slower compared to CPython's 0.2s, it's about a 18% improvement. The execution method was as follows, and I ran it about 3 times, calculating based on the highest value.
Additionally, since JSON often has duplicate keys, I also implemented CPython's method of reusing them. However, when measured separately, this didn't show any execution time benefits—it's only expected to help with memory efficiency.
While profiling by taking multiple flamegraph spans, I discovered that scanstring was taking the longest time. The reason was that operations like chars().nth() are O(N) and severely slow. I avoided this performance bottleneck by also operating with byte indices alongside character indices.
Compared to CPython taking 0.2 seconds, it now takes around 0.4 to 0.5 seconds—a huge improvement! This is over 16 times faster!
For the original motivation,
pip install, what used to take 5 seconds for theblackpackage now takes 1.61 seconds. Even when built for debug targets, what used to take 171 seconds now finishes in under 20 seconds.Summary by CodeRabbit
Bug Fixes
Improvements
New Features
✏️ Tip: You can customize this high-level summary in your review settings.