-
Notifications
You must be signed in to change notification settings - Fork 1.4k
gc module internal structure and API #6910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughImplements a generational GC subsystem and exposes a Python-compatible gc module API, adds VM context fields for gc state, extends PyObject inspection for GC, and updates Cargo features and spellcheck tokens. Changes
Sequence DiagramsequenceDiagram
actor Python
participant GC as "gc module"
participant GCState as "GcState (global)"
participant Callbacks as "callbacks list"
participant VM as "VM Context"
Python->>GC: collect(generation=None)
GC->>GCState: collect(generation)
activate GCState
GCState->>Callbacks: invoke_callbacks("start", generation, 0, 0)
Callbacks->>VM: execute registered callbacks
VM-->>Callbacks: callback results
GCState->>GCState: perform collection (counts)
GCState->>Callbacks: invoke_callbacks("stop", generation, collected, uncollectable)
Callbacks->>VM: execute registered callbacks
VM-->>Callbacks: callback results
deactivate GCState
GCState-->>GC: return (collected, uncollectable)
GC-->>Python: return counts
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Code has been automatically formatted The code in this PR has been formatted using:
git pull origin gc-internal |
|
cc @discord9 |
8052f5f to
5a8b225
Compare
Add gc_state module with GcState, GcGeneration, GcDebugFlags, GcStats. Replace gc module stubs with working API backed by gc_state. Add gc_callbacks and gc_garbage to Context. Add is_gc_tracked, gc_finalized, gc_get_referents to PyObject. Collection is stubbed (returns 0) — actual algorithm to follow.
d63fba5 to
4ff9fb8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@crates/stdlib/src/gc.rs`:
- Around line 189-196: get_referrers currently unconditionally returns an empty
list but the review warns against raising NotImplementedError; to match CPython
behavior, either implement basic GC-tracking or explicitly preserve the
empty-list semantics and document the limitation. Update the
get_referrers(FuncArgs, VirtualMachine) implementation to return
vm.ctx.new_list(vec![]) for non-discoverable referrers, add/adjust the doc
comment inside the function to state that only GC-tracked references are
discoverable and stack/C-extension references are not, and do not raise
NotImplementedError; alternatively, if you choose to implement minimal tracking,
modify the scanning logic in get_referrers to consult your GC-tracking
structures and collect referrers before falling back to vm.ctx.new_list(vec![]).
In `@crates/vm/src/gc_state.rs`:
- Around line 231-287: untrack_object currently never removes a deallocated
frozen object from permanent_objects, leaving stale pointers that break unfreeze
and get_freeze_count; update unsafe fn untrack_object to also acquire a write
lock on self.permanent_objects and remove the GcObjectPtr (gc_ptr) just like
tracked_objects and finalized_objects are handled, ensuring the permanent set
does not retain freed pointers (refer to GcObjectPtr, untrack_object,
permanent_objects, unfreeze, and get_freeze_count).
In `@crates/vm/src/object/core.rs`:
- Around line 902-919: is_gc_tracked currently returns capability
(vtable.trace.is_some() || dict.is_some()) instead of actual membership; add a
public query on GcState (e.g., GcState::is_tracked(&self, obj_id: ObjectId) or
similar) that checks the tracked_objects set used by
track_object/untrack_object, then change PyObject::is_gc_tracked to call that
GcState query rather than inspecting vtable.trace or dict; ensure you use the
object's unique identifier or reference the same key used in tracked_objects so
is_gc_tracked reflects real GC membership while gc_get_referents can remain
unchanged.
🧹 Nitpick comments (1)
crates/vm/src/gc_state.rs (1)
289-305: Finalize tracking has two sources of truth.
GcState::{is_finalized, mark_finalized}uses a HashSet, butPyObject::gc_finalizedrelies on per-object bits and stdlibgc.is_finalizeduses that path. Consider consolidating on one mechanism to avoid divergence (e.g., exposeset_gc_finalizedaspub(crate)and set bits here, or drop the set).
| /// Return the list of objects that directly refer to any of the arguments. | ||
| #[pyfunction] | ||
| fn get_referrers(_args: FuncArgs, vm: &VirtualMachine) -> PyResult { | ||
| Err(vm.new_not_implemented_error("")) | ||
| fn get_referrers(args: FuncArgs, vm: &VirtualMachine) -> PyListRef { | ||
| // This is expensive: we need to scan all tracked objects | ||
| // For now, return an empty list (would need full object tracking to implement) | ||
| let _ = args; | ||
| vm.ctx.new_list(vec![]) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🌐 Web query:
CPython gc.get_referrers documentation behavior error handling
💡 Result:
gc.get_referrers(*objs) (CPython) is explicitly a debugging API with several important behavioral limitations:
- What it returns: a list of objects that directly refer to any of the given
objs, but it only finds referrers among GC-tracked containers (objects participating in cyclic-GC viatp_traverse). If an extension type holds references but does not support GC, it will not be reported. [1] - “Stale” referrers can appear: objects that have been dereferenced but are still part of uncollected cycles may show up; if you want “currently live” results, run
gc.collect()first. [1] - Returned objects may be temporarily invalid: the docs warn that some returned referrers may still be under construction and in a temporarily invalid state—avoid doing anything other than inspection/debugging. [1]
- It can miss real referrers: not all references in the runtime are represented as GC-tracked container links. For example, references held only on the VM evaluation stack (and some other internal/C-held references) may not be reported, so
gc.get_referrers(x)can be empty even whenxis obviously still reachable. [2]
Error handling / exceptions (what the docs imply)
- The function fires an auditing event
gc.get_referrerswith argumentobjs; if you have auditing hooks / security policy enabled, that hook may raise and thus makeget_referrers()fail. [1] - Otherwise, the standard docs do not specify special “semantic” errors (it’s a normal builtin; you can still get ordinary failures like
MemoryErrorif allocation of the result list fails, orTypeErrorfor incorrect calling convention). The main “gotcha” called out is invalid/unsafe returned objects, not a recoverable exception path. [1]
Sources: Python gc module docs (get_referrers) [1], explanation of missed referrers (VM stack / C-held refs) [2].
[1] (docs.python.org)
[2] (stackoverflow.com)
Citations:
- 1: https://docs.python.org/3.12/library/gc.html?utm_source=openai
- 2: https://stackoverflow.com/questions/55215649/does-any-object-tracked-in-python-gc-have-no-referrer?utm_source=openai
🏁 Script executed:
# First, let's check the actual file to confirm the current state
cat -n crates/stdlib/src/gc.rs | head -220 | tail -40Repository: RustPython/RustPython
Length of output: 1633
get_referrers returns an empty list, but raising NotImplementedError contradicts CPython behavior.
CPython's gc.get_referrers() does not raise NotImplementedError—it legitimately returns an empty list when no referrers are found (e.g., references on the VM stack or in non-GC-tracked C objects are not discoverable). The function is documented as a debugging API with inherent limitations, not as a "not implemented" operation.
The underlying concern is valid: the current implementation makes no attempt to find referrers and unconditionally returns an empty list. However, the suggested fix conflicts with CPython compatibility. Consider either:
- Implementing at least basic tracking of GC-managed object references, or
- Accepting the empty list as a known limitation while improving documentation
🤖 Prompt for AI Agents
In `@crates/stdlib/src/gc.rs` around lines 189 - 196, get_referrers currently
unconditionally returns an empty list but the review warns against raising
NotImplementedError; to match CPython behavior, either implement basic
GC-tracking or explicitly preserve the empty-list semantics and document the
limitation. Update the get_referrers(FuncArgs, VirtualMachine) implementation to
return vm.ctx.new_list(vec![]) for non-discoverable referrers, add/adjust the
doc comment inside the function to state that only GC-tracked references are
discoverable and stack/C-extension references are not, and do not raise
NotImplementedError; alternatively, if you choose to implement minimal tracking,
modify the scanning logic in get_referrers to consult your GC-tracking
structures and collect referrers before falling back to vm.ctx.new_list(vec![]).
| /// Check if this object is tracked by the garbage collector. | ||
| /// Returns true if the object has a trace function or has an instance dict. | ||
| pub fn is_gc_tracked(&self) -> bool { | ||
| if self.0.vtable.trace.is_some() { | ||
| return true; | ||
| } | ||
| self.0.dict.is_some() | ||
| } | ||
|
|
||
| /// Get the referents (objects directly referenced) of this object. | ||
| /// Uses the full traverse including dict and slots. | ||
| pub fn gc_get_referents(&self) -> Vec<PyObjectRef> { | ||
| let mut result = Vec::new(); | ||
| self.0.traverse(&mut |child: &PyObject| { | ||
| result.push(child.to_owned()); | ||
| }); | ||
| result | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for GC tracking-related code
echo "=== Searching for TRACKED bit and tracking functions ==="
rg -n "GcBits::TRACKED|track_object|untrack_object|GcState" -g '*.rs' --max-count 30Repository: RustPython/RustPython
Length of output: 1074
🏁 Script executed:
#!/bin/bash
# Examine the is_gc_tracked implementation and surrounding context
echo "=== Looking at core.rs to understand current implementation ==="
rg -B 5 -A 5 "fn is_gc_tracked" crates/vm/src/object/core.rsRepository: RustPython/RustPython
Length of output: 485
🏁 Script executed:
#!/bin/bash
# Search for how objects are added/removed from GC tracking
echo "=== Looking for gc collection and tracking mechanisms ==="
fd -e rs | xargs rg -l "gc_collect|gc_track|GcBits" | head -20Repository: RustPython/RustPython
Length of output: 175
🌐 Web query:
CPython gc.is_tracked implementation and behavior
💡 Result:
What gc.is_tracked(obj) reports (behavior)
gc.is_tracked(obj)returns whether this object instance is currently in CPython’s cyclic-GC tracking lists (not whether it is “GC-capable” in principle). Its result can change after later garbage collections due to container “untracking” optimizations. [1][2]- Objects that cannot participate in reference cycles (most “atomic” objects like
int,str, etc.) are not tracked. Containers and user-defined objects typically are tracked, but some built-in containers may be untracked when they are proven to contain only non-tracked contents. [1][2] - This behavior is version- and type-optimization-dependent; e.g., dictionaries/tuples may toggle between tracked and untracked depending on their contents and what GC cycle has run. [1][2]
How CPython implements it (core idea)
At the C level, Python exposes PyObject_GC_IsTracked(op) (Stable ABI since 3.9), documented as “analogous to gc.is_tracked()”: it returns 1 only if the object both supports the GC protocol and is currently tracked. [3]
Internally, CPython’s fast check for “tracked” is essentially:
- tracked ⇔ the object’s GC header indicates it’s linked into a GC list
- concretely:
_PyObject_GC_IS_TRACKED(op)returns true when the per-object GC header field_gc_nextis non-zero. [4] - and when an object is untracked, CPython unlinks it from the GC list and sets
gc->_gc_next = 0. [5]
Practical implications / “gotchas”
gc.is_tracked(x) == Falsedoes not necessarily mean “cannot ever be tracked”: many container objects can be tracked/untracked over time depending on contents and GC passes. [2][4]- If you’re writing/inspecting C extensions: “GC-capable” is
PyObject_IS_GC(obj), while “currently tracked” isPyObject_GC_IsTracked(obj)/_PyObject_GC_IS_TRACKED(obj). [3][4]
Sources: Python gc docs and GC design notes [1][2], CPython C-API GC support docs [3], internal header/coverage view showing _PyObject_GC_IS_TRACKED [4], and internal untrack implementation showing _gc_next = 0 [5].
[1] https://docs.python.org/3.12/library/gc.html
[2] https://daobook.github.io/devguide/garbage_collector.html
[3] https://docs.python.org/3/c-api/gcsupport.html
[4] https://cheimes.fedorapeople.org/cpython/lcov-report/Include/internal/pycore_gc.h.gcov.html
[5] https://android.googlesource.com/platform/prebuilts/python/linux-x86/+/5e62e45e850184b4a88ada6467f73f39f8b44609/include/python3.9/internal/pycore_object.h
🏁 Script executed:
#!/bin/bash
# Look at track_object and untrack_object implementations
sed -n '236,280p' crates/vm/src/gc_state.rsRepository: RustPython/RustPython
Length of output: 1872
🏁 Script executed:
#!/bin/bash
# Look for what data structures hold tracked objects
sed -n '99,160p' crates/vm/src/gc_state.rsRepository: RustPython/RustPython
Length of output: 2550
🏁 Script executed:
#!/bin/bash
# Search for how is_gc_tracked is currently used in the codebase
rg "is_gc_tracked" -g '*.rs' -B 2 -A 2Repository: RustPython/RustPython
Length of output: 751
🏁 Script executed:
#!/bin/bash
# Check the gc.rs stdlib module to see if it implements gc.is_tracked()
head -100 crates/stdlib/src/gc.rsRepository: RustPython/RustPython
Length of output: 3103
🏁 Script executed:
#!/bin/bash
# Look for how GcState is accessed from PyObject methods
rg "gc_state()" crates/vm/src/object/core.rs -B 3 -A 3Repository: RustPython/RustPython
Length of output: 47
🏁 Script executed:
#!/bin/bash
# Check if there are any methods to query if an object is in tracked_objects
rg "tracked_objects" crates/vm/src/gc_state.rs -B 2 -A 2Repository: RustPython/RustPython
Length of output: 1090
🏁 Script executed:
#!/bin/bash
# Look at the full GcState struct definition and public methods
sed -n '99,250p' crates/vm/src/gc_state.rsRepository: RustPython/RustPython
Length of output: 5303
🏁 Script executed:
#!/bin/bash
# Check if there's a way to query tracking state
rg "pub.*fn.*is_tracked|is_in_gc" crates/vm/src/gc_state.rsRepository: RustPython/RustPython
Length of output: 47
is_gc_tracked() incorrectly reports GC-capability rather than actual tracking state.
The method returns true based on trace.is_some() or dict.is_some(), which indicates the object can be tracked, not whether it currently is tracked. GcState maintains actual tracking membership in tracked_objects (populated/cleared by track_object/untrack_object), but this state is private and inaccessible from is_gc_tracked(). This diverges from CPython's gc.is_tracked(), which reports actual GC list membership.
To fix: expose a public query method on GcState to check tracking membership, and have is_gc_tracked() consult it instead of relying on static vtable/dict presence.
🤖 Prompt for AI Agents
In `@crates/vm/src/object/core.rs` around lines 902 - 919, is_gc_tracked currently
returns capability (vtable.trace.is_some() || dict.is_some()) instead of actual
membership; add a public query on GcState (e.g., GcState::is_tracked(&self,
obj_id: ObjectId) or similar) that checks the tracked_objects set used by
track_object/untrack_object, then change PyObject::is_gc_tracked to call that
GcState query rather than inspecting vtable.trace or dict; ensure you use the
object's unique identifier or reference the same key used in tracked_objects so
is_gc_tracked reflects real GC membership while gc_get_referents can remain
unchanged.
ShaharNaveh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
cec949d to
06c8270
Compare
📦 Library DependenciesThe following Lib/ modules were modified. Here are their dependencies: [ ] lib: cpython/Lib/asyncio dependencies:
dependent tests: (8 tests)
[ ] test: cpython/Lib/test/test_dict.py (TODO: 3) dependencies: dependent tests: (no tests depend on dict) [ ] test: cpython/Lib/test/test_generators.py (TODO: 5) dependencies: dependent tests: (no tests depend on generator) [ ] lib: cpython/Lib/subprocess.py dependencies:
dependent tests: (51 tests)
[x] lib: cpython/Lib/weakref.py dependencies:
dependent tests: (146 tests)
[ ] test: cpython/Lib/test/test_weakset.py (TODO: 1) dependencies: dependent tests: (no tests depend on weakset) [ ] lib: cpython/Lib/zoneinfo dependencies:
dependent tests: (3 tests)
Legend:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@crates/vm/src/gc_state.rs`:
- Around line 236-252: In track_object, a poisoned write lock on
generation_objects[0] can cause the object to be inserted into tracked_objects
but not generation_objects (and counts not incremented), creating inconsistency
with get_objects or generation counts; fix by making the operation atomic:
either return a Result from track_object (propagate the lock error from
generation_objects[0].write()) or acquire/check both locks first and only insert
into tracked_objects if generation_objects[0].write() succeeded (i.e., perform
the generation_objects[0].write() and insert into gen0 and update
generations[0].count/alloc_count before touching tracked_objects), referencing
track_object, generation_objects, tracked_objects, generations, alloc_count and
get_objects so the change is applied where those symbols are used.
- Around line 391-429: The freeze() and unfreeze() functions drain sets into
local vectors then attempt to write the destination lock, which can fail and
silently drop objects; change them to handle lock errors defensively: after
draining a generation (generation_objects.write() in freeze) or draining
permanent (permanent_objects.write() in unfreeze), check the result of acquiring
the destination lock (permanent_objects.write() in freeze,
generation_objects[2].write() in unfreeze) and if it Err, re-acquire the
original source lock (or re-insert the drained pointers back into the original
generation/permanent) or else log a fatal error/panic so objects are not lost;
reference freeze, unfreeze, generation_objects, permanent_objects,
permanent.count, generations[*].count and ensure counts are only updated after
successful insertion into the destination set.
| pub unsafe fn track_object(&self, obj: NonNull<PyObject>) { | ||
| let gc_ptr = GcObjectPtr(obj); | ||
|
|
||
| // Add to generation 0 tracking first (for correct gc_refs algorithm) | ||
| // Only increment count if we successfully add to the set | ||
| if let Ok(mut gen0) = self.generation_objects[0].write() | ||
| && gen0.insert(gc_ptr) | ||
| { | ||
| self.generations[0].count.fetch_add(1, Ordering::SeqCst); | ||
| self.alloc_count.fetch_add(1, Ordering::SeqCst); | ||
| } | ||
|
|
||
| // Also add to global tracking (for get_objects, etc.) | ||
| if let Ok(mut tracked) = self.tracked_objects.write() { | ||
| tracked.insert(gc_ptr); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential inconsistency if generation lock is poisoned.
If generation_objects[0].write() returns Err (poisoned lock), the object is skipped from generation tracking but still added to tracked_objects. This creates an inconsistent state where get_objects(None) would return the object but get_objects(Some(0)) would not, and the generation count would be wrong.
Consider either:
- Propagating the error to the caller, or
- Ensuring both operations succeed or neither does (transactional semantics).
Since this is interface-only and actual GC implementation will follow, this may be acceptable for now if documented.
🤖 Prompt for AI Agents
In `@crates/vm/src/gc_state.rs` around lines 236 - 252, In track_object, a
poisoned write lock on generation_objects[0] can cause the object to be inserted
into tracked_objects but not generation_objects (and counts not incremented),
creating inconsistency with get_objects or generation counts; fix by making the
operation atomic: either return a Result from track_object (propagate the lock
error from generation_objects[0].write()) or acquire/check both locks first and
only insert into tracked_objects if generation_objects[0].write() succeeded
(i.e., perform the generation_objects[0].write() and insert into gen0 and update
generations[0].count/alloc_count before touching tracked_objects), referencing
track_object, generation_objects, tracked_objects, generations, alloc_count and
get_objects so the change is applied where those symbols are used.
| pub fn freeze(&self) { | ||
| // Move all objects from gen0-2 to permanent | ||
| let mut objects_to_freeze: Vec<GcObjectPtr> = Vec::new(); | ||
|
|
||
| for (gen_idx, generation) in self.generation_objects.iter().enumerate() { | ||
| if let Ok(mut gen_set) = generation.write() { | ||
| objects_to_freeze.extend(gen_set.drain()); | ||
| self.generations[gen_idx].count.store(0, Ordering::SeqCst); | ||
| } | ||
| } | ||
|
|
||
| // Add to permanent set | ||
| if let Ok(mut permanent) = self.permanent_objects.write() { | ||
| let count = objects_to_freeze.len(); | ||
| for ptr in objects_to_freeze { | ||
| permanent.insert(ptr); | ||
| } | ||
| self.permanent.count.fetch_add(count, Ordering::SeqCst); | ||
| } | ||
| } | ||
|
|
||
| /// Unfreeze all objects (move from permanent to gen2) | ||
| pub fn unfreeze(&self) { | ||
| let mut objects_to_unfreeze: Vec<GcObjectPtr> = Vec::new(); | ||
|
|
||
| if let Ok(mut permanent) = self.permanent_objects.write() { | ||
| objects_to_unfreeze.extend(permanent.drain()); | ||
| self.permanent.count.store(0, Ordering::SeqCst); | ||
| } | ||
|
|
||
| // Add to generation 2 | ||
| if let Ok(mut gen2) = self.generation_objects[2].write() { | ||
| let count = objects_to_unfreeze.len(); | ||
| for ptr in objects_to_unfreeze { | ||
| gen2.insert(ptr); | ||
| } | ||
| self.generations[2].count.fetch_add(count, Ordering::SeqCst); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Objects can be lost if lock acquisition fails after draining.
In freeze(), if generation_objects are successfully drained but permanent_objects.write() returns Err (poisoned), the objects in objects_to_freeze are dropped without being added to any tracking set. Similarly in unfreeze().
Consider logging a warning or panic on lock failure here, since losing GC-tracked objects is a serious error condition.
Potential defensive handling
// Add to permanent set
- if let Ok(mut permanent) = self.permanent_objects.write() {
+ let mut permanent = self.permanent_objects.write()
+ .expect("permanent_objects lock poisoned during freeze");
let count = objects_to_freeze.len();
for ptr in objects_to_freeze {
permanent.insert(ptr);
}
self.permanent.count.fetch_add(count, Ordering::SeqCst);
- }🤖 Prompt for AI Agents
In `@crates/vm/src/gc_state.rs` around lines 391 - 429, The freeze() and
unfreeze() functions drain sets into local vectors then attempt to write the
destination lock, which can fail and silently drop objects; change them to
handle lock errors defensively: after draining a generation
(generation_objects.write() in freeze) or draining permanent
(permanent_objects.write() in unfreeze), check the result of acquiring the
destination lock (permanent_objects.write() in freeze,
generation_objects[2].write() in unfreeze) and if it Err, re-acquire the
original source lock (or re-insert the drained pointers back into the original
generation/permanent) or else log a fatal error/panic so objects are not lost;
reference freeze, unfreeze, generation_objects, permanent_objects,
permanent.count, generations[*].count and ensure counts are only updated after
successful insertion into the destination set.
Only interface, without actual GC
split from #6849
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.