Technical Details¶
This document describes emend’s internal design, the rationale behind key architectural choices, and the dependencies that underpin each subsystem.
Overview¶
emend is structured around two complementary refactoring primitives that share a common execution model:
Structured edits — component-level surgery on symbol metadata (parameters, return types, decorators, bases, body) addressed via a selector grammar.
Pattern transforms — code-pattern search and replace using metavariable capture syntax (
$X,$...ARGS) compiled to a unified structural matcher.
Both systems parse source files into a concrete syntax tree (CST) using Tree-sitter via a Rust backend. Transformations produce a sequence of byte-range edits that are applied to the original source, ensuring that untouched code is reproduced character-for-character while maintaining high performance.
Two-layer architecture¶
Rust layer: emend_core¶
The emend_core extension is a PyO3 / maturin Rust crate compiled directly into the emend wheel.
It handles all performance-critical AST analysis and manipulation:
File discovery — parallel directory walk via
rayon.Unified Scope Resolver — a multi-language-capable engine that builds scope trees and resolves qualified names. Scoping and binding rules are driven by a language configuration file (e.g.,
languages/python.toml).Structural Matcher — a generic matcher that executes structural queries and captures metavariables directly on the Tree-sitter AST.
Mutation Engine —
PyFileTransformmanages a set of non-overlapping byte-range replacements, providing the foundation for all code edits.Symbol collection — extracts definition trees with rich metadata (signatures, return types, visibility) for the
search --output summarycommand.Reference analysis — resolves all identifiers and attributes to qualified names, supporting
refs,rename, anddeadcode.
Python layer: CLI and Orchestration¶
The Python layer provides the user-facing CLI and orchestrates complex refactoring workflows:
CLI Entry Point — command definitions and argument parsing using Typer.
Query Parsing — Lark grammars for the selector and pattern languages.
Transform Orchestration — higher-level refactoring logic (e.g.,
move,copy-to) that combines primitives like symbol extraction and byte-range edits.Type Oracle integration — adapters for external type inference engines (Pyright, Pyrefly) that provide optional semantic metadata.
Language Configuration¶
emend’s scoping and binding logic is language-agnostic and data-driven. Each supported language is defined by a TOML configuration file that specifies:
Scope Creators — AST node types that create new lexical scopes (functions, classes, comprehensions).
Binding Rules — rules for how names are introduced (assignments, loop variables, parameters).
Import Resolution — strategies for resolving cross-file qualified names.
Visibility Rules — language-specific conventions for public vs. private API detection.
Adding a new language¶
To add support for a new language, create a directory under languages/ with
two files:
languages/<lang_name>/
├── config.toml # scope resolver configuration
└── symbols.scm # tree-sitter query for symbol extraction
config.toml¶
The configuration file drives the scope resolver and qualified-name builder.
Use languages/python/config.toml as a reference. The file contains the
following sections:
[language]Top-level metadata: language
name, thetree_sitter_grammarcrate to use for parsing,file_extensions(e.g.,[".py"]), andkeywordsthat the pattern compiler should treat as reserved.[scoping]Declares which AST node types create new scopes (
scope_creators) and per- kind rules. Each scope kind has anis_closure_boundaryflag that controls whether name lookups propagate outward (e.g., Python class scopes are closure boundaries, so inner functions cannot see class-level names directly).[bindings]Rules for how names are introduced into a scope: assignments, loop targets, function/class parameters,
withtargets, exception handler names, and definition nodes (functions, classes). Each rule maps an AST node type to the child field that holds the bound name.[imports]Describes the structure of import statements so the resolver can extract module paths and bound names.
resolutionselects one of the built-in resolvers (see below).[qualified_names]Controls how qualified names are assembled:
module_separator("."for Python),class_member_prefixandnested_function_prefixflags, and thelocals_markerstring for nested scopes.[exports]Public API conventions such as
__all__membership or naming patterns that mark a symbol as an entry point for dead-code analysis.[builtins]A list of names that are always in scope and do not need resolution (e.g.,
print,len,Truein Python).
symbols.scm¶
A tree-sitter query
file that defines which AST nodes constitute symbols for the search --output
summary and deadcode commands. The Python query captures function
definitions, class definitions, decorated definitions, and top-level variable
assignments. Each captured node should use a @name capture for the
symbol’s identifier and a top-level capture (e.g., @function, @class,
@variable) for the full node.
Import resolution strategies¶
The Rust backend provides built-in import resolution strategies selected by
the resolution field in [imports]:
"python"— follows importlib semantics:sys.path,src/layout detection, relative imports, and namespace packages."node"— implements Node.js module resolution for JavaScript and TypeScript:node_moduleslookup,index.jsdefaults, andpackage.jsonmain/exportsfields.
Adding a new resolution strategy requires changes to the Rust emend_core
crate.
Testing¶
After adding a new language configuration, run the full test suite to verify that the scope resolver, symbol collection, and reference analysis all behave correctly:
make test
Pay particular attention to qualified-name construction and import resolution,
as errors in those areas propagate to refs, rename, and deadcode.
Caching and indexing¶
emend maintains a cache at .emend/cache/parse.db (SQLite, WAL mode).
The cache is content-addressed — almost every key is the MD5 of the file’s
source text — so switching branches or reverting edits naturally reuses
earlier entries. A .gitignore and .dockerignore are auto-generated
inside .emend/cache/ to prevent the database from being checked in.
When running inside a git worktree the cache is stored in the main repo’s
.emend/cache/ so all worktrees share a single database (see
Git worktree support below).
Overview of cache tables¶
Table |
Key |
Contents |
|---|---|---|
|
content MD5 (BLOB) |
Compressed-pickled |
|
content MD5 (TEXT) |
Compressed-pickled |
|
(worktree_id, absolute path) |
|
|
(content_hash, file_path, name, …) |
One row per symbol definition (function, class, method). Stores name,
qualified name, kind, line range, depth, parent, signature, return type,
and decorators. Indexed on |
|
(content_hash, target_qn, file_path, line, col) |
One row per reference to a qualified name. Each row records the
reference kind ( |
|
(content_hash, imported_module) |
One row per import statement, mapping the importing file to the dotted
module name. Indexed on |
|
key name (TEXT) |
Key-value pairs: |
How caches are populated¶
There are two population paths:
Lazy (on first use). visit_project() populates qn_index as a
side-effect of running the Rust PyScopeResolver: after each file is
resolved, the qualified-name set is stored in the cache.
Eager (``emend tool index``). warm_caches() scans the project in parallel
using a ProcessPoolExecutor. Each worker subprocess (_index_batch())
receives a batch of (file_path, source_text) tuples and performs:
QN resolution —
PyScopeResolver→ compressed pickle →qn_index.Symbol collection —
emend_core.collect_symbols_from_str()→symbol_indexrows (name, kind, line, signature, etc.).Import extraction — regex scan of
import/from … importstatements →import_graphrows.Reference collection —
PyScopeResolverreference output →reference_indexrows (target QN, line, column, ref_kind).
All analysis is handled by the Rust tree-sitter backend.
After all workers finish, the main process performs three additional steps:
File manifest —
stat()every indexed file and writes(worktree_id, path, mtime_ns, size, content_hash, timestamp)tofile_manifest. Each worktree maintains its own set of manifest rows.Git HEAD — runs
git rev-parse HEADand stores the SHA inindex_metaunder the keygit_head:<worktree_id>.Type cache — runs the configured type engine (pyrefly / pyright / ty) and stores results in
type_cache.
Workers write directly to the SQLite database (WAL mode permits concurrent writers across processes). Files whose content hash already appears in all relevant tables are skipped.
How caches are invalidated¶
Because caches are keyed on file content (MD5 hash), not file path, they are automatically correct — if a file’s content hasn’t changed, its cached data is still valid regardless of when it was written. There is no explicit “invalidation” of stale entries; old entries simply become unreachable when no file on disk has that content anymore.
For the path-indexed tables (file_manifest, symbol_index,
reference_index, import_graph), a three-tier freshness check determines
which files need re-indexing:
Tier 1 — Git HEAD (~1 ms). git rev-parse HEAD is compared against the
stored git_head:<worktree_id> in index_meta. If they match, no files
have changed since the last index in this worktree.
Tier 2 — File stat (~10–50 ms for 5 000 files). Each file is stat()-ed
and its (mtime_ns, size) compared against file_manifest. Files whose
mtime and size match are unchanged — no I/O required.
Tier 3 — Content hash (only for stat-mismatched files). Files whose mtime
or size differ are read and hashed. If the hash matches the manifest (e.g.
git stash pop touched the mtime but didn’t change content), the manifest’s
mtime is updated in-place. If the hash differs, the file is re-indexed: old
rows keyed on the previous content hash are deleted from symbol_index,
reference_index, and import_graph, then fresh rows are inserted.
This check is implemented in _scan_manifest() and exposed through
_ensure_index_fresh(), which commands call before querying the index. If
fewer than 50 files are stale, they are re-indexed inline; otherwise the caller
falls back to the cold path or advises running emend tool index.
How caches are cleaned¶
emend does not aggressively prune old entries. Content-hash keyed tables
(qn_index, type_cache) accumulate entries across branch switches, which
is intentional: switching back to an earlier branch reuses those entries.
Path-indexed rows are kept consistent by the re-index cycle described above:
when a file’s content changes, its old rows (keyed on the previous content
hash) are deleted before new rows are inserted. Deleted files are removed from
file_manifest and their derived rows are cleaned up during
_ensure_index_fresh().
To reclaim disk space or force a full rebuild:
# Delete the entire cache and rebuild from scratch:
rm -rf .emend/cache/
emend tool index
# Or just rebuild (existing entries are overwritten):
emend tool index
The emend tool index --status command reports the number of indexed files,
symbols, import edges, and references, plus how many files are stale.
Warm-path query acceleration¶
When the index is fresh, several commands bypass full-project scans:
find --complete <prefix>queriessymbol_indexwith aLIKEprefix match — typically < 5 ms.analyze refsqueriesreference_indexby qualified name — typically < 10 ms._files_importing_module()checksimport_graphbefore falling back to the Rustfiles_importing_modulescan.
All warm paths fall back transparently to their original (cold) implementations when the index is unavailable or stale.
Git worktree support¶
When emend runs inside a git worktree, the cache is automatically shared with the main repository. This means:
Running
emend tool indexin any worktree populates the sharedparse.db. Other worktrees immediately benefit from the cached parse trees, qualified-name indexes, type information, symbol definitions, and reference data — all of which are keyed by content hash.Each worktree maintains its own
file_manifestrows (scoped by aworktree_idderived from the worktree’s absolute path), so stat-based freshness checks are accurate per worktree.Git HEAD tracking is per-worktree (
git_head:<worktree_id>keys inindex_meta), so branch switches in one worktree don’t invalidate another.
The mechanism works by reading the .git file in the worktree root (which
contains a gitdir: pointer) and following the commondir reference to
locate the main repository. _resolve_cache_root() in transform.py
performs this resolution and caches the result. For non-worktree repos (and
non-git projects), the project root is used directly — no behavior change.
SQLite WAL mode ensures that concurrent access from multiple worktrees (or multiple emend processes) is safe: readers never block, and writes are serialized with a configurable timeout.
Selector grammar¶
Selectors are parsed by a Lark grammar
(grammars/selector.lark) into ExtendedSelector dataclasses. The grammar
handles:
dotted symbol paths with wildcard segments (
Class.*,Test*)component accessors (
[params],[returns],[decorators],[bases],[body],[imports])by-name and by-index sub-accessors (
[ctx],[0])pseudo-classes (
:KEYWORD_ONLY,:POSITIONAL_ONLY,:POSITIONAL_OR_KEYWORD)line-range selectors (
file.py:42-100)file glob expansion (
src/**/*.py::*[params])
Pattern grammar¶
Patterns are parsed by a second Lark grammar (grammars/pattern.lark) into
Pattern dataclasses containing MetaVar objects.
compile_pattern_to_rust_ir() translates a Pattern into a JSON IR
consumed by the Rust structural matcher engine, which performs matching directly
on the Tree-sitter AST.
Metavariable types:
$X— captures any expression node$_— anonymous, matches any expression and discards the capture$...ARGS— variadic capture (sequence of arguments)$X:str/$X:int/$X:call/$X:attr— syntactic type-constrained capture$X:type[T]/$X:returns[T]— inferred type constraint (requires TypeOracle; see below)
TypeOracle layer¶
type_oracle.py provides a pluggable type inference adapter used by find
(for :type[X] / :returns[X] pattern constraints and --returns lookup
filtering) and the analyze types command.
Abstract interface: TypeOracle¶
TypeOracle is an abstract base class with four abstract methods:
infer_file(path, project_root) → FileTypes— return all type bindings for a filetype_at(path, line, col) → TypeBinding | None— return the binding at a positionclear_cache()— evict cached resultsis_available() → bool— check if the backing tool is installed
Results are returned as FileTypes (a list of TypeBinding objects with
positional and name indexes built by FileTypes.build_index()). TypeBinding
records the name, source location, raw_type string from the engine, and a
parsed TypeDescriptor tree.
Backends¶
Three backends are provided:
PyreflyAdapter — shells out to
pyrefly check --debug-infoand parses the JSON binding dump. This is the most comprehensive source of type information (full binding dump, not just diagnostics) but requires pyrefly to be installed. Supportsinfer_batch()for multi-file queries in one subprocess call.PyrightAdapter — starts
pyright-langservervia the LSP protocol and queriestextDocument/hoverfor each identifier collected by_collect_symbols(). Type strings are extracted from the hover markdown and parsed intoTypeDescriptortrees. The LSP process is started once and reused across calls.TyAdapter — same approach as PyrightAdapter but using
ty lsp.
All three cache results keyed on the MD5 of the file’s content (_FileTypeCache,
bounded at 256 entries with FIFO eviction) so unchanged files are not re-analyzed.
Type string parsing¶
parse_type_string(raw) converts type strings from any backend into a
TypeDescriptor tree, handling named types, parameterized generics
(list[int], dict[str, int]), union types (str | None), callable
signatures ((x: int) -> str), and Self@ClassName prefixes from Pyrefly.
TypeDescriptor.matches(constraint) performs structural matching: an unknown
constraint acts as a wildcard, a named constraint matches parameterized types by
base name, and union types match if any member satisfies the constraint.
Integration with pattern matching¶
:type[X] and :returns[X] constraint tokens are matched syntactically by
m.DoNotCare() (any node passes) and then post-filtered by
_filter_matches_by_type_oracle() in transform.py. This keeps the Rust
fast-path bypass simple: any oracle constraint is post-filtered after the Rust
structural matcher returns positional match data.
Engine autodetection¶
detect_type_engine(project_root) checks for config files in order
(pyrightconfig.json → ty.toml → pyrefly.toml → pyproject.toml
sections), then falls back to tool availability on PATH
(pyrefly → ty → pyright). create_type_oracle(engine="auto") combines
detection and instantiation in one call.
Lint engine¶
lint.py loads rules from .emend/rules.yaml (with legacy fallback to
.emend/patterns.yaml). Each rule specifies a match pattern, a
message, an optional not-within constraint, and an optional fix
pattern for --fix mode.
The lint engine applies a two-tier scan:
Batch Rust path — rules whose
matchpattern compiles to Rust IR are batched into a singlefind_multi_patterns_in_filescall. This handles the common case of simple pattern rules (function calls, attribute accesses) with no structural scope constraint.Single-file Rust path — rules with complex patterns or
not-withinconstraints are evaluated per-file usingfind_pattern()with the Rust structural matcher.
# noqa suppression is implemented by tokenizing the source for # noqa
comments, then mapping each comment to its enclosing statement range. A
suppressed statement suppresses all matches inside it.
Free-threaded Python¶
Python 3.13 introduced an experimental free-threaded build (--disable-gil)
and 3.14 continues this as a supported configuration. emend is designed to
take full advantage:
emend_coreis registeredgil_used = false, so all Rust functions release the GIL immediately and can run on multiple OS threads simultaneously.Cross-file operations (
rename,refs) use aThreadPoolExecutoron free-threaded Python, with the Rust extension performing GIL-free analysis across threads.
To enable free-threaded speedups, install emend with a free-threaded Python:
uv tool install --python 3.14t emend
Build system¶
The project uses maturin as its build backend.
pyproject.toml declares build-backend = "maturin" and a [tool.maturin]
section that points at rust/Cargo.toml and sets module-name =
"emend.emend_core". maturin compiles the Rust crate and packages the resulting
shared library alongside the Python source into a single platform wheel. Users
receive one wheel with no additional binary dependencies.
Version numbers are stored in rust/Cargo.toml and propagated to Python
metadata automatically by maturin (dynamic = ["version"] in pyproject.toml).