Technical Details ================= This document describes emend's internal design, the rationale behind key architectural choices, and the dependencies that underpin each subsystem. Overview -------- emend is structured around two complementary refactoring primitives that share a common execution model: - **Structured edits** — component-level surgery on symbol metadata (parameters, return types, decorators, bases, body) addressed via a selector grammar. - **Pattern transforms** — code-pattern search and replace using metavariable capture syntax (``$X``, ``$...ARGS``) compiled to a unified structural matcher. Both systems parse source files into a concrete syntax tree (CST) using `Tree-sitter `_ via a Rust backend. Transformations produce a sequence of byte-range edits that are applied to the original source, ensuring that untouched code is reproduced character-for-character while maintaining high performance. Two-layer architecture ----------------------- Rust layer: ``emend_core`` ~~~~~~~~~~~~~~~~~~~~~~~~~~ The ``emend_core`` extension is a `PyO3 `_ / `maturin `_ Rust crate compiled directly into the emend wheel. It handles all performance-critical AST analysis and manipulation: - **File discovery** — parallel directory walk via ``rayon``. - **Unified Scope Resolver** — a multi-language-capable engine that builds scope trees and resolves qualified names. Scoping and binding rules are driven by a language configuration file (e.g., ``languages/python.toml``). - **Structural Matcher** — a generic matcher that executes structural queries and captures metavariables directly on the Tree-sitter AST. - **Mutation Engine** — ``PyFileTransform`` manages a set of non-overlapping byte-range replacements, providing the foundation for all code edits. - **Symbol collection** — extracts definition trees with rich metadata (signatures, return types, visibility) for the ``search --output summary`` command. - **Reference analysis** — resolves all identifiers and attributes to qualified names, supporting ``refs``, ``rename``, and ``deadcode``. Python layer: CLI and Orchestration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The Python layer provides the user-facing CLI and orchestrates complex refactoring workflows: - **CLI Entry Point** — command definitions and argument parsing using `Typer`. - **Query Parsing** — Lark grammars for the selector and pattern languages. - **Transform Orchestration** — higher-level refactoring logic (e.g., ``move``, ``copy-to``) that combines primitives like symbol extraction and byte-range edits. - **Type Oracle integration** — adapters for external type inference engines (Pyright, Pyrefly) that provide optional semantic metadata. Language Configuration ---------------------- emend's scoping and binding logic is language-agnostic and data-driven. Each supported language is defined by a TOML configuration file that specifies: - **Scope Creators** — AST node types that create new lexical scopes (functions, classes, comprehensions). - **Binding Rules** — rules for how names are introduced (assignments, loop variables, parameters). - **Import Resolution** — strategies for resolving cross-file qualified names. - **Visibility Rules** — language-specific conventions for public vs. private API detection. Adding a new language --------------------- To add support for a new language, create a directory under ``languages/`` with two files: .. code-block:: text languages// ├── config.toml # scope resolver configuration └── symbols.scm # tree-sitter query for symbol extraction config.toml ~~~~~~~~~~~ The configuration file drives the scope resolver and qualified-name builder. Use ``languages/python/config.toml`` as a reference. The file contains the following sections: ``[language]`` Top-level metadata: language ``name``, the ``tree_sitter_grammar`` crate to use for parsing, ``file_extensions`` (e.g., ``[".py"]``), and ``keywords`` that the pattern compiler should treat as reserved. ``[scoping]`` Declares which AST node types create new scopes (``scope_creators``) and per- kind rules. Each scope kind has an ``is_closure_boundary`` flag that controls whether name lookups propagate outward (e.g., Python class scopes are closure boundaries, so inner functions cannot see class-level names directly). ``[bindings]`` Rules for how names are introduced into a scope: assignments, loop targets, function/class parameters, ``with`` targets, exception handler names, and definition nodes (functions, classes). Each rule maps an AST node type to the child field that holds the bound name. ``[imports]`` Describes the structure of import statements so the resolver can extract module paths and bound names. ``resolution`` selects one of the built-in resolvers (see below). ``[qualified_names]`` Controls how qualified names are assembled: ``module_separator`` (``"."`` for Python), ``class_member_prefix`` and ``nested_function_prefix`` flags, and the ``locals_marker`` string for nested scopes. ``[exports]`` Public API conventions such as ``__all__`` membership or naming patterns that mark a symbol as an entry point for dead-code analysis. ``[builtins]`` A list of names that are always in scope and do not need resolution (e.g., ``print``, ``len``, ``True`` in Python). symbols.scm ~~~~~~~~~~~~ A `tree-sitter query `_ file that defines which AST nodes constitute symbols for the ``search --output summary`` and ``deadcode`` commands. The Python query captures function definitions, class definitions, decorated definitions, and top-level variable assignments. Each captured node should use a ``@name`` capture for the symbol's identifier and a top-level capture (e.g., ``@function``, ``@class``, ``@variable``) for the full node. Import resolution strategies ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The Rust backend provides built-in import resolution strategies selected by the ``resolution`` field in ``[imports]``: - ``"python"`` — follows importlib semantics: ``sys.path``, ``src/`` layout detection, relative imports, and namespace packages. - ``"node"`` — implements Node.js module resolution for JavaScript and TypeScript: ``node_modules`` lookup, ``index.js`` defaults, and ``package.json`` ``main``/``exports`` fields. Adding a new resolution strategy requires changes to the Rust ``emend_core`` crate. Testing ~~~~~~~ After adding a new language configuration, run the full test suite to verify that the scope resolver, symbol collection, and reference analysis all behave correctly: .. code-block:: bash make test Pay particular attention to qualified-name construction and import resolution, as errors in those areas propagate to ``refs``, ``rename``, and ``deadcode``. Caching and indexing -------------------- emend maintains a cache at ``.emend/cache/parse.db`` (SQLite, WAL mode). The cache is content-addressed — almost every key is the MD5 of the file's source text — so switching branches or reverting edits naturally reuses earlier entries. A ``.gitignore`` and ``.dockerignore`` are auto-generated inside ``.emend/cache/`` to prevent the database from being checked in. When running inside a git worktree the cache is stored in the **main repo's** ``.emend/cache/`` so all worktrees share a single database (see `Git worktree support`_ below). Overview of cache tables ~~~~~~~~~~~~~~~~~~~~~~~~ .. list-table:: :header-rows: 1 :widths: 20 15 65 * - Table - Key - Contents * - ``qn_index`` - content MD5 (BLOB) - Compressed-pickled ``set[str]`` of every qualified name in the file. Used by ``visit_project()`` to skip files that cannot reference a target symbol. * - ``type_cache`` - content MD5 (TEXT) - Compressed-pickled ``FileTypes`` from the type oracle. Avoids re-running pyrefly / pyright / ty on unchanged files. * - ``file_manifest`` - (worktree_id, absolute path) - ``(mtime_ns, size, content_hash, indexed_at)``. Bridges path-based queries to the content-hash caches and enables incremental re-indexing via stat-only scans. Scoped by ``worktree_id`` so each worktree maintains its own stat cache while sharing content-hashed data. * - ``symbol_index`` - (content_hash, file_path, name, ...) - One row per symbol definition (function, class, method). Stores name, qualified name, kind, line range, depth, parent, signature, return type, and decorators. Indexed on ``name``, ``qualified_name``, ``file_path``, and ``kind`` for fast lookups. * - ``reference_index`` - (content_hash, target_qn, file_path, line, col) - One row per reference to a qualified name. Each row records the reference kind (``read``, ``write``, ``import``, ``call``). Indexed on ``target_qn`` for fast find-references. * - ``import_graph`` - (content_hash, imported_module) - One row per import statement, mapping the importing file to the dotted module name. Indexed on ``imported_module`` for fast "files importing X" queries. * - ``index_meta`` - key name (TEXT) - Key-value pairs: ``schema_version``, ``git_head:``, ``indexed_at:``. Per-worktree keys are scoped by the worktree's resolved root path. How caches are populated ~~~~~~~~~~~~~~~~~~~~~~~~ There are two population paths: **Lazy (on first use).** ``visit_project()`` populates ``qn_index`` as a side-effect of running the Rust ``PyScopeResolver``: after each file is resolved, the qualified-name set is stored in the cache. **Eager (``emend tool index``).** ``warm_caches()`` scans the project in parallel using a ``ProcessPoolExecutor``. Each worker subprocess (``_index_batch()``) receives a batch of ``(file_path, source_text)`` tuples and performs: 1. **QN resolution** — ``PyScopeResolver`` → compressed pickle → ``qn_index``. 2. **Symbol collection** — ``emend_core.collect_symbols_from_str()`` → ``symbol_index`` rows (name, kind, line, signature, etc.). 3. **Import extraction** — regex scan of ``import`` / ``from … import`` statements → ``import_graph`` rows. 4. **Reference collection** — ``PyScopeResolver`` reference output → ``reference_index`` rows (target QN, line, column, ref_kind). All analysis is handled by the Rust tree-sitter backend. After all workers finish, the main process performs three additional steps: - **File manifest** — ``stat()`` every indexed file and writes ``(worktree_id, path, mtime_ns, size, content_hash, timestamp)`` to ``file_manifest``. Each worktree maintains its own set of manifest rows. - **Git HEAD** — runs ``git rev-parse HEAD`` and stores the SHA in ``index_meta`` under the key ``git_head:``. - **Type cache** — runs the configured type engine (pyrefly / pyright / ty) and stores results in ``type_cache``. Workers write directly to the SQLite database (WAL mode permits concurrent writers across processes). Files whose content hash already appears in all relevant tables are skipped. How caches are invalidated ~~~~~~~~~~~~~~~~~~~~~~~~~~ Because caches are keyed on file content (MD5 hash), not file path, they are **automatically correct** — if a file's content hasn't changed, its cached data is still valid regardless of when it was written. There is no explicit "invalidation" of stale entries; old entries simply become unreachable when no file on disk has that content anymore. For the path-indexed tables (``file_manifest``, ``symbol_index``, ``reference_index``, ``import_graph``), a three-tier freshness check determines which files need re-indexing: **Tier 1 — Git HEAD (~1 ms).** ``git rev-parse HEAD`` is compared against the stored ``git_head:`` in ``index_meta``. If they match, no files have changed since the last index in this worktree. **Tier 2 — File stat (~10–50 ms for 5 000 files).** Each file is ``stat()``-ed and its ``(mtime_ns, size)`` compared against ``file_manifest``. Files whose mtime and size match are unchanged — no I/O required. **Tier 3 — Content hash (only for stat-mismatched files).** Files whose mtime or size differ are read and hashed. If the hash matches the manifest (e.g. ``git stash pop`` touched the mtime but didn't change content), the manifest's mtime is updated in-place. If the hash differs, the file is re-indexed: old rows keyed on the previous content hash are deleted from ``symbol_index``, ``reference_index``, and ``import_graph``, then fresh rows are inserted. This check is implemented in ``_scan_manifest()`` and exposed through ``_ensure_index_fresh()``, which commands call before querying the index. If fewer than 50 files are stale, they are re-indexed inline; otherwise the caller falls back to the cold path or advises running ``emend tool index``. How caches are cleaned ~~~~~~~~~~~~~~~~~~~~~~ emend does **not** aggressively prune old entries. Content-hash keyed tables (``qn_index``, ``type_cache``) accumulate entries across branch switches, which is intentional: switching back to an earlier branch reuses those entries. Path-indexed rows are kept consistent by the re-index cycle described above: when a file's content changes, its old rows (keyed on the previous content hash) are deleted before new rows are inserted. Deleted files are removed from ``file_manifest`` and their derived rows are cleaned up during ``_ensure_index_fresh()``. To reclaim disk space or force a full rebuild: .. code-block:: bash # Delete the entire cache and rebuild from scratch: rm -rf .emend/cache/ emend tool index # Or just rebuild (existing entries are overwritten): emend tool index The ``emend tool index --status`` command reports the number of indexed files, symbols, import edges, and references, plus how many files are stale. Warm-path query acceleration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When the index is fresh, several commands bypass full-project scans: - ``find --complete `` queries ``symbol_index`` with a ``LIKE`` prefix match — typically < 5 ms. - ``analyze refs`` queries ``reference_index`` by qualified name — typically < 10 ms. - ``_files_importing_module()`` checks ``import_graph`` before falling back to the Rust ``files_importing_module`` scan. All warm paths fall back transparently to their original (cold) implementations when the index is unavailable or stale. Git worktree support ~~~~~~~~~~~~~~~~~~~~ When emend runs inside a `git worktree `_, the cache is automatically shared with the main repository. This means: - Running ``emend tool index`` in **any** worktree populates the shared ``parse.db``. Other worktrees immediately benefit from the cached parse trees, qualified-name indexes, type information, symbol definitions, and reference data — all of which are keyed by content hash. - Each worktree maintains its own ``file_manifest`` rows (scoped by a ``worktree_id`` derived from the worktree's absolute path), so stat-based freshness checks are accurate per worktree. - Git HEAD tracking is per-worktree (``git_head:`` keys in ``index_meta``), so branch switches in one worktree don't invalidate another. The mechanism works by reading the ``.git`` file in the worktree root (which contains a ``gitdir:`` pointer) and following the ``commondir`` reference to locate the main repository. ``_resolve_cache_root()`` in ``transform.py`` performs this resolution and caches the result. For non-worktree repos (and non-git projects), the project root is used directly — no behavior change. SQLite WAL mode ensures that concurrent access from multiple worktrees (or multiple emend processes) is safe: readers never block, and writes are serialized with a configurable timeout. Selector grammar ---------------- Selectors are parsed by a `Lark `_ grammar (``grammars/selector.lark``) into ``ExtendedSelector`` dataclasses. The grammar handles: - dotted symbol paths with wildcard segments (``Class.*``, ``Test*``) - component accessors (``[params]``, ``[returns]``, ``[decorators]``, ``[bases]``, ``[body]``, ``[imports]``) - by-name and by-index sub-accessors (``[ctx]``, ``[0]``) - pseudo-classes (``:KEYWORD_ONLY``, ``:POSITIONAL_ONLY``, ``:POSITIONAL_OR_KEYWORD``) - line-range selectors (``file.py:42-100``) - file glob expansion (``src/**/*.py::*[params]``) Pattern grammar --------------- Patterns are parsed by a second Lark grammar (``grammars/pattern.lark``) into ``Pattern`` dataclasses containing ``MetaVar`` objects. ``compile_pattern_to_rust_ir()`` translates a ``Pattern`` into a JSON IR consumed by the Rust structural matcher engine, which performs matching directly on the Tree-sitter AST. Metavariable types: - ``$X`` — captures any expression node - ``$_`` — anonymous, matches any expression and discards the capture - ``$...ARGS`` — variadic capture (sequence of arguments) - ``$X:str`` / ``$X:int`` / ``$X:call`` / ``$X:attr`` — syntactic type-constrained capture - ``$X:type[T]`` / ``$X:returns[T]`` — inferred type constraint (requires TypeOracle; see below) TypeOracle layer ---------------- ``type_oracle.py`` provides a pluggable type inference adapter used by ``find`` (for ``:type[X]`` / ``:returns[X]`` pattern constraints and ``--returns`` lookup filtering) and the ``analyze types`` command. Abstract interface: ``TypeOracle`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``TypeOracle`` is an abstract base class with four abstract methods: - ``infer_file(path, project_root) → FileTypes`` — return all type bindings for a file - ``type_at(path, line, col) → TypeBinding | None`` — return the binding at a position - ``clear_cache()`` — evict cached results - ``is_available() → bool`` — check if the backing tool is installed Results are returned as ``FileTypes`` (a list of ``TypeBinding`` objects with positional and name indexes built by ``FileTypes.build_index()``). ``TypeBinding`` records the name, source location, ``raw_type`` string from the engine, and a parsed ``TypeDescriptor`` tree. Backends ~~~~~~~~ Three backends are provided: - **PyreflyAdapter** — shells out to ``pyrefly check --debug-info`` and parses the JSON binding dump. This is the most comprehensive source of type information (full binding dump, not just diagnostics) but requires pyrefly to be installed. Supports ``infer_batch()`` for multi-file queries in one subprocess call. - **PyrightAdapter** — starts ``pyright-langserver`` via the LSP protocol and queries ``textDocument/hover`` for each identifier collected by ``_collect_symbols()``. Type strings are extracted from the hover markdown and parsed into ``TypeDescriptor`` trees. The LSP process is started once and reused across calls. - **TyAdapter** — same approach as PyrightAdapter but using ``ty lsp``. All three cache results keyed on the MD5 of the file's content (``_FileTypeCache``, bounded at 256 entries with FIFO eviction) so unchanged files are not re-analyzed. Type string parsing ~~~~~~~~~~~~~~~~~~~ ``parse_type_string(raw)`` converts type strings from any backend into a ``TypeDescriptor`` tree, handling named types, parameterized generics (``list[int]``, ``dict[str, int]``), union types (``str | None``), callable signatures (``(x: int) -> str``), and ``Self@ClassName`` prefixes from Pyrefly. ``TypeDescriptor.matches(constraint)`` performs structural matching: an unknown constraint acts as a wildcard, a named constraint matches parameterized types by base name, and union types match if any member satisfies the constraint. Integration with pattern matching ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``:type[X]`` and ``:returns[X]`` constraint tokens are matched syntactically by ``m.DoNotCare()`` (any node passes) and then *post-filtered* by ``_filter_matches_by_type_oracle()`` in ``transform.py``. This keeps the Rust fast-path bypass simple: any oracle constraint is post-filtered after the Rust structural matcher returns positional match data. Engine autodetection ~~~~~~~~~~~~~~~~~~~~~ ``detect_type_engine(project_root)`` checks for config files in order (``pyrightconfig.json`` → ``ty.toml`` → ``pyrefly.toml`` → ``pyproject.toml`` sections), then falls back to tool availability on PATH (pyrefly → ty → pyright). ``create_type_oracle(engine="auto")`` combines detection and instantiation in one call. Lint engine ----------- ``lint.py`` loads rules from ``.emend/rules.yaml`` (with legacy fallback to ``.emend/patterns.yaml``). Each rule specifies a ``match`` pattern, a ``message``, an optional ``not-within`` constraint, and an optional ``fix`` pattern for ``--fix`` mode. The lint engine applies a two-tier scan: 1. **Batch Rust path** — rules whose ``match`` pattern compiles to Rust IR are batched into a single ``find_multi_patterns_in_files`` call. This handles the common case of simple pattern rules (function calls, attribute accesses) with no structural scope constraint. 2. **Single-file Rust path** — rules with complex patterns or ``not-within`` constraints are evaluated per-file using ``find_pattern()`` with the Rust structural matcher. ``# noqa`` suppression is implemented by tokenizing the source for ``# noqa`` comments, then mapping each comment to its enclosing statement range. A suppressed statement suppresses all matches inside it. Free-threaded Python --------------------- Python 3.13 introduced an experimental **free-threaded** build (``--disable-gil``) and 3.14 continues this as a supported configuration. emend is designed to take full advantage: - ``emend_core`` is registered ``gil_used = false``, so all Rust functions release the GIL immediately and can run on multiple OS threads simultaneously. - Cross-file operations (``rename``, ``refs``) use a ``ThreadPoolExecutor`` on free-threaded Python, with the Rust extension performing GIL-free analysis across threads. To enable free-threaded speedups, install emend with a free-threaded Python: .. code-block:: bash uv tool install --python 3.14t emend Build system ------------ The project uses `maturin `_ as its build backend. ``pyproject.toml`` declares ``build-backend = "maturin"`` and a ``[tool.maturin]`` section that points at ``rust/Cargo.toml`` and sets ``module-name = "emend.emend_core"``. maturin compiles the Rust crate and packages the resulting shared library alongside the Python source into a single platform wheel. Users receive one wheel with no additional binary dependencies. Version numbers are stored in ``rust/Cargo.toml`` and propagated to Python metadata automatically by maturin (``dynamic = ["version"]`` in ``pyproject.toml``).