ADR 0105: Hybrid Codebase Index (core + MCP) for C# stacks with Roslyn truth¶
Status: Accepted · Implemented
Date: 2026-05-06
Related ADRs¶
| ADR | Role |
|---|---|
| 0039 | Workspace navigation — multiple views and “current file + related” |
| 0040 | LSP (C# / Markdown) — command line in settings.toml: presets, optional keys, environment override |
| 0052 | Agent contract CLI (MCP parity) and snapshot tests |
| 0053 | Intent map and control flow on PFD |
| 0056 | Semantic map adoption of Skia composition pipeline |
| 0067 | Graph-backed surfaces — shared contract for graph screen family |
| 0069 | Markdown Preview — MFD instrument, renderer-first decoupling, no inline preview in document |
| 0079 | IDS vs CDS; AXAML index — not IDS |
| 0095 | Three Health levels — Workspace, Solution, IDE (channel taxonomy) |
| 0097 | Cockpit compute units (CCU; LRU Unit analog) — layer between transport, meaning, and channel |
| 0098 | Semantics first; document and repository as projections (Semantic-First) |
| 0099 | IDE DataBus — typed events and state projections |
| 0100 | Project constitution |
| 0101 | Licensing and commercialization strategy |
| 0102 | Data Acquisition Layer — external interfaces and adapters boundary |
| 0106 | Hybrid Codebase Index — CascadeIDE integration, freshness, Semantic Map |
Summary¶
- Hybrid codebase index: portable core + MCP; Roslyn is truth for C#.
- SQLite FTS5 (keyword) + optional vec (semantic); fusion α/β.
- Scope: C#, Razor, AXAML, web stacks in one workspace; ADR 0106 — CIDE integration.
Terms and abbreviations¶
Working definitions within this ADR; algorithm details — per SQLite / chosen embedding provider documentation.
| Term | Meaning here |
|---|---|
| FTS (full-text search) | Full-text search: index and queries over tokens/words inside document texts (file or chunk), not only exact field match or filename search. |
| FTS5 | Fifth SQLite full-text module: FTS5 virtual tables, inverted index “term → document occurrences”, relevance-aware queries. In this ADR — primary keyword backend of layer B. |
| Inverted index | Structure “word/term → list of documents (and positions)” backing fast FTS; not to be confused with Roslyn symbol graph. |
| BM25 (Best Matching 25, Okapi BM25 family) | Class of statistical ranking functions for full-text hits: balance “term frequent in this document” vs “term rare in corpus”. In SQLite FTS5 relevance uses auxiliary rank functions (including bm25()); in this ADR “keyword / BM25” means full-text with such ranking, not a separate engine outside SQLite. |
| Keyword search | Search by word/phrase match (via FTS), without required “understanding” of the query in different phrasings. |
| Embedding | Fixed-dimension vector from a model over text (code fragment, paragraph, query). Semantically similar texts ideally get close vectors in the chosen metric. |
| Semantic / vector search | Select fragments by embedding proximity of query and chunks (cosine similarity, etc.), not keyword match. In this ADR also vec (vector channel). |
| Vector store | Storage for vectors and metadata (chunk id, path, line range), with nearest-neighbor operations (ANN / full scan at small scale). |
| sqlite-vec | SQLite extension for vector storage and query; in this ADR — optional local vector store beside FTS, not replacing the keyword layer. |
| Fusion | Merge hit lists from two channels (here FTS and vec): score normalization, weighted sum or equivalent, final top‑N. See § fusion sketch. |
| Chunk | Continuous file fragment indexed as one FTS/vec unit (line window, logical block, etc.); see § chunking. |
| MCP | Model Context Protocol — transport and tool contract for agents/IDE; separate index MCP service in § deployment. |
| DAL | Data Acquisition Layer — layer for data from workspace and external world per 0102. |
| CCU | Cockpit Compute Unit(s) — packaging compute results into stable channel DTOs per 0097. |
Context¶
CascadeIDE is an MCP-first IDE: the agent must orient quickly in the codebase and assemble context in a small model window (or under a limited step/call budget).
For any .NET/C# solution we already have a “source of truth” for precise semantic operations:
- Roslyn (via roslyn-mcp and IDE wiring) for: diagnostics, go-to-definition, find-usages, rename, symbol-level navigation.
But Roslyn does not fully solve:
- fast “sense overview” and a “first map” of the solution without reading dozens of files;
- full-text and orientation over Markdown, configs,
.csproj/.sln/.slnx, YAML/TOML, the web layer (Razor/Blazor.razor, HTML/CSS), Avalonia (.axaml) markup, and other artifacts without a Roslyn semantic model for those formats; - for a plain C# project (including CascadeIDE itself) the same hybrid layer gives fast keyword/optional semantic over the whole repository — including
.csas text (layer B: FTS only, not symbols), while rename/impact stay on Roslyn; - persistence across sessions: the “map” should live beside the project/IDE profile and not require re-training the agent every time.
External solutions exist (e.g. SocratiCode) with hybrid search + graph + impact, but they add infrastructure load (Docker/Qdrant/Ollama) and license risk (AGPL) for product integration.
Additionally: CascadeIDE is cross-platform (Avalonia). We do not want the critical navigation layer tied to Windows-only/drivers/Docker, but on Linux we may allow heavier backend options.
Decision in one sentence¶
Introduce a two-layer navigation model: Roslyn is truth for C# semantics, beside a light hybrid index over the solution contour: web artifacts (.razor, MD, HTML/CSS), Avalonia .axaml (and pairing heuristic with code-behind .cs when needed), configuration and companions (including optional full-text on .cs as text, without replacing symbol-level operations); keyword + optional semantics; minimal ops cost and cross-platform support.
Goals¶
- Reduce agent step count: 1–2 calls → enough relevant context to decide.
- Provide a “first map” without “read 20 files”: top files/nodes/flows, entry points — for Blazor/Web, Avalonia (AXAML + bindings/control names), and plain C#, including developing CascadeIDE itself on the same tool stack.
- Preserve semantic correctness: C# refactor-impact is Roslyn-based, not heuristic.
- Work without mandatory Docker (especially on Windows), with predictable local install/update.
- Be cross-platform (Windows/Linux/macOS), with optional backend accelerators on Linux.
Non-goals (first phase)¶
- Full “polyglot dependency graph” across 18+ languages.
- Replacing Roslyn MCP: Roslyn remains the truth layer for C#.
- Mandatory vector DB/containers for baseline scenarios.
- “One graph that is always right”: graph/impact outside C# allows heuristics and needs verification.
Architecture (by layer)¶
Layer A: Roslyn truth (C#)¶
Use Roslyn for:
- diagnostics / code actions;
- find usages / rename;
- symbol navigation;
- (where possible) call graph / entrypoints within a C# project.
This layer is precise but “expensive” in workflow: the agent still needs to know what to search for.
Layer B: Hybrid index (artifacts around C#, web layer, Avalonia AXAML, optional .cs text)¶
Index for files and fragments outside Roslyn symbolism or as text (not as a type graph):
.razor,.razor.cs(including partial / file pairing);.md/.mdx;.html,.css,.scss(including@import, classes/selectors);- basic configs (
appsettings*.json,.editorconfig,*.props,*.targets,*.csproj,*.slnx, pipeline YAML,*.yml,*.toml, etc.); .axaml(and typical code-behind*.axaml.csif present): markup and attributes — as text for FTS and light heuristics (x:Name,{Binding …},Classes=,avares:paths); not a substitute for an Avalonia XAML parser, not CDS/IDS semantics (see 0079 — CDS vs IDS);*.cs(index option): full-text/keyword only (identifiers and strings match as text in the file); rename/find-usages/impact remain Roslyn-only. Tool responses must mark.cshits as text-ranked so they are not mixed with symbol truth.
The index provides:
- keyword / BM25: config strings, CSS, Razor routes,
.cs/.axaml/doc fragments; - optional semantic: “by meaning” search (embeddings), without mandatory Docker.
Index data:
- stored locally (IDE profile or beside the project);
- updated incrementally (watcher + hash);
- explicit format versioning (so migration does not break UX).
Storage / backend (baseline)¶
Recommended default (no Docker, cross-platform):
- Keyword/BM25: SQLite FTS5 (on-disk local DB) as fast full-text index.
- Semantic vectors (optional): SQLite +
sqlite-vecas local vector store (enabled only when semantics are on).
The engine here is classic SQLite (e.g. Microsoft.Data.Sqlite or another provider to the same SQLite library), not WitDatabase (*.witdb): Wit stays for CascadeIDE application data; the index file is a separate on-disk SQLite.
Important: hybrid = FTS (keyword) + vec (semantic) as two independent sub-indexes merged at the service layer (ranking/fusion), not “one DB magic”.
Layer C: Composition (agent workflow, portable)¶
Default agent scenario (outside a specific IDE):
- Hybrid search (fast, cheap) → top-N fragments and map.
- Roslyn navigation for precise C# check/refactor.
- Point reads of files/fragments only after search.
Embedding this scenario in CascadeIDE (buttons, channels, debounced reindex, CCU/DataBus, Semantic Map) — ADR 0106.
Deployment: library + separate MCP¶
Package the index as a shared library (core: indexing, SQLite, request/response formats) and a separate MCP server (thin stdio layer + tool registration) so that:
- search can be used outside CascadeIDE (other MCP IDE/agents, CLI, automation);
- the heavy process (watcher, SQLite files, optional embeddings) is isolated: restarts and updates do not mix with Avalonia/UI.
CascadeIDE may use the same core in-proc or launch the same MCP binary as a child process — tool ids and contracts stay shared for both (cockpit placement details — 0106).
Configuration and UX invariants¶
- Off-by-default for infrastructure: if semantic embeddings need an external provider, that must be opt-in.
- Cross-platform: same tool ids/contracts in MCP; difference only in backend provider.
- Small-window operation: tool responses should be “compact by default” (top-N, with path/range/score), with a separate command to expand.
Implementation watchouts¶
Operational points without which dogfood and production disappoint quickly:
- Volume and noise. FTS over all
*.csinflates the index and can pollute top-N with raw string hits. Need explicit defaults and filters insettings.toml(or equivalent): ignores/gitignorealignment, path masks, ranking (e.g. prioritize docs/configs over “raw”.cs, or the opposite — “code first” mode), ability to temporarily exclude*.csfrom FTS without disabling the rest of the index.
- Freshness on saves from CascadeIDE. Cheap increment and lag-free UX — ADR 0106. MCP/core may use a watcher and incremental reindex; product tie-in with the editor session — in the IDE.
- MCP contract from the first prototype. Search response structure needs a stable hit-type field (e.g.
hit_kind:text_fts/text_vector/symbol_followup_roslynor equivalent) so agent and human do not guess from free text. Changing field semantics later costs more than baking it in v0.
Alternatives and why not (for now)¶
A) “Roslyn + grep only”¶
Pros: minimal infrastructure, high C# accuracy.
Cons: too many steps and file reads for agent scenarios; poor coverage of docs/config/web and global “where mentioned” across the repo without a heavy Roslyn-only sweep.
B) Embed SocratiCode wholesale¶
Pros: ready hybrid+graph+impact layer, fast “orientation” on a large repo.
Cons:
- ops: Docker/Qdrant/Ollama in baseline;
- graph correctness outside C# depends on heuristics;
- AGPL license — undesirable for product embedding (see 0101).
C) LSP for everything (full polyglot)¶
Pros: potential semantic accuracy per language.
Cons: too large operational and integration cost; does not solve “small window/few calls” without a separate index/ranking layer.
Consequences¶
Positive¶
- The agent gets a fast “first pass” over the solution and can dogfood the same index while developing CascadeIDE and other C# repos, not limited to “Blazor only”.
- Roslyn remains “truth” for dangerous operations (rename/impact/diagnostics).
- Docker becomes optional: Windows-friendly baseline; Linux may get extended modes.
Negative / risks¶
- A new data layer (index) → versions, migrations, observability needed.
- Risk of false links in
.razor/CSS/HTML heuristics → need “confidence” and explicit “hint” labeling. - Indexing
.cs/.axamlas text may look like “semantic find” → see § implementation watchouts (hit_kind, ranking). - Tools must stay compact or the hybrid index may spam context and hurt UX.
Rollout plan (portable core + MCP): status¶
| Step | Content | ADR 0105 scope (implemented) |
|---|---|---|
| 1 | MCP contracts (search, status, reindex, explain-result, version/hit_kind); core in library |
✅ hybrid-codebase-index repo |
| 2 | Keyword index, increment, ignores; optional FTS on *.cs; watcher tool |
✅ |
| 3 | Razor / AXAML: .razor↔.razor.cs, .axaml↔.axaml.cs pairs; heuristic headers __hci_* (directives, resources, bindings, tags) |
✅ (HybridCodebaseIndex.Core augment) |
| 4 | Embeddings opt-in (settings.toml), sqlite-vec optional |
✅ |
| 5 | IDE workflow + freshness on save | → ADR 0106 |
| 6 | Scope defaults, FTS chunking, FTS+vec fusion | ✅ settings.toml + hybrid search |
Sketch: index scope, chunks, fusion (FTS + vec)¶
Addendum to rollout plan: reasonable defaults at spike, without changing top-level architecture (layer B, storage).
Index scope¶
- Primary anchor: active
.sln/ main.csprojof the CascadeIDE profile — same workspace contour as the Roslyn session. - Default extension: paths under workspace root, minus aligned
.gitignore(and if needed.cursorignoreagainst agent noise) and a hard denylist:bin/,obj/,node_modules/,.git/, typical tool cache dirs. - Monorepo: one index DB per (workspace_root, solution_path) pair; another solution in the same tree — separate index contour (switch by profile). Field
extra_include_rootsinsettings.tomlfor sibling dirs (docs, external KB, etc.) — opt-in.
Chunking for FTS¶
| Type | Strategy |
|---|---|
Compact configs, small .md, .razor within size limit |
One FTS document per file; upper document size limit (e.g. 256–512 KiB) — configurable. |
Long .md, .cs, .axaml |
Sliding line windows (guide: 80–120 lines, overlap 10–15); stable chunk_id: path + range (start_line / offset). |
.razor |
Prefer logical boundaries (@code, large markup blocks); if not cheap — same line windows. |
Freshness: on edit rebuild only affected chunks; for small files whole-document rebuild is allowed. Tool response always gives path and line range (or offset) so agent and human open the point without guessing.
Fusion of keyword (FTS) and semantic (vec), v0¶
- Independently get top‑K from FTS and vec (internal K, guide 20–40; outward after merge — compact top‑N).
- Normalize scores within each channel (min-max or rank-based, e.g.
1/(rank+R)). - Merge unique chunks: final score
S = α·S_fts + β·S_vec; if a chunk is missing in a channel — that channel’s contribution is 0. - Default with vec on:
α ≈ 0.65,β ≈ 0.35; with vec off — FTS only. - Short query (1–2 tokens) or low max
S_vec: boost FTS or do not mix vec (keyword-dominant mode).
In the DTO, keep both contributions (fts_score, vec_score when present) with hit_kind and final rank — so explainability (“why in top”) is preserved. Thresholds and weights in settings.toml without breaking response format on later iterations.