From Vectors to a Graph — Digital Twin Migration Reference
Digital Twin · Migration Reference

From Vectors to a Graph

A working reference for the Digital Twin's migration from ChromaDB vector retrieval to a Neo4j-backed GraphRAG pipeline — why it was chosen, what it costs, and how to reason about both systems side by side.

Created 2026-05-16 Maintained by Barbara Hidalgo-Sotelo Feature branch: graphrag-migration
Canonical entities
167
57 skills · 42 methods · 46 techs · 22 concepts
Sections
121
Across 13 KB doc types
Canonicalization compression
1.92×
279 raw → 167 canonical
Zero-mention sections
16
13% of total
2 quality flags pending review. Acronym Title-Casing in canonical names and cross-type pool overlap — both addressed in section 12 (Adjust later) with root cause and proposed fix.

01 · In one paragraphThe migration

The Digital Twin's retrieval layer is being rebuilt. The old system splits knowledge base documents into ~900-character chunks, embeds each one, and searches them by vector similarity. The new system stores full sections as the retrieval unit, promotes Projects, Skills, Methods, Technologies, and Concepts to first-class nodes with explicit edges between them, and ranks results with a hybrid score that combines vector similarity with graph signals. The payoff is queries that couldn't be answered before — "which projects use knowledge graphs?" — and retrieval that returns complete thoughts instead of fragments. The cost is one additional database to operate, one new query language to write, and a manual entity-curation step until the canonicalization pipeline is fully self-preserving.

02 · Why migrateThree problems with the current system

Each problem in the table below is a real failure mode observed in production, not a hypothetical. The Neo4j move is what the new architecture does to address it.

ProblemChromaDB behaviorNeo4j move
Wrong granularity Returns 900-char chunks that often cut mid-paragraph or mid-thought. Stores complete sections (2–3K chars) as the retrieval unit, with chunks kept as optional children for fallback.
Missing connections Cannot answer "which projects use X?" — relationships between projects and skills are implicit in text only. Promotes Skill, Method, Technology, Publication to nodes with typed edges from Project; Cypher traversal answers relationship queries directly.
Poor ranking Pure L2 distance on embeddings; cannot tell whether a section describes a topic or merely mentions it. Hybrid score combines vector similarity with graph centrality, project-described boost, and entity-richness signals.

03 · Three movesThe architecture, mapped to the problems

Every other choice in the migration plan exists to support one of these three moves. When you get lost, come back here.

Move 1 — Make the section a first-class retrieval unit

Each H2 block in a KB document becomes a Section node carrying its complete text and its own embedding. Section boundaries match the existing ChromaDB ingestion logic exactly, which keeps the before/after evaluation honest.

Move 2 — Promote entities to typed nodes with typed edges

Skills, Methods, Technologies, Publications, and Concepts become nodes. Projects connect to them through named relationships: DEMONSTRATES, USES_METHOD, USES_TECHNOLOGY, DOCUMENTED_IN. Sections connect to entities through MENTIONS edges that carry a count and a contextual snippet.

Move 3 — Replace L2 distance with a multi-signal hybrid score

Vector similarity stays in the mix at 60% weight, but the final ranking also rewards sections that explicitly describe a Project (+25%), sections rich in entity mentions (+10%), and longer/more substantial sections (+5%). The full formula lives in section 6.

04 · The schemaWhat lives in the graph

Two tiers. The document hierarchy (Document → Section → Chunk) is where text lives. The entity network (Project at center, connecting Skill / Method / Technology / Publication) is what the text is about. The two tiers are bridged by MENTIONS edges (entities cited within sections) and DESCRIBED_IN edges (projects pointing back to the sections that describe them).

Neo4j schema for Digital Twin GraphRAG Top row: Document feeds HAS_SECTION into Section, which feeds HAS_CHUNK into Chunk. Section carries full text and embedding. Middle: Project node with DESCRIBED_IN edge up to Section. Bottom row: Skill, Method, Technology, Publication entities connected by fan-out edges from Project. Document source file Section full text + embedding Chunk fallback child HAS_SECTION HAS_CHUNK Project entity hub DESCRIBED_IN Skill capability Method approach Technology tool Publication writeup
Two tiers bridged by DESCRIBED_IN (drawn) and MENTIONS (cross-cutting, not drawn for legibility)

Node types — quick reference

Note on properties: role (Skill) and stage (Method) are edge properties, not node properties. They live on the DEMONSTRATES and USES_METHOD relationships respectively, so the same Skill can play a "core" role on one Project and a "supporting" role on another.

NodePropertiesSource · example
Documentfile_path, title, source_type, sensitivity, content_hash, last_updatedOne per KB file
e.g. kb_biosketch.md
Sectionid, name, full_text, embedding (1536-dim), sensitivity, order, char_countOne per H2 block within a Document
e.g. kb_biosketch:Education
Chunkid, text, embedding, chunk_index, char_countChildren of Section, optional fallback
Projectid, title, summary, design_insight, walkthrough_context, tags, sensitivity, urls, diagram_filenameFrom featured_projects.py
e.g. resume-graph-explorer, digital-twin
Skillname, category, alt_labels
+ role on DEMONSTRATES edge: core / secondary / supporting
LLM extraction from project walkthroughs project_entities only
Methodname, category, alt_labels
+ stage on USES_METHOD edge: ingestion / retrieval / evaluation / generation
LLM extraction from project walkthroughs project_entities only
Technologyname, category, alt_labelsLLM extraction from project walkthroughs project_entities only
Conceptname, description, source, alt_labels
source tracks which KB section the concept was first extracted from
Curated from KB section mentions manual curation
Publicationtitle, type, year, urlFrom kb_publications.md

Relationship types — quick reference

EdgeDirectionWhy it exists
HAS_SECTIONDocument → SectionStructural containment
HAS_CHUNKSection → ChunkOptional fine-grained fallback
NEXT_SECTIONSection → SectionSequential order within a Document
DESCRIBED_INProject → SectionWhich sections actually describe this project (feeds the +0.25 ranking boost)
DEMONSTRATESProject → SkillCarries role property (core / secondary / supporting)
USES_METHODProject → MethodCarries stage property (ingestion / retrieval / evaluation / generation)
USES_TECHNOLOGYProject → TechnologyPlain typed edge
DOCUMENTED_INProject → PublicationLinked write-ups
MENTIONSSection → any entityCarries count and context; feeds the +0.10 entity-richness signal
RELATED_TOProject → ProjectDerived from shared skills/methods deferred — not in Phase 3

By the numbers — what's actually in the graph

Real counts from canonical_entities.json as of the most recent canonicalization run. These numbers are the actual order of magnitude you're operating at — useful both for tuning decisions and for talking about the system honestly.

10Projects
121Sections
167Canonical entities
13KB doc types

Canonicalization in action

The three-phase canonicalization pipeline (Decision 2) compresses 279 raw extracted entity names down to 167 canonical nodes — a 1.92× reduction. The compression happens because the same idea appears across multiple project walkthroughs under slightly different names, and the LLM batch phase collapses those into a single canonical with alt_labels.

Raw extracted
279
99 skills + 95 methods + 85 techs
After canonical
167
57 skills + 42 methods + 46 techs + 22 concepts

A few real merges the pipeline produced:

"Knowledge Graphs", "Knowledge Base Design" Knowledge Graph
"NLP", "NLP / LLM Integration", "NLP Pipeline Design" Nlp
"Data Serialization (Turtle, JSON-LD)", "RDF/Linked Data" Rdf
"Document Chunking Strategies", "Document Management Systems", "PDF Extraction" Document Processing

Skill roles · 57 canonical skills

The role property — carried on the DEMONSTRATES edge — distributes as:

core
15
e.g. Knowledge Graph, RAG, Evaluation
secondary
25
e.g. NLP, Visualization, Prompt Engineering
supporting
17
e.g. FastAPI, Google Cloud, Data Validation

Method stages · 42 canonical methods

The stage property — carried on the USES_METHOD edge — distributes as:

ingestion
21
e.g. LLM Extraction, Source priority ordering
generation
10
e.g. Chatbot, Narrative priority encoding
evaluation
8
e.g. Failure mode analysis, Multi-provider benchmarking
retrieval
3
Embeddings, RAG, SELECT-only SQL Querying

The ingestion-heavy distribution is honest about where the methodological work goes in these projects — most of the design effort is in getting data into a usable shape before any retrieval or generation happens.

MENTIONS edges per Section · across 121 sections with mention data

How many entities (Projects, Skills, Concepts combined) each Section mentions:

0 mentions
16
no graph signal at all
1–2
5
3–5
10
at or below the cap
6–10
40
above the cap
11–19
37
above the cap
20+
13
far above the cap

Median is 9 mentions per section, mean is 9.8. About 75% of sections sit at 6+ mentions — well above the +0.10 bonus cap of 5. The signal is effectively binary in production right now: either a section has <5 mentions and gets a partial bonus, or it has 5+ and gets the full bonus. The cap likely needs to move to 10–15 to actually differentiate sections. Flagged in section 12.

Per-project entity density · raw counts from project_entities

Before canonicalization, this is the entity footprint each project contributed. Higher numbers don't mean better projects — they mean more methodological surface area to articulate.

ProjectSkillsMethodsTechsTotal
resume-graph-explorer13101538
beehive-monitor12121135
poolula-platform1112932
chronoscope10101131
academic-citation-platform1112730
concept-cartographer108826
digital-twin99624
weaving-memories97824
fitness-tracker89522
convoscope66517

Sensitivity tier · across 121 sections with mention data

public
109
90% of sections
inner_circle
10
passphrase-gated
personal
2
private tier
Section-level numbers — post-Phase 3 placeholderawaiting load

This block will fill in once Phase 3 finishes loading sections into Neo4j. Expected to cover:

  • Average section length (chars) by source_type — to validate the "2–3K chars per section" claim in the granularity argument
  • Embedding distribution sanity checks — confirm all section embeddings are valid 1536-dim vectors with reasonable norms
  • Sections per Document — mean, min, max — to flag any documents that are dramatically over- or under-sectioned
  • HAS_CHUNK edge density — how often chunks are actually retained vs. the section-only path being used; informs whether chunks can be retired (see Adjust later)
  • Vector index coverage — percentage of sections with valid embeddings indexed under section_embeddings
  • Sections with zero MENTIONS edges by document — currently 16 sections show zero mentions in _source_section_mentions; worth knowing which documents those cluster in

Run the simple inspection queries from section 8 against the loaded graph to populate these.

Browse the canonical entities

Live data from canonical_entities.json — all 167 canonical entity nodes. Search across names and alt_labels, or filter by type. Click any row to see full alt_labels and (for concepts) the description.

Showing 167 of 167 entities
Type Name Role / Stage / Source Category Alt labels
· · ·

05 · Same query, two pathsHow a question travels through each system

The clearest way to internalize the difference is to follow a single query through both pipelines. Here are two contrasting cases — one where both systems work but the experience differs, and one where only the graph approach can answer the question at all.

Case A — "Tell me about the Resume Graph Explorer"

A factual question with a clear target. Both systems can answer it; the difference is in what comes back.

ChromaDB · vector only

Embed the query. Find the 10 nearest chunks by L2 distance. Filter by sensitivity tier. Return them in similarity order.

What comes back: 5–10 chunks of ~900 characters each, drawn from wherever the project is mentioned in the KB. Some are mid-paragraph cuts. The LLM has to stitch them.

Neo4j · hybrid graph

Embed the query. Vector-search sections. Boost sections where DESCRIBED_IN ← Project {id: "resume-graph-explorer"} applies. Return top section in full.

What comes back: The complete section that describes the project — 2–3K characters of coherent, contextualized prose, with related-project metadata attached.

Cypher · the graph-side query
CALL db.index.vector.queryNodes('section_embeddings', 10, $query_embedding)
YIELD node AS section, score AS vector_score
WHERE section.sensitivity IN $allowed_tiers

// Reward sections that DESCRIBE the project, not just mention it
OPTIONAL MATCH (section)<-[:DESCRIBED_IN]-(p:Project)
OPTIONAL MATCH (section)-[:MENTIONS]->(entity)

WITH section, vector_score,
     count(DISTINCT p) AS projects_described,
     count(DISTINCT entity) AS entities_mentioned

WITH section,
     (vector_score * 0.6 +
      CASE WHEN projects_described > 0 THEN 0.25 ELSE 0 END +
      toFloat(CASE WHEN entities_mentioned > 5 THEN 5 ELSE entities_mentioned END) / 5 * 0.10 +
      (CASE WHEN section.char_count > 2000 THEN 0.05 ELSE 0 END)) AS final_score
ORDER BY final_score DESC LIMIT 5
RETURN section.full_text, section.name, final_score

Case B — "Which projects use knowledge graphs?"

This is the kind of query that vector search structurally cannot answer well. The ChromaDB approach returns whatever section happens to mention knowledge graphs most prominently — useful, but not a list. The graph approach traverses the relationship and returns the actual answer.

ChromaDB · best it can do
# Embed "Which projects use knowledge graphs?"
# Return 10 chunks ordered by similarity to that embedding
# LLM reads chunks and tries to extract a list of projects
# Failure mode: misses projects that don't say "knowledge graph" verbatim
results = collection.query(query_embeddings=[emb], n_results=10)
Neo4j · the question rephrased as graph traversal
MATCH (skill:Skill {name: "Knowledge Graphs"})<-[:DEMONSTRATES]-(project:Project)
RETURN project.title, project.summary
ORDER BY project.title
What changed

The question went from "find similar text" to "follow a specific relationship." The graph schema makes that relationship queryable. The vector index is still there for fuzzy semantic search — but now we choose which mechanism to use based on what kind of question is being asked.

06 · Hybrid scoringWhy these four weights, and what they're tuning

Section ranking in Neo4j is a weighted sum of four signals, each normalized to [0, 1] so that count-based components can't dominate the embedding similarity. Maximum possible score ≈ 1.0.

0.60 × vector_score (cosine similarity)
+ 0.25 if section is DESCRIBED_IN by any Project (boost specificity)
+ 0.10 × min(entities_mentioned, 5) / 5 (reward entity-rich sections)
+ 0.05 if section.char_count > 2000 (reward substance over fragments)
= final_score

What each weight is doing

  • 0.60 · vector similarity — the workhorse. If the query is semantically close to the section's content, this carries most of the signal.
  • 0.25 · project-described boost — distinguishes sections that describe a project from sections that merely mention one. This is the move that solves the "specificity" problem in ranking.
  • 0.10 · entity-richness signal — capped at 5 entities. Rewards sections that sit at intersections in the graph (e.g., a section discussing both a Skill and a Project is more useful than one that mentions only a name in passing).
  • 0.05 · length tiebreaker — small bonus for substantial sections. Mostly a tiebreaker to avoid surfacing thin sections when richer alternatives exist.
Tune later

These weights are educated guesses, not optimized values. Phase 5 of the migration plan calls for empirically tuning them against the evaluation harness once the graph is fully populated and queries are flowing.

07 · Architectural decisionsFour resolved choices and why they matter

These decisions are marked do not re-open without new evidence in the migration plan. Each one shapes what the rest of the work has to do.

Decision 1

Sections split at H2, reusing parse_markdown_sections

Decision
Section boundaries match the existing ChromaDB ingestion exactly. Reuse utils.parse_markdown_sections(raw_text, header_level=2, include_nested=True).
Why
Comparability. If sections are defined the same way in both systems, any improvement measured in evaluation is attributable to the graph architecture — not to a different chunker.
Cost
Rigidity. A single-H2 document becomes one giant section. Falls back to H3-split when needed, with a HAS_SUBSECTION edge — a special case to remember.
Decision 2

Three-phase canonicalization, with an explicit no fuzzy matching rule

Decision
Normalize entity names through (1) deterministic lowercase-and-prefer-tag, (2) tag-anchored alignment to featured_projects.py tags, (3) LLM batch for the remainder. Type pools stay separate.
Why
At ~40–80 entities, a single Claude call is faster, cheaper, and more accurate than fuzzy-string threshold tuning. "GA4" vs. "Google Analytics 4" scores ~0.35 on SequenceMatcher — well below any usable threshold. This lesson is documented in Resume Graph Explorer's design notes and ported here.
Cost
One Claude API call per entity type per run, plus the discipline to actually review scripts/entity_normalization_report.json before loading. Zero merges is a valid outcome.
Decision 3

Full rebuild on every load, with content_hash tracked for the future

Decision
MATCH (n) DETACH DELETE n at the top of populate_neo4j_graph.py. Document nodes carry content_hash and last_updated from day one.
Why
At ~120 nodes, a full rebuild runs in under 10 minutes — under the threshold where incremental complexity earns its keep. content_hash costs nothing now and enables incremental updates later without a schema migration.
Cost
Manual graph corrections get blown away on rebuild. Trigger to switch is concrete: rebuild > 30 min, or you're frequently losing corrections. .chroma_db_DT/ stays intact for 72 hours post-deploy as rollback.
Decision 4

Entity source separation — node definitions vs. relationship signals

Decision
Skills, Methods, Technologies are canonicalized only from project_entities (structured LLM extraction from project walkthroughs). Concepts are canonicalized only from section_mentions. Section mentions of skills/tech still create MENTIONS edges, but they don't feed the canonical entity pool.
Why
Discovered during Phase 2 implementation. Pulling skills from both sources produced 452 raw names instead of ~85, and the canonicalization LLM call truncated mid-JSON. Section mentions are relationship data, not node definitions.
Cost
Concepts have no structured source — they're noisy and require manual curation. See the adjust later note about overwrite-fragility.
· · ·

08 · Simple queriesFor understanding what's in the graph

These are the queries to run in the Neo4j browser when you want to feel out the shape of what's loaded. None of them are evaluation queries — they're for inspection and intuition.

What sections exist, grouped by document

MATCH (d:Document)-[:HAS_SECTION]->(s:Section)
RETURN d.title AS document, collect(s.name) AS sections
ORDER BY d.title

Every project, with the sections that describe it

MATCH (p:Project)-[:DESCRIBED_IN]->(s:Section)
RETURN p.title, collect(s.name) AS describing_sections
ORDER BY p.title

Health check — projects with no DESCRIBED_IN edge

// If this returns rows, those projects are invisible to the +0.25 ranking boost
MATCH (p:Project)
WHERE NOT (p)-[:DESCRIBED_IN]->()
RETURN p.id, p.title

MENTIONS edge distribution — sanity check

// If most sections have 0-1 mentions, entity extraction is too sparse.
// If most have 20+, the +0.10 cap will saturate everywhere.
MATCH (s:Section)-[m:MENTIONS]->()
RETURN s.name, count(m) AS mention_count
ORDER BY mention_count DESC LIMIT 20

Entity inventory

MATCH (n)
WHERE n:Skill OR n:Method OR n:Technology OR n:Concept
RETURN labels(n)[0] AS type, count(*) AS total
ORDER BY total DESC

09 · Demo queriesFor showing the difference between before and after

These are the queries to use when explaining the migration to someone who hasn't seen it before — in an interview, in a LinkedIn post, in a walkthrough. Each one is chosen because the graph approach does something the vector approach structurally can't.

QueryWhat ChromaDB doesWhat Neo4j does
"Which projects use knowledge graphs?" Returns 5-10 chunks that mention the phrase; the LLM has to extract a list. Returns the exact list of Project nodes connected by DEMONSTRATES to the Knowledge Graphs skill.
"What other projects use similar methods to Resume Graph Explorer?" Cannot reason about "similar methods." Returns whatever chunks mention Resume Explorer. Traverses (p1)-[:USES_METHOD]->(m)<-[:USES_METHOD]-(p2) and returns shared methods with project pairs.
"Show me all projects that use Neo4j" Returns chunks mentioning Neo4j — could miss projects that use it without saying so explicitly. Returns exactly the projects linked to the Neo4j Technology node.
"Tell me about Resume Graph Explorer" Returns 5-10 fragments of the project description, often cut mid-thought. Returns the full DESCRIBED_IN section as a single coherent block.
"Find sections that discuss Bayesian reasoning" Approximates with vector similarity; quality depends on phrasing. Returns sections with an explicit MENTIONS edge to the Bayesian reasoning Concept node, including all alt_labels.
"What is Barbara's educational background?" Returns 3-5 chunks of the Education section, possibly fragmented. Returns the complete Education section in one piece.

The Cypher behind the demo queries

"Which projects use knowledge graphs?"
MATCH (skill:Skill {name: "Knowledge Graphs"})<-[:DEMONSTRATES]-(p:Project)
RETURN p.title, p.summary
"Similar methods to Resume Graph Explorer"
MATCH (p1:Project {id: "resume-graph-explorer"})-[:USES_METHOD]->(m:Method)
      <-[:USES_METHOD]-(p2:Project)
WHERE p1 <> p2
RETURN p2.title, collect(m.name) AS shared_methods
ORDER BY size(shared_methods) DESC
"Sections discussing Bayesian reasoning"
MATCH (s:Section)-[m:MENTIONS]->(c:Concept {name: "Bayesian reasoning"})
RETURN s.name, m.count, m.context
ORDER BY m.count DESC

10 · Metrics definedWhat the evaluation harness measures

The migration plan calls for five test batteries comparing ChromaDB baseline against Neo4j. Each metric below maps to one of them.

Coherence score manual, 1–5 scale
Human-rated quality of retrieved context per query, evaluated across three dimensions: fragmentation (are chunks cut mid-thought?), coverage (do the chunks together contain enough to answer?), and noise (are off-topic chunks diluting the context?). 5 = complete, clean section. 1 = unusable or wrong content. Manual scoring is fine for prototype scale; tracks the granularity problem.
Top-1 precision automated, percentage
Of N test queries with known correct answers, what percentage have the answer in the #1 retrieved result. Tracks ranking quality.
Top-K precision automated, percentage
Same idea, but counts the answer if it appears anywhere in the top-K (typically K=3 or K=5). Easier metric to hit; useful for understanding how much the LLM is bailing out the retriever.
MRR · Mean Reciprocal Rank automated, 0–1 scale
Average of 1 / rank across queries, where rank is the position where the correct answer first appears. Rewards getting the right answer earlier in the result list. An MRR of 0.75 means on average the correct answer is in position 1.33.
MRR = (1/N) × Σ (1 / ranki)
Relationship query success rate automated, percentage
Of the relationship-style queries that vector search can't answer well ("which projects use X", "similar methods to Y"), what percentage does the graph answer correctly. Baseline is 0–20%; target is 75%+. Tracks the missing-connections problem.
Latency · p50 / p95 / p99 automated, milliseconds
Time from query received to chunks/sections returned. p95 means "95% of queries return in this time or less" — the standard percentile to watch for user experience. p99 catches the long tail. Target: p95 under 500ms.
Eval suite pass rate automated, percentage
Existing 50-question evaluation harness pass rate. Used as a regression check — the graph migration must not drop this below the ChromaDB baseline of ~85-90%.

11 · GlossaryThe jargon, in one place

RAG · Retrieval-Augmented Generation
A pattern where an LLM's response is grounded in retrieved source documents rather than parametric memory alone. The retriever fetches relevant context based on the query; the LLM generates an answer using that context.
GraphRAG
A variant of RAG where the retriever sits on top of a knowledge graph instead of (or in addition to) a vector index. Retrieval can combine vector similarity with graph traversal — "find sections semantically similar to the query, AND boost the ones that describe a relevant entity."
Embedding · vector
A high-dimensional numeric representation of text (here, 1536 dimensions from OpenAI's text-embedding-3-small). Texts with similar meaning end up close to each other in this vector space. Embeddings are how "semantic search" actually works under the hood.
Vector index
A specialized data structure that lets you find the K closest vectors to a query vector quickly, without comparing against every vector in the database. Neo4j 5.11+ ships with a built-in vector index, which is what removes the need for a separate vector database here.
Cosine similarity · L2 distance
Two ways to measure how close two vectors are. Cosine measures the angle between them (1 = identical direction, 0 = orthogonal). L2 measures straight-line distance. They're equivalent on normalized vectors but report different numbers; ChromaDB returns L2 and the existing code converts via 1 - (dist² / 2).
Cypher
Neo4j's query language. SQL-like in spirit but designed for graph patterns. MATCH (p:Project)-[:USES]->(t:Technology) RETURN p, t reads as "find every Project connected to a Technology by a USES edge, return both."
MERGE
A Cypher operation that creates a node or relationship if it doesn't exist, or matches the existing one if it does. The graph-database equivalent of "upsert." Used throughout the populate scripts so reruns are safe.
DETACH DELETE
Cypher operation that removes a node along with any relationships attached to it. MATCH (n) DETACH DELETE n wipes the entire graph. Used at the top of populate_neo4j_graph.py per Decision 3.
Canonical name · alt_labels
The canonical name is the single official form of an entity (e.g., "Bayesian reasoning"). alt_labels are true synonyms or near-synonyms ("Bayesian thinking", "Belief updating") stored on the node so that queries on any variant still find it. Borrowed from SKOS conventions in the Resume Graph Explorer.
Sensitivity tier
Three-level audience access control: public, personal, inner_circle. Every Section and Document carries a tier; retrieval filters by what the visitor is allowed to see. Inherited from the existing ChromaDB setup unchanged.
Hybrid retrieval
Combining multiple retrieval signals (here: vector similarity + graph structure + entity richness + length) into a single ranked result set, rather than relying on any one signal alone.
Project entities · section mentions
Two distinct extraction sources. Project entities come from structured LLM extraction over project walkthroughs — they define the canonical Skill / Method / Technology nodes. Section mentions come from LLM scanning of KB sections — they produce MENTIONS edges and (separately) the Concept nodes. Decision 4 keeps these sources from contaminating each other.

12 · Adjust laterPlaces to revisit once the system is running

None of these is blocking. They're flagged here so they don't disappear into folklore.

Score weights

The 0.60 / 0.25 / 0.10 / 0.05 split is a starting guess. Phase 5 of the migration plan calls for tuning these against the eval harness once enough query data exists. The current values are loosely calibrated to make vector_score dominant while letting graph signals matter; that ratio may need to shift either direction.

MENTIONS edge cap saturates in production

The +0.10 bonus saturates at 5 entities mentioned. Real data shows median 9 mentions per section, mean 9.8, with ~75% of sections at 6+ mentions. The signal is effectively binary right now: either a section has <5 mentions and gets a partial bonus, or it has 5+ and gets the full bonus — no differentiation within the 6+ population. The cap should likely move to 10–15 once the migration is running, validated against the actual ranking outputs. The "MENTIONS edge distribution sanity check" query in section 8 is the right tool to watch for this.

Concept curation is overwrite-fragile

The manually curated Concept list (20 nodes) lives in canonical_entities.json but is regenerated by canonicalize_entities.py on every run. The Decision 4 note proposes the long-term fix: save the curated list to scripts/concepts_curated.json and have the canonicalization script read-and-preserve it rather than regenerating from scratch. Until that's done, treat the canonical list as a brittle artifact and back it up before rerunning canonicalization.

Phase 2 / Phase 3 boundary is fuzzy in the script

The migration plan describes Phase 2 (Entity Extraction) and Phase 3 (Relationship Mapping) as sequential, but populate_projects() in populate_neo4j_graph.py creates Project nodes and DEMONSTRATES / USES_METHOD / USES_TECHNOLOGY edges in the same pass via MERGE. What's actually new in Phase 3 is the cross-cutting relationships: DESCRIBED_IN, MENTIONS, NEXT_SECTION. This is fine but worth noting if anyone wonders why "relationship mapping" feels lighter than the bullet list suggests.

Chunks may end up being dead code

The schema retains the Chunk node and HAS_CHUNK edge as a fallback for fine-grained retrieval. If evaluation shows that full-section retrieval is consistently better, chunks become an unused branch — simplify the schema or document explicitly why they remain.

RELATED_TO (Project → Project) is deferred

The schema defines a RELATED_TO {similarity: float} edge for project-to-project similarity derived from shared skills/methods. Not implemented in Phase 3 — it's a nice-to-have that can wait until the rest of the graph is stable.

Acronym Title-Casing in canonical names

The deterministic phase of canonicalization is Title-Casing acronyms, producing forms like Rdf, Skos, Nlp, Rag, Openai, Chromadb, Transe, Fastapi, Sqlmodel, Weather Api in canonical_entities.json. Some acronyms are preserved correctly (SHACL, JSON, CSV, HTML), but only because the extraction model happened to emit them uppercase — no normalization rule kept them that way.

Root cause. Phase 1 uses max(variants, key=len) to pick the canonical form from each lowercased group — a heuristic designed to prefer "Knowledge Graphs" over "Knowledge Graph". When the extractor emits "Rag" as the only variant in its group, "Rag" wins by default. "RAG" and "Rag" have identical length, so the length rule can't break the tie even when both appear.

Fix for next rebuild. Add a pre-pass to Phase 1 that prefers ALL-CAPS variants when they exist in a group: if any variant is fully uppercase and ≤6 characters, prefer it regardless of length. Catches the common cases without a hand-maintained allowlist. A named-acronyms list would be cleaner but the simple rule covers ~95% of real-world ambiguity. Lives in canonicalize_entities.py only — no schema or graph changes needed.

Cross-type pool overlap — design ambiguity, not a bug

Current data shows 20 names appearing as both Skill AND Method, 8 as Skill AND Technology, and 3 names (Rdf, Skos, Weather Api) in all three pools. The code is behaving exactly as written — type pools canonicalize independently with no cross-pool dedup step.

What Decision 4 actually says. "Keep pools separate" meant don't let NLP-the-skill accidentally merge with NLP-the-technology during normalization. It said nothing about whether the same name can legitimately hold nodes in two pools.

Where overlap is defensible. (:Skill {name: "Rag"}) = capability Barbara has. (:Method {name: "Rag"}) = retrieval approach a project uses. Two different graph roles, both correct. Same for Docker / Knowledge Graph — the technology is the tool; the skill is using it.

Where it's an extraction-quality issue. Rdf, Skos, Weather Api appearing in all three pools is the extractor seeing "uses RDF" and tagging it as a Skill rather than a Technology choice. These should pin to Technology only.

Decision. Leave it for now — the current overlap is harmless because separate nodes correctly serve different Cypher queries. Next extraction pass should enforce a primary-type rule for the three triple-pool offenders. Fixable in extract_entities.py prompt tightening — no graph rebuild required.