01 · In one paragraphThe migration
The Digital Twin's retrieval layer is being rebuilt. The old system splits knowledge base documents into ~900-character chunks, embeds each one, and searches them by vector similarity. The new system stores full sections as the retrieval unit, promotes Projects, Skills, Methods, Technologies, and Concepts to first-class nodes with explicit edges between them, and ranks results with a hybrid score that combines vector similarity with graph signals. The payoff is queries that couldn't be answered before — "which projects use knowledge graphs?" — and retrieval that returns complete thoughts instead of fragments. The cost is one additional database to operate, one new query language to write, and a manual entity-curation step until the canonicalization pipeline is fully self-preserving.
02 · Why migrateThree problems with the current system
Each problem in the table below is a real failure mode observed in production, not a hypothetical. The Neo4j move is what the new architecture does to address it.
| Problem | ChromaDB behavior | Neo4j move |
|---|---|---|
| Wrong granularity | Returns 900-char chunks that often cut mid-paragraph or mid-thought. | Stores complete sections (2–3K chars) as the retrieval unit, with chunks kept as optional children for fallback. |
| Missing connections | Cannot answer "which projects use X?" — relationships between projects and skills are implicit in text only. | Promotes Skill, Method, Technology, Publication to nodes with typed edges from Project; Cypher traversal answers relationship queries directly. |
| Poor ranking | Pure L2 distance on embeddings; cannot tell whether a section describes a topic or merely mentions it. | Hybrid score combines vector similarity with graph centrality, project-described boost, and entity-richness signals. |
03 · Three movesThe architecture, mapped to the problems
Every other choice in the migration plan exists to support one of these three moves. When you get lost, come back here.
Move 1 — Make the section a first-class retrieval unit
Each H2 block in a KB document becomes a Section node carrying its complete text and its own embedding. Section boundaries match the existing ChromaDB ingestion logic exactly, which keeps the before/after evaluation honest.
Move 2 — Promote entities to typed nodes with typed edges
Skills, Methods, Technologies, Publications, and Concepts become nodes. Projects connect to them through named relationships: DEMONSTRATES, USES_METHOD, USES_TECHNOLOGY, DOCUMENTED_IN. Sections connect to entities through MENTIONS edges that carry a count and a contextual snippet.
Move 3 — Replace L2 distance with a multi-signal hybrid score
Vector similarity stays in the mix at 60% weight, but the final ranking also rewards sections that explicitly describe a Project (+25%), sections rich in entity mentions (+10%), and longer/more substantial sections (+5%). The full formula lives in section 6.
04 · The schemaWhat lives in the graph
Two tiers. The document hierarchy (Document → Section → Chunk) is where text lives. The entity network (Project at center, connecting Skill / Method / Technology / Publication) is what the text is about. The two tiers are bridged by MENTIONS edges (entities cited within sections) and DESCRIBED_IN edges (projects pointing back to the sections that describe them).
DESCRIBED_IN (drawn) and MENTIONS (cross-cutting, not drawn for legibility)Node types — quick reference
Note on properties: role (Skill) and stage (Method) are edge properties, not node properties. They live on the DEMONSTRATES and USES_METHOD relationships respectively, so the same Skill can play a "core" role on one Project and a "supporting" role on another.
| Node | Properties | Source · example |
|---|---|---|
Document | file_path, title, source_type, sensitivity, content_hash, last_updated | One per KB file e.g. kb_biosketch.md |
Section | id, name, full_text, embedding (1536-dim), sensitivity, order, char_count | One per H2 block within a Document e.g. kb_biosketch:Education |
Chunk | id, text, embedding, chunk_index, char_count | Children of Section, optional fallback |
Project | id, title, summary, design_insight, walkthrough_context, tags, sensitivity, urls, diagram_filename | From featured_projects.pye.g. resume-graph-explorer, digital-twin |
Skill | name, category, alt_labels + role on DEMONSTRATES edge: core / secondary / supporting | LLM extraction from project walkthroughs project_entities only |
Method | name, category, alt_labels + stage on USES_METHOD edge: ingestion / retrieval / evaluation / generation | LLM extraction from project walkthroughs project_entities only |
Technology | name, category, alt_labels | LLM extraction from project walkthroughs project_entities only |
Concept | name, description, source, alt_labelssource tracks which KB section the concept was first extracted from | Curated from KB section mentions manual curation |
Publication | title, type, year, url | From kb_publications.md |
Relationship types — quick reference
| Edge | Direction | Why it exists |
|---|---|---|
HAS_SECTION | Document → Section | Structural containment |
HAS_CHUNK | Section → Chunk | Optional fine-grained fallback |
NEXT_SECTION | Section → Section | Sequential order within a Document |
DESCRIBED_IN | Project → Section | Which sections actually describe this project (feeds the +0.25 ranking boost) |
DEMONSTRATES | Project → Skill | Carries role property (core / secondary / supporting) |
USES_METHOD | Project → Method | Carries stage property (ingestion / retrieval / evaluation / generation) |
USES_TECHNOLOGY | Project → Technology | Plain typed edge |
DOCUMENTED_IN | Project → Publication | Linked write-ups |
MENTIONS | Section → any entity | Carries count and context; feeds the +0.10 entity-richness signal |
RELATED_TO | Project → Project | Derived from shared skills/methods deferred — not in Phase 3 |
By the numbers — what's actually in the graph
Real counts from canonical_entities.json as of the most recent canonicalization run. These numbers are the actual order of magnitude you're operating at — useful both for tuning decisions and for talking about the system honestly.
Canonicalization in action
The three-phase canonicalization pipeline (Decision 2) compresses 279 raw extracted entity names down to 167 canonical nodes — a 1.92× reduction. The compression happens because the same idea appears across multiple project walkthroughs under slightly different names, and the LLM batch phase collapses those into a single canonical with alt_labels.
A few real merges the pipeline produced:
Skill roles · 57 canonical skills
The role property — carried on the DEMONSTRATES edge — distributes as:
Method stages · 42 canonical methods
The stage property — carried on the USES_METHOD edge — distributes as:
The ingestion-heavy distribution is honest about where the methodological work goes in these projects — most of the design effort is in getting data into a usable shape before any retrieval or generation happens.
MENTIONS edges per Section · across 121 sections with mention data
How many entities (Projects, Skills, Concepts combined) each Section mentions:
Median is 9 mentions per section, mean is 9.8. About 75% of sections sit at 6+ mentions — well above the +0.10 bonus cap of 5. The signal is effectively binary in production right now: either a section has <5 mentions and gets a partial bonus, or it has 5+ and gets the full bonus. The cap likely needs to move to 10–15 to actually differentiate sections. Flagged in section 12.
Per-project entity density · raw counts from project_entities
Before canonicalization, this is the entity footprint each project contributed. Higher numbers don't mean better projects — they mean more methodological surface area to articulate.
| Project | Skills | Methods | Techs | Total |
|---|---|---|---|---|
resume-graph-explorer | 13 | 10 | 15 | 38 |
beehive-monitor | 12 | 12 | 11 | 35 |
poolula-platform | 11 | 12 | 9 | 32 |
chronoscope | 10 | 10 | 11 | 31 |
academic-citation-platform | 11 | 12 | 7 | 30 |
concept-cartographer | 10 | 8 | 8 | 26 |
digital-twin | 9 | 9 | 6 | 24 |
weaving-memories | 9 | 7 | 8 | 24 |
fitness-tracker | 8 | 9 | 5 | 22 |
convoscope | 6 | 6 | 5 | 17 |
Sensitivity tier · across 121 sections with mention data
Section-level numbers — post-Phase 3 placeholderawaiting load
This block will fill in once Phase 3 finishes loading sections into Neo4j. Expected to cover:
- Average section length (chars) by source_type — to validate the "2–3K chars per section" claim in the granularity argument
- Embedding distribution sanity checks — confirm all section embeddings are valid 1536-dim vectors with reasonable norms
- Sections per Document — mean, min, max — to flag any documents that are dramatically over- or under-sectioned
- HAS_CHUNK edge density — how often chunks are actually retained vs. the section-only path being used; informs whether chunks can be retired (see Adjust later)
- Vector index coverage — percentage of sections with valid embeddings indexed under
section_embeddings - Sections with zero MENTIONS edges by document — currently 16 sections show zero mentions in
_source_section_mentions; worth knowing which documents those cluster in
Run the simple inspection queries from section 8 against the loaded graph to populate these.
Browse the canonical entities
Live data from canonical_entities.json — all 167 canonical entity nodes. Search across names and alt_labels, or filter by type. Click any row to see full alt_labels and (for concepts) the description.
| Type▾ | Name▾ | Role / Stage / Source▾ | Category▾ | Alt labels |
|---|
05 · Same query, two pathsHow a question travels through each system
The clearest way to internalize the difference is to follow a single query through both pipelines. Here are two contrasting cases — one where both systems work but the experience differs, and one where only the graph approach can answer the question at all.
Case A — "Tell me about the Resume Graph Explorer"
A factual question with a clear target. Both systems can answer it; the difference is in what comes back.
Embed the query. Find the 10 nearest chunks by L2 distance. Filter by sensitivity tier. Return them in similarity order.
What comes back: 5–10 chunks of ~900 characters each, drawn from wherever the project is mentioned in the KB. Some are mid-paragraph cuts. The LLM has to stitch them.
Embed the query. Vector-search sections. Boost sections where DESCRIBED_IN ← Project {id: "resume-graph-explorer"} applies. Return top section in full.
What comes back: The complete section that describes the project — 2–3K characters of coherent, contextualized prose, with related-project metadata attached.
CALL db.index.vector.queryNodes('section_embeddings', 10, $query_embedding)
YIELD node AS section, score AS vector_score
WHERE section.sensitivity IN $allowed_tiers
// Reward sections that DESCRIBE the project, not just mention it
OPTIONAL MATCH (section)<-[:DESCRIBED_IN]-(p:Project)
OPTIONAL MATCH (section)-[:MENTIONS]->(entity)
WITH section, vector_score,
count(DISTINCT p) AS projects_described,
count(DISTINCT entity) AS entities_mentioned
WITH section,
(vector_score * 0.6 +
CASE WHEN projects_described > 0 THEN 0.25 ELSE 0 END +
toFloat(CASE WHEN entities_mentioned > 5 THEN 5 ELSE entities_mentioned END) / 5 * 0.10 +
(CASE WHEN section.char_count > 2000 THEN 0.05 ELSE 0 END)) AS final_score
ORDER BY final_score DESC LIMIT 5
RETURN section.full_text, section.name, final_score
Case B — "Which projects use knowledge graphs?"
This is the kind of query that vector search structurally cannot answer well. The ChromaDB approach returns whatever section happens to mention knowledge graphs most prominently — useful, but not a list. The graph approach traverses the relationship and returns the actual answer.
# Embed "Which projects use knowledge graphs?"
# Return 10 chunks ordered by similarity to that embedding
# LLM reads chunks and tries to extract a list of projects
# Failure mode: misses projects that don't say "knowledge graph" verbatim
results = collection.query(query_embeddings=[emb], n_results=10)
MATCH (skill:Skill {name: "Knowledge Graphs"})<-[:DEMONSTRATES]-(project:Project)
RETURN project.title, project.summary
ORDER BY project.title
The question went from "find similar text" to "follow a specific relationship." The graph schema makes that relationship queryable. The vector index is still there for fuzzy semantic search — but now we choose which mechanism to use based on what kind of question is being asked.
06 · Hybrid scoringWhy these four weights, and what they're tuning
Section ranking in Neo4j is a weighted sum of four signals, each normalized to [0, 1] so that count-based components can't dominate the embedding similarity. Maximum possible score ≈ 1.0.
What each weight is doing
- 0.60 · vector similarity — the workhorse. If the query is semantically close to the section's content, this carries most of the signal.
- 0.25 · project-described boost — distinguishes sections that describe a project from sections that merely mention one. This is the move that solves the "specificity" problem in ranking.
- 0.10 · entity-richness signal — capped at 5 entities. Rewards sections that sit at intersections in the graph (e.g., a section discussing both a Skill and a Project is more useful than one that mentions only a name in passing).
- 0.05 · length tiebreaker — small bonus for substantial sections. Mostly a tiebreaker to avoid surfacing thin sections when richer alternatives exist.
These weights are educated guesses, not optimized values. Phase 5 of the migration plan calls for empirically tuning them against the evaluation harness once the graph is fully populated and queries are flowing.
07 · Architectural decisionsFour resolved choices and why they matter
These decisions are marked do not re-open without new evidence in the migration plan. Each one shapes what the rest of the work has to do.
Sections split at H2, reusing parse_markdown_sections
utils.parse_markdown_sections(raw_text, header_level=2, include_nested=True).HAS_SUBSECTION edge — a special case to remember.Three-phase canonicalization, with an explicit no fuzzy matching rule
featured_projects.py tags, (3) LLM batch for the remainder. Type pools stay separate.scripts/entity_normalization_report.json before loading. Zero merges is a valid outcome.Full rebuild on every load, with content_hash tracked for the future
MATCH (n) DETACH DELETE n at the top of populate_neo4j_graph.py. Document nodes carry content_hash and last_updated from day one.content_hash costs nothing now and enables incremental updates later without a schema migration..chroma_db_DT/ stays intact for 72 hours post-deploy as rollback.Entity source separation — node definitions vs. relationship signals
project_entities (structured LLM extraction from project walkthroughs). Concepts are canonicalized only from section_mentions. Section mentions of skills/tech still create MENTIONS edges, but they don't feed the canonical entity pool.08 · Simple queriesFor understanding what's in the graph
These are the queries to run in the Neo4j browser when you want to feel out the shape of what's loaded. None of them are evaluation queries — they're for inspection and intuition.
What sections exist, grouped by document
MATCH (d:Document)-[:HAS_SECTION]->(s:Section)
RETURN d.title AS document, collect(s.name) AS sections
ORDER BY d.title
Every project, with the sections that describe it
MATCH (p:Project)-[:DESCRIBED_IN]->(s:Section)
RETURN p.title, collect(s.name) AS describing_sections
ORDER BY p.title
Health check — projects with no DESCRIBED_IN edge
// If this returns rows, those projects are invisible to the +0.25 ranking boost
MATCH (p:Project)
WHERE NOT (p)-[:DESCRIBED_IN]->()
RETURN p.id, p.title
MENTIONS edge distribution — sanity check
// If most sections have 0-1 mentions, entity extraction is too sparse.
// If most have 20+, the +0.10 cap will saturate everywhere.
MATCH (s:Section)-[m:MENTIONS]->()
RETURN s.name, count(m) AS mention_count
ORDER BY mention_count DESC LIMIT 20
Entity inventory
MATCH (n)
WHERE n:Skill OR n:Method OR n:Technology OR n:Concept
RETURN labels(n)[0] AS type, count(*) AS total
ORDER BY total DESC
09 · Demo queriesFor showing the difference between before and after
These are the queries to use when explaining the migration to someone who hasn't seen it before — in an interview, in a LinkedIn post, in a walkthrough. Each one is chosen because the graph approach does something the vector approach structurally can't.
| Query | What ChromaDB does | What Neo4j does |
|---|---|---|
| "Which projects use knowledge graphs?" | Returns 5-10 chunks that mention the phrase; the LLM has to extract a list. | Returns the exact list of Project nodes connected by DEMONSTRATES to the Knowledge Graphs skill. |
| "What other projects use similar methods to Resume Graph Explorer?" | Cannot reason about "similar methods." Returns whatever chunks mention Resume Explorer. | Traverses (p1)-[:USES_METHOD]->(m)<-[:USES_METHOD]-(p2) and returns shared methods with project pairs. |
| "Show me all projects that use Neo4j" | Returns chunks mentioning Neo4j — could miss projects that use it without saying so explicitly. | Returns exactly the projects linked to the Neo4j Technology node. |
| "Tell me about Resume Graph Explorer" | Returns 5-10 fragments of the project description, often cut mid-thought. | Returns the full DESCRIBED_IN section as a single coherent block. |
| "Find sections that discuss Bayesian reasoning" | Approximates with vector similarity; quality depends on phrasing. | Returns sections with an explicit MENTIONS edge to the Bayesian reasoning Concept node, including all alt_labels. |
| "What is Barbara's educational background?" | Returns 3-5 chunks of the Education section, possibly fragmented. | Returns the complete Education section in one piece. |
The Cypher behind the demo queries
MATCH (skill:Skill {name: "Knowledge Graphs"})<-[:DEMONSTRATES]-(p:Project)
RETURN p.title, p.summary
MATCH (p1:Project {id: "resume-graph-explorer"})-[:USES_METHOD]->(m:Method)
<-[:USES_METHOD]-(p2:Project)
WHERE p1 <> p2
RETURN p2.title, collect(m.name) AS shared_methods
ORDER BY size(shared_methods) DESC
MATCH (s:Section)-[m:MENTIONS]->(c:Concept {name: "Bayesian reasoning"})
RETURN s.name, m.count, m.context
ORDER BY m.count DESC
10 · Metrics definedWhat the evaluation harness measures
The migration plan calls for five test batteries comparing ChromaDB baseline against Neo4j. Each metric below maps to one of them.
1 / rank across queries, where rank is the position where the correct answer first appears. Rewards getting the right answer earlier in the result list. An MRR of 0.75 means on average the correct answer is in position 1.33.11 · GlossaryThe jargon, in one place
text-embedding-3-small). Texts with similar meaning end up close to each other in this vector space. Embeddings are how "semantic search" actually works under the hood.1 - (dist² / 2).MATCH (p:Project)-[:USES]->(t:Technology) RETURN p, t reads as "find every Project connected to a Technology by a USES edge, return both."MATCH (n) DETACH DELETE n wipes the entire graph. Used at the top of populate_neo4j_graph.py per Decision 3.public, personal, inner_circle. Every Section and Document carries a tier; retrieval filters by what the visitor is allowed to see. Inherited from the existing ChromaDB setup unchanged.MENTIONS edges and (separately) the Concept nodes. Decision 4 keeps these sources from contaminating each other.12 · Adjust laterPlaces to revisit once the system is running
None of these is blocking. They're flagged here so they don't disappear into folklore.
The 0.60 / 0.25 / 0.10 / 0.05 split is a starting guess. Phase 5 of the migration plan calls for tuning these against the eval harness once enough query data exists. The current values are loosely calibrated to make vector_score dominant while letting graph signals matter; that ratio may need to shift either direction.
The +0.10 bonus saturates at 5 entities mentioned. Real data shows median 9 mentions per section, mean 9.8, with ~75% of sections at 6+ mentions. The signal is effectively binary right now: either a section has <5 mentions and gets a partial bonus, or it has 5+ and gets the full bonus — no differentiation within the 6+ population. The cap should likely move to 10–15 once the migration is running, validated against the actual ranking outputs. The "MENTIONS edge distribution sanity check" query in section 8 is the right tool to watch for this.
The manually curated Concept list (20 nodes) lives in canonical_entities.json but is regenerated by canonicalize_entities.py on every run. The Decision 4 note proposes the long-term fix: save the curated list to scripts/concepts_curated.json and have the canonicalization script read-and-preserve it rather than regenerating from scratch. Until that's done, treat the canonical list as a brittle artifact and back it up before rerunning canonicalization.
The migration plan describes Phase 2 (Entity Extraction) and Phase 3 (Relationship Mapping) as sequential, but populate_projects() in populate_neo4j_graph.py creates Project nodes and DEMONSTRATES / USES_METHOD / USES_TECHNOLOGY edges in the same pass via MERGE. What's actually new in Phase 3 is the cross-cutting relationships: DESCRIBED_IN, MENTIONS, NEXT_SECTION. This is fine but worth noting if anyone wonders why "relationship mapping" feels lighter than the bullet list suggests.
The schema retains the Chunk node and HAS_CHUNK edge as a fallback for fine-grained retrieval. If evaluation shows that full-section retrieval is consistently better, chunks become an unused branch — simplify the schema or document explicitly why they remain.
The schema defines a RELATED_TO {similarity: float} edge for project-to-project similarity derived from shared skills/methods. Not implemented in Phase 3 — it's a nice-to-have that can wait until the rest of the graph is stable.
The deterministic phase of canonicalization is Title-Casing acronyms, producing forms like Rdf, Skos, Nlp, Rag, Openai, Chromadb, Transe, Fastapi, Sqlmodel, Weather Api in canonical_entities.json. Some acronyms are preserved correctly (SHACL, JSON, CSV, HTML), but only because the extraction model happened to emit them uppercase — no normalization rule kept them that way.
Root cause. Phase 1 uses max(variants, key=len) to pick the canonical form from each lowercased group — a heuristic designed to prefer "Knowledge Graphs" over "Knowledge Graph". When the extractor emits "Rag" as the only variant in its group, "Rag" wins by default. "RAG" and "Rag" have identical length, so the length rule can't break the tie even when both appear.
Fix for next rebuild. Add a pre-pass to Phase 1 that prefers ALL-CAPS variants when they exist in a group: if any variant is fully uppercase and ≤6 characters, prefer it regardless of length. Catches the common cases without a hand-maintained allowlist. A named-acronyms list would be cleaner but the simple rule covers ~95% of real-world ambiguity. Lives in canonicalize_entities.py only — no schema or graph changes needed.
Current data shows 20 names appearing as both Skill AND Method, 8 as Skill AND Technology, and 3 names (Rdf, Skos, Weather Api) in all three pools. The code is behaving exactly as written — type pools canonicalize independently with no cross-pool dedup step.
What Decision 4 actually says. "Keep pools separate" meant don't let NLP-the-skill accidentally merge with NLP-the-technology during normalization. It said nothing about whether the same name can legitimately hold nodes in two pools.
Where overlap is defensible. (:Skill {name: "Rag"}) = capability Barbara has. (:Method {name: "Rag"}) = retrieval approach a project uses. Two different graph roles, both correct. Same for Docker / Knowledge Graph — the technology is the tool; the skill is using it.
Where it's an extraction-quality issue. Rdf, Skos, Weather Api appearing in all three pools is the extractor seeing "uses RDF" and tagging it as a Skill rather than a Technology choice. These should pin to Technology only.
Decision. Leave it for now — the current overlap is harmless because separate nodes correctly serve different Cypher queries. Next extraction pass should enforce a primary-type rule for the three triple-pool offenders. Fixable in extract_entities.py prompt tightening — no graph rebuild required.