Paired comparison across 58 questions, holding retrieval, system prompt, vector store, top-k, and temperature constant. Companion artifact to the blog post; intended for readers who want the numbers and the methodology behind them. Run date: May 11, 2026.
GPT-4.1 behaved like the more expansive portfolio guide. It was longer on 44 of 58 questions, used more follow-up invitations, and more often turned retrieved facts into a fuller narrative answer.
Gemini 2.5 Flash behaved like the more concise answerer. It was shorter on most questions, much cheaper, and often serviceable on closed factual prompts. Its main weakness is not that it missed the fact; it more often under-developed the answer relative to the Digital Twin's intended voice and purpose.
The retrieval layer is not the explanation for the differences. Top-k and question set were held constant, and chunk similarity stats are essentially identical. The variation is in how each model used the same retrieved context.
Two runs of the same 58-question evaluation set, held constant on retrieval (Chroma vector store, top-k = 4), system prompt, and temperature (0.6). Only the model varied. Metrics fall into three buckets: behavioral counts, lexical overlap, and cost.
The full 58-question set is too large to score by hand after every change. These 15 questions are my first-pass quality check for whether the Twin can make a strong, grounded impression across career, project, and technical conversations.
Six dimensions, each scored from 1 (weak) to 5 (strong). Total possible score per response: 30. This rubric structure is the scaffold for the next round of human-scored evaluation. The dimensions are deliberately separable so a response can be accurate but bland, or specific but ungrounded, and the rubric will say so.
| Dimension | What it asks | Low score (1-2) | High score (4-5) |
|---|---|---|---|
| Accuracy | Did the response get the facts right? | Contains a factual error or invented detail. | Every claim is supported by the knowledge base or known true. |
| Specificity | Did it avoid generic AI or career language? | Could describe any AI consultant. | Uses specifics only Barbara could provide. |
| Voice fidelity | Did it sound like Barbara? | Reads like a generic chatbot. | Reads like Barbara wrote it. |
| Strategic usefulness | Would this help a recruiter or peer understand her value? | Answers the question but doesn't advance the conversation. | Turns the answer into an invitation to explore further. |
| Grounding | Did it stay tied to retrieved or source material? | Drifts into plausible-but-unsourced claims. | Every concrete claim traces to a chunk. |
| Follow-up quality | Did the next-step invitation feel specific and natural? | Generic or absent. "Let me know if you have more questions." | Specific to the answer, opening a useful direction. |
| Metric | GPT-4.1 | Gemini 2.5 Flash | GPT minus Gemini |
|---|---|---|---|
| Mean words per response | 225.5 | 181.2 | +44.4 |
| Median words per response | 225.0 | 198.0 | +27.0 |
| SD of words | 103.5 | 93.5 | +10.1 |
| Min words | 5 | 26 | -21 |
| Max words | 438 | 385 | +53 |
| Mean sentences | 10.4 | 9.5 | +0.8 |
| Mean words per sentence | 22.5 | 18.7 | +3.8 |
| Mean Flesch-Kincaid grade | 12.2 | 11.3 | +0.9 |
| Mean Flesch reading ease | 47.7 | 47.5 | +0.2 |
| Type-token ratio | 0.86 | 0.87 | -0.01 |
| Follow-up flag count | 40 | 10 | +30 |
| Follow-up phrase count | 49 | 20 | +29 |
| Warm opener count | 6 | 18 | -12 |
| Linked answer count | 14 | 10 | +4 |
| Strong markdown count | 30 | 23 | +7 |
| Light markdown count | 15 | 32 | -17 |
| No markdown count | 13 | 3 | +10 |
| Mean project term count | 1.93 | 1.57 | +0.36 |
| Mean must-cover lexical coverage | 0.51 | 0.48 | +0.04 |
| Completion tokens | 17,145 | 32,897 | -15,752 |
| Prompt tokens | 397,713 | 421,279 | -23,566 |
| Total tokens | 414,858 | 454,176 | -39,318 |
| Estimated cost USD | $0.489 | $0.181 | +$0.309 |
Token counts are not directly comparable across providers because each tokenizer is different. Gemini reported more completion tokens despite shorter word counts. Word count and estimated dollar cost are the more interpretable comparisons.
Higher-level partition of the 58 questions into three types: bounded-open (most of the set), closed-fact (biographical, credentials), and open-persona (the playful one).
| Question type | N | GPT mean words | Gemini mean words | Delta words | Jaccard | GPT followups | Gemini followups |
|---|---|---|---|---|---|---|---|
| bounded_open | 46 | 245.5 | 202.7 | +42.8 | 0.20 | 31 | 10 |
| closed_fact | 11 | 147.9 | 99.4 | +48.5 | 0.22 | 9 | 0 |
| open_persona | 1 | 162.0 | 93.0 | +69.0 | 0.13 | 0 | 0 |
Finer-grained partition. Sorted by N descending; intents with only one question (the bottom rows) carry no statistical weight but are included for completeness.
| Intent | N | GPT mean words | Gemini mean words | Delta words | Jaccard | GPT followups | Gemini followups | GPT cover | Gemini cover |
|---|---|---|---|---|---|---|---|---|---|
| technical_overview | 8 | 240.5 | 187.4 | +53.1 | 0.18 | 7 | 1 | 0.56 | 0.45 |
| career_history_fact | 5 | 177.6 | 138.8 | +38.8 | 0.24 | 5 | 0 | 0.50 | 0.49 |
| working_style | 5 | 289.6 | 200.2 | +89.4 | 0.16 | 3 | 0 | 0.31 | 0.33 |
| education_fact | 4 | 106.5 | 55.0 | +51.5 | 0.16 | 3 | 0 | 0.85 | 0.88 |
| methodology | 3 | 362.0 | 244.3 | +117.7 | 0.21 | 2 | 0 | 0.43 | 0.31 |
| project_walkthrough | 3 | 303.0 | 288.3 | +14.7 | 0.23 | 3 | 2 | 0.59 | 0.55 |
| tool_encouragement | 3 | 78.7 | 49.0 | +29.7 | 0.14 | 1 | 1 | 0.13 | 0.14 |
| knowledge_graph_overview | 2 | 355.0 | 351.5 | +3.5 | 0.28 | 1 | 0 | 0.62 | 0.58 |
| motivation | 2 | 209.0 | 204.5 | +4.5 | 0.24 | 0 | 0 | 0.77 | 0.65 |
| personal_background | 2 | 180.5 | 121.0 | +59.5 | 0.23 | 1 | 0 | 0.81 | 0.75 |
| product_architecture | 2 | 313.5 | 201.0 | +112.5 | 0.28 | 2 | 0 | 0.66 | 0.66 |
| tool_dice | 2 | 23.0 | 36.5 | -13.5 | 0.08 | 1 | 1 | 0.10 | 0.20 |
| career_next_step | 1 | 205.0 | 171.0 | +34.0 | 0.17 | 1 | 0 | 0.50 | 0.20 |
| communication_bridge | 1 | 351.0 | 338.0 | +13.0 | 0.16 | 1 | 0 | 0.25 | 0.50 |
| credential_fact | 1 | 119.0 | 139.0 | -20.0 | 0.34 | 1 | 0 | 0.83 | 0.83 |
| evaluation_capability | 1 | 298.0 | 206.0 | +92.0 | 0.29 | 0 | 0 | 0.82 | 0.82 |
| hallucination_reflection | 1 | 246.0 | 273.0 | -27.0 | 0.13 | 0 | 0 | 0.25 | 0.00 |
| knowledge_boundary_check | 1 | 191.0 | 92.0 | +99.0 | 0.23 | 1 | 0 | 0.29 | 0.14 |
| learning_style | 1 | 265.0 | 288.0 | -23.0 | 0.18 | 0 | 0 | 0.36 | 0.36 |
| persona_self_description | 1 | 263.0 | 233.0 | +30.0 | 0.15 | 1 | 1 | 0.40 | 0.50 |
| personal_professional_bridge | 1 | 207.0 | 246.0 | -39.0 | 0.23 | 0 | 0 | 0.38 | 0.50 |
| playful_persona | 1 | 162.0 | 93.0 | +69.0 | 0.13 | 0 | 0 | 0.29 | 0.29 |
| positioning | 1 | 288.0 | 255.0 | +33.0 | 0.21 | 1 | 0 | 0.36 | 0.36 |
| product_overview | 1 | 334.0 | 227.0 | +107.0 | 0.22 | 1 | 0 | 0.75 | 0.83 |
| project_detail | 1 | 401.0 | 357.0 | +44.0 | 0.22 | 1 | 1 | 0.70 | 0.50 |
| publication_fact | 1 | 173.0 | 121.0 | +52.0 | 0.28 | 0 | 1 | 0.60 | 0.50 |
| research_explanation | 1 | 230.0 | 213.0 | +17.0 | 0.33 | 1 | 0 | 0.86 | 0.79 |
| resource_location | 1 | 122.0 | 108.0 | +14.0 | 0.29 | 1 | 1 | 0.67 | 0.50 |
| tech_stack | 1 | 147.0 | 160.0 | -13.0 | 0.36 | 1 | 1 | 0.25 | 0.25 |
| ID | Question | Intent | GPT words | Gemini words | Delta | Jaccard | GPT cover | Gemini cover |
|---|---|---|---|---|---|---|---|---|
| V3-019 | How do you approach problem-solving? | methodology | 438 | 243 | +195 | 0.17 | 0.67 | 0.50 |
| V3-027 | What makes Poolula Platform modular? | product_architecture | 391 | 206 | +185 | 0.21 | 0.69 | 0.77 |
| V3-022 | How do you handle ambiguity in projects? | working_style | 388 | 222 | +166 | 0.17 | 0.31 | 0.31 |
| V3-045 | What vector databases have you used? | technical_overview | 216 | 63 | +153 | 0.11 | 0.58 | 0.50 |
| V3-029 | What is the Beehive Photo Metadata Tracker? | project_walkthrough | 358 | 209 | +149 | 0.21 | 0.69 | 0.69 |
| V3-042 | How do you approach ML model development? | methodology | 375 | 240 | +135 | 0.21 | 0.25 | 0.25 |
| V3-040 | How do you document your projects? | working_style | 324 | 190 | +134 | 0.18 | 0.25 | 0.12 |
| V3-048 | How do you collaborate with engineering teams? | working_style | 277 | 151 | +126 | 0.13 | 0.38 | 0.46 |
| V3-044 | What is your approach to RAG systems? | technical_overview | 392 | 281 | +111 | 0.24 | 0.91 | 0.73 |
| V3-031 | What is Poolula Platform? | product_overview | 334 | 227 | +107 | 0.22 | 0.75 | 0.83 |
| V3-006 | Tell me about your dogs | personal_background | 279 | 174 | +105 | 0.28 | 0.62 | 0.50 |
| V3-047 | What's your experience with data engineering? | technical_overview | 343 | 241 | +102 | 0.18 | 0.56 | 0.44 |
| ID | Question | Intent | GPT words | Gemini words | Delta | Jaccard | GPT cover | Gemini cover |
|---|---|---|---|---|---|---|---|---|
| V3-012 | Walk me through a project | project_walkthrough | 253 | 360 | -107 | 0.23 | 0.30 | 0.40 |
| V3-050 | What knowledge graph experience do you have? | knowledge_graph_overview | 337 | 385 | -48 | 0.27 | 0.67 | 0.67 |
| V3-056 | What is the thing you like to do the most for fun and work | personal_professional_bridge | 207 | 246 | -39 | 0.23 | 0.38 | 0.50 |
| V3-037 | What programming languages are featured in your projects? | technical_overview | 181 | 208 | -27 | 0.18 | 0.36 | 0.29 |
| V3-039 | What visualization tools do you use in your projects? | technical_overview | 262 | 289 | -27 | 0.16 | 0.60 | 0.30 |
| V3-057 | I heard of some digital twins that make up information. | hallucination_reflection | 246 | 273 | -27 | 0.13 | 0.25 | 0.00 |
| V3-017 | What drives you? | motivation | 184 | 209 | -25 | 0.22 | 0.67 | 0.67 |
| V3-023 | What's your philosophy on learning new technologies? | learning_style | 265 | 288 | -23 | 0.18 | 0.36 | 0.36 |
| V3-053 | Roll a die for me | tool_dice | 5 | 26 | -21 | 0.06 | 0.20 | 0.20 |
| V3-011 | What certifications do you have? | credential_fact | 119 | 139 | -20 | 0.34 | 0.83 | 0.83 |
| V3-021 | What's your take on documentation? | working_style | 251 | 266 | -15 | 0.14 | 0.40 | 0.30 |
| V3-030 | What technologies does the Fitness Dashboard use? | tech_stack | 147 | 160 | -13 | 0.36 | 0.25 | 0.25 |
Pairs where the two models took the most different approaches. Tool-related and free-form questions dominate.
| ID | Question | Intent | Jaccard |
|---|---|---|---|
| V3-053 | Roll a die for me | tool_dice | 0.06 |
| V3-010 | What was your role at UT Austin after returning from MIT? | career_history_fact | 0.09 |
| V3-054 | I need to make a random decision between options | tool_dice | 0.09 |
| V3-003 | Did you play any sports in college? | education_fact | 0.10 |
| V3-055 | Remind me of your mantra and send it to me | tool_encouragement | 0.11 |
| V3-045 | What vector databases have you used? | technical_overview | 0.11 |
| V3-043 | What LLM frameworks do you work with? | technical_overview | 0.12 |
| V3-048 | How do you collaborate with engineering teams? | working_style | 0.13 |
| V3-058 | How many twins are there and do you get jealous? | playful_persona | 0.13 |
| V3-057 | I heard of some digital twins that make up information. | hallucination_reflection | 0.13 |
| V3-021 | What's your take on documentation? | working_style | 0.14 |
| V3-051 | I need some encouragement today | tool_encouragement | 0.14 |
Pairs where the two models converged. Bounded factual and technical-stack questions dominate.
| ID | Question | Intent | Jaccard |
|---|---|---|---|
| V3-035 | What did you build at Metric5? | career_history_fact | 0.38 |
| V3-030 | What technologies does the Fitness Dashboard use? | tech_stack | 0.36 |
| V3-011 | What certifications do you have? | credential_fact | 0.34 |
| V3-013 | How was this digital twin built | product_architecture | 0.34 |
| V3-005 | What was your dissertation about? | research_explanation | 0.33 |
| V3-008 | What shifted your focus to neuroscience? | career_history_fact | 0.31 |
| V3-032 | What evaluation features does Poolula have? | evaluation_capability | 0.29 |
| V3-036 | Where can I find documentation for your projects? | resource_location | 0.29 |
| V3-015 | Tell me about your knowledge graph work | knowledge_graph_overview | 0.29 |
| V3-046 | What cloud platforms do you have experience with? | technical_overview | 0.28 |
| V3-006 | Tell me about your dogs | personal_background | 0.28 |
| V3-033 | What was your notable publication from MIT? | publication_fact | 0.28 |
Questions where GPT's lexical grounding score exceeded Gemini's most.
| ID | Question | Intent | GPT cover | Gemini cover | Delta |
|---|---|---|---|---|---|
| V3-025 | What matters most to you in a role? | career_next_step | 0.50 | 0.20 | +0.30 |
| V3-039 | What visualization tools do you use in your projects? | technical_overview | 0.60 | 0.30 | +0.30 |
| V3-002 | What honors program were you part of at UT Austin? | education_fact | 1.00 | 0.75 | +0.25 |
| V3-020 | What excites you about AI/ML work? | motivation | 0.88 | 0.62 | +0.25 |
| V3-057 | I heard of some digital twins that make up information. | hallucination_reflection | 0.25 | 0.00 | +0.25 |
| V3-028 | Tell me about your Fitness Dashboard project | project_walkthrough | 0.78 | 0.56 | +0.22 |
| V3-038 | Tell me about your beehive tracker's metadata analysis | project_detail | 0.70 | 0.50 | +0.20 |
| V3-041 | What's your approach to evaluation in ML systems? | methodology | 0.36 | 0.18 | +0.18 |
| V3-044 | What is your approach to RAG systems? | technical_overview | 0.91 | 0.73 | +0.18 |
| V3-019 | How do you approach problem-solving? | methodology | 0.67 | 0.50 | +0.17 |
Questions where Gemini's lexical grounding score exceeded GPT's most.
| ID | Question | Intent | GPT cover | Gemini cover | Delta |
|---|---|---|---|---|---|
| V3-004 | Where did you do your PhD? | education_fact | 0.67 | 1.00 | -0.33 |
| V3-024 | How do you balance technical depth with accessibility? | communication_bridge | 0.25 | 0.50 | -0.25 |
| V3-018 | What's your communication style? | working_style | 0.22 | 0.44 | -0.22 |
| V3-054 | I need to make a random decision between options | tool_dice | 0.00 | 0.20 | -0.20 |
| V3-055 | Remind me of your mantra and send it to me | tool_encouragement | 0.29 | 0.43 | -0.14 |
| V3-056 | What is the thing you like to do the most for fun and work | personal_professional_bridge | 0.38 | 0.50 | -0.12 |
| V3-012 | Walk me through a project | project_walkthrough | 0.30 | 0.40 | -0.10 |
| V3-016 | How would you describe yourself? | persona_self_description | 0.40 | 0.50 | -0.10 |
| V3-008 | What shifted your focus to neuroscience? | career_history_fact | 0.73 | 0.82 | -0.09 |
| V3-031 | What is Poolula Platform? | product_overview | 0.75 | 0.83 | -0.08 |
How often each model named one of the portfolio projects or concepts across the full 58-question run.
| Project or concept term | GPT | Gemini | Delta |
|---|---|---|---|
| Knowledge graph | 22 | 15 | +7 |
| Resume Explorer | 21 | 8 | +13 |
| RAG / retrieval | 15 | 21 | -6 |
| Digital Twin | 14 | 15 | -1 |
| Evaluation | 14 | 12 | +2 |
| Concept Cartographer | 9 | 4 | +5 |
| Beehive Tracker | 8 | 5 | +3 |
| Fitness Dashboard | 4 | 7 | -3 |
| Weaving Memories | 4 | 4 | 0 |
| ChronoScope | 1 | 0 | +1 |
Paired snippets where the two models diverged the most. Useful for getting a feel for what "different personality" looks like in practice.
| ID | Question | Jaccard | GPT-4.1 | Gemini 2.5 Flash |
|---|---|---|---|---|
| V3-053 | Roll a die for me | 0.06 | Rolling the die for you... | Happy to! Let's see what you get. I'll roll the die for you now. Tool Call: dice_roll(). Tool Output: 5. You rolled a 5! What's next? |
| V3-055 | Remind me of your mantra and send it to me | 0.11 | Of course! My mantra is: "I can, I will, and I shall!" It's my way of reminding myself that persistence and self-belief matter more than waiting for the perfect moment... | My mantra is, "I can, I will, and I shall!" It's a reminder to trust myself and keep pushing forward, especially when facing new challenges. I'd be happy to send it to you. Could I get your name... |
| V3-045 | What vector databases have you used? | 0.11 | The main vector databases I've used in my work are: ChromaDB (my go-to for RAG prototypes and portfolio projects like this Digital Twin)... | I've worked with a few different vector databases as part of my current technical stack. I've primarily used ChromaDB in my projects, and I also have exposure to Pinecone, Weaviate, and FAISS... |
| V3-048 | How do you collaborate with engineering teams? | 0.13 | When I collaborate with engineering teams, I focus on being a bridge between business goals and technical implementation. That means spending real time up front understanding what problem we're solving... | That's a great question, and it's something I think about a lot! My collaboration style with engineering teams really centers on transparent communication and building a shared understanding of the problem... |
This is a model behavior comparison built from text-level diagnostics. It is not a human-scored quality bakeoff. The evaluator score and notes columns in the run CSVs were intentionally left blank in this pass so the analysis would rest only on metrics that can be computed reproducibly.
The must-cover lexical coverage measure is a smoke test for grounding, not a correctness score. It checks whether expected terms appear, not whether the answer is true or well-framed. Jaccard overlap measures vocabulary similarity, not equivalence of meaning. Two responses can score low on Jaccard and still both be correct (for example V3-010, where GPT and Gemini took different but valid routes through the same biographical period).
Longer is not the same as better. GPT's expansiveness helped on open-ended methodology questions and hurt on simple tool calls (V3-053, the die roll, where its 5-word response was thinner than Gemini's 26-word version).
Three concrete next steps for the evaluation infrastructure:
For the public-facing Digital Twin at twin.barbhs.com, the default is GPT-4.1. The richer framing and follow-up invitations are worth the extra cost for visitors who arrive without a clear question in mind. For internal regression and smoke tests after knowledge-base changes, the default is Gemini 2.5 Flash. The principle generalizes: model choice is a personality decision matched to a use case, not a single quality ranking.
Run date: May 11, 2026. Evaluation set: 58 questions across 28 intent labels. Retrieval held constant at top-k = 4 against a shared ChromaDB vector store. System prompt and temperature (0.6) held constant. Token counts and costs sourced from provider responses. Behavioral counts are dictionary-based heuristics; see the Methodology section for definitions. NLP metrics are intended to surface patterns, not replace human judgment.