# Asymmetry Inventory — `asymmetric_v1` eval pipeline

## Plain English Summary

This document lists every way the Wave 14 LoCoMo benchmark numbers on graph.mnemoverse.com are **not yet a clean head-to-head comparison**. Each memory system on that page (Mem0, Supermemory, Zep, Mnemoverse, and the rest) was run under conditions that are very close, but not identical. The differences are small individually — a reader prompt that hints at which conversation turn matters, a retrieval depth cap that lets one system see more context than another, a judge prompt inherited from a vendor's own published evaluation harness — but they add up to a few accuracy points, which is the same order of magnitude as the gaps between systems. We chose to publish the numbers with the differences fully disclosed rather than wait until every condition is uniform, because the field needs shared data to argue about and a year-long silence would help no one.

The inventory below is the per-difference record. Each entry has a stable ID (`ASYM-NNN`), a one-line label, a description of what is different and for which systems, an estimated direction and magnitude of effect where we can estimate it, and a pointer to the run logs or code where the difference is visible. The dashboard links each cell to the subset of entries that apply to that specific number, so a reader can see exactly which conditions shaped a given score. This document is the source of truth for that disclosure; if a number on the dashboard contradicts an entry here, the entry here wins and the dashboard is the bug. The next evaluation round will tighten the contract, shrink this list, and ship under a new name — `asymmetric_v2` or whatever replaces it — at which point this document is frozen as the historical record for Wave 14.

> This file enumerates the known measurement asymmetries between the
> Mnemoverse runner path (`evaluate.py` + `run_locomo.py`) and the
> competitor runner path (`_runner_main.py` + `_query_loop.py`) in the
> `asymmetric_v1` eval pipeline used for Wave 14 (and the earlier
> Phase B1 / Phase C1 cells whose `config.eval_path` field marks them
> `asymmetric_v1`). All such cells reference the findings below by
> `ASYM-NNN` id.
>
> Source: an adversarial code-path audit run on 2026-06-05. Two lenses
> are catalogued here — code-path-asymmetry and silent-failure — at
> severities `critical`, `high`, and `medium`. Three further lenses
> (config-tampering, judge-bias, comparability) ran on the same audit
> and produced 17 additional `critical`/`high` findings; those are
> not yet promoted into this inventory because their evidence is
> publication-layer (cell-set comparability rather than per-cell code
> sites) and is addressed separately in the `WAVE14_MORNING_BRIEFING.md`
> §Disclaimer and in the deferred publication-layer audit. **A reader
> of this inventory should not assume the 28 items below are the full
> set of known measurement issues; they are the code-site-anchored
> subset.**

## Summary

- total findings catalogued here: **30**
- FAVORS_MNEMOVERSE (active): **16** (inflate Mnemoverse scores; ASYM-029 added 2026-06-11 — async competitor stores queried before they settle; ASYM-030 added 2026-06-11 — in-proc read budget 120s vs production 10s, intra-Mnemoverse)
- FAVORS_COMPETITOR (active): **1** (ASYM-009 reader-max-tokens floor — small in practice)
- UNKNOWN / BOTH: **10** (ASYM-028 added 2026-06-10 — LoCoMo-tuned shared reader prompt, symmetric across systems, spun out of ASYM-002)
- closed (no longer active for new runs): **3** (ASYM-024 — closed in PR #292; ASYM-023 — zep limit cap fixed at `zep_adapter.py:519`; ASYM-025 — ingest content parity 2026-06-11; historical cells still affected)

Severity breakdown:

- critical: 4
- high: 11 (incl. ASYM-023/024 now closed for new runs; ASYM-029 added 2026-06-11)
- medium: 15

Direction by severity (count of findings in the body):

| severity | FAVORS_MNEMOVERSE | FAVORS_COMPETITOR (active) | CLOSED | UNKNOWN/BOTH |
|---|---|---|---|---|
| critical | 4 | 0 | 0 | 0 |
| high | 4 | 1 | 2 (ASYM-023, ASYM-024) | 4 |
| medium | 8 | 0 | 1 (ASYM-025) | 6 |

The bias remains skewed: 16 of 30 findings actively tilt the published numbers
in Mnemoverse's favour. 1 active finding clearly disadvantages
Mnemoverse (ASYM-009 reader `max_tokens` floor — small in practice).
ASYM-024 (HTTP-adapter omits `two_pass`) was the dominant term in the
observed `mnemoverse_http` vs `mnemoverse_engine` gap and is now closed
in PR #292. The closure ADDED ASYM-027 (HTTP `strategy="auto"` →
server-side classifier; in-proc baseline → no classifier) as a
DISCLOSED kept-algorithm-advantage per Eduard's plan-185 mandate
("получить результаты с API лучше чем локально без обмана"). The 10
UNKNOWN/BOTH-direction findings are genuinely two-sided or
unattributable. Any cross-system claim drawn from these cells must
account for the heavy directional skew above, not just the raw count
of findings (of the 14 active Mnemoverse-favoring findings, 13 are vs
competitors; ASYM-027 is intra-Mnemoverse — http-row vs engine-row.
ASYM-023, closed for new runs, tilted the historical wave14d cells).

> **ASYM-024 closed in PR #292 (2026-06-10).** The HTTP adapter now
> POSTs to `/api/v1/memory/read-batch` with `two_pass=True` BY DEFAULT —
> matching the in-proc baseline on the `two_pass` axis. The I4
> `--no-two-pass` knob (PR #302) exists solely for the G2-B pre-fix
> baseline run; cells produced with it get ASYM-024 REOPENED in their
> `known_asymmetries` stamp and must never be published as headline rows.
> The body ALSO carries `strategy="auto"` as a **DISCLOSED kept-
> algorithm-advantage per ASYM-027** (in-proc baseline omits the field,
> so HTTP gets the server-side StrategyClassifier and in-proc does not).
> Per Eduard's plan-185 mandate "получить результаты с API лучше чем
> локально без обмана" the kept-advantage is disclosed, not hidden.
> Historical `mnemoverse_http` cells emitted BEFORE the closure commit
> still carry the asymmetry and **must not be re-cited against
> `mnemoverse_engine` without a re-run on the post-#292 adapter**. New
> cells produced after this PR merges are clean on the ASYM-024 axis
> AND carry ASYM-027 disclosure; other inventory items still apply.

## Findings

### ASYM-001 — Mnemoverse drops category=5 adversarial questions (n=152); competitors evaluated on n=199

| field | value |
|---|---|
| severity | critical |
| direction | FAVORS_MNEMOVERSE |
| evidence | `experiments/benchmarks/locomo/dataset.py:23` — `DEFAULT_CATEGORIES = [1, 2, 3, 4]  # skip adversarial`; `experiments/benchmarks/locomo/evaluate.py:1045-1046` — `qa_items = [qa for qa in conv.qa_items if qa.category in cats]`. In contrast, `experiments/benchmarks/competitors/_query_loop.py:530-531` (`_resolve_qa_items`) returns `list(conv.qa_items)` — no category filter. Cell-level: `phase-c/wave14c/cell_supermemory_locomo_conv26_n199_k10.json` contains 47 `category: 5` entries; `night-runs/cell_2c_engine_locomo_conv26_k10.json` contains 0. |
| applies_to | ALL_ASYMMETRIC_V1 (every Mnemoverse `n_questions=152` cell vs every competitor `n_questions=199` cell — supermemory, zep, mem0_v3_cloud Wave 14b/14c/14d) |
| effect | Cat=5 is LoCoMo's adversarial / unanswerable subset; competitors score ~0 on most. Honest Supermemory@k=10 ≈ 61/152 = 0.401 vs reported 0.3065 over n=199 — narrows Mnemoverse-vs-Supermemory gap by ~10pp |
| fix_next_round | Single runner over identical question set: either (a) Mnemoverse re-run with `categories=[1,2,3,4,5]` and publish n=199, or (b) competitor runner mirrors `DEFAULT_CATEGORIES`. Proper-architecture round eliminates by using one harness for all systems |
| status | documented |

### ASYM-002 — Reader prompt asymmetry: Mnemoverse gets per-QA category_instructions, competitor reader gets them stripped

| field | value |
|---|---|
| severity | critical |
| direction | FAVORS_MNEMOVERSE |
| evidence | `experiments/benchmarks/competitors/_runner_main.py:681` — `answer_prompt_template = ANSWER_PROMPT.replace("{category_instructions}\n", "")` with a comment acknowledging this is a known asymmetry ("spec §4 leaves the prompt as one-template-for-all"). Mnemoverse path `evaluate.py:524-528` builds `prompt = ANSWER_PROMPT.format(context=context, question=qa.question, category_instructions=instructions)` where `instructions = _category_hint(qa.category)` (`evaluate.py:463`) — per-category hints like "combine facts from MULTIPLE memories", "Pay attention to dates [YYYY-MM-DD]", "list ALL items, comma-separated" (`evaluate.py:51-71`). Reader-answer length confirms: Mnemoverse cell avg 109.2 chars; supermemory wave14c k=100 = 44.5 chars; zep wave14d k=200 = 40.9 chars. (Line ref is audit-time; the strip now lives at `_runner_main.py:943`.) NOTE 2026-06-10: this entry covers ONLY the per-QA `category_instructions` delta between paths; the LoCoMo-tuned BASE prompt shared by both paths was spun out as a separate harness-level disclosure — see ASYM-028. |
| applies_to | ALL_ASYMMETRIC_V1 (every competitor cell vs every Mnemoverse cell; multi-hop and temporal categories most affected) |
| effect | The Mnemoverse reader is told the expected answer format (list / date / multi-fact) per question; the competitor reader is not. Reader-answer length confirms the lift in practice: Mnemoverse cell average 109.2 chars; supermemory at k=100 = 44.5 chars; zep at k=200 = 40.9 chars. Shorter, less-formatted competitor answers are penalised more by every judge. Magnitude of the boost to Mnemoverse-side accuracy is not separately quantified — it is bundled with other asymmetries — but is large enough to be visible in answer length and matches the multi-hop / temporal categories where Mnemoverse cell-level accuracy is highest |
| fix_next_round | Single reader prompt path: shared `format_reader_prompt(qa, context)` module used by both runners. The proper-architecture single-runner round eliminates the asymmetry by construction |
| status | documented |

### ASYM-003 — mem0_v3_cloud k=100/k=200 cells published with `_query_failure_rate=1.0` but judge_aggregate still presented as real numbers

| field | value |
|---|---|
| severity | critical |
| direction | FAVORS_MNEMOVERSE |
| evidence | `.wt-mem0-cloud/experiments/results/phase-c/wave14b/cell_mem0_v3_cloud_locomo_conv26_n199_k200.json` — `judge_aggregate = {'mnemoverse': 0.0704, 'mem0': 0.0905, 'mem0-4o': 0.1005, 'strict': 0.0101, '_query_failure_rate': 1.0}`. 199/199 queries returned HTTP 429 quota_exceeded; all `retrieved_atom_ids` empty; 18 reader answers empty. The four judge numbers are reader-on-empty-context baselines, NOT real measurements. Cell JSON carries no `NOT_REAL_MEASUREMENT` flag; `_query_failure_rate` sits inside `judge_aggregate` next to the four pseudo-numbers and is easy to miss |
| applies_to | `phase-c/wave14b/cell_mem0_v3_cloud_locomo_conv26_n199_k100.json`, `phase-c/wave14b/cell_mem0_v3_cloud_locomo_conv26_n199_k200.json` |
| effect | Inflates Mnemoverse-vs-Mem0 gap dramatically at k=100 and k=200 — published competitor numbers reflect a 0%-retrieval baseline, not the system under test |
| fix_next_round | Hard rule in shared runner: `_query_failure_rate > 0.5` ⇒ `headline_metric=null`, `judge_aggregate=null`, `invalidated_reason='query_failure_rate>0.5'`. Re-emit existing cells with the gate applied |
| status | documented |

### ASYM-004 — `mnemoverse` judge label refers to TWO DIFFERENT prompts depending on path

| field | value |
|---|---|
| severity | critical |
| direction | FAVORS_MNEMOVERSE |
| evidence | Mnemoverse path: `evaluate.py:549` → `llm_client.judge(...)` → `llm_client.py:472` uses `JUDGE_SYSTEM_PROMPT_BINARY` → output format `Score: <0.0 or 1.0>\nReasoning: ...` (`llm_client.py:70-83`). Competitor path: `_runner_main.py:269-271` → `_LLMJudgeAdapter` → `judges.py:179-186` Judge id=`mnemoverse` with `_MNEMO_USER` (`judges.py:137-147`) — returns JSON `{reasoning, label: CORRECT|WRONG}`. Different system text, different output schema, different parser. `night-runs/cell_2c_engine_locomo_conv26_k10.json` has per-row `judge_tokens.mnemoverse` (llm_client.judge path); `phase-c/wave14c/cell_supermemory_*.json` carries `judge_prompt_hashes.mnemoverse: 703651c0...` (sha256 of `_MNEMO_SYSTEM\n_MNEMO_USER`). Both publish to `judge_aggregate.mnemoverse` under the identical column name. Empirical pattern across cells: Mnemoverse cells (binary-text prompt) tend to score systematically higher under the column than competitor cells (JSON-CORRECT/WRONG prompt) on comparable answer quality — visible in the briefing's `mnemoverse` column where Mnemoverse and naked_cosine (both binary-prompt) sit roughly 15-30 pp above what a strict-grader interpretation of competitor cells would imply |
| applies_to | ALL_ASYMMETRIC_V1 — every cell where the column `judge_aggregate.mnemoverse` is compared across systems |
| effect | Different prompts ⇒ different lenience ⇒ the column is not a comparable metric. The binary-text prompt empirically produces higher scores on the Mnemoverse path than the JSON-label prompt does on the competitor path, so the column inflates Mnemoverse and naked_cosine numbers relative to competitor numbers. Magnitude not separately ablated — to ablate, run both prompts on the same answer set via `competitors/rejudge_comparison.py`. Until that ablation runs, the column is documented as biased in the Mnemoverse direction rather than truly unknown |
| fix_next_round | Pick ONE prompt as canonical. Recommendation: deprecate `llm_client.judge()`, route Mnemoverse path through `judges.py get_judge('mnemoverse') + score_case`, then rejudge Mnemoverse night-runs cells via the existing `competitors/rejudge_comparison.py`. Proper-architecture round mandates single judge module shared by both runners |
| status | documented |

### ASYM-005 — Context formatter asymmetry: Mnemoverse reader sees relevance % + chronological sort; competitor reader sees positional index only

| field | value |
|---|---|
| severity | high |
| direction | FAVORS_MNEMOVERSE |
| evidence | Mnemoverse `evaluate.py:475-478`: `context_items = _sort_for_context(items[:effective_context_k], qa.category)` (`evaluate.py:166-175` sorts by `[YYYY-MM-DD]` extracted from content when `category==2` temporal); formatted as `- [{int(item.relevance * 100)}%] {item.content}`. Competitor `_query_loop.py:566-572`: `for i, item in enumerate(retrieved): lines.append(f"- [{i + 1}] {item.get('text') or item.get('content') or item}")` — positional index only, no relevance, no temporal sort. Mnemoverse temporal score = 0.8649 vs single_hop = 0.8286 (temporal is the BEST category, plausibly due to the sort) |
| applies_to | ALL_ASYMMETRIC_V1 (every competitor cell vs every Mnemoverse cell; temporal category most affected) |
| effect | Relevance % is a prior signal to gpt-5-mini ("trust higher-percentage items more"); chronological sort directly aids temporal extraction. Reader has structural advantages on Mnemoverse-formatted context |
| fix_next_round | Single shared `format_reader_context(items, category)` module imported by both runners — proper-architecture round routes all cells through one formatter |
| status | documented |

### ASYM-006 — `judge_*_tokens` always 0 in every competitor cell (interceptor never increments)

| field | value |
|---|---|
| severity | high |
| direction | FAVORS_MNEMOVERSE |
| evidence | `experiments/benchmarks/competitors/_query_loop.py:331` initialises `per_k_judge_totals = {jn: {'prompt_tokens': 0, 'completion_tokens': 0} for jn in judge_names}` and writes that dict back at lines 437-439 — but a grep across `competitors/` shows **zero** writes to `per_k_judge_totals[jn]['prompt_tokens']` or `['completion_tokens']` anywhere; the counter is never incremented inside the per-question judge call. Every Wave 14b/14c/14d cell ships `judge_mnemoverse_prompt_tokens: 0` (and all four judges, both prompt and completion). Mnemoverse matrix cell `cell_mnemoverse_engine_locomo_conv26_n152_k200.json` reports `cost_usd.judge = 0.0501` from the rejudge_cell.py path. `docs/WAVE14_MORNING_BRIEFING.md:34-36` already calls this out as BLOCKER-5 |
| applies_to | ALL Wave 14b/14c/14d competitor cells (supermemory, zep, mem0_v3_cloud); cost_usd comparisons across the matrix |
| effect | Every competitor cell's `cost_usd.judge` rounds to ~$0; Mnemoverse cells carry real judge costs. Any "cost per cell" or "cost efficiency" chart structurally understates competitor cost and overstates Mnemoverse cost relative to baselines that have $0 judge |
| fix_next_round | Wire `LLMClient.last_usage` into `_query_loop.py`'s per-judge try block (after `await judge.score(...)` accumulate into `per_k_judge_totals[judge.name]`). Single runner round eliminates by sharing the token-accounting path |
| status | documented |

### ASYM-007 — Reader-failure fallback in Mnemoverse path silently substitutes top-1 retrieved memory as the answer; competitor path returns empty string

| field | value |
|---|---|
| severity | high |
| direction | FAVORS_MNEMOVERSE |
| evidence | `experiments/benchmarks/locomo/evaluate.py:532-535` — `try: answer = await llm_client.complete(prompt, max_tokens=_ans_max_tokens); except Exception: answer = items[0].content if items else ''`. Mnemoverse silently uses top-1 retrieved memory AS the reader answer when the LLM call fails, then ships it to the judge with no `reader_failed` flag. Competitor `_query_loop.py:367-368` sets `reader_answer = ''` on the same failure ⇒ judge=0 in the competitor path |
| applies_to | All Mnemoverse night-runs cells (`cell_2b/c/d/e/f`) potentially; severity depends on actual reader-failure count which is not surfaced |
| effect | Mnemoverse pseudo-answers can pass the judge (especially on lookup-style single_hop questions where the retrieved snippet itself contains the answer); competitor empty answers always score 0. The asymmetry is one-sided in favour of Mnemoverse on every reader-failure event. Per-cell magnitude is bounded by the count of `llm_client.complete` exceptions in the run, which is not surfaced in the published cells — so the effect is documented as present-but-unquantified |
| fix_next_round | Both runners agree on a single failure policy (either both empty-string or both top-1) AND set a `reader_failed=true` flag on the row. Proper-architecture round audits Mnemoverse night-runs logs for `llm_client.complete` exceptions and either re-runs or flags affected cells |
| status | documented |

### ASYM-008 — `effective_context_k` cap is Mnemoverse-only: at k=200 Mnemoverse feeds ≤10 single_hop / ≤20 multi_hop items to reader; competitor feeds all 200

| field | value |
|---|---|
| severity | high |
| direction | BOTH |
| evidence | `evaluate.py:469-474` — `effective_context_k = cfg.context_k; if qa.category == 0: effective_context_k = min(cfg.context_k, 10); elif qa.category == 1: effective_context_k = max(cfg.context_k, 20); context_items = _sort_for_context(items[:effective_context_k], qa.category)`. For k=200 Mnemoverse cells, single_hop is capped at 10. `_query_loop.py:566` honors the FULL retrieved list (cap_50 fix removed `[:50]`). `night-runs/cell_2b_mnemoverse_locomo_conv26_full.json` has `config.context_k: 200` but only top-10 reaches the reader for single_hop (70/152 = 46% of QAs per cell). Competitor k=200 cells deliver all 200 to reader |
| applies_to | All Mnemoverse k=100/k=200 cells (cell_2e/2f at k=100; cell_2b/2f at k=200 — verify against matrix); competitor k=100/k=200 cells |
| effect | The cap has two competing effects. (a) At single_hop questions (cat=0, ~46 % of QAs), Mnemoverse's reader sees only top-10 even when `config.context_k = 200` — a direct disadvantage relative to the competitor reader which sees all 200. (b) At single_hop, top-10 is in many cases ENOUGH context (the answer is in the first few retrieved items), so a smaller well-ranked context can outperform a larger noisier one — which would favour Mnemoverse. The net direction is ambiguous and depends on per-cell retrieval quality. Classification is BOTH rather than FAVORS_COMPETITOR because the ambiguity is genuine |
| fix_next_round | Either unify both paths on the same effective_context_k logic OR publish a separate `effective_context_k_reader` column derived per cell. Proper-architecture round uses one formatter |
| status | documented |

### ASYM-009 — Reader `max_tokens` budget asymmetric: Mnemoverse passes 512 (→2048 effective), competitor passes 2048 (→6144 effective)

| field | value |
|---|---|
| severity | high |
| direction | FAVORS_COMPETITOR |
| evidence | Mnemoverse `evaluate.py:531` — `_ans_max_tokens = 512 if llm_client._is_new_api else 200`; `llm_client.py:434-435` — `if self._is_new_api: max_tokens = max(max_tokens * 3, 2048)` ⇒ 2048 effective. Competitor `_runner_main.py:89` `DEFAULT_READER_MAX_TOKENS = 2048`; `_runner_main.py:519` `max_tokens=args.max_tokens or DEFAULT_READER_MAX_TOKENS`; via `llm_client.py:435` floor ⇒ `max(2048×3, 2048) = 6144` effective |
| applies_to | ALL_ASYMMETRIC_V1 (every competitor cell vs every Mnemoverse cell) |
| effect | 3× larger reader thinking-budget for competitors. With `reasoning_effort=minimal` for gpt-5-mini the practical effect may be smaller than 3×, but the structural budget difference is real |
| fix_next_round | Align both paths on `max_tokens=2048` (matches `DEFAULT_READER_MAX_TOKENS` and `_UTILITY_LLM_MAX_TOKENS` at `run_locomo.py:66`). Proper-architecture round routes both through one reader-call helper |
| status | documented |

### ASYM-010 — Empty `reader_answer` rows scored 0 by Mnemoverse-path judge but SKIPPED by rejudged competitor judges → asymmetric per-judge denominators

| field | value |
|---|---|
| severity | high |
| direction | BOTH |
| evidence | `experiments/benchmarks/locomo/evaluate.py:547` — `if cfg.use_judge and llm_client and answer:` (empty answers are NEVER judged by the primary mnemoverse judge during original run, but the cell still ships the row with no `judge_scores` key — and the cell's `judge_scores` writeback gives `{'mnemoverse': 0.0}` for those rows). `experiments/benchmarks/_harness/rejudge_cell.py:268-270` — `if not answer: skipped_no_answer += 1; continue` when re-applying mem0/mem0-4o/strict. `_harness/rejudge_cell.py:357-360` averages over non-None scores ⇒ Mnemoverse cell `cell_2b` has mnemoverse n=152, mem0/mem0-4o/strict n=150. Inflation on `cell_2b`: mem0 0.9400 vs fair 0.9276 = +1.24pp; mem0-4o 0.9533 vs 0.9408 = +1.25pp; strict 0.4867 vs 0.4803 = +0.64pp. Competitor cells (supermemory wave14c k=100, zep wave14d k=200) have 0 empty answers; mem0_v3_cloud k=200 has 18 empty answers scored via `_query_loop.py` (as 0) — not via rejudge_cell skip path |
| applies_to | Mnemoverse cells `cell_2b/2c/2d/2e/2f` (k=10/20/50/100/200); mem0_v3_cloud k=200 cell (opposite asymmetry); any future cell with empty reader answers |
| effect | +1.25pp inflation on mem0-4o vs apples-to-apples on Mnemoverse cells; depresses mnemoverse-judge score on mem0_v3_cloud relative to mem0/mem0-4o/strict due to inconsistent empty-row handling across runners |
| fix_next_round | Standardize on always-score-empty-as-0 (preferred — what the runner does) and re-aggregate rejudge_cell to NOT skip. Add `judge_aggregate_n_per_judge: {judge_id: int}` per cell. Proper-architecture round uses single judging path |
| status | documented |

### ASYM-011 — Judge failure silently dropped: failed per-row judge calls leave `judge_scores` key missing; aggregate computed over smaller denominator with no disclosure

| field | value |
|---|---|
| severity | high |
| direction | BOTH |
| evidence | `experiments/benchmarks/competitors/_query_loop.py:388-395` — judge exceptions only emit `log.warning('k_sweep_judge_failed', ...)`; the failed judge's score is NOT added to `per_k_judge_scores[judge.name]`. Lines 416-419 — `judge_aggregate[jn] = sum(scores) / len(scores)` over surviving scores. No `n_judged_per_judge` field in cell. Same pattern in `_harness/rejudge_cell.py:330-344`. For Mnemoverse k=200 `cell_2b`, denominators differ across judges (mnemoverse=152, others=150) with no disclosure |
| applies_to | ALL_ASYMMETRIC_V1 cells (both paths) where any judge fails on any QA |
| effect | A judge that silently fails on N rows publishes a mean that could be higher or lower than full-coverage truth, with no way to tell from the cell. Direction depends on which rows fail and which way they would have scored |
| fix_next_round | Add `judge_aggregate_n_per_judge: {judge_id: int}` and `_judge_failures: {judge_id: count}` to every cell. Treat any cell where any judge's n < n_questions as hold-for-disclosure. Proper-architecture round mandates this schema |
| status | documented |

### ASYM-012 — Mnemoverse calls `engine.feedback(...)` (Hebbian update) between QAs within the SAME cell; competitor adapters are frozen post-ingest

| field | value |
|---|---|
| severity | medium |
| direction | FAVORS_MNEMOVERSE |
| evidence | `experiments/benchmarks/locomo/evaluate.py:365-373` — after every `engine.read()` Mnemoverse calls `await engine.feedback(FeedbackRequest(atom_ids=retrieved_ids[:5], outcome=0.5, query_concepts=read_result.query_concepts, domain=domain))`. Even with outcome=0.5 (neutral / blind, per INTEGRITY PROTOCOL at `evaluate.py:8-13`), this still updates the Hebbian graph via concept co-occurrence. `_query_loop.py` (lines 343-410) has NO equivalent call. `COMPETITOR_RUNNER_REWORK_SPEC.md` invariant #2 deliberately freezes adapter state during query loop |
| applies_to | All Mnemoverse night-runs cells (cell_2b/c/d/e/f) and any derived matrix cell |
| effect | Later QAs in a Mnemoverse cell see an evolving retrieval graph: concept↔query Hebbian edges accumulate from earlier QAs of the same evaluation pass. Competitor adapters cannot do this in `asymmetric_v1`. The asymmetry is structurally one-sided in favour of Mnemoverse: it gives Mnemoverse two advantages no competitor receives — (i) the in-cell graph compounds during evaluation rather than staying frozen post-ingest, and (ii) the evolution is driven by the actual query distribution being evaluated on. Both advantages compound across the 152 QAs of a cell. An ablation pass (feedback off vs on, same cell) has not been run in this round — the magnitude is therefore not separately quantified — but the direction and structural unfairness are unambiguous |
| fix_next_round | Gate feedback behind a flag (off by default for benchmark cells) OR ablate: re-run with feedback off and publish both numbers. Proper-architecture round runs in-engine Hebbian for both — equalizes (or removes the call from the harness) |
| status | documented |
| http_path_disclosure | PR #289 (A2): `MnemoverseHttpAdapter` keeps the Hebbian feedback call enabled (`_feedback_enabled=True` default) and uses `outcome=0.5` (neutral midpoint, matching in-proc) so the HTTP path is symmetric with the in-process baseline on this axis. The kept-algorithm-advantage is disclosed per-cell via `provenance_stamps._feedback_enabled` — competitor cells (mem0/supermemory/zep) still don't fire `engine.feedback` because their adapters have no such call site. The FAVORS_MNEMOVERSE direction therefore still applies to `mnemoverse_http` cells vs competitor cells; it does NOT apply to `mnemoverse_http` vs `mnemoverse_engine` (those two are aligned). |

### ASYM-013 — Recall metric asymmetric: Mnemoverse cells carry true `retrieval_recall` vs evidence; competitor cells have `retrieved_atom_ids = ['pos_0', 'pos_1', ...]` (no real IDs)

| field | value |
|---|---|
| severity | medium |
| direction | FAVORS_MNEMOVERSE |
| evidence | `evaluate.py:428-429` — `r_recall = retrieval_recall(retrieved_ids, qa.evidence, dia_to_atom); r_precision = retrieval_precision(...)`. Competitor `_query_loop.py:576-588` `_extract_atom_ids` returns `pos_0, pos_1, ...` because adapters (e.g. `supermemory_adapter.py:219`) return plain snippet TEXT, not IDs. Confirmed: `phase-c/wave14c/cell_supermemory_locomo_conv26_n199_k10.json:50-52` — `retrieved_atom_ids: ['pos_0', 'pos_1']`. Matrix manifest `recall_at_k: null` for all competitor cells |
| applies_to | All competitor cells (no real recall@k computable); all Mnemoverse cells (real recall@k present) |
| effect | The headline ranking metric is `judge_aggregate.mnemoverse`, not recall — so this doesn't directly bias rank. But downstream charts comparing Mnemoverse recall (real number) vs competitor recall (`null`) are structurally biased toward Mnemoverse |
| fix_next_round | Adapters expose real `memory_id` alongside the snippet so `_extract_atom_ids` gets real IDs OR hide `recall_at_k` from any cross-system comparison. Proper-architecture round mandates real-ID return from all adapters |
| http_path_disclosure | I3 (PR #302, 2026-06-10): `MnemoverseHttpAdapter` now emits **dia-translated** ids (`conv-26::D1:3` dialect, via the atom-id→dia-id map built from this run's own write-batch responses) — recall@k becomes computable for the `mnemoverse_http` row while all four competitors keep structural `pos_i`. NOTE the sharpened teeth: `compute_recall` writes competitor recall as a NUMBER 0.0, not null — cross-system recall charts must exclude positional-id rows or show them as not-computable. Translation coverage is disclosed per-cell via `config.adapter_meta._atom_id_kind` (`dia`/`mixed`/`uuid`) + `_total_unmapped_atom_ids`. |
| status | documented |

### ASYM-014 — `_query_failure_rate` field absent on Mnemoverse cells: absence ambiguous between "0 failures" and "crashed before publishing"

| field | value |
|---|---|
| severity | medium |
| direction | FAVORS_MNEMOVERSE |
| evidence | `experiments/benchmarks/competitors/_query_loop.py:424-431` only writes `judge_aggregate['_query_failure_rate']` when per-k failure rate > `QUERY_FAILURE_SURFACE_RATE` (0.02). Mnemoverse path `evaluate.py` has NO analogous instrumentation — `engine.read()` crashes propagate (`run_locomo.py:717` has no try/except around the QA loop). Mnemoverse cells with failures don't exist on disk; cells that DO exist can't be distinguished from cells with sub-2% silent failures |
| applies_to | All Mnemoverse night-runs cells and matrix cells |
| effect | "Mnemoverse cells have no `_query_failure_rate`" reads as "zero failures" but actually means "cells with failures never got published". Survival bias on the Mnemoverse side; competitors get a diagnostic field they might trip |
| fix_next_round | Add `_query_failure_rate` (or `n_query_failures`) field always (set to 0 if none) in BOTH runners. Lower `QUERY_FAILURE_SURFACE_RATE` to 0 (always surface). Proper-architecture round mandates the field in the cell schema |
| status | documented |

### ASYM-015 — Mnemoverse matrix cells carry `git_sha_source: 'normalize_time'` — sha stamped at normalization, not captured at run

| field | value |
|---|---|
| severity | medium |
| direction | FAVORS_MNEMOVERSE |
| evidence | `experiments/results/night-runs/cell_2{b,c,d,e,f}_*.json` — every Mnemoverse and naked_cosine raw cell has `config.git_sha = None`. `experiments/benchmarks/matrix/cells/cell_mnemoverse_engine_locomo_conv26_n152_k200.json` then carries `config.git_sha = '1cc6512'` AND `config.git_sha_source = 'normalize_time'`. Phase-C competitor cells DO carry real run-time git_sha (wave14b=7c682b9, wave14c=9214eda, wave14d=8b47b66) |
| applies_to | All Mnemoverse matrix cells; `WAVE14_MORNING_BRIEFING.md` headline 95.3 / 88.2 numbers |
| effect | Provenance for Mnemoverse cells is forged at normalization, not captured at run. Headlines citing "Mnemoverse 95.3" reference a SHA the actual run did not record — competitor numbers do not have this weakness |
| fix_next_round | Re-emit Mnemoverse night-runs cells with real run-time SHA capture in `locomo/evaluate.py` (mirror `_query_loop.py`'s plumbing). Proper-architecture round mandates run-time SHA in every cell |
| status | documented |

### ASYM-016 — Bare-int ingest return from `mem0_adapter` hardcodes `failed_turns=0` even when `mem0_add_failed` fired

| field | value |
|---|---|
| severity | medium |
| direction | FAVORS_MNEMOVERSE |
| evidence | `experiments/benchmarks/competitors/mem0_adapter.py:194-211` catches each `memory.add` exception, logs `mem0_add_failed`, increments stored only on success, returns bare `int`. `_query_loop.py:_unpack_ingest_result` line 614-618 — `try: stored_int = int(result); return stored_int, 0, fallback_total` hardcodes failed=0. A run where 50/419 turns failed ships `ingest_stored_turns=369, ingest_failed_turns=0, ingest_total_turns=419` — looks like a clean filter, not a failure |
| applies_to | Any cell whose adapter returns a bare int from ingest while swallowing per-unit add failures — `failed_turns` is hardcoded 0 downstream for all of them. Verified 2026-06-10 (panel on #298): `mem0_adapter.py:182-226` (OSS); `mem0_cloud_adapter.py:260-269,337` (cloud — WORSE: returns `total_turns` even when sessions failed); `supermemory_adapter.py:165-186`; `zep_adapter.py:336-368` (stored count honest, failed count invisible). Stamped on all four systems in `_asymmetry_registry.py`. |
| effect | Hides competitor ingest failures; makes failed adapter runs look like deliberate skips. Direction is anti-competitor (Mnemoverse-favoring) because it understates competitor performance only when reported as ratio, not when reported as absolute |
| fix_next_round | Tighten `_unpack_ingest_result` to set `failed = total - stored` when bare-int branch fires OR require all adapters to return `IngestResult`/dict. Proper-architecture round mandates structured IngestResult |
| status | documented |

### ASYM-017 — phase-c-supermem mixed-SHA within the same K-curve: k=10/100 at 9214eda, k=20/30/50 at 98627ac

| field | value |
|---|---|
| severity | medium |
| direction | BOTH |
| evidence | `phase-c-supermem/cell_supermemory_locomo_conv26_n199_k10.json` git_sha=9214eda; k=100 git_sha=9214eda; but k=20, k=30, k=50 git_sha=98627ac. All five share `tag='phase-c-supermemory'` and present as one K-curve. `WAVE14_MORNING_BRIEFING.md` §3 does not list 98627ac at all |
| applies_to | `phase-c-supermem/cell_supermemory_locomo_conv26_n199_k{10,20,30,50,100}.json` |
| effect | Curve monotonicity (or non-monotonicity) cannot be cleanly attributed to K vs to code drift between the two SHAs. The drift direction is unknown — `98627ac` could have raised or lowered Supermemory measurements at k=20/30/50 relative to `9214eda`. Classification is BOTH (not FAVORS_COMPETITOR) because the structural unfairness is unattributable rather than one-sided |
| fix_next_round | Decide one SHA per K-curve; re-run off-SHA points; annotate K-curve plots with per-point SHA. Proper-architecture round mandates single SHA per scope |
| status | documented |

### ASYM-018 — Mnemoverse n=152 cells frozen at `1cc6512` (pre cap_50 fix); competitor cells at 7c682b9..8b47b66 — cell-by-cell SHA mismatch is invisible in manifest

| field | value |
|---|---|
| severity | medium |
| direction | BOTH |
| evidence | Mnemoverse_engine_n152 cells all at `1cc6512` (matrix manifest + cells). Supermemory cells at `9214eda` (k=10/100) and `98627ac` (k=20/30/50) — both AFTER `7c682b9 fix(bench): drop hardcoded retrieved[:50] reader-context cap`. mem0_v3_cloud at `7c682b9`. Zep at `8b47b66`. `_query_loop.py` was never touched at 1cc6512, so "run both at the same SHA" is impossible |
| applies_to | ALL_ASYMMETRIC_V1 cross-system comparisons |
| effect | Re-running competitor harness at 1cc6512 to validate the gap would hit the buggy `[:50]` cap; the asymmetry is invisible in the manifest. Direction is genuinely both — the cap_50 bug hurt competitors when present, but the SHA mismatch leaves the validation path ambiguous |
| fix_next_round | Re-run Mnemoverse cells at LATEST sha that includes all competitor harness fixes (post-8b47b66). Proper-architecture round runs all cells under one SHA |
| status | documented |

### ASYM-019 — `MnemoverseAdapter.ingest` (used in some competitor-runner smoke runs) drifts from `run_locomo.py`'s baseline ingest plumbing

| field | value |
|---|---|
| severity | medium |
| direction | UNKNOWN |
| evidence | `competitors/mnemoverse_adapter.py:113-116` — `result = await ingest_conversation_direct(engine, conversation, domain)` uses default ingest settings (no `consolidate_between_sessions` flag, no edge pruning). Direct path `run_locomo.py:690-695` passes `consolidate_between_sessions=cfg.consolidate` from baseline_cfg; lines 702-708 conditionally run `engine.hebbian.prune_weak_edges(...)` if `prune_min_weight > 0`. Headline `mnemoverse_engine` cells run through `run_locomo.py` and have different post-ingest state than the shim produces |
| applies_to | Any cell produced via the competitor-runner shim path for Mnemoverse (smoke/parity tests); does NOT affect headline night-runs cells |
| effect | Anyone re-running Mnemoverse via the competitor harness (e.g. to get n=199 score) will get slightly different Mnemoverse than headline numbers reflect. Direction depends on which ingest knob matters more |
| fix_next_round | Make `MnemoverseAdapter.ingest` accept the same flags as `run_locomo.py`'s baseline_cfg and plumb them through OR document in adapter docstring + cell metadata. Proper-architecture round uses single mnemoverse-core HTTP adapter for all cells — eliminates the drift |
| status | documented |

### ASYM-020 — Supermemory cells silently report `ingest_stored_turns=19 / ingest_total_turns=419` (sessions-vs-turns unit mismatch)

| field | value |
|---|---|
| severity | medium |
| direction | FAVORS_MNEMOVERSE |
| evidence | All 10 Supermemory cells (phase-c/wave14c and phase-c-supermem k=10/20/30/50/100/200) carry `token_totals = {'ingest_stored_turns': 19, 'ingest_failed_turns': 0, 'ingest_total_turns': 419}`. `competitors/supermemory_adapter.py:123-184` iterates `conversation.sessions` (not turns) — 19 stored = 19 sessions. `_query_loop.py:_unpack_ingest_result:614` treats the bare-int as `(stored=19, failed=0, total=fallback_total=419)`. Cell publishes apples-to-oranges quotient: a Supermemory reader naively comparing the field to Mnemoverse's `ingest_stored_turns=419 / total=419` will conclude Supermemory dropped 95 % of turns when in fact Supermemory ingested all 19 sessions (each containing many turns) as 19 documents |
| applies_to | All Supermemory cells (Wave 14c and phase-c-supermem) |
| effect | `stored/total = 4.5%` reads as catastrophic ingestion failure; downstream cost-per-turn arithmetic wrong by 22×. Direction is both — Supermemory looks bad on stored/total ratio AND on per-turn cost. Could either inflate or deflate downstream comparisons |
| fix_next_round | Return `IngestResult(stored_turns=419, stored_units=19, unit='session')` OR set `ingest_total_turns=len(conversation.sessions)` for Supermemory. Add `config.ingest_unit` per system. Proper-architecture round mandates structured IngestResult with explicit unit |
| status | documented |

### ASYM-021 — `judge_*_tokens=0` → `cost_usd.judge` computed off zero tokens but presented as real cost (compounds ASYM-006)

| field | value |
|---|---|
| severity | medium |
| direction | BOTH |
| evidence | `experiments/benchmarks/matrix/cells/cell_mnemoverse_engine_locomo_conv26_n152_k200.json` reports `cost_usd.judge=0.0501`. Competitor matrix-normalized cells (e.g. supermemory n=199 k=100) compute judge cost from `token_totals.judge_*_*_tokens=0` ⇒ `cost_usd.judge ≈ 0.0000`. Mnemoverse cells' judge cost reflects `rejudge_cell.py`-captured tokens; competitor cells' judge cost is structurally $0 because `_query_loop.py` never increments the counter (ASYM-006) |
| applies_to | All Wave 14 matrix cells; any cost-comparison chart |
| effect | Compounds ASYM-006. Cross-system cost comparisons are not apples-to-apples — competitor cost UNDERSTATED, Mnemoverse-vs-competitor cost-efficiency for retrieval MISLEADING in both directions |
| fix_next_round | Until `judge_*_tokens` captured in `_query_loop.py`, NULL out `cost_usd.judge` in competitor cells (publish "unknown" not "$0.00"). Tag matrix-normalized cells with `cost_usd_judge_provenance: 'rejudge_cell.py'` vs `'absent'`. Proper-architecture round eliminates via single token-accounting path |
| status | documented |

### ASYM-022 — `mem0` and `mem0-4o` judges are documented-lenient and inflate every system's score

| field | value |
|---|---|
| severity | high |
| direction | BOTH |
| evidence | The `mem0` and `mem0-4o` judges replicate Mem0's published evaluation prompt (`experiments/benchmarks/judges.py` — verbatim from Mem0's `evaluation/src/llmjudge.py`). Mem0's own prompt explicitly accepts paraphrase, partial credit, and ±14-day temporal tolerance. The Penfield evaluation pass (2026-05; not in this PR) measured that the same prompt accepts ≥63 % of intentionally-wrong reference answers on a held-out probe set. The `mnemoverse` and `strict` judges do not have this lenience. Per the four judge columns in the briefing, the spread between strict and mem0-4o on the same `(system, k)` cell is typically 25-50 pp, e.g. mnemoverse_engine k=200: strict 48.7 vs mem0-4o 95.3 (+46.6 pp on the same answers) |
| applies_to | every cell that publishes a `mem0` or `mem0-4o` judge score (all Wave 14 + Phase B1 cells) |
| effect | Lenient judges inflate every system's score AND compress the gap between strong and weak systems (a weak system's "close-but-wrong" answer is accepted; a strong system's "exact-and-right" answer was already going to be accepted). Net effect on cross-system rankings is therefore BOTH-direction — but the inflation itself is real and any external citation of a `mem0` or `mem0-4o` number without disclosure of the judge's lenience is misleading. This is the highest-impact deferred finding from the publication-layer audit and is promoted here so cells stamp it explicitly |
| fix_next_round | Treat `strict` as the headline judge in any cross-system claim. Publish `mem0` / `mem0-4o` numbers only alongside the same cell's `strict` number, never alone. The proper-architecture round adopts the same convention by default |
| status | documented |

### ASYM-023 — Zep `graph.search` returns at most 30 atoms per query; Wave 14d k=50/k=100/k=200 cells effectively measure k=30

| field | value |
|---|---|
| severity | high |
| direction | FAVORS_MNEMOVERSE (historical wave14d cells); CLOSED for new runs (`zep_adapter.py:496`) |
| evidence | `experiments/benchmarks/competitors/zep_adapter.py:159` constructs the adapter with `limit = ZEP_DEFAULT_LIMIT = 30` and the `query()` method passes that constant — NOT the requested `top_k` — to `client.graph.search(..., limit=self._limit)`. Verified on disk: every row in `cell_zep_locomo_conv26_n199_k50.json`, `k100.json`, and `k200.json` has `len(retrieved_atom_ids) == 30` (counted over the first 10 rows of each cell). Cells at k=10 and k=20 correctly return 10 and 20 atoms respectively because `top_k < self._limit` and the downstream `_extract_snippets` clips to `top_k`. At k=50/100/200 the clip is a no-op because Zep returned only 30 |
| applies_to | `experiments/results/phase-c/wave14d/cell_zep_locomo_conv26_n199_k{50,100,200}.json` |
| effect | Zep cells at k=50/100/200 are not measurements of Zep at those k values. They are measurements of Zep at k=30 published under k=50/100/200 column labels. The Zep k-curve shown in the briefing is therefore flat-by-construction beyond k=30. The Zep numbers themselves are valid for k=30 retrieval — the cells simply do not test Zep's behaviour at deeper retrieval. Other systems (Mnemoverse, naked_cosine, Supermemory) actually retrieved at their declared k, so cross-system comparison at k=50/100/200 puts Zep at a structural retrieval-depth disadvantage |
| fix_next_round | Fix the adapter to pass `limit=top_k` (or `limit=max(top_k, self._limit)`) and verify Zep's server actually returns more than 30 — if Zep server-side caps at 30, document the cap as a system-level constraint and run the k-curve over [10, 20, 30] only |
| status | closed_for_new_runs — `zep_adapter.py:496` `effective_limit = max(top_k, self._limit)`; guard-listed in `CLOSED_ON_SYMMETRIC_V1` (`_asymmetry_registry.py`) so it can never be stamped on symmetric_v1 cells. Historical asymmetric_v1 wave14d k=50/100/200 cells remain affected and must keep citing this id. Residual open question (panel F4 on #298): Zep's SERVER-side cap is unverified — the pilot k-sweep's per-row retrieved counts are the verification artifact; if the server caps at 30, this re-opens before publishing zep k≥50 |

### ASYM-026 — `MnemoverseHttpAdapter._settle_poll` reads org-wide `total_atoms` from `/api/v1/memory/stats` (no domain filter); ingest-batch settle is therefore vulnerable to parallel-domain contamination

| field | value |
|---|---|
| severity | medium |
| direction | UNKNOWN |
| evidence | `experiments/benchmarks/competitors/mnemoverse_http_adapter.py:403-428` polls `GET /api/v1/memory/stats` (server route `src/mnemo/api/routes.py:228-242` accepts no query params) and uses the response's `total_atoms` as the stabilization signal. The endpoint is org-scoped, NOT domain-scoped — every conversation in the `bench-locomo-conv26` org contributes to the same counter. When a single LoCoMo run sequentially ingests one conv at a time the counter still reaches a stable plateau, but a parallel run (two convs ingesting simultaneously into the same org) would have one conv's settle-poll observe the other conv's atoms and exit early (or vice versa, never settle). No matrix run today does parallel ingest into the same org, so this is a latent risk, not a live measurement bias. |
| applies_to | every `mnemoverse_http` cell whose ingest path used the settle-poll (i.e. every `mnemoverse_http` cell — settle-poll is unconditional after batch ingest) |
| effect | UNKNOWN-direction in current runs (single-conv-at-a-time ingest is structurally safe). Becomes FAVORS_COMPETITOR if any future matrix run launches parallel conv ingestion into the same bench org, because either (a) settle-poll exits early and the next query runs against a partially-indexed corpus, or (b) settle-poll never reaches plateau and times out, both of which weaken HTTP cell numbers vs in-proc. |
| fix_next_round | Add a server-side `/api/v1/memory/stats?domain=...` filter (api change), OR rotate the bench org per conversation so the org-wide counter == domain counter trivially, OR replace the polling signal with an adapter-tracked "post-batch acked-count" that doesn't depend on server-side aggregation. The third option is cheapest and is the proposed direction for #291b. |
| status | documented |

### ASYM-024 — `MnemoverseHttpAdapter.query` omits the engine `two_pass` flag that the in-process baseline uses by default

| field | value |
|---|---|
| severity | high |
| direction | FAVORS_COMPETITOR (pre-closure cells); CLOSED in PR #292 |
| evidence | Pre-#292 wire shape: `experiments/benchmarks/competitors/mnemoverse_http_adapter.py:439-455` posted `{"query": ..., "top_k": top_k, "domains": [domain], "min_relevance": ..., "include_associations": True, "concepts": []}` to **`/api/v1/memory/query`** — body carried NO `two_pass`. `QueryRequestSchema` (`src/mnemo/api/schemas.py:391-397`) does NOT accept `two_pass`/`strategy`; only `ReadRequestSchema` (used by `/memory/read-batch`, `routes_v1.py:284-296`) accepts them. The in-process MnemoverseAdapter (`experiments/benchmarks/competitors/mnemoverse_adapter.py:124-131`) sends `engine.read(ReadRequest(..., two_pass=True))` — `two_pass=True` BUT NOTHING for the `strategy` field (defaults to `None` → no preset). **Correction vs prior revisions of this entry: in-proc baseline does NOT request `strategy="auto"` via StrategyClassifier — earlier investigator framing was incorrect; verified 2026-06-10 by direct read of `mnemoverse_adapter.py`.** The real ASYM-024 axis is therefore `two_pass` only; the HTTP path's new `strategy="auto"` after closure is a SEPARATE disclosed advantage tracked under ASYM-027. |
| applies_to | every `mnemoverse_http` cell produced from `feat/mnemoverse-http-adapter` (2026-06-06 onward) UP TO PR #292 merge. Cells produced from PR #292 onward have ASYM-024 closed (and ASYM-027 disclosed). |
| effect | HTTP cells under-represent server-engine capability on the `two_pass` axis. Pilot numbers (judge_mnemoverse ≈ 0.27-0.33 on conv-26 vs in-proc engine 0.74) reflect this gap. Cross-row comparisons that put `mnemoverse_http` next to `mnemoverse_engine` invite the misreading "the HTTP API is materially worse than local" when the actual delta is "the bench adapter does not request the algorithm features that the in-proc bench uses". Distinct from a real server-side weakness — same engine, different invocation. |
| fix_next_round | **DONE on PR #292** (commit `<filled at merge>`). HTTP adapter now POSTs to `/api/v1/memory/read-batch` with a 1-element `queries` list carrying `two_pass=True`; ReadRequestSchema accepts the field and `routes_v1.py:284-296` forwards it to `engine.read()` exactly as the in-proc `MnemoverseAdapter.query` does. The HTTP body ALSO carries `strategy="auto"` as a DISCLOSED kept-algorithm-advantage per Eduard's Option B decision 2026-06-10 — see ASYM-027 for the disclosure and the rationale. Pilot conv-26 + conv-47 + held-out conv-30 numbers will record actual lift magnitudes in a follow-up commit. |
| status | closed_in_pr_292 |

### ASYM-027 — `MnemoverseHttpAdapter.query` sends `strategy="auto"` triggering server-side `StrategyClassifier`; in-process baseline omits the field (defaults to `None` → no preset, no classifier)

| field | value |
|---|---|
| severity | medium |
| direction | FAVORS_MNEMOVERSE (intra-Mnemoverse — `mnemoverse_http` row gains structural retrieval advantage over the `mnemoverse_engine` row) |
| evidence | After PR #292 closure of ASYM-024, the HTTP adapter posts `strategy="auto"` to `/api/v1/memory/read-batch` (`experiments/benchmarks/competitors/mnemoverse_http_adapter.py` query body). Server-side, `c:/Projects/mnemoverse/mnemoverse-core/src/mnemo/core/memory_engine.py:1769-1782` checks `if request.strategy == "auto":` → instantiates `StrategyClassifier()` → classifies the query text (regex + optional `classify_llm`) → applies a strategy preset (e.g. `multi_hop`: PPR + gap_filling + entity_chain rewrites). The in-process baseline (`experiments/benchmarks/competitors/mnemoverse_adapter.py:124-131`) sends `ReadRequest(..., two_pass=True)` — `strategy` field NOT set → defaults to `None` → no classifier, no preset. Net: HTTP path can apply `multi_hop` preset on its own retrieval; in-proc path runs without preset on the same query. |
| applies_to | every `mnemoverse_http` cell produced from PR #292 (`fix/asym-024-two-pass-strategy`) onward. Engine cells (`mnemoverse_engine` row) are unaffected — they keep `strategy=None` per the in-proc adapter contract. |
| effect | The `mnemoverse_http` row gets a STRUCTURALLY STRONGER retrieval pipeline than the `mnemoverse_engine` row on the SAME server engine. HTTP-vs-engine comparisons within the matrix are no longer apples-to-apples on retrieval algorithm. The asymmetry is intra-Mnemoverse (not vs competitors): it determines which Mnemoverse row leads the published numbers. Eduard's plan-185 mandate "получить результаты с API лучше чем локально без обмана. Гварды чтобы не получилось обмана" — this asymmetry is disclosed per the mandate's "без обмана" clause; it is NOT hidden in cells. Public-facing audit + landscape commentary must surface ASYM-027 alongside ASYM-012 (Hebbian feedback kept-advantage) whenever discussing HTTP-vs-engine deltas. |
| fix_next_round | Two options: (a) **align in-proc baseline** to also send `strategy="auto"` — requires re-running engine baseline cells (#281 matrix baseline) on a new SHA; gain truly-symmetric measurement at the cost of moving the engine number; OUT OF SCOPE for PR #292. (b) **Accept the asymmetry as a designed product disclosure** per plan-185 mandate (the path Eduard chose 2026-06-10) — disclosed in this inventory + in adapter comments; published numbers carry an ASYM-027 banner. Re-evaluate after pilot conv-26 + conv-47 + held-out conv-30 results land — if `multi_hop` preset is doing heavy lifting (large lift on multi_hop questions) and direction-of-effect is unambiguous, record magnitude in a follow-up commit. |
| status | disclosed_kept_algorithm_advantage |

### ASYM-025 — `MnemoverseHttpAdapter.ingest` POSTs raw turns with `concepts=[]` and relies on server concept extraction; in-process baseline calls `ingest_conversation_direct` which extracts concepts client-side before write

| field | value |
|---|---|
| severity | medium |
| direction | UNKNOWN |
| evidence | `experiments/benchmarks/competitors/mnemoverse_http_adapter.py:342-365` builds each write item as `{"content": turn.text, "concepts": [], "domain": domain, "metadata": {...}}` and POSTs to `/api/v1/memory/write-batch`. The adapter's own comment (lines 350-355) states "Concepts are left empty — server's concept-extraction path extracts them so this stays apples-to-apples with the in-process MnemoverseAdapter (concepts=[] passed there too)". The in-process path calls `ingest_conversation_direct` (`experiments/benchmarks/locomo/ingest.py:115-127, 190-201`) which builds `content = f"[{turn.session_datetime}] {turn.speaker}: {turn.text}"` (note: includes session_datetime + speaker prefix in the content itself, not just metadata) and passes `concepts=concepts_from(turn)` (line 71 — `[speaker.lower(), entity_words...]`). The two paths therefore differ in BOTH (a) the textual content stored (HTTP: raw `turn.text`; in-proc: `[datetime] speaker: text`) and (b) the concept-extraction trigger (HTTP: server-side; in-proc: client-side helper, then server extracts again if config so demands) |
| applies_to | every `mnemoverse_http` cell produced from `feat/mnemoverse-http-adapter` (2026-06-06 onward); the `mnemoverse_engine` baseline does NOT have this asymmetry because it does not use the HTTP path |
| effect | UNKNOWN-direction. Content asymmetry (datetime + speaker prefix vs raw text) likely tilts retrieval recall — datetime-prefixed content matches temporal queries better; raw content does not. Concept asymmetry (client-extracted heuristic vs server-extracted ML) tilts in either direction depending on which extractor is stronger. Net direction not measured. Could be a contributor to the observed HTTP vs engine gap, alongside ASYM-024 (which is almost certainly the dominant term). |
| fix_next_round | DONE — option (a) implemented 2026-06-11 (plan-Б): adapter sends `[{turn.session_datetime}] {turn.speaker}: {turn.text}` + `concepts=_extract_concepts(turn)` imported from locomo/ingest.py (single source). Trigger: the G2-B ablation (conv-30 pre/post two_pass, retrieval Jaccard 1.00) REFUTED ASYM-024 dominance, while raw-text ingest demonstrably killed temporal questions (cat2 +0pp at every k; smoke row-0 answered 'yesterday' to a when-question) |
| status | closed_2026-06-11 — historical mnemoverse_http cells emitted before the closure still carry the asymmetry and must not be re-cited without a re-run |

### ASYM-028 — Shared reader prompt is LoCoMo-tuned (anti-abstain + list-format + brevity rules) iterated on dev conv-26; applies to every system equally but bakes benchmark knowledge into the harness

| field | value |
|---|---|
| severity | medium |
| direction | BOTH (symmetric across systems — shifts ALL rows vs out-of-box usage; not a row-vs-row bias) |
| evidence | `experiments/benchmarks/locomo/evaluate.py:34-48` — `ANSWER_PROMPT` carries LoCoMo-failure-mode-tuned rules: `Do NOT say "I don't know"`, `NEVER refuse to answer` (anti-abstain), `list ALL items, comma-separated` (list-format gold answers), `Be BRIEF ... No filler` (LoCoMo judge brevity preference). The symmetric runner consumes the SAME template for every system (`_runner_main.py:933-943`, stripping only the per-QA `{category_instructions}` placeholder — that strip is ASYM-002's axis, not this one). Spun out of ASYM-002 on 2026-06-10: the LoCoMo-tuned base prompt was previously visible only inside ASYM-002's evidence text and never separately disclosed (gap-analysis finding: "hidden inside ASYM-002 — external comms pass the sniff test only because it is not named"). |
| applies_to | EVERY cell — `asymmetric_v1` and `symmetric_v1` alike — produced with the shared `ANSWER_PROMPT`; all systems equally. (Historical `asymmetric_v1` cells were stamped before this id existed and do not carry it in `known_asymmetries`; the id is stamped at write time on `symmetric_v1` cells only.) |
| effect | Not a row-vs-row bias: all systems share the prompt. It is a bench-tuning disclosure: the prompt rules were iterated against dev conv-26 failure modes, so ABSOLUTE scores overstate out-of-box performance for every system, and the prompt's transfer to unseen conversations (conv-47, held-out conv-30) is part of what the G2 held-out gate measures rather than a free assumption. |
| fix_next_round | None required for fairness (the prompt is symmetric). Keep the id stamped on every `symmetric_v1` cell so external readers see the harness-level tuning; re-deriving a "neutral" prompt would move all numbers and break comparability with published rows. |
| status | documented |

### ASYM-029 — Async competitor stores are queried before they settle: ingest() returning ≠ store being queryable; the harness measured an empty/partial index and published it as the system's score

| field | value |
|---|---|
| severity | high |
| direction | FAVORS_MNEMOVERSE (depresses competitor scores; our own rows ingest synchronously in-engine or settle-poll via the HTTP adapter) |
| evidence | 2026-06-11 conv-47 matrix run (`experiments/results/run-2026-06-11/conv47-matrix/quarantine/`): **supermemory** — 31 session-docs uploaded 13:18–13:19, ALL five k-sweeps 13:19–16:40 retrieved median **0** items (nonzero rows per sweep: k10 4/190, k20 6/190, k200 9/190). CAUSE CORRECTION (PR #308, same day): the PROVEN dominant cause for supermemory was **API response-shape drift** — the live v4 search moved result text into a `memory` field the extractor didn't know, so the adapter dropped every result even on a fully settled index (reproduced 17:59: raw client 30 results, adapter 0 snippets; fixed by adding the `memory` field). Settle lag for supermemory in the run window is UNPROVEN (run-time logs record only HTTP 200s, not payload sizes); the post-run live probes (committed `supermemory_settle_probe.json`) prove the index was alive after the run, not that it was empty during it. `judge_mnemoverse` ≈ 0.11 flat = reader-on-empty-context baseline. `_query_failure_rate` ≈ 0 (a single timeout across 950 queries; the k=10 cell records 1/190) because the search calls themselves succeed — the ASYM-003 outage gate cannot see this failure mode. **mem0_v3_cloud** (the settle-lag evidence proper — unaffected by the supermemory cause correction) — k=10 (the FIRST sweep after a 36 s async-accepted upload of 689 turns) scored 0.274 in its first half vs 0.484 in its second — a 0.21 gap; later sweeps' halves differ by ≤0.10 (k20 0.65/0.64, k50 0.59/0.68, k100 0.62/0.64, k200 0.67/0.67) → the LLM fact-extraction pipeline was still populating the store during the first sweep. |
| applies_to | every cell of `supermemory`, `mem0_v3_cloud`, `mem0_v2_oss` (shared extraction pipeline; OSS undocumented incident-wise, disclosed pre-emptively; requery cells keep the stamp until the settle-poll lands). zep ingests through a synchronous episode wait (623 s for conv-47) and showed nonzero retrieval from row 0 — not stamped. |
| effect | An async competitor's row (or its first sweep) measures the indexing queue, not the memory system. Publishing it UNDERSTATES the competitor — the same dishonesty class as inflating our own numbers. The 2026-06-11 supermemory row (all 5 cells) and mem0 k=10 cell were quarantined manually on discovery. |
| fix_next_round | THIS PR: (1) `INVALIDATION_EMPTY_RETRIEVAL_RATE` write-time quarantine gate — >50 % of rows with empty `retrieved_atom_ids` → `quarantine/`, mirrors the ASYM-003 gate; (2) `--requery-existing-corpus` runner flag — re-query the SAME settled corpus without re-ingesting duplicates (supermemory `reset()` is best-effort, not a guaranteed wipe; requery mode skips BOTH ingest and reset), cells stamped `config.requery_of_existing_corpus=true`. PR-B (planned): post-ingest settle-poll in the supermemory/mem0 adapters (probe-search until the store answers or a disclosed timeout aborts the system run). |
| status | active — mitigated by gates; settle-poll pending |

### ASYM-030 — In-proc shim engine runs with `read_budget_seconds=120` while production defaults to 10.0

| field | value |
|---|---|
| severity | medium |
| direction | FAVORS_MNEMOVERSE (intra-Mnemoverse: the in-proc reference row only; the headline `mnemoverse_http` row hits the production server with its production budget) |
| evidence | `src/mnemo/config.py:1409` — `EngineConfig.read_budget_seconds: float = 10.0` (production default; Railway serves reads under it because the long-lived server amortizes the ConceptExpander cold-start build across requests). The in-proc shim pays that build INSIDE its first `engine.read()`: 2026-06-11 conv-47 row — `concept_expander_build_cancelled` + `engine_read_budget_exceeded budget_s=10.0` on EVERY query, 553/553 failures by process kill — 190 at k=10, 190 at k=20, 173 into k=50 — with empty error strings (cancelled-build `TimeoutError` has an empty `str()`), 100 % of the row lost. BEAM offline runs already used 120 s for the same reason. |
| applies_to | every `mnemoverse` (in-proc) cell produced with `MnemoverseAdapter(read_budget_seconds=120.0)` (the new default in `mnemoverse_adapter.py`). |
| effect | The in-proc row answers reads that a budget-10 production engine would refuse (503). Latency per read is still honestly recorded per row (`t_retrieval_s`); the asymmetry is availability, not speed. Without the raised budget the row cannot be measured at all on a cold per-process engine. |
| fix_next_round | None for fairness of the headline row (http measures production). Possible alternative: eager expander build during ingest (excluded from query timing) would let the shim run the production budget — engine follow-up issue. |
| status | active — disclosed on in-proc cells via `known_asymmetries` |

## Closing the inventory

This file was originally frozen as of the `asymmetric_v1` round
publish, with the expectation that the proper-architecture round
(`symmetric_v1`: single runner, identical reader prompt and context
formatter for all adapters, identical n-filter, single judge prompt
per judge name) would close every finding here and produce
`eval_path: symmetric_v1` cells that do NOT reference this inventory.

**Correction (2026-06-10, verified against the post-#291/#292/#293
code):** the symmetric harness closes the PATH asymmetries by
construction (ASYM-001/002/004/005/007/008/009/010 do not
differentiate systems when every system runs through
`_runner_main.py`), PR #291/#292 closed ASYM-006/021/024, and the zep
limit cap (ASYM-023) is fixed at `zep_adapter.py:496`
(`effective_limit = max(top_k, self._limit)`) — but a
subset of findings remains live on the symmetric path and IS stamped
onto `symmetric_v1` cells at write time by
`competitors/_asymmetry_registry.py` (gap-analysis blocker B1):
ASYM-011 and ASYM-014 (silent-failure findings in `_query_loop.py`
itself), ASYM-028 (harness-level prompt tuning), ASYM-022 (when a
lenient judge is in the set), and the per-adapter items
ASYM-012/013/016/019/020/026/027 (ASYM-025 closed 2026-06-11). The registry module is the
single source for which ids apply to which system; its mapping is
test-checked against this file.

**Out-of-scope at this severity:** the audit also flagged
`info`-severity items (failure-surface threshold semantics; Phase B1
cap_50 historical context) and a `low`-severity item (matrix vs raw
two-schema split). These are documentation / consumer-API concerns,
not asymmetries that bias cell numbers, and are not catalogued here.

**Publication-layer issues separate from this inventory:** the audit
also produced a marketing-language sweep covering headline framing,
best-k cherry-picking, and single-judge selection in earlier drafts
of `WAVE14_MORNING_BRIEFING.md`, `COMPETITIVE_LANDSCAPE.md`, and
`RUN_REGISTRY.md`. Those violations are publication-layer concerns,
not eval-pipeline asymmetries, and do not receive `ASYM-NNN` ids.
The fixes for them are visible in the briefing's current revision.

**Coverage gap to be aware of when reading this file:** the catalogued
28 items come from two of the five audit lenses (code-path-asymmetry
and silent-failure, audited 2026-06-05); ASYM-024..027 were added
2026-06-06..10 from the HTTP-adapter rounds, and ASYM-028 on
2026-06-10 from the gap-analysis pass as a spin-out of ASYM-002. Three further lenses (config-tampering,
judge-bias, comparability) ran on the same audit and produced 17
additional `critical`/`high` items not yet promoted here. Those items
overlap conceptually with what is catalogued (e.g. the same
single-judge framing issue surfaces in both the judge-bias lens and
in the publication-layer sweep) but a future revision should either
promote them with `ASYM-NNN` ids or document the de-duplication
rationale per item.

**Reference pattern for cell metadata:**

```yaml
eval_path: asymmetric_v1
asymmetry_inventory:
  - ASYM-001
  - ASYM-002
  - ASYM-004
  # ... (cells should list every ASYM-NNN that materially affects their numbers)
inventory_doc: experiments/benchmarks/ASYMMETRY_INVENTORY.md
```