Evaluation
Marvin ships with a reproducible retrieval benchmark so changes to chunking, embeddings, ranking, or fusion can be measured against an external reference rather than vibes.
LongMemEval-S
LongMemEval (ICLR 2025) is a
public benchmark for long-term chat memory. The "small" variant
(LongMemEval-S) contains 500 questions, each paired with a haystack
of ~50 prior chat sessions; the gold answer is contained in one or two
of those sessions. We use the
xiaowu0162/longmemeval-cleaned
release, the same one that agentmemory reports against, so numbers are
directly comparable.
The harness measures retrieval only: for each question we build a fresh in-memory index from that question's haystack, run a search with the question text, and check whether any gold session id appears in the top-K results.
Metrics
For each question (and aggregated across all questions):
recall_any@K— 1.0 if any gold session is in the top-K results.NDCG@10— normalised DCG with binary session-level relevance.MRR— reciprocal rank of the first gold session.
The published agentmemory headline is recall_any@5.
Quick start
# 1. Download the cleaned dataset (~270 MB, one-off).
python scripts/download_longmemeval.py
# 2. Run the BM25 baseline on all 500 questions (~1 minute).
python -m marvin.eval \
--dataset experiments/data/longmemeval_s_cleaned.json \
--mode bm25 \
--output experiments/results/bm25.json
# 3. Run hybrid retrieval (BM25 + dense vectors via fastembed).
python -m marvin.eval \
--dataset experiments/data/longmemeval_s_cleaned.json \
--mode hybrid \
--output experiments/results/hybrid.json
# 4. Quick sanity check without downloading any embedding model.
python -m marvin.eval --dataset PATH --mode bm25 \
--embedding-provider hash --limit 20
Modes
| Mode | Index streams | Notes |
|---|---|---|
bm25 |
SQLite FTS5 only | Fastest; deterministic; no model. |
vector |
sqlite-vec only |
Pure dense retrieval ablation. |
hybrid |
FTS5 + sqlite-vec, RRF fused | Default. Mirrors service.search. |
--rerank composes orthogonally with any mode. The first stage fetches a
pool of chunks (controlled by --rerank-depth), the cross-encoder scores
each (query, chunk_text) pair, and the session-level result is the
max-pool of chunk scores. See Reranking below.
Baseline numbers
Run on commit feature/eval-longmemeval, multi-core CPU, no GPU,
default chunking (1200 / 200), full LongMemEval-S (500 questions).
| Mode | Embedder | R@5 | R@10 | R@20 | NDCG@10 | MRR | Wall |
|---|---|---|---|---|---|---|---|
| BM25 | n/a (FTS5) | 95.6% | 98.2% | 99.2% | 87.3% | 89.2% | 55 s |
| Hybrid | hash (deterministic) | 88.8% | 92.2% | 98.2% | 74.7% | 77.1% | 73 s |
For reference, agentmemory reports recall_any@5 = 95.2% on the same
dataset using BM25 + dense vectors with cross-encoder reranking. Marvin's
BM25 alone matches that headline number; closing the remaining gap on the
hardest question types (especially single-session-preference) is the
target for follow-up work on dense retrieval and reranking.
The "hybrid + hash" row is informative rather than aspirational: when the
real embedder isn't available, Marvin currently falls back to a feature-
hashing backend whose vectors are essentially random. Mixing them into
RRF hurts relative to BM25-only, so production deployments should
always have fastembed (or a real embedding API) installed.
Cross-encoder reranker lift
Measured on a 100-question LongMemEval-S subset, BM25 first-stage (hash
embedder), --rerank-depth 50, CPU under heavy concurrent load.
| Mode | R@5 | R@10 | NDCG@10 | MRR | Median latency |
|---|---|---|---|---|---|
| BM25 | 96.0% | 99.0% | 88.9% | 89.5% | 1.5 s |
| BM25 + rerank | 98.0% | 99.0% | 94.6% | 95.3% | 125 s |
Per question type:
| Type (n) | R@5 BM25 | R@5 +rerank | NDCG@10 BM25 | NDCG@10 +rerank | MRR BM25 | MRR +rerank |
|---|---|---|---|---|---|---|
| single-session-user (n=70) | 97.1% | 98.6% | 96.0% | 98.6% | 94.7% | 98.1% |
| multi-session (n=30) | 93.3% | 96.7% | 72.4% | 85.5% | 77.2% | 88.6% |
The reranker's value is concentrated where it should be: the harder
multi-session slice jumps +13.1 pp NDCG@10 and +11.4 pp MRR. On
the easier single-session slice it still delivers a clean +2.6 pp /
+3.4 pp. Recall@10 was already at ceiling so top-K movement is dominated
by ordering metrics rather than R@5.
The latency figure reflects the host this run was executed on (load
average ~40 during the whole run). On an idle workstation the same
depth-50 rerank runs in single-digit seconds per query, and on a GPU
it's milliseconds. Production MCP queries only pay the cost when
rerank_enabled=true.
Hybrid with fastembed on CPU
fastembed's ONNX backend has roughly linear-then-superlinear cost in
sequence length on CPU. With the default 512-char embedding cap, a full
500-question hybrid run takes several hours on a typical workstation.
For interactive iteration:
- Cap embedding text aggressively:
--max-embed-chars 128is ~5× faster than the default 512 with little recall impact in our spot checks. - Use
--limit Nto evaluate on a subset. - Use
--mode bm25for changes that don't touch dense retrieval.
Reducing this cost is on the roadmap (smaller models, batched embedding service, optional GPU/Metal backends).
Reranking
Hybrid retrieval is strong at finding the right chunk but RRF only uses rank order — it ignores query-document interactions. A cross-encoder reranker reads the query and each candidate jointly and typically lifts top-K precision by 5–15 points on open-domain QA tasks at the cost of a few hundred ms per query on CPU.
The harness (and MarvinService.search) ship with an optional reranking
pass backed by fastembed's
TextCrossEncoder. The default model is
BAAI/bge-reranker-v2-m3:
multilingual, 568M params, Apache-2.0. BAAI does not publish ONNX weights
directly, so Marvin registers the community
onnx-community/bge-reranker-v2-m3-ONNX
int8 port (~570 MB, single file) via TextCrossEncoder.add_custom_model
the first time the reranker is constructed. Any reranker listed by
TextCrossEncoder.list_supported_models() works too — pass e.g.
--rerank-model Xenova/ms-marco-MiniLM-L-6-v2 for a faster, English-only
alternative.
# BM25 retrieval + cross-encoder reranking on the first 50 questions.
python -m marvin.eval \
--dataset experiments/data/longmemeval_s_cleaned.json \
--mode bm25 \
--rerank \
--rerank-depth 50 \
--limit 50 \
--output experiments/results/bm25_rerank.json
Flags:
--rerank— enable the cross-encoder.--rerank-model— HF model id (default:BAAI/bge-reranker-v2-m3).--rerank-depth— first-stage chunk pool size (default: 50). Chunks, not sessions: several chunks from the same session are scored independently and max-pooled back.--rerank-max-chars— per-document truncation before tokenisation (default: 1024). Keeps CPU cost bounded.
Why chunk reranking rather than session-level? LongMemEval sessions are long conversations (commonly 10–20 KB of raw turns). The reranker's input window is effectively 512 tokens, so naively prefixing a whole session discards the very signal we need. Scoring the chunks that first-stage retrieval already matched, then max-pooling to sessions, recovers the signal cleanly.
Performance: with bge-reranker-v2-m3 quantized to int8 on CPU, 50
(query, chunk) pairs take roughly 2–4 seconds on an idle workstation
and much more on a heavily-loaded box (see the
measured lift table for a worst-case
number). Budget a few minutes of wall time per 100 questions in the
harness, or pick a smaller model (e.g.
Xenova/ms-marco-MiniLM-L-6-v2) for interactive iteration. MCP gateway
queries pay the reranker cost once per search() call and only when
rerank_enabled is set.
Output
The CLI prints a per-question-type breakdown and writes a JSON dump:
{
"mode": "bm25",
"embedding_provider": "hash",
"questions": 500,
"recall_at_5": 0.956,
"recall_at_10": 0.982,
"ndcg_at_10": 0.873,
"mrr": 0.892,
"median_latency_ms": 111.1,
"total_seconds": 55.4,
"per_type": { "...": "..." },
"per_question": [ "...", "..." ]
}
per_question includes the retrieved session ids, gold ids, and per-question
metrics — useful for digging into failure cases.
Programmatic API
from pathlib import Path
from marvin.embeddings import EmbeddingService
from marvin.eval.longmemeval import load_dataset, run_benchmark
from marvin.reranker import RerankerService
entries = load_dataset(Path("experiments/data/longmemeval_s_cleaned.json"))
summary = run_benchmark(
entries[:50],
mode="hybrid",
embedder=EmbeddingService(),
reranker=RerankerService(provider="fastembed"),
rerank_depth=50,
)
print(summary.recall_at_5, summary.mrr)