Evaluation

What Wenlan's published retrieval numbers mean, how they are generated, and what they do not claim.

Qi-Xuan LuUpdated Jul 2, 20266 min read

At a glance

The public numbers are retrieval metrics, not end-to-end answer quality claims.

The current CE-reranker retrieval snapshot reports LME_Oracle at 93.6% Recall@5 / 0.857 MRR / 0.883 NDCG@10 and LME_S at 87.7% Recall@5 / 0.815 MRR / 0.822 NDCG@10.

Current snapshot

Wenlan publishes a compact retrieval snapshot for the shipped hybrid retrieval path. The snapshot uses BGE-Base-EN-v1.5-Q embeddings, FTS5, Reciprocal Rank Fusion, and the local BGE reranker when enabled.

The numbers below should be read as retrieval-only metrics over the stated fixtures. LME_S is the deep LongMemEval substrate; the current representative snapshot uses the stratified N=90 fixture, with 84 gradeable retrieval rows.

README snapshot

Benchmark                         Scope                         Recall@5   MRR     NDCG@10
LME_Oracle                        CE-reranked, 500 Q            93.6%     0.857   0.883
LME_S                             CE-reranked, deep N=90        87.7%     0.815   0.822

What is measured

The headline metrics are Recall@5, MRR, and NDCG@10. They measure whether the retrieval layer surfaces relevant context near the top of the result set.

This is useful because Wenlan's job is to bring the right memories, pages, decisions, and graph context into the next agent session. It does not prove that a downstream model will always answer correctly.

What is not measured

The published table is not a full product-quality score, a guarantee of answer quality, or a latency benchmark. It is also not a cross-product claim unless the comparison page explicitly states the protocol.

Single-run snapshots are useful for development direction. Public claims that compare improvements should be regenerated under the current schema and, for headline claims, backed by repeated runs or documented methodology.

Where the harness lives

The eval harness lives in crates/wenlan-core/src/eval. The workflow docs live in docs/eval in the Wenlan repository.

Slow GPU or API-backed evals are manual. Normal CI avoids running heavy model benchmarks because hosted runners do not provide the right hardware, secrets, or cost profile.

LongMemEval retrieval runs use Recall@5, MRR, and NDCG@10 headline fields. The LME_S row comes from docs/eval/results/lme_s_90_bge_base_pool20.summary.json in the Wenlan repo.
The README updater reads a local metrics JSON and writes the tracked README snapshot.
Raw local baseline artifacts stay outside git; the repository keeps the methodology and curated snapshot.

How to rerun

Clone the repository, follow the eval docs, and run the appropriate ignored eval harness command for the benchmark you want to reproduce. Expect slow runs when local models or judges are involved.

When comparing two retrieval modes, regenerate both sides under the same schema and fixture revision. Cross-schema comparisons are treated as invalid by the repository's eval discipline.

Open eval docs on GitHub

Desktop App Status

Understand how the optional Wenlan desktop app relates to the daemon, CLI, MCP server, and Claude Code plugin.