VibeSearchBench

Proactive Search · Evolving Intent · Structured Knowledge

200
Tasks
20
Domains
Avg user turns
Avg tool turns
Best F1

By far the hardest verifiable long-horizon search benchmark — 200 bilingual tasks benchmarking proactive search in the wild with persona-driven progressive disclosure and schema-free knowledge graph evaluation.

What is VibeSearch?

Real users rarely specify full intent upfront. VibeSearch captures bidirectional convergence: agents interleave partial results with follow-up questions while users progressively disclose needs. VibeSearchBench pairs each task with a persona simulator and evaluates schema-free knowledge graphs via graph matching (Precision / Recall / F1).

VibeSearch-Pro

100 professional research scenarios — literature reviews, market analysis, technical due diligence across specialized domains.

Professional

VibeSearch-Daily

100 daily-life search tasks — shopping, travel, lifestyle decisions with vague initial queries and evolving preferences.

Daily-life

Evaluation

Progressive-disclosure user simulator, multi-turn tool use (search / visit / code), and LLM-as-judge graph matching on predicted vs. ground-truth triples.

Graph F1