VibeSearch-Pro
100 professional research scenarios — literature reviews, market analysis, technical due diligence across specialized domains.
Professional
Proactive Search · Evolving Intent · Structured Knowledge
By far the hardest verifiable long-horizon search benchmark — 200 bilingual tasks benchmarking proactive search in the wild with persona-driven progressive disclosure and schema-free knowledge graph evaluation.
Real users rarely specify full intent upfront. VibeSearch captures bidirectional convergence: agents interleave partial results with follow-up questions while users progressively disclose needs. VibeSearchBench pairs each task with a persona simulator and evaluates schema-free knowledge graphs via graph matching (Precision / Recall / F1).
100 professional research scenarios — literature reviews, market analysis, technical due diligence across specialized domains.
Professional100 daily-life search tasks — shopping, travel, lifestyle decisions with vague initial queries and evolving preferences.
Daily-lifeProgressive-disclosure user simulator, multi-turn tool use (search / visit / code), and LLM-as-judge graph matching on predicted vs. ground-truth triples.
Graph F1