From Noise to Knowledge: Rethinking Benchmarks for the Early Training of LLMs

Join us in building benchmarks that capture early-stage reasoning & scientific knowledge in LLMs

Competition Banner

Register here to participate

Competition Overview

The development of Large Language Models (LLMs) typically begins with a series of ablation experiments, wherein various model architectures, data mixtures, and training hyperparameters are systematically evaluated. This phase is commonly referred to as the early stages of training. During this period, researchers primarily monitor two key metrics: the training loss curve and evaluation scores. However, existing evaluation benchmarks often fail to provide meaningful or discriminative signals during these initial stages where LLMs are trained on a few tokens ~300B tokens, making it challenging to derive conclusive insights from ongoing experiments.

This competition tackles the challenge of designing scientific knowledge evaluation tasks specifically tailored for measuring early training progress of language models. Participants are invited to develop novel evaluation methodologies or adapt existing benchmarks to better capture performance differences among language models. To support this effort, we provide three pre-trained small models (0.5B, 1B, and 3B parameters), along with intermediate checkpoints sampled during training up to 200B tokens. All experiments and development work can be run on widely available free cloud-based GPU platforms, making participation accessible to researchers with modest hardware. Submissions will be evaluated based on three criteria: the quality of the performance signal they produce, the consistency of model rankings at 1 trillion tokens of training, and their alignment with the scientific knowledge domain. By promoting the design of tailored evaluation strategies for SLMs, this competition aims to attract a broad range of participants from around the world, including those who may not be machine learning experts or have access to dedicated GPU resources. Ultimately, this initiative seeks to make foundational LLM research more systematic and benchmark-informed from the earliest phases of model development.

Eva

Competition Timeline

Competition kick-off 14 July 2025
Warm-up Phase 14 July 2025 - 17 August 2025 (5 weeks)
Development Phase 18 August 2025 - 26 October 2025 (10 weeks)
Final Phase 27 October 2025 - 03 November 2025 (3 weeks)
Results Announcement 04 November 2025
Winners' Fact Sheets & Code Release Due 22 November 2025
NeurIPS Competition Workshop Presentation 6 or 7 December 2025

Prizes

A public leaderboard showcasing the top evaluation tasks across a diverse set of LLMs will be maintained.

Special prizes

Contact and Support

For inquiries and support, reach out to the task coordinators at e2lmc@tii.ae

Organizers

Affiliated Institutions

TII Logo
Spotify Logo
Caltech Logo
Sorbonne Logo
Oxford Logo