The dynamics of discovery and the Heaps-Zipf relationship

When following a sequence — such as reading a text or tracking a user’s activity — one can measure how the ‘dictionary’ of distinct elements (types) grows with the number of observations (tokens). When this growth follows a power law, it is referred to as Heaps’ law, a regularity often associated with Zipf’s law and frequently used to characterize human discovery processes. While random sampling from a Zipf-like distribution can reproduce Heaps’ law, this connection relies on the assumption of temporal independence — an assumption often violated in real-world systems although frequently found in the literature.

Here, Célestin Zimmerlin, Thomas Louail, Manuel Moussallam and Marc Barthelemy investigate how temporal correlations in token sequences affect the type–token curve. In human behaviors like music listening and web browsing, domain-specific correlations in token ordering lead to systematic deviations from the Zipf–Heaps framework, effectively decoupling the type–token plot from the rank–frequency distribution. Using a minimal one-parameter model, we reproduce a wide variety of type–token trajectories, including the extremal cases that bound all possible behaviors compatible with a given frequency distribution.

The results demonstrate that type–token growth reflects not only the empirical distribution of type frequencies, but also the domain-specific, temporal structure of the sequence — a factor often overlooked in empirical applications of scaling laws to characterize human behavior.

Excerpt

When observing a sequence—such as reading a text, browsing websites, or listening to music— the basic elements that make up the sequence (words, web pages, tracks) may either recur or appear for the first time, reflecting a continuous interplay between familiar elements and novel ones. Two empirical laws—Heaps’ law and Zipf’s law—have emerged as central tools for describing how novelty and frequency are distributed in these systems [1]. Heaps’ law characterizes the growth of the number of distinct types D with the number of observed tokens k in a sequence, typically following a sublinear power law. D ∝ k α , (1) with α ∈ [0.4, 0.7] in most empirical cases [2, 3]. It has been observed in systems ranging from natural language and source code to scientific and chemical databases [4– 6], and has been considered as a fitting law for capturing innovation, novelties and discovery processes. [7–9].

This project has received financial support from the CNRS through the MITI interdisciplinary and exploratory research program. T.L., M.M., and M.B. supervised this work. C.Z. performed the data analysis, coding, and initial draft writing. All authors contributed to the study’s conceptualization, validation, final writing, review, and editing.

Célestin Zimmerlin, Thomas Louail, Manuel Moussallam, Marc Barthelemy. Dynamics of discovery and the Heaps-Zipf relationship. Phys. Rev. E 113, 054304, 2026 DOI: https://doi.org/10.1103/543d-frbq