The Cardinality Estimation/Distinct Elements problem is to approximate the number of distinct elements in a data stream using a small probabilistic data structure called a “sketch”. This problem has been studied for 40 years, has many industrial applications, and is featured prominently in most courses on Big Data algorithmics. It is therefore a real puzzle to explain why research on this popular and fundamental problem has been unusually slow.
This talk presents a complete history of the Cardinality Estimation problem from Flajolet and Martin’s seminal 1983 paper to the present, and includes an account of how the research community became fractured, delaying many natural developments by decades. I will present our recent efforts to achieve information-theoretically optimal cardinality sketches, which draws on two notions of “information” developed in the 20th century: Fisher information (governing optimal point estimation) and Shannon entropy (governing optimal space/communication).
Joint work with Dingyu Wang:
Seth Pettie is a Professor of Computer Science and Engineering at the University of Michigan. He received his Ph.D. from UT-Austin in 2003, and was an Alexander von Humboldt Fellow at the Max Planck Institute for Computer Science from 2003-2006.