A broad range of research areas including Internet measurement,
privacy, and network security rely on lists of target domains to be
analysed; researchers make use of target lists for reasons of necessity
or efficiency. The popular Alexa list of one million domains is
a widely used example. Despite their prevalence in research papers,
the soundness of top lists has seldom been questioned by the community:
little is known about the lists' creation, representativity,
potential biases, stability, or overlap between lists.
In this study we survey the extent, nature, and evolution of top
lists used by research communities. We assess the structure and
stability of these lists, and show that rank manipulation is possible
for some lists. We also reproduce the results of several scientific
studies to assess the impact of using a top list at all, which list
specifically, and the date of list creation. We find that (i) top lists
generally overestimate results compared to the general population
by a significant margin, often even an order of magnitude, and (ii)
some top lists have surprising change characteristics, causing high
day-to-day fluctuation and leading to result instability.We conclude
our paper with specific recommendations on the use of top lists,
and how to interpret results based on top lists with caution.
Our paper has been accepted for IMC 2018.
You can obtain a preprint of our paper from arXiv.
We provide a historical and ongoing collection of top lists on an archive server.
The full dataset of scripts and raw data used for the publication is hosted at at the TUM Library.
Please cite this study when using the data.
scheitle [AT] net.in.tum.de
jelten [AT] net.in.tum.de
Internet Measurement data for this study is based on RWTH Aachen's NetRay project and TUM's GINO project.