Description (en)
Institutions that perform web crawls in order to gather
heritage collections have millions – or even billions – of
files encoded in thousands of different formats about
which they barely know anything. Many of these
heritage institutions are members of the International
Internet Preservation Consortium, whose Preservation
Working Group decided to address the issues related to
format identification in web archive.
Its first goal is to design an overview of the formats to
be found in different types of collections (large-, smallscale…)
over time. It shows that the web seems to be
becoming a more standardized space. A small number
of formats – frequently open – cover from 90 to 95% of
web archive collections, and we can reasonably hope to
find preservation strategies for them.
However, this survey is mainly built on a source – the
MIME type of the file sent in the server response – that
gives good statistical trends but is not fully reliable for
every file. This is the reason why it appears necessary to
study how to use, for web archives, identification tools
developed for other kinds of digital assets.