You are here: University of Vienna PHAIDRA Detail o:429548
Title
Deduplicating Bibliotheca Alexandrina’s Web Archive
Language
English
Description (en)
Archiving web content is bound to produce datasets with duplication, either across time or across location. The Bibliotheca Alexandrina (BA) has a web archive legacy spanning a period of 10 years and is continuing to expand the collection. Initial assessment of this very large store of data was conducted. Given a high enough rate of duplication, deduplication would lead to sizable savings in storage requirements. The BA worked through the International Internet Preservation Consortium (IIPC) to compile best practices for recording duplicates in ISO 28500, the WARC File Format. To deduplicate legacy web archives “after the fact,” the BA is implementing the WARCrefs deduplication tools. Following implementation and testing, the BA plans to put the tools to use to deduplicate its one petabyte of archived web content.
Keywords (en)
web archiving, deduplication, hash algorithms, ISO 28500, WARC File Format, WARCrefs, WARCsum
ISBN
978-0-692-59881-8
Author of the digital object
Youssef  Eldakar
Magdy  Nagi
Format
application/pdf
Size
163.6 kB
Licence Selected
CC BY 4.0 International
Conferences
Conference 2015
Name of Publication (en)
Proceedings of the 12th International Conference on Digital Preservation
Publisher
School of Information and Library Science, University of North Carolina at Chapel Hill
Other links

ISBN
978-0-692-59881-8

Content
Details
Uploader
Object type
PDFDocument
Format
application/pdf
Created
04.03.2016 01:14:00
Metadata