You are here: University of Vienna PHAIDRA Detail o:1424885
Title
Text File Format Identification: An Application of AI for the Curation of Digital Records
Language
English
Description (en)
File format identification is a necessary step for the effective digital preservation of records. It allows appropriate actions to be taken for the curation and access of file types. The National Archives has existing processes for dealing with binary file format types, using tools such as PRONOM and DROID. These methods rely on using header information (metadata) and consistent binary sequences. However, these are not appropriate for the identification of text le formats as these do not contain recognisable header information or consistent patterns. Most text formats can be opened as plain text files, however file type information is often needed to understand the files use and context. Automated methods are necessary for text file format identification due to the scale of digital records processed by The National Archives, UK. An Artificial Intelligence methodology was tested and implemented using representative data collected from the GitHub repositories of UK Government departments. The first prototype developed has achieved reasonably good performance in successfully detecting five file formats with similar characteristics. The results encourage us to carry out additional experiments to include further text file format types.
Keywords (en)
Text file formats, Supervised learning, digital preservation
Author of the digital object
Santhilata  Kuppili Venkata  (The National Archives)
Author of the digital object
Paul  Young  (The National Archives)
Author of the digital object
Alex  Green  (The National Archives (UK))
Format
application/pdf
Size
243.9 kB
Licence Selected
CC BY 4.0 International
Conferences
Conference 2021
Content
Details
Uploader
Object type
PDFDocument
Format
application/pdf
Created
23.02.2022 01:26:42
Metadata