Title (en)
Text File Format Identification: An Application of AI for the Curation of Digital Records
Language
English
Description (en)
File format identification is a necessary step for the effective digital preservation of records. It allows appropriate actions to be taken for the curation and access of file types. The National Archives has existing processes
for dealing with binary file format types, using tools such as PRONOM and DROID. These methods rely on using header information (metadata) and consistent binary
sequences. However, these are not appropriate for the identification of text le formats as these do not contain recognisable header information or consistent patterns. Most text formats can be opened as plain text files, however
file type information is often needed to understand the files use and context. Automated methods are necessary for text file format identification due to the scale of digital records processed by The National Archives, UK. An Artificial Intelligence methodology was tested and implemented using representative data collected from the GitHub repositories of UK Government departments. The
first prototype developed has achieved reasonably good performance in successfully detecting five file formats with similar characteristics. The results encourage us to carry out additional experiments to include further text file format types.
Keywords (en)
Text file formats, Supervised learning, digital preservation
Author of the digital object
Santhilata Kuppili Venkata (The National Archives)
Author of the digital object
Paul Young (The National Archives)
Author of the digital object
Alex Green (The National Archives (UK))
Format
application/pdf
Size
243.9 kB
Licence Selected
CC BY 4.0 International
Conferences
Conference 2021
- Citable links
Persistent identifier
https://phaidra.univie.ac.at/o:1424885 - Content
- Details
- This object is in collection
- Metadata
- Export formats