Description (en)
File format identification is a necessary step for the effective digital preservation of records. It allows appropriate actions to be taken for the curation and access of file types. The National Archives has existing processes
for dealing with binary file format types, using tools such as PRONOM and DROID. These methods rely on using header information (metadata) and consistent binary
sequences. However, these are not appropriate for the identification of text le formats as these do not contain recognisable header information or consistent patterns. Most text formats can be opened as plain text files, however
file type information is often needed to understand the files use and context. Automated methods are necessary for text file format identification due to the scale of digital records processed by The National Archives, UK. An Artificial Intelligence methodology was tested and implemented using representative data collected from the GitHub repositories of UK Government departments. The
first prototype developed has achieved reasonably good performance in successfully detecting five file formats with similar characteristics. The results encourage us to carry out additional experiments to include further text file format types.