Fundamental Tools for Web Archive Research (FUTARC)
Corpus Creation, Completeness, and Versions

Purpose: The aim of this project is to develop a number of fundamental tools to be used when studying web archives.

Expected outcomes: The expected outcomes of the project is a number of fundamental tools to used when studying web archives, including tools to select and create a corpus and to extract the corpus from a larger collection.

Studying an entire web archive is probably the exception, and therefore research projects are very likely to delimit a certain part of a web archive as their object of study (delimited in time or space, by file types, content, html-code, or other). Therefore, tools for corpus creation are needed.

In addition, web archives are in many cases incomplete — things are missing (files, elements on a web page, etc.) — but they are also in many cases too complete — there may be several versions of ‘the same’. Therefore, tools are needed for handling incompleteness as well as versions.

Thus, the use of fundamental tools for corpus creation, completeness, and version handling can be considered a first step for any subsequent study of material in a web archive and a prerequisite for making informed and critical choices as to what to study.

Project team:
Niels Brügger (project responsible), Professor, Head of the Centre for Internet Studies and of NetLab/DigHumLab, Aarhus University
Ulrich Have, IT-Architect, NetLab/DigHumLab, Aarhus University
(Previous participant: Niels Ole Finnemann, Professor)