Completeness and Authenticity

It should be stressed from the beginning that web archives take various approaches towards protecting and respecting peoples’ privacy. Data is only collected if it has been made public, and because data protection laws in Denmark assert that even then older data may be considered “personal”, access to the Danish web archive, Netarkivet (http://netarkivet.dk/), is restricted to researchers with relevant projects, who must respect the rules and laws of privacy when working with the data.

It is also important to understand that data is collected on a regular basis, but that it is impossible to collect and preserve the amount of changes that occur (see also the previous subchapter, “Abundance”). Data in Netarkivet is primarily collected by automatically attempting to copy everything on the Danish National domain (.dk), and additionally websites relating to or located in Denmark four times per year. The process of data collection on such a scale is time-consuming and challenging, wherefore temporal differences may result in inconsistencies. For example: An event occurring during the time of data collection may be yet unknown on some pages stored at the beginning of the process, then announced or occurring on other pages stored underway, and finally commented upon as a past event on pages stored towards the end of the process.

Furthermore; most websites use content represented on other websites (“hosts”). For example: Video represented on Youtube can be a crucial part of a news article. But the data collection process will not gather the video; rather it will gather the code that should include the video on the news article. In some cases the video may be stored on the same website where the article is found, and in such cases it may be stored with the article. But in many cases the archived copy of the article will not contain the video, and will thus be incomplete, and if the video has been deleted or changed it may be difficult to fully reconstruct the original content of the article at all.

Finally, the process of gathering data holds its own challenges, and errors may occur, resulting in data incompleteness.

A detailed explanation of challenges and limitations is offered in Janne Nielsen’s book “Using Web Archives in research – an Introduction, V2”. It is a NetLab publication, and free to download, see “References”.

The important thing is to understand that a web archive is not, and cannot be, an exact representation of the web as it was at a previous time. But it remains a large scale preservation of data which is changed over time and would otherwise be lost. With a web archive, patterns (of use, communication, user behavior or other phenomena) can be established with the degree of certainty that comes from observing them as big data patterns.

It may also be relevant to remember that no form of archiving can ever be complete. No amount of data can hold all information about a phenomenon, without actually being the phenomenon itself.