Searching a Web Archive

Search options offered in web archives are usually URL search (where the user searches for stored versions of a website by using its direct original web address, or “internet link”), and free text search. These are the search options offered in Netarkivet.

Free text search may appear similar to searching the live web with an online search engine such as Google. However, the reality of archive searches is quite different:

Several (often many) versions of the same website are stored, so that many “hits” in a search will be more or less identical. If changes were made to the different versions stored, then it may be difficult to determine which version best coincides with the one the user had in mind (if any).

Nevertheless, with the amounts of web data gathered over years, with many websites in many copies and versions, the amount of results from a free text search may be overwhelming. It will usually be in the user’s best interest to restrict a search as much as possible with web domain names, time limits, etc.

And even so, the user may face another problem which is that results may be listed alphabetically or chronologically, but not – as users of the live web are otherwise used to – by relevance. Online search engines use complex criteria (algorithms) to determine relevance and list the most relevant hits first. Offering results that are immediately perceived as relevant is crucial to the success of an online search engine.

But criteria of relevance depend on such things as popularity and exact representations of search phrases, which does not apply in a web archive in the same manner as online.

Web archive users may be accustomed to search results that are immediately useful, but even after using the best possible keywords and restrictions when searching in web archives the user may face results that will demand long and manual sorting before specific needs can be fulfilled.

Searching by URL also holds challenges, especially if the URL has been changed. The website may be there in earlier or later version than listed after a URL search, but under a different URL.

The chapter on searching in web archives in Nielsen (2016) is recommended for further reading, see “References”.