WebData Researchers at Web Archiving Conference

WebData Researchers at Web Archiving Conference

  • April 27, 2026
Table of Contents

Several researchers from the WebData project have been in Brussels to attend the Web Archiving Conference 2026, organised by the International Internet Preservation Consortium (IIPC). They presented preliminary results from the project, and shared experiences with others working to facilitate research using web archives.

Several contributions from WebData

There was great interest in the different contributions from the WebData project. WebData was represented with three posters:

Assessing the Needs of Researchers: Jon Tønnessen (NB) presented the results from the needs study we have carried out among 99 researchers who wish to use web archives in their research. We have identified the challenges they face, and which needs they have in order to use the web archives in a more effective way. Many participants welcomed the fact that we have done such work, so that one can get a better basis for the development of services and tools that can meet researchers’ needs.

Mapping duplicate images in a web archive using perceptual hashing: Marie Roald (NB) presented a method for identifying duplicates of images in a web archive using perceptual hashing (pHash). This makes it far easier to find images in the archive that people will perceive as similar, even though machines initially assess them as completely different. One advantage for the project is that machine learning methods can be made far more sustainable, since the extent of data computation can be reduced considerably.

Revisiting a statistical approach for measuring Solr query performance: Jørgen Antonsen (NB) presented a method for measuring and visualising performance when querying the WebData project’s test platform. The method makes it possible to compare response times across different types of search results, for example searches with very few or very many hits. This gives a better basis for assessing the effect of different technical measures, so that performance can be improved for the end user.

In addition, WebData staff participated in several workshops and a panel:

  • Users first: (Re)designing Web Archives around real Needs (panel)
    • Tønnessen participated in a panel with representatives from The National Archives in the UK, Bibliothèque nationale de France, Ghent University and the University of Edinburgh. A key takeaway was that web archives need to move from a “collect first” mindset to a more user-oriented approach, where collections can be made more accessible to researchers, cultural heritage professionals, journalists and others.
  • Workshop on Web Archives and AI (workshop)
  • GLAM labs & Jupyter notebooks (workshop)
  • SolrWayback (workshop)

Increased focus on access, research and sustainability

There were many interesting contributions at the conference. It is impossible to mention them all, but we would like to highlight some that are particularly relevant to the WebData project:

The historian Ian Milligan presented his research on the terrorist attacks of 11 September 2001. In the presentation Web archives of tragedy: ethical, sustainable access and research use for 9/11 collections, he showed how historians are entirely dependent on using web archives as primary sources in cultural and social history research on events after 2000.

The National Archives in the UK gave a presentation titled Unlocking the Web Archive: understanding researcher needs, about its work to better facilitate research using its web archive. Through workshops with both experienced and potential users, it has identified different barriers to access, understanding and research use, as well as the practical and ethical frameworks the institution must work within.

The language technologist Laurie Burchell presented Common Crawl’s work on improving language recognition in web archives. Existing solutions are either too inaccurate, especially for smaller languages, or too resource-intensive to be used at scale. Common Crawl is therefore developing a system specifically for web data. It can handle multilingual webpages and is designed to be very fast, increasing the possibility of developing multilinguality in web archives and strengthening work on language technology for underrepresented languages.

The historian Jesper Verhoef showed in his presentation, Hyperlinked homeland: A historical hyperlink analysis of 200 Dutch LGBT+ websites, how he has analysed hyperlinks in a Dutch web archive collection of LGBT+ websites. By examining which websites were linked to, he was able to uncover networks of identity and belonging in the Dutch web archive. The analysis challenges the assumption that queer web cultures are primarily transnational, among other things by identifying clear Dutch and often hyperlocal clusters of queer websites.

David Mahoney shared interesting findings from his PhD project, in which he uses web archives to study how websites have grown and changed over time. By using open metadata and access services, he can measure how digital development has contributed to greenhouse gas emissions over time. He thereby showed how metadata from libraries can be an important source of knowledge about the development of the web, resource use and sustainability.

Katy Boss also presented how NB is working to make digital preservation more sustainable, including through data minimisation, reduced energy consumption for storage, almost exclusive use of renewable energy, and plans to use waste heat from NB’s data centre in Mo i Rana for district heating.

Katy Boss presenting on sustainability at IIPC 2026

Share :