EARMA Conference Odense 2024

PDF

Data Quality through APIs

Data Quality or: How I learned to stop worrying and love APIs

Conference

EARMA Conference Odense 2024

Format: Oral 20 Minutes

Topic: IT Systems and tools supporting RMA now and in the future

Abstract

In this talk we will discuss the migration of data from Technological University Dublin's Digital Commons institutional repository to a new CRIS (Elsevier PURE). In this process we discovered some data quality (DQ) issues. In particular we found that key data about articles (and other outputs) was held in a single unstructured citation field. The Research and Innovation Support Office and the Library Research Service team worked together to resolve the problems. The approach we took was to leverage two APIs; the open Crossref API and OpenAI’s GPT API to enrich our data source and mitigate what would otherwise have been a long and tedious manual DQ exercise. The work involved combining high quality Crossref data with generative AI extracted query terms. To do this we trained the AI model on two sets of data; conference proceedings citations and journal article citations. The resulting responses helped to identify key metadata such as Host Publication Title and Page Ranges and additionally helped identified mis-categorised papers. In this talk we will discuss issues of appropriate use of AI, limits and costs to the approach, the AI training process and the peril of AI hallucination.