One of the core services that PA Digital offers is metadata aggregation. After evaluating our operational needs in 2018-19,  PA Digital’s technical team at Temple University Libraries decided that it was time to replace our original aggregator, which had become increasingly difficult to maintain, and began the planning process to implement a new approach for harvesting, gathering, and distributing records from contributing cultural heritage institutions. In rebuilding our processes, we aimed to adopt a more flexible and scalable system, focusing on the long-term sustainability of our technical infrastructure. We explored different options including Combine, an application developed at Wayne State University as part of the Michigan DPLA Hub (see our previous blog post). After much trial-and-error, we ultimately decided to use a set of tools already supported by Temple University Libraries: Apache Airflow for workflow management and Apache Solr/Blacklight for internal metadata review. 

From 2019 to 2021, we migrated our existing workflows to these new systems, working with individual institutions to reharvest their collections. As of March 2021, the PA Digital technical team has completed this project and phased out the old aggregator software, an important step forward for the ongoing sustainability of the initiative. Now that we are fully operating with our updated aggregator, we’d like to share more details about the new workflows and benefits of this work to the PA Digital community going forward. 

Snapshot of PA Digital’s Airflow dashboard

How does the new aggregation process work? 

Tasks associated with harvesting and aggregating metadata are managed with the open source platform Apache Airflow using automated workflows. For each contributing institution, we input institution-specific information and execute a set of standardized steps in Airflow. This includes harvesting institutional collections using OAI-PMH parameters or a CSV file, validating and transforming the records based on PA Digital metadata standards, and publishing them for metadata review and retrieval by partners such as the Digital Public Library of America (DPLA).

We also developed an internal site using Solr and Blacklight to search and view PA Digital aggregated metadata for quality assurance. Nicknamed “Funnel Cake,” this platform is used as a tool for sharing ongoing metadata feedback with our contributors. For details about the new aggregation processes, see About the PA Digital Aggregator

Funnel Cake search interface

How does it affect the harvest of new and updated collections?

The process for harvesting new collections remains largely unchanged for contributing institutions. One important difference with our new aggregator is that each time an institution adds or updates a collection in PA Digital, we now reharvest all of your collections rather than executing that process collection-by-collection. 

With our streamlined workflows, we will also be able to onboard new contributors more easily and update our processes efficiently based on any changing priorities or standards. We can accept a variety of metadata formats and develop more customized metadata mappings and transformations for individual institutions now. In addition, we have the flexibility to outbound more granular metadata to partners such as DPLA rather than being limited to Simple Dublin Core.

What does this mean to the sustainability of PA Digital?

Rather than developing a boutique solution, PA Digital aggregation is now maintained with a well-established workflow management tool (Airflow) used for executing complex processes across different data types and scenarios. In addition, all of the applications adopted for the new aggregator are open-source platforms already known and used by the team members who currently maintain PA Digital’s technical infrastructure. The increased flexibility of our workflows also allows us to customize without a high risk of system error. These factors will help us sustain and update these technical processes as needed over a longer period of time. 

Who was involved in doing this work at PA Digital?

This implementation and migration project was conducted by staff at Temple University Libraries in support of PA Digital. Between 2018 and 2021, the following people contributed to this work at different phases of the project: Rachel Appel, Leanne Finnigan, Christina Harlow, Chad Nelson, Stefanie Ramsay, Holly Tomren, Emily Toner, Jennifer Anton, Timothy Bieniosek, David Kinzer, and Steven Ng. 

Categories: Technology

1 Comment

Metadata collection improvements for the Sunshine State Digital Network – Illuminations · July 20, 2022 at 11:00 am

[…] Transform, Load (ETL) processes. There are many ETL software solutions available. PA Digital moved their metadata collection activities into Apache Airflow, and now SSDN is following […]

Leave a Reply

%d bloggers like this: