Gretchen Gueguen on the DPLA Archival Description White Paper

As a Pennsylvania native (Kittanning, PA), educated by our flagship institution (Penn State), and current resident (Erie, PA), I can’t help but be a little partial to the PA Digital Hub. So I am delighted to have the opportunity to discuss collections, description, and access at DPLA on this blog, and to particularly talk about Aggregating and Representing Collections in the Digital Public Library of America.

Photo on 1-20-17 at 11.51 AM #2

I am the Data Services Coordinator at DPLA, which means that I am in charge of managing our data aggregation services. This involves working with our Content and Service Hubs in their early stages to ensure that their data will be interoperable with our data set and then managing the quality review and remediation process for the first data ingest. After that, I work closely with the technology team to schedule re-ingest and maintenance of the data. I have several ongoing initiatives to analyze and improve our overall data quality that I’ll be working on (with the help of our Hubs) over the coming months. There’s always more improvement to be made, which is a great challenge!

Putting together the Archival Description Working Group

Throughout 2015 several themes kept popping up in questions and conversations at DPLA: How do we communicate the context of materials that are closely related to each other? How do we take advantage of information that is created about collections while still keeping DPLA a library of digital materials only? How can materials that come from different descriptive traditions (libraries, archives, museums) be reconciled?

DPLA had always had metadata fields for collection name and description in our item-level records, but at that point it had never been tightly controlled. We focused far more on the basic item descriptors (title, subjects, description) and let institutions do what they wished with collection. Since “collection” is actually a term with a lot of different meanings, we ended up with a hodge-podge of data that could at times seem misleading or redundant. Some Hubs organized all the content from each partner into an institution-based collection, others had very broad subject- or format-based collections, while still others had very specific provenance-based collections. In short, we felt that the collection data was not ready for prime-time, so while it was retained and indexed in the record for searching, it was not featured in the website version of the metadata record.

In late 2015, DPLA also decided that in order to continue to collaborate with the community effectively the time had come to move on from the very broad open committees that had been created during the planning stages of the DPLA(?) to a series of more focused working groups that could help solve specific issues. One of the first of these working groups we decided to put together was one related to  the issues of collections data, specifically that created in archival description — hence, the Archival Description Working Group. While archives were the initial focus, the work of the group ended up being a more all-encompassing analysis of collections in DPLA regardless of the type of originating institution.

The recommendations the working group came up with were published as a whitepaper in 2016. They addressed five areas:

  • Recommendations for representing objects (item vs. aggregate)
  • Recommendations for relationship of object to collection
  • Recommendations for creating and sharing collection data
  • Recommendations for user interface
  • Recommendations for process

The Methodology of the Working Group

An important first step the group tackled was to define what we meant by the word “collection” in our context. From the whitepaper:

The term is used loosely by the working group to mean any intentionally created grouping of materials. This could include, but is not limited to: materials that are related by provenance, and described and managed using archival control; materials intentionally assembled based on a theme, era, etc.; and groupings of materials gathered together to showcase a particular topic (e.g., digital exhibits or DPLA primary source sets). Not included in this definition are assemblages of digital objects that are not the result of some sort of intentional selection. For example, all of the objects that are exposed to DPLA by a particular institution would, generally speaking, not be a collection in this sense. All of the digital objects that belong to a specific type or form/genre – maps, for instance – would also not be a collection in the context of this white paper.

In order to develop our recommendations we took a three-phased approached. First, we did research. We read as many reports and articles we could find on combining materials described at item- and collection- or aggregate-level and we reviewed several digital library sites that did something innovative in this area. After the research phase, we synthesized our findings and created a list of user-based scenarios that we thought DPLA should support:

  1. It should be apparent to users when they find an item/s that these materials are part of a collection if appropriate.
  2. Users should know as soon as they search that items are part of collections and should be able to act on that knowledge.
  3. Users should be able to refine and limit their searches by membership in collections.
  4. Users should understand when objects are described using a traditional component-level archival-style descriptions, i.e., one object that represents many items.
  5. Users should be presented with appropriate metadata for objects, and this level of metadata and context may not be the same for all objects and collections. This could result in many items with the same description.
  6. Users may be presented with information that helps them makes sense of where the item belongs within a collection if the collection structure or arrangement is meaningful.
  7. Collection/context information applies to different types of collections including exhibitions and primary source sets.
  8. Users should be able to go to DPLA and find a collection that interests them without doing an item search.
  9. Users should be able to find similar materials related to a retrieved item by their membership in the same collection.

We then used the scenarios to guide us through the process of making recommendations for changes to DPLA metadata, workflow, and interface.

Recommendations for item and aggregate objects

Rather than write at length here about each of the areas of recommendations, I’d like to just address one of the areas of biggest discussion: Recommendations for representing objects (item vs. aggregate).

The question that drove this discussion was how data created about materials in the aggregate can be used in DPLA. A prime example of this kind of data is a folder-level description that an archive might create. In the past decade in particular this kind of practice has increased in the archival community, largely inspired by the landmark publication of More Product, Less Process. DPLA has increasingly gotten records from contributing institutions that reflect things like folders of materials rather than the individual items within them. In the archival community the finding aid, which contextualizes and describes an entire collection is the norm. However, DPLA doesn’t just serve archives. We have a huge collection of books, films, reports, journals, etc., all of which are individual items. Our searching, indexing, and presentation designs are all based on the idea that each record corresponds to a single individual object. Since, DPLA can’t just adopt an archival, collection-based description approach, the working group focused part of their efforts of thinking through how aggregate-level descriptions could be combined with the existing item-level paradigm.

The two solutions usually adopted when faced with the question of how to translate metadata for a folder to DPLA were basically either to create one description that described a bunch of items in the aggregate, or to create a bunch of really minimal records for each item reusing similar data in each one. Either of these approaches might be best in particular situations. For example, in specific cases of unique visual materials an institution may want to opt for lots of individual items with minimal metadata. Even though the metadata is similar for each, this would allow the visual material to be discretely findable.

On the other hand, the search experience of seeing record after record for textual materials with indistinct images that are virtually identical would not suit the majority of those types of materials. In this case, the experience of finding a basic, high-level description of a folder and then following the link back to the originating institution to examine the materials in depth seemed to be the best fit.

The working group actually doesn’t want to recommend one style of description over another. Instead we want to work with the kinds of descriptive practices that professionals are already using. We want DPLA to fit into the accepted professional practice, not create yet another new approach that may or may not be adopted. We think that having an infrastructure that can be flexible enough to encompass aggregated objects and item level objects, while also communicating relationships between materials will serve the user scenarios we came up with best. Furthermore, we wanted to rely on the judgement of the librarians, archivists, and other professional on the best way to describe and provide access to their own materials rather than dictate something to them.

In both of these types of description though, the working group members agreed that the addition of collection titles, descriptions, and links back to collection-level descriptions or home pages at the original institution would help greatly in communicating what these objects are to their audience. The other sections of the whitepaper go into detail on how that collection-level information can be gathered, stored, and displayed effectively in DPLA. Combined with a flexible approach to item description described above, the working group felt that these changes would best achieve the goals of the user scenarios.

Recommendations have context too

It’s important to remember that this and the other recommendations in the whitepaper are for DPLA, in other words, they pertain to the handling of collection and aggregate-object metadata in a heterogeneous, large-scale aggregated environment. They should not be read as recommendations for every cultural heritage institution everywhere. Those submitting data to DPLA would need to publish it in a way that we could use, but within their own context, their own repository or website, it may be best served by being put up another way. I would encourage anyone involved in a DPLA contributing institution or interested in metadata aggregation overall to read the whitepaper and think about how these recommendations might or might not fit in their own institution.

It’s also important to remember that these are recommendations only. DPLA is in the process of implementing a number of them, but some have turned out to be infeasible or are affected by other DPLA initiatives. In particular, recommendations around representing objects and process are being implemented, but those around creating and sharing data and user interface have been refined. Another working group working on overall revisions to DPLA’s metadata application profile is suggesting further refinements of collection data, and the interface is being worked out through an overall DPLA website redesign. In the end, I feel like the spirit of the recommendations will definitely be adopted, but with a few tweaks.

Metadata Cleanup Made Easy with OpenRefine

This is Gabe Galson. I work here at Temple University on the PA Digital Metadata team and would like to share with you some tricks of the trade. Shhhh! These are the exclusive Metadata Cleanup secrets the pros don’t want you to know about. Field value normalization life hacks that, after this exclusive blog post, will go back into the PA Digital Vault forever.

Are you frustrated because your local repository doesn’t feature expandable and collapsable facets that can be sorted either alphabetically or by frequency of occurrence, enabling effortless detection of slightly divergent values? Don’t be. Now there’s OpenRefine.

OpenRefine is a powerful metadata cleanup tool that allows you to replicate our PA Digital aggregator’s key functionalities on any standard computer. All you need is an export of your data in a tabular format, that is to say, your metadata as an Excel, tab separated value [TSV], or comma separated value [CSV] file. Refine will ‘Hoover’ up such data, display it as a spreadsheet, then allow you to view any field’s constituent values via a facet box, as in our aggregator. When sorted alphabetically the facets Refine can generate will allow you to eyeball inconsistent values. Check out ‘Philadelphia (Pa.)’ vs ‘Philadelphia, (PA)’ in this screenshot of a Refine facet:

blog image

From there the values can be quickly standardized.

Refine’s clustering functionality will pull from a dataset of any size slightly divergent values that may not be obvious from a simple browse. These values can then be quickly standardized en masse from within the clustering tool. All 145 Philadelphia variants detected in the screenshot below can be standardized with a single click. Wow!

Blog image2

Traditional spreadsheet programs don’t make it easy to work with multiple values contained in a single cell.  With Refine it’s a breeze. Assuming the values are separated by a single delimiter one can split each value into its own individual row, then fold each row back into the original record when cleanup is complete.

Go from this…

Refine blog

to this…

Refine blog1

… and then back again whenever you want!

But wait, there’s more! Refine also sports an array of other features useful to anyone with messy data on their hands. It lets you structure unstructured data, converting text blocks into spreadsheets. It allows you to stack multiple facets to your heart’s content, star individual records of interest then facet on the star marker, or isolate a particular subset of your data and perform complex operations on only it. Refine offers a robust undo/redo interface, allowing you to test out complex transformations without risk. It lets you integrate regular expressions, GREL commands, and Python script into your basic cleanup operations, making it extremely flexible and powerful. GREL –the Google Refine Expression Language, a simple programming language native to Refine– is no more complicated than Excel’s formulas; it lets those with no coding experience perform fairly sophisticated data cleanup operations. Refine will also allow you to populate columns with data called from websites or APIs. For example the Google Maps API can be called to return the geo coordinates corresponding to individual street addresses found within your dataset.  

As you can see, Refine, which is open source, free to download, and fairly easy to pick up, will put many of our aggregator’s functions at your fingertips, allowing you to independently prepare your data for ingestion into the DPLA.

If you’d like to learn more about OpenRefine feel free to sign up for the PA Digital team’s upcoming in-person Metadata Anonymous workshop, which will feature a hands-on, in-depth intro to Refine that will take you from zero to 100 in no time at all! All of the operations illustrated or described in this blog post will be covered. Sign up here if you’re interested!  

If you want to dive right in, take a look at this index of OpenRefine resources on the web. 

OpenRefine Resources:

OpenRefine wiki:

https://github.com/OpenRefine/OpenRefine/wiki

Overview of GREL syntax:

https://github.com/OpenRefine/OpenRefine/wiki/Understanding-Expressions

Comprehensive index of GREL functions by type:

https://github.com/OpenRefine/OpenRefine/wiki/GREL-Functions

OpenRefine listserve/discussion group. If I have an issue I can’t solve on my own I ask for help here. Many advanced Refine users monitor the board and will be happy to help in many cases.

https://groups.google.com/forum/#!forum/openrefine

Regex cheat sheet focusing on OpenRefine users:

http://arcadiafalcone.net/GoogleRefineCheatSheets.pdf

Good basic OpenRefine introduction:

https://casci.umd.edu/wp-content/uploads/2013/12/OpenRefine-tutorial-v1.5.pdf

Another good OpenRefine introduction:

http://www.meanboyfriend.com/overdue_ideas/wp-content/uploads/2014/11/Introduction-to-OpenRefine-handout-CC-BY.pdf

Another good introduction:

http://enipedia.tudelft.nl/wiki/OpenRefine_Tutorial

Treasure trove of advanced OpenRefine recipes:

http://keithmaguire.net/

The recipes found on this page are especially useful to Refine novices:

http://keithmaguire.net/articles/open-refine-short-recipes.html

The “OpenRefine for Metadata Cleanup” PDF that can be found on this page is an excellent openrefine tutorial that includes many useful recipes, including the date transformations featured in our own cookbook:

https://www.orbiscascade.org/workshop-4-metadata-cleanup

Book: Verborgh, R., & De, W. M. (2013). Using OpenRefine: The essential OpenRefine guide that takes you from data analysis and error fixing to linking your dataset to the Web.

http://www.worldcat.org/oclc/859384483

https://www.amazon.com/Using-OpenRefine-Ruben-Verborgh/dp/1783289082/ref=sr_1_1?s=books&ie=UTF8&qid=1492262092&sr=1-1&keywords=using+open+refine

Free Your Metadata OpenRefine Reference:

Site: http://freeyourmetadata.org/

Book: http://book.freeyourmetadata.org

Especially for Archivists: Chaos —> Order is a great blog about data manipulation and cleanup tools for archival data. The authors have a few very useful posts about Open Refine, especially dealing with dates and duplicate subjects.

Blog: https://icantiemyownshoes.wordpress.com/

Refine posts: https://icantiemyownshoes.wordpress.com/tag/openrefine/

Especially for Archivists: The Bentley Historical Library at the University of Michigan maintains an excellent blog about their experiences integrating Archives Space, Archivematica and Dspace. The authors have a few very interesting posts about using Open Refine and Python to clean their EAD files.

Blog: http://archival-integration.blogspot.com/

Refine posts: http://archival-integration.blogspot.com/search/label/OpenRefine

Regular expression (regex) tutorials:

http://www.regular-expressions.info

 

Society of American Archivists (SAA) 2017

In late July I had the pleasure of participating in the annual meeting of the Society of American Archivists (SAA) and serving as a co-panelist at a rights-oriented session: https://archives2017.sched.com/event/ABHL/504-the-rights-stuff-encouraging-appropriate-reuse-with-standardized-rights-statements. The session was moderated by Kelcy Shepherd (DPLA Network Manager, Digital Public Library of America). and my co-panelists were Laura Capell (Head of Digital Production & Electronic Records Archivist, University of Miami), MJ Han (Metadata Librarian, University of Illinois at Urbana-Champaign), and Sheila McAlister (Director, Digital Library of Georgia).

The session was quite well attended, with approximately 150 in attendance, and heavily promoted after the sessions were made available online (I’ve seen it float by on multiple mailing lists!). Although the introductory slides are available I strongly recommend this excellent summary of the panel from Michael Barera (Archivist, Texas A&M University-Commerce). Given our audience – archivists who likely hadn’t implemented standardized rights statements to any great degree – we provided an introduction to RightsStatements.org and the standardized statements, and moved to a panel discussion of the benefits of providing standardized rights information, the panel’s implementation hurdles, the complexities that we’ve encountered in rights statements assignments, and answered questions from the community.

As for the meeting itself – it was very large with multiple concurrent sessions, and there were a few days of Society business meetings and workshops before the primary conference sessions, and an unconference after the meeting. The meeting was hosted at the Portland Conference Center and to everyone’s delight, one of the conference hotels hosted a cat show the day after the conference.

As a copyright attorney working as a librarian, archivists’ work can be somewhat opaque and having the opportunity for informal networking and discussion of archival practices in the rights context in other institutions besides my own was invaluable.

Brandy Blog Post

I attended sessions that were pretty unique to SAA. My favorite was Everyone’s Vested Interests: Archivists and Affinity Groups, which turns out to be a really oblique way of saying super unique corporate archives. We were regaled with tales from Coca-Cola, Levi, IBM (including the making of Hidden Figures), Harley-Davidson, and Disney. The session was a must-attend for anyone working in archives that preserve fandom-focused materials (and as a Penn State Employee was actually useful in helping to understand how the Archivists & Special Collections team uses what we have to cultivate our fan base and obtain important scholarly collections). Plus, the exhibits hall was packed full of vendors showing off their new techy wares – I was particularly taken by automatic microfiche scanners!

Finally, one of my favorite things were the epic ribbons that SAA provides conference goers! A more accurate badge has never been assembled.

Meet the Developer Team – Chad Nelson

chad

For our first edition of Meet the Developer Team, we would like to introduce Chad Nelson who works at Temple University and is a part of PA Digital. The Developer Team has worked tirelessly on PA Digital’s aggregator so that our hub can harvest over 214,000 records from 38 institutions throughout Pennsylvania and contribute them to the DPLA.

Michael Carroll, Interviewer (MC): Can you tell our readers a little about yourself and your role/association with Temple University and the work you do there?

Chad Nelson, Developer Team Member (CN): I’m Lead Technology Developer at Temple University Libraries. I’m part of the team that builds applications and services that help users get access to our resources and make staff lives easier. I’m responsible for keeping the team of developers interested and focused, thinking strategically about our infrastructure and the maintainability of our applications.

MC: How did you first get involved with PA Digital and the DPLA?

CN: Before I started at Temple, which is when I started working with PA Digital, I was already involved with DPLA as a Community Rep. As a rep, I used data from the DPLA and the DPLA API to build applications I thought explored the data in new ways. As part of that process, I wrote a small software library making it easy to use the DPLA API in the Python programming language. I also built a few apps – Color Browse (http://colorbrowse.club/) allowing a small selection of DPLA data to be searchable by Color, and DPLA by State and County (http://chads.space/dpla/), that shows DPLA item distribution at the State and County level.

MC: Can you elaborate about your role in and contributions to PA Digital?

CN: I maintain the servers that power the PA Digital Aggregator, and contribute to the software we use to pull in data in from contributors, normalize it, and feed it through to the DPLA. This includes updating the application to perform new functions, reviewing code submissions from other developers on the PA Digital project, working with the metadata team on designing future requirements, or analyzing problems with data from new contributors.

MC: What is your favorite app for engaging with DPLA materials? How did you go about the initial stages of developing your apps for the DPLA?

CN: The Color Browse app is my favorite app that I worked on because I learned so much from building it, and it has really helped me discover items in collections I would never have otherwise found.
I started off that project by wondering how searching and classifying by color could even happen. I had never done that kind for work before so I started off small, trying to understand how I would get a list of all the colors in an image. Once I had a good sense of how that worked, looked around for a dataset with lots of images I knew how to get a hold of easily – and DPLA was the obvious choice.

MC: Are there any apps that you are currently working on or would like to see developed for the DPLA?

CN: I’m very irregularly working on the Color Browse application to add more items, allow searching for multiple colors within an image, and syncing with DPLA.

MC: How would you rate your experience working on PA Digital and how does it relate to the work you do at Temple University?

CN: Working on PA Digital has been challenging. Trying to find the right balance between writing applications that are general enough to handle the many different systems our contributors is not easy. It is something the DPLA itself has struggled with, and it is pretty obvious why.

It has really made me appreciate just how diverse and varied the structure of cultural heritage data is, and what a huge undertaking by the DPLA it is to have aggregated as much as they have.

2017 Updates

2017 has already been an amazing year for PA Digital!

We began our year with a webinar, “Highlights of DPLA Whitepapers Webinar” in January in order to give an overview of three complex documents for our existing and prospective contributors. During this webinar, we summarized Aggregating and Representing Collections in the Digital Public Library of America. This paper explored the possibility of including more collection-level description within the DPLA. The second white paper, RightsStatements.org White Paper: Recommendations for Standardized International Rights Statements acted as documentation and information for Rightsstatements.org. Lastly, we spoke on DPLA Metadata Quality Guidelines which acts as a refresh of the DPLA’s metadata requirements and recommendations for better data quality. View our slides here!

We have had two harvests so far this year. Our April harvest saw the inclusion of Bryn Mawr College, Bloomsburg University, Montgomery County Community College, Slippery Rock University, Ursinus College, Philadelphia University, and the Pennsylvania Horticultural Society. This harvest included 19 new collections and 18,480 digital objects (records).

PA Digital was well-represented at DPLAFest 2017 in Chicago. Brandy Karl, Copyright Officer, from Pennsylvania State University presented on “Implementing Rights Statements @ PSU and PA Digital” (part of Turn the Rights On: a RightsStatements.org Update and Comparison of Regional Rights Standardization Projects). View her slides here!

Delphine Khanna and myself presented on “Reaching Out to Potential DPLA Hub Contributors: PA Digital’s Communication Strategy and Plan, or “The Accidental Public Relations Manager.” View our slides here!

Our June 2017 harvest saw the inclusion of West Chester University, Pennsylvania State Archives, La Salle University, Millersville University, Sewickley Public Library, and Carnegie Library of Pittsburgh. This harvest also added 48 new collections and 27,780 digital objects (records).

We would like to extend warm thanks to all who worked with us to bring in new collections.

You can see all of PA Digital’s records in the DPLA by searching or faceting on our name PA Digital: PA Digital Records in the DPLA.

View our progress since we went live in DPLA:

This slideshow requires JavaScript.

We also revamped our website recently. Check it out: https://padigital.org/

This slideshow requires JavaScript.

In addition to new contributors and records, we are planning:

  • Two metadata workshops,
    • Metadata Anonymous Webinar, 8/23 at 1pm
    • If You Liked it Then You Should Have Put Metadata On It: Descriptive Cataloging and Selecting Rights Statements for Digital Collections at the 2017 Pennsylvania Library Association (PaLA) 10/18 at 9am
  • Two orientation webinars, and
    • Knight Orientation Webinar, 7/20 at 1pm
    • Fall Webinar TBD
  • Three educational online modules on rights statements for this summer and fall.
    • What is Copyright?
    • What is a Rights Statement?
    • Implementing Rights Statements

We are looking forward to presenting our work and onboarding more institutions and more content from current contributors within the coming year. Stay tuned for more details.

As usual, for information about our project, or about how you can participate in PA Digital and the DPLA, please email anytime to info@padigital.org.

Sincerely,

Rachel Appel, Co-Manager, PA Digital, on behalf of the PA Digital Team

Chicken

Pennsylvania State Archives, Chicken on Barrel with String on Leg