Blog

Memorial Day in Pennsylvania

Observed in the United States on the last Monday of May, Memorial Day honors those who died in the armed forces. Many cities in the United States have claimed to be the originators of Memorial Day, including Boalsburg, Pennsylvania. Boalsburg declared itself the birthplace of Memorial Day in 1864, according to this pamphlet (oddly enough, a part of the Digital Library of Georgia’s collections!). Unfortunately for Boalsburg, President Lyndon Johnson declared in 1966 that Waterloo, New York was the official birthplace of this tradition. Regardless of its birthplace, we celebrate this day every year by organizing parades, decorating gravesites, wearing remembrance poppies, and thanking those who served.

Here are some items from PA Digital collections highlighting celebrations in our state’s past:

Memorial_Day_Parade_1927Memorial Day Parade, 1927. Courtesy of James V. Brown Public Library.

Memorial_Day_Parade_1945.jpg
Memorial Day Parade, 1945. Courtesy of James V. Brown Public Library.

recto
View of canoes with lead canoe pulling a smaller canoe filled with flowers, Bethlehem, Pennsylvania. Courtesy of Lehigh University.

resolverThomas Sovereign Gates (Penn President, 1930-1944) places a wreath at War Memorial plaque in Furness Library, 1945. Courtesy of the University of Pennsylvania.

Memorial Day also typically signals the official beginning of summer! After the prolonged winter this year, we could sure use it. The long weekend is a popular time for backyard barbecuing, spending time in the garden, or heading down the shore (and, of course, getting stuck in traffic on the way home). Take a look at how Pennsylvanians have celebrated:

Boys_swim_at_Vare_Recreation_Center.jpgBoys swim at Vare Recreation Center, 1964. Courtesy of Temple University. 

Miss_Mermaid_celebrates_Atlantic_City_beach_opening
Miss Mermaid celebrates Atlantic City beach opening. Courtesy of Temple University.

sizeMemorial Day traffic, 1972. Courtesy of Temple University.

We hope you all take some inspiration from these celebrations and enjoy your Memorial Day weekend. Happy summer!

MARAC Spring 2018

This post is by Brandy Karl, Copyright Officer @ the Pennsylvania State University and member of PA Digital Metadata Team Rights Subgroup and Rachel Appel, co-project manager of PA Digital.

MARAC2018
Our panel! Top row from left: Doreva Belfiore, Linda Tompkins-Baldwin, Gabe Galson, Jen Palmentiero. Bottom row from left: Paul Kelly, Rachel Appel, Brandy Karl. Photo taken by Jessica Lydon.

This month, we had the pleasure of presenting at the Mid-Atlantic Regional Archives (MARAC) Spring 2018 Conference in Hershey, PA along with PA Digital colleagues, Doreva Belfiore (HSLC), Gabe Galson (Temple University), and members of other hubs, Paul Kelly (DC Public Library and District Digital), Linda Tompkins-Baldwin (Enoch Enoch Pratt Free Library/State Library Resource and Digital Maryland), and Jen Palmentiero (Southeastern New York Library Resources Council and Empire State Digital Network).

A highlight of the conference was the keynote by Trevor Owens, Head of Digital Content Management at the Library of Congress. Trevor’s keynote was framed around his book Theory & Craft of Digital Preservation (full preprint available via link). He discussed how we’ve been working on digital preservation for over half a century and made points about the holistic nature of digital preservation. For example, software cannot preserve anything and a repository is the work that people do with tools, workflows and processes. Hoarding is also not digital preservation and therefore appraisal is key. It was a great way to start the conference!

Our own panel was a birds-of-a-feather on rights statements, “True Rights Statement Confessions” [slides can be found at this link]. The completely Q&A focused session aimed to bring together various experts from mid-Atlantic DPLA Hubs who have implemented standardized rights statements for digital collections, worked on education and training for its constituent institutions’ digital collections, or have done rights statements analyses across their home institution or constituent collections. Attendees were encouraged to ask any questions about normalized rights statements. We had some great questions and discussions, such as when to use the Public Domain license versus the No Copyright – US statement and how rights statements are user-centric and focus on potential uses of the item than the repository’s risk profile.

Interestingly, we did get a number of questions that focused on copyright concerns. One great question asked about the difference between the rights of the original work and the digitized facsimile, or surrogate. Other questions included where to put in permissions information (if at all) and the notion of what is someone’s intellectual property in handwritten modern letters.

We had a total of 68 attendees and did not need to use any of our backup questions in case folks didn’t have any. We hope the attendees enjoyed the session and MARAC Spring 2018 as much as we did.

DPLA Members Meeting 2018 Recap

Rachel Appel, Doreva Belfiore, Gabe Galson, and I attended the first DPLA Members Network Meeting held in Atlanta, GA. Including PA Digital, 23 of 27 member hubs were represented at the meeting, which provided us the opportunity to chat with other attendees about our ideas, goals, projects, questions, challenges, and successes.

The first day consisted of updates from the DPLA team, including a welcome from new DPLA Director John Bracken, who set the tone for the meeting by asking questions for us to consider around our audiences and our impact. Other members of the DPLA team provided updates on ongoing work around curatorial projects, rights statements applications, QA practices, and analytics. We learned that the DPLA has 141 Primary Source Sets available on their website, which comprise 30-35% of traffic to the DPLA during the school year. DPLA also has 33 exhibits currently, which represents 15-20% of traffic to DPLA. I was interested in these because we are in the midst of creating our own Primary Source Sets at PA Digital, and I was interested in not only how the DPLA went about creating these, but how they measured their impact.

Another highlight of the first day was nine lightning talks covering a variety of projects spearheaded by hubs, ranging from metadata aggregation in Michigan to geospatial mapping in Minnesota to connecting LIS students into cultural heritage institutions in Wisconsin. Rachel and I were able to present on the Primary Source Sets project too!

Picture1
Using sorcery to impress everyone at the DPLA Members Meeting! (Photo courtesy of Doreva Belfiore.)

Our talk, Primary Source Set Sorcery, gave an overview of our approach to creating primary source sets (disclaimer: no sorcery was actually performed). We received positive feedback from hubs who have already created Primary Source Sets or were working towards it. We’re looking forward to updating you on this project more soon! (You can also find our slides here.)

The second day of the meeting included sessions and workshops in areas such as rights statements, outreach, networking, repository challenges, partner recruitment, and building hubs. I attended a workshop on rightsstatements.org with Greg Cram from the NYPL and Emily Gore from the DPLA who walked us through the three major categories of rightsstatements.org: In Copyright, No Copyright, and Other, with really helpful examples of what a good rights statement looks like, as well as some confusing ones. This is something we have been actively working on at PA Digital and it was great for me to see how Greg and Emily taught us so we can continue educating ourselves and our contributing institutions on how to properly apply rights information to their collections.

Some of the sessions around outreach and partner recruitment allowed hubs to share approaches that have worked for them as well as some challenges that we all face. One of the challenges that resonated with me was how to reach out to unique types of institutions and/or users and how do we measure the impact we have. For example, one challenge that many hubs related to was connecting with institutions across large states. An obstacle we are working on is making connections with institutions in Central and Western Pennsylvania, while I am based at Temple University, all the way out east in Philadelphia. I heard from many other hubs who have staff centralized in one part of the state who don’t know where to start in reaching out to others further away. Many others hubs do a lot of work to keep up with local conferences, listservs, and following up with their connections from all over their state. This is something we will continue to improve on, and if you’re reading this and interested in working with us, email us!

Having a community of peers to connect with around these issues and questions was really helpful and PA Digital is excited to continue participating in these events. Thanks for hosting us, DPLA!

Stefanie Ramsay

 

The Importance of Fair Use and Standardized Rights Statements for Digital Cultural Heritage

By Gabe Galson and Rachel Appel

Originally posted for Fair Use Week and Scholarly Communication @ Temple.

At Temple University Libraries several staff members work on the PA Digital project. PA Digital is the Pennsylvania service hub of the Digital Public Library of America (DPLA). The DPLA aggregates digital collections (images, photographs, text, maps, audio and video) shared by libraries and archives’ special collections all across the United States, allowing researchers to efficiently search, browse, and utilize these resources through a single interface. PA Digital is a statewide partnership that collects materials from Pennsylvania cultural heritage organizations, then transmits them to the DPLA, making them available to the widest possible audience. All of these activities are enabled, to a great degree, by Fair Use.  

Fair Use is a US legal doctrine that allows limited reuse of copyrighted materials. It is an invitation to the sort of intellectual/artistic exchange that keeps our culture vibrant, and a counterbalance against the the US’s increasingly strict copyright laws. Sampling, artistic appropriation, creative or educational quotation, parody, and text mining/textual analysis are all activities that flourish under Fair Use’s protection, shielded –to a degree at least– from the threat of litigation. Likewise libraries, archives and museums around the country have been able to digitize their archival objects and make them freely accessible online because of the fair use doctrine. Many digital collections that are available through PA Digital and the Digital Public Library of America, for example, are in copyright; digitizing and making them publicly discoverable through a database platform is considered fair use. However, it is important to communicate clearly to users, such as scholars and researchers, that such works remain in copyright and have use restrictions and limitations. Fair Use is a key concept that enables both digitization and reuse of digital facsimiles and is the rationale for making cultural heritage collections available online, in local repositories as well as the DPLA.

That’s where RightsStatements.org comes in. The site provides 12 normalized, standardized statements that cultural heritage institutions can use to describe the copyright status of online cultural heritage materials. A joint global project of Digital Public Library of America, Europeana, New York Public Library, University of Michigan, and other institutions, Rightsstatements.org went live in 2016. It creates three categories of statements (with four statements in each) to be used with cultural heritage materials, including some terms for use in the EU. The goal is to provide cultural heritage institutions with simple and standardized terms to summarize the copyright status of Works in their collection and how those Works may be used.

There are three overall categories with four specific rights statements within each: In Copyright, No Copyright, and Other.

Rights Statements and Licenses are critical for digitization and data reuse. A normalized rights statement or Creative Commons license makes it so much easier for a member of the public to understand how that item can be used. The Digital Public Library of America has incorporated RightsStatements.org statements into their portal to function as a facet for searching because they are all machine readable and normalized. A similar metaphor is shopping through an online retailer – when we buy from online retailers what do we look for? Ideally, items with Free Shipping. This makes it easy for scholars to look for Works that can be used in their publications and research.

DPLArecord

Example DPLA record from Penn State University with NoC-US statement

Beyond traditional scholarship, normalized rights statements can also encourage creative reuse of Works if people know what they can and can’t do. For example, DPLA’s annual GIF IT UP campaign, where users make images into gifs, and the #ColorOurCollections nationwide promotion by galleries, libraries, archives and museums, where end users are encouraged to reuse digital objects as coloring pages.

Birds.gif

Gif made by Michael Carroll for GIF IT UP 2017. Drawing (Two Birds on Flowers) from the Free Library of Philadelphia.

Rightsstatements.org is still getting off the ground, but it promises to make the process of identifying usable Works far simpler and less time-consuming for researchers, scholars, and students. Take a look at the Europeana aggregator’s eight million plus ‘free reuse’ results for an example of what’s possible via machine-readable statements. Go forth and reuse!

More resources:
Ballinger, Linda, et al. “Providing Quality Rights Metadata for Digital Collections Through RightsStatements.org.” Pennsylvania Libraries: Research & Practice, vol. 5, no. 2, 2017, pp. 144–158. http://palrap.pitt.edu/ojs/index.php/palrap/article/view/157

Fair Use Checklist: http://copyright.psu.edu/checklist/

RightsStatement.org Resources: http://rightsstatements.org/en/

PA Digital webinars:

Menand, Louis. (2014). Crooner in Rights Spat. The New Yorker. Retrieved from https://www.newyorker.com/magazine/2014/10/20/crooner-rights-spat

2017 Knight Grant Subaward Success

We at PA Digital are wrapping up our work with our generous Knight Grant subaward from the DPLA for 2017.

The funding from the Knight Grant allowed us to expand our outreach to potential contributors in the Knight Communities of Philadelphia and State College, PA. We have seen a sizable increase in the number of records, collections, and institutions represented in DPLA as a result of these outreach efforts. 

Our numbers:
Since March, we have ingested 59,575 new records from 93 new collections, and 12 new institutions within the Knight communities. This exceeded our goals for the sub-award!

Picture1Picture2Picture3

Our events:

  • PA Digital Orientation webinar targeted towards Knight institutions on July 20th [slides]
  • “Connect and Communicate” webinar Pennsylvania Library Association “Connect and Communicate” webinar on May 17
  • Greater Philadelphia Law Library Association Institute in-person session on June 23
  • “Metadata Anonymous” webinar in an effort to expand the conversation about metadata quality and explain our various review processes on August 23 [slides]
  • “Metadata Anonymous” in-person workshop based on the webinar at the Free Library of Philadelphia on December 7 [slides]

Our Rights Statements Training:

We created three Rights Webinar “modules” with the aim of providing context and guidance toward implementing the RightsStatements.org recommendations.

  • Copyright 101 [video]
  • What is a Rights Statement [video]
  • Implementing Rights Statements [video]

Thank you to everyone who became a contributor, attended one of our workshops or webinars, and supported our increased outreach efforts to Knight communities. Please contact us if you have any questions, concerns, or would like to contribute additional collections: info@padigital.org.

Picture4

Image: East Broad Top Railroad 14 train, Frank G. Zahn Railroad Photograph Collection, Temple University

Picture5.jpg

Image: Epistle of caution against pride, &c. from the Yearly Meeting in London, 1718, Quaker Broadsides Collection, Haverford College Quaker and Special Collections & Friends Historical Library of Swarthmore College
Prior update blog post: https://padigital.org/2017/07/12/2017-updates/

“Implementing RightsStatements.org” Module Available!

The PA Digital Metadata Rights Subgroup Team is excited to present the third of three video modules on copyright and rights statements. The third module, “Implementing RightsStatements.org” provides an overview of rights statements and how to implement them as shown through examples done at Penn State University. This is a condensed version of Linda Ballinger, Brandy Karl, and Anastasia Chiu’s article, “Providing Quality Rights Metadata for Digital Collections Through RightsStatements.org” in Pennsylvania Libraries: Research and Practice.

Implementing RightsStatements.org

If you have any questions, please feel free to email info@padigital.org or visit padigital.org.

Gretchen Gueguen on the DPLA Archival Description White Paper

As a Pennsylvania native (Kittanning, PA), educated by our flagship institution (Penn State), and current resident (Erie, PA), I can’t help but be a little partial to the PA Digital Hub. So I am delighted to have the opportunity to discuss collections, description, and access at DPLA on this blog, and to particularly talk about Aggregating and Representing Collections in the Digital Public Library of America.

Photo on 1-20-17 at 11.51 AM #2

I am the Data Services Coordinator at DPLA, which means that I am in charge of managing our data aggregation services. This involves working with our Content and Service Hubs in their early stages to ensure that their data will be interoperable with our data set and then managing the quality review and remediation process for the first data ingest. After that, I work closely with the technology team to schedule re-ingest and maintenance of the data. I have several ongoing initiatives to analyze and improve our overall data quality that I’ll be working on (with the help of our Hubs) over the coming months. There’s always more improvement to be made, which is a great challenge!

Putting together the Archival Description Working Group

Throughout 2015 several themes kept popping up in questions and conversations at DPLA: How do we communicate the context of materials that are closely related to each other? How do we take advantage of information that is created about collections while still keeping DPLA a library of digital materials only? How can materials that come from different descriptive traditions (libraries, archives, museums) be reconciled?

DPLA had always had metadata fields for collection name and description in our item-level records, but at that point it had never been tightly controlled. We focused far more on the basic item descriptors (title, subjects, description) and let institutions do what they wished with collection. Since “collection” is actually a term with a lot of different meanings, we ended up with a hodge-podge of data that could at times seem misleading or redundant. Some Hubs organized all the content from each partner into an institution-based collection, others had very broad subject- or format-based collections, while still others had very specific provenance-based collections. In short, we felt that the collection data was not ready for prime-time, so while it was retained and indexed in the record for searching, it was not featured in the website version of the metadata record.

In late 2015, DPLA also decided that in order to continue to collaborate with the community effectively the time had come to move on from the very broad open committees that had been created during the planning stages of the DPLA(?) to a series of more focused working groups that could help solve specific issues. One of the first of these working groups we decided to put together was one related to  the issues of collections data, specifically that created in archival description — hence, the Archival Description Working Group. While archives were the initial focus, the work of the group ended up being a more all-encompassing analysis of collections in DPLA regardless of the type of originating institution.

The recommendations the working group came up with were published as a whitepaper in 2016. They addressed five areas:

  • Recommendations for representing objects (item vs. aggregate)
  • Recommendations for relationship of object to collection
  • Recommendations for creating and sharing collection data
  • Recommendations for user interface
  • Recommendations for process

The Methodology of the Working Group

An important first step the group tackled was to define what we meant by the word “collection” in our context. From the whitepaper:

The term is used loosely by the working group to mean any intentionally created grouping of materials. This could include, but is not limited to: materials that are related by provenance, and described and managed using archival control; materials intentionally assembled based on a theme, era, etc.; and groupings of materials gathered together to showcase a particular topic (e.g., digital exhibits or DPLA primary source sets). Not included in this definition are assemblages of digital objects that are not the result of some sort of intentional selection. For example, all of the objects that are exposed to DPLA by a particular institution would, generally speaking, not be a collection in this sense. All of the digital objects that belong to a specific type or form/genre – maps, for instance – would also not be a collection in the context of this white paper.

In order to develop our recommendations we took a three-phased approached. First, we did research. We read as many reports and articles we could find on combining materials described at item- and collection- or aggregate-level and we reviewed several digital library sites that did something innovative in this area. After the research phase, we synthesized our findings and created a list of user-based scenarios that we thought DPLA should support:

  1. It should be apparent to users when they find an item/s that these materials are part of a collection if appropriate.
  2. Users should know as soon as they search that items are part of collections and should be able to act on that knowledge.
  3. Users should be able to refine and limit their searches by membership in collections.
  4. Users should understand when objects are described using a traditional component-level archival-style descriptions, i.e., one object that represents many items.
  5. Users should be presented with appropriate metadata for objects, and this level of metadata and context may not be the same for all objects and collections. This could result in many items with the same description.
  6. Users may be presented with information that helps them makes sense of where the item belongs within a collection if the collection structure or arrangement is meaningful.
  7. Collection/context information applies to different types of collections including exhibitions and primary source sets.
  8. Users should be able to go to DPLA and find a collection that interests them without doing an item search.
  9. Users should be able to find similar materials related to a retrieved item by their membership in the same collection.

We then used the scenarios to guide us through the process of making recommendations for changes to DPLA metadata, workflow, and interface.

Recommendations for item and aggregate objects

Rather than write at length here about each of the areas of recommendations, I’d like to just address one of the areas of biggest discussion: Recommendations for representing objects (item vs. aggregate).

The question that drove this discussion was how data created about materials in the aggregate can be used in DPLA. A prime example of this kind of data is a folder-level description that an archive might create. In the past decade in particular this kind of practice has increased in the archival community, largely inspired by the landmark publication of More Product, Less Process. DPLA has increasingly gotten records from contributing institutions that reflect things like folders of materials rather than the individual items within them. In the archival community the finding aid, which contextualizes and describes an entire collection is the norm. However, DPLA doesn’t just serve archives. We have a huge collection of books, films, reports, journals, etc., all of which are individual items. Our searching, indexing, and presentation designs are all based on the idea that each record corresponds to a single individual object. Since, DPLA can’t just adopt an archival, collection-based description approach, the working group focused part of their efforts of thinking through how aggregate-level descriptions could be combined with the existing item-level paradigm.

The two solutions usually adopted when faced with the question of how to translate metadata for a folder to DPLA were basically either to create one description that described a bunch of items in the aggregate, or to create a bunch of really minimal records for each item reusing similar data in each one. Either of these approaches might be best in particular situations. For example, in specific cases of unique visual materials an institution may want to opt for lots of individual items with minimal metadata. Even though the metadata is similar for each, this would allow the visual material to be discretely findable.

On the other hand, the search experience of seeing record after record for textual materials with indistinct images that are virtually identical would not suit the majority of those types of materials. In this case, the experience of finding a basic, high-level description of a folder and then following the link back to the originating institution to examine the materials in depth seemed to be the best fit.

The working group actually doesn’t want to recommend one style of description over another. Instead we want to work with the kinds of descriptive practices that professionals are already using. We want DPLA to fit into the accepted professional practice, not create yet another new approach that may or may not be adopted. We think that having an infrastructure that can be flexible enough to encompass aggregated objects and item level objects, while also communicating relationships between materials will serve the user scenarios we came up with best. Furthermore, we wanted to rely on the judgement of the librarians, archivists, and other professional on the best way to describe and provide access to their own materials rather than dictate something to them.

In both of these types of description though, the working group members agreed that the addition of collection titles, descriptions, and links back to collection-level descriptions or home pages at the original institution would help greatly in communicating what these objects are to their audience. The other sections of the whitepaper go into detail on how that collection-level information can be gathered, stored, and displayed effectively in DPLA. Combined with a flexible approach to item description described above, the working group felt that these changes would best achieve the goals of the user scenarios.

Recommendations have context too

It’s important to remember that this and the other recommendations in the whitepaper are for DPLA, in other words, they pertain to the handling of collection and aggregate-object metadata in a heterogeneous, large-scale aggregated environment. They should not be read as recommendations for every cultural heritage institution everywhere. Those submitting data to DPLA would need to publish it in a way that we could use, but within their own context, their own repository or website, it may be best served by being put up another way. I would encourage anyone involved in a DPLA contributing institution or interested in metadata aggregation overall to read the whitepaper and think about how these recommendations might or might not fit in their own institution.

It’s also important to remember that these are recommendations only. DPLA is in the process of implementing a number of them, but some have turned out to be infeasible or are affected by other DPLA initiatives. In particular, recommendations around representing objects and process are being implemented, but those around creating and sharing data and user interface have been refined. Another working group working on overall revisions to DPLA’s metadata application profile is suggesting further refinements of collection data, and the interface is being worked out through an overall DPLA website redesign. In the end, I feel like the spirit of the recommendations will definitely be adopted, but with a few tweaks.

“What is a Rights Statement?” Module Available!

The PA Digital Metadata Rights Subgroup Team is excited to present the second of three video modules on copyright and rights statements. The second module, “What is a Rights Statement?” provides an overview of rights statements and their application for digitized cultural heritage collections. The video covers: what are rights statements, what are they not, a history of rightsstatements.org, and overview of the statements. It also covers licenses, a comparison of rights statements and licenses, the benefits of normalized rights statements, the challenges and risks, and sources for more information.

 
Look for our next video, “Implementing RightsStatements.org” in December.
If you have any questions, please feel free to email info@padigital.org or visit padigital.org.

Metadata Cleanup Made Easy with OpenRefine

This is Gabe Galson. I work here at Temple University on the PA Digital Metadata team and would like to share with you some tricks of the trade. Shhhh! These are the exclusive Metadata Cleanup secrets the pros don’t want you to know about. Field value normalization life hacks that, after this exclusive blog post, will go back into the PA Digital Vault forever.

Are you frustrated because your local repository doesn’t feature expandable and collapsable facets that can be sorted either alphabetically or by frequency of occurrence, enabling effortless detection of slightly divergent values? Don’t be. Now there’s OpenRefine.

OpenRefine is a powerful metadata cleanup tool that allows you to replicate our PA Digital aggregator’s key functionalities on any standard computer. All you need is an export of your data in a tabular format, that is to say, your metadata as an Excel, tab separated value [TSV], or comma separated value [CSV] file. Refine will ‘Hoover’ up such data, display it as a spreadsheet, then allow you to view any field’s constituent values via a facet box, as in our aggregator. When sorted alphabetically the facets Refine can generate will allow you to eyeball inconsistent values. Check out ‘Philadelphia (Pa.)’ vs ‘Philadelphia, (PA)’ in this screenshot of a Refine facet:

blog image

From there the values can be quickly standardized.

Refine’s clustering functionality will pull from a dataset of any size slightly divergent values that may not be obvious from a simple browse. These values can then be quickly standardized en masse from within the clustering tool. All 145 Philadelphia variants detected in the screenshot below can be standardized with a single click. Wow!

Blog image2

Traditional spreadsheet programs don’t make it easy to work with multiple values contained in a single cell.  With Refine it’s a breeze. Assuming the values are separated by a single delimiter one can split each value into its own individual row, then fold each row back into the original record when cleanup is complete.

Go from this…

Refine blog

to this…

Refine blog1

… and then back again whenever you want!

But wait, there’s more! Refine also sports an array of other features useful to anyone with messy data on their hands. It lets you structure unstructured data, converting text blocks into spreadsheets. It allows you to stack multiple facets to your heart’s content, star individual records of interest then facet on the star marker, or isolate a particular subset of your data and perform complex operations on only it. Refine offers a robust undo/redo interface, allowing you to test out complex transformations without risk. It lets you integrate regular expressions, GREL commands, and Python script into your basic cleanup operations, making it extremely flexible and powerful. GREL –the Google Refine Expression Language, a simple programming language native to Refine– is no more complicated than Excel’s formulas; it lets those with no coding experience perform fairly sophisticated data cleanup operations. Refine will also allow you to populate columns with data called from websites or APIs. For example the Google Maps API can be called to return the geo coordinates corresponding to individual street addresses found within your dataset.  

As you can see, Refine, which is open source, free to download, and fairly easy to pick up, will put many of our aggregator’s functions at your fingertips, allowing you to independently prepare your data for ingestion into the DPLA.

If you’d like to learn more about OpenRefine feel free to sign up for the PA Digital team’s upcoming in-person Metadata Anonymous workshop, which will feature a hands-on, in-depth intro to Refine that will take you from zero to 100 in no time at all! All of the operations illustrated or described in this blog post will be covered. Sign up here if you’re interested!  

If you want to dive right in, take a look at this index of OpenRefine resources on the web. 

OpenRefine Resources:

OpenRefine wiki:

https://github.com/OpenRefine/OpenRefine/wiki

Overview of GREL syntax:

https://github.com/OpenRefine/OpenRefine/wiki/Understanding-Expressions

Comprehensive index of GREL functions by type:

https://github.com/OpenRefine/OpenRefine/wiki/GREL-Functions

OpenRefine listserve/discussion group. If I have an issue I can’t solve on my own I ask for help here. Many advanced Refine users monitor the board and will be happy to help in many cases.

https://groups.google.com/forum/#!forum/openrefine

Regex cheat sheet focusing on OpenRefine users:

http://arcadiafalcone.net/GoogleRefineCheatSheets.pdf

Good basic OpenRefine introduction:

https://casci.umd.edu/wp-content/uploads/2013/12/OpenRefine-tutorial-v1.5.pdf

Another good OpenRefine introduction:

http://www.meanboyfriend.com/overdue_ideas/wp-content/uploads/2014/11/Introduction-to-OpenRefine-handout-CC-BY.pdf

Another good introduction:

http://enipedia.tudelft.nl/wiki/OpenRefine_Tutorial

Treasure trove of advanced OpenRefine recipes:

http://keithmaguire.net/

The recipes found on this page are especially useful to Refine novices:

http://keithmaguire.net/articles/open-refine-short-recipes.html

The “OpenRefine for Metadata Cleanup” PDF that can be found on this page is an excellent openrefine tutorial that includes many useful recipes, including the date transformations featured in our own cookbook:

https://www.orbiscascade.org/workshop-4-metadata-cleanup

Book: Verborgh, R., & De, W. M. (2013). Using OpenRefine: The essential OpenRefine guide that takes you from data analysis and error fixing to linking your dataset to the Web.

http://www.worldcat.org/oclc/859384483

https://www.amazon.com/Using-OpenRefine-Ruben-Verborgh/dp/1783289082/ref=sr_1_1?s=books&ie=UTF8&qid=1492262092&sr=1-1&keywords=using+open+refine

Free Your Metadata OpenRefine Reference:

Site: http://freeyourmetadata.org/

Book: http://book.freeyourmetadata.org

Especially for Archivists: Chaos —> Order is a great blog about data manipulation and cleanup tools for archival data. The authors have a few very useful posts about Open Refine, especially dealing with dates and duplicate subjects.

Blog: https://icantiemyownshoes.wordpress.com/

Refine posts: https://icantiemyownshoes.wordpress.com/tag/openrefine/

Especially for Archivists: The Bentley Historical Library at the University of Michigan maintains an excellent blog about their experiences integrating Archives Space, Archivematica and Dspace. The authors have a few very interesting posts about using Open Refine and Python to clean their EAD files.

Blog: http://archival-integration.blogspot.com/

Refine posts: http://archival-integration.blogspot.com/search/label/OpenRefine

Regular expression (regex) tutorials:

http://www.regular-expressions.info

 

Copyright 101 Module Available!

The PA Digital Metadata Rights Subgroup Team is excited to present the first of three video modules on copyright and rights statements. The first module, “Copyright 101,” provides a basic introduction for library and information professionals considering copyright and rights issues in digitized cultural heritage collections.

Copyright module 1 - screenshot
 
 
Look for our next videos, “What is a Rights Statement?” in early November and “Implementing RightsStatements.org” in early December.
If you have any questions, please feel free to email info@padigital.org or visit padigital.org.