Metadata Cleanup Made Easy with OpenRefine

This is Gabe Galson. I work here at Temple University on the PA Digital Metadata team and would like to share with you some tricks of the trade. Shhhh! These are the exclusive Metadata Cleanup secrets the pros don’t want you to know about. Field value normalization life hacks that, after this exclusive blog post, will go back into the PA Digital Vault forever.

Are you frustrated because your local repository doesn’t feature expandable and collapsable facets that can be sorted either alphabetically or by frequency of occurrence, enabling effortless detection of slightly divergent values? Don’t be. Now there’s OpenRefine.

OpenRefine is a powerful metadata cleanup tool that allows you to replicate our PA Digital aggregator’s key functionalities on any standard computer. All you need is an export of your data in a tabular format, that is to say, your metadata as an Excel, tab separated value [TSV], or comma separated value [CSV] file. Refine will ‘Hoover’ up such data, display it as a spreadsheet, then allow you to view any field’s constituent values via a facet box, as in our aggregator. When sorted alphabetically the facets Refine can generate will allow you to eyeball inconsistent values. Check out ‘Philadelphia (Pa.)’ vs ‘Philadelphia, (PA)’ in this screenshot of a Refine facet:

blog image

From there the values can be quickly standardized.

Refine’s clustering functionality will pull from a dataset of any size slightly divergent values that may not be obvious from a simple browse. These values can then be quickly standardized en masse from within the clustering tool. All 145 Philadelphia variants detected in the screenshot below can be standardized with a single click. Wow!

Blog image2

Traditional spreadsheet programs don’t make it easy to work with multiple values contained in a single cell. With Refine it’s a breeze. Assuming the values are separated by a single delimiter one can split each value into its own individual row, then fold each row back into the original record when cleanup is complete.

Go from this…

Refine blog

to this…

Refine blog1

… and then back again whenever you want!

But wait, there’s more! Refine also sports an array of other features useful to anyone with messy data on their hands. It lets you structure unstructured data, converting text blocks into spreadsheets. It allows you to stack multiple facets to your heart’s content, star individual records of interest then facet on the star marker, or isolate a particular subset of your data and perform complex operations on only it. Refine offers a robust undo/redo interface, allowing you to test out complex transformations without risk. It lets you integrate regular expressions, GREL commands, and Python script into your basic cleanup operations, making it extremely flexible and powerful. GREL –the Google Refine Expression Language, a simple programming language native to Refine– is no more complicated than Excel’s formulas; it lets those with no coding experience perform fairly sophisticated data cleanup operations. Refine will also allow you to populate columns with data called from websites or APIs. For example the Google Maps API can be called to return the geo coordinates corresponding to individual street addresses found within your dataset.

As you can see, Refine, which is open source, free to download, and fairly easy to pick up, will put many of our aggregator’s functions at your fingertips, allowing you to independently prepare your data for ingestion into the DPLA.

If you’d like to learn more about OpenRefine feel free to sign up for the PA Digital team’s upcoming in-person Metadata Anonymous workshop, which will feature a hands-on, in-depth intro to Refine that will take you from zero to 100 in no time at all! All of the operations illustrated or described in this blog post will be covered. Sign up here if you’re interested!

If you want to dive right in, take a look at this index of OpenRefine resources on the web.

OpenRefine Resources:

OpenRefine wiki:

https://github.com/OpenRefine/OpenRefine/wiki

Overview of GREL syntax:

https://github.com/OpenRefine/OpenRefine/wiki/Understanding-Expressions

Comprehensive index of GREL functions by type:

https://github.com/OpenRefine/OpenRefine/wiki/GREL-Functions

OpenRefine listserve/discussion group. If I have an issue I can’t solve on my own I ask for help here. Many advanced Refine users monitor the board and will be happy to help in many cases.

https://groups.google.com/forum/#!forum/openrefine

Regex cheat sheet focusing on OpenRefine users:

http://arcadiafalcone.net/GoogleRefineCheatSheets.pdf

Good basic OpenRefine introduction:

https://casci.umd.edu/wp-content/uploads/2013/12/OpenRefine-tutorial-v1.5.pdf

Another good OpenRefine introduction:

http://www.meanboyfriend.com/overdue_ideas/wp-content/uploads/2014/11/Introduction-to-OpenRefine-handout-CC-BY.pdf

Another good introduction:

http://enipedia.tudelft.nl/wiki/OpenRefine_Tutorial

Treasure trove of advanced OpenRefine recipes:

http://keithmaguire.net/

The recipes found on this page are especially useful to Refine novices:

http://keithmaguire.net/articles/open-refine-short-recipes.html

The “OpenRefine for Metadata Cleanup” PDF that can be found on this page is an excellent openrefine tutorial that includes many useful recipes, including the date transformations featured in our own cookbook:

https://www.orbiscascade.org/workshop-4-metadata-cleanup

Book: Verborgh, R., & De, W. M. (2013). Using OpenRefine: The essential OpenRefine guide that takes you from data analysis and error fixing to linking your dataset to the Web.

http://www.worldcat.org/oclc/859384483

https://www.amazon.com/Using-OpenRefine-Ruben-Verborgh/dp/1783289082/ref=sr_1_1?s=books&ie=UTF8&qid=1492262092&sr=1-1&keywords=using+open+refine

Free Your Metadata OpenRefine Reference:

Site: http://freeyourmetadata.org/

Book: http://book.freeyourmetadata.org

Especially for Archivists: Chaos —> Order is a great blog about data manipulation and cleanup tools for archival data. The authors have a few very useful posts about Open Refine, especially dealing with dates and duplicate subjects.

Blog: https://icantiemyownshoes.wordpress.com/

Refine posts: https://icantiemyownshoes.wordpress.com/tag/openrefine/

Especially for Archivists: The Bentley Historical Library at the University of Michigan maintains an excellent blog about their experiences integrating Archives Space, Archivematica and Dspace. The authors have a few very interesting posts about using Open Refine and Python to clean their EAD files.

Blog: http://archival-integration.blogspot.com/

Refine posts: http://archival-integration.blogspot.com/search/label/OpenRefine

Regular expression (regex) tutorials:

http://www.regular-expressions.info

Metadata Cleanup Made Easy with OpenRefine

Published by michaeljcarroll on October 31, 2017October 31, 2017

OpenRefine Resources:

Like this:

0 Comments

Leave a ReplyCancel reply

Miscellaneous

PA Digital News: Aggregator Updates

Education

Introducing PA Digital Primary Source Sets

Resources

The Importance of Fair Use and Standardized Rights Statements for Digital Cultural Heritage

Metadata Cleanup Made Easy with OpenRefine

Published by michaeljcarroll on October 31, 2017October 31, 2017

OpenRefine Resources:

Share this:

Like this:

0 Comments

Leave a ReplyCancel reply

Related Posts

Miscellaneous

PA Digital News: Aggregator Updates

Education

Introducing PA Digital Primary Source Sets

Resources

The Importance of Fair Use and Standardized Rights Statements for Digital Cultural Heritage

Discover more from PA Digital