This is Gabe Galson. I work here at Temple University on the PA Digital Metadata team and would like to share with you some tricks of the trade. Shhhh! These are the exclusive Metadata Cleanup secrets the pros don’t want you to know about. Field value normalization life hacks that, after this exclusive blog post, will go back into the PA Digital Vault forever.
Are you frustrated because your local repository doesn’t feature expandable and collapsable facets that can be sorted either alphabetically or by frequency of occurrence, enabling effortless detection of slightly divergent values? Don’t be. Now there’s OpenRefine.
OpenRefine is a powerful metadata cleanup tool that allows you to replicate our PA Digital aggregator’s key functionalities on any standard computer. All you need is an export of your data in a tabular format, that is to say, your metadata as an Excel, tab separated value [TSV], or comma separated value [CSV] file. Refine will ‘Hoover’ up such data, display it as a spreadsheet, then allow you to view any field’s constituent values via a facet box, as in our aggregator. When sorted alphabetically the facets Refine can generate will allow you to eyeball inconsistent values. Check out ‘Philadelphia (Pa.)’ vs ‘Philadelphia, (PA)’ in this screenshot of a Refine facet:
From there the values can be quickly standardized.
Refine’s clustering functionality will pull from a dataset of any size slightly divergent values that may not be obvious from a simple browse. These values can then be quickly standardized en masse from within the clustering tool. All 145 Philadelphia variants detected in the screenshot below can be standardized with a single click. Wow!
Traditional spreadsheet programs don’t make it easy to work with multiple values contained in a single cell. With Refine it’s a breeze. Assuming the values are separated by a single delimiter one can split each value into its own individual row, then fold each row back into the original record when cleanup is complete.
Go from this…
… and then back again whenever you want!
But wait, there’s more! Refine also sports an array of other features useful to anyone with messy data on their hands. It lets you structure unstructured data, converting text blocks into spreadsheets. It allows you to stack multiple facets to your heart’s content, star individual records of interest then facet on the star marker, or isolate a particular subset of your data and perform complex operations on only it. Refine offers a robust undo/redo interface, allowing you to test out complex transformations without risk. It lets you integrate regular expressions, GREL commands, and Python script into your basic cleanup operations, making it extremely flexible and powerful. GREL –the Google Refine Expression Language, a simple programming language native to Refine– is no more complicated than Excel’s formulas; it lets those with no coding experience perform fairly sophisticated data cleanup operations. Refine will also allow you to populate columns with data called from websites or APIs. For example the Google Maps API can be called to return the geo coordinates corresponding to individual street addresses found within your dataset.
As you can see, Refine, which is open source, free to download, and fairly easy to pick up, will put many of our aggregator’s functions at your fingertips, allowing you to independently prepare your data for ingestion into the DPLA.
If you’d like to learn more about OpenRefine feel free to sign up for the PA Digital team’s upcoming in-person Metadata Anonymous workshop, which will feature a hands-on, in-depth intro to Refine that will take you from zero to 100 in no time at all! All of the operations illustrated or described in this blog post will be covered. Sign up here if you’re interested!
If you want to dive right in, take a look at this index of OpenRefine resources on the web.
Overview of GREL syntax:
Comprehensive index of GREL functions by type:
OpenRefine listserve/discussion group. If I have an issue I can’t solve on my own I ask for help here. Many advanced Refine users monitor the board and will be happy to help in many cases.
Regex cheat sheet focusing on OpenRefine users:
Good basic OpenRefine introduction:
Another good OpenRefine introduction:
Another good introduction:
Treasure trove of advanced OpenRefine recipes:
The recipes found on this page are especially useful to Refine novices:
The “OpenRefine for Metadata Cleanup” PDF that can be found on this page is an excellent openrefine tutorial that includes many useful recipes, including the date transformations featured in our own cookbook:
Book: Verborgh, R., & De, W. M. (2013). Using OpenRefine: The essential OpenRefine guide that takes you from data analysis and error fixing to linking your dataset to the Web.
Free Your Metadata OpenRefine Reference:
Especially for Archivists: Chaos —> Order is a great blog about data manipulation and cleanup tools for archival data. The authors have a few very useful posts about Open Refine, especially dealing with dates and duplicate subjects.
Refine posts: https://icantiemyownshoes.wordpress.com/tag/openrefine/
Especially for Archivists: The Bentley Historical Library at the University of Michigan maintains an excellent blog about their experiences integrating Archives Space, Archivematica and Dspace. The authors have a few very interesting posts about using Open Refine and Python to clean their EAD files.
Regular expression (regex) tutorials: