Monday, February 20, 2012

Cleaning Data with Google Refine and Visualization with Fusion Tables

Introduction
Google Refine is a powerful faceted browser for cleaning up data from large data sets that might contain errors and connecting that data to external databases (known as reconciliation). As an example this article will go through the steps of creating a data visualization using Google Maps to show the locations of universities for a list of graduate students, below is the final product. This process involves using many of the main data cleaning and mining features of Google Refine.
Data Clean Up
Once Google Refine is downloaded and unzipped, it runs in a browser at http://127.0.0.1:3333 and does not require installation. Google Refine can handle large data sets and the data can be imported into Google Refine using many common formats, like delimited data (e.g. comma or tab-delimited), Excel, JSON, and XML. To create the project, the list of students and their universities were imported as a tab-delimited file on the "Create Project" screen; Google Refine automatically determined that the data was tab-delimited, and it assumed that the first row included column names. Google Refine provides a preview window where users can change import settings, such as delimiters, selecting column headings, etc.You can also do some data clean up here by ignore comment rows that might be at the top of the data file.
Each column in Google Refine has a drop-down menu for filtering (known as faceting), sorting, and reconciling. It should be noted that during this entire process, it is easy to undo any changes using the "Undo / Redo" tab in the left hand sidebar. Options for reconciling are in the "Reconcile" menu shown below and the first step is to select "Start reconciling..." in the submenu.
This brings up a new window where pre-existing reconciliation services can be selected or new services can be added. Google Refine by default includes two Freebase-based services that are useful for more powerful data mining; these were used to get the latitude and longitude locations for the universities. Google Refine, again, tries to automatically determine what type of data is in the column; in this case it guessed correctly, and the first option is "College/University" with options like "Educational Institution", also, being in the list.
Once the reconciliation is done, several changes happened in the interface. Under the "Graduate University" column name, a green bar appeared to give a visual cue as to how many universities were matched. Within the column, universities that were correctly matched were highlighted in blue. Those that were not matched correctly had possible options below them with a check mark box to select the option; there was also a double check mark box that repeated the option for all identical entries.
Two new boxes then appeared in the sidebar; these are the facets (filters). The judgment box showed counts for the students who had blank entries, had their universities matched, or where there was no match. The second box was a histogram that visualizes the scores for the matching. Clicking on the "(blank)" entry in the judgment box causes only the rows with a blank entry to be shown; multiple selections can be made using the "include" option that appears on mouse over.

For the initial data cleanup, the students with blank entries were deleted by first clicking on the "(blank)" entry. Then the deleting was done in the data display window by clicking on the "Remove all matching rows" option in the "Edit rows" submenu of the "All" column drop-down menu. In the example case, this left about 30 rows not being reconciled because the university name contained a secondary affiliation acronym. These acronyms needed to be removed and reconciliation needed to be redone. There are two ways of editing the cell values: 1) Mousing over a cell in the data display brings up an edit menu to change the cell value (shown below); all cells with matching contents can be changed and 2) a transformation can be done on all the cells at once.
This example used the second option. The acronyms always appear in the form "acronym / university name", so a single transformation was done on the cells. In the "Graduate University" drop-down menu there is an the option to "Edit cells" with the submenu option "Transform..." that brings up the window shown below where the Google Refine Expression Language (GREL) is used to transform the cell values; Clojure and Jython-based expressions can also be used if you need to clean up data in a more advanced manner. The window has two main sections: 1) a text area to input an expression and 2) a preview section to see how the column will be transformed; this section also gives tabs to see past expressions.
By default, the expression is "value"; "value" is the current value of the cell, and expressions are applied to this value. To select out the university name, the "match" function was used. The match function works by using regular expressions to match some part of the string; all the accessible GREL functions can be found here. Below is the expression that will pull out just the university name.
value.match(/.*\/\s(.*)/)[0]
Parentheses surround the part of the cell value to be retrieved. Multiple selections can be made, which are returned as an array. Here, only one selection is made so it is array entry "0". After the transformation, the reconciliation was repeated, which brought down the unmatched count to 5. Another way to reconcile is to do a manual search; this is the option that appears below the check mark options and was used to eliminate the remaining few erroneous entries.

Retrieving Latitude and Longitude
With the universities cleaned up, the next step was to retrieve the latitudes and longitudes by adding columns with this information from Freebase.
Selecting the "Add columns from Freebase ..." option in the screenshot above brings up a window where various pieces of information can be pulled back; for this example, first, GeoLocation >> Longitude was selected, and then, this was repeated with Latitude. Some of the entries were missing, so, for simplicity, these rows were removed. To do this, the first step was to create facet to identify the blank entries. Facets are created from the column drop-down menu "Facet", then "Text Facet" for either the latitude or longitude columns (this assumes that the absence of one means the absence of the other). Creating this facet will create a new box in the sidebar. The last option is the "(blank)" option used to select only the blank entries and these rows were deleted using the option from the "All" column, as before. Lastly, at the top of the interface is an option to export the results into several formats; "Tab-separated value" was selected because it is easily imported into other software.

Generating a Google Map
Visual representation of data can be very useful when dealing with large data sets; here is a technique for visualizing data on Google Maps. The university locations were placed onto a Google Map using Google Fusion Tables; this is a beta feature of Google Docs. Creating a new table brings up an import dialog. Once the import is complete, Fusion Tables will have a Visualize menu at the top with a visualize as Map option; this is shown below. By selecting this option, Fusion Tables should then be able to automatically take the latitude and longitude columns and render them on a Google Map for a nice visualization; this is the first figure of the article. The map can then be exported as KML or embedded in another webpage.

No comments: