actually should be grouped together. After importing your data, you simply select
edit cells --> cluster and edit
and select which algorithm you want to use. After Refine runs, you decidewhether to accept or reject each suggestion. For example, you could say yes to combiningMicrosoftand Microsoft Corp., but no to combining Coach Inc. with CQG Inc. If it's offeringtoo few or too many suggestions, you can change the strength of the suggestion function.There are also numerical options that offer quick and easy overviews of data distributions.This functionality can reveal anomalies that might be the result of data input errors -- such as$800,000 instead of $80,000 for a salary entry, or it could expose inconsistencies -- such asdifferences in the way compensation data is reported from entry to entry, with some showing,say, hourly wages and others showing weekly pay or yearly salaries.Beyond data housekeeping, Google Refine offers some useful analysis tools, such as sortingand filtering.
Once you get used to which commands do what, this is a powerful tool for datamanipulation and analysis that strikes a good balance between functionality and ease of use.The undo/redo list of every action you've taken lets you roll back when needed. And textfunctions handle Java-syntax regular expressions, allowing you to look for patterns (such as,say, three numbers followed by two digits) as well as specific text strings and numbers.Finally, while this is a browser-based application, it works with files on your desktop, so yourdata remains local.
Although Google Refine looks like a spreadsheet, you can't do typicalspreadsheet calculations with it; for that, you must export to a conventional spreadsheetapplication. If you've got a large data set, carve out some time in your day to go through all ofRefine's suggested changes, since it can take a while. And, depending on the data set, beprepared when looking for text items to merge: You're likely to get either a lot of falsepositives or missed problems -- or both.
Advanced beginner. Knowledge of data analysis concepts is more important thantechnical prowess; power Excel users who understand data-cleaning needs should becomfortable with this.
Windows, Mac OS X (if it appears to do nothing after loading on a Mac, point abrowser manually to http://127.0.0.1:3333/ ), Linux.
Thesethree screencastsgive a good overview of why and how you'd useRefine; there's alsofairly detailed documentationon the Google Code project area.
Sometimes you need to combine graphical representation of your data with heftier numericalanalysis.
The R Project for Statistical Computing
What it does:
R is a general statistical analysis platform (the authors call it an "environment")that runs on the command line. Need to find means, medians, standard deviations,correlations? R can handle that and much more, including "linear and generalized linearmodels, nonlinear regression models, time series analysis, classical parametric andnonparametric tests, clustering and smoothing," according to theproject website.