You are on page 1of 2

Tools and Important Considerations of Working with

Data in Excel (a draft handout)


Drawbacks of using Excel for data analysis

Saving your steps: unlike statistical Ambiguous algorithms: statistical Zeros and N/A: some formulas
software such as SAS, Stata, R, you algorithms may not behave won't work with zeros or blanks as
can't keep track of the modifications correctly or as expected, or the expected, you'll have to decide
you make to your data in Excel results may be hard to interpret what to do about missing values

Selected Research Data Management Practices and Excel


Survey design stage
Use text boxes with caution and specify units (when applicable). "Enter the number of hours spent…" with a textbox might get
you responses such as "1 hour and a half", "1.5 hours", "1.5", "90 minutes" that you'll find painful to recode and standardize.

Initial preparation of your Excel file


 Save the original raw data somewhere safe using a significant filename.
 Create a working copy of your data in a standard Excel format (not a .csv, although you may want to archive your data in a
more universal format such as .csv at the end of your project) with a significant filename.
 Rename all worksheets in your working copy with short significant names.
 Create a new "About" worksheet with a short description of your Excel file, including where the data comes from and
what can be found in the various worksheets. You can even add hyperlinks to your worksheets for easier navigation (click
on the cell where you want to insert a worksheet link, Insert ribbon, Links section, Hyperlink, select Place in this
document). Update this worksheet as necessary as you work on your project.

Data worksheet preparation (repeat for each data worksheet in your file)
 Create a "Notes" worksheet using a naming convention that helps preserve the link with the data worksheet (for example:
2021_survey_data and 2021_survey_data_notes). You can use this worksheet to summarize the contents of the data
worksheet, keep a written record of modifications made to the data worksheet, and/or put in metadata about data
variables.
 If applicable, move anything above and under the data table of each of your data worksheets to the "Notes" worksheet.
 If your column headings take up more than one row in your spreadsheet, collapse the headings so that there is only one
row with headings. This might mean concatenating an overarching category and repeating over multiple columns.
 Make sure that your columns have significant and distinct headings.
 Unmerge any merged rows. Repeat overarching categories in every row of the column rather than leaving rows blank.
 Resize columns to desired width, select the data table, and use Wrap Text (Home ribbon, Alignment section) to make the
contents of the cells display entirely on-screen.
 Use Freeze Panes to freeze your header row and any significant columns in place (View ribbon, Window section).

Data exploration preparation


 For each variable in your data table, make sure that units (if applicable) are identified in the heading and/or in your Notes
worksheet (all the data in the column is recorded in the same unit), or contained in a separate column (the units vary).

 Consider using a tool like Open Refine to clean up your data. Otherwise, try the following steps:
o Use the TRIM formula to remove extra blanks at the beginning or end of a cell.
o Create a new column at the end of your data table that can flag missing data. Use the COUNTA or COUNTBLANK
formulas to investigate any counts that might be problematic with your dataset (the count will depend on your
data and the context – is it fine if some columns contain blank values?).
 If you want, you can 'color' any blank cell in your data table (Excel for Windows) by going to the Home
ribbon, Editing section, Find and Select, Go to Special, selecting blanks, and selecting a highlight color.
o Create a new column at the end of your table to flag any "silly" or problematic rows of data that you may need to
exclude from your analysis.
o Add a filter to your data table (click on a cell in your data table, go to Home ribbon, Editing section, Filter). Click
on the filter dropdown for each variable and look for any unusual values: look at the beginning and end of the list
of values, look for "N/A", look for silly answers.
 Examine your worksheet for "issues" identified in the Quartz guide to bad data – especially the issues flagged in the first
section (Issues that your source should solve).

Other considerations
 Resist any and all temptation to merge cells. Stay strong!
 If you run into any problems when using a formula or visualizing your data in Excel, double-check the format that has
been applied to the cells referenced by the formula/visualization.
 Follow the RDM practices shared in Brief guide: Research Data Management.
 Follow the practices shared in Data Carpentry's Data Organization in Spreadsheets for Social Scientists. The Formatting
Problems page is especially important to go over.
 Ask your data librarian for help if any of these steps seem like they might take a lot of manual work or a long time to do.

Learning resources

Data carpentry: open lessons on working with data.


Excel workshop material: self-teaching Excel files from Dal's Data Librarian.
Learning resources for working with secondary datasets: Google spreadsheet containing a list of tools, tips, courses, and
horror stories for working with secondary data –some of these might also be useful for those working with their own data.
Library carpentry: open lessons aimed at information science professionals.
Open refine: wonderful data cleaning tool – clean up your data, work with facets, save the resulting work in an Excel-
compatible format.
To learn more and practice, try the Programming Historian Open Refine exercise, look at this set of slides from a
(former) Data Librarian from Queen's University, or try the Data Carpentry OpenRefine for Social Science Data lesson
or the Library Carpentry OpenRefine lesson.
Unpivot Excel data: How to transform data from crosstab/wide format (useful for working with Tableau) to flat/tabular/long
format (useful for Excel pivot tables) in Excel.

You might also like