You are on page 1of 10

Chapter 9

Data Wrangling Tools


Data Wrangling Tools
• Tools for data wrangling span a number of dimensions, from general-
purpose programming languages, to commodity spreadsheet
applications, to visual transformation and profiling products.

• There are easily dozens of tools in each category, but we’re going to
focus on three tools

• Excel, SQL, and Trifacta Wrangler.


Data Size and Infrastructure
• The first two characteristics of data wrangling tools, supported data
size and required infrastructure, are very closely related.

• After all, you wouldn’t want to use a desktop application to wrangle


terabytes or petabytes of data—imagine how slowly your computer
would run!

• If your total data size is only a few megabytes, investing in a big data
distributed-processing platform like a Hadoop cluster would be a
massively wasteful use of computing power and budget.
• So generally, smaller data corresponds to smaller infrastructure needs
and bigger data corresponds to bigger infrastructure needs.

• Excel - application - designed to run on a personal computer.

• SQL is typically deployed on a centralized infrastructure consisting of


one or more networked servers.

• Excel - primarily used on small-to medium-sized data

• SQL - production transaction datasets up to the multiterabytes range.


• Trifacta Wrangler - support transforming data of various sizes—from
megabytes to petabytes—by running on either a Hadoop cluster or on
a single server.

• Trifacta Wrangler’s execution environment is determined


automatically at runtime based on data volume and the logical
complexity of the transformations.
Data Structures - Excel
• Excel - data laid as grid - grid need not to be rectangular or
completely filled.

• Often, people include multiple tables in a single Excel grid, mix


descriptive text with data, or embed graphics within their
spreadsheets.

• All of these data structures roughly conform to the constraints of the


grid, but are not strictly rectangular or consistent.
• Within each cell of the grid, Excel supports a wide variety of value
types, from numbers and percentages to dates and times.

• Given the level of heterogeneity that can be present in an Excel


dataset, a single cell, is the most important data element in an Excel
spreadsheet.
Data Structures - SQL
• SQL expects datasets to be constructed as a set of records, in which
every record contains the same set of fields.

• Any dataset that you decide to wrangle using SQL must be rectangular
and must also conform to a specific schema.

• As with cells in Excel, the record fields in SQL can have a variety of
types.

• Different versions of SQL support different field types, but the basic
set of dates, times, strings, and numbers are universal.
Trifacta Wrangler
• Trifacta, unlike Excel and SQL, can handle structured, semistructured,
and unstructured data.

• When working in Trifacta, data does not need to be explicitly broken


down into rows and columns or fully populated.

• Like the other two tools, Trifacta supports a variety of different data
types, from the most basic integers, strings, and Booleans, to more
complex custom types like dates, US states, and phone numbers.
Transformation Paradigms

You might also like