You are on page 1of 5

Data wrangling: parsing delimited text reports

BAN 313 (Industrial analysis)


February 2020

In this week’s class you will be working with a report that was generated by the Enterprise Resource Planning
(ERP) system SAP. The type and meaning of the report is of lesser concern. Instead, the focus is on how to
import the data, and work with the different fields so that you can get to the analyses of the data. It would
be helpful if you have acquainted yourself with the Introduction to Importing Data in R course on DataCamp.
The data excerpt given to you in the RawData.txt file is a small excerpt of only a 1000 records. First, let’s
make sure our workspace is clean, and we import the necessary libraries.
rm(list=ls())
library(tidyverse)

## -- Attaching packages -------------------------------------------------------------------------------


## v ggplot2 3.2.1 v purrr 0.3.2
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ----------------------------------------------------------------------------------------
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Within the tidyverse suite we will mainly use the readr package in this course. If you do not have the
tidyverse library available, you may have to first install it.
install.packages("tidyverse")

The purpose of this document might be for you to create a sequence of tasks that is reproducible everytime
you get a new report. So the first step is to get some idea of the text file RawData.txt contains. This you
can do in a basic text editor like Notepad or Notepad++ on Windows, or BBEdit on a Mac. If you are
comfortable on the terminal/command line on a Mac or Linux/Unix-based system you can use the head or
cat commands.
We see that the report title and date is in the first line. Not very interesting. Lines 2, 3 and 5 have horizontal
separation lines. Also not very interesting. The column headers are in line 4:

Field Description
Site The store code as set up in SAP.
Del. Date The date on which the ordered stock should be delivered.
Site description Store name.
PO No's The number of purchase order to be delivered on the delivery date.
Del No's The total number of deliveries from the start of the financial year up
to the delivery date.
HU No's The total number of handling units (pallets or rolltainers, or both)
represented by the order.
Org OB Del Original requested number of stock-keeping units (SKUs) for the
order.
OB Del Qty The number of available SKUs in stock to be delivered.
GI Qty The number of items that were goods issued (shipped to store).
GR Qty The number of items that were goods received by the store.

1
A couple of observations: * the column delimeter is the symbol |; * the second column seems to be a date;
and * the last column seems to only have zero, so we should check later if it has any value.
Right, now let’s import the delimited text using the readr package, which is included in the tidyverse suite
of packages.
raw <- read_delim(file="../Data/RawData.txt", delim="|", col_names=FALSE, skip=5,
locale = locale(grouping_mark = ","), col_types=c("ccccnnnnnnnnc"))

## Warning: 2 parsing failures.


## row col expected actual file
## 844 -- 13 columns 1 columns '../Data/RawData.txt'
## 846 -- 13 columns 1 columns '../Data/RawData.txt'
It is good practice to not assume R knows what argument you’re passing. Instead explicitly state each
argument name, and then its value. For example, don’t assume that the first argument for read_delim id
file, but rather explicitly state the argument, file=.... It also improves the readability of your script so
that others can follow.
Also not the use of ./ to denote “in the current location”, and ../ to denote “go back up one directory”. So
the file argument file="../Data/RawData.txt" tells us that you have to go up one directory; and there
is a folder called Data/ and in that folder you will find the file called RawData.txt. We refer to this as a
relative path, as it is relative to the current working directory.
The alternative to a relative path is an absolute path. This is where you provide the entire, complete path of
where the file is. On my MacBook that would be /Users/jwjoubert/workspace/ban313/.../RawData.txt
where the ... is all the intermediate folders as well. On a Windows machine the absolute path will be
something like C:/Program Files/.../RawData.txt.
Remember, if you are not sure of the available arguments, or want to know more about a function, call its
help function using ? and the function name. So, for the function reading delimited text the help function
call would be the following.
?read_delim

Next it might be useful to just get an idea of what we’re dealing with in the data set. We can check out the
structure.
str(raw)

Or looking at the first few lines.


head(raw)

We see that the first and last columns only contain NA values. This is because there was a column delimiter
| at the start and end of each row, suggesting there should be values before the first, and after the last
delimiter. Since these are nonsensical, we can simply remove them.
clean <- raw[, -c(1,ncol(raw))]
head(clean)

A quick look at the bottom of the file suggests there are three lines with seperator characters and column
totals. Since these are incomplete, we can remove them (and possibly other) observations (rows) with NA
values.
clean <- na.omit(clean)

Now we are ready to rename the columns.


colnames(clean) <- c("site", "date", "descr", "pos", "del", "hu", "poQty", "obDel",
"obQty", "giQty", "grQty")

2
Next we want to convert the date column to an actual date type in R. But the format is again non-standard.
Luckily there is a useful function where we can specify a variety of custom formats. Make sure to have a look
at its help function.
clean$date <- as.POSIXlt(x=clean$date, tz="Africa/Johannesburg", format="%d.%m.%Y")

How did we know to use Africa/Johannesburg as the timezone description? Again, read the help function
and check out the link they provide. If you don’t provide the timezone argument, R will assume that you
are using Greenwich Mean Time (GMT), and this may prove troublesome when you work with data from
multiple timezones. If we now look at the structure of out clean data set, we see that we have the correct
variable types.
str(clean)

## Classes 'tbl_df', 'tbl' and 'data.frame': 843 obs. of 11 variables:


## $ site : chr " GC02" " GC04" " GC05" " GC06" ...
## $ date : POSIXlt, format: "2018-01-28" "2018-01-28" ...
## $ descr: chr "Blackheath " "Bedfordview " "Kensington
## $ pos : num 10 11 10 12 9 10 10 9 8 10 ...
## $ del : num 8 27 13 23 7 16 12 11 24 14 ...
## $ hu : num 13 36 22 12 16 17 12 13 20 14 ...
## $ poQty: num 801 1861 1231 832 1061 ...
## $ obDel: num 451 1265 678 506 594 ...
## $ obQty: num 442 1261 675 487 594 ...
## $ giQty: num 442 1261 675 487 594 ...
## $ grQty: num 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "problems")=Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 5 variables:
## ..$ row : int 844 846
## ..$ col : chr NA NA
## ..$ expected: chr "13 columns" "13 columns"
## ..$ actual : chr "1 columns" "1 columns"
## ..$ file : chr "'../Data/RawData.txt'" "'../Data/RawData.txt'"
## - attr(*, "na.action")= 'omit' Named int 844 845 846
## ..- attr(*, "names")= chr "844" "845" "846"
And we can do some summary statistics.
summary(clean)

## site date descr


## Length:843 Min. :2018-01-25 00:00:00 Length:843
## Class :character 1st Qu.:2018-01-25 00:00:00 Class :character
## Mode :character Median :2018-01-26 00:00:00 Mode :character
## Mean :2018-01-26 09:32:14
## 3rd Qu.:2018-01-27 00:00:00
## Max. :2018-01-28 00:00:00
## pos del hu poQty
## Min. : 1.000 Min. : 0.00 Min. : 0.0 Min. : 1
## 1st Qu.: 3.000 1st Qu.: 3.00 1st Qu.: 7.0 1st Qu.: 321
## Median : 5.000 Median : 9.00 Median :17.0 Median : 1011
## Mean : 6.522 Mean :10.37 Mean :18.8 Mean : 1182
## 3rd Qu.:10.000 3rd Qu.:14.00 3rd Qu.:26.0 3rd Qu.: 1718
## Max. :18.000 Max. :70.00 Max. :89.0 Max. :10416
## obDel obQty giQty grQty
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0000
## 1st Qu.: 210.0 1st Qu.: 208.0 1st Qu.: 208.0 1st Qu.: 0.0000
## Median : 713.0 Median : 705.0 Median : 705.0 Median : 0.0000

3
## Mean : 895.2 Mean : 884.6 Mean : 884.6 Mean : 0.5611
## 3rd Qu.:1305.5 3rd Qu.:1299.5 3rd Qu.:1299.5 3rd Qu.: 0.0000
## Max. :6032.0 Max. :4487.0 Max. :4487.0 Max. :428.0000
How many unique stores were there?
length(unique(clean$site))

## [1] 399
or
length(unique(clean$descr))

## [1] 389
Why do you think there is a difference?
By inspecting the data, we see that there is mainly zeros in the grQty column, and this is indeed because of
a SAP error not populating the report correctly. For this exercise we are going to discard the column, but
only because we know that there is an assignable cause. We don’t just remove values we don’t like.
clean <- clean[, -which(colnames(clean)=="grQty")]

Finally, we are ready to save our clean data set, and use it for a variety of analyses. Better still, when we get
the next SAP report, we dont have to do a thing, we simply just point this script of ours to the new file, and
we are good to go.
write_csv(x=clean, path="../Data/CleanData.csv")

Now we can start asking questions. For example, how many orders are being delivered on 27 January?
sum(clean[clean$date == as.POSIXlt(x="2018-01-27", tz="Africa/Johannesburg"), "pos"])

## [1] 1369
Take some time to unpack that statement. We can also view the distribution of the number of units issued
over the entire reporting period.
hist(clean$giQty, breaks=20, xlab="Goods issued (units)", main="")

4
200
150
Frequency

100
50
0

0 1000 2000 3000 4000

Goods issued (units)

You might also like