You are on page 1of 1

Machine Learning for Business Analytics

Shopping Data
(databm1608) – Data Preprocessing | Clustering

Process:
1. We take into account the characteristics – the date of purchase, the customer's financial
situation, and so forth.
2. Change the format of the column "InvoiceDate" to Date type.
3. Examine the data to determine the most recent purchase date, which was December 2011
according to the data, and the oldest purchase date, which was December 2010 according to the
data.
4. Isolate the Month and Year from the given data in new columns corresponding to the Month
and Year of the indicated date using "InvoiceDate" values.
5. Use the formula “=IF(YEAR=2010,12,12-MONTH)” with the corresponding cell values for Year
and Month to determine the purchase date difference (Recency) from year 2010 in terms of
months.
6. By multiplying the values in the “UnitPrice” and “Quantity” columns, create a new column to
calculate money spent (Monetary).
7. Update the data and create a Pivot Table. Calculate the frequency from the "CustomerID," the
"Monetary Value" sum, and the "Recency of Purchase" using the "Min of Months" method.
8. Export the changed data to a csv file. In R Studio, open this file.
9. To summarise the csv file, use R's summary() tool. Outliers are highlighted in the summary.
10. Create a histogram with the Frequency and Monetary values. Because the majority of
Monetary values are less than 40,000, zero, negative, and >40,000 values can be removed.
Similarly, Frequency is mostly in the 50-60 range. The remaining numbers can be classified as
outliers.

You might also like