You are on page 1of 16

CIN715

Assignment 2
(Sales & Shipping Dataset)
Abstract
Data mining is regarded as a highly advance set of tools that handles large proportions of
data which are often accumulated in haphazard form comprising of several missing values.
This proportion of available data is only available in tera-to-peta-bytes form that has been
radically changed in science and engineering zone. In an effort to conduct analysis, manage
and make decisions on such proportions of data, effective methods and procedures needs
to be deployed which is called data mining. Data mining enables data sets to be discovered
to produce unseen and anonymous predictions which could be referred and utilized in
future for better decision making. Data mining which comprises of pattern recognition,
mathematical and statistical practices to help locate data warehouse and provide aid to
analyst in identifying important trends, statistics/statements, and anomalies. Hence this
assignment will have its focus on the sales and shipping datasets which will be used to
produce final outcomes.
Introduction
Almost every modern-day industry efficiently produces immense quantities of data that is
utilized for decision making situations. Data science provides a broad range of sophisticated
and supple methods that transforms data to help conquer challenges and reach towards
company goals.
To begin with, certain industries encounter their own set of problems with their own
devised analytical solutions waiting to be deployed to resolve those problems. Rapid miner
is regarded as an open source, free of charge software tool that is normally used to data
analysis and text mining. Rapid miner has great amount of experience towards major
industries, effectively comprehends all specific problems and issues industries clashes with
and provides a company’s robust track record of facilitating corporations drive revenue, cut
costs and risks that could be evaded (M. Shiga, 2021).
Moving on, so why use rapid miner? Well rapid miner has a much more effective and easier
way of improving business operations such as delivering transformational impact for
organizations, enhancing efficiency of skills and talent of the organization and their workers,
produces in depth analysis in basic manner, guarantees laid-back on trust, tune and explain
habits, and delivers future proof innovation, mobility, and extensibility. (Why RapidMiner -
RapidMiner, 2021)
Lastly, sales and shipping dataset will be used in this assignment to help reach the
conclusion. Since this assignment is related to sakes and providing shipment, there will be
some possible scenarios where shipping of goods will be delivered on normal schedules and
scenarios where shipments could be delayed due to unforeseen circumstances.
Problem statement and hypothesis
Problem Statement
Sale is the most important term for any business in today’s modern era. When business
sales are increasing then it is probable that the business is making a profit which is good for
the business in future. The major problem arises when the business looks into the number
of sales and delivery of goods made on different time frame. It is little bit hard to record
each and every transaction in MS Excel. In MS Excel data needs to be typed according to the
transactions made on particular dates of every customer. The processes in MS Excel is also
little bit challenging job for any person or user. Understanding data records from one
customer to thousand customers is really a hard task. Thus, RapidMiner 9.10 will be used to
solve the above problem.

Hypothesis
The data mining is really important job to do when there are large datasets. When using
RapidMiner in processing all the datasets it makes our work much easier, and the results can
be processed much faster to see and understand better. There are many problems
encountered when dealing with many records or datasets and the similar type of situation
was seen when analysing the Sales and Shipping Dataset then for accuracy and true results.
Methodology
A) Dataset
This dataset is about the sales been done by numerous customers and delivery of the
items are been done on time as instructed by the customers. The dataset provides both
information about sale been done as well as the how it was delivered or got cancelled.

The information dataset contains are as follows:


 Order Number
 Quantity Ordered
 Price Each
 Order line Number
 Sales
 Order Date
 Status
 QRT-ID
 Month ID
 Year-ID
 Product Line
 MSRP
 Product Code
 Customer Name
 Phone
 Address line 1
 Address line 2
 City
 State
 Postal Code
 Country
 Territory
 Contact Last Name
 Contact First Name
 Deal Size
B) As mentioned earlier in Part A of methodology, this dataset provides information
about sale as well as delivery of products to customers. Customers records is a way
to keep a track to every individual customers and whether to provide a delivery
service to customer or not. When delivery goods or products is delivered to a
particular customer then, there are options like (product has been shipped, got
cancelled, got disputed, on hold, in process, or even resolved). By this it might be
easy to get the correct information on time.

C) By using this dataset, it will be easy to track records of how delivery was done to
each and every customer. This will also help when all the final results are displayed
in decision tree diagram with its detailed description with it. This can be used to see
different aspects or even can solve any problems.
Algorithm used

The type of algorithm used is Classification where decision tree is being used in this data set
of Sales & Shipping. There were 6 types of processes which made decision tree diagram
possible including: the dataset of Sales and Shipping, set role process, split data process,
decision tree process, apply model process and lastly the performance process.

The first screenshot is the decision tree model where all the processes are done in order to
get the final results or outcomes.
This is result that shows the decision tree diagram after all processes are been connected
with each other.

This decision tree diagram shows the branches which indicates whether the goods are
delivered to customer or is been cancelled due to some situations.
Results and Discussion
1) Sale and Shipping data set processes (Model)

This above model shows all the necessary processes that is done to get the final outcomes
which is to find out the Status of delivery been done. In this case it will be determined by
this (product has been shipped, got cancelled, got disputed, on hold, in process, or even
resolved). This process model will help in better understanding the outcomes or records
which has a large amounts of details of each and every customer. When this processes are
linked or connected together then, the decision tree diagram is possible. The RapidMiner
plays an important part so that the processes can be done faster, and the result can be
accurate calculated or figured out.
2) Decision tree diagram of Sale and Shipping data set

This is the outcome of the process model. This is a decision tree which is been possible
when connecting all the process models together the get the final results which was the
Status of the delivery. Not only this, but it also shows all possible outcomes which is hard to
determine in MS Excel sheets. When a business sells it goods then, it records all the
necessary information of its customer but for delivery of the same goods is little bit hard.
When goods are to be delivered then one should keep a track of it until it has been
delivered or in some case might be cancelled.
3) Tree diagram description of Sale and Shipping data set

The above screenshot shows the detailed description of decision tree which was
constructed in RapidMiner. It gives all clear results of decision tree diagram which finds
Status of delivery made to customers. This detailed information is a way to fast track the
delivery or the shipping services being performed by the business. This can be also used to
update or upgrade the important business records for future references. By this information
all possible outcome can be seen if one gets confused when browsing through the decision
tree diagram.
4) Table view of Performance vector of Sale and Shipping data set

This screenshot shows the Table view of Performance vector of Sale and Shipping data set.
This will be helpful for Sales and Shipping business when viewing it and get the clear
accuracies of the datasets. It shows 6 different types of outcomes when deliver goods to the
customers. First when it true then the goods have been shipped to customers. Otherwise, it
will be in disrupted, cancelled, or on hold. Thus, it can also be improved and be resolved
when analysing the dataset.
5) Description of Performance vector of Sale and Shipping data set

This screenshot shows the description of Performance vector of Sale and Shipping data set.
The information shows the accuracy as well as the confusion matrix of the Sales and
Shipping Datasets. When using this type of information, the Sales and Shipping business will
get to their business better and any problems in the business can be solved. This
information can also determine the business performance in the future years. Tracking of
delivery is important and therefore, the business must know all the relevant information
about it.
Limitation
The limitation that was encountered will doing this assignment was:

 The dataset might be incomplete then it will an issue to work with it.
 The RapidMiner software makes the operating systems little bit slow when starting
this software to process with the dataset.
 For new user without knowing about how to use this software will be hard when
compared with a professional user of RapidMiner software.
Conclusion and Recommendations
Conclusion
Data mining is not just running a number of complex queries on a set of data which is stored
in your database. Technique and analysis are very important when it comes to identifying
the format of the information that you have or need. When analysing, managing, and
making decisions of these types of huge data sets the techniques which need to be used is
namely data mining which can be transformed in many different fields, large volumes of
data need data mining tools which will alter numerous fields. K DD is the process in which
hidden patterns of the repositories can be found. The difficulties outlined by various forms
of data are substantial. Methods, tools, and strategies for data mining are valuable in a
variety of applicable areas. (Osmer R. 2021).

Recommendations
The RapidMiner is a useful software that was used in this assignment and has help in every
aspect of it. The Sale and Shipping Datasets was used in this software to get the final
outcomes and all the possible results. Not only in the dataset of Sales and Shipping but it
can also be useful to other types of business datasets. Every business should use this
software where possible in any subject matter. By using this software, the large of
information can be analysed quickly and the correct results will be known. Data mining is
done where a business handle large amounts of data or information. Keeping information
up to date is another thing to be looked at. For every business its customer is the most
important part which help the business to function and grow. Without customers a business
isn’t hundred percent complete.
References

 RapidMiner. 2021. Why RapidMiner - RapidMiner. [online] Available at:


<https://rapidminer.com/why-rapidminer/> [Accessed 4 November 2021].

 M. Shiga, I. Takigawa, and H. Mamitsuka, “A spectral clustering approach to


optimally combining numerical vectors with a modular network,” in KDD, 2007, pp.
647–656. Accessed on 04/11/21.
 Osmer R. Zalane, CMPUT 690 principles of knowledge discovery in databases”
Introduction to Data mining”. Accessed on 04/11/21.
 Koperski, J. Adhikary and J. Han, "Spatial Data Mining: Progress and Challenges",
SIGMOD'96Workshop on Research Issues in Data Mining and Knowledge Discovery
DMKD'96, Montreal, Canada. Accessed on 04/11/21.

You might also like