Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Learn Data Mining Through Excel: A Step-by-Step Approach for Understanding Machine Learning Methods
Learn Data Mining Through Excel: A Step-by-Step Approach for Understanding Machine Learning Methods
Learn Data Mining Through Excel: A Step-by-Step Approach for Understanding Machine Learning Methods
Ebook306 pages2 hours

Learn Data Mining Through Excel: A Step-by-Step Approach for Understanding Machine Learning Methods

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Use popular data mining techniques in Microsoft Excel to better understand machine learning methods.

Software tools and programming language packages take data input and deliver data mining results directly, presenting no insight on working mechanics and creating a chasm between input and output. This is where Excel can help.

Excel allows you to work with data in a transparent manner. When you open an Excel file, data is visible immediately and you can work with it directly. Intermediate results can be examined while you are conducting your mining task, offering a deeper understanding of how data is manipulated and results are obtained. These are critical aspects of the model construction process that are hidden in software tools and programming language packages.

This book teaches you data mining through Excel. You will learn how Excel has an advantage in data mining when the data sets are not too large. It can give you a visual representation of data mining, building confidence in your results. You will go through every step manually, which offers not only an active learning experience, but teaches you how the mining process works and how to find the internal hidden patterns inside the data.


What You Will Learn

  • Comprehend data mining using a visual step-by-step approach
  • Build on a theoretical introduction of a data mining method, followed by an Excel implementation
  • Unveil the mystery behind machine learning algorithms, making a complex topic accessible to everyone
  • Become skilled in creative uses of Excel formulas and functions
  • Obtain hands-on experience with data mining and Excel


Who This Book Is For
Anyone who is interested in learning data mining or machine learning, especially data science visual learners and people skilled in Excel, who would like to explore data science topics and/or expand their Excel skills. A basic or beginner level understanding of Excel is recommended. 

LanguageEnglish
PublisherApress
Release dateJun 13, 2020
ISBN9781484259825
Learn Data Mining Through Excel: A Step-by-Step Approach for Understanding Machine Learning Methods

Related to Learn Data Mining Through Excel

Related ebooks

Programming For You

View More

Related articles

Reviews for Learn Data Mining Through Excel

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Learn Data Mining Through Excel - Hong Zhou

    © Hong Zhou 2020

    H. ZhouLearn Data Mining Through Excelhttps://doi.org/10.1007/978-1-4842-5982-5_1

    1. Excel and Data Mining

    Hong Zhou¹ 

    (1)

    University of Saint Joseph, West Hartford, CT, USA

    Let’s get right to the topic. Why do we need to learn Excel in our data mining endeavor? It is true that there are quite a few outstanding data mining software tools such as RapidMiner and Tableau that make the mining process easy and straightforward. In addition, programming languages Python and R have a large number of reliable packages dedicated to various data mining tasks. What is the purpose of studying data mining or machine learning through Excel?

    Why Excel?

    If you are already an experienced data mining professional, I would say that you are asking the right question and probably you should not read this book. However, if you are a beginner in data mining, or a visual learner, or want to understand the mathematical background behind some popular data mining techniques, or an educator, then this book is right for you, and probably is the first book you should read before you start your data mining journey.

    Excel allows you to work with data in a transparent manner, meaning when an Excel file is opened, the data is visible immediately and every step of data processing is also visible. Intermediate results are contained in the Excel worksheet and can be examined while you are conducting your mining task. This allows you to obtain a deep and clear understanding of how the data are manipulated and how the results are obtained. Other software tools and programming languages hide critical aspects of the model construction process. For most data mining projects, the goal is to find the internal hidden patterns inside the data. Therefore, hiding the detailed process is beneficial to the users of the tools or packages. But it is not helpful for beginners, visual learners, or those who want to understand how the mining process works. Let me use k-nearest neighbors method (K-NN) to illustrate the learning differences between RapidMiner, R, and Excel. Before we do that, we need to understand several terminologies in data mining.

    There are two types of data mining techniques: supervised and unsupervised. Supervised methods require the use of a training dataset to train the software programs or algorithms (such programs or algorithms are often referred to as machines) first. Programs are trained to reach an optimal state called a model. This is why a training process is also called modeling. Data mining methods can also be categorized into parametric and nonparametric methods. For parametric methods, a model is just a set of parameters or rules obtained through the training process that are believed to allow the programs to work well with the training dataset. Nonparametric methods do not generate a set of parameters. Instead, they dynamically evaluate the incoming data based on the existing dataset. You may be confused by such definitions at this time. They will make sense soon.

    What is a training dataset? In a training dataset, the target variable (also called label, target, dependent variable, outcome variable, response), the value of which to be predicted, is given or known. The value of the target variable depends on the values of other variables which are usually called attributes, predictors, or independent variables. Based on the attribute values, a supervised data mining method computes (or so-called predicts) the value of the target variable. Some computed target values might not match the known target values in the training dataset. A good model indicates an optimal set of parameters or rules that can minimize the mismatches.

    A model is usually constructed to work on future datasets with unknown target values in a supervised data mining method. Such future datasets are commonly called scoring datasets. In an unsupervised data mining method , however, there is no training dataset and the model is an algorithm that can directly be applied on the scoring datasets. K-nearest neighbors method is a supervised data mining technique.

    Suppose we want to predict if a person is likely to accept a credit card offer based on the person’s age, gender, income, and number of credit cards they already have. The target variable is the response to the credit card offer (assume it is either Yes or No), while age, gender, income, and number of existing credit cards are the attributes. In the training dataset, all variables including both the target and attributes are known. In such a scenario, a K-NN model is constructed through the use of the training dataset. Based on the constructed model, we can predict the responses to the credit card offer of people whose information is stored in the scoring dataset.

    In RapidMiner, one of the best data mining tools, the prediction process is as follows: retrieve both the training data and scoring data from the repository ➤ set role for the training data ➤ apply the K-NN operator on the training data to construct the model ➤ connect the model and the scoring data to the Apply Model operator. That’s it! You can now execute the process and the result is obtained. Yes, very straightforward. This is shown in Figure 1-1. Be aware that there is no model validation in this simple process.

    ../images/494428_1_En_1_Chapter/494428_1_En_1_Fig1_HTML.jpg

    Figure 1-1

    K-NN model in RapidMiner

    Applying K-NN method is very simple in R, too. After loading the library class, read the training data and scoring data and make use of the K-NN function, and by then we have finished our job: ready to view our result. This is demonstrated in Figure 1-2. Note that lines starting with # are comments.

    ../images/494428_1_En_1_Chapter/494428_1_En_1_Fig2_HTML.jpg

    Figure 1-2

    K-NN in R

    The knowledge you have gained from the preceding tasks is enough to just be able to apply the data mining method K-NN. But if you are trying to understand, step by step, why and how K-NN works, you will need a lot more information. Excel can offer you the opportunity to go through a step-by-step analysis process on a dataset during which you can develop a solid understanding of the K-NN algorithm. With this solid understanding, you can then be more proficient in using other powerful tools or programming languages. Most importantly, you will have a better understanding of the quality and value of your data mining results. You will see that in later chapters.

    Of course, Excel is much more limited in data mining compared to R, Python, and RapidMiner. Excel can only work with data up to a smaller size limit. Meanwhile, some data mining techniques are too complicated to be practiced through Excel. Nonetheless, Excel provides us direct and visual understanding of the data mining mechanisms. In addition, Excel is naturally suitable for data preparation.

    Today, because of the software tools and other packages, most effort in a data mining task is spent on understanding the task (including the business understanding and data understanding), preparing the data, and presenting the results. Less than 10% of the effort is spent on the modeling process. The process of preparing the data for modeling is called data engineering . Excel has an advantage on data engineering when the datasets are not too large because it can give us a visual representation of data engineering, which allows us to be more confident in our data preparation process.

    As an experienced educator , I realize that students can better develop a deep understanding of data mining methods if these methods are also explained through step-by-step instructions in Excel. Studying through Excel unveils the mystery behind data mining or machine learning methods and makes students more confident in applying these methods.

    Did I just mention machine learning? Yes, I did. Machine learning is another buzz phrase today. What is machine learning ? What is the difference between data mining and machine learning?

    Most efforts to differentiate data mining and machine learning are not successful because data mining and machine learning cannot be clearly separated per se. At this moment, I would suggest that we treat them the same. But if I must tell the difference between data mining and machine learning, I would say that machine learning is more on supervised methods, while data mining includes both supervised and unsupervised methods.

    Prepare Some Excel Skills

    There are quite some Excel skills to learn in this book. I will explain some of them in detail when we need to use them. However, there are several fundamental Excel skills and functions that we need to be familiar with before we start talking about data mining.

    Formula

    Formula is the most important feature of Excel. Writing a formula is like writing a programming statement. In Excel, a formula always starts with an equal sign (= without quotation marks).

    Upon opening an Excel file, we are greeted with a table-like worksheet. Yes, every worksheet is a huge table. One reason why Excel is naturally suitable for data storage, analysis, and mining is because data are automatically arranged in a table format in Excel. Each cell in the big table has a name or so-called reference. By default, each column is labeled by an alphabet, while each row is labeled with a number. For example, the very first cell at the top-left corner is cell A1, that is, column A and row 1. The content in a cell, whatever it is, is represented by the cell reference.

    Enter number 1 in cell A1. The value of cell A1 is 1 and A1 represents 1 at this moment.

    Enter the formula =A1*10 (without the double quotation marks) in cell B1 and hit the Enter key. Note that the formula starts with =. Be aware that this is the only time a formula is presented inside a pair of double quotation marks in this book. From now on, all formulas are presented directly without quotation marks.

    Enter the text A1 * 10 in cell C1. Because the text does not start with =, it is not a formula.

    Our worksheet looks like Figure 1-3.

    ../images/494428_1_En_1_Chapter/494428_1_En_1_Fig3_HTML.jpg

    Figure 1-3

    Excel formula

    Autofill or Copy

    Autofill is another critical feature of Excel which makes Excel capable of working with a relatively large dataset. Autofill is also called copy by many people.

    Let’s learn autofill by the following experiment:

    1.

    Enter 1 in cell A1.

    2.

    Enter 2 in cell A2.

    3.

    Select both cells A1 and A2.

    4.

    Release the left mouse button.

    5.

    Move the mouse cursor to the right-below corner of cell A2 until the cursor becomes a black cross (shown in Figure 1-4).

    ../images/494428_1_En_1_Chapter/494428_1_En_1_Fig4_HTML.jpg

    Figure 1-4

    Cursor becomes a black cross

    6.

    Press down the left mouse button and drag down to cell A6.

    The cells A1:A6 are automatically filled with numbers 1, 2, 3, 4, 5, and 6. This process is called autofill. Some people call it copy, too. But more precisely, this process is autofill.

    Let’s conduct another experiment:

    1.

    Select cell B1 (make sure that B1 still has the formula =A1*10). Lift up the left mouse button.

    2.

    Move the mouse cursor to the left-down corner of cell B1 until the cursor becomes a black cross.

    3.

    Drag down the mouse cursor to cell B6. Our worksheet looks like

    Enjoying the preview?
    Page 1 of 1