You are on page 1of 13
If p-value < = alpha: Reject the null hypothesis (data present a significant result). Now let’s consider risk again. Consider the following screenshot of a t test, the p-value is highlighted in yellow (Exhibit 3-13): EXHIBIT 3-13 t-test Assessing for Significant Differences in Average Shipping Times across Categories ‘Hest: Two Sample Assuming Unequal Variances ‘copinrs_allather Mean 2607M2 2.91527 Variance: 2697177 2.249016 Observations 1236 Hypothesized Mean pifference ° at a Stat “1.86664 P(T<=t) one-tail ‘critical one-tatl 1.663197 P(re=t) twe-tall c.1as207 ‘critical two-tatl 1.98801 If our alpha is 5 percent (0.05), we would have to fail to reject the null because the p-value of 0.073104 is greater than alpha. But what if we set our alpha at 10 percent instead of 5 percent? All of a sudden, our p-value of 0.073104 is less than alpha (0.10). It is critical to make the decision of what your significance level will be (1 percent, 5 percent, 10 percent) prior to running your statistical test. The p-value shouldn’t dictate which alpha you select, as tempting as that may be! page 133 @ PROGRESS CHECK 5. How does Benford’s law provide an expectation of any set of naturally occurring collections of numbers? 6. Identify a reason the sales amount of any single product may or may not follow Benford's law. 7. Name three clusters of customers who might shop at Walmart. 8. In Exhibit 3-12, Cluster 1 of the group insurance highlighted claims have a long period from death to payment dates. Why would that cluster be of interest to internal auditors? PREDICTIVE ANALYTICS LO 3-4 Understand the techniques associated with predictive analytics, including regression and classification. Before we discuss predictive analytics, we need to bring you up to speed on some data-specific terms: © A target is an expected attribute or value that we want to evaluate, For example, if we are trying to predict whether a transaction is fraudulent, the target might be a specific “fraud score.” If we're trying to predict an interest rate, the target would be “interest rate.” The target is usually referred to as the dependent variable in a regression analysis. © A class is a manually assigned category applied to a record based on an event. For example, if the credit department has rejected a credit line for a customer, the credit department assigns the class “Rejected” to the customer’s master record. Likewise, if the internal auditors have confirmed that fraud has occurred, they would assign the class “Fraud” to that transaction. On the other hand, we may ask questions with specific outcomes, such as: “Will a new vendor ship a large order on time?” When you are performing an analysis that uses historical data to predict a future outcome, you will use a supervised approach. You might use regression to predict a specific value to answer a question such as, “How many days do we predict it will take a new vendor to ship an order?” Again, the prediction is based on the activity we have observed from other vendors, shown in Exhibit 3- 14. We use historical data to create the new model. Using a classification model, you can predict whether a new vendor belongs to one class or another based on the behavior of the others, shown in Exhibit 3-15. Causal modeling, similarity matching, and link prediction are additional supervised approaches where you attempt to identify causation —_page13_ (which can be expensive), identify a series of characteristics that predict a model, or attempt to identify other relationships, respectively. EXHIBIT 3-14 Regression 2 s i voue EXHIBIT 3-15 Classification Distance Predictive analytics facilitate making forecasts of accounting outcomes, including these examples: 1. Helping management accountants predict future performance, including, future sales, earnings, and cash flows. This will help management accountants set budgets and plan production, and estimate available funds for loan repayments, dividend payments, and operations. 2. Helping company accountants predict which customers will be able to pay what they owe the company. This will help accountants estimate the appropriate allowance for doubtful accounts. 3. Helping auditors predict which financial statements need to be restated. 4, Helping investors and lenders predict which companies are likely to go bankrupt, or unable to continue as a going concern. 5. Helping investors and financial analysts predict future sales, earnings, and cash flows, critical to stock valuation. Regression Regressions allow the accountant to develop models to predict expected outcomes. These expected outcomes might be to predict the number of days to ship products relative to the volume of orders placed by the customer, shown in Exhibit 3-14. page 135 Regression is a supervised method used to predict specific values. In this case, the number of days to ship is dependent on the number of items in the order. Therefore, we can use regression to predict the number of days it takes Vendor A to ship based on the volume in the order. (Vendor A is represented by the gold star in Exhibits 3-14 and 3-15.) Regression analysis involves the following process: rR Identify the variables that might predict an outcome (or target or dependent variable). The inputs, or explanatory variables, are called independent variables, where the output is a dependent variable. You will probably remember the formula for a linear equation from algebra or other math classes, y = mx + b. When we run a linear regression with only one explanatory variable, we use the same equation, with m representing the slope of the explanatory variable, and b representing the y-intercept. Because linear regression models can have more than one explanatory variable, though, the formula is written just slightly different as y = bp + b;x for a simple (one explanatory variable) regression, with bo representing the y-intercept and b, representing the slope of the explanatory variable. In a multiple regression model, each explanatory variable receives their own slope: y = by + bx; + bx ... and on until all of the explanatory variables and their respective slopes have been accounted for. When you run a regression in Excel, Tableau, or other statistical software, the values for the intercept and the slopes of each explanatory variable will be provided in a regression output. To create a predictive model, you simply plug in the values provided in the regression output along with the values for the particular scenario you are estimating or predicting, A. Dummy variables: Variables in linear regression must be numerical, but sometimes we need to include categorical variables, for instance whether consumers are male or female, or if they are from Arkansas or New York. When that is the case, we have to transform our categorical variables into numbers (we can’t add a word to our formula!), andin particular into binary numbers called dummy variables. Dummy variables can take on the values of 0 or 1, when 0 represents the absence of something and 1 represents the presence of something, You will see examples of dummy variables in Comprehensive Lab 3-6. 2. Determine the functional form of the relationship. Is it a linear relationship where each input plots to another, or is the relationship nonlinear? While most accounting questions utilize a linear relationship, itis possible to consider a nonlinear relationship. 3. Identify the parameters of the model. What are the relative weights of each independent variable on the dependent variable? These are the coefficients on each of the independent variables. Statistical t-tests assess each regression coefficient at a time to determine if the weight is statistically different than 0 (or no weight at all). Particularly in multiple regression, it can be useful to assess the p-value for each variable. You interpret the p-value for each variable the same way you assess the p- value in a t-test: If the p-value is less than your alpha (typically 0.05), then you reject the null hypothesis. In regression, that implies that the explanatory variable is statistically significant. 4, Evaluate the goodness of fit. Calculate the adjusted R? value to determine whether the data are close to the line or not. In general, the better the fit (e.g,, R? > 0.8), the more accurate the prediction will be. The adjusted R? is a value between 0 and 1. An adjusted R? value of 0 represents no ability to explain the dependent variable, and an adjusted R? value of 1 represents perfect ability to explain the dependent variable. Another statistic is the Model F-test. The F-test of the overall significance of the hypothesized model (that has one or more independent variables) compared to a model with no independent variables tells us statistically if our model is better than chance. page 136 EVM exe) titi t()) Lab 3-3 and Lab 3-6 have you calculate linear regression to predict completion rates and sales types. Example of the Regression Approach in Cost Accounting The following discussion primarily identifies the structure of the model— that is, the relationship between the dependent variable and the plausible independent variables—that might be useful in cost accounting; Dependent variable = f (Independent variables) Let’s imagine a regression analysis is rum to find the appropriate cost drivers for a company that makes deliveries using these proposed dependent and independent variables: Dependent variable: Total daily overhead costs for a company. Independent variables: Potential overhead cost drivers include: Deliveries: The number of deliveries made that day. Miles: The number of miles to make all deliveries that day. Time (minutes): The delivery time it takes to make all deliveries that day. Weight: Combined weight of all deliveries made that day. In the following sections, we provide additional examples of the implementation of regression analysis in the labs. Example of the Regression Approach in Managerial Accounting Accounting firms experience a great amount of employee tumover each year (between 15 and 25 percent each year).? Understanding and predicting, employee tumover is a particularly important determination for accounting firms. Each year, they must predict how many new employees might be needed to accommodate growth, to supply needed areas of expertise, and to replace employees who have left. Accounting firms might predict employee tumover by predicting the following regression model in this way: Employee turnover = f(Current professional salaries, Health of the economy [GDP], Salaries offered by other accounting firms or by corporate accounting, etc.) Using such a model, accounting firms could then begin to collect the necessary data to test their model and predict the level of employee tumover. Example of the Regression Approach in Auditing One of the key tasks of auditors of a bank is to consider the amount of the allowance for loan losses or for nonbanks to consider the allowance for doubtful accounts (i.e., those receivables that may never be collected). These allowances are often subject to manipulation to help manage eamings.* The Financial Accounting Standards Board (FASB) issued Accounting Standards Update 2016-13, which requires that banks _page 137 provide an estimate of expected credit losses (ECLs) by considering historical collection rates, current information, and reasonable and supportable forecasts, including estimates of prepayments.® Using these historical and industry data, auditors may work to test a model to establish a loan loss reserve in this way: Allowance for loan loses amount = f (Current aged loans, Loan type, (Customer loan history, Collections success) Example of the Regression Approach in Accoun' Business g or For example, in Chapter 1, we worked to understand why LendingClub rejected certain loan applications. As we considered all of the possible explanations, we found that there were at least three possible indicators that a loan might be rejected, including the debt-to-income ratios, length of employment, and credit (tisk) scores, suggesting a model where: Loan rejection = f (Debt to-income ratio, Length of employment. Credit [risk] score) Another example of the regression approach might be the approval of individual credit card transactions. Assume you go on a trip; in the morning you are in Pittsburgh and by the very next day, you are in Shanghai. Will your credit card transaction in Shanghai automatically be rejected? Credit card companies establish models to predict fraud and decide whether to accept or reject a proposed credit card transaction. A potential model may be the following: ‘Transaction approval = / (Location of current transaction, Location of last transaction, Amount of current transaction, Prior history of travel of credit card holder, etc.) Time Series Analysis Time series analysis is a predictive analytics technique used to predict future values based on past values of the same variable. This is a popular technique for estimating future sales, earnings, and cash flows based on past values of the same variable. Investors will want to predict future earnings and cash flows for use in valuing the stock. Management will work to predict future sales for use in planning future production. Lenders will work to predict future cash flows to evaluate if their loan will be repaid. While there are many variants of time series analysis, all use prior performance to estimate future performance. For example: Performance,,; =f (Pecformancey,,..,Performance,) where performance will usually be sales, earnings, or cash flows when used in an accounting context. Classification The goal of classification is to predict whether an individual we know very little about will belong to one class or another. For example, will a customer have their balance written off? The key here is that we are predicting whether the write-off will occur or not (in other words, there are two classes: “Write-Off’ and “Good”). This is in contrast with a regression that attempts to predict many possible values of the dependent variable, rather than just a few classes as used in classification. page 138 Classification is a supervised method that can be used to predict the dlass of a new observation. In this case, blue circles represent “on-time” vendors. Green squares represent “delayed” vendors. The gold star represents a new vendor with no history. Classification is a little more involved as we are now dealing with machine learning and complex probabilistic models. Here are the general steps: 1. Identify the classes you wish to predict. 2. Manually classify an existing set of records. 3. Select a set of classification models. A. Divide your data into training and testing sets. 5. Generate your model. 6. Interpret the results and select the “best” model. Classification Terminology First, a bit of terminology to prepare us for our discussion. Training data are existing data that have been manually evaluated and assigned a dass. We know that some customer accounts have been written off, so those accounts are assigned the class “Write-Off.” We will train our model to learn what it is that those customers have in common so we can predict whether a new customer will default or not. Test data are existing data used to evaluate the model. The classification algorithm will try to predict the class of the test data and then compare its prediction to the previously assigned class. This comparison is used to evaluate the accuracy of the model or the probability that the model will assign the correct class. Decision trees are used to divide data into smaller groups, and decision boundaries mark the split between one class and another. Exhibit 3-16 provides an illustration of both decision trees and decision boundaries. Decision trees split the data at each branch into two or more groups. In this example, the first branch divides the vendor data by geographic distance and inserts a decision boundary through the middle of the data. Branches 2 and 3 split each of the two new groups by vendor volume. Note that the decision boundaries in the graph on the right are different for each grouping, EXHIBIT 3-16 Example of Decision Trees and Decision Boundaries volume Distance Pruning removes branches from a decision tree to avoid overfitting the model. In other words, pruning reduces the number of times we split the groups of data into smaller groups, as shown in Exhibit 3-16. Pre-pruning occurs during the model generation. The model stops creating new branches when the information usefulness of an additional branch is low. Post- pruning evaluates the complete model and discards branches after the fact. Exhibit 3-17 provides an illustration of how pruning might work in a decision tree EXHIBIT 3-17 Illustration of Pruning a Decision Tree Linear classifiers are useful for ranking items rather than simply predicting dass probability. These classifiers are used to identify a decision boundary. Exhibit 3-18 shows an illustration of linear classifiers segregating, the two classes. page 139 EXHIBIT 3-18 Illustration of Linear Classifiers A linear discriminant uses an algebraic line to separate the two classes. In the example noted here, the classification is a function of both volume and distance: if LO x Volume - 1.5 x Distance + 50 > 0 if LO x Volume - 1.5 x Distance + 0 < 0 We don’t expect linear dlassifiers to perfectly segregate classes. For example, the green square that appears below the line in Exhibit 3-18 would be incorrectly classified as a circle and considered an error. Support vector machine is a discriminating classifier that is defined by a separating hyperplane that works first to find the widest margin (or biggest pipe) and then works to find the middle line. Exhibits 3-19 and 3-20 provide an illustration of support vector machines and how they work to find the best decision boundary. EXHIBIT 3-19 Support Vector Machines With support vector machines, first find the widest margin (biggest pipe); then find the middle line page 140 EXHIBIT 3-20 Support Vector Machine Decision Boundaries

You might also like