Professional Documents
Culture Documents
Data Analysis PDF
Data Analysis PDF
DATA IN EXCEL
By Andrei Okhlopkov
Published: December 2019
For the most recent version of this booklet,
accompanying Excel file (“Data analysis.xlsx”)
and other publications please visit:
www.eloquens.com/channel/andrei-okhlopkov
Materials in this booklet are distributed freely and are not intended for sale. You are
authorized and encouraged to forward this booklet and accompanying Excel file to
everyone who might be interested.
Analyzing Data in Excel
Also in this series:
• Guiding Principles of Financial Modeling
• Dates and Timelines in Financial Models
• Unlocking Full Potential of Excel Data Tables (parts 1 and 2)
• Sensitivity Analysis with Indifference Curves
• Charts and Dashboards in Excel (parts 1 and 2)
• Descriptive Statistics for Grouped (Weighted) Data
• Monte Carlo Analysis without Macros
• Portfolio Analysis and Sales Forecasting
• Form Controls and Hyperlinks in Financial Models
• VBA Interactive Collection
The above list and the publications are constantly updated. To be notified on the
updates, please follow me on LinkedIn:
https://www.linkedin.com/in/andrei-okhlopkov-92a3191/
I appreciate your interest in my publications on financial
modeling, Excel and VBA, which I enjoy sharing, will keep
coming and hope you will find useful.
Most of my publications are free but you can support my project
through the PayPal (ID: andrei.okhlopkov@hotmail.com).
And certainly feel free to let me know if you need any further
help in financial modeling.
Andrei Okhlopkov
andrei.okhlopkov@hotmail.com
Page 2
Analyzing Data in Excel
TABLE OF CONTENTS
INTRODUCTION 4
LOOKING UP DATA IN LISTS 5
REGULAR LOOKUPS 5
LENIENT LOOKUPS 6
MATCHING DATA IN LISTS AND CELLS 9
MATCHING ENTIRE CELL CONTENT 9
MATCHING PARTS OF CELLS: COUNTING 10
MATCHING PARTS OF CELLS: SUMMING 11
EXTRACTING DESIRED PARTS OF TEXT FROM LISTS AND CELLS 12
PRACTICE EXAMPLES 14
DUE DILIGENCE ANALYSIS 14
ENHANCING ACCOUNTING SYSTEM 14
SPLITTING TEXT IN CELLS BY COLUMNS 15
SINGLE-COMPONENT 15
MULTI-COMPONENT 15
CONDITIONAL STATISTICS FOR DELIMITED LISTS 17
ONE CONDITION, SINGLE CRITERIA 17
SEVERAL CONDITIONS, SINGLE CRITERIA 18
MULTIPLE CRITERIA 19
STATISTICS FOR DATA WITH CRITERIA AND RANKS 21
CRITERIA, THEN RANKS 21
RANKS, THEN CRITERIA 22
SORTING DATA UNDER MULTIPLE CRITERIA 24
CONDITIONAL STATISTICS FOR NON-DELIMITED LISTS 25
ONE CONDITION, SINGLE CRITERIA 25
SEVERAL CONDITIONS, SINGLE CRITERIA 25
MULTIPLE CRITERIA 25
“FLEXIBLE” SELECTION 29
ANALYZING DATA CONTAINING DATES 31
GROUPING BY YEARS 31
GROUPING BY YEARS AND ADDITIONAL CRITERIA 32
GROUPING BY YEARS AND QUARTERS 32
GROUPING BY YEARS AND MONTHS 33
GROUPING BY WEEKDAYS 34
TRANSFORMING TWO-DIMENSIONAL TABLES 35
SINGLE CRITERIA 35
MULTIPLE CRITERIA, ONE DIMENSION 36
MULTIPLE CRITERIA, TWO DIMENSIONS 36
CREATING AUTOMATIC LISTS BASED ON CRITERIA 38
SINGLE CRITERIA 38
MULTIPLE CRITERIA (AND AND OR LOGIC) 39
WHAT’S NEXT? 40
DESCRIPTIVE STATISTICS FOR GROUPED (WEIGHTED) DATA 40
PORTFOLIO ANALYSIS AND SALES FORECASTING 41
Page 3
Analyzing Data in Excel
Introduction
In this brochure I have collected a handful of techniques for working with data in Excel:
looking up values in tables, matching text or numbers in the lists and extracting
corresponding values from neighboring columns, querying specific data, aggregating
the amounts based on certain condition or a number of criteria, making lists and
summaries, doing statistical analysis based on these criteria etc.
From financial modeling standpoint, these techniques will help collect and analyze data
at the due diligence stage (I am giving a practical example of how to make an extract
from somewhat a messy accounting system of a fictitious investment target). If you are
an in-house accountant or analyst, the methods I am going to describe will help you
make even the existing system more informative and analytical with no additional
resources (I will give an example on this too).
I have tied my examples to the automotive industry (in which I have spent quite some
years). But these examples are very generic and flexible and can certainly be adjusted
to any other industry or specific analytical requirements.
Page 4
Analyzing Data in Excel
Regular lookups
To find out an exchange rate valid at a given date (entered into the cell B22) we use a
simple lookup function. It can work in two syntaxes:
– short variant (cell D22) does not use the third argument (result vector) and
returns the date at which the rate was established. In our example, we are trying
to find the rate at 25-Jun-18. There is no such date in the table, so we are looking
for the date immediately preceding our given date. This is 22-Jun-18.
– full variant (cell E22) tells the rate (1.1552) which was set on 22 June 2018 and
which was valid on the date we are looking at (25-June-18).
Again, for these formulae to work properly dates must be sorted in ascending order. If,
for whatever reason, your data is not sorted, section on the right explains how to still
get the same results. The dates and rates in cells G4:H18 are exactly the same but are
shuffled.
Let's look at the formula in cell G22 which finds the date at which the relevant exchange
rate was determined. Take the first component of it (=G5:G18<=B22), enter into any
cell, then put the cursor inside the formula and press F9. This is what you will see:
={FALSE;FALSE;FALSE;TRUE;TRUE;FALSE;FALSE;FALSE;FALSE;FALSE;TRUE;FALSE;F
ALSE;TRUE}
As you, the formula checks for every date if it is less than or equal to our given date, and
returns a string of TRUE and FALSE values. As such, 24-Aug-18 is greater than 22-Jun-
18, so the first position is FALSE. Same about the second and the third positions (20-Jul-
18 and 13-Jul-18 respectively). On the forth position we have 01-Jun-18 which is less
than our 25-Jun-18, so the value is TRUE for the first time, and so on.
We are then multiplying this array by the dates ( =G5:G18 * (G5:G18<=B22) ) and if we
F9 this formula again we will get:
={0;0;0;43252;43273;0;0;0;0;0;43266;0;0;43259}
First three positions are zeroes representing the fact that the first three values in the
first string are FALSE, and multiplying anything by FALSE is equivalent to multiplying
Page 5
Analyzing Data in Excel
by zero. Other seemingly strange numbers are the dates as Excel treats them (and Excel
treats dates as the number of days elapsed since 1 Jan 1900).
As the result of this operation, we have zeroed out all the dates that are greater than
our target date. Thus, the effective date we are looking for will be the maximum of the
remaining dates.
We do an interim operation putting this array under the INDEX function
=INDEX(G5:G18 * (G5:G18<=B22), 0). This does not change the array in any way but it
allows to use it further with regular functions natively, in this case with the MAX
function:
=MAX(INDEX(G5:G18 * (G5:G18<=B22), 0))
This is our final formula for the effective date.
Next, we will need to determine the exchange rate which was set on this date. In cell
H22 we start with =MAX(G5:G18 * (G5:G18<=B22)) which is the same formula we have
arrived to earlier but without the INDEX component (we can exclude it here because
we will use the SUMPRODUCT function later which can handle such arrays itself).
We then test which of date in the list is equal to the date we have just found out:
=G5:G18=MAX(G5:G18 * (G5:G18<=B22)). This returns the following array:
={FALSE;FALSE;FALSE;FALSE;TRUE;FALSE;FALSE;FALSE;FALSE;FALSE;FALSE;FALSE;
FALSE;FALSE}
There is only one TRUE on the fifth position here, showing that that this is the position
which has 22-Jun-2018. By multiplying this array by exchange rates, we zero out all the
rates except the one we need:
=(G5:G18=MAX(G5:G18 * (G5:G18<=B22))) * H5:H18
which produces an array:
={0;0;0;0;1.1552;0;0;0;0;0;0;0;0;0}
If we add up all these numbers, we will get 1.1552 and our mission is completed. We do
this with the SUMPRODUCT function:
=SUMPRODUCT((G5:G18=MAX(G5:G18 * (G5:G18<=B22))) * H5:H18)
Lenient lookups
Lenient lookups provide more flexibility in looking up data. Consider an example in cell
D25 which returns the closest date from the list before or on the lookup date (this is
essentially the same what the formula in cell D22 does but I want to give you a
Page 6
Analyzing Data in Excel
perspective). First we test which dates in our array are less than or equal to the lookup
date (25-Jun-2018):
=D5:D18<=B25
This operation returns the following array:
={TRUE;TRUE;TRUE;TRUE;FALSE;FALSE;FALSE;FALSE;FALSE;FALSE;FALSE;FALSE;FA
LSE;FALSE}
Then we divide the dates array by those logical results:
=D5:D18 / (D5:D18<=B25)
Dividing a number by TRUE or FALSE logical value is equivalent to dividing by 1 or 0
respectively, so the number either remains as it was or transforms to a #DIV/0! error:
={43252;43259;43266;43273;#DIV/0!;#DIV/0!;#DIV/0!;#DIV/0!;#DIV/0!;#DIV/0!;#D
IV/0!;#DIV/0!;#DIV/0!;#DIV/0!}
Numbers in this array are the dates presented in general number format. By doing this
operation we have eliminated
We use the AGGREGATE function (I will be using it in this file very extensively and will
talk a lot about it) to find the maximum date which remained after we had eliminated
all the dates greater than the lookup date:
=AGGREGATE(14, 6, D5:D18 / (D5:D18<=B25), 1)
Argument 14 means the function works as an equivalent to the LARGE function,
argument 6 means the functions will ignore the errors and argument 1 in the end
means the function is to define the 1st largest value in the array. This returns 22-Jun-
2018 as it is the latest (largest) date less than the lookup date (25-Jun-2018).
Formula in cell D26 works in the opposite way: it keeps only those date which are
greater than or equal to the lookup date and finds the smallest of those (the first
argument being 15 corresponds to the SMALL function).
As a side note, a combination of INDEX and MATCH with a match type of -1 produces
the same effect but dates must be sorted in descending order. With the above formula
dates can be in any order.
Lastly, the formula in cell D27 just finds the closest date, regardless of direction. It
deducts the lookup date from the array of dates first:
=D5:D18-B27
the result of which is:
Page 7
Analyzing Data in Excel
={-24;-17;-10;-3;4;11;18;25;32;39;46;53;60;67}
Every number in this array is the number of days between the lookup date and every
date in the array. We will take the absolutes of those days:
=ABS(D5:D18-B27)
={24;17;10;3;4;11;18;25;32;39;46;53;60;67}
we put this through the INDEX function - again, this is a technical operation to allow
other functions to deal with this array natively:
=INDEX(ABS(D5:D18-B27),0)
and then find out the minimum of it:
=MIN(INDEX(ABS(D5:D18-B27),0))
the result of which is 3. We then use the MATCH function to determine the position of
value 3 in our array:
=MATCH(MIN(INDEX(ABS(D5:D18-B27),0)), INDEX(ABS(D5:D18-B27),0), 0)
which returns 4.
In the end we will index the array with rates to determine which date is in the 4th cell
from the top:
=INDEX(D5:D18, MATCH(MIN(INDEX(ABS(D5:D18-B27),0)), INDEX(ABS(D5:D18-
B27),0), 0))
this is our final formula in cell D27 and the result is 22-Jun-2018 (this dare is just 3
days away from the lookup date 25-Jun-2018 and is the closest to it).
Formulae in cell D29 and D30 are slight modifications of the formulae in cells D25 an
dD26 respectively. They use strict inequality when comparing the lookup dates with
dates in the array. As a result, even if we use lookup dates which exist in the array (22-
Jun-2018), the formulae will return 15-June-2018 and 29-Jun-2018, the dates
preceding and following the lookup date in the array.
Formulae in cells E25:E30 use the simple LOOKUP function to find out the rate on the
dates we have found out. As that date would actually exist in the list, the LOOKUP
function will return an exact match for the date.
With shuffled data, formulae to determine the dates (cells G25:G30) are the same as
those for sorted data. Formulae to find the rates use the combination of INDEX and
MATCH functions to find the exact matches (discussed in detail later), as the LOOKUP
function is not suited for unsorted data.
Page 8
Analyzing Data in Excel
Page 9
Analyzing Data in Excel
We just need to add absolute references and copy the formula down to cell I10 and our
vehicles are priced now.
This combination of MATCH and INDEX is one of the most powerful and popular lookup
and reference tools - I recommend memorizing it and using in your analysis. This
combination has many advantages against the VLOOKUP and HLOOKUP functions
which are used frequently for the same purposes).
We are using again the combination of INDEX and MATCH to tie vehicle specs to their
optional equipment using the order numbers (the formulae are in the range J4:J10).
Page 10
Analyzing Data in Excel
If we add this array up, we will get 2, which means we have 2 optional pieces from the
list found in cell E4. We do this addition with the SUMPRODUCT function, then make
absolute references as needed and copy the formula down to cell F10:
=SUMPRODUCT(--ISNUMBER(SEARCH($C$17:$C$23,J4)))
Page 11
Analyzing Data in Excel
Page 12
Analyzing Data in Excel
All formulae on this sheet use exactly the same principle - you just need to populate the
lists in the range of cells C14:F16, modify references in the formula to absolute or semi-
absolute and copy it down and across. The data is now ready for further analysis.
Page 13
Analyzing Data in Excel
Practice Examples
Refers to sheet: “Pr Example”
This sheet contains two practical examples we can implement in real life which are
based on what we have learnt so far.
Page 14
Analyzing Data in Excel
Multi-component
The next example is a bit more complicated as it has several delimiters (spaces) in each
text string (located in cells B10:B16). Column C contains zeroes which are used later for
support purposes (for consistence of formula). Starting from column D, I determine the
position of every space character. To do this, I am using first the SUBSTITUTE function
which replaces a given character (in this case a space " ") with another character (I
have used the pipe character ("|") as it is not used anywhere else in the descriptions; in
your examples you can use any other character or a string of characters, e.g.
"abc987xyz"). A special and unique feature of the SUBSTITUTE function is that you can
indicate the instance number of the character you are looking for to replace with
another character. These numbers are in the header (cells D9:G9) which feed the
formulae. Once the spaces are substituted with the pipes, I initiate the SEARCH function
to find at which position this new character is located. Remember, in every column this
substituted character relates to a particular occurrence of the original space character.
Therefore, the formula returns a position number of a particular space in the text.
Page 15
Analyzing Data in Excel
The last column here (H) calculates the total length of the strings which is also used
later.
Formulae in range J10:M16 use the MID function which returns a certain number of
character from the middle of a text string starting from a giver character. As starting
character and the number of character it uses the numbers we have calculated
previously in cells C10:G16. Formulae in the last column (N10:N16) are very slightly
different in references, and they use the VALUE function again to move from text to
numbers.
As a final remark, you can do this transformation using the 'Text to Columns'
functionality in Excel. One of the reasons I have given this alternative approach here is
to show you some text functions.
Page 16
Analyzing Data in Excel
Page 17
Analyzing Data in Excel
(the SMALL function) and 14 (the LARGE function). The last argument (1 in both cases)
means the functions need to find the first smallest and largest number from the array
(and if you need to find the second, third etc. smallest or largest number under the
given criteria, just replace 1 with the corresponding number). The errors, again, will be
ignored.
The population standard deviation is calculated by testing again the cars have sedan
bodystyle (=C5:C38=I5). Another part of the formula (=(F5:F38-M7)^2)/M5) is taken
from the definition of standard deviation - we take every member's price, deduct the
average price from it, and square every difference. Then we add these squares up and
divide by the number of cars. As we need to do the calculation only for the "qualifying"
cars, we divide by the number of sedans found in the first calculation (in cell M5). We
also multiply every square by the first array which tested which of the cars are sedans,
so as to zero out those which are not. We finally add them up (using the SUMPRODUCT
function) and find the square root of the total. This is the standard deviation.
Other statistical indicators (sample standard deviation, skewness, kurtosis) are
calculated below using the same techniques and the formulae definitions of the
corresponding functions.
Several conditions, single criteria
In the next example we start using several criteria, and this allows us to use either the
AND logic, or the OR logic. With the AND logic, the selection must contain all the
features we select. With the OR logic, the selection may contain any of the criteria
chosen. If we want to analyze the cars with the base level and sedan bodystyle, we can
take cars which contain both of those features or the cars which contain at least one of
those features.
Analyzing under the AND logic is relatively simple - we use the SUMIFS and COUNTIFS
functions and the AGGREGATE function in a similar way as in the previous example, but
we are doing two tests this time (one checking the levels being equal to the base, and
the other checking the bodystyles for being the sedans), and then we multiply the
resulting arrays:
=(C5:C38=I19)*(B5:B38=I18)
={FALSE;TRUE;FALSE;FALSE;FALSE;FALSE;TRUE;FALSE;TRUE;TRUE;TRUE;FALSE;TR
UE;TRUE;TRUE;TRUE;FALSE;TRUE;FALSE;FALSE;FALSE;TRUE;FALSE;TRUE;TRUE;FAL
SE;FALSE;FALSE;FALSE;FALSE;FALSE;FALSE;FALSE;TRUE})*({TRUE;FALSE;FALSE;FA
LSE;TRUE;FALSE;FALSE;FALSE;FALSE;FALSE;TRUE;FALSE;TRUE;TRUE;FALSE;FALSE;
FALSE;TRUE;TRUE;FALSE;FALSE;FALSE;FALSE;TRUE;TRUE;FALSE;FALSE;TRUE;FALS
E;TRUE;TRUE;TRUE;TRUE;FALSE}
={0;0;0;0;0;0;0;0;0;0;1;0;1;1;0;0;0;1;0;0;0;0;0;1;1;0;0;0;0;0;0;0;0;0}
You can see these arrays if you enter the formula, then click in the formula and press
the F9 key. To see the second line above (the results of both tests separately before
they are multiplied) you have to select what's inside the first parenthesis, and also
press F9, then do the same for the second parenthesis.
Page 18
Analyzing Data in Excel
The OR logic is slightly more complicated - we need to check that at least one test result
in the pair of corresponding test results is TRUE, so we add both arrays and then test
which results are greater than zero (just to remind you, the TRUE and FALSE values
become 1 and 0 if we do some mathematical operation on them, even if we add or
multiply one of them by the other):
=(C5:C38=I19)+(B5:B38=I18)>0
={FALSE;TRUE;FALSE;FALSE;FALSE;FALSE;TRUE;FALSE;TRUE;TRUE;TRUE;FALSE;TR
UE;TRUE;TRUE;TRUE;FALSE;TRUE;FALSE;FALSE;FALSE;TRUE;FALSE;TRUE;TRUE;FAL
SE;FALSE;FALSE;FALSE;FALSE;FALSE;FALSE;FALSE;TRUE})+({TRUE;FALSE;FALSE;FA
LSE;TRUE;FALSE;FALSE;FALSE;FALSE;FALSE;TRUE;FALSE;TRUE;TRUE;FALSE;FALSE;
FALSE;TRUE;TRUE;FALSE;FALSE;FALSE;FALSE;TRUE;TRUE;FALSE;FALSE;TRUE;FALS
E;TRUE;TRUE;TRUE;TRUE;FALSE})>0
={1;1;0;0;1;0;1;0;1;1;2;0;2;2;1;1;0;2;1;0;0;1;0;2;2;0;0;1;0;1;1;1;1;1}>0
={TRUE;TRUE;FALSE;FALSE;TRUE;FALSE;TRUE;FALSE;TRUE;TRUE;TRUE;FALSE;TRU
E;TRUE;TRUE;TRUE;FALSE;TRUE;TRUE;FALSE;FALSE;TRUE;FALSE;TRUE;TRUE;FALSE
;FALSE;TRUE;FALSE;TRUE;TRUE;TRUE;TRUE;TRUE}
In the last step we have obtained an array of TRUE and FALSE values which tell us
which vehicles contain either of the features. We are using this array in the formulas
directly, same as in the AND logic examples, except for the first formula which counts
them: because the SUMPRODUCT function cannot add the TRUE and FALSE values
directly, we have to convert them into ones and zero by putting a double-minus before
them (this would also be a mathematical operation).
Multiple criteria
The last three examples (on the right) allow for multiple criteria for each feature (i.e.
the level can be comfort or luxury, and the bodystyle can be sedan or wagon, or any
other combination).
These formulae rely on a somewhat non-standard use of the MATCH function: We use
an array of elements(representing a lookup value argument) to be matched to one
single cell (representing a lookup array). Looking at cell T5:
=MATCH(B5:B38,P5:P7,0)
={1;2;2;#N/A;1;2;2;2;2;2;1;#N/A;1;1;2;2;#N/A;1;1;2;#N/A;2;#N/A;1;1;2;#N/A;1;2;1;1
;1;1;2}
A number in this array means that the corresponding cell in the range B5:B38 contains
one of the values from the range P5:P70, and an error #N/A means it has not found any.
Using the ISNUMBER function converts these values into TRUE and FALSE logical
values:
Page 19
Analyzing Data in Excel
=ISNUMBER(MATCH(B5:B38,P5:P7,0))
={TRUE;TRUE;TRUE;FALSE;TRUE;TRUE;TRUE;TRUE;TRUE;TRUE;TRUE;FALSE;TRUE;T
RUE;TRUE;TRUE;FALSE;TRUE;TRUE;TRUE;FALSE;TRUE;FALSE;TRUE;TRUE;TRUE;FAL
SE;TRUE;TRUE;TRUE;TRUE;TRUE;TRUE;TRUE}
The latter array is then handled in a way described above.
In the very last example on this sheet ('Several conditions, multiple criteria - 2') I have
introduced a criteria which is NOT contained in the data (we are selecting the vehicles
the bodystyle of which is NOT sedan, i.e. it is a hatchback or a wagon). In this case
instead of the ISNUMBER function we are using the ISERROR function (the MATCH
function will return the #N/A errors for those values which it could not match, and
these errors will be converted into TRUEs, while the numbers (in the instances where
the MATCH function did find the match will become the FALSEs). The rest of the
formulae are unchanged.
Once you understand this toolset you will be able to do any other sort of analysis you
need in this format.
Page 20
Analyzing Data in Excel
Page 21
Analyzing Data in Excel
returning 143,050
We do other statistics in a similar way. The arrays returned by the AGGREGATE
functions can be processed by other Excel functions if we put those arrays under the
INDEX function with a second argument of zero. In this section, as we have a
sufficiently high number of vehicles for the analysis, I have added the calculation of
skewness and kurtosis using their native Excel functions.
Page 22
Analyzing Data in Excel
=ISNUMBER(MATCH(F3:F36,AGGREGATE(14,6,F3:F36/(B3:B36=I21),INDEX(ROW(IN
DIRECT("1:"&I22)),0)),0))*(B3:B36=I21)*(C3:C36=I24)
which results in:
={0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;1;0;1;1;1;1;0;0;0;0;0;0;0;0;0}
In this list every "1" corresponds to a car which satisfies all the three criteria. We add
them up with the SUMPRODUCT function and get 5.
All other statistic calculations rely on this algorithm of criteria selection and are done in
the same way as in the first example.
Page 23
Analyzing Data in Excel
Page 24
Analyzing Data in Excel
Multiple criteria
Page 25
Analyzing Data in Excel
The three examples to the right show how to use a range of criteria for the analysis. If
you look at cell Q5, the formula relies on the SEARCH function again but this time we
will create a matrix of results consisting of 3 columns (the number of criteria in the
range M5:M7) and 34 rows (corresponding to the number of models in the range
B5:B38). As both ranges are vertical arrays, we will need to transpose one into a
horizontal one to construct such a matrix. As you can guess, we will use the
TRANSPOSE function for that.
If we take a reference =M5:M7 and F9 it ("to F9" became a verb on its own in modern
English) we will see:
={"Base";"Comfort";0}
As we are going to use this array with the SEARCH function shortly we need to do
something with an empty cell M7 which is currently not filled with any criteria but we
want to keep it for flexibility in the future. If we use this blank sell with the SEARCH
function the latter will find it in any cell we will be testing, so we will do a small trick:
=IF(ISBLANK(M5:M7),"|||",M5:M7)
which returns:
={"Base";"Comfort";"|||"}
We have just replaced an empty cell with three pipe symbols ("|") which I have chosen
because it is very unlikely to be found in any text; you can use your own combination of
any symbols you wish.
Notice the values are separated by semicolons, which represent end of lines in an array;
so our array has three rows. If we now use the TRANSPOSE function we will get:
=TRANSPOSE(IF(ISBLANK(M5:M7),"|||",M5:M7))
={"Base","Comfort","|||"}
Instead of semicolons we now have values separated by commas, which separate
columns in matrices; so this is now an array of one row and three columns. We will use
this array with the array of car descriptions and the SEARCH function:
=SEARCH(TRANSPOSE(IF(ISBLANK(M5:M7),"|||",M5:M7)),B5:B38)
This results in:
={1,#VALUE!,#VALUE!;#VALUE!,1,#VALUE!;#VALUE!,1,#VALUE!;#VALUE!,#VALUE!,#V
ALUE!;1,#VALUE!,#VALUE!;#VALUE!,1,#VALUE!;#VALUE!,1,#VALUE!;#VALUE!,1,#VAL
UE!;#VALUE!,1,#VALUE!;#VALUE!,1,#VALUE!;1,#VALUE!,#VALUE!;#VALUE!,#VALUE!,
#VALUE!;1,#VALUE!,#VALUE!;1,#VALUE!,#VALUE!;#VALUE!,1,#VALUE!;#VALUE!,1,#V
ALUE!;#VALUE!,#VALUE!,#VALUE!;1,#VALUE!,#VALUE!;1,#VALUE!,#VALUE!;#VALUE!
,1,#VALUE!;#VALUE!,#VALUE!,#VALUE!;#VALUE!,1,#VALUE!;#VALUE!,#VALUE!,#VAL
Page 26
Analyzing Data in Excel
UE!;1,#VALUE!,#VALUE!;1,#VALUE!,#VALUE!;#VALUE!,1,#VALUE!;#VALUE!,#VALUE!,
#VALUE!;1,#VALUE!,#VALUE!;#VALUE!,1,#VALUE!;1,#VALUE!,#VALUE!;1,#VALUE!,#V
ALUE!;1,#VALUE!,#VALUE!;1,#VALUE!,#VALUE!;#VALUE!,1,#VALUE!}
Its a long array but you can see that every three members are separated by semicolons,
and there are 34 such groups; this means we have an array of 3 columns and 34 rows.
Applying our standard procedure to separate numbers from errors:
=ISNUMBER(SEARCH(TRANSPOSE(IF(ISBLANK(M5:M7),"|||",M5:M7)),B5:B38))
={TRUE,FALSE,FALSE;FALSE,TRUE,FALSE;FALSE,TRUE,FALSE;FALSE,FALSE,FALSE;TR
UE,FALSE,FALSE;FALSE,TRUE,FALSE;FALSE,TRUE,FALSE;FALSE,TRUE,FALSE;FALSE,T
RUE,FALSE;FALSE,TRUE,FALSE;TRUE,FALSE,FALSE;FALSE,FALSE,FALSE;TRUE,FALSE,
FALSE;TRUE,FALSE,FALSE;FALSE,TRUE,FALSE;FALSE,TRUE,FALSE;FALSE,FALSE,FALS
E;TRUE,FALSE,FALSE;TRUE,FALSE,FALSE;FALSE,TRUE,FALSE;FALSE,FALSE,FALSE;FA
LSE,TRUE,FALSE;FALSE,FALSE,FALSE;TRUE,FALSE,FALSE;TRUE,FALSE,FALSE;FALSE,T
RUE,FALSE;FALSE,FALSE,FALSE;TRUE,FALSE,FALSE;FALSE,TRUE,FALSE;TRUE,FALSE,
FALSE;TRUE,FALSE,FALSE;TRUE,FALSE,FALSE;TRUE,FALSE,FALSE;FALSE,TRUE,FALSE
}
and putting a double negative in front:
=--ISNUMBER(SEARCH(TRANSPOSE(IF(ISBLANK(M5:M7),"|||",M5:M7)),B5:B38))
={1,0,0;0,1,0;0,1,0;0,0,0;1,0,0;0,1,0;0,1,0;0,1,0;0,1,0;0,1,0;1,0,0;0,0,0;1,0,0;1,0,0;0,1,0;0,1
,0;0,0,0;1,0,0;1,0,0;0,1,0;0,0,0;0,1,0;0,0,0;1,0,0;1,0,0;0,1,0;0,0,0;1,0,0;0,1,0;1,0,0;1,0,0;1,0
,0;1,0,0;0,1,0}
Still the same array of 3 columns and 34 rows, but this time we have brought it down to
ones and zeroes. Every 1 represents an instance where one of the criteria (in the range
M5:M7) was found in the list B5:B38. Every criteria is unique for every description (a
car cannot be both base and comfort level), so if we add up these numbers this will give
us a total number of cars satisfying the criteria:
=SUM(--
ISNUMBER(SEARCH(TRANSPOSE(IF(ISBLANK(M5:M7),"|||",M5:M7)),B5:B38)))
This is the formula in cell Q5 which returns 28. It must be array-entered
(Ctrl+Shift+Enter). This criteria is used in all other formulae in this example.
Moving on to the next examples (with several conditions), we will need to deal with
several matrices in one formula. The matrices might have different dimensions and
cannot be combined directly. So we will transform them back from two-dimensional
into single-dimensional matrices using the MMULT function. In the example (developed
further in cell P18) we start with =--
ISNUMBER(SEARCH(TRANSPOSE(IF(ISBLANK(M18:M20),"|||",M18:M20)),B5:B38))
which results in a 3x34 matrix:
Page 27
Analyzing Data in Excel
={1,0,0;0,0,0;0,0,0;0,0,0;1,0,0;0,0,0;0,0,0;0,0,0;0,0,0;0,0,0;1,0,0;0,0,0;1,0,0;1,0,0;0,0,0;0,0
,0;0,0,0;1,0,0;1,0,0;0,0,0;0,0,0;0,0,0;0,0,0;1,0,0;1,0,0;0,0,0;0,0,0;1,0,0;0,0,0;1,0,0;1,0,0;1,0
,0;1,0,0;0,0,0}
We have done this in previous examples. This time we will do =MMULT(--
ISNUMBER(SEARCH(TRANSPOSE(IF(ISBLANK(M18:M20),"|||",M18:M20)),B5:B38)),{1;
1;1})
Using the MMULT function to multiply a 3x34 array by a 1x3 array results in a 1x34
array:
={1;0;0;0;1;0;0;0;0;0;1;0;1;1;0;0;0;1;1;0;0;0;0;1;1;0;0;1;0;1;1;1;1;0}
As our second multiplier (the small matrix) consists entirely of 1s, the resulting matrix
returns a 1 for every row (three members) in the larger array if at least one member in
the row was 1 (you can read more about the MMULT function in the Excel Help).
Finally, to quickly generate an array of a given size consisting of 1s, I am using the ROW
function first which returns numbers of rows in a reference:
=ROW(M18:M20)
={18;19;20}
Raising these numbers to the power 0 converts them (as any other number) into 1s:
=ROW(M18:M20)^0
={1;1;1}
The final formula in cell P18 multiplies both arrays (to comply with the AND logic) and
adds up the products:
=SUM(MMULT(--
ISNUMBER(SEARCH(TRANSPOSE(IF(ISBLANK(M18:M20),"|||",M18:M20)),B5:B38)),R
OW(M18:M20)^0)*MMULT(--
ISNUMBER(SEARCH(TRANSPOSE(IF(ISBLANK(M21:M23),"|||",M21:M23)),B5:B38)),R
OW(M21:M23)^0))
This approach is used in all other formulae through this example.
In the next example ('Several conditions, multiple criteria - 2') we are using the same
approach but, as the second condition is exclusive (NOT sedan) we are using the
ISERROR instead of the ISNUMBER function. Note this time we are not doing anything
with unused empty cells (cells M35 and M36 in our example) so that they are counted
first and then excluded by the ISERROR function.
Page 28
Analyzing Data in Excel
“Flexible” selection
The last example on this sheet ("Flexible selections") allows to choose and features, or
keywords, and performs the analysis based on the AND or the OR logic. The OR logic
formulae follow the same principle as other formulae; the AND logic formulae make use
of a special particularity of multiplying matrices. Considering our example (in cell W5),
the =--ISNUMBER(SEARCH(TRANSPOSE(IF(ISBLANK(T5:T7),"|||",T5:T7)),B5:B38))
part of it returns the following matrix (I have added line breaks after each semicolon
for the sake of legibility - it is important to understand how it all works):
={1,0,0;
0,1,0;
0,0,0;
0,0,0;
1,0,0;
0,0,0;
0,1,0;
0,0,0;
0,1,0;
0,1,0;
1,1,0;
0,0,0;
1,1,0;
1,1,0;
0,1,0;
0,1,0;
0,0,0;
1,1,0;
1,0,0;
0,0,0;
0,0,0;
0,1,0;
0,0,0;
1,1,0;
1,1,0;
0,0,0;
0,0,0;
1,0,0;
0,0,0;
1,0,0;
1,0,0;
1,0,0;
1,0,0;
0,1,0}
Every line has three elements corresponding to the three cells in the range T5:T7. A 1
means that a particular cell content is found in the corresponding cell in the range
B5:B38. If we multiply this array by ={1;1;1} (represented by the ROW(T5:T7)^0 part
of formula) we will get a one-dimensional array:
Page 29
Analyzing Data in Excel
=MMULT(--
ISNUMBER(SEARCH(TRANSPOSE(IF(ISBLANK(T5:T7),"|||",T5:T7)),B5:B38)),ROW(T5:
T7)^0)
={1;1;0;0;1;0;1;0;1;1;2;0;2;2;1;1;0;2;1;0;0;1;0;2;2;0;0;1;0;1;1;1;1;1}
This array is a vertical array but I have not made line breaks this time. You may notice
that essentially this operation adds up values within each line of the original array and
makes this sum a value in the new array. For instance, the first two values being 1
means that only one of the conditions is satisfied in the first two cells, and so on. We are
interested in those instances where all conditions are satisfied, and as we have two
conditions, in the values equal to 2. So we modify our formula further:
=(MMULT(--
ISNUMBER(SEARCH(TRANSPOSE(IF(ISBLANK(T5:T7),"|||",T5:T7)),B5:B38)),ROW(T5:
T7)^0)=COUNTA(T5:T7))
={FALSE;FALSE;FALSE;FALSE;FALSE;FALSE;FALSE;FALSE;FALSE;FALSE;TRUE;FALSE;
TRUE;TRUE;FALSE;FALSE;FALSE;TRUE;FALSE;FALSE;FALSE;FALSE;FALSE;TRUE;TRU
E;FALSE;FALSE;FALSE;FALSE;FALSE;FALSE;FALSE;FALSE;FALSE}
Then we translate logical values into 1s and 0s by adding a double negative sign in front
of the array and summing those values up with the SUM function (the formula must be
array-entered - Ctrl+Shift+Enter):
=SUM(--(MMULT(--
ISNUMBER(SEARCH(TRANSPOSE(IF(ISBLANK(T5:T7),"|||",T5:T7)),B5:B38)),ROW(T5:
T7)^0)=COUNTA(T5:T7)))
and the total count is 6. Other formulae on this sheet are based on the same criteria
selection
Once again, all formulae in these last four examples must be array entered
(Ctrl+Shift+Enter). Also make sure any range, except in the last example, does not
contain any duplicating criteria (e.g. range M21:M23 must contain only bodystyles;
don't mix it up with other features as this will likely lead to errors).
Page 30
Analyzing Data in Excel
Page 31
Analyzing Data in Excel
The total is 143,300.
To count the vehicles sold in a particular year, we just skip adding the prices; instead,
we put a double minus before the test to convert the results into 1s and zeroes and add
them up again:
=SUMPRODUCT(--(YEAR(A5:A38)=I4))
=SUMPRODUCT(--
({FALSE;FALSE;FALSE;TRUE;FALSE;FALSE;FALSE;TRUE;FALSE;FALSE;FALSE;FALSE;F
ALSE;FALSE;TRUE;TRUE;FALSE;FALSE;FALSE;FALSE;TRUE;FALSE;FALSE;FALSE;FALS
E;FALSE;TRUE;FALSE;FALSE;FALSE;FALSE;TRUE;TRUE;TRUE}))
=SUMPRODUCT({0;0;0;1;0;0;0;1;0;0;0;0;0;0;1;1;0;0;0;0;1;0;0;0;0;0;1;0;0;0;0;1;1;1})
which returns 9.
Grouping by years and additional criteria
In the second exercise we also get sales by years but in addition we will break down the
cars by their equipment level (Base, Comfort, Luxury). The formula is not much
different: we just add one more criteria testing the level in each line, so the prices of the
cars of other levels are zeroed out:
=SUMPRODUCT(F5:F38*(YEAR(A5:A38)=I4)*(B5:B38=H11))
In the formula which counts the cars we now don't need to add a double minus: the
purpose of it was to do some mathematical operation with TRUEs and FALSEs which
converts them into 1 and o respectively. But now we are multiplying two arrays of
logical values which does the same.
The next steps are no different to what we have done before: we do a test of whether
the quarter number is equal to the given quarter, combine it with the test of the year,
multiply by the prices and add up the results:
=SUMPRODUCT(F5:F38*(YEAR(A5:A38)=I22)*(LOOKUP(MONTH(A5:A38),{1,4,7,10;1,
2,3,4})=H23))
=SUMPRODUCT({13900;12050;10400;10800;12100;12600;18500;16650;18750;1945
0;18900;15700;17900;17550;19000;14300;12750;14300;13400;15350;14850;17950;
16200;15500;18300;15300;15850;14400;14600;17450;16100;15250;18850;17750}*(
{FALSE;FALSE;FALSE;TRUE;FALSE;FALSE;FALSE;TRUE;FALSE;FALSE;FALSE;FALSE;FA
LSE;FALSE;TRUE;TRUE;FALSE;FALSE;FALSE;FALSE;TRUE;FALSE;FALSE;FALSE;FALSE;
FALSE;TRUE;FALSE;FALSE;FALSE;FALSE;TRUE;TRUE;TRUE})*({TRUE;FALSE;FALSE;F
ALSE;FALSE;FALSE;FALSE;FALSE;FALSE;FALSE;FALSE;TRUE;TRUE;FALSE;TRUE;FALS
E;FALSE;FALSE;TRUE;FALSE;FALSE;FALSE;FALSE;FALSE;TRUE;FALSE;FALSE;FALSE;F
ALSE;TRUE;FALSE;FALSE;FALSE;FALSE}))
=SUMPRODUCT({0;0;0;0;0;0;0;0;0;0;0;0;0;0;19000;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0
})
From the last iteration we can see there was only one car sold in the first quarter of
2016 for 19,000, so the result of the formula is 19,000.
For the third and the fourth quarters I have put the Roman numbers. To check the
quarters we have received against those Roman numbers we will use the ROMAN
function which converts Arabic numbers into Roman numbers (cell I25):
=ROMAN(LOOKUP(MONTH(A5:A38),{1,4,7,10;1,2,3,4}))
={"I";"IV";"IV";"III";"III";"IV";"III";"IV";"III";"II";"IV";"I";"I";"III";"I";"IV";"II";"III";"I";"IV";
"III";"II";"II";"IV";"I";"IV";"IV";"II";"IV";"I";"III";"II";"IV";"IV"}
The rest of the formula has not changed. Starting from Excel 2016 there is the ARABIC
function available which does the opposite - converts Roman numbers into Arabic, so
instead of converting the whole array from Arabic into Roman you can convert just the
header of your line from Roman into Arabic.
Page 33
Analyzing Data in Excel
={2;11;12;7;8;11;8;11;9;4;10;2;2;7;3;11;5;8;2;10;8;5;4;12;1;12;11;4;12;3;8;4;12;11}
We have done the same and are doing this here for the years. Finally, we multiply the
product of both array by the price array and get the total sales for a particular month in
a particular year.
=SUMPRODUCT((MONTH(A5:A38)=H31)*(YEAR(A5:A38)=I30)*F5:F38)
If the months are expressed as three-letter abbreviations we will use a different
approach. We will format our dates to show only the months using the TEXT function.
See how this is done in e.g. cell I35:
=TEXT(A5:A38,"mmm")
={"Feb";"Nov";"Dec";"Jul";"Aug";"Nov";"Aug";"Nov";"Sep";"Apr";"Oct";"Feb";"Feb";"Jul";
"Mar";"Nov";"May";"Aug";"Feb";"Oct";"Aug";"May";"Apr";"Dec";"Jan";"Dec";"Nov";"Apr"
;"Dec";"Mar";"Aug";"Apr";"Dec";"Nov"}
The format "mmm" means showing month names reduced to three letters, as you can
see in the second line showing formula evaluation. In the last section of this table
(which deals with full month names) we are using the format "mmmm" (in e.g. cell I39):
=TEXT(A5:A38,"mmmm")
={"February";"November";"December";"July";"August";"November";"August";"Novemb
er";"September";"April";"October";"February";"February";"July";"March";"November";"
May";"August";"February";"October";"August";"May";"April";"December";"January";"De
cember";"November";"April";"December";"March";"August";"April";"December";"Nove
mber"}
The rest of the formulae in both sections is the same as in the first section - we test the
year, multiply both tests by each other and the price array and add up the products.
Grouping by weekdays
In the last example I am grouping sales by weekdays and the approach is quite similar:
if a weekday is expressed as a number I am using the WEEKDAY function (I am using
the European convention under which the first day of the week is Monday). If we use
weekdays as three-letter abbreviations or full names, I am using the TEXT function with
"ddd" or "dddd" respectively as a second argument.
As a last remark, if you have all the features combined in one cell (as we had in the "In-
cell" sheet), you can use the ISNUMBER(SEARCH(...)) construction again to do this
analysis by dates.
Page 34
Analyzing Data in Excel
Single criteria
We will start with the most simple Summary 1 which aggregates the numbers by level.
We will yet again use the SUMPRODUCT function and this time its first argument will be
the 2D range (E6:G10) filled with the revenue information. Excel treats this range as a
5x3 array, and if you enter =E6:G10 into a cell and press F9 while the cursor is in a cell
or in the formula bar you will see this:
={47150,24700,70050;0,73350,0;46050,57900,45550;0,67950,0;32050,0,67950}
Notice the line items are separated with commas (","), and lines with semicolons (";").
For the sake of convenience, going forward I will be adding a line break after each
semicolon:
={47150,24700,70050;
0,73350,0;
46050,57900,45550;
0,67950,0;
32050,0,67950}
We will take this array and multiply it by a vertical array consisting of five rows
representing the checks of whether the cars belong to the base series:
=E6:G10*(B6:B10=B15)
This will return an array as follows:
={47150,24700,70050;
0,73350,0;
0,0,0;
0,0,0;
0,0,0}
In the original table the Base series is the first and the second line, so the formula has
kept the values in those lines (they have been multiplied by the TRUE values from the
array with checks which is equivalent to multiplying by 1) and zeroed out the other
lines (multiplying them by the FALSE values which is the same as multiplying by 0).
Page 35
Analyzing Data in Excel
Lastly, we put this array under the SUMPRODUCT function which adds up all the values
and returns 215,250.
Multiple criteria, one dimension
In Summary 2 we will add another column with a split by bodystyle and the formula
will need an extra check for that. In the source data table bodystyles are located in the
header, which means they represent a horizontal array. So now we will be multiplying
our number array by a vertical and a horizontal arrays of logical values, and the
formula =E6:G10*(E5:G5=F15)*(B6:B10=E15) will result in:
={47150,0,0;
0,0,0;
0,0,0;
0,0,0;
0,0,0}
The operation has kept only one number (at the intersect of the tests satisfying both
conditions). So the SUMPRODUCT of this array returns 47,150.
={FALSE,TRUE,FALSE}
The resulting array is:
={0,24700,0;
0,0,0;
0,0,0;
0,0,0;
0,0,0}
rendering the total of 24,700 (base sedans with manual transmissions) - this is the only
number located at the intersect of all the TRUEs.
Page 37
Analyzing Data in Excel
Single criteria
The first example (in columns E and F) makes a selection based on single criteria (in
cell E4). We start looking at the formula with the familiar part
=ISNUMBER(SEARCH(E4,B10:B43)) which cells in the source range contain a value in
the criteria cell:
={FALSE;FALSE;FALSE;FALSE;FALSE;FALSE;FALSE;TRUE;FALSE;FALSE;FALSE;FALSE;
FALSE;FALSE;FALSE;FALSE;TRUE;FALSE;FALSE;FALSE;TRUE;FALSE;FALSE;FALSE;FA
LSE;TRUE;FALSE;TRUE;TRUE;TRUE;TRUE;TRUE;FALSE;FALSE}
Then we take a range with ordinal numbers (in column A) corresponding to the source
range entries, and divide these values by the array above. Dividing some number by
TRUE is equivalent to dividing it by 1 and leaves it unchanged; dividing it by FALSE is
the same as dividing it by zero and returns an error. So our modified formula
=A10:A43/ISNUMBER(SEARCH(E4,B10:B43)) will produce:
={#DIV/0!;#DIV/0!;#DIV/0!;#DIV/0!;#DIV/0!;#DIV/0!;#DIV/0!;8;#DIV/0!;#DIV/0!;#D
IV/0!;#DIV/0!;#DIV/0!;#DIV/0!;#DIV/0!;#DIV/0!;17;#DIV/0!;#DIV/0!;#DIV/0!;21;#DI
V/0!;#DIV/0!;#DIV/0!;#DIV/0!;26;#DIV/0!;28;29;30;31;32;#DIV/0!;#DIV/0!}
As you see, it has kept only those position numbers which contain "Hatchback",
according to our criteria. We will need to extract those numbers, one by one, based on
their ranking, using the AGGREGATE function with 15 as the first argument (which
makes it work as the SMALL function) and 6 as the second argument (which makes it
ignore errors). The last argument (relative position) is taken from column A which, as
we recall, contains consecutive numbers:
=AGGREGATE(15,6,A10:A43/ISNUMBER(SEARCH(E4,B10:B43)),A10)
This function returns 8 being the first smallest number in the above array. With the
help of the INDEX function we will find out the 8th position in the source data:
=INDEX(B10:B43,AGGREGATE(15,6,A10:A43/ISNUMBER(SEARCH(E4,B10:B43)),A10))
and the answer is "Comfort hatchback 1.6 manual"
As a final step, we will put this expression under the IFERROR function (because when
we are done with all numerals after the division operation the formula will start
Page 38
Analyzing Data in Excel
returning errors), make absolute references as needed and copy the formula down to
cell E43:
=IFERROR(INDEX($B$10:$B$43,AGGREGATE(15,6,$A$10:$A$43/ISNUMBER(SEARCH(
$E$4,$B$10:$B$43)),A10)),0)
Formulae in the neighboring column F are the same formulae but referring to column C
as the first argument and so returning the car prices.
Page 39
Analyzing Data in Excel
What’s Next?
Descriptive Statistics for Grouped (Weighted) Data
Excel has a powerful set of tools to perform statistical analysis, but they apply only to
ungrouped data. As an example, the AVERAGE function calculates a simple average of
ungrouped numbers, but calculating a weighted average requires a different approach.
This review compiles the formulas to perform statistical calculations for grouped data.
They cover the same (and even greater) scope as Excel’s native statistical functions.
The file includes calculations for:
– single-array data (average, median, variance, standard deviation, percentiles,
skewness, kurtosis)
– dual-array data (covariance, correlation, standard error, linear trend).
I have also given an overview of the basics of statistics and how it applies to financial
analysis and included a method to assign weights to your numbers, depending on their
relevance to your benchmarks.
https://www.eloquens.com/tool/QklGiG1a/engineering/statistics-
methods/descriptive-statistics-for-grouped-weighted-data
Page 40
Analyzing Data in Excel
https://www.eloquens.com/tool/dXAVfKnZ/finance/financial-modeling-courses-
tutorials/portfolio-analysis-and-sales-forecasting
Page 41