Business Intelligence: What is BI and ETL

M111 Business Intelligence Technology Management 2014
What is Business Intelligence?

http://en.wikipedia.org/wiki/Business_Intelligence
Business intelligence (BI) is a set of theories, methodologies, architectures, and technologies
that transform raw data into meaningful and useful information for business purposes. BI can
handle enormous amounts of unstructured data to help identify, develop and otherwise create
new opportunities. BI, in simple words, makes interpreting voluminous data friendly. Making use
of new opportunities and implementing an effective strategy can provide a competitive market
advantage and long-term stability.
Generally, Business Intelligence is made up of an increasing number of components, these are:
Multidimensional aggregation and allocation

Denormalization, tagging and standardization
Real-time reporting with analytical alert
Interface with unstructured data source
Group consolidation, budgeting and rolling forecast
Statistical inference and probabilistic simulation
Key performance indicators optimization
Version control and process management
Open item management
BI technologies provide historical, current and predictive views of business operations. Common
functions of business intelligence technologies are reporting, online analytical processing,
analytics, data mining, process mining, complex event processing, business performance
management, benchmarking, text mining, predictive analytics and prescriptive analytics.
Though the term business intelligence is sometimes a synonym for competitive intelligence
(because they both support decision making), BI uses technologies, processes, and
applications to analyze mostly internal, structured data and business processes while
competitive intelligence gathers, analyzes and disseminates information with a topical focus on
company competitors. If understood broadly, business intelligence can include the subset of
competitive intelligence.
Business intelligence can be applied to the following business purposes, in order to drive
business value.
1. Measurement program that creates a hierarchy of performance metrics (see
also Metrics Reference Model) and benchmarking that informs business leaders
about progress towards business goals (business process management).
Class Notes
Page 1

2. Analytics program that builds quantitative processes for a business to arrive at
optimal decisions and to perform business knowledge discovery. Frequently

involves: data mining, process mining, statistical analysis, predictive analytics,
predictive modeling, business process modeling, complex event processing and
prescriptive analytics.
3. Reporting/enterprise reporting program that builds infrastructure for strategic
reporting to serve the strategic management of a business, not operational
reporting. Frequently involves data visualization, executive information system
and OLAP.
4. Collaboration/collaboration platform program that gets different areas (both
inside and outside the business) to work together through data sharing and
electronic data interchange.
5. Knowledge management program to make the company data driven through
strategies and practices to identify, create, represent, distribute, and enable
adoption of insights and experiences that are true business knowledge.
Knowledge management leads to learning management and regulatory
compliance.
In addition to above, business intelligence also can provide a pro-active approach, such as
ALARM function to alert immediately to end-user. There are many types of alerts, for example if
some business value exceeds the threshold value the color of that amount in the report will turn
RED and the business analyst is alerted. Sometimes an alert mail will be sent to the user as
well. This end to end process requires data governance, which should be handled by the expert.
What is Extract-Transform-Load (ETL)?

http://en.wikipedia.org/wiki/Extract,_transform,_load
In computing, extract, transform, and load (ETL) refers to a process in database usage and
especially in data warehousing that:
Extracts data from outside sources
Transforms it to fit operational needs, which can include quality levels

Loads it into the end target (database, more specifically, operational data store, data
mart, or data warehouse)
ETL systems are commonly used to integrate data from multiple applications, typically
Class Notes
Page 2

developed and supported by different vendors or hosted on separate computer hardware. The
disparate systems containing the original data are frequently managed and operated by different
employees. For example a cost accounting system may combine data from payroll, sales and
purchasing.
Extract
The first part of an ETL process involves extracting the data from the source systems. In many
cases this is the most challenging aspect of ETL, since extracting data correctly sets the stage
for how subsequent processes go further.
Most data warehousing projects consolidate data from different source systems. Each separate
system may also use a different data organization and/or format. Common data source formats
are relational databases and flat files, but may include non-relational database structures such
as Information Management System (IMS) or other data structures such as Virtual Storage
Access Method (VSAM) or Indexed, or even fetching from outside sources such as through web
spidering or screen-scraping. The streaming of the extracted data source and load on-the-fly to
the destination database is another way of performing ETL when no intermediate data storage
is required. In general, the goal of the extraction phase is to convert the data into a single format
appropriate for transformation processing.
An intrinsic part of the extraction involves the parsing of extracted data, resulting in a check if
the data meets an expected pattern or structure. If not, the data may be rejected entirely or in
part.
Transform
The transform stage applies a series of rules or functions to the extracted data from the source
to derive the data for loading into the end target. Some data sources require very little or even
no manipulation of data. In other cases, one or more of the following transformation types may
be required to meet the business and technical needs of the target database:
Selecting only certain columns to load (or selecting null columns not to load). For
example, if the source data has three columns (also called attributes), roll_no, age,
and salary, then the selection may take only roll_no and salary. Similarly, the
selection mechanism may ignore all those records where salary is not present
(salary = null).
Translating coded values (e.g., if the source system stores 1 for male and 2 for
female, but the warehouse stores M for male and F for female)
Encoding free-form values (e.g., mapping "Male" to "M")
Class Notes
Page 3
Deriving a new calculated value (e.g., sale_amount = qty * unit_price)
Sorting
Joining data from multiple sources (e.g., lookup, merge) and deduplicating the data
Aggregation (for example, rollup summarizing multiple rows of data total sales
for each store, and for each region, etc.)

Generating surrogate-key values
Transposing or pivoting (turning multiple columns into multiple rows or vice versa)
Splitting a column into multiple columns (e.g., converting a comma-separated list,
specified as a string in one column, into individual values in different columns)

Disaggregation of repeating columns into a separate detail table (e.g., moving a
series of addresses in one record into single addresses in a set of records in a linked
address table)
Lookup and validate the relevant data from tables or referential files for slowly
changing dimensions.
Applying any form of simple or complex data validation. If validation fails, it may
result in a full, partial or no rejection of the data, and thus none, some or all the data
are handed over to the next step, depending on the rule design and exception
handling. Many of the above transformations may result in exceptions, for example,
when a code translation parses an unknown code in the extracted data.
Load
The load phase loads the data into the end target, usually the data warehouse (DW). Depending
on the requirements of the organization, this process varies widely. Some data warehouses may
overwrite existing information with cumulative information; frequently, updating extracted data is
done on a daily, weekly, or monthly basis. Other data warehouses (or even other parts of the
same data warehouse) may add new data in a historical form at regular intervalsfor example,
hourly. To understand this, consider a data warehouse that is required to maintain sales records
of the last year. This data warehouse overwrites any data older than a year with newer data.
However, the entry of data for any one year window is made in a historical manner. The timing
and scope to replace or append are strategic design choices dependent on the time available
and the business needs. More complex systems can maintain a history and audit trail of all
changes to the data loaded in the data warehouse.
As the load phase interacts with a database, the constraints defined in the database schema
as well as in triggers activated upon data load apply (for example, uniqueness, referential
integrity, mandatory fields), which also contribute to the overall data quality performance of the
ETL process.
Class Notes
Page 4
For example, a financial institution might have information on a customer in several

departments and each department might have that customer's information listed in a
different way. The membership department might list the customer by name,
whereas the accounting department might list the customer by number. ETL can
bundle all this data and consolidate it into a uniform presentation, such as for storing
in a database or data warehouse.
Another way that companies use ETL is to move information to another application
permanently. For instance, the new application might use another database vendor
and most likely a very different database schema. ETL can be used to transform the
data into a format suitable for the new application to use.
An example of this would be an Expense and Cost Recovery System (ECRS) such
as used by accountancies, consultancies and lawyers. The data usually ends up in
the time and billing system, although some businesses may also utilize the raw data
for employee productivity reports to Human Resources (personnel dept.) or
equipment usage reports to Facilities Management.
Class Notes
Page 5

What is Online Analytical Processing (OLAP)?
http://en.wikipedia.org/wiki/Online_analytical_processing
In computing, online analytical processing, or OLAP /olp/, is an approach to answering

multi-dimensional analytical (MDA) queries swiftly. OLAP is part of the broader category of
business intelligence, which also encompasses relational database, report writing and data
mining. Typical applications of OLAP include business reporting for sales, marketing,
management reporting, business process management (BPM), budgeting and forecasting,
financial reporting and similar areas, with new applications coming up, such as agriculture. The
term OLAP was created as a slight modification of the traditional database term OLTP (Online
Transaction Processing).
OLAP tools enable users to analyze multidimensional data interactively from multiple
perspectives. OLAP consists of three basic analytical operations: consolidation (roll-up), drilldown, and slicing and dicing. Consolidation involves the aggregation of data that can be
accumulated and computed in one or more dimensions. For example, all sales offices are rolled
up to the sales department or sales division to anticipate sales trends. By contrast, the drilldown is a technique that allows users to navigate through the details. For instance, users can
view the sales by individual products that make up a regions sales. Slicing and dicing is a
feature whereby users can take out (slicing) a specific set of data of the OLAP cube and view
(dicing) the slices from different viewpoints.
Databases configured for OLAP use a multidimensional data model, allowing for complex
analytical and ad hoc queries with a rapid execution time. They borrow aspects of navigational,
hierarchical databases and relational databases.
What is Data Mining?

http://en.wikipedia.org/wiki/Data_mining
Data mining (the analysis step of the "Knowledge Discovery and Data Mining" process, or
KDD), an interdisciplinary subfield of computer science, is the computational process of
discovering patterns in large data sets involving methods at the intersection of artificial
intelligence, machine learning, statistics, and database systems. The overall goal of the data
mining process is to extract information from a data set and transform it into an understandable
Class Notes
Page 6

structure for further use. Aside from the raw analysis step, it involves database and data
management aspects, data pre-processing, model and inference considerations,
interestingness metrics, complexity considerations, post-processing of discovered structures,
visualization, and online updating.
The term is a buzzword, and is frequently misused to mean any form of large-scale data or
information processing (collection, extraction, warehousing, analysis, and statistics) but is also
generalized to any kind of computer decision support system, including artificial intelligence,
machine learning, and business intelligence. In the proper use of the word, the key term is
discovery, commonly defined as "detecting something new". Even the popular book "Data
mining: Practical machine learning tools and techniques with Java" (which covers mostly
machine learning material) was originally to be named just "Practical machine learning", and the
term "data mining" was only added for marketing reasons. Often the more general terms "(large
scale) data analysis", or "analytics" or when referring to actual methods, artificial intelligence
and machine learning are more appropriate.
The actual data mining task is the automatic or semi-automatic analysis of large quantities of
data to extract previously unknown interesting patterns such as groups of data records (cluster
analysis), unusual records (anomaly detection) and dependencies (association rule mining).
This usually involves using database techniques such as spatial indices. These patterns can
then be seen as a kind of summary of the input data, and may be used in further analysis or, for
example, in machine learning and predictive analytics. For example, the data mining step might
identify multiple groups in the data, which can then be used to obtain more accurate prediction
results by a decision support system. Neither the data collection, data preparation, nor result
interpretation and reporting are part of the data mining step, but do belong to the overall KDD
process as additional steps.
The related terms data dredging, data fishing, and data snooping refer to the use of data mining
methods to sample parts of a larger population data set that are (or may be) too small for
reliable statistical inferences to be made about the validity of any patterns discovered. These
methods can, however, be used in creating new hypotheses to test against the larger data
populations.
What is Business Analytics?

http://en.wikipedia.org/wiki/Business_analytics
Business analytics (BA) refers to the skills, technologies, applications and practices for
continuous iterative exploration and investigation of past business performance to gain insight
Class Notes
Page 7

and drive business planning. Business analytics focuses on developing new insights and
understanding of business performance based on data and statistical methods. In contrast,
business intelligence traditionally focuses on using a consistent set of metrics to both measure
past performance and guide business planning, which is also based on data and statistical
methods.
Business analytics makes extensive use of data, statistical and quantitative analysis,
explanatory and predictive modeling, and fact-based management to drive decision making. It is
therefore closely related to management science. Analytics may be used as input for human
decisions or may drive fully automated decisions. Business intelligence is querying, reporting,
OLAP, and "alerts."
In other words, querying, reporting, OLAP, and alert tools can answer questions such as what
happened, how many, how often, where the problem is, and what actions are needed. Business
analytics can answer questions like why is this happening, what if these trends continue, what
will happen next (that is, predict), what is the best that can happen (that is, optimize).
What is Data Warehouse?

http://en.wikipedia.org/wiki/Data_warehousing
In computing, a data warehouse (DW, DWH), or an enterprise data warehouse (EDW), is a
database used for reporting and data analysis. Integrating data from one or more disparate
sources creates a central repository of data, a data warehouse (DW). Data warehouses store
current and historical data and are used for creating trending reports for senior management
reporting such as annual and quarterly comparisons.
The data stored in the warehouse is uploaded from the operational systems (such as marketing,
sales, etc., shown in the figure to the right). The data may pass through an operational data
store for additional operations before it is used in the DW for reporting.
The typical extract transforms load (ETL)-based data warehouse uses staging, data integration,
and access layers to house its key functions. The staging layer or staging database stores raw
data extracted from each of the disparate source data systems. The integration layer integrates
the disparate data sets by transforming the data from the staging layer often storing this
transformed data in an operational data store (ODS) database. The integrated data are then
moved to yet another database, often called the data warehouse database, where the data is
arranged into hierarchical groups often called dimensions and into facts and aggregate facts.
The combination of facts and dimensions is sometimes called a star schema. The access layer
Class Notes
Page 8

helps users retrieve data.
A data warehouse constructed from an integrated data source system does not require ETL,
staging databases, or operational data store databases. The integrated data source systems
may be considered to be a part of a distributed operational data store layer. Data federation
methods or data virtualization methods may be used to access the distributed integrated source
data systems to consolidate and aggregate data directly into the data warehouse database
tables. Unlike the ETL-based data warehouse, the integrated source data systems and the data
warehouse are all integrated since there is no transformation of dimensional or reference data.
This integrated data warehouse architecture supports the drill from the aggregate data of the
data warehouse to the transactional data of the integrated source data systems.
A data mart is a small data warehouse focused on a specific area of interest. Data warehouses
can be subdivided into data marts for improved performance and ease of use within that area.
Alternatively, an organization can create one or more data marts as first steps towards a larger
and more complex enterprise data warehouse.
This definition of the data warehouse focuses on data storage. The main source of the data is
cleaned, transformed, cataloged and made available for use by managers and other business
professionals for data mining, online analytical processing, market research and decision
support (Marakas & O'Brien 2009). However, the means to retrieve and analyze data, to extract,
transform and load data, and to manage the data dictionary are also considered essential
components of a data warehousing system. Many references to data warehousing use this
broader context. Thus, an expanded definition for data warehousing includes business
intelligence tools, tools to extract, transform and load data into the repository, and tools to
manage and retrieve metadata.
What is Key Performance Indicator?

http://en.wikipedia.org/wiki/Key_performance_indicators
A performance indicator or key performance indicator (KPI) is a type of performance
measurement. An organization may use KPIs to evaluate its success, or to evaluate the
success of a particular activity in which it is engaged. Sometimes success is defined in terms of
making progress toward strategic goals, but often success is simply the repeated, periodic
achievement of some level of operational goal (e.g. zero defects, 10/10 customer satisfaction,
etc.). Accordingly, choosing the right KPIs relies upon a good understanding of what is
important to the organization. 'What is important' often depends on the department measuring
the performance - e.g. the KPIs useful to finance will be quite different from the KPIs assigned
Class Notes
Page 9

to sales. Since there is a need to understand well what is important (like Kangaroo balls),
various techniques to assess the present state of the business, and its key activities, are
associated with the selection of performance indicators. These assessments often lead to the
identification of potential improvements, so performance indicators are routinely associated with
'performance improvement' initiatives. A very common way to choose KPIs is to apply a
management framework such as the balanced scorecard.
What is Balanced Scorecard?

http://en.wikipedia.org/wiki/Balanced_scorecard
The balanced scorecard (BSC) is a strategy performance management tool - a semi-standard

structured report, supported by design methods and automation tools, that can be used by
managers to keep track of the execution of activities by the staff within their control and to
monitor the consequences arising from these actions. It is perhaps the best known of several
such frameworks (it was the most widely adopted performance management framework
reported in the 2010 annual survey of management tools undertaken by Bain & Company.).
The characteristic of the balanced scorecard and its derivatives is the presentation of a mixture
of financial and non-financial measures each compared to a 'target' value within a single
concise report. The report is not meant to be a replacement for traditional financial or
operational reports but a succinct summary that captures the information most relevant to those
reading it. It is the method by which this 'most relevant' information is determined (i.e., the
design processes used to select the content) that most differentiates the various versions of the
tool in circulation. The balanced scorecard also gives light to the company's vision and mission.
These two elements must always be referred when preparing a balance scorecard.
As a model of performance, the balanced scorecard is effective in that "it articulates the links
between leading inputs (human and physical), processes, and lagging outcomes and focuses
on the importance of managing these components to achieve the organization's strategic
priorities."
The first versions of balanced scorecard asserted that relevance should derive from the
corporate strategy, and proposed design methods that focused on choosing measures and
targets associated with the main activities required to implement the strategy. As the initial
audience for this was the readers of the Harvard Business Review, the proposal was translated
into a form that made sense to a typical reader of that journal - one relevant to a mid-sized US
business. Accordingly, initial designs were encouraged to measure three categories of nonfinancial measure in addition to financial outputs - those of "customer," "internal business
processes" and "learning and growth." Clearly these categories were not so relevant to nonClass Notes
Page 10

profits or units within complex organizations (which might have high degrees of internal
specialization), and much of the early literature on balanced scorecard focused on suggestions
of alternative 'perspectives' that might have more relevance to these groups.
Modern balanced scorecard thinking has evolved considerably since the initial ideas proposed
in the late 1980s and early 1990s, and the modern performance management tools including
Balanced Scorecard are significantly improved - being more flexible (to suit a wider range of
organisational types) and more effective (as design methods have evolved to make them easier
to design, and use).[5]
What is Performance Management?

http://en.wikipedia.org/wiki/Performance_management
Performance management (PM) includes activities which ensure that goals are consistently
being met in an effective and efficient manner. Performance management can focus on the
performance of an organization, a department, employee, or even the processes to build a
product of service, as well as many other areas.
PM is also known as a process by which organizations align their resources, systems and
employees to strategic objectives and priorities.
Performance management originated as a broad term coined by Dr. Aubrey Daniels in the late
1970s to describe a technology (i.e. science imbedded in applications methods) for managing
both behavior and results, two critical elements of what is known as performance. A formal
definition of performance management, according to Daniels' is "a scientifically based, dataoriented management system. It consists of three primary elements-measurement, feedback
and positive reinforcement."
This is used most often in the workplace, can apply wherever people interact schools,
churches, community meetings, sports teams, health setting, governmental agencies, social
events and even political settings - anywhere in the world people interact with their
environments to produce desired effects. Armstrong and Baron (1998) defined it as a strategic
and integrated approach to increase the effectiveness of companies by improving the
performance of the people who work in them and by developing the capabilities of teams and
individual contributors.
It may be possible to get all employees to reconcile personal goals with organizational goals
and increase productivity and profitability of an organization using this process. It can be applied
by organizations or a single department or section inside an organization, as well as an
Class Notes
Page 11

individual person. The performance process is appropriately named the self-propelled
performance process (SPPP).
First, a commitment analysis must be done where a job mission statement is drawn up for each
job. The job mission statement is a job definition in terms of purpose, customers, product and
scope. The aim with this analysis is to determine the continuous key objectives and
performance standards for each job position.
Following the commitment analysis is the work analysis of a particular job in terms of the
reporting structure and job description. If a job description is not available, then a systems
analysis can be done to draw up a job description. The aim with this analysis is to determine the
continuous critical objectives and performance standards for each job.
Class Notes
Page 12
Real World Example 1: Prediction of Red Wine Quality

Here is a story about how to use data to predict quality of red wine without tasting by experts,
quoted from a book called Super Crunchers -- How anything can be predicted.
Orley Ashenfelter really loves wine: When a good red wine ages, he says, something quite
magical happens. Yet Orley isnt just obsessed with how wine tastes. He wants to know about
the forces behind great and not-so-great wines.
When you buy a good red wine, he says, "you're always making an investment, in the sense
that it's probably going to be better later. And what you'd like to know is not what its worth now,
but what it's going to be worth in the future. Even if you're not going to sell it -- even if youre
going to drink it. If you want to know, 'How much pleasure am I going to get by delaying my
gratification? That s an endlessly fascinating topic." Its a topic that has consumed a fair amount
of his life for the last twenty-five years.
In his day job, Orley crunches numbers. He uses statistics to extract hidden information from
large datasets. As an economist at Princeton, hes looked at the wages of identical twins to
estimate the impact of an extra year of school. He's estimated how much states value a
statistical life by looking at differences in speed limits. For years he edited the leading
economics journal in the United States, the American Economic Review.
Ashenfelter is a tall man with a bushy mane of white hair and a booming, friendly voice that
tends to dominate a room. No milque-toast he. He's the kind of guy who would quickly disabuse
you of any stereotype you might have that number crunchers are meek, retiring souls. I've seen
Orley stride around a classroom, gutting the reasoning behind a seminar paper with affable
exuberance. When he starts out his remarks with over-the-top praise, watch out.
What's really gotten Orley in trouble is crunching numbers to assess the quality of Bordeaux
wines. Instead of using the "swishing and spitting" approach of wine gurus like Robert Parker,
Orley has used statistics to find out what characteristics of vintage are associated with higher or
lower auction prices.
"It's really a no-brainer," he said. "Wine is an agricultural product dramatically affected by the
weather from year to year." Using decades of weather data from France's Bordeaux region,
Orley found that low levels of harvest rain and high average summer temperatures produce the
greatest wines. The statistical fit on data from 1952 through 1980 was remarkably tight for the
red wines of Burgundy as well as Bordeaux.
Bordeaux are best when the grapes are ripe and their juice is concentrated. In years when the
summer is particularly hot, grapes get ripe, which lowers their acidity. And, in years when there
is below- average rainfall, the fruit gets concentrated. So it's in the hot and dry years that you
Class Notes
Page 13

tend to get the legendary vintages. Ripe grapes make supple (low-acid) wines. Concentrated
grapes make full-bodied wines.
Hes had the temerity to reduce his theory to a formula:
Wine quality = 12.145 + 0.00117 x Winter Rainfall + 0.0614 x Average Growing
Season Temperature - 0.00386 x Harvest Rainfall
That's right. By plugging the weather statistics for any year into this equation, Ashenfelter can
predict the general quality of any vintage. With a slightly fancier equation, he can make more
precise predictions for the vintage quality at more than 100 Chateaus. "It may seem a bit
mathematical," he acknowledges, "but this is exactly the way the French ranked their vineyards
back in the famous 1855 classifications."
Traditional wine critics have not embraced Ashenfelter's data-driven predictions. Britain's Wine
magazine said "the formula's self-evident illness invite[s] disrespect." William Sokolin, a New
York wine merchant, said the Bordeaux wine industry's view of Ashenfelter's work ranges
"somewhere between violent and hysterical." At times, he's been scorned by trade members.
When Ashenfelter gave a wine presentation at Christie's Wine Department, dealers in the back
openly hissed at his presentation.
Maybe the world's most influential wine writer (and publisher of The Wine Advocate), Robert
Parker, colorfully called Ashenfelter "an absolute total sham." Even though Ashenfelter is one of
the most respected quantitative economists in the world, to Parker his approach "is really a
Neanderthal way of looking at wine. It's so absurd as to be laughable." Parker dismisses the
possibility that a mathematical equation could help identify wines that actually taste good: "I'd
hate to be invited to his house to drink wine."
Parker says Ashenfelter "is rather like a movie critic who never goes to see the movie but tells
you how good it is based on the actors and the director.
Parker has a point. Just as it's more accurate to see the movie, shouldnt it be more accurate to
actually taste the wine? There's just one catch: for months and months there is no wine to taste.
Bordeaux and Burgundies spend eighteen months to twenty-four months in oak casks before
they are set aside for aging in bottles. Experts like Parker, have to wait four months just to have
a first taste after the wine is placed in barrels. And even then its a rather foul, fermenting
mixture. Its not clear that tasting this undrinkable early wine gives tasters very accurate
information about the wines future quality. For example, Bruce Kaiser, former director of the
wine department at auctioneer Butterfield & Butterfield, has claimed, Young wines are changing
so quickly that no one, and I mean no one, can really evaluate a vintage at all accurately by
taste until it is at least ten years old, perhaps older.
In sharp contrast, Orley crunches numbers to find the historical relationship between weather
and price. That's how he found out that each centimeter of winter rain tends to raise the
Class Notes
Page 14

expected price 0.00117 dollars. Of course, it's only a tendency. But number crunching lets
Orley predict the future quality of a vintage as soon as the grapes are harvested -- months
before even the first barrel taste and years before the wine is sold. In a world where wine
futures are actively traded, Ashenfelters predictions give wine collectors a huge leg up on the
competition.
Ashenfelter started publishing his predictions in the late eighties in a semiannual newsletter
called Liquid Assets. He advertised the newsletter at first with small ads in Wine Spectator, and
slowly built a base of about 600 subscribers. The subscribers were a geographically diverse lot
of millionaires and wine buffs -- mostly confined to the small group of wine collectors who were
comfortable with econometric techniques. The newsletter's subscription base was just a fraction
of the 30,000 people who plunked down $30 a year for Robert Parker's newsletter The Wine
Advocate.
Ashenfelter's ideas reached a much larger audience in early 1990 when the New York Times
published a front-page article about his new prediction machine. Orley openly criticized Parker's
assessment of the 1986 Bordeaux. Parker had rated the '86s as "very good and sometimes
exceptional." Ashenfelter disagreed. He felt that the below-average growing season
temperature and the above-average harvest rainfall doomed this vintage to mediocrity.
Yet the real bombshell in the article concerned Orleys prediction about the 1989 Bordeaux.
While these wines were barely three months in the cask and had yet to even be tasted by critics,
Orley predicted that they would be the wine of the century. He guaranteed that they would be
stunning good. On his scale, if the great 1961 Bordeaux were 100, then the 1989 Bordeaux
were a whopping 149. Orley Brazenly predicted that they would sell for as high a price as any
wine made in the last thirty-five years.
The wine critics were incensed. Parker now characterized Ashenfelters quantitative estimates
as ludicrous and absurd. Sokolin said the reaction was a mixture of fury and fear. Hes really
upset a lot of people. Within a few years, Wine Spectator refused to publish any more ads for
his (or any other) newsletter.
The traditional experts circled the wagons and tried to discredit both Orley and his methodology.
His methodology was flawed, they said, because it couldnt predict the exact future price. For
example, Thomas Matthews, the tasting director of Wine Spectator, complained that
Ashenfelters price predictions were exactly true only three times in the twenty-seven vintages.
Even though Orleys formula was specifically designed to fit price data his predicted prices
are both under and over the actual prices. However, to statisticians (and to anyone else who
thinks about it for a moment), having predictions that are sometimes high and sometimes low is
a good thing; its the sign of an unbiased estimate. In fact, Orley showed that Parkers initial
ratings of vintages had been systematically biased upward. More often that not Parker had to
downgrade his initial rankings.
Class Notes
Page 15

In 1990 Orley went even further out on a limb. After declaring 1989 the vintage of the century,
the data told him that 1990 was going to be even better. And he said so. In retrospect we can
now see that Liquid Assets predictions have been astonishingly accurate. The 89s turned out
to be a truly excellent vintage and the 90s were even better.
How can it be that youd have two vintages of the century in two consecutive years? It turns
out that since 1986 there hasnt been a year with below-average growing season temperature.
The weather in France has been balmy for more than two decades. A convenient truth for wine
lovers is that its been a great time to grow truly supple Bordeaux.
The traditional experts are now paying a lot more attention to the weather. While many of them
have never publicly acknowledged the power of Orleys predictions, their own predictions now
correspond much more closely to his simple equation results. Orley still maintains his website,
www.liquidasset.com, but hes stopped producing the newsletter. He says, Unlike the past, the
tasters no longer make any horrendous mistakes. Frankly, I kind of killed myself. I dont have
as much value added anymore.
Ashenfelters detractors see him as a heretic. He threatens to demystify the world of wine. He
eschews flowery and nonsensical (muscular, tight, or rakish) jargon, and instead gives you
reasons for his predictions.
The industrys intransigence is not just about aesthetics. The wine dealers and writers just
dont want the public informed to the degree that Orley can provide, Kaiser notes. It started
with the 86 vintage. Orley said it was a scam, a dreadful year suffering from a lot of rain and
were raving, insisting it was a great vintage. Orley was right, but it isnt always popular to be
right.
Both the wine dealers and writers have a vested interest in maintaining their informational
monopoly on the quality of wine. The dealers use the perennially inflated initial rankings as a
way to stabilize prices. And Wine Spectator and The Wine Advocate have millions of dollars at
stake in remaining the dominant arbiters of quality. As Upton Sinclair (and now Al Gore) has
said, It is difficult to get a man to understand something when his salary depends on his not
understanding it. The same holds true for wine. The livelihood of a lot of people depends on
wine drinkers believing this equation wont work, Orley says. Theyre annoyed because
suddenly theyre kind of obsolete.
There are some signs of change. Michael Broadbent, chairman of the International Wine
Department at Christies in London, puts it diplomatically: Many think Orley is a crank, and I
suppose he is in many ways. But I have found that, year after year, his ideas and his work are
remarkably on the mark. What he does can be really helpful to someone wanting to buy wine.
Class Notes
Page 16

Real World Example 2: Prediction of good baseball players
This story is about how to use data to predict good baseball player. It is also quoted from the
same book called Super Crunchers -- How anything can be predicted.
The grand world of oenophiles seems worlds away from the grandstands of baseball. But in
many ways, Ashenfelter was trying to do for wine just what writer Bill James did for baseball.
In his annual Baseball Abstracts, James challenged the notion that baseball experts could judge
talent simply by watching a player. Michael Lewiss Moneyball showed that James was
baseballs herald of data-driven decision making. Jamess simple but powerful thesis was that
data-based analysis in baseball was superior to observational expertise:
The naked eye was inadequate for learning what you need to know to evaluate players.
Think about it. One absolutely cannot tell, by watching, the difference between a .300
hitter and a .275 hitter. The difference is one hit every two weeks. If you see both 15
games a year, there is a 40 percent chance that the .275 hitter will have more hits than
the .300 hitter. The difference between a good hitter and average hitter is simply not
visible -- it is a matter of record.
Like Ashenfelter, James believed in formulas. He said, A hitter should be measured by his
success in that which he is trying to do, and that which he is trying to do is create runs. So
James went out and created a new formula to better measure a hitters contribution to runs
created:
Runs Created = (Hits + Walks) x Total Bases/(At Bats + Walks)
This equation put much more emphasis on a players on-base percentage and especially gives
higher ratings to those players who tend to walk more often. Jamess number-crunching
approach was particular anathema to scouts. If wine critics like Robert Parker live by their taste
and smell, then scouts live and die by their eyes. Thats their value added. As described by
Lewis:
In the scouts view, you found big league ball players by driving sixty thousand miles,
staying in a hundred crappy motels, and eating God knows how many meals at Dennys
all so you could watch 200 high school and college baseball games inside of four
months, 199 of which were completely meaningless to you. [Y]ou would walk into the
ball park, find a seat on the aluminum plank in the fourth row directly behind the catcher
and see something no one else had seen -- at least no one who knew the meaning of it.
You only had to see him once. If you see it once, its there.
Class Notes
Page 17

The scouts and wine critics like Robert Parker have more in common than simply a penchant for
spitting. Just as Parker believes that he can assess the quality of a Chateaus vintage based on
a single taste, baseball scouts believe they can assess the quality of the high school prospect
based on a single viewing.
In both contexts, people are trying to predict the market value of untested and immature
products, whether they be grapes or baseball players. And in both contexts, the central dispute
is whether to rely on observational expertise or quantitative data.
Like wine critics, baseball scouts often resort to non-falsifiable euphemisms, such as Hes a
real player, or Hes a tools guy.
In Moneyball, the conflict between data and traditional expertise came to a head in 2002 when
Oakland As general manager Billy Beane wanted to draft Jeremy Brown. Beane had read
James and decided he was going to draft based on hard numbers. Beane loved Jeremy Brown
because he had walked more frequently than any other college player. The scouts hated him
because he was, well, fat. If he tried to run in corduroys, an As scout sneered, hed start a fire.
The scouts thought there was no way that someone with his body could play major league ball.
Beane couldnt care less how a player looked. His drafting mantra was Were not selling jeans.
Beane just wanted to win games. The scouts seem to be wrong. Brown has progressed faster
than anyone else the As drafted that year. In September 2006, he was called up for his major
league debut with the As and batted .300 (with an on-base percentage of .364).
There are striking parallels between the ways that Ashenfelter and James originally tried to
disseminate their number-crunching results. Just like Ashenfelter, James began by placing
small ads for his first newsletter, Baseball Abstracts (which he euphemistically characterized as
a book). In the first year, he sold a total of seventy-five copies. Just as Ashenfelter was locked
out of Wine Spectator, James was given the cold shoulder by the Elias Sports Bureau when he
asked to share data.
But James and Ashenfelter have forever left their marks upon their industries. The perennial
success of the Oakland As, detailed in Moneyball, and even the first world championship of the
Boston Red Sox under the data-driven management of Theo Epstein are tributes to Jamess
lasting influence. The improved weather-driven predictions of even traditional wine writers are
silent tributes to Ashenfelters impact.
Both have even given rise to gearhead groups that revel in their brand of number crunching.
James inspired SABR, the Society for American Baseball Research. Baseball number crunching
now even has its own name, sabermetrics. In 2006, Ashenfelter in turn helped launch the
Journal of Wine Economics. Theres even an Association of Wine Economists now.
Unsurprisingly, Ashenfelter is its first president. By the way, Ashenfelters first predictions in
hindsight are looking pretty darn good. I looked up recent auction prices for Chateau Latour and
sure enough the 89s were selling for more than twice the price of the 86s, and 1990 bottles
were priced even higher. Take that, Robert Parker.
Class Notes
Page 18
Real World Example 3: F-Score and EJInsight

Here is the original article in Hong Kong Economic Journal. An English translation in a
summarized version is appended after it.
F-Score
2014 2 20

2
2013 Score
back test F-Score
2 2 4 3
F-Score
F-ScoreProfessor Joseph D. Piotroski

2000
ProfitabilityLeverage, Liquidity and Source of Funds

Class Notes
Page 19

Operating Efficiency 9
9 1 0 9
F-ScoreF-Score 0 9
8 9
2000 F-Score
1976 1996 20 23%
500 14.5%F-Score
500 8.5%F-Score
F-Score 10
back test F-Score
300
11 12 5 F-Score
Class Notes
Page 20

F-Score F-Score
F-Score F-Score 8
9 2 4
5 4
F-Score
F-Score F-Score
bell-shaped curve F-Score
3 7 8 9 0 2
0 8 9
0 2 1
F-Score 8 9 0 2
F-Score
EJFQ
Class Notes
Page 21

5 F-Score 89
F-Score 0 2
5
5 F-Score 89
5
2008 5 4
2 10
4 F-Score
2006 2006 5 4
3 20072008 2011
F-Score
EJFQ
EJFQ
Class Notes
Page 22

5 F-Score 89
F-Score 0 2 5
5 F-Score 89
5
34
10 2003
2008 2011
F-Score
EJFQ
10
F-Score
Class Notes
Page 23

Prof. Joseph D. Piotroski Value Investing: The Use
of Historical Financial Statement Information to Separate Winners from Losers
http://www.chicagobooth.edu/~/media/FE874EE65F624AAEBD0166B1974FD74D.pdf
Here is the English translation in summary form.
Stock Selection by F-Score

Hong Kong Economic Journal
20 February 2014
Analysis of stock selection by F-Score is made with reference to the article of Prof Joseph D.
Piotroski - Value Investing: The Use of Historical Financial Statement Information to Separate
Winners
from
Losers
(http://www.chicagobooth.edu/
~
/
media/FE874EE65F624AAEBD0166B1974FD74D.pdf).
Stock selection is based on the following key aspects of individual historical financial statements:
Aspect
1.
2.
3.
4.
5.
6.
7.
8.
9.
Profitability
Leverage, Liquidity and

Source of Funds
Operating Efficiency
Indicators and Score (1 for yes and 0 for no)

Return on Asset (ROA) > 0
Cash Flow from Operation (CFO) > 0
ROA increasing
CFO > Net Income
Long-term debt to total assets decreasing
Current ratio increasing
No share outstanding increase
Gross margin increasing
Asset turnover increasing
Total score for each stock ranges from 0 to 9, with

higher score meaning better fundamental conditions,
and vice versa
Taking the US stock market in 1976-1996 as an example, by adopting the F-Score screening
method to buy high-scoring stock and short-sell the low-scoring, the average annual rate of
return is up to 23%. Compared with the S & P 500 Index over the same period (which showed
Class Notes
Page 24

an increase of 14.5 percent), the F-Score stock-picking strategy outperforms the S & P 500
index by up 8.5 percent (not including dividends). Of course, the efficacy of this strategy in
recent years is pending verification, as stock market's performance in the wake of the financial
crisis has been distorted by quantitative easing introduced by the central bank.
Validation of the Strategy for the Hong Kong Stock Market

Validation of the strategy based on financial data of the Hong Kong stock market in the past 10
years has been made with the following two assumptions:
1.
Since companies with smaller capitalization tend to have more valuation trap, the backtesting range only focused on those some 300 stocks with a relatively large market
capitalization in the Hang Seng Composite Index.
2.
Calculation will only be applied to those companies with financial year cycle ending in
November/December (and announcement of corresponding annual financial results given
before May in the ensuing year) so that a complete F-score could be calculated.
Measurement of efficacy will be made with respect to the annual change in stock price
starting from May in the ensuing year. A 30% stop-loss risk management rule is also
applied.
Class Notes
Page 25

Figure 1 - Bird-eye View of the F-Score of Hong Kong Stocks
A bell-shaped distribution is observed, i.e. F-score of most of the enterprises is in the range of
3-7; only a few number of companies (generally less than one percent) has a score of 8-9 or 0-2
(so far there is no company having the score of 0). Focusing on the small number of stocks
bearing the extreme scores of 8-9 and 0-2 may help to achieve higher efficiency in portfolio
management. Two investment strategies as follows have been adopted for testing their efficacy
against performance of the Hang Seng Index:
Class Notes
Page 26

Strategy 1:
Starting from early May each year, based on the F-score of stocks in the past calendar year,
buy stocks with the score of 8-9 and short sell stocks with the score of 0-2, with equal weighting
assigned for both stocks bought and stocks short-sold.
Strategy 2:
Starting from early May each year, based on the F-score of stocks in the past calendar year,
buy stocks with the score of 8-9, i.e. with no short-selling of stocks with the 0-2.
Figure 2 F-score Strategy vs the Performance of Hang Seng Index (HSI)
Years 2003-2012
Heng Seng Index

Strategy 1
Strategy 2
In the past 10 years, the cumulative performance of Strategy 1 and 2 is superior to the HSI
[Figure 4]. It is worth noting that, since F-Score strategy can often help to pick stocks with
explosive rising trends, the cumulative performance over the years can far outstrip the
performance of the HSI. For Strategy 1, underperformance vs HSI was only recorded in 2006
(i.e., early May 2006 till end of April 2007) in the decade. For Strategy 2, underperformance was
only recorded in the three years of 2007, 2008 and 2011. In other words, the effectiveness of FScore in the application in Hong Kong stocks is very significant.
Two more strategies as follows have also been attempted based on inclusion of the EJFQ
signals in the stock selection process:
Class Notes
Page 27

Strategy 3:
On top of the short-listing methodology in Strategy 1, buying of stocks will only be made for
stocks with green EJFQ signals and short-selling will only be performed for stocks with red
EJFQ signals.
Strategy 4:
On top of the short-listing methodology in Strategy 2, only buy stocks with green EJFQ signals.
Figure 3 - Inclusion of the EJFQ Signals in Stock Selection
Class Notes
Heng Seng Index

Strategy 3
Strategy 4
Page 28

Figure 4 Cumulative Returns of the Four Strategies
Heng Seng
Index
Strategy 1
Strategy 2
Strategy 3
Strategy 4
Similar to Strategies 1 and 2, Strategies 3 & 4 also outperform HSI to a significant extent. For
Strategy 3, underperformance vs HSI was only recorded in 2003. For Strategy 4,
underperformance was only recorded in 2008 and 2011.
With exclusion of factors like transaction costs and share dividends, Strategy 3 is the best
according to the above back test. Certainly, inclusion of transaction costs and share dividends
will help to provide a more comprehensive assessment.
Class Notes
Page 29

Business Intelligence: What is BI and ETL

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Business Intelligence: What is BI and ETL

Uploaded by

Copyright:

Available Formats

M111 Business Intelligence Technology Management 2014

What is Business Intelligence?

Multidimensional aggregation and allocation

M111 Business Intelligence Technology Management 2014

optimal decisions and to perform business knowledge discovery. Frequently

What is Extract-Transform-Load (ETL)?

Extracts data from outside sources

Transforms it to fit operational needs, which can include quality levels

M111 Business Intelligence Technology Management 2014

M111 Business Intelligence Technology Management 2014

Deriving a new calculated value (e.g., sale_amount = qty * unit_price)

for each store, and for each region, etc.)

specified as a string in one column, into individual values in different columns)

M111 Business Intelligence Technology Management 2014

For example, a financial institution might have information on a customer in several

M111 Business Intelligence Technology Management 2014

In computing, online analytical processing, or OLAP /olp/, is an approach to answering

What is Data Mining?

M111 Business Intelligence Technology Management 2014

What is Business Analytics?

M111 Business Intelligence Technology Management 2014

What is Data Warehouse?

M111 Business Intelligence Technology Management 2014

What is Key Performance Indicator?

M111 Business Intelligence Technology Management 2014

What is Balanced Scorecard?

The balanced scorecard (BSC) is a strategy performance management tool - a semi-standard

M111 Business Intelligence Technology Management 2014

What is Performance Management?

M111 Business Intelligence Technology Management 2014

M111 Business Intelligence Technology Management 2014

Real World Example 1: Prediction of Red Wine Quality

M111 Business Intelligence Technology Management 2014

M111 Business Intelligence Technology Management 2014

M111 Business Intelligence Technology Management 2014

M111 Business Intelligence Technology Management 2014

M111 Business Intelligence Technology Management 2014

M111 Business Intelligence Technology Management 2014

Real World Example 3: F-Score and EJInsight

F-ScoreProfessor Joseph D. Piotroski

ProfitabilityLeverage, Liquidity and Source of Funds

M111 Business Intelligence Technology Management 2014

M111 Business Intelligence Technology Management 2014

M111 Business Intelligence Technology Management 2014

M111 Business Intelligence Technology Management 2014

M111 Business Intelligence Technology Management 2014

Here is the English translation in summary form.

Stock Selection by F-Score

Leverage, Liquidity and

Indicators and Score (1 for yes and 0 for no)

Total score for each stock ranges from 0 to 9, with

M111 Business Intelligence Technology Management 2014

Validation of the Strategy for the Hong Kong Stock Market

M111 Business Intelligence Technology Management 2014

M111 Business Intelligence Technology Management 2014

Heng Seng Index

M111 Business Intelligence Technology Management 2014

Heng Seng Index

M111 Business Intelligence Technology Management 2014

You might also like