Professional Documents
Culture Documents
A : Data warehouse is a subject oriented, integrated, time-variant, and nonvolatile collection of data
that supports management's decision-making process.
Q: What does subject-oriented data warehouse signify?
A : Subject oriented signifies that the data warehouse stores the information around a particular
subject such as product, customer, sales, etc.
Q: List any five applications of data warehouse.
A : Some applications include financial services, banking services, customer goods, retail sectors,
controlled manufacturing.
Q: What do OLAP and OLTP stand for?
A : OLAP is an acronym for Online Analytical Processing and OLTP is an acronym of Online
Transactional Processing.
Q: What is the very basic difference between data warehouse and operational databases?
A : A data warehouse contains historical information that is made available for analysis of the business
whereas an operational database contains current information that is required to run the business.
Q: List the Schema that a data warehouse system can implements.
A : A data Warehouse can implement star schema, snowflake schema, and fact constellation schema.
Q: What is Data Warehousing?
A : Data Warehousing is the process of constructing and using the data warehouse.
Q: List the process that are involved in Data Warehousing.
A : Data Warehousing involves data cleaning, data integration and data consolidations.
Q: List the functions of data warehouse tools and utilities.
A : The functions performed by Data warehouse tool and utilities are Data Extraction, Data Cleaning,
Data Transformation, Data Loading and Refreshing.
Q: What do you mean by Data Extraction?
A : Metadata is simply defined as data about data. In other words, we can say that metadata is the
summarized data that leads us to the detailed data.
Q: What does Metadata Respiratory contain?
A : Data cube helps us to represent the data in multiple dimensions. The data cube is defined by
dimensions and facts.
Q: Define dimension?
A : The dimensions are the entities with respect to which an enterprise keeps the records.
Q: Explain data mart.
A : Data mart contains the subset of organization-wide data. This subset of data is valuable to specific
groups of an organization. In other words, we can say that a data mart contains data specific to a
particular group.
Q: What is Virtual Warehouse?
A : The stages are IT strategy, Education, Business Case Analysis, technical Blueprint, Build the version,
History Load, Ad hoc query, Requirement Evolution, Automation, and Extending Scope.
Q: Define load manager.
A : A load manager performs the operations required to extract and load the process. The size and
complexity of load manager varies between specific solutions from data warehouse to data warehouse.
Q: Define the functions of a load manager.
A : A load manager extracts data from the source system. Fast load the extracted data into temporary
data store. Perform simple transformations into structure similar to the one in the data warehouse.
A : Warehouse manager is responsible for the warehouse management process. The warehouse
manager consist of third party system software, C programs and shell scripts. The size and complexity
of warehouse manager varies between specific solutions.
Q: Define the functions of a warehouse manager.
A : The warehouse manager performs consistency and referential integrity checks, creates the indexes,
business views, partition views against the base data, transforms and merge the source data into the
temporary store into the published data warehouse, backs up the data in the data warehouse, and
archives the data that has reached the end of its captured life.
Q: What is Summary Information?
A : Summary Information is the area in data warehouse where the predefined aggregations are kept.
Q: What does the Query Manager responsible for?
A : Query Manager is responsible for directing the queries to the suitable tables.
Q: List the types of OLAP server
A : There are four types of OLAP servers, namely Relational OLAP, Multidimensional OLAP, Hybrid
OLAP, and Specialized SQL Servers.
Q: Which one is faster, Multidimensional OLAP or Relational OLAP?
A : OLAP performs functions such as roll-up, drill-down, slice, dice, and pivot.
Q: How many dimensions are selected in Slice operation?
A : For dice operation two or more dimensions are selected for a given cube.
Q: How many fact tables are there in a star schema?
A : Partitioning is done for various reasons such as easy management, to assist backup recovery, to
enhance performance.
Q: What kind of costs are involved in Data Marting?
A : Data Marting involves hardware & software cost, network access cost, and time cost.
1. What is Tableau?
Tableau is a business intelligence software that allows anyone to connect to respective data, and then
visualize and create interactive, shareable dashboards.
2. What are the different Tableau Products and what is the latest version of Tableau?
(i)Tableau Desktop:
It is a self service business analytics and data visualization that anyone can use. It translates pictures of
data into optimized queries. With tableau desktop, you can directly connect to data from your data
warehouse for live upto date data analysis. You can also perform queries without writing a single line of
code. Import all your data into Tableau’s data engine from multiple sources & integrate altogether by
combining multiple views in a interactive dashboard.
(ii)Tableau Server:
It is more of a enterprise level Tableau software. You can publish dashboards with Tableau Desktop and
share them throughout the organization with web-based Tableau server. It leverages fast databases
through live connections.
(iii)Tableau Online:
This is a hosted version of Tableau server which helps makes business intelligence faster and easier than
before. You can publish Tableau dashboards with Tableau Desktop and share them with colleagues.
(iv)Tableau Reader:
It’s a free desktop application that enables you to open and view visualizations that are built in Tableau
Desktop. You can filter, drill down data but you cannot edit or perform any kind of interactions.
(v)Tableau Public:
This is a free Tableau software which you can use to make visualizations with but you need to save your
workbook or worksheets in the Tableau Server which can be viewed by anyone.
Measures are the numeric metrics or measurable quantities of the data, which can be analyzed by
dimension table. Measures are stored in a table that contain foreign keys referring uniquely to the
associated dimension tables. The table supports data storage at atomic level and thus, allows more
number of records to be inserted at one time. For instance, a Sales table can have product key,
customer key, promotion key, items sold, referring to a specific event.
Dimensions are the descriptive attribute values for multiple dimensions of each attribute, defining
multiple characteristics. A dimension table ,having reference of a product key form the table, can consist
of product name, product type, size, color, description, etc.
A .twb is an xml document which contains all the selections and layout made you have made in
your Tableau workbook. It does not contain any data.
A .twbx is a ‘zipped’ archive containing a .twb and any external files such as extracts and
background images.
Tableau provides easy to use, best in class, visual analytic capabilities but has nothing to do with the
data foundation or plumbing. But with an integration with a SQL server it can be the complete package.
On the other hand traditional BI tools have the before mentioned capabilities but then you have to deal
with significant amount of upfront costs. The cost of consulting, software and hardware is comparatively
quite high.
The joins in Tableau are same as SQL joins. Take a look at the diagram below to understand it.
9. What are the different connections you can make with your dataset?
We can either connect live to our data set or extract data onto Tableau.
Live: Connecting live to a data set leverages its computational processing and storage. New
queries will go to the database and will be reflected as new or updated within the data.
Extract: An extract will make a static snapshot of the data to be used by Tableau’s data engine.
The snapshot of the data can be refreshed on a recurring schedule as a whole or incrementally
append data. One way to set up these schedules is via the Tableau server.
The benefit of Tableau extract over live connection is that extract can be used anywhere without any
connection and you can build your own visualization without connecting to database.
They are Named areas to the left and top of the view. You build views by placing fields onto the shelves.
Some shelves are available only when you select certain mark types.
Sets are custom fields that define a subset of data based on some conditions. A set can be based on a
computed condition, for example, a set may contain customers with sales over a certain threshold.
Computed sets update as your data changes. Alternatively, a set can be based on specific data point in
your view.
A group is a combination of dimension members that make higher level categories. For example, if you
are working with a view that shows average test scores by major, you may want to group certain majors
together to create major categories.
A hierarchical field in tableau is used for drilling down data. It means viewing your data in a more
granular level.
Tableau server acts a middle man between Tableau users and the data. Tableau Data Server allows you
to upload and share data extracts, preserve database connections, as well as reuse calculations and field
metadata. This means any changes you make to the data-set, calculated fields, parameters, aliases, or
definitions, can be saved and shared with others, allowing for a secure, centrally managed and
standardized dataset. Additionally, you can leverage your server’s resources to run queries on extracts
without having to first transfer them to your local machine.
Tableau Data Engine is a really cool feature in Tableau. Its an analytical database designed to achieve
instant query response, predictive performance, integrate seamlessly into existing data infrastructure
and is not limited to load entire data sets into memory.
If you work with a large amount of data, it does takes some time to import, create indexes and sort data
but after that everything speeds up. Tableau Data Engine is not really in-memory technology. The data is
stored in disk after it is imported and the RAM is hardly utilized.
16. What are the different filters in Tableau and how are they different from each other?
The different filters in Tableau are: Quick , Context and Normal/Traditional filter are:
Normal Filter is used to restrict the data from database based on selected dimension or
measure. A Traditional Filter can be created by simply dragging a field onto the ‘Filters’ shelf.
Quick filter is used to view the filtering options and filter each worksheet on a dashboard while
changing the values dynamically (within the range defined) during the run time.
Context Filter is used to filter the data that is transferred to each individual worksheet. When a
worksheet queries the data source, it creates a temporary, flat table that is uses to compute the
chart. This temporary table includes all values that are not filtered out by either the Custom SQL
or the Context Filter.
Click the drop down to the right of Dimensions on the Data pane and select “Create > Calculated
Field” to open the calculation editor.
Name the new field and create a formula.
Dual Axis is an excellent phenomenon supported by Tableau that helps users view two scales of two
measures in the same graph. Many websites like Indeed.com and other make use of dual axis to show
the comparison between two measures and their growth rate in a septic set of years. Dual axes let you
compare multiple measures at once, having two independent axes layered on top of one another. This is
how it looks like:
19. What is the difference between a tree map and heat map?
A heat map can be used for comparing categories with color and size. With heat maps, you can compare
two different measures together.
A tree map also does the same except it is considered a very powerful visualization as it can be used for
illustrating hierarchical data and part-to-whole relationships.
The process of viewing numeric values or measures at higher and more summarized levels of the data is
called aggregation. When you place a measure on a shelf, Tableau automatically aggregates the data,
usually by summing it. You can easily determine the aggregation applied to a field because the function
always appears in front of the field’s name when it is placed on a shelf. For example, Sales becomes
SUM(Sales). You can aggregate measures using Tableau only for relational data sources.
Multidimensional data sources contain aggregated data only. In Tableau, multidimensional data sources
are supported only in Windows.
According to Tableau, Disaggregating your data allows you to view every row of the data source which
can be useful when you are analyzing measures that you may want to use both independently and
dependently in the view. For example, you may be analyzing the results from a product satisfaction
survey with the Age of participants along one axis. You can aggregate the Age field to determine the
average age of participants or disaggregate the data to determine at what age participants were most
satisfied with the product.
Joining term is used when you are combining data from the same source, for example,
worksheet in an Excel file or tables in Oracle database
While blending requires two completely defined data sources in your report.
Data extracts are the first copies or subdivisions of the actual data from original data sources. The
workbooks using data extracts instead of those using live DB connections are faster since the extracted
data is imported in Tableau Engine.After this extraction of data, users can publish the workbook, which
also publishes the extracts in Tableau Server. However, the workbook and extracts won’t refresh unless
users apply a scheduled refresh on the extract. Scheduled Refreshes are the scheduling tasks set for data
extract refresh so that they get refreshed automatically while publishing a workbook with data extract.
This also removes the burden of republishing the workbook every time the concerned data gets
updated.
Create a Performance Recording to record performance information about the main events you
interact with workbook. Users can view the performance metrics in a workbook created by
Tableau.
Help -> Settings and Performance -> Start Performance Recording
Help -> Setting and Performance -> Stop Performance Recording.
Reviewing the Tableau Desktop Logslocated at C:\Users\\My Documents\My Tableau
Repository. For live connection to data source, you can check log.txt and tabprotosrv.txt files.
For an extract, check tdeserver.txt file.
Performance testing is again an important part of implementing tableau. This can be done by loading
Testing Tableau Server with TabJolt, which is a “Point and Run” load generator created to perform QA.
While TabJolt is not supported by tableau directly, it has to be installed using other open source
products.
Horizontal – Horizontal layout containers allow the designer to group worksheets and
dashboard components left to right across your page and edit the height of all elements at once.
Vertical – Vertical containers allow the user to group worksheets and dashboard components
top to bottom down your page and edit the width of all elements at once.
Text – All textual fields.
Image Extract – A Tableau workbook is in XML format. In order to extracts images, Tableau
applies some codes to extract an image which can be stored in XML.
Web [URL ACTION] – A URL action is a hyperlink that points to a Web page, file, or other web-
based resource outside of Tableau. You can use URL actions to link to more information about
your data that may be hosted outside of your data source. To make the link relevant to your
data, you can substitute field values of a selection into the URL as parameters.
29. Mention whether you can create relational joins in Tableau without creating a new table?
Yes, one can create relational joins in tableau without creating a new table.
In some cases, you can improve query performance by selecting the option to Assume Referential
Integrity from the Data menu. When you use this option, Tableau will include the joined table in the
query only if it is specifically referenced by fields in the view.
32. Explain when would you use Joins vs. Blending in Tableau?
If data resides in a single source, it is always desirable to use Joins. When your data is not in one place
blending is the most viable way to create a left join like the connection between your primary and
secondary data sources.
Data blending is the ability to bring data from multiple data sources into one Tableau view, without the
need for any special coding. A default blend is equivalent to a left outer join. However, by switching
which data source is primary, or by filtering nulls, it is possible to emulate left, right and inner joins.
In Tableau, measures can share a single axis so that all the marks are shown in a single pane. Instead of
adding rows and columns to the view, when you blend measures there is a single row or column and all
of the values for each measure is shown along one continuous axis. We can blend multiple measures by
simply dragging one measure or axis and dropping it onto an existing axis.
A story is a sheet that contains a sequence of worksheets or dashboards that work together to convey
information. You can create stories to show how facts are connected, provide context, demonstrate
how decisions relate to outcomes, or simply make a compelling case. Each individual sheet in a story is
called a story point.
There are two types of data roles in Tableau – discrete and continuous dimension.
Discrete data roles are values that are counted as distinct and separate and can only take
individual values within a range. Examples: number of threads in a sheet, customer name or row
ID or State. Discrete values are shown as blue pills on the shelves and blue icons in the data
window.
Continuous data roles are used to measure continuous data and can take on any value within a
finite or infinite interval. Examples: unit price, time and profit or order quantity. Continuous
variables behave in a similar way in that they can take on any value. Continuous values are
shown as green pills.
There are many ways to create story in Tableau. Each story point can be based on a different view or
dashboard, or the entire story can be based on the same visualization, just seen at different stages, with
different marks filtered and annotations added. You can use stories to make a business case or to simply
narrate a sequence of events.
By default, your story gets its title from its sheet name. To edit it, double-click the title. You can
also change your title’s font, color, and alignment. Click Apply to view your changes.
To start building your story, drag a sheet from the Story tab on the left and drop it into the
center of the view
To highlight a key takeaway for your viewers, drag a text object over to the story worksheet and
type your comment.
To further highlight the main idea of this story point, you can change a filter or sort on a field in
the view, then save your changes by clicking Update above the navigator box.
Tableau Drive is a methodology for scaling out self-service analytics. Drive is based on best practices
from successful enterprise deployments. The methodology relies on iterative, agile methods that are
faster and more effective than traditional long-cycle deployment.
A cornerstone of this approach is a new model of partnership between business and IT.
By adding the same calculation to ‘Group By’ clause in SQL query or creating a Calculated Field in the
Data Window and using that field whenever you want to group the fields.
40. Mention what is the difference between published data sources and embedded data sources in
Tableau?
The difference between published data source and embedded data source is that,
Published data source: It contains connection information that is independent of any workbook
and can be used by multiple workbooks.
Embedded data source: It contains connection information and is associated with a workbook.
You can embed interactive Tableau views and dashboards into web pages, blogs, wiki pages, web
applications, and intranet portals. Embedded views update as the underlying data changes, or as their
workbooks are updated on Tableau Server. Embedded views follow the same licensing and permission
restrictions used on Tableau Server. That is, to see a Tableau view that’s embedded in a web page, the
person accessing the view must also have an account on Tableau Server.
Alternatively, if your organization uses a core-based license on Tableau Server, a Guest account is
available. This allows people in your organization to view and interact with Tableau views embedded in
web pages without having to sign in to the server. Contact your server or site administrator to find out if
the Guest user is enabled for the site you publish to.
You can do the following to embed views and adjust their default appearance:
Get the embed code provided with a view:The Share button at the top of each view includes
embed code that you can copy and paste into your webpage. (The Share button doesn’t appear
in embedded views if you change the showShareOptions parameter to false in the code.)
Customize the embed code: You can customize the embed code using parameters that control
the toolbar, tabs, and more. For more information, see Parameters for Embed Code.
Use the Tableau JavaScript API: Web developers can use Tableau JavaScript objects in web
applications. To get access to the API, documentation, code examples, and the Tableau
developer community, see the Tableau Developer Portal.
43. Design a view in a map such that if user selects any state, the cities under that state has to show
profit and sales.
According to your question you must have state, city, profit and sales fields in your dataset.
Step 6: Right click on state field and select show quick filter.
44. Think that I am using Tableau Desktop & have a live connection to Cloudera Hadoop data. I need to
press F5 to refresh the visualization. Is there anyway to automatically refresh visualization every ‘x’
seconds instead of pressing F5?
All you need to do is replace the api src and server url with yours.
<!DOCTYPE html>
<html lang="en">
<head>
<title>Tableau JavaScript API </title>
<script type="text/javascript" src="http://servername/javascripts/api/tableau_v8.js"></script>
</head>
<div id="tableau Viz"></div>
<script type='text/javascript'>
var placeholderDiv = document.getElementById("tableau Viz");
var url = "http://servername/t/311/views/Mayorscreenv5/Mayorscreenv2";
var options={
hideTabs:True,
width:"100%",
height:"1000px"
};
var viz= new tableauSoftware.Viz(placeholderDiv,url,options);
setInterval (function() {viz.refreshDataAsync()},5000);
</script>
</body>
</html>
45. Suppose my license expires today, will users be able to view dashboards or workbooks which I
published in the server earlier?
If your server license expires today, your username on the server will have the role ‘unlicensed’ which
means you cannot access but others can. The site admin can change the ownership to another person so
that the extracts do not fail.
Yes! For sure. It gives you data insight to the extent that other tools can’t. Moreover, it also helps you to
plan and point the anomalies and improvise your process for betterment of your company.
47. Can we place an excel file in a shared location and and use it to develop a report and refresh it in
regular intervals?
Yes, we can do it. But for better performance we should use Extract.
Yes, Tableau Desktop can be installed on both on Mac and Windows Operating System.
49. What is the maximum no. of rows Tableau can utilize at one time?
Tableau is not restricted by the no. of rows in the table. Customers use Tableau to access petabytes of
data because it only retrieves the rows and columns needed to answer your questions.
50. When publishing workbooks on Tableau online, sometimes a error about needing to extract
appears. Why does it happen occasionally?
This happens when a user is trying to publish a workbook that is connected to an internal server or a file
stored on a local drive, such as a SQL server that is within a company’s network
When you add a table calculation, you must use all dimensions in the level of detail either for
partitioning (scoping) or for addressing (direction):
The dimensions that define how to group the calculation, that is, define the scope of data it is
performed on, are called partitioning fields. The table calculation is performed separately within each
partition.
The remaining dimensions, upon which the table calculation is performed, are called addressing fields,
and determine the direction of the calculation.
2. What is the difference between sets and groups?
Sets- 1)sets is a grouping purpose based on the some condition. 2) calculation field possible in sets.
No
Joining is a SQL term that refers to combining two data sources into a single data source. Blending is a
Tableau term that refers to combining two data sources into a single chart. The main difference
between them is that a join is done once at the data source and used for every chart, while a blend is
done individually for each chart.
Order of execution
1. LOD
2. Table Calculations
3. Reference lines
4. Can we draw 3 reference lines in a single chart?
No
The context filter is not frequently changed by the user – if the filter is changed the database must
recomputed and rewrite the temporary table, slowing performance.
No
11. How can we combine database and flat file data in Tableau desktop??
You can combine them by connecting data two times, one for database tables and one for flat file. The
Data->Edit Relationships. Give a join condition on common column from db tables to flat file.
Fact table consists of the measurements, metrics or facts of a business process. It is located at the
center of a star schema or a snowflake schema surrounded by dimension tables.
Steps to automate the reports: while publishing the report to Tableau server, you will find the option to
schedule reports. Click on this to select the time when you want to refresh the data.
14. Can we display top five and last five sales in same view?
Yes, parameters do have their independent dropdown lists enabling users to view the data entries
available in the parameter during its creation.
The Pages shelf lets you break a view into a series of pages so you can better analyze how a specific field
affects the rest of the data in a view.
Tree Maps – Display data in nested rectangles. We use dimensions to define structure of the tree maps
and measures to design the size or color of the individual rectangle. We cannot add trend lines in Tree
maps.
Scatter plot – provides an easy way to visualize relationships between numerical variables. We can add
trend lines.
Edit the quick filter from the pull-down arrow. Go to “Customize” and uncheck the “Show “All” Value”
checkbox.
By adding the same calculation to ‘Group By’ clause in SQL query or creating a Calculated Field in the
Data Window and using that field whenever you want to group the fields.
• Using groups in a calculation. You cannot reference ad-hoc groups in a calculation
• Blend data using groups created in the secondary data source: Only calculated groups can be used in
data blending if the group was created in the secondary data source.
• Use a group in another workbook. You can easily replicate a group in another workbook by copy and
pasting a calculation.
what is Tableau?
Tableau is a business intelligence software that allows anyone to connect to respective data, and then
visualize and create interactive, sharable dashboards.
What is a data Source page?
A page where you can set up your data source. The Data Source page generally consists of four main
areas: left pane, join area, preview area, and metadata area.
what is a extract is Tableau?
A saved subset of a data source that you can use to improve performance and analyze offline.
what is a format pane in Tableau?
A pane that contains formatting settings that control the entire worksheet, as well as individual fields in
the view.
What is LOD expression in Tableau?
A syntax that supports aggregation at dimensionalities other than the view level. With level of detail
expressions, you can attach one or more dimensions to any aggregate expression.
What is the difference between Quick Filter and Normal filter?
Normal Filter is used to restrict the data from database based on selected dimension or measure. But
Quick Filters are used to give a chance to user for dynamically changing data members at run time.
What is Tableau Reader?
Tableau Reader is a free viewing application that lets anyone read and interact with packaged
workbooks created by Tableau Desktop.
Can we have multiple value selection in parameter?
No
Which join i sused in data blending?
There won't be any joins as such but we will just give the column references like primary and foreign
key relation.
What are the possible reasons for slow performance in Tableau?
There should be a common dimension to blend the data source into single worksheet.
What is a Dimension?
Tableau treats any field containing qualitative, categorical information as a dimension. This includes
any field with text or dates values.
What is a Measure?
A measure is a field that is a dependent on value of one or more dimensions. Tableau treats any field
containing numeric (quantitative) information as a measure.
What does the extension .twbx represent in Tableau?
It is a file which represents Tableau Packaged Workbook, in which the .twb file grouped together with
the datasources.
What are the types of filters in Tableau?
A card to the left of the view where you can drag fields to control mark properties such as type, color,
size, shape, label, tooltip, and detail.
What are shelves in Tableau?
They are Named areas to the left and top of the view. You build views by placing fields onto the
shelves. Some shelves are available only when you select certain mark types.
What is a Tableau workbook?
It is a file with a .twb extension that contains one or more worksheets (and possibly also dashboards
and stories).
In Tableau what is a worksheet?
A sheet where you build views of your data by dragging fields onto shelves.
What is an alais in Tableau?
In a context filter the filter condition is applied first to the data source and then some other filters are
applied only to the resulting records.
What is Dual Axis?
You can compare multiple measures using dual axes, which are two independent axes that are layered
on top of each other.
What is a page shelf in Tableau?
The Pages shelf is used to control the display of output by choosing the sequence of display.
What are the possible reasons for slow performance in Tableau?
These are inbuilt calculations in tableau which we normally use to calculate Percentange chages.
What is data blending?
Data blending is used to blend data from multiple data sources on a single worksheet. The data is
joined on common dimensions.
Can we have multiple value selection in parameter?
No
It Imports the entire data source into Tableau�s fast data engine as an extract and saves it in the
workbook.
What are parameters and when do you use it?
Parameters are dynamic values that can replace constant values in calculations.
What is TDE file in Tableau?
It refers to the file that contains data extracted from external sources like MS Excel, MS Access or CSV
file.
What is a story in Tableau?
A story is a sheet that contains a sequence of worksheets or dashboards that work together to convey
information.
What is a Published data source?
It contains connection information that is independent of any workbook and can be used by multiple
workbooks.
What is a Embedded data source?
If data resides in a single source,we use Joins but when your data is not in one place blending is used.
How to automate reports using Tableau software?
You need to publish report to tableau server, while publishing you will find one option to schedule
reports.You just need to select the time when you want to refresh data.
what is Tableau Show me?
Show Me is used to apply a required view to the existing data in the worksheet. Those views can be a
pie chart, scatter plot or a line chart.
what is a Tableau data pane?
A pane on the left side of the workbook that displays the fields of the data sources to which Tableau is
connected.
What is a calculated field in Tableau?
A new field that you create by using a formula to modify the existing fields in your data source.
What is crosstab chart?
It is a text table view. Use text tables to display the numbers associated with dimension members.
How to check the meatadata of a table?
In the menu Data -> New connection drag the table to the data pane to view its meatdata.
How to create a column Alias?
In the menu Data -> New connection open the table metadata and click on the column name to create
alias.
How to get current date and time?
The REPLACE function searches a given string for a substring and replaces it with replacement string.
which function returns the number of items in a group?
TOP filter.
What is a Gannt Chart?
A Gantt chart shows the progress of the value of a task or resource over a period of time. So Gantt
chart a time dimension is an essential field.
Forecasting is about predicting the future value of a measure. There are many mathematical models for
forecasting. Tableau uses the model known as exponential smoothing.
What is a Trendline in tableau?
Trend lines are used to predict the continuation of certain trend of a variable. It also helps to identify
the correlation between two variables by observing the trend in both of them simultaneously.
9) Why Tableau?
Whether your data is in an on-premise database, a database, a data warehouse, a cloud application or
an Excel file, you can analyze it with Tableau. You can create views of your data and share it with
colleagues, customers, and partners. You can use Tableau to blend it with other data. And you can keep
your data up to date automatically.
11) What are the differences between Tableau Software GoodData and Traditional BI (Business
Objects, etc.)?
At high level there are four major differences.How to view sql which is generated by Tab
13) What is the difference between heat map and tree map?
A heat map is a great way to compare categories using color and size. In this, you can compare two
different measures. Tree map is a very powerful visualization, particularly for illustrating hierarchical
(tree – structured) data and part – to – whole relationships.
15) How will you publish and schedule workbook in tableau server?
First create a schedule for particular time and then create extract for the data source and publish the
workbook for the server. Before you publish, there is a option called Scheduling and Authentication,
click on that and select the schedule from the drop down which is created and publish. Also publish data
source and assign the schedule. This schedule will automatically run for the assigned time and the
workbook is refreshed.
a queue waiting to get anything done. Tableau and GoodData provide more of a self – service
experience.
Analysis layer
This is where Tableau excels. It has a powerful and flexible drag & drop visualization engine based on
some technology from Stanford. Traditional BI typically provide some canned reports but changing them
requires significant time and money.
Data layer
This is where the three options are most different:
GoodData requires you to move your data to its cloud. Traditional BI typically requires you to move your
data to its data warehouse system. Tableau connects to a variety of existing data source and also
provides a fast in – memory data engine, essentially a local database. Since most enterprises have their
data stored all over the place, this provides the most choice and lets companies use the investment
they’ve already made.
Enterprise readiness.
something you can limit the options from the list or use some conditions to limit the data by field or
value.
22) What is benefit of Tableau extract file over the live connection?
Extract can be used anywhere without any connection and you can build your own visualizations
without connecting to Database.
23) How to combine two excel files with same fields but different data (different years)?
I have 5 different excel files (2007.xls, 2008.xls..2011.xls) with same fields (film name, genre, budge,
rating, profitability) but with data from different year (2007 to 2011). Can someone tell me how can I
combine the film name, genre and profitability so that I can see the visualization of 2007 to 2011 in a
single chart.
27) How can we combine database and flat file data in tableau desktop?
Connect data two times, one for database tables and one for flat file. The Data->Edit Relationships
Give a join condition on common column from db tables to flat file
Generate the Map using cities –>then Drag the Profit and sales to the Details–>Add the state as Quick
filter
30) How to create cascading filters without context filter ?
I have filterl and filter2..Based on filterl I need to filter2 data
Ex: Filterl as Country and Filter 2: States
I have chosen country as USA and filter2 should display only USA states
Choose options of Filter2 states :
select option of “Only relevant values “
34) What are the differences between Tableau desktop and Tableau Server?
While Tableau desktop performs data visualization and workbook creation, Tableau server is used to
distribute these interactive workbooks and/or reports to the right audience. Users can edit and update
the workbooks and dashboards online or Server but cannot create new ones. However, there are limited
editing options when compared to desktop.
Tableau Public is again a free tool consisting of Desktop and Server components accessible to anyone.
choose to ‘filter by’ the list, which cannot be used to perform calculations.
Users can dynamically change measures and dimensions in parameter but filters do not approve of this
feature.
Creating one or more context filters improves performance as users do not have to create extra filters
on large data source, reducing the query-execution time.
You can create by dragging a filed into ‘Filters’ tab and then, Right-Click that field and select ‘’Add to
Context”.
43) What are the limitations of context filters?
Tableau takes time to place a filter in context. When a filter is set as context one, the software creates a
temporary table for that particular context filter. This table will reload each time and consists of all
values that are not filtered by either Context or Custom SQL filter.
components left to right across your page and edit the height of all elements at once.
• Vertical- Vertical containers allow the user to group worksheets and dashboard components top to
bottom down your page and edit the width of all elements at once.
• Text
• Image Extract: – A Tableau workbook is in XML format. In order to extracts images, Tableau applies
some codes to extract an image which can be stored in XML.
• Web [URL ACTION]:- A URL action is a hyperlink that points to a Web page, file, or other web-based
resource outside of Tableau. You can use URL actions to link to more information about your data that
may be hosted outside of your data source. To make the link relevant to your data, you can substitute
field values of a selection into the URL as parameters.
• Use a group in another workbook. You can easily replicate a group in another workbook by copy and
pasting a calculation.
Q3. What are some of the new features introduced in Tableau 9.1?
Ans.
Visual analytics
Mobile
Data
Enterprise
Q4. Can you create relational joins in Tableau without creating a new table?
Ans. Yes, you can create relational joins without creating a new table.
Connection type
Icon/Name
Connects to
Live or Last extract
More connectors
No coding is required
Interface is simple
Doesn’t render feature to search content across all your data
Qlikview:
Ans. They are dynamic values that can change constant values in calculations, reference lines and filters.
Q8. Mention whether you can have multiple value selection in parameter?
Ans. No
Web
Text
Image Extract
Horizontal
Vertical
Tableau Desktop
Tableau Reader
Tableau Public
Tableau Server
Use extracts
Limit the amount of data you bring in – both rows and columns
Switch data source using the “extract function”
Pre-aggregate your data before brining into Tableau.
Left
Right
Inner
Full outer
Dimensions
>> Tableau will create a temporary table for this two countries data and if you have any other filers
>>other will be apply on this two countries data if we don’t have any context filter
>> Each and individual record will check for all filters
Q. What is the current latest version of Tableau Desktop(as of Sep, 25th 2017)?
Current versions: Tableau Desktop version 10.4
Q. Why tableau?
Whether your data is in an on-premise database, a database, a data warehouse, a cloud application or
an Excel file, you can analyze it with Tableau. You can create views of your data and share it with
colleagues, customers, and partners. You can use Tableau to blend it with other data. And you can keep
your data up to date automatically.
Q. What are Filters? How many types of filters are there in Tableau?
Filter is nothing but it is restricted to unnecessary, it is showing exact data. Basically filters are 3 types.
1. Quick filter
2. Context filter
3. Datasource filter
Q. Can we use non – used columns (Columns which are not used in reports but data source has
columns) in Tableau Filters?
Yes!
Ex. In data source I have column like
empID, EmpName, EmpDept,EmpDsignation, EmpSalary
In reports I am using empname on columns and empsalry on rows.
I can use empDesignation on Filters
Q. How to combine two excel files with same fields but different data (different years)?
I have 5 different excel files (2007.xls, 2008.xls..2011.xls) with same fields (film name, genre, budge,
rating, profitability) but with data from different year (2007 to 2011). Can someone tell me how can I
combine the film name, genre and profitability so that I can see the visualization of 2007 to 2011 in a
single chart?
1.Inner Join: Inner join will loads the only matching records from the both tables. Inner join condition:
Tableaa.id = Tableb.id
2.Outer Join:
Again the outer join divided into 3 types.
a)Left Outer Join,
b)Right Outer Join,
c)Full Outer Join.
Left outer join: displays the complete data from the left + matching records from the left.
Condition: tablea.id(+).
Right Outer Join: displays the complete data from the right + matching records from the left.
Condition: tablea.id(+)=tableb.id
Full outer join: full outer join load the complete data from the left table and right table. Condition: Table
A full outer join Table B ON tablea.id= tableb.id
3.Self-Join: if we are performing join to the same table itself such a kind of join called as self-join
Non Equi Join:
In the join condition if we are using the operators apart from the equality “=” then such a kind of joins
Q. Can we place an excel file in a shared location and use it to develop a report and refresh it in
regular intervals?
Yes you can do it… but for the better performance use extract
Q. What is the different between twb and twbx file extensions. Please explain.
Twb is a live connection, it points to the data source; the user receiving twb needs permission to said
data source and no data is included. .twbx takes data offline, stroes the data as a package or zip like file,
thereby eradicating the need for permissions from end user, it’s now a snapshot in time of the data as of
the time it was Saved as . twbx
Q. Can you get values from two different sources as a single input into parameter?
No you cannot. Each data source corresponds to a Tableau workbook. If you include both data variables
in the same data source you can input them in the same workbook.
Q. What are the similarities and differences between Tableau software and Palantir?
Palantir and Tableau are very different. Palantir has its roots in large data computer science problems
involving security, payments, fraud detection and the likes. Customers/Investors include Paypal, CIA and
others.
Tableau is a visualization player – with roots in Stanford U research. It’s Visual Query Language (VizQL)
allows users to build visualizations on top of standard data warehouses or spreadsheets.
Q. Design a view to show region wise profit and sales.I did not want line and bar chat should be used
for profit and sales. How you will design and please explain?
Generate the Map using cities –>then Drag the Profit and sales to the Details–>Add the state as Quick
filter
Q. Design a view in a map such that if user selects any state the cities under that state has to show
profit and sales.
If you want to show the Sales and profit in each and every city under the states in the same work sheet.
According to your question you should have State, City, Sales and Profit filed in your dataset.
1. Double click on the State filed.
2. Drag the City and drop into Marks card (under the State fied)
Q. How can we combine database and flat file data in tableau desktop?
Connect data two times, one for database tables and one for flat file. The Data->Edit Relationships
Give a join condition on common column from db tables to flat file
individual salary transactions for each employee. You can create a view like that by selecting
Analysis>Aggregate Measures.
live connection to the data source, check the log.txt and tabprotosrv.txt files. If you are using an extract,
check the tdeserver.txt file. The tabprotosrv.txt file often shows detailed information about queries.
Q. What are the major differences between tableau version 7.0 and tableau version 8.0?
1. New visualizations are introduced like treemap, bubble chart and box and whisker plot
2. We can copy worksheet directly from one workbook to another workbook
3. Introduced R script
Q. Suppose my license expires today, can users able to view the dashboards or workbook which i
published in server earlier.
If your server license expires today, your user name on the server will have the role ‘unlicensed’ which
means you cannot access, but others can. The Site Admin can ‘Change Ownership’ to another person, so
extracts if enabled do not fail.
Q. Think that I am using Tableau desktop and have a live connection to Cloud era hadoop data. I need
to press F5 to refresh the visualization. Is there anyway to automatically refresh the visualization
every x minutes instead of pressing F5 every-time?
Here is the example of refreshing dashboard in every 3 seconds, Replace api src and server url with
yours. The interval below is for 3 seconds.
Tableau Desktop is based on breakthrough technology from Stanford University that lets you drag &
drop to analyze data. It is great data visualization tool, you can connect to data in a few clicks, then
visualize and crate interactive dashboards with a few more.
Q. What are the differences between Tableau Software, GoodData and Traditional BI (Business
Objects, etc.)?
You could talk feature – functionality for days, but at a high level there are four major differences.
1. Speed: How fast can you get up and running with the system, answer questions, design and share
dashboards and then change them? This is Where systems like Tableau and GoodData are far better
than old – school business intelligence like Business Objects or Cognos. Traditional systems took months
or years to intelligence like Business Objects or Cognos. Traditional systems took months or years to
implement, with costs running to millions. Tableau has a free trail that installs in minutes and GoodData
is cloud – based, so they are faster to implement by orders of magnitude. They are also faster to results:
traditional BI requires IT and developers to make any changes to reports, so business users are struck in
a queue waiting to get anything done. Tableau and GoodData provide more of a self – service
experience.
2. Analysis layer: This is where Tableau excels. It has a powerful and flexible drag & drop visualization
engine based on some technology from Stanford. GoodData and traditional BI typically provide some
canned reports but changing them requires significant time and money.
3. Data layer: This is where the three options are most different:
GoodData requires you to move your data to its cloud. Traditional BI typically requires you to move your
data to its data warehouse system. Tableau connects to a variety of existing data source and also
provides a fast in – memory data engine, essentially a local database. Since most enterprises have their
data stored all over the place, this provides the most choice and lets companies use the investment
they’ve already made.
4. Enterprise readiness: Traditional BI and Tableau do well here, with enterprise – level security and
high scalability.
Q. What is the Difference between quick filter and Normal filter in tableau?
Quick filter is used to view the filtering options and can be used to select the option. Normal filer is
something you can limit the options from the list or use some conditions to limit the data by field or
value.
Note that Tableau does not import the data. Instead it queries to the database directly.
– Analyze: Analyzing data means viewing it, filtering it, sorting it, performing calculations on it,
reorganizing it, summarizing it, and so on.Using Tableau you can do all of these things by simply
arranging fields of your data source on a Tableau worksheet. When you drop a field on a worksheet,
Tableau queries the data using standard drivers and query languages (like SQL and MDX) and presents a
visual analysis of the data.
– Share: You can share results with others either by sharing workbooks with other Tableau users,
by pasting results into applications such as Microsoft Office, printing to PDF or by using Tableau Server
to publish or embed your views across your organization.
The company was founded in Mountain View, California in January, 2003 by Chris Stolte, Christian
Chabot and Pat Hanrahan.
Tableau is business intelligence software that allows anyone to easily connect to data, then visualize and
create interactive, shareable dashboards. It’s easy enough that any Excel user can learn it, but powerful
enough to satisfy even the most complex analytical problems. Securely sharing your findings with others
only takes seconds.
Tableau offers five main products: Tableau Desktop, Tableau Server, Tableau Online, Tableau reader and
Tableau Public.
Data visualization refers to the techniques used to communicate data or information by encoding it as
visual objects (e.g. points, lines or bars) contained in graphics.
Tableau Desktop is based on breakthrough technology from Stanford University that lets you drag &
drop to analyze data. It is great data visualization tool, you can connect to data in a few clicks, then
visualize and crate interactive dashboards with a few more.
Tableau Server is browser- and mobile-based insight anyone can use. Publish dashboards with Tableau
Desktop and share them throughout your organization. It’s easy to set up and even easier to run.
Tableau Public is a free service that lets anyone publish interactive data to the web. Once on the web,
anyone can interact with the data, download it, or create their own visualizations of it. No programming
skills are required. Be sure to look at the gallery to see some of the things people have been doing with
it.
Our course design of tutorials is practical and informative. At TekSlate, we offer resources to help you
learn various IT courses. We avail both written material and demo video tutorials. For in-depth
knowledge and practical experience explore Tableau Desktop.
Why Tableau?
Whether your data is in an on-premise database, a database, a data warehouse, a cloud application or
an Excel file, you can analyze it with Tableau. You can create views of your data and share it with
colleagues, customers, and partners. You can use Tableau to blend it with other data. And you can keep
your data up to date automatically.
Tableau Performance is based on Data source performance. If data source takes more time to execute a
query then Tableau must wait up to that time.
What are the differences between Tableau Software Good-data and Traditional BI (Business Objects,
etc.)?
Dimensions is nothing but the descriptive text columns and facts are nothing but measures (numerical
values) dimension ex: Product Name, City. Facts:Sales, profit
By adding the same calculation to ‘Group By’ clause in SQL query or creating a Calculated Field in the
Data Window and using that field whenever you want to group the fields.
Blend data using groups created in the secondary data source: Only calculated groups can be used in
data blending if the group was created in the secondary data source.
Use a group in another workbook. You can easily replicate a group in another workbook by copy and
pasting a calculation.
A heat map is a great way to compare categories using color and size. In this, you can compare two
different measures. Tree map is a very powerful visualization, particularly for illustrating hierarchical
(tree – structured) data and part – to – whole relationships.
The Tableau Desktop Log files are located in C:\Users\\My Documents\My Tableau Repository. If you
have a live connection to the data source, check the log.txt and tabprotosrv.txt files. If you are using an
extract, check the tdeserver.txt file. The tabprotosrv.txt file often shows detailed information about
queries.
First create a schedule for particular time and then create extract for the data source and publish the
workbook for the server. Before you publish, there is a option called Scheduling and Authentication,
click on that and select the schedule from the drop down which is created and publish. Also publish data
source and assign the schedule. This schedule will automatically run for the assigned time and the
workbook is refreshed.
While Tableau lets you analyze databases and spreadsheets like never before, you don’t need to know
anything about databases to use Tableau. In fact, Tableau is designed to allow business people with no
technical training to analyze their data efficiently.Tableau is based on three simple concepts:
Note that Tableau does not import the data. Instead it queries to the database directly.
Analyze: Analyzing data means viewing it, filtering it, sorting it, performing calculations on it,
reorganizing it, summarizing it, and so on.Using Tableau you can do all of these things by simply
arranging fields of your data source on a Tableau worksheet. When you drop a field on a worksheet,
Tableau queries the data using standard drivers and query languages (like SQL and MDX) and presents a
visual analysis of the data.
Share: You can share results with others either by sharing workbooks with other Tableau users, by
pasting results into applications such as Microsoft Office, printing to PDF or by using Tableau Server to
publish or embed your views across your organization.
What are the difference between tableau 7.0 and 8.0 versions?
New visualizations are introduced like tree map bubble chart and box and whisker plot
Introduced R script
With Kerboros support, Tableau 8.3 advances enterprise-grade data analysis with these enhancements:
Provides seamless, single sign-on experience from Tableau client to back-end data sources
Protects sensitive data with delegated access and viewer credential management
Explain the relationship difference between Tableau Workbook, Story, Dashboard, Worksheets.?
Tableau uses a WORKBOOK and SHEET file structure, much like Microsoft Excel. A WORKBOOK contains
SHEETS, which can be a WORKSHEET , a DASHBOARD , or a STORY .
A WORKSHEET contains a single view along with shelves, legends, and the Data pane.
A STORY contains a sequence of worksheets or dashboards that work together to convey information.
You need to publish report to tableau server, while publishing you will find one option to schedule
reports.You just need to select the time when you want to refresh data.
Speed
How fast can you get up and running with the system, answer questions, design and share dashboards
and then change them? This is Where systems like Tableau and GoodData are far better than old –
school business intelligence like Business Objects or Cognos. Traditional systems took months or years to
intelligence like Business Objects or Cognos. Traditional systems took months or years to implement,
with costs running to millions. Tableau has a free trail that installs in minutes and GoodData is cloud –
based, so they are faster to implement by orders of magnitude. They are also faster to results:
traditional BI requires IT and developers to make any changes to reports, so business users are struck in
a queue waiting to get anything done. Tableau and GoodData provide more of a self – service
experience.
Analysis layer
This is where Tableau excels. It has a powerful and flexible drag & drop visualization engine based on
some technology from Stanford. Traditional BI typically provide some canned reports but changing them
requires significant time and money.
Data layer
GoodData requires you to move your data to its cloud. Traditional BI typically requires you to move your
data to its data warehouse system. Tableau connects to a variety of existing data source and also
provides a fast in – memory data engine, essentially a local database. Since most enterprises have their
data stored all over the place, this provides the most choice and lets companies use the investment
they’ve already made.
Enterprise readiness.
Traditional BI and Tableau do well here, with enterprise – level security and high scalability.
Parameters are dynamic values that can replace constant values in calculations and can serve as filters
What are Filters? How many types of filters are there in Tableau?
Filter is nothing but it is restricted to unnecessary, it is showing exact data. Basically filters are 3 types.
Quick filter
Context filt
Whenever we create context filter >> Tableau will create a temporary table for this particular filter set
and other filters will be apply on context filter data like cascade parameters… suppose we have crated
context filter on countries >> we have chosen country as USA and India >> Tableau will create a
temporary table for this two countries data and if you have any other filers >>other will be apply on this
two countries data if we don’t have any context filter >> each and individual record will check for all
filters
The context filter is not frequently changed by the user – if the filter is changed the database must
recomputed and rewrite the temporary table, slowing performance.
When you set a dimension to context, Tableau crates a temporary table that will require a reload each
time the view is initiated. For Excel, Access and text data sources, the temporary table created is in an
Access table format. For SQL Server, My SQL and Oracle data sources, you must have permission to
create a temporary table on your server. For multidimensional data source, or cubes, temporary tables
are not crated, and context filters only defined which filters are independent and dependent.
What is the Difference between quick filter and Normal filter in tableau?
Quick filter is used to view the filtering options and can be used to select the option. Normal filer is
something you can limit the options from the list or use some conditions to limit the data by field or
value.
Extract can be used anywhere without any connection and you can build your own visualizations
without connecting to Database.
How to combine two excel files with same fields but different data (different years)?
I have 5 different excel files (2007.xls, 2008.xls..2011.xls) with same fields (film name, genre, budge,
rating, profitability) but with data from different year (2007 to 2011). Can someone tell me how can I
combine the film name, genre and profitability so that I can see the visualization of 2007 to 2011 in a
single chart?
We can join max 32 table, it’s not possible to combine more than 32 tables.
R is a popular open-source environment for statistical analysis. Tableau Desktop can now connect to R
through calculated fields and take advantage of R functions, libraries, and packages and even saved
models. These calculations dynamically invoke the R engine and pass values to R via the Rserve package,
and are returned back to Tableau.
Tableau Server can also be configured to connect to an instance of Rserve through the tabadmin utility,
allowing anyone to view a dashboard containing R functionality.
Combining R with Tableau gives you the ability to bring deep statistical analysis into a drag-and-drop
visual analytics environment.
Page shelf is power full part of tableau That you can use to control the display of output as well as
printed results of output.
The difference lies in the application. Parameters allow users to insert their values, which can be
integers, float, date, string that can be used in calculations. However, filters receive only values users
choose to ‘filter by’ the list, which cannot be used to perform calculations.Users can dynamically change
measures and dimensions in parameter but filters do not approve of this feature. Most in-depth,
industry-led curriculum in Tableau.
How can we combine database and flat file data in tableau desktop?
Connect data two times, one for database tables and one for flat file. The Data->Edit Relationships
The concept of context filter in Tableau makes the process of filtering smooth and straightforward. It
establishes a filtering hierarchy where all other filters present refer to the context filter for their
subsequent operations. The other filters now process data that has been passed through the context
filter.
Creating one or more context filters improves performance as users do not have to create extra filters
on large data source, reducing the query-execution time.
You can create by dragging a filed into ‘Filters’ tab and then, Right-Click that field and select ‘’Add to
Context”
Add custom color code Note: In tableau 9.0 version we have color picker option.
TDE is a Tableau desktop file that contains a .tde extension. It refers to the file that contains data
extracted from external sources like MS Excel, MS Access or CSV file.
There are two aspects of TDE design that make them ideal for supporting analytics and data discovery.
The second is how they are structured which impacts how they are loaded into memory and used by
Tableau. This is an important aspect of how TDEs are “architecture aware”. Architecture-awareness
means that TDEs use all parts of your computer memory, from RAM to hard disk, and put each part to
work what best fits its characteristics.
How to design a view to show region wise profit and sales.I did not want line and bar chat should be
used for profit and sales?
Generate the Map using cities –>then Drag the Profit and sales to the Details–>Add the state as Quick
filter
I have chosen country as USA and filter2 should display only USA states
Multiple Measures are shown in single axis and also all the marks shown in single pane
Http://onlinehelp.tableau.com/current/pro/online/mac/en-Us/multiplemeasures_blendedaxes.html
Unlike Data Joining, Data Blending in tableau allows combining of data from different sources and
platforms. For instance, you can blend data present in an Excel file with that of an Oracle DB to create a
new dataset
6 ASHLEY 25000 HR
Drag ename on column and salary on rows we will get sum (salary) of each and individual employee
When you look at the aggregated data in the views above, each bar represents all transactions for a
specific employee summed up or averaged into a single value. Now say that you want to see the
individual salary transactions for each employee. You can create a view like that by selecting
Analysis>Aggregate Measures.
Tableau desktop: desktop environment to create and publish standard and packaged workbooks.
Tableau Public: workbooks available publicly online for users to download and access the included data.
Tableau will create a temporary table for this particular filter set and other filters will be apply on
context filter data like cascade parameters… suppose we have crated context filter on countries >> we
have chosen country as USA and India
Tableau will create a temporary table for this two countries data and if you have any other filers >>other
will be apply on this two countries data if we don’t have any context filter
The context filter is not frequently changed by the user – if the filter is changed the database must
recomputed and rewrite the temporary table, slowing performance.
When you set a dimension to context, Tableau crates a temporary table that will require a reload each
time the view is initiated. For Excel, Access and text data sources, the temporary table created is in an
Access table format. For SQL Server, My SQL and Oracle data sources, you must have permission to
create a temporary table on your server. For multidimensional data source, or cubes, temporary tables
are not crated, and context filters only defined which filters are independent and dependent.
Tableau offers five main products: Tableau Desktop, Tableau Server, Tableau Online, Tableau reader and
Tableau Public.
Q. What is the current latest version of Tableau Desktop(as of Sep, 25th 2017)?
Data visualization refers to the techniques used to communicate data or information by encoding it as
visual objects (e.g. points, lines or bars) contained in graphics.
Interested in mastering Tableau Course? Enroll now for FREE demo on Tableau Training.
Q. Why tableau?
Whether your data is in an on-premise database, a database, a data warehouse, a cloud application or
an Excel file, you can analyze it with Tableau. You can create views of your data and share it with
colleagues, customers, and partners. You can use Tableau to blend it with other data. And you can keep
your data up to date automatically.
Q. What are Filters? How many types of filters are there in Tableau?
Filter is nothing but it is restricted to unnecessary, it is showing exact data. Basically filters are 3 types.
1. Quick filter
2. Context filter
3. Datasource filter
Madhu 300
3.krishna 5000.net
2 bbc 13000testing
5 vamshi 19000.net
drag ename on columna and salary on rows we will get sum (salary) of each and individual employee
When you look at the aggregated data in the views above, each bar represents all transactions for a
specific employee summed up or averaged into a single value. Now say that you want to see the
individual salary transactions for each employee. You can create a view like that by selecting
Analysis>Aggregate Measures.
Q. Can we use non – used columns (Columns which are not used in reports but data source has columns)
in Tableau Filters?
Yes!
Extract can be used anywhere without any connection and you can build your own visualizations
without connecting to Database.
Q. How to combine two excel files with same fields but different data (different years)?
I have 5 different excel files (2007.xls, 2008.xls..2011.xls) with same fields (film name, genre, budge,
rating, profitability) but with data from different year (2007 to 2011). Can someone tell me how can I
combine the film name, genre and profitability so that I can see the visualization of 2007 to 2011 in a
single chart?
We can join max 32 table, it’s not possible to combine more than 32 tables.
Joins in Tableau:
For Eg: your client is in Healthcare domain and using SQL Server as their database. In SQL server there
may be many Tableau like Claims Tables, Rejected Claims Table, Customer Table. Now, client wants to
know customer wise claims and customer wise rejected claims table using the joins. Join is a query that
combines the data form 2 or more tables by making use of Join condition.
We can join max 32 table, it’s not possible to combine more then 32 tables.
If we create joins on the fields in Tableau all the table names are suffixing with $. While performing the
joins on multiple tables, always go with the les amount of data tables, so that we can improve the
performance.
1.Equi Join,
1. Equi Join: in the join condition if we are using Equality”=”operator then such a kind of join called as
Equi join.
2. Non Equi Join: in the join condition apart from the Equality”=”if we use any other operator like
<,>,<=,>= and=! Then such a kind of joins are called as Non Equi Join
1. Inner Join,
2. Outer Join,
3. Self – Join.
1.Inner Join: Inner join will loads the only matching records from the both tables. Inner join condition:
Tableaa.id = Tableb.id
2.Outer Join:
Left outer join: displays the complete data from the left + matching records from the left.
Condition: tablea.id(+).
Right Outer Join: displays the complete data from the right + matching records from the left.
Condition: tablea.id(+)=tableb.id
Full outer join: full outer join load the complete data from the left table and right table. Condition: Table
A full outer join Table B ON tablea.id= tableb.id
3.Self-Join: if we are performing join to the same table itself such a kind of join called as self-join
In the join condition if we are using the operators apart from the equality “=” then such a kind of joins
are called as Non Equi join.
For ex: your client is same Healthcare Client. They are operating their services in Asia, Europe, NA and so
on & the are maintaining Asia data in SQL, Europe Data in SQL Server and NA data in MY SQL.
Now, your client wants to analyze their business across the world in a single worksheet. So you can’t
perform join here.
Normally in the Tableau we can perform the analysis on the single data server. If we want to perform
the analysis from the multiple data sources in a single sheet then we have to make use of a new concept
called as data blending.
Data blending mix the data from the different data sources and allow the users to perform th analysis in
a single sheet. Blending means mixing. If we are mixing the data sources then it is called as data
blending.
1. If we are performing the data blending on 2 data source these 2 data sources should have at least 1
common dimension.
1. Automatic way
2. Custom way
1. Automatic way: In the automatic way Tableau automatically defines the relationship between the 2
data sources based on the common dimensions and based on the matching values and the relationship
is indicated with Orange color.
2. Custom or Manual way: In the manual or custom way the user need to define the relationship
manually.
1. All the primary data sources and the secondary data sources are linked by specific relationship
2. while performing the data blending each work sheet has a primary connection and optionally it might
contains several secondary connections.
3. All the primary connections are indicated in the Blue in the work sheet and all the secondary data
sources indicated with the Orange color tick mark.
4. In the data blending 1 sheet contains 1 primary data source and 1 sheet can contain end number of
secondary data sources.
Dimensions is nothing but the descriptive text columns and facts are nothing but measures (numerical
values) dimention ex:productname city..facts:sales, profit
Q. Can we place an excel file in a shared location and use it to develop a report and refresh it in regular
intervals?
Yes you can do it… but for the better performance use extract
A heat map is a great way to compare categories using color and size. In this, you can compare two
different measures. Tree map is a very powerful visualization, particularly for illustrating hierarchical
(tree – structured) data and part – to – whole relationships.
Q. What is the different between twb and twbx file extensions. Please explain.
Twb is a live connection, it points to the data source; the user receiving twb needs permission to said
data source and no data is included. .twbx takes data offline, stroes the data as a package or zip like file,
thereby eradicating the need for permissions from end user, it’s now a snapshot in time of the data as of
the time it was Saved as . twbx
Related Article: How To Use Tabadmin For Administrative Task Automation In Tableau?
Multiple Measures are shown in single axis and also all the marks shown in single pane
Us/multiplemeasures_blendedaxes.html
First, most of the BI tools out there are pricey. However, Tableau has a free offering (Tableau Public) as
well as a very popular (also free) academic distribution. Tableau is well recognized by firms like Forrester
research to be one of the most easy to use, and agile products currently available. see here: Tableau
Ranks #1 in The Forrester Wave: Advanced Data Visualization (ADV) Platforms That makes it easy to pick
up and try new things with, which data visualization people love about it.
On the other hand, unlike some of the other BI tools, Tableau is not a complete technology stack, it is
most useful for visualization and analytics. – you will need other products in addition to tableau for
heavier enterprise data ETL, maintenance, and storage, etc.
https://www.tableau.com/about/blog/2012/7/tableau-ranks-1-forrester-wave-advanced-data-
visualization-adv-platforms-1852
Q. Can you get values from two different sources as a single input into parameter?
No you cannot. Each data source corresponds to a Tableau workbook. If you include both data variables
in the same data source you can input them in the same workbook.
We can use parameters with filters, calculated fields ,actions, measure-swap, changing views and auto
updates
Custom SQL Query written after connecting to data for pulling the data in a structured view, One simple
example is you have 50 columns in a table, but we need just 10 columns only. So instead of taking 50
columns you can write a sql query. Performance will increase.
Related Article: What Are The Common Use Cases For Tabcmd In Tableau?
Q. What are the differences between Tableau Software and Traditional BI tools?
Tableau provides easy to use, best in class, Visual Analytic capabilities, but it does not help with the
plumbing (data foundation). You could, for example, marry SQL Server with Tableau to get the complete
package. Tableau licenses are relatively expensive if you are looking to scale.
Traditional BI can handle it all but with significant upfront costs. Higher consulting, hardware and
software costs. Among the mega-vendors, only Microsoft can provide a reasonable value proposition.
Open source vendors like Pentaho and JasperSoft do not have an abundant enough talent pool, yet.
Q. What are the similarities and differences between Tableau software and Palantir?
Palantir and Tableau are very different. Palantir has its roots in large data computer science problems
involving security, payments, fraud detection and the likes. Customers/Investors include Paypal, CIA and
others.
Tableau is a visualization player – with roots in Stanford U research. It’s Visual Query Language (VizQL)
allows users to build visualizations on top of standard data warehouses or spreadsheets.
I have chosen country as INDIA and filter2 should display only INDIA states
Yes for sure! It gives you data insight to the extend that others don’t.
Helps u plan and point the anomalies and improvise your process for betterment.
Using filters or calculated fields we can able to display the top 5 and last 5 sales in same view?
Q. Design a view to show region wise profit and sales.I did not want line and bar chat should be used for
profit and sales. How you will design and please explain?
Generate the Map using cities –>then Drag the Profit and sales to the Details–>Add the state as Quick
filter
Q. Design a view in a map such that if user selects any state the cities under that state has to show profit
and sales.
If you want to show the Sales and profit in each and every city under the states in the same work sheet.
According to your question you should have State, City, Sales and Profit filed in your dataset.
2. Drag the City and drop into Marks card (under the State fied)
6. Right click on the State field and select show quick filter.
7. Select any state and check whether you got the required view or not. In this view size indicates the
amount of sales and color indicates the Profit values.
Q. How can we combine database and flat file data in tableau desktop?
Connect data two times, one for database tables and one for flat file. The Data->Edit Relationships
Madhu 300
3.krishna 5000.net
2 bbc 13000testing
5 vamshi 19000.net
drag ename on columna and salary on rows we will get sum (salary) of each and individual employee
When you look at the aggregated data in the views above, each bar represents all transactions for a
specific employee summed up or averaged into a single value. Now say that you want to see the
individual salary transactions for each employee. You can create a view like that by selecting
Analysis>Aggregate Measures.
Our goal is to help people see and understand data. Our software products put the power of data into
the hands of everyday people, allowing a broad population of business users to engage with their data,
ask questions, solve problems and create value.
Tableau Public is a free service that lets anyone publish interactive data to the web. Once on the web,
anyone can interact with the data, download it, or create their own visualizations of it. No programming
skills are required. Be sure to look at the gallery to see some of the things people have been doing with
it.
Data modelling is the analysis of data objects that are used in a business or other context and the
identification of the relationships among these data objects. Data modelling is a first step in doing
object-oriented programming
I think we all work on different projects using Tableau, so the work begins from understanding the
requirement getting the required data, story boarding then creating visualizations in tableau and then
presenting it to the client for review.
Parameters are dynamic values that can replace constant values in calculations and can serve as filters
Tableau Performance is based on Data source performance. If data source takes more time to execute a
query then Tableau must wait up to that time.
First create a schedule for particular time and then create extract for the data source and publish the
workbook for the server. Before you publish, there is a option called Scheduling and Authentication,
click on that and select the schedule from the drop down which is created and publish. Also publish data
source and assign the schedule. This schedule will automatically run for the assigned time and the
workbook is refreshed.
Parameters are dynamic values that can replace constant values in calculations and can serve as
filters.Filters are used to restrict the data based on the condition u have mentioned in the filters shelf.
The Tableau Desktop Log files are located in C:UsersMy DocumentsMy Tableau Repository. If you have a
live connection to the data source, check the log.txt and tabprotosrv.txt files. If you are using an extract,
check the tdeserver.txt file. The tabprotosrv.txt file often shows detailed information about queries.
Related Article: What Kinds Of Tasks Can Be Done With Tabcmd In Tableau?
page shelf is power full part of tableau That you can use to controle the display of output as well as
printed results of output.
Q. What are the major differences between tableau version 7.0 and tableau version 8.0?
1. New visualizations are introduced like treemap, bubble chart and box and whisker plot
3. Introduced R script
Step 1: Build a Map View Double-click a geographic fields such as State, Area Code, Zip Code, etc.
Step 2: Select the Fille Map Mark Type The Automatic mark type will show this type of view as circles
over a map. On the Marks card, select Filled Map to color the geographic areas.
Step 3: Drag a Field to the Color Shelf Define how the locations are colored by dragging another field to
the Color shelf.
Yes it may have its own drop down list, the entries which you make in the Parameter while creating it
can be viewed as Dropdown list.
After creation of Dashboards if we get problem from sql side that means Custom Sql ….How to Rectify
the sql performance from custom sql.
Q. Suppose my license expires today, can users able to view the dashboards or workbook which i
published in server earlier.
If your server license expires today, your user name on the server will have the role ‘unlicensed’ which
means you cannot access, but others can. The Site Admin can ‘Change Ownership’ to another person, so
extracts if enabled do not fail.
Q. Think that I am using Tableau desktop and have a live connection to Cloud era hadoop data. I need to
press F5 to refresh the visualization. Is there anyway to automatically refresh the visualization every x
minutes instead of pressing F5 every-time?
Here is the example of refreshing dashboard in every 3 seconds, Replace api src and server url with
yours. The interval below is for 3 seconds.
Related Article: What Are The Rapid-fire Analysis At A Public Utility In Tableau?
Tableau Desktop is based on breakthrough technology from Stanford University that lets you drag &
drop to analyze data. It is great data visualization tool, you can connect to data in a few clicks, then
visualize and crate interactive dashboards with a few more.
Q. What are the differences between Tableau Software, GoodData and Traditional BI (Business Objects,
etc.)?
You could talk feature – functionality for days, but at a high level there are four major differences.
1. Speed: How fast can you get up and running with the system, answer questions, design and share
dashboards and then change them? This is Where systems like Tableau and GoodData are far better
than old – school business intelligence like Business Objects or Cognos. Traditional systems took months
or years to intelligence like Business Objects or Cognos. Traditional systems took months or years to
implement, with costs running to millions. Tableau has a free trail that installs in minutes and GoodData
is cloud – based, so they are faster to implement by orders of magnitude. They are also faster to results:
traditional BI requires IT and developers to make any changes to reports, so business users are struck in
a queue waiting to get anything done. Tableau and GoodData provide more of a self – service
experience.
2. Analysis layer: This is where Tableau excels. It has a powerful and flexible drag & drop visualization
engine based on some technology from Stanford. GoodData and traditional BI typically provide some
canned reports but changing them requires significant time and money.
3. Data layer: This is where the three options are most different:
GoodData requires you to move your data to its cloud. Traditional BI typically requires you to move your
data to its data warehouse system. Tableau connects to a variety of existing data source and also
provides a fast in – memory data engine, essentially a local database. Since most enterprises have their
data stored all over the place, this provides the most choice and lets companies use the investment
they’ve already made.
4. Enterprise readiness: Traditional BI and Tableau do well here, with enterprise – level security and high
scalability.
Tableau is business intelligence software that allows anyone to easily connect to data, then visualize and
create interactive, sharable dashboards. It’s easy enough that any Excel user can learn it, but powerful
enough to satisfy even the most complex analytical problems. Securely sharing your findings with others
only takes seconds.
Tableau Server is browser- and mobile-based insight anyone can use. Publish dashboards with Tableau
Desktop and share them throughout your organization. It’s easy to set up and even easier to run.
R is a popular open-source environment for statistical analysis. Tableau Desktop can now connect to R
through calculated fields and take advantage of R functions, libraries, and packages and even saved
models. These calculations dynamically invoke the R engine and pass values to R via the Rserve package,
and are returned back to Tableau.
1. Tableau Server can also be configured to connect to an instance of Rserve through the tabadmin
utility, allowing anyone to view a dashboard containing R functionality.
2. Combining R with Tableau gives you the ability to bring deep statistical analysis into a drag-and-drop
visual analytics environment.
Q. What is the Difference between quick filter and Normal filter in tableau?
Quick filter is used to view the filtering options and can be used to select the option. Normal filer is
something you can limit the options from the list or use some conditions to limit the data by field or
value.
You need to publish report to tableau server, while publishing you will find one option to schedule
reports.You just need to select the time when you want to refresh data.
Tableau compiles the elements of your visual canvas into a SQL or MDX query for the remote database
to process. Since a database typically runs on more powerful hardware than the laptops / workstations
used by analysts, you should generally expect the database to handle queries much faster than most in
memory BI applications limited by enduser hardware. Tableau’s ability to push computation (queries)
close to the data is increasingly important for large data sets, which may reside on a fast cluster and may
be too large to bring in memory.Another factor in performance relates to data transfer, or in Tableau’s
case resultset transfer. Since Tableau visualizations are designed for human consumption, they are
tailored to the capabilities and limits of the human perception system. This generally means that the
amount of data in a query result set is small relative to the size of the underlying data, and visualizations
focus on aggregation and filtering to identify trends and outliers. The small result sets require little
network bandwidth, so Tableau is able to fetch and render the result set very quickly. And, as Ross
mentioned, Tableau will cache query results for fast reuse.The last factor involves Tableau’s ability to
use in memory acceleration as needed (for example, when working with very slow databases, text files,
etc.). Tableau’s Data Engine uses memory mapped I/O, so while it takes advantage of in memory
acceleration it can easily work with large data sets which cannot fit in memory. The Data Engine will
work only with the subsets of data on disk which are needed for a given query, and the data subsets are
mapped into memory as needed.
Tableau Desktop is a data visualization application that lets you analyze virtually any type of structured
data and produce highly interactive, beautiful graphs, dashboards, and reports in just minutes. After a
quick installation, you can connect to virtually any data source from spreadsheets to data warehouses
and display information in multiple graphic perspectives. Designed to be easy to use, you’ll be working
faster than ever before.
While Tableau lets you analyze databases and spreadsheets like never before, you don’t need to know
anything about databases to use Tableau. In fact, Tableau is designed to allow business people with no
technical training to analyze their data efficiently.Tableau is based on three simple concepts:
Note that Tableau does not import the data. Instead it queries to the database directly.
– Analyze: Analyzing data means viewing it, filtering it, sorting it, performing calculations on it,
reorganizing it, summarizing it, and so on.Using Tableau you can do all of these things by simply
arranging fields of your data source on a Tableau worksheet. When you drop a field on a worksheet,
Tableau queries the data using standard drivers and query languages (like SQL and MDX) and presents a
visual analysis of the data.
– Share: You can share results with others either by sharing workbooks with other Tableau users,
by pasting results into applications such as Microsoft Office, printing to PDF or by using Tableau Server
to publish or embed your views across your organization.
1. New visualizations are introduced like tree map bubble chart and box and whisker plot
3. Introduced R script
– With Kerboros support, Tableau 8.3 advances enterprise-grade data analysis with these
enhancements:
1. Provides seamless, single sign-on experience from Tableau client to back-end data sources
2. Protects sensitive data with delegated access and viewer credential management
The company was founded in Mountain View, California in January, 2003 by Chris Stolte, Christian
Chabot and Pat Hanrahan.
Tableau is business intelligence software that allows anyone to easily connect to data, then visualize and
create interactive, sharable dashboards. It’s easy enough that any Excel user can learn it, but powerful
enough to satisfy even the most complex analytical problems. Securely sharing your findings with others
only takes seconds.
Tableau offers five main products: Tableau Desktop, Tableau Server, Tableau Online, Tableau reader and
Tableau Public.
Data visualization refers to the techniques used to communicate data or information by encoding it as
visual objects (e.g. points, lines or bars) contained in graphics.
Tableau Desktop is based on breakthrough technology from Stanford University that lets you drag &
drop to analyze data. It is great data visualization tool, you can connect to data in a few clicks, then
visualize and crate interactive dashboards with a few more.
Tableau Server is browser- and mobile-based insight anyone can use. Publish dashboards with Tableau
Desktop and share them throughout your organization. It’s easy to set up and even easier to run.
Tableau Public is a free service that lets anyone publish interactive data to the web. Once on the web,
anyone can interact with the data, download it, or create their own visualizations of it. No programming
skills are required. Be sure to look at the gallery to see some of the things people have been doing with
it.
9) Why Tableau?
Whether your data is in an on-premise database, a database, a data warehouse, a cloud application or
an Excel file, you can analyze it with Tableau. You can create views of your data and share it with
colleagues, customers, and partners. You can use Tableau to blend it with other data. And you can keep
your data up to date automatically.
Tableau Performance is based on Data source performance. If data source takes more time to execute a
query then Tableau must wait up to that time
11) What are the differences between Tableau Software GoodData and Traditional BI (Business Objects,
etc.)?
At high level there are four major differences.How to view sql which is generated by Tab
Dimensions is nothing but the descriptive text columns and facts are nothing but measures (numerical
values) dimension ex: Product Name, City. Facts:Sales, profit
13) What is the difference between heat map and tree map?
A heat map is a great way to compare categories using color and size. In this, you can compare two
different measures. Tree map is a very powerful visualization, particularly for illustrating hierarchical
(tree – structured) data and part – to – whole relationships.
The Tableau Desktop Log files are located in C:\Users\\My Documents\My Tableau Repository. If you
have a live connection to the data source, check the log.txt and tabprotosrv.txt files. If you are using an
extract, check the tdeserver.txt file. The tabprotosrv.txt file often shows detailed information about
queries.
15) How will you publish and schedule workbook in tableau server?
First create a schedule for particular time and then create extract for the data source and publish the
workbook for the server. Before you publish, there is a option called Scheduling and Authentication,
click on that and select the schedule from the drop down which is created and publish. Also publish data
source and assign the schedule. This schedule will automatically run for the assigned time and the
workbook is refreshed.
While Tableau lets you analyze databases and spreadsheets like never before, you don’t need to know
anything about databases to use Tableau. In fact, Tableau is designed to allow business people with no
technical training to analyze their data efficiently.Tableau is based on three simple concepts:
Note that Tableau does not import the data. Instead it queries to the database directly.
Analyze: Analyzing data means viewing it, filtering it, sorting it, performing calculations on it,
reorganizing it, summarizing it, and so on.Using Tableau you can do all of these things by simply
arranging fields of your data source on a Tableau worksheet. When you drop a field on a worksheet,
Tableau queries the data using standard drivers and query languages (like SQL and MDX) and presents a
visual analysis of the data.
Share: You can share results with others either by sharing workbooks with other Tableau users, by
pasting results into applications such as Microsoft Office, printing to PDF or by using Tableau Server to
publish or embed your views across your organization.
What are the difference between tableau 7.0 and 8.0 versions?
New visualizations are introduced like tree map bubble chart and box and whisker plot
Introduced R script
With Kerboros support, Tableau 8.3 advances enterprise-grade data analysis with these enhancements:
Provides seamless, single sign-on experience from Tableau client to back-end data sources
Protects sensitive data with delegated access and viewer credential management
You need to publish report to tableau server, while publishing you will find one option to schedule
reports.You just need to select the time when you want to refresh data.
Speed
How fast can you get up and running with the system, answer questions, design and share dashboards
and then change them? This is Where systems like Tableau and GoodData are far better than old –
school business intelligence like Business Objects or Cognos. Traditional systems took months or years to
intelligence like Business Objects or Cognos. Traditional systems took months or years to implement,
with costs running to millions. Tableau has a free trail that installs in minutes and GoodData is cloud –
based, so they are faster to implement by orders of magnitude. They are also faster to results:
traditional BI requires IT and developers to make any changes to reports, so business users are struck in
a queue waiting to get anything done. Tableau and GoodData provide more of a self – service
experience.
Analysis layer
This is where Tableau excels. It has a powerful and flexible drag & drop visualization engine based on
some technology from Stanford. Traditional BI typically provide some canned reports but changing them
requires significant time and money.
Data layer
GoodData requires you to move your data to its cloud. Traditional BI typically requires you to move your
data to its data warehouse system. Tableau connects to a variety of existing data source and also
provides a fast in – memory data engine, essentially a local database. Since most enterprises have their
data stored all over the place, this provides the most choice and lets companies use the investment
they’ve already made.
Enterprise readiness.
Parameters are dynamic values that can replace constant values in calculations and can serve as filters
What are Filters? How many types of filters are there in Tableau?
Filter is nothing but it is restricted to unnecessary, it is showing exact data. Basically filters are 3 types.
Quick filter
Context filter
Datasource filter
Whenever we crate context filter >> Tableau will create a temporary table for this particular filter set
and other filters will be apply on context filter data like cascade parameters… suppose we have crated
context filter on countries >> we have chosen country as USA and India >> Tableau will create a
temporary table for this two countries data and if you have any other filers >>other will be apply on this
two countries data if we don’t have any context filter >> each and individual record will check for all
filters
The context filter is not frequently changed by the user – if the filter is changed the database must
recomputed and rewrite the temporary table, slowing performance.
When you set a dimension to context, Tableau crates a temporary table that will require a reload each
time the view is initiated. For Excel, Access and text data sources, the temporary table created is in an
Access table format. For SQL Server, My SQL and Oracle data sources, you must have permission to
create a temporary table on your server. For multidimensional data source, or cubes, temporary tables
are not crated, and context filters only defined which filters are independent and dependent.
What is the Difference between quick filter and Normal filter in tableau?
Quick filter is used to view the filtering options and can be used to select the option. Normal filer is
something you can limit the options from the list or use some conditions to limit the data by field or
value.
22) What is benefit of Tableau extract file over the live connection?
Extract can be used anywhere without any connection and you can build your own visualizations
without connecting to Database.
23) How to combine two excel files with same fields but different data (different years)?
I have 5 different excel files (2007.xls, 2008.xls..2011.xls) with same fields (film name, genre, budge,
rating, profitability) but with data from different year (2007 to 2011). Can someone tell me how can I
combine the film name, genre and profitability so that I can see the visualization of 2007 to 2011 in a
single chart.
We can join max 32 table, it’s not possible to combine more than 32 tables.
R is a popular open-source environment for statistical analysis. Tableau Desktop can now connect to R
through calculated fields and take advantage of R functions, libraries, and packages and even saved
models. These calculations dynamically invoke the R engine and pass values to R via the Rserve package,
and are returned back to Tableau.
Tableau Server can also be configured to connect to an instance of Rserve through the tabadmin utility,
allowing anyone to view a dashboard containing R functionality.
Combining R with Tableau gives you the ability to bring deep statistical analysis into a drag-and-drop
visual analytics environment.
Page shelf is power full part of tableau That you can use to control the display of output as well as
printed results of output.
27) How can we combine database and flat file data in tableau desktop?
Connect data two times, one for database tables and one for flat file. The Data->Edit Relationships
Add custom color code Note: In tableau 9.0 version we have color picker option.
29) How to design a view to show region wise profit and sales.I did not want line and bar chat should be
used for profit and sales?
Generate the Map using cities –>then Drag the Profit and sales to the Details–>Add the state as Quick
filter
I have chosen country as USA and filter2 should display only USA states
Multiple Measures are shown in single axis and also all the marks shown in single pane
Http://onlinehelp.tableau.com/current/pro/online/mac/en-
Us/multiplemeasures_blendedaxes.html
A much advanced, direct, precise and ordered way of viewing large volumes of data is called data
visualization. It is the visual representation of data in the form of graphs and charts, especially when you
can’t define it textually. You can show trends, patters and correlations through various data visualization
software and tools; Tableau is one such data visualization software used by businesses and corporates.
34) What are the differences between Tableau desktop and Tableau Server?
While Tableau desktop performs data visualization and workbook creation, Tableau server is used to
distribute these interactive workbooks and/or reports to the right audience. Users can edit and update
the workbooks and dashboards online or Server but cannot create new ones. However, there are limited
editing options when compared to desktop.
Tableau Public is again a free tool consisting of Desktop and Server components accessible to anyone.
Tableau parameters are dynamic variables/values that replace the constant values in data calculations
and filters. For instance, you can create a calculated field value returning true when the score is greater
than 80, and otherwise false. Using parameters, one can replace the constant value of 80 and control it
dynamically in the formula.
The difference actually lies in the application. Parameters allow users to insert their values, which can be
integers, float, date, string that can be used in calculations. However, filters receive only values users
choose to ‘filter by’ the list, which cannot be used to perform calculations.
Users can dynamically change measures and dimensions in parameter but filters do not approve of this
feature.
—>Facts are the numeric metrics or measurable quantities of the data, which can be analyzed by
dimension table. Facts are stores in Fact table that contain foreign keys referring uniquely to the
associated dimension tables. The fact table supports data storage at atomic level and thus, allows more
number of records to be inserted at one time. For instance, a Sales Fact table can have product key,
customer key, promotion key, items sold, referring to a specific event.
—>Dimensions are the descriptive attribute values for multiple dimensions of each attribute, defining
multiple characteristics. A dimension table ,having reference of a product key form the fact table, can
consist of product name, product type, size, color, description, etc.
Global quick filters are a way to filter each worksheet on a dashboard until each of them contains a
dimension. They are very useful for worksheets using the same data source, which sometimes proves to
a disadvantage and generate slow results. Thus, parameters are more useful.
Parameters facilitate only four ways to represent data on a dashboard (which are seven in quick filters).
Further, parameters do not allow multiple selections in a filter.
Aggregation and disaggregation in Tableau are the ways to develop a scatterplot to compare and
measure data values. As the name suggests, aggregation is the calculated form of a set of values that
return a single numeric value. For instance, a measure with values 1,3,5,7 returns 1. You can also set a
default aggregation for any measure, which is not user-defined. Tableau supports various default
aggregations for a measure like Sum, average, Median, Count and others.
Disaggregating data refers to viewing each data source row, while analyzing data both independently
and dependently.
Unlike Data Joining, Data Blending in tableau allows combining of data from different sources and
platforms. For instance, you can blend data present in an Excel file with that of an Oracle DB to create a
new dataset.
The concept of context filter in Tableau makes the process of filtering smooth and straightforward. It
establishes a filtering hierarchy where all other filters present refer to the context filter for their
subsequent operations. The other filters now process data that has been passed through the context
filter.
Creating one or more context filters improves performance as users do not have to create extra filters
on large data source, reducing the query-execution time.
You can create by dragging a filed into ‘Filters’ tab and then, Right-Click that field and select ‘’Add to
Context”.
Tableau takes time to place a filter in context. When a filter is set as context one, the software creates a
temporary table for that particular context filter. This table will reload each time and consists of all
values that are not filtered by either Context or Custom SQL filter.
.twb is the most common file extension used in Tableau, which presents an XML format file and
comprises all the information present in each dashboard and sheet like what fields are used in the
views, styles and formatting applied to a sheet and dashboard.
But this workbook does not contain any data. The Packaged workbook merges the information in a
Tableau workbook with the local data available (which is not on server). .twbx serves as a zip file, which
will include custom images if any. Packaged Workbook allows users to share their workbook information
with other Tableau Desktop users and let them open it in Tableau Reader.
Data extracts are the first copies or subdivisions of the actual data from original data sources. The
workbooks using data extracts instead of those using live DB connections are faster since the extracted
data is imported in Tableau Engine.
After this extraction of data, users can publish the workbook, which also publishes the extracts in
Tableau Server. However, the workbook and extracts won’t refresh unless users apply a scheduled
refresh on the extract. Scheduled Refreshes are the scheduling tasks set for data extract refresh so that
they get refreshed automatically while publishing a workbook with data extract. This also removes the
burden of republishing the workbook every time the concerned data gets updated.
• Horizontal- Horizontal layout containers allow the designer to group worksheets and dashboard
components left to right across your page and edit the height of all elements at once.
• Vertical- Vertical containers allow the user to group worksheets and dashboard components top to
bottom down your page and edit the width of all elements at once.
• Text
• Image Extract: – A Tableau workbook is in XML format. In order to extracts images, Tableau applies
some codes to extract an image which can be stored in XML.
• Web [URL ACTION]:- A URL action is a hyperlink that points to a Web page, file, or other web-based
resource outside of Tableau. You can use URL actions to link to more information about your data that
may be hosted outside of your data source. To make the link relevant to your data, you can substitute
field values of a selection into the URL as parameters.
• Create a Performance Recording to record performance information about the main events you
interact with workbook. Users can view the performance metrics in a workbook created by Tableau.
• Reviewing the Tableau Desktop Logs located at C:\Users\\My Documents\My Tableau Repository. For
live connection to data source, you can check log.txt and tabprotosrv.txt files. For an extract, check
tdeserver.txt file.
Tableau provides a distinct and powerful tool to control the output display known as Page shelf. As the
name suggests, the page shelf fragments the view into a series of pages, presenting a different view on
each page, making it more user-friendly and minimizing scrolling to analyze and view data and
information. You can flip through the pages using the specified controls and compare them at a
common axle.
Performance testing is again an important part of implementing tableau. This can be done by loading
Testing Tableau Server with TabJolt, which is a “Point and Run” load generator created to perform QA.
While TabJolt is not supported by tableau directly, it has to be installed using other open source
products.
Dual Axis is an excellent phenomenon supported by Tableau that helps users view two scales of two
measures in the same graph. Many websites like Indeed.com and other make use of dual axis to show
the comparison between two measures and their growth rate in a septic set of years. Dual axes let you
compare multiple measures at once, having two independent axes layered on top of one another.
The maximum number of 32 tables can be joined in Tableau. A table size must also be limited to 255
columns (fields).
The auto-filter provides a feature of removing ‘All’ options by simply clicking the down arrow in the
auto-filter heading. You can scroll down to ‘Customize’ in the dropdown and then uncheck the ‘Show
“All” Value’ attribute. It can be activated by checking the field again.
• Tableau desktop: desktop environment to create and publish standard and packaged workbooks.
• Tableau Public: workbooks available publicly online for users to download and access the included
data.
55) How can you display top five and last five sales in the same view?
Create two sets, one for top 5 another for bottom 5 and the join these two sets displaying a unique set
of total 10 rows.
TDE is a Tableau desktop file that contains a .tde extension. It refers to the file that contains data
extracted from external sources like MS Excel, MS Access or CSV file.
There are two aspects of TDE design that make them ideal for supporting analytics and data discovery.
• The second is how they are structured which impacts how they are loaded into memory and used by
Tableau. This is an important aspect of how TDEs are “architecture aware”. Architecture-awareness
means that TDEs use all parts of your computer memory, from RAM to hard disk, and put each part to
work what best fits its characteristics.
By adding the same calculation to ‘Group By’ clause in SQL query or creating a Calculated Field in the
Data Window and using that field whenever you want to group the fields.
• Blend data using groups created in the secondary data source: Only calculated groups can be used in
data blending if the group was created in the secondary data source.
• Use a group in another workbook. You can easily replicate a group in another workbook by copy and
pasting a calculation.
Yes, parameters do have their independent dropdown lists enabling users to view the data entries
available in the parameter during its creation.
We will continuously update tableau interview questions and answers in this site with real time
scenarios by tableau experts.You can request for tableau interview questions and answers pdf in the
Contact us form.
Tableau is a business intelligence software that allows anyone to connect to respective data, and then
visualize and create interactive, sharable dashboards.
Data: With new web data connector, it makes data accessible from anywhere
Mobile: The new tableau comes with a high-resolution thumbnails, taking screenshot offline and high-
level security for the data
Visual Analytics: View proximity in the radial selection tool, also provides features like creating filter
formulas and Zoom control on your data
Tableau public is a free service that allow anyone to publish interactive data to the web. Once it is there
on web, anyone can interact with the data, download it or create their own visualization.
4) Mention whether you can create relational joins in Tableau without creating a new table?
Yes, one can create relational joins in tableau without creating a new table.
Bookmarks: It contains a single worksheet and its an easy way to quickly share your work
Packaged Workbooks: It contains a workbook along with any supporting local file data and background
images
Data Extraction Files: Extract files are a local copy of a subset or entire data source
Data Connection Files: It’s a small XML file with various connection information
6) Mention what is the difference between published data sources and embedded data sources in
Tableau?
The difference between published data source and embedded data source is that,
Published data source: It contains connection information that is independent of any workbook and can
be used by multiple workbooks.
Embedded data source: It contains connection information and is associated with a workbook.
Icon/Name
Connection Type
Connects to
If data resides in a single source, it is always desirable to use Joins. When your data is not in one place
blending is the most viable way to create a left join like the connection between your primary and
secondary data sources.
A table data extract is a compressed snapshot of data stored on disk and loaded into memory as
required to render a Tableau. A TDE is a columnar store and reduce the input/output required to access
and aggregate the values.
10) Explain what is the difference between blending and joining in Tableau?
Joining term is used when you are combining data from the same source, for example, worksheet in an
Excel file or tables in Oracle database
While blending requires two completely defined data sources in your report.
An operational database undergoes frequent changes on a daily basis on account of the transactions
that take place. Suppose a business executive wants to analyze previous feedback on any data such as a
product, a supplier, or any consumer data, then the executive will have no data available to analyze
because the previous data has been updated due to transactions.
A data warehouses provides us generalized and consolidated data in multidimensional view. Along with
generalized and consolidated view of data, a data warehouses also provides us Online Analytical
Processing (OLAP) tools. These tools help us in interactive and effective analysis of data in a
multidimensional space. This analysis results in data generalization and data mining.
Data mining functions such as association, clustering, classification, prediction can be integrated with
OLAP operations to enhance the interactive mining of knowledge at multiple level of abstraction. That's
why data warehouse has now become an important platform for data analysis and online analytical
processing.
It possesses consolidated historical data, which helps the organization to analyze its business.
A data warehouse helps executives to organize, understand, and use their data to take strategic
decisions.
An operational database is constructed for well-known tasks and workloads such as searching
particular records, indexing, etc. In contract, data warehouse queries are often complex and
they present a general form of data.
An operational database query allows to read and modify operations, while an OLAP query
needs only read only access of stored data.
An operational database maintains current data. On the other hand, a data warehouse
maintains historical data.
Time Variant − The data collected in a data warehouse is identified with a particular time
period. The data in a data warehouse provides information from the historical point of view.
Non-volatile − Non-volatile means the previous data is not erased when new data is added to it.
A data warehouse is kept separate from the operational database and therefore frequent
changes in operational database is not reflected in the data warehouse.
Note − A data warehouse does not require transaction processing, recovery, and concurrency controls,
because it is physically stored and separate from the operational database.
Financial services
Banking services
Consumer goods
Retail sectors
Controlled manufacturing
Types of Data Warehouse
Information processing, analytical processing, and data mining are the three types of data warehouse
applications that are discussed below −
Information Processing − A data warehouse allows to process the data stored in it. The data
can be processed by means of querying, basic statistical analysis, reporting using crosstabs,
tables, charts, or graphs.
Data Mining − Data mining supports knowledge discovery by finding hidden patterns and
associations, constructing analytical models, performing classification and prediction. These
mining results can be presented using the visualization tools.
2 OLAP systems are used by knowledge OLTP systems are used by clerks, DBAs, or
workers such as executives, managers, database professionals.
and analysts.
12 The database size is from 100GB to 100 The database size is from 100 MB to 100 GB.
TB.
Tuning Production Strategies − The product strategies can be well tuned by repositioning the
products and managing the product portfolios by comparing the sales quarterly or yearly.
Customer Analysis − Customer analysis is done by analyzing the customer's buying preferences,
buying time, budget cycles, etc.
Operations Analysis − Data warehousing also helps in customer relationship management, and
making environmental corrections. The information also allows us to analyze business
operations.
Query-driven Approach
Update-driven Approach
Query-Driven Approach
This is the traditional approach to integrate heterogeneous databases. This approach was used to build
wrappers and integrators on top of multiple heterogeneous databases. These integrators are also
known as mediators.
Now these queries are mapped and sent to the local query processor.
The results from heterogeneous sites are integrated into a global answer set.
Disadvantages
Query-driven approach needs complex integration and filtering processes.
This approach is also very expensive for queries that require aggregations.
Update-Driven Approach
This is an alternative to the traditional approach. Today's data warehouse systems follow update-
driven approach rather than the traditional approach discussed earlier. In update-driven approach, the
information from multiple heterogeneous sources are integrated in advance and are stored in a
warehouse. This information is available for direct querying and analysis.
Advantages
This approach has the following advantages −
The data is copied, processed, integrated, annotated, summarized and restructured in semantic
data store in advance.
Query processing does not require an interface to process data at local sources.
Data Transformation − Involves converting the data from legacy format to warehouse format.
Data Loading − Involves sorting, summarizing, consolidating, checking integrity, and building
indices and partitions.
Note − Data cleaning and data transformation are important steps in improving the quality of data and
data mining results.
Metadata
Metadata is simply defined as data about data. The data that are used to represent other data is known
as metadata. For example, the index of a book serves as a metadata for the contents in the book. In
other words, we can say that metadata is the summarized data that leads us to the detailed data.
Metadata acts as a directory. This directory helps the decision support system to locate the
contents of a data warehouse.
Metadata Repository
Metadata repository is an integral part of a data warehouse system. It contains the following metadata
−
Business metadata − It contains the data ownership information, business definition, and
changing policies.
Operational metadata − It includes currency of data and data lineage. Currency of data refers
to the data being active, archived, or purged. Lineage of data means history of data migrated
and transformation applied on it.
Data for mapping from operational environment to data warehouse − It metadata includes
source databases and their contents, data extraction, data partition, cleaning, transformation
rules, data refresh and purging rules.
Data Cube
A data cube helps us represent data in multiple dimensions. It is defined by dimensions and facts. The
dimensions are the entities with respect to which an enterprise preserves the records.
The following table represents the 2-D view of Sales Data for a company with respect to time, item, and
location dimensions.
But here in this 2-D table, we have records with respect to time and item only. The sales for New Delhi
are shown with respect to time, and item dimensions according to type of items sold. If we want to
view the sales data with one more dimension, say, the location dimension, then the 3-D view would be
useful. The 3-D view of the sales data with respect to time, item, and location is shown in the table
below −
The above 3-D table can be represented as 3-D data cube as shown in the following figure −
Data Mart
Data marts contain a subset of organization-wide data that is valuable to specific groups of people in an
organization. In other words, a data mart contains only those data that is specific to a particular group.
For example, the marketing data mart may contain only data related to items, customers, and sales.
Data marts are confined to subjects.
The implementation cycle of a data mart is measured in short periods of time, i.e., in weeks
rather than months or years.
The life cycle of data marts may be complex in the long run, if their planning and design are not
organization-wide.
Virtual Warehouse
The view over an operational data warehouse is known as virtual warehouse. It is easy to build a virtual
warehouse. Building a virtual warehouse requires excess capacity on operational database servers.
A data warehouse is never static; it evolves as the business expands. As the business evolves, its
requirements keep changing and therefore a data warehouse must be designed to ride with these
changes. Hence a data warehouse system needs to be flexible.
Ideally there should be a delivery process to deliver a data warehouse. However data warehouse
projects normally suffer from various issues that make it difficult to complete tasks and deliverables in
the strict and ordered fashion demanded by the waterfall method. Most of the times, the requirements
are not understood completely. The architectures, designs, and build components can be completed
only after gathering and studying all the requirements.
Delivery Method
The delivery method is a variant of the joint application development approach adopted for the
delivery of a data warehouse. We have staged the data warehouse delivery process to minimize risks.
The approach that we will discuss here does not reduce the overall delivery time-scales but ensures the
business benefits are delivered incrementally through the development process.
Note − The delivery process is broken into phases to reduce the project and delivery risk.
IT Strategy
Data warehouse are strategic investments that require a business process to generate benefits. IT
Strategy is required to procure and retain funding for the project.
Business Case
The objective of business case is to estimate business benefits that should be derived from using a data
warehouse. These benefits may not be quantifiable but the projected benefits need to be clearly
stated. If a data warehouse does not have a clear business case, then the business tends to suffer from
credibility problems at some stage during the delivery process. Therefore in data warehouse projects,
we need to understand the business case for investment.
The prototype can be thrown away after the feasibility concept has been shown.
The activity addresses a small subset of eventual data content of the data warehouse.
The following points are to be kept in mind to produce an early release and deliver business benefits.
Limit the scope of the first build phase to the minimum that delivers business benefits.
Business Requirements
To provide quality deliverables, we should make sure the overall requirements are understood. If we
understand the business requirements for both short-term and medium-term, then we can design a
solution to fulfil short-term requirements. The short-term solution can then be grown to a full solution.
Technical Blueprint
This phase need to deliver an overall architecture satisfying the long term requirements. This phase
also deliver the components that must be implemented in a short term to derive any business benefit.
The blueprint need to identify the followings.
History Load
This is the phase where the remainder of the required history is loaded into the data warehouse. In this
phase, we do not add new entities, but additional physical tables would probably be created to store
increased data volumes.
Let us take an example. Suppose the build version phase has delivered a retail sales analysis data
warehouse with 2 months’ worth of history. This information will allow the user to analyze only the
recent trends and address the short-term issues. The user in this case cannot identify annual and
seasonal trends. To help him do so, last 2 years’ sales history could be loaded from the archive. Now
the 40GB data is extended to 400GB.
Note − The backup and recovery procedures may become complex, therefore it is recommended to
perform this activity within a separate phase.
Ad hoc Query
In this phase, we configure an ad hoc query tool that is used to operate a data warehouse. These tools
can generate the database query.
Note − It is recommended not to use these access tools when the database is being substantially
modified.
Automation
In this phase, operational management processes are fully automated. These would include −
Extending Scope
In this phase, the data warehouse is extended to address a new set of business requirements. The
scope can be extended in two ways −
Note − This phase should be performed separately, since it involves substantial efforts and complexity.
Requirements Evolution
From the perspective of delivery process, the requirements are always changeable. They are not static.
The delivery process must support this and allow these changes to be reflected within the system.
This issue is addressed by designing the data warehouse around the use of data within business
processes, as opposed to the data requirements of existing queries.
The architecture is designed to change and grow to match the business needs, the process operates as
a pseudo-application development process, where the new requirements are continually fed into the
development activities and the partial deliverables are produced. These partial deliverables are fed
back to the users and then reworked ensuring that the overall system is continually updated to meet
the business needs.
In this chapter, we will discuss how to build data warehousing solutions on top open-system
technologies like Unix and relational databases.
Note − Before loading the data into the data warehouse, the information extracted from the external
sources must be reconstructed.
Note − Consistency checks are executed only when all the data sources have been loaded into the
temporary data store.
Cleaning and transforming the loaded data helps speed up the queries. It can be done by making the
data consistent −
within itself.
with other data within the same data source.
with the data in other source systems.
with the existing data present in the warehouse.
Transforming involves converting the source data into a structure. Structuring the data increases the
query performance and decreases the operational cost. The data contained in a data warehouse must
be transformed to support performance requirements and control the ongoing operational costs.
Aggregation
Aggregation is required to speed up common queries. Aggregation relies on the fact that most common
queries will analyze a subset or an aggregation of the detailed data.
For example, in a retail sales analysis data warehouse, it may be required to keep data for 3 years with
the latest 6 months data being kept online. In such as scenario, there is often a requirement to be able
to do month-on-month comparisons for this year and last year. In this case, we require some data to be
restored from the archive.
ensures that all the system sources are used in the most effective way.
The information generated in this process is used by the warehouse management process to determine
which aggregations to generate. This process does not generally operate during the regular load of
information into data warehouse.
Since a data warehouse can gather information quickly and efficiently, it can enhance business
productivity.
A data warehouse provides us a consistent view of customers and items, hence, it helps us
manage customer relationship.
A data warehouse also helps in bringing down the costs by tracking trends, patterns over a long
period in a consistent and reliable manner.
To design an effective and efficient data warehouse, we need to understand and analyze the business
needs and construct a business analysis framework. Each person has different views regarding the
design of a data warehouse. These views are as follows −
The top-down view − This view allows the selection of relevant information needed for a data
warehouse.
The data source view − This view presents the information being captured, stored, and
managed by the operational system.
The data warehouse view − This view includes the fact tables and dimension tables. It
represents the information stored inside the data warehouse.
The business query view − It is the view of the data from the viewpoint of the end-user.
Bottom Tier − The bottom tier of the architecture is the data warehouse database server. It is
the relational database system. We use the back end tools and utilities to feed data into the
bottom tier. These back end tools and utilities perform the Extract, Clean, Load, and refresh
functions.
Middle Tier − In the middle tier, we have the OLAP Server that can be implemented in either of
the following ways.
Top-Tier − This tier is the front-end client layer. This layer holds the query tools and reporting
tools, analysis tools and data mining tools.
Virtual Warehouse
Data mart
Enterprise Warehouse
Virtual Warehouse
The view over an operational data warehouse is known as a virtual warehouse. It is easy to build a
virtual warehouse. Building a virtual warehouse requires excess capacity on operational database
servers.
Data Mart
Data mart contains a subset of organization-wide data. This subset of data is valuable to specific groups
of an organization.
In other words, we can claim that data marts contain data specific to a particular group. For example,
the marketing data mart may contain data related to items, customers, and sales. Data marts are
confined to subjects.
Window-based or Unix/Linux-based servers are used to implement data marts. They are
implemented on low-cost servers.
The implementation data mart cycles is measured in short periods of time, i.e., in weeks rather
than months or years.
The life cycle of a data mart may be complex in long run, if its planning and design are not
organization-wide.
Enterprise Warehouse
An enterprise warehouse collects all the information and the subjects spanning an entire
organization
The data is integrated from operational systems and external information providers.
This information can vary from a few gigabytes to hundreds of gigabytes, terabytes or beyond.
Load Manager
This component performs the operations required to extract and load process.
The size and complexity of the load manager varies between specific solutions from one data
warehouse to other.
Perform simple transformations into structure similar to the one in the data warehouse.
Fast Load
In order to minimize the total load window the data need to be loaded into the warehouse in
the fastest possible time.
It is more effective to load the data into relational database prior to applying transformations
and checks.
Gateway technology proves to be not suitable, since they tend not be performant when large
data volumes are involved.
Simple Transformations
While loading it may be required to perform simple transformations. After this has been completed we
are in position to do the complex checks. Suppose we are loading the EPOS sales transaction we need
to perform the following checks:
Strip out all the columns that are not required within the warehouse.
Convert all the values to required data types.
Warehouse Manager
A warehouse manager is responsible for the warehouse management process. It consists of third-party
system software, C programs, and shell scripts.
The size and complexity of warehouse managers varies between specific solutions.
A warehouse manager analyzes the data to perform consistency and referential integrity
checks.
Creates indexes, business views, partition views against the base data.
Transforms and merges the source data into the published data warehouse.
Archives the data that has reached the end of its captured life.
Note − A warehouse Manager also analyzes query profiles to determine index and aggregations are
appropriate.
Query Manager
Query manager is responsible for directing the queries to the suitable tables.
By directing the queries to appropriate tables, the speed of querying and response generation
can be increased.
Query manager is responsible for scheduling the execution of the queries posed by the user.
Detailed Information
Detailed information is not kept online, rather it is aggregated to the next level of detail and then
archived to tape. The detailed information part of data warehouse keeps the detailed information in
the starflake schema. Detailed information is loaded into the data warehouse to supplement the
aggregated data.
The following diagram shows a pictorial impression of where detailed information is stored and how it
is used.
Note − If detailed information is held offline to minimize disk storage, we should make sure that the
data has been extracted, cleaned up, and transformed into starflake schema before it is archived.
Summary Information
Summary Information is a part of data warehouse that stores predefined aggregations. These
aggregations are generated by the warehouse manager. Summary Information must be treated as
transient. It changes on-the-go in order to respond to the changing query profiles.
It needs to be updated whenever new data is loaded into the data warehouse.
It may not have been backed up, since it can be generated fresh from the detailed information.
access to information. This chapter cover the types of OLAP, operations on OLAP, difference between
OLAP, and statistical databases and OLTP.
Hybrid OLAP
Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of ROLAP and
faster computation of MOLAP. HOLAP servers allows to store the large data volumes of detailed
information. The aggregations are stored separately in MOLAP store.
OLAP Operations
Since OLAP servers are based on multidimensional view of data, we will discuss OLAP operations in
multidimensional data.
Roll-up
Drill-down
Slice and dice
Pivot (rotate)
Roll-up
Roll-up performs aggregation on a data cube in any of the following ways −
Initially the concept hierarchy was "street < city < province < country".
On rolling up, the data is aggregated by ascending the location hierarchy from the level of city
to the level of country.
When roll-up is performed, one or more dimensions from the data cube are removed.
Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following ways −
Drill-down is performed by stepping down a concept hierarchy for the dimension time.
Initially the concept hierarchy was "day < month < quarter < year."
On drilling down, the time dimension is descended from the level of quarter to the level of
month.
When drill-down is performed, one or more dimensions from the data cube are added.
It navigates the data from less detailed data to highly detailed data.
Slice
The slice operation selects one particular dimension from a given cube and provides a new sub-cube.
Consider the following diagram that shows how slice works.
Here Slice is performed for the dimension "time" using the criterion time = "Q1".
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube. Consider the
following diagram that shows the dice operation.
The dice operation on the cube based on the following selection criteria involves three dimensions.
OLAP vs OLTP
2 OLAP systems are used by knowledge OLTP systems are used by clerks, DBAs, or
workers such as executives, managers database professionals.
and analysts.
7 Provides summarized and consolidated Provides primitive and highly detailed data.
data.
Relational OLAP servers are placed between relational back-end server and client front-end tools. To
store and manage the warehouse data, the relational OLAP uses relational or extended-relational
DBMS.
ROLAP tools store and analyze highly volatile and changeable data.
Database server
ROLAP server
Front-end tool.
Advantages
Points to Remember −
MOLAP tools process information with consistent response time regardless of level of
summarizing or calculations selected.
MOLAP tools need to avoid many of the complexities of creating a relational database to store
data for analysis.
MOLAP server adopts two level of storage representation to handle dense and sparse data sets.
MOLAP Architecture
MOLAP includes the following components −
Database server.
MOLAP server.
Front-end tool.
Advantages
3 MOLAP is best suited for inexperienced ROLAP is best suited for experienced users.
users, since it is very easy to use.
4 Maintains a separate database for data It may not require space other than available
Star Schema
Each dimension in a star schema is represented with only one-dimension table.
The following diagram shows the sales data of a company with respect to the four dimensions,
namely time, item, branch, and location.
There is a fact table at the center. It contains the keys to each of four dimensions.
The fact table also contains the attributes, namely dollars sold and units sold.
Note − Each dimension has only one dimension table and each table holds a set of attributes. For
example, the location dimension table contains the attribute set {location_key, street, city,
province_or_state,country}. This constraint may cause data redundancy. For example, "Vancouver" and
"Victoria" both the cities are in the Canadian province of British Columbia. The entries for such cities
may cause data redundancy along the attributes province_or_state and country.
Snowflake Schema
Some dimension tables in the Snowflake schema are normalized.
Unlike Star schema, the dimensions table in a snowflake schema are normalized. For example,
the item dimension table in star schema is normalized and split into two dimension tables,
namely item and supplier table.
Now the item dimension table contains the attributes item_key, item_name, type, brand, and
supplier-key.
The supplier key is linked to the supplier dimension table. The supplier dimension table contains
the attributes supplier_key and supplier_type.
Note − Due to normalization in the Snowflake schema, the redundancy is reduced and therefore, it
becomes easy to maintain and the save storage space.
The following diagram shows two fact tables, namely sales and shipping.
The shipping fact table has the five dimensions, namely item_key, time_key, shipper_key,
from_location, to_location.
The shipping fact table also contains two measures, namely dollars sold and units sold.
It is also possible to share dimension tables between fact tables. For example, time, item, and
location dimension tables are shared between the sales and shipping fact table.
Schema Definition
Multidimensional schema is defined using Data Mining Query Language (DMQL). The two primitives,
cube definition and dimension definition, can be used for defining the data warehouses and data
marts.
define cube < cube_name > [ < dimension-list > }: < measure_list >
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state, country)
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier (supplier key, supplier type))
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city (city key, city, province or state, country))
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state,country)
define cube shipping [time, item, shipper, from location, to location]:
To Assist Backup/Recovery
If we do not partition the fact table, then we have to load the complete fact table with all the data.
Partitioning allows us to load only as much data as is required on a regular basis. It reduces the time to
load and also enhances the performance of the system.
Note − To cut down on the backup size, all partitions other than the current partition can be marked as
read-only. We can then put these partitions into a state where they cannot be modified. Then they can
be backed up. It means only the current partition is to be backed up.
To Enhance Performance
By partitioning the fact table into sets of data, the query procedures can be enhanced. Query
performance is enhanced because now the query scans only those partitions that are relevant. It does
not have to scan the whole data.
Horizontal Partitioning
There are various ways in which a fact table can be partitioned. In horizontal partitioning, we have to
keep in mind the requirements for manageability of the data warehouse.
Points to Note
The detailed information remains available online.
The number of physical tables is kept relatively small, which reduces the operating cost.
This technique is suitable where a mix of data dipping recent history and data mining through
entire history is required.
This technique is not useful where the partitioning profile changes on a regular basis, because
repartitioning will increase the operation cost of data warehouse.
Suppose a market function has been structured into distinct regional departments like on a state by
state basis. If each region wants to query on information captured within its region, it would prove to
be more effective to partition the fact table into regional partitions. This will cause the queries to speed
up because it does not require to scan information that is not relevant.
Points to Note
The query does not have to scan irrelevant data which speeds up the query process.
This technique is not appropriate where the dimensions are unlikely to change in future. So, it is
worth determining that the dimension does not change in future.
If the dimension changes, then the entire fact table would have to be repartitioned.
Note − We recommend to perform the partition only on the basis of time dimension, unless you are
certain that the suggested dimension grouping will not change within the life of the data warehouse.
Points to Note
This partitioning is complex to manage.
Partitioning Dimensions
If a dimension contains large number of entries, then it is required to partition the dimensions. Here
we have to check the size of a dimension.
Consider a large design that changes over time. If we need to store all the variations in order to apply
comparisons, that dimension may be very large. This would definitely affect the response time.
In the round robin technique, when a new partition is needed, the old one is archived. It uses metadata
to allow user access tool to refer to the correct table partition.
This technique makes it easy to automate table management facilities within the data warehouse.
Vertical Partition
Vertical partitioning, splits the data vertically. The following images depicts how vertical partitioning is
done.
Normalization
Row Splitting
Normalization
Normalization is the standard relational method of database organization. In this method, the rows are
collapsed into a single row, hence it reduce space. Take a look at the following tables that show how
normalization is performed.
16 sunny Bangalore W
64 san Mumbai S
30 5 3.67 3-Aug-13 16
35 4 5.33 3-Sep-13 16
40 5 2.50 3-Sep-13 64
45 7 5.66 3-Sep-13 16
Row Splitting
Row splitting tends to leave a one-to-one map between partitions. The motive of row splitting is to
speed up the access to large table by reducing its size.
Note − While using vertical partitioning, make sure that there is no requirement to perform a major
join operation between two partitions.
Account_Txn_Table
transaction_id
account_id
transaction_type
value
transaction_date
region
branch_name
We can choose to partition on any key. The two possible keys could be
region
transaction_date
Suppose the business is organized in 30 geographical regions and each region has different number of
branches. That will give us 30 partitions, which is reasonable. This partitioning is good enough because
our requirements capture has shown that a vast majority of queries are restricted to the user's own
business region.
If we partition by transaction_date instead of region, then the latest transaction from every region will
be in one partition. Now the user who wants to look at data within his own region has to query across
multiple partitions.
other words, we can say that metadata is the summarized data that leads us to detailed data. In terms
of data warehouse, we can define metadata as follows.
Metadata acts as a directory. This directory helps the decision support system to locate the
contents of a data warehouse.
Note − In a data warehouse, we create metadata for the data names and definitions of a given data
warehouse. Along with this metadata, additional metadata is also created for time-stamping any
extracted data, the source of extracted data.
Categories of Metadata
Metadata can be broadly categorized into three categories −
Business Metadata − It has the data ownership information, business definition, and changing
policies.
Technical Metadata − It includes database system names, table and column names and sizes,
data types and allowed values. Technical metadata also includes structural information such as
primary and foreign key attributes and indices.
Operational Metadata − It includes currency of data and data lineage. Currency of data means
whether the data is active, archived, or purged. Lineage of data means the history of data
migrated and transformation applied on it.
Role of Metadata
Metadata has a very important role in a data warehouse. The role of metadata in a warehouse is
different from the warehouse data, yet it plays an important role. The various roles of metadata are
explained below.
This directory helps the decision support system to locate the contents of the data warehouse.
Metadata helps in decision support system for mapping of data when data is transformed from
operational environment to data warehouse environment.
Metadata helps in summarization between current detailed data and highly summarized data.
Metadata also helps in summarization between lightly detailed data and highly summarized
data.
Metadata Repository
Metadata repository is an integral part of a data warehouse system. It has the following metadata −
Definition of data warehouse − It includes the description of structure of data warehouse. The
description is defined by schema, view, hierarchies, derived data definitions, and data mart
locations and contents.
Business metadata − It contains has the data ownership information, business definition, and
changing policies.
Operational Metadata − It includes currency of data and data lineage. Currency of data means
whether the data is active, archived, or purged. Lineage of data means the history of data
migrated and transformation applied on it.
Data for mapping from operational environment to data warehouse − It includes the source
databases and their contents, data extraction, data partition cleaning, transformation rules,
data refresh and purging rules.
The importance of metadata can not be overstated. Metadata helps in driving the accuracy of reports,
validates data transformation, and ensures the accuracy of calculations. Metadata also enforces the
definition of business terms to business end-users. With all these uses of metadata, it also has its
challenges. Some of the challenges are discussed below.
Metadata in a big organization is scattered across the organization. This metadata is spread in
spreadsheets, databases, and applications.
Metadata could be present in text files or multimedia files. To use this data for information
management solutions, it has to be correctly defined.
There are no industry-wide accepted standards. Data management solution vendors have
narrow focus.
Note − Do not data mart for any other reason since the operation cost of data marting could be very
high. Before data marting, make sure that data marting strategy is appropriate for your particular
solution.
Consider a retail organization, where each merchant is accountable for maximizing the sales of a group
of products. For this, the following are the valuable information −
Given below are the issues to be taken into account while determining the functional split −
The merchant could query the sales trend of other products to analyze what is happening to the
sales.
Note − We need to determine the business benefits and technical feasibility of using a data mart.
There are some tools that populate directly from the source system but some cannot. Therefore
additional requirements outside the scope of the tool are needed to be identified for future.
Note − In order to ensure consistency of data across all access tools, the data should not be directly
populated from the data warehouse, rather each tool must have its own data mart.
Data marts allow us to build a complete wall by physically separating data segments within the data
warehouse. To avoid possible privacy problems, the detailed data can be removed from the data
warehouse. We can create data mart for each legal entity and load it via data warehouse, with detailed
account data.
The summaries are data marted in the same way as they would have been designed within the data
warehouse. Summary tables help to utilize all dimension data in the starflake schema.
data and the data mart exist within the data warehouse, then we would face additional cost to store
and manage replicated data.
Note − Data marting is more expensive than aggregations, therefore it should be used as an additional
strategy and not as an alternative strategy.
Network Access
A data mart could be on a different location from the data warehouse, so we should ensure that the
LAN or WAN has the capacity to handle the data volumes being transferred within the data mart load
process.
Network capacity.
Time window available
Volume of data being transferred
Mechanisms being used to insert data into a data mart
The structure of configuration manager varies from one operating system to another.
The interface of configuration manager allows us to control all aspects of the system.
Some important jobs that a scheduler must be able to handle are as follows −
Data load
Data processing
Index creation
Backup
Aggregation creation
Data transformation
Note − If the data warehouse is running on a cluster or MPP architecture, then the system scheduling
manager must be capable of running across the architecture.
Note − The Event manager monitors the events occurrences and deals with them. The event manager
also tracks the myriad of things that can go wrong on this complex data warehouse system.
Events
Events are the actions that are generated by the user or the system itself. It may be noted that the
event is a measurable, observable, occurrence of a defined action.
Hardware failure
Running out of space on certain key disks
A process dying
A process returning an error
CPU usage exceeding an 805 threshold
Internal contention on database serialization points
Buffer cache hit ratios exceeding or failure below threshold
A table reaching to maximum of its size
Scheduling
Backup data tracking
Database awareness
Backups are taken only to protect against data loss. Following are the important points to remember −
The backup software will keep some form of database of where and when the piece of data was
backed up.
The backup recovery manager must have a good front-end to that database.
Being aware of the database, the software then can be addressed in database terms, and will
not perform backups that would not be viable.
Load manager
Warehouse manager
Query manager
Data Warehouse Load Manager
Load manager performs the operations required to extract and load the data into the database. The
size and complexity of a load manager varies between specific solutions from one data warehouse to
another.
Perform simple transformations into structure similar to the one in the data warehouse.
Fast Load
In order to minimize the total load window, the data needs to be loaded into the warehouse in
the fastest possible time.
It is more effective to load the data into a relational database prior to applying transformations
and checks.
Gateway technology is not suitable, since they are inefficient when large data volumes are
involved.
Simple Transformations
While loading, it may be required to perform simple transformations. After completing simple
transformations, we can do complex checks. Suppose we are loading the EPOS sales transaction, we
need to perform the following checks −
Strip out all the columns that are not required within the warehouse.
Convert all the values to required data types.
Warehouse Manager
The warehouse manager is responsible for the warehouse management process. It consists of a third-
party system software, C programs, and shell scripts. The size and complexity of a warehouse manager
varies between specific solutions.
Creates indexes, business views, partition views against the base data.
Generates normalizations.
Transforms and merges the source data of the temporary store into the published data
warehouse.
Archives the data that has reached the end of its captured life.
Note − A warehouse Manager analyzes query profiles to determine whether the index and
aggregations are appropriate.
Query Manager
The query manager is responsible for directing the queries to suitable tables. By directing the queries
to appropriate tables, it speeds up the query request and response process. In addition, the query
manager is responsible for scheduling the execution of the queries posted by the user.
It stores query profiles to allow the warehouse manager to determine which indexes and
aggregations are appropriate.
The objective of a data warehouse is to make large amounts of data easily accessible to the users,
hence allowing the users to extract information about the business as a whole. But we know that there
could be some security restrictions applied on the data that can be an obstacle for accessing the
information. If the analyst has a restricted view of data, then it is impossible to capture a complete
picture of the trends within the business.
The data from each analyst can be summarized and passed on to management where the different
summaries can be aggregated. As the aggregations of summaries cannot be the same as that of the
aggregation as a whole, it is possible to miss some information trends in the data unless someone is
analyzing the data as a whole.
Security Requirements
Adding security features affect the performance of the data warehouse, therefore it is important to
determine the security requirements as early as possible. It is difficult to add security features after the
data warehouse has gone live.
During the design phase of the data warehouse, we should keep in mind what data sources may be
added later and what would be the impact of adding those data sources. We should consider the
following possibilities during the design phase.
Whether the new data sources will require new security and/or audit restrictions to be
implemented?
Whether the new users added who have restricted access to data that is already generally
available?
This situation arises when the future users and the data sources are not well known. In such a situation,
we need to use the knowledge of business and the objective of data warehouse to know likely
requirements.
User access
Data load
Data movement
Query generation
User Access
We need to first classify the data and then classify the users on the basis of the data they can access. In
other words, the users are classified according to the data they can access.
Data Classification
Data can be classified according to its sensitivity. Highly-sensitive data is classified as highly
restricted and less-sensitive data is classified as less restrictive.
Data can also be classified according to the job function. This restriction allows only specific
users to view particular data. Here we restrict the users to view only that part of the data in
which they are interested and are responsible for.
There are some issues in the second approach. To understand, let's have an example. Suppose you are
building the data warehouse for a bank. Consider that the data being stored in the data warehouse is
the transaction data for all the accounts. The question here is, who is allowed to see the transaction
data. The solution lies in classifying the data according to the function.
User classification
Users can be classified as per the hierarchy of users in an organization, i.e., users can be
classified by departments, sections, groups, and so on.
Users can also be classified according to their role, with people grouped across departments
based on their role.
Let's have an example of a data warehouse where the users are from sales and marketing department.
We can have security by top-to-down company view, with access centered on the different
departments. But there could be some restrictions on users at different levels. This structure is shown
in the following diagram.
But if each department accesses different data, then we should design the security access for each
department separately. This can be achieved by departmental data marts. Since these data marts are
separated from the data warehouse, we can enforce separate security restrictions on each data mart.
This approach is shown in the following figure.
If the data is generally available to all the departments, then it is useful to follow the role access
hierarchy. In other words, if the data is generally accessed by all the departments, then apply security
restrictions as per the role of the user. The role access hierarchy is shown in the following figure.
Audit Requirements
Auditing is a subset of security, a costly activity. Auditing can cause heavy overheads on the system. To
complete an audit in time, we require more hardware and therefore, it is recommended that wherever
possible, auditing should be switched off. Audit requirements can be categorized as follows −
Connections
Disconnections
Data access
Data change
Note − For each of the above-mentioned categories, it is necessary to audit success, failure, or both.
From the perspective of security reasons, the auditing of failures are very important. Auditing of failure
is important because they can highlight unauthorized or fraudulent access.
Network Requirements
Network security is as important as other securities. We cannot ignore the network security
requirement. We need to consider the following issues −
Are there restrictions on which network routes the data can take?
These restrictions need to be considered carefully. Following are the points to remember −
The process of encryption and decryption will increase overheads. It would require more
processing power and processing time.
The cost of encryption can be high if the system is already a loaded system because the
encryption is borne by the source system.
Data Movement
There exist potential security implications while moving the data. Suppose we need to transfer some
restricted data as a flat file to be loaded. When the data is loaded into the data warehouse, the
following questions are raised −
Documentation
The audit and security requirements need to be properly documented. This will be treated as a part of
justification. This document can contain all the information gathered from −
Data classification
User classification
Network requirements
Data movement and storage requirements
All auditable actions
Impact of Security on Design
Security affects the application code and the development timescales. Security affects the following
area −
Application development
Database design
Testing
Application Development
Security affects the overall application development and it also affects the design of the important
components of the data warehouse such as load manager, warehouse manager, and query manager.
The load manager may require checking code to filter record and place them in different locations.
More transformation rules may also be required to hide certain data. Also there may be requirements
of extra metadata to handle any extra objects.
To create and maintain extra views, the warehouse manager may require extra codes to enforce
security. Extra checks may have to be coded into the data warehouse to prevent it from being fooled
into moving data into a location where it should not be available. The query manager requires the
changes to handle any access restrictions. The query manager will need to be aware of all extra views
and aggregations.
Database design
The database layout is also affected because when security measures are implemented, there is an
increase in the number of views and tables. Adding security increases the size of the database and
hence increases the complexity of the database design and management. It will also add complexity to
the backup management and recovery plan.
Testing
Testing the data warehouse is a complex and lengthy process. Adding security to the data warehouse
also affects the testing time complexity. It affects the testing in the following two ways −
It will increase the time required for integration and system testing.
There is added functionality to be tested which will increase the size of the testing suite.
Backup Terminologies
Before proceeding further, you should know some of the backup terminologies discussed below.
Complete backup − It backs up the entire database at the same time. This backup includes all
the database files, control files, and journal files.
Partial backup − As the name suggests, it does not create a complete backup of the database.
Partial backup is very useful in large databases because they allow a strategy whereby various
parts of the database are backed up in a round-robin fashion on a day-to-day basis, so that the
whole database is backed up effectively once a week.
Cold backup − Cold backup is taken while the database is completely shut down. In multi-
instance environment, all the instances should be shut down.
Hot backup − Hot backup is taken when the database engine is up and running. The
requirements of hot backup varies from RDBMS to RDBMS.
Hardware Backup
It is important to decide which hardware to use for the backup. The speed of processing the backup
and restore depends on the hardware being used, how the hardware is connected, bandwidth of the
network, backup software, and the speed of server's I/O system. Here we will discuss some of the
hardware choices that are available and their pros and cons. These choices are as follows −
Tape Technology
Disk Backups
Tape Technology
The tape choice can be categorized as follows −
Tape media
Standalone tape drives
Tape stackers
Tape silos
Tape Media
There exists several varieties of tape media. Some tape media standards are listed in the table below −
DLT 40 GB 3 MB/s
8 mm 14 GB 1 MB/s
Consider the server is a 48node MPP machine. We do not know the node to connect the tape
drive and we do not know how to spread them over the server nodes to get the optimal
performance with least disruption of the server and least internal I/O latency.
Connecting the tape drive as a network available device requires the network to be up to the
job of the huge data transfer rates. Make sure that sufficient bandwidth is available during the
time you require it.
Tape Stackers
The method of loading multiple tapes into a single tape drive is known as tape stackers. The stacker
dismounts the current tape when it has finished with it and loads the next tape, hence only one tape is
available at a time to be accessed. The price and the capabilities may vary, but the common ability is
that they can perform unattended backups.
Tape Silos
Tape silos provide large store capacities. Tape silos can store and manage thousands of tapes. They can
integrate multiple tape drives. They have the software and hardware to label and store the tapes they
store. It is very common for the silo to be connected remotely over a network or a dedicated link. We
should ensure that the bandwidth of the connection is up to the job.
Disk Backups
Methods of disk backups are −
Disk-to-disk backups
Mirror breaking
These methods are used in the OLTP system. These methods minimize the database downtime and
maximize the availability.
Disk-to-Disk Backups
Here backup is taken on the disk rather on the tape. Disk-to-disk backups are done for the following
reasons −
Mirror Breaking
The idea is to have disks mirrored for resilience during the working day. When backup is required, one
of the mirror sets can be broken out. This technique is a variant of disk-to-disk backups.
Note − The database may need to be shutdown to guarantee consistency of the backup.
Optical Jukeboxes
Optical jukeboxes allow the data to be stored near line. This technique allows a large number of optical
disks to be managed in the same way as a tape stacker or a tape silo. The drawback of this technique is
that it has slow write speed than disks. But the optical media provides long-life and reliability that
makes them a good choice of medium for archiving.
Software Backups
There are software tools available that help in the backup process. These software tools come as a
package. These tools not only take backup, they can effectively manage and control the backup
strategies. There are many software packages available in the market. Some of them are listed in the
following table −
Networker Legato
ADSM IBM
Omniback II HP
Alexandria Sequent
It is very difficult to predict what query the user is going to post in the future.
Performance Assessment
Here is a list of objective measures of performance −
It is of no use trying to tune response time, if they are already better than those required.
To hide the complexity of the system from the user, aggregations and views should be used.
It is also possible that the user can write a query you had not tuned for.
Note − If there is a delay in transferring the data, or in arrival of data then the entire system is affected
badly. Therefore it is very important to tune the data load first.
There are various approaches of tuning data load that are discussed below −
The very common approach is to insert data using the SQL Layer. In this approach, normal
checks and constraints need to be performed. When the data is inserted into the table, the
code will run to check for enough space to insert the data. If sufficient space is not available,
then more space may have to be allocated to these tables. These checks take time to perform
and are costly to CPU.
The second approach is to bypass all these checks and constraints and place the data directly
into the preformatted blocks. These blocks are later written to the database. It is faster than
the first approach, but it can work only with whole blocks of data. This can lead to some space
wastage.
The third approach is that while loading the data into the table that already contains the table,
we can maintain indexes.
The fourth approach says that to load the data in tables that already contain data, drop the
indexes & recreate them when the data load is complete. The choice between the third and
the fourth approach depends on how much data is already loaded and how many indexes need
to be rebuilt.
Integrity Checks
Integrity checking highly affects the performance of the load. Following are the points to remember −
Integrity checks need to be limited because they require heavy processing power.
Integrity checks should be applied on the source system to avoid performance degrade of data
load.
Tuning Queries
We have two kinds of queries in data warehouse −
Fixed queries
Ad hoc queries
Fixed Queries
Fixed queries are well defined. Following are the examples of fixed queries −
regular reports
Canned queries
Common aggregations
Tuning the fixed queries in a data warehouse is same as in a relational database system. The only
difference is that the amount of data to be queried may be different. It is good to store the most
successful execution plan while testing fixed queries. Storing these executing plan will allow us to spot
changing data size and data skew, as it will cause the execution plan to change.
Note − We cannot do more on fact table but while dealing with dimension tables or the aggregations,
the usual collection of SQL tweaking, storage mechanism, and access methods can be used to tune
these queries.
Ad hoc Queries
To understand ad hoc queries, it is important to know the ad hoc users of the data warehouse. For
each user or group of users, you need to know the following −
It is important to track the user's profiles and identify the queries that are run on a regular
basis.
It is also important that the tuning performed does not affect the performance.
If these queries are identified, then the database will change and new indexes can be added for
those queries.
If these queries are identified, then new aggregations can be created specifically for those
queries that would result in their efficient execution.
Unit testing
Integration testing
System testing
Unit Testing
In unit testing, each component is separately tested.
Each module, i.e., procedure, program, SQL Script, Unix shell is tested.
Integration Testing
In integration testing, the various modules of the application are brought together and then
tested against the number of inputs.
System Testing
In system testing, the whole data warehouse application is tested together.
The purpose of system testing is to check whether the entire system works correctly together
or not.
Since the size of the whole data warehouse is very large, it is usually possible to perform
minimal system testing before the test plan can be enacted.
Test Schedule
First of all, the test schedule is created in the process of developing the test plan. In this schedule, we
predict the estimated time required for the testing of the entire data warehouse system.
There are different methodologies available to create a test schedule, but none of them are perfect
because the data warehouse is very complex and large. Also the data warehouse system is evolving in
nature. One may face the following issues while creating a test schedule −
A simple problem may have a large size of query that can take a day or more to complete, i.e.,
the query does not complete in a desired time scale.
There may be hardware failures such as losing a disk or human errors such as accidentally
deleting a table or overwriting a large table.
Note − Due to the above-mentioned difficulties, it is recommended to always double the amount of
time you would normally allow for testing.
Media failure
Loss or damage of table space or data file
Loss or damage of redo log file
Loss or damage of control file
Instance failure
Loss or damage of archive file
Loss or damage of table
Failure during data failure
Testing Operational Environment
There are a number of aspects that need to be tested. These aspects are listed below.
Security − A separate security document is required for security testing. This document contains
a list of disallowed operations and devising tests for each.
Disk Configuration. − Disk configuration also needs to be tested to identify I/O bottlenecks. The
test should be performed with multiple times with different settings.
Management Tools. − It is required to test all the management tools during system testing.
Here is the list of tools that need to be tested.
o Event manager
o System manager
o Database manager
o Configuration manager
o Backup recovery manager
Testing the Database
The database is tested in the following three ways −
Testing the database manager and monitoring tools − To test the database manager and the
monitoring tools, they should be used in the creation, running, and management of test
database.
Testing database features − Here is the list of features that we have to test −
o Querying in parallel
Testing database performance − Query execution plays a very important role in data
warehouse performance measures. There are sets of fixed queries that need to be run
regularly and they should be tested. To test ad hoc queries, one should go through the user
requirement document and understand the business completely. Take time to test the most
awkward queries that the business is likely to ask against different index and aggregation
strategies.
Scheduling software
Day-to-day operational procedures
Backup recovery strategy
Management and scheduling tools
Overnight processing
Query performance
Note − The most important point is to test the scalability. Failure to do so will leave us a system design
that does not work when the system grows.
As we have seen that the size of the open database has grown approximately double its
magnitude in the last few years, it shows the significant value that it contains.
As the size of the databases grow, the estimates of what constitutes a very large database
continues to grow.
The hardware and software that are available today do not allow to keep a large amount of
data online. For example, a Telco call record requires 10TB of data to be kept online, which is
just a size of one month’s record. If it requires to keep records of sales, marketing customer,
employees, etc., then the size will be more than 100 TB.
The record contains textual information and some multimedia data. Multimedia data cannot be
easily manipulated as text data. Searching the multimedia data is not an easy task, whereas
textual information can be retrieved by the relational software available today.
Apart from size planning, it is complex to build and run data warehouse systems that are ever
increasing in size. As the number of users increases, the size of the data warehouse also
increases. These users will also require to access the system.
With the growth of the Internet, there is a requirement of users to access data online.
Several concepts are of particular importance to data warehousing. They are discussed in detail in this
section.
Dimensional Data Model: Dimensional data model is commonly used in data warehousing systems. This
section describes this modeling technique, and the two common schema types, star
schema and snowflake schema.
Slowly Changing Dimension: This is a common issue facing data warehousing practioners. This section
explains the problem, and describes the three ways of handling this problem with examples.
Conceptual Data Model: What is a conceptual data model, its features, and an example of this type of
data model.
Logical Data Model: What is a logical data model, its features, and an example of this type of data
model.
Physical Data Model: What is a physical data model, its features, and an example of this type of data
model.
Conceptual, Logical, and Physical Data Model: Different levels of abstraction for a data model. This
section compares and contrasts the three different types of data models.
Data Integrity: What is data integrity and how it is enforced in data warehousing.
MOLAP, ROLAP, and HOLAP: What are these different types of OLAP technology? This section discusses
how they are different from the other, and the advantages and disadvantages of each.
Bill Inmon vs. Ralph Kimball: These two data warehousing heavyweights have a different view of the
role between data warehouse and data mart.
Factless Fact Table: A fact table without any fact may sound silly, but there are real life instances when a
factless fact table is useful in data warehousing.
Junk Dimension: Discusses the concept of a junk dimension: When to use it and why is it useful.
Conformed Dimension: Discusses the concept of a conformed dimension: What is it and why is it
important.
Dimensional data model is most often used in data warehousing systems. This is different from the 3rd
normal form, commonly used for transactional (OLTP) type systems. As you can imagine, the same data
would then be stored differently in a dimensional model than in a 3rd normal form model.
To understand dimensional data modeling, let's define some of the terms commonly used in this type of
modeling:
Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.
Hierarchy: The specification of levels that represents relationship between different attributes within a
dimension. For example, one possible hierarchy in the Time dimension is Year → Quarter → Month →
Day.
Fact Table: A fact table is a table that contains the measures of interest. For example, sales amount
would be such a measure. This measure is stored in the fact table with the appropriate granularity. For
example, it can be sales amount by store by day. In this case, the fact table would contain three
columns: A date column, a store column, and a sales amount column.
Lookup Table: The lookup table provides the detailed information about the attributes. For example, the
lookup table for the Quarter attribute would include a list of all of the quarters available in the data
warehouse. Each row (each quarter) may have several fields, one for the unique ID that identifies the
quarter, and one or more additional fields that specifies how that particular quarter is represented on a
report (for example, first quarter of 2001 may be represented as "Q1 2001" or "2001 Q1").
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup
tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are
represented by lookup tables. Attributes are the non-key columns in the lookup tables.
In designing data models for data warehouses / data marts, the most commonly used schema types
are Star Schemaand Snowflake Schema.
Whether one uses a star or a snowflake largely depends on personal preference and business needs.
Personally, I am partial to snowflakes, when there is a business case to analyze the information at that
particular level.
In the star schema design, a single object (the fact table) sits in the middle and is radically connected to
other surrounding objects (dimension lookup tables) like a star. Each dimension is represented as a
single table. The primary key in each dimension table is related to a foreign key in the fact table.
All measures in the fact table are related to all the dimensions that fact table is related to. In other
words, they all have the same level of granularity.
A star schema can be simple or complex. A simple star consists of one fact table; a complex star can
have more than one fact table.
Let's look at an example: Assume our data warehouse keeps store sales data, and the different
dimensions are time, store, product, and customer. In this case, the figure on the left represents our star
schema. The lines between two tables indicate that there is a primary key / foreign key relationship
between the two tables. Note that different dimensions are not related to one another.
The snowflake schema is an extension of the star schema, where each point of the star explodes into
more points. In a star schema, each dimension is represented by a single dimensional table, whereas in a
snowflake schema, that dimensional table is normalized into multiple lookup tables, each representing a
level in the dimensional hierarchy.
We will have 4 lookup tables in a snowflake schema: A lookup table for year, a lookup table for month, a
lookup table for week, and a lookup table for day. Year is connected to Month, which is then connected
to Day. Week is only connected to Day. A sample snowflake schema illustrating the above relationships
in the Time Dimension is shown to the right.
The main advantage of the snowflake schema is the improvement in query performance due to
minimized disk storage requirements and joining smaller lookup tables. The main disadvantage of the
snowflake schema is the additional maintenance efforts needed due to the increase number of lookup
tables.
Granularity
The first step in designing a fact table is to determine the granularity of the fact table. By granularity,
we mean the lowest level of information that will be stored in the fact table. This constitutes two steps:
For example, in an off-line retail world, the dimensions for a sales fact table are usually time, geography,
and product. This list, however, is by no means a complete list for all off-line retailers. A supermarket
with a Rewards Card program, where customers provide some personal information in exchange for a
rewards card, and the supermarket would offer lower prices for certain items for customers who
present a rewards card at checkout, will also have the ability to track the customer dimension. Whether
the data warehousing system includes the customer dimension will then be a decision that needs to be
made.
Determining which part of hierarchy the information is stored along each dimension is not an exact
science. This is where user requirement (both stated and possibly future) plays a major role.
In the above example, will the supermarket wanting to do analysis along at the hourly level? (i.e.,
looking at how certain products may sell by different hours of the day.) If so, it makes sense to use 'hour'
as the lowest level of granularity in the time dimension. If daily analysis is sufficient, then 'day' can be
used as the lowest level of granularity. Since the lower the level of detail, the larger the data amount in
the fact table, the granularity exercise is in essence figuring out the sweet spot in the tradeoff between
detailed level of analysis and data storage.
Note that sometimes the users will not specify certain requirements, but based on the industry
knowledge, the data warehousing team may foresee that certain requirements will be forthcoming that
may result in the need of additional details. In such cases, it is prudent for the data warehousing team to
design the fact table such that lower-level information is included. This will avoid possibly needing to re-
design the fact table in the future. On the other hand, trying to anticipate all future requirements is an
impossible and hence futile exercise, and the data warehousing team needs to fight the urge of the
"dumping the lowest level of detail into the data warehouse" symptom, and only includes what is
practically needed. Sometimes this can be more of an art than science, and prior experience will become
invaluable here.
Data Warehousing > Concepts > Fact And Fact Table Types
Types of Facts
Additive: Additive facts are facts that can be summed up through all of the dimensions in the
fact table.
Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions
in the fact table, but not the others.
Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions
present in the fact table.
Let us use examples to illustrate each of the three types of facts. The first example assumes that we are
a retailer, and we have a fact table with the following columns:
Date
Store
Product
Sales_Amount
The purpose of this table is to record the sales amount for each product in each store on a daily
basis. Sales_Amountis the fact. In this case, Sales_Amount is an additive fact, because you can sum up
this fact along any of the three dimensions present in the fact table -- date, store, and product. For
example, the sum of Sales_Amount for all 7 days in a week represents the total sales amount for that
week.
Date
Account
Current_Balance
Profit_Margin
The purpose of this table is to record the current balance for each account at the end of each day, as
well as the profit margin for each account for each day. Current_Balance and Profit_Margin are the
facts. Current_Balance is a semi-additive fact, as it makes sense to add them up for all accounts (what's
the total current balance for all accounts in the bank?), but it does not make sense to add them up
through time (adding up all current balances for a given account for each day of the month does not give
us any useful information). Profit_Margin is a non-additive fact, for it does not make sense to add them
up for the account level or the day level.
Based on the above classifications, there are two types of fact tables:
Cumulative: This type of fact table describes what has happened over a period of time. For
example, this fact table may describe the total sales by product by store by day. The facts for
this type of fact tables are mostly additive facts. The first example presented here is a
cumulative fact table.
Snapshot: This type of fact table describes the state of things in a particular instance of time,
and usually includes more semi-additive and non-additive facts. The second example presented
here is a snapshot fact table.
Data Warehousing > Concepts > Slowly Changing Dimensions
The "Slowly Changing Dimension" problem is a common one particular to data warehousing. In a
nutshell, this applies to cases where the attribute for a record varies over time. We give an
example below:
Christina is a customer with ABC Inc. She first lived in Chicago, Illinois. So, the original entry in
the customer lookup table has the following record:
At a later date, she moved to Los Angeles, California on January, 2003. How should ABC Inc. now
modify its customer table to reflect this change? This is the "Slowly Changing Dimension"
problem.
There are in general three ways to solve this type of problem, and they are categorized as
follows:
Type 1: The new record replaces the original record. No trace of the old record exists.
Type 2: A new record is added into the customer dimension table. Therefore, the customer is
treated essentially as two people.
Type 3: The original record is modified to reflect the change.
We next take a look at each of the scenarios and how the data model and the data looks like for
each of them. Finally, we compare and contrast among the three alternatives.
Data Warehousing > Concepts > Type 1 Slowly Changing Dimension
In Type 1 Slowly Changing Dimension, the new information simply overwrites the original
information. In other words, no history is kept.
In our example, recall we originally have the following table:
After Christina moved from Illinois to California, the new information replaces the new record,
and we have the following table:
Advantages:
- This is the easiest way to handle the Slowly Changing Dimension problem, since there is no
need to keep track of the old information.
Disadvantages:
- All history is lost. By applying this methodology, it is not possible to trace back in history. For
example, in this case, the company would not be able to know that Christina lived in Illinois
before.
Usage:
About 50% of the time.
When to use Type 1:
Type 1 slowly changing dimension should be used when it is not necessary for the data
warehouse to keep track of historical changes.
Data Warehousing > Concepts > Type 2 Slowly Changing Dimension
In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the new
information. Therefore, both the original and the new record will be present. The new record
gets its own primary key.
In our example, recall we originally have the following table:
After Christina moved from Illinois to California, we add the new information as a new row into
the table:
Advantages:
- This allows us to accurately keep all historical information.
Disadvantages:
- This will cause the size of the table to grow fast. In cases where the number of rows for the
table is very high to start with, storage and performance can become a concern.
- This necessarily complicates the ETL process.
Usage:
About 50% of the time.
When to use Type 2:
Type 2 slowly changing dimension should be used when it is necessary for the data warehouse
to track historical changes.
Next Page: Type 3 Slowly Changing Dimension
Data Warehousing > Concepts > Type 3 Slowly Changing Dimension
In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular attribute of
interest, one indicating the original value, and one indicating the current value. There will also be a
column that indicates when the current value becomes active.
To accommodate Type 3 Slowly Changing Dimension, we will now have the following columns:
Customer Key
Name
Original State
Current State
Effective Date
After Christina moved from Illinois to California, the original information gets updated, and we have the
following table (assuming the effective date of change is January 15, 2003):
Advantages:
- This does not increase the size of the table, since new information is updated.
Disadvantages:
- Type 3 will not be able to keep all history where an attribute is changed more than once. For example,
if Christina later moves to Texas on December 15, 2003, the California information will be lost.
Usage:
Type III slowly changing dimension should only be used when it is necessary for the data warehouse to
track historical changes, and when such changes will only occur for a finite number of time.
A conceptual data model identifies the highest-level relationships between the different entities.
Features of conceptual data model include:
From the figure above, we can see that the only information shown via the conceptual data model is the
entities that describe the data and the relationships between those entities. No other information is
shown through the conceptual data model.
A logical data model describes the data in as much detail as possible, without regard to how they will be
physical implemented in the database. Features of a logical data model include:
The steps for designing the logical data model are as follows:
Comparing the logical data model shown above with the conceptual data model diagram, we see the
main differences between the two:
In a logical data model, primary keys are present, whereas in a conceptual data model, no
primary key is present.
In a logical data model, all attributes are specified within an entity. No attributes are specified in
a conceptual data model.
Relationships between entities are specified using primary keys and foreign keys in a logical data
model. In a conceptual data model, the relationships are simply stated, not specified, so we
simply know that two entities are related, but we do not specify what attributes are used for
this relationship.
Physical data model represents how the model will be built in the database. A physical database model
shows all table structures, including column name, column data type, column constraints, primary key,
foreign key, and relationships between tables. Features of a physical data model include:
Comparing the physical data model shown above with the logical data model diagram, we see the main
differences between the two:
Table Names ✓
Column Names ✓
Column Data Types ✓
Below we show the conceptual, logical, and physical versions of a single data model.
We can see that the complexity increases from conceptual to logical to physical. This is why we
always first start with the conceptual data model (so we understand at high level what are the
different entities in our data and how they relate to one another), then move on to the logical
data model (so we understand the details of our data without worrying about how they will
actually implemented), and finally the physical data model (so we know exactly how to
implement our data model in the database of choice). In a data warehousing project, sometimes
the conceptual data model and the logical data model are considered as a single deliverable.
Data Warehousing > Concepts > Data Integrity
Data integrity refers to the validity of data, meaning data is consistent and correct. In the data
warehousing field, we frequently hear the term, "Garbage In, Garbage Out." If there is no data
integrity in the data warehouse, any resulting report and analysis will not be useful.
In a data warehouse or a data mart, there are three areas of where data integrity needs to be
enforced:
Database level
We can enforce data integrity at the database level. Common ways of enforcing data integrity
include:
Referential integrity
The relationship between the primary key of one table and the foreign key of another table
must always be maintained. For example, a primary key cannot be deleted if there is still a
foreign key that refers to this primary key.
Primary key / Unique constraint
Primary keys and the UNIQUE constraint are used to make sure every row in a table can be
uniquely identified.
Not NULL vs. NULL-able
For columns identified as NOT NULL, they may not have a NULL value.
Valid Values
Only allowed values are permitted in the database. For example, if a column can only have
positive integers, a value of '-1' cannot be allowed.
ETL process
For each step of the ETL process, data integrity checks should be put in place to ensure that
source data is the same as the data in the destination. Most common checks include record
counts or record sums.
Access level
We need to ensure that data is not altered by any unauthorized means either during the ETL
process or in the data warehouse. To do this, there needs to be safeguards against unauthorized
access to data (including physical access to the servers), as well as logging of all data access
history. Data integrity can only ensured if there is no unauthorized access to the data.
Data Warehousing > Concepts > What Is OLAP
OLAP stands for On-Line Analytical Processing. The first attempt to provide a definition to OLAP
was by Dr. Codd, who proposed 12 rules for OLAP. Later, it was discovered that this particular
white paper was sponsored by one of the OLAP tool vendors, thus causing it to lose objectivity.
The OLAP Report has proposed the FASMI test, Fast Analysis
of Shared Multidimensional Information. For a more detailed description of both Dr. Codd's
rules and the FASMI test, please visit The OLAP Report.
For people on the business side, the key feature out of the above list is "Multidimensional." In
other words, the ability to analyze metrics in different dimensions such as time, geography,
gender, product, etc. For example, sales for the company are up. What region is most
responsible for this increase? Which store in this region is most responsible for the increase?
What particular product category or categories contributed the most to the increase? Answering
these types of questions in order means that you are performing an OLAP analysis.
Depending on the underlying technology used, OLAP can be broadly divided into two different
camps: MOLAP and ROLAP. A discussion of the different OLAP types can be found in the MOLAP,
ROLAP, and HOLAP section.
In the OLAP world, there are mainly two different types: Multidimensional OLAP (MOLAP) and Relational
OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP and ROLAP.
MOLAP
This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional cube.
The storage is not in the relational database, but in proprietary formats.
Advantages:
Excellent performance: MOLAP cubes are built for fast data retrieval, and are optimal for slicing
and dicing operations.
Can perform complex calculations: All calculations have been pre-generated when the cube is
created. Hence, complex calculations are not only doable, but they return quickly.
Disadvantages:
Limited in the amount of data it can handle: Because all calculations are performed when the
cube is built, it is not possible to include a large amount of data in the cube itself. This is not to
say that the data in the cube cannot be derived from a large amount of data. Indeed, this is
possible. But in this case, only summary-level information will be included in the cube itself.
Requires additional investment: Cube technology are often proprietary and do not already exist
in the organization. Therefore, to adopt MOLAP technology, chances are additional investments
in human and capital resources are needed.
ROLAP
This methodology relies on manipulating the data stored in the relational database to give the
appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and
dicing is equivalent to adding a "WHERE" clause in the SQL statement.
Advantages:
Can handle large amounts of data: The data size limitation of ROLAP technology is the limitation
on data size of the underlying relational database. In other words, ROLAP itself places no
limitation on data amount.
Can leverage functionalities inherent in the relational database: Often, relational database
already comes with a host of functionalities. ROLAP technologies, since they sit on top of the
relational database, can therefore leverage these functionalities.
Disadvantages:
Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple SQL
queries) in the relational database, the query time can be long if the underlying data size is
large.
Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL
statements to query the relational database, and SQL statements do not fit all needs (for
example, it is difficult to perform complex calculations using SQL), ROLAP technologies are
therefore traditionally limited by what SQL can do. ROLAP vendors have mitigated this risk by
building into the tool out-of-the-box complex functions as well as the ability to allow users to
define their own functions.
HOLAP
HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type
information, HOLAP leverages cube technology for faster performance. When detail information is
needed, HOLAP can "drill through" from the cube into the underlying relational data.
Data Warehousing > Concepts > Bill Inmon vs. Ralph Kimball
In the data warehousing field, we often hear about discussions on where a person / organization's
philosophy falls into Bill Inmon's camp or into Ralph Kimball's camp. We describe below the difference
between the two.
Bill Inmon's paradigm: Data warehouse is one part of the overall business intelligence system. An
enterprise has one data warehouse, and data marts source their information from the data warehouse.
In the data warehouse, information is stored in 3rd normal form.
Ralph Kimball's paradigm: Data warehouse is the conglomerate of all data marts within the enterprise.
Information is always stored in the dimensional model.
There is no right or wrong between these two ideas, as they represent different data warehousing
philosophies. In reality, the data warehouse systems in most enterprises are closer to Ralph Kimball's
idea. This is because most data warehouses started out as a departmental effort, and hence they
originated as a data mart. Only when more data marts are built later do they evolve into a data
warehouse.
A factless fact table is a fact table that does not have any measures. It is essentially an intersection of
dimensions. On the surface, a factless fact table does not make sense, since a fact table is, after all,
about facts. However, there are situations where having this kind of relationship makes sense in data
warehousing.
For example, think about a record of student attendance in classes. In this case, the fact table would
consist of 3 dimensions: the student dimension, the time dimension, and the class dimension. This
factless fact table would look like the following:
The only measure that you can possibly attach to each combination is "1" to show the presence of that
particular combination. However, adding a fact that always shows 1 is redundant because we can simply
use the COUNT function in SQL to answer the same questions.
Factless fact tables offer the most flexibility in data warehouse design. For example, one can easily
answer the following questions with this factless fact table:
Without using a factless fact table, we will need two separate fact tables to answer the above two
questions. With the above factless fact table, it becomes the only fact table that's needed.
In data warehouse design, frequently we run into a situation where there are yes/no indicator fields in
the source system. Through business analysis, we know it is necessary to keep such information in the
fact table. However, if keep all those indicator fields in the fact table, not only do we need to build many
small dimension tables, but the amount of information stored in the fact table also increases
tremendously, leading to possible performance and management issues.
Junk dimension is the way to solve this problem. In a junk dimension, we combine these indicator fields
into a single dimension. This way, we'll only need to build a single dimension table, and the number of
fields in the fact table, as well as the size of the fact table, can be decreased. The content in the junk
dimension table is the combination of all possible values of the individual indicator fields.
Let's look at an example. Assuming that we have the following fact table:
In this example, TXN_CODE, COUPON_IND, and PREPAY_IND are all indicator fields. In this existing
format, each one of them is a dimension. Using the junk dimension principle, we can combine them into
a single junk dimension, resulting in the following fact table:
Note that now the number of dimensions in the fact table went from 7 to 5.
The content of the junk dimension table would look like the following:
In this case, we have 3 possible values for the TXN_CODE field, 2 possible values for the COUPON_IND
field, and 2 possible values for the PREPAY_IND field. This results in a total of 3 x 2 x 2 = 12 rows for the
junk dimension table.
By using a junk dimension to replace the 3 indicator fields, we have decreased the number of
dimensions by 2 and also decreased the number of fields in the fact table by 2. This will result in a data
warehousing environment that offer better performance as well as being easier to manage.
A conformed dimension is a dimension that has exactly the same meaning and content when being
referred from different fact tables. A conformed dimension can refer to multiple tables in multiple data
marts within the same organization. For two dimension tables to be considered as conformed, they
must either be identical or one must be a subset of another. There cannot be any other type of
difference between the two tables. For example, two dimension tables that are exactly the same except
for the primary key are not considered conformed dimensions.
Why is conformed dimension important? This goes back to the definition of data warehouse being
"integrated." Integrated means that even if a particular entity had different meanings and different
attributes in the source systems, there must be a single version of this entity once the data flows into
the data warehouse.
The time dimension is a common conformed dimension in an organization. Usually the only rule to
consider with the time dimension is whether there is a fiscal year in addition to the calendar year and
the definition of a week. Fortunately, both are relatively easy to resolve. In the case of fiscal vs. calendar
year, one may go with either fiscal or calendar, or an alternative is to have two separate conformed
dimensions, one for fiscal year and one for calendar year. The definition of a week is also something that
can be different in large organizations: Finance may use Saturday to Friday, while marketing may use
Sunday to Saturday. In this case, we should decide on a definition and move on. The nice thing about the
time dimension is once these rules are set, the values in the dimension table will never change. For
example, October 16th will never become the 15th day in October.
Not all conformed dimensions are as easy to produce as the time dimension. An example is the
customer dimension. In any organization with some history, there is a high likelihood that different
customer databases exist in different parts of the organization. To achieve a conformed customer
dimension means those data must be compared against each other, rules must be set, and data must be
cleansed. In addition, when we are doing incremental data loads into the data warehouse, we'll need to
apply the same rules to the new values to make sure we are only adding truly new customers to the
customer dimension.
Building a conformed dimension also part of the process in master data management, or MDM. In
MDM, one must not only make sure the master data dimensions are conformed, but that conformity
needs to be brought back to the source systems.