Tableau Interview Q& A

Pathway Learners Online Training (IT & NON IT Skills Training & Placement Services
Datawarehouse interview questions and answers
Define data warehouse?
A : Data warehouse is a subject oriented, integrated, time-variant, and nonvolatile collection of data
that supports management's decision-making process.
Q: What does subject-oriented data warehouse signify?
A : Subject oriented signifies that the data warehouse stores the information around a particular
subject such as product, customer, sales, etc.
Q: List any five applications of data warehouse.
A : Some applications include financial services, banking services, customer goods, retail sectors,
controlled manufacturing.
Q: What do OLAP and OLTP stand for?
A : OLAP is an acronym for Online Analytical Processing and OLTP is an acronym of Online
Transactional Processing.
Q: What is the very basic difference between data warehouse and operational databases?
A : A data warehouse contains historical information that is made available for analysis of the business
whereas an operational database contains current information that is required to run the business.
Q: List the Schema that a data warehouse system can implements.
A : A data Warehouse can implement star schema, snowflake schema, and fact constellation schema.
Q: What is Data Warehousing?
A : Data Warehousing is the process of constructing and using the data warehouse.
Q: List the process that are involved in Data Warehousing.
A : Data Warehousing involves data cleaning, data integration and data consolidations.
Q: List the functions of data warehouse tools and utilities.
A : The functions performed by Data warehouse tool and utilities are Data Extraction, Data Cleaning,
Data Transformation, Data Loading and Refreshing.
Q: What do you mean by Data Extraction?
Visit US : www.pathwaylearners.in Mail US : info@pathwaylearners.in

A : Data extraction means gathering data from multiple heterogeneous sources.

Q: Define metadata?
A : Metadata is simply defined as data about data. In other words, we can say that metadata is the
summarized data that leads us to the detailed data.
Q: What does Metadata Respiratory contain?
A : Metadata respiratory contains definition of data warehouse, business metadata, operational

metadata, data for mapping from operational environment to data warehouse, and the algorithms for
summarization.
Q: How does a Data Cube help?
A : Data cube helps us to represent the data in multiple dimensions. The data cube is defined by
dimensions and facts.
Q: Define dimension?
A : The dimensions are the entities with respect to which an enterprise keeps the records.
Q: Explain data mart.
A : Data mart contains the subset of organization-wide data. This subset of data is valuable to specific
groups of an organization. In other words, we can say that a data mart contains data specific to a
particular group.
Q: What is Virtual Warehouse?
A : The view over an operational data warehouse is known as virtual warehouse.

Q: List the phases involved in the data warehouse delivery process.
A : The stages are IT strategy, Education, Business Case Analysis, technical Blueprint, Build the version,
History Load, Ad hoc query, Requirement Evolution, Automation, and Extending Scope.
Q: Define load manager.
A : A load manager performs the operations required to extract and load the process. The size and
complexity of load manager varies between specific solutions from data warehouse to data warehouse.
Q: Define the functions of a load manager.

A : A load manager extracts data from the source system. Fast load the extracted data into temporary
data store. Perform simple transformations into structure similar to the one in the data warehouse.
Q: Define a warehouse manager.
A : Warehouse manager is responsible for the warehouse management process. The warehouse
manager consist of third party system software, C programs and shell scripts. The size and complexity
of warehouse manager varies between specific solutions.
Q: Define the functions of a warehouse manager.
A : The warehouse manager performs consistency and referential integrity checks, creates the indexes,
business views, partition views against the base data, transforms and merge the source data into the
temporary store into the published data warehouse, backs up the data in the data warehouse, and
archives the data that has reached the end of its captured life.
Q: What is Summary Information?
A : Summary Information is the area in data warehouse where the predefined aggregations are kept.
Q: What does the Query Manager responsible for?
A : Query Manager is responsible for directing the queries to the suitable tables.
Q: List the types of OLAP server
A : There are four types of OLAP servers, namely Relational OLAP, Multidimensional OLAP, Hybrid
OLAP, and Specialized SQL Servers.
Q: Which one is faster, Multidimensional OLAP or Relational OLAP?
A : Multidimensional OLAP is faster than Relational OLAP.

Q: List the functions performed by OLAP.
A : OLAP performs functions such as roll-up, drill-down, slice, dice, and pivot.
Q: How many dimensions are selected in Slice operation?
A : Only one dimension is selected for the slice operation.

Q: How many dimensions are selected in dice operation?
A : For dice operation two or more dimensions are selected for a given cube.
Q: How many fact tables are there in a star schema?
A : There is only one fact table in a star Schema.

Q: What is Normalization?
A : Normalization splits up the data into additional tables.

Q: Out of star schema and snowflake schema, whose dimension table is normalized?
A : Snowflake schema uses the concept of normalization.

Q: What is the benefit of normalization?
A : Normalization helps in reducing data redundancy.

Q: Which language is used for defining Schema Definition?
A : Data Mining Query Language (DMQL) is used for Schema Definition.

Q: What language is the base of DMQL?
A : DMQL is based on Structured Query Language (SQL).

Q: What are the reasons for partitioning?
A : Partitioning is done for various reasons such as easy management, to assist backup recovery, to
enhance performance.
Q: What kind of costs are involved in Data Marting?
A : Data Marting involves hardware & software cost, network access cost, and time cost.

Tableau Interview questions and answers(Model-1)

Beginners Level Tableau Interview Questions
1. What is Tableau?
Tableau is a business intelligence software that allows anyone to connect to respective data, and then
visualize and create interactive, shareable dashboards.
2. What are the different Tableau Products and what is the latest version of Tableau?
Here is the Tableau Product family.

(i)Tableau Desktop:
It is a self service business analytics and data visualization that anyone can use. It translates pictures of
data into optimized queries. With tableau desktop, you can directly connect to data from your data
warehouse for live upto date data analysis. You can also perform queries without writing a single line of
code. Import all your data into Tableau’s data engine from multiple sources & integrate altogether by
combining multiple views in a interactive dashboard.
(ii)Tableau Server:
It is more of a enterprise level Tableau software. You can publish dashboards with Tableau Desktop and
share them throughout the organization with web-based Tableau server. It leverages fast databases
through live connections.
(iii)Tableau Online:

This is a hosted version of Tableau server which helps makes business intelligence faster and easier than
before. You can publish Tableau dashboards with Tableau Desktop and share them with colleagues.
(iv)Tableau Reader:
It’s a free desktop application that enables you to open and view visualizations that are built in Tableau
Desktop. You can filter, drill down data but you cannot edit or perform any kind of interactions.
(v)Tableau Public:
This is a free Tableau software which you can use to make visualizations with but you need to save your
workbook or worksheets in the Tableau Server which can be viewed by anyone.
3. What are the different datatypes in Tableau?
Tableau supports the following data-types:
4. What are Measures and Dimensions?
Measures are the numeric metrics or measurable quantities of the data, which can be analyzed by
dimension table. Measures are stored in a table that contain foreign keys referring uniquely to the
associated dimension tables. The table supports data storage at atomic level and thus, allows more

number of records to be inserted at one time. For instance, a Sales table can have product key,
customer key, promotion key, items sold, referring to a specific event.
Dimensions are the descriptive attribute values for multiple dimensions of each attribute, defining
multiple characteristics. A dimension table ,having reference of a product key form the table, can consist
of product name, product type, size, color, description, etc.
Tableau Interview Questions & Answers | Tableau Career Path | Edureka

5. What is the difference between .twb and .twbx extension?
 A .twb is an xml document which contains all the selections and layout made you have made in
your Tableau workbook. It does not contain any data.
 A .twbx is a ‘zipped’ archive containing a .twb and any external files such as extracts and
background images.
6. What is the difference between Tableau and Traditional BI Tools?
Tableau provides easy to use, best in class, visual analytic capabilities but has nothing to do with the
data foundation or plumbing. But with an integration with a SQL server it can be the complete package.
On the other hand traditional BI tools have the before mentioned capabilities but then you have to deal
with significant amount of upfront costs. The cost of consulting, software and hardware is comparatively
quite high.
7. What are the different types of joins in Tableau?
The joins in Tableau are same as SQL joins. Take a look at the diagram below to understand it.

8. How many maximum tables can you join in Tableau?
You can join a maximum of 32 tables in Tableau.
9. What are the different connections you can make with your dataset?
We can either connect live to our data set or extract data onto Tableau.
 Live: Connecting live to a data set leverages its computational processing and storage. New
queries will go to the database and will be reflected as new or updated within the data.
 Extract: An extract will make a static snapshot of the data to be used by Tableau’s data engine.
The snapshot of the data can be refreshed on a recurring schedule as a whole or incrementally
append data. One way to set up these schedules is via the Tableau server.
The benefit of Tableau extract over live connection is that extract can be used anywhere without any
connection and you can build your own visualization without connecting to database.

10. What are shelves?
They are Named areas to the left and top of the view. You build views by placing fields onto the shelves.
Some shelves are available only when you select certain mark types.
11. What are sets?
Sets are custom fields that define a subset of data based on some conditions. A set can be based on a
computed condition, for example, a set may contain customers with sales over a certain threshold.
Computed sets update as your data changes. Alternatively, a set can be based on specific data point in
your view.
12. What are groups?
A group is a combination of dimension members that make higher level categories. For example, if you
are working with a view that shows average test scores by major, you may want to group certain majors
together to create major categories.

13. What is a hierarchical field?
A hierarchical field in tableau is used for drilling down data. It means viewing your data in a more
granular level.
14. What is Tableau Data Server?
Tableau server acts a middle man between Tableau users and the data. Tableau Data Server allows you
to upload and share data extracts, preserve database connections, as well as reuse calculations and field
metadata. This means any changes you make to the data-set, calculated fields, parameters, aliases, or
definitions, can be saved and shared with others, allowing for a secure, centrally managed and
standardized dataset. Additionally, you can leverage your server’s resources to run queries on extracts
without having to first transfer them to your local machine.
Intermediate Level Tableau Interview Questions

15. What is Tableau Data Engine?
Tableau Data Engine is a really cool feature in Tableau. Its an analytical database designed to achieve
instant query response, predictive performance, integrate seamlessly into existing data infrastructure
and is not limited to load entire data sets into memory.
If you work with a large amount of data, it does takes some time to import, create indexes and sort data
but after that everything speeds up. Tableau Data Engine is not really in-memory technology. The data is
stored in disk after it is imported and the RAM is hardly utilized.
16. What are the different filters in Tableau and how are they different from each other?
In Tableau, filters are used to restrict the data from database.
The different filters in Tableau are: Quick , Context and Normal/Traditional filter are:
 Normal Filter is used to restrict the data from database based on selected dimension or
measure. A Traditional Filter can be created by simply dragging a field onto the ‘Filters’ shelf.

 Quick filter is used to view the filtering options and filter each worksheet on a dashboard while
changing the values dynamically (within the range defined) during the run time.
 Context Filter is used to filter the data that is transferred to each individual worksheet. When a
worksheet queries the data source, it creates a temporary, flat table that is uses to compute the
chart. This temporary table includes all values that are not filtered out by either the Custom SQL
or the Context Filter.
17. How to create a calculated field in Tableau?
 Click the drop down to the right of Dimensions on the Data pane and select “Create > Calculated
Field” to open the calculation editor.
 Name the new field and create a formula.
Take a look at the example below:
18. What is a dual axis?
Dual Axis is an excellent phenomenon supported by Tableau that helps users view two scales of two
measures in the same graph. Many websites like Indeed.com and other make use of dual axis to show
the comparison between two measures and their growth rate in a septic set of years. Dual axes let you
compare multiple measures at once, having two independent axes layered on top of one another. This is
how it looks like:

19. What is the difference between a tree map and heat map?
A heat map can be used for comparing categories with color and size. With heat maps, you can compare
two different measures together.

A tree map also does the same except it is considered a very powerful visualization as it can be used for
illustrating hierarchical data and part-to-whole relationships.

20. What is disaggregation and aggregation of data?
The process of viewing numeric values or measures at higher and more summarized levels of the data is
called aggregation. When you place a measure on a shelf, Tableau automatically aggregates the data,
usually by summing it. You can easily determine the aggregation applied to a field because the function
always appears in front of the field’s name when it is placed on a shelf. For example, Sales becomes
SUM(Sales). You can aggregate measures using Tableau only for relational data sources.
Multidimensional data sources contain aggregated data only. In Tableau, multidimensional data sources
are supported only in Windows.
According to Tableau, Disaggregating your data allows you to view every row of the data source which
can be useful when you are analyzing measures that you may want to use both independently and
dependently in the view. For example, you may be analyzing the results from a product satisfaction

survey with the Age of participants along one axis. You can aggregate the Age field to determine the
average age of participants or disaggregate the data to determine at what age participants were most
satisfied with the product.
21. What is the difference between joining and blending in Tableau?
 Joining term is used when you are combining data from the same source, for example,
worksheet in an Excel file or tables in Oracle database
 While blending requires two completely defined data sources in your report.
22. What are Extracts and Schedules in Tableau server?
Data extracts are the first copies or subdivisions of the actual data from original data sources. The
workbooks using data extracts instead of those using live DB connections are faster since the extracted
data is imported in Tableau Engine.After this extraction of data, users can publish the workbook, which
also publishes the extracts in Tableau Server. However, the workbook and extracts won’t refresh unless
users apply a scheduled refresh on the extract. Scheduled Refreshes are the scheduling tasks set for data
extract refresh so that they get refreshed automatically while publishing a workbook with data extract.
This also removes the burden of republishing the workbook every time the concerned data gets
updated.
23. How to view underlying SQL Queries in Tableau?
Viewing underlying SQL Queries in Tableau provides two options:
 Create a Performance Recording to record performance information about the main events you
interact with workbook. Users can view the performance metrics in a workbook created by
Tableau.
Help -> Settings and Performance -> Start Performance Recording
Help -> Setting and Performance -> Stop Performance Recording.
 Reviewing the Tableau Desktop Logslocated at C:\Users\\My Documents\My Tableau
Repository. For live connection to data source, you can check log.txt and tabprotosrv.txt files.
For an extract, check tdeserver.txt file.

24. How to do Performance Testing in Tableau?
Performance testing is again an important part of implementing tableau. This can be done by loading
Testing Tableau Server with TabJolt, which is a “Point and Run” load generator created to perform QA.
While TabJolt is not supported by tableau directly, it has to be installed using other open source
products.
25. Name the components of a Dashboard.
 Horizontal – Horizontal layout containers allow the designer to group worksheets and
dashboard components left to right across your page and edit the height of all elements at once.
 Vertical – Vertical containers allow the user to group worksheets and dashboard components
top to bottom down your page and edit the width of all elements at once.
 Text – All textual fields.
 Image Extract – A Tableau workbook is in XML format. In order to extracts images, Tableau
applies some codes to extract an image which can be stored in XML.
 Web [URL ACTION] – A URL action is a hyperlink that points to a Web page, file, or other web-
based resource outside of Tableau. You can use URL actions to link to more information about
your data that may be hosted outside of your data source. To make the link relevant to your
data, you can substitute field values of a selection into the URL as parameters.
26. How to remove ‘All’ options from a Tableau auto-filter?

The auto-filter provides a feature of removing ‘All’ options by simply clicking the down arrow in the
auto-filter heading. You can scroll down to ‘Customize’ in the dropdown and then uncheck the ‘Show
“All” Value’ attribute. It can be activated by checking the field again.
27. How to add Custom Color to Tableau?

Adding a Custom Color refers to a power tool in Tableau. Restart you Tableau desktop once you save
.tps file. From the Measures pane, drag the one you want to add color to Color. From the color legend
menu arrow, select Edit Colors. When a dialog box opens, select the palette drop-down list and
customize as per requirement.
28. What is TDE file?

TDE is a Tableau desktop file that contains a .tde extension. It refers to the file that contains data
extracted from external sources like MS Excel, MS Access or CSV file.
There are two aspects of TDE design that make them ideal for supporting analytics and data discovery.

 Firstly, TDE is a columnar store.

 The second is how they are structured which impacts how they are loaded into memory and
used by Tableau. This is an important aspect of how TDEs are “architecture aware”.
Architecture-awareness means that TDEs use all parts of your computer memory, from RAM to
hard disk, and put each part to work what best fits its characteristics.
29. Mention whether you can create relational joins in Tableau without creating a new table?
Yes, one can create relational joins in tableau without creating a new table.
30. How to automate reports?

You need to publish report to tableau server, while publishing you will find one option to schedule
reports.You just need to select the time when you want to refresh data.
31. What is Assume referential integrity?
In some cases, you can improve query performance by selecting the option to Assume Referential
Integrity from the Data menu. When you use this option, Tableau will include the joined table in the
query only if it is specifically referenced by fields in the view.
32. Explain when would you use Joins vs. Blending in Tableau?
If data resides in a single source, it is always desirable to use Joins. When your data is not in one place
blending is the most viable way to create a left join like the connection between your primary and
secondary data sources.
33. What is default Data Blending Join?
Data blending is the ability to bring data from multiple data sources into one Tableau view, without the
need for any special coding. A default blend is equivalent to a left outer join. However, by switching
which data source is primary, or by filtering nulls, it is possible to emulate left, right and inner joins.
34. What do you understand by blended axis?
In Tableau, measures can share a single axis so that all the marks are shown in a single pane. Instead of
adding rows and columns to the view, when you blend measures there is a single row or column and all
of the values for each measure is shown along one continuous axis. We can blend multiple measures by
simply dragging one measure or axis and dropping it onto an existing axis.

35. What is story in Tableau?
A story is a sheet that contains a sequence of worksheets or dashboards that work together to convey
information. You can create stories to show how facts are connected, provide context, demonstrate
how decisions relate to outcomes, or simply make a compelling case. Each individual sheet in a story is
called a story point.
36. What is the difference between discrete and continuous in Tableau?
There are two types of data roles in Tableau – discrete and continuous dimension.
 Discrete data roles are values that are counted as distinct and separate and can only take
individual values within a range. Examples: number of threads in a sheet, customer name or row
ID or State. Discrete values are shown as blue pills on the shelves and blue icons in the data
window.
 Continuous data roles are used to measure continuous data and can take on any value within a
finite or infinite interval. Examples: unit price, time and profit or order quantity. Continuous
variables behave in a similar way in that they can take on any value. Continuous values are
shown as green pills.
37.How to create stories in Tableau?
There are many ways to create story in Tableau. Each story point can be based on a different view or
dashboard, or the entire story can be based on the same visualization, just seen at different stages, with
different marks filtered and annotations added. You can use stories to make a business case or to simply
narrate a sequence of events.
 Click the New Story tab.

 In the lower-left corner of the screen, choose a size for your story. Choose from one of the
predefined sizes, or set a custom size, in pixels.
 By default, your story gets its title from its sheet name. To edit it, double-click the title. You can
also change your title’s font, color, and alignment. Click Apply to view your changes.
 To start building your story, drag a sheet from the Story tab on the left and drop it into the
center of the view

 Click Add a caption to summarize the story point.
 To highlight a key takeaway for your viewers, drag a text object over to the story worksheet and
type your comment.
 To further highlight the main idea of this story point, you can change a filter or sort on a field in
the view, then save your changes by clicking Update above the navigator box.
38. What is the DRIVE Program Methodology?
Tableau Drive is a methodology for scaling out self-service analytics. Drive is based on best practices
from successful enterprise deployments. The methodology relies on iterative, agile methods that are
faster and more effective than traditional long-cycle deployment.
A cornerstone of this approach is a new model of partnership between business and IT.
39. How to use group in calculated field?
By adding the same calculation to ‘Group By’ clause in SQL query or creating a Calculated Field in the
Data Window and using that field whenever you want to group the fields.
 Using groups in a calculation. You cannot reference ad-hoc groups in a calculation.

 Blend data using groups created in the secondary data source: Only calculated groups can be
used in data blending if the group was created in the secondary data source.
 Use a group in another workbook. You can easily replicate a group in another workbook by
copy and pasting a calculation.
40. Mention what is the difference between published data sources and embedded data sources in
Tableau?
The difference between published data source and embedded data source is that,
 Published data source: It contains connection information that is independent of any workbook
and can be used by multiple workbooks.
 Embedded data source: It contains connection information and is associated with a workbook.

41. Mention what are different Tableau files?
Different Tableau files include:
 Workbooks: Workbooks hold one or more worksheets and dashboards

 Bookmarks: It contains a single worksheet and its an easy way to quickly share your work
 Packaged Workbooks: It contains a workbook along with any supporting local file data and
background images
 Data Extraction Files: Extract files are a local copy of a subset or entire data source
 Data Connection Files: It’s a small XML file with various connection information
Expert level Tableau Interview Questions

42. How to embed views onto Webpages?
You can embed interactive Tableau views and dashboards into web pages, blogs, wiki pages, web
applications, and intranet portals. Embedded views update as the underlying data changes, or as their
workbooks are updated on Tableau Server. Embedded views follow the same licensing and permission
restrictions used on Tableau Server. That is, to see a Tableau view that’s embedded in a web page, the
person accessing the view must also have an account on Tableau Server.
Alternatively, if your organization uses a core-based license on Tableau Server, a Guest account is
available. This allows people in your organization to view and interact with Tableau views embedded in
web pages without having to sign in to the server. Contact your server or site administrator to find out if
the Guest user is enabled for the site you publish to.
You can do the following to embed views and adjust their default appearance:
 Get the embed code provided with a view:The Share button at the top of each view includes
embed code that you can copy and paste into your webpage. (The Share button doesn’t appear
in embedded views if you change the showShareOptions parameter to false in the code.)
 Customize the embed code: You can customize the embed code using parameters that control
the toolbar, tabs, and more. For more information, see Parameters for Embed Code.
 Use the Tableau JavaScript API: Web developers can use Tableau JavaScript objects in web
applications. To get access to the API, documentation, code examples, and the Tableau
developer community, see the Tableau Developer Portal.

43. Design a view in a map such that if user selects any state, the cities under that state has to show
profit and sales.
According to your question you must have state, city, profit and sales fields in your dataset.
Step 1: Double click on the state field
Step 2: Drag the city and drop it into Marks card.
Step 3: Drag the sales and drop it into size.
Step 4: Drag profit and drop it into color.
Step 5: Click on size legend and increase the size.
Step 6: Right click on state field and select show quick filter.
Step 7: Select any state now and check the view.
44. Think that I am using Tableau Desktop & have a live connection to Cloudera Hadoop data. I need to
press F5 to refresh the visualization. Is there anyway to automatically refresh visualization every ‘x’
seconds instead of pressing F5?
Here is an example of refreshing the dashboard for every 5 seconds.
All you need to do is replace the api src and server url with yours.
<!DOCTYPE html>
<html lang="en">
<head>
<title>Tableau JavaScript API </title>
<script type="text/javascript" src="http://servername/javascripts/api/tableau_v8.js"></script>
</head>
<div id="tableau Viz"></div>
<script type='text/javascript'>
var placeholderDiv = document.getElementById("tableau Viz");
var url = "http://servername/t/311/views/Mayorscreenv5/Mayorscreenv2";
var options={

hideTabs:True,
width:"100%",
height:"1000px"
};
var viz= new tableauSoftware.Viz(placeholderDiv,url,options);
setInterval (function() {viz.refreshDataAsync()},5000);
</script>
</body>
</html>
Some Additional Tricky Tableau Interview Questions
45. Suppose my license expires today, will users be able to view dashboards or workbooks which I
published in the server earlier?
If your server license expires today, your username on the server will have the role ‘unlicensed’ which
means you cannot access but others can. The site admin can change the ownership to another person so
that the extracts do not fail.
46. Is Tableau software good for strategic acquisition?
Yes! For sure. It gives you data insight to the extent that other tools can’t. Moreover, it also helps you to
plan and point the anomalies and improvise your process for betterment of your company.
47. Can we place an excel file in a shared location and and use it to develop a report and refresh it in
regular intervals?
Yes, we can do it. But for better performance we should use Extract.
48. Can Tableau be installed on MacOS?
Yes, Tableau Desktop can be installed on both on Mac and Windows Operating System.
49. What is the maximum no. of rows Tableau can utilize at one time?
Tableau is not restricted by the no. of rows in the table. Customers use Tableau to access petabytes of
data because it only retrieves the rows and columns needed to answer your questions.

50. When publishing workbooks on Tableau online, sometimes a error about needing to extract
appears. Why does it happen occasionally?
This happens when a user is trying to publish a workbook that is connected to an internal server or a file
stored on a local drive, such as a SQL server that is within a company’s network
Tableau Interview Questions and Answers (Model -2)
1. What do you mean by addressing and partitioning?
When you add a table calculation, you must use all dimensions in the level of detail either for
partitioning (scoping) or for addressing (direction):
 The dimensions that define how to group the calculation, that is, define the scope of data it is
performed on, are called partitioning fields. The table calculation is performed separately within each
partition.
 The remaining dimensions, upon which the table calculation is performed, are called addressing fields,
and determine the direction of the calculation.
2. What is the difference between sets and groups?
Group- 1)group is category purpose 2) calculation field not a possible in group
Sets- 1)sets is a grouping purpose based on the some condition. 2) calculation field possible in sets.
3. Can we create sets using measures
No
4. What is the difference between Data blending and join?
Joining is a SQL term that refers to combining two data sources into a single data source. Blending is a
Tableau term that refers to combining two data sources into a single chart. The main difference
between them is that a join is done once at the data source and used for every chart, while a blend is
done individually for each chart.
5. How to do performance testing?

products.

6. What is the order of execution of table calculation, LOD, reference lines?
Order of execution
1. LOD
2. Table Calculations
3. Reference lines
4. Can we draw 3 reference lines in a single chart?
No
8. What are limitations of context filter?
The context filter is not frequently changed by the user – if the filter is changed the database must
recomputed and rewrite the temporary table, slowing performance.
9. How do we select multiple values in parameters?
We cannot select multiple values in parameters
10. Can we perform all kinds of joins using Data blending?
No
11. How can we combine database and flat file data in Tableau desktop??
You can combine them by connecting data two times, one for database tables and one for flat file. The
Data->Edit Relationships. Give a join condition on common column from db tables to flat file.
12. Define fact table?
Fact table consists of the measurements, metrics or facts of a business process. It is located at the
center of a star schema or a snowflake schema surrounded by dimension tables.
13. How do you automate reports using Tableau software?
Steps to automate the reports: while publishing the report to Tableau server, you will find the option to
schedule reports. Click on this to select the time when you want to refresh the data.
14. Can we display top five and last five sales in same view?

Yes. Using sets and combined set.
15. Can parameters have dropdown list?
Yes, parameters do have their independent dropdown lists enabling users to view the data entries
available in the parameter during its creation.
16. What is the purpose of using page shelf?
The Pages shelf lets you break a view into a series of pages so you can better analyze how a specific field
affects the rest of the data in a view.
17. Difference between scattere plot vs Tree map?
Tree Maps – Display data in nested rectangles. We use dimensions to define structure of the tree maps
and measures to design the size or color of the individual rectangle. We cannot add trend lines in Tree
maps.
Scatter plot – provides an easy way to visualize relationships between numerical variables. We can add
trend lines.
18. How To remove the All options from a Tableau auto-filter?
Edit the quick filter from the pull-down arrow. Go to “Customize” and uncheck the “Show “All” Value”
checkbox.
19. How do you optimize performance of dashboard?

 Minimize the number of fields based on the analysis being performed. Use the hide all unused
fields option to remove unused columns from a data source.
 Minimize the number of records. Use extract filters to keep only the data you need.
 Optimize extracts to speed up future queries by materializing calculations, removing columns and the
use of accelerated views.
 Remove unneeded dimension from the detail shelf
 Reduce number of filters
 Use context filter
 Use Boolean calculations
20. How to use group in calculations field?

• Using groups in a calculation. You cannot reference ad-hoc groups in a calculation
• Blend data using groups created in the secondary data source: Only calculated groups can be used in
data blending if the group was created in the secondary data source.
• Use a group in another workbook. You can easily replicate a group in another workbook by copy and
pasting a calculation.
Tableau Interview Questions and Answers (Model -3)
what is Tableau?
visualize and create interactive, sharable dashboards.
What is a data Source page?
A page where you can set up your data source. The Data Source page generally consists of four main
areas: left pane, join area, preview area, and metadata area.
what is a extract is Tableau?
A saved subset of a data source that you can use to improve performance and analyze offline.
what is a format pane in Tableau?
A pane that contains formatting settings that control the entire worksheet, as well as individual fields in
the view.
What is LOD expression in Tableau?
A syntax that supports aggregation at dimensionalities other than the view level. With level of detail
expressions, you can attach one or more dimensions to any aggregate expression.
What is the difference between Quick Filter and Normal filter?
Normal Filter is used to restrict the data from database based on selected dimension or measure. But
Quick Filters are used to give a chance to user for dynamically changing data members at run time.
What is Tableau Reader?

Tableau Reader is a free viewing application that lets anyone read and interact with packaged
workbooks created by Tableau Desktop.
Can we have multiple value selection in parameter?
No
Which join i sused in data blending?
There won't be any joins as such but we will just give the column references like primary and foreign
key relation.
What are the possible reasons for slow performance in Tableau?
More Extracts, filters and depends on data sources.

What is the criteria to blend the data from multiple data sources.?
There should be a common dimension to blend the data source into single worksheet.
What is a Dimension?
Tableau treats any field containing qualitative, categorical information as a dimension. This includes
any field with text or dates values.
What is a Measure?
A measure is a field that is a dependent on value of one or more dimensions. Tableau treats any field
containing numeric (quantitative) information as a measure.
What does the extension .twbx represent in Tableau?
It is a file which represents Tableau Packaged Workbook, in which the .twb file grouped together with
the datasources.
What are the types of filters in Tableau?
Custom Filters ,Context Filters, Normal Filters.

What is marks card in Tableau?
A card to the left of the view where you can drag fields to control mark properties such as type, color,
size, shape, label, tooltip, and detail.
What are shelves in Tableau?

They are Named areas to the left and top of the view. You build views by placing fields onto the
shelves. Some shelves are available only when you select certain mark types.
What is a Tableau workbook?
It is a file with a .twb extension that contains one or more worksheets (and possibly also dashboards
and stories).
In Tableau what is a worksheet?
A sheet where you build views of your data by dragging fields onto shelves.
What is an alais in Tableau?
An alternative name that you can assign to a field or to a dimension member.

What is a context filter?
In a context filter the filter condition is applied first to the data source and then some other filters are
applied only to the resulting records.
What is Dual Axis?
You can compare multiple measures using dual axes, which are two independent axes that are layered
on top of each other.
What is a page shelf in Tableau?
The Pages shelf is used to control the display of output by choosing the sequence of display.
What are the possible reasons for slow performance in Tableau?
More Extracts, filters and depends on data sources.

What is table calculation in Tableau?
These are inbuilt calculations in tableau which we normally use to calculate Percentange chages.
What is data blending?
Data blending is used to blend data from multiple data sources on a single worksheet. The data is
joined on common dimensions.
Can we have multiple value selection in parameter?
No

What is Connect live?
It Creates a direct connect to the data source and speed up access.

What is Import all data feature in Tableau?
It Imports the entire data source into Tableau�s fast data engine as an extract and saves it in the
workbook.
What are parameters and when do you use it?
Parameters are dynamic values that can replace constant values in calculations.
What is TDE file in Tableau?
It refers to the file that contains data extracted from external sources like MS Excel, MS Access or CSV
file.
What is a story in Tableau?
A story is a sheet that contains a sequence of worksheets or dashboards that work together to convey
information.
What is a Published data source?
It contains connection information that is independent of any workbook and can be used by multiple
workbooks.
What is a Embedded data source?
It contains connection information and is associated with a workbook.

when to use Joins versus Blending in Tableau?
If data resides in a single source,we use Joins but when your data is not in one place blending is used.
How to automate reports using Tableau software?
what is Tableau Show me?
Show Me is used to apply a required view to the existing data in the worksheet. Those views can be a
pie chart, scatter plot or a line chart.
what is a Tableau data pane?

A pane on the left side of the workbook that displays the fields of the data sources to which Tableau is
connected.
What is a calculated field in Tableau?
A new field that you create by using a formula to modify the existing fields in your data source.
What is crosstab chart?
It is a text table view. Use text tables to display the numbers associated with dimension members.
How to check the meatadata of a table?
In the menu Data -> New connection drag the table to the data pane to view its meatdata.
How to create a column Alias?
In the menu Data -> New connection open the table metadata and click on the column name to create
alias.
How to get current date and time?
Use the NOW() function.

Ho wto check if a data is a of type date?
BY using the ISDATE() function.

GIve an expression to add 4 months to the date 014-03-12?
DATEADD('month', 3, #2014-03-12#) = 2004-07-15 12:00:00 AM

What does REPLACE function do in Tableau?
The REPLACE function searches a given string for a substring and replaces it with replacement string.
which function returns the number of items in a group?
The COUNT() function.

Which filter is used to get the top 10 values from a view?
TOP filter.
What is a Gannt Chart?
A Gantt chart shows the progress of the value of a task or resource over a period of time. So Gantt
chart a time dimension is an essential field.

What is Forecasting in Tableau?
Forecasting is about predicting the future value of a measure. There are many mathematical models for
forecasting. Tableau uses the model known as exponential smoothing.
What is a Trendline in tableau?
Trend lines are used to predict the continuation of certain trend of a variable. It also helps to identify
the correlation between two variables by observing the trend in both of them simultaneously.
Tableau Interview Questions & Answers (Model -4)
1) Who are the founders of Tableau?

The company was founded in Mountain View, California in January, 2003 by Chris Stolte, Christian
Chabot and Pat Hanrahan.
2) What is Tableau Software?

Tableau is business intelligence software that allows anyone to easily connect to data, then visualize and
create interactive, sharable dashboards. It’s easy enough that any Excel user can learn it, but powerful
enough to satisfy even the most complex analytical problems. Securely sharing your findings with others
only takes seconds.
3) What are the five main product offered by Tableau company?

Tableau offers five main products: Tableau Desktop, Tableau Server, Tableau Online, Tableau reader and
Tableau Public.
4) What is the current latest version of Tableau Desktop(as of Feb.2015)?

Current versions: Tableau Desktop version 9
5) What is data visualization?

Data visualization refers to the techniques used to communicate data or information by encoding it as
visual objects (e.g. points, lines or bars) contained in graphics.
6) What is Tableau Desktop?

Tableau Desktop is based on breakthrough technology from Stanford University that lets you drag &
drop to analyze data. It is great data visualization tool, you can connect to data in a few clicks, then
visualize and crate interactive dashboards with a few more.
7) What is Tableau Server?
Tableau Server is browser- and mobile-based insight anyone can use. Publish dashboards with Tableau
Desktop and share them throughout your organization. It’s easy to set up and even easier to run.

8) What is Tableau Public?

Tableau Public is a free service that lets anyone publish interactive data to the web. Once on the web,
anyone can interact with the data, download it, or create their own visualizations of it. No programming
skills are required. Be sure to look at the gallery to see some of the things people have been doing with
it.
9) Why Tableau?
Whether your data is in an on-premise database, a database, a data warehouse, a cloud application or
an Excel file, you can analyze it with Tableau. You can create views of your data and share it with
colleagues, customers, and partners. You can use Tableau to blend it with other data. And you can keep
your data up to date automatically.
10) How does Tableau perform with huge datasets?

Tableau Performance is based on Data source performance. If data source takes more time to execute a
query then Tableau must wait up to that time
11) What are the differences between Tableau Software GoodData and Traditional BI (Business
Objects, etc.)?
At high level there are four major differences.How to view sql which is generated by Tab
12) What are Dimensions and Facts?

Dimensions is nothing but the descriptive text columns and facts are nothing but measures (numerical
values) dimension ex: Product Name, City. Facts:Sales, profit
13) What is the difference between heat map and tree map?
A heat map is a great way to compare categories using color and size. In this, you can compare two
different measures. Tree map is a very powerful visualization, particularly for illustrating hierarchical
(tree – structured) data and part – to – whole relationships.
14) How to view sql which is generated by Tableau Desktop?

The Tableau Desktop Log files are located in C:\Users\\My Documents\My Tableau Repository. If you
have a live connection to the data source, check the log.txt and tabprotosrv.txt files. If you are using an
extract, check the tdeserver.txt file. The tabprotosrv.txt file often shows detailed information about
queries.
15) How will you publish and schedule workbook in tableau server?
First create a schedule for particular time and then create extract for the data source and publish the
workbook for the server. Before you publish, there is a option called Scheduling and Authentication,

click on that and select the schedule from the drop down which is created and publish. Also publish data
source and assign the schedule. This schedule will automatically run for the assigned time and the
workbook is refreshed.
16) How Does Tableau Work?

While Tableau lets you analyze databases and spreadsheets like never before, you don’t need to know
anything about databases to use Tableau. In fact, Tableau is designed to allow business people with no
technical training to analyze their data efficiently.Tableau is based on three simple concepts:
Connect: Connect Tableau to any database that you want to analyze.
Note that Tableau does not import the data. Instead it queries to the database directly.
Analyze: Analyzing data means viewing it, filtering it, sorting it, performing calculations on it,
reorganizing it, summarizing it, and so on.Using Tableau you can do all of these things by simply
arranging fields of your data source on a Tableau worksheet. When you drop a field on a worksheet,
Tableau queries the data using standard drivers and query languages (like SQL and MDX) and presents a
visual analysis of the data.
Share: You can share results with others either by sharing workbooks with other Tableau users, by
pasting results into applications such as Microsoft Office, printing to PDF or by using Tableau Server to
publish or embed your views across your organization.
What are the difference between tableau 7.0 and 8.0 versions?
New visualizations are introduced like tree map bubble chart and box and whisker plot
We can copy worksheet directly from one workbook to another Workbook
Introduced R script
17) What are the features of Tableau 8.3?

With Kerboros support, Tableau 8.3 advances enterprise-grade data analysis with these enhancements:
Provides seamless, single sign-on experience from Tableau client to back-end data sources
Protects sensitive data with delegated access and viewer credential management
Connects to live data sources through stable, automated back-end authentication
Leverages existing IT investments in enterprise-grade authentication and data security
Supports smart card authentication
18) How do I automate reports using Tableau software?

Speed
How fast can you get up and running with the system, answer questions, design and share dashboards
and then change them? This is Where systems like Tableau and GoodData are far better than old –
school business intelligence like Business Objects or Cognos. Traditional systems took months or years to
intelligence like Business Objects or Cognos. Traditional systems took months or years to implement,
with costs running to millions. Tableau has a free trail that installs in minutes and GoodData is cloud –
based, so they are faster to implement by orders of magnitude. They are also faster to results:
traditional BI requires IT and developers to make any changes to reports, so business users are struck in

a queue waiting to get anything done. Tableau and GoodData provide more of a self – service
experience.
Analysis layer
This is where Tableau excels. It has a powerful and flexible drag & drop visualization engine based on
some technology from Stanford. Traditional BI typically provide some canned reports but changing them
requires significant time and money.
Data layer
This is where the three options are most different:
GoodData requires you to move your data to its cloud. Traditional BI typically requires you to move your
data to its data warehouse system. Tableau connects to a variety of existing data source and also
provides a fast in – memory data engine, essentially a local database. Since most enterprises have their
data stored all over the place, this provides the most choice and lets companies use the investment
they’ve already made.
Enterprise readiness.
19) What is a parameter in Tableau ? And how it works.?

Parameters are dynamic values that can replace constant values in calculations and can serve as filters
What are Filters? How many types of filters are there in Tableau?
Filter is nothing but it is restricted to unnecessary, it is showing exact data. Basically filters are 3 types.
Quick filter
Context filter
Datasource filter
20) What is the difference between context filter to other filters?

Whenever we crate context filter >> Tableau will create a temporary table for this particular filter set
and other filters will be apply on context filter data like cascade parameters… suppose we have crated
context filter on countries >> we have chosen country as USA and India >> Tableau will create a
temporary table for this two countries data and if you have any other filers >>other will be apply on this
two countries data if we don’t have any context filter >> each and individual record will check for all
filters
21) What is disadvantage of context filters?

When you set a dimension to context, Tableau crates a temporary table that will require a reload each
time the view is initiated. For Excel, Access and text data sources, the temporary table created is in an
Access table format. For SQL Server, My SQL and Oracle data sources, you must have permission to
create a temporary table on your server. For multidimensional data source, or cubes, temporary tables
are not crated, and context filters only defined which filters are independent and dependent.
What is the Difference between quick filter and Normal filter in tableau?
Quick filter is used to view the filtering options and can be used to select the option. Normal filer is

something you can limit the options from the list or use some conditions to limit the data by field or
value.
22) What is benefit of Tableau extract file over the live connection?
Extract can be used anywhere without any connection and you can build your own visualizations
without connecting to Database.
23) How to combine two excel files with same fields but different data (different years)?
I have 5 different excel files (2007.xls, 2008.xls..2011.xls) with same fields (film name, genre, budge,
rating, profitability) but with data from different year (2007 to 2011). Can someone tell me how can I
combine the film name, genre and profitability so that I can see the visualization of 2007 to 2011 in a
single chart.
24) What is the Max no of tables we can join in Tableau?

We can join max 32 table, it’s not possible to combine more than 32 tables.
25) How does the integration of Tableau with R works?

R is a popular open-source environment for statistical analysis. Tableau Desktop can now connect to R
through calculated fields and take advantage of R functions, libraries, and packages and even saved
models. These calculations dynamically invoke the R engine and pass values to R via the Rserve package,
and are returned back to Tableau.
Tableau Server can also be configured to connect to an instance of Rserve through the tabadmin utility,
allowing anyone to view a dashboard containing R functionality.
Combining R with Tableau gives you the ability to bring deep statistical analysis into a drag-and-drop
visual analytics environment.
26) What is Page shelf?

Page shelf is power full part of tableau That you can use to control the display of output as well as
printed results of output.
27) How can we combine database and flat file data in tableau desktop?
Connect data two times, one for database tables and one for flat file. The Data->Edit Relationships
Give a join condition on common column from db tables to flat file
28) How to add custom Color to Tableau?

Create Custom Color code in “Preferences.tps”
Navigation ::: Documents » My Table Repository »Preferences.tps
Add custom color code Note: In tableau 9.0 version we have color picker option.
29) How to design a view to show region wise profit and sales.I did not want line and bar chat should
be used for profit and sales?

Generate the Map using cities –>then Drag the Profit and sales to the Details–>Add the state as Quick
filter
30) How to create cascading filters without context filter ?
I have filterl and filter2..Based on filterl I need to filter2 data
Ex: Filterl as Country and Filter 2: States
I have chosen country as USA and filter2 should display only USA states
Choose options of Filter2 states :
select option of “Only relevant values “
31) What is dual axis?

To display two measure in one graph
32) What is blended axis?

Multiple Measures are shown in single axis and also all the marks shown in single pane
Drag a dimension in a column
Drag the first measure in column
Drag 2nd measure in existing axis
Http://onlinehelp.tableau.com/current/pro/online/mac/en-
Us/multiplemeasures_blendedaxes.html
33) What is Data Visualization?
A much advanced, direct, precise and ordered way of viewing large volumes of data is called data
visualization. It is the visual representation of data in the form of graphs and charts, especially when you
can’t define it textually. You can show trends, patters and correlations through various data visualization
software and tools; Tableau is one such data visualization software used by businesses and corporates.
34) What are the differences between Tableau desktop and Tableau Server?
While Tableau desktop performs data visualization and workbook creation, Tableau server is used to
distribute these interactive workbooks and/or reports to the right audience. Users can edit and update
the workbooks and dashboards online or Server but cannot create new ones. However, there are limited
editing options when compared to desktop.
Tableau Public is again a free tool consisting of Desktop and Server components accessible to anyone.
35) Define parameters in Tableau and their working.

Tableau parameters are dynamic variables/values that replace the constant values in data calculations
and filters. For instance, you can create a calculated field value returning true when the score is greater
than 80, and otherwise false. Using parameters, one can replace the constant value of 80 and control it
dynamically in the formula.
36) Differentiate between parameters and filters in Tableau.
The difference actually lies in the application. Parameters allow users to insert their values, which can be
integers, float, date, string that can be used in calculations. However, filters receive only values users

choose to ‘filter by’ the list, which cannot be used to perform calculations.
Users can dynamically change measures and dimensions in parameter but filters do not approve of this
feature.
37) What are fact table and Dimension table in Tableau?

—>Facts are the numeric metrics or measurable quantities of the data, which can be analyzed by
dimension table. Facts are stores in Fact table that contain foreign keys referring uniquely to the
associated dimension tables. The fact table supports data storage at atomic level and thus, allows more
number of records to be inserted at one time. For instance, a Sales Fact table can have product key,
—>Dimensions are the descriptive attribute values for multiple dimensions of each attribute, defining
multiple characteristics. A dimension table ,having reference of a product key form the fact table, can
consist of product name, product type, size, color, description, etc.
38) What are Quick Filters in Tableau?
Global quick filters are a way to filter each worksheet on a dashboard until each of them contains a
dimension. They are very useful for worksheets using the same data source, which sometimes proves to
a disadvantage and generate slow results. Thus, parameters are more useful.
39) State limitations of parameters in Tableau.

Parameters facilitate only four ways to represent data on a dashboard (which are seven in quick filters).
Further, parameters do not allow multiple selections in a filter.
40) What is aggregation and disaggregation of data in Tableau?

Aggregation and disaggregation in Tableau are the ways to develop a scatterplot to compare and
measure data values. As the name suggests, aggregation is the calculated form of a set of values that
return a single numeric value. For instance, a measure with values 1,3,5,7 returns 1. You can also set a
default aggregation for any measure, which is not user-defined. Tableau supports various default
aggregations for a measure like Sum, average, Median, Count and others.
Disaggregating data refers to viewing each data source row, while analyzing data both independently
and dependently.
41) What is Data Blending?

Unlike Data Joining, Data Blending in tableau allows combining of data from different sources and
platforms. For instance, you can blend data present in an Excel file with that of an Oracle DB to create a
new dataset.
42) What is Content Filter?

The concept of context filter in Tableau makes the process of filtering smooth and straightforward. It
establishes a filtering hierarchy where all other filters present refer to the context filter for their
subsequent operations. The other filters now process data that has been passed through the context
filter.

Creating one or more context filters improves performance as users do not have to create extra filters
on large data source, reducing the query-execution time.
You can create by dragging a filed into ‘Filters’ tab and then, Right-Click that field and select ‘’Add to
Context”.
43) What are the limitations of context filters?
Tableau takes time to place a filter in context. When a filter is set as context one, the software creates a
temporary table for that particular context filter. This table will reload each time and consists of all
values that are not filtered by either Context or Custom SQL filter.
44) Name the file extensions in Tableau.

There are a number of file types and extensions in Tableau:
• Tableau Workbook (.twb)
• Tableau Packaged Workbook (.twbx)
• Tableau Datasource (.tds)
• Tableau Packaged Datasource (.tdsx)
• Tableau Data extract (.tde)
• Tableau Bookmark (.tdm)
• Tableau Map Source (.tms)
• Tableau Preferences (.tps)
45) Explain the difference between .twb and .twbx

.twb is the most common file extension used in Tableau, which presents an XML format file and
comprises all the information present in each dashboard and sheet like what fields are used in the
views, styles and formatting applied to a sheet and dashboard.
But this workbook does not contain any data. The Packaged workbook merges the information in a
Tableau workbook with the local data available (which is not on server). .twbx serves as a zip file, which
will include custom images if any. Packaged Workbook allows users to share their workbook information
with other Tableau Desktop users and let them open it in Tableau Reader.
46) What are Extracts and Schedules in Tableau server?

data is imported in Tableau Engine.
After this extraction of data, users can publish the workbook, which also publishes the extracts in
Tableau Server. However, the workbook and extracts won’t refresh unless users apply a scheduled
refresh on the extract. Scheduled Refreshes are the scheduling tasks set for data extract refresh so that
they get refreshed automatically while publishing a workbook with data extract. This also removes the
burden of republishing the workbook every time the concerned data gets updated.
47) Name the components of a Dashboard

• Horizontal- Horizontal layout containers allow the designer to group worksheets and dashboard

components left to right across your page and edit the height of all elements at once.
• Vertical- Vertical containers allow the user to group worksheets and dashboard components top to
bottom down your page and edit the width of all elements at once.
• Text
• Image Extract: – A Tableau workbook is in XML format. In order to extracts images, Tableau applies
some codes to extract an image which can be stored in XML.
• Web [URL ACTION]:- A URL action is a hyperlink that points to a Web page, file, or other web-based
resource outside of Tableau. You can use URL actions to link to more information about your data that
may be hosted outside of your data source. To make the link relevant to your data, you can substitute
field values of a selection into the URL as parameters.
48) How to view underlying SQL Queries in Tableau?

• Create a Performance Recording to record performance information about the main events you
interact with workbook. Users can view the performance metrics in a workbook created by Tableau.
Help> Settings and Performance> Start Performance Recording
Help> Setting and Performance > Stop Performance Recording
• Reviewing the Tableau Desktop Logs located at C:\Users\\My Documents\My Tableau Repository. For
live connection to data source, you can check log.txt and tabprotosrv.txt files. For an extract, check
tdeserver.txt file.

Tableau provides a distinct and powerful tool to control the output display known as Page shelf. As the
name suggests, the page shelf fragments the view into a series of pages, presenting a different view on
each page, making it more user-friendly and minimizing scrolling to analyze and view data and
information. You can flip through the pages using the specified controls and compare them at a
common axle.
50) How to do Performance Testing in Tableau?

products.
51) Explain the concept of Dual Axis.

compare multiple measures at once, having two independent axes layered on top of one another.

52) How many maximum tables can you join in Tableau?

The maximum number of 32 tables can be joined in Tableau. A table size must also be limited to 255
columns (fields).
53) How to remove ‘All’ options from a Tableau auto-filter?

54) What different products Tableau provide?
• Tableau Server : on-premise or cloud-hosted software to access the workbooks built
• Tableau desktop: desktop environment to create and publish standard and packaged workbooks.
• Tableau Public: workbooks available publicly online for users to download and access the included
data.
• Tableau Reader: get a local access to open Tableau Packaged workbook
55) How can you display top five and last five sales in the same view?
Create two sets, one for top 5 another for bottom 5 and the join these two sets displaying a unique set
of total 10 rows.
56) What is TDE file?

• Firstly, TDE is a columnar store
• The second is how they are structured which impacts how they are loaded into memory and used by
Tableau. This is an important aspect of how TDEs are “architecture aware”. Architecture-awareness
means that TDEs use all parts of your computer memory, from RAM to hard disk, and put each part to
work what best fits its characteristics.
57) How to use group in calculated field?


58) Can parameters have dropdown list?

Tableau Interview Questions & Answers(Model-5)
Q1. What do you mean by Tableau?

Ans. It connects easily to any data source, be it web-based data, warehouse or Microsoft excel.
Moreover, this can help create appealing graphs, reports and dashboards using data.
Q2. What is LOD expression?

Ans. This offers a way to easily commute aggressions that are not at the level of detail of the
visualization. With this, you can attach more than one dimension to any aggregate expression.
Q3. What are some of the new features introduced in Tableau 9.1?
Ans.
 Visual analytics
 Mobile
 Data
 Enterprise
Q4. Can you create relational joins in Tableau without creating a new table?
Ans. Yes, you can create relational joins without creating a new table.
Q5. What are the traits to differentiate data source?

Ans. The data source are differentiated on the basis of following characteristics:
 Connection type
 Icon/Name
 Connects to
 Live or Last extract
Q6. What is the difference between Tableau and Qlikview?

Ans. Tableau:
 More connectors
 No coding is required

 Interface is simple
 Doesn’t render feature to search content across all your data
Qlikview:
 Easy to use and discover the hidden trends

 Offers a personal edition free
 Data is stored in qvd form from the performance point of view
 Great capabilities for processing data
Q7. What are parameters?
Ans. They are dynamic values that can change constant values in calculations, reference lines and filters.
Q8. Mention whether you can have multiple value selection in parameter?
Ans. No
Q9. Name some of the types of filters?

Ans. Normal filters, context filters and custom filters
Q10. What are the four different kinds of shelves in Tableau?

Ans. Rows, columns, filters and pages
Q11. Name some file extensions in Tableau?

Ans.
 Tableau Datasource (.tds)

 Tabelau Bookmark (.tdm)
 Tableau preferences (.tps)
 Tableau workbook (.twb)
 Tableau Mapsource (.tms)
Q12. Explain Data blending?

Ans. A method for coalescing data that supplements a table of data from one data source from columns
of table from another data source.
Q13. How many tables can you join?

Ans. One can join a maximum of 32 tables in Tableau
Q14. What are the different components of a dashboard?

Ans.
 Web
 Text

 Image Extract
 Horizontal
 Vertical
Q15. What is TDE file?

Ans. TDE stands for Tableau data extract. It refers to the file that comprises data extracted from external
sources like CSV file, MS Excel or MS Access.
Q16. Name different products tableau offer?

Ans.
 Tableau Desktop
 Tableau Reader
 Tableau Public
 Tableau Server
Q17. How to get current date and time?

Ans. By using the NOW () function.
Q18. Explain Gantt chart?

Ans. Used for visualizing project schedules, Gantt Chart demonstrates the progress of the value of a task
over a period of time.
Q19. What are the reasons for slow performance in Tableau?

Ans. One of the reasons is data source is slow in itself. This might be because your CSV file is too big or
your database is old for reading. You can improve Tableau experience by following ways:
 Use extracts
 Limit the amount of data you bring in – both rows and columns
 Switch data source using the “extract function”
 Pre-aggregate your data before brining into Tableau.
Q20. Which function is used to return the number of items in a group?

Ans. The count () function
Q21. Define sets?

Ans. Sets are used to define a subset of data depending on certain conditions. There are oodles of ways
to create a set based on a filter that you have already mentioned.
Q22. How many types of Joins you can use in Tableau?

Ans. There are four types of Joins in Tableau:
 Left
 Right

 Inner
 Full outer
Q23. What do you mean by groups?

Ans. A mixture of dimension member that makes higher level categories, Groups are used on calculated
fields.
Q24. Explain Tableau Data Server?

Ans. Enabling you to share and upload data extracts and reuse calculations and field metadata, Tableau
Data Server makes connecting to data transparent and much easier.
Q25. Difference between Measures and Dimensions?

Ans. Measures
 Generates axes when added to rows and columns shelves

 Dependent variable
 Stored in a table that has foreign keys
 Numeric units analyzed by dimension table
Dimensions
 Descriptive attributes values defining multiple characteristics

 Generate headers when added to rows and columns shelves
 Independent variable
 Considered more complex
Q. What is the difference between context filter to other filters?

Whenever we crate context filter
>> Tableau will create a temporary table for this particular filter set and other filters will be apply on
context filter data like cascade parameters… suppose we have crated context filter on countries >> we
have chosen country as USA and India

>> Tableau will create a temporary table for this two countries data and if you have any other filers
>>other will be apply on this two countries data if we don’t have any context filter
>> Each and individual record will check for all filters
Q. What is disadvantage of context filters?

*The context filter is not frequently changed by the user – if the filter is changed the database must
*When you set a dimension to context, Tableau crates a temporary table that will require a reload each
Q. What are the five main product offered by Tableau company?

Tableau Public.
Q. What is the current latest version of Tableau Desktop(as of Sep, 25th 2017)?
Current versions: Tableau Desktop version 10.4
Q. What is data visualization?

Q. Why tableau?
Q. What are Filters? How many types of filters are there in Tableau?
1. Quick filter
2. Context filter
3. Datasource filter
Q. What is disaggregation and aggregation of data?

Suppose I have data like
Eid Ename Salary Dept
1.abc 2000 java
2.bbc 3000 .net
3.Krishna 2500 java
Madhu 300
5.Vamshi 3000 mainframes

1.abc 1000 testing

2.bbc 3000 tableau
3.krishna 5000.net
4.Madhu 7000 testing
vanshi 9000 tableau
1 abc 11000 Mainframes
2 bbc 13000testing
3 krishna 15000 java
4 Madhu 17000 .nte
5 vamshi 19000.net
Aggregation: to display aggregate data
Sum/avg salary by each individual employee
drag ename on columna and salary on rows we will get sum (salary) of each and individual employee
now change measure type as Avg
Choose salary option – choose measure types as “Avg”
Disaggregation: To display each and every transaction
When you look at the aggregated data in the views above, each bar represents all transactions for a
specific employee summed up or averaged into a single value. Now say that you want to see the
individual salary transactions for each employee. You can create a view like that by selecting
Analysis>Aggregate Measures.
Q. How to remove the All options from a Tableau auto – filter?

Right click filter>>customize>>uncheck show all option
Q. Can we use non – used columns (Columns which are not used in reports but data source has
columns) in Tableau Filters?
Yes!
Ex. In data source I have column like
empID, EmpName, EmpDept,EmpDsignation, EmpSalary
In reports I am using empname on columns and empsalry on rows.
I can use empDesignation on Filters
Q. What is benefit of Tableau extract file over the live connection?

Q. How to combine two excel files with same fields but different data (different years)?
single chart?
Q. Max no of tables we can join in Tableau?


Q. What is the difference between joining and blending in Tableau?

Joins in Tableau:
For EX: your client is in Healthcare domain and using SQL Server as their database. In SQL server there
may be many Tableau like Claims Tables, Rejected Claims Table, Customer Table. Now, client wants to
know customer wise claims and customer wise rejected claims table using the joins. Join is a query that
combines the data form 2 or more tables by making use of Join condition.
We can join max 32 table, it’s not possible to combine more then 32 tables.
In Tableau the joins can perform in 2 ways.
1. By making use of common columns.
2. By making use of common data types.
If we create joins on the fields in Tableau all the table names are suffixing with $. While performing the
joins on multiple tables, always go with the les amount of data tables, so that we can improve the
performance.
In Tableau the joins are divided into 2 types.
1.Equi Join,
2.Non Equi Join
1. Equi Join: in the join condition if we are using Equality”=”operator then such a kind of join called as
Equi join.
2. Non Equi Join: in the join condition apart from the Equality”=”if we use any other operator like
<,>,<=,>= and=! Then such a kind of joins are called as Non Equi Join
Equi Join is divided into 3 types
1. Inner Join,
2. Outer Join,
3. Self – Join.
1.Inner Join: Inner join will loads the only matching records from the both tables. Inner join condition:
Tableaa.id = Tableb.id
2.Outer Join:
Again the outer join divided into 3 types.
a)Left Outer Join,
b)Right Outer Join,
c)Full Outer Join.
Left outer join: displays the complete data from the left + matching records from the left.
Condition: tablea.id(+).
Right Outer Join: displays the complete data from the right + matching records from the left.
Condition: tablea.id(+)=tableb.id
Full outer join: full outer join load the complete data from the left table and right table. Condition: Table
A full outer join Table B ON tablea.id= tableb.id
3.Self-Join: if we are performing join to the same table itself such a kind of join called as self-join
Non Equi Join:
In the join condition if we are using the operators apart from the equality “=” then such a kind of joins

are called as Non Equi join.

Data Blending in Tableau:
For ex: your client is same Healthcare Client. They are operating their services in Asia, Europe, NA and so
on & the are maintaining Asia data in SQL, Europe Data in SQL Server and NA data in MY SQL.
Now, your client wants to analyze their business across the world in a single worksheet. So you can’t
perform join here.
Now you have make use of Data Blending Concept.
Normally in the Tableau we can perform the analysis on the single data server. If we want to perform
the analysis from the multiple data sources in a single sheet then we have to make use of a new concept
called as data blending.
Data blending mix the data from the different data sources and allow the users to perform th analysis in
a single sheet. Blending means mixing. If we are mixing the data sources then it is called as data
blending.
Rules to perform the data blending

In order to perform data blending there are few rules.
1. If we are performing the data blending on 2 data source these 2 data sources should have at least 1
common dimension.
2. In that common dimension at least 1 value should match.
In Tableau we can perform the data blending in 2 ways.
1. Automatic way
2. Custom way
1. Automatic way: In the automatic way Tableau automatically defines the relationship between the 2
data sources based on the common dimensions and based on the matching values and the relationship
is indicated with Orange color.
2. Custom or Manual way: In the manual or custom way the user need to define the relationship
manually.
Data blending fuctionality

1. All the primary data sources and the secondary data sources are linked by specific relationship
2. while performing the data blending each work sheet has a primary connection and optionally it might
contains several secondary connections.
3. All the primary connections are indicated in the Blue in the work sheet and all the secondary data
sources indicated with the Orange color tick mark.
4. In the data blending 1 sheet contains 1 primary data source and 1 sheet can contain end number of
Q. What are Dimensions and Facts?

values) dimention ex:productname city..facts:sales, profit
Q. Can we place an excel file in a shared location and use it to develop a report and refresh it in
regular intervals?
Yes you can do it… but for the better performance use extract

Q. What is the difference between heat map and tree map?

different measures. Tree map is a very powerful visualization, particularly for
illustrating hierarchical (tree – structured) data and part – to – whole relationships.
Q. What is the different between twb and twbx file extensions. Please explain.
Twb is a live connection, it points to the data source; the user receiving twb needs permission to said
data source and no data is included. .twbx takes data offline, stroes the data as a package or zip like file,
thereby eradicating the need for permissions from end user, it’s now a snapshot in time of the data as of
the time it was Saved as . twbx
Q. What is dual axis?

Q. What is blended axis?

Q. What makes Tableau software stand out?

In my view, Tableau stands out for several reasons:
First, most of the BI tools out there are pricey. However, Tableau has a free offering (Tableau Public) as
well as a very popular (also free) academic distribution. Tableau is well recognized by firms like Forrester
research to be one of the most easy to use, and agile products currently available. see here: Tableau
Ranks #1 in The Forrester Wave: Advanced Data Visualization (ADV) Platforms That makes it easy to pick
up and try new things with, which data visualization people love about it.
On the other hand, unlike some of the other BI tools, Tableau is not a complete technology stack, it is
most useful for visualization and analytics. – you will need other products in addition to tableau for
heavier enterprise data ETL, maintenance, and storage, etc.
Q. How do we do testing in Tableau?

You can’t test in Tableau as far as I know. It is a data visualization software.
Q. Can you get values from two different sources as a single input into parameter?
No you cannot. Each data source corresponds to a Tableau workbook. If you include both data variables
in the same data source you can input them in the same workbook.
Q. How many ways we use parameters in Tableau?

We can use parameters with filters, calculated fields ,actions, measure-swap, changing views and auto
updates

Q. What is the use of new Custom SQL Query in tableau?

Custom SQL Query written after connecting to data for pulling the data in a structured view, One simple
example is you have 50 columns in a table, but we need just 10 columns only. So instead of taking 50
columns you can write a sql query. Performance will increase.
Q. What are the differences between Tableau Software and Traditional BI tools?
Tableau provides easy to use, best in class, Visual Analytic capabilities, but it does not help with the
plumbing (data foundation). You could, for example, marry SQL Server with Tableau to get the complete
package. Tableau licenses are relatively expensive if you are looking to scale.
Traditional BI can handle it all but with significant upfront costs. Higher consulting, hardware and
software costs. Among the mega-vendors, only Microsoft can provide a reasonable value proposition.
Open source vendors like Pentaho and JasperSoft do not have an abundant enough talent pool, yet.
Q. What are the similarities and differences between Tableau software and Palantir?
Palantir and Tableau are very different. Palantir has its roots in large data computer science problems
involving security, payments, fraud detection and the likes. Customers/Investors include Paypal, CIA and
others.
Tableau is a visualization player – with roots in Stanford U research. It’s Visual Query Language (VizQL)
allows users to build visualizations on top of standard data warehouses or spreadsheets.
Q. How to create cascading filters without context filter ?

I have chosen country as INDIA and filter2 should display only INDIA states
Q. Is Tableau Software good for a strategic acquisition?

Yes for sure! It gives you data insight to the extend that others don’t.
Helps u plan and point the anomalies and improvise your process for betterment.
Q. How to display top 5 and last 5 sales in same view?

Using filters or calculated fields we can able to display the top 5 and last 5 sales in same view?
Q. Design a view to show region wise profit and sales.I did not want line and bar chat should be used
for profit and sales. How you will design and please explain?
filter
Q. Design a view in a map such that if user selects any state the cities under that state has to show
profit and sales.
If you want to show the Sales and profit in each and every city under the states in the same work sheet.
According to your question you should have State, City, Sales and Profit filed in your dataset.
1. Double click on the State filed.
2. Drag the City and drop into Marks card (under the State fied)

3. Drag the sales and drop into size.

4. Drag the profit and drop into color.
5. Click on Size legend and increase the size.(75%)
6. Right click on the State field and select show quick filter.
7. Select any state and check whether you got the required view or not. In this view size indicates the
amount of sales and color indicates the Profit values.
Q. How to add custom Color to Tableau?

Add custom color code
Note: In tableau 9.0 version we have color picker option..
Q. How can we combine database and flat file data in tableau desktop?

1.abc 2000 java
2.bbc 3000 .net
3.Krishna 2500 java
Madhu 300
1.abc 1000 testing
2.bbc 3000 tableau
3.krishna 5000.net
vamshi 9000 tableau
2 bbc 13000testing
4 Madhu 17000 .nte
5 vamshi 19000.net


Q. What Does TABLEAU do?

Our goal is to help people see and understand data. Our software products put the power of data into
the hands of everyday people, allowing a broad population of business users to engage with their data,
ask questions, solve problems and create value.
Q. What is Tableau Public?

it.
Q. What is data modelling?

Data modelling is the analysis of data objects that are used in a business or other context and the
identification of the relationships among these data objects. Data modelling is a first step in doing
object-oriented programming
Q. What is your daily work process in tableau?

I think we all work on different projects using Tableau, so the work begins from understanding the
requirement getting the required data, story boarding then creating visualizations in tableau and then
presenting it to the client for review.
Q. What is parameter in Tableau ? And how it works.?

Q. How does Tableau perform with huge datasets?

query then Tableau must wait up to that time.
Q. How will you publish and schedule workbook in tableau server?

Q. Define the names for parameters ,filters etc…
Parameters are dynamic values that can replace constant values in calculations and can serve as
filters.Filters are used to restrict the data based on the condition u have mentioned in the filters shelf.
Q. How to view sql which is generated by Tableau Desktop?

The Tableau Desktop Log files are located in C:UsersMy DocumentsMy Tableau Repository. If you have a

live connection to the data source, check the log.txt and tabprotosrv.txt files. If you are using an extract,
check the tdeserver.txt file. The tabprotosrv.txt file often shows detailed information about queries.
Q. What is page shelf?

page shelf is power full part of tableau That you can use to controle the display of output as well as
Q. What are the major differences between tableau version 7.0 and tableau version 8.0?
1. New visualizations are introduced like treemap, bubble chart and box and whisker plot
2. We can copy worksheet directly from one workbook to another workbook
3. Introduced R script
Q. How to create filled maps?

Step 1: Build a Map View Double-click a geographic fields such as State, Area Code, Zip Code, etc.
Step 2: Select the Fille Map Mark Type The Automatic mark type will show this type of view as circles
over a map. On the Marks card, select Filled Map to color the geographic areas.
Step 3: Drag a Field to the Color Shelf Define how the locations are colored by dragging another field to
the Color shelf.
Q. Is Parameter have it’s dropdown list?

Yes it may have its own drop down list, the entries which you make in the Parameter while creating it
can be viewed as Dropdown list.
Tableau Dashboard Interview Questions:
Q. How to rectify SQL Performance for developed Dashboards

After creation of Dashboards if we get problem from sql side that means Custom Sql ….How to Rectify
the sql performance from custom sql.
Q. Suppose my license expires today, can users able to view the dashboards or workbook which i
published in server earlier.
If your server license expires today, your user name on the server will have the role ‘unlicensed’ which
means you cannot access, but others can. The Site Admin can ‘Change Ownership’ to another person, so
extracts if enabled do not fail.
Q. Think that I am using Tableau desktop and have a live connection to Cloud era hadoop data. I need
to press F5 to refresh the visualization. Is there anyway to automatically refresh the visualization
every x minutes instead of pressing F5 every-time?
Here is the example of refreshing dashboard in every 3 seconds, Replace api src and server url with
yours. The interval below is for 3 seconds.
Tableau JavaScript API
Q. What Tableau Desktop is?

Q. What are the differences between Tableau Software, GoodData and Traditional BI (Business
Objects, etc.)?
You could talk feature – functionality for days, but at a high level there are four major differences.
1. Speed: How fast can you get up and running with the system, answer questions, design and share
dashboards and then change them? This is Where systems like Tableau and GoodData are far better
than old – school business intelligence like Business Objects or Cognos. Traditional systems took months
or years to intelligence like Business Objects or Cognos. Traditional systems took months or years to
implement, with costs running to millions. Tableau has a free trail that installs in minutes and GoodData
is cloud – based, so they are faster to implement by orders of magnitude. They are also faster to results:
experience.
2. Analysis layer: This is where Tableau excels. It has a powerful and flexible drag & drop visualization
engine based on some technology from Stanford. GoodData and traditional BI typically provide some
canned reports but changing them requires significant time and money.
3. Data layer: This is where the three options are most different:
4. Enterprise readiness: Traditional BI and Tableau do well here, with enterprise – level security and
high scalability.
Q. What is Tableau Software?

only takes seconds.
Q. What is Tableau Server?

Q. Explain the integration of Tableau with R?

1. Tableau Server can also be configured to connect to an instance of Rserve through the tabadmin

utility, allowing anyone to view a dashboard containing R functionality.

2. Combining R with Tableau gives you the ability to bring deep statistical analysis into a drag-and-
drop visual analytics environment.
Q. What is the Difference between quick filter and Normal filter in tableau?
value.
Q. How do I automate reports using Tableau software?

Q. How is Tableau so fast when working with databases?

Tableau compiles the elements of your visual canvas into a SQL or MDX query for the remote database
to process. Since a database typically runs on more powerful hardware than the laptops / workstations
used by analysts, you should generally expect the database to handle queries much faster than most in
memory BI applications limited by enduser hardware. Tableau’s ability to push computation (queries)
close to the data is increasingly important for large data sets, which may reside on a fast cluster and may
be too large to bring in memory.Another factor in performance relates to data transfer, or in Tableau’s
case resultset transfer. Since Tableau visualizations are designed for human consumption, they are
tailored to the capabilities and limits of the human perception system. This generally means that the
amount of data in a query result set is small relative to the size of the underlying data, and visualizations
focus on aggregation and filtering to identify trends and outliers. The small result sets require little
network bandwidth, so Tableau is able to fetch and render the result set very quickly. And, as Ross
mentioned, Tableau will cache query results for fast reuse.The last factor involves Tableau’s ability to
use in memory acceleration as needed (for example, when working with very slow databases, text files,
etc.). Tableau’s Data Engine uses memory mapped I/O, so while it takes advantage of in memory
acceleration it can easily work with large data sets which cannot fit in memory. The Data Engine will
work only with the subsets of data on disk which are needed for a given query, and the data subsets are
mapped into memory as needed.
Q. What is Tableau Desktop?

Tableau Desktop is a data visualization application that lets you analyze virtually any type of structured
data and produce highly interactive, beautiful graphs, dashboards, and reports in just minutes. After a
quick installation, you can connect to virtually any data source from spreadsheets to data warehouses
and display information in multiple graphic perspectives. Designed to be easy to use, you’ll be working
faster than ever before.
Q. How Does Tableau Work?

– Connect: Connect Tableau to any database that you want to analyze.

– Analyze: Analyzing data means viewing it, filtering it, sorting it, performing calculations on it,
– Share: You can share results with others either by sharing workbooks with other Tableau users,
by pasting results into applications such as Microsoft Office, printing to PDF or by using Tableau Server
to publish or embed your views across your organization.
Q. What is the difference between tableau 7.0 and 8.0 versions.

1. New visualizations are introduced like tree map bubble chart and box and whisker plot
2. We can copy worksheet directly from one workbook to another Workbook
Q. Explain the features of Tableau 8.3?

– With Kerboros support, Tableau 8.3 advances enterprise-grade data analysis with these
enhancements:
1. Provides seamless, single sign-on experience from Tableau client to back-end data sources
2. Protects sensitive data with delegated access and viewer credential management
3. Connects to live data sources through stable, automated back-end authentication
4. Leverages existing IT investments in enterprise-grade authentication and data security
5. Supports smart card authentication
Who are the founders of Tableau?
What is Tableau Software?
create interactive, shareable dashboards. It’s easy enough that any Excel user can learn it, but powerful
only takes seconds.
What are the five main product offered by Tableau company?
Tableau Public.
What is the current latest version of Tableau Desktop(as of Feb.2015)?

What is data visualization?
What is Tableau Desktop?
What is Tableau Server?
What is Tableau Public?
it.
Our course design of tutorials is practical and informative. At TekSlate, we offer resources to help you
learn various IT courses. We avail both written material and demo video tutorials. For in-depth
knowledge and practical experience explore Tableau Desktop.
Why Tableau?
How does Tableau perform with huge datasets?
What are the differences between Tableau Software Good-data and Traditional BI (Business Objects,
etc.)?
At high level there are four major differences.

What are Dimensions and Facts?
How to use group in calculated field?
Using groups in a calculation. You cannot reference ad-hoc groups in a calculation.
Blend data using groups created in the secondary data source: Only calculated groups can be used in
Use a group in another workbook. You can easily replicate a group in another workbook by copy and
What is the difference between heat map and tree map?
How to view sql which is generated by Tableau Desktop?
queries.
How will you publish and schedule workbook in tableau server?
How Does Tableau Work?

Compare QlikView and Tableau?
Criteria Tableau QlikView
Data integration Exceptional Good
Working with multidimensional data Very Good Good
Support for PowerPoint Available Not available
Visual Drilldown Good Very Good
Scalability Good Limited by RAM
Introduced R script
What are the features of Tableau 8.3?
Explain the relationship difference between Tableau Workbook, Story, Dashboard, Worksheets.?
Workbooks and sheets:

Tableau uses a WORKBOOK and SHEET file structure, much like Microsoft Excel. A WORKBOOK contains
SHEETS, which can be a WORKSHEET , a DASHBOARD , or a STORY .
A WORKSHEET contains a single view along with shelves, legends, and the Data pane.
A DASHBOARD is a collection of views from multiple worksheets.
A STORY contains a sequence of worksheets or dashboards that work together to convey information.
How do I automate reports using Tableau software?
Speed
experience.
Analysis layer
Data layer
Traditional BI and Tableau do well here, with enterprise – level security and high scalability.
What is a parameter in Tableau ? And how it works.?

Quick filter
Context filt
Data source filter
What is the difference between context filter to other filters?
Whenever we create context filter >> Tableau will create a temporary table for this particular filter set
filters
What is disadvantage of context filters?
value.
What is benefit of Tableau extract file over the live connection?
How to combine two excel files with same fields but different data (different years)?

single chart?
What is the Max no of tables we can join in Tableau?
How does the integration of Tableau with R works?
What is Page shelf?
Differentiate between parameters and filters in Tableau.
The difference lies in the application. Parameters allow users to insert their values, which can be
choose to ‘filter by’ the list, which cannot be used to perform calculations.Users can dynamically change
measures and dimensions in parameter but filters do not approve of this feature. Most in-depth,
industry-led curriculum in Tableau.
How can we combine database and flat file data in tableau desktop?
What is Content Filter?
filter.

Context”
How to add custom Color to Tableau?
What is TDE file?
Firstly, TDE is a columnar store
The second is how they are structured which impacts how they are loaded into memory and used by
How to design a view to show region wise profit and sales.I did not want line and bar chat should be
used for profit and sales?
filter
How to create cascading filters without context filter ?
What is dual axis?

What is blended axis?
Http://onlinehelp.tableau.com/current/pro/online/mac/en-Us/multiplemeasures_blendedaxes.html
What is Data Blending?
new dataset
What is disaggregation and aggregation of data?
EID ENAME Salary DEPT
1 SAM 1000 SALES
2 JOHN 1500 FINANCE
3 LISA 3000 ACCOUNTING
4 RAY 2000 OPERATIONS
5 SMITH 6000 MANUFACTURING
6 ASHLEY 25000 HR
7 KIM 2000 ACCOUNTING
Drag ename on column and salary on rows we will get sum (salary) of each and individual employee

What different products Tableau provide?
Tableau Server : on-premise or cloud-hosted software to access the workbooks built
Tableau desktop: desktop environment to create and publish standard and packaged workbooks.
Tableau Public: workbooks available publicly online for users to download and access the included data.
Tableau Reader: get a local access to open Tableau Packaged workbook.
Tableau Real Time Interview Questions:
Q. What is the difference between context filter to other filters?
Whenever we crate context filter
Tableau will create a temporary table for this particular filter set and other filters will be apply on
context filter data like cascade parameters… suppose we have crated context filter on countries >> we
have chosen country as USA and India
Tableau will create a temporary table for this two countries data and if you have any other filers >>other
will be apply on this two countries data if we don’t have any context filter
Each and individual record will check for all filters
Q. What is disadvantage of context filters?

Q. What are the five main product offered by Tableau company?
Tableau Public.
Q. What is the current latest version of Tableau Desktop(as of Sep, 25th 2017)?
Current version: Tableau Desktop Version 10.4
Q. What is data visualization?
Interested in mastering Tableau Course? Enroll now for FREE demo on Tableau Training.
Q. Why tableau?
Q. What are Filters? How many types of filters are there in Tableau?
1. Quick filter
2. Context filter
3. Datasource filter

1.abc 2000 java
2.bbc 3000 .net
3.Krishna 2500 java
Madhu 300
1.abc 1000 testing
2.bbc 3000 tableau
3.krishna 5000.net
vanshi 9000 tableau
2 bbc 13000testing
4 Madhu 17000 .nte
5 vamshi 19000.net
Q. How to remove the All options from a Tableau auto – filter?

Right click filter>>customize>>uncheck show all option
Q. Can we use non – used columns (Columns which are not used in reports but data source has columns)
in Tableau Filters?
Yes!
Ex. In data source I have column like
empID, EmpName, EmpDept,EmpDsignation, EmpSalary
In reports I am using empname on columns and empsalry on rows.
I can use empDesignation on Filters
Q. What is benefit of Tableau extract file over the live connection?
Q. How to combine two excel files with same fields but different data (different years)?
single chart?
Related Article: Employing Visual Analytics To Aid Succession Planning In Tableau
Q. Max no of tables we can join in Tableau?
Q. What is the difference between joining and blending in Tableau?
Joins in Tableau:
For Eg: your client is in Healthcare domain and using SQL Server as their database. In SQL server there
may be many Tableau like Claims Tables, Rejected Claims Table, Customer Table. Now, client wants to

know customer wise claims and customer wise rejected claims table using the joins. Join is a query that
combines the data form 2 or more tables by making use of Join condition.
We can join max 32 table, it’s not possible to combine more then 32 tables.
In Tableau the joins can perform in 2 ways.
1. By making use of common columns.
2. By making use of common data types.
If we create joins on the fields in Tableau all the table names are suffixing with $. While performing the
joins on multiple tables, always go with the les amount of data tables, so that we can improve the
performance.
In Tableau the joins are divided into 2 types.
1.Equi Join,
2.Non Equi Join
1. Equi Join: in the join condition if we are using Equality”=”operator then such a kind of join called as
Equi join.
2. Non Equi Join: in the join condition apart from the Equality”=”if we use any other operator like
<,>,<=,>= and=! Then such a kind of joins are called as Non Equi Join
Equi Join is divided into 3 types
1. Inner Join,
2. Outer Join,
3. Self – Join.
1.Inner Join: Inner join will loads the only matching records from the both tables. Inner join condition:
Tableaa.id = Tableb.id
2.Outer Join:
Again the outer join divided into 3 types.
a)Left Outer Join,
b)Right Outer Join,
c)Full Outer Join.
Left outer join: displays the complete data from the left + matching records from the left.

Condition: tablea.id(+).
Right Outer Join: displays the complete data from the right + matching records from the left.
Condition: tablea.id(+)=tableb.id
Full outer join: full outer join load the complete data from the left table and right table. Condition: Table
A full outer join Table B ON tablea.id= tableb.id
3.Self-Join: if we are performing join to the same table itself such a kind of join called as self-join
Non Equi Join:
In the join condition if we are using the operators apart from the equality “=” then such a kind of joins
are called as Non Equi join.
Data Blending in Tableau:
For ex: your client is same Healthcare Client. They are operating their services in Asia, Europe, NA and so
on & the are maintaining Asia data in SQL, Europe Data in SQL Server and NA data in MY SQL.
Now, your client wants to analyze their business across the world in a single worksheet. So you can’t
perform join here.
Now you have make use of Data Blending Concept.
Normally in the Tableau we can perform the analysis on the single data server. If we want to perform
the analysis from the multiple data sources in a single sheet then we have to make use of a new concept
called as data blending.
Data blending mix the data from the different data sources and allow the users to perform th analysis in
a single sheet. Blending means mixing. If we are mixing the data sources then it is called as data
blending.
Rules to perform the data blending
In order to perform data blending there are few rules.
1. If we are performing the data blending on 2 data source these 2 data sources should have at least 1
common dimension.
2. In that common dimension at least 1 value should match.
In Tableau we can perform the data blending in 2 ways.
1. Automatic way
2. Custom way

1. Automatic way: In the automatic way Tableau automatically defines the relationship between the 2
data sources based on the common dimensions and based on the matching values and the relationship
is indicated with Orange color.
2. Custom or Manual way: In the manual or custom way the user need to define the relationship
manually.
Data blending fuctionality
1. All the primary data sources and the secondary data sources are linked by specific relationship
2. while performing the data blending each work sheet has a primary connection and optionally it might
contains several secondary connections.
3. All the primary connections are indicated in the Blue in the work sheet and all the secondary data
sources indicated with the Orange color tick mark.
4. In the data blending 1 sheet contains 1 primary data source and 1 sheet can contain end number of
Q. What are Dimensions and Facts?
values) dimention ex:productname city..facts:sales, profit
Tableau Admin Interview Questions:
Q. Can we place an excel file in a shared location and use it to develop a report and refresh it in regular
intervals?
Yes you can do it… but for the better performance use extract
Q. What is the difference between heat map and tree map?
Q. What is the different between twb and twbx file extensions. Please explain.
Twb is a live connection, it points to the data source; the user receiving twb needs permission to said
data source and no data is included. .twbx takes data offline, stroes the data as a package or zip like file,

thereby eradicating the need for permissions from end user, it’s now a snapshot in time of the data as of
the time it was Saved as . twbx
Q. What is dual axis?
Related Article: How To Use Tabadmin For Administrative Task Automation In Tableau?
Q. What is blended axis?
Q. What makes Tableau software stand out?
In my view, Tableau stands out for several reasons:
First, most of the BI tools out there are pricey. However, Tableau has a free offering (Tableau Public) as
well as a very popular (also free) academic distribution. Tableau is well recognized by firms like Forrester
research to be one of the most easy to use, and agile products currently available. see here: Tableau
Ranks #1 in The Forrester Wave: Advanced Data Visualization (ADV) Platforms That makes it easy to pick
up and try new things with, which data visualization people love about it.
On the other hand, unlike some of the other BI tools, Tableau is not a complete technology stack, it is
most useful for visualization and analytics. – you will need other products in addition to tableau for
heavier enterprise data ETL, maintenance, and storage, etc.
https://www.tableau.com/about/blog/2012/7/tableau-ranks-1-forrester-wave-advanced-data-
visualization-adv-platforms-1852
Q. How do we do testing in Tableau?

You can’t test in Tableau as far as I know. It is a data visualization software.
Q. Can you get values from two different sources as a single input into parameter?
No you cannot. Each data source corresponds to a Tableau workbook. If you include both data variables
in the same data source you can input them in the same workbook.
Q. How many ways we use parameters in Tableau?
We can use parameters with filters, calculated fields ,actions, measure-swap, changing views and auto
updates
Q. What is the use of new Custom SQL Query in tableau?
Custom SQL Query written after connecting to data for pulling the data in a structured view, One simple
example is you have 50 columns in a table, but we need just 10 columns only. So instead of taking 50
columns you can write a sql query. Performance will increase.
Related Article: What Are The Common Use Cases For Tabcmd In Tableau?
Q. What are the differences between Tableau Software and Traditional BI tools?
Tableau provides easy to use, best in class, Visual Analytic capabilities, but it does not help with the
plumbing (data foundation). You could, for example, marry SQL Server with Tableau to get the complete
package. Tableau licenses are relatively expensive if you are looking to scale.
Traditional BI can handle it all but with significant upfront costs. Higher consulting, hardware and
software costs. Among the mega-vendors, only Microsoft can provide a reasonable value proposition.
Open source vendors like Pentaho and JasperSoft do not have an abundant enough talent pool, yet.
Q. What are the similarities and differences between Tableau software and Palantir?
Palantir and Tableau are very different. Palantir has its roots in large data computer science problems
involving security, payments, fraud detection and the likes. Customers/Investors include Paypal, CIA and
others.

Tableau is a visualization player – with roots in Stanford U research. It’s Visual Query Language (VizQL)
allows users to build visualizations on top of standard data warehouses or spreadsheets.
Q. How to create cascading filters without context filter ?
I have chosen country as INDIA and filter2 should display only INDIA states
Q. Is Tableau Software good for a strategic acquisition?
Yes for sure! It gives you data insight to the extend that others don’t.
Helps u plan and point the anomalies and improvise your process for betterment.
Q. How to display top 5 and last 5 sales in same view?
Using filters or calculated fields we can able to display the top 5 and last 5 sales in same view?
Q. Design a view to show region wise profit and sales.I did not want line and bar chat should be used for
profit and sales. How you will design and please explain?
filter
Q. Design a view in a map such that if user selects any state the cities under that state has to show profit
and sales.
If you want to show the Sales and profit in each and every city under the states in the same work sheet.
According to your question you should have State, City, Sales and Profit filed in your dataset.
1. Double click on the State filed.
2. Drag the City and drop into Marks card (under the State fied)
3. Drag the sales and drop into size.

4. Drag the profit and drop into color.
5. Click on Size legend and increase the size.(75%)
6. Right click on the State field and select show quick filter.
7. Select any state and check whether you got the required view or not. In this view size indicates the
amount of sales and color indicates the Profit values.
Q. How to add custom Color to Tableau?
Add custom color code
Note: In tableau 9.0 version we have color picker option..
Q. How can we combine database and flat file data in tableau desktop?
1.abc 2000 java
2.bbc 3000 .net
3.Krishna 2500 java
Madhu 300
1.abc 1000 testing
2.bbc 3000 tableau
3.krishna 5000.net

vamshi 9000 tableau
2 bbc 13000testing
4 Madhu 17000 .nte
5 vamshi 19000.net
Q. What Does TABLEAU do?
Our goal is to help people see and understand data. Our software products put the power of data into
the hands of everyday people, allowing a broad population of business users to engage with their data,
ask questions, solve problems and create value.
Q. What is Tableau Public?

it.
Q. What is data modelling?
Data modelling is the analysis of data objects that are used in a business or other context and the
identification of the relationships among these data objects. Data modelling is a first step in doing
object-oriented programming
Q. What is your daily work process in tableau?
I think we all work on different projects using Tableau, so the work begins from understanding the
requirement getting the required data, story boarding then creating visualizations in tableau and then
presenting it to the client for review.
Q. What is parameter in Tableau ? And how it works.?
Q. How does Tableau perform with huge datasets?
Q. How will you publish and schedule workbook in tableau server?
Q. Define the names for parameters ,filters etc…
Parameters are dynamic values that can replace constant values in calculations and can serve as
filters.Filters are used to restrict the data based on the condition u have mentioned in the filters shelf.

Q. How to view sql which is generated by Tableau Desktop?
The Tableau Desktop Log files are located in C:UsersMy DocumentsMy Tableau Repository. If you have a
live connection to the data source, check the log.txt and tabprotosrv.txt files. If you are using an extract,
check the tdeserver.txt file. The tabprotosrv.txt file often shows detailed information about queries.
Related Article: What Kinds Of Tasks Can Be Done With Tabcmd In Tableau?
Q. What is page shelf?
page shelf is power full part of tableau That you can use to controle the display of output as well as
Q. What are the major differences between tableau version 7.0 and tableau version 8.0?
1. New visualizations are introduced like treemap, bubble chart and box and whisker plot
2. We can copy worksheet directly from one workbook to another workbook
Q. How to create filled maps?
Step 1: Build a Map View Double-click a geographic fields such as State, Area Code, Zip Code, etc.
Step 2: Select the Fille Map Mark Type The Automatic mark type will show this type of view as circles
over a map. On the Marks card, select Filled Map to color the geographic areas.
Step 3: Drag a Field to the Color Shelf Define how the locations are colored by dragging another field to
the Color shelf.
Q. Is Parameter have it’s dropdown list?
Yes it may have its own drop down list, the entries which you make in the Parameter while creating it
can be viewed as Dropdown list.
Tableau Dashboard Interview Questions:

Q. How to rectify SQL Performance for developed Dashboards
After creation of Dashboards if we get problem from sql side that means Custom Sql ….How to Rectify
the sql performance from custom sql.
Q. Suppose my license expires today, can users able to view the dashboards or workbook which i
published in server earlier.
If your server license expires today, your user name on the server will have the role ‘unlicensed’ which
means you cannot access, but others can. The Site Admin can ‘Change Ownership’ to another person, so
extracts if enabled do not fail.
Q. Think that I am using Tableau desktop and have a live connection to Cloud era hadoop data. I need to
press F5 to refresh the visualization. Is there anyway to automatically refresh the visualization every x
minutes instead of pressing F5 every-time?
Here is the example of refreshing dashboard in every 3 seconds, Replace api src and server url with
yours. The interval below is for 3 seconds.
Tableau JavaScript API
Related Article: What Are The Rapid-fire Analysis At A Public Utility In Tableau?
Q. What Tableau Desktop is?
Q. What are the differences between Tableau Software, GoodData and Traditional BI (Business Objects,
etc.)?
You could talk feature – functionality for days, but at a high level there are four major differences.
1. Speed: How fast can you get up and running with the system, answer questions, design and share
dashboards and then change them? This is Where systems like Tableau and GoodData are far better
than old – school business intelligence like Business Objects or Cognos. Traditional systems took months

or years to intelligence like Business Objects or Cognos. Traditional systems took months or years to
implement, with costs running to millions. Tableau has a free trail that installs in minutes and GoodData
is cloud – based, so they are faster to implement by orders of magnitude. They are also faster to results:
experience.
2. Analysis layer: This is where Tableau excels. It has a powerful and flexible drag & drop visualization
engine based on some technology from Stanford. GoodData and traditional BI typically provide some
canned reports but changing them requires significant time and money.
3. Data layer: This is where the three options are most different:
4. Enterprise readiness: Traditional BI and Tableau do well here, with enterprise – level security and high
scalability.
Related Article: Aggregating Disparate Data Sources At A Large University In Tableau
Q. What is Tableau Software?
only takes seconds.
Q. What is Tableau Server?
Q. Explain the integration of Tableau with R?
1. Tableau Server can also be configured to connect to an instance of Rserve through the tabadmin
utility, allowing anyone to view a dashboard containing R functionality.

2. Combining R with Tableau gives you the ability to bring deep statistical analysis into a drag-and-drop
Q. What is the Difference between quick filter and Normal filter in tableau?
value.
Q. How do I automate reports using Tableau software?
Q. How is Tableau so fast when working with databases?
Tableau compiles the elements of your visual canvas into a SQL or MDX query for the remote database
to process. Since a database typically runs on more powerful hardware than the laptops / workstations
used by analysts, you should generally expect the database to handle queries much faster than most in
memory BI applications limited by enduser hardware. Tableau’s ability to push computation (queries)
close to the data is increasingly important for large data sets, which may reside on a fast cluster and may
be too large to bring in memory.Another factor in performance relates to data transfer, or in Tableau’s
case resultset transfer. Since Tableau visualizations are designed for human consumption, they are
tailored to the capabilities and limits of the human perception system. This generally means that the
amount of data in a query result set is small relative to the size of the underlying data, and visualizations
focus on aggregation and filtering to identify trends and outliers. The small result sets require little
network bandwidth, so Tableau is able to fetch and render the result set very quickly. And, as Ross
mentioned, Tableau will cache query results for fast reuse.The last factor involves Tableau’s ability to
use in memory acceleration as needed (for example, when working with very slow databases, text files,
etc.). Tableau’s Data Engine uses memory mapped I/O, so while it takes advantage of in memory
acceleration it can easily work with large data sets which cannot fit in memory. The Data Engine will
work only with the subsets of data on disk which are needed for a given query, and the data subsets are
mapped into memory as needed.
Related Article: How To Embed Tableau Reports Securely On The Web?
Q. What is Tableau Desktop?
Tableau Desktop is a data visualization application that lets you analyze virtually any type of structured
data and produce highly interactive, beautiful graphs, dashboards, and reports in just minutes. After a
quick installation, you can connect to virtually any data source from spreadsheets to data warehouses

and display information in multiple graphic perspectives. Designed to be easy to use, you’ll be working
faster than ever before.
Q. How Does Tableau Work?
– Connect: Connect Tableau to any database that you want to analyze.
– Analyze: Analyzing data means viewing it, filtering it, sorting it, performing calculations on it,
– Share: You can share results with others either by sharing workbooks with other Tableau users,
by pasting results into applications such as Microsoft Office, printing to PDF or by using Tableau Server
to publish or embed your views across your organization.
Q. What is the difference between tableau 7.0 and 8.0 versions.
1. New visualizations are introduced like tree map bubble chart and box and whisker plot
2. We can copy worksheet directly from one workbook to another Workbook
Q. Explain the features of Tableau 8.3?
– With Kerboros support, Tableau 8.3 advances enterprise-grade data analysis with these
enhancements:
1. Provides seamless, single sign-on experience from Tableau client to back-end data sources
2. Protects sensitive data with delegated access and viewer credential management
3. Connects to live data sources through stable, automated back-end authentication
4. Leverages existing IT investments in enterprise-grade authentication and data security
5. Supports smart card authentication.
1) Who are the founders of Tableau?

2) What is Tableau Software?
only takes seconds.
3) What are the five main product offered by Tableau company?
Tableau Public.
4) What is the current latest version of Tableau Desktop(as of Feb.2015)?
5) What is data visualization?
6) What is Tableau Desktop?
7) What is Tableau Server?
8) What is Tableau Public?
it.
9) Why Tableau?

10) How does Tableau perform with huge datasets?
query then Tableau must wait up to that time
11) What are the differences between Tableau Software GoodData and Traditional BI (Business Objects,
etc.)?
At high level there are four major differences.How to view sql which is generated by Tab
12) What are Dimensions and Facts?
13) What is the difference between heat map and tree map?
14) How to view sql which is generated by Tableau Desktop?
queries.
15) How will you publish and schedule workbook in tableau server?
16) How Does Tableau Work?

Introduced R script
17) What are the features of Tableau 8.3?
18) How do I automate reports using Tableau software?
Speed
experience.

Analysis layer
Data layer
19) What is a parameter in Tableau ? And how it works.?
Quick filter
Context filter
Datasource filter
20) What is the difference between context filter to other filters?
Whenever we crate context filter >> Tableau will create a temporary table for this particular filter set
filters
21) What is disadvantage of context filters?

value.
22) What is benefit of Tableau extract file over the live connection?
23) How to combine two excel files with same fields but different data (different years)?
single chart.
24) What is the Max no of tables we can join in Tableau?
25) How does the integration of Tableau with R works?
27) How can we combine database and flat file data in tableau desktop?

28) How to add custom Color to Tableau?
29) How to design a view to show region wise profit and sales.I did not want line and bar chat should be
used for profit and sales?
filter
30) How to create cascading filters without context filter ?
31) What is dual axis?
32) What is blended axis?
Http://onlinehelp.tableau.com/current/pro/online/mac/en-

33) What is Data Visualization?
A much advanced, direct, precise and ordered way of viewing large volumes of data is called data
visualization. It is the visual representation of data in the form of graphs and charts, especially when you
can’t define it textually. You can show trends, patters and correlations through various data visualization
software and tools; Tableau is one such data visualization software used by businesses and corporates.
34) What are the differences between Tableau desktop and Tableau Server?
While Tableau desktop performs data visualization and workbook creation, Tableau server is used to
distribute these interactive workbooks and/or reports to the right audience. Users can edit and update
the workbooks and dashboards online or Server but cannot create new ones. However, there are limited
editing options when compared to desktop.
Tableau Public is again a free tool consisting of Desktop and Server components accessible to anyone.
35) Define parameters in Tableau and their working.
Tableau parameters are dynamic variables/values that replace the constant values in data calculations
and filters. For instance, you can create a calculated field value returning true when the score is greater
than 80, and otherwise false. Using parameters, one can replace the constant value of 80 and control it
dynamically in the formula.
36) Differentiate between parameters and filters in Tableau.
The difference actually lies in the application. Parameters allow users to insert their values, which can be
choose to ‘filter by’ the list, which cannot be used to perform calculations.
Users can dynamically change measures and dimensions in parameter but filters do not approve of this
feature.
37) What are fact table and Dimension table in Tableau?
—>Facts are the numeric metrics or measurable quantities of the data, which can be analyzed by
dimension table. Facts are stores in Fact table that contain foreign keys referring uniquely to the
associated dimension tables. The fact table supports data storage at atomic level and thus, allows more
number of records to be inserted at one time. For instance, a Sales Fact table can have product key,
—>Dimensions are the descriptive attribute values for multiple dimensions of each attribute, defining
multiple characteristics. A dimension table ,having reference of a product key form the fact table, can
consist of product name, product type, size, color, description, etc.

38) What are Quick Filters in Tableau?
Global quick filters are a way to filter each worksheet on a dashboard until each of them contains a
dimension. They are very useful for worksheets using the same data source, which sometimes proves to
a disadvantage and generate slow results. Thus, parameters are more useful.
39) State limitations of parameters in Tableau.
Parameters facilitate only four ways to represent data on a dashboard (which are seven in quick filters).
Further, parameters do not allow multiple selections in a filter.
40) What is aggregation and disaggregation of data in Tableau?
Aggregation and disaggregation in Tableau are the ways to develop a scatterplot to compare and
measure data values. As the name suggests, aggregation is the calculated form of a set of values that
return a single numeric value. For instance, a measure with values 1,3,5,7 returns 1. You can also set a
default aggregation for any measure, which is not user-defined. Tableau supports various default
aggregations for a measure like Sum, average, Median, Count and others.
Disaggregating data refers to viewing each data source row, while analyzing data both independently
and dependently.
41) What is Data Blending?
new dataset.
42) What is Content Filter?
filter.
Context”.
43) What are the limitations of context filters?
Tableau takes time to place a filter in context. When a filter is set as context one, the software creates a
temporary table for that particular context filter. This table will reload each time and consists of all
values that are not filtered by either Context or Custom SQL filter.

44) Name the file extensions in Tableau.
There are a number of file types and extensions in Tableau:
• Tableau Workbook (.twb)
• Tableau Packaged Workbook (.twbx)
• Tableau Datasource (.tds)
• Tableau Packaged Datasource (.tdsx)
• Tableau Data extract (.tde)
• Tableau Bookmark (.tdm)
• Tableau Map Source (.tms)
• Tableau Preferences (.tps)
45) Explain the difference between .twb and .twbx
.twb is the most common file extension used in Tableau, which presents an XML format file and
comprises all the information present in each dashboard and sheet like what fields are used in the
views, styles and formatting applied to a sheet and dashboard.
But this workbook does not contain any data. The Packaged workbook merges the information in a
Tableau workbook with the local data available (which is not on server). .twbx serves as a zip file, which
will include custom images if any. Packaged Workbook allows users to share their workbook information
with other Tableau Desktop users and let them open it in Tableau Reader.
46) What are Extracts and Schedules in Tableau server?
data is imported in Tableau Engine.
After this extraction of data, users can publish the workbook, which also publishes the extracts in
Tableau Server. However, the workbook and extracts won’t refresh unless users apply a scheduled
refresh on the extract. Scheduled Refreshes are the scheduling tasks set for data extract refresh so that
they get refreshed automatically while publishing a workbook with data extract. This also removes the
burden of republishing the workbook every time the concerned data gets updated.
47) Name the components of a Dashboard
• Horizontal- Horizontal layout containers allow the designer to group worksheets and dashboard
components left to right across your page and edit the height of all elements at once.

• Vertical- Vertical containers allow the user to group worksheets and dashboard components top to
bottom down your page and edit the width of all elements at once.
• Text
• Image Extract: – A Tableau workbook is in XML format. In order to extracts images, Tableau applies
some codes to extract an image which can be stored in XML.
• Web [URL ACTION]:- A URL action is a hyperlink that points to a Web page, file, or other web-based
resource outside of Tableau. You can use URL actions to link to more information about your data that
may be hosted outside of your data source. To make the link relevant to your data, you can substitute
field values of a selection into the URL as parameters.
48) How to view underlying SQL Queries in Tableau?
• Create a Performance Recording to record performance information about the main events you
interact with workbook. Users can view the performance metrics in a workbook created by Tableau.
Help> Settings and Performance> Start Performance Recording
Help> Setting and Performance > Stop Performance Recording
• Reviewing the Tableau Desktop Logs located at C:\Users\\My Documents\My Tableau Repository. For
live connection to data source, you can check log.txt and tabprotosrv.txt files. For an extract, check
tdeserver.txt file.
Tableau provides a distinct and powerful tool to control the output display known as Page shelf. As the
name suggests, the page shelf fragments the view into a series of pages, presenting a different view on
each page, making it more user-friendly and minimizing scrolling to analyze and view data and
information. You can flip through the pages using the specified controls and compare them at a
common axle.
50) How to do Performance Testing in Tableau?
products.
51) Explain the concept of Dual Axis.

compare multiple measures at once, having two independent axes layered on top of one another.
52) How many maximum tables can you join in Tableau?
The maximum number of 32 tables can be joined in Tableau. A table size must also be limited to 255
columns (fields).
53) How to remove ‘All’ options from a Tableau auto-filter?
54) What different products Tableau provide?
• Tableau Server : on-premise or cloud-hosted software to access the workbooks built
• Tableau desktop: desktop environment to create and publish standard and packaged workbooks.
• Tableau Public: workbooks available publicly online for users to download and access the included
data.
• Tableau Reader: get a local access to open Tableau Packaged workbook
55) How can you display top five and last five sales in the same view?
Create two sets, one for top 5 another for bottom 5 and the join these two sets displaying a unique set
of total 10 rows.
56) What is TDE file?
• Firstly, TDE is a columnar store
• The second is how they are structured which impacts how they are loaded into memory and used by

57) How to use group in calculated field?
58) Can parameters have dropdown list?
We will continuously update tableau interview questions and answers in this site with real time
scenarios by tableau experts.You can request for tableau interview questions and answers pdf in the
Contact us form.
1) Explain what is Tableau?
visualize and create interactive, sharable dashboards.
2) List out some of the new features introduced in Tableau 9.1?
The new features introduced in Tableau 9.1 includes
Data: With new web data connector, it makes data accessible from anywhere
Mobile: The new tableau comes with a high-resolution thumbnails, taking screenshot offline and high-
level security for the data
Enterprise: Easy enterprise deployment with active directory synchronization
Visual Analytics: View proximity in the radial selection tool, also provides features like creating filter
formulas and Zoom control on your data
3) Explain what is Tableau public?
Tableau public is a free service that allow anyone to publish interactive data to the web. Once it is there
on web, anyone can interact with the data, download it or create their own visualization.

4) Mention whether you can create relational joins in Tableau without creating a new table?
Yes, one can create relational joins in tableau without creating a new table.
5) Mention what are different Tableau files?
Different Tableau files include
Workbooks: Workbooks hold one or more worksheets and dashboards
Bookmarks: It contains a single worksheet and its an easy way to quickly share your work
Packaged Workbooks: It contains a workbook along with any supporting local file data and background
images
Data Extraction Files: Extract files are a local copy of a subset or entire data source
Data Connection Files: It’s a small XML file with various connection information
6) Mention what is the difference between published data sources and embedded data sources in
Tableau?
The difference between published data source and embedded data source is that,
Published data source: It contains connection information that is independent of any workbook and can
be used by multiple workbooks.
Embedded data source: It contains connection information and is associated with a workbook.
7) Mention what are the characteristics to distinguish data source?
The data source are distinguished on the basis of following characteristics
Icon/Name
Connection Type
Connects to
Live or Last extract
8) Mention when to use Joins vs. Blending in Tableau?
If data resides in a single source, it is always desirable to use Joins. When your data is not in one place
blending is the most viable way to create a left join like the connection between your primary and
9) Explain what is Tableau Data Extract?

A table data extract is a compressed snapshot of data stored on disk and loaded into memory as
required to render a Tableau. A TDE is a columnar store and reduce the input/output required to access
and aggregate the values.
10) Explain what is the difference between blending and joining in Tableau?
Joining term is used when you are combining data from the same source, for example, worksheet in an
Excel file or tables in Oracle database
While blending requires two completely defined data sources in your report.
Data Warehousing - Overview

The term "Data Warehouse" was first coined by Bill Inmon in 1990. According to Inmon, a data
warehouse is a subject oriented, integrated, time-variant, and non-volatile collection of data. This data
helps analysts to take informed decisions in an organization.
An operational database undergoes frequent changes on a daily basis on account of the transactions
that take place. Suppose a business executive wants to analyze previous feedback on any data such as a
product, a supplier, or any consumer data, then the executive will have no data available to analyze
because the previous data has been updated due to transactions.
A data warehouses provides us generalized and consolidated data in multidimensional view. Along with
generalized and consolidated view of data, a data warehouses also provides us Online Analytical
Processing (OLAP) tools. These tools help us in interactive and effective analysis of data in a
multidimensional space. This analysis results in data generalization and data mining.
Data mining functions such as association, clustering, classification, prediction can be integrated with
OLAP operations to enhance the interactive mining of knowledge at multiple level of abstraction. That's
why data warehouse has now become an important platform for data analysis and online analytical
processing.
Understanding a Data Warehouse

 A data warehouse is a database, which is kept separate from the organization's operational
database.
 There is no frequent updating done in a data warehouse.
 It possesses consolidated historical data, which helps the organization to analyze its business.

 A data warehouse helps executives to organize, understand, and use their data to take strategic
decisions.
 Data warehouse systems help in the integration of diversity of application systems.
 A data warehouse system helps in consolidated historical data analysis.
Why a Data Warehouse is Separated from Operational Databases

A data warehouses is kept separate from operational databases due to the following reasons −
 An operational database is constructed for well-known tasks and workloads such as searching
particular records, indexing, etc. In contract, data warehouse queries are often complex and
they present a general form of data.
 Operational databases support concurrent processing of multiple transactions. Concurrency

control and recovery mechanisms are required for operational databases to ensure robustness
and consistency of the database.
 An operational database query allows to read and modify operations, while an OLAP query
needs only read only access of stored data.
 An operational database maintains current data. On the other hand, a data warehouse
maintains historical data.
Data Warehouse Features

The key features of a data warehouse are discussed below −
 Subject Oriented − A data warehouse is subject oriented because it provides information

around a subject rather than the organization's ongoing operations. These subjects can be
product, customers, suppliers, sales, revenue, etc. A data warehouse does not focus on the
ongoing operations, rather it focuses on modelling and analysis of data for decision making.
 Integrated − A data warehouse is constructed by integrating data from heterogeneous sources

such as relational databases, flat files, etc. This integration enhances the effective analysis of
data.
 Time Variant − The data collected in a data warehouse is identified with a particular time
period. The data in a data warehouse provides information from the historical point of view.

 Non-volatile − Non-volatile means the previous data is not erased when new data is added to it.
A data warehouse is kept separate from the operational database and therefore frequent
changes in operational database is not reflected in the data warehouse.
Note − A data warehouse does not require transaction processing, recovery, and concurrency controls,
because it is physically stored and separate from the operational database.
Data Warehouse Applications

As discussed before, a data warehouse helps business executives to organize, analyze, and use their
data for decision making. A data warehouse serves as a sole part of a plan-execute-assess "closed-loop"
feedback system for the enterprise management. Data warehouses are widely used in the following
fields −
 Financial services
 Banking services
 Consumer goods
 Retail sectors
 Controlled manufacturing
Types of Data Warehouse
Information processing, analytical processing, and data mining are the three types of data warehouse
applications that are discussed below −
 Information Processing − A data warehouse allows to process the data stored in it. The data
can be processed by means of querying, basic statistical analysis, reporting using crosstabs,
tables, charts, or graphs.
 Analytical Processing − A data warehouse supports analytical processing of the information

stored in it. The data can be analyzed by means of basic OLAP operations, including slice-and-
dice, drill down, drill up, and pivoting.
 Data Mining − Data mining supports knowledge discovery by finding hidden patterns and
associations, constructing analytical models, performing classification and prediction. These
mining results can be presented using the visualization tools.

Sr.No. Data Warehouse (OLAP) Operational Database(OLTP)
1 It involves historical processing of It involves day-to-day processing.

information.
2 OLAP systems are used by knowledge OLTP systems are used by clerks, DBAs, or
workers such as executives, managers, database professionals.
and analysts.
3 It is used to analyze the business. It is used to run the business.
4 It focuses on Information out. It focuses on Data in.
5 It is based on Star Schema, Snowflake It is based on Entity Relationship Model.

Schema, and Fact Constellation Schema.
6 It focuses on Information out. It is application oriented.
7 It contains historical data. It contains current data.
8 It provides summarized and It provides primitive and highly detailed data.

consolidated data.
9 It provides summarized and It provides detailed and flat relational view of

multidimensional view of data. data.
10 The number of users is in hundreds. The number of users is in thousands.
11 The number of records accessed is in The number of records accessed is in tens.

millions.

12 The database size is from 100GB to 100 The database size is from 100 MB to 100 GB.
TB.
13 These are highly flexible. It provides high performance.
Data Warehousing - Concepts

What is Data Warehousing?
Data warehousing is the process of constructing and using a data warehouse. A data warehouse is
constructed by integrating data from multiple heterogeneous sources that support analytical reporting,
structured and/or ad hoc queries, and decision making. Data warehousing involves data cleaning, data
integration, and data consolidations.
Using Data Warehouse Information

There are decision support technologies that help utilize the data available in a data warehouse. These
technologies help executives to use the warehouse quickly and effectively. They can gather data,
analyze it, and take decisions based on the information present in the warehouse. The information
gathered in a warehouse can be used in any of the following domains −
 Tuning Production Strategies − The product strategies can be well tuned by repositioning the
products and managing the product portfolios by comparing the sales quarterly or yearly.
 Customer Analysis − Customer analysis is done by analyzing the customer's buying preferences,
buying time, budget cycles, etc.
 Operations Analysis − Data warehousing also helps in customer relationship management, and
making environmental corrections. The information also allows us to analyze business
operations.
Integrating Heterogeneous Databases

To integrate heterogeneous databases, we have two approaches −
 Query-driven Approach
 Update-driven Approach
Query-Driven Approach

This is the traditional approach to integrate heterogeneous databases. This approach was used to build
wrappers and integrators on top of multiple heterogeneous databases. These integrators are also
known as mediators.
Process of Query-Driven Approach

 When a query is issued to a client side, a metadata dictionary translates the query into an
appropriate form for individual heterogeneous sites involved.
 Now these queries are mapped and sent to the local query processor.
 The results from heterogeneous sites are integrated into a global answer set.
Disadvantages
 Query-driven approach needs complex integration and filtering processes.
 This approach is very inefficient.
 It is very expensive for frequent queries.
 This approach is also very expensive for queries that require aggregations.
Update-Driven Approach
This is an alternative to the traditional approach. Today's data warehouse systems follow update-
driven approach rather than the traditional approach discussed earlier. In update-driven approach, the
information from multiple heterogeneous sources are integrated in advance and are stored in a
warehouse. This information is available for direct querying and analysis.
Advantages
This approach has the following advantages −
 This approach provide high performance.
 The data is copied, processed, integrated, annotated, summarized and restructured in semantic
data store in advance.
 Query processing does not require an interface to process data at local sources.
Functions of Data Warehouse Tools and Utilities

The following are the functions of data warehouse tools and utilities −

 Data Extraction − Involves gathering data from multiple heterogeneous sources.
 Data Cleaning − Involves finding and correcting the errors in data.
 Data Transformation − Involves converting the data from legacy format to warehouse format.
 Data Loading − Involves sorting, summarizing, consolidating, checking integrity, and building
indices and partitions.
 Refreshing − Involves updating from data sources to warehouse.
Note − Data cleaning and data transformation are important steps in improving the quality of data and
data mining results.
Data Warehousing - Terminologies

In this chapter, we will discuss some of the most commonly used terms in data warehousing.
Metadata
Metadata is simply defined as data about data. The data that are used to represent other data is known
as metadata. For example, the index of a book serves as a metadata for the contents in the book. In
other words, we can say that metadata is the summarized data that leads us to the detailed data.
In terms of data warehouse, we can define metadata as following −
 Metadata is a road-map to data warehouse.
 Metadata in data warehouse defines the warehouse objects.
 Metadata acts as a directory. This directory helps the decision support system to locate the
contents of a data warehouse.
Metadata Repository
Metadata repository is an integral part of a data warehouse system. It contains the following metadata
−
 Business metadata − It contains the data ownership information, business definition, and
changing policies.
 Operational metadata − It includes currency of data and data lineage. Currency of data refers
to the data being active, archived, or purged. Lineage of data means history of data migrated
and transformation applied on it.

 Data for mapping from operational environment to data warehouse − It metadata includes
source databases and their contents, data extraction, data partition, cleaning, transformation
rules, data refresh and purging rules.
 The algorithms for summarization − It includes dimension algorithms, data on granularity,

aggregation, summarizing, etc.
Data Cube
A data cube helps us represent data in multiple dimensions. It is defined by dimensions and facts. The
dimensions are the entities with respect to which an enterprise preserves the records.
Illustration of Data Cube

Suppose a company wants to keep track of sales records with the help of sales data warehouse with
respect to time, item, branch, and location. These dimensions allow to keep track of monthly sales and
at which branch the items were sold. There is a table associated with each dimension. This table is
known as dimension table. For example, "item" dimension table may have attributes such as
item_name, item_type, and item_brand.
The following table represents the 2-D view of Sales Data for a company with respect to time, item, and
location dimensions.
But here in this 2-D table, we have records with respect to time and item only. The sales for New Delhi
are shown with respect to time, and item dimensions according to type of items sold. If we want to
view the sales data with one more dimension, say, the location dimension, then the 3-D view would be

useful. The 3-D view of the sales data with respect to time, item, and location is shown in the table
below −
The above 3-D table can be represented as 3-D data cube as shown in the following figure −
Data Mart
Data marts contain a subset of organization-wide data that is valuable to specific groups of people in an
organization. In other words, a data mart contains only those data that is specific to a particular group.

For example, the marketing data mart may contain only data related to items, customers, and sales.
Data marts are confined to subjects.
Points to Remember About Data Marts

 Windows-based or Unix/Linux-based servers are used to implement data marts. They are
implemented on low-cost servers.
 The implementation cycle of a data mart is measured in short periods of time, i.e., in weeks
rather than months or years.
 The life cycle of data marts may be complex in the long run, if their planning and design are not
organization-wide.
 Data marts are small in size.
 Data marts are customized by department.
 The source of a data mart is departmentally structured data warehouse.
 Data marts are flexible.
The following figure shows a graphical representation of data marts.
Virtual Warehouse
The view over an operational data warehouse is known as virtual warehouse. It is easy to build a virtual
warehouse. Building a virtual warehouse requires excess capacity on operational database servers.
Data Warehousing - Delivery Process

A data warehouse is never static; it evolves as the business expands. As the business evolves, its
requirements keep changing and therefore a data warehouse must be designed to ride with these
changes. Hence a data warehouse system needs to be flexible.
Ideally there should be a delivery process to deliver a data warehouse. However data warehouse
projects normally suffer from various issues that make it difficult to complete tasks and deliverables in
the strict and ordered fashion demanded by the waterfall method. Most of the times, the requirements
are not understood completely. The architectures, designs, and build components can be completed
only after gathering and studying all the requirements.
Delivery Method
The delivery method is a variant of the joint application development approach adopted for the
delivery of a data warehouse. We have staged the data warehouse delivery process to minimize risks.
The approach that we will discuss here does not reduce the overall delivery time-scales but ensures the
business benefits are delivered incrementally through the development process.
Note − The delivery process is broken into phases to reduce the project and delivery risk.
The following diagram explains the stages in the delivery process −
IT Strategy

Data warehouse are strategic investments that require a business process to generate benefits. IT
Strategy is required to procure and retain funding for the project.
Business Case
The objective of business case is to estimate business benefits that should be derived from using a data
warehouse. These benefits may not be quantifiable but the projected benefits need to be clearly
stated. If a data warehouse does not have a clear business case, then the business tends to suffer from
credibility problems at some stage during the delivery process. Therefore in data warehouse projects,
we need to understand the business case for investment.
Education and Prototyping

Organizations experiment with the concept of data analysis and educate themselves on the value of
having a data warehouse before settling for a solution. This is addressed by prototyping. It helps in
understanding the feasibility and benefits of a data warehouse. The prototyping activity on a small
scale can promote educational process as long as −
 The prototype addresses a defined technical objective.
 The prototype can be thrown away after the feasibility concept has been shown.
 The activity addresses a small subset of eventual data content of the data warehouse.
 The activity timescale is non-critical.
The following points are to be kept in mind to produce an early release and deliver business benefits.
 Identify the architecture that is capable of evolving.
 Focus on business requirements and technical blueprint phases.
 Limit the scope of the first build phase to the minimum that delivers business benefits.
 Understand the short-term and medium-term requirements of the data warehouse.
Business Requirements
To provide quality deliverables, we should make sure the overall requirements are understood. If we
understand the business requirements for both short-term and medium-term, then we can design a
solution to fulfil short-term requirements. The short-term solution can then be grown to a full solution.
The following aspects are determined in this stage −

 The business rule to be applied on data.
 The logical model for information within the data warehouse.
 The query profiles for the immediate requirement.
 The source systems that provide this data.
Technical Blueprint
This phase need to deliver an overall architecture satisfying the long term requirements. This phase
also deliver the components that must be implemented in a short term to derive any business benefit.
The blueprint need to identify the followings.
 The overall system architecture.

 The data retention policy.
 The backup and recovery strategy.
 The server and data mart architecture.
 The capacity plan for hardware and infrastructure.
 The components of database design.
Building the Version
In this stage, the first production deliverable is produced. This production deliverable is the smallest
component of a data warehouse. This smallest component adds business benefit.
History Load
This is the phase where the remainder of the required history is loaded into the data warehouse. In this
phase, we do not add new entities, but additional physical tables would probably be created to store
increased data volumes.
Let us take an example. Suppose the build version phase has delivered a retail sales analysis data
warehouse with 2 months’ worth of history. This information will allow the user to analyze only the
recent trends and address the short-term issues. The user in this case cannot identify annual and
seasonal trends. To help him do so, last 2 years’ sales history could be loaded from the archive. Now
the 40GB data is extended to 400GB.

Note − The backup and recovery procedures may become complex, therefore it is recommended to
perform this activity within a separate phase.
Ad hoc Query
In this phase, we configure an ad hoc query tool that is used to operate a data warehouse. These tools
can generate the database query.
Note − It is recommended not to use these access tools when the database is being substantially
modified.
Automation
In this phase, operational management processes are fully automated. These would include −
 Transforming the data into a form suitable for analysis.
 Monitoring query profiles and determining appropriate aggregations to maintain system

performance.
 Extracting and loading data from different source systems.
 Generating aggregations from predefined definitions within the data warehouse.
 Backing up, restoring, and archiving the data.
Extending Scope
In this phase, the data warehouse is extended to address a new set of business requirements. The
scope can be extended in two ways −
 By loading additional data into the data warehouse.
 By introducing new data marts using the existing information.
Note − This phase should be performed separately, since it involves substantial efforts and complexity.
Requirements Evolution
From the perspective of delivery process, the requirements are always changeable. They are not static.
The delivery process must support this and allow these changes to be reflected within the system.
This issue is addressed by designing the data warehouse around the use of data within business
processes, as opposed to the data requirements of existing queries.

The architecture is designed to change and grow to match the business needs, the process operates as
a pseudo-application development process, where the new requirements are continually fed into the
development activities and the partial deliverables are produced. These partial deliverables are fed
back to the users and then reworked ensuring that the overall system is continually updated to meet
the business needs.
Data Warehousing - System Processes

We have a fixed number of operations to be applied on the operational databases and we have well-
defined techniques such as use normalized data, keep table small, etc. These techniques are suitable
for delivering a solution. But in case of decision-support systems, we do not know what query and
operation needs to be executed in future. Therefore techniques applied on operational databases are
not suitable for data warehouses.
In this chapter, we will discuss how to build data warehousing solutions on top open-system
technologies like Unix and relational databases.
Process Flow in Data Warehouse

There are four major processes that contribute to a data warehouse −
 Extract and load the data.

 Cleaning and transforming the data.
 Backup and archive the data.
 Managing queries and directing them to the appropriate data sources.

Extract and Load Process

Data extraction takes data from the source systems. Data load takes the extracted data and loads it
into the data warehouse.
Note − Before loading the data into the data warehouse, the information extracted from the external
sources must be reconstructed.
Controlling the Process

Controlling the process involves determining when to start data extraction and the consistency check
on data. Controlling process ensures that the tools, the logic modules, and the programs are executed
in correct sequence and at correct time.
When to Initiate Extract

Data needs to be in a consistent state when it is extracted, i.e., the data warehouse should represent a
single, consistent version of the information to the user.
For example, in a customer profiling data warehouse in telecommunication sector, it is illogical to

merge the list of customers at 8 pm on Wednesday from a customer database with the customer
subscription events up to 8 pm on Tuesday. This would mean that we are finding the customers for
whom there are no associated subscriptions.
Loading the Data

After extracting the data, it is loaded into a temporary data store where it is cleaned up and made
consistent.
Note − Consistency checks are executed only when all the data sources have been loaded into the
temporary data store.
Clean and Transform Process

Once the data is extracted and loaded into the temporary data store, it is time to perform Cleaning and
Transforming. Here is the list of steps involved in Cleaning and Transforming −
 Clean and transform the loaded data into a structure

 Partition the data
 Aggregation
Clean and Transform the Loaded Data into a Structure

Cleaning and transforming the loaded data helps speed up the queries. It can be done by making the
data consistent −
 within itself.
 with other data within the same data source.
 with the data in other source systems.
 with the existing data present in the warehouse.
Transforming involves converting the source data into a structure. Structuring the data increases the
query performance and decreases the operational cost. The data contained in a data warehouse must
be transformed to support performance requirements and control the ongoing operational costs.
Partition the Data

It will optimize the hardware performance and simplify the management of data warehouse. Here we
partition each fact table into multiple separate partitions.
Aggregation
Aggregation is required to speed up common queries. Aggregation relies on the fact that most common
queries will analyze a subset or an aggregation of the detailed data.
Backup and Archive the Data

In order to recover the data in the event of data loss, software failure, or hardware failure, it is
necessary to keep regular back ups. Archiving involves removing the old data from the system in a
format that allow it to be quickly restored whenever required.
For example, in a retail sales analysis data warehouse, it may be required to keep data for 3 years with
the latest 6 months data being kept online. In such as scenario, there is often a requirement to be able
to do month-on-month comparisons for this year and last year. In this case, we require some data to be
restored from the archive.
Query Management Process

This process performs the following functions −
 manages the queries.
 helps speed up the execution time of queris.

 directs the queries to their most effective data sources.
 ensures that all the system sources are used in the most effective way.
 monitors actual query profiles.
The information generated in this process is used by the warehouse management process to determine
which aggregations to generate. This process does not generally operate during the regular load of
information into data warehouse.
Data Warehousing - Architecture

In this chapter, we will discuss the business analysis framework for the data warehouse design and
architecture of a data warehouse.
Business Analysis Framework

The business analyst get the information from the data warehouses to measure the performance and
make critical adjustments in order to win over other business holders in the market. Having a data
warehouse offers the following advantages −
 Since a data warehouse can gather information quickly and efficiently, it can enhance business
productivity.
 A data warehouse provides us a consistent view of customers and items, hence, it helps us
manage customer relationship.
 A data warehouse also helps in bringing down the costs by tracking trends, patterns over a long
period in a consistent and reliable manner.
To design an effective and efficient data warehouse, we need to understand and analyze the business
needs and construct a business analysis framework. Each person has different views regarding the
design of a data warehouse. These views are as follows −
 The top-down view − This view allows the selection of relevant information needed for a data
warehouse.
 The data source view − This view presents the information being captured, stored, and
managed by the operational system.

 The data warehouse view − This view includes the fact tables and dimension tables. It
represents the information stored inside the data warehouse.
 The business query view − It is the view of the data from the viewpoint of the end-user.
Three-Tier Data Warehouse Architecture

Generally a data warehouses adopts a three-tier architecture. Following are the three tiers of the data
warehouse architecture.
 Bottom Tier − The bottom tier of the architecture is the data warehouse database server. It is
the relational database system. We use the back end tools and utilities to feed data into the
bottom tier. These back end tools and utilities perform the Extract, Clean, Load, and refresh
functions.
 Middle Tier − In the middle tier, we have the OLAP Server that can be implemented in either of
the following ways.
o By Relational OLAP (ROLAP), which is an extended relational database management

system. The ROLAP maps the operations on multidimensional data to standard
relational operations.
o By Multidimensional OLAP (MOLAP) model, which directly implements the

multidimensional data and operations.
 Top-Tier − This tier is the front-end client layer. This layer holds the query tools and reporting
tools, analysis tools and data mining tools.
The following diagram depicts the three-tier architecture of data warehouse −

Data Warehouse Models

From the perspective of data warehouse architecture, we have the following data warehouse models −
 Virtual Warehouse
 Data mart
 Enterprise Warehouse
Virtual Warehouse
The view over an operational data warehouse is known as a virtual warehouse. It is easy to build a
virtual warehouse. Building a virtual warehouse requires excess capacity on operational database
servers.
Data Mart
Data mart contains a subset of organization-wide data. This subset of data is valuable to specific groups
of an organization.

In other words, we can claim that data marts contain data specific to a particular group. For example,
the marketing data mart may contain data related to items, customers, and sales. Data marts are
confined to subjects.
Points to remember about data marts −
 Window-based or Unix/Linux-based servers are used to implement data marts. They are
implemented on low-cost servers.
 The implementation data mart cycles is measured in short periods of time, i.e., in weeks rather
than months or years.
 The life cycle of a data mart may be complex in long run, if its planning and design are not
organization-wide.
 Data marts are small in size.
 Data marts are customized by department.
 The source of a data mart is departmentally structured data warehouse.
 Data mart are flexible.
Enterprise Warehouse
 An enterprise warehouse collects all the information and the subjects spanning an entire
organization
 It provides us enterprise-wide data integration.
 The data is integrated from operational systems and external information providers.
 This information can vary from a few gigabytes to hundreds of gigabytes, terabytes or beyond.
Load Manager
This component performs the operations required to extract and load process.
The size and complexity of the load manager varies between specific solutions from one data
warehouse to other.
Load Manager Architecture

The load manager performs the following functions −

 Extract the data from source system.
 Fast Load the extracted data into temporary data store.
 Perform simple transformations into structure similar to the one in the data warehouse.
Extract Data from Source

The data is extracted from the operational databases or the external information providers. Gateways
is the application programs that are used to extract data. It is supported by underlying DBMS and
allows client program to generate SQL to be executed at a server. Open Database Connection(ODBC),
Java Database Connection (JDBC), are examples of gateway.
Fast Load
 In order to minimize the total load window the data need to be loaded into the warehouse in
the fastest possible time.
 The transformations affects the speed of data processing.
 It is more effective to load the data into relational database prior to applying transformations
and checks.
 Gateway technology proves to be not suitable, since they tend not be performant when large
data volumes are involved.

Simple Transformations
While loading it may be required to perform simple transformations. After this has been completed we
are in position to do the complex checks. Suppose we are loading the EPOS sales transaction we need
to perform the following checks:
 Strip out all the columns that are not required within the warehouse.
 Convert all the values to required data types.
Warehouse Manager
A warehouse manager is responsible for the warehouse management process. It consists of third-party
system software, C programs, and shell scripts.
The size and complexity of warehouse managers varies between specific solutions.
Warehouse Manager Architecture

A warehouse manager includes the following −
 The controlling process

 Stored procedures or C with SQL
 Backup/Recovery tool
 SQL Scripts
Operations Performed by Warehouse Manager

 A warehouse manager analyzes the data to perform consistency and referential integrity
checks.
 Creates indexes, business views, partition views against the base data.
 Generates new aggregations and updates existing aggregations. Generates normalizations.
 Transforms and merges the source data into the published data warehouse.
 Backup the data in the data warehouse.
 Archives the data that has reached the end of its captured life.
Note − A warehouse Manager also analyzes query profiles to determine index and aggregations are
appropriate.
Query Manager
 Query manager is responsible for directing the queries to the suitable tables.
 By directing the queries to appropriate tables, the speed of querying and response generation
can be increased.
 Query manager is responsible for scheduling the execution of the queries posed by the user.
Query Manager Architecture

The following screenshot shows the architecture of a query manager. It includes the following:
 Query redirection via C tool or RDBMS

 Stored procedures
 Query management tool
 Query scheduling via C tool or RDBMS
 Query scheduling via third-party software

Detailed Information
Detailed information is not kept online, rather it is aggregated to the next level of detail and then
archived to tape. The detailed information part of data warehouse keeps the detailed information in
the starflake schema. Detailed information is loaded into the data warehouse to supplement the
aggregated data.
The following diagram shows a pictorial impression of where detailed information is stored and how it
is used.

Note − If detailed information is held offline to minimize disk storage, we should make sure that the
data has been extracted, cleaned up, and transformed into starflake schema before it is archived.
Summary Information
Summary Information is a part of data warehouse that stores predefined aggregations. These
aggregations are generated by the warehouse manager. Summary Information must be treated as
transient. It changes on-the-go in order to respond to the changing query profiles.
The points to note about summary information are as follows −
 Summary information speeds up the performance of common queries.
 It increases the operational cost.
 It needs to be updated whenever new data is loaded into the data warehouse.
 It may not have been backed up, since it can be generated fresh from the detailed information.
Data Warehousing - OLAP

Online Analytical Processing Server (OLAP) is based on the multidimensional data model. It allows
managers, and analysts to get an insight of the information through fast, consistent, and interactive

access to information. This chapter cover the types of OLAP, operations on OLAP, difference between
OLAP, and statistical databases and OLTP.
Types of OLAP Servers

We have four types of OLAP servers −
 Relational OLAP (ROLAP)

 Multidimensional OLAP (MOLAP)
 Hybrid OLAP (HOLAP)
 Specialized SQL Servers
Relational OLAP
ROLAP servers are placed between relational back-end server and client front-end tools. To store and
manage warehouse data, ROLAP uses relational or extended-relational DBMS.
ROLAP includes the following −
 Implementation of aggregation navigation logic.

 Optimization for each DBMS back end.
 Additional tools and services.
Multidimensional OLAP
MOLAP uses array-based multidimensional storage engines for multidimensional views of data. With
multidimensional data stores, the storage utilization may be low if the data set is sparse. Therefore,
many MOLAP server use two levels of data storage representation to handle dense and sparse data
sets.
Hybrid OLAP
Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of ROLAP and
faster computation of MOLAP. HOLAP servers allows to store the large data volumes of detailed
information. The aggregations are stored separately in MOLAP store.
Specialized SQL Servers

Specialized SQL servers provide advanced query language and query processing support for SQL queries
over star and snowflake schemas in a read-only environment.

OLAP Operations
Since OLAP servers are based on multidimensional view of data, we will discuss OLAP operations in
multidimensional data.
Here is the list of OLAP operations −
 Roll-up
 Drill-down
 Slice and dice
 Pivot (rotate)
Roll-up
Roll-up performs aggregation on a data cube in any of the following ways −
 By climbing up a concept hierarchy for a dimension

 By dimension reduction
The following diagram illustrates how roll-up works.

 Roll-up is performed by climbing up a concept hierarchy for the dimension location.
 Initially the concept hierarchy was "street < city < province < country".
 On rolling up, the data is aggregated by ascending the location hierarchy from the level of city
to the level of country.
 The data is grouped into cities rather than countries.
 When roll-up is performed, one or more dimensions from the data cube are removed.
Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following ways −

 By stepping down a concept hierarchy for a dimension

 By introducing a new dimension.
The following diagram illustrates how drill-down works −
 Drill-down is performed by stepping down a concept hierarchy for the dimension time.
 Initially the concept hierarchy was "day < month < quarter < year."
 On drilling down, the time dimension is descended from the level of quarter to the level of
month.
 When drill-down is performed, one or more dimensions from the data cube are added.
 It navigates the data from less detailed data to highly detailed data.
Slice

The slice operation selects one particular dimension from a given cube and provides a new sub-cube.
Consider the following diagram that shows how slice works.
 Here Slice is performed for the dimension "time" using the criterion time = "Q1".
 It will form a new sub-cube by selecting one or more dimensions.
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube. Consider the
following diagram that shows the dice operation.

The dice operation on the cube based on the following selection criteria involves three dimensions.
 (location = "Toronto" or "Vancouver")

 (time = "Q1" or "Q2")
 (item =" Mobile" or "Modem")
Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order to provide an
alternative presentation of data. Consider the following diagram that shows the pivot operation.

OLAP vs OLTP
Sr.No. Data Warehouse (OLAP) Operational Database (OLTP)
1 Involves historical processing of Involves day-to-day processing.

information.

2 OLAP systems are used by knowledge OLTP systems are used by clerks, DBAs, or
workers such as executives, managers database professionals.
and analysts.
3 Useful in analyzing the business. Useful in running the business.
4 It focuses on Information out. It focuses on Data in.
5 Based on Star Schema, Snowflake, Based on Entity Relationship Model.

Schema and Fact Constellation Schema.
6 Contains historical data. Contains current data.
7 Provides summarized and consolidated Provides primitive and highly detailed data.
data.
8 Provides summarized and Provides detailed and flat relational view of

multidimensional view of data. data.
9 Number or users is in hundreds. Number of users is in thousands.
10 Number of records accessed is in Number of records accessed is in tens.

millions.
11 Database size is from 100 GB to 1 TB Database size is from 100 MB to 1 GB.
12 Highly flexible. Provides high performance.
Data Warehousing - Relational OLAP

Relational OLAP servers are placed between relational back-end server and client front-end tools. To
store and manage the warehouse data, the relational OLAP uses relational or extended-relational
DBMS.
ROLAP includes the following −
 Implementation of aggregation navigation logic

 Optimization for each DBMS back-end
 Additional tools and services
Points to Remember
 ROLAP servers are highly scalable.
 ROLAP tools analyze large volumes of data across multiple dimensions.
 ROLAP tools store and analyze highly volatile and changeable data.
Relational OLAP Architecture

ROLAP includes the following components −
 Database server
 ROLAP server
 Front-end tool.
Advantages

 ROLAP servers can be easily used with existing RDBMS.

 Data can be stored efficiently, since no zero facts can be stored.
 ROLAP tools do not use pre-calculated data cubes.
 DSS server of micro-strategy adopts the ROLAP approach.
Disadvantages
 Poor query performance.
 Some limitations of scalability depending on the technology architecture that is utilized.
Data Warehousing - Multidimensional OLAP

Multidimensional OLAP (MOLAP) uses array-based multidimensional storage engines for
multidimensional views of data. With multidimensional data stores, the storage utilization may be low
if the dataset is sparse. Therefore, many MOLAP servers use two levels of data storage representation
to handle dense and sparse datasets.
Points to Remember −
 MOLAP tools process information with consistent response time regardless of level of
summarizing or calculations selected.
 MOLAP tools need to avoid many of the complexities of creating a relational database to store
data for analysis.
 MOLAP tools need fastest possible performance.
 MOLAP server adopts two level of storage representation to handle dense and sparse data sets.
 Denser sub-cubes are identified and stored as array structure.
 Sparse sub-cubes employ compression technology.
MOLAP Architecture
MOLAP includes the following components −
 Database server.
 MOLAP server.
 Front-end tool.

Advantages
 MOLAP allows fastest indexing to the pre-computed summarized data.

 Helps the users connected to a network who need to analyze larger, less-defined data.
 Easier to use, therefore MOLAP is suitable for inexperienced users.
Disadvantages
 MOLAP are not capable of containing detailed data.

 The storage utilization may be low if the data set is sparse.
MOLAP vs ROLAP
Sr.No. MOLAP ROLAP
1 Information retrieval is fast. Information retrieval is comparatively slow.
2 Uses sparse array to store data-sets. Uses relational table.
3 MOLAP is best suited for inexperienced ROLAP is best suited for experienced users.
users, since it is very easy to use.
4 Maintains a separate database for data It may not require space other than available

cubes. in the Data warehouse.
5 DBMS facility is weak. DBMS facility is strong.
Data Warehousing - Schemas

Schema is a logical description of the entire database. It includes the name and description of records
of all record types including all associated data-items and aggregates. Much like a database, a data
warehouse also requires to maintain a schema. A database uses relational model, while a data
warehouse uses Star, Snowflake, and Fact Constellation schema. In this chapter, we will discuss the
schemas used in a data warehouse.
Star Schema
 Each dimension in a star schema is represented with only one-dimension table.
 This dimension table contains the set of attributes.
 The following diagram shows the sales data of a company with respect to the four dimensions,
namely time, item, branch, and location.
 There is a fact table at the center. It contains the keys to each of four dimensions.
 The fact table also contains the attributes, namely dollars sold and units sold.

Note − Each dimension has only one dimension table and each table holds a set of attributes. For
example, the location dimension table contains the attribute set {location_key, street, city,
province_or_state,country}. This constraint may cause data redundancy. For example, "Vancouver" and
"Victoria" both the cities are in the Canadian province of British Columbia. The entries for such cities
may cause data redundancy along the attributes province_or_state and country.
Snowflake Schema
 Some dimension tables in the Snowflake schema are normalized.
 The normalization splits up the data into additional tables.
 Unlike Star schema, the dimensions table in a snowflake schema are normalized. For example,
the item dimension table in star schema is normalized and split into two dimension tables,
namely item and supplier table.
 Now the item dimension table contains the attributes item_key, item_name, type, brand, and
supplier-key.
 The supplier key is linked to the supplier dimension table. The supplier dimension table contains
the attributes supplier_key and supplier_type.

Note − Due to normalization in the Snowflake schema, the redundancy is reduced and therefore, it
becomes easy to maintain and the save storage space.
Fact Constellation Schema

 A fact constellation has multiple fact tables. It is also known as galaxy schema.
 The following diagram shows two fact tables, namely sales and shipping.
 The sales fact table is same as that in the star schema.
 The shipping fact table has the five dimensions, namely item_key, time_key, shipper_key,
from_location, to_location.
 The shipping fact table also contains two measures, namely dollars sold and units sold.
 It is also possible to share dimension tables between fact tables. For example, time, item, and
location dimension tables are shared between the sales and shipping fact table.
Schema Definition
Multidimensional schema is defined using Data Mining Query Language (DMQL). The two primitives,
cube definition and dimension definition, can be used for defining the data warehouses and data
marts.
Syntax for Cube Definition

define cube < cube_name > [ < dimension-list > }: < measure_list >
Syntax for Dimension Definition

define dimension < dimension_name > as ( < attribute_or_dimension_list > )
Star Schema Definition

The star schema that we have discussed can be defined using Data Mining Query Language (DMQL) as
follows −
define cube sales star [time, item, branch, location]:
dollars sold = sum(sales in dollars), units sold = count(*)
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state, country)
Snowflake Schema Definition

Snowflake schema can be defined using DMQL as follows −
define cube sales snowflake [time, item, branch, location]:
define dimension item as (item key, item name, brand, type, supplier (supplier key, supplier type))
define dimension location as (location key, street, city (city key, city, province or state, country))
Fact Constellation Schema Definition

Fact constellation schema can be defined using DMQL as follows −
define cube sales [time, item, branch, location]:
define dimension item as (item key, item name, brand, type, supplier type)
define dimension location as (location key, street, city, province or state,country)
define cube shipping [time, item, shipper, from location, to location]:

dollars cost = sum(cost in dollars), units shipped = count(*)
define dimension time as time in cube sales

define dimension item as item in cube sales
define dimension shipper as (shipper key, shipper name, location as location in cube sales, shipper type)
define dimension from location as location in cube sales
define dimension to location as location in cube sales
Data Warehousing - Partitioning Strategy

Partitioning is done to enhance performance and facilitate easy management of data. Partitioning also
helps in balancing the various requirements of the system. It optimizes the hardware performance and
simplifies the management of data warehouse by partitioning each fact table into multiple separate
partitions. In this chapter, we will discuss different partitioning strategies.
Why is it Necessary to Partition?

Partitioning is important for the following reasons −
 For easy management,

 To assist backup/recovery,
 To enhance performance.
For Easy Management
The fact table in a data warehouse can grow up to hundreds of gigabytes in size. This huge size of fact
table is very hard to manage as a single entity. Therefore it needs partitioning.
To Assist Backup/Recovery
If we do not partition the fact table, then we have to load the complete fact table with all the data.
Partitioning allows us to load only as much data as is required on a regular basis. It reduces the time to
load and also enhances the performance of the system.
Note − To cut down on the backup size, all partitions other than the current partition can be marked as
read-only. We can then put these partitions into a state where they cannot be modified. Then they can
be backed up. It means only the current partition is to be backed up.
To Enhance Performance

By partitioning the fact table into sets of data, the query procedures can be enhanced. Query
performance is enhanced because now the query scans only those partitions that are relevant. It does
not have to scan the whole data.
Horizontal Partitioning
There are various ways in which a fact table can be partitioned. In horizontal partitioning, we have to
keep in mind the requirements for manageability of the data warehouse.
Partitioning by Time into Equal Segments

In this partitioning strategy, the fact table is partitioned on the basis of time period. Here each time
period represents a significant retention period within the business. For example, if the user queries
for month to date data then it is appropriate to partition the data into monthly segments. We can
reuse the partitioned tables by removing the data in them.
Partition by Time into Different-sized Segments

This kind of partition is done where the aged data is accessed infrequently. It is implemented as a set of
small partitions for relatively current data, larger partition for inactive data.
Points to Note
 The detailed information remains available online.
 The number of physical tables is kept relatively small, which reduces the operating cost.
 This technique is suitable where a mix of data dipping recent history and data mining through
entire history is required.
 This technique is not useful where the partitioning profile changes on a regular basis, because
repartitioning will increase the operation cost of data warehouse.

Partition on a Different Dimension

The fact table can also be partitioned on the basis of dimensions other than time such as product
group, region, supplier, or any other dimension. Let's have an example.
Suppose a market function has been structured into distinct regional departments like on a state by
state basis. If each region wants to query on information captured within its region, it would prove to
be more effective to partition the fact table into regional partitions. This will cause the queries to speed
up because it does not require to scan information that is not relevant.
Points to Note
 The query does not have to scan irrelevant data which speeds up the query process.
 This technique is not appropriate where the dimensions are unlikely to change in future. So, it is
worth determining that the dimension does not change in future.
 If the dimension changes, then the entire fact table would have to be repartitioned.
Note − We recommend to perform the partition only on the basis of time dimension, unless you are
certain that the suggested dimension grouping will not change within the life of the data warehouse.
Partition by Size of Table

When there are no clear basis for partitioning the fact table on any dimension, then we
should partition the fact table on the basis of their size. We can set the predetermined size as a critical
point. When the table exceeds the predetermined size, a new table partition is created.
Points to Note
 This partitioning is complex to manage.
 It requires metadata to identify what data is stored in each partition.
Partitioning Dimensions
If a dimension contains large number of entries, then it is required to partition the dimensions. Here
we have to check the size of a dimension.
Consider a large design that changes over time. If we need to store all the variations in order to apply
comparisons, that dimension may be very large. This would definitely affect the response time.
Round Robin Partitions

In the round robin technique, when a new partition is needed, the old one is archived. It uses metadata
to allow user access tool to refer to the correct table partition.
This technique makes it easy to automate table management facilities within the data warehouse.
Vertical Partition
Vertical partitioning, splits the data vertically. The following images depicts how vertical partitioning is
done.
Vertical partitioning can be performed in the following two ways −
 Normalization
 Row Splitting
Normalization
Normalization is the standard relational method of database organization. In this method, the rows are
collapsed into a single row, hence it reduce space. Take a look at the following tables that show how
normalization is performed.
Table before Normalization

Product_id Qty Value sales_date Store_id Store_name Location Region
30 5 3.67 3-Aug-13 16 sunny Bangalore S
35 4 5.33 3-Sep-13 16 sunny Bangalore S
40 5 2.50 3-Sep-13 64 san Mumbai W
45 7 5.66 3-Sep-13 16 sunny Bangalore S
Table after Normalization
Store_id Store_name Location Region
16 sunny Bangalore W
64 san Mumbai S
Product_id Quantity Value sales_date Store_id
30 5 3.67 3-Aug-13 16
35 4 5.33 3-Sep-13 16
40 5 2.50 3-Sep-13 64
45 7 5.66 3-Sep-13 16
Row Splitting

Row splitting tends to leave a one-to-one map between partitions. The motive of row splitting is to
speed up the access to large table by reducing its size.
Note − While using vertical partitioning, make sure that there is no requirement to perform a major
join operation between two partitions.
Identify Key to Partition

It is very crucial to choose the right partition key. Choosing a wrong partition key will lead to
reorganizing the fact table. Let's have an example. Suppose we want to partition the following table.
Account_Txn_Table
transaction_id
account_id
transaction_type
value
transaction_date
region
branch_name
We can choose to partition on any key. The two possible keys could be
 region
 transaction_date
Suppose the business is organized in 30 geographical regions and each region has different number of
branches. That will give us 30 partitions, which is reasonable. This partitioning is good enough because
our requirements capture has shown that a vast majority of queries are restricted to the user's own
business region.
If we partition by transaction_date instead of region, then the latest transaction from every region will
be in one partition. Now the user who wants to look at data within his own region has to query across
multiple partitions.
Hence it is worth determining the right partitioning key.
Data Warehousing - Metadata Concepts

What is Metadata?
Metadata is simply defined as data about data. The data that is used to represent other data is known
as metadata. For example, the index of a book serves as a metadata for the contents in the book. In

other words, we can say that metadata is the summarized data that leads us to detailed data. In terms
of data warehouse, we can define metadata as follows.
 Metadata is the road-map to a data warehouse.
 Metadata in a data warehouse defines the warehouse objects.
 Metadata acts as a directory. This directory helps the decision support system to locate the
contents of a data warehouse.
Note − In a data warehouse, we create metadata for the data names and definitions of a given data
warehouse. Along with this metadata, additional metadata is also created for time-stamping any
extracted data, the source of extracted data.
Categories of Metadata
Metadata can be broadly categorized into three categories −
 Business Metadata − It has the data ownership information, business definition, and changing
policies.
 Technical Metadata − It includes database system names, table and column names and sizes,
data types and allowed values. Technical metadata also includes structural information such as
primary and foreign key attributes and indices.
 Operational Metadata − It includes currency of data and data lineage. Currency of data means
whether the data is active, archived, or purged. Lineage of data means the history of data
migrated and transformation applied on it.

Role of Metadata
Metadata has a very important role in a data warehouse. The role of metadata in a warehouse is
different from the warehouse data, yet it plays an important role. The various roles of metadata are
explained below.
 Metadata acts as a directory.
 This directory helps the decision support system to locate the contents of the data warehouse.
 Metadata helps in decision support system for mapping of data when data is transformed from
operational environment to data warehouse environment.
 Metadata helps in summarization between current detailed data and highly summarized data.
 Metadata also helps in summarization between lightly detailed data and highly summarized
data.
 Metadata is used for query tools.
 Metadata is used in extraction and cleansing tools.
 Metadata is used in reporting tools.
 Metadata is used in transformation tools.
 Metadata plays an important role in loading functions.
The following diagram shows the roles of metadata.

Metadata Repository
Metadata repository is an integral part of a data warehouse system. It has the following metadata −
 Definition of data warehouse − It includes the description of structure of data warehouse. The
description is defined by schema, view, hierarchies, derived data definitions, and data mart
locations and contents.
 Business metadata − It contains has the data ownership information, business definition, and
changing policies.
 Operational Metadata − It includes currency of data and data lineage. Currency of data means
whether the data is active, archived, or purged. Lineage of data means the history of data
migrated and transformation applied on it.
 Data for mapping from operational environment to data warehouse − It includes the source
databases and their contents, data extraction, data partition cleaning, transformation rules,
data refresh and purging rules.
 Algorithms for summarization − It includes dimension algorithms, data on granularity,

aggregation, summarizing, etc.
Challenges for Metadata Management

The importance of metadata can not be overstated. Metadata helps in driving the accuracy of reports,
validates data transformation, and ensures the accuracy of calculations. Metadata also enforces the
definition of business terms to business end-users. With all these uses of metadata, it also has its
challenges. Some of the challenges are discussed below.
 Metadata in a big organization is scattered across the organization. This metadata is spread in
spreadsheets, databases, and applications.
 Metadata could be present in text files or multimedia files. To use this data for information
management solutions, it has to be correctly defined.
 There are no industry-wide accepted standards. Data management solution vendors have
narrow focus.
 There are no easy and accepted methods of passing metadata.
Data Warehousing - Data Marting

Why Do We Need a Data Mart?
Listed below are the reasons to create a data mart −
 To partition data in order to impose access control strategies.
 To speed up the queries by reducing the volume of data to be scanned.
 To segment data into different hardware platforms.
 To structure data in a form suitable for a user access tool.
Note − Do not data mart for any other reason since the operation cost of data marting could be very
high. Before data marting, make sure that data marting strategy is appropriate for your particular
solution.
Cost-effective Data Marting

Follow the steps given below to make data marting cost-effective −
 Identify the Functional Splits

 Identify User Access Tool Requirements
 Identify Access Control Issues

Identify the Functional Splits

In this step, we determine if the organization has natural functional splits. We look for departmental
splits, and we determine whether the way in which departments use information tend to be in isolation
from the rest of the organization. Let's have an example.
Consider a retail organization, where each merchant is accountable for maximizing the sales of a group
of products. For this, the following are the valuable information −
 sales transaction on a daily basis

 sales forecast on a weekly basis
 stock position on a daily basis
 stock movements on a daily basis
As the merchant is not interested in the products they are not dealing with, the data marting is a
subset of the data dealing which the product group of interest. The following diagram shows data
marting for different users.
Given below are the issues to be taken into account while determining the functional split −

 The structure of the department may change.
 The products might switch from one department to other.
 The merchant could query the sales trend of other products to analyze what is happening to the
sales.
Note − We need to determine the business benefits and technical feasibility of using a data mart.
Identify User Access Tool Requirements

We need data marts to support user access tools that require internal data structures. The data in such
structures are outside the control of data warehouse but need to be populated and updated on a
regular basis.
There are some tools that populate directly from the source system but some cannot. Therefore
additional requirements outside the scope of the tool are needed to be identified for future.
Note − In order to ensure consistency of data across all access tools, the data should not be directly
populated from the data warehouse, rather each tool must have its own data mart.
Identify Access Control Issues

There should to be privacy rules to ensure the data is accessed by authorized users only. For example a
data warehouse for retail banking institution ensures that all the accounts belong to the same legal
entity. Privacy laws can force you to totally prevent access to information that is not owned by the
specific bank.
Data marts allow us to build a complete wall by physically separating data segments within the data
warehouse. To avoid possible privacy problems, the detailed data can be removed from the data
warehouse. We can create data mart for each legal entity and load it via data warehouse, with detailed
account data.
Designing Data Marts

Data marts should be designed as a smaller version of starflake schema within the data warehouse and
should match with the database design of the data warehouse. It helps in maintaining control over
database instances.

The summaries are data marted in the same way as they would have been designed within the data
warehouse. Summary tables help to utilize all dimension data in the starflake schema.
Cost of Data Marting

The cost measures for data marting are as follows −
 Hardware and Software Cost

 Network Access
 Time Window Constraints
Hardware and Software Cost
Although data marts are created on the same hardware, they require some additional hardware and
software. To handle user queries, it requires additional processing power and disk storage. If detailed

data and the data mart exist within the data warehouse, then we would face additional cost to store
and manage replicated data.
Note − Data marting is more expensive than aggregations, therefore it should be used as an additional
strategy and not as an alternative strategy.
Network Access
A data mart could be on a different location from the data warehouse, so we should ensure that the
LAN or WAN has the capacity to handle the data volumes being transferred within the data mart load
process.
Time Window Constraints

The extent to which a data mart loading process will eat into the available time window depends on
the complexity of the transformations and the data volumes being shipped. The determination of how
many data marts are possible depends on −
 Network capacity.
 Time window available
 Volume of data being transferred
 Mechanisms being used to insert data into a data mart
Data Warehousing - System Managers

System management is mandatory for the successful implementation of a data warehouse. The most
important system managers are −
 System configuration manager

 System scheduling manager
 System event manager
 System database manager
 System backup recovery manager
System Configuration Manager
 The system configuration manager is responsible for the management of the setup and
configuration of data warehouse.

 The structure of configuration manager varies from one operating system to another.
 In Unix structure of configuration, the manager varies from vendor to vendor.
 Configuration managers have single user interface.
 The interface of configuration manager allows us to control all aspects of the system.
Note − The most important configuration tool is the I/O manager.
System Scheduling Manager

System Scheduling Manager is responsible for the successful implementation of the data warehouse.
Its purpose is to schedule ad hoc queries. Every operating system has its own scheduler with some
form of batch control mechanism. The list of features a system scheduling manager must have is as
follows −
 Work across cluster or MPP boundaries

 Deal with international time differences
 Handle job failure
 Handle multiple queries
 Support job priorities
 Restart or re-queue the failed jobs
 Notify the user or a process when job is completed
 Maintain the job schedules across system outages
 Re-queue jobs to other queues
 Support the stopping and starting of queues
 Log Queued jobs
 Deal with inter-queue processing
Note − The above list can be used as evaluation parameters for the evaluation of a good scheduler.
Some important jobs that a scheduler must be able to handle are as follows −
 Daily and ad hoc query scheduling

 Execution of regular report requirements

 Data load
 Data processing
 Index creation
 Backup
 Aggregation creation
 Data transformation
Note − If the data warehouse is running on a cluster or MPP architecture, then the system scheduling
manager must be capable of running across the architecture.
System Event Manager

The event manager is a kind of a software. The event manager manages the events that are defined on
the data warehouse system. We cannot manage the data warehouse manually because the structure of
data warehouse is very complex. Therefore we need a tool that automatically handles all the events
without any intervention of the user.
Note − The Event manager monitors the events occurrences and deals with them. The event manager
also tracks the myriad of things that can go wrong on this complex data warehouse system.
Events
Events are the actions that are generated by the user or the system itself. It may be noted that the
event is a measurable, observable, occurrence of a defined action.
Given below is a list of common events that are required to be tracked.
 Hardware failure
 Running out of space on certain key disks
 A process dying
 A process returning an error
 CPU usage exceeding an 805 threshold
 Internal contention on database serialization points
 Buffer cache hit ratios exceeding or failure below threshold
 A table reaching to maximum of its size

 Excessive memory swapping

 A table failing to extend due to lack of space
 Disk exhibiting I/O bottlenecks
 Usage of temporary or sort area reaching a certain thresholds
 Any other database shared memory usage
The most important thing about events is that they should be capable of executing on their own. Event
packages define the procedures for the predefined events. The code associated with each event is
known as event handler. This code is executed whenever an event occurs.
System and Database Manager

System and database manager may be two separate pieces of software, but they do the same job. The
objective of these tools is to automate certain processes and to simplify the execution of others. The
criteria for choosing a system and the database manager are as follows −
 increase user's quota.

 assign and de-assign roles to the users
 assign and de-assign the profiles to the users
 perform database space management
 monitor and report on space usage
 tidy up fragmented and unused space
 add and expand the space
 add and remove users
 manage user password
 manage summary or temporary tables
 assign or deassign temporary space to and from the user
 reclaim the space form old or out-of-date temporary tables
 manage error and trace logs
 to browse log and trace files
 redirect error or trace information

 switch on and off error and trace logging

 perform system space management
 monitor and report on space usage
 clean up old and unused file directories
 add or expand space.
System Backup Recovery Manager
The backup and recovery tool makes it easy for operations and management staff to back-up the data.
Note that the system backup manager must be integrated with the schedule manager software being
used. The important features that are required for the management of backups are as follows −
 Scheduling
 Backup data tracking
 Database awareness
Backups are taken only to protect against data loss. Following are the important points to remember −
 The backup software will keep some form of database of where and when the piece of data was
backed up.
 The backup recovery manager must have a good front-end to that database.
 The backup recovery software should be database aware.
 Being aware of the database, the software then can be addressed in database terms, and will
not perform backups that would not be viable.
Data Warehousing - Process Managers

Process managers are responsible for maintaining the flow of data both into and out of the data
warehouse. There are three different types of process managers −
 Load manager
 Warehouse manager
 Query manager
Data Warehouse Load Manager

Load manager performs the operations required to extract and load the data into the database. The
size and complexity of a load manager varies between specific solutions from one data warehouse to
another.
Load Manager Architecture

The load manager does performs the following functions −
 Extract data from the source system.
 Fast load the extracted data into temporary data store.
 Perform simple transformations into structure similar to the one in the data warehouse.
Extract Data from Source

The data is extracted from the operational databases or the external information providers. Gateways
are the application programs that are used to extract data. It is supported by underlying DBMS and
allows the client program to generate SQL to be executed at a server. Open Database Connection
(ODBC) and Java Database Connection (JDBC) are examples of gateway.
Fast Load
 In order to minimize the total load window, the data needs to be loaded into the warehouse in
the fastest possible time.

 Transformations affect the speed of data processing.
 It is more effective to load the data into a relational database prior to applying transformations
and checks.
 Gateway technology is not suitable, since they are inefficient when large data volumes are
involved.
Simple Transformations
While loading, it may be required to perform simple transformations. After completing simple
transformations, we can do complex checks. Suppose we are loading the EPOS sales transaction, we
need to perform the following checks −
 Strip out all the columns that are not required within the warehouse.
 Convert all the values to required data types.
Warehouse Manager
The warehouse manager is responsible for the warehouse management process. It consists of a third-
party system software, C programs, and shell scripts. The size and complexity of a warehouse manager
varies between specific solutions.
Warehouse Manager Architecture

A warehouse manager includes the following −
 The controlling process

 Stored procedures or C with SQL
 Backup/Recovery tool
 SQL scripts

Functions of Warehouse Manager

A warehouse manager performs the following functions −
 Analyzes the data to perform consistency and referential integrity checks.
 Creates indexes, business views, partition views against the base data.
 Generates new aggregations and updates the existing aggregations.
 Generates normalizations.
 Transforms and merges the source data of the temporary store into the published data
warehouse.
 Backs up the data in the data warehouse.
 Archives the data that has reached the end of its captured life.
Note − A warehouse Manager analyzes query profiles to determine whether the index and
aggregations are appropriate.
Query Manager
The query manager is responsible for directing the queries to suitable tables. By directing the queries
to appropriate tables, it speeds up the query request and response process. In addition, the query
manager is responsible for scheduling the execution of the queries posted by the user.

Query Manager Architecture

A query manager includes the following components −
 Query redirection via C tool or RDBMS

 Stored procedures
 Query management tool
 Query scheduling via C tool or RDBMS
 Query scheduling via third-party software
Functions of Query Manager

 It presents the data to the user in a form they understand.
 It schedules the execution of the queries posted by the end-user.
 It stores query profiles to allow the warehouse manager to determine which indexes and
aggregations are appropriate.
Data Warehousing - Security

The objective of a data warehouse is to make large amounts of data easily accessible to the users,
hence allowing the users to extract information about the business as a whole. But we know that there
could be some security restrictions applied on the data that can be an obstacle for accessing the
information. If the analyst has a restricted view of data, then it is impossible to capture a complete
picture of the trends within the business.
The data from each analyst can be summarized and passed on to management where the different
summaries can be aggregated. As the aggregations of summaries cannot be the same as that of the
aggregation as a whole, it is possible to miss some information trends in the data unless someone is
analyzing the data as a whole.
Security Requirements
Adding security features affect the performance of the data warehouse, therefore it is important to
determine the security requirements as early as possible. It is difficult to add security features after the
data warehouse has gone live.
During the design phase of the data warehouse, we should keep in mind what data sources may be
added later and what would be the impact of adding those data sources. We should consider the
following possibilities during the design phase.
 Whether the new data sources will require new security and/or audit restrictions to be
implemented?
 Whether the new users added who have restricted access to data that is already generally
available?
This situation arises when the future users and the data sources are not well known. In such a situation,
we need to use the knowledge of business and the objective of data warehouse to know likely
requirements.
The following activities get affected by security measures −
 User access
 Data load
 Data movement
 Query generation

User Access
We need to first classify the data and then classify the users on the basis of the data they can access. In
other words, the users are classified according to the data they can access.
Data Classification
The following two approaches can be used to classify the data −
 Data can be classified according to its sensitivity. Highly-sensitive data is classified as highly
restricted and less-sensitive data is classified as less restrictive.
 Data can also be classified according to the job function. This restriction allows only specific
users to view particular data. Here we restrict the users to view only that part of the data in
which they are interested and are responsible for.
There are some issues in the second approach. To understand, let's have an example. Suppose you are
building the data warehouse for a bank. Consider that the data being stored in the data warehouse is
the transaction data for all the accounts. The question here is, who is allowed to see the transaction
data. The solution lies in classifying the data according to the function.
User classification
The following approaches can be used to classify the users −
 Users can be classified as per the hierarchy of users in an organization, i.e., users can be
classified by departments, sections, groups, and so on.
 Users can also be classified according to their role, with people grouped across departments
based on their role.
Classification on basis of Department
Let's have an example of a data warehouse where the users are from sales and marketing department.
We can have security by top-to-down company view, with access centered on the different
departments. But there could be some restrictions on users at different levels. This structure is shown
in the following diagram.

But if each department accesses different data, then we should design the security access for each
department separately. This can be achieved by departmental data marts. Since these data marts are
separated from the data warehouse, we can enforce separate security restrictions on each data mart.
This approach is shown in the following figure.
Classification Based on Role

If the data is generally available to all the departments, then it is useful to follow the role access
hierarchy. In other words, if the data is generally accessed by all the departments, then apply security
restrictions as per the role of the user. The role access hierarchy is shown in the following figure.
Audit Requirements
Auditing is a subset of security, a costly activity. Auditing can cause heavy overheads on the system. To
complete an audit in time, we require more hardware and therefore, it is recommended that wherever
possible, auditing should be switched off. Audit requirements can be categorized as follows −
 Connections
 Disconnections
 Data access
 Data change
Note − For each of the above-mentioned categories, it is necessary to audit success, failure, or both.
From the perspective of security reasons, the auditing of failures are very important. Auditing of failure
is important because they can highlight unauthorized or fraudulent access.
Network Requirements
Network security is as important as other securities. We cannot ignore the network security
requirement. We need to consider the following issues −
 Is it necessary to encrypt data before transferring it to the data warehouse?

 Are there restrictions on which network routes the data can take?
These restrictions need to be considered carefully. Following are the points to remember −
 The process of encryption and decryption will increase overheads. It would require more
processing power and processing time.
 The cost of encryption can be high if the system is already a loaded system because the
encryption is borne by the source system.
Data Movement
There exist potential security implications while moving the data. Suppose we need to transfer some
restricted data as a flat file to be loaded. When the data is loaded into the data warehouse, the
following questions are raised −
 Where is the flat file stored?

 Who has access to that disk space?
If we talk about the backup of these flat files, the following questions are raised −
 Do you backup encrypted or decrypted versions?

 Do these backups need to be made to special tapes that are stored separately?
 Who has access to these tapes?
Some other forms of data movement like query result sets also need to be considered. The questions
raised while creating the temporary table are as follows −
 Where is that temporary table to be held?

 How do you make such table visible?
We should avoid the accidental flouting of security restrictions. If a user with access to the restricted
data can generate accessible temporary tables, data can be visible to non-authorized users. We can
overcome this problem by having a separate temporary area for users with access to restricted data.
Documentation
The audit and security requirements need to be properly documented. This will be treated as a part of
justification. This document can contain all the information gathered from −

 Data classification
 User classification
 Network requirements
 Data movement and storage requirements
 All auditable actions
Impact of Security on Design
Security affects the application code and the development timescales. Security affects the following
area −
 Application development
 Database design
 Testing
Application Development
Security affects the overall application development and it also affects the design of the important
components of the data warehouse such as load manager, warehouse manager, and query manager.
The load manager may require checking code to filter record and place them in different locations.
More transformation rules may also be required to hide certain data. Also there may be requirements
of extra metadata to handle any extra objects.
To create and maintain extra views, the warehouse manager may require extra codes to enforce
security. Extra checks may have to be coded into the data warehouse to prevent it from being fooled
into moving data into a location where it should not be available. The query manager requires the
changes to handle any access restrictions. The query manager will need to be aware of all extra views
and aggregations.
Database design
The database layout is also affected because when security measures are implemented, there is an
increase in the number of views and tables. Adding security increases the size of the database and
hence increases the complexity of the database design and management. It will also add complexity to
the backup management and recovery plan.
Testing

Testing the data warehouse is a complex and lengthy process. Adding security to the data warehouse
also affects the testing time complexity. It affects the testing in the following two ways −
 It will increase the time required for integration and system testing.
 There is added functionality to be tested which will increase the size of the testing suite.
Data Warehousing - Backup

A data warehouse is a complex system and it contains a huge volume of data. Therefore it is important
to back up all the data so that it becomes available for recovery in future as per requirement. In this
chapter, we will discuss the issues in designing the backup strategy.
Backup Terminologies
Before proceeding further, you should know some of the backup terminologies discussed below.
 Complete backup − It backs up the entire database at the same time. This backup includes all
the database files, control files, and journal files.
 Partial backup − As the name suggests, it does not create a complete backup of the database.
Partial backup is very useful in large databases because they allow a strategy whereby various
parts of the database are backed up in a round-robin fashion on a day-to-day basis, so that the
whole database is backed up effectively once a week.
 Cold backup − Cold backup is taken while the database is completely shut down. In multi-
instance environment, all the instances should be shut down.
 Hot backup − Hot backup is taken when the database engine is up and running. The
requirements of hot backup varies from RDBMS to RDBMS.
 Online backup − It is quite similar to hot backup.
Hardware Backup
It is important to decide which hardware to use for the backup. The speed of processing the backup
and restore depends on the hardware being used, how the hardware is connected, bandwidth of the
network, backup software, and the speed of server's I/O system. Here we will discuss some of the
hardware choices that are available and their pros and cons. These choices are as follows −
 Tape Technology

 Disk Backups
Tape Technology
The tape choice can be categorized as follows −
 Tape media
 Standalone tape drives
 Tape stackers
 Tape silos
Tape Media
There exists several varieties of tape media. Some tape media standards are listed in the table below −
Tape Media Capacity I/O rates
DLT 40 GB 3 MB/s
3490e 1.6 GB 3 MB/s
8 mm 14 GB 1 MB/s
Other factors that need to be considered are as follows −
 Reliability of the tape medium

 Cost of tape medium per unit
 Scalability
 Cost of upgrades to tape system
 Cost of tape medium per unit
 Shelf life of tape medium
Standalone Tape Drives
The tape drives can be connected in the following ways −

 Direct to the server

 As network available devices
 Remotely to other machine
There could be issues in connecting the tape drives to a data warehouse.
 Consider the server is a 48node MPP machine. We do not know the node to connect the tape
drive and we do not know how to spread them over the server nodes to get the optimal
performance with least disruption of the server and least internal I/O latency.
 Connecting the tape drive as a network available device requires the network to be up to the
job of the huge data transfer rates. Make sure that sufficient bandwidth is available during the
time you require it.
 Connecting the tape drives remotely also require high bandwidth.
Tape Stackers
The method of loading multiple tapes into a single tape drive is known as tape stackers. The stacker
dismounts the current tape when it has finished with it and loads the next tape, hence only one tape is
available at a time to be accessed. The price and the capabilities may vary, but the common ability is
that they can perform unattended backups.
Tape Silos
Tape silos provide large store capacities. Tape silos can store and manage thousands of tapes. They can
integrate multiple tape drives. They have the software and hardware to label and store the tapes they
store. It is very common for the silo to be connected remotely over a network or a dedicated link. We
should ensure that the bandwidth of the connection is up to the job.
Disk Backups
Methods of disk backups are −
 Disk-to-disk backups
 Mirror breaking
These methods are used in the OLTP system. These methods minimize the database downtime and
maximize the availability.
Disk-to-Disk Backups

Here backup is taken on the disk rather on the tape. Disk-to-disk backups are done for the following
reasons −
 Speed of initial backups

 Speed of restore
Backing up the data from disk to disk is much faster than to the tape. However it is the intermediate
step of backup. Later the data is backed up on the tape. The other advantage of disk-to-disk backups is
that it gives you an online copy of the latest backup.
Mirror Breaking
The idea is to have disks mirrored for resilience during the working day. When backup is required, one
of the mirror sets can be broken out. This technique is a variant of disk-to-disk backups.
Note − The database may need to be shutdown to guarantee consistency of the backup.
Optical Jukeboxes
Optical jukeboxes allow the data to be stored near line. This technique allows a large number of optical
disks to be managed in the same way as a tape stacker or a tape silo. The drawback of this technique is
that it has slow write speed than disks. But the optical media provides long-life and reliability that
makes them a good choice of medium for archiving.
Software Backups
There are software tools available that help in the backup process. These software tools come as a
package. These tools not only take backup, they can effectively manage and control the backup
strategies. There are many software packages available in the market. Some of them are listed in the
following table −
Package Name Vendor
Networker Legato
ADSM IBM

Epoch Epoch Systems
Omniback II HP
Alexandria Sequent
Criteria for Choosing Software Packages

The criteria for choosing the best software package are listed below −
 How scalable is the product as tape drives are added?

 Does the package have client-server option, or must it run on the database server itself?
 Will it work in cluster and MPP environments?
 What degree of parallelism is required?
 What platforms are supported by the package?
 Does the package support easy access to information about tape contents?
 Is the package database aware?
 What tape drive and tape media are supported by the package?
Data Warehousing - Tuning

A data warehouse keeps evolving and it is unpredictable what query the user is going to post in the
future. Therefore it becomes more difficult to tune a data warehouse system. In this chapter, we will
discuss how to tune the different aspects of a data warehouse such as performance, data load, queries,
etc.
Difficulties in Data Warehouse Tuning

Tuning a data warehouse is a difficult procedure due to following reasons −
 Data warehouse is dynamic; it never remains constant.
 It is very difficult to predict what query the user is going to post in the future.
 Business requirements change with time.
 Users and their profiles keep changing.

 The user can switch from one group to another.
 The data load on the warehouse also changes with time.
Note − It is very important to have a complete knowledge of data warehouse.
Performance Assessment
Here is a list of objective measures of performance −
 Average query response time

 Scan rates
 Time used per day query
 Memory usage per process
 I/O throughput rates
Following are the points to remember.
 It is necessary to specify the measures in service level agreement (SLA).
 It is of no use trying to tune response time, if they are already better than those required.
 It is essential to have realistic expectations while making performance assessment.
 It is also essential that the users have feasible expectations.
 To hide the complexity of the system from the user, aggregations and views should be used.
 It is also possible that the user can write a query you had not tuned for.
Data Load Tuning

Data load is a critical part of overnight processing. Nothing else can run until data load is complete. This
is the entry point into the system.
Note − If there is a delay in transferring the data, or in arrival of data then the entire system is affected
badly. Therefore it is very important to tune the data load first.
There are various approaches of tuning data load that are discussed below −
 The very common approach is to insert data using the SQL Layer. In this approach, normal
checks and constraints need to be performed. When the data is inserted into the table, the

code will run to check for enough space to insert the data. If sufficient space is not available,
then more space may have to be allocated to these tables. These checks take time to perform
and are costly to CPU.
 The second approach is to bypass all these checks and constraints and place the data directly
into the preformatted blocks. These blocks are later written to the database. It is faster than
the first approach, but it can work only with whole blocks of data. This can lead to some space
wastage.
 The third approach is that while loading the data into the table that already contains the table,
we can maintain indexes.
 The fourth approach says that to load the data in tables that already contain data, drop the
indexes & recreate them when the data load is complete. The choice between the third and
the fourth approach depends on how much data is already loaded and how many indexes need
to be rebuilt.
Integrity Checks
Integrity checking highly affects the performance of the load. Following are the points to remember −
 Integrity checks need to be limited because they require heavy processing power.
 Integrity checks should be applied on the source system to avoid performance degrade of data
load.
Tuning Queries
We have two kinds of queries in data warehouse −
 Fixed queries
 Ad hoc queries
Fixed Queries
Fixed queries are well defined. Following are the examples of fixed queries −
 regular reports
 Canned queries
 Common aggregations

Tuning the fixed queries in a data warehouse is same as in a relational database system. The only
difference is that the amount of data to be queried may be different. It is good to store the most
successful execution plan while testing fixed queries. Storing these executing plan will allow us to spot
changing data size and data skew, as it will cause the execution plan to change.
Note − We cannot do more on fact table but while dealing with dimension tables or the aggregations,
the usual collection of SQL tweaking, storage mechanism, and access methods can be used to tune
these queries.
Ad hoc Queries
To understand ad hoc queries, it is important to know the ad hoc users of the data warehouse. For
each user or group of users, you need to know the following −
 The number of users in the group

 Whether they use ad hoc queries at regular intervals of time
 Whether they use ad hoc queries frequently
 Whether they use ad hoc queries occasionally at unknown intervals.
 The maximum size of query they tend to run
 The average size of query they tend to run
 Whether they require drill-down access to the base data
 The elapsed login time per day
 The peak time of daily usage
 The number of queries they run per peak hour
Points to Note
 It is important to track the user's profiles and identify the queries that are run on a regular
basis.
 It is also important that the tuning performed does not affect the performance.
 Identify similar and ad hoc queries that are frequently run.
 If these queries are identified, then the database will change and new indexes can be added for
those queries.

 If these queries are identified, then new aggregations can be created specifically for those
queries that would result in their efficient execution.
Data Warehousing - Testing

Testing is very important for data warehouse systems to make them work correctly and efficiently.
There are three basic levels of testing performed on a data warehouse −
 Unit testing
 Integration testing
 System testing
Unit Testing
 In unit testing, each component is separately tested.
 Each module, i.e., procedure, program, SQL Script, Unix shell is tested.
 This test is performed by the developer.
Integration Testing
 In integration testing, the various modules of the application are brought together and then
tested against the number of inputs.
 It is performed to test whether the various components do well after integration.
System Testing
 In system testing, the whole data warehouse application is tested together.
 The purpose of system testing is to check whether the entire system works correctly together
or not.
 System testing is performed by the testing team.
 Since the size of the whole data warehouse is very large, it is usually possible to perform
minimal system testing before the test plan can be enacted.
Test Schedule
First of all, the test schedule is created in the process of developing the test plan. In this schedule, we
predict the estimated time required for the testing of the entire data warehouse system.

There are different methodologies available to create a test schedule, but none of them are perfect
because the data warehouse is very complex and large. Also the data warehouse system is evolving in
nature. One may face the following issues while creating a test schedule −
 A simple problem may have a large size of query that can take a day or more to complete, i.e.,
the query does not complete in a desired time scale.
 There may be hardware failures such as losing a disk or human errors such as accidentally
deleting a table or overwriting a large table.
Note − Due to the above-mentioned difficulties, it is recommended to always double the amount of
time you would normally allow for testing.
Testing Backup Recovery

Testing the backup recovery strategy is extremely important. Here is the list of scenarios for which this
testing is needed −
 Media failure
 Loss or damage of table space or data file
 Loss or damage of redo log file
 Loss or damage of control file
 Instance failure
 Loss or damage of archive file
 Loss or damage of table
 Failure during data failure
Testing Operational Environment
There are a number of aspects that need to be tested. These aspects are listed below.
 Security − A separate security document is required for security testing. This document contains
a list of disallowed operations and devising tests for each.
 Scheduler − Scheduling software is required to control the daily operations of a data

warehouse. It needs to be tested during system testing. The scheduling software requires an
interface with the data warehouse, which will need the scheduler to control overnight
processing and the management of aggregations.

 Disk Configuration. − Disk configuration also needs to be tested to identify I/O bottlenecks. The
test should be performed with multiple times with different settings.
 Management Tools. − It is required to test all the management tools during system testing.
Here is the list of tools that need to be tested.
o Event manager
o System manager
o Database manager
o Configuration manager
o Backup recovery manager
Testing the Database
The database is tested in the following three ways −
 Testing the database manager and monitoring tools − To test the database manager and the
monitoring tools, they should be used in the creation, running, and management of test
database.
 Testing database features − Here is the list of features that we have to test −
o Querying in parallel
o Create index in parallel
o Data load in parallel
 Testing database performance − Query execution plays a very important role in data
warehouse performance measures. There are sets of fixed queries that need to be run
regularly and they should be tested. To test ad hoc queries, one should go through the user
requirement document and understand the business completely. Take time to test the most
awkward queries that the business is likely to ask against different index and aggregation
strategies.
Testing the Application

 All the managers should be integrated correctly and work in order to ensure that the end-to-
end load, index, aggregate and queries work as per the expectations.

 Each function of each manager should work correctly
 It is also necessary to test the application over a period of time.
 Week end and month-end tasks should also be tested.
Logistic of the Test

The aim of system test is to test all of the following areas −
 Scheduling software
 Day-to-day operational procedures
 Backup recovery strategy
 Management and scheduling tools
 Overnight processing
 Query performance
Note − The most important point is to test the scalability. Failure to do so will leave us a system design
that does not work when the system grows.
Data Warehousing - Future Aspects

Following are the future aspects of data warehousing.
 As we have seen that the size of the open database has grown approximately double its
magnitude in the last few years, it shows the significant value that it contains.
 As the size of the databases grow, the estimates of what constitutes a very large database
continues to grow.
 The hardware and software that are available today do not allow to keep a large amount of
data online. For example, a Telco call record requires 10TB of data to be kept online, which is
just a size of one month’s record. If it requires to keep records of sales, marketing customer,
employees, etc., then the size will be more than 100 TB.
 The record contains textual information and some multimedia data. Multimedia data cannot be
easily manipulated as text data. Searching the multimedia data is not an easy task, whereas
textual information can be retrieved by the relational software available today.

 Apart from size planning, it is complex to build and run data warehouse systems that are ever
increasing in size. As the number of users increases, the size of the data warehouse also
increases. These users will also require to access the system.
 With the growth of the Internet, there is a requirement of users to access data online.
Data Warehousing > Concepts
Several concepts are of particular importance to data warehousing. They are discussed in detail in this
section.
Dimensional Data Model: Dimensional data model is commonly used in data warehousing systems. This
section describes this modeling technique, and the two common schema types, star
schema and snowflake schema.
Slowly Changing Dimension: This is a common issue facing data warehousing practioners. This section
explains the problem, and describes the three ways of handling this problem with examples.
Conceptual Data Model: What is a conceptual data model, its features, and an example of this type of
data model.
Logical Data Model: What is a logical data model, its features, and an example of this type of data
model.
Physical Data Model: What is a physical data model, its features, and an example of this type of data
model.
Conceptual, Logical, and Physical Data Model: Different levels of abstraction for a data model. This
section compares and contrasts the three different types of data models.
Data Integrity: What is data integrity and how it is enforced in data warehousing.
What is OLAP: Definition of OLAP.
MOLAP, ROLAP, and HOLAP: What are these different types of OLAP technology? This section discusses
how they are different from the other, and the advantages and disadvantages of each.
Bill Inmon vs. Ralph Kimball: These two data warehousing heavyweights have a different view of the
role between data warehouse and data mart.

Factless Fact Table: A fact table without any fact may sound silly, but there are real life instances when a
factless fact table is useful in data warehousing.
Junk Dimension: Discusses the concept of a junk dimension: When to use it and why is it useful.
Conformed Dimension: Discusses the concept of a conformed dimension: What is it and why is it
important.
Data Warehousing > Concepts > Dimensional Data Model
Dimensional data model is most often used in data warehousing systems. This is different from the 3rd
normal form, commonly used for transactional (OLTP) type systems. As you can imagine, the same data
would then be stored differently in a dimensional model than in a 3rd normal form model.
To understand dimensional data modeling, let's define some of the terms commonly used in this type of
modeling:
Dimension: A category of information. For example, the time dimension.
Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.
Hierarchy: The specification of levels that represents relationship between different attributes within a
dimension. For example, one possible hierarchy in the Time dimension is Year → Quarter → Month →
Day.
Fact Table: A fact table is a table that contains the measures of interest. For example, sales amount
would be such a measure. This measure is stored in the fact table with the appropriate granularity. For
example, it can be sales amount by store by day. In this case, the fact table would contain three
columns: A date column, a store column, and a sales amount column.
Lookup Table: The lookup table provides the detailed information about the attributes. For example, the
lookup table for the Quarter attribute would include a list of all of the quarters available in the data
warehouse. Each row (each quarter) may have several fields, one for the unique ID that identifies the
quarter, and one or more additional fields that specifies how that particular quarter is represented on a
report (for example, first quarter of 2001 may be represented as "Q1 2001" or "2001 Q1").
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup
tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are
represented by lookup tables. Attributes are the non-key columns in the lookup tables.
In designing data models for data warehouses / data marts, the most commonly used schema types
are Star Schemaand Snowflake Schema.

Whether one uses a star or a snowflake largely depends on personal preference and business needs.
Personally, I am partial to snowflakes, when there is a business case to analyze the information at that
particular level.
Data Warehousing > Concepts > Star Schema
In the star schema design, a single object (the fact table) sits in the middle and is radically connected to
other surrounding objects (dimension lookup tables) like a star. Each dimension is represented as a
single table. The primary key in each dimension table is related to a foreign key in the fact table.
Sample star schema
All measures in the fact table are related to all the dimensions that fact table is related to. In other
words, they all have the same level of granularity.
A star schema can be simple or complex. A simple star consists of one fact table; a complex star can
have more than one fact table.
Let's look at an example: Assume our data warehouse keeps store sales data, and the different
dimensions are time, store, product, and customer. In this case, the figure on the left represents our star
schema. The lines between two tables indicate that there is a primary key / foreign key relationship
between the two tables. Note that different dimensions are not related to one another.
Data Warehousing > Concepts > Snowflake Schema
The snowflake schema is an extension of the star schema, where each point of the star explodes into
more points. In a star schema, each dimension is represented by a single dimensional table, whereas in a
snowflake schema, that dimensional table is normalized into multiple lookup tables, each representing a
level in the dimensional hierarchy.

Sample snowflake schema
For example, the Time Dimension that consists of 2 different hierarchies:
1. Year → Month → Day

2. Week → Day
We will have 4 lookup tables in a snowflake schema: A lookup table for year, a lookup table for month, a
lookup table for week, and a lookup table for day. Year is connected to Month, which is then connected
to Day. Week is only connected to Day. A sample snowflake schema illustrating the above relationships
in the Time Dimension is shown to the right.
The main advantage of the snowflake schema is the improvement in query performance due to
minimized disk storage requirements and joining smaller lookup tables. The main disadvantage of the
snowflake schema is the additional maintenance efforts needed due to the increase number of lookup
tables.
Data Warehousing > Concepts > Fact Table Granularity
Granularity
The first step in designing a fact table is to determine the granularity of the fact table. By granularity,
we mean the lowest level of information that will be stored in the fact table. This constitutes two steps:
1. Determine which dimensions will be included.

2. Determine where along the hierarchy of each dimension the information will be kept.
The determining factors usually goes back to the requirements.
Which Dimensions To Include

Determining which dimensions to include is usually a straightforward process, because business

processes will often dictate clearly what are the relevant dimensions.
For example, in an off-line retail world, the dimensions for a sales fact table are usually time, geography,
and product. This list, however, is by no means a complete list for all off-line retailers. A supermarket
with a Rewards Card program, where customers provide some personal information in exchange for a
rewards card, and the supermarket would offer lower prices for certain items for customers who
present a rewards card at checkout, will also have the ability to track the customer dimension. Whether
the data warehousing system includes the customer dimension will then be a decision that needs to be
made.
What Level Within Each Dimension To Include
Determining which part of hierarchy the information is stored along each dimension is not an exact
science. This is where user requirement (both stated and possibly future) plays a major role.
In the above example, will the supermarket wanting to do analysis along at the hourly level? (i.e.,
looking at how certain products may sell by different hours of the day.) If so, it makes sense to use 'hour'
as the lowest level of granularity in the time dimension. If daily analysis is sufficient, then 'day' can be
used as the lowest level of granularity. Since the lower the level of detail, the larger the data amount in
the fact table, the granularity exercise is in essence figuring out the sweet spot in the tradeoff between
detailed level of analysis and data storage.
Note that sometimes the users will not specify certain requirements, but based on the industry
knowledge, the data warehousing team may foresee that certain requirements will be forthcoming that
may result in the need of additional details. In such cases, it is prudent for the data warehousing team to
design the fact table such that lower-level information is included. This will avoid possibly needing to re-
design the fact table in the future. On the other hand, trying to anticipate all future requirements is an
impossible and hence futile exercise, and the data warehousing team needs to fight the urge of the
"dumping the lowest level of detail into the data warehouse" symptom, and only includes what is
practically needed. Sometimes this can be more of an art than science, and prior experience will become
invaluable here.
Data Warehousing > Concepts > Fact And Fact Table Types
Types of Facts
There are three types of facts:
 Additive: Additive facts are facts that can be summed up through all of the dimensions in the
fact table.
 Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions
in the fact table, but not the others.

 Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions
present in the fact table.
Let us use examples to illustrate each of the three types of facts. The first example assumes that we are
a retailer, and we have a fact table with the following columns:
Date
Store
Product
Sales_Amount
The purpose of this table is to record the sales amount for each product in each store on a daily
basis. Sales_Amountis the fact. In this case, Sales_Amount is an additive fact, because you can sum up
this fact along any of the three dimensions present in the fact table -- date, store, and product. For
example, the sum of Sales_Amount for all 7 days in a week represents the total sales amount for that
week.
Say we are a bank with the following fact table:
Date
Account
Current_Balance
Profit_Margin
The purpose of this table is to record the current balance for each account at the end of each day, as
well as the profit margin for each account for each day. Current_Balance and Profit_Margin are the
facts. Current_Balance is a semi-additive fact, as it makes sense to add them up for all accounts (what's
the total current balance for all accounts in the bank?), but it does not make sense to add them up
through time (adding up all current balances for a given account for each day of the month does not give
us any useful information). Profit_Margin is a non-additive fact, for it does not make sense to add them
up for the account level or the day level.
Types of Fact Tables
Based on the above classifications, there are two types of fact tables:

 Cumulative: This type of fact table describes what has happened over a period of time. For
example, this fact table may describe the total sales by product by store by day. The facts for
this type of fact tables are mostly additive facts. The first example presented here is a
cumulative fact table.
 Snapshot: This type of fact table describes the state of things in a particular instance of time,
and usually includes more semi-additive and non-additive facts. The second example presented
here is a snapshot fact table.
 Data Warehousing > Concepts > Slowly Changing Dimensions
 The "Slowly Changing Dimension" problem is a common one particular to data warehousing. In a
nutshell, this applies to cases where the attribute for a record varies over time. We give an
example below:
 Christina is a customer with ABC Inc. She first lived in Chicago, Illinois. So, the original entry in
the customer lookup table has the following record:
Customer Key Name State

1001 Christina Illinois
 At a later date, she moved to Los Angeles, California on January, 2003. How should ABC Inc. now
modify its customer table to reflect this change? This is the "Slowly Changing Dimension"
problem.
 There are in general three ways to solve this type of problem, and they are categorized as
follows:
 Type 1: The new record replaces the original record. No trace of the old record exists.
 Type 2: A new record is added into the customer dimension table. Therefore, the customer is
treated essentially as two people.
 Type 3: The original record is modified to reflect the change.
 We next take a look at each of the scenarios and how the data model and the data looks like for
each of them. Finally, we compare and contrast among the three alternatives.
 Data Warehousing > Concepts > Type 1 Slowly Changing Dimension
 In Type 1 Slowly Changing Dimension, the new information simply overwrites the original
information. In other words, no history is kept.
 In our example, recall we originally have the following table:

 After Christina moved from Illinois to California, the new information replaces the new record,
and we have the following table:

1001 Christina California

 Advantages:
 - This is the easiest way to handle the Slowly Changing Dimension problem, since there is no
need to keep track of the old information.
 Disadvantages:
 - All history is lost. By applying this methodology, it is not possible to trace back in history. For
example, in this case, the company would not be able to know that Christina lived in Illinois
before.
 Usage:
 About 50% of the time.
 When to use Type 1:
 Type 1 slowly changing dimension should be used when it is not necessary for the data
warehouse to keep track of historical changes.
 Data Warehousing > Concepts > Type 2 Slowly Changing Dimension
 In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the new
information. Therefore, both the original and the new record will be present. The new record
gets its own primary key.
 In our example, recall we originally have the following table:

 After Christina moved from Illinois to California, we add the new information as a new row into
the table:

1005 Christina California
 Advantages:
 - This allows us to accurately keep all historical information.
 Disadvantages:
 - This will cause the size of the table to grow fast. In cases where the number of rows for the
table is very high to start with, storage and performance can become a concern.
 - This necessarily complicates the ETL process.
 Usage:
 About 50% of the time.
 When to use Type 2:
 Type 2 slowly changing dimension should be used when it is necessary for the data warehouse
to track historical changes.
 Next Page: Type 3 Slowly Changing Dimension
Data Warehousing > Concepts > Type 3 Slowly Changing Dimension

In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular attribute of
interest, one indicating the original value, and one indicating the current value. There will also be a
column that indicates when the current value becomes active.
In our example, recall we originally have the following table:

To accommodate Type 3 Slowly Changing Dimension, we will now have the following columns:
 Customer Key
 Name
 Original State
 Current State
 Effective Date
After Christina moved from Illinois to California, the original information gets updated, and we have the
following table (assuming the effective date of change is January 15, 2003):
Customer Key Name Original State Current State Effective Date

1001 Christina Illinois California 15-JAN-2003
Advantages:
- This does not increase the size of the table, since new information is updated.
- This allows us to keep some part of history.
Disadvantages:
- Type 3 will not be able to keep all history where an attribute is changed more than once. For example,
if Christina later moves to Texas on December 15, 2003, the California information will be lost.
Usage:
Type 3 is rarely used in actual practice.
When to use Type 3:
Type III slowly changing dimension should only be used when it is necessary for the data warehouse to
track historical changes, and when such changes will only occur for a finite number of time.

Data Warehousing > Concepts > Conceptual Data Model
A conceptual data model identifies the highest-level relationships between the different entities.
Features of conceptual data model include:
 Includes the important entities and the relationships among them.

 No attribute is specified.
 No primary key is specified.
The figure below is an example of a conceptual data model.
Conceptual Data Model
From the figure above, we can see that the only information shown via the conceptual data model is the
entities that describe the data and the relationships between those entities. No other information is
shown through the conceptual data model.
Data Warehousing > Concepts > Logical Data Model
A logical data model describes the data in as much detail as possible, without regard to how they will be
physical implemented in the database. Features of a logical data model include:
 Includes all entities and relationships among them.

 All attributes for each entity are specified.
 The primary key for each entity is specified.
 Foreign keys (keys identifying the relationship between different entities) are specified.
 Normalization occurs at this level.
The steps for designing the logical data model are as follows:

1. Specify primary keys for all entities.

2. Find the relationships between different entities.
3. Find all attributes for each entity.
4. Resolve many-to-many relationships.
5. Normalization.
The figure below is an example of a logical data model.
Logical Data Model
Comparing the logical data model shown above with the conceptual data model diagram, we see the
main differences between the two:
 In a logical data model, primary keys are present, whereas in a conceptual data model, no
primary key is present.
 In a logical data model, all attributes are specified within an entity. No attributes are specified in
a conceptual data model.
 Relationships between entities are specified using primary keys and foreign keys in a logical data
model. In a conceptual data model, the relationships are simply stated, not specified, so we
simply know that two entities are related, but we do not specify what attributes are used for
this relationship.

Data Warehousing > Concepts > Physical Data Model
Physical data model represents how the model will be built in the database. A physical database model
shows all table structures, including column name, column data type, column constraints, primary key,
foreign key, and relationships between tables. Features of a physical data model include:
 Specification all tables and columns.

 Foreign keys are used to identify relationships between tables.
 Denormalization may occur based on user requirements.
 Physical considerations may cause the physical data model to be quite different from the logical
data model.
 Physical data model will be different for different RDBMS. For example, data type for a column
may be different between MySQL and SQL Server.
The steps for physical data model design are as follows:
1. Convert entities into tables.

2. Convert relationships into foreign keys.
3. Convert attributes into columns.
4. Modify the physical data model based on physical constraints / requirements.
The figure below is an example of a physical data model.

Physical Data Model
Comparing the physical data model shown above with the logical data model diagram, we see the main
differences between the two:
 Entity names are now table names.

 Attributes are now column names.
 Data type for each column is specified. Data types can be different depending on the actual
database being used.
 Data Warehousing > Concepts > Data Modeling - Conceptual, Logical, And Physical Data
Models
 The three levels of data modeling, conceptual data model, logical data model, and physical data
model, were discussed in prior sections. Here we compare these three types of data models.
The table below compares the different features:
Feature Conceptual Logical Physical

Entity Names ✓ ✓
Entity Relationships ✓ ✓
Attributes ✓
Primary Keys ✓ ✓
Foreign Keys ✓ ✓

Table Names ✓
Column Names ✓
Column Data Types ✓
 Below we show the conceptual, logical, and physical versions of a single data model.
Conceptual Model Design Logical Model Design Physical Model Design
 We can see that the complexity increases from conceptual to logical to physical. This is why we
always first start with the conceptual data model (so we understand at high level what are the
different entities in our data and how they relate to one another), then move on to the logical
data model (so we understand the details of our data without worrying about how they will
actually implemented), and finally the physical data model (so we know exactly how to
implement our data model in the database of choice). In a data warehousing project, sometimes
the conceptual data model and the logical data model are considered as a single deliverable.
 Data Warehousing > Concepts > Data Integrity
 Data integrity refers to the validity of data, meaning data is consistent and correct. In the data
warehousing field, we frequently hear the term, "Garbage In, Garbage Out." If there is no data
integrity in the data warehouse, any resulting report and analysis will not be useful.
 In a data warehouse or a data mart, there are three areas of where data integrity needs to be
enforced:
 Database level
 We can enforce data integrity at the database level. Common ways of enforcing data integrity
include:
 Referential integrity
 The relationship between the primary key of one table and the foreign key of another table
must always be maintained. For example, a primary key cannot be deleted if there is still a
foreign key that refers to this primary key.
 Primary key / Unique constraint

 Primary keys and the UNIQUE constraint are used to make sure every row in a table can be
uniquely identified.
 Not NULL vs. NULL-able
 For columns identified as NOT NULL, they may not have a NULL value.
 Valid Values
 Only allowed values are permitted in the database. For example, if a column can only have
positive integers, a value of '-1' cannot be allowed.
 ETL process
 For each step of the ETL process, data integrity checks should be put in place to ensure that
source data is the same as the data in the destination. Most common checks include record
counts or record sums.
 Access level
 We need to ensure that data is not altered by any unauthorized means either during the ETL
process or in the data warehouse. To do this, there needs to be safeguards against unauthorized
access to data (including physical access to the servers), as well as logging of all data access
history. Data integrity can only ensured if there is no unauthorized access to the data.
 Data Warehousing > Concepts > What Is OLAP
 OLAP stands for On-Line Analytical Processing. The first attempt to provide a definition to OLAP
was by Dr. Codd, who proposed 12 rules for OLAP. Later, it was discovered that this particular
white paper was sponsored by one of the OLAP tool vendors, thus causing it to lose objectivity.
The OLAP Report has proposed the FASMI test, Fast Analysis
of Shared Multidimensional Information. For a more detailed description of both Dr. Codd's
rules and the FASMI test, please visit The OLAP Report.
 For people on the business side, the key feature out of the above list is "Multidimensional." In
other words, the ability to analyze metrics in different dimensions such as time, geography,
gender, product, etc. For example, sales for the company are up. What region is most
responsible for this increase? Which store in this region is most responsible for the increase?
What particular product category or categories contributed the most to the increase? Answering
these types of questions in order means that you are performing an OLAP analysis.
 Depending on the underlying technology used, OLAP can be broadly divided into two different
camps: MOLAP and ROLAP. A discussion of the different OLAP types can be found in the MOLAP,
ROLAP, and HOLAP section.
Data Warehousing > Concepts > MOLAP, ROLAP, And HOLAP
In the OLAP world, there are mainly two different types: Multidimensional OLAP (MOLAP) and Relational
OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP and ROLAP.
MOLAP
This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional cube.
The storage is not in the relational database, but in proprietary formats.
Advantages:

 Excellent performance: MOLAP cubes are built for fast data retrieval, and are optimal for slicing
and dicing operations.
 Can perform complex calculations: All calculations have been pre-generated when the cube is
created. Hence, complex calculations are not only doable, but they return quickly.
Disadvantages:
 Limited in the amount of data it can handle: Because all calculations are performed when the
cube is built, it is not possible to include a large amount of data in the cube itself. This is not to
say that the data in the cube cannot be derived from a large amount of data. Indeed, this is
possible. But in this case, only summary-level information will be included in the cube itself.
 Requires additional investment: Cube technology are often proprietary and do not already exist
in the organization. Therefore, to adopt MOLAP technology, chances are additional investments
in human and capital resources are needed.
ROLAP
This methodology relies on manipulating the data stored in the relational database to give the
appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and
dicing is equivalent to adding a "WHERE" clause in the SQL statement.
Advantages:
 Can handle large amounts of data: The data size limitation of ROLAP technology is the limitation
on data size of the underlying relational database. In other words, ROLAP itself places no
limitation on data amount.
 Can leverage functionalities inherent in the relational database: Often, relational database
already comes with a host of functionalities. ROLAP technologies, since they sit on top of the
relational database, can therefore leverage these functionalities.
Disadvantages:
 Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple SQL
queries) in the relational database, the query time can be long if the underlying data size is
large.
 Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL
statements to query the relational database, and SQL statements do not fit all needs (for
example, it is difficult to perform complex calculations using SQL), ROLAP technologies are
therefore traditionally limited by what SQL can do. ROLAP vendors have mitigated this risk by
building into the tool out-of-the-box complex functions as well as the ability to allow users to
define their own functions.

HOLAP
HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type
information, HOLAP leverages cube technology for faster performance. When detail information is
needed, HOLAP can "drill through" from the cube into the underlying relational data.
Data Warehousing > Concepts > Bill Inmon vs. Ralph Kimball
In the data warehousing field, we often hear about discussions on where a person / organization's
philosophy falls into Bill Inmon's camp or into Ralph Kimball's camp. We describe below the difference
between the two.
Bill Inmon's paradigm: Data warehouse is one part of the overall business intelligence system. An
enterprise has one data warehouse, and data marts source their information from the data warehouse.
In the data warehouse, information is stored in 3rd normal form.
Ralph Kimball's paradigm: Data warehouse is the conglomerate of all data marts within the enterprise.
Information is always stored in the dimensional model.
There is no right or wrong between these two ideas, as they represent different data warehousing
philosophies. In reality, the data warehouse systems in most enterprises are closer to Ralph Kimball's
idea. This is because most data warehouses started out as a departmental effort, and hence they
originated as a data mart. Only when more data marts are built later do they evolve into a data
warehouse.
Data Warehousing > Concepts > Factless Fact Table
A factless fact table is a fact table that does not have any measures. It is essentially an intersection of
dimensions. On the surface, a factless fact table does not make sense, since a fact table is, after all,
about facts. However, there are situations where having this kind of relationship makes sense in data
warehousing.
For example, think about a record of student attendance in classes. In this case, the fact table would
consist of 3 dimensions: the student dimension, the time dimension, and the class dimension. This
factless fact table would look like the following:

The only measure that you can possibly attach to each combination is "1" to show the presence of that
particular combination. However, adding a fact that always shows 1 is redundant because we can simply
use the COUNT function in SQL to answer the same questions.
Factless fact tables offer the most flexibility in data warehouse design. For example, one can easily
answer the following questions with this factless fact table:
 How many students attended a particular class on a particular day?

 How many classes on average does a student attend on a given day?
Without using a factless fact table, we will need two separate fact tables to answer the above two
questions. With the above factless fact table, it becomes the only fact table that's needed.
Data Warehousing > Concepts > Junk Dimension
In data warehouse design, frequently we run into a situation where there are yes/no indicator fields in
the source system. Through business analysis, we know it is necessary to keep such information in the
fact table. However, if keep all those indicator fields in the fact table, not only do we need to build many
small dimension tables, but the amount of information stored in the fact table also increases
tremendously, leading to possible performance and management issues.
Junk dimension is the way to solve this problem. In a junk dimension, we combine these indicator fields
into a single dimension. This way, we'll only need to build a single dimension table, and the number of
fields in the fact table, as well as the size of the fact table, can be decreased. The content in the junk
dimension table is the combination of all possible values of the individual indicator fields.
Let's look at an example. Assuming that we have the following fact table:

In this example, TXN_CODE, COUPON_IND, and PREPAY_IND are all indicator fields. In this existing
format, each one of them is a dimension. Using the junk dimension principle, we can combine them into
a single junk dimension, resulting in the following fact table:
Note that now the number of dimensions in the fact table went from 7 to 5.
The content of the junk dimension table would look like the following:
In this case, we have 3 possible values for the TXN_CODE field, 2 possible values for the COUPON_IND
field, and 2 possible values for the PREPAY_IND field. This results in a total of 3 x 2 x 2 = 12 rows for the
junk dimension table.
By using a junk dimension to replace the 3 indicator fields, we have decreased the number of
dimensions by 2 and also decreased the number of fields in the fact table by 2. This will result in a data
warehousing environment that offer better performance as well as being easier to manage.

Data Warehousing > Concepts > Conformed Dimension
A conformed dimension is a dimension that has exactly the same meaning and content when being
referred from different fact tables. A conformed dimension can refer to multiple tables in multiple data
marts within the same organization. For two dimension tables to be considered as conformed, they
must either be identical or one must be a subset of another. There cannot be any other type of
difference between the two tables. For example, two dimension tables that are exactly the same except
for the primary key are not considered conformed dimensions.
Why is conformed dimension important? This goes back to the definition of data warehouse being
"integrated." Integrated means that even if a particular entity had different meanings and different
attributes in the source systems, there must be a single version of this entity once the data flows into
the data warehouse.
The time dimension is a common conformed dimension in an organization. Usually the only rule to
consider with the time dimension is whether there is a fiscal year in addition to the calendar year and
the definition of a week. Fortunately, both are relatively easy to resolve. In the case of fiscal vs. calendar
year, one may go with either fiscal or calendar, or an alternative is to have two separate conformed
dimensions, one for fiscal year and one for calendar year. The definition of a week is also something that
can be different in large organizations: Finance may use Saturday to Friday, while marketing may use
Sunday to Saturday. In this case, we should decide on a definition and move on. The nice thing about the
time dimension is once these rules are set, the values in the dimension table will never change. For
example, October 16th will never become the 15th day in October.
Not all conformed dimensions are as easy to produce as the time dimension. An example is the
customer dimension. In any organization with some history, there is a high likelihood that different
customer databases exist in different parts of the organization. To achieve a conformed customer
dimension means those data must be compared against each other, rules must be set, and data must be
cleansed. In addition, when we are doing incremental data loads into the data warehouse, we'll need to
apply the same rules to the new values to make sure we are only adding truly new customers to the
customer dimension.
Building a conformed dimension also part of the process in master data management, or MDM. In
MDM, one must not only make sure the master data dimensions are conformed, but that conformity
needs to be brought back to the source systems.


Tableau Interview Q&amp; A

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Tableau Interview Q&amp; A

Uploaded by

Copyright:

Available Formats

Pathway Learners Online Training (IT & NON IT Skills Training & Placement Services

Datawarehouse interview questions and answers

Define data warehouse?

Visit US : www.pathwaylearners.in Mail US : info@pathwaylearners.in

A : Data extraction means gathering data from multiple heterogeneous sources.

A : Metadata respiratory contains definition of data warehouse, business metadata, operational

A : The view over an operational data warehouse is known as virtual warehouse.

Visit US : www.pathwaylearners.in Mail US : info@pathwaylearners.in

Q: Define a warehouse manager.

A : Multidimensional OLAP is faster than Relational OLAP.

A : Only one dimension is selected for the slice operation.

Visit US : www.pathwaylearners.in Mail US : info@pathwaylearners.in

Q: How many dimensions are selected in dice operation?

A : There is only one fact table in a star Schema.

A : Normalization splits up the data into additional tables.

A : Snowflake schema uses the concept of normalization.

A : Normalization helps in reducing data redundancy.

A : Data Mining Query Language (DMQL) is used for Schema Definition.

A : DMQL is based on Structured Query Language (SQL).

Visit US : www.pathwaylearners.in Mail US : info@pathwaylearners.in

Tableau Interview questions and answers(Model-1)

Here is the Tableau Product family.

Visit US : www.pathwaylearners.in Mail US : info@pathwaylearners.in

Visit US : www.pathwaylearners.in Mail US : info@pathwaylearners.in

3. What are the different datatypes in Tableau?

Tableau supports the following data-types:

4. What are Measures and Dimensions?

Visit US : www.pathwaylearners.in Mail US : info@pathwaylearners.in

Tableau Interview Questions & Answers | Tableau Career Path | Edureka

6. What is the difference between Tableau and Traditional BI Tools?

7. What are the different types of joins in Tableau?

Visit US : www.pathwaylearners.in Mail US : info@pathwaylearners.in

8. How many maximum tables can you join in Tableau?

You can join a maximum of 32 tables in Tableau.

Visit US : www.pathwaylearners.in Mail US : info@pathwaylearners.in

10. What are shelves?

11. What are sets?

12. What are groups?

Visit US : www.pathwaylearners.in Mail US : info@pathwaylearners.in

13. What is a hierarchical field?

14. What is Tableau Data Server?

Intermediate Level Tableau Interview Questions

In Tableau, filters are used to restrict the data from database.

Visit US : www.pathwaylearners.in Mail US : info@pathwaylearners.in

17. How to create a calculated field in Tableau?

Take a look at the example below:

18. What is a dual axis?

Visit US : www.pathwaylearners.in Mail US : info@pathwaylearners.in

Visit US : www.pathwaylearners.in Mail US : info@pathwaylearners.in

Visit US : www.pathwaylearners.in Mail US : info@pathwaylearners.in

20. What is disaggregation and aggregation of data?

Visit US : www.pathwaylearners.in Mail US : info@pathwaylearners.in

21. What is the difference between joining and blending in Tableau?

22. What are Extracts and Schedules in Tableau server?

23. How to view underlying SQL Queries in Tableau?

Viewing underlying SQL Queries in Tableau provides two options:

Visit US : www.pathwaylearners.in Mail US : info@pathwaylearners.in

24. How to do Performance Testing in Tableau?

25. Name the components of a Dashboard.

26. How to remove ‘All’ options from a Tableau auto-filter?

27. How to add Custom Color to Tableau?

28. What is TDE file?

Visit US : www.pathwaylearners.in Mail US : info@pathwaylearners.in

 Firstly, TDE is a columnar store.

Tableau Interview Q& A

Tableau Interview Q& A