This action might not be possible to undo. Are you sure you want to continue?
Data analysis is the process of inspecting, cleaning, transforming and modeling raw data into useful information in order to reach a certain conclusion or decision. The process of organizing the data to interpret the trend with the help of charts, graphs or textual write-ups is known as Data Analysis. Staging What is a staging area? Do we need it? What is the purpose of a staging area? Staging area is place where you hold temporary tables on data warehouse server. Staging tables are connected to work area or fact tables. We basically need staging area to hold the data and perform data cleansing and merging before loading the data into warehouse. How are data marts and data warehouses related? Data Mart is a subset of data warehouse, a data mart is a repository of data that holds information on a specific business area. The data warehouse will be made up of a number of data marts. Data dictionary What is data dictionary? It is a Metadata repository; it contains information about data such as meaning, relationship to other data, origin, usage and format. It provides information about the database and has the following features: • The definitions of all schema objects in the database (tables, views, indexes, clusters, synonyms, sequences, procedures, functions, packages, triggers, and so on) • How much space has been allocated for, and is currently used by, the schema objects • Default values for columns • Integrity constraint information • The names of Oracle users • Privileges and roles each user has been granted • Auditing information, such as who has accessed or updated various schema objects • Other general database information So, it could simply be a MS Word document that describes each table we have, the columns of various tables, the description of columns (why is that column there, what is the purpose of it?; sometimes columns are deprecated so this data dictionary will contain information about it;). How is it different from a database schema? Data dictionary is a document that describes the various tables and columns and relationships and the reasons for columns to exist. Why is it required? So, users of the database (programmers, end users, business teams) can get the most out of the database. So that they can truly understand the layout of the tables, what columns mean what? Say there were two columns, one that said TStamp1 and TStamp2. These column names are not very descriptive or intuitive in what they mean (or what kind of data they hold). In the data dictionary, the creator of this table can say “TStamp1 is used to store the last login date for the user. TStamp2 is used to store the last logout date for the user”. Meta data is data about the data! So, data dictionary is the way you give the world information about meta data! How does one create data dictionary? One can do it manually! But, the
What is Operational data store? Operational data store (ODS) is an integrated database of operational data. Data definition What is data definition? Data definition is used to describe sets of variables that are passed to a template. ODS act as a staging area for data warehouses and data marts used for data analysis.problem with this approach. Data flow: Operational System ODS data warehouse For example MS Excel files are a type of database. and hashes. date modified and date last accessed. A typical ODS may contain 30-60 days of information. • ODS is based on two dimensional model. while a data warehouse typically contains years of data. Hashes can use anything for their values. Data mining is the process of analyzing data from different perspectives and . unwanted junk (apostrophe. They are given specific names. the integration involves cleaning. Arrays can hold any type of variable in their elements. to make analysis and reporting easier. corrupted rows). • An ODS is a flat structure. It is just one table that contains all data. Processing times by job) Each type of the table will be kept in separate schema to decrease maintenance Data mining work and time spent to look for specific table. resolving redundancy and checking against business rules for integrity. I have used a tool called StoneField to create data dictionary (You can elaborate it once you get the basic dictionary). This is the metadata of the actual MS Excel database. Data errors like missing data (NULL values for important columns and negative or zero values. ODS contains high granular data with limited history that contains current or near time data. forming a pre-computed answer to a business question (ex. comma and white space).e. database errors and business rule errors etc. storing order ids for lookup) Audit – tables used to keep track of the ELT process (ex. where you can create strings. numeric data errors (negative value should be positive) and format of phone numbers. • ODS have an option to overwrite or add single record. Totals by day) Staging – Tables used to store data during ELT processing but the data is not removed immediately Temp – tables used during ELT processing that can immediately be truncated afterwards (ex. ODS integrate data from multiple sources. It's essentially just a sequence of assignments. I will write aggregate queries or ETL/ELT tool to find errors. Since the data comes from different sources.g. but only strings for their keys. Most of the time you use an ODS for line item data. Operational data store What are the various tables used in data warehouse? Fact – a table type that contains atomic (source) data Dimension – a table type that contains referential data needed by the fact tables Aggregate – a table type used to aggregate data. Then I will analyze the query results or transformation reports to measure the impact. integers. arrays.g. you can create string as string=” a String” Data cleansing How will you perform data cleansing? Before start the cleaning process I did quality assessment for the source data. E. if the database schema is changing fairly often is that one can end up spending a lot of time doing this.
Snow schema: Any dimensions with extended dimensions are know as snowflake schema. or attributes of a business Dimension tables Lookup table A lookup table is the table placed on the target table based on the primary key of the target. and logic which can be used for user to traverse in hierarchy nodes. snow schema . Dimensions tables are smaller and hold descriptive data that reflects the dimensions. Star schema: A single fact table with N number of Dimension. each dimension has a primary dimension table. to which one or more additional dimensions can join. The primary dimension table is the only table that can join to the fact table. This schema is de-normalized and results in simple join and less complex query as well as faster results.g. Star schema Snow flake schema Star schema vs.Fact table summarizing it into useful information Facts table contains the factual or quantitative data about business-many columns and billions of rows (e. Snowflake schema. This schema is normalized and results in complex join and very complex query as well as slower results. Dimension table contains textual attributes of measurements stored in facts tables. It contains foreign keys for the dimension tables. It updates the table by allowing only modified records based on lookup conditions. Snow schema is a type of organizing the table so that we can retrieve the result from the database quickly in a warehouse environment. categories. It is a collection of hierarchies. all dimensions will be linked directly with a fact table. dimensions may be interlinked or may have one to many relationship with other table. total number of active customers). number of transformations ran for a particular month.