Professional Documents
Culture Documents
You can configure the Lookup transformation to perform different types of lookups. You can configure the transformation to be connected or unconnected, cached or uncached: Connected or unconnected. Connected and unconnected transformations receive input and send output in different ways. Cached or uncached. Sometimes you can improve session performance by caching the lookup table. If you cache the lookup table, you can choose to use a dynamic or static cache. By default, the lookup cache remains static and does not change during the session. With a dynamic cache, the Informatica Server inserts or updates rows in the cache during the session. When you cache the target table as the lookup, you can look up values in the target and insert them if they do not exist, or update them if they do
informatica:What is Data Transformation Manager Process? How many Threads it creates to process data, explain each thread in brief.
When the workflow reaches a session, the Load Manager starts the DTM process. The DTM process is the process associated with the session task. The Load Manager creates one DTM process for each session in the workflow. The DTM process performs the following tasks: Reads session information from the repository. Expands the server and session variables and parameters. Creates the session log file. Validates source and target code pages. Verifies connection object permissions. Runs pre-session shell commands, stored procedures and SQL. Creates and runs mapping, reader, writer, and transformation threads to extract, transform, and load data. Runs post-session stored procedures, SQL, and shell commands. Sends post-session email. The DTM allocates process memory for the session and divides it into buffers. This is also known as buffer memory. The default memory allocation is 12,000,000 bytes. The DTM uses multiple threads to process data. The main DTM thread is called the master thread. The master thread creates and manages other threads. The master thread for a session can create mapping, pre-session, post-session, reader, transformation, and writer threads. Mapping Thread -One thread for each session. Fetches session and mapping information. Compiles the mapping. Cleans up after session execution. Pre- and Post-Session Threads- One thread each to perform pre- and post-session operations. Reader Thread -One thread for each partition for each source pipeline. Reads from sources. Relational sources use relational reader threads, and file sources use file reader threads .
Transformation Thread -One or more transformation threads for each partition. Processes data according to the transformation logic in the mapping. Writer Thread- One thread for each partition, if a target exists in the source pipeline. Writes to targets. Relational targets use relational writer threads, and file targets use file writer threads.
informatica: Suppose session is configured with commit interval of 10,000 rows and source has 50,000 rows explain the commit points for Source-based c
Suppose session is configured with commit interval of 10,000 rows and source has 50,000 rows explain the commit points for Source-based commit & Target-based commit. Assume appropriate value wherever required. a)For example, a session is configured with target-based commit interval of 10,000. The writer buffers fill every 7,500 rows. When the Informatica Server reaches the commit interval of 10,000, it continues processing data until the writer buffer is filled. The second buffer fills at 15,000 rows, and the Informatica Server issues a commit to the target. If the session completes successfully, the Informatica Server issues commits after 15,000, 22,500, 30,000, and 40,000 rows. b)The Informatica Server might commit less rows to the target than the number of rows produced by the active source. For example, you have a source-based commit session that passes 10,000 rows through an active source, and 3,000 rows are dropped due to transformation logic. The Informatica Server issues a commit to the target when the 7,000 remaining rows reach the target. The number of rows held in the writer buffers does not affect the commit point for a source-based commit session. For example, you have a source-based commit session that passes 10,000 rows through an active source. When those 10,000 rows reach the targets, the Informatica Server issues a commit. If the session completes successfully, the Informatica Server issues commits after 10,000, 20,000, 30,000, and 40,000 source rows.
How to capture performance statistics of individual transformation in the mapping and explain some important statistics that can be captured?
informatica : How to capture performance statistics of individual transformation in the mapping and explain some important statistics that can be captured?
Ans: a)Before using performance details to improve session performance you must do the following: Enable monitoring Increase Load Manager shared memory Understand performance counters . To view performance details in the Workflow Monitor: While the session is running, right-click the session in the Workflow Monitor and choose Properties. Click the Performance tab in the Properties dialog box. Click OK. To view the performance details file: Locate the performance details file. The Informatica Server names the file session_name.perf, and stores it in the same directory as the session log. If there is no session-specific directory for the session log, the Informatica Server saves the file in the default log files directory. Open the file in any text editor. b) Source Qualifier and Normalizer Transformations. BufferInput_efficiency -Percentage reflecting how seldom the reader waited for a free buffer when passing data to the DTM. BufferOutput_efficiency - Percentage reflecting how seldom the DTM waited for a full buffer of data from the reader.
Target BufferInput_efficiency -Percentage reflecting how seldom the DTM waited for a free buffer when passing data to the writer. BufferOutput_efficiency -Percentage reflecting how seldom the Informatica Server waited for a full buffer of data from the writer. For Source Qualifiers and targets, a high value is considered 80-100 percent. Low is considered 0-20 percent. However, any dramatic difference in a given set of BufferInput_efficiency and BufferOutput_efficiency counters indicates inefficiencies that may benefit from tuning. Posted by Emmanuel at 4:31 PM informatica: Ans: Load manager is the primary Informatica server process. It performs the following tasks: a. Manages sessions and batch scheduling. b. Locks the sessions and reads properties. c. Reads parameter files. d. Expands the server and session variables and parameters. e. Verifies permissions and privileges.
f. Validates sources and targets code pages. g. Creates session log files. h. Creates Data Transformation Manager (DTM) process, which executes the session.
When to reinitialize the aggregate caches Scenario :-Informatica Server and Client are in different machines. You run a session from the server manager by specifying the source and target databases. It displays an error. You are confident that everything is correct. Then why it is displaying the error? The connect strings for source and target databases are not configured on the Workstation conatining the server though they may be on the client m/c.
Unlike other transformations we cannot override the Sequence Generator transformation properties at the session level. This protecxts the integrity of the sequence values generated.
informatica : What is the difference between connected lookup and unconnected lookup? Ans: Differences between Connected and Unconnected Lookups:
Connected Lookup Unconnected Lookup Receives input values directly from the pipeline. Receives input values from the result of a :LKP expression in another transformation. We can use a dynamic or static cache We can use a static cache Supports user-defined default values Does not support user-defined default values
informatica : Where do you define update strategy? Ans: We can set the Update strategy at two different levels: Within a session. When you configure a session, you can instruct the Informatica Server to either treat all records in the same way (for example, treat all records as inserts), or use instructions coded into the session mapping to flag records for different database operations. Within a mapping. Within a mapping, you use the Update Strategy transformation to flag records for insert, delete, update, or reject.
transformations in a mapping. The Informatica Server queries the lookup table based on the lookup ports in the transformation. It compares Lookup transformation port values to lookup table column values based on the lookup condition. Use the result of the lookup to pass to other transformations and the target.
What is a transformation?
informatica: What is a transformation? A transformation is a repository object that generates, modifies, or passes data. You configure logic in a transformation that the Informatica Server uses to transform data. The Designer provides a set of transformations that perform specific functions. For example, an Aggregator transformation performs calculations on groups of data. Each transformation has rules for configuring and connecting in a mapping. For more information about working with a specific transformation, refer to the chapter in this book that discusses that particular transformation. You can create transformations to use once in a mapping, or you can create reusable transformations to use in multiple mappings.
Ans: When you use event-based scheduling, the Informatica Server starts a session when it locates the specified indicator file. To use event-based scheduling, you need a shell command, script, or batch file to create an indicator file when all sources are available. The file must be created or sent to a directory local to the Informatica Server. The file can be of any format recognized by the Informatica Server operating system. The Informatica Server deletes the indicator file once the session starts. Use the following syntax to ping the Informatica Server on a UNIX system: pmcmd ping [{user_name | %user_env_var} {password | %password_env_var}] [hostname:]portno Use the following syntax to start a session or batch on a UNIX system: pmcmd start {user_name | %user_env_var} {password | %password_env_var} [hostname:]portno [folder_name:]{session_name | batch_name} [:pf=param_file] session_flag wait_flag Use the following syntax to stop a session or batch on a UNIX system: pmcmd stop {user_name | %user_env_var} {password | %password_env_var} [hostname:]portno[folder_name:]{session_name | batch_name} session_flag Use the following syntax to stop the Informatica Server on a UNIX system: pmcmd stopserver {user_name | %user_env_var} {password | %password_env_var} [hostname:]portno
incremental changes to sources. For example, rather than reading all the product data each time you update the DDS, you can improve performance by capturing only the inserts, deletes, and updates that have occurred in the PRODUCTS table since the last time you updated the DDS. The DDS has one additional advantage beyond performance: when you move data into the DDS, you can format it in a standard fashion. For example, you can prune sensitive employee data that should not be stored in any data mart. Or you can display date and time values in a standard format. You can perform these and other data cleansing tasks when you move data into the DDS instead of performing them repeatedly in separate data marts.
When should you create the dynamic data store? Do you need a DDS at all?
informatica: When should you create the dynamic data store? Do you need a DDS at all? To decide whether you should create a dynamic data store (DDS), consider the following issues: How much data do you need to store in the DDS? The one principal advantage of data marts is the selectivity of information included in it. Instead of a copy of everything potentially relevant from the OLTP database and flat files, data marts contain only the information needed to answer specific questions for a specific audience (for example, sales performance data used by the sales division). A dynamic data store is a hybrid of the galactic warehouse and the individual data mart,
since it includes all the data needed for all the data marts it supplies. If the dynamic data store contains nearly as much information as the OLTP source, you might not need the intermediate step of the dynamic data store. However, if the dynamic data store includes substantially less than all the data in the source databases and flat files, you should consider creating a DDS staging area. What kind of standards do you need to enforce in your data marts? Creating a DDS is an important technique in enforcing standards. If data marts depend on the DDS for information, you can provide that data in the range and format you want everyone to use. For example, if you want all data marts to include the same information on customers, you can put all the data needed for this standard customer profile in the DDS. Any data mart that reads customer data from the DDS should include all the information in this profile. How often do you update the contents of the DDS? If you plan to frequently update data in data marts, you need to update the contents of the DDS at least as often as you update the individual data marts that the DDS feeds. You may find it easier to read data directly from source databases and flat file systems if it becomes burdensome to update the DDS fast enough to keep up with the needs of individual data marts. Or, if particular data marts need updates significantly faster than others, you can bypass the DDS for these fast update data marts. Is the data in the DDS simply a copy of data from source systems, or do you plan to reformat this information before storing it in the DDS? One advantage of the dynamic data store is that, if you plan on reformatting information in the same fashion for several data marts, you only need to format it once for the dynamic data store. Part of this question is whether you keep the data normalized when you copy it to the DDS. How often do you need to join data from different systems? On occasion, you may need to join records queried from different databases or read from different flat file systems. The more frequently you need to perform this type of heterogeneous join, the more advantageous it would be to perform all such joins within the DDS, then make the results available to all data marts that use the DDS as a source.
repository are called global shortcuts. We use the Designer to create shortcuts.
What is a metadata?
Designing a data mart involves writing and storing a complex set of instructions. You need to know where to get data (sources), how to change it, and where to write the information (targets). PowerMart and PowerCenter call this set of instructions metadata. Each piece of metadata (for example, the description of a source table in an operational database) can contain comments about it. In summary, Metadata can include information such as mappings describing how to transform source data, sessions indicating when you want the Informatica Server to perform the transformations, and connect strings for sources and targets.
What is ER Diagram
ER - Stands for entitity relationship diagrams. It is the first step in the design of data model which will later lead to a physical database design of possible a OLTP or OLAP database
19. Is it possible to execute work flows in different repositories at the same time using the same informatica
21. How to parse characters using functions in the expression transformation. For example if a column has character like mgr=a. I have to parse the character 'mgr='. Which function should I use?
25. what s an ODS? what s the purpose of ODS?s that a logical database that stores extracted data from source
27. We can insert or update the rows without using the update strategy. Then what is the necessity of the update strategy?
29. What is the purpose of using UNIX commands in informatica. Which UNIX commands are
Starting with Oracle 8.1.5, introduced in March 1999, you can have a materialized view, also known as a summary. Like a regular view, a materialized view can be used to build a black-box abstraction for the programmer. In other words, the view might be created with a complicated JOIN, or an expensive GROUP BY with sums and averages. With a regular view, this expensive operation would be done every time you issued a query. With a materialized view, the expensive operation is done when the view is created and thus an individual query need not involve substantial computation. Materialized views consume space because Oracle is keeping a copy of the data or at least a copy of information derivable from the data. More importantly, a materialized view does not contain up-to-the-minute information. When you query a regular view, your results includes changes made up to the last committed transaction before your SELECT. When you query a materialized view, you're getting results as of the time that the view was created or refreshed. Note that Oracle lets you specify a refresh interval at which the materialized view will automatically be refreshed. At this point, you'd expect an experienced Oracle user to say "Hey, these aren't new. This is the old CREATE SNAPSHOT facility that we used to keep semi-up-to-date copies of tables on machines across the network!" What is new with materialized views is that you can create them with the ENABLE QUERY REWRITE option. This authorizes the SQL parser to look at a query involving aggregates or JOINs and go to the materialized view instead. Consider the following query, from the ArsDigita Community System's /admin/users/registration-history.tcl page:
select to_char(registration_date,'YYYYMM') as sort_key, rtrim(to_char(registration_date,'Month')) as pretty_month, to_char(registration_date,'YYYY') as pretty_year, count(*) as n_new from users group by to_char(registration_date,'YYYYMM'),
to_char(registration_date,'Month'), to_char(registration_date,'YYYY') order by 1; SORT_K PRETTY_MO PRET N_NEW ------ --------- ---- ---------199805 May 1998 898 199806 June 1998 806 199807 July 1998 972 199808 August 1998 849 199809 September 1998 1023 199810 October 1998 1089 199811 November 1998 1005 199812 December 1998 1059 199901 January 1999 1488 199902 February 1999 2148 For each month, we have a count of how many users registered at photo.net. To execute the query, Oracle must sequentially scan the users table. If the users table grew large and you wanted the query to be instant, you'd sacrifice some timeliness in the stats with create materialized view users_by_month enable query rewrite refresh complete start with 1999-03-28 next sysdate + 1 as select to_char(registration_date,'YYYYMM') as sort_key, rtrim(to_char(registration_date,'Month')) as pretty_month, to_char(registration_date,'YYYY') as pretty_year, count(*) as n_new from users group by to_char(registration_date,'YYYYMM'), to_char(registration_date,'Month'), to_char(registration_date,'YYYY') order by 1 Oracle will build this view just after midnight on March 28, 1999. The view will be refreshed every 24 hours after that. Because of the enable query rewriteclause, Oracle will feel free to grab data from the view even when a user's query does not mention the view. For example, given the query select count(*) from users where rtrim(to_char(registration_date,'Month')) = 'January' and to_char(registration_date,'YYYY') = '1999' Oracle would ignore the users table altogether and pull information fromusers_by_month. This would give the same result with much less work. Suppose that the current month is March 1998, though. The query select count(*) from users where rtrim(to_char(registration_date,'Month')) = 'March' and to_char(registration_date,'YYYY') = '1998'
will also hit the materialized view rather than the users table and hence will miss anyone who has registered since midnight (i.e., the query rewriting will cause a different result to be returned).
which transformations will you use? This can be handled by using the file list in informatica. If we have 5 files in different locations on the server and we need to load in to single target table. In session properties we need to change the file type as Indirect. (Direct if the source file contains the source data. Choose Indirect if the source file contains a list of files. When you select Indirect, the PowerCenter Server finds the file list then reads each listed file when it executes the session.) am taking a notepad and giving following paths and file names in this notepad and saving this notepad as emp_source.txt in the directory /ftp_data/webrep/ /ftp_data/webrep/SrcFiles/abc.txt /ftp_data/webrep/bcd.txt /ftp_data/webrep/srcfilesforsessions/xyz.txt /ftp_data/webrep/SrcFiles/uvw.txt /ftp_data/webrep/pqr.txt In session properties i give /ftp_data/webrep/ in the directory path and file name as emp_source.txt and file type as Indirect.
which one faster, and which one is best in Informatica Power Center 8.1/8.5? I guess you are asking about the tracing level. When you configure a transformation, you can set the amount of detail the Integration Service writes in the session log. PowerCenter 8.x supports 4 types of tracing level: 1.Normal: Integration Service logs initialization and status information, errors encountered, and skipped rows due to transformation row errors. Summarizes session results, but not at the level of individual rows. 2.Terse: Integration Service logs initialization information and error messages and notification of rejected data. 3. Verbose Initialization: In addition to normal tracing, Integration Service logs additional initialization details, names of index and data files used, and detailed transformation statistics. 4.Verbose Data: In addition to verbose initialization tracing, Integration Service logs each row that passes into the mapping. Also notes where the Integration Service truncates string data to fit the precision of a column and provides detailed transformation statistics. Allows the Integration Service to write errors to both the session log and error log when you enable row error logging. When you configure the tracing level to verbose data, the Integration Service writes row data for all rows in a block when it processes a transformation. By default, the tracing level for every transformation is Normal. In which situation do we use unconnected lookup? Unconnected lookup should be used when we need to call same lookup multiple times in one mapping. For example, in a parent child relationship you need to pass mutiple child id's to get respective parent id's. One can argue that this can be achieved by creating resusable lookup as well. Thats true, but reusable components are created when the need is across mappings and not one mapping. Also, if we use connected lookup multiple times in a mapping, by default the cache would be persistent.
How do you handle error logic in Informatica? What are the transformations that you used while handling errors? How did you reload those error records in target? Bad files contains column indicator and row indicator. Row indicator: It generally happens when working with update strategy transformation. The writer/target rejects the rows going to the target Columnindicator: D -valid o - overflow n - null t - truncate When the data is with nulls, or overflow it will be rejected to write the data to the target The reject data is stored on reject files. You can check the data and reload the data in to the target using reject reload utility. What happens if you turn off version? You would not be able to track the changes done to the respective mappings/sessions/workflows. What is DTM buffer size, Default buffer blocksize. If any performance issue happens to session, which one we have to increase and which one we have decrease.? DTM buffer size is memory you allocate to DTM process (12 MB) Buffer Block size is Size of heaviest Source/Target* number of rows that can be moved at a time(should be minimum 20 (64 KB) And Informatica bydeault assign it to be as for 83 sources and Targets(Buffer Memory) So we should increase/decrease size accordingly if more then 83 sources and Target then we should increase DTM and if source or target are heavy we should go with increasing buffer Block size. what is difference between source base and target base commit? Suppose if we say the target base commit as 1000, then informatica server will apply commit for every 1000 on the target table. if we say a source base commit for 1000, and due to tranformation logic suppose 500 rows are dropped, then only 500 rows will insert into the target table, informatica server will apply commit on those 500 rows. What are the transformations not used in mapplet and why?
A mapplet can't be used in another mapplet.Because if you try to drag and drop one mapplet from the left hand side under the mapplet subfolder to the mapplet designer workspace it won't allow you to do so, but if you try to drag and drop one mapplet to one mapping,i.e., in the mapping designer then it comes to the workspce.This means a mapplet can only be used in a mapping but can't be used in another mapplet.That's why mapplet is known as the reusable form of mapping. For SQ Transformation when I am writing a custom Query, do I need to have all the From tables as part of the mapping? That is say I have 3 from tables in the Custom Query, do I need to import all 3 tables in the mapping. All 3 tables are from same database schema. Please assist, No Need to import all tables .. just take care of Field names,Fields order, lengths and datatypes ..define the join condition properly between them as part of custom query ...
shared objects across the repositories in a domain. The objects are shared through global shortcuts. Local Repository : Local repository is within a domain and its not a global repository. Local repository can connect to a global repository using global shortcuts and can use objects in its shared folders. Versioned Repository : This can either be local or global repository but it allows version control for the repository. A versioned repository can store multiple copies, or versions of an object. This features allows to efficiently develop, test and deploy metadata in the production environment. Q. What is a code page? A. A code page contains encoding to specify characters in a set of one or more languages. The code page is selected based on source of the data. For example if source contains Japanese text then the code page should be selected to support Japanese text. When a code page is chosen, the program or application for which the code page is set, refers to a specific set of data that describes the characters the application recognizes. This influences the way that application stores, receives, and sends character data. Q. Which all databases PowerCenter Server on Windows can connect to? A. PowerCenter Server on Windows can connect to following databases: IBM DB2 Informix Microsoft Access Microsoft Excel Microsoft SQL Server Oracle Sybase Teradata Q. Which all databases PowerCenter Server on UNIX can connect to? A. PowerCenter Server on UNIX can connect to following databases: IBM DB2 Informix Oracle Sybase Teradata Infomratica Mapping Designer Q. How to execute PL/SQL script from Informatica mapping? A. Stored Procedure (SP) transformation can be used to execute PL/SQL Scripts. In SP Transformation PL/SQL procedure name can be specified. Whenever the session is executed, the session will call the pl/sql procedure. Q. How can you define a transformation? What are different types of transformations available in Informatica? A. A transformation is a repository object that generates, modifies, or passes data. The Designer provides a set of transformations that perform specific functions. For example, an Aggregator transformation performs calculations on groups of data. Below are the various transformations available in Informatica: Aggregator Application Source Qualifier Custom Expression External Procedure Filter Input Joiner Lookup Normalizer Output Rank Router
Sequence Generator Sorter Source Qualifier Stored Procedure Transaction Control Union Update Strategy XML Generator XML Parser XML Source Qualifier Q. What is a source qualifier? What is meant by Query Override? A. Source Qualifier represents the rows that the PowerCenter Server reads from a relational or flat file source when it runs a session. When a relational or a flat file source definition is added to a mapping, it is connected to a Source Qualifier transformation. PowerCenter Server generates a query for each Source Qualifier Transformation whenever it runs the session. The default query is SELET statement containing all the source columns. Source Qualifier has capability to override this default query by changing the default settings of the transformation properties. The list of selected ports or the order they appear in the default query should not be changed in overridden query. Q. What is aggregator transformation? A. The Aggregator transformation allows performing aggregate calculations, such as averages and sums. Unlike Expression Transformation, the Aggregator transformation can only be used to perform calculations on groups. The Expression transformation permits calculations on a row-byrow basis only. Aggregator Transformation contains group by ports that indicate how to group the data. While grouping the data, the aggregator transformation outputs the last row of each group unless otherwise specified in the transformation properties. Various group by functions available in Informatica are : AVG, COUNT, FIRST, LAST, MAX, MEDIAN, MIN, PERCENTILE, STDDEV, SUM, VARIANCE. Q. What is Incremental Aggregation? A. Whenever a session is created for a mapping Aggregate Transformation, the session option for Incremental Aggregation can be enabled. When PowerCenter performs incremental aggregation, it passes new source data through the mapping and uses historical cache data to perform new aggregation calculations incrementally. Q. How Union Transformation is used? A. The union transformation is a multiple input group transformation that can be used to merge data from various sources (or pipelines). This transformation works just like UNION ALL statement in SQL, that is used to combine result set of two SELECT statements. Q. Can two flat files be joined with Joiner Transformation? A. Yes, joiner transformation can be used to join data from two flat file sources. Q. What is a look up transformation? A. This transformation is used to lookup data in a flat file or a relational table, view or synonym. It compares lookup transformation ports (input ports) to the source column values based on the lookup condition. Later returned values can be passed to other transformations. Q. Can a lookup be done on Flat Files? A. Yes. Q. What is the difference between a connected look up and unconnected look up? A. Connected lookup takes input values directly from other transformations in the pipleline. Unconnected lookup doesnt take inputs directly from any other transformation, but it can be used in any transformation (like expression) and can be invoked as a function using :LKP expression. So, an unconnected lookup can be called multiple times in a mapping.
Q. What is the main difference between Data Warehousing and Business Intelligence? The differentials are: DW - is a way of storing data and creating information through leveraging data marts. DM's are segments or categories of information and/or data that are grouped together to provide 'information' into that segment or category. DW does not require BI to work. Reporting tools can generate reports from the DW. BI - is the leveraging of DW to help make business decisions and recommendations. Information and data rules engines are leveraged here to help make these decisions along with statistical analysis tools and data mining tools. Q. What is data modeling? Q. What are the different steps for data modeling? Q. What are the data modeling tools you have used? (Polaris) Q. What is a Physical data model? During the physical design process, you convert the data gathered during the logical design phase into a description of the physical database, including tables and constraints. Q. What is a Logical data model? A logical design is a conceptual and abstract design. We do not deal with the physical implementation details yet; we deal only with defining the types of information that we need. The process of logical design involves arranging data into a series of logical relationships called entities and attributes. Q. What are an Entity, Attribute and Relationship? An entity represents a chunk of information. In relational databases, an entity often maps to a table. An attribute is a component of an entity and helps define the uniqueness of the entity. In relational databases, an attribute maps to a column. The entities are linked together using relationships. Q. What are the different types of Relationships? Entity-Relationship. Q. What is the difference between Cardinality and Nullability? Q. What is Forward, Reverse and Re-engineering? Q. What is meant by Normalization and De-normalization? Q. What are the different forms of Normalization? Q. What is an ETL or ETT? And what are the different types? ETL is the Data Warehouse acquisition processes of Extracting, Transforming (or Transporting) and Loading (ETL) data from source systems into the data warehouse.
E.g. Oracle Warehouse Builder, Powermart. Q. Explain the Extraction process? (Polaris, Mascot) Q. How do you extract data from different data sources explain with an example? (Polaris) Q. What are the reporting tools you have used? What is the difference between them? (Polaris) Q. How do you automate Extraction process? (Polaris) Q. Without using ETL tool can u prepare a Data Warehouse and maintain? (Polaris) Q. How do you identify the changed records in operational data (Polaris) Q. What is a Star Schema? A star schema is a set of tables comprised of a single, central fact table surrounded by denormalized dimensions. Each dimension is represented in a single table. Star schema implement dimensional data structures with de- normalized dimensions. Snowflake schema is an alternative to star schema. A relational database schema for representing multidimensional data. The data is stored in a central fact table, with one or more tables holding information on each dimension. Dimensions have levels, and all levels are usually shown as columns in each dimension table. Q. What is a Snowflake Schema? A snowflake schema is a set of tables comprised of a single, central fact table surrounded by normalized dimension hierarchies. Each dimension level is represented in a table. Snowflake schema implements dimensional data structures with fully normalized dimensions. Star schema is an alternative to snowflake schema. An example would be to break down the Time dimension and create tables for each level; years, quarters, months; weeks, days These additional branches on the ERD create ore of a Snowflake shape then Star. Q. What is Very Large Database? Q. What are SMP and MPP? Symmetric multi-processors (SMP) Q. What is data mining? Data Mining is the process of automated extraction of predictive information from large databases. It predicts future trends and finds behaviour that the experts may miss as it lies beyond their expectations. Data Mining is part of a larger process called knowledge discovery; specifically, the step in which advanced statistical analysis and modeling techniques are applied to the data to find useful patterns and relationships. Data mining can be defined as "a decision support process in which we search for patterns of information in data." This search may be done just by the user, i.e. just by performing queries, in which case it is quite hard and in most of the cases not comprehensive enough to reveal intricate patterns. Data mining uses sophisticated statistical analysis and modeling techniques to uncover such patterns and relationships hidden in organizational databases patterns that ordinary methods might miss. Once found, the information needs to be presented in a suitable form, with graphs, reports, etc. Q. What is an OLAP? (Mascot) OLAP is software for manipulating multidimensional data from a variety of sources. The data is often stored in data warehouse. OLAP software helps a user create queries, views,
representations and reports. OLAP tools can provide a "front-end" for a data-driven DSS. On-Line Analytical Processing (OLAP) is a category of software technology that enables analysts, managers and executives to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user. OLAP functionality is characterized by dynamic multi-dimensional analysis of consolidated enterprise data supporting end user analytical and navigational activities Q. What are the Different types of OLAP's? What are their differences? (Mascot) OLAP - Desktop OLAP(Cognos), ROLAP, MOLAP(Oracle Discoverer) ROLAP, MOLAP and HOLAP are specialized OLAP (Online Analytical Analysis) applications. ROLAP stands for Relational OLAP. Users see their data organized in cubes with dimensions, but the data is really stored in a Relational Database (RDBMS) like Oracle. The RDBMS will store data at a fine grain level, response times are usually slow. MOLAP stands for Multidimensional OLAP. Users see their data organized in cubes with dimensions, but the data is store in a Multi-dimensional database (MDBMS) like Oracle Express Server. In a MOLAP system lot of queries have a finite answer and performance is usually critical and fast. HOLAP stands for Hybrid OLAP, it is a combination of both worlds. Seagate Software's Holos is an example HOLAP environment. In a HOLAP system one will find queries on aggregated data as well as on detailed data. DOLAP Q. What is the difference between data warehousing and OLAP? The terms data warehousing and OLAP are often used interchangeably. As the definitions suggest, warehousing refers to the organization and storage of data from a variety of sources so that it can be analyzed and retrieved easily. OLAP deals with the software and the process of analyzing data, managing aggregations, and partitioning information into cubes for in-depth analysis, retrieval and visualization. Some vendors are replacing the term OLAP with the terms analytical software and business intelligence. Q. What are the facilities provided by data warehouse to analytical users? Q. What are the facilities provided by OLAP to analytical users? Q. What is a Histogram? How to generate statistics? Q. In Erwin what are the different types of models (Honeywell) Q. Many Suppliers Many Products Model the above scenario in Erwin. How many tables and what do they contain (Honeywell) Q. What are the options available in Erwin Tool box (Honeywell) Q. Aggregate navigation Q. What are the Data Warehouse Center administration functions? The functions of Visual Warehouse administration are: Creating Data Warehouse Center security groups. Defining Data Warehouse Center privileges for that group. Registering Data Warehouse Center users. Adding Data Warehouse Center users to security groups. Registering data sources.
Registering warehouses (targets). Creating subjects. Registering agents. Registering Data Warehouse Center programs. Q. How do I set the log level higher for more detailed information within Data Warehouse Center 7.2? Within DWC, log level capability can be set from 0 to 4. There is a log level 5, yet it cannot be turned on using the GUI, but must be turned on manually. A command line trace can be used for any trace level, and this is the only way to turn on a level 5 trace: Go to start, programs, IBM DB2, command line processor. Connect to the control database: db2 => connect to Control_Database_name Update the configuration table: db2 => update iwh.configuration set value_int = 5 where name = 'TRACELVL' and (component = '') Valid components are: Logger trace = log Agent trace = agent Server trace = RTK DDD = DDD ODBC = VWOdbc For multiple traces the format is: db2 => update iwh.configuration set value_int = 5 where name = 'TRACELVL' and (component = '' or component = '') Reset the connection: db2 => connect reset Stop and restart the Warehouse server and logger. Perform the failing operation. Be sure to reset the trace level to 0 using the command line when you are done: db2 => update iwh.configuration set value_int = 0 where name = 'TRACELVL' and (component = '') When you run a trace, the Data Warehouse Center writes information to text files. Data Warehouse Center programs that are called from steps also write any trace information to this directory. These files are located in the directory specified by the VWS_LOGGING environment variable. The default value of VWS_LOGGING is: Windows and OS/2 = x:\sqllib\logging UNIX = /var/IWH AS/400 = /QIBM/UserData/IWH For additional information, see basic logging function in the Data Warehouse Center
administration guide. Q. What types of data sources does Data Warehouse Center support? The Data Warehouse Center supports a wide variety of relational and non relational data sources. You can populate your Data Warehouse Center warehouse with data from the following databases and files: Any DB2 family database Oracle Sybase Informix Microsoft SQL Server IBM DataJoiner Multiple Virtual Storage (OS/390), Virtual Machine (VM), and local area network (LAN) files IMS and Virtual Storage Access Method (VSAM) (with Data Joiner Classic Connect) Q. What is the Data Warehouse Center control database? When you install the warehouse server, the warehouse control database that you specify during installation is initialized. Initialization is the process in which the Data Warehouse Center creates the control tables that are required to store Data Warehouse Center metadata. If you have more than one warehouse control database, you can use the Data Warehouse Center --> Control Database Management window to initialize the second warehouse control database. However, only one warehouse control database can be active at a time. Q. What databases need to be registered as system ODBC data sources for the Data Warehouse Center? The Data Warehouse Center database that needs to be registered as system ODBC data sources are: source target control databases 1. What was the original business problem that led you to do this project? Whether the consultant is being hired to gather requirements or to customize an OLAP application, this question indicates that shes interested in the big picture. Shell keep the answer in mind as she does her work, which is a measure of quality assurance. 2. Where are you in your current implementation process? A consultant who asks this question knows not to make any assumptions about how much progress youve made. She probably also understands that you might be wrong. There are plenty of clients who have begun application development without having gathered requirements. Understanding where the client thinks he is is just as important as understanding where he wants to be. It also helps the consultant in making improvement suggestions or recommendations for additional skills or technologies. 3. How long do you see this position being filled by an external resource? While the question might seem self-serving at first, a good consultant is ever mindful of his responsibility to render himself dispensable over time. Your answer will give him a good idea of how much time he has to perform the work as well as to cross train permanent staff within your organization. A variation on this question is: "Is there a dedicated person or group targeted for knowledge transfer in this area?" 4. What deliverables do you expect from this engagement? The consultant who doesnt ask about deliverables is the consultant who expects to sit around giving advice. Beware of the "ivory tower" consultants, who are too light for heavy work and too heavy for light work. Every consultant you talk to should expect to produce some sort of deliverable, be it a requirements document, a data model, HTML, a project plan, test procedures
or a mission statement. 5. Would you like to talk to a past client or two? The fact that a consultant would offer references is testimony that she knows her stuff. Many do not. Those consultants who hide behind nondisclosures for not giving references should be avoided. While its often valid to deny prospective clients work samples because of confidentiality agreements, theres no good reason not to offer the name and phone number of someone who will sing the consultants praises. Dont be satisfied with a reference for the entire firm. Many good firms can employ below-average consultants. Ask to talk to someone whos worked with the person or team youre considering. Once youve hired that consultant and are happy with his work, offer to be a reference. It comes around.
Implementing a concrete DWS is a complex task comprising two major phases. In the DWS configuration phase, a conceptual view of the warehouse is first specified according to user requirements (data warehouse design). Then, the involved data sources and the way data will be extracted and loaded into the warehouse (data acquisition) is determined. Finally, decisions about persistent storage of the warehouse using database technology and the various ways data will be accessed during analysis are made. After the initial load (the first load of the DWH according to the DWH configuration), during the DWS operation phase, warehouse data must be regularly refreshed, i.e., modifications of operational data since the last DWH refreshment must be propagated into the warehouse such that data stored in the DWH reflect the state of the underlying operational systems. Besides DWH refreshment, DWS operation includes further tasks like archiving and purging of DWH data or DWH monitoring. Q. What are the functional requirements for a data warehouse? A data warehouse must be able to support various types of information applications. Decision support processing is the principle type of information application in a data warehouse, but the use of a data warehouse is not restricted to a decision support system. It is possible that each information application has its own set of requirements in terms of data, the way that data is modeled, and the way it is used. The data warehouse is where these applications get their "consolidated data." A data warehouse must consolidate primitive data and it must provide all facilities to derive information from it, as required by the end-users. Detailed primitive data is of prime importance, but data volumes tend to be big and users usually require information derived from the primitive data. Data in a data warehouse must be organized such that it can be analyzed or explored from different angles. Analysis of the historical context (the time dimension) is of prime importance. Examples of other important contextual dimensions are geography, organization, products, suppliers, customers, and so on. Q. What are the characteristics of a data warehouse? Data in a data warehouse is organized as subject oriented rather than application oriented. It is designed and constructed as a non-volatile store of business data, transactions and events. Data warehouse is a logically integrated store of data originating from disparate operational sources. It is the only source for deriving information needed by the end users. Several temporal modeling styles are usually used in different areas of the data warehouse. Q. What are the characteristics of the data in a data warehouse? Data in the DWH is integrated from various, heterogeneous operational systems (like database systems, flat files, etc.) and further external data sources (like demographic and statistical databases, WWW, etc.). Before the integration, structural and semantic differences have to be reconciled, i.e., data have to be homogenized according to a uniform data model. Furthermore, data values from operational systems have to be cleaned in order to get correct data into the data warehouse. The need to access historical data (i.e., histories of warehouse data over a prolonged period of time) is one of the primary incentives for adopting the data warehouse approach. Historical data are necessary for business trend analysis which can be expressed in terms of understanding the differences between several views of the real-time data (e.g., profitability at the end of each month). Maintaining historical data means that periodical snapshots of the corresponding operational data are propagated and stored in the warehouse without overriding previous
warehouse states. However, the potential volume of historical data and the associated storage costs must always be considered in relation to their potential business benefits. Furthermore, warehouse data is mostly non-volatile, i.e., access to the DWH is typically readoriented. Modifications of the warehouse data takes place only when modifications of the source data are propagated into the warehouse. Finally, a data warehouse contains usually additional data, not explicitly stored in the operational sources, but derived through some process from operational data (called also derived data). For example, operational sales data could be stored in several aggregation levels (weekly, monthly, quarterly sales) in the warehouse. Q. When should a company consider implementing a data warehouse? Data warehouses or a more focused database called a data mart should be considered when a significant number of potential users are requesting access to a large amount of related historical information for analysis and reporting purposes. So-called active or real-time data warehouses can provide advanced decision support capabilities. Q. What data is stored in a data warehouse? In general, organized data about business transactions and business operations is stored in a data warehouse. But, any data used to manage a business or any type of data that has value to a business should be evaluated for storage in the warehouse. Some static data may be compiled for initial loading into the warehouse. Any data that comes from mainframe, client/server, or webbased systems can then be periodically loaded into the warehouse. The idea behind a data warehouse is to capture and maintain useful data in a central location. Once data is organized, managers and analysts can use software tools like OLAP to link different types of data together and potentially turn that data into valuable information that can be used for a variety of business decision support needs, including analysis, discovery, reporting and planning. Q. Database administrators (DBAs) have always said that having non-normalized or denormalized data is bad. Why is de-normalized data now okay when it's used for Decision Support? Normalization of a relational database for transaction processing avoids processing anomalies and results in the most efficient use of database storage. A data warehouse for Decision Support is not intended to achieve these same goals. For Data-driven Decision Support, the main concern is to provide information to the user as fast as possible. Because of this, storing data in a denormalized fashion, including storing redundant data and pre-summarizing data, provides the best retrieval results. Also, data warehouse data is usually static so anomolies will not occur from operations like add, delete and update a record or field. Q. How often should data be loaded into a data warehouse from transaction processing and other source systems? It all depends on the needs of the users, how fast data changes and the volume of information that is to be loaded into the data warehouse. It is common to schedule daily, weekly or monthly dumps from operational data stores during periods of low activity (for example, at night or on weekends). The longer the gap between loads, the longer the processing times for the load when it does run. A technical IS/IT staffer should make some calculations and consult with potential users to develop a schedule to load new data. Q. What are the benefits of data warehousing? Some of the potential benefits of putting data into a data warehouse include: 1. Improving turnaround time for data access and reporting;
2. Standardizing data across the organization so there will be one view of the "truth"; 3. Merging data from various source systems to create a more comprehensive information source; 4. Lowering costs to create and distribute information and reports; 5. Sharing data and allowing others to access and analyze the data; 6. Encouraging and improving fact-based decision making. Q. What are the limitations of data warehousing? The major limitations associated with data warehousing are related to user expectations, lack of data and poor data quality. Building a data warehouse creates some unrealistic expectations that need to be managed. A data warehouse doesn't meet all decision support needs. If needed data is not currently collected, transaction systems need to be altered to collect the data. If data quality is a problem, the problem should be corrected in the source system before the data warehouse is built. Software can provide only limited support for cleaning and transforming data. Missing and inaccurate data can not be "fixed" using software. Historical data can be collected manually, coded and "fixed", but at some point source systems need to provide quality data that can be loaded into the data warehouse without manual clerical intervention. Q. How does my company get started with data warehousing? Build one! The easiest way to get started with data warehousing is to analyze some existing transaction processing systems and see what type of historical trends and comparisons might be interesting to examine to support decision making. See if there is a "real" user need for integrating the data. If there is, then IS/IT staff can develop a data model for a new schema and load it with some current data and start creating a decision support data store using a database management system (DBMS). Find some software for query and reporting and build a decision support interface that's easy to use. Although the initial data warehouse/data-driven DSS may seem to meet only limited needs, it is a "first step". Start small and build more sophisticated systems based upon experience and successes.
Fact tables which do not have any facts are called factless fact tables. They may consist of nothing but keys. There are two kinds of fact tables that do not have any facts at all. The first type of factless fact table is a table that records an event. Many event-tracking tables in dimensional data warehouses turn out to be factless. E.g. A student tracking system that detects each student attendance event each day. The second type of factless fact table is called a coverage table. Coverage tables are frequently needed when a primary fact table in a dimensional data warehouse is sparse. E.g. A sales fact table that records the sales of products in stores on particular days under each promotion condition. The sales fact table does answer many interesting questions but cannot answer questions about things that did not happen. For instance, it cannot answer the question, which products were in promotion that did not sell? because it contains only the records of products that did sell. In this case the coverage table comes to the rescue. A record is placed in the coverage table for each product in each store that is on promotion in each time period. Q. What are Causal dimension? A causal dimension is a kind of advisory dimension that should not change the fundamental grain of a fact table. E.g. why the customer bought the product? It can be due to promotion, sales etc. Q. What is meant by Drill Through? (Mascot) Operating Data Source - directly connects to application database Q. What is Operational Data Store? (Mascot) Q. What is BI? And why do we need BI? Business Intelligence, it is an ongoing process of various integration packages to analyze data. Q What is Slicing and Dicing ? How we can do in Impromptu (We cannot do)? It is done only in Powerplay. GENERAL Q. Explain the Project. (Polaris) Explain about the various projects (MIDAS2/VIP). Why was MIDAS2 or VIP or SCI developed. Q. What is the size of the database in your project? (Polaris) Approximately 900GB. Q. What is the daily data volume (in GB/records)? Or What is the size of the data extracted in the extraction process? (Polaris) Q. How many Data marts are there in your project? Q. How many Fact and Dimension tables are there in your project?
Q. What is the size of Fact table in your project? Q. How many dimension tables did you had in your project and name some dimensions (columns)? (Mascot) Q. Name some measures in your fact table? (Mascot) Q. Why couldnt u go for Snowflake schema? (Mascot) Q. How many Measures u have created? (Mascot) Q. How many Facts & Dimension Tables are there in your Project? (Mascot) Q. Have u created Datamarts? (Mascot) Q. What is the difference between OLTP and OLAP? OLAP - Online Analytical processing, mainly required for DSS, data is in denormalized manner and mainly used for non volatile data, highly indexed, improve query response time OLTP - Transactional Processing - DML, highly normalized to reduce deadlock & increase concurrency Q. What is the difference between OLTP and data warehouse? Operational System Data Warehouse Transaction Processing Query Processing Time Sensitive History Oriented Operator View Managerial View Organized by transactions (Order, Input, Inventory) Organized by subject (Customer, Product) Relatively smaller database Large database size Many concurrent users Relatively few concurrent users Volatile Data Non Volatile Data Stores all data Stores relevant data Not Flexible Flexible Q. Explain the DW life cycle Data warehouses can have many different types of life cycles with independent data marts. The following is an example of a data warehouse life cycle. In the life cycle of this example, four important steps are involved. Extraction - As a first step, heterogeneous data from different online transaction processing systems is extracted. This data becomes the data source for the data warehouse. Cleansing/transformation - The source data is sent into the populating systems where the data is cleansed, integrated, consolidated, secured and stored in the corporate or central data warehouse. Distribution - From the central data warehouse, data is distributed to independent data marts specifically designed for the end user. Analysis - From these data marts, data is sent to the end users who access the data stored in the data mart depending upon their requirement. Q. What is the life cycle of DW? Getting data from OLTP systems from diff data sources Analysis & staging - Putting in a staging layer- cleaning, purging, putting surrogate keys, SCM ,
dimensional modeling Loading Writing of metadata Q. What are the different Reporting and ETL tools available in the market? Q. What is a data warehouse? A data warehouse is a database designed to support a broad range of decision tasks in a specific organization. It is usually batch updated and structured for rapid online queries and managerial summaries. Data warehouses contain large amounts of historical data which are derived from transaction data, but it can include data from other sources also. It is designed for query and analysis rather than for transaction processing. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources. The term data warehousing is often used to describe the process of creating, managing and using a data warehouse. Q. What is a data mart? A data mart is a selected part of the data warehouse which supports specific decision support application requirements of a companys department or geographical region. It usually contains simple replicates of warehouse partitions or data that has been further summarized or derived from base warehouse data. Instead of running ad hoc queries against a huge data warehouse, data marts allow the efficient execution of predicted queries over a significantly smaller database. Q. How do I differentiate between a data warehouse and a data mart? (KPIT Infotech Pune, Mascot) A data warehouse is for very large databases (VLDBs) and a data mart is for smaller databases. The difference lies in the scope of the things with which they deal. A data mart is an implementation of a data warehouse with a small and more tightly restricted scope of data and data warehouse functions. A data mart serves a single department or part of an organization. In other words, the scope of a data mart is smaller than the data warehouse. It is a data warehouse for a smaller group of end
improves performance. Q. Dimensions and Facts. Dimensional modeling begins by dividing the world into measurements and context. Measurements are usually numeric and taken repeatedly. Numeric measurements are facts. Facts are always surrounded by mostly textual context that's true at the moment the fact is recorded. Facts are very specific, well-defined numeric attributes. By contrast, the context surrounding the facts is open-ended and verbose. It's not uncommon for the designer to add context to a set of facts partway through the implementation. Dimensional modeling divides the world of data into two major types: Measurements and Descriptions of the context surrounding those measurements. The measurements, which are typically numeric, are stored in fact tables, and the descriptions of the context, which are typically textual, are stored in the dimension tables. A fact table in a pure star schema consists of multiple foreign keys, each paired with a primary key in a dimension, together with the facts containing the measurements. Every foreign key in the fact table has a match to a unique primary key in the respective dimension (referential integrity). This allows the dimension table to possess primary keys that arent found in the fact table. Therefore, a product dimension table might be paired with a sales fact table in which some of the products are never sold. Dimensional models are full-fledged relational models, where the fact table is in third normal form and the dimension tables are in second normal form. The main difference between second and third normal form is that repeated entries are removed from a second normal form table and placed in their own snowflake. Thus the act of removing the context from a fact record and creating dimension tables places the fact table in third normal form. Sales, Cost, ProfitE.g. for Fact tables Customer, Product, Store, TimeE.g. for Dimensions Q. What are Additive Facts? Or what is meant by Additive Fact? The fact tables are mostly very huge and almost never fetch a single record into our answer set. We fetch a very large number of records on which we then do, adding, counting, averaging, or taking the min or max. The most common of them is adding. Applications are simpler if they store facts in an additive format as often as possible. Thus, in the grocery example, we dont need to store the unit price. We compute the unit price by dividing the dollar sales by the unit sales whenever necessary. Q. What is meant by averaging over time? Some facts, like bank balances and inventory levels, represent intensities that are awkward to express in an additive format. We can treat these semi additive facts as if they were additive but just before presenting the results to the end user; divide the answer by the number of time periods to get the right result. This technique is called averaging over time. Q. What is a Conformed Dimension? When the enterprise decides to create a set of common labels across all the sources of data, the separate data mart teams (or, single centralized team) must sit down to create master dimensions that everyone will use for every data source. These master dimensions are called Conformed Dimensions. Two dimensions are conformed if the fields that you use as row headers have the same domain.
Q. What is a Conformed Fact? If the definitions of measurements (facts) are highly consistent, we call them as Conformed Facts. Q. What are the 3 important fundamental themes in a data warehouse? The 3 most important fundamental themes are: 1. Drilling Down 2. Drilling Across and 3. Handling Time Q. What is meant by Drilling Down? Drilling down means nothing more than give em more detail. Drilling Down in a relational database means adding a row header to an existing SELECT statement. For instance, if you are analyzing the sales of products at a manufacturer level, the select list of the query reads: SELECT MANUFACTURER, SUM(SALES). If you wish to drill down on the list of manufacturers to show the brand sold, you add the BRAND row header: SELECT MANUFACTURER, BRAND, SUM(SALES). Now each manufacturer row expands into multiple rows listing all the brands sold. This is the essence of drilling down. We often call a row header a grouping column because everything in the list thats not aggregated with an operator such as SUM must be mentioned in the SQL GROUP BY clause. So the GROUP BY clause in the second query reads, GROUP BY MANUFACTURER, BRAND. Q. What is meant by Drilling Across? Drilling Across adds more data to an existing row. If drilling down is requesting ever finer and granular data from the same fact table, then drilling across is the process fo linking two or more fact tables at the same granularity, or, in other words, tables with the same set of grouping columns and dimensional constraints. A drill across report can be created by using grouping columns that apply to all the fact tables used in the report. The new fact table called for in the drill-across operation must share certain dimensions with the fact table in the original query. All fact tables in a drill-across query must use conformed dimensions. Q. What is the significance of handling time? Example, when a customer moves from a property, we might want to know: 1. who the new customer is 2. when did the old customer move out 3. when did the new customer move in 4. how long was the property empty etc Q. What is menat by Drilling Up? If drilling down is adding grouping columns from the dimension tables, then drilling up is subtracting grouping columns.
Q. What is meant by Drilling Around? The final variant of drilling is drilling around a value circle. This is similar to the linear value chain that I showed in the previous example, but occurs in a data warehouse where the related fact tables that share common dimensions are not arranged i n a linear order. The best example is from health care, where as many as 10 separate entities are processing patient encounters, and are sharing this information with one another. E.g. a typical health care value circle with 10 separate entities surrounding the patient. When the common dimensions are conformed and the requested grouping columns are drawn from dimensions that tie to all the fact tables in a given report, you can generate really powerful drill around reports by performing separate queries on each fa ct table and outer joining the answer sets in the client tool. Q. What are the important fields in a recommended Time dimension table? Time_key Day_of_week Day_number_in_month Day_number_overall Month Month_number_overall Quarter Fiscal_period Season Holiday_flag Weekday_flag Last_day_in_month_flag Q. Why have timestamp as a surrogate key rather than a real date? The tiem stamp in a fact table should be a surrogate key instead of a real date because: the rare timestamp that is inapplicable, corrupted, or hasnt happened yet needs a value that cannot be a real date most end-user calendar navigation constraints, such as fiscal periods, end-of-periods, holidays, day numbers and week numbers arent supported by database timestamps integer time keys take up much less disk space than full dates Q. Why have more than one fact table instead of a single fact table? We cannot combine all of the business processes into a single fact table because: the separate fact tables in the value chain do not share all the dimensions. You simply cant put the customer ship to dimension on the finished goods inventory data each fact table possesses different facts, and the fact table records are recorded at different tiems along the alue chain Q. What is mean by Slowly Changing Dimensions and what are the different types of SCDs? (Mascot) Dimensions dont change in predicable ways. Individual customers and products evolve slowly and episodically. Some of the changes are true physical changes. Customers change their addresses because they move. A product is manufactured with different packaging. Other changes are actually corrections of mistakes in the data. And finally, some changes are changes in how we label a product or customer and are more a matter of opinion than physical reality. We call these variations Slowly Changing Dimension (SCD).
The 3 fundamental choices for handling the slowly changing dimension are: Overwrite the changed attribute, thereby destroying previous history eg. Useful when correcting an error Issue a new record for the customer, keeping the customer natural key, but creating a new surrogate primary key Create an additional field in the existing customer record, and store the old value of the attribute in the additional field. Overwrite the original attribute field A Type 1 SCD is an overwrite of a dimensional attribute. History is definitely lost. We overwrite when we are correcting an error in the data or when we truly dont want to save history. A Type 2 SCD creates a new dimension record and requires a generalized or surrogate key for the dimension. We create surrogate keys when a true physical change occurs in a dimension entity at a specific point in time, such as the customer address change or the product packing change. We often add a timestamp and a reason code in the dimension record to precisely describe the change. The Type 2 SCD records changes of values of dimensional entity attributes over time. The technique requires adding a new row to the dimension each time theres a change in the value of an attribute (or group of attributes) and assigning a unique surrogate key to the new row. A Type 3 SCD adds a new field in the dimension record but does not create a new record. We might change the designation of the customers sales territory because we redraw the sales territory map, or we arbitrarily change the category of the product from confectionary to candy. In both cases, we augment the original dimension attribute with an old attribute so we can switch between these alternate realities. Q. What are the techniques for handling SCDs? Overwriting Creating another dimension record Creating a current value filed Q. What is a Surrogate Key and where do you use it? (Mascot) A surrogate key is an artificial or synthetic key that is used as a substitute for a natural key. It is just a unique identifier or number for each row that can be used for the primary key to the table. It is useful because the natural primary key (i.e. Customer Number in Customer table) can change and this makes updates more difficult. Some tables have columns such as AIRPORT_NAME or CITY_NAME which are stated as the primary keys (according to the business users) but ,not only can these change, indexing on a numerical value is probably better and you could consider creating a surrogate key called, say, AIRPORT_ID. This would be internal to the system and as far as the client is concerned you may display only the AIRPORT_NAME. Another benefit you can get from surrogate keys (SID) is in Tracking the SCD - Slowly Changing Dimension. A classical example: On the 1st of January 2002, Employee 'E1' belongs to Business Unit 'BU1' (that's what would be in your Employee Dimension). This employee has a turnover allocated to him on the Business Unit 'BU1' But on the 2nd of June the Employee 'E1' is muted from Business Unit 'BU1' to Business Unit 'BU2.' All the new turnover has to belong to the new Business Unit 'BU2' but the
old one should Belong to the Business Unit 'BU1.' If you used the natural business key 'E1' for your employee within your data warehouse everything would be allocated to Business Unit 'BU2' even what actually belongs to 'BU1.' If you use surrogate keys, you could create on the 2nd of June a new record for the Employee 'E1' in your Employee Dimension with a new surrogate key. This way, in your fact table, you have your old data (before 2nd of June) with the SID of the Employee 'E1' + 'BU1.' All new data (after 2nd of June) would take the SID of the employee 'E1' + 'BU2.' You could consider Slowly Changing Dimension as an enlargement of your natural key: natural key of the Employee was Employee Code 'E1' but for you it becomes so you need another id.Employee Code + Business Unit - 'E1' + 'BU1' or 'E1' + 'BU2.' But the difference with the natural key enlargement process is that you might not have all part of your new key within your fact table, so you might not be able to do the join on the new enlarge key Every join between dimension tables and fact tables in a data warehouse environment should be based on surrogate key, not natural keys.
What are the Tracking levels in Informatica transformations? Which one is efficient and which one faster, and which one is best in Informatica Power Center 8.1/8.5? Answer:If you are asking about the tracing level. When you configure a transformation you can set the amount of detail the Integration Service writes in the session log. PowerCenter 8.x supports 4 types of tracing level:
1.Normal: Integration Service logs initialization and status information errors encountered and skipped rows due to transformation row errors. Summarizes session results but not at the level of individual rows.
2.Terse: Integration Service logs initialization information and error messages and notification of rejected data.
3. Verbose Initialization: In addition to normal tracing Integration Service logs additional initialization details names of index and data files used and detailed transformation statistics.
4.Verbose Data: In addition to verbose initialization tracing Integration Service logs each row that passes into the mapping. Also notes where the Integration Service truncates string data to fit the precision of a column and provides detailed transformation statistics. Allows the Integration Service to write errors to both the session log and error log when you enable row error logging. When you configure the tracing level to verbose data the Integration Service writes row data for all rows in a block when it processes a transformation.
Explain Why it is bad practice to place a Transaction Control transformation upstream from a SQL transformation? Answer: SQL Transformation drops any incoming transaction boundaries Why the input pipe lines to the joiner should not contain an update strategy transformation? Answer: Update Strategy flags each row for either Insert Update Delete or Reject. I think when you use it before Joiner Joiner drops all the flagging details.
This is a curious question though but I can not imagine how would one expect to deal with the scenario in which both the joiner transformation incoming pipelines have update strategy. In that scenario it would make it really complicated to join the rows flagged for different database operations (Update Insert Delete) and then decide which operation to perform. To avoid this I think Informatica prohibited Update Strategy Transformation to be used before Joiner transformation. Router is passive transformation, but one may argue that it is passive because in case if we use default group (only) then there is no change in number of rows. What explanation will you give? In which situation do we use update strategy? Answer: Update Strategy can be used whenever we need to update some existing value in database. For an update strategy we need to have a primary key. In which situation do we use unconnected lookup? Answer:Unconnected lookup should be used when we need to call same lookup multiple times in one mapping. For example, in a parent child relationship you need to pass mutiple child id's to get respective parent id's. One can argue that this can be achieved by creating resusable lookup as well. Thats true, but reusable components are created when the need is across mappings and not one mapping. Also, if we use connected lookup multiple times in a mapping, by default the cache would be persistent. What are the possible dependency problems while running session? Answer:Dependency problems means when we run a process the process output is input to other process. Then first process is stopped then it causes problem or stop running other process. One process is depending on other other. If one process get effected then other process effect. This is called problem dependency.
How do you handle error logic in Informatica? What are the transformations that you used while handling errors? How did you reload those error records in target? Answer: Bad files contains column indicator and row indicator. Row indicator: It generally happens when working with update strategy transformation. The writer/target rejects the rows going to the target Columnindicator: D -valid o - overflow n - null t - truncate When the data is with nulls, or overflow it will be rejected to write the data to the target The reject data is stored on reject files. You can check the data and reload the data in to the target using reject reload utility.
1.What is Datadriven? Answer: The informatica server follows instructions coded into update strategy transformations with in the session maping determine how to flag records for insert, update, delete or reject. If you do not choose data driven option setting,the informatica server ignores all update strategy transformations in the mapping. -------------------------------------------------------------------------------If the data driven option is selected in the session properties,it follows the instructions in the update strategy transformation in the mapping o.w it follows instuctions specified in the session. 2.What is difference between IIF and DECODE function? You can use nested IIF statements to test multiple conditions. The following example tests for various conditions and returns 0 if sales is zero or negative: IIF( SALES > 0, IIF( SALES < 50, SALARY1, IIF( SALES < 100, SALARY2, IIF( SALES < 200, SALARY3, BONUS))), 0 ) You can use DECODE instead of IIF in many cases. DECODE may improve readability. The following shows how you can use DECODE instead of IIF : SALES > 0 and SALES < 50, SALARY1, SALES > 49 AND SALES < 100, SALARY2, SALES > 99 AND SALES < 200, SALARY3, SALES > 199, BONUS) -------------------------------------------------------------------------------Decode function can used in sql statement. where as if statment cant use with SQL statement. What is the difference between Informatica 7.0 and 8.0 ? the basic diff b/w inbetween informatica8.0and informatica7.0 is that in 8.0series informatica corp has introduces power exchnage concept in this 8 version advanced 10 transformations are there and java transformation is one main Features of Informatica 8 The architecture of Power Center 8 has changed a lot; 1. PC8 is service-oriented for modularity, scalability and flexibility. 2. The Repository Service and Integration Service (as replacement for Rep Server and Informatica Server) can be run on different computers in a network (so called nodes), even redundantly. 3. Management is centralized, that means services can be started and stopped on nodes via a central web interface. 4. Client Tools access the repository via that centralized machine, resources are distributed dynamically. 5. Running all services on one machine is still possible, of course. 6. It has a support for unstructured data which includes spreadsheets, email, Microsoft Word files, presentations and .PDF documents. It provides high availability, seamless fail over, eliminating single points of failure. 7. It has added performance improvements (To bump up systems performance, Informatica has added "push down optimization" which moves data transformation processing to the native relational database I/O engine whenever its is most appropriate.)
8. Informatica has now added more tightly integrated data profiling, cleansing, and matching capabilities. 9. Informatica has added a new web based administrative console. 10.Ability to write a Custom Transformation in C++ or Java. 11.Midstream SQL transformation has been added in 8.1.1, not in 8.1. 12.Dynamic configuration of caches and partitioning 13.Java transformation is introduced. 14.User defined functions 15.PowerCenter 8 release has "Append to Target file" feature. What is the use of incremental aggregation? Explain me in brief with an example. Its a session option. when the informatica server performs incremental aggr. it passes new source data through the mapping and uses historical chache data to perform new aggregation caluculations incrementaly. for performance we will use it. explain error handling in informatica ? Go to the session log file there we will find the information regarding to the session initiation process, errors encountered. load summary. so by seeing the errors encountered during the session running, we can resolve the errors. -------------------------------------------------------------------------------There is one file called the bad file which generally has the format as *.bad and it contains the records rejected by informatica server. There are two parameters one fort the types of row and other for the types of columns. The row indicators signifies what operation is going to take place ( i.e. insertion, deletion, updation etc.). The column indicators contain information regarding why the column has been rejected.( such as violation of not null constraint, value error, overflow etc.) If one rectifies the error in the data preesent in the bad file and then reloads the data in the target,then the table will contain only valid data. What is the default join that source qualifier provides? Answer: Inner equi join. Can you start a batches with in a batch? Answer: No you can not. If you want to start batch that resides in a batch,create a new independent batch and copy the necessary sessions into the new batch.