You are on page 1of 3

Planning & Design Requirement Analysis guidelines:

  We should allow 80% of the task effort to focus on immediate requirements. The
remaining 20% of the effort should be spent trying to understand what future
requirements are.
  The source systems (accounting, ERP, sales, inventory…), which will provide
information to data, warehouse needs to be identified.
 The business rules that needs to be applied to the data needxs to be well understood.
 If medium term requirements are not well understood, it could put limit the candidate
architecture to evolve and satisfy longer-term requirements.
 Produce the logical model for the data warehouse and produce initial query profile.
Expect it to change on an ongoing basis.
 Identify initial sizing estimates for database.
 Understand the existing IT infrastructure and identify hardware preference.
 Refine and rework initial conclusions. Always document the present.

 Risks:

 Because the data warehousing requirements are never fully understood, beware of the
tendency to extend this task. A common comment is “we still don’t fully understand the
requirements.” A time box is required as such requirements will be impossible to define
completely.
 Focus on designing flexibility into data warehouse and business rule execution. Failure to
do so may lead to substantial cost penalties in the future.
 Since a logical model of enterprise may not exist, spend some time trying to brainstorm
what a potential one might look like.
 If the midpoint feedback from right people does not take place, the initial decision may
be suspect or risky, and further work on them may be inappropriate.
 

Architecture and design guidelines:

 The following input are definitely required to start the architecture and blueprint of a
Data warehousing system:
o Identification of the mechanism for the data transfer and load.
o Database sizing and query performance expectations.
o Access control, backup and recovery guidelines.
o Overall scope definition. (Not just the first build definition).

IT infrastructure.

 Data extraction might be done in one of the two ways:


o Using flat files.
o Using data management tool (EAI tools).
 The worst-case scenarios must be defined in terms of the size of feed data and allowed
processing time for the maximum load.
 It is advisable to design a preliminary set extraction and load scripts and test the
performance. Expect the statistics to change as the scripts evolves and database grows.
 Design a process controller on the source system to initiate and track the load and
validation process.
 Design physical databases. Convert the logical model to a detailed database design within
the overall technical blueprint.
 Design and build scripts to generate indexes.
 Design and build basic data validation functions, which need to be performed within
Datawarehousing schema.
 Design and build scripts to generate summary and error tables. A summary table contains
the process and status information regarding extraction and load. Error tables’ records
dirty data, which are rejected by Datawarehousing schema validations.
 Document the technical blueprint and decisions on which load and validation tools will
be used.
 Access control facility could be sometime very complex, so the elapsed time on these
steps could increase.
 Design and management of the data warehouse on a MPP System is considerably more
difficult than on an SMP or cluster.
 Do not design the data warehouse around a specific tool or tool type.
 Ensure that the data warehousing design scales as the data scales.

Risks:

 Very often the complexity of extraction is underestimated.


 It can be very difficult to determine exactly when the extraction should be initiated.
 It is critical that a correct balance is achieved between queries loading the star flake
schema and accessing it. The step to generate the summary and error tables tends to
continually grow in response to new user requirements.
 Very often the cleanliness of the data is underestimated. It results in neglecting the
performance consideration for validation functions.
 It may be impossible to tell the difference between and acceptable query, which takes up
to 72 hours to run and the one that never finishes.
 You need to be aware of any limitations, imposed by either the operating system or
hardware, on striping and mirroring disks.

MPP: An acronym for massively parallel processing. These are large multi-node machines with
larger number of CPUs.

NUMA: An acronym for non-uniform memory architecture. A NUMA machine is composed of


multiple nodes, but unlike cluster or MPP machines, it has shared memory address space.

SMP: An acronym for symmetric multi processing. An SMP machine consists of many CPUs,
which share memory and disk.

Cluster: A tightly coupled group of SMP machines, with shared disks. 

You might also like