You are on page 1of 6

Data Warehousing and Business Intelligence

DS-3003

Assignment # 1

Fall 2023

Submitted By:
Eeman Ijaz 21i-1381
DS-M

Submitted To:
Dr. Asif Naeem
1.
a. In a geographically distributed data warehousing setup, each region will
have its dedicated data storage infrastructure. This means that data for
the sales, inventory, customer relations, marketing, and other departments
specific to that region will be stored locally. While regional data will be
stored locally, there will be a centralized data warehouse located at the
headquarters. This central storage will hold all the related to the entire
organization including financial data, high-level analytics, and aggregated
data.
b. Local vendors, inventory, shipping details, customer preferences, market
trends.
c. Relevant data acquiring from the operational data can be done by using
data replication (temporary tables). These tables will ensure that already
sent data to ETL is not sent again. As these tables are not linked with the
others in the database, database locking will be minimal. New entries in
the database will be reflected in the temporary tables using triggers. When
these new entries will be sent to ETL the temporary tables will be
overwritten. So, the size of tables will not grow.
d. Inconsistencies: Establish data cleaning and validation checks during
integration process to identify and address inconsistencies during the
conversion process.
Example: Delivery_date should be after Order_date
Data Handling: Data from different regions might be stored in different
formats.
Branch A records customer addresses in the "Street, City, State, Zip"
format, while Branch B uses "Street, Zip, City, State." standardize the
address format across all branches, ensuring that all records follow the
"Street, City, State, Zip" format.
Completeness: Data might be missing for some fields or some fields
might not event exist in different regions. These will generate null values
when combined. Data cleaning techniques must be used to solve this
issue.
Data integration: Branch A uses a SQL database, Branch B relies on
NoSQL, and Branch C has data stored in flat files. Integrating these
diverse data sources is challenging. Use a standardized schema and ETL
tool that supports wide range of data sources.
Storage cost: local storage options might not be enough for regional data
increasing storage costs. A better options is to use cloud storage.
e. The store is open on weekdays from 10am to 8pm. The best time to
upload the data would be over the weekend and after 8pm.
f.
● How have sales performed across different regions, departments,
and product categories over time?
● What were our total sales for the past quarter, and how do they
compare to the same period last year?
● Which products have been our top sellers in the last month, and in
which regions are they most popular?
● What are the age groups and geographic locations of our most
frequent customers?
● Do we have enough stock of our bestselling items?

2. 5-Tier Architecture

Data Stream Tier: In this tier, we gather data from different sources and bring it
into our data warehouse. Source can be the database, files or external feeds.
The data collected here is often raw and unprocessed.
Processing Tier (ETL): Here, we clean, organize, and prepare the data so it fits
nicely into our data warehouse. This Process includes Extraction, Transformation
and Loading.
Data Storage Tier: This is where we store all the cleaned and organized
data,into the warehouse of further processing. Data is stored in the way that it
optimizes queries.
Data Access Tier: this Tier allows users to interact with the data.User can
Perform queries and retrieve the data form the warehouse.
Data Presentation Tier: The last tier includes reports and data visualization and
a Dashboard. Where we focus on insights of data.
3.
a. Time, Books and Customers.

b.

Facts Attributes

Sales Time_id, customer_id, book_id,


quatity_sold, unit_price, revenue

Inventory Time_id, book_id, location,


supplier_id, quantity_sold,
quantity_in_stock

Shipping Shipping_id ,basket_id, time_id,,


tracking_no, carrier_id, book_id,
basket_number

c.
Dimensions Attributes

Time Date, month, quarter, year

Books book_id, name, genre, author, pages,


language

Customer Customer_id, name, address, number,


age

Author Author_id, name, no_of_books,

Publisher Publisher_id, name, location

Shopping cart Basket_id, no_of_items, amount

Shipping Shipping_id, ship_date, method,


address, tracking_no, shipping_status,
shipping_zone, carrier_id
4.
a.

b.
● What is the overall trend in profits over time?
● Which stores are generating max and min profits?
● Is there any effect of promotions?
● How do individual store performances compare?
● Which products have the least sales in the past month of each
store?
● Is there a regional factor affecting profits?

c. Extraction: Retrieve the data that has been chosen for monitoring.
Compress the extensive dataset. You can opt for one of the following
methods:
● Periodic (set fixed times by identifying similar closing times for each
store).
● Event-driven.

Transformation:

● Convert encoding.
● Standardize data.
● Handle date-related operations.
● Store computed values.
● Enhance data quality by eliminating duplicates, managing missing
data, and ensuring data accuracy.
● Aggregate data across different dimensions.
● Create calculated metrics.
Loading: Now, the data is being transferred into the data warehouse with
a focus on maintaining security and data integrity. Loading typically occurs
during weekends or during low user activity periods to minimize disruption.
Enhancements can be made through techniques like partitioning,
parallelization, and incremental updates to improve the loading process.

You might also like