BDA Unit-2 (Part 3)

Main Characteristics of a Data Warehouse
 Stores large quantities of historical data so old data is not erased when new data
is updated
 Captures data from multiple, disparate databases
 Works with ODS to house normalized, cleaned data
 Organized by subject
 OLAP (online analytical processing) application
 The primary data source for data analytics
 Reports and dashboards use data from data warehouses.
Main Characteristics of a Data Mart

 Focuses on one subject matter or business unit
 Acts as a mini-data warehouse, holding aggregated data
 Data is limited in scope
 Often uses a star schema or similar structure
 Reports and dashboards use the data from the data mart.
Main Characteristics of a Data Lake

 Collects all data from many disparate data sources over an extended period
 Meets the needs of various users in the organization
 It is uploaded without an established methodology
 Processes and cleans data and stores it in the data lake.
ETL (Extract, Transform, and Load) Process in

Data Warehouse
What is ETL?
ETL is a process that extracts the data from different source systems, then transforms
the data (like applying calculations, concatenations, etc.) and finally loads the data into
the Data Warehouse system. Full form of ETL is Extract, Transform and Load.
It’s tempting to think a creating a Data warehouse is simply extracting data from
multiple sources and loading into database of a Data warehouse. This is far from the
truth and requires a complex ETL process. The ETL process requires active inputs
from various stakeholders including developers, analysts, testers, top executives and
is technically challenging.
In order to maintain its value as a tool for decision-makers, Data warehouse system
needs to change with business changes. ETL is a recurring activity (daily, weekly,
monthly) of a Data warehouse system and needs to be agile, automated, and well
documented.
Why do you need ETL?

There are many reasons for adopting ETL in the organization:
 It helps companies to analyze their business data for taking critical business
decisions.
 Transactional databases cannot answer complex business questions that can be
answered by ETL example.
 A Data Warehouse provides a common data repository
 ETL provides a method of moving the data from various sources into a data
warehouse.
 As data sources change, the Data Warehouse will automatically update.
 Well-designed and documented ETL system is almost essential to the success
of a Data Warehouse project.
 Allow verification of data transformation, aggregation and calculations rules.
 ETL process allows sample data comparison between the source and the target
system.
 ETL process can perform complex transformations and requires the extra area
to store the data.
 ETL helps to Migrate data into a Data Warehouse. Convert to the various
formats and types to adhere to one consistent system.
 ETL is a predefined process for accessing and manipulating source data into the
target database.
 ETL in data warehouse offers deep historical context for the business.
 It helps to improve productivity because it codifies and reuses without a need for
technical skills.
Differences between Cloud Storage and Traditional Storage :

Parameters Cloud Storage Traditional Storage
Cloud storage perform better due to Traditional storage perform a bit slow
Performance using NoSQL. as compared to cloud.
This type of storage options are easy to This storage are heavy to manage as
maintain as you use and service provider you need to manually run through
Maintenance takes care of maintenance. maintenance tools.
Cloud storage are highly reliable as it Traditional storage requires high initial
Reliability takes less time to get under functioning. effort and is less reliable.
Cloud storage supports file sharing Traditional storage requires physical

dynamically as it can be shared anywhere drives to share data and network is to
File Sharing with network access. established between both.
File access In this system file access time is This system has fast access time as
time dependent on the network speed. compared to cloud storage.
Traditional storage are secure with

Cloud storage are more secure as it they can get attacked easily through
Security integrates with many security tools. virus and malwares.
Amazon Drive, Dropbox, AutoSync are HHD, SSD and Pendrives are some
Applications some applications of cloud storage. applications of traditional storage.
Foundation of Big Data:-
 The concept of big data itself is relatively new, the origins of large data sets go back to the 1960s
and '70s when the world of data was just getting started with the first data centers and the
development of the relational database.
 Around 2005, people began to realize just how much data users generated through Facebook,
YouTube, and other online services. Hadoop (an open-source framework created specifically to
store and analyze big data sets) was developed that same year.
 NoSQL also began to gain popularity during this time. The development of open-source
frameworks, such as Hadoop (and more recently, Spark) was essential for the growth of big data
because they make big data easier to work with and cheaper to store.
 In the years since then, the volume of big data has skyrocketed. Users are still generating huge
amounts of data—but it’s not just humans who are doing it. With the advent of the Internet of
Things (IoT), more objects and devices are connected to the internet, gathering data on customer
usage patterns and product performance.
 The emergence of machine learning has produced still more data. While big data has come far, its
usefulness is only just beginning. Cloud computing has expanded big data possibilities even further.
The cloud offers truly elastic scalability, where developers can simply spin up ad hoc clusters to test
a subset of data.
Benefits of Big Data and Data Analytics
 Big data makes it possible for you to gain more complete answers because you have more information.
 More complete answers mean more confidence in the data—which means a completely different
approach to tackling problems.
Types of Big Data :
a) Structured - Structured is one of the types of big data and By structured data, we mean data that can be
processed, stored, and retrieved in a fixed format. It refers to highly organized information that can be
readily and seamlessly stored and accessed from a database by simple search engine algorithms. For
instance, the employee table in a company database will be structured as the employee details, their job
positions, their salaries, etc., will be present in an organized manner.
b) Unstructured -Unstructured data refers to the data that lacks any specific form or structure whatsoever.
This makes it very difficult and time-consuming to process and analyze unstructured data. Email is an
example of unstructured data. Structured and unstructured are two important types of big data.
c) Semi-structured Semi structured is the third type of big data. Semi-structured data pertains to the data
containing both the formats mentioned above, that is, structured and unstructured data. To be precise, it
refers to the data that although has not been classified under a particular repository (database), yet
contains vital information or tags that segregate individual elements within the data. Thus we come to the
end of types of data.
Veracity
V’s of Big Data
 Value is the one of the foremost property of big data. It is a significant for infrastructure system
of IT, to save enormous amount of values in their databases.
 Velocity is a term that is concern with the high speed of data generation .The rate at which speed
data is generated, influences the potential of data. The flow of data is massive and continuous.
 Variety states to various types of sources and the data is presented in both forms that are
structured and unstructured. Now data in the form of videos files, e-mail, audios files, world
processing files etc. is also being considered.
 Volume is the term of “Big Data” that is concerned with extremely large data. It is volume that
decides that whether a data is big data or not. So, “Volume” is the important parameter out of
various parameters that should be considered while dealing with ‘Big Data’.
 Variability is a term that deals with the inconsistency of data. Data is not 100% correct when
dealing with large volume of data. Validity deals with accuracy and correctness of data.
 Vulnerability is concern with the security feature of data. After all, a data breaches with big
data is a big breach.
 Volatility is a parameter of Big Data that is concern with statistical measure of the dispersion for
a given set of returns.
 Visualization is a current characteristic of big data that is deal with visualization of data.
 Variability in big data’s context is the inconsistency of speed at which the data is stored in to
our system There are a number of tools and techniques used to analyse the big data.
Big Data Tools
Data Collection Data Storage tools and Data filtering and Data cleaning and validation
tools frameworks extraction tools tools
1 Semantria Apache HBase (Hadoop Scraper DataCleaner

database)
2 Opinion Crawl CouchDB OctoParse MapReduce
3 OpenText MangoDB ParseHub Rapidminer
4 Trackur Apache spark Mozenda OpenRefine
5 SAS Sentiment Oracle,NoSQL Database Content Grabber Talend

Analysis
Data collection tools:
There is no doubt that today that there are a number of Big Data tools that are present in market. Semantria,
Opinion Crawl, OpenText, Trackur are some of them which are commonly used.
Semantria:
Semantria provides us with a unique service which is proceeded towards by collecting various information
from various clients. And after then the process of analyzing that information is applied diligently to produce
the most valuable and desirable insights. It is very helpful to find trends and identifies various patterns which
are useful to get success. Semantria is a tool that powerfully combines various text analytics. With this, users
can enjoy the following benefits: collect more reliable, actionable insights. Some features of Semantria are
fast customizable, comprehensive system, extraction of entity, classification, clustering, visualization, 10+
languages etc.
Open text:
The Open Text is Sentiment Analysis module. It is a special type of engine that is used in classification to
find out various subjective patterns. It is also used to evaluate the expressions of sentiment that is present in
text form. First of all the analysis work is done at the topic level, sentence level, and document level. Its
prime function is to acknowledge whether parts of text are realistic.
Trackur:
Trackur is a tool that is used to collect the information. It uses its automated sentiment analysis to look at the
specific keywords that the users are supervising and after then decisions are carried out. The sentiment may
be positive, negative or may be neutral with the related document. In Trackur algorithm, it could be used to
observe the social sites and can outline news, to collect information through the trends and automated
sentiment analysis.
SAS Sentiment Analysis

SAS is also a sentiment analysis tool that automatically extracts sentiments in real time. It performs this task
with the help of combination of various statistical modeling techniques. These processing techniques are
based on rule of natural language. There are some automated reports which are built-in show the various
patterns and show their reactions in detail. With the help of evaluations that are ongoing, user can properly
refine their models and make different adjustments related to classifications.
Opinion Crawl
Opinion Crawl, an online tool, used as a sentiment analysis for current affairs. It permits various visitors to
evaluate the Web sentiment based on a particular topic like a company. User can input a topic that he want
to assess and can get an ad-hoc sentiment assessment related to the topic. User will get a pie chart for every
individual topic that shows the real- time sentiment. There is also a list of the current news headlines, some
images related to the topic. All these ideas allow user to check that what type of issues are derived the
sentiment i.e.in a positive way or negative way.
A. Data Storage and frameworks tools:
The captured data that may be structured or unstructured need to be stored in databases. There is need of
some databases to accommodate Big Data. A lot of frameworks have been developed by organizations like
Apache, Oracle etc. that are used as analytics tools to fetch and process data which is stored on these
repositories. Some of these are as follows:
Apache Hadoop
Apache Hadoop is the one of the technology designed to process Big Data, which is unification of
structured and unstructured data huge volume. Apache Hadoop is an open source platform and processing
framework that exclusively provides batch processing. Hadoop was firstly influenced by Google's Map
Reduce. In Map Reduce software framework the whole program is divided into a number of parts these are
small is size. These small parts also called as fragments. These fragments can be executed on any system in
the cluster .

BDA Unit-2 (Part 3)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BDA Unit-2 (Part 3)

Uploaded by

Copyright:

Available Formats

Main Characteristics of a Data Warehouse

Main Characteristics of a Data Mart

Main Characteristics of a Data Lake

ETL (Extract, Transform, and Load) Process in

Why do you need ETL?

Differences between Cloud Storage and Traditional Storage :

Cloud storage supports file sharing Traditional storage requires physical

Traditional storage are secure with

Benefits of Big Data and Data Analytics

Types of Big Data :

V’s of Big Data

1 Semantria Apache HBase (Hadoop Scraper DataCleaner

2 Opinion Crawl CouchDB OctoParse MapReduce

3 OpenText MangoDB ParseHub Rapidminer

4 Trackur Apache spark Mozenda OpenRefine

5 SAS Sentiment Oracle,NoSQL Database Content Grabber Talend

SAS Sentiment Analysis

You might also like