Professional Documents
Culture Documents
Data Analysis
is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful
information, informing conclusions, and supporting decision-making .
Data Mining
is a particular data analysis technique that focuses on statistical modeling and knowledge discovery for
predictive rather than purely descriptive purposes, while business intelligence covers data analysis that
relies heavily on aggregation, focusing mainly on business information .
Data collection
Data are collected from a variety of sources . The data may be collected from sensors in the environment,
including traffic cameras, satellites, recording devices, etc. It may also be obtained through interviews,
downloads from online sources, or reading documentation or from a data hub in an organization .
Data processing
Data, when initially obtained, must be processed or organized for analysis. For instance, these may
involve placing data into rows and columns in a table format (Known as structured data) for further
analysis, often through the use of spreadsheet or statistical software.
Data Cleaning
Once processed and organized, the data may be incomplete, contain duplicates, or contain errors . The
need for data cleaning, will arise from problems in the way that the datum are entered and stored. Data
cleaning is the process of preventing and correcting these errors. Common tasks include record matching,
identifying inaccuracy of data, overall quality of existing data
INTRODUCTION
2 - BASIC DATA ANALYSIS METHODS
Big Data
Is a collection of data that is huge in volume, yet growing exponentially with time. It is a data with so large
size and complexity that none of traditional data management tools can store it or process it efficiently. Big
data is also a data but with huge size.
Types Of Big Data
• Structured :
• Any data that can be stored, accessed and processed in the form of fixed format is termed as a 'structured' data
• Unstructured :
• Any data with unknown form or structure . A typical example of unstructured data is a heterogeneous data source
containing a combination of simple text files, images, videos etc . Like a data source from google search
• Semi-structured :
• Semi-structured data can contain both the forms of data. We can see semi-structured data as a structured in form but it
is actually not defined with e.g. a table definition in relational DBMS. Example of semi-structured data is a data
represented in an XML file.
Characteristics Of Big Data
• Volume – whether a particular data can actually be considered as a Big Data or not, is dependent upon the volume of
data.
• Variety – Variety refers to heterogeneous sources and the nature of data, both structured and unstructured .Nowadays,
data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the
analysis applications.
• Velocity –the speed of generation of data. Big Data Velocity deals with the speed at which data flows in from sources
like business processes, application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of
data is massive and continuous.
• Variability – This refers to the inconsistency which can be shown by the data at times, thus hampering the process of
being able to handle and manage the data effectively.
INTRODUCTION
4 - DATA ANALYSIS PROCESS ON BIGDATA
Data analysts, collect, process, clean and analyze growing volumes of structured transaction data
Data professionals collect data from a variety of different sources. Often, it is a mix of semi-structured and
unstructured data . some common sources include:
web server logs ; cloud applications ; mobile applications ; social media content ; text from customer
emails and survey responses ; mobile phone records ; and machine data captured by sensors connected
to the internet of things (IoT).
Data is processed. After data is collected and stored in a data warehouse . data professionals must
organize, configure and partition the data properly for analytical queries.
Data is cleansed for quality. Data professionals scrub the data using scripting tools or enterprise software.
They look for any errors or inconsistencies, such as duplications or formatting mistakes, and organize and
tidy up the data.
The collected, processed and cleaned data is analyzed with analytics software. This includes tools for:
data mining, which sifts through data sets in search of patterns and relationships
predictive analytics, which builds models to forecast customer behavior and other future developments
machine learning, which taps algorithms to analyze large data sets
deep learning, which is a more advanced offshoot of machine learning
text mining and statistical analysis software
artificial intelligence (AI)
data visualization tools