You are on page 1of 5

BIG DATA, DATA MINING,

AND MACHINE LEARNING

Data Mining Platform Architecture


Computing Environment Specifications from Hardware and Distributed Systems
perspective Ch1,Ch2 with technical focus on Hadoop
INTRODUCTION
1 - DATA ANALYSIS CONVENTIONS

Data Analysis
is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful
information, informing conclusions, and supporting decision-making .
Data Mining
is a particular data analysis technique that focuses on statistical modeling and knowledge discovery for
predictive rather than purely descriptive purposes, while business intelligence covers data analysis that
relies heavily on aggregation, focusing mainly on business information .
Data collection
Data are collected from a variety of sources . The data may be collected from sensors in the environment,
including traffic cameras, satellites, recording devices, etc. It may also be obtained through interviews,
downloads from online sources, or reading documentation or from a data hub in an organization .
Data processing
Data, when initially obtained, must be processed or organized for analysis. For instance, these may
involve placing data into rows and columns in a table format (Known as structured data) for further
analysis, often through the use of spreadsheet or statistical software.
Data Cleaning
Once processed and organized, the data may be incomplete, contain duplicates, or contain errors . The
need for data cleaning, will arise from problems in the way that the datum are entered and stored. Data
cleaning is the process of preventing and correcting these errors. Common tasks include record matching,
identifying inaccuracy of data, overall quality of existing data
INTRODUCTION
2 - BASIC DATA ANALYSIS METHODS

Descriptive analysis - What happened


The descriptive analysis method is the starting point to any analytic process, and it aims to answer the
question of what happened? It does this by ordering, manipulating, and interpreting raw data from various
sources to turn it into valuable insights to your business

Exploratory analysis - How to explore data relationships.


As its name suggests, the main aim of the exploratory analysis is to explore. Prior to it, there's still no
notion of the relationship between the data and the variables. Once the data is investigated, the
exploratory analysis enables you to find connections and generate hypotheses and solutions for specific
problems. A typical area of ​application for exploratory analysis is data mining.

Diagnostic analysis - Why it happened.


Diagnostic data analytics empowers analysts and business executives by helping them gain
understanding of why something happened. If you know why something happened as well as how it
happened, you will be able to pinpoint the exact ways of tackling the issue or challenge.

Predictive analysis - What will happen.


The predictive method allows you to look into the future to answer the question: what will happen? In
order to do this, it uses the results of the previously mentioned descriptive, exploratory, and diagnostic
analysis, in addition to machine learning (ML) and artificial intelligence (AI). Like this, you can uncover
future trends, potential problems or inefficiencies, connections, and casualties in your data.
INTRODUCTION
3 - THE BIGDATA CONVENTION

Big Data
Is a collection of data that is huge in volume, yet growing exponentially with time. It is a data with so large
size and complexity that none of traditional data management tools can store it or process it efficiently. Big
data is also a data but with huge size.
Types Of Big Data
• Structured :
• Any data that can be stored, accessed and processed in the form of fixed format is termed as a 'structured' data
• Unstructured :
• Any data with unknown form or structure . A typical example of unstructured data is a heterogeneous data source
containing a combination of simple text files, images, videos etc . Like a data source from google search
• Semi-structured :
• Semi-structured data can contain both the forms of data. We can see semi-structured data as a structured in form but it
is actually not defined with e.g. a table definition in relational DBMS. Example of semi-structured data is a data
represented in an XML file.
Characteristics Of Big Data
• Volume – whether a particular data can actually be considered as a Big Data or not, is dependent upon the volume of
data.
• Variety – Variety refers to heterogeneous sources and the nature of data, both structured and unstructured .Nowadays,
data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the
analysis applications.
• Velocity –the speed of generation of data. Big Data Velocity deals with the speed at which data flows in from sources
like business processes, application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of
data is massive and continuous.
• Variability – This refers to the inconsistency which can be shown by the data at times, thus hampering the process of
being able to handle and manage the data effectively.
INTRODUCTION
4 - DATA ANALYSIS PROCESS ON BIGDATA
Data analysts, collect, process, clean and analyze growing volumes of structured transaction data
Data professionals collect data from a variety of different sources. Often, it is a mix of semi-structured and
unstructured data . some common sources include:
web server logs ; cloud applications ; mobile applications ; social media content ; text from customer
emails and survey responses ; mobile phone records ; and machine data captured by sensors connected
to the internet of things (IoT).
Data is processed. After data is collected and stored in a data warehouse . data professionals must
organize, configure and partition the data properly for analytical queries.
Data is cleansed for quality. Data professionals scrub the data using scripting tools or enterprise software.
They look for any errors or inconsistencies, such as duplications or formatting mistakes, and organize and
tidy up the data.
The collected, processed and cleaned data is analyzed with analytics software. This includes tools for:
data mining, which sifts through data sets in search of patterns and relationships
predictive analytics, which builds models to forecast customer behavior and other future developments
machine learning, which taps algorithms to analyze large data sets
deep learning, which is a more advanced offshoot of machine learning
text mining and statistical analysis software
artificial intelligence (AI)
data visualization tools

You might also like