You are on page 1of 10

BIG DATA

BIG DATA DEFINITION (TN 1.1)


1. Artificial intelligence and machine learning

The most used type of machine learning is a type of AI that learns A to B, or input to output
mappings. This is called supervised learning.

2. Tasks we can perform

Is relevant to understand what are the type of tasks that this algorithm can do with enough data

available. Therefore, what are the Big data capabilities. They can be summarized in:

• Descriptions

• Predictions

• Inferences

• Classifications

• Clustering
• Recommendations

• Cognitive systems

Descriptions: the integration of all data in a single dashboard or map making it available even it
just shows a description of a phenomena from different perspectives might trigger a better
decision-making process.

• Supervised learning:
- Predictions
- Interferences
- Classifications
• Unsupervised learning
- Clustering
- Recommendations
- Cognitive systems
3.Big data characteristics

We can distinguish some characteristics of Big data that are called the Vs of Biga data: The three
classical Vs of Big data (Volume, Variety and Velocity) + 2 more (Veracity and Valence) + and finally
one more (Value)

• Volume: Amount of data matters. The challenge of this volumes is how to store, acquire,
recover distribute, process and at what costs. In general, more data means better
predictions, classifications...
• Velocity: Fast rate in which data is received and acted on.
• Variety: Many type of data that are available
• Veracity: How accurate or truthful a data set may be.
• Valence: Connectedness between data items
• Value: The final product has to have a value for it to be useful

4. Strategic view: the Five Ps of Big data and data projects

We can distinguish 5 elements that determine the value creation in an strategic view of a data
project.

1. Purpose: A goal must be formulated to complete a project.


2. People: The necessity of having people with different skillsets
3. Processes: the main idea is that the process is a way of capturing the value from the
sources of data and transforming it into value with all the other elements or Ps.
4. Platforms: Besides the above-mentioned factors, fundamental and strategical questions
that are concerned with what platforms you will use for your analytics and products are
also critical for successfully managing a data science project.
5. Programmability: Tools or apps you will use
• Programming languages: SQL, Python, R
• Big Data tools: Hadoop, Google’s Cloud Storage & Big Query, AWS Redshift and S3
• Streaming software: Kafka, Spark, talend
• BI tools: Tableau, Qlik, Google Data Studio
BIG DATA PLATFORM AND PROGRAMABILTY (TN 3.0)
1. HADOOP ENVIROMENT

Hadoop is a whole environment of tools, an ecosystem of tools to be able to store and process
data of every type and speed base on parallelism.

Hadoop allowed big problems to be broken down into smaller elements so that analysis could be
done quickly and cost-effectively. By breaking the big data problem into small pieces that could be
processed in parallel, you can process the information and regroup the small pieces to present
results.

Hadoop is designed to parallelize data processing across computing nodes to speed computations
and hide latency. At its core, Hadoop has two primary components:

• Hadoop Distributed File System: A reliable, high-bandwidth, low-cost, data storage cluster
that facilitates the management of related files across machines.
• MapReduce engine: A high-performance parallel/distributed data processing
implementation of the MapReduce algorithm.

2. HDFS

The Hadoop Distributed File System is a versatile, resilient, clustered approach to managing files in
a big data environment.

HDFS works by breaking large files into smaller pieces called blocks. The blocks are stored on data
nodes, and it is the responsibility of the NameNode to know what blocks on which data nodes
make up the complete file. The NameNode also acts as a “traffic cop,” managing all access to the
files, including reads, writes, creates, deletes, and replication of data blocks on the data nodes.
The complete collection of all the files in the cluster is sometimes referred to as the file system
namespace.

3. MAPREDUCE
Hadoop MapReduce is an implementation of the algorithm developed and maintained by the
Apache Hadoop project.

The algorithm works as follows:

• Input the file and broken it in pieces for processing


• Map the data: Assign Key value pairs to the elements in the pieces
• Short and shuffle: organize the pieces making possible at the same time
o Balance number of pieces in each computer
o Homogeneous pieces in each computer
• Reduce: perform the task
DATA PROCESS, DATA MODELS AND DATA MANAGEMENT (TN 4.0)
1. DATA PROCESS

Data processing occurs when data is collected and translated into usable information. We can
distinguish several steps in this process:

Data munging: Transform data form erroneous or unusable forms, to useful and use-case-specific
ones. This concept includes:

• Data exploration: Munging usually begins with data exploration. This initial
exploration can be done with some initial graphs, correlations, histograms, or
descriptive statistics (Mean, Median, Mode, Range...) and visualize them in maps or
dashboards.
• Data transformation and integration: Once a sense of the raw data’s contents and
structure have been established, it must be transformed to new formats appropriate for
downstream processing. This step involves the pure restructuring of data.
• Data enrichment (and integration): This involves finding external sources of information
to expand the scope or content of existing records.
• Data validation: This step allows users to discover typos, incorrect mappings, problems
with transformation steps, even the rare corruption caused by computational failure or
error. No matter the data management system has prepare the data in a development of
new project this task is essential.

The data exploitation is the las group of tasks to perform our R&D project. In the project process
we will build a design of the desire output and possibly a beta version. In and operational
implementation we differentiate between the front-end (User interface) of the system and the
back-end(server) of the system. We include in data exploitation the analysis and visualization.

Analysis: Data mining is the process of finding anomalies, patterns, and correlations within large
data sets to predict outcomes.

Some include here the data preparation, but we consider it inside the data munging. Also, some
include data warehousing that involves storing structured data in relational database management
systems so it could be analyzed for business intelligence, reporting, and basic dashboarding
capabilities. However, this is and operational task not a data project process, so we prefer to
consider it in the data management process.
The report and visualization are the last step of the project process and involves the design of the
output to be meaningful to the end user.

Finally, we need to connect with the purpose and iterate the process to get a correct and valuable
output
2. DATA MODELS AND TYPES OF DATA.

Data models describe data characteristics. We can distinguish:

• Operations
• Constrictions
• Structures
Operations: The possible operations can be summarized as:

• Sub setting: given a data set and a condition. Find a subset that fulfils the condition.
• Substructure extraction: given a set of data extracts a part of that structure with its
elements
• Union: given two data sets we create a new data set with elements from the two data
sets, erasing duplicates
• Join: given two data sets with complementary structure we create a new group with
elements of both data sets, erasing the duplicates

Constrictions: are the logical propositions that data must complain. For example: each person
has only one name. Different models have different ways to express constrictions
The different type of constrictions is:

The types of data can be classified as:

• Structured
• Semi structured
• Un-structured
Structured: generally, refers to data that has a defined length and format.

The relational model is still in wide usage today and plays an important role in the evolution of big
data. Understanding the relational database is important because other types of databases are
used with big data. In a relational model, the data is stored in a table. This is often accomplished in
a relational model using a structured query language (SQL)

Another aspect of the relational model using SQL is that tables can be queried using a common key
(that is, the relationship). The related tables use that Key to make possible the relation. Here it’s
called the foreign key. In this tables we can find duplicate keys

Semi-structured data is a type of data that has some consistent and definite characteristics. It
does not confine into a rigid structure such as that needed for relational databases.

Unstructured data is data that does not follow a specified format.

3. DATA MANAGEMENT

Data management is the practice of collecting, keeping, and using data securely, efficiently,
and cost-effectively. The goal of data management is to help people, organizations, and
connected things optimize the use of data.

Using another point of view Data management consist in the way to answer to the issues
appearing to make operational a given data project. So that we can consider the following:

• Data storage
• Data ingestion
• Data integration
• Data retrieval
• Data quality
• Data Security

4. DATA MANAGEMENT: DATA STORAGE

One of the most important services provided by operational databases (also called data stores)
is persistence. Persistence guarantees that the data stored in a database won’t be changed
without permissions and that it will available if it is important to the business.

Given this most important requirement, you must then think about what kind of data you want
to persist, how can you access and update it, and how can you use it to make business decisions.
At this most fundamental level, the choice of your database engines is critical.

The forefather of persistent data stores is the relational database management system, or
RDBMS. The relational model is still in wide usage today and has an important role to play in the
evolution of big data.

Relational databases are built on one or more relations and are represented by table s. As the
name implies, normalized data has been converted from native format into a shared, agreed upon
format. To achieve a consistent view of the information, the field will need to be normalized to
one form or the other.
Over the years, the structured query language (SQL) has evolved in lock step with RDBMS
technology and is the most widely used mechanism for creating, querying, maintaining, and
operating relational databases. These tasks are referred to as CRUD: Create, retrieve, update, and
delete are common, related operations you can use directly on a database or through an
application programming interface (API).

Nonrelational databases do not rely on the table/key model endemic to RDBMSs. One
emerging, popular class of nonrelational database is called not only SQL (NoSQL). Nonrelational
database technologies have the following characteristics in common:

• Scalability: capability to write data across multiple data stores simultaneously without
regard to physical limitations of the underlying infrastructure.
• Data and Query model: Instead of the row, column, key structure, nonrelational
databases use specialty frameworks to store data with a requisite set of spe cialty query
APIs to intelligently access the data.
• Persistence design: Persistence is still a critical element in nonrelational databases.
• Interface diversity: Although most of these technologies support RESTful APIs6 as their
“go to” interface, they also offer a wide variety of connection mechanisms for
programmers and database managers, including analysis tools and reporting/visualization.
• Eventual Consistency: While RDBMS uses ACID (Atomicity, Consistency, Isolation,
Durability) as a mechanism for ensuring the consistency of data, non-relational DBMS
use BASE.

Distributed data storage is a computer network where data or information is stored (or

replicated) on more than one node or computer.

Distributed databases are databases that quickly retrieves data over many nodes. Distributed
data stores have an increased availability of data at the expense of consistency. We will come back
to these ideas speaking about the big data platform called Hadoop and specifically HDFS (Hadoop
Distributed file systems)
5.DATA MANAGEMENT: DATA INGESTION

Data Ingestion is the process of acquiring and importing data into a data store or a database . If the
data is ingested in real time, each record is pushed into the database as it is remitted, (Real time)
Data-in-motion: analyzed as it is generated. (Batch process) Data-at-rest: collected prior to
analysis

Streaming data is data that is generated continuously by thousands of data sources, which
typically send in the data records simultaneously, and in small sizes. This data needs to be
processed sequentially and incrementally on a record-by-record basis or over sliding time
windows.

In Data Streaming Systems real time Compute one data element or a small window of data
elements at a time.

Elements of data ingestion:

7.DATA MANAGEMENT: DATA RETRIEVAL

Data Retrieval is the process of searching, identifying, and extracting required data from a
database. A database is one designed to make transactional systems run efficiently.

A data warehouse is a type of database the integrates copies of transaction data from disparate
source systems and provisions them for analytical use. In a traditional data warehouse, the data is
loaded into the warehouse after transforming it into a well-defined and structured format: this is
called schema on write.

A Data Lake is a massive storage depository with huge processing power and ability to handle a
very large number of concurrences, data management and analytical tasks. In a data lake is not
stored into a warehouse unless there is use. Data lakes ensures all data is stored for a potentially
unknown use later: schema on read.
8.DATA MANAGEMENT: DATA STORAGE AND RETRIEVAL INFRASTUCTURE. SCALING

Depending on the type of physical storage system the storage and access time increases. Using
these criteria, we can make a memory hierarchy:

• Internal register is for holding the temporary results and variables.


• Cache is used by the CPU for memory which is being accessed repeatedly.
• Main memory or RAM (Random Access Memory): It is a type of the computer memory
and is a hardware component.
• Hard disk: A hard disk is a hardware component in a computer. Data is kept permanently
in this memory.
• Magnetic tape: Magnetic tape memory is usually used for backing up large data.

9.DATA MANAGEMENT. DATA QUALITY

The concept of data quality includes the following:

• Data Profiling
• Data Parsing and Standardization
• Data Matching and Data Cleansing

Data profiling provides the metrics and reports that business information owners need to
continuously measure, monitor, track, and improve data quality at multiple points across the
organization.

Data parsing and standardization typically provides data standardization capabilities, enabling
data analysts to standardize and validate their customer data.

Data cleansing is used to correct data and make data consistent.

Data matching is the identification of potential duplicates for account, contact, and prospect
records

10. DATA MANAGEMENT: DATA SECURITY

We need to secure:

• Machines
• Data transfer across different phases of data operation

Data safeguarding techniques:

• Encrypting: Encrypting everything in a comprehensive way reduces your exposure


• Data anonymization: When data is anonymized, you remove all data that can be uniquely
tied to an individual.
• Tokenization: This technique protects sensitive data by replacing it with random tokens or
alias values that mean nothing to someone who gains unauthorized access to this data.
• Cloud database controls: In this technique, access controls are built into the database to
protect the whole database so that each piece of data doesn’t need to be encrypted.

You might also like