You are on page 1of 36

Translated from Chinese (Simplified) to English - www.onlinedoctranslator.

com

Big data analysis tools and their applications

DAeardaptmvn
ceetnf l eecftrirger
doER ic aalti
on Coanntdrol Ai
Lr aCbo.nDiti ia l eEin
ycEtrnigcne
dpenoairntgmaenndtEonfeErgle rng,g
inNeaetrioinnaglNChaitnioYniaUl nCivheursnitgy-oHfsTinech
gnUonloivgeyrsity
Course outline

- Big Data Analysis and Big Data Talents

- Big Data Architecture and Architecture Optimization

- Big data tools and their applications

Ddepartment of Refrigeration and Air Conditioning and Energy Eengineering, National Chin Yi University of Ttechnology 2/35
Big Data Analysis and Big Data Talents

- Data analysis has become a development trend in the information age, which has led to an increase in the vacancies of data talents

and improved treatment. Data talents are even more sought-after in European and American countries.

-In response to the data collection, processing, analysis and application of big data, the professionals required for many important steps

can gradually become independent or transform into important occupations in response to the importance of the existing work and

the increase in processing business volume.

- However, because enterprises are still groping for the positioning and definition of data talents, there are often cases where the

content and nature of work are confused or the division of labor is unclear, and they are directly referred to as data analysts

or engineers, but the business content and nature are very different.

- Such as data/data engineer (Data Engineer)Most of them are from software engineers, but they focus on the construction

and application of data analysis systems and the automation of processing processes such as data collection and

cleaning, conversion and integration, and sometimes they may have to assist in partial analysis.

- Other common ones are data/data scientists (Data Scientist), Data/Data Analyst (
Data Analyst), Data/Data Architect (Data Architect)other occupational types

Ddepartment of Refrigeration and Air Conditioning and Energy Eengineering, National Chin Yi University of Ttechnology 3/35
Big Data Analysis and Big Data Talents

- The demand for data talents in the industry can be classified according to different demand levels/directions:

- Big Data Architecture: Emphasis on the level/direction of infrastructure and data architecture

- Focus on the implementation principles, deployment, optimization and stability issues of various open source framework software,

and build a basic environment for big data applications with data flow tools and visualization tools

- Data Architect (Data Architect)

-Big data analysis: Emphasis on the level/direction of system modeling and data analysis

- Focus on statistical analysis, index establishment, data correlation, deep mining and investment in machine learning based

on the collected data and business content, so as to obtain information to analyze and reason, draw conclusions or

predict possibilities and make suggestions, or cooperate with the field Special suggestions and proposals

- data scientist (Data Scientist),Data Analyst (Data Analyst)

- Big Data Development: Emphasis on the level/direction of data application and construction implementation

- Focus on data application system, server side, database development, and related application software development,

data operation interface, data carrier connection, data processing and client application development, etc., based

on the proficiency of data application content to quickly meet the needs perform construction

- Data Engineer (Data Engineer)

Ddepartment of Refrigeration and Air Conditioning and Energy Eengineering, National Chin Yi University of Ttechnology 4/35
Big Data Analysis and Big Data Talents

- data architect Mainly responsible for establishing and maintaining relevant equipment and technical benchmarks for

company data storage, planning the operation structure of hardware and software, and ensuring that the overall

data storage system can support future data volume and analysis needs

-Infrastructure and architecture for building the overall system

-The main development directions and common technical tools of big data architecture include:

- Architecture Theory: Parallel Computing Theory, Decentralized Architecture, UsingMapReduce, SparkWait…

- Data flow application:Flume, Fluentd, Kafka, ZMQWait…

- Decentralized storage:HDFS (Hadoop Distributed File System), CephArchitecture etc.…

- Application software:Hive, HBase, Cassandra, PrestoDB, MongodbWait…

- Visual application: FineReport, Tableau, ECharts, HTML5+CSS3,


Highcharts, Chart.js, AWS D3.jsWait…

Ddepartment of Refrigeration and Air Conditioning and Energy Eengineering, National Chin Yi University of Ttechnology 5/35
Big Data Analysis and Big Data Talents

- Data Analyst Mainly through the analysis and interpretation of the sorted data, try to obtain

information from it to draw conclusions for judgment or find out trends for prediction

- Process for data analysis and application stage

-The nature of the work is more inclined to the exploration type, that is, trying to solve unknown problems, not necessarily

finding a correct solution, nor guaranteeing results (even difficult to predict results) and difficult to guarantee the

output of business results, but the results usually can be greatly affected Operation of the company

-Belongs to the core business of big data processing, so if the team is not enough, it is usually realized by

taking data analysts as the core and taking on other aspects of the business.

Ddepartment of Refrigeration and Air Conditioning and Energy Eengineering, National Chin Yi University of Ttechnology 6/35
Big Data Analysis and Big Data Talents

-The main development directions and common technical tools of big data analysis include:

- Database application:RDBMS, NoSQL, MySQL, Hive, CassandraWait…

- Data processing:DataPipeline, Kettle, Informatica, GoldenGateWait… useETL(extract, transform, load)


andPythonLanguage writing related programs for cleaning, conversion, integration and other
processing (analyzing previous processing)

- Statistics: Statistics using probability and statistics

- Data analysis: analysis through data collection (mining), clustering, classification, regression analysis, data
modeling, machine learning, business knowledge, domain knowledge, etc.

- From processing (pre-processing), statistics to analysis, commonly used general tools includeR, SAP,
SPSS, SSAS, SSRS, ExcelWait…

Ddepartment of Refrigeration and Air Conditioning and Energy Eengineering, National Chin Yi University of Ttechnology 7/35
Big Data Analysis and Big Data Talents

- Data analysts can usually be roughly divided into two categories: technical orientation and business orientation. The

capabilities and work content of the two are quite different, and the requirements for tools are also different.

-Technical orientation is mainly biased towards the application of big data in all aspects, including data application in

collection, storage, management and modeling and deployment

-Business orientation mainly focuses on data application in combination with work business content/

professional fields, including project analysis, cost calculation, business planning, etc.

-The data application methods and processing stages of the two are often different, and they usually have the ability to

handle the data pre-processing business of general data engineers

Ddepartment of Refrigeration and Air Conditioning and Energy Eengineering, National Chin Yi University of Ttechnology 8/35
Big Data Analysis and Big Data Talents

- Technical data analysts, often in information (IT)Services for departments, data


centers, etc.

-It is closer to the feeling of taking the role of an analyst with a statistician with a software engineer background, and it is

necessary to make good use of tools to assist in statistical analysis

-According to different work links, it is usually divided into database engineers, database engineers,ETL

Engineers, reptile engineers, algorithm engineers, etc.

-Usually subdivided into different groups such as data warehousing, thematic analysis, modeling analysis, data governance, etc.

with the data processing stage

-The work content is combined with data acquisition, data sorting, database management, data algorithm development, and

report design. In this way, the data scattered in various places can be collected and calculated into commonly used indicators,

and various easy-to-use indicators can be displayed. understand charts

Ddepartment of Refrigeration and Air Conditioning and Energy Eengineering, National Chin Yi University of Ttechnology 9/35
Big Data Analysis and Big Data Talents

- Business data analysts, often serving in the operations department, marketing department, sales department,

etc.

- Closer to a traditional analytical researcher, but requires a deep understanding of the business domain

-It can be roughly divided into data operation, business analysis, member analysis, business analyst and other roles

depending on the business department of the service

-The specific problems, analysis ideas and systems of different business contents are different, but generally the processing

logic projected to the data analysis is analogized through the analysis logic of the work business

-The work content is mainly to organize business reports, do special analysis for specific businesses,

and measure, plan, and plan application data for business growth.

Ddepartment of Refrigeration and Air Conditioning and Energy Eengineering, National Chin Yi University of Ttechnology 10/35
Big Data Analysis and Big Data Talents

- data scientist
- Statistician (statistician) who can extract information from large data sets and make statistics and

inferences through big data analysis

- Data analysts who are skilled in analyzing and deriving conclusions or evaluating recommendations

- "General term for senior data analyst"

- domain expert

-Have sufficient knowledge and experience in their respective professional fields, and have a certain degree of

ability to solve related problems in this field

-Usually as a data analyst in an auxiliary or advisory role when performing analysis, as a


solution or assessing the thinking of possible users (proxy opinion)

- Requires data analysts to communicate and coordinate with project managers to translate their expertise into

knowledge and rules in data systems

Ddepartment of Refrigeration and Air Conditioning and Energy Eengineering, National Chin Yi University of Ttechnology 11/35
Big Data Analysis and Big Data Talents

- data engineer Mainly responsible for ensuring the source of data, collecting and importing and pre-

processing, as well as confirming and integrating the establishment, structure and setting of data systems

and frameworks in the enterprise

-Cleaning, converting, integrating and other processing for data collection and processing

- Focus more on the development and maintenance of server-side and database-related functions, and the nature of work is basically

similar to that of software engineers

-The main development direction and common technical tools of big data development include:

- Database development:RDBMS, NoSQL, MySQL, Hive, CassandraWait…

- Data flow development:Flume, Fluentd, Kafka, ZMQWait…

- Front-end development:HTML5+CSS3, Highcharts, Chart.js, AWS D3.jsWait…

- Data acquisition and development:Python, Embedded Controller Development Language (C/C++, Wiring)Wait… Applied to

web crawler, word segmentation, semantic analysis, natural language learning and other applications

Ddepartment of Refrigeration and Air Conditioning and Energy Eengineering, National Chin Yi University of Ttechnology 12/35
Big Data Analysis and Big Data Talents

- The demand for data talents in the information age is quite diverse. To face larger, more diverse, and

more complex information, in addition to the ability to quickly process huge amounts of data, it is also

necessary to master data structures, programming languages, applied statistics, data mining,

Processing and analysis capabilities

-However, there is too much professional knowledge in various fields. Even if individual members are recruited according to the

classification of talent needs, it is easy to fail to integrate smoothly due to the communication gap in expressing their

respective needs, let alone promote the operation.

-A project manager with professional management knowledge and basic data science management and application knowledge can be used as the

team leader to establish a professional data team, division of labor and cooperation, each performing its own duties

- Just like the common software technical team today, from a small number of members to deal with the vast and complex business in the past, it

gradually formed a complete team through the process of business run-in, work refinement, and organizational reorganization.

- Most companies expect to build a new or separate data technical team from the original software technical team quickly based on the experience

established by the software technical team, and establish an independent operating data technical team to start operation. However, it still

takes time for continuous operation and running-in to optimize the team structure.

Ddepartment of Refrigeration and Air Conditioning and Energy Eengineering, National Chin Yi University of Ttechnology 13/35
Big Data Analysis and Big Data Talents

job title Basic ability

Chief Information Officer/Project Manager


Strategic analysis, team management, corporate communication, direction planning, basic management and
(Chief Information Officer /
application knowledge of data science, etc...
Project Manager)

Data Scientist/Domain Expert


Definition/clarification of problems, logical thinking ability, cross-domain integration/cooperation, programming
(Data Scientist /
and system development, machine learning and artificial intelligence, etc…
Domain Expert)

Data Analyst Statistical analysis, database application, programming and development, data mining, big

(Data Analyst) data processing, etc...

data architect Data warehousing management, relational database system, distributed data storage system system ,
(Data Architect) architecture planning and integration, etc...

Application of open source software framework/system (Hadoop,Sparketc.),


data engineer
programming and system development, developmentETL(capture/transform/load)
(Data Engineer)
process, build data pipeline (Data Pipelines) , system integration, etc.…

Ddepartment of Refrigeration and Air Conditioning and Energy Eengineering, National Chin Yi University of Ttechnology 14/35
Big Data Architecture and Architecture Optimization

- For the construction of commercial big data systems, from the processing flow/operations performed to

the software and hardware systems that need to be constructed, mainly include:

-Data collection: [soft] network data collection (crawlers), job data collection (logs)
[Hard] Enterprise network (intranet/extranet), data storage server

-Data storage: [soft] data warehousing, relational database (structured/unstructured)


[Hard] Decentralized data systems, data storage servers, memory storage

-Data processing: [soft] batch processing, message queue, real-time processing

[Hard] memory storage, memory computing, computing server

-Data retrieval: [soft] query comparison, data correlation, decentralized search

[Hard] Distributed data system, data storage computing server, memory


storage operation

- Data mining: [soft] data mining, machine learning


[Hard] Data storage computing server, memory storage computing

Ddepartment of Refrigeration and Air Conditioning and Energy Eengineering, National Chin Yi University of Ttechnology 15/35
Big Data Architecture and Architecture Optimization

- The optimization of the overall system, in addition to adjusting the cluster structure of the storage and

computing servers according to the needs, the memory storage and computing servers (In-memory

Server) Whether it is independent or not, adjusting network bandwidth and distinguishing between

internal and external networks for data security, analyzing and confirming software, resources and links

for optimization is one of the important tasks of data architects

- The main aspects to be considered can be divided into the following aspects:

- The trade-off of speed and resources

- The trade-off of stability and resources

- The trade-off between scalability and resources

Ddepartment of Refrigeration and Air Conditioning and Energy Eengineering, National Chin Yi University of Ttechnology 16/35
Big Data Architecture and Architecture Optimization

- Speed vs. Resource Tradeoff

-Whether there is any unreasonable processing structure

- Large tasks take too long to execute - try splitting tasks and processing them in parallel

- Small task response time is too long - try to improve program structure or data storage speed

-Whether there is any unreasonable processing logic

- For example, when the distributed parallel computing function is introduced for batch processing content, whether there is a causal relationship

between the data or the sequence of calculation, which leads to the need to wait for the previous-level calculation results to be idle and waste

calculation resources

-Whether the resource allocation is unreasonable or insufficient

- For example, the virtual machine (KVM),container(Docker), allocate server resources to too many
virtual servers or duplicate configurations

- Whether the storage and access structure of data is unreasonable

- Confirm whether it is necessary to import according to the frequency of data use or demandIn-memoryTechnology

Ddepartment of Refrigeration and Air Conditioning and Energy Eengineering, National Chin Yi University of Ttechnology 17/35
Big Data Architecture and Architecture Optimization

- Stability vs. Resource Tradeoffs

- Fast server is better than slow server

- best experience

- Slow server is better than server crash

- Through decentralized queuing, resource locking and other mechanisms to avoid system errors

- Server downtime is better than server damage

- Evaluate and disperse system stress to avoid system damage and rebuilding costing a lot of manpower, time and money

- Evaluation of mixed use of local server and cloud service

- Data system backup, backup server and alternate server planning

- Heat dissipation and stability of server hardware

Ddepartment of Refrigeration and Air Conditioning and Energy Eengineering, National Chin Yi University of Ttechnology 18/35
Big Data Architecture and Architecture Optimization

- Scalability vs. Resource Tradeoffs

- Selection of system architecture

- Extended Features of Data Nodes in Distributed Data Systems Quickly Configure Server Clusters

- How to create a relational database and how to store files

- The way the database is created will affect the way the index table is built (total table or data chain), thus affecting the query speed

and the speed of decentralized computing, as well as the additional difficulty

- When unstructured data is stored in a decentralized manner, whether the fragmented file block storage is allocated considering

the server response speed and health status

- Creation and selection of data/server nodes

- The establishment of server room and cluster server

- Introduce external nodes to expand data storage (partner/cloud server)

Ddepartment of Refrigeration and Air Conditioning and Energy Eengineering, National Chin Yi University of Ttechnology 19/35
Big data tools and their applications

- For the application of data and the transmission and evolution of its life cycle, it can be roughly divided

into the following stages

- data collection: Through manual import, the program automatically captures and records the information, analyzes the stream

or network connection, parses or builds a table, connects with other databases, etc. to collect, and then performs pre-

processing including cleaning, conversion and integration (sorting and sorting) ) to form the original data

- data storage: After sorting and sorting out the collected data, the structured or unstructured data is stored

separately (data table or data directory). Currently, the more common practice is to establish a data warehouse

with an open-source decentralized architecture that facilitates flexible expansion.

- data modeling: A model is formed by sorting out the mathematical relationship between the data and

establishing a certain data calculation method or data index. Sometimes additional processing is required to

achieve filtering and formatting before modeling to ensure model reliability

- ANALYSE information: Attempts to seek logic of causality or influence between data, or to make appropriate

interpretations and inferences about the presentation of data

- visual delivery: Create relevant charts for easy analysis or presentation

Ddepartment of Refrigeration and Air Conditioning and Energy Eengineering, National Chin Yi University of Ttechnology 20/35
Data analysis and processing stages and corresponding tools

Offline interface/report:Excel, PowerPoint, Tableau... Offline


code writing works:R, SAS, Rython, Processing, … online
visualize
interface:Echarts, Tagxedo, cloud service, ...

Interface operation:Excel, SPSS, … code writing


ANALYSE information
works:VBA, Python, R, SAS, ...

Interface operation:SPSS, ...

data modeling Code writing operation (specialized language):R, SAS, ...

Code writing operation (generic language):Python, ...

database:SQL, Hadoop, Hive, … interface


data storage operation:Excel, SPSS, … code writing works:
VBA, Python, R, SAS, ...

database:SQL, Hadoop, Hive, … web


data collection
crawlers:Python, Java, php, C/C++, ...

Ddepartment of Refrigeration and Air Conditioning and Energy Eengineering, National Chin Yi University of Ttechnology 21/35
Ha

Ddepartment of Refrigeration and Air Conditioning and Energy Eengineering, National Chin Yi University of Ttechnology 22/35
S

Ddepartment of Refrigeration and Air Conditioning and Energy Eengineering, National Chin Yi University of Ttechnology 23/35
Had

Ddepartment of Refrigeration and Air Conditioning and Energy Eengineering, National Chin Yi University of Ttechnology 24/35
Ddepartment of Refrigeration and Air Conditioning and Energy Eengineering, National Chin Yi University of Ttechnology 25/35
Organizing common tools

- Archive storage:Hadoop HDFS,Tachyon,KFS


- Decentralized Computing:Hadoop MapReduce,Spark

- Streaming/real-time computing:Storm,Spark Streaming,S4,Heron

- key value (KV),NOSQLdatabase:HBase,Redis,MongoDB


- Resource management:YARN,Mesos

- Log collection:Flume,Scribe,Logstash,Kibana
- Message queue:Kafka,StormMQ,ZeroMQ,RabbitMQ
- Query analysis:Hive,Impala,Pig,Presto,Phoenix,
Spark SQL,Drill,Flink,Kylin,Druid

Ddepartment of Refrigeration and Air Conditioning and Energy Eengineering, National Chin Yi University of Ttechnology 26/35
Organizing common tools

- Memory storage/computation: Memcached, Berkeley DB, Redis

- Decentralized Coordination Service:Zookeeper

- Cluster management and monitoring:Ambari,Ganglia,Nagios,Cloudera


Manager

- Data collection, machine learning:Mahout,SparkMLLib

- Data synchronization:Sqoop

- Task scheduling:Oozie

- Visual aids: FineReport, Tableau


- JavaScriptLibraries:ECharts, Highcharts, Chart.js, AWS D3.js

Ddepartment of Refrigeration and Air Conditioning and Energy Eengineering, National Chin Yi University of Ttechnology 27/35
ApacheOpen source project/framework/tool

- Hadoop: Open source cluster computing framework, using a decentralized file systemHDFS, with a distributed

computing system (parallel computing framework) that integrates the concepts of mapping and induction

MapReduceto apply

- Spark: An open-source cluster computing framework that references the Elastic Distributed Dataset (

Resilient Distributed Datasets, RDDs)Used with in-memory computing technology, includingSpark

Streaming(Streaming Analysis Engine),GraphX(decentralized graphics processing framework) and


other unstructured data processing modules

- Nutch: Open source extensible web crawler (Crawler)with the query (Searcher)engine

- Flume: An open source decentralized log collection system that can process data aggregation from a variety of

sources, including network communication data, social media data, email data, and event data, etc.

- Kafka: An open source streaming platform that can be viewed as a large-scale publish/subscribe message queue according to a

distributed transaction log file architecture

- HDFS: Open source decentralized archive storage system framework

Ddepartment of Refrigeration and Air Conditioning and Energy Eengineering, National Chin Yi University of Ttechnology 28/35
ApacheOpen source project/framework/tool

- Hive: An open source data warehousing and analysis suite that maps structured data files to database tables, which can be

accessed through classesSQLStatements are implemented quicklyMapReduce

- HBase: Open source non-relational decentralized database (NoSQL),based onGoogle BigTable


The architecture implements compression algorithms, memory operations, etc.MapReduce
task output

- Cassandra: open source decentralizedNoSQLDatabase system, including data model and fully
decentralized architecture

- PrestoDB: High performance decentralizedSQLA query engine that can target databases of
different source types (such asMySQL, PostgreSQL, AWS Redshift, MS SQL Server,
TeradataEqual relational database; orHDFS, AWS S3, Cassandra, Kafka andMongoDBand
other non-relational databases) to query at the same time

- EChart: Open source visualization library, based onJavaScriptImplemented, cross-platform and

compatible with most browsers for the visualization of charts and geographic maps

Ddepartment of Refrigeration and Air Conditioning and Energy Eengineering, National Chin Yi University of Ttechnology 29/35
Visual application tool

- At present, most of the construction is based on the web environment.

- Mainly usedHTML5(tag syntax),CSS3(hierarchical style sheet), with various


JavaScript Library, and withHTMLofDOM (Document Object Model)container or
APIsImplemented support modules to build a visual interface

-Usually combined withApache HTTP server(web server) andSQL


server(database server) to apply, usually withJSON (JavaScript
Object Notation)Convert structured data toJavaScriptobject to use
- Open source tools are mainly based onJavaScript Libraryway, throughAPIsProvide users with functions such as quickly

creating corresponding charts, automatic updates and animation update presentations

- also throughJavaScriptAccess control elements written in some other programming language (such
as Visual StudioChina and IsraelC/C++authored chart element)

- There is also a lot of business wisdom (BI)Using visual analysis software, throughGUICreate a visual

analysis report/panel in an interactive way, and most of them also support publishing as a web

version for sharing

Ddepartment of Refrigeration and Air Conditioning and Energy Eengineering, National Chin Yi University of Ttechnology 30/35
Glossary

- data architect
Data Architect

- data scientist
Data Scientist

- domain expert
Domain Expert

- Data Analyst
Data Analyst

- data engineer
Data Engineer

Ddepartment of Refrigeration and Air Conditioning and Energy Eengineering, National Chin Yi University of Ttechnology 31/35
Glossary

- ApacheOpen source project/framework/tool

- Hadoop

- Spark

- HDFS

- Nutch

- Flume

- Hive

- HBase

- EChart

Ddepartment of Refrigeration and Air Conditioning and Energy Eengineering, National Chin Yi University of Ttechnology 32/35
problem discussion

- What are the three directions/levels of data talent needs in the industry?
What are the talent demand directions of big data analysis? (three)

- From the three directions/levels of data talent needs in the industry, which four emerging

occupations can be sorted out?


What are the emerging jobs of big data analysis? (four)

- What are the two broad categories of data analysts?


The categories that data analysts can be roughly divided into? (two)

- Please list the job titles of big data analysis talents and their required basic abilities (five types)
Please list the five job titles and the requirement abilities of big data
analysis talents.

Ddepartment of Refrigeration and Air Conditioning and Energy Eengineering, National Chin Yi University of Ttechnology 33/35
problem discussion

- What are the main aspects to be considered when optimizing a big data architecture?

What are the main considerations when optimizing the big data
analysis system architecture.

- What are the trade-offs between speed and resources?


What are the trade-offs between speed of system and resources.

- What are the trade-offs between stability and resources?


What are the trade-offs between stability of system and resources.

- What are the trade-offs between scalability and resources?


What are the trade-offs between expansibility of system and
resources.

Ddepartment of Refrigeration and Air Conditioning and Energy Eengineering, National Chin Yi University of Ttechnology 34/35
problem discussion

- Please list the optimization contents of the software and hardware system architecture corresponding to the processing flow/

work performed

Please list the contents of system architecture optimization (software


and hardware system).

- Please list the data analysis and processing stages and their corresponding tools

Please list the data analysis and processing stages with the
corresponding tools.

- The construction environment and content of data visualization tools

What's the environment and content of the data visualization tools.

Ddepartment of Refrigeration and Air Conditioning and Energy Eengineering, National Chin Yi University of Ttechnology 35/35
Q&A

Ddepartment of Refrigeration and Air Conditioning and Energy Eengineering, National Chin Yi University of Ttechnology

You might also like