Report Big Data-1

ĐẠI HỌC QUỐC GIA TPHCM
TRƯỜNG ĐẠI HỌC KHOA HỌC TỰ NHIÊN
KHOA CÔNG NGHỆ THÔNG TIN
Final presentation
Đề tài: Knime analytics platform
Môn học: Big Data
Sinh viên thực hiện: Giáo viên hướng dẫn:

Đào Tiến Hưng (21127051) GS. TS. Nguyễn Ngọc Thảo
Nguyễn Thành Luân (21127102) TG. Bùi Huỳnh Trung Nam
Nguyễn Hữu Khánh (21127072)
Hà Huy Hoàng (21127610)
Ngày 21 tháng 4 năm 2024

Trường Đại học Khoa học Tự nhiên - ĐHQG HCM
Final presentation Big Data
Mục lục
1 INTRODUCTION 2
1.1 What is KNIME analytics platform? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Why is KNIME recommended for use? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Historical overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 KNIME’S MAIN FEATURES 7

2.1 I/O operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Data processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Data Analytics And Building Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6 KNIME Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 SOLVE REAL PROBLEM: BIGMART SALE DATA 21
4 KNIME, RAPID MINER, AND POWER BI 26
5 TASK ASSIGNMENT AND EVALUATION 28
Tài liệu 29
Trang 1
1 INTRODUCTION
1.1 What is KNIME analytics platform?
KNIME[5][8], initially developed at the University of Konstanz, Germany, emerges as a versatile open-
source platform designed for advanced data analytics, integration, and reporting. Anchored in a modular,
node-based architecture, KNIME enables users to construct sophisticated data workflows by orchestrating
nodes representing diverse data processing tasks within an intuitive graphical interface.
A distinguishing facet of KNIME lies in its adaptability to various data formats and origins, facilitat-
ing seamless integration with databases, flat files, web services, and prominent big data frameworks. This
versatility extends to its compatibility with major programming languages like R, Python, and Java, en-
hancing its utility for comprehensive data analysis and custom functionality deployment.
Moreover, KNIME’s extensibility is underscored by its expansive repository of extensions and plugins,
encompassing advanced machine learning algorithms, statistical analysis tools, and integration capabili-
ties with cloud services and big data ecosystems. This extensible framework enables users to tailor KNIME
to their specific analytical requirements and industry contexts.
Additionally, KNIME prioritizes collaborative knowledge sharing through centralized repositories for
storing and disseminating workflows, components, and best practices. This collaborative ethos fosters
a conducive environment for collective learning and expertise exchange among users, thereby accelerating
data-driven insights generation and decision-making processes.
In the realm of data analysis, the "building blocks of analytics" constitute fundamental constituents essen-
tial for the execution of analytical endeavors. These constituents encompass a spectrum of tools, method-
ologies, and components that facilitate data analysts in accessing, processing, and interpreting data. Within
the context of KNIME, a prominent platform for data analytics, these building blocks typically comprise
nodes, workflows, and models for data analysis.
The integration of these foundational elements establishes a robust framework for the execution of intricate
data analysis tasks. They furnish data analysts with the requisite resources and capabilities to engage in data
processing, conduct comprehensive analyses, and distill salient insights from the data corpus. Through the
utilization of these building blocks, KNIME users can streamline the process of data access and manipu-
lation, thereby fostering the generation of profound insights and the informed formulation of data-driven
decisions.
Trang 2
Hình 1: Building-blocks of analytics
The conceptual framework of analytics building-blocks[11] encompasses the foundational elements and
methodologies integral to the data analytics lifecycle. These constituent components serve as the funda-
mental units facilitating the extraction of insights and actionable intelligence from raw data. Below, we
delve into each building block with scholarly detail:
1. Data Collection and Acquisition: This initial phase involves the systematic aggregation of data
from diverse sources, encompassing databases, repositories, APIs, streaming platforms, sensors,
and external interfaces. The data, varying in structure and format, may originate from internal orga-
nizational systems, external collaborators, or publicly available repositories.
2. Data Cleaning and Preprocessing: Raw data typically exhibits inconsistencies, errors, missing val-
ues, and noise, necessitating preprocessing before analysis. Techniques encompassed in data clean-
ing and preprocessing include deduplication, missing data handling, standardization, normalization,
and outlier detection.
3. Exploratory Data Analysis (EDA): EDA constitutes a pivotal stage wherein analysts meticulously
examine and visualize the dataset to gain comprehensive insights into its structural attributes, distri-
butions, interrelationships, and anomalous patterns. Descriptive statistical measures, alongside data
visualization methodologies and correlation analyses, are commonly employed during EDA.
4. Feature Engineering and Selection: Feature engineering entails the transformation of raw data
into informative features conducive to enhancing the efficacy of machine learning models. This
process encompasses the creation of novel features, curation of pertinent features, and encoding of
categorical variables.
5. Statistical Analysis and Hypothesis Testing: Statistical methodologies are deployed to scrutinize
data and validate hypotheses regarding underlying patterns or relationships. Techniques such as re-
gression analysis, ANOVA, t-tests, and chi-square tests are employed for inferential statistical anal-
ysis.
Trang 3
6. Machine Learning Modeling: In this phase, machine learning algorithms are trained on the pre-
pared dataset to develop predictive models or uncover latent patterns. Supervised, unsupervised, and
ensemble learning algorithms are commonly leveraged for modeling purposes.
7. Model Evaluation and Validation: Following model training, rigorous evaluation and validation
protocols are employed to assess model performance and generalization capabilities. Performance
metrics including accuracy, precision, recall, F1-score, and AUC are utilized for evaluation, with
techniques such as cross-validation employed for validation.
8. Model Deployment and Monitoring: Deploying models into production environments facilitates
real-time predictions on new data. Continuous monitoring of model performance is imperative to
detect concept drift, ensure reliability, and recalibrate models as necessary.
9. Interpretation and Communication of Results: Finally, the insights gleaned from the analysis are
meticulously interpreted within the domain context and effectively communicated to stakeholders.
Visualization techniques, interactive dashboards, comprehensive reports, and succinct presentations
are employed to disseminate findings and articulate actionable recommendations.
By methodically adhering to these building blocks, organizations can harness the potency of data analytics
to extract invaluable insights, underpin evidence-based decision-making, and engender a competitive edge
within the contemporary data-centric landscape.
In summary, KNIME epitomizes a comprehensive data analytics platform, distinguished by its modu-
lar architecture, compatibility with diverse data sources, extensibility through plugins, and emphasis on
collaborative knowledge sharing. These attributes position KNIME as a preferred tool for academic re-
search, facilitating advanced data analysis and insights derivation across various disciplines and research
domains.
1.2 Why is KNIME recommended for use?

In the dynamic realm of data analytics, the selection of appropriate tools holds paramount importance, sig-
nificantly influencing the efficacy, adaptability, and triumph of analytical endeavors. Amidst the plethora
of options available, KNIME emerges as a preeminent solution, offering a robust platform that empowers
users to extract valuable insights from their data reservoirs. This discourse delves comprehensively into
the compelling rationale behind considering KNIME and advocates its adoption as a preferred choice for
data analytics endeavors spanning diverse domains[6].
Foremost, KNIME’s unwavering commitment to open-source tenets underscores its accessibility and transparency[9],
rendering it an enticing option for both organizational entities and individual practitioners. The platform’s
open-source ethos not only fosters community-driven collaboration and innovation but also instills confi-
dence in its reliability and security posture. KNIME’s open-source architecture affords users the liberty to
tailor, expand, and refine the platform to suit their bespoke needs, fostering a culture of communal knowl-
edge sharing and collective advancement within the data analytics ecosystem.
Moreover, KNIME’s versatility extends across its comprehensive functionality spectrum, encompassing
the entire gamut of data analytics tasks from initial data preprocessing and exploratory analysis to in-
tricate modeling, sophisticated analysis, and insightful visualization. This holistic approach obviates the
Trang 4
necessity for disparate toolsets and workflows, streamlining the analytical journey and augmenting oper-
ational efficiency. Irrespective of whether handling structured or unstructured data, KNIME provides a
unified platform where users can seamlessly amalgamate diverse data sources, deploy advanced analytical
methodologies, and distill actionable insights with remarkable ease and efficacy.
A distinguishing hallmark of KNIME is its cross-platform compatibility, ensuring seamless operability

across heterogeneous operating environments such as Windows, macOS, and Linux distributions. This
flexibility empowers users to harness KNIME’s capabilities across a gamut of computing infrastructures,
spanning desktop workstations, laptop devices, and server deployments, sans encumbrance from compat-
ibility constraints. By transcending platform barriers, KNIME fosters collaboration and knowledge ex-
change across multifarious teams and organizational strata, catalyzing innovation and expediting decision-
making processes.
Central to KNIME’s appeal is its user-centric interface, which democratizes data analytics by enabling
participation from users spanning diverse technical acumen levels. The intuitively designed drag-and-drop
workflow editor, coupled with visually informative node configurations and interactive visualization fea-
tures, empowers domain experts, data analysts, and business stakeholders to explore, analyze, and interpret
data without necessitating extensive programming prowess. This democratization of data science engen-
ders a culture of data-driven decision-making, empowering organizations to harness the full potential of
their data assets in pursuit of organizational objectives.
Furthermore, KNIME boasts an extensive module repository, constituting a veritable repository of pre-
built components, plugins, and extensions that furnish users with a plethora of functionalities and capabil-
ities. With a repertoire of over 1000 modules encompassing data manipulation techniques, machine learn-
ing algorithms, and seamless integration with external tools and technologies, KNIME empowers users
to rapidly assemble bespoke workflows tailored to their specific requirements. This expansive ecosys-
tem expedites the development lifecycle, facilitates rapid prototyping, and facilitates experimentation with
cutting-edge analytical methodologies with utmost convenience and expediency.
In addition to its rich feature set, KNIME underscores scalability and performance as pivotal considera-
tions, ensuring optimal operational efficiency even when confronted with large-scale data analytics under-
takings. Harnessing the prowess of parallel processing, distributed computing paradigms, and in-memory
data storage mechanisms, KNIME optimizes performance and scalability, enabling users to tackle datasets
spanning gigabytes or terabytes with consummate efficiency. Whether executing intricate analytics work-
flows or deploying machine learning models at scale, KNIME delivers the performance and responsiveness
requisite for contemporary data-intensive applications.
Moreover, KNIME’s seamless interoperability with external tools and technologies accentuates its ver-
satility and extensibility[12], affording users the liberty to leverage existing investments in programming
languages, libraries, and platforms. With native support for Java, R, Python, JavaScript, and seamless in-
tegration with sophisticated analytics platforms such as Apache Spark, TensorFlow, and Apache Kafka,
KNIME empowers users to harness the collective potency of these technologies alongside its intuitive
interface and workflow management capabilities. This interoperability augments data integration capabil-
ities, amplifies analytical prowess, and unlocks novel avenues for innovation and discovery within the data
analytics landscape.
Trang 5
In summation, KNIME stands as a paragon of comprehensiveness and versatility in the realm of data
analytics, offering a compelling amalgamation of open-source flexibility, comprehensive functionality,
cross-platform compatibility, user-centric design, extensive module repository, scalability, and seamless
integration with external tools and technologies. By embracing KNIME, organizations and individuals
can embark on a transformative expedition aimed at unlocking the latent potential of their data reservoirs,
propelling innovation, and orchestrating data-driven decisions that propel them towards triumph in an
increasingly data-centric milieu.
1.3 Historical overview
Hình 2: A brief of Knime’s history
• Founding (2004): KNIME originated as an academic endeavor spearheaded by scholars at the Uni-
versity of Konstanz, Germany, under the leadership of Michael Berthold. The project’s inception
aimed to furnish a scholarly platform facilitating data preprocessing, analysis, and visualization,
prioritizing user-friendliness and adaptability.
• Initial Releases (2006-2007): The nascent stages witnessed the public dissemination of KNIME,
characterized by foundational features such as a graphical workflow editor, a diverse array of data
manipulation and analytical tools, and compatibility with multiple data formats. During this epoch,
KNIME gained swift traction among scholars and data practitioners seeking an open-source alter-
native to proprietary software solutions.
• Expansion and Community Growth (2008-2012): KNIME underwent robust expansion both func-
tionally and in terms of its user constituency. Augmentations encompassed advanced statistical ana-
lytics, seamless integration with external tools and databases, and facilitation for scripting languages
such as R and Python. Concurrently, the KNIME academic community flourished, fostering the cre-
ation of plugins, workflows, and instructional materials, thereby enriching the platform’s ecosystem.
Trang 6
• Enterprise Edition (2013): As KNIME’s adoption burgeoned in commercial settings, KNIME AG

emerged as a commercial entity, offering professional services including support, training, and con-
sultancy. The advent of the Enterprise Edition catered to the exigencies of enterprise users, fur-
nishing augmented features encompassing heightened security measures, collaborative utilities, and
bespoke support services.
• Integration with Big Data Technologies (2014-2016): Recognizing the imperative of grappling
with voluminous datasets, KNIME expanded its repertoire to encompass integration with promi-
nent big data technologies like Apache Hadoop and Apache Spark. This integration facilitated the
harnessing of distributed computing paradigms for scalable data processing and analysis within the
KNIME milieu.
• Advanced Analytics and Machine Learning (2017-2019): With the burgeoning demand for so-
phisticated analytics and machine learning capabilities, KNIME embarked on augmenting its ana-
lytics and predictive modeling functionalities. Rollouts of new extensions tailored to tasks such as
text mining, image processing, and deep learning ensued, endowing users with the wherewithal to
address multifaceted data analytical challenges.
• Recent Developments (2020-Present): Recent epochs have witnessed continued innovation and
evolution within KNIME, underscored by enhancements targeting usability, scalability, and perfor-
mance metrics. Enhancements to the user interface, workflow management features, and cloud ser-
vice integration epitomize these advancements. Moreover, KNIME has embraced emerging frontiers
like artificial intelligence and automated machine learning, cementing its standing as a comprehen-
sive platform for contemporary data analytics workflows.
2 KNIME’S MAIN FEATURES

2.1 I/O operators
1. Data Acquisition: KNIME encompasses an extensive array of Input/Output (I/O) operators de-
signed to facilitate the retrieval and dissemination of data within analytical workflows. These op-
erators play a pivotal role in procuring data from diverse sources and exporting analysis results to
various destinations.
2. Data Importation: KNIME boasts a diverse repertoire of I/O operators tailored for importing data
from multifarious origins. These encompass:
• File Readers: Operators devised for parsing data from prevalent file formats such as CSV,
Excel, JSON, XML, etc., with users afforded the capacity to tailor import parameters like file
paths and column delimiters.
• Database Connectors: Operators furnishing connectivity to relational databases, wherein
users can configure authentication credentials and SQL query statements to extract data.
• Web Services Access: Operators enabling access to data from web services and APIs, empow-
ering users to specify API endpoints, request parameters, and authentication tokens.
• Big Data Integration: Operators facilitating interaction with expansive data ecosystems like
Apache Hadoop and Apache Spark, enabling users to extract data from distributed file systems
and Spark RDDs.
Trang 7
3. Data Exportation: Correspondingly, KNIME presents an array of I/O operators for exporting anal-
ysis outcomes to diverse destinations. These encompass:
• File Writers: Operators for inscribing analysis results to files in assorted formats, with users
empowered to specify output configurations like file paths and encoding options.
• Database Writers: Operators for committing analysis results to relational databases, allowing
users to specify database connection settings, table structures, and transaction management
criteria.
• Web Services Publication: Operators enabling the publication of analysis results to web ser-
vices and RESTful APIs, affording users the flexibility to define API endpoints, request meth-
ods, and response formats.
• Cloud Storage Integration: Operators for persisting analysis results in cloud storage plat-
forms, granting users control over parameters like bucket names, access keys, and encryption
methods.
4. Configuration Flexibility: Each I/O operator within KNIME provides a plethora of configurable
parameters, allowing users to tailor the import/export processes to their precise specifications. Pa-
rameters encompass file paths, database connection details, query statements, data formats, com-
pression options, and error handling strategies.
5. Error Management: KNIME’s I/O operators are equipped with robust error handling mechanisms
to contend with exceptions that may arise during data import/export operations. Users are empow-
ered to configure error handling protocols to delineate the appropriate responses to contingencies
such as missing files, connection failures, or data format incongruities.
6. Parallel Execution: Certain I/O operators in KNIME support parallel execution paradigms, facili-
tating concurrent import/export operations across multiple threads or nodes. This parallelization en-
hances performance and scalability, particularly in scenarios characterized by voluminous datasets
or distributed computing environments.
7. Integration with Analytical Workflows: I/O operators harmoniously integrate with KNIME’s vi-
sual workflow environment, enabling users to seamlessly incorporate data import/export tasks into
their analytical workflows alongside other data processing and analysis endeavors. This symbiotic
integration fosters the construction of intricate workflows that orchestrate data movement and ex-
change processes, thereby enhancing overall workflow efficiency and efficacy in data analysis en-
deavors.
Trang 8
Hình 3: Knime’s I/O operators
2.2 Data processing

The data processing capabilities of KNIME are extensive and versatile, catering to a wide range of analyt-
ical needs. Here’s an overview of the data processing features in KNIME:
1. Data Preparation:
• Data Cleansing and Imputation: KNIME offers a diverse array of tools for data cleansing
tasks, enabling users to manage missing values, outliers, and inconsistencies within datasets.
Imputation techniques, such as mean, median, or mode, can be applied to address missing data.
• Data Transformation: Users can execute various data transformations, encompassing scal-
ing, encoding categorical variables, binning, and deriving new features through mathematical
operations or domain-specific functions.
• Textual Analysis: KNIME provides nodes for textual analysis tasks, including tokenization,
stemming, lemmatization, and sentiment analysis, which are crucial for analyzing unstructured
textual data.
2. Data Integration:
• Connectivity Options: KNIME boasts extensive connectivity capabilities for integrating data
from diverse sources, such as databases (SQL, NoSQL), file systems (CSV, Excel), web ser-
vices (RESTful APIs), and cloud platforms (e.g., Amazon S3, Google Cloud Storage).
Trang 9
• Blending and Joining: Users can blend and join datasets based on common keys or criteria,
amalgamating data from disparate sources into a unified dataset. Different types of joins (e.g.,
inner, outer, left, right) are supported to accommodate diverse merging scenarios.
3. Data Exploration:
• Visual Exploration: KNIME’s interactive visualization tools enable users to explore data vi-
sually through histograms, scatter plots, box plots, and other graphical representations, facili-
tating comprehension of data distribution, patterns, and relationships.
• Statistical Analysis: KNIME provides nodes for generating descriptive statistics, including
measures of central tendency, dispersion, and correlation coefficients, thereby supporting hy-
pothesis testing and inferential statistics to draw conclusions from data samples.
4. Data Analysis:
• Predictive Modeling: KNIME encompasses a comprehensive library of machine learning al-

gorithms for constructing predictive models, spanning classification, regression, clustering,
and association rule mining. Users can experiment with various algorithms and model config-
urations to identify optimal models for their data.
• Text Mining and NLP: Specialized nodes for text mining tasks, such as document preprocess-
ing, feature extraction, and sentiment analysis, empower users to extract actionable insights
from unstructured textual data.
• Time Series Analysis: KNIME offers nodes for conducting time series analysis tasks, such as
trend analysis, seasonality detection, and forecasting, thereby enabling users to visualize time
series data and develop predictive models for future values.
5. Workflow Automation:
• Visual Workflow Design: KNIME’s visual workflow editor facilitates the creation and or-
chestration of complex data processing pipelines, enabling users to automate repetitive tasks
and foster collaboration among team members.
• Parameterization and Iteration: KNIME supports parameterization and iteration within work-
flows, enabling dynamic and iterative data processing pipelines. Users can define parameters
to control node settings and utilize loops to iterate over data subsets or execute repetitive tasks.
6. Scalability and Performance:
• Parallel Execution: KNIME supports parallel execution of workflows, distributing data pro-
cessing tasks across multiple CPU cores or nodes to enhance performance and scalability,
especially in scenarios involving large datasets or computationally intensive tasks.
• Integration with Big Data Platforms: Seamless integration with distributed computing frame-
works, such as Apache Spark and Hadoop, enables users to process and analyze massive
datasets in distributed environments, with nodes available for reading and writing data from/to
HDFS, Spark RDDs, and Hive tables.
7. Integration with External Systems:
Trang 10
• Custom Scripting: KNIME facilitates the incorporation of custom scripts written in languages
such as R, Python, SQL, or Java within workflows, extending KNIME’s functionality, leverag-
ing external libraries, or integrating with third-party systems.
• Database Connectivity: KNIME provides nodes for connecting to external databases and ex-
ecuting SQL queries within workflows, enabling data retrieval, transformation, and storage
directly from databases.
• Web Service Integration: KNIME supports integration with web services and RESTful APIs,
facilitating data retrieval from online sources and publication of analysis results to web end-
points, with nodes available for HTTP requests, JSON/XML parsing, and authentication han-
dling.
Hình 4: Knime’s Data processing
2.3 Visualization
Visualization in KNIME is a crucial aspect of data analysis, aiding users in understanding and interpreting
their data more effectively. Here’s an overview of visualization capabilities within KNIME:
1. Interactive Visualizations: KNIME offers an extensive suite of interactive visualization nodes, en-
abling users to generate a diverse array of charts, plots, and graphs within their workflows. Users
can produce scatter plots, line graphs, bar charts, histograms, box plots, pie charts, heatmaps, and
Trang 11
more, tailored to their data characteristics and analytical objectives. These visualizations are inter-
active, permitting users to dynamically explore their data by manipulating zoom levels, panning,
and examining individual data points or categories. Users possess the ability to customize visual-
ization attributes such as color schemes, markers, line styles, axis labels, and legends to refine the
appearance of their visualizations.
2. Advanced Visualization Techniques: KNIME facilitates the utilization of sophisticated visualiza-

tion methodologies including 3D plots, parallel coordinate plots, treemaps, chord diagrams, Sankey
diagrams, and network visualizations. Users have the capability to craft intricate visualizations
aimed at uncovering intricate patterns, relationships, and structures inherent within their datasets,
fostering deeper insights and comprehension.
3. Data Exploration and Analysis Tools: KNIME’s visualizations are endowed with robust data ex-
ploration and analysis utilities, empowering users to interactively scrutinize their data. The imple-
mentation of brushing and linking functionality enables users to select data points within one visu-
alization and observe corresponding data highlighted across other visualizations, facilitating cross-
referencing and pattern identification. Interactive functionalities for data filtering, sorting, grouping,
and aggregation are available within visualizations, allowing users to focus on specific data subsets
or explore diverse facets of their datasets.
4. Integration with External Libraries and Tools: KNIME seamlessly integrates with external visu-
alization libraries and tools, affording users the opportunity to leverage supplementary visualization
capabilities beyond the native offerings. Users can integrate customized JavaScript, Python, or R
scripts to produce highly tailored and interactive visualizations utilizing libraries such as Plotly,
Matplotlib, D3.js, Google Charts, and more. This adaptability empowers users to develop advanced
visualizations tailored to their unique requirements, incorporating intricate interactive features and
visual components.
5. Dashboarding and Reporting: KNIME facilitates the creation of interactive dashboards and re-
ports amalgamating multiple visualizations, analytical outputs, and textual content into cohesive
presentations. Users can design interactive dashboards equipped with drill-down functionalities, fil-
ters, sliders, dropdown menus, checkboxes, and other interactive elements to enhance user engage-
ment and exploration. The reporting capabilities of KNIME enable the generation of professionally
formatted reports integrating formatted text, images, tables, and visualizations, ideal for disseminat-
ing insights to stakeholders or inclusion in presentations and publications.
6. Publication-Quality Output: KNIME empowers users to export visualizations in various formats

suitable for dissemination and presentation purposes, including PNG, PDF, SVG, and HTML. Users
can tailor output parameters such as resolution, dimensions, fonts, and other attributes to ensure the
delivery of high-quality and visually appealing outputs, aligning with the requirements of diverse
publishing platforms and document formats.
7. Workflow Integration and Automation: Visualizations within KNIME seamlessly integrate with
the visual workflow environment, enabling users to embed data visualization tasks within their ana-
lytical workflows alongside other data processing and analysis activities. Users can design sophisti-
cated workflows automating data visualization tasks, facilitating reproducible and scalable analytical
workflows that are conducive to reuse, sharing, and adaptation across various datasets and analytical
scenarios.
Trang 12
Hình 5: Knime’s Visualization
2.4 Data Analytics And Building Model

Data analytics and model building in KNIME involve a series of steps aimed at extracting insights from
data and constructing predictive models. Here’s an overview of these processes:
1. Data Preparation: Data preprocessing encompasses tasks such as rectifying missing data, identi-
fying outliers, and addressing inconsistencies through techniques like imputation, outlier detection,
and data validation. Data transformation involves standardizing features, encoding categorical vari-
ables, and generating new features via methods like feature scaling, one-hot encoding, and feature
engineering.
2. Exploratory Data Analysis (EDA): Descriptive statistics furnish insights into data distribution,
centrality, and dispersion, including measures such as mean, median, standard deviation, skewness,
and kurtosis. Visualizations such as histograms, scatter plots, and box plots are employed to scruti-
nize relationships, patterns, and anomalies within the dataset. Correlation analysis serves to elucidate
the strength and direction of associations among variables, guiding subsequent feature selection and
model development.
3. Feature Selection and Engineering: Feature selection techniques encompass filter, wrapper, and
embedded methods aimed at identifying the most salient predictors. Feature engineering entails the
Trang 13
creation of novel features from existing ones via mathematical transformations, domain expertise, or
automated approaches like principal component analysis (PCA) and t-distributed stochastic neighbor
embedding (t-SNE).
4. Model Building: Model selection hinges on the problem type (classification, regression, clustering)
and data characteristics (linear, non-linear, imbalanced). KNIME offers a diverse array of machine
learning algorithms, including decision trees, random forests, support vector machines (SVM), k-
nearest neighbors (KNN), gradient boosting, and deep learning models. Models are trained using su-
pervised learning with labeled data and unsupervised learning for tasks like clustering and anomaly
detection.
5. Model Evaluation and Validation: Model assessment entails the use of performance metrics such
as accuracy, precision, recall, F1-score, ROC curve, and AUC. Cross-validation techniques like k-
fold, stratified, and leave-one-out cross-validation are employed to validate models and estimate their
generalization capacity. Hyperparameter optimization methods such as grid search, random search,
and Bayesian optimization are utilized to fine-tune model performance and mitigate overfitting.
6. Deployment and Operationalization: Upon obtaining a satisfactory model, deployment entails ex-
porting models as PMML or deploying them as web services to production environments. Models
are integrated into existing systems to facilitate real-time predictions or automate decision-making
processes. Continuous model monitoring and updating ensure sustained performance and adaptabil-
ity to evolving data and business requirements.
7. Iterative Process: Data analytics and model building in KNIME constitute iterative endeavors in-
volving ongoing refinement, performance monitoring, and adaptation to changing circumstances.
Iterative experimentation with diverse algorithms, features, and hyperparameters fosters model en-
hancement and optimization, enabling continuous learning and alignment with evolving data dy-
namics and organizational needs.
Trang 14
Hình 6: Knime’s Data Analytics And Building Model
By adhering to these meticulous steps within KNIME, practitioners can effectively leverage data assets to
derive actionable insights, construct robust predictive models, and foster evidence-based decision-making
across diverse domains and sectors.
2.5 Reporting
Reporting in KNIME involves the generation of informative and visually appealing reports to communicate
data-driven insights and analysis results. Here’s an overview of reporting capabilities within KNIME:
Certainly, here’s a more academically paraphrased version of the detailed reporting capabilities within
KNIME:
1. Report Composition: KNIME provides a versatile framework for assembling reports, enabling
users to arrange and tailor report elements to suit their specific needs and preferences. Users have the
flexibility to create multi-section reports with distinct layouts, allowing for organized and structured
presentation of information. The option to save and reuse report templates promotes consistency in
formatting across various reports.
2. Visualizations in Reports: KNIME offers an extensive array of visualization choices for incorpora-
tion into reports, spanning from fundamental charts (e.g., bar charts, line plots) to more sophisticated
visual representations (e.g., heatmaps, Sankey diagrams). Visualizations can be fine-tuned with di-
verse attributes such as color schemes, markers, labels, and axes configurations, thereby enhancing
Trang 15
interpretability and clarity. Interactive features embedded within visualizations empower users to
engage directly with the data within the report, fostering exploration and analysis.
3. Text and Annotations: Users have the capacity to integrate descriptive narratives, annotations, and
interpretive commentary within reports to furnish context, elucidate findings, and provide insights
into the data and analytical outcomes. Textual content encompasses various formatting options, in-
cluding headings, paragraphs, bullet points, and stylized text styles (e.g., bold, italic, underline), en-
hancing readability and comprehension. Annotations serve to underscore salient discoveries, trends,
or irregularities within the data, emphasizing noteworthy insights for further consideration.
4. Tables and Data Summaries: KNIME facilitates the inclusion of tabular data and concise data
summaries within reports, offering detailed presentations of datasets, statistical synopses, and ag-
gregated outcomes. Tables can be tailored with functionalities such as column sorting, filtering,
grouping, and subtotaling, fostering data exploration and examination. Data summaries, comprising
statistical metrics like mean, median, standard deviation, and percentiles, afford succinct represen-
tations of the data’s distribution and characteristics.
5. Dynamic and Interactive Elements: Reports can integrate dynamic and interactive components to
heighten user engagement and interactivity. Interactive chart functionalities empower users to delve
into specific data points, dynamically filter datasets, and modify visualization parameters (e.g., axes,
scales, categories) to refine their analytical focus. Interactive elements such as dropdown menus,
sliders, and checkboxes enable users to personalize their viewing experience and concentrate on
specific facets of the analysis.
6. Export and Sharing: Upon completion, reports can be exported in diverse formats, encompassing
PDF, HTML, Excel, PowerPoint, and image formats (e.g., PNG, JPEG). Exported reports retain the
interactive features and formatting established within KNIME, ensuring consistency and fidelity
across disparate viewing platforms. Reports can be disseminated to stakeholders and collaborators
via various channels, including email, file-sharing platforms, or integration into presentations and
documents for broader distribution.
7. Automation and Schedule: KNIME supports automated report generation and scheduling, afford-
ing users the capability to schedule reports for generation at predefined intervals (e.g., daily, weekly,
monthly) or triggered by specific events. Automated reporting workflows streamline the process
of data retrieval, analysis, report generation, and distribution, minimizing manual intervention and
guaranteeing timely delivery of actionable insights.
Trang 16
Hình 7: Knime’s Reporting
Through the utilization of these advanced reporting functionalities within KNIME, users can construct
sophisticated and informative reports that effectively communicate data-driven insights and analytical
findings to stakeholders and decision-makers. The flexibility, interactivity, and automation capabilities
provided by KNIME empower users to streamline the reporting process and deliver actionable insights
efficiently.
2.6 KNIME Node

KNIME nodes are fundamental building blocks within the KNIME Analytics Platform, serving various
functions to enable data processing, analysis, and visualization. Here’s an overview of KNIME nodes and
their roles:
Trang 17
Hình 8: Knime’s node
1. Reader Nodes: Reader nodes serve as data ingestion mechanisms, facilitating the importation of
data from various sources into the KNIME workflow. They support the retrieval of data from file
formats (e.g., CSV, Excel, JSON), relational databases, web services, and application programming
interfaces (APIs).
2. Data Manipulation Nodes: Data manipulation nodes encompass a diverse array of operations
aimed at preprocessing and cleansing data for subsequent analysis. These operations include data
filtering, attribute selection, data sorting, data merging, data splitting, data aggregation, and data
type transformation.
3. Analytics Nodes: Analytics nodes constitute a comprehensive suite of machine learning and statisti-
cal algorithms tailored for data analysis tasks. This repertoire encompasses classification algorithms
(e.g., decision trees, random forests, support vector machines), regression algorithms (e.g., linear
regression, logistic regression), clustering algorithms (e.g., k-means, hierarchical clustering), asso-
ciation rule mining algorithms, and text mining algorithms.
4. Visualization Nodes: Visualization nodes facilitate the creation of graphical representations of data
within the KNIME environment. These nodes offer functionalities for generating diverse visual-
ization types, including scatter plots, bar charts, line plots, histograms, network graphs, heatmaps,
treemaps, and interactive visualizations.
5. Workflow Control Nodes: Workflow control nodes govern the data flow and execution logic within
the workflow. They provide mechanisms for constructing loops, conditional branches, error handling
routines, and other control structures to orchestrate the sequence of node execution and manage
exceptional scenarios.
6. Data Mining Nodes: Data mining nodes encompass advanced algorithms for uncovering patterns,
anomalies, and predictive models within datasets. These algorithms include techniques for frequent
itemset mining, sequence mining, time series analysis, ensemble modeling, and other exploratory
data analysis tasks.
7. Utility Nodes: Utility nodes furnish supplementary functionalities for data manipulation and work-
flow management. They encompass operations such as data sampling, partitioning, formatting, and
importing/exporting data in various formats to facilitate data preparation and workflow orchestra-
tion.
Trang 18
8. External Tool Integration Nodes: External tool integration nodes facilitate seamless interoperabil-
ity with external tools, libraries, and platforms from within the KNIME environment. These nodes
enable the execution of Python scripts, R scripts, SQL queries, Hadoop jobs, Spark jobs, and other
external processes to leverage additional computational resources and functionalities.
9. Database Nodes: Database nodes enable interactions with relational databases for data retrieval,
manipulation, and querying tasks. They encompass functionalities for establishing database con-
nections, executing SQL queries, and fetching data into the KNIME environment for subsequent
analysis.
10. Workflow Management Nodes: Workflow management nodes provide utilities for organizing, doc-
umenting, and versioning workflows to enhance reproducibility and collaboration. These nodes sup-
port functionalities such as adding annotations, comments, and descriptions to workflows, as well
as version control and documentation features for workflow governance.
11. Metanodes and Sub-Workflows: Metanodes and sub-workflows facilitate the encapsulation of
groups of nodes into reusable components, promoting modularity and efficiency in workflow design.
These constructs enable the creation of custom nodes and workflows for specific tasks, enhancing
workflow scalability and maintainability.
12. Executor Nodes: Executor nodes enable the distributed execution of workflows on remote servers
or cloud platforms to enhance computational scalability and performance. These nodes facilitate
the deployment of workflows on KNIME Server, cloud infrastructure (e.g., AWS, GCP, Azure), or
other distributed computing environments for efficient resource utilization and parallel processing
capabilities.
Hình 9: States of Node
In the KNIME Analytics Platform [3], nodes play a fundamental role in enabling data processing, analysis,
and visualization. Each node has a status indicator represented by a traffic light icon. Initially, when a
node is added to the workflow editor, its status is "not configured" indicated by a red traffic light. Once
configured, the status changes to "configured" denoted by a yellow traffic light.
Trang 19
Nodes can be configured by accessing their configuration dialog through various methods such as double-
clicking the node, clicking the Configure button, right-clicking the node and selecting Configure, or press-
ing F6. Executing a node is achieved by clicking the Execute button, right-clicking the node and selecting
Execute, or pressing F7. Successful execution results in the node’s status changing to "executed" repre-
sented by a green traffic light. Failure to execute will display an error sign, prompting necessary adjust-
ments to the node settings and inputs.
Cancellation of node execution can be done by clicking the Cancel button, right-clicking the node and
selecting Cancel, or pressing F9. Resetting a node to its default settings can be accomplished by clicking
the Reset button, right-clicking the node and selecting Reset, or pressing F8. Resetting a node also resets
subsequent nodes in the workflow, changing their status from "executed" to "configured" and clearing their
outputs.
Nodes may have multiple input and output ports, facilitating data flow within the workflow. Input ports
consume data from predecessor nodes, while output ports provide data to successor nodes. The types
of inputs and outputs are visually distinguished, and tooltips provide explanations. Mandatory inputs are
indicated by filled input ports, while optional inputs can be empty.
Hình 10: Ports of Node
These nodes constitute the foundational elements of KNIME workflows, providing users with a versa-
tile toolkit for constructing and executing complex data analytics pipelines for diverse applications and
domains. Through the strategic combination and configuration of nodes within workflows, users can con-
duct comprehensive data processing, analysis, and visualization tasks to derive actionable insights and
support evidence-based decision-making processes.
Trang 20
3 SOLVE REAL PROBLEM: BIGMART SALE DATA
Hình 11: Read data
Trang 21
Hình 12: Missing value processing
Trang 22
Hình 13: visualizing number data
Trang 23
Hình 14: visualizing characters data
Trang 24
Hình 15: Processing data after visualizing
Hình 16: Convert to number data and tandardization
Trang 25
Hình 17: Buiding model
4 KNIME, RAPID MINER, AND POWER BI

Below is a comparison table highlighting key features and capabilities of KNIME, RapidMiner, and Power
BI [2][1][4][7][10]:
1. Data Integration: KNIME: Provides comprehensive support for diverse data sources and formats,
facilitating seamless integration through a variety of connectors. RapidMiner: Offers connectors for
various data sources, ensuring versatile integration capabilities. Power BI: Integrates with a broad
spectrum of data sources, including databases and cloud services, through connectors and APIs.
2. Data Preparation: KNIME: Equips users with a comprehensive suite of data manipulation tools
within a visual workflow environment. RapidMiner: Empowers users with data cleansing and trans-
formation functionalities via an intuitive visual interface. Power BI: Facilitates data shaping and
cleansing operations for enhanced data preparation.
3. Machine Learning: KNIME: Furnishes a diverse library of machine learning algorithms and sup-
ports integration with popular ML libraries.RapidMiner: Encompasses a broad range of machine
learning algorithms for predictive analytics. Power BI: Offers built-in machine learning models and
algorithms for predictive analytics.
Trang 26
Feature KNIME RapidMiner Power BI

Data Integra- Supports diverse data Offers connectors for Integrates with a wide
tion sources and formats various data sources range of data sources
Data Prepa- Provides extensive data Offers data cleansing Includes data shaping
ration manipulation tools and transformation and cleansing tools
Machine Offers a rich library of Includes a wide range Provides built-in ML
Learning ML algorithms of ML algorithms models and algorithms
Visual Work- Features a visual and Employs a visual drag- Offers a visual interface
flow modular workflow de- and-drop interface for workflow creation
sign
Visualization Provides interactive vi- Offers basic visualiza- Includes robust visual-
sualization options tion capabilities ization capabilities
Scalability Scalable for both small Suitable for scalable Scales well for
and large datasets data analysis tasks enterprise-level de-
ployments
Community Large and active com- Active community and Strong user community
Support munity of users and online resources and support resources
contributors
Enterprise Offers enterprise-grade Provides enterprise- Includes enterprise-
Features features and support level features and ready features and
support support
Cost Open-source with com- Offers free and paid Available as part of Of-
mercial support options versions with support fice 365 subscription
options
Bảng 1: Comparison of KNIME, RapidMiner, and Power BI
4. Visual Workflow: KNIME: Features a modular workflow design, facilitating the creation of com-
plex data analysis pipelines. RapidMiner: Utilizes a visual drag-and-drop interface for designing
data analysis workflows. Power BI: Employs a visual interface for creating interactive dashboards
and reports.
5. Visualization: KNIME: Provides interactive visualization options for comprehensive data explo-
ration and analysis. RapidMiner: Offers basic visualization capabilities for data analysis and inter-
pretation. Power BI: Includes robust visualization capabilities for creating interactive reports and
dashboards.
6. Scalability: KNIME: Scalable for handling both small and large datasets, with support for dis-
tributed computing. RapidMiner: Suitable for scalable data analysis tasks with options for distributed
data processing. Power BI: Scales well for enterprise-level deployments and large datasets.
7. Community Support: KNIME: Benefits from a large and active community, providing extensive
documentation and online resources. RapidMiner: Enjoys an active community and offers online
resources for user support and collaboration. Power BI: Benefits from a strong user community and
offers comprehensive documentation and support resources.
8. Enterprise Features: KNIME: Offers enterprise-grade features, including security, workflow au-
Trang 27
tomation, and collaboration tools. RapidMiner: Provides enterprise-level features such as deploy-
ment options and role-based access control. Power BI: Includes enterprise-ready features like data
governance and security capabilities.
9. Cost: KNIME: Open-source with commercial support options, offering free access to the platform
with additional features through subscription plans. RapidMiner: Offers free and paid versions with
flexible pricing plans based on usage and deployment requirements. Power BI: Available as part of
the Office 365 subscription with different pricing tiers for individual users and enterprises.
This comprehensive comparison elucidates the strengths and capabilities of KNIME, RapidMiner, and
Power BI, facilitating informed decision-making for users.
In conclusion, the comparison reveals that KNIME, RapidMiner, and Power BI each offer unique strengths
and capabilities across various aspects of data analytics and visualization. KNIME stands out for its com-
prehensive support for diverse data sources, extensive data manipulation tools, and rich library of machine
learning algorithms. RapidMiner excels in providing a user-friendly visual interface for designing data
analysis workflows and offers a broad range of machine learning algorithms. Power BI distinguishes itself
with its robust visualization capabilities, seamless integration with various data sources, and enterprise-
ready features such as data governance and security.
While KNIME, RapidMiner, and Power BI cater to different user requirements and preferences, they all
contribute to advancing data-driven decision-making and insights generation. Users can leverage these
platforms based on their specific needs, whether it’s for exploratory data analysis, predictive modeling,
or creating interactive reports and dashboards. Ultimately, the choice between KNIME, RapidMiner, and
Power BI depends on factors such as the complexity of data analysis tasks, scalability requirements, com-
munity support, and budget considerations.
5 TASK ASSIGNMENT AND EVALUATION
Task Performer Task Asignment Task Completion

Percent
Đào Tiến Hưng Section 1 ResearchPlan, INTRODUCTION TO 100%
KNIME ANALYTICS PLATFORM Presenta-
tion and prepare slides, Introduction Report
Nguyễn Thành Luân Section 3 ResearchPlan, DEMO USING KN- 100%
IME WITH BIGMART SALE DATA Presenta-
tion and prepare slides, Demo bigmart sale data
Report
Nguyễn Hữu Khánh Section 4 ResearchPlan, ANALYZE AND 100%
COMPARE Presentation and prepare slides,
Knime, Rapid Miner and Power BI Report
Hà Huy Hoàng Section 2 ResearchPlan, KNIME’S MAIN 100%
FEATURES Presentation and prepare slides,
Main features of Knime Report
Bảng 2: Task assignment and evaluation
Trang 28
Tài liệu
[1] KNIME Analytics Platform User Review. https://www.trustradius.com/reviews/
knime-analytics-platform-2023-09-24-02-50-48, 2023. Accessed 1 Apr. 2024.
[2] Compare KNIME vs RapidMiner. https://www.peerspot.com/products/

comparisons/knime_vs_rapidminer, n.d. Accessed 26 Mar. 2024.
[3] Getting Started Guide. https://www.knime.com/getting-started-guide, n.d. Ac-

cessed 21 Mar. 2024.
[4] KNIME Community Hub. https://hub.knime.com/knime/spaces/Examples/00_

Components/Data%20Manipulation~hjR-MWGEtsyRJgCV/, n.d. Accessed 1 Apr. 2024.
[5] KNIME Documentation. https://docs.knime.com/?pk_vid=

1de43c59d1af5b0217122678235d49eb, n.d. Accessed 21 Mar. 2024.
[6] Davide Ganzaroli. The Best kept Secret in Data Science is KNIME. Low Code for Data Science,
2023. Accessed 26 Mar. 2024.
[7] Ana Gelevska. KNIME vs RapidMiner: Which Solution Is Better for Your Business? Redfield AI,
2023. Accessed 31 Mar. 2024.
[8] Daniel Ihrmark and Juuso Tyrkkö. Learning text analytics without coding? An introduction to KN-
IME. Education for Information, 39(2):121–137, 2023.
[9] Ángel M. Laguna. Why should you learn KNIME? Low Code for Data Science, 2023.
[10] Venkateswarlu Pynam, R Roje Spanadna, and Kolli Srikanth. An extensive study of data analysis
tools (rapid miner, weka, r tool, knime, orange). Int. J. Comput. Sci. Eng, 5(9):4–11, 2018.
[11] S. Shah. Download Diagram – The Building Blocks of Analytics | SCALE 123. https://www.
scale123.com/analytics-architecture-diagram-property-management/,
n.d. Accessed 4 Apr. 2024.
[12] Ruochen Wang. Want to do Data Analysis without coding? Use KNIME! SFU Professional Com-
puter Science, 2020.
Trang 29

Report Big Data-1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Report Big Data-1

Uploaded by

Copyright:

Available Formats

ĐẠI HỌC QUỐC GIA TPHCM

TRƯỜNG ĐẠI HỌC KHOA HỌC TỰ NHIÊN

KHOA CÔNG NGHỆ THÔNG TIN

Môn học: Big Data

Sinh viên thực hiện: Giáo viên hướng dẫn:

Ngày 21 tháng 4 năm 2024

2 KNIME’S MAIN FEATURES 7

3 SOLVE REAL PROBLEM: BIGMART SALE DATA 21

4 KNIME, RAPID MINER, AND POWER BI 26

5 TASK ASSIGNMENT AND EVALUATION 28

Hình 1: Building-blocks of analytics

1.2 Why is KNIME recommended for use?

A distinguishing hallmark of KNIME is its cross-platform compatibility, ensuring seamless operability

1.3 Historical overview

Hình 2: A brief of Knime’s history

• Enterprise Edition (2013): As KNIME’s adoption burgeoned in commercial settings, KNIME AG

2 KNIME’S MAIN FEATURES

Hình 3: Knime’s I/O operators

2.2 Data processing

• Predictive Modeling: KNIME encompasses a comprehensive library of machine learning al-

6. Scalability and Performance:

7. Integration with External Systems:

Hình 4: Knime’s Data processing

2. Advanced Visualization Techniques: KNIME facilitates the utilization of sophisticated visualiza-

6. Publication-Quality Output: KNIME empowers users to export visualizations in various formats

Hình 5: Knime’s Visualization

2.4 Data Analytics And Building Model

Hình 6: Knime’s Data Analytics And Building Model

Hình 7: Knime’s Reporting

2.6 KNIME Node

Hình 8: Knime’s node

Hình 9: States of Node

Hình 10: Ports of Node

3 SOLVE REAL PROBLEM: BIGMART SALE DATA

Hình 11: Read data

Hình 12: Missing value processing

Hình 13: visualizing number data

Hình 14: visualizing characters data

Hình 15: Processing data after visualizing

Hình 16: Convert to number data and tandardization

Hình 17: Buiding model

4 KNIME, RAPID MINER, AND POWER BI

Feature KNIME RapidMiner Power BI

Bảng 1: Comparison of KNIME, RapidMiner, and Power BI

5 TASK ASSIGNMENT AND EVALUATION

Task Performer Task Asignment Task Completion

Bảng 2: Task assignment and evaluation

[2] Compare KNIME vs RapidMiner. https://www.peerspot.com/products/

[3] Getting Started Guide. https://www.knime.com/getting-started-guide, n.d. Ac-

[4] KNIME Community Hub. https://hub.knime.com/knime/spaces/Examples/00_

[5] KNIME Documentation. https://docs.knime.com/?pk_vid=

You might also like