You are on page 1of 25

CHAPTER 3 MATERIALS AND METHODS

3.1 Overview
This chapter expounds, what type of strategies are being embraced to obtain the
required outcomes, and what exactly are the techniques received to determine each of the
pressing issues. It additionally tells about what types of materials are found in this entire
process. Data collection methods, the dataset is clarified in detail. Machine Learning
Algorithms (MLA) and their techniques are discussed, like Support Vector Machine (SVM),
Multilayer Perceptron (MLP), K Nearest Neighbor (KNN) and Random Forest (RF).
Machine Learning (ML) tools like, Weka and RapidMiner which are used in the whole
procedures will be clarified with total hypothetical and graphical detail in this section.
3.2 Data
Data is used in scientific research, finance, business management, governance and
practically in every kind of human organizations and activities. Data is reported, measured
and collected and can be viewed in the form of images and graphs etc. Before it has been
collected and cleaned by the researchers, the collected numbers and figures are known as raw
data or unprocessed data. To remove the data entry errors or outliers, raw data needs to be
cleaned and corrected. Commonly data processing happens in phases, and treated data of one
stage can be measured as the raw data for the next stage. Field data is the raw data that is
gathered in the uncontrolled atmosphere. Data is defined as the new fuel for the digital
economy (Cresci et al., 2015).
In the operating time, everybody knows and intrigued by the data which is growing
fast day by day. In straightforward terms, we are able to say that data is usually investigating
with several viewpoints every day. Therefore, there is suffering from the surpassed way of
measuring the data or even to control it with your client necessities. The advancement is
usually contributing as a whole lot of conceivable in this space. Experts and scientists are
contributing their credible designs to appropriate the given information. On another tactile
hand, there are many professionals anticipating models for business advancements. The
advancements help a good deal for the customers to examine the conduct of the info later.
There are many procedures applied by the specialists to measure the advancements and
forecast of the data. Data recovering is useful to individual the significant info from the
informational index. Data is a material that portrays something, estimations, reports, and
perceptions. Metadata is usually a representation of ideal or credible information of the data.
3.3 Types of Data
There are three types of data, which are given below.

 Structure Data
 Semi Structure Data
 Unstructured Data

3.3.1 Structure Data


Structured data is usually arranged because quantitative data frequently, and it's the
type of data many people are used to functioning with. Consider data that fits interior fixed
areas and segments found in community databases and spreadsheets flawlessly. Cases of
Structured data incorporate labels, dates, addresses, Mastercard numbers, stock data,
geolocation, and the sky may be the limit from there. Structured data is made up and
effectively comprehended by model language exceptionally. Those doing work inside
interpersonal databases can facts, search, and control rapidly Structured data generally. This
can be the many alluring elements of Structured data. The program writing language utilized
for overseeing structured data is named Structured Query Language, called SQL otherwise.
This language was made by IBM in the mid-1970s and is particularly helpful for looking
after connections in databases. Structured data can be paper-established frameworks that
agencies depended on for business understanding decades lower back. While structured data
is really as yet valuable, even more organizations are wishing to deconstruct unstructured
data for future opportunities (Laorden et al., 2014).
3.3.2 Semi Structure Data
Semi-structured data is the data that doesn't adjust to a data model however has some
structure. It does not have a fixed or unbending diagram. The data doesn't dwell in a
discerning database yet that has some authoritative properties that make it simpler to dissect.
With some procedures, we can stock them in the social catalogue. Semi-structured data does
not have a similar amount of association and regularity of organized info. The info doesn't
dwell inset areas or records, contains ingredients that may isolate the info into different
hierarchies. Semi-structured data is usually data that's neither crude data nor composed data
on a normal database framework. It really is organized data, however, it is not sorted out in
an acceptable model, comparable to a desk or something centered chart. A lot of information
on the Internet could be portrayed as semi-structured data. This data is usually a kind of
arranged data it doesn't accommodate with the traditional structure of data models associated
to relational databases or various kinds of information tables, however, no matter contain
labels or diverse markers to isolate semantic parts and implement chains of command of
records and fields in the data. Along these relative lines, it is called a self-portraying
framework otherwise. Examples of semi-structured data include JSON and XML are types of
semi-structured data (Alsaleh et al., 2014).
3.3.3 Unstructured Data
This is the data that either does not have a predefined data model or perhaps isn't
sorted away in a pre-characterized method. Unstructured data is a message substantial
commonly, may contain information however, for instance, dates, figures, and actualities
likewise. This outcome in abnormalities and ambiguities which make it hard to grasp
utilizing customary tasks when contrasted with facts set aside in structured databases. Some
examples of unstructured data are incorporate sound, video tutorial files or No-SQL
databases. Unstructured data signifies around 80% of the information. It incorporates content
and video and sounds substance frequently. Models incorporate electronic mails, word
planning archives, recordings, photographs, sound files, introductions, webpages and
numerous different types of business reviews. Remember that while these kinds of records
might have an internal structure, they are up to now regarded as “unstructured” on the lands
that the data they contain doesn't fit flawlessly in a data source. Unstructured data is all over
the accepted place. Honestly, most associations and people direct their lives around
unstructured information. Likewise, with organized data similarly, unstructured information
is usually either machine created, or human being produced (Asdaghi and Soleimani, 2019).
There are some examples of machine-generated, which are given below.
Satellite photographs: This incorporates climate data or the info that the
administration catches on its satellite observation symbolism. Consider Google Earth
simply, and the image is got by you.
Scientific data: This incorporates seismic symbolism, air information, and excessive
vitality materials science.
Photos and video recording: This incorporate reliability, observation, and traffic
video.
Radar data: This incorporates vehicular, meteorological, and oceanographic seismic
profiles.
There are some examples of human-generated, which are given below.
Social media data: These details are created from the net based life phases, for
example, YouTube, Facebook, Twitter, LinkedIn, and Flickr.
Mobile data: This incorporates information, for instance, quick messages and area
data.
websites content: This hails from any web page conveying an unstructured
substance, like YouTube, Flickr, or Instagram.
Content internal to your company: Think about all of the content interior archives,
logs, overview results, and communications. Today venture info really speaks to a
massive percent of this content data on earth.
3.4 Big Data
Big Data is a data with a massive size. Big Data is a term that used to describe the
large amount of structure data, semi structure data and unstructured data. Organizations used
Big Data to accumulate in their systems for better analysis and increase profitability. Big
Data can use to refine marketing strategy and increase costumer engagement rate with the
help of better insights of customers. Big data is the discipline that manages different ways to
analyze, extract information systematically from, or in other ways deals with too big and
multifaceted data sets that cannot be distributed by old-style data processing software.
Greater statistical rule is offered by data with numerous rows, whereas data that has
advanced complexity can lead towards the higher false finding rate. Data source, information
privacy, storage of data, data capturing, analysis of data, sharing, search, transfer, querying,
visualization and updating, are the challenges that are included in big data. Three main
concepts were originally associated with big data, variety, velocity and volume (Yang, 2019).
We simply track and observe that what will happen but might not sample, whenever
the big data is handled by us. Thus, data with exceeded sizes that cannot be processed by the
traditional software’s within an adequate value and time are often included in big data. Data
sets are rising quickly, because the data is gradually collected by cheap and several
information detecting IOT devices like cameras, mobile devices, microphones, wireless
sensor networks and remote sensing devices. From 2012, about 2.5 exabytes of data are
produced every day. IDC predicts that there will be about 163 zettabytes of data by 2025.
The one of the main questions for big enterprises is to determine who would be owning the
initiatives of big data that will be affecting all of the organization. Software packages that are
used to visualize data, desktop statistics and relational database management systems often
have problems regarding the handling of big data. A large number of parallel software
running on servers might be required for the work. Depending on the tools and capabilities of
the users, what qualifies for big data varies and big data is made a moving target by
expanding the capabilities. Some organizations might be triggered to need to reconsider the
data management choices by facing the hundreds of GB’s of data (Fürnkranz et al., 2012).
3.4.1 Six V’s of Big Data
There are six V’s of Big Data, which are given below.

3.4.1.1 Volume
It refers to the capacity to store huge data banks. Our Traditional Business
Intelligence arrangements contain a typical and steady volume of information, arriving at
capacity estimates no bigger than gigabytes. As the need to incorporate new developing
sources is created, the measure of information develops at a wretched pace and our Data
Warehouse must have the option to help the capacity and handling of such information for
additional analysis. There are various wellsprings of rising information that produce a lot of
data and in a brief timeframe, obviously extensively surpass the essential stockpiling sizes of
Traditional BI solutions. We could consider new rising information sources in BI:
interpersonal organizations, movement sensors, framework sensors, pages, blog, applications,
georeferencing among others. There are numerous other existing social media platforms that
are being used by people all over the world, like Instagram, YouTube, Snapchat, Twitter,
Linked Inn etc., and all of these social media platforms generate a huge amount of data.
Recently data has increased enormously, and about 90% of the data has been created in just
last couple of years. It is estimated that in 2020, the data size will increase up to 50 times as
compared to the data size in 2011 (Pence, 2014).
3.4.1.2 Variety
Our Data Warehouse at present have organized information, information
characterized for the data of our clients, items and others, whose reason enables us to
incorporate new effectively adjusted sources. In any case, with the new sources accessible we
started to discover kinds of information that we didn't believe were conceivable previously,
among which we could discover: Images or Photographs, Video, Text, XML, JSON, Key-
Value, Audio, Sensor Signals, States of Time, Blog, HTML or even human Genome
information .The value-based databases that we use in our DW may well store these sorts of
information, yet it would not be of incredible assistance since they are not ideal and would
not enable us to extricate significant information. The stockpiling innovations that we at
present use don't have the limit or the eagerness to have these kinds of information,
consequently it is important to mull over a database that gives us adaptability and assorted
variety in this angle. These days data is generated in different types and that too in huge
quantity, from which most of it is unstructured data like images, audio and videos etc. In fact,
about 80% of the data generated these days is unstructured data, which can be categorized in
different categories like social media posts, its updates, its images, likes and comments on
images and many other different forms of data. Latest and advance big data technologies are
able to harvest, store and use the structured data and unstructured data at the same time (B
and Chattopadhyay, 2017).
3.4.1.3 Value
Now, it is a great idea to think about that in spite of the fact that we have included
new information sources, we have considered the utilization of new innovations and that we
have produced an incentive with the incorporation of new measurements and KPIs to our
Traditional BI stage, it is intriguing to think misuse and produce considerably more
advantage to this information.Certainly utilizing methods, calculations and improvements
that permit to anticipate with a more prominent load of the information some basic
leadership, for example, foreseeing practices of our customers, the accurate minute to make
another item or even recognize value-based misrepresentation, this is conceivable on the off
chance that We have individuals or devices that help the association find what it doesn't
know , get prescient information and convey pertinent information stories , create
significantly more trust in basic leadership from the data.We are particularly discussing
individuals with a profile of Data Scientists . At the moment, data is the one of the most
significant things for all kinds of organizations, government or private. All of the
organizations use the data to make decisions and it is also used for the future planning, which
increase the importance of data, because any bad decision can have a negative impact on the
organizations. As a result, the worth of data has increased exponentially (Faroukhi et al.,
2020).
3.4.1.4 Visibility
As yet, all of the V's supplement one another, with a huge database that furnishes us
with solid, variable, refreshed data and is additionally creating huge incentive for us, it is
likewise important to begin having apparatuses of perception that enable a simple method to
peruse our new investigates, which could well be measurable and that definitely would
bargain their improvement with our announcing instruments that we as of now have (Luna-
Romera et al., 2018).
3.4.1.5 Variability
Variability in Big Data's setting suggests two or three different things. One is the
amount of inconsistencies in the information. These ought to be found by inconsistency and
peculiarity recognizable proof systems all together for any noteworthy examination to occur.
Enormous Data is also factor in perspective on the colossal number of information
estimations coming about as a result of various disparate data types and sources. Variability
can in like manner imply the clashing speed at which tremendous data is stacked into your
database (Pendyala, 2018).
3.4.1.6 Validity
Like veracity, authenticity implies how careful and address the information is for its
proposed use. As demonstrated by Forbes, a normal 60 percent of an information scientist's
time is spent decontaminating their information before having the alternative to do any
examination. The preferred position from gigantic data assessment is simply in a similar class
as its essential data, so you need to grasp extraordinary data organization practices to ensure
unsurprising information quality (Luna-Romera et al., 2018).
3.5 Data Collection
In the age where data is the most important thing in every field of life then one of our
major concerns is gathering data from the other sources. Data collection plays a very
important role in the field of Data Mining. Data collection is a process of gathering facts and
figures on variables of interest. It is just random facts and figures without proper structure
and planning. As the focus is on social media, there are two basic ways to collect public data
of user profiles. One of them is the primary data collection method and the other is the
secondary data collection method.
3.5.1 Collection Approach
Primary data is the first handed data gather by the researcher himself through direct
effort and experience. In this study, the primary data collection method is adopted. Primary
data is real-time data on the other hand secondary data is a data which is collected by other
researcher or organization in the past. The sources of primary data are observations,
manually collected, experiments, surveys, questionnaires, and personal interviews. The
primary data collection method is an expensive way to collect data and it requires more time,
on the other hand, the secondary data collection method is economical and requires a short
time. The primary data is specific to the researcher's need and it has more accuracy with
respect to secondary data.
To collect data, we use the primary data collection technique. Manually data
collection technique is time-consuming, but this technique gives better results because the
researcher collects those attributes which are required. In this technique, we visit one by user
profile on the Facebook social media platform and collect public data. The tool, which is
used for saving the data is Microsoft Excel. First of all, we visit the user profile and observe
the number of friends because this attribute is very important data processing (Stringhini et
al., 2010).
3.6 Dataset
Fig. 4 shows the view of a dataset in the Excel sheet. The details of our dataset are as
follows:
 No. of Friends: First of all we observe this attribute of the Facebook user profile,
which is very important in the data labeling section.
 No. of Profile Pics: is also an important attribute of the Facebook user profile, which
is used in the data labeling section. In this, we observe No. of profile pics in the
photos section of the user profile.
 Profile Pic Likes: is a third important attribute that is used in the data labeling
section. In this, we observe user profile pic likes.
 Profile Pic Comments: is forth important attribute that is used in the data labeling
section.
 Profile Pic Address: is also collected from the user profile.
 Name: is one of the attributes of the user profile, which are observed and save in the
Excel sheet and used in Data Labeling.
 Profile Url: is also collected from the user profile.
 Gender: is observe from the user profile.
 No. of Cover Photos: is an attribute that is observed in the Photos section of the user
profile.

Figure 3.1: Dataset in Excel


3.7 Data Labeling
This technique used to label the data that the user profile is a spam profile or not a
spam profile. There are three ways adopted for data labeling which are Engagement Rate
(ER), Profile Pic Duplication Check on TinEye and Not Human Name. These three ways are
explained below.
Table 3.1: Features

Sr. Features
1 Engagement Rate (ER)
2 Profile Pic Duplication Check on TinEye
3 Not Human Name
3.7.1 Engagement Rate (ER)
An Engagement Rate (ER) is a metric that estimates the degree of commitment that a
bit of made substance is accepting from a crowd of people. It shows how much individuals
communicate with the substance. Elements that impact engagement that is user’ comments,
share, likes, No. of friends and more. This is a significant measurement to watch out for on
the grounds that higher buyer engagement is an indication of extraordinary substance.
Engagement is extraordinary for surveying web-based life publicizing efforts and can be
applied to Facebook, Twitter, and some other web-based life stage. Make a point to likewise
screen the input got from customers since they may offer some great proposals for
development. Fig 5 shows the data labeling based on Engagement Rate (ER).

Figure 3.2: Data Label Based on ER


Engagement Rate (ER) is a metric that is utilized intensely in investigating social
media. Engagement Rate (ER) used for Data Labeling that the user profile spam or not. In
this Engagement Rate (ER) four attributes of the user profile are used, these four attributes
are No. of Friends, No. of Profile Pics, Profile Pic Likes and Profile Pic Comments. The
Engagement Rate (ER) metric is given below.

((Profile Pic Likes + Profile Pic Comments) * No. of Profile Pics) ER =


No. of Friends

* 100

Calculate the Engagement Rate (ER) of each user profile and save it in the Microsoft
Excel sheet. For labeling the data for the profile spam or not spam, the minimum value of the
Engagement Rate (ER) is set which is 0.01%. If the value of Engagement Rate (ER) is less
than 0.01% then the user profile label with “Spam”, on the other hand, the user profile label
with “Not Spam” if the value of Engagement Rate (ER) is higher than 0.01%.
3.7.2 Profile Pic Duplication Check
TinEye is the main site to ever utilize for picture identification innovation and to this
date is as yet one of the most well-known and broadly utilized reverse search engines. It's
extraordinary for proficient picture takers or creatives who have worked on the web and need
to check whether any of it has been taken or altered and reused. At the time, TinEye flaunted
38 billion indexed pictures. In the event that you're incredulous, TinEye makes its release and
updates data openly accessible at tineye.com/updates.

Figure 3.3: TinEye


There are numerous motivations to utilize a Reverse Image Search Engine. Possibly
you are a picture taker hoping to see who has been utilizing your photographs without
approval. Possibly you are a visual originator searching for a bigger variant of a photo.
Possibly you are intrigued to check whether somebody is utilizing your image from
Facebook or Instagram. In advanced promoting, here and there we depend on turn around
picture scan for finding unapproved item pictures being utilized by contenders, to which this
can prompt a cut it out. Or on the other hand, if an unapproved photograph is being utilized,
maybe you can examine a reference back to your possessed picture.
TinEye is an image search engine, which is used to label the user profile data through
check profile pic duplication. There are three ways, first is upload a picture from your PC or
cell phone by tapping the upload to find the picture you wish to look for. The second way is
To look by URL, copy a picture URL address into the inquiry box and get results. Third and
last is to drag a picture from a tab in your browser and drop it in a browser tab where TinEye
is open. The data label by TinEye shows in Figure 7 below.

Figure 3.4: Data Label by TinEye


3.7.3 Not Human Name
Facebook is quite clear on what considers a genuine name. It ought to be "the name
that your friends call you in everyday life” and it should “also appear on [an official] ID."
Facebook even has a rundown of worthy ID types, which incorporates things like visas and
driving licenses. Fundamentally, if it's not the name the administration knows you by, it's
most likely not going to fly with Facebook.
There are likewise several different rules that your name must avoid some words, which are
given below.
 Images, numbers, strange capitalization, and such things are not included in your
name.
 A mixture of characters from various languages is not included in your name.
 A title like Doctor or Father is also not included in your name.
 Words that aren't your name; for example, I was unable to have "Magnificent Harry
Guinness" as mine, regardless of the amount I needed it.
 Offensive words are not included in your name.
Facebook's genuine name policy is a major piece of the explanation they have a
productive promoting business, while Twitter and Reddit not. The whole foundation of
Facebook is that it's where genuine users cooperate with one another without holing up
behind mysterious usernames and clear symbols. It's the reason, in spite of Facebook's
numerous issues, they've never had a similar degree of misuse and trolling that Twitter and
Reddit get. Individuals despite everything quarrel and contend over everything without
exception, however at any rate they realize that it's their bigot uncle they're battling with and
not SpottyTeenager64.
3.8 Tools
There are three tools used, which are following
 Microsoft Excel
 Weka
 RapidMiner
3.8.1 Microsoft Excel
Microsoft Excel is a spreadsheet developed by Microsoft for Windows, macOS,
Android and iOS. It features calculation, graphing tools, pivot tables, and a macro
programming language called Visual Basic for Applications. we visit one by user profile on
the Facebook social media platform and collect public data of profile. The tool Microsoft
Excel 365 shows in Figure 8, which is used for saving the data.

Figure 3.5: Microsoft Excel


3.8.1.1 Excel File Convert to CSV
Comma-separated values (CSV) is a generally utilized record group that stores
unthinkable information (numbers and content) as plain content. Its prominence and
suitability are because of the way that a lot of projects and applications support csv records,
in any event as an elective import/trade design. In addition, the csv design permits clients to
look at the record and quickly determine the issues to have information, assuming any,
change the CSV delimiter, citing rules, and so on. This is conceivable in light of the fact that
a CSV record is plain content and a normal client or even a beginner can undoubtedly
comprehend it with no expectation to absorb information.
In Excel workbook, switch to the File tab, and then click Save As. Alternatively, you
can press F12 to open the same Save As dialog. In the Save as type box, choose to save your
Excel file as CSV (Comma delimited).
Figure 3.6: CSV File Save
3.8.2 WEKA
Waikato Environment for Knowledge Analysis is a wide range of machine learning
algorithms to find out patterns and detail mining in a dataset. The calculations can either be
employed legitimately to a dataset or known as from your individual Java code. Weka
consists of devices for data pre-handling, purchase, relapse, grouping, affiliation guidelines,
and representation. Weka is a range of algorithms that are used for real-world problems or
tackling certifiable facts mining issues. It really is composed in Java and works on any level
practically. The calculations can either be employed to a dataset or known as from
legitimately your own Java code. Weka supports many mining tasks like data preprocessing,
classification or grouping, regression, feature selection and data representation. The entirety
of Weka's procedures is based on the supposition that the given information is obtainable as a
good solitary level document or connection, where each data point is depicted by a set
number of attributes (normally, ostensible or numeric characteristics, some other attributes
are also supported).
Figure 3.7: WEKA
3.8.2.1 CSV File Convert to ARFF File in Weka
Further, the data is changed over to ARFF (Attribute Relation File Format) format for
WEKA. An ARFF record is an ASCII content file that depicts a good rundown of
illustrations sharing a whole lot of qualities. ARFF paperwork was created by the ML Project
at The University of Waikato for use of Weka.

Figure 3.8: ARFF File Save


Weka gives a helpful device to stack CSV records and spare them in ARFF. You just
need to do this once with your dataset. Utilizing the means underneath, you can change over
your dataset from CSV organization to ARFF configuration and use it with the Weka
workbench. Open the ARFF-Viewer by clicking "Tools" in the menu and select "Arff
Viewer". You will be given a void ARFF-Viewer window. Open your CSV record in the
ARFF-Viewer by tapping the "File" menu and select "Open". Explore to your present
working catalog. Change the "Files of Type:" filter to "CSV information records (*.csv)".
Select your record and snap the "Open" button. Spare your dataset in ARFF group by tapping
the "Document" menu and choosing "Save as… ". Enter a file name with an .arff extension
and snap the "Spare" button.
3.8.3 RapidMiner
RapidMiner is a programming-free facts examination platform, which enables your
client to strategy data process in a good fitting and-play style by wiring administrators.
Besides, usefulness could be put into RapidMiner by creating expansions, which are created
available on the RapidMiner Industry. The RapidMiner Linked Open data augmentation
includes administrators for stacking information from datasets inside Linked Open up Data,
just as self-governing following RDF connections to diverse datasets and community event
additional information from that true point. Moreover, the growth underpins structure
coordinating for the information assembled from different datasets.
Figure 3.9: RapidMiner

3.9 Machine Learning Algorithms in WEKA


There are four Machine Learning Algorithms (MLA) Support Vector Machine
(SVM), Multilayer Perceptron (MLP), K Nearest Neighbor (KNN) and Random Forest (RF)
used on machine learning tool Weka to detect spam profiles on a Facebook social media
platform and classify the data that spam or not spam.
Table 3.2: Machine Learning Algorithms (MLA)
Sr. Machine Learning Algorithms (MLA)
1 Support Vector Machine (SVM)
2 Multilayer Perceptron (MLP)
3 K Nearest Neighbor (KNN)
4 Random Forest (RF)

3.9.1 Support Vector Machine (SVM)


In the WEKA explorer, on the 'Preprocess' tab, open data containing file. On the
“Classify” tab, press the “Choose” button to select the classifier WEKA->classifiers-
>functions->SMO (SMO is a optimization algorithm used to train a SVM on a dataset) which
is show in the figure below.
Figure 3.10: SVM in WEKA
Support Vector Machine (SVM) is a calculation that can be used for both order and
relapse troubles. It is a regulated AI calculation. Be that as it may, it is generally utilized in
characterization issues. Right now, entire information as a point in n-dimensional space with
the estimation of every part being the estimation of a particular organize. By then, we
perform request by finding the hyper-plane that isolates the two classes (Trivedi, 2016).
3.9.2 Multilayer Perceptron (MLP)
In the WEKA explorer, on the 'Preprocess' tab, open data containing file. On the
“Classify” tab, press the “Choose” button to select the classifier WEKA->classifiers-
>functions->Multilayer Perceptron, which is show in the figure below.
Figure 3.11: MLP in WEKA
A multilayer perceptron (MLP) is a deep, artificial neural network shows in figure
3.11. It is made out of more than one perceptron. They are made out of an input layer to get
the sign, a output layer that settles on a choice or expectation about the input, and in the
middle of those two, a subjective number of concealed layers that are the genuine
computational motor of the MLP. MLPs with one concealed layer are equipped for
approximating any consistent capacity. Multilayer perceptrons are frequently applied to
administered learning issues they train on a lot of info yield matches and figure out how to
show the relationship (or conditions) between those sources of info and yields. Preparing
includes modifying the parameters, or the loads and inclinations, of the model so as to limit
mistake. Backpropagation is utilized to make those gauge and predisposition alterations
comparative with the blunder, and the mistake itself can be estimated in an assortment of
ways, including by root mean squared mistake.
3.9.3 K Nearest Neighbor (KNN)
The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised
machine learning algorithm that can be used to solve both classification and regression
problems. In the WEKA explorer, on the 'Preprocess' tab, open data containing file. On the
“Classify” tab, press the “Choose” button to select the classifier WEKA->classifiers->lazy-
>IBK, which is show in the figure below.

Figure 3.12: KNN in WEKA


3.9.4 Random Forest (RF)
Random forest is a flexible, easy to use machine learning algorithm that produces,
even without hyper-parameter tuning, a great result most of the time. It is also one of the
most used algorithms, because of its simplicity and diversity. In the WEKA explorer, on the
'Preprocess' tab, open data containing file. On the “Classify” tab, press the “Choose” button
to select the classifier WEKA->classifiers->lazy->IBK, which is show in the figure below.
Figure 3.13: RF in WEKA
3.10 Machine Learning Algorithms (MLA) in RapidMiner
There are four Machine Learning Algorithms (MLA) SVM, MLP, KNN and RF,
which are given below.

3.10.1 Support Vector Machine (SVM)


For apply Support Vector Machine (SVM) in RapidMiner, we have to build a model.
First of all, we import dataset in RapidMiner. Drag and drop the dataset in process window.
Search the “Select Attributes” in operators and drop it in process window. Connect the
dataset to “Select Attributes” operator. Search the “Set Role” in operators and drop into the
process window also set the values. Connect “Select Attributes” to “Set Role”. Search SVM
in operators and drop into the process window. Connect “Set Role” to “SVM”. Search
“Apply Model” in operators and drop into the process window. Connect “Select Attributes”
and “SVM” to “Apply Model”. Again, drop the “Set Role” from operators into the process
window. Connect “Apply Model” to “Set Role”. Search “Performance” in operators and drop
into the process window. Connect “Set Role” to “Performance” and connect “Performance”
to result port, which is shown in the Figure below.
Figure 3.14: SVM in RapidMiner
3.10.2 Multilayer Perceptron (MLP)
For apply Multilayer Perceptron (MLP) in RapidMiner, we have to build a model.
First of all, we import dataset in RapidMiner. Drag and drop the dataset in process window.
Search the “Select Attributes” in operators and drop it in process window. Connect the
dataset to “Select Attributes” operator. Search the “Set Role” in operators and drop into the
process window also set the values. Connect “Select Attributes” to “Set Role”. Search
“Perceptron” in operators and drop into the process window. Connect “Set Role” to
“Perceptron”. Search “Apply Model” in operators and drop into the process window.
Connect “Select Attributes” and “Perceptron” to “Apply Model”. Again, drop the “Set Role”
from operators into the process window. Connect “Apply Model” to “Set Role”. Search
“Performance” in operators and drop into the process window. Connect “Set Role” to
“Performance” and connect “Performance” to result port, which is shown in the Figure
below.
Figure 3.15: MLP in RapidMiner
3.10.3 K Nearest Neighbor (KNN)
For apply K Nearest Neighbor (KNN) in RapidMiner, we have to build a model. First
of all, we import dataset in RapidMiner. Drag and drop the dataset in process window.
Search the “Select Attributes” in operators and drop it in process window. Connect the
dataset to “Select Attributes” operator. Search the “Set Role” in operators and drop into the
process window also set the values. Connect “Select Attributes” to “Set Role”. Search KNN
in operators and drop into the process window. which is shown in the Figure below.

Figure 3.16: KNN in RapidMiner


Connect “Set Role” to “KNN”. Search “Apply Model” in operators and drop into
the process window. Connect “Select Attributes” and “KNN” to “Apply Model”. Again,
drop the “Set Role” from operators into the process window. Connect “Apply Model” to
“Set Role”. Search “Performance” in operators and drop into the process window.
Connect “Set Role” to “Performance” and connect “Performance” to result port.
3.10.4 Random Forest (RF)
For apply Random Forest (RF) in RapidMiner, we have to build a model. First of
all, we import dataset in RapidMiner. Drag and drop the dataset in process window.
Search the “Select Attributes” in operators and drop it in process window. Connect the
dataset to “Select Attributes” operator. Search the “Set Role” in operators and drop into
the process window also set the values. Connect “Select Attributes” to “Set Role”. Search
Random Forest in operators and drop into the process window. Connect “Set Role” to
“Random Forest”. Search “Apply Model” in operators and drop into the process window.
Connect “Select Attributes” and “Random Forest” to “Apply Model”. Again, drop the
“Set Role” from operators into the process window. Connect “Apply Model” to “Set
Role”. Search “Performance” in operators and drop into the process window. Connect
“Set Role” to “Performance” and connect “Performance” to result port, which is
shown in the Figure below.

Figure 3.17: RF in RapidMiner

You might also like