You are on page 1of 9

Swarming the High-Dimensional

Datasets Using Ensemble Classification


Algorithm

Thulasi Bikku, A. Peda Gopi and R. Laxmi Prasanna

Abstract In dealing the typical issues associated with a high-dimensional infor-


mation search space, the conventional streamlining algorithms off limits a sensible
course of action in light of the fact that the interest space increases exponentially with
the performance issue, along these lines handling these issues using exact techniques
are not helpful. Without a doubt, the comparing data has demonstrated its strength as
an indispensable advantage for the business elements and legislative association to
take incite and consummate choices by methods for surveying the relevant records.
As the number of features (attributes) expands, the computational cost of running the
acceptance errand develops exponentially. This curse of dimensionality influences
supervised and in addition unsupervised learning algorithms. The characteristics
inside the informational collection may likewise be unimportant to the undertaking
being contemplated, hence influencing the unwavering quality of the results. There
might be a relationship between qualities in the informational index that may influ-
ence the execution of the order. In this way, a novel methodology known as ensemble
classification algorithm is proposed in the view of the feature selection. We show that
our algorithm compares favorably to existing algorithms, thus providing state of the
art performance. This algorithm is proposed to lessen the computational overheads,
adaptability, and information unbalancing in the Big Data.

Keywords Clustering · Big data · Classification · Feature selection · Accuracy

Please note that the LNCS Editorial assumes that all authors have used the western naming con-
vention, with given names preceding surnames. This determines the structure of the names in the
running heads and the author index.

T. Bikku (B) · A. P. Gopi · R. L. Prasanna


Vignan’s Nirula Institute of Technology and Science for Women, Palakalur, Guntur, India
e-mail: thulasi.bikku@gmail.com
A. P. Gopi
e-mail: gopiarepalli2@gmail.com
R. L. Prasanna
e-mail: happy.prasanna44@gmail.com

© Springer Nature Singapore Pte Ltd. 2019 583


R. S. Bapi et al. (eds.), First International Conference on Artificial Intelligence
and Cognitive Computing , Advances in Intelligent Systems and Computing 815,
https://doi.org/10.1007/978-981-13-1580-0_56
584 T. Bikku et al.

1 Introduction

Big data attempts to discuss what creates huge data, what estimations depict the
magnitude and diverse characteristics of enormous information, and what gadgets
and developments exist to saddle the limit of huge information. From corporate
sector to metropolitan facilitators and scholastics, huge information is the subject of
thought and, to some point, fear [1].
The unexpected growth of gigantic data has left badly prepared traditional data
mining tools and techniques. Sometime recently, new creative upgrades at first
showed up in particular and academic arrangements. Authors and specialists bounced
to books and other electronic media for speedy and wide stream of their work on
tremendous data. In like manner, one finds two or three books on Big data, includ-
ing Big Data for Dummies [2], however, inadequate with regards to basic trades in
academic preparations. One of the effective factors in information mining ventures
relies upon choosing the correct algorithm for the inquiry close by. One of the more
mainstream data mining capacities is clustering [3] the data sets and classification.
For this examination, we have picked to utilize a few order calculations, as we will
probably characterize our information into two labels, referred to as binary classifi-
cation. Now a days there are diverse sorts of classification algorithms accessible. For
the motivations behind this investigation, we have picked few algorithms like C4.5, a
decision tree classification, k-nearest neighbor (K-NN), and support vector machine
(SVM) calculations [4]. The big information can be portrayed as three V’s to be
specific: volume, variety, and velocity. Volume: It suggests the degree and quantity
of data. Big data sizes are accounted for different terabytes, petabytes and zettabytes
to store and process. Because of the huge capacity threshold values are constructed,
allowing impressively more prominent datasets. In like manner, the sort of data, ana-
lyzed under variety, describes what is inferred by ‘huge’ [5]. Variety: It implies the
basic heterogeneity in a dataset. Innovative advances allow firms to use distinctive
sorts of structured, a combination of structured and unstructured known as semi-
structured, and unstructured data [6]. Structured data, which constitutes only 5% of
all present data, implies the even data found in spreadsheets or social databases. An
unusual condition of combination, and analysing of enormous data, is not on a very
basic level new. Velocity: The multiplication of electronic contraptions, for instance,
mobile phones and sensors, has provoked an extraordinary rate of data creation and
is driving a creating requirement for progressing examination and affirmation-based
orchestrating [7].

2 Related Work

C. L. Philip Chen et al. (2014) examine the overview of big data on the data-intensive
applications. It is presently bona fide that Big Data has drawn gigantic thought from
specialists in information sciences, system, and chiefs in governments and tries [8].
Swarming the High-Dimensional Datasets Using Ensemble … 585

As the rate of information improvement outperforms Moore’s law toward the start
of this new century, irrational data is making an uncommon burden to the people.
Then again, there are so much potential and significantly important qualities con-
cealed in the huge volumes of data. Another investigative perfect model is consid-
ered as data-intensive scientific discovery (DISD), generally called Big Data issues
[9]. There are many fields, for example, open association, government security to
the logical research, which experiences the enormous information issues. From one
point of view, Big Data is incredibly critical to convey benefit in associations, like-
wise formative accomplishments in consistent controls, which accommodate a lot of
opportunities to make staggering advances in various fields. Of course, Big Data also
develops with various troubles, for instance, challenges in finding the relevant data,
data accumulating, data examination, and data representation [10]. In Fig. 1, 50% of
the enterprises think big data will help them in increasing operational efficiency, etc.
Chakraborty et al. (2014) portrays on the best way to enhance the activity execution
utilizing Map-reduce in the Hadoop Clusters [11]. In datasets of this present reality,
there are numerous superfluous, excess or deceiving highlights that are pointless.
In such cases, include determination of Feature Selection (FS) is used to defeat the
curse of dimensionality issue. The motivation behind FS is to diminish the intricacy
and increment the execution of a framework by choosing unmistakable highlights.
At the end of the day, the objective of FS is to choose an applicable subset of relevant
data, R highlights from an aggregate arrangement of information, I includes (R < I)
in a given dataset.Selection of attributes is one of the dynamic fields in numer-
ous different territories, for example, information handling, information mining,
design acknowledgment, machine learning, arrangement issues, PC vision. Dimen-
sion reduction method is applied in the first phase to reduce dimension of the datasets
and in next phase of classifying the relevant data. Unler et al. introduced a half and
half channel wrapper important attribute subset selection calculation model in view
of particle swarm optimization, PSO for SVM characterization. The best compo-
nent subset is chosen from an arrangement of low-level picture properties, including
local, intensity, shading, and literary highlights, utilizing the hereditary calculation.
This calculation has a superior in tackling different improvement issues and was
inspected in different issues like channel displaying, unconstrained enhancement,
model classifier, highlight determination such as attribute selection, arrangement,
and multi-objective streamlining. To deal with swarming the information depends
on k-means clustering, must be of high caliber as to an appropriately picked homo-
geneity measure. K-means is a partitioning based clustering procedure that endeavors
to discover a client indicated a number of clusters based on their centroids. The sig-
nificant disadvantages distinguished are there is a trouble in looking at nature of the
clusters formed and delivered; initial partitions can bring about final clusters. It is
extremely hard to anticipate the quantity of groups (k-value) and sensitive to scale.
Another approach is proposed in light of the Gravitational Search Algorithm GSA
as an enhancement procedure to upgrade both the optimal feature subset and SVM
parameters at the same time. Two kinds of GSA are utilized as a part of the type of
a remarkable algorithm: the continuous (real-valued) version and discrete (binary-
valued) version [12]. The continuous-valued variant is utilized to enhance the best
586 T. Bikku et al.

Fig. 1 Survey chart big data opportunities

SVM demonstrate parameters, while the discrete rendition is utilized to look through
the optimal feature subset. Rashedi et al. have built up the binary variant of the first
GSA for parallel advancement issues in which refreshing position intends to switch
in the vicinity of 0 and 1 as opposed to adjacent. The issues to be dissued in twofold
space instead of real space. So in this paper, we proposed another calculation which
can beat the computational issues and enhance the performance and accuracy.

3 Proposed Algorithm

Our examination point is to viably speak to the information regarding the decrease
in computational overheads, adaptability, and information adjusting. The database
might be DBMS or RDBMS. The enormous information can be ordered into three
information structure to be specific organized information, unstructured, and semi-
organized information. An advanced methodology is proposed in the field of infor-
mation stored in large volumes in light of the law of gravity. In this proposed calcu-
lation, specialists are considered as items and their execution is estimated by their
masses. The items are pulled into each other because of their gravitational power and
their development toward their heavier masses. In the proposed algorithm named as
elector-based algorithm for high-dimensional datasets (EHD), every one mass has
four particulars: position, latency mass, dynamic gravitational mass, and static grav-
itational mass. Every one mass presents an answer, and the figuring is investigated
by really changing the gravitational and latency masses. By disappointment of time,
we expect that masses be pulled in by the heaviest mass. This mass will demonstrate
a perfect course of action in the search space. The proposed algorithmic (EHD) steps
are given: Here, t  0 means that it is not the past, not the future but the present state.
B(t)  B0 epsilon-alpha*t/T ;
The number of iterations defined by T ;
Create a random initial population;
Consider a system with N agents.

Yi  (yi1 . . . yid . . . yin ) for i  1, 2 . . . , N ;


Swarming the High-Dimensional Datasets Using Ensemble … 587

where Y id presents the position of the agents of the ith agent in the dth dimension.

Massactive i  Masspassive i  Massii  Massi , i  1, 2, . . . , N ;

The following calculations are done until the criterion dissatisfies:


Obtain the total force that exerts on the ith solution.
Force  gravitational constant (B) [active massi * passive massj ]/square of the dis-
tance between the two masses.
Total force acting on agent i is given as:
Forceid(t)  summation of best j values of all the groups are calculated by the best
and worst values of j.
All agents in the group are evaluated, and best and worst agents are initialized.

Massi (t)  [fiti (t) − worst(t)]/[best(t)−worst(t)];

best(t)  minimum of j[fit j(t)], where j  1, 2, . . . , n;

worst(t)  maximum of j[fit j(t)];

It can also be noted that for maximization problem, we can change the above equation
as: best (t)  maximum of j[fitj(t)];

worst(t)  minimum of j[fit j(t)];

In this way, all operators apply the power as time goes on, Kbest is decreased specif-
ically, and toward the end, there will be one force applying on the others. Finally,
Kbest is the arrangement of first K operators with the best fitness esteem and greatest
mass.
Kbest is the function of time with the initial value K o at the beginning and decreasing
with time.
Kbest is the capacity of time with the underlying quality K o at the start and diminishing
with time.
Calculate the acceleration of the ith solution.
According to Law II of Newton:
Acceleration  force/mass of the object;
According to the proposed algorithm (EHD):
Accelerationi d(t)  summation of j at best [gravitational constant
(B)*[massj/distance between agents + epsilon (for initial value)]*displacement
between agents;
Compute the new velocity of the ith solution; the velocity of an object is the rate of
progress of its position as for an edge of reference and is an element of time.
Velocityid (t + 1)  randomi *velocityi d(t) + accelarationi d(t);
Where randomi can be from 1 to n
588 T. Bikku et al.

Obtain the new position of the ith solution


Y i (t + 1)  randomi*velocityid(t + 1);
Now consider two distinct random solutions from the Y values.
Calculate the elector operation is
Electort(i,j) ()  Yj + random(0,1)*(Y (t)r1−Y (t)r2);
Yi(t + 1)  Y (t)i,j + velocityid(t + 1), if random(0,1) < elector;
Yi(t + 1)  Yi + random (0,1)*[Y (t)r1−Y (t)r2] otherwise
Where elector () is the rate at which it controls the probability of inheriting from
the new position.
The advantages of the proposed algorithm are the heavy inertia mass moves slowly
and thus search space becomes easier. Finding gravity of the data improves the
accuracy of the search which reduces the time complexity. More precise search of
the data leads to the scalability of the data. Data balancing is maintained due to the
inertia and mass calculations.

4 Experimental Analysis

In this proposed model, medical documents are collected from each data source as
structured or unstructured format. In this system, a novel gene-based document’s rela-
tionships are analyzed using MapReduce framework to discover the textual patterns
from a large collection of medical document sets. This segment gives the outcomes,
accuracy, and the proposed algorithm’s performance, with reference of k-mean and
GSA algorithm. The proposed method is assessed as far as precision, error rate and
compared with other algorithms. The four diverse datasets are collected and uti-
lized for the proposed model for result analysis and the significant execution of the
framework is given underneath.
Average separation index measures the greatness of gaps between any two clusters
in a segment, by anticipating the information in a couple of groups into a one-
dimensional space in which they have the most extreme partition. Table 1 shows the
comparison between different algorithms and the proposed EHD model, which had
given good results.
The Separation Index is measured for various calculations with various datasets of
different algorithms. Hence, the results prove that the proposed algorithm is efficient
than the conventional algorithms such as k-means and GSA.
Entropy is the difference between the original label of the class and the predicted
class label. If the entropy of the cluster is low indicates it as the better cluster. The
entropy increases when ground truth of objects in the cluster additionally expands.
The more noteworthy entropy implies that the grouping is not great. The proposed
algorithm shows that it is having less entropy than other traditional algorithms. Table 2
shows the comparison on different datasets based on entropy between different algo-
rithms and the proposed EHD model, which had given good results.
The k-means algorithm is broadly utilized for clustering vast arrangements of
information. However, the standard algorithm does not generally ensure great out-
Swarming the High-Dimensional Datasets Using Ensemble … 589

Table 1 Comparing different instances based on average separation index


Data set Average separation index Average separation index
(with five instances) (with ten instances)
K-means GSA Proposed K-means GSA Proposed
method method
(EHD) (EHD)
Abalone 0.25722 0.250 0.1798 0.280546 0.24801 0.176145
Brest 0.25101 0.2492 0.1761 0.278084 0.25612 0.173817
cancer
Forest fire 0.21245 0.201 0.1744 0.276339 0.23789 0.172168
Iris 0.26159 0.254 0.1789 0.280957 0.2187 0.176533

Table 2 Comparing different algorithms on different datasets based on entropy


Dataset Average cluster entropy
K-means GSA Proposed method
(EHD)
Abalone 0.618 0.52375 0.2128
Brest cancer 0.629 0.52755 0.2548
Forest fire 0.632 0.5304 0.2884
Iris 0.641 0.53895 0.3392

Table 3 Comparing different algorithms on different datasets based on accuracy


Data set Accuracy%
K-means GSA Proposed method
(EHD)
Abalone 65.27 93.33 95.63330165
Brest cancer 64.89 91.23 94.39843083
Forest fire 63.68 91.32 94.98560788
Iris 65.71 93.74 94.33791411

comes as the accuracy, which means nearness to original evaluation of the final
groups, relies upon the choice of initial centroids. Table 3 shows the comparison on
different datasets based on accuracy between different algorithms and the proposed
EHD model, which had given good results.
Sensitivity measures the extent of positives that are effectively recognized. All
experiments, regardless of how carefully arranged and executed, have some level of
mistake or vulnerability or uncertainty. The accuracy can be evaluated by computing
the percentage of the error, which can be calculated when the true value is known.
Though the percent error is an absolute value, it can be expressed with magnitude to
indicate the direction of error from true value. Table 4 shows the comparison based
on true positive, error, and outlier between different algorithms and the proposed
EHD model, which had given good results.
590 T. Bikku et al.

Table 4 Comparing different algorithms on based on sensitivity and error rate


Algorithm True positive Error (%) Outlier (%)
(sensitivity)
K-means 72.921 27.91 16.2
GSA 82.415 22.64 12.4
Proposed method 84.624 19.85 9.92
(EHD)

Table 5 Accuracy measured Dataset Distinct elector ()


at different values of elector
()   0.1   0.2   0.3
Abalone 94.225 95.101 95.63330165
Brest cancer 94.2849 94.2585 94.39843083
Forest fire 93.1258 93.5265 94.98560788
Iris 94.1232 94.2459 94.33791411

The computational complexity nature of the typical calculation is frightfully high


due to reassigning the data points at various circumstances, amid each iteration of the
loop. The proposed algorithm computed at different elector values and the accuracy
computed for different datasets are shown in Table 5.

5 Conclusion

High-dimensional document clustering and classification is one of the essential


machine learning models for the knowledge extraction process of the real-time user
recommended systems. As the amount of information in the high-dimensional data
repositories increases, many organizations are facing the unprecedented issues of how
to process the available huge volumes of data efficiently. This model is used as a user
recommended system on larger document sets using the map reduduce of Hadoop
framework. Experimental results show that the proposed algorithm (EHD) has given
better results compared to traditional document clustering and classification models.
The computational complexity of the Mapper phase is O(nlogn) and Reducer phase is
O(logn). In future, this work can be extended to protein clustering and classification
using the Hadoop framework in the biomedical datasets.

References

1. Kruger, Andries F. Machine learning, data mining, and the World Wide Web: design of special-
purpose search engines. Diss. Stellenbosch: Stellenbosch University, 2003.
2. Hurwitz, Judith, et al. Big data for dummies. John Wiley & Sons, 2013.
Swarming the High-Dimensional Datasets Using Ensemble … 591

3. Wu, A. H., et al. “Soy intake and breast cancer risk in Singapore Chinese Health Study.” British
journal of cancer99.1 (2008): 196–200.
4. Wu, Xindong, et al. “Top 10 algorithms in data mining.” Knowledge and information systems
14.1 (2008): 1–37.
5. Zikopoulos, Paul, and Chris Eaton. Understanding big data: Analytics for enterprise class
hadoop and streaming data. McGraw-Hill Osborne Media, 2011.
6. Wang, Lei, et al. “Bigdatabench: A big data benchmark suite from internet services.” High
Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on.
IEEE, 2014.
7. Jagadish, H. V., et al. “Big data and its technical challenges.” Communications of the ACM
57.7 (2014): 86–94.
8. Chen, CL Philip, and Chun-Yang Zhang. “Data-intensive applications, challenges, techniques
and technologies: A survey on Big Data.” Information Sciences 275 (2014): 314–347.
9. Tolle, Kristin M., D. Stewart W. Tansley, and Anthony JG Hey. “The fourth paradigm:
Data-intensive scientific discovery [point of view].” Proceedings of the IEEE 99.8 (2011):
1334–1337.
10. Kaisler, Stephen, et al. “Big data: Issues and challenges moving forward.” System Sciences
(HICSS), 2013 46th Hawaii International Conference on. IEEE, 2013.
11. Chakraborty, Suryadip. Data Aggregation in Healthcare Applications and BIGDATA set in a
FOG based Cloud System. Diss. University of Cincinnati, 2016.
12. Sarafrazi, Soroor, and Hossein Nezamabadi-pour. “A New Class of Hybrid Algorithms Based
on Gravitational Search Algorithms: Proposal and Empirical Comparison”.

You might also like