You are on page 1of 20

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3048610, IEEE Access

Predicting Influential Blogger's by a novel,


hybrid and optimized Case Based Reasoning
approach with Balanced Random Forest using
Imbalanced data
Yousra Asim1, Ahmad Kamran Malik1,*, Basit Raza1, Ahmad R Shahaid1, Nafees Qamar2
1Department of Computer Science, COMSATS University Islamabad (CUI), Islamabad,45550, Pakistan.
2Department of Health Administration, Governors State University, Illinois 60484, United States
*Corresponding author: Ahmad Kamran Malik (ahmad.kamran@comsats.edu.pk)

This research was supported in part by the Department of Health Administration, Governors State University, USA. This study is also
supported by COMSATS University Islamabad (CUI), Islamabad, Pakistan, under research productivity funds CUI/ORIC-PD/20.

ABSTRACT

Bloggers possess the capability of understanding and influencing mass psychology to a wide community of
fans and followers by posting their online valuable content. Their dominance over audience can be used as
a helping hand in the corporate world which desires to disseminate their product or services among
diversified people belonging to varying localities, and is always on the lookout for suitable and quick ways
to grasp public access. Due to this reason, influential bloggers are preferred in the online market to initiate
marketing campaigns which is a thought-provoking task due to loads of blogger communities. The novelty
of this paper lies in the proposed Framework for Influential Blogger Prediction based on Blogger and Blog
Features (IBP-BBF) using Case-Based Reasoning (CBR) which is not only capable of handling labeled data
but also unstructured data (blogs) and imbalanced data in an optimized way. Detailed labelled and
unstructured data are collected by online survey of 129 bloggers and text mining of their 32,200 blogs
respectively. The classification results are compared and validated with state-of-the-art machine learning
techniques by using standard evaluation measures respectively in the context of imbalanced data. The
results show that the proposed IBP-BBF framework through CBR modeling outperforms existing
techniques in classifying and adapting the influential blogger prediction. The IBP-BBF framework
performed better as compared to baseline imbalanced data classification techniques. It is found that the
Balanced Random Forest contributes towards the performance of CBR approach than Balanced Bagging
Classifier and RUSBoost classifier. By using the CBR approach, baseline techniques can be optimized for
influential blogger identification in a better way.

INDEX TERMS: Blogger Classification, Case Based Reasoning (CBR), Machine Learning (ML), Imbalanced data, Text
Mining

I. INTRODUCTION the others due to various reasons, for example the better
People belonging to diverse backgrounds, breathing in quality of their content, the wider range of impact that their
differing geographical zones come together to interact and videos make, their general appeal, and more concise style of
share on online social networks (OSNs) [1]. Some of these presentation etc. These bloggers with a wide range of
users become bloggers and start sharing their ideas with the influence are called influential nodes. Their influence is used
rest of the OSN users. They gain gradual increase in by corporate companies for viral marketing and information
followers based on the quality of content that they present. flow [4], targeted marketing [5], propagation of brand
They encourage their followers to participate and indulge in information [2], influence maximization [3], and many other
their blogs. Some of the bloggers gain more popularity than purposes as highlighted in [4] .

VOLUME XX, 2017 1

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3048610, IEEE Access
Author Name: Preparation of Papers for IEEE Access (November 2020)

Such influential nodes exert a subtle yet powerful impact classification. Furthermore, the construction of ontologies by
on the psychology of their followers [5]. The corporate using blog content in [18] can suffer from contradictions and
companies can make use of this impact for their own benefit, misinterpretations of target concept due to blog’s
thereby attracting customers for their particular brand [6]. unstructured nature eventually compromising the
Due to dynamic nature of OSNs, it becomes quite a task for performance of a classifier. Without handling such issues,
any company to hire the most suitable and skilled an influential blogger cannot be identified correctly to design an
influential node such as blogger [7]. It also requires an adaptive prediction framework which can promote
understanding of the blogging world and an uncanny skill in companies in today’s e-business environment.
identifying professional VS amateur bloggers. No doubt a The main objective of this research is to model, develop,
professional blogger can prove way more valuable in and implement an adaptive framework for the identification
reaching out to the masses than an amateur novice [8]. of influential bloggers by using blogger features and blog
As the community of bloggers in OSN has expanded due content features which can boost prediction results. It is an
to the perpetually growing activity of blogging, therefore, it effort which extends our previous investigations [19-21] in
has become inconvenient to find the impact of various this domain. This work can be differentiated with respect to
bloggers in huge online blogger’s community and to identify the prior works by incorporating unstructured data handling
influential bloggers amongst them. Although, there exists along with labeled data in the suggested framework, provided
some other relevant challenges in this perspective as the data is imbalanced. Also, the investigation in the context
mentioned by authors namely, quantity measure of influence, of imbalanced data classification entirely moved us towards a
lack of ground truth, temporal aspects of blogging platform, different view of the prescribed problem in terms of
missing links in bloggers datasets, identification of consistent imbalanced data classification techniques and performance
and active bloggers [9]. However, in this research work, we measures, where imbalanced data problem indicates that
are only interested in the identification of such influential important cases (positive instances of target output class) are
bloggers who opt blogging as their profession (such as few or rare in the training dataset [22, 23]. In other words, if
professional bloggers or otherwise). the instances of one output class are lesser as compared to
Previously, the relevant studies have widely focused on other output class, then it denotes the dataset is imbalanced in
Network-based models, Feature based models (using features the case of binary classification problem [24]. It can lead to
against blog posts such as number of likes or comments etc.) the worst performance of standard classifiers because they
to identify influential bloggers [9, 10], and on a few ML assume balanced class distributions in the dataset which is
techniques (based on blogger’s features) [11-13] for blogger not true in the case of imbalanced data. Such data consists of
classification into professional or non-professional. This underrepresented data which requires new tools and
study is focused on the latter only. Furthermore, authors have algorithms to process raw data into useful knowledge
tried to identify the factors (important features) affecting representation and to learn patterns in data [25]. Authors
blogging behavior of a blogger [12, 14-16] based on have suggested a measure namely Imbalanced Ratio (IR) to
questionnaires. Also, blog content features has been calculate the distribution of instances provided the data is an
examined for several purposes such as content analysis, imbalanced binary dataset [26]. It can be calculated by
opinion mining etc. [17], however, we are only interested in dividing the number of majority instance by number of
identifying influential bloggers in this context. Authors have minority instances in a dataset. Once IR’s value is
also used blogger features along with blog content features in determined, then imbalanced dataset can be categorized into
the context of influential blogger prediction [18]. low imbalanced (if IR’s value lies within 1.5 and 3), medium
During a comprehensive analysis of the literature, we imbalanced (if the value of IR lies within 3 and 9) , and high
observed that previously used Machine Learning (ML) imbalance datasets (if IR exceeds 9) [27].
techniques for blogger classification are not adaptive [11-13, In addition, the inclusion of adaptive capability in the
18] which makes them incapable for automatic identification proposed framework, that is, one of the autonomic
of unseen influential bloggers with varying input features and characteristics of an algorithm gives its strength to work
previously unobserved blogger instances. Most of them independently where human need is minimized and adaptive
construct general and explicit description of a target concept behavior is managed [28]. Autonomous systems (AS) handle
(global approximations) based on the blogger’s training data, unseen situations in a natural way just like humans who live
which decreases their problem solving capability with each in a changing environment in their whole lives where they
newly presented blogger instance leading to poor have to handle unpredictable and unseen situations. The shift
performance of a classifier. Though, the number of factors in a human experience brings with it a shift in his/her
that affect the blogging behavior of a blogger is not limited perspective, attitudes, likes/dislikes, and problem solving
skills. An AS exhibits a variety of characteristics that enable
[12, 14-16], however, existing predictive models [11-13]
it to wisely manage random situations. Such systems can
have not considered the handling of varying features of
solve a problem after going through a learning process. They
unseen bloggers and their blogs. Besides, there is a room for
can adapt over time to the changing trends through a process
improvement in their performance measures for blogger

2 VOLUME XX, 2019

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3048610, IEEE Access
Author Name: Preparation of Papers for IEEE Access (November 2020)

of self-* properties such as self monitoring, self prediction, The paper is organized as follows. Section II represents the
self-evaluation, self-correction and self adaptation [29]. previous related work in the domain of influential blogger
In this work, Case-based Reasoning (CBR) modeling identification. Section III provides the complete
incorporates a few characteristics of AS for the classification methodology, including dataset collection and description,
task. Conversely, traditional machine learning techniques opt baseline methods used for comparison, standard evaluation
for building an explicit model with no adaptation to suggest metrics used for results evaluation. Section IV offers the
the solutions to unseen problems which is the reason behind proposed IBP-BBF framework for influential blogger
their incapability in holding autonomic properties. identification. Section V describes results and discussion.
Previously, CBR have been successfully investigated as an Additionally, it provides insight to proof of concept and
option to design systems with self management Section VI concludes this study.
characteristics [28, 30]. Although we are not going to
propose a complete AS but a classifier with a few AS II. RELATED WORK
properties. For this purpose, we assume that adaptation based In this section, a preview of preceding closely connected
on CBR can be suitable for the proposed prediction studies is offered for the identification of influential bloggers.
framework for the identification of influential blogger. The existing models are classified into Network-based
Furthermore, this study is using a hybrid research design models and Feature-based models [9, 10]. The aforesaid
(quantitative as well as experimental). In quantitative models are capable of finding the influential blogger by using
research, surveys are thought out as the best way to collect a different network centrality measures and use the
large amount of data from a large number of people in a short connections between the users. In contrast, the subsequent
amount of time [31]. In this research, the collection of models calculate the influence of a blog directly by analyzing
survey-based dataset is crucial due to unavailability of the the attributes of blog posts such as comments received, the
dataset required to understand the features of a blogger as length of a blog post, number of references used in a blog
well as his blogging simultaneously for the identification of a post, significance of blog website where the blog has been
blogger’s influence. This collection of data from online posted etc., which contributes towards his/her influence
bloggers for the evaluation of the proposed framework makes indirectly. What’s more, such models are further grouped
this research design based on a survey. We have used the into temporal and non temporal feature based models where
internet for performing online data collection due to its low the former models have also considered the recentness of
cost and larger sampling frame than other data gathering blog post features for the influential blogger identification.
techniques. Furthermore, the investigation of the existing On the other hand, the latter models have ignored the recency
state of art machine learning techniques for imbalanced data of under consideration features. However, this research work
that can perform well for influential blogger classification is not examining network structure or blog post features to
and use of standard performance measures such as find influential bloggers.
Specificity, F-measure, Geometric Mean, and ROC area Besides, a few studies have identified the influential
under the curve to evaluate the effectiveness of proposed bloggers by using labeled data. They performed classification
framework make this work an experimental research design. by using a few ML techniques, implemented in weka tool on
This work theoretically and practically contributes by the standard BLOGGER dataset collected by [12] . After
proposing and implementing an adaptive Framework for collection of dataset, the authors used C4.5 decision tree
Influential Blogger Prediction based on Blogger and Blog algorithm which has obtained 82% accuracy for blogger
Features (IBP-BBF) using CBR for the influential blogger’s classification. They have discussed factors contributing
classification. Ìt is a classifier which partially holds some towards professionalism of a blogger, but without results
characteristics of an AS such as self-prediction and self validation. Apart from this, K-Nearest Neighbor (KNN) and
adaptation. It can not only identify the most influential Artificial Neural Networks (ANNs) are used and achieved
bloggers based on their features, but also consists of a such 84% and 90% accuracy respectively. Further, Random Forest
capability through which changing aptitudes of the bloggers (RF) classifier is used to compare results with C4.5 and KNN
can be monitored and identified. It is capable of devising algorithm in [11, 12] which provided 8% gain in accuracy
solutions in unseen situations without any human help. It can without cross validation [13]. Although these studies have
update itself according to future data for improvement of its done an experimental evaluation of a few ML techniques for
performance. Additionally, it can handle imbalanced data in blogger classification, but they have ignored the standard
an efficient way as compared to other under-investigation practice of verifying results by using k-fold cross validation
baseline methods. Likewise, the CBR approach can which is necessary to estimate the skill of a trained ML
positively enhance the classification capabilities of baseline model for unseen future predictions. Similarly, being unable
methods is another theoretical contribution to the existing to provide generalized results, they have also ignored the
body of knowledge which is also found consistent with our unstructured data for finding the influential bloggers. Also,
prior research outcomes [19]. these studies are incapable of proposing any new method in
this problem domain.

2 VOLUME XX, 2019

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3048610, IEEE Access
Author Name: Preparation of Papers for IEEE Access (November 2020)

To overcome this preceding research gap, [20] was an related content and micro-blogging. Different studies are
initiative in this problem domain. By using a number of ML conducted to identify the latest trends/topics in blogging by
techniques, they have observed 85% accuracy gain for investigating the evolving blogging content with the passage
blogger classification by using RF classifier and Nearest of time. Researchers are found interested in several methods
Neighbor classifier along with standard way of cross such as classification, clustering, sentiment analysis, time
validation. In this chain of experiments, 2% gain in accuracy series, etc. The former blog mining studies have used
is grasped with Artificial Neural Network (ANN) [21]. After dimensionality reduction methods such as spectral
investigating already available ML techniques on a number dimensionality reduction techniques for mono or bi-
of standard datasets and highlighting a few among them for directional blog data. Different studies have used non-linear
better blogger classification, a new classifier named dimension reduction techniques, but focus on a specific type
Influential Blogger based Case-Based Reasoning (IB-CBR) of blogs such as business blogs. Several solutions in blog
is offered with adaptive capabilities [19]. It iteratively mining are static and incapable of handling dynamic,
outperformed relevant previously used techniques in terms of complex, and huge blog data [17]. It indicates that in spite of
standard performance measures and achieved 95% accuracy, a number of studies relevant to influential blogger
97% True Positive Rate, 11% False Positive rate, 96% F- identification (by using labeled data) and several blog mining
measure, 98% ROC area under the curve in third iteration. studies, there is still a need of such adaptive framework
In addition, authors have presented some context-specific which is capable of handling both labeled as well as
works in this domain. For instance, a dataset of fashion unstructured (blog text) data simultaneously. Also, such
bloggers belonging to Spain is collected to find influential framework should have the prediction capabilities in an
bloggers among them. They concluded that influential efficient manner in the case of unseen scenarios without
bloggers keep their blogs updated, actively participate in human intervention.
online social circles, contribute in organizing fashion events
with media and habitual in reading fashion magazines [32]. III. RESEARCH METHODOLOGY
A weibo platform-dependant framework is offered for In this work, a dataset is collected from Facebook and
information spread by finding influential bloggers. Authors Instagram based online bloggers by collecting their responses
extract all blogs of pre-selected keywords from weibo via through an online questionnaire. The obtained data from
web-crawler and collect blogger’s information (such as their blogger responses has 35 input features (i.e. answers of 35
sex, number of followers/followees, number of questions) and have one binary target output such as
tweets/retweets, and type of a blogger) to find out their professional/non-professional blogger (as mentioned by the
influence as an opinion leader [18]. They construct two blogger himself/herself). There are 37 positive instances
directed graphs of blog post-repost relationship and blogger (professional bloggers) and 92 negative instances (non-
post-repost relationship respectively for this purpose and professional bloggers). The value of IR which is calculated
used them for information diffusion. However, our work by 37/92= 2.5, indicates that the extracted dataset is a
focuses on suggesting a classifier without using network medium imbalanced. We extracted the given bloggers URLs
structure and without focusing any specific platform. Also, from the dataset on which participants (bloggers) post their
they are not handling data imbalanced problem. We are blogs. Afterwards, we extracted total 32,200 blogs from their
working on traditional blogs as well as microblogs instead of given URLs. Blogs dataset comprises of 13,751 blogs from
only using the latter which shows that this work is not Instagram and 18,449 from other blogging platforms such as
blogging platform dependent. Moreover, instead of building Blogspot and WordPress. Details of blog extraction can be
ontologies, we used ML techniques for blogger classification. seen in the Section IV. Further, we selected the blogs of
Although previously, identification of influential blogger is those categories in which the blogs of both professional as
carried out using aforesaid methodologies but it is observed well as non-professional bloggers were present. There were
that there is absence of such work which can handle labeled six categories having both types of bloggers, namely
as well as unstructured data by using ML techniques Activities, private thoughts and reflections, Human rights
simultaneously. issues and development, Self experiences, Toursim, Business
As far as the studies related to blog mining (unstructured and digital marketing, and Poetry, literature and art (Details
data pattern extraction) are concerned, a very recent study are shown in Table 3). We labeled the features of these blogs
has provided insights of the literature in this context so far as a professional blogger or otherwise according to obtained
[17]. As an illustration, the former studies have followed a responses. The selected categories comprise of total 6,037
qualitative approach for blog content analysis based on the blogs having 1,483 blogs of professional bloggers and 4,554
gender and language to find out the participation of bloggers blogs of non-professional bloggers (results in
in blogging. Further, as discussed before, authors have IR=4,554/1,483=3.07), which shows a medium imbalanced
examined metadata (features) of blogs such as a blog author, dataset in the case of unstructured data as well. This way, we
blog’s published date and time, a blog’s length etc. It is also used labeled as well as unstructured data for experimental
stated that recent studies in this domain focus on news- analysis.

2 VOLUME XX, 2019

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3048610, IEEE Access
Author Name: Preparation of Papers for IEEE Access (November 2020)

TABLE 1. Demographic details of participants (Bloggers)

Gender Marital status Religion Education/Degree Age in years (y) Occupation Country
Male (51.2%) Single (65.1%) Atheist (3.1%) School Education < 20 y (10.01%) Businessman (6.2%) Pakistan
Female (48.8%) Married (25.58%) Christian (10.07%) (15.55%) 21-25 y (40.3%) Doctor (2.3%) (65.11%)
Engaged/In a Hindu (11.6%) Bachelor (51.93%) 26-30 y (31.7%) Engineer (6.2%) India (13.95%)
Relationship Muslim (72.09%) Master (26.35%) 31-35 y (9.3%) Govt. Servant (1.5%) Nigerian (4.65%)
(7.75%) Prefer Not to Say Ph.D. (2.3%) 36-40 y (5.4%) Private job (14.7%) American
Prefer Not to say (3.1%) Technical 41-45 y (1.55%) Professional bloggers (2.32%)
(1.5%) Qualification (2.32%) > 46 y (1.55%) (24.03%) Filipino (2.30%)
Did not complete Students (27.13%) U.K. (2.3%)
school education Teachers (5.42%) U.A.E. (1.5%)
(0.77%) Others (12.40%) Others (6.976%)

Further, we used both types of data (labeled/unstructured) to different from each other (were not limited to Facebook and
evaluate our proposed framework for influential blogger Instagram).
identification. The efficiency of proposed framework was Our dataset consists of 129 bloggers from which 63
investigated by using standard performance measures (48.8%) were female and 66 (51.2%) were male bloggers.
commonly used for imbalanced binary classification problem There were 84 (65.1%) bloggers who are single, 33 (25.58%)
such as Specificity, F-Measure, Geometric Mean and ROC bloggers who are married, and 10 (7.75%) bloggers marital
area under the curve. Furthermore, the comparative analysis status was “engaged/in a relationship”. However, 2 (1.5%)
of results was performed between proposed framework bloggers did not tell about their marital status. Regarding
outputs and state-of-the art ML techniques previously used religion, there were 4 (3.1%) bloggers who were atheist, 13
for imbalanced dataset classification. Several distance and (10.07%) bloggers were Christian, 15 (11.6%) bloggers were
similarity measures such as Cosine, Euclidean, Braycurtis, Hindu, 93 (72.09%) bloggers were Muslim. But, 4 (3.1%)
Canberra, Correlation, Chebyshev, Minkowski, and bloggers hesitated to mention their religion. With respect to
CityBlock similarity are used to find the similar cases from the educational background, there were 67 (51.93%) bloggers
the repository of previous cases with respect to new unseen who had done “Bachelor/License”, 34 (26.35%) bloggers
cases. were Master degree holders, 3 (2.3%) bloggers were Ph.D.
degree holders, and 4 (2.32%) bloggers had acquired
A. DATA COLLECTION technical qualifications. Likewise, 20 (15.55%) bloggers had
We prepared a questionnaire to conduct a survey of online completed only school education. On the other hand, there
bloggers who belongs to Instagram and Facebook platforms. was only one blogger who had not completed his school
To achieve the theoretical soundness (validity) of this online education.
survey, it was prepared by following the research questions According to age group, 52 (40.3%) bloggers fall in 21-25
that have already been used in [12, 15, 33]. This dataset is years age group, 41 (31.7%) bloggers lies in 26-30 years age
collected due to following reasons: group, 13 (10.01%) of the total were having age less than 20
Firstly, the relevant standard online dataset (BLOGGER’s years, 12 (9.3%) bloggers belong to 31-35 years age group,
Dataset [12]) is small and we need more data to effectively 7 (5.4%) bloggers belong to 36-40 years age group, 2
evaluate the proposed framework for influential bloggers bloggers fall in 41-45 years age group. The categories of 46-
classification. Secondly, the studies [15, 33], carried out in 50 years and 51 years and above age groups had only one
other departments, which have been performed for factor blogger each. Among participants of survey, there were 8
identification behind the blogging of bloggers have used (6.2%) bloggers who had mentioned their occupation as
questionnaires to collect data from bloggers. They provided “business man”, 3 (2.3%) bloggers were doctors, 8 (6.2%)
results by using statistical methods. But, unfortunately, the bloggers were engineer, 2 (1.5%) bloggers were govt.
data collected in these studies are small, and also not servant, 19 (14.7%) bloggers were on private job, 31
available online. (24.03%) participants were only a blogger by profession, 35
The survey was posted on aforesaid social network (27.13%) bloggers were students, 7 (5.42%) bloggers were
platforms and responses were being collected. The process of teachers, and 16 (12.40%) bloggers fall in the category of
data collection was initiated on February 20, 2018 and lasted “others”.The participants of this survey belong to different
till February 02, 2019. The collected data consists of 129 countries. Most of them 65.11% (84 bloggers) were
bloggers and their 32,200 blogs. In this dataset, the Pakistanis. However, there were 18 (13.95%) bloggers who
participants were from different countries, having diverse belonged to India, 6 (4.65%) bloggers were Nigerian, 4
religious background and belong to different age-groups, (2.32%) bloggers were American, 3 (2.3%) bloggers were
having varying interests and dissimilar educational Filipino, 3 (2.3%) were U.K. resident, and 2 (1.5%) bloggers
backgrounds. Likewise, their blogging platforms were also were from the U.A.E. Similarly, among the participants,
there was only one participant from each of the following

2 VOLUME XX, 2019

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3048610, IEEE Access
Author Name: Preparation of Papers for IEEE Access (November 2020)

countries: Afghanistan, Canada, Ghana, Indonesia, Malaysia, The motivations of investigating RUSBoost for professional
Mauritius, South Africa, South Arab, and Poland. These blogger classification are its outstanding performance,
details of the dataset can be seen in Table 1. simplicity, and less computational cost. This hybrid
algorithm under samples the dataset to reduce the instances
B. ALGORITHMS USED FOR COMPARISON of majority class to balance the dataset at each iteration and
This work has explored the significance of the proposed uses AdaBoost.M2 for training purposes [37]. It modifies the
framework for influential blogger classification into weight distribution of instances after each training of a
professional or otherwise. We have used Balanced Random classifier which is the essence of diversity in training data to
Forest (BRF) classifier, Balanced Bagging Classifier (BBC), classify data with more accuracy.
and RUSBoost classifier to compare their results against the 4) IB-CBR CLASSIFIER
the outcomes of the proposed IBP-BBF framework. First, An adaptive model named IB-CBR is offered previously
framework outcomes are evaluated and compared against for the identification of influential bloggers [19]. This work
algorithms used in [34-36] for imbalanced datasets in binary is a hybrid of case-based reasoning and RF. The proposed
classification problem. Additionally, we have examined and automated system is capable of improving its performance
compared the results of previously offered algorithm named iteratively by depending upon previous experiences. It
as IB-CBR [19] in our experimental chain in this domain. outperforms previously highlighted classic ensemble method
The classifiers used for results comparison are briefly such as RF, Non-ensemble methods such as Nearest
described as follows: Neighbor algorithm and adaptive models such as ANN.
1) BALANCED RANDOM FOREST CLASSIFIER However, the aforementioned competitor algorithms were
BRF [34] is an ensemble classifier and generates many the outcomes of extensive experimental studies which show
decision trees by picking the number of attributes at random). their outstanding performance over several ML techniques
It overcomes the deficiency of classic RF which has biased using multiple standard datasets [20, 21]. In this work, we
nature towards majority class for accuracy prediction. In the shall compare the CBR outcomes with the proposed model in
case of the extremely imbalanced dataset, it is more likely imbalanced dataset domain.
that the training set may contain rare instances of minority
class to build decision trees. It can definitely contribute C. EVALUATION METRICS
towards unsatisfactory performance of a classifier for Due to imbalanced dataset, we used Precision (Pre), Recall
minority class prediction. Being computationally more (Rec), Specificity (Spec), F-measure, Geometric Mean
efficient BRF joins random under sampling with the (G.M.), and ROC area under the curve (ROC AUC) to
ensembling idea. At each iteration, it randomly extracts a produce classification results. The value of these metrics
training set by under sampling the majority class instances varies from 0 to 1 where 0 indicates the minimum value
and selecting same number of minority class instances with (poor classification) and 1 indicates the maximum value
replacement (e.g. using some instances multiple times in (excellent classification).
generating a single decision tree). This strategy enables BRF Pre is used to quantify the number of correctly predicted
to opt for the best hypothesis by generating several models positive examples by a classifier. It focuses on the accuracy
for future prediction along with the balanced training set to of minority class which does not affect the true performance
train those models. evaluation of a classifier. It is an alternative of classification
2) BALANCED BAGGING CLASSIFIER accuracy in the case of imbalance data. It can be calculated
It uses random under sampling method to balance the by Eq. (1) which is as follows:
training set on training time coupled with bagging idea for
making predictions. Just like RF, classic bagging builds 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑃𝑟𝑒 = (1)
several versions of a base classifier to get final prediction 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
[35]. However, it uses all features of dataset to split a node
which makes it somehow different from RF. Due to the Pre is not capable of predicting false negatives i.e. how
comparable working nature of balanced bagging and BRF, many positive examples (professional bloggers) are predicted
we selected this algorithm in experimental design. as negative class examples (non-professional bloggers) by a
classifier. To overcome this deficiency, Rec (sensitivity) is
3) RUSBOOST CLASSIFIER
used which provides an indication of missed positive
RUSBoost is recommended as an outperforming algorithm
instances (professional bloggers) by a classifier. It provides
by [36] as compared to Non-ensemble classifiers, Classic
insight into the coverage of minority class in the case of
ensembles, Cost-sensitive boosting Ensemble classifiers,
imbalanced datasets. It supports to figure out the ratio of
SMOTEBoost, Bagging-based classifiers, and hybrid
those professional bloggers that are perceived as non-
ensembles provided the data is imbalanced and have two
professional bloggers by the framework. It is calculated by
output classes. Authors proved it by conducting experiments
the following Eq. (2).
on 44 imbalanced standard datasets from KEEL repository.

2 VOLUME XX, 2019

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3048610, IEEE Access
Author Name: Preparation of Papers for IEEE Access (November 2020)

𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 Spec (True Negative Rate) examines how good our
𝑅𝑒𝑐 = (2)
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 framework is at avoiding false predictions. It denotes the
fraction of predicted non-professional bloggers that are
Understanding of the classification results for both actually non-professional. It can be calculated by the
aforesaid metrics is a little bit tricky. As an illustration, when following Eq. (4):
both metrics have high values, it indicates that framework is
absolutely capable of handling correct output class 𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
𝑆𝑝𝑒𝑐 = (4)
predictions and vice versa. On the other hand, if the value of 𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
Rec is low and the value of Pre is high, then it points out that G.M. gives worth to both output classes to produce
output class is not predicted well by framework but it is resultant value by trying to enhance the accuracy on each of
trustable if it does. Likewise, the high Rec and low Pre the output classes. It produces a low value even if a single
values indicate that the framework can predict an output class output class is not well predicted by the model. It can be
very well, but it also includes the points of another class in it calculated by the following Eq. (5).
(in the case of binary classification problem). A single score G. M. = √senstivity ∗ spec (5)
named F-measure encompasses both Pre and Rec. As
discussed above, it is likely that a classifier can have ROC AUC investigates the capability of the proposed
outstanding Pre, but poor Rec or vice versa. In such a case, framework to extinguish between professional/non-
F-measure can represent both scores in a single value once professional blogger. It can produce the following values as
they are calculated. F-measure tries to ensure that each output output; ‘1’ shows a perfect classifier, ‘0.9’ denotes excellent
class exactly contains the points of only one class classifier, ‘0.8’ depicts a good classifier, ‘0.7’ shows a
(professional blogger/non-professional blogger). The mediocre classifier, ‘0.6’ indicates a poor classifier, and
following Eq. (3) is used for the calculation of F-measure: ‘below 0.6’ shows random classifier.
(𝑃𝑟𝑒 ∗ 𝑅𝑒𝑐)
𝐹 − 𝑀𝑒𝑎𝑠𝑢𝑟𝑒 = 2 ∗ (3)
(𝑃𝑟𝑒 + 𝑅𝑒𝑐)
TABLE 2. Functions used to find the similarity between cases

Name of Similarity Definition


Measure
n
Cosine distance
similarity
x x ck dk

Sim = k =1

 (xck )  (x dk )
cd n 2 n 2

k =1 r =1

Where Simcd depicts the distance between cth and dth cases with reference to all features of blogger feature vector.

  w (xck − xdk ) 


Euclidean distance n 2
Dist cd
= k
k =1

Where Dist cd depicts the distance between cth and dth cases with reference to all features of blogger feature vector.

 (x )
n
Braycurtis distance
ck
− x dk
Dist = k =1

 (x )
cd n

ck
+ x dk
k =1

Where Dist cd depicts the distance between cth and dth cases with reference to all features of blogger feature vector.
n (
Canberra distance xck − xdk )
Dist cd = 
k =1 (xck + xdk )

Where Dist cd depicts the distance between cth and dth cases with reference to all features of blogger feature vector.
Correlation distance (c − c̅). (d − d̅)
Dist cd = 1 −
‖c − c̅‖2 ‖d − d̅‖
2
Where Dist cd depicts the distance between cth and dth cases with reference to all features of blogger feature vector and c̅
is the mean of elements of c.
Chebyshev distance Dist cd = max|xc − xd |
Where Dist cd depicts the distance between cth and dth cases with reference to all features of blogger feature vector.
1⁄
Minkowski distance p
Dist cd = (∑|xc − xd |p )
Where Dist cd depicts the distance between cth and dth cases with reference to all features of blogger feature vector.
CityBlock distance Dist cd = |xc − xd |
(Manhattan distance) Where Dist cd depicts the distance between cth and dth cases with reference to all features of blogger feature vector.

2 VOLUME XX, 2019

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3048610, IEEE Access
Author Name: Preparation of Papers for IEEE Access (November 2020)

Fig 1: Adaptive Influential Blogger Predictor (Block Diagram)

Algorithm 1: Pseudocode for Instagram Crawler


Input: List named as ‘Insta_list’ containing instagram IDs of each blogger ‘BloggerID’
Output: Post_url, Post_text, Number_of_likes, Number_of_comments, Post_comments for each post ‘p’ and Total_posts of each
BloggerID
1: for each BloggerID in Insta_list
2: Posts_list Get each Post_url of BloggerID
3: Total_posts Length of Posts_list
4: for each post p in Posts_list
5: Extract Post_text, Number_of_likes, Number_of_comments, Post_comments
6: Save above-mentioned attributes against each post p in Posts_list including Total_posts of each BloggerID
7: end for
8: end for

case in the case base. Afterwards, similarity measure is


D. SIMILARITY MEASURES calculated between the incoming blogger features and the
In this paper, eight commonly used distance as well as available cases in case base for future predictions. Module 3
similarity measures such as Jaccard, Cosine, Euclidean, comprises of adaptation. When a new unseen blogger data
Braycurtis, and Canberra are applied to calculate the are entered in the system, it enables the IBP-BBF to adapt the
similarity between the new instance and the training data in changing behavior of a blogger. Modules 2 and 3 include
Algorithm 3. The comparison of these measures is carried CBR which learns experience based on reasoning and it can
out on the basis of performance measures discussed in the provide adaptivity in the sense that if a new case is seen
section III C. The definitions of these similarity functions are during prediction then this approach updates its rules for
provided in Table 2. future cases [38]. It has been extensively used by the research
community because it can suggest the solution to unseen
IV. PROPOSED IBP-BBF FRAMEWORK problems based on the past experience of problem solving.
In this section, we present the proposed framework in detail First of all, Algorithm 1 is used to crawl blogs data from
for Influential Blogger Prediction based on Blogger and Blog Instagram blogger IDs for this purpose. Afterwards,
Features namely IBP-BBF with respect to its modules, Algorithm 2 is used to crawl blogs data of bloggers other
input/output, and adaptive functionality based on CBR. than Instagram. Feedparser python library is used which is
The IBP-BBF comprises of three modules, namely Feature capable of parsing feeds in several formats such as RSS,
Extraction Module, Influential Blogger Predictor Module, RDF, etc. A feed (RSS document) consists of text (whether
and Adaptation Module as shown in Fig. 1. Module 1 is full or summarized), and metadata (blog posting date, and
related to the feature extraction of a blogger. The incoming author name).
blogger labeled data (such as his personal information) and
his blogs (unstructured data) are input to the system. A. Feature Extraction
Different features from this blogger data are extracted. We used Continous Bag of Words (CBOW) Word2vec
Module 2 is a prediction component, which predicts whether model of an open-source Gensim library of python. This
a blogger is influential or not. The features extracted by model provides vectors (numerical form) of the given text by
Module 1 are used for the representation of a blogger as a

2 VOLUME XX, 2019

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3048610, IEEE Access
Author Name: Preparation of Papers for IEEE Access (November 2020)

using a neural network of two layers. It takes raw text corpus Numpy library handles large multi-dimensional arrays and
as an input and returns a set of vectors (feature vectors) as consists of high-level mathematical functions which can be
output. It can provide accurate estimates with respect to a manipulated on these arrays. Pandas is a ML library for data
word’s meaning depending upon the previous appearances of manipulation and analysis. It is used to import data from
that word in a corpus. Moreover, these estimates can several file formats in the form of dataframes.
highlight that word’s association with other words. It The Natural Language Toolkit (NLTK) platform is
provides a vocabulary having items and vectors attached to comprised of text processing libraries having several built-in
them, which is used to find the associations between words. methods. It enabled us to use well-known methods such as
WordNetLemmatizer( ) and PorterStemmer( ). The former is
Algorithm 2: Pseudocode for Blogspot and Wordpress used in Lemmatization process which removes inflectional
Crawler endings. It reduces a word to the root or lexical word, namely
Input: List named as ‘Blogs_list’ containing blog urls of each lemma by linking words with similar meaning to one word
blogger ‘BloggerID’ (based on its context) by using vocabulary. On the other
Output: BlogPost_url, BlogPost_title, BlogPost_text, hand, the latter is used for stemming process which attempts
Blog_created_on for each blogpost ‘p’ and Total_blogposts to chop off the word endings by removing derivational
each BloggerID affixes to a stem without the use of glossary (whether it’s a
1: for each BloggerID in Blogs_list valid contextual word or not). For example, professionals is
2: while feed is not empty reduced to profession by the former process which cannot be
3: Parse each RSS feed and extract BlogPost_title, further analyzed. Likewise, it is reduced to professional by
BlogPost_text, Blog_created_on the latter. Moreover, former reduces funfairs to two root
4: Total_blogposts  Length of feed words fun and fair while the latter removes only s from the
5: Save above-mentioned attributes against each post p end i.e. funfair. Algorithm 3 is used to extract files of each
in Posts_list including Total_posts of each BloggerID blog’s category having vector values of words belonging to
6: end while that category after the data cleaning process.
7: end for

Algorithm 3: Pseudocode for Data cleaning and Word2Vec


from gensim.models import Word2Vec
from gensim.parsing.preprocessing import split_alphanum
from .utils.util_functions import strip_all_entities, remove_stop_words
import pandas as pd
import numpy as np
from nltk.stem import WordNetLemmatizer, PorterStemmer
Input: file of all blogs of bloggers ‘file b_file’
Output: csv files of each blog’s category ‘category_name.csv’
1: data_frame read data from b_file
2: blog_cat_list extract each unique category from data_frame
3: for each category ‘c’ in blog_cat_list
4: cat_df extract blogs data frame of category ‘c’ from ‘data_frame’
5: for each row ‘r’ of cat_df
6: blogget blog text from’r’
7: blog remove blogger name from blog
8: blog blog.lower()
9: blog split_alphanum (blog)
10: blog remove_stop_words (blog)
11: blog strip_all_entities (blog)
12: list bl blog.split()
13: for each word ‘w’ in bl
14: if w ends with letter ‘e’
15: lemmatize w and append in list clean_blog_words
16: else
17: stem w and append in list clean_blog_words
18: univ_blog_list  append clean_blog_words
19: Get vocabulary ‘cat_voc’ of unique words from univ_blog_list

2 VOLUME XX, 2019

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3048610, IEEE Access
Author Name: Preparation of Papers for IEEE Access (November 2020)

20: Write each word of cat_voc as column name ‘col’ in excel sheet ‘category_name.csv’
21: for each index i of univ_blog_list
22: single_blog_model= Word2Vec(univ_blog_list[i])
23: for each blog word ‘bw’ in univ_blog_list[i]
24: score=np.mean (single_blog_model.wv(bw)
25: write score of bw against its col at row i in category_name.csv
26: end for

Fig. 2 The proposed IBP-BBF framework for influential blogger prediction

In the case of high dimensional data, a classifier starts After this step, each instance of each category is tagged
focusing the irrelevant features of input data which may based on the ground truth obtained from survey (whether the
result in its poor performance [39]. Also, the availability of blogger is professional or otherwise). Afterwards each file is
huge amount of data makes it impractical to examine gone through blogger classification in non-adaptive and
everything in a dataset because of resource constraints [6]. adaptive manner. For each new instance of bloggers data, the
Keep all this in mind, in this work, as each category has a similarity between all the cases (training data) already stored
huge number of unique words, so we used Principal in Case Repository (CR), and the new case is determined by
Component Analysis (PCA) to reduce the feature using Algorithm 5. If the similarity value is greater than or
dimensions. We have used PCA with different variance equal to 80%, then a list of candidate solutions named LIST1
values such as 85%, 90%, 95%, and 99%, in order to select is maintained. After traversing all training data in CR, most
the best value where results are found optimized. Algorithm frequently occurring solution is used (reuse the case) to
4 is used to extract principal components of each category predict the blogger as professional/non-professional.
with respect to each variance value. Here, the PCA generated
files are four in number with respect to each category.

2 VOLUME XX, 2019

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3048610, IEEE Access
Author Name: Preparation of Papers for IEEE Access (November 2020)

Algorithm 4: Pseudocode for Principal Component On the other hand, if LIST1 contains such solutions
Analysis where half solutions classify an unseen problem as Influential
import pandas as pd Blogger and the other half classify as Non-Influential
from sklearn.preprocessing import StandardScaler Blogger, then the very first solution of LIST2, if exists, is
from sklearn.decomposition import PCA added to LIST1 to make an odd number of solution in it.
Input: variance value var_val, category file Further, the most occurring solution of LIST1 is reused to
category_name.csv classify the unseen problem. However, if list LIST1 is found
Output: file PCA_category_var_val.csv having principal empty which denotes that there is no previous solution
components with var_val amount available for the new problem, then the prediction is
1: data_frame read data from category_name.csv performed on the basis of the already trained BRF model.
2: X StandardScaler().fit_transform (data_frame)
This phase of IBP-BBF is basically known as Revise phase
3: pca PCA(var_val)
4: Principal_componentspca.fit_transform(X) where the proposed system automatically predicts the
5:Principal_dataframepd.Dataframe(data=Principal_compon solution without human interference in unseen situations.
ents) Finally, the Retain phase is used to add up the newly
6: Final_dfpd.concat(join principal components and final predicted solution of aforesaid unseen problem in CR for
class) future use.
7: Write Final_df in PCA_category_var_val Algorithm 6: Pseudocode of IBP-BBF in an adaptive
manner
In the IBP-BBF framework, blogger features and its blog Input: Blogger Feature Vector (BFV)
post features are the input, and professional/non-professional Output: Prediction of Influential/non-influential Blogger
bloggers are the output, which we want to predict. Fig. 2 (IB/NIB)
represents the proposed IBP-BBF. 1: Extract the incoming blogger features BFV
Likewise, Algorithm 6 is designed as an adaptive 2: Train BRF model based on CR
algorithm for influential blogger identification. The cleaned 3: Calculate similarity between new blogger and training data in CR
input files of blogger feature vectors with tagging are the 4: if similarity is >=80%
5: Retrieve and Maintain list LIST1
input of Algorithm 6. The training data are kept in CR. First
6: else
of all, Balanced Random Model is trained based on these 7: if similarity is >=60%
cases available in CR. Afterwards, the similarity between the 8: Retrieve and Maintain list LIST2
new unseen problem and the training data is calculated. If the 9: end if
calculated similarity is greater or equal to 80%, then the list 10: if LIST1 is not empty
LIST1 is maintained by retrieving all the solutions of 11: if no. of elements of LIST1> LIST2 with 80% matching
previous cases having at least 80% similarity with the unseen 12: if LIST1 has an odd number of occurrences
13: Reuse solution and predict based on majority voting
problem. On the other hand, if the similarity is greater or 14: else
equal to 60%, then the list LIST2 is maintained, which 15: if LIST1 has even no. of occurrences & LIST2 !=empty
contains all the solution of those cases who are at least 60% 16: Add LIST2[0] in LIST1
similar to new problem. Now, if the former list is not empty 17: Predict based on majority voting
and have a number of solutions as compared to the latter list, 18: else
then the solution to the unseen problem is devised based on 19: Reuse solution & predict by first occurrence of LIST1
20: else
the most frequently occurring solution in LIST1. 21: Revise solution and predict by BRF model
22: Retain new blogger features and solution by adding in CR
Algorithm 5: Pseudocode of IBP-BBF in non-adaptive 23: else
manner 24: Revise solution and predict by BRF model
Input: Blogger Feature Vector (BFV) 25: Retain new blogger features and solution by adding in CR
26:end if
Output: Prediction of Influential/non-influential Blogger
(IB/NIB)
V. EXPERIMENTAL DESIGN AND DISCUSSION
1: Extract the incoming blogger features
This section presents the complete series of steps carried
2: Calculate the similarity between the new blogger and training
out to evaluate the proposed IBP-BBF framework. For
data in Case Repository (CR)
experiments, an HP laptop having Intel ® Core (TM) i7-
3: if similarity is >=80%
7500U CPU @ 2.7 Hertz processor, a RAM of 8 GB, and
4: Maintain list LIST1
Windows 10 (64 bit) operating system is used. The IBP-BBF
5: if LIST1 is not empty
framework and all algorithms are implemented in Python 2.7
6: Reuse solution and predict based on majority voting
using an Integrated Development Environment (IDE),
7: else
namely PyCharm 4.0.7. We used 10-fold cross validation for
8: “Can’t predict solution”
results evaluation in order to indicate the efficiency of the

2 VOLUME XX, 2019

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3048610, IEEE Access
Author Name: Preparation of Papers for IEEE Access (November 2020)

TABLE 3. Different categories extracted from unstructured data

S. No. Category Name No. of instances Principle components


1. Activities, private thoughts, and reflections Total instances :1525 1274 at0.99 variance
+ve instances:112 936 at 0.95 variance
-ve instances:1413 714 at 0.90 variance
583 at 0.85 variance
2. Human rights issue and development Total instances:302 270 at0.99 variance
+ve instances:120 228 at 0.95 variance
-ve instances:182 196 at 0.90 variance
170 at 0.85 variance
3. Self experiences Total instances:2267 1864 at 0.99 variance
+ve instances:638 1429 at 0.95 variance
-ve instances:1629 1129 at 0.90 variance
929 at 0.85 variance
4. Tourism Total instances:758 656 at 0.99 variance
+ve instances:494 529 at 0.95 variance
-ve instances:264 430 at 0.90 variance
355 at 0.85 variance
5. Business and digital marketing Total instances:423 254 at 0.99 variance
+ve instances:56 171 at 0.95 variance
-ve instances:367 151 at 0.90 variance
135 at 0.85 variance
6. Poetry, literature and art Total instances:762 660 at 0.99 variance
+ve instances:63 510 at 0.95 variance
-ve instances:699 403 at 0.90 variance
328 at 0.85 variance

proposed framework for influential blogger identification. performance measures for influential blogger classification.
Furthermore, results of the proposed IBP-BBF framework Table 4 shows the results of both aforesaid algorithms in the
are compared with existing top-nouch machine learning case of Activities, private thoughts and reflections dataset
techniques for imbalanced data classification such as BRF, with 85% variance. It is clear from the results that adaptive
RUSBoost, and BBC by using their default parameters. algorithm adds up to the predictive capabilities of the
However, IB-CBR algorithm is also used for results suggested framework.
comparison to investigate it on imbalanced data, although it Table 5 shows the results for influential blogger
is not tested and designed for this purpose. The reason behind classification with respect to Human rights issue and
using this algorithm is its outstanding performance among development dataset in a non-adaptive and adaptive manner.
our former research works [19-21]. In this work, standard The outcomes indicate that the results obtained from
imbalanced data performance evaluation metrics such as Algorithm 5 (before adaptation) are worst with respect to F-
Spec, F-measure, G.M., and ROC area under the curve are measure (i.e. 0) and G.M. (i.e. 0). It indicates the poor test’s
used for experimental outcomes evaluation. The performance accuracy to distinctly identify output classes and
of the IBP-BBF framework is investigated on all datasets classification failure of non-adaptive algorithm. On the other
(falling into different categories) obtained from online hand, the higher values of all performance metrics after
blogger’s survey. adaptation with respect to all similarity measures highlights
To reduce the data dimensions, PCA technique is used. the success of influential blogger classification (by using
Different values of variance such as 85%, 90%, 95%, and Algorithm 6). In the case of other four datasets, the similar
99% are examined to select the best value for each category observance is found and adaptive algorithm is found far
dataset. As there are six categories (datasets), so it means that better in the very first iteration as compared to Algorithm 5.
total 24 data files are used for experiments (where each However, results are not shown here due to paper length
category has four dataset files having different aforesaid limitations.
variance values with different number of principal To evaluate the performance of aforementioned similarity
components. Table 3 shows these categories, their number of measures, we examined aforesaid eight similarity measures
instances along with PCA generated a varying number of for each category’s data files having varying variances.
components at different variance values. Afterwards, each Afterwards, similarity measures with higher performance in
data file is given as an input to Algorithm 5 and Algorithm 6 all varying-variance files of each category are selected by
to compare the results of both algorithms for influential using IBP-BBF, and the merging candidates (CBR-BBC and
blogger classification in a non-adaptive as well as an CBR-RB), which are the combination of BBC and
adaptive manner respectively. The results highlights that the RUSBoost respectively with case-based reasoning
suggested adaptive algorithm beats its competitor in terms of methodology. This way, the same process is repeated for all

2 VOLUME XX, 2019

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3048610, IEEE Access
Author Name: Preparation of Papers for IEEE Access (November 2020)

TABLE 4. Adaptation comparison through IBP-BBF Framework for influential blogger prediction in the case of Activities, private thoughts and
reflections dataset

Similarity measure Before Adaptation (%) After Adaptation (%)


Spec. F-measure G.M. Spec. F-measure G.M
Cosine Distance 97% 65% 68% 92.7% 76.8% 80.0%
Euclidean Distance 100% 0% 4% 91.0% 79.4% 83.1%
Braycurtis Distance 98% 43% 52% 91.5% 79.1% 82.8%
Canberra Distance 100% 0% 0% 89.1% 80.6% 83.1%
Correlation Distance 97% 68% 71% 92.1% 79.0% 81.5%
Chebyshev Distance 100% 0% 0% 90.1% 80.9% 83.3%
Minkowski Distance 100% 0% 0% 91.4% 80.0% 84.0%
CityBlock Distance 100% 0% 0% 93.1% 80.9% 83.4%

TABLE 5. Adaptation comparison through IBP-BBF Framework for influential blogger prediction in the case of Human rights issue and development
dataset

Similarity measure Before Adaptation (%) After Adaptation (%)


Spec. F-measure G.M. Spec. F-measure G.M
Cosine Distance 100.0% 51.1% 57.7% 97.2% 96.7% 97.0%
Euclidean Distance 100.0% 51.1% 57.8% 97.5% 96.7% 97.1%
Braycurtis Distance 100.0% 38.1% 45.1% 97.0% 96.4% 96.7%
Canberra Distance 100.0% 52.9% 59.1% 97.2% 96.7% 97.0%
Correlation Distance 100.0% 52.9% 59.1% 98.0% 97.4% 97.6%
Chebyshev Distance 100.0% 38.1% 45.1% 97.3% 96.4% 96.8%
Minkowski Distance 100.0% 0.0% 0.0% 97.2% 96.7% 97.0%
CityBlock Distance 100.0% 0.0% 0.0% 97.3% 96.4% 96.8%

Fig. 3 Performance comparison by using different algorithms for influential blogger prediction

categories. After the completion of the whole process, the BBF, CBR-BBC and CBR-RB. In the first iteration, many
following outcomes are obtained. The higher values of similarity measures are on the same stage in terms of
performance measures in Activities, private thoughts and performance for this dataset. However, it is clear in three
reflections dataset are achieved with 85% variance under iterations that Minkowski similarity iteratively keeps on
Minkowski similarity measure in three iterations. It remains learning from new problems and positively contributes to the
true for all CBR-based classifiers named as suggested IBP- performance of the IBP-BBF framework. Moreover, the

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3048610, IEEE Access
Author Name: Preparation of Papers for IEEE Access (November 2020)

outcomes show the strength of proposed IBP-BBF In the case of Tourism dataset, we again achieve higher
framework as compared to CBR-BBC and CBR-RB in terms performance for influential blogger classification with 85%
of all performance measures such as 91.4% Spec., 80% F- variance and Minkowski distance similarity. IBP-BBF
measure, 84% G.M., and 93.6% ROC area under the curve. outperforms other methods with higher performance, such as
On the other hand, in the context of baseline methods, IBP- 99.8% spec., 99.6% F-measure, 99.7% G.M., and 100%
BBF performs almost similar to BRF for influential blogger ROC area under the curve for influential blogger
classification, however, the former is found slightly better classification. In this case, IB-CBR and BBC performed
than the latter in terms of Spec., and G.M. Furthermore, BRF similarly. BRF remains on the top in baseline methods with
is found as a better classifier among aforesaid baseline respect to its performance in classifying bloggers into
methods. However, IB-CBR achieved the worst spec. professional or otherwise. The results can be seen in Fig. 4.
(43.7%), and G.M. (62.3%), which shows its weakness in the Likewise, Fig. 5 compares the results of IBP-BBF for
case of imbalanced dataset. Fig. 3 shows all these results. prediction with other baseline methods such as IB-CBR,
In the case of Human rights and Development dataset, RUSBoost, BRF, and BBC in the case of aforesaid three
again Minkowski similarity measure iteratively performed datasets. On the other hand, in the case of Business and
well with respect to other similarities in terms of Digital Marketing, it is found that Minkowski similarity with
performance measures in three iterations. The proposed IBP- 99% variance produces overall better prediction results in the
BBF framework shows outstanding performance as case of IBP-BBF having 94.8% spec., 92.9% F-Measure,
compared to other baseline methods as well as merging 91.1% G.M., and 95.2% ROC area under the curve.
competitors classifiers for influential blogger classification. It Moreover, IBP-BBF outperforms other baseline methods in
obtained 97.2% Spec., 96.7% F-Measure, 97% G.M., and terms of performance measures for influential blogger
99.4% ROC area under the curve. Further, IB-CBR is in the classification. Again, BRF performed well among baseline
second last in terms of performance, but outperforms as methods in terms of performance measures. Likewise, in the
compared to RUSBoost. case of Poetry, Literature and Arts, again Minkowski
Likewise, high performance is seen in the Self-experience similarity measure is found better with 85% variance which
dataset with 85% variance and Minkowski distance similarity keeps on enhancing the performance of the proposed
in three iterations. It is observed that IBP-BBF beats its framework. It is found that IBP-BBF shows excellent
competitor algorithms having 95.8% spec., 96.2% F- performance for influential blogger identification by having
Measure, 96.3% G.M., and 98.9% ROC area under the curve. 79.7 % Spec., 86.6% F-measure, 81.6% G.M., and 91.6%
Also, BRF remains on the top in baseline methods for ROC area under the curve. Among baseline methods, the
influential blogger identification. However, IB-CBR obtained BBC is found effective for influential blogger classification.
lowest values for performance measures. IB-CBR The worst performance of CBR-RB merger is seen in this
performance is lesser than other methods which is likely due dataset shown in Table 6.
to the fact that this classifier is not designed for imbalanced
datasets.

Fig. 4 Performance measures comparison of merging contestants in the revise phase of the proposed IBP-BBF framework for influential blogger
identification.

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3048610, IEEE Access
Author Name: Preparation of Papers for IEEE Access (November 2020)

Fig. 5 Performance Comparison of the proposed IBP-BBF framework with IB-CBR, RUSBoost, BRF, and BBC for influential blogger identification

TABLE 6. Performance measure comparison by using different algorithms

Classifier Business and digital marketing Poetry, Literature and Art


Spec. F-Measure G.M. ROC Spec. F-Measure G.M. ROC
IB-CBR 62.7% 93.0% 75.3% 94.7% 38.6% 91.2% 56.9% 89.0%
RUSBoost 58.4% 85.7% 69.4% 84.9% 39.6% 87.3% 56.4% 78.2%
BRF 92.5% 91.4% 91.5% 96.6% 79.7% 86.4% 81.5% 91.0%
BBC 81.1% 86.7% 83.0% 91.2% 79.8% 86.7% 81.7% 89.0%
IBP-BBF 94.8% 92.9% 91.1% 95.2% 79.7% 86.6% 81.6% 91.6%
CBR-RB 67.1% 92.1% 77.9% 90.7% 59.9% 89.2% 72.0% 85.0%
CBR-BBC 87.5% 88.7% 87.5% 93.7% 81.2% 86.3% 82.2% 87.5%

To further examine the performance of each similarity under the curve (100%) in all iterations. Although, the values
measure with respect to more number of iterations, we have of performance metrics are found increasing in the second
investigated the efficiency of proposed IBP-BBF framework and third iteration of IBP-BBF as compared to first iteration
up to three iterations. It helps in determining such a similarity but those results are found lesser than Minkowski similarity
measure which can keep on improving the performance of measure even in the very first iteration of the proposed
the IBP-BBF framework for future instances in terms of framework. Results can be seen in Fig. 6.
influential blogger prediction. The results show that the Generally speaking about the experimental outcomes of
Minkowski similarity measure is at the top as compared to this research work, it is observed that the proposed IBP-BBF
other similarity measures in more iterations of the proposed overall remained on the top as compared to other baseline
framework. This remains true for all datasets, as an methods and CBR-based merging candidates namely CBR-
illustration, Table 7 shows the outcomes in the case of RB, and CBR-BBC. Besides, the overall results in each
Poetry, Literature, and Art dataset. dataset are found better in terms of all performance measures
In the case of labeled data, bloggers features dataset (having with 85% variance except Business and Digital Marketing.
answers of the questionnaire and the output class as Although, it is stated earlier that performance of a similarity
professional blogger/non-professional blogger) is directly measure can be related to the data dimensions [19] which
given as input to the proposed IBP-BBF framework. The indicates that there is a possibility that different similarity
results indicate that Minkowski maintained its top position in measures can behave differently on different number of input
terms of performance measures as compared to other features. Perhaps, it is the reason behind the outstanding
similarity measures. It achieved maximum values for Spec. performance of Minkowski similarity measure as compared
(100%), F-measure (100%), G.M. (100%), and ROC area

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3048610, IEEE Access
Author Name: Preparation of Papers for IEEE Access (November 2020)

TABLE 7. Performance measures comparison for Influential blogger identification by using different similarity measures with one, two a nd three
iterations using Poetry, Literature, and Art dataset

Similarity Measure No. of iterations Spec. F-measure G.M. ROC AUC


Cosine Distance One iteration 81.1% 86.1% 82.0% 91.0%
Two iterations 55.1% 83.1% 66.0% 92.8%
Three iterations 57.9% 88.1% 70.2% 95.6%
Euclidean Distance One iteration 79.7% 86.5% 81.6% 91.4%
Two iterations 57.1% 84.4% 67.9% 93.1%
Three iterations 54.7% 83.5% 65.8% 93.6%
Braycurtis Distance One iteration 78.3% 86.4% 80.7% 91.0%
Two iterations 57.6% 84.6% 68.4% 92.8%
Three iterations 60.3% 89.3% 72.3% 95.9%
One iteration 79.7% 86.3% 81.4% 90.3%
Canberra Distance Two iterations 55.3% 84.0% 66.5% 92.7%
Three iterations 53.4% 83.1% 64.8% 93.2%
One iteration 79.8% 86.7% 81.7% 91.8%
Correlation Distance Two iterations 66.4% 85.8% 74.6% 93.4%
Three iterations 73.6% 90.7% 81.2% 95.7%
One iteration 81.1% 86.1% 82.0% 91.6%
Chebyshev Distance Two iterations 54.6% 83.6% 65.8% 92.8%
Three iterations 54.2% 84.0% 65.7% 93.7%
One iteration 79.7% 86.6% 81.6% 91.6%
Minkowski Distance Two iterations 89.5% 89.2% 88.8% 93.2%
Three iterations 96.1% 93.4% 94.6% 96.3%
One iteration 79.8% 86.9% 81.8% 91.8%
CityBlock Distance Two iterations 55.9% 84.7% 67.2% 93.1%
Three iterations 55.2% 84.4% 66.6% 93.5%

Fig. 6 Performance comparison of different algorithms for influential blogger identification by using Labeled dataset
to all other candidate similarity measures in three iterations measure can be a useful candidate where we have no
of IBP-BBF. It is also suggested previously that minkowski knowledge of true distance metric [40]. Moreover, it is the

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3048610, IEEE Access
Author Name: Preparation of Papers for IEEE Access (November 2020)

combination of Euclidean and Manhattan distance similarity models. Due to this biased selection of subsets, the output of
and in the case of unstructured data such as blogs, it seems new models is affected which leads to over-fitting.
difficult to identify a feasible similarity measure, which Another finding of this research work is the excellent
increases the successful applicability of this measure. In this performance of BRF, which is found at the top in baseline
experimental study, it seems effective to learn data patterns methods. It is also found as well-performed merging
and future predictions, which is the reason of the gradual candidate as compared to CBR-BBC, and CBR-RB with
increase in the performance of the proposed IBP-BBF CBR. It uses down sampling majority class technique in the
framework. training phase to select the subsets of the full training set,
Additionally, CBR-BBC and CBR-RB are on the second which prevents majority class dominating the classifier by
and third number respectively. It is important to highlight removing majority class examples [34]. It focuses on the
that previous authors have concluded that in general hybrid equal representation of each class in each decision tree by
Bagging (combined bagging with data sampling) algorithm is artificially modifying the class distribution, which is likely to
better than hybrid Boosting (combined boosting with data positively contribute to its predictive capabilities.
sampling) for classification results provided the data is The most important finding that must be highlighted is that
imbalanced and noisy [41]. Here, in this research work, we the CBR-BRF, CBR-RB, and CBR-BBC performs better
have also found that the performance of BBC (the hybrid of than baseline methods i.e. BRF, RUSBoost, and BBC
bagging with random under sampling technique) is classifiers in the revise phase of IBP-BBF, which indicates
comparatively better than RUSBoost (the hybrid of adaboost that merging of case-based reasoning with these baseline
with random under sampling technique) in all imbalanced methods positively contribute towards the efficiency of a
datasets except Tourism, which shows that research results classifier for influential blogger identification, provided the
are consistent with previous findings. Furthermore, in this data is imbalanced. The working strategy of CBR i.e. to learn
research, it is also seen that the integration of Bagging and from previous experiences and to store former problem
Boosting with CBR methodology (such as CBR-BBC, and solution pairs for future prediction adds up to its significance
CBR-RB) follows the same pattern in all datasets except for predictions.
Tourism. The performance of former remains better as As stated earlier, the performance of an adaptive
compared to the latter for influential blogger identification. It algorithm, namely IB-CBR’s is also explored since it has
indicates that whether we merge hybrid bagging and hybrid been on the top in the chain of our previous experiments for
boosting with case-based reasoning, the former is likely to influential blogger identification [19]. Unfortunately, its
achieve more efficiency than the latter in the context of overall performance is not found satisfactory with respect to
focused domain. It is probably due to the fact that bagging other classifiers in the context of imbalanced data. As authors
can avoid over-fitting which is its strength over boosting. It is embed RF in the revise phase of IB-CBR, so there is a
clearly stated in [42] that boosting may suffer overfitting due significant probability that it can badly affect the learning
to the increase in number of iterations. These iterations are process in the case of imbalanced data. The reason behind is
used to adjust an observation’s weight with respect to that the subsets used for training purpose of a classifier in the
previous trees built during prior iteration. The generalization case of imbalanced dataset may have a few or none of
ability for concept learning of an overfitted boosting model minority class examples [34]. As decision trees in RF are
decreases due to fitting too closely to training data [43]. On built based on such training subsets [44], which can lead to
the other hand, bagging uses several learners (trees without its poor performance in the classification of unseen examples
pruning) in the training phase to contribute in final prediction of minority class; negatively contributing to the efficiency of
which seems less likely to overfit. The selection of subsets in IB-CBR.
the training phase of these ensemble classifiers and the
method used for future predictions based on the output of N A. Proof of concept
classifiers contributes to their predictive capabilities. Bagging In this section, we discuss the possible reasons behind the
assigns equal weight to every classifier and predicts unseen performance of under investigation algorithms. It is seen
instances by using the responses of these N classifiers based earlier that RUSBoost is a simpler, and faster algorithm with
on majority voting. It recursively keeps on selecting the favorable classification performance [37]. However, BRF
random subsets with replacement from the full training set beats RUSBoost for influential blogger identification in this
and provides equal chances to each instance of getting study. The latter being the forest of stumps (i.e. decision trees
selected as a new subset. It keeps each model obtained from only with one split) gives more worth (weights) to the stumps
the training process for future data classifications. On the having less prediction errors and do not entertain all stumps
other hand, boosting assigns performance-based weights to N equally for final prediction. This baised working nature of
classifiers in the training phase by using the weighted latter classifier can lead it to overfitting which is probably the
average method. Instead of selecting subsets of the training cause of its lesser classification performance as compared to
data at random in the training phase, it chooses new subsets BRF where the latter gives equal importance to each decision
of such instances which were mis-classified by previous tree for final prediction.

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3048610, IEEE Access
Author Name: Preparation of Papers for IEEE Access (November 2020)

In literature, RF is not considered as a suitable option for classification results. As unseen future problems can be
data classification if the data is imbalanced [34]. Because, it solved by IBP-BBF without an explicitly trained non-
reduces overall error rate and emphasizes more on accurately adaptive model, which makes the CBR approach more fit for
classifying the majority class examples leading to poor our problem.
accuracy rate for minority class examples. The poor This study is basically an initiative towards handling
performance of IB-CBR is the evidence in this perspective. labeled and unstructured data collectively for the prediction
On the other hand, BRF produced outstanding results for of influential bloggers. However, the scope of the proposed
influential blogger identification due to the selection of method in terms of performance evaluation is limited to
balanced down-sampled data and modified split node criteria. imbalanced data and the results may differ in the context of
At each iteration in RF, BRF extracts bootstrap sample from balanced data. Also, the selection of another dimensionality
the minority class and the same number of cases is randomly reduction method can influence experimental outcomes.
extracted from majority class with replacement. Moreover, it In the future, the scope of this research can be extended to
uses CART algorithm in which each node is split based on evaluate the efficiency of proposed solution in terms of speed
randomly selected variables instead of choosing the best for data classification. It may include the handling of
variable for node split (while building decsision trees). By adaptive maintenance of case-repository in terms of its size
repeating the above process as desired, final prediction is growth upon saving the new unseen problems with the
made by aggregating the predictions of the ensemble. In this prescribed solutions. The comparison of different feature
study, we found BRF classifier better for influential blogger selection techniques other than PCA on the performance
classification when merged with CBR approach (i.e. IBP- gain of IBP-BBF is also under investigation. The
BBF framework) as compared to all previously highlighted investigation of Stanford Glove and Google BERT instead of
supreme baseline imbalanced data classification methods. Word2vec for vector representation of words is also in the
Also, the success of standard BRF with respect to its pipeline. We also aim to explore the significance of deep
competitor classifiers in this problem domain is probably the neural networks in this problem domain.
proof of excellent modification in standard RF by authors of
[34]. Conflict of interest
In bagging, instead of choosing the best node split feature None.
from subset of randomly selected features, all features are
considered for this purpose. All the features have different Acknowledgments
levels of information gain for correct data classification and The authors would like to thank Mr. Abdul Majeed from
can’t be equally important [45]. Most probably, the working Systems Limited, Pakistan, for his useful discussion in
nature of bagging is the reason behind the lesser performance technical perspective and providing language help.
of BBC for influential blogger prediction as compared to
BRF.
REFERENCES
VI. CONCLUSION AND FUTURE WORK [1] Y. Asim, A. K. Malik, B. Raza, W. Naeem, and S. Rathore,
This study presents an Influential Blogger Prediction based "Community-centric brokerage-aware access control for online
on CBR framework using BRF, which is composed of social networks," Future Generation Computer Systems, vol.
109, pp. 469-478, 2020.
adaptive capabilities for the identification of influential [2] T. Araujo, P. Neijens, and R. Vliegenthart, "Getting the word
bloggers provided the data is imbalanced. Such leading out on Twitter: the role of influentials, information brokers and
individuals can be hired by different companies for their strong ties in building word-of-mouth for brands,"
International Journal of Advertising, vol. 36, no. 3, pp. 496-
product marketing to the relevant audiences which can lead 513, 2017.
to their success. This study attempts to map the standard [3] M. Han, M. Yan, Z. Cai, Y. Li, X. Cai, and J. Yu, "Influence
autonomic system characteristics such as self-prediction, and maximization by probing partial communities in dynamic
online social networks," Transactions on Emerging
self-adaptation in the influential blogger identification Telecommunications Technologies, vol. 28, no. 4, 2017.
problem. This way the labeled as well as unstructured data of [4] Y. Yang and J. Pei, "Influence Analysis in Evolving Networks:
bloggers both can be used to predict influential bloggers A Survey," IEEE Transactions on Knowledge and Data
using autonomic characteristics of IBP-BBF. The outcome of Engineering, 2019.
[5] Y. Asim, A. K. Malik, B. Raza, and A. R. Shahid, "A trust
the suggested framework is compared with state-of-the art model for analysis of trust, influence and their relationship in
imbalanced data classification techniques such as BRF, BBC, social network communities," Telematics and Informatics, vol.
and RUSBoost. The results indicate that the proposed CBR 36, pp. 94-116, 2019.
[6] R. Bian, Y. S. Koh, G. Dobbie, and A. Divoli, "Identifying
approach for the proposed framework outperformed other Top-k Nodes in Social Networks: A Survey," ACM Computing
competing classifiers for influential blogger prediction and Surveys (CSUR), vol. 52, no. 1, pp. 22, 2019.
adaptation. Further, this study provides the successful [7] N. HAFIENE, W. KAROUI, and L. B. ROMDHANE,
"Influential nodes detection in dynamic social networks: A
merging of CBR approach with the aforesaid classifiers Survey," Expert Systems with Applications, pp. 113642, 2020.
instead of using them in a standard way for better

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3048610, IEEE Access
Author Name: Preparation of Papers for IEEE Access (November 2020)

[8] K. Zhao and A. Kumar, "Who blogs what: understanding the taxonomy," Evolutionary computation, vol. 17, no. 3, pp. 275-
publishing behavior of bloggers," World Wide Web, vol. 16, no. 306, 2009.
5-6, pp. 621-644, 2013. [28] M. J. Khan, M. M. Awais, S. Shamail, and I. Awan, "An
[9] H. U. Khan, A. Daud, U. Ishfaq, T. Amjad, N. Aljohani, R. A. empirical study of modeling self-management capabilities in
Abbasi, et al., "Modelling to identify influential bloggers in the autonomic systems using case-based reasoning," Simulation
blogosphere: A survey," Computers in Human Behavior, vol. Modelling Practice and Theory, vol. 19, no. 10, pp. 2256-2275,
68, pp. 64-82, 2017. 2011.
[10] H. U. Khan and A. Daud, "Finding the top influential bloggers [29] B. Raza, A. Aslam, A. Sher, A. K. Malik, and M. Faheem,
based on productivity and popularity features," New Review of "Autonomic performance prediction framework for data
Hypermedia and Multimedia, pp. 1-18, 2016. warehouse queries using lazy learning approach," Applied Soft
[11] F. S. Gharehchopogh, S. R. Khaze, and I. Maleki, "A new Computing, pp. 106216, 2020.
approach in bloggers classification with hybrid of k-nearest [30] N. Shaheen, B. Raza, A. R. Shahid, and H. Alquhayz, "A Novel
neighbor and artificial neural network algorithms," Indian Optimized Case-based Reasoning approach with K-means
Journal of Science and Technology, vol. 8, no. 3, pp. 237-246, Clustering and Genetic Algorithm for Predicting Multi-class
2015. Workload Characterization in Autonomic Database and Data
[12] F. S. Gharehchopogh and S. R. Khaze, "Data mining Warehouse System," IEEE Access, vol. 8, pp. 105713-105727,
application for cyber space users tendency in blog writing: a 2020.
case study," International Journal of Computer Applications [31] S. W. VanderStoep and D. D. Johnson, Research methods for
vol. 47, no. 18, pp. 40-46, 2013. everyday life: Blending qualitative and quantitative
[13] N. A. Samsudin, A. Mustapha, and M. H. A. Wahab, approaches, 1 ed. vol. 32. San Francisco,: John Wiley & Sons,
"Ensemble classification of cyber space users tendency in blog 2008.
writing using random forest," in Innovations in Information [32] P. SanMiguel and T. Sádaba, "Nice to be a fashion blogger,
Technology (IIT), 2016 12th International Conference on, Al- hard to be influential: An analysis based on personal
Ain, United Arab Emirates, 2016, pp. 169-172. characteristics, knowledge criteria, and social factors," Journal
[14] C. Fullwood, K. Melrose, N. Morris, and S. Floyd, "Sex, blogs, of global fashion marketing, vol. 9, no. 1, pp. 40-58, 2018.
and baring your soul: factors influencing UK blogging [33] M. Taki, "Bloggers and the Blogosphere in Lebanon & Syria:
strategies," Journal of the Association for Information Science meanings and activities," Ph.D., Media, Art and Design
and Technology, vol. 64, no. 2, pp. 345-355, 2013. department, University of Westminster, UK, 2010.
[15] B. Quadir and N.-S. Chen, "The effects of reading and writing [34] C. Chen, A. Liaw, and L. Breiman, "Using Random Forest to
habits on blog adoption," Behaviour & Information Learn Imbalanced Data," 2004.
Technology, vol. 34, no. 9, pp. 893-901, 2015. [35] L. Breiman, "Bagging predictors," Machine learning, vol. 24,
[16] H.-M. Lai and C.-P. Chen, "Factors influencing secondary no. 2, pp. 123-140, 1996.
school teachers’ adoption of teaching blogs," Computers & [36] M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, and F.
Education, vol. 56, no. 4, pp. 948-960, 2011. Herrera, "A review on ensembles for the class imbalance
[17] H. Hassani, C. Beneki, S. Unger, M. T. Mazinani, and M. R. problem: bagging-, boosting-, and hybrid-based approaches,"
Yeganegi, "Text Mining in Big Data Analytics," Big Data and IEEE Transactions on Systems, Man, and Cybernetics, Part C
Cognitive Computing, vol. 4, no. 1, pp. 1, 2020. (Applications and Reviews), vol. 42, no. 4, pp. 463-484, 2011.
[18] F. Li and T. C. Du, "Maximizing micro-blog influence in [37] C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, and A.
online promotion," Expert Systems with Applications, vol. 70, Napolitano, "RUSBoost: A hybrid approach to alleviating class
pp. 52-66, 2017. imbalance," IEEE Transactions on Systems, Man, and
[19] Y. Asim, B. Raza, A. K. Malik, A. R. Shahaid, and H. Cybernetics-Part A: Systems and Humans, vol. 40, no. 1, pp.
Alquhayz, "An Adaptive Model for Identification of Influential 185-197, 2009.
Bloggers Based on Case-Based Reasoning Using Random [38] I. Watson, "Case-based reasoning is a methodology not a
Forest," IEEE Access, vol. 7, pp. 87732-87749, 2019. technology," Knowledge-based systems, vol. 12, no. 5, pp. 303-
[20] Y. Asim, A. R. Shahid, A. K. Malik, and B. Raza, 308, 1999.
"Significance of machine learning algorithms in professional [39] M. Al‐Maitah, "Text analytics for big data using rough–fuzzy
blogger's classification," Computers & Electrical Engineering, soft computing techniques," Expert Systems, vol. 36, no. 6, pp.
vol. 65, pp. 461-473, 2018. e12463, 2019.
[21] Y. Asim, B. Raza, A. K. Malik, S. Rathore, and A. Bilal, [40] B. Lu, M. Charlton, C. Brunsdon, and P. Harris, "The
"Improving the Performance of Professional Blogger’s Minkowski approach for choosing the distance metric in
Classification," presented at the International Conference on geographically weighted regression," International Journal of
Computing, Mathematics and Engineering Technologies, Geographical Information Science, vol. 30, no. 2, pp. 351-368,
Sukkar IBA, Pakistan, 2018. 2016.
[22] P. Branco, L. Torgo, and R. P. Ribeiro, "A survey of predictive [41] T. M. Khoshgoftaar, J. Van Hulse, and A. Napolitano,
modeling on imbalanced domains," ACM Computing Surveys "Comparing boosting and bagging techniques with noisy and
(CSUR), vol. 49, no. 2, pp. 1-50, 2016. imbalanced data," IEEE Transactions on Systems, Man, and
[23] M. A. U. H. Tahir, S. Asghar, A. Manzoor, and M. A. Noor, "A Cybernetics-Part A: Systems and Humans, vol. 41, no. 3, pp.
classification model for class imbalance dataset using genetic 552-568, 2010.
programming," IEEE Access, vol. 7, pp. 71013-71037, 2019. [42] P. Bühlmann and T. Hothorn, "Boosting algorithms:
[24] S. Cateni, V. Colla, and M. Vannucci, "A method for Regularization, prediction and model fitting," Statistical
resampling imbalanced datasets in binary classification tasks Science, vol. 22, no. 4, pp. 477-505, 2007.
for real-world problems," Neurocomputing, vol. 135, pp. 32-41, [43] Y. Ganjisaffar, R. Caruana, and C. V. Lopes, "Bagging
2014. gradient-boosted trees for high precision, low variance ranking
[25] H. He and E. A. Garcia, "Learning from imbalanced data," models," in Proceedings of the 34th international ACM SIGIR
IEEE Transactions on knowledge and data engineering, vol. conference on Research and development in Information
21, no. 9, pp. 1263-1284, 2009. Retrieval, Beijing, China, 2011, pp. 85-94.
[26] A. Orriols-Puig, E. Bernadó-Mansilla, D. E. Goldberg, K. [44] I. Ullah, B. Raza, A. K. Malik, M. Imran, S. U. Islam, and S.
Sastry, and P. L. Lanzi, "Facetwise analysis of XCS for W. Kim, "A churn prediction model using random forest:
problems with class imbalances," IEEE Transactions on analysis of machine learning techniques for churn prediction
Evolutionary Computation, vol. 13, no. 5, pp. 1093-1119, 2009. and factor identification in telecom sector," IEEE Access, vol.
[27] S. García and F. Herrera, "Evolutionary undersampling for 7, pp. 60134-60149, 2019.
classification with imbalanced datasets: Proposals and

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3048610, IEEE Access
Author Name: Preparation of Papers for IEEE Access (November 2020)

[45] T. M. Mitchell, "Decision Tree learning," in Machine Learning, applications, software and data security. He particularly investigates the
ed: McGraw-Hill, 1999, pp. 52–78. applications of computing science in the areas of biomedical, health
informatics and patient privacy. He has published several research articles
in health informatics journals and conferences. mqamar@govst.edu
YOUSRA ASIM has received the BS degree in
software engineering from Fatima Jinnah Women
University, Rawalpindi, Pakistan, in 2006 and
MSCS degree from Kinnaird College, Lahore,
Pakistan in 2008. She has been teaching as
Lecture since 2009 and now she is working as
Assistant Professor of Computer Science in Govt.
College for Women Sihal since 2016. She is a
Ph.D. student in the CS department at COMSATS
University Islamabad (CUI), Islamabad, Pakistan. Her research interests
are Social Networks, Influential Nodes, Data Mining, Machine Learning
and Privacy. engryousraasim@gmail.com

AHMAD KAMRAN MALIK is working as an


Assistant Professor at COMSATS University
Islamabad (CUI), Islamabad, Pakistan. He
received his Ph.D. (Computer Sc.) from the
Vienna University of Technology (TU-Wien),
Austria. He is author of a book and has published
a number of research papers in international
journals and conferences. His research interests
are focused on Data Science, Social Network
Analysis, and Information security. He is
interested in data analysis and prediction using Data Science techniques
particularly using graphs and networks data.
ahmad.kamran@comsats.edu.pk

BASIT RAZA is working as Assistant


Professor in the department of Computer
Science, COMSATS University Islamabad
(CUI), Islamabad, Pakistan. He received his
Ph.D. (Computer Science) degree in 2014 from
International Islamic University, Islamabad,
Pakistan. He has published a number of
conference and journal papers of internal repute.
His research interests are Database management
system, Security and Privacy, Data Mining, Data Warehousing, Machine
Learning and Artificial Intelligence. basit.raza@comsats.edu.pk

AHMAD R. SHAHID is currently working as


Assistant Professor at COMSATS University
Islamabad (CUI), Islamabad, Pakistan. He did his
PhD in Computer Science from York, UK in
2012. During his PhD he worked on
automatically building a WordNet for four
languages, namely, English, German, French and
Greek. After his PhD, he has been working in the
areas of Computer Vision and Pattern
Recognition, Machine Learning, and Natural Language Processing. A few
of the problems that he has worked on include cancer detection, pedestrian
detection, driver fatigue detection, and data mining.
ahmadrshahid@comsats.edu.pk

NAFEES QAMAR leads the Health Informatics


programs as the Program Director and tenure-
track faculty at GSU. Dr. Qamar holds a Ph.D. in
computer science from University Grenoble
Alpes, France. He has worked as a visiting
assistant professor in the Biomedical and Health
Informatics program, Department of Computer
Science at State University of New York at
Oswego. He was also a Postdoctoral Research
Fellow and Project Manager at United Nations University, and Academic
Researcher at Vanderbilt University, Nashville. He holds an honorary
Research Fellow position at Southwest University, China. Dr. Qamar's
broad research and teaching interests include computer science

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

You might also like