You are on page 1of 17
Chapter 63 ® Analysis for Malicious URLs Using oe Machine Learning and Deep Learning Approaches Santosh Kumar Birthriya and Ankit Kumar Jain 1 Introduction Malicious URL is a common and serious cyber security threat, Misleading URLs hosts unsolicited contents and causes suspicious users to suffer, losing thousands of thousands of dollars annually. These threats must be timely identified and dealt with. URL detection is traditionally done mostly by using blacklists. However, it is not possible to make blacklists comprehensive, and it is not possible to detect new malicious URLs. In recent years, machine learning techniques in order to improve the generality of malicious URL detector have been studied increasingly. A huge phishing attack in May 2017 involving millions of users of Gmail hit Google, in which the hacker received a user’s email history. The hackers might post emails as part of a known source and ask them to review the attached file. URL is the worldwide document-and-resource address. Two major parts of a URL are: 1. Identifying protocol 2. Specify resource IP address or domain name. The resource identification protocol and name are split by colon and two forward slashes, for example, Fig. 1. Compromised URLs used for cyber-attacks are consid- ered malicious URLs. A malicious URL or Web site contains a range of non- requested content such as spam, phishing or drive-by downloads to begin attacks [1] The popular types of malicious attacks include: drive-by downloads, social media, phishing and spamming [2] S.K. Birthriya (62) A. K Jain Department of Computer Engincering, National Institute of Technology, Kurukshetra, India e-mail: santoshOxb@ gmail.com ALK. Jain e-mail: ankitjain2407 @ gmail.com © The Author(s), under exclusive license to Springer Nature Singapore Pte Ld. 2021 797 M, Dave etal. (eds), Proceedings of the International Conference on Paradigms of Computing, Communication and Data Sciences, Algorithms for Intelligent Systems, hhtps:f/doi.org/10,1007/978-981- 15-7533-4_63 198 S.K. Birtiya and A. K. Jain Protoel Primary Domain Pat 4 c —n ——, ; =~ —nhittp:/www.google.com/webhp?gh=39=89 = Top- level domain Hostname http://208.252.0.4/webhp?gh=39=89 1. Uniform resource locator Although machine-based learning can be used instead of blacklisting, the problem, is many (3} * The traditional URL approaches do not identify sequence patterns and interactions between the characters * Traditional models of machine learning depend on manual feature engineering, Feature-based methods in an adverse environment are not safe. * The unrevealed features cannot be held, and the test data does not generalize. Deep learning uses a number of hidden layers in which every layer is used to learn different levels of abstraction. * Detailed research and analysis are carried out for malicious URL detection on numerous benchmark deep learning architectures. * In investigational analysis, different types of datasets are used to determine the generalizability of the models. * Experiments with different deep learning models are shown for character-level integration and n-gram representation. The paper is structured as follows: Sect. 2 presents the related work for URL. Section 3 discusses the malicious URL detection using machine learning. Section 4 discusses the malicious URL detection using deep learning. Section 5 discusses the open problems. Section 6 discusses the conclusion, 2 Related Work Numerous literary studies deal with this problem from the point of view of machine earning. That is to say, they compile a list of URLs categorized as malicious or benign and describe each URL with the aid of a set of attributes, De las Cuevas et al. [4] classifying rates reported at about 96%, which rose 97% with the preprocessing 63 Analysis for Malicious URLs Using Machine Learning 799 rough set feature selection which reduces original 12 to 9. This resulted to an imbal- anced categorization problem, which was solved by under-sampling. In total after eliminating duplicates, 57,000 URL instances were found. Kan and Thi [5] have not categorized Web pages by content but through their URLs, even faster because there are no gaps in choosing the material or reading the text of the Web site. The URL is split into several tokens, which extract classification features. The authors indicated that the combination of high-quality URL segmen- tation with extraction of features improved the classification of various baseline techniques. Ma ct al. [6] developed an approach to identify malicious Web sites in the light of the wealth of information they provide about the essence of a site from lexical and host-based features of their URLs. Their system is trying to screen and classify essential URL components and metadata across hundreds of thousands without the need to use high-level domain skills. The technique was tested in up to 30 000 cases, with promising results, namely a very high rate of detection (95-99%) and a minimal false positive rate, Zhao and Hoi [7] prevent a (typical) class imbalance in the problem of mali- cious URL detection by using their Cost-Sensitive Online Active Learning (CSOAL) framework. CSOAL only queries a small fraction (about 0.5% in one million cases) of the available data and optimizes explicitly two cost-sensitive measures to address class imbalances, 3 Malicious URL Detection Using Machine Learning Machine learning approaches use the URL as data processing and how a URL is marked as malicious or benign by statistical characteristics. The availability of training data is a main requirement for the training of a machine learning model. In general, machine training can be divided into supervise, unsupervised and semi- supervised labels that refer to the presence of training data labels, not labels and labels for a small part of training data, Labels are conscious of the fact that a URL is harmful or benign. After the training data is collected, the next step is to obtain informative information to adequately describe the URL, while simultaneously mathematically interpreting it through machine learning patterns. For example, in Fig. 2, using a URL string will simply prevent us from learning a good model prediction. There- fore, appropriate features based on certain values or heuristics must be extracted to obtain a good display of URLs. These can include lexical, host-based characteristics, etc. For further machine learning, the ability to have relevant information is necessary, because the underlying assumption of machine learning methods is that harmful and benign URLs have specific distributions. Therefore, for the consistency of the URL model predicted by machine learning, the accuracy of the features representing URLs is important, The next step in the development of the prediction model is to actually train the model with the properly represented training data. There are a famous algo- rithms for learning machines, which can be used to address malicious URLs. This 800 S.K. Birthriya and A. K. Jain ode Training 2. Malicious URL detection using machine learning section classifies and reviews methods of learning that has been employed for this task and allows the use of appropriate machine learning methods to solve particular problems. Machine learning classifier is categorized into: 4 K-Nearest Neighbor (KNN) Classifier ‘The k-nearest neighbor (KNN) algorithm is a machine learning algorithm based on instances. It is a lazy classifier that generalizes whenever an instance is required to be classified. KNN is based on the Euclidean distance concept. This is meant to be an inside D-dimensional space that each instance of the dataset will display. If an instance is required, the nearest point in this region should be identified and the instance is classified according to the classification of the nearest instance, In general, for classification requirements, more than once instance is used. In KNN, Kis the number of instances that are to be used [8]. The equation for computing the distance between two instances x, and xg given by d(xp; x,) is shown in Eq. 1 a In this, ax(xp) is the 1th attribute of instance x, [8]. This is the concept of the K-nearest neighbor's algorithm, 41 Naive Bayes Naive Bayes is a set of classification algorithms based on the Bayes theorem. Itis not apatticular algorithm, but a set of algorithms where everyone shares a common defi- nition, namely each pair of defined and mutually exclusive characteristics/features. 63 Analysis for Malicious URLs Using Machine Learning 801 The Naive classifier belongs to the probabilistic family and is used to categorize sample data. Bayes theorem: Bayesian classifier analyzes the dependent events and the prob- ability of a future event that can be predicted from the previous event of the same nature, If certain words also appear in phished but not ham, then this email is prob- ably phished [9]. Naive Bayes is a group of algorithms for classification based on the theorem of Bayes given by: P(MIN) = P(N|M)P(M)/P(N) @ P(MIN) = P(NI|M) = P(N2|M) x P(N3|M)... P(Vn|M)P(M) 3) where: P(MIN) is the posterior probability of a class (M, phished or ham) given pre~ dictor (N, word vectors), P(M) is the prior probability of class. P(NIM) is the probability of the predictor of a given class, P(M) is the previous predictor probability. Naive Bayes: ‘© For each attribute, calculate probabilities, depending on the class value. For the attribute, use the product rule to achieve a joint conditional probability ‘© Using Bayes rule to determine the class variable’s conditional probabilities. 4.2. Support Vector Machines (SVM) ‘Support vector machines (SVMs) are supervised algorithm, popular with high speed and high-efficiency text classification algorithm. Based on the given training set, it produces a two-dimensional hyperplane that best separates the classes. The decision boundary is called a hyperplane. If phishing is detected, the input is expressed by a number of features; for example, if there is a word or output that is 1 or —1, the email is phished or not phished [9] The goal is to find the optimum hyperplane division between two classes by opti- mizing the margin between the closest classes. Suppose we have a linear discrimi- natory function and two linearly separable groups of +1 and —1 target values [10] ‘The hyperplane equation is w'x; + c = 0, and the hyperplane that discriminates will satisfy. wx +o20 ifw, @ wy +e<0 ifw,=—1; () 802 S.K. Birthriya and A. K. Jain Separating Hyperplane Fig.3 Support vector machine Now, the distance to a hyperplane from any point x is lu'x; + cV/lull and the distance from the source is efllul. As shown in Fig. 3, the boundary points are referred to as support vectors, with the center of the margin being the optimum separation hyperplane, maximizing space between them. Even though SVMs are very strong and often used in classification, they have several disadvantages. To train the data, they require high calculations. These are also vulnerable to noisy data and therefore prone to overfit. 43° Linear Regression A linear regression algorithm attempts to create an equation that calculates the value of a real-world attribute [11]. This is done by using the least fitting method. Itis also possible to adjust this approach to real value to predict discrete classes by means of a threshold 7), 4.4 Decision Tree In the present era, decision-making trees are the most sensitive discreet technique, along with a supervised algorithm that expresses graphical efficiency. It is an algo- rithm in which an element of its scope is placed in a particular domain and can be either discrete or continuous. In this method, each split is selected to reduce the variance of the target variable. The input of the decision tree is often passed as an 63 Analysis for Malicious URLs Using Machine Learning 803 object/scenario that imitates certain set of properties, and the output is generally a decision that says either YES or NO [12] 4.5 Logistic Regression Logistic regression is an effective regression analysis to be conducted where the dependent variable is binary. For other regression analyses, logistic regression is a predictive analysis (9]. Logistic regression can be applied for modeling problems and solving them, also known as binary problems. Logistic regression is that it measures an event’s predictability score. 4.6 Random Forests Random forests or random decision forests are a team method of studying the classification, regression and other activities by creating a multiplicity of decision- making bodies at training periods and generating the groups that are in class mode or mean forecast (regression) of the individual trees. Random decision-making forests correctly match their learning habit in decision-making trees [9] Random forests are classifiers combined by a number of tree predictors, in which every ttec is independently sampled from the random vector values. A large amount of variables can be managed in a dataset by random forests. They also provide an internal unbiased estimation of the generalization error during the forest construction process. The lack of reproducibility is the major disadvantage of random forests, since the method of forest creation is random [9] 4.7 Adaptive Boosting (AdaBoost) Itis an algorithm of continuous learning whose main aim is to improve the perfor- ‘mance of the learning algorithm. It is formally used for classification purposes. This task is carried out by building a strong classifier, a sequence of countless weak clas- sificrs. The best out-of-the-box classifier is when AdaBoost comes in combination with decision trees [9] 5 Malicious URL Detection Using Deep Learning In recent years, the field of deep learning [13] has become a choice for many data mining projects and machine learning applications. Deep learning attempts to lea 804 S.K. Birthriya and A. K. Jain specific features directly from (often non-structured) data and identify them, This can help us decrease painful feature engineering for malicious URL detection and develop models without any domain expertise. ‘As a natural extension for the application of profound NLP-based deep learning models, malicious URL detection models are successful because they can draw features from unstructured text data. Consequently, this task has been carried out by convolutionary neural networks (CNNs) [14]. In order to identify patterns of certain characters which occur together, eXpose used character-level convolutionary networks for the URL string, The model did not employ word encoding, so that in contrast with the previous methods, which heavily relied on the bag of words, this model did not have an explosion problem in number of features. However, the model performed better than the use of SVMs on the bag-of-word features [15] URLNet has established an end-to-end system that applies both character and word convolutions to the URL and scientifically demonstrates that it is able to improve its efficiency by means of words and character-level information. Many parallel efforts have been made with CNNs to make some further improvements to this issue, using denouncing CNNs to identify inconsistencies in a session where vulnerability benign sites can forward users to spiteful URLs and attempt to encode character information using advanced vector embedding. Following this, CNNs were used for detection of algorithmally generated domain names, such as deep learning (LSTM and CNN). Several LSTM extensions have just been studied, in which CNN and LSTM-based models were fitted with a focus mechanism and the bidirectional LSTMs have been used [16, 17] Malicious URL detection system uses deep learning: The Ethernet level is shown, in Fig. 3 for the detection of malicious URLs. Itis usually referred to as DeepURLDe- tect. DeepURLDetect is an in-house model of christened convolution hybrid and Jong-term memory. This module can be added to the existing modular cyber-hazard situational awareness system to increase the detection rate of malicious URL. The architecture is composed of three main modules: (1) data collection, (2) identifying malicious URL and (3) continuous monitoring. Distributed log collector passively collects URL logs from various sources within an Ethernet LAN and passes them on to distributed database. Using distributed log parser, the URLs are then interpreted and fed into deep learning framework. It clas- sifies the URLs either as malicious or as legal. A backup of preprocessed URLs is saved for further use in the distributed database. Front-end broker provides the deep earning module to view comprehensive URL analysis information. The framework provides a continuous monitoring module to track malicious URLs detected [18] (Fig. 4) 6 Open Problems Real-world URL data is a large volume and high-speed data medium. The search engine contains over 30 tillion different URLs that navigate 20 trillion Web sites in 63 Analysis for Malicious URLs Using Machine Leaning 805 Ethernet Lan . | f Rew Log sa Distributed Log Collector Nest Distributed Log Parser | Deep URL Detect Front End Broker oso. Continous Monitoring 4. Malicious URL detection using deep learning [18] August 2012 each day on the Internet, A malicious URL detection model is nearly impossible to train on all URL data. Many current machine learning malicious URL detection methods use supervised training methods, which include labeled training data, which includes both benign and malicious URLs. If human experts are required to mark or to collect data from blacklists/whitelists (often labeled manually), the labeled figures may be collected Unfortunately, compared with all URLs that are available on the Web, the size of these labeled data is small. In addition to the high volume and speed of URLs and very high-dimensional features, this poses a practical challenge to a training model for the classification. Some specific learning methods such as reduction of features and limited learning have been studied, but they do not solve the problem effectively. Another difficulty is to drift where malicious URLs will change over time as new threats and attacks are emerging. Machine learning techniques are required whenever they appear in order to handle drifting concept. ‘An important research objective is to understand how a URL is spiteful and to verify which URL developments can contribute toward a spiteful or benign URL. This is particularly difficult when using deep learning models that often function as black boxes. 806 S.K, Birthriya and A. K. Jain 7 Conclusion Malicious URL identification plays a major role in several cybersecurity applica- tions, and computer training simply leads to a promising direction. In this paper, the URL identification of malicious by machine learning techniques and deep learning was thoroughly and systematically surveyed. In particular, provided a systematic formulation of spiteful URL detection from a machine learning point of view, then discussed the existing URL detection studies, In particular, in the form of the devel- ‘opment of new representation features, and the creation of new learning algorithms to solve deceptive URL detection tasks. References 1. Liang B, Huang J, Liu F, Wang D, Dong D, Liang Z (2009) Malicious web pages detection based on abnormal visibility recognition, In: 2009 international conference on e-business and information system security. IEEE, pp 1-5 2. Patil DR, Patil JB (2015) Survey on malicious web pages detection techniques, Int J u- and e-Serv Sei Technol 8(5):195-206 3. Sahoo D, Liu C, Hoi SC: (2017) Malicious URL detection using machine learning: a survey arXiv preprint alXiv:1701.07179 4, Prakash, Kumar P, Kempella M, Gupta M (2010) Phishnet: predictive blacklisting to detect, phishing attacks. In: 2010 proceedings IEEE INFOCOM. IEEE, pp 1-5 5. Kan MY, Thi HON (2005) Fast webpage classification using URL. features, In: Proceedings of the 14th ACM international conference on information and knowledge management, pp 325-326 6. MaJ, Saul LK, Savage S, Voelker GM (2009) Beyond blacklists: learning to detect malicious ‘web sites from suspicious URLs. In: Proceedings of the 1Sth ACM SIGKDD interational conference on knowledge discovery and data mining, pp 1245-1254 Zhao P, Hoi SC (2013) Cost-sensitive online active learning with application to malicious URL detection, In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, pp 919-927 8. Tolan F, Carthy J (2009) Phishing detection using classifier ensembles. In: 2009 eCrime researchers Summit, IEEE, pp 1-9 9. Rawal $, Rawal B, Shaheen A, Malik $ (2017) Phishing detection in E-mails using machine learning, Int J Appl Inf Syst 12:21-24 10. Abu-Nimeh 8, Nappa D, Wang X, Nair S (2007) A comparison of machine learning techniques for phishing detection, In: Proceedings of the ant-phishing working groups 2nd annual eCrime researchers summit, pp 60-69 11. Senagi K, Jouandeau N, Kamoni P (2017) Machine learning algorithms applied to soil analysis for crop production optimization in precision farming. In: 2017 EFITA WCCA CONGRESS, pis 12. Unnithan NA, Harikrishnan NB, Vinayakumar R, Soman KP, Sundarakrishna $ (2018) Detecting phishing E-mail using machine learning techniques. In: Proceedings of the Ist antisphishing shared task pilot at 4th ACM IWSPA. vol 2124, pp 51-57 13. LeCun ¥, Bengio Y, Hinton G (2015) Deep learning, Nature 521(7553):436—444 14, Benavides E, Fuertes W, Sanchez $, Sanchez M (2020) Classification of phishing attack solu- tions by employing deep learning techniques: a systematic literature review. In; Developments ‘and advances in defense and security. Springer, pp 51-64 68 16. 17, 18, ‘Analysis for Malicious URLs Using Machine Learning 807 Le H, Pham Q, Sahoo D, Hoi SC (2018) URLnet learning a URL representation with deep leaming for malicious URL detection, In; Proceedings of ACM conference, Washington, DC, USA, pp 1-13, Saxe J, Berlin K (2017) eXpose: a character-level convolutional neural network with embed- dings for detecting malicious URLs, file paths and registry keys. arXiv preprint arXiv:1702. 08568 Hochreiter S, Schmidhuber J(1997) Long short-term memory. Neural Comput 9(8):1735-1780 ‘Vanhoenshoven F, Népoles G, Falcon R, Vanhoof K, Képpen M (2016) Detecting malicious URLs using machine learning techniques. In: 2016 intelligence (SSCD. BE symposium series on computational Chapter 64 ® Engaging Smartphones and Social Data os for Curing Depressive Disorders: An Overview and Survey Srishti Bhatia, Yash Kesarwani, Ashish Basantani, and Sarika Jain 1 Introduction According to the World Health Organization (WHO), more than 300 million people suffer from depression worldwide [1]. A study reported in WHO states that at least 7.5% of the Indian population suffers from some form of mental disorder. In 2015, India had the highest number of cases of depressive disorders in the world (Fig. 1). Depression is one of the primary reasons for suicide. The severity of this disease is an issue of concer worldwide, Proper treatment of depression is the need of the hour, but various reasons lead to depression being untreated. The authors in (2] show that the social stigma around mental illnesses is the most prominent reason that leads to untreated depression, Other reasons include unawareness of the disease. The lack of medical professionals in India also contributes to depression being untreated. There is a need for alternative cures for depression. ‘The rapid advancement in mobile phone technology has presented an opportunity for providing easily accessible and affordable health care. Various studies have been done that explore the idea of using mobile phones for providing patient-centric health care, In the traditional healthcare system, a patient has to be presented physically Bhatia (63) - ¥. Kesarwani - A. Basantani -S. Jain ‘National Institute of Technology, Kurukshetra, India e-mail: srishti_S1810002@nitkkr.ac in Y. Kesarwani e-mail: yash_$1810038@nitkkrac.in A. Basantani e-mail: abasantani_51810032@nitkkr.ac.in S.Jain e-mail: jasarika @nitkkr.ac.in © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd, 2021 809 M, Dave etal. eds.), Proceedings of the International Conference on Paradigms of Computing, Communication and Data Sciences, Algorithms for Intelligent Systems, bhtps:f/doi.org/10,1007/978-981- 15-7533-4_64 si0 S. Bhatia etal Number of cases of depressive disorders Fig. 1. No. of cases of depressive disorder in 2015 according to WHO [1] at a clinic for physician’s recommendations about health. The physician cures the patient based on the symptoms" description, the findings of the laboratory tests, and his observations. The symptoms may, at times, be misleading or not well described, due to which the physician may not be able to predict the patient’s state of health accurately and precisely. The situation is even worse, where a patient has a mental disorder. We anticipate and advocate that best practices are required to be exploited in this domain for a still better world to live in. Along with mobile phones, social media has grown to such an extent that in today’s scenario, it is affecting the lives of millions around the globe. After tech-savvy, it is the social media savvy that has become synonymous with today’s generation. In India alone, by November 2019, a population of 525 million has Internet access [3] with over 300 million users of Facebook [4], 33 million plus users of Twitter and 26 million plus users of LinkedIn [5]. On average, people spent 2.4 h a day on social media in India, whichis slightly below the global average of 2.5h a day (6, 7]. Though several people still believe in socializing in real life, others prefer socializing in the virtual world of social media, be it making a new friend or sharing of common ideas and interests. Social media also serves as an outlet for emotions for some people. ‘Thus, analyzing social media data to determine the mental state of the patient can be advantageous [8]. De Choudhury et al. confirmed this by using social media data for understanding depression in individuals [9] and populations [10] We study the existing systems that aim to predict and cure depression and provide 1a broad overview of the research that has been done. We then propose our system that performs prediction of depression as well as monitoring and activity recommen- dation for helping patients suffering from depression. Our system has the following applications: ‘* A patient suffering from depression might not be able to receive treatment from a psychiatrist due to various issues such as of shortage of time, social stigma, cost, 64. Engaging Smartphones and Social Data for Curing Depressive sul or unavailability of doctors. The application can help such people in monitoring and regulating their mood by recommending appropriate activities. © The application provides an alternative to clinic-centric health care. It helps psychiatrists in monitoring their patients on a daily basis and hence provides better treatment, © Aninstitute or organization that wants to keep its members mentally healthy can use the application to monitor its members and recommend them some course of action to recover from depression. ® Relatives of a prospective patient suffering from depression can use the application to monitor the patient's activities and mood to understand the situation the patient is going through and help him/her in coping up with depression. 2 Overview To provide a broad overview of the previous research, we categorize them as follows. 2.1 Inputs For depression prediction and activity recommendations, three types of data have been used: wearable sensor data, smartphone usage patterns, and social media data Wearable sensors. Wearable sensors such as Fitbit describe the physical activity of the users, ie., the number of steps they have walked, how much exercise they have done, heart rate, etc. Smartphones are also equipped with sensors such as GPS sensors that describe the places they have visited in a day [1] Smartphone usage patterns, Smartphone usage patterns such as call logs and text messages can describe how much the user interacts with people. They provide insights into the daily activity of the users [12]. Social media. Social media posts describe the online activity of the user. Various studies achieve significant accuracy in predicting depression using social media posts (10} 2.2 Social Media Data Extraction Most studies analyze data from the most used social media Web sites: Facebook, ‘Twitter, or Reddit. APIs for all these sites are available that can be used to extract data from these sites. Twitter Firehose is one such API that can be used to extract ‘Twitter data [9]. Various Python libraries are also capable of doing the task. Tweepy is one such library that interacts with the Twitter API and extracts the tweets. The 812 S. Bhatia etal. authors in [13] have developed theit NCapture tool to extract data from Reddit. PushShift Reddit API is also used to retrieve posts from Reddit. 2.3 Sentiment Analysis Technique One of the most crucial tasks in the prediction of depression via social media posts is the sentiment analysis of these posts. Different techniques have been used for sentiment analysis Ontologies. Ontologies facilitate the representation, storage, and searching of contextual health information in order to better integrate it. The authors in [14] develop an ontology comprising 443 classes and 60 relationships that predict depres- sion based on the presence of words such as “fear,” “restless,” “impulse,” “guilty.” “sad,” “academic stresses,” and “suicide” in posts. Machine learning. Machine learning is another widely used method for sentiment analysis. Various studies have shown that machine learning classifiers such as SVM [9], Naive Bayes [15], and decision tree [16] provide accurate depression prediction, 2.4. Prediction/Recommendation Technique Most studies use machine learning techniques such as SVM, linear regression, CNN, and decision tree for depression prediction. Activity recommendations can be made using clustering and conditional combinations cube (CCC) [1]. Similar users are considered as a single cluster and recommended activities that are being preferred by people in the same cluster. 3 Related Work With the advent of technology, mobile phones have become a significant part of our lives so much that we have our mobile phones with us 24 x 7. Hence, these devices provide the opportunity of providing patient-centric health care. The authors in [12] introduce the concept of Augmented Personalized Healthcare, i.c., the idea to provide better healthcare using mobile phones and IoT devices. They have developed kHealth, a system for asthma management, which provides continuous monitoring of asthma patients using wearable sensors, The idea of continuous monitoring through mobile phones makes them ideal for predicting as well as treating mental illnesses such as depression, The key symp- toms of depression, such as isolation, decreased activity, and sleeping problems, can be detected by continuous monitoring of a person’s activities through wearable sensors and mobile phones. Hence, mobile phone usage patterns and sensor data can 64. Engaging Smartphones and Social Data for Curing Depressive 813 be effectively used to predict the onset of depression in an individual, Apart from prediction, continuous monitoring of patient's activities through mobile phones can prove to be very useful in recommending activities that help to cope-up with depres- sion, For example, if a person has not gone out of her house for days, she should be recommended to take a walk. Apart from using mobile phones, people nowadays are quite active on social media Web sites such as Facebook, Twitter, Reddit, and Instagram. People tend to share their thoughts and feelings on these sites more often than in real life. Analysis of a person's social media data can be quite effective input in predicting their mental state. Research has proven that using social media data to predict depression leads to better accuracy in depression prediction. 3.1 Prediction Through Mobile Phone Usage Patterns MoodScope [17], developed by Microsoft, is a smartphone software that predicts the mood of the user using text messages, emails, call logs, application usage, Web browsing, and location. The authors developed a mood logger application that is used by participants to self-report emotions. The app also collects mobile phone usage data. This data is used to train a personalized mood model for each user using multi-linear regression. It also uses Sequential Forward Selection (SFS) for choosing relevant features, The personalized model achieved an accuracy of 93% with two ‘months of training. The study also develops an all-user mood model that can be used as a general model for all users initially. This model achieves an accuracy of 66%. The study develops a hybrid model, which is a combination of both the personalized and the all-user mood model, which achieves an accuracy of 72%. Hung et al. [18] use a similar approach to predict negative emotions. They used mobile phone usage patterns to train four classifiers. It was found that the Naive Bayes classifier with greedy best forward feature selection had the highest accuracy. The classifier achieved an accuracy of 86.71% 3.2 Prediction Through Social Media De Choudhury et al. [9] use crowdsourcing to collect Twitter data of individuals who report that they have been diagnosed with clinical depression. They extracted attributes such as the number of posts per day, the number of replies, the pattern of posting, and linguistic style to train a support vector machine (SVM) classifier. LIWC software is used to determine the linguistic style of the individual and to characterize the Twitter posts as negative or positive, The SVM classifier achieves ~70% accuracy with 0.74 precision. A similar approach is used by Bichstaedt et al. [16] to predict depression in medical records using Facebook posts. They use logistic regression models to achieve high accuracy (AUC = 0.72). The study [19] aims to develop 814 S. Bhatia etal a Web application that provides an easy and convenient way for the user to check their depression levels using their social media posts and provides a doctor’s location nearest to user's location. The study uses the Naive Bayes classifier to predict depres- sion using Facebook posts along with questionnaires filled by the user that helps in gelting an understanding of the person’s behavioral attributes. The authors [20] detect depression using relevant social media data. They used publicly available Facebook data of bipolar, depression, and anxiety pages that contained users’ comments. The data was collected using the NCapture tool. The processing of the depressive data is done based on the four factors: emotional process, temporal process, linguistic process, and all combined. The Linguistic Inquiry and Word Count (LIWC2015) dictionary is used for text analysis, They train various classifiers and show that the decision tree classifier gives the best accuracy for their dataset. Gaur et al. [13] used information from Reddit to check about suicidal tendencies and other mental health issues afflicting depressed users. The authors have built suicide risk severity lexicon using medical knowledge bases and suicide ontology to detect suicidal thoughts and actions. A C-SSRS-based five-label (supportive, indicator, ideation, behavior, and attempt) classification scheme is used that helps mental health psychiatrists determine an actionable measure for an individual's care. The convolutional neural network is used for predicting suicidality outperforming SVM, linear and rule-based models. For five-label confusion matrix, CNN correctly classifies users with an accuracy of 80%. For a four-label confusion matrix, it has 92% accuracy. 3.3 Activity Recommendation Activity recommender systems for patients suffering from depression can prove to be very beneficial in emotion regulation and can help the patients in managing their recurring depressive episodes. Hung et al. [15] developed an activity recommender system based on smartphone usage patterns, personal information, and environmental sensor information. The system uses clustering and combinational cube matrix (CCC) to recommend similar activities to similar users. Intellicare is a suite of apps that aims to treat depression using interactive skill- based training that is based on cognitive behavioral therapy, problem-solving therapy, etc. The apps ate available on the Google Play Store. The study [21] evaluates the efficiency of Intellicare in reducing depressive symptoms. The results of the study showed significant improvements in PHQ-9 scores of the participants after eight weeks of using the applications, Yang et al. [22] develop emHealth, an intelligent activity recommender system with depression prediction for emotion regulation. They fitst predict depression using decision tree technique and SVM as the prediction models and compare their accuracies. After prediction, they recommend activities based on external factors of depression and the level of depression. ‘Wahle et al. [3] developed Mobile Sensing and Support (MOSS), a smartphone app that provides just-in-time interventions derived from cognitive behavior therapy.

You might also like