Data Mining Review - 1

Drug Consumption Risk Evaluation
A PROJECT REPORT
for
DATA MINING TECHNIQUES (ITE2006)
in
B.Tech – Information Technology and Engineering
by
NAME Reg.no SLOT
M.LIKITH CHOWDARY 18BIT0039 D2
A.SAIDA REDDY 18BIT0299 D2
D.SAI SUCHETAN REDDY 18BIT0094 D1
Under the Guidance of
RANICHANDRA .C
Associate Professor, SITE
1
DRUG CONSUMPTION RISK EVALUATION
I. Abstract
This project presents different approaches to classify whether a person is

consuming a specific drug or not. Various Machine Learning Classification
techniques have been implemented involving the manipulation in the output
labels and also various categorical input features. These techniques have been
implemented assuring the major factors being age, gender, nationality, education
and Ethnicity. The proposed methods have been tested on the test set and we
achieved a good accuracy for the predication. The parameters used for the
evaluation included Accuracy, Precision and Recall The experimental results
suggest that using a particular method is subjective to its application.
Keywords – SVM, Decision Trees, Artificial Neural Networks, Drug Risk.
II Literature Survey
TITLE METHOD PERFORMANCE

The usefulness of artificial The artificial neural network
A 10-fold cross validation
neural networks for the model predicting the
method was adopted for
prediction of adverse drug vancomycin -induced
evaluating the resultant
reactions and focused on nephrotoxicity predictive
artificial neural network.
vancomycin -induced performance Among these patients, 179
nephrotoxicity (15.7%) developed
vancomycin -induced
nephrotoxicity
Popular data mining algorithm Bagging, REP Tree and This method 12 helps to
Sequential minimal decision table (DT) extracted identify the students who need
optimization from a decision tree or rule- special advising or counseling
based classifier by the councilors/teachers to
2
understand the danger of
consuming alcohol.
The developed MEDICASCY, Chemical structure for the They demonstrate that their
a multilabel-based boosted drug side effect, indication, method (TargeTox) can
random forest machine efficacy, and probable mode distinguish potentially
learning method that only of action target predictions. idiosyncratically toxic drugs
requires the small molecule’s from safe drugs and is also
suitable for speculative
evaluation of different target
sets to support the design of
optimal low-toxicity
combinations.
Pharmacovigilance is defined The availability of various Important textual data
as the science and activities sources of healthcare data for from patient social media
relating to the detection, analysis in recent years opens and clinical documents in
assessment, understanding new opportunities for the data- HER. f such systems is
driven pharmacovigilance hampered by the bias in
researchMost studies in data and the pitfalls of the
pharmacovigilance focus on data mining algorithms
structure. adopted.
DOI-HOI (Drug of interest- Efficient ADR signal The tentative results suggest
Health outcome of interest) refinement method that filters that the method has the
instances of a DOI-HOI (Drug capability to efficiently refine
of interest-Health outcome of ADR signals but each signal
interest) signal and does not may require specific tuning to
require knowledge of possible determine the optimal support
confounders. and confidence values to be
implemented
Drug consumption and misuse To collect data including Sensitivity and specificity
personality traits (NEO-FFI- greater than 70% were
R), impulsivity (BIS-11), achieved for amphetamines,
sensation seeking (ImpSS), amyl nitrite, benzodiazepines,
and demographic information chocolate, caffeine, heroin,
ketamine, methadone, and
nicotine
To analyse the predictive We worked with a sample of

The models analysed
power of different 2666 teenagers, 1378 of whom
were able to discriminate
psychosocial and personality do not consume nicotine while
correctly between both
variables 1288 are nicotine consumers
types of subjects within a
range of 77.39% to
3
78.20%, achieving
91.29% sensitivity and
74.32% specificity .This
study was carried out
with the help of the
National Plan on Drugs.
Sequential minimal Bayesian logistic regression This method helps to

optimization (SMO), Bagging, and unsupervised machine identify the students who
REP Tree learning approaches are used need special advising or
to find multiple adverse events counselling by the
counsellors/teachers to
understand the danger of
consuming alcohol.
Predicting successful They used a dataset commonly

Measured by the area
treatment using a real life used in the SUD field and
under the receiver
large database. hypothesized that SL will
operating characteristic
generate the best prediction.
curve (AUC) evaluated in
a test sample not included
in the training sample used
to fit all prediction models.
ANND-D to predict whether a Input features are given to ANN-C module predicts
person is using VSA or not the ANN-D module to the use of VSA in terms
and ANN-C to predict the time predict if volatile of time such as day, week,
of use substance abuse (VSA) month, year, decade,
has been done by the before a decade, etc.
person or not.
Identity of drug targets and They proposed an original They demonstrate that
off- targets, functional impact machine learning method that their method (TargeTox)
score computed from Gene leverages identity of drug can distinguish potentially
Ontology annotations targets and off- targets, idiosyncratically toxic
functional impact score drugs from safe drugs and
computed from Gene is also suitable for
Ontology annotations, and speculative evaluation of
biological network data to different target sets to
predict drug toxicity. support the design of
optimal low-toxicity
combinations.
AUPLC-TOF-MS analytical Correlation analysis was used Moreover, MRM

4
strategy. to investigate the internal screening method was
relationship among the employed via UPLC-
metabolites and seven MS/MS to validate the
respective metabolites were practicability and
selected into one group as reliability of the identified
“Targetmetabolites”. metabolites
Specific Type A Behavior The consumption of legal and These results will prove useful
Pattern (TABP) profiles exist. illegal substances (alcohol, to drug consumption
tobacco, psychoactive drugs, prevention and treatment
cannabis, cocaine and programs focused on the
hallucinogens) by young abovementioned personality
Spaniards.Specific personality profiles.
profiles were identified which
constitute either a risk factor
or a protective factor for
substance abuse.
Analysed drug use motives Two classical machine A random cluster sampling of
through a series of learning techniques, Decision schools was conducted in the
classification techniques. Trees (DT) and Artificial island of Mallorca.
Neural Networks (ANN); two Participants were 9,300
modern statistical techniques, students.
k-Nearest Neighbours (K-NN)
and Naïve Bayes (NB); and a
classical statistical technique,
Logistic Regression (LogR).
MEDICASCY, a multilabel- A multilabel-based boosted It has comparable or even
based boosted random forest random forest machine significantly better
machine. learning method that only performance than existing
requires the small molecule’s approaches requiring far more
chemical structure for the drug information. Kmeans, KNN,
side effect, indication, Random forest, multi layer
efficacy, and probable mode perceptron are used.
of action target predictions. MEDICASCY shows about
78% precision and recall for
predicting at least one severe
side effect and 72% precision
drug efficacy
iii. EXISTING SYSTEMS
wherever the coaching knowledge is analyzed and classification rules square measure generated.
Subsequent part is the classification, wherever check knowledge is classified into categories in
step with the generated rules. Since classification algorithms need that categories be outlined
5
primarily based on knowledge attribute values, we tend to had created associate degree attribute
class for each knowledge that will have a worth of either level of addiction or level of extra
tablets .A UPLC-TOF-MS analytical strategy combined with online data acquiring and post-
acquisition data-mining methods was established for heroin metabolite identification .Correlation
analysis was used to investigate the internal relationship among the metabolites and seven
respective metabolites were selected into one group as “Targetmetabolites”. Moreover, MRM
screening method was employed via UPLC-MS/MS to validate the practicability and reliability
of the identified metabolites.
iv. GAP IDENTIFIED
In this section we will be the preliminary concepts used for our project work
We have to get the dataset form trusted sources. We are working on the Drug consumption
dataset prepared by University of California, Irvine.
We are not only predicting whether a person consumes a specific drug or not, but also for how
long he might have used the drug.
We are predicting based on 13 attributes related to a person rather than 4-5 attributes.
V. PROPOSED METHOD
a.) Data Extraction:

We will import the dataset from University of California, Irvine Repository. We will download
the zip file and extract the excel format of drugconsumption.csv dataset. Using libraries such as
pandas we will import the dataset into the jupyter notebook environment.
b.) Data Visualization:

Based on all the attributes present in the dataset, to get better understanding about the dataset and
to obtain various useful insights from the data, we will perform various types of visualization
such as Box plot, Violin Plot, Sub plots, Spatial Analysis using matplotlib and seaborn libraries.
c.) Pre- Processing and Feature Engineering:
6
Before applying any algorithm or making any classifications, We need to clean the data, i.e, Pre-
Processing Must be done. In the dataset mentioned above there are some attributes which stores
categorical data, and these must be converted into numerical data. We will be using techniques
such as One Hot Encoding to carry out the task.
Not all the Features are essential for our final goal of Classification. There will be some features
which are more essential, i.e, have more impact on the output when compared to others. To
know the correlation between attributes a heat map can be designed which some the amount of
correlation between different attributes. Based on the results obtained in the above heat map we
will be selecting the attributes.
d.) Model Selection:

The next important step is to design/ select an model to fit on our attributes. We will divide the
attributes into attributes into two categories x & y. Where x contains all the attributes
7
which are required for the classification, and y represent the predicted label attribute. We shoulf
fit our model on x&y using a classification based machine algorithm such as SVM, Decision
Trees and Neural Networks.
e.) Testing the model on the test set:

For any model designed it is very essential to test our model. So we should divide the data into
train and test data. Firstly, we need to train the data on our training set and we need to test the
model on the testing data to check out the performance of the model.
f.) Evaluating the Model:

Finally, we before we deploy our model we need to know the accuracy of the model. To find the
accuracy of our model after it predicting the test data we can use various evaluation metrics such
as Accuracy score. Based on all the evaluation metrics we will select the model which gives the
highest accuracy and low error.
vi. DATASET DESCRIPTION AND SAMPLE DATA
Database contains records for 1885 respondents. For each respondent 12 attributes are
known: Personality measurements which include NEO-FFI-R (neuroticism, extraversion,
openness to experience, agreeableness, and conscientiousness), BIS-11 (impulsivity), and
ImpSS (sensation seeking), level of education, age, gender, country of residence and
ethnicity. All input attributes are originally categorical and are quantified. After
quantification values of all input features can be considered as real-valued. In addition,
participants were questioned concerning their use of 18 legal and illegal drugs (alcohol,
amphetamines, amyl nitrite, benzodiazepine, cannabis, chocolate, cocaine, caffeine, crack,
ecstasy, heroin, ketamine, legal highs, LSD, methadone, mushrooms, nicotine and volatile
substance abuse and one fictitious drug (Semeron) which was introduced to identify over-
claimers. For each drug they have to select one of the answers: never used the drug, used it
over a decade ago, or in the last decade, year, month, week, or day.
8
9

Data Mining Review - 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Review - 1

Uploaded by

Copyright:

Available Formats

Drug Consumption Risk Evaluation

DATA MINING TECHNIQUES (ITE2006)

B.Tech – Information Technology and Engineering

Under the Guidance of

Associate Professor, SITE

This project presents different approaches to classify whether a person is

Keywords – SVM, Decision Trees, Artificial Neural Networks, Drug Risk.

TITLE METHOD PERFORMANCE

To analyse the predictive We worked with a sample of

Sequential minimal Bayesian logistic regression This method helps to

Predicting successful They used a dataset commonly

AUPLC-TOF-MS analytical Correlation analysis was used Moreover, MRM

iii. EXISTING SYSTEMS

iv. GAP IDENTIFIED

a.) Data Extraction:

b.) Data Visualization:

c.) Pre- Processing and Feature Engineering:

d.) Model Selection:

e.) Testing the model on the test set:

f.) Evaluating the Model:

vi. DATASET DESCRIPTION AND SAMPLE DATA

You might also like