You are on page 1of 2

Hi, this is gani, today I am going to present my work which is online customer intention prediction.

this work can be used to see Model user behavior based on their interactions with an e-commerce website and to Measure
which website actions correlate with revenue and sales. And to Identify seasonality and trends in buying behaviors

3-The dataset was obtained from Kaggle because of its availability and ease of use. Dataset consists of 12330 samples and 18
features of which 8 are categorical and 10 are numeric. "Administrative", "Administrative Duration", "Informational",
"Informational Duration", "Product Related" and "Product Related Duration" represent the number of different types of pages
visited by the visitor in that session and total time spent in each of these page categories. The values of these features are
derived from the URL information of the pages visited by the user and updated in real time when a user takes an action, e.g.
moving from one page to another.

The "Bounce Rate", "Exit Rate" and "Page Value" features represent the metrics measured by "Google Analytics" for each page
in the e-commerce site. The value of "Bounce Rate" feature for a web page refers to the percentage of visitors who enter the
site from that page and then leave ("bounce") without triggering any other requests to the analytics server during that session.
The value of "Exit Rate" feature for a specific web page is calculated as for all pageviews to the page, the percentage that were
the last in the session. The "Page Value" feature represents the average value for a web page that a user visited before
completing an e-commerce transaction.

The "Special Day" feature indicates the closeness of the site visiting time to a specific special day (e.g. Mother’s Day, Valentine's
Day) in which the sessions are more likely to be finalized with transaction. The value of this attribute is determined by
considering the dynamics of e-commerce such as the duration between the order date and delivery date. For example, for
Valentina’s day, this value takes a nonzero value between February 2 and February 12, zero before and after this date unless it
is close to another special day, and its maximum value of 1 on February 8.

The dataset also includes operating system, browser, region, traffic type, visitor type as returning or new visitor, a Boolean
value indicating whether the date of the visit is weekend, and month of the year.

4- when we look at the statistical summary we that Some duration values were negative suggesting outliers or missing values

The numerical features are highly skewed.

The Prediction classes are imbalanced with 1908 True and 10422 False classes

5- The target variable Revenue is imbalanced. (Techniques like oversampling, undersampling or smoting can be
used along with Stratified Cross Validation for developing a robust model)

Data was not collected in the months of jan and april

The Data collected is not uniformly dostributed among the months and Region

OperatingSystem, Browser, TrafficType apperas to have some categories missing

6-7-Since the distribution plots are skewed the enhanced boxplot are are used to vizualize conditional
distribution. the enhanced box plot can br use to vizualize more quantiles. the outliers for columns are
clipped. And To fix the skewness problem various transforms can be applied. To see if the transformation is
useful multivariate analysis for the transfomration can be compared with that of the original dataset.

The feature Exit rate is bimodal. there is a high correlation between productrelated and
productrelated_duration (0.88) and Exit rate and bounce rate(0.91). An analysis of variance test suggests
that numeric columns are correlated with the target variable.

pairwise Scatter plot suggest non-linear relation ship

Three different feature sets were generated


No scaling

Scaled using MinMax scaler

Yeo Johnson transformation

8- in this grap we can see how yeo johnson transformation change the distribution.

• 9-Model Performance Indicator is used To handle imbalance of classes Balanced accuracy score was used
• Balanced accuracy is calculated as the average of the proportion corrects of each class individually.
• 10- The goal is to identify which features are important and influence the buying intent of customers. There
are 3 sets of feature and 2 models are trained on each set. The tree based model(Gradient boosted trees)
will be used to determine feature importance. The model with the highest Balanced accuracy score will be
selected. The neural netwrok based mode serves as reference for how good the tree based model is. Ideally
the performance between the two models should be equal.

• Model selection

The hyper parameter for each model is optimized using bayesian optimization and 5 fold stratified Cross
Validation. Each model is trained on there different transformations of the original dataset. Since the data set
is imbalanced be use balanced accuracy as the metric.

11-The best result for both tree based model and neural network model are obtained on the feature set that
is not scaled or transfromed. Both models perform equally well with the neural netwrok performing 0,5
percent better than the gradient boosted trees.

12-The top 10 features affecting the buying intentions were identified. These features can be further isolated
depending upon the applications. For eg. Exit would be a good measure of how well personalized webpages are working for
users. A simple A/B test can be carried out with and without personalization and exit rates as well as other features can be
monitored. Change in these features would indicate a change in buying intent and tell us if the test was succesfull

13- to conclude Dataset was cleaned , explored and visualized

Three transformations were tested and applied to find the best fit

Two models were trained ( tree based model and neural network)

Top 10 correlating features were isolated

FRAUD DETECTİON: SUPERVISED/UNSUPERVISED

CUSTOMER SEGMENTATION=UNSUPERVISED/ KMEANS

REMOMMENDATION= CONTENT BASED COLLABORATIVE REFINFORCEMENT

RISK ASSETMENT MATRIX MODEL

CLAIMS PREDICTION= SVM rf DT

ADVERTISEMENT=REIFNFORCEMENT

PRICE OPTIMIZATION

You might also like