Data Mining

greatiearning Power AheadDECLARATION We certify that a. The work contained in this project has been done by me under the guidance of our supervision. b, The work has not been submitted to any other Institute for any degree or diploma. We have followed the guidelines provided by the Institute in preparing the project report. d. We have confirmed to the norms and guidelines given in the Ethical Code of Conduct of the Institute. e. Whenever we have used materials (data, theoretical analysis, figures, and text) from other sources, we have given due credit to them by citing them in the text of the report and giving their details in the references. Further, I have taken permission from the copyright owners of the sourc whenever necessary. ‘Name of Students Signature of Students Ameya UdapureTable of Contents: Problem 1: Clustering Data Dictionary for Market Segmentation 1.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-variate, and multivariate analysis). 1.2 Do you think scaling is necessary for clustering in this case? Justify 1.3 Apply hierarchical clustering to scaled data, Identify the number of optimum clusters using Dendrogram and briefly describe them. 1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow curve and silhouette score, Explain the results properly. Interpret and write inferences on the finalized clusters. 15 Describe cluster profiles for the clusters defined. Recommend different promotional strategies for different clusters 2 Problem 2: CART-RF-ANN 2.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi- variate, and multivariate analysis). 2.2 Data Split: Split the data into test and train, build classification model CART, Random Forest, Artificial Neural Network 2.3 Performance Metrics: Comment and Check the performance of Predictions on Train and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score, classification reports for each model 2.4 Final Model: Compare all the models and write an inference which model is best/optimized. 2.5 Inference: Based on the whole Analysis, what are the business insights and recommendationsProblem 1: Clustering A leading bank wants to develop a customer segmentation to give promotional offers to its customers. They collected a sample that summarizes the activities of users during the past few months. You are given the task to identify the segments based on credit card usage. Data Dictionary for Market Segmentation: - spending: Amount spent by the customer per month (in 1000s) - advance_payments: Amount paid by the customer in advance by cash (in 100s) - probability_of_full_payment: Probability of payment done in full by the customer to the bank - current_balance: Balance amount left in the account to make purchases (in 1000s) - credit_limit: Limit of the amount in credit card (1000s) - min_payment_amt : minimum paid by the customer while making payments for purchases made monthly (in 100s) - max_spent_in_single_shopping: Maximum amount spent in one purchase (in 1000s) 1.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-variate, and multivariate analysis). Solution: Based on summary descriptive, the data looks good. We see for most of the variable, mean/medium are nearly equal Include a 90% to see variations and it looks distributed evenly Std Deviation is high for spending variable Minimum spending: 10.59 Maximum spending: 21.18 Mean value: 14.847523809523818 Median value: 14.355 Standard deviation: 2.909699430687361 Null values: False+ credit_limit & spending ~ spending & current_balance ~ credit_limit & advance_payments - max_spent_in_single_shopping current_balancespending _advance_payments _ probability spending 1.000000 0.994341 0.608288 advance_payments 0.994341 1.000000 0.529244 probability offull_payment 0.608288 1.000000 current_balance osasses 0.367915 credit iit os70771 mmin_payment_amt -0.229572 -0.217340 max spentinsingleshopping 0.863693 0.890784 0.226825 ‘ , Strategy to remove outliers: We choose to replace attribute outlier values by their respective medians, instead of dropping them, as we will lose other column info and also their outlier are present only in two a variable and within 5 records.1.2 Do you think scaling is necessary for clustering in this case? Justify Solution: Scaling needs to be done as the values of the variables are different spending, advance payments are in different values and this may get more weightage. Also have shown below the plot of the data prior and after scaling. Scaling will have all the values in the relative same range. have used z score to standardized the data to relative same scale -3 to +3. Before Scaling: vs 150 ns 100 se aS ik ° 0 100 150 200 After Sealing:1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using Dendrogram and briefly describe them, Solution: rir (19) (15) (12) (24) (24) (26) (17) (24) (20) (29) Both the method are almost similar means, minor variation, which we know it occurs. We for cluster grouping based on the dendrogram, 3 looks good. Did the further analysis, and based on the dataset had gone for 3 group cluster solution based on the hierarchical clustering Also in real time, there could have been more variables value captured - tenure, balance frequency, balance, purchase, installment of purchase, others. And three group cluster solution gives a pattern based on high/medium/low spending with max spent in single shopping (high value item) and probability of full payment(payment made). 1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow curve and silhouette score. Explain the results properly. Interpret and write inferences on the finalized clusters. Solution: [1459.9899999989998, 680.171754487081 420..6589731513006, 74655984791394, 7672098597522, 289..7454594701129, 261.99257262366164, 239.08573665700952, 221 .00108292809628, 209.55056673071783]1400 1000 800 600 Inertia in the cluster 400 200 2 4 6 8 10 Custers silhouette_score(clean_dataset_Scaled, labels_4) 8.3291966792017613 .46577247686580914, 4007276552751299, -3291966792817613, -28316654897654814, 2897583830272518, 2694844355168535, 25437316027505635, 2623959398663564, -267398772529917] essesee0e804s 040 silhouette Coefficient 030 From SC 5 6 7 8 Number of clusters ‘core, the number of optimal clusters could be 3. 10 1.8 Describe cluster profiles for the clusters defined. Recommend different promotional strategies for different clusters. Solution: clusters-3 spending advan payments probability of full payment current_balance credit limit min_payment_amt ‘max_spent_in_single.shopping Freq Group 1: High Spending Group 8.371429 6.145429 0.884400 6.158171 3.684629 6.017371 70. 2 1.872388 13.257015 osaso72 5.238940 2.848537 4.9494 5a 09 7.000000 * Giving any reward points might increase their purchases. 10 a 4.199041 4.233562 0.879180 5.086178 '3,000000© Maximum max spent in single shopping is high for this group, so can be offered discounvoffer on next transactions upon full payment Increase their credit limit and ‘+ Increase spending habits ‘© Give loan against the credit card, as they are customers with good repayment record © Tic up with luxury brands, which will drive more one time maximum spending Group 3: Medium Spending Group ‘© They are potential target customers who are paying bills and doing purchases and maintaining comparatively good credit score. So, we can increase credit limit or can lower down interest rate. ‘© Promote premium cards/loyalty cars to increase transactions. ‘* Increase spending habits by trying with premium ecommerce sites, travel portal, travel airlines/hotel, as this will encourage them to spend more Group 2: Low Spending Group '* Customers should be given remainders for payments. Offers can be provided on early payments to improve their payment rate. ‘* Increase their spending habits by ticing up with grocery stores, utilities (electricity, phone, gas, others) 1Problem 2: CART-RF-ANN ‘An Insurance firm providing tour insurance is facing higher claim frequency. The management decides to collect data from the past few years. You are assigned the task to make a model which predicts the claim status and provide recommendations to management. Use CART, RF & ANN and compare the models' performances in train and test sets 2.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi- variate, and multivariate analysis). Solution: RangeIndex: 3000 entries, @ to 2999 Data columns (total 1@ columns): # Column Non-Null Count otype @ Age 32@@ non-null intea 1 Agency_code 3880 non-null object 2 Type 3090 non-null object 3 Claimed 3280 non-null object, 4 Commision 3888 non-null floated 5 Channel 3228 non-null object, 6 Duration 3828 non-null intea 7 sales 32¢@ non-null floated 8 Product Name 322 non-null object 9 Destination 380 non-null object dtypes: floaté4(2), int64(2), object (6) memory usage: 234.5+ KB © 10 variables © Age, Commision, Duration, Sales are numeric variable © rest are categorial variables * 3000 records, no missing one ‘© 9 independent variable and one target variable - Claimed 2Age Agency_code Type Claimed Commision Channel Duration sales Product Name Destination dtype: inted NO missing values. count mean sta ‘Age 3000.0 36081000 10.46951 Commision 3000.0 14.528203 25.481455 Duration 3000.0 70,001583. 134,059913, Sales 3000.0 60.248913 70.733954 min 26% 5% 80 s20 3600 09 09 463 “1.0 110 2650 0.9 200 3300 Duration has negative value; itis not possible. Wrong entry. Commision & Sales- mean and median varies significantly Categorial code variable maximum unique count is $ Number of duplicate entries are 139. Though it shows there are 139 records, but it can be 75% 42,000 17235 3.000 9.000 max 24.00 21021 4880.00 539.00 of different customers, there is no customer ID or any unique identifier, so T am not dropping them off. Univariant Analysis: Range of values: 76 Minimum Age: & Maximum Age: 84 Mean value: 38.091 Median value: 36.0 Standard deviation: 10.4635182453 Null values: False spending - Ist Quartile (QI) is: 32.0 spending - 3st Quartile (Q3) is: 42.0 Interquartile range (IQR) of Age is 10.0 Lower outliers in Age: 17.0 Upper outliers in Age: 57.0 Number of outliers in Age upper:198 2BNumber of outliers in Age lower: 6 % of Outlier in Age upper: 7.0 % % of Outlier in Age lower: 0.0 % In [29] 2.2 Data Split: Split the data into test and train, build classification model CART, Random Forest, Artificial Neural Network Solution: Extracting the target column into separate vectors for training set and test set. X train (2108, 9) X_test (900, 9) train_labels (2100, ) test_labels (900, ) {‘eriterion': ‘gini', ‘max_depth': 10, ‘min_samples_leaf’: 58, ‘min_ sanples_split': 459} DecisionTreeClassifier(max_depth: s_split=450 @, min_samples_leaf=5@, min_sample randon_stave=1) {’max_depth': 6, ‘max_features’: 3, ‘min_sanples_leaf': 8, 'min_samp les_split': 46, ‘n_estimators': 350} RandonForestClassifier (max_depth =8, , max_features=3, min_samples_leaf mples_split=46, n_estinators=350, rando m_state=1) DecisionTreeClassifier (max_depth=3.5, min_samples_leaf=44, min_samples_split=258, random_stat 14DecisionTreeClassifier(max_depth=4.85, min_samples_leaf=44 min_samples_split-260, random_state=1) RandomForestClassifier(max_depth=6, max_features=3, min_sanples_leaf =8 min_samples_split=46, n_estimators=350, rando m_state=1) ° 2 0.095509 3 0.348602 4 0.131594 ‘Agency_Code Product Name Sales Commision Duration Type Age Destination Channel HLPCLassifier(hidden_layer_sizes=200 tol=0.01) 15 p 276018 235583 152733 135997 -O77475 71019 039503 008971 082705 max_iter=2508, random_state=12.3 Performance Metrics: Comment and Check the performance of Predi ns on Train and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score, classification reports for each model, Solution: ao 02 Oe 060810 CART -AUC and ROC for the test data AUC= 0.823 10 a a2 02 af 08 08 a CART Confusion Matrix and Classification Report for the training data AUC: 0.801 CART Confusion Matrix and Classification Report for the training data: array([[1909, 144], [ 307, 3401]) 16Train Data Accuracy = 0.78523 precision recall f1-score support a 0.81 8.90 8.85 1453 1 0.78 0.53 8.60 647 accuracy 6.79 2100 macro avg 0.76 e.71 8.73 2100 Weighted avg 0.78 8.79 8.78 2180 CART Confusion Matrix and Classification Report for the testing data array({[553, 78], (136, 141]]) Test Data Accuracy = 0.7711 Cart train precision 0.7 Cart train recall @.53 Cart train f1 0.6 precision recall ft-score @ 0.80 @.89 0.84 0.67 0.51 0.58 accuracy 77 macro avg 0.74 0.70 a7 weighted avg 0.76 0.77 0.76 Cart test precision 0.67 cart test recall 0.51 cart test f1 0.58 v7 support, 623 277 90e 908 998Train Data: ~ AUC: 82% ~ Accuracy: 79% ~ Precision: 70% = fl-Score: 60% Test Dat: - AUC: 80% - Accuracy: 77% ~ Precision: 80% - fl-Score: 84% Training and Test set results are almost similar, and with the overall measures high, the model is a good model. Change is the most important variable for predicting diabetes 2.4 Final Model: Compare all the models and write an inference which model is best/optimized. Solution: CART CART Ft Newer | Neweds 18— fF — NN Roc 10 gos 5 @ g 06 804 ¢ $ E02 0.0 00 802 04 06 08 10 False Positive Rate — cart — FF — ROC 10 gv 08 2 ¥ 06 R04 § F 02 0.0 oo 802 04 a) 10 False Positive Rate Conclusion: 19Tam selecting the RF model, as it has better accuracy, precision, recall, fl score better than other two CART & NN. 2.5 Inference: Based on the whole Analysis, what are the business insights and recommendations 20Solution: ‘© [strongly recommended we collect more real time unstructured data and past data if, possible. ‘© This is understood by looking at the insurance data by drawing relations between different variables such as day of the incident, time, age group, and associating it with other external information such as location, behavior patterns, weather information, airline/vehicle types, ete, ‘* Streamlining online experiences benefitted customers, leading to an increase in conversions, which subsequently raised profits + As per the data 90% of insurance is done by online channel. ‘* Other interesting fact, is almost all the offline business has a claimed associated, need to find why? ‘© Need to train the JZI agency resources to pick up sales as they are in bottom, need to run promotional marketing campaign or evaluate if we need to tie up with alternate agency ‘+ Also based on the model we are getting 80%accuracy, so we need customer books airline tickets or plans, eross sell the insurance based on the claim data pattern. '* Other interesting fact is more sales happen via Agency than Airlines and the trend shows the claim are processed more at Airline. So we may need to deep dive into the process to understand the workflow and why? Key performance indicators (KPI) The KPI’s of insurance claims are: ‘+ Reduce claims eyele time ‘© Increase customer satisfaction © Combat fraud ‘© Optimize claims recovery ‘+ Reduce claim handling costs Insights gained from data and Al-powered analytics could expand the boundaries of insurability, extend existing products, and give rise to new risk transfer solutions in areas like a non-damage business interruption and reputational damage. 21

Data Mining

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining

Uploaded by

Copyright:

Available Formats

You might also like