You are on page 1of 59

task2

March 25, 2024

[2]: from google.colab import drive


drive.mount('/content/drive')

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)
Drive already mounted at /content/drive; to attempt to forcibly remount, call
drive.mount("/content/drive", force_remount=True).

[3]: %cd "/content/drive/MyDrive/Interview-AI Engineer-VNPay/dataset"

/content/drive/MyDrive/Interview-AI Engineer-VNPay/dataset
/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[4]: !ls

anscombe.csv data.csv US_Stores.xlsx


/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[5]: import pandas as pd


import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import LabelEncoder

1
import seaborn as sns
from scipy.stats import chi2_contingency

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[6]: pd.set_option('display.max_columns', None)

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[7]: order_df = pd.read_excel("US_Stores.xlsx", sheet_name="Orders")


return_df = pd.read_excel("US_Stores.xlsx", sheet_name="Returns")
user_df = pd.read_excel("US_Stores.xlsx", sheet_name="Users")

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[8]: order_df

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[8]: Row ID Order Priority Discount Unit Price Shipping Cost Customer ID \
0 20847 High 0.01 2.84 0.93 3
1 20228 Not Specified 0.02 500.98 26.00 5
2 21776 Critical 0.06 9.48 7.29 11
3 24844 Medium 0.09 78.69 19.99 14
4 24846 Medium 0.08 3.28 2.31 14
… … … … … … …
1947 19842 High 0.01 10.90 7.46 3397
1948 19843 High 0.10 7.99 5.03 3397

2
1949 26208 Not Specified 0.08 11.97 5.81 3399
1950 24911 Medium 0.10 9.38 4.93 3400
1951 25914 High 0.10 105.98 13.99 3403

Customer Name Ship Mode Customer Segment Product Category \


0 Bonnie Potter Express Air Corporate Office Supplies
1 Ronnie Proctor Delivery Truck Home Office Furniture
2 Marcus Dunlap Regular Air Home Office Furniture
3 Gwendolyn F Tyson Regular Air Small Business Furniture
4 Gwendolyn F Tyson Regular Air Small Business Office Supplies
… … … … …
1947 Andrea Shaw Regular Air Small Business Office Supplies
1948 Andrea Shaw Regular Air Small Business Technology
1949 Marvin Reid Regular Air Small Business Office Supplies
1950 Florence Gold Express Air Small Business Furniture
1951 Tammy Buckley Express Air Consumer Furniture

Product Sub-Category Product Container \


0 Pens & Art Supplies Wrap Bag
1 Chairs & Chairmats Jumbo Drum
2 Office Furnishings Small Pack
3 Office Furnishings Small Box
4 Pens & Art Supplies Wrap Bag
… … …
1947 Storage & Organization Small Box
1948 Telephones and Communication Medium Box
1949 Pens & Art Supplies Small Pack
1950 Office Furnishings Small Box
1951 Office Furnishings Medium Box

Product Name Product Base Margin \


0 SANFORD Liquid Accent� Tank-Style Highlighters 0.54
1 Global Troy� Executive Leather Low-Back Tilter 0.60
2 DAX Two-Tone Rosewood/Black Document Frame, De… 0.45
3 Howard Miller 12-3/4 Diameter Accuwave DS � Wa… 0.43
4 Newell 321 0.56
… … …
1947 Crate-A-Files� 0.59
1948 Bell Sonecor JB700 Caller ID 0.60
1949 Staples SlimLine Pencil Sharpener 0.60
1950 Eldon Expressions Punched Metal & Wood Desk Ac… 0.57
1951 Tenex 46" x 60" Computer Anti-Static Chairmat,… 0.65

Country Region State or Province City Postal Code \


0 United States West Washington Anacortes 98221
1 United States West California San Gabriel 91776
2 United States East New Jersey Roselle 7203

3
3 United States Central Minnesota Prior Lake 55372
4 United States Central Minnesota Prior Lake 55372
… … … … … …
1947 United States Central Illinois Danville 61832
1948 United States Central Illinois Danville 61832
1949 United States Central Illinois Des Plaines 60016
1950 United States East West Virginia Fairmont 26554
1951 United States West Wyoming Cheyenne 82001

Order Date Ship Date Profit Quantity ordered new Sales Order ID
0 2015-01-07 2015-01-08 4.5600 4 13.01 88522
1 2015-06-13 2015-06-15 4390.3665 12 6362.85 90193
2 2015-02-15 2015-02-17 -53.8096 22 211.15 90192
3 2015-05-12 2015-05-14 803.4705 16 1164.45 86838
4 2015-05-12 2015-05-13 -24.0300 7 22.23 86838
… … … … … … …
1947 2015-03-11 2015-03-12 -116.7600 18 207.31 87536
1948 2015-03-11 2015-03-12 -160.9520 22 143.12 87536
1949 2015-03-29 2015-03-31 -41.8700 5 59.98 87534
1950 2015-04-04 2015-04-04 -24.7104 15 135.78 87537
1951 2015-02-08 2015-02-11 349.4850 5 506.50 87530

[1952 rows x 25 columns]

[9]: return_df

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[9]: Order ID Status


0 65 Returned
1 612 Returned
2 614 Returned
3 678 Returned
4 710 Returned
… … …
1629 182681 Returned
1630 182683 Returned
1631 182750 Returned
1632 182781 Returned
1633 182906 Returned

[1634 rows x 2 columns]

4
[10]: user_df

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[10]: Region Manager


0 Central Chris
1 East Erin
2 South Sam
3 West William

[11]: order_df['Order ID'].is_unique

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[11]: False

[12]: order_df['Order ID'][order_df['Order ID'].duplicated()]

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[12]: 4 86838
5 86838
6 86838
10 86836
15 42949

1935 88838
1936 88838
1940 88745
1942 88746
1948 87536
Name: Order ID, Length: 587, dtype: int64

5
[13]: order_df[order_df['Order ID'] == 182683]

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[13]: Empty DataFrame


Columns: [Row ID, Order Priority, Discount, Unit Price, Shipping Cost, Customer
ID, Customer Name, Ship Mode, Customer Segment, Product Category, Product Sub-
Category, Product Container, Product Name, Product Base Margin, Country, Region,
State or Province, City, Postal Code, Order Date, Ship Date, Profit, Quantity
ordered new, Sales, Order ID]
Index: []

[14]: set(return_df['Status'].values)

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[14]: {'Returned'}

[15]: order_df[order_df['Order ID'] == 65]

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[15]: Empty DataFrame


Columns: [Row ID, Order Priority, Discount, Unit Price, Shipping Cost, Customer
ID, Customer Name, Ship Mode, Customer Segment, Product Category, Product Sub-
Category, Product Container, Product Name, Product Base Margin, Country, Region,
State or Province, City, Postal Code, Order Date, Ship Date, Profit, Quantity
ordered new, Sales, Order ID]
Index: []

[16]: order_df = pd.merge(order_df, user_df, on='Region', how='left')

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:

6
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[17]: order_df

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[17]: Row ID Order Priority Discount Unit Price Shipping Cost Customer ID \
0 20847 High 0.01 2.84 0.93 3
1 20228 Not Specified 0.02 500.98 26.00 5
2 21776 Critical 0.06 9.48 7.29 11
3 24844 Medium 0.09 78.69 19.99 14
4 24846 Medium 0.08 3.28 2.31 14
… … … … … … …
1947 19842 High 0.01 10.90 7.46 3397
1948 19843 High 0.10 7.99 5.03 3397
1949 26208 Not Specified 0.08 11.97 5.81 3399
1950 24911 Medium 0.10 9.38 4.93 3400
1951 25914 High 0.10 105.98 13.99 3403

Customer Name Ship Mode Customer Segment Product Category \


0 Bonnie Potter Express Air Corporate Office Supplies
1 Ronnie Proctor Delivery Truck Home Office Furniture
2 Marcus Dunlap Regular Air Home Office Furniture
3 Gwendolyn F Tyson Regular Air Small Business Furniture
4 Gwendolyn F Tyson Regular Air Small Business Office Supplies
… … … … …
1947 Andrea Shaw Regular Air Small Business Office Supplies
1948 Andrea Shaw Regular Air Small Business Technology
1949 Marvin Reid Regular Air Small Business Office Supplies
1950 Florence Gold Express Air Small Business Furniture
1951 Tammy Buckley Express Air Consumer Furniture

Product Sub-Category Product Container \


0 Pens & Art Supplies Wrap Bag
1 Chairs & Chairmats Jumbo Drum
2 Office Furnishings Small Pack
3 Office Furnishings Small Box
4 Pens & Art Supplies Wrap Bag

7
… … …
1947 Storage & Organization Small Box
1948 Telephones and Communication Medium Box
1949 Pens & Art Supplies Small Pack
1950 Office Furnishings Small Box
1951 Office Furnishings Medium Box

Product Name Product Base Margin \


0 SANFORD Liquid Accent� Tank-Style Highlighters 0.54
1 Global Troy� Executive Leather Low-Back Tilter 0.60
2 DAX Two-Tone Rosewood/Black Document Frame, De… 0.45
3 Howard Miller 12-3/4 Diameter Accuwave DS � Wa… 0.43
4 Newell 321 0.56
… … …
1947 Crate-A-Files� 0.59
1948 Bell Sonecor JB700 Caller ID 0.60
1949 Staples SlimLine Pencil Sharpener 0.60
1950 Eldon Expressions Punched Metal & Wood Desk Ac… 0.57
1951 Tenex 46" x 60" Computer Anti-Static Chairmat,… 0.65

Country Region State or Province City Postal Code \


0 United States West Washington Anacortes 98221
1 United States West California San Gabriel 91776
2 United States East New Jersey Roselle 7203
3 United States Central Minnesota Prior Lake 55372
4 United States Central Minnesota Prior Lake 55372
… … … … … …
1947 United States Central Illinois Danville 61832
1948 United States Central Illinois Danville 61832
1949 United States Central Illinois Des Plaines 60016
1950 United States East West Virginia Fairmont 26554
1951 United States West Wyoming Cheyenne 82001

Order Date Ship Date Profit Quantity ordered new Sales \


0 2015-01-07 2015-01-08 4.5600 4 13.01
1 2015-06-13 2015-06-15 4390.3665 12 6362.85
2 2015-02-15 2015-02-17 -53.8096 22 211.15
3 2015-05-12 2015-05-14 803.4705 16 1164.45
4 2015-05-12 2015-05-13 -24.0300 7 22.23
… … … … … …
1947 2015-03-11 2015-03-12 -116.7600 18 207.31
1948 2015-03-11 2015-03-12 -160.9520 22 143.12
1949 2015-03-29 2015-03-31 -41.8700 5 59.98
1950 2015-04-04 2015-04-04 -24.7104 15 135.78
1951 2015-02-08 2015-02-11 349.4850 5 506.50

Order ID Manager

8
0 88522 William
1 90193 William
2 90192 Erin
3 86838 Chris
4 86838 Chris
… … …
1947 87536 Chris
1948 87536 Chris
1949 87534 Chris
1950 87537 Erin
1951 87530 William

[1952 rows x 26 columns]

[18]: order_df = pd.merge(order_df, return_df, how='left', left_on='Order ID',␣


↪right_on='Order ID')

order_df['Returned'] = order_df['Status'].apply(lambda x: 1 if pd.notnull(x)␣


↪else 0)

order_df.drop(columns=['Status'], inplace=True)

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[19]: order_df

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[19]: Row ID Order Priority Discount Unit Price Shipping Cost Customer ID \
0 20847 High 0.01 2.84 0.93 3
1 20228 Not Specified 0.02 500.98 26.00 5
2 21776 Critical 0.06 9.48 7.29 11
3 24844 Medium 0.09 78.69 19.99 14
4 24846 Medium 0.08 3.28 2.31 14
… … … … … … …
1947 19842 High 0.01 10.90 7.46 3397
1948 19843 High 0.10 7.99 5.03 3397
1949 26208 Not Specified 0.08 11.97 5.81 3399
1950 24911 Medium 0.10 9.38 4.93 3400

9
1951 25914 High 0.10 105.98 13.99 3403

Customer Name Ship Mode Customer Segment Product Category \


0 Bonnie Potter Express Air Corporate Office Supplies
1 Ronnie Proctor Delivery Truck Home Office Furniture
2 Marcus Dunlap Regular Air Home Office Furniture
3 Gwendolyn F Tyson Regular Air Small Business Furniture
4 Gwendolyn F Tyson Regular Air Small Business Office Supplies
… … … … …
1947 Andrea Shaw Regular Air Small Business Office Supplies
1948 Andrea Shaw Regular Air Small Business Technology
1949 Marvin Reid Regular Air Small Business Office Supplies
1950 Florence Gold Express Air Small Business Furniture
1951 Tammy Buckley Express Air Consumer Furniture

Product Sub-Category Product Container \


0 Pens & Art Supplies Wrap Bag
1 Chairs & Chairmats Jumbo Drum
2 Office Furnishings Small Pack
3 Office Furnishings Small Box
4 Pens & Art Supplies Wrap Bag
… … …
1947 Storage & Organization Small Box
1948 Telephones and Communication Medium Box
1949 Pens & Art Supplies Small Pack
1950 Office Furnishings Small Box
1951 Office Furnishings Medium Box

Product Name Product Base Margin \


0 SANFORD Liquid Accent� Tank-Style Highlighters 0.54
1 Global Troy� Executive Leather Low-Back Tilter 0.60
2 DAX Two-Tone Rosewood/Black Document Frame, De… 0.45
3 Howard Miller 12-3/4 Diameter Accuwave DS � Wa… 0.43
4 Newell 321 0.56
… … …
1947 Crate-A-Files� 0.59
1948 Bell Sonecor JB700 Caller ID 0.60
1949 Staples SlimLine Pencil Sharpener 0.60
1950 Eldon Expressions Punched Metal & Wood Desk Ac… 0.57
1951 Tenex 46" x 60" Computer Anti-Static Chairmat,… 0.65

Country Region State or Province City Postal Code \


0 United States West Washington Anacortes 98221
1 United States West California San Gabriel 91776
2 United States East New Jersey Roselle 7203
3 United States Central Minnesota Prior Lake 55372
4 United States Central Minnesota Prior Lake 55372

10
… … … … … …
1947 United States Central Illinois Danville 61832
1948 United States Central Illinois Danville 61832
1949 United States Central Illinois Des Plaines 60016
1950 United States East West Virginia Fairmont 26554
1951 United States West Wyoming Cheyenne 82001

Order Date Ship Date Profit Quantity ordered new Sales \


0 2015-01-07 2015-01-08 4.5600 4 13.01
1 2015-06-13 2015-06-15 4390.3665 12 6362.85
2 2015-02-15 2015-02-17 -53.8096 22 211.15
3 2015-05-12 2015-05-14 803.4705 16 1164.45
4 2015-05-12 2015-05-13 -24.0300 7 22.23
… … … … … …
1947 2015-03-11 2015-03-12 -116.7600 18 207.31
1948 2015-03-11 2015-03-12 -160.9520 22 143.12
1949 2015-03-29 2015-03-31 -41.8700 5 59.98
1950 2015-04-04 2015-04-04 -24.7104 15 135.78
1951 2015-02-08 2015-02-11 349.4850 5 506.50

Order ID Manager Returned


0 88522 William 0
1 90193 William 0
2 90192 Erin 0
3 86838 Chris 0
4 86838 Chris 0
… … … …
1947 87536 Chris 0
1948 87536 Chris 0
1949 87534 Chris 0
1950 87537 Erin 0
1951 87530 William 0

[1952 rows x 27 columns]

[20]: order_df[order_df['Returned'] == 1]

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[20]: Row ID Order Priority Discount Unit Price Shipping Cost Customer ID \
68 1950 Medium 0.01 4.91 0.50 117
69 1951 Medium 0.09 4.00 1.30 117

11
171 5302 High 0.01 8.33 1.99 308
256 1147 Medium 0.08 2.94 0.96 491
294 2368 Medium 0.00 6.88 2.00 553
346 7893 Not Specified 0.00 236.97 59.24 640
588 6711 High 0.00 6.68 5.66 1044
689 7632 Medium 0.09 130.98 30.00 1217
692 7810 Medium 0.00 7.10 6.05 1228
693 7811 Medium 0.01 4.98 4.62 1228
694 7812 Medium 0.06 5.68 1.39 1228
968 8389 High 0.02 30.98 17.08 1733
1205 1008 High 0.09 16.98 12.39 2189
1513 5338 High 0.05 165.20 19.99 2670
1514 5339 High 0.09 17.99 8.65 2670

Customer Name Ship Mode Customer Segment Product Category \


68 Linda Weiss Regular Air Home Office Office Supplies
69 Linda Weiss Express Air Home Office Office Supplies
171 Glen Caldwell Regular Air Small Business Technology
256 Toni Swanson Regular Air Consumer Office Supplies
294 Kristine Connolly Express Air Home Office Office Supplies
346 Neal Wolfe Delivery Truck Consumer Furniture
588 Erin Ballard Regular Air Home Office Office Supplies
689 Billy Perry Browning Delivery Truck Small Business Furniture
692 Hazel Jennings Regular Air Small Business Office Supplies
693 Hazel Jennings Express Air Small Business Technology
694 Hazel Jennings Regular Air Small Business Office Supplies
968 Nina Horne Kelly Regular Air Small Business Office Supplies
1205 Frank Cross Regular Air Corporate Office Supplies
1513 Yvonne Mann Regular Air Home Office Office Supplies
1514 Yvonne Mann Regular Air Home Office Office Supplies

Product Sub-Category Product Container \


68 Labels Small Box
69 Paper Wrap Bag
171 Computer Peripherals Small Pack
256 Pens & Art Supplies Wrap Bag
294 Paper Wrap Bag
346 Tables Jumbo Box
588 Paper Small Box
689 Chairs & Chairmats Jumbo Drum
692 Binders and Binder Accessories Small Box
693 Computer Peripherals Small Pack
694 Envelopes Small Box
968 Paper Small Box
1205 Envelopes Small Box
1513 Storage & Organization Small Box
1514 Pens & Art Supplies Small Box

12
Product Name Product Base Margin \
68 Avery 493 0.36
69 EcoTones® Memo Sheets 0.37
171 80 Minute Slim Jewel Case CD-R , 10/Pack - Sta… 0.52
256 Newell 343 0.58
294 Adams Phone Message Book, 200 Message Capacity… 0.39
346 Chromcraft Rectangular Conference Tables 0.61
588 Xerox 1923 0.37
689 Office Star - Contemporary Task Swivel chair w… 0.78
692 Wilson Jones Hanging View Binder, White, 1" 0.39
693 Imation 3.5", DISKETTE 44766 HGHLD3.52HD/FM, 1… 0.64
694 Staples Standard Envelopes 0.38
968 Xerox 197 0.40
1205 Brown Kraft Recycled Envelopes 0.35
1513 Economy Rollaway Files 0.59
1514 Model L Table or Wall-Mount Pencil Sharpener 0.57

Country Region State or Province City Postal Code \


68 United States West Washington Seattle 98103
69 United States West Washington Seattle 98103
171 United States West Washington Seattle 98115
256 United States East New York New York City 10154
294 United States West California Los Angeles 90008
346 United States West Washington Seattle 98119
588 United States West California Los Angeles 90004
689 United States East Massachusetts Boston 2112
692 United States East Pennsylvania Philadelphia 19140
693 United States East Pennsylvania Philadelphia 19140
694 United States East Pennsylvania Philadelphia 19140
968 United States East District of Columbia Washington 20012
1205 United States East New York New York City 10177
1513 United States West California Los Angeles 90049
1514 United States West California Los Angeles 90049

Order Date Ship Date Profit Quantity ordered new Sales \


68 2015-04-04 2015-04-06 112.060 47 228.46
69 2015-04-04 2015-04-06 16.790 19 77.61
171 2015-02-14 2015-02-15 10.740 32 280.62
256 2015-05-15 2015-05-17 -2.120 23 66.70
294 2015-01-28 2015-01-29 34.068 36 267.53
346 2015-02-14 2015-02-15 1192.040 34 6686.34
588 2015-02-27 2015-02-28 -76.940 90 617.40
689 2015-04-28 2015-05-01 -421.760 41 5258.94
692 2015-02-16 2015-02-17 -60.145 28 208.83
693 2015-02-16 2015-02-18 -111.720 41 228.30
694 2015-02-16 2015-02-16 33.010 24 129.53

13
968 2015-06-28 2015-06-29 -32.280 13 438.25
1205 2015-05-08 2015-05-10 -48.570 22 381.91
1513 2015-05-29 2015-05-29 2008.710 167 27587.55
1514 2015-05-29 2015-05-29 -80.530 71 1191.58

Order ID Manager Returned


68 13959 William 1
69 13959 William 1
171 37760 William 1
256 8353 Erin 1
294 17155 William 1
346 56452 William 1
588 47813 William 1
689 54595 Erin 1
692 55874 Erin 1
693 55874 Erin 1
694 55874 Erin 1
968 59937 Erin 1
1205 7364 Erin 1
1513 37924 William 1
1514 37924 William 1

[21]: order_ids_in_order_df = set(order_df['Order ID'])


order_ids_in_return_df = set(return_df['Order ID'])

common_order_ids = order_ids_in_order_df.intersection(order_ids_in_return_df)
unique_order_ids_in_order_df = order_ids_in_order_df - common_order_ids
unique_order_ids_in_return_df = order_ids_in_return_df - common_order_ids

print(f'{len(unique_order_ids_in_order_df)} unique order IDs in order_df but␣


↪not in return_df:', unique_order_ids_in_order_df)

print(f'{len(unique_order_ids_in_return_df)} unique order IDs in return_df but␣


↪not in order_df:', unique_order_ids_in_return_df)

print(f'{len(common_order_ids)} common order IDs:', common_order_ids)

1354 unique order IDs in order_df but not in return_df: {90114, 90115, 90120,
90121, 86041, 90145, 90146, 90147, 90148, 40997, 86050, 86051, 86052, 86053,
90154, 86054, 86063, 90160, 86064, 90166, 90167, 86075, 86076, 86077, 8257,
90178, 86085, 86086, 90185, 90186, 90187, 86092, 90189, 90190, 90192, 90193,
86101, 86102, 86103, 86104, 90201, 32869, 86118, 86119, 90218, 86122, 86123,
86124, 90236, 90237, 90238, 90239, 86144, 86145, 90244, 90248, 86153, 90258,
86163, 86164, 86165, 86166, 90264, 90265, 86173, 90270, 90271, 53410, 16547,
86181, 86184, 86189, 86190, 86191, 86192, 90291, 90292, 90296, 90301, 90303,
12480, 90309, 90314, 86220, 86221, 86222, 90322, 86227, 90327, 86233, 86234,
90333, 90334, 90335, 90337, 90338, 90339, 53476, 86250, 90353, 90354, 86258,
90359, 90360, 90361, 90362, 86263, 86264, 86267, 86268, 86279, 90378, 86283,
86284, 90385, 90386, 90387, 86297, 86307, 90404, 90405, 24869, 41253, 90408,

14
16676, 86308, 86309, 86310, 86311, 90414, 90415, 86327, 86331, 90430, 90431,
90432, 86338, 45380, 90437, 90438, 90439, 86346, 90449, 86356, 86357, 90460,
90461, 90462, 86368, 86369, 90469, 86373, 359, 90473, 86382, 90479, 90480,
86383, 86384, 90488, 90491, 90492, 90493, 86397, 90500, 90501, 90502, 86409,
86410, 86411, 90513, 90514, 86422, 86427, 90524, 90525, 86432, 90530, 90531,
90532, 90533, 90538, 90539, 90540, 86447, 86448, 86454, 90551, 86459, 86460,
90557, 86465, 57794, 86466, 90568, 90577, 90578, 86486, 90583, 86489, 86490,
86491, 90588, 90589, 90593, 90594, 45539, 90596, 90597, 86500, 90600, 90601,
90602, 86507, 86508, 86509, 86514, 90612, 90613, 86520, 90621, 86527, 90624,
86528, 86529, 90630, 90631, 86534, 86535, 86536, 86544, 90641, 86545, 86546,
86547, 86548, 90646, 86555, 86556, 90653, 548, 86565, 90662, 86566, 86567,
90669, 86573, 86574, 86575, 90674, 90675, 90678, 90685, 86591, 86592, 90695,
86599, 86600, 90706, 86610, 86611, 86612, 90710, 90714, 86621, 90724, 90725,
86629, 86633, 90731, 90735, 86639, 90739, 86645, 86646, 90750, 90751, 90752,
90753, 86654, 86655, 646, 86662, 29319, 86668, 90766, 90767, 90771, 86686,
86687, 86688, 37537, 90786, 90787, 41636, 86693, 86694, 29350, 86699, 90796,
90800, 90806, 90814, 90815, 53953, 90818, 90819, 90820, 90821, 86722, 86723,
86724, 86725, 86734, 86735, 90832, 90833, 90837, 90844, 86750, 86751, 86752,
86753, 90850, 86754, 90853, 90854, 90855, 4839, 90859, 90860, 90861, 86767,
86768, 90867, 90871, 90880, 90881, 45824, 86789, 86790, 86791, 90888, 86792,
86793, 90891, 86794, 86795, 86796, 90899, 90905, 90908, 90909, 90910, 86812,
86813, 86814, 8994, 86815, 90917, 90922, 86826, 86827, 86828, 90927, 90932,
86836, 90934, 86837, 86838, 86839, 86846, 86847, 41793, 90951, 90952, 86860,
90961, 90962, 86867, 90964, 86868, 86869, 86870, 86874, 90973, 90977, 37729,
33635, 86883, 86884, 86885, 86886, 86887, 90985, 90986, 90987, 86898, 86899,
86900, 86901, 86902, 91000, 86913, 86914, 91017, 86925, 86926, 86927, 91025,
86933, 91030, 91036, 91041, 91042, 91043, 86949, 86950, 86951, 86952, 91049,
86956, 91053, 91054, 86957, 86958, 91057, 86959, 91059, 91060, 86960, 91062,
91063, 86966, 86973, 962, 91076, 91077, 91078, 86989, 91086, 91087, 91088,
91089, 91090, 87002, 87003, 87004, 87005, 91108, 91109, 91110, 87015, 87016,
91115, 91116, 87020, 91122, 91123, 87029, 87030, 91127, 87031, 87032, 91130,
91131, 87033, 87041, 87042, 87043, 91144, 87057, 91166, 91167, 87071, 87072,
87076, 87077, 17446, 91174, 91175, 87078, 87079, 91180, 87086, 87087, 91194,
91195, 91200, 91201, 21572, 9285, 87109, 87110, 91209, 91212, 91213, 87117,
91219, 91228, 91229, 87134, 87135, 13408, 54369, 91235, 91236, 37987, 87146,
87147, 91244, 91245, 87148, 87160, 87161, 91258, 87162, 91261, 91262, 91263,
21636, 87175, 87176, 87177, 87178, 91277, 87186, 87187, 91285, 91286, 87193,
87194, 87195, 91296, 91297, 91298, 91304, 91305, 91306, 87208, 91310, 87214,
91316, 87221, 87222, 91321, 91328, 38080, 29889, 87234, 38087, 87240, 87243,
87244, 87245, 91344, 91354, 91355, 87258, 87259, 87260, 34017, 91362, 91363,
17636, 91365, 91366, 87272, 91371, 87277, 91376, 87285, 87286, 87287, 91386,
91388, 91389, 87296, 87297, 87298, 87299, 58628, 91397, 91398, 87306, 91407,
91408, 87316, 87317, 91414, 91415, 91416, 91417, 91424, 13606, 54567, 91432,
91433, 91435, 91436, 91437, 91438, 87342, 87345, 87347, 91447, 91451, 87356,
87357, 91454, 87364, 87365, 87366, 91466, 87374, 87378, 87382, 87383, 91480,
91481, 91482, 91488, 91492, 46436, 87396, 91495, 91496, 91502, 87406, 87407,
87408, 91513, 87424, 87425, 91522, 87426, 5509, 9606, 87435, 87436, 91543,
87451, 87452, 91550, 91555, 87463, 87464, 13735, 87473, 87474, 91571, 91575,

15
91576, 87484, 91581, 87485, 91583, 91584, 87486, 91586, 87487, 87488, 21958,
87511, 50656, 87520, 87525, 87530, 87534, 87535, 87536, 87537, 87552, 87553,
87554, 87555, 87556, 87569, 87570, 87579, 87583, 87584, 87585, 58914, 87586,
87587, 87602, 87603, 87611, 87617, 87618, 87619, 87620, 87630, 87631, 87632,
87633, 87634, 87651, 87652, 42599, 87671, 87672, 87676, 87677, 87678, 87679,
38529, 34435, 22147, 87695, 87696, 87700, 54949, 87720, 87721, 87725, 87726,
87727, 87747, 87748, 87749, 87757, 87765, 87772, 87773, 50917, 26342, 87790,
87795, 87804, 87811, 87812, 87813, 46853, 87823, 87824, 87830, 87831, 87832,
5920, 14115, 46884, 87846, 87847, 87853, 87862, 87877, 87884, 87885, 87888,
87889, 87899, 87900, 5984, 87905, 42852, 87908, 87909, 87915, 87916, 87917,
87933, 87934, 87935, 51072, 87940, 87946, 87947, 87952, 87953, 87954, 87962,
87963, 87964, 87965, 87977, 87978, 87979, 87980, 87993, 87994, 87995, 38852,
42949, 88004, 88014, 88015, 88016, 88017, 88023, 88028, 88029, 88030, 59365,
88039, 88040, 88041, 88048, 88060, 88061, 47108, 55300, 88075, 88083, 88084,
88085, 88093, 88094, 10277, 88101, 88102, 88103, 88104, 88105, 88114, 30785,
34882, 43079, 88135, 88136, 88137, 88151, 88152, 88156, 88157, 55392, 88163,
88164, 88165, 39015, 88173, 88174, 88184, 88185, 88191, 88192, 18561, 88196,
88197, 88198, 88204, 88205, 88212, 88213, 88219, 88220, 39076, 88232, 88233,
88234, 88239, 88240, 88241, 88256, 88265, 88266, 88267, 88268, 88278, 88279,
88280, 88281, 88282, 10464, 22755, 88296, 88297, 88298, 88319, 88320, 14596,
88329, 88330, 88348, 88360, 88361, 88367, 88368, 88371, 88372, 88380, 88387,
88388, 88389, 88390, 88391, 88403, 88404, 88405, 88406, 88410, 88411, 88418,
88425, 88426, 88443, 88444, 88447, 35200, 2433, 88448, 88449, 27013, 47493,
88460, 88461, 88474, 88475, 88479, 88480, 55713, 6562, 14756, 88487, 88502,
88503, 88504, 88511, 14785, 88522, 88527, 88534, 88543, 88544, 88545, 88546,
88547, 88548, 88554, 88555, 88556, 88557, 88558, 88568, 88569, 88570, 88571,
23042, 88579, 88580, 39430, 88587, 88588, 88589, 88590, 88598, 88599, 88600,
88610, 88611, 88612, 88626, 88627, 88632, 88633, 88634, 88644, 88645, 88646,
88656, 88657, 88658, 88666, 88667, 88668, 19042, 88677, 88678, 88679, 88685,
88686, 88692, 88701, 88702, 88713, 88714, 88721, 88722, 88726, 88727, 88728,
88729, 88730, 88731, 88745, 88746, 88753, 88758, 88766, 88781, 88782, 88783,
88784, 88794, 88798, 88814, 88815, 90109, 88819, 90110, 88824, 88825, 88826,
88836, 11013, 88837, 88838, 88839, 88840, 88852, 88857, 88870, 88871, 88879,
88880, 88881, 88882, 88889, 88890, 27456, 88899, 11077, 88905, 88906, 88907,
88908, 88921, 88928, 88929, 88940, 88941, 88942, 88958, 88959, 88971, 88972,
88974, 88975, 88998, 89004, 89005, 89006, 89007, 89008, 89017, 89018, 89019,
89025, 11206, 89039, 89040, 89041, 89047, 89053, 89054, 89055, 3042, 44002,
89059, 89071, 89076, 89077, 89083, 89084, 89092, 89093, 89095, 89096, 89097,
89102, 89106, 89112, 89128, 89129, 89130, 89139, 89140, 89146, 89147, 89148,
3138, 89166, 89174, 89175, 89176, 89184, 89193, 89194, 89199, 89200, 89201,
89202, 89203, 89209, 89211, 48257, 89218, 89219, 89240, 89251, 40101, 56486,
89257, 89258, 89259, 89278, 89279, 89284, 44231, 23751, 89291, 89292, 89293,
89299, 89300, 89301, 89314, 89315, 89316, 36069, 89319, 89320, 89327, 89333,
89334, 89344, 3332, 11527, 89355, 89356, 89360, 89361, 89375, 89376, 40224,
32037, 89389, 89394, 89401, 89402, 89406, 89407, 89408, 3397, 23877, 89414,
89415, 89426, 89431, 89432, 89433, 89434, 89440, 28001, 48483, 89448, 89449,
89450, 89456, 89465, 89481, 89497, 89503, 89504, 89505, 89514, 89515, 89520,
89521, 89522, 89523, 89524, 89525, 11712, 89536, 89537, 7623, 89564, 89571,

16
89572, 44517, 89579, 89583, 89584, 89585, 89595, 89596, 3585, 89601, 89602,
89608, 89609, 89610, 89611, 89631, 20007, 89639, 89647, 89657, 89658, 89664,
28225, 89665, 89666, 89679, 89680, 89686, 89697, 40547, 36452, 89704, 89705,
89706, 89716, 89726, 24193, 89730, 89729, 89743, 89761, 89762, 32420, 89770,
89775, 89776, 89777, 89787, 89789, 48836, 89801, 89805, 89810, 89818, 89819,
89820, 7909, 57061, 24294, 89835, 89836, 89847, 89848, 89849, 89856, 3841,
89857, 89858, 89869, 89872, 89873, 89874, 89879, 89880, 89885, 20261, 36647,
89897, 89909, 89910, 89915, 85826, 85827, 85828, 89928, 85833, 85834, 85835,
89939, 89940, 89941, 89942, 89943, 89944, 85850, 85857, 85858, 89957, 85865,
85866, 85867, 85868, 89961, 89970, 85880, 89981, 89982, 89983, 89984, 89988,
85893, 85894, 85895, 85896, 85897, 85898, 24455, 89993, 89994, 89999, 90000,
90001, 90002, 90003, 85914, 85915, 85916, 90011, 53153, 85928, 85929, 90026,
90027, 90031, 90032, 85938, 85939, 85940, 90040, 85947, 85948, 85949, 85950,
90043, 12224, 90044, 90048, 32710, 90058, 90059, 85964, 85965, 85966, 90069,
85979, 85980, 85981, 90078, 90079, 8165, 85990, 28647, 85991, 49125, 86002,
86003, 90099, 90103, 90104, 86010, 86011, 86012, 86013, 86014}
1623 unique order IDs in return_df but not in order_df: {20480, 20486, 143384,
135194, 139291, 155675, 131101, 147486, 167965, 53285, 135224, 131130, 151611,
131133, 139326, 163902, 180287, 65, 176190, 36932, 36934, 45127, 151641, 147546,
143450, 155741, 163933, 159838, 57440, 41059, 8292, 8293, 12389, 49255, 176248,
159866, 36992, 36994, 24707, 32901, 4230, 36999, 36998, 155800, 135320, 155804,
180381, 143519, 41120, 32931, 12451, 4261, 57510, 176698, 155834, 139451,
147643, 180412, 147646, 151738, 135356, 12483, 49349, 16582, 32966, 131288,
147672, 143577, 155868, 135389, 131294, 20704, 41186, 32996, 32998, 159992,
151802, 172283, 123132, 176378, 135420, 159997, 41216, 16641, 57600, 176382,
49412, 28928, 20743, 155929, 160025, 180507, 176410, 123166, 147743, 53536,
177534, 12580, 57638, 16679, 4391, 164152, 172344, 123194, 135480, 176443,
172349, 147775, 143679, 12613, 24902, 164184, 143706, 164187, 147804, 127324,
151901, 155999, 160095, 53600, 49510, 178362, 164216, 123258, 131450, 139643,
151933, 156031, 160127, 20864, 176511, 37250, 176536, 180633, 172442, 147869,
164254, 147871, 180638, 180639, 143773, 12704, 12706, 20899, 12710, 29095,
156090, 123323, 164282, 135612, 135613, 160188, 176572, 20934, 172504, 172505,
180698, 160218, 180700, 135643, 156126, 123359, 16864, 152028, 168413, 143838,
164345, 139775, 156159, 4610, 33283, 49668, 37380, 12806, 25095, 53767, 156184,
147993, 139802, 156186, 147996, 156189, 131614, 139807, 180760, 180767, 143898,
135707, 41508, 33317, 45605, 37414, 127545, 131642, 156219, 156220, 164412,
172602, 148031, 25152, 16961, 180798, 160315, 45632, 25157, 139864, 123481,
148056, 123483, 172634, 168539, 160348, 123487, 49762, 612, 12900, 614, 12903,
123512, 143995, 148092, 180861, 127612, 148095, 156287, 160380, 57986, 135806,
168571, 49797, 4738, 45698, 29318, 164504, 152219, 152221, 180894, 160414,
17058, 678, 49830, 152249, 180922, 160441, 135867, 131773, 168635, 176825,
135871, 29376, 160447, 29380, 33477, 710, 37572, 45767, 172761, 131802, 172762,
172764, 135897, 123614, 152282, 168668, 152286, 152287, 29410, 740, 45794,
33510, 21222, 140024, 127737, 144121, 131835, 155647, 172797, 164606, 172799,
144125, 8961, 152317, 49924, 33541, 4864, 775, 127769, 152346, 127773, 148254,
127774, 41760, 144159, 160543, 168733, 168734, 13091, 21286, 45863, 148280,
176955, 131901, 156477, 156478, 181054, 833, 144190, 9027, 49988, 152382,
168766, 29505, 29506, 54086, 181084, 4960, 21346, 33637, 13158, 17255, 54119,

17
131960, 123769, 164728, 148347, 172921, 156541, 127870, 50048, 17282, 9093, 902,
25478, 25479, 41861, 21383, 54151, 177656, 156568, 123801, 123802, 140187,
172952, 181147, 131998, 123807, 160667, 17313, 50081, 50083, 160668, 144286,
177053, 50087, 13218, 5028, 132024, 152504, 136121, 177081, 156604, 123837,
156605, 156606, 9152, 181181, 144318, 152510, 5059, 5061, 54215, 173016, 123870,
132062, 140255, 136158, 160734, 50147, 54243, 13284, 37860, 37862, 46052, 54245,
169533, 169534, 173048, 160762, 132091, 163839, 132093, 123902, 132095, 58368,
140286, 144382, 9219, 58372, 168958, 177151, 123928, 140312, 164888, 168984,
128028, 168988, 181278, 132127, 164895, 177180, 132152, 156729, 148536, 160825,
144442, 136252, 164926, 173118, 128061, 144444, 144445, 169023, 177214, 54339,
50246, 5189, 138619, 140376, 123994, 173147, 152667, 140381, 181342, 160860,
160861, 169052, 144479, 54368, 17508, 169053, 58470, 13410, 54371, 140409,
177274, 160891, 181372, 132221, 132222, 124031, 177279, 33921, 50307, 58500,
13444, 25735, 132248, 173208, 148634, 124059, 124060, 160924, 38050, 29861,
177336, 132281, 181434, 152761, 136378, 148669, 156862, 144573, 144574, 46276,
50374, 25799, 58566, 148697, 140506, 140507, 169177, 177371, 140510, 144605,
144606, 21729, 25828, 46311, 172029, 136440, 177401, 124154, 132347, 181500,
152826, 132350, 173310, 9472, 50432, 144634, 136444, 17668, 144636, 177407,
5381, 46341, 140568, 181528, 173338, 140571, 128284, 136476, 144670, 148767,
165151, 169244, 54563, 5414, 29991, 46375, 161080, 177464, 156987, 156988,
173373, 181566, 136507, 58688, 128316, 144702, 152895, 21824, 34117, 50501,
161087, 38210, 13638, 140632, 165209, 176120, 136537, 140636, 177501, 146812,
25952, 58720, 38240, 58725, 9574, 42342, 132472, 124282, 165243, 136572, 128381,
140670, 124286, 124287, 144766, 152957, 38272, 50564, 152958, 50566, 42375,
21890, 5511, 140698, 173466, 136606, 157087, 136607, 34209, 169374, 13729,
46497, 181688, 124345, 144824, 153016, 161210, 144827, 161212, 54721, 17858,
58818, 42436, 13765, 148952, 180216, 169434, 144859, 132573, 144862, 9696,
30176, 54755, 9701, 50663, 128504, 153081, 144890, 173563, 157180, 173565,
181755, 140799, 173567, 181759, 128509, 128510, 144894, 38400, 169469, 54787,
181784, 161304, 173594, 165403, 124444, 161310, 157215, 169503, 50721, 9762,
34338, 153144, 149050, 169531, 165436, 149053, 149054, 132669, 140862, 17985,
165437, 42563, 17988, 58949, 153149, 153151, 5699, 46662, 162331, 165465,
161369, 124507, 149084, 181851, 136795, 169565, 9829, 50789, 136825, 124538,
177786, 132733, 26240, 59009, 50818, 38530, 42628, 54914, 50823, 149144, 157337,
157338, 173721, 181914, 136861, 173726, 161437, 169631, 13984, 50850, 13986,
22181, 9895, 59047, 149176, 140985, 132794, 136889, 161466, 177850, 136894,
169662, 59072, 9923, 30403, 38596, 9927, 18119, 157400, 157402, 173786, 145117,
157406, 153310, 128735, 177886, 50914, 34532, 132857, 141049, 157435, 161531,
128765, 157438, 132863, 173822, 173823, 182015, 59139, 26372, 169726, 128767,
46852, 30469, 38661, 178648, 178650, 178651, 149272, 173849, 136988, 124701,
132894, 145181, 169756, 145183, 59171, 42788, 18215, 182072, 157497, 145208,
157499, 165691, 165693, 182075, 153403, 145212, 137021, 128830, 161599, 10054,
42823, 141144, 124761, 153433, 141147, 178011, 173917, 14176, 34658, 42850,
34661, 149368, 178040, 157562, 141178, 141179, 157565, 161657, 124799, 161661,
34689, 145279, 51075, 169855, 22402, 38787, 55172, 173977, 182170, 137113,
169881, 157597, 169884, 149407, 18336, 42912, 173983, 128925, 177051, 161693,
14242, 55203, 6054, 145338, 133052, 141244, 165820, 165823, 145342, 42945,
55235, 10183, 165848, 133081, 182233, 157659, 124892, 149469, 165852, 174047,

18
128984, 145371, 153564, 153567, 145375, 47078, 47079, 157689, 124921, 169977,
129018, 161786, 178170, 169981, 161791, 47109, 133144, 170011, 170012, 129053,
170015, 47138, 55330, 51239, 165944, 124985, 157753, 137272, 124988, 161848,
178234, 124991, 18496, 141375, 165951, 137275, 178235, 14406, 51271, 47174,
174169, 174171, 182365, 157791, 170079, 6241, 22627, 34916, 18533, 51302,
141432, 174201, 182392, 149627, 170106, 161917, 157822, 133247, 153726, 6272,
43138, 22656, 43140, 39043, 22661, 161944, 141465, 149657, 149658, 149660,
133277, 149661, 141471, 26784, 18593, 157853, 174239, 145563, 161948, 153757,
129182, 153759, 14497, 47265, 39075, 47271, 153784, 133305, 133306, 149691,
141500, 157882, 133310, 174269, 170169, 59585, 153786, 43203, 137406, 170174,
14528, 14534, 178392, 174297, 129241, 141531, 174300, 125149, 125150, 141533,
182492, 145626, 170201, 129246, 26852, 18661, 55526, 35047, 129272, 133369,
166137, 166138, 178425, 153851, 162043, 170236, 129279, 26881, 10498, 18689,
59652, 43269, 137471, 39169, 178431, 22787, 137497, 166170, 157979, 141595,
170265, 141598, 182559, 59680, 137500, 129310, 59683, 162078, 22820, 35110,
35111, 125240, 166200, 182586, 166203, 174395, 129336, 137530, 153915, 129339,
35137, 18753, 170301, 137534, 137535, 55616, 55618, 55623, 129368, 129369,
178520, 149852, 141661, 158047, 51553, 51554, 31073, 6498, 47457, 26982, 51559,
6500, 6502, 170360, 137596, 59776, 18822, 47494, 166296, 182680, 182681, 125339,
182683, 129433, 158110, 166302, 145818, 137630, 170398, 22947, 39333, 10662,
22950, 158136, 166329, 166330, 149947, 158139, 125373, 125374, 141758, 166332,
154040, 162235, 178621, 162238, 55747, 149976, 166360, 174553, 166363, 154072,
149981, 158173, 174559, 43488, 182750, 162265, 162266, 129501, 55776, 43494,
18919, 59879, 14820, 158200, 125433, 166392, 133627, 174584, 182781, 141822,
166398, 145916, 27137, 170494, 55808, 31232, 47620, 6661, 47621, 162330, 125467,
141852, 127516, 133662, 141855, 137755, 137756, 145947, 145950, 162335, 35366,
23076, 6695, 174648, 129592, 137785, 154171, 141884, 178749, 125503, 43585,
19010, 39490, 55877, 31303, 178776, 150105, 133722, 178777, 158300, 141917,
150109, 158302, 166493, 137819, 6757, 14951, 125560, 174713, 125562, 141946,
150138, 150141, 182906, 154233, 154237, 170621, 23168, 39555, 19078, 178840,
150169, 146076, 178846, 55968, 15009, 35492, 10917, 51876, 51879, 178874,
133819, 178875, 150205, 125631, 129727, 43713, 19138, 39619, 150232, 129753,
125659, 133851, 174813, 158430, 142047, 35554, 51940, 150264, 150265, 142073,
146170, 133884, 170747, 137981, 129791, 15106, 35588, 47876, 174872, 138009,
162586, 150299, 158492, 133917, 158494, 133919, 142111, 166686, 174876, 174879,
162590, 56101, 47910, 138040, 146232, 125754, 142139, 129850, 129854, 133951,
170815, 56128, 179006, 52035, 6978, 6979, 174937, 174938, 133979, 150364,
174940, 162650, 142175, 138075, 138078, 27490, 146271, 52068, 162654, 162655,
35687, 15202, 15206, 177243, 158584, 146297, 125818, 134011, 166779, 174971,
158590, 174975, 162683, 129918, 146328, 129947, 179101, 138142, 35744, 7079,
154552, 162744, 125882, 125883, 158652, 125885, 175035, 162745, 129979, 23488,
39872, 31682, 7107, 56257, 15303, 158684, 134111, 130015, 39904, 158713, 166906,
125947, 130042, 138234, 146426, 150527, 154619, 154620, 162812, 23557, 11271,
23559, 39943, 125976, 175128, 175130, 154650, 166940, 142365, 150557, 130077,
130079, 52258, 7203, 35877, 146489, 134202, 166972, 154684, 130110, 154687,
27712, 52288, 44098, 19523, 138303, 23616, 35910, 23619, 56387, 167000, 175195,
179291, 150622, 126046, 27744, 35936, 31844, 7269, 27750, 52327, 162937, 167036,
130174, 134271, 154751, 162943, 11396, 150680, 171161, 158875, 126108, 158878,

19
19616, 11425, 11426, 40097, 31907, 48293, 48295, 142521, 142523, 158908, 175292,
138428, 163005, 163006, 48321, 56514, 23748, 36038, 40132, 40134, 126169,
175321, 179418, 175324, 142557, 175326, 130267, 179420, 154845, 40160, 36067,
3300, 48353, 142584, 158968, 150780, 154876, 134398, 146685, 154878, 179452,
44292, 19718, 56582, 48391, 171288, 134425, 142617, 126235, 146713, 167197,
163101, 134431, 163102, 32036, 56612, 52518, 175416, 146745, 159034, 150843,
150844, 171322, 163131, 179514, 36160, 48448, 167256, 126297, 175448, 171353,
126300, 159068, 146778, 171357, 146782, 15712, 7521, 28003, 15718, 48486, 48487,
150904, 138616, 163193, 159099, 142716, 150909, 142718, 159103, 52608, 3456,
11648, 52611, 11652, 28037, 130429, 130430, 142744, 134553, 150936, 150938,
167322, 167325, 175512, 171420, 11682, 15778, 40354, 36262, 152121, 167352,
126393, 130489, 163259, 155069, 163261, 138687, 56768, 56769, 3525, 44486,
52678, 175576, 138712, 175578, 138714, 159196, 175580, 146906, 167391, 171483,
138719, 11748, 48615, 126456, 134648, 142840, 126459, 134651, 138744, 171512,
155131, 171515, 138751, 15872, 24066, 3589, 126488, 146971, 167452, 163357,
15904, 44579, 56868, 44583, 175672, 126521, 142905, 175673, 159292, 175676,
167486, 171576, 179768, 130623, 7744, 20036, 52805, 56901, 48710, 134745,
147033, 167516, 151135, 36449, 56930, 56931, 3687, 138872, 126585, 147066,
134779, 179837, 134782, 28291, 7812, 11909, 48773, 11911, 7815, 48775, 134808,
143003, 159387, 147099, 147102, 155295, 7841, 7845, 20134, 143032, 130746,
155323, 151228, 143036, 143038, 126654, 134846, 3777, 179900, 163518, 163519,
171710, 3783, 126680, 130776, 163544, 161950, 179931, 167646, 28387, 12005,
155641, 161340, 151290, 159484, 134908, 171772, 147197, 147199, 36609, 28419,
16134, 139165, 130840, 163608, 155418, 155420, 130845, 134943, 147231, 12067,
48931, 28455, 126777, 151354, 139065, 180025, 155451, 147260, 126783, 12096,
155455, 163647, 171839, 36676, 44869, 57157, 36679, 32582, 143193, 130905,
151387, 163673, 135005, 126814, 180058, 139100, 36705, 180059, 36707, 180061,
8034, 40802, 40806, 57190, 130936, 139128, 151420, 135037, 139132, 175999,
28544, 155519, 49026, 49027, 36743, 159640, 176024, 176026, 143259, 139160,
143261, 159646, 151455, 143263, 126877, 44962, 176029, 36772, 36773, 4006,
20389, 171935, 57248, 57253, 147384, 126905, 126907, 4037, 8133, 24519, 171994,
143325, 171998, 151519, 131039, 139231, 136765, 49123, 20453, 12262, 12263,
135160, 163833, 172026, 147452, 126973, 167935}
11 common order IDs: {37760, 8353, 59937, 17155, 37924, 54595, 55874, 13959,
47813, 56452, 7364}
/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[22]: order_df[order_df["Returned"] == 1]

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`

20
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[22]: Row ID Order Priority Discount Unit Price Shipping Cost Customer ID \
68 1950 Medium 0.01 4.91 0.50 117
69 1951 Medium 0.09 4.00 1.30 117
171 5302 High 0.01 8.33 1.99 308
256 1147 Medium 0.08 2.94 0.96 491
294 2368 Medium 0.00 6.88 2.00 553
346 7893 Not Specified 0.00 236.97 59.24 640
588 6711 High 0.00 6.68 5.66 1044
689 7632 Medium 0.09 130.98 30.00 1217
692 7810 Medium 0.00 7.10 6.05 1228
693 7811 Medium 0.01 4.98 4.62 1228
694 7812 Medium 0.06 5.68 1.39 1228
968 8389 High 0.02 30.98 17.08 1733
1205 1008 High 0.09 16.98 12.39 2189
1513 5338 High 0.05 165.20 19.99 2670
1514 5339 High 0.09 17.99 8.65 2670

Customer Name Ship Mode Customer Segment Product Category \


68 Linda Weiss Regular Air Home Office Office Supplies
69 Linda Weiss Express Air Home Office Office Supplies
171 Glen Caldwell Regular Air Small Business Technology
256 Toni Swanson Regular Air Consumer Office Supplies
294 Kristine Connolly Express Air Home Office Office Supplies
346 Neal Wolfe Delivery Truck Consumer Furniture
588 Erin Ballard Regular Air Home Office Office Supplies
689 Billy Perry Browning Delivery Truck Small Business Furniture
692 Hazel Jennings Regular Air Small Business Office Supplies
693 Hazel Jennings Express Air Small Business Technology
694 Hazel Jennings Regular Air Small Business Office Supplies
968 Nina Horne Kelly Regular Air Small Business Office Supplies
1205 Frank Cross Regular Air Corporate Office Supplies
1513 Yvonne Mann Regular Air Home Office Office Supplies
1514 Yvonne Mann Regular Air Home Office Office Supplies

Product Sub-Category Product Container \


68 Labels Small Box
69 Paper Wrap Bag
171 Computer Peripherals Small Pack
256 Pens & Art Supplies Wrap Bag
294 Paper Wrap Bag
346 Tables Jumbo Box
588 Paper Small Box
689 Chairs & Chairmats Jumbo Drum

21
692 Binders and Binder Accessories Small Box
693 Computer Peripherals Small Pack
694 Envelopes Small Box
968 Paper Small Box
1205 Envelopes Small Box
1513 Storage & Organization Small Box
1514 Pens & Art Supplies Small Box

Product Name Product Base Margin \


68 Avery 493 0.36
69 EcoTones® Memo Sheets 0.37
171 80 Minute Slim Jewel Case CD-R , 10/Pack - Sta… 0.52
256 Newell 343 0.58
294 Adams Phone Message Book, 200 Message Capacity… 0.39
346 Chromcraft Rectangular Conference Tables 0.61
588 Xerox 1923 0.37
689 Office Star - Contemporary Task Swivel chair w… 0.78
692 Wilson Jones Hanging View Binder, White, 1" 0.39
693 Imation 3.5", DISKETTE 44766 HGHLD3.52HD/FM, 1… 0.64
694 Staples Standard Envelopes 0.38
968 Xerox 197 0.40
1205 Brown Kraft Recycled Envelopes 0.35
1513 Economy Rollaway Files 0.59
1514 Model L Table or Wall-Mount Pencil Sharpener 0.57

Country Region State or Province City Postal Code \


68 United States West Washington Seattle 98103
69 United States West Washington Seattle 98103
171 United States West Washington Seattle 98115
256 United States East New York New York City 10154
294 United States West California Los Angeles 90008
346 United States West Washington Seattle 98119
588 United States West California Los Angeles 90004
689 United States East Massachusetts Boston 2112
692 United States East Pennsylvania Philadelphia 19140
693 United States East Pennsylvania Philadelphia 19140
694 United States East Pennsylvania Philadelphia 19140
968 United States East District of Columbia Washington 20012
1205 United States East New York New York City 10177
1513 United States West California Los Angeles 90049
1514 United States West California Los Angeles 90049

Order Date Ship Date Profit Quantity ordered new Sales \


68 2015-04-04 2015-04-06 112.060 47 228.46
69 2015-04-04 2015-04-06 16.790 19 77.61
171 2015-02-14 2015-02-15 10.740 32 280.62
256 2015-05-15 2015-05-17 -2.120 23 66.70

22
294 2015-01-28 2015-01-29 34.068 36 267.53
346 2015-02-14 2015-02-15 1192.040 34 6686.34
588 2015-02-27 2015-02-28 -76.940 90 617.40
689 2015-04-28 2015-05-01 -421.760 41 5258.94
692 2015-02-16 2015-02-17 -60.145 28 208.83
693 2015-02-16 2015-02-18 -111.720 41 228.30
694 2015-02-16 2015-02-16 33.010 24 129.53
968 2015-06-28 2015-06-29 -32.280 13 438.25
1205 2015-05-08 2015-05-10 -48.570 22 381.91
1513 2015-05-29 2015-05-29 2008.710 167 27587.55
1514 2015-05-29 2015-05-29 -80.530 71 1191.58

Order ID Manager Returned


68 13959 William 1
69 13959 William 1
171 37760 William 1
256 8353 Erin 1
294 17155 William 1
346 56452 William 1
588 47813 William 1
689 54595 Erin 1
692 55874 Erin 1
693 55874 Erin 1
694 55874 Erin 1
968 59937 Erin 1
1205 7364 Erin 1
1513 37924 William 1
1514 37924 William 1

[23]: return_df[return_df["Order ID"] == 20480]

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[23]: Order ID Status


172 20480 Returned

[24]: order_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1952 entries, 0 to 1951
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----

23
0 Row ID 1952 non-null int64
1 Order Priority 1952 non-null object
2 Discount 1952 non-null float64
3 Unit Price 1952 non-null float64
4 Shipping Cost 1952 non-null float64
5 Customer ID 1952 non-null int64
6 Customer Name 1952 non-null object
7 Ship Mode 1952 non-null object
8 Customer Segment 1952 non-null object
9 Product Category 1952 non-null object
10 Product Sub-Category 1952 non-null object
11 Product Container 1952 non-null object
12 Product Name 1952 non-null object
13 Product Base Margin 1936 non-null float64
14 Country 1952 non-null object
15 Region 1952 non-null object
16 State or Province 1952 non-null object
17 City 1952 non-null object
18 Postal Code 1952 non-null int64
19 Order Date 1952 non-null datetime64[ns]
20 Ship Date 1952 non-null datetime64[ns]
21 Profit 1952 non-null float64
22 Quantity ordered new 1952 non-null int64
23 Sales 1952 non-null float64
24 Order ID 1952 non-null int64
25 Manager 1952 non-null object
26 Returned 1952 non-null int64
dtypes: datetime64[ns](2), float64(6), int64(6), object(13)
memory usage: 427.0+ KB
/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[25]: order_df['Country'].unique()

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[25]: array(['United States'], dtype=object)

24
[26]: order_df['Delivery Time'] = pd.to_datetime(order_df['Ship Date']) - pd.
↪to_datetime(order_df['Order Date'])

order_df = order_df.drop(columns=['Row ID', 'Customer Name', 'Ship Date',␣


↪'Order Date', 'Country'])

order_df = order_df[['Order ID', 'Order Priority', 'Discount', 'Unit Price',␣


↪'Shipping Cost',

'Quantity ordered new', 'Sales', 'Profit', 'Delivery Time',


'Ship Mode', 'Customer Segment', 'Product Category',␣
↪'Product Sub-Category',

'Product Container', 'Product Name', 'Product Base Margin',


'Region', 'State or Province', 'City', 'Postal Code',␣
↪'Manager', 'Returned']]

order_df = order_df.sort_values(by='Order ID')


order_df.reset_index(drop=True, inplace=True)

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[27]: order_df

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[27]: Order ID Order Priority Discount Unit Price Shipping Cost \


0 359 Medium 0.08 124.49 51.94
1 548 Critical 0.04 3.08 0.99
2 548 Critical 0.02 6.48 5.90
3 548 Critical 0.04 125.99 4.20
4 646 High 0.01 9.31 3.98
… … … … … …
1947 91576 Low 0.04 880.98 44.55
1948 91581 Not Specified 0.01 145.45 17.85
1949 91583 High 0.01 28.99 8.59
1950 91584 Not Specified 0.10 5.18 5.74
1951 91586 Medium 0.03 85.99 0.99

Quantity ordered new Sales Profit Delivery Time Ship Mode \


0 56 6831.37 -500.38000 1 days Delivery Truck
1 75 236.87 36.02000 1 days Regular Air

25
2 53 370.91 -50.64000 1 days Regular Air
3 47 4976.92 510.48900 2 days Regular Air
4 61 586.96 -10.90000 1 days Regular Air
… … … … … …
1947 8 6901.25 4233.25880 4 days Delivery Truck
1948 8 1214.03 837.68070 1 days Delivery Truck
1949 21 556.61 196.52328 1 days Regular Air
1950 2 10.96 -29.00300 2 days Regular Air
1951 20 1503.05 1037.10450 1 days Regular Air

Customer Segment Product Category Product Sub-Category \


0 Corporate Furniture Tables
1 Home Office Office Supplies Labels
2 Home Office Office Supplies Paper
3 Home Office Technology Telephones and Communication
4 Small Business Office Supplies Scissors, Rulers and Trimmers
… … … …
1947 Consumer Furniture Bookcases
1948 Corporate Technology Office Machines
1949 Home Office Technology Telephones and Communication
1950 Corporate Office Supplies Binders and Binder Accessories
1951 Home Office Technology Telephones and Communication

Product Container Product Name \


0 Jumbo Box Bevis 36 x 72 Conference Tables
1 Small Box Avery 481
2 Small Box Xerox 1976
3 Small Box V3682
4 Small Pack Acme® Forged Steel Scissors with Black Enamel …
… … …
1947 Jumbo Box Riverside Palais Royal Lawyers Bookcase, Royal…
1948 Jumbo Drum Panasonic KX-P1150 Dot Matrix Printer
1949 Medium Box SouthWestern Bell FA970 Digital Answering Mach…
1950 Small Box Wilson Jones Impact Binders
1951 Wrap Bag Accessory34

Product Base Margin Region State or Province City \


0 0.63 West California Los Angeles
1 0.37 Central Texas Houston
2 0.37 Central Texas Houston
3 0.59 Central Texas Houston
4 0.56 Central Texas Dallas
… … … … …
1947 0.62 West Nevada Las Vegas
1948 0.56 Central Texas Burleson
1949 0.56 West New Mexico Clovis
1950 0.36 West California Dublin

26
1951 0.55 West Idaho Coeur D Alene

Postal Code Manager Returned


0 90008 William 0
1 77041 Chris 0
2 77041 Chris 0
3 77041 Chris 0
4 75220 Chris 0
… … … …
1947 89115 William 0
1948 76028 Chris 0
1949 88101 William 0
1950 94568 William 0
1951 83814 William 0

[1952 rows x 22 columns]

[28]: order_df['Order Priority'].unique()

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[28]: array(['Medium', 'Critical', 'High', 'Low', 'Not Specified', 'Critical '],


dtype=object)

[29]: order_df['Order Priority'] = order_df['Order Priority'].str.strip()

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[30]: order_df['Order Priority'].unique()

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[30]: array(['Medium', 'Critical', 'High', 'Low', 'Not Specified'], dtype=object)

27
[31]: order_df['Sales'].corr(order_df['Profit'], method='spearman')

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[31]: 0.27474678109322304

[32]: # Remove outliers

correlation = order_df['Sales'].corr(order_df['Profit'], method='spearman')

print("Correlation between Sales and Profit before removing outliers:",␣


↪correlation)

z_scores_sales = (order_df['Sales'] - order_df['Sales'].mean()) /␣


↪order_df['Sales'].std()

z_scores_profit = (order_df['Profit'] - order_df['Profit'].mean()) /␣


↪order_df['Profit'].std()

threshold = 1.96

outliers_mask = (abs(z_scores_sales) > threshold) | (abs(z_scores_profit) >␣


↪threshold)

filtered_order_df = order_df[~outliers_mask]
filtered_correlation = filtered_order_df['Sales'].
↪corr(filtered_order_df['Profit'], method='spearman')

print("Correlation between Sales and Profit after removing outliers:",␣


↪filtered_correlation)

Correlation between Sales and Profit before removing outliers:


0.27474678109322304
Correlation between Sales and Profit after removing outliers: 0.2405031747411026
/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[33]: order_df

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:

28
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[33]: Order ID Order Priority Discount Unit Price Shipping Cost \


0 359 Medium 0.08 124.49 51.94
1 548 Critical 0.04 3.08 0.99
2 548 Critical 0.02 6.48 5.90
3 548 Critical 0.04 125.99 4.20
4 646 High 0.01 9.31 3.98
… … … … … …
1947 91576 Low 0.04 880.98 44.55
1948 91581 Not Specified 0.01 145.45 17.85
1949 91583 High 0.01 28.99 8.59
1950 91584 Not Specified 0.10 5.18 5.74
1951 91586 Medium 0.03 85.99 0.99

Quantity ordered new Sales Profit Delivery Time Ship Mode \


0 56 6831.37 -500.38000 1 days Delivery Truck
1 75 236.87 36.02000 1 days Regular Air
2 53 370.91 -50.64000 1 days Regular Air
3 47 4976.92 510.48900 2 days Regular Air
4 61 586.96 -10.90000 1 days Regular Air
… … … … … …
1947 8 6901.25 4233.25880 4 days Delivery Truck
1948 8 1214.03 837.68070 1 days Delivery Truck
1949 21 556.61 196.52328 1 days Regular Air
1950 2 10.96 -29.00300 2 days Regular Air
1951 20 1503.05 1037.10450 1 days Regular Air

Customer Segment Product Category Product Sub-Category \


0 Corporate Furniture Tables
1 Home Office Office Supplies Labels
2 Home Office Office Supplies Paper
3 Home Office Technology Telephones and Communication
4 Small Business Office Supplies Scissors, Rulers and Trimmers
… … … …
1947 Consumer Furniture Bookcases
1948 Corporate Technology Office Machines
1949 Home Office Technology Telephones and Communication
1950 Corporate Office Supplies Binders and Binder Accessories
1951 Home Office Technology Telephones and Communication

Product Container Product Name \


0 Jumbo Box Bevis 36 x 72 Conference Tables

29
1 Small Box Avery 481
2 Small Box Xerox 1976
3 Small Box V3682
4 Small Pack Acme® Forged Steel Scissors with Black Enamel …
… … …
1947 Jumbo Box Riverside Palais Royal Lawyers Bookcase, Royal…
1948 Jumbo Drum Panasonic KX-P1150 Dot Matrix Printer
1949 Medium Box SouthWestern Bell FA970 Digital Answering Mach…
1950 Small Box Wilson Jones Impact Binders
1951 Wrap Bag Accessory34

Product Base Margin Region State or Province City \


0 0.63 West California Los Angeles
1 0.37 Central Texas Houston
2 0.37 Central Texas Houston
3 0.59 Central Texas Houston
4 0.56 Central Texas Dallas
… … … … …
1947 0.62 West Nevada Las Vegas
1948 0.56 Central Texas Burleson
1949 0.56 West New Mexico Clovis
1950 0.36 West California Dublin
1951 0.55 West Idaho Coeur D Alene

Postal Code Manager Returned


0 90008 William 0
1 77041 Chris 0
2 77041 Chris 0
3 77041 Chris 0
4 75220 Chris 0
… … … …
1947 89115 William 0
1948 76028 Chris 0
1949 88101 William 0
1950 94568 William 0
1951 83814 William 0

[1952 rows x 22 columns]

Report 1: Manager/Regional Report

[34]: region_groups = order_df.groupby('Region')


region_summary = region_groups.agg({'Sales': 'sum', 'Profit': 'sum', 'Shipping␣
↪Cost': 'mean', 'Delivery Time': 'mean'})

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`

30
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[35]: sales = region_summary['Sales']


profits = region_summary['Profit']
avg_shipping_cost = region_summary['Shipping Cost']
avg_delivery_time = region_summary['Delivery Time']

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[36]: sales.plot(kind='bar', color='skyblue', position=0.5, width=0.4, label='Total␣


↪Sales')

profits.plot(kind='bar', color='orange', position=-0.5, width=0.4, label='Total␣


↪Profits')

for i in range(len(sales)):
plt.text(i, sales[i], f'{sales[i]:,.0f}', ha='center', va='top',␣
↪color='black')

plt.text(i+0.38, profits[i], f'{profits[i]:,.0f}', ha='center', va='top',␣


↪color='black')

plt.title('Total Sales and Profits by Region')


plt.xlabel('Region')
plt.ylabel('Amount')
plt.xticks(rotation=45)
plt.legend()
plt.show()

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

31
Calculate the correlation among profit, delivery time, and order priority for each region
[37]: regions = list(order_df["Region"].unique())

for region in regions:


print(f"Region: {region}")
print("################")

# Filter DataFrame for the current region


region_df = order_df[order_df['Region'] == region]

# Print only top 3 occurrences of each product category


product_category_counts = region_df['Product Category'].value_counts().
↪head(3)

print("Top 3 occurrences of each Product Category:")


print(product_category_counts)
print("################")

# Calculate average profits associated with each product category

32
avg_profits_by_category = region_df.groupby('Product Category')['Profit'].
↪mean()
print("Average Profits for each Product Category:")
print(avg_profits_by_category)
print("################")

# Calculate average discount associated with each product category


avg_discount_by_category = region_df.groupby('Product␣
↪Category')['Discount'].mean()

print("Average Discount for each Product Category:")


print(avg_discount_by_category)
print("################")

# Calculate average Product base margin associated with each product␣


↪category

avg_margin_by_category = region_df.groupby('Product Category')['Product␣


↪Base Margin'].mean()

print("Average Product Base Margin for each Product Category:")


print(avg_margin_by_category)
print("################")

# Count occurrence of each Order Priority


order_priority_counts = region_df['Order Priority'].value_counts()
print("Occurrence of each Order Priority:")
print(order_priority_counts)
print("################")

# Calculate average delivery time for each Order Priority


average_delivery_time = region_df.groupby('Order Priority')['Delivery␣
↪Time'].mean()

print("Average Delivery Time for each Order Priority:")


print(average_delivery_time)
print("################")

# Calculate average shipping cost for the region


average_shipping_cost = region_df['Shipping Cost'].mean()
print("Average Shipping Cost for the region:", average_shipping_cost)
print("################")

# Calculate average profit per order


total_profit = region_df['Profit'].sum()
total_orders = len(region_df['Order ID'].unique())
average_profit_per_order = total_profit / total_orders
print("Average Profit per Order:", average_profit_per_order)
print("################")

# Count the number of rows where 'Returned' column has a value of 1

33
returned_count = (region_df['Returned'] == 1).sum()
total_orders = len(region_df["Order ID"].unique())
print(f"Number of rows where 'Returned' column has a value of 1:␣
↪{returned_count}/{total_orders}")

print("#####################################") # Add a blank line for␣


↪readability between regions

print()
print()

Region: West
################
Top 3 occurrences of each Product Category:
Office Supplies 253
Technology 129
Furniture 88
Name: Product Category, dtype: int64
################
Average Profits for each Product Category:
Product Category
Furniture 608.474841
Office Supplies 48.236487
Technology 78.257155
Name: Profit, dtype: float64
################
Average Discount for each Product Category:
Product Category
Furniture 0.049659
Office Supplies 0.048577
Technology 0.044264
Name: Discount, dtype: float64
################
Average Product Base Margin for each Product Category:
Product Category
Furniture 0.602706
Office Supplies 0.468571
Technology 0.559457
Name: Product Base Margin, dtype: float64
################
Occurrence of each Order Priority:
Low 112
Not Specified 105
High 93
Medium 89
Critical 71
Name: Order Priority, dtype: int64
################
Average Delivery Time for each Order Priority:

34
Order Priority
Critical 1 days 06:45:38.028169014
High 1 days 10:19:21.290322580
Low 4 days 06:12:51.428571428
Medium 1 days 09:42:28.314606741
Not Specified 1 days 06:51:25.714285714
Name: Delivery Time, dtype: timedelta64[ns]
################
Average Shipping Cost for the region: 12.733872340425533
################
Average Profit per Order: 225.7285419702381
################
Number of rows where 'Returned' column has a value of 1: 8/336
#####################################

Region: Central
################
Top 3 occurrences of each Product Category:
Office Supplies 315
Technology 131
Furniture 120
Name: Product Category, dtype: int64
################
Average Profits for each Product Category:
Product Category
Furniture 41.845431
Office Supplies 89.946268
Technology 335.961425
Name: Profit, dtype: float64
################
Average Discount for each Product Category:
Product Category
Furniture 0.050583
Office Supplies 0.047968
Technology 0.049466
Name: Discount, dtype: float64
################
Average Product Base Margin for each Product Category:
Product Category
Furniture 0.583590
Office Supplies 0.457238
Technology 0.576489
Name: Product Base Margin, dtype: float64
################
Occurrence of each Order Priority:
Critical 118
High 116

35
Not Specified 114
Low 114
Medium 104
Name: Order Priority, dtype: int64
################
Average Delivery Time for each Order Priority:
Order Priority
Critical 1 days 10:34:34.576271186
High 1 days 09:55:51.724137931
Low 3 days 20:25:15.789473684
Medium 1 days 11:46:09.230769230
Not Specified 1 days 10:56:50.526315789
Name: Delivery Time, dtype: timedelta64[ns]
################
Average Shipping Cost for the region: 12.575618374558303
################
Average Profit per Order: 193.41368167150003
################
Number of rows where 'Returned' column has a value of 1: 0/400
#####################################

Region: East
################
Top 3 occurrences of each Product Category:
Office Supplies 265
Technology 110
Furniture 99
Name: Product Category, dtype: int64
################
Average Profits for each Product Category:
Product Category
Furniture -4.676507
Office Supplies 192.858727
Technology 314.971045
Name: Profit, dtype: float64
################
Average Discount for each Product Category:
Product Category
Furniture 0.048889
Office Supplies 0.051170
Technology 0.046818
Name: Discount, dtype: float64
################
Average Product Base Margin for each Product Category:
Product Category
Furniture 0.606224
Office Supplies 0.468365

36
Technology 0.557636
Name: Product Base Margin, dtype: float64
################
Occurrence of each Order Priority:
Critical 106
Medium 101
Not Specified 94
High 89
Low 84
Name: Order Priority, dtype: int64
################
Average Delivery Time for each Order Priority:
Order Priority
Critical 1 days 12:00:00
High 1 days 08:37:45.168539325
Low 3 days 16:51:25.714285714
Medium 1 days 09:44:33.267326732
Not Specified 1 days 13:01:16.595744680
Name: Delivery Time, dtype: timedelta64[ns]
################
Average Shipping Cost for the region: 13.79957805907173
################
Average Profit per Order: 264.0600725882353
################
Number of rows where 'Returned' column has a value of 1: 7/323
#####################################

Region: South
################
Top 3 occurrences of each Product Category:
Office Supplies 238
Technology 111
Furniture 93
Name: Product Category, dtype: int64
################
Average Profits for each Product Category:
Product Category
Furniture 12.313790
Office Supplies -8.905288
Technology -121.169173
Name: Profit, dtype: float64
################
Average Discount for each Product Category:
Product Category
Furniture 0.049892
Office Supplies 0.049370
Technology 0.050721

37
Name: Discount, dtype: float64
################
Average Product Base Margin for each Product Category:
Product Category
Furniture 0.592299
Office Supplies 0.465294
Technology 0.557568
Name: Product Base Margin, dtype: float64
################
Occurrence of each Order Priority:
Critical 96
High 93
Low 88
Not Specified 83
Medium 82
Name: Order Priority, dtype: int64
################
Average Delivery Time for each Order Priority:
Order Priority
Critical 1 days 08:45:00
High 1 days 04:38:42.580645161
Low 4 days 09:00:00
Medium 1 days 08:11:42.439024390
Not Specified 1 days 10:07:13.734939759
Name: Delivery Time, dtype: timedelta64[ns]
################
Average Shipping Cost for the region: 12.828303167420815
################
Average Profit per Order: -44.24556558113497
################
Number of rows where 'Returned' column has a value of 1: 0/326
#####################################

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)
• We can see that the order of South Region is not less than other region by a lot.
• The Low priority get lowest delivery time (4 days in average), others don’t have a huge
difference.
• South has huge loss for Technology sector, but no returned order
• South region has a slightly higher discount for Technology, compared to other region
• On average, South region lost $44 per order
• There are cases when sales is significantly smaller than profit

38
Report 2: Best ship mode
[38]: order_df['Delivery Time Seconds'] = order_df['Delivery Time'].dt.total_seconds()

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[39]: average_shipping_cost = order_df.groupby('Ship Mode')['Shipping Cost'].mean()


average_delivery_time = order_df.groupby('Ship Mode')['Delivery Time Seconds'].
↪mean()

plt.figure(figsize=(10, 6))
plt.subplot(1, 2, 1)
average_shipping_cost.plot(kind='bar', color='skyblue')
plt.title('Average Shipping Cost by Shipping Mode')
plt.xlabel('Shipping Mode')
plt.ylabel('Average Shipping Cost')
plt.subplot(1, 2, 2)
average_delivery_time.plot(kind='bar', color='lightgreen')
plt.title('Average Delivery Time by Shipping Mode')
plt.xlabel('Shipping Mode')
plt.ylabel('Average Delivery Time (seconds)')
plt.tight_layout()
plt.show()

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

39
• Delivery truck service is way too expensive
• Regular Air has the longest delivery time
• Express Air is the best Shipping Mode available
Report 3: Association Rule in Purchased Items
[40]: order_df

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[40]: Order ID Order Priority Discount Unit Price Shipping Cost \


0 359 Medium 0.08 124.49 51.94
1 548 Critical 0.04 3.08 0.99
2 548 Critical 0.02 6.48 5.90
3 548 Critical 0.04 125.99 4.20
4 646 High 0.01 9.31 3.98
… … … … … …
1947 91576 Low 0.04 880.98 44.55
1948 91581 Not Specified 0.01 145.45 17.85
1949 91583 High 0.01 28.99 8.59
1950 91584 Not Specified 0.10 5.18 5.74

40
1951 91586 Medium 0.03 85.99 0.99

Quantity ordered new Sales Profit Delivery Time Ship Mode \


0 56 6831.37 -500.38000 1 days Delivery Truck
1 75 236.87 36.02000 1 days Regular Air
2 53 370.91 -50.64000 1 days Regular Air
3 47 4976.92 510.48900 2 days Regular Air
4 61 586.96 -10.90000 1 days Regular Air
… … … … … …
1947 8 6901.25 4233.25880 4 days Delivery Truck
1948 8 1214.03 837.68070 1 days Delivery Truck
1949 21 556.61 196.52328 1 days Regular Air
1950 2 10.96 -29.00300 2 days Regular Air
1951 20 1503.05 1037.10450 1 days Regular Air

Customer Segment Product Category Product Sub-Category \


0 Corporate Furniture Tables
1 Home Office Office Supplies Labels
2 Home Office Office Supplies Paper
3 Home Office Technology Telephones and Communication
4 Small Business Office Supplies Scissors, Rulers and Trimmers
… … … …
1947 Consumer Furniture Bookcases
1948 Corporate Technology Office Machines
1949 Home Office Technology Telephones and Communication
1950 Corporate Office Supplies Binders and Binder Accessories
1951 Home Office Technology Telephones and Communication

Product Container Product Name \


0 Jumbo Box Bevis 36 x 72 Conference Tables
1 Small Box Avery 481
2 Small Box Xerox 1976
3 Small Box V3682
4 Small Pack Acme® Forged Steel Scissors with Black Enamel …
… … …
1947 Jumbo Box Riverside Palais Royal Lawyers Bookcase, Royal…
1948 Jumbo Drum Panasonic KX-P1150 Dot Matrix Printer
1949 Medium Box SouthWestern Bell FA970 Digital Answering Mach…
1950 Small Box Wilson Jones Impact Binders
1951 Wrap Bag Accessory34

Product Base Margin Region State or Province City \


0 0.63 West California Los Angeles
1 0.37 Central Texas Houston
2 0.37 Central Texas Houston
3 0.59 Central Texas Houston
4 0.56 Central Texas Dallas

41
… … … … …
1947 0.62 West Nevada Las Vegas
1948 0.56 Central Texas Burleson
1949 0.56 West New Mexico Clovis
1950 0.36 West California Dublin
1951 0.55 West Idaho Coeur D Alene

Postal Code Manager Returned Delivery Time Seconds


0 90008 William 0 86400.0
1 77041 Chris 0 86400.0
2 77041 Chris 0 86400.0
3 77041 Chris 0 172800.0
4 75220 Chris 0 86400.0
… … … … …
1947 89115 William 0 345600.0
1948 76028 Chris 0 86400.0
1949 88101 William 0 86400.0
1950 94568 William 0 172800.0
1951 83814 William 0 86400.0

[1952 rows x 23 columns]

[41]: order_products_df = order_df.groupby('Order ID').agg({'Product Name': lambda x:␣


↪x.tolist(),

'Product Category':␣
↪lambda x: set(x),

'Sales': 'sum',
'Profit': 'sum'}).
↪reset_index()

order_products_df['Num Items'] = order_products_df['Product Name'].apply(len)

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[42]: order_products_df

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

42
[42]: Order ID Product Name \
0 359 [Bevis 36 x 72 Conference Tables]
1 548 [Avery 481, Xerox 1976, V3682]
2 646 [Acme® Forged Steel Scissors with Black Enamel…
3 962 [Holmes Replacement Filter for HEPA Air Cleane…
4 2433 [Lexmark 4227 Plus Dot Matrix Printer]
… … …
1360 91576 [Electrix 20W Halogen Replacement Bulb for Zoo…
1361 91581 [Panasonic KX-P1150 Dot Matrix Printer]
1362 91583 [SouthWestern Bell FA970 Digital Answering Mac…
1363 91584 [Wilson Jones Impact Binders]
1364 91586 [Accessory34]

Product Category Sales Profit Num Items


0 {Furniture} 6831.37 -500.380000 1
1 {Technology, Office Supplies} 5584.70 495.869000 3
2 {Office Supplies} 586.96 -10.900000 1
3 {Office Supplies} 10561.20 338.466500 2
4 {Technology} 43046.20 4073.250000 1
… … … … …
1360 {Technology, Furniture} 7250.87 4299.355856 3
1361 {Technology} 1214.03 837.680700 1
1362 {Technology} 556.61 196.523280 1
1363 {Office Supplies} 10.96 -29.003000 1
1364 {Technology} 1503.05 1037.104500 1

[1365 rows x 6 columns]

[43]: apriori_df = order_products_df[order_products_df["Num Items"] > 1]

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[44]: apriori_df

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

43
[44]: Order ID Product Name \
1 548 [Avery 481, Xerox 1976, V3682]
3 962 [Holmes Replacement Filter for HEPA Air Cleane…
8 3397 [Belkin 105-Key Black Keyboard, Avery Durable …
10 3841 [Fellowes PB500 Electric Punch Plastic Comb Bi…
12 5509 [Staples Brown Kraft Recycled Clasp Envelopes,…
… … …
1344 91466 [Xerox 1928, Tripp Lite Isotel 8 Ultra 8 Outle…
1354 91522 [Accessory37, Self-Adhesive Address Labels for…
1356 91550 [Global Leather Executive Chair, Peel & Seel® …
1357 91555 [Avery Flip-Chart Easel Binder, Black, G.E. Ha…
1360 91576 [Electrix 20W Halogen Replacement Bulb for Zoo…

Product Category Sales Profit \


1 {Technology, Office Supplies} 5584.70 495.869000
3 {Office Supplies} 10561.20 338.466500
8 {Technology, Office Supplies} 1280.63 -23.631000
10 {Office Supplies} 47856.87 8565.705600
12 {Furniture, Office Supplies} 953.69 23.018200
… … … …
1344 {Office Supplies} 344.64 43.276000
1354 {Technology, Office Supplies} 472.19 11.365560
1356 {Technology, Office Supplies, Furniture} 2248.32 1126.418540
1357 {Furniture, Office Supplies} 188.08 -82.694200
1360 {Technology, Furniture} 7250.87 4299.355856

Num Items
1 3
3 2
8 2
10 2
12 2
… …
1344 2
1354 2
1356 3
1357 2
1360 3

[451 rows x 6 columns]

[45]: from mlxtend.frequent_patterns import apriori, association_rules

apriori_df['Product Name'] = apriori_df['Product Name'].apply(lambda x: ', '.


↪join(x))

encoded_df = apriori_df['Product Name'].str.get_dummies(sep=', ')


frequent_itemsets = apriori(encoded_df, min_support=0.008, use_colnames=True)

44
rules = association_rules(frequent_itemsets, metric='lift', min_threshold=1)
rules.sort_values(by='confidence', ascending=False, inplace=True)
top_10_rules = rules.head(10)

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)
<ipython-input-45-3c6157fc9c80>:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-


docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
apriori_df['Product Name'] = apriori_df['Product Name'].apply(lambda x: ',
'.join(x))
/usr/local/lib/python3.10/dist-
packages/mlxtend/frequent_patterns/fpcommon.py:110: DeprecationWarning:
DataFrames with non-bool types result in worse computationalperformance and
their support might be discontinued in the future.Please use a DataFrame with
bool type
warnings.warn(

[46]: top_10_rules

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[46]: antecedents consequents \


0 (DISKETTE 44766 HGHLD3.52HD/FM) (10/Pack)
2 (Imation Neon Mac Format Diskettes) (10/Pack)
5 (Avery Flip-Chart Easel Binder) (Black)
7 (Global High-Back Leather Tilter) (Burgundy)
8 (Tennsco Lockers) (Gray)
6 (Burgundy) (Global High-Back Leather Tilter)
9 (Gray) (Tennsco Lockers)
1 (10/Pack) (DISKETTE 44766 HGHLD3.52HD/FM)
3 (10/Pack) (Imation Neon Mac Format Diskettes)
4 (Black) (Avery Flip-Chart Easel Binder)

antecedent support consequent support support confidence lift \

45
0 0.008869 0.031042 0.008869 1.000000 32.214286
2 0.008869 0.031042 0.008869 1.000000 32.214286
5 0.008869 0.044346 0.008869 1.000000 22.550000
7 0.008869 0.013304 0.008869 1.000000 75.166667
8 0.011086 0.017738 0.008869 0.800000 45.100000
6 0.013304 0.008869 0.008869 0.666667 75.166667
9 0.017738 0.011086 0.008869 0.500000 45.100000
1 0.031042 0.008869 0.008869 0.285714 32.214286
3 0.031042 0.008869 0.008869 0.285714 32.214286
4 0.044346 0.008869 0.008869 0.200000 22.550000

leverage conviction zhangs_metric


0 0.008594 inf 0.977629
2 0.008594 inf 0.977629
5 0.008476 inf 0.964206
7 0.008751 inf 0.995526
8 0.008673 4.911308 0.988789
6 0.008751 2.973392 1.000000
9 0.008673 1.977827 0.995485
1 0.008594 1.387583 1.000000
3 0.008594 1.387583 1.000000
4 0.008476 1.238914 1.000000

[47]: filtered_orders = order_df[order_df['Product Name'].str.contains("(Black)") &␣


↪order_df['Product Name'].str.contains("(Avery Flip-Chart Easel Binder)")]

filtered_order_ids = filtered_orders['Order ID'].unique()


filtered_df = order_df[order_df['Order ID'].isin(filtered_order_ids)]

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)
<ipython-input-47-61326dcd0cd2>:2: UserWarning: This pattern is interpreted as a
regular expression, and has match groups. To actually get the groups, use
str.extract.
filtered_orders = order_df[order_df['Product Name'].str.contains("(Black)") &
order_df['Product Name'].str.contains("(Avery Flip-Chart Easel Binder)")]
/usr/local/lib/python3.10/dist-packages/pandas/core/algorithms.py:522:
DeprecationWarning: np.find_common_type is deprecated. Please use
`np.result_type` or `np.promote_types`.
See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more
information. (Deprecated NumPy 1.25)
common = np.find_common_type([values.dtype, comps_array.dtype], [])

[48]: filtered_df

46
/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[48]: Order ID Order Priority Discount Unit Price Shipping Cost \


68 21636 Low 0.00 22.38 15.10
69 21636 Low 0.07 5.98 4.69
70 21636 Low 0.02 55.99 3.30
119 36452 High 0.04 6.98 2.83
120 36452 High 0.10 22.38 15.10
1887 91355 Low 0.06 17.78 5.03
1888 91355 Low 0.04 22.38 15.10
1941 91555 High 0.10 22.38 15.10
1942 91555 High 0.04 6.98 2.83

Quantity ordered new Sales Profit Delivery Time Ship Mode \


68 29 682.68 -52.6470 7 days Express Air
69 11 73.44 -24.4400 5 days Regular Air
70 63 2997.07 366.5070 0 days Regular Air
119 18 129.48 46.0100 2 days Regular Air
120 26 564.98 -107.5135 1 days Regular Air
1887 3 55.17 38.0673 3 days Regular Air
1888 18 403.53 16.0218 8 days Regular Air
1941 7 152.11 -107.5135 1 days Regular Air
1942 5 35.97 24.8193 2 days Regular Air

Customer Segment Product Category Product Sub-Category \


68 Home Office Office Supplies Binders and Binder Accessories
69 Home Office Office Supplies Storage & Organization
70 Home Office Technology Telephones and Communication
119 Home Office Furniture Office Furnishings
120 Home Office Office Supplies Binders and Binder Accessories
1887 Small Business Furniture Office Furnishings
1888 Small Business Office Supplies Binders and Binder Accessories
1941 Home Office Office Supplies Binders and Binder Accessories
1942 Home Office Furniture Office Furnishings

Product Container Product Name \


68 Small Box Avery Flip-Chart Easel Binder, Black
69 Small Box Perma STOR-ALL� Hanging File Box, 13 1/8"W x 1…
70 Small Pack Accessory24
119 Small Pack G.E. Halogen Desk Lamp Bulbs
120 Small Box Avery Flip-Chart Easel Binder, Black
1887 Small Box Seth Thomas 13 1/2" Wall Clock

47
1888 Small Box Avery Flip-Chart Easel Binder, Black
1941 Small Box Avery Flip-Chart Easel Binder, Black
1942 Small Pack G.E. Halogen Desk Lamp Bulbs

Product Base Margin Region State or Province City \


68 0.38 East New York New York City
69 0.68 East New York New York City
70 0.59 East New York New York City
119 0.37 East New York New York City
120 0.38 East New York New York City
1887 0.54 East New York Coram
1888 0.38 East New York Coram
1941 0.38 Central Texas Leander
1942 0.37 Central Texas Leander

Postal Code Manager Returned Delivery Time Seconds


68 10170 Erin 0 604800.0
69 10170 Erin 0 432000.0
70 10170 Erin 0 0.0
119 10009 Erin 0 172800.0
120 10009 Erin 0 86400.0
1887 11727 Erin 0 259200.0
1888 11727 Erin 0 691200.0
1941 78641 Chris 0 86400.0
1942 78641 Chris 0 172800.0

Using a support level of 0.8%, we can discover items with a particular characteristic or items that
are frequently bought together. Support level is set to 0.8%, which means that we are interested
in finding itemsets that occur in at least 0.8% of all transactions. This helps us identify patterns
of association between items that occur together frequently in transactions. By setting a minimum
support threshold, we focus on identifying only those itemsets that have a significant presence in
the dataset, allowing us to uncover meaningful relationships between products or characteristics.
We can then use this information to promote items that normally go together to customer, help
boosting profits.
Report 4: Product Category Focus
[49]: order_df

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

48
[49]: Order ID Order Priority Discount Unit Price Shipping Cost \
0 359 Medium 0.08 124.49 51.94
1 548 Critical 0.04 3.08 0.99
2 548 Critical 0.02 6.48 5.90
3 548 Critical 0.04 125.99 4.20
4 646 High 0.01 9.31 3.98
… … … … … …
1947 91576 Low 0.04 880.98 44.55
1948 91581 Not Specified 0.01 145.45 17.85
1949 91583 High 0.01 28.99 8.59
1950 91584 Not Specified 0.10 5.18 5.74
1951 91586 Medium 0.03 85.99 0.99

Quantity ordered new Sales Profit Delivery Time Ship Mode \


0 56 6831.37 -500.38000 1 days Delivery Truck
1 75 236.87 36.02000 1 days Regular Air
2 53 370.91 -50.64000 1 days Regular Air
3 47 4976.92 510.48900 2 days Regular Air
4 61 586.96 -10.90000 1 days Regular Air
… … … … … …
1947 8 6901.25 4233.25880 4 days Delivery Truck
1948 8 1214.03 837.68070 1 days Delivery Truck
1949 21 556.61 196.52328 1 days Regular Air
1950 2 10.96 -29.00300 2 days Regular Air
1951 20 1503.05 1037.10450 1 days Regular Air

Customer Segment Product Category Product Sub-Category \


0 Corporate Furniture Tables
1 Home Office Office Supplies Labels
2 Home Office Office Supplies Paper
3 Home Office Technology Telephones and Communication
4 Small Business Office Supplies Scissors, Rulers and Trimmers
… … … …
1947 Consumer Furniture Bookcases
1948 Corporate Technology Office Machines
1949 Home Office Technology Telephones and Communication
1950 Corporate Office Supplies Binders and Binder Accessories
1951 Home Office Technology Telephones and Communication

Product Container Product Name \


0 Jumbo Box Bevis 36 x 72 Conference Tables
1 Small Box Avery 481
2 Small Box Xerox 1976
3 Small Box V3682
4 Small Pack Acme® Forged Steel Scissors with Black Enamel …
… … …
1947 Jumbo Box Riverside Palais Royal Lawyers Bookcase, Royal…

49
1948 Jumbo Drum Panasonic KX-P1150 Dot Matrix Printer
1949 Medium Box SouthWestern Bell FA970 Digital Answering Mach…
1950 Small Box Wilson Jones Impact Binders
1951 Wrap Bag Accessory34

Product Base Margin Region State or Province City \


0 0.63 West California Los Angeles
1 0.37 Central Texas Houston
2 0.37 Central Texas Houston
3 0.59 Central Texas Houston
4 0.56 Central Texas Dallas
… … … … …
1947 0.62 West Nevada Las Vegas
1948 0.56 Central Texas Burleson
1949 0.56 West New Mexico Clovis
1950 0.36 West California Dublin
1951 0.55 West Idaho Coeur D Alene

Postal Code Manager Returned Delivery Time Seconds


0 90008 William 0 86400.0
1 77041 Chris 0 86400.0
2 77041 Chris 0 86400.0
3 77041 Chris 0 172800.0
4 75220 Chris 0 86400.0
… … … … …
1947 89115 William 0 345600.0
1948 76028 Chris 0 86400.0
1949 88101 William 0 86400.0
1950 94568 William 0 172800.0
1951 83814 William 0 86400.0

[1952 rows x 23 columns]

[50]: profit_by_category = order_df.groupby('Product Category')['Profit'].sum()

margin_by_category = order_df.groupby('Product Category')['Product Base␣


↪Margin'].mean()

shipping_cost_by_category = order_df.groupby('Product Category')['Shipping␣


↪Cost'].mean()

quantity_by_category = order_df.groupby('Product Category')['Quantity ordered␣


↪new'].sum()

fig, axs = plt.subplots(2, 2, figsize=(12, 8))

profit_by_category.plot(kind='bar', ax=axs[0, 0], color='skyblue')


axs[0, 0].set_title('Sum of Profit by Product Category')
axs[0, 0].set_ylabel('Sum of Profit')

50
margin_by_category.plot(kind='bar', ax=axs[0, 1], color='lightgreen')
axs[0, 1].set_title('Average Product Base Margin by Product Category')
axs[0, 1].set_ylabel('Average Product Base Margin')

shipping_cost_by_category.plot(kind='bar', ax=axs[1, 0], color='salmon')


axs[1, 0].set_title('Average Shipping Cost by Product Category')
axs[1, 0].set_ylabel('Average Shipping Cost')

quantity_by_category.plot(kind='bar', ax=axs[1, 1], color='orange')


axs[1, 1].set_title('Sum of Quantity Ordered by Product Category')
axs[1, 1].set_ylabel('Sum of Quantity Ordered')

plt.tight_layout()
plt.show()

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

• Office Suppliers: This category generates the highest profit overall. Despite the company’s

51
profit margin being relatively smaller compared to other categories, Office Supplies excel
in terms of the number of orders. This indicates that Office Supplies are popular among
customers and contribute significantly to the company’s revenue stream.
• Furniture: On the contrary, Furniture has the lowest profit margin and incurs very high
shipping costs. This suggests that while Furniture may contribute to sales volume, its prof-
itability is compromised due to high associated expenses. The high shipping costs may be a
deterrent for customers or may reflect the logistical challenges of transporting bulky items.
• Recommendations:
Focus on maintaining or increasing sales volume in Office Supplies while optimizing costs. This
could involve negotiating better deals with suppliers or streamlining operations.
For Furniture, consider strategies lower shipping cost (open storage locations), and get more cus-
tomers to buy as it has large profit margin
Report 5: Correlation test
[52]: order_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1952 entries, 0 to 1951
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Order ID 1952 non-null int64
1 Order Priority 1952 non-null object
2 Discount 1952 non-null float64
3 Unit Price 1952 non-null float64
4 Shipping Cost 1952 non-null float64
5 Quantity ordered new 1952 non-null int64
6 Sales 1952 non-null float64
7 Profit 1952 non-null float64
8 Delivery Time 1952 non-null timedelta64[ns]
9 Ship Mode 1952 non-null object
10 Customer Segment 1952 non-null object
11 Product Category 1952 non-null object
12 Product Sub-Category 1952 non-null object
13 Product Container 1952 non-null object
14 Product Name 1952 non-null object
15 Product Base Margin 1936 non-null float64
16 Region 1952 non-null object
17 State or Province 1952 non-null object
18 City 1952 non-null object
19 Postal Code 1952 non-null int64
20 Manager 1952 non-null object
21 Returned 1952 non-null int64
22 Delivery Time Seconds 1952 non-null float64
dtypes: float64(7), int64(4), object(11), timedelta64[ns](1)
memory usage: 350.9+ KB

52
/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[53]: num_df = order_df.select_dtypes(include=['int', 'float'])

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[54]: num_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1952 entries, 0 to 1951
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Order ID 1952 non-null int64
1 Discount 1952 non-null float64
2 Unit Price 1952 non-null float64
3 Shipping Cost 1952 non-null float64
4 Quantity ordered new 1952 non-null int64
5 Sales 1952 non-null float64
6 Profit 1952 non-null float64
7 Product Base Margin 1936 non-null float64
8 Postal Code 1952 non-null int64
9 Returned 1952 non-null int64
10 Delivery Time Seconds 1952 non-null float64
dtypes: float64(7), int64(4)
memory usage: 167.9 KB
/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[55]: num_df = num_df.drop(["Order ID", "Postal Code"], axis=1)

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`

53
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[56]: correlation_matrix = num_df.corr(method='spearman')


plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f",␣
↪linewidths=0.5)

plt.title('Spearman Rank Correlation Heatmap')


plt.show()

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

54
• Sales is strongly positive correlated to Unit Price and Shipping Cost. This is correct as
Furniture has high Unit Price => high Sales, as well as high Shipping Fee
• Unit Price and Shipping Cost also correlated.
• Sales and Profit does not correlated noticably
[77]: cat_df = order_df.select_dtypes(include=['object'])

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[78]: cat_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1952 entries, 0 to 1951
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Order Priority 1952 non-null object
1 Ship Mode 1952 non-null object
2 Customer Segment 1952 non-null object
3 Product Category 1952 non-null object
4 Product Sub-Category 1952 non-null object
5 Product Container 1952 non-null object
6 Product Name 1952 non-null object
7 Region 1952 non-null object
8 State or Province 1952 non-null object
9 City 1952 non-null object
10 Manager 1952 non-null object
dtypes: object(11)
memory usage: 167.9+ KB
/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[79]: cat_df = cat_df.drop(["Product Name", "Manager", "State or Province"], axis=1)

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`

55
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[80]: cat_df

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[80]: Order Priority Ship Mode Customer Segment Product Category \


0 Medium Delivery Truck Corporate Furniture
1 Critical Regular Air Home Office Office Supplies
2 Critical Regular Air Home Office Office Supplies
3 Critical Regular Air Home Office Technology
4 High Regular Air Small Business Office Supplies
… … … … …
1947 Low Delivery Truck Consumer Furniture
1948 Not Specified Delivery Truck Corporate Technology
1949 High Regular Air Home Office Technology
1950 Not Specified Regular Air Corporate Office Supplies
1951 Medium Regular Air Home Office Technology

Product Sub-Category Product Container Region City


0 Tables Jumbo Box West Los Angeles
1 Labels Small Box Central Houston
2 Paper Small Box Central Houston
3 Telephones and Communication Small Box Central Houston
4 Scissors, Rulers and Trimmers Small Pack Central Dallas
… … … … …
1947 Bookcases Jumbo Box West Las Vegas
1948 Office Machines Jumbo Drum Central Burleson
1949 Telephones and Communication Medium Box West Clovis
1950 Binders and Binder Accessories Small Box West Dublin
1951 Telephones and Communication Wrap Bag West Coeur D Alene

[1952 rows x 8 columns]

[81]: label_encoded_df = cat_df.copy()


label_encoder = LabelEncoder()
for column in label_encoded_df.columns:
label_encoded_df[column] = label_encoder.
↪fit_transform(label_encoded_df[column])

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:

56
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[82]: label_encoded_df

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

[82]: Order Priority Ship Mode Customer Segment Product Category \


0 3 0 1 0
1 0 2 2 1
2 0 2 2 1
3 0 2 2 2
4 1 2 3 1
… … … … …
1947 2 0 0 0
1948 4 0 1 2
1949 1 2 2 2
1950 4 2 1 1
1951 3 2 2 2

Product Sub-Category Product Container Region City


0 15 0 3 436
1 7 4 0 341
2 10 4 0 341
3 16 4 0 341
4 13 5 0 168
… … … … …
1947 2 0 3 407
1948 9 1 0 87
1949 16 3 3 137
1950 1 4 3 191
1951 16 6 3 140

[1952 rows x 8 columns]

[83]: chi2_matrix = pd.DataFrame(index=label_encoded_df.columns,␣


↪columns=label_encoded_df.columns)

p_values = pd.DataFrame(index=label_encoded_df.columns,␣
↪columns=label_encoded_df.columns)

57
for col1 in label_encoded_df.columns:
for col2 in label_encoded_df.columns:
if col1 != col2:
contingency_table = pd.crosstab(label_encoded_df[col1],␣
↪label_encoded_df[col2])

chi2, p, _, _ = chi2_contingency(contingency_table)
chi2_matrix.loc[col1, col2] = chi2
p_values.loc[col1, col2] = p

plt.figure(figsize=(12, 10))
sns.heatmap(chi2_matrix.astype(float), annot=True, cmap='coolwarm', fmt=".2f",␣
↪linewidths=0.5)

plt.title('Chi-Squared Test for Categorical Variables')


plt.show()

/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283:
DeprecationWarning: `should_run_async` will not call `transform_cell`
automatically in the future. Please pass the result to `transformed_cell`
argument and any exception that happen during thetransform in
`preprocessing_exc_tuple` in IPython 7.17 and above.
and should_run_async(code)

58
• Product Sub-Category is heavily correlated with City
[ ]:

59

You might also like