Professional Documents
Culture Documents
Submitted by : R K SUJAN
3. Keep iterating until there is no change to the centroids. i.e assignment of data
points to clusters isn’t changing.
4. Compute the sum of the squared distance between data points and all
centroids.
6. Compute the centroids for the clusters by taking the average of the all data
points that belong to each cluster.
ALGORITHM FOR K-MEANS CLUSTERING
CODE FOR K-MEANS
#import required libraries
import numpy as np
dataset =
make_blobs(n_samples=200,centers=4,n_features=2,cluster_std=1.6,random_state
=50)
points = dataset[0]
#import kmeans
kmeans = KMeans(n_clusters=4)
kmeans.fit(points)
KMeans(n_clusters=4)
plt.scatter(dataset[0][:,0], dataset[0][:,1])
<matplotlib.collections.PathCollection at 0x23668c26790>
clusters = kmeans.cluster_centers_
y_km = kmeans.fit_predict(points)
plt.show()
DATASET
(array([[-1.06705283e+00, 9.24306355e+00],
[ 1.08654318e+00, -6.94815805e+00],
[-2.30970265e+00, 5.84969440e+00],
[-1.59728965e+00, 8.45369045e+00],
[-5.79816711e+00, -3.78405528e+00],
[-2.29833776e+00, -4.66005371e+00],
[-1.68210799e+00, 1.12843423e+01],
[-7.38296450e-01, -5.20135260e+00],
[-1.92048345e-01, -6.45318764e+00],
[-6.37920287e+00, -2.17672355e+00],
[-5.12054803e+00, -2.91449176e+00],
[-4.22297649e+00, 9.04719945e+00],
[-2.67815731e+00, -2.32828819e+00],
[-3.02994109e+00, 1.86959382e+00],
[-5.73938121e+00, -7.25648845e-01],
[-1.99427571e+00, 4.28616868e+00],
[-2.89522086e+00, 1.10222354e+01],
[-1.11286937e+00, 1.03086380e+01],
[-5.19987051e+00, -1.59879364e+00],
[-4.75916597e+00, -1.97047767e+00],
[-6.76865308e+00, -3.56907573e+00],
[-2.65715781e+00, 3.33763714e+00],
[-4.61722463e+00, 1.06373187e+01],
[-1.43729337e+00, 1.12137736e+01],
[-1.98068787e+00, 9.73142838e+00],
[-1.47838268e+00, 4.02156675e+00],
[-3.74580344e+00, 9.15545625e+00],
[-5.51090509e-01, -2.19802594e+00],
[-2.68015629e+00, 5.58489786e+00],
[-1.18697121e-01, 1.04950260e+01],
[ 7.08946126e-02, 1.27161487e+01],
[-6.03097685e+00, -1.01668649e+00],
[-6.43543481e+00, 1.19165025e-01],
[-7.91271326e-01, -5.63231066e+00],
[ 9.02189228e-02, -4.24988128e+00],
[-3.72960397e+00, -2.40552410e+00],
[-5.47201497e+00, -1.29098281e+00],
[-4.50400179e+00, -1.29552557e+00],
[-1.59604970e+00, 7.08952891e+00],
[-2.45285170e+00, 6.35814471e+00],
[-1.02481236e+00, 1.34548122e+01],
[-7.16917808e+00, -3.68305685e+00],
[-2.09444877e+00, 6.60308885e+00],
[-3.08549983e+00, 6.22161479e+00],
[-6.82140576e-01, -4.83269360e+00],
[ 1.64436813e+00, -3.29688399e+00],
[-1.81165386e+00, 9.57286673e+00],
[-1.21769584e-01, 6.51275284e+00],
[-1.65143884e+00, 6.38316168e+00],
[-4.18721798e+00, 8.93800061e+00],
[-1.10703455e+00, -4.83713152e+00],
[-7.62627421e+00, -4.60727232e+00],
[-3.06568887e-01, 5.25844092e+00],
[-1.23581275e+00, 8.35805290e+00],
[-1.85807535e-01, 2.57718893e+00],
[ 7.28797198e-01, 6.06528632e+00],
[-1.70400879e+00, -2.88008464e+00],
[-5.02706384e+00, 7.61298431e-01],
[-6.22443225e+00, -6.57162467e-01],
[-2.90807981e+00, 5.27669491e+00],
[-1.37711368e+00, -5.50047455e+00],
[-5.57986277e-01, -2.70088621e+00],
[-5.68833947e+00, 7.94601173e+00],
[-2.77413056e+00, -5.78872960e+00],
[-1.53159637e+00, -5.42990953e+00],
[-3.22848472e+00, 9.44642918e+00],
[ 9.86777496e-01, -7.30690762e+00],
[-4.42661936e+00, 3.35071015e+00],
[-3.17162516e+00, 1.10347610e+01],
[-4.74516474e+00, 7.89837755e+00],
[ 1.02471465e+00, -4.64795418e+00],
[-6.13566432e+00, -2.93094035e+00],
[-3.42672033e+00, 7.64284207e+00],
[ 1.27831270e+00, -6.29519484e+00],
[-3.16483095e+00, 6.35636403e+00],
[ 1.13910574e-02, 5.46235123e+00],
[-5.41232378e+00, -2.68666494e+00],
[ 4.61164125e-01, 4.69143186e+00],
[-2.41469662e+00, 4.66269862e+00],
[-3.77686363e-01, -5.75177620e+00],
[-6.10691421e+00, -5.98494706e+00],
[-4.87535312e-01, 6.36669314e+00],
[-5.73193316e+00, -1.81425052e+00],
[-4.88797474e+00, -2.96226761e+00],
[-5.91551686e+00, -1.39463278e+00],
[-7.44500073e+00, -1.82470952e+00],
[-3.39008216e+00, 1.09563447e+01],
[ 2.47622860e-01, -5.03543616e+00],
[-3.10260432e+00, 1.09469609e+01],
[-5.15417920e+00, -4.12796457e+00],
[-4.28633194e-01, -4.24947701e+00],
[-4.27501504e+00, 1.08359469e+01],
[ 4.55976021e-02, -4.59883918e+00],
[-5.04804825e+00, 4.27765336e+00],
[-2.40612947e+00, 5.07809235e+00],
[-2.27451380e+00, -1.54186053e+00],
[-1.57744641e-01, -1.15341625e+01],
[-2.19532828e+00, 4.52009408e+00],
[-5.01209756e-01, -3.66534438e+00],
[-2.55093474e+00, 5.07808929e+00],
[-7.89434801e+00, -3.17030594e+00],
[-1.53349447e+00, -5.87137205e+00],
[-3.69177238e+00, 2.87620370e+00],
[-1.31024459e+00, 1.19798893e+01],
[-1.49167744e+00, 7.45001320e+00],
[ 2.24563558e+00, -6.37052906e+00],
[-2.93581723e+00, 4.37099430e+00],
[-2.45885784e+00, -3.47646132e+00],
[-9.37207745e+00, -2.04265047e+00],
[-1.85324174e+00, 1.15343543e+01],
[-4.55544644e-02, -5.77956461e+00],
[-4.81350458e+00, -4.29442383e+00],
[-2.83977728e+00, 1.05836834e+01],
[-3.25189078e+00, 8.58382453e+00],
[-5.78104717e+00, -3.22180679e+00],
[-1.35072701e+00, 4.38388826e+00],
[-2.54760385e+00, 1.23266492e+01],
[-1.83963385e+00, 1.17304073e+01],
[-3.56940146e+00, 3.97719844e+00],
[ 5.19455346e+00, -3.85790517e+00],
[ 1.26866610e+00, 8.69129038e+00],
[-3.63664996e+00, 7.23811254e+00],
[-1.55079863e+00, 8.16118375e+00],
[-1.75136566e+00, 1.01798622e+01],
[ 4.22044090e+00, -7.82455952e+00],
[-1.01845204e+00, 1.08561916e+01],
[-3.09538208e+00, 9.04263837e+00],
[-2.75853245e+00, 5.71712591e+00],
[-1.69955192e+00, 7.60084115e+00],
[ 1.00681205e+00, -5.97364221e+00],
[-3.63618643e+00, -4.01910949e+00],
[ 1.05766953e+00, -2.84354513e+00],
[-5.21005358e-01, -5.36288806e+00],
[ 4.74333018e-01, 2.91649791e+00],
[-1.16095485e+00, 9.30443737e+00],
[ 7.72592657e-01, 3.34757221e+00],
[ 1.15283270e-01, -4.98158692e+00],
[-6.17063348e-01, 1.04101088e+01],
[-2.76847604e+00, 8.52320320e+00],
[-5.25173430e+00, -2.08429857e+00],
[-3.85525653e+00, 9.54219399e+00],
[-8.01851943e-01, 5.95676894e+00],
[-2.36271016e+00, 6.81776964e+00],
[-1.99764975e+00, -3.85128758e+00],
[-6.65130512e+00, -3.92501387e+00],
[-5.57724115e+00, 1.14034957e+01],
[ 1.19709771e+00, -5.35592862e+00],
[-3.25011945e+00, 5.37703143e+00],
[ 1.18033537e+00, -7.97895365e+00],
[-6.91252565e+00, -4.45298216e+00],
[-1.76815267e+00, 9.19196787e+00],
[-6.65058496e+00, -2.11819191e+00],
[-3.70764352e+00, 6.74162691e+00],
[-3.71255665e-01, -4.99321884e+00],
[ 1.12056494e-01, -6.58921181e+00],
[-2.33425004e+00, 7.05562607e+00],
[-6.96784964e-01, 1.00164565e+01],
[-7.67542214e-01, -5.69548201e+00],
[-6.88656858e-01, -9.55180953e+00],
[ 3.19734410e+00, -3.69780369e+00],
[-1.66854762e+00, 4.66869475e+00],
[-3.40729232e-01, 5.72252744e+00],
[-7.63340475e-01, -2.76069256e+00],
[-2.15963524e+00, -7.56230415e+00],
[-5.10916044e+00, -4.59492642e+00],
[-4.57024715e+00, -7.15787278e-01],
[-4.87469044e+00, -2.18916044e+00],
[-1.42220382e+00, 5.20840822e+00],
[ 1.86144971e+00, -9.23586332e+00],
[-4.26446596e-01, -4.87764872e+00],
[-1.35671783e+00, -3.72590953e+00],
[ 9.41731341e-02, -4.05882797e+00],
[-6.95635379e+00, -8.81553313e-01],
[ 1.08172044e+00, 5.81661034e+00],
[ 3.14164337e-02, -5.38350852e+00],
[-2.85096028e+00, 7.51512826e+00],
[-4.36021918e+00, -2.94957772e+00],
[-3.47593712e-01, -4.33384716e+00],
[-2.70720258e+00, 1.05857295e+01],
[-3.77539609e+00, 4.13414806e+00],
[ 1.26381204e+00, -7.84824077e+00],
[-1.44884409e+00, 3.75963327e+00],
[-2.25521451e+00, 5.57096900e+00],
[-3.09117088e+00, 9.37957142e+00],
[-6.36790963e+00, -5.30288810e-02],
[-6.11503859e-01, -3.24108804e+00],
[-6.24195183e+00, -2.19627952e+00],
[-5.34255894e+00, 1.28888667e+01],
[ 1.14391114e+00, 4.35267793e+00],
[ 2.32669251e+00, -7.52917540e+00],
[-2.96275801e+00, 2.59217754e+00],
[-3.02730359e+00, 3.52590749e+00],
[-3.49785697e+00, -8.75045274e-01],
[-5.23835667e-01, 7.48498444e+00],
[ 1.09766760e+00, -4.85679456e+00],
[-1.10000365e+00, 1.10130763e+01],
[-3.98648663e+00, -1.98177808e+00],
[-3.29604652e+00, 6.38490461e+00],
[-3.75526942e+00, -1.56756272e+00],
[-7.10483937e-01, 1.18869578e+01]]), array([2, 0, 3, 3, 1, 0, 2, 0, 0, 1, 1, 2, 1, 1, 1, 3, 2, 2,
1, 1, 1, 3,
2, 2, 2, 3, 2, 1, 3, 2, 2, 1, 1, 0, 0, 1, 1, 1, 3, 3, 2, 1, 3, 3,
0, 0, 2, 3, 3, 2, 0, 1, 3, 2, 3, 3, 0, 1, 1, 3, 0, 0, 2, 0, 0, 2,
0, 3, 2, 2, 0, 1, 2, 0, 3, 3, 1, 3, 3, 0, 1, 3, 1, 1, 1, 1, 2, 0,
2, 1, 0, 2, 0, 3, 3, 1, 0, 3, 0, 3, 1, 0, 3, 2, 3, 0, 3, 1, 1, 2,
0, 1, 2, 2, 1, 3, 2, 2, 3, 0, 2, 2, 2, 2, 0, 2, 2, 3, 2, 0, 1, 0,
0, 3, 2, 3, 0, 2, 2, 1, 2, 3, 2, 1, 1, 2, 0, 3, 0, 1, 2, 1, 3, 0,
0, 3, 2, 0, 0, 0, 3, 3, 1, 0, 1, 1, 1, 3, 0, 0, 0, 0, 1, 3, 0, 2,
1, 0, 2, 3, 0, 3, 3, 2, 1, 0, 1, 2, 3, 0, 3, 3, 1, 3, 0, 2, 1, 3,
1, 2]))
APRIORI ALGORITHM
Agarwal and Srikant proposed the Apriori algorithm in 1994. Apriori uses a
"bottom up" approach, where frequent subsets are extended one item at a time
(a step known as candidate generation), and groups of candidates are tested
against the data. The algorithm terminates when no further successful
extensions are found. Apriori uses breadth-first search to count candidate item
sets efficiently. This algorithm uses downward closure property, which states
that, “Any subset of a frequent itemset must be frequent”. It is called as
apriori because it uses prior knowledge of frequent item set properties.
It uses level-wise search, where k-itemsets (an itemset containing k number
of items is called as a k-itemset) are used to explore (k+1) itemsets to mine
frequent itemsets from transactional database. First, the set of frequent 1-
temset (L1) is found. L1 is used to find L2, which is used to find L3 and so on,
until no more frequent k-itemsets can be found.
The candidate-gen function takes Lk-1 and returns a superset (called the
candidates) of the set of all frequent
k- itemsets. It has two steps
Apriori Property –
All non-empty subset of frequent itemset must be frequent. The key concept
of Apriori algorithm is its anti-monotonicity of support measure. Apriori
assumes that
Apriori Algorithm can be slow. The main limitation is time required to hold a
vast number of candidate sets with much frequent itemsets, low minimum support
or large itemsets i.e. it is not an efficient approach for large number of datasets.
For example, if there are 10^4 from frequent 1- itemsets, it need to generate more
than 10^7 candidates into 2-length which in turn they will be tested and
accumulate. Furthermore, to detect frequent pattern in size 100 i.e. v1, v2…
v100, it have to generate 2^100 candidate itemsets that yield on costly and
wasting of time of candidate generation. So, it will check for many sets from
candidate itemsets, also it will scan database many times repeatedly for finding
candidate itemsets. Apriori will be very low and inefficiency when memory
capacity is limited with large number of transactions.
PSEUDOCODE OF THE ALGORITHM
APRIORI CODE
# Loading neccesary packages
import numpy as np
import pandas as pd
#myretaildata = pd.read_excel('http://archive.ics.uci.edu/ml/machine-learning-
databases/00352/Online%20Retail.xlsx')
myretaildata.head ()
#Data Cleaning
myretaildata['InvoiceNo'] = myretaildata['InvoiceNo'].astype('str')#converting
invoice number to type string
myretaildata = myretaildata[~myretaildata['InvoiceNo'].str.contains('C')]#remove
credit transactions
myretaildata.head()
myretaildata['Country'].value_counts()
#myretaildata.shape
.groupby(['InvoiceNo', 'Description'])['Quantity']
.sum().unstack().reset_index().fillna(0)
.set_index('InvoiceNo'))
mybasket.head()
def my_encode_units(x):
if x <= 0;
return 0
if x >=1;
return 1
my_basket_sets = mybasket.applymap(my_encode_units)
#Generating rules
my_rules.head(100)
(my_rules['confidence'] >=0.3)]