You are on page 1of 25

Experiment No 1:

Program on Visualization of data using matplotlib using python.


Aim: To study visualization of data using matplotlib using python.
Theory:
- Matplotlib is the most popular data visualization library in Python. It allows us to create
figures and plots, and makes it very easy to produce static raster or vector files without the
need for any GUIs.
- If you have Anaconda, you can simply install Matplotlib from your terminal or command
prompt using: conda install matplotlib
- If you do not have Anaconda on your computer, install Matplotlib from your terminal
using: pip install matplotlib

Program:

from matplotlib import pyplot as plt


years=[2014, 2015, 2016, 2017, 2018]
country1=[200, 500, 550, 600, 700]
country2=[100,200,300,400,500]
plt.plot(years, country1)
plt.plot(years, country2)
#y=[1, 2, 3]
#plt.plot(y)
plt.ylabel("population")
plt.xlabel("year")
plt.show()

BAR CHART
from matplotlib import pyplot as plt
years=[2014, 2015, 2016, 2017, 2018]
country1=[200, 500, 550, 600, 700]
country2=[100,200,300,400,500]
plt.bar(years, country1)
plt.bar(years, country2)
#y=[1, 2, 3]
#plt.plot(y)
plt.ylabel("population")
plt.xlabel("year")
plt.show()

Output:
Conclusion: Thus, we have studied about visualization using matplotlib.
Experiment No 2:
Program on linear algebra and matrices operation.
Aim: To study linear algebra and matrix operation.

Theory:
- The Python programming language has no built-in support for linear algebra, but it is
fairly straightforward to write code which will implement as much as you need. The
most obvious way to represent vectors and matrices are as lists and nested lists. For
serious numerical linear algebra, the best option is to install and use
the NumPy package. A more flexible solution is to use SAGE, a Python-based
symbolic algebra system which includes NumPy.

Program:
 Part1:
import numpy as np
a=[1,2]
b=[3,3]
c=[1,2]
x=np.array(a)
y=np.array(b)
z=np.array(c)
sum=x+y+z
minus=x-y
scalar=x*10
mean=np.mean([x,y])
dot_product=np.sum(x*y*z)
sum_of_squares=np.sum(x**2)
magnitude=np.sum(x**2)**.5
distance=np.sum((x-y)**2)**0.5
print(sum)
print(minus)
print(scalar)
print(mean)
print(dot_product)
print(sum_of_squares)
print(magnitude)
print(distance)

 Part2:
import numpy as np
x=[[1,2,3],[1,2,1],[1,3,1]]
y=[[1],[2],[3]]
a=[[1,2],[2,1]]
b=[[1,1],[2,2]]
a=np.array(a)
b=np.array(b)
sum=a+b
product=np.dot(a,b)
matrix=np.array(x)
transpose=np.transpose(x)
inverse=np.linalg.inv(x)
solve=np.dot(np.linalg.inv(x),y)
print(np.array(y))
print(transpose)
print(inverse)
print(solve)
print(sum)
print(product)

Output:
Part1:

Part2:

Conclusion: Thus, we have studied about linear algebra and matrix operation.
Experiment No: 3
Program on data manipulation with data frame using pandas.
Aim: To study data manipulation with data frame using pandas.
Theory:
Pandas is a popular Python package for data science, and with good reason: it offers
powerful, expressive and flexible data structures that make data manipulation and analysis
easy, among many other things. The Data Frame is one of these structures. Obviously,
making your Data Frames is your first step in almost anything that you want to do when it
comes to data munging in Python. Sometimes, you will want to start from scratch, but you
can also convert other data structures, such as lists or NumPy arrays, to Pandas Data Frames.
Program:
import pandas as pd

import numpy as np

from matplotlib import pyplot as plt

from pandas import DataFrame

import sys

print(sys.version)

print(pd.__version__)

print(np.__version__)

names=['Bob','jessica','Mary','John','Mel','Bob']

births=[968,240,77,953,973,900]

#to merge two list

Dataset=list(zip(names,births))

print(Dataset)

#dataFrame

df=pd.DataFrame(Dataset,columns=['Names','births'])

print(df)
#export dataSet

df.to_csv(r'/home/umi/Desktop/Dataset.csv')

#to import dataset

#df=read_csv(location)

print(df)

print("unique name",df["Names"].unique())

#check datatypes of the column

x=df.dtypes
print(x)

#Analyze data

sorted=df.sort_values(['births'],ascending=False)
print(sorted)
highest=sorted.head(1)

print(highest)

#plot graph

graph=df['births'].plot()

#maximum value in the dataSet

max=df['births'].max()
print("highest is",max)

#name associated with the maximum value

maxName=df['Names'][df['births']==df['births'].max()].values
text= str(max)+'-'+maxName

#add text to graph


plt.annotate(text,xy=(1,max),xytext=(8, 0),xycoords=('axes
fraction', 'data'), textcoords='offset points')
plt.show()

Output:

Conclusion: Thus, we have studied about data manipulation with data frame using pandas.
Experiment No 4:
Program on working with files.
Aim: To study about working with files.
Theory:
When you’re working with Python, you don’t need to import a library in order to read and
write files. It’s handled natively in the language, albeit in a unique manner. The first thing
you’ll need to do is use Python’s built-in open function to get a file object. The open function
opens a file. It’s simple. When you use the open function, it returns something called a file
object. File objects contain methods and attributes that can be used to collect information
about the file you opened. They can also be used to manipulate said file. For example,
the mode attribute of a file object tells you which mode a file was opened in. And
the name attribute tells you the name of the file that the file object has opened. You must
understand that a file and file object are two wholly separate – yet related – things.

Program:
- Read File:
with open ("dhara","r") as f:
file_content= f.read()
print (file_content)

- Write File:
with open ("dhara1.txt","w") as f1:
f1.write ("DSBA")
f1.close

- Copy File:
with open ("dhara","r") as f:
with open("dhara1.txt","a") as f1:
for line in f:
f1.write(line)
Output:
Conclusion: Thus, we have studied about working with files in python.
Experiment No:5
Program on Naive Bayes algorithm
Aim: To study Naive Bayes algorithm
Theory:
- It is a classification technique based on Bayes’ Theorem with an assumption of
independence among predictors. In simple terms, a Naive Bayes classifier assumes
that the presence of a particular feature in a class is unrelated to the presence of any
other feature. For example, a fruit may be considered to be an apple if it is red, round,
and about 3 inches in diameter. Even if these features depend on each other or upon
the existence of the other features, all of these properties independently contribute to
the probability that this fruit is an apple and that is why it is known as ‘Naive’.
- Naive Bayes model is easy to build and particularly useful for very large data sets.
Along with simplicity, Naive Bayes is known to outperform even highly sophisticated
classification methods.
Example:
Players will play if weather is sunny. Is this statement is correct?
We can solve it using above discussed method of posterior probability.
P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)
Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.

Naive Bayes uses a similar method to predict the probability of different class based on
various attributes. This algorithm is mostly used in text classification and with problems
having multiple classes.

Applications:

 Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast.
Thus, it could be used for making predictions in real time.
 Multi class Prediction: This algorithm is also well known for multi class prediction
feature. Here we can predict the probability of multiple classes of target variable.
 Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers
mostly used in text classification (due to better result in multi class problems and
independence rule) have higher success rate as compared to other algorithms. As a
result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment
Analysis (in social media analysis, to identify positive and negative customer
sentiments)
 Recommendation System: Naive Bayes Classifier and Collaborative
Filtering together builds a Recommendation System that uses machine learning and
data mining techniques to filter unseen information and predict whether a user would
like a given resource or not

Program:

#Import Library of Gaussian Naive Bayes model


import imp
from sklearn.naive_bayes import GaussianNB
import numpy as np
#assigning predictor and target variables
x= np.array([[15,20,8],[11,15,18], [7,20,8], [12,8,20], [22,8,4],
[14,20,12], [11,11,11], [17,16,15], [12,20,4], [12,17,23], [12,6,7],
[12,16,12]])
y= np.array(['f','p','f','f','f','p','p','p', 'f','p','f', 'p'])
#Create a Gaussian Classifier
model = GaussianNB()
# Train the model using the training sets
model.fit(x,y)

#Predict Output
prediction= model.predict([[17,16,7]])
print(prediction)

Output:

Conclusion: Thus, we have studied Naive Bayes model which is easy to build and
particularly useful for very large data sets.
Experiment No: 6
Program on shortest path
Aim: To study the implementation of shortest path in graph.

Theory:

Graphs are networks consisting of nodes connected by edges or arcs. In directed graphs, the
connections between nodes have a direction, and are called arcs; in undirected graphs, the
connections have no direction and are called edges. We mainly discuss directed graphs.
Algorithms in graphs include finding a path between two nodes, finding the shortest path
between two nodes, determining cycles in the graph (a cycle is a non-empty path from a node
to itself), finding a path that reaches all nodes (the famous "traveling salesman problem"),
and so on. Sometimes the nodes or arcs of a graph have weights or costs associated with
them, and we are interested in finding the cheapest path.
There's considerable literature on graph algorithms, which are an important part of discrete
mathematics. Graphs also have much practical use in computer algorithms. Obvious
examples can be found in the management of networks, but examples abound in many other
areas. For instance, caller-callee relationships in a computer program can be seen as a graph
(where cycles indicate recursion, and unreachable nodes represent dead code).

Program:
#to find all paths and shortest path in bidirectinal graph
#To generate graph
graph = {'A': ['B', 'C'],
'B': ['C','A','D'],
'C': ['D','A','B','F'],
'D': ['C','B'],
'E': ['F','E'],
'F': ['C']}

#allpaths
def find_all_paths(graph, start, end, path=[]):
path = path + [start]
if start == end:
return [path]
if not graph.has_key(start):
return []
paths = []
for node in graph[start]:
if node not in path:
newpaths = find_all_paths(graph, node, end, path)
for newpath in newpaths:
paths.append(newpath)
return paths
print("all paths:",find_all_paths(graph, 'F', 'A'))

#shortestpath
def find_shortest_path(graph, start, end, path=[]):
path = path + [start]
if start == end:
return path
if not graph.has_key(start):
return None
shortest = None
for node in graph[start]:
if node not in path:
newpath = find_shortest_path(graph, node, end, path)
if newpath:
if not shortest or len(newpath) < len(shortest):
shortest = newpath
return shortest

print("shortest path:",find_shortest_path(graph, 'F', 'A'))

Output:

Conclusion: Thus, we have studied how to find the shortest path in a graph.
Experiment No:7
Aim: To Implement K-Means Clustering.
Theory:
K-means clustering:
K-means clustering is a type of unsupervised learning, which is used when you have
unlabeled data (i.e., data without defined categories or groups). The goal of this
algorithm is to find groups in the data, with the number of groups represented by the
variable K. The algorithm works iteratively to assign each data point to one of K groups
based on the features that are provided. Data points are clustered based on feature
similarity.

Program:
import numpy as np
import matplotlib.pyplot as plt from matplotlib
import style style.use("ggplot") from sklearn.cluster
import KMeans

x = [1, 5, 1.5, 8, 1, 9] y = [8, 8, 1.8, 8, 0.6, 11]

plt.scatter(x,y)

plt.show()

X = np.array([[1, 8], [5, 8], [1.5, 1.8], [8, 8], [1, 0.6], [9, 11]])

kmeans = KMeans(n_clusters=2) kmeans.fit(X)

centroids = kmeans.cluster_centers_ labels = kmeans.labels_

print(centroids) print(labels)

colors = ["g.","r.","c.","y."]

for i in range(len(X)): print("coordinate:",X[i], "label:", labels[i]) plt.plot(X[i][0],


X[i][1], colors[labels[i]], markersize = 10)

plt.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10)

plt.show()

Output:
Conclusion: Thus, we have studied the implementation of K-means clustering using python.
Experiment No:8
Aim: To study web scrapping
Theory:
- Web scrapping is getting meaningful structured data from web pages.
When performing data science tasks, it's common to want to use data found on the
internet. You'll usually be able to access this data in csv format, or via an Application
Programming Interface (API). However, there are times when the data you want can
only be accessed as part of a web page. In cases like this, you'll want to use a
technique called web scraping to get the data from the web page into a format you can
work with in your analysis.
- Pages on the Web are written in HTML, in which text is (ideally) marked up into
elements
and their attributes:
<html>
<head>
<title>A web page</title>
</head>
<body>
<p id="author">Joel Grus</p>
<p id="subject">Data Science</p>
</body>
</html>
In a perfect world, where all web pages are marked up semantically for our benefit,
we
would be able to extract data using rules like “find the <p> element whose id is
subject
and return the text it contains.” In the actual world, HTML is not generally well-
formed,
let alone annotated. This means we’ll need help making sense of it.
To get data out of HTML, we will use the Beautiful Soup library, which builds a tree
out of
the various elements on a web page and provides a simple interface for accessing
them. The latest version is Beautiful Soup 4.3.2 (pip install beautifulsoup4),
which is what we’ll be using. We’ll also be using the requests library (pip install
requests), which is a much nicer way of making HTTP requests than anything that’s
built

Program:
import requests
from bs4 import BeautifulSoup

url='http://www.wordsforlife.org.uk/songs/jack-and-jill-went-hill'

page=requests.get(url)
ps=page.status_code

pt=page.text

#print(ps)

#print(pt)

soup = BeautifulSoup(page.text, 'html.parser')

#print(soup.prettify())

ptag=soup.find_all('p')

ptag3=soup.find_all('p')[0].get_text()

print(ptag3)

#print(ptag)

with open("wb.txt",'w') as w1:


w1.write(ptag3)
with open("wb.txt",'r') as r:
with open("New_wb.txt",'w') as w:
for line in r:
w.write(line)
Experiment No:9
Aim: Create a function to determine the probability of
a) drawing a heart
b) drawing a face card
c) drawing the queen of hearts

Theory:
- Probability as a way of quantifying the uncertainty associated with events chosen
from some universe of events, think of rolling a die. The universe consists of all
possible outcomes. And any subset of these outcomes is an event; for example, “the
die rolls a one” or “the die rolls an even number.” Notation ally, we write to mean
“the probability of the event E.” We’ll use probability theory to build models. We’ll
use probability theory to evaluate
- Dependence & independence events E and F are dependent if knowing something
about whether E happens gives us information about whether F happens (and vice
versa). Otherwise they are independent
- Mathematically, we say that two events E and F are independent if the probability that
they both happen is the product of the probabilities that each one happens:
P(E,F)= P(E).P(F)
- F they are not necessarily independent (and if the probability of F is not zero), then
we define the probability of E “conditional on F” as
P(E|F)= P(E,F)/P(F)

Program:

def event_probability(event_outcomes, sample_space):


probability = (event_outcomes / sample_space) * 100
return round(probability, 1)

# Sample Space
cards = 52

# Determine the probability of drawing a heart


hearts = 13
heart_probability = event_probability(hearts, cards)

# Determine the probability of drawing a face card


face_cards = 12
face_card_probability = event_probability(face_cards, cards)

# Determine the probability of drawing the queen of hearts


queen_of_hearts = 1
queen_of_hearts_probability = event_probability(queen_of_hearts, cards)

# Print each probability


print(str(heart_probability) + '%')
print(str(face_card_probability) + '%')
print(str(queen_of_hearts_probability) + '%')
Output:

Conclusion: Thus, we have created a function to determine the probability which is a way of
quantifying the uncertainty associated with events chosen from some universe of events.
Experiment No:10
Program on Mapreduce (Word count)
Aim: To study map reduce.

Theory:
- MapReduce is a programming model and an associated implementation for
processing and generating big data sets with a parallel, distributed cluster. A
MapReduce program is composed of a map procedure (or method), which performs
filtering and sorting (such as sorting students by first name into queues, one queue for
each name), and a reduce method, which performs a summary operation (such as
counting the number of students in each queue, yielding name frequencies).
- The "MapReduce System" (also called "infrastructure" or "framework") orchestrates
the processing by marshalling the distributed servers, running the various tasks in
parallel, managing all communications and data transfers between the various parts of
the system, and providing for redundancy and fault tolerance. The model is a
specialization of the split-apply-combine.
- It is inspired by the map and reduce functions commonly used in functional
programming although their purpose in the MapReduce framework is not the same as
in their original forms. The key contributions of the MapReduce framework are not
the actual map and reduce functions (which, for example, resemble the 1995 message
passing interface standards reduce and scatter operations single threaded
implementation.
- The use of this model is beneficial only when the optimized distributed shuffle
operation (which reduces network communication cost) and fault tolerance features of
the MapReduce framework come into play. Optimizing the communication cost is
essential to a good MapReduce algorithm. MapReduce libraries have been written in
many programming languages, with different levels of optimization. A popular open-
source implementation is Apache Hadoop.
- The name MapReduce originally referred to the proprietary Google technology, but
has since been genericized. By 2014, Google was no longer using MapReduce as their
primary big data processing model, and development on Apache Mahout had moved
on to more capable and less disk-oriented mechanisms that incorporated full map and
reduce capabilities.

Program:

A) For map:
import sys

# input comes from STDIN (standard input)


for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py

# tab-delimited; the trivial word count is 1


print('%s\t%s' % (word,1))

B) To reduce:
from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

# input comes from STDIN


for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()

# parse the input we got from mapper.py


word, count = line.split('\t', 1)

# convert count (currently a string) to int


try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue

# this IF-switch only works because Hadoop sorts map output


# by key (here: word) before it is passed to the reducer
if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
print('%s\t%s' % (current_word, current_count))
current_count = count
current_word = word

# do not forget to output the last word if needed!


if current_word == word:
print('%s\t%s' % (current_word, current_count))

Conclusion: Thus we have studied the implementation of MapReduce using word count.

You might also like