You are on page 1of 32

globsyn

Python additional libraries and


projects
Additional libraries and projects

Topics to be covered
Theano
Tensorflows
Keras
Projects

5 www.globsynfinishingschool.com
Theano
• Theano is a Python library for efficiently handling mathematical expressions
involving multi-dimensional arrays (also known as tensors).
• Common choice for implementing neural network models.
• Features of Theano:
• Automatic differentiation – you only have to implement the forward (prediction) part
of the model, and Theano will automatically figure out how to calculate the gradients at
various points, allowing you to perform gradient descent for model training.
• Transparent use of a GPU – you can write the same code and run it either on CPU or
GPU. Theano will figure out which parts of the computation should be moved to the
GPU.
• Speed and stability optimizations – Theano will internally reorganize and optimize your
computations, in order to make them run faster and be more numerically stable. It will
also try to compile some operations into C code, in order to speed up the computation

6 www.globsynfinishingschool.com
Theano basic example
import theano
import numpy

x = theano.tensor.fvector('x’) # x is a vector of 32 bit floats


W = theano.shared(numpy.asarray([0.2, 0.7]), 'W’) # W is a Theano variable to which
the vector is assigned
y = (x * W).sum() # y is the sum of all elements in the element-wise multiplication of
x and W

f = theano.function([x], y) # Theano function f takes as input x and outputs y


output = f([1.0, 1.0]) # script prints out the summed product of [0.2, 0.7] and [1.0,
1.0], which is: 0.2*1.0 + 0.7*1.0 = 0.9

print output

7 www.globsynfinishingschool.com
Symbolic graphs in Theano
• When we create a model with Theano, we first define a symbolic
graph of all variables and operations that need to be performed
• Then we can apply this graph on specific inputs to get outputs.
• By chaining up various operations, a graph of all variables and
functions is created that need to be used to reach the output
values.
• This symbolic graph is also the reason why we can only use Theano-
specific operations when defining our models.
• If we tried to integrate functions from some random Python library
into our network, they would attempt to perform the calculations
immediately, instead of returning a Theano variable as needed.

8 www.globsynfinishingschool.com
Theano Variables
• We can define variables which don’t have any values yet. These are later
used as inputs to the network.
• Variable type has to be defined – can be default, can be defined
explicitly,
• x = theano.tensor.fvector('x’) # variable x is a vector of 32-bit floats, and its name is
‘x’ (Python variable names are not visible to Theano)
• x = theano.tensor.vector('x', dtype=float32) # variable type specified

• Theano variables need explicit names. Here are some important types
Constructor dtype ndim
fvector float32 1
ivector int32 1
fscalar float32 0
fmatrix float32 2
ftensor3 float32 3
dtensor3 float64 3
9 www.globsynfinishingschool.com
Theano functions
• Theano functions are hooks for interacting with the symbolic graph, used
for passing input into our network and collecting the resulting output.
• Here a Theano function f takes x as input and returns y as output:
• f = theano.function([x], y)
The first parameter is the list of input variables, and the second parameter is the list of
output variables (in this case, single value).
• When we construct a function, Theano builds the computational graph
and optimizes it as much as possible. It restructures mathematical
operations to make them faster and more stable, compiles some parts to
C, moves some tensors to the GPU, etc.
• Theano compilation can be controlled by setting the value of mode in the
environment variable THEANO_FLAGS:
• FAST_COMPILE – Fast to compile, slow to run. Python implementations only, minimal graph
optimisation.
• FAST_RUN – Slow to compile, fast to run. C implementations where available, full range of
optimisations

10 www.globsynfinishingschool.com
Theano minimal training example
• Here’s a minimal script for actually training something in Theano. We will be
training the weights in W using gradient descent, so that the result from the
model would be 20 instead of the original 0.9.
import theano
import numpy
x = theano.tensor.fvector('x')
target = theano.tensor.fscalar('target’) # target value we use for training
W = theano.shared(numpy.asarray([0.2, 0.7]), 'W')
y = (x * W).sum()
cost = theano.tensor.sqr(target - y) # cost function is a simple squared distance from target
gradients = theano.tensor.grad(cost, [W]) # grad function takes in the real-valued cost and a list of
variables we want gradients for, and returns the list of gradients:

W_updated = W - (0.1 * gradients[0]) # update rule: subtract the gradient, multiplied by learning rate
updates = [(W, W_updated)] # list of tuples where first element is variable we want to update and second
element is a variable containing the values that we want the first variable to contain after the update.

f = theano.function([x, target], y, updates=updates)# two input arguments – one for the input vector,
another for the target value used for training

for i in xrange(10): # repeatedly call function in order to train parameters


output = f([1.0, 1.0], 20.0)
print output

11 www.globsynfinishingschool.com
Tensorflows
• TensorFlow is an open source software library for numerical computation using
data flow graphs.
• Nodes in the graph represent mathematical operations, while the graph edges
represent the multidimensional data arrays (tensors) communicated between
them.
• The flexible architecture allows you to deploy computation to one or more
CPUs or GPUs in a desktop, server, or mobile device with a single API.
• TensorFlow follows a lazy programming paradigm. It first builds a graph of all
the operations to be done, and then when a “session” is called, it “runs” the
graph. It’s built to be scalable, by changing internal data representation to
tensors (aka multi-dimensional arrays).
• TensorFlow is not only a neural network library, it can be used for other ML
algorithms like decision tress or –Nearest Neighbors

12 www.globsynfinishingschool.com
Tensorflow (Contd.)
• The central unit of data in TensorFlow is the tensor.
• A tensor consists of a set of primitive values shaped into an array of any number of dimensions
called rank. Examples:
• 3 # a rank 0 tensor; a scalar with shape []
• [1., 2., 3.] # a rank 1 tensor; a vector with shape [3]
• [[1., 2., 3.], [4., 5., 6.]] # a rank 2 tensor; a matrix with shape [2, 3]
• [[[1., 2., 3.]], [[7., 8., 9.]]] # a rank 3 tensor with shape [2, 1, 3]
• A computational graph is a series of TensorFlow operations arranged into a graph of nodes.
• Each node in the graph takes zero or more tensors as inputs and produces a tensor as an output.
Tensorflow programs basically (a) Builds the computational graph (b) Executes the graph.
• Example: Build a graph with 3 nodes: 2 constant nodes which take no inputs and 1 operation node
import tensorflow as tf
node1 = tf.constant(3.0, dtype=tf.float32)
node2 = tf.constant(4.0, dtype=tf.float32)
node3 = tf.add(node1, node2)
sess = tf.Session()
print(sess.run([node1, node2]))
print(sess.run(node3))

[3.0, 4.0]
7.0

13 www.globsynfinishingschool.com
Keras
• Keras is a minimalist Python library for deep learning that can run on top of Theano or
TensorFlow. Models in Keras are defined as a sequence of layers
• Developed to make implementing deep learning models as fast and easy as possible for
research and development.
• Runs on Python 2.7 or 3.5 and can seamlessly execute on GPUs and CPUs given the
underlying frameworks.
• 4 guiding principles:
• Modularity: A model can be understood as a sequence or a graph alone. All the
concerns of a deep learning model are discrete components that can be combined in
arbitrary ways.
• Minimalism: The library provides just enough to achieve an outcome, no frills and
maximizing readability.
• Extensibility: New components are intentionally easy to add and use within the
framework, intended for researchers to trial and explore new ideas.
• Python: No separate model files with custom file formats. Everything is native Python.

14 www.globsynfinishingschool.com
Keras steps
• Steps
• Define your model, e.g. Sequential Model creates a sequence.
• Add layers as building blocks of the model. The layer needs to know what input shape
it should expect, e.g. Dense.
• Compile your model. Specify loss functions and optimizers (e.g. rmsprop, adagrad).
• Training - Fit your model. Use Numpy arrays of input data and labels.
• Make predictions.
• Basic example.
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
model = Sequential()
model.add(Dense(32, activation = 'relu', input_dim=100))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimization='rmsprop', loss='binary_crossentropy', metrics = ['accuracy’])
data.np.random.random((1000,100))
labels = np.random.randint(2, size=(1000,1))
model.fit(data,labels,epoch=10, batch_size=32)
predictions = model.predict(data)

15 www.globsynfinishingschool.com
Project 1: Predict Credit Card Acceptance
• A small credit card dataset for simple econometric analysis (taken from Kaggle,
originally from William Greene's book Econometric Analysis)
• Content
• card: Dummy variable, 1 if application for credit card accepted, 0 if not
• reports: Number of major derogatory reports
• age: Age n years plus twelfths of a year
• income: Yearly income (divided by 10,000)
• share: Ratio of monthly credit card expenditure to yearly income
• expenditure: Average monthly credit card expenditure
• owner: 1 if owns their home, 0 if rent
• selfempl: 1 if self employed, 0 if not.
• dependents: 1 + number of dependents
• months: Months living at current address
• majorcards: Number of major credit cards held
• active: Number of active credit accounts
• Goal: Predict whether a credit card application will be accepted based upon various
data about the applicant.
• Split into train and test data and create 4 different types of models from the data -
Decision Trees, Linear Regression, Native Bayes and K-NN
• Do performance evaluation of each model
16 www.globsynfinishingschool.com
Project 2: Predict Loan Application Status
• A small loan application dataset taken from Kaggle (contains training and test data)
• Content
• Application Id
• Gender
• Married
• Dependent
• Education
• Self-employed
• Applicant Income
• Co-applicant income Train data Test data
• Loan amount
• Credit History (1 if present, 0 or blank if not)
• Property area
• Goal: Predict whether a loan application will be accepted based upon applicant data.
• Create 4 different types of models from the data - Decision Trees, Linear Regression,
Native Bayes and K-NN based on the training data set
• Apply on the test data set and compare the differences in the results

17 www.globsynfinishingschool.com
Project 3: Predict Disease
• A small medical diagnosis dataset taken from Kaggle
• Content : 3 different files
• Symptom ids versus names
• Diagnosis ids versus names
• matrix linking system ids to diagnosis ids with weight. "wei" means weight
(common = 1, life-threatening = 2, and common pediatrics = 3. 0 means no data)
• Goal: Predict (diagnose) diseases and their severity based upon symptoms.
• Split data from the matrix into train and test data and create 4 different types of
models from the training data - Decision Trees, Linear Regression, Native Bayes and K-
NN
• Perform diagnosis based on the test data and see how it matches up with the actual
results

Symptoms Diagnosis Symptom vs. Diagnosis

18 www.globsynfinishingschool.com
Project 4: Predict Outcome of Tennis match
• Results for the men's ATP tour date back to January 2000, including Grand Slams,
Masters Series, Masters Cup and International Series competitions. Metadata can be
found here: http://www.tennis-data.co.uk/notes.txt
• Goal: Predict who will be the match winner
• Split into train and test data and create 4 different types of models from the data -
Decision Trees, Linear Regression, Native Bayes and K-NN
• Do performance evaluation of each model

19 www.globsynfinishingschool.com
Project 5: Predict Party Affiliation of Congressmen
• Goal: To use Naive Bayes and nearest neighbors learning methods to predict the party affiliation of congressmen from their voting records.
• The file vote.dat contains, in tabular form, the voting record of the current House of Representatives on 4 important votes in 2003
• The first line of the file is a header, with the labels of each row.
• Each of the following lines corresponds to one congressman:
Chars. 1-2: Abbreviation of state name.
Chars. 7-24: Name of congresssman.
Char 26: Party affiliation: R or D (Republican or Democrat).
Chars 31, 39, 45, 51: Votes on four issues. These are each "Y", "N", or "-" (no vote).
• It must be possible to easily set the size of the training set, NTRAIN, and the size of the test set NTEST. The training set will then be the
first NTRAIN lines of the data file and the test set will be the last NTEST lines.
• In the learning phase, you compute Freq(Party) and Freq(VoteI | Party) over the training set. In the prediction phase, you combine these
using the Naive Bayes formula to predict the party affiliation from the votes. In evaluation, you compare the predicted party affiliation to
the recorded affiliation, and note whether the prediction was correct or not.
• Note that while Naive Bayes method elegantly handles null values, Nearest neighbor does not so for NN, any congressman with any null
votes should be ignored.
• For each of two learning methods (Naive Bayes and nearest neighbors) you should output:
• The accuracy over the test set of the method learned over the training set.
• The accuracy over the training set of the method learned over the training set.
• For Naive Bayes, you should output the probability assigned to each outcome for the four congressmen Raymond Green, Denis
Majette, Clifford Stearns, and Michael Castle (the last four lines of the data file.)
• For nearest neighbors, you should output the "vote" on the outcome for these four congressmen.

20 www.globsynfinishingschool.com
Project 6: Analysis of Flipkart Data
• This is a subset of data created by extracting data from Flipkart.com.
• The dataset has the following fields
• pid: product id
• product_name
• product_category_tree
• retail_price
• discounted_price
• is_FK_Advantage_product (whether or not Flipkart advantage product)
• product_rating
• brand
• product_specifications
• Use Python Code to do the following:
• Perform analysis of the data and use Native Bayes and Nearest Neighbor models to predict the following:
• What is the discount percentage for a given product category and brand
• Whether or not the product is a Flipkart advantage product given product category and brand
• Determine the specifications keys corresponding to each product category from the product_specifications column
• Create lists of specification value pairs for each product. The keys should be determined by the product category of that product
• For any particular product category, is there an association rule between some of the product specification values and the product rating?
Remember that product rating is not available in many cases
• Is the discount percent related to the brand and product rating?
Data source: https://www.kaggle.com/PromptCloudHQ/flipkart-products

21 www.globsynfinishingschool.com
Project 7: Global Terrorism data
• This is a subset of data about global terrorist events taking place between 2012 and 2016.
• The dataset has the following fields
• iyear, imonth and iday: data of event
• country and city
• attacktype1_txt: attack type
• targtype1_txt: target institution type
• corp1: target institution
• target1: target population type
• natlty1_txt: nationality of attacker
• gname: name of attacking organization/terrorist group
• weaptype1_txt, weapsubtype1_txt and weapdetail: describes the weapon of attack
• nkill and nwound: number of people killed and wounded
• propextent_txt: extent of property damage
• ransomamt: ransom amount (e.g. in case of hostage situations): values like 0 or negative indicate that either this was not a ransom
case or the amount was unknown.
• Use Python Code to do the following:
• Perform analysis of the data and use Native Bayes and Nearest Neighbor models to predict the following:
• Predict the terrorist group, given other data fields. What are the attributes that best correlate to terrorist group?
• Predict the weapon type, given the extent of damage
• Use classifiers to determine if the weapons used to carry out attacks have changed over the years. Similarly, have the target sites changed
over the years?
Data source: https://www.kaggle.com/START-UMD/gtd

22 www.globsynfinishingschool.com
Project 8: Detecting fraud in fin transactions
• Feature description :

step - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).
type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.
amount - amount of the transaction in local currency.
nameOrig - customer who started the transaction
oldbalanceOrg - initial balance before the transaction
newbalanceOrig - new balance after the transaction
nameDest - customer who is the recipient of the transaction
oldbalanceDest - initial balance recipient before the transaction. Note that there is not information for customers
that start with M (Merchants).
newbalanceDest - new balance recipient after the transaction. Note that there is not information for customers that
start with M (Merchants).

23 www.globsynfinishingschool.com
Project 8: Detecting fraud in fin transactions
contd..
isFraud - This is the transactions made by the fraudulent agents. In this specific dataset the fraudulent behavior
of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring
to another account and then cashing out of the system.
isFlaggedFraud - The business model aims to control massive transfers from one account to another and flags
illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single
transaction.

• Goal: Predict who will be the match winner


• Split into train and test data and create 4 different types of models from the data - Decision Trees,
Linear Regression, Native Bayes and K-NN

24 www.globsynfinishingschool.com
Project 9: Predicting Interview Attendance
This is the data colleceted by a recruitement agency.
They tries to predict whether a candidate will turn up for an interview .
Dataset contains 2 years data between sep,2014 to jan 2017.
Recruiters deal with candidate(job seekers) and client(company who will recruite)

Following are the feature name:


• Date of Interview,
• Client name,
• Industry :- This refers to the vertical the client belongs(Note Candidates can jump across verticals in their job
hunt)
• Location,
• Position to be closed :- Niche rare skill sets , routine common skill sets
• Nature of Skillset,
This refers to the skill the client has and specifies the same
• Interview Type,

25 www.globsynfinishingschool.com
Project 9: Predicting Interview Attendance
contd..
There are three types of interview-

Walk-in drives :- these are unscheduled. Candidates are either contacted or they come to the interview on
their own volition,

Scheduled:- Here the candidates profiles are screened by the client and subsequent to this, the vendor fixes
an appointment between the client and the candidate.

Scheduled walk-in.:- Here the number of candidates is larger and the candidates are informed beforehand of
a tentative date to ascertain their availability. The profiles are screened as in a scheduled interview. In a
sense it bears features of both a walk-in and a scheduled interview

26 www.globsynfinishingschool.com
Project 9: Predicting Interview Attendance
contd..
• Name(Cand ID)
• Gender
• Candidate Current Location
• Candidate Job Location
• Interview Venue
• Candidate Native location
• Have you obtained the necessary permission to start at the required time
• Hope there will be no unscheduled meetings
• Can I Call you three hours before the interview and follow up on your attendance for the interview
• Can I have an alternative number/ desk number

27 www.globsynfinishingschool.com
Project 9: Predicting Interview Attendance
contd..

• I assure you that I will not trouble you too much,


• Have you taken a printout of your updated resume?
• Have you read the JD and understood the same?
• Are you clear with the venue details and the landmark?
• Has the call letter been shared
• Expected Attendance:
Whether the candidate was expected to attend the interview. Here the it is either yes no or uncertain
• Observed Attendance:
Whether the candidate attended the interview. This is binary and will form our dependent variable
• Marital Status

28 www.globsynfinishingschool.com
Project 10: Predicting Customer Churn
Churning means leaving a network and joining to other network (a customer switching from Vodafone to
Airtel). A telecom operator(Vodafone) needs to predict which customer is about to churn so that they can
prepare a customized plan for that customer to prevent churning.

Following are the description of all column of the dataset:

st - state
acclen - account length
arcode - area code
phnum - phone number
intplan - internet plan (yes/no)
voice - voice
nummailmes - no of email messages

29 www.globsynfinishingschool.com
Project 10: Predicting Customer Churn contd..
tdmin - total day messages
tdcal - total day time calls
tdchar - total day time charges
temin - total evening time minutes
tecal - total evening time calls
tecahr - total evening time charges
tnmin - total night time minutes
tncal - total night time calls
tnchar - total night time charges
timin - total international minutes
tical - total international calls
tichar - total international charges
ncsc - no. of customer services calls
label - Churned? (True/False)

30 www.globsynfinishingschool.com
Project 11:Predict burned area of Forest Fires
using Meteorological Data
Features description : -

1. X - x-axis spatial coordinate within the Montesinho park map: 1 to 9


2. Y - y-axis spatial coordinate within the Montesinho park map: 2 to 9
3. month - month of the year: "jan" to "dec"
4. day - day of the week: "mon" to "sun"
5. FFMC - FFMC index from the FWI system: 18.7 to 96.20
6. DMC - DMC index from the FWI system: 1.1 to 291.3
7. DC - DC index from the FWI system: 7.9 to 860.6
8. ISI - ISI index from the FWI system: 0.0 to 56.10
9. temp - temperature in Celsius degrees: 2.2 to 33.30
10. RH - relative humidity in %: 15.0 to 100
11. wind - wind speed in km/h: 0.40 to 9.40
12. rain - outside rain in mm/m2 : 0.0 to 6.4
13. area - the burned area of the forest (in ha): 0.00 to 1090.84 ( This is output feature)

31 www.globsynfinishingschool.com

You might also like