You are on page 1of 5

IJISET - International Journal of Innovative Science, Engineering & Technology, Vol. 2 Issue 5, May 2015.

www.ijiset.com
ISSN 2348 – 7968

STOCHASTICALLY REDUCING
OVERFITTING IN DEEP NEURAL NETWORK USING
DROPOUT

Nishtha Tripathi1, Avani Jadeja2


1
M.E. Student, Computer Engineering, Hashmukh Goswami College of Engineering, GTU, India
2
Assistant Professor, Computer Engineering, Hashmukh Goswami College of Engineering, GTU, India

Abstract
Deep neural networks are trained on the large number of
parameters which are likely to co-adapt and overfit. Overfitting is 2. Deep Restricted Boltzmann Machine
a challenging problem in the deep neural network.Dropout
training has shown a significant effect in improving deep neural
network. The aim of this dissertation to study dropout and other
c1 c3 c4 cm
which are built on dropout regularization methods. In real world
data is noisy with i.e. missing features, unlabeled, unstructured.
We will study method to distort data prior training to act as a
regularizer. This will create data having a correlation with real
world data. Restricted Boltzmann Machine probabilistic energy
based graphical model with no interconnection between hidden
to hidden units and visible to visible units. It would be stacked
and Deep RBM will be formed for training. b1 b3 b4 bn
Keywords: — Deep neural networks, Regularization,
Overfitting, Distorted Distribution.
Figure 1 RBM

1. Introduction and Motivation


• Energy Function
Neural networks are powerful computational models
that are being used extensively for solving problems in E ( v, h ) =
−∑ a i v i − ∑ b i h j −∑ h j w ij v i
vision, speech, natural language processing and many other i j i,j
areas. Fascinating and complex computational model
called Deep Neural Network are inspired by human Visual
Cortex. The main aim of the machine learning tasks is to • Probability of Hidden and Visible vectors
build models which generalize well on unseen data. A wide 1 − E ( v ,h )
difference between test and train error is known as p ( v, h ) =
e
Overfitting.
Z
where Z = ∑ e (
− E v ,h )
The proficiency of model (i.e. Neural Networks) is
tested when it works on data which it has not been tested. v ,h
Training model on the data is the most important phase for
deciding the different parameters. The model should be → Visible to Hidden probablities
trained considering problems related to dataset. Here we
performed experiments by a method distorted distribution =p hj ( 
 i
) 
1| v = a f  b j + ∑ v i w ij 

which distorts training and test data both to provide variety
to train the model and then tested on the unseen data. It is a
where a f = activation function
simple and effective method which has significant → Hidden to Visible probablities
improvement in the training time as well as error reduction.  
=p ( v i 1| h ) = a f  a i + ∑ h j w ij 
 j 

465
IJISET - International Journal of Innovative Science, Engineering & Technology, Vol. 2 Issue 5, May 2015.
www.ijiset.com
ISSN 2348 – 7968
Restricted Boltzmann Machine is undirected energy 3. Contrastive Divergence K-steps
based graphical model. In the energy based models the
energy functions distribution of the data over the hidden
and visible units. The energy function is as described in the Input : RBM (V1 , . . . , Vm , H1 , . . . , H n ), training batch S
equation. Output : Gradient approximation Δw ij , Δb j and Δci
RBM can be stacked to form DRBM which can train
using CD-k for a faster learning. DRBM can used various for i = 1, . . . , n, j = 1, . . . , m
machine learning tasks like computer vision, Natural init Δw ij = Δb j = Δci = 0 for i = 1, . . . , n, j = 1, . . . , m
Language Programming, handwritten recognition.
for all the v ∈ S do
Pre-training (0)
State Activation v ← v
1.Compute energy 1.Train layer one for t = 0, . . . , k -1 do
Dataset by one
2.Explore unsupervised for i = 1, . . . , n do

3.Prepare 2.Fix parameters


(t)
sample h i : p h i |v((t)
)
3.Stack another for j = 1, . . . , m do
layer with prev
layer features
( t+1)
sample vi : p vi |h (
(t)
)
for i = 1, . . . , n, j = 1, . . . , m do
(0) (0) (k) (k)
Learning phase Fine tuning Δw ij ← Δw ij + p(H i = 1|v ) • v j - p(H i = 1|v ) • v j
Prediction
1.Positive phase 1.Add output for j = 1, . . . , m do
1.Load weight
(clamp weights) layer with labels matrix (0) ( k )
Δb j ← Δb j +v j - v j
2.Negative phase 2.Update weights 2.Use model for
for i = 1, . . . , n do
(CD-k) 3. Save weights prediction
(0) (k)
Δci ← Δci + p(H i = 1|v ) - p(H i = 1|v )
3.Learn weights
4.Weight Updates

4. Distorted Distribution
Figure 2 Workflow for Training DRBM. The data is first distorted
The data will be distorted by different distributions. The
first as shown in fig. 3, pre-trained using RBM only the CD-k
algorithm is applied for learning from weights. distribution will define the characteristics of new features.
The new distribution will in the form

( ) ( )
Data Pre- N
P x' |x = ∏ PD x'n |x n ; θn
Add Random
Processing
Noise

Dataset
1.Import data
1. Select data n =1
2.Convert in req.
format
2. Apply distortion where,
3. Generate data
3.Explore data
x' are new parameter
θn model parameters
Randomize noise x normal parameter
selection
1. Guassian Noise PD is the type of distribution
2. Laplace Noise
3. Dropout Noise PD can take one of the following forms:
1. Dropout in which the n th feature is randomly set to zero with probability pn;
Figure 3 Workflow for distorting data first the data is pre-
processed then randomly the distribution is applied 2. Gaussian on the n th feature with variance σ
3. Laplace n th feature with variance λ

466
IJISET - International Journal of Innovative Science, Engineering & Technology, Vol. 2 Issue 5, May 2015.
www.ijiset.com
ISSN 2348 – 7968

4. Conclusions
The better your paper looks, the better the Journal looks.
Thanks for your cooperation and contribution.

Figure 4 Random images taken after applying distortions from left


first is dropout, Lapalace and Gaussian distribution respectively.

5. Experimental Results and Analysis


The MNIST database
consists of 60,000 training
and 10,000 test images of
Figure 5 Graph for evaluating Error for different layers no. of
handwritten digits of size 28
epochs as shown in Table I and with fine-tuning
by 28 pixels. For the data set,
we used the training set to Layer Layer Train- Test-
train RBM. The digits data DRBM
1 2 Error Error
Model 500
used are taken from the 100 200
Epochs
DRBM DRBM
MNIST data set [26, 31], Epochs Epochs % %
which itself was constructed 28.31 3.23 20.0
DBM 0 1.3
by modifying a subset of the min hours hours
much larger dataset produced by NIST (the National 29 3.26 20.4
DBM_Dist. 0 0.87
Institute of Standards and Technology). min hours hours
It comprises a training set of 60,000 examples and a 28.21 3.27 20.0
DBM_Dropout 0.11 1.03
test set of 10,000 examples. Some of the data was min hours hours
collected from Census Bureau employees and the rest was Table I: Recorded time and error for running different
collected from high-school children, and care was taken to layers and errors with fine-tuning
ensure that the test examples were written by different
individuals to the training examples. In the performed experiment for distorted
We have trained 2-layer DRBM with 500-1000 hidden distribution model the test and train are both distorted with
units and 784 visible units respectively. Dataset was the rate 0.2 % which has shown significant improvement in
divided mini-batches of 100 for reducing training time. We error rate and reduced training time also. Dropout takes the
trained model using CD-k steps. Logistic activation more time then the our method to reduce the error rate.
function used for calculations. Then the model was fine
tunned using gradient calculation [39].
Evaluation Criteria: CONCLUSIONS
Deep Neural network a powerful, complex
Error %=( misclassified images * 100)/Total Images
computational model for non-linear processing. The
Test Images: 10,000
problem with DNN is that it deals with billions and
Train Images: 60,000
millions of parameters which are likely to co-adapt. The
randomized distorted methods for features and data can
itself acts as regularizer. This suffices the need of adding
model parameters so might possibly no extra computation
added for regularizations, which in lieu can save time.
We had performed experiments using distorted
distribution on the MINIST dataset. Distorted Distribution
can be extended to use different types of distribution

467
IJISET - International Journal of Innovative Science, Engineering & Technology, Vol. 2 Issue 5, May 2015.
www.ijiset.com
ISSN 2348 – 7968
function. For training our model on different variety of [14] G. E. Hinton. A practical guide to training restricted
dataset we increase dataset size by applying different boltzmann machine.Neural Networks: Tricks of the Trade,
distributions. Comparison with GPU performance will be a 599-619,2010
good comparison, it can explore the distribute aspect also. [15] L. Prechelt. Early stopping-but when? In Neural Networks:
Tricks of the trade. Springer Berlin Heidelberg, pp. 55-69
These methods can be extended to NLP, computer vision
,1988
as well as multimodal learning. [16] R. Caruana , S. Lawrence , & L. Giles (2001). Overfitting
The model of DRBM is implemented through stacking in neural nets: Backpropagation, conjugate gradient, and
different RBM. It would be interesting to observe results early stopping. Advances in neural information processing
by applying to different models like convolution neural net systems, 402-408,2001
(CNN), Deep Belief Net (DBN). [17] S. Ha K., Cho , & D. MacLachlan. Response models based
on bagging neural networks. Journal of Interactive
Marketing, 19(1), 17-30,2005
References [18] S. Geman , E. Bienenstock , & R. Doursat. Neural
[1] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, &
networks and the bias/variance dilemma. Neural
R. Salakhutdinov . Dropout: A simple way to prevent neural
computation, 4(1), 1-58, 1992.
networks from overfitting. The Journal of Machine Learning
[19] Li. Deng "An overview of deep-structured learning for
Research, 15(1), pp.1929-1958, Jul. 2014.
information processing."Proceedings of Asian-Pacific
[2] G. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, & R.
Signal & Information Processing Annual Summit and
Salakhutdinov. Improving neural networks by preventing
Conference (APSIPAASC). 2011.
co-adaptation of feature detectors. ArXiv preprint arXiv:
[20] Y. Bengio , P. Lamblin , D. Popovici , & H. Larochelle.
1207.0580, Jul. 2012. Greedy layer-wise training of deep networks. Advances in
[3] I.J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, neural information processing systems, 19, 153, 2007.
& Y. Bengio .Maxout networks. arXiv preprint [21] G.E.Hilton, T.J. Slenkosi D. E. Rumelhart, J. L.
arXiv:1302.4389, Sep 2013 McClelland, and the PDP Research Group, ed. "Learning
[4] J.T. Springenberg & M. Riedmiller .Improving Deep and Relearning in Boltzmann Machines". Parallel
Neural Networks with Probabilistic Maxout Units. arXiv Distributed Processing: Explorations in the Microstructure
preprint arXiv:1312.6116,Jul 2013 of Cognition. Volume 1: Foundations (Cambridge: MIT
[5] Y. Miao, F. Metze, & S. Rawat .Deep maxout networks for Press): 282–317, 1986.
low-resource speech recognition. IEEE workshop [22] Hinton Geoffrey, Simon Osindero, and Yee-Whye Teh. "A
in Automatic Speech Recognition and Understanding fast learning algorithm for deep belief nets." Neural
(ASRU), Dec.2013 computation 18, no. 7, pp: 1527-1554, 2006.
[6] M. Cai, Y. Shi, & J. Liu. Deep maxout neural networks for [23] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.A.
speech recognition. IEEE workshop in Automatic Speech Manzagol. Stacked denoising autoencoders: Learning useful
Recognition and Understanding (ASRU), Dec.2013 representations in a deep network
[7] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, & R. Fergus. [24] A. Krizhevsky , I. Sutskever , & G. E. Hinton . Imagenet
Regularization of neural networks using dropconnect. classification with deep convolutional neural networks.
In Proceedings of the 30th International Conference on In Advances in neural information processing systems pp.
Machine Learning (ICML-13), 2013, pp. 1058-1066. 1097-1105, 2012.
[8] Frazão Xavier, and Luís A. Alexandre. "DropAll: [25] N. Srivastava . Improving neural networks with
Generalization of Two Convolutional Neural Network dropout ,Master’s Thesis University of Toronto, Dec. 2013
Regularization Methods." Image Analysis and [26] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner.
Recognition.Springer International Publishing, 282-289, "Gradient-based learning applied to document
2014. recognition." Proceedings of the IEEE, 86(11), pp:2278-
[9] M. D. Zeiler, & R. Fergus . Stochastic pooling for 2324, November 1998
regularization of deep convolutional neural networks. arXiv [27] Krizhevsky Alex and Hinton Geofrey. Learning multiple
preprint arXiv:1301.3557,Jan.2013 layers of features from tiny images. Technical report,
[10] Bertero M. "Regularization methods for linear inverse University of Toronto, April 2009.
problems." Inverse Problems. Springer Berlin Heidelberg, [28] Yuval Netzer, Tao Wang, Adam Coates, Alessandro
pp. 52-112, 1986. Bissacco, Bo Wu, Andrew Y. Ng Reading Digits in Natural
[11] Ba J., & Frey B. .Adaptive dropout for training deep neural Images with Unsupervised Feature Learning NIPS
networks. InAdvances in Neural Information Processing Workshop on Deep Learning and Unsupervised Feature
Systems, 2013, (pp. 3084-3092). Learning 2011.
[12] Y. Bengio. Learning deep architectures for [29] Y. LeCun, F.J. Huang, L. Bottou, Learning Methods for
AI. Foundations and trends in Machine Learning, 2(1), 1- Generic Object Recognition with Invariance to Pose and
127, Jan.2009. Lighting. CVPR 2004
[13] A. Fischer & C. Igel. Training restricted Boltzmann [30] S. Haykin, Neural Networks - A Comprehensive
machines: An introduction. Pattern Recognition, 47(1), 25- Foundation, Maxwell Mac Millian Int., New York, 1994.
39,2014

468
IJISET - International Journal of Innovative Science, Engineering & Technology, Vol. 2 Issue 5, May 2015.
www.ijiset.com
ISSN 2348 – 7968
[31] C. M. Bishop, Neural Networks for Pattern Recognition,
Oxford University Press, 1995.
[32] G. Hinton, 'Deep belief networks', Scholarpedia, vol. 4, no.
5, pp. 5947
[33] Slatton T. G. A comparison of dropout and weight decay
for regularizing deep neural networks. Undergraduate
Honors Theses University of Arkansas Libraries ,2014
[34] J. A. Koziol , E. M. Tan , Dai L., P. Ren & J. Y. Zhang
Restricted Boltzmann Machines for Classification of
Hepatocellular Carcinoma.Computational Biology
Journal, 2014.
[35] A. Y. Ng Feature selection, L 1 vs. L 2 regularization, and
rotational invariance. In Proceedings of the twenty-first
International Conference on Machine learning pp: 78,ACM
July 2014
[36] Nishtha Tripathi, Avani Jadeja. A Survey of
Regularization Methods for Deep Neural
Network.International Journal of Computer Science and
Mobile Computing, Vol.3 Issue.10, pg. 429-436, Nov.
2014.
[37] V. Tetko Igor, David J. Livingstone, and Alexander I.
Luik. Neural network studies.Comparison of overfitting and
overtraining. Journal of chemical information and computer
sciences 35.5 pp: 826-833, 1995
[38] G. E. Hinton Training products of experts by minimizing
contrastive diver gence. Neural computation, 14(8),
1771-1800, 2014.
[39] In.mathworks.com, 'MATLAB - The Language of
Technical Computing', [Online]. Available:
http://in.mathworks.com/products/matlab/. [Accessed: 05-
May- 2014].

469

You might also like