You are on page 1of 8

‫الجامعة التكنولوجية – قسم علوم الحاسوب‬

‫تقرير االمتحان النهائي للفصل الدراسي (الثاني)‬


‫‪2020-2019‬‬

‫‪Large Data Files for Machine Learning: Image data‬‬


‫‪preprocessing‬‬

‫اسم الطالب‪ :‬محمد عباس صبار غالي‬


‫المرحلة ‪ :‬الثالث‪-‬مسائي‬
‫‪ :‬البرمجيات‬ ‫الفرع‬
‫‪ :‬تعدين ومستودعات البيانات‬ ‫المادة‬
‫اسم التدريسي ‪ :‬د‪ .‬خليل ابراهيم‬
‫تاريخ التسليم ‪:‬‬
Introduction
Data loading is the process of copying and loading data or data sets from a source
file, folder or application to a database or similar application. It is usually
implemented by copying digital data from a source and pasting or loading the data
to a data storage or processing utility. Exploring and applying machine learning
algorithms to datasets that are too large to fit into memory is pretty common.
Extract, Transform, Load (ETL) operations aggregate, pre-process and saves data.
However, traditional ETL solutions cannot handle the volume, speed, and diversity
of big data sets. The Hadoop platform stores and processes big data in a distributed
environment, thanks to which it is possible to divide incoming data streams into
fragments for the purpose of parallel processing of large data sets. The built-in
scalability of Hadoop architecture allows you to speed up ETL tasks, significantly
reducing the time of analysis [1]. during the early days of data integration, the driving
force behind data integration were wrapper-mediator schemes; the construction of
the wrappers is a primitive form of ETL scripting [2]. In the mid of 1990s, data
warehousing came in the central stage of database research and still, ETL was there,
but hidden behind the lines. Popular books [3]. Exploring and applying machine
learning algorithms to datasets that are too large to fit into memory is pretty common.
That leads to some questions are raised up and most important of them are: How do
I load my multiple gigabyte data file, if the algorithms crashed when I try to run my
dataset; what should I do, and Can you help me with out-of-memory errors?
In this report, I present an algorithm of progressively loading image files.
This report is organized as follows: In section 2, some applications corresponding to
the created algorithm are briefly explained. The steps of the created algorithm are
presented in section 3. The conclusion and future work are offered in section 4.
II. Applications of the created algorithm
This section presents some applications created in the literature for the
last decade corresponding to the created algorithm.
The author of [4] presents an algorithm to validate the correctness of the
extract, load, and transform (ETL) process of the extracted data of West
Virginia Clinical and Translational Science Institute’s Integrated Data
Repository, a clinical data warehouse that includes data extracted from
two EHR systems. Methods: Four hundred ninety-eight observations
were randomly selected from the integrated data repository and compared
with the two source EHR systems. Results: Of the 498 observations, there
were 479 concordant and 19 discordant observations. The discordant
observations fell into three general categories: a) design decision
differences between the IDR and source EHRs, b) timing differences, and
c) user interface settings. After resolving apparent discordances, our
integrated data repository was found to be 100% accurate relative to its
source EHR systems. Conclusion: Any institution that uses a clinical data
warehouse that is developed based on extraction processes from
operational databases, such as EHRs, employs some form of an ETL
process. As secondary use of EHR data begins to transform the research
landscape, the importance of the basic validation of the extracted EHR
data cannot be underestimated and should start with the validation of the
extraction process itself.

In 2013, Guillaume Lavoué and Florent Dupont [5] proposed a technical


solution for streaming and visualization of compressed 3D data on the
web. Our approach leans upon three strong features: (1) a dedicated
progressive compression algorithm for 3D graphic data with colors
producing a binary compressed format which allows a progressive
decompression with several levels of details; (2) the introduction of a
JavaScript halfedge data structure allowing complex geometrical and
topological operations on a 3D mesh; (3) the multi-thread JavaScript /
WebGL implementation of the decompression scheme allowing 3D data
streaming in a web browser. Experiments and comparison with existing
solutions show promising results in terms of latency, adaptability and
quality of user experience. The authors implemented a complex geometry
processing operations directly in JavaScript; hence it provides useful
insights on the benefits and limitations of this scripting language.
JavaScript has shown unexpected impressive performances in our case.
III. Problem statement
Storing big data is one thing, but what about processing it and developing
machine learning algorithms to work with it? In this article, we will
discuss how to easily create a scalable and parallelized machine learning
platform to process large-scale data.
IV. Steps of the created algorithm
This section presents the steps used to create the algorithm of
progressively loading image files as shown in algorithm 1.
---------------------------------------------------------------------------------------
Algorithm 1 Image preprocessing
---------------------------------------------------------------------------------------
Procedure LIMIT THE SIZE OF THE INOUT IMAGE
 Input image=255.-input_image.
End procedure
Procedure IMAGE BINERY.
 Matlab function (im2bw).
End procedure.
Procedure BOUNDING BOX COMPUTATION
 Label connected components in 2-D binary image
(bwlabel).
 Measure properties of image regions (regionprops).
 Bounding box of polyshape.
 Reshape array (reshape).
End procedure
Procedure EXTRACTION OF BOUNDED IMAGE
 Sort array elements (sort). Followed by different steps
End procedure
Procedure CREATE AN EDGE IMAGE
 Find edges in intensity image (edge).
End procedure
Procedure CREATE A SKELETONE IMAGE
 Morphological operations on binary images (bwmorph).
End procedure

Procedure COMPUTE THE BOUNDING BOX OF A SKELETONE IMAGE


 Different steps are executed.
End procedure
Procedure TO EXTRACT BOUNDED IMAGE
 Sort array elements (sort).
 Different steps are executed.
 Trace region boundaries in binary image (bwboundaries).
 Unique values in array (unique).
 Boundary of a set of points in 2-D (x, y) (boundary)
End procedure

Procedure ALTERNATE FINDING OF BOUNDRY PIXELE


 Find indices and values of nonzero elements.
End procedure
End algorithm
--------------------------------------------------------------------------------------------

Figure 1 below shows the input and output image which is its minimum bounded
box skeleton image and the x-y coordinates of boundary points.
Hint: a matlab code of the created algorithm is presented in appendix 1 below.

Figure 1. a. represents the original input image and b. represents the output image.
V. Conclusion and future work
In this report, a number of tactics that may be used when dealing with
very large data files for machine learning. An algorithm of progressively
loading image files is created with different steps of matlab code. Most
of the matlab program codes are based on the built-in matlab functions.
This report shows that using the built-in functions of matlab may save
time and facilitate the complexity of the proposed program. For the future
work, It is to develop a very robust and high-performance parallel cluster
on the cloud (this can also be used on a local machine for performance
enhancement).
Reference
[1] Shu N.C., Housel B.C., Taylor R.W., Ghosh S.P., and Lum V.Y. EXPRESS,” a
data extraction, processing, and restructuring system. ACM Trans. Database Syst”,
Vol, 2(2), pp 134–174, 1977.
[2] Roth M.T. and Schwarz P.M,” Don’t scrap it, wrap it! A wrapper architecture for
legacy data sources” In Proc. 23th Int. Conf. on Very Large Data Bases, . 266–275,
1997.
[3] Labio. W, Wiener J.L., Garcia-Molina H., and Gorelik V.,”Efficient resumption
of interrupted warehouse loads”, In Proc. ACM SIGMOD Int. Conf. on
Management of Data, pp. 46–57, 2000.
[4] Michael J. Denney (MA) a, Dustin M. Long (PhD) b, Matthew G. Armistead
(BS) a, Jamie L. Anderson (RHITCHTS-IM) c, Baqiyyah N. Conway (PhD),
“Validating the extract, transform, load process used to populate a large clinical
research database”, Vol 94, pp 271-274, 2016.
[5] Guillaume Lavoue, Laurent Chevalier, and Florent Dupon,” Streaming
Compressed 3D Data on the Web using JavaScript and WebGL” Proceedings of the
18th International Conference on 3D Web Technology” 1-9, 2015.
Appendix 1
function [bounded_skel_image, boundary_x, boundary_y] = preprocimg(input_image)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% This function takes a gray scale image and return its minimum
% bounded box skeleton image and the x-y coordinates of boundary points
% here boundary_x and boundary_y are coordinates with reference to lower
% left hand side of image, i.e. analogous to geometric (cartesian)
% coordinate system
% size_image=size(input_image);
input_image_inv=255.-input_image;
input_image_bin1=im2bw(input_image_inv);
% to compute the bounding box
[label_input_image_bin1 num]=bwlabel(input_image_bin1,8);
Iprops = regionprops(label_input_image_bin1);
Ibox = [Iprops.BoundingBox];
Ibox = reshape(Ibox,[4 num]);
% to extract bounded image
[~, ind]=sort(Ibox(3:4,:),2,'descend');
Iboxfinal=Ibox(:,ind(1));
bounded_image=input_image_bin1(round(Iboxfinal(2)):round(Iboxfinal(2))+Iboxfinal(4)-
1,round(Iboxfinal(1)):round(Iboxfinal(1))+Iboxfinal(3)-1);
bounded_image_scale=bwmorph(bounded_image,'dilate',2);
figure,imshow(bounded_image_scale) % FIGURE 2
% % create an edge image
%input_image_bin2 = edge(bounded_image_scale);
%figure; imshow(input_image_bin2,[]);
% create a skeleton image
skel_image = bwmorph(bounded_image_scale,'thin',Inf);
% figure;imshow(~skel_image,[]) % FIGURE 3
% compute the bounding box of a skeleton image
[label_input_image_bin1 num]=bwlabel(skel_image,8);
Iprops = regionprops(label_input_image_bin1);
Ibox = [Iprops.BoundingBox];
Ibox = reshape(Ibox,[4 num]);
% to extract bounded image
[~, ind]=sort(Ibox(3:4,:),2,'descend');
Iboxfinal=Ibox(:,ind(1));
bounded_skel_image=skel_image(round(Iboxfinal(2)):round(Iboxfinal(2))+Iboxfinal(4)-
1,round(Iboxfinal(1)):round(Iboxfinal(1))+Iboxfinal(3)-1);
%figure,imshow(~bounded_skel_image)
[B,L]=bwboundaries(bounded_skel_image);
boundary=B{1};
boundary=unique(boundary,'rows');
boundary_x=boundary(:,2);
boundary_y=size(L,1)-boundary(:,1)+1;
%%% alternate finding of boundary pixels
%%% here col_edge is x coordinate and row_edge is y
[row_edge ,col_edge]=find(bounded_skel_image==1);
hold on, plot(boundary_x,boundary(:,1),'r.'); hold off

You might also like