Processing & Analysing Large & Complex Data Streams Using Big Data

PROCESSING & ANALYSING LARGE & COMPLEX DATA STREAMS USING BIG DATA
ABSTRACT
The emerging large datasets have made efficient data processing a much more difficult task
for the traditional methodologies. Invariably, datasets continue to increase rapidly in size with
time. The purpose of this research is to give an overview of some of the tools and techniques
that can be utilized to manage and analyze large datasets. We propose a faster way to
catalogue and retrieve data by creating a directory file – more specifically, an improved
method that would allow file retrieval based on its time and date. This method eliminates the
process of searching the entire content of files and reduces the time it takes to locate the
selected data. We also implement the nearest search algorithm in an event where the searched
query is not found. The algorithm sorts through data to find the closest points that are within
close proximity to the searched query. We also offer an efficient data reduction method that
effectively condenses the amount of data. The algorithm enables users to store the desired
amount of data in a file and decrease the time in which observations are retrieved for
processing. This is achieved by using a reduced standard deviation range to minimize the
original data and keeping the dataset to a significant smaller dataset size.
Guided by: Dr. Ashad Ullah Qureshi

Contact: 6260651575, 9179357477
Email: conceptsbookspublication@gmail.com
Web: https://researchitout.blogspot.com/
ACKNOWLEDGEMENT
First and foremost, I would like to acknowledge the efforts of those who have dedicated their
time to assist me with this proposal. I would like to express my gratitude to my advisor, Dr.
Claude Turner, for his valuable guidance and support during my work on this proposal. Much
appreciation also goes to Dr. Irving Linares for his help in providing excellent ideas on search
and retrieval methods. I would also like to thank Dr. Lethia Jackson and Dr. Jie Yan for their
helpful comments on the structure of my thesis. Likewise, I would like to recognize Dr. Bo
Yang for his encouragement and assistance during the development of my proposal. I would
also like to thank Dr. Hoda El-Sayed, the director of the Doctor of Science Program, for her
assistance during my work on this proposal.
Furthermore, I would like to thank the administrators of Bowie State University and the
Department of Computer Science for providing a great environment for their doctoral-
seeking students. The resources offered at Bowie State University assisted me in completing
this project. A special thanks goes to Dr. Cosmas U. Nwokeafor, Dean of the Graduate
School, for his guidance during this process.
Moreover, an honorable mention goes to my family and friends for their understanding and
support. Finally, my deepest gratitude is to my dear wife, Roughiyeh A. Teymourlouei, for
her love and full support during my studies.

Contact: 6260651575, 9179357477
CONTENTS
Chapter 1. Introduction 1-6
1.1 Overview
1.2 Background of the Problem
1.3 Statement of the Problem
1.4 Purpose of the Study
1.5 Research Hypothesis
1.6 Research Questions
Chapter 2. Literature Review 7 - 14
2.1 Data Processing Levels
2.2 What is Raw Data?
2.3 Data Processing
2.4 Data Retrieval
2.5 Retrieval Methods
Chapter 3. Methodology 15 - 23
3.1 Creating a Directory File

Contact: 6260651575, 9179357477
Chapter 4. Experiments, Results and Discussion 24 - 33
Chapter 5. Conclusions and Recommendations 34 - 35
5.1 Conclusions
5.2 Recommendations and Future Work
References

Contact: 6260651575, 9179357477
CHAPTER I
INTRODUCTION
1.1 Overview
Data search and retrieval methods allow researchers and scientists to process terabytes or
petabytes of data, perform sophisticated analysis, and quickly generate intelligent results.
These methods also provide a systematic way for the scientific community to find data that is
of interest. In addition, they allow the transformation, organization, and presentation of data
in a prescribed manner. Data presented in this way can be viewed from a variety of
perspectives by selecting the appropriate criteria.
Invariably, data has played a central role in scientific research, and scientists use data for a
variety of reasons. Researchers must be able to retrieve data in order to confirm their
hypothesis through empirical analysis. For instance, researchers in various fields of
computing often access large data archives when studying DNA sequences that require full
text search and identification of similarities, scores, etc. These scientists also utilize data
obtained from DNA sequences to study the different groups of plants and their relation.
1.2 Background of the Problem
Data has attained the form of continuous data streams rather than finite stored data sets,
posing barriers to users that wish to obtain results at a preferred time. Data prescribed in this
manner displays no bounds or limitations; thus, a delay in the retrieval of data can be
expected. For this reason, a search for the selected data in vast amounts of unsorted data is a
time-consuming process. Furthermore, the size of the data itself becomes part of the problem.
Generally, three main issues are involved in the data retrieval process. First, is the decision as
to which type of information is worth retrieving? In essence, the application must be able to
differentiate between relevant data and data that are not essential to the user. In other words,
the frequency of a term is not enough to infer the quality of the file that contains it. However,
the best way to accomplish this goal is by pulling information that follows the user’s
preference. The second issue that is involved in the data retrieval process is the methods that
will be used to acquire that data. When deciding on the methods that one can use to retrieve
data, time must be considered. Indeed, effective data retrieval methods are in demand because
scientists are in need of algorithms that are sensitive to time. In addition, the relevance of data
rests on accessing it on time. The third issue that must be accounted for lies in the methods

Contact: 6260651575, 9179357477
that will be used to analyze the data. Data must be analyzed fairly quickly in order to provide
users with the ability to pull data in a reasonable amount of time.
Currently, search engines are given as a way to retrieve information. However, search
engines are also faced with a number of difficult issues in maintaining or enhancing the
quality of their performance. One of these prevalent problems is spamming. As a result of
this interference, the quality of the rankings suffers severely.
The spam problem is not present in a local or restricted communications network, rather the
web that exists within a corporation. Content quality is another barrier to a quality
performance. However, even if spam did not exist, there are many troubling issues concerned
with the quality of the content on the web. The web is full of noisy, low quality, and
unreliable content. Quality evaluation is another interference that web search engines combat.
Quality evaluation is problematic because users do not provide specific feedback; instead,
implicit feedback information is provided based on results on which the user clicked in his or
her search. The exploitation of implicit feedback to evaluate different ranking strategies is not
efficient.
Data interferences and/ or information retrieval barriers restrict users from obtaining accurate
and reliable results. The essence of data retrieval is to acquire reliable results. For instance, a
scientist who obtains accurate results assists in identifying anomalies. If the algorithm fails to
provide scientists with accurate results, the identification of anomalies will not be found
effectively.
1.3 Statement of the Problem
Because of the escalation in the complexity of scientific data and its growing volume,
efficient management of such data has become increasingly important. Indeed, a
comprehensive end-to-end approach to data management needs to be undertaken in order to
generate, manage, and analyze this information effectively. This method would encompass all
the steps of data processing--from the initial gathering of the data to its final analysis. An
article titled “As We May Think,” by Vannevar Bush established the main ideas of automatic
access to large amounts of stored data files (Rouff, 2003). As the importance of efficient
search and retrieval algorithms continues to grow, especially in NASA and NOAA, search
and retrieval analysis will take on an increasingly important role. The following are examples
of some of the domains where these techniques play a significant role:
• NASA’s space agent – studying evolution, etc.;

• NOAA – used to detect weather claims;
• Libraries – used to organize books;
Contact: 6260651575, 9179357477
• File indexing – automatic file identification in multimedia databases.
Currently, the development of retrieval and search algorithms has not kept pace with data-
gathering capabilities. Thus, systems are being overloaded with vast amounts of data (Rouff,
2003). Data is continually streaming from spacecraft on Earth and in space quicker than it can
be stored, organized, and interpreted. In addition, it is very expensive to transfer just one bit
of data down from a spacecraft. In essence, more effective/improved algorithms are crucial to
collect what is most important. In order to access data rapidly and select the relevant content,
search and retrieval algorithms must keep pace with the constant expansion of data gathering.
Without an ability to effectively process such vast amounts of data, a backlog of data will
accumulate and hinder efficiency. One of the primary requirements of these types of
algorithms is efficiency. For example, NASA launches satellites that help study the Earth’s
ecosystems (Tsang & Jackson, 2010). The data-gathering capability of the Earth Observing
System (EOS) is such that about a terabyte of data per day can be generated (Esfandiari,
Ramapriyan, Behnke & Sofinowski, 2007). Scientists must then analyze the generated data
fairly quickly to ensure its integrity by identifying anomalies (e.g. calibration problems with
instruments). Another example of space anomalies is temperature variances. For instance,
scientists sometimes require global temperature readings to view the changes that transpired
and to effectively identify temperature anomalies. However, such tasks cannot be executed
successfully without an advanced algorithm that is able to retrieve data in a significant
amount of time. Data anomalies are quite common, and it is often an outcome of data
inconsistency that results from data redundancy. Cosmic anomalies are another type of
abnormality that scientists combat. A cosmic anomaly is a hidden site in space, and it can be
found using a ship's on-board scanner. Nevertheless, this goal must be completed in a
reasonable amount of time to ensure its accuracy.
Data management systems for remote sensing organize satellite data and information about
those data. This grants scientists the ability to determine what data are available for their
particular research requirements quickly (Goldberg and Smith, 1987).
However, remote sensing data management systems that are present today do not support a
wide enough variety of sensors to satisfy the needs of scientists. The earliest of these systems
was set up only to manage data archives for particular satellites or sets of sensors (Goldberg
and Smith, 1987). Recent systems contain information about the atmosphere, land surface,
oceans, etc. (Smith, 2012). Data management problems encountered by most scientific
domains are common and can be addressed through shared technology solutions. Our review
of the literature reveals that the following two major requirements can effectively address
data management problems (Beebe, 2010):

Contact: 6260651575, 9179357477
• An increase in the processing of raw data and an improvement in the efficiency of

access to storage systems is needed. In particular, improvements to parallel file
systems are needed for the reading of large volumes of data without delaying the
retrieval and analysis of the data.
• Scientists must have access to complex data search and retrieval algorithms to better
understand large, complex data sets.
In the case of spacecraft data, each spacecraft (e.g., satellites in the example provided by the
EOS) separately generates many large data files from its many instruments, and each scientist
must sort through the data and recognize any irregularities or anomalies. One “sorting
method” is to index multiple data files to address the search and retrieval issues. The indexing
of data at high speed for query and retrieval is achieved using a streaming index based on
signature files (Faloutsos & Christodoulakis, 1984).
Another proposed “sorting method” is to search through all of the file names to locate the
data of interest; however, this cannot apply to all spacecraft data.
Typically, there are two major concerns involved with analyzing massive data sets: (i) the
period of time required to search through data files and (ii) how to effectively identify
relationships between data. Analyzing massive data files requires knowledge of what the
scientist is looking for (i.e., what is considered to be an anomaly). Therefore, the algorithms
must be able to search through and efficiently manipulate massive datasets.
1.4 Purpose of the Study
The primary goal of this dissertation is to propose an effective search and retrieval methods
for sorting voluminous data. This study proposes a novel approach to using file header
information that is entity specific (e.g., to spacecraft, instrument, or experiment) to improve
the efficiency in the processing and identification of anomalies in large spacecraft data. More
specifically, to help address the two requirements and the two concerns mentioned in the
previous section, the goal of this project is to develop algorithms that effectively identify
anomalies in large, complex spacecraft data. We will accomplish this goal through the
following objectives:
• Develop algorithms that effectively translate the decomposition of data so

thatanomalies can be more rapidly and efficiently identified. Regions of rapid change
could then be more quickly and easily recognized and presented to the user with a
relationship of parameters and the generation of a script to drive a visualization
program such as Flow Analysis Software Toolkit (FAST) or IBM’s Data Explorer
(Brodlie, Duce, Gallop, Sagar, Walton & Wood, 2004).
Contact: 6260651575, 9179357477
• Adapt multi-level data processing algorithms to remove unnecessary information

from spacecraft data through the exploitation of file header parameters (e.g., time
block start and end time).
Since data is very large and highly complex, scientists are not able to access the desired data
in a timely manner. Without such ability, scientists will encounter problems when assembling
their research. In 2004, the International Space Station (ISS) completely lost power, which
placed many people in imminent danger. In 2011, ISS faced the same issue; however, to
approach the problem, scientists must refer back to the log file from 2004-2011 to identify the
anomaly. This involves massive data sets that scientists must search through in order to detect
the glitch. In addition, data retrieval has to be accessed fairly quickly to resolve the problem.
If scientists do not find the data of interest in a timely fashion, there is a possibility that the
spacecraft will crash and endanger many people. Therefore, the algorithms must effectively
translate the decomposition of data rapidly. Rapid decomposition of data will initiate a better
identification of anomalies because with the ability to decompose data fairly quickly, we are
then able to process a complex problem or system and break it down into parts that are easier
to understand, program, and maintain. Multi-level data processing algorithms also distribute a
complex system into manageable portions.
1.5 Research Hypothesis
To meet the first research objective, the proposed approach would be domain independent so
that any dataset can be analyzed. Correlation computations could then be more easily
processed and areas of interest more quickly displayed without any prior information or
narrowing of the data. The suggested technique allows improvements in the accessibility and
the ability to search files and directories, reducing the time a researcher spends finding data.
The solution to address the second research objective encompasses the use of file headers that
contain only essential and instrument-specific information. This would allow a search
through files to be done quicker. It would also provide scientists with the ability to evaluate
files for matches to the searched query without searching the entire file itself by having data
that is already stripped down to bare essentials to generate useful search results. This would
involve summarizing the relevant data in the header, reducing it so that it is more accessible
and searchable.
We will also consider how multi-level data processing methods may be exploited as a way to
process files. This method allows us to process files by removing unnecessary information.
This requires the addition of header information (e.g., time block start and end time). In
general, there are two classes of data. The first type of data is identified as housekeeping data.
Housekeeping data provides information about a satellite. For instance, it reveals “its
Contact: 6260651575, 9179357477
temperature, the functionality of the parts, and similar status information”

(http://imagine.gsfc.nasa.gov/docs/sats_n_data/sat_to_grnd.html).
Housekeeping data is used by the ground crew to ensure that every part of the satellite is
functioning properly. Scientists also utilize housekeeping data to analyze the performance of
instruments during the gathering of their science data. The second type of data is science
data. Instruments have their own data termed as science data. Science data consists of images,
spectra, count rates, and other measurements. Multi-level data processing allows the removal
of unnecessary data in data types such as housekeeping and science data. Each one of these
data blocks has its own specification. Therefore, housekeeping and/ or science data can be
accessed easily given each file start and end time.
1.6 Research Questions
The main goal of this dissertation is to develop an effective, well-organized set of search and
retrieval algorithms that can be used to locate data files in massive amounts of scientific data
(big data), and lead to improvements over existing algorithms.
To address this problem, three main issues need to be addressed:
1. How to effectively find the selected data files in vast amounts of unsorted data
2. How to process incomplete data sets in order to locate the desired data files
3. How to efficiently limit vast amounts of scientific data to a relatively significant size

Contact: 6260651575, 9179357477
CHAPTER 2
LITERATURE REVIEW
Before discussing the problems associated with storing and retrieving data, it is useful to
review how data is generated and processed. What does this involve?
2.1 Data Processing Levels
Large datasets usually refer to voluminous data beyond the capabilities of the current
database technology. The term data is typically described as information. Data is used to refer
to vast amounts of information in a standardized format. Generally, data include numbers,
letters, equations, images, dates, and other materials. Data is first gathered and then
processed. Particularly, data processing is a distinctive step in the information processing
cycle. In information processing, “data is acquired, entered, validated and processed, stored
and outputted, either in response to queries or in the form of routine reports. Data processing
refers to the act of recording or otherwise handling one or more sets of data” (Leung, 2009).
Nonetheless, large datasets are transforming the way research is carried out, resulting in the
emergence of a fast-paced algorithm.
Image data obtained from satellites, for instance, can be confusing. The terminologies such as
"Raw," "Level 1B," and "Path-Oriented Plus" can be bewildering. Despite attempts at
standardization by most of the image distribution agencies around the world, there is still
some confusion and lack of consensus. Some clarification is necessary in order to have a
meaningful discussion about such data (Shoshani, 2009). These terms apply to the amount of
processing that is done to imagery linking to the time it is received and the time it is loaded
into the computer. Most remotely sensed data requires similar steps of basic pre-processing
before it can be used. Image distribution agencies use a common set of "levels" to describe
the types of processing work done to their images.
Only after this processing is done are the images ready to be passed on to scientists for
utilization. The descriptions explained in Table we show that the processing levels are
hierarchical; Level 2 data starts with the processing included in Level 1 imagery and adds
more features (Shoshani, 2009). This processing is then recursively applied to the successive
data level, as shown in Table 2.1 and further explained in Figure 2.3.

Contact: 6260651575, 9179357477
Table 2.1: Standard Satellite Image Processing Levels (Remote Sensing Product Levels)
Level 1 data processing proceeds by operating on the raw data (which comes from the
sensors attached to the instrument). In other words, level 1 data processing converts raw data
from the sensor outputs to scientific units, calculating any additional oceanographic
parameters of interest, and reducing the data set to a tractable size (Johnston, 2006). Given
this process, it is always best to archive raw data because once it has been processed, it
cannot be reversed. On the other hand, level 2 data conversion “starts with the raw data (i.e.,
.dat or .hex) file. It takes the information contained in the configuration (.con or .xmlcon) file
and converts it to scientific units” (Johnston, 2006).
The other data levels, as shown in Table 2.1, are further explained in Figure 2.3. Data has to
be interpreted in a way in which information makes sense to the reader and, therefore, these
successive steps are used to translate such data.
2.2 What is Raw Data?
Raw data is the unprocessed/unorganized source data that has not been processed to be
displayed in any sort of presentable form. The raw form may look disorganized and
meaningless without any processing procedures. The information entered into a database is
Contact: 6260651575, 9179357477
referred to as raw data. Data is considered to be raw data if it has not been processed by the
computer in any way. Typically, raw data can be anything from a series of numbers, the way
those numbers are sequenced, even the way they are spaced, but it can yield very important
information. A computer interprets this information in a way that attempts to make sense to
the reader. Binary code is a good example of raw data. A binary code can be very confusing
for the user which makes it problematic for the user to interpret it. See Figure 2.2 for
additional details.
Figure 2.1: Example Of Data In Binary Codes

Contact: 6260651575, 9179357477
2.3 Data Processing
Data has to be effectively processed in order to convert raw data into meaningful information.
See Figure below for details.
Figure 2.2: Data Management
Converting raw data into an easily usable form involves a great deal of data processing
(Member, Ortega & Shen, 2010). Computers conduct data processing, which accepts the raw
data as input and then provides information as output. The systems that perform this task are
a vital component of satellite operations.
Step 1 - Raw telemetry and level 1 data
Raw telemetry is downlinked to NASA ground stations (Facts, 2002). It is then forwarded to
the quality control and processing centers. Next, the telemetry is processed to obtain level 1
Contact: 6260651575, 9179357477
data. This is the data that is timed and located, expressed in the appropriate units, and
checked for quality.
Step 2 - Level 1 data and level 2 geophysical data
Level 1 data are modified to eliminate instrument errors and errors resulting from
atmospheric signal propagation and perturbations caused by surface reflection.
Step 3 - Data validation and qualification
Data validation requires the monitoring of instrument flow and precise quality controls.
Before the final product is delivered to the end user, it has to be checked.
Step 4 - level 3 and level 4 data:
Level 3 data are validated (off-record data are edited) along-track data. Level 2 geophysical
data perform further computations. There may be cross-calibration between missions. Level 4
is multi-satellite (cross-calibrated) gridded data.
Figure 2.3 shows how data is collected, processed, and separated by the instrument from
spacecraft.

Contact: 6260651575, 9179357477
Figure 2.3: Data Generation And Processing
Once data has been generated, processed, and stored, it can then be made available in a more
useful form for scientists’ research.
Phase I: Access data is removed from data directly received from the spacecraft through the
spacecraft's zero level processor (this is the command processor that the spacecraft relies on
when the main processor is offline).
Phase II: Data is separated according to instrument and header information. When data is
uploaded from the instrument, the header for internally recorded data is written. The header
tells the program how many bytes of data should be read for each variable that is defined in
the header information (e.g., float, double, integer, etc.).

Contact: 6260651575, 9179357477
Phase III: A different level of data processing is applied to the data file before the data is
analyzed. This data processing allows the files to be processed for the removal of
unnecessary information.
These series of steps make up the complete data processing activity that validates data for
researchers’ purpose.
2.4 Data Retrieval
Data retrieval must be ranked according to the relevance of terms or queried parameters. In
the beginning, ranking was done according to how search terms were related to each file or
candidate (Haveliwala, 2003). Currently, we are faced with the problem of how to access and
manage data. Tackling such a challenge will require the processing of vast amounts of data
effectively.
The ability to gather data has outstripped our ability to use it. For example, data accessibility
has been affected by data storage. There were appreciably fewer data storage devices and
fewer data storage centers in the past than there are now. As a result, there is not only an
increase in the overall stored data, but an associated and unnecessary increase in the capacity
and the number of computers and facilities where it is being stored. In other words, since the
amounts of data to be stored by today’s applications can easily outmatch the capabilities of
single computers, distributed storage systems provide a means to distribute the burden of
storing and retrieving this data onto multiple different computers (Yianilos, 2001; Tolksdorf
& Walther, 2011). However, this further complicates data search and retrieval because data
retrieval needs to keep pace with these developments. Improvements in computing need to
keep pace with these developments as well. In addition, high performance grid computing
and improvements in instrumentation also improve and accelerate data gathering.
Furthermore, the amassed data adds to the burden of storing or accessing data.
Thanks to computation, scientists are now able to not only read each other’s papers, but
access each other’s raw data. This makes it easier to reproduce and further advance others’
research. As a result, large amounts of data are not only being received and processed
through instruments, but are also being generated and accessed as the outcome of research.
The ability to access and search through such data effectively is again important for a
scientist in order to make the best use of it. At the level of collaboration and sharing data, raw
data must be accessible for scientists before it is processed through a bottleneck and delivered
in streamlined fashion through an article or paper. Thus, access to raw unprocessed data can
be useful; however, there is still needs to be a way to easily access and retrieve it (Witt,
2009).

Contact: 6260651575, 9179357477
Another level of complexity exists with data gathering. Different scientific fields gather and
store different types of data. For instance, “astronomers are producing Flexible Image
Transport System (FITS) files and VOTable compliant XML” (Williams, 2002).
“Geoscientists are producing Arc/Info SHP files and GeoTIFF images. Bio-scientists are
producing huge genomic databases and associated EST (expressed sequence tags), GSS
(genome survey sequence) and HTGS (high throughput genomic sequence) files. Medical
scientists are generating vast sets of DICOM images; weather researchers are generating
HDF5 datasets” (Choudhury & Hunter, 2004). Based on the various files, file formats are
also subject to change. In the world of data storage, formats become obsolete, and legacy files
need to be converted to the newer format. The preservation of metadata is in itself an issue
due to the constant changes in how data is being stored. Attempts have been made to address
these issues involving storage and access of data (Choudhury & Hunter, 2005). The others
have concentrated on improving storage systems with parallel file systems, the packaging of
data to make it more obtainable, and the automation of storing processed data (Shoshani,
2009).
Another approach to addressing these issues has been taken at the access level. Attempts have
been made to improve data access itself. An XML-based Distributed Metadata Server
(DIMES) has been conceived (Deng, Kafatos, Wang & Wang, Yang, 2001). This broke the
data search down into three levels, involving data transmission, metadata, and data
interoperability. Data transmission is a general providing of data; metadata allows the users to
sift through it for some subset or format and to integrate those results with other data; data
interoperability allows the user to extract the data in great detail. This is complimented by a
“nearest-neighbor” search mechanism that exploits XML (Doty, Kafatos, & Kinter, 2003).
Yet, another solution has been proposed to manage and access multiple file types:
the File Object Method (FOM). This solution uses the file content rather than the extension
for associating with a particular application. The files appear as an object. This helps to
ensure that the correct methods are being executed on the correct file format. The FOM
organizes data by associating related code to files. Using the framework, the data can be
retrieved and manipulated. The FOM solution also helps to save resources by storing
calculations and keeping them available as a reference for scientists. The caching of method
results helps to speed calculations that invoke the same parameters through the use of the
cached results. Also, the FOM solution helps users to preserve the originaldirectory
structure. However, users are unable to use it for searching through file content, e.g. using it
to locate all .txt files with a specific phrase (Shoshani, 2009).Furthermore, the rapid
progression in technology has allowed space agencies like NASA to establish more satellites,
instruments, and monitoring systems. Therefore, there is more instrumentation on those
Contact: 6260651575, 9179357477
satellites. In addition, the instrumentation itself is more accurate and is able to deliver more
data. This advancement has created a huge amount of data. Data must then be stored and
processed. The processing of data is important despite the sheer amount of it because the
scientist still needs to be able to quickly and efficiently search through the data in a timely
fashion (Broder, Dean, Deshmukh, Do, Henzinger & Sarawagi, 2000; Authoria & James,
2005). As a result, timeliness is critical because the relevance and value of data may depend
on accessing it on time. Currently, the development of methods for managing and accessing
stored data have lagged behind the actual collecting of data, possibly creating a backlog of
data that may never be looked at unless the scientist has a system that allows him or her to
retrieve it and to quickly look at or be notified about relevant data. Since data is only useful
when it is analyzed and interpreted, simply collecting it without having a system to efficiently
store and retrieve it is not useful at all.
2.5 Retrieval Methods
This section analyzes several existing retrieval methods. The strategies obtained simplified
the ways of retrieving information. Three types of retrieval methods will be discussed: web,
multimedia information, and structured text retrieval. This section provides an overview of
previous work to show some of the methods used in the past to retrieve information.

Contact: 6260651575, 9179357477
CHAPTER 3
METHODOLOGY
The motivation behind the proposed algorithm is to present the user with the ability to search
specific data in a timely fashion. This method is efficient for obtaining only a selected
partition of the scientific data. In other words, after data has been reduced to selected files,
the algorithm carries out a search on these files. Figure 3.1 illustrates the general approach.
Figure 3.1 illustrates the general approach. This algorithm is also vital for other types of
searches as well, such as searches performed in DNA or protein sequences.
Figure 3.1: General Approach To Proposed Algorithm
The goal is to extract the most relevant information from a potentially overwhelming quantity
of data to facilitate the underlying data analysis. Figure 3.1 reveals data in its original state
and the portion of data that has been extracted. The proposed solution is an approach that
would only involve vital and instrument-specific information, which could be used to
significantly reduce the burden on the current search and retrieval systems. Searches through
files can be done more quickly and efficiently if they involve the ability to access
summarized data of every file without having to directly search its entire contents.
Suppose one is searching for particular information through multiple files where each
particular item is stored as a separate file. After receiving feedback from reviewers, one
wishes to inspect all occurrences of that specific item as it appears in multiple files. If there is
Contact: 6260651575, 9179357477
a directory file that points to the file index start and end time, the information could be found
more easily. A faster way to catalogue and retrieve data is by creating a directory file for all
of the data files in the archive center. A file directory is a location in a computer where files
are stored. The directory file would allow retrieval of the file based on its time and date. This
helps reduce the data retrieval process and condenses the amount of time it takes to locate the
selected file. The diagram below shows the proper format of directory files. Each data file
consists of a start time and an end time.
Figure 3.2: Structures Of Directory Files
3.1 Creating a Directory File
The algorithm called Direct will create the directory file, add header information, and place
the header at the beginning of each data file. The header for internally recorded data is
written when the data is uploaded from the instrument. The Table 3.1 show sample of the
directory file.
Table 3.1: Sample Of A Directory File
The steps below present the procedures that the program will perform.
Contact: 6260651575, 9179357477
Steps to processing data:
➢ Collect raw data;

➢ Remove unnecessary data;
➢ Separate each instrument’s data;
➢ Process the data from level 1 to level 4, based on the scientist’s algorithm of how the
data should be processed and the instrumentation or specification of the scientist’s
data;
➢ Contain information about the variables such as specific type of spacecraft or
instrumentation.
These series of steps form a complete data processing activity, ensuring data access. Searches
through files are done more quickly and effectively. This method will provide scientists with
an improved capability to find and access data in the file.How To Effectively Develop A
Directory File That Catalogues Data:
➢ Step 1: Read the data file to locate start and end time of the file. Then, count the
chapter number of each file.
➢ Step 2: Increment file index and chapter number and write the start and end time.
➢ Step 3: Continue this process for all of the selected data files.
Table 3.2 shows how the directory file is structured.
Table 3.2: Structure Of A Directory File
The directory file points to each data file with specific start and end times, which help to
locate a particular data file. See Figure 3.3 below for additional details.

Contact: 6260651575, 9179357477
Figure 3.3: Diagram Of A Directory File
Table 3.3 below shows a sample of a directory file.

Contact: 6260651575, 9179357477
Table 3.3: Sample Of A Directory File

Contact: 6260651575, 9179357477
More specifically, this study proposes the development of an algorithm that generates a
directory file that contains the following information:
• File index;
• File name;
• Start and end time for every file;
• Start address and end address (position of the file).
This will provide a tool for rapidly accessing data within these parameters. A pseudo-code for
searching and retrieving file time and file index is provided below for clarification of the
proposed algorithm:
Algorithm 1: Proposed Search Algorithm

Contact: 6260651575, 9179357477
Basically, the proposed technique denotes that the selected file can be located at a preferred
time. The user can simply enter the desired time, and the file will be easily located. However,
in an event where time entered is not found, we then apply the nearest search algorithm to
locate the closest points in the original data that relate to the searched query.
The nearest search algorithm is widely recognized, and its methods have been used
extensively in the fields of computing. For instance, in algorithms for information retrieval
(Rijsbergen, 1979; Salton, 1989; Buckley, Mitra, Salton, Singhal, 1995; Deerwester, Dumais,
Furnas, Harshman, Landauer, 1990), pattern recognition (Cover, Hart, 1967; Duda, Hart,
1973), statistics and data analysis (National Research Council, 1988; Devroye, Wagner,
1982), data compression (Gersho, Gray, 1991), and multimedia databases (Pentland, Picard,
Sclaroff, 1994; Ashley, Dom, Flickner, Gorkani, Hafner, Huang, Lee, Niblack, Petkovic,
Sawhney, Steele, Yanker, 1995; Jain, Smeulders, 1996). The nearest search algorithm has
played a major role in the examples presented. It assists researchers in constructing suitable
retrieval techniques. Nearest neighbor searching is a relevant problem in a variety of
applications.
However, the algorithm is used, at this point, to estimate the difference between values
implied by the user and the true values. It manages a given set of data points in a metric
space. The task of this algorithm is to preprocess these points so that any query point given
can be reported quickly. The points of interest can be specified as a matrix of points. The way
we choose to measure distances can drastically affect the accuracy of the system. Therefore,
effective and reliable measurements are of importance. For this reason, the nearest search
algorithm is best suited in such a sensitive case. It implies that we have a way to measure
distances between the query and database.
This algorithm is effective in reporting back to the user a value that is close to the desired
time. However, if the time searched is out of boundary detected in the previous file, then the
algorithm reports the next file back to the user. A Pseudocode for a fast nearest neighbor
search is provided.

Contact: 6260651575, 9179357477
Algorithm 2: Fast Nearest Search Algorithm
The nearest neighbor search algorithm is taken at the proximity level. It is used to discover
proximate relationships between data sets. It sorts through data lying in a huge (possibly
infinite) metric space to determine where two or more separately matching data points are
within a specified distance. The application of the nearest search algorithm ensures that the
selected data is within close proximity to the searched query. The proposed methods aim to
retrieve information that the user requests more easily and effectively. Large amounts of data,
such as some of the spacecraft carried instruments, have to be reduced and presented in
understandable quantities. Upon obtaining spacecraft data, there are a series of steps that
must be performed. A highly organized program is desired for capturing and playing back
information associated with the data collected. Data attained from spacecraft is established in
binary codes. In this case, data cannot be read or interpreted. Therefore, to capture data, a
program is employed that converts the actual binary codes into HEX or the ASCII format.
The program called Direct adds header information and places the header at the beginning of
each data file, which further assists in providing data in understandable quantities. We
assume that the data files are in the HEX or ASCII format. If they are not, the data has to be
converted into either one of these forms. Figure 3.4 below shows the data block in the HEX
format.

Contact: 6260651575, 9179357477
Table 3.4: Example Of A Data Block In Hex Format
We proceed by strategizing how to effectively find the selected data files in vast amounts of
unsorted data. Unsorted data is highly unorganized and difficult to depict. Such data does not
have additional header information, which makes it problematic for those trying to decipher
the data. In addition, access data has not been removed, preventing researchers from being
able to analyze the data effectively. For an efficient search through unsorted data, the addition
of header information is necessary in order to organize such data. See table below for an
example of unsorted data.

Contact: 6260651575, 9179357477
Table 3.5: Example Of Unsorted Data
Table 3.6 shows the addition of header information.

Contact: 6260651575, 9179357477
Table 3.6: Example Of Binary Unsorted Data
After data has been corrected, it will be displayed as shown in table 3.7. The header
information separates each data file and organizes it based on start and end time.

Contact: 6260651575, 9179357477
CHAPTER 4
EXPERIMENTS, RESULTS, AND DISCUSSION
Often, many scientists and researchers find themselves combating vast amounts of data
without an effective or a fast algorithm to process the data efficiently and in a timely manner.
However, scientists are still required to effectively acquire desired results in a reasonable
amount of time to conduct their studies properly. In this chapter, we provide examples of data
that are constantly used by scientists as well as results of the proposed algorithm.
The graph below shows the instances in which a voltage dropped below a desired threshold.
More specifically, on 09/08/31 at 22:00:00, all of the voltages are below the threshold;
however, from 09/08/31 23:00:00 to 09/09/01 02:00:00, few of the voltage readings are
below the threshold.
Figure 4.1: Diagram Shows The Voltages That Are Below The Threshold

Contact: 6260651575, 9179357477
Figure 4.1 shows how actual data is plotted and the specific voltage readings. However,
detecting anomalies is often difficult without an improved algorithm that is used to manage
such data as seen in Figure. 4.1. Figure 4.2 shows the voltage measurements. Based on this
graph, there is an indication that some of the measurements are below a minimum. This
specific diagram reveals that voltages 14, 15, and 16 are out of family.
Figure 4.2: This Particular Data Exemplifies A Sample Of Out Of Family Ranges For A
Typical
Spacecraft
The detection of these types of anomalies can easily be located and can be made easily
retrievable by using the proposed methods. Figure 4.3 is an example where such methods can
be effective.

Contact: 6260651575, 9179357477
Figure 4.3 is an example of a plotted data, which shows nearly a year’s worth of data. In this
plot, a problem in September 1999 can be seen. The temperature showing for this reading is
out of range. However, it is unclear which data file in September 1999 has the problem. A
whole year’s worth of data would have to be plotted to find this anomaly. Instead, we can use
the algorithm that we have proposed which would allow researchers to search in the directory
file by time and locate the file that has the problem.
Figure 4.3: An Expanded View Of The Data That Contains More Specific Detail Than Figure
4.2
The graph below shows the September 1, 1999 file data. It is evident that in this particular
file lies voluminous data which makes it problematic for the user to obtain the anomaly.

Contact: 6260651575, 9179357477
Figure 4.4: The Data From The Sept. 1, 1999 File Which Contains The Anomaly
It is understood that scientists must be able to analyze the generated data fairly quickly in
order to ensure its integrity by identifying anomalies. One of the most common space
anomalies is temperature variances. Most often scientists require global temperature readings
to view the changes that transpired so that they can effectively identify temperature
anomalies. Acquiring temperature anomalies is quite important; for instance, global warming
is always measured using temperature anomalies. As another example, Figure 4.5 shows the
temperatures variance as the annual and five-year running mean temperature changes with the
base period 1951-1980.

Contact: 6260651575, 9179357477
Figure 4.5: Hemispheric Temperature Change
This temperature analysis has been conducted by scientists at NASA’s Goddard Institute for
Space Studies (GISS). This plot shows that the average global temperature on earth increased
by about 0.8°Celsius (1.4°Fahrenheit) since 1880. Two-thirds of the warming has occurred
since 1975 at a rate of roughly 0.15-0.20°C per decade. Figure 4.6 shows an annual and five-
year running mean surface air temperature in the contiguous 48 United States relative to the
1951-1980 mean.

Contact: 6260651575, 9179357477
Figure 4.6: U.S. Temperature
Figure 4.6 shows temperature readings from1880 to present with the base period 1951-1980.
The black line is the annual mean and the red line is the five-year running mean. See Figure
4.7 to view the actual plotted data of the specified anomaly. In the early 1920’s, there is an
indication of a low average global temperature occurring between 1880 and 2013. High
global temperature average has increasingly predominated, with the ratio now about two-to-
one for the 48 states as a whole.

Contact: 6260651575, 9179357477
Figure 4.7: Temperature Anomaly Plot
Such tasks cannot be executed successfully without an advanced algorithm that is able to
retrieve data in a significantly short amount of time. The relevancy of data lies with
researcher’s ability to access such results in a timely manner. Therefore, the proposed
algorithm is quite suited for such a case because it is fast and efficient. The retrieval process
of spacecraft data is very crucial in determining the accuracy of the results; however, this task
was complicated by the extremely large amount of the available data. Nevertheless, the
proposed algorithm is able to present accurate results even when operating with such
complexities. Researchers now have the ability to locate the desired file from large datasets at
a preferred time. The practical value of the proposed algorithm is presented through empirical
results. See Figure 4.8 for additional details.

Contact: 6260651575, 9179357477
FIGURE 4.8: RESULTS OF PROPOSED ALGORITHM
This algorithm was able to locate the selected file from large datasets. Essentially, keying the
desired time will provide the user with the file index and its corresponding start and end time
and the file name as shown in Figure 4.8. However, if the time is not found, then the
algorithm locates a time that is within close proximity to the searched query which is
achieved through the use of a nearest search algorithm. See Figure 4.9 for additional details.

Contact: 6260651575, 9179357477
Figure 4.9: Nearest Search Results
Figure 4.9 shows that the algorithm was able to effectively locate a time that was within close
proximity to the search query because the time entered was not present in the specified data.
Also, if the searched time is out of the boundary detected in the previous file, then it reports
back to the user in the next file.
Figure 4.10 further displays how the nearest search algorithm located the data points that
matched the searched query. The results of the nearest search are displayed in the Figure 4.10
below. The algorithm located data points that were in close proximity to the searched query.

Contact: 6260651575, 9179357477
Figure 4.10: Selected Data Points
The results presented by the algorithm show that the selected files were found easily and
effectively. The plotted data has been taken from actual spacecraft data during the period
2006 through 2013. Clearly, this involves vast amounts of data; however, the algorithm easily
located the selected data that was within the range of the specified time.
The algorithm was instructed to locate the closest points that match the user’s searched query
in vast amounts of data. It preprocessed these data points so that the query point given can be
reported quickly.

Contact: 6260651575, 9179357477
CHAPTER 5
CONCLUSIONS AND RECOMMENDATIONS
5.1 Conclusions
The existing algorithms were carefully observed in order to identify their weaknesses. The
sole purpose is to ensure that the proposed algorithm does not share the same weaknesses.
One of the main and common weaknesses in the majority of the existing algorithms is the
length of time required to search and retrieve the data of interest. Time is crucial to scientists
and researchers; therefore, suitable methods to reduce the processing time are in high
demand. Hence, this algorithm when compared to the existing algorithms has proven to be
the most promising for scientists as well as other researchers in the field of computing.
Because time is lessened and accurate results are presented, scientists can further investigate
anomalies and arrive at a desirable outcome.
The suggested algorithm generates accurate and reliable results. The advantages/benefits of
this algorithm:
o Data is extracted, transformed, and loaded onto the system at an ideal time
o Stores and manages data in a database system effectively
o Provides data access to researchers and scientists
o Analyzes data by the application software
o Presents data in a useful format
o Condenses a large amount of data to a significant smaller size; typically by one to two
orders of magnitude
o Processes incomplete data effectively
o Sorts through unsorted data
o Fast paced and prompt access to data along with data retrieval processing techniques
5.2 Recommendations and Future Work
Future research can apply the methods presented in this dissertation to develop algorithms for
searching and retrieving scientific data. The experimental and theoretical work presented will
help researchers to develop a range of tools for searching, retrieving, and processing data.
Due to a significant reduction of processing time achieved by the proposed algorithm,
researchers can manage and obtain the desired data at a preferred time. This algorithm is not

Contact: 6260651575, 9179357477
limited to studies conducted by NASA or scientists in general. It can also be utilized in

several data centers as well as in the medical field. For instance, in the field of medicine, the
processing of medical data is playing an increasingly important role, e.g. computer
tomography, magnetic resonance imaging, and so forth. These data types are produced
persistently in hospitals and are increasing at a very high rate. Therefore, the need for systems
that can provide efficient retrieval of medical data that is of a particular interest is becoming
very high. The suggested algorithm can be utilized, in this case, to ease the burden of data
retrieval and to assess the relevant data retrieval process. The algorithm can manage data in
all of its aspects, including data in ASCII formats, binary codes, compressed data,
uncompressed data, and so forth. Researchers who desire to advance this study should be
aware of several suggestions that were noted in this particular study. The first is reducing
space limitations. This algorithm uses dynamic arrays. The array occupies an amount of
memory that is proportional to its size, independent of the number of elements that are
actually of interest; therefore, determining ways to constraint this condition can further
improve this study. Those that desire to further advance this study should also consider the
type of systems that are compatible with conducting this study. In other words, this stipulates
that the machine needed to test or perform these methods must be up to speed and updated to
the current database system. The use of basic machines will not suffice and will not generate
accurate results.
Optimizing directory of large data files using directory files to compress into multiple
directories. For very large directory file jobs, using multiple machines to simultaneously
build directory files on different portions of a data collection is generally much faster than
creating directory files on a single machine. Splitting up the directory file creation job
(parallel processing), is also a good strategy if disk space is insufficient to create the directory
files all at once. Furthermore, in the future we surely need to expand this research to quantify
the algorithm’s time complexity using an analytical approach. We may also consider data in
higher dimensions. The analytical approach seeks to reduce a system to its elementary
components in order to study the system in detail and understand the types of interactions that
exist between these components, we will perform additional tests for the proposed algorithms
to demonstrate the ), time complexity working with big, complex data.
Finally, index pre-processing is required for the implementation of these families of

algorithms. The one-time offline file pre-processing overhead is a small tradeoff that
significantly reduces the real-time search complexity by typically two orders of magnitude.

Contact: 6260651575, 9179357477
References
[1] Abolhassani, M., Fuhr, N., G¨overt, N., & Großjohann, K. (2003). Content oriented XML
retrieval with HyREX.
[2] Ashley, J., Dom, B., Flickner, M., Gorkani, M., Hafner, J., Huang, Q., Lee, D., Niblack,
W., Petkovic, D., Sawhney, H., Steele, D., and Yanker, B. (1995).
[3] Query by image and video content: the QBIC system. Authoria, B., & James, B. (2005).
Staffing Strategies : Can You Find, Recruit, and Retain the Talent You Need?
[4] Baeza-yates, R., & Ribeiro-neto, B. (1999). Modern Information Retrieval. Basic data
processing. Retrieved from
http://calcofi.org/downloads/CTD_Data/InstrumentManuals/SeaBird/Mod
ule3_BasicDataProcessing.pdf
[5] Beebe, R. (2010). Data Management, Preservation And The Future Of Pds. Esfandiari,
M., Ramapriyan, H., & Sofinowski, E. (2007). Earth Observing System (EOS) Data and
Information System (EOSDIS Evolution Update and Future. Retrieved from
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4423727
[6] Bengio, S., & Poh, N. (2005). How Do Correlation and Variance of Base-Experts Affect
Fusion in Biometric Authentication Tasks?
[7] Benitez, A. B. (2002). Multimedia Knowledge Integration, Summarization and

Evaluation.
[8] Bhuptani, R., & Shettar, R. (n.d.). A Vertical Search Engine – Based on Domain
Classifier.
[9] Boszormenyi, L., & Podlipnig, S. (2003). A survey of Web cache replacement strategies.
[10] Broder, A., Dean, J., Deshmukh, K., Do, H. H., Henzinger, M. R. & Sarawagi, S. (2000).
Data Cleaning: Problems and Current Approaches.
[11] Brodlie, K., Duce, D., Gallop, J., Sagar, M., Walton, J., & Wood, J. (2004).
Visualization in Grid Computing Environments.
[12] Buckley, C., Mitra, M., Salton, G., and Singhal, A. (1995). New Retrieval Approaches
Using SMART: TREC 4.
[13] Chang, E. Y., Chen-Chuan Chang, K., Smith, J. R., & Wu, Y. (2004). Optimal
Multimodal Fusion for Multimedia Data Analysis.
Contact: 6260651575, 9179357477
[14] Chechik, G., & Tishby, N. (2003). Extracting Relevant Structures with Side Information.
[15] Chiaramella, Y. (2001). Information retrieval and structured documents.
[16] Chiaramella, Y., Fourel, F., Mulhem, P. (1996). A Model for Multimedia Information
Retrieval 1 Introduction.
[17] Choudhury, S. & Hunter, J. (2004). Semi-Automated Preservation and Archival of

Scientific Data using Semantic Grid Services.
[18] Choudhury, S. & Hunter, J. (2005). Semi-automated preservation and archival of

scientific data using semantic grid services.
[19] Christodoulakis, S. & Faloutsos, C. (1984). Signature Files: An Access Method for
Documents and Its Analytical Performance Evaluation.
[20] Cohen, P.R., Oviatt, S.L., Wu, L. (2002). From members to team to committee – a
robust approach to gestural and multimodal recognition.

Contact: 6260651575, 9179357477

Processing &amp; Analysing Large &amp; Complex Data Streams Using Big Data

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Processing &amp; Analysing Large &amp; Complex Data Streams Using Big Data

Uploaded by

Copyright:

Available Formats

PROCESSING & ANALYSING LARGE & COMPLEX DATA STREAMS USING BIG DATA

Guided by: Dr. Ashad Ullah Qureshi

Guided by: Dr. Ashad Ullah Qureshi

Chapter 1. Introduction 1-6

1.2 Background of the Problem

1.3 Statement of the Problem

1.4 Purpose of the Study

1.5 Research Hypothesis

1.6 Research Questions

Chapter 2. Literature Review 7 - 14

2.1 Data Processing Levels

2.2 What is Raw Data?

2.3 Data Processing

2.4 Data Retrieval

2.5 Retrieval Methods

3.1 Creating a Directory File

Guided by: Dr. Ashad Ullah Qureshi

Chapter 4. Experiments, Results and Discussion 24 - 33

Chapter 5. Conclusions and Recommendations 34 - 35

5.2 Recommendations and Future Work

Guided by: Dr. Ashad Ullah Qureshi

1.2 Background of the Problem

Guided by: Dr. Ashad Ullah Qureshi

1.3 Statement of the Problem

• NASA’s space agent – studying evolution, etc.;

• File indexing – automatic file identification in multimedia databases.

Guided by: Dr. Ashad Ullah Qureshi

• An increase in the processing of raw data and an improvement in the efficiency of

1.4 Purpose of the Study

• Develop algorithms that effectively translate the decomposition of data so

• Adapt multi-level data processing algorithms to remove unnecessary information

1.5 Research Hypothesis

temperature, the functionality of the parts, and similar status information”

1.6 Research Questions

To address this problem, three main issues need to be addressed:

Guided by: Dr. Ashad Ullah Qureshi

2.1 Data Processing Levels

Guided by: Dr. Ashad Ullah Qureshi

2.2 What is Raw Data?

Figure 2.1: Example Of Data In Binary Codes

Guided by: Dr. Ashad Ullah Qureshi

2.3 Data Processing

Figure 2.2: Data Management

Step 1 - Raw telemetry and level 1 data

Step 2 - Level 1 data and level 2 geophysical data

Step 3 - Data validation and qualification

Step 4 - level 3 and level 4 data:

Guided by: Dr. Ashad Ullah Qureshi

Figure 2.3: Data Generation And Processing

Guided by: Dr. Ashad Ullah Qureshi

2.4 Data Retrieval

Guided by: Dr. Ashad Ullah Qureshi

2.5 Retrieval Methods

Guided by: Dr. Ashad Ullah Qureshi

Figure 3.1: General Approach To Proposed Algorithm

Figure 3.2: Structures Of Directory Files

3.1 Creating a Directory File

Table 3.1: Sample Of A Directory File

Steps to processing data:

➢ Collect raw data;

Table 3.2 shows how the directory file is structured.

Table 3.2: Structure Of A Directory File

Guided by: Dr. Ashad Ullah Qureshi

Figure 3.3: Diagram Of A Directory File

Processing & Analysing Large & Complex Data Streams Using Big Data

Processing & Analysing Large & Complex Data Streams Using Big Data