You are on page 1of 5

Digital Forensics Evidence Mining Tool

Khaled Almakadmeh, & Mhammed Almakadmeh

Concordia Institute for Information Systems Engineering, Faculty of Engineering and Computer Science,
Concordia University, Montreal, QC, Canada H3G 2W1

Internet has created new forms of human interaction through its services,
like E-mail, Internet Forums and Online Banking Services. On the other hand,
it has provided countless opportunities for crimes to be committed, many
digital techniques have been developed, and used to help cybercrime
investigators in the process of evidence collection. In this paper, we developed
an efficient digital forensics mining tool to help cybercrime investigators in
evidence collection and analysis by providing various forensically important

Keywords: Evidence, Digital Forensics, Semantic Search,Cybercime Investigation.

1 INTRODUCTION to get results that contain the term cocaine or any

other related terms. Table 1 shows some examples of

I nternet has provided many solutions that help

people over the entire world to facilitate their lives
including; E-mail, Instant Messages (IM), Online
terms and their synonyms/Hyponyms

Table 1: Examples of terms & their synonyms/

Banking Services, and many other services that most Hyponyms
of the people can’t stop using. However, according Term synonyms/Hyponyms
to published statistics, there are thousands of Cocaine Blow, Nose Candy, Snow,
businesses and government departments like Western Crack, Tornado
Union, and CD Universe have been Bank Depository, Reserve, Backlog,
hacked, which resulted in over a billion dollars of Stockpile, Deposit, Container,
damages per year, and this amount of losses is Money Resource, Money Box
climbing. This makes the job of law enforcement Investigation Probe, Inquiry, Enquiry,
officers including cybercrimes investigators more Research, Investigating
difficult and complicated, because of the large Internet Net, Cyberspace, System,
amount of data that has to be collected and analyzed. Electronic Net, Computer
Most of cyber criminals use high-technological
devices; this requires that law enforcement agencies Our tool is able to enrich the search with various
to have efficient tools and utilities to gather and semantic suggestions that the investigator can use.
analyze data from these devices. These reasons were While developing our tool we faced many
primary motivation behind conducting our research challenges; we should take into consideration the
in computer forensics to develop our Digital Forensic tool efficiency, robust functionality, and
Evidence Mining Tool. It’s dedicated to help visualization during the whole development cycle.
cybercrimes investigators in the process of collecting Besides these challenges, our solution should be
and analyzing evidence from suspects’ devices. We scalable for large number of files and ready to adapt
have provided features that are highly needed, new features. In addition, the tool needs to be very
helpful and supportive toward evidence collection. responsive; within a matter of few seconds the search
results need to be displayed and ready to be
Search engines like Google, Yahoo, and many processed.
others perform keyword search. However,
cybercrime investigators need is to be able to do a 2 RELATED WORK
semantically oriented search. Semantic search [1]
provides a great flexibility during the investigation
In this section, we focus on previous tools and
process. For example, the word "cocaine" is not
solutions that have been proposed to help cybercrime
going to be mentioned frequently in a drug dealer's
investigators. First, we discuss stand alone utilities
communications, instead, when an investigator wants
used in this field and in subsequent sections we
to search for a word like "cocaine", (s)he is expecting
mention how our tool takes advantage by integrating
them, and providing more customized features that 4 PROPOSED SOLUTION
will help cybercrime investigators in performing
their jobs. In this section, we show an overview of our tool’s
architecture. Then, we discuss how each component
The first utility we use is Google Desktop Search in the tool contributes to the overall functionality.
(GDS) [2] provided by Google Corporation. GDS is After that, we show the use-case and activity
a desktop search engine that provides full text search diagram of our tool.
for a wide range of file types, such as emails,
documents of all types, audio files, images, chat logs, 4.1 System Architecture
and history web pages that the user has visited. What
makes it efficient is that after the initial setup and The system architecture provides a comprehensive
building the index for the first time, indexing occurs overview of the tool and its supporting infrastructure,
only when the machine is idle. Thus, the machine's Figure 1 shows the architecture of our tool:
performance is not affected. GDS also makes sure
that it stays up to date by monitoring any changes on
existing or in newly added files. The last but not the
least feature is finding deleted files; Google Desktop
creates cached copies (snapshots) of all files. These
copies can be viewed even if the files have been
deleted and are returned in the search results.

The other utility we use is WordNet [3], a large

English lexical database. It provides nouns, verbs,
adjectives and adverbs that are grouped into sets of
cognitive synonyms called “Synsets”. Synsets are
interlinked by means of conceptual-semantic and
lexical relations [3]. In [4] indexing with WordNet
Synsets is used to improve text retrieval. We take
advantage of this utility to show the investigator a
broad collection of suggestions that she/he could
pass to GDS. Further discussion about our developed
solution is provided in subsequent sections.


A Good problem statement should answer the

following questions: Figure 1: Tool Architecture

 What is the problem?

The investigator needs to be able to query the 4.2 Tool Components
criminals’ devices to build knowledge about
what information it contains. This knowledge The system components are:
can be used to provide evidence, and/or to
prevent future incidents.  Graphical User Interface
 WordNet API
 Who has the problem?  Google desktop SDK
The intended clients for this solution are  Business Layer
cybercrime investigators; they face a problem
when performing an effective and efficient We describe each component from a technical
search on the information in criminals’ perspective, and explain how they communicate with
devices. each other to handle the submitted task. Then, we
present the implemented features that are of great use
 What is the solution? to cybercrime investigations.
A full featured desktop tool that uses GDS and
4.3 Graphical User Interface
WordNet to provide semantic search in a
suspect’s computer.
The Graphical User interface (GUI) was designed
to be simple, intuitive, and yet very practical. It
contains all our tool functionalities in a clear and In addition, the investigator has the capability for
standard presentation to minimize the learning curve more options, like specifying whether he wants to
of the user. The GUI also provides menus that look for nouns, verbs, or adjectives that are related to
accomplish the same functionalities as the main the term he previously searched. Below that panel
window components; this menu is intended to help there is a definition window that shows the definition
users that are more menu-oriented. Figure 2 shows a of the selected word from the suggestion panel, and
screen shot of our tool. an example of use. Double-clicking on a term from
the suggestion panel initiates a new request to search
for that term and the results are displayed in a new
tab. This approach guarantees that our tool is
working at the highest performance level.

4.5 Google Desktop SDK

The Google Desktop Search SDK consists of the


 Event Schemas : The GDS engine processes

event objects sent to it by other components
(Business layer, or even the GUI). An event
object consists of the content data the
Figure 2: Digital Forensics Evidence Mining Tool
investigator wants the engine to index and
store, as well as additional meta-
information and properties about that
4.4 WordNet
content or the event object. The event
schemas specify the allowed event types
For the semantic search functionality, we decided
and the relevant properties for each event
not to automatically search for all synonyms of the
desired term. Since this approach will overload the
tool, and overwhelms the investigator with a large
 Developer Indexing API: The Developer
amount of results. Instead, we designed our tool to
Indexing API consists of interfaces used to
search only for the desired term. Figure 3 shows
construct event objects and send them to
more practical feature-rich suggestion panel. When
the forensic investigator enters a term and hits Enter;
the suggestion panel shows a list in the form of tree
 Developer Search API : We only use the
view that contains synonyms, acronyms, sister
Developer Search API. It sends an HTTP
request to Google Desktop Search engine
that contains the investigator search query
term. The HTTP response contains the
desktop search results in XML format.

When the investigator submits a search query,

actually (s)he generates an HTTP request that
includes a &format = xml parameter.

For example, to search for "Google" you would

send something like: ZK

To break this down:

 is the localhost
address and GDS port.
 search&s=1ftR7c_hVZKYvuYS-
RWnFHk91Z0: is the search command and
Figure 3: Panel shows suggestions for "Cocaine" a security token.

 ?q=Google: is the query term(s) parameter.

If the investigator wants to search for more than 4.8 Activity Diagram
one term, separate the terms with +s. For example, to
search for both "Google" and "GDS", The activity diagram [5] shows the flow of the
use:?q=Google+GDS. program when a search task is submitted to the tool.
As shown in the diagram, the user can specify
If the investigator wants want to search for a advanced search options before executing the search;
specific phrase, separate the terms with +s and also choose a keyword from WordNet to run the
surround the phrase with %22s. For example, to search again. After the results are shown, the user
search for the phrase "Google Desktop Search", can generate a report and save it to be used later
use:?q=%22Goo-gle+Desktop+Search%22 when presenting the evidence to the court of law.

To search for the two phrases "Google Desktop

Search" and "Copyright 2005",
y-right+ 2005%22.
&format=xml specifies that the HTTP response
returns the search results in XML format. By default,
an HTTP search response will only return the first
ten results.
It’s kept for developer to specify the number as
needed by appending the &num= parameter,
followed by the maximum number of results to be
returned to the query. There is no problem if the
maximum number argument value is greater than the
total number of search results; only the total number
of results is returned, with no null "results".

4.6 Business Layer

This component is at the core of our tool; it receives

the search terms from the GUI and it interacts with
the WordNet component in case the investigator Figure 4: Activity Diagram
wants to search a keyword from the suggestion panel,
it also sends the search term with the search 4.9 Applicability
preferences to the GDS engine. The business layer
processes the results and sends them back to the GUI Our tool runs on Windows XP, Vista, and even
to be shown to the investigator. This layer resembles Windows 7, and by using Google Desktop Search
the brain of our tool where all the processing engine our tool can access all file types, MS Office
complexity is hidden kept separated from the GUI. It files, Outlook files, archive files (such as .zip, .rar),
is composed of classes and functions that email and web history files.
communicate with the rest of the components.

4.7 Use Case Diagram 5 TOOL FEATURES

The use case diagram [5] gives an abstract of what Our tool provides a feature-rich environment for
functionalities the investigator can use when working the investigator. We provide many features that help
with our tool. the investigator in evidence analysis and report
generation. Below is a description of all the
functionalities our tool provides:

5.1 Result Display: By default search results are

displayed in a group of twenty per page; the
previous and next buttons allows the
investigator to navigate through the next and
previous result page. The total number of
results found is shown at the top of the results

Figure 3: Use case Diagram

5.2 Access All Files Types: Using Google Desktop 5.13 Calculate & Display the Hash (MD5) of the
Search engine our tool can access all file types, file to prove the integrity of the seized evidence.
MS Office files, Outlook files, archive files
(such as .zip, .rar), and web history files. 5.14 Help Menu: provides the user with a user
manual of how to use the tool functionalities.
5.3 Semantic Search: Full of features panel that
suggests many variations of the keyword,
including a small panel that shows the meaning 6 CONCLUSIONS
of each word, and a sample sentence of how it
is used. In this paper, we developed a Digital Forensics
Evidence Mining Tool that is dedicated to help
5.5 Multiple Tabs: For each keyword searched a cybercrimes investigators, in the process of
new tab will open, allowing the investigator to collecting and analyzing data from a suspect’s
conduct more search processes, and close any computer. We have provided in this solution features
unneeded tab. that are highly needed, helpful and supportive
towards evidence collection. We took advantage of
5.6 Advanced Search: Provides more options that some already developed APIs, such as; Google
allow the tool to filter the number of results. Search Desktop API, and WordNet API to enrich our
application. Due to recurring requirements in this hot
A. Choose various file types for more topic, our solution is scalable and can be adjusted to
refined search, including most common adapt future requirements and features to provide a
file types, like; text, images, audio/video, unique and essential tool for cybercrime
archive (zip), and HTML files. investigators.
B. Choose specific file category like email
or web to search only the specified type
of files. 7 REFERENCES
C. Choose the number of results per page.
D. Sort the results by relevance: when [1] R. Guha, Rob McCool, Eric Miller, Semantic
checked; relevant files (within the same search, International World Wide Web Conference,
directory) will be displayed Proceedings of the 12th international conference on
(sequentially) after each other. World Wide Web.

5.7 Display File Snippet: Allows the investigator [2] Benjamin Turnbull, Barry Blundell, Jill Slay,
to see the searched term within the file it’s been Google Desktop as a Source of Digital Evidence.
[3] George A. Miller, Richard Beckwith, Christiane
5.8 Display Detailed File Information: like Fellbaum, Derek Gross, and Katherine Miller,
creation date, last access date, last write date, Introduction to WordNet: An On-line Lexical
file attributes, and MD5 Checksum value. Database.
5.9 Opening The File In The Appropriate
Application: when the file name is double [4] Julio Gonzalo, Felisa Verdejo, Irina Chugur, Juan
clicked in the graphical user interface. Cigarrain, Indexing with WordNet synsets can
improve text retrieval, UNED, Ciudad Universitaria.
5.10 Comprehensive Menu: provides the same
functionalities to the user is (s)he is more [5] G. Booch, J. Rumbaugh, I. Jacobson, Unified
accustomed to using menus. Modeling Language User Guide.

5.11 Report Generation: allows the selection of

multiple files from multiple search results tabs,
to be added and used to generate a report in
HTML format. This report shows for each file:
the file title, path, MD5 Checksum value, and
files size.

5.12 Set the Search Path: to search within a

specific directory only.