You are on page 1of 65

CHAPTER 1

INTRODUCTION

At the present world, the use of the software is undoubtedly increasing.

People use specific software for educational purposes and personal use. These

include bank transactions e-commerce. It will save lots of money, time and

human resources. Therefore, most of the companies computerize their

business operations.

There’s such a lot information hold on it, that whenever a user asks for a

few information, the computer has to search to its files to explore the data or

information and build it to the user. Same is that the life of a computer, there is

such a lot information hold on in it, that whenever a user asks for a

few information, the computer has to search its memory to seek for the

info and build it accessible to the user. The laptop has its procedures to search

over its memory fast.

According to the article published by Google, reveals that roughly 40% of

people search only on a smartphone. People are searching Google via

smartphone than ever before the company says, with the most prevalent

categories revolving around health, parenting, and beauty. Other findings from

Google’s study are 80% of people search google are using Smartphones, 67%

of people use a desktop computer, 16% of people use a tablet 57% of people

use more than one type of device, 27% of people use a smartphone only, 14%

of people use a desktop computer. Later IDC white paper distributed in 2012,

1
the creators share that a worldwide study of 1200 data specialists and IT

experts found that they spend a normal of 4.5 hours a week trying to find

documents on their computers. The individuals who got to discover things the

foremost, and who ought to be the leading at finding them. Instep, they are

investing half of those 4.5 hours looking for, and not seeing, the records they

require. At that point, they spend the other half reproducing what they haven't

found.

https://www.searchenginejournal.com/mobile-search-rise-almost-half-people-

search-smartphones-study/175544/

Nowadays the use of search engines is essential in doing school papers

and most people who are using a search engine are doing it for research

purposes. People are mostly looking for answers or at least to data with which

to make a decision. Searching is one of the simplest things to do on the internet

or with the computer regarding find files on the computer.

Moreover, because of this modernization, computer is the most important

in this era, and searching is the most straightforward people that can do in the

most modern computers, people do their jobs in their personal computer

basically they store there finished work in their computer, some people

download a large files in the internet and save in the computer storage,

therefore the use of digital storage in the present is essential.

According to makeuseof.com, many people save their files on their

desktop; some people keep it for understandable reason. It provides instant

2
access with a single click. These people did not know the risk like hackers can

access desktop files. Sometimes, people who keep saving in their desktop are

getting confused about the sdata they saved; people sort the data; others are

using the search bar to find the information they need. The use of the search

bar is significant because it makes the user search easily what they are

seeking. for

To search for files, the search algorithm is fundamental; it is the step-by-

step process used to trace exact data among a collection of data. It is

considered a fundamental of computing. In this course, when searching for

data, the distinction between a fast application and a slower

one often lies within the utilization of the proper search algorithm. Search

algorithms will be classified as their mechanism of searching. Linear search

algorithms check each record for the one related to a target key in a linear

fashion.

In this course, string-searching-algorithm or string-matching-algorithm

is an essential category of string algorithms that try and find a place

wherever one or many strings (also known as patterns) found among a more

significant line or text.

Naïve String Matching is one of the fast string search algorithms incomes

with matching short length patterns. This application can help people who

cannot wholly manage their computer desktops in such a way that it's easier

than ever to spend a long time searching for it. It can search the word inside the

3
ms-word(.docx) document in a short time. To seek a string, a naïve string

algorithm can be used. Naïve string matching algorithm is the fastest in string

matching algorithm; it can process at no time; it can match a string in (Θ(nm)).

PURPOSE AND DESCRIPTION

Search applications are very prevalent nowadays. Unorganized files on

computers or has no proper file management is the most common problem by

users, the use of this study is to apply the naïve string matching algorithm in

searching file/s containing the text being sought and to add feature like how

many words found in the file/s are, when was the last time the text/word/phrase

is being searched. This feature is not seen or present in the current search

engine of windows OS.

This study includes a creation of an application that applies Naïve String

Matching Algorithm in searching text inside the (.docx) document, during the

execution of the program the user can locate the folder where the user wants to

explore in the browse button and then user enters the desired word in the

search box provided, once the search button is clicked it immediately searches

the ms-word document and display in the rich textbox.

Furthermore, after the search, the user will click one of the listed ms-

word document, and it will extract the word to the panel box of the application it

will show the words within documents that were selected path by the users put

in the search box.

4
OBJECTIVES:

The objectives of this study are the following:

 To develop a search engine that can search contents of MS-word

documents using naïve string matching algorithm and displays the

document.

 To determine the accuracy of naïve string matching algorithm in

searching files containing the certain text/word/phrase being searched by

testing several test data.

SCOPE AND LIMITATION

SCOPE

The study covers the concept of Naïve string matching algorithm applied

in the searching text in an MS-word document. The study also includes the

creation of simulator that has the following functionalities.

 Can search the content of MS-word documents.

 Can generate a search log, a word that recently searches, searches time
and date.

 Can search MS-word through sub-folders

 Can exclude or include sub-directories in searching

LIMITATION

5
 The simulation can search with the maximum length of 100 composed of
the alphabet (upper and lower case (a, b, c, A, B, C).

 The simulation program cannot read.PDF, XLSM, PPTX, and other


document file format are not included.
 Other files aside form .doc, .docx are not included.

CHAPTER II

REVIEW OF RELATED LITERATURE

According to Christian Charras, Thierry Lecroq of Université de Rouen

(2015) String-matching algorithms are necessary modules used in

implementations of functional software existing under most operating systems.

Moreover, they emphasize programming methods that serve as paradigms in

other fields of computer science (system or software design).

Furthermore, the authors of A Fast Multiple String-pattern Matching

Algorithm, Yanggon Kim and Sun Kim of Towson University (1999) they

proposed a simple and efficient multiple string pattern matching algorithm

based on a compact encoding scheme. The algorithm they used scans text

from left to right while encoding characters in the document based on the

alphabet that occurs in the input patterns. And they conclude that their

algorithm demonstrates the ability to handle a vast number of models

simultaneously and runs faster than five grep and are in many cases. The

hashing techniques are used in other multiple-string matching algorithms to

handle a large number of patterns.

6
Also according to the author of Algorithms for string searching Ricardo A.

Baeza-Yates, the author surveys several algorithms for searching a string in a

piece of text. The authors include several theoretical and empirical results, as

well as the actual algorithm. The authors conclude that string matching

algorithms depend on the alphabet size and pattern size. If the pattern is small

(1 to 3 characters long) it is better to use the naive algorithm. Also if the

alphabet size is large, then Knuth-Morris-Pratt's algorithm is a good choice. In

all the other cases, in particular, for long texts, Boyer-Moore's algorithm is

better. Finally, the Horspool version of the Boyer-Moore algorithm is the best

algorithm, according to the execution time, for Almost all pattern lengths. The

shift-or algorithm has a run lug time similar to the KMP algorithm. However, the

main advantage of the KMP algorithm is that we can search for more general

patterns.

Furthermore, according to the author of Evaluation of String Matching

Algorithms Simon Wahlström (2013), In their paper, an evaluation of five string

searching algorithms presented; Brute Force, Boyer-Moore, Knuth-Morris-Pratt,

Karp-Rabin, and the Horspool algorithm. They discussed that the string search

algorithms algorithm had been provided with an explanation of the semantics on

how the algorithms work. The algorithms that have been researched and

explained have their unique weaknesses and strengths. In finding a small

pattern alphabet then Brute force/Naïve algorithm is an excellent choice since it

is easy to implement.

7
Moreover, according to TARA: An Algorithm for Fast Searching of

Multiple Pattern on Text File by M. Oguzhan Külekci (2007) in his paper he

introduced a new multi-pattern matching algorithm that performs searching of

fixed-length strings on text files very fast by benefiting from bit-parallelism. The

algorithm is given the name TARA. Bounded gaps, as well as character classes

in keywords, are also supported, in his research in searching multiple patterns

in text files, the experimental results on language text indicate that for small

number of patterns the unoptimized implementation of the algorithm is

approximately 1.5 times faster than grep software and 5 times than its nearest

successor of the AC and CW variant. The TARA algorithm that the author used

it is believed that for practical usage it represents a very convenient way of

searching with the simplicity of speed of the algorithm as the modern computers

today are sufficient for daily life problems.

Moreover, according to the authors of Study of Different Algorithms for

Pattern Matching by Rahul B. Diwate, Prof. Satish J. Alaspurkar (2013) in every

search engine uses different search algorithms for handling different types of

data. Full search algorithm increases the pattern matching process. In the

paper discussed complexity, efficiency, and techniques used by the algorithms

relate to different. The paper proposed analysis and comparison of different

algorithms for full search equivalent pattern matching like complexity, efficiency,

and techniques. The author concludes that each algorithm has its

characteristics. The Boyer Morris and Knuth– Morris–Pratt algorithm is more

useful for searching. We focused on the complexity of each algorithm, Knuth–

8
Morris–Pratt algorithm having less time complexity and Boyer Morris algorithms

having less preprocessing time complexity. Fast DTW algorithm is best for all

Image, Audio and Video pattern processing. Fast DTW has a linear time and

space complexity. The time performance of exact string pattern matching can

be significantly improved if an efficient algorithm is used.

According to the authors of Multithreaded Implementation of Hybrid String

Matching Algorithm by Akhtar Rasool, Dr. Nilay Khare, Himanshu Arora, Amit

Varshney, Gaurav Kumar Maulana Azad National Institute of Technology (2012)

Hybrid pattern matching algorithm is made after combining KMP & Boyer-Moore

string searching algorithms to generate a new algorithm. It also a pattern searching

algorithm that searches a pattern from left to right in the string. Reducing the time

required in the worst/average case it an effort to reduce processing time, the goal is

to combine the best/average case advantages of the algorithm with the worst case

guarantees of KMP. It results in the comparison shows that the Hybrid algorithm

significantly improves the matching efficiency. The main drawback of the Boyer-

Moore type algorithms is the pre-processing time and the space required, which

depends on the alphabet size and the pattern size. For this reason, if the pattern is

small (1 to 3 characters long) it is better to use the naive algorithm.

According to the authors of Importance of String Matching in Real-World

Problems by Kapil Kumar Soni, Rohit Vyas, Amit Singhal, In string matching pattern

strings are searched within a larger string or text. Let us assume that pattern string

“p" and text string „S.‟ The problem of string matching deals by finding whether a

pattern set „p‟ occurs in „S‟ or not. And if „p‟ occurs when the position of it should

9
be reported in „S‟ where “p‟ occurs. There are two types of string matching Exact

string matching and Approximate string matching. String matching has dramatically

influenced the field of computer science and will play an essential role in various

real-world problems. As time grows, more and more efficient string matching

algorithms will be used. Since 1950 lots of single and multiple patterns string

matching algorithms have been suggested. There are many more possible areas in

which string matching can play a crucial role in excelling.

Chapter 3

THEORETICAL BACKGROUND

Finding all occurrences of a pattern in a text is a problem that frequently

arises in text-editing programs. Typically, the text is a document being edited,

and the pattern searched for is a particular word supplied by the user. Efficient

algorithms for this problem can greatly aid the responsiveness of the text-

editing program. The idea of the naive solution is just to make a comparison

character by character of the text T [s...s + m − 1] for all s ∈ {0, . . . , nm + 1}

and the pattern P[0...m − 1]. It returns all the valid shifts found.

The naive algorithm finds all valid shifts using a loop that checks the

condition P[1 . . m] = T[s + 1 . . s + m] for each of the n - m + 1 possible values

of s.

NAIVE-STRING-MATCHER (T, P)

n length[T]

10
m length[P]

for s 0 to n – m

do if P[1 . . m] = T[s + 1 . . s + m]

According to the research of Zvi Galil and Joel Seiferas in Time-Space-

Optimal String Matching (1987) earlier string-matching algorithms follow a

single general scheme. That scheme considers prospective positions p for the

pattern in the text in increasing order, and it maintains the length q > 0 of a

pattern prefix known to match the text starting following position p ([0, q], = [ p,p

+ q],). For appropriately calculated p' >p and q,' then, the algorithms search as

follows:

Each time q reaches the pattern length 1x1, a full instance of the pattern has

been found following position p in the text (x = [ p,p + Ix]],); the search can be

continued by dropping out of the while-loop. (We consider y(p + q + 1) = x(q + 1) to

11
be false whenever p + q + 1 > ( y ] or q + 1 > 1x1, so this will be automatic.) Of

course, the algorithms should halt when the end of the text is reached (p = I ~1).

The previous algorithms differ only in how they calculate p' and q'. The naive

algorithm conservatively calculates p' =p + 1 and q' = 0. Since [0, qlX = [ p,p + q],,

however, consideration of p’ =p + shift is futile unless [0, q - shift], = [shift, q],; so

the Knuth-Morris-Pratt algorithm calculates p’ =p + shift,(q), where

shift,(q) = min{sh@ > 0 I [shift, qlX = (0, q -shift],},

In the research of M. Oguzhan Kulekci in TARA: An Algorithm for Fast

Searching Multiple Patterns on Text files (2007) the algorithm performs very fast in

practice. Experiments are conducted to compare the performance of the proposed

algorithm with widely used GNU grep file search utility and also with nine variants of

Aho&Corasick and Comentz&Walter algorithms on natural language text.

The TARA Algorithm execute by Let P = {p0, p1, . . . , pm−1} be the set of m

patterns that are to be scanned in text T[0 . . . n − 1] of n characters, and LP = {lp0,

lp1, . . . , lpm−1} be the corresponding lengths of patterns in P. The alphabet is

denoted by Σ. Maximum and minimum values of LP are stored in maxlen and

minlen variables. The algorithm is explained below on an example where it is

assumed that P = {bal, peynir, re[cc¸]el}, LP = {3, 6, 5}, maxlen = 6, and minlen = 3.

12
In the research of Multithreaded Implementation of Hybrid String Matching

Algorithm by Akhtar Rasool, Dr. Nilay Khare, Himanshu Arora, Amit Varshney,

Gaurav Kumar (2012) the algorithm came in existence after combining KMP &

Boyer-Moore string searching algorithms to generate a new algorithm. It also a

pattern searching algorithm that searches a pattern from left to right in the series.

Reducing the time required in the worst/average case it an effort to reduce

processing time, the goal is to combine the best/average case advantages of the

algorithm with the worst case guarantees of KMP. According to the experiments we

have conducted, the new algorithm is among the fastest in practice for the

computation of all occurrences of a pattern p = p[1..m] in a text string s = s[1..n] on

an alphabet of size n giving a time complexity of O(m + n).[6,7]

Given a String Sand Pattern P of size m and n respectively. Step1: We have

provided a String S of size m, break that string into two parts (i.e., S1 and S2).

13
Chapter 4

METHODOLOGY

This chapter presents the design of the application that developed through

Netbeans IDE. This also includes the discussion of the functions used and

algorithm applied in the application. At this point, the overall progress of the

simulator is presented for accurate understanding.

Below is the diagram that will elaborate on the primary process of the

simulator; this also gives the reader the concept of the functions is used in the

application.

14
Fig. 1. Use Case Diagram

ACTIVITY DIAGRAM

The figure below presents the workflow of the program application in a

graphical way.

15
Fig. 2. Activity Diagram

SEQUENCE DIAGRAM

The diagram below shows how the object interacts with each other and

the order of those interactions. The process is represented vertically, and the

interactions are shown as arrows.

16
Fig. 3. Sequence Diagram

NAÏVE STRING MATCHING ALGORITHM FLOWCHART

17
Figure 4. Naïve String Matching Algorithm Flowchart

Source Code:

public static void search(String txt, String pat) {

int M = pat.length();

int N = txt.length();

for (int i = 0; i <= N - M; i++) {

int j;

for (j = 0; j < M; j++) {

if (txt.charAt(i + j) != pat.charAt(j)) {

break; }

if (j == M) //if pat[0...M-1] = txt[i, i+1, ...i+M-1]

System.out.println("Pattern found at index " + i);

String txt = ex.getText().toString();

String pat = TextFinder.textSearchText.getText().toString();


search(txt.toString().toLowerCase(), pat.toString());

NAÏVE STRING MATCHING PSEUDOCODE

NAIVE-STRING-MATCHER (T, P)

1. n length[T]

2. m length[P]

18
3. for s 0 to n – m

4. do if P[1 . . m] = T[s + 1 . . s + m]

CHAPTER 5

RESULT AND DISCUSSION,


CONCLUSION AND RECOMMENDATION

This section shows the result of the study. Discussion, determination, and

recommendation are found at the end of this chapter.

RESULT AND DISCUSSION

19
The following tables below show the elapsed time of the Naïve String

Matching Algorithm in searching MS-word document content.

Table 1. MS-word String Match

Sub-directories in searching MS-word document

MS-word Number of Matched MS-word Filename from


Documents pattern patterns selected path

sunflower 6 Sample3.docx
shallow 16 Sample7.docx
tell 4 Sample7.docx
Sample3.docx
nevertheless 4 Sample3.docx
needless 4 Sample6.docx
Sample1.docx
Sample3.docx
bad 15
Sample6.docx.
Sample7.docx
deep 2 Sample7.docx
flood 5 Sample9.docx

20
Figure 5. Text “nevertheless” result

Figure 6. Text “sunflower” results

21
Figure 7. Text “needless” result

Figure 8. Text “deep” result

22
Figure 9. Text “flood” result

In the table 1 presents the elapsed time in nanoseconds of the process using

naïve string matching algorithm in searching MS-word content, as the images

shown above the string that searched has many pattern indexes, it means there are

several words that match the pattern inside the ms-word documents.

23
Conclusion

After the analysis of the text that has been searching. The following findings

were made:

The MS-word searching content is the way used by the researcher to search

for text inside the MS-word document using Naïve String Matching algorithm.

However, when running the MS-word search content, matched patterns matches

text by detecting pattern indexes. This indexes determine how many pattern have

matched within the particular ms-word document.

Recommendation

After the analysis of the directories that have been searched, the following

findings were made:

Based on the results after running the test data in the application, the

following conclusions were made by the researcher: To the future researchers, I

greatly recommend to add the following features:

The application can search throughout the local area networks. Can search

ms-word to specific shared path or folder from computer to another though Local

Area Network.

BIBLIOGRAPHY

24
Page Title: WHY PEOPLE USE SEARCH ENGINES: RESEARCH, SHOPPING,
AND ENTERTAINMENT
Address: https://www.dummies.com/web-design-development/search-engine-
optimization/why-people-use-search-engines-research-shopping-and-
entertainment/

Page Title: Multithreaded Implementation of Hybrid String Matching

Address: http://www.enggjournals.com/ijcse/doc/IJCSE12-04-03-032.pdf

Page Title: Multithreaded Implementation of Hybrid String Matching ...

Address: http://www.enggjournals.com/ijcse/doc/IJCSE12-04-03-032.pdf

Page Title: Evaluation of String Searching Algorithms

Address: https://pdfs.semanticscholar.org 878c943735e75935e995b58.pdf

Page Title: Tara: An algorithm for fast searching of multiple patterns on text files

Address:
https://www.researchgate.net/publication/4321078_Tara_An_algorithm_for_fast_se
arching_of_multiple_patterns_on_text_files

Page Title: We assume that the text is an array T 1n of length n

Address: https://www.coursehero.com/file/p5d1hjv/We-assume-that-the-text-is-an-
array-T-1n-of-length-n-and-that-the-pattern-is-an/

Page Title: TR 87 February 1981 - urresearch.rochester.edu

Address: https://urresearch.rochester.edu/fileDownloadForInstitutionalItem.action?
itemId=10186&itemFileId=22371

25
Page Title: Multithreaded Implementation of Hybrid String Matching ...

Address: http://www.enggjournals.com/ijcse/doc/IJCSE12-04-03-032.pdf

Page Title: Omegaexpression is the set of functions that grow

Address: https://www.coursehero.com/file/p2kq6vr/Omegaexpression-is-the-set-of-
functions-that-grow-faster-than-or-at-the-same/

Page Title: A very fast string matching algorithm for small alphabets and long
patterns

Address:https://www.researchgate.net/publication/225725083_A_very_fast_string_
matching_algorithm_for_small_alphabets_and_long_patterns

Page Title: A Fast Multiple String-Pattern Matching Algorithm

Address: https://www.google.com/search?q=of+A+Fast+Multiple+String-
pattern+Matching+Algorithm,
+Yangon+Kim+and+Sun+Kim+of+Towson+University+(1999)&spell=1&sa=X&ved=
0ahUKEwj66YyU5I_hAhUNcCsKHYi0DrcQBQgpKAA&biw=1517&bih=730

Page Title: Algorithms for string searching Ricardo A. Baeza-Yates

Address: https://www.semanticscholar.org/paper/Algorithms-for-String-Searching
%3A-A-Survey-Baeza-Yates/bc2f8507f00a419aebe9d9ccb56a68919cc19b46

Page Title: Evaluation of String Matching Algorithms

Address:https://pdfs.semanticscholar.org/8afc/6c601aa4ae2e0878c943735e75935
e995b58.pdf

26
Page Title: Time-space-optimal string matching

Address: https://dl.acm.org/citation.cfm?id=802463

Page Title: Multithreaded Implementation of Hybrid String Matching Algorithm

Address: www.enggjournals.com/ijcse/doc/IJCSE12-04-03-032.pdf

APPENDIX A

Source Code:

27
MS-word search

package thesis;

import de.schlichtherle.io.File;

import java.awt.event.KeyAdapter;

import java.awt.event.KeyEvent;

import java.awt.event.KeyListener;

import java.sql.Connection;

import java.sql.DriverManager;

import java.sql.PreparedStatement;

import java.sql.ResultSet;

import java.sql.SQLException;

import java.sql.Statement;

import java.util.Calendar;

import java.util.Date;

import java.util.GregorianCalendar;

import java.util.HashMap;

import java.util.Iterator;

import java.util.List;

import java.util.Map;

import java.util.regex.Matcher;

import java.util.regex.Pattern;

import javax.swing.DefaultListModel;

import javax.swing.SwingUtilities;

28
import org.apache.commons.io.FilenameUtils;

import javax.swing.JFileChooser;

import javax.swing.JOptionPane;

import static thesis.DOCXParser.search;

public class TextFinder extends javax.swing.JFrame {

public String query;

public Connection con;

public Statement state;

public TextFinder() {

initComponents();

CurrentDate();

textSearchText.addKeyListener(new KeyAdapter() {

public void keyTyped(KeyEvent e) {

char keyChar = e.getKeyChar();

if (Character.isUpperCase(keyChar)) {

e.setKeyChar(Character.toLowerCase(keyChar));

29
});

try {

Class.forName("com.mysql.jdbc.Driver");

con =
DriverManager.getConnection("jdbc:mysql://localhost:3306/searchlog", "root", "");

state = con.createStatement();

// JOptionPane.showMessageDialog(this, "connected");

} catch (Exception dummy) {

JOptionPane.showMessageDialog(null, dummy);

public void CurrentDate() {

Calendar cal = new GregorianCalendar();

int month = cal.get(Calendar.MONTH);

int year = cal.get(Calendar.YEAR);

int day = cal.get(Calendar.DAY_OF_MONTH);

date_text.setText(+day + "/" + (month + 1) + "/" + year);

public void update(){

try{

}catch(Exception e){

30
}

public void check(){

String word = textSearchText.getText();

try

Statement stmt = con.createStatement();

String selectquery = "SELECT * FROM logs where word =


'"+textSearchText.getText()+"'";

System.out.println(selectquery);

ResultSet rs= stmt.executeQuery(selectquery);

System.out.println(rs.next());

if (rs.next())

//infoMessage("already word added","arlet!!");

// JOptionPane.showMessageDialog(this, "already added ");

else

int s = 1;

31
int searched_count = 1;

String insertquery = ("INSERT INTO dupword (word) VALUES ('" +


textSearchText.getText() + "')");

// int x = stmt.executeUpdate(insertquery);

//// System.out.println(x);

stmt.execute(insertquery);

// String updatequery = ("INSERT INTO dupword (searched_count)


VALUES('') where searched_count = " + searched_count+ ");

// infoMessage("word added","arlert!!");

// JOptionPane.showMessageDialog(this, " added");

} catch (Exception e) {

System.out.println(e);

//ADD data

public void insert() {

ResultSet rs = null;

32
Date date = new Date();

try {

// query = "SELECT * FROM logs WHERE word ='sunflower'";

query = ("INSERT INTO logs (word,time,date)VALUES ('" +


textSearchText.getText() + "','" + Integer.toString(date.getHours()) + ":" +
Integer.toString(date.getMinutes()) + ":" + Integer.toString(date.getSeconds()) + "','"
+ date_text.getText() + "')");

// JOptionPane.showMessageDialog(null,
"inserted","data_saved",JOptionPane.INFORMATION_MESSAGE);

//time_text.setText("");

//date_text.setText("");

state.executeUpdate(query);

} catch (Exception dummy) {

JOptionPane.showMessageDialog(null, dummy);

private void buttonStopActionPerformed(java.awt.event.ActionEvent evt) {

33
worker.interrupt();

buttonStop.setEnabled(false);

private void buttonSearchActionPerformed(java.awt.event.ActionEvent evt) {

//first try to make sure the value in textSearchPath is a valid directory

try {

matches = new HashMap(); //clear any old results

listResults.setModel(new DefaultListModel()); //clear any old results

//jLabel6.getText()).setTitle("search result (0)");

// jLabel6.setText("Result (0)");

File searchDir = new File(textSearchPath.getText());

if (!searchDir.isDirectory()) {

throw new Exception("The Search Path value does not appear to be a


valid directory.");

if (textSearchText.getText().length() == 0) {

throw new Exception("Please enter text to search for in the Containing


field.");

34
// if(!checkPDF.isSelected() && !checkPlainText.isSelected() && !
checkPowerPoint.isSelected() && !checkWord.isSelected()) throw new
Exception("Please select at least one file type to search for text.");

} catch (Exception e) {

// Finder.error(e);

// return;

//valid search criteria, start the search

worker = new Thread() {

private String fileNamePattern = null;

private boolean interrupted = false;

public void run() {

SwingUtilities.invokeLater(new Runnable() {

public void run() {

setEnableStates(true);

});

fileNamePattern = textFileName.getText();

searchDirectory(textSearchPath.getText());

SwingUtilities.invokeLater(new Runnable() {

public void run() {

35
setEnableStates(false);

labelSearching.setText("");

});

// check();

insert();

public void searchDirectory(String directory) {

File currentDir = new File(directory);

File[] files = (de.schlichtherle.io.File[]) currentDir.listFiles();

for (int i = 0; i < files.length && !interrupted; i++) {

//update the search location visual cue

final String fileName = files[i].getAbsolutePath();

SwingUtilities.invokeLater(new Runnable() {

public void run() {

labelSearching.setText(fileName);

labelSearching.setToolTipText(fileName);

36
});

if (files[i].isDirectory() && !files[i].isArchive() &&


checkRecursive.isSelected()) {

searchDirectory(files[i].getAbsolutePath());

} else if (files[i].isDirectory() && files[i].isEntry()) {

searchDirectory(files[i].getAbsolutePath());

} else if (!files[i].isDirectory()) { //just a plain, ordinary directory, and


we're not recursing

checkIfMatch(files[i]);

} else {

//file is a normal directory, and recursion is off, ignore

if (Thread.interrupted()) {

interrupted = true;

private void checkIfMatch(final File file) {

if (FilenameUtils.wildcardMatchOnSystem(file.getName(),
fileNamePattern)) {

//filename match hit, now try to parse the file according to check-boxes

List matchingLines = null;

//parse all files as plain-text first regardless of extension (if enabled)

37
if (matchingLines == null) { //plain-text extraction failed

//parse file by extension type

// if(checkWord1.isSelected()) {

// matchingLines = PlainTextParser.findMatches(file,
textSearchText.getText());

// }

if (checkWord.isSelected() &&
file.getAbsolutePath().toLowerCase().endsWith(".doc")) {

matchingLines = MSWordParser.findMatches(file,
textSearchText.getText());

// matchingLines = DOCXParser.findMatches(file,
textSearchText.getText());

} else if (checkWord.isSelected() &&


file.getAbsolutePath().toLowerCase().endsWith(".docx")) {

matchingLines = DOCXParser.findMatches(file,
textSearchText.getText());

if (matchingLines != null) {

synchronized (matches) { //could be performing a concurrent read


operation

matches.put(file.getAbsolutePath(), matchingLines);

SwingUtilities.invokeLater(new Runnable() {

public void run() {

38
DefaultListModel listModel = (DefaultListModel)
listResults.getModel();

listModel.addElement(file.getAbsolutePath());

// jLabel6.getText()).setTitle("search result (" + listModel.size() +


"):");

// jLabel6.setText("Result (" + listModel.size() + "):");

} });

} }

};

worker.start();

//System.out.println("a="+elapsedTime);

public void jButtonAction(java.awt.event.ActionEvent evt) {

buttonSearchActionPerformed(evt);

private void buttonBrowseActionPerformed(java.awt.event.ActionEvent evt) {

fileChooser.setDialogTitle("Select Directory Search Path");

fileChooser.setFileSelectionMode(javax.swing.JFileChooser.DIRECTORIES_ONLY
);

39
if (fileChooser.showOpenDialog(this) == fileChooser.APPROVE_OPTION) {

textSearchPath.setText(fileChooser.getSelectedFile().getAbsolutePath());

private void listResultsValueChanged(javax.swing.event.ListSelectionEvent evt) {

// TODO add your handling code here:

if (listResults.getSelectedIndex() >= 0) {

List lines = null;

Iterator it = null;

synchronized (matches) { //could get concurrent access exception for


Hashmap read operations?

lines = (List) matches.get((String) listResults.getSelectedValue());

if (lines != null)

it = lines.iterator();

StringBuffer text = new StringBuffer("");

while (it.hasNext()) {

text.append((String) it.next());

text.append("\n");

textLines.setText(text.toString());

} else {

40
textLines.setText("");

} else {

textLines.setText("");

private void checkWordActionPerformed(java.awt.event.ActionEvent evt) {

// TODO add your handling code here:

private void checkWord1ActionPerformed(java.awt.event.ActionEvent evt) {

// TODO add your handling code here:

private void jButton1ActionPerformed(java.awt.event.ActionEvent evt) {

logsearch search = new logsearch();

search.setVisible(true);

private void textSearchPathActionPerformed(java.awt.event.ActionEvent evt) {

// TODO add your handling code here:

private void setEnableStates(boolean searching) {

41
textFileName.setEnabled(!searching);

textSearchPath.setEnabled(!searching);

textSearchText.setEnabled(!searching);

checkRecursive.setEnabled(!searching);

checkArchives.setEnabled(!searching);

checkPDF.setEnabled(!searching);

// checkPlainText.setEnabled(!searching);

checkWord1.setEnabled(!searching);

checkWord.setEnabled(!searching);

// checkWord1.setEnabled(!searching);

buttonBrowse.setEnabled(!searching);

buttonSearch.setEnabled(!searching);

buttonStop.setEnabled(searching);

// Finder.getInstance().setBusy(searching);

DOCX parse

package thesis;

42
import de.schlichtherle.io.File;

import de.schlichtherle.io.FileInputStream;

import java.util.LinkedList;

import java.util.List;

import org.apache.poi.xwpf.extractor.XWPFWordExtractor;

import org.apache.poi.xwpf.usermodel.XWPFDocument;

public class DOCXParser {

TextFinder f = new TextFinder();

public static void search(String txt, String pat) {

long startTime = System.nanoTime() / 1000000;

//NAIVE STRING ALGORITHM START

int M = pat.length();

int N = txt.length();

for (int i = 0; i <= N - M; i++) {

int j;

/* For current index i, check for pattern

match */

for (j = 0; j < M; j++) {

if (txt.charAt(i + j) != pat.charAt(j)) {

43
break;

if (j == M) //if pat[0...M-1] = txt[i, i+1, ...i+M-1]

System.out.println("Word found at index " + i);

long endTime = System.nanoTime() / 1000000;

long duration = (endTime - startTime);

System.out.println(duration);

public static List findMatches(File file, String text) {

// DOCXParser d = new DOCXParser();

List matchingLines = new LinkedList();

XWPFDocument doc = null;

XWPFWordExtractor ex = null;

String docText = null;

String line = null;

44
try {

doc = new XWPFDocument(new FileInputStream(file));

ex = new XWPFWordExtractor(doc);

docText = ex.getText();// + " " + header.getText() + " " + footer.getText();// +


" " + ex.getHeaderText() + " " + ex.getFooterText();

//docText = docText.replaceAll("\\s", " ");

String txt = ex.getText().toString();

String pat = TextFinder.textSearchText.getText().toString();

search(txt.toString().toLowerCase(), pat.toString());

//NAIVE STRING MATCHING ALGORITHM END

int index = docText.toLowerCase().indexOf(text);

while (index >= 0) {

int start = index >= 20 ? index - 20 : 0;

int end = index + 20 < docText.length() ? index + 20 : docText.length();

line = docText.substring(start, end);

matchingLines.add(line);

index = docText.toLowerCase().indexOf(text, index + text.length());

45
} catch (Exception e) {

//fall through to return, could be because file is not UTF-8 readable, or some
other IOException

} finally {

//no cleanup

return matchingLines.size() > 0 ? matchingLines : null;

public static void main(String[] args) {

DOC parse

/*

46
* To change this license header, choose License Headers in Project Properties.

* To change this template file, choose Tools | Templates

* and open the template in the editor.

*/

package thesis;

/**

* @author Lenovo

*/

import de.schlichtherle.io.File;

import de.schlichtherle.io.FileInputStream;

import java.util.LinkedList;

import java.util.List;

import org.apache.poi.hwpf.HWPFDocument;

import org.apache.poi.hwpf.extractor.WordExtractor;

import static thesis.DOCXParser.search;

public class MSWordParser {

TextFinder f = new TextFinder();

public static void search(String txt, String pat) {

//NAIVE STRING ALGORITHM START

47
int M = pat.length();

int N = txt.length();

for (int i = 0; i <= N - M; i++) {

int j;

/* For current index i, check for pattern

match */

for (j = 0; j < M; j++) {

if (txt.charAt(i + j) != pat.charAt(j)) {

break;

if (j == M) // if pat[0...M-1] = txt[i, i+1, ...i+M-1]

System.out.println("Word found at index " + i);

public static List findMatches(File file, String text) {

48
List matchingLines = new LinkedList();

HWPFDocument doc = null;

WordExtractor ex = null;

String docText = null;

String line = null;

try {

doc = new HWPFDocument(new FileInputStream(file));

ex = new WordExtractor(doc);

docText = ex.getText();// + " " + ex.getHeaderText() + " " +


ex.getFooterText();

// docText = docText.replaceAll("\\s", " ");

String txt = ex.getText().toString();

String pat = TextFinder.textSearchText.getText().toString();

search(txt.toString().toLowerCase(), pat.toString());

//NAIVE STRING MATCHING ALGORITHM END

int index = docText.toLowerCase().indexOf(text);

while (index >= 0) {

int start = index >= 20 ? index - 20 : 0;

int end = index + 20 < docText.length() ? index + 20 : docText.length();

line = docText.substring(start, end);

matchingLines.add(line);

49
index = docText.toLowerCase().indexOf(text, index + text.length());

} catch (Exception e) {

//fall through to return, could be because file is not UTF-8 readable, or some
other IOException

} finally {

//no cleanup

return matchingLines.size() > 0 ? matchingLines : null;

Database Connections

package databes;

50
import java.sql.Connection;

import javax.swing.*;

import java.sql.DriverManager;

public class DatabaseConnection {

Connection con = null;

public static Connection ConnecrDb(){

try{

Class.forName("com.mysql.jdbc.Driver");

Connection con =
DriverManager.getConnection("jdbc:mysql://localhost/searchlog","root","");

// JOptionPane.showMessageDialog(null, "Connected");

return con;

} catch(Exception e){

JOptionPane.showMessageDialog(null, e);

return null;

Search Log

package thesis;

51
import java.sql.*;

import javax.swing.*;

import net.proteanit.sql.DbUtils;

import databes.DatabaseConnection;

import java.util.Calendar;

import java.util.GregorianCalendar;

public class logsearch extends javax.swing.JFrame {

Connection con = null;

ResultSet rs = null;

PreparedStatement pst = null;

public logsearch() {

initComponents();

con = DatabaseConnection.ConnecrDb();

update_table();

CurrentDate();

private void update_table() {

try {

52
String sql = "select * from logs order by time desc";

pst = con.prepareStatement(sql);

rs = pst.executeQuery();

logtable.setModel(DbUtils.resultSetToTableModel(rs));

} catch (Exception e) {

JOptionPane.showMessageDialog(null, e);

public void CurrentDate() {

Calendar cal = new GregorianCalendar();

int month = cal.get(Calendar.MONTH);

int year = cal.get(Calendar.YEAR);

int day = cal.get(Calendar.DAY_OF_MONTH);

SLdate.setText(+day + "/" + (month + 1) + "/" + year);

private void jButton1ActionPerformed(java.awt.event.ActionEvent evt) {

String sql = "delete from logs where word=?";

try {

pst = con.prepareStatement(sql);

pst.setString(1, jLabel1.getText());

pst.execute();

53
// JOptionPane.showMessageDialog(null, "Log History Deleted");

} catch (Exception e) {

JOptionPane.showMessageDialog(null, e);

update_table();

private void logtableMouseClicked(java.awt.event.MouseEvent evt) {

try {

int row = logtable.getSelectedRow();

String Table_click = (logtable.getModel().getValueAt(row, 0).toString());

String sql = "select * from logs where word='" + Table_click + "'";

pst = con.prepareStatement(sql);

rs = pst.executeQuery();

if (rs.next()) {

String add = rs.getString("word");

jLabel1.setText(add);

} } catch (Exception e) {

JOptionPane.showMessageDialog(null, e);

54
APPENDIX B

SCREENSHOTS

The following figures show the screen layouts of the designed program.

Main Screen

55
Browsing Path

Display match patter MS-word document

Document containing pattern match Display

Search Log

56
APPENDIX C

SOFTWARE SPECIFICATION

SOFTWARE SPECIFICATION

-Windows 10 Operating Sytstem 64 bit

-Java

-Netbeans 8.0.1

-XAMPP

HARDWARE SPECIFICATION

-1TB HDD

57
-4GB DDR4 RAM

-Intel(R) Core(TM) i3-6006U CPU @ 2.00Ghz 2.00Ghz

-Intel(R) HD Graphics 520(2112 MB)

RAMON MAGSAYSAY MEMORIAL COLLEGES


Office of the Program Director
INFORMATION TECHNOLOGY EDUCATION PROGRAM
General Santos City, Philippines
Document Type: Document No. : DAP–03-01- 29- B
T HESIS / CAPSTONE PROJECT REPORTS Issue No.: SY20 Revision No.:
Document Title: Effective Date: June 01, 2017
Adviser-Advisee MOU Page 1 of 1

Adviser – Advisee Memorandum of Understanding

All students in thesis programs must complete this form contingent with
submission of a thesis topic for approval. The signatures of student and
adviser indicate that they intend to abide by the terms and provisions of
this agreement. A copy of the signed Memorandum should be submitted to
the Program Director.

Date MARCH 18, 2019


Student’s Name RUDJIE Q. CARILLO
Adviser’s Name HANZEL GRACE L. JARIOL
Degree MASTER IN INFORMATION TECHNOLOGY

58
Title of Project:
AN APPLICATION OF NAÏVE STRING MATCHING ALGORITHM IN SEARCHING MS-
WORD DOCUMENT CONTENT IN WINDOWS PLATFORM

Role and Responsibilities of the Adviser


All advisers are expected to have good knowledge of the research discipline.
The Thesis Adviser has the overall responsibility for guiding the student through
the process of the successful completion of a thesis that fulfills the expectations of
scholarly work at the appropriate level as well as meets the requirements of the
Department and the School. The following conditions have been read and agreed
upon by student and adviser:
 be able and willing to assume principal responsibility for advising the
student;
 have adequate time available for this work and be accessible to the student;
 provide adequate and timely feedback to both the student and the
Committee regarding student progress toward degree completion;
 guide and provide continuing feedback on the student's development of a
research project by providing input on the intellectual appropriateness of the
proposed activities, the reasonableness of project scope, acquisition of
necessary resources and expertise, necessary laboratory or computer
facilities, etc.;
 establish key academic milestones and communicate these to the student
and appropriately evaluate the student on meeting these milestones.
 Ensures that the study proposed by the student conforms to the standard of
the College and has immediate or potential impact on the research thrust of
the school.
 Guides the students in their Research / Capstone Project in the following
tasks while in the proposal stage:
o Defining the research problems/objectives in clear specific terms
o Building a working bibliography for the research
o Identifying variables and formulating hypothesis, if any
o Determining research design, population to be studied, research
environment, instruments to be used and the data collection
procedures
 Meets the advisee regularly (at least twice a month, NOTE: the researcher
must seek proper appointment) to answer questions and help resolve
impasses and conflicts.
 Points out errors in the development work, in the analysis, or in the
documentation. The adviser must remind the Proponents/Researchers to do
their work properly.
 Reviews thoroughly all deliverables at every stage of the Research /
Capstone Project, to ensure that they meet the department's standards. The

59
adviser may also require his/her Proponents/Researchers to submit progress
reports regularly.
 Recommends the Proponents/Researchers for Proposal Hearing and Oral
Defense. The adviser should not sign the Proposal Hearing Notice and the
Oral Defense Notice if he/she believes that the Proponents/Researchers are
not yet ready for Proposal Hearing and Oral Defense, respectively. Thus, if
the Proponents/Researchers fail to appear in the Proposal Hearing or Oral
Defense, it is partially the adviser's fault.
 Clarifies points during the Proposal Hearing and Oral Defense.
 Ensures that all required revisions are incorporated into the appropriate
documents and/or software.
 Keeps informed of the schedule of Research / Capstone Project activities,
required deliverables and deadlines.
 Recommends to the Proposal Hearing and Oral Defense panel the
nomination of his/her advisee’s Research / Capstone Project for an award.

Role and Responsibilities of the Student/Advisee


While it is expected that students receive guidance and support from their
adviser and all members of the Thesis Committee, the student is responsible for
actually defining and carrying out the program approved by the Thesis Committee
and completing the thesis/capstone project. As such, it is expected that the student
assumes a leadership role in defining and carrying out all aspects of his/her degree
program and thesis/capstone project. Within this context, students have the
following responsibilities:
 Keep informed of the Capstone Project Guidelines and Policies.
 Keep informed of the schedule of Research / Capstone Project activities,
required deliverables and deadlines posted by Adviser and Dean.
 Submit on time all deliverables specified in this document as well as those to
be specified by the Adviser and Dean.
 Submit on time all requirements identified by the Capstone Project Oral
Defense Panel during the Oral Defense.
 Submit on time the requirements identified by the adviser throughout the
duration of the Capstone Project.
 Schedule regular meetings (at least once a week) with the Adviser
throughout the duration of the Capstone Project. The meetings serve as a
venue for the proponent to report the progress of their work, as well as raise
any issues or concerns.
 Schedule regular meetings (at least once in a semester) with the Dean
throughout the duration of the Capstone Project.
 Pays promptly of the monetary obligation, thus, the adviser’s fee amounting
to P1,000.00 from school year 2014-2015, per semester.
 Failure to comply with the deliverables as required by the adviser subjects
the advisee to be excluded from the research project, thus, the capstone
he/she has initiated will no longer qualify for oral defense.

60
In addition, the student and adviser should discuss/define:
 Ownership and use of data
 A plan for presentations and publications based on the thesis
 Authorship protocols for presentations and publications

CONFORME:

Signatures

RUDJIE Q. CARILLO MARCH 20, 2019


Student Date

HANZEL GRACE JARIOL, MIT MARCH 20, 2019


Adviser Date

RAMON MAGSAYSAY MEMORIAL COLLEGES

Office of the Program Director

INFORMATION TECHNOLOGY EDUCATION PROGRAM

General Santos City, Philippines

Document Type: Document No. : DAP–03-01- 29- D

T HESIS / CAPSTONE PROJECT REPORTS Issue No.: SY20 Revision No.:

Document Title: Effective Date: June 01, 2017

Project Working Title Page 1 of 1

PROJECT WORKING TITLE FORM


Proponent/Researcher:

61
RUDJIE Q. CARILLO
Proposed Project Title:
AN APPLICATION OF NAÏVE STRING MATCHING ALGORITHM IN SEARCHING MS-WORD DOCUMENT
CONTENT IN WINDOWS PLATFORM

Submitted by: Noted:

RUDJIE Q. CARILLO HANZEL GRACE JARIOL, MIT


(Signature of Researcher over printed name) (Signature adviser over printed name)

Date: ______________________ Date: ______________________

Recommending Approval: Approved:

JIM JAMERO, MIT ETHEL L. OCZON, MSIS


(Panelist Signature over printed name) (Signature the Dean over printed name)

Date: ______________________ Date: ______________________

RAMON MAGSAYSAY MEMORIAL COLLEGES


Office of the Program Director
INFORMATION TECHNOLOGY EDUCATION PROGRAM
General Santos City, Philippines
Document Type: Document No. : DAP–03-01- 29- C
T HESIS / CAPSTONE PROJECT REPORTS Issue No.: SY20 Revision No.:
Document Title: Effective Date: June 01, 2017
Research / Capstone Title Hearing Notice Page 1 of 1

RESEARCH / CAPSTONE TITLE HEARING NOTICE

Date filed: ____________________


Ref. Code: ____________________
Date: ____________________
Time: ____________________
Venue: ____________________

62
COLLEGE/ INSTITUTE/ DEPARTMENT: College of Information Technology Education
Research Title:

AN APPLICATION OF NAÏVE STRING MATCHING ALGORITHM IN SEARCHING MS-WORD DOCUMENT CONTENT


IN WINDOWS PLATFORM
Proponent:

RUDJIE Q. CARILLO___________________________________________________

CERTIFICATION
The undersigned members comprising the panel for oral examination hereby agree
to the schedule of hearing for the above research.

HANZEL GRACE L. JARIOL, MIT JIM JAMERO, MIT


RESEARCH ADVISER PANEL MEMBER 1

ERECKJUN E. CASTAÑO ETHEL L. OCZON, MSIS


PANEL MEMBER 2 PANEL CHAIR

CERTIFICATION

This is to certify that the undersigned had edited the manuscript of “RUDJIE

Q. CARILLO entitled “AN APPLICATION OF NAÏVE STRING MATCHING

ALGORITHM IN SEARCHING MS-WORD DOCUMENT CONTENT IN WINDOWS

PLATFORM” as to its content and grammar.

Done this 21th day of March 2019

63
This certification is issued upon the request of the student mentioned earlier

to whatever purpose it may serve.

VANGELINE O. ERUM, PhD


English Critic

CURRICULUM VITAE

PERSONAL INFORMATION

NAME: Rudjie Q. Carillo

City Address: Sarangani Homes P-1 Prk. Malakas Brgy San Isidro, General Santos

City

Contact Number: 09354061270

Email Address: redjexcarillo11@gmail.com

64
Gender: Male

Age: 21 years old

Birthday: October 10, 1997

Place of Birth: General Santos City

EDUCATIONAL BACKGROUND

COLLEGE: Ramon Magsaysay Memorial Colleges Pioneer Avenue, General

Santos City

Course: Bachelor of Science in Computer Science

SECONDARY: Lagao National High School Aparente St. General Santos City

ELEMENTARY: Dadiangas West Central Elementary School, General Santos City

65

You might also like