You are on page 1of 56



Automatic Discovery of Personal Name Aliases from the Web

1. Introduction:
An individual is typically referred by numerous name aliases on the web. Accurate identification of aliases of a given person name is useful in various web related tasks such as information retrieval, sentiment analysis, personal name disambiguation, and relation extraction. We propose a method to extract aliases of a given personal name from the web. Given a personal name, the proposed method first extracts a set of candidate aliases. Second, we rank the extracted candidates according to the likelihood of a candidate being a correct alias of the given name. We propose a novel, automatically extracted lexical pattern-based approach to efficiently extract a large set of candidate aliases from snippets retrieved from a web search engine. We define numerous ranking scores to evaluate candidate aliases using three approaches: lexical pattern frequency, word cooccurrences in an anchor text graph, and page counts on the web. To construct a robust alias detection system, we integrate the different ranking scores into a single ranking function using ranking support vector machines. We evaluate the proposed method on three data sets: an English personal names data set, an English place names data set, and a Japanese personal names data set. The proposed method outperforms numerous baselines and previously proposed name alias extraction methods, achieving a statistically significant mean reciprocal rank (MRR) of 0.67. Experiments carried out using location names and Japanese personal names suggest the possibility of extending the proposed method to extract aliases for different types of named entities, and for different languages. The aliases extracted using the proposed method are

successfully utilized in an information retrieval task and improve recall by 20 percent in a relation-detection task.

2. System Analysis
2.1 Existing System: Accurate identification of aliases of a given person name is useful in various web related tasks such as information retrieval, sentiment analysis, personal name disambiguation, and relation extraction. We propose a method to extract aliases of a given personal name from the web. Given a personal name, the proposed method first extracts a set of candidate aliases. Second, we rank the extracted candidates according to the likelihood of a candidate being a correct alias of the given name. We propose a novel, automatically extracted lexical pattern-based approach to efficiently extract a large set of candidate aliases from snippets retrieved from a web search engine. 2.2 Proposed System: The proposed method outperforms numerous baselines and previously proposed name alias extraction methods. The aliases extracted using the proposed method are successfully utilized in an information retrieval task and improve recall by 20 percent in a relation-detection task. The proposed method first extracts a set of candidate aliases. Second, we rank the extracted candidates according to the likelihood of a candidate being a correct alias of the given name.

2.2.1 Product Functionality

According to another embodiment of the present invention, a method comprises: generating a set of pre-computed materialized sub-graphs; receiving a search query having a search query term; in response to the receiving, accessing a set of pre-computed materialized sub-graphs; wherein the accessing comprises: accessing the text index based

on the search query term to retrieve the corresponding term group identifier; and accessing the corresponding materialized sub-graph based on the term group identifier; executing a dynamic random-walk based search on only the corresponding materialized sub-graph; based on the executing, retrieving nodes in the dataset; and transmitting the nodes as results of the query.

Fig.2.2.1 Architecture of a BinRank system

Fig.2.2.1 shows a flowchart of a method 40 for query processing in accordance with an embodiment of the invention. In block 42 materialized sub-graphs are pre-computed. A search query is then received in block 44 and one of the pre-computed materialized subgraphs is accessed using a text index, in block 46. In block 48, an authority-based keyword search is executed on the materialized sub-graph. In block 50, nodes are retrieved from the dataset based on the keyword search. The retrieved nodes are transmitted as the results of the query in block 52. 4

Fig.2.2.2 flowchart for generating pre-computed materialized sub-graphs

According to one embodiment of the present invention, a method comprises: generating a set of pre-computed materialized sub-graphs from a dataset by: grouping all terms in said dataset; and for each group, executing a dynamic random-walk based search over the full dataset using terms in said group as random walk starting points; based on said executing, identifying important nodes; and using said nodes to construct a corresponding subgraph; receiving a search query having at least one search query term; accessing a particular one of said pre-computed materialized sub-graphs; executing a dynamic

authority-based keyword search on said particular one of said pre-computed materialized sub-graphs; retrieving nodes in said dataset based on said executing; and responding to said search query with results including said retrieved nodes. Fig.2.2.2 shows a flowchart of a process 54 for generating pre-computed materialized sub-graphs in accordance with an embodiment of the invention. In block 56, all terms in the dataset are partitioned. A partition identifier is stored for each term, in block 58. A random walk is then executed over each partition in block 60. In block 62, important nodes are identified for each partition based on the random walk. The important nodes are used to construct a corresponding sub-graph for each partition in block 64.

Requirement Specification plays an important role to create quality software solution; Requirements are refined and analyzed to assess the clarity. Requirements are represented in a manner that ultimately leads to successful software implementation. Each requirement must be consistent with the overall objective. The development of this project deals with the following requirements:


The selection of hardware is very important in the existence and proper working of any software. In the selection of hardware, the size and the capacity requirements are also important. Content Processor Hard Disk RAM Description Pentium4 40Gb 1Gb


The software requirements specification is produces at the culmination of the analysis tasks. One of the most difficult tasks is that, the selection of the software, once system requirement is known by determining whether a particular software package fits the requirements.

Content OS Database

Description Windows XP with SP2 or Windows Vista My SQL

Technologies Core Java, Advance Java, HTML,Servlets,JSP.XML IDE Browser My Eclipse Mozilla Firefox, IE 6.


Preliminary investigation examine project feasibility, the likelihood the system will be useful to the organization. The main objective of the feasibility study is to test the Technical, Operational and Economical feasibility for adding new modules and debugging old running system. All system is feasible if they are unlimited resources and infinite time. There are aspects in the feasibility study portion of the preliminary investigation: Technical Feasibility Operation Feasibility Economical Feasibility


The technical issue usually raised during the feasibility stage of the investigation includes the following: Does the necessary technology exist to do what is suggested? Do the proposed equipments have the technical capacity to hold the data required to use the new system? Will the proposed system provide adequate response to inquiries, regardless of the number or location of users? Can the system be upgraded if developed? Are there technical guarantees of accuracy, reliability, ease of access and data security? Earlier no system existed to cater to the needs of Secure Infrastructure Implementation System. The current system developed is technically feasible. It is a web based user interface for audit workflow at NIC-CSD. Thus it provides an easy access to the users. The databases purpose is to create, establish and maintain a workflow among various entities in order to facilitate all concerned users in their various capacities or roles. Permission to the users would be granted based on the roles specified. Therefore, it provides the technical guarantee of accuracy, reliability and security. The software and

hard requirements for the development of this project are not many and are already Available in-house at NIC or are available as free as open source. The work for the project is done with the current equipment and existing software technology. Necessary bandwidth exists for providing a fast feedback to the users irrespective of the number of users using the system.


Proposed projects are beneficial only if they can be turned out into information system. That will meet the organizations operating requirements. Operational feasibility aspects of the project are to be taken as an important part of the project implementation. Some of the important issues raised are to test the operational feasibility of a project includes the following: Is there sufficient support for the management from the users? Will the system be used and work properly if it is being developed and implemented? Will there be any resistance from the user that will undermine the possible application benefits? This system is targeted to be in accordance with the above-mentioned issues. Beforehand, the management issues and user requirements have been taken into consideration. So there is no question of resistance from the users that can undermine the possible application benefits. The well-planned design would ensure the optimal utilization of the computer resources and would help in the improvement of performance status.


A system can be developed technically and that will be used if installed must still be a good investment for the organization. In the economical feasibility, the development cost in creating the system is evaluated against the ultimate benefit derived from the new systems. Financial benefits must equal or exceed the costs.The system is economically feasible. It does not require any addition hardware or software. Since the interface for this system is developed using the existing resources and technologies available at NIC, There

is nominal expenditure and economical feasibility for certain. 5. SOFTWARE PROCESS MODEL: There are various software development approaches defined and designed which are used/employed during development process of software, these approaches are also referred as "Software Development Process Models". Each process model follows a particular life cycle in order to ensure success in process of software development. One such approach/process used in Software Development is "The Waterfall Model". Waterfall approach was first Process Model to be introduced and followed widely in Software Engineering to ensure success of the project. In "The Waterfall" approach, the whole process of software development is divided into separate process phases. The phases in Waterfall model are: Requirement Specifications phase, Software Design, Implementation and Testing & Maintenance. All these phases are cascaded to each other so that second phase is started as and when defined set of goals are achieved for first phase and it is signed off, so the name "Waterfall Model". All the methods and processes undertaken in Waterfall Model are more visible.


STAGES: Requirement Analysis & Definition: All possible requirements of the system to be developed are captured in this phase. Requirements are set of functionalities and constraints that the end-user (who will be using the system) expects from the system. The requirements are gathered from the end-user by consultation, these requirements are analyzed for their validity and the possibility of incorporating the requirements in the system to be development is also studied. Finally, a Requirement Specification document is created which serves the purpose of guideline for the next phase of the model. System & Software Design: Before a starting for actual coding, it is highly important to understand what we are going to create and what it should look like? The requirement specifications from first phase are studied in this phase and system design is prepared. System Design helps in specifying hardware and system requirements and also helps in defining overall system architecture. The system design specifications serve as input for the. 11

Implementation & Unit Testing: On receiving system design documents, the work is divided in modules/units and actual coding is started. The system is first developed in small programs called units, which are integrated in the next phase. Each unit is developed and tested for its functionality; this is referred to as Unit Testing. Unit testing mainly verifies if the modules/units meet their specifications.

Integration & System Testing: As specified above, the system is first divided in units which are developed and tested for their functionalities. These units are integrated into a complete system during Integration phase and tested to check if all modules/units coordinate between each other and the system as a whole behaves as per the specifications. After successfully testing the software, it is delivered to the customer. Operations & Maintenance: This phase of "The Waterfall Model" is virtually never ending phase (Very long). Generally, problems with the system developed (which are not found during the development life cycle) come up after its practical use starts, so the issues related to the system are solved after deployment of the system. Not all the problems come in picture directly but they arise time to time and needs to be solved; hence this process is referred as Maintenance.

INTRODUCTION Software design sits at the technical kernel of the software engineering process and is applied regardless of the development paradigm and area of application. Design is the first step in the development phase for any engineered product or system. The designers goal is to produce a model or representation of an entity that will later be built. Beginning, once system requirement have been specified and analyzed, system design is the first of the three technical activities -design, code and test that is required to build and verify software. The importance can be stated with a single word Quality. Design is the place 12

where quality is fostered in software development. Design provides us with representations of software that can assess for quality. Design is the only way that we can accurately translate a customers view into a finished software product or system. Software design serves as a foundation for all the software engineering steps that follow. Without a strong design we risk building an unstable system one that will be difficult to test, one whose quality cannot be assessed until the last stage. During design, progressive refinement of data structure, program structure, and procedural details are developed reviewed and documented. System design can be viewed from either technical or project management perspective. From the technical point of view, design is comprised of four activities architectural design, data structure design, interface design and procedural design.

6.1 CLASS DIAGRAM A class is a description of a set of objects that share the same attributes, operations, relationships and semantics. Graphically it is rendered as a rectangle. An attribute is named property of a class that defines a range of values that instance of that class may hold. An attribute represents some property of the thing you are modeling that is shared by all the objects of that class. Operation is the implementation of the services that can be requested from any object of the class to affect behavior. A Class diagram shows a set of classes, interfaces, collaborations and their relationships


Controller PageInfo doget()

SearchManager KeyTerm List doget() dopost()

LoginManager UserName Password doget()

Learner MateralisedSudGraphs Documenid doget() dopost()

DbHandler documentid Binid userdetails insert() update() select() delete() isvalid()

Dbconstants Drivername url Username password querys()

Fig 1

6.2 USECASE DIAGRAM A Use Case specifies the behavior of a system or a part of the system and is a description of a set of sequences of actions, including variants, which a system performs to yield an observable result of value to an actor. Use-cases provide a way for the developers to come to a common understanding with the systems end users and domain experts. Graphically a use case is rendered by an ellipse. A use case diagram is just a special kind of diagram and shares


the same common properties as do all other diagrams-a name and graphical contents that are a projection into a model. In use case diagram, there is a system boundary and the actors stay outside the boundary and the use cases are kept inside the boundary. Use Case diagrams commonly contain Use cases Actors Dependency ,generalization and association relationships

login admin

start learning

data base

stop learning user


Fig 2 6.3 SEQUENCE DIAGRAM An interaction is a behavior that comprises of a set of messages exchanged among a set of objects within a context to accomplish a purpose. We use interactions to model the dynamic aspects of the model. When an object passes a message to another object, the receiving object might in turn send a message to another object, which might send a message to yet another object, and so on. This stream of messages forms a sequence. Any sequence must have a beginning;


the start of every sequence is rooted in some process or thread. Each process or thread within a system defines a distinct flow of control, messages are ordered in sequence of time.
Search.jsp Search Result.jsp Enter Keyword BackHand Db

Forward the keyterm Connect to database Retrive relevant documents

Forward list

Forward result

Forward posting list Display the result

Fig 3

6.4 COMPONENT DIAGRAM Component diagram is a special kind of diagram in UML. It describes the components used to make those functionalities. So from that point component diagrams are used to visualize the physical components in a system. These components are libraries, packages, files etc.


Component diagrams can also be described as a static implementation view of a system. Static implementation represents the organization of the components at a particular moment. So the purpose of the component diagram can be summarized as:

Visualize the components of a system. Construct executables by using forward and reverse engineering. Describe the organization and relationships of the components.


search mechanism

search key word

view result

admin prepro cessor query processor


4.4 TABLES 1. binrank1 ATTRIBUTE NAME DATATYPE INT(45) VARCHAR(45) CONSTRAINTS Primary Key, Auto increment

Eid Documenturl


Title Description priority keyterm


2. bintable SNo binId Documenturl Title Description priority keyterm INT(45) INT(45) VARCHAR(45) VARCHAR(45) VARCHAR(45) INT(20) VARCHAR(45)


3. userdetails Username Password VARCHAR(45) VARCHAR(45) Foreign Keys Foreign Keys

7. IMPLEMENTATION: 7.1 MODULE DESCRIPTION: The system after careful analysis has been identified following modules: 1. Search Module: In this we are going to create a web page using Jsp and we are allowing user and administrator to enter there desired key term which is forwarded to controller and controller take care of rest of things.


2. Admin Module: In this module when ever Administrator login in to his account the controller checks whether he is a valid use if he is valid user it is going to call file where materialized sub graphs are going to be created and key terms from initial database are placed in to there respective bins where key terms which are co-related terms are kept under one bin and the binId of that bin is stored using hash mapping and this thread runs every time when ever Administrator logins in to system and materialized sub graphs are created. 3. Learner Module: In this module we are going to update the priority of a particular key term. The priority is incremented according to number of hits particular Document has hit and priority is incremented both in initial and materialized sub graphs. 4. Preprocessing Module: In this module we are going to implement the BinRank algorithm when we are going to create bins according to algorithm mentioned in this document. 5. Query Module: In this module when ever user enters keyword controller calls Dbhandler and it checks for particular keyword in materialized sub graphs and displays the result. We are writing two java files one is Dbhandler and Dbconstants.In Dbhandler we are going to write code for connecting to a database. The main reason for using Dbconstants is in future if we want to change our database it can be done by just changing Drive Name Username and Password.

7.2 SAMPLE CODE: import java.util.ArrayList; import java.util.HashMap; import java.util.Iterator; import java.util.List; import java.util.Map;


import java.util.Set; import com.itp.binrank.dbhandler.DbHandler; import com.itp.binrank.postingList.PostingList;

public class BackHandThread extends Thread { boolean value; Map workload=new HashMap(); int MAXBINSIZE; public BackHandThread(boolean value){ this.value=value; } public void run(){ // it is a continous java thread which is going to run continously..... try { workload=DbHandler.totalWorkLoad(); Map binList=packTermsIntoBins(workload, 10); if(binList!=null) { DbHandler.deleteBins(); DbHandler.insertBins(binList); } else{System.out.println("Some error occure in retriving pack terms into bin please try again later"); } } catch (Exception e) { // TODO: handle exception System.out.println(e); }


} /** * * @param w is work Load which contains Key Term as key and posting list as the value * @param MAXBINSIZE is the size of the BIN which <keyTerm,posting> */ public static Map packTermsIntoBins(Map w, int MAXBINSIZE){ Set set=w.keySet(); Iterator it=null; String t=null,t1=null; int index=0; int max=0; System.out.println("in pack terms method"+set.toString()); Map Bin=new HashMap(); it=w.keySet().iterator(); List list1=new ArrayList(); while(it.hasNext()){ list1.add(; } it=w.keySet().iterator(); while(!w.isEmpty()){ Runtime.getRuntime().gc(); // to free up the heap memory Map bin=new HashMap(); Map cache=new HashMap(); it=w.keySet().iterator();


// Below While loop is to find Maximum Posting List while(it.hasNext()) { t=(String); //System.out.println("outer loop = "+t); List list=(List)w.get(t); //System.out.println(" list is "+list); if( list!=null && max<list.size()) { max=list.size(); t1=t; } }// End OF While loop for finding Max Posting List // Assigning max posting list 't1' to 't' if(t1!=null) t=t1; //System.out.println(" 1st loop max Post List Element "+t ); // Bin and Cache Comparision loop and bin insertion loop Iterator it2=w.keySet().iterator(); while(it2.hasNext()){ String cach=(String); //System.out.println("cachelement"+cach+cach.indexOf(t)); if(cach.indexOf(t)!=-1 && !(cach.equals(t))) {


cache.put(cach, null); } } while(t!=null){ bin.put(t, w.get(t)); w.remove(t); System.out.println(" sub bin starting " +bin.toString());

int bestI=0,i=0,union=0; union=bin.size()+cache.size()-i; if(union>MAXBINSIZE){ Iterator itcache=cache.keySet().iterator(); while(itcache.hasNext()){ if(union>MAXBINSIZE) { itcache.remove(); } union=bin.size()+cache.size()-i;""; }

} i=cache.size();


if(i>bestI){ bestI=i; Iterator it5=cache.keySet().iterator(); while(it5.hasNext()) {""; } cache.remove(t); }

if(bestI==0){ // Below While loop is to find Maximum Posting List t1=null; Iterator it4=w.keySet().iterator(); while(it4.hasNext()) { t=(String); List list=(List)w.get(t); if(max<list.size() && list.size()<=MAXBINSIZE-bin.size()) { max=list.size(); t1=t; } System.out.println("Inner loop checking mx size } System.out.println("t1=== "+t1); I4");


if(t1==null){ t=null; } }

} // End Of Inner While Loop where t!=null condition checking Bin.put(index++,bin); bin=null; cache=null; } // End of Outer While Loop return Bin; } }


7.3.1 Introduction to JAVA

Initially the language was called as oak but it was renamed as Java in 1995. The primary motivation of this language was the need for a platform-independent (i.e., architecture neutral) language that could be used to create software to be embedded


in various consumer electronic devices. Java is a programmers language. Java is cohesive and consistent. Except for those constraints imposed by the Internet environment, Java gives the programmer, full control. Finally, Java is to Internet programming where C was to system programming.

Importance of Java to the Internet Java has had a profound effect on the Internet. This is because; Java expands the Universe of objects that can move about freely in Cyberspace. In a network, two categories of objects are transmitted between the Server and the Personal computer. They are: Passive information and Dynamic active programs . The Dynamic, Self-executing programs cause serious problems in the areas of Security and probability. But, Java addresses those concerns and by doing so, has opened the door to an exciting new form of program Java can be used to create two types of programs Applications and Applets: An application is a program that runs on our Computer under the operating system of that computer. It is more or less like one creating using C or C++. Javas ability to create Applets makes it important. An Applet is an application designed to be transmitted over the Internet and executed by a Java compatible web browser. An applet is actually a tiny Java program, dynamically downloaded across the network, just like an image. But the difference is, it is an intelligent program, not just a media file. It can react to the user input and dynamically change. Features of Java Security Every time you that you download a normal program,you are risking a viral infection. Prior to Java, most users did not download executable programs frequently, and those who did scan them for viruses prior to execution. Most users still worried about the 26

possibility of infecting their systems with a virus. In addition, another type of malicious program exists that must be guarded against. This type of program can gather private information, such as credit card numbers, bank account balances, and passwords. Java answers both these concerns by providing a firewall between a network application and your computer. When you use a Java-compatible Web browser, you can safely download Java applets without fear of virus infection or malicious intent. Portability For programs to be dynamically downloaded to all the various types of platforms connected to the Internet, some means of generating portable executable code is needed .As you will see, the same mechanism that helps ensure security also helps create portability. Indeed, Javas solution to these two problems is both elegant and efficient. The Byte code The key that allows the Java to solve the security and portability problems is that the output of Java compiler is Byte code. Byte code is a highly optimized set of instructions designed to be executed by the Java run-time system, which is called the Java Virtual Machine (JVM). That is, in its standard form, the JVM is an interpreter for byte code. Translating a Java program into byte code helps makes it much easier to run a program in a wide variety of environments. The reason is, once the run-time package exists for a given system, any Java program can run on it. Although Java was designed for interpretation, there is technically nothing about Java that prevents on-the-fly compilation of byte code into native code. Sun has just completed its Just in Time (JIT) compiler for byte code. When the JIT compiler is a part of JVM, it compiles byte code into executable code in real time, on a piece-by-piece, demand basis. It is not possible to compile an entire Java program into executable code all at once, because Java performs various run-time checks that can be done only at run time. The JIT compiles code, as it is needed, during execution.


Java Virtual Machine (JVM) Beyond the language, there is the Java virtual machine. The Java virtual machine is an important element of the Java technology. The virtual machine can be embedded within a web browser or an operating system. Once a piece of Java code is loaded onto a machine, it is verified. As part of the loading process, a class loader is invoked and does byte code verification makes sure that the code thats has been generated by the compiler will not corrupt the machine that its loaded on. Byte code verification takes place at the end of the compilation process to make sure that is all accurate and correct. So byte code verification is integral to the compiling and executing of Java code. Overall Description


Java byte code


Picture showing the development process of JAVA Program Java programming uses to produce byte codes and executes them. The first box indicates that the Java source code is located in a. Java file that is processed with a Java compiler called javac. The Java compiler produces a file called a. class file, which contains the byte code. The Class file is then loaded across the network or loaded locally on your machine into the execution environment is the Java virtual machine, which interprets and executes the byte code. Java Architecture Java architecture provides a portable, robust, high performing environment for development. Java provides portability by compiling the byte codes for the Java Virtual Machine, which is then interpreted on each platform by the run-time environment. Java is a dynamic system, able to load code when needed from a machine in the same room or across the planet. Compilation of code When you compile the code, the Java compiler creates machine code (called byte code)


for a hypothetical machine called Java Virtual Machine (JVM). The JVM is supposed to execute the byte code. The JVM is created for overcoming the issue of portability. The code is written and compiled for one machine and interpreted on all machines. This machine is called Java Virtual Machine. Compiling and interpreting Java Source Code

Source Code .. .. ..

PC Compiler Java Macintosh Compiler Byte code

Java Interpreter (PC)


(Platform Indepen dent)

Java Java Interpreter Interpreter (Macintosh (Spare) )


During run-time the Java interpreter tricks the byte code file into thinking that it is running on a Java Virtual Machine. In reality this could be a Intel Pentium Windows 95 or SunSARC station running Solaris or Apple Macintosh running system and all could receive code from any computer through Internet and run the Applets. Simple Java was designed to be easy for the Professional programmer to learn and to use effectively. If you are an experienced C++ programmer, learning Java will be even easier. Because Java inherits the C/C++ syntax and many of the object oriented features of C++. Most of the confusing concepts from C++ are either left out of Java or implemented in a


cleaner, more approachable manner. In Java there are a small number of clearly defined ways to accomplish a given task. Object-Oriented Java was not designed to be source-code compatible with any other language. This allowed the Java team the freedom to design with a blank slate. One outcome of this was a clean usable, pragmatic approach to objects. The object model in Java is simple and easy to extend, while simple types, such as integers, are kept as high-performance nonobjects. Robust The multi-platform environment of the Web places extraordinary demands on a program, because the program must execute reliably in a variety of systems. The ability to create robust programs was given a high priority in the design of Java. Java is strictly typed language; it checks your code at compile time and run time. Java virtually eliminates the problems of memory management and deallocation, which is completely automatic. In a well-written Java program, all run time errors can and should be managed by your program.

7.3.2 Hyper Text Markup Language

Hypertext Markup Language (HTML), the languages of the World Wide Web (WWW), allows users to produces Web pages that include text, graphics and pointer to other Web pages (Hyperlinks). HTML is not a programming language but it is an application of ISO Standard 8879, SGML (Standard Generalized Markup Language), but specialized to hypertext and adapted to the Web. The idea behind Hypertext is that instead of reading text in rigid linear structure, we can easily jump from one point to another point. We can navigate through the information based on our interest and preference. A markup language is simply a series of elements, each delimited with special characters that define how text or other items enclosed within the elements should be displayed. Hyperlinks are underlined or emphasized works that load to other documents or some portions of the same


document. HTML can be used to display any type of document on the host computer, which can be geographically at a different location. It is a versatile language and can be used on any platform or desktop. HTML provides tags (special codes) to make the document look attractive. HTML tags are not case-sensitive. Using graphics, fonts, different sizes, color, etc., can enhance the presentation of the document. Anything that is not a tag is part of the document itself. Basic HTML Tags: <! ---> specifies comments Creates hypertext links Formats text as bold Formats text in large font. Contains all tags and text in the HTML document Creates text Definition of a term Creates definition list Formats text with a particular font Encloses a fill-out form Defines a particular frame in a set of frames Creates headings of different levels Contains tags that specify information about a document <HR>...</HR> <HTML></HTML> <META>...</META> <SCRIPT></SCRIPT> <TABLE></TABLE> <TD></TD> <TR></TR> <TH></TH> Creates a horizontal rule Contains all other HTML tags Provides meta-information about a document Contains client-side or server-side script Creates a table Indicates table data in a table Designates a table row Creates a heading in a table

<A>. </A> <B>. </B> <BIG>. </BIG> <BODY></BODY> <CENTER>...</CENTER> <DD></DD> <DL>...</DL> <FONT></FONT> <FORM>...</FORM> <FRAME>...</FRAME> <H#></H#> <HEAD>...</HEAD>


Advantages A HTML document is small and hence easy to send over the net. It is small because it does not include formatted information. HTML is platform independent. HTML tags are not case-sensitive.

7.3.2 SERVLETS: What is Java Servlets? Servlets are server side components that provide a powerful mechanism for developing server side programs. Servlets provide component-based, platformindependent methods for building Web-based applications, without the performance limitations of CGI programs. Unlike proprietary server extension mechanisms (such as the Netscape Server API or Apache modules), servlets are server as well as platformindependent. This leaves you free to select a "best of breed" strategy for your servers, platforms, and tools. Using servlets web developers can create fast and efficient server side application which can run on any servlet enabled web server. Servlets run entirely inside the Java Virtual Machine. Since the Servlet runs at server side so it does not checks the browser for compatibility. Servlets can access the entire family of Java APIs, including the JDBC API to access enterprise databases. Servlets can also access a library of HTTP-specific calls; receive all the benefits of the mature java language including portability, performance, reusability, and crash protection. Today servlets are the popular choice for building interactive web applications. Third-party servlet containers are available for Apache Web Server, Microsoft IIS, and others. Servlet containers are usually the components of web and application servers, such as BEA WebLogic Application Server, IBM WebSphere, Sun Java System Web Server, Sun Java System Application Server and others. Servlets are not designed for a specific protocol. It is different thing that they are most commonly used with the HTTP protocols Servlets uses the classes in the java packages javax.servlet and javax.servlet.http. Servlets provides a way of creating the sophisticated server side extensions in a server as they follow the standard framework and use the highly portable java language. 32

A Generic servlet contains the following five methods: 1.init() public void init(ServletConfig config) throws ServletException The init() method is called only once by the servlet container throughout the life of a servlet. By this init() method the servlet get to know that it has been placed into service. The servlet cannot be put into the service if The init() method does not return within a fix time set by the web server. It throws a ServletException Parameters - The init() method takes a ServletConfig object that contains the initialization parameters and servlet's configuration and throws a ServletException if an exception has occurred. 2.service() public void service(ServletRequest req, ServletResponse res) throws ServletException, IOException Once the servlet starts getting the requests, the service() method is called by the servlet container to respond. The servlet services the client's request with the help of two objects. These two objects javax.servlet.ServletRequest and javax.servlet.ServletResponse are passed error. Parameters - The service() method takes the ServletRequest object that contains the client's request and the object ServletResponse contains the servlet's response. The service() method throws ServletException and IOExceptions exception. by the servlet container. The status code of the response always should be set for a servlet that throws or sends an


3.getServletConfig() public ServletConfig getServletConfig() This method contains parameters for initialization and startup of the servlet and returns a ServletConfig object. This object is then passed to the init method. When this interface is implemented then it stores the ServletConfig object in order to return it. It is done by the generic class which implements this inetrface. Returns - the ServletConfig object 4.getServletInfo() public String getServletInfo() The information about the servlet is returned by this method like version, author etc. This method returns a string which should be in the form of plain text and not any kind of markup. Returns - a string that contains the information about the servlet 5. destory () Public void destroy () This method is called when we need to close the servlet. That is before removing a servlet instance from service, the servlet container calls the destroy() method. Once the servlet container calls the destroy() method, no service methods will be then called . That is after the exit of all the threads running in the servlet, the destroy() method is called. Hence, the servlet gets a chance to clean up all the resources like memory, threads etc which are being held.


Life cycle of Servlet: Life cycle of a servlet can be categorized into four parts: 1. Loading and Instantiation: The servlet container loads the servlet during startup or when the first request is made. The loading of the servlet depends on the attribute <load-on-startup> of web.xml file. If the attribute <load-on-startup> has a positive value then the servlet is load with loading of the container otherwise it load when the first request comes for service. After loading of the servlet, the container creates the instances of the servlet. 2. Initialization: After creating the instances, the servlet container calls the init() method and passes the servlet initialization parameters to the init() method. The init() must be called by the servlet container before the servlet can service any request. The initialization parameters persist untill the servlet is destroyed. The init() method is called only once throughout the life cycle of the servlet. The servlet will be available for service if it is loaded successfully otherwise the servlet container unloads the servlet. 3. Servicing the Request: After successfully completing the initialization process, the servlet will be available for service. Servlet creates seperate threads for each request. The sevlet container calls the service() method for servicing any request. The service() method determines the kind of request and calls the appropriate method (doGet() or doPost()) for handling the request and sends response to the client using the methods of the response object. 4. Destroying the Servlet: If the servlet is no longer needed for servicing any request, the servlet container calls the destroy() method . Like the init() method this method is also called only once throughout the life cycle of the servlet. Calling the destroy() method indicates to the servlet container not to sent the any request for service and the servlet releases all the resources associated with it. Java Virtual Machine claims for the memory associated with the resources for garbage collection.


Life Cycle of a Servlet

Several web.xml conveniences: Servlet 2.5 introduces several small changes to the web.xml file to make it more convenient to use. For example while writing a <filtermapping>, we can now use an asterisk in a <servlet-name> which will represent all <servlet-name> which will represent all servlets as well as JSP. Previously <filter-mapping> <filter-name>FilterName</filter-name> <servlet-name>FilterName</servlet-name> </filter-mapping> Now, <filter-mapping> <filter-name>FilterName</filter-name> <servlet-name>*</servlet-name> </filter-mapping> 36

Previously in <servlet-mapping> or <filter-mapping> there used to be only one <urlpattern>, but now we can have multiple <url-pattern>, like

<servlet-mapping> <servlet-name>abc</servlet-name> <url-pattern>/abc/*</url-pattern> <url-pattern>/abc/*</url-pattern> </servlet-mapping> Advantages of Java Servlets 1. Portability 2. Powerful 3. Efficiency 4. Safety 5. Integration 6. Extensibility 7. Inexpensive Each of the points are defined below: Portability As we know that the servlets are written in java and follow well known standardized APIs so they are highly portable across operating systems and server implementations. We can develop a servlet on Windows machine running the tomcat server or any other server and later we can deploy that servlet effortlessly on any other operating system like Unix server running on the iPlanet/Netscape Application server. So servlets are write once, run anywhere (WORA) program. Powerful We can do several things with the servlets which were difficult or even impossible to do with CGI, for example the servlets can talk directly to the web server while the CGI


programs can't do. Servlets can share data among each other, they even make the database connection pools easy to implement. They can maintain the session by using the session tracking mechanism which helps them to maintain information from request to request. It can do many other things which are difficult to implement in the CGI programs. Efficiency As compared to CGI the servlets invocation is highly efficient. When the servlet get loaded in the server, it remains in the server's memory as a single object instance. However with servlets there are N threads but only a single copy of the servlet class. Multiple concurrent requests are handled by separate threads so we can say that the servlets are highly scalable. Safety As servlets are written in java, servlets inherit the strong type safety of java language. Java's automatic garbage collection and a lack of pointers means that servlets are generally safe from memory management problems. In servlets we can easily handle the errors due to Java's exception handling mechanism. If any exception occurs then it will throw an exception. Integration Servlets are tightly integrated with the server. Servlet can use the server to translate the file paths, perform logging, check authorization, and MIME type mapping etc. Extensibility The servlet API is designed in such a way that it can be easily extensible. As it stands today, the servlet API support Http Servlets, but in later date it can be extended for another type of servlets. Inexpensive There are number of free web servers available for personal use or for commercial purpose. Web servers are relatively expensive. So by using the free available web servers you can add servlet support to it. 38

7.4 JDBC
Java Database Connectivity or in short JDBC is a technology that enables the java program to manipulate data stored into the database. Here is the complete tutorial on JDBC technology. 1. What is JDBC? JDBC is Java application programming interface that allows the Java programmers to access database management system from Java code. It was developed by Java Soft, a subsidiary of Sun Microsystems. JDBC has four Components: 1. The JDBC API. 2. The JDBC Driver Manager. 3. The JDBC Test Suite. 4. The JDBC-ODBC Bridge. 1. The JDBC API. The JDBC application programming interface provides the facility for accessing the relational database from the Java programming language. The API technology provides the industrial standard for independently connecting Java programming language and a wide range of databases. The user not only execute the SQL statements, retrieve results, and update the data but can also access it anywhere within a network because of it's "Write Once, Run Anywhere" (WORA) capabilities. Due to JDBC API technology, user can also access other tabular data sources like spreadsheets or flat files even in the a heterogeneous environment. JDBC application programming interface is a part of the Java platform that has included Java Standard Edition (Java SE) and the Java Enterprise Edition (Java EE) in itself. The JDBC API has four main interfaces: The latest version of JDBC 4.0 application programming interface is divided into two packages i-) java.sql 39

ii-) javax.sql. Java SE and Java EE platforms are included in both the packages. 2. The JDBC Driver Manager. The JDBC Driver Manager is a very important class that defines objects which connect Java applications to a JDBC driver. Usually Driver Manager is the backbone of the JDBC architecture. It's very simple and small that is used to provide a means of managing the different types of JDBC database driver running on an application. The main responsibility of JDBC database driver is to load all the drivers found in the system properly as well as to select the most appropriate driver from opening a connection to a database. The Driver Manager also helps to select the most appropriate driver from the previously loaded drivers when a new open database is connected. 3. The JDBC Test Suite. The function of JDBC driver test suite is to make ensure that the JDBC drivers will run user's program or not. The test suite of JDBC application program interface is very useful for testing a driver based on JDBC technology during testing period. It ensures the requirement of Java Platform Enterprise Edition (J2EE). 4. The JDBC-ODBC Bridge. The JDBC-ODBC bridge, also known as JDBC type 1 driver is a database driver that utilize the ODBC driver to connect the database. This driver translates JDBC method calls into ODBC function calls. The Bridge implements Jdbc for any database for which an Odbc driver is available. The Bridge is always implemented as the sun.jdbc.odbc Java package and it contains a native library used to access ODBC. Now we can conclude this topic: This first two component of JDBC, the JDBC API and the JDBC Driver Manager manages to connect to the database and then build a java program that utilizes SQL commands to communicate with any RDBMS. On the other hand, the last two components are used to communicate with ODBC or to test web application in the specialized environment.


JDBC Architecture 1. Database connections 2. SQL statements 3. Result Set 4. Database metadata 5. Prepared statements 6. Binary Large Objects (BLOBs) 7. Character Large Objects (CLOBs) 8. Callable statements 9. Database drivers 10. Driver manager The JDBC API uses a Driver Manager and database-specific drivers to provide transparent connectivity to heterogeneous databases. The JDBC driver manager ensures that the correct driver is used to access each data source. The Driver Manager is capable of supporting multiple concurrent drivers connected to multiple heterogeneous databases. The location of the driver manager with respect to the JDBC drivers and the servlet is shown in Figure .

Layers of the JDBC Architecture


A JDBC driver translates standard JDBC calls into a network or database protocol or into a database library API call that facilitates communication with the database. This translation layer provides JDBC applications with database independence. If the back-end database changes, only the JDBC driver need be replaced with few code modifications required. There are four distinct types of JDBC drivers JDBC Driver and Its Types Type 1 JDBC-ODBC Bridge. Type 1 drivers act as a "bridge" between JDBC and another database connectivity mechanism such as ODBC. The JDBC- ODBC bridge provides JDBC access using most standard ODBC drivers. This driver is included in the Java 2 SDK within the sun.jdbc.odbc package. In this driver the java statements are converted to jdbc statements. A JDBC statement calls the ODBC by using the JDBCODBC Bridge. And finally the query is executed by the database. This driver has serious limitation for many applications Type 1 JDBC Architecture


Type 2 Java to Native API. Type 2 drivers use the Java Native Interface (JNI) to make calls to a local database library API. This driver converts the JDBC calls into a database specific call for databases such as SQL, ORACLE etc. This driver communicates directly with the database server. It requires some native code to connect to the database. Type 2 drivers are usually faster than Type 1 drivers. Like Type 1 drivers, Type 2 drivers require native database client libraries to be installed and configured on the client machine. Type 2 JDBC Architecture

Type 3 Java to Network Protocol Or All- Java Driver. Type 3 drivers are pure Java drivers that use a proprietary network protocol to communicate with JDBC middleware on the server. The middleware then translates the network protocol to database-specific function calls. Type 3 drivers are the most flexible JDBC solution because they do not require native database libraries on the client and can connect to many different databases on the back end. Type 3 drivers can be deployed over the Internet without client installation. 43

Java-------> JDBC statements------> SQL statements ------> databases. Type 3 JDBC Architecture

Type 4 Java to Database Protocol. Type 4 drivers are pure Java drivers that implement a proprietary database protocol (like Oracle's SQL*Net) to communicate directly with the database. Like Type 3 drivers, they do not require native database libraries and can be deployed over the Internet without client installation. One drawback to Type 4 drivers is that they are database specific. Unlike Type 3 drivers, if your back-end database changes, you may save to purchase and deploy a new Type 4 driver (some Type 4 drivers are available free of charge from the database manufacturer). However, because Type drivers communicate directly with the database engine rather than through middleware or a native library, they are usually the fastest JDBC drivers available. This driver directly converts the java statements to SQL statements. Type 4 JDBC Architecture

So, you may be asking yourself, "Which is the right type of driver for your application?" 44

Well, that depends on the requirements of your particular project. If you do not have the opportunity or inclination to install and configure software on each client, you can rule out Type 1 and Type 2 drivers. However, if the cost of Type 3 or Type 4 drivers is prohibitive, Type 1 and type 2 drivers may become more attractive because they are usually available free of charge. Price aside, the debate will often boil down to whether to use Type 3 or Type 4 driver for a particular application. In this case, you may need to weigh the benefits of flexibility and interoperability against performance. Type 3 drivers offer your application the ability to transparently access different types of databases, while Type 4 drivers usually exhibit better performance and, like Type 1 and Type 2 drivers, may be available free if charge from the database manufacturer


Testing is one of the most important phases in the software development activity. In software development life cycle (SDLC), the main aim of testing process is the quality; the developed software is tested against attaining the required functionality and performance. During the testing process the software is worked with some particular test cases and the output of the test cases are analyzed whether the software is working according to the expectations or not. The success of the testing process in determining the errors is mostly depends upon the test case criteria, for testing any software we need to have a description of the expected behavior of the system and method of determining whether the observed behavior confirmed to the expected behavior.

Since the errors in the software can be injured at any stage. So, we have to carry out the testing process at ferent levels during the development. The basic levels of testing


are Unit, Integration, System and Acceptance Testing. The Unit Testing is carried out on coding. Here different modules are tested against the specifications produced during design for the modules. In case of integration testing different tested modules are combined into sub systems and tested in case of the system testing the full software is tested and in the next level of testing the system is tested with user requirement document prepared during SRS. There are two basic approaches for testing. They are

In Functional Testing test cases are decided solely on the basis of requirements of the program or module and the internals of the program or modules are not considered for selection of test cases. This is also called Black Box Testing

In Structural Testing test cases are generated on actual code of the program or module to be tested. This is called White Box Testing.

A number of activities must be performed for testing software. Testing starts with test plan. Test plan identifies all testing related activities that need to be performed along with the schedule and guide lines for testing. The plan also specifies the levels of testing that need to be done, by identifying the different testing units. For each unit specified in the plan first the test cases and reports are produced. These reports are analyzed. TEST PLAN: Test plan is a general document for entire project, which defines the scope, approach to be taken and the personal responsible for different activities of testing. The inputs for forming test plans are Project plan Requirements document System design TEST CASE SPECIFICATION: Although there is one test plan for entire project test cases have to be specified separately 46

for each test case. Test case specification gives for each item to be tested. All test cases and outputs expected for those test cases. TEST CASE EXECUTION AND ANALYSIS: The steps to be performed for executing the test cases are specified in separate document called test procedure specification. This document specify any specify requirements that exist for setting the test environment and describes the methods and formats for reporting the results of testing. UNIT TESTING: Unit testing mainly focused first in the smallest and low level modules, proceeding one at a time. Bottom-up testing was performed on each module. As developing a driver program, that tests modules by developed or used. But for the purpose of testing, modules themselves were used as stubs, to print verification of the actions performed. After the lower level modules were tested, the modules that in the next higher level those make use of the lower modules were tested. Each module was tested against required functionally and test cases were developed to test the boundary values. INTEGRATION TESTING: Integration testing is a systematic technique for constructing the program structure, while at the same time conducting tests to uncover errors associated with interfacing. As the system consists of the number of modules the interfaces to be tested were between the edges of the two modules. The software tested under this was incremental bottom-up approach. Bottom-up approach integration strategy was implemented with the following steps. Low level modules were combined into clusters that perform specific software sub functions. SYSTEM TESTING: System testing is a series of different tests whose primary purpose is to fully exercise the computer-based system. It also tests to find discrepancies between the system and its original objective, current specifications.



Test case name

Test Procedure

Preconditio n

Expected Result Login Successful ly


Specification Document


ADMIN Sign in Form

Update Priority


Succes s



Search for keyword

Enter Keyword

Enter valid Keyword

Display Result


SeachResult.js p


Search for keyword

Enter Keyword

Enter at Least Single letter

Display Result

Succes s

SeachResult.js p


8. INPUT & OUTPUT SCREENS 8.1 Search Page:


8.2 Login Page:



10. CONCLUSION It has been a great pleasure for us to work on this exciting and challenging project. This project proved good for me as it provided practical knowledge of not only programming in JAVA and SERVLETS web based application and some extent Windows Application will be great demand in future. This will provide better opportunities and guidance in future in developing projects independently and TOMCAT Server, but also about all handling procedure related with Automatic Discovery of Personal Name Aliases from the Web . It also provides knowledge about the latest technology used in developing web enabled application and client server technology that will be great demand in future. This will provide better opportunities and guidance in future in developing projects independently. Now we have demonstrated that BinRank can achieve subsecond query execution time on the English Wikipedia data set, while producing high-quality search results that closely approximate the results of ObjectRank on the original graph.


11. USER MANUAL: To Use our software you need to have following software components installed on your PC 1. JAVA 2. Apache tomcat server 6.0 3. XAMPP Control Panel (Default MySql Database) 1. JAVA: To download java go to following link And select your OS type and Version and install it on your system. After installing java set following path in your system environment options Setting PATH and CLASSPATH Determining the current values of PATH and CLASSPATH 1. Unix Type these commands in a command window: echo $PATH echo $CLASSPATH If you get a blank command line in response to either of these, then that particular variable has no value (it has not yet been set). 2. Windows Type these commands in a command window: echo %PATH% echo %CLASSPATH% If you get the message "echo is on" for either of these, then that particular variable has no value (it has not yet been set). 3. Windows 98 First, try the instructions for "Other Versions of Windows." If you are able to set PATH


via the System window, great! Otherwise, you will need to modify a line of text in the c:\autoexec.bat file and restart your computer. Start Notepad (Start > Program Files > Accessories > Notepad) Open c:\autoexec.bat (File > Open, change to the c: folder and look for and open autoexec.bat. If you don't find one, create one.) You might find one or more than one line that starts with "set path". Look for one. If you do not find any lines starting with "set path", then add this new line to the end of the autoexec.bat file: set path=c:\j2sdk1.4.1_01\bin (This assumes that you really do have such a folder after you installed Java SDK 1.4. Please verify as needed.) If you already have one or more lines starting with "set path", go to the last one. If it does not currently include "c:\j2sdk1.4.1_01\bin", then add a semicolon to the righthand end of the "set path" expression and then add this: c:\j2sdk1.4.1_01\bin Save the changes to autoexec.bat. You might find one or more than on line that starts with "set classpath". Look for one. If you do not find any lines starting with "set classpath", exit Notepad and any other programs and restart your computer. Try the java and javac commands again to see if they work now. You can ignore the remaining instructions below. If you already have one or more lines starting with "set classpath", go to the last one. Add this to the righthand end of the "set classpath" expression: Save the changes to autoexec.bat, exit Notepad and any other programs and restart your computer. Try the java and javac commands again. 2. APACHE: Download Apache tomcat server from following website And install it my assigning a port number , username and password.


3. XAMPP Control Panel (Default MySql Database):

Download Xampp from following wed site: STANDARD INSTALLATION: 1. Double-click on the Windows installation icon, The installation will commence, and standard options are presented. Xampp installs by default to C:\Program Files\Xampp. Check the relevant boxes to install Apache, mySQL, and FileZilla as a service on NT-type systems -- NT4, W2K, XP -- (recommended). This means they start up with Windows, and Windows closes them to shut down. 2. Run the program by clicking the Start Menu item. 3. Start / stop the individual applications via the Xampp Control Panel, in the Windows Start menu. 4. To Uninstall: Windows - Control Panel - Add/Remove Programs - click the Xampp entry.


12. BIBLIOGRAPHY REFERENCES/BIBILIOGRAPHY JAVA Technologies: JAVA Complete Reference Java Script Programming by Yehuda Shiran JAVA server pages by Larne Pekowsley. T.H. Haveliwala, "Topic-Sensitive PageRank," Proc.2002 G. Jeh and J. Widom, "Scaling Personalized Web Search," Proc. A. Balmin, V. Hristidis, and Y. Papakonstantinou, "ObjectRank: Authority-Based Keyword Search in Databases," Bases(VLDB),2004. Proc. Int'l Conf. Very Large Data


DATA MINING Data Mining Concepts and Techniques Jiawei Han and Micheline Kamber.