You are on page 1of 16

Project Report on

Custom Search App

ENGI 981B
Mehakdeep Singh Chhina (202193857)

Instructor: Professor Cheng Li


Project Supervisor: Professor Thumeera R. Wanasinghe

1
Table of Contents

Abstract ….……………………………………….…………………………………………………. 3

Introduction ….……………………………………….………………………………………………. 3

Methodologies ……….……………………………………….………………………………………. 4

Architecture ……….……………………………………….…………………………………………. 6

Design Details …………………………….……….………………………………………………… 7

Working Details ……………………………………………………………………………………… 8

Conclusion ………………………………………………………………………………………… 14

Future Work ……………………………………….………………………………………………… 15

2
1. Abstract
The 'Custom Web Crawlers' application is a dynamic and efficient solution designed to collect,
organize, and provide insightful data using web crawlers, Elasticsearch, and a robust search
engine. Leveraging the power of Node.js and its event-driven, non-blocking I/O model, the
application seamlessly executes web crawlers and enables users to retrieve extensive information
from diverse online sources.
Through a user-friendly frontend interface, users can easily register, log in, and initiate web
crawling processes. The backend, driven by Node.js child processes, efficiently manages the
concurrent execution of multiple web crawlers, ensuring optimal data retrieval without
compromising performance.
The application's core functionality includes crawling data from LinkedIn profiles and extends
its capabilities to incorporate diverse web crawlers, including YouTube, websites, and more. The
use of Elasticsearch as the data repository ensures efficient storage and retrieval of the crawled
data, providing users with a powerful search engine to explore and analyze the collected
information.
Moreover, the application's AI-driven search relevance enhances the user experience by studying
users' past search patterns. This feature empowers users to discover personalized and meaningful
insights, making data-driven decision-making an effortless process.
As the 'Custom Search App' evolves, its potential for use cases becomes limitless. From
competitive analysis and market research to sentiment analysis and social media monitoring,
users can leverage the application's versatile web crawlers and Elasticsearch capabilities to gain
strategic insights across various domains.
By focusing on search engine optimization, web crawlers, and Elasticsearch integration, the
'Custom Search App' delivers efficient data collection, seamless search experiences, and
extensive data analysis possibilities to users. This application paves the way for transformative
data exploration, enabling users to harness the vast information available online to make
informed decisions and gain a competitive edge in their endeavors.

2. Introduction
The 'Custom Search App' has become an indispensable tool in our data-driven world,
revolutionizing the way we gather and process information from diverse online sources. As the
use of web apps and mobile devices continues to surge, this powerful application plays a crucial
role in aggregating data and making it easily accessible for businesses and individuals alike.
The primary purpose of the 'Custom Search App' is to collect data from various internet locations
and offer intelligent search functions tailored to specific business needs. Whether integrated as a

3
search bar on a website or utilized for personal data handling, this versatile app offers a wide
range of options to cater to individual requirements.
By empowering users with multiple web crawler options, the 'Custom Search App' effectively
consolidates data from various sources into a unified repository. The search client interface
further enhances the user experience, allowing selective search and application of filters to the
collected data. For instance, businesses seeking competitive insights can leverage the app to
compile and analyze competitors' advertising strategies across multiple platforms, gaining a
significant competitive edge.
The seamless integration of concurrent programming and web crawlers has been instrumental in
the success of the 'Custom Search App.' Leveraging concurrent processes optimizes data
gathering from multiple sources simultaneously, enhancing application performance and
scalability.
As businesses and individuals increasingly rely on data to drive their decisions, the 'Custom
Search App' plays a pivotal role in unlocking valuable information and empowering users with
actionable insights. With its efficient web crawling capabilities and seamless integration with
Elasticsearch, this advanced application continues to shape the landscape of data acquisition and
analysis, opening up endless possibilities for data exploration and strategic decision-making.

3. Methodologies
The development of the Custom Search App application involved a meticulous and well-planned
approach to effectively implement web crawling and concurrent programming functionalities.
Careful consideration was given to technology selection, web crawling strategies, and the secure
handling of data. The incorporation of concurrent programming using Node child processes
enabled efficient data collection from multiple sources, contributing to the application's overall
speed and performance.

Tools and Technologies

ReactJs
React.js, developed and maintained by Facebook, is a widely adopted free, open-source
JavaScript library renowned for its versatility and robustness. As the main frontend language for
our project, React provides a rich set of tools and components that facilitate the creation of
interactive and dynamic user interfaces. Its component-based architecture promotes code

4
reusability and simplifies the management of complex UI elements, making it an ideal choice for
building the frontend of the Custom Web Crawlers application. Additionally, the vibrant React
community and vast collection of third-party libraries contribute to the continuous growth and
innovation within the React ecosystem, further enhancing the development process and overall
user experience.

Nodejs
Node.js (Node) is an open-source development platform for executing JavaScript code
server-side. We use Node.js for our Backend Processes due to its event-driven, non-blocking I/O
model, which ensures efficient handling of concurrent tasks, making it an excellent fit for our
web crawlers. With Node's ability to manage multiple connections simultaneously, we can
efficiently gather data from various internet sources, providing users with a seamless and
responsive crawling experience. Its lightweight and scalable nature also ensures that our
application can handle a large number of crawling requests with ease.

Elasticsearch
Elasticsearch, being a highly capable full-text search engine, offers advanced search
functionalities that greatly enhance the search experience for our Custom Web Crawlers
application. Its ability to handle large volumes of data and perform complex searches at high
speeds ensures quick and accurate retrieval of relevant information for users. Moreover,
Elasticsearch's integration with AI technologies allows us to implement intelligent ranking
algorithms, making search results more tailored and personalized to individual user preferences.
By seamlessly passing the user's search query from the query engine to Elasticsearch, we provide
users with a smooth and efficient search process, enabling them to obtain meaningful insights
and make data-driven decisions with ease.

Child Processes (NodeJs)

Node.js Child Processes enable concurrent programming by allowing the execution of multiple
processes simultaneously, leveraging the system's multi-core architecture efficiently. By creating
child processes, each process operates independently, executing specific tasks without blocking

5
the main application's event loop. This concurrency approach enhances the performance of
Custom Web Crawlers, enabling the application to crawl multiple web pages concurrently,
significantly reducing data retrieval time and improving overall scalability. Additionally, Node
child processes facilitate seamless communication between parent and child processes, enabling
efficient data sharing and synchronization, further enhancing the web crawling process and
optimizing resource utilization.

4. Architecture

The block diagram showcases the core architecture of the Custom Web Crawlers application,
divided into frontend and backend components. In the frontend, end users interact with the
Search UI, submitting their search queries. These queries are then processed by the Query
Engine, which communicates with the backend for data retrieval. The backend consists of the
WEB and Crawler modules, responsible for web crawling tasks and data collection from various
online sources. The gathered data is stored in the Repository, leveraging Elasticsearch for

6
efficient storage and retrieval. The ranked search results are delivered back to the frontend via
the Query Engine, enabling end users to access valuable information seamlessly.
The backend's interconnected modules, WEB and Crawler, work in synergy to efficiently gather
and process data from the internet. The Crawler module actively navigates through web pages,
collecting relevant information, while the Repository ensures proper storage in Elasticsearch.
This robust backend infrastructure enables the application to manage a vast amount of data
efficiently. When end users submit search queries through the frontend, the Query Engine
interacts with the Repository to retrieve relevant data, delivering the search results to the Search
UI. Overall, the block diagram illustrates the well-coordinated flow of information, enabling
Custom Web Crawlers to provide users with an effective and user-friendly search experience.

5.Design Details

Flow Chart

7
Upon registration, users gain access to the Custom Web Crawlers application and can log in
using their credentials. During the login process, the system validates the user's credentials to
ensure their authenticity. If the provided credentials are valid, the user is granted access to the
application, and they can proceed to utilize its functionalities. However, if the credentials are
invalid or do not match the registered information, the system denies access, safeguarding the
application from unauthorized usage.
Once a valid user gains access to the application, they are presented with two primary options:
utilizing the web crawler functionality or employing the search engine. For users interested in
web crawling, they can initiate the process by adding specific websites they wish to crawl. The
user-friendly interface enables users to easily input the URLs of websites they want to scrape
data from. Additionally, the application allows users to customize the crawling process further by
selecting additional options, such as login credentials for websites that require account access to
share data. This flexibility empowers users to tailor the web crawling process to meet their
unique requirements.
Alternatively, users can choose to utilize the search engine aspect of the application. In this
scenario, the user can simply download sample HTML and JavaScript files provided by the
application. By running these files on their system, the user can quickly obtain search results
from the existing data available within the application. This user-friendly approach makes it
convenient for users to employ the search engine without the need for additional data gathering.
Overall, the Custom Web Crawlers application offers users a seamless experience by combining
web crawling and search engine capabilities. The ability to add new websites for crawling and
choose various options, along with the option to directly utilize the search engine, provides users
with versatile tools to access and extract valuable insights from the vast pool of data available on
the internet.

6. Working Details

Linkedin Crawler

In addition to its core functionality, the LinkedIn Crawler feature offers flexibility for future
enhancements. As the application evolves, the potential to incorporate additional data fields and
fine-tune the crawling process opens up possibilities for more in-depth analysis. The extensible
design allows for the inclusion of supplementary data points such as skill endorsements,
connection details, and post engagement metrics, providing users with a comprehensive
understanding of their target LinkedIn profiles.

8
Furthermore, the concurrent programming paradigm remains a cornerstone of the application's
scalability and responsiveness. With the ability to handle a growing number of concurrent
crawling requests, the Custom Web Crawlers application is well-equipped to meet the demands
of an expanding user base without compromising performance.
The real-time prompt provided to users upon initiating the crawling process ensures transparency
and instills confidence in the application's reliability. Users can rest assured that the data retrieval
is underway, allowing them to focus on other tasks while the application diligently collects the
desired information.
As the application continues to harness the power of Elasticsearch, the seamless integration of
AI-driven search relevance remains an exciting prospect. By implementing machine learning
algorithms, the application can further elevate the search experience, offering personalized
search results tailored to individual user preferences and behaviors.
In conclusion, the LinkedIn Crawler functionality within the Custom Web Crawlers application
exemplifies the synergy between web crawling, concurrent programming, and Elasticsearch.
Through this innovative approach, users can effortlessly leverage the rich data available on
LinkedIn profiles, empowering them to make well-informed decisions, uncover strategic
insights, and gain a competitive edge in their respective domains. The application's potential for
continuous improvement and adaptability ensures its position at the forefront of web crawling
advancements, driving data exploration and discovery for users across diverse industries.

9
Code -

The `startCrawler` function is a crucial part of the backend implementation for the Linkedin
Crawler feature in the Custom Web Crawlers application. This function is responsible for
initiating the crawling process and handling the data retrieval from the LinkedIn profile specified
by the user.
Upon receiving a request, the `startCrawler` function first extracts the LinkedIn profile URL and
the optional Company Name provided by the user. If the URL is not found or if it ends with '/',
the function appropriately handles the URL and ensures it is well-formed for the crawling
process. The application uses Node.js' `child_process` module to spawn a Python process
responsible for executing the actual web crawler script.
The Python script, located at `./crawlers/linkedin/linkedin.py`, performs the actual web crawling
from the specified LinkedIn URL, collecting relevant data. The optional Company Name

10
provided by the user may be used by the Python script to further customize the crawling process
for tailored data retrieval.
As the Python process executes, it emits output data, which is received by the Node.js process
via the `process.stdout.on('data', ...)` event listener. The output data contains the name of the file
created by the Python script during the crawling process. This file contains the data collected
from the LinkedIn profile.
The Node.js process captures the file name from the output data, and if successfully obtained,
proceeds to call the `bulkImport` function from the `../elastic.js` module. The `bulkImport`
function is responsible for adding the data from the file to Elasticsearch, making it accessible for
efficient storage and retrieval.
Any error messages generated during the Python process execution are captured and logged by
the `process.stderr.on('data', ...)` event listener.
Once the crawling process and data import to Elasticsearch are complete, the `startCrawler`
function responds with a confirmation message to the user, indicating that the crawling process
has started.
This backend code efficiently handles the initiation of the LinkedIn Crawler, the communication
with the Python crawler script, and the subsequent storage of the collected data in Elasticsearch.
The integration of concurrent programming through Node child processes enables the application
to handle multiple crawling requests concurrently, providing users with a seamless and
responsive crawling experience.

Search Client
The Search Client tab is a user interface within the Custom Web Crawlers application that
empowers users to access and interact with the data they have concurrently crawled from
different sources using the web crawlers. This intuitive interface allows users to search and filter
the collected data based on the company name, providing a seamless and efficient way to explore
and analyze the crawled information.
Upon accessing the Search Client tab, users are presented with a comprehensive view of the
crawled data. The interface displays search results in a user-friendly format, presenting each data
entry with a caption and an attached picture, where applicable. This visual representation
enhances the user experience and facilitates quick data understanding.
The Search Client tab offers a robust filtering functionality that enables users to narrow down
their search results based on the company name. By entering specific company names in the
search field, users can instantly filter the displayed data, quickly locating relevant information
related to their selected companies.

11
One of the key features of the Search Client is its support for concurrent data retrieval from
multiple web crawlers. The data collected simultaneously from different web sources is
consolidated and presented in a unified manner, streamlining the user's access to a diverse range
of information from various websites.
The intuitive and interactive nature of the Search Client ensures that users can effortlessly
explore the crawled data, gaining valuable insights for various purposes, such as competitor
analysis, market research, and business intelligence. This user-friendly interface, combined with
the power of concurrent web crawling, empowers users to efficiently navigate through the vast
amount of collected data and extract meaningful information that can drive data-driven
decision-making processes.

12
Code -

The provided code represents a backend function called `search` responsible for handling search
queries in the Custom Web Crawlers application. The function utilizes Elasticsearch to perform
the search operation efficiently.
The function takes a `req` object as an input parameter, which contains user-defined search
parameters like the size of the search results, search query, and the company name for filtering.
It constructs a `searchObj` that defines the index to search in (`linkedin` in this case) and
specifies highlight fields for displaying search results with highlighted keywords. The `body`
object defines the search query using Elasticsearch's bool query to accommodate multiple filters
based on the user's input.
The function then checks the `req` object for specific search parameters and appends
corresponding filters to the `body` object accordingly. For instance, if a search query or company
name is provided, the function adds matching filters to the bool query.

13
Once the search query is built, the function executes the search operation using `client.search`
provided by Elasticsearch's API. It returns the search results in the form of hits containing
relevant data from the index.
In summary, the `search` function efficiently processes user-defined search parameters,
constructs the Elasticsearch query, performs the search operation, and returns the search results
from the backend to be displayed on the frontend search client.

7. Conclusion
The 'Custom Search App' demonstrates the power of search engine optimization, web crawlers,
and Elasticsearch integration in significantly enhancing the efficiency and scalability of data
retrieval processes. Leveraging Node.js' event-driven, non-blocking I/O platform, the application
achieves seamless execution of web crawlers, revolutionizing data collection from diverse online
sources. Powered by concurrent programming, users can initiate multiple crawling tasks
concurrently, reducing data retrieval time, and improving overall application responsiveness. The
application's extensible design enables the integration of diverse web crawlers, including the
LinkedIn Crawler, YouTube, and website crawlers, further enriching its data collection
capabilities. Looking ahead, the 'Custom Search App' holds great promise for integrating
machine learning algorithms into Elasticsearch to generate tags and personalized search results.
This enhancement will deliver more relevant and tailored information to users, enriching their
search experience.
In conclusion, the 'Custom Search App' showcases the transformative impact of search engine
optimization, web crawlers, and Elasticsearch integration, opening new possibilities for seamless
data collection, dynamic search experiences, and data-driven insights. As advancements in these
areas continue, the application is poised to revolutionize data acquisition and search capabilities,
empowering users with a comprehensive and real-time understanding of the digital landscape.
With its efficient and scalable architecture, the 'Custom Search App' sets the stage for a new era
of data exploration and analysis, catering to diverse use cases and guiding businesses and
individuals towards data-driven success.

14
8. Future Work
The Custom Web Crawlers application exhibits immense potential for future expansion, with a
primary focus on enhancing concurrent programming capabilities to further improve web
crawling efficiency and scalability. The concurrent programming paradigm has already proven
its significance in the application's current functionality, allowing the simultaneous execution of
multiple web crawlers for diverse data collection tasks. Building on this foundation, additional
features and advancements can be introduced to take the concurrent programming aspect to new
heights.
1. Diverse Web Crawlers: As the application evolves, the capability to add different types of
web crawlers can be explored. For instance, implementing web crawlers for platforms like
YouTube, various websites, and other social media platforms can extend data collection
capabilities, catering to a broader range of user requirements.
2. Advanced Search Options: Future work can include introducing advanced search options
within the final search page to enhance the user experience. Incorporating filters, sorting options,
and additional search criteria can empower users to pinpoint precisely relevant data, ensuring a
seamless and targeted search experience.
3. AI-Driven Search Relevance: Integrating artificial intelligence (AI) to analyze user search
patterns can significantly enhance the relevance of search results. Machine learning algorithms
can be incorporated into Elasticsearch to generate personalized tags and prioritize search results
based on user preferences and past behavior, delivering more meaningful insights.
4. User-Customizable Search Settings: Providing users with search settings to customize the
rank of results, manage stop words, and fine-tune search behavior can further improve the
application's adaptability and cater to individual preferences.
5. Open API Integration: Future development can focus on implementing Open APIs using
OAuth authentication to allow other developers to access and leverage the application's search
services. This expansion fosters collaboration and integration with external systems, promoting
innovation and data sharing across various applications.
With the continuous advancements in concurrent programming, the Custom Web Crawlers
application is poised to deliver even more exceptional data collection and search capabilities.
The focus on expanding concurrent programming aspects will unlock new possibilities for
efficient web crawling, allowing users to harness diverse and extensive data resources
effortlessly. As these future enhancements are introduced, the application will continue to
empower users with an unparalleled web crawling and search experience, facilitating data-driven
decision-making and strategic insights for businesses and individuals alike.

15
9. References

[1] M. K. T. E. M. J. a. M. N. Z. G. Gonzalez, Search Engine Indexing, U.S. Patent Application


13/713,765., 2012.
[2] W. I. C. a. E. R. M. S. T. Kirsch, Real-time document collection search engine with phrase
indexing., U.S. Patent 5,920,854., 1999.
[3] Cox, K., Concurrency in Go: Tools and Techniques for Developers. O'Reilly Media, 2017.
ISBN: 978-1491941195
[4] O. V. R. V. P. Nikita Kathare, "A Comprehensive Study of Elasticsearch," International
Journal of Science and Research (IJSR).
[5] Casciaro, M., and Harter, L., Node.js Design Patterns: Design and implement
production-grade Node.js applications using proven patterns and techniques, 3rd Edition. Packt
Publishing, 2019. ISBN: 978-1839214110
[6] Olston, C., and Najork, M., Web Crawling. Synthesis Lectures on Information Concepts,
Retrieval, and Services, Vol. 1, No. 1, pp. 1-96, 2009. DOI:
10.2200/S00170ED1V01Y200904ICR005
[7] Pant, G., Srinivasan, R., and Menczer, F., Web Crawling Techniques and Tools: A Survey.
Journal of Computer Science and Technology, Vol. 19, No. 3, pp. 402-418, 2004. DOI:
10.1007/s11390-004-0402-7
[8] Aggarwal, M., and Rajput, N., Advancements in Web Crawling and Data Extraction
Techniques: A Survey. International Journal of Computer Applications, Vol. 161, No. 5, pp.
17-21, 2017. DOI: 10.5120/ijca2017912674

16

You might also like