You are on page 1of 19

1.

Statistical Topic Modeling for


Local News Using
Unsupervised Machine
Learning
Submitted By : Hassan Ali (160551)

BSCS (VII) C

Wasif Ali (160949)

BSCS (VII) A

Submitted To: Ma’am Asra Masood


Table of Contents
1. Executive Summary ........................................................................................................................ 4

1.1 Product ........................................................................................................................................ 4

1.2 Market......................................................................................................................................... 4

1.3 Competitors................................................................................................................................. 4

1.4 Management ............................................................................................................................... 4

1.5 Funding Needed .......................................................................................................................... 5

1.6 Financial Projections .................................................................................................................. 5

2. Industry Analysis...........................................................................................................................4

3. Company Description ................................................................................................................... 6

3.1. Company History ...................................................................................................................... 6

3.2. Mission Statement .................................................................................................................... 6

3.3. Services and Target Market ...................................................................................................... 6

4. Market Analysis ............................................................................................................................ 7

4.1. Segmentation and Target Market .............................................................................................. 7

5.1. Revenue Model ......................................................................................................................... 8

5.2. Cost and Expenses .................................................................................................................... 8

6. Marketing Plan ............................................................................................................................. 9

6.1. Marketing Vision ...................................................................................................................... 9

6.2. Target Customers ...................................................................................................................... 9

6.3. Unique Selling Proposition ..................................................................................................... 10

6.4. Pricing & Positioning Strategy ............................................................................................... 10

6.5. Distribution Plan ..................................................................................................................... 10

6.6. Offers ...................................................................................................................................... 10

6.7. Promotional Strategy .............................................................................................................. 11

6.10. Joint Ventures and Partnership ............................................................................................. 11

7. Design and Development Plan .................................................................................................... 12

User Side ......................................................................................................................................... 12

7.1. How to use Topic Extraction tool Services ............................................................................. 12

Admin Side ..................................................................................................................................... 13

7.3. Projected Development Costs ................................................................................................. 14

8. Operations Plan .......................................................................................................................... 14

8.1. Operations ................................................................................................................................ 14


8.1.1. Backstage Operations............................................................................................................ 15

9. Management Team and Company’s Structure .......................................................................... 15

9.1. Ownership Structure ............................................................................................................... 15

9.2. Internal Management Team .................................................................................................... 15

9.2.1. Hassan Ali, Co-founder and Chief Executive Officer.......................................................... 16

9.2.2. Wasif Ali, Co-founder and Managing Director ................................................................... 16

9.3. External Management Team ................................................................................................... 17

9.3.1. Advisory Board .................................................................................................................... 17

10. Overall Schedule ....................................................................................................................... 17

10.1. Key Milestones for Statistical Topic Modeling for Local News........................................... 17

11. Financial Projections ................................................................................................................ 18

11.1. Cost Structure ....................................................................................................................... 18

11.2. Cash Flow Projections .......................................................................................................... 18


1. Executive Summary
With the rapid and limitless increase in data published online on news sites and news blogs
manual inspection of topics being discussed is not possible,that’s why the project will provide
a Machine Learning based approach to tackle this problem. The tool developed in the project
will provide User a probability of topics being discussed in the document provided by the
User or extracted from online source.

1.1 Product
Statistical Topic Modeling For Local News Using Unsupervised Machine Learning provides
two main functionalities on the tool used by the user that are provide user the functionality to
Extract topic being discussed online on the news websites that are provided by the tool with
the percentage of occurrence of words in these topics and the second one is to extract topics
and percentage of words in the document that is provide by the user.

1.2 Market

The tool is built for National Centre for Cyber Security and it will be used as module in the
products created by them or can be considered as an individual tool. The market users that the
tool will target will be Researchers and Law Enforcement agencies . Other users that will use
the tool will be decided by National Centre for Cyber Security as the tool being their entity.

1.3 Competitors

We have many competitors as there are various companies providing tools with the
functionalities of topic modeling and major one include Mallet Topic Modeling tool and
Themacroscope Topic modeling tool, currently being used by the users. Irrespective of that,
Statistical Topic Modeling For Local News Using Unsupervised Machine tool competitive
main advantage lies in the providing topic extraction both from online provide News site
document and user provide news document and also topic modeling provided by other
provider are not trained and built for news of Pakistani Origin which our product will cater
with.

1.4 Management

Statistical Topic Modeling For Local News Using Unsupervised Machine tool is developed
by Hassan Ali and Wasif Ali . We both and National Center for Cyber Security are behind the
idea of making a topic modeling tool for local news as it was needed by National Center for
Cyber Security and no such tool existed for local news. That was the main motivation behind
the development of the tool as it was being built for a National Institute.

1.5 Funding Needed

Statistical Topic Modeling For Local News Using Unsupervised Machine tool is seeking
20,000 in order to use this capital for paying for Services used during the Development
process and also for outsourcing some of sub modules of the tool in order to speed up the
development process .

1.6 Financial Projections

Statistical Topic Modeling For Local News Using Unsupervised Machine tool

is in early stage but the tool being an entity for National Center for Cyber Security will be
directly used by them as a module or as individual project and a significant amount of
revenue is expected in short span of time as the product will be embedded or sold separately
to users nation wide .

2. Industry Analysis

2.1 Industry Size and Growth Rate

Since the emergence of topic models, researchers have introduced this approach into the
fields of Online document mining. Because of its superiority in analysis of large-scale
document collections, better results have been obtained in such fields as News text
mining.On the other hand, most of these studies follow the classic text-mining method of a
topic model.

In recent years, we have been witnessing exponential growth in News published online on
various sources either they being News Sites or Blogs or News on Social Media. This
situation also poses a great challenge, namely, how to extract hidden knowledge and relations
from these data and that acted as a motivation behind the Project and especially target the
News published on Local News Sites which are growing in number in each passing minute
both in data and sources from where they are originating.

3. Company Description
The Name of our Product is Statistical Topic Modeling For Local News Using Unsupervised
Machine Learning,which is a developed by Hassan Ali and Wasif Ali and is being created as
a product for National Center for Cyber Security which is a National Institute. It is
headquartered in Islamabad,Pakistan.

3.1. Company History


The establishment of National Centre for Cyber Security (NCCS) has been commenced by
Government of Pakistan in June 2018. The NCCS project is a joint initiative of Higher
Education Commission (HEC) and Planning Commission. The Centre constitutes Research
and Development (R&D) Labs in reputed universities of Pakistan which were shortlisted after
the open call for proposals made by HEC in early 2018. After meticulous scrutiny in two
rounds of evaluation, technical proposals from 11 universities of Pakistan were shortlisted.
These universities were given the mandate to establish NCCS affiliated Labs in different
specialized areas of Cyber security under the secretariat of centre. It is honour for Air
University that it is entrusted with the status of NCCS Secretariat in addition to the two
affiliated labs in the domain of Cyber Crime Forensics and Smart Devices and Networks
Security.

3.2. Mission Statement


‘Providing functionality of Topic Extraction from a Large document without manual
inspection of the document’

3.3. Services and Target Market


Statistical Topic Modeling For Local News Using Unsupervised Machine Learning tool target
market users consists of Researchers and Law Enforcement agencies . The two main
functionalities that the tool will provide to the user are the following :

1) Facility to extract topics from the documents published online on websites provided by the
tool and also provide probability of words compromising the topics.
2) Facility to extract topics from the document provided by the user and also provide
probability of words compromising the topics.

3.4 Objectives

NCCS plans on providing a demonstrations of the news topic modeling tool and explain its
functionalities to National Institutes requiring the functionalities provided by the tool and sell
to those interested in the product.

3.5 Vision Statement

Statistical Topic Modeling For Local News Using Unsupervised Machine Learning tool
envisions not only catering the needs of the Law enforcement agencies or National Institutes
but further on creating a specific version of the tool available to the general users looking for
a local topic modeling tool.

3.6 Key Partnership

To Counter the Competition faced by other topic modeling tools we as a developer don’t have
to focus on that issue as the tool itself is an entity of National Center for Cyber Security will
elevate the authenticity of the tool ,but it will also significantly help us in overcoming the
most challenging obstacle to be faced by the business i.e. the task of establishing
trustworthiness and credibility among the users.

4. Market Analysis

4.1. Segmentation and Target Market


Statistical Topic Modeling For Local News Using Unsupervised Machine Learning is a
Topic Modeling tool built for topic extraction from document . The Minimum Viable Product
(MVP) shall be launched initially for Law Enforcement Agencies or National Institute .The
need of users identified which creates the need for development of topic modeling tool ,is
because user would have to manually read up all the documents to infer meaning from it
which is a hassle and takes a lot of time.
4.2 Competitor Analysis

The biggest competition is the existing topic modeling tools. The introduction of Statistical
Topic Modeling For Local News Using Unsupervised Machine Learning or “Topic Extraire”
will be a potential threat to the existing system, which can lead to considerable opposition.
However, establishing credibility is what will determine the success of this venture.

4.2.1 Collaborators

The collaborators consist of suppliers of Statistical Topic Modeling For Local News Using
Unsupervised Machine Learning. National Centre for Cyber Security and the users using
them.

5. Economics of Statistical Topic Modeling For Local News Using


Unsupervised Machine Learning

5.1. Revenue Model


5.1.1 Monthly Fee

The Users using the tool will need to pay a monthly fee in order to user the functionalities
provided by app.

5.1.2 One-Time Subscription

The Users using the tool can also choose the option of one time subscription of tool and will
be charged accordingly.

5.2. Cost and Expenses

5.2.1. Start-up Costs


 Tool Development for Statistical Topic Modeling For Local News Using
Unsupervised Machine Learning to be installed in users’ system.
 Marketing Costs.

5.2.2. Fixed Costs


 Tool development (one time cost)
 Developers’ Salaries (For fixing bugs and updates)
 Office Rent
5.2.3. Profit Margins
The product Statistical Topic Modeling For Local News Using Unsupervised Machine
Learning profit will be generated from the Subscription fee for using the functionalities
provided by tool that is both monthly and one time direct sale and it would be subtracted from
the Fixed Cost and the resultant generated will be our actual profit from the product.
Following is a mathematical expression of the profit generation

Profit Margin=Revenue(Subscription)- Costs(Fixed Costs)

6. Marketing Plan

6.1. Marketing Vision


‘Statistical Topic Modeling For Local News Using Unsupervised Machine Learning’ is on a
path to become the Topic modeling tool of choice for People looking for topic extraction of
Pakistani News. Our weekly meetings with National Center for Cyber Security Research
Assistant (RA’S) always addresses ways in which we can further improve the results of
topics being extracted so better service is provided to the end user.

6.2. Target Customers


Our target market comprises of Law Enforcement Agencies and Researcher interested in
topic extraction.

6.2.1. Demographic Profile


 Age: 18 to 50 Years.
 Gender: Both.

6.2.2. Psychological Profile


The main issue which is being faced by the research community and Law enforcement
agencies is hassle of reading the entire news and research papers to extract the semantic of
published document and extract prominent topic being discussed in it the reason being they
getting aware of the online publications made . On the other hand researchers would have to
read the whole of the document to get an idea of the topics being discussed in the document.

This issue has been discussed in many research and scientific communities as well as by
national security agencies but till date no one has come up with the efficient solution to meet
the benchmarks.
One of the very basic solution that can be used is the introduction of Statistical Topic
Modeling For Local News Using Unsupervised Machine Learning method. One should

Use this tool to actually get the desired topics without getting in to the hustle of reading the
whole news document .but we have designed for national privacy security.

6.3. Unique Selling Proposition


“Topic Extraction made easy”.

6.4. Pricing & Positioning Strategy


We aim to position ‘Statistical Topic Modeling’ as a cost-effective tool for the Researchers
and Law enforcement agencies . For this purpose we will be adopting a low-cost strategy in
order to provide the researchers with a convenient and flexible service at the lowest possible
prices to capture more users.

6.5. Distribution Plan


Distribution of a finished product is the one most important step of the product cycle and it
will decide whether the product will be a hit or a flop and after very critical thinking we have
finalized our distribution plan for tool.

We on our own have collaborated with different universities and research institutes and the
major collaborations are being done by National Center for Cyber Security who are dealing
with the law enforcement agencies who will use our topic extraction tool . The second phase
of distribution will include providing a open source version of our product to common user
interested in topic extraction. As our product is only for specific segment,we are not aiming to
achieve huge number of users.

6.6. Offers
In order to attract new customers and to retain the existing ones, we plan on providing special
deals. These deals will consist of the following offers:

 Free Trials: First time users will be allowed to avail ‘Statistical Topic Modeling’ service
free of cost and extract up to 1000 topics from online sources.
(Note: This is a one-time offer. Users will be charged thereafter)
 Referral Offers: Discount will be provided to the users who buy one month
Subscription and are able to invite and ensure subscription of 1more user.
6.7. Promotional Strategy
Even though there are a number of promotional and advertisement tactics, but while keeping
in mind our respective target market Topic Modeling For Local News Using Unsupervised
Machine Learning will adopt the following methodologies:

 NCCS : National Center for Cyber Security itself is a promoter of the tool produced
as the tool will be used by NCCS sites around Pakistan and from where other
potential users will get the rights to use the tool initially.
 Word of Mouth: This will not only be the cheapest way to promote ‘Statistical Topic
Modeling, but also the most effective way to build trust and credibility among the
researchers community.
 Seminar: We will ask the Research Institutes Administration to grant us with the
permission to conduct a short seminars in order to promote our tool and make the
potential users aware about the tool.

6.8. Joint Ventures and Partnership


We and National Center for Cyber Security aim to work as sole proprietors for our developed
product.But in today's market it is very difficult to get hold of market without any
promotions.Our strategy is to choose related already existing names which can play helpful
role in our product's promotions.Another option is of joint ventures with different institutes,
research publication journals.These are the best options so far for promoting our toolkit.
7. Design and Development Plan

User Side

7.1. How to use Topic Extraction tool Services

7.1.1. Extraction of news from Local News Site

Upon using the tool user can select the option to extract news from Local news sites and these
sites will be provided by the tool itself .

7.1.2.Write Own News Blog


The User can also Provide his/her own document from which the process of topic extraction
will be performed .
7.1.3. Extraction of news of specific date

The User will be provided the functionality to extract topics from a single date which will
be provide by the User and topics will be provide to the user by the tool

7.1.4. Extraction of news of two specific date

The User will be provided the functionality to extract topics between two dates which will
be provide by the User and topics will be provide to the user by the tool

Admin Side

7.1.5. Enter Record


Admin will insert record in database fetched from user side

7.1.6. Update Record


Admin will edit existing records in database on user request

7.1.7. Delete Record


Admin will delete existing records in database on user demand
7.1.8. Manage Accounts
Admin will maintain and manage user accounts and history

7.2. Challenges and Risks


Due to various events happening around Pakistan with each passed minute, large volumes of
News is being published online on social media, news articles, blogs. A large amount of data
is beyond our requirements. The challenge is to analyze and Model these documents to
understand the topics being discussed which manually is quite difficult.

Topic (title) or summary of the text concisely and readily delivers exact idea described in the
longer texts. It grasps the main idea and solves the problem of reading whole texts and
unstructured data. It can give the overview of large collection of text and make it easy for the
reader by improving efficacy, time and effort.

7.3. Projected Development Costs

These include:

 Developers cost
 Marketing Costs
 Third party cost

8. Operations Plan

8.1. Operations
8.1.1. Backstage Operations

These are the operations which are unseen by customers. They consist of the following
activities:

 Extraction of Data from Online News Sites


 Train our model
 Pre-processing
 Training
 Scoring and Evaluating of the trained Model
 Manage accounts

8.1.2. Front-stage Operations


These are the operations which are seen by customers. They consist of the following
activities:

 Extraction of Topics from Online Site: Extract News from the News Services
provide by tool itself and get its probability also
 Extraction of Topics from Own Document: Provide Input text or file User will be
able to place input in text area or can also upload text file of different extensions for
extraction of topic .
 Select Category User will choose category of domain in which topic will lie
 Search the Topic In the end user will perform search operation so to get topic by
providing above information
 Save the Topic User will be able to save searched topic and can maintain records for
future use

9. Management Team and Company’s Structure

9.1. Ownership Structure


The legal structure of ‘Statistical Topic Modeling For Local News Using Unsupervised
Machine Learning’ will be that of a partnership between the co-founders, Hassan Ali and
Wasif Ali and NCCS.

9.2. Internal Management Team


The management team consists of two co-founders and three divisions.
9.2.1. Hassan Ali, Co-founder and Chief Executive Officer
Hassan Ali doing his Bachelors in Computer Science from Air University. This degree helped
him in harnessing the skills required for the creation and management of our tool. He deals
with clients for company’s growth and development.

9.2.2. Wasif Ali, Co-founder and Managing Director


Wasif Ali doing his Bachelors in Computer Science from Air University. Due to excellence in
programming skills he serves as a key role for our start up.

9.2.3. Marketing Team


The co-founders themselves will undertake the task of promoting and marketing ‘Statistical
Topic Modeling ’ as they are currently studying at Air University. For the purpose of getting
their marketing material printed, they will contact X con Digital. It is an Islamabad based
advertising company which has its office in the Street 70, G 14/3 G-14, Islamabad, Islamabad
Capital Territory. Apart from that, the project will be also marketed and promoted by NCCS
as the product being their entity and will do meetings with Institutes interested in the product

9.2.4. Operations Team


‘Statistical Topic Modeling using LDA will be operated from NCCS in E-9, Islamabad. The
office will consist of a operations team, but regardless of the size, the individuals will have to
be highly efficient in order to ensure smooth processing of the all the necessary operations. It
will consist of the following sub-divisions:

 Development Department: Development department consist mostly of software


engineers who have developed the tool and will be responsible to perform changes and
make upgrades to the tool
 Query Department: Query Department is the department which is for the users to get
there queries and update the users about there query.

9.2.5. Finance Team


The finances of the product will be taken care by National Center of Cyber security .For this
purpose, they will ensure utmost transparency and hold meetings at the end of each month in
order to have a clear understanding of the finances consisting of all the expenses and revenue
that has been generated from the business. After which they will decide on how to allocate the
resources at hand.
9.3. External Management Team

9.3.1. Advisory Board


The co-founders tend to have personal ties with one of our university faculty member and
HDO of Computer Science department Dr. Mehdi Hassan .He has his expertise in field of
Data Science. He has managed to carve a great name for himself due to his creativity,
innovation and sheer hard work. Therefore, due to the level of utmost respect and admiration
both the co-founders hold for him, they believe that his knowledge and experience far exceeds
that of an entire board of advisers .

10. Overall Schedule

10.1. Key Milestones for Statistical Topic Modeling for Local News
 Make researcher to get the desired topic easily and efficiently.
 Help securities agencies to get awareness about the text being published
Online.
 Provide useful graphical visualizations of data.
 Allow the topic extraction of past news.
 Allow extraction of topics from a certain span of time
 Help user to get the topic from latest news.
 Allow the user to embed our model in their product.
11. Financial Projections

11.1. Cost Structure

11.1.1. Cost for Application Development


$100(Software Cost) + $150(Online Services) + $50(Misc.)= $300

11.1.2. Marketing Costs


The cost incurred for the marketing print material i.e. consisting of flyer, print brochures and
standees will account to $85.

11.1.3. Office Space


 Islamabad Office:
NCCS Office
Rent = $350/month

Total Rent = $350

Office Overhead Costs = $100

11.1.4. Staff’ Salary


 Number of Operators in Office = 2
 Salary per Operator = $100

Total Salary of Operators = $200 ($100*2)

11.2. Cash Flow Projections

11.2.1. Cash Outflow


 Application Development= $300
 Marketing Costs= $85
 Office Space= $350
 Staff Salary= $200

Total Cash Outflow = $935


11.2.2. Cash Inflow
 Expected Number of User per month = 100
 Subscription Fee per Registration = $300 per Year

Total Subscription Revenue = $30,000 (300 * 100)

 Expected user’s of pay per download feature per month = 50

 Pay per download fee= $25

Total Pay per Download Revenue Per Month= $1250 ( 50*25)

Total Pay per Download Revenue Per year= $15,000 (12,500 * 12)

Total Revenues = $45,000 [30,000+15,000]

You might also like