You are on page 1of 18

Lecture 15:

If a periodic signal is decomposed into five sine waves with frequencies of 100, 300, 500, 700, and 900
Hz, what is its bandwidth? Explanation: The bandwidth is 900-100 =

800 Hz.

Composite Signals:

For data communication a simple sine wave is not useful, what is used is a composite signal which is a
combination of many simple sine waves. According to French Mathematician, Jean Baptist, any
composite signal is a combination of simple sine waves with different amplitudes and frequencies and
phases.

Bandwidth and Spectrum are common terms in disciplines such as Telecommunication, Networking etc.
The difference between Bandwidth and Spectrum is that:

Bandwidth is the maximum rate of data transfer within a certain period of time while a spectrum is a
collection of waves with particular frequencies arranged in order.

Bandwidth describes the maximum data transfer rate of a network or Internet connection. It measures


how much data can be sent over a specific connection in a given amount of time. For example, a
gigabit Ethernet connection has a bandwidth of 1,000 Mbps (125 megabytes per second). An Internet
connection via cable modem may provide 25 Mbps of bandwidth.

Bandwidth measures how much data can flow through a specific connection at one time.

When visualizing bandwidth, it may help to think of a network connection as a tube and each bit
of data as a grain of sand. If you pour a large amount of sand into a skinny tube, it will take a long time
for the sand to flow through it. If you pour the same amount of sand through a wide tube, the sand will
finish flowing through the tube much faster. Similarly, a download will finish much faster when you have
a high-bandwidth connection rather than a low-bandwidth. connection.

Lecture 16:

1) _______________________consists of the policies and practices adopted to prevent and


monitor unauthorized access, misuse, modification, or denial of a computer network and
network-accessible resources

Network security

Network security is protection of the access to files and directories in a computer network against
hacking, misuse and unauthorized changes to the system. An example of network security is an anti
virus system. (1) The authorization of access to data in a network, which is controlled by the network
administrator.
A total of three types of network security components can be called upon – hardware, software, and
cloud security components.

Benefits of Network Security:

 Firewall
 Network Segmentation
 Remote Access VPN
 Zero Trust Network Access (ZTNA)
 Email Security
 Data Loss Prevention (DLP)
 Intrusion Prevention Systems (IPS)
 Sandboxing
 Hyperscale Network Security
 Cloud Network Security

Lecture 17:

1) __________________ are essential as basic building blocks for a system that will organize
recorded information that is collected by libraries, archives, museums, etc
Retrieval tools

An information retrieval (IR) system is a set of algorithms that facilitate the relevance of displayed
documents to searched queries. In simple words, it works to sort and rank documents based on the
queries of a user. There is uniformity with respect to the query and text in the document to enable
document accessibility.

There are three types of Information Retrieval (IR) models:

1. Classical IR Model — It is designed upon basic mathematical concepts and is the most widely-used of
IR models. Classic Information Retrieval models can be implemented with ease. Its examples include
Vector-space, Boolean and Probabilistic IR models. In this system, the retrieval of information depends
on documents containing the defined set of queries. There is no ranking or grading of any kind. The
different classical IR models take Document Representation, Query representation, and
Retrieval/Matching function into account in their modelling.

2. Non-Classical IR Model — They differ from classic models in that they are built upon propositional
logic. Examples of non-classical IR models include Information Logic, Situation Theory, and Interaction
models.

3. Alternative IR Model — These take principles of classical IR model and enhance upon to create more
functional models like the Cluster model, Alternative Set-Theoretic Models Fuzzy Set model, Latent
Semantic Indexing (LSI) model, Alternative Algebraic Models Generalized Vector Space Model, etc.

Lecture 18:

1) _______________ is/are examples of IR model


All
2) __________________ are essential as basic building blocks for a system that will organize
recorded information that is collected by libraries, archives, museums, etc
Information Retrieval tools
3) ____________________ is a discipline that deals with the representation, storage, organization,
and access to information items
Information Retrieval

Information retrieval (IR) is the field of computer science that deals with the processing of documents
containing free text, so that they can be rapidly retrieved based on keywords specified in a user's query.

Components of a traditional information retrieval system experiment include the:

1. indexing system – indexing and searching methods and procedures (an indexing system can be
human or automated);

2. collection of documents – text, image or multimedia documents, or document surrogates (for


example bibliographical records);

3. defined set of queries – which are input into the system, with or without the involvement of a human
searcher; and

4. evaluation criteria – specified measures by which each system is evaluated, for example ‘precision’
and ‘recall’ as measures of relevance. Recall is the proportion of relevant documents in the collection
retrieved in response to the query. Precision is the proportion of relevant documents amongst the set of
documents retrieved in response to the query.

Lecture 19:

Information Retrieval Model

Set-theoretic models: represent documents as sets of words or phrases. Similarities are usually derived
from set-theoretic operations on those sets. Common models are:

 Standard Boolean
 Extended Boolean
 Fuzzy retrieval

Algebraic models: represent documents & queries as vectors, matrices, or tuples. Similarity of query
vector & document vector is represented as a scalar value.

 Vector space
 Extended Boolean
 LSI

Probabilistic Models: Treat the process of document retrieval as a probabilistic inference. Similarities
are computed as probabilities that a document is relevant for a given query.

 Probabilistic theorems
 Bayes' theorem

Lecture 20:
Precision and recall are the measures used in the information retrieval domain to measure how well an
information retrieval system retrieves the relevant documents requested by a user. The measures are
defined as follows:

Precision = Total number of documents retrieved that are relevant/Total number of documents that
are retrieved.

Recall = Total number of documents retrieved that are relevant/Total number of relevant documents in
the database.

Fallout: proportion of non-relevant items that has been retrieved in a given search.

Generality: the relevant items that have been retrieved in a given search.

1) Bayesian model is an example of ………………

Naïve Bayes classifiers

Lecture 21:

Information Retrieval Biological Databases:

These systems allow text searching of multiple molecular biology database and provide links to
relevant information for entries that match the search criteria. The three systems differ in the
databases they search and the links they have to other information.

These are the databases consisting of biological data like protein sequencing, molecular structure, DNA
sequences, etc in an organized form.

Several computer tools are there to manipulate the biological data like an update, delete, insert, etc.
Scientists, researchers from all over the world enter their experiment data and results in a biological
database so that it is available to a wider audience.

Biological databases are free to use and contain a huge collection of a variety of biological data.

The most widely used interface for the retrieval of information from biological da- tabases is the NCBI
Entrez system. Entrez capitalizes on the fact that there are preexisting, logical relationships between the
individual entries found in numerous public databases.

SRS (EBI and DDBJ): SRS is a data retrieval system that integrates heterogeneous databanks in molecular
biology and genome analysis. There are currently several dozen servers worldwide that provide access
to over 300 different databanks via the World Wide Web.

Lecture 22:

1) BioMed central is the first ------- publisher


Largest and open access

IR in Bioinformatics

 Discovering & accessing


 Appropriate bioinformatics resource
 Interesting publications with explicit classification of the relevant topics
 Heterogeneous &several kinds of data, information

Data Stores

MedLine: a bibliographic db of life sciences and biomedical information

Medicine, nursing, pharmacy, dentistry

PubMed: free search engine accessing MEDLINE DB of references & abstracts on life sciences &
biomedical topics

PubMed Central:archives publicly accessible full-text scholarly articles within the biomedical & life
sciences journal literature

BioMed Central: publish over 200 scientific journals describes itself as first & largest open access
science publisher

Lecture 23:

Search Engines:

To find out scientific resources, as journals & conference proceedings

Systems have been developed to retrieve scientific publications

CiteSeer: well-known automatic generator of digital libraries of scientific literature

 qdatabase creation
 q personalized filtering of new publications
 q personalized adaptation and discovery of interesting research and trends

RefMed:

Search engine for PubMed that provides relevance ranking induces a new ranking according to the user
judgment

MedlineRanker:

Search engines for Medline Learns most discriminative words by comparing a set of abstracts provided
by user with the Medline

Ranks abstracts according to learned discriminative words

Lecture 24:

1) ----------- includes patient medical information from multiple Sources


EMR and EHR
2) Formula of Sensitivity =Hint:TP = True Postive
TP/TP+FP

When performing classification predictions, there's four types of outcomes that could occur.
 True positives are when you predict an observation belongs to a class and it actually does
belong to that class.
 True negatives are when you predict an observation does not belong to a class and it
actually does not belong to that class.
 False positives occur when you predict an observation belongs to a class when in reality it
does not.
 False negatives occur when you predict an observation does not belong to a class when in
fact it does

These four outcomes are often plotted on a confusion matrix.

The three main metrics used to evaluate a classification model are accuracy, precision, and recall.

Accuracy is defined as the percentage of correct predictions for the test data. It can be calculated easily
by dividing the number of correct predictions by the number of total predictions.

Precision is defined as the fraction of relevant examples (true positives) among all of the examples
which were predicted to belong in a certain class.

Recall is defined as the fraction of examples which were predicted to belong to a class with respect to all
of the examples that truly belong in the class.

Lecture 25:

Categories of Search Engine:

GpPubMed:

 Extracts Gene Ontology terms from the retrieved abstracts


 Supplies the user with the relevant ontology for browsing
 Indexes PubMed search results with ontological background knowledge

XploreMed:

 Filters PubMed results according to main MeSH categories


 Extracts topic keywords their co-occurrences, With the goal of extracting abstracts

EBIMed:

 IR+ IE from Medline


 Analyzes retrieved Medline abstracts to highlight associations
 Results are displayed in tables, and all terms are linked to their entries in biomedical DBs

Lecture 26:

1) eTBLAST is a search engine for 

Citations; Medline & Full text articles

2) EMR stands for


Electronic medical records
3) Major Electronic Health Record issues include Data Integrity , Privacy and security of patient
data, EHR implementation cost
True
4) Knowledge discovery is largely data-driven, starting with the many existing large-scale data
collections such as EHRs
True
5) Which one of the following in not a part of Knowledge Discovery Steps
6) For heart failure selection following classifier can be used  ..................
Descion Tree
7) Bayes’ theorem and naïve bayes classifier is widely used for decision making in which field of
science?
Medical
8) In Association Rule Mining a “consequent” item found in _________
Combination with antecedents

Lecture 34:

Clustering is a method which group the data points having similar properties and/or features, while data
points in different groups should have highly offbeat properties and/or features

Hierarchical clustering is another unsupervised machine learning algorithm, which is used to group the
unlabeled datasets into a cluster and also known as hierarchical cluster analysis or HCA. In this
algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-shaped structure is
known as the dendrogram.

Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts with taking all


data points as single clusters and merging them until one cluster is left.

How the Agglomerative Hierarchical clustering Work?

The working of the AHC algorithm can be explained using the below steps:

Step-1: Create each data point as a single cluster. Let's say there are N data points, so the number of

clusters will also be N

Step-2: Take two closest data points or clusters and merge them to form one cluster. So, there will now
be N-1 clusters
Step-3: Again, take the two closest clusters and merge them together to form one cluster. There will be
N-2 clusters.

Step-4: Repeat Step 3 until only one cluster left. So, we will get the following clusters. Consider the
below images:

Step-5: Once all the clusters are combined into one big cluster, develop the dendrogram to divide the
clusters as per the problem

Lecture 35:

1) Human speech / voice can be recongnized by using ________


Machine learning

Machine learning is an application of AI that enables systems to learn and improve from experience
without being explicitly programmed. Machine learning focuses on developing computer programs that
can access data and use it to learn for themselves.

How Does Machine Learning Work?


Similar to how the human brain gains knowledge and understanding, machine learning relies on input,
such as training data or knowledge graphs, to understand entities, domains and the connections
between them. With entities defined, deep learning can begin.The machine learning process begins with
observations or data, such as examples, direct experience or instruction. It looks for patterns in data so it
can later make inferences based on the examples provided. The primary aim of ML is to allow computers
to learn autonomously without human intervention or assistance and adjust actions accordingly.

Applications of Machine Learning:

Lecture 36:

1) The software/tool that is used to manage the database and its users is called  

DBMS

What is supervised learning?

Supervised learning is a machine learning approach that’s defined by its use of labeled datasets. These
datasets are designed to train or “supervise” algorithms into classifying data or predicting outcomes
accurately. Using labeled inputs and outputs, the model can measure its accuracy and learn over time.

Supervised learning can be separated into two types of problems when data mining: classification and
regression:

 Classification problems use an algorithm to accurately assign test data into specific categories,
such as separating apples from oranges. Or, in the real world, supervised learning algorithms can
be used to classify spam in a separate folder from your inbox. Linear classifiers, support vector
machines, decision trees and random forest are all common types of classification algorithms.
 Regression is another type of supervised learning method that uses an algorithm to understand
the relationship between dependent and independent variables. Regression models are helpful
for predicting numerical values based on different data points, such as sales revenue projections
for a given business. Some popular regression algorithms are linear regression, logistic
regression and polynomial regression.

What is unsupervised learning?

Unsupervised learning uses machine learning algorithms to analyze and cluster unlabeled data sets.
These algorithms discover hidden patterns in data without the need for human intervention (hence,
they are “unsupervised”).

Unsupervised learning models are used for three main tasks: clustering, association and dimensionality
reduction:

 Clustering is a data mining technique for grouping unlabeled data based on their similarities or
differences. For example, K-means clustering algorithms assign similar data points into groups,
where the K value represents the size of the grouping and granularity. This technique is helpful
for market segmentation, image compression, etc.
 Association is another type of unsupervised learning method that uses different rules to find
relationships between variables in a given dataset. These methods are frequently used for
market basket analysis and recommendation engines, along the lines of “Customers Who
Bought This Item Also Bought” recommendations.
 Dimensionality reduction is a learning technique used when the number of features (or
dimensions) in a given dataset is too high. It reduces the number of data inputs to a manageable
size while also preserving the data integrity. Often, this technique is used in the preprocessing
data stage, such as when autoencoders remove noise from visual data to improve picture
quality.
 Reinforcement learning is a machine learning training method based on rewarding desired
behaviors and/or punishing undesired ones. In general, a reinforcement learning agent is able to
perceive and interpret its environment, take actions and learn through trial and error

Lecture 37:

Now, let's have a look at some of the popular applications of Supervised Learning:

 Predictive analytics (house prices, stock exchange prices, etc.)


 Text recognition
 Spam detection
 Customer sentiment analysis
 Object detection (e.g. face detection)

Some use cases for unsupervised learning — more specifically, clustering — include:

 Customer segmentation, or understanding different customer groups around which to build


marketing or other business strategies.
 Genetics, for example clustering DNA patterns to analyze evolutionary biology.
 Recommender systems, which involve grouping together users with similar viewing patterns in
order to recommend similar content.
 Anomaly detection, including fraud detection or detecting defective mechanical parts (i.e.,
predictive maintenance).

Lecture 38:

1) Objectives of Data Integration includes all of them except ?


Homogeneity

Data integration is the practice of consolidating data from disparate sources into a single dataset with
the ultimate goal of providing users with consistent access and delivery of data across the spectrum of
subjects and structure types, and to meet the information needs of all applications and business
processes. The data integration process is one of the main components in the overall data management
process, employed with increasing frequency as big data integration and the need to share existing data
continues to grow.

Database Management Systems (DBMS) are software systems used to store, retrieve, and run queries
on data. A DBMS serves as an interface between an end-user and a database, allowing users to create,
read, update, and delete data in the database.

Many heterogeneous data sources:

 Experimental data produced by chip-based techniques


 Genome-wide measurement of gene activity under different conditions (e.g., normal vs.
different disease states) ƒ Experimental annotations (metadata about experiments)
 Clinical data
 Lots of inter-connected web data sources and ontologies
 Sequence data, annotation data, vocabularies, …
 Publications (knowledge in text documents)
 Private vs. public data

Different kinds of analysis

 Gene expression analysis


 Transcription analysis
 Functional profiling
 Pathway analysis and reconstruction
 Text mining

Lecture 39:

1 ) Which one of the following format is emerging as a standard for data transfer

XML
The Most Common Data Integration Challenges

1. Data is Not Available Where it Should Be:

One of the most common business integration challenges is that data is not where it should be. When
data is scattered throughout the enterprise, it gets hard to bring it all together in one place. The risk of
missing a crucial part of data is always present. It could be hidden in secret files. An ex-employee could
have saved data in a different location and left without informing the peers. Or it could be any other
reason that results in the data being elsewhere.

It is suggested to use a data integration platform to gather and compile data in one place to overcome
the problem of not finding data where expected. Asking developers to work on it is time-consuming,
which leads to the next issue.

2. Data Collection Latency and Delays:

In today’s world, data needs to be processed in real-time if you want to get accurate and meaningful
insights. But if the developers manually complete the data integration steps, this is just not possible. It
will lead to a delay in data collection. By the time developers collect data from last week, there will be
this week’s left to deal with, and so on.

Automated data integration tools solve this problem effectively. These tools have been developed to
collect data in real-time without letting enterprises waste their valuable resources in the process.

3. Wrong and Multiple Formats:

Another of the common challenges of system integration is the multiple formats of data. The data saved
by the finance department will be in a format that’s different from how and sales teams present their
data. Comparing and combining unstructured data from different formats is neither effective nor useful.

An easy solution to this is to use data transformation tools. These tools analyze the formats of data and
change them to a unified format before adding data to the central database. Some data integration and
business analytics tools already have this as a built-in feature. This reduces the number of errors you will
need to manually check and solve when collecting data.

4. Lack of Quality Data:

We have an abundance of data. But how much of it is even worth processing? Is all of it useful for the
business? What if you process wrong data and make decisions based on it? These are some challenges
of integration that every organization faces when it starts data integration.

Using low-quality data can result in long-term losses for an enterprise. How can this issue be solved?
There’s something called data quality management that lets you validate data much before it is added
to the warehouse. This saves you from moving unwanted data from its actual location to the data
warehouse. Your database will only house high-quality data that has been validated as genuine.
5. Numerous Duplicates in Data Pipeline:

Duplicate data is something no business can avoid. But having duplicates in the data warehousewill lead
to long-term problems that will impact your business decisions. Hiring data integration consulting
services will help you eliminate data silos by creating a comprehensive communication channel between
the departments.

When the employees share data across the departments, it will naturally reduce the need to create and
save duplicate data. Standardizing validates data will also ensure that the employees know which data
to consider. Investing in technology is vital. But ensuring transparency in the entire system is equally
important.

6. Lack of Understanding of Available Data:

What use is data if the employees don’t understand it or know what to do with it? Not every employee
will have the same skills. That makes it hard for some of them to understand data. For example, the IT
department would be proficient in discussing data using technical terms. The same cannot be said for
employees from the finance or HR departments. They use different terms related to their fields of
expertise.

The consulting companies that are into data integration service offerings help create a common
vocabulary that can be used throughout the enterprise. It’s like a glossary shared with every employee
to help them understand what a certain term or phrase means. This will reduce miscommunication and
mistakes caused due to the wrong understanding of existing data.

7. Existing System Customizations:

It’s most likely that your existing systems have already been customized to suit the specific business
needs. Now, bringing more tools and software can complicate things if they are not compatible with
each other. One of the data integration features you should invest in is the ability to provide multiple
deployment options.

Whether it is on-premises or on the cloud platforms, whether it is linking with an existing system or
building a new one to suit the data-driven model, data integration services can include ways to combine
different systems and bring them together on the same platform.

8. No Proper Planning and Approach to Data Integration :

Data integration is not something you decide and start implementing overnight. You will first need to
understand your business processes, create an environment for employees to communicate and learn,
and then start integrating data from different corners of the enterprise.

Lack of planning is one of the common data integration challenges in healthcare as data has to come
from numerous sources that include a lot of third-party entities. Everyone involved in the process needs
to know why data integration is taking place and how they can use the analytics to improve their
efficiency and productivity. Transparency and communication will solve the problem.

9. The Number of Systems and Tools Used in the Enterprise :

Most enterprises use multiple platforms based on the type of software the employees need. The same
goes for systems and tools that are used in different departments. The marketing team relies on
software that’s not used by the HR team. With so many systems to deal with, gathering data can be a
complex task. It needs cooperation from every employee.

An easy way to collect data from multiple systems and tools is by using a pre-configured integration
software that can work with almost any business setup. You don’t need to invest in different tools to
extract data from numerous sources.

10. No Data Security:

When anyone asks what all the challenges in data integration are, you will need to include data security
or the lack of it in the list. How many businesses have been attacked by cybercriminals in recent times?
Neither the industry giants nor small startups have been spared. Data leaks, data breaches, and data
corruption can make the enterprise vulnerable to any kind of cyberattack. And it could be weeks and
months before you recognize it.

Data integration services that offer end-to-end solutions will solve data security problems. They will
enhance the security systems in the business. This ensures that only authorized employees can access
the data warehouse to add, delete, or edit the information stored.

11. Extracting Valuable Insights from Data :

A complaint several businesses make is that they are unable to extract valuable insights after data
integration. How can data integration issues be avoided when there is no proper planning? There’s an
effective solution that enterprises don’t consider before investing in data integration. What’s that?
Planning.

We’ve mentioned this in our previous points. SMEs need to know what they want to achieve before
investing in any system. Unless the long-term goals are clear, you cannot decide the right strategy to
achieve the goals. You will need to choose analytical tools that can be integrated with the data
warehouse. This will ensure a continuous cycle in organizations where data is collected, processed,
analyzed, and reports are generated to help you improve your business.

Lecture 40:

Pattern Finding in a Genome:

Vocabulary :

A pattern (keyword) is an ordered sequence of symbols


Vocabulary:

Symbols of the pattern and the searched text are chosen from a predetermined finite set, called an
alphabet (Σ)

Four Cases of Pattern Finding:

Look for a perfect match

 Allow errors due to substitutions

Four Cases of Pattern Finding:

 Allow errors due to insertions-deletions

(InDels).

Four Cases of Pattern Finding:

Rank possible matches according to a weight function and keep matches above a certain threshold

Four Cases of Pattern Finding:

p1 and p2 are two patterns of length 5

 W is the weight of complete patterns defined via a nucleotide-nucleotide weight function w( )

Generalized Algorithm:

Goal: Finding all occurrences of a pattern in a text

Input:

Pattern p = [p1…pn] of length n

Text t = [t1…tm] of length m


Output:

An indication that pattern P exist in T

or it does not exist in text T

Pattern searching algorithms search specific sequences in strands of DNA, RNA and proteins having
important biological meaning

Lecture 41:

Brute force approach :

A brute force approach is an approach that finds all the possible solutions to find a satisfactory solution
to a given problem. The brute force algorithm tries out all the possibilities till a satisfactory solution is
not found.

Such an algorithm can be of two types:

Optimizing: In this case, the best solution is found. To find the best solution, it may either find all the
possible solutions to find the best solution or if the value of the best solution is known, it stops finding
when the best solution is found. For example: Finding the best path for the travelling salesman problem.
Here best path means that travelling all the cities and the cost of travelling should be minimum.

Satisficing: It stops finding the solution as soon as the satisfactory solution is found. Or example, finding
the travelling salesman path which is within 10% of optimal.

Often Brute force algorithms require exponential time. Various heuristics and optimization can be used:

Heuristic: A rule of thumb that helps you to decide which possibilities we should look at first.

Optimization: A certain possibilities are eliminated without exploring all of them.

Advantages of a brute-force algorithm :

The following are the advantages of the brute-force algorithm:

 This algorithm finds all the possible solutions, and it also guarantees that it finds the correct
solution to a problem.
 This type of algorithm is applicable to a wide range of domains.
 It is mainly used for solving simpler and small problems.
 It can be considered a comparison benchmark to solve a simple problem and does not require
any particular domain knowledge.

Disadvantages of a brute-force algorithm:

The following are the disadvantages of the brute-force algorithm:


 It is an inefficient algorithm as it requires solving each and every state.
 It is a very slow algorithm to find the correct solution as it solves each state without considering
whether the solution is feasible or not.
 The brute force algorithm is neither constructive nor creative as compared to other algorithms.

Lecture 42:

1) Knuth-Morris-Pratt Algorithm avoids ________


Backtracking of the string p

Knuth-Morris-Pratt Algorithm:

Introduction :

A linear time algorithm for string matching

Does not involve backtracking on string s i.e., repetitive comparison of nucleotide residues

Components :

 The Prefix Function


 The KMP Matcher
 The Prefix Function Π

Encapsulates knowledge about how the pattern matches against shifts of itself

This information can be used to avoid useless shifts of the pattern ‘p’

This enables avoiding backtracking on the string ‘S’

The KMP Matcher:

Given: string ‘S’, pattern ‘p’ and prefix function ‘Π’

Find: the occurrence of ‘p’ in ‘S’

Return: the number of shifts of ‘p’ after which occurrence is found

You might also like