Professional Documents
Culture Documents
If a periodic signal is decomposed into five sine waves with frequencies of 100, 300, 500, 700, and 900
Hz, what is its bandwidth? Explanation: The bandwidth is 900-100 =
800 Hz.
Composite Signals:
For data communication a simple sine wave is not useful, what is used is a composite signal which is a
combination of many simple sine waves. According to French Mathematician, Jean Baptist, any
composite signal is a combination of simple sine waves with different amplitudes and frequencies and
phases.
Bandwidth and Spectrum are common terms in disciplines such as Telecommunication, Networking etc.
The difference between Bandwidth and Spectrum is that:
Bandwidth is the maximum rate of data transfer within a certain period of time while a spectrum is a
collection of waves with particular frequencies arranged in order.
Bandwidth measures how much data can flow through a specific connection at one time.
When visualizing bandwidth, it may help to think of a network connection as a tube and each bit
of data as a grain of sand. If you pour a large amount of sand into a skinny tube, it will take a long time
for the sand to flow through it. If you pour the same amount of sand through a wide tube, the sand will
finish flowing through the tube much faster. Similarly, a download will finish much faster when you have
a high-bandwidth connection rather than a low-bandwidth. connection.
Lecture 16:
Network security
Network security is protection of the access to files and directories in a computer network against
hacking, misuse and unauthorized changes to the system. An example of network security is an anti
virus system. (1) The authorization of access to data in a network, which is controlled by the network
administrator.
A total of three types of network security components can be called upon – hardware, software, and
cloud security components.
Firewall
Network Segmentation
Remote Access VPN
Zero Trust Network Access (ZTNA)
Email Security
Data Loss Prevention (DLP)
Intrusion Prevention Systems (IPS)
Sandboxing
Hyperscale Network Security
Cloud Network Security
Lecture 17:
1) __________________ are essential as basic building blocks for a system that will organize
recorded information that is collected by libraries, archives, museums, etc
Retrieval tools
An information retrieval (IR) system is a set of algorithms that facilitate the relevance of displayed
documents to searched queries. In simple words, it works to sort and rank documents based on the
queries of a user. There is uniformity with respect to the query and text in the document to enable
document accessibility.
1. Classical IR Model — It is designed upon basic mathematical concepts and is the most widely-used of
IR models. Classic Information Retrieval models can be implemented with ease. Its examples include
Vector-space, Boolean and Probabilistic IR models. In this system, the retrieval of information depends
on documents containing the defined set of queries. There is no ranking or grading of any kind. The
different classical IR models take Document Representation, Query representation, and
Retrieval/Matching function into account in their modelling.
2. Non-Classical IR Model — They differ from classic models in that they are built upon propositional
logic. Examples of non-classical IR models include Information Logic, Situation Theory, and Interaction
models.
3. Alternative IR Model — These take principles of classical IR model and enhance upon to create more
functional models like the Cluster model, Alternative Set-Theoretic Models Fuzzy Set model, Latent
Semantic Indexing (LSI) model, Alternative Algebraic Models Generalized Vector Space Model, etc.
Lecture 18:
Information retrieval (IR) is the field of computer science that deals with the processing of documents
containing free text, so that they can be rapidly retrieved based on keywords specified in a user's query.
1. indexing system – indexing and searching methods and procedures (an indexing system can be
human or automated);
3. defined set of queries – which are input into the system, with or without the involvement of a human
searcher; and
4. evaluation criteria – specified measures by which each system is evaluated, for example ‘precision’
and ‘recall’ as measures of relevance. Recall is the proportion of relevant documents in the collection
retrieved in response to the query. Precision is the proportion of relevant documents amongst the set of
documents retrieved in response to the query.
Lecture 19:
Set-theoretic models: represent documents as sets of words or phrases. Similarities are usually derived
from set-theoretic operations on those sets. Common models are:
Standard Boolean
Extended Boolean
Fuzzy retrieval
Algebraic models: represent documents & queries as vectors, matrices, or tuples. Similarity of query
vector & document vector is represented as a scalar value.
Vector space
Extended Boolean
LSI
Probabilistic Models: Treat the process of document retrieval as a probabilistic inference. Similarities
are computed as probabilities that a document is relevant for a given query.
Probabilistic theorems
Bayes' theorem
Lecture 20:
Precision and recall are the measures used in the information retrieval domain to measure how well an
information retrieval system retrieves the relevant documents requested by a user. The measures are
defined as follows:
Precision = Total number of documents retrieved that are relevant/Total number of documents that
are retrieved.
Recall = Total number of documents retrieved that are relevant/Total number of relevant documents in
the database.
Fallout: proportion of non-relevant items that has been retrieved in a given search.
Generality: the relevant items that have been retrieved in a given search.
Lecture 21:
These systems allow text searching of multiple molecular biology database and provide links to
relevant information for entries that match the search criteria. The three systems differ in the
databases they search and the links they have to other information.
These are the databases consisting of biological data like protein sequencing, molecular structure, DNA
sequences, etc in an organized form.
Several computer tools are there to manipulate the biological data like an update, delete, insert, etc.
Scientists, researchers from all over the world enter their experiment data and results in a biological
database so that it is available to a wider audience.
Biological databases are free to use and contain a huge collection of a variety of biological data.
The most widely used interface for the retrieval of information from biological da- tabases is the NCBI
Entrez system. Entrez capitalizes on the fact that there are preexisting, logical relationships between the
individual entries found in numerous public databases.
SRS (EBI and DDBJ): SRS is a data retrieval system that integrates heterogeneous databanks in molecular
biology and genome analysis. There are currently several dozen servers worldwide that provide access
to over 300 different databanks via the World Wide Web.
Lecture 22:
IR in Bioinformatics
Data Stores
PubMed: free search engine accessing MEDLINE DB of references & abstracts on life sciences &
biomedical topics
PubMed Central:archives publicly accessible full-text scholarly articles within the biomedical & life
sciences journal literature
BioMed Central: publish over 200 scientific journals describes itself as first & largest open access
science publisher
Lecture 23:
Search Engines:
qdatabase creation
q personalized filtering of new publications
q personalized adaptation and discovery of interesting research and trends
RefMed:
Search engine for PubMed that provides relevance ranking induces a new ranking according to the user
judgment
MedlineRanker:
Search engines for Medline Learns most discriminative words by comparing a set of abstracts provided
by user with the Medline
Lecture 24:
When performing classification predictions, there's four types of outcomes that could occur.
True positives are when you predict an observation belongs to a class and it actually does
belong to that class.
True negatives are when you predict an observation does not belong to a class and it
actually does not belong to that class.
False positives occur when you predict an observation belongs to a class when in reality it
does not.
False negatives occur when you predict an observation does not belong to a class when in
fact it does
The three main metrics used to evaluate a classification model are accuracy, precision, and recall.
Accuracy is defined as the percentage of correct predictions for the test data. It can be calculated easily
by dividing the number of correct predictions by the number of total predictions.
Precision is defined as the fraction of relevant examples (true positives) among all of the examples
which were predicted to belong in a certain class.
Recall is defined as the fraction of examples which were predicted to belong to a class with respect to all
of the examples that truly belong in the class.
Lecture 25:
GpPubMed:
XploreMed:
EBIMed:
Lecture 26:
Lecture 34:
Clustering is a method which group the data points having similar properties and/or features, while data
points in different groups should have highly offbeat properties and/or features
Hierarchical clustering is another unsupervised machine learning algorithm, which is used to group the
unlabeled datasets into a cluster and also known as hierarchical cluster analysis or HCA. In this
algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-shaped structure is
known as the dendrogram.
The working of the AHC algorithm can be explained using the below steps:
Step-1: Create each data point as a single cluster. Let's say there are N data points, so the number of
Step-2: Take two closest data points or clusters and merge them to form one cluster. So, there will now
be N-1 clusters
Step-3: Again, take the two closest clusters and merge them together to form one cluster. There will be
N-2 clusters.
Step-4: Repeat Step 3 until only one cluster left. So, we will get the following clusters. Consider the
below images:
Step-5: Once all the clusters are combined into one big cluster, develop the dendrogram to divide the
clusters as per the problem
Lecture 35:
Machine learning is an application of AI that enables systems to learn and improve from experience
without being explicitly programmed. Machine learning focuses on developing computer programs that
can access data and use it to learn for themselves.
Lecture 36:
1) The software/tool that is used to manage the database and its users is called
DBMS
Supervised learning is a machine learning approach that’s defined by its use of labeled datasets. These
datasets are designed to train or “supervise” algorithms into classifying data or predicting outcomes
accurately. Using labeled inputs and outputs, the model can measure its accuracy and learn over time.
Supervised learning can be separated into two types of problems when data mining: classification and
regression:
Classification problems use an algorithm to accurately assign test data into specific categories,
such as separating apples from oranges. Or, in the real world, supervised learning algorithms can
be used to classify spam in a separate folder from your inbox. Linear classifiers, support vector
machines, decision trees and random forest are all common types of classification algorithms.
Regression is another type of supervised learning method that uses an algorithm to understand
the relationship between dependent and independent variables. Regression models are helpful
for predicting numerical values based on different data points, such as sales revenue projections
for a given business. Some popular regression algorithms are linear regression, logistic
regression and polynomial regression.
Unsupervised learning uses machine learning algorithms to analyze and cluster unlabeled data sets.
These algorithms discover hidden patterns in data without the need for human intervention (hence,
they are “unsupervised”).
Unsupervised learning models are used for three main tasks: clustering, association and dimensionality
reduction:
Clustering is a data mining technique for grouping unlabeled data based on their similarities or
differences. For example, K-means clustering algorithms assign similar data points into groups,
where the K value represents the size of the grouping and granularity. This technique is helpful
for market segmentation, image compression, etc.
Association is another type of unsupervised learning method that uses different rules to find
relationships between variables in a given dataset. These methods are frequently used for
market basket analysis and recommendation engines, along the lines of “Customers Who
Bought This Item Also Bought” recommendations.
Dimensionality reduction is a learning technique used when the number of features (or
dimensions) in a given dataset is too high. It reduces the number of data inputs to a manageable
size while also preserving the data integrity. Often, this technique is used in the preprocessing
data stage, such as when autoencoders remove noise from visual data to improve picture
quality.
Reinforcement learning is a machine learning training method based on rewarding desired
behaviors and/or punishing undesired ones. In general, a reinforcement learning agent is able to
perceive and interpret its environment, take actions and learn through trial and error
Lecture 37:
Now, let's have a look at some of the popular applications of Supervised Learning:
Some use cases for unsupervised learning — more specifically, clustering — include:
Lecture 38:
Data integration is the practice of consolidating data from disparate sources into a single dataset with
the ultimate goal of providing users with consistent access and delivery of data across the spectrum of
subjects and structure types, and to meet the information needs of all applications and business
processes. The data integration process is one of the main components in the overall data management
process, employed with increasing frequency as big data integration and the need to share existing data
continues to grow.
Database Management Systems (DBMS) are software systems used to store, retrieve, and run queries
on data. A DBMS serves as an interface between an end-user and a database, allowing users to create,
read, update, and delete data in the database.
Lecture 39:
1 ) Which one of the following format is emerging as a standard for data transfer
XML
The Most Common Data Integration Challenges
One of the most common business integration challenges is that data is not where it should be. When
data is scattered throughout the enterprise, it gets hard to bring it all together in one place. The risk of
missing a crucial part of data is always present. It could be hidden in secret files. An ex-employee could
have saved data in a different location and left without informing the peers. Or it could be any other
reason that results in the data being elsewhere.
It is suggested to use a data integration platform to gather and compile data in one place to overcome
the problem of not finding data where expected. Asking developers to work on it is time-consuming,
which leads to the next issue.
In today’s world, data needs to be processed in real-time if you want to get accurate and meaningful
insights. But if the developers manually complete the data integration steps, this is just not possible. It
will lead to a delay in data collection. By the time developers collect data from last week, there will be
this week’s left to deal with, and so on.
Automated data integration tools solve this problem effectively. These tools have been developed to
collect data in real-time without letting enterprises waste their valuable resources in the process.
Another of the common challenges of system integration is the multiple formats of data. The data saved
by the finance department will be in a format that’s different from how and sales teams present their
data. Comparing and combining unstructured data from different formats is neither effective nor useful.
An easy solution to this is to use data transformation tools. These tools analyze the formats of data and
change them to a unified format before adding data to the central database. Some data integration and
business analytics tools already have this as a built-in feature. This reduces the number of errors you will
need to manually check and solve when collecting data.
We have an abundance of data. But how much of it is even worth processing? Is all of it useful for the
business? What if you process wrong data and make decisions based on it? These are some challenges
of integration that every organization faces when it starts data integration.
Using low-quality data can result in long-term losses for an enterprise. How can this issue be solved?
There’s something called data quality management that lets you validate data much before it is added
to the warehouse. This saves you from moving unwanted data from its actual location to the data
warehouse. Your database will only house high-quality data that has been validated as genuine.
5. Numerous Duplicates in Data Pipeline:
Duplicate data is something no business can avoid. But having duplicates in the data warehousewill lead
to long-term problems that will impact your business decisions. Hiring data integration consulting
services will help you eliminate data silos by creating a comprehensive communication channel between
the departments.
When the employees share data across the departments, it will naturally reduce the need to create and
save duplicate data. Standardizing validates data will also ensure that the employees know which data
to consider. Investing in technology is vital. But ensuring transparency in the entire system is equally
important.
What use is data if the employees don’t understand it or know what to do with it? Not every employee
will have the same skills. That makes it hard for some of them to understand data. For example, the IT
department would be proficient in discussing data using technical terms. The same cannot be said for
employees from the finance or HR departments. They use different terms related to their fields of
expertise.
The consulting companies that are into data integration service offerings help create a common
vocabulary that can be used throughout the enterprise. It’s like a glossary shared with every employee
to help them understand what a certain term or phrase means. This will reduce miscommunication and
mistakes caused due to the wrong understanding of existing data.
It’s most likely that your existing systems have already been customized to suit the specific business
needs. Now, bringing more tools and software can complicate things if they are not compatible with
each other. One of the data integration features you should invest in is the ability to provide multiple
deployment options.
Whether it is on-premises or on the cloud platforms, whether it is linking with an existing system or
building a new one to suit the data-driven model, data integration services can include ways to combine
different systems and bring them together on the same platform.
Data integration is not something you decide and start implementing overnight. You will first need to
understand your business processes, create an environment for employees to communicate and learn,
and then start integrating data from different corners of the enterprise.
Lack of planning is one of the common data integration challenges in healthcare as data has to come
from numerous sources that include a lot of third-party entities. Everyone involved in the process needs
to know why data integration is taking place and how they can use the analytics to improve their
efficiency and productivity. Transparency and communication will solve the problem.
Most enterprises use multiple platforms based on the type of software the employees need. The same
goes for systems and tools that are used in different departments. The marketing team relies on
software that’s not used by the HR team. With so many systems to deal with, gathering data can be a
complex task. It needs cooperation from every employee.
An easy way to collect data from multiple systems and tools is by using a pre-configured integration
software that can work with almost any business setup. You don’t need to invest in different tools to
extract data from numerous sources.
When anyone asks what all the challenges in data integration are, you will need to include data security
or the lack of it in the list. How many businesses have been attacked by cybercriminals in recent times?
Neither the industry giants nor small startups have been spared. Data leaks, data breaches, and data
corruption can make the enterprise vulnerable to any kind of cyberattack. And it could be weeks and
months before you recognize it.
Data integration services that offer end-to-end solutions will solve data security problems. They will
enhance the security systems in the business. This ensures that only authorized employees can access
the data warehouse to add, delete, or edit the information stored.
A complaint several businesses make is that they are unable to extract valuable insights after data
integration. How can data integration issues be avoided when there is no proper planning? There’s an
effective solution that enterprises don’t consider before investing in data integration. What’s that?
Planning.
We’ve mentioned this in our previous points. SMEs need to know what they want to achieve before
investing in any system. Unless the long-term goals are clear, you cannot decide the right strategy to
achieve the goals. You will need to choose analytical tools that can be integrated with the data
warehouse. This will ensure a continuous cycle in organizations where data is collected, processed,
analyzed, and reports are generated to help you improve your business.
Lecture 40:
Vocabulary :
Symbols of the pattern and the searched text are chosen from a predetermined finite set, called an
alphabet (Σ)
(InDels).
Rank possible matches according to a weight function and keep matches above a certain threshold
Generalized Algorithm:
Input:
Pattern searching algorithms search specific sequences in strands of DNA, RNA and proteins having
important biological meaning
Lecture 41:
A brute force approach is an approach that finds all the possible solutions to find a satisfactory solution
to a given problem. The brute force algorithm tries out all the possibilities till a satisfactory solution is
not found.
Optimizing: In this case, the best solution is found. To find the best solution, it may either find all the
possible solutions to find the best solution or if the value of the best solution is known, it stops finding
when the best solution is found. For example: Finding the best path for the travelling salesman problem.
Here best path means that travelling all the cities and the cost of travelling should be minimum.
Satisficing: It stops finding the solution as soon as the satisfactory solution is found. Or example, finding
the travelling salesman path which is within 10% of optimal.
Often Brute force algorithms require exponential time. Various heuristics and optimization can be used:
Heuristic: A rule of thumb that helps you to decide which possibilities we should look at first.
This algorithm finds all the possible solutions, and it also guarantees that it finds the correct
solution to a problem.
This type of algorithm is applicable to a wide range of domains.
It is mainly used for solving simpler and small problems.
It can be considered a comparison benchmark to solve a simple problem and does not require
any particular domain knowledge.
Lecture 42:
Knuth-Morris-Pratt Algorithm:
Introduction :
Does not involve backtracking on string s i.e., repetitive comparison of nucleotide residues
Components :
Encapsulates knowledge about how the pattern matches against shifts of itself
This information can be used to avoid useless shifts of the pattern ‘p’