You are on page 1of 13

Data Mining

Oral Questions LP-II


1. What are the different Data Warehousing Schemas?
A schema is defined as a logical description of database where fact
and dimension tables are joined in a logical manner. Data
Warehouse is maintained in the form of Star, Snow flakes, and Fact
Constellation schema.

2. Explain Star Schema vs. Snowflake Schema.


S.NO Star Schema Snowflake Schema

In star schema, The fact tables While in snowflake schema, The fact tables,
and the dimension tables are dimension tables as well as sub dimension
1. contained. tables are contained.

Star schema is a top-down


2. model. While it is a bottom-up model.

3. Star schema uses more space. While it uses less space.

It takes less time for the While it takes more time than star schema
4. execution of queries. for the execution of queries.

In star schema, Normalization While in this, Both normalization and


5. is not used. denormalization are used.

6. It’s design is very simple. While it’s design is complex.

The query complexity of star While the query complexity of snowflake


7. schema is low. schema is higher than star schema.

It’s understanding is very


8. simple. While it’s understanding is difficult.

It has less number of foreign


9. keys. While it has more number of foreign keys.

10. It has high data redundancy. While it has low data redundancy.

3.
3. Mention what is the responsibility of a Data analyst?
->Data analysts work with data to help their organizations make
better business decisions. Using techniques from a range of
disciplines, including computer programming, mathematics, and
statistics, data analysts draw conclusions from data to describe,
predict, and improve business performance. They form the core of
any analytics team and tend to be generalists versed in the methods
of mathematical and statistical analysis.

4. List out some of the best practices for data cleaning?


 Removal of Unwanted Observations

Since one of the main goals of data cleansing is to make sure that the dataset is free of
unwanted observations, this is classified as the first step to data cleaning. Unwanted
observations in a dataset are of 2 types, namely; the duplicates and irrelevances.

 Duplicate Observations

A data is said to be a duplicate if it is repeated in a dataset, with it having more than one
occurrence. This usually arises when the dataset is created as a result of combining
data from two or more sources.

This can also occur in some other cases, including when a respondent makes more
than one submission to a survey or error during data entry.

 Irrelevant Observations

Irrelevant observations are those that don’t actually fit the specific problem that you’re
trying to solve. Like having the price when you are only dealing with quantity.

For example, if you were building a model for prices of apartments in an estate, you
don’t need data showing the number of occupants of each house. Irrelevant
observations mostly occur when data is generated by scraping from another data
source.

 Fix Data Structure

After removing unwanted observations, the next thing to do is to make sure that the
wanted observations are well-structured. Structural errors may occur during data
transfer due to a slight human mistake or incompetency of the data entry personnel.

Some of the things one should look out for when fixing data structure include;
typographical errors, grammatical blunders, and so on. The data structure is mostly
concerned with categorical data.
Here, we correct misspelled words and summarize category headings that are too long.
This is very important because long category headings may not be fully shown on the
graph.

5. Mention what is data cleansing?


Data cleansing or data cleaning is the process of detecting and correcting (or
removing) corrupt or inaccurate records from a record set, table, or database and
refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the
data and then replacing, modifying, or deleting the dirty or coarse data.

5. List out some common problems faced by data analyst?


Data Understanding - Normally there are multiple databases maintained by
data engineers for different purpose and with different accesses. Now as a
data analyst you should have sound understanding of data fields. There can
be several tables within a data base with hundreds of fields/columns, you
should be able to fetch the required information.

Data Security - You have to ensure that only the party who is entitled to
view the data has the required permission and access to it. You don’t want
sales data to published across or for that matter any irrelevant access to
anybody within or outside the organization.

Resistance/Interference from senior management - As a data analysts


you have to work closely with top management of the organization, it is
difficult to convey your message or intention of your findings. There are
many a times resistance in accepting certain unpleasing findings, your
analysis may invoke changes in the way things have been working which
makes many people uncomfortable. One should always be prepared before
publishing your recommendations and findings for detailed questionnaire. It
can be tiring and result in nothing.

Technical challenges - Many times a data analyst does not have enough
access to the data to work with. It is really challenging to work with the data
engineering team or the DB owner who are sometimes not easy to work with
as there are several justifications required for getting the required access.
And many other technical challenges for that sake.

Visual Changes - This has happened few times with me. There are certain
color palettes that are specific to your company or brand. You may have
created a great looking dashboard or graphs but sometimes you will have to
tone it down so that it aligns with the existing template and is already
familiar. Yes, sometimes other will tell you what colors to use that all.
6. List of some best tools that can be useful for data-analysis?

Microsoft Power BI. ...


SAP BusinessObjects. ...
Sisense. ...
TIBCO Spotfire. ...
Thoughtspot. ...
Qlik. ...
SAS Business Intelligence. ...
Tableau.

7. What is difference between Supervised and Unsupervised Learning?


Supervised and Unsupervised learning are the machine learning paradigms which are
used in solving the class of tasks by learning from the experience and performance
measure. The supervised and Unsupervised learning mainly differ by the fact that
supervised learning involves the mapping from the input to the essential output. On the
contrary, unsupervised learning does not aim to produce output in the response of the
particular input instead it discovers patterns in data.

These supervised and unsupervised learning techniques are implemented in various


applications such as artificial neural networks which is a data processing systems
containing a huge number of largely interlinked processing elements.

7. What are different similarities between Kmeans and KNN


Algorithm?

K-NN is a Supervised machine learning while K-means is


an unsupervised machine learning.

K-NN is a classification or regression machine learning algorithm while K-


means is a clustering machine learning algorithm.

K-NN is a lazy learner while K-Means is an eager learner. An eager learner has
a model fitting that means a training step but a lazy learner does not have a
training phase.

K-NN performs much better if all of the data have the same scale but this is not
true for K-means.

8. What is Euclidean distance? Explain with Suitable


example?
It can be simply explained as the ordinary distance between two points.
It is one of the most used algorithms in the cluster analysis. One of the
algorithms that use this formula would be K-mean. Mathematically it
computes the root of squared differences between the coordinates between
two objects.

9. What is hamming distance? Explain with Suitable


example?
The Hamming distance involves counting up which set of corresponding
digits or places are different, and which are the same. For example, take the
text string “hello world” and contrast it with another text string, “herra poald.”
There are five places along the corresponding strings where the letters are
different.

11. What is Chi Square Distance? Explain with Suitable example?


Chi-square distance calculation is a statistical method, generally
measures similarity between 2 feature matrices. Such distance is
generally used in many applications like similar image retrieval, image
texture, feature extractions etc.

10. What are different types of Clustering?

Data Mining Clustering Methods


 Partitioning Clustering Method. In this method, let us say that “m” partition is done on
the “p” objects of the database. ...
 Hierarchical Clustering Methods. ...
 Density-Based Clustering Method. ...
 Grid-Based Clustering Method. ...
 Model-Based Clustering Methods. ...
 Constraint-Based Clustering Method.

11. What is Weka Tool? Explain the Step to Perform Clustering


on Sample data set?

The WEKA SimpleKMeans algorithm uses Euclidean distance


measure to compute distances between instances and clusters. To
perform clustering, select the "Cluster" tab in the Explorer and
click on the "Choose" button. This results in a drop down list of
available clustering algorithms.

12. Explain Association Rule


Association rule learning is a rule-based machine learning method for
discovering interesting relations between variables in large
databases. It is intended to identify strong rules discovered in
databases using some measures of interestingness.

13. What is the Application of A-Priori algorithm?


Apriori algorithm is a classical algorithm in data mining. It is used
for mining frequent itemsets and relevant association rules. It is
devised to operate on a database containing a lot of transactions, for
instance, items brought by customers in a store.

14. What is Market Basket Analysis? Explain with suitable


example?
In market basket analysis (also called association analysis or
frequent itemset mining), you analyze purchases that commonly
happen together. For example, people who buy bread and peanut
butter also buy jelly. Or people who buy shampoo might also buy
conditioner

15. Who propose A-Priori algorithm?

Two scientists Agrawal and Srikant were the first to propose a


solution to this in their 1994 paper called Fast Algorithms for Mining
Association Rules. Their first solution is the famous Apriori algorithm.

16. What is minimum support and minimum confidence?


Minimum support is applied to find all frequent itemsets in a data set. 2.
These frequent itemsets and the minimum confidence constraint are used to
compose the rules. Finding all frequent itemsets in a data set is a complex
procedure since it involves analyzing all possible itemsets.

17. What is use of Tokenize operator?

Tokenize is an operator for splitting the sentence in the document


into a sequence of words [14] . The purpose of this sub process is
to separate words from a document, so this list of words can be used
for the next sub process
20. What are different modes of Tokenize operator

21. How to use Read Document operator?

22. Why we use Filter token and Filter stop word?

23. How to use Filter Class operator?

STQA Questions and Answers


Q 1) What is the difference between Quality Assurance, Quality Control and
testing?
Quality Assurance is the process of planning and defining the way of monitoring and
implementing the quality (test) processes within a team and the organization. This method
basically defines and sets the quality standards of the projects.
Quality Control is the process of finding defects and providing suggestions to improve the
quality of the software. The methods used by Quality Control are usually established by
quality assurance. It is the primary responsibility of the testing team to implement quality
control.
Testing is the process of finding defects/bugs. It validates whether the software built by
the development team meets the requirements set by the user and the standards set by
the organization. In testing the main focus is on finding bugs and the testing team acts as
a quality gatekeeper.
Q 2) When do you think QA activities should start?
QA activity should start at the beginning of the project. The more early it starts the
more beneficial it is to set the standard for achieving the quality. The cost, time and
efforts are very challenging in case the QA activities get delayed.

Q 3) What is the difference between Test Plan and Test Strategy?


Test plan basically depicts how the testing should be performed for a particular
application, falling under a project, whereas Test Strategy is at a higher level, mostly
created by the Project Manager which demonstrates the overall approach of the testing
for the entire project.

Q 4) Can you explain the software testing life cycle?


Software Testing Life Cycle refers to a testing process that has specific steps to be
executed in a definite sequence to ensure that the quality goals have been met. It includes
the following phases of testing:

 Requirement Analysis
 Test Planning
 Test Design
 Test Environment Setup
 Test Execution
 Test Closure

Q 5) How do you define a format of writing a good test case?


1. Keep things simple and transparent.
2. Make test cases reusable.
3. Keep test case IDs unique.
4. Peer review is important.
5. Test cases should have the end user or defined requirements in mind.
6. Specify expected results and assumptions

Q 6) What is a good test case?


In simple words, a good test case is one that finds a defect. But all test cases will not
find defects, so a good test case can also be one that has all the prescribed details and
coverage.

Q 7) What would you do if you have a large suit to execute in a very less
time?
In that case, we should prioritize the test case at first instance and execute the high priority
test cases first and then move on to the lower priority ones. This way we can make sure
that the important aspects of the software are tested.
Alternatively, we may also seek customer preference that which is the most important
functions of the software according to them, and we should start testing from those areas
and then gradually move to those areas which are of less importance.

Q 8) Do you think QA’s can also participate to resolve production issues?


Yes!! It would be a good learning curve for QA’s to participate in resolving production
issues. Many a time production issues could be resolved by clearing the logs or making
some registry settings or by restarting the services. These kinds of environmental
issues could be very well fixed by the QA team. Also, If QAs have an insight on
resolving the production issues, they may also include them while writing the test
cases, and this way they can contribute to enhance quality and try to minimize the
production defects.

Q 9) Suppose you find a bug in production, how would you make sure that
the same bug is not introduced again?
The best way is to immediately write a test case for the production defect and include it
in the regression suite. Also, many a time we can also think of alternate test cases or
similar kinds of test cases and include them in our planned execution .

.Q 10) What is the difference between functional and nonfunctional testing?


Functional Testing: Functional testing basically deals with the functional aspect of the application.
This technique tests that the system is behaving according to the requirement and specification.
These are directly linked with customer requirements. We validate the test cases against the
specified requirement and make the test pass or failed accordingly. Functional testing includes
regression, integration, system, smoke, etc.
Non Functional Testing: Non-functional testing tests the non-functional aspect of the application.
It tests NOT the requirement, but the environmental factors like performance, load, and stress.
These are not explicitly specified in the requirement but are prescribed in the quality standards.
So, as QA we have to make sure that these testings are also given sufficient time and priority.

Q 11) What is negative testing? How is it different from positive testing?


Negative testing is a technique that validates that the system behaves gracefully in case
of any invalid inputs. For example, in case the user enters any invalid data in a text box,
the system should display a proper message instead of the technical message which the
user does not understand.
Negative testing is different from positive testing in a way that positive testing validates
that our system works as expected and compares the test results with the expected
results. Most of the time scenarios for negative testing are not mentioned in the functional
requirement documents. As a QA we have to identify the negative scenarios and should
have provisions to test those.

Q 12) How would you ensure that your testing is complete and has good
coverage?
Requirements traceability matrix and Test coverage matrices will help us to determine
that our test cases have good coverage. Requirement traceability matrices will help us
to determine that the test conditions are enough so that all the requirements are
covered. Coverage matrices will help us to determine that the test cases are enough to
satisfy all the identified test conditions in RTM.

Q 13) What are the different artifacts you refer when you write the test
cases?
The main artifacts used are:

 Functional requirement specification


 Requirement understanding document
 Use Cases
 Wireframes
 User Stories
 Acceptance criteria
 Many a time UAT test cases
Q 14) Have you ever managed writing the test cases without having any
documents?
Yes, many a time we have a situation where we have to write test cases without having
any existing documents. In that case, the best way is to

 Collaborate with the BA and development team.


 Find out some mails which have some information.
 Find out older test cases/regression suite.
 If the feature is new, try to read the wiki pages or help the application to have an
idea.
 Sit with the developer and try to understand the changes being made.
 Based on your understanding, identify the test condition and send it to BA or
stakeholders to review them.

Q 15) What is meant by Verification and Validation?


Validation is the process of evaluating the final product to check whether the software
meets the business needs. The test execution which we do in our day-to-day life is
actually the validation activity which includes smoke testing, functional testing, regression
testing, systems testing, etc.
Verification is a process of evaluating the intermediary work products of a software
development lifecycle to check if we are on the right track to creating the final product.

Q 16) What are the different verification techniques you know?


There are 3 verification techniques:
Review, Inspection, and walk-through
1) Review – It is a method by which the code/test cases are examined by an individual
other than the author who has produced it. It is one of the easiest and best ways to ensure
coverage and quality.
2) Inspection – It is a technical and disciplined way to examine and correct the defects
in the test artifact or code. Because it is disciplined, it has various roles:

 Moderator – Who facilitates the entire inspection meeting


 Recorder – Who records the minutes of the meeting, defects that occurred, and
other points discussed.
 Reader – The one who will read out the document/code. The leader also leads the
entire inspection meeting.
 Producer – The author. They are ultimately responsible to update their
document/code as per the comments.
 Reviewer – All the team can be considered as a reviewer. This role can also be
played by some group of experts if the project demands.
3) Walk-through – It is a process in which the author of the document/code reads the
content and gets the feedback.

Q 17) What is the difference between Load and Stress testing?


Stress Testing is a technique that validates the behavior of the system when it executes
under stress. To explain, we reduce the resources and check the behavior of the system.
We first understand the upper limit of the system and gradually reduce the resources and
check the system behavior.
In Load testing, we validate the system behavior under the expected load. The load can
be of concurrent users or resources, accessing the system at the same time.

Q 18) In case you have any doubts regarding your project, how do you
approach?
In case of any doubts, first, try to get it clear by reading the available application help.
In case of doubts still persisting, ask your supervisor or the senior member of your
team.

Q 19) Have you used any Automation tools?


The different automation tools are:

 Telerik Test Studio.


 Selenium
 Robotium
 TestComplete
 Watir
 Visual Studio Test Professional. …
 QTP (UFT)

Q 20) How do you determine which piece of software require how much
testing?
We can know this factor by finding out the Cyclomatic Complexity.
The technique helps to identify the below 3 questions for the programs/features:

 Is the feature/program testable?


 Is the feature/ program understood by everyone?
 Is the feature/program reliable enough?
As a QA we can use this technique to identify the “level” of our testing.
It is a practice that if the result of Cyclomatic complexity is more or a bigger number than
that piece of functionality to be of complex nature and hence we conclude as a tester;
that the piece of code/functionality requires in-depth testing. On the other hand, if the
result of the Cyclomatic Complexity is a smaller number, then the functionality is of less
complexity and decides the scope accordingly. As a QA it’s very important that we
understand the entire testing lifecycle and should be able to suggest changes in our
process if required. The goal is to deliver high-quality software and thus, a QA should
take all the necessary measures to improve the process and the way the testing team
executes the tests.

You might also like