You are on page 1of 19

Data Engineer

Python Panda’s Interview Questions:-


1. What are the different types of data structures in Pandas?
2. What are the different file formats supported by Pandas for data input and output?
3. Differentiate between a Pandas DataFrame and a Series.
4. Explain the different ways of creating a Pandas series.
5. What is reindexing in Pandas, and why is it necessary?
6. How to create a copy of the Pandas series?
7. How to convert a Pandas series to DataFrame?
8. How do you delete rows and columns in a Pandas DataFrame?
9. State the difference between join() and merge() in Pandas.
10. State the difference between concat() and merge() in Pandas.
11. How can you obtain the frequency count of unique items in a Pandas DataFrame.
12. How to Read Data from Pandas DataFrame?
13. Explain the different ways of creating a Pandas DataFrame.
14. How do you select specific columns from a DataFrame?
15. How do you add a new row to a DataFrame in Pandas?
16. How do you add a new column to a DataFrame in Pandas?
17. How do you drop duplicate rows from a DataFrame?
18. Explain the concept of multi-indexing in Pandas.
19. How can you handle outliers in a numerical column in a Pandas DataFrame?
20. How can you filter rows in a DataFrame based on a condition?
21. What is the difference between interpolate() and fillna() in Pandas?
22. How can you normalize and standardize data using Pandas?
23. Explain the difference between loc[] and iloc[] in the context of DataFrame indexing.
24. How can you check the memory usage of a Pandas DataFrame?
25. Discuss techniques to reduce memory usage when working with large datasets in Pandas.
26. When would you prefer using a Series over a DataFrame or vice versa?
27. What is the purpose of the pivot_table function in Pandas?
28. How can you find the items of series A that are not present in series B?
29. Explain how to extract the year, month, and day from a datetime column in a Pandas DataFrame.
30. Demonstrate how to export a Pandas DataFrame to an Excel file.
31. How do you merge two DataFrames in Pandas?

1. Explain the difference between Azure Data Lake Storage Gen1 and Gen2.
2. How do you optimize data storage costs in Azure?
3. Describe the process of setting up Azure Data Factory and its key components.
4. What are the differences between Azure SQL Database and Azure SQL Data Warehouse?
5. How do you handle data ingestion in Azure?
6. Explain the concept of data partitioning in Azure Cosmos DB.
7. What is Azure Databricks and how does it relate to big data processing?
8. Discuss the advantages and disadvantages of using Azure Data Lake Analytics.
9. How do you ensure data security and compliance in Azure?
10. Explain the role of Azure Stream Analytics in real-time data processing.
11. How do you design a scalable data pipeline in Azure?
12. Describe the process of data transformation in Azure Data Factory.
13. What is Azure Synapse Analytics and how does it integrate with other Azure services?
14. How do you monitor and troubleshoot data pipelines in Azure?
15. Explain the concept of PolyBase in Azure SQL Data Warehouse.
16. What are the different storage options available in Azure for structured and unstructured data?
17. How do you handle data lineage and auditing in Azure?
18. Discuss the best practices for optimizing query performance in Azure SQL Database.
19. Explain the role of Azure Data Explorer in handling large volumes of time-series data.
20. How do you implement disaster recovery and high availability for Azure data solutions?

Spark dataframe interview questions asked in my interviews…..

1. What is the difference between RDDs and DataFrames in Spark?


2. Explain the concept of Catalyst optimizer in Spark SQL.
3. How does Spark handle schema inference for DataFrames?
4. What is the purpose of partitioning in Spark DataFrames? How does it affect performance?
5. Explain the concept of DataFrame lineage in Spark.
6. How does Spark handle schema evolution?
7. What are the benefits of using DataFrames over RDDs in Spark?
8. What is a Catalyst expression in Spark SQL?
9. Explain the concept of column pruning in Spark SQL optimization.
10. How does Spark handle data skewness in DataFrames?
11. What is the significance of caching and persistence in Spark DataFrames?
12. Explain the concept of lazy evaluation in Spark DataFrames.
13. How does Spark handle null values in DataFrames?
14. What are the different types of joins supported by Spark DataFrames? Explain each.
15. What is the purpose of the DataFrame API's `explain()` method?
16. How does Spark perform shuffle operations in DataFrames?
17. Explain the concept of window functions in Spark SQL.
18. What are the differences between `cache()` and `persist()` methods in Spark DataFrames?
19. How does Spark handle memory management for DataFrames?
20. Explain the concept of broadcast joins and when to use them in Spark.

Explain the various types of Joins?


Explain Equi join?
Explain the Right Outer Join?
How do you alter the name of a column?
how can you build a stored procedure?
Distinguish between MongoDB and MySQL?
Compare the 'Having' and 'Where' clauses in detail?
What are the differences between COALESCE() and ISNULL()?
What is the difference between “Stored Procedure” and “Function”?
What is the difference between the “DELETE” and “TRUNCATE” commands?
What is difference between “Clustered Index” and “Non Clustered Index”?
What is the difference between “Primary Key” and “Unique Key”?
What is the difference between a “Local Temporary Table” and “Global Temporary Table”?
What is the difference between primary key and unique constraints?
What are the differences between DDL, DML and DCL in SQL?
What is a view in SQL? How to create view?
What is a Trigger?
What is the difference between Trigger and Stored Procedure?
What are indexes?
What are Primary Keys and Foreign Keys?
What are wildcards used in database for Pattern Matching?
What is Union, minus and Interact commands?
What is RDBMS?
What is OLTP?
What is Aggregate Functions?
What is the difference between UNION and UNION ALL?
What is a foreign key, and what is it used for?

Scenario Based Questions--


❗ SQL Query to find second highest salary of Employee?
❗ SQL Query to find Max Salary from each department?
❗ Write SQL Query to display current date?
❗ Write an SQL Query to check whether date passed to Query is date of given format or not?
❗ Write a SQL Query to print the name of distinct employee whose DOB is between 01/01/1960 to 31/12/1975?
❗ Write an SQL Query to find employee whose Salary is equal or greater than 10000?
❗ Write an SQL Query to find name of employee whose name Start with ‘M’?
❗ Find the 3rd MAX salary in the emp table?
❗ Suppose there is annual salary information provided by emp table.
❗How to fetch monthly salary of each and every employee?
❗Display the list of employees who have joined the company before 30th June 90 or after 31st dec 90?

You can expect….

1. Sql queries…
— basic to indepth
—advanced sql
—windowing funtions,CTE

2. Python….
—functions
—data scraping
—lambda functions
—basics

3. Pyspark…
— RDD, dataframes, datasets
—spark sql
—advanced windowing function
—rank, dense rank, row number
—transformations, actions(advanced)

4. Spark…
—architecture
—internal execution
—jobs issues, spark submit
—cluster issues, optimisations
—drivers, nodes, joins

5. Hadoop
—in depth architecture
— jobs
—application manager
—tools used

Practice these concepts in depth, when i say in depth you must learn in depth…

Interview patterns have changed post covid, you can expect lot questions… prepare well.

Spark Questions:-
Those includes --
What is spark? Explain Architecture
Explain where did you use spark in your project?
What all optimization techniques have you used in spark?
Explain transformations and actions have you used?
What happens when you use shuffle in spark?
Difference between ReduceByKey Vs GroupByKey?
Explain the issues you resolved when you working with spark?
Compare Spark vs Hadoop MapReduce?
Difference between Narrow & wide transformations?
What is partition and how spark Partitions the data?
What is RDD?
what is broadcast variable?
Difference between Sparkcontext Vs Sparksession?
Explain about transformations and actions in the spark?
what is Executor memory in spark?
What is lineage graph?
What is DAG?
Explain libraries that Spark Ecosystem supports?
What is a DStream?
What is Catalyst optimizer and explain it?
Why parquet file format is best for spark?
Difference between dataframe Vs Dataset Vs RDD?
Explain features of Apache Spark?
Explain Lazy evaluation and why is it need?
Explain Pair RDD?
What is Spark Core?
What is the difference between persist() and cache()?
What are the various levels of persistence in Apache Spark?
Does Apache Spark provide check pointing?
How can you achieve high availability in Apache Spark?
Explain Executor Memory in a Spark?
What are the disadvantages of using Apache Spark?
What is the default level of parallelism in apache spark?
Compare map() and flatMap() in Spark?
Difference between repartition Vs coalesce?
Explain Spark Streaming?
Explain accumulators?
What is the use of broadcast join?

Interview Question in PWC :-

Generally there were 3 Rounds of Interviews --

Since It isn't Product based company you don't expect DS & Algo questions of course i don't know it too.

Round 1 --
Very crucial and nervous

Current Project Explanation.


Any Major issues resolved in the current project.
What all optimization techniques have you used in your project
Hadoop Architecture
Three SQL questions majorly on joins, subqueries, Group By, inline view,with Clause, timestamp.
Coding Questions Pyspark or scala Spark.
What are different constraints in sql and why do they different?

Round 2 --
Mostly focused on Scenario based questions. Those include--

How do you increase mappers?


What if Running Job fails in sqoop?
How do you update with Latest data on Sqoop?
What you do if data skewing happens?
What kind of file formats do you use? and where?
Why we need RDD?
What is the use driver in spark?
How do you tackle Memory exceptions errors in spark?
What kind of join do you use if one partition has more data and others have less data?
Why do we need to use containers in spark?
What happens if we increase More partitions in spark?
Write down command to increase no.of partitions?
Write down UDF query?

Round 3 --
Important to get through or rejected

Architecture of MapReduce?
What is Outliers in MapReduce?
What is Partition, shuffle & sort?
What is block report in Hadoop?
Partition By Vs Bucketing in Hive?
Difference between Cassandra and HBase?
What is Catalyst optimizer?
Explain the ETL tools have you used?
Explain basic cloud concepts and more questions on specific cloud we worked?
DataFrame Vs Dataset?
Explain Broadcast join?

ADF interview Questions:-


💥 Beginner-
What is Azure Data Factory?
Is Azure Data Factory ETL or ELT tool?
Why is ADF needed?
What sets Azure Data Factory apart from conventional ETL tools?
What are the major components of a Data Factory?
What are the ways to execute pipelines in Azure Data Factory?
What is the purpose of Linked services in Azure Data Factory?
Explain more on Data Factory Integration Runtime?
What is required to execute an SSIS package in Data Factory?
What is the limit on the number of Integration Runtimes, if any?
What are ARM Templates in Azure Data Factory?
How can we deploy code to higher environments in Data Factory?
Can we pass parameters to a pipeline run?
What are some useful constructs available in Data Factory?
Is it possible to push code and have CI/CD in ADF?
What are mapping data flows?

⚡ 2-5 Years Experienced--


What are different activities you have used in Azure DataFactory?
How can I schedule a pipeline?
What is the difference between mapping and wrangling data flow?
How is lookup activity useful in the Azure Data Factory?
Elaborate more on the Get Metadata activity in Azure Data Factory?
How to debug an ADF pipeline?
What does it mean by the breakpoint in the ADF pipeline?
Is it possible to have nested looping in Azure Data Factory?
❗ More than 5 years experienced --
How to copy multiple tables from one datastore to another?
What are some performance tuning techniques for Mapping Data Flow activity?
What are some of the limitations of ADF?
How are all the components of Azure Data Factory combined to complete the purpose?
Can we integrate Data Factory with Machine learning data?
Explain the Azure Data Factory Architecture?
What is Azure Data Lake Analytics?

1. Explain the basic concepts in Databricks.

2. What does the caching process involve?

3. What are the different types of caching?

4. Should you ever remove and clean up leftover data frames in Databricks?

5. How do you create a Databricks personal access token?

6. What steps should you take to revoke a private access token?

7. What are the benefits of using Databricks?

8. Can you use Databricks along with Azure Notebooks?

9. Do you need to store an action’s outcome in a different variable?

10. What is autoscaling?

11. Can you run Databricks on private cloud infrastructure?

12. What are some issues you can face in Databricks?

13. Why is it necessary for us to use the DBU framework?

14. Explain what workspaces are in Databricks.

15. Is it possible to manage Databricks using PowerShell?

16. What is Kafka for?

17. What is a Delta table?

18. Which cloud service category does Databricks belong to: SaaS, PaaS, or IaaS?

19. Explain the differences between a control plane and a data plane.

20. What are widgets used for in Databricks?

1. What are the major features of Databricks?


2. What is the difference between an instance and a cluster?

3. Name some of the key use cases of Kafka in Databricks.

4. How would you use Databricks to process big data?

5. Give an example of a data analysis project you’ve worked on.

6. How would you ensure the security of sensitive data in a Databricks environment?

7. What is the management plane in Databricks?

8. How do you import third-party JARs or dependencies in Databricks?

9. Define data redundancy.

10. What is a job in Databricks?

11. How do you capture streaming data in Databricks?

12. How can you connect your ADB cluster to your favorite IDE?

1. What is a job in Databricks?

A job in Databricks is a way to manage your data processing and applications in a workspace. It can consist of one
task or be a multi-task workflow that relies on complex dependencies.

Databricks does most of the work by monitoring clusters, reporting errors, and completing task orchestration. The
easy-to-use scheduling system enables programmers to keep jobs running without having to move data to different
locations.

2. What is the difference between an instance and a cluster?

An instance represents a single virtual machine used to run an application or service. A cluster refers to a set of
instances that work together to provide a higher level of performance or scalability for an application or service.

Checking if candidates have this knowledge isn’t complicated when you use the right assessment methods. Use
a Machine Learning test to find out more about candidates’ experience using software applications and networking
resources. This also gives your job applicants a chance to show how they would manage large amounts of data.

3. How would you ensure the security of sensitive data in a Databricks environment?

Databricks has network protections that help users secure information in a workspace environment. This process
prevents sensitive data from getting lost or ending up in the wrong storage system.

To ensure proper security, the user can access IP lists to show the network location of important information in
Databricks. Then they should restrict outbound network access using a virtual private cloud.

4. What is the management plane in Databricks?


The management plane is a set of tools and services used to manage and control the Databricks environment. It
includes the Databricks workplace, which provides a web-based interface for managing data, notebooks, and clusters.
It also offers security, compliance, and governance features.

Send candidates a Cloud System Administration test to assess their networking capabilities. You can also use this test
to learn more about their knowledge of computer infrastructure.

5. Define data redundancy.

Data redundancy occurs when the same data is stored in multiple locations in the same database or dataset.
Redundancy should be minimized since it is usually unnecessary and can lead to inconsistencies and inefficiencies.
Therefore, it’s usually best to identify and remove redundancies to avoid using up storage space.

15 challenging Databricks interview questions to ask experienced coders


Below is a list of 15 challenging Databricks interview questions to ask expert candidates. Choose questions that will
help you learn more about their programming knowledge and experience using data analytics.

1. What is a Databricks cluster?

2. Describe a dataflow map.

3. List the stages of a CI/CD pipeline.

4. What are the different applications for Databricks table storage?

5. Define serverless data processing.

6. How will you handle Databricks code while you work with Git or TFS in a team?

7. Write the syntax to connect the Azure storage account and Databricks.

8. Explain the difference between data analytics workloads and data engineering workloads.

9. What do you know about SQL pools?

10. What is a Recovery Services Vault?

11. Can you cancel an ongoing job in Databricks?

12. Name some rules of a secret scope.

13. Write the syntax to delete the IP access list.

14. How do you set up a DEV environment in Databricks?

15. What can you accomplish using APIs?

1. Define serverless data processing.


Serverless data processing is a way to process data without needing to worry about the underlying infrastructure. You
can save time and reduce costs by having a service like Databricks manage the infrastructure and allocate resources as
needed.

Databricks can provide the necessary resources on demand and scale them as needed to simplify the management of
data processing infrastructure.

2. How would you handle Databricks code while working with Git or TFS in a team?

Global information tracker (Git) and Team Foundation Server (TFS) are version control systems that help
programmers manage code. TFS cannot be used in Databricks because the software doesn’t support it. Therefore,
programmers can only use Git when working on a repository system.

Candidates should also know that Git is an open-source, distributed version control system, whereas TFS is a
centralized version control system offered by Microsoft.

Since Databricks integrates with Git, data engineers and programmers can easily manage code without constantly
updating the software or reducing storage because of low capacity.

The Git skills test can help you choose candidates who are well versed in this open-source tool. It also gives them an
opportunity to prove their ability to manage data analytics projects and source code.

3. Explain the difference between data analytics workloads and data engineering workloads.

Data analytics workloads involve obtaining insights, trends, and patterns from data. Meanwhile, data engineering
workloads involve building and maintaining the infrastructure needed to store, process, and manage data.

4. Name some rules of a secret scope in Databricks.

A secret scope is a collection of secrets identified by a name. Programmers and developers can use this feature to store
and manage sensitive information, including secret identities or application programming interface (API)
authentication information, while protecting it from unauthorized access.

One rule candidates could mention is that a Databricks workspace can only hold a maximum of 100 secret scopes.

You can send candidates a REST API test to see how they manage data and create scopes for an API. This test also
determines whether candidates can deal with errors and security considerations.

5. What is a Recovery Services vault?

A recovery services vault is an Azure management function that performs backup-related operations. It enables users
to restore important information and copy data to adhere to backup regulations. The service can also help users arrange
data in a more organized and manageable way.

1. What is Azure Data Factory?


Azure Data Factory is a cloud-based, fully managed, serverless ETL and data integration service offered by Microsoft Azure for automating data movement
from its native place to, say, a data lake or data warehouse using ETL (extract-transform-load) OR extract-load-transform (ELT). It lets you create and run
data pipelines to help move and transform data and run scheduled pipelines.

2. Is Azure Data Factory ETL or ELT tool?


It is a cloud-based Microsoft tool that provides a cloud-based integration service for data analytics at scale and supports ETL and ELT paradigms.
3. Why is ADF needed?
With an increasing amount of big data, there is a need for a service like ADF that can orchestrate and operationalize processes to refine the enormous stores of
raw business data into actionable business insights.

4. What sets Azure Data Factory apart from conventional ETL tools?
Azure Data Factory stands out from other ETL tools as it provides: -

 Enterprise Readiness: Data integration at Cloud Scale for big data analytics!
 Enterprise Data Readiness: There are 90+ connectors supported to get your data from any disparate sources to the Azure cloud!
 Code-Free Transformation: UI-driven mapping dataflows.
 Ability to run Code on Any Azure Compute: Hands-on data transformations
 Ability to rehost on-prem services on Azure Cloud in 3 Steps: Many SSIS packages run on Azure cloud.
 Making DataOps seamless: with Source control, automated deploy & simple templates.
 Secure Data Integration: Managed virtual networks protect against data exfiltration, which, in turn, simplifies your networking.

Data Factory contains a series of interconnected systems that provide a complete end-to-end platform for data engineers. The below snippet summarizes the
same.
5. What are the major components of a Data Factory?
To work with Data Factory effectively, one must be aware of below concepts/components associated with it: -

Pipelines: Data Factory can contain one or more pipelines, which is a logical grouping of tasks/activities to perform a task. An activity can read data from
Azure blob storage and load it into Cosmos DB or Synapse DB for analytics while transforming the data according to business logic. This way, one can work
with a set of activities using one entity rather than dealing with several tasks individually.

Activities: Activities represent a processing step in a pipeline. For example, you might use a copy activity to copy data between data stores. Data Factory
supports data movement, transformations, and control activities.

Datasets: Datasets represent data structures within the data stores, which simply point to or reference the data you want to use in your activities as inputs or
outputs.

Linked Service: This is more like a connection string, which will hold the information that Data Factory can connect to various sources. In the case of reading
from Azure Blob storage, the storage-linked service will specify the connection string to connect to the blob, and the Azure blob dataset will select the
container and folder containing the data.

Integration Runtime: Integration runtime instances bridged the activity and linked Service. The linked Service or activity references it and provides the
computing environment where the activity runs or gets dispatched. This way, the activity can be performed in the region closest to the target data stores or
compute Services in the most performant way while meeting security (no publicly exposing data) and compliance needs.

Data Flows: These are objects you build visually in Data Factory, which transform data at scale on backend Spark services. You do not need to understand
programming or Spark internals. Design your data transformation intent using graphs (Mapping) or spreadsheets (Power query activity).
6. What are the different ways to execute pipelines in Azure Data Factory?
There are three ways in which we can execute a pipeline in Data Factory:

 Debug mode can be helpful when trying out pipeline code and acts as a tool to test and troubleshoot our code.
 Manual Execution is what we do by clicking on the ‘Trigger now’ option in a pipeline. This is useful if you want to run your pipelines on an ad-hoc
basis.
 We can schedule our pipelines at predefined times and intervals via a Trigger. As we will see later in this article, there are three types of triggers
available in Data Factory.

7. What is the purpose of Linked services in Azure Data Factory?


Linked services are used majorly for two purposes in Data Factory:

For a Data Store representation, i.e., any storage system like Azure Blob storage account, a file share, or an Oracle DB/ SQL Server instance.

For Compute representation, i.e., the underlying VM will execute the activity defined in the pipeline.

8. Can you Elaborate more on Data Factory Integration Runtime?


The Integration Runtime, or IR, is the compute infrastructure for Azure Data Factory pipelines. It is the bridge between activities and linked services. The
linked Service or Activity references it and provides the computing environment where the activity is run directly or dispatched. This allows the activity to be
performed in the closest region to the target data stores or computing Services.

Azure Data Factory supports three types of integration runtime, and one should choose based on their data integration capabilities and network environment
requirements.

 Azure Integration Runtime: To copy data between cloud data stores and send activity to various computing services such as SQL Server, Azure
HDInsight, etc.
 Self-Hosted Integration Runtime: Used for running copy activity between cloud data stores and data stores in private networks. Self-hosted integration
runtime is software with the same code as the Azure Integration Runtime but installed on your local system or machine over a virtual network.
 Azure SSIS Integration Runtime: You can run SSIS packages in a managed environment. So, when we lift and shift SSIS packages to the data factory,
we use Azure SSIS Integration Runtime.
9. What is required to execute an SSIS package in Data Factory?
We must create an SSIS integration runtime and an SSISDB catalog hosted in the Azure SQL server database or Azure SQL-managed instance before
executing an SSIS package.

10. What is the limit on the number of Integration Runtimes, if any?


Within a Data Factory, the default limit on any entities is set to 5000, including pipelines, data sets, triggers, linked services, Private Endpoints, and
integration runtimes. If required, one can create an online support ticket to raise the limit to a higher number.

Refer to the documentation for more details: https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/azure-subscription-service-


limits#azure-data-factory-limits.

11. What are ARM Templates in Azure Data Factory? What are they used for?
An ARM template is a JSON (JavaScript Object Notation) file that defines the infrastructure and configuration for the data factory pipeline, including pipeline
activities, linked services, datasets, etc. The template will contain essentially the same code as our pipeline.

ARM templates are helpful when we want to migrate our pipeline code to higher environments, say Production or Staging from Development, after we are
convinced that the code is working correctly.

Kickstart your data engineer career with end-to-end solved big data projects for beginners.

12. How can we deploy code to higher environments in Data Factory?


At a very high level, we can achieve this with the below set of steps:

📢 Sharing One of ADF interview question:

📓 Pull the four CSV files from the web link (https) and copy them to the ADLSGen2 storage.

Step : 1️⃣
Create a Config JSON File
Store base URL, relative URLs, and CSV filenames in a JSON file named config.json.

Step : 2️⃣
Set Up Linked Services
Configure linked services for HTTP server (for downloading CSV files) and ADLS Gen2 (for storage).

: Create Azure Data Factory Pipeline 3️⃣


Step 3️⃣
3️⃣
Lookup Activity: Read config.json.
For Each Activity: Iterate through each CSV filename.
Copy Activity: Copy CSV files from the web link to ADLS Gen2.

Step : 4️⃣
Parameterize Linked Services and Datasets
Parameterize linked services to dynamically supply values from config.json.
Parameterize datasets for flexibility.

: Dynamic Mapping of Source and Sink 5️⃣


Step 5️⃣
5️⃣
Use dynamic mapping to adjust source (web link) and sink (ADLS Gen2) dynamically.

Step
6️⃣ : Execute the Pipeline 6️⃣
▶️
Trigger pipeline execution to pull CSV files and copy them to ADLS Gen2.

Step : 7️⃣
Monitor and Validate
Monitor pipeline execution in Azure Data Factory and validate successful file copy.

Preparing for an Azure Data Factory interview? Let's delve into crucial questions to sharpen your expertise:

Introduction to Azure Data Factory:


↳ Explore the foundational concepts of Azure Data Factory, Microsoft's cloud-based data integration service.

Data Integration Basics:


↳ Understand the core principles of data integration, including data ingestion, transformation, and loading.
Pipeline Orchestration and Execution:
↳ Delve into the orchestration and execution of data pipelines in Azure Data Factory, ensuring seamless data movement.
Data Movement Activities:
↳ Explore various data movement activities available in Azure Data Factory, such as Copy Activity and Data Flow.
Data Transformation Techniques:
↳ Understand how Azure Data Factory facilitates data transformation using Mapping Data Flows and Wrangling Data Flows.
Integration with Azure Services:
↳ Navigate the integration capabilities of Azure Data Factory with other Azure services like Azure Synapse Analytics and Azure Databricks.
Monitoring and Management:
↳ Explore methods for monitoring and managing Azure Data Factory pipelines and activities, ensuring optimal performance.
Security and Compliance Measures:
↳ Understand the security features and compliance considerations in Azure Data Factory, safeguarding sensitive data.
Error Handling and Fault Tolerance:
↳ Delve into error handling mechanisms and fault tolerance strategies in Azure Data Factory, ensuring data integrity and reliability.
Best Practices and Optimization Tips:
↳ Explore best practices and optimization techniques to maximize the efficiency and performance of Azure Data Factory.
Ready to showcase your Azure Data Factory prowess?

1. What is Azure Data Factory (ADF)?

Azure Data Factory (ADF) is a cloud-based data integration service that enables you to create data-driven
workflows for orchestrating and automating data movement and data transformation. ADF provides a
managed service that is continuously monitored and updated, and it provides built-in security features, such
as data encryption, identity and access management, and data privacy.

2. What are the core components of Azure Data Factory (ADF)?

The core components of Azure Data Factory (ADF) are:

 Pipelines: A pipeline is a logical grouping of activities that perform a specific task.


 Activities: An activity is a unit of work within a pipeline.
 Datasets: A dataset is a named view of data that is used by activities.
 Linked services: A linked service is a connection to a data store or a compute service.
 Triggers: A trigger is a mechanism that starts a pipeline run.

3. What are the different types of activities in Azure Data Factory (ADF)?

The different types of activities in Azure Data Factory (ADF) are:

 Data movement activities: Data movement activities move data from one location to another.
 Data transformation activities: Data transformation activities transform data from one format to another.
 Control activities: Control activities control the flow of a pipeline.

4. What is a linked service in Azure Data Factory (ADF)?

A linked service in Azure Data Factory (ADF) is a connection to a data store or a compute service. Linked
services are used to connect to various data stores, such as Azure Blob Storage, Azure Data Lake Storage,
and Azure SQL Database.

5. What is a dataset in Azure Data Factory (ADF)?

A dataset in Azure Data Factory (ADF) is a named view of data that is used by activities. A dataset represents
the input or output of an activity.

6. What is a pipeline in Azure Data Factory (ADF)?

A pipeline in Azure Data Factory (ADF) is a logical grouping of activities that perform a specific task. A
pipeline can contain one or more activities, and it can be triggered manually or scheduled to run at a
specific time.

7. What is a trigger in Azure Data Factory (ADF)?


A trigger in Azure Data Factory (ADF) is a mechanism that starts a pipeline run. Triggers can be scheduled to
run at a specific time, or they can be triggered by an event, such as the arrival of a new file in a data store.

8. What is the difference between a tumbling window and sliding window trigger in Azure Data
Factory (ADF)?

A tumbling window trigger in Azure Data Factory (ADF) triggers a pipeline run at a fixed interval, while a
sliding window trigger triggers a pipeline run at a sliding interval. For example, a tumbling window trigger
might trigger a pipeline run every hour, while a sliding window trigger might trigger a pipeline run every 30
minutes.

9. What is the difference between a single-node and an integrated runtime in Azure Data Factory
(ADF)?

A single-node runtime in Azure Data Factory (ADF) is a standalone runtime that is used for data integration
tasks, while an integrated runtime is a runtime that is integrated with Azure Data Factory. An integrated
runtime provides additional features, such as support for custom activities and integration with Azure
DevOps.

10. What is the difference between a tumbling window and a sliding window in Azure Data Factory
(ADF)?

A tumbling window in Azure Data Factory (ADF) is a fixed-size window that moves data at regular intervals,
while a sliding window is a moving window that moves data based on a specific time interval.

For example, a tumbling window might move data every hour, while a sliding window might move data
every 30 minutes, but also include the previous 30 minutes of data.

11. What is the difference between a dataset and a linked service in Azure Data Factory (ADF)?

A dataset in Azure Data Factory (ADF) is a named view of data that is used by activities, while a linked
service is a connection to a data store or a compute service. A dataset represents the input or output of an
activity, while a linked service is used to connect to various data stores or compute services.

12. What is the difference between a pipeline and a trigger in Azure Data Factory (ADF)?

A pipeline in Azure Data Factory (ADF) is a logical grouping of activities that perform a specific task, while a
trigger is a mechanism that starts a pipeline run. A pipeline can contain one or more activities, and it can be
triggered manually or scheduled to run at a specific time, while a trigger starts a pipeline run based on a
specific event or schedule.

13. What is the difference between a data flow and a mapping data flow in Azure Data Factory
(ADF)?

A data flow in Azure Data Factory (ADF) is a data transformation that is executed in a managed runtime,
while a mapping data flow is a data transformation that is executed in a Spark runtime. A data flow provides
a visual interface for data transformation, while a mapping data flow provides a code-first interface for data
transformation.

14. What is the difference between a tumbling window and a tumbling window trigger in Azure Data
Factory (ADF)?

A tumbling window in Azure Data Factory (ADF) is a fixed-size window that moves data at regular intervals,
while a tumbling window trigger is a trigger that starts a pipeline run at regular intervals. A tumbling
window trigger might start a pipeline run every hour, while a tumbling window moves data every hour.

15. What is the difference between a sliding window and a sliding window trigger in Azure Data
Factory (ADF)?

A sliding window in Azure Data Factory (ADF) is a moving window that moves data based on a specific time
interval, while a sliding window trigger is a trigger that starts a pipeline run based on a specific time interval.
A sliding window trigger might start a pipeline run every 30 minutes, while a sliding window moves data
every 30 minutes, but also include the previous 30 minutes of data.

 Create a feature branch that will store our code base.


 Create a pull request to merge the code after we’re sure to the Dev branch.
 Publish the code from the dev to generate ARM templates.
 This can trigger an automated CI/CD DevOps pipeline to promote code to higher environments like Staging or Production.

13. Which three activities can you run in Microsoft Azure Data Factory?
Azure Data Factory supports three activities: data movement, transformation, and control activities.

Data movement activities: As the name suggests, these activities help move data from one place to another.
e.g., Copy Activity in Data Factory copies data from a source to a sink data store.

Data transformation activities: These activities help transform the data while we load it into the data's target or destination.
e.g., Stored Procedure, U-SQL, Azure Functions, etc.

Control flow activities: Control (flow) activities help control the flow of any activity in a pipeline.

e.g., wait activity makes the pipeline wait for a specified time.

14. What are the two types of compute environments supported by Data Factory to execute the transform activities?
Below are the types of computing environments that Data Factory supports for executing transformation activities: -

i) On-Demand Computing Environment: This is a fully managed environment provided by ADF. This type of calculation creates a cluster to perform the
transformation activity and automatically deletes it when the activity is complete.

ii) Bring Your Environment: In this environment, you can use ADF to manage your computing environment if you already have the infrastructure for on-
premises services.

15. What are the steps involved in an ETL process?


The ETL (Extract, Transform, Load) process follows four main steps:

i) Connect and Collect: Connect to the data source/s and move data to local and crowdsource data storage.

ii) Data transformation using computing services such as HDInsight, Hadoop, Spark, etc.

iii) Publish: To load data into Azure data lake storage, Azure SQL data warehouse, Azure SQL databases, Azure Cosmos DB, etc.

iv)Monitor: Azure Data Factory has built-in support for pipeline monitoring via Azure Monitor, API, PowerShell, Azure Monitor logs, and health panels on
the Azure portal.

16. If you want to use the output by executing a query, which activity shall you use?
Look-up activity can return the result of executing a query or stored procedure.
The output can be a singleton value or an array of attributes, which can be consumed in subsequent copy data activity, or any transformation or control flow
activity like ForEach activity.

17. Can we pass parameters to a pipeline run?


Yes, parameters are a first-class, top-level concept in Data Factory. We can define parameters at the pipeline level and pass arguments as you execute the
pipeline run on demand or using a trigger.

18. Have you used Execute Notebook activity in Data Factory? How to pass parameters to a notebook activity?
We can execute notebook activity to pass code to our databricks cluster. We can pass parameters to a notebook activity using the baseParameters property. If
the parameters are not defined/ specified in the activity, default values from the notebook are executed.

19. What are some useful constructs available in Data Factory?


parameter: Each activity within the pipeline can consume the parameter value passed to the pipeline and run with the @parameter construct.

coalesce: We can use the @coalesce construct in the expressions to handle null values gracefully.

activity: An activity output can be consumed in a subsequent activity with the @activity construct.

20. Can we push code and have CI/CD (Continuous Integration and Continuous Delivery) in ADF?
Data Factory fully supports CI/CD of your data pipelines using Azure DevOps and GitHub. This allows you to develop and deliver your ETL processes
incrementally before publishing the finished product. After the raw data has been refined into a business-ready consumable form, load the data into Azure
Data Warehouse or Azure SQL Azure Data Lake, Azure Cosmos DB, or whichever analytics engine your business uses can point to from their business
intelligence tools.

PYSPARK INTERVIEW QUESTION:-


Top 20 hashtag#pyspark interview questions you should learn….

1. What is PySpark, and how does it relate to Apache Spark?


2. Explain the difference between RDDs, DataFrames, and Datasets in PySpark.
3. How do you create a SparkSession in PySpark, and why is it important?
4. What is lazy evaluation in PySpark, and why is it beneficial?
5.How can you read data from different file formats using PySpark?
6.Explain the concept of partitioning in PySpark and its significance.
7.What are transformations and actions in PySpark? Provide examples of each.
8.How do you handle missing or null values in PySpark DataFrames?
9.Explain the concept of shuffling in PySpark and its impact on performance.
10.What is caching and persistence in PySpark, and when would you use them?
11.How do you handle skewed data in PySpark?
12.Explain the difference between repartition and coalesce in PySpark.
13.How do you perform aggregation operations on PySpark DataFrames?
14.What is broadcast variable and when would you use it in PySpark?
15.How do you write data to different file formats using PySpark?
16.Explain the concept of window functions in PySpark and provide an example.
17.How do you perform joins between PySpark DataFrames?
18.What are the different ways to optimize PySpark jobs for performance?
19.How do you handle large datasets that don't fit into memory in PySpark?
20. Explain the concept of UDFs (User-Defined Functions) in PySpark and provide an example use case.

Top 10 sql interviews questions for your interviews…

❗Provide a SQL query that finds the second-highest salary from an "Employee" table.
❗How can you use window functions to calculate the cumulative sum of a column in a given order?
❗Explain the difference between 1NF, 2NF, and 3NF, and provide an example illustrating each.
❗Write a query that joins three tables, filtering results based on multiple conditions, and includes aggregated values.
❗Discuss strategies for optimizing the performance of a slow-performing SQL query.
📌 Design a database schema to store and efficiently query time-series data.
📌Implement a recursive SQL query to traverse a hierarchical structure in a "Organization" table.
📌Explain how indexing works in databases, and discuss scenarios where composite indexes are beneficial.
📌Compare star schema and snowflake schema in the context of data warehousing, and provide use cases for each.
📌Describe the purpose of isolation levels in database transactions and discuss potential issues related to concurrent transactions.

Questions:-
Sometimes when i write sql queries i messed up at very basic level, then i learned this approach for my query successful running.

1. Choosing specific columns or expressions to be included in “SELECT” clause.

2. Specify the tables or views to be inin “FROM” clause.

3. Cautiously filter the rows based on specific condition in “WHERE” clause.


4. In “GROUP BY” group the it with same values into summary rows.

5. Similar to “WHERE” but filter the rows based in codition in “HAVING” clause after group by.

6. Carefully sort the order using “ORDER BY”, otherwise it gives errors

7. Use only required tables in “JOIN” clause, avoid using all tables.

8. Be careful while using “LIMIT” clause, to query the results.

9. Use “OFFSET” to skip the rows.

10. Use well “DISTINCT” clause to remove duplicates.

Writing executable sql query does require lot of analysis on data. Make sure you do that.

Follow this 10 rules to write sql query in your interviews.

AWM Technical Functional Analyst and Data Analyst JD is already published below.

Here is my interview experience with Tesco for you..


Remember i got a call from LinkedIn InMail.

Round 1: Technical(1.5 years)


Project explanation, roles and responsibilities.
Sql advanced coding, pyspark advanced coding.
Spark scenario based questions,
How to do code optimisation, job optimisation questions?
What all are optimisation used in project?
Azure scenario based questions?
Data facotry optimisation questions?
Pipeline desigin basic
Data bricks scenario based questions?
Kafka intermediatq questions?
Airflow experience? How to work with it?

Round 2: Techno manager(1 hr)


Project explanation, what is my role?
How to build a pipeline?
Gave scenario, asked to build a pipeline?
What tech stack to use? Why did you choose that?
Why only this solution? Do you know any?
How do you implement security in this solution?
How do you optimise your code?
How long do you take to complete this project?
How fo you schedule your jobs? What specific tools have you used?
Have you ever designed a pipeline?
How do you deploy the code? Design that?
How do you make report finally?

It’s all about 1 system design question, and more cross question

Results; failed ❌ (reason is should have more understanding on security as a data engineer)

The feedback make sense to me, if you are into data, you must know how to keep it safe.

if you working with hashtag#dataengineering, learning snowflake is very important for your interviews and your job....

This document consists 9 modules .....


1. Loading data
2. Writing queries, Cloning
3. Working with different data,
4. RBAC Controls
5. Data sharing
6. Practice labs
7. Real time scenarios.

My recent interview experience with Walmart india 👇

I got a call from Instahyre just to tell you.

Round 1: (Tech 1)
Introduction each other.
Overall experience, skillset, project explanation.
Questions on my role in the project,
Frequency of data, issues faced in the project, and resolution provided.
Questions on spark, real time scenarios.
Pyspark coding, sql coding,Python coding.
Azure questions,
Delta lake, databricks
Data modeling, ci/cd pipeline
Snowflake vs star schema
Basics of kafka

Round 2: (Tech 2)
Project explanation.
System designing on data use case, cross questions.
Internals of spark, in depth about hive, hbase.
Pyspark advanced coding, sql advanced.
Real time scenarios on azure data engineer
Ci/cd indepth.
Slowly changing dimensions, handling cdc data.
Code optimisation in databricks, delta takes
Basics on snowflake.
In depth questions on airflow
Few questions on kafka

Round 3: (Techno manager)


Project explanation, project architecture,
How to create optimised pipelines.
Experience working e commerce, what domains have worked before?
Asked questions on client facing skills?
Questions on leadership skills?
How to handle very urgent use cases and scenarios?
Questions on Streaming experience.

Result: failed 🤝

Though i have got very good feedback in 2 technical rounds, i still had defects in my skills,

I could have prepared a bit better management skills, real scenarios, and I shouldn’t get tensed.

On real note, technical interviews are much easier than manageerial interviews😉

So be prepared well.

Many of my colleagues at Accenture wanted to crack the Big 4 or any major Product giant.

However, only a few were able to switch jobs successfully.

Which strategy did they use? Over the years, I've talked to them, and here's what they told me:

🔹 The interview rounds are not the main hurdle, it is our laziness.

Take the case of me, when was preparing for my switch there were days I wanted to enjoy after coming home from work, but I upskilled and practiced SQL,
Python, etc. instead.

I learned early, the way I utilize my free time, the more chances I’ve got for cracking any company.

🔹 Basics are the key ingredient for your success, but remember other topics as well.
My colleagues who skipped SQL failed most interviews because it's a must-prepare topic for any working professional.

So, make sure you have a plan to cover every tool timely. Get into the habit of applying your theoretical knowledge to practical problems.

And if you want to check where you’re right now in preparation, just try a Mock interview.

I gave multiple mock interviews before the real PWc or Brillio interviews.

If you want an All-In-One package for preparation, then I’ll suggest Bosscoder Academy.

Check them here: https://bit.ly/3I0VcWJ

On-premises data migration to Azure Data Factory (ADF)!!

It's crucial to consider the implications of data skew, which can affect the performance and efficiency of data
processing pipelines. Here's a detailed overview:
1. What is Data Skew
▪️Data skew refers to the uneven distribution of data across processing resources.
▪️It occurs when certain data partitions or keys contain significantly more data than others.

2. what are the Implications of Data Skew in Azure Data Factory Migration
▪️Data skew can lead to inefficient resource utilization, causing some processing units to be overloaded while others remain underutilized.
▪️It can result in performance degradation, longer processing times, and increased costs due to inefficient resource usage.

3. what are the Challenges


▪️Migrating from an on-premises environment to Azure Data Factory may exacerbate existing data skew issues or introduce new ones due to differences in
infrastructure and configuration.

4. what is the strategy


▪️Review and optimize data partitioning strategies to ensure even distribution of data across processing units.
▪️Utilize hash partitioning or range partitioning techniques to evenly distribute data based on key attributes.
▪️Utilize Azure Data Factory's built-in capabilities for data skew detection and management.
▪️Leverage features such as dynamic partitioning and parallel execution to handle skewed data more efficiently.
▪️Implement monitoring and performance tuning mechanisms to identify and address data skew issues proactively.
▪️Use Azure Monitor and Azure Data Factory performance insights to monitor pipeline performance and identify bottlenecks.
▪️Review and optimize data processing logic to minimize the impact of data skew on pipeline performance.
▪️Implement parallel processing techniques and distributed computing frameworks to handle skewed data more effectively.
▪️Analyze data distribution patterns and adjust data processing strategies accordingly.
▪️Consider factors such as data volume, skewness, and processing requirements when designing data pipelines.

5. What are the Best Practices


▪️Regularly review and optimize data partitioning strategies based on evolving data patterns and processing requirements.
▪️Implement automated monitoring and alerting mechanisms to detect and address data skew issues in real-time.
▪️Continuously iterate and refine data processing pipelines based on performance feedback and insights gathered from monitoring tools.

By implementing these strategies and best practices, organizations can effectively manage data skew challenges during the migration process and optimize the
performance and efficiency of their data processing workflows in Azure Data Factory.

Data Engineer JD is as below:

 Experience in developing AWS/ Azure based data pipelines and processing of data at scale using latest cloud technologies
 Experience in design and development of applications using Python (must have) or Java and in Big Data technologies and tools such as
PySpark/Spark, Hadoop, Hive,Kafka etc.
 Good understanding of data warehousing concepts and Expert level skills in writing and optimizing SQL.
 Experience in low code type development and metadata configuration-based development, e.g., metadata based data ingestion, schema shift
detection, event based development, etc.
 Develop and unit test the functional aspect of the required data solution leveraging the core frameworks (foundational code framework)
 Experience with working in an Agile environment and CI/CD driven testing culture.
 Prioritize your work in conjunction with multiple teams as you will be working with different components
 Able to setup and take on design tasks if necessary
 Deal with incidents in a timely manner coming up with rapid resolution
 The ability to keep current with the constantly changing technology industry.
 Present to clients with due supervision.
 Efficiently manage junior team members
 Training, supervision and providing guidance to junior staff.
 Various administrative duties, including recruiting

Good To have:

 Knowledge of Insurance Industry


 Finance and/or Actuarial project experience
 Basic knowledge of the financial sector and capital markets.
 Proficient in written and spoken English.
 Power Point, Word, Excel, MS Access etc.

Data Engineering Roadmap..


https://lnkd.in/g_2BVCFp

what is Bigdata....
https://lnkd.in/gCutYTjV

Hadoop Architecture...
https://lnkd.in/g7cmKcdp

Hadoop commands..
https://lnkd.in/gTMVKYUn

Learn SQL...
https://lnkd.in/gAGY-vX3

Learn Hive...
https://lnkd.in/g_HbE6jJ

Learn Python...
https://lnkd.in/ggMZRfpf

Learn Pyspark...
https://lnkd.in/gz82NHGp

Learn Scala Spark...


https://lnkd.in/gSzSiW6h

Learn Spark..
https://lnkd.in/gc4cViiQ

My interview experience...
https://lnkd.in/gXRmkFtP

SQL interview questions...


https://lnkd.in/gHB_bq48

Spark Interview questions..


https://lnkd.in/ggpduwRv

Pyspark interview questions...


https://lnkd.in/geKBdxBn

Azure Data Factory Interview questions..


https://lnkd.in/geaxscBs

Learn DSA....
https://lnkd.in/gz2PP8Z7

Complete Interview Guide..


https://lnkd.in/gy-nQ28W

Learn Power Bi..


https://lnkd.in/gV2Kn9M8

Learn Snowflake..
https://lnkd.in/gfmkej2B

SQL projects.....
https://lnkd.in/gPd_jkGU

data engineering projects..


https://lnkd.in/g9hbFGUU
https://lnkd.in/g-WBDKVd

500+ data engineering interview questions here...


https://lnkd.in/gNgnVNPg

On-premises data migration to Azure Data Factory (ADF)!!

It's crucial to consider the implications of data skew, which can affect the performance and efficiency of data processing pipelines. Here's a detailed overview:

1. What is Data Skew


▪️Data skew refers to the uneven distribution of data across processing resources.
▪️It occurs when certain data partitions or keys contain significantly more data than others.

2. what are the Implications of Data Skew in Azure Data Factory Migration
▪️Data skew can lead to inefficient resource utilization, causing some processing units to be overloaded while others remain underutilized.
▪️It can result in performance degradation, longer processing times, and increased costs due to inefficient resource usage.

3. what are the Challenges


▪️Migrating from an on-premises environment to Azure Data Factory may exacerbate existing data skew issues or introduce new ones due to differences in infrastructure and
configuration.

4. what is the strategy


▪️Review and optimize data partitioning strategies to ensure even distribution of data across processing units.
▪️Utilize hash partitioning or range partitioning techniques to evenly distribute data based on key attributes.
▪️Utilize Azure Data Factory's built-in capabilities for data skew detection and management.
▪️Leverage features such as dynamic partitioning and parallel execution to handle skewed data more efficiently.
▪️Implement monitoring and performance tuning mechanisms to identify and address data skew issues proactively.
▪️Use Azure Monitor and Azure Data Factory performance insights to monitor pipeline performance and identify bottlenecks.
▪️Review and optimize data processing logic to minimize the impact of data skew on pipeline performance.
▪️Implement parallel processing techniques and distributed computing frameworks to handle skewed data more effectively.
▪️Analyze data distribution patterns and adjust data processing strategies accordingly.
▪️Consider factors such as data volume, skewness, and processing requirements when designing data pipelines.

5. What are the Best Practices


▪️Regularly review and optimize data partitioning strategies based on evolving data patterns and processing requirements.
▪️Implement automated monitoring and alerting mechanisms to detect and address data skew issues in real-time.
▪️Continuously iterate and refine data processing pipelines based on performance feedback and insights gathered from monitoring tools.

By implementing these strategies and best practices, organizations can effectively manage data skew challenges during the migration process and optimize the performance and
efficiency of their data processing workflows in Azure Data Factory.

You might also like