DP-203 Discussion Dump

Topic 1 question 1 discussion
nkav
:: Highly Voted 3 months, 3 weeks ago
product key is a surrogate key as it is an identity column
upvoted 24 times
111222333
:: 3 months, 1 week ago
Agree on the surrogate key, exactly.
"In data warehousing, IDENTITY functionality is particularly important as it
makes easier the creation of surrogate keys."
Why ProductKey is certainly not a business key: "The IDENTITY
value in Synapse is not guaranteed to be unique if the user explicitly inserts a duplicate value with 'SET
IDENTITY_INSERT ON' or reseeds IDENTITY". Business key is an index which identifies uniqueness of a row
and here Microsoft says that identity doesn't guarantee uniqueness.
References:
https://azure.microsoft.com/en-
us/blog/identity-now-available-with-azure-sql-data-warehouse/
https://docs.microsoft.com/en-us/azure/synapse-
analytics/sql-data-warehouse/sql-data-warehouse-tables-identity
upvoted 3 times
...
...
sagga
Type2 because there are start and end columns and ProductKey is a surrogate key. ProductNumber seems a business key.
upvoted 14 times
DrC
:: 2 months, 4 weeks ago
The start and end columns are for when to when the product was being sold, not for metadata purposes. That
makes it: Type 1 – No History
Update record directly, there is no record of historical values, only current state
upvoted 13 times
captainbee
Exactly how I saw it
upvoted 1 times
...
...
...
SatyamKishore
:: Most Recent 22 hours, 40 minutes ago
this is a divided discussion, still confused if is SCD 1or 2 ?
upvoted 1 times
...
YipingRuan
:: 3 days, 3 hours ago
Type 2 and Surrogate key.
“The table must also define a surrogate key because the business key (in this instance,
employee ID) won't be unique.”
https://docs.microsoft.com/en-us/learn/modules/populate-slowly-changing-dimensions-
azure-synapse-analytics-pipelines/3-choose-between-dimension-types
So IDENTITY suggests [ProductKey] is a
surrogate key.
upvoted 1 times
...
anarvekar
:: 2 weeks, 2 days ago
I guess the answer Type-2 is valid because RowInsertedDateTime and RowUpdatedDateTime are being used as type-2
effective dates, where inserted date is the effective_from date and updated date is the effective_to date, which will be set
to some futuristic date or NULL for the currently active records. So I'm conviced that it is Type-2.
However, ProductKey
has to be a surrogate key. Identity column can never be a business/natural key, as that's what we import from the source
as is and the column is supposed to contain duplicates in case of type-2.
upvoted 1 times
...
Akki0120
:: 1 month ago
For all questions from contributor access 9403778084
upvoted 1 times
...
noone_a
:: 1 month, 2 weeks ago
SCD Type 1 is correct. There is no start/end date to show when the record is valid from/to. sellStart/end does not fulfill
this role. a product might have a limited sales run, say of 1 month, and that is what these columns show. they dont show
the row has been replaced.
The key is a surrogate key. Identity fields generate unique values in most cases. of course this
can be overridden using IDENTITY_INSERT, but this is something that is only used usually to fix issues, and not in day
to day operations.
upvoted 3 times
...
Balaji1003
Type1 and SurrogateKey.
Type1 because the sellstartdate and sellenddate has business meaning, and not SCD columns.
Surrogatekey because ID is incremented for every insert.
upvoted 1 times
...
Steviyke
:: 2 months ago
Answer is: TYPE 2 SCD and Surrogate Key. There is a [ETLAuditID] that's an INT and tracks changes like 1 or 0 for
history. Also, you cannot have a TYPE1 SCD with a surrogate key.
upvoted 2 times
...
eng1
Type 2 doesn't need the insert and update field, so it's Type 1 and surrogate key
upvoted 6 times
...
ThiruthuvaRajan
SCD is Type-2. It has both start and end information. with that we can easily say which is current one. the "Current" one
refers to Type-2.
https://docs.microsoft.com/en-us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-
analytics-pipelines/3-choose-between-dimension-types
And the key is unique identifier for each row so it is Surrogate
key.
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-identity
upvoted 1 times
captainbee
It really isn't Type-2. The start and end columns apply to the product being sold, not the entry on the table. Also
there is no IsActive column either. Type-1 all the way.
upvoted 3 times
...
...
DragonBlake
product is surrogate key
upvoted 2 times
...
clguy
ProductKey is SK and SourceProductId is BK and it it TYPE 1 SCD
upvoted 2 times
...
dmnantilla9
:: 3 months ago
Is type 2 and surrogate key
upvoted 1 times
...
wfrf92
Type 1
Surrogate Key
upvoted 11 times
baobabko
Type 1 as there is no obvious versioning, just latest value and the time of record creation and update.
upvoted 3 times
...
...
bananawu
Correct Answer, "In Azure Synapse Analytics, the IDENTITY value increases on its own in each distribution and does
not overlap with IDENTITY values in other distributions. The IDENTITY value in Synapse is not guaranteed to be
unique if the user explicitly inserts a duplicate value with “SET IDENTITY_INSERT ON” or reseeds IDENTITY. For
details, see CREATE TABLE (Transact-SQL) IDENTITY (Property)."
upvoted 1 times
baobabko
IDENTITY is assigned by the system. It has no business meaning. Hence it cannot be a business key.
Automatically generated and assigned keys are called Surrogate Keys.
upvoted 1 times
...
...
neerajkrjain
It should be a type 1 dimension.
upvoted 3 times
...
malakosan
I agree
upvoted 1 times
malakosan
With Arindamb
upvoted 1 times
...
...
Arindamb
Identity column holds natural number which is different from natural key such as SSN Number, Mobile number etc.
Hence the answer should be surrogate key..
upvoted 4 times
malakosan
I Agree
upvoted 2 times
...
...
AugustineUba
:: Highly Voted 2 weeks, 2 days ago
From the documentation the answer is clear enough. B is the right answer. When choosing a distribution column, select a
distribution column that: "Is not a date column. All data for the same date lands in the same distribution. If several users
are all filtering on the same date, then only 1 of the 60 distributions do all the processing work."
upvoted 6 times
...
waterbender19
:: Most Recent 2 weeks, 3 days ago
I think the answer should be D for that specific query. If you look at the datatypes, DateKey is an INT datatype not a
DATE datatype.
upvoted 2 times
waterbender19
and thet statement that Fact table will be added 1 million rows daily means that each datekey value has an equal
amount of rows associated with that value.
upvoted 1 times
...
...
andimohr
The reference given in the answer is precise: Choose a distribution column with data that a) distributes evenly b) has
many unique values c) does not have NULLs or few NULLs and d) IS NOT A DATE COLUMN... definitely the best
choice for the Hash distribution is on the Identity column.
upvoted 4 times
...
noone_a
although its a fact table, replicated is the correct distribution in this case.
Each row is 141 bytes in size x 1000000
records = 135Mb total size
Microsoft recommend replicated distribution for anything under 2GB.
We have no further
information regarding table growth so this answer is based only on the info provided.
upvoted 1 times
noone_a
edit, this is incorrect as it will have 1 million records added daily for 3 years, putting it over 2GB
upvoted 2 times
...
...
vlad888
Yes - do not use date column - there is such recomendation in synapse docs. But here we have range search - potensiallu
several nodes will be used.
upvoted 1 times
...
vlad888
Actually it is clear that it should be hash distributed. BUT Product key brings no benefit for this query - doesn't
participated in it at all. So - DateKey. Although it is unusual for Synapse
upvoted 3 times
...
savin
:: 2 months ago
I don't think there is enough information to decide this. Also we can not decide it by just looking at one query. Only
considering this query and if we assume no other dimensions are connected to this fact table, good answer would be D.
upvoted 2 times
...
ChandrashekharDeshpande
My answer goes with D...
In most cases data is partitioned on a date column that is closely tied to the order in which the
data is loaded into the SQL pool. Partitioning improves query performance. A query that applies a filter to partitioned
data can limit the scan to only the qualifying partitions thereby improving performance dramatically as filtering can
avoid a full table scan and only scan a smaller subset of data. It also seems, the data partitioned on date will get
distributed uniformly across the nodes thereby avoiding a partition to be hot partition.
upvoted 1 times
vlad888
Avoiding partition - compute node to be precise - is least desirable thing - it is mpp system. 60 nodes performs
work faster then 5.
upvoted 1 times
...
...
bc5468521
Agree to B
upvoted 3 times
...
Ritab
Round robin looks to be the best fit
upvoted 1 times
baobabko
The question is about this exact query. To minimize the time for this query you should distribute the work. But -
if we do hash distribution on date column this will utilize at most 30 distributions. Round robin would be a good
choice if this is really the only query we run, but we probably want to join with other tables on the primary key.
So hash distribution on the primary key might be better choice. If we assume uniform primary key distribution,
hashing on the PK will have the effect of round robin. - hence B is the correct answer.
upvoted 7 times
DrC
Also: 1 million rows of data added daily and will contain three years of data.
It will have over a billion
rows when loaded.
That will put it over the 2GB recommendation for hash-distributed.
Consider using a
hash-distributed table when:
* The table size on disk is more than 2 GB.
* The table has frequent insert,
update, and delete operations.
upvoted 1 times
lsdudi
:: 1 month, 1 week ago
Only round robin will use all 60 partitions. There is no join Key.
upvoted 1 times
...
...
...
...
Pradip_valens
"Not D: Do not use a date column. . All data for the same date lands in the same distribution. If several users are all
filtering on the same date, then only 1 of the 60 distributions do all the processing work." ???
the same implies for
ProductKey, now forgiven query we may need to check every record for the date, so checking all 60 distribution ???
upvoted 2 times
freerider
According to the reference there are multiple things that makes it inappropiate to use the date column:
Is not used
in WHERE clauses. This could narrow the query to not run on all the distributions.
Is not a date column.
WHERE clauses often filter by date. When this happens, all the processing could run on only a few distributions.
Replicated is unlikely to be correct since it's to much data (a million rows per day for the last 3 years).
They also
use the product key in the reference example.
upvoted 3 times
...
baobabko
The question is about this exact query. To minimize the time for this query you should distribute the work. But -
if we do hash distribution on date column this will utilize at most 30 distributions. Round robin would be a good
choice if this is really the only query we run, but we probably want to join with other tables on the primary key.
So hash distribution on the primary key might be better choice. If we assume uniform primary key distribution,
hashing on the PK will have the effect of round robin.
upvoted 1 times
...
...
uther
it should be ManagerEmployeeKey, in dimensions we use surogates to create hierarchy, co answer IMO is C
upvoted 26 times
baobabko
Agree. The purpose of surrogate key is to encapsulate business key which might change unexpectedly or can
have duplicates if data comes from different systems. Business key is preserved only for lineage/traceability
purpose. Business key should not be used for linking inside data warehouse. In addition - as the table is defined,
it is not unique key.
upvoted 4 times
...
malakosan
I agree, is C
upvoted 5 times
...
...
TorbenS
:: Highly Voted 3 months, 1 week ago
I think the correct answer is [ManagerEmployeeID] (A) because at the time of the insert we can’t guarantee that the
manager is already inserted and thus we can’t resolve the EmployeeKey of the manager, because it is an identity.
upvoted 7 times
DragonBlake
If you use ManagerEmployeeID, it is not unique. Correct answer is C
upvoted 3 times
...
...
YipingRuan
:: Most Recent 3 days, 2 hours ago
"Provide fast lookup of the manager" and surrogate key [ManagerEmployeeKey] is unique.
upvoted 1 times
...
angelato
:: 2 weeks ago
Explanation from Udemy: [ManagerEmployeeKey] [int] NULL is the correct line to add to the table. In dimensions we
use surrogates. If [ManagerEmployeeID] [int] NULL is used to create a hierarchy, at the time of the insert we can’t
guarantee that the manager is already inserted and thus we can’t resolve the EmployeeKey of the manager, because it is
an identity.
Hierarchies, in tabular models, are metadata that define relationships between two or more columns in a
table. Hierarchies can appear separate from other columns in a reporting client field list, making them easier for client
users to navigate and include in a report.
upvoted 1 times
...
andimohr
Correct answer is A. [ManagerEmployeeID] [int] NULL
Follow the given reference: "Hierarchies are... meant to be...
used as a tool for providing a better user experience."
We are data engineers. The key is that we should create a new
column to "support creating an employee reporting hierarchy for your entire company". The entire company (data
analysts, report consumers) will not be aware of the technically created surrogate "EmployeeKey". Naming the column
with a reference to EmployeeId - and using the business value EmployeeId for this reference - will give most individuals
in the company the best experience buliding data models, looking at sample data etc.
My impression is most discussions
here have possible performance issues in mind. Both EmployeeId and EmployeeKey are integers and will perform
similar if the .
upvoted 2 times
...
Akki0120
:: 1 month ago
upvoted 1 times
...
EddyRoboto
What if we had an update in manager table?
The surrogate key would be incremented and we would lose the current
manage information (if the manage table be an SCD type2).
So, I think that the correct answer is A;
upvoted 5 times
EddyRoboto
:: 17 hours, 40 minutes ago
Pls, desconsider, I misuderstood the question. The correct answer is C, like stated above.
upvoted 1 times
...
...
meswapnilspal
:: 2 months ago
what's the diff between ManagerEmployeeKey and ManagerEmployeeID ? I am new to Data warehousing concepts
upvoted 2 times
...
Steviyke
:: 2 months ago
If you use [ManagerEmployeeKey] [int] NULL, how are you going to implement hierarchy in your design? That is why
A is the only logical option.
upvoted 2 times
...
bc5468521
Agree to C
upvoted 1 times
...
kruukp
B is a correct answer. There is a column 'name' in the where clause which doesn't exist in the table.
upvoted 47 times
knarf
:: 2 months ago
I agree B is correct, not because the column 'name' in the query is invalid, but because the table reference itself is
invalid as the table was created as CREATE TABLE mytestdb.myParquetTable and not
mytestdb.dbo.myParquetTable
upvoted 3 times
anarvekar
Isn't dbo the default schema the objects are created in, if the schema name is not explicitly specified in
the DDL?
upvoted 1 times
...
AugustineUba
:: 4 weeks, 1 day ago
I agree with this.
upvoted 1 times
...
...
baobabko
Even if the column name is correct. When I tried the example , it threw an error that table doesn't exist (as
expected - after all - it is a Spark table, not SQL. There is no external or any other table which could be queried
in the SQL pool)
upvoted 2 times
knarf
:: 2 months ago
See my post above and comment?
upvoted 1 times
...
Alekx42
https://docs.microsoft.com/en-us/azure/synapse-analytics/metadata/table
"Once a database has been
created by a Spark job, you can create tables in it with Spark that use Parquet as the storage format. Table
names will be converted to lower case and need to be queried using the lower case name. These tables
will immediately become available for querying by any of the Azure Synapse workspace Spark pools.
The Spark created, managed, and external tables are also made available as external tables with the same
name in the corresponding synchronized database in serverless SQL pool."
I think the reason you got the
error was because the query had to use the lower case names. See the example in the same link, they
create a similar table and use the lowercase letters to query it from the Serverless SQL pool.
Anyway, this
confirms that B is the correct answer here.
upvoted 2 times
...
...
...
ast3roid
The question is wrong. Looks like it was created reffering to this example.
https://docs.microsoft.com/en-
us/azure/synapse-analytics/metadata/table#examples
Table create query is updated according to the questiong but select
query looks the same. Anser is B with `name` in the where clause and Anser is A with `EmployeeId` in the where clause.
upvoted 1 times
...
knarf
:: 2 months ago
I vote for B - The table was inadvertently created with the schema 'mytestdb' and not the indended 'dbo' schema. The
query refers to the three-part name mytestdb.dbo.myParquetTable which is invalid.
upvoted 2 times
...
Steviyke
:: 2 months ago
The query will throw an ERROR as name != EmployeeName. There is no column as "Name or name" in the Spark pool
table.
If the table was queried with "employeename" it will return the right answer.
upvoted 1 times
...
savin
:: 2 months ago
Ans is B since the column name is not "name"
upvoted 1 times
...
terajuana
from the documentation
"Azure Synapse Analytics allows the different workspace computational engines to share
databases and Parquet-backed tables between its Apache Spark pools and serverless SQL pool."
upvoted 1 times
...
dmnantilla9
:: 3 months ago
the response is A: only if the column name is "EmployeeName", but not only "name".
upvoted 2 times
AndrewThePandrew
agree. This is what through me off.
upvoted 1 times
...
...
AvithK
:: 2 weeks ago
truncate partition is even quicker, why isn't that the answer, if the data is dropped anyway?
upvoted 1 times
BlackMal
:: 1 week, 3 days ago
This, i think it should be the answer
upvoted 1 times
...
...
poornipv
what is the correct answer for this?
upvoted 2 times
...
AnonAzureDataEngineer
Seems like it should be:
1. E
2. A
3. C
upvoted 1 times
...
dragos_dragos62000
Correct!
upvoted 1 times
...
Dileepvikram
The data copy to back up table is not mentioned in the answer
upvoted 1 times
savin
:: 2 months ago
partition switching part covers it. So its correct i think
upvoted 1 times
...
...
wfrf92
Is this correct ????
upvoted 1 times
alain2
Yes, it is.
https://www.cathrinewilhelmsen.net/table-partitioning-in-sql-server-partition-switching/
upvoted 3 times
...
TorbenS
yes, I think so
upvoted 4 times
...
...
Chillem1900
I believe the answer should be B. In case of a serverless pool a wildcard should be added to the location.
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=hadoop#arguments-
create-external-table
upvoted 31 times
...
alain2
"Serverless SQL pool can recursively traverse folders only if you specify /** at the end of path."
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-folders-multiple-csv-files
upvoted 9 times
Preben
When you are quoting from Microsoft documentation, do not ADD in words to the sentence. 'Only' is not used.
upvoted 5 times
...
...
Akki0120
:: Most Recent 1 month ago
upvoted 2 times
...
elimey
:: 1 month ago
The answer is B
upvoted 2 times
...
AKC11
Answer is B. C can be the answer only if there are wildcards in the path https://docs.microsoft.com/en-us/azure/synapse-
analytics/sql/query-folders-multiple-csv-files
upvoted 1 times
...
InvisibleShadow
Answer should be B. Please fix in the exam question.
upvoted 2 times
...
bc5468521
Go for B
upvoted 4 times
...
wfrf92
Unlike Hadoop external tables, native external tables don't return subfolders unless you specify /** at the end of path. In
this example, if LOCATION='/webdata/', a serverless SQL pool query, will return rows from mydata.txt. It won't return
mydata2.txt and mydata3.txt because they're located in a subfolder. Hadoop tables will return all files within any sub-
folder.
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=hadoop
upvoted 4 times
...
alain2
1: Parquet - column-oriented binary file format
2: AVRO - Row based format, and has logical type timestamp
https://youtu.be/UrWthx8T3UY
upvoted 27 times
terajuana
the web is full of old information. timestamp support has been added to parquet
upvoted 3 times
vlad888
Ok, but in 1st case we need only 3 of 50 columns. Parquet i columnar format. In 2nd Avro because ideal
for read full row
upvoted 4 times
...
...
...
Himlo24
Shouldn't the answer for Report 1 be Parquet? Because Parquet format is Columnar and should be best for reading a few
columns only.
upvoted 7 times
...
elimey
https://luminousmen.com/post/big-data-file-formats
upvoted 1 times
...
elimey
:: 1 month ago
Report 1 definitely Parquet
upvoted 1 times
...
noone_a
report 1 - Parquet as it is columar.
report 2 - avro as it is row based and can be compressed further than csv.
upvoted 1 times
...
bsa_2021
:: 2 months ago
The actual answer provided and answer from discussion differs. Which one to follow for actual exam?
upvoted 1 times
...
bc5468521
1- Parquet
2- Parquet
Since they are all querying; AVRO is good for writing, OLTP, Parquet is good for quering/read
upvoted 4 times
...
szpinat
For Report 2 - why not csv?
upvoted 1 times
...
ehnw
there is no mention fo aviro in the learning materials provided by Microsoft. not sure about it
upvoted 1 times
...
sagga
D is correct
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-practices#batch-jobs-structure
upvoted 17 times
...
Sunnyb
:: Most Recent 2 months, 2 weeks ago
D is absolutely correct
upvoted 2 times
...
elimey
:: 1 month ago
correct
upvoted 2 times
...
Krishna_Kumar__
:: 2 months ago
The Answer seems correct 1: Parquet
2: AVRO
upvoted 2 times
...
alain2
1. Merge Files
2. Parquet
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-performance-tuning-
guidance
upvoted 29 times
Ameenymous
:: 3 months ago
The smaller the files, the negative the performance so Merge and Parquet seems to be the right answer.
upvoted 7 times
...
...
captainbee
:: Highly Voted 1 month, 3 weeks ago
It's frustrating just how many questions ExamTopics get wrong. Can't be helpful
upvoted 11 times
RyuHayabusa
:: 1 month ago
At least it helps in learning, as you have to research and think for yourself. Another big topic is having this
questions in the first place is immensely helpful
upvoted 5 times
...
...
elimey
1. Merge Files: Because the question said 10 different small JSON to a different file
2. Parquet
upvoted 3 times
...
Erte
Box 1: Preserver herarchy
Compared to the flat namespace on Blob storage, the hierarchical namespace greatly improves
the performance of directory management operations, which
improves overall job performance.
Box 2: Parquet
Azure
Data Factory parquet format is supported for Azure Data Lake Storage Gen2. Parquet supports the schema property.
Reference:
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction
https://docs.microsoft.com/en-us/azure/data-factory/format-parquet
upvoted 1 times
...
ThiruthuvaRajan
It should be 1)Merge Files - Question clearly says "initially ingested as 10 small json files". There is no hint on hierarchy
or partition information. so clearly we need to merge these files for better performance
2) Parquet -> Always gives better
performance for columnar based data
upvoted 5 times
...
yobllip
Answer should be
1 - Cool
2 - Archive
Comparison table shown access time for cool tier ttfb is milliseconds
https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blob-storage-tiers#comparing-block-blob-storage-options
upvoted 16 times
...
ssitb
Answer should be 1-hot
2-archive
https://www.bmc.com/blogs/cold-vs-hot-data-storage/
Cold storage data retrieval can
take much longer than hot storage. It can take minutes to hours to access cold storage data
upvoted 2 times
captainbee
Cold storage takes milliseconds to retrieve
upvoted 3 times
...
syamkumar
I also doubt if its hot storage and archive.. because its mentioned 5-year-old has to be retrieved within seconds
which is not possible via cold storage//
upvoted 1 times
savin
:: 2 months ago
but the cost factor is also there. keeping the data in hot tier for 5 years vs cold tier for 5 years would add
significant amount.
upvoted 1 times
...
...
...
DrC
Answer is correct
upvoted 4 times
...
Sunnyb
Answer is correct
upvoted 12 times
...
bc5468521
Answer D; Temporal table is better than SCD2, but it is not supported in Synpase yet
upvoted 8 times
Preben
Here's the documentation for how to implement temporal tables in Synapse from 2019.
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-
temporary
upvoted 1 times
mbravo
Temporal tables and Temporary tables are two very distinct concepts. Your link has absolutely nothing to
do with this question.
upvoted 5 times
Vaishnav
https://docs.microsoft.com/en-us/azure/azure-sql/temporal-tables
Answer : A Temporal Tables
upvoted 1 times
Vaishnav
Sorry Answer is D: SCD 2 , as according to microsoft docs ,"Temporal tables keep data
closely related to time context so that stored facts can be interpreted as valid only within
the specific period." , as in the question it is mentioned "from a given point in time", so D
seems to be the correct.
upvoted 1 times
...
...
...
...
...
dd1122
:: Most Recent 1 week, 5 days ago
Answer D is correct. Temporal tables mentioned in the link below are supported in Azure SQL Database(PaaS) and
Azure Managed Instance, where as in this question Dedicated SQL Pools are mentioned so no temporal tables can be
used. SCD Type 2 is the answer.
https://docs.microsoft.com/en-us/azure/azure-sql/temporal-tables
upvoted 2 times
...
escoins
Definitively answer D
upvoted 1 times
...
[Removed]
The answer is A - Temporal tables
"Temporal tables enable you to restore row versions from any point in time."
https://docs.microsoft.com/en-us/azure/azure-sql/database/business-continuity-high-availability-disaster-recover-hadr-
overview
upvoted 1 times
...
Dileepvikram
The requirement says that the table should store latest information, so the answer should be temporal table, right?
Because scd type 2 will store the complete history.
upvoted 1 times
captainbee
Also needs to return employee information from a given point in time? Full history needed for that.
upvoted 6 times
...
...
Diane
correct answer is ABF https://www.examtopics.com/discussions/microsoft/view/41207-exam-dp-200-topic-1-question-
56-discussion/
upvoted 22 times
AvithK
yes but the order is different it is FAB
upvoted 1 times
KingIlo
The question didn't specify order or sequence
upvoted 1 times
...
...
...
AvithK
:: Most Recent 2 weeks ago
I don't get why it doesn't start with F. The managed identity should be created first, right?
upvoted 2 times
...
IDKol
:: 1 month ago
Correct Answer should be
F. Create a managed identity.
A. Add the managed identity to the Sales group.
B. Use the
managed identity as the credentials for the data load process.
upvoted 3 times
...
MonemSnow
A, C, F is the correct answer
upvoted 1 times
...
savin
:: 2 months ago
We need to configure so synapse is able to access the data lake so we need to create managed identity and add it to sales
group since it already can access the data lake. Adding our AD creds to sales group allows us to access the storage using
that credentials but will not be able to load the data to synapse
upvoted 1 times
...
Krishna_Kumar__
:: 2 months ago
Correct Answer should be A. Add the managed identity to the Sales group.
B. Use the managed identity as the
credentials for the data load process.
F. Create a managed identity.
upvoted 2 times
...
jikilim858
ADF = Azure Data Factory
upvoted 4 times
...
savin
ABF should be correct
upvoted 3 times
...
AndrewThePandrew
Answer should be F: create managed ID, A: Add Managed ID to the group, D: use the managed ID for the load process
via Azure active directory. How can you add a managed identity to something if it is not created first? Maybe others are
seeing this in a different order?
upvoted 4 times
...
wfrf92
it should be A,B,F
upvoted 4 times
...
steeee
The correct answer should be A.
upvoted 2 times
...
JohnMasipa
:: Highly Voted 1 day, 3 hours ago
This can't be correct. Should be D.
upvoted 5 times
...
Blueko
:: Highly Voted 1 day, 4 hours ago
Request: "The solution must minimize how long it takes to load the data to the staging table" The distribution should be
Round-Robin, not Hash, as in the answer's motivations: "Round-robin tables are useful for improving loading speed"
upvoted 5 times
...
A1000
Round-Robin
Heap
None
upvoted 2 times
...
viper16752
Answers should be:
Distribution - Round Robin (See https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-
warehouse/sql-data-warehouse-tables-distribute)
Indexing - Heap (See https://docs.microsoft.com/en-us/azure/synapse-
analytics/sql-data-warehouse/sql-data-warehouse-tables-index)
Partitioning - (It's a staging table, no sense in partitioning
here)
upvoted 2 times
...
Gopinath601
I feel that answer is Distribution = Hash
Indexing = Heap
Partitioning = Date
us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-index
upvoted 1 times
...
Nilay95
:: 1 day, 2 hours ago
I think answer should be
1. Round Robin
2. Clustered Columnstore
3. None
Is partitioning allowed in round robin
distribution? Please someone confirm and accordingly modify the answer if needed.
upvoted 2 times
steeee
Totally agree with you. Thanks.
upvoted 1 times
...
...
Miris
correct
upvoted 5 times
...
mdalorso
This is Stream Analytics Query Language, a little different than tsql https://docs.microsoft.com/en-us/stream-analytics-
query/last-azure-stream-analytics
upvoted 2 times
AvithK
so is the answer DATEDIFF+LAST incorrect then?
upvoted 1 times
...
...
vlad888
The query has no sense, at least if it is T-SQL. Look: each row is end event or start event. How window function (Last()
over partition) can get start event if there is where condition that filter out end event only???
upvoted 2 times
...
Francesco1985
correct
upvoted 8 times
...
AvithK
Bad rows go to 'folder out' and the good rows to the junk table? How come?
upvoted 1 times
...
mayank
As per the link provided in the explanation disjoint:false looks correct. I believe you must go through the link
https://docs.microsoft.com/en-us/azure/data-factory/data-flow-conditional-split and choose you answer for disjoint
wisely . I will go with "False"
upvoted 14 times
...
Alekx42
I think "disjoint" should be True, so that data can be sent to all matching conditions. In this way the "all" output can get
the data from every department, which ensures that "data can also be processed by the entire company".
upvoted 9 times
Steviyke
:: 2 months ago
I concur with @Alekx42 thought. Since we want to process for each dept (3 streams), then we must ensure we
can still process for ALL depts at the same time (4th or default stream), hence DISJOINT:TRUE. Else,
DISJOINT:FALSE.
upvoted 1 times
...
...
brendy
The top votes are split, any consensus?
upvoted 1 times
...
Vaishnav
Answer is correct. Refer below Microsoft doc
https://docs.microsoft.com/en-us/azure/data-factory/data-flow-conditional-
split
upvoted 1 times
...
escoins
The provided link handles with "all other", we have the situation here with "all". Therefore I think disjoint:true should be
correct.
upvoted 1 times
...
sagga
I think the correct order is:
1) mount onto DBFS
2) read into data frame
3) transform data frame
4) specify temporary
folder
5) write to table in SQL data warehouse
About temporary folder, there is a note explain this:
https://docs.microsoft.com/en-us/azure/databricks/scenarios/databricks-extract-load-sql-data-warehouse#load-data-into-
azure-synapse
Discussions about this question:
https://www.examtopics.com/discussions/microsoft/view/11653-exam-
dp-200-topic-2-question-30-discussion/
upvoted 41 times
andylop04
Today I received this question in my exam. Only appeared the 5 options of this response. I only had to order, not
choice. This solutions is the correct. Thanks sagga.
upvoted 9 times
...
labasmuse
Hi sagga! Thank you. I do agree....
upvoted 2 times
InvisibleShadow
:: 2 months ago
fix solution on site
upvoted 2 times
...
...
...
Miris
1) mount the data onto DBFS
2) Read the file into a data frame
3) Perform transformations on the file
4) Specify a
temporary folder to stage the data
5) Write the results to a table in Azure synapse
upvoted 8 times
...
steeee
The given answer is correct, after read the link provided carefully several times. There's already a service principal. With
that, it's no need to mount. You do need to drop the dataframe as the last step.
upvoted 1 times
...
labasmuse
Correct solution: Read the file into a data frame
Perform transformations on the file
Specify a temporary folder to stage
the data
Write the results to a table in Azure synapse
Drop the data frame
upvoted 4 times
ThiruthuvaRajan
you should not perform transformation on the file.
You need not to drop the dataframe. sagga options are correct
upvoted 2 times
...
Wisenut
I believe you perform transformation on the data frame and not on the file
upvoted 5 times
...
...
Puneetgupta003
ANswers are correct
upvoted 8 times
...
belha
:: Most Recent 1 month, 3 weeks ago
not schedule ?
upvoted 1 times
captainbee
As the solution says, you cannot use the Delay with Schedule.
upvoted 1 times
...
...
escoins
why not schedule trigger?
upvoted 1 times
...
Sunnyb
Answer is correct
upvoted 9 times
captainbee
Agreed. So easy that even ExamTopics got it right.
upvoted 17 times
...
...
Palee
:: Most Recent 1 month, 1 week ago
Right Answer. Answer to 3rd drop down is already in the question.
upvoted 1 times
...
zarga
The third one is wrong because the stream analytics application already exist in the project. The goal is to modify the
current stream analytics application in order to read protobuff data. I think the right answer is the first one in the list
(update input.json file and reference dll)
upvoted 6 times
...
steeee
Third one should be the first action listed: Change file format in input.json
upvoted 1 times
...
Gowthamr02
Correct!
upvoted 1 times
...
zarga
A is the right answer (don't use autoresolve region)
upvoted 4 times
...
kishorenayak
:: 2 months ago
Should not this be option A??
https://docs.microsoft.com/en-us/azure/data-factory/concepts-integration-runtime
"If you
have strict data compliance requirements and need ensure that data do not leave a certain geography, you can explicitly
create an Azure IR in a certain region and point the Linked Service to this IR using ConnectVia property. For example, if
you want to copy data from Blob in UK South to Azure Synapse Analytics in UK South and want to ensure data do not
leave UK, create an Azure IR in UK South and link both Linked Services to this IR."
upvoted 1 times
Dicupillo
Yes it's option A
upvoted 1 times
...
...
saty_nl
:: 2 months ago
Correct answer.
upvoted 2 times
...
damaldon
fully agree
upvoted 1 times
...
Sunnyb
A is correct
upvoted 2 times
...
Sunnyb
Answer is correct
upvoted 10 times
...
saty_nl
:: 2 months ago
Correct answer.
upvoted 3 times
...
damaldon
Correct, Tumbling Window is needed to use periodic time intervals
upvoted 2 times
...
Gowthamr02
Correct!
upvoted 2 times
...
Travel_freak
correct answer
upvoted 1 times
...
trungngonptit
correct answer
upvoted 3 times
...
Miris
correct
upvoted 5 times
...
damaldon
:: Most Recent 2 months, 1 week ago
Fully agree
upvoted 2 times
...
damaldon
Correct!
upvoted 7 times
...
Gowthamr02
Answer in Correct!
upvoted 5 times
...
trungngonptit
correct, blob storage or azure sql database
upvoted 3 times
...
saty_nl
This is correct.
upvoted 4 times
...
Whiz_01
:: Highly Voted 3 months ago
This is hopping. It is overlapping
upvoted 32 times
AugustineUba
100% Hopping
upvoted 3 times
...
...
saty_nl
Correct answer is Hopping, as we need to calculate running average, which means it will have overlapping.
upvoted 12 times
...
Kbruv
It hopping
upvoted 1 times
...
arvind05
Hopping
upvoted 2 times
...
NithyaSara
I think the correct answer is Hopping for overlap timeperiod
upvoted 3 times
...
escoins
Go for hopping
upvoted 1 times
...
damaldon
Why is it overlapping?
upvoted 1 times
captainbee
:: 2 months ago
Because it wants to calculate the average costs for the last 15 minutes, every 5 minutes. The diagram is
massively unhelpful
upvoted 1 times
...
...
xig
The correct answer is hopping. Reference: https://docs.microsoft.com/en-us/stream-analytics-query/hopping-window-
azure-stream-analytics
upvoted 2 times
...
Miris
hopping - https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions
upvoted 3 times
...
nas28
Hopping bro
upvoted 1 times
...
captainbee
Hopping mad with this one
upvoted 1 times
...
Ameenymous
Should be Hopping !
upvoted 4 times
...
ThiruthuvaRajan
It is hopping window
upvoted 2 times
...
S5e
It should be Hopping
upvoted 3 times
...
Himlo24
Agree, this should be hopping
upvoted 4 times
...
stefanos
I am pretty sure it should be hopping.
upvoted 2 times
...
newuser995
Shouldn't it be hopping?
upvoted 2 times
...
Diane
Shouldn't this be hopping?
upvoted 2 times
...
Alekx42
You do not need a Window function. You just process the data and perform the geospatial check as it arrives. See the
same example here:
https://docs.microsoft.com/en-us/azure/stream-analytics/geospatial-scenarios
upvoted 19 times
captainbee
That's what I thought, there's no reporting over time periods. It's just a case of when this happens, ping it off.
upvoted 2 times
...
...
JackArmitage
1. Azure Stream Analytics
2. No Window
3. Point within Polygon
upvoted 13 times
...
Amalbenrebai
answers are correct: Hopping is correct
SELECT count(*) as NumberOfRequests, RegionsRefDataInput.RegionName
FROM UserRequestStreamDataInput
JOIN RegionsRefDataInput ON
st_within(UserRequestStreamDataInput.FromLocation, RegionsRefDataInput.Geofence) = 1
GROUP BY
RegionsRefDataInput.RegionName, hoppingwindow(minute, 1, 15)
upvoted 1 times
...
hs28974
I would say Tumbling window as minimizing cost is a requirement as well. No window indicates you will recalculate if
the point is inside the polygon every time a car moves. A tumbling window will only perform the calculation once every
30 seconds.
upvoted 2 times
GeneralZhukov
:: 3 days, 7 hours ago
Question says data from the vehicles sent to azure event hub only once every minute so this isn't valid reasoning
upvoted 1 times
...
...
Newfton
The explanation for Hopping Window only states what a hopping window is, not why is the correct answer here. It does
not make sense in this question, I think it should be No Window.
upvoted 1 times
...
Peterlustig2049
How will the CSV file be read though? I thought Azure Stream Analytics can only load reference from Blob or Azure
SQL?
upvoted 1 times
...
eng1
:: 2 months ago
1. Azure Stream Analytics
2. No Window 3. Point within Polygon
No Window because you can write a query that joins
the device stream with the geofence reference data and generates an alert every time a device is outside of an allowed
building.
SELECT DeviceStreamInput.DeviceID, SiteReferenceInput.SiteID, SiteReferenceInput.SiteName INTO
Output
FROM DeviceStreamInput JOIN SiteReferenceInput
ON st_within(DeviceStreamInput.GeoPosition,
SiteReferenceInput.Geofence) = 0
WHERE DeviceStreamInput.DeviceID = SiteReferenceInput.AllowedDeviceID
https://docs.microsoft.com/en-us/azure/stream-analytics/geospatial-scenarios#generate-alerts-with-geofence
upvoted 5 times
...
nas28
I would say No window, because Azure streaming service will have to respond when a vehicule is outside an area (by
event), no window since we don't want it to calculate a metric here no mean, no sum.
upvoted 2 times
...
ThiruthuvaRajan
Answers :
1) Azure streams analytics
2) Hopping windows
3) Point within Polygon
Explained clearly about fencing
https://docs.microsoft.com/en-us/azure/stream-analytics/geospatial-scenarios
upvoted 5 times
...
Whiz_01
:: 3 months ago
Hopping is in the answer. The event is only triggered when a condition is met. Which means we will have overlapping
events.
upvoted 6 times
captainbee
:: 2 months ago
But hopping is for reporting at set intervals? Not for when an event happens.
upvoted 1 times
...
...
sagga
isn't it tumbling window?
upvoted 8 times
alain2
yes, tumbling window makes more sense
upvoted 2 times
...
...
bc5468521
The ABS-AQS source is deprecated. For new streams, we recommend using Auto Loader instead.
upvoted 5 times
...
belha
TRUE ???
upvoted 1 times
...
Sunnyb
:: Highly Voted 3 months ago
1/14 = 0.07
6% = 0.06
should be lowered.
upvoted 8 times
...
MirandaL
"We recommend that you increase the concurrent jobs limit only when you see low resource usage with the default
values on each node."
https://docs.microsoft.com/en-us/azure/data-factory/monitor-integration-runtime
upvoted 5 times
...
Jacob_Wang
It might be the ratio. For instance, 2/14 might should be lowered to 2/20.
upvoted 1 times
...
saty_nl
:: 2 months ago
Concurrent jobs limit must be raised, as we are under-utilizing the provisioned capacity.
upvoted 2 times
...
damaldon
A) is correct because of HA is set to FALSE
https://docs.microsoft.com/en-us/azure/data-factory/create-self-hosted-
integration-runtime#high-availability-and-scalability
upvoted 1 times
...
terajuana
the limit should be left as is to allow capacity for more jobs - a single job could use 20% CPU if it is running intensive
work. The pricing model isn't by concurrency so there is no budget rationale to lower it.
upvoted 1 times
...
bc5468521
2 jobs/node, but the CPU is not fully utilized; based on the workload, don't need too many concurrent jobs, so lower to 1
job/node
upvoted 1 times
...
dfdsfdsfsd
I might be misunderstanding this but the way I look at it is that if 2 concurrent jobs use 6% of the CPU, then 1 job
requires 3% CPU and you could have approximately 100/3=33 concurrent jobs. So you can raise the limit. What makes
me insecure is that I imagine not every job would be equal in CPU-load.
upvoted 3 times
Alekx42
I agree with your explaination. I think lowering the limit makes no sense: the system is underloaded, why should
you limit the parallelism that you could have when many jobs eventually get executed at the same time?
Maintaining the current value could be an option: there are no issues with the current configuration with respect
to the maximum concurrent job value.
Increasing the value is good if we take as true your hypotesis that every
job requires the same CPU %.
upvoted 2 times
...
...
AssilAbdulrahim
✑ CPU Utilization: 6%
✑ Concurrent Jobs (Running/Limit): 2/14
I am also confused but I tend to adjust the
explanation because the system still has very low utilization 6% and only 2 out of 14 concurrent jobs are there... Hence I
might think it should be lowered...
Can you please explain why both of you think it should be raised?
upvoted 1 times
AssilAbdulrahim
I meant the scalability of nodes should be lowered...
upvoted 1 times
...
...
tanza
Concurrent jobs limit should be raised , no?
upvoted 5 times
Preben
If you eat 1 ice cream a day, but you buy 5 new ones every day -- should you increase the amount of ice cream
you buy, or lower it? This is the same. You are paying for 14 concurrent jobs, but you are only using 2. You are
only using 6 % of the CPU you have purchased, so you are paying for 94 % that you do not use.
upvoted 5 times
bsa_2021
:: 2 months ago
The question is about the action w.r.t. cuncurrent jobs value. Cuncurrent jobs should be raised to make
full use of resources. Also, (if possible) the resources should be lowered so that it is not wasted. I think
the choice of answer raised/lowered should be based on the context and the context here is about the
cuncurrent jobs, not resources. Hence, I think raised would be correct.
upvoted 2 times
Banach
I understand your point of view, and I understood the question in the same way you did at first.
But after reading carefully the sentence it asks (as you said) about the limit value (or the settings)
of concurrent jobs, knowing that you only use 6% of your CPU with only 2 concurrent jobs.
Therefore, considering the waste of resources, "lowered" is, imo, the correct answer here
(although the formulation of the question is a bit confusing, I admit).
upvoted 1 times
...
...
terajuana
data factory pricing is based on activity runs and not concurrency
upvoted 2 times
...
...
alain2
IMO, it should be lowered because:
. Concurrent Jobs (Running/Limit): 2/14
. CPU Utilization: 6%
upvoted 1 times
...
MacronfromFrance
for me, it should be raised. I don't find explanation in the given link... :(
upvoted 2 times
...
...
brendy
Is this correct?
upvoted 1 times
...
husseyn
Concurent Jobs should be raised - There is less cpu utilization
upvoted 1 times
husseyn
please ignore this, it was meant for the question before
upvoted 6 times
...
...
Prabagar
correct answer
upvoted 11 times
...
damaldon
Fully agree
upvoted 2 times
...
Ati1362
answer correct
upvoted 6 times
...
dragos_dragos62000
I think you can use a session window with 10 sec timeout... is like tumbling window with 10 second window size.
upvoted 2 times
TedoG
:: 1 month ago
I Disagree. The session could be extended if the maximum duration is set longer than the timeout.
upvoted 2 times
...
RyuHayabusa
:: 1 month ago
The important thing to remember in a session window is the maximum duration. So theoretically a 10 second
timout can still result in a window of 20 minutes for example (if every 9 seconds a new event comes in and the
window never "closes"). If the maximum duration would be 10 seconds, I would agree. But as the question is
worded right now, the answer is NO.
https://docs.microsoft.com/en-us/stream-analytics-query/session-window-
azure-stream-analytics
upvoted 3 times
...
EddyRoboto
Agree, cause it doesn't overlap any event, just group then in a given time that we can define;
upvoted 1 times
...
...
Ati1362
answer is correct
upvoted 7 times
...
saty_nl
Answer is A, the same solution can be achieved via hopping window, see below:
us/stream-analytics-query/hopping-window-azure-stream-analytics
upvoted 2 times
captainbee
:: 2 months ago
As eng1 says, it "can" be used to achieve the same affect as a tumbling window, but as they've set it to 5 and 10,
it won't be.
upvoted 3 times
...
eng1
:: 2 months ago
No, the hop size is not equal to window size, and to make a Hopping window the same as a Tumbling window,
specify the hop size to be the same as the window size.
upvoted 8 times
...
...
111222333
Correct is A
upvoted 13 times
dfdsfdsfsd
Agree. Jobs cannot use a high-concurrency cluster because it does not support Scala.
upvoted 3 times
...
...
Wisenut
I too agree on the comment by 111222333. As per the requirement " A workload for jobs that will run notebooks that use
Python, Scala, and SOL". Scala is only supported by Standard
upvoted 5 times
...
damaldon
Answer: A
-Data scientist should have their own cluster and should terminate after 120 mins - STANDARD
-Cluster for
Jobs should support scala - STANDARD
https://docs.microsoft.com/en-us/azure/databricks/clusters/configure
upvoted 1 times
...
Sunnyb
A is the right answer because Standard cluster supports scala
upvoted 1 times
...
alain2
B because: "High Concurrency clusters work only for SQL, Python, and R. The performance and security of High
Concurrency clusters is provided by running user code in separate processes, which is not possible in Scala."
upvoted 10 times
...
111222333
Correct answer is B.
Jobs use Scala which is not supported in High Concurreny cluster.
upvoted 6 times
...
damaldon
Answer: B
-Cluster for
Jobs should support scala - STANDARD
upvoted 4 times
...
Sunnyb
B is the correct answer
Link below:
upvoted 3 times
...
dfdsfdsfsd
High-concurrency clusters do not support Scala. So the answer is still 'No' but the reasoning is wrong.
upvoted 8 times
Preben
I agree that High concurrency does not support Scala. But they specified using a Standard cluster for the jobs,
which does support Scala. Why is the answer 'No'?
upvoted 2 times
eng1
Because the High Concurrency cluster for each data scientist is not correct, it should be standard for a
single user!
upvoted 2 times
...
...
...
FRAN__CO_HO
Answer should be NO, which
Data scientist: STANDARD as need to run scala
Jobs: STANDARD as need to run scala
Data Engineers: High-concurrency clusters as better resource sharing
upvoted 4 times
...
damaldon
Answer: NO
-Cluster
for Jobs should support scala - STANDARD
upvoted 1 times
...
nas28
Answer correct : No. but the reason is wrong, They want data scientists cluster to shut down automatically after 120
minutes so Standard cluster not high concurrency
upvoted 2 times
...
Sunnyb
Answer is correct - NO
upvoted 1 times
...
JohnMasipa
:: 1 day, 1 hour ago
Can someone please explain why the answer is A?
upvoted 1 times
...
fbraza
:: 1 day, 2 hours ago
Delta lake is only available from Scala version 2.12 but the json data has a version of scala of 2.11.
upvoted 1 times
...
Sunnyb
Step 1: Create a Log Analytics workspace that has Data Retention set to 120 days.
Step 2: From Azure Portal, add a
diagnostic setting.
Step 3: Select the PipelineRuns Category
Step 4: Send the data to a Log Analytics workspace.
upvoted 22 times
...
Amalbenrebai
:: Most Recent 1 week ago
in this case we will not use a storage Account to save the diagnostic logs to a storage account, but we will send them to
Log Analytics:
1: Create a Log Analytics workspace that has Data Retention set to 120 days.
2: From Azure Portal, add a
diagnostic setting.
3: Select the PipelineRuns Category
4: Send the data to a Log Analytics workspace
upvoted 2 times
...
mss1
If you create diagnostics from the Datafactory you wil notice that you can only set the retentiondays when you select a
storage account for the PipelineRuns. So you need a storage account first. You do not have an option in the selection to
create a diagnostic from the datafactory and thus the option "select the pipelineruns" is not an option. I agree with the
current selection.
upvoted 2 times
mss1
To complete my answer. I also agree with "Sunnyb". There are more solutions to this question.
upvoted 2 times
...
...
herculian_effort
step 1. From Azure Portal, add a diagnostic setting.
step 2. Send data to a Log analytics workspace.
step 3. Create a Log
Analytics workspace that has Data Retention set to 120 days.
step 4. Select the PipelineRuns Category.
The video in the
below link walks you through the process step by step, start watching at 2min 30sec mark
us/azure/data-factory/monitor-using-azure-monitor#keeping-azure-data-factory-metrics-and-pipeline-run-data
upvoted 2 times
Armandoo
This is the correct answer
upvoted 1 times
...
...
mric
:: 2 months ago
According to the linked article, it's: first Storage Account, then Event Hub, and finally Log Analytics.
So I would say:
1-
Create an Azure Storage Account with a lifecycle policy
2- Stream to an Azure Event Hub
3- Create a Log Analytics
workspace that has a Data Retention set to 120 days
4- Send the data to a Log Analytics Workspace
Source:
https://docs.microsoft.com/en-us/azure/data-factory/monitor-using-azure-monitor#keeping-azure-data-factory-metrics-
and-pipeline-run-data
upvoted 3 times
...
det_wizard
Take off the storage account and After add diagnostic setting it would be select pipelineruns then send to log analytics
upvoted 2 times
...
teofz
regarding the storage account, what is it for?!
upvoted 1 times
sagga
I don't know if you need to, see this discussion: https://www.examtopics.com/discussions/microsoft/view/49811-
exam-dp-200-topic-3-question-19-discussion/
upvoted 2 times
...
...
damaldon
Correct!
upvoted 7 times
...
Rob77
1. create user from external provider for Group1
2. create Role1 with select on schema1
3. add user to the Role1
upvoted 24 times
...
patricka95
The suggested answer is wrong. As others have identified, the correct steps are;
1. create user <> from external provider
2. create role <> with select permission on schema
3. add user to role
upvoted 2 times
...
eng1
It should be D-E-A
upvoted 1 times
eng1
:: 2 months ago
Please ignore my previous answer, it should be
D: Create a database user in dw1 that represents Group1 and uses
FROM EXTERNAL PROVIDE clause
A: Create a database role named Role1 and grant Role1 SELECT
permissions to schema1
E: Assign Rol1 to the Group1 database user
upvoted 4 times
...
...
eng1
It should be C-A-E
upvoted 1 times
...
SG1705
Is the answer correct ??
upvoted 1 times
Marcello83
No, in my opinion it is D, A, E. If you give a reader role to the group, the users will have the possibility to query
all the tables, not only the selected schema.
upvoted 4 times
...
...
Francesco1985
Guys the aswers are correct: https://docs.microsoft.com/en-us/azure/azure-sql/database/transparent-data-encryption-
byok-overview
upvoted 8 times
...
terajuana
TDE doesn't use client managed keys
answer therefore is
1) always encrypted
2) key vault in 2 regions
upvoted 1 times
Alekx42
TDE can be configured with Customer Managed keys:
https://docs.microsoft.com/en-us/azure/azure-
sql/database/transparent-data-encryption-tde-overview?tabs=azure-portal#customer-managed-transparent-data-
encryption---bring-your-own-key
Key vault is configured in multiple regions by microsoft itself. I also double-
checked by creating a key vault and there are no geo-redundancy options. Also see here:
https://docs.microsoft.com/en-us/azure/key-vault/general/disaster-recovery-guidance
upvoted 3 times
...
Alekx42
Moreover, always encrypted is NOT TDE option. The question asks to enable TDE.
upvoted 1 times
...
...
Alekx42
The first answer is correct. You need to enable TDE with customer keys in order to track the key usage in Azure key
vault. The second answer seems wrong, as pointed out by Rob77. AKV does have replication it 2 additional regions by
default. So I guess that it makes more sense to use a Microsoft .NET framwork data provider
https://docs.microsoft.com/en-us/dotnet/framework/data/adonet/data-providers
upvoted 1 times
terajuana
TDE doesn't operate with customer keys but always encrypted does
upvoted 1 times
...
...
Rob77
second answer does not seem to be correct - AKV is already replicated within the region locally (and also 2 pair
regions). Therefore if the datacentre fails (or even whole region) the traffic will be redirected.
https://docs.microsoft.com/en-us/azure/key-vault/general/disaster-recovery-guidance
upvoted 2 times
...
damaldon
Correct!
upvoted 4 times
...
saty_nl
Answer is correct. Dynamic data masking will limit the exposure of sensitive data.
upvoted 2 times
...
Alekx42
C is the right answer. Check the discussion here:
https://www.examtopics.com/discussions/microsoft/view/18788-exam-
dp-201-topic-3-question-12-discussion/
upvoted 5 times
Tracy_Anderson
:: 1 month ago
The link below show how you can infer a column that is data masked. It is also referenced in the 201 topic,
https://docs.microsoft.com/nl-nl/sql/relational-databases/security/dynamic-data-masking?view=sql-server-ver15
upvoted 1 times
...
mikerss
:: 2 months ago
the key word is 'infer'. as listed in the below documentation, data masking is not used to protect against malicious
intent to infer the underlying data. I would therefore choose C
upvoted 1 times
...
...
patricka95
Column level security is the correct answer. It is obvious based on "The solution must prevent all the salespeople from
viewing or inferring the credit card information.". If masking was used, they could still view or infer the credit card data.
Also, I interpret "Entries" to imply rows.
upvoted 1 times
...
Himlo24
Shouldn't the answer be C? Because the salesperson will get an error when trying to query credit card info.
upvoted 3 times
mvisca
Nope, the salesperson, generally, uses the last 4 digits of the card to validate, in a pickup for example. They don't
need to know all the others numbers, so data masking is correct.
upvoted 10 times
mbravo
It is not because there is a requirement that the data should be protected not only from viewing but also
inferring. Masked data can still be inferred using brute force techniques. The only option in this case is C
(Column level encryption).
upvoted 4 times
terajuana
nope - the question contains
"You need to recommend a solution to provide salespeople with the
ability to view all the entries in Customers"
if you implement column-level security then they
cannot view all items i.e. select * from the table because it will give them an error. The only way
to fulfil the requirement therefore is masking
upvoted 6 times
captainbee
Ironically DP-200 has the exact same question and everyone was leaning toward Column
Level Security. I think being able to look at all entries means looking at all ROWS, rather
than columns. They're able to do that still with CLS, just can't see all columns. You can
still infer when there's data masking.
upvoted 1 times
...
escoins
absolutely right. The key word is "all the entries"
upvoted 1 times
...
...
Preben
"You need to recommend a solution to provide salespeople with the ability to view all the entries
in Customers."
Credit card data is an entry in the Customers table. How can they view that entry
if you use column level encryption?
upvoted 2 times
...
...
...
...
Preben
Correct.
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-parallelization
Embarrassingly parallel
jobs
Step 3 and 4.
upvoted 5 times
...
nichag
:: Most Recent 4 weeks ago
Shouldn't the number of partitions only be 8, since the question only asks about the output?
upvoted 1 times
...
rumosgf
Why 16? Don't understand...
upvoted 2 times
mbravo
Embarrassingly parallel jobs
upvoted 6 times
captainbee
:: 2 months ago
It's not THAT embarrassing
upvoted 2 times
...
...
...
lara_mia1
1. Hash Distributed, ProductKey because >2GB and ProductKey is extensively used in joins
2. Hash Distributed,
RegionKey because "The table size on disk is more than 2 GB." and you have to chose a distribution column which: "Is
not used in WHERE clauses. This could narrow the query to not run on all the distributions." source:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-
distribute#choosing-a-distribution-column
upvoted 18 times
niceguy0371
:: 1 week ago
Disagree on nr. 1 because of the reason you give for nr. 2. (choose a distribution column that is not used in where
clauses. A join is also a where clause
upvoted 1 times
...
vblessings
i agree
upvoted 1 times
...
Marcello83
I agree with lara_mia1
upvoted 1 times
...
...
Rob77
Both hash as both are > 2GB. In the 2nd table RegionKey cannot be used with round_robin distribution as round_robin
does not take a distribution key...
upvoted 15 times
...
DarioEtna
as for me i guess this is the right choice:
1. Hash Distributed, RegionKey because 2. Hash Distributed, RegionKey
because "When two large fact tables have frequent joins, query performance improves when you distribute both tables on
one of the join columns" [Microsoft Documentation]
If we use for one ProductKey and for one RegionKey maybe the
data movements would increase...or not?
upvoted 1 times
DarioEtna
But we cannot use ProductKey in both because in Invoice table it is used in WHERE condition
upvoted 1 times
...
...
Amalbenrebai
Regarding the invoces table, we can use the Round-robin distribution because there is no obvious joining key in the table
upvoted 1 times
...
zarga
1. Hash on product key
2. Hash on region key (used on group by and have 65 unique values)
upvoted 2 times
...
BrennaFrenna
The sales table makes sense with hashing distribution on ProductKey and since there is no obvious joining key for
invoices, you should use round robin distribution on RegionKey. When it would be a smaller table you should use
replicated.
upvoted 3 times
...
tubis
When it says 75% of records related to one of the 40 regions, if we partition the Sales by Region, isn't it improve the
reading process drastically in compare to productKey?
upvoted 1 times
patricka95
No, if 75% relate to one region and we hash on region, that means that those will all be on one node and there
will be skew. Correct answers are Hash, Product, Hash, Region.
upvoted 1 times
...
Preben
That's 75 % of 61 % of the regions that will be done effectively. That's only efficient for 45 % of the queries. Not
a whole lot.
upvoted 2 times
...
...
bc5468521
I AGREE WITH BOTH HASH WITH PRODUCT KEY
upvoted 5 times
...
SG1705
Why ??
upvoted 6 times
okechi
:: 2 months ago
Why ?? Because When you add the "WHERE" clause to your T-SQL query it allows the query optimizer
accesses only the relevant partitions to satisfy the filter criteria of the query - which is what partition elimination
is all about.
upvoted 5 times
...
IgorLacik
:: 2 months ago
Maybe this? https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-parallelization
I think I
read somewhere in the docs that you cannot apply complex queries on partition filtering, cannot find it though
(not much help I guess, but hopefully better than nothing)
upvoted 1 times
...
...
elimey
correct
upvoted 1 times
...
rjile
correct B
upvoted 5 times
...
Avinash75
Incoming queries use the primary key SaleKey column to retrieve data as displayed in the following table ..doesn't this
mean Salekey will be used in where clause , which makes Salekey not suitable for hashkey distribution .
Choosing a
distribution column that helps minimize data movement is one of the most important strategies for optimizing
performance of your dedicated SQL pool:
- Is not used in WHERE clauses. This could narrow the query to not run on all
the distributions.
with no obvious choice i feel it should be round robin with column clustered index i.e D
upvoted 1 times
...
erssiws
I understand that hash distribution mainly for improving the joins and group-by to reduce the data shuffling. In this case,
there is no join or group-by mentioned. I think round-robin would be a better option.
upvoted 1 times
...
Yatoom
If the answer is hash distributed, then what would be the key? If there is no obvious joining key, round-robin should be
chosen (https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-
distribute#round-robin-distributed)
upvoted 1 times
Preben
It says it uses the SaleKey. Round-robin is generally not effective at these large scale tables. The 10 tb was a very
important hint here.
upvoted 9 times
...
...
Marcello83
Why not non-clustered columnstore index ? I do not find clear the different use cases of clustered and non-clustered
columnstore indexes...
upvoted 1 times
lsdudi
non-clustered columnstore index dosen't exists
upvoted 3 times
...
...
damaldon
correct!
upvoted 3 times
...
Miris
correct
upvoted 3 times
...
dragos_dragos62000
Correct
upvoted 2 times
...
erssiws
Activity logs show only activities, e.g., trigger the pipeline, stop the pipeline, ...
Resource health check shows only the
healthiness of the resource.
The monitor app indeed contains the pipeline run failure information. But it keep the data
only for 45 days.
upvoted 3 times
...
damaldon
Correct!
upvoted 2 times
...
MinionVII
Correct.
"Backlogged Input Events Number of input events that are backlogged. A non-zero value for this metric implies
that your job isn't able to keep up with the number of incoming events. If this value is slowly increasing or consistently
non-zero, you should scale out your job."
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-
monitoring
upvoted 2 times
...

DP-203 Discussion Dump

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DP-203 Discussion Dump

Uploaded by

Copyright:

Available Formats

Topic 1 question 1 discussion

You might also like