DWH

DWH
https://www.stitchdata.com/resources/analytic-vs-transactional-database/
Your data warehouse will become the single source of truth for your organization. All organizational
data will be loaded into this warehouse and it will power all reporting and analysis…
Greenplum https://scalegrid.io/blog/what-is-greenplum-database/
Vertica (free up to 1 tb)
Amazon released Redshift, the first cloud analytic database, in 2012. It can be deployed for as little as
$100 or so per month and is provisioned completely online, which allows companies to avoid capital
expenditures and the complex process of installing, configuring, and maintaining their own hardware.
Redshift is the dominant player in cloud analytic databases, but Snowflake, Google BigQuery, and
Microsoft Azure Synapse are prominent competitors.
https://cloud.google.com/bigquery/pricing
https://azure.microsoft.com/en-us/pricing/details/synapse-analytics/?
ef_id=CjwKCAiAvK2bBhB8EiwAZUbP1LvhecPlfxM443_4iGrY3sxb1gccM-
Sob37x108pXQo5AL8BwYHkzBoChJYQAvD_BwE%3AG
%3As&OCID=AIDcmmkk7j6bqe_SEM_CjwKCAiAvK2bBhB8EiwAZUbP1LvhecPlfxM443_4iGrY3sxb1gccM-
Sob37x108pXQo5AL8BwYHkzBoChJYQAvD_BwE%3AG
%3As&gclid=CjwKCAiAvK2bBhB8EiwAZUbP1LvhecPlfxM443_4iGrY3sxb1gccM-
Sob37x108pXQo5AL8BwYHkzBoChJYQAvD_BwE
What's under the hood of an analytic database
How exactly do analytic databases deliver this 100-1,000x performance improvement for
analytical query processing? Without getting too technical, there are a few key traits that allow
analytic databases to perform their stunning feats of speed.
 Columnar data storage (organized within star or snowflake schemas)

 Efficient data compression
 Distributed workloads
What about Hadoop?
We've talked extensively about the benefits of analytic databases for analyzing large datasets, but
there's an elephant in the room. How does Hadoop fit into this picture?
Hadoop is a framework that allows for the distributed processing of large data sets across
clusters of computers. It has two primary components: HDFS, a distributed file system, and
MapReduce, a system for parallel processing of large datasets in HDFS.
Hadoop is a critical component of the data stack for many organizations, but it's not an
appropriate choice of technology for your organization's data warehouse. Its use cases tend to be
more heavily statistical and algorithmic, and more focused on data science than business
analytics. Here are a few examples of where Hadoop might be a better choice than an analytic
database:
 Recommendations. Recommendation algorithms, from movies to music to ecommerce,

are commonly powered by algorithms running on top of Hadoop
 Classifications. From facial recognition to recognizing what song is playing on the radio
(a la Shazam), classification algorithms are a popular usage of Hadoop.
 Search. The precursor of Hadoop was actually built at Google to power Google's web
search algorithm. Hadoop today is still used in many search applications.
These are just a few examples, and they're all incredibly important applications of data analysis.
What's most important in this list of examples, though, are the things that aren't included.
Hadoop is not commonly used for the types of tasks we're primarily interested in—analysis and
visualization of data about businesses and their customers. For those tasks, analytic databases are
not only more usable because of their SQL-based interface, they're actually far higher
performance.
If processing data in Hadoop is a priority for your organization, it will be important that your
data pipeline output data to both a data warehouse and to HDFS, as depicted below. If you're just
getting off the ground, we'd recommend steering clear of Hadoop until you find a clear and
compelling need. Until then, Hadoop will just be a distraction.
Data lake
https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/
Data Lakes allow you to store relational data like operational databases and data from line of business
applications, and non-relational data like mobile apps, IoT devices, and social media. They also give you
the ability to understand what data is in the lake through crawling, cataloging, and indexing of data.
Finally, data must be secured to ensure your data assets are protected.
Data Lakes allow various roles in your organization like data scientists, data developers, and business
analysts to access data with their choice of analytic tools and frameworks. This includes open source
frameworks such as Apache Hadoop, Presto, and Apache Spark, and commercial offerings from data
warehouse and business intelligence vendors. Data Lakes allow you to run analytics without the need to
move your data to a separate analytics system.
Postgre
https://www.postgresql.org/about/news/postgresql-anonymizer-10-privacy-by-design-for-postgres-
2452/
Though Postgres is a great choice, keep in mind that a cloud-based warehouse like Snowflake will (in the
long run) be easier to manage and maintain.
Postgre – relational DB. Has indexes, not column oriented calculations.
When is a Postgres data warehouse a bad idea?
Okay, so building a data warehouse in postgres might not be a great idea. If you expect to have
tens or hundreds of millions of objects / events / users to analyze pretty soon, I would invest in a
more scalable solution, like BigQuery or Snowflake.
Comparisons
 OLTP: using a database to run your business
 OLAP: using a database to understand your business
With OLTP, you run things like ‘record a sales transaction: one Honda Civic by Jane Doe in the London
branch on the 1st of January, 2020’.
With OLAP, your queries can become incredibly complex: ‘give me the total sales of green Honda Civics
in the UK for the past 6 months’ or ‘tell me how many cars Jane Doe sold last month’, and ‘tell me how
well did Honda cars do this quarter in comparison to the previous quarter’? The queries in the latter
category aggregate data across many more elements when compared to the queries for the former.
In our example of a car dealership, you might be able to get a way with running both OLTP and OLAP
query-types on a normal relational database. But if you deal with huge amounts of data — if you're
querying a global database of car sales over the past decade, for instance — it becomes important to
structure your data for analysis separately from the business application. Not doing so would result in
severe performance problems.
https://www.holistics.io/blog/the-rise-and-fall-of-the-olap-cube/
https://habr.com/ru/post/567078/
https://habr.com/ru/company/lesta_studio/blog/540686/
Итоги
Хочется сказать в конце, что BigQuery вполне достойное облачное хранилище данных.
Для хранения небольшого количества данных можно уложиться в бесплатный лимит, но
если ваши данные будут насчитывать терабайты и работать с данными вы будете часто, то
и затраты будут высокими. С первого взгляда кажется, что 1 ТБ в месяц для запросов —
это очень много, но подводный камень кроется в том, что BigQuery считает все данные,
которые были обработаны во время выполнения запроса. И если вы работаете с обычными
таблицами и попытаетесь выполнить какое-либо усечение данных в виде добавления
WHERE либо LIMIT, то с грустью говорим вам, что BigQuery израсходует такой же объём
трафика, как и при обычном запросе SELECT FROM. Однако если грамотно построить
структуру вашей БД, вы сможете колоссально сэкономить свой бюджет в BigQuery.
Наши рекомендации:
 Избегайте SELECT *.
Делайте запросы всегда только к тем полям, которые вам необходимы.
 Избегайте в таблицах полей с типами данных record, array (repeated record).
Запросы, в которых присутствуют данные столбцы, будут потреблять больше
трафика, т. к. BigQuery придётся обработать все данные этого столбца.
Спасибо @ekoblov за дельное замечание.
 Старайтесь создавать секционированные таблицы (partitioned tables).
Если грамотно разбить таблицу по партициям, в запросах, в которых будет
происходить фильтрация по партиционированному полю, можно значительно
снизить потребление трафика, т. к. BigQuery обработает только партицию таблицы,
указанную в фильтре запроса.
 Старайтесь добавлять кластеризацию в ваших секционированных таблицах.
Кластеризация позволяет отсортировать данные в ваших таблицах по заданным
столбцам, что также сократит потребление трафика. При использовании
фильтрации по кластеризованным столбцам в запросах BigQuery обработает только
тот диапазон данных, который включает значения из вашего фильтра.
 Для подсчёта обработанных данных всегда используйте Cloud Console BigQuery.
Когда вы вводите запрос в Cloud Console, валидатор запроса проверяет синтаксис
запроса и предоставляет оценку количества прочитанных байтов. Эту оценку
можно использовать для расчёта стоимости запроса в калькуляторе цен.
 Используйте калькулятор для оценки стоимости хранения данных и выполнения
запросов: https://cloud.google.com/products/calculator/.
Для оценки стоимости запросов в калькуляторе необходимо ввести количество
байтов, обрабатываемых запросом, в виде Б, КБ, МБ, ГБ, ТБ. Если запрос
обрабатывает менее 1 ТБ, оценка составит 0 долларов, поскольку BigQuery
предоставляет 1 ТБ в месяц бесплатно для обработки запросов по требованию.
Аналогичные действия можно выполнить и для оценки хранения данных.

DWH

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DWH

Uploaded by

Copyright:

Available Formats

DWH

Vertica (free up to 1 tb)

 Columnar data storage (organized within star or snowflake schemas)

What about Hadoop?

 Recommendations. Recommendation algorithms, from movies to music to ecommerce,

Postgre – relational DB. Has indexes, not column oriented calculations.

When is a Postgres data warehouse a bad idea?

You might also like