Snowpro Advanced Architect: Performance and Tuning

Image by Pexels from Pixabay

  • Clustering
    • Outline clustering best practices and appropriate usage
    • Evaluate overlap and depth
    • Data types
    • Cardinality
    • Policies
    • Auto-clustering
  • Materialized Views
    • Outline the benefits of Materialized view, the corresponding properties, downsides, and limitations.
    • Identify use-cases when Materialized views should be used.
    • Explain the use-case for Materialized views with External Tables.
  • Query Profiles & Tuning
    • Retrieve and interpret query profiles, identify bottlenecks, and outline recommendations.
    • Given a scenario, determine methods to cut down query processing time.
    • Outline caching.
  • Warehouses
    • Identify performance guidelines and best practices for using warehouses in Snowflake to process queries.
    • Using Warehouses to optimize query performance.
  • ACCOUNT_USAGE Views

This is the part of the exam where things start to get quite theoretical, a word known to daunt and intimidate the trepid student. I would like to say, without a hint of irony though, that the theoretical side of Snowflake is perhaps the most exciting. Concepts like clustering and warehousing enlighten the magic going on behind the scenes – there are many reasons why Snowflake is steaming ahead to industry-leader status (by the time you’re reading this blog, it might already be there) and in this section a lot of the reasons why will become much clearer.

Clustering

To truly ‘get’ clustering in Snowflake, let’s first look at how it’s not done. In many alternative databases, clustering is done via a process called ‘static partitioning’.

Static partitioning means that each partition is a unit of management manipulated independently using specialised DDL and syntax. In Redshift, for example, one can use a DIST-KEY to partition data between clusters. This type of partitioning puts a hard limit on the number of partitions that is hard to increase and can lead to data skew.

In contrast, Snowflake’s micro-partitioning avoids these drawbacks whilst maintaining the advantages (such as increased performance and scaling).

All data that is ingested into Snowflake is automatically divided into micro-partitions, these are groups of 50 – 500 MB of uncompressed data, organised in columnar format. For each of these micro-partitions, Snowflake collects data including: • Range of values for each of the micro-partition’s columns • Number of distinct values • Additional properties used for speeding up queries

The benefits of micro-partitions: • Derived automatically. • Small, and thus, allow efficient DML and fine-grained pruning for faster queries. • Micro-partitions can overlap, preventing skew. • Columns stored independently of one another, meaning only columns referenced by a query are actually scanned. • Columns compressed individually.

Impacts of micro-partitioning:

DML All DML operations (e.g. DELETE, UPDATE, MERGE) take advantage of micro-partition metadata to simplify table maintenance.

E.g. An operation like deleting all rows is a metadata-only operation.

Query Pruning Snowflake uses metadata to prune queries, if only 10% of the data needs to be scanned, only 10% will be.

E.g. If you have a query like SELECT ** WHERE Year = 2020 then only those micro-partitions where the Year value is 2020 will be scanned.

Image Credit: Snowflake

One illustration of how cool micro-portioning is comes in the example of time travel. The intricacies of time travel will be explored in the final section of this blog series, suffice to say for now Snowflake keeps a historical record of your tables meaning you can explore and restore previous versions after you’ve made changes to your data. Now, if Snowflake had a new copy of your table for every single change you ever made, you’d soon see your bill rocket. Even if you had a small number of partitions this could still be expensive. With micro-partitioning, only copies of the specific micro-partition are created, representing a very small subset of your data.

Say you have 20 micro-partitions for a table, and you perform DDL on the data in micro-partition #20 a few times creating micro-partitions #20.b on Day 2, #20.c on Day 3 and #20.d on Day 4. If you query using time travel the data from Day 1 you will return micro-partitions #1 – #19, plus #20.a. If you query Day 3, you’ll get #1 – #19, plus #20.c and so on…

Clustering Best Practices

Thus far, we’ve spoken about automatic clustering and not about clustering using a user-defined cluster key. Unless a user-defined cluster key is specified for a table (and this need not happen during table creation) Snowflake will automatically decide how to cluster it. As DML occurs on large tables (measured by the amount of data in a table, not the number of rows) a table might become inefficiently clustered. One way of solving this is manually sorting rows on key table columns and then re-inserting them into the table. However, this can be very expensive. Instead, Snowflake recommends establishing a clustering key. This can be done on any table or materialised view. A clustering key is a subset of columns in a table used to co-locate data in the same micro-partitions. These can be defined on creation or after creation. Setting a cluster key is very simple, you simply set the key in the CLUSTER BY parameter in either a CREATE TABLE or an ALTER TABLE statement, however, deciding which key(s) to actually use deserves a little bit of thought and consideration.

When choosing clustering keys, Snowflake recommends the following prioritisation:

  1. Columns most actively used in selective filters
  2. Columns most actively used in join predicates

In terms of cardinality (remember: low cardinality means a low of uniqueness (many duplicates) and high cardinality means a high level of uniqueness) your clustering keys should be:

  • high enough to enabled effective pruning
  • low enough to allow Snowflake to effectively group rows in the same micro-partitions …essentially meaning that you want mid-level cardinality.

Clustering Depth and Re-Clustering

As mentioned in the previous section, as large tables endure a large amount of DML, query performance can degrade and become inefficient. What this introduces is ‘clustering depth’. Clustering depth is the average depth (1 or greater) of overlapping micro-partitions in a table – the lower this number is, the better clustered a table is. Ultimately, the best measure of how well-clustered a table is will be query performance, but clustering depth can help diagnose slow performance.

Let’s look at the diagram below:

Image Source: Snowflake

This is how Snowflake’s documentation explains this scenario:

As this diagram illustrates:

    1. At the beginning, the range of values in all the micro-partitions overlap.

    2. As the number of overlapping micro-partitions decreases, the overlap depth decreases.

    3. When there is no overlap in the range of values across all micro-partitions, the micro-partitions are considered to be in a constant state (i.e. they cannot be improved by clustering).

Source

In order to see the cluster depth of a table, you can use the SYSTEM$CLUSTERING_DEPTH() function in which you will specify the table name with speach marks as such SYSTEM$CLUSTERING_DEPTH(‘example_table’). If you want more detail, you can use SYSTEM$CLUSTERING_INFORMATION() which will return a JSON object containing name/value pairs that can be found here.

Reclustering is an automatic process that needs no user overhead. Related records are deleted and then re-inserted according to the cluster key. This process consumes credits like all DML operations, but it also increases storage costs - the aforementioned process means Snowflake produces new micropartitions for the table. Old micropartitions are retained for as long as the time travel and fail safe retention periods.

Materialized Views

Materialized views are a kind of ready-made view. While normal views are computed once you query them, a materialised view comes in a bit of a ‘Blue Peter’ style. For those not familiar with Blue Peter, it’s a children’s show that a lot of Brits grew up with. Normally in an episode there’s an arts and crafts section where the presenter will guide children through the process of making something. Often this might involve glue or paint drying, so instead of hours of excruciating boredom as the audience watched paint dry, the presenter would say ‘here’s one I made earlier’ and reveal a dry and ready version of their art piece so the audience could see the finished product.

What Snowflake have done with the introduction of materialized views is allow you to produce your own here’s-one-I-made-earlier view. Instead of your view-users having to wait hours as your super-complex view with all its aggregations, projections and selection operations, you can present them with something ready made that they can just leisurely query.

When Should You Use a Materialized View?

  • Query results return a smaller number of rows/columns than the base table

  • Query results require significant processing, e.g.:
    • Analysing semi-structured data
    • Aggregates that take a while to calculate

  • Query is on an external table (i.e. an external stage) which will have slower performance than an in-Snowflake dataset

  • Base table doesn’t change much

Advantages

  • Can improve the performance of queries that use the same subquery results repeatedly

  • Automatically and transparently maintained by Snowflake – background service updates the view when changes are made to the base table, this is more efficient and less error-prone than manually doing so

  • Data is always current, no matter how much DML has happened to the base table – if the view isn’t up-to-date, SF will retrieve missing data from the base table

Use Cases

Snowflake provide the following use cases for materialized views:

1:

Suppose that, every day, you run a query Q that includes a subquery S. If S is resource-intensive and queries data that changes only once a week, then you could improve performance of the outer query Q by running S and caching the results in a table named CT:

    - You would update the table only once a week.
    - The rest of the time, when you run Q, it would reference the subquery results of S that were stored in the table. This would work well as long as the results of subquery S change predictably (e.g. at the same time every week).

However, if the results of S change unpredictably then caching the results in a table is risky; sometimes your main query Q will return out-of-date results if the results of subquery S are out of date (and thus the results of cached table CT are out of date).

Ideally, you’d like a special type of cache for results that change rarely, but for which the timing of the change is unpredictable. Looking at it another way, you’d like to force your subquery S to be re-run (and your cache table CT to be updated) when necessary.

A materialized view implements an approximation of the best of both worlds. You define a query for your materialized view, and the results of the query are cached (as though they were stored in an internal table), but Snowflake updates the cache when the table that the materialized view is defined on is updated. Thus, your subquery results are readily available for fast performance.

Source

2:

As a less abstract example, suppose that you run a small branch of a large pharmacy, and your branch stocks hundreds of medications out of a total of tens of thousands of FDA-approved medications.

Suppose also that you have a complete list of all medications that each of your customers takes, and that almost all of those customers order only medicines that are in stock (i.e. special orders are rare).

In this scenario, you could create a materialized view that lists only the interactions among medicines that you keep in stock. When a customer orders a medicine that she has never used before, if both that medicine and all of the other medicines that she takes are covered by your materialized view, then you don’t need to check the entire FDA database for drug interactions; you can just check the materialized view, so the search is faster.

Source

Materialized Views vs Regular Views

Use a materialized view when all of the following are true:

  • Query results don’t change often
  • View results are used often
  • Query consumes a lot of resources

Use a regular view when any of the following are true:

  • Results of the view often change
  • Results are not often used
  • Query is not resource-intensive

Limitations

  • Can only query a single table (no joins)
  • Cannot query:
    • Another view
    • A user-defined table function
  • Cannot include:
    • UDFs
    • Window functions
    • HAVING, ORDER BY, LIMIT or GROUP BY clauses
    • Many aggregate functions are not supported

Query Profile & Tuning

I don’t find this part of the exam very interesting, and I doubt you will. Nevertheless, query profiling can be incredibly useful in optimising your queries and – most importantly – saving you lots of money as a constructive friend in identifying inefficiencies and bottlenecks. Query Profile is accessed via either History or Worksheets by clicking on a query ID.

Image Credit: Snowflake

Consists of the following main elements:

  • Steps – if the query was processed in multiple steps, you can toggle between them

  • Operator tree – a graphical representation of all the operator nodes for the selected step, including relationships with other operator nodes. Operators are the functional building blocks of a query. They are responsible for different aspects of data management and processing, including data access, transformations and updates. Each operator node in the tree includes some basic attributes:
    • <Type> [#]: Operator type and ID number. ID can be used to uniquely identify an operator within a query profile
    • Percentage: Fraction of time that this operator consumed within the query step. This information is also reflected in the orange bar at the bottom of the operator node, allowing for easy visual identification of performance-critical operators.
    • Label: Operator-specific additional information.

  • Node list – collapsible list of operator nodes by execution time

  • Overview – when an operator is not-selected you will see an overview. The overview/detail pane is divided into 3 sections:
    • Execution Time: Provides information about which processing tasks consumed query time (described in Query/Operator Details below). Additionally, for step-level information, it shows the state of the given step, and its execution time.
    • Statistics: Provides detailed information about various statistics (described in Query/Operator Details below).
    • Attributes: Provides component-specific information (described in Operator Types below).

In order to help you analyse query performance, the detail panel provides two classes of profiling information:

  1. Execution Time, in categories
    • Processing – time spent on data processing by the CPU
    • Local Disk IO – time when the processing was blocked by local disk access
    • Remote Disk IO – time when the processing was blocked by remote disk access
    • Network Communication – time when processing was waiting for the network data transfer
    • Initialization – time spent setting up the query processing

  2. Detailed Statistics
    • IO – information about the input-output operations performed during the query
    • DML – statistics for DML queries
    • Pruning – information on the effects of table pruning
    • Spilling – infor about disk usage operations where intermediate results do not fit in memory
    • Network – network communication
    • External Functions – info about calls to external functions

Some common problems identified by the query profile are offered by Snowflake:

‘Exploding’ Joins When a join condition is not specified, or records from one table match multiple records in another, this can cause an ‘exploding join’.

UNION without ALL UNION ALL simply concatenates inputs, UNION eliminates duplicates, thus takes more time.

Query Too Large to Fit in Memory Sometimes an operation is too large for the amount of memory available in the compute resoources used to execute it, this means the query processing engine will starts spilling the data to disk. Larger warehouses prevent this.

Inefficient Pruning For Snowflake’s clustering to be leveraged, the data storage order needs to be correlated with the query filter attributes, if not, queries can perform very slowly.

Caching

Whenever you execute a query in Snowflake, the results of that query are persisted meaning that they are cached. This means if you re-execute the query, you can do so without incurring new processing charges. Likewise, you can also use those query results for post-processing. (i.e. query the query).

Snowflake offers an extensive list of the conditions that all must be met for persisted query results to be used (note that meeting all these conditions isn’t a guarantee): The new query syntactically matches the previously-executed query.

  • The query does not include functions that must be evaluated at execution time (e.g. CURRENT_TIMESTAMP() and UUID_STRING()). Note that the CURRENT_DATE() function is an exception to this rule; even though CURRENT_DATE() is evaluated at execution time, queries that use CURRENT_DATE() can still use the query reuse feature.

  • The query does not include user-defined functions (UDFs) or external functions.

  • The table data contributing to the query result has not changed.

  • The persisted result for the previous query is still available.

  • The role accessing the cached results has the required privileges.

  • If the query was a SELECT query, the role executing the query must have the necessary access privileges for all the tables used in the cached query.

  • If the query was a SHOW query, the role executing the query must match the role that generated the cached results.

  • Any configuration options that affect how the result was produced have not changed.

  • The table’s micro-partitions have not changed (e.g. been reclustered or consolidated) due to changes to other data in the table.

Source

You can use the persisted query results for post-processing if you so wish. I like doing this when I’m building really big queries in a step-by-step fashion. You can do this by using the RESULT_SCAN table function. For example:

SELECT col1, col2, SUM(col3) AS sum_col3
FROM example_table
GROUP BY col1, col2;

SELECT col1, COUNT(col2), SUM(sum_col3) AS sum_col3
FROM TABLE(RESULT_SCAN(LAST_QUERY_ID()))


Warehouses

Warehouses are the ‘doers’ in Snowflake, they perform the operations that ingest your data and give you your query results. They come in all varieties of shapes and sizes (which you can change even during query runs). Think of them as the machinery on the factory floor. If you’re doing the SnowPro Advanced Architect then warehouses shouldn’t be new to you (otherwise, how on earth did you pass SnowPro Core?) Nevertheless, here’s a recap of warehouse sizes.

Image Credit: Snowflake

Warehouses are used:

  • to perform SELECT statements that require compute resources (e.g. retrieving rows from tables and views)
  • to perform DML operations such as updating, loading and unloading

When you increase the size of your warehouse, credit usage doubled, for multi-cluster warehouses (Enterprise Edition feature) the number of warehouses is taken into consideration. Increasing warehouse size does not always improve data loading performance, this is influenced more by number of files being loaded and their size.

Quick Notes on Warehouses:

  • By default, auto-suspension is enabled, causing the warehouse to suspend after a specified period of inactivity.

  • By default, auto-resume is enabled meaning it is resumed whenever a query needing it is submitted.

  • Multi-cluster warehouses can be either maximised (will always be a certain number), or auto-scale (will scale out according to need).

  • Query processing is impacted more by table size than number of rows.

  • Filter predicates, number of joins and number of tables.

  • When running, a warehouse maintains a cache of table data accessed by the warehouse, cache size is determined by the compute resources in the warehouse. The cache is dropped when the warehouse is suspended.

  • Resizing has no effect on running queries, only subsequent ones, and only occurs when a warehouse is running, if suspended, will instruct Snowflake to provision more resources when warehouse resumes.

  • On the web interface, click on Warehouses and then the name of the warehouse for the Load Monitoring Chart Image Credit: Snowflake

  • Query load is calculated by taking the execution time in seconds and dividing it by the total time in seconds for the interval

Common Issues And Solutions:

In the exam spec, you might have noticed that you’ll be given some scenarios and asked how to change the configuration of your warehouses to solve them. Luckily, Snowflake’s documentation mentions some examples and I’ve decided to simplify them here.

1

Problem: Query load is high or queues are big

Solution:

  • Start a separate warehouse and move queued queries to that warehouse
  • If you are using multi-cluster warehouses, change multi-cluster settings to add more warehouses

2

Problem: Recurring usage spikes

Solution:

  • Move some of the peak to its own warehouse
  • Increase the maximum of your multi-cluster warehouses

3

Problem: Warehouse is running, and conusming credits, but query load is less than 1 for long periods of time

Solution:

  • Decrease the warehouse size
  • Decrease the MIN_CLUSTER_COUNT parameter for multi-cluster warehouses

Account Usage

There have been many times when I’ve found myself staring at Snowflake, clicking run on a simple SELECT * FROM query wondering why nothing seems to have worked. Somewhere, somehow something has gone wrong, and I needed someone to tell me what. (This has happened more times that I care to admit).

ACCOUNT_USAGE is where I eventually found my answers. Think of it simply as logs in SQL table format. Here you can find object metadata, usage metrics, and troubleshooting data.

What you may be surprised to find out is that ACCOUNT_USAGE is a share. SNOWFLAKE is the shared database and ACCOUNT_USAGE is the name of the share.

You may notice that ACCOUNT_USAGE is very similar to INFORMATION_SCHEMA, and they mostly are, but there are some subtle differences, ACCOUNT_USAGE has: • Records for dropped objects included in each view. • Longer retention time for historical usage data. • More data latency.

Here is a list of views:

Image Credit: Snowflake

Further Reading

Databases, Tables & Views

Virtual Warehouses

Queries