Snowpro Advanced Architect: Snowflake Storage & Protection

Image by Pexels from Pixabay

Time Travel

  • Define Time Travel features:
    • Data retention periods
    • Differences between Snowflake Editions
    • Querying historical data
    • Impact of cloning and restoring objects
    • Underlying Storage concepts
  • Snowflake’s Continuous Data Protection Lifecycle Data Replication and Failover
  • Explain Replication and outline features
    • Cross cloud
    • Cross region
    • Objects that are replicated
  • How it can be used for Business Continuity
  • Clones
    • Working with clones
    • Creating new tables
    • Compare command usage:
      • CREATE TABLE LIKE
      • CREATE TABLE AS SELECT
      • CLONE
      • COPY GRANTS
  • Semi-Structured Data
    • Working with semi-structured data

Time Travel

Mistakes happen. Big mistakes happen. While I would like to state, for the record, that I’ve never been in a situation where I’m responsible for the loss of thousands, hundreds of thousands or millions of rows of data – I’ve definitely heard stories of this blood-curdling moment happening to others. Time travel allows you to quickly undo these alarm raising mistakes. Whenever you make changes to data-related-objects in Snowflake, a copy of the data before such changes were made is retained for a period of time.

Data retention comes in two layers:

  1. Time Travel: where data is recoverable using user operations (which will be covered shortly) – this period can be between 1 to 90 days depending on the edition of Snowflake you are using and what you decide should be your data retention period.

  2. Fail-Safe: where data is not recoverable using user operations but can be retrieved by Snowflake if you contact support. This period lasts 7 days and starts immediately after the data retention period ends. Note that even when data is in fail-safe you are still charged for storing that data. ACCOUNTADMINS can see total data storage for their account in the ‘Account’ section of the web-interface.

Image Credit: Snowflake

If you wish to query time travel data, you can do so using the following syntax:

  • **AT BEFORE**
    • Can be used in one of two ways:
      • Specified in SELECT statements after FROM, e.g. SELECT * FROM example_table BEFORE ( STATEMENT => <query_id> )
      • Specified in CREATE … CLONE commands, e.g. CREATE TABLE example_table_clone CLONE example_table BEFORE ( STATEMENT => <query_id> )
    • Accepts the following parameters:
      • TIMESTAMP
      • OFFSET (time difference in seconds from the present time)
      • STATEMENT (identifier for a statement, e.g. query id)
  • UNDROP
    • Command for tables, schemas or databases
    • Will restore the object as it was immediately prior to deletion

Data Retention Period

As you saw above, the data retention period can be set to any figure between 0 to 90 depending on edition, the way this works is as follows:

  • For Snowflake Standard Edition, the data retention period is between 0 to 1 day (default 1 day).

  • For Snowflake Enterprise Edition (and higher) it can be between 0 to 1 day for transient objects and up to 90 days for permanent objects.

There also some quick facts to take note of:

  • ACCOUNTADMINS can set the default data retention period for the entire account.

  • If DATA_RETENTION_TIME_IN_DAYS is 0 then time travel is deactivated.

  • If you increase retention objects from that point will be retained for longer, but old data already in fail safe will stay there.

  • If you decrease retention objects older than the new retention period will be moved to fail safe.

  • When a database is dropped, the data retention period for schemas and tables within that database – if set to be different – is not honoured. They’re retained for the same period as the database.

  • The same is true for schemas and the tables within.

  • If you wish for the data retention period of tables within schemas or schemas within databases to be honoured, drop them first.

  • A user must have OWNERSHIP privileges for an object to restore it.

  • Tables and schemas can only be restored in the current schema or database, even if a fully-qualified object name is specified.

  • Storage fees are incurred, calculated for each 24-hour period from the time the data changed.

Under the Hood

You might remember that in the performance and tuning blog I cited time travel as an example of how cool Snowflake’s micro-partitions are (isn’t it nice when everything comes together?) I think it is worth, rather self-indulgently, quoting myself:

"
One illustration of how cool micro-portioning is comes in the example of time travel. The intricacies of time travel will be explored in the final section of this blog series, suffice to say for now Snowflake keeps a historical record of your tables meaning you can explore and restore previous versions after you’ve made changes to your data. Now, if Snowflake had a new copy of your table for every single change you ever made, you’d soon see your bill rocket. Even if you had a small number of partitions this could still be expensive. With micro-partitioning, only copies of the specific micro-partition are created, representing a very small subset of your data.

Say you have 20 micro-partitions for a table, and you perform DDL on the data in micro-partition #20 a few times creating micro-partitions #20.b on Day 2, #20.c on Day 3 and #20.d on Day 4. If you query using time travel the data from Day 1 you will return micro-partitions #1 – #19, plus #20.a. If you query Day 3, you’ll get #1 – #19, plus #20.c and so on…
"


The implication of the above is that when you restore old data, you’re not creating something new you’re taking something old out of time-travel. Essentially, when you query the table from now on, the pointer will simply point at what was once there.

Data Replication & Failover

Databases can be replicated and synced across multiple Snowflake accounts in different regions and different cloud platforms. If you are using Business Critical Edition you can enable failover in different regions across multiple accounts. The original database is known as the primary database when replicating, replicas are secondary databases.

The following cannot be replicated: • Users • Roles • Warehouses • Resource monitors • Shares

In terms of security: • Privileges granted on a database are not replicated to a secondary database • Snowflake encrypts all replicated database files in-transit from the source account to the target account • If Tri-Secret Secure is enabled for the replication accounts (both source and target) extra Tri-Secret Secure protection will also apply

One thing you should be wary of is if you are using a Business Critical edition of Snowflake (or higher) as the primary database, but one or more of the accounts where you’re replicating data to are on lower editions then you will get an error message. This is to prevent account admins for Business Critical editions from inadvertently replicating sensitive data to accounts that are unsuitable for such data. This behaviour can be overridden by setting the IGNORE EDITION CHECK parameter to true.

Charges for replication are divided into two categories:

  • Data Transfer: Initial database replication and subsequent synchornisation operations transfer data between regions. Cloud providers charge for these sorts of transfers. The location of the source account determines the price of the data transfer.

  • Compute Resources: Some compute resources are needed for the replication and these will be charged

Unfortunately, although the exam spec says cross cloud and cross region features will be tested, there is very limited resources out there except for this blog.

Remember:

  • Cluster operations are performed in the primary database, reclustering is not performed on the secondary database.

  • Materialized Views in the secondary database after a refresh operation only have their definitions replicated, not the data. Data can thus be out of date. To keep them up to date set AUTO_REFRESH_MATERIALIZED_VIEWS_ON_SECONDARY = TRUE on the secondary database.

  • A database with external tables can error when replicated, move external tables to a location without replication.

  • For masking and row access policies, if either of the following conditions is true, then the initial replication operation or a subsequent refresh operation fails:
    • The primary database is in an Enterprise (or higher) account and contains a policy but one or more of the accounts approved for replication are on lower editions.
    • A policy contained in the primary database has a reference to a policy in another database.
  • Only a primary database can have change tracking and streams enabled.

  • Historical data is not replicated to secondary databases.

  • Data retention in a secondary database begins at the same time as that in the primary.

  • When DML occurs on one or a few rows in a table in the primary database, all effected micro-partitions must be synchronised (not just the affected rows) this can make data transfer costs high To DROP a primary database, you must first promote a secondary to primary.

Clones

Cloning works pretty much as it’s sci-fi styled name would suggest. A snapshot of your table/schema/database is taken to create a derived copy which, until you make changes, shares the same underlying storage. Once you start making changes though, new clone-specific micro-partitions will be created, meaning you could quickly end up with hybrid clones that share a percentage of the underlying storage with the original and another percentage of independent storage.

Every table has an ID and a CLONE_GROUP_ID – when a table has no clones, this is the same ID, but once a table gets a clone, the CLONE_GROUP_ID changes and goes on to reference the original table, the clone and any subsequent clones (clones can be cloned to make clone-clones which can be cloned to make clone-clone-clones ad infinitum).

Whether a cloned object inherits the source objects’ privileges is a little bit complicated – basically, no. But objects lower in the hierarchy will inherit the privileges of their sources. What does that mean? Well, if you clone a table, nothing. If you clone a schema, then your tables and views will inherit the privileges of their source tables and views but the schema won’t. Likewise with databases, everything in your database will inherit the privileges of their sources, but the database will not.

Note that you can create a clone of an historical object by combining cloning with time travel.

Notes on Syntax

CREATE TABLE AS SELECT (CTAS)

This type of query creates a table based upon the result of a query. If you query is a simple SELECT * FROM <table> statement then you will get a carbon copy of your table, but normally, this kind of query is used to generate unique tables that are the by-product of another table/other tables. As far as I can see, and I may be wrong, the documentation doesn’t specify whether CTAS always creates a whole new table with whole new micro-partitions when executing a simple SELECT * FROM <table> statement – but I can well imagine it does.

CREATE OR REPLACE  TABLE example_table_ctas (item VARCHAR, item_count INT)
  COPY GRANTS
  AS SELECT item, COUNT(item)
        FROM example_table
        GROUP BY item


Note that COPY GRANTS, for some reason, only works if you have the OR REPLACE clause present.

CREATE TABLE LIKE

This command creates an empty table with all the same column definitions as the table stated in the LIKE clause.

CREATE OR REPLACE TABLE example_table_like LIKE example_table
  COPY GRANTS


CREATE TABLE … CLONE

Now this is the one we’re interested in, this is how we we create clones which will have both the table’s column definitions and the data within that table.

CREATE OR REPLACE TABLE example_table_clone CLONE example_table
  COPY GRANTS


CREATE OR REPLACE TABLE example_table_clone_tt CLONE example_table
  { BEFORE STATEMENT => <id> }
  COPY GRANTS


COPY GRANTS

When using the above CREATE TABLE variants, you can specify that you wish to COPY GRANTS - as seems quite intuitive, what this means is that the replica you are creating will retain the access privileges of the original. The only privilege that won’t be transferred - for obvious reasons - is OWNERSHIP. Future grants also won’t be inherited.

Semi-Structured Data

Leaving one of the best topics to last, the final part of the exam spec tells us we need to know how to work with semi-structured data. Semi-structured data is defined as that which has no fixed schema and can contain nested data structures such as arrays, for example.

Before diving in, let’s explain a few data types we come across when looking at semi-structured data:

  • VARIANT: Tagged universal type that can store values of any other type up to 16MB.
  • OBJECT: Used to represent collections of key-value pairs
  • ARRAY: Represent dense or sparse arrays of arbitrary size where index is a non-negative integer

When you ingest semi-structured data, it is stored in the VARIANT type explained above. From that point, you will want to perform some ELT on the VARIANT column. If you know what your data looks like, then you can simply query the variant column to extract the column names within. Let’s look at an example:

[
	{
		color: "red",
		value: "#f00"
	},
	{
		color: "green",
		value: "#0f0"
	},
	{
		color: "blue",
		value: "#00f"
	},
	{
		color: "cyan",
		value: "#0ff"
	},
	{
		color: "magenta",
		value: "#f0f"
	},
	{
		color: "yellow",
		value: "#ff0"
	},
	{
		color: "black",
		value: "#000"
	}
]

Source

If we were to query the data above in Snowflake, we’d want to do something like this, assuming our variant column was called varexmple:

LATERAL FLATTEN is the function we use to make this data look a little more normal. Now, it is far beyond me to adequately explain LATERAL FLATTEN, it is an incredibly complex function. All you need to know for now is that it can split up your JSON into rows. If that’s not a satisfying enough explanation, then I would definitely recommend reading this blog.

Suppose we have something slightly more complicated. Here we would need to dig in.

{
	"id": "0001",
	"type": "donut",
	"name": "Cake",
	"ppu": 0.55,
	"batters":
		{
			"batter":
				[
					{ "id": "1001", "type": "Regular" },
					{ "id": "1002", "type": "Chocolate" },
					{ "id": "1003", "type": "Blueberry" },
					{ "id": "1004", "type": "Devil's Food" }
				]
		},
	"topping":
		[
			{ "id": "5001", "type": "None" },
			{ "id": "5002", "type": "Glazed" },
			{ "id": "5005", "type": "Sugar" },
			{ "id": "5007", "type": "Powdered Sugar" },
			{ "id": "5006", "type": "Chocolate with Sprinkles" },
			{ "id": "5003", "type": "Chocolate" },
			{ "id": "5004", "type": "Maple" }
		]
}

Source

As you can see in the LATERAL FLATTEN statement, we’ve dug into the nested JSON in order to get the data into rows. But in doing so we’ve lost the ‘toppings’ section. Whilst I could go on into the intricacies of how to do this multi-layered ELT, it’s not really necessary here, so I’ll just leave it with you. If you want to insert the data above into a table in Snowflake INSERT it as SELECT PARSE_JSON(‘<insert JSON here>’).

If you don’t know your data well enough to do the queries above, you can use two functions INFER_SCHEMA and GENERATE_COLUMN_DESCRIPTION to help perform your ELT.

Remember that JSON is not the only semi-structured file format supported, you can also import Avro, ORC, Parquet and XML.

Further Reading

Snowflake Time Travel & Fail-safe

Database Replication and Failover/Failback

Cloning Considerations

Semi-structured Data

Querying Semi-structured Data