- Outline key components of Snowflake’s architecture
- Differentiate Snowflake and other paradigms
- Shared disk
- Shared nothing
- Snowflake’s securable object hierarchy SQL Syntax:
- Querying metadata
- Cluster information
- Data types
Snowflake is the ultimate SaaS (software as a service) offering, it likes to brag. What Snowflake means by this is that when you purchase usage of Snowflake, you don’t have any hardware you need to maintain, any software to install or any overhead management to take care of. All you have to do is tailor the global settings to your needs.
Every part of Snowflake’s service is cloud-based (except those small things that are run locally: connectors, drivers and command line clients).
As you probably know from doing SnowPro Core, Snowflake splits storage and compute – virtual compute instances take care of the computation required for querying, ingesting and unloading data, and virtual storage takes care of storing your data in Snowflake’s proprietary format.
Image Credit: Snowflake
As mentioned above, storage stands independently from compute. In order to ensure that data is stored as efficiently and as cheaply as possible, Snowflake takes your data on load and converts it into its proprietary format. In this format, the data is compressed and organised in a columnar-way (a common technique for improving data efficiency). The raw data is hereafter managed by Snowflake, it’s not accessible by customers except by SQL queries.
While cloud object storage comes with many complications – lifecycle policies, access policies, redundancy etc… all of this is taken care of and managed by Snowflake. Customers are only responsible for deciding Snowflake-specific features such as time travel retention, for example.
This is the ‘compute’ layer we’ve been talking so much about – the query processing layer is made up of a collection of virtual warehouses, a cluster of compute nodes taken out with a cloud provider (Google, Amazon, Microsoft…)
Virtual warehouses are completely independent of one another, nothing is shared between them, for that reason, what is going on with one virtual warehouse has no impact on what’s going on with another.
Again, these compute nodes are managed and maintained by Snowflake – you simply need to tell Snowflake how many you want and at what size.
Think of this as the brains of the operation – while separating storage and compute is definitely a good idea, you need something that’s going to tie them both back together again and enable that coordination. More than that, this layer also processes all the extra things that make Snowflake tick, such as: • Authentication • Infrastructure management • Metadata management • Query parsing and optimization • Access control
Along with having three architectural layers, there are three cache layers that specifically handle persisted query results in order to make queries much more efficient and much cheaper to re-run.
Result Cache: this layer holds the query result (meaning not the underlying data so if you have performed any aggregations then the unaggregated data won’t be available here). The reason this is its own layer is because you may want to access the cache from a different virtual warehouse from the one originally used – the result cache layer allows those query results to be accessible. Query results are held here for 24 hours so long as the underlying data hasn’t changed since query execution.
Local Disk Cache: this stores the data used by SQL queries – this data is stored in SSD and memory. Thanks to this layer, data that has already been queried is likely to be retrieved quicker.
Remote Disk: where long-term storage is held.
Snowflake v Other Paradigms
“Snowflake’s architecture is a hybrid of traditional shared-disk and shared-nothing database architectures. Similar to shared-disk architectures, Snowflake uses a central data repository for persisted data that is accessible from all compute nodes in the platform. But similar to shared-nothing architectures, Snowflake processes queries using MPP (massively parallel processing) compute clusters where each node in the cluster stores a portion of the entire data set locally. This approach offers the data management simplicity of a shared-disk architecture, but with the performance and scale-out benefits of a shared-nothing architecture.”
Let’s look a little closer at these two concepts
One storage layer is accessible by all cluster nodes, multi-cluster nodes with CPU and memory but no disk storage for themselves communicate with a central storage layer to get to the data and to process it.
Image Credit: Hevodata
Distributed cluster-nodes along with disk-storage and their own CPU and memory. In this model, data can be partitioned and shared across these cluster nodes as they each have disk storage.
Image Credit: Hevodata
Snowflake’s securable object hierarchy
This topic was covered in SNOWPRO ADVANCED ARCHITECT: SECURITY which you can access here
I’ve spoken a little about this concept here but this blog delves into some of the fancier tricks one can use to partially automate table creation using the ability to query staged files. While not necessary to know for the exam, it does give you a bit of an idea for the syntax required.
What you do need to know for the exam is the basic way in which this operates. Any file in an internal stage or a named external stage can be queried using Snowflake. Listing the files is easy enough, just use the following command:
In order to pinpoint an exact file, you should use the PATTERN parameter in the FROM statement. To help Snowflake interpret your files, you should also include the FILE_FORMAT parameter. For example:
select t.$1, t.$2 from @mystage1 ( file_format => 'myformat', pattern=>'.*data.*[.]csv.gz' ) t;
As you can see above, columns are specified in a SELECT statement according to their positions $1, preceded by what’s being queries in the FROM statement. You can also alias your FROM statement to make referencing the columns easier.
The following metadata columns can be queried or copied into tables:
METADATA$FILENAME: Name of the staged file the current row belongs to. Includes the path to the data file in the stage.
METADATA$FILE_ROW_NUMBER: Row number for each record in the container staged data file.
Snowflake provides the following commands for querying clustering:
SYSTEM$CLUSTERING_INFORMATION: Returns clustering info, including average clustering depth, for a table based on one or more columns in the table.
SYSTEM$CLUSTERING_DEPTH: Returns computation of average depth of table according to specified columns.
SYSTEM$CLUSTERING_RATIO: Calculates clustering ratio number between 0 to 100, the higher the ratio, the better the clustering.
Writing SQL is much easier when you’re able to reference time, for this purpose, Snowflake provides the following functions:
Snowflake defines an identifier as “a string of characters (up to 255 characters in length) used to identify first-class Snowflake “named” objects” source. The metadata for identifiers can be queried using SHOW and DESCRIBE.
Snowflake offers the following for querying data types:
TYPEOF: Reports the type of a value stored in a VARIANT column.
IS_<object_type>: Assesses whether VARIANT value is a data type.
SHOW COLUMNS: Will show columns of a table with data types.
DESCRIBE TABLE: Can be used to analyse columns of a table/stage.