Snowpro Advanced Architect: Connectors & Drivers

Domain Objectives

Connectors

  • Kafka
  • Basic configuration and access privileges for the Kafka connector

Drivers

  • Identifying the various client drivers available to connect to Snowflake

A data warehouse would be a very useless thing without connectors and drivers; putting data in would be much harder and actually using the data would be needlessly difficult.

Simply put, connectors take data from an application and put it into a database, and drivers take the data from a database and put it into an application. You will see above the example of the Kafka Connector – the purpose of this connector is to take data from Kafka and ingest it into Snowflake. Likewise, a driver might be used by a software such as Tableau to visualise the data in Snowflake.

Now, what I’ve said above isn’t strictly true, it’s possible to use an ODBC driver – for example – to send data to Snowflake and if you have a couple of centuries spare, be my guest, but this isn’t really good practice.

This blog is going to focus on connectors and drivers, what’s available and their configuration, both of which constitute the material for Domain 2 of the SnowPro Advanced Architect exam.

What Is Apache Kafka and Kafka Connector?

I don’t think I can really offer a better explanation of what Apache Kafka and it’s connector are than the one Snowflake offers in their documentation:

“Apache Kafka software uses a publish and subscribe model to write and read streams of records, similar to a message queue or enterprise messaging system. Kafka allows processes to read and write messages asynchronously. A subscriber does not need to be connected directly to a publisher; a publisher can queue a message in Kafka for the subscriber to receive later.
An application publishes messages to a topic, and an application subscribes to a topic to receive those messages. Kafka can process, as well as transmit, messages; however, that is outside the scope of this document. Topics can be divided into partitions to increase scalability.

Kafka Connect is a framework for connecting Kafka with external systems, including databases. A Kafka Connect cluster is a separate cluster from the Kafka cluster. The Kafka Connect cluster supports running and scaling out connectors (components that support reading and/or writing between external systems).

The Kafka connector is designed to run in a Kafka Connect cluster to read data from Kafka topics and write the data into Snowflake tables.”

Source

AWS offers a service called MSK (Managed Streaming for Kafka) that essentially allows you to use AWS infrastructure to run Kafka clusters. In their documentation the two following diagrams help to explain how Kafka works:

Image Credit: AWS

Image Credit: AWS

If you want to read a little more, I suggest reading this page of their documentation.

The Kafka Connector bit is much easier to understand, essentially, it is a cluster within your Kafka cluster that acts as a bridge between that cluster and Snowflake.

Kafka Connector in Snowflake

Snowflake provides two versions of the connector:

  • A version for the Confluent package
  • A version for the OSS (Open Source Software) package

Looking at the above, you see a few concepts:

In Snowflake a topic produces a stream of rows into a table, in general, each message will include one row. Whilst, in Kafka, one application can subscribe to many topics, in Snowflake, typically, only one topic will be matched to one table.

If you do not match a table to a Kafka topic then Kafka connector will create a new table in which to insert the incoming data. However, the topic name may change slightly:

  • Lowercase topic names become uppercase
  • If the first character is not a letter or an underscore, then the topic name will be prefixed with an underscore
  • ‘Illegal’ characters will be replaced with underscores.

This may result in two tables in a schema possessing the same name.

Kafka messages will be sent in either JSON or Avro format, hence they are saved in two VARIANT columns:

  • RECORD_CONTENT – the Kafka message, all the row values will be in here
  • RECORD_METADATA – the message’s metadata including:
    • topic – the topic name
    • partition – the number of the Kafka partition within the topic (not the Snowflake partition)
    • offset – the offset in the partition
    • CreateTime/LogAppendTime – timestamp associated with the message recorded as milliseconds since midnight 1st January 1970, UTC.
    • key - If the message is a Kafka KeyedMessage then that key will be shown here
    • schema_id - when using Avro with a schema registry, this will show the ID in that registry
    • headers - user-defined key-value pair associated with the record

Image Credit: Snowflake

What the Kafka Connector does behind the scenes in order to ingest this data is it subscribes to a topic, creates an internal stage, a pipe and a table (if one does not already exist), messages are buffered until they’re big enough (or enough time has passed) and then writes the messages to a temp file in the internal stage. This then triggers the Snowpipe and copies a pointer to the file in a queue. A virtual warehouse then loads the data from the staged file into the target table. Once loaded, temp files in the stage are deleted.

Image Credit: Snowflake

Kafka and Kafka Connector are both fault-tolerant, what this means is that messages are neither duplicated, nor are they silently dropped. If an error is detected then the row is not loaded but rather moved to a table stage.

Both have their limits though, there is a retention time for messages (7 days by default) and if messages are deleted in a Kafka topic this might not be reflected in the Snowflake table. While it is unlikely, it is theoretically possible that Kafka might send messages faster than Snowflake can ingest them. If this occurs, it can be solved by tuning the number of nodes in the Connector cluster, the number of tasks allocated to the Connector, or improving the network bandwidth between the Connector and Snowflake.

Kafka Connector Access Privileges

It would be a security nightmare if Kafka Connector could simply be plugged into Snowflake and instantly allowed to do as it wished. Before Kafka Connector can load data into your tables, you must first grant it the requisite privileges.

Snowflake suggests that each Kafka instance should have its own separate user account and role (which becomes the default role for that user) so that privileges can be individually granted and revoked as and when needed.

The Kafka Connector role should be granted USAGE on the database, on the schema level it should be granted USAGE, CREATE TABLE, CREATE STAGE AND CREATE PIPE.

If a table and/or stage already exist which you want Kafka Connector to use, then the role should have OWNERSHIP granted on the table and READ and WRITE granted on the stage. These privileges will be granted to the Kafka Connector account’s default role if Kafka Connector creates them itself.

Snowflake’s documentation contains the following example code:

-- Use a role that can create and manage roles and privileges.
use role securityadmin;

-- Create a Snowflake role with the privileges to work with the connector.
create role kafka_connector_role_1;

-- Grant privileges on the database.
grant usage on database kafka_db to role kafka_connector_role_1;

-- Grant privileges on the schema.
grant usage on schema kafka_schema to role kafka_connector_role_1;
grant create table on schema kafka_schema to role kafka_connector_role_1;
grant create stage on schema kafka_schema to role kafka_connector_role_1;
grant create pipe on schema kafka_schema to role kafka_connector_role_1;

-- Only required if the Kafka connector will load data into an existing table.
grant ownership on table existing_table1 to role kafka_connector_role_1;

-- Only required if the Kafka connector will stage data files in an existing internal stage: (not recommended).
grant read, write on stage existing_stage1 to role kafka_connector_role_1;

-- Grant the custom role to an existing user.
grant role kafka_connector_role_1 to user kafka_connector_user_1;

-- Set the custom role as the default role for the user.
-- If you encounter an 'Insufficient privileges' error, verify the role that has the OWNERSHIP privilege on the user.
alter user kafka_connector_user_1 set default_role = kafka_connector_role_1;

Source

Kafka Configuration

Kafka Connector is configured by creating a configuration file that contains the necessary parameters to ensure it can log in to Snowflake and map topics to tables. These parameters are broadcast from the master node to the worker nodes – therefore, it is important that communication channels between nodes in Kafka are as secure as possible. You can create the configuration file in either distributed mode (JSON):

{
  "name": "XYZCompanySensorData",
  "config": {
    "connector.class": "com.snowflake.kafka.connector.SnowflakeSinkConnector",
    "tasks.max": "8",
    "topics": "topic1,topic2",
    "snowflake.topic2table.map": "topic1:table1,topic2:table2",
    "buffer.count.records": "10000",
    "buffer.flush.time": "60",
    "buffer.size.bytes": "5000000",
    "snowflake.url.name": "myorganization-myaccount.snowflakecomputing.com:443",
    "snowflake.user.name": "jane.smith",
    "snowflake.private.key": "xyz123",
    "snowflake.private.key.passphrase": "jkladu098jfd089adsq4r",
    "snowflake.database.name": "mydb",
    "snowflake.schema.name": "myschema",
    "key.converter": "org.apache.kafka.connect.storage.StringConverter",
    "value.converter": "com.snowflake.kafka.connector.records.SnowflakeAvroConverter",
    "value.converter.schema.registry.url": "http://localhost:8081",
    "value.converter.basic.auth.credentials.source": "USER_INFO",
    "value.converter.basic.auth.user.info": "jane.smith:MyStrongPassword"
  }
}

Source
Or standalone mode:

connector.class=com.snowflake.kafka.connector.SnowflakeSinkConnector
tasks.max=8
topics=topic1,topic2
snowflake.topic2table.map= topic1:table1,topic2:table2
buffer.count.records=10000
buffer.flush.time=60
buffer.size.bytes=5000000
snowflake.url.name=myorganization-myaccount.snowflakecomputing.com:443
snowflake.user.name=jane.smith
snowflake.private.key=xyz123
snowflake.private.key.passphrase=jkladu098jfd089adsq4r
snowflake.database.name=mydb
snowflake.schema.name=myschema
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=com.snowflake.kafka.connector.records.SnowflakeAvroConverter
value.converter.schema.registry.url=http://localhost:8081
value.converter.basic.auth.credentials.source=USER_INFO
value.converter.basic.auth.user.info=jane.smith:MyStrongPassword

Source

A list of Kafka configuration properties can be found here

It is important to remember that the Kafka Connector user will need to be authenticated via a key pair, in order to set the key pair for the user you can use ALTER USER SET RSA_PUBLIC_KEY=’<insert key here>’.

Drivers

What drivers do has already been discussed above, this section will merely expand upon which drivers are available and their various idiosyncrasies.

Node.js Driver

The Node.js Driver is written completely in JavaScript. Using the Node.js driver typically means:

  1. Establishing a Connection with Snowflake
  2. Executing statements (e.g. queries)
  3. Consuming the results
  4. Terminating the connection

It is important to note that the driver cannot be used for file upload or download, SnowSQL or the JDBC driver will have to be used instead.

Notes:

  • This driver supports all five types of Snowflake authentication.
  • You can use SnowCD to evaluate and troubleshoot connectivity.
  • When the driver connects, Snowflake sends a certificate to confirm that the connection is legitimately to Snowflake. The driver will then send this to an OCSP (Online Certificate Status Protocol) if the driver fails it will either ‘fail open’ or ‘fail closed’
  • When Snowflake returns results, the driver automatically maps SQL data types to JavaScript equivalents.

Go Snowflake Driver

The Go Snowflake Driver allows you to connect to Snowflake using the Go programming language by implementing the database/sql package.

It is important to note that the driver cannot be used for file upload or download, SnowSQL or the JDBC driver will have to be used instead.

Notes:

  • You can use SnowCD to evaluate and troubleshoot connectivity.

.NET Driver

Using Visual Studio, the .NET driver provides Microsoft .NET an interface to Snowflake.

It is important to note that the driver cannot be used for file upload or download, SnowSQL or the JDBC driver will have to be used instead.

Notes:

  • You can use SnowCD to evaluate and troubleshoot connectivity.

JDBC Driver

JDBC is a very common driver and can be used with most client/applications that support JDBC.

Notes:

  • When used to upload files, the JDBC driver encrypts them, and decrypts when downloading them.
  • You can use SnowCD to evaluate and troubleshoot connectivity.
  • SSO, MFA and Key Pair authentication are supported.
  • A list of connection parameters can be found here

ODBC Driver

ODBC is another very common driver and, again, can be used with most clients/applications that support ODBC.

Notes

  • You can use SnowCD to evaluate and troubleshoot connectivity.
  • SSO, MFA and Key Pair authentication are supported.
  • A list of connection parameters can be found here

PHP PDO Driver for Snowflake

The PHP PDO driver provides a way of connecting to Snowflake with PHP PDO.

Notes:

  • You can use SnowCD to evaluate and troubleshoot connectivity.

SnowCD

Although not specifically part of this section, it has been mentioned, and it may be useful to know a little bit about what SnowCD is about.

Snowflake Connectivity Diagnostics Tool helps users diagnose and troubleshoot their network connectivity to Snowflake. It leverages the Snowflake hostname IP addresses and ports listed by either SYSTEM$WHITELIST() or SYSTEM$WHITELIST_PRIVATELINK() to run a series of checks, returning either ‘all checks passed’ or a message about failures with some troubleshooting suggestions.

It can be used within automated scripts or as part of periodic checks of running machines, it works either directly or through proxy machines, it checks access to the Snowflake database and to stages and verifies than an HTTP response was returned from the host. The latter allows you to detect problems such as:

  • No HTTP server is running at the specified IP and port
  • DNS lookup failures
  • MITM attacks (Man in the Middle – meaning interception of a connection by a rogue intermediary)
  • Network failures below HTTP level

SnowCD does not, however, check access policies to S3 buckets, Azure Blob Storage or Google Cloud Storage, only basic stage access is tested (not the HTTP response) and it doesn’t detect 403 errors from a customer’s proxy server.

OSCP

Again, while not specifically included in this section, this another nice-to-know.

Online Certificate Status Protocol helps keep connections secure by helping to determine whether a certificate is revoked when clients attempt to connect an endpoint through HTTPs. Each certificate is evaluated in the chain of trust.

Fail-Open is the default response behaviour. It signifies that a certificate has been revoked, and any other errors or status are allowed but with a WARNING in the logs.

Fail-Close is more restrictive, if for any reason a client or driver is not valid then the connection will fail. This must be set manually for each driver or connector.

Sources and Further Reading

Connecting to Snowflake