Like most things in the data world, partitioning isn't new, but Snowflake's take on it is a bit different.
In any traditional RDBMS, you are expected to handle partitioning yourself via SQL commands or the administrative interface (static partitioning). If you create a table on the C: drive, it just exists within the database file as one large object. You have to configure the table yourself if you want to split it over multiple files or disks, for instance.
Snowflake does away with all that. When you create a table, Snowflake automatically splits it up into micro-partitions behind the scenes (dynamic partitioning). The key elements of micro-partitions are:
- Contiguous units of storage
- Contains between 50MB and 500MB of uncompressed data
- Each micro-partition size is smaller than the uncompressed data size, as the data are stored in compressed format
- Snowflake automatically maps groups of rows into individual micro-partitions
- Rows are organised using a columnar format
- Tables are partitioned as rows are inserted
- The data range held within a micro-partition can overlap with the data ranges within other micro-partitions. This can prevent data skew, where some partitions may be full of data and other partitions may be virtually empty
- If you drop a column, the column is removed from metadata but the data remain within the micro-partition storage. The same thing happens if you delete all rows in a table
Efficient Querying
Most modern data platforms try to reduce the amount of data queries have to interrogate. This leads to faster queries and reduced costs, as only the data the query is interested in are inspected. In a classic example, you may want to filter for the UK from this particular country list:
- UK (200 records)
- USA (2000 records)
- China (1800 records)
The UK's records make up a grand total of 5% of the total number of records in this example. If a query like this was executed:
SELECT * FROM Sales WHERE Country = 'UK';
We would only want to query the 5% of records for the UK, otherwise the query is going to be mightily expensive. In a system that has not been partitioned, you might well have an expensive query. In Snowflake, it's likely only the correct 5% of rows will be queried, thanks to micro-partitioning. This process is known as query pruning.
Clustering
Clustering is an important element of the micro-partition story. Snowflake creates clustering metadata as micro-partitions are updated or created, including information such as the total number of micro-partitions for a table and the depth of the overlapping micro-partitions. When a table is created, it has a cluster depth of 0 - meaning the table is empty! As rows are added, this number will change as micro-partitions are added. You want this number to stay small - the smaller the number, the better the clustering of the data. Snowflake has an automatic reclustering feature, replacing the now-deprecated manual clustering option. This means you no longer need to monitor the state of your clustered tables. Just ensure your table has a clustering key and Snowflake will take care of things for you.