Amazon Redshift is a columnar data warehouse solution on AWS that supports PostgreSQL queries. As a managed service, it simplifies automation of tasks such as backups, replication, and auto-scaling. Amazon Redshift Spectrum extends Redshift to allow querying of S3 buckets so that data does not need to be loaded directly into a Redshift database.
A key difference between Redshift and relational databases is that it uses a columnar architecture, which means that data is stored by column rather than by row.
A Redshift cluster consists of a leader node and one or more compute nodes. Queries to Redshift are sent to the leader node's SQL endpoint, which breaks the work down into jobs that are forwarded to the compute nodes where they are processed in parallel.
Redshift table design
A well designed Redshift table will ensure even distribution of data across nodes so that jobs executed by each node will complete at roughly the same time. Data should also be sorted to minimize the need for table scans. Data may also be compressed to reduce disk space requirements and minimize I/O.
Table distribution styles
|Each node contains the full table||Maximizes performance||Requires excessive storage|
|Use ||Balances trade-offs between ||Can be the "worst of both worlds"|
|Data is round-robin partitioned||Ensures even distribution of data across nodes||Poor join performance|
|Data is distributed based on a table colum||Balances performance with storage||Requires evenly distributed data|
ALL distribution style
ALL distribution style stores the full table on each node. While this can improve performance dramatically, it can also increase storage needs dramatically.
ALL distribution style for:
- Smaller, frequently used dimension tables
- Tables where performance is critical
AUTO distribution style
AUTO distribution style is the default style used by Redshift and uses
ALL initially, then switches to
EVEN when the table gets too large. Once the table is switched to
EVEN, it will not revert to
ALL even if the table size decreases.
EVEN distribution style
EVEN distribution style distributes data evenly across all nodes in round-robin fashion, ensuring an even but essentially random distribution.
EVEN distribution style for:
- Tables that aren't usually joined with other tables
- When you can't decide between other distribution styles
KEY distribution style
KEY distribution style distributes data across nodes based on values from one column, referred to as the distribution key column. All records with the same value for the distribution column will be stored on the same node (the column names do not need to match, only the values). The same distribution pattern applies for other tables with the same distribution key column value which can improve performance through a reduced need for communication across nodes.
KEY distribution styles for:
- Keys that are evenly distributed
In Redshift, a sort key specifies the order in which data is stored on disk. Each table can specify one of three sort options:
|Data is sorted by a single column|
|Data is hierarchically sorted by multiple columns|
|Data is independently sorted by multiple columns (this requires additional disk space)|
Interleaved sorted tables tend to perform worse for queries that filter by the simple or compound sort-keys, as long as the filters make use of the hierarchy defined by compound keys, due to it's more complex storage structure. However, virtually all other filters will perform better with an interleaved sort-key with the trade-off that additional storage space is required.
Amazon Redshift Resources
- Amazon Redshift Homepage
- Amazon Redshift Documentation
- Redshift high-level System Architecture
- Redshift Pricing Guide
- Redshift Best Practices
Broader Topics Related to Amazon Redshift
Amazon Web Services (AWS)
The hosted cloud platform provided by Amazon
Columnar databases Columnar databases , also knowns as column-oriented databases , are databases that store data on disk as columns, as…
Data warehouses: Architecture, data flow, and related topics