Amazon Redshift

Amazon Redshift is a columnar data warehouse solution on AWS that supports PostgreSQL queries. As a managed service, it simplifies automation of tasks such as backups, replication, and auto-scaling. Amazon Redshift Spectrum extends Redshift to allow querying of S3 buckets so that data does not need to be loaded directly into a Redshift database.

A key difference between Redshift and relational databases is that it uses a columnar architecture, which means that data is stored by column rather than by row.

Redshift architecture

A Redshift cluster consists of a leader node and one or more compute nodes. Queries to Redshift are sent to the leader node's SQL endpoint, which breaks the work down into jobs that are forwarded to the compute nodes where they are processed in parallel.

flowchart TD Client -->|SQL Query| Leader[Leader Node] subgraph Redshift Cluster Leader --> C1[Compute Node 1] Leader --> CDot[...] Leader --> CN[Compute Node N] C1 --> Leader CDot --> Leader CN --> Leader end Leader -->|Data| Client

Redshift table design

A well designed Redshift table will ensure even distribution of data across nodes so that jobs executed by each node will complete at roughly the same time. Data should also be sorted to minimize the need for table scans. Data may also be compressed to reduce disk space requirements and minimize I/O.

Table distribution styles

StyleDescriptionBenefitsDrawbacks
ALLEach node contains the full tableMaximizes performanceRequires excessive storage
AUTOUse ALL initially, switch to EVEN when the table gets too largeBalances trade-offs between ALL and EVENCan be the "worst of both worlds"
EVENData is round-robin partitionedEnsures even distribution of data across nodesPoor join performance
KEYData is distributed based on a table columBalances performance with storageRequires evenly distributed data

The ALL distribution style

The ALL distribution style stores the full table on each node. While this can improve performance dramatically, it can also increase storage needs dramatically.

Use the ALL distribution style for:

  • Smaller, frequently used dimension tables
  • Tables where performance is critical

The AUTO distribution style

The AUTO distribution style is the default style used by Redshift and uses ALL initially, then switches to EVEN when the table gets too large. Once the table is switched to EVEN, it will not revert to ALL even if the table size decreases.

The EVEN distribution style

The EVEN distribution style distributes data evenly across all nodes in round-robin fashion, ensuring an even but essentially random distribution.

Use the EVEN distribution style for:

  • Tables that aren't usually joined with other tables
  • When you can't decide between other distribution styles

The KEY distribution style

The KEY distribution style distributes data across nodes based on values from one column, referred to as the distribution key column. All records with the same value for the distribution column will be stored on the same node (the column names do not need to match, only the values). The same distribution pattern applies for other tables with the same distribution key column value which can improve performance through a reduced need for communication across nodes.

Use the KEY distribution styles for:

  • Keys that are evenly distributed

Sort strategies

In Redshift, a sort key specifies the order in which data is stored on disk. Each table can specify one of three sort options:

StrategyDescription
sortkey (simple)Data is sorted by a single column
compound sortkeyData is hierarchically sorted by multiple columns
interleaved sortkeyData is independently sorted by multiple columns (this requires additional disk space)

Interleaved sorted tables tend to perform worse for queries that filter by the simple or compound sort-keys, as long as the filters make use of the hierarchy defined by compound keys, due to it's more complex storage structure. However, virtually all other filters will perform better with an interleaved sort-key with the trade-off that additional storage space is required.

Amazon Redshift Resources

Broader Topics Related to Amazon Redshift

Amazon Web Services (AWS)

Amazon Web Services (AWS) Amazon Web Services (AWS) is a hosted cloud platform , provided by Amazon. AWS Resources Amazon Web Services…

Columnar Databases

Columnar databases Columnar databases , also knowns as column-oriented databases , are databases that store data on disk as columns, as…

Data warehouses

Data warehouses: Architecture, data flow, and related topics

Amazon Redshift Knowledge Graph