Amazon Redshift

Redshift is a fully managed, scalable data warehouse (stores structured data) designed and optimized for running complex analytical queries on large structured data. It is not just storage, but also a powerful platform for data processing and querying, enabling complex analytics on vast amounts of data.

It more expensive than Athena for intermittent or small workloads.

1. Key Points

  1. Redshift Stores structured data, optimized for fast retrieval and complex analytics.
  2. Redshift supports complex SQL queries, allowing for advanced analytics on petabytes of data.
  3. Redshift can integrate with the following AWS services for streamlined workflows, data loading, and visualization.
    1. Amazon S3(via Redshift Spectrum or COPY)
    2. Amazon DynamoDB
    3. Amazon QuickSight
    4. AWS Glue
    5. Kinesis
  4. Advanced Features:
    1. Machine Learning: Native ML integration for predictive analytics.
    2. Data Sharing: Enables cross-cluster analytics.
    3. Automatic Scaling: Adjusts to varying workloads, ensuring cost efficiency.
  5. Cost Consideration: You pay for provisioned clusters (compute + storage), even when idle — unless using Redshift Serverless, which charges per query/runtime.

2. Redshift COPY and Redshift Spectrum

  1. Redshift COPY Command
    • The COPY command is used to load data into a Redshift table from external sources like Amazon S3, DynamoDB, or local files.
    • Use when you want to bring external data into Redshift for structured analysis and long-term storage.
  2. Redshift Spectrum
    • Redshift Spectrum enables direct querying of data stored in Amazon S3 without the need to load it into Redshift tables.
    • Use when You want to analyze data in S3 without moving it into Redshift or You're performing ad hoc queries on data that doesn’t need to reside permanently in Redshift.

alt text

3. Data Warehouse vs Data Lake vs Data Lakehouse

Data Warehouse

  1. Stores structured data in fixed schemas (tables, columns).
  2. Holds current and historical data from multiple systems.
  3. Optimized for analytics and reporting.
  4. Example: Amazon Redshift, Snowflake.

Data Lake

  1. Stores raw, unstructured or semi-structured data (logs, images, JSON, CSV).
  2. Can hold current and historical data.
  3. Schema is applied on read (“schema-on-read”), not on write.
  4. Ideal for big data, machine learning, and real-time analytics.
  5. Example: Amazon S3 + AWS Lake Formation.

Data Lakehouse

  1. Combines data warehouse + data lake features.
  2. Allows raw and refined data to coexist in the same platform.
  3. Avoids moving data between lake and warehouse for analytics.
  4. Example: Amazon Athena + S3 + Glue catalog, Databricks Lakehouse.

Athena: Amazon Athena is neither a data warehouse nor a data lake itself — but it’s closely associated with data lakes. It is a serverless, interactive query service used to analyze data directly in Amazon S3 using standard SQL. It works on top of your data lake (usually stored in S3).