Amazon Athena

Athena is an interactive query service for analyzing data stored in Amazon S3 using standard SQL. As, It is serverless, so there is no infrastructure to set up or manage, and customers pay only for the queries they run.

It supports structured and semi-structured data formats such as CSV, JSON, Parquet, and more, making it ideal for flexible exploration of data.

Features

  1. Works directly on files in S3.
  2. Does not modify data in Amazon S3 during analysis, but it does use schema-on-read technology when queries are executed
  3. Best for large-scale, unstructured data in S3, but not suitable for complex, high-performance analytics on massive datasets.
  4. Use Redshift for complex, high-performance analytics on structured data.
  5. Use Athena to process logs, perform ad-hoc analysis, and run interactive queries

1. What does Athena do?

Imagine you have a spreadsheet (CSV file) stored in Amazon S3. Athena lets you write SQL queries to analyze or fetch specific information from that spreadsheet without setting up a database or server.

2. Athena and Glue together

  1. Athena: For querying and analyzing already-prepared data in S3.
  2. Glue: For organizing, cleaning, and preparing messy or scattered data before analysis.

Both services work well together:

  1. Use Glue to prepare and clean your data.
  2. Use Athena to analyze it. alt text

3. Can use Athena on DynamoDB and Amazon RDS?

Athena doesn’t natively query DynamoDB or RDS directly. Athena primarily works with data stored in Amazon S3, but it can also query with some additional configurations.

  1. Use AWS Glue to extract data from DynamoDB/RDS and store it in S3 in a queryable format like Parquet or JSON.
  2. Once the DynamoDB/RDS data is in S3, you can use Athena to query it just like any other data in S3.

4. Athena and Redshift together in a modern data architecture:

  1. Use Athena for initial exploration and analysis on raw data in S3.
  2. Once the data is cleaned and transformed, load it into Redshift for advanced, high-performance analytics and dashboard.

5. Athena, Redshift and Kinesis data analytics

Amazon Kinesis Data Analytics is designed for real-time data processing and analysis, which makes it fundamentally different from Athena and Redshift. While all three services are used for data analysis, but they serve different purposes.

Feature Kinesis Data Analytics Athena Redshift
Nature of Data Real-time, streaming data Static data stored in S3 Structured, relational data
Use Case Real-time analytics (e.g., monitoring) Ad-hoc or batch querying Complex, high-performance analytics
Data Source Kinesis Streams, Kafka, Firehose Files in S3 (CSV, JSON, Parquet) Redshift tables or data from S3/RDS
Processing Speed Millisecond/second-level latency On-demand, batch processing Scheduled, batch analytics
Cost Model Pay for compute and processing time Pay-per-query Pay for compute and storage
Example Query Detect fraudulent transactions as they occur Summarize static logs in S3 Generate dashboards/reports