Amazon EMR (Elastic MapReduce) (23rd Nov)
Amazon EMR (previously called Amazon Elastic MapReduce) is a fully managed cluster platform that simplifies running big data frameworks (like Apache Hadoop, Apache Spark, Hive, Presto) on AWS.
- Architecture:
- Utilizes Hadoop to run a resizable cluster of Amazon EC2 instances.
- Cluster nodes are categorized as Master, Core, and Task nodes.
- Network Access:
- EMR clusters are launched into a VPC.
- Use Security Groups to control inbound and outbound traffic between nodes.
- Primary Storage:
- Can use Amazon S3 for scalable, durable storage.
- Optionally, use HDFS (Hadoop Distributed File System) for local cluster storage.
- Cost Management (Important):
- Use Spot Instances for Task Nodes to reduce cost.
- Utilize Managed Scaling to automatically adjust node capacity based on utilization.
- Use Transient Clusters for episodic workloads to avoid paying for idle clusters.
- Limitations:
- Storage and performance: EMR does not provide the same low-latency or high-performance storage as Amazon FSx for Lustre, which is better suited for high-performance workflows.
- Workload focus: Primarily designed for distributed data processing, not high-performance computing (HPC).
1. When not to EMR!
If the goal isto minimize infrastructure management, then EMR might not be the best choice due to the need to manage the underlying infrastructure.- Amazon EMR is highly scalable and powerful for data processing tasks but it comes with the overhead of provisioning and managing clusters, which can add operational complexity and costs.
- Alternative Services: To reduce infrastructure management, consider using serverless services like
AWS Lambda,AWS Glue, orAmazon Athena, which abstract away the infrastructure and scale automatically based on demand.