Amazon Kinesis – platform handling massive streaming data
Amazon Elastic MapReduce (Amazon EMR) – AWS Hadoop
AWS Data Pipeline – move data between AWS services
AWS Import/Export – move data by physically moving
Amazon Kinesis
A platform for handling massive streaming data on AWS, consists of 3 services of Amazon Kinesis Firehose, Amazon Kinesis Streams and Amazon Kinesis Analytics
Use cases: Data ingestion, real-time processing of massive data streams
Amazon Kinesis Firehose
- Load massive volumes of streaming data into Amazon S3, Amazon Redshift, or Amazon Elasticsearch
- No need to write code
- Create a delivery stream and configure the destination for your data
- Clients write data to the stream using an AWS API call and data is automatically sent to proper destination
Amazon Kinesis Streams
- Allows you to collect and process large streams of data in real time
- Using AWS SDKs, you can create an Amazon Kinesis Stream application that processes the data
Amazon Kinesis Analytics
Amazon Elastic MapReduce (EMR)
A fully managed, on-demand Hadoop framework, that reduces the complexity and up-front costs of setting up Hadoop.
Use caes: log processing, clickstream analysis, genomics and life sciences
When you launch an Amazon EMR cluster, you specify below options:
- The instance type of the nodes in the cluster
- The number of nodes in the cluster
- Version of Hadoop
- Additional tools or applications like Hive, Pig, Spark or Presto
Two types of storage can be used with EMR
Hadoop Distributed File System (HDFS)
- Hadoop standard file system
- All data is replicated across multiple instances
- EMR can use EC2 instance storage or EBS for HDFS
EMR File System (EMRFS)
- An implementation of HDFS that allows clusters to store data on Amazon S3
- Durable and low cost
AWS Data Pipeline
A web service to help you process and move data between different AWS compute and storage services
- also supports moving on-premises data sources at specified intervals
- The pipeline interacts with the data stored in data nodes
- The pipeline will execute activities, such as moving data, running Hive queries
- AWS Data Pipeline supports preconditions to allow a activity to run
AWS Data Pipeline is best for regular batch processes instead of continuous data streams; use Amazon Kinesis for data streams
AWS Import/Export
A service to accelerate transferring large amounts of data into and out of AWS using physical storage appliances, bypassing the internet.
The data is copied to a device at the source (your DC or an AWS region), shipped via standard shipping mechanisms and copied to the destination
AWS Import/Export has 2 features:
- AWS Snowball – uses Amazon-provided shippable storage appliances shipped through UPS, each Snowball is protected by KMS
- AWS Import/Export Disk – transfer data onto or off of storage devices user owns, cannot be managed the jobs via AWS Snowball console.