Menu Close

Chapter 11(3/4) – Analytics Services

Amazon Kinesis – platform handling massive streaming data

Amazon Elastic MapReduce (Amazon EMR) – AWS Hadoop

AWS Data Pipeline – move data between AWS services

AWS Import/Export – move data by physically moving

Amazon Kinesis

A platform for handling massive streaming data on AWS, consists of 3 services of Amazon Kinesis Firehose, Amazon Kinesis Streams and Amazon Kinesis Analytics

Use cases: Data ingestion, real-time processing of massive data streams

Amazon Kinesis Firehose

  • Load massive volumes of streaming data into Amazon S3, Amazon Redshift, or Amazon Elasticsearch
  • No need to write code
  • Create a delivery stream and configure the destination for your data
  • Clients write data to the stream using an AWS API call and data is automatically sent to proper destination

Amazon Kinesis Streams

  • Allows you to collect and process large streams of data in real time
  • Using AWS SDKs, you can create an Amazon Kinesis Stream application that processes the data

Amazon Kinesis Analytics

Amazon Elastic MapReduce (EMR)

A fully managed, on-demand Hadoop framework, that reduces the complexity and up-front costs of setting up Hadoop.

Use caes: log processing, clickstream analysis, genomics and life sciences

When you launch an Amazon EMR cluster, you specify below options:

  1. The instance type of the nodes in the cluster
  2. The number of nodes in the cluster
  3. Version of Hadoop
  4. Additional tools or applications like Hive, Pig, Spark or Presto

Two types of storage can be used with EMR

Hadoop Distributed File System (HDFS)

  • Hadoop standard file system
  • All data is replicated across multiple instances
  • EMR can use EC2 instance storage or EBS for HDFS

EMR File System (EMRFS)

  • An implementation of HDFS that allows clusters to store data on Amazon S3
  • Durable and low cost

AWS Data Pipeline

A web service to help you process and move data between different AWS compute and storage services

  • also supports moving on-premises data sources at specified intervals
  • The pipeline interacts with the data stored in data nodes
  • The pipeline will execute activities, such as moving data, running Hive queries
  • AWS Data Pipeline supports preconditions to allow a activity to run

AWS Data Pipeline is best for regular batch processes instead of continuous data streams; use Amazon Kinesis for data streams

AWS Import/Export

A service to accelerate transferring large amounts of data into and out of AWS using physical storage appliances, bypassing the internet.

The data is copied to a device at the source (your DC or an AWS region), shipped via standard shipping mechanisms and copied to the destination

AWS Import/Export has 2 features:

  • AWS Snowball – uses Amazon-provided shippable storage appliances shipped through UPS, each Snowball is protected by KMS
  • AWS Import/Export Disk – transfer data onto or off of storage devices user owns, cannot be managed the jobs via AWS Snowball console.

Leave a Reply

Your email address will not be published. Required fields are marked *