BIG DATA

Big Data Analytics Options on AWS
Erik Swensson December 2014

Amazon Web Services – Big Data Analytics Options on AWS
December 2014
Introduction
The AWS Advantage in Big Data Analytics
Amazon Redshift
Amazon Kinesis
Amazon Elastic MapReduce Amazon DynamoDB Application on Amazon EC2
Solving Big Data Problems
Example 1: Enterprise Data Warehouse
Example 2: Capturing and Analyzing Sensor Data
Conclusion Further Reading
2 3 3 3 4 7
Page 2 of 29
CS Help, Email: tutorcs@163.com
Amazon Web Services – Big Data Analytics Options on AWS December 2014
Amazon Web Services (AWS) is a flexible, cost-effective, easy-to-use cloud computing platform. The AWS Cloud delivers a comprehensive portfolio of secure and scalable cloud computing services in a self-service, pay-as-you-go model, with zero capital expense needed to handle your big data analytics workloads, such as real-time streaming analytics, data warehousing, NoSQL and relational databases, object storage, analytics tools, and data workflow services. This whitepaper provides an overview of the different big data options available in the AWS Cloud for architects, data scientists, and developers. For each of the big data analytics options, this paper describes the following:
 Ideal usage patterns
 Performance
 Durability and availability
 Cost model
 Scalability
 Elasticity
 Interfaces
 Anti-patterns
This paper describes two scenarios showcasing the analytics options in use and provides additional resources to get started with big data analytics on AWS.
Introduction
As we become a more digital society the amount of data being created and collected is accelerating significantly. The analysis of this ever-growing data set becomes a challenge using traditional analytical tools. Innovation is required to bridge the gap between the amount of data that is being generated and the amount of data that can be analyzed effectively. Big data tools and technologies offer ways to efficiently analyze data to better understand customer preferences, to gain a competitive advantage in the marketplace, and to use as a lever to grow your business. The AWS ecosystem of analytical solutions is specifically designed to handle this growing amount of data and provide insight into ways your business can collect and analyze it.
The AWS Advantage in Big Data Analytics
Analyzing large data sets requires significant compute capacity that can vary in size based on the amount of input data and the analysis required. This characteristic of big data workloads is ideally suited to the pay-as-you-go cloud computing model, where applications can easily scale up and down based on demand. As requirements change you can easily resize your environment (horizontally or vertically) on AWS to meet your
Page 3 of 29

Amazon Web Services – Big Data Analytics Options on AWS December 2014
needs without having to wait for additional hardware, or being required to over-invest to provision enough capacity. For mission-critical applications on a more traditional infrastructure, system designers have no choice but to over-provision, because a surge in additional data due to an increase in business need must be something the system can handle. By contrast, on AWS you can provision more capacity and compute in a matter of minutes, meaning that your big data applications grow and shrink as demand dictates, and your system runs as close to optimal efficiency as possible. In addition, you get flexible computing on a world-class infrastructure with access to the many different geographic regions that AWS offers1, along with the ability to utilize other scalable services that Amazon offers such as Amazon Simple Storage Service (S3)2 and AWS Data Pipeline.3 These capabilities of the AWS platform make it an extremely good fit for solving big data problems. You can read about many customers that have implemented successful big data analytics workloads on AWS on the AWS case studies web page. 4
Amazon Redshift
Amazon Redshift is a fast, fully-managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools.5 It is optimized for datasets ranging from a few hundred gigabytes to a petabyte or more, and is designed to cost less than a tenth of the cost of most traditional data warehousing solutions. Amazon Redshift delivers fast query and I/O performance for virtually any size dataset by using columnar storage technology while parallelizing and distributing queries across multiple nodes. As a managed service, automation is provided for most of the common administrative tasks associated with provisioning, configuring, monitoring, backing up, and securing a data warehouse, making it very easy and inexpensive to manage and maintain. This automation allows you to build a petabyte-scale data warehouse in minutes, a task that has traditionally taken weeks, or months, to complete in an on-premises implementation.
Ideal Usage Pattern
Amazon Redshift is ideal for online analytical processing (OLAP) using your existing business intelligence tools. Organizations are using Amazon Redshift to do the following:
 Analyze global sales data for multiple products
 Store historical stock trade data
 Analyze ad impressions and clicks
 Aggregate gaming data
 Analyze social trends
1 http://aws.amazon.com/about-aws/globalinfrastructure/ 2 http://aws.amazon.com/s3/
3 http://aws.amazon.com/datapipeline/
4 http://aws.amazon.com/solutions/case-studies/big-data/ 5 http://aws.amazon.com/redshift/
Page 4 of 29
程序代写 CS代考加QQ: 749389476
Amazon Web Services – Big Data Analytics Options on AWS December 2014
 Measure clinical quality, operation efficiency, and financial performance in health care
Cost Model
An Amazon Redshift data warehouse cluster requires no long-term commitments or upfront costs. This frees you from the capital expense and complexity of planning and purchasing data warehouse capacity ahead of your needs. Charges are based on the size and number of nodes of your cluster. There is no additional charge for backup storage up to 100% of your provisioned storage. For example, if you have an active cluster with 2 XL nodes for a total of 4 TB of storage, AWS will provide up to 4 TB of backup storage on to Amazon S3 at no additional charge. Backup storage beyond the provisioned storage size, and backups stored after your cluster is terminated, are billed at standard Amazon S3 rates.6 There is no data transfer charge for communication between Amazon S3 and Amazon Redshift. For details about Amazon Redshift pricing go to the pricing web page.7
Performance
Amazon Redshift uses a variety of innovations to obtain very high performance on
datasets ranging in size from hundreds of gigabytes to a petabyte or more. It uses
columnar storage, data compression, and zone maps to reduce the amount of I/O
needed to perform queries. Amazon Redshift has a massively parallel processing (MPP)
architecture, parallelizing and distributing SQL operations to take advantage of all
available resources. The underlying hardware is designed for high-performance data
processing, using local attached storage to maximize throughput between the Intel Xeon
E5 processor and drives, and a 10GigE mesh network to maximize throughput between
nodes. Performance can be tuned based on your data warehousing needs; AWS offers
high density compute with solid-state drives (SSDs) as well as high-density storage
Durability and Availability
An Amazon Redshift cluster will automatically detect and replace a failed node in your
data warehouse cluster. The data warehouse cluster will be read-only until a
replacement node is provisioned and added to the database, which typically only takes a
few minutes. Amazon Redshift makes your replacement node available immediately and
streams your most frequently accessed data from Amazon S3 first to allow you to
resume querying your data as quickly as possible. Additionally, your Amazon Redshift
data warehouse cluster will remain available in the event of a drive failure. Because
Amazon Redshift mirrors your data across the cluster, it will use the data from another
node to rebuild failed drives. Amazon Redshift clusters reside within one Availability
6 http://aws.amazon.com/s3/pricing/
7 http://aws.amazon.com/redshift/pricing/
Page 5 of 29

Amazon Web Services – Big Data Analytics Options on AWS December 2014
Scalability and Elasticity
With a few clicks in the AWS Management Console,9 or a simple API call,10 you can easily change the number or type of nodes in your data warehouse as your performance or capacity needs change. Using Amazon Redshift you can start with as little as a single 160 GB node and scale up all the way to a petabyte or more of compressed user data using many nodes. To learn more about Amazon Redshift node types go to the Amazon Redshift documentation.11 While resizing, Amazon Redshift places your existing cluster into read-only mode, provisions a new cluster of your chosen size, and then copies data from your old cluster to your new one in parallel. During this process you only pay for the active Amazon Redshift cluster. You can continue running queries against your old cluster while the new one is being provisioned. After your data has been copied to your new cluster, Amazon Redshift will automatically redirect queries to your new cluster and remove the old cluster.
Interfaces
Amazon Redshift uses industry-standard SQL and is accessed using standard JDBC, or
ODBC, drivers (for more information on Amazon Redshift drivers, go to the
documentation).12 There are numerous examples of validated integrations with many
popular business intelligence (BI) and extract, transform, and load (ETL) vendors.13
Loads and unloads are attempted in parallel into each compute node to maximize the
rate at which you can ingest data into your data warehouse cluster as well as to and
from Amazon S3 and Amazon DynamoDB. Metrics for compute utilization, memory
utilization, storage utilization, and read/write traffic to your Amazon Redshift data
warehouse cluster are available for no additional charge via the AWS Management
Console or Amazon CloudWatch APIs.14
Anti-Patterns
 Small data sets—Amazon Redshift is built for parallel processing across a cluster. If your data set is less than a hundred gigabytes, you are not going to get
8 http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html
9 http://aws.amazon.com/console/
10 http://docs.aws.amazon.com/redshift/latest/APIReference/Welcome.html
11 http://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html#rs-about-clusters-and-nodes 12 http://docs.aws.amazon.com/redshift/latest/dg/c_redshift-and-postgres-sql.html
13 http://aws.amazon.com/redshift/partners/ 14 http://aws.amazon.com/cloudwatch/
Page 6 of 29
Zone (AZ).8 If you want to have a multi-AZ setup for Amazon Redshift you can set up a
mirror, and then self-manage replication and failover.

Amazon Web Services – Big Data Analytics Options on AWS December 2014
all the benefits that Amazon Redshift has to offer; Amazon Relational Database Service (Amazon RDS) may be a better solution.15
 Online transaction processing (OLTP)—Amazon Redshift is designed for data warehouse workloads producing extremely fast and inexpensive analytic capabilities. If your requirement is for a fast transactional system, then a traditional relational database system built on Amazon RDS, or a NoSQL database offering functionality such as Amazon DynamoDB, is a better choice.
 Unstructured data—Data in Amazon Redshift must be structured by a defined schema, rather than supporting arbitrary schema structure for each row. If your data are unstructured you can perform extract, transform, and load (ETL) processes on Amazon Elastic MapReduce (Amazon EMR) to get the data ready for loading into Amazon Redshift.
 Binary large object (BLOB) data—If you plan on storing large data files (such as BLOB data, which includes digital video, images, or music), you’ll want to consider storing the data in Amazon S3 and referencing its location in Amazon Redshift. In this scenario, Amazon Redshift keeps track of metadata (e.g., item name, size, date created, owner, and location) about your binary objects, but the large objects themselves would be stored in Amazon S3.
Amazon Kinesis
Amazon Kinesis provides an easy way to process fast moving data in real time.16 You simply provision the level of input and output required for your data stream, in blocks of one megabyte per second (MB/sec), using the AWS Management Console, API,17 or SDKs.18 The size of your stream can be adjusted up or down at any time without restarting the stream and without any impact on the data sources pushing data to Amazon Kinesis. Within 10 seconds, data put into an Amazon Kinesis stream is available for analysis. Stream data is stored across multiple Availability Zones in a region for 24 hours. During that window, data is available to be read, re-read, backfilled, and analyzed, or moved to long-term storage (e.g., Amazon S3 or Amazon Redshift). The Amazon Kinesis client library enables developers to focus on creating their business applications while removing the undifferentiated heavy lifting associated with load- balancing streaming data, coordinating distributed services and handling fault-tolerant data processing.
15 http://aws.amazon.com/rds/
16 http://aws.amazon.com/kinesis/
17 http://docs.aws.amazon.com/kinesis/latest/APIReference/Welcome.html
18 http://docs.aws.amazon.com/aws-sdk-php/guide/latest/service-kinesis.html
Page 7 of 29

Amazon Web Services – Big Data Analytics Options on AWS December 2014
Ideal Usage Pattern
Amazon Kinesis is useful wherever there is a need to move data rapidly off producers (i.e., data sources) as it is produced, and then to continuously process it. There are many different purposes for continuous processing including transforming the data before emitting it into another data store, driving real-time metrics and analytics, or deriving and aggregating multiple Amazon Kinesis streams into either more complex Kinesis streams or into downstream processing. The following are typical scenarios for using Amazon Kinesis for analytics.
 Real-time data analytics: Amazon Kinesis enables real-time data analytics on streaming data, such as analyzing website clickstream data and customer engagement analytics.
 Log and data feed intake and processing: With Amazon Kinesis producers can push data directly into an Amazon Kinesis stream. For example, system and application logs can be submitted to Amazon Kinesis and be available for processing in seconds. This prevents the log data from being lost if the front-end, or application server, fails and reduces local log storage on the source. Amazon Kinesis provides accelerated data intake because you are not batching up the data on the servers before you submit the data for intake.
 Real-time metrics and reporting: You can use data ingested into Amazon Kinesis for extracting metrics, and generating KPIs to power reports and dashboards, at real-time speeds. This enables data-processing application logic to work on data as it is streaming in continuously, rather than waiting for data batches to be sent to the data-processing applications.
Performance
An Amazon Kinesis stream is made up of one or more shards; each shard gives you a capacity of 5 read transactions per second, up to a maximum total of 2 MB of data read per second. Each shard can support up to 1000 write transactions per second and up to a maximum total of 1 MB data written per second. The data capacity of your stream is a function of the number of shards that you specify for the stream. The total capacity of the stream is the sum of the capacities of its shards. You can provision as many shards as you need to get the throughput capacity you want, for instance a 1 gigabyte per second data stream would require 1024 shards.
Cost Model
Amazon Kinesis has simple pay-as-you-go pricing, with no up-front costs and no minimum fees. You’ll only pay for the resources you consume. There are just two pricing components, an hourly charge per shard and a charge for each 1 million PUT transactions. For detailed pricing information, go to the Amazon Kinesis pricing page.19
19 http://aws.amazon.com/kinesis/pricing/
Page 8 of 29

Amazon Web Services – Big Data Analytics Options on AWS December 2014
Applications running on Amazon Elastic Compute Cloud (EC2) that process the Amazon Kinesis stream would also incur standard Amazon EC2 pricing.
Durability and Availability
Amazon Kinesis synchronously replicates data across three Availability Zones in an AWS region, providing high availability and data durability. Additionally, a cursor in Amazon DynamoDB can be used to follow what has been read from Amazon Kinesis in Amazon DynamoDB. In the event of system failure in any system that was in the middle of reading, the cursor is designed to keep the spot where the reading left off in durable storage so that a healthy system can pick up from the exact spot where the failed system left off.
Scalability and Elasticity
You can increase or decrease the capacity of the stream at any time according to your business or operational needs, without any interruption to ongoing stream processing. By utilizing simple API calls or development tools you can automate scaling of your Amazon Kinesis environment to meet demand and ensure that you only pay for what you need.
Interfaces
There are two interfaces to Kinesis: the first is the input interface, which data producers use to put data into Amazon Kinesis, and the second is the output interface, which is used to process and analyze the data that comes in. Producers write data using a simple PUT via an API directly or abstracted by an AWS Software Development Kit (SDK) or toolkit.20
For processing data that has already been put into Amazon Kinesis client libraries are provided to build and operate real-time streaming data processing applications. The Amazon Kinesis Client Library (KCL) acts as an intermediary between your business applications, which contain the specific logic needed to process Amazon Kinesis stream data, and the Amazon Kinesis service itself.21 There is also integration to read from a Kinesis stream into Apache Storm via the Kinesis Storm Spout.22
Anti-Patterns
 Small-scale consistent throughput—If you are streaming data, throughput is a consistent couple hundred kilobytes per second or less. While it will technically
20 http://aws.amazon.com/tools/
21 https://github.com/awslabs/amazon-kinesis-client 22 https://github.com/awslabs/kinesis-storm-spout
Page 9 of 29
浙大学霸代写加微信 cstutorcs
Amazon Web Services – Big Data Analytics Options on AWS December 2014
function just fine with such a small amount of data, Amazon Kinesis is designed
and optimized for large amounts of data.
 Long-term data storage and analytics—Amazon Kinesis is a robust ingestion
platform, not a storage platform, thus data is only kept for a rolling 24-hour period. The design pattern to follow to store data that you want to keep or analyze for longer than a 24-hour period is to move the data into another durable storage service such as Amazon S3,23 Amazon Glacier,24
Amazon Redshift,25 or Amazon DynamoDB.26
Amazon Elastic MapReduce
Amazon Elastic MapReduce (Amazon EMR) is a web service that makes it easy to quickly and cost-effectively process vast amounts of data.27 Amazon EMR uses Apache Hadoop, an open source framework, to distribute your data and processing across a resizable cluster of Amazon EC2 instances.
Ideal Usage Pattern
Amazon EMR’s flexible framework reduces large processing problems and data sets into smaller jobs and distributes them across many compute nodes in a Hadoop cluster. This capability lends itself to many usage patterns with big data analytics. Here are a few examples:
 Log processing and analytics
 Large extract, transform, and load (ETL) and data movement
 Risk modeling and threat analytics
 Ad targeting and click stream analytics
 Genomics
 Predictive analytics
 Ad-hoc data mining and analytics
For a more in-depth discussion on Amazon EMR, go to the Amazon EMR Best Practices whitepaper.28
23 http://aws.amazon.com/s3/
24 http://aws.amazon.com/glacier/
25 http://aws.amazon.com/redshift/
26 http://aws.amazon.com/dynamodb/
27 http://aws.amazon.com/elasticmapreduce/
28 https://media.amazonwebservices.com/AWS_Amazon_EMR_Best_Practices.pdf
Hadoop provides a framework to run big
data processing and analytics; Amazon EMR does all the heavily lifting involved with
provisioning, managing, and maintaining the infrastructure and software of a Hadoop
Page 10 of 29

Amazon Web Services – Big Data Analytics Options on AWS December 2014
Cost Model
With Amazon EMR you can launch a persistent cluster that stays up indefinitely or a temporary cluster that terminates after the analysis is complete. In either scenario you only pay for the hours the cluster is up. Amazon EMR supports a variety of Amazon EC2 instance types (standard, high CPU, high memory, high I/O, etc.) and all Amazon EC2 instance pricing options (On-Demand, Reserved, and Spot). When you launch an Amazon EMR cluster (also called a “job flow”), you choose how many, and what type, of Amazon EC2 Instances to provision. The Amazon EMR price is in addition to the Amazon EC2 price. For more detailed information, go to the Amazon EMR pricing page.29
Performance
Amazon EMR performance is driven by the type of EC2 instances you choose to run your cluster on. You should choose an instance type suitable for your processing requirements—memory, storage, or processing power. For more information on EC2 instance specifications, go to the Amazon EC2 instance types web page.30
Durability and Availability
By default, Amazon EMR is fault tolerant for core node failures and continues job execution if a slave node goes down. Currently Amazon EMR does not automatically provision another node to take over failed slaves, but using Amazon CloudWatch customers can monitor the health of nodes and replace failed nodes.31
To help handle the unlikely event of a master node failure we recommend that you back up your data to a persistent store such as Amazon S3. Optionally you can choose to run Amazon EMR with the MapR distribution, which provides a no-NameNode architecture that can tolerate multiple simultaneous failures with automatic failover and fallback.32 The metadata is distributed and replicated, just like the data. With a no-NameNode architecture, there is no practical limit to how many files can be stored, and also no dependency on any external network-attached storage.
Scalability and Elasticity
With Amazon EMR, it is easy to resize a running cluster.33 You can add core nodes which hold the Hadoop Distributed File System (HDFS) at any time to increase your processing power and increase HDFS storage capacity (and throughput). Additionally, you can use Amazon S3 natively, or you can use Elastic Map Reduce File System (EMRFS) along with or instead of local HDFS. This allows you to decouple your memory
29 http://aws.amazon.com/elasticmapreduce/pricing/
30 http://aws.amazon.com/ec2/instance-types/
31 http://aws.amazon.com/cloudwatch/
32 http://aws.amazon.com/elasticmapreduce/mapr/
33 http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-manage-resize.html
Page 11 of 29

Amazon Web Services – Big Data Analytics Options on AWS December 2014
and compute from your storage, providing greater flexibility and cost efficiency.34 At any time, you can add and remove task nodes that can process Hadoop jobs, but that do not maintain HDFS. Some customers add hundreds of instances to their clusters when their batch processing occurs, and remove the extra instances when processing completes. For example, you might not know how much data your clusters will be handling in six months, or you might have spikey processing needs. With Amazon EMR you don’t need to guess your future requirements or provision for peak demand because you can easily add or remove capacity at any time.
Additionally, you can add all new clusters of various sizes and remove them at any time with a few clicks in the AWS Management Console or a by a programmatic API call.35
Interfaces
Amazon EMR supports many tools on top of Hadoop that can be used for big data analytics and each tool has its own interface. Here is a brief summary of the most popular options:
Hive is an open source data warehouse and analytics package that runs on top of Hadoop. Hive is operated by HiveQL, a SQL-based language which allows users to structure, summarize, and query data. HiveQL goes beyond standard SQL, adding first-class support for map/reduce functions and complex extensible user-defined data types like JSON and Thrift. This capability allows processing of complex and unstructured data sources, such as text documents and log files. Hive allows user extensions via user-defined functions written in Java. Amazon EMR has made numerous improvements to Hive, including direct integration with Amazon DynamoDB and Amazon S3. For example, with Amazon EMR you can load table partitions automatically from Amazon S3, write data to tables in Amazon S3 without using temporary files, and access resources in Amazon S3, such as scripts for custom map and/or reduce operations and additional libraries. See the Amazon EMR documentation to learn more about analyzing data with Hive.36
Pig is an open source analytics package that runs on top of Hadoop. Pig is operated by Pig Latin, a SQL-like language that allows users to structure, summarize, and query data. In addition to SQL-like operations, Pig Latin also adds first-class support for map/reduce functions and complex extensible user-defined data types. This capability allows processing of complex and unstructured data sources, such as text documents and log files. Pig allows user extensions via user-defined functions written in Java. Amazon EMR has made numerous improvements to Pig, including the ability to use multiple file systems (normally Pig can only access one remote file system), the ability to load customer JARs and scripts from Amazon S3 (e.g., “REGISTER s3://my-bucket/piggybank.jar”), and additional functionality for String
34 http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-file-systems.html 35 http://docs.aws.amazon.com/ElasticMapReduce/latest/API/Welcome.html
36 http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive.html
Page 12 of 29

Amazon Web Services – Big Data Analytics Options on AWS December 2014
and DateTime processing. See the Amazon EMR documentation to learn more about processing data with Pig.37
Spark is an open source data analytics engine built on Hadoop with the fundamentals for in-memory MapReduce. Spark provides additional speed for certain analytics and is the foundation for other power tools such as Shark (SQL driven data warehousing), Spark Streaming (streaming applications), GraphX (graph systems), and MLlib (machine learning). See the Amazon blogs to learn how to run Spark on Amazon EMR.38
HBase is an open source, non-relational, distributed database modeled after Google’s BigTable. It was developed as part of Apache Software Foundation’s Hadoop project and runs on top of HDFS to provide BigTable-like capabilities for Hadoop. HBase provides you a fault-tolerant, efficient way of storing large quantities of sparse data using column-based compression and storage. In addition, HBase provides fast lookup of data because data is stored in-memory instead of on disk. HBase is optimized for sequential write operations, and it is highly efficient for batch inserts, updates, and deletes. HBase works seamlessly with Hadoop, sharing its file system and serving as a direct input and output to Hadoop jobs. HBase also integrates with Apache Hive, enabling SQL-like queries over HBase tables, joins with Hive-based tables, and support for Java Database Connectivity (JDBC). With Amazon EMR, you can back up HBase to Amazon S3 (full or incremental, manual or automated) and you can restore from a previously created backup. See the Amazon EMR documentation to learn more about HBase and Amazon EMR.39
Impala is an open source tool in the Hadoop ecosystem for interactive, ad hoc querying using SQL syntax. Instead of using MapReduce, it leverages a massively parallel processing (MPP) engine similar to that found in traditional relational database management systems (RDBMS). With this architecture, you can query your data in HDFS or HBase tables very quickly, and leverage Hadoop’s ability to process diverse data types and provide schema at runtime. This makes Impala a great tool to perform interactive, low-latency analytics. Impala also has user-defined functions in Java and C++, and can connect to business intelligence (BI) tools through ODBC and JDBC drivers. Impala uses the Hive metastore to hold information about the input data, including the partition names and data types. See the Amazon EMR documentation to learn more about Impala and Amazon EMR.40
Hunk was developed by Splunk to make machine data accessible, usable, and valuable to everyone. With Hunk you can interactively explore, analyze, and visualize data stored in Amazon EMR and Amazon S3, harnessing Splunk analytics
37 http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-pig.html
38 http://blogs.aws.amazon.com/bigdata/post/Tx15AY5C50K70RV/Installing-Apache-Spark-on-an-Amazon-EMR-
39 http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hbase.html 40 http://docs.aws.amazon.com/ElasticMapReduce/lates