You may have read a recent post by my colleague Chris Bunch covering his thoughts on public vs. private cloud computing. If not, I suggest giving it a read before diving in to my post.

Chris’ article led to some discussion (great!) through a few channels. One of the comments raised related to the viability of larger data sets in the public cloud – due to the cost of data transfer in and out (for AWS specifically in this instance, and for a few terabytes per month).

I’d like to counter this viewpoint, as I believe public cloud (and notably AWS) still to be the winner of the fight even in this scenario. I’ll outline my thoughts below.

I’ve’ setup an S3 bucket for hosting media related to this blog and since the content of this media is by definition public, I wanted a simple way for everyone to publicly list the content of that directory similar to that of Apache Options +Indexes.

My initial thoughts were immediately focused on the AWS SDK for JavaScript which was released not too long ago. But then I came across this article where a simple javascript and jquery script is used and finally ended up using the excellent github project s3-bucket-listing.

Windows Server 2012 introduced some noticeable improvements to storage in Windows. While in previous versions of Windows Desktop and Server operating systems you had very limited options, such as mirroring, striping and RAID5, Windows Server 2012 has introduced some new fundamental concepts:

  • Storage pools. A collection of physical disks that enable you to aggregate disks, expand capacity in a flexible manner.
  • Storage spaces. Virtual disks created from free space in a storage pool. Storage spaces have such attributes as resiliency level, fixed and thin provisioning.

Windows Server 2012 R2 has some further improvements where Microsoft added the ability to use different tiers of storage types with storage spaces, for example you can use an SSD tier to perform read and write cache to significantly improve performance.

This part is pretty straight forward, we will effectively use the ephemeral SSD as the primary storage device for our EC2 instance but at the same time use lsyncd asynchronously to replicate files to another directory mounted on an EBS volume where you can get the persistence of EBS and the durability of snapshots to S3.

One downside to this approach is when you recovering from failure, a manual step of initially copying data from EBS back onto SSD will be required. Note however if you are planning to use something like this in production it is recommended to use a high availability solution by which you synchronise to another node in a different availability zone as well as locally to EBS.

ZFS is the best thing since sliced bread. It is both a file system and volume manager combined into complete awesomeness. It has features that will blow any competition out of water, here are just a few:

  • 128-bit file system, so could store 256 quadrillion ZB (a ZB is a billion TB.)
  • Checksums are stored with metadata allowing ZFS to scrub volumes and fixing silent data corruption.
  • Copy on Write, meaning that when data is changed it is not overwritten — it is always written to a new block. Think EBS snapshots.
  • Pooled Data Storage: ZFS takes available storage drives and pools them together as a single resource allowing efficient use of capacity available.
  • Compression.
  • Inline block level deduplication – this one is particularly magnificent.
  • ZFS send/receive for replicating data across between systems.
  • NFS, CIFS and iscsi sharing of volumes directly out of ZFS.
  • SSD Hybrid Storage Pools allowing ZFS to use SSD’s as L2ARC (Read Cache) and ZIL (write cache) and this is what we will do here.

I could spend the rest of 2014 writing about ZFS and I won’t give it justice. Thankfully there are people who are much smarter than me who have done that already, so I’ll point you in their direction

bcache is a Linux kernel’s block layer cache (hence the name, block cache). It allows one or more fast storage devices such as an SSD to act as a cache for one or more slower drives, effectively creating hybrid drive. Sounds like just the right tool for the job.

bcache has a few interesting features the following are worth noting:

  • A single cache device can be used to cache multiple devices.
  • Recovers from unclean shutdown.
  • Many write options: Writethrough, writeback and writearound.
  • Designed for SSD’s by never performing random writes and by turning them into sequential writes instead.
  • It was merged into the Linux kernel mainline in kernel version 3.10.

flashcashe and dm-cache are similar to bcache and offer more or less similar functionality.

In this part we will discuss using Software RAID in Linux and setting up a special mdadm mirror between the ephemeral SSD and the EBS volume. Setting the flag --write-mostly with EBS device on the mirror will ensure that the md driver avoids reading from ebs if at all possible and send all reads to the SSD. This option was originally added when mirroring over a slow network interface, but performs equally well to concentrate reads on an SSD.

In our test environment we have a c3.8xlarge instance running, this is the disk configuration:

$ lsblk
xvda    202:0    0     8G  0 disk
└─xvda1 202:1    0     8G  0 part /
xvdb    202:16   0   320G  0 disk
xvdc    202:32   0   320G  0 disk
xvdd    202:48   0   400G  0 disk

/dev/xvdb and /dev/xvdc are the 320GB SSD ephemeral disks available to a c3.8xlarge instance while /dev/xvdd is a provisioned IOPS EBS volume with 4000 iops.

When you launch an instance in Amazon EC2, the instance type that you specify determines the hardware of the host computer used for your instance. Instance types include varying combinations of CPU, memory, storage, and networking capacity so that you can choose the appropriate mix of resources for your need.

A noticiable improvement in the current generation of instance types is the introduction of SSD storage for instance store. The disadvantage with instance store is that it is ephemeral, meaning it persists only during the lifetime of its associated instance. To keep valuable data safe, it shoud be stored in Amazon EBS which is persistant and backed up using snapshots which are stored in Amazon S3. However, with the right technique, this extremely fast SSD ephmeral storage can still be leveraged to significantly improve performance. In this series of posts we will discuss a exactly that.

This is the simplest definition of the term Big Data that I could find.

What is Big Data?

Big Data refers to a collection of tools and technologies that help you work productively with data at any scale.

Why not just use a database server?

  • Due to the advances of technology data is continuously being generated from more sources and more devices.
  • Much of that data such as videos, photos, comments and reviews on websites and social media is unstructured – that means the data is not stored in predefined structured tables. Instead it is often made of volumes of text, dates, numbers and facts that are typically free form by nature.
  • Certain data types of data is arriving so fast, there is not even time to store it before applying analytics to it.
  • Thats why traditional data management and analytics tools alone don’t enable IT to store, manage, process and analyse the data.