Analyzing CloudTrail Logs Using Hive/Hadoop

Disclaimer

This is simply a blog post record for myself as I had great difficulty in finding information on the subject. It’s not meant to be a very informative guide on either CloudTrail or Hive/hadoop

Intro

Recently at work we’ve had an issue where some security group ingress rules were being modified (either automated or manually) and it has been affecting our test runs that rely on those rules. In order to try and track down the source of the modification we have enabled CloudTrail. CloudTrail is part of the AWS family of web services and it records AWS API records you’ve made and places those logs in an S3 bucket that you can access.

The recorded information includes the identity of the API caller, the time of the API call, the source IP address of the API caller, the request parameters, and the response elements returned by the AWS service.

Hive

My experience with Hive has been very limited (simple exposure from running tutorials) however I was aware that it was a SQL-ish type execution engine that transformed those queries into MapReduce jobs to execute using Hadoop. As it was built with Hadoop that means it has native support for using S3 as a HDFS.

With the little knowledge of Hive I had, I thought there should exist a very prominent white paper in which describes how to consume CloudTrail logs using Hive (using some custom SerDe). A co-worker was simply consuming the JSON log files via Python however I was on a mission to see if I could solve the solution (querying relevant data from the logs) using an easy-setup with Hive! The benefit of setting up the Hadoop/Hive cluster for this would be that it could be used easily to query additional information and be persistent.

Solution

After contacting some people from the EMR team (I was unable to find anything myself on the internet) I was finally pointed to some relevant information! I’ve included the reference link and the original example code for incase the link ever breaks.
reference: http://www.emrsandbox.com/beeswax/execute/design/4#query

The key thing to note from the example is that it is using a custom SerDe that is included with the Hadoop clusters created with AWS ElasticMapReduce. The SerDe includes the input format table and deserializer which will properly consume the nested JSON records. With this you can now query easily CloudTrail logs!

SQS , Apache Camel & Akka

Akka Apache-Camel Via SQS

This is an example project of how to setup a sample project using Akka, Apache-Camel and SQS together. Never heard of them or curious how they interact with each other?

Apache-Camel

Apache Camel is a rule-based routing and mediation engine.

What that basically means is that Apache Camel provides a common API for exchanging messages across a variety of platforms/protocols such as: HTTP, SQS, AMQP

Akka

Akka is a toolkit and runtime for building highly concurrent, distributed, and fault tolerant event-driven applications on the JVM.

What that basically means is that Akka is a framework for writing code (known as actors) that lend to code to be easily distributed.

SQS

Amazon Simple Queue Service (SQS) is a fast, reliable, scalable & affordable message queuing service.

Why should you care!?

I’ve recently had the pleasure of releasing some code on heroku using the PlayFramework. Although deployment and initial setup was a breeze, I was bummed out to find that using Akka’s protocol was not doable as only standard ports on heroku are allowed (80,443). This leads to being unable to use Akka actors in a proper distributed model (i.e. they can’t talk to each other!)

Heroku has made a post where they outline using RabbitMq instead of the default Akka protocol however I did not find simple / ideal.

This brings us to this sample project! Leveraging Apache-Camel & SQS it was very straightforward to send messages to distributed actors.

Please check out the project on my Github page: https://github.com/fzakaria/Akka-Camel-SQS