Analyzing CloudTrail Logs Using Hive/Hadoop

Disclaimer

This is simply a blog post record for myself as I had great difficulty in finding information on the subject. It’s not meant to be a very informative guide on either CloudTrail or Hive/hadoop

Intro

Recently at work we’ve had an issue where some security group ingress rules were being modified (either automated or manually) and it has been affecting our test runs that rely on those rules. In order to try and track down the source of the modification we have enabled CloudTrail. CloudTrail is part of the AWS family of web services and it records AWS API records you’ve made and places those logs in an S3 bucket that you can access.

The recorded information includes the identity of the API caller, the time of the API call, the source IP address of the API caller, the request parameters, and the response elements returned by the AWS service.

Hive

My experience with Hive has been very limited (simple exposure from running tutorials) however I was aware that it was a SQL-ish type execution engine that transformed those queries into MapReduce jobs to execute using Hadoop. As it was built with Hadoop that means it has native support for using S3 as a HDFS.

With the little knowledge of Hive I had, I thought there should exist a very prominent white paper in which describes how to consume CloudTrail logs using Hive (using some custom SerDe). A co-worker was simply consuming the JSON log files via Python however I was on a mission to see if I could solve the solution (querying relevant data from the logs) using an easy-setup with Hive! The benefit of setting up the Hadoop/Hive cluster for this would be that it could be used easily to query additional information and be persistent.

Solution

After contacting some people from the EMR team (I was unable to find anything myself on the internet) I was finally pointed to some relevant information! I’ve included the reference link and the original example code for incase the link ever breaks.
reference: http://www.emrsandbox.com/beeswax/execute/design/4#query

The key thing to note from the example is that it is using a custom SerDe that is included with the Hadoop clusters created with AWS ElasticMapReduce. The SerDe includes the input format table and deserializer which will properly consume the nested JSON records. With this you can now query easily CloudTrail logs!

Leave a Reply

Your email address will not be published. Required fields are marked *