WaterFlow – SWF Framework

SWF Framework

I’ve had pretty good exposure to SWF through my last stint at Amazon/AWS and I grew to love the service. Once you get past some of the confusing aspects of programming in a very stateless / distributed manner, you begin to appreciate the
true power that is available to you.

At my previous team, even though it was within AWS, the team had created their own SWF Framework – mostly because they pre-dated the AWS Flow Framework. I was exposed to some interesting concepts that were necessary in the custom framework and that were lacking in Flow.

Personally, although Flow is a great framework, I never loved the use of Annotation processing through AspectJ. It makes it hard to debug code in your IDE, reason about the code mentally and difficult to set up on anything other than Eclipse.

Recently I came across https://bitbucket.org/clarioanalytics/services-swift/, which is a very minimal SWF framework that targets Java 1.6. It gave me a good idea on how you achieve something pretty robust with SWF with minimal code. I found it however to be too much on the other extreme. Whereas Flow was overly complicated and too magical, I found swift to be lacking when writing the workflow/decider.

WaterFlow

I decided to take the best parts of Flow and the best parts of Swift and make WaterFlow. Its a relatively small SWF framework in the same vein of Swift but brought into the world of JDK8 and with strong asynchronous programming story (for when orchestrating the decider). I’d love to help get someone bootstrapped on it and help them with onboarding! Please contact me.

AWS Lambda – Sending CloudTrail notifications to CloudSearch

Lambda

Amazon has just recently announced AWS Lambda, which is a pretty cool new service that runs your code in response to events. The service manages all the compute resources for you and is a nice hands-off approach to running things in the cloud (How much easier can it get!). At the moment there are only a few event sources that are supported by AWS Lambda however one of them are S3 Put notifications (creation/update of new keys/objects).

CloudTrail & Inspiration

Recently at work I wanted more insight into some of the API calls that were made on our AWS accounts (occasionally mysterious actions have occured and finding the CloudTrail could prove fruitful). I’ve recently written on setting up an EMR cluster connected to your CloudTrail S3 bucket to perform easy queries against your dataset however I find that too much power in most cases and thought there should be a simpler way.

I had come across this blog post which outlines sending CloudTrail events to CloudSearch with the help of SQS, & SNS. Now that AWS Lambda exists can it be simpler!
You bet!

I’ve created the following gist which you can upload to AWS Lambda to start sending your S3 CloudTrail notifications to CloudSearch

In order to utilize the script, make sure you’ve created a CloudSearch domain and added the index fields in the MAPPINGS variable (you can use the helpful script in the linked blog post here).

Analyzing CloudTrail Logs Using Hive/Hadoop

Disclaimer

This is simply a blog post record for myself as I had great difficulty in finding information on the subject. It’s not meant to be a very informative guide on either CloudTrail or Hive/hadoop

Intro

Recently at work we’ve had an issue where some security group ingress rules were being modified (either automated or manually) and it has been affecting our test runs that rely on those rules. In order to try and track down the source of the modification we have enabled CloudTrail. CloudTrail is part of the AWS family of web services and it records AWS API records you’ve made and places those logs in an S3 bucket that you can access.

The recorded information includes the identity of the API caller, the time of the API call, the source IP address of the API caller, the request parameters, and the response elements returned by the AWS service.

Hive

My experience with Hive has been very limited (simple exposure from running tutorials) however I was aware that it was a SQL-ish type execution engine that transformed those queries into MapReduce jobs to execute using Hadoop. As it was built with Hadoop that means it has native support for using S3 as a HDFS.

With the little knowledge of Hive I had, I thought there should exist a very prominent white paper in which describes how to consume CloudTrail logs using Hive (using some custom SerDe). A co-worker was simply consuming the JSON log files via Python however I was on a mission to see if I could solve the solution (querying relevant data from the logs) using an easy-setup with Hive! The benefit of setting up the Hadoop/Hive cluster for this would be that it could be used easily to query additional information and be persistent.

Solution

After contacting some people from the EMR team (I was unable to find anything myself on the internet) I was finally pointed to some relevant information! I’ve included the reference link and the original example code for incase the link ever breaks.
reference: http://www.emrsandbox.com/beeswax/execute/design/4#query

The key thing to note from the example is that it is using a custom SerDe that is included with the Hadoop clusters created with AWS ElasticMapReduce. The SerDe includes the input format table and deserializer which will properly consume the nested JSON records. With this you can now query easily CloudTrail logs!