Building a scraper for recreation.gov

The start of a new project

A friend has recently asked if I can look into building a tool / site to scrape https://recreation.gov — with a end-goal of building a system to automatically reserve for a desired permit.

This piqued my interest and lets take a look at what I can do! At a high level I imagined building:
register desired site -> continously scrape -> reserve -> notify via text

Looks like a good chance to put together some interesting technologies: web-framework (django?) & twilio to send notifications

Alternatives

Before beginning any project, I take a look at the current space and see if there are any current open source alternatives or even a paid platform to leverage.

I found the following:

I could not find a paid service and the OSS options seemed very difficult for non technical people to use.

Can I haz API?

Browsing online — I was ecstatic when I came across ridb.recreation.gov which is a REST API for the recreation.gov website — unfortunately it doesn’t let you perform reservations and I couldn’t decipher yet how to link them to the reservation portion. Perhaps it might be leveraged in the future!

Time to use our favorite reverse engineering tools: wireshark & Charles — I ended up using Charles specifically because I find it easier to setup as a man-in-the-middle HTTPS proxy.

You can follow the simple guide on how to setup Charles as a HTTPS proxy here

Here is the raw request from Charles when searching locations matching whitney at https://www.recreation.gov/unifSearch.do
(unimportant parts stripped out)


POST /unifSearch.do HTTP/1.1
Host: www.recreation.gov
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:60.0) Gecko/20100101 Firefox/60.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
Referer: https://www.recreation.gov/unifSearch.do
Content-Type: application/x-www-form-urlencoded
Content-Length: 275
Connection: keep-alive
Upgrade-Insecure-Requests: 1

currentMaximumWindow=12&locationCriteria=whitney&interest=&locationPosition=&selectedLocationCriteria=&resetAllFilters=true&filtersFormSubmitted=false&glocIndex=0&googleLocations=Whitney+Place+Northwest%2C+Seattle%2C+WA%2C+USA%7C-122.39853319999997%7C47.6974492%7C%7CLOCALITY

The important part is that it is x-www-form-urlencoded with locationCriteria=whitney.

The response is HTML however we can use various tools to strip out the desired list.

Dropwizard Reservoir Concurrency

Dropwizard Metrics

A senior colleague was recently writing some code that required some sliding window book-keeping. Worried about throughput and concurrency, the colleague opted for a home-grown solution following the single-writer principle.

From prior experience with Dropwizard Metrics, my quick quip was “meh, just use Dropwizard’s SlidingTimeWindowReservoir”, as I had come to expect the library to provide robust & highly concurrent data structures for metrics.

He ended up diving into the implementation and sure enough — found it to be quite ingenious. It took me a little bit of understanding so I thought I would explain it here for my future self.

Underlying Datastructure

When drumming up ways to implement a SlidingTimeWindowReservoir, various data structures could be used however Dropwizard opt’s for a ConcurrentSkipListMap, which is a lock free NavigableMap.

The map is sorted on tick (time), and the interface NavigableMap, allows for easy trimming.

Concurrency

The key to the ConcurrentSkipListMap is the clock tick.

How do we solve the scenario where multiple writers try to record a value at the same clock granularity?

This is where the implementation is quite neat, by introducing a COLLISION_BUFFER.

Original source

In the unlikely case where multiple writers are trying to add to the Map in the same clock granularity (i.e. clock.getTick() returns the same exact value) the use of a CAS allows the code to keep looping incrementing the tick value by 1 within a COLLISION_BUFFER.

Consider the simple case where clock.getTick() returns 2 & oldTick returns 256 (1 * 256).

The first writer does: tick - oldTick and assigns newTick as tick. The compareAndSet is successful and lastTick is set as 512.

The second writer fails the CAS and loops again but now lastTick is 512.
newTick will now be 513 and be set.

Large Scale Guice Projects – Module Deduplication

If you come from a large Java shop, you’ve likely heard of or encountered Guice — Google’s lightweight dependency injection framework; analogous to Spring.

First time users of Guice will usually be starry eyed amazed at the ability to get type safe dependency and the seemingly modular way in which to bundle up your dependencies into Modules.

Much of the tutorials, guides and best practices found online though are targeted towards smaller codebases, and anyone in large Java shops will have likely hit the spaghetti of configuration and AbstractModule dependencies to get your code to boot up — seemingly ruining the modularity of using a dependency injection framework.
This post is aimed at some best practice I’ve found for keeping AbstractModule composable and easy to understand.

If you don’t want to read to the end just checkout my project on solving Guice deduplication — guice-dedupe

Composition

The biggest hurdle large projects will face in Guice is that you’ll want to keep Modules resuable and more importantly self-contained.

Consider the following example where I have a JSON class that is bound in a module, and two other modules want to make use of it.

We’d like to make use of the install option, so that a consumer can either use ModuleB or ModuleA and the necessary bindings are self-contained. The problem arises if ModuleA and ModuleB are used — you’ll be treated with a Multiple Binding Exception.

Many codebases, simply remove the installation of Module dependencies and move the mess of figuring out the right set of final modules you need at moment you try to create the injector. What a mess!

The way to solve this is to use Guice’s built-in de-duplication code. For the most part it works out of the box, unless you are using @Provides in your modules.
Simply change all your existing AbstractModule to SingletonModule from the library guice-dedupe and you’ll get modules that are fully self-contained now even with providers.

WaterFlow – SWF Framework

SWF Framework

I’ve had pretty good exposure to SWF through my last stint at Amazon/AWS and I grew to love the service. Once you get past some of the confusing aspects of programming in a very stateless / distributed manner, you begin to appreciate the
true power that is available to you.

At my previous team, even though it was within AWS, the team had created their own SWF Framework – mostly because they pre-dated the AWS Flow Framework. I was exposed to some interesting concepts that were necessary in the custom framework and that were lacking in Flow.

Personally, although Flow is a great framework, I never loved the use of Annotation processing through AspectJ. It makes it hard to debug code in your IDE, reason about the code mentally and difficult to set up on anything other than Eclipse.

Recently I came across https://bitbucket.org/clarioanalytics/services-swift/, which is a very minimal SWF framework that targets Java 1.6. It gave me a good idea on how you achieve something pretty robust with SWF with minimal code. I found it however to be too much on the other extreme. Whereas Flow was overly complicated and too magical, I found swift to be lacking when writing the workflow/decider.

WaterFlow

I decided to take the best parts of Flow and the best parts of Swift and make WaterFlow. Its a relatively small SWF framework in the same vein of Swift but brought into the world of JDK8 and with strong asynchronous programming story (for when orchestrating the decider). I’d love to help get someone bootstrapped on it and help them with onboarding! Please contact me.

Learning Netty – HTTP Echo

Learning Netty – Part 1

I recently picked up a copy of Netty in Action – which has been a great to learn more about Netty. I’ve become more fascinated with Netty as I’ve delved farther into Reactive & Asynchronous programming. Netty is a very powerful framework that simplifies a lot of the challenges with NIO programming however it is still difficult to find resources / examples of every day use cases.

Echo HTTP

Many examples online start with building an Echo Server & Client. However they use a simple TCP echo server – which although probably more reasonable for the server doesn’t show a barebones HTTP setup.

The following is a Gist of a barebones setup of setting up an HTTP server and client. The example shows a few simple defaults such as the use of HttpObjectAggregator, HttpServerCodec and SimpleChannelInboundHandler.

PhotoReflect Unwatermarked image

Reverse Engineering

I lately had a special event where a photographer took some photos of the occasion and used PhotoReflect to host/sell me them.

The price per photo is 29$ or the whole set (90) 699$. I consider that an egregious amount to charge, considering that her services were supposedly already included with the event.

The following are some instructions you can follow for downloading the non-water marked medium quality images. This assumes that you are running on a Mac however the commands can easily be modified for any setup.

The real magic here is in the s=-3 portion of the request, which I’ve found returns the image without the water-mark.

I still haven’t figured out how to download the HighQuality images, but if you find out please share!

EDIT: I have also found a helpful Gist that can help inject a download button onto the store.

HypeMachine Chrome Extension – Still going

Short and Sweet

This will be a short and sweet post. It’s been a while since I’ve used my HypeMachine Chrome Extension and just as long since I’ve used HypeMachine (my music taste has shifted). However I’m always surprises me the community of people using the extension.

I’ve recently merged in some new code that allows the user to download multiple songs at once which was graciously written by Scott Clayton. The code is now in master, so feel free to re-install the extension and give the new functionality a whirl!

Pagination in Clojure

Luminus – Pagination

I’ve recently been working on a fun side project using the Luminus web framework as my first foray into Clojure (which I’m absolutely falling in love with)

One thing however I find missing from the documentation and in general online is an idiomatic way to paginate in Clojure. I’m sure there is some sexy pagination strategy that uses lazy-seqs, macros, protocols and records however I was not able to come up with anything (myself or via google).

I’m dumping my small helper functions that I ended up writing in hopes that perhaps someone finds use for it:

Ultimately one would use the create function to include in their context/response a structured Pagination map.

If you have anything better please share!

AWS Lambda – Sending CloudTrail notifications to CloudSearch

Lambda

Amazon has just recently announced AWS Lambda, which is a pretty cool new service that runs your code in response to events. The service manages all the compute resources for you and is a nice hands-off approach to running things in the cloud (How much easier can it get!). At the moment there are only a few event sources that are supported by AWS Lambda however one of them are S3 Put notifications (creation/update of new keys/objects).

CloudTrail & Inspiration

Recently at work I wanted more insight into some of the API calls that were made on our AWS accounts (occasionally mysterious actions have occured and finding the CloudTrail could prove fruitful). I’ve recently written on setting up an EMR cluster connected to your CloudTrail S3 bucket to perform easy queries against your dataset however I find that too much power in most cases and thought there should be a simpler way.

I had come across this blog post which outlines sending CloudTrail events to CloudSearch with the help of SQS, & SNS. Now that AWS Lambda exists can it be simpler!
You bet!

I’ve created the following gist which you can upload to AWS Lambda to start sending your S3 CloudTrail notifications to CloudSearch

In order to utilize the script, make sure you’ve created a CloudSearch domain and added the index fields in the MAPPINGS variable (you can use the helpful script in the linked blog post here).

Scalatron Build.sbt file to the rescue

Scalatron

I’ve been recently playing around with writing a bot for Scalatron however I didn’t find any great explanation on how to setup a nice development process with SBT. The closest I could find was this blog post but it left a lot to the imagination. I hope you find my annotated Build.sbt below better and more clear!

If you launch sbt and run play you should see the Scalatron server start up and pickup your Bot!