Casey Robinson


Software Engineer

rampantmonkey.com

Docker, Git, Go, Javascript, Nginx, PostgreSQL, SQLite, GraphViz, Linux, ZeroMQ, Redis, Logic Programming, Expert Systems, Testing

Experience

WireWheel Inc. - Lead Architect - 11/2020-Present

WireWheel offers products for achieving compliance with privacy laws while preparing your organization to respond to upcoming privacy standards.

Highlights

Privacy Operations Management

I am leading a team of engineers to build the next generation of our product for managing collection and reporting of privacy compliance data.

Till Inc. - Senior Software Engineer - 04/2020-11/2020

Till is a company that works with landlords and renters to provide customized payment schedules and budgeting tools to help renters pay on time and improve their financial outlook.

Highlights

Rent Roll Data Pipeline

The ingestion pipeline for daily updates of renter status and balance designed to support multiple vendor APIs and manually created excel files. I designed the pipeline as a streaming system with support for change detection notifications and replayable historical input data.

Matching Service

Each vendor provides their own identifier for a user. The matching service is responsible for translating external identifiers into internal identifiers. The service also provided a fuzzy matching endpoint to find data for new users during the enrollment flow. I built this service with go, redis, and bloom filters.

Upside Business Travel - Senior Software Engineer - 04/2019-04/2020

Upside Business Travel (now defunct - business travel and global pandemics don't mix) was a business travel booking platform focused on planning an entire trips for frequent travelers.

Highlights

Sabre Session Token Pool

Sabre, the primary source of data for the airline industry, uses session tokens to handle authentication and authorization of their API. Each company purchases a fixed number of tokens and is charged, without warning, an overage fee if they try to use more than the number. Thus, an accurate count of the tokens in use is critical to controlling cost. I designed a distributed session pool manager to accurately enforce the limit while providing tokens to the 180 microservices in our infrastructure. I used TLA+ to design the management algorithm and implemented the service using redis, lua, and nodejs.

Booking Engine

Sabre is the source of truth for bookings across the airline industry. Sabre was originally created in 1960 to make American Airlines booking system available to other airlines. As you can imagine, there are many idiosyncracies which introduce an impedance mismatch with HTTP/JSON/NodeJS microservices. A unique character encoding (the service predates ASCII), 6-bit integers, and a terminal based UI which travel agents use to manage all bookings. Sabre has been working to modernize their APIs. However, many features still only exist through the low-level Sabre Send Command, SabreCommandLLSRQ. To use this command you need to send the keys that a travel agent would type in the terminal UI.

My project was to build a booking engine service which accepted booking requests from the rest of our services, reserved the requested inventory, and provided actionable error messages for users when the request could not be fulfilled. I built the initial proof of concept and led the implementation of the service across four engineering teams.

Fugue Inc. - Senior Software Engineer - 10/2017-02/2019

Fugue provides infrastructure as code with an emphasis on security, auditability, and remediation.

Highlights

Cloud Objects

List every item in an AWS account to create a snapshot or diff with a previous snapshot. This was implemented with Python and fully tested using property based testing with Hypothesis. This project also introduced property based testing to many of the engineers.

Pertubation Testing and Monitoring

One of the primary goals of the product is to detect unintended changes to infrastructure, revert to the last known state, and alert the end user. To test this feature I created a fuzz testing framework which introduced arbitrary changes into an AWS account and measured the remediation latency. The largest constraint on this project is the cost of running complex AWS infrastructure. Thus the perturbation selection criteria was critical and ultimately derived from analyzing historical data about bugs and feature changes.

Distil Networks - Platform Architect - 08/2014-10/2017

Protect websites, mobile applications, and APIs from automated attacks without affecting the flow of business-critical traffic.

Highlights

Growing the team

When I joined Distil, the platform team consisted of one co-founder and one manager. When I left there were nine engineers across four offices and two continents. I was heavily involved in the recruiting, hiring process for all of them. This included organizing an ongoing rust meetup, working with recruiting to define job descriptions, and working with my manager to define onboarding and training plans.

During my time at Distil I was involved in talks to acquire three different companies. Two were successful and I led the architecture effort to integrate their products into our platform.

Development Environment and Test Suite

As part of onboarding new engineers I built an automated test suite which centralized the ad hoc scripts engineers had created up until that point. The test runner was built primarily using rust and selenium with support for a variety other headless browsers (when I left there were 12 different browsers in active rotation).

Our product was deployed to multiple machines in a customer's data center. This was an environment most engineers were not familiar with replicating locally. I built a tool using go, python, and lxc (eventually replaced with docker and docker compose) to replicate the deployment environment locally. This allowed the team to focus on processes and messages instead of file paths and virtual machines. It also allowed teams to accurately run the test suite locally.

Fault Tolerance

Multi-node deployments were less reliable than single node deployments despite being sold to customers for the redundancy (critical since our product was directly in line with their production traffic). I determined that the root cause was errors being propagated between machines due to the design of our ZeroMQ messaging system.

I redesigned the data flow to improve reliability. In two node deployments we replicated input to tolerate a single node failure. In larger deployments we use data replication, service discovery, and service migration to tolerate n/2 - 1 failures.

As part of this work I created a test suite which introduces faults into the network and observes the failover.

Nginx

We used nginx as our primary http request handling process. I was responsible for maintaining the build process for our custom plugins and any patches we created for nginx itself.

Visitor Identification

We used browser fingerprinting as the primary component in identifying visitors. This code was untouched for three years and lacked data validation. I collaborated with the data science team to define the set of features which provided the most uniqueness and validate new proposed features. This resulted in 2.5x the number of identifiers after normalizing for traffic volume.

However, with the initial approach it was trivial for a bot writer to obtain a new identifier and thus evade our tracking by appearing as a new visitor on each request. I added HashCash to increase the cost of requesting an identifier. This resulted in a 30% increase in the number of bots detected with no customer complaints about false positives.

To prevent bot writers from resubmitting the same token we required that each token is unique. Effectively, deduplicating an infinite steram of data. For this, I implemented a stable bloom filter with lua inside of redis.

Data Pipelines

Our product had two primary components: edge nodes which filtered requests in the customer's live producton traffic and a reporting cluster which provided machine learning and customer dashboards. The edge nodes were located in customer's data centers or cloud infrastructure while the reporting cluster was hosted on our own infrastructure. Many features required sending data between the two components. Configuration and ML models were sent to the edge nodes and logs were returned.

To provide visibility and control over this communication we built a data pipeline daemon. This process controlled all data flow from the edge to our reporting cluster. It provided a web interface for customers to view and manage all aspects of this data transfer. We chose to use go and kafka for this project.

We also decided to improve the usability of our log stream for internal customers as part of this project. We used protocol buffers to provide the data science team with a strict schema. We also introduced counters for our operations team to quickly diagnose issues.

Customization

Our product had many interconnected features and providing a direct list of settings to end users and support engineers proved to be confusing and ineffective. Similarly, adding new features required understanding every existing and in-development feature to avoid conflicts. This cost us excessive downtime downtime and multiple customers.

To increase understandability of the product I designed a rule engine and worked with the engineers, support team, and analysts to encode our knowledge of the platform into a digestible format and provide a customizable tool for end users.

Latency was the primary external constraint on the design of this system. We had contractual obligations to introduce no more than 20ms of latency into the request flow. I decided to build the rule engine as an nginx module written in rust to meet the latency requirment. Rust also allowed us to onboard engineers who only previously had experience with application level programming in ruby. Rust also had the unexpected benefit of compiling to webassembly which allowed the library to be reused in the web configuration interface.

The rule engine was implemented as an abstract virtual machine for executing rules. This virtual machine explicitly limited memory usage and execution time of each rule so that we could accurately predict the impact of any given configuration before deploying to production. Rules were defined in boolean logic and translated into disjunctive normal form to evaluate rules in constant space. The compiler was designed to provide user feedback on the rules and esitmate resource usage.

The rule engine was shipped to production and meets the latency and resource requirments. When I left we were continuing to collect usage and performance data to further improve execution.

Education

Side Projects

Escape Room Design

A small subset of the puzzle hunt team also works together to design and build escape rooms. This has provided the opportunity to design and program embedded systems as well as organize user testing.

DCPHR Puzzle Hunt

I lead a team of volunteers to design, construct, and facilitate a day long puzzle hunt in which teams of four people solve a series of puzzles. Through this event I have gotten the opportunity to develop mobile, progressive web applications, augmented reality puzzles, and use Prolog to prove that a puzzle has only one solution. In addition to the technical projects I have had the opportunity to work on the non-technical aspects of organizing a team including recruting, peer review, and project management.