Casey Robinson


Full Stack Data Scientist

Experience

Cognition Controls - Full Stack Data Scientist - 02/2022-Present

Purpose-Built, Smart HVAC Controls. Engineered for light commercial, industrial, and retail spaces.

Highlights

Thermodynamic Model

I developed a thermodynamic model of multi-zone spaces using Python and PyTorch. This model forms the basis of the rest of the product including control, anomaly detection, and runtime estimation.

Control Algorithm

Reusing the thermodynamic model, I implemented model predictive control with a genetic in the firmware of our building control device. This genetic control algorithm continually seeks to achieve tenant comfort in the least energy intensive way.

Anomaly Detection and Classification

To detect when HVAC equipment is running in a degraded capacity, I built an anomaly detection process and alerting system. The candidates are first detected with a straightforward regression to quickly highlight areas which require deeper analysis. These candidates are then clustered using soft dynamic time warping. Labels are applied to the clusters automatically from past issues confirmed through an HVAC srevice visit. This allows our product to send alerts tailored to the most likely issue such as operating at the limit or equipment malfunction.

Time Series Feature Store

At the core of our product is a time series feature store. This data store is an online cache of our data resampled and aligned to a standard time step for consistent downstream consumption by our machine learning pipelines and user dashboard. I built this with Python, Starlette, and Redis and deployed to an EC2 autoscaling group managed by Terraform.

Data is ingested from a variety of sources - SNS, MQTT, Kinesis, and S3. The service can currently serve 2k requests per second per core in the EC2 autoscaling group with further optimizations in the works.

Temperature Compensation

Our thermostat contains many heat generating components - cellular radio, processor, and power management. These components introduce heat into the enclosure which can cause inaccuracies in the temperature reading from our on-board temperature sensor. In collaboration with our electrical engineer I designed an ongoing experiment to capture data with reference temperature sensors and internal temperature readings from each component in a variety of locations and power usage scenarios. Then I developed a pipeline on top of the time series feature store to build a model that compensates for this heat to produce a more accurate temperature reading.

WireWheel Inc. - Lead Architect - 11/2020-02/2022

WireWheel offers products for achieving compliance with privacy laws while preparing your organization to respond to upcoming privacy standards.

Highlights

Privacy Operations Management

I led a team of engineers to build a new product for managing collection and reporting of privacy compliance data within large organizations. Previously many privacy officers used spreadsheets and other manual processes for gathering the required information throughout their organization. This is error prone and an expensive process. And as more privacy laws are enacted this burdeon will only increase. Our solution was to build a custom task and information management system for gathering compliance information.

The ultimate goal of the privacy officers is to produce a compliance document for each asset in their company and each compliance regulation they are obligated to follow. To address the combinatorial nature and reduce the disruption to the rest of the organization our product focused on three key areas: templating, automation, and inventory.

Privacy compliance lawyers, either our employees or the customer's, could define a template for the compliance report they would like to produce. This template was then translated into a series of tasks distributed to their team. Upon completing the tasks the template was populated and a final compliance report document could be produced.

We built automation to prefill answers and conditionally skip answers that were determined to be irrelevant - e.g. not in EU, then GDPR not relevant. We also built automation to check for conflicting answers and ask the participants to reconcile - e.g. encryption required yet not enabled.

Compliance is always done against an asset and most standards require overlapping information. To remove this redundant work we introduced an asset inventory. Task answers populated an asset inventory. If the inventory contained information we would prefill that answer for the user - e.g. two standards ask for a data retention policy, only need to provide that once since it is a property of the data storage system.

All of the business logic and automation was implemented in PostgreSQL using PL/pgsql. The API layer was implemented using Go. The template system was implemented using Go compiled to WebAssembly. The UI was implemented using PureScript.

Till Inc. - Senior Software Engineer - 04/2020-11/2020

Till is a company that works with landlords and renters to provide customized payment schedules and budgeting tools to help renters pay on time and improve their financial outlook.

Highlights

Rent Roll Data Pipeline

The ingestion pipeline for daily updates of renter status and balance designed to support multiple vendor APIs and manually created excel files. I designed the pipeline as a streaming system with support for change detection notifications and replayable historical input data.

Matching Service

Each vendor provides their own identifier for a user. The matching service is responsible for translating external identifiers into internal identifiers. The service also provided a fuzzy matching endpoint to find data for new users during the enrollment flow. I built this service with go, redis, and bloom filters.

Upside Business Travel - Senior Software Engineer - 04/2019-04/2020

Upside Business Travel (now defunct - business travel and global pandemics don't mix) was a business travel booking platform focused on planning an entire trips for frequent travelers.

Highlights

Sabre Session Token Pool

Sabre, the primary source of data for the airline industry, uses session tokens to handle authentication and authorization of their API. Each company purchases a fixed number of tokens and is charged, without warning, an overage fee if they try to use more than the number. Thus, an accurate count of the tokens in use is critical to controlling cost. I designed a distributed session pool manager to accurately enforce the limit while providing tokens to the 180 microservices in our infrastructure. I used TLA+ to design the management algorithm and implemented the service using redis, lua, and nodejs.

Booking Engine

Sabre is the source of truth for bookings across the airline industry. Sabre was originally created in 1960 to make American Airlines booking system available to other airlines. As you can imagine, there are many idiosyncracies which introduce an impedance mismatch with HTTP/JSON/NodeJS microservices. A unique character encoding (the service predates ASCII), 6-bit integers, and a terminal based UI which travel agents use to manage all bookings. Sabre has been working to modernize their APIs. However, many features still only exist through the low-level Sabre Send Command, SabreCommandLLSRQ. To use this command you need to send the keys that a travel agent would type in the terminal UI.

My project was to build a booking engine service which accepted booking requests from the rest of our services, reserved the requested inventory, and provided actionable error messages for users when the request could not be fulfilled. I built the initial proof of concept and led the implementation of the service across four engineering teams and our data science team.

Fugue Inc. - Senior Software Engineer - 10/2017-02/2019

Fugue provides infrastructure as code with an emphasis on security, auditability, and remediation.

Highlights

Cloud Objects

List every item in an AWS account to create a snapshot or diff with a previous snapshot. This was implemented with Python and fully tested using property based testing with Hypothesis. This project also introduced property based testing to many of the engineers.

Pertubation Testing and Monitoring

One of the primary goals of the product is to detect unintended changes to infrastructure, revert to the last known state, and alert the end user. To test this feature I created a fuzz testing framework which introduced arbitrary changes into an AWS account and measured the remediation latency. The largest constraint on this project is the cost of running complex AWS infrastructure. Thus the perturbation selection criteria was critical and ultimately derived from analyzing historical data about bugs and feature changes.

Distil Networks - Platform Architect - 08/2014-10/2017

Protect websites, mobile applications, and APIs from automated attacks without affecting the flow of business-critical traffic.

Highlights

Growing the team

When I joined Distil, the platform team consisted of one co-founder and one manager. When I left there were nine engineers across four offices and two continents. I was heavily involved in the recruiting, hiring process for all of them. This included organizing an ongoing rust meetup, working with recruiting to define job descriptions, and working with my manager to define onboarding and training plans.

During my time at Distil I was involved in talks to acquire three different companies. Two were successful and I led the architecture effort to integrate their products into our platform.

Development Environment and Test Suite

As part of onboarding new engineers I built an automated test suite which centralized the ad hoc scripts engineers had created up until that point. The test runner was built primarily using rust and selenium with support for a variety other headless browsers (when I left there were 12 different browsers in active rotation).

Our product was deployed to multiple machines in a customer's data center. This was an environment most engineers were not familiar with replicating locally. I built a tool using go, python, and lxc (eventually replaced with docker and docker compose) to replicate the deployment environment locally. This allowed the team to focus on processes and messages instead of file paths and virtual machines. It also allowed teams to accurately run the test suite locally.

Fault Tolerance

Multi-node deployments were less reliable than single node deployments despite being sold to customers for the redundancy (critical since our product was directly in line with their production traffic). I determined that the root cause was errors being propagated between machines due to the design of our ZeroMQ messaging system.

I redesigned the data flow to improve reliability. In two node deployments we replicated input to tolerate a single node failure. In larger deployments we use data replication, service discovery, and service migration to tolerate n/2 - 1 failures.

As part of this work I created a test suite which introduces faults into the network and observes the failover.

Nginx

We used nginx as our primary http request handling process. I was responsible for maintaining the build process for our custom plugins and any patches we created for nginx itself.

Visitor Identification

We used browser fingerprinting as the primary component in identifying visitors. This code was untouched for three years and lacked data validation. I collaborated with the data science team to define the set of features which provided the most uniqueness and validate new proposed features. This resulted in 2.5x the number of identifiers after normalizing for traffic volume.

However, with the initial approach it was trivial for a bot writer to obtain a new identifier and thus evade our tracking by appearing as a new visitor on each request. I added HashCash to increase the cost of requesting an identifier. This resulted in a 30% increase in the number of bots detected with no customer complaints about false positives.

To prevent bot writers from resubmitting the same token we required that each token is unique. Effectively, deduplicating an infinite steram of data. For this, I implemented a stable bloom filter with lua inside of redis.

Data Pipelines

Our product had two primary components: edge nodes which filtered requests in the customer's live producton traffic and a reporting cluster which provided machine learning and customer dashboards. The edge nodes were located in customer's data centers or cloud infrastructure while the reporting cluster was hosted on our own infrastructure. Many features required sending data between the two components. Configuration and ML models were sent to the edge nodes and logs were returned.

To provide visibility and control over this communication we built a data pipeline daemon. This process controlled all data flow from the edge to our reporting cluster. It provided a web interface for customers to view and manage all aspects of this data transfer. We chose to use go and kafka for this project.

We also decided to improve the usability of our log stream for internal customers as part of this project. We used protocol buffers to provide the data science team with a strict schema. We also introduced counters for our operations team to quickly diagnose issues.

Customization

Our product had many interconnected features and providing a direct list of settings to end users and support engineers proved to be confusing and ineffective. Similarly, adding new features required understanding every existing and in-development feature to avoid conflicts. This cost us excessive downtime downtime and multiple customers.

To increase understandability of the product I designed a rule engine and worked with the engineers, support team, and analysts to encode our knowledge of the platform into a digestible format and provide a customizable tool for end users.

Latency was the primary external constraint on the design of this system. We had contractual obligations to introduce no more than 20ms of latency into the request flow. I decided to build the rule engine as an nginx module written in rust to meet the latency requirment. Rust also allowed us to onboard engineers who only previously had experience with application level programming in ruby. Rust also had the unexpected benefit of compiling to webassembly which allowed the library to be reused in the web configuration interface.

The rule engine was implemented as an abstract virtual machine for executing rules. This virtual machine explicitly limited memory usage and execution time of each rule so that we could accurately predict the impact of any given configuration before deploying to production. Rules were defined in boolean logic and translated into disjunctive normal form to evaluate rules in constant space. The compiler was designed to provide user feedback on the rules and esitmate resource usage.

The rule engine was shipped to production and meets the latency and resource requirments. When I left we were continuing to collect usage and performance data to further improve execution.

Education