Software engineer at Google, initially in the datacenter group and later in
geo.
- Hardware Systems, 2007 to 2010
This group wrote software to support datacenter repairs. We formed the
interface between hardware repair techs and the lower levels of the engineering
stack, such as borg and GFS.
- Touchpad: This was an application for a handheld scanner and later laptops
for datacenter repair work. I was sole author of the initial version, and
later tech lead for ongoing maintenance. It consisted of a web application
used by techs to locate machines, diagnose them, and scan swapped parts, and a
database to coordinate and record all of that. It was used for all machine
repairs in Google datacenters from 2007 until sometime around 2009.
This was a standard web application, using the Quixote framework, and
extensively tested via a custom test framework that could parse and analyze
HTML, to follow links and simulate session state. The other interesting feature
was that stateful operations were strictly isolated and instrumented so that
on any error report I could replay the complete session in a debugger,
optionally rewinding time, resetting breakpoints, and comparing expected and
actual values. Due to the absence of JavaScript, pages were otherwise
stateless and easy to debug.
- R3: This was a machine management system. It would accept suspected
broken machines and coordinate automatic testing, reinstallation, automatic
diagnosis, scheduling for manual repairs, etc. Since it delegated the actual
work to other systems, it was basically a distributed state machine, with
facilities to define workflows with complicated control flow and local
customization, along with global structured logging, monitoring, manual override
for exceptions, and the like. Individual steps were required to be stateless,
except database transactions at start and end, which provided robustness
against crashing code and failing or hung servers.
I was the designer and author of much of the original code, and tech
lead for two or three other developers working on separate components. I later
became a consultant for the software engineering team that took over its
maintenance. It was responsible for the repairs workflow for almost all Google
servers in all datacenters from 2009 until at least 2015.
- Travel to various datacenters for requirements gathering, local liason, and
training. This included 6 months in Taiwan, assisting local operations and
giving talks in English and Chinese.
- GroundTruth, 2010 to 2017: This group created the map data used by Google
maps.
- Work on Atlas, the tool used to edit maps. This was a large Java
application that used Swing and OpenGL to visualize and edit road and business
data for the entire world. I was not the original architect, but added
features and fixed bugs at all levels, from OpenGL to Swing layout and event
handling, to bigtable and spanner backends, to editing tool implementation and
workflows.
Some examples:
- Implemented live QC, which was a kind of real time lint for map data.
This was a realtime local version of already existent offline batch-oriented
lints based on global (literally) analysis. This involved both the
implementation and coordination with ops to make it mandatory and automatic
for the applicable workflows. This could short-circuit a lengthy manual QC
cycle by requiring operators to fix automatic lints before committing their
work.
- Implemented some autochecks, which were heuristics to find map
problems. For example, analyze road networks for components with an
entrance but no exit, or inconsistent road priorities.
- Added support for multiple turn restrictions, e.g. bicycles can't
pass at any time, cars can't pass during a certain time range.
- Added support for restriction groups, so a special event can
close off entire sections.
- Implemented QC scoring, a system to prioritize QC work by ranking
map problems by severity.
This work involved meeting with Operations in India and coordinating with local
management to decide on new features, gather requirements, stage rollouts,
monitor performance, and the like.
- Implemented issue search, which converted issues (map metadata used to
track work) to a dremel database. It did a lot of complicated ad-hoc
aggregation and analysis, and was used both directly by ops and as a foundation
for further analysis and dashboards. This was used to track and distribute
work, and also to make decisions about employee performance and pay.
- Design consultant for contractor projects in Seattle and Hyderabad. This
was mostly database design for Django apps.
- Lots of bug fixing, backend migrations, and miscellaneous small tasks.
During my time at Google, I interviewed around 250 candidates, served as a
Python readability reviewer, created and maintained some foundational libraries
for parallelism and immutable records in Python.
|