Ottawa, ON, Canada
Employee - Full Time
Mar 12, 2018

Production Engineering - Service Patterns

Production Engineering at Shopify encompasses the disciplines of site reliability engineering, infrastructure engineering, and developer productivity. Our team ensures that Shopify infrastructure is able to scale massively, while also delivering resilient systems, amazing performance, and impactful tools for our entire engineering team.

The objective of the Service Pattern group is to spread to the rest of the organization the tooling, lessons and patterns we’ve used to reliably scale Shopify to over 80,000 requests per second on Rails. Today, new applications are spinning up around the company. On day one, they reach many similar scaling challenges. In this group, we extract and evolve the tools that have allowed Shopify to scale and provide them to every developer in the company. We want Shopify developers to focus on making commerce better, not being concerned with infrastructure. Providing scalability is our job.

You’ll be responsible for designing tools that help build scalable, maintainable and resilient applications. These tools will be consumed by hundreds of developers and applications across the organization, and will allow them to abstract away the pain-points of scale. You will have the ability to continually ship changes to production multiple times a day, affecting developers and merchants across the entire platform. Developer productivity is one of the key success criteria for us, so it’s important that you write high quality documentation as well as good code.

Some of the challenges our group works on:

  • Evolving our sharding abstraction to allow other applications around the organization to take advantage of the architecture that’s allowed Shopify Core to scale
  • Building resiliency tooling to automatically generate resiliency matrices, and improve Toxiproxy to make creating resilient applications a breeze
  • Moving shop data between shards with minimum disruption for the customer to improve data locality, resiliency, and performance for our merchants
  • Designing the RPC layer that makes talking between our 100s of internal applications a joy, setting up a service mesh to provide circuit breakers to everyone, and enable Chaos Engineering
  • Failing over shards between datacenters without losing requests
  • Building the tools to make refactoring data at scale easier: traversing billions of records across 100s of Pods without a hitch

You’ll need to have experience with:

  • Building backend web services using several languages and frameworks (some tools we use include Ruby, UNIX commands, Go, Kafka, Python, …)
  • Working with relational databases and SQL
  • Working with web frameworks or the desire to learn it quickly
  • Linux and systems knowledge, should be comfortable navigating production infrastructure
  • Comfortable digging deep into problems on your own. Always hungry to answer another “why.”

It’d be amazing if you have experience with:

  • Experience building resilient, scalable services (with tools like Toxiproxy) and concepts like SLA, fault tolerance, circuit breakers ring a bell
  • Experience with development on a leading cloud provider (GCE, AWS, Azure, …)
  • Experience understanding and working with the lower levels of relational databases (Binlog, Topology, Performance Optimization, Replication)
  • Experience with reasoning about and working with distributed systems (Consensus algorithms like Raft/Paxos, 2PC, ACID ..)
  • Experience operating infrastructure, debugging production systems, and being on-call
  • Experience with concurrent programming (optimistic/pessimistic, semaphores, deadlocks)

Our team has spoken at conferences around the world about the work that we’re doing:

How to Apply