About the Job• Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures that services—both our internally critical and our externally-visible systems—have reliability, uptime appropriate to users' needs and a fast rate of improvement. Additionally SRE’s will keep an ever-watchful eye on the capacity and performance of our system. Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation.
• On the SRE team, you’ll have the opportunity to manage the complex challenges of scale which are unique to us, while using your expertise in coding, algorithms, complexity analysis and large-scale system design.
• SRE's culture of diversity, intellectual curiosity, problem-solving and openness is key to its success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to create an environment that provides the support and mentorship needed to learn and grow.
• Engage in and improve the whole lifecycle of services - from inception and design, deployment, operation, and refinement.
• Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and launch reviews.
• Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
• Scale systems sustainably through mechanisms like automation; evolve systems by pushing for changes that improve reliability and velocity.
• Lead sustainable incident response, blameless postmortems, and production improvements that result in direct business opportunities for Organization.
• Manage individual project priorities, deadlines, and deliverables.
• Provide guidance to other team members on managing end-to-end availability and performance of mission-critical services, on building automation to prevent problem recurrence, and on building automated responses for non-exceptional service conditions.
• Bachelor's degree in Computer Science, a related technical field involving software/systems engineering, or equivalent practical experience.
• Experience programming in at least one of the following languages: C, C++, Java, Python, or Go.
• Experience with algorithms and data structures.
• 3-5 years of experience in computing, distributed systems, storage, or networking.
• Expertise in designing, analyzing, and troubleshooting large-scale distributed systems.
• Ability to debug, optimize code, and automate routine tasks.
• Systematic problem-solving approach, coupled with effective communication skills and a sense of drive.
• Experience with algorithms and data structures and/or Unix/Linux systems internals (e.g., filesystems, system calls) and administration.