Senior Site Reliability Engineer

Walmart Labs
10 yrs exp
10 yrs exp
About opportunityAbout opportunity

Job Summary

As a leader within the Global Technical Engineering Operations (GTEO) SRO team you will work with other SRO, TDO, SRE, DevOps and Engineering practitioners to manage mission-critical infrastructure, tools, and processes that will ensure highest levels of availability and reliability of all our websites. As a senior member of the team you will be expected to work with management, peers, and customers to define and implement SRO’stechnical vision.

You're right for the job if you are comfortable handling major incident response leading a technical team of engineers to resolve and restore service across complex distributed architectures. You will work cross-functionally amongst a variety of teams and be a core contributor in every significant engineering service or solution that we deliver to our stakeholders. You'll excel if you have enthusiasm for digging deep, and a flare for sharp technical communication, prioritization and organization. You will work directly with our SRE and DevOps teams to manage our next generation “always up” cloud based e-commerce platform.

The SRO Manager is responsible for managing a team of engineers through daily proactive monitoring and issue detection resolving issues before they become impacting. Manage engineer workloads, performance and delver clear direction to the SRO team. Responsible for customer follow up, control difficult calls, provide performance metrics and demonstrate expertise within Service Management processes and procedures to manage service impacting incidents. Our goal is to build, scale and guard the systems that delight our customers.

To do so, you will need strong skills in following areas

  • Deep understanding of incident management processes and procedures.
  • Focus on internal and external customer requirements (SLA’s & KPI’s)
  • Demonstrate advanced understanding of business processes being supported by assigned system(s)
  • Develop clear tactical and strategic goals for the SRO related to function, capabilities and capacities.
  • Make recommendations regarding improving situational awareness and alerting to potential business impacts, either internal or external influencers.
  • Responsible for immediate coordinated response of critical incidents to reduce impact and increase availability.
  • Responsible for leadership and communications between the business customer and technology teams.
  • Identify and recommend processes or system enhancements for the SRO.
  • Leads the resolution of high complexity Incidents as required.
  • Manages the analysis, communication and resolution of incidents.
  • Manages others in researching and recommending alternative actions for incident resolution.
  • Analyze trends to proactively prevent incidents and to provide historical summary reports.
  • Mentor and grow talent within your team to build a best in class SRO function.
  • Calm under pressure orchestrating major incident response to mission critical systems.
  • Function as part of a global SRO management team to deliver continuous improvement.
  • Excellent communication and stakeholder management skills.
  • Technically strong within infrastructure or software engineering.
  • Ability to assess system impact and formulate accurate problem statements to distribute across the management and technical communities.

Additional responsibilities may include

  • Develop a deep understanding of the various services and applications that come together to deliver Walmart e-commerce products
  • Monitor and discover failures/issues in a timely fashion and work with engineers to identify root cause and fix issues
  • Root-cause analysis complex problems involving multiple parties, networks, hardware, software and cloud technologies.
  • High focus on collecting and inferring metrics.
  • Identify and drive the automation of systems that maintain system health.
  • Drives standardization and service focused instrumentation to resolve break/fix scenarios, engaging broader teams where necessary. Contributes to command and control related activities focused on restoration of complex outages. May work independently or as part of a team on more complex projects. Provides mentoring and guidance to more junior team members.
  • Networking responsibilities: Understanding and performing TCP dumps, snoop, and other network sniffers. Understands and applies knowledge of most protocols (TCP/IP, HTTP, UDP, etc.)
  • Application Technologies: Provides recommendations and advice to the team and/or department in the areas of web services, OS, and storage, including being an active liaison to Development, QA and the Business.
  • Analyzes systems and makes recommendations to prevent possible incidents using knowledge of complex and company-wide systems.
  • Lead end-to-end audit of monitors and alarms based on subsystem knowledge.
  • Utilizes time management and project management skills to lead the resolution of incidents in a timely and organized manner, effectively communicating necessary information. May consult directly with developers or third party vendors; provides subject matter expertise.
  • Other duties and responsibilities as assigned.


  • 10+ years in an infrastructure or systems environment delivering operational excellence to highly complex distributed systems.
  • Experience in leading and troubleshooting service impacting incidents across largescale enterprise systems.
  • Methodical and systematic problem solving approach, combined with a solid awareness of ownership, initiative and drive.
  • Experience controlling and leading a team to deliver in highly pressurized situations delivering clear and concise communication to partners and stakeholders.
  • Experience of command and control tools in a production environment.
  • Networking knowledge and understanding of network concepts, such as different protocols (TCP/IP, UDP, ICMP, etc.), MAC addresses, IP packets, DNS, OSI layers, and load balancing).
  • Understanding of Unix/Linux systems from kernel to shell and beyond, taking in system libraries, file systems, and client-server protocols along the way. Experience administering Linux systems in a production environment
  • Programming experience in one or more of the following languages: Go, Java, Python, Ruby, Shell
  • Bachelor's Degree in Computer Science or a related field, or relevant work experience
  • Experience with cloud technologiessuch as AWS, AZURE OpenStack
  • Experience with enterprise monitoring solutions like AppDynamics, New Relic, Prometheus, Graphite, Nagios, Sensu, Splunk, Grafana and Greylog.
Read more
Interview RoundsInterview roundsUnavailable
Hiring Team
About Walmart Labs

A culture of curiosity and exploration awaits you @WalmartLabs. Our Technology Center in Bengaluru uses tech for the charter of building brand new platforms and services on the latest technology stack to support both our e-commerce and stores businesses, worldwide. We bring together online, mobile, and social with their 11,000+ stores around the world, creating a seamless experience for customers, to shop in a way that’s most convenient for them - anytime & anywhere. You can read more about us here.

Perfect for you to join us in our journey to build the next generation of Commerce and change the way millions of people shop. We take immense pride in our women in tech who changed the way the world shops. Here’s what we have to say about their journey and their life amidst the technology they create. :)

success tick
Thanks much!
Appreciate your feedback.