Site Reliability Engineer

Sorry, this particular job is closed. But feel free to fill out a General Application

Search Jobs

General Apply

  • As an Equal Employment Opportunity Employer, DAVIS has reporting requirements which require us to invite employees to voluntarily self-identify their race/ethnicity. Submission of this information is voluntary and refusal to provided it will not subject you to any adverse treatment. The information obtained will be kept confidential.
  • Accepted file types: pdf, doc, docx, txt, rtf.
  • By applying to this position and providing my contact information, I give The DAVIS Companies permission to provide me with email communications and information.

Site Reliability Engineer

  • Specialty:

  • Title:

    Site Reliability Engineer
    • City:

    • State:

    • Zip Code:

  • Start date:

  • Status:

  • Assignment Type:

  • Job Id:



About the role:

We are seeking a bright, results-driven, hands-on individual to lead our Web/eCommerce Site Reliability Engineering team. This role will be a point of contact for Site Reliability Engineering and production support of Web/eCommerce platform, which includes Adobe Experience Management suite, SAP Commerce Cloud, Solr Search, MuleSoft based services API, etc. This role will focus on making the platform more stable and reliable. This role will be responsible for providing intermediate to advanced technical and problem resolution support to Web and eCommerce applications. It is a hands-on role that requires a technical skillset, service provider oversight, and cross-functional team relationship management. This role is also responsible for delivering clear, concise, timely communication to our customers to ensure their confidence in our team’s passion to provide them with the best customer experience possible.


  • Lead and mentor Site Reliability Engineering (SRE) team
  • Establish and improve best industry SRE practices
  • Ownership of production platform and it’s uptime, availability, and stability
  • Manage baselines of uptime, performance, and error rate of web/eCommerce platform and drive the efforts to improve these with the help of other teams, as needed
  • Work on feature requests, defects and other development tasks, in particular to those that are related to monitoring, reliability, and scalability
  • Use tools to understand customer’s friction points and drive to address it with the help of other teams, as needed
  • Get customer’s feedback and friction points from other teams and tools, incorporate those in the product backlog, and drive those items to closure
  • Own incident management, problem management, and change management
  • Implement monitoring and alerting for all technical components and create dashboards and visualization using monitoring tools (NewRelic APM or similar products)
  • Improve observability using monitoring solution and manage baselines of technical KPIs and ensure it improves over time with the help of other teams, as needed
  • Acts as a primary escalation point for major incidents for web/eCommerce platform
  • Available during non-core hours, especially during release or critical incident, etc. and participate in 24x7 on-call support
  • Run SWAT for critical incidents resolution by collaborating with other teams (IT and/or business) and third-party providers. Make critical decision to resolve the situation and keep everyone informed during the process
  • Lead the root cause investigation and drive the permanent fix with the help of other teams, as needed
  • Participate in architectural review and provide feedback to make sure platform has built-in redundancy and it’s scalable and fault-tolerant
  • Participate in design reviews and make recommendations to improve the reliability and maintainability of the system. Ensure NFRs are covered and unhappy path scenarios are coded well to aid potential debugging needs
  • Participate in code review and provide feedback for good logging and diagnostics
  • Document knowledge articles and create/maintain operational runbooks
  • Maintain third party vendor scorecard and provide feedback to their account executives as needed
  • Collaborate with the product owners, business, developers, and QE, as needed


  • Bachelor’s Degree in computer science, information technology or equivalent experience

Required Skills

  • 10+ years of hands-on experience in providing 24X7X365 L2 production support
  • Recent 5+ years of hands-on experience in providing L2 production support to B2C or B2B eCommerce website
  • Recent 5+ years of hands-on experience in diagnosing production issue and debug code
  • Recent 5+ years of experience with ITIL framework including incident management, problem management, and change management
  • Recent 5+ years of experience as an SRE Engineer or Senior/Lead production support engineer
  • Recent 5+ years of hands-on experience in Java programming and shell scripting
  • Recent 5+ years of hands-on experience working with relational databases, including an understanding of relational table designs and running SQL
  • Recent 5+ years of experience in driving SWAT and managing SLA/SLO
  • Recent 5+ years of experience in driving root cause analysis and driving closure on a permanent fix
  • Recent 3+ years of hands-on experience in APIs validation
  • Recent 3+ years of hands-on experience with monitoring tools and log aggregation tools (such as New Relic or Datadog)
  • Recent 3+ years of experience in managing technical KPIs - availability, performance, error rate, etc.
  • Have web/eCommerce application software development background
  • Knowledge about a technological landscape of eCommerce platform and associated integrations
  • Have a sense of ownership with analytical and problem-solving skills
  • Ability to learn new skills quickly as needed
  • Good communicator, both written and spoken, such that complex IT issues can be explained in everyday language that business can understand 

Preferred Skills

  • Prior Software Development experience as an AEM and/or Hybris developer is a big plus
  • Experience in automating routine task or process
  • Experience in CDN and WAF
  • Certifications in Cloud, monitoring tool, and programing language
  • Experience working in agile/scrum environments
  • Experience with Atlassian suite including Bitbucket, Bamboo, Jira, and Confluence
  • Good documenting skill with experience in building knowledge base repository
  • Good mentoring skill with experience in training and mentoring less experienced team members
  • Ability to effectively learn and use new concepts, tools, and methodology to support the needs of the business
  • Experience leading offshore team members








More Info


Similar Positions