Senior Software Engineer - Site Reliability
Why?
Collaborate with top-tier clients, contribute to innovative projects, and leverage cutting-edge technologies to create exceptional digital solutions. Be a part of a supportive, inclusive work culture prioritizing continuous learning, professional growth, and work-life balance.
Introduction
We are seeking an accomplished Senior Site Reliability Engineer (SRE) to lead the design, implementation, and evolution of highly available, scalable, and resilient systems across our multi-cloud infrastructure. In this senior role, you will drive architectural decisions, establish reliability standards, and mentor teams while ensuring operational excellence across complex distributed systems. You will partner with engineering leadership, development teams, and product stakeholders to shape infrastructure strategy, implement sophisticated automation, and champion a culture of reliability engineering.
Key Responsibilities
- Architect and implement highly reliable, scalable, and cost-effective infrastructure solutions for mission-critical applications across multi-cloud environments (AWS and Azure).
- Lead the definition and refinement of service level objectives (SLOs), service level indicators (SLIs), and error budgets, establishing reliability standards across the organization.
- Design and implement sophisticated Infrastructure as Code (IaC) solutions using Terraform, Ansible, and Azure Resource Manager (ARM) templates or Bicep.
- Drive automation strategies to eliminate toil, improve operational efficiency, and enable self-service capabilities for development teams.
- Lead incident response efforts, conduct thorough post-incident reviews, and implement systemic improvements to prevent recurrence.
- Champion cloud-native architectures and modern reliability practices, serving as a technical advisor for infrastructure and platform decisions.
- Mentor junior SREs and engineers, fostering a culture of reliability, observability, and continuous improvement.
- Participate in and help optimize the on-call rotation, ensuring sustainable practices and effective escalation procedures.
- Establish and maintain comprehensive documentation standards, runbooks, and knowledge repositories that enable team autonomy and effective incident response.
- Design and implement advanced monitoring, logging, and alerting strategies using observability platforms to enable proactive issue detection and resolution.
- Lead container orchestration initiatives using Kubernetes (AKS, EKS) and implement sophisticated deployment strategies including blue-green, canary, and progressive delivery patterns.
- Ensure security, compliance, and governance standards are embedded throughout the infrastructure lifecycle, implementing security-as-code practices.
- Drive capacity planning, performance optimization, and cost management initiatives across cloud platforms.
- Collaborate with architecture and security teams to establish platform standards, reference architectures, and best practices.
Skills, Knowledge and Expertise
- 5+ years of proven experience as a Site Reliability Engineer or similar role, with demonstrated expertise in designing, implementing, and operating large-scale, distributed systems.
- Deep expertise in Infrastructure as Code (IaC) with Terraform and Ansible, including module development, state management, and multi-environment orchestration.
- Extensive hands-on experience with both AWS and Azure cloud platforms, including advanced services, networking, and security features in both environments.
- Expert-level knowledge of container orchestration with Kubernetes, including architecture, custom resource definitions (CRDs), operators, service mesh implementations, and production-scale cluster management.
- Advanced proficiency in Linux system administration, performance tuning, and troubleshooting complex system-level issues.
- Proven experience implementing GitOps workflows using ArgoCD, Flux, or similar tools, including advanced deployment patterns and progressive delivery.
- Deep understanding of observability principles and hands-on experience with tools such as Prometheus, Grafana, Datadog, Azure Monitor, or the ELK stack.
- Expert knowledge of networking concepts, including load balancing, CDNs, DNS, VPNs, service mesh architectures, and distributed systems communication patterns.
- Strong programming and scripting capabilities in Python, Bash, Go, or PowerShell, with the ability to develop custom tooling and automation frameworks.
- Extensive experience designing and optimizing CI/CD pipelines using Jenkins, GitLab CI, Azure DevOps, GitHub Actions, or CircleCI.
- Demonstrated ability to lead incident response, conduct root cause analysis, and drive systemic reliability improvements.
- Excellent communication and leadership skills with proven ability to influence technical decisions and collaborate with stakeholders at all levels.
- Current certification in AWS (Solutions Architect Associate/Professional or equivalent) and Azure (Azure Administrator or Azure Solutions Architect), with practical experience managing production workloads on both platforms.
- Experience with hybrid and multi-cloud networking strategies, including ExpressRoute, Direct Connect, and cloud interconnects.
- Knowledge of serverless architectures on AWS (Lambda) and Azure (Functions, Logic Apps) and their operational considerations.
- Proven experience with disaster recovery planning, business continuity, and implementing multi-region active-active architectures.
- Understanding of machine learning operations (MLOps), data pipeline orchestration, and supporting ML workloads in production.
- Experience with service mesh technologies such as Istio, Linkerd, or Consul.
- Familiarity with chaos engineering principles and tools like Chaos Monkey or Gremlin.
- Experience with configuration management at scale and policy-as-code tools like Open Policy Agent (OPA).
- Knowledge of FinOps principles and cloud cost optimization strategies.
Why Work At Axelerant?
- Excellent work exposure - Some of our recent clients were the UN, the University of East London, and Doctors Without Borders.
- Meaningful projects to contribute back - Most of our projects are in the education, government, healthcare, and not-for-profit sectors. We also encourage and support team members for open-source contributions.
- Work-life flexibility and remote work - You decide when and where to work. This has allowed many team members, who couldn’t have held a regular job otherwise, to have thriving careers.
- Eight-hour workdays - We don't say 8 hours and expect 12 hours minimum.
- No micromanagement - Micromanagement makes us grunt like the Hulk. So nobody would be looking over your shoulders. But help is always available when asked.
- No discrimination - We believe in equal pay for equal work. Personal decisions like planning to have children will not stop you from getting promoted.
- Championing inclusivity - We like diversity. It enriches our lives and products. If you see something wrong or that could be better on day 1, share through established channels to bring positive change. We listen.
- Meaningful time off - 52 weekends and 40 days per year of consolidated leave, plus maternity, paternity, adoption, and sabbatical allowances. We also have Kindness leaves for emergencies.
- Family Medical Insurance - You want your family’s health secured. So do we. We got you, your spouse, and your little ones covered. And free doctor and health and wellness consultations from medical experts, whenever you need.
- Performance coaching - Our professional, empathetic coaches will help you become your best version through career and personal development.
- Event sponsorship - If your session at any event is selected and aligns with sponsorship guidelines, we cover all expenses for the trip, whether domestic or international.
- Continuing education allowance - We’ll cover up to 2% of your annual salary yearly for classes, certifications, or buying books to further your capabilities.
- Health and wellness allowance
- Generous home office set-up allowance
- Sponsored team meet-ups
- Co-working space allowance
- Event allowance
Growth can't be one-sided.
When you grow, we grow.
About Axelerant
As a global company that puts care into employee happiness, engineering excellence, and customer success, Axelerant brings together top talent, success management as our service framework, and an unconventional work environment that empowers — to deliver transformational outcomes for our clients and team members alike.
Apply for this role
Required fields are marked with an asterisk (*).
Speed this up — upload your CV
Drop in your résumé (PDF) or a PDF export of your LinkedIn profile and we'll fill out as much of the form as we can. You'll review everything before submitting.
We send your CV to a third-party AI service (Google Gemini via OpenRouter) to read it. AI can make mistakes — please verify all entries before you submit.
Don’t see the job you’re looking for?
Fill in your details below and we’ll reach out to you when there’s an opening!
