Description:
monday.com is looking for a Reliability Engineer to join our Reliability team. This role will be integral in ensuring the robustness and dependability of our platform, impacting millions of users globally.
About The Role:
- Maintain a comprehensive understanding of our service architecture and its dependencies.
- Identify and mitigate risks associated with tightly coupled services and complex interconnections.
- Lead service re-architecture initiatives to improve reliability and scalability.
- Review new services and ensure they meet our reliability standards.
- Advocate for Chaos Engineering, collaborate with R&D teams, build tools/envs, and improve system resilience
- Manage the full lifecycle of reliability tools and services, adhering to the comprehensive architectural guidelines
- Collaborate with teams to define and monitor Service Level Indicators (SLIs) and Service Level Objectives (SLOs) that align with business goals and user expectations
- Our Stack: Kubernetes, Datadog, Chaos Mesh, AWS, Terraform, CDKTF
Requirements:
- Proven k8s and Linux admin/internals experience.
- Proven experience with microservice architectures and reliability engineering.
- Deep understanding of reliability concepts (eg, SLOs, SLIs, and service interconnections).
- Strong background in incident response and resilience efforts.
- Ability to collaborate across teams to drive reliability improvements.
- (Nice-to-have): Prior knowledge with chaos engineering.
רוצה לראות עוד משרות מתאימות? Jobify מנתחת את הניסיון התעסוקתי שלך ומציגה לך משרות עדכניות - בחינם!