עדיין מחפשים עבודה במנועי חיפוש? הגיע הזמן להשתדרג!
במקום לעבור לבד על אלפי מודעות, Jobify מנתחת את קורות החיים שלך ומציגה לך רק משרות שבאמת מתאימות לך.
מעל 80,000 משרות • 4,000 חדשות ביום
חינם. בלי פרסומות. בלי אותיות קטנות.
Description:
We are looking for an exceptional MLOps Team Lead to own, build, and scale the infrastructure and automation that powers AI21 Labs’ state-of-the-art Large Language Models (LLMs) and AI systems.
This is a technical leadership role that blends hands-on engineering with strategic vision. You will define MLOps best practices, build high-performance ML infrastructure, and lead a world-class team working at the intersection of AI research and production-grade ML systems.
You will work closely with LLM Algorithm Researchers, ML Engineers, and Data Scientists to enable fast, scalable, and reliable ML workflows – covering everything from distributed training to real-time inference optimization.
If you have deep technical expertise, thrive in high-scale AI environments, and want to lead the next generation of MLOps, we want to hear from you.
Role and Responsibilities:
MLOps Infrastructure & Automation
- Architect and maintain scalable, self-service ML pipelines, CI/CD workflows, and orchestration frameworks (Kubeflow, MLflow, Airflow).
- Design high-scale distributed training environments, leveraging multi-GPU/TPU clusters and parallelization strategies.
- Optimize ML workflows for speed, scalability, and cost efficiency across cloud (AWS/GCP) and on-prem environments.
Model Deployment & Real-Time Inference
- Build ultra-low-latency, high-throughput inference architectures optimized for LLMs at scale.
- Implement A/B testing, canary releases, and rollback mechanisms for model deployment.
- Develop robust monitoring, logging, and alerting solutions for model performance, drift detection, and reliability.
Cloud & Compute Optimization
- Lead the design and scaling of multi-cloud ML infrastructure using Kubernetes, Terraform, and ArgoCD.
- Optimize GPU/TPU utilization, autoscaling, and resource allocation to maximize efficiency.
- Build and manage feature stores, data pipelines, and large-scale storage solutions.
Leadership & Cross-Team Collaboration
- Work closely with LLM researchers, ML engineers, and platform teams to align MLOps infrastructure with cutting-edge AI research and real-world deployment needs.
- Define and enforce best practices for model governance, security, and compliance.
Mentor and grow a high-performing MLOps team, driving a culture of technical excellence, automation, and continuous improvement.
Requirements:
- 3+ years of experience in MLOps, ML infrastructure, or AI platform engineering.
- 2+ years of hands-on experience in ML pipeline automation, large-scale model deployment, and infrastructure scaling.
- Expertise in deep learning frameworks (like PyTorch, TensorFlow, JAX) and MLOps platforms (like Kubeflow, MLflow, TFX).
- Proven track record of building production-grade ML systems that scale to billions of predictions daily.
- Deep knowledge of Kubernetes, cloud-native architectures (AWS/GCP), and infrastructure as code (Terraform, Helm, ArgoCD).
- Strong software engineering skills in Python, Bash, and Go, with a focus on writing clean, maintainable, and scalable code.
- Experience with observability & monitoring stacks (Prometheus, Grafana, Datadog, OpenTelemetry).
- Strong background in security, compliance, and model governance for AI/ML systems.
Leadership & Execution
- Proven ability to lead high-impact engineering teams in a fast-paced AI environment.
- Ability to drive technical strategy while remaining hands-on in critical areas.
- Strong cross-functional collaboration skills, working closely with research and engineering teams.
- Passion for automation, efficiency, and designing scalable self-service MLOps solutions.
- Experience in mentoring and coaching engineers, fostering a culture of innovation and continuous learning.
It Would Be Great If You Have:
- Experience working with LLMs and large-scale generative AI models in production.
- Expertise in optimizing model inference latency and cost at scale.
- Contributions to open-source MLOps tools or AI infrastructure projects.
במקום לעבור לבד על אלפי מודעות, Jobify מנתחת את קורות החיים שלך ומציגה לך רק משרות שבאמת מתאימות לך.
מעל 80,000 משרות • 4,000 חדשות ביום
חינם. בלי פרסומות. בלי אותיות קטנות.
ערב