Senior AI Engineer, SRE & LLM Infrastructure

עדיין מחפשים עבודה במנועי חיפוש? הגיע הזמן להשתדרג!

במקום לעבור לבד על אלפי מודעות, Jobify מנתחת את קורות החיים שלך ומציגה לך רק משרות שבאמת מתאימות לך.

מעל 80,000 משרות • 4,000 חדשות ביום
חינם. בלי פרסומות. בלי אותיות קטנות.

Confidential

תל אביב - יפו

Confidential

תל אביב - יפו
מלאה, עבודה מהבית
30,000-45,000 ₪ הערכה מבוססת AI ולא שכר שהתקבל מהמעסיק
הערכה מבוססת AI ולא שכר של המעסיק

Keep the AI running. Through every spike, every release, every Monday morning.

Somewhere on the planet right now, someone is paying real money to use software built by the company hiring for this role. Some of that software is powered by large language models — and when those models go down, get slow, or burn budget, customers feel it instantly. We're looking for a senior production engineer with the SRE instincts to keep AI dependable at scale — someone who's run hardware-accelerated workloads in production and has shipped LLM-powered systems through the highs and lows of real traffic.

This isn't a research role. It isn't a model-training role. It's the SRE / production-reliability side of AI Engineering: GPUs, Kubernetes, SLOs, observability, incident response — applied to large language models in production. You've kept high-scale systems alive on hardware before; we're hiring you to do it for the AI layer of our product.

What you'll own:

The reliability and scalability of LLM serving in production — uptime, latency percentiles, cost per million tokens
Operating modern AI infrastructure on GPUs and Kubernetes with real SRE discipline (capacity planning, autoscaling, blast-radius control)
SLOs, observability (TTFT, tokens/sec, error budgets), load testing, and incident response for AI workloads
Hardening the serving stack (vLLM / TensorRT-LLM / Triton) against traffic spikes, noisy neighbors, and the rough edges of GPU operations
Partnering with product and engineering teams to ship AI features that stay up under real load

What we're looking for:

Senior SRE / production-engineering background — strong track record of running services at scale through the messy reality of incidents and growth
Hands-on experience with hardware-accelerated workloads in production — GPUs (NVIDIA), distributed training/serving infrastructure, or equivalent (TPUs, accelerators)
Real LLM context — you've shipped or operated LLM-powered systems and you understand how they fail differently from a normal service
Production cloud + Kubernetes at scale, with the observability and capacity-planning chops to match
Judgment — you've made the calls on architecture, SLOs, and trade-offs before and you've been right more than wrong

Nice to have:

Direct experience with modern LLM serving stacks (vLLM, TensorRT-LLM, Triton, Ray Serve)
Multi-GPU / multi-node serving experience; familiarity with quantization, batching, and inference cost optimization at the operational layer
Prior staff/lead SRE or platform-leadership experience
Open-source contributions to AI infrastructure

Come keep the production layer of the AI revolution running. Real users. Real scale. Real pagers (and real on-call rotation discipline).

שאלות ותשובות עבור משרת Senior AI Engineer, SRE & LLM Infrastructure

מהו התפקיד המרכזי של מהנדס/ת AI בכיר/ה, SRE ותשתיות LLM בתחזוקת מערכות AI בקנה מידה גדול?

התפקיד המרכזי של מהנדס/ת AI בכיר/ה, SRE ותשתיות LLM הוא להבטיח את האמינות והסקלאביליות של שירותי מודלי שפה גדולים (LLM) בסביבת פרודקשן. זה כולל אחריות על זמינות המערכת, זמני השהיה, עלות למיליון טוקנים, ותפעול תשתית AI מודרנית על גבי GPUs ו-Kubernetes תוך הקפדה על עקרונות SRE.

אילו כישורים נדרשים ממועמד/ת לתפקיד Senior AI Engineer, SRE & LLM Infrastructure כדי להצליח בתחזוקת עומסי עבודה מואצי חומרה?

כיצד תורם תפקיד Senior AI Engineer, SRE & LLM Infrastructure לאופטימיזציה של עלויות וביצועים של מודלי שפה גדולים?

לכל המשרות של Senior AI SRE Engineer