עדיין מחפשים עבודה במנועי חיפוש? הגיע הזמן להשתדרג!
במקום לעבור לבד על אלפי מודעות, Jobify מנתחת את קורות החיים שלך ומציגה לך רק משרות שבאמת מתאימות לך.
מעל 80,000 משרות • 4,000 חדשות ביום
חינם. בלי פרסומות. בלי אותיות קטנות.
Lightbits is seeking an exceptional Senior Inference Systems Engineer to build advanced infrastructure that improves LLM inference performance through KV cache optimization, offloading, streaming, compression, and scheduling.
In this role, you will work at the intersection of CUDA, GPU architecture, transformer inference, Rust systems programming, and large-scale AI serving platforms. You will design and build systems that intelligently manage KV cache placement across GPU, CPU, storage, and remote memory tiers while maximizing throughput, minimizing latency, and reducing infrastructure costs.
This is a highly hands-on position for someone who enjoys solving deep performance challenges, optimizing every layer of the inference stack, and turning low-level innovations into customer-facing product value. Position based in Israel.
Responsibilities
- Design and implement KV cache offloading, streaming, and memory management infrastructure for large-scale LLM serving.
- Build cache-aware scheduling systems that determine when to keep, evict, prefetch, stream, compress, decompress, or recompute KV cache blocks.
- Optimize inference runtimes such as vLLM and SGLang, including paged attention, prefix caching, schedulers, and cache management systems.
- Develop mechanisms that overlap IO operations with attention execution to maximize GPU utilization and minimize latency.
- Build high-performance components in Rust, C++, and CUDA for scheduling, cache coordination, telemetry, and inference optimization.
- Profile and eliminate bottlenecks across GPU, CPU, memory, networking, storage, and runtime layers.
- Design benchmark frameworks and performance tests for long-context, streaming, multi-turn, and high-concurrency workloads.
- Measure and improve key inference metrics including TTFT, TBT/ITL, GPU utilization, cache hit rates, and cost per token.
- Collaborate closely with Product, Platform, ML, and Engineering teams to deliver production-ready optimization capabilities.
Qulifications and Experience
- Strong hands-on experience with CUDA programming and GPU performance optimization.
- Deep understanding of transformer inference, attention mechanisms, KV cache architecture, batching, streaming generation, prefill, and decode.
- Experience with vLLM, SGLang, TensorRT-LLM, Triton Inference Server, or similar LLM serving frameworks.
- Experience designing or optimizing KV cache systems, including cache reuse, eviction, prefix caching, radix caching, or cache offloading.
- Strong systems programming skills in Rust, C++, or both.
- Strong Python skills for experimentation, benchmarking, and performance analysis.
- Experience building performance-sensitive schedulers, async IO systems, or distributed infrastructure.
- Strong debugging and profiling skills using tools such as Nsight, CUDA profiling tools, or custom telemetry systems.
- Experience with GPUDirect, RDMA, NVMe, cache compression, FlashAttention, paged attention, or distributed inference architectures is a strong advantage.
- Bachelor’s or Master’s degree in Computer Science, Software Engineering, Electrical Engineering, or a related field.
במקום לעבור לבד על אלפי מודעות, Jobify מנתחת את קורות החיים שלך ומציגה לך רק משרות שבאמת מתאימות לך.
מעל 80,000 משרות • 4,000 חדשות ביום
חינם. בלי פרסומות. בלי אותיות קטנות.