Design scalable multi-node serving architectures including prefill/decode disaggregation, distributed KV-cache, cache eviction strategies, and scheduler design for fairness and efficiency
Analyze and optimize memory usage, device utilization, communication overhead (PCIe, RDMA), and runtime behavior in real-world serving environments
Build and maintain benchmarks and simulators for AI serving workloads, and drive data-informed architectural decisions
Work closely with infrastructure, compiler, and hardware teams to co-design end-to-end AI serving systems
Actively contribute to and collaborate with open-source communities (e.g., vLLM, PyTorch, Triton, SGLang), including upstream contributions, bug fixes, and design discussions
Key Qualifications
Master's or higher degree (or equivalent experience) in Computer Science, Electrical Engineering, or a related field
Strong experience with Python, C++ and PyTorch, including model execution and runtime internals
Hands-on experience with inference serving or high-performance ML systems
Familiarity with Linux systems, profiling tools, and debugging performance bottlenecks
Strong problem-solving skills and the ability to reason about system-level trade-offs
Clear communication skills and ability to collaborate in a fast-paced engineering environment
Ideal Qualifications
Experience with vLLM, SGLang, TensorRT-LLM, or similar LLM serving frameworks
Deep understanding of KV-cache management, attention mechanisms, and memory-efficient inference
Experience with multi-node inference, including tensor/pipeline parallelism
Design scalable multi-node serving architectures including prefill/decode disaggregation, distributed KV-cache, cache eviction strategies, and scheduler design for fairness and efficiency
Analyze and optimize memory usage, device utilization, communication overhead (PCIe, RDMA), and runtime behavior in real-world serving environments
Build and maintain benchmarks and simulators for AI serving workloads, and drive data-informed architectural decisions
Work closely with infrastructure, compiler, and hardware teams to co-design end-to-end AI serving systems
Actively contribute to and collaborate with open-source communities (e.g., vLLM, PyTorch, Triton, SGLang), including upstream contributions, bug fixes, and design discussions
Key Qualifications
Master's or higher degree (or equivalent experience) in Computer Science, Electrical Engineering, or a related field
Strong experience with Python, C++ and PyTorch, including model execution and runtime internals
Hands-on experience with inference serving or high-performance ML systems
Familiarity with Linux systems, profiling tools, and debugging performance bottlenecks
Strong problem-solving skills and the ability to reason about system-level trade-offs
Clear communication skills and ability to collaborate in a fast-paced engineering environment
Ideal Qualifications
Experience with vLLM, SGLang, TensorRT-LLM, or similar LLM serving frameworks
Deep understanding of KV-cache management, attention mechanisms, and memory-efficient inference
Experience with multi-node inference, including tensor/pipeline parallelism