We are seeking a highly skilled NPU Runtime Software Engineer to join our team. You will be responsible for designing and implementing the software layer that bridges high-level ML frameworks with our proprietary NPU hardware — enabling the next generation of real-time AI applications. Your work will ensure that state-of-the-art models — with a heavy focus on LLMs — run with industry-leading efficiency, low latency, and high throughput. You will sit at the intersection of compilers, system drivers, and distributed inference frameworks, spanning the full runtime stack from graph execution and compiler integration to inference serving.
Responsibilities and Opportunities
- Design and implement the RBLN runtime module that interfaces with compiler and driver components, including the graph executor and runtime APIs, to enable ML model deployment through the RBLN SDK
- Architect and maintain native PyTorch execution support within the runtime, including torch.compile integration and RBLN compiler toolchains, to enable seamless NPU acceleration with minimal user-side code changes
- Design and implement a user-facing profiler that provides actionable performance insights, delivered as part of the RBLN SDK
- Develop and extend vLLM to enhance inference performance on NPUs, including support for key vLLM features such as advanced memory management, parallelism, and dynamic batching
- Design and optimize distributed inference across multi-NPU setups, including collective communication operations (CCL) to support various parallelism strategies
- Conduct benchmarking and profiling to evaluate runtime system performance and implement optimizations to improve overall system efficiency
- Collaborate with ML engineers and infrastructure teams to deploy and scale inference services
Key Qualifications
- Bachelor's degree or higher in Computer Science, Electrical Engineering, or a related field
- Strong proficiency in C++ and Python
- Strong understanding of deep learning fundamentals and LLM architectures, including Transformer-based models, generative AI, and inference optimization techniques
- Hands-on experience with LLM serving frameworks (e.g., vLLM, TensorRT-LLM)
- Solid understanding of model optimization techniques (tensor parallelism, KV cache optimizations, memory-efficient execution)
- Familiarity with system software components, including compilers, runtimes, drivers, and firmware
- Familiarity with hardware acceleration (GPUs, NPUs, TPUs) and efficient memory management techniques
- Strong debugging and performance profiling skills for high-throughput inference environments
- Ability to work effectively across compiler, driver, and ML engineering teams
- Excellent written and verbal communication skills
Ideal Qualifications
- Practical experience with AI accelerator runtimes and driver APIs (e.g., GPUs)
- Direct contribution or production experience with ML frameworks and serving systems such as PyTorch, vLLM, SGLang, TensorRT, and TensorRT-LLM
- Understanding of torch.compile and graph optimizations
- Strong understanding of operating systems, resource management, and high-performance computing concepts
- Advanced proficiency in modern C++ for developing efficient, high-performance systems
- Experience with multithreading and parallel programming
- Experience deploying LLMs in distributed environments
전형절차
- 서류전형 > On-line 인터뷰 > On-site 인터뷰(과제 포함) > Culture-fit 인터뷰 > 처우 협의 > 최종 합격
- 전형절차는 직무별로 다르게 운영될 수 있으며, 일정 및 상황에 따라 변동될 수 있습니다.
- 전형 일정 및 결과는 지원 시 작성하신 이메일로 개별 안내드립니다.
참고사항
- 본 공고는 모집 완료 시 조기 마감될 수 있습니다.
- 지원서 내용 중 허위사실이 있는 경우에는 합격이 취소될 수 있습니다.
- 채용 및 업무 수행과 관련하여 요구되는 법령 상 자격이 갖추어지지 않은 경우 채용이 제한될 수 있습니다.
- 보훈 대상자 및 장애인 여부는 채용 과정에서 어떠한 불이익도 미치지 않습니다.
- 담당 업무 범위는 후보자의 전반적인 경력과 경험 등 제반사정을 고려하여 변경될 수 있습니다. 이러한 변경이 필요할 경우, 최종 합격 통지 전 적절한 시기에 후보자와 커뮤니케이션 될 예정입니다.
- 채용 관련 문의사항은 아래 메일 주소로 문의바랍니다.
- [email protected]