NPU Runtime Software Engineer

직군

Software

근무지

Rebellions | 리벨리온경기도 성남시 분당구 정자일로156번길 6, R-TOWER 3F ~ 8F

We are seeking a highly skilled NPU Runtime Software Engineer to join our team. You will be responsible for designing and implementing the software layer that bridges high-level ML frameworks with our proprietary NPU hardware — enabling the next generation of real-time AI applications. Your work will ensure that state-of-the-art models — with a heavy focus on LLMs — run with industry-leading efficiency, low latency, and high throughput. You will sit at the intersection of compilers, system drivers, and distributed inference frameworks, spanning the full runtime stack from graph execution and compiler integration to inference serving.

Responsibilities and Opportunities

Design and implement the RBLN runtime module that interfaces with compiler and driver components, including the graph executor and runtime APIs, to enable ML model deployment through the RBLN SDK
Architect and maintain native PyTorch execution support within the runtime, including torch.compile integration and RBLN compiler toolchains, to enable seamless NPU acceleration with minimal user-side code changes
Design and implement a user-facing profiler that provides actionable performance insights, delivered as part of the RBLN SDK
Develop and extend vLLM to enhance inference performance on NPUs, including support for key vLLM features such as advanced memory management, parallelism, and dynamic batching
Design and optimize distributed inference across multi-NPU setups, including collective communication operations (CCL) to support various parallelism strategies
Conduct benchmarking and profiling to evaluate runtime system performance and implement optimizations to improve overall system efficiency
Collaborate with ML engineers and infrastructure teams to deploy and scale inference services

Key Qualifications

Over 5 years of experience in software engineering, with significant work on ML frameworks, inference runtimes, or AI accelerator toolchains in production environments
Bachelor's degree or higher in Computer Science, Electrical Engineering, or a related field
Strong proficiency in C++ and Python
Strong understanding of deep learning fundamentals and LLM architectures, including Transformer-based models, generative AI, and inference optimization techniques
Hands-on experience with LLM serving frameworks (e.g., vLLM, TensorRT-LLM)
Solid understanding of model optimization techniques (tensor parallelism, KV cache optimizations, memory-efficient execution)
Familiarity with system software components, including compilers, runtimes, drivers, and firmware
Familiarity with hardware acceleration (GPUs, NPUs, TPUs) and efficient memory management techniques
Strong debugging and performance profiling skills for high-throughput inference environments
Ability to work effectively across compiler, driver, and ML engineering teams
Excellent written and verbal communication skills

Ideal Qualifications

Practical experience with AI accelerator runtimes and driver APIs (e.g., GPUs)
Direct contribution or production experience with ML frameworks and serving systems such as PyTorch, vLLM, SGLang, TensorRT, and TensorRT-LLM
Understanding of torch.compile and graph optimizations
Strong understanding of operating systems, resource management, and high-performance computing concepts
Advanced proficiency in modern C++ for developing efficient, high-performance systems
Experience with multithreading and parallel programming
Experience deploying LLMs in distributed environments

전형절차

서류전형 > On-line 인터뷰 > On-site 인터뷰(과제 포함) > Culture-fit 인터뷰 > 처우 협의 > 최종 합격
전형절차는 직무별로 다르게 운영될 수 있으며, 일정 및 상황에 따라 변동될 수 있습니다.
전형 일정 및 결과는 지원 시 작성하신 이메일로 개별 안내드립니다.

참고사항

본 공고는 모집 완료 시 조기 마감될 수 있습니다.
지원서 내용 중 허위사실이 있는 경우에는 합격이 취소될 수 있습니다.
채용 및 업무 수행과 관련하여 요구되는 법령 상 자격이 갖추어지지 않은 경우 채용이 제한될 수 있습니다.
보훈 대상자 및 장애인 여부는 채용 과정에서 어떠한 불이익도 미치지 않습니다.
담당 업무 범위는 후보자의 전반적인 경력과 경험 등 제반사정을 고려하여 변경될 수 있습니다. 이러한 변경이 필요할 경우, 최종 합격 통지 전 적절한 시기에 후보자와 커뮤니케이션 될 예정입니다.
채용 관련 문의사항은 아래 메일 주소로 문의바랍니다.
[email protected]