Offer summary

Qualifications:

2+ years in MLOps, DevOps, or backend engineering for AI workloads, Proficient in DeepStream 7.x and containerization with Docker, Strong programming skills in Python and bash, with CI/CD scripting experience, Experience deploying and optimizing CNNs and LLMs in production environments..

Key responsibilities:

Build and automate inference pipelines for computer vision models

Migrate and optimize Triton workloads to DeepStream with minimal downtime

Serve and optimize large language models using quantization and pruning techniques

Automate build/test/release processes and support model lifecycle management.

Job description

We are in need of 1 AI and Machine Learning Engineer who will assist our Team in Emerging Technologies.

The chosen resource needs to work offshore and below are the detailed requirement for this role.

Must Have

2 + years in MLOps, DevOps or backend engineering for AI workloads
DeepStream 7.x poweruser—pipelines, Gstplugins, nvdsanalytics, nvstreammux
Solid grasp of containerization (Docker) & GPU scheduling
Proven track record squeezing latency/throughput on NVIDIA GPUs (TensorRT, mixed precision, CUDA toolkit)
Handson deploying YOLO or comparable CNNs in production
Experience selfhosting and serving LLMs (vLLM, TensorRTLLM, or similar) plus quantization/pruning/distillation
Strong Python & bash; confidence with CI/CD scripting

Nice To Have

Exposure to cloud GPUs (AWS /GCP /Azure)
Experience with edge devices (Jetson, Xavier, Orin)
Performance profiling with Nsight Systems / DCGM
Knowledge of Triton Inference Server internals
Familiarity with distributed training (PyTorch DDP, DeepSpeed)
Basic frontend/REST gRPC API design skills

What You Will Do

Build & automate inference pipelines
Design, containerize and deploy CV models (YOLO v8 / v11, custom CNNs) with DeepStream 7.x, optimizing for lowest latency and highest throughput on NVIDIA GPUs.
Migrate existing Triton workloads to DeepStream with minimal downtime.
Serve and optimize large language models
Selfhost Llama 3.2, Llama 4, and future LLM/VLMs on the cluster using bestpractice quantization, pruning and distillation techniques.
Expose fast, reliable APIs and monitoring for downstream teams.
Continuous delivery & observability
Automate build/test/release steps and set up health metrics, logs and alerts so models stay stable in production.
Allocate GPU resources efficiently across CV and LLM services.
Model lifecycle support (10 – 20 %)
Assist data scientists with occasional finetuning or retraining runs and package models for production.

Required profile