Offer summary

Qualifications:

Bachelor's or Master's degree in Computer Science, Data Sciences, or related fields., 7+ years of experience in deploying ML/DL and LLM solutions in large-scale environments., Strong experience with ML Ops tools and LLM-specific frameworks., Proficient in containerization and CI/CD practices..

Key responsibilities:

Develop and manage scalable deployment strategies for LLMs.

Optimize LLM inference performance and manage vector databases.

Design and maintain CI/CD pipelines for ML model workflows.

Collaborate with Data Scientists to streamline model development.

Job description

Summary

Gainwell is seeking LLM Ops Engineers and ML Ops Engineers to join our growing AI/ML team. This role is responsible for developing, deploying, and maintaining scalable infrastructure and pipelines for Machine Learning (ML) models and Large Language Models (LLMs). You will play a critical role in ensuring smooth model lifecycle management, performance monitoring, version control, and compliance while collaborating closely with Data Scientists, DevOps, and

Role Description :

Core LLM Ops Responsibilities:

•   Develop and manage scalable deployment strategies specifically tailored for LLMs (GPT, Llama, Claude, etc.).
•   Optimize LLM inference performance, including model parallelization, quantization, pruning, and fine-tuning pipelines.
•   Integrate prompt management, version control, and retrieval-augmented generation (RAG) pipelines.
•   Manage vector databases, embedding stores, and document stores used in conjunction with LLMs.
•   Monitor hallucination rates, token usage, and overall cost optimization for LLM APIs or on-prem deployments.
•   Continuously monitor models for its performance and ensure alert system in place.
•   Ensure compliance with ethical AI practices, privacy regulations, and responsible AI guidelines in LLM workflows.

Core ML Ops Responsibilities:

•   Design, build, and maintain robust CI/CD pipelines for ML model training, validation, deployment, and monitoring.
•   Implement version control, model registry, and reproducibility strategies for ML models.
•   Automate data ingestion, feature engineering, and model retraining workflows.
•   Monitor model performance, drift, and ensure proper alerting systems are in place.
•   Implement security, compliance, and governance protocols for model deployment.
•   Collaborate with Data Scientists to streamline model development and experimentation.
•   Leadership Skills – Should be able to work as a team lead, interface with team leads of other functions/departments, understand business requirements, cost sensitivity and translate the same to an appropriate solution that is feasible to develop and deploy.

What We’re Looking For

•   Bachelor's or Master's degree or higher in Computer Science, Data Sciences-Machine Learning, Engineering, or related fields.
•   Strong experience with ML Ops tools (Kubeflow, ML flow, TFX, Sage Maker, etc.).
•   Experience with LLM-specific tools and frameworks ( LangChain, Lang Graph, LlamaIndex, Hugging Face, OpenAI APIs, Vector DBs like Pinecone, FAISS, Weavite, Chroma DB etc.).
•   Solid experience in deploying models in cloud (AWS, Azure, GCP) and on-prem environments.
•   Proficient in containerization (Docker, Kubernetes) and CI/CD practices.
•   Familiarity with monitoring tools like Prometheus, Grafana, and ML observability platforms.
•   Strong coding skills in Python, Bash, and familiarity with infrastructure-as-code tools (Terraform, Helm, etc.).Knowledge of healthcare AI applications and regulatory compliance (HIPAA, CMS) is a plus.
•   Strong skills in Giskard, Deepeval etc.
•   Understanding of business use cases, cost sensitivity, strong interpersonal skills, architecting skills and abilities to convince multiple stakeholders.
Qualifications
•   Bachelor or Masters or Higher in Computer Sciences, Data Sciences, or any related field
•   7+ years to 10-11 Years of experience in deploying ML/DL and LLM based solutions in large scale deployment environment or related experience

•   Experience with fine-tuning LLMs and serving them in production at scale.
•   Knowledge of model compression techniques for LLMs (LoRA, QLoRA, quantization-aware training).
•   Experience with distributed systems and high-performance computing for large-scale model serving.
Awareness of AI fairness, explainability, and governance frameworks.

What You Should Expect in This Role

•   Fully Remote Opportunity – Work from anywhere in the U.S. / India
•   Minimal Travel Required – Occasional travel opportunities (0-10%).
•   Opportunity to Work on Cutting-Edge AI Solutions in a mission-driven healthcare technology environment.

Required profile