Python Developer (AI Evaluation Frameworks)

Location

: Pune, India
Contract Type

: Regular

Open positions

- - - - - - - - - - - -

Job summary:

We are seeking a seasoned Python Engineer with 5–7 years of professional experience and exposure to QA practices to join our team focused on development of AI evaluation frameworks. The ideal candidate combines hands‑on Python engineering skills, a QA mindset, and practical familiarity with GenAI/LLM concepts and Azure cloud services. You will design, build, and maintain scalable evaluation systems and work closely with QA teams and stakeholders to ensure robust, repeatable assessment of AI components.

Key responsibilities

Design and implement AI evaluation frameworks and tooling for model assessment, benchmarking, and automated testing of LLMs, agents, and GenAI features.
Build production‑grade Python applications, API’s to support evaluation pipelines and integrations.
Collaborate with QA team brainstorm current evaluation challenges and build reproducible evaluation workflows.
Implement end‑to‑end evaluation pipelines including data preprocessing, metric computation, test orchestration, and reporting.
Ensure code quality and maintain coding standards through static analysis, unit/integration tests, code reviews, and tooling (e.g., SonarQube).
Contribute to design and implementation of APIs and services.
Deploy and operate evaluation components on Azure, leveraging platform services and following infrastructure‑as‑code practices.
Instrument monitoring, logging, and alerting for evaluation pipelines; capture audit trails and results for compliance and reproducibility.
Partner with data scientists, ML engineers, and product stakeholders to gather requirements, validate evaluation approaches, and incorporate feedback.
Support peers in troubleshooting and resolving issues across development and QA; mentor junior developers and share best practices.
Maintain documentation for evaluation frameworks, runbooks etc.
Unit tests and unit plans are built, executed, optimized, monitored, ensuring quality, security and consistency. Malfunctions, incidents and bugs are detected, understood, analyzed, reported and solved.

Required qualifications

5–7 years of professional Python development experience with strong, demonstrable hands‑on skills.
Solid understanding of OOPs concepts, software design principles, and coding best practices.
Experience with test‑driven development, writing unit and integration tests, and collaborating with QA teams on automated testing.
Familiarity with the full project lifecycle: requirements, design, development, code review, deployment, maintenance, and deprecation.
Experience building RESTful APIs using FastAPI, Flask, or Django.
Practical experience with Azure cloud services and deployment patterns (App Services, AKS, Azure Functions, Blob/Storage, DevOps pipelines).
Exposure to CI/CD tooling and code quality tools such as SonarQube
Working knowledge of AI/DS concepts—particularly GenAI, LLMs, RAG patterns, and agent architectures.
Strong problem solving, debugging skills, and ability to work across distributed systems.
Excellent communication skills and demonstrated ability to work closely with QA, data science, and product teams.

Desirable (good‑to‑have)

Experience with LLM frameworks such as LangChain, LlamaIndex, or similar.
Familiarity with observability tools and ML/LLM monitoring.
Prior experience designing evaluation metrics for NLP/LLM tasks (e.g., BLEU/ROUGE, embeddings‑based similarity, human evaluation orchestration).
Prior knowledge and experience of working on traditional AI-ML systems.

Behavioral competencies

Mindset: attention to detail, attention towards testability and reproducibility, and strong focus on accuracy, quality and safety.
Collaborative: able to partner effectively with QA, ML, and product stakeholders.
Proactive communicator: gathers feedback, surfaces risks early, and drives adoption of evaluation tooling.
Mentorship orientation: supports and uplifts team members through knowledge sharing.

Apply