About the Role
We are seeking a highly skilled Site Reliability Engineer with experience applying Generative AI
(GenAI) to automate and enhance the reliability of complex data platforms. You will be responsible for building self-healing infrastructure, AI-powered observability, and automating incident response across data pipelines (e.g., Databricks, Glue, Kafka, Flink).
This is a high-impact role where you will shape the future of data reliability at the company,
mentor engineers, and lead initiatives that span multiple teams and domains.
Key Responsibilities
Platform Reliability & Automation
• Design, implement, and operate reliable, scalable, and observable data platforms.
• Automate incident triage, remediation, and postmortems using GenAI-powered tools.
• Develop intelligent runbooks and self-healing workflows using LLMs.
GenAI-Enabled SRE Practices
• Build and integrate GenAI copilots for on-call support, anomaly detection, and RCA
(root cause analysis).
• Fine-tune or prompt engineer LLMs for specific use cases like summarizing logs,
interpreting metrics, or generating remediation steps.
• Leverage vector databases (e.g., FAISS, Weaviate) to retrieve telemetry and incident
history for GenAI prompts.
Observability & Anomaly Detection
• Integrate GenAI with observability tools (e.g., Datadog, Prometheus, Grafana,
OpenTelemetry).
• Build systems for natural language querying of platform health and pipeline performance.
• Collaborate with data engineers to monitor SLIs/SLOs across ingestion, transformation,
and delivery layers.
CI/CD & Risk Management
• Integrate GenAI into CI/CD pipelines to generate blast radius analyses and deployment
guardrails.
• Use LLMs to assess the risk of configuration or schema changes before production
rollout.
• Automate validation and rollback strategies based on historical outcomes.
Qualifications
• 5+ years in SRE, DevOps, or Data Engineering roles with strong focus on automation and
observability.
• Solid experience in cloud-native data platforms (e.g., Databricks, Glue, Kafka, Flink, S3,
Lambda).
• Proven experience using or integrating GenAI tools (OpenAI, Claude, HuggingFace
Transformers).
• Proficiency in Python or Scala; experience with Spark and Airflow a plus.
• Familiarity with LLM techniques: prompt engineering, embeddings, retrieval-augmented
generation (RAG).
• Hands-on experience with monitoring and alerting tools (e.g., Prometheus, Grafana,
Datadog).
• Experience with Infrastructure as Code (e.g., Terraform, CloudFormation).
Preferred:
• Experience fine-tuning LLMs or integrating GenAI agents into production systems.
• Familiarity with vector databases (e.g., Pinecone, Qdrant, FAISS).
• Knowledge of data quality frameworks and lineage tools (e.g., DeeQu, Great Expectations, Amundsen, Unity Catalog).
• Understanding of ITIL/incident management frameworks.
• Strong communication and documentation skills, especially in on-call and postmortem
environments.
Application Confirmation
You're applying for the role below: