Staff Site Reliability Engineer

About the Role

We are seeking a highly skilled Site Reliability Engineer with experience applying Generative AI

(GenAI) to automate and enhance the reliability of complex data platforms. You will be responsible for building self-healing infrastructure, AI-powered observability, and automating incident response across data pipelines (e.g., Databricks, Glue, Kafka, Flink).

This is a high-impact role where you will shape the future of data reliability at the company,

mentor engineers, and lead initiatives that span multiple teams and domains.

Key Responsibilities

Platform Reliability & Automation

• Design, implement, and operate reliable, scalable, and observable data platforms.

• Automate incident triage, remediation, and postmortems using GenAI-powered tools.

• Develop intelligent runbooks and self-healing workflows using LLMs.

GenAI-Enabled SRE Practices

• Build and integrate GenAI copilots for on-call support, anomaly detection, and RCA

(root cause analysis).

• Fine-tune or prompt engineer LLMs for specific use cases like summarizing logs,

interpreting metrics, or generating remediation steps.

• Leverage vector databases (e.g., FAISS, Weaviate) to retrieve telemetry and incident

history for GenAI prompts.

Observability & Anomaly Detection

• Integrate GenAI with observability tools (e.g., Datadog, Prometheus, Grafana,

OpenTelemetry).

• Build systems for natural language querying of platform health and pipeline performance.

• Collaborate with data engineers to monitor SLIs/SLOs across ingestion, transformation,

and delivery layers.

CI/CD & Risk Management

• Integrate GenAI into CI/CD pipelines to generate blast radius analyses and deployment

guardrails.

• Use LLMs to assess the risk of configuration or schema changes before production

rollout.

• Automate validation and rollback strategies based on historical outcomes.

Qualifications

• 5+ years in SRE, DevOps, or Data Engineering roles with strong focus on automation and

observability.

• Solid experience in cloud-native data platforms (e.g., Databricks, Glue, Kafka, Flink, S3,

Lambda).

• Proven experience using or integrating GenAI tools (OpenAI, Claude, HuggingFace

Transformers).

• Proficiency in Python or Scala; experience with Spark and Airflow a plus.

• Familiarity with LLM techniques: prompt engineering, embeddings, retrieval-augmented

generation (RAG).

• Hands-on experience with monitoring and alerting tools (e.g., Prometheus, Grafana,

Datadog).

• Experience with Infrastructure as Code (e.g., Terraform, CloudFormation).

Preferred:

• Experience fine-tuning LLMs or integrating GenAI agents into production systems.

• Familiarity with vector databases (e.g., Pinecone, Qdrant, FAISS).

• Knowledge of data quality frameworks and lineage tools (e.g., DeeQu, Great Expectations, Amundsen, Unity Catalog).

• Understanding of ITIL/incident management frameworks.

• Strong communication and documentation skills, especially in on-call and postmortem

environments.

Application Confirmation

You're applying for the role below:

Staff Site Reliability Engineer

Location: Thành phố Hồ Chí Minh

Contract Details: Headhunt

Submit Date: 2026-04-18

No CV uploaded

About the job

Location Thành phố Hồ Chí Minh
Created On 2026-03-18
Working Model WFO
Job Level Senior