Staff Site Reliability Engineer

About the Role

We are seeking a highly skilled Site Reliability Engineer with experience applying Generative AI

(GenAI) to automate and enhance the reliability of complex data platforms. You will be responsible for building self-healing infrastructure, AI-powered observability, and automating incident response across data pipelines (e.g., Databricks, Glue, Kafka, Flink).

This is a high-impact role where you will shape the future of data reliability at the company,

mentor engineers, and lead initiatives that span multiple teams and domains.

Key Responsibilities

Platform Reliability & Automation

• Design, implement, and operate reliable, scalable, and observable data platforms.

• Automate incident triage, remediation, and postmortems using GenAI-powered tools.

• Develop intelligent runbooks and self-healing workflows using LLMs.

GenAI-Enabled SRE Practices

• Build and integrate GenAI copilots for on-call support, anomaly detection, and RCA

(root cause analysis).

• Fine-tune or prompt engineer LLMs for specific use cases like summarizing logs,

interpreting metrics, or generating remediation steps.

• Leverage vector databases (e.g., FAISS, Weaviate) to retrieve telemetry and incident

history for GenAI prompts.

Observability & Anomaly Detection

• Integrate GenAI with observability tools (e.g., Datadog, Prometheus, Grafana,

OpenTelemetry).

• Build systems for natural language querying of platform health and pipeline performance.

• Collaborate with data engineers to monitor SLIs/SLOs across ingestion, transformation,

and delivery layers.

CI/CD & Risk Management

• Integrate GenAI into CI/CD pipelines to generate blast radius analyses and deployment

guardrails.

• Use LLMs to assess the risk of configuration or schema changes before production

rollout.

• Automate validation and rollback strategies based on historical outcomes.

Qualifications

• 5+ years in SRE, DevOps, or Data Engineering roles with strong focus on automation and

observability.

• Solid experience in cloud-native data platforms (e.g., Databricks, Glue, Kafka, Flink, S3,

Lambda).

• Proven experience using or integrating GenAI tools (OpenAI, Claude, HuggingFace

Transformers).

• Proficiency in Python or Scala; experience with Spark and Airflow a plus.

• Familiarity with LLM techniques: prompt engineering, embeddings, retrieval-augmented

generation (RAG).

• Hands-on experience with monitoring and alerting tools (e.g., Prometheus, Grafana,

Datadog).

• Experience with Infrastructure as Code (e.g., Terraform, CloudFormation).

Preferred:

• Experience fine-tuning LLMs or integrating GenAI agents into production systems.

• Familiarity with vector databases (e.g., Pinecone, Qdrant, FAISS).

• Knowledge of data quality frameworks and lineage tools (e.g., DeeQu, Great Expectations, Amundsen, Unity Catalog).

• Understanding of ITIL/incident management frameworks.

• Strong communication and documentation skills, especially in on-call and postmortem

environments.

Apply Save

Application Confirmation

You're applying for the role below:

Staff Site Reliability Engineer

Location: Thành phố Hồ Chí Minh

Contract Details: Headhunt

Submit Date: 2026-04-18

No CV uploaded

About the job

Location	Thành phố Hồ Chí Minh
Created On	2026-03-18
Working Model	WFO
Job Level	Senior

Apply Save

IT Contractors

Smart Development Teams

Executive Search

Global Workforce

AI

Cloud

Cybersecurity

SAP

Application Services

Data Services

Infrastructure Services

OSS

BSS

AI Agents

Monolithic to Microservices

Unified Mobile Development

Infrastructure as Code

Accrue

Trada

About Us

Our Values

Our Impact

Get Technology Consultants

Build Your IT Team Fast

Hire your next talent now

Build Operate Transfer

Outsource Now

Request a Demo

Join Dikshatek Now

Life at Dikshatek

Blogs

Vietnam

India

Indonesia

Thailand

Philippines

Malaysia

Dubai

Australia

Kenya

Nigeria

New Zealand

USA

Staff Site Reliability Engineer

Application Confirmation

Staff Site Reliability Engineer

About the job