Tag: SRE

Antigravity DevOps SRE July 13, 2026

Building an Autonomous SRE Agent with Google ADK and the Antigravity SDK - A walk through building an SRE agent that works autonomously in your cloud.

Antigravity Google Kubernetes Engine Kubernetes SRE July 6, 2026

100x SRE: Building an Agentic GKE Capacity Optimizer with Google Antigravity 2.0 - This article details how to build an agentic GKE capacity optimizer using Google Antigravity 2.0 and Custom Compute Classes. This solution addresses the problem of "Pending Pods" and manual resource management by dynamically generating optimal GKE manifests based on workload demands and market conditions. It provides a human-in-the-loop approval process, shifting the cognitive load from SREs for more efficient and cost-effective Kubernetes cluster provisioning.

DevOps GKE Autopilot SRE June 22, 2026

Activepieces on GKE Autopilot: Six Engineering Disciplines for Production Workflow Automation - This article details deploying Activepieces, a workflow automation platform, on Google Cloud's GKE Autopilot for a robust production environment. It highlights how integrating six key engineering disciplines—Platform Engineering, GitOps, SRE, DevSecOps, CI/CD, and FinOps—creates a secure, reliable, and cost-effective solution. This "golden path" approach simplifies operations, enabling self-service provisioning of Activepieces instances with built-in security, observability, and automatic cost attribution.

AI DevOps SRE June 15, 2026

Google Published Their AI SRE Blueprint. Here's the Line-by-Line Mapping to What the Community Has Been Building

DevOps SRE June 15, 2026

Choosing the Right GCP Infrastructure & Architecture for a Startup - Part 1 of the series: “How We Built Production Infrastructure for a Startup from Scratch”.

Kubernetes SRE June 15, 2026

Your Kubernetes Cluster Is Probably Dropping Logs — Mine Was

DevOps Official Blog SRE June 1, 2026

AI in SRE: Where and how Google is deploying agentic AI to improve operations - With SRE AI, Google plans to fully adopt AI and agentic technologies, leveraging AI as a force multiplier while also maintaining control.

DevOps Infrastructure Official Blog SRE May 25, 2026

How Google Does It: Fleet-wide, large-scale A/B experimentation - Learn how Google validates critical changes to the fleet by performing A/B experimentation on the infrastructure itself.

Cloud Dataproc Serverless Spark SRE May 11, 2026

F1 Telemetry and Tuning for your Spark cluster: the BigQuery Log Analytics setup that costs nothing - This article outlines a cost-effective method for gathering detailed telemetry from Spark clusters on Google Cloud. By leveraging a small custom Spark listener and five GCP Log Analytics queries, it extracts crucial job and cluster events for data-driven tuning and resource allocation. This approach provides actionable insights for autoscaled Dataproc clusters significantly cheaper than traditional BigQuery export pipelines.

DevOps SRE May 4, 2026

When DevSecOps Met SRE: How We Hunted Down a GCP Security Incident and Made Our System Bulletproof - A real-world story of applied SRE principles, shift-left security, and blameless postmortems on Google Cloud Platform.

AI Databricks SRE April 27, 2026

【AI Deployment】MLOps#5:LGTM Stack for LLMOps - Background.

MCP SRE April 20, 2026

From Incident to Pull Request: Building an AI-Powered SRE Agent on GCP - Automating detection, root cause analysis, and code fixes using Google Cloud, MCP, and LLMs.

DevOps LLM SRE April 13, 2026

Deploying AI Models on GKE Autopilot with Ollama: A Hands-On Lab for SRE Engineers - How I ran Gemma 2 (9B and 27B) on an NVIDIA L4 GPU in Kubernetes — for the price of a coffee.

DevOps Kubernetes SRE March 23, 2026

Scaling SRE Systems with GCP + Kubernetes: Lessons from Running at 10x Traffic

DevOps SRE March 23, 2026

AI Can Build Your App — But Can It Keep It Reliable? Intro to SLI, SLO, and SLA - In today’s engineering landscape, building applications has become dramatically faster. With AI-powered tools like Claude, Copilot, and….

SRE March 9, 2026

👍DON’T PANIC: Why AI is the Secret to Antifragile Systems - Escaping the “Golgafrincham Trap” by moving beyond mere efficiency to build systems that actually thrive under stress.

DevOps Official Blog SRE March 9, 2026

Unified Maintenance: A new, unified way to manage maintenance across Google Cloud - Unified Maintenance, now GA, is a centralized dashboard that lets you view and manage maintenance events across your Google Cloud services.

DevOps Kubernetes Paywall SRE March 2, 2026

From Zero to Disaster Recovery: Backing Up PostgreSQL on Kubernetes to Amazon S3 (With a Real… - Most teams say they have backups.

DevOps SRE Feb. 23, 2026

Stop Deploying Manually: Build a Clean, Production-Ready CI/CD Pipeline in GCP - There’s a moment in every engineer’s journey when manual deployment starts to feel wrong.

Kubernetes Prometheus SRE Feb. 2, 2026

Building Reliable Distributed Storage with MinIO (6) - Prometheus Integration and Metric Scraping (Docker + Kubernetes + Prometheus).

Gemini CLI Generative AI Official Blog SRE Jan. 26, 2026

How Google SREs Use Gemini CLI to Solve Real-World Outages - See how Google SREs use Gemini CLI and Gemini 3 to automate incident response, from paging to postmortem. Learn how AI helps eliminate toil and reduce Bad Customer Minutes safely.

DevOps Paywall SRE Jan. 18, 2026

Google Cloud for DevOps: The Essential Knowledge Stack | by Everton Araújo | Jan, 2026 | Medium - Transitioning to Google Cloud Platform (GCP) requires more than just knowing where the buttons are; it requires a shift toward Google’s….

DevOps Official Blog SRE Dec. 15, 2025

Is your DR plan just wishful thinking? Prove your resilience with chaos engineering - Controlled chaos engineering experiments that simulate real-world disasters quantitatively measure the impact of failures on system performance.

DevOps SRE Oct. 27, 2025

Google Cloud Support: The Insurance Policy Your Cloud Workloads Need - A 5 Minute Guide to Selecting the Right Google Cloud Support Coverage.

Billing DevOps Paywall SRE Oct. 20, 2025

Stop Google Cloud Bills Before They Spiral: Build Your Kill Switch - The article discusses the need for a "kill switch" to automatically disable billing in Google Cloud when budgets are exceeded, as budgets only provide alerts and don't prevent overspending.

DevOps Official Blog SRE Oct. 20, 2025

Chaos engineering on Google Cloud: Principles, practices, and getting started - By deliberately introducing failures into production systems, chaos engineering helps you face production incidents calmly and confidently.

Cloud Operations DevOps SRE Sept. 29, 2025

How to Configure Essential Contacts in Google Cloud to Receive Important Email Notifications - Essential Contacts defines who receives important notifications from Google so nothing critical gets missed.

DevOps Official Blog SRE Aug. 18, 2025

How Google does it: Your guide to platform engineering - At PlatformCon 2025, we talked about “shift down” — one of the guiding principles behind Google’s approach to platform engineering.

DevOps Official Blog SRE Aug. 18, 2025

Beyond guardrails: A taxonomy of platform engineering control mechanisms - Learn how to control the platform engineering application lifecycle with golden paths, guardrails, safety nets, and manual checkpoints and reviews.

Monitoring Official Blog SRE July 21, 2025

Application monitoring in Google Cloud: Bridging manual and AI-assisted troubleshooting - Cloud Observability’s curated Application Monitoring dashboards improve troubleshooting with best practices from Google SREs.

GCP Experience Google Kubernetes Engine Official Blog SRE June 30, 2025

Using Platform Engineering to simplify the developer experience - part one - Learn how John Lewis transformed its e-commerce platform with Google Cloud, GKE, and a simplified developer platform engineering approach.

Cloud SQL SRE June 30, 2025

PostgreSQL hygiene at Plum - Plum's Site Reliability Engineering team describes how it improved the performance and reduced costs of their Google CloudSQL PostgreSQL databases.

DevOps FinOps SRE May 12, 2025

How We Cut Our GCP Bill by 30% - How We Turned Cost Optimization into a Team Superpower.

AWS Infrastructure SRE April 21, 2025

Migration from Amazon Web Services provider to Google Cloud Platform - A Tale of Two Cloud Providers: How we made the pivotal decision to migrate several of our major European workloads from AWS to GCP.

DevOps Gemini Official Blog SRE April 14, 2025

Delivering an application-centric, AI-powered cloud for developers and operators - Google Cloud introduces an application-centric, AI-powered cloud experience for developers and operators. New features in Gemini Code Assist and Gemini Cloud Assist provide AI assistance throughout the application lifecycle, from design and development to deployment and management.

Machine Learning Official Blog SRE Feb. 24, 2025

An SRE’s guide to optimizing ML systems with MLOps pipelines - This article discusses how to apply Site Reliability Engineering (SRE) principles to optimize machine learning (ML) systems and pipelines. It covers various aspects such as training ML models, ensuring data freshness, optimizing serving efficiency, achieving cost efficiency, and implementing automation for scale.

DevOps Official Blog SRE Jan. 27, 2025

Is your platform ready for 2025? New research on platform engineering reveals the secret to success - A recent research study by Google Cloud and Enterprise Strategy Group (ESG) reveals that 55% of global organizations have already adopted platform engineering, and 90% of those plan to expand its reach to more developers. The study identifies three critical components that are central to the success of mature platform engineering leaders: fostering close collaboration, adopting a "platform as a product" approach, and defining success by measuring performance through clear metrics.

SRE Jan. 20, 2025

Are you doing Google Cloud Site Reliability Engineering (SRE) Wrong? Part 2— Core Concepts - Explanation of key SRE concepts.

SRE Jan. 13, 2025

Are you doing Google Cloud Site Reliability Engineering (SRE) Wrong? Part 1 — Core Principles - Google Cloud Architecture Framework — Reliability Core Principles.

DevOps Official Blog SRE Jan. 13, 2025

Avoid global outages by partitioning cloud applications to reduce blast radius - To reduce the risk of global outages, Google Cloud recommends partitioning the serving stack. Partitioning involves running isolated instances of application servers and storage. When making changes to the application code, new changes are deployed to one partition at a time, limiting the blast radius of an outage.

Cloud Run NodeJS SRE Nov. 25, 2024

2x Faster, 40% less RAM: The Cloud Run stdout logging hack

DevOps Paywall Python SRE Nov. 18, 2024

Building Resilient Systems on Google Cloud Platform: An Engineer’s Guide - This guide provides key strategies and coding practices for building resilient systems on GCP, covering topics such as regions and zones, platform availability, and infrastructure stack. Real-world scenarios illustrate the application of these principles in practice.

Billing DevOps FinOps SRE Oct. 21, 2024

How We Reduced Costs on GCP by Optimizing a Single SKU: Network Inter Zone Data Transfer Out - Significantly reduced GCP costs by optimizing a single SKU: Network Inter Zone Data Transfer Out. By analyzing data transfer patterns, consolidating workloads, and leveraging zonal resources, a 45% reduction was achieved in inter-zone data transfer, resulting in monthly savings of approximately 20% for the GCP bill.

Official Blog SRE Sept. 30, 2024

Project management à la SRE: How to juggle the needs of your project and production - Site Reliability Engineering (SRE) teams at Google face unique challenges in project management due to their dual responsibilities of supporting production and delivering infrastructure projects. To address this, SRE teams employ enhanced planning, including reserving engineering hours and carefully timing project starts.

Cloud Logging Cloud Monitoring Official Blog SRE Sept. 16, 2024

Cut through the noise with new log scopes for Cloud Observability - Log scopes in Cloud Observability allow you to manage and analyze your organization's logs more efficiently. They are named collections of logs of interest within the same or different projects, made up of groups of log views that control and grant permissions to a subset of logs in a log bucket. Log scopes can be used to correlate metrics with logs from the same application or isolated environments, providing a more focused and relevant view of your telemetry data.

DevOps Official Blog SRE Aug. 19, 2024

Hakuhodo Technologies: The transformative impact of SRE - Hakuhodo Technologies, a specialized technology company, transformed its organization with Site Reliability Engineering (SRE) practices to enhance software development, deliver new value, and improve collaboration within the Hakuhodo DY Group. By implementing the "SRE Core" program, they revitalized communication between application and infrastructure teams, established critical user journeys, and learned the importance of observability.

Cloud Logging DevOps Official Blog SRE Aug. 5, 2024

Best practices for streamlining log centralization with Cloud Logging - Centralize log management with Cloud Logging for unified visibility, efficient management, enhanced security, and streamlined operations. Utilize aggregated sinks for efficient routing, establish a central observability project as a management hub, customize log storage for optimal retention and cost efficiency, and manage log storage access control for data security. Monitor log volume to proactively manage storage costs and investigate anomalies.

DevOps Official Blog SRE July 1, 2024

Free to be SRE — how to use generative AI to code, test and troubleshoot your systems - Generative AI, including Google's Gemini for developers, offers a toolkit that can help streamline SRE operational tasks and boost efficiency. This curated list of resources provides a foundational understanding of generative AI concepts and how to leverage them to enhance operational efficiency. Start with the basics of generative AI and progress to advanced techniques through videos and hands-on labs. Discover how generative AI can revolutionize SRE workflows and unlock a new era of operational excellence.

DevOps Official Blog SRE June 17, 2024

Free to be SRE, with this systems engineering syllabus - Systems engineering is a discipline used by Google site reliability engineers (SREs) to create and implement reliable systems. To help you learn more about systems engineering, Google has assembled some resources, including a paper on the systems engineering side of site reliability engineering, a chapter on non-abstract large system design in the SRE Workbook, a self-guided workshop on distributed image server, a YouTube talk on Google's production environment, and a research paper on reliable data processing with minimal toil.

Monitoring Paywall SRE June 17, 2024

Google Cloud Engineers Need This App — But Few Realize It Exists - Reviewing a Google Cloud mobile solution for on-the-go data infrastructure management, monitoring and incident response.

DevOps Official Blog SRE June 10, 2024

5 more myths about platform engineering: how it’s built, what it does, and what it doesn’t - Platform engineering is a new approach to managing IT infrastructure and software development that aims to streamline the software development process by providing developers with self-service tools and platforms, abstracting away complex infrastructure details, and automating repetitive tasks. It's not a one-size-fits-all solution and requires a tailored approach to meet your organization's specific needs. Start small with a minimal viable platform, prioritize high-value tasks, and iterate based on feedback to build a platform that truly delivers value to your developers and organization.

DevOps SRE Terraform June 3, 2024

Landing Zone Deployment (Google Cloud Adoption Series) - Step-by-step guidance for how to actually deploy our LZ, either using Cloud Setup “Click-Ops” in the console, or with Terraform.

DevOps Official Blog SRE June 3, 2024

5 myths about platform engineering: what it is and what it isn’t - Platform engineering is a relatively new approach to software delivery that aims to reduce friction and cognitive overload for developers by abstracting away the complexity of modern software systems. It involves creating an internal developer platform that provides developer self-service through golden paths, codifying DevOps practices into software, and taking a holistic approach to automation. Platform engineering is not just advanced DevOps or automation, and it is not a fad but a response to the growing complexity of modern software systems.

Cloud Monitoring Official Blog SRE May 27, 2024

Understand the change in Cloud Monitoring service discovery and how to adapt - Cloud Monitoring has changed the way services are defined. Now, all services in the Services Overview dashboard must be explicitly created. To simplify this, a list of candidates based on auto-discovered services is provided when defining a new service in the console UI. Auto-detected services come with predefined SLIs for availability and latency, while custom services require explicit definition of these SLIs.

Infrastructure Official Blog SRE VMware Engine May 27, 2024

Sharing details on a recent incident impacting one of our customers - Google Cloud experienced an incident that impacted one customer's use of Google Cloud VMware Engine (GCVE) in a single cloud region. The incident was caused by an inadvertent misconfiguration during deployment using an internal tool, leading to the automatic deletion of the customer's GCVE Private Cloud after a system-assigned 1-year period. Google Cloud has since taken steps to prevent such incidents from happening again, including deprecating the internal tool and reviewing all GCVE deployments. The customer's data backups stored in Google Cloud Storage were not affected and assisted in the rapid restoration of services.

SRE May 13, 2024

Google Cloud SLO demystified: Uncovering metrics behind predefined SLOs - Unveiling Google Cloud SLO Secrets. This is a guided tour to predefined SLOs of monitored services.

SRE May 13, 2024

Google Cloud accidentally deletes UniSuper’s online account due to ‘unprecedented misconfiguration’ - More than half a million UniSuper fund members went a week with no access to their superannuation accounts after a “one-of-a-kind” Google Cloud “misconfiguration” led to the financial services provider’s private cloud account being deleted, Google and UniSuper have revealed.

DevOps Official Blog SRE April 29, 2024

2024 DORA survey now live: share your thoughts on AI, DevEx, and platform engineering - A possibility to participate in DORA's annual survey.

DevOps GCP Experience Google Kubernetes Engine Official Blog SRE April 29, 2024

Ninja Van: delivering flexibility, stability and scalability to core applications with a cloud container platform - Ninja Van, a fast-growing logistics company in Southeast Asia, uses Google Cloud's Kubernetes Engine (GKE) to manage its microservices architecture. GKE's scalability and ease of use enable Ninja Van to deliver a seamless development experience and improve its CI/CD pipeline.

DevOps Infrastructure Monitoring SRE Stackdriver April 29, 2024

Stay Ahead of the Storm: Comprehensive Insights into Google Cloud Personalized Service Health - Personalized Service Health from Google Cloud monitors your cloud projects and proactively notifies you of potential issues. It provides customizable alerts and leverages past incidents to improve reliability, making it a valuable tool for managing your cloud environment.

DevOps SRE April 8, 2024

Design your Landing Zone — Design Considerations Part 4— IaC, GitOps and CI/CD (Google Cloud Adoption Series) - LZ design considerations and decisions you need to make, relating to IaC, GitOps and CI/CD.

DevOps SRE Feb. 12, 2024

Google Cloud Adoption: Site Reliability Engineering (SRE), and Best Practices for SLI / SLO/ SLA - The best practices of Site Reliability Engineering.

API DevOps Official Blog SRE Jan. 29, 2024

5 ways platform engineers can help developers create winning APIs - How can platform engineers influence API development?

DevOps Official Blog SRE Jan. 22, 2024

Personalized Service Health is now generally available: Get started today - Personalized Service Health begins processing and publishing relevant incidents to your Service Health dashboard in the Google Cloud console.

Duet AI Official Blog SRE Jan. 22, 2024

Get your services back online quickly with Duet AI - Duet AI, an assistive AI tool,can help you make sense of the error messages and also speed up your investigation.

Google Cloud Platform Monitoring Official Blog SRE Jan. 22, 2024

Google Cloud mobile app: A troubleshooting and management companion for your cloud applications - With the Google Cloud mobile app, you can easily monitor the status and access services.

Monitoring Networking Official Blog SRE Jan. 22, 2024

Get timely networking health updates with Personalized Service Health emerging incidents - Emerging incidents are machine-driven alerts that are communicated simultaneously to you and internal Google SRE teams, significantly reducing the time-to-first-meaningful post about an incident.

Monitoring SRE Dec. 25, 2023

Personalized Service Health: Early Warning System for Disruptive Events Impacting Your Google Cloud Services - Google Cloud's Personalized Service Health (PSH) is a valuable service that lets you identify Google Cloud service disruptions relevant to your projects so you can manage and respond to them efficiently. With PSH, you can proactively identify and address potential issues before they cause a significant impact on your operations.

DevOps Official Blog Partners SRE Vertex AI Dec. 4, 2023

Nobl9's Reliability AI, Powered by Google - Customers who want to leverage AI technology in Google Cloud to define and understand SLOs can now do so through Vertex AI, thanks to Nobl9 and the new tool they developed, SLOgpt.ai.

DevOps Official Blog SRE Dec. 4, 2023

Driving success through open communication - Distilling years of Google research into five dimensions that you can apply to drive success within your own organization.

DevOps Official Blog SRE Terraform Workforce Identity Federation Sept. 25, 2023

Manage infrastructure with Workload Identity Federation and Terraform Cloud - Terraform Cloud workspaces integrate with Workload Identity Federation to authenticate and then impersonate Google Cloud service accounts.

Cloud SQL DevOps SRE Terraform Sept. 25, 2023

How to connect to GCP Private Cloud SQL instance in your local machine using a Bastion and Terraform. - A Terraform snippet to create a bastion VM to access Cloud SQL instance that has a private IP.

Kubernetes SRE Sept. 18, 2023

Google Kubernetes Engine Troubleshooting Made Simple with Interactive Playbooks - Using GKE interactive playbooks for troubleshooting guidance for common issues.

DevOps Official Blog SRE Aug. 28, 2023

Calling all DevOps, IT Ops, Platform Engineers and SREs: 5 can’t-miss breakout sessions at Next ‘23 - There’s no lack of breakout sessions for DevOps, IT Ops, Platform Engineers, and Site Reliability Engineers (SREs) at Google Cloud Next 2023.

Cloud Run Monitoring SRE Aug. 21, 2023

How to create a SLO for Cloud Run programatically

Google Kubernetes Engine Official Blog SRE Aug. 21, 2023

How to set up observability for a multi-tenant GKE solution - Learn how to set up a GKE multi-tenant solution for observability using Log Router, and setting up a sink to route a tenant’s logs to their dedicated GCP project.

Cloud Bigtable Official Blog SRE Aug. 7, 2023

What's new in Bigtable observability - Learn about new tools and metrics for Cloud Bigtable including query stats, high-granularity metrics, and table stats.

CI GCP Experience Official Blog SRE July 24, 2023

Vodafone: A DevOps approach to AI/ML through cloud-native CI/CD pipelines - How Vodafone improved the performance of its ML pipelines by using DevOps principles of automation, code mirroring and CI/CD.

DevOps Official Blog SRE July 17, 2023

DevOps Awards winner Sabre on nurturing team culture - Sabre worked closely with Google Cloud to transform its system and company culture to make better use of the cloud.

DevOps Official Blog SRE June 19, 2023

2022 State of DevOps Report data deep dive: Documentation is like sunshine - The State of DevOps Report finds a clear link between documentation quality and an organization’s ability to meet its performance goals.

Cloud Monitoring Official Blog SRE June 19, 2023

New in Cloud Monitoring: Better tools for analysis, uptime checks, and alerts - We recently launched several new Cloud Monitoring features to improve your visualization and troubleshooting experience.

DevOps Official Blog SRE June 12, 2023

The Modernization Imperative: Shifting left is for suckers. Shift down instead - Instead of developers “shifting left,” they need to “shift down” and push more workloads down onto the platforms they’re already using.

DevOps Monitoring Official Blog SRE May 15, 2023

Uptime checks for availability - Monitor the availability of public and private resources, and to alert you when there are problems.

AI DevOps Official Blog SRE May 15, 2023

Introducing Duet AI for Google Cloud – an AI-powered collaborator

DevOps Official Blog SRE Terraform April 24, 2023

Running Infrastructure-as-Code with the least privilege possible - Google service account impersonation lets you run your terraform code and manage resources without overly broad access.

Billing Cloud Monitoring DevOps Official Blog SRE April 17, 2023

How to identify and reduce costs of your Google Cloud observability in Cloud Monitoring - A cost savings guide for Cloud Monitoring.

Compute Engine Official Blog SRE April 10, 2023

Monitor the health of your VM fleets in the Compute Engine console - The new Observability tab in the Compute Engine console provides insights into CPU, memory, network, disk, live processes, and system events.

Cloud SQL Database Migration Service SRE March 27, 2023

Upgrade Your MySQL Version with Minimal Downtime: Our Journey with Google Cloud’s Data Migration Service - Google Cloud provides a reliable and efficient tool for upgrading your MySQL instance using its Data Migration Service (DMS).

Monitoring Prometheus SRE March 27, 2023

Scaling Observability Reliably and Frugally at Magicpin - A process of creating an observability platform on GCP.

Cloud Monitoring Official Blog SRE March 20, 2023

Verify POST endpoint availability with Uptime Checks - Google Cloud Monitoring can now handle any kind of request bodies for POST requests, giving you better REST resource tracking.

Official Blog SRE March 13, 2023

Adopting SRE: Standardizing your SLO design process - Designing SLOs is a key SRE competency which requires careful consideration and a consistent approach to implementation.

DevOps Google Kubernetes Engine Official Blog SRE Feb. 6, 2023

Scalability testing on Google Kubernetes Engine: Know before you go - Getting ready to scale up a Kubernetes-based workload? Learn about the benefits, how to set goals and best practices of scalability testing on GKE.

DevOps Official Blog SRE Jan. 23, 2023

Reliability and SRE in the 2022 State of DevOps Report - Learn more about the connection between SRE, DevOps and reliability.

DevOps SRE Dec. 26, 2022

Disaster Recovery — locality-restricted workloads on GCP - This post discusses how you can use Google Cloud to architect for disaster recovery (DR) to meet location-specific requirements.

DevOps Official Blog SRE Dec. 19, 2022

Why Focus on Symptoms, Not Causes? - Why aren’t we monitoring what users care about? How did we get here? What do users care about?

DevOps Official Blog SRE Nov. 21, 2022

Composite availability: calculating the overall availability of cloud infrastructure - Understand how to calculate the composite reliability of your cloud infrastructure to help design Cloud architectures with an optimal SLA.

DevOps GCP Experience Networking Security SRE VPC Service Controls Nov. 7, 2022

How we secured our data on the Cloud - Challenges and solutions while enforcing VPC Service Controls.

DevOps Official Blog Skaffold SRE Oct. 31, 2022

Skaffold v2 GA: Further enhancing developer productivity - With Skaffold V2, you can now build and manage container images on Cloud Run and on ARM architectures.

Billing Cloud Logging Official Blog SRE Oct. 24, 2022

Cloud Logging pricing for Cloud Admins: How to approach it & save cost - How, where and when pricing is incurred in Cloud Logging, Google’s observability solution to manage Logs. It also covers our recommendations to save and optimize cost.

CI DevOps Official Blog SRE Sept. 19, 2022

Building a secure CI/CD pipeline using Google Cloud built-in services - Build a secure CI/CD pipeline using Google Cloud's built-in services using Cloud Build, Cloud Deploy, Artifact Registry, Binary Authorization and GKE.

Cloud Operations OpenTelemetry SRE Aug. 29, 2022

Ultimate Google Cloud Operations configuration for external services - Monitoring Elasticsearch service deployed on Elastic Cloud with OpenTelemetry and Cloud Operations.

Security SRE Aug. 15, 2022

Gremlin Chaos Engineering On Google Cloud - This Article is based on how to implement Chaos Engineering Experiments Using Gremlin on Google Cloud.

Cloud Monitoring DevOps Official Blog SRE Aug. 15, 2022

Snooze your alert policies in Cloud Monitoring - Snooze alert policies to prevent the creation of alerts and notifications. This is useful during maintenance windows, non-business hours, and more.

Data Analytics DevOps Looker Official Blog SRE Aug. 1, 2022

Managing the Looker ecosystem at scale with SRE and DevOps practices - Following DevOps and SRE best practices can help organizations bring order to distributed Looker environments.

DevOps Official Blog SRE July 4, 2022

Incorporating quota regression detection into your release pipeline - Check quotas across cloud environments before promoting images to prevent outages due to inconsistent API quota limits.

DevOps Official Blog SRE May 30, 2022

Enterprise DevOps Guidebook - Chapter 1 - Learn more about how to implement DORA best practices with our DevOps Enterprise Guidebook.

DevOps Official Blog SRE May 30, 2022

Application Rationalization through Google Cloud’s CAMP Framework - Application Rationalization through CAST Highlight (automated source code scan with business context) and mFit (VM workload assessment & automated migration).

Cloud Monitoring Kubernetes Monitoring SRE May 16, 2022

Metrics Management with Google Cloud Managed Service for Prometheus - Maisons du Monde is a furniture and home decor company that was founded in France over 25 years ago. We have 360 stores across France….

DevOps Official Blog SRE May 9, 2022

Are your SLOs realistic? How to analyze your risks like an SRE - Before committing to an SLO, Site Reliability Engineering practices recommend that you evaluate the risks to a given service.

DevOps Infrastructure Official Blog SRE April 25, 2022

The SRE book turns 6! - Site Reliability Engineering with Google’s SRE team! Since the publication of the SRE Book, we’ve learned and shared a lot —come explore SRE with us!

Official Blog SRE April 18, 2022

Introducing the Google SRE Prodcast - Discover Prodcast, Google’s Site Reliability Engineering Podcast. This limited-edition series explores fundamental topics in reliability engineering from the perspective of experienced Google SREs.

SRE Terraform April 11, 2022

GCP integration with PagerDuty using Terraform - This article will show you, how Storytel 2022 went from a basic setup with a single global on-call team to a Full Service Ownership setup.

DevOps Monitoring Official Blog SRE April 4, 2022

Add severity levels to your alert policies in Cloud Monitoring - Add static and dynamic severity levels to your alert policies for easier triaging and include these in notifications when sent to 3rd party services.

Security SRE March 21, 2022

Forensics - Ever wondered what you need to do to collect evidence when you have an incident?

Official Blog Security SRE Feb. 14, 2022

Achieving Autonomic Security Operations: Automation as a Force Multiplier - Your Security Operations Center (SOC) can learn a lot from what IT operations learned during the SRE revolution. In this post of the series, we plan to extract the lessons for your SOC centered on another SRE principle - automation as a force multiplier.

DevOps Official Blog SRE Dec. 13, 2021

Postmortems at Loon: a guiding force for rapid development - Discover how Loon Site Reliability Engineers used postmortems to iterate on their stratospheric software-defined network.

DevOps SRE Dec. 6, 2021

Part-5: Google DevOps-Observability with SRE principles

DevOps GCP Experience Official Blog SRE Dec. 6, 2021

Shopify engineers deliver on peak performance during Black Friday Cyber Monday 2021 - Shopify just experienced a record-breaking Black Friday Cyber Monday. Learn how Shopify works with Google Cloud to handle unprecedented peak moments with ease.

DevOps Official Blog SRE Dec. 6, 2021

Want to supercharge your DevOps practice? Research says try SRE - The 2021 DORA State of DevOps Report found interesting trends in DevOps shops that use SRE best practices.

Cloud Operations GCP Experience Official Blog SRE Nov. 29, 2021

How Sabre is using SRE to lead a successful digital transformation - Sabre Corporation joined forces with Google Cloud as their preferred cloud provider to accelerate their digital transformation following SRE principles.

DevOps NoSQL Official Blog SRE Nov. 29, 2021

Empowering DevOps to foster customer loyalty in modern retail with MongoDB Atlas on Google Cloud - MongoDB Atlas on Google Cloud can enhance DevOps performance in today’s retail market.

DevOps Kubernetes SRE Oct. 25, 2021

Google Cloud DevOps Series - Google Cloud Compute options for Kubernetes.

Official Blog SRE Sept. 27, 2021

What’s your org’s reliability mindset? Insights from Google SREs - An organization’s approach to product reliability is a function of its mindset.

Cloud Operations SRE Tutorial Sept. 13, 2021

Measuring Reliability in GCP: Step By Step SLO creation guide using Cloud Operation Sandbox - In this step-by-step guide, I will demonstrate how to configure SLOs in Cloud Operations using our learning environment, Cloud Operation Sandbox.

Anthos Official Blog SRE Terraform Aug. 23, 2021

Deploy Anthos on GKE with Terraform part 1: GitOps with Config Sync - It is now simple to use Terraform to configure Anthos features on your GKE clusters. This is the first part of the 3 part series that describes using Terraform to enable Config Sync.

Anthos DevOps Official Blog SRE Aug. 9, 2021

Get in sync: Consistent Kubernetes with new Anthos Config Management features - Anthos Config Management and Config Controller bring Kubernetes-style declarative policy and config management to GKE environments.

CI Cloud Build Official Blog SRE Aug. 2, 2021

Introducing Cloud Build private pools: Secure CI/CD for private networks - With new private pools, you can use Google Cloud’s hosted Cloud Build CI/CD service on resources in your private network or in other clouds.

DevOps Official Blog SRE Aug. 2, 2021

Securing the software development lifecycle with Cloud Build and SLSA - Google’s proposed SLSA framework provides guidance on how to build a more secure software supply chain.

DevOps Official Blog SRE Aug. 2, 2021

Let's migrate: why lifting and shifting is simply too easy to ignore - Maximise the velocity and success of your cloud migration by starting with lift and shift.

DevOps Official Blog SRE July 5, 2021

Announcing the 2021 State of DevOps Report Sponsors

Cloud CDN DevOps SRE July 5, 2021

Google Cloud CDN Custom Dashboard - An example of a custom Dashboard in Cloud Monitoring for Cloud CDN.

DevOps Official Blog SRE June 21, 2021

Are we there yet? Thoughts on assessing an SRE team’s maturity - Examining the key indicators that signal a mature SRE team.

Cloud Operations GCP Experience Official Blog SRE June 14, 2021

How Lowe’s meets customer demand with Google SRE practices - Lowe’s has adopted Google SRE practices to help developer and operations teams keep up with ecommerce demand.

DevOps Official Blog SRE June 7, 2021

DevOps on Google Cloud: tools to speed up software development velocity - Google Cloud’s application development and continuous integration/continuous delivery (CI/CD) tools help ForgeRock developers stay productive.

Official Blog SRE May 31, 2021

Four steps to jumpstarting your SRE practice - Once you have leadership buy-in, there are some things you can do to get the SRE ball rolling, fast.

DevOps SRE May 24, 2021

Book - Implementing DevOps on Google Cloud - Achieving Google’s Professional Cloud DevOps Engineer Certification.

Cloud Operations DevOps Official Blog SRE May 10, 2021

SRE fundamentals 2021: SLIs vs SLAs vs SLOs - What’s the difference between an SLI, an SLO and an SLA? Google Site Reliability Engineers (SRE) explain.

DevOps Official Blog SRE May 3, 2021

SRE at Google: Our complete list of CRE life lessons - Find links to blog posts that share Google’s SRE best practices in one handy location.

DevOps Official Blog SRE April 26, 2021

5 resources to help you get started with SRE - Here are top five Google Cloud resources for getting started on your SRE journey.

Cloud Operations SRE Stackdriver April 5, 2021

SRE Public Resources for GCP Customers - A list of articles, videos and courses related to SRE.

Official Blog SRE March 15, 2021

How do you eat an elephant? Google SREs talk digital transformation - It’s not just about technology. Google Cloud SREs touch on the human and organizational side of a cloud migration.

Cloud Operations Official Blog SRE March 1, 2021

With SRE, failing to plan is planning to fail - The process of becoming a successful Site Reliability Engineering shop starts well before you take your first class or read your first manual.

Cloud Operations DevOps Official Blog SRE Jan. 25, 2021

Take the first step toward SRE with Cloud Operations Sandbox - Spin up the Cloud Operations Sandbox to see how Google’s logging, monitoring, tracing, profiling and debugging can kickstart your SRE practice.

Cloud Operations Monitoring SRE Stackdriver Jan. 25, 2021

Operation Suite GCP - Monitoring Logging and Error Reporting - An overview of Operation Suite in GCP: Monitoring , Logging, Error Reporting.

Cloud Build DevOps SRE Oct. 12, 2020

Gitflow with Github and Cloud Build - Implementing Gitflow using Github and Cloud Build.

DevOps Monitoring SRE Oct. 5, 2020

How to alert on SLOs - How to use SLO error budget alerts in Monitoring.

DevOps Official Blog SRE Sept. 28, 2020

SRE Classroom: exercises for non-abstract large systems design - Learn how to apply SRE principles in this series of workshops on non-abstract large systems design (NALSD) with Google engineers.

DevOps Official Blog SRE Sept. 28, 2020

Are you an Elite DevOps performer? Find out with the Four Keys Project - Learn how the Four Keys open source project lets you gauge your DevOps performance according to DORA metrics.

Cloud Monitoring DevOps SRE Terraform Sept. 7, 2020

Creating SLOs with Terraform - Example of creating SLO for Cloud Monitoring using Terraform.

GCP Experience Official Blog SRE Aug. 10, 2020

Three months, 30x demand: How we scaled Google Meet during COVID-19 - Learn how Google's SRE team ramped up to handle high demand for Google Meet in response to COVID-19.

Monitoring Official Blog SRE July 13, 2020

Setting SLOs: observability using custom metrics - See how you can set service-level objectives (SLOs) for complex services for better cloud monitoring. Part of SRE tips series.

Cloud Monitoring Official Blog SRE July 13, 2020

Setting SLOs: a step-by-step guide - See how to use SRE principles to keep customers happy with your service, using the right service-level objectives (SLOs).

Official Blog SRE June 29, 2020

How maintenance windows affect your error budget — SRE tips - See how maintenance windows can impact your error budget when using SRE practices, and get tips on how and when to use them.

Official Blog SRE June 15, 2020

Building resilient systems to weather the unexpected - See how SRE teams at Google apply principles in practice to built resilient systems and prepare for any type of business continuity needs.

DevOps Official Blog SRE June 1, 2020

Meeting reliability challenges with SRE principles - Following SRE principles can help you build reliable production systems. When getting started, you may encounter three common challenges. Here’s how to solve them.

Official Blog SRE May 4, 2020

Designing distributed systems using NALSD flashcards - Get to know the SRE-inspired principles and numbers, plus handy flashcards, to help you design non-abstract large scale design (NALSD) distributed systems.

DevOps Official Blog SRE April 13, 2020

Learn to build secure and reliable systems with a new book from Google - Engineers across Google's security and SRE organizations share best practices to help you design scalable and reliable systems that are fundamentally secure.

Official Blog SRE March 16, 2020

Finding a problem at the bottom of the Google stack - See a real-world example of how Google’s SRE practices can identify and help fix issues, even at the bottom of the hardware stack.

Monitoring Official Blog SRE March 16, 2020

Use SRE principles to monitor pipelines with Cloud Monitoring dashboards - Try SRE principles and the four golden signals as the metrics to build a monitoring dashboard for your data pipelines.

AWS DevOps GCP Experience SRE March 9, 2020

Our migration journey from AWS to Google Cloud — Part 1 - Description of infrastructure migration from AWS to GCP, part 1.

AWS DevOps GCP Experience SRE March 9, 2020

Our migration journey from AWS to Google Cloud — Part 2 - Description of infrastructure migration from AWS to GCP, part 2.

Google Kubernetes Engine Official Blog SRE Jan. 20, 2020

Using deemed SLIs to measure customer reliability - Following SRE principles involves reliability metrics like SLOs and SLIs. See how CRE teams and customers at Google use deemed SLIs.

Cloud Storage SRE Stackdriver Storage Dec. 23, 2019

Monitoring bytes sent from Google Cloud Storage buckets - The article describes how to set up monitoring and creating alerts based on data transferred from Cloud Storage.

SRE Dec. 23, 2019

Warm Disaster recovery for applications in Google Cloud - The article explains how to set up a Warm Disaster Recovery pattern for application.

Official Blog SRE Dec. 16, 2019

Learning—and teaching—the art of service-level objectives -- CRE Life Lessons - Host your own Art of SLOs workshop with Google SRE materials, now available to anyone.

DevOps Official Blog SRE Dec. 9, 2019

Shrinking the time to mitigate production incidents - CRE life lessons - See how you can use SRE and CRE principles and tests from Google, including Wheel of Misfortune and DiRT, to reduce the time needed to mitigate production incidents.

SRE Nov. 18, 2019

SRE Best Practices, For People in a Hurry - 20 simple rules for building a Google-Grade Site Reliability Engineering (SRE) practice.

SRE Nov. 18, 2019

Hot Disaster recovery on Google Cloud for applications running on-premises - The article goes through process of creating a Hot Disaster recovery on GCP for on-premise applications.

SRE Nov. 11, 2019

Warm Disaster recovery on Google Cloud for applications running on-premises - The article explains Warm Disaster Recovery pattern.

DevOps Official Blog SRE Nov. 4, 2019

How to integrate Policy Intelligence recommendations into an IaC pipeline - Learn how to incorporate recommendations from Policy Intelligence into an infrastructure as code pipeline.

Official Blog SRE Oct. 7, 2019

Transitioning a typical engineering ops team into an SRE powerhouse - Moving a network operations team to an SRE-driven model took some time, but was well worth the effort, as teams can focus on reliability rather than hardware.

DevOps Official Blog SRE Sept. 16, 2019

Shrinking the impact of production incidents using SRE principles—CRE Life Lessons - SRE principles can help you shrink the impact of production incidents through use of SLOs, writing postmortems, and promoting a blameless culture.

DevOps Official Blog SRE Terraform July 1, 2019

GCP DevOps tricks: Create a custom Cloud Shell image that includes Terraform and Helm - Learn how to add DevOps tools like Helm and Terraform to Cloud Shell, GCP’s browser-based management tool.

DevOps Official Blog SRE July 1, 2019

How SRE teams are organized, and how to get started - Getting started with SRE often starts with understanding SRE principles and how teams are organized. Find tips here on which SRE team implementation to use.

DevOps Infrastructure Official Blog SRE April 8, 2019

Want repeatable scale? Adopt infrastructure as code on GCP - The article describes concepts and motivation for Infrastructure as a Code approach.

DevOps Official Blog SRE March 25, 2019

Introducing a new Coursera course on Site Reliability Engineering - The new course, Site Reliability Engineering: Measuring and Managing Reliability, distills years of collective Google SRE experience with designing and managing complex systems that meet their reliability targets.

DevOps Official Blog SRE March 18, 2019

Make your voice heard! Take the 2019 Accelerate State of DevOps survey - By contributing to the survey, you will help shape the narrative of the rapidly growing DevOps industry. Your insights will help drive conversations on how as an industry we can develop software faster with less risk.

Istio Kubernetes Official Blog SRE March 11, 2019

The service mesh era: Using Istio and Stackdriver to build an SRE service - Demonstrating how to use Istio to level up SRE practices for workloads running in Kubernetes.

Official Blog SRE Feb. 4, 2019

Tune up your SLI metrics: CRE life lessons - How you can tune your existing SLIs to be a better representation of what your customers are experiencing.

Official Blog SRE Jan. 28, 2019

Do you have an SRE team yet? How to start and assess your journey - The Site Reliability Workbook is available in HTML now!

DevOps Official Blog SRE Jan. 21, 2019

Canary analysis: Lessons learned and best practices from Google and Waze - How Waze is using Spinnaker (continuous delivery system) to do canary deployments.

Official Blog Security SRE Sept. 17, 2018

Trust through transparency: incident response in Google Cloud - White paper which explains how Google Cloud manages incidents.

Official Blog SRE Aug. 6, 2018

Repairing network hardware at scale with SRE principles - Google’s SRE principles to guide developers and operations teams toward better systems reliability.

Official Blog SRE July 23, 2018

SRE fundamentals: SLIs, SLAs and SLOs - Learn about SRE fundamentals: SLIs, SLAs and SLOs.

SRE July 2, 2018

Understanding error budget overspend - part one - CRE life lessons - Questions to consider to see if you need to recalibrate your error budget - when dowtime of your applications is more than your service level objectives.

SRE July 2, 2018

Good housekeeping for error budgets - part two - CRE life lessons - Fixing the root that causes overspending error budget.

SRE July 2, 2018

Kubernetes podcast - #9 SRE, with Tina Zhang and Fred van den Driessche.

Official Blog SRE June 4, 2018

Troubleshooting tips: Help your cloud provider help you - Tips for communicating with cloud provider support team.

Official Blog SRE June 4, 2018

Troubleshooting tips: How to talk so your cloud provider will listen (and understand) - Practical tips on communicating with cloud providers since cloud presents a new way of working for IT teams shifting away from legacy systems.

Official Blog SRE May 14, 2018

Defining SLOs for services with dependencies - CRE life lessons - How to define and manage SLOs for services with dependencies.

DevOps Official Blog SRE May 14, 2018

SRE vs. DevOps: competing standards or close friends? - What exactly is SRE and how does it relate to DevOps?

SRE March 19, 2018

Risk and Error Budgets - How the SRE discipline reduces tension over velocity/stability between product teams and system operators by quantifying risk and employing error budgets.

Official Blog SRE Feb. 12, 2018

Applying the Escalation Policy — CRE life lessons - CRE Life Lessons: Explore some scenarios to apply the Escalation Policy.

SRE Jan. 22, 2018

An example escalation policy — CRE life lessons - This post demonstrate lightly-edited SLO escalation policy and associated rationales from a Google SRE team to illustrate the trade-offs that particular teams make to maintain a high development velocity.

SRE Jan. 8, 2018

Consequences of SLO violations — CRE life lessons - Article explains importance of creating a policy to handle Service Level Objective (SLO) violations, role of Site Reliability Engineers (SREs) and Devs in responding to SLO violations and structure of policy.

SRE Dec. 11, 2017

Getting the most out of shared postmortems — CRE life lessons - In this post, it's considered how to review a postmortem with your affected customer(s) for better actionable data and also to help customers improve their systems and practices.

SRE Oct. 30, 2017

Building good SLOs - CRE life lessons - Practicle tips how to formulate Service Level Objectives for Service Level Indicators.

SRE Aug. 14, 2017

CRE life lessons: The practicalities of dark launching - How to deal with some circumstances that can some up with dark launching.

SRE Aug. 7, 2017

CRE life lessons: What is a dark launch, and what does it do for me? - Dark launch sends a copy of real user-generated traffic to your new service, and discards the result from the new service before it's returned to the user.

SRE July 10, 2017

Making the most of an SRE service takeover - CRE life lessons - In Part 2 of this blog post we explained what an SRE team would want to learn about a service angling for SRE support, and what kind of improvements they want to see in the service before considering it for take-over. And in Part 1, we looked at why an SRE team would or wouldn’t choose to onboard a new application. Now, let’s look at what happens once the SREs agree to take on the pager.

SRE June 26, 2017

Why should your app get SRE support? - CRE life lessons - Practical tips how to organize Site Reliability Engineering team.

SRE May 29, 2017

Know thy enemy: how to prioritize and communicate risks - CRE life lessons - This time how to identify and mitigate risks in your system.

SRE April 3, 2017

How release canaries can save your bacon - CRE life lessons - Description of release process using canary (gradual) release from Site Reliability Engineering team.

SRE March 27, 2017

Reliable releases and rollbacks - CRE life lessons - Life lessons from SRE (Site Reliability Engineer) when new release is deployed but something goes wrong.

SRE March 6, 2017

Incident management at Google — adventures in SRE-land - How engineers in Google handle incidents in their data centres.

Useful Links

Contact

Zdenko Hrček
Třebanická 183
Prague, Czech Republic
Phone: +420 777 283 075
Email: [email protected]

Tag: SRE

Latest Issues

#513 Issue

#512 Issue

#511 Issue

Useful Links

Contact