Tag: SRE

DevOps Official Blog SRE Sept. 28, 2020

SRE Classroom: exercises for non-abstract large systems design - Learn how to apply SRE principles in this series of workshops on non-abstract large systems design (NALSD) with Google engineers.

DevOps Official Blog SRE Sept. 28, 2020

Are you an Elite DevOps performer? Find out with the Four Keys Project - Learn how the Four Keys open source project lets you gauge your DevOps performance according to DORA metrics.

Cloud Monitoring DevOps SRE Terraform Sept. 7, 2020

Creating SLOs with Terraform - Example of creating SLO for Cloud Monitoring using Terraform.

GCP Experience Official Blog SRE Aug. 10, 2020

Three months, 30x demand: How we scaled Google Meet during COVID-19 - Learn how Google's SRE team ramped up to handle high demand for Google Meet in response to COVID-19.

Monitoring Official Blog SRE July 13, 2020

Setting SLOs: observability using custom metrics - See how you can set service-level objectives (SLOs) for complex services for better cloud monitoring. Part of SRE tips series.

Cloud Monitoring Official Blog SRE July 13, 2020

Setting SLOs: a step-by-step guide - See how to use SRE principles to keep customers happy with your service, using the right service-level objectives (SLOs).

Official Blog SRE June 29, 2020

How maintenance windows affect your error budget — SRE tips - See how maintenance windows can impact your error budget when using SRE practices, and get tips on how and when to use them.

Official Blog SRE June 15, 2020

Building resilient systems to weather the unexpected - See how SRE teams at Google apply principles in practice to built resilient systems and prepare for any type of business continuity needs.

DevOps Official Blog SRE June 1, 2020

Meeting reliability challenges with SRE principles - Following SRE principles can help you build reliable production systems. When getting started, you may encounter three common challenges. Here’s how to solve them.

Official Blog SRE May 4, 2020

Designing distributed systems using NALSD flashcards - Get to know the SRE-inspired principles and numbers, plus handy flashcards, to help you design non-abstract large scale design (NALSD) distributed systems.

DevOps Official Blog SRE April 13, 2020

Learn to build secure and reliable systems with a new book from Google - Engineers across Google's security and SRE organizations share best practices to help you design scalable and reliable systems that are fundamentally secure.

Official Blog SRE March 16, 2020

Finding a problem at the bottom of the Google stack - See a real-world example of how Google’s SRE practices can identify and help fix issues, even at the bottom of the hardware stack.

Monitoring Official Blog SRE March 16, 2020

Use SRE principles to monitor pipelines with Cloud Monitoring dashboards - Try SRE principles and the four golden signals as the metrics to build a monitoring dashboard for your data pipelines.

AWS DevOps GCP Experience SRE March 9, 2020

Our migration journey from AWS to Google Cloud — Part 1 - Description of infrastructure migration from AWS to GCP, part 1.

AWS DevOps GCP Experience SRE March 9, 2020

Our migration journey from AWS to Google Cloud — Part 2 - Description of infrastructure migration from AWS to GCP, part 2.

Google Kubernetes Engine Official Blog SRE Jan. 20, 2020

Using deemed SLIs to measure customer reliability - Following SRE principles involves reliability metrics like SLOs and SLIs. See how CRE teams and customers at Google use deemed SLIs

Cloud Storage SRE Stackdriver Storage Dec. 23, 2019

Monitoring bytes sent from Google Cloud Storage buckets - The article describes how to set up monitoring and creating alerts based on data transferred from Cloud Storage.

SRE Dec. 23, 2019

Warm Disaster recovery for applications in Google Cloud - The article explains how to set up a Warm Disaster Recovery pattern for application.

Official Blog SRE Dec. 16, 2019

Learning—and teaching—the art of service-level objectives -- CRE Life Lessons - Host your own Art of SLOs workshop with Google SRE materials, now available to anyone.

DevOps Official Blog SRE Dec. 9, 2019

Shrinking the time to mitigate production incidents - CRE life lessons - See how you can use SRE and CRE principles and tests from Google, including Wheel of Misfortune and DiRT, to reduce the time needed to mitigate production incidents.

SRE Nov. 18, 2019

SRE Best Practices, For People in a Hurry - 20 simple rules for building a Google-Grade Site Reliability Engineering (SRE) practice.

SRE Nov. 18, 2019

Hot Disaster recovery on Google Cloud for applications running on-premises - The article goes through process of creating a Hot Disaster recovery on GCP for on-premise applications.

SRE Nov. 11, 2019

Warm Disaster recovery on Google Cloud for applications running on-premises - The article explains Warm Disaster Recovery pattern.

DevOps Official Blog SRE Nov. 4, 2019

How to integrate Policy Intelligence recommendations into an IaC pipeline - Learn how to incorporate recommendations from Policy Intelligence into an infrastructure as code pipeline

Official Blog SRE Oct. 6, 2019

Transitioning a typical engineering ops team into an SRE powerhouse - Moving a network operations team to an SRE-driven model took some time, but was well worth the effort, as teams can focus on reliability rather than hardware.

DevOps Official Blog SRE Sept. 16, 2019

Shrinking the impact of production incidents using SRE principles—CRE Life Lessons - SRE principles can help you shrink the impact of production incidents through use of SLOs, writing postmortems, and promoting a blameless culture.

DevOps Official Blog SRE Terraform July 1, 2019

GCP DevOps tricks: Create a custom Cloud Shell image that includes Terraform and Helm - Learn how to add DevOps tools like Helm and Terraform to Cloud Shell, GCP’s browser-based management tool

DevOps Official Blog SRE July 1, 2019

How SRE teams are organized, and how to get started - Getting started with SRE often starts with understanding SRE principles and how teams are organized. Find tips here on which SRE team implementation to use.

DevOps Infrastructure Official Blog SRE April 8, 2019

Want repeatable scale? Adopt infrastructure as code on GCP - The article describes concepts and motivation for Infrastructure as a Code approach.

DevOps Official Blog SRE March 25, 2019

Introducing a new Coursera course on Site Reliability Engineering - The new course, Site Reliability Engineering: Measuring and Managing Reliability, distills years of collective Google SRE experience with designing and managing complex systems that meet their reliability targets.

DevOps Official Blog SRE March 18, 2019

Make your voice heard! Take the 2019 Accelerate State of DevOps survey - By contributing to the survey, you will help shape the narrative of the rapidly growing DevOps industry. Your insights will help drive conversations on how as an industry we can develop software faster with less risk.

Istio Kubernetes Official Blog SRE March 11, 2019

The service mesh era: Using Istio and Stackdriver to build an SRE service - Demonstrating how to use Istio to level up SRE practices for workloads running in Kubernetes.

Official Blog SRE Feb. 4, 2019

Tune up your SLI metrics: CRE life lessons - How you can tune your existing SLIs to be a better representation of what your customers are experiencing.

Official Blog SRE Jan. 28, 2019

Do you have an SRE team yet? How to start and assess your journey - The Site Reliability Workbook is available in HTML now!

DevOps Official Blog SRE Jan. 21, 2019

Canary analysis: Lessons learned and best practices from Google and Waze - How Waze is using Spinnaker (continuous delivery system) to do canary deployments.

Official Blog Security SRE Sept. 17, 2018

Trust through transparency: incident response in Google Cloud - White paper which explains how Google Cloud manages incidents.

Official Blog SRE Aug. 6, 2018

Repairing network hardware at scale with SRE principles - Google’s SRE principles to guide developers and operations teams toward better systems reliability.

Official Blog SRE July 23, 2018

SRE fundamentals: SLIs, SLAs and SLOs - Learn about SRE fundamentals: SLIs, SLAs and SLOs.

SRE July 2, 2018

Understanding error budget overspend - part one - CRE life lessons - Questions to consider to see if you need to recalibrate your error budget - when dowtime of your applications is more than your service level objectives.

SRE July 2, 2018

Good housekeeping for error budgets - part two - CRE life lessons - Fixing the root that causes overspending error budget.

SRE July 2, 2018

Kubernetes podcast - #9 SRE, with Tina Zhang and Fred van den Driessche.

Official Blog SRE June 4, 2018

Troubleshooting tips: Help your cloud provider help you - Tips for communicating with cloud provider support team.

Official Blog SRE June 4, 2018

Troubleshooting tips: How to talk so your cloud provider will listen (and understand) - Practical tips on communicating with cloud providers since cloud presents a new way of working for IT teams shifting away from legacy systems.

Official Blog SRE May 14, 2018

Defining SLOs for services with dependencies - CRE life lessons - How to define and manage SLOs for services with dependencies.

DevOps Official Blog SRE May 14, 2018

SRE vs. DevOps: competing standards or close friends? - What exactly is SRE and how does it relate to DevOps?

SRE March 19, 2018

Risk and Error Budgets - How the SRE discipline reduces tension over velocity/stability between product teams and system operators by quantifying risk and employing error budgets.

Official Blog SRE Feb. 12, 2018

Applying the Escalation Policy — CRE life lessons - CRE Life Lessons: Explore some scenarios to apply the Escalation Policy

SRE Jan. 22, 2018

An example escalation policy — CRE life lessons - This post demonstrate lightly-edited SLO escalation policy and associated rationales from a Google SRE team to illustrate the trade-offs that particular teams make to maintain a high development velocity.

SRE Jan. 8, 2018

Consequences of SLO violations — CRE life lessons - Article explains importance of creating a policy to handle Service Level Objective (SLO) violations, role of Site Reliability Engineers (SREs) and Devs in responding to SLO violations and structure of policy.

SRE Dec. 11, 2017

Getting the most out of shared postmortems — CRE life lessons - In this post, it's considered how to review a postmortem with your affected customer(s) for better actionable data and also to help customers improve their systems and practices.

SRE Oct. 30, 2017

Building good SLOs - CRE life lessons - Practicle tips how to formulate Service Level Objectives for Service Level Indicators

SRE Aug. 14, 2017

CRE life lessons: The practicalities of dark launching - How to deal with some circumstances that can some up with dark launching.

SRE Aug. 7, 2017

CRE life lessons: What is a dark launch, and what does it do for me? - Dark launch sends a copy of real user-generated traffic to your new service, and discards the result from the new service before it's returned to the user.

SRE July 10, 2017

Making the most of an SRE service takeover - CRE life lessons - In Part 2 of this blog post we explained what an SRE team would want to learn about a service angling for SRE support, and what kind of improvements they want to see in the service before considering it for take-over. And in Part 1, we looked at why an SRE team would or wouldn’t choose to onboard a new application. Now, let’s look at what happens once the SREs agree to take on the pager.

SRE June 26, 2017

Why should your app get SRE support? - CRE life lessons - Practical tips how to organize Site Reliability Engineering team

SRE May 29, 2017

Know thy enemy: how to prioritize and communicate risks - CRE life lessons - This time how to identify and mitigate risks in your system

SRE April 3, 2017

How release canaries can save your bacon - CRE life lessons - Description of release process using canary (gradual) release from Site Reliability Engineering team

SRE March 27, 2017

Reliable releases and rollbacks - CRE life lessons - Life lessons from SRE (Site Reliability Engineer) when new release is deployed but something goes wrong

SRE March 6, 2017

Incident management at Google — adventures in SRE-land - How engineers in Google handle incidents in their data centres


Latest Issues


Zdenko Hrček
Třebanická 183
Prague, Czech Republic
Phone: +420 777 283 075
Email: zdenko@gcpweekly.com