Tag: SRE

DevOps Kubernetes SRE Oct. 25, 2021

Google Cloud DevOps Series - Google Cloud Compute options for Kubernetes.

Official Blog SRE Sept. 27, 2021

What’s your org’s reliability mindset? Insights from Google SREs - An organization’s approach to product reliability is a function of its mindset.

Cloud Operations SRE Tutorial Sept. 13, 2021

Measuring Reliability in GCP: Step By Step SLO creation guide using Cloud Operation Sandbox - In this step-by-step guide, I will demonstrate how to configure SLOs in Cloud Operations using our learning environment, Cloud Operation Sandbox.

Anthos Official Blog SRE Terraform Aug. 23, 2021

Deploy Anthos on GKE with Terraform part 1: GitOps with Config Sync - It is now simple to use Terraform to configure Anthos features on your GKE clusters. This is the first part of the 3 part series that describes using Terraform to enable Config Sync.

Anthos DevOps Official Blog SRE Aug. 9, 2021

Get in sync: Consistent Kubernetes with new Anthos Config Management features - Anthos Config Management and Config Controller bring Kubernetes-style declarative policy and config management to GKE environments.

CI Cloud Build Official Blog SRE Aug. 2, 2021

Introducing Cloud Build private pools: Secure CI/CD for private networks - With new private pools, you can use Google Cloud’s hosted Cloud Build CI/CD service on resources in your private network or in other clouds.

DevOps Official Blog SRE Aug. 2, 2021

Securing the software development lifecycle with Cloud Build and SLSA - Google’s proposed SLSA framework provides guidance on how to build a more secure software supply chain.

DevOps Official Blog SRE Aug. 2, 2021

Let's migrate: why lifting and shifting is simply too easy to ignore - Maximise the velocity and success of your cloud migration by starting with lift and shift.

DevOps Official Blog SRE July 5, 2021

Announcing the 2021 State of DevOps Report Sponsors

Cloud CDN DevOps SRE July 5, 2021

Google Cloud CDN Custom Dashboard - An example of a custom Dashboard in Cloud Monitoring for Cloud CDN.

DevOps Official Blog SRE June 22, 2021

Are we there yet? Thoughts on assessing an SRE team’s maturity - Examining the key indicators that signal a mature SRE team.

Cloud Operations GCP Experience Official Blog SRE June 14, 2021

How Lowe’s meets customer demand with Google SRE practices - Lowe’s has adopted Google SRE practices to help developer and operations teams keep up with ecommerce demand.

DevOps Official Blog SRE June 7, 2021

DevOps on Google Cloud: tools to speed up software development velocity - Google Cloud’s application development and continuous integration/continuous delivery (CI/CD) tools help ForgeRock developers stay productive.

Official Blog SRE May 31, 2021

Four steps to jumpstarting your SRE practice - Once you have leadership buy-in, there are some things you can do to get the SRE ball rolling, fast.

DevOps SRE May 24, 2021

Book - Implementing DevOps on Google Cloud - Achieving Google’s Professional Cloud DevOps Engineer Certification.

Cloud Operations DevOps Official Blog SRE May 10, 2021

SRE fundamentals 2021: SLIs vs SLAs vs SLOs - What’s the difference between an SLI, an SLO and an SLA? Google Site Reliability Engineers (SRE) explain.

DevOps Official Blog SRE May 3, 2021

SRE at Google: Our complete list of CRE life lessons - Find links to blog posts that share Google’s SRE best practices in one handy location.

DevOps Official Blog SRE April 25, 2021

5 resources to help you get started with SRE - Here are top five Google Cloud resources for getting started on your SRE journey.

Cloud Operations SRE Stackdriver April 5, 2021

SRE Public Resources for GCP Customers - A list of articles, videos and courses related to SRE.

Official Blog SRE March 15, 2021

How do you eat an elephant? Google SREs talk digital transformation - It’s not just about technology. Google Cloud SREs touch on the human and organizational side of a cloud migration.

Cloud Operations Official Blog SRE March 1, 2021

With SRE, failing to plan is planning to fail - The process of becoming a successful Site Reliability Engineering shop starts well before you take your first class or read your first manual.

Cloud Operations DevOps Official Blog SRE Jan. 25, 2021

Take the first step toward SRE with Cloud Operations Sandbox - Spin up the Cloud Operations Sandbox to see how Google’s logging, monitoring, tracing, profiling and debugging can kickstart your SRE practice.

Cloud Operations Monitoring SRE Stackdriver Jan. 25, 2021

Operation Suite GCP - Monitoring Logging and Error Reporting - An overview of Operation Suite in GCP: Monitoring , Logging, Error Reporting.

Cloud Build DevOps SRE Oct. 12, 2020

Gitflow with Github and Cloud Build - Implementing Gitflow using Github and Cloud Build.

DevOps Monitoring SRE Oct. 5, 2020

How to alert on SLOs - How to use SLO error budget alerts in Monitoring.

DevOps Official Blog SRE Sept. 28, 2020

SRE Classroom: exercises for non-abstract large systems design - Learn how to apply SRE principles in this series of workshops on non-abstract large systems design (NALSD) with Google engineers.

DevOps Official Blog SRE Sept. 28, 2020

Are you an Elite DevOps performer? Find out with the Four Keys Project - Learn how the Four Keys open source project lets you gauge your DevOps performance according to DORA metrics.

Cloud Monitoring DevOps SRE Terraform Sept. 7, 2020

Creating SLOs with Terraform - Example of creating SLO for Cloud Monitoring using Terraform.

GCP Experience Official Blog SRE Aug. 10, 2020

Three months, 30x demand: How we scaled Google Meet during COVID-19 - Learn how Google's SRE team ramped up to handle high demand for Google Meet in response to COVID-19.

Monitoring Official Blog SRE July 13, 2020

Setting SLOs: observability using custom metrics - See how you can set service-level objectives (SLOs) for complex services for better cloud monitoring. Part of SRE tips series.

Cloud Monitoring Official Blog SRE July 13, 2020

Setting SLOs: a step-by-step guide - See how to use SRE principles to keep customers happy with your service, using the right service-level objectives (SLOs).

Official Blog SRE June 29, 2020

How maintenance windows affect your error budget — SRE tips - See how maintenance windows can impact your error budget when using SRE practices, and get tips on how and when to use them.

Official Blog SRE June 15, 2020

Building resilient systems to weather the unexpected - See how SRE teams at Google apply principles in practice to built resilient systems and prepare for any type of business continuity needs.

DevOps Official Blog SRE June 1, 2020

Meeting reliability challenges with SRE principles - Following SRE principles can help you build reliable production systems. When getting started, you may encounter three common challenges. Here’s how to solve them.

Official Blog SRE May 4, 2020

Designing distributed systems using NALSD flashcards - Get to know the SRE-inspired principles and numbers, plus handy flashcards, to help you design non-abstract large scale design (NALSD) distributed systems.

DevOps Official Blog SRE April 13, 2020

Learn to build secure and reliable systems with a new book from Google - Engineers across Google's security and SRE organizations share best practices to help you design scalable and reliable systems that are fundamentally secure.

Official Blog SRE March 16, 2020

Finding a problem at the bottom of the Google stack - See a real-world example of how Google’s SRE practices can identify and help fix issues, even at the bottom of the hardware stack.

Monitoring Official Blog SRE March 16, 2020

Use SRE principles to monitor pipelines with Cloud Monitoring dashboards - Try SRE principles and the four golden signals as the metrics to build a monitoring dashboard for your data pipelines.

AWS DevOps GCP Experience SRE March 9, 2020

Our migration journey from AWS to Google Cloud — Part 1 - Description of infrastructure migration from AWS to GCP, part 1.

AWS DevOps GCP Experience SRE March 9, 2020

Our migration journey from AWS to Google Cloud — Part 2 - Description of infrastructure migration from AWS to GCP, part 2.

Google Kubernetes Engine Official Blog SRE Jan. 20, 2020

Using deemed SLIs to measure customer reliability - Following SRE principles involves reliability metrics like SLOs and SLIs. See how CRE teams and customers at Google use deemed SLIs

Cloud Storage SRE Stackdriver Storage Dec. 23, 2019

Monitoring bytes sent from Google Cloud Storage buckets - The article describes how to set up monitoring and creating alerts based on data transferred from Cloud Storage.

SRE Dec. 23, 2019

Warm Disaster recovery for applications in Google Cloud - The article explains how to set up a Warm Disaster Recovery pattern for application.

Official Blog SRE Dec. 16, 2019

Learning—and teaching—the art of service-level objectives -- CRE Life Lessons - Host your own Art of SLOs workshop with Google SRE materials, now available to anyone.

DevOps Official Blog SRE Dec. 9, 2019

Shrinking the time to mitigate production incidents - CRE life lessons - See how you can use SRE and CRE principles and tests from Google, including Wheel of Misfortune and DiRT, to reduce the time needed to mitigate production incidents.

SRE Nov. 18, 2019

SRE Best Practices, For People in a Hurry - 20 simple rules for building a Google-Grade Site Reliability Engineering (SRE) practice.

SRE Nov. 18, 2019

Hot Disaster recovery on Google Cloud for applications running on-premises - The article goes through process of creating a Hot Disaster recovery on GCP for on-premise applications.

SRE Nov. 11, 2019

Warm Disaster recovery on Google Cloud for applications running on-premises - The article explains Warm Disaster Recovery pattern.

DevOps Official Blog SRE Nov. 4, 2019

How to integrate Policy Intelligence recommendations into an IaC pipeline - Learn how to incorporate recommendations from Policy Intelligence into an infrastructure as code pipeline

Official Blog SRE Oct. 6, 2019

Transitioning a typical engineering ops team into an SRE powerhouse - Moving a network operations team to an SRE-driven model took some time, but was well worth the effort, as teams can focus on reliability rather than hardware.

DevOps Official Blog SRE Sept. 16, 2019

Shrinking the impact of production incidents using SRE principles—CRE Life Lessons - SRE principles can help you shrink the impact of production incidents through use of SLOs, writing postmortems, and promoting a blameless culture.

DevOps Official Blog SRE Terraform July 1, 2019

GCP DevOps tricks: Create a custom Cloud Shell image that includes Terraform and Helm - Learn how to add DevOps tools like Helm and Terraform to Cloud Shell, GCP’s browser-based management tool

DevOps Official Blog SRE July 1, 2019

How SRE teams are organized, and how to get started - Getting started with SRE often starts with understanding SRE principles and how teams are organized. Find tips here on which SRE team implementation to use.

DevOps Infrastructure Official Blog SRE April 8, 2019

Want repeatable scale? Adopt infrastructure as code on GCP - The article describes concepts and motivation for Infrastructure as a Code approach.

DevOps Official Blog SRE March 25, 2019

Introducing a new Coursera course on Site Reliability Engineering - The new course, Site Reliability Engineering: Measuring and Managing Reliability, distills years of collective Google SRE experience with designing and managing complex systems that meet their reliability targets.

DevOps Official Blog SRE March 18, 2019

Make your voice heard! Take the 2019 Accelerate State of DevOps survey - By contributing to the survey, you will help shape the narrative of the rapidly growing DevOps industry. Your insights will help drive conversations on how as an industry we can develop software faster with less risk.

Istio Kubernetes Official Blog SRE March 11, 2019

The service mesh era: Using Istio and Stackdriver to build an SRE service - Demonstrating how to use Istio to level up SRE practices for workloads running in Kubernetes.

Official Blog SRE Feb. 4, 2019

Tune up your SLI metrics: CRE life lessons - How you can tune your existing SLIs to be a better representation of what your customers are experiencing.

Official Blog SRE Jan. 28, 2019

Do you have an SRE team yet? How to start and assess your journey - The Site Reliability Workbook is available in HTML now!

DevOps Official Blog SRE Jan. 21, 2019

Canary analysis: Lessons learned and best practices from Google and Waze - How Waze is using Spinnaker (continuous delivery system) to do canary deployments.

Official Blog Security SRE Sept. 17, 2018

Trust through transparency: incident response in Google Cloud - White paper which explains how Google Cloud manages incidents.

Official Blog SRE Aug. 6, 2018

Repairing network hardware at scale with SRE principles - Google’s SRE principles to guide developers and operations teams toward better systems reliability.

Official Blog SRE July 23, 2018

SRE fundamentals: SLIs, SLAs and SLOs - Learn about SRE fundamentals: SLIs, SLAs and SLOs.

SRE July 2, 2018

Understanding error budget overspend - part one - CRE life lessons - Questions to consider to see if you need to recalibrate your error budget - when dowtime of your applications is more than your service level objectives.

SRE July 2, 2018

Good housekeeping for error budgets - part two - CRE life lessons - Fixing the root that causes overspending error budget.

SRE July 2, 2018

Kubernetes podcast - #9 SRE, with Tina Zhang and Fred van den Driessche.

Official Blog SRE June 4, 2018

Troubleshooting tips: Help your cloud provider help you - Tips for communicating with cloud provider support team.

Official Blog SRE June 4, 2018

Troubleshooting tips: How to talk so your cloud provider will listen (and understand) - Practical tips on communicating with cloud providers since cloud presents a new way of working for IT teams shifting away from legacy systems.

Official Blog SRE May 14, 2018

Defining SLOs for services with dependencies - CRE life lessons - How to define and manage SLOs for services with dependencies.

DevOps Official Blog SRE May 14, 2018

SRE vs. DevOps: competing standards or close friends? - What exactly is SRE and how does it relate to DevOps?

SRE March 19, 2018

Risk and Error Budgets - How the SRE discipline reduces tension over velocity/stability between product teams and system operators by quantifying risk and employing error budgets.

Official Blog SRE Feb. 12, 2018

Applying the Escalation Policy — CRE life lessons - CRE Life Lessons: Explore some scenarios to apply the Escalation Policy

SRE Jan. 22, 2018

An example escalation policy — CRE life lessons - This post demonstrate lightly-edited SLO escalation policy and associated rationales from a Google SRE team to illustrate the trade-offs that particular teams make to maintain a high development velocity.

SRE Jan. 8, 2018

Consequences of SLO violations — CRE life lessons - Article explains importance of creating a policy to handle Service Level Objective (SLO) violations, role of Site Reliability Engineers (SREs) and Devs in responding to SLO violations and structure of policy.

SRE Dec. 11, 2017

Getting the most out of shared postmortems — CRE life lessons - In this post, it's considered how to review a postmortem with your affected customer(s) for better actionable data and also to help customers improve their systems and practices.

SRE Oct. 30, 2017

Building good SLOs - CRE life lessons - Practicle tips how to formulate Service Level Objectives for Service Level Indicators

SRE Aug. 14, 2017

CRE life lessons: The practicalities of dark launching - How to deal with some circumstances that can some up with dark launching.

SRE Aug. 7, 2017

CRE life lessons: What is a dark launch, and what does it do for me? - Dark launch sends a copy of real user-generated traffic to your new service, and discards the result from the new service before it's returned to the user.

SRE July 10, 2017

Making the most of an SRE service takeover - CRE life lessons - In Part 2 of this blog post we explained what an SRE team would want to learn about a service angling for SRE support, and what kind of improvements they want to see in the service before considering it for take-over. And in Part 1, we looked at why an SRE team would or wouldn’t choose to onboard a new application. Now, let’s look at what happens once the SREs agree to take on the pager.

SRE June 26, 2017

Why should your app get SRE support? - CRE life lessons - Practical tips how to organize Site Reliability Engineering team

SRE May 29, 2017

Know thy enemy: how to prioritize and communicate risks - CRE life lessons - This time how to identify and mitigate risks in your system

SRE April 3, 2017

How release canaries can save your bacon - CRE life lessons - Description of release process using canary (gradual) release from Site Reliability Engineering team

SRE March 27, 2017

Reliable releases and rollbacks - CRE life lessons - Life lessons from SRE (Site Reliability Engineer) when new release is deployed but something goes wrong

SRE March 6, 2017

Incident management at Google — adventures in SRE-land - How engineers in Google handle incidents in their data centres


Latest Issues


Zdenko Hrček
Třebanická 183
Prague, Czech Republic
Phone: +420 777 283 075
Email: [email protected]