Tag: SRE
DevOps Official Blog SRE Terraform Workforce Identity Federation Sept. 25, 2023Manage infrastructure with Workload Identity Federation and Terraform Cloud - Terraform Cloud workspaces integrate with Workload Identity Federation to authenticate and then impersonate Google Cloud service accounts.
Cloud SQL DevOps SRE Terraform Sept. 25, 2023How to connect to GCP Private Cloud SQL instance in your local machine using a Bastion and Terraform. - A Terraform snippet to create a bastion VM to access Cloud SQL instance that has a private IP.
Kubernetes SRE Sept. 18, 2023Google Kubernetes Engine Troubleshooting Made Simple with Interactive Playbooks - Using GKE interactive playbooks for troubleshooting guidance for common issues.
DevOps Official Blog SRE Aug. 28, 2023Calling all DevOps, IT Ops, Platform Engineers and SREs: 5 can’t-miss breakout sessions at Next ‘23 - There’s no lack of breakout sessions for DevOps, IT Ops, Platform Engineers, and Site Reliability Engineers (SREs) at Google Cloud Next 2023.
Cloud Run Monitoring SRE Aug. 21, 2023How to create a SLO for Cloud Run programatically
Google Kubernetes Engine Official Blog SRE Aug. 21, 2023How to set up observability for a multi-tenant GKE solution - Learn how to set up a GKE multi-tenant solution for observability using Log Router, and setting up a sink to route a tenant’s logs to their dedicated GCP project.
Cloud Bigtable Official Blog SRE Aug. 7, 2023What's new in Bigtable observability - Learn about new tools and metrics for Cloud Bigtable including query stats, high-granularity metrics, and table stats.
CI GCP Experience Official Blog SRE July 24, 2023Vodafone: A DevOps approach to AI/ML through cloud-native CI/CD pipelines - How Vodafone improved the performance of its ML pipelines by using DevOps principles of automation, code mirroring and CI/CD.
DevOps Official Blog SRE July 17, 2023DevOps Awards winner Sabre on nurturing team culture - Sabre worked closely with Google Cloud to transform its system and company culture to make better use of the cloud.
DevOps Official Blog SRE June 19, 20232022 State of DevOps Report data deep dive: Documentation is like sunshine - The State of DevOps Report finds a clear link between documentation quality and an organization’s ability to meet its performance goals.
Cloud Monitoring Official Blog SRE June 19, 2023New in Cloud Monitoring: Better tools for analysis, uptime checks, and alerts - We recently launched several new Cloud Monitoring features to improve your visualization and troubleshooting experience.
DevOps Official Blog SRE June 12, 2023The Modernization Imperative: Shifting left is for suckers. Shift down instead - Instead of developers “shifting left,” they need to “shift down” and push more workloads down onto the platforms they’re already using.
DevOps Monitoring Official Blog SRE May 15, 2023Uptime checks for availability - Monitor the availability of public and private resources, and to alert you when there are problems.
AI DevOps Official Blog SRE May 15, 2023Introducing Duet AI for Google Cloud – an AI-powered collaborator
DevOps Official Blog SRE Terraform April 24, 2023Running Infrastructure-as-Code with the least privilege possible - Google service account impersonation lets you run your terraform code and manage resources without overly broad access.
Billing Cloud Monitoring DevOps Official Blog SRE April 17, 2023How to identify and reduce costs of your Google Cloud observability in Cloud Monitoring - A cost savings guide for Cloud Monitoring.
Compute Engine Official Blog SRE April 10, 2023Monitor the health of your VM fleets in the Compute Engine console - The new Observability tab in the Compute Engine console provides insights into CPU, memory, network, disk, live processes, and system events.
Cloud SQL Database Migration Service SRE March 27, 2023Upgrade Your MySQL Version with Minimal Downtime: Our Journey with Google Cloud’s Data Migration Service - Google Cloud provides a reliable and efficient tool for upgrading your MySQL instance using its Data Migration Service (DMS).
Monitoring Prometheus SRE March 27, 2023Scaling Observability Reliably and Frugally at Magicpin - A process of creating an observability platform on GCP.
Cloud Monitoring Official Blog SRE March 20, 2023Verify POST endpoint availability with Uptime Checks - Google Cloud Monitoring can now handle any kind of request bodies for POST requests, giving you better REST resource tracking.
Official Blog SRE March 13, 2023Adopting SRE: Standardizing your SLO design process - Designing SLOs is a key SRE competency which requires careful consideration and a consistent approach to implementation.
DevOps Google Kubernetes Engine Official Blog SRE Feb. 6, 2023Scalability testing on Google Kubernetes Engine: Know before you go - Getting ready to scale up a Kubernetes-based workload? Learn about the benefits, how to set goals and best practices of scalability testing on GKE.
DevOps Official Blog SRE Jan. 23, 2023Reliability and SRE in the 2022 State of DevOps Report - Learn more about the connection between SRE, DevOps and reliability.
DevOps SRE Dec. 26, 2022Disaster Recovery — locality-restricted workloads on GCP - This post discusses how you can use Google Cloud to architect for disaster recovery (DR) to meet location-specific requirements.
DevOps Official Blog SRE Dec. 19, 2022Why Focus on Symptoms, Not Causes? - Why aren’t we monitoring what users care about? How did we get here? What do users care about?
DevOps Official Blog SRE Nov. 21, 2022Composite availability: calculating the overall availability of cloud infrastructure - Understand how to calculate the composite reliability of your cloud infrastructure to help design Cloud architectures with an optimal SLA.
DevOps GCP Experience Networking Security SRE VPC Service Controls Nov. 7, 2022How we secured our data on the Cloud - Challenges and solutions while enforcing VPC Service Controls.
DevOps Official Blog Skaffold SRE Oct. 31, 2022Skaffold v2 GA: Further enhancing developer productivity - With Skaffold V2, you can now build and manage container images on Cloud Run and on ARM architectures.
Billing Cloud Logging Official Blog SRE Oct. 24, 2022Cloud Logging pricing for Cloud Admins: How to approach it & save cost - How, where and when pricing is incurred in Cloud Logging, Google’s observability solution to manage Logs. It also covers our recommendations to save and optimize cost.
CI DevOps Official Blog SRE Sept. 19, 2022Building a secure CI/CD pipeline using Google Cloud built-in services - Build a secure CI/CD pipeline using Google Cloud's built-in services using Cloud Build, Cloud Deploy, Artifact Registry, Binary Authorization and GKE.
Cloud Operations OpenTelemetry SRE Aug. 29, 2022Ultimate Google Cloud Operations configuration for external services - Monitoring Elasticsearch service deployed on Elastic Cloud with OpenTelemetry and Cloud Operations.
Security SRE Aug. 15, 2022Gremlin Chaos Engineering On Google Cloud - This Article is based on how to implement Chaos Engineering Experiments Using Gremlin on Google Cloud.
Cloud Monitoring DevOps Official Blog SRE Aug. 15, 2022Snooze your alert policies in Cloud Monitoring - Snooze alert policies to prevent the creation of alerts and notifications. This is useful during maintenance windows, non-business hours, and more.
Data Analytics DevOps Looker Official Blog SRE Aug. 1, 2022Managing the Looker ecosystem at scale with SRE and DevOps practices - Following DevOps and SRE best practices can help organizations bring order to distributed Looker environments.
DevOps Official Blog SRE July 4, 2022Incorporating quota regression detection into your release pipeline - Check quotas across cloud environments before promoting images to prevent outages due to inconsistent API quota limits.
DevOps Official Blog SRE May 30, 2022Enterprise DevOps Guidebook - Chapter 1 - Learn more about how to implement DORA best practices with our DevOps Enterprise Guidebook.
DevOps Official Blog SRE May 30, 2022Application Rationalization through Google Cloud’s CAMP Framework - Application Rationalization through CAST Highlight (automated source code scan with business context) and mFit (VM workload assessment & automated migration).
Cloud Monitoring Kubernetes Monitoring SRE May 16, 2022Metrics Management with Google Cloud Managed Service for Prometheus - Maisons du Monde is a furniture and home decor company that was founded in France over 25 years ago. We have 360 stores across France….
DevOps Official Blog SRE May 9, 2022Are your SLOs realistic? How to analyze your risks like an SRE - Before committing to an SLO, Site Reliability Engineering practices recommend that you evaluate the risks to a given service.
DevOps Infrastructure Official Blog SRE April 25, 2022The SRE book turns 6! - Site Reliability Engineering with Google’s SRE team! Since the publication of the SRE Book, we’ve learned and shared a lot —come explore SRE with us!
Official Blog SRE April 18, 2022Introducing the Google SRE Prodcast - Discover Prodcast, Google’s Site Reliability Engineering Podcast. This limited-edition series explores fundamental topics in reliability engineering from the perspective of experienced Google SREs.
SRE Terraform April 11, 2022GCP integration with PagerDuty using Terraform - This article will show you, how Storytel 2022 went from a basic setup with a single global on-call team to a Full Service Ownership setup.
DevOps Monitoring Official Blog SRE April 4, 2022Add severity levels to your alert policies in Cloud Monitoring - Add static and dynamic severity levels to your alert policies for easier triaging and include these in notifications when sent to 3rd party services.
Security SRE March 21, 2022Forensics - Ever wondered what you need to do to collect evidence when you have an incident?
Official Blog Security SRE Feb. 14, 2022Achieving Autonomic Security Operations: Automation as a Force Multiplier - Your Security Operations Center (SOC) can learn a lot from what IT operations learned during the SRE revolution. In this post of the series, we plan to extract the lessons for your SOC centered on another SRE principle - automation as a force multiplier.
DevOps Official Blog SRE Dec. 13, 2021Postmortems at Loon: a guiding force for rapid development - Discover how Loon Site Reliability Engineers used postmortems to iterate on their stratospheric software-defined network.
DevOps SRE Dec. 6, 2021Part-5: Google DevOps-Observability with SRE principles
DevOps GCP Experience Official Blog SRE Dec. 6, 2021Shopify engineers deliver on peak performance during Black Friday Cyber Monday 2021 - Shopify just experienced a record-breaking Black Friday Cyber Monday. Learn how Shopify works with Google Cloud to handle unprecedented peak moments with ease.
DevOps Official Blog SRE Dec. 6, 2021Want to supercharge your DevOps practice? Research says try SRE - The 2021 DORA State of DevOps Report found interesting trends in DevOps shops that use SRE best practices.
Cloud Operations GCP Experience Official Blog SRE Nov. 29, 2021How Sabre is using SRE to lead a successful digital transformation - Sabre Corporation joined forces with Google Cloud as their preferred cloud provider to accelerate their digital transformation following SRE principles.
DevOps NoSQL Official Blog SRE Nov. 29, 2021Empowering DevOps to foster customer loyalty in modern retail with MongoDB Atlas on Google Cloud - MongoDB Atlas on Google Cloud can enhance DevOps performance in today’s retail market.
DevOps Kubernetes SRE Oct. 25, 2021Google Cloud DevOps Series - Google Cloud Compute options for Kubernetes.
Official Blog SRE Sept. 27, 2021What’s your org’s reliability mindset? Insights from Google SREs - An organization’s approach to product reliability is a function of its mindset.
Cloud Operations SRE Tutorial Sept. 13, 2021Measuring Reliability in GCP: Step By Step SLO creation guide using Cloud Operation Sandbox - In this step-by-step guide, I will demonstrate how to configure SLOs in Cloud Operations using our learning environment, Cloud Operation Sandbox.
Anthos Official Blog SRE Terraform Aug. 23, 2021Deploy Anthos on GKE with Terraform part 1: GitOps with Config Sync - It is now simple to use Terraform to configure Anthos features on your GKE clusters. This is the first part of the 3 part series that describes using Terraform to enable Config Sync.
Anthos DevOps Official Blog SRE Aug. 9, 2021Get in sync: Consistent Kubernetes with new Anthos Config Management features - Anthos Config Management and Config Controller bring Kubernetes-style declarative policy and config management to GKE environments.
CI Cloud Build Official Blog SRE Aug. 2, 2021Introducing Cloud Build private pools: Secure CI/CD for private networks - With new private pools, you can use Google Cloud’s hosted Cloud Build CI/CD service on resources in your private network or in other clouds.
DevOps Official Blog SRE Aug. 2, 2021Securing the software development lifecycle with Cloud Build and SLSA - Google’s proposed SLSA framework provides guidance on how to build a more secure software supply chain.
DevOps Official Blog SRE Aug. 2, 2021Let's migrate: why lifting and shifting is simply too easy to ignore - Maximise the velocity and success of your cloud migration by starting with lift and shift.
DevOps Official Blog SRE July 5, 2021Announcing the 2021 State of DevOps Report Sponsors
Cloud CDN DevOps SRE July 5, 2021Google Cloud CDN Custom Dashboard - An example of a custom Dashboard in Cloud Monitoring for Cloud CDN.
DevOps Official Blog SRE June 22, 2021Are we there yet? Thoughts on assessing an SRE team’s maturity - Examining the key indicators that signal a mature SRE team.
Cloud Operations GCP Experience Official Blog SRE June 14, 2021How Lowe’s meets customer demand with Google SRE practices - Lowe’s has adopted Google SRE practices to help developer and operations teams keep up with ecommerce demand.
DevOps Official Blog SRE June 7, 2021DevOps on Google Cloud: tools to speed up software development velocity - Google Cloud’s application development and continuous integration/continuous delivery (CI/CD) tools help ForgeRock developers stay productive.
Official Blog SRE May 31, 2021Four steps to jumpstarting your SRE practice - Once you have leadership buy-in, there are some things you can do to get the SRE ball rolling, fast.
DevOps SRE May 24, 2021Book - Implementing DevOps on Google Cloud - Achieving Google’s Professional Cloud DevOps Engineer Certification.
Cloud Operations DevOps Official Blog SRE May 10, 2021SRE fundamentals 2021: SLIs vs SLAs vs SLOs - What’s the difference between an SLI, an SLO and an SLA? Google Site Reliability Engineers (SRE) explain.
DevOps Official Blog SRE May 3, 2021SRE at Google: Our complete list of CRE life lessons - Find links to blog posts that share Google’s SRE best practices in one handy location.
DevOps Official Blog SRE April 25, 20215 resources to help you get started with SRE - Here are top five Google Cloud resources for getting started on your SRE journey.
Cloud Operations SRE Stackdriver April 5, 2021SRE Public Resources for GCP Customers - A list of articles, videos and courses related to SRE.
Official Blog SRE March 15, 2021How do you eat an elephant? Google SREs talk digital transformation - It’s not just about technology. Google Cloud SREs touch on the human and organizational side of a cloud migration.
Cloud Operations Official Blog SRE March 1, 2021With SRE, failing to plan is planning to fail - The process of becoming a successful Site Reliability Engineering shop starts well before you take your first class or read your first manual.
Cloud Operations DevOps Official Blog SRE Jan. 25, 2021Take the first step toward SRE with Cloud Operations Sandbox - Spin up the Cloud Operations Sandbox to see how Google’s logging, monitoring, tracing, profiling and debugging can kickstart your SRE practice.
Cloud Operations Monitoring SRE Stackdriver Jan. 25, 2021Operation Suite GCP - Monitoring Logging and Error Reporting - An overview of Operation Suite in GCP: Monitoring , Logging, Error Reporting.
Cloud Build DevOps SRE Oct. 12, 2020Gitflow with Github and Cloud Build - Implementing Gitflow using Github and Cloud Build.
DevOps Monitoring SRE Oct. 5, 2020How to alert on SLOs - How to use SLO error budget alerts in Monitoring.
DevOps Official Blog SRE Sept. 28, 2020SRE Classroom: exercises for non-abstract large systems design - Learn how to apply SRE principles in this series of workshops on non-abstract large systems design (NALSD) with Google engineers.
DevOps Official Blog SRE Sept. 28, 2020Are you an Elite DevOps performer? Find out with the Four Keys Project - Learn how the Four Keys open source project lets you gauge your DevOps performance according to DORA metrics.
Cloud Monitoring DevOps SRE Terraform Sept. 7, 2020Creating SLOs with Terraform - Example of creating SLO for Cloud Monitoring using Terraform.
GCP Experience Official Blog SRE Aug. 10, 2020Three months, 30x demand: How we scaled Google Meet during COVID-19 - Learn how Google's SRE team ramped up to handle high demand for Google Meet in response to COVID-19.
Monitoring Official Blog SRE July 13, 2020Setting SLOs: observability using custom metrics - See how you can set service-level objectives (SLOs) for complex services for better cloud monitoring. Part of SRE tips series.
Cloud Monitoring Official Blog SRE July 13, 2020Setting SLOs: a step-by-step guide - See how to use SRE principles to keep customers happy with your service, using the right service-level objectives (SLOs).
Official Blog SRE June 29, 2020How maintenance windows affect your error budget — SRE tips - See how maintenance windows can impact your error budget when using SRE practices, and get tips on how and when to use them.
Official Blog SRE June 15, 2020Building resilient systems to weather the unexpected - See how SRE teams at Google apply principles in practice to built resilient systems and prepare for any type of business continuity needs.
DevOps Official Blog SRE June 1, 2020Meeting reliability challenges with SRE principles - Following SRE principles can help you build reliable production systems. When getting started, you may encounter three common challenges. Here’s how to solve them.
Official Blog SRE May 4, 2020Designing distributed systems using NALSD flashcards - Get to know the SRE-inspired principles and numbers, plus handy flashcards, to help you design non-abstract large scale design (NALSD) distributed systems.
DevOps Official Blog SRE April 13, 2020Learn to build secure and reliable systems with a new book from Google - Engineers across Google's security and SRE organizations share best practices to help you design scalable and reliable systems that are fundamentally secure.
Official Blog SRE March 16, 2020Finding a problem at the bottom of the Google stack - See a real-world example of how Google’s SRE practices can identify and help fix issues, even at the bottom of the hardware stack.
Monitoring Official Blog SRE March 16, 2020Use SRE principles to monitor pipelines with Cloud Monitoring dashboards - Try SRE principles and the four golden signals as the metrics to build a monitoring dashboard for your data pipelines.
AWS DevOps GCP Experience SRE March 9, 2020Our migration journey from AWS to Google Cloud — Part 1 - Description of infrastructure migration from AWS to GCP, part 1.
AWS DevOps GCP Experience SRE March 9, 2020Our migration journey from AWS to Google Cloud — Part 2 - Description of infrastructure migration from AWS to GCP, part 2.
Google Kubernetes Engine Official Blog SRE Jan. 20, 2020Using deemed SLIs to measure customer reliability - Following SRE principles involves reliability metrics like SLOs and SLIs. See how CRE teams and customers at Google use deemed SLIs
Cloud Storage SRE Stackdriver Storage Dec. 23, 2019Monitoring bytes sent from Google Cloud Storage buckets - The article describes how to set up monitoring and creating alerts based on data transferred from Cloud Storage.
SRE Dec. 23, 2019Warm Disaster recovery for applications in Google Cloud - The article explains how to set up a Warm Disaster Recovery pattern for application.
Official Blog SRE Dec. 16, 2019Learning—and teaching—the art of service-level objectives -- CRE Life Lessons - Host your own Art of SLOs workshop with Google SRE materials, now available to anyone.
DevOps Official Blog SRE Dec. 9, 2019Shrinking the time to mitigate production incidents - CRE life lessons - See how you can use SRE and CRE principles and tests from Google, including Wheel of Misfortune and DiRT, to reduce the time needed to mitigate production incidents.
SRE Nov. 18, 2019SRE Best Practices, For People in a Hurry - 20 simple rules for building a Google-Grade Site Reliability Engineering (SRE) practice.
SRE Nov. 18, 2019Hot Disaster recovery on Google Cloud for applications running on-premises - The article goes through process of creating a Hot Disaster recovery on GCP for on-premise applications.
SRE Nov. 11, 2019Warm Disaster recovery on Google Cloud for applications running on-premises - The article explains Warm Disaster Recovery pattern.
DevOps Official Blog SRE Nov. 4, 2019How to integrate Policy Intelligence recommendations into an IaC pipeline - Learn how to incorporate recommendations from Policy Intelligence into an infrastructure as code pipeline
Official Blog SRE Oct. 6, 2019Transitioning a typical engineering ops team into an SRE powerhouse - Moving a network operations team to an SRE-driven model took some time, but was well worth the effort, as teams can focus on reliability rather than hardware.
DevOps Official Blog SRE Sept. 16, 2019Shrinking the impact of production incidents using SRE principles—CRE Life Lessons - SRE principles can help you shrink the impact of production incidents through use of SLOs, writing postmortems, and promoting a blameless culture.
DevOps Official Blog SRE Terraform July 1, 2019GCP DevOps tricks: Create a custom Cloud Shell image that includes Terraform and Helm - Learn how to add DevOps tools like Helm and Terraform to Cloud Shell, GCP’s browser-based management tool
DevOps Official Blog SRE July 1, 2019How SRE teams are organized, and how to get started - Getting started with SRE often starts with understanding SRE principles and how teams are organized. Find tips here on which SRE team implementation to use.
DevOps Infrastructure Official Blog SRE April 8, 2019Want repeatable scale? Adopt infrastructure as code on GCP - The article describes concepts and motivation for Infrastructure as a Code approach.
DevOps Official Blog SRE March 25, 2019Introducing a new Coursera course on Site Reliability Engineering - The new course, Site Reliability Engineering: Measuring and Managing Reliability, distills years of collective Google SRE experience with designing and managing complex systems that meet their reliability targets.
DevOps Official Blog SRE March 18, 2019Make your voice heard! Take the 2019 Accelerate State of DevOps survey - By contributing to the survey, you will help shape the narrative of the rapidly growing DevOps industry. Your insights will help drive conversations on how as an industry we can develop software faster with less risk.
Istio Kubernetes Official Blog SRE March 11, 2019The service mesh era: Using Istio and Stackdriver to build an SRE service - Demonstrating how to use Istio to level up SRE practices for workloads running in Kubernetes.
Official Blog SRE Feb. 4, 2019Tune up your SLI metrics: CRE life lessons - How you can tune your existing SLIs to be a better representation of what your customers are experiencing.
Official Blog SRE Jan. 28, 2019Do you have an SRE team yet? How to start and assess your journey - The Site Reliability Workbook is available in HTML now!
DevOps Official Blog SRE Jan. 21, 2019Canary analysis: Lessons learned and best practices from Google and Waze - How Waze is using Spinnaker (continuous delivery system) to do canary deployments.
Official Blog Security SRE Sept. 17, 2018Trust through transparency: incident response in Google Cloud - White paper which explains how Google Cloud manages incidents.
Official Blog SRE Aug. 6, 2018Repairing network hardware at scale with SRE principles - Google’s SRE principles to guide developers and operations teams toward better systems reliability.
Official Blog SRE July 23, 2018SRE fundamentals: SLIs, SLAs and SLOs - Learn about SRE fundamentals: SLIs, SLAs and SLOs.
SRE July 2, 2018Understanding error budget overspend - part one - CRE life lessons - Questions to consider to see if you need to recalibrate your error budget - when dowtime of your applications is more than your service level objectives.
SRE July 2, 2018Good housekeeping for error budgets - part two - CRE life lessons - Fixing the root that causes overspending error budget.
SRE July 2, 2018Kubernetes podcast - #9 SRE, with Tina Zhang and Fred van den Driessche.
Official Blog SRE June 4, 2018Troubleshooting tips: Help your cloud provider help you - Tips for communicating with cloud provider support team.
Official Blog SRE June 4, 2018Troubleshooting tips: How to talk so your cloud provider will listen (and understand) - Practical tips on communicating with cloud providers since cloud presents a new way of working for IT teams shifting away from legacy systems.
Official Blog SRE May 14, 2018Defining SLOs for services with dependencies - CRE life lessons - How to define and manage SLOs for services with dependencies.
DevOps Official Blog SRE May 14, 2018SRE vs. DevOps: competing standards or close friends? - What exactly is SRE and how does it relate to DevOps?
SRE March 19, 2018Risk and Error Budgets - How the SRE discipline reduces tension over velocity/stability between product teams and system operators by quantifying risk and employing error budgets.
Official Blog SRE Feb. 12, 2018Applying the Escalation Policy — CRE life lessons - CRE Life Lessons: Explore some scenarios to apply the Escalation Policy
SRE Jan. 22, 2018An example escalation policy — CRE life lessons - This post demonstrate lightly-edited SLO escalation policy and associated rationales from a Google SRE team to illustrate the trade-offs that particular teams make to maintain a high development velocity.
SRE Jan. 8, 2018Consequences of SLO violations — CRE life lessons - Article explains importance of creating a policy to handle Service Level Objective (SLO) violations, role of Site Reliability Engineers (SREs) and Devs in responding to SLO violations and structure of policy.
SRE Dec. 11, 2017Getting the most out of shared postmortems — CRE life lessons - In this post, it's considered how to review a postmortem with your affected customer(s) for better actionable data and also to help customers improve their systems and practices.
SRE Oct. 30, 2017Building good SLOs - CRE life lessons - Practicle tips how to formulate Service Level Objectives for Service Level Indicators
SRE Aug. 14, 2017CRE life lessons: The practicalities of dark launching - How to deal with some circumstances that can some up with dark launching.
SRE Aug. 7, 2017CRE life lessons: What is a dark launch, and what does it do for me? - Dark launch sends a copy of real user-generated traffic to your new service, and discards the result from the new service before it's returned to the user.
SRE July 10, 2017Making the most of an SRE service takeover - CRE life lessons - In Part 2 of this blog post we explained what an SRE team would want to learn about a service angling for SRE support, and what kind of improvements they want to see in the service before considering it for take-over. And in Part 1, we looked at why an SRE team would or wouldn’t choose to onboard a new application. Now, let’s look at what happens once the SREs agree to take on the pager.
SRE June 26, 2017Why should your app get SRE support? - CRE life lessons - Practical tips how to organize Site Reliability Engineering team
SRE May 29, 2017Know thy enemy: how to prioritize and communicate risks - CRE life lessons - This time how to identify and mitigate risks in your system
SRE April 3, 2017How release canaries can save your bacon - CRE life lessons - Description of release process using canary (gradual) release from Site Reliability Engineering team
SRE March 27, 2017Reliable releases and rollbacks - CRE life lessons - Life lessons from SRE (Site Reliability Engineer) when new release is deployed but something goes wrong
SRE March 6, 2017Incident management at Google — adventures in SRE-land - How engineers in Google handle incidents in their data centres
Useful Links
Contact
Třebanická 183
Prague, Czech Republic
Phone: +420 777 283 075
Email: [email protected]