Dealing with PII Data in Dataflow with Cloud DLP API - In this guide, we’ll walk through the process of creating a Dataflow pipeline to read data from Google Cloud Storage (GCS), apply transformations, data masking using Cloud DLP API, and then write the transformed data to a BigQuery table.
Kafka migration from on-prem to Confluent - The purpose of this technical blog is to guide readers through the migration process from on-prem to Confluent Kafka and to shed light on the data ingesting/transformation processes carried through data pipelines to finally store data to BigQuery.
Protobuf to BigQuery with Apache Beam - Some days ago Apache released Beam 2.50, which was announced to come with support to write protocol buffer objects into BigQuery tables into BigQuery tables, thanks to the writeProtos method.
Measuring climate and land changes with AI - In this People & Planet AI episode, we celebrate the amazing launch of a geospatial project called Dynamic World, which maps the entire planet into different categories to track changes in ecosystems with precision. We then explore how to build an AI model like Dynamic World using Cloud.
Extend your Dataflow template with UDFs - Learn how to easily extend a Cloud Dataflow template with user-defined functions (UDFs) to transform messages in-flight, without modifying or maintaining Apache Beam code.
3x Dataflow Throughput with Auto Sharding for BigQuery - Google is launching Dataflow Auto Sharding - a new capability that enables users to get increased performance when writing to Big Query in Dataflow. With Auto sharding, Dataflow automatically sets the number of shards for Big Query sink without manual user work.
Beam College - Improve your skills on data processing through flexible hands-on training and practical tips provided by experts. Join the free workshops and learn how to use Apache Beam from concept to common use cases and best practices.
Cloud Composer launching Dataflow pipelines - A step by step tutorial which walks you through setting a Cloud Composer solution that will read a comma-separated values text file and insert each of the rows contained within into a BigQuery table.
Computing Time Series metrics at scale in Google Cloud - This blog post shows how data scientists and engineers can use GCP Dataflow to compute time-series metrics in real-time or in batch to backfill data at scale, for example, to detect anomalies in market data or IoT devices.
Turn any Dataflow pipeline into a reusable template - Flex Templates allow you to create templates from any Dataflow pipeline with additional flexibility to decide who can run jobs, where to run the jobs, and what steps to take based on input and output parameters.
Streaming analytics on Google Cloud for regulated industries. - This blog demonstrates how a streaming analytics pipeline on Google Cloud using PubSub, Apache Beam (on Dataflow runner), Cloud Storage, and BigQuery can be executed in a single region and protected end to end using Customer-Managed Encryption key (CMEK).
Developing interactively with Apache Beam notebooks - Using the Apache Beam interactive runner with JupyterLab notebooks lets you iteratively develop pipelines, inspect your pipeline graph, and parse individual PCollections in a read-eval-print-loop (REPL) workflow.
Aggregated Audit Logging With Google Cloud and Python - Taking Apache2 server access logs from a web server, converting the log file line-by-line to JSON data, publishing that JSON data to a Google PubSub topic, transforming the data using Google DataFlow, and storing the resulting log file in Google BigQuery long-term storage.
Life of a Cloud Dataflow service-based shuffle - Shuffle implementation (currently in beta) is in the Cloud Dataflow SDK for Java version 2.0. In this post, it's explained and demonstrated the practical impact of the new shuffle on data pipelines using the Opinion Analysis project as an example.
Cloud Dataflow 2.0 SDK goes GA - In new release better handling of large BigQuery Sinks, the ability to write streaming data to text or Apache Avro files on Cloud Storage, allowing writing into multiple BigQuery tables based on incoming user data and more
Apache Beam publishes the first stable release - Apache Beam (open source project for unified programming model to define and execute data processing pipelines, including ETL, batch and stream (continuous) processing based on Dataflow) made it's first stable release since incubating into Apache Organization
Example to Integrate Spark Streaming with Google Cloud at Scale - Github repository which contains example to integrate Spark Streaming with Google Cloud products. The streaming application pulls messages from Google Pub/Sub directly without Kafka, using custom receivers. When the streaming application is running, it can get entities from Google Datastore and put ones to Datastore.