Summer 2020 Remote Software Engineering Internship at Google — Google Cloud Platform

Metrics Transform Processor in OpenTelemetry Collector

6 min readSep 18, 2020

Internship Introduction

For this summer, I was on the Core Compute Observability team under Google Cloud Platform (GCP) — Cloud Monitoring, and it went from 5/26/2020 to 8/14/2020. Due to this year’s special situation, all Google internships were conducted virtually.

Project Background

What is OpenTelemetry and Collector

OpenTelemetry has a new set of telemetry libraries and tools to make monitoring in the cloud easy and portable. It aims to achieve an industry standard for cloud monitoring with the joint efforts from major cloud and monitoring providers including Google, Microsoft, Amazon, Splunk, etc. One notable feature of OpenTelemetry is its Collector. The Collector is a standalone service that runs as a centralized spot to collect, process, and export metrics/traces from a VM or container to any monitoring backends (e.g. Google Cloud Monitoring). The Collector’s pipeline has three main components: receivers, processors and exporters.

Metrics Transform Processor

One of the goals of my internship was to determine whether we could use the OpenTelemetry Collector to record the VM host metrics (CPU, disk, memory, etc.) for use on Google Compute Engine. In order to do that with backward compatibility with the Cloud Monitoring agent, the metrics needed to be sent using the existing Cloud Monitoring agent metric schemas.

The metrics transform processor addresses this metric schema conversion need. It handles renaming for metrics, labels and label values in a configurable way. It also enables aggregations across labels and label values, which are needed for the metrics to conform to the format of specific monitoring backends (in this case, Cloud Monitoring). For instance, the Collector currently exports the CPU usage on each core separately, but the Cloud Monitoring agent by default only sends CPU usage aggregated across cores. To preserve compatibility with that default, we need to aggregate the metrics before writing. Besides, if the Collector exports metrics with an unnecessary amount of detail, the customers will face higher costs. With the aggregation feature, the users can define what labels or label values to ignore, so the unnecessary details are aggregated away while no essential information is lost.

Approach

The approach includes two major parts: configuration design and algorithm implementation. The configuration is a user-defined yaml file to instruct this processor at build-time about what operations to perform on which metrics. I have designed this configuration to optimize user experience with the most common use cases of this processor.

Configuration Structure

Examples

configuration to transform the OpenTelemetry memory usage metric into the Cloud Monitoring agent one

This configuration means that for all data points with label values slab_reclaimable and slab_unreclaimable under the label state, combine them by taking the sum and give them a new label value slab.

As for algorithm implementation, this entire project is written in Golang. The aggregation algorithm utilizes HashMaps to group time-series data points based on specifications from the configuration file, which includes grouping based on labels and label values. Then the specified operation (sum, mean, max, or min) is performed on the grouped data. We prioritized performance so this process itself doesn’t take much of the machine resources. For example, to ensure optimal performance, the initialization of some maps and sets are moved from runtime to build-time, so that they don’t get re-run every time the pipeline consumes metrics.

Results

Monitoring dashboard for metrics transformation

Here is a custom monitoring dashboard with memory usage data coming from a VM running on a Google Compute Engine (GCE) instance:

Left: metric from the current Cloud Monitoring agent
Middle: unprocessed metric from the Collector
Right: processed metric from the Collector via metrics transform processor

Problem: formatting differences between the raw Collector metric in the middle and the Cloud Monitoring metric on the left.

The metric names are different: the Collector metric is called system.memory.usage, whereas the Cloud Monitoring metric is called memory/bytes_used.
The label values are also different: the Collector metric has separate labels values slab_reclaimable and slab_unreclaimable, whereas the Cloud Monitoring metric only has the label value slab in place for these two.

Solution: metric transform processor solves the problem in the processed metric on the right:

The metric name is updated from system.memory.usage to memory/bytes_used by the renaming feature.
The label values slab_reclaimable and slab_unreclaimable are aggregated together by taking the sum to form the single slab label value. This is achieved by the aggregation functionality.

In the end, the metric is transformed to be completely compatible with the current Cloud Monitoring agent metrics. Regarding the final product, this entire processor is already in use with the current release of OpenTelemetry Collector, which will impact millions of OpenTelemetry and GCP users.

Build Tooling and Third-Party Support

Since I still had a few weeks left after finishing my core project, to ensure the best experience for GCP operations engineers and users, I developed an automated build tool that packages the Collector binary with a preloaded configuration and a detailed README that make the Collector readily compatible with Google Cloud Monitoring. This build tool is a combination of Makefile, shell script and a Docker container, which work together behind the scenes to build a distributable tarball file. Furthermore, I have also collaborated with my co-intern for this final distribution package to also include his receiver that accepts Prometheus metrics from MySQL, Apache, and JVM.

Future Work

This processor holds a great amount of potential as it provides some essential functionalities. Therefore, it can definitely be extended to become more powerful. In fact, features were added to this processor by other developers even during the development process. I would like to see smarter features like selecting metrics based on regex, or aggregation over time instead of just within a batch. Aggregation-over-time, specifically, is a useful feature because it would enable ephemeral clients like web browsers or cloud functions to write individual data points to the collector. These data points would then aggregate into counters/gauges before sent to a monitoring backend.

Learning Experience and Conclusion

It was indeed a dream come true to work at Google! I am grateful to have the opportunity to work with such talented engineers and leaders, and to make such a large-scale impact as a Google intern. I also felt fortunate to work on OpenTelemetry, where I learned so much about the monitoring/metrics space, Golang and build tools, which I had no exposure to before.

I have also polished my technical communication skills significantly. Throughout this experience, I drafted a design document (can be publicly viewed as a GitHub issue) to present my idea and approach, and I also stayed active in advocating for my own work among the stakeholders.

There were also some events that made this experience more interesting. First of all, our work was featured on the internal Eng News, which gave us a stage to showcase our projects to a wider audience. Since OpenTelemetry also has contributors from Amazon, we had the chance to meet with a team from Amazon to provide help on defining their project scope. Last but not least, I had the opportunity to perform code reviews on incoming features into my project. This was indeed a unique and thrilling experience as an intern.

Acknowledgment

I am glad that we interns got to get together for virtual hangouts. I made great friends and connections that made the experience even more incredible! Thanks to our host Dave Raffensperger, co-host Quentin Smith, fellow Googler James Bebbington, and my co-intern Nicolas MacBeth. Special thanks to Dave Raffensperger and other Googlers working on OpenTelemetry for reviewing this article!