Introduction
Part of maintaining healthy, performant systems is monitoring. It could be a simple application or complex architecture microservice; keeping tabs on system performance, metrics, and errors are all pieces of the monitoring puzzle. Perhaps one of the most powerful, widely used, and popular tools for monitoring today is Prometheus, an open-source system designed to help collect, store, and analyze metrics, specifically well-suited for modern cloud-native environments.
In this blog post, we’ll explore the fundamental aspects of Prometheus, focusing on its architecture and the different types of metrics it collects. This is the first post in a series that will dive deeper into setting up Prometheus, creating monitoring dashboards, and configuring alerts.
What is Prometheus?
Prometheus is a widely known open-source tool for monitoring and alerting. It was first developed at SoundCloud in 2012 and was then accepted into the Cloud Native Computing Foundation (CNCF) in 2016. It works very well in changing and container-based settings, which is why many people use it in microservices, Kubernetes, and cloud-native systems.
You can collect metrics, query on those metrics with a rich query language called PromQL(Prometheus Query Language), and define alerts on specific bounds with Prometheus. Its time-series data model makes it perfect for watching how system performance changes over time.
What Does Prometheus Monitor?
Prometheus is extremely flexible and can be used to track a vast range of components spread across many systems. Among the most important domains that Prometheus typically monitors includes
1 . Infrastructure and Hardware
CPU Usage : Prometheus monitors CPU load, usage percentage, and server temperature.
Memory Usage : It monitors available memory and used memory along with swap usage and memory usage for every process.
Disk Usage : Prometheus can monitor disk space, disk I/O, and the health of disks in your system.
Network Traffic : Monitor network throughput, packet loss, error rates, and other network-related metrics.
2 . Services and Applications
HTTP Requests : Prometheus can monitor web servers, track request rates, response times, and error rates.
Databases : It can keep track of database health, query performance, and connection counts for systems like PostgreSQL, MySQL, and others.
Queues and Jobs : For systems like message queues (RabbitMQ, Kafka), Prometheus can track the number of pending messages, processing times, and consumer behavior.
3 . Containerized and Cloud Environments
Docker and Kubernetes : Prometheus integrates well with Docker and Kubernetes to monitor container health, resource usage, pod status, and cluster performance.
Cloud Services : Prometheus monitors cloud services like AWS, Google Cloud, and Azure using the right exporters with metrics regarding CPU usage, storage, and network performance.
4 . Application Metrics
Custom Application Metrics : Developers expose specific application metrics with custom instrumentation, such as transaction rates, error counts, or performance metrics related to their application logic.
Java, Go, and Node.js Applications : Prometheus can collect JVM metrics for Java-based applications and work with Go or Node.js applications on specific performance metrics.
Prometheus Architecture
Image Credit : Prometheus Official Documentation
1 . Prometheus Server
The core part in the Prometheus system of monitoring will be the server, responsible for gathering metrics in its database to serve queries afterwards. A full Prometheus server composition is comprised by the Retrieval process, TSDB, and finally the HTTP Server process. Let us delve deeper with each one.
Retrieval means the Prom is gathering metrics from one target or another using a pull model. It scrapes data periodically from exposed endpoints by the exporters or applications, for example
/metrics
. The targets will be from system resources like CPU, memory, to application-specific data like HTTP requests or errors. The scrape configuration defines how often Prom does pull the data.Time Series Database(TSDB) : The Time Series Database is where Prometheus stores all the scraped metrics. This is a very optimized, high-performance database to store time-series data, a sequence of data points indexed by time.
The key feature of Prometheus' data model is the use of time series. A time series is a specific metric, for example,
{http_requests_total}
, identified by a set of labels, for example,{method="GET", status="200"}
. Each sample in the time series is a combination of a timestamp and a value-the metric value at that point in time.Example of a Time Series:
http_requests_total{method="GET", status="200"} 1000
This line indicates that there have been 1,000 total HTTP requests with the following characteristics:method="GET"
: The requests used the GET HTTP method.status="200"
: The requests resulted in a 200 OK status, indicating successful responses.
HTTP Server : The HTTP Server provides access to the Prometheus web interface and API endpoints. Such allows users to execute PromQL queries in order to retrieve metrics, visualize data, and monitor the status of Prometheus itself. This server also exposes its internal metrics for self-monitoring, as well as providing API access towards integration with other systems or dashboards.
Storage : Prometheus primarily uses local storage for its time-series data, and the choice between HDD (Hard Disk Drive) and SSD (Solid State Drive) depends on your performance and capacity needs.
2 . Service Discovery Layer
Prometheus discovers monitoring targets using two primary mechanisms:
Kubernetes Integration:
Automatically discovers and monitors k8s objects (pods, services, nodes)
Understands k8s labels and annotations for configuration
Dynamically updates as containers scale up/down or move between nodes
File-based Service Discovery (
file_sd
):Reads target configurations from JSON or YAML files
Supports dynamic updates without Prometheus restart
Useful for static targets or custom integration scenarios
3 . Push-gateway
Some jobs (short-lived jobs) complete too quickly for Prometheus to scrape their metrics directly. These jobs push their metrics to the push gateway, which acts as a buffer. Prometheus then scrapes the push gateway to collect the metrics.
Designed specifically for short-lived jobs:
Batch jobs
Serverless functions
Cron jobs
Acts as a metrics cache:
Accepts pushed metrics from jobs
Retains last pushed values
Exposes metrics for Prometheus to scrape
Provides job completion tracking
Handles metric persistence across restarts
4 . Targets
In Prometheus, targets are usually endpoints or instances that expose metrics for Prometheus to collect (scrape). They are central how Prometheus gathers the data to be used in monitoring of systems, services, and application.
Jobs/Exporters:
These are long-running processes that expose metrics to Prometheus.
Exporters are special-purpose tools that expose metrics from various services (e.g., databases, OS metrics) in a format Prometheus understands.
5 . Alert Manager
Prometheus can define alerting rules based on collected metrics. When a condition specified in these rules is met, it generates an alert.
These alerts are sent to the Alertmanager, which handles the logic of deduplication, grouping, and routing of alerts.
The Alertmanager can notify various channels such as PagerDuty, Email, and other integrations (e.g., Slack, SMS).
This ensures that the right people are notified when a specific alert condition is triggered.
6 . Visualization
Prometheus Web UI:
Prometheus has a built-in web interface that allows users to run queries, view metrics, and create basic visualizations.
This is useful for quick exploration and troubleshooting.
Grafana:
Grafana is a powerful dashboarding tool that integrates with Prometheus to provide more advanced visualizations.
Users can create complex and interactive dashboards, combining data from Prometheus and other sources.
API Clients:
Custom applications or scripts can query Prometheus data using its HTTP API.
This allows for integration with other systems, automated reporting, and custom visualizations or alerting mechanisms.
Types of Metrics in Prometheus
Prometheus uses a powerful data model to collect and store various types of metrics. Each metric type has a specific purpose and helps monitor different aspects of a system or application. Here are the primary types of metrics in Prometheus:
1. Counters
Counters are metrics that represent a cumulative value that only increases or resets to zero. They are used to track counts over time, such as the number of requests served, tasks completed, or errors encountered. Since counters only increase, they are ideal for monitoring things that should be accumulated over time.
Use Cases:
Number of HTTP requests received.
Number of errors in a system.
Total number of processed jobs.
Example:
http_requests_total{status="200"} 1000
This metric shows that 1000 HTTP requests with a status code of 200 have been served.
2. Gauges
Gauges are metrics that represent a value that can go up and down. They are used to track values that fluctuate, such as memory usage, CPU usage, or temperature. Gauges are suitable for monitoring real-time values that can increase or decrease over time.
Use Cases:
Current memory usage.
Temperature of a server.
Number of active users.
Example:
memory_usage_bytes 204800
active_users 15
This metric indicates that the current memory usage is 204800 bytes and the number of active users is 15.
3. Histograms
Histograms are metrics that allow you to observe the distribution of values over time. They are used to measure and bucket data points into configurable ranges, or buckets. Histograms are particularly useful for understanding the latency or duration of requests and the distribution of response sizes.
Use Cases:
Request durations (e.g., HTTP request latency).
Response size distributions.
Packet size in network monitoring.
Components of Histograms:
Count: Total number of observations.
Sum: Total sum of all observed values.
Buckets: Predefined ranges that count how many observations fall into each range.
Example:
http_request_duration_seconds_bucket{le="0.5"} 200
http_request_duration_seconds_bucket{le="1.0"} 350
http_request_duration_seconds_count 500
http_request_duration_seconds_sum 300
This histogram shows that 200 requests were completed within 0.5 seconds, 350 within 1 second, with a total of 500 requests having a cumulative duration of 300 seconds.
4. Summaries
Summaries are similar to histograms but provide a different way of aggregating data. They focus on calculating configurable quantiles (percentiles) over a sliding time window. Summaries are useful for tracking latency, request durations, or any metric where percentiles are significant.
Use Cases:
Monitoring the 95th percentile of request latency.
Tracking the median response time.
Observing error rates over time.
Components of Summaries:
Count: Total number of observations.
Sum: Total sum of all observed values.
Quantiles: Specific percentiles (e.g., 0.5, 0.95) that provide insights into the distribution.
Example:
http_request_duration_seconds_sum 1200
http_request_duration_seconds_count 300
http_request_duration_seconds{quantile="0.5"} 0.2
http_request_duration_seconds{quantile="0.95"} 1.5
This summary indicates that the median request duration is 0.2 seconds, and 95% of the requests were completed within 1.5 seconds.
Choosing the Right Metric Type
Selecting the appropriate metric type depends on the nature of the data and the kind of analysis you want to perform:
Counters are ideal for accumulating events over time.
Gauges are best for monitoring fluctuating values.
Histograms are suited for analyzing distributions and ranges.
Summaries are great for observing percentiles and quantiles.
Each metric type provides unique insights and can be used in combination to get a comprehensive view of system performance.
In the next part of this series, we will dive into the practical implementation of Prometheus. This will include setting up Prometheus, configuring exporters, and visualizing metrics with Grafana. Stay tuned for a hands-on approach to mastering Prometheus monitoring.