Prometheus collects metrics from your infrastructure, and Grafana visualizes them in real-time dashboards. Together, they form the industry-standard open-source monitoring solution used by teams managing everything from microservices to Kubernetes clusters. This guide walks you through a production-ready setup in under an hour.
Most DevOps teams start with basic monitoring—maybe a few log files and manual checks. It doesn't scale. You need automatic metric collection, long-term storage, and visual insights into system behavior. That's where Prometheus steps in. It scrapes metrics from applications and infrastructure at regular intervals, stores them efficiently, and lets you query them with PromQL.
Grafana does something different: it's a visualization layer. You point it at Prometheus as a data source, and suddenly you're building dashboards that update in real-time. You can see CPU usage, memory consumption, request latency, and custom application metrics all in one place.
The combination is powerful because they're built to work together. Prometheus handles the heavy lifting of collection and storage. Grafana handles the presentation and alerting. Neither depends on the other, so you can swap components out if needed.
Before you start, you'll need:
The architecture is straightforward: Prometheus scrapes metrics from exporters and applications, stores them in its time-series database, Grafana queries Prometheus for data, and you access Grafana's web UI to view dashboards. You'll also want a Node Exporter running on servers you want to monitor—it's a small agent that exposes system metrics.
Start by downloading the latest Prometheus binary. At the time of writing, v2.53+ is recommended. Visit the official Prometheus download page to grab the latest version.
cd /opt
sudo wget https://github.com/prometheus/prometheus/releases/download/v2.53.0/prometheus-2.53.0.linux-amd64.tar.gz
sudo tar xvfz prometheus-2.53.0.linux-amd64.tar.gz
sudo mv prometheus-2.53.0.linux-amd64 prometheus
sudo chown -R nobody:nogroup /opt/prometheus
Next, create a systemd service file so Prometheus runs automatically:
sudo tee /etc/systemd/system/prometheus.service > /dev/null <
Before starting the service, you need to configure Prometheus. The main configuration file is prometheus.yml. Let's create a basic one that scrapes itself and a Node Exporter:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']
Save this to /opt/prometheus/prometheus.yml. Now enable and start the service:
sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus
sudo systemctl status prometheus
Prometheus should now be running on http://localhost:9090. You can see the status page, query metrics, and check what's being scraped.
Node Exporter exposes system metrics like CPU, memory, disk, and network usage. Install it on any host you want to monitor. Here's the quickest way:
cd /opt
sudo wget https://github.com/prometheus/node_exporter/releases/download/v1.8.0/node_exporter-1.8.0.linux-amd64.tar.gz
sudo tar xvfz node_exporter-1.8.0.linux-amd64.tar.gz
sudo mv node_exporter-1.8.0.linux-amd64 node_exporter
sudo chown -R nobody:nogroup /opt/node_exporter
Create a systemd service for Node Exporter:
sudo tee /etc/systemd/system/node_exporter.service > /dev/null <
Enable and start it:
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter
Verify it's working by visiting http://localhost:9100/metrics. You'll see hundreds of system metrics in Prometheus exposition format. Go back to your Prometheus config and make sure you've added the node scrape job (we did this above). Reload Prometheus to pick up the new target.
Grafana's installation varies by platform. On Ubuntu, the easiest approach is using the official repository:
sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
sudo apt-get update
sudo apt-get install -y grafana-server
Start Grafana:
sudo systemctl enable grafana-server
sudo systemctl start grafana-server
sudo systemctl status grafana-server
Grafana runs on port 3000. Open http://localhost:3000 in your browser. The default credentials are admin/admin. You'll be prompted to change the password on first login—do that immediately.
Now add Prometheus as a data source. Go to Configuration → Data Sources → Add Data Source. Select Prometheus. Set the URL to http://localhost:9090. Click Save & Test. If everything's connected, you'll see a green message confirming the link works.
With Prometheus feeding data into Grafana, you're ready to build dashboards. Let's start simple: create a panel showing CPU usage.
Click the + icon in the sidebar and select Dashboard. Add a new panel. In the query editor, enter a PromQL query. Here's a useful one for CPU:
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
This calculates the percentage of CPU not idle over the last 5 minutes. Name the panel "CPU Usage", set the units to percent, and save. You now have a working dashboard panel.
Want to add more panels? Try memory usage:
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
Or disk usage:
(1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lowerfs|squashfs|vfat"} / node_filesystem_size_bytes{fstype!~"tmpfs|fuse.lowerfs|squashfs|vfat"})) * 100
Grafana has thousands of pre-built dashboards available on Grafana's dashboard repository. You can import them by ID in seconds. Dashboard 1860 (Node Exporter Full) is particularly popular for system monitoring—it's comprehensive and well-maintained.
Metrics alone aren't enough. You need to be notified when something's wrong. Prometheus and Grafana both support alerting, though they work differently.
In Prometheus, create alert rules by adding a rules file. Create /opt/prometheus/rules.yml:
groups:
- name: system_alerts
interval: 30s
rules:
- alert: HighCPUUsage
expr: (100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
for: 5m
annotations:
summary: "High CPU usage detected on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}%"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value }}%"
Update prometheus.yml to include these rules:
rule_files:
- "rules.yml"
Reload Prometheus to activate alerts. They'll appear in the Alerts section of the web UI. To actually get notified, you need an Alertmanager instance. For now, focus on getting the alerts firing correctly in the UI.
If you prefer containerization, here's a docker-compose.yml that brings everything up in one command:
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
restart: always
node_exporter:
image: prom/node-exporter:latest
ports:
- "9100:9100"
restart: always
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
GF_SECURITY_ADMIN_PASSWORD: admin
volumes:
- grafana_data:/var/lib/grafana
restart: always
volumes:
prometheus_data:
grafana_data:
Run docker-compose up -d and everything starts together. This is great for testing and development. For production, you'll want to customize image versions, environment variables, and storage.
Once you're up and running, here's what separates a hobby setup from a production-grade monitoring stack:
Retention Policy: Prometheus keeps metrics in memory and on disk. By default, it retains 15 days of data. For production, decide based on your needs. High-traffic systems might drop this to 7 days to save disk space. Use --storage.tsdb.retention.time=7d when starting Prometheus.
Remote Storage: Local storage doesn't scale forever. For long-term retention, consider remote storage backends like Thanos, Cortex, or Victoria Metrics. They're designed to handle petabyte-scale metrics.
Scrape Interval: The default 15-second scrape interval is fine for most setups. Lower it to 5 seconds only if you need high granularity and can handle the storage overhead