Setting Up Grafana to Monitor Servers, VMs, ESXi, UPS, FreeNAS and More

I recently posted my own Grafana dashboard on reddit and was met with many great questions about how to do some of the things I did. I found a lot of good resources that helped me out along the way, but I had to figure some of it out on my own, so I wanted to share the knowledge I’ve acquired to everyone else that may run into some of the same issues. You may have found some of these other resources, and maybe they don’t fit your scenario, or maybe you’re looking for more of a step-by-step process. Let’s jump in.

 

There are several pieces to the puzzle when setting up Grafana to monitor your homelab:

 

  • collectd: This is a daemon that collects various system metrics. It’s been around for a long time. It is very fast, and very useful, especially if you’re running FreeNAS.
  • Telegraf: Like collectd, it collects many of the same metrics. It’s also very fast, and extremely easy to set up. We’ll be using this on almost all of our servers.
  • SNMP: Originally designed to be a (simple) network management protocol, it can be used to report all sorts of metrics. We’ll get into the gritty details later in the ESXi section, but this works very differently from collectd and Telegraf, but is just as useful.
  • Graphite: A time-based database for collected data. We won’t be installing graphite, but we’ll be using InfluxDB’s graphite support to store the data from collectd.
  • InfluxDB: Another time-based database. This is the central location for all of our collected metrics.
  • Grafana: Data is great, but we need something to make it look pretty! That’s where Grafana comes in. You can create custom dashboards to show the data that’s most important to you.

 

Initial Setup

There are many ways to set this up, and this tutorial should have enough information to cover most of them, but you may need to modify things to fit your needs. If you’re running ESXi, I highly recommend creating individual Ubuntu Server VMs for Grafana and InfluxDB, but you can do it all on one VM if you prefer. If you’re using Proxmox, I’d recommend using containers for these. For example, here is my server configuration:

 

  • {subnet prefix}: 10.0.0.0/8
  • {ESXi IP}: 10.0.0.10
    • {grafana IP}: 10.0.0.23
    • {database IP}: 10.0.0.28

 

Your setup may be much different, so I will reference these IPs in the future as their tags. Simply replace these “tags” with the IP of your server or your subnet prefix. My VMs are running Ubuntu Server 16.04. This is probably your best bet for compatibility as of this writing (May 2018). 18.04 may be supported, but I haven’t tested any of these things on that version.

 

InfluxDB

Connect to your database server (through SSH or other means), and let’s get InfluxDB set up.

 

Installation

These instructions are to setup InfluxDB on an Ubuntu Server VM. If using a different distro, follow the instructions here.

 

Add the Repository
curl -sL https://repos.influxdata.com/influxdb.key | sudo apt-key add -
source /etc/lsb-release
echo "deb https://repos.influxdata.com/${DISTRIB_ID,,} ${DISTRIB_CODENAME} stable" | sudo tee /etc/apt/sources.list.d/influxdb.list

 

Install and Start InfluxDB
sudo apt update && sudo apt install influxdb
sudo systemctl start influxdb

 

Allow Access through Firewall
sudo ufw allow from {subnet prefix} to any port 8086

 

Configuring Databases and Users

 

Let’s bring up the InfluxDB CLI:

influx

 

We need to do a few things here. First, we need to create a database for Telegraf to use. Then we need to create an authenticated user for InfluxDB, and users for Telegraf and Grafana to be able to access this data.

create database telegraf
create user "admin" with password 'yourpassword' with all privileges
create user "grafana" with password 'yourpassword'
create user "telegraf" with password 'yourpassword'

 

The grafana user will need to read the Telegraf data, and the telegraf user will need to write the data, so let’s set up the permissions accordingly:

grant read on "telegraf" to "grafana"
grant write on "telegraf" to "telegraf"

 

We’re done here.

exit

 

InfluxDB Configuration

 

Now we need to enable HTTP authentication in InfluxDB. Use your favorite editor, and edit /etc/influxdb/influxdb.conf. Example:

sudo vim /etc/influxdb/influxdb.conf

 

We want to find the [http] tag, and enable authentication. Look for auth_enabled in the [http] section and set it to true.

auth-enabled = true

 

Save the file, then restart InfluxDB:

sudo systemctl restart influxdb

 

When you need to access the InfluxDB CLI in the future, you can do so with two methods:

influx -username admin -password yourpassword

or, the method I prefer, which doesn’t show my password on-screen, and prompts you for a username and password:

influx
auth

 

That’s it! InfluxDB is now ready to go. Let’s start getting some data in there.

 

Telegraf

With InfluxDB set up and our telegraf database created, now we need to install Telegraf to send metrics (or “measurements”) to InfluxDB.

 

Installing Telegraf

Telegraf uses the same repository as InfluxDB, so if you’re installing this on the InfluxDB server, you don’t need to add the repository. For other servers, go ahead and add it:

curl -sL https://repos.influxdata.com/influxdb.key | sudo apt-key add -
source /etc/lsb-release
echo "deb https://repos.influxdata.com/${DISTRIB_ID,,} ${DISTRIB_CODENAME} stable" | sudo tee /etc/apt/sources.list.d/influxdb.list

 

You want to complete this entire process for your database server and every server that you want to collect metrics on, except ESXi, pfSense, and FreeNAS. So, if you’re running 6 Ubuntu Server VMs for different services, complete this process on all 6 VMs.

 

Install
sudo apt update && sudo apt install telegraf

 

Configure

Open /etc/telegraf/telegraf.conf for editing. Find [[outputs.influxdb]] and add (or edit) the following lines:

urls = ["http://{database IP}:8086"]
database = “telegraf”
username = “telegraf”
password = “yourpassword”

 

If you’re configuring Telegraf on your database server, find [[inputs.net]] in the config. It will be commented out with a # by default. Remove the # to uncomment it. Since this is our database server, we need to collect input data from other servers.

 

If you’re going to be collecting data via SNMP for ESXi, you may want to head over to the ESXi section now to add the values to Telegraf, and return here when you’re done.

 

We’re finished configuring Telegraf. Save the configuration and close it.

 

Restart the Service
sudo systemctl restart telegraf

 

Testing it Out

We’ve finally reached a point where we can see if things are working. Let’s check it out:

influx
use telegraf
show measurements

 

Remember, the first command gets us into the InfluxDB CLI. You will need to authenticate with one of the two methods I mentioned earlier. use telegraf tells influx you’re working with the telegraf database. show measurements shows what measurements  are available in the database. You should see several here, including cpu, disk, mem, system, and many others. If these show up, we’re ready to move on to configuring Grafana!

 

Grafana

Finally, some fun! We’ll get to see our data presented beautifully in graphs when we’re done here. If you’d prefer to add ESXi, pfSense, or FreeNAS metrics before configuring Grafana, you can feel free to skip this section and come back to it later.

 

Installation

This guide will show you how to install and configure Grafana on an Ubuntu Server 16.04+ installation. Grafana is available for many other platforms. If you’re installing on something else, visit this link.

 

Install the key

Connect to your Grafana server. Edit /etc/apt/sources.list and add this line to the bottom of the file:

deb https://packagecloud.io/grafana/stable/debian/ stretch main

 

Save the file and close it. Now we need to add the key:

curl https://packagecloud.io/gpg.key | sudo apt-key add -

 

Install Grafana

We’re ready to install Grafana. Run the following:

sudo apt update && sudo apt install grafana

 

Let’s set up our server so that Grafana starts at boot.

sudo systemctl daemon-reload
sudo systemctl enable grafana-server.service

 

As we did with InfluxDB, and assuming your Grafana installation is on a different machine, add the UFW exception to allow local network access:

sudo ufw allow from {subnet prefix} to any port 3000

 

Start Grafana!

sudo systemctl start grafana-server

 

Working With the Grafana Web UI

Open up a web browser and navigate to the following address:

http://{grafana IP}:3000

 

You’ll be taken to the login page, which will look something like this:

 

 

Login with the default credentials of admin/admin.

 

Add a Data Source for Telegraf

Set the name as “telegraf”, and the type as “InfluxDB”. Make sure the URL points to your InfluxDB server at http://{database IP}:8086. In the details, set the database as “telegraf”, and the user as “grafana” with the password you set up for the “grafana” user in the database.

 

When finished, click “Save & Test”. You should see the “Data source is working” popup above the button.

 

 

Creating Our Dashboard and First Panel

Click the + sign on the left side toolbar and create a new dashboard. We’re just going to create a simple singlestat panel to make sure our data is working, and as a primer for how Grafana works.

 

Click on Singlestat to add the panel.

 

Click on the panel title and click Edit:

 

Since we just want a simple panel, try this one first. Select total from mem where the host is your database server.

Awesome, I have 1040351232 total mems. That sounds cool, but let’s change the format so it can be read a little better. Click the Options tab, and change the unit to “bytes”:

 

Go to the General tab and set the title to “Database Total RAM”

 

On the top right hand corner, click the back button to go back to the dashboard. Spiffy! We now have a single stat panel that shows the installed RAM on our database server.

 

That’s all that’s needed for now to install and configure Grafana. Check out my other post (Link coming soon!) for a more in-depth guide on how to configure and use Grafana panels.

 

Let’s add some metrics for our other servers.

 

pfSense

Good news! This one is a quick setup, and it uses Telegraf. Login to your pfSense web configuration, and head over to System > Package Manager. Look at your Installed Packages and see if Telegraf is installed. If not, click the Available Packages tab and install it.

 

Configuration

Now that we have the Telegraf package in pfSense, head up to the menu and click Services > Telegraf. Make sure you enable the service, set the output to InfluxDB, point it to your database server IP, and use the telegraf credentials for InfluxDB.

 

Click Save. The Telegraf service is now running on pfSense. You can configure Grafana panels with the same measurements as your other servers. See? I told you that one would be quick!

 

FreeNAS

This one is pretty easy to set up as well, assuming you’re on FreeNAS 11+. I don’t know if it’s the exact same process on previous versions, as you may have to install a graphite plugin.

 

Configuration

First, we need to let our InfluxDB service know that it’s going to be collecting Graphite data. Luckily, this functionality is built-in to InfluxDB, but it’s not enabled by default.

 

Connect to your database server, and open /etc/influxdb/influxdb.conf for editing. Find the [[graphite]] tag, uncomment the enabled = false line, and set it to true:

[[graphite]]
   enabled = true

 

Save and close the file, then restart the InfluxDB service.

sudo systemctl restart influxdb

 

Creating the Database

Now we need to create a database for graphite data. Open the InfluxDB CLI and auth in:

influx

 

Create the database and give grafana read access:

create database graphite
grant read on "graphite" to "grafana"

 

Exit the CLI.

exit

 

FreeNAS Graphite Configuration

Login to your FreeNAS web configuration UI, click System, then Advanced. Scroll down to the bottom of the Advanced settings, and set the Remote Graphite Server Hostname to {database IP}. I recommend you click the checkbox to “Report CPU in percentage” as well, but that’s completely up to you. FreeNAS reports CPU usage in “jiffies”. If you don’t know what those are, or if you’re unsure, check the box.

 

 

Click Save.

 

Tying Into Grafana

Now head over to your Grafana web UI at http://{grafana IP}:3000 and add a new datasource.

 

Let’s call the datasource “graphite“. Even though there is a type of “Graphite”, do not choose it. We still want to use InfluxDB, and point it at our InfluxDB IP with the same 8086 port we used for our telegraf datasource.

 

Set the database as graphite, and the user as the grafana user in InfluxDB.

 

That’s it for setup! But…

 

There’s a catch.

 

FreeNAS uses collectd, and they use a LOT of metrics from it. This is great, because you can monitor damn near everything on the system, but it’s also frustrating, because Grafana won’t show you all of them in the measurement dropdown when configuring a panel. The official collectd documentation can be helpful, but these are the basics to find what you’re looking for:

  • Each measurement is usually in the following format: servers.[servername].[category].[subcategory](.[subcategory])
  • If you’re looking for a particular measurement that isn’t in the dropdown, start typing “servers.[freenas-host].[category]”. For example: “servers.freenas_local.cpu” would show you measurements starting with “cpu”.
  • Beware that some measurements aren’t easily found this way. An example would be disk stats, which many of them are under the geom_stat category. I had good luck sometimes just typing a single letter and seeing what’s in there.

 

So, let’s look at what to do in Grafana to get our FreeNAS measurements, and take a peek at an example.

 

When you create a new panel, you’ll need to set the data source to “graphite”, unless you already set this as default:

 

This example shows how to get disk write latency from 5 disks from FreeNAS:

 

After creating the first query, 4 copies were made from it, and the only change was the disk number. This will generate the beautiful graph below:

 

OK, so maybe I shouldn’t have said it was “easy”, but the initial setup isn’t bad. Now you have FreeNAS stats in Grafana.

 

ESXi

This one probably gave me the most headache. It’s not that it’s difficult, but finding information about it can be an exercise in futility. Hopefully this provides great value to anyone who visits this page, because I sure could have used it when I started.

 

Let’s begin with what’s covered elsewhere, then we’ll get to the goodies.

 

Setting up SNMP on ESXi

SSH into your ESXi server and run these commands:

esxcli system snmp set --communities public
esxcli system snmp set --enable true

 

The public community doesn’t have to be used. You can use private, or any other string you’d like. The reason for using public is that Telegraf is already configured to use public by default. If your ESXi server is open to the web (it shouldn’t be), set this community to something else.

 

Set the firewall rules to allow SNMP:

esxcli network firewall ruleset set --ruleset-id snmp --allowed-all false
esxcli network firewall ruleset allowedip add --ruleset-id snmp --ip-address {subnet prefix}
esxcli network firewall ruleset set --ruleset-id snmp --enabled true

 

You can restart the service by issuing the following command (or do this in the next step):

/etc/init.d/snmpd restart

 

Now that SNMP is set up for ESXi, log in to your ESXi web interface and under Host, navigate to Manage, then Services. Scroll down to snmpd, select it, and click Restart (if you didn’t do it through SSH). Then right click, and make sure the policy is set so that the service starts and stops with the host.

 

Configuring Telegraf for ESXi SNMP

Login to your database server and open /etc/telegraf.telegraf.conf for editing. Find [[inputs.snmp]], uncomment it, the agents line below it, and the community line below that, then modify agents value to your ESXi IP, using port 161, as well as your community string specified above:

[[inputs.snmp]]
  agents = [ "{ESXi IP}:161" ]
  community = "public"

 

Go down a bit in the file and find where the measurements are. This will usually be indicated by a line like this:

# ## measurement name

 

Just below that line, let’s add the name of the system we’re using for the upcoming SNMP fields. How about “esxi”?

name = "esxi"

Below this, you will need to add the metrics for ESXi that you’d like to monitor over SNMP. These take OID values from MIB. I won’t get into too much detail about MIB or OIDs, but basically, they identify the specific metric you’re pulling from SNMP. For example, the OID for uptime in ESXi is .1.3.6.1.2.1.25.1.1.0. This returns [ms of uptime]/10 for some reason. I won’t pretend to know why, but that’s what it returns.

 

Crazy Time!

This is where I think you will get the most value, as I couldn’t find information about this anywhere myself. This is some cool stuff ahead!

 

Over at the zabbix forums, a user posted a handy OID list for ESXi. This is important to find your metrics, unless you only want the ones I’ve included below. To browse the values of these, and for other MIB features, I use a program called qtmib for Ubuntu, and ManageEngine MIB Browser for Windows. You will need one of these, or a tool like them, to go further. For this guide, I’ll be using qtmib for demonstration, but the information is easily translated to ManageEngine if you’re using Windows.

 

Let’s Start Grabbing Data

Open up your preferred MIB browser from above and let’s get started. We’ll begin with CPU stats. On the OID list, we can see there is a value for HOST-RESOURCES-MIB::hrProcessorLoad.1. This reports the CPU load in percentage for CPU1. There’s also a .2, but what do we do if we have more than 2 CPUs in ESXi? Well, the OID for hrProcessorLoad.1 is  .1.3.6.1.2.1.25.3.3.1.2.1, so let’s take the root of that, .1.3.6.1.2.1.25.3.3.1.2.0, and put that into our MIB browser, and execute a Bulk Get.

 

 

As you can see, we have 8 of the .1.3.6.1.2.1.25.3.3.1.2.x values, and qtmib grabbed 10 values in bulk, and it also updated the OID value to the last one in the sequence. When the OID starts with .1, it seems to be synonymous with iso. You can use either .1 or iso. Anyway, this means we have 8 CPUs, which means we actually have 8 threads, as ESXi reports each thread, not just each core. This is for a quad core, hyper-threaded Xeon, so that makes sense. We’re going to need to use .1.3.6.1.2.1.25.3.3.1.2.1 through .1.3.6.1.2.1.25.3.3.1.2.8. If you have more than this, you’ll need to change your Bulk Get settings, or run Bulk Get again when the OID updates in the MIB browser to the last value, because the default bulk is 10. You can also run an SNMP Walk. It does what it sounds like. It starts at the value you provided, and sequentially “walks” through the OIDs, providing their values.

 

That was easy! Let’s try getting our total RAM installed, and RAM used. ESXi reports total RAM in hrMemorySize.0 (.1.3.6.1.2.1.25.2.2.0), but we won’t be using that. It also reports RAM metrics in hrStorage values. Let’s find out what we have. In the OID list, there’s an OID of .1.3.6.1.2.1.25.2.3.1.3.1 for HOST-RESOURCES-MIB::hrStorageDescr.1. We want the descriptions of all of our storage options, so let’s take the root of that, and plug it into the MIB browser and run a Bulk Get again. Here’s what mine returned:

iso.3.6.1.2.1.25.2.3.1.3.1 = STRING: "/vmfs/volumes/589d3023-f84baacc-e60c-0cc47ae1ff40"
iso.3.6.1.2.1.25.2.3.1.3.2 = STRING: "/vmfs/volumes/0195ab76-8368c3bc-e91e-90ef1b2635e8"
iso.3.6.1.2.1.25.2.3.1.3.3 = STRING: "/vmfs/volumes/589ca53d-eda88052-71f3-0cc47ae1ff40"
iso.3.6.1.2.1.25.2.3.1.3.4 = STRING: "/vmfs/volumes/1197dd72-970c8709-ee7e-9aa406489801"
iso.3.6.1.2.1.25.2.3.1.3.5 = STRING: "/vmfs/volumes/589d302c-cdfbf3e2-6be3-0cc47ae1ff40"
iso.3.6.1.2.1.25.2.3.1.3.6 = STRING: "/vmfs/volumes/91fbe77c-72f36e78-a5e0-b76048a0b702"
iso.3.6.1.2.1.25.2.3.1.3.7 = STRING: "/vmfs/volumes/0563db71-543ade94-49f3-2c0cde7cd562"
iso.3.6.1.2.1.25.2.3.1.3.8 = STRING: "/vmfs/volumes/589ca548-c90a211a-e157-0cc47ae1ff40"
iso.3.6.1.2.1.25.2.3.1.3.9 = STRING: "/vmfs/volumes/589d3387-f09d9e1a-54a3-0cc47ae1ff40"
iso.3.6.1.2.1.25.2.3.1.3.10 = STRING: "/vmfs/volumes/5aa58afd-a74e6b98-3f62-0cc47ae1ff40"

 

I don’t see anything about RAM there. That must be other storage in ESXi. Let’s run it again starting with the .10 prefix. Here’s what came back:

iso.3.6.1.2.1.25.2.3.1.3.11 = STRING: "/vmfs/volumes/software"
iso.3.6.1.2.1.25.2.3.1.3.12 = STRING: "Real Memory"
...

 

There it is! ESXi reports RAM as storage called “Real Memory”. Jot down that last number (12 in this case). You’re going to use that to get values.

 

We’re going to want to use hrStorageSize and hrStorageUsed to get our memory values. Let’s take a look at the first OIDs for those two on the OID list:

HOST-RESOURCES-MIB::hrStorageSize.1 .1.3.6.1.2.1.25.2.3.1.5.1
HOST-RESOURCES-MIB::hrStorageUsed.1 .1.3.6.1.2.1.25.2.3.1.6.1

 

We know that’s the first index. All we need to do is change the .1 to a .12 to get our memory values. Let’s try it out:

 

Given there is 32GiB of RAM (31.8 GiB available to ESXi) installed, these numbers look accurate, as I’m using almost all of my RAM for my VMs. The numbers reported are in KiB, not bytes. This is information you need to know to properly display it in Grafana.

 

So, using the knowledge we’ve acquired throughout this section, let’s go back to editing our telegraf.conf, and plug these SNMP values in:

 

[[inputs.snmp]]
 agents = [ "10.0.0.10:161" ]
 community = "public"
# ## measurement name
 name = "esxi"
 [[inputs.snmp.field]]
   name = "uptime"
   oid = ".1.3.6.1.2.1.25.1.1.0"
 [[inputs.snmp.field]]
   name = "cpu1-load"
   oid = ".1.3.6.1.2.1.25.3.3.1.2.1"
 [[inputs.snmp.field]]
   name = "cpu2-load"
   oid = ".1.3.6.1.2.1.25.3.3.1.2.2"
 [[inputs.snmp.field]]
   name = "cpu3-load"
   oid = ".1.3.6.1.2.1.25.3.3.1.2.3"
 [[inputs.snmp.field]]
   name = "cpu4-load"
   oid = ".1.3.6.1.2.1.25.3.3.1.2.4"
 [[inputs.snmp.field]]
   name = "cpu5-load"
   oid = ".1.3.6.1.2.1.25.3.3.1.2.5"
 [[inputs.snmp.field]]
   name = "cpu6-load"
   oid = ".1.3.6.1.2.1.25.3.3.1.2.6"
 [[inputs.snmp.field]]
   name = "cpu7-load"
   oid = ".1.3.6.1.2.1.25.3.3.1.2.7"
 [[inputs.snmp.field]]
   name = "cpu8-load"
   oid = ".1.3.6.1.2.1.25.3.3.1.2.8"
 [[inputs.snmp.field]]
   name = "ram-total"
   oid = ".1.3.6.1.2.1.25.2.3.1.5.12"
 [[inputs.snmp.field]]
   name = "ram-used"
   oid = ".1.3.6.1.2.1.25.2.3.1.6.12"

 

With the information provided, you should now be able to grab any metric you desire from the OID list and add it to telegraf.conf. When you’re done, restart Telegraf:

sudo systemctl restart telegraf

 

All of your values will now be available in the telegraf data source in grafana, under “esxi” measurement. Congratulations! You now have ESXi monitoring in Grafana.

 

UPS

Monitoring a UPS has a different method depending on whether or not your UPS has a network management card installed. If it does, you can simply use SNMP to grab the values. I’d recommend checking out the ESXi guide above and performing an SNMP walk on the IP address of your UPS management card.

 

What if you don’t have a management card, and you’re using a USB connection? Well, you’ll need to have it plugged into a NUT (Network UPS Tools) server (or VM). Details on how to set this up are all over the web, so this guide will assume you already have your NUT server up and running.

 

Creating a Script to Pull upsc Values

We need to create a script to pull the upsc values first, then we’ll have Telegraf execute this script. Create a new script and place it into /opt:

sudo vim /opt/ups_metrics.sh

 

Here is a copy of my ups_metrics.sh I’m running. You’ll need to change the UPS_NAME to your UPS name in NUT config, HOSTNAME to the hostname of the server, and then add/remove any metrics you desire (test which ones are available by using the upsc command).

#!/bin/bash
 
UPS_NAME="SMX1500"
HOSTNAME="nut-server"
GATHER_COMMAND="upsc ${UPS_NAME}"

(
 M_CHARGE=$(${GATHER_COMMAND} | grep battery.charge: | cut -d' ' -f2)
 M_RUNTIME=$(${GATHER_COMMAND} | grep battery.runtime: | cut -d' ' -f2)
 M_VOLTAGE=$(${GATHER_COMMAND} | grep battery.voltage: | cut -d' ' -f2)
 M_STATUS=$(${GATHER_COMMAND} | grep ups.status | cut -d' ' -f2)

 METRICS="charge=${M_CHARGE},runtime=${M_RUNTIME},voltage=${M_VOLTAGE},status=\"$M_STATUS\""

 echo ups,host=$HOSTNAME,ups=${UPS_NAME} ${METRICS}
) 2>/dev/null

 

Save the file and close it.

 

Configure Telegraf to Use the New Script

Open up /etc/telegraf/telegraf.conf for editing and find the [[inputs.exec]] section. Uncomment [[inputs.exec]] if it’s commented out, then add the following below it:

 commands = [
   "/opt/ups_metrics.sh"
 ]
 data_format="influx"

 

Save the file, close it, and restart the Telegraf service:

sudo systemctl restart telegraf

 

You’re finished setting up UPS reporting. The values can now be accessed from the telegraf datasource in Grafana by selecting the “ups” measurement.

 

Happy Graphing!

 

 

About: stringpoet


Leave a Reply