11 minute read

Observability overlaps between my profession and my hobbies. If I can get metrics from it and graph it, I do. I taught myself Go while playing with the API for my thermostat a number of years back in order to see how my apartment was handling New England winters. I have multiple monitoring servers of different types tracking the health of my fleet. I open Grafana when I first wake up in the morning and start snitching on people pentesting my website. So naturally, a new gaming PC comes into my home and I want to know how it’s performing.

Why bother?

I had a Raspberry Pi 5 hooked up as a tv computer before this that was being monitored like any other Linux system in the house. The Legion Go is effectively replacing it, sitting on a docking station on a shelf next to the tv. In its little cabinet, I added an exhaust fan out the back and put another fan behind the device. Now I want to know,

  • How hot does it get during gameplay?
  • How does it perform docked vs handheld, or even traveling?
  • Which games are more stressful than others?
  • Long shot, what do the FPS look like?
  • If games start crashing, what did the logs say at the time?

This is the only computer I own running Windows. I’m a little out of my element; I haven’t had to administer a Windows server in many years and definitely never set one up to ship metrics to a TSDB. It’s a learning experience, but maybe not a useful one because I haven’t seen Windows Server creep up in my particular corner of tech in quite a while. This also runs Windows 11 Home Edition, which has limitations I didn’t expect (no Bitlocker to Go?!). We’ll work through it.

The Observability Stack

For this purpose, I’m using the following:

A little about this design decision. I do have a Prometheus server running, but I use that for monitoring and alerting on my servers. InfluxDB is what I use for pushing various other metrics to, like my bluetooth thermometers. Another reason for Influx over Prometheus in this case is it’s push-based; I want this to record data while I’m connected to wifi on a plane or a train. We’re using InfluxDB 2 in this case as I have not migrated to 3 yet. A separate blog post will come soon covering that. The Telegraf output plugin is the same for v2 and v3 though, so we’re good.

Telegraf is our agent of choice here, and if you’d prefer to use Prometheus, it can actually serve a /metrics endpoint like any other exporter. It can also ship Windows Event Logs to Loki with minimal setup.

Loki just happens to be my log server of choice. Telegraf does support LogStash as well, if that’s your thing.

Finally, Grafana is what everyone uses. It’s great, it works.

I’m not going to cover setting up the various servers here, but we will go over the Telegraf config I’m using.

Setup Telegraf

Creating a user

Some of what we’re going to do here requires elevated privileges further into the demo. Chances are you logged into your new device with your Microsoft account. This makes it tricky to run the service as yourself because it wants a conventional username and password. We’ll create an admin user just for Telegraf.

  1. Go to Settings > Accounts > Other Users
  2. Start to add an account and click “I don’t have this person’s sign-in information.”
  3. Click “Add a user without a Microsoft account.”
  4. Fill out the form including the archaic security questions. If you use a password manager, just use gibberish.
  5. Go back and edit the account, change the account type to “Administrator.”

Installing Telegraf

We’re going to want to get the binary and create a Service for it. First, create C:\Program Files\Telegraf. Then we’ll need to get the latest Telegraf Windows binary from InfluxData.

The GitHub Releases page for Telegraf has various packages. We’ll want to download the Windows AMD64 zip file here.

Copy telegraf.exe and telegraf.conf into the directory you made. Run the following command from the new directory to install it as a Windows service, but we won’t be starting it just yet.

.\telegraf.exe --service install --config "C:\Program Files\telegraf\telegraf.conf"

Go to the search bar and enter “Services”, right click on telegraf and select “Properties”, change the user account to the new one you created.

Rename telegraf.conf to telegraf.conf.default. It includes an example of every plugin in it, so it’s useful to keep around and we can check the documentation for further details.

Configuring Telegraf

We’ll be putting the following blocks into telegraf.conf. Note that Telegraf’s configuration is in TOML, so configure your preferred editor accordingly to make it pretty.

Some basic config options first:

[global_tags]
# You might not need anything here.

[agent]
  # Make longer if you want.
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = "0s"

  # Windows specific logging setting.
  logformat = "eventlog"

  # Windows makes different assumptions than UNIX-likes.
  # Set to your FQDN
  hostname = "legiongo2.yourdomain.tld"

Output config

Example InfluxDB v2/v3 config. This will write all our metrics except the Windows Event Log.

[[outputs.influxdb_v2]]
  urls = ["https://influxdb.yourdomain.tld"]
  token = "yourtokenhere"
  organization = "your-org-from-influx"
  bucket = "your-bucket"

  # Get metrics, ignore logs.
  namedrop = [ "win_eventlog" ]
  content_encoding = "gzip"

Want to set it up as a Prometheus exporter instead? Try this. Documentation here for more options.

[[outputs.prometheus_client]]
  listen = ":9273"
  namedrop = [ "win_eventlog" ]

Example Loki config, this will write only the Windows Event Log.

[[outputs.loki]]
  domain = "https://loki.yourdomain.tld"
  endpoint = "/loki/api/v1/push"
  # Assuming basic auth. That's what I do.
  username = "loki"
  password = "yourpassword"
  gzip_request = true
  # Only logs
  namepass = [ "win_eventlog" ]

Input config

Start with the essentials.

[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = false
  core_tags = true

[[inputs.disk]]

[[inputs.diskio]]

[[inputs.mem]]

[[inputs.net]]

[[inputs.swap]]

[[inputs.system]]

[[inputs.temp]]

Next, let’s add some of these Windows specific metrics. You may find some of them useful.

[[inputs.win_perf_counters]]
  [[inputs.win_perf_counters.object]]
    Measurement = "win_cpu"
    ObjectName = "Processor"
    Instances = ["*"]
    UseRawValues = true
    Counters = [
      "% Idle Time",
      "% Interrupt Time",
      "% Privileged Time",
      "% User Time",
      "% Processor Time",
      "% DPC Time",
    ]

  [[inputs.win_perf_counters.object]]
    Measurement = "win_disk"
    ObjectName = "LogicalDisk"
    Instances = ["*"]
    Counters = [
      "% Idle Time",
      "% Disk Time",
      "% Disk Read Time",
      "% Disk Write Time",
      "% User Time",
      "% Free Space",
      "Current Disk Queue Length",
      "Free Megabytes",
    ]

  [[inputs.win_perf_counters.object]]
    Measurement = "win_diskio"
    ObjectName = "PhysicalDisk"
    Instances = ["*"]
    Counters = [
      "Disk Read Bytes/sec",
      "Disk Write Bytes/sec",
      "Current Disk Queue Length",
      "Disk Reads/sec",
      "Disk Writes/sec",
      "% Disk Time",
      "% Disk Read Time",
      "% Disk Write Time",
    ]

  [[inputs.win_perf_counters.object]]
    Measurement = "win_net"
    ObjectName = "Network Interface"
    Instances = ["*"]
    Counters = [
      "Bytes Received/sec",
      "Bytes Sent/sec",
      "Packets Received/sec",
      "Packets Sent/sec",
      "Packets Received Discarded",
      "Packets Outbound Discarded",
      "Packets Received Errors",
      "Packets Outbound Errors",
    ]

  [[inputs.win_perf_counters.object]]
    Measurement = "win_system"
    ObjectName = "System"
    Instances = ["------"]
    Counters = [
      "Context Switches/sec",
      "System Calls/sec",
      "Processor Queue Length",
      "System Up Time",
    ]

  [[inputs.win_perf_counters.object]]
    Measurement = "win_mem"
    ObjectName = "Memory"
    Instances = ["------"]
    Counters = [
      "Available Bytes",
      "Cache Faults/sec",
      "Demand Zero Faults/sec",
      "Page Faults/sec",
      "Pages/sec",
      "Transition Faults/sec",
      "Pool Nonpaged Bytes",
      "Pool Paged Bytes",
      "Standby Cache Reserve Bytes",
      "Standby Cache Normal Priority Bytes",
      "Standby Cache Core Bytes",
    ]

  [[inputs.win_perf_counters.object]]
    Measurement = "win_swap"
    ObjectName = "Paging File"
    Instances = ["_Total"]
    Counters = [
      "% Usage",
    ]

Lastly, let’s configure it to ship the Windows Event Log. This example is a firehose; I’m still deciding how to narrow it down myself, you can adjust for your needs accordingly.

[[inputs.win_eventlog]]
  xpath_query = '''
    <QueryList>
    <Query Id="0" Path="Security">
      <Select Path="Security">*</Select>
    </Query>
    <Query Id="1" Path="Application">
      <Select Path="Application">*</Select>
      </Query>
      <Query Id="2" Path="Windows PowerShell">
        <Select Path="Windows PowerShell">*</Select>
      </Query>
      <Query Id="3" Path="System">
        <Select Path="System">*</Select>
      </Query>
      <Query Id="4" Path="Setup">
        <Select Path="Setup">*</Select>
      </Query>
    </QueryList>
  '''

Test and start the service

Let’s make sure your created user can execute with this config. From a command prompt:

runas /user:telegraf "cmd.exe /k"

Then in the new window:

cd "C:\Program Files\Telegraf"
.\telegraf.exe --test

If you see metrics and no errors, go ahead and start the service.

net start telegraf

Checking metrics and logs

If you’re using InfluxDB V2, open the Influx portal and go to Explore. This will show you quickly if you’re receiving metrics or not. If V3, use Grafana Explore.

For logs, hop into Grafana Explore if you’re using Loki. Depending on how you configure your Event Log output, you might see a lot of stuff here.

Is this good enough?

Not really. I’m using a Lenovo Legion Go 2, and these metrics don’t encompass what I was looking for. There is a thermal metric, but when I open Task Manager in Windows, I see a GPU temperature metric that is significantly higher. I want that, I also want all the other stuff I see in the AMD Adrenaline software. How do I get that?

Our quest for more metrics

Telegraf includes an input plugin that leverages AMD’s CLI tools to report GPU metrics. Those CLI tools only exist for Linux. For whatever reason, AMD’s official software for Windows assumes you just want to play video games and you’re not wasting precious gaming time on a project like this. Psh. Don’t tell me what’s fun.

I searched high and low for a way to get this data. I found some things.

  • AMD ADLX SDK: this development kit gives a great deal of access to the GPU including visibility into what’s happening, but they expect you to fire up Visual C++, compile it, etc. I’m not doing all that. Is that even free? Why isn’t there a DLL I can just download?
  • Adrenaline Software logs: you can actually create log files in Adrenaline that are in CSV format. This is… oh my god this is terrible. Could it work? Yeah, actually. The file input plugin for Telegraf can parse CSV data into metrics. This involves activating the logging in the app every time you play and setting up good log rotation so you don’t have Telegraf parsing a 300MB file every 10 seconds. That’ll be great for gameplay, right?
  • LibreHardwareMonitor: popular open source project which gives you a lot more metrics than Windows exposes on its own. It runs a local http server which includes json output at /data.json. Telegraf has an http input plugin and a json parser. Perfect! Wait… it can’t bind to localhost? Nope. It finds your default ipv4 address and binds to that. I tried hacking the config file and it just overwrote my changes. This means I’d have to reconfigure Telegraf every time that IP address changes, and given that this thing is going to spend a lot of time on public wifi, that’s a deal breaker. Might raise a GitHub issue on that one.
  • LibreHardwareMonitorLib.dll: just yank the DLL out of the aforementioned project and write a small script Telegraf can execute with it.

Yeah, that last one.

Writing a collector script for Telegraf

Telegraf’s exec plugin will execute a script or binary file and parse the output into metrics. This means we can expose pretty much anything.

Being an old curmudgeonly Linux admin, I don’t know .NET or Powershell. You can do this however you want, but my example involves installing Python.

Download a recent stable version of Python for Windows and install for ALL USERS. Once that’s good, do the following:

runas /user:telegraf "pip install pythonnet"

These are .NET bindings for Python, which will allow us to use that aforementioned DLL. You’re probably thinking, “wouldn’t it be more resource efficient to do this in Powershell, .NET or C++?” Yeah, go make a GitHub gist and share it.

Download the latest LibreHardwareMonitor package from GitHub and copy the LibreHardwareMonitorLib.dll file out of it into C:\Program Files\Telegraf. Then, right click on it, select Properties, and click “Unblock” under Security.

Also in this folder, I created lhm-to-telegraf.py. This is based on something snip3rnick already figured out and I thank them for their cross platform knowledge.

import clr

clr.AddReference(r"C:\Program Files\telegraf\LibreHardwareMonitorLib.dll")
from LibreHardwareMonitor.Hardware import (
    Computer,
    IVisitor,
    IHardware,
    ISensor,
    IParameter,
)


class UpdateVisitor(IVisitor):
    __namespace__ = "MyVisitor"

    def VisitComputer(self, computer):
        computer.Traverse(self)

    def VisitHardware(self, hardware: IHardware):
        hardware.Update()
        for sub in hardware.SubHardware:
            sub.Update()

    def VisitParameter(self, parameter: IParameter):
        pass

    def VisitSensor(self, sensor: ISensor):
        pass


c = Computer()
# There are more metrics that you can enable.
# I'm only using this for GPU.
c.IsGpuEnabled = True

c.Open()

visitor = UpdateVisitor()
c.Accept(visitor)

for h in c.Hardware:
    h.Update()
    for s in h.Sensors:
        sensor_type = s.SensorType.ToString().lower()
        if sensor_type == "smalldata" and "Memory" in s.Name:
            measurement = "gpu_mem"
        elif sensor_type == "load" and "D3D" in s.Name:
            measurement = "gpu_d3d"
        else:
            measurement = f"gpu_{sensor_type}"
        hardware = s.Hardware.Name.replace(" ", "_")
        name = s.Name.replace(" ", "_")

        print(f"{measurement},hardware={hardware} {name}={s.Value}")


c.Close()

Now we have to add that to telegraf.conf.

[[inputs.exec]]
  commands = ["python.exe \"C:\\Program Files\\telegraf\\lhm-telegraf.py\""]
  timeout = "10s"
  # We output influx line protocol, no custom parsing needed.
  data_format = "influx"

Do this again and make sure you see GPU metrics.

runas /user:telegraf "cmd.exe /k"
cd "C:\Program Files\Telegraf"
.\telegraf.exe --test

If all looks good, go ahead and restart Telegraf.

net stop telegraf
net start telegraf

Conclusion

I got what I wanted. This hopefully serves well as a rough example on how to monitor a modern Windows-based game console. If you’re running SteamOS you can probably treat it like any Linux server, but it’s nice to see that Telegraf was a valid option for Windows.

Updated: