Vibration Monitoring Architecture: From Sensor to Dashboard

The first time I tried to stream raw vibration data to a dashboard, I managed to crash my MQTT broker in under ten minutes. I had a high-frequency accelerometer spitting out samples at 5kHz, and I thought the “correct” way to handle this was to publish every single reading as a discrete MQTT message. I quickly learned that MQTT is great for state changes and telemetry, but it is not a high-speed data bus for raw waveforms.

If you’re building a vibration monitoring system, you’re dealing with a fundamental conflict: the physics of vibration require high-frequency sampling to be useful, but the networking constraints of industrial environments hate high-frequency traffic.

This is a problem for anyone moving beyond simple “is it vibrating?” thresholds into actual predictive maintenance. If you want to do FFT (Fast Fourier Transform) analysis to find a bearing failure before the machine explodes, you can’t just send an average value every ten seconds. You need the raw data, but you can’t send it all.

What I tried first

My initial approach was the “naive cloud” pattern. I used a Raspberry Pi as a gateway, connected a digital accelerometer via SPI, and wrote a Python script to push every sample to a Mosquitto broker.

The logic was simple:

Read sensor.
Wrap value in JSON.
Publish to vibration/sensor/1.
Repeat.

It failed miserably. First, the JSON overhead was insane. Sending {"value": 0.123} for a single float is a waste of bytes. You’re sending roughly 20 bytes of string to communicate 4 bytes of actual data. Second, the broker couldn’t keep up with the packet rate. I saw massive memory spikes on the broker and a growing lag in the dashboard. By the time the “real-time” graph updated, the data was 45 seconds old.

I also ignored the “noise” problem. I was sampling everything, including the electrical hum of the nearby VFDs, and my dashboard looked like a chaotic mess of static. I had no windowing, no filtering, and no understanding of the Nyquist frequency. I was just throwing data at a wall and hoping it looked like a vibration. I assumed that “more data is better,” but in IIoT, unmanaged data is just noise that costs you bandwidth and CPU.

The actual solution

To make this work in production, you have to move the heavy lifting to the edge. You don’t send samples; you send processed packets or aggregated features.

1. The Edge Collector (Python/C++)

Instead of streaming, I implemented a buffering strategy. The edge device collects a “window” of data (e.g., 1024 samples), performs a basic DC offset removal, and then sends the window as a binary blob or a compressed array.

Here is a simplified version of how I handled the collection and publishing. I switched from JSON to a more compact format and implemented a sampling window. This example assumes you’re using a Linux-based gateway with Python 3.11.

import paho.mqtt.client as mqtt
import numpy as np
import time
import struct

# Configuration
BROKER = "broker.example.com"
TOPIC = "vibration/sensor/1/raw"
SAMPLING_RATE = 5000  # 5kHz
WINDOW_SIZE = 1024    # Number of samples per packet

client = mqtt.Client()
client.connect(BROKER, 1883)

def get_raw_samples(count):
    # In reality, this would be an SPI/I2C read from the accelerometer
    # Using a dummy signal: 50Hz sine wave + noise
    t = np.linspace(0, count/SAMPLING_RATE, count)
    signal = np.sin(2 * np.pi * 50 * t) + np.random.normal(0, 0.2, count)
    return signal.astype(np.float32)

try:
    while True:
        # Collect a window of data
        data = get_raw_samples(WINDOW_SIZE)
        
        # Remove DC offset (center the signal around 0)
        # This prevents a constant bias from skewing the FFT later
        data = data - np.mean(data)
        
        # Pack as binary to save bandwidth (f = float 32-bit)
        # This is significantly smaller than JSON
        binary_payload = struct.pack(f'{WINDOW_SIZE}f', *data)
        
        # QoS 0 is critical here. We don't want the broker to 
        # buffer old vibration packets if the network hiccups.
        # If we miss a window, we just move to the next one.
        client.publish(TOPIC, binary_payload, qos=0)
        
        # Control the publishing rate to avoid flooding the broker
        # 1024 samples at 5kHz means we send a packet every ~200ms
        time.sleep(0.2) 
except KeyboardInterrupt:
    client.disconnect()

2. The Data Pipeline

I moved the broker to a dedicated instance to avoid resource contention. For the broker, I spent some time comparing HiveMQ and Mosquitto (I wrote a more detailed breakdown of that in /posts/mqtt-broker-selection-hivemq-vs-mosquitto-for-industrial-use/). For this scale, Mosquitto v2.0 is fine, but the configuration needs to be tuned for high throughput.

The pipeline looks like this: Sensor $\rightarrow$ Edge Gateway (Windowing/Filtering) $\rightarrow$ MQTT Broker $\rightarrow$ Telegraf $\rightarrow$ InfluxDB $\rightarrow$ Grafana.

Telegraf is the unsung hero here. I used a custom Telegraf plugin to decode the binary blobs back into floats before writing them to InfluxDB. If you try to write the raw binary blob directly into a database, you’re just storing an opaque string, which makes querying impossible.

3. The Storage Layer (InfluxDB)

Vibration data is high-cardinality and high-volume. If you store every raw sample in a relational database, you’ll kill your disk I/O. I used InfluxDB v2.7 because it handles time-series data natively and supports TSM (Time-Structured Merge tree) engines.

To keep the database from bloating, I implemented a retention policy. Raw waveforms are kept for 7 days. After that, I only keep the calculated RMS (Root Mean Square) and Peak-to-Peak values. This is where the concept of “downsampling” becomes a requirement, not a suggestion.

4. The Dashboard (Grafana)

In Grafana, you don’t want to plot 5,000 points per second. It’s visually useless and slows the browser to a crawl. Instead, I created two views:

The Health View: A time-series graph of the RMS value. This tells you if the machine is getting “louder” over time.
The Analysis View: A snapshot of the raw waveform and its FFT.

For the FFT, I used a Flux query in InfluxDB to perform the transformation on the server side rather than in the browser.

from(bucket: "vibration_data")
  |> range(start: -1h)
  |> filter(fn: (r) => r["_measurement"] == "vibration" and r["sensor_id"] == "1")
  |> aggregateWindow(every: 1s, fn: mean) 
  |> yield(name: "mean_vibration")

Why this works: The Engineering Logic

The shift from “streaming samples” to “windowed packets” is the difference between a system that crashes and a system that scales.

Network Overhead In the naive approach, every single float was wrapped in an MQTT header (fixed header, topic name, etc.). For a 4-byte float, you might be sending 40-60 bytes of overhead. By grouping 1024 samples into a single binary blob, the overhead is amortized across the entire window. You’ve reduced the packet count by a factor of 1024, which is the only way to keep a standard MQTT broker stable.

The Nyquist Limit and Aliasing One of the biggest traps in vibration monitoring is aliasing. If you sample at 1kHz, you can only accurately see frequencies up to 500Hz. If your motor has a bearing defect vibrating at 800Hz, it will “alias” and appear as a 200Hz signal in your data. I set the sampling rate to 5kHz to ensure I could capture the 2nd and 3rd harmonics of the machinery I was monitoring.

Edge Processing vs. Cloud Processing By removing the DC offset at the edge, I ensured that the FFT is centered. A large DC offset creates a massive spike at 0Hz in the frequency domain, which often masks the actual low-frequency vibration components you’re looking for. Doing this at the edge saves the database from storing useless bias and saves the dashboard from having to calculate it on the fly.

Troubleshooting the “Ghost in the Machine”

Even with this architecture, I hit several walls. The most frustrating was the “intermittent lag” where the dashboard would freeze for 10 seconds and then fast-forward through 10 seconds of data.

The Error: MQTT Buffer Overflow I checked the Mosquitto logs and saw: mosquitto.log: Socket error on client <id>, disconnecting mosquitto.log: Client <id> already connected, force disconnecting

The issue was the QoS level. I had set qos=1 (at least once delivery). When the network had a momentary dip, the edge gateway started buffering messages. Once the connection resumed, the gateway flooded the broker with a backlog of thousands of packets. Because it was qos=1, the broker spent all its CPU cycles acknowledging packets instead of routing them to Telegraf.

The fix was switching to qos=0. In vibration monitoring, a missed window is better than a delayed window. If you’re doing real-time monitoring, data from 30 seconds ago is useless.

The Error: InfluxDB Write Timeouts As the number of sensors grew, I started seeing: 429 Too Many Requests: write request failed

This happened because Telegraf was trying to push too many small points. I had to adjust the flush_interval and batch_size in telegraf.conf:

[agent]
  interval = "10s"
  round_interval = true
  precision = "ms"
  flush_interval = "5s"
  batch_size = 5000

Increasing the batch_size reduced the number of HTTP requests to InfluxDB, significantly lowering the CPU load on the database node.

Lessons Learned

If I had to build this from scratch today, I’d make a few changes.

First, I would use Protobuf instead of struct.pack. While binary blobs are efficient, they are fragile. If you change the sensor precision from float32 to float64, every single downstream consumer breaks without warning. Protobuf provides a schema that allows for evolution without breaking the pipeline.

Second, I would integrate this with a higher-level health scoring system. Raw vibration data is a means to an end. An operator doesn’t care about the amplitude of the 120Hz peak; they care if the machine is going to fail. I’ve written about this in /posts/equipment-health-scoring-one-number-your-operators-actually-check/, where I explain how to boil these complex waveforms down into a single health score.

The biggest surprise? The physical installation. I spent weeks on the software architecture only to find out that the sensors were giving me garbage data because they weren’t bolted down correctly. In IIoT, the “air gap” between the sensor and the machine is just as important as the air gap between your gateway and the cloud. If the sensor is loose, you’re just monitoring the vibration of the sensor itself, not the machine.

Summary of the stack for reference:

Sensors: Digital Accelerometers (SPI)
Gateway: Raspberry Pi / Industrial PC (Python 3.11)
Transport: MQTT v5.0 (Mosquitto v2.0)
Ingestion: Telegraf v1.2x
Storage: InfluxDB v2.7 (TSM Engine)
Visualization: Grafana v10.x

For those looking to implement this at scale, especially in environments where you can’t manage every single gateway manually, I’d suggest looking into automating the deployment via OpenTofu and GitHub Actions, as I detailed in /posts/automating-infrastructure-with-opentofu-and-github-actions/. Managing five gateways is easy; managing fifty is a full-time job if you’re doing it via SSH.

What I tried first

The actual solution

1. The Edge Collector (Python/C++)

2. The Data Pipeline

3. The Storage Layer (InfluxDB)

4. The Dashboard (Grafana)

Why this works: The Engineering Logic

Troubleshooting the “Ghost in the Machine”

Lessons Learned

Related Posts

Edge Computing for IIoT: When to Process at the Source

Condition-Based vs Time-Based Maintenance: Making the Switch

Grafana Dashboards: Information Density vs Readability

Comments