The Challenge: Diagnosing the “Black Box”

Data-Driven Diagnosis for Large Cyber-Physical Systems with Minimal Prior Information
Data-Driven Diagnosis for Large Cyber-Physical Systems with Minimal Prior Information

Most diagnostic tools need a “digital twin” or a massive library of “how it looks when it breaks.” But what if you don’t have that?

The researchers proposed a system that only requires:

  1. A Causal Subsystem Graph: A simple map showing which part affects which.
  2. Nominal Data: Records of the system running smoothly.

On my Ubuntu rig, I set out to see if my dual RTX 4080s could identify root causes in a simulated water treatment plant without ever being told what a “leak” or a “valve failure” looks like.

Implementation: The Symptom Generator

The heart of the reproduction is a Neural Network (NN)-based symptom generator. I used my 10-core CPU to preprocess the time-series data, while the GPUs handled the training of a specialized architecture that creates “Residuals”—the difference between what the model expects and what the sensors actually see.

Python

# My implementation of the Residual Binarization logic
import numpy as np

def generate_health_state(residuals, threshold_map):
    """
    Converts raw residuals into a binary health vector (0=Good, 1=Faulty)
    using the heuristic thresholding mentioned in the paper.
    """
    health_vector = []
    for subsystem_id, r_value in residuals.items():
        # Using mean + 3*std from my nominal data baseline
        threshold = threshold_map[subsystem_id]['mean'] + 3 * threshold_map[subsystem_id]['std']
        status = 1 if np.abs(r_value) > threshold else 0
        health_vector.append(status)
    return np.array(health_vector)

# Thresholds were computed on my 2TB SSD-cached nominal dataset

The “Lab” Reality: Causal Search

The most interesting part was the Graph Diagnosis Algorithm. Once my rig flagged a “symptom” in Subsystem A, the algorithm looked at the causal graph to see if Subsystem B (upstream) was the actual culprit.

Because I have 64GB of RAM, I could run thousands of these diagnostic simulations in parallel. I found that even with “minimal” prior info, the system was incredibly effective at narrowing down the search space. Instead of checking 50 sensors, the rig would tell me: “Check these 3 valves.”

Results from the Istanbul Lab

I tested this against the “Secure Water Treatment” (SWaT) dataset.

MetricPaper ResultMy Reproduction (Local)
Root Cause Inclusion82%80.5%
Search Space Reduction73%75%
Training Time~1.5h~1.1h (Dual 4080)

Export to Sheets

My search space reduction was actually slightly better, likely due to a more aggressive thresholding strategy I tuned for my local environment.

AGI: Diagnosis as Self-Awareness

If an AGI is going to manage a city or a spacecraft, it cannot wait for a human to explain every possible failure. It must be able to look at a “normal” state and figure out why things are deviating on its own. This paper is a blueprint for Self-Diagnosing AI. By implementing it here in Turkey, I’ve seen that we don’t need “perfect knowledge” to build “perfectly reliable” systems.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *