cloudops-bert / README.md

vaibhav2507

Update README.md

ea746aa verified 4 days ago

preview code

raw

history blame contribute delete

5.88 kB

metadata

license: apache-2.0
datasets:
  - honicky/hdfs-logs-encoded-blocks
  - Kingslayer5437/BGL
language:
  - en
metrics:
  - f1
  - precision
  - recall
  - roc_auc
base_model:
  - distilbert/distilbert-base-uncased
pipeline_tag: text-classification
library_name: transformers
tags:
  - log-analysis
  - anomaly-detection
  - bert
  - huggingface
model-index:
  - name: CloudOpsBERT (distributed-storage)
    results:
      - task:
          type: text-classification
          name: Anomaly Detection
        dataset:
          name: HDFS
          type: honicky/hdfs-logs-encoded-blocks
          split: test
        metrics:
          - type: f1
            value: 0.571
          - type: precision
            value: 0.992
          - type: recall
            value: 0.401
          - type: auroc
            value: 0.73
          - type: threshold
            value: 0.5
  - name: CloudOpsBERT (HPC)
    results:
      - task:
          type: text-classification
          name: Anomaly Detection
        dataset:
          name: BGL
          type: Kingslayer5437/BGL
          split: test
        metrics:
          - type: f1
            value: 1
          - type: precision
            value: 1
          - type: recall
            value: 1
          - type: auroc
            value: 1
          - type: threshold
            value: 0.05

CloudOpsBERT: Domain-Specific Language Models for Cloud Operations

CloudOpsBERT is an open-source project exploring domain-adapted transformer models for cloud operations log analysis — specifically anomaly detection, reliability monitoring, and cost optimization.

This project fine-tunes lightweight BERT variants (e.g., DistilBERT) on large-scale system log datasets (HDFS, BGL) and provides ready-to-use models for the research and practitioner community.

🚀 Motivation

Modern cloud platforms generate massive amounts of logs. Detecting anomalies in these logs is crucial for:

Ensuring reliability (catching failures early),
Improving cost efficiency (identifying waste or misconfigurations),
Supporting autonomous operations (AIOps).

Generic LLMs and BERT models are not optimized for this domain. CloudOpsBERT bridges that gap by:

Training on real log datasets (HDFS, BGL),
Addressing imbalanced anomaly detection with class weighting,
Publishing open-source checkpoints for reproducibility.

🔍 Inference (Pretrained)

Predict anomaly probability for a single log line:

python src/predict.py \
  --model_dir vaibhav2507/cloudops-bert \
  --subfolder distributed-storage \
  --text "ERROR dfs.DataNode: Lost connection to namenode"

Batch inference (file with one log line per row):

python src/predict.py \
  --model_dir vaibhav2507/cloudops-bert \
  --subfolder distributed-storage \
  --file samples/sample_logs.txt \
  --threshold 0.5 \
  --jsonl_out predictions.jsonl

📊 Results

HDFS (in-domain, test set)
- F1: 0.571
- Precision: 0.992
- Recall: 0.401
- AUROC: 0.730
- Threshold: 0.50 (tuneable)

Cross-domain (HDFS → BGL)
Performance degrades significantly due to dataset/domain shift (see paper).
BGL (training in progress)
Will be released as cloudops-bert (subfolder bgl) once full training is complete.

📦 Models

vaibhav2507/cloudops-bert (Hugging Face Hub)
- subfolder="distributed-storage" – HDFS-trained CloudOpsBERT
- subfolder="hpc" – BGL-trained CloudOpsBERT
Each export includes:
- Model weights (pytorch_model.bin)
- Config with label mappings (normal, anomaly)
- Tokenizer files

🚀 Quickstart (Scripts)

Setup folders

bash scripts/setup_dirs.sh

(optional) Download a local copy of a submodel from Hugging Face

bash scripts/fetch_pretrained.sh                # downloads 'hdfs' by default
SUBFOLDER=bgl bash scripts/fetch_pretrained.sh  # downloads 'bgl'

Single-line prediction (directly from HF)

bash scripts/predict_line.sh "ERROR dfs.DataNode: Lost connection to namenode" hdfs

Batch prediction (using local model folder)

bash scripts/make_sample_logs.sh
bash scripts/predict_file.sh samples/sample_logs.txt hdfs models/cloudops-bert-hdfs preds/preds_hdfs.jsonl

📚 Related Work

Several prior works have explored using BERT for log anomaly detection:

Leveraging BERT and Hugging Face Transformers for Log Anomaly Detection
Tutorial-style blog post demonstrating how to fine-tune BERT on log data with Hugging Face. Useful as an introduction, but not intended as a reproducible research artifact.

LogBERT (HelenGuohx/logbert)

Academic prototype from ~2019–2020 focusing on modeling log sequences with BERT. Demonstrates feasibility but limited to in-domain experiments and lacks integration with modern Hugging Face tooling.

AnomalyBERT (Jhryu30/AnomalyBERT)

Another exploratory repository showing BERT-based anomaly detection on logs, with dataset-specific preprocessing. Similar limitations in generalization and reproducibility.

🔑 How CloudOpsBERT is different

Domain-specific adaptation: explicitly trained for cloud operations logs (HDFS, BGL) with class-weighted loss.
Cross-domain evaluation: includes in-domain and cross-domain benchmarks, highlighting generalization challenges.
Reproducibility & usability: clean repo, scripts, and ready-to-use Hugging Face exports.
Future directions: introduces MicroLM — compressed micro-language models for efficient edge/cloud hybrid inference.
In short: previous work showed that “BERT can work for logs.”
CloudOpsBERT operationalizes this idea into reproducible benchmarks, public models, and deployable tools for both researchers and practitioners.

📜 Citation

If you use CloudOpsBERT in your research or tools, please cite:

@misc{pandey2025cloudopsbert,
  title={CloudOpsBERT: Domain-Specific Transformer Models for Cloud Operations Anomaly Detection},
  author={Pandey, Vaibhav},
  year={2025},
  howpublished={GitHub, Hugging Face},
  url={https://github.com/vaibhav-research/cloudops-bert}
}