File size: 5,881 Bytes

---
license: apache-2.0
datasets:
- honicky/hdfs-logs-encoded-blocks
- Kingslayer5437/BGL
language:
- en
metrics:
- f1
- precision
- recall
- roc_auc
base_model:
- distilbert/distilbert-base-uncased
pipeline_tag: text-classification
library_name: transformers
tags:
- log-analysis
- anomaly-detection
- bert
- huggingface
model-index:
- name: CloudOpsBERT (distributed-storage)
  results:
  - task:
      type: text-classification
      name: Anomaly Detection
    dataset:
      name: HDFS
      type: honicky/hdfs-logs-encoded-blocks
      split: test
    metrics:
    - type: f1
      value: 0.571
    - type: precision
      value: 0.992
    - type: recall
      value: 0.401
    - type: auroc
      value: 0.73
    - type: threshold
      value: 0.5
- name: CloudOpsBERT (HPC)
  results:
  - task:
      type: text-classification
      name: Anomaly Detection
    dataset:
      name: BGL
      type: Kingslayer5437/BGL
      split: test
    metrics:
    - type: f1
      value: 1.00
    - type: precision
      value: 1.00
    - type: recall
      value: 1.00
    - type: auroc
      value: 1.00
    - type: threshold
      value: 0.05
---
---
# CloudOpsBERT: Domain-Specific Language Models for Cloud Operations

CloudOpsBERT is an open-source project exploring **domain-adapted transformer models** for **cloud operations log analysis** — specifically anomaly detection, reliability monitoring, and cost optimization.

This project fine-tunes lightweight BERT variants (e.g., DistilBERT) on large-scale system log datasets (HDFS, BGL) and provides ready-to-use models for the research and practitioner community.

---

## 🚀 Motivation

Modern cloud platforms generate massive amounts of logs. Detecting anomalies in these logs is crucial for:
- Ensuring **reliability** (catching failures early),
- Improving **cost efficiency** (identifying waste or misconfigurations),
- Supporting **autonomous operations** (AIOps).

Generic LLMs and BERT models are not optimized for this domain. CloudOpsBERT bridges that gap by:
- Training on **real log datasets** (HDFS, BGL),
- Addressing **imbalanced anomaly detection** with class weighting,
- Publishing **open-source checkpoints** for reproducibility.

---




## 🔍 Inference (Pretrained)
Predict anomaly probability for a single log line:
```
python src/predict.py \
  --model_dir vaibhav2507/cloudops-bert \
  --subfolder distributed-storage \
  --text "ERROR dfs.DataNode: Lost connection to namenode"
```
Batch inference (file with one log line per row):

```
python src/predict.py \
  --model_dir vaibhav2507/cloudops-bert \
  --subfolder distributed-storage \
  --file samples/sample_logs.txt \
  --threshold 0.5 \
  --jsonl_out predictions.jsonl
```

## 📊 Results
* HDFS (in-domain, test set)
  * F1: 0.571
  * Precision: 0.992
  * Recall: 0.401
  * AUROC: 0.730
  * Threshold: 0.50 (tuneable)
- Cross-domain (HDFS → BGL)
- Performance degrades significantly due to dataset/domain shift (see paper).
- BGL (training in progress)
- Will be released as cloudops-bert (subfolder bgl) once full training is complete.

## 📦 Models

* vaibhav2507/cloudops-bert (Hugging Face Hub)
  * subfolder="distributed-storage" – HDFS-trained CloudOpsBERT
  * subfolder="hpc" – BGL-trained CloudOpsBERT
* Each export includes:
  * Model weights (pytorch_model.bin)
  * Config with label mappings (normal, anomaly)
  * Tokenizer files

## 🚀 Quickstart (Scripts)
 1) Setup folders
```
bash scripts/setup_dirs.sh
```

 2) (optional) Download a local copy of a submodel from Hugging Face
```
bash scripts/fetch_pretrained.sh                # downloads 'hdfs' by default
SUBFOLDER=bgl bash scripts/fetch_pretrained.sh  # downloads 'bgl'
```

 3) Single-line prediction (directly from HF)
```
bash scripts/predict_line.sh "ERROR dfs.DataNode: Lost connection to namenode" hdfs
```

 4) Batch prediction (using local model folder)
```
bash scripts/make_sample_logs.sh
bash scripts/predict_file.sh samples/sample_logs.txt hdfs models/cloudops-bert-hdfs preds/preds_hdfs.jsonl
```

## 📚 Related Work

Several prior works have explored using BERT for log anomaly detection:

- Leveraging BERT and Hugging Face Transformers for Log Anomaly Detection
- Tutorial-style blog post demonstrating how to fine-tune BERT on log data with Hugging Face. Useful as an introduction, but not intended as a reproducible research artifact.

LogBERT (HelenGuohx/logbert)
- Academic prototype from ~2019–2020 focusing on modeling log sequences with BERT. Demonstrates feasibility but limited to in-domain experiments and lacks integration with modern Hugging Face tooling.
  
AnomalyBERT (Jhryu30/AnomalyBERT)
- Another exploratory repository showing BERT-based anomaly detection on logs, with dataset-specific preprocessing. Similar limitations in generalization and reproducibility.

## 🔑 How CloudOpsBERT is different
- Domain-specific adaptation: explicitly trained for cloud operations logs (HDFS, BGL) with class-weighted loss.
- Cross-domain evaluation: includes in-domain and cross-domain benchmarks, highlighting generalization challenges.
- Reproducibility & usability: clean repo, scripts, and ready-to-use Hugging Face exports.
- Future directions: introduces MicroLM — compressed micro-language models for efficient edge/cloud hybrid inference.
- In short: previous work showed that “BERT can work for logs.”
- CloudOpsBERT operationalizes this idea into reproducible benchmarks, public models, and deployable tools for both researchers and practitioners.

## 📜 Citation
If you use CloudOpsBERT in your research or tools, please cite:
```
@misc{pandey2025cloudopsbert,
  title={CloudOpsBERT: Domain-Specific Transformer Models for Cloud Operations Anomaly Detection},
  author={Pandey, Vaibhav},
  year={2025},
  howpublished={GitHub, Hugging Face},
  url={https://github.com/vaibhav-research/cloudops-bert}
}
```