HailBERT-de-v1

Schätz, Franck; Glaser, Rüdiger

doi:10.60493/m24bf-vg718

Published November 12, 2025 | Version 1.0

Software Open

HailBERT-de-v1

1. University of Freiburg

Hail Phase Classification Model (H0 / H1 / H2) — DeBERTa v3 (German)

Description

HailBERT-de-v1 is a German-language transformer model for the automatic classification of text passages describing hail events. It was developed at the Chair of Physical Geography at the Institute of Environmental Social Sciences and Geography to systematically screen historical written sources for hail descriptions and thereby support climatological analyses.

The model provides an AI-based tool for the quantitative evaluation of historical climate reports. By automatically identifying and categorising hail descriptions in German-language documents, it contributes to the reconstruction of extreme weather events and supports interdisciplinary research bridging climatology, environmental history, and artificial intelligence.

HailBERT-de-v1 was trained on the dataset "Historical Climate Observations of Thunderstorms and Hail in Central Europe (1000–1900)" (DOI: 10.60493/834bd-mww13), which contains manually annotated historical climate observations from the years 1000 to 1900.

The model predicts three categories

H0 = Passage without a hail event.
H1 = Passage describing hail without hail impact (hailstones smaller than 2 cm).
H2 = Passage describing hail impact or large accumulation (hailstones 2 cm or larger). H2 includes size comparisons (e.g., as large as a walnut or pigeon's egg), documented damage (e.g., broken branches, damaged crops, shattered windowpanes, roof damage, stripped bark), and quantity descriptions (e.g., hail knee-deep in the streets). Relative and comparative descriptions are also recognized.

Model Details

Type: Transformer encoder with classification head
Architecture: Based on microsoft/deberta-v3-base
Language: German
Task: Multi-class text classification
Classes: H0 – no hail / H1 – hail without impact / H2 – hail with impact
License: Apache-2.0
Developer: Franck Schätz (2025)

Training Data

Manually annotated German historical weather descriptions (see DOI 10.60493/834bd-mww13)
Labels: H0, H1, H2.
Training/validation/test split: 80% / 10% / 10%.
Cases labeled "-99" in the dataset represent ambiguous descriptions that cannot be assigned clearly to one of the three categories and were removed before training.

Training Procedure and Hyperparameters

Mixed precision (fp16)
Learning rate 3e-5
max sequence length 256
batch size 16
epochs 5
weight decay 0.01
warmup ratio 0.06
cosine learning rate scheduler
label smoothing 0.05
seed 42.

Evaluation Results (Test Set)

Validation / Test performance:
- Accuracy: 0.956 / 0.964
- F1 macro: 0.910 / 0.933 ± 0.003
- F1 weighted: 0.956 / 0.956
- Loss: 0.33 / 0.29

Intended Use

The model is designed for event extraction pipelines, historical climate reconstruction, and preprocessing for environmental humanities research.

Out-of-Scope Use

Not suitable for modern social media texts, creative writing or generative applications, or semantic search without further adaptation.

Bias, Risks, and Limitations

The model reflects patterns in the historical corpus used for training.
It may confuse heavy rain, sleet, or snowstorms with hail.
Spelling variability (especially before 1900) can reduce accuracy.
Manual validation is recommended for high-stakes scientific work.

How to Get Started

Load using Hugging Face Transformers:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("Stickmu/HailBERT-de-v1")
model = AutoModelForSequenceClassification.from_pretrained("your-username/hail-phaseB-bert")
text = "Am 12. Juli 1845 fiel in der Gegend um Ulm ein heftiges Hagelwetter."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
label_id = outputs.logits.argmax(-1).item()

labels = {
0: "H0 – no hail",
1: "H1 – hail without impact (<2 cm)",
2: "H2 – hail with impact (≥2 cm)",
}
print(labels[label_id])

Texts are encoded with the tokenizer and logits output by the model are used to determine the label (0 = H0, 1 = H1, 2 = H2).

Files

HailBERT-de-v1_FreiData.zip

Files (831.7 MB)

Name	Size	Download all
HailBERT-de-v1_FreiData.zip md5:9de5e52039b2513a8c2a41fb06618e32	831.7 MB	Preview Download

Additional details

Is derived from: Dataset: 10.60493/834bd-mww13 (DOI)
Is identical to: Software: 10.57967/hf/6989 (DOI)
Is supplemented by: Software: 10.60493/4jphm-mv506 (DOI)

	All versions	This version
Views	1	1
Downloads	0	0
Data volume	0 Bytes	0 Bytes

HailBERT-de-v1

Creators

Description

Hail Phase Classification Model (H0 / H1 / H2) — DeBERTa v3 (German)

Description

Training Data

Training Procedure and Hyperparameters

Evaluation Results (Test Set)

Intended Use

Out-of-Scope Use

Bias, Risks, and Limitations

How to Get Started

Files

HailBERT-de-v1_FreiData.zip

Files (831.7 MB)

Additional details

Related works