HailBERT-de-v1
Description
Hail Phase Classification Model (H0 / H1 / H2) — DeBERTa v3 (German)
Description
HailBERT-de-v1 is a German-language transformer model for the automatic classification of text passages describing hail events. It was developed at the Chair of Physical Geography at the Institute of Environmental Social Sciences and Geography to systematically screen historical written sources for hail descriptions and thereby support climatological analyses.
The model provides an AI-based tool for the quantitative evaluation of historical climate reports. By automatically identifying and categorising hail descriptions in German-language documents, it contributes to the reconstruction of extreme weather events and supports interdisciplinary research bridging climatology, environmental history, and artificial intelligence.
HailBERT-de-v1 was trained on the dataset "Historical Climate Observations of Thunderstorms and Hail in Central Europe (1000–1900)" (DOI: 10.60493/834bd-mww13), which contains manually annotated historical climate observations from the years 1000 to 1900.
The model predicts three categories
- H0 = Passage without a hail event.
- H1 = Passage describing hail without hail impact (hailstones smaller than 2 cm).
- H2 = Passage describing hail impact or large accumulation (hailstones 2 cm or larger). H2 includes size comparisons (e.g., as large as a walnut or pigeon's egg), documented damage (e.g., broken branches, damaged crops, shattered windowpanes, roof damage, stripped bark), and quantity descriptions (e.g., hail knee-deep in the streets). Relative and comparative descriptions are also recognized.
Model Details
- Type: Transformer encoder with classification head
- Architecture: Based on microsoft/deberta-v3-base
- Language: German
- Task: Multi-class text classification
- Classes: H0 – no hail / H1 – hail without impact / H2 – hail with impact
- License: Apache-2.0
- Developer: Franck Schätz (2025)
Training Data
- Manually annotated German historical weather descriptions (see DOI 10.60493/834bd-mww13)
- Labels: H0, H1, H2.
- Training/validation/test split: 80% / 10% / 10%.
- Cases labeled "-99" in the dataset represent ambiguous descriptions that cannot be assigned clearly to one of the three categories and were removed before training.
Training Procedure and Hyperparameters
- Mixed precision (fp16)
- Learning rate 3e-5
- max sequence length 256
- batch size 16
- epochs 5
- weight decay 0.01
- warmup ratio 0.06
- cosine learning rate scheduler
- label smoothing 0.05
- seed 42.
Evaluation Results (Test Set)
- Validation / Test performance:
- Accuracy: 0.956 / 0.964
- F1 macro: 0.910 / 0.933 ± 0.003
- F1 weighted: 0.956 / 0.956
- Loss: 0.33 / 0.29
Intended Use
- The model is designed for event extraction pipelines, historical climate reconstruction, and preprocessing for environmental humanities research.
Out-of-Scope Use
- Not suitable for modern social media texts, creative writing or generative applications, or semantic search without further adaptation.
Bias, Risks, and Limitations
- The model reflects patterns in the historical corpus used for training.
- It may confuse heavy rain, sleet, or snowstorms with hail.
- Spelling variability (especially before 1900) can reduce accuracy.
- Manual validation is recommended for high-stakes scientific work.
How to Get Started
Load using Hugging Face Transformers:
from transformers import AutoTokenizer, AutoModelForSequenceClassificationtokenizer = AutoTokenizer.from_pretrained("Stickmu/HailBERT-de-v1")
model = AutoModelForSequenceClassification.from_pretrained("your-username/hail-phaseB-bert")
text = "Am 12. Juli 1845 fiel in der Gegend um Ulm ein heftiges Hagelwetter."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
label_id = outputs.logits.argmax(-1).item()
labels = {
0: "H0 – no hail",
1: "H1 – hail without impact (<2 cm)",
2: "H2 – hail with impact (≥2 cm)",
}
print(labels[label_id])
Texts are encoded with the tokenizer and logits output by the model are used to determine the label (0 = H0, 1 = H1, 2 = H2).
Files
HailBERT-de-v1_FreiData.zip
Files
(831.7 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:9de5e52039b2513a8c2a41fb06618e32
|
831.7 MB | Preview Download |
Additional details
Related works
- Is derived from
- Dataset: 10.60493/834bd-mww13 (DOI)
- Is identical to
- Software: 10.57967/hf/6989 (DOI)
- Is supplemented by
- Software: 10.60493/4jphm-mv506 (DOI)