Published November 12, 2025 | Version 1.0
Software Open

HailBERT-de-v1

  • 1. ROR icon University of Freiburg

Description

Hail Phase Classification Model (H0 / H1 / H2) — DeBERTa v3 (German)

Description


HailBERT-de-v1 is a German-language transformer model for the automatic classification of text passages describing hail events. It was developed at the Chair of Physical Geography at the Institute of Environmental Social Sciences and Geography to systematically screen historical written sources for hail descriptions and thereby support climatological analyses.

The model provides an AI-based tool for the quantitative evaluation of historical climate reports. By automatically identifying and categorising hail descriptions in German-language documents, it contributes to the reconstruction of extreme weather events and supports interdisciplinary research bridging climatology, environmental history, and artificial intelligence.

HailBERT-de-v1 was trained on the dataset "Historical Climate Observations of Thunderstorms and Hail in Central Europe (1000–1900)" (DOI: 10.60493/834bd-mww13), which contains manually annotated historical climate observations from the years 1000 to 1900.


The model predicts three categories

  1. H0 = Passage without a hail event.
  2. H1 = Passage describing hail without hail impact (hailstones smaller than 2 cm).
  3. H2 = Passage describing hail impact or large accumulation (hailstones 2 cm or larger). H2 includes size comparisons (e.g., as large as a walnut or pigeon's egg), documented damage (e.g., broken branches, damaged crops, shattered windowpanes, roof damage, stripped bark), and quantity descriptions (e.g., hail knee-deep in the streets). Relative and comparative descriptions are also recognized.

Model Details

  • Type: Transformer encoder with classification head
  • Architecture: Based on microsoft/deberta-v3-base
  • Language: German
  • Task: Multi-class text classification
  • Classes: H0 – no hail / H1 – hail without impact / H2 – hail with impact
  • License: Apache-2.0
  • Developer: Franck Schätz (2025)

Training Data

  • Manually annotated German historical weather descriptions (see DOI 10.60493/834bd-mww13)
  • Labels: H0, H1, H2.
  • Training/validation/test split: 80% / 10% / 10%.
  • Cases labeled "-99" in the dataset represent ambiguous descriptions that cannot be assigned clearly to one of the three categories and were removed before training.

Training Procedure and Hyperparameters

  • Mixed precision (fp16)
  • Learning rate 3e-5
  • max sequence length 256
  • batch size 16
  • epochs 5
  • weight decay 0.01
  • warmup ratio 0.06
  • cosine learning rate scheduler
  • label smoothing 0.05
  • seed 42.

Evaluation Results (Test Set)

  • Validation / Test performance:
    • Accuracy: 0.956 / 0.964
    • F1 macro: 0.910 / 0.933 ± 0.003
    • F1 weighted: 0.956 / 0.956
    • Loss: 0.33 / 0.29

Intended Use

  • The model is designed for event extraction pipelines, historical climate reconstruction, and preprocessing for environmental humanities research.

Out-of-Scope Use

  • Not suitable for modern social media texts, creative writing or generative applications, or semantic search without further adaptation.

Bias, Risks, and Limitations

  • The model reflects patterns in the historical corpus used for training.
  • It may confuse heavy rain, sleet, or snowstorms with hail.
  • Spelling variability (especially before 1900) can reduce accuracy.
  • Manual validation is recommended for high-stakes scientific work.

How to Get Started


Load using Hugging Face Transformers:


from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("Stickmu/HailBERT-de-v1")

model = AutoModelForSequenceClassification.from_pretrained("your-username/hail-phaseB-bert")

text = "Am 12. Juli 1845 fiel in der Gegend um Ulm ein heftiges Hagelwetter."

inputs = tokenizer(text, return_tensors="pt")

outputs = model(**inputs)

label_id = outputs.logits.argmax(-1).item()

 

labels = {

   0: "H0 – no hail",

   1: "H1 – hail without impact (<2 cm)",

   2: "H2 – hail with impact (≥2 cm)",

}

print(labels[label_id])


Texts are encoded with the tokenizer and logits output by the model are used to determine the label (0 = H0, 1 = H1, 2 = H2).

 

Files

HailBERT-de-v1_FreiData.zip

Files (831.7 MB)

Name Size Download all
md5:9de5e52039b2513a8c2a41fb06618e32
831.7 MB Preview Download

Additional details

Related works

Is derived from
Dataset: 10.60493/834bd-mww13 (DOI)
Is identical to
Software: 10.57967/hf/6989 (DOI)
Is supplemented by
Software: 10.60493/4jphm-mv506 (DOI)