Data Deduplication
Removing 386 duplicate support tickets for evaluation integrity.
Noise Reduction
Cleaning IT slang and standardizing mixed-case inputs.
BIO Tag Alignment
Mapping word-level NER tags to DistilBERT subword offsets.
Label Multi-Mapping
Encoding Category, Sub-Category, and Priority labels to unique IDs.
Stratified 80/20 Split
Balanced training/test sets locked by label distribution.
Tensor Conversion
Input formatting into PyTorch Tensors for GPU fine-tuning.
Output Representation View
// Original Input
"wifi is entirely dead in conference room B"
// Tokenized State
input_ids: [101, 7523, 2003, ... 3154, 102]
attention_mask: [1, 1, 1, ... 1, 1]
// Resulting Labels
label_id: 8 NETWORK | WIFI
ner_labels: [-100, 3, 0, ... 7, -100] BIO-TAGGED