A writeup on how to setup filtering during model pretraining.
Method
- Setup 6 classification models to detect CBRN content:
- Fine-tuned constitutional classifier - a fine-tuned Language Model labelled on harmful vs harmless data
- Promped constutitional classifier - a LM that is given a prompt to determine CBRN violations - uses Claude 3.5 haiku
- Holdout Loss - a special canary model for harmful CBRN content
- FastText - a Bag of Words classifier with learnable embeddings
- Named Entities String Match - a list of harmful CBRN entities that are matched against a document
