A writeup on how to setup filtering during model pretraining.

Method

  • Setup 6 classification models to detect CBRN content:
    • Fine-tuned constitutional classifier - a fine-tuned Language Model labelled on harmful vs harmless data
    • Promped constutitional classifier - a LM that is given a prompt to determine CBRN violations - uses Claude 3.5 haiku
    • Holdout Loss - a special canary model for harmful CBRN content
    • FastText - a Bag of Words classifier with learnable embeddings
    • Named Entities String Match - a list of harmful CBRN entities that are matched against a document