A course offered by Bluedot.
Notes
Week 1
- AI Safety Camps
- Stockfish AI Cheating
- AI Alignment
- AI Impacts
- Specification Gaming
- Young Business Person Alignment Analogy
- MechaHitler
- Multi-Agent
- Agent Miscoordination
- Ecosystem Graph
Week 2
- Input Data Filtration
- Machine Unlearning
- Deep Ignorance
- Anthropic Pretraining Filtering Guidelines
- LLM Training Data Poisoning
- RLHF
- Reinforcement Learning from AI Feedback
- CICERO
- CoastRunners RL Failure
- Scalable Oversight
- Deepmind Frontier Safety Framework
Week 3
- AI Evaluations
- Responsible Scaling Policy
- Model Cards
- Algorithmic Auditing
- Anthropic AI Blackmail
- OpenAI Shutdown Avoidance
- 3cb
- HonestCyberBenchmark
- Autonomous Replication and Adaptation
- Anthropic Responsible Scaling Policy
- Confessions
Week 4
- Model Interpretability
- Transformation Robustness
- Polysemanticity
- Feature Steering
- Model Circuit
- Universality Hypothesis
- Swiss Cheese Defense
- Practical Mechanistic Interpretability
- Reverse Engineer Complexity Pitfall
- Polanyi’s Paradox
- Feature Visualization
- Saliency Map
- BERT Interpretability Illusion
- Chinchilla Circuit Analysis
- SAE
- Top-Down Interpretability
- MRI Scan for AI
- Probe
- Chain of Thought Monitorability
- Model Organisms of Misalignment
- Hallucination Probe
Week 5
- Attack Policy
- P(Doom)
- AI Control
- AI Firewall
- Constitutional Classifier
- AI Enabled Coups
- Gradual Disempowerment