Models struggle on long-term week-long or month-long tasks.