A method to benchmark AI systems against automated code patching