Distributed training with PyTorch FSDP vs DeepSpeed - practical comparison?