[2024-06-13 11:10:25,167] INFO: Will use torch.nn.parallel.DistributedDataParallel() and 4 gpus [2024-06-13 11:10:25,260] INFO: NVIDIA GeForce RTX 2080 Ti [2024-06-13 11:10:25,260] INFO: NVIDIA GeForce RTX 2080 Ti [2024-06-13 11:10:25,260] INFO: NVIDIA GeForce RTX 2080 Ti [2024-06-13 11:10:25,260] INFO: NVIDIA GeForce RTX 2080 Ti [2024-06-13 11:10:29,461] INFO: using dtype=torch.float32 [2024-06-13 11:10:30,596] INFO: using attention_type=math [2024-06-13 11:10:30,614] INFO: using attention_type=math [2024-06-13 11:10:30,632] INFO: using attention_type=math [2024-06-13 11:10:30,650] INFO: using attention_type=math [2024-06-13 11:10:30,669] INFO: using attention_type=math [2024-06-13 11:10:30,687] INFO: using attention_type=math [2024-06-13 11:10:33,259] INFO: DistributedDataParallel( (module): MLPF( (nn0_id): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (nn0_reg): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (conv_id): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (conv_reg): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (nn_id): Sequential( (0): Linear(in_features=529, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=6, bias=True) ) (nn_pt): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_eta): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_sin_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_cos_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_energy): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) ) ) [2024-06-13 11:10:33,260] INFO: Backbone Trainable parameters: 11671568 [2024-06-13 11:10:33,260] INFO: Backbone Non-trainable parameters: 0 [2024-06-13 11:10:33,260] INFO: Backbone Total parameters: 11671568 [2024-06-13 11:10:33,263] INFO: Modules Trainable parameters Non-tranable parameters module.nn0_id.0.weight 8704 0 module.nn0_id.0.bias 512 0 module.nn0_id.2.weight 512 0 module.nn0_id.2.bias 512 0 module.nn0_id.4.weight 262144 0 module.nn0_id.4.bias 512 0 module.nn0_reg.0.weight 8704 0 module.nn0_reg.0.bias 512 0 module.nn0_reg.2.weight 512 0 module.nn0_reg.2.bias 512 0 module.nn0_reg.4.weight 262144 0 module.nn0_reg.4.bias 512 0 module.conv_id.0.mha.in_proj_weight 786432 0 module.conv_id.0.mha.in_proj_bias 1536 0 module.conv_id.0.mha.out_proj.weight 262144 0 module.conv_id.0.mha.out_proj.bias 512 0 module.conv_id.0.norm0.weight 512 0 module.conv_id.0.norm0.bias 512 0 module.conv_id.0.norm1.weight 512 0 module.conv_id.0.norm1.bias 512 0 module.conv_id.0.seq.0.weight 262144 0 module.conv_id.0.seq.0.bias 512 0 module.conv_id.0.seq.2.weight 262144 0 module.conv_id.0.seq.2.bias 512 0 module.conv_id.1.mha.in_proj_weight 786432 0 module.conv_id.1.mha.in_proj_bias 1536 0 module.conv_id.1.mha.out_proj.weight 262144 0 module.conv_id.1.mha.out_proj.bias 512 0 module.conv_id.1.norm0.weight 512 0 module.conv_id.1.norm0.bias 512 0 module.conv_id.1.norm1.weight 512 0 module.conv_id.1.norm1.bias 512 0 module.conv_id.1.seq.0.weight 262144 0 module.conv_id.1.seq.0.bias 512 0 module.conv_id.1.seq.2.weight 262144 0 module.conv_id.1.seq.2.bias 512 0 module.conv_id.2.mha.in_proj_weight 786432 0 module.conv_id.2.mha.in_proj_bias 1536 0 module.conv_id.2.mha.out_proj.weight 262144 0 module.conv_id.2.mha.out_proj.bias 512 0 module.conv_id.2.norm0.weight 512 0 module.conv_id.2.norm0.bias 512 0 module.conv_id.2.norm1.weight 512 0 module.conv_id.2.norm1.bias 512 0 module.conv_id.2.seq.0.weight 262144 0 module.conv_id.2.seq.0.bias 512 0 module.conv_id.2.seq.2.weight 262144 0 module.conv_id.2.seq.2.bias 512 0 module.conv_reg.0.mha.in_proj_weight 786432 0 module.conv_reg.0.mha.in_proj_bias 1536 0 module.conv_reg.0.mha.out_proj.weight 262144 0 module.conv_reg.0.mha.out_proj.bias 512 0 module.conv_reg.0.norm0.weight 512 0 module.conv_reg.0.norm0.bias 512 0 module.conv_reg.0.norm1.weight 512 0 module.conv_reg.0.norm1.bias 512 0 module.conv_reg.0.seq.0.weight 262144 0 module.conv_reg.0.seq.0.bias 512 0 module.conv_reg.0.seq.2.weight 262144 0 module.conv_reg.0.seq.2.bias 512 0 module.conv_reg.1.mha.in_proj_weight 786432 0 module.conv_reg.1.mha.in_proj_bias 1536 0 module.conv_reg.1.mha.out_proj.weight 262144 0 module.conv_reg.1.mha.out_proj.bias 512 0 module.conv_reg.1.norm0.weight 512 0 module.conv_reg.1.norm0.bias 512 0 module.conv_reg.1.norm1.weight 512 0 module.conv_reg.1.norm1.bias 512 0 module.conv_reg.1.seq.0.weight 262144 0 module.conv_reg.1.seq.0.bias 512 0 module.conv_reg.1.seq.2.weight 262144 0 module.conv_reg.1.seq.2.bias 512 0 module.conv_reg.2.mha.in_proj_weight 786432 0 module.conv_reg.2.mha.in_proj_bias 1536 0 module.conv_reg.2.mha.out_proj.weight 262144 0 module.conv_reg.2.mha.out_proj.bias 512 0 module.conv_reg.2.norm0.weight 512 0 module.conv_reg.2.norm0.bias 512 0 module.conv_reg.2.norm1.weight 512 0 module.conv_reg.2.norm1.bias 512 0 module.conv_reg.2.seq.0.weight 262144 0 module.conv_reg.2.seq.0.bias 512 0 module.conv_reg.2.seq.2.weight 262144 0 module.conv_reg.2.seq.2.bias 512 0 module.nn_id.0.weight 270848 0 module.nn_id.0.bias 512 0 module.nn_id.2.weight 512 0 module.nn_id.2.bias 512 0 module.nn_id.4.weight 3072 0 module.nn_id.4.bias 6 0 module.nn_pt.nn.0.weight 273920 0 module.nn_pt.nn.0.bias 512 0 module.nn_pt.nn.2.weight 512 0 module.nn_pt.nn.2.bias 512 0 module.nn_pt.nn.4.weight 1024 0 module.nn_pt.nn.4.bias 2 0 module.nn_eta.nn.0.weight 273920 0 module.nn_eta.nn.0.bias 512 0 module.nn_eta.nn.2.weight 512 0 module.nn_eta.nn.2.bias 512 0 module.nn_eta.nn.4.weight 1024 0 module.nn_eta.nn.4.bias 2 0 module.nn_sin_phi.nn.0.weight 273920 0 module.nn_sin_phi.nn.0.bias 512 0 module.nn_sin_phi.nn.2.weight 512 0 module.nn_sin_phi.nn.2.bias 512 0 module.nn_sin_phi.nn.4.weight 1024 0 module.nn_sin_phi.nn.4.bias 2 0 module.nn_cos_phi.nn.0.weight 273920 0 module.nn_cos_phi.nn.0.bias 512 0 module.nn_cos_phi.nn.2.weight 512 0 module.nn_cos_phi.nn.2.bias 512 0 module.nn_cos_phi.nn.4.weight 1024 0 module.nn_cos_phi.nn.4.bias 2 0 module.nn_energy.nn.0.weight 273920 0 module.nn_energy.nn.0.bias 512 0 module.nn_energy.nn.2.weight 512 0 module.nn_energy.nn.2.bias 512 0 module.nn_energy.nn.4.weight 1024 0 module.nn_energy.nn.4.bias 2 0 [2024-06-13 11:10:33,322] INFO: DistributedDataParallel( (module): DeepMET( (nn): Sequential( (0): Linear(in_features=535, out_features=256, bias=True) (1): ELU(alpha=1.0) (2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0, inplace=False) (4): Linear(in_features=256, out_features=2, bias=True) ) ) ) [2024-06-13 11:10:33,322] INFO: DeepMET Trainable parameters: 138242 [2024-06-13 11:10:33,322] INFO: DeepMET Non-trainable parameters: 0 [2024-06-13 11:10:33,322] INFO: DeepMET Total parameters: 138242 [2024-06-13 11:10:33,323] INFO: Modules Trainable parameters Non-tranable parameters module.nn.0.weight 136960 0 module.nn.0.bias 256 0 module.nn.2.weight 256 0 module.nn.2.bias 256 0 module.nn.4.weight 512 0 module.nn.4.bias 2 0 [2024-06-13 11:10:33,344] INFO: Creating experiment dir /pfvol/experiments/MLPF_clic_backbone_pyg-clic_20240429_101112_971749/MLPF_4GTX_MET_latentX_ReinitializeBackbone_20240613_111024_992400 [2024-06-13 11:10:33,344] INFO: Model directory /pfvol/experiments/MLPF_clic_backbone_pyg-clic_20240429_101112_971749/MLPF_4GTX_MET_latentX_ReinitializeBackbone_20240613_111024_992400 [2024-06-13 11:10:33,452] INFO: train_dataset: clic_edm_ttbar_pf, 800800 [2024-06-13 11:10:33,681] INFO: valid_dataset: clic_edm_ttbar_pf, 200200 [2024-06-13 11:10:33,755] INFO: Initiating epoch #1 train run on device rank=0 [2024-06-13 11:26:00,841] INFO: Initiating epoch #1 valid run on device rank=0 [2024-06-13 11:27:07,409] INFO: Rank 0: epoch=1 / 200 train_loss=17.1556 valid_loss=14.0936 stale=0 time=16.56m eta=3295.6m [2024-06-13 11:27:07,421] INFO: Initiating epoch #2 train run on device rank=0 [2024-06-13 11:42:28,645] INFO: Initiating epoch #2 valid run on device rank=0 [2024-06-13 11:43:33,969] INFO: Rank 0: epoch=2 / 200 train_loss=14.4419 valid_loss=14.5019 stale=1 time=16.44m eta=3267.4m [2024-06-13 11:43:34,042] INFO: Initiating epoch #3 train run on device rank=0 [2024-06-13 11:58:55,003] INFO: Initiating epoch #3 valid run on device rank=0 [2024-06-13 12:00:00,847] INFO: Rank 0: epoch=3 / 200 train_loss=13.7517 valid_loss=13.6502 stale=0 time=16.45m eta=3247.3m [2024-06-13 12:00:00,902] INFO: Initiating epoch #4 train run on device rank=0 [2024-06-13 12:15:21,765] INFO: Initiating epoch #4 valid run on device rank=0 [2024-06-13 12:16:28,161] INFO: Rank 0: epoch=4 / 200 train_loss=12.9085 valid_loss=12.9348 stale=0 time=16.45m eta=3229.4m [2024-06-13 12:16:28,241] INFO: Initiating epoch #5 train run on device rank=0 [2024-06-13 12:31:48,825] INFO: Initiating epoch #5 valid run on device rank=0 [2024-06-13 12:32:54,578] INFO: Rank 0: epoch=5 / 200 train_loss=12.3233 valid_loss=12.3942 stale=0 time=16.44m eta=3211.5m [2024-06-13 12:32:54,646] INFO: Initiating epoch #6 train run on device rank=0 [2024-06-13 12:48:15,142] INFO: Initiating epoch #6 valid run on device rank=0 [2024-06-13 12:49:20,751] INFO: Rank 0: epoch=6 / 200 train_loss=11.8885 valid_loss=12.0355 stale=0 time=16.44m eta=3194.0m [2024-06-13 12:49:20,802] INFO: Initiating epoch #7 train run on device rank=0 [2024-06-13 13:04:41,294] INFO: Initiating epoch #7 valid run on device rank=0 [2024-06-13 13:05:47,006] INFO: Rank 0: epoch=7 / 200 train_loss=11.5544 valid_loss=11.7054 stale=0 time=16.44m eta=3176.8m [2024-06-13 13:05:47,070] INFO: Initiating epoch #8 train run on device rank=0 [2024-06-13 13:21:08,077] INFO: Initiating epoch #8 valid run on device rank=0 [2024-06-13 13:22:14,702] INFO: Rank 0: epoch=8 / 200 train_loss=11.3238 valid_loss=11.3833 stale=0 time=16.46m eta=3160.4m [2024-06-13 13:22:14,954] INFO: Initiating epoch #9 train run on device rank=0 [2024-06-13 13:37:37,247] INFO: Initiating epoch #9 valid run on device rank=0 [2024-06-13 13:38:42,761] INFO: Rank 0: epoch=9 / 200 train_loss=11.1157 valid_loss=11.0942 stale=0 time=16.46m eta=3144.1m [2024-06-13 13:38:42,804] INFO: Initiating epoch #10 train run on device rank=0 [2024-06-13 13:54:05,664] INFO: Initiating epoch #10 valid run on device rank=0 [2024-06-13 13:55:11,363] INFO: Rank 0: epoch=10 / 200 train_loss=10.9472 valid_loss=10.9661 stale=0 time=16.48m eta=3127.9m [2024-06-13 13:55:11,398] INFO: Initiating epoch #11 train run on device rank=0 [2024-06-13 14:10:33,590] INFO: Initiating epoch #11 valid run on device rank=0 [2024-06-13 14:11:39,251] INFO: Rank 0: epoch=11 / 200 train_loss=10.7993 valid_loss=10.8268 stale=0 time=16.46m eta=3111.5m [2024-06-13 14:11:39,314] INFO: Initiating epoch #12 train run on device rank=0 [2024-06-13 14:27:00,253] INFO: Initiating epoch #12 valid run on device rank=0 [2024-06-13 14:28:05,686] INFO: Rank 0: epoch=12 / 200 train_loss=10.6789 valid_loss=10.6633 stale=0 time=16.44m eta=3094.7m [2024-06-13 14:28:05,722] INFO: Initiating epoch #13 train run on device rank=0 [2024-06-13 14:43:28,191] INFO: Initiating epoch #13 valid run on device rank=0 [2024-06-13 14:44:33,745] INFO: Rank 0: epoch=13 / 200 train_loss=10.5778 valid_loss=10.6430 stale=0 time=16.47m eta=3078.3m [2024-06-13 14:44:33,790] INFO: Initiating epoch #14 train run on device rank=0 [2024-06-13 14:59:56,555] INFO: Initiating epoch #14 valid run on device rank=0 [2024-06-13 15:01:02,157] INFO: Rank 0: epoch=14 / 200 train_loss=10.4909 valid_loss=10.5902 stale=0 time=16.47m eta=3062.0m [2024-06-13 15:01:02,192] INFO: Initiating epoch #15 train run on device rank=0 [2024-06-13 15:16:23,897] INFO: Initiating epoch #15 valid run on device rank=0 [2024-06-13 15:17:28,894] INFO: Rank 0: epoch=15 / 200 train_loss=10.4165 valid_loss=10.6016 stale=1 time=16.45m eta=3045.3m [2024-06-13 15:17:28,929] INFO: Initiating epoch #16 train run on device rank=0 [2024-06-13 15:32:49,538] INFO: Initiating epoch #16 valid run on device rank=0 [2024-06-13 15:33:54,617] INFO: Rank 0: epoch=16 / 200 train_loss=10.3464 valid_loss=10.6026 stale=2 time=16.43m eta=3028.5m [2024-06-13 15:33:54,660] INFO: Initiating epoch #17 train run on device rank=0 [2024-06-13 15:49:15,742] INFO: Initiating epoch #17 valid run on device rank=0 [2024-06-13 15:50:20,642] INFO: Rank 0: epoch=17 / 200 train_loss=10.2851 valid_loss=10.5912 stale=3 time=16.43m eta=3011.8m [2024-06-13 15:50:20,679] INFO: Initiating epoch #18 train run on device rank=0 [2024-06-13 16:05:42,699] INFO: Initiating epoch #18 valid run on device rank=0 [2024-06-13 16:06:48,868] INFO: Rank 0: epoch=18 / 200 train_loss=10.2309 valid_loss=10.6175 stale=4 time=16.47m eta=2995.4m [2024-06-13 16:06:49,223] INFO: Initiating epoch #19 train run on device rank=0 [2024-06-13 16:22:11,433] INFO: Initiating epoch #19 valid run on device rank=0 [2024-06-13 16:23:16,084] INFO: Rank 0: epoch=19 / 200 train_loss=10.1796 valid_loss=10.6312 stale=5 time=16.45m eta=2978.9m [2024-06-13 16:23:16,086] INFO: Initiating epoch #20 train run on device rank=0 [2024-06-13 16:38:38,022] INFO: Initiating epoch #20 valid run on device rank=0 [2024-06-13 16:39:47,366] INFO: Rank 0: epoch=20 / 200 train_loss=10.1344 valid_loss=10.5265 stale=0 time=16.52m eta=2963.0m [2024-06-13 16:39:47,831] INFO: Initiating epoch #21 train run on device rank=0 [2024-06-13 16:55:09,483] INFO: Initiating epoch #21 valid run on device rank=0 [2024-06-13 16:56:17,931] INFO: Rank 0: epoch=21 / 200 train_loss=10.0923 valid_loss=10.4307 stale=0 time=16.5m eta=2947.0m [2024-06-13 16:56:18,409] INFO: Initiating epoch #22 train run on device rank=0 [2024-06-13 17:11:39,901] INFO: Initiating epoch #22 valid run on device rank=0 [2024-06-13 17:12:45,424] INFO: Rank 0: epoch=22 / 200 train_loss=10.0548 valid_loss=10.3392 stale=0 time=16.45m eta=2930.5m [2024-06-13 17:12:45,558] INFO: Initiating epoch #23 train run on device rank=0 [2024-06-13 17:28:07,857] INFO: Initiating epoch #23 valid run on device rank=0 [2024-06-13 17:29:12,788] INFO: Rank 0: epoch=23 / 200 train_loss=10.0193 valid_loss=10.3442 stale=1 time=16.45m eta=2914.0m [2024-06-13 17:29:12,829] INFO: Initiating epoch #24 train run on device rank=0 [2024-06-13 17:44:34,937] INFO: Initiating epoch #24 valid run on device rank=0 [2024-06-13 17:45:40,563] INFO: Rank 0: epoch=24 / 200 train_loss=9.9836 valid_loss=10.3005 stale=0 time=16.46m eta=2897.5m [2024-06-13 17:45:40,605] INFO: Initiating epoch #25 train run on device rank=0 [2024-06-13 18:01:02,206] INFO: Initiating epoch #25 valid run on device rank=0 [2024-06-13 18:02:07,997] INFO: Rank 0: epoch=25 / 200 train_loss=9.9513 valid_loss=10.2546 stale=0 time=16.46m eta=2881.0m [2024-06-13 18:02:08,050] INFO: Initiating epoch #26 train run on device rank=0 [2024-06-13 18:17:29,490] INFO: Initiating epoch #26 valid run on device rank=0 [2024-06-13 18:18:35,324] INFO: Rank 0: epoch=26 / 200 train_loss=9.9202 valid_loss=10.1905 stale=0 time=16.45m eta=2864.5m [2024-06-13 18:18:35,390] INFO: Initiating epoch #27 train run on device rank=0 [2024-06-13 18:33:56,865] INFO: Initiating epoch #27 valid run on device rank=0 [2024-06-13 18:35:02,686] INFO: Rank 0: epoch=27 / 200 train_loss=9.8903 valid_loss=10.1332 stale=0 time=16.45m eta=2848.0m [2024-06-13 18:35:02,746] INFO: Initiating epoch #28 train run on device rank=0 [2024-06-13 18:50:24,102] INFO: Initiating epoch #28 valid run on device rank=0 [2024-06-13 18:51:30,153] INFO: Rank 0: epoch=28 / 200 train_loss=9.8610 valid_loss=10.1012 stale=0 time=16.46m eta=2831.5m [2024-06-13 18:51:30,232] INFO: Initiating epoch #29 train run on device rank=0 [2024-06-13 19:06:51,482] INFO: Initiating epoch #29 valid run on device rank=0 [2024-06-13 19:07:57,219] INFO: Rank 0: epoch=29 / 200 train_loss=9.8310 valid_loss=10.0657 stale=0 time=16.45m eta=2815.0m [2024-06-13 19:07:57,263] INFO: Initiating epoch #30 train run on device rank=0 [2024-06-13 19:23:18,640] INFO: Initiating epoch #30 valid run on device rank=0 [2024-06-13 19:24:24,402] INFO: Rank 0: epoch=30 / 200 train_loss=9.8026 valid_loss=10.0366 stale=0 time=16.45m eta=2798.5m [2024-06-13 19:24:24,486] INFO: Initiating epoch #31 train run on device rank=0 [2024-06-13 19:39:45,675] INFO: Initiating epoch #31 valid run on device rank=0 [2024-06-13 19:40:51,814] INFO: Rank 0: epoch=31 / 200 train_loss=9.7738 valid_loss=10.0158 stale=0 time=16.46m eta=2782.0m [2024-06-13 19:40:51,863] INFO: Initiating epoch #32 train run on device rank=0 [2024-06-13 19:56:13,287] INFO: Initiating epoch #32 valid run on device rank=0 [2024-06-13 19:57:21,343] INFO: Rank 0: epoch=32 / 200 train_loss=9.7467 valid_loss=10.0011 stale=0 time=16.49m eta=2765.7m [2024-06-13 19:57:21,628] INFO: Initiating epoch #33 train run on device rank=0 [2024-06-13 20:12:43,195] INFO: Initiating epoch #33 valid run on device rank=0 [2024-06-13 20:13:49,849] INFO: Rank 0: epoch=33 / 200 train_loss=9.7208 valid_loss=9.9925 stale=0 time=16.47m eta=2749.3m [2024-06-13 20:13:49,983] INFO: Initiating epoch #34 train run on device rank=0 [2024-06-13 20:29:11,594] INFO: Initiating epoch #34 valid run on device rank=0 [2024-06-13 20:30:17,186] INFO: Rank 0: epoch=34 / 200 train_loss=9.6949 valid_loss=9.9740 stale=0 time=16.45m eta=2732.8m [2024-06-13 20:30:17,227] INFO: Initiating epoch #35 train run on device rank=0 [2024-06-13 20:45:39,532] INFO: Initiating epoch #35 valid run on device rank=0 [2024-06-13 20:46:44,704] INFO: Rank 0: epoch=35 / 200 train_loss=9.6714 valid_loss=9.9813 stale=1 time=16.46m eta=2716.3m [2024-06-13 20:46:44,735] INFO: Initiating epoch #36 train run on device rank=0 [2024-06-13 21:02:04,254] INFO: Initiating epoch #36 valid run on device rank=0 [2024-06-13 21:03:09,630] INFO: Rank 0: epoch=36 / 200 train_loss=9.6473 valid_loss=9.9797 stale=2 time=16.41m eta=2699.6m [2024-06-13 21:03:09,683] INFO: Initiating epoch #37 train run on device rank=0 [2024-06-13 21:18:29,584] INFO: Initiating epoch #37 valid run on device rank=0 [2024-06-13 21:19:37,252] INFO: Rank 0: epoch=37 / 200 train_loss=9.6246 valid_loss=10.0098 stale=3 time=16.46m eta=2683.1m [2024-06-13 21:19:37,764] INFO: Initiating epoch #38 train run on device rank=0 [2024-06-13 21:34:57,754] INFO: Initiating epoch #38 valid run on device rank=0 [2024-06-13 21:36:05,546] INFO: Rank 0: epoch=38 / 200 train_loss=9.6016 valid_loss=10.0260 stale=4 time=16.46m eta=2666.7m [2024-06-13 21:36:06,228] INFO: Initiating epoch #39 train run on device rank=0 [2024-06-13 21:51:26,074] INFO: Initiating epoch #39 valid run on device rank=0 [2024-06-13 21:52:33,210] INFO: Rank 0: epoch=39 / 200 train_loss=9.5780 valid_loss=10.0230 stale=5 time=16.45m eta=2650.3m [2024-06-13 21:52:33,673] INFO: Initiating epoch #40 train run on device rank=0 [2024-06-13 22:07:54,022] INFO: Initiating epoch #40 valid run on device rank=0 [2024-06-13 22:08:59,283] INFO: Rank 0: epoch=40 / 200 train_loss=9.5557 valid_loss=10.0234 stale=6 time=16.43m eta=2633.7m [2024-06-13 22:08:59,351] INFO: Initiating epoch #41 train run on device rank=0 [2024-06-13 22:24:20,752] INFO: Initiating epoch #41 valid run on device rank=0 [2024-06-13 22:25:26,008] INFO: Rank 0: epoch=41 / 200 train_loss=9.5327 valid_loss=10.0420 stale=7 time=16.44m eta=2617.2m [2024-06-13 22:25:26,184] INFO: Initiating epoch #42 train run on device rank=0 [2024-06-13 22:40:47,981] INFO: Initiating epoch #42 valid run on device rank=0 [2024-06-13 22:41:53,491] INFO: Rank 0: epoch=42 / 200 train_loss=9.5100 valid_loss=10.0623 stale=8 time=16.46m eta=2600.7m [2024-06-13 22:41:53,551] INFO: Initiating epoch #43 train run on device rank=0 [2024-06-13 22:57:16,480] INFO: Initiating epoch #43 valid run on device rank=0 [2024-06-13 22:58:22,295] INFO: Rank 0: epoch=43 / 200 train_loss=9.4878 valid_loss=10.1121 stale=9 time=16.48m eta=2584.3m [2024-06-13 22:58:22,546] INFO: Initiating epoch #44 train run on device rank=0 [2024-06-13 23:13:44,214] INFO: Initiating epoch #44 valid run on device rank=0 [2024-06-13 23:14:49,623] INFO: Rank 0: epoch=44 / 200 train_loss=9.4654 valid_loss=10.1103 stale=10 time=16.45m eta=2567.8m [2024-06-13 23:14:49,690] INFO: Initiating epoch #45 train run on device rank=0 [2024-06-13 23:30:11,124] INFO: Initiating epoch #45 valid run on device rank=0 [2024-06-13 23:31:16,418] INFO: Rank 0: epoch=45 / 200 train_loss=9.4409 valid_loss=10.1177 stale=11 time=16.45m eta=2551.3m [2024-06-13 23:31:16,471] INFO: Initiating epoch #46 train run on device rank=0 [2024-06-13 23:46:37,527] INFO: Initiating epoch #46 valid run on device rank=0 [2024-06-13 23:47:42,756] INFO: Rank 0: epoch=46 / 200 train_loss=9.4167 valid_loss=10.1303 stale=12 time=16.44m eta=2534.8m [2024-06-13 23:47:42,820] INFO: Initiating epoch #47 train run on device rank=0 [2024-06-14 00:03:03,908] INFO: Initiating epoch #47 valid run on device rank=0 [2024-06-14 00:04:09,107] INFO: Rank 0: epoch=47 / 200 train_loss=9.3904 valid_loss=10.1231 stale=13 time=16.44m eta=2518.3m [2024-06-14 00:04:09,158] INFO: Initiating epoch #48 train run on device rank=0 [2024-06-14 00:19:30,568] INFO: Initiating epoch #48 valid run on device rank=0 [2024-06-14 00:20:35,939] INFO: Rank 0: epoch=48 / 200 train_loss=9.3679 valid_loss=10.0771 stale=14 time=16.45m eta=2501.8m [2024-06-14 00:20:36,054] INFO: Initiating epoch #49 train run on device rank=0 [2024-06-14 00:35:57,723] INFO: Initiating epoch #49 valid run on device rank=0 [2024-06-14 00:37:02,818] INFO: Rank 0: epoch=49 / 200 train_loss=9.3431 valid_loss=10.0526 stale=15 time=16.45m eta=2485.3m [2024-06-14 00:37:02,856] INFO: Initiating epoch #50 train run on device rank=0 [2024-06-14 00:52:24,668] INFO: Initiating epoch #50 valid run on device rank=0 [2024-06-14 00:53:30,151] INFO: Rank 0: epoch=50 / 200 train_loss=9.3177 valid_loss=10.0629 stale=16 time=16.45m eta=2468.8m [2024-06-14 00:53:30,199] INFO: Initiating epoch #51 train run on device rank=0 [2024-06-14 01:08:52,761] INFO: Initiating epoch #51 valid run on device rank=0 [2024-06-14 01:09:57,984] INFO: Rank 0: epoch=51 / 200 train_loss=9.2929 valid_loss=10.0795 stale=17 time=16.46m eta=2452.4m [2024-06-14 01:09:58,036] INFO: Initiating epoch #52 train run on device rank=0 [2024-06-14 01:25:19,754] INFO: Initiating epoch #52 valid run on device rank=0 [2024-06-14 01:26:24,850] INFO: Rank 0: epoch=52 / 200 train_loss=9.2661 valid_loss=10.0761 stale=18 time=16.45m eta=2435.9m [2024-06-14 01:26:24,899] INFO: Initiating epoch #53 train run on device rank=0 [2024-06-14 01:41:46,077] INFO: Initiating epoch #53 valid run on device rank=0 [2024-06-14 01:42:51,363] INFO: Rank 0: epoch=53 / 200 train_loss=9.2384 valid_loss=10.1003 stale=19 time=16.44m eta=2419.4m [2024-06-14 01:42:51,456] INFO: Initiating epoch #54 train run on device rank=0 [2024-06-14 01:58:12,244] INFO: Initiating epoch #54 valid run on device rank=0 [2024-06-14 01:59:17,530] INFO: Rank 0: epoch=54 / 200 train_loss=9.2127 valid_loss=10.1138 stale=20 time=16.43m eta=2402.9m [2024-06-14 01:59:17,582] INFO: Initiating epoch #55 train run on device rank=0 [2024-06-14 02:14:39,036] INFO: Initiating epoch #55 valid run on device rank=0 [2024-06-14 02:15:44,255] INFO: Rank 0: epoch=55 / 200 train_loss=9.1860 valid_loss=10.1304 stale=21 time=16.44m eta=2386.4m [2024-06-14 02:15:44,294] INFO: Initiating epoch #56 train run on device rank=0 [2024-06-14 02:31:05,592] INFO: Initiating epoch #56 valid run on device rank=0 [2024-06-14 02:32:10,846] INFO: Rank 0: epoch=56 / 200 train_loss=9.1608 valid_loss=10.1383 stale=22 time=16.44m eta=2369.9m [2024-06-14 02:32:10,918] INFO: Initiating epoch #57 train run on device rank=0 [2024-06-14 02:47:32,106] INFO: Initiating epoch #57 valid run on device rank=0 [2024-06-14 02:48:37,263] INFO: Rank 0: epoch=57 / 200 train_loss=9.1346 valid_loss=10.1655 stale=23 time=16.44m eta=2353.4m [2024-06-14 02:48:37,334] INFO: Initiating epoch #58 train run on device rank=0 [2024-06-14 03:03:58,183] INFO: Initiating epoch #58 valid run on device rank=0 [2024-06-14 03:05:03,321] INFO: Rank 0: epoch=58 / 200 train_loss=9.1093 valid_loss=10.1892 stale=24 time=16.43m eta=2336.9m [2024-06-14 03:05:03,372] INFO: Initiating epoch #59 train run on device rank=0 [2024-06-14 03:20:24,669] INFO: Initiating epoch #59 valid run on device rank=0 [2024-06-14 03:21:30,255] INFO: Rank 0: epoch=59 / 200 train_loss=9.0808 valid_loss=10.2646 stale=25 time=16.45m eta=2320.4m [2024-06-14 03:21:30,298] INFO: Initiating epoch #60 train run on device rank=0 [2024-06-14 03:36:50,783] INFO: Initiating epoch #60 valid run on device rank=0 [2024-06-14 03:37:56,003] INFO: Rank 0: epoch=60 / 200 train_loss=9.0539 valid_loss=10.2881 stale=26 time=16.43m eta=2303.9m [2024-06-14 03:37:56,067] INFO: Initiating epoch #61 train run on device rank=0 [2024-06-14 03:53:17,367] INFO: Initiating epoch #61 valid run on device rank=0 [2024-06-14 03:54:22,645] INFO: Rank 0: epoch=61 / 200 train_loss=9.0261 valid_loss=10.2956 stale=27 time=16.44m eta=2287.4m [2024-06-14 03:54:22,686] INFO: Initiating epoch #62 train run on device rank=0 [2024-06-14 04:09:43,770] INFO: Initiating epoch #62 valid run on device rank=0 [2024-06-14 04:10:49,273] INFO: Rank 0: epoch=62 / 200 train_loss=8.9985 valid_loss=10.3739 stale=28 time=16.44m eta=2270.9m [2024-06-14 04:10:49,328] INFO: Initiating epoch #63 train run on device rank=0 [2024-06-14 04:26:10,526] INFO: Initiating epoch #63 valid run on device rank=0 [2024-06-14 04:27:15,785] INFO: Rank 0: epoch=63 / 200 train_loss=8.9696 valid_loss=10.4069 stale=29 time=16.44m eta=2254.4m [2024-06-14 04:27:15,851] INFO: Initiating epoch #64 train run on device rank=0 [2024-06-14 04:42:37,062] INFO: Initiating epoch #64 valid run on device rank=0 [2024-06-14 04:43:42,180] INFO: Rank 0: epoch=64 / 200 train_loss=8.9369 valid_loss=10.4111 stale=30 time=16.44m eta=2237.9m [2024-06-14 04:43:42,212] INFO: Initiating epoch #65 train run on device rank=0 [2024-06-14 04:59:03,790] INFO: Initiating epoch #65 valid run on device rank=0 [2024-06-14 05:00:09,103] INFO: Rank 0: epoch=65 / 200 train_loss=8.9091 valid_loss=10.4702 stale=31 time=16.45m eta=2221.5m [2024-06-14 05:00:09,178] INFO: Initiating epoch #66 train run on device rank=0 [2024-06-14 05:15:29,918] INFO: Initiating epoch #66 valid run on device rank=0 [2024-06-14 05:16:35,079] INFO: Rank 0: epoch=66 / 200 train_loss=8.8788 valid_loss=10.4848 stale=32 time=16.43m eta=2205.0m [2024-06-14 05:16:35,117] INFO: Initiating epoch #67 train run on device rank=0 [2024-06-14 05:31:56,338] INFO: Initiating epoch #67 valid run on device rank=0 [2024-06-14 05:33:01,409] INFO: Rank 0: epoch=67 / 200 train_loss=8.8505 valid_loss=10.4525 stale=33 time=16.44m eta=2188.5m [2024-06-14 05:33:01,446] INFO: Initiating epoch #68 train run on device rank=0 [2024-06-14 05:48:22,374] INFO: Initiating epoch #68 valid run on device rank=0 [2024-06-14 05:49:27,652] INFO: Rank 0: epoch=68 / 200 train_loss=8.8239 valid_loss=10.4750 stale=34 time=16.44m eta=2172.0m [2024-06-14 05:49:27,693] INFO: Initiating epoch #69 train run on device rank=0 [2024-06-14 06:04:48,993] INFO: Initiating epoch #69 valid run on device rank=0 [2024-06-14 06:05:54,131] INFO: Rank 0: epoch=69 / 200 train_loss=8.7947 valid_loss=10.4440 stale=35 time=16.44m eta=2155.5m [2024-06-14 06:05:54,178] INFO: Initiating epoch #70 train run on device rank=0 [2024-06-14 06:21:15,083] INFO: Initiating epoch #70 valid run on device rank=0 [2024-06-14 06:22:20,276] INFO: Rank 0: epoch=70 / 200 train_loss=8.7667 valid_loss=10.4738 stale=36 time=16.43m eta=2139.0m [2024-06-14 06:22:20,316] INFO: Initiating epoch #71 train run on device rank=0 [2024-06-14 06:37:41,407] INFO: Initiating epoch #71 valid run on device rank=0 [2024-06-14 06:38:46,739] INFO: Rank 0: epoch=71 / 200 train_loss=8.7372 valid_loss=10.4455 stale=37 time=16.44m eta=2122.5m [2024-06-14 06:38:46,822] INFO: Initiating epoch #72 train run on device rank=0 [2024-06-14 06:54:07,787] INFO: Initiating epoch #72 valid run on device rank=0 [2024-06-14 06:55:12,908] INFO: Rank 0: epoch=72 / 200 train_loss=8.7095 valid_loss=10.4270 stale=38 time=16.43m eta=2106.0m [2024-06-14 06:55:12,950] INFO: Initiating epoch #73 train run on device rank=0 [2024-06-14 07:10:33,916] INFO: Initiating epoch #73 valid run on device rank=0 [2024-06-14 07:11:39,547] INFO: Rank 0: epoch=73 / 200 train_loss=8.6709 valid_loss=10.4293 stale=39 time=16.44m eta=2089.6m [2024-06-14 07:11:39,590] INFO: Initiating epoch #74 train run on device rank=0 [2024-06-14 07:27:00,717] INFO: Initiating epoch #74 valid run on device rank=0 [2024-06-14 07:28:06,207] INFO: Rank 0: epoch=74 / 200 train_loss=8.6425 valid_loss=10.4529 stale=40 time=16.44m eta=2073.1m [2024-06-14 07:28:06,244] INFO: Initiating epoch #75 train run on device rank=0 [2024-06-14 07:43:27,590] INFO: Initiating epoch #75 valid run on device rank=0 [2024-06-14 07:44:32,863] INFO: Rank 0: epoch=75 / 200 train_loss=8.6043 valid_loss=10.4577 stale=41 time=16.44m eta=2056.6m [2024-06-14 07:44:32,905] INFO: Initiating epoch #76 train run on device rank=0 [2024-06-14 08:00:01,103] INFO: Initiating epoch #76 valid run on device rank=0 [2024-06-14 08:01:06,452] INFO: Rank 0: epoch=76 / 200 train_loss=8.5649 valid_loss=10.4509 stale=42 time=16.56m eta=2040.4m [2024-06-14 08:01:06,485] INFO: Initiating epoch #77 train run on device rank=0 [2024-06-14 08:16:27,220] INFO: Initiating epoch #77 valid run on device rank=0 [2024-06-14 08:17:32,433] INFO: Rank 0: epoch=77 / 200 train_loss=8.5263 valid_loss=10.4429 stale=43 time=16.43m eta=2023.9m [2024-06-14 08:17:32,477] INFO: Initiating epoch #78 train run on device rank=0 [2024-06-14 08:32:54,149] INFO: Initiating epoch #78 valid run on device rank=0 [2024-06-14 08:33:59,152] INFO: Rank 0: epoch=78 / 200 train_loss=8.4941 valid_loss=10.3942 stale=44 time=16.44m eta=2007.4m [2024-06-14 08:33:59,205] INFO: Initiating epoch #79 train run on device rank=0 [2024-06-14 08:49:20,553] INFO: Initiating epoch #79 valid run on device rank=0 [2024-06-14 08:50:25,899] INFO: Rank 0: epoch=79 / 200 train_loss=8.4655 valid_loss=10.4029 stale=45 time=16.44m eta=1990.9m [2024-06-14 08:50:25,966] INFO: Initiating epoch #80 train run on device rank=0 [2024-06-14 09:05:47,015] INFO: Initiating epoch #80 valid run on device rank=0 [2024-06-14 09:06:52,259] INFO: Rank 0: epoch=80 / 200 train_loss=8.4339 valid_loss=10.4285 stale=46 time=16.44m eta=1974.5m [2024-06-14 09:06:52,323] INFO: Initiating epoch #81 train run on device rank=0 [2024-06-14 09:22:13,254] INFO: Initiating epoch #81 valid run on device rank=0 [2024-06-14 09:23:18,568] INFO: Rank 0: epoch=81 / 200 train_loss=8.4038 valid_loss=10.4361 stale=47 time=16.44m eta=1958.0m [2024-06-14 09:23:18,611] INFO: Initiating epoch #82 train run on device rank=0 [2024-06-14 09:38:39,080] INFO: Initiating epoch #82 valid run on device rank=0 [2024-06-14 09:39:44,503] INFO: Rank 0: epoch=82 / 200 train_loss=8.3642 valid_loss=10.4329 stale=48 time=16.43m eta=1941.5m [2024-06-14 09:39:44,569] INFO: Initiating epoch #83 train run on device rank=0 [2024-06-14 09:55:05,607] INFO: Initiating epoch #83 valid run on device rank=0 [2024-06-14 09:56:10,900] INFO: Rank 0: epoch=83 / 200 train_loss=8.3232 valid_loss=10.4731 stale=49 time=16.44m eta=1925.0m [2024-06-14 09:56:10,969] INFO: Initiating epoch #84 train run on device rank=0 [2024-06-14 10:11:32,110] INFO: Initiating epoch #84 valid run on device rank=0 [2024-06-14 10:12:37,332] INFO: Rank 0: epoch=84 / 200 train_loss=8.2777 valid_loss=10.4811 stale=50 time=16.44m eta=1908.6m [2024-06-14 10:12:37,386] INFO: Initiating epoch #85 train run on device rank=0 [2024-06-14 10:27:58,587] INFO: Initiating epoch #85 valid run on device rank=0