[2024-06-17 11:23:42,715] INFO: Will use torch.nn.parallel.DistributedDataParallel() and 4 gpus [2024-06-17 11:23:42,817] INFO: NVIDIA GeForce RTX 2080 Ti [2024-06-17 11:23:42,817] INFO: NVIDIA GeForce RTX 2080 Ti [2024-06-17 11:23:42,817] INFO: NVIDIA GeForce RTX 2080 Ti [2024-06-17 11:23:42,817] INFO: NVIDIA GeForce RTX 2080 Ti [2024-06-17 11:23:47,398] INFO: using dtype=torch.float32 [2024-06-17 11:23:48,438] INFO: using attention_type=math [2024-06-17 11:23:48,457] INFO: using attention_type=math [2024-06-17 11:23:48,476] INFO: using attention_type=math [2024-06-17 11:23:48,498] INFO: using attention_type=math [2024-06-17 11:23:48,517] INFO: using attention_type=math [2024-06-17 11:23:48,535] INFO: using attention_type=math [2024-06-17 11:23:52,642] INFO: mlpf_kwargs: {'input_dim': 17, 'num_classes': 6, 'input_encoding': 'joint', 'pt_mode': 'linear', 'eta_mode': 'linear', 'sin_phi_mode': 'linear', 'cos_phi_mode': 'linear', 'energy_mode': 'linear', 'elemtypes_nonzero': [1, 2], 'learned_representation_mode': 'last', 'conv_type': 'attention', 'num_convs': 3, 'dropout_ff': 0.0, 'dropout_conv_id_mha': 0.0, 'dropout_conv_id_ff': 0.0, 'dropout_conv_reg_mha': 0.0, 'dropout_conv_reg_ff': 0.0, 'activation': 'relu', 'head_dim': 16, 'num_heads': 32, 'attention_type': 'math'} [2024-06-17 11:23:52,642] INFO: Loaded model weights from /pfvol/experiments/MLPF_clic_backbone_pyg-clic_20240429_101112_971749/best_weights.pth [2024-06-17 11:23:53,815] INFO: DistributedDataParallel( (module): MLPF( (nn0_id): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (nn0_reg): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (conv_id): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (conv_reg): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (nn_id): Sequential( (0): Linear(in_features=529, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=6, bias=True) ) (nn_pt): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_eta): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_sin_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_cos_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_energy): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) ) ) [2024-06-17 11:23:53,816] INFO: Backbone Trainable parameters: 11671568 [2024-06-17 11:23:53,816] INFO: Backbone Non-trainable parameters: 0 [2024-06-17 11:23:53,817] INFO: Backbone Total parameters: 11671568 [2024-06-17 11:23:53,820] INFO: Modules Trainable parameters Non-tranable parameters module.nn0_id.0.weight 8704 0 module.nn0_id.0.bias 512 0 module.nn0_id.2.weight 512 0 module.nn0_id.2.bias 512 0 module.nn0_id.4.weight 262144 0 module.nn0_id.4.bias 512 0 module.nn0_reg.0.weight 8704 0 module.nn0_reg.0.bias 512 0 module.nn0_reg.2.weight 512 0 module.nn0_reg.2.bias 512 0 module.nn0_reg.4.weight 262144 0 module.nn0_reg.4.bias 512 0 module.conv_id.0.mha.in_proj_weight 786432 0 module.conv_id.0.mha.in_proj_bias 1536 0 module.conv_id.0.mha.out_proj.weight 262144 0 module.conv_id.0.mha.out_proj.bias 512 0 module.conv_id.0.norm0.weight 512 0 module.conv_id.0.norm0.bias 512 0 module.conv_id.0.norm1.weight 512 0 module.conv_id.0.norm1.bias 512 0 module.conv_id.0.seq.0.weight 262144 0 module.conv_id.0.seq.0.bias 512 0 module.conv_id.0.seq.2.weight 262144 0 module.conv_id.0.seq.2.bias 512 0 module.conv_id.1.mha.in_proj_weight 786432 0 module.conv_id.1.mha.in_proj_bias 1536 0 module.conv_id.1.mha.out_proj.weight 262144 0 module.conv_id.1.mha.out_proj.bias 512 0 module.conv_id.1.norm0.weight 512 0 module.conv_id.1.norm0.bias 512 0 module.conv_id.1.norm1.weight 512 0 module.conv_id.1.norm1.bias 512 0 module.conv_id.1.seq.0.weight 262144 0 module.conv_id.1.seq.0.bias 512 0 module.conv_id.1.seq.2.weight 262144 0 module.conv_id.1.seq.2.bias 512 0 module.conv_id.2.mha.in_proj_weight 786432 0 module.conv_id.2.mha.in_proj_bias 1536 0 module.conv_id.2.mha.out_proj.weight 262144 0 module.conv_id.2.mha.out_proj.bias 512 0 module.conv_id.2.norm0.weight 512 0 module.conv_id.2.norm0.bias 512 0 module.conv_id.2.norm1.weight 512 0 module.conv_id.2.norm1.bias 512 0 module.conv_id.2.seq.0.weight 262144 0 module.conv_id.2.seq.0.bias 512 0 module.conv_id.2.seq.2.weight 262144 0 module.conv_id.2.seq.2.bias 512 0 module.conv_reg.0.mha.in_proj_weight 786432 0 module.conv_reg.0.mha.in_proj_bias 1536 0 module.conv_reg.0.mha.out_proj.weight 262144 0 module.conv_reg.0.mha.out_proj.bias 512 0 module.conv_reg.0.norm0.weight 512 0 module.conv_reg.0.norm0.bias 512 0 module.conv_reg.0.norm1.weight 512 0 module.conv_reg.0.norm1.bias 512 0 module.conv_reg.0.seq.0.weight 262144 0 module.conv_reg.0.seq.0.bias 512 0 module.conv_reg.0.seq.2.weight 262144 0 module.conv_reg.0.seq.2.bias 512 0 module.conv_reg.1.mha.in_proj_weight 786432 0 module.conv_reg.1.mha.in_proj_bias 1536 0 module.conv_reg.1.mha.out_proj.weight 262144 0 module.conv_reg.1.mha.out_proj.bias 512 0 module.conv_reg.1.norm0.weight 512 0 module.conv_reg.1.norm0.bias 512 0 module.conv_reg.1.norm1.weight 512 0 module.conv_reg.1.norm1.bias 512 0 module.conv_reg.1.seq.0.weight 262144 0 module.conv_reg.1.seq.0.bias 512 0 module.conv_reg.1.seq.2.weight 262144 0 module.conv_reg.1.seq.2.bias 512 0 module.conv_reg.2.mha.in_proj_weight 786432 0 module.conv_reg.2.mha.in_proj_bias 1536 0 module.conv_reg.2.mha.out_proj.weight 262144 0 module.conv_reg.2.mha.out_proj.bias 512 0 module.conv_reg.2.norm0.weight 512 0 module.conv_reg.2.norm0.bias 512 0 module.conv_reg.2.norm1.weight 512 0 module.conv_reg.2.norm1.bias 512 0 module.conv_reg.2.seq.0.weight 262144 0 module.conv_reg.2.seq.0.bias 512 0 module.conv_reg.2.seq.2.weight 262144 0 module.conv_reg.2.seq.2.bias 512 0 module.nn_id.0.weight 270848 0 module.nn_id.0.bias 512 0 module.nn_id.2.weight 512 0 module.nn_id.2.bias 512 0 module.nn_id.4.weight 3072 0 module.nn_id.4.bias 6 0 module.nn_pt.nn.0.weight 273920 0 module.nn_pt.nn.0.bias 512 0 module.nn_pt.nn.2.weight 512 0 module.nn_pt.nn.2.bias 512 0 module.nn_pt.nn.4.weight 1024 0 module.nn_pt.nn.4.bias 2 0 module.nn_eta.nn.0.weight 273920 0 module.nn_eta.nn.0.bias 512 0 module.nn_eta.nn.2.weight 512 0 module.nn_eta.nn.2.bias 512 0 module.nn_eta.nn.4.weight 1024 0 module.nn_eta.nn.4.bias 2 0 module.nn_sin_phi.nn.0.weight 273920 0 module.nn_sin_phi.nn.0.bias 512 0 module.nn_sin_phi.nn.2.weight 512 0 module.nn_sin_phi.nn.2.bias 512 0 module.nn_sin_phi.nn.4.weight 1024 0 module.nn_sin_phi.nn.4.bias 2 0 module.nn_cos_phi.nn.0.weight 273920 0 module.nn_cos_phi.nn.0.bias 512 0 module.nn_cos_phi.nn.2.weight 512 0 module.nn_cos_phi.nn.2.bias 512 0 module.nn_cos_phi.nn.4.weight 1024 0 module.nn_cos_phi.nn.4.bias 2 0 module.nn_energy.nn.0.weight 273920 0 module.nn_energy.nn.0.bias 512 0 module.nn_energy.nn.2.weight 512 0 module.nn_energy.nn.2.bias 512 0 module.nn_energy.nn.4.weight 1024 0 module.nn_energy.nn.4.bias 2 0 [2024-06-17 11:23:53,877] INFO: DistributedDataParallel( (module): DeepMET( (nn): Sequential( (0): Linear(in_features=535, out_features=256, bias=True) (1): ELU(alpha=1.0) (2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0, inplace=False) (4): Linear(in_features=256, out_features=2, bias=True) ) ) ) [2024-06-17 11:23:53,877] INFO: DeepMET Trainable parameters: 138242 [2024-06-17 11:23:53,878] INFO: DeepMET Non-trainable parameters: 0 [2024-06-17 11:23:53,878] INFO: DeepMET Total parameters: 138242 [2024-06-17 11:23:53,878] INFO: Modules Trainable parameters Non-tranable parameters module.nn.0.weight 136960 0 module.nn.0.bias 256 0 module.nn.2.weight 256 0 module.nn.2.bias 256 0 module.nn.4.weight 512 0 module.nn.4.bias 2 0 [2024-06-17 11:23:53,879] INFO: Creating experiment dir /pfvol/experiments/MLPF_clic_backbone_pyg-clic_20240429_101112_971749/MLPF_4GTX_MET_latentX_FloatBackbone_20240617_112342_613393 [2024-06-17 11:23:53,879] INFO: Model directory /pfvol/experiments/MLPF_clic_backbone_pyg-clic_20240429_101112_971749/MLPF_4GTX_MET_latentX_FloatBackbone_20240617_112342_613393 [2024-06-17 11:23:53,907] INFO: train_dataset: clic_edm_ttbar_pf, 800800 [2024-06-17 11:23:54,244] INFO: valid_dataset: clic_edm_ttbar_pf, 200200 [2024-06-17 11:23:54,263] INFO: Initiating epoch #1 train run on device rank=0 [2024-06-17 11:43:49,444] INFO: Initiating epoch #1 valid run on device rank=0 [2024-06-17 11:44:58,192] INFO: Rank 0: epoch=1 / 400 train_loss=18.3925 valid_loss=18.3640 stale=0 time=21.07m eta=8405.1m [2024-06-17 11:44:58,203] INFO: Initiating epoch #2 train run on device rank=0 [2024-06-17 12:04:44,854] INFO: Initiating epoch #2 valid run on device rank=0 [2024-06-17 12:05:53,238] INFO: Rank 0: epoch=2 / 400 train_loss=13.6729 valid_loss=10.1164 stale=0 time=20.92m eta=8354.6m [2024-06-17 12:05:53,304] INFO: Initiating epoch #3 train run on device rank=0 [2024-06-17 12:25:39,758] INFO: Initiating epoch #3 valid run on device rank=0 [2024-06-17 12:26:47,889] INFO: Rank 0: epoch=3 / 400 train_loss=9.5564 valid_loss=9.3610 stale=0 time=20.91m eta=8322.9m [2024-06-17 12:26:47,940] INFO: Initiating epoch #4 train run on device rank=0 [2024-06-17 12:46:34,547] INFO: Initiating epoch #4 valid run on device rank=0 [2024-06-17 12:47:43,096] INFO: Rank 0: epoch=4 / 400 train_loss=9.0040 valid_loss=8.9759 stale=0 time=20.92m eta=8297.6m [2024-06-17 12:47:43,232] INFO: Initiating epoch #5 train run on device rank=0 [2024-06-17 13:07:29,782] INFO: Initiating epoch #5 valid run on device rank=0 [2024-06-17 13:08:37,490] INFO: Rank 0: epoch=5 / 400 train_loss=8.6682 valid_loss=8.7237 stale=0 time=20.9m eta=8272.9m [2024-06-17 13:08:37,506] INFO: Initiating epoch #6 train run on device rank=0 [2024-06-17 13:28:23,954] INFO: Initiating epoch #6 valid run on device rank=0 [2024-06-17 13:29:32,086] INFO: Rank 0: epoch=6 / 400 train_loss=8.4439 valid_loss=8.5552 stale=0 time=20.91m eta=8249.7m [2024-06-17 13:29:32,125] INFO: Initiating epoch #7 train run on device rank=0 [2024-06-17 13:49:18,687] INFO: Initiating epoch #7 valid run on device rank=0 [2024-06-17 13:50:26,641] INFO: Rank 0: epoch=7 / 400 train_loss=8.2744 valid_loss=8.4338 stale=0 time=20.91m eta=8227.2m [2024-06-17 13:50:26,696] INFO: Initiating epoch #8 train run on device rank=0 [2024-06-17 14:10:13,241] INFO: Initiating epoch #8 valid run on device rank=0 [2024-06-17 14:11:21,176] INFO: Rank 0: epoch=8 / 400 train_loss=8.1414 valid_loss=8.3458 stale=0 time=20.91m eta=8205.0m [2024-06-17 14:11:21,220] INFO: Initiating epoch #9 train run on device rank=0 [2024-06-17 14:31:07,575] INFO: Initiating epoch #9 valid run on device rank=0 [2024-06-17 14:32:15,540] INFO: Rank 0: epoch=9 / 400 train_loss=8.0297 valid_loss=8.2674 stale=0 time=20.91m eta=8183.0m [2024-06-17 14:32:15,571] INFO: Initiating epoch #10 train run on device rank=0 [2024-06-17 14:52:02,481] INFO: Initiating epoch #10 valid run on device rank=0 [2024-06-17 14:53:10,423] INFO: Rank 0: epoch=10 / 400 train_loss=7.9334 valid_loss=8.2017 stale=0 time=20.91m eta=8161.5m [2024-06-17 14:53:10,459] INFO: Initiating epoch #11 train run on device rank=0 [2024-06-17 15:12:57,078] INFO: Initiating epoch #11 valid run on device rank=0 [2024-06-17 15:14:05,820] INFO: Rank 0: epoch=11 / 400 train_loss=7.8477 valid_loss=8.1497 stale=0 time=20.92m eta=8140.4m [2024-06-17 15:14:06,123] INFO: Initiating epoch #12 train run on device rank=0 [2024-06-17 15:33:52,989] INFO: Initiating epoch #12 valid run on device rank=0 [2024-06-17 15:35:00,806] INFO: Rank 0: epoch=12 / 400 train_loss=7.7734 valid_loss=8.0968 stale=0 time=20.91m eta=8119.2m [2024-06-17 15:35:00,872] INFO: Initiating epoch #13 train run on device rank=0 [2024-06-17 15:54:47,195] INFO: Initiating epoch #13 valid run on device rank=0 [2024-06-17 15:55:54,878] INFO: Rank 0: epoch=13 / 400 train_loss=7.7065 valid_loss=8.0516 stale=0 time=20.9m eta=8097.5m [2024-06-17 15:55:54,922] INFO: Initiating epoch #14 train run on device rank=0 [2024-06-17 16:15:41,185] INFO: Initiating epoch #14 valid run on device rank=0 [2024-06-17 16:16:50,094] INFO: Rank 0: epoch=14 / 400 train_loss=7.6416 valid_loss=8.0090 stale=0 time=20.92m eta=8076.5m [2024-06-17 16:16:50,263] INFO: Initiating epoch #15 train run on device rank=0 [2024-06-17 16:36:36,470] INFO: Initiating epoch #15 valid run on device rank=0 [2024-06-17 16:37:44,973] INFO: Rank 0: epoch=15 / 400 train_loss=7.5809 valid_loss=7.9727 stale=0 time=20.91m eta=8055.4m [2024-06-17 16:37:45,155] INFO: Initiating epoch #16 train run on device rank=0 [2024-06-17 16:57:31,450] INFO: Initiating epoch #16 valid run on device rank=0 [2024-06-17 16:58:39,257] INFO: Rank 0: epoch=16 / 400 train_loss=7.5231 valid_loss=7.9363 stale=0 time=20.9m eta=8034.0m [2024-06-17 16:58:39,308] INFO: Initiating epoch #17 train run on device rank=0 [2024-06-17 17:18:25,727] INFO: Initiating epoch #17 valid run on device rank=0 [2024-06-17 17:19:33,450] INFO: Rank 0: epoch=17 / 400 train_loss=7.4665 valid_loss=7.9047 stale=0 time=20.9m eta=8012.7m [2024-06-17 17:19:33,503] INFO: Initiating epoch #18 train run on device rank=0 [2024-06-17 17:39:19,670] INFO: Initiating epoch #18 valid run on device rank=0 [2024-06-17 17:40:27,485] INFO: Rank 0: epoch=18 / 400 train_loss=7.4124 valid_loss=7.8739 stale=0 time=20.9m eta=7991.3m [2024-06-17 17:40:27,538] INFO: Initiating epoch #19 train run on device rank=0 [2024-06-17 18:00:13,898] INFO: Initiating epoch #19 valid run on device rank=0 [2024-06-17 18:01:21,555] INFO: Rank 0: epoch=19 / 400 train_loss=7.3592 valid_loss=7.8392 stale=0 time=20.9m eta=7970.0m [2024-06-17 18:01:21,586] INFO: Initiating epoch #20 train run on device rank=0 [2024-06-17 18:21:07,972] INFO: Initiating epoch #20 valid run on device rank=0 [2024-06-17 18:22:16,063] INFO: Rank 0: epoch=20 / 400 train_loss=7.3091 valid_loss=7.8062 stale=0 time=20.91m eta=7948.9m [2024-06-17 18:22:16,101] INFO: Initiating epoch #21 train run on device rank=0 [2024-06-17 18:42:02,405] INFO: Initiating epoch #21 valid run on device rank=0 [2024-06-17 18:43:10,303] INFO: Rank 0: epoch=21 / 400 train_loss=7.2566 valid_loss=7.7689 stale=0 time=20.9m eta=7927.7m [2024-06-17 18:43:10,348] INFO: Initiating epoch #22 train run on device rank=0 [2024-06-17 19:02:56,548] INFO: Initiating epoch #22 valid run on device rank=0 [2024-06-17 19:04:04,551] INFO: Rank 0: epoch=22 / 400 train_loss=7.2029 valid_loss=7.7375 stale=0 time=20.9m eta=7906.6m [2024-06-17 19:04:04,611] INFO: Initiating epoch #23 train run on device rank=0 [2024-06-17 19:23:50,959] INFO: Initiating epoch #23 valid run on device rank=0 [2024-06-17 19:24:58,609] INFO: Rank 0: epoch=23 / 400 train_loss=7.1504 valid_loss=7.6993 stale=0 time=20.9m eta=7885.4m [2024-06-17 19:24:58,647] INFO: Initiating epoch #24 train run on device rank=0 [2024-06-17 19:44:44,762] INFO: Initiating epoch #24 valid run on device rank=0 [2024-06-17 19:45:52,687] INFO: Rank 0: epoch=24 / 400 train_loss=7.0986 valid_loss=7.6668 stale=0 time=20.9m eta=7864.3m [2024-06-17 19:45:52,728] INFO: Initiating epoch #25 train run on device rank=0 [2024-06-17 20:05:39,360] INFO: Initiating epoch #25 valid run on device rank=0 [2024-06-17 20:06:47,024] INFO: Rank 0: epoch=25 / 400 train_loss=7.0483 valid_loss=7.6320 stale=0 time=20.9m eta=7843.2m [2024-06-17 20:06:47,070] INFO: Initiating epoch #26 train run on device rank=0 [2024-06-17 20:26:33,485] INFO: Initiating epoch #26 valid run on device rank=0 [2024-06-17 20:27:41,401] INFO: Rank 0: epoch=26 / 400 train_loss=6.9978 valid_loss=7.5936 stale=0 time=20.91m eta=7822.1m [2024-06-17 20:27:41,448] INFO: Initiating epoch #27 train run on device rank=0 [2024-06-17 20:47:27,922] INFO: Initiating epoch #27 valid run on device rank=0 [2024-06-17 20:48:36,205] INFO: Rank 0: epoch=27 / 400 train_loss=6.9471 valid_loss=7.5589 stale=0 time=20.91m eta=7801.2m [2024-06-17 20:48:36,292] INFO: Initiating epoch #28 train run on device rank=0 [2024-06-17 21:08:22,923] INFO: Initiating epoch #28 valid run on device rank=0 [2024-06-17 21:09:31,087] INFO: Rank 0: epoch=28 / 400 train_loss=6.8975 valid_loss=7.5256 stale=0 time=20.91m eta=7780.3m [2024-06-17 21:09:31,167] INFO: Initiating epoch #29 train run on device rank=0 [2024-06-17 21:29:18,271] INFO: Initiating epoch #29 valid run on device rank=0 [2024-06-17 21:30:26,043] INFO: Rank 0: epoch=29 / 400 train_loss=6.8498 valid_loss=7.4971 stale=0 time=20.91m eta=7759.4m [2024-06-17 21:30:26,098] INFO: Initiating epoch #30 train run on device rank=0 [2024-06-17 21:50:13,654] INFO: Initiating epoch #30 valid run on device rank=0 [2024-06-17 21:51:21,194] INFO: Rank 0: epoch=30 / 400 train_loss=6.8044 valid_loss=7.4724 stale=0 time=20.92m eta=7738.5m [2024-06-17 21:51:21,236] INFO: Initiating epoch #31 train run on device rank=0 [2024-06-17 22:11:08,453] INFO: Initiating epoch #31 valid run on device rank=0 [2024-06-17 22:12:18,062] INFO: Rank 0: epoch=31 / 400 train_loss=6.7626 valid_loss=7.4507 stale=0 time=20.95m eta=7718.0m [2024-06-17 22:12:18,111] INFO: Initiating epoch #32 train run on device rank=0 [2024-06-17 22:32:05,016] INFO: Initiating epoch #32 valid run on device rank=0 [2024-06-17 22:33:12,965] INFO: Rank 0: epoch=32 / 400 train_loss=6.7224 valid_loss=7.4287 stale=0 time=20.91m eta=7697.1m [2024-06-17 22:33:13,027] INFO: Initiating epoch #33 train run on device rank=0 [2024-06-17 22:53:00,453] INFO: Initiating epoch #33 valid run on device rank=0 [2024-06-17 22:54:08,457] INFO: Rank 0: epoch=33 / 400 train_loss=6.6845 valid_loss=7.4118 stale=0 time=20.92m eta=7676.3m [2024-06-17 22:54:08,498] INFO: Initiating epoch #34 train run on device rank=0 [2024-06-17 23:13:55,981] INFO: Initiating epoch #34 valid run on device rank=0 [2024-06-17 23:15:03,878] INFO: Rank 0: epoch=34 / 400 train_loss=6.6491 valid_loss=7.3963 stale=0 time=20.92m eta=7655.4m [2024-06-17 23:15:03,924] INFO: Initiating epoch #35 train run on device rank=0 [2024-06-17 23:34:52,265] INFO: Initiating epoch #35 valid run on device rank=0 [2024-06-17 23:36:00,700] INFO: Rank 0: epoch=35 / 400 train_loss=6.6151 valid_loss=7.3829 stale=0 time=20.95m eta=7634.8m [2024-06-17 23:36:00,767] INFO: Initiating epoch #36 train run on device rank=0 [2024-06-17 23:55:48,488] INFO: Initiating epoch #36 valid run on device rank=0 [2024-06-17 23:56:56,523] INFO: Rank 0: epoch=36 / 400 train_loss=6.5829 valid_loss=7.3699 stale=0 time=20.93m eta=7614.0m [2024-06-17 23:56:56,563] INFO: Initiating epoch #37 train run on device rank=0 [2024-06-18 00:16:44,243] INFO: Initiating epoch #37 valid run on device rank=0 [2024-06-18 00:17:52,220] INFO: Rank 0: epoch=37 / 400 train_loss=6.5518 valid_loss=7.3605 stale=0 time=20.93m eta=7593.2m [2024-06-18 00:17:52,275] INFO: Initiating epoch #38 train run on device rank=0 [2024-06-18 00:37:39,471] INFO: Initiating epoch #38 valid run on device rank=0 [2024-06-18 00:38:47,672] INFO: Rank 0: epoch=38 / 400 train_loss=6.5223 valid_loss=7.3541 stale=0 time=20.92m eta=7572.4m [2024-06-18 00:38:48,011] INFO: Initiating epoch #39 train run on device rank=0 [2024-06-18 00:58:35,610] INFO: Initiating epoch #39 valid run on device rank=0 [2024-06-18 00:59:43,862] INFO: Rank 0: epoch=39 / 400 train_loss=6.4937 valid_loss=7.3448 stale=0 time=20.93m eta=7551.6m [2024-06-18 00:59:43,921] INFO: Initiating epoch #40 train run on device rank=0 [2024-06-18 01:19:31,970] INFO: Initiating epoch #40 valid run on device rank=0 [2024-06-18 01:20:40,246] INFO: Rank 0: epoch=40 / 400 train_loss=6.4660 valid_loss=7.3388 stale=0 time=20.94m eta=7530.9m [2024-06-18 01:20:40,409] INFO: Initiating epoch #41 train run on device rank=0 [2024-06-18 01:40:28,735] INFO: Initiating epoch #41 valid run on device rank=0 [2024-06-18 01:41:38,951] INFO: Rank 0: epoch=41 / 400 train_loss=6.4391 valid_loss=7.3318 stale=0 time=20.98m eta=7510.5m [2024-06-18 01:41:39,937] INFO: Initiating epoch #42 train run on device rank=0 [2024-06-18 02:01:27,570] INFO: Initiating epoch #42 valid run on device rank=0 [2024-06-18 02:02:35,865] INFO: Rank 0: epoch=42 / 400 train_loss=6.4133 valid_loss=7.3287 stale=0 time=20.93m eta=7489.8m [2024-06-18 02:02:35,999] INFO: Initiating epoch #43 train run on device rank=0 [2024-06-18 02:22:23,659] INFO: Initiating epoch #43 valid run on device rank=0 [2024-06-18 02:23:32,371] INFO: Rank 0: epoch=43 / 400 train_loss=6.3883 valid_loss=7.3264 stale=0 time=20.94m eta=7469.1m [2024-06-18 02:23:32,465] INFO: Initiating epoch #44 train run on device rank=0 [2024-06-18 02:43:19,780] INFO: Initiating epoch #44 valid run on device rank=0 [2024-06-18 02:44:27,883] INFO: Rank 0: epoch=44 / 400 train_loss=6.3651 valid_loss=7.3236 stale=0 time=20.92m eta=7448.2m [2024-06-18 02:44:27,937] INFO: Initiating epoch #45 train run on device rank=0 [2024-06-18 03:04:15,535] INFO: Initiating epoch #45 valid run on device rank=0 [2024-06-18 03:05:24,334] INFO: Rank 0: epoch=45 / 400 train_loss=6.3414 valid_loss=7.3212 stale=0 time=20.94m eta=7427.4m [2024-06-18 03:05:24,414] INFO: Initiating epoch #46 train run on device rank=0 [2024-06-18 03:25:12,105] INFO: Initiating epoch #46 valid run on device rank=0 [2024-06-18 03:26:20,628] INFO: Rank 0: epoch=46 / 400 train_loss=6.3188 valid_loss=7.3190 stale=0 time=20.94m eta=7406.6m [2024-06-18 03:26:21,021] INFO: Initiating epoch #47 train run on device rank=0 [2024-06-18 03:46:09,565] INFO: Initiating epoch #47 valid run on device rank=0 [2024-06-18 03:47:17,702] INFO: Rank 0: epoch=47 / 400 train_loss=6.2967 valid_loss=7.3184 stale=0 time=20.94m eta=7385.9m [2024-06-18 03:47:17,765] INFO: Initiating epoch #48 train run on device rank=0 [2024-06-18 04:07:05,603] INFO: Initiating epoch #48 valid run on device rank=0 [2024-06-18 04:08:14,137] INFO: Rank 0: epoch=48 / 400 train_loss=6.2751 valid_loss=7.3164 stale=0 time=20.94m eta=7365.1m [2024-06-18 04:08:14,181] INFO: Initiating epoch #49 train run on device rank=0 [2024-06-18 04:28:02,404] INFO: Initiating epoch #49 valid run on device rank=0 [2024-06-18 04:29:10,118] INFO: Rank 0: epoch=49 / 400 train_loss=6.2542 valid_loss=7.3175 stale=1 time=20.93m eta=7344.2m [2024-06-18 04:29:10,226] INFO: Initiating epoch #50 train run on device rank=0 [2024-06-18 04:48:58,163] INFO: Initiating epoch #50 valid run on device rank=0 [2024-06-18 04:50:05,931] INFO: Rank 0: epoch=50 / 400 train_loss=6.2332 valid_loss=7.3176 stale=2 time=20.93m eta=7323.4m [2024-06-18 04:50:05,990] INFO: Initiating epoch #51 train run on device rank=0 [2024-06-18 05:09:54,117] INFO: Initiating epoch #51 valid run on device rank=0 [2024-06-18 05:11:02,652] INFO: Rank 0: epoch=51 / 400 train_loss=6.2136 valid_loss=7.3194 stale=3 time=20.94m eta=7302.6m [2024-06-18 05:11:02,833] INFO: Initiating epoch #52 train run on device rank=0 [2024-06-18 05:30:50,905] INFO: Initiating epoch #52 valid run on device rank=0 [2024-06-18 05:31:58,207] INFO: Rank 0: epoch=52 / 400 train_loss=6.1938 valid_loss=7.3195 stale=4 time=20.92m eta=7281.7m [2024-06-18 05:31:58,240] INFO: Initiating epoch #53 train run on device rank=0 [2024-06-18 05:51:46,680] INFO: Initiating epoch #53 valid run on device rank=0 [2024-06-18 05:52:54,154] INFO: Rank 0: epoch=53 / 400 train_loss=6.1746 valid_loss=7.3233 stale=5 time=20.93m eta=7260.8m [2024-06-18 05:52:54,192] INFO: Initiating epoch #54 train run on device rank=0 [2024-06-18 06:12:41,738] INFO: Initiating epoch #54 valid run on device rank=0 [2024-06-18 06:13:49,290] INFO: Rank 0: epoch=54 / 400 train_loss=6.1558 valid_loss=7.3208 stale=6 time=20.92m eta=7239.8m [2024-06-18 06:13:49,328] INFO: Initiating epoch #55 train run on device rank=0 [2024-06-18 06:33:37,098] INFO: Initiating epoch #55 valid run on device rank=0 [2024-06-18 06:34:44,440] INFO: Rank 0: epoch=55 / 400 train_loss=6.1370 valid_loss=7.3222 stale=7 time=20.92m eta=7218.9m [2024-06-18 06:34:44,493] INFO: Initiating epoch #56 train run on device rank=0 [2024-06-18 06:54:31,438] INFO: Initiating epoch #56 valid run on device rank=0 [2024-06-18 06:55:38,582] INFO: Rank 0: epoch=56 / 400 train_loss=6.1189 valid_loss=7.3251 stale=8 time=20.9m eta=7197.8m [2024-06-18 06:55:38,624] INFO: Initiating epoch #57 train run on device rank=0 [2024-06-18 07:15:25,606] INFO: Initiating epoch #57 valid run on device rank=0 [2024-06-18 07:16:32,694] INFO: Rank 0: epoch=57 / 400 train_loss=6.1014 valid_loss=7.3238 stale=9 time=20.9m eta=7176.8m [2024-06-18 07:16:32,743] INFO: Initiating epoch #58 train run on device rank=0 [2024-06-18 07:36:19,861] INFO: Initiating epoch #58 valid run on device rank=0 [2024-06-18 07:37:27,371] INFO: Rank 0: epoch=58 / 400 train_loss=6.0839 valid_loss=7.3285 stale=10 time=20.91m eta=7155.8m [2024-06-18 07:37:27,421] INFO: Initiating epoch #59 train run on device rank=0 [2024-06-18 07:57:15,687] INFO: Initiating epoch #59 valid run on device rank=0 [2024-06-18 07:58:23,471] INFO: Rank 0: epoch=59 / 400 train_loss=6.0665 valid_loss=7.3311 stale=11 time=20.93m eta=7134.9m [2024-06-18 07:58:23,567] INFO: Initiating epoch #60 train run on device rank=0 [2024-06-18 08:18:11,412] INFO: Initiating epoch #60 valid run on device rank=0 [2024-06-18 08:19:18,945] INFO: Rank 0: epoch=60 / 400 train_loss=6.0496 valid_loss=7.3314 stale=12 time=20.92m eta=7114.0m [2024-06-18 08:19:18,971] INFO: Initiating epoch #61 train run on device rank=0 [2024-06-18 08:39:07,567] INFO: Initiating epoch #61 valid run on device rank=0 [2024-06-18 08:40:15,021] INFO: Rank 0: epoch=61 / 400 train_loss=6.0327 valid_loss=7.3331 stale=13 time=20.93m eta=7093.1m [2024-06-18 08:40:15,051] INFO: Initiating epoch #62 train run on device rank=0 [2024-06-18 09:00:03,058] INFO: Initiating epoch #62 valid run on device rank=0 [2024-06-18 09:01:10,843] INFO: Rank 0: epoch=62 / 400 train_loss=6.0161 valid_loss=7.3373 stale=14 time=20.93m eta=7072.2m [2024-06-18 09:01:10,879] INFO: Initiating epoch #63 train run on device rank=0 [2024-06-18 09:20:59,782] INFO: Initiating epoch #63 valid run on device rank=0 [2024-06-18 09:22:07,544] INFO: Rank 0: epoch=63 / 400 train_loss=5.9994 valid_loss=7.3382 stale=15 time=20.94m eta=7051.4m [2024-06-18 09:22:07,784] INFO: Initiating epoch #64 train run on device rank=0