[2024-06-14 08:27:38,729] INFO: Will use torch.nn.parallel.DistributedDataParallel() and 4 gpus [2024-06-14 08:27:38,819] INFO: NVIDIA TITAN Xp [2024-06-14 08:27:38,819] INFO: NVIDIA TITAN Xp [2024-06-14 08:27:38,819] INFO: NVIDIA TITAN Xp [2024-06-14 08:27:38,819] INFO: NVIDIA TITAN Xp [2024-06-14 08:27:44,708] INFO: using dtype=torch.float32 [2024-06-14 08:27:45,231] INFO: using attention_type=math [2024-06-14 08:27:45,250] INFO: using attention_type=math [2024-06-14 08:27:45,266] INFO: using attention_type=math [2024-06-14 08:27:45,283] INFO: using attention_type=math [2024-06-14 08:27:45,299] INFO: using attention_type=math [2024-06-14 08:27:45,315] INFO: using attention_type=math [2024-06-14 08:27:49,899] INFO: mlpf_kwargs: {'input_dim': 17, 'num_classes': 6, 'input_encoding': 'joint', 'pt_mode': 'linear', 'eta_mode': 'linear', 'sin_phi_mode': 'linear', 'cos_phi_mode': 'linear', 'energy_mode': 'linear', 'elemtypes_nonzero': [1, 2], 'learned_representation_mode': 'last', 'conv_type': 'attention', 'num_convs': 3, 'dropout_ff': 0.0, 'dropout_conv_id_mha': 0.0, 'dropout_conv_id_ff': 0.0, 'dropout_conv_reg_mha': 0.0, 'dropout_conv_reg_ff': 0.0, 'activation': 'relu', 'head_dim': 16, 'num_heads': 32, 'attention_type': 'math'} [2024-06-14 08:27:49,900] INFO: Loaded model weights from /pfvol/experiments/MLPF_clic_backbone_pyg-clic_20240429_101112_971749/best_weights.pth [2024-06-14 08:27:52,127] INFO: DistributedDataParallel( (module): MLPF( (nn0_id): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (nn0_reg): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (conv_id): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (conv_reg): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (nn_id): Sequential( (0): Linear(in_features=529, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=6, bias=True) ) (nn_pt): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_eta): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_sin_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_cos_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_energy): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) ) ) [2024-06-14 08:27:52,128] INFO: Backbone Trainable parameters: 0 [2024-06-14 08:27:52,129] INFO: Backbone Non-trainable parameters: 11671568 [2024-06-14 08:27:52,129] INFO: Backbone Total parameters: 11671568 [2024-06-14 08:27:52,134] INFO: Modules Trainable parameters Non-tranable parameters module.nn0_id.0.weight 0 8704 module.nn0_id.0.bias 0 512 module.nn0_id.2.weight 0 512 module.nn0_id.2.bias 0 512 module.nn0_id.4.weight 0 262144 module.nn0_id.4.bias 0 512 module.nn0_reg.0.weight 0 8704 module.nn0_reg.0.bias 0 512 module.nn0_reg.2.weight 0 512 module.nn0_reg.2.bias 0 512 module.nn0_reg.4.weight 0 262144 module.nn0_reg.4.bias 0 512 module.conv_id.0.mha.in_proj_weight 0 786432 module.conv_id.0.mha.in_proj_bias 0 1536 module.conv_id.0.mha.out_proj.weight 0 262144 module.conv_id.0.mha.out_proj.bias 0 512 module.conv_id.0.norm0.weight 0 512 module.conv_id.0.norm0.bias 0 512 module.conv_id.0.norm1.weight 0 512 module.conv_id.0.norm1.bias 0 512 module.conv_id.0.seq.0.weight 0 262144 module.conv_id.0.seq.0.bias 0 512 module.conv_id.0.seq.2.weight 0 262144 module.conv_id.0.seq.2.bias 0 512 module.conv_id.1.mha.in_proj_weight 0 786432 module.conv_id.1.mha.in_proj_bias 0 1536 module.conv_id.1.mha.out_proj.weight 0 262144 module.conv_id.1.mha.out_proj.bias 0 512 module.conv_id.1.norm0.weight 0 512 module.conv_id.1.norm0.bias 0 512 module.conv_id.1.norm1.weight 0 512 module.conv_id.1.norm1.bias 0 512 module.conv_id.1.seq.0.weight 0 262144 module.conv_id.1.seq.0.bias 0 512 module.conv_id.1.seq.2.weight 0 262144 module.conv_id.1.seq.2.bias 0 512 module.conv_id.2.mha.in_proj_weight 0 786432 module.conv_id.2.mha.in_proj_bias 0 1536 module.conv_id.2.mha.out_proj.weight 0 262144 module.conv_id.2.mha.out_proj.bias 0 512 module.conv_id.2.norm0.weight 0 512 module.conv_id.2.norm0.bias 0 512 module.conv_id.2.norm1.weight 0 512 module.conv_id.2.norm1.bias 0 512 module.conv_id.2.seq.0.weight 0 262144 module.conv_id.2.seq.0.bias 0 512 module.conv_id.2.seq.2.weight 0 262144 module.conv_id.2.seq.2.bias 0 512 module.conv_reg.0.mha.in_proj_weight 0 786432 module.conv_reg.0.mha.in_proj_bias 0 1536 module.conv_reg.0.mha.out_proj.weight 0 262144 module.conv_reg.0.mha.out_proj.bias 0 512 module.conv_reg.0.norm0.weight 0 512 module.conv_reg.0.norm0.bias 0 512 module.conv_reg.0.norm1.weight 0 512 module.conv_reg.0.norm1.bias 0 512 module.conv_reg.0.seq.0.weight 0 262144 module.conv_reg.0.seq.0.bias 0 512 module.conv_reg.0.seq.2.weight 0 262144 module.conv_reg.0.seq.2.bias 0 512 module.conv_reg.1.mha.in_proj_weight 0 786432 module.conv_reg.1.mha.in_proj_bias 0 1536 module.conv_reg.1.mha.out_proj.weight 0 262144 module.conv_reg.1.mha.out_proj.bias 0 512 module.conv_reg.1.norm0.weight 0 512 module.conv_reg.1.norm0.bias 0 512 module.conv_reg.1.norm1.weight 0 512 module.conv_reg.1.norm1.bias 0 512 module.conv_reg.1.seq.0.weight 0 262144 module.conv_reg.1.seq.0.bias 0 512 module.conv_reg.1.seq.2.weight 0 262144 module.conv_reg.1.seq.2.bias 0 512 module.conv_reg.2.mha.in_proj_weight 0 786432 module.conv_reg.2.mha.in_proj_bias 0 1536 module.conv_reg.2.mha.out_proj.weight 0 262144 module.conv_reg.2.mha.out_proj.bias 0 512 module.conv_reg.2.norm0.weight 0 512 module.conv_reg.2.norm0.bias 0 512 module.conv_reg.2.norm1.weight 0 512 module.conv_reg.2.norm1.bias 0 512 module.conv_reg.2.seq.0.weight 0 262144 module.conv_reg.2.seq.0.bias 0 512 module.conv_reg.2.seq.2.weight 0 262144 module.conv_reg.2.seq.2.bias 0 512 module.nn_id.0.weight 0 270848 module.nn_id.0.bias 0 512 module.nn_id.2.weight 0 512 module.nn_id.2.bias 0 512 module.nn_id.4.weight 0 3072 module.nn_id.4.bias 0 6 module.nn_pt.nn.0.weight 0 273920 module.nn_pt.nn.0.bias 0 512 module.nn_pt.nn.2.weight 0 512 module.nn_pt.nn.2.bias 0 512 module.nn_pt.nn.4.weight 0 1024 module.nn_pt.nn.4.bias 0 2 module.nn_eta.nn.0.weight 0 273920 module.nn_eta.nn.0.bias 0 512 module.nn_eta.nn.2.weight 0 512 module.nn_eta.nn.2.bias 0 512 module.nn_eta.nn.4.weight 0 1024 module.nn_eta.nn.4.bias 0 2 module.nn_sin_phi.nn.0.weight 0 273920 module.nn_sin_phi.nn.0.bias 0 512 module.nn_sin_phi.nn.2.weight 0 512 module.nn_sin_phi.nn.2.bias 0 512 module.nn_sin_phi.nn.4.weight 0 1024 module.nn_sin_phi.nn.4.bias 0 2 module.nn_cos_phi.nn.0.weight 0 273920 module.nn_cos_phi.nn.0.bias 0 512 module.nn_cos_phi.nn.2.weight 0 512 module.nn_cos_phi.nn.2.bias 0 512 module.nn_cos_phi.nn.4.weight 0 1024 module.nn_cos_phi.nn.4.bias 0 2 module.nn_energy.nn.0.weight 0 273920 module.nn_energy.nn.0.bias 0 512 module.nn_energy.nn.2.weight 0 512 module.nn_energy.nn.2.bias 0 512 module.nn_energy.nn.4.weight 0 1024 module.nn_energy.nn.4.bias 0 2 [2024-06-14 08:27:52,224] INFO: DistributedDataParallel( (module): DeepMET( (nn): Sequential( (0): Linear(in_features=535, out_features=256, bias=True) (1): ELU(alpha=1.0) (2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0, inplace=False) (4): Linear(in_features=256, out_features=2, bias=True) ) ) ) [2024-06-14 08:27:52,225] INFO: DeepMET Trainable parameters: 138242 [2024-06-14 08:27:52,225] INFO: DeepMET Non-trainable parameters: 0 [2024-06-14 08:27:52,225] INFO: DeepMET Total parameters: 138242 [2024-06-14 08:27:52,226] INFO: Modules Trainable parameters Non-tranable parameters module.nn.0.weight 136960 0 module.nn.0.bias 256 0 module.nn.2.weight 256 0 module.nn.2.bias 256 0 module.nn.4.weight 512 0 module.nn.4.bias 2 0 [2024-06-14 08:27:52,227] INFO: Creating experiment dir /pfvol/experiments/MLPF_clic_backbone_pyg-clic_20240429_101112_971749/MLPF_4GTX_MET_latentX_FreezeBackbone_20240614_082738_634358 [2024-06-14 08:27:52,227] INFO: Model directory /pfvol/experiments/MLPF_clic_backbone_pyg-clic_20240429_101112_971749/MLPF_4GTX_MET_latentX_FreezeBackbone_20240614_082738_634358 [2024-06-14 08:27:52,295] INFO: train_dataset: clic_edm_ttbar_pf, 800800 [2024-06-14 08:27:52,420] INFO: valid_dataset: clic_edm_ttbar_pf, 200200 [2024-06-14 08:27:52,465] INFO: Initiating epoch #1 train run on device rank=0 [2024-06-14 08:34:38,793] INFO: Initiating epoch #1 valid run on device rank=0 [2024-06-14 08:36:10,619] INFO: Rank 0: epoch=1 / 400 train_loss=8.2990 valid_loss=8.3075 stale=0 time=8.3m eta=3312.7m [2024-06-14 08:36:10,631] INFO: Initiating epoch #2 train run on device rank=0 [2024-06-14 08:42:49,667] INFO: Initiating epoch #2 valid run on device rank=0 [2024-06-14 08:44:21,163] INFO: Rank 0: epoch=2 / 400 train_loss=7.9741 valid_loss=8.1842 stale=0 time=8.18m eta=3279.2m [2024-06-14 08:44:21,230] INFO: Initiating epoch #3 train run on device rank=0 [2024-06-14 08:51:00,025] INFO: Initiating epoch #3 valid run on device rank=0 [2024-06-14 08:52:31,170] INFO: Rank 0: epoch=3 / 400 train_loss=7.8952 valid_loss=8.1265 stale=0 time=8.17m eta=3261.4m [2024-06-14 08:52:31,235] INFO: Initiating epoch #4 train run on device rank=0 [2024-06-14 08:59:10,425] INFO: Initiating epoch #4 valid run on device rank=0 [2024-06-14 09:00:41,938] INFO: Rank 0: epoch=4 / 400 train_loss=7.8446 valid_loss=8.0880 stale=0 time=8.18m eta=3249.6m [2024-06-14 09:00:42,021] INFO: Initiating epoch #5 train run on device rank=0 [2024-06-14 09:07:20,353] INFO: Initiating epoch #5 valid run on device rank=0 [2024-06-14 09:08:52,170] INFO: Rank 0: epoch=5 / 400 train_loss=7.8073 valid_loss=8.0417 stale=0 time=8.17m eta=3238.6m [2024-06-14 09:08:52,213] INFO: Initiating epoch #6 train run on device rank=0 [2024-06-14 09:15:31,858] INFO: Initiating epoch #6 valid run on device rank=0 [2024-06-14 09:17:03,452] INFO: Rank 0: epoch=6 / 400 train_loss=7.7774 valid_loss=8.0008 stale=0 time=8.19m eta=3229.7m [2024-06-14 09:17:03,541] INFO: Initiating epoch #7 train run on device rank=0 [2024-06-14 09:23:42,468] INFO: Initiating epoch #7 valid run on device rank=0 [2024-06-14 09:25:13,701] INFO: Rank 0: epoch=7 / 400 train_loss=7.7525 valid_loss=7.9749 stale=0 time=8.17m eta=3220.0m [2024-06-14 09:25:13,763] INFO: Initiating epoch #8 train run on device rank=0 [2024-06-14 09:31:52,748] INFO: Initiating epoch #8 valid run on device rank=0 [2024-06-14 09:33:25,829] INFO: Rank 0: epoch=8 / 400 train_loss=7.7308 valid_loss=7.9557 stale=0 time=8.2m eta=3212.2m [2024-06-14 09:33:25,918] INFO: Initiating epoch #9 train run on device rank=0 [2024-06-14 09:40:05,478] INFO: Initiating epoch #9 valid run on device rank=0 [2024-06-14 09:41:37,081] INFO: Rank 0: epoch=9 / 400 train_loss=7.7112 valid_loss=7.9376 stale=0 time=8.19m eta=3203.7m [2024-06-14 09:41:37,212] INFO: Initiating epoch #10 train run on device rank=0 [2024-06-14 09:48:16,616] INFO: Initiating epoch #10 valid run on device rank=0 [2024-06-14 09:49:48,290] INFO: Rank 0: epoch=10 / 400 train_loss=7.6925 valid_loss=7.9210 stale=0 time=8.18m eta=3195.3m [2024-06-14 09:49:48,354] INFO: Initiating epoch #11 train run on device rank=0 [2024-06-14 09:56:26,428] INFO: Initiating epoch #11 valid run on device rank=0 [2024-06-14 09:57:57,889] INFO: Rank 0: epoch=11 / 400 train_loss=7.6754 valid_loss=7.9052 stale=0 time=8.16m eta=3185.9m [2024-06-14 09:57:57,928] INFO: Initiating epoch #12 train run on device rank=0 [2024-06-14 10:04:36,613] INFO: Initiating epoch #12 valid run on device rank=0 [2024-06-14 10:06:07,954] INFO: Rank 0: epoch=12 / 400 train_loss=7.6591 valid_loss=7.8911 stale=0 time=8.17m eta=3177.0m [2024-06-14 10:06:07,987] INFO: Initiating epoch #13 train run on device rank=0 [2024-06-14 10:12:46,279] INFO: Initiating epoch #13 valid run on device rank=0 [2024-06-14 10:14:17,795] INFO: Rank 0: epoch=13 / 400 train_loss=7.6436 valid_loss=7.8780 stale=0 time=8.16m eta=3168.1m [2024-06-14 10:14:17,902] INFO: Initiating epoch #14 train run on device rank=0 [2024-06-14 10:20:57,059] INFO: Initiating epoch #14 valid run on device rank=0 [2024-06-14 10:22:28,717] INFO: Rank 0: epoch=14 / 400 train_loss=7.6289 valid_loss=7.8665 stale=0 time=8.18m eta=3159.8m [2024-06-14 10:22:28,807] INFO: Initiating epoch #15 train run on device rank=0 [2024-06-14 10:29:08,560] INFO: Initiating epoch #15 valid run on device rank=0 [2024-06-14 10:30:39,572] INFO: Rank 0: epoch=15 / 400 train_loss=7.6151 valid_loss=7.8561 stale=0 time=8.18m eta=3151.5m [2024-06-14 10:30:39,642] INFO: Initiating epoch #16 train run on device rank=0 [2024-06-14 10:37:19,292] INFO: Initiating epoch #16 valid run on device rank=0 [2024-06-14 10:38:51,267] INFO: Rank 0: epoch=16 / 400 train_loss=7.6021 valid_loss=7.8472 stale=0 time=8.19m eta=3143.5m [2024-06-14 10:38:51,309] INFO: Initiating epoch #17 train run on device rank=0 [2024-06-14 10:45:31,299] INFO: Initiating epoch #17 valid run on device rank=0 [2024-06-14 10:47:03,310] INFO: Rank 0: epoch=17 / 400 train_loss=7.5891 valid_loss=7.8371 stale=0 time=8.2m eta=3135.7m [2024-06-14 10:47:03,343] INFO: Initiating epoch #18 train run on device rank=0 [2024-06-14 10:53:43,250] INFO: Initiating epoch #18 valid run on device rank=0 [2024-06-14 10:55:14,538] INFO: Rank 0: epoch=18 / 400 train_loss=7.5763 valid_loss=7.8298 stale=0 time=8.19m eta=3127.5m [2024-06-14 10:55:14,591] INFO: Initiating epoch #19 train run on device rank=0 [2024-06-14 11:01:54,791] INFO: Initiating epoch #19 valid run on device rank=0 [2024-06-14 11:03:25,650] INFO: Rank 0: epoch=19 / 400 train_loss=7.5644 valid_loss=7.8223 stale=0 time=8.18m eta=3119.2m [2024-06-14 11:03:25,697] INFO: Initiating epoch #20 train run on device rank=0 [2024-06-14 11:10:05,372] INFO: Initiating epoch #20 valid run on device rank=0 [2024-06-14 11:11:36,701] INFO: Rank 0: epoch=20 / 400 train_loss=7.5523 valid_loss=7.8165 stale=0 time=8.18m eta=3111.0m [2024-06-14 11:11:36,787] INFO: Initiating epoch #21 train run on device rank=0 [2024-06-14 11:18:16,351] INFO: Initiating epoch #21 valid run on device rank=0 [2024-06-14 11:19:47,198] INFO: Rank 0: epoch=21 / 400 train_loss=7.5413 valid_loss=7.8105 stale=0 time=8.17m eta=3102.6m [2024-06-14 11:19:47,246] INFO: Initiating epoch #22 train run on device rank=0 [2024-06-14 11:26:26,362] INFO: Initiating epoch #22 valid run on device rank=0 [2024-06-14 11:27:57,545] INFO: Rank 0: epoch=22 / 400 train_loss=7.5307 valid_loss=7.8047 stale=0 time=8.17m eta=3094.2m [2024-06-14 11:27:57,776] INFO: Initiating epoch #23 train run on device rank=0 [2024-06-14 11:34:37,492] INFO: Initiating epoch #23 valid run on device rank=0 [2024-06-14 11:36:07,964] INFO: Rank 0: epoch=23 / 400 train_loss=7.5207 valid_loss=7.8009 stale=0 time=8.17m eta=3085.8m [2024-06-14 11:36:08,013] INFO: Initiating epoch #24 train run on device rank=0 [2024-06-14 11:42:47,784] INFO: Initiating epoch #24 valid run on device rank=0 [2024-06-14 11:44:19,539] INFO: Rank 0: epoch=24 / 400 train_loss=7.5113 valid_loss=7.7975 stale=0 time=8.19m eta=3077.7m [2024-06-14 11:44:19,617] INFO: Initiating epoch #25 train run on device rank=0 [2024-06-14 11:50:58,622] INFO: Initiating epoch #25 valid run on device rank=0 [2024-06-14 11:52:30,361] INFO: Rank 0: epoch=25 / 400 train_loss=7.5018 valid_loss=7.7948 stale=0 time=8.18m eta=3069.5m [2024-06-14 11:52:30,425] INFO: Initiating epoch #26 train run on device rank=0 [2024-06-14 11:59:09,923] INFO: Initiating epoch #26 valid run on device rank=0 [2024-06-14 12:00:41,390] INFO: Rank 0: epoch=26 / 400 train_loss=7.4929 valid_loss=7.7927 stale=0 time=8.18m eta=3061.3m [2024-06-14 12:00:41,493] INFO: Initiating epoch #27 train run on device rank=0 [2024-06-14 12:07:20,688] INFO: Initiating epoch #27 valid run on device rank=0 [2024-06-14 12:08:52,499] INFO: Rank 0: epoch=27 / 400 train_loss=7.4843 valid_loss=7.7900 stale=0 time=8.18m eta=3053.1m [2024-06-14 12:08:52,558] INFO: Initiating epoch #28 train run on device rank=0 [2024-06-14 12:15:31,720] INFO: Initiating epoch #28 valid run on device rank=0 [2024-06-14 12:17:02,872] INFO: Rank 0: epoch=28 / 400 train_loss=7.4760 valid_loss=7.7872 stale=0 time=8.17m eta=3044.7m [2024-06-14 12:17:02,919] INFO: Initiating epoch #29 train run on device rank=0 [2024-06-14 12:23:41,833] INFO: Initiating epoch #29 valid run on device rank=0 [2024-06-14 12:25:13,038] INFO: Rank 0: epoch=29 / 400 train_loss=7.4678 valid_loss=7.7861 stale=0 time=8.17m eta=3036.4m [2024-06-14 12:25:13,084] INFO: Initiating epoch #30 train run on device rank=0 [2024-06-14 12:31:52,065] INFO: Initiating epoch #30 valid run on device rank=0 [2024-06-14 12:33:23,111] INFO: Rank 0: epoch=30 / 400 train_loss=7.4600 valid_loss=7.7845 stale=0 time=8.17m eta=3028.0m [2024-06-14 12:33:23,166] INFO: Initiating epoch #31 train run on device rank=0 [2024-06-14 12:40:02,589] INFO: Initiating epoch #31 valid run on device rank=0 [2024-06-14 12:41:33,373] INFO: Rank 0: epoch=31 / 400 train_loss=7.4521 valid_loss=7.7850 stale=1 time=8.17m eta=3019.6m [2024-06-14 12:41:33,410] INFO: Initiating epoch #32 train run on device rank=0 [2024-06-14 12:48:13,411] INFO: Initiating epoch #32 valid run on device rank=0 [2024-06-14 12:49:44,064] INFO: Rank 0: epoch=32 / 400 train_loss=7.4439 valid_loss=7.7849 stale=2 time=8.18m eta=3011.4m [2024-06-14 12:49:44,105] INFO: Initiating epoch #33 train run on device rank=0 [2024-06-14 12:56:23,400] INFO: Initiating epoch #33 valid run on device rank=0 [2024-06-14 12:57:53,972] INFO: Rank 0: epoch=33 / 400 train_loss=7.4375 valid_loss=7.7859 stale=3 time=8.16m eta=3003.0m [2024-06-14 12:57:54,035] INFO: Initiating epoch #34 train run on device rank=0 [2024-06-14 13:04:33,735] INFO: Initiating epoch #34 valid run on device rank=0 [2024-06-14 13:06:04,422] INFO: Rank 0: epoch=34 / 400 train_loss=7.4291 valid_loss=7.7842 stale=0 time=8.17m eta=2994.7m [2024-06-14 13:06:04,470] INFO: Initiating epoch #35 train run on device rank=0 [2024-06-14 13:12:44,242] INFO: Initiating epoch #35 valid run on device rank=0 [2024-06-14 13:14:15,052] INFO: Rank 0: epoch=35 / 400 train_loss=7.4215 valid_loss=7.7828 stale=0 time=8.18m eta=2986.5m [2024-06-14 13:14:15,089] INFO: Initiating epoch #36 train run on device rank=0 [2024-06-14 13:20:54,731] INFO: Initiating epoch #36 valid run on device rank=0 [2024-06-14 13:22:25,415] INFO: Rank 0: epoch=36 / 400 train_loss=7.4148 valid_loss=7.7863 stale=1 time=8.17m eta=2978.2m [2024-06-14 13:22:25,470] INFO: Initiating epoch #37 train run on device rank=0 [2024-06-14 13:29:04,493] INFO: Initiating epoch #37 valid run on device rank=0 [2024-06-14 13:30:35,397] INFO: Rank 0: epoch=37 / 400 train_loss=7.4078 valid_loss=7.7869 stale=2 time=8.17m eta=2969.9m [2024-06-14 13:30:35,436] INFO: Initiating epoch #38 train run on device rank=0 [2024-06-14 13:37:13,822] INFO: Initiating epoch #38 valid run on device rank=0 [2024-06-14 13:38:44,200] INFO: Rank 0: epoch=38 / 400 train_loss=7.4005 valid_loss=7.7891 stale=3 time=8.15m eta=2961.4m [2024-06-14 13:38:44,254] INFO: Initiating epoch #39 train run on device rank=0 [2024-06-14 13:45:22,712] INFO: Initiating epoch #39 valid run on device rank=0 [2024-06-14 13:46:53,759] INFO: Rank 0: epoch=39 / 400 train_loss=7.3932 valid_loss=7.7876 stale=4 time=8.16m eta=2953.0m [2024-06-14 13:46:53,832] INFO: Initiating epoch #40 train run on device rank=0 [2024-06-14 13:53:32,094] INFO: Initiating epoch #40 valid run on device rank=0 [2024-06-14 13:55:02,374] INFO: Rank 0: epoch=40 / 400 train_loss=7.3873 valid_loss=7.7921 stale=5 time=8.14m eta=2944.5m [2024-06-14 13:55:02,467] INFO: Initiating epoch #41 train run on device rank=0 [2024-06-14 14:01:40,026] INFO: Initiating epoch #41 valid run on device rank=0 [2024-06-14 14:03:10,531] INFO: Rank 0: epoch=41 / 400 train_loss=7.3810 valid_loss=7.7921 stale=6 time=8.13m eta=2935.9m [2024-06-14 14:03:10,583] INFO: Initiating epoch #42 train run on device rank=0 [2024-06-14 14:09:49,012] INFO: Initiating epoch #42 valid run on device rank=0 [2024-06-14 14:11:19,923] INFO: Rank 0: epoch=42 / 400 train_loss=7.3743 valid_loss=7.7907 stale=7 time=8.16m eta=2927.6m [2024-06-14 14:11:19,975] INFO: Initiating epoch #43 train run on device rank=0 [2024-06-14 14:17:57,838] INFO: Initiating epoch #43 valid run on device rank=0 [2024-06-14 14:19:28,648] INFO: Rank 0: epoch=43 / 400 train_loss=7.3680 valid_loss=7.7890 stale=8 time=8.14m eta=2919.1m [2024-06-14 14:19:28,692] INFO: Initiating epoch #44 train run on device rank=0 [2024-06-14 14:26:06,442] INFO: Initiating epoch #44 valid run on device rank=0 [2024-06-14 14:27:36,798] INFO: Rank 0: epoch=44 / 400 train_loss=7.3622 valid_loss=7.7931 stale=9 time=8.14m eta=2910.6m [2024-06-14 14:27:36,831] INFO: Initiating epoch #45 train run on device rank=0 [2024-06-14 14:34:15,554] INFO: Initiating epoch #45 valid run on device rank=0 [2024-06-14 14:35:45,904] INFO: Rank 0: epoch=45 / 400 train_loss=7.3564 valid_loss=7.7949 stale=10 time=8.15m eta=2902.2m [2024-06-14 14:35:45,997] INFO: Initiating epoch #46 train run on device rank=0 [2024-06-14 14:42:24,544] INFO: Initiating epoch #46 valid run on device rank=0 [2024-06-14 14:43:55,815] INFO: Rank 0: epoch=46 / 400 train_loss=7.3504 valid_loss=7.7892 stale=11 time=8.16m eta=2894.0m [2024-06-14 14:43:55,862] INFO: Initiating epoch #47 train run on device rank=0 [2024-06-14 14:50:33,924] INFO: Initiating epoch #47 valid run on device rank=0 [2024-06-14 14:52:05,794] INFO: Rank 0: epoch=47 / 400 train_loss=7.3453 valid_loss=7.8007 stale=12 time=8.17m eta=2885.8m [2024-06-14 14:52:05,861] INFO: Initiating epoch #48 train run on device rank=0 [2024-06-14 14:58:44,210] INFO: Initiating epoch #48 valid run on device rank=0 [2024-06-14 15:00:15,075] INFO: Rank 0: epoch=48 / 400 train_loss=7.3392 valid_loss=7.8045 stale=13 time=8.15m eta=2877.4m [2024-06-14 15:00:15,098] INFO: Initiating epoch #49 train run on device rank=0 [2024-06-14 15:06:52,878] INFO: Initiating epoch #49 valid run on device rank=0 [2024-06-14 15:08:23,312] INFO: Rank 0: epoch=49 / 400 train_loss=7.3334 valid_loss=7.7991 stale=14 time=8.14m eta=2869.0m [2024-06-14 15:08:23,366] INFO: Initiating epoch #50 train run on device rank=0 [2024-06-14 15:15:01,284] INFO: Initiating epoch #50 valid run on device rank=0 [2024-06-14 15:16:32,294] INFO: Rank 0: epoch=50 / 400 train_loss=7.3290 valid_loss=7.7980 stale=15 time=8.15m eta=2860.6m [2024-06-14 15:16:32,338] INFO: Initiating epoch #51 train run on device rank=0 [2024-06-14 15:23:11,014] INFO: Initiating epoch #51 valid run on device rank=0 [2024-06-14 15:24:42,252] INFO: Rank 0: epoch=51 / 400 train_loss=7.3235 valid_loss=7.8024 stale=16 time=8.17m eta=2852.4m [2024-06-14 15:24:42,306] INFO: Initiating epoch #52 train run on device rank=0 [2024-06-14 15:31:20,499] INFO: Initiating epoch #52 valid run on device rank=0 [2024-06-14 15:32:50,971] INFO: Rank 0: epoch=52 / 400 train_loss=7.3189 valid_loss=7.8082 stale=17 time=8.14m eta=2844.1m [2024-06-14 15:32:51,008] INFO: Initiating epoch #53 train run on device rank=0 [2024-06-14 15:39:28,915] INFO: Initiating epoch #53 valid run on device rank=0 [2024-06-14 15:40:59,991] INFO: Rank 0: epoch=53 / 400 train_loss=7.3143 valid_loss=7.8077 stale=18 time=8.15m eta=2835.7m [2024-06-14 15:41:00,063] INFO: Initiating epoch #54 train run on device rank=0 [2024-06-14 15:47:37,903] INFO: Initiating epoch #54 valid run on device rank=0 [2024-06-14 15:49:08,818] INFO: Rank 0: epoch=54 / 400 train_loss=7.3083 valid_loss=7.8082 stale=19 time=8.15m eta=2827.4m [2024-06-14 15:49:08,865] INFO: Initiating epoch #55 train run on device rank=0 [2024-06-14 15:55:46,816] INFO: Initiating epoch #55 valid run on device rank=0 [2024-06-14 15:57:17,009] INFO: Rank 0: epoch=55 / 400 train_loss=7.3035 valid_loss=7.8132 stale=20 time=8.14m eta=2819.0m [2024-06-14 15:57:17,041] INFO: Initiating epoch #56 train run on device rank=0 [2024-06-14 16:03:54,859] INFO: Initiating epoch #56 valid run on device rank=0 [2024-06-14 16:05:24,995] INFO: Rank 0: epoch=56 / 400 train_loss=7.2985 valid_loss=7.8176 stale=21 time=8.13m eta=2810.6m [2024-06-14 16:05:25,048] INFO: Initiating epoch #57 train run on device rank=0 [2024-06-14 16:12:03,163] INFO: Initiating epoch #57 valid run on device rank=0 [2024-06-14 16:13:33,503] INFO: Rank 0: epoch=57 / 400 train_loss=7.2941 valid_loss=7.8200 stale=22 time=8.14m eta=2802.3m [2024-06-14 16:13:33,541] INFO: Initiating epoch #58 train run on device rank=0 [2024-06-14 16:20:11,559] INFO: Initiating epoch #58 valid run on device rank=0 [2024-06-14 16:21:42,288] INFO: Rank 0: epoch=58 / 400 train_loss=7.2890 valid_loss=7.8168 stale=23 time=8.15m eta=2794.0m [2024-06-14 16:21:42,330] INFO: Initiating epoch #59 train run on device rank=0 [2024-06-14 16:28:20,407] INFO: Initiating epoch #59 valid run on device rank=0 [2024-06-14 16:29:51,728] INFO: Rank 0: epoch=59 / 400 train_loss=7.2842 valid_loss=7.8171 stale=24 time=8.16m eta=2785.7m [2024-06-14 16:29:51,816] INFO: Initiating epoch #60 train run on device rank=0 [2024-06-14 16:36:29,612] INFO: Initiating epoch #60 valid run on device rank=0 [2024-06-14 16:37:59,614] INFO: Rank 0: epoch=60 / 400 train_loss=7.2796 valid_loss=7.8209 stale=25 time=8.13m eta=2777.3m [2024-06-14 16:37:59,657] INFO: Initiating epoch #61 train run on device rank=0 [2024-06-14 16:44:37,059] INFO: Initiating epoch #61 valid run on device rank=0 [2024-06-14 16:46:08,101] INFO: Rank 0: epoch=61 / 400 train_loss=7.2754 valid_loss=7.8299 stale=26 time=8.14m eta=2769.0m [2024-06-14 16:46:08,136] INFO: Initiating epoch #62 train run on device rank=0 [2024-06-14 16:52:46,477] INFO: Initiating epoch #62 valid run on device rank=0 [2024-06-14 16:54:16,909] INFO: Rank 0: epoch=62 / 400 train_loss=7.2713 valid_loss=7.8259 stale=27 time=8.15m eta=2760.7m [2024-06-14 16:54:17,207] INFO: Initiating epoch #63 train run on device rank=0 [2024-06-14 17:00:55,612] INFO: Initiating epoch #63 valid run on device rank=0 [2024-06-14 17:02:26,655] INFO: Rank 0: epoch=63 / 400 train_loss=7.2659 valid_loss=7.8274 stale=28 time=8.16m eta=2752.5m [2024-06-14 17:02:26,699] INFO: Initiating epoch #64 train run on device rank=0 [2024-06-14 17:09:04,607] INFO: Initiating epoch #64 valid run on device rank=0 [2024-06-14 17:10:35,556] INFO: Rank 0: epoch=64 / 400 train_loss=7.2622 valid_loss=7.8255 stale=29 time=8.15m eta=2744.3m [2024-06-14 17:10:35,612] INFO: Initiating epoch #65 train run on device rank=0 [2024-06-14 17:17:14,835] INFO: Initiating epoch #65 valid run on device rank=0 [2024-06-14 17:18:46,706] INFO: Rank 0: epoch=65 / 400 train_loss=7.2578 valid_loss=7.8282 stale=30 time=8.18m eta=2736.2m [2024-06-14 17:18:46,757] INFO: Initiating epoch #66 train run on device rank=0 [2024-06-14 17:25:25,266] INFO: Initiating epoch #66 valid run on device rank=0