[2024-06-17 11:23:40,848] INFO: Will use torch.nn.parallel.DistributedDataParallel() and 4 gpus [2024-06-17 11:23:40,939] INFO: NVIDIA GeForce RTX 2080 Ti [2024-06-17 11:23:40,939] INFO: NVIDIA GeForce RTX 2080 Ti [2024-06-17 11:23:40,939] INFO: NVIDIA GeForce RTX 2080 Ti [2024-06-17 11:23:40,939] INFO: NVIDIA GeForce RTX 2080 Ti [2024-06-17 11:23:45,079] INFO: using dtype=torch.float32 [2024-06-17 11:23:46,151] INFO: using attention_type=math [2024-06-17 11:23:46,162] INFO: using attention_type=math [2024-06-17 11:23:46,173] INFO: using attention_type=math [2024-06-17 11:23:46,183] INFO: using attention_type=math [2024-06-17 11:23:46,194] INFO: using attention_type=math [2024-06-17 11:23:46,204] INFO: using attention_type=math [2024-06-17 11:23:50,390] INFO: mlpf_kwargs: {'input_dim': 17, 'num_classes': 6, 'input_encoding': 'joint', 'pt_mode': 'linear', 'eta_mode': 'linear', 'sin_phi_mode': 'linear', 'cos_phi_mode': 'linear', 'energy_mode': 'linear', 'elemtypes_nonzero': [1, 2], 'learned_representation_mode': 'last', 'conv_type': 'attention', 'num_convs': 3, 'dropout_ff': 0.0, 'dropout_conv_id_mha': 0.0, 'dropout_conv_id_ff': 0.0, 'dropout_conv_reg_mha': 0.0, 'dropout_conv_reg_ff': 0.0, 'activation': 'relu', 'head_dim': 16, 'num_heads': 32, 'attention_type': 'math'} [2024-06-17 11:23:50,390] INFO: Loaded model weights from /pfvol/experiments/MLPF_clic_backbone_pyg-clic_20240429_101112_971749/best_weights.pth [2024-06-17 11:23:51,679] INFO: DistributedDataParallel( (module): MLPF( (nn0_id): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (nn0_reg): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (conv_id): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (conv_reg): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (nn_id): Sequential( (0): Linear(in_features=529, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=6, bias=True) ) (nn_pt): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_eta): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_sin_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_cos_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_energy): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) ) ) [2024-06-17 11:23:51,680] INFO: Backbone Trainable parameters: 11671568 [2024-06-17 11:23:51,680] INFO: Backbone Non-trainable parameters: 0 [2024-06-17 11:23:51,680] INFO: Backbone Total parameters: 11671568 [2024-06-17 11:23:51,683] INFO: Modules Trainable parameters Non-tranable parameters module.nn0_id.0.weight 8704 0 module.nn0_id.0.bias 512 0 module.nn0_id.2.weight 512 0 module.nn0_id.2.bias 512 0 module.nn0_id.4.weight 262144 0 module.nn0_id.4.bias 512 0 module.nn0_reg.0.weight 8704 0 module.nn0_reg.0.bias 512 0 module.nn0_reg.2.weight 512 0 module.nn0_reg.2.bias 512 0 module.nn0_reg.4.weight 262144 0 module.nn0_reg.4.bias 512 0 module.conv_id.0.mha.in_proj_weight 786432 0 module.conv_id.0.mha.in_proj_bias 1536 0 module.conv_id.0.mha.out_proj.weight 262144 0 module.conv_id.0.mha.out_proj.bias 512 0 module.conv_id.0.norm0.weight 512 0 module.conv_id.0.norm0.bias 512 0 module.conv_id.0.norm1.weight 512 0 module.conv_id.0.norm1.bias 512 0 module.conv_id.0.seq.0.weight 262144 0 module.conv_id.0.seq.0.bias 512 0 module.conv_id.0.seq.2.weight 262144 0 module.conv_id.0.seq.2.bias 512 0 module.conv_id.1.mha.in_proj_weight 786432 0 module.conv_id.1.mha.in_proj_bias 1536 0 module.conv_id.1.mha.out_proj.weight 262144 0 module.conv_id.1.mha.out_proj.bias 512 0 module.conv_id.1.norm0.weight 512 0 module.conv_id.1.norm0.bias 512 0 module.conv_id.1.norm1.weight 512 0 module.conv_id.1.norm1.bias 512 0 module.conv_id.1.seq.0.weight 262144 0 module.conv_id.1.seq.0.bias 512 0 module.conv_id.1.seq.2.weight 262144 0 module.conv_id.1.seq.2.bias 512 0 module.conv_id.2.mha.in_proj_weight 786432 0 module.conv_id.2.mha.in_proj_bias 1536 0 module.conv_id.2.mha.out_proj.weight 262144 0 module.conv_id.2.mha.out_proj.bias 512 0 module.conv_id.2.norm0.weight 512 0 module.conv_id.2.norm0.bias 512 0 module.conv_id.2.norm1.weight 512 0 module.conv_id.2.norm1.bias 512 0 module.conv_id.2.seq.0.weight 262144 0 module.conv_id.2.seq.0.bias 512 0 module.conv_id.2.seq.2.weight 262144 0 module.conv_id.2.seq.2.bias 512 0 module.conv_reg.0.mha.in_proj_weight 786432 0 module.conv_reg.0.mha.in_proj_bias 1536 0 module.conv_reg.0.mha.out_proj.weight 262144 0 module.conv_reg.0.mha.out_proj.bias 512 0 module.conv_reg.0.norm0.weight 512 0 module.conv_reg.0.norm0.bias 512 0 module.conv_reg.0.norm1.weight 512 0 module.conv_reg.0.norm1.bias 512 0 module.conv_reg.0.seq.0.weight 262144 0 module.conv_reg.0.seq.0.bias 512 0 module.conv_reg.0.seq.2.weight 262144 0 module.conv_reg.0.seq.2.bias 512 0 module.conv_reg.1.mha.in_proj_weight 786432 0 module.conv_reg.1.mha.in_proj_bias 1536 0 module.conv_reg.1.mha.out_proj.weight 262144 0 module.conv_reg.1.mha.out_proj.bias 512 0 module.conv_reg.1.norm0.weight 512 0 module.conv_reg.1.norm0.bias 512 0 module.conv_reg.1.norm1.weight 512 0 module.conv_reg.1.norm1.bias 512 0 module.conv_reg.1.seq.0.weight 262144 0 module.conv_reg.1.seq.0.bias 512 0 module.conv_reg.1.seq.2.weight 262144 0 module.conv_reg.1.seq.2.bias 512 0 module.conv_reg.2.mha.in_proj_weight 786432 0 module.conv_reg.2.mha.in_proj_bias 1536 0 module.conv_reg.2.mha.out_proj.weight 262144 0 module.conv_reg.2.mha.out_proj.bias 512 0 module.conv_reg.2.norm0.weight 512 0 module.conv_reg.2.norm0.bias 512 0 module.conv_reg.2.norm1.weight 512 0 module.conv_reg.2.norm1.bias 512 0 module.conv_reg.2.seq.0.weight 262144 0 module.conv_reg.2.seq.0.bias 512 0 module.conv_reg.2.seq.2.weight 262144 0 module.conv_reg.2.seq.2.bias 512 0 module.nn_id.0.weight 270848 0 module.nn_id.0.bias 512 0 module.nn_id.2.weight 512 0 module.nn_id.2.bias 512 0 module.nn_id.4.weight 3072 0 module.nn_id.4.bias 6 0 module.nn_pt.nn.0.weight 273920 0 module.nn_pt.nn.0.bias 512 0 module.nn_pt.nn.2.weight 512 0 module.nn_pt.nn.2.bias 512 0 module.nn_pt.nn.4.weight 1024 0 module.nn_pt.nn.4.bias 2 0 module.nn_eta.nn.0.weight 273920 0 module.nn_eta.nn.0.bias 512 0 module.nn_eta.nn.2.weight 512 0 module.nn_eta.nn.2.bias 512 0 module.nn_eta.nn.4.weight 1024 0 module.nn_eta.nn.4.bias 2 0 module.nn_sin_phi.nn.0.weight 273920 0 module.nn_sin_phi.nn.0.bias 512 0 module.nn_sin_phi.nn.2.weight 512 0 module.nn_sin_phi.nn.2.bias 512 0 module.nn_sin_phi.nn.4.weight 1024 0 module.nn_sin_phi.nn.4.bias 2 0 module.nn_cos_phi.nn.0.weight 273920 0 module.nn_cos_phi.nn.0.bias 512 0 module.nn_cos_phi.nn.2.weight 512 0 module.nn_cos_phi.nn.2.bias 512 0 module.nn_cos_phi.nn.4.weight 1024 0 module.nn_cos_phi.nn.4.bias 2 0 module.nn_energy.nn.0.weight 273920 0 module.nn_energy.nn.0.bias 512 0 module.nn_energy.nn.2.weight 512 0 module.nn_energy.nn.2.bias 512 0 module.nn_energy.nn.4.weight 1024 0 module.nn_energy.nn.4.bias 2 0 [2024-06-17 11:23:51,742] INFO: DistributedDataParallel( (module): DeepMET( (nn): Sequential( (0): Linear(in_features=535, out_features=256, bias=True) (1): ELU(alpha=1.0) (2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0, inplace=False) (4): Linear(in_features=256, out_features=2, bias=True) ) ) ) [2024-06-17 11:23:51,742] INFO: DeepMET Trainable parameters: 138242 [2024-06-17 11:23:51,742] INFO: DeepMET Non-trainable parameters: 0 [2024-06-17 11:23:51,742] INFO: DeepMET Total parameters: 138242 [2024-06-17 11:23:51,743] INFO: Modules Trainable parameters Non-tranable parameters module.nn.0.weight 136960 0 module.nn.0.bias 256 0 module.nn.2.weight 256 0 module.nn.2.bias 256 0 module.nn.4.weight 512 0 module.nn.4.bias 2 0 [2024-06-17 11:23:51,752] INFO: Creating experiment dir /pfvol/experiments/MLPF_clic_backbone_pyg-clic_20240429_101112_971749/MLPF_4GTX_MET_latentX_FloatBackbone_20240617_112340_776429 [2024-06-17 11:23:51,752] INFO: Model directory /pfvol/experiments/MLPF_clic_backbone_pyg-clic_20240429_101112_971749/MLPF_4GTX_MET_latentX_FloatBackbone_20240617_112340_776429 [2024-06-17 11:23:51,778] INFO: train_dataset: clic_edm_ttbar_pf, 800800 [2024-06-17 11:24:00,261] INFO: valid_dataset: clic_edm_ttbar_pf, 200200 [2024-06-17 11:24:00,303] INFO: Initiating epoch #1 train run on device rank=0 [2024-06-17 11:38:40,454] INFO: Initiating epoch #1 valid run on device rank=0 [2024-06-17 11:39:46,717] INFO: Rank 0: epoch=1 / 400 train_loss=32.0535 valid_loss=25.9647 stale=0 time=15.77m eta=6293.7m [2024-06-17 11:39:46,718] INFO: Initiating epoch #2 train run on device rank=0 [2024-06-17 11:54:27,347] INFO: Initiating epoch #2 valid run on device rank=0 [2024-06-17 11:55:33,039] INFO: Rank 0: epoch=2 / 400 train_loss=23.0435 valid_loss=21.1719 stale=0 time=15.77m eta=6277.6m [2024-06-17 11:55:33,100] INFO: Initiating epoch #3 train run on device rank=0 [2024-06-17 12:10:12,947] INFO: Initiating epoch #3 valid run on device rank=0 [2024-06-17 12:11:18,692] INFO: Rank 0: epoch=3 / 400 train_loss=20.9330 valid_loss=20.6972 stale=0 time=15.76m eta=6260.2m [2024-06-17 12:11:18,775] INFO: Initiating epoch #4 train run on device rank=0 [2024-06-17 12:25:58,699] INFO: Initiating epoch #4 valid run on device rank=0 [2024-06-17 12:27:06,176] INFO: Rank 0: epoch=4 / 400 train_loss=19.3160 valid_loss=18.7242 stale=0 time=15.79m eta=6246.7m [2024-06-17 12:27:06,366] INFO: Initiating epoch #5 train run on device rank=0 [2024-06-17 12:41:46,364] INFO: Initiating epoch #5 valid run on device rank=0 [2024-06-17 12:42:52,720] INFO: Rank 0: epoch=5 / 400 train_loss=16.8997 valid_loss=16.0923 stale=0 time=15.77m eta=6231.0m [2024-06-17 12:42:52,774] INFO: Initiating epoch #6 train run on device rank=0 [2024-06-17 12:57:32,754] INFO: Initiating epoch #6 valid run on device rank=0 [2024-06-17 12:58:38,862] INFO: Rank 0: epoch=6 / 400 train_loss=15.0103 valid_loss=14.3500 stale=0 time=15.77m eta=6214.9m [2024-06-17 12:58:38,925] INFO: Initiating epoch #7 train run on device rank=0 [2024-06-17 13:13:18,902] INFO: Initiating epoch #7 valid run on device rank=0 [2024-06-17 13:14:24,798] INFO: Rank 0: epoch=7 / 400 train_loss=13.3200 valid_loss=12.7585 stale=0 time=15.76m eta=6198.6m [2024-06-17 13:14:24,902] INFO: Initiating epoch #8 train run on device rank=0 [2024-06-17 13:29:05,603] INFO: Initiating epoch #8 valid run on device rank=0 [2024-06-17 13:30:11,033] INFO: Rank 0: epoch=8 / 400 train_loss=12.1691 valid_loss=11.9001 stale=0 time=15.77m eta=6182.8m [2024-06-17 13:30:11,080] INFO: Initiating epoch #9 train run on device rank=0 [2024-06-17 13:44:51,494] INFO: Initiating epoch #9 valid run on device rank=0 [2024-06-17 13:45:57,287] INFO: Rank 0: epoch=9 / 400 train_loss=11.4694 valid_loss=11.2105 stale=0 time=15.77m eta=6166.9m [2024-06-17 13:45:57,341] INFO: Initiating epoch #10 train run on device rank=0 [2024-06-17 14:00:37,663] INFO: Initiating epoch #10 valid run on device rank=0 [2024-06-17 14:01:43,433] INFO: Rank 0: epoch=10 / 400 train_loss=10.8330 valid_loss=10.5733 stale=0 time=15.77m eta=6151.0m [2024-06-17 14:01:43,482] INFO: Initiating epoch #11 train run on device rank=0 [2024-06-17 14:16:24,544] INFO: Initiating epoch #11 valid run on device rank=0 [2024-06-17 14:17:30,271] INFO: Rank 0: epoch=11 / 400 train_loss=10.3061 valid_loss=10.0955 stale=0 time=15.78m eta=6135.6m [2024-06-17 14:17:30,319] INFO: Initiating epoch #12 train run on device rank=0 [2024-06-17 14:32:15,843] INFO: Initiating epoch #12 valid run on device rank=0 [2024-06-17 14:33:21,369] INFO: Rank 0: epoch=12 / 400 train_loss=9.8750 valid_loss=9.7613 stale=0 time=15.85m eta=6122.4m [2024-06-17 14:33:21,404] INFO: Initiating epoch #13 train run on device rank=0 [2024-06-17 14:48:06,559] INFO: Initiating epoch #13 valid run on device rank=0 [2024-06-17 14:49:12,081] INFO: Rank 0: epoch=13 / 400 train_loss=9.5395 valid_loss=9.4588 stale=0 time=15.84m eta=6108.5m [2024-06-17 14:49:12,138] INFO: Initiating epoch #14 train run on device rank=0 [2024-06-17 15:03:57,345] INFO: Initiating epoch #14 valid run on device rank=0 [2024-06-17 15:05:02,511] INFO: Rank 0: epoch=14 / 400 train_loss=9.2814 valid_loss=9.2345 stale=0 time=15.84m eta=6094.3m [2024-06-17 15:05:02,536] INFO: Initiating epoch #15 train run on device rank=0 [2024-06-17 15:19:47,593] INFO: Initiating epoch #15 valid run on device rank=0 [2024-06-17 15:20:52,991] INFO: Rank 0: epoch=15 / 400 train_loss=9.0826 valid_loss=9.0731 stale=0 time=15.84m eta=6079.9m [2024-06-17 15:20:53,025] INFO: Initiating epoch #16 train run on device rank=0 [2024-06-17 15:35:38,053] INFO: Initiating epoch #16 valid run on device rank=0 [2024-06-17 15:36:43,516] INFO: Rank 0: epoch=16 / 400 train_loss=8.9212 valid_loss=8.9485 stale=0 time=15.84m eta=6065.3m [2024-06-17 15:36:43,559] INFO: Initiating epoch #17 train run on device rank=0 [2024-06-17 15:51:28,891] INFO: Initiating epoch #17 valid run on device rank=0 [2024-06-17 15:52:34,617] INFO: Rank 0: epoch=17 / 400 train_loss=8.7881 valid_loss=8.8341 stale=0 time=15.85m eta=6050.8m [2024-06-17 15:52:34,668] INFO: Initiating epoch #18 train run on device rank=0 [2024-06-17 16:07:20,323] INFO: Initiating epoch #18 valid run on device rank=0 [2024-06-17 16:08:27,807] INFO: Rank 0: epoch=18 / 400 train_loss=8.6757 valid_loss=8.7432 stale=0 time=15.89m eta=6036.8m [2024-06-17 16:08:27,870] INFO: Initiating epoch #19 train run on device rank=0 [2024-06-17 16:23:13,377] INFO: Initiating epoch #19 valid run on device rank=0 [2024-06-17 16:24:19,126] INFO: Rank 0: epoch=19 / 400 train_loss=8.5811 valid_loss=8.6615 stale=0 time=15.85m eta=6022.1m [2024-06-17 16:24:19,170] INFO: Initiating epoch #20 train run on device rank=0 [2024-06-17 16:39:05,079] INFO: Initiating epoch #20 valid run on device rank=0 [2024-06-17 16:40:14,643] INFO: Rank 0: epoch=20 / 400 train_loss=8.5005 valid_loss=8.6035 stale=0 time=15.92m eta=6008.5m [2024-06-17 16:40:14,695] INFO: Initiating epoch #21 train run on device rank=0 [2024-06-17 16:54:57,108] INFO: Initiating epoch #21 valid run on device rank=0 [2024-06-17 16:56:04,341] INFO: Rank 0: epoch=21 / 400 train_loss=8.4338 valid_loss=8.5539 stale=0 time=15.83m eta=5993.0m [2024-06-17 16:56:04,429] INFO: Initiating epoch #22 train run on device rank=0 [2024-06-17 17:10:49,756] INFO: Initiating epoch #22 valid run on device rank=0 [2024-06-17 17:11:55,452] INFO: Rank 0: epoch=22 / 400 train_loss=8.3749 valid_loss=8.4977 stale=0 time=15.85m eta=5977.9m [2024-06-17 17:11:55,497] INFO: Initiating epoch #23 train run on device rank=0 [2024-06-17 17:26:41,085] INFO: Initiating epoch #23 valid run on device rank=0 [2024-06-17 17:27:47,676] INFO: Rank 0: epoch=23 / 400 train_loss=8.3218 valid_loss=8.4591 stale=0 time=15.87m eta=5963.0m [2024-06-17 17:27:48,223] INFO: Initiating epoch #24 train run on device rank=0 [2024-06-17 17:42:33,851] INFO: Initiating epoch #24 valid run on device rank=0 [2024-06-17 17:43:39,300] INFO: Rank 0: epoch=24 / 400 train_loss=8.2687 valid_loss=8.4220 stale=0 time=15.85m eta=5947.8m [2024-06-17 17:43:39,347] INFO: Initiating epoch #25 train run on device rank=0 [2024-06-17 17:58:25,215] INFO: Initiating epoch #25 valid run on device rank=0 [2024-06-17 17:59:30,918] INFO: Rank 0: epoch=25 / 400 train_loss=8.2228 valid_loss=8.3968 stale=0 time=15.86m eta=5932.7m [2024-06-17 17:59:30,971] INFO: Initiating epoch #26 train run on device rank=0 [2024-06-17 18:14:17,166] INFO: Initiating epoch #26 valid run on device rank=0 [2024-06-17 18:15:22,669] INFO: Rank 0: epoch=26 / 400 train_loss=8.1805 valid_loss=8.3561 stale=0 time=15.86m eta=5917.4m [2024-06-17 18:15:22,710] INFO: Initiating epoch #27 train run on device rank=0 [2024-06-17 18:30:08,494] INFO: Initiating epoch #27 valid run on device rank=0 [2024-06-17 18:31:14,176] INFO: Rank 0: epoch=27 / 400 train_loss=8.1453 valid_loss=8.3184 stale=0 time=15.86m eta=5902.1m [2024-06-17 18:31:14,244] INFO: Initiating epoch #28 train run on device rank=0 [2024-06-17 18:46:00,974] INFO: Initiating epoch #28 valid run on device rank=0 [2024-06-17 18:47:06,309] INFO: Rank 0: epoch=28 / 400 train_loss=8.1053 valid_loss=8.2865 stale=0 time=15.87m eta=5886.9m [2024-06-17 18:47:06,344] INFO: Initiating epoch #29 train run on device rank=0 [2024-06-17 19:01:52,300] INFO: Initiating epoch #29 valid run on device rank=0 [2024-06-17 19:02:58,070] INFO: Rank 0: epoch=29 / 400 train_loss=8.0644 valid_loss=8.2552 stale=0 time=15.86m eta=5871.6m [2024-06-17 19:02:58,123] INFO: Initiating epoch #30 train run on device rank=0 [2024-06-17 19:17:42,092] INFO: Initiating epoch #30 valid run on device rank=0 [2024-06-17 19:18:47,823] INFO: Rank 0: epoch=30 / 400 train_loss=8.0225 valid_loss=8.2300 stale=0 time=15.83m eta=5855.8m [2024-06-17 19:18:47,887] INFO: Initiating epoch #31 train run on device rank=0 [2024-06-17 19:33:32,754] INFO: Initiating epoch #31 valid run on device rank=0 [2024-06-17 19:34:38,684] INFO: Rank 0: epoch=31 / 400 train_loss=7.9862 valid_loss=8.2138 stale=0 time=15.85m eta=5840.2m [2024-06-17 19:34:38,723] INFO: Initiating epoch #32 train run on device rank=0 [2024-06-17 19:49:25,031] INFO: Initiating epoch #32 valid run on device rank=0 [2024-06-17 19:50:31,001] INFO: Rank 0: epoch=32 / 400 train_loss=7.9466 valid_loss=8.1761 stale=0 time=15.87m eta=5824.9m [2024-06-17 19:50:31,147] INFO: Initiating epoch #33 train run on device rank=0 [2024-06-17 20:05:17,421] INFO: Initiating epoch #33 valid run on device rank=0 [2024-06-17 20:06:22,945] INFO: Rank 0: epoch=33 / 400 train_loss=7.9141 valid_loss=8.1542 stale=0 time=15.86m eta=5809.5m [2024-06-17 20:06:22,996] INFO: Initiating epoch #34 train run on device rank=0 [2024-06-17 20:21:09,154] INFO: Initiating epoch #34 valid run on device rank=0 [2024-06-17 20:22:14,741] INFO: Rank 0: epoch=34 / 400 train_loss=7.8759 valid_loss=8.1185 stale=0 time=15.86m eta=5794.0m [2024-06-17 20:22:14,794] INFO: Initiating epoch #35 train run on device rank=0 [2024-06-17 20:37:01,268] INFO: Initiating epoch #35 valid run on device rank=0 [2024-06-17 20:38:06,641] INFO: Rank 0: epoch=35 / 400 train_loss=7.8435 valid_loss=8.0885 stale=0 time=15.86m eta=5778.5m [2024-06-17 20:38:06,683] INFO: Initiating epoch #36 train run on device rank=0 [2024-06-17 20:52:53,492] INFO: Initiating epoch #36 valid run on device rank=0 [2024-06-17 20:53:59,549] INFO: Rank 0: epoch=36 / 400 train_loss=7.8140 valid_loss=8.0585 stale=0 time=15.88m eta=5763.2m [2024-06-17 20:53:59,688] INFO: Initiating epoch #37 train run on device rank=0 [2024-06-17 21:08:46,605] INFO: Initiating epoch #37 valid run on device rank=0 [2024-06-17 21:09:52,612] INFO: Rank 0: epoch=37 / 400 train_loss=7.7846 valid_loss=8.0302 stale=0 time=15.88m eta=5747.9m [2024-06-17 21:09:52,766] INFO: Initiating epoch #38 train run on device rank=0 [2024-06-17 21:24:39,552] INFO: Initiating epoch #38 valid run on device rank=0 [2024-06-17 21:25:44,971] INFO: Rank 0: epoch=38 / 400 train_loss=7.7553 valid_loss=8.0128 stale=0 time=15.87m eta=5732.4m [2024-06-17 21:25:45,021] INFO: Initiating epoch #39 train run on device rank=0 [2024-06-17 21:40:31,391] INFO: Initiating epoch #39 valid run on device rank=0 [2024-06-17 21:41:36,720] INFO: Rank 0: epoch=39 / 400 train_loss=7.7310 valid_loss=7.9968 stale=0 time=15.86m eta=5716.8m [2024-06-17 21:41:36,767] INFO: Initiating epoch #40 train run on device rank=0 [2024-06-17 21:56:19,611] INFO: Initiating epoch #40 valid run on device rank=0 [2024-06-17 21:57:25,304] INFO: Rank 0: epoch=40 / 400 train_loss=7.7046 valid_loss=7.9685 stale=0 time=15.81m eta=5700.8m [2024-06-17 21:57:25,359] INFO: Initiating epoch #41 train run on device rank=0 [2024-06-17 22:12:11,115] INFO: Initiating epoch #41 valid run on device rank=0 [2024-06-17 22:13:16,989] INFO: Rank 0: epoch=41 / 400 train_loss=7.6809 valid_loss=7.9532 stale=0 time=15.86m eta=5685.1m [2024-06-17 22:13:17,080] INFO: Initiating epoch #42 train run on device rank=0 [2024-06-17 22:28:02,516] INFO: Initiating epoch #42 valid run on device rank=0 [2024-06-17 22:29:08,361] INFO: Rank 0: epoch=42 / 400 train_loss=7.6574 valid_loss=7.9384 stale=0 time=15.85m eta=5669.5m [2024-06-17 22:29:08,422] INFO: Initiating epoch #43 train run on device rank=0 [2024-06-17 22:43:53,643] INFO: Initiating epoch #43 valid run on device rank=0 [2024-06-17 22:44:59,024] INFO: Rank 0: epoch=43 / 400 train_loss=7.6364 valid_loss=7.9267 stale=0 time=15.84m eta=5653.7m [2024-06-17 22:44:59,060] INFO: Initiating epoch #44 train run on device rank=0 [2024-06-17 22:59:44,894] INFO: Initiating epoch #44 valid run on device rank=0 [2024-06-17 23:00:50,476] INFO: Rank 0: epoch=44 / 400 train_loss=7.6162 valid_loss=7.9199 stale=0 time=15.86m eta=5638.0m [2024-06-17 23:00:50,532] INFO: Initiating epoch #45 train run on device rank=0 [2024-06-17 23:15:35,856] INFO: Initiating epoch #45 valid run on device rank=0 [2024-06-17 23:16:41,907] INFO: Rank 0: epoch=45 / 400 train_loss=7.5943 valid_loss=7.8997 stale=0 time=15.86m eta=5622.4m [2024-06-17 23:16:41,959] INFO: Initiating epoch #46 train run on device rank=0 [2024-06-17 23:31:27,790] INFO: Initiating epoch #46 valid run on device rank=0 [2024-06-17 23:32:33,527] INFO: Rank 0: epoch=46 / 400 train_loss=7.5740 valid_loss=7.8907 stale=0 time=15.86m eta=5606.7m [2024-06-17 23:32:33,611] INFO: Initiating epoch #47 train run on device rank=0 [2024-06-17 23:47:19,305] INFO: Initiating epoch #47 valid run on device rank=0 [2024-06-17 23:48:24,906] INFO: Rank 0: epoch=47 / 400 train_loss=7.5530 valid_loss=7.8759 stale=0 time=15.85m eta=5591.0m [2024-06-17 23:48:24,954] INFO: Initiating epoch #48 train run on device rank=0 [2024-06-18 00:03:10,427] INFO: Initiating epoch #48 valid run on device rank=0 [2024-06-18 00:04:16,152] INFO: Rank 0: epoch=48 / 400 train_loss=7.5324 valid_loss=7.8642 stale=0 time=15.85m eta=5575.3m [2024-06-18 00:04:16,319] INFO: Initiating epoch #49 train run on device rank=0 [2024-06-18 00:19:01,941] INFO: Initiating epoch #49 valid run on device rank=0 [2024-06-18 00:20:07,849] INFO: Rank 0: epoch=49 / 400 train_loss=7.5113 valid_loss=7.8465 stale=0 time=15.86m eta=5559.6m [2024-06-18 00:20:07,914] INFO: Initiating epoch #50 train run on device rank=0 [2024-06-18 00:34:50,134] INFO: Initiating epoch #50 valid run on device rank=0 [2024-06-18 00:35:56,342] INFO: Rank 0: epoch=50 / 400 train_loss=7.4917 valid_loss=7.8404 stale=0 time=15.81m eta=5543.5m [2024-06-18 00:35:56,376] INFO: Initiating epoch #51 train run on device rank=0 [2024-06-18 00:50:42,067] INFO: Initiating epoch #51 valid run on device rank=0 [2024-06-18 00:51:47,794] INFO: Rank 0: epoch=51 / 400 train_loss=7.4735 valid_loss=7.8275 stale=0 time=15.86m eta=5527.8m [2024-06-18 00:51:47,909] INFO: Initiating epoch #52 train run on device rank=0 [2024-06-18 01:06:33,173] INFO: Initiating epoch #52 valid run on device rank=0 [2024-06-18 01:07:47,146] INFO: Rank 0: epoch=52 / 400 train_loss=7.4545 valid_loss=7.8157 stale=0 time=15.99m eta=5513.0m [2024-06-18 01:07:48,249] INFO: Initiating epoch #53 train run on device rank=0 [2024-06-18 01:22:33,988] INFO: Initiating epoch #53 valid run on device rank=0 [2024-06-18 01:23:39,939] INFO: Rank 0: epoch=53 / 400 train_loss=7.4359 valid_loss=7.8138 stale=0 time=15.86m eta=5497.4m [2024-06-18 01:23:39,999] INFO: Initiating epoch #54 train run on device rank=0 [2024-06-18 01:38:25,071] INFO: Initiating epoch #54 valid run on device rank=0 [2024-06-18 01:39:30,592] INFO: Rank 0: epoch=54 / 400 train_loss=7.4181 valid_loss=7.7960 stale=0 time=15.84m eta=5481.6m [2024-06-18 01:39:30,630] INFO: Initiating epoch #55 train run on device rank=0 [2024-06-18 01:54:15,615] INFO: Initiating epoch #55 valid run on device rank=0 [2024-06-18 01:55:21,228] INFO: Rank 0: epoch=55 / 400 train_loss=7.4007 valid_loss=7.7870 stale=0 time=15.84m eta=5465.7m [2024-06-18 01:55:21,284] INFO: Initiating epoch #56 train run on device rank=0 [2024-06-18 02:10:06,455] INFO: Initiating epoch #56 valid run on device rank=0 [2024-06-18 02:11:12,295] INFO: Rank 0: epoch=56 / 400 train_loss=7.3841 valid_loss=7.7761 stale=0 time=15.85m eta=5449.9m [2024-06-18 02:11:12,446] INFO: Initiating epoch #57 train run on device rank=0 [2024-06-18 02:25:57,811] INFO: Initiating epoch #57 valid run on device rank=0 [2024-06-18 02:27:03,550] INFO: Rank 0: epoch=57 / 400 train_loss=7.3682 valid_loss=7.7655 stale=0 time=15.85m eta=5434.2m [2024-06-18 02:27:03,624] INFO: Initiating epoch #58 train run on device rank=0 [2024-06-18 02:41:48,928] INFO: Initiating epoch #58 valid run on device rank=0 [2024-06-18 02:42:55,504] INFO: Rank 0: epoch=58 / 400 train_loss=7.3528 valid_loss=7.7623 stale=0 time=15.86m eta=5418.5m [2024-06-18 02:42:55,755] INFO: Initiating epoch #59 train run on device rank=0 [2024-06-18 02:57:39,619] INFO: Initiating epoch #59 valid run on device rank=0 [2024-06-18 02:58:55,219] INFO: Rank 0: epoch=59 / 400 train_loss=7.3364 valid_loss=7.7496 stale=0 time=15.99m eta=5403.5m [2024-06-18 02:58:55,352] INFO: Initiating epoch #60 train run on device rank=0 [2024-06-18 03:13:38,210] INFO: Initiating epoch #60 valid run on device rank=0 [2024-06-18 03:14:46,595] INFO: Rank 0: epoch=60 / 400 train_loss=7.3211 valid_loss=7.7456 stale=0 time=15.85m eta=5387.7m [2024-06-18 03:14:46,851] INFO: Initiating epoch #61 train run on device rank=0 [2024-06-18 03:29:31,960] INFO: Initiating epoch #61 valid run on device rank=0 [2024-06-18 03:30:38,297] INFO: Rank 0: epoch=61 / 400 train_loss=7.3074 valid_loss=7.7441 stale=0 time=15.86m eta=5371.9m [2024-06-18 03:30:38,574] INFO: Initiating epoch #62 train run on device rank=0 [2024-06-18 03:45:24,054] INFO: Initiating epoch #62 valid run on device rank=0 [2024-06-18 03:46:29,290] INFO: Rank 0: epoch=62 / 400 train_loss=7.2952 valid_loss=7.7414 stale=0 time=15.85m eta=5356.1m [2024-06-18 03:46:29,350] INFO: Initiating epoch #63 train run on device rank=0 [2024-06-18 04:01:14,388] INFO: Initiating epoch #63 valid run on device rank=0 [2024-06-18 04:02:19,962] INFO: Rank 0: epoch=63 / 400 train_loss=7.2802 valid_loss=7.7372 stale=0 time=15.84m eta=5340.3m [2024-06-18 04:02:20,013] INFO: Initiating epoch #64 train run on device rank=0 [2024-06-18 04:17:05,753] INFO: Initiating epoch #64 valid run on device rank=0 [2024-06-18 04:18:11,425] INFO: Rank 0: epoch=64 / 400 train_loss=7.2666 valid_loss=7.7314 stale=0 time=15.86m eta=5324.5m [2024-06-18 04:18:11,479] INFO: Initiating epoch #65 train run on device rank=0 [2024-06-18 04:32:56,601] INFO: Initiating epoch #65 valid run on device rank=0 [2024-06-18 04:34:10,797] INFO: Rank 0: epoch=65 / 400 train_loss=7.2532 valid_loss=7.7244 stale=0 time=15.99m eta=5309.4m [2024-06-18 04:34:10,876] INFO: Initiating epoch #66 train run on device rank=0 [2024-06-18 04:48:55,794] INFO: Initiating epoch #66 valid run on device rank=0 [2024-06-18 04:50:02,690] INFO: Rank 0: epoch=66 / 400 train_loss=7.2398 valid_loss=7.7224 stale=0 time=15.86m eta=5293.6m [2024-06-18 04:50:02,819] INFO: Initiating epoch #67 train run on device rank=0 [2024-06-18 05:04:48,116] INFO: Initiating epoch #67 valid run on device rank=0 [2024-06-18 05:05:55,649] INFO: Rank 0: epoch=67 / 400 train_loss=7.2277 valid_loss=7.7219 stale=0 time=15.88m eta=5277.9m [2024-06-18 05:05:55,718] INFO: Initiating epoch #68 train run on device rank=0 [2024-06-18 05:20:40,595] INFO: Initiating epoch #68 valid run on device rank=0 [2024-06-18 05:21:48,352] INFO: Rank 0: epoch=68 / 400 train_loss=7.2163 valid_loss=7.7158 stale=0 time=15.88m eta=5262.2m [2024-06-18 05:21:48,689] INFO: Initiating epoch #69 train run on device rank=0 [2024-06-18 05:36:30,343] INFO: Initiating epoch #69 valid run on device rank=0 [2024-06-18 05:37:35,889] INFO: Rank 0: epoch=69 / 400 train_loss=7.2032 valid_loss=7.7138 stale=0 time=15.79m eta=5246.1m [2024-06-18 05:37:35,948] INFO: Initiating epoch #70 train run on device rank=0 [2024-06-18 05:52:20,837] INFO: Initiating epoch #70 valid run on device rank=0 [2024-06-18 05:53:26,381] INFO: Rank 0: epoch=70 / 400 train_loss=7.1913 valid_loss=7.7083 stale=0 time=15.84m eta=5230.2m [2024-06-18 05:53:26,432] INFO: Initiating epoch #71 train run on device rank=0 [2024-06-18 06:08:12,436] INFO: Initiating epoch #71 valid run on device rank=0 [2024-06-18 06:09:18,135] INFO: Rank 0: epoch=71 / 400 train_loss=7.1795 valid_loss=7.7033 stale=0 time=15.86m eta=5214.4m [2024-06-18 06:09:18,207] INFO: Initiating epoch #72 train run on device rank=0 [2024-06-18 06:24:03,128] INFO: Initiating epoch #72 valid run on device rank=0 [2024-06-18 06:25:08,700] INFO: Rank 0: epoch=72 / 400 train_loss=7.1684 valid_loss=7.6948 stale=0 time=15.84m eta=5198.5m [2024-06-18 06:25:08,753] INFO: Initiating epoch #73 train run on device rank=0 [2024-06-18 06:39:53,948] INFO: Initiating epoch #73 valid run on device rank=0 [2024-06-18 06:40:58,835] INFO: Rank 0: epoch=73 / 400 train_loss=7.1567 valid_loss=7.6979 stale=1 time=15.83m eta=5182.6m [2024-06-18 06:40:58,869] INFO: Initiating epoch #74 train run on device rank=0 [2024-06-18 06:55:43,841] INFO: Initiating epoch #74 valid run on device rank=0 [2024-06-18 06:56:48,711] INFO: Rank 0: epoch=74 / 400 train_loss=7.1462 valid_loss=7.6974 stale=2 time=15.83m eta=5166.7m [2024-06-18 06:56:48,728] INFO: Initiating epoch #75 train run on device rank=0 [2024-06-18 07:11:33,697] INFO: Initiating epoch #75 valid run on device rank=0 [2024-06-18 07:12:39,775] INFO: Rank 0: epoch=75 / 400 train_loss=7.1351 valid_loss=7.6891 stale=0 time=15.85m eta=5150.9m [2024-06-18 07:12:40,035] INFO: Initiating epoch #76 train run on device rank=0 [2024-06-18 07:27:25,142] INFO: Initiating epoch #76 valid run on device rank=0 [2024-06-18 07:28:30,126] INFO: Rank 0: epoch=76 / 400 train_loss=7.1246 valid_loss=7.6947 stale=1 time=15.83m eta=5135.0m [2024-06-18 07:28:30,164] INFO: Initiating epoch #77 train run on device rank=0 [2024-06-18 07:43:15,635] INFO: Initiating epoch #77 valid run on device rank=0 [2024-06-18 07:44:21,342] INFO: Rank 0: epoch=77 / 400 train_loss=7.1127 valid_loss=7.6852 stale=0 time=15.85m eta=5119.1m [2024-06-18 07:44:21,387] INFO: Initiating epoch #78 train run on device rank=0 [2024-06-18 07:59:06,575] INFO: Initiating epoch #78 valid run on device rank=0 [2024-06-18 08:00:11,130] INFO: Rank 0: epoch=78 / 400 train_loss=7.1024 valid_loss=7.6882 stale=1 time=15.83m eta=5103.2m [2024-06-18 08:00:11,155] INFO: Initiating epoch #79 train run on device rank=0 [2024-06-18 08:14:52,775] INFO: Initiating epoch #79 valid run on device rank=0 [2024-06-18 08:15:57,660] INFO: Rank 0: epoch=79 / 400 train_loss=7.0923 valid_loss=7.6902 stale=2 time=15.78m eta=5087.1m [2024-06-18 08:15:57,709] INFO: Initiating epoch #80 train run on device rank=0 [2024-06-18 08:30:42,746] INFO: Initiating epoch #80 valid run on device rank=0 [2024-06-18 08:31:48,415] INFO: Rank 0: epoch=80 / 400 train_loss=7.0820 valid_loss=7.6778 stale=0 time=15.85m eta=5071.2m [2024-06-18 08:31:48,448] INFO: Initiating epoch #81 train run on device rank=0 [2024-06-18 08:46:33,607] INFO: Initiating epoch #81 valid run on device rank=0 [2024-06-18 08:47:39,105] INFO: Rank 0: epoch=81 / 400 train_loss=7.0716 valid_loss=7.6768 stale=0 time=15.84m eta=5055.3m [2024-06-18 08:47:39,165] INFO: Initiating epoch #82 train run on device rank=0 [2024-06-18 09:02:24,234] INFO: Initiating epoch #82 valid run on device rank=0 [2024-06-18 09:03:29,096] INFO: Rank 0: epoch=82 / 400 train_loss=7.0625 valid_loss=7.6777 stale=1 time=15.83m eta=5039.4m [2024-06-18 09:03:29,138] INFO: Initiating epoch #83 train run on device rank=0 [2024-06-18 09:18:14,026] INFO: Initiating epoch #83 valid run on device rank=0 [2024-06-18 09:19:19,045] INFO: Rank 0: epoch=83 / 400 train_loss=7.0543 valid_loss=7.6787 stale=2 time=15.83m eta=5023.5m [2024-06-18 09:19:19,053] INFO: Initiating epoch #84 train run on device rank=0 [2024-06-18 09:34:03,695] INFO: Initiating epoch #84 valid run on device rank=0