[2024-06-13 14:43:58,635] INFO: Will use torch.nn.parallel.DistributedDataParallel() and 4 gpus [2024-06-13 14:43:58,738] INFO: NVIDIA GeForce RTX 2080 Ti [2024-06-13 14:43:58,738] INFO: NVIDIA GeForce RTX 2080 Ti [2024-06-13 14:43:58,738] INFO: NVIDIA GeForce RTX 2080 Ti [2024-06-13 14:43:58,738] INFO: NVIDIA GeForce RTX 2080 Ti [2024-06-13 14:44:05,071] INFO: using dtype=torch.float32 [2024-06-13 14:44:06,190] INFO: using attention_type=math [2024-06-13 14:44:06,201] INFO: using attention_type=math [2024-06-13 14:44:06,212] INFO: using attention_type=math [2024-06-13 14:44:06,223] INFO: using attention_type=math [2024-06-13 14:44:06,234] INFO: using attention_type=math [2024-06-13 14:44:06,244] INFO: using attention_type=math [2024-06-13 14:44:11,478] INFO: mlpf_kwargs: {'input_dim': 17, 'num_classes': 6, 'input_encoding': 'joint', 'pt_mode': 'linear', 'eta_mode': 'linear', 'sin_phi_mode': 'linear', 'cos_phi_mode': 'linear', 'energy_mode': 'linear', 'elemtypes_nonzero': [1, 2], 'learned_representation_mode': 'last', 'conv_type': 'attention', 'num_convs': 3, 'dropout_ff': 0.0, 'dropout_conv_id_mha': 0.0, 'dropout_conv_id_ff': 0.0, 'dropout_conv_reg_mha': 0.0, 'dropout_conv_reg_ff': 0.0, 'activation': 'relu', 'head_dim': 16, 'num_heads': 32, 'attention_type': 'math'} [2024-06-13 14:44:11,479] INFO: Loaded model weights from /pfvol/experiments/MLPF_clic_backbone_pyg-clic_20240429_101112_971749/best_weights.pth [2024-06-13 14:44:12,792] INFO: DistributedDataParallel( (module): MLPF( (nn0_id): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (nn0_reg): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (conv_id): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (conv_reg): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (nn_id): Sequential( (0): Linear(in_features=529, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=6, bias=True) ) (nn_pt): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_eta): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_sin_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_cos_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_energy): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) ) ) [2024-06-13 14:44:12,793] INFO: Backbone Trainable parameters: 11671568 [2024-06-13 14:44:12,793] INFO: Backbone Non-trainable parameters: 0 [2024-06-13 14:44:12,793] INFO: Backbone Total parameters: 11671568 [2024-06-13 14:44:12,798] INFO: Modules Trainable parameters Non-tranable parameters module.nn0_id.0.weight 8704 0 module.nn0_id.0.bias 512 0 module.nn0_id.2.weight 512 0 module.nn0_id.2.bias 512 0 module.nn0_id.4.weight 262144 0 module.nn0_id.4.bias 512 0 module.nn0_reg.0.weight 8704 0 module.nn0_reg.0.bias 512 0 module.nn0_reg.2.weight 512 0 module.nn0_reg.2.bias 512 0 module.nn0_reg.4.weight 262144 0 module.nn0_reg.4.bias 512 0 module.conv_id.0.mha.in_proj_weight 786432 0 module.conv_id.0.mha.in_proj_bias 1536 0 module.conv_id.0.mha.out_proj.weight 262144 0 module.conv_id.0.mha.out_proj.bias 512 0 module.conv_id.0.norm0.weight 512 0 module.conv_id.0.norm0.bias 512 0 module.conv_id.0.norm1.weight 512 0 module.conv_id.0.norm1.bias 512 0 module.conv_id.0.seq.0.weight 262144 0 module.conv_id.0.seq.0.bias 512 0 module.conv_id.0.seq.2.weight 262144 0 module.conv_id.0.seq.2.bias 512 0 module.conv_id.1.mha.in_proj_weight 786432 0 module.conv_id.1.mha.in_proj_bias 1536 0 module.conv_id.1.mha.out_proj.weight 262144 0 module.conv_id.1.mha.out_proj.bias 512 0 module.conv_id.1.norm0.weight 512 0 module.conv_id.1.norm0.bias 512 0 module.conv_id.1.norm1.weight 512 0 module.conv_id.1.norm1.bias 512 0 module.conv_id.1.seq.0.weight 262144 0 module.conv_id.1.seq.0.bias 512 0 module.conv_id.1.seq.2.weight 262144 0 module.conv_id.1.seq.2.bias 512 0 module.conv_id.2.mha.in_proj_weight 786432 0 module.conv_id.2.mha.in_proj_bias 1536 0 module.conv_id.2.mha.out_proj.weight 262144 0 module.conv_id.2.mha.out_proj.bias 512 0 module.conv_id.2.norm0.weight 512 0 module.conv_id.2.norm0.bias 512 0 module.conv_id.2.norm1.weight 512 0 module.conv_id.2.norm1.bias 512 0 module.conv_id.2.seq.0.weight 262144 0 module.conv_id.2.seq.0.bias 512 0 module.conv_id.2.seq.2.weight 262144 0 module.conv_id.2.seq.2.bias 512 0 module.conv_reg.0.mha.in_proj_weight 786432 0 module.conv_reg.0.mha.in_proj_bias 1536 0 module.conv_reg.0.mha.out_proj.weight 262144 0 module.conv_reg.0.mha.out_proj.bias 512 0 module.conv_reg.0.norm0.weight 512 0 module.conv_reg.0.norm0.bias 512 0 module.conv_reg.0.norm1.weight 512 0 module.conv_reg.0.norm1.bias 512 0 module.conv_reg.0.seq.0.weight 262144 0 module.conv_reg.0.seq.0.bias 512 0 module.conv_reg.0.seq.2.weight 262144 0 module.conv_reg.0.seq.2.bias 512 0 module.conv_reg.1.mha.in_proj_weight 786432 0 module.conv_reg.1.mha.in_proj_bias 1536 0 module.conv_reg.1.mha.out_proj.weight 262144 0 module.conv_reg.1.mha.out_proj.bias 512 0 module.conv_reg.1.norm0.weight 512 0 module.conv_reg.1.norm0.bias 512 0 module.conv_reg.1.norm1.weight 512 0 module.conv_reg.1.norm1.bias 512 0 module.conv_reg.1.seq.0.weight 262144 0 module.conv_reg.1.seq.0.bias 512 0 module.conv_reg.1.seq.2.weight 262144 0 module.conv_reg.1.seq.2.bias 512 0 module.conv_reg.2.mha.in_proj_weight 786432 0 module.conv_reg.2.mha.in_proj_bias 1536 0 module.conv_reg.2.mha.out_proj.weight 262144 0 module.conv_reg.2.mha.out_proj.bias 512 0 module.conv_reg.2.norm0.weight 512 0 module.conv_reg.2.norm0.bias 512 0 module.conv_reg.2.norm1.weight 512 0 module.conv_reg.2.norm1.bias 512 0 module.conv_reg.2.seq.0.weight 262144 0 module.conv_reg.2.seq.0.bias 512 0 module.conv_reg.2.seq.2.weight 262144 0 module.conv_reg.2.seq.2.bias 512 0 module.nn_id.0.weight 270848 0 module.nn_id.0.bias 512 0 module.nn_id.2.weight 512 0 module.nn_id.2.bias 512 0 module.nn_id.4.weight 3072 0 module.nn_id.4.bias 6 0 module.nn_pt.nn.0.weight 273920 0 module.nn_pt.nn.0.bias 512 0 module.nn_pt.nn.2.weight 512 0 module.nn_pt.nn.2.bias 512 0 module.nn_pt.nn.4.weight 1024 0 module.nn_pt.nn.4.bias 2 0 module.nn_eta.nn.0.weight 273920 0 module.nn_eta.nn.0.bias 512 0 module.nn_eta.nn.2.weight 512 0 module.nn_eta.nn.2.bias 512 0 module.nn_eta.nn.4.weight 1024 0 module.nn_eta.nn.4.bias 2 0 module.nn_sin_phi.nn.0.weight 273920 0 module.nn_sin_phi.nn.0.bias 512 0 module.nn_sin_phi.nn.2.weight 512 0 module.nn_sin_phi.nn.2.bias 512 0 module.nn_sin_phi.nn.4.weight 1024 0 module.nn_sin_phi.nn.4.bias 2 0 module.nn_cos_phi.nn.0.weight 273920 0 module.nn_cos_phi.nn.0.bias 512 0 module.nn_cos_phi.nn.2.weight 512 0 module.nn_cos_phi.nn.2.bias 512 0 module.nn_cos_phi.nn.4.weight 1024 0 module.nn_cos_phi.nn.4.bias 2 0 module.nn_energy.nn.0.weight 273920 0 module.nn_energy.nn.0.bias 512 0 module.nn_energy.nn.2.weight 512 0 module.nn_energy.nn.2.bias 512 0 module.nn_energy.nn.4.weight 1024 0 module.nn_energy.nn.4.bias 2 0 [2024-06-13 14:44:12,855] INFO: DistributedDataParallel( (module): DeepMET( (nn): Sequential( (0): Linear(in_features=535, out_features=256, bias=True) (1): ELU(alpha=1.0) (2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0, inplace=False) (4): Linear(in_features=256, out_features=2, bias=True) ) ) ) [2024-06-13 14:44:12,855] INFO: DeepMET Trainable parameters: 138242 [2024-06-13 14:44:12,855] INFO: DeepMET Non-trainable parameters: 0 [2024-06-13 14:44:12,855] INFO: DeepMET Total parameters: 138242 [2024-06-13 14:44:12,856] INFO: Modules Trainable parameters Non-tranable parameters module.nn.0.weight 136960 0 module.nn.0.bias 256 0 module.nn.2.weight 256 0 module.nn.2.bias 256 0 module.nn.4.weight 512 0 module.nn.4.bias 2 0 [2024-06-13 14:44:12,877] INFO: Creating experiment dir /pfvol/experiments/MLPF_clic_backbone_pyg-clic_20240429_101112_971749/MLPF_4GTX_MET_latentX_FloatBackbone_20240613_144358_467020 [2024-06-13 14:44:12,877] INFO: Model directory /pfvol/experiments/MLPF_clic_backbone_pyg-clic_20240429_101112_971749/MLPF_4GTX_MET_latentX_FloatBackbone_20240613_144358_467020 [2024-06-13 14:44:13,043] INFO: train_dataset: clic_edm_ttbar_pf, 800800 [2024-06-13 14:44:13,185] INFO: valid_dataset: clic_edm_ttbar_pf, 200200 [2024-06-13 14:44:13,253] INFO: Initiating epoch #1 train run on device rank=0 [2024-06-13 14:58:58,065] INFO: Initiating epoch #1 valid run on device rank=0 [2024-06-13 15:00:05,585] INFO: Rank 0: epoch=1 / 200 train_loss=8.5014 valid_loss=7.3970 stale=0 time=15.87m eta=3158.6m [2024-06-13 15:00:05,596] INFO: Initiating epoch #2 train run on device rank=0 [2024-06-13 15:14:47,610] INFO: Initiating epoch #2 valid run on device rank=0 [2024-06-13 15:15:54,634] INFO: Rank 0: epoch=2 / 200 train_loss=11.5930 valid_loss=15.3761 stale=1 time=15.82m eta=3137.3m [2024-06-13 15:15:54,702] INFO: Initiating epoch #3 train run on device rank=0 [2024-06-13 15:30:36,266] INFO: Initiating epoch #3 valid run on device rank=0 [2024-06-13 15:31:43,293] INFO: Rank 0: epoch=3 / 200 train_loss=10.8359 valid_loss=9.2851 stale=2 time=15.81m eta=3119.2m [2024-06-13 15:31:43,338] INFO: Initiating epoch #4 train run on device rank=0 [2024-06-13 15:46:20,274] INFO: Initiating epoch #4 valid run on device rank=0 [2024-06-13 15:47:26,883] INFO: Rank 0: epoch=4 / 200 train_loss=8.7755 valid_loss=8.5338 stale=3 time=15.73m eta=3098.1m [2024-06-13 15:47:26,988] INFO: Initiating epoch #5 train run on device rank=0 [2024-06-13 16:02:07,346] INFO: Initiating epoch #5 valid run on device rank=0 [2024-06-13 16:03:14,289] INFO: Rank 0: epoch=5 / 200 train_loss=8.2664 valid_loss=8.1970 stale=4 time=15.79m eta=3081.7m [2024-06-13 16:03:14,357] INFO: Initiating epoch #6 train run on device rank=0 [2024-06-13 16:17:52,451] INFO: Initiating epoch #6 valid run on device rank=0 [2024-06-13 16:18:59,095] INFO: Rank 0: epoch=6 / 200 train_loss=7.9612 valid_loss=7.9982 stale=5 time=15.75m eta=3064.0m [2024-06-13 16:18:59,141] INFO: Initiating epoch #7 train run on device rank=0 [2024-06-13 16:33:37,737] INFO: Initiating epoch #7 valid run on device rank=0 [2024-06-13 16:34:46,456] INFO: Rank 0: epoch=7 / 200 train_loss=7.7490 valid_loss=7.8539 stale=6 time=15.79m eta=3048.1m [2024-06-13 16:34:47,033] INFO: Initiating epoch #8 train run on device rank=0 [2024-06-13 16:49:24,668] INFO: Initiating epoch #8 valid run on device rank=0 [2024-06-13 16:50:34,290] INFO: Rank 0: epoch=8 / 200 train_loss=7.5921 valid_loss=7.8666 stale=7 time=15.79m eta=3032.4m [2024-06-13 16:50:34,784] INFO: Initiating epoch #9 train run on device rank=0 [2024-06-13 17:05:13,041] INFO: Initiating epoch #9 valid run on device rank=0 [2024-06-13 17:06:19,415] INFO: Rank 0: epoch=9 / 200 train_loss=7.4815 valid_loss=7.7478 stale=8 time=15.74m eta=3015.7m [2024-06-13 17:06:19,461] INFO: Initiating epoch #10 train run on device rank=0 [2024-06-13 17:20:56,974] INFO: Initiating epoch #10 valid run on device rank=0 [2024-06-13 17:22:03,165] INFO: Rank 0: epoch=10 / 200 train_loss=7.3862 valid_loss=7.7158 stale=9 time=15.73m eta=2998.8m [2024-06-13 17:22:03,213] INFO: Initiating epoch #11 train run on device rank=0 [2024-06-13 17:36:41,376] INFO: Initiating epoch #11 valid run on device rank=0 [2024-06-13 17:37:48,322] INFO: Rank 0: epoch=11 / 200 train_loss=7.3090 valid_loss=7.6863 stale=10 time=15.75m eta=2982.5m [2024-06-13 17:37:48,671] INFO: Initiating epoch #12 train run on device rank=0 [2024-06-13 17:52:26,198] INFO: Initiating epoch #12 valid run on device rank=0 [2024-06-13 17:53:32,855] INFO: Rank 0: epoch=12 / 200 train_loss=7.2380 valid_loss=7.6718 stale=11 time=15.74m eta=2966.1m [2024-06-13 17:53:32,898] INFO: Initiating epoch #13 train run on device rank=0 [2024-06-13 18:08:10,988] INFO: Initiating epoch #13 valid run on device rank=0 [2024-06-13 18:09:17,521] INFO: Rank 0: epoch=13 / 200 train_loss=7.1850 valid_loss=7.6619 stale=12 time=15.74m eta=2949.9m [2024-06-13 18:09:17,566] INFO: Initiating epoch #14 train run on device rank=0 [2024-06-13 18:23:55,683] INFO: Initiating epoch #14 valid run on device rank=0 [2024-06-13 18:25:02,625] INFO: Rank 0: epoch=14 / 200 train_loss=7.1336 valid_loss=7.6442 stale=13 time=15.75m eta=2933.8m [2024-06-13 18:25:02,740] INFO: Initiating epoch #15 train run on device rank=0 [2024-06-13 18:39:40,748] INFO: Initiating epoch #15 valid run on device rank=0 [2024-06-13 18:40:47,591] INFO: Rank 0: epoch=15 / 200 train_loss=7.0936 valid_loss=7.6450 stale=14 time=15.75m eta=2917.7m [2024-06-13 18:40:47,632] INFO: Initiating epoch #16 train run on device rank=0 [2024-06-13 18:55:25,085] INFO: Initiating epoch #16 valid run on device rank=0 [2024-06-13 18:56:31,571] INFO: Rank 0: epoch=16 / 200 train_loss=7.0683 valid_loss=7.7528 stale=15 time=15.73m eta=2901.5m [2024-06-13 18:56:31,642] INFO: Initiating epoch #17 train run on device rank=0 [2024-06-13 19:11:08,337] INFO: Initiating epoch #17 valid run on device rank=0 [2024-06-13 19:12:15,354] INFO: Rank 0: epoch=17 / 200 train_loss=7.0506 valid_loss=7.7817 stale=16 time=15.73m eta=2885.3m [2024-06-13 19:12:15,475] INFO: Initiating epoch #18 train run on device rank=0 [2024-06-13 19:26:52,380] INFO: Initiating epoch #18 valid run on device rank=0 [2024-06-13 19:27:59,136] INFO: Rank 0: epoch=18 / 200 train_loss=7.0315 valid_loss=7.8167 stale=17 time=15.73m eta=2869.2m [2024-06-13 19:27:59,169] INFO: Initiating epoch #19 train run on device rank=0 [2024-06-13 19:42:36,808] INFO: Initiating epoch #19 valid run on device rank=0 [2024-06-13 19:43:43,369] INFO: Rank 0: epoch=19 / 200 train_loss=7.0157 valid_loss=7.8094 stale=18 time=15.74m eta=2853.1m [2024-06-13 19:43:43,415] INFO: Initiating epoch #20 train run on device rank=0 [2024-06-13 19:58:20,437] INFO: Initiating epoch #20 valid run on device rank=0 [2024-06-13 19:59:27,489] INFO: Rank 0: epoch=20 / 200 train_loss=7.0789 valid_loss=7.8717 stale=19 time=15.73m eta=2837.1m [2024-06-13 19:59:27,642] INFO: Initiating epoch #21 train run on device rank=0 [2024-06-13 20:14:05,593] INFO: Initiating epoch #21 valid run on device rank=0 [2024-06-13 20:15:12,381] INFO: Rank 0: epoch=21 / 200 train_loss=7.0949 valid_loss=7.8567 stale=20 time=15.75m eta=2821.3m [2024-06-13 20:15:12,441] INFO: Initiating epoch #22 train run on device rank=0 [2024-06-13 20:29:49,963] INFO: Initiating epoch #22 valid run on device rank=0 [2024-06-13 20:30:56,901] INFO: Rank 0: epoch=22 / 200 train_loss=7.1424 valid_loss=8.2802 stale=21 time=15.74m eta=2805.3m [2024-06-13 20:30:56,997] INFO: Initiating epoch #23 train run on device rank=0 [2024-06-13 20:45:33,997] INFO: Initiating epoch #23 valid run on device rank=0 [2024-06-13 20:46:39,278] INFO: Rank 0: epoch=23 / 200 train_loss=7.1989 valid_loss=7.9359 stale=22 time=15.7m eta=2789.2m [2024-06-13 20:46:39,309] INFO: Initiating epoch #24 train run on device rank=0 [2024-06-13 21:01:05,445] INFO: Initiating epoch #24 valid run on device rank=0 [2024-06-13 21:02:10,097] INFO: Rank 0: epoch=24 / 200 train_loss=7.2391 valid_loss=8.1021 stale=23 time=15.51m eta=2771.6m [2024-06-13 21:02:10,174] INFO: Initiating epoch #25 train run on device rank=0 [2024-06-13 21:16:37,514] INFO: Initiating epoch #25 valid run on device rank=0 [2024-06-13 21:17:45,463] INFO: Rank 0: epoch=25 / 200 train_loss=7.5218 valid_loss=8.3456 stale=24 time=15.59m eta=2754.8m [2024-06-13 21:17:45,943] INFO: Initiating epoch #26 train run on device rank=0 [2024-06-13 21:32:12,673] INFO: Initiating epoch #26 valid run on device rank=0 [2024-06-13 21:33:20,239] INFO: Rank 0: epoch=26 / 200 train_loss=7.5834 valid_loss=8.2058 stale=25 time=15.57m eta=2737.9m [2024-06-13 21:33:20,834] INFO: Initiating epoch #27 train run on device rank=0 [2024-06-13 21:47:47,621] INFO: Initiating epoch #27 valid run on device rank=0 [2024-06-13 21:48:54,977] INFO: Rank 0: epoch=27 / 200 train_loss=7.7725 valid_loss=8.7602 stale=26 time=15.57m eta=2721.2m [2024-06-13 21:48:55,494] INFO: Initiating epoch #28 train run on device rank=0 [2024-06-13 22:03:21,827] INFO: Initiating epoch #28 valid run on device rank=0 [2024-06-13 22:04:26,533] INFO: Rank 0: epoch=28 / 200 train_loss=11.9784 valid_loss=16.8950 stale=27 time=15.52m eta=2704.2m [2024-06-13 22:04:26,584] INFO: Initiating epoch #29 train run on device rank=0 [2024-06-13 22:18:51,208] INFO: Initiating epoch #29 valid run on device rank=0 [2024-06-13 22:19:55,832] INFO: Rank 0: epoch=29 / 200 train_loss=17.8466 valid_loss=14.6596 stale=28 time=15.49m eta=2687.1m [2024-06-13 22:19:55,903] INFO: Initiating epoch #30 train run on device rank=0 [2024-06-13 22:34:17,318] INFO: Initiating epoch #30 valid run on device rank=0 [2024-06-13 22:35:22,066] INFO: Rank 0: epoch=30 / 200 train_loss=37.7285 valid_loss=40.7227 stale=29 time=15.44m eta=2669.8m [2024-06-13 22:35:22,139] INFO: Initiating epoch #31 train run on device rank=0 [2024-06-13 22:49:42,994] INFO: Initiating epoch #31 valid run on device rank=0 [2024-06-13 22:50:47,447] INFO: Rank 0: epoch=31 / 200 train_loss=40.8026 valid_loss=40.7263 stale=30 time=15.42m eta=2652.6m [2024-06-13 22:50:47,482] INFO: Initiating epoch #32 train run on device rank=0 [2024-06-13 23:05:08,571] INFO: Initiating epoch #32 valid run on device rank=0 [2024-06-13 23:06:13,697] INFO: Rank 0: epoch=32 / 200 train_loss=40.8061 valid_loss=40.7263 stale=31 time=15.44m eta=2635.5m [2024-06-13 23:06:13,758] INFO: Initiating epoch #33 train run on device rank=0 [2024-06-13 23:20:34,942] INFO: Initiating epoch #33 valid run on device rank=0 [2024-06-13 23:21:39,875] INFO: Rank 0: epoch=33 / 200 train_loss=40.8062 valid_loss=40.7263 stale=32 time=15.44m eta=2618.6m [2024-06-13 23:21:39,895] INFO: Initiating epoch #34 train run on device rank=0 [2024-06-13 23:36:00,388] INFO: Initiating epoch #34 valid run on device rank=0 [2024-06-13 23:37:05,066] INFO: Rank 0: epoch=34 / 200 train_loss=40.8062 valid_loss=40.7263 stale=33 time=15.42m eta=2601.6m [2024-06-13 23:37:05,109] INFO: Initiating epoch #35 train run on device rank=0 [2024-06-13 23:51:25,681] INFO: Initiating epoch #35 valid run on device rank=0 [2024-06-13 23:52:30,736] INFO: Rank 0: epoch=35 / 200 train_loss=40.8062 valid_loss=40.7263 stale=34 time=15.43m eta=2584.8m [2024-06-13 23:52:30,795] INFO: Initiating epoch #36 train run on device rank=0 [2024-06-14 00:06:51,565] INFO: Initiating epoch #36 valid run on device rank=0 [2024-06-14 00:07:56,232] INFO: Rank 0: epoch=36 / 200 train_loss=40.7712 valid_loss=40.7263 stale=35 time=15.42m eta=2568.0m [2024-06-14 00:07:56,280] INFO: Initiating epoch #37 train run on device rank=0 [2024-06-14 00:22:17,174] INFO: Initiating epoch #37 valid run on device rank=0 [2024-06-14 00:23:21,782] INFO: Rank 0: epoch=37 / 200 train_loss=40.8062 valid_loss=40.7263 stale=36 time=15.43m eta=2551.4m [2024-06-14 00:23:21,819] INFO: Initiating epoch #38 train run on device rank=0 [2024-06-14 00:37:42,440] INFO: Initiating epoch #38 valid run on device rank=0 [2024-06-14 00:38:47,001] INFO: Rank 0: epoch=38 / 200 train_loss=40.8062 valid_loss=40.7263 stale=37 time=15.42m eta=2534.7m [2024-06-14 00:38:47,037] INFO: Initiating epoch #39 train run on device rank=0 [2024-06-14 00:53:07,794] INFO: Initiating epoch #39 valid run on device rank=0 [2024-06-14 00:54:12,576] INFO: Rank 0: epoch=39 / 200 train_loss=40.8062 valid_loss=40.7263 stale=38 time=15.43m eta=2518.2m [2024-06-14 00:54:12,618] INFO: Initiating epoch #40 train run on device rank=0 [2024-06-14 01:08:33,607] INFO: Initiating epoch #40 valid run on device rank=0 [2024-06-14 01:09:38,276] INFO: Rank 0: epoch=40 / 200 train_loss=40.8062 valid_loss=40.7263 stale=39 time=15.43m eta=2501.7m [2024-06-14 01:09:38,329] INFO: Initiating epoch #41 train run on device rank=0 [2024-06-14 01:23:59,220] INFO: Initiating epoch #41 valid run on device rank=0 [2024-06-14 01:25:03,667] INFO: Rank 0: epoch=41 / 200 train_loss=40.8062 valid_loss=40.7263 stale=40 time=15.42m eta=2485.2m [2024-06-14 01:25:03,700] INFO: Initiating epoch #42 train run on device rank=0 [2024-06-14 01:39:24,622] INFO: Initiating epoch #42 valid run on device rank=0 [2024-06-14 01:40:29,129] INFO: Rank 0: epoch=42 / 200 train_loss=40.8062 valid_loss=40.7263 stale=41 time=15.42m eta=2468.8m [2024-06-14 01:40:29,181] INFO: Initiating epoch #43 train run on device rank=0 [2024-06-14 01:54:49,887] INFO: Initiating epoch #43 valid run on device rank=0 [2024-06-14 01:55:54,477] INFO: Rank 0: epoch=43 / 200 train_loss=40.8062 valid_loss=40.7263 stale=42 time=15.42m eta=2452.4m [2024-06-14 01:55:54,521] INFO: Initiating epoch #44 train run on device rank=0 [2024-06-14 02:10:15,415] INFO: Initiating epoch #44 valid run on device rank=0 [2024-06-14 02:11:20,047] INFO: Rank 0: epoch=44 / 200 train_loss=40.8062 valid_loss=40.7263 stale=43 time=15.43m eta=2436.1m [2024-06-14 02:11:20,118] INFO: Initiating epoch #45 train run on device rank=0 [2024-06-14 02:25:40,682] INFO: Initiating epoch #45 valid run on device rank=0 [2024-06-14 02:26:45,650] INFO: Rank 0: epoch=45 / 200 train_loss=40.8062 valid_loss=40.7263 stale=44 time=15.43m eta=2419.9m [2024-06-14 02:26:45,719] INFO: Initiating epoch #46 train run on device rank=0 [2024-06-14 02:41:06,246] INFO: Initiating epoch #46 valid run on device rank=0 [2024-06-14 02:42:11,461] INFO: Rank 0: epoch=46 / 200 train_loss=40.8062 valid_loss=40.7263 stale=45 time=15.43m eta=2403.6m [2024-06-14 02:42:11,538] INFO: Initiating epoch #47 train run on device rank=0 [2024-06-14 02:56:32,371] INFO: Initiating epoch #47 valid run on device rank=0 [2024-06-14 02:57:37,042] INFO: Rank 0: epoch=47 / 200 train_loss=40.8062 valid_loss=40.7263 stale=46 time=15.43m eta=2387.4m [2024-06-14 02:57:37,123] INFO: Initiating epoch #48 train run on device rank=0 [2024-06-14 03:11:57,835] INFO: Initiating epoch #48 valid run on device rank=0 [2024-06-14 03:13:02,518] INFO: Rank 0: epoch=48 / 200 train_loss=40.8062 valid_loss=40.7263 stale=47 time=15.42m eta=2371.3m [2024-06-14 03:13:02,570] INFO: Initiating epoch #49 train run on device rank=0 [2024-06-14 03:27:23,285] INFO: Initiating epoch #49 valid run on device rank=0 [2024-06-14 03:28:27,894] INFO: Rank 0: epoch=49 / 200 train_loss=40.8062 valid_loss=40.7263 stale=48 time=15.42m eta=2355.1m [2024-06-14 03:28:27,939] INFO: Initiating epoch #50 train run on device rank=0 [2024-06-14 03:42:49,253] INFO: Initiating epoch #50 valid run on device rank=0 [2024-06-14 03:43:53,847] INFO: Rank 0: epoch=50 / 200 train_loss=40.8062 valid_loss=40.7263 stale=49 time=15.43m eta=2339.0m [2024-06-14 03:43:53,888] INFO: Initiating epoch #51 train run on device rank=0 [2024-06-14 03:58:14,989] INFO: Initiating epoch #51 valid run on device rank=0 [2024-06-14 03:59:19,553] INFO: Rank 0: epoch=51 / 200 train_loss=40.8062 valid_loss=40.7263 stale=50 time=15.43m eta=2323.0m [2024-06-14 03:59:19,616] INFO: Initiating epoch #52 train run on device rank=0 [2024-06-14 04:13:40,322] INFO: Initiating epoch #52 valid run on device rank=0