[2024-06-18 09:39:07,165] INFO: Will use torch.nn.parallel.DistributedDataParallel() and 4 gpus [2024-06-18 09:39:07,250] INFO: NVIDIA GeForce GTX 1080 Ti [2024-06-18 09:39:07,250] INFO: NVIDIA GeForce GTX 1080 Ti [2024-06-18 09:39:07,250] INFO: NVIDIA GeForce GTX 1080 Ti [2024-06-18 09:39:07,250] INFO: NVIDIA GeForce GTX 1080 Ti [2024-06-18 09:39:13,900] INFO: using dtype=torch.float32 [2024-06-18 09:39:14,732] INFO: using attention_type=math [2024-06-18 09:39:14,749] INFO: using attention_type=math [2024-06-18 09:39:14,767] INFO: using attention_type=math [2024-06-18 09:39:14,789] INFO: using attention_type=math [2024-06-18 09:39:14,807] INFO: using attention_type=math [2024-06-18 09:39:14,824] INFO: using attention_type=math [2024-06-18 09:39:18,654] INFO: mlpf_kwargs: {'input_dim': 17, 'num_classes': 6, 'input_encoding': 'joint', 'pt_mode': 'linear', 'eta_mode': 'linear', 'sin_phi_mode': 'linear', 'cos_phi_mode': 'linear', 'energy_mode': 'linear', 'elemtypes_nonzero': [1, 2], 'learned_representation_mode': 'last', 'conv_type': 'attention', 'num_convs': 3, 'dropout_ff': 0.0, 'dropout_conv_id_mha': 0.0, 'dropout_conv_id_ff': 0.0, 'dropout_conv_reg_mha': 0.0, 'dropout_conv_reg_ff': 0.0, 'activation': 'relu', 'head_dim': 16, 'num_heads': 32, 'attention_type': 'math'} [2024-06-18 09:39:18,654] INFO: Loaded model weights from /pfvol/experiments/MLPF_clic_backbone_pyg-clic_20240429_101112_971749/best_weights.pth [2024-06-18 09:39:19,800] INFO: DistributedDataParallel( (module): MLPF( (nn0_id): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (nn0_reg): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (conv_id): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (conv_reg): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (nn_id): Sequential( (0): Linear(in_features=529, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=6, bias=True) ) (nn_pt): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_eta): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_sin_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_cos_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_energy): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) ) ) [2024-06-18 09:39:19,802] INFO: Backbone Trainable parameters: 11671568 [2024-06-18 09:39:19,802] INFO: Backbone Non-trainable parameters: 0 [2024-06-18 09:39:19,802] INFO: Backbone Total parameters: 11671568 [2024-06-18 09:39:19,807] INFO: Modules Trainable parameters Non-tranable parameters module.nn0_id.0.weight 8704 0 module.nn0_id.0.bias 512 0 module.nn0_id.2.weight 512 0 module.nn0_id.2.bias 512 0 module.nn0_id.4.weight 262144 0 module.nn0_id.4.bias 512 0 module.nn0_reg.0.weight 8704 0 module.nn0_reg.0.bias 512 0 module.nn0_reg.2.weight 512 0 module.nn0_reg.2.bias 512 0 module.nn0_reg.4.weight 262144 0 module.nn0_reg.4.bias 512 0 module.conv_id.0.mha.in_proj_weight 786432 0 module.conv_id.0.mha.in_proj_bias 1536 0 module.conv_id.0.mha.out_proj.weight 262144 0 module.conv_id.0.mha.out_proj.bias 512 0 module.conv_id.0.norm0.weight 512 0 module.conv_id.0.norm0.bias 512 0 module.conv_id.0.norm1.weight 512 0 module.conv_id.0.norm1.bias 512 0 module.conv_id.0.seq.0.weight 262144 0 module.conv_id.0.seq.0.bias 512 0 module.conv_id.0.seq.2.weight 262144 0 module.conv_id.0.seq.2.bias 512 0 module.conv_id.1.mha.in_proj_weight 786432 0 module.conv_id.1.mha.in_proj_bias 1536 0 module.conv_id.1.mha.out_proj.weight 262144 0 module.conv_id.1.mha.out_proj.bias 512 0 module.conv_id.1.norm0.weight 512 0 module.conv_id.1.norm0.bias 512 0 module.conv_id.1.norm1.weight 512 0 module.conv_id.1.norm1.bias 512 0 module.conv_id.1.seq.0.weight 262144 0 module.conv_id.1.seq.0.bias 512 0 module.conv_id.1.seq.2.weight 262144 0 module.conv_id.1.seq.2.bias 512 0 module.conv_id.2.mha.in_proj_weight 786432 0 module.conv_id.2.mha.in_proj_bias 1536 0 module.conv_id.2.mha.out_proj.weight 262144 0 module.conv_id.2.mha.out_proj.bias 512 0 module.conv_id.2.norm0.weight 512 0 module.conv_id.2.norm0.bias 512 0 module.conv_id.2.norm1.weight 512 0 module.conv_id.2.norm1.bias 512 0 module.conv_id.2.seq.0.weight 262144 0 module.conv_id.2.seq.0.bias 512 0 module.conv_id.2.seq.2.weight 262144 0 module.conv_id.2.seq.2.bias 512 0 module.conv_reg.0.mha.in_proj_weight 786432 0 module.conv_reg.0.mha.in_proj_bias 1536 0 module.conv_reg.0.mha.out_proj.weight 262144 0 module.conv_reg.0.mha.out_proj.bias 512 0 module.conv_reg.0.norm0.weight 512 0 module.conv_reg.0.norm0.bias 512 0 module.conv_reg.0.norm1.weight 512 0 module.conv_reg.0.norm1.bias 512 0 module.conv_reg.0.seq.0.weight 262144 0 module.conv_reg.0.seq.0.bias 512 0 module.conv_reg.0.seq.2.weight 262144 0 module.conv_reg.0.seq.2.bias 512 0 module.conv_reg.1.mha.in_proj_weight 786432 0 module.conv_reg.1.mha.in_proj_bias 1536 0 module.conv_reg.1.mha.out_proj.weight 262144 0 module.conv_reg.1.mha.out_proj.bias 512 0 module.conv_reg.1.norm0.weight 512 0 module.conv_reg.1.norm0.bias 512 0 module.conv_reg.1.norm1.weight 512 0 module.conv_reg.1.norm1.bias 512 0 module.conv_reg.1.seq.0.weight 262144 0 module.conv_reg.1.seq.0.bias 512 0 module.conv_reg.1.seq.2.weight 262144 0 module.conv_reg.1.seq.2.bias 512 0 module.conv_reg.2.mha.in_proj_weight 786432 0 module.conv_reg.2.mha.in_proj_bias 1536 0 module.conv_reg.2.mha.out_proj.weight 262144 0 module.conv_reg.2.mha.out_proj.bias 512 0 module.conv_reg.2.norm0.weight 512 0 module.conv_reg.2.norm0.bias 512 0 module.conv_reg.2.norm1.weight 512 0 module.conv_reg.2.norm1.bias 512 0 module.conv_reg.2.seq.0.weight 262144 0 module.conv_reg.2.seq.0.bias 512 0 module.conv_reg.2.seq.2.weight 262144 0 module.conv_reg.2.seq.2.bias 512 0 module.nn_id.0.weight 270848 0 module.nn_id.0.bias 512 0 module.nn_id.2.weight 512 0 module.nn_id.2.bias 512 0 module.nn_id.4.weight 3072 0 module.nn_id.4.bias 6 0 module.nn_pt.nn.0.weight 273920 0 module.nn_pt.nn.0.bias 512 0 module.nn_pt.nn.2.weight 512 0 module.nn_pt.nn.2.bias 512 0 module.nn_pt.nn.4.weight 1024 0 module.nn_pt.nn.4.bias 2 0 module.nn_eta.nn.0.weight 273920 0 module.nn_eta.nn.0.bias 512 0 module.nn_eta.nn.2.weight 512 0 module.nn_eta.nn.2.bias 512 0 module.nn_eta.nn.4.weight 1024 0 module.nn_eta.nn.4.bias 2 0 module.nn_sin_phi.nn.0.weight 273920 0 module.nn_sin_phi.nn.0.bias 512 0 module.nn_sin_phi.nn.2.weight 512 0 module.nn_sin_phi.nn.2.bias 512 0 module.nn_sin_phi.nn.4.weight 1024 0 module.nn_sin_phi.nn.4.bias 2 0 module.nn_cos_phi.nn.0.weight 273920 0 module.nn_cos_phi.nn.0.bias 512 0 module.nn_cos_phi.nn.2.weight 512 0 module.nn_cos_phi.nn.2.bias 512 0 module.nn_cos_phi.nn.4.weight 1024 0 module.nn_cos_phi.nn.4.bias 2 0 module.nn_energy.nn.0.weight 273920 0 module.nn_energy.nn.0.bias 512 0 module.nn_energy.nn.2.weight 512 0 module.nn_energy.nn.2.bias 512 0 module.nn_energy.nn.4.weight 1024 0 module.nn_energy.nn.4.bias 2 0 [2024-06-18 09:39:19,899] INFO: DistributedDataParallel( (module): DeepMET( (nn): Sequential( (0): Linear(in_features=11, out_features=256, bias=True) (1): ELU(alpha=1.0) (2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0, inplace=False) (4): Linear(in_features=256, out_features=2, bias=True) ) ) ) [2024-06-18 09:39:19,899] INFO: DeepMET Trainable parameters: 4098 [2024-06-18 09:39:19,899] INFO: DeepMET Non-trainable parameters: 0 [2024-06-18 09:39:19,900] INFO: DeepMET Total parameters: 4098 [2024-06-18 09:39:19,901] INFO: Modules Trainable parameters Non-tranable parameters module.nn.0.weight 2816 0 module.nn.0.bias 256 0 module.nn.2.weight 256 0 module.nn.2.bias 256 0 module.nn.4.weight 512 0 module.nn.4.bias 2 0 [2024-06-18 09:39:19,902] INFO: Creating experiment dir /pfvol/experiments/MLPF_clic_backbone_pyg-clic_20240429_101112_971749/MLPF_4GTX_MET_MLPFCands_FloatBackbone_20240618_093906_891861 [2024-06-18 09:39:19,902] INFO: Model directory /pfvol/experiments/MLPF_clic_backbone_pyg-clic_20240429_101112_971749/MLPF_4GTX_MET_MLPFCands_FloatBackbone_20240618_093906_891861 [2024-06-18 09:39:19,943] INFO: train_dataset: clic_edm_ttbar_pf, 800800 [2024-06-18 09:39:20,063] INFO: valid_dataset: clic_edm_ttbar_pf, 200200 [2024-06-18 09:39:20,112] INFO: Initiating epoch #1 train run on device rank=0 [2024-06-18 10:01:25,397] INFO: Initiating epoch #1 valid run on device rank=0 [2024-06-18 10:03:03,291] INFO: Rank 0: epoch=1 / 400 train_loss=23.8286 valid_loss=17.1102 stale=0 time=23.72m eta=9464.1m [2024-06-18 10:03:03,293] INFO: Initiating epoch #2 train run on device rank=0 [2024-06-18 10:25:00,302] INFO: Initiating epoch #2 valid run on device rank=0 [2024-06-18 10:26:37,339] INFO: Rank 0: epoch=2 / 400 train_loss=15.1906 valid_loss=13.5771 stale=0 time=23.57m eta=9410.1m [2024-06-18 10:26:37,461] INFO: Initiating epoch #3 train run on device rank=0 [2024-06-18 10:48:34,345] INFO: Initiating epoch #3 valid run on device rank=0 [2024-06-18 10:50:11,419] INFO: Rank 0: epoch=3 / 400 train_loss=12.9544 valid_loss=11.9481 stale=0 time=23.57m eta=9376.5m [2024-06-18 10:50:11,450] INFO: Initiating epoch #4 train run on device rank=0 [2024-06-18 11:12:07,884] INFO: Initiating epoch #4 valid run on device rank=0 [2024-06-18 11:13:44,966] INFO: Rank 0: epoch=4 / 400 train_loss=11.3839 valid_loss=10.7932 stale=0 time=23.56m eta=9347.0m [2024-06-18 11:13:45,009] INFO: Initiating epoch #5 train run on device rank=0 [2024-06-18 11:35:41,818] INFO: Initiating epoch #5 valid run on device rank=0 [2024-06-18 11:37:18,696] INFO: Rank 0: epoch=5 / 400 train_loss=10.3164 valid_loss=10.0647 stale=0 time=23.56m eta=9320.1m [2024-06-18 11:37:18,727] INFO: Initiating epoch #6 train run on device rank=0 [2024-06-18 11:59:15,394] INFO: Initiating epoch #6 valid run on device rank=0 [2024-06-18 12:00:52,017] INFO: Rank 0: epoch=6 / 400 train_loss=9.6930 valid_loss=9.5622 stale=0 time=23.55m eta=9293.9m [2024-06-18 12:00:52,033] INFO: Initiating epoch #7 train run on device rank=0 [2024-06-18 12:22:48,884] INFO: Initiating epoch #7 valid run on device rank=0 [2024-06-18 12:24:25,732] INFO: Rank 0: epoch=7 / 400 train_loss=9.2712 valid_loss=9.2283 stale=0 time=23.56m eta=9268.8m [2024-06-18 12:24:25,751] INFO: Initiating epoch #8 train run on device rank=0 [2024-06-18 12:46:22,656] INFO: Initiating epoch #8 valid run on device rank=0 [2024-06-18 12:47:59,885] INFO: Rank 0: epoch=8 / 400 train_loss=8.9626 valid_loss=8.9566 stale=0 time=23.57m eta=9244.5m [2024-06-18 12:47:59,973] INFO: Initiating epoch #9 train run on device rank=0 [2024-06-18 13:09:57,341] INFO: Initiating epoch #9 valid run on device rank=0 [2024-06-18 13:11:34,449] INFO: Rank 0: epoch=9 / 400 train_loss=8.6892 valid_loss=8.6716 stale=0 time=23.57m eta=9220.6m [2024-06-18 13:11:34,472] INFO: Initiating epoch #10 train run on device rank=0 [2024-06-18 13:33:30,761] INFO: Initiating epoch #10 valid run on device rank=0 [2024-06-18 13:35:08,067] INFO: Rank 0: epoch=10 / 400 train_loss=8.4612 valid_loss=8.5019 stale=0 time=23.56m eta=9196.2m [2024-06-18 13:35:08,333] INFO: Initiating epoch #11 train run on device rank=0 [2024-06-18 13:57:04,642] INFO: Initiating epoch #11 valid run on device rank=0 [2024-06-18 13:58:41,603] INFO: Rank 0: epoch=11 / 400 train_loss=8.3197 valid_loss=8.3940 stale=0 time=23.55m eta=9171.8m [2024-06-18 13:58:41,612] INFO: Initiating epoch #12 train run on device rank=0 [2024-06-18 14:20:38,717] INFO: Initiating epoch #12 valid run on device rank=0 [2024-06-18 14:22:15,518] INFO: Rank 0: epoch=12 / 400 train_loss=8.2151 valid_loss=8.3044 stale=0 time=23.57m eta=9147.9m [2024-06-18 14:22:15,538] INFO: Initiating epoch #13 train run on device rank=0 [2024-06-18 14:44:13,145] INFO: Initiating epoch #13 valid run on device rank=0 [2024-06-18 14:45:50,555] INFO: Rank 0: epoch=13 / 400 train_loss=8.1301 valid_loss=8.2294 stale=0 time=23.58m eta=9124.5m [2024-06-18 14:45:50,582] INFO: Initiating epoch #14 train run on device rank=0 [2024-06-18 15:07:48,240] INFO: Initiating epoch #14 valid run on device rank=0 [2024-06-18 15:09:25,113] INFO: Rank 0: epoch=14 / 400 train_loss=8.0566 valid_loss=8.1692 stale=0 time=23.58m eta=9100.9m [2024-06-18 15:09:25,136] INFO: Initiating epoch #15 train run on device rank=0 [2024-06-18 15:31:22,421] INFO: Initiating epoch #15 valid run on device rank=0 [2024-06-18 15:32:58,984] INFO: Rank 0: epoch=15 / 400 train_loss=7.9927 valid_loss=8.1174 stale=0 time=23.56m eta=9077.0m [2024-06-18 15:32:59,071] INFO: Initiating epoch #16 train run on device rank=0 [2024-06-18 15:54:55,517] INFO: Initiating epoch #16 valid run on device rank=0 [2024-06-18 15:56:31,859] INFO: Rank 0: epoch=16 / 400 train_loss=7.9367 valid_loss=8.0713 stale=0 time=23.55m eta=9052.7m [2024-06-18 15:56:31,863] INFO: Initiating epoch #17 train run on device rank=0 [2024-06-18 16:18:28,652] INFO: Initiating epoch #17 valid run on device rank=0 [2024-06-18 16:20:05,997] INFO: Rank 0: epoch=17 / 400 train_loss=7.8854 valid_loss=8.0296 stale=0 time=23.57m eta=9029.0m [2024-06-18 16:20:06,199] INFO: Initiating epoch #18 train run on device rank=0 [2024-06-18 16:42:02,189] INFO: Initiating epoch #18 valid run on device rank=0 [2024-06-18 16:43:38,866] INFO: Rank 0: epoch=18 / 400 train_loss=7.8394 valid_loss=7.9945 stale=0 time=23.54m eta=9004.9m [2024-06-18 16:43:38,908] INFO: Initiating epoch #19 train run on device rank=0 [2024-06-18 17:05:35,899] INFO: Initiating epoch #19 valid run on device rank=0 [2024-06-18 17:07:12,321] INFO: Rank 0: epoch=19 / 400 train_loss=7.7973 valid_loss=7.9636 stale=0 time=23.56m eta=8981.0m [2024-06-18 17:07:12,344] INFO: Initiating epoch #20 train run on device rank=0 [2024-06-18 17:29:08,668] INFO: Initiating epoch #20 valid run on device rank=0 [2024-06-18 17:30:52,956] INFO: Rank 0: epoch=20 / 400 train_loss=7.7586 valid_loss=7.9336 stale=0 time=23.68m eta=8959.4m [2024-06-18 17:30:53,044] INFO: Initiating epoch #21 train run on device rank=0 [2024-06-18 17:52:48,142] INFO: Initiating epoch #21 valid run on device rank=0 [2024-06-18 17:54:46,919] INFO: Rank 0: epoch=21 / 400 train_loss=7.7228 valid_loss=7.9086 stale=0 time=23.9m eta=8941.6m [2024-06-18 17:54:47,465] INFO: Initiating epoch #22 train run on device rank=0 [2024-06-18 18:16:42,717] INFO: Initiating epoch #22 valid run on device rank=0 [2024-06-18 18:18:19,095] INFO: Rank 0: epoch=22 / 400 train_loss=7.6896 valid_loss=7.8842 stale=0 time=23.53m eta=8917.1m [2024-06-18 18:18:20,602] INFO: Initiating epoch #23 train run on device rank=0 [2024-06-18 18:40:17,021] INFO: Initiating epoch #23 valid run on device rank=0 [2024-06-18 18:42:19,580] INFO: Rank 0: epoch=23 / 400 train_loss=7.6582 valid_loss=7.8620 stale=0 time=23.98m eta=8900.3m [2024-06-18 18:42:23,801] INFO: Initiating epoch #24 train run on device rank=0 [2024-06-18 19:04:19,159] INFO: Initiating epoch #24 valid run on device rank=0 [2024-06-18 19:05:55,902] INFO: Rank 0: epoch=24 / 400 train_loss=7.6282 valid_loss=7.8402 stale=0 time=23.54m eta=8876.7m [2024-06-18 19:05:56,017] INFO: Initiating epoch #25 train run on device rank=0 [2024-06-18 19:27:52,297] INFO: Initiating epoch #25 valid run on device rank=0 [2024-06-18 19:29:29,175] INFO: Rank 0: epoch=25 / 400 train_loss=7.6004 valid_loss=7.8215 stale=0 time=23.55m eta=8852.3m [2024-06-18 19:29:29,204] INFO: Initiating epoch #26 train run on device rank=0 [2024-06-18 19:51:25,274] INFO: Initiating epoch #26 valid run on device rank=0 [2024-06-18 19:53:14,078] INFO: Rank 0: epoch=26 / 400 train_loss=7.5740 valid_loss=7.8059 stale=0 time=23.75m eta=8830.7m [2024-06-18 19:53:14,162] INFO: Initiating epoch #27 train run on device rank=0 [2024-06-18 20:15:09,587] INFO: Initiating epoch #27 valid run on device rank=0 [2024-06-18 20:16:47,721] INFO: Rank 0: epoch=27 / 400 train_loss=7.5490 valid_loss=7.7883 stale=0 time=23.56m eta=8806.4m [2024-06-18 20:16:47,738] INFO: Initiating epoch #28 train run on device rank=0 [2024-06-18 20:38:43,255] INFO: Initiating epoch #28 valid run on device rank=0 [2024-06-18 20:42:14,425] INFO: Rank 0: epoch=28 / 400 train_loss=7.5251 valid_loss=7.7745 stale=0 time=25.44m eta=8807.2m [2024-06-18 20:42:15,804] INFO: Initiating epoch #29 train run on device rank=0 [2024-06-18 21:04:04,941] INFO: Initiating epoch #29 valid run on device rank=0 [2024-06-18 21:07:14,691] INFO: Rank 0: epoch=29 / 400 train_loss=7.5021 valid_loss=7.7605 stale=0 time=24.98m eta=8800.5m [2024-06-18 21:07:20,819] INFO: Initiating epoch #30 train run on device rank=0 [2024-06-18 21:29:10,161] INFO: Initiating epoch #30 valid run on device rank=0 [2024-06-18 21:31:39,708] INFO: Rank 0: epoch=30 / 400 train_loss=7.4800 valid_loss=7.7476 stale=0 time=24.31m eta=8785.4m [2024-06-18 21:31:40,864] INFO: Initiating epoch #31 train run on device rank=0 [2024-06-18 21:53:29,854] INFO: Initiating epoch #31 valid run on device rank=0 [2024-06-18 21:55:50,418] INFO: Rank 0: epoch=31 / 400 train_loss=7.4589 valid_loss=7.7354 stale=0 time=24.16m eta=8766.8m [2024-06-18 21:55:51,958] INFO: Initiating epoch #32 train run on device rank=0 [2024-06-18 22:17:40,941] INFO: Initiating epoch #32 valid run on device rank=0 [2024-06-18 22:19:46,307] INFO: Rank 0: epoch=32 / 400 train_loss=7.4385 valid_loss=7.7232 stale=0 time=23.91m eta=8745.0m [2024-06-18 22:19:48,394] INFO: Initiating epoch #33 train run on device rank=0 [2024-06-18 22:41:37,658] INFO: Initiating epoch #33 valid run on device rank=0 [2024-06-18 22:43:18,706] INFO: Rank 0: epoch=33 / 400 train_loss=7.4189 valid_loss=7.7124 stale=0 time=23.51m eta=8718.8m [2024-06-18 22:43:19,739] INFO: Initiating epoch #34 train run on device rank=0 [2024-06-18 23:05:10,140] INFO: Initiating epoch #34 valid run on device rank=0 [2024-06-18 23:06:46,892] INFO: Rank 0: epoch=34 / 400 train_loss=7.4003 valid_loss=7.7024 stale=0 time=23.45m eta=8691.9m [2024-06-18 23:06:47,175] INFO: Initiating epoch #35 train run on device rank=0 [2024-06-18 23:28:36,859] INFO: Initiating epoch #35 valid run on device rank=0 [2024-06-18 23:30:19,367] INFO: Rank 0: epoch=35 / 400 train_loss=7.3821 valid_loss=7.6932 stale=0 time=23.54m eta=8666.0m [2024-06-18 23:30:19,990] INFO: Initiating epoch #36 train run on device rank=0 [2024-06-18 23:52:09,641] INFO: Initiating epoch #36 valid run on device rank=0 [2024-06-18 23:56:45,651] INFO: Rank 0: epoch=36 / 400 train_loss=7.3646 valid_loss=7.6822 stale=0 time=26.43m eta=8669.5m [2024-06-18 23:57:45,551] INFO: Initiating epoch #37 train run on device rank=0 [2024-06-19 00:19:35,592] INFO: Initiating epoch #37 valid run on device rank=0 [2024-06-19 00:22:17,305] INFO: Rank 0: epoch=37 / 400 train_loss=7.3475 valid_loss=7.6741 stale=0 time=24.53m eta=8662.5m [2024-06-19 00:22:18,427] INFO: Initiating epoch #38 train run on device rank=0 [2024-06-19 00:44:07,769] INFO: Initiating epoch #38 valid run on device rank=0 [2024-06-19 00:47:22,539] INFO: Rank 0: epoch=38 / 400 train_loss=7.3312 valid_loss=7.6657 stale=0 time=25.07m eta=8650.3m [2024-06-19 00:47:44,587] INFO: Initiating epoch #39 train run on device rank=0 [2024-06-19 01:09:34,473] INFO: Initiating epoch #39 valid run on device rank=0 [2024-06-19 01:12:13,600] INFO: Rank 0: epoch=39 / 400 train_loss=7.3152 valid_loss=7.6581 stale=0 time=24.48m eta=8635.2m [2024-06-19 01:12:21,228] INFO: Initiating epoch #40 train run on device rank=0 [2024-06-19 01:34:11,147] INFO: Initiating epoch #40 valid run on device rank=0 [2024-06-19 01:36:02,456] INFO: Rank 0: epoch=40 / 400 train_loss=7.2995 valid_loss=7.6523 stale=0 time=23.69m eta=8610.4m [2024-06-19 01:36:05,672] INFO: Initiating epoch #41 train run on device rank=0 [2024-06-19 01:57:54,801] INFO: Initiating epoch #41 valid run on device rank=0 [2024-06-19 01:59:41,281] INFO: Rank 0: epoch=41 / 400 train_loss=7.2845 valid_loss=7.6432 stale=0 time=23.59m eta=8584.1m [2024-06-19 01:59:43,658] INFO: Initiating epoch #42 train run on device rank=0 [2024-06-19 02:21:33,035] INFO: Initiating epoch #42 valid run on device rank=0 [2024-06-19 02:23:18,802] INFO: Rank 0: epoch=42 / 400 train_loss=7.2698 valid_loss=7.6386 stale=0 time=23.59m eta=8557.7m [2024-06-19 02:23:19,939] INFO: Initiating epoch #43 train run on device rank=0 [2024-06-19 02:45:08,720] INFO: Initiating epoch #43 valid run on device rank=0 [2024-06-19 02:46:58,434] INFO: Rank 0: epoch=43 / 400 train_loss=7.2552 valid_loss=7.6309 stale=0 time=23.64m eta=8531.8m [2024-06-19 02:47:00,086] INFO: Initiating epoch #44 train run on device rank=0 [2024-06-19 03:08:48,758] INFO: Initiating epoch #44 valid run on device rank=0 [2024-06-19 03:10:36,288] INFO: Rank 0: epoch=44 / 400 train_loss=7.2408 valid_loss=7.6256 stale=0 time=23.6m eta=8505.7m [2024-06-19 03:10:37,211] INFO: Initiating epoch #45 train run on device rank=0 [2024-06-19 03:32:26,049] INFO: Initiating epoch #45 valid run on device rank=0 [2024-06-19 03:34:11,300] INFO: Rank 0: epoch=45 / 400 train_loss=7.2270 valid_loss=7.6188 stale=0 time=23.57m eta=8479.4m [2024-06-19 03:34:12,130] INFO: Initiating epoch #46 train run on device rank=0 [2024-06-19 03:56:01,368] INFO: Initiating epoch #46 valid run on device rank=0 [2024-06-19 03:57:44,738] INFO: Rank 0: epoch=46 / 400 train_loss=7.2133 valid_loss=7.6132 stale=0 time=23.54m eta=8453.0m [2024-06-19 03:57:45,611] INFO: Initiating epoch #47 train run on device rank=0 [2024-06-19 04:19:34,893] INFO: Initiating epoch #47 valid run on device rank=0 [2024-06-19 04:21:16,659] INFO: Rank 0: epoch=47 / 400 train_loss=7.2000 valid_loss=7.6077 stale=0 time=23.52m eta=8426.5m [2024-06-19 04:21:19,101] INFO: Initiating epoch #48 train run on device rank=0 [2024-06-19 04:43:08,149] INFO: Initiating epoch #48 valid run on device rank=0 [2024-06-19 04:44:49,871] INFO: Rank 0: epoch=48 / 400 train_loss=7.1869 valid_loss=7.6034 stale=0 time=23.51m eta=8400.3m [2024-06-19 04:44:51,911] INFO: Initiating epoch #49 train run on device rank=0 [2024-06-19 05:06:40,912] INFO: Initiating epoch #49 valid run on device rank=0 [2024-06-19 05:08:22,642] INFO: Rank 0: epoch=49 / 400 train_loss=7.1736 valid_loss=7.5978 stale=0 time=23.51m eta=8374.2m [2024-06-19 05:08:23,353] INFO: Initiating epoch #50 train run on device rank=0 [2024-06-19 05:30:11,658] INFO: Initiating epoch #50 valid run on device rank=0 [2024-06-19 05:31:53,464] INFO: Rank 0: epoch=50 / 400 train_loss=7.1608 valid_loss=7.5930 stale=0 time=23.5m eta=8347.9m [2024-06-19 05:31:54,307] INFO: Initiating epoch #51 train run on device rank=0 [2024-06-19 05:53:43,838] INFO: Initiating epoch #51 valid run on device rank=0 [2024-06-19 05:55:31,535] INFO: Rank 0: epoch=51 / 400 train_loss=7.1484 valid_loss=7.5892 stale=0 time=23.62m eta=8322.6m [2024-06-19 05:55:32,510] INFO: Initiating epoch #52 train run on device rank=0 [2024-06-19 06:17:21,435] INFO: Initiating epoch #52 valid run on device rank=0 [2024-06-19 06:19:01,659] INFO: Rank 0: epoch=52 / 400 train_loss=7.1361 valid_loss=7.5854 stale=0 time=23.49m eta=8296.4m [2024-06-19 06:19:02,241] INFO: Initiating epoch #53 train run on device rank=0 [2024-06-19 06:40:50,965] INFO: Initiating epoch #53 valid run on device rank=0 [2024-06-19 06:42:30,744] INFO: Rank 0: epoch=53 / 400 train_loss=7.1237 valid_loss=7.5804 stale=0 time=23.48m eta=8270.2m [2024-06-19 06:42:31,166] INFO: Initiating epoch #54 train run on device rank=0 [2024-06-19 07:04:19,446] INFO: Initiating epoch #54 valid run on device rank=0 [2024-06-19 07:05:56,110] INFO: Rank 0: epoch=54 / 400 train_loss=7.1119 valid_loss=7.5773 stale=0 time=23.42m eta=8243.8m [2024-06-19 07:05:56,190] INFO: Initiating epoch #55 train run on device rank=0 [2024-06-19 07:27:46,287] INFO: Initiating epoch #55 valid run on device rank=0 [2024-06-19 07:29:22,676] INFO: Rank 0: epoch=55 / 400 train_loss=7.1000 valid_loss=7.5740 stale=0 time=23.44m eta=8217.5m [2024-06-19 07:29:22,732] INFO: Initiating epoch #56 train run on device rank=0 [2024-06-19 07:51:12,421] INFO: Initiating epoch #56 valid run on device rank=0 [2024-06-19 07:52:49,194] INFO: Rank 0: epoch=56 / 400 train_loss=7.0884 valid_loss=7.5701 stale=0 time=23.44m eta=8191.4m [2024-06-19 07:52:49,259] INFO: Initiating epoch #57 train run on device rank=0 [2024-06-19 08:14:38,903] INFO: Initiating epoch #57 valid run on device rank=0 [2024-06-19 08:16:15,943] INFO: Rank 0: epoch=57 / 400 train_loss=7.0767 valid_loss=7.5664 stale=0 time=23.44m eta=8165.4m [2024-06-19 08:16:15,951] INFO: Initiating epoch #58 train run on device rank=0 [2024-06-19 08:38:05,772] INFO: Initiating epoch #58 valid run on device rank=0 [2024-06-19 08:39:46,356] INFO: Rank 0: epoch=58 / 400 train_loss=7.0653 valid_loss=7.5639 stale=0 time=23.51m eta=8139.8m [2024-06-19 08:39:46,901] INFO: Initiating epoch #59 train run on device rank=0 [2024-06-19 09:01:35,422] INFO: Initiating epoch #59 valid run on device rank=0 [2024-06-19 09:03:11,716] INFO: Rank 0: epoch=59 / 400 train_loss=7.0540 valid_loss=7.5619 stale=0 time=23.41m eta=8113.8m [2024-06-19 09:03:11,767] INFO: Initiating epoch #60 train run on device rank=0 [2024-06-19 09:25:01,658] INFO: Initiating epoch #60 valid run on device rank=0 [2024-06-19 09:26:42,773] INFO: Rank 0: epoch=60 / 400 train_loss=7.0427 valid_loss=7.5593 stale=0 time=23.52m eta=8088.5m [2024-06-19 09:26:43,252] INFO: Initiating epoch #61 train run on device rank=0 [2024-06-19 09:48:31,766] INFO: Initiating epoch #61 valid run on device rank=0 [2024-06-19 09:50:08,192] INFO: Rank 0: epoch=61 / 400 train_loss=7.0317 valid_loss=7.5560 stale=0 time=23.42m eta=8062.7m [2024-06-19 09:50:08,312] INFO: Initiating epoch #62 train run on device rank=0 [2024-06-19 10:11:58,627] INFO: Initiating epoch #62 valid run on device rank=0 [2024-06-19 10:13:35,385] INFO: Rank 0: epoch=62 / 400 train_loss=7.0212 valid_loss=7.5540 stale=0 time=23.45m eta=8037.1m [2024-06-19 10:13:35,562] INFO: Initiating epoch #63 train run on device rank=0 [2024-06-19 10:35:25,242] INFO: Initiating epoch #63 valid run on device rank=0 [2024-06-19 10:37:01,848] INFO: Rank 0: epoch=63 / 400 train_loss=7.0103 valid_loss=7.5510 stale=0 time=23.44m eta=8011.5m [2024-06-19 10:37:01,908] INFO: Initiating epoch #64 train run on device rank=0 [2024-06-19 10:58:51,599] INFO: Initiating epoch #64 valid run on device rank=0 [2024-06-19 11:00:28,232] INFO: Rank 0: epoch=64 / 400 train_loss=6.9994 valid_loss=7.5485 stale=0 time=23.44m eta=7986.0m [2024-06-19 11:00:28,323] INFO: Initiating epoch #65 train run on device rank=0 [2024-06-19 11:22:17,836] INFO: Initiating epoch #65 valid run on device rank=0 [2024-06-19 11:23:54,436] INFO: Rank 0: epoch=65 / 400 train_loss=6.9892 valid_loss=7.5457 stale=0 time=23.44m eta=7960.5m [2024-06-19 11:23:54,499] INFO: Initiating epoch #66 train run on device rank=0 [2024-06-19 11:45:44,167] INFO: Initiating epoch #66 valid run on device rank=0 [2024-06-19 11:47:21,203] INFO: Rank 0: epoch=66 / 400 train_loss=6.9784 valid_loss=7.5440 stale=0 time=23.45m eta=7935.1m [2024-06-19 11:47:21,246] INFO: Initiating epoch #67 train run on device rank=0 [2024-06-19 12:09:10,989] INFO: Initiating epoch #67 valid run on device rank=0 [2024-06-19 12:10:47,490] INFO: Rank 0: epoch=67 / 400 train_loss=6.9680 valid_loss=7.5429 stale=0 time=23.44m eta=7909.8m [2024-06-19 12:10:47,535] INFO: Initiating epoch #68 train run on device rank=0 [2024-06-19 12:32:37,368] INFO: Initiating epoch #68 valid run on device rank=0 [2024-06-19 12:34:13,820] INFO: Rank 0: epoch=68 / 400 train_loss=6.9581 valid_loss=7.5408 stale=0 time=23.44m eta=7884.5m [2024-06-19 12:34:13,856] INFO: Initiating epoch #69 train run on device rank=0 [2024-06-19 12:56:03,589] INFO: Initiating epoch #69 valid run on device rank=0 [2024-06-19 12:57:40,177] INFO: Rank 0: epoch=69 / 400 train_loss=6.9475 valid_loss=7.5395 stale=0 time=23.44m eta=7859.3m [2024-06-19 12:57:40,226] INFO: Initiating epoch #70 train run on device rank=0 [2024-06-19 13:19:29,953] INFO: Initiating epoch #70 valid run on device rank=0 [2024-06-19 13:21:06,367] INFO: Rank 0: epoch=70 / 400 train_loss=6.9373 valid_loss=7.5378 stale=0 time=23.44m eta=7834.1m [2024-06-19 13:21:06,406] INFO: Initiating epoch #71 train run on device rank=0 [2024-06-19 13:42:56,082] INFO: Initiating epoch #71 valid run on device rank=0 [2024-06-19 13:44:38,034] INFO: Rank 0: epoch=71 / 400 train_loss=6.9276 valid_loss=7.5367 stale=0 time=23.53m eta=7809.3m [2024-06-19 13:44:38,634] INFO: Initiating epoch #72 train run on device rank=0 [2024-06-19 14:06:27,397] INFO: Initiating epoch #72 valid run on device rank=0 [2024-06-19 14:08:09,289] INFO: Rank 0: epoch=72 / 400 train_loss=6.9174 valid_loss=7.5352 stale=0 time=23.51m eta=7784.6m [2024-06-19 14:08:09,920] INFO: Initiating epoch #73 train run on device rank=0 [2024-06-19 14:29:58,407] INFO: Initiating epoch #73 valid run on device rank=0 [2024-06-19 14:31:40,987] INFO: Rank 0: epoch=73 / 400 train_loss=6.9077 valid_loss=7.5341 stale=0 time=23.52m eta=7760.0m [2024-06-19 14:31:41,469] INFO: Initiating epoch #74 train run on device rank=0 [2024-06-19 14:53:29,920] INFO: Initiating epoch #74 valid run on device rank=0 [2024-06-19 14:55:12,777] INFO: Rank 0: epoch=74 / 400 train_loss=6.8979 valid_loss=7.5329 stale=0 time=23.52m eta=7735.4m [2024-06-19 14:55:13,562] INFO: Initiating epoch #75 train run on device rank=0 [2024-06-19 15:17:02,416] INFO: Initiating epoch #75 valid run on device rank=0 [2024-06-19 15:18:44,135] INFO: Rank 0: epoch=75 / 400 train_loss=6.8881 valid_loss=7.5318 stale=0 time=23.51m eta=7710.7m [2024-06-19 15:18:44,904] INFO: Initiating epoch #76 train run on device rank=0 [2024-06-19 15:40:34,690] INFO: Initiating epoch #76 valid run on device rank=0 [2024-06-19 15:42:16,661] INFO: Rank 0: epoch=76 / 400 train_loss=6.8785 valid_loss=7.5312 stale=0 time=23.53m eta=7686.2m [2024-06-19 15:42:17,450] INFO: Initiating epoch #77 train run on device rank=0 [2024-06-19 16:04:06,628] INFO: Initiating epoch #77 valid run on device rank=0 [2024-06-19 16:05:46,552] INFO: Rank 0: epoch=77 / 400 train_loss=6.8693 valid_loss=7.5315 stale=1 time=23.49m eta=7661.6m [2024-06-19 16:05:47,363] INFO: Initiating epoch #78 train run on device rank=0 [2024-06-19 16:27:35,374] INFO: Initiating epoch #78 valid run on device rank=0 [2024-06-19 16:29:16,945] INFO: Rank 0: epoch=78 / 400 train_loss=6.8596 valid_loss=7.5303 stale=0 time=23.49m eta=7637.0m [2024-06-19 16:29:17,720] INFO: Initiating epoch #79 train run on device rank=0 [2024-06-19 16:51:06,577] INFO: Initiating epoch #79 valid run on device rank=0 [2024-06-19 16:52:49,633] INFO: Rank 0: epoch=79 / 400 train_loss=6.8503 valid_loss=7.5300 stale=0 time=23.53m eta=7612.5m [2024-06-19 16:52:50,715] INFO: Initiating epoch #80 train run on device rank=0 [2024-06-19 17:14:38,818] INFO: Initiating epoch #80 valid run on device rank=0 [2024-06-19 17:16:22,632] INFO: Rank 0: epoch=80 / 400 train_loss=6.8410 valid_loss=7.5294 stale=0 time=23.53m eta=7588.2m [2024-06-19 17:16:23,172] INFO: Initiating epoch #81 train run on device rank=0 [2024-06-19 17:38:10,638] INFO: Initiating epoch #81 valid run on device rank=0 [2024-06-19 17:39:54,743] INFO: Rank 0: epoch=81 / 400 train_loss=6.8316 valid_loss=7.5292 stale=0 time=23.53m eta=7563.8m [2024-06-19 17:39:55,494] INFO: Initiating epoch #82 train run on device rank=0 [2024-06-19 18:01:42,178] INFO: Initiating epoch #82 valid run on device rank=0 [2024-06-19 18:03:26,335] INFO: Rank 0: epoch=82 / 400 train_loss=6.8223 valid_loss=7.5278 stale=0 time=23.51m eta=7539.3m [2024-06-19 18:03:27,309] INFO: Initiating epoch #83 train run on device rank=0 [2024-06-19 18:25:13,586] INFO: Initiating epoch #83 valid run on device rank=0 [2024-06-19 18:26:55,855] INFO: Rank 0: epoch=83 / 400 train_loss=6.8131 valid_loss=7.5279 stale=1 time=23.48m eta=7514.8m [2024-06-19 18:26:56,674] INFO: Initiating epoch #84 train run on device rank=0 [2024-06-19 18:48:43,218] INFO: Initiating epoch #84 valid run on device rank=0 [2024-06-19 18:50:26,247] INFO: Rank 0: epoch=84 / 400 train_loss=6.8039 valid_loss=7.5283 stale=2 time=23.49m eta=7490.3m [2024-06-19 18:50:27,808] INFO: Initiating epoch #85 train run on device rank=0 [2024-06-19 19:12:14,243] INFO: Initiating epoch #85 valid run on device rank=0 [2024-06-19 19:14:19,022] INFO: Rank 0: epoch=85 / 400 train_loss=6.7947 valid_loss=7.5277 stale=0 time=23.85m eta=7467.3m [2024-06-19 19:14:19,726] INFO: Initiating epoch #86 train run on device rank=0 [2024-06-19 19:36:06,786] INFO: Initiating epoch #86 valid run on device rank=0 [2024-06-19 19:37:55,822] INFO: Rank 0: epoch=86 / 400 train_loss=6.7859 valid_loss=7.5285 stale=1 time=23.6m eta=7443.2m [2024-06-19 19:37:57,706] INFO: Initiating epoch #87 train run on device rank=0 [2024-06-19 19:59:45,530] INFO: Initiating epoch #87 valid run on device rank=0 [2024-06-19 20:01:26,066] INFO: Rank 0: epoch=87 / 400 train_loss=6.7769 valid_loss=7.5279 stale=2 time=23.47m eta=7418.8m [2024-06-19 20:01:27,541] INFO: Initiating epoch #88 train run on device rank=0 [2024-06-19 20:23:15,712] INFO: Initiating epoch #88 valid run on device rank=0 [2024-06-19 20:26:25,928] INFO: Rank 0: epoch=88 / 400 train_loss=6.7681 valid_loss=7.5290 stale=3 time=24.97m eta=7399.7m [2024-06-19 20:26:27,700] INFO: Initiating epoch #89 train run on device rank=0 [2024-06-19 20:48:15,390] INFO: Initiating epoch #89 valid run on device rank=0 [2024-06-19 20:49:57,674] INFO: Rank 0: epoch=89 / 400 train_loss=6.7592 valid_loss=7.5289 stale=4 time=23.5m eta=7375.3m [2024-06-19 20:49:58,790] INFO: Initiating epoch #90 train run on device rank=0 [2024-06-19 21:11:46,662] INFO: Initiating epoch #90 valid run on device rank=0 [2024-06-19 21:13:27,640] INFO: Rank 0: epoch=90 / 400 train_loss=6.7503 valid_loss=7.5283 stale=5 time=23.48m eta=7350.9m [2024-06-19 21:13:28,463] INFO: Initiating epoch #91 train run on device rank=0 [2024-06-19 21:35:16,080] INFO: Initiating epoch #91 valid run on device rank=0 [2024-06-19 21:37:01,971] INFO: Rank 0: epoch=91 / 400 train_loss=6.7415 valid_loss=7.5291 stale=6 time=23.56m eta=7326.7m [2024-06-19 21:37:02,789] INFO: Initiating epoch #92 train run on device rank=0 [2024-06-19 21:58:50,873] INFO: Initiating epoch #92 valid run on device rank=0 [2024-06-19 22:01:32,549] INFO: Rank 0: epoch=92 / 400 train_loss=6.7326 valid_loss=7.5298 stale=7 time=24.5m eta=7305.7m [2024-06-19 22:01:46,866] INFO: Initiating epoch #93 train run on device rank=0 [2024-06-19 22:23:34,718] INFO: Initiating epoch #93 valid run on device rank=0 [2024-06-19 22:27:28,548] INFO: Rank 0: epoch=93 / 400 train_loss=6.7238 valid_loss=7.5304 stale=8 time=25.69m eta=7289.2m [2024-06-19 22:27:43,930] INFO: Initiating epoch #94 train run on device rank=0 [2024-06-19 22:49:31,424] INFO: Initiating epoch #94 valid run on device rank=0 [2024-06-19 22:53:12,481] INFO: Rank 0: epoch=94 / 400 train_loss=6.7151 valid_loss=7.5311 stale=9 time=25.48m eta=7272.0m [2024-06-19 22:53:20,785] INFO: Initiating epoch #95 train run on device rank=0 [2024-06-19 23:15:08,439] INFO: Initiating epoch #95 valid run on device rank=0 [2024-06-19 23:18:43,584] INFO: Rank 0: epoch=95 / 400 train_loss=6.7062 valid_loss=7.5314 stale=10 time=25.38m eta=7253.8m [2024-06-19 23:18:54,554] INFO: Initiating epoch #96 train run on device rank=0 [2024-06-19 23:40:41,965] INFO: Initiating epoch #96 valid run on device rank=0 [2024-06-19 23:46:21,333] INFO: Rank 0: epoch=96 / 400 train_loss=6.6977 valid_loss=7.5320 stale=11 time=27.45m eta=7242.2m [2024-06-19 23:46:23,729] INFO: Initiating epoch #97 train run on device rank=0 [2024-06-20 00:08:11,252] INFO: Initiating epoch #97 valid run on device rank=0 [2024-06-20 00:15:23,330] INFO: Rank 0: epoch=97 / 400 train_loss=6.6891 valid_loss=7.5332 stale=12 time=28.99m eta=7234.7m [2024-06-20 00:16:26,463] INFO: Initiating epoch #98 train run on device rank=0 [2024-06-20 00:38:13,924] INFO: Initiating epoch #98 valid run on device rank=0 [2024-06-20 00:42:03,219] INFO: Rank 0: epoch=98 / 400 train_loss=6.6808 valid_loss=7.5336 stale=13 time=25.61m eta=7219.4m [2024-06-20 00:42:04,716] INFO: Initiating epoch #99 train run on device rank=0 [2024-06-20 01:03:52,075] INFO: Initiating epoch #99 valid run on device rank=0 [2024-06-20 01:10:34,079] INFO: Rank 0: epoch=99 / 400 train_loss=6.6721 valid_loss=7.5340 stale=14 time=28.49m eta=7209.5m [2024-06-20 01:10:54,302] INFO: Initiating epoch #100 train run on device rank=0 [2024-06-20 01:32:40,746] INFO: Initiating epoch #100 valid run on device rank=0 [2024-06-20 01:35:13,269] INFO: Rank 0: epoch=100 / 400 train_loss=6.6636 valid_loss=7.5341 stale=15 time=24.32m eta=7187.7m [2024-06-20 01:36:06,197] INFO: Initiating epoch #101 train run on device rank=0 [2024-06-20 01:57:52,148] INFO: Initiating epoch #101 valid run on device rank=0 [2024-06-20 01:59:31,565] INFO: Rank 0: epoch=101 / 400 train_loss=6.6552 valid_loss=7.5358 stale=16 time=23.42m eta=7164.7m [2024-06-20 01:59:32,474] INFO: Initiating epoch #102 train run on device rank=0 [2024-06-20 02:21:20,733] INFO: Initiating epoch #102 valid run on device rank=0 [2024-06-20 02:23:00,836] INFO: Rank 0: epoch=102 / 400 train_loss=6.6469 valid_loss=7.5367 stale=17 time=23.47m eta=7139.4m [2024-06-20 02:23:01,728] INFO: Initiating epoch #103 train run on device rank=0 [2024-06-20 02:44:49,388] INFO: Initiating epoch #103 valid run on device rank=0 [2024-06-20 02:46:29,183] INFO: Rank 0: epoch=103 / 400 train_loss=6.6384 valid_loss=7.5376 stale=18 time=23.46m eta=7114.0m [2024-06-20 02:46:29,830] INFO: Initiating epoch #104 train run on device rank=0 [2024-06-20 03:08:18,070] INFO: Initiating epoch #104 valid run on device rank=0 [2024-06-20 03:09:58,595] INFO: Rank 0: epoch=104 / 400 train_loss=6.6302 valid_loss=7.5390 stale=19 time=23.48m eta=7088.7m [2024-06-20 03:09:59,252] INFO: Initiating epoch #105 train run on device rank=0 [2024-06-20 03:31:47,404] INFO: Initiating epoch #105 valid run on device rank=0 [2024-06-20 03:33:26,315] INFO: Rank 0: epoch=105 / 400 train_loss=6.6218 valid_loss=7.5410 stale=20 time=23.45m eta=7063.4m [2024-06-20 03:33:26,968] INFO: Initiating epoch #106 train run on device rank=0 [2024-06-20 03:55:15,105] INFO: Initiating epoch #106 valid run on device rank=0 [2024-06-20 03:56:54,164] INFO: Done with training. Total training time on device 0 is 2537.568min