[2024-06-14 08:32:08,247] INFO: Will use torch.nn.parallel.DistributedDataParallel() and 4 gpus [2024-06-14 08:32:08,352] INFO: NVIDIA GeForce RTX 2080 Ti [2024-06-14 08:32:08,352] INFO: NVIDIA GeForce RTX 2080 Ti [2024-06-14 08:32:08,352] INFO: NVIDIA GeForce RTX 2080 Ti [2024-06-14 08:32:08,352] INFO: NVIDIA GeForce RTX 2080 Ti [2024-06-14 08:32:13,044] INFO: using dtype=torch.float32 [2024-06-14 08:32:14,016] INFO: using attention_type=math [2024-06-14 08:32:14,027] INFO: using attention_type=math [2024-06-14 08:32:14,038] INFO: using attention_type=math [2024-06-14 08:32:14,051] INFO: using attention_type=math [2024-06-14 08:32:14,062] INFO: using attention_type=math [2024-06-14 08:32:14,073] INFO: using attention_type=math [2024-06-14 08:32:17,637] INFO: mlpf_kwargs: {'input_dim': 17, 'num_classes': 6, 'input_encoding': 'joint', 'pt_mode': 'linear', 'eta_mode': 'linear', 'sin_phi_mode': 'linear', 'cos_phi_mode': 'linear', 'energy_mode': 'linear', 'elemtypes_nonzero': [1, 2], 'learned_representation_mode': 'last', 'conv_type': 'attention', 'num_convs': 3, 'dropout_ff': 0.0, 'dropout_conv_id_mha': 0.0, 'dropout_conv_id_ff': 0.0, 'dropout_conv_reg_mha': 0.0, 'dropout_conv_reg_ff': 0.0, 'activation': 'relu', 'head_dim': 16, 'num_heads': 32, 'attention_type': 'math'} [2024-06-14 08:32:17,638] INFO: Loaded model weights from /pfvol/experiments/MLPF_clic_backbone_pyg-clic_20240429_101112_971749/best_weights.pth [2024-06-14 08:32:18,967] INFO: DistributedDataParallel( (module): MLPF( (nn0_id): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (nn0_reg): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (conv_id): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (conv_reg): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (nn_id): Sequential( (0): Linear(in_features=529, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=6, bias=True) ) (nn_pt): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_eta): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_sin_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_cos_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_energy): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) ) ) [2024-06-14 08:32:18,968] INFO: Backbone Trainable parameters: 0 [2024-06-14 08:32:18,968] INFO: Backbone Non-trainable parameters: 11671568 [2024-06-14 08:32:18,968] INFO: Backbone Total parameters: 11671568 [2024-06-14 08:32:18,972] INFO: Modules Trainable parameters Non-tranable parameters module.nn0_id.0.weight 0 8704 module.nn0_id.0.bias 0 512 module.nn0_id.2.weight 0 512 module.nn0_id.2.bias 0 512 module.nn0_id.4.weight 0 262144 module.nn0_id.4.bias 0 512 module.nn0_reg.0.weight 0 8704 module.nn0_reg.0.bias 0 512 module.nn0_reg.2.weight 0 512 module.nn0_reg.2.bias 0 512 module.nn0_reg.4.weight 0 262144 module.nn0_reg.4.bias 0 512 module.conv_id.0.mha.in_proj_weight 0 786432 module.conv_id.0.mha.in_proj_bias 0 1536 module.conv_id.0.mha.out_proj.weight 0 262144 module.conv_id.0.mha.out_proj.bias 0 512 module.conv_id.0.norm0.weight 0 512 module.conv_id.0.norm0.bias 0 512 module.conv_id.0.norm1.weight 0 512 module.conv_id.0.norm1.bias 0 512 module.conv_id.0.seq.0.weight 0 262144 module.conv_id.0.seq.0.bias 0 512 module.conv_id.0.seq.2.weight 0 262144 module.conv_id.0.seq.2.bias 0 512 module.conv_id.1.mha.in_proj_weight 0 786432 module.conv_id.1.mha.in_proj_bias 0 1536 module.conv_id.1.mha.out_proj.weight 0 262144 module.conv_id.1.mha.out_proj.bias 0 512 module.conv_id.1.norm0.weight 0 512 module.conv_id.1.norm0.bias 0 512 module.conv_id.1.norm1.weight 0 512 module.conv_id.1.norm1.bias 0 512 module.conv_id.1.seq.0.weight 0 262144 module.conv_id.1.seq.0.bias 0 512 module.conv_id.1.seq.2.weight 0 262144 module.conv_id.1.seq.2.bias 0 512 module.conv_id.2.mha.in_proj_weight 0 786432 module.conv_id.2.mha.in_proj_bias 0 1536 module.conv_id.2.mha.out_proj.weight 0 262144 module.conv_id.2.mha.out_proj.bias 0 512 module.conv_id.2.norm0.weight 0 512 module.conv_id.2.norm0.bias 0 512 module.conv_id.2.norm1.weight 0 512 module.conv_id.2.norm1.bias 0 512 module.conv_id.2.seq.0.weight 0 262144 module.conv_id.2.seq.0.bias 0 512 module.conv_id.2.seq.2.weight 0 262144 module.conv_id.2.seq.2.bias 0 512 module.conv_reg.0.mha.in_proj_weight 0 786432 module.conv_reg.0.mha.in_proj_bias 0 1536 module.conv_reg.0.mha.out_proj.weight 0 262144 module.conv_reg.0.mha.out_proj.bias 0 512 module.conv_reg.0.norm0.weight 0 512 module.conv_reg.0.norm0.bias 0 512 module.conv_reg.0.norm1.weight 0 512 module.conv_reg.0.norm1.bias 0 512 module.conv_reg.0.seq.0.weight 0 262144 module.conv_reg.0.seq.0.bias 0 512 module.conv_reg.0.seq.2.weight 0 262144 module.conv_reg.0.seq.2.bias 0 512 module.conv_reg.1.mha.in_proj_weight 0 786432 module.conv_reg.1.mha.in_proj_bias 0 1536 module.conv_reg.1.mha.out_proj.weight 0 262144 module.conv_reg.1.mha.out_proj.bias 0 512 module.conv_reg.1.norm0.weight 0 512 module.conv_reg.1.norm0.bias 0 512 module.conv_reg.1.norm1.weight 0 512 module.conv_reg.1.norm1.bias 0 512 module.conv_reg.1.seq.0.weight 0 262144 module.conv_reg.1.seq.0.bias 0 512 module.conv_reg.1.seq.2.weight 0 262144 module.conv_reg.1.seq.2.bias 0 512 module.conv_reg.2.mha.in_proj_weight 0 786432 module.conv_reg.2.mha.in_proj_bias 0 1536 module.conv_reg.2.mha.out_proj.weight 0 262144 module.conv_reg.2.mha.out_proj.bias 0 512 module.conv_reg.2.norm0.weight 0 512 module.conv_reg.2.norm0.bias 0 512 module.conv_reg.2.norm1.weight 0 512 module.conv_reg.2.norm1.bias 0 512 module.conv_reg.2.seq.0.weight 0 262144 module.conv_reg.2.seq.0.bias 0 512 module.conv_reg.2.seq.2.weight 0 262144 module.conv_reg.2.seq.2.bias 0 512 module.nn_id.0.weight 0 270848 module.nn_id.0.bias 0 512 module.nn_id.2.weight 0 512 module.nn_id.2.bias 0 512 module.nn_id.4.weight 0 3072 module.nn_id.4.bias 0 6 module.nn_pt.nn.0.weight 0 273920 module.nn_pt.nn.0.bias 0 512 module.nn_pt.nn.2.weight 0 512 module.nn_pt.nn.2.bias 0 512 module.nn_pt.nn.4.weight 0 1024 module.nn_pt.nn.4.bias 0 2 module.nn_eta.nn.0.weight 0 273920 module.nn_eta.nn.0.bias 0 512 module.nn_eta.nn.2.weight 0 512 module.nn_eta.nn.2.bias 0 512 module.nn_eta.nn.4.weight 0 1024 module.nn_eta.nn.4.bias 0 2 module.nn_sin_phi.nn.0.weight 0 273920 module.nn_sin_phi.nn.0.bias 0 512 module.nn_sin_phi.nn.2.weight 0 512 module.nn_sin_phi.nn.2.bias 0 512 module.nn_sin_phi.nn.4.weight 0 1024 module.nn_sin_phi.nn.4.bias 0 2 module.nn_cos_phi.nn.0.weight 0 273920 module.nn_cos_phi.nn.0.bias 0 512 module.nn_cos_phi.nn.2.weight 0 512 module.nn_cos_phi.nn.2.bias 0 512 module.nn_cos_phi.nn.4.weight 0 1024 module.nn_cos_phi.nn.4.bias 0 2 module.nn_energy.nn.0.weight 0 273920 module.nn_energy.nn.0.bias 0 512 module.nn_energy.nn.2.weight 0 512 module.nn_energy.nn.2.bias 0 512 module.nn_energy.nn.4.weight 0 1024 module.nn_energy.nn.4.bias 0 2 [2024-06-14 08:32:19,029] INFO: DistributedDataParallel( (module): DeepMET( (nn): Sequential( (0): Linear(in_features=11, out_features=256, bias=True) (1): ELU(alpha=1.0) (2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0, inplace=False) (4): Linear(in_features=256, out_features=2, bias=True) ) ) ) [2024-06-14 08:32:19,029] INFO: DeepMET Trainable parameters: 4098 [2024-06-14 08:32:19,029] INFO: DeepMET Non-trainable parameters: 0 [2024-06-14 08:32:19,029] INFO: DeepMET Total parameters: 4098 [2024-06-14 08:32:19,030] INFO: Modules Trainable parameters Non-tranable parameters module.nn.0.weight 2816 0 module.nn.0.bias 256 0 module.nn.2.weight 256 0 module.nn.2.bias 256 0 module.nn.4.weight 512 0 module.nn.4.bias 2 0 [2024-06-14 08:32:19,052] INFO: Creating experiment dir /pfvol/experiments/MLPF_clic_backbone_pyg-clic_20240429_101112_971749/MLPF_4GTX_MET_MLPFCands_20240614_083208_086253 [2024-06-14 08:32:19,052] INFO: Model directory /pfvol/experiments/MLPF_clic_backbone_pyg-clic_20240429_101112_971749/MLPF_4GTX_MET_MLPFCands_20240614_083208_086253 [2024-06-14 08:32:19,265] INFO: train_dataset: clic_edm_ttbar_pf, 800800 [2024-06-14 08:32:19,437] INFO: valid_dataset: clic_edm_ttbar_pf, 200200 [2024-06-14 08:32:19,515] INFO: Initiating epoch #1 train run on device rank=0 [2024-06-14 08:37:04,027] INFO: Initiating epoch #1 valid run on device rank=0 [2024-06-14 08:38:11,230] INFO: Rank 0: epoch=1 / 400 train_loss=8.3462 valid_loss=8.2710 stale=0 time=5.86m eta=2338.9m [2024-06-14 08:38:11,241] INFO: Initiating epoch #2 train run on device rank=0 [2024-06-14 08:42:51,125] INFO: Initiating epoch #2 valid run on device rank=0 [2024-06-14 08:43:58,235] INFO: Rank 0: epoch=2 / 400 train_loss=7.9970 valid_loss=8.2375 stale=0 time=5.78m eta=2317.4m [2024-06-14 08:43:58,272] INFO: Initiating epoch #3 train run on device rank=0 [2024-06-14 08:48:38,620] INFO: Initiating epoch #3 valid run on device rank=0 [2024-06-14 08:49:45,951] INFO: Rank 0: epoch=3 / 400 train_loss=7.9659 valid_loss=8.2167 stale=0 time=5.79m eta=2308.0m [2024-06-14 08:49:46,003] INFO: Initiating epoch #4 train run on device rank=0 [2024-06-14 08:54:25,778] INFO: Initiating epoch #4 valid run on device rank=0 [2024-06-14 08:55:33,086] INFO: Rank 0: epoch=4 / 400 train_loss=7.9433 valid_loss=8.2007 stale=0 time=5.78m eta=2299.4m [2024-06-14 08:55:33,127] INFO: Initiating epoch #5 train run on device rank=0 [2024-06-14 09:00:14,020] INFO: Initiating epoch #5 valid run on device rank=0 [2024-06-14 09:01:21,079] INFO: Rank 0: epoch=5 / 400 train_loss=7.9253 valid_loss=8.1879 stale=0 time=5.8m eta=2293.1m [2024-06-14 09:01:21,117] INFO: Initiating epoch #6 train run on device rank=0 [2024-06-14 09:06:01,085] INFO: Initiating epoch #6 valid run on device rank=0 [2024-06-14 09:07:07,823] INFO: Rank 0: epoch=6 / 400 train_loss=7.9101 valid_loss=8.1778 stale=0 time=5.78m eta=2285.5m [2024-06-14 09:07:07,861] INFO: Initiating epoch #7 train run on device rank=0 [2024-06-14 09:11:48,021] INFO: Initiating epoch #7 valid run on device rank=0 [2024-06-14 09:12:55,093] INFO: Rank 0: epoch=7 / 400 train_loss=7.8981 valid_loss=8.1698 stale=0 time=5.79m eta=2279.0m [2024-06-14 09:12:55,133] INFO: Initiating epoch #8 train run on device rank=0 [2024-06-14 09:17:34,955] INFO: Initiating epoch #8 valid run on device rank=0 [2024-06-14 09:18:41,792] INFO: Rank 0: epoch=8 / 400 train_loss=7.8883 valid_loss=8.1621 stale=0 time=5.78m eta=2272.2m [2024-06-14 09:18:41,853] INFO: Initiating epoch #9 train run on device rank=0 [2024-06-14 09:23:21,619] INFO: Initiating epoch #9 valid run on device rank=0 [2024-06-14 09:24:29,152] INFO: Rank 0: epoch=9 / 400 train_loss=7.8801 valid_loss=8.1546 stale=0 time=5.79m eta=2266.1m [2024-06-14 09:24:29,303] INFO: Initiating epoch #10 train run on device rank=0 [2024-06-14 09:29:09,283] INFO: Initiating epoch #10 valid run on device rank=0 [2024-06-14 09:30:16,145] INFO: Rank 0: epoch=10 / 400 train_loss=7.8730 valid_loss=8.1478 stale=0 time=5.78m eta=2259.8m [2024-06-14 09:30:16,193] INFO: Initiating epoch #11 train run on device rank=0 [2024-06-14 09:34:56,393] INFO: Initiating epoch #11 valid run on device rank=0 [2024-06-14 09:36:03,408] INFO: Rank 0: epoch=11 / 400 train_loss=7.8668 valid_loss=8.1414 stale=0 time=5.79m eta=2253.8m [2024-06-14 09:36:03,451] INFO: Initiating epoch #12 train run on device rank=0 [2024-06-14 09:40:42,983] INFO: Initiating epoch #12 valid run on device rank=0 [2024-06-14 09:41:49,317] INFO: Rank 0: epoch=12 / 400 train_loss=7.8611 valid_loss=8.1353 stale=0 time=5.76m eta=2247.1m [2024-06-14 09:41:49,353] INFO: Initiating epoch #13 train run on device rank=0 [2024-06-14 09:46:28,792] INFO: Initiating epoch #13 valid run on device rank=0 [2024-06-14 09:47:35,724] INFO: Rank 0: epoch=13 / 400 train_loss=7.8559 valid_loss=8.1296 stale=0 time=5.77m eta=2240.7m [2024-06-14 09:47:35,788] INFO: Initiating epoch #14 train run on device rank=0 [2024-06-14 09:52:15,528] INFO: Initiating epoch #14 valid run on device rank=0 [2024-06-14 09:53:22,299] INFO: Rank 0: epoch=14 / 400 train_loss=7.8510 valid_loss=8.1243 stale=0 time=5.78m eta=2234.6m [2024-06-14 09:53:22,350] INFO: Initiating epoch #15 train run on device rank=0 [2024-06-14 09:58:02,394] INFO: Initiating epoch #15 valid run on device rank=0 [2024-06-14 09:59:08,984] INFO: Rank 0: epoch=15 / 400 train_loss=7.8464 valid_loss=8.1188 stale=0 time=5.78m eta=2228.5m [2024-06-14 09:59:09,046] INFO: Initiating epoch #16 train run on device rank=0 [2024-06-14 10:03:48,769] INFO: Initiating epoch #16 valid run on device rank=0 [2024-06-14 10:04:55,315] INFO: Rank 0: epoch=16 / 400 train_loss=7.8420 valid_loss=8.1135 stale=0 time=5.77m eta=2222.3m [2024-06-14 10:04:55,377] INFO: Initiating epoch #17 train run on device rank=0 [2024-06-14 10:09:35,219] INFO: Initiating epoch #17 valid run on device rank=0 [2024-06-14 10:10:42,385] INFO: Rank 0: epoch=17 / 400 train_loss=7.8378 valid_loss=8.1085 stale=0 time=5.78m eta=2216.5m [2024-06-14 10:10:42,435] INFO: Initiating epoch #18 train run on device rank=0 [2024-06-14 10:15:21,767] INFO: Initiating epoch #18 valid run on device rank=0 [2024-06-14 10:16:28,867] INFO: Rank 0: epoch=18 / 400 train_loss=7.8337 valid_loss=8.1037 stale=0 time=5.77m eta=2210.4m [2024-06-14 10:16:28,913] INFO: Initiating epoch #19 train run on device rank=0 [2024-06-14 10:21:08,784] INFO: Initiating epoch #19 valid run on device rank=0 [2024-06-14 10:22:16,039] INFO: Rank 0: epoch=19 / 400 train_loss=7.8297 valid_loss=8.0990 stale=0 time=5.79m eta=2204.6m [2024-06-14 10:22:16,157] INFO: Initiating epoch #20 train run on device rank=0 [2024-06-14 10:26:55,620] INFO: Initiating epoch #20 valid run on device rank=0 [2024-06-14 10:28:02,563] INFO: Rank 0: epoch=20 / 400 train_loss=7.8258 valid_loss=8.0947 stale=0 time=5.77m eta=2198.6m [2024-06-14 10:28:02,607] INFO: Initiating epoch #21 train run on device rank=0 [2024-06-14 10:32:42,848] INFO: Initiating epoch #21 valid run on device rank=0 [2024-06-14 10:33:49,834] INFO: Rank 0: epoch=21 / 400 train_loss=7.8219 valid_loss=8.0906 stale=0 time=5.79m eta=2192.9m [2024-06-14 10:33:49,873] INFO: Initiating epoch #22 train run on device rank=0 [2024-06-14 10:38:29,472] INFO: Initiating epoch #22 valid run on device rank=0 [2024-06-14 10:39:36,037] INFO: Rank 0: epoch=22 / 400 train_loss=7.8182 valid_loss=8.0867 stale=0 time=5.77m eta=2186.8m [2024-06-14 10:39:36,093] INFO: Initiating epoch #23 train run on device rank=0 [2024-06-14 10:44:15,714] INFO: Initiating epoch #23 valid run on device rank=0 [2024-06-14 10:45:21,861] INFO: Rank 0: epoch=23 / 400 train_loss=7.8146 valid_loss=8.0826 stale=0 time=5.76m eta=2180.7m [2024-06-14 10:45:21,914] INFO: Initiating epoch #24 train run on device rank=0 [2024-06-14 10:50:01,518] INFO: Initiating epoch #24 valid run on device rank=0 [2024-06-14 10:51:07,857] INFO: Rank 0: epoch=24 / 400 train_loss=7.8109 valid_loss=8.0788 stale=0 time=5.77m eta=2174.6m [2024-06-14 10:51:07,901] INFO: Initiating epoch #25 train run on device rank=0 [2024-06-14 10:55:47,169] INFO: Initiating epoch #25 valid run on device rank=0 [2024-06-14 10:56:53,911] INFO: Rank 0: epoch=25 / 400 train_loss=7.8075 valid_loss=8.0748 stale=0 time=5.77m eta=2168.6m [2024-06-14 10:56:53,953] INFO: Initiating epoch #26 train run on device rank=0 [2024-06-14 11:01:33,115] INFO: Initiating epoch #26 valid run on device rank=0 [2024-06-14 11:02:39,700] INFO: Rank 0: epoch=26 / 400 train_loss=7.8040 valid_loss=8.0709 stale=0 time=5.76m eta=2162.5m [2024-06-14 11:02:39,750] INFO: Initiating epoch #27 train run on device rank=0 [2024-06-14 11:07:18,876] INFO: Initiating epoch #27 valid run on device rank=0 [2024-06-14 11:08:25,488] INFO: Rank 0: epoch=27 / 400 train_loss=7.8006 valid_loss=8.0672 stale=0 time=5.76m eta=2156.5m [2024-06-14 11:08:25,544] INFO: Initiating epoch #28 train run on device rank=0 [2024-06-14 11:13:04,770] INFO: Initiating epoch #28 valid run on device rank=0 [2024-06-14 11:14:11,221] INFO: Rank 0: epoch=28 / 400 train_loss=7.7973 valid_loss=8.0636 stale=0 time=5.76m eta=2150.4m [2024-06-14 11:14:11,284] INFO: Initiating epoch #29 train run on device rank=0 [2024-06-14 11:18:50,198] INFO: Initiating epoch #29 valid run on device rank=0 [2024-06-14 11:19:56,843] INFO: Rank 0: epoch=29 / 400 train_loss=7.7940 valid_loss=8.0600 stale=0 time=5.76m eta=2144.4m [2024-06-14 11:19:56,885] INFO: Initiating epoch #30 train run on device rank=0 [2024-06-14 11:24:36,037] INFO: Initiating epoch #30 valid run on device rank=0 [2024-06-14 11:25:42,673] INFO: Rank 0: epoch=30 / 400 train_loss=7.7906 valid_loss=8.0566 stale=0 time=5.76m eta=2138.4m [2024-06-14 11:25:42,762] INFO: Initiating epoch #31 train run on device rank=0 [2024-06-14 11:30:21,865] INFO: Initiating epoch #31 valid run on device rank=0 [2024-06-14 11:31:28,516] INFO: Rank 0: epoch=31 / 400 train_loss=7.7873 valid_loss=8.0531 stale=0 time=5.76m eta=2132.5m [2024-06-14 11:31:28,658] INFO: Initiating epoch #32 train run on device rank=0 [2024-06-14 11:36:08,032] INFO: Initiating epoch #32 valid run on device rank=0 [2024-06-14 11:37:14,610] INFO: Rank 0: epoch=32 / 400 train_loss=7.7841 valid_loss=8.0498 stale=0 time=5.77m eta=2126.6m [2024-06-14 11:37:14,659] INFO: Initiating epoch #33 train run on device rank=0 [2024-06-14 11:41:53,967] INFO: Initiating epoch #33 valid run on device rank=0 [2024-06-14 11:43:00,445] INFO: Rank 0: epoch=33 / 400 train_loss=7.7809 valid_loss=8.0466 stale=0 time=5.76m eta=2120.6m [2024-06-14 11:43:00,556] INFO: Initiating epoch #34 train run on device rank=0 [2024-06-14 11:47:39,943] INFO: Initiating epoch #34 valid run on device rank=0 [2024-06-14 11:48:46,493] INFO: Rank 0: epoch=34 / 400 train_loss=7.7777 valid_loss=8.0434 stale=0 time=5.77m eta=2114.7m [2024-06-14 11:48:46,557] INFO: Initiating epoch #35 train run on device rank=0 [2024-06-14 11:53:26,051] INFO: Initiating epoch #35 valid run on device rank=0 [2024-06-14 11:54:32,869] INFO: Rank 0: epoch=35 / 400 train_loss=7.7745 valid_loss=8.0403 stale=0 time=5.77m eta=2108.9m [2024-06-14 11:54:32,920] INFO: Initiating epoch #36 train run on device rank=0 [2024-06-14 11:59:12,239] INFO: Initiating epoch #36 valid run on device rank=0 [2024-06-14 12:00:19,417] INFO: Rank 0: epoch=36 / 400 train_loss=7.7715 valid_loss=8.0373 stale=0 time=5.77m eta=2103.1m [2024-06-14 12:00:19,473] INFO: Initiating epoch #37 train run on device rank=0 [2024-06-14 12:04:58,611] INFO: Initiating epoch #37 valid run on device rank=0 [2024-06-14 12:06:05,116] INFO: Rank 0: epoch=37 / 400 train_loss=7.7684 valid_loss=8.0343 stale=0 time=5.76m eta=2097.2m [2024-06-14 12:06:05,181] INFO: Initiating epoch #38 train run on device rank=0 [2024-06-14 12:10:44,330] INFO: Initiating epoch #38 valid run on device rank=0 [2024-06-14 12:11:51,577] INFO: Rank 0: epoch=38 / 400 train_loss=7.7654 valid_loss=8.0315 stale=0 time=5.77m eta=2091.4m [2024-06-14 12:11:51,609] INFO: Initiating epoch #39 train run on device rank=0 [2024-06-14 12:16:31,158] INFO: Initiating epoch #39 valid run on device rank=0 [2024-06-14 12:17:37,945] INFO: Rank 0: epoch=39 / 400 train_loss=7.7624 valid_loss=8.0287 stale=0 time=5.77m eta=2085.5m [2024-06-14 12:17:37,995] INFO: Initiating epoch #40 train run on device rank=0 [2024-06-14 12:22:17,546] INFO: Initiating epoch #40 valid run on device rank=0 [2024-06-14 12:23:24,908] INFO: Rank 0: epoch=40 / 400 train_loss=7.7595 valid_loss=8.0261 stale=0 time=5.78m eta=2079.8m [2024-06-14 12:23:24,949] INFO: Initiating epoch #41 train run on device rank=0 [2024-06-14 12:28:04,552] INFO: Initiating epoch #41 valid run on device rank=0 [2024-06-14 12:29:11,635] INFO: Rank 0: epoch=41 / 400 train_loss=7.7566 valid_loss=8.0235 stale=0 time=5.78m eta=2074.0m [2024-06-14 12:29:11,685] INFO: Initiating epoch #42 train run on device rank=0 [2024-06-14 12:33:51,443] INFO: Initiating epoch #42 valid run on device rank=0 [2024-06-14 12:34:58,414] INFO: Rank 0: epoch=42 / 400 train_loss=7.7538 valid_loss=8.0211 stale=0 time=5.78m eta=2068.3m [2024-06-14 12:34:58,466] INFO: Initiating epoch #43 train run on device rank=0 [2024-06-14 12:39:37,846] INFO: Initiating epoch #43 valid run on device rank=0 [2024-06-14 12:40:44,615] INFO: Rank 0: epoch=43 / 400 train_loss=7.7510 valid_loss=8.0187 stale=0 time=5.77m eta=2062.4m [2024-06-14 12:40:44,654] INFO: Initiating epoch #44 train run on device rank=0 [2024-06-14 12:45:23,987] INFO: Initiating epoch #44 valid run on device rank=0 [2024-06-14 12:46:31,011] INFO: Rank 0: epoch=44 / 400 train_loss=7.7485 valid_loss=8.0165 stale=0 time=5.77m eta=2056.6m [2024-06-14 12:46:31,037] INFO: Initiating epoch #45 train run on device rank=0 [2024-06-14 12:51:10,269] INFO: Initiating epoch #45 valid run on device rank=0 [2024-06-14 12:52:17,282] INFO: Rank 0: epoch=45 / 400 train_loss=7.7460 valid_loss=8.0144 stale=0 time=5.77m eta=2050.8m [2024-06-14 12:52:17,337] INFO: Initiating epoch #46 train run on device rank=0 [2024-06-14 12:56:56,702] INFO: Initiating epoch #46 valid run on device rank=0 [2024-06-14 12:58:03,927] INFO: Rank 0: epoch=46 / 400 train_loss=7.7437 valid_loss=8.0125 stale=0 time=5.78m eta=2045.0m [2024-06-14 12:58:03,985] INFO: Initiating epoch #47 train run on device rank=0 [2024-06-14 13:02:43,558] INFO: Initiating epoch #47 valid run on device rank=0 [2024-06-14 13:03:50,675] INFO: Rank 0: epoch=47 / 400 train_loss=7.7413 valid_loss=8.0105 stale=0 time=5.78m eta=2039.3m [2024-06-14 13:03:50,748] INFO: Initiating epoch #48 train run on device rank=0 [2024-06-14 13:08:29,928] INFO: Initiating epoch #48 valid run on device rank=0 [2024-06-14 13:09:37,086] INFO: Rank 0: epoch=48 / 400 train_loss=7.7392 valid_loss=8.0087 stale=0 time=5.77m eta=2033.5m [2024-06-14 13:09:37,138] INFO: Initiating epoch #49 train run on device rank=0 [2024-06-14 13:14:16,454] INFO: Initiating epoch #49 valid run on device rank=0 [2024-06-14 13:15:23,727] INFO: Rank 0: epoch=49 / 400 train_loss=7.7369 valid_loss=8.0068 stale=0 time=5.78m eta=2027.7m [2024-06-14 13:15:23,761] INFO: Initiating epoch #50 train run on device rank=0 [2024-06-14 13:20:02,977] INFO: Initiating epoch #50 valid run on device rank=0 [2024-06-14 13:21:10,592] INFO: Rank 0: epoch=50 / 400 train_loss=7.7350 valid_loss=8.0049 stale=0 time=5.78m eta=2022.0m [2024-06-14 13:21:10,667] INFO: Initiating epoch #51 train run on device rank=0 [2024-06-14 13:25:50,754] INFO: Initiating epoch #51 valid run on device rank=0 [2024-06-14 13:26:58,335] INFO: Rank 0: epoch=51 / 400 train_loss=7.7330 valid_loss=8.0031 stale=0 time=5.79m eta=2016.3m [2024-06-14 13:26:58,422] INFO: Initiating epoch #52 train run on device rank=0 [2024-06-14 13:31:34,621] INFO: Initiating epoch #52 valid run on device rank=0 [2024-06-14 13:32:39,878] INFO: Rank 0: epoch=52 / 400 train_loss=7.7309 valid_loss=8.0013 stale=0 time=5.69m eta=2010.0m [2024-06-14 13:32:39,903] INFO: Initiating epoch #53 train run on device rank=0 [2024-06-14 13:37:15,739] INFO: Initiating epoch #53 valid run on device rank=0 [2024-06-14 13:38:20,997] INFO: Rank 0: epoch=53 / 400 train_loss=7.7292 valid_loss=7.9994 stale=0 time=5.68m eta=2003.6m [2024-06-14 13:38:21,048] INFO: Initiating epoch #54 train run on device rank=0 [2024-06-14 13:42:56,687] INFO: Initiating epoch #54 valid run on device rank=0 [2024-06-14 13:44:01,966] INFO: Rank 0: epoch=54 / 400 train_loss=7.7272 valid_loss=7.9977 stale=0 time=5.68m eta=1997.2m [2024-06-14 13:44:02,020] INFO: Initiating epoch #55 train run on device rank=0 [2024-06-14 13:48:37,676] INFO: Initiating epoch #55 valid run on device rank=0 [2024-06-14 13:49:44,179] INFO: Rank 0: epoch=55 / 400 train_loss=7.7256 valid_loss=7.9961 stale=0 time=5.7m eta=1991.0m [2024-06-14 13:49:44,317] INFO: Initiating epoch #56 train run on device rank=0 [2024-06-14 13:54:20,321] INFO: Initiating epoch #56 valid run on device rank=0 [2024-06-14 13:55:25,787] INFO: Rank 0: epoch=56 / 400 train_loss=7.7239 valid_loss=7.9946 stale=0 time=5.69m eta=1984.8m [2024-06-14 13:55:25,845] INFO: Initiating epoch #57 train run on device rank=0 [2024-06-14 14:00:01,444] INFO: Initiating epoch #57 valid run on device rank=0 [2024-06-14 14:01:06,859] INFO: Rank 0: epoch=57 / 400 train_loss=7.7224 valid_loss=7.9931 stale=0 time=5.68m eta=1978.5m [2024-06-14 14:01:06,926] INFO: Initiating epoch #58 train run on device rank=0 [2024-06-14 14:05:42,405] INFO: Initiating epoch #58 valid run on device rank=0 [2024-06-14 14:06:47,722] INFO: Rank 0: epoch=58 / 400 train_loss=7.7208 valid_loss=7.9916 stale=0 time=5.68m eta=1972.2m [2024-06-14 14:06:47,771] INFO: Initiating epoch #59 train run on device rank=0 [2024-06-14 14:11:23,435] INFO: Initiating epoch #59 valid run on device rank=0 [2024-06-14 14:12:28,414] INFO: Rank 0: epoch=59 / 400 train_loss=7.7193 valid_loss=7.9901 stale=0 time=5.68m eta=1965.9m [2024-06-14 14:12:28,479] INFO: Initiating epoch #60 train run on device rank=0 [2024-06-14 14:17:04,288] INFO: Initiating epoch #60 valid run on device rank=0 [2024-06-14 14:18:08,727] INFO: Rank 0: epoch=60 / 400 train_loss=7.7178 valid_loss=7.9886 stale=0 time=5.67m eta=1959.6m [2024-06-14 14:18:08,761] INFO: Initiating epoch #61 train run on device rank=0 [2024-06-14 14:22:44,573] INFO: Initiating epoch #61 valid run on device rank=0 [2024-06-14 14:23:49,614] INFO: Rank 0: epoch=61 / 400 train_loss=7.7165 valid_loss=7.9872 stale=0 time=5.68m eta=1953.4m [2024-06-14 14:23:49,647] INFO: Initiating epoch #62 train run on device rank=0 [2024-06-14 14:28:25,089] INFO: Initiating epoch #62 valid run on device rank=0 [2024-06-14 14:29:30,488] INFO: Rank 0: epoch=62 / 400 train_loss=7.7149 valid_loss=7.9858 stale=0 time=5.68m eta=1947.2m [2024-06-14 14:29:30,528] INFO: Initiating epoch #63 train run on device rank=0 [2024-06-14 14:34:05,904] INFO: Initiating epoch #63 valid run on device rank=0 [2024-06-14 14:35:11,123] INFO: Rank 0: epoch=63 / 400 train_loss=7.7138 valid_loss=7.9844 stale=0 time=5.68m eta=1941.0m [2024-06-14 14:35:11,169] INFO: Initiating epoch #64 train run on device rank=0 [2024-06-14 14:39:46,538] INFO: Initiating epoch #64 valid run on device rank=0 [2024-06-14 14:40:51,678] INFO: Rank 0: epoch=64 / 400 train_loss=7.7123 valid_loss=7.9831 stale=0 time=5.68m eta=1934.8m [2024-06-14 14:40:51,729] INFO: Initiating epoch #65 train run on device rank=0 [2024-06-14 14:45:27,371] INFO: Initiating epoch #65 valid run on device rank=0 [2024-06-14 14:46:32,863] INFO: Rank 0: epoch=65 / 400 train_loss=7.7113 valid_loss=7.9817 stale=0 time=5.69m eta=1928.7m [2024-06-14 14:46:32,908] INFO: Initiating epoch #66 train run on device rank=0 [2024-06-14 14:51:08,330] INFO: Initiating epoch #66 valid run on device rank=0 [2024-06-14 14:52:13,715] INFO: Rank 0: epoch=66 / 400 train_loss=7.7098 valid_loss=7.9805 stale=0 time=5.68m eta=1922.5m [2024-06-14 14:52:13,828] INFO: Initiating epoch #67 train run on device rank=0 [2024-06-14 14:56:49,248] INFO: Initiating epoch #67 valid run on device rank=0 [2024-06-14 14:57:54,387] INFO: Rank 0: epoch=67 / 400 train_loss=7.7088 valid_loss=7.9792 stale=0 time=5.68m eta=1916.4m [2024-06-14 14:57:54,439] INFO: Initiating epoch #68 train run on device rank=0 [2024-06-14 15:02:30,330] INFO: Initiating epoch #68 valid run on device rank=0 [2024-06-14 15:03:35,577] INFO: Rank 0: epoch=68 / 400 train_loss=7.7074 valid_loss=7.9780 stale=0 time=5.69m eta=1910.3m [2024-06-14 15:03:35,658] INFO: Initiating epoch #69 train run on device rank=0 [2024-06-14 15:08:11,398] INFO: Initiating epoch #69 valid run on device rank=0 [2024-06-14 15:09:16,229] INFO: Rank 0: epoch=69 / 400 train_loss=7.7064 valid_loss=7.9766 stale=0 time=5.68m eta=1904.2m [2024-06-14 15:09:16,279] INFO: Initiating epoch #70 train run on device rank=0 [2024-06-14 15:13:51,708] INFO: Initiating epoch #70 valid run on device rank=0 [2024-06-14 15:14:57,596] INFO: Rank 0: epoch=70 / 400 train_loss=7.7050 valid_loss=7.9754 stale=0 time=5.69m eta=1898.1m [2024-06-14 15:14:57,933] INFO: Initiating epoch #71 train run on device rank=0 [2024-06-14 15:19:32,657] INFO: Initiating epoch #71 valid run on device rank=0 [2024-06-14 15:20:37,434] INFO: Rank 0: epoch=71 / 400 train_loss=7.7038 valid_loss=7.9743 stale=0 time=5.66m eta=1892.0m [2024-06-14 15:20:37,481] INFO: Initiating epoch #72 train run on device rank=0 [2024-06-14 15:25:12,778] INFO: Initiating epoch #72 valid run on device rank=0 [2024-06-14 15:26:17,447] INFO: Rank 0: epoch=72 / 400 train_loss=7.7029 valid_loss=7.9731 stale=0 time=5.67m eta=1885.8m [2024-06-14 15:26:17,501] INFO: Initiating epoch #73 train run on device rank=0 [2024-06-14 15:30:52,886] INFO: Initiating epoch #73 valid run on device rank=0 [2024-06-14 15:31:57,299] INFO: Rank 0: epoch=73 / 400 train_loss=7.7018 valid_loss=7.9719 stale=0 time=5.66m eta=1879.7m [2024-06-14 15:31:57,331] INFO: Initiating epoch #74 train run on device rank=0 [2024-06-14 15:36:32,830] INFO: Initiating epoch #74 valid run on device rank=0 [2024-06-14 15:37:37,489] INFO: Rank 0: epoch=74 / 400 train_loss=7.7006 valid_loss=7.9708 stale=0 time=5.67m eta=1873.6m [2024-06-14 15:37:37,653] INFO: Initiating epoch #75 train run on device rank=0 [2024-06-14 15:42:13,055] INFO: Initiating epoch #75 valid run on device rank=0 [2024-06-14 15:43:17,490] INFO: Rank 0: epoch=75 / 400 train_loss=7.6996 valid_loss=7.9697 stale=0 time=5.66m eta=1867.5m [2024-06-14 15:43:17,545] INFO: Initiating epoch #76 train run on device rank=0 [2024-06-14 15:47:52,954] INFO: Initiating epoch #76 valid run on device rank=0 [2024-06-14 15:48:57,789] INFO: Rank 0: epoch=76 / 400 train_loss=7.6985 valid_loss=7.9688 stale=0 time=5.67m eta=1861.5m [2024-06-14 15:48:57,835] INFO: Initiating epoch #77 train run on device rank=0 [2024-06-14 15:53:33,472] INFO: Initiating epoch #77 valid run on device rank=0 [2024-06-14 15:54:37,454] INFO: Rank 0: epoch=77 / 400 train_loss=7.6975 valid_loss=7.9678 stale=0 time=5.66m eta=1855.4m [2024-06-14 15:54:37,461] INFO: Initiating epoch #78 train run on device rank=0 [2024-06-14 15:59:12,968] INFO: Initiating epoch #78 valid run on device rank=0 [2024-06-14 16:00:17,414] INFO: Rank 0: epoch=78 / 400 train_loss=7.6967 valid_loss=7.9667 stale=0 time=5.67m eta=1849.3m [2024-06-14 16:00:17,452] INFO: Initiating epoch #79 train run on device rank=0 [2024-06-14 16:04:52,954] INFO: Initiating epoch #79 valid run on device rank=0 [2024-06-14 16:05:58,479] INFO: Rank 0: epoch=79 / 400 train_loss=7.6957 valid_loss=7.9661 stale=0 time=5.68m eta=1843.3m [2024-06-14 16:05:58,809] INFO: Initiating epoch #80 train run on device rank=0 [2024-06-14 16:10:33,794] INFO: Initiating epoch #80 valid run on device rank=0 [2024-06-14 16:11:38,306] INFO: Rank 0: epoch=80 / 400 train_loss=7.6949 valid_loss=7.9651 stale=0 time=5.66m eta=1837.3m [2024-06-14 16:11:38,365] INFO: Initiating epoch #81 train run on device rank=0 [2024-06-14 16:16:13,810] INFO: Initiating epoch #81 valid run on device rank=0 [2024-06-14 16:17:18,346] INFO: Rank 0: epoch=81 / 400 train_loss=7.6939 valid_loss=7.9646 stale=0 time=5.67m eta=1831.2m [2024-06-14 16:17:18,411] INFO: Initiating epoch #82 train run on device rank=0 [2024-06-14 16:21:54,100] INFO: Initiating epoch #82 valid run on device rank=0 [2024-06-14 16:22:58,664] INFO: Rank 0: epoch=82 / 400 train_loss=7.6931 valid_loss=7.9637 stale=0 time=5.67m eta=1825.2m [2024-06-14 16:22:58,702] INFO: Initiating epoch #83 train run on device rank=0 [2024-06-14 16:27:34,426] INFO: Initiating epoch #83 valid run on device rank=0 [2024-06-14 16:28:38,964] INFO: Rank 0: epoch=83 / 400 train_loss=7.6922 valid_loss=7.9631 stale=0 time=5.67m eta=1819.2m [2024-06-14 16:28:39,013] INFO: Initiating epoch #84 train run on device rank=0 [2024-06-14 16:33:14,667] INFO: Initiating epoch #84 valid run on device rank=0 [2024-06-14 16:34:19,456] INFO: Rank 0: epoch=84 / 400 train_loss=7.6914 valid_loss=7.9626 stale=0 time=5.67m eta=1813.2m [2024-06-14 16:34:19,517] INFO: Initiating epoch #85 train run on device rank=0 [2024-06-14 16:38:54,929] INFO: Initiating epoch #85 valid run on device rank=0 [2024-06-14 16:39:59,355] INFO: Rank 0: epoch=85 / 400 train_loss=7.6905 valid_loss=7.9617 stale=0 time=5.66m eta=1807.2m [2024-06-14 16:39:59,413] INFO: Initiating epoch #86 train run on device rank=0 [2024-06-14 16:44:34,821] INFO: Initiating epoch #86 valid run on device rank=0 [2024-06-14 16:45:39,235] INFO: Rank 0: epoch=86 / 400 train_loss=7.6896 valid_loss=7.9609 stale=0 time=5.66m eta=1801.2m [2024-06-14 16:45:39,273] INFO: Initiating epoch #87 train run on device rank=0 [2024-06-14 16:50:14,703] INFO: Initiating epoch #87 valid run on device rank=0 [2024-06-14 16:51:19,104] INFO: Rank 0: epoch=87 / 400 train_loss=7.6887 valid_loss=7.9605 stale=0 time=5.66m eta=1795.2m [2024-06-14 16:51:19,138] INFO: Initiating epoch #88 train run on device rank=0 [2024-06-14 16:55:54,479] INFO: Initiating epoch #88 valid run on device rank=0 [2024-06-14 16:56:59,023] INFO: Rank 0: epoch=88 / 400 train_loss=7.6879 valid_loss=7.9596 stale=0 time=5.66m eta=1789.2m [2024-06-14 16:56:59,102] INFO: Initiating epoch #89 train run on device rank=0 [2024-06-14 17:01:34,488] INFO: Initiating epoch #89 valid run on device rank=0 [2024-06-14 17:02:38,917] INFO: Rank 0: epoch=89 / 400 train_loss=7.6870 valid_loss=7.9592 stale=0 time=5.66m eta=1783.3m [2024-06-14 17:02:38,965] INFO: Initiating epoch #90 train run on device rank=0 [2024-06-14 17:07:14,427] INFO: Initiating epoch #90 valid run on device rank=0 [2024-06-14 17:08:18,965] INFO: Rank 0: epoch=90 / 400 train_loss=7.6862 valid_loss=7.9586 stale=0 time=5.67m eta=1777.3m [2024-06-14 17:08:19,009] INFO: Initiating epoch #91 train run on device rank=0 [2024-06-14 17:12:54,466] INFO: Initiating epoch #91 valid run on device rank=0 [2024-06-14 17:13:59,923] INFO: Rank 0: epoch=91 / 400 train_loss=7.6854 valid_loss=7.9579 stale=0 time=5.68m eta=1771.4m [2024-06-14 17:13:59,966] INFO: Initiating epoch #92 train run on device rank=0 [2024-06-14 17:18:35,141] INFO: Initiating epoch #92 valid run on device rank=0 [2024-06-14 17:19:39,567] INFO: Rank 0: epoch=92 / 400 train_loss=7.6845 valid_loss=7.9571 stale=0 time=5.66m eta=1765.4m [2024-06-14 17:19:39,604] INFO: Initiating epoch #93 train run on device rank=0 [2024-06-14 17:24:15,318] INFO: Initiating epoch #93 valid run on device rank=0 [2024-06-14 17:25:19,763] INFO: Rank 0: epoch=93 / 400 train_loss=7.6837 valid_loss=7.9567 stale=0 time=5.67m eta=1759.5m [2024-06-14 17:25:19,825] INFO: Initiating epoch #94 train run on device rank=0 [2024-06-14 17:29:55,356] INFO: Initiating epoch #94 valid run on device rank=0 [2024-06-14 17:31:00,207] INFO: Rank 0: epoch=94 / 400 train_loss=7.6829 valid_loss=7.9559 stale=0 time=5.67m eta=1753.6m [2024-06-14 17:31:00,281] INFO: Initiating epoch #95 train run on device rank=0 [2024-06-14 17:35:35,760] INFO: Initiating epoch #95 valid run on device rank=0 [2024-06-14 17:36:40,212] INFO: Rank 0: epoch=95 / 400 train_loss=7.6820 valid_loss=7.9550 stale=0 time=5.67m eta=1747.6m [2024-06-14 17:36:40,267] INFO: Initiating epoch #96 train run on device rank=0 [2024-06-14 17:41:15,766] INFO: Initiating epoch #96 valid run on device rank=0 [2024-06-14 17:42:20,546] INFO: Rank 0: epoch=96 / 400 train_loss=7.6812 valid_loss=7.9543 stale=0 time=5.67m eta=1741.7m [2024-06-14 17:42:20,629] INFO: Initiating epoch #97 train run on device rank=0 [2024-06-14 17:46:56,275] INFO: Initiating epoch #97 valid run on device rank=0 [2024-06-14 17:48:00,931] INFO: Rank 0: epoch=97 / 400 train_loss=7.6803 valid_loss=7.9537 stale=0 time=5.67m eta=1735.8m [2024-06-14 17:48:01,186] INFO: Initiating epoch #98 train run on device rank=0 [2024-06-14 17:52:36,845] INFO: Initiating epoch #98 valid run on device rank=0 [2024-06-14 17:53:41,733] INFO: Rank 0: epoch=98 / 400 train_loss=7.6796 valid_loss=7.9532 stale=0 time=5.68m eta=1729.9m [2024-06-14 17:53:42,083] INFO: Initiating epoch #99 train run on device rank=0 [2024-06-14 17:58:17,649] INFO: Initiating epoch #99 valid run on device rank=0 [2024-06-14 17:59:22,011] INFO: Rank 0: epoch=99 / 400 train_loss=7.6788 valid_loss=7.9527 stale=0 time=5.67m eta=1724.0m [2024-06-14 17:59:22,057] INFO: Initiating epoch #100 train run on device rank=0 [2024-06-14 18:03:57,609] INFO: Initiating epoch #100 valid run on device rank=0 [2024-06-14 18:05:02,152] INFO: Rank 0: epoch=100 / 400 train_loss=7.6781 valid_loss=7.9523 stale=0 time=5.67m eta=1718.1m [2024-06-14 18:05:02,185] INFO: Initiating epoch #101 train run on device rank=0 [2024-06-14 18:09:37,849] INFO: Initiating epoch #101 valid run on device rank=0 [2024-06-14 18:10:42,295] INFO: Rank 0: epoch=101 / 400 train_loss=7.6774 valid_loss=7.9519 stale=0 time=5.67m eta=1712.2m [2024-06-14 18:10:42,433] INFO: Initiating epoch #102 train run on device rank=0 [2024-06-14 18:15:17,883] INFO: Initiating epoch #102 valid run on device rank=0 [2024-06-14 18:16:23,421] INFO: Rank 0: epoch=102 / 400 train_loss=7.6767 valid_loss=7.9513 stale=0 time=5.68m eta=1706.4m [2024-06-14 18:16:23,639] INFO: Initiating epoch #103 train run on device rank=0 [2024-06-14 18:20:58,797] INFO: Initiating epoch #103 valid run on device rank=0 [2024-06-14 18:22:03,467] INFO: Rank 0: epoch=103 / 400 train_loss=7.6760 valid_loss=7.9511 stale=0 time=5.66m eta=1700.5m [2024-06-14 18:22:03,499] INFO: Initiating epoch #104 train run on device rank=0 [2024-06-14 18:26:39,075] INFO: Initiating epoch #104 valid run on device rank=0 [2024-06-14 18:27:43,495] INFO: Rank 0: epoch=104 / 400 train_loss=7.6754 valid_loss=7.9508 stale=0 time=5.67m eta=1694.6m [2024-06-14 18:27:43,541] INFO: Initiating epoch #105 train run on device rank=0 [2024-06-14 18:32:18,986] INFO: Initiating epoch #105 valid run on device rank=0 [2024-06-14 18:33:23,601] INFO: Rank 0: epoch=105 / 400 train_loss=7.6748 valid_loss=7.9503 stale=0 time=5.67m eta=1688.7m [2024-06-14 18:33:23,643] INFO: Initiating epoch #106 train run on device rank=0 [2024-06-14 18:37:59,543] INFO: Initiating epoch #106 valid run on device rank=0 [2024-06-14 18:39:04,343] INFO: Rank 0: epoch=106 / 400 train_loss=7.6742 valid_loss=7.9501 stale=0 time=5.68m eta=1682.9m [2024-06-14 18:39:04,417] INFO: Initiating epoch #107 train run on device rank=0 [2024-06-14 18:43:40,191] INFO: Initiating epoch #107 valid run on device rank=0 [2024-06-14 18:44:44,675] INFO: Rank 0: epoch=107 / 400 train_loss=7.6737 valid_loss=7.9498 stale=0 time=5.67m eta=1677.0m [2024-06-14 18:44:44,769] INFO: Initiating epoch #108 train run on device rank=0 [2024-06-14 18:49:20,526] INFO: Initiating epoch #108 valid run on device rank=0 [2024-06-14 18:50:25,157] INFO: Rank 0: epoch=108 / 400 train_loss=7.6732 valid_loss=7.9493 stale=0 time=5.67m eta=1671.1m [2024-06-14 18:50:25,226] INFO: Initiating epoch #109 train run on device rank=0 [2024-06-14 18:55:01,091] INFO: Initiating epoch #109 valid run on device rank=0 [2024-06-14 18:56:06,226] INFO: Rank 0: epoch=109 / 400 train_loss=7.6727 valid_loss=7.9492 stale=0 time=5.68m eta=1665.3m [2024-06-14 18:56:06,280] INFO: Initiating epoch #110 train run on device rank=0 [2024-06-14 19:00:41,798] INFO: Initiating epoch #110 valid run on device rank=0 [2024-06-14 19:01:46,779] INFO: Rank 0: epoch=110 / 400 train_loss=7.6722 valid_loss=7.9487 stale=0 time=5.67m eta=1659.5m [2024-06-14 19:01:46,917] INFO: Initiating epoch #111 train run on device rank=0 [2024-06-14 19:06:22,473] INFO: Initiating epoch #111 valid run on device rank=0 [2024-06-14 19:07:27,010] INFO: Rank 0: epoch=111 / 400 train_loss=7.6716 valid_loss=7.9484 stale=0 time=5.67m eta=1653.6m [2024-06-14 19:07:27,052] INFO: Initiating epoch #112 train run on device rank=0 [2024-06-14 19:12:02,973] INFO: Initiating epoch #112 valid run on device rank=0 [2024-06-14 19:13:07,462] INFO: Rank 0: epoch=112 / 400 train_loss=7.6712 valid_loss=7.9480 stale=0 time=5.67m eta=1647.8m [2024-06-14 19:13:07,488] INFO: Initiating epoch #113 train run on device rank=0 [2024-06-14 19:17:42,715] INFO: Initiating epoch #113 valid run on device rank=0 [2024-06-14 19:18:47,240] INFO: Rank 0: epoch=113 / 400 train_loss=7.6708 valid_loss=7.9476 stale=0 time=5.66m eta=1641.9m [2024-06-14 19:18:47,247] INFO: Initiating epoch #114 train run on device rank=0 [2024-06-14 19:23:22,873] INFO: Initiating epoch #114 valid run on device rank=0 [2024-06-14 19:24:27,483] INFO: Rank 0: epoch=114 / 400 train_loss=7.6704 valid_loss=7.9473 stale=0 time=5.67m eta=1636.1m [2024-06-14 19:24:27,545] INFO: Initiating epoch #115 train run on device rank=0 [2024-06-14 19:29:02,820] INFO: Initiating epoch #115 valid run on device rank=0 [2024-06-14 19:30:07,512] INFO: Rank 0: epoch=115 / 400 train_loss=7.6701 valid_loss=7.9468 stale=0 time=5.67m eta=1630.2m [2024-06-14 19:30:07,607] INFO: Initiating epoch #116 train run on device rank=0 [2024-06-14 19:34:42,933] INFO: Initiating epoch #116 valid run on device rank=0 [2024-06-14 19:35:47,407] INFO: Rank 0: epoch=116 / 400 train_loss=7.6697 valid_loss=7.9465 stale=0 time=5.66m eta=1624.3m [2024-06-14 19:35:47,449] INFO: Initiating epoch #117 train run on device rank=0 [2024-06-14 19:40:22,938] INFO: Initiating epoch #117 valid run on device rank=0 [2024-06-14 19:41:27,474] INFO: Rank 0: epoch=117 / 400 train_loss=7.6693 valid_loss=7.9460 stale=0 time=5.67m eta=1618.5m [2024-06-14 19:41:27,525] INFO: Initiating epoch #118 train run on device rank=0 [2024-06-14 19:46:03,321] INFO: Initiating epoch #118 valid run on device rank=0 [2024-06-14 19:47:08,386] INFO: Rank 0: epoch=118 / 400 train_loss=7.6690 valid_loss=7.9456 stale=0 time=5.68m eta=1612.7m [2024-06-14 19:47:08,458] INFO: Initiating epoch #119 train run on device rank=0 [2024-06-14 19:51:43,741] INFO: Initiating epoch #119 valid run on device rank=0 [2024-06-14 19:52:48,510] INFO: Rank 0: epoch=119 / 400 train_loss=7.6685 valid_loss=7.9452 stale=0 time=5.67m eta=1606.9m [2024-06-14 19:52:48,542] INFO: Initiating epoch #120 train run on device rank=0 [2024-06-14 19:57:23,906] INFO: Initiating epoch #120 valid run on device rank=0 [2024-06-14 19:58:29,071] INFO: Rank 0: epoch=120 / 400 train_loss=7.6683 valid_loss=7.9448 stale=0 time=5.68m eta=1601.0m [2024-06-14 19:58:29,125] INFO: Initiating epoch #121 train run on device rank=0 [2024-06-14 20:03:04,676] INFO: Initiating epoch #121 valid run on device rank=0 [2024-06-14 20:04:09,254] INFO: Rank 0: epoch=121 / 400 train_loss=7.6678 valid_loss=7.9445 stale=0 time=5.67m eta=1595.2m [2024-06-14 20:04:09,310] INFO: Initiating epoch #122 train run on device rank=0 [2024-06-14 20:08:45,163] INFO: Initiating epoch #122 valid run on device rank=0 [2024-06-14 20:09:49,536] INFO: Rank 0: epoch=122 / 400 train_loss=7.6674 valid_loss=7.9442 stale=0 time=5.67m eta=1589.4m [2024-06-14 20:09:49,550] INFO: Initiating epoch #123 train run on device rank=0 [2024-06-14 20:14:26,089] INFO: Initiating epoch #123 valid run on device rank=0 [2024-06-14 20:15:31,260] INFO: Rank 0: epoch=123 / 400 train_loss=7.6673 valid_loss=7.9438 stale=0 time=5.7m eta=1583.6m [2024-06-14 20:15:31,293] INFO: Initiating epoch #124 train run on device rank=0 [2024-06-14 20:20:07,013] INFO: Initiating epoch #124 valid run on device rank=0 [2024-06-14 20:21:12,252] INFO: Rank 0: epoch=124 / 400 train_loss=7.6669 valid_loss=7.9434 stale=0 time=5.68m eta=1577.8m [2024-06-14 20:21:12,287] INFO: Initiating epoch #125 train run on device rank=0 [2024-06-14 20:25:48,205] INFO: Initiating epoch #125 valid run on device rank=0 [2024-06-14 20:26:52,930] INFO: Rank 0: epoch=125 / 400 train_loss=7.6665 valid_loss=7.9431 stale=0 time=5.68m eta=1572.0m [2024-06-14 20:26:52,938] INFO: Initiating epoch #126 train run on device rank=0 [2024-06-14 20:31:29,303] INFO: Initiating epoch #126 valid run on device rank=0 [2024-06-14 20:32:34,047] INFO: Rank 0: epoch=126 / 400 train_loss=7.6663 valid_loss=7.9427 stale=0 time=5.69m eta=1566.2m [2024-06-14 20:32:34,090] INFO: Initiating epoch #127 train run on device rank=0 [2024-06-14 20:37:09,952] INFO: Initiating epoch #127 valid run on device rank=0 [2024-06-14 20:38:15,040] INFO: Rank 0: epoch=127 / 400 train_loss=7.6659 valid_loss=7.9425 stale=0 time=5.68m eta=1560.5m [2024-06-14 20:38:15,157] INFO: Initiating epoch #128 train run on device rank=0 [2024-06-14 20:42:52,932] INFO: Initiating epoch #128 valid run on device rank=0 [2024-06-14 20:43:59,379] INFO: Rank 0: epoch=128 / 400 train_loss=7.6657 valid_loss=7.9421 stale=0 time=5.74m eta=1554.8m [2024-06-14 20:43:59,424] INFO: Initiating epoch #129 train run on device rank=0 [2024-06-14 20:48:37,501] INFO: Initiating epoch #129 valid run on device rank=0 [2024-06-14 20:49:43,596] INFO: Rank 0: epoch=129 / 400 train_loss=7.6655 valid_loss=7.9418 stale=0 time=5.74m eta=1549.1m [2024-06-14 20:49:43,637] INFO: Initiating epoch #130 train run on device rank=0 [2024-06-14 20:54:21,903] INFO: Initiating epoch #130 valid run on device rank=0 [2024-06-14 20:55:28,319] INFO: Rank 0: epoch=130 / 400 train_loss=7.6652 valid_loss=7.9414 stale=0 time=5.74m eta=1543.5m [2024-06-14 20:55:28,365] INFO: Initiating epoch #131 train run on device rank=0 [2024-06-14 21:00:06,604] INFO: Initiating epoch #131 valid run on device rank=0 [2024-06-14 21:01:12,879] INFO: Rank 0: epoch=131 / 400 train_loss=7.6648 valid_loss=7.9412 stale=0 time=5.74m eta=1537.8m [2024-06-14 21:01:12,893] INFO: Initiating epoch #132 train run on device rank=0 [2024-06-14 21:05:51,116] INFO: Initiating epoch #132 valid run on device rank=0 [2024-06-14 21:06:57,365] INFO: Rank 0: epoch=132 / 400 train_loss=7.6645 valid_loss=7.9408 stale=0 time=5.74m eta=1532.1m [2024-06-14 21:06:57,397] INFO: Initiating epoch #133 train run on device rank=0 [2024-06-14 21:11:35,464] INFO: Initiating epoch #133 valid run on device rank=0 [2024-06-14 21:12:41,789] INFO: Rank 0: epoch=133 / 400 train_loss=7.6643 valid_loss=7.9406 stale=0 time=5.74m eta=1526.5m [2024-06-14 21:12:41,830] INFO: Initiating epoch #134 train run on device rank=0 [2024-06-14 21:17:19,858] INFO: Initiating epoch #134 valid run on device rank=0 [2024-06-14 21:18:26,124] INFO: Rank 0: epoch=134 / 400 train_loss=7.6639 valid_loss=7.9403 stale=0 time=5.74m eta=1520.8m [2024-06-14 21:18:26,188] INFO: Initiating epoch #135 train run on device rank=0 [2024-06-14 21:23:04,171] INFO: Initiating epoch #135 valid run on device rank=0 [2024-06-14 21:24:10,588] INFO: Rank 0: epoch=135 / 400 train_loss=7.6638 valid_loss=7.9400 stale=0 time=5.74m eta=1515.1m [2024-06-14 21:24:10,622] INFO: Initiating epoch #136 train run on device rank=0 [2024-06-14 21:28:48,354] INFO: Initiating epoch #136 valid run on device rank=0 [2024-06-14 21:29:54,638] INFO: Rank 0: epoch=136 / 400 train_loss=7.6634 valid_loss=7.9399 stale=0 time=5.73m eta=1509.4m [2024-06-14 21:29:54,733] INFO: Initiating epoch #137 train run on device rank=0 [2024-06-14 21:34:32,702] INFO: Initiating epoch #137 valid run on device rank=0 [2024-06-14 21:35:38,831] INFO: Rank 0: epoch=137 / 400 train_loss=7.6631 valid_loss=7.9396 stale=0 time=5.73m eta=1503.7m [2024-06-14 21:35:38,873] INFO: Initiating epoch #138 train run on device rank=0 [2024-06-14 21:40:16,849] INFO: Initiating epoch #138 valid run on device rank=0 [2024-06-14 21:41:22,893] INFO: Rank 0: epoch=138 / 400 train_loss=7.6631 valid_loss=7.9392 stale=0 time=5.73m eta=1498.1m [2024-06-14 21:41:22,948] INFO: Initiating epoch #139 train run on device rank=0 [2024-06-14 21:46:01,320] INFO: Initiating epoch #139 valid run on device rank=0 [2024-06-14 21:47:07,009] INFO: Rank 0: epoch=139 / 400 train_loss=7.6628 valid_loss=7.9390 stale=0 time=5.73m eta=1492.4m [2024-06-14 21:47:07,074] INFO: Initiating epoch #140 train run on device rank=0 [2024-06-14 21:51:45,586] INFO: Initiating epoch #140 valid run on device rank=0 [2024-06-14 21:52:51,599] INFO: Rank 0: epoch=140 / 400 train_loss=7.6625 valid_loss=7.9388 stale=0 time=5.74m eta=1486.7m [2024-06-14 21:52:51,633] INFO: Initiating epoch #141 train run on device rank=0 [2024-06-14 21:57:30,059] INFO: Initiating epoch #141 valid run on device rank=0 [2024-06-14 21:58:36,197] INFO: Rank 0: epoch=141 / 400 train_loss=7.6623 valid_loss=7.9385 stale=0 time=5.74m eta=1481.0m [2024-06-14 21:58:36,199] INFO: Initiating epoch #142 train run on device rank=0 [2024-06-14 22:03:14,678] INFO: Initiating epoch #142 valid run on device rank=0 [2024-06-14 22:04:21,030] INFO: Rank 0: epoch=142 / 400 train_loss=7.6621 valid_loss=7.9382 stale=0 time=5.75m eta=1475.4m [2024-06-14 22:04:21,075] INFO: Initiating epoch #143 train run on device rank=0 [2024-06-14 22:08:59,226] INFO: Initiating epoch #143 valid run on device rank=0 [2024-06-14 22:10:05,243] INFO: Rank 0: epoch=143 / 400 train_loss=7.6618 valid_loss=7.9380 stale=0 time=5.74m eta=1469.7m [2024-06-14 22:10:05,260] INFO: Initiating epoch #144 train run on device rank=0 [2024-06-14 22:14:43,478] INFO: Initiating epoch #144 valid run on device rank=0 [2024-06-14 22:15:50,923] INFO: Rank 0: epoch=144 / 400 train_loss=7.6617 valid_loss=7.9378 stale=0 time=5.76m eta=1464.0m [2024-06-14 22:15:51,097] INFO: Initiating epoch #145 train run on device rank=0 [2024-06-14 22:20:29,232] INFO: Initiating epoch #145 valid run on device rank=0 [2024-06-14 22:21:35,695] INFO: Rank 0: epoch=145 / 400 train_loss=7.6614 valid_loss=7.9375 stale=0 time=5.74m eta=1458.4m [2024-06-14 22:21:35,776] INFO: Initiating epoch #146 train run on device rank=0 [2024-06-14 22:26:14,214] INFO: Initiating epoch #146 valid run on device rank=0 [2024-06-14 22:27:20,406] INFO: Rank 0: epoch=146 / 400 train_loss=7.6612 valid_loss=7.9373 stale=0 time=5.74m eta=1452.7m [2024-06-14 22:27:20,446] INFO: Initiating epoch #147 train run on device rank=0 [2024-06-14 22:31:59,072] INFO: Initiating epoch #147 valid run on device rank=0 [2024-06-14 22:33:05,191] INFO: Rank 0: epoch=147 / 400 train_loss=7.6610 valid_loss=7.9371 stale=0 time=5.75m eta=1447.0m [2024-06-14 22:33:05,225] INFO: Initiating epoch #148 train run on device rank=0 [2024-06-14 22:37:43,715] INFO: Initiating epoch #148 valid run on device rank=0 [2024-06-14 22:38:49,966] INFO: Rank 0: epoch=148 / 400 train_loss=7.6608 valid_loss=7.9368 stale=0 time=5.75m eta=1441.4m [2024-06-14 22:38:50,004] INFO: Initiating epoch #149 train run on device rank=0 [2024-06-14 22:43:28,302] INFO: Initiating epoch #149 valid run on device rank=0 [2024-06-14 22:44:34,690] INFO: Rank 0: epoch=149 / 400 train_loss=7.6606 valid_loss=7.9367 stale=0 time=5.74m eta=1435.7m [2024-06-14 22:44:34,719] INFO: Initiating epoch #150 train run on device rank=0 [2024-06-14 22:49:13,190] INFO: Initiating epoch #150 valid run on device rank=0 [2024-06-14 22:50:19,343] INFO: Rank 0: epoch=150 / 400 train_loss=7.6604 valid_loss=7.9365 stale=0 time=5.74m eta=1430.0m [2024-06-14 22:50:19,389] INFO: Initiating epoch #151 train run on device rank=0 [2024-06-14 22:54:58,017] INFO: Initiating epoch #151 valid run on device rank=0 [2024-06-14 22:56:04,199] INFO: Rank 0: epoch=151 / 400 train_loss=7.6601 valid_loss=7.9362 stale=0 time=5.75m eta=1424.3m [2024-06-14 22:56:04,276] INFO: Initiating epoch #152 train run on device rank=0 [2024-06-14 23:00:42,686] INFO: Initiating epoch #152 valid run on device rank=0 [2024-06-14 23:01:48,840] INFO: Rank 0: epoch=152 / 400 train_loss=7.6600 valid_loss=7.9360 stale=0 time=5.74m eta=1418.6m [2024-06-14 23:01:48,865] INFO: Initiating epoch #153 train run on device rank=0 [2024-06-14 23:06:27,230] INFO: Initiating epoch #153 valid run on device rank=0 [2024-06-14 23:07:33,347] INFO: Rank 0: epoch=153 / 400 train_loss=7.6598 valid_loss=7.9357 stale=0 time=5.74m eta=1413.0m [2024-06-14 23:07:33,363] INFO: Initiating epoch #154 train run on device rank=0 [2024-06-14 23:12:12,348] INFO: Initiating epoch #154 valid run on device rank=0 [2024-06-14 23:13:18,626] INFO: Rank 0: epoch=154 / 400 train_loss=7.6596 valid_loss=7.9355 stale=0 time=5.75m eta=1407.3m [2024-06-14 23:13:18,674] INFO: Initiating epoch #155 train run on device rank=0 [2024-06-14 23:17:57,668] INFO: Initiating epoch #155 valid run on device rank=0 [2024-06-14 23:19:03,925] INFO: Rank 0: epoch=155 / 400 train_loss=7.6595 valid_loss=7.9354 stale=0 time=5.75m eta=1401.6m [2024-06-14 23:19:04,018] INFO: Initiating epoch #156 train run on device rank=0 [2024-06-14 23:23:42,659] INFO: Initiating epoch #156 valid run on device rank=0 [2024-06-14 23:24:49,183] INFO: Rank 0: epoch=156 / 400 train_loss=7.6592 valid_loss=7.9353 stale=0 time=5.75m eta=1396.0m [2024-06-14 23:24:49,235] INFO: Initiating epoch #157 train run on device rank=0 [2024-06-14 23:29:28,037] INFO: Initiating epoch #157 valid run on device rank=0 [2024-06-14 23:30:34,924] INFO: Rank 0: epoch=157 / 400 train_loss=7.6591 valid_loss=7.9350 stale=0 time=5.76m eta=1390.3m [2024-06-14 23:30:35,072] INFO: Initiating epoch #158 train run on device rank=0 [2024-06-14 23:35:13,198] INFO: Initiating epoch #158 valid run on device rank=0 [2024-06-14 23:36:19,399] INFO: Rank 0: epoch=158 / 400 train_loss=7.6590 valid_loss=7.9348 stale=0 time=5.74m eta=1384.6m [2024-06-14 23:36:19,453] INFO: Initiating epoch #159 train run on device rank=0 [2024-06-14 23:40:57,971] INFO: Initiating epoch #159 valid run on device rank=0 [2024-06-14 23:42:04,391] INFO: Rank 0: epoch=159 / 400 train_loss=7.6588 valid_loss=7.9347 stale=0 time=5.75m eta=1378.9m [2024-06-14 23:42:04,481] INFO: Initiating epoch #160 train run on device rank=0 [2024-06-14 23:46:42,734] INFO: Initiating epoch #160 valid run on device rank=0 [2024-06-14 23:47:48,973] INFO: Rank 0: epoch=160 / 400 train_loss=7.6587 valid_loss=7.9345 stale=0 time=5.74m eta=1373.2m [2024-06-14 23:47:49,028] INFO: Initiating epoch #161 train run on device rank=0 [2024-06-14 23:52:27,375] INFO: Initiating epoch #161 valid run on device rank=0 [2024-06-14 23:53:33,455] INFO: Rank 0: epoch=161 / 400 train_loss=7.6585 valid_loss=7.9343 stale=0 time=5.74m eta=1367.5m [2024-06-14 23:53:33,463] INFO: Initiating epoch #162 train run on device rank=0 [2024-06-14 23:58:12,096] INFO: Initiating epoch #162 valid run on device rank=0 [2024-06-14 23:59:18,170] INFO: Rank 0: epoch=162 / 400 train_loss=7.6584 valid_loss=7.9343 stale=0 time=5.75m eta=1361.9m [2024-06-14 23:59:18,193] INFO: Initiating epoch #163 train run on device rank=0 [2024-06-15 00:03:56,754] INFO: Initiating epoch #163 valid run on device rank=0 [2024-06-15 00:05:03,231] INFO: Rank 0: epoch=163 / 400 train_loss=7.6583 valid_loss=7.9340 stale=0 time=5.75m eta=1356.2m [2024-06-15 00:05:03,284] INFO: Initiating epoch #164 train run on device rank=0 [2024-06-15 00:09:41,918] INFO: Initiating epoch #164 valid run on device rank=0 [2024-06-15 00:10:48,215] INFO: Rank 0: epoch=164 / 400 train_loss=7.6582 valid_loss=7.9339 stale=0 time=5.75m eta=1350.5m [2024-06-15 00:10:48,273] INFO: Initiating epoch #165 train run on device rank=0 [2024-06-15 00:15:27,051] INFO: Initiating epoch #165 valid run on device rank=0 [2024-06-15 00:16:33,196] INFO: Rank 0: epoch=165 / 400 train_loss=7.6580 valid_loss=7.9337 stale=0 time=5.75m eta=1344.8m [2024-06-15 00:16:33,210] INFO: Initiating epoch #166 train run on device rank=0 [2024-06-15 00:21:12,348] INFO: Initiating epoch #166 valid run on device rank=0 [2024-06-15 00:22:18,987] INFO: Rank 0: epoch=166 / 400 train_loss=7.6579 valid_loss=7.9337 stale=0 time=5.76m eta=1339.1m [2024-06-15 00:22:19,170] INFO: Initiating epoch #167 train run on device rank=0 [2024-06-15 00:26:58,024] INFO: Initiating epoch #167 valid run on device rank=0 [2024-06-15 00:28:04,312] INFO: Rank 0: epoch=167 / 400 train_loss=7.6577 valid_loss=7.9336 stale=0 time=5.75m eta=1333.5m [2024-06-15 00:28:04,375] INFO: Initiating epoch #168 train run on device rank=0 [2024-06-15 00:32:42,941] INFO: Initiating epoch #168 valid run on device rank=0 [2024-06-15 00:33:49,438] INFO: Rank 0: epoch=168 / 400 train_loss=7.6576 valid_loss=7.9334 stale=0 time=5.75m eta=1327.8m [2024-06-15 00:33:49,485] INFO: Initiating epoch #169 train run on device rank=0 [2024-06-15 00:38:27,782] INFO: Initiating epoch #169 valid run on device rank=0 [2024-06-15 00:39:33,915] INFO: Rank 0: epoch=169 / 400 train_loss=7.6575 valid_loss=7.9333 stale=0 time=5.74m eta=1322.1m [2024-06-15 00:39:33,969] INFO: Initiating epoch #170 train run on device rank=0 [2024-06-15 00:44:11,971] INFO: Initiating epoch #170 valid run on device rank=0 [2024-06-15 00:45:17,839] INFO: Rank 0: epoch=170 / 400 train_loss=7.6574 valid_loss=7.9331 stale=0 time=5.73m eta=1316.4m [2024-06-15 00:45:17,937] INFO: Initiating epoch #171 train run on device rank=0 [2024-06-15 00:49:56,113] INFO: Initiating epoch #171 valid run on device rank=0 [2024-06-15 00:51:02,423] INFO: Rank 0: epoch=171 / 400 train_loss=7.6573 valid_loss=7.9331 stale=0 time=5.74m eta=1310.7m [2024-06-15 00:51:02,475] INFO: Initiating epoch #172 train run on device rank=0 [2024-06-15 00:55:40,444] INFO: Initiating epoch #172 valid run on device rank=0 [2024-06-15 00:56:46,501] INFO: Rank 0: epoch=172 / 400 train_loss=7.6572 valid_loss=7.9331 stale=0 time=5.73m eta=1305.0m [2024-06-15 00:56:46,539] INFO: Initiating epoch #173 train run on device rank=0 [2024-06-15 01:01:24,904] INFO: Initiating epoch #173 valid run on device rank=0 [2024-06-15 01:02:31,322] INFO: Rank 0: epoch=173 / 400 train_loss=7.6571 valid_loss=7.9329 stale=0 time=5.75m eta=1299.3m [2024-06-15 01:02:31,363] INFO: Initiating epoch #174 train run on device rank=0 [2024-06-15 01:07:09,585] INFO: Initiating epoch #174 valid run on device rank=0 [2024-06-15 01:08:15,512] INFO: Rank 0: epoch=174 / 400 train_loss=7.6570 valid_loss=7.9329 stale=1 time=5.74m eta=1293.6m [2024-06-15 01:08:15,559] INFO: Initiating epoch #175 train run on device rank=0 [2024-06-15 01:12:54,211] INFO: Initiating epoch #175 valid run on device rank=0 [2024-06-15 01:14:00,643] INFO: Rank 0: epoch=175 / 400 train_loss=7.6569 valid_loss=7.9327 stale=0 time=5.75m eta=1287.9m [2024-06-15 01:14:00,693] INFO: Initiating epoch #176 train run on device rank=0 [2024-06-15 01:18:38,802] INFO: Initiating epoch #176 valid run on device rank=0 [2024-06-15 01:19:44,985] INFO: Rank 0: epoch=176 / 400 train_loss=7.6569 valid_loss=7.9326 stale=0 time=5.74m eta=1282.2m [2024-06-15 01:19:45,028] INFO: Initiating epoch #177 train run on device rank=0 [2024-06-15 01:24:23,431] INFO: Initiating epoch #177 valid run on device rank=0 [2024-06-15 01:25:30,159] INFO: Rank 0: epoch=177 / 400 train_loss=7.6567 valid_loss=7.9325 stale=0 time=5.75m eta=1276.5m [2024-06-15 01:25:30,191] INFO: Initiating epoch #178 train run on device rank=0 [2024-06-15 01:30:08,798] INFO: Initiating epoch #178 valid run on device rank=0 [2024-06-15 01:31:15,219] INFO: Rank 0: epoch=178 / 400 train_loss=7.6567 valid_loss=7.9324 stale=0 time=5.75m eta=1270.8m [2024-06-15 01:31:15,253] INFO: Initiating epoch #179 train run on device rank=0 [2024-06-15 01:35:53,718] INFO: Initiating epoch #179 valid run on device rank=0 [2024-06-15 01:37:00,105] INFO: Rank 0: epoch=179 / 400 train_loss=7.6566 valid_loss=7.9324 stale=1 time=5.75m eta=1265.1m [2024-06-15 01:37:00,138] INFO: Initiating epoch #180 train run on device rank=0 [2024-06-15 01:41:38,583] INFO: Initiating epoch #180 valid run on device rank=0 [2024-06-15 01:42:45,267] INFO: Rank 0: epoch=180 / 400 train_loss=7.6565 valid_loss=7.9323 stale=0 time=5.75m eta=1259.4m [2024-06-15 01:42:45,381] INFO: Initiating epoch #181 train run on device rank=0 [2024-06-15 01:47:24,236] INFO: Initiating epoch #181 valid run on device rank=0 [2024-06-15 01:48:30,246] INFO: Rank 0: epoch=181 / 400 train_loss=7.6564 valid_loss=7.9323 stale=1 time=5.75m eta=1253.7m [2024-06-15 01:48:30,287] INFO: Initiating epoch #182 train run on device rank=0 [2024-06-15 01:53:09,203] INFO: Initiating epoch #182 valid run on device rank=0 [2024-06-15 01:54:15,652] INFO: Rank 0: epoch=182 / 400 train_loss=7.6563 valid_loss=7.9322 stale=0 time=5.76m eta=1248.0m [2024-06-15 01:54:15,707] INFO: Initiating epoch #183 train run on device rank=0 [2024-06-15 01:58:54,316] INFO: Initiating epoch #183 valid run on device rank=0 [2024-06-15 02:00:00,577] INFO: Rank 0: epoch=183 / 400 train_loss=7.6562 valid_loss=7.9321 stale=0 time=5.75m eta=1242.3m [2024-06-15 02:00:00,585] INFO: Initiating epoch #184 train run on device rank=0 [2024-06-15 02:04:39,270] INFO: Initiating epoch #184 valid run on device rank=0 [2024-06-15 02:05:45,438] INFO: Rank 0: epoch=184 / 400 train_loss=7.6562 valid_loss=7.9321 stale=1 time=5.75m eta=1236.6m [2024-06-15 02:05:45,484] INFO: Initiating epoch #185 train run on device rank=0 [2024-06-15 02:10:24,255] INFO: Initiating epoch #185 valid run on device rank=0 [2024-06-15 02:11:30,848] INFO: Rank 0: epoch=185 / 400 train_loss=7.6561 valid_loss=7.9321 stale=0 time=5.76m eta=1230.9m [2024-06-15 02:11:31,057] INFO: Initiating epoch #186 train run on device rank=0 [2024-06-15 02:16:09,390] INFO: Initiating epoch #186 valid run on device rank=0 [2024-06-15 02:17:16,091] INFO: Rank 0: epoch=186 / 400 train_loss=7.6560 valid_loss=7.9319 stale=0 time=5.75m eta=1225.3m [2024-06-15 02:17:16,197] INFO: Initiating epoch #187 train run on device rank=0 [2024-06-15 02:21:54,938] INFO: Initiating epoch #187 valid run on device rank=0 [2024-06-15 02:23:01,431] INFO: Rank 0: epoch=187 / 400 train_loss=7.6559 valid_loss=7.9319 stale=0 time=5.75m eta=1219.6m [2024-06-15 02:23:01,478] INFO: Initiating epoch #188 train run on device rank=0 [2024-06-15 02:27:40,397] INFO: Initiating epoch #188 valid run on device rank=0 [2024-06-15 02:28:46,711] INFO: Rank 0: epoch=188 / 400 train_loss=7.6558 valid_loss=7.9319 stale=0 time=5.75m eta=1213.9m [2024-06-15 02:28:46,769] INFO: Initiating epoch #189 train run on device rank=0 [2024-06-15 02:33:25,353] INFO: Initiating epoch #189 valid run on device rank=0 [2024-06-15 02:34:32,523] INFO: Rank 0: epoch=189 / 400 train_loss=7.6558 valid_loss=7.9318 stale=0 time=5.76m eta=1208.2m [2024-06-15 02:34:32,581] INFO: Initiating epoch #190 train run on device rank=0 [2024-06-15 02:39:11,461] INFO: Initiating epoch #190 valid run on device rank=0 [2024-06-15 02:40:18,091] INFO: Rank 0: epoch=190 / 400 train_loss=7.6557 valid_loss=7.9316 stale=0 time=5.76m eta=1202.5m [2024-06-15 02:40:18,134] INFO: Initiating epoch #191 train run on device rank=0 [2024-06-15 02:44:56,862] INFO: Initiating epoch #191 valid run on device rank=0 [2024-06-15 02:46:03,115] INFO: Rank 0: epoch=191 / 400 train_loss=7.6556 valid_loss=7.9316 stale=1 time=5.75m eta=1196.8m [2024-06-15 02:46:03,181] INFO: Initiating epoch #192 train run on device rank=0 [2024-06-15 02:50:42,231] INFO: Initiating epoch #192 valid run on device rank=0 [2024-06-15 02:51:48,439] INFO: Rank 0: epoch=192 / 400 train_loss=7.6556 valid_loss=7.9316 stale=0 time=5.75m eta=1191.1m [2024-06-15 02:51:48,500] INFO: Initiating epoch #193 train run on device rank=0 [2024-06-15 02:56:26,682] INFO: Initiating epoch #193 valid run on device rank=0 [2024-06-15 02:57:32,907] INFO: Rank 0: epoch=193 / 400 train_loss=7.6555 valid_loss=7.9315 stale=0 time=5.74m eta=1185.4m [2024-06-15 02:57:32,953] INFO: Initiating epoch #194 train run on device rank=0 [2024-06-15 03:02:11,077] INFO: Initiating epoch #194 valid run on device rank=0 [2024-06-15 03:03:17,547] INFO: Rank 0: epoch=194 / 400 train_loss=7.6554 valid_loss=7.9314 stale=0 time=5.74m eta=1179.7m [2024-06-15 03:03:17,581] INFO: Initiating epoch #195 train run on device rank=0 [2024-06-15 03:07:56,158] INFO: Initiating epoch #195 valid run on device rank=0 [2024-06-15 03:09:03,088] INFO: Rank 0: epoch=195 / 400 train_loss=7.6554 valid_loss=7.9314 stale=0 time=5.76m eta=1174.0m [2024-06-15 03:09:03,139] INFO: Initiating epoch #196 train run on device rank=0 [2024-06-15 03:13:41,638] INFO: Initiating epoch #196 valid run on device rank=0 [2024-06-15 03:14:48,279] INFO: Rank 0: epoch=196 / 400 train_loss=7.6553 valid_loss=7.9313 stale=0 time=5.75m eta=1168.3m [2024-06-15 03:14:48,337] INFO: Initiating epoch #197 train run on device rank=0 [2024-06-15 03:19:26,957] INFO: Initiating epoch #197 valid run on device rank=0 [2024-06-15 03:20:33,644] INFO: Rank 0: epoch=197 / 400 train_loss=7.6553 valid_loss=7.9312 stale=0 time=5.76m eta=1162.6m [2024-06-15 03:20:33,686] INFO: Initiating epoch #198 train run on device rank=0 [2024-06-15 03:25:12,082] INFO: Initiating epoch #198 valid run on device rank=0 [2024-06-15 03:26:18,595] INFO: Rank 0: epoch=198 / 400 train_loss=7.6552 valid_loss=7.9313 stale=1 time=5.75m eta=1156.9m [2024-06-15 03:26:18,649] INFO: Initiating epoch #199 train run on device rank=0 [2024-06-15 03:30:57,498] INFO: Initiating epoch #199 valid run on device rank=0 [2024-06-15 03:32:04,236] INFO: Rank 0: epoch=199 / 400 train_loss=7.6552 valid_loss=7.9312 stale=0 time=5.76m eta=1151.2m [2024-06-15 03:32:04,283] INFO: Initiating epoch #200 train run on device rank=0 [2024-06-15 03:36:43,101] INFO: Initiating epoch #200 valid run on device rank=0 [2024-06-15 03:37:49,608] INFO: Rank 0: epoch=200 / 400 train_loss=7.6551 valid_loss=7.9312 stale=1 time=5.76m eta=1145.5m [2024-06-15 03:37:49,645] INFO: Initiating epoch #201 train run on device rank=0 [2024-06-15 03:42:27,921] INFO: Initiating epoch #201 valid run on device rank=0 [2024-06-15 03:43:34,398] INFO: Rank 0: epoch=201 / 400 train_loss=7.6551 valid_loss=7.9311 stale=0 time=5.75m eta=1139.8m [2024-06-15 03:43:34,478] INFO: Initiating epoch #202 train run on device rank=0 [2024-06-15 03:48:12,962] INFO: Initiating epoch #202 valid run on device rank=0 [2024-06-15 03:49:19,622] INFO: Rank 0: epoch=202 / 400 train_loss=7.6550 valid_loss=7.9310 stale=0 time=5.75m eta=1134.1m [2024-06-15 03:49:19,651] INFO: Initiating epoch #203 train run on device rank=0 [2024-06-15 03:53:58,448] INFO: Initiating epoch #203 valid run on device rank=0 [2024-06-15 03:55:04,863] INFO: Rank 0: epoch=203 / 400 train_loss=7.6549 valid_loss=7.9310 stale=0 time=5.75m eta=1128.4m [2024-06-15 03:55:04,917] INFO: Initiating epoch #204 train run on device rank=0 [2024-06-15 03:59:43,727] INFO: Initiating epoch #204 valid run on device rank=0 [2024-06-15 04:00:50,419] INFO: Rank 0: epoch=204 / 400 train_loss=7.6549 valid_loss=7.9310 stale=0 time=5.76m eta=1122.7m [2024-06-15 04:00:50,501] INFO: Initiating epoch #205 train run on device rank=0 [2024-06-15 04:05:29,376] INFO: Initiating epoch #205 valid run on device rank=0 [2024-06-15 04:06:36,427] INFO: Rank 0: epoch=205 / 400 train_loss=7.6549 valid_loss=7.9309 stale=0 time=5.77m eta=1117.0m [2024-06-15 04:06:36,493] INFO: Initiating epoch #206 train run on device rank=0 [2024-06-15 04:11:15,740] INFO: Initiating epoch #206 valid run on device rank=0 [2024-06-15 04:12:22,763] INFO: Rank 0: epoch=206 / 400 train_loss=7.6548 valid_loss=7.9309 stale=0 time=5.77m eta=1111.3m [2024-06-15 04:12:22,816] INFO: Initiating epoch #207 train run on device rank=0 [2024-06-15 04:17:01,754] INFO: Initiating epoch #207 valid run on device rank=0 [2024-06-15 04:18:08,751] INFO: Rank 0: epoch=207 / 400 train_loss=7.6547 valid_loss=7.9307 stale=0 time=5.77m eta=1105.6m [2024-06-15 04:18:08,837] INFO: Initiating epoch #208 train run on device rank=0 [2024-06-15 04:22:47,214] INFO: Initiating epoch #208 valid run on device rank=0 [2024-06-15 04:23:53,715] INFO: Rank 0: epoch=208 / 400 train_loss=7.6548 valid_loss=7.9307 stale=1 time=5.75m eta=1099.9m [2024-06-15 04:23:53,761] INFO: Initiating epoch #209 train run on device rank=0 [2024-06-15 04:28:32,447] INFO: Initiating epoch #209 valid run on device rank=0 [2024-06-15 04:29:39,115] INFO: Rank 0: epoch=209 / 400 train_loss=7.6547 valid_loss=7.9306 stale=0 time=5.76m eta=1094.2m [2024-06-15 04:29:39,165] INFO: Initiating epoch #210 train run on device rank=0 [2024-06-15 04:34:17,870] INFO: Initiating epoch #210 valid run on device rank=0 [2024-06-15 04:35:23,934] INFO: Rank 0: epoch=210 / 400 train_loss=7.6546 valid_loss=7.9306 stale=1 time=5.75m eta=1088.5m [2024-06-15 04:35:23,984] INFO: Initiating epoch #211 train run on device rank=0 [2024-06-15 04:40:02,206] INFO: Initiating epoch #211 valid run on device rank=0 [2024-06-15 04:41:08,759] INFO: Rank 0: epoch=211 / 400 train_loss=7.6546 valid_loss=7.9305 stale=0 time=5.75m eta=1082.8m [2024-06-15 04:41:08,905] INFO: Initiating epoch #212 train run on device rank=0 [2024-06-15 04:45:46,980] INFO: Initiating epoch #212 valid run on device rank=0 [2024-06-15 04:46:52,929] INFO: Rank 0: epoch=212 / 400 train_loss=7.6546 valid_loss=7.9305 stale=1 time=5.73m eta=1077.1m [2024-06-15 04:46:53,079] INFO: Initiating epoch #213 train run on device rank=0 [2024-06-15 04:51:31,518] INFO: Initiating epoch #213 valid run on device rank=0 [2024-06-15 04:52:37,959] INFO: Rank 0: epoch=213 / 400 train_loss=7.6545 valid_loss=7.9305 stale=0 time=5.75m eta=1071.3m [2024-06-15 04:52:38,045] INFO: Initiating epoch #214 train run on device rank=0 [2024-06-15 04:57:16,055] INFO: Initiating epoch #214 valid run on device rank=0 [2024-06-15 04:58:22,696] INFO: Rank 0: epoch=214 / 400 train_loss=7.6545 valid_loss=7.9303 stale=0 time=5.74m eta=1065.6m [2024-06-15 04:58:22,769] INFO: Initiating epoch #215 train run on device rank=0 [2024-06-15 05:03:00,981] INFO: Initiating epoch #215 valid run on device rank=0 [2024-06-15 05:04:06,937] INFO: Rank 0: epoch=215 / 400 train_loss=7.6544 valid_loss=7.9303 stale=1 time=5.74m eta=1059.9m [2024-06-15 05:04:07,001] INFO: Initiating epoch #216 train run on device rank=0 [2024-06-15 05:08:45,139] INFO: Initiating epoch #216 valid run on device rank=0 [2024-06-15 05:09:51,413] INFO: Rank 0: epoch=216 / 400 train_loss=7.6544 valid_loss=7.9302 stale=0 time=5.74m eta=1054.2m [2024-06-15 05:09:51,461] INFO: Initiating epoch #217 train run on device rank=0 [2024-06-15 05:14:29,676] INFO: Initiating epoch #217 valid run on device rank=0 [2024-06-15 05:15:35,747] INFO: Rank 0: epoch=217 / 400 train_loss=7.6544 valid_loss=7.9302 stale=1 time=5.74m eta=1048.5m [2024-06-15 05:15:35,813] INFO: Initiating epoch #218 train run on device rank=0 [2024-06-15 05:20:14,090] INFO: Initiating epoch #218 valid run on device rank=0 [2024-06-15 05:21:19,926] INFO: Rank 0: epoch=218 / 400 train_loss=7.6543 valid_loss=7.9302 stale=2 time=5.74m eta=1042.7m [2024-06-15 05:21:19,977] INFO: Initiating epoch #219 train run on device rank=0 [2024-06-15 05:25:58,275] INFO: Initiating epoch #219 valid run on device rank=0 [2024-06-15 05:27:04,939] INFO: Rank 0: epoch=219 / 400 train_loss=7.6543 valid_loss=7.9301 stale=0 time=5.75m eta=1037.0m [2024-06-15 05:27:05,043] INFO: Initiating epoch #220 train run on device rank=0 [2024-06-15 05:31:43,668] INFO: Initiating epoch #220 valid run on device rank=0 [2024-06-15 05:32:50,021] INFO: Rank 0: epoch=220 / 400 train_loss=7.6543 valid_loss=7.9300 stale=0 time=5.75m eta=1031.3m [2024-06-15 05:32:50,072] INFO: Initiating epoch #221 train run on device rank=0 [2024-06-15 05:37:28,274] INFO: Initiating epoch #221 valid run on device rank=0 [2024-06-15 05:38:34,914] INFO: Rank 0: epoch=221 / 400 train_loss=7.6542 valid_loss=7.9299 stale=0 time=5.75m eta=1025.6m [2024-06-15 05:38:34,985] INFO: Initiating epoch #222 train run on device rank=0 [2024-06-15 05:43:13,521] INFO: Initiating epoch #222 valid run on device rank=0 [2024-06-15 05:44:20,277] INFO: Rank 0: epoch=222 / 400 train_loss=7.6542 valid_loss=7.9299 stale=0 time=5.75m eta=1019.9m [2024-06-15 05:44:20,353] INFO: Initiating epoch #223 train run on device rank=0 [2024-06-15 05:48:58,394] INFO: Initiating epoch #223 valid run on device rank=0 [2024-06-15 05:50:04,743] INFO: Rank 0: epoch=223 / 400 train_loss=7.6541 valid_loss=7.9298 stale=0 time=5.74m eta=1014.2m [2024-06-15 05:50:04,813] INFO: Initiating epoch #224 train run on device rank=0 [2024-06-15 05:54:43,692] INFO: Initiating epoch #224 valid run on device rank=0 [2024-06-15 05:55:49,671] INFO: Rank 0: epoch=224 / 400 train_loss=7.6541 valid_loss=7.9299 stale=1 time=5.75m eta=1008.5m [2024-06-15 05:55:49,757] INFO: Initiating epoch #225 train run on device rank=0 [2024-06-15 06:00:28,507] INFO: Initiating epoch #225 valid run on device rank=0 [2024-06-15 06:01:34,543] INFO: Rank 0: epoch=225 / 400 train_loss=7.6541 valid_loss=7.9297 stale=0 time=5.75m eta=1002.8m [2024-06-15 06:01:34,588] INFO: Initiating epoch #226 train run on device rank=0 [2024-06-15 06:06:12,494] INFO: Initiating epoch #226 valid run on device rank=0 [2024-06-15 06:07:18,916] INFO: Rank 0: epoch=226 / 400 train_loss=7.6541 valid_loss=7.9298 stale=1 time=5.74m eta=997.0m [2024-06-15 06:07:18,971] INFO: Initiating epoch #227 train run on device rank=0 [2024-06-15 06:11:57,422] INFO: Initiating epoch #227 valid run on device rank=0 [2024-06-15 06:13:03,734] INFO: Rank 0: epoch=227 / 400 train_loss=7.6540 valid_loss=7.9297 stale=0 time=5.75m eta=991.3m [2024-06-15 06:13:03,798] INFO: Initiating epoch #228 train run on device rank=0 [2024-06-15 06:17:41,975] INFO: Initiating epoch #228 valid run on device rank=0 [2024-06-15 06:18:48,375] INFO: Rank 0: epoch=228 / 400 train_loss=7.6540 valid_loss=7.9297 stale=0 time=5.74m eta=985.6m [2024-06-15 06:18:48,416] INFO: Initiating epoch #229 train run on device rank=0 [2024-06-15 06:23:26,928] INFO: Initiating epoch #229 valid run on device rank=0 [2024-06-15 06:24:33,246] INFO: Rank 0: epoch=229 / 400 train_loss=7.6540 valid_loss=7.9297 stale=0 time=5.75m eta=979.9m [2024-06-15 06:24:33,314] INFO: Initiating epoch #230 train run on device rank=0 [2024-06-15 06:29:12,075] INFO: Initiating epoch #230 valid run on device rank=0 [2024-06-15 06:30:18,522] INFO: Rank 0: epoch=230 / 400 train_loss=7.6539 valid_loss=7.9297 stale=1 time=5.75m eta=974.2m [2024-06-15 06:30:18,582] INFO: Initiating epoch #231 train run on device rank=0 [2024-06-15 06:34:57,133] INFO: Initiating epoch #231 valid run on device rank=0 [2024-06-15 06:36:03,824] INFO: Rank 0: epoch=231 / 400 train_loss=7.6539 valid_loss=7.9296 stale=0 time=5.75m eta=968.4m [2024-06-15 06:36:03,876] INFO: Initiating epoch #232 train run on device rank=0 [2024-06-15 06:40:42,034] INFO: Initiating epoch #232 valid run on device rank=0 [2024-06-15 06:41:48,421] INFO: Rank 0: epoch=232 / 400 train_loss=7.6539 valid_loss=7.9296 stale=1 time=5.74m eta=962.7m [2024-06-15 06:41:48,472] INFO: Initiating epoch #233 train run on device rank=0 [2024-06-15 06:46:26,575] INFO: Initiating epoch #233 valid run on device rank=0 [2024-06-15 06:47:32,859] INFO: Rank 0: epoch=233 / 400 train_loss=7.6539 valid_loss=7.9296 stale=2 time=5.74m eta=957.0m [2024-06-15 06:47:32,925] INFO: Initiating epoch #234 train run on device rank=0 [2024-06-15 06:52:11,612] INFO: Initiating epoch #234 valid run on device rank=0 [2024-06-15 06:53:17,659] INFO: Rank 0: epoch=234 / 400 train_loss=7.6538 valid_loss=7.9296 stale=3 time=5.75m eta=951.3m [2024-06-15 06:53:17,709] INFO: Initiating epoch #235 train run on device rank=0 [2024-06-15 06:57:56,594] INFO: Initiating epoch #235 valid run on device rank=0 [2024-06-15 06:59:02,867] INFO: Rank 0: epoch=235 / 400 train_loss=7.6538 valid_loss=7.9295 stale=0 time=5.75m eta=945.6m [2024-06-15 06:59:02,941] INFO: Initiating epoch #236 train run on device rank=0 [2024-06-15 07:03:41,040] INFO: Initiating epoch #236 valid run on device rank=0 [2024-06-15 07:04:46,986] INFO: Rank 0: epoch=236 / 400 train_loss=7.6538 valid_loss=7.9295 stale=1 time=5.73m eta=939.8m [2024-06-15 07:04:47,072] INFO: Initiating epoch #237 train run on device rank=0 [2024-06-15 07:09:25,177] INFO: Initiating epoch #237 valid run on device rank=0 [2024-06-15 07:10:31,077] INFO: Rank 0: epoch=237 / 400 train_loss=7.6538 valid_loss=7.9295 stale=2 time=5.73m eta=934.1m [2024-06-15 07:10:31,115] INFO: Initiating epoch #238 train run on device rank=0 [2024-06-15 07:15:09,346] INFO: Initiating epoch #238 valid run on device rank=0 [2024-06-15 07:16:15,994] INFO: Rank 0: epoch=238 / 400 train_loss=7.6537 valid_loss=7.9294 stale=0 time=5.75m eta=928.4m [2024-06-15 07:16:16,044] INFO: Initiating epoch #239 train run on device rank=0 [2024-06-15 07:20:54,410] INFO: Initiating epoch #239 valid run on device rank=0 [2024-06-15 07:22:00,911] INFO: Rank 0: epoch=239 / 400 train_loss=7.6537 valid_loss=7.9295 stale=1 time=5.75m eta=922.7m [2024-06-15 07:22:00,957] INFO: Initiating epoch #240 train run on device rank=0 [2024-06-15 07:26:39,651] INFO: Initiating epoch #240 valid run on device rank=0 [2024-06-15 07:27:45,966] INFO: Rank 0: epoch=240 / 400 train_loss=7.6536 valid_loss=7.9294 stale=2 time=5.75m eta=917.0m [2024-06-15 07:27:46,037] INFO: Initiating epoch #241 train run on device rank=0 [2024-06-15 07:32:24,842] INFO: Initiating epoch #241 valid run on device rank=0 [2024-06-15 07:33:31,339] INFO: Rank 0: epoch=241 / 400 train_loss=7.6537 valid_loss=7.9294 stale=0 time=5.76m eta=911.2m [2024-06-15 07:33:31,376] INFO: Initiating epoch #242 train run on device rank=0 [2024-06-15 07:38:10,082] INFO: Initiating epoch #242 valid run on device rank=0 [2024-06-15 07:39:16,413] INFO: Rank 0: epoch=242 / 400 train_loss=7.6536 valid_loss=7.9294 stale=1 time=5.75m eta=905.5m [2024-06-15 07:39:16,489] INFO: Initiating epoch #243 train run on device rank=0 [2024-06-15 07:43:54,805] INFO: Initiating epoch #243 valid run on device rank=0 [2024-06-15 07:45:01,021] INFO: Rank 0: epoch=243 / 400 train_loss=7.6536 valid_loss=7.9295 stale=2 time=5.74m eta=899.8m [2024-06-15 07:45:01,054] INFO: Initiating epoch #244 train run on device rank=0 [2024-06-15 07:49:39,329] INFO: Initiating epoch #244 valid run on device rank=0 [2024-06-15 07:50:46,118] INFO: Rank 0: epoch=244 / 400 train_loss=7.6536 valid_loss=7.9294 stale=0 time=5.75m eta=894.1m [2024-06-15 07:50:46,163] INFO: Initiating epoch #245 train run on device rank=0 [2024-06-15 07:55:24,578] INFO: Initiating epoch #245 valid run on device rank=0 [2024-06-15 07:56:30,976] INFO: Rank 0: epoch=245 / 400 train_loss=7.6536 valid_loss=7.9294 stale=1 time=5.75m eta=888.4m [2024-06-15 07:56:31,019] INFO: Initiating epoch #246 train run on device rank=0 [2024-06-15 08:01:09,444] INFO: Initiating epoch #246 valid run on device rank=0 [2024-06-15 08:02:15,758] INFO: Rank 0: epoch=246 / 400 train_loss=7.6535 valid_loss=7.9294 stale=2 time=5.75m eta=882.6m [2024-06-15 08:02:15,811] INFO: Initiating epoch #247 train run on device rank=0 [2024-06-15 08:06:54,166] INFO: Initiating epoch #247 valid run on device rank=0 [2024-06-15 08:08:00,287] INFO: Rank 0: epoch=247 / 400 train_loss=7.6535 valid_loss=7.9294 stale=3 time=5.74m eta=876.9m [2024-06-15 08:08:00,522] INFO: Initiating epoch #248 train run on device rank=0 [2024-06-15 08:12:38,892] INFO: Initiating epoch #248 valid run on device rank=0 [2024-06-15 08:13:45,067] INFO: Rank 0: epoch=248 / 400 train_loss=7.6535 valid_loss=7.9294 stale=4 time=5.74m eta=871.2m [2024-06-15 08:13:45,113] INFO: Initiating epoch #249 train run on device rank=0 [2024-06-15 08:18:23,244] INFO: Initiating epoch #249 valid run on device rank=0 [2024-06-15 08:19:29,579] INFO: Rank 0: epoch=249 / 400 train_loss=7.6535 valid_loss=7.9294 stale=5 time=5.74m eta=865.5m [2024-06-15 08:19:29,790] INFO: Initiating epoch #250 train run on device rank=0 [2024-06-15 08:24:07,986] INFO: Initiating epoch #250 valid run on device rank=0 [2024-06-15 08:25:14,003] INFO: Rank 0: epoch=250 / 400 train_loss=7.6535 valid_loss=7.9294 stale=6 time=5.74m eta=859.7m [2024-06-15 08:25:14,191] INFO: Initiating epoch #251 train run on device rank=0 [2024-06-15 08:29:52,279] INFO: Initiating epoch #251 valid run on device rank=0 [2024-06-15 08:30:58,661] INFO: Rank 0: epoch=251 / 400 train_loss=7.6535 valid_loss=7.9294 stale=7 time=5.74m eta=854.0m [2024-06-15 08:30:58,862] INFO: Initiating epoch #252 train run on device rank=0 [2024-06-15 08:35:37,530] INFO: Initiating epoch #252 valid run on device rank=0 [2024-06-15 08:36:43,735] INFO: Rank 0: epoch=252 / 400 train_loss=7.6534 valid_loss=7.9294 stale=8 time=5.75m eta=848.3m [2024-06-15 08:36:44,021] INFO: Initiating epoch #253 train run on device rank=0 [2024-06-15 08:41:22,366] INFO: Initiating epoch #253 valid run on device rank=0 [2024-06-15 08:42:29,167] INFO: Rank 0: epoch=253 / 400 train_loss=7.6534 valid_loss=7.9293 stale=0 time=5.75m eta=842.6m [2024-06-15 08:42:29,465] INFO: Initiating epoch #254 train run on device rank=0 [2024-06-15 08:47:07,766] INFO: Initiating epoch #254 valid run on device rank=0 [2024-06-15 08:48:14,009] INFO: Rank 0: epoch=254 / 400 train_loss=7.6534 valid_loss=7.9294 stale=1 time=5.74m eta=836.9m [2024-06-15 08:48:14,175] INFO: Initiating epoch #255 train run on device rank=0 [2024-06-15 08:52:52,074] INFO: Initiating epoch #255 valid run on device rank=0 [2024-06-15 08:53:57,897] INFO: Rank 0: epoch=255 / 400 train_loss=7.6534 valid_loss=7.9294 stale=2 time=5.73m eta=831.1m [2024-06-15 08:53:58,096] INFO: Initiating epoch #256 train run on device rank=0 [2024-06-15 08:58:36,232] INFO: Initiating epoch #256 valid run on device rank=0 [2024-06-15 08:59:42,716] INFO: Rank 0: epoch=256 / 400 train_loss=7.6534 valid_loss=7.9293 stale=0 time=5.74m eta=825.4m [2024-06-15 08:59:42,900] INFO: Initiating epoch #257 train run on device rank=0 [2024-06-15 09:04:20,867] INFO: Initiating epoch #257 valid run on device rank=0 [2024-06-15 09:05:27,260] INFO: Rank 0: epoch=257 / 400 train_loss=7.6534 valid_loss=7.9293 stale=0 time=5.74m eta=819.7m [2024-06-15 09:05:27,429] INFO: Initiating epoch #258 train run on device rank=0 [2024-06-15 09:10:05,119] INFO: Initiating epoch #258 valid run on device rank=0 [2024-06-15 09:11:11,671] INFO: Rank 0: epoch=258 / 400 train_loss=7.6533 valid_loss=7.9292 stale=0 time=5.74m eta=814.0m [2024-06-15 09:11:11,851] INFO: Initiating epoch #259 train run on device rank=0 [2024-06-15 09:15:50,568] INFO: Initiating epoch #259 valid run on device rank=0 [2024-06-15 09:16:56,790] INFO: Rank 0: epoch=259 / 400 train_loss=7.6533 valid_loss=7.9292 stale=0 time=5.75m eta=808.2m [2024-06-15 09:16:57,080] INFO: Initiating epoch #260 train run on device rank=0 [2024-06-15 09:21:35,581] INFO: Initiating epoch #260 valid run on device rank=0 [2024-06-15 09:22:42,076] INFO: Rank 0: epoch=260 / 400 train_loss=7.6533 valid_loss=7.9292 stale=0 time=5.75m eta=802.5m [2024-06-15 09:22:42,261] INFO: Initiating epoch #261 train run on device rank=0 [2024-06-15 09:27:20,317] INFO: Initiating epoch #261 valid run on device rank=0 [2024-06-15 09:28:26,886] INFO: Rank 0: epoch=261 / 400 train_loss=7.6532 valid_loss=7.9291 stale=0 time=5.74m eta=796.8m [2024-06-15 09:28:27,057] INFO: Initiating epoch #262 train run on device rank=0 [2024-06-15 09:33:05,299] INFO: Initiating epoch #262 valid run on device rank=0 [2024-06-15 09:34:12,128] INFO: Rank 0: epoch=262 / 400 train_loss=7.6533 valid_loss=7.9291 stale=0 time=5.75m eta=791.1m [2024-06-15 09:34:12,401] INFO: Initiating epoch #263 train run on device rank=0 [2024-06-15 09:38:50,510] INFO: Initiating epoch #263 valid run on device rank=0 [2024-06-15 09:39:56,804] INFO: Rank 0: epoch=263 / 400 train_loss=7.6533 valid_loss=7.9291 stale=1 time=5.74m eta=785.3m [2024-06-15 09:39:57,073] INFO: Initiating epoch #264 train run on device rank=0 [2024-06-15 09:44:35,678] INFO: Initiating epoch #264 valid run on device rank=0 [2024-06-15 09:45:41,696] INFO: Rank 0: epoch=264 / 400 train_loss=7.6532 valid_loss=7.9291 stale=2 time=5.74m eta=779.6m [2024-06-15 09:45:41,918] INFO: Initiating epoch #265 train run on device rank=0 [2024-06-15 09:50:20,093] INFO: Initiating epoch #265 valid run on device rank=0 [2024-06-15 09:51:26,147] INFO: Rank 0: epoch=265 / 400 train_loss=7.6532 valid_loss=7.9291 stale=3 time=5.74m eta=773.9m [2024-06-15 09:51:26,344] INFO: Initiating epoch #266 train run on device rank=0 [2024-06-15 09:56:04,797] INFO: Initiating epoch #266 valid run on device rank=0 [2024-06-15 09:57:10,837] INFO: Rank 0: epoch=266 / 400 train_loss=7.6532 valid_loss=7.9291 stale=0 time=5.74m eta=768.2m [2024-06-15 09:57:10,997] INFO: Initiating epoch #267 train run on device rank=0 [2024-06-15 10:01:49,396] INFO: Initiating epoch #267 valid run on device rank=0 [2024-06-15 10:02:55,873] INFO: Rank 0: epoch=267 / 400 train_loss=7.6532 valid_loss=7.9290 stale=0 time=5.75m eta=762.4m [2024-06-15 10:02:56,035] INFO: Initiating epoch #268 train run on device rank=0 [2024-06-15 10:07:34,091] INFO: Initiating epoch #268 valid run on device rank=0 [2024-06-15 10:08:40,116] INFO: Rank 0: epoch=268 / 400 train_loss=7.6532 valid_loss=7.9291 stale=1 time=5.73m eta=756.7m [2024-06-15 10:08:40,295] INFO: Initiating epoch #269 train run on device rank=0 [2024-06-15 10:13:18,763] INFO: Initiating epoch #269 valid run on device rank=0 [2024-06-15 10:14:25,083] INFO: Rank 0: epoch=269 / 400 train_loss=7.6531 valid_loss=7.9290 stale=0 time=5.75m eta=751.0m [2024-06-15 10:14:25,252] INFO: Initiating epoch #270 train run on device rank=0 [2024-06-15 10:19:03,629] INFO: Initiating epoch #270 valid run on device rank=0 [2024-06-15 10:20:10,048] INFO: Rank 0: epoch=270 / 400 train_loss=7.6532 valid_loss=7.9290 stale=0 time=5.75m eta=745.3m [2024-06-15 10:20:10,254] INFO: Initiating epoch #271 train run on device rank=0 [2024-06-15 10:24:48,766] INFO: Initiating epoch #271 valid run on device rank=0 [2024-06-15 10:25:54,615] INFO: Rank 0: epoch=271 / 400 train_loss=7.6531 valid_loss=7.9290 stale=1 time=5.74m eta=739.5m [2024-06-15 10:25:54,785] INFO: Initiating epoch #272 train run on device rank=0 [2024-06-15 10:30:32,834] INFO: Initiating epoch #272 valid run on device rank=0 [2024-06-15 10:31:38,348] INFO: Rank 0: epoch=272 / 400 train_loss=7.6531 valid_loss=7.9290 stale=2 time=5.73m eta=733.8m [2024-06-15 10:31:38,510] INFO: Initiating epoch #273 train run on device rank=0 [2024-06-15 10:36:16,774] INFO: Initiating epoch #273 valid run on device rank=0 [2024-06-15 10:37:22,628] INFO: Rank 0: epoch=273 / 400 train_loss=7.6531 valid_loss=7.9290 stale=3 time=5.74m eta=728.1m [2024-06-15 10:37:22,889] INFO: Initiating epoch #274 train run on device rank=0 [2024-06-15 10:42:01,175] INFO: Initiating epoch #274 valid run on device rank=0 [2024-06-15 10:43:07,437] INFO: Rank 0: epoch=274 / 400 train_loss=7.6531 valid_loss=7.9289 stale=0 time=5.74m eta=722.3m [2024-06-15 10:43:07,663] INFO: Initiating epoch #275 train run on device rank=0 [2024-06-15 10:47:45,966] INFO: Initiating epoch #275 valid run on device rank=0 [2024-06-15 10:48:51,678] INFO: Rank 0: epoch=275 / 400 train_loss=7.6531 valid_loss=7.9289 stale=1 time=5.73m eta=716.6m [2024-06-15 10:48:51,836] INFO: Initiating epoch #276 train run on device rank=0 [2024-06-15 10:53:29,912] INFO: Initiating epoch #276 valid run on device rank=0 [2024-06-15 10:54:35,912] INFO: Rank 0: epoch=276 / 400 train_loss=7.6531 valid_loss=7.9290 stale=2 time=5.73m eta=710.9m [2024-06-15 10:54:36,090] INFO: Initiating epoch #277 train run on device rank=0 [2024-06-15 10:59:13,933] INFO: Initiating epoch #277 valid run on device rank=0 [2024-06-15 11:00:19,718] INFO: Rank 0: epoch=277 / 400 train_loss=7.6531 valid_loss=7.9289 stale=3 time=5.73m eta=705.1m [2024-06-15 11:00:19,824] INFO: Initiating epoch #278 train run on device rank=0 [2024-06-15 11:04:58,107] INFO: Initiating epoch #278 valid run on device rank=0 [2024-06-15 11:06:03,788] INFO: Rank 0: epoch=278 / 400 train_loss=7.6531 valid_loss=7.9290 stale=4 time=5.73m eta=699.4m [2024-06-15 11:06:04,034] INFO: Initiating epoch #279 train run on device rank=0 [2024-06-15 11:10:42,093] INFO: Initiating epoch #279 valid run on device rank=0 [2024-06-15 11:11:48,362] INFO: Rank 0: epoch=279 / 400 train_loss=7.6530 valid_loss=7.9290 stale=5 time=5.74m eta=693.7m [2024-06-15 11:11:48,541] INFO: Initiating epoch #280 train run on device rank=0 [2024-06-15 11:16:26,876] INFO: Initiating epoch #280 valid run on device rank=0 [2024-06-15 11:17:32,872] INFO: Rank 0: epoch=280 / 400 train_loss=7.6530 valid_loss=7.9289 stale=6 time=5.74m eta=688.0m [2024-06-15 11:17:33,056] INFO: Initiating epoch #281 train run on device rank=0 [2024-06-15 11:22:10,978] INFO: Initiating epoch #281 valid run on device rank=0 [2024-06-15 11:23:16,801] INFO: Rank 0: epoch=281 / 400 train_loss=7.6530 valid_loss=7.9290 stale=7 time=5.73m eta=682.2m [2024-06-15 11:23:16,983] INFO: Initiating epoch #282 train run on device rank=0 [2024-06-15 11:27:55,175] INFO: Initiating epoch #282 valid run on device rank=0 [2024-06-15 11:29:01,252] INFO: Rank 0: epoch=282 / 400 train_loss=7.6530 valid_loss=7.9290 stale=8 time=5.74m eta=676.5m [2024-06-15 11:29:01,435] INFO: Initiating epoch #283 train run on device rank=0 [2024-06-15 11:33:39,674] INFO: Initiating epoch #283 valid run on device rank=0 [2024-06-15 11:34:45,788] INFO: Rank 0: epoch=283 / 400 train_loss=7.6530 valid_loss=7.9289 stale=0 time=5.74m eta=670.8m [2024-06-15 11:34:45,967] INFO: Initiating epoch #284 train run on device rank=0 [2024-06-15 11:39:23,952] INFO: Initiating epoch #284 valid run on device rank=0 [2024-06-15 11:40:30,186] INFO: Rank 0: epoch=284 / 400 train_loss=7.6530 valid_loss=7.9289 stale=0 time=5.74m eta=665.0m [2024-06-15 11:40:30,381] INFO: Initiating epoch #285 train run on device rank=0 [2024-06-15 11:45:08,498] INFO: Initiating epoch #285 valid run on device rank=0 [2024-06-15 11:46:14,623] INFO: Rank 0: epoch=285 / 400 train_loss=7.6530 valid_loss=7.9289 stale=1 time=5.74m eta=659.3m [2024-06-15 11:46:14,785] INFO: Initiating epoch #286 train run on device rank=0 [2024-06-15 11:50:52,904] INFO: Initiating epoch #286 valid run on device rank=0 [2024-06-15 11:51:58,827] INFO: Rank 0: epoch=286 / 400 train_loss=7.6530 valid_loss=7.9289 stale=2 time=5.73m eta=653.6m [2024-06-15 11:51:59,041] INFO: Initiating epoch #287 train run on device rank=0 [2024-06-15 11:56:36,880] INFO: Initiating epoch #287 valid run on device rank=0 [2024-06-15 11:57:43,046] INFO: Rank 0: epoch=287 / 400 train_loss=7.6529 valid_loss=7.9289 stale=0 time=5.73m eta=647.8m [2024-06-15 11:57:43,254] INFO: Initiating epoch #288 train run on device rank=0 [2024-06-15 12:02:20,730] INFO: Initiating epoch #288 valid run on device rank=0 [2024-06-15 12:03:26,983] INFO: Rank 0: epoch=288 / 400 train_loss=7.6530 valid_loss=7.9288 stale=0 time=5.73m eta=642.1m [2024-06-15 12:03:27,190] INFO: Initiating epoch #289 train run on device rank=0 [2024-06-15 12:08:04,834] INFO: Initiating epoch #289 valid run on device rank=0 [2024-06-15 12:09:10,689] INFO: Rank 0: epoch=289 / 400 train_loss=7.6530 valid_loss=7.9290 stale=1 time=5.72m eta=636.4m [2024-06-15 12:09:10,884] INFO: Initiating epoch #290 train run on device rank=0 [2024-06-15 12:13:48,360] INFO: Initiating epoch #290 valid run on device rank=0 [2024-06-15 12:14:54,416] INFO: Rank 0: epoch=290 / 400 train_loss=7.6530 valid_loss=7.9288 stale=0 time=5.73m eta=630.6m [2024-06-15 12:14:54,654] INFO: Initiating epoch #291 train run on device rank=0 [2024-06-15 12:19:32,435] INFO: Initiating epoch #291 valid run on device rank=0 [2024-06-15 12:20:38,443] INFO: Rank 0: epoch=291 / 400 train_loss=7.6530 valid_loss=7.9288 stale=1 time=5.73m eta=624.9m [2024-06-15 12:20:38,622] INFO: Initiating epoch #292 train run on device rank=0 [2024-06-15 12:25:16,305] INFO: Initiating epoch #292 valid run on device rank=0 [2024-06-15 12:26:22,593] INFO: Rank 0: epoch=292 / 400 train_loss=7.6529 valid_loss=7.9287 stale=0 time=5.73m eta=619.2m [2024-06-15 12:26:22,807] INFO: Initiating epoch #293 train run on device rank=0 [2024-06-15 12:31:00,463] INFO: Initiating epoch #293 valid run on device rank=0 [2024-06-15 12:32:06,199] INFO: Rank 0: epoch=293 / 400 train_loss=7.6529 valid_loss=7.9288 stale=1 time=5.72m eta=613.4m [2024-06-15 12:32:06,467] INFO: Initiating epoch #294 train run on device rank=0 [2024-06-15 12:36:44,016] INFO: Initiating epoch #294 valid run on device rank=0 [2024-06-15 12:37:50,000] INFO: Rank 0: epoch=294 / 400 train_loss=7.6529 valid_loss=7.9287 stale=0 time=5.73m eta=607.7m [2024-06-15 12:37:50,202] INFO: Initiating epoch #295 train run on device rank=0 [2024-06-15 12:42:27,857] INFO: Initiating epoch #295 valid run on device rank=0 [2024-06-15 12:43:34,152] INFO: Rank 0: epoch=295 / 400 train_loss=7.6529 valid_loss=7.9287 stale=0 time=5.73m eta=602.0m [2024-06-15 12:43:34,351] INFO: Initiating epoch #296 train run on device rank=0 [2024-06-15 12:48:12,290] INFO: Initiating epoch #296 valid run on device rank=0 [2024-06-15 12:49:18,427] INFO: Rank 0: epoch=296 / 400 train_loss=7.6529 valid_loss=7.9287 stale=0 time=5.73m eta=596.2m [2024-06-15 12:49:18,591] INFO: Initiating epoch #297 train run on device rank=0 [2024-06-15 12:53:56,173] INFO: Initiating epoch #297 valid run on device rank=0 [2024-06-15 12:55:02,370] INFO: Rank 0: epoch=297 / 400 train_loss=7.6529 valid_loss=7.9286 stale=0 time=5.73m eta=590.5m [2024-06-15 12:55:02,615] INFO: Initiating epoch #298 train run on device rank=0 [2024-06-15 12:59:40,243] INFO: Initiating epoch #298 valid run on device rank=0 [2024-06-15 13:00:46,173] INFO: Rank 0: epoch=298 / 400 train_loss=7.6528 valid_loss=7.9286 stale=1 time=5.73m eta=584.8m [2024-06-15 13:00:46,385] INFO: Initiating epoch #299 train run on device rank=0 [2024-06-15 13:05:23,919] INFO: Initiating epoch #299 valid run on device rank=0 [2024-06-15 13:06:29,933] INFO: Rank 0: epoch=299 / 400 train_loss=7.6529 valid_loss=7.9287 stale=2 time=5.73m eta=579.0m [2024-06-15 13:06:30,197] INFO: Initiating epoch #300 train run on device rank=0 [2024-06-15 13:11:07,908] INFO: Initiating epoch #300 valid run on device rank=0 [2024-06-15 13:12:13,903] INFO: Rank 0: epoch=300 / 400 train_loss=7.6529 valid_loss=7.9286 stale=0 time=5.73m eta=573.3m [2024-06-15 13:12:14,105] INFO: Initiating epoch #301 train run on device rank=0 [2024-06-15 13:16:51,815] INFO: Initiating epoch #301 valid run on device rank=0 [2024-06-15 13:17:58,111] INFO: Rank 0: epoch=301 / 400 train_loss=7.6528 valid_loss=7.9286 stale=0 time=5.73m eta=567.6m [2024-06-15 13:17:58,381] INFO: Initiating epoch #302 train run on device rank=0 [2024-06-15 13:22:35,917] INFO: Initiating epoch #302 valid run on device rank=0 [2024-06-15 13:23:42,319] INFO: Rank 0: epoch=302 / 400 train_loss=7.6528 valid_loss=7.9286 stale=1 time=5.73m eta=561.8m [2024-06-15 13:23:42,535] INFO: Initiating epoch #303 train run on device rank=0 [2024-06-15 13:28:20,010] INFO: Initiating epoch #303 valid run on device rank=0 [2024-06-15 13:29:25,871] INFO: Rank 0: epoch=303 / 400 train_loss=7.6528 valid_loss=7.9286 stale=0 time=5.72m eta=556.1m [2024-06-15 13:29:26,135] INFO: Initiating epoch #304 train run on device rank=0 [2024-06-15 13:34:04,076] INFO: Initiating epoch #304 valid run on device rank=0 [2024-06-15 13:35:11,336] INFO: Rank 0: epoch=304 / 400 train_loss=7.6528 valid_loss=7.9285 stale=0 time=5.75m eta=550.4m [2024-06-15 13:35:11,813] INFO: Initiating epoch #305 train run on device rank=0 [2024-06-15 13:39:49,401] INFO: Initiating epoch #305 valid run on device rank=0 [2024-06-15 13:40:55,171] INFO: Rank 0: epoch=305 / 400 train_loss=7.6528 valid_loss=7.9286 stale=1 time=5.72m eta=544.6m [2024-06-15 13:40:55,410] INFO: Initiating epoch #306 train run on device rank=0 [2024-06-15 13:45:33,514] INFO: Initiating epoch #306 valid run on device rank=0 [2024-06-15 13:46:39,685] INFO: Rank 0: epoch=306 / 400 train_loss=7.6527 valid_loss=7.9285 stale=0 time=5.74m eta=538.9m [2024-06-15 13:46:39,873] INFO: Initiating epoch #307 train run on device rank=0 [2024-06-15 13:51:17,815] INFO: Initiating epoch #307 valid run on device rank=0 [2024-06-15 13:52:25,220] INFO: Rank 0: epoch=307 / 400 train_loss=7.6527 valid_loss=7.9285 stale=0 time=5.76m eta=533.2m [2024-06-15 13:52:25,774] INFO: Initiating epoch #308 train run on device rank=0 [2024-06-15 13:57:02,929] INFO: Initiating epoch #308 valid run on device rank=0 [2024-06-15 13:58:08,835] INFO: Rank 0: epoch=308 / 400 train_loss=7.6527 valid_loss=7.9285 stale=1 time=5.72m eta=527.5m [2024-06-15 13:58:09,059] INFO: Initiating epoch #309 train run on device rank=0 [2024-06-15 14:02:47,024] INFO: Initiating epoch #309 valid run on device rank=0 [2024-06-15 14:03:52,952] INFO: Rank 0: epoch=309 / 400 train_loss=7.6527 valid_loss=7.9284 stale=0 time=5.73m eta=521.7m [2024-06-15 14:03:53,266] INFO: Initiating epoch #310 train run on device rank=0 [2024-06-15 14:08:30,838] INFO: Initiating epoch #310 valid run on device rank=0 [2024-06-15 14:09:36,779] INFO: Rank 0: epoch=310 / 400 train_loss=7.6527 valid_loss=7.9285 stale=1 time=5.73m eta=516.0m [2024-06-15 14:09:37,109] INFO: Initiating epoch #311 train run on device rank=0 [2024-06-15 14:14:15,008] INFO: Initiating epoch #311 valid run on device rank=0 [2024-06-15 14:15:21,423] INFO: Rank 0: epoch=311 / 400 train_loss=7.6527 valid_loss=7.9284 stale=0 time=5.74m eta=510.3m [2024-06-15 14:15:21,816] INFO: Initiating epoch #312 train run on device rank=0 [2024-06-15 14:19:59,531] INFO: Initiating epoch #312 valid run on device rank=0 [2024-06-15 14:21:05,589] INFO: Rank 0: epoch=312 / 400 train_loss=7.6527 valid_loss=7.9284 stale=0 time=5.73m eta=504.5m [2024-06-15 14:21:05,786] INFO: Initiating epoch #313 train run on device rank=0 [2024-06-15 14:25:43,160] INFO: Initiating epoch #313 valid run on device rank=0 [2024-06-15 14:26:49,296] INFO: Rank 0: epoch=313 / 400 train_loss=7.6527 valid_loss=7.9284 stale=1 time=5.73m eta=498.8m [2024-06-15 14:26:49,528] INFO: Initiating epoch #314 train run on device rank=0 [2024-06-15 14:31:26,957] INFO: Initiating epoch #314 valid run on device rank=0 [2024-06-15 14:32:32,813] INFO: Rank 0: epoch=314 / 400 train_loss=7.6527 valid_loss=7.9284 stale=2 time=5.72m eta=493.1m [2024-06-15 14:32:33,205] INFO: Initiating epoch #315 train run on device rank=0 [2024-06-15 14:37:10,731] INFO: Initiating epoch #315 valid run on device rank=0 [2024-06-15 14:38:16,544] INFO: Rank 0: epoch=315 / 400 train_loss=7.6526 valid_loss=7.9284 stale=3 time=5.72m eta=487.3m [2024-06-15 14:38:16,718] INFO: Initiating epoch #316 train run on device rank=0 [2024-06-15 14:42:54,458] INFO: Initiating epoch #316 valid run on device rank=0 [2024-06-15 14:44:00,199] INFO: Rank 0: epoch=316 / 400 train_loss=7.6526 valid_loss=7.9284 stale=4 time=5.72m eta=481.6m [2024-06-15 14:44:00,374] INFO: Initiating epoch #317 train run on device rank=0 [2024-06-15 14:48:37,679] INFO: Initiating epoch #317 valid run on device rank=0 [2024-06-15 14:49:43,434] INFO: Rank 0: epoch=317 / 400 train_loss=7.6526 valid_loss=7.9283 stale=0 time=5.72m eta=475.8m [2024-06-15 14:49:43,704] INFO: Initiating epoch #318 train run on device rank=0 [2024-06-15 14:54:21,272] INFO: Initiating epoch #318 valid run on device rank=0 [2024-06-15 14:55:26,803] INFO: Rank 0: epoch=318 / 400 train_loss=7.6526 valid_loss=7.9283 stale=1 time=5.72m eta=470.1m [2024-06-15 14:55:27,009] INFO: Initiating epoch #319 train run on device rank=0 [2024-06-15 15:00:04,448] INFO: Initiating epoch #319 valid run on device rank=0 [2024-06-15 15:01:10,883] INFO: Rank 0: epoch=319 / 400 train_loss=7.6526 valid_loss=7.9282 stale=0 time=5.73m eta=464.4m [2024-06-15 15:01:11,145] INFO: Initiating epoch #320 train run on device rank=0 [2024-06-15 15:05:48,406] INFO: Initiating epoch #320 valid run on device rank=0 [2024-06-15 15:06:55,514] INFO: Rank 0: epoch=320 / 400 train_loss=7.6526 valid_loss=7.9282 stale=0 time=5.74m eta=458.6m [2024-06-15 15:06:55,705] INFO: Initiating epoch #321 train run on device rank=0 [2024-06-15 15:11:32,846] INFO: Initiating epoch #321 valid run on device rank=0 [2024-06-15 15:12:38,979] INFO: Rank 0: epoch=321 / 400 train_loss=7.6526 valid_loss=7.9282 stale=0 time=5.72m eta=452.9m [2024-06-15 15:12:39,142] INFO: Initiating epoch #322 train run on device rank=0 [2024-06-15 15:17:16,655] INFO: Initiating epoch #322 valid run on device rank=0 [2024-06-15 15:18:22,852] INFO: Rank 0: epoch=322 / 400 train_loss=7.6526 valid_loss=7.9281 stale=0 time=5.73m eta=447.2m [2024-06-15 15:18:23,029] INFO: Initiating epoch #323 train run on device rank=0 [2024-06-15 15:23:01,294] INFO: Initiating epoch #323 valid run on device rank=0 [2024-06-15 15:24:07,348] INFO: Rank 0: epoch=323 / 400 train_loss=7.6526 valid_loss=7.9281 stale=0 time=5.74m eta=441.5m [2024-06-15 15:24:07,548] INFO: Initiating epoch #324 train run on device rank=0 [2024-06-15 15:28:45,420] INFO: Initiating epoch #324 valid run on device rank=0 [2024-06-15 15:29:51,525] INFO: Rank 0: epoch=324 / 400 train_loss=7.6526 valid_loss=7.9282 stale=1 time=5.73m eta=435.7m [2024-06-15 15:29:51,784] INFO: Initiating epoch #325 train run on device rank=0 [2024-06-15 15:34:30,141] INFO: Initiating epoch #325 valid run on device rank=0 [2024-06-15 15:35:36,111] INFO: Rank 0: epoch=325 / 400 train_loss=7.6526 valid_loss=7.9282 stale=2 time=5.74m eta=430.0m [2024-06-15 15:35:36,307] INFO: Initiating epoch #326 train run on device rank=0 [2024-06-15 15:40:14,552] INFO: Initiating epoch #326 valid run on device rank=0 [2024-06-15 15:41:20,448] INFO: Rank 0: epoch=326 / 400 train_loss=7.6526 valid_loss=7.9282 stale=3 time=5.74m eta=424.3m [2024-06-15 15:41:20,665] INFO: Initiating epoch #327 train run on device rank=0 [2024-06-15 15:45:58,980] INFO: Initiating epoch #327 valid run on device rank=0 [2024-06-15 15:47:05,308] INFO: Rank 0: epoch=327 / 400 train_loss=7.6525 valid_loss=7.9281 stale=0 time=5.74m eta=418.5m [2024-06-15 15:47:05,506] INFO: Initiating epoch #328 train run on device rank=0 [2024-06-15 15:51:43,853] INFO: Initiating epoch #328 valid run on device rank=0 [2024-06-15 15:52:50,058] INFO: Rank 0: epoch=328 / 400 train_loss=7.6525 valid_loss=7.9281 stale=0 time=5.74m eta=412.8m [2024-06-15 15:52:50,244] INFO: Initiating epoch #329 train run on device rank=0 [2024-06-15 15:57:28,854] INFO: Initiating epoch #329 valid run on device rank=0 [2024-06-15 15:58:35,080] INFO: Rank 0: epoch=329 / 400 train_loss=7.6525 valid_loss=7.9280 stale=0 time=5.75m eta=407.1m [2024-06-15 15:58:35,293] INFO: Initiating epoch #330 train run on device rank=0 [2024-06-15 16:03:13,630] INFO: Initiating epoch #330 valid run on device rank=0 [2024-06-15 16:04:19,939] INFO: Rank 0: epoch=330 / 400 train_loss=7.6525 valid_loss=7.9280 stale=0 time=5.74m eta=401.3m [2024-06-15 16:04:20,157] INFO: Initiating epoch #331 train run on device rank=0 [2024-06-15 16:08:58,457] INFO: Initiating epoch #331 valid run on device rank=0 [2024-06-15 16:10:04,776] INFO: Rank 0: epoch=331 / 400 train_loss=7.6525 valid_loss=7.9280 stale=0 time=5.74m eta=395.6m [2024-06-15 16:10:05,053] INFO: Initiating epoch #332 train run on device rank=0 [2024-06-15 16:14:43,497] INFO: Initiating epoch #332 valid run on device rank=0 [2024-06-15 16:15:49,364] INFO: Rank 0: epoch=332 / 400 train_loss=7.6525 valid_loss=7.9280 stale=1 time=5.74m eta=389.9m [2024-06-15 16:15:49,567] INFO: Initiating epoch #333 train run on device rank=0 [2024-06-15 16:20:27,805] INFO: Initiating epoch #333 valid run on device rank=0 [2024-06-15 16:21:33,702] INFO: Rank 0: epoch=333 / 400 train_loss=7.6525 valid_loss=7.9280 stale=2 time=5.74m eta=384.1m [2024-06-15 16:21:33,889] INFO: Initiating epoch #334 train run on device rank=0 [2024-06-15 16:26:12,313] INFO: Initiating epoch #334 valid run on device rank=0 [2024-06-15 16:27:19,860] INFO: Rank 0: epoch=334 / 400 train_loss=7.6525 valid_loss=7.9280 stale=0 time=5.77m eta=378.4m [2024-06-15 16:27:20,378] INFO: Initiating epoch #335 train run on device rank=0 [2024-06-15 16:31:58,605] INFO: Initiating epoch #335 valid run on device rank=0 [2024-06-15 16:33:05,598] INFO: Rank 0: epoch=335 / 400 train_loss=7.6525 valid_loss=7.9280 stale=0 time=5.75m eta=372.7m [2024-06-15 16:33:06,197] INFO: Initiating epoch #336 train run on device rank=0 [2024-06-15 16:37:44,593] INFO: Initiating epoch #336 valid run on device rank=0 [2024-06-15 16:38:50,556] INFO: Rank 0: epoch=336 / 400 train_loss=7.6525 valid_loss=7.9280 stale=1 time=5.74m eta=367.0m [2024-06-15 16:38:50,728] INFO: Initiating epoch #337 train run on device rank=0 [2024-06-15 16:43:28,998] INFO: Initiating epoch #337 valid run on device rank=0 [2024-06-15 16:44:35,399] INFO: Rank 0: epoch=337 / 400 train_loss=7.6525 valid_loss=7.9279 stale=0 time=5.74m eta=361.2m [2024-06-15 16:44:35,745] INFO: Initiating epoch #338 train run on device rank=0 [2024-06-15 16:49:14,194] INFO: Initiating epoch #338 valid run on device rank=0 [2024-06-15 16:50:20,498] INFO: Rank 0: epoch=338 / 400 train_loss=7.6525 valid_loss=7.9278 stale=0 time=5.75m eta=355.5m [2024-06-15 16:50:20,673] INFO: Initiating epoch #339 train run on device rank=0 [2024-06-15 16:54:59,218] INFO: Initiating epoch #339 valid run on device rank=0 [2024-06-15 16:56:05,302] INFO: Rank 0: epoch=339 / 400 train_loss=7.6525 valid_loss=7.9279 stale=1 time=5.74m eta=349.8m [2024-06-15 16:56:05,617] INFO: Initiating epoch #340 train run on device rank=0 [2024-06-15 17:00:44,148] INFO: Initiating epoch #340 valid run on device rank=0 [2024-06-15 17:01:50,904] INFO: Rank 0: epoch=340 / 400 train_loss=7.6524 valid_loss=7.9279 stale=2 time=5.75m eta=344.0m [2024-06-15 17:01:51,520] INFO: Initiating epoch #341 train run on device rank=0 [2024-06-15 17:06:29,866] INFO: Initiating epoch #341 valid run on device rank=0 [2024-06-15 17:07:36,263] INFO: Rank 0: epoch=341 / 400 train_loss=7.6525 valid_loss=7.9278 stale=0 time=5.75m eta=338.3m [2024-06-15 17:07:36,441] INFO: Initiating epoch #342 train run on device rank=0 [2024-06-15 17:12:14,676] INFO: Initiating epoch #342 valid run on device rank=0 [2024-06-15 17:13:22,079] INFO: Rank 0: epoch=342 / 400 train_loss=7.6524 valid_loss=7.9278 stale=0 time=5.76m eta=332.6m [2024-06-15 17:13:22,475] INFO: Initiating epoch #343 train run on device rank=0 [2024-06-15 17:18:00,322] INFO: Initiating epoch #343 valid run on device rank=0 [2024-06-15 17:19:06,547] INFO: Rank 0: epoch=343 / 400 train_loss=7.6524 valid_loss=7.9278 stale=1 time=5.73m eta=326.8m [2024-06-15 17:19:06,808] INFO: Initiating epoch #344 train run on device rank=0 [2024-06-15 17:23:45,136] INFO: Initiating epoch #344 valid run on device rank=0 [2024-06-15 17:24:51,114] INFO: Rank 0: epoch=344 / 400 train_loss=7.6524 valid_loss=7.9279 stale=2 time=5.74m eta=321.1m [2024-06-15 17:24:51,312] INFO: Initiating epoch #345 train run on device rank=0 [2024-06-15 17:29:29,853] INFO: Initiating epoch #345 valid run on device rank=0 [2024-06-15 17:30:36,219] INFO: Rank 0: epoch=345 / 400 train_loss=7.6524 valid_loss=7.9278 stale=3 time=5.75m eta=315.4m [2024-06-15 17:30:36,403] INFO: Initiating epoch #346 train run on device rank=0 [2024-06-15 17:35:14,691] INFO: Initiating epoch #346 valid run on device rank=0 [2024-06-15 17:36:21,199] INFO: Rank 0: epoch=346 / 400 train_loss=7.6524 valid_loss=7.9278 stale=0 time=5.75m eta=309.6m [2024-06-15 17:36:21,366] INFO: Initiating epoch #347 train run on device rank=0 [2024-06-15 17:40:59,603] INFO: Initiating epoch #347 valid run on device rank=0 [2024-06-15 17:42:06,057] INFO: Rank 0: epoch=347 / 400 train_loss=7.6524 valid_loss=7.9278 stale=1 time=5.74m eta=303.9m [2024-06-15 17:42:06,333] INFO: Initiating epoch #348 train run on device rank=0 [2024-06-15 17:46:44,873] INFO: Initiating epoch #348 valid run on device rank=0 [2024-06-15 17:47:51,546] INFO: Rank 0: epoch=348 / 400 train_loss=7.6524 valid_loss=7.9277 stale=0 time=5.75m eta=298.2m [2024-06-15 17:47:51,783] INFO: Initiating epoch #349 train run on device rank=0 [2024-06-15 17:52:29,997] INFO: Initiating epoch #349 valid run on device rank=0 [2024-06-15 17:53:36,037] INFO: Rank 0: epoch=349 / 400 train_loss=7.6524 valid_loss=7.9277 stale=1 time=5.74m eta=292.4m [2024-06-15 17:53:36,361] INFO: Initiating epoch #350 train run on device rank=0 [2024-06-15 17:58:14,344] INFO: Initiating epoch #350 valid run on device rank=0 [2024-06-15 17:59:20,857] INFO: Rank 0: epoch=350 / 400 train_loss=7.6524 valid_loss=7.9277 stale=0 time=5.74m eta=286.7m [2024-06-15 17:59:21,192] INFO: Initiating epoch #351 train run on device rank=0 [2024-06-15 18:03:59,256] INFO: Initiating epoch #351 valid run on device rank=0 [2024-06-15 18:05:05,751] INFO: Rank 0: epoch=351 / 400 train_loss=7.6524 valid_loss=7.9277 stale=0 time=5.74m eta=281.0m [2024-06-15 18:05:05,977] INFO: Initiating epoch #352 train run on device rank=0 [2024-06-15 18:09:44,208] INFO: Initiating epoch #352 valid run on device rank=0 [2024-06-15 18:10:50,529] INFO: Rank 0: epoch=352 / 400 train_loss=7.6524 valid_loss=7.9277 stale=1 time=5.74m eta=275.3m [2024-06-15 18:10:50,780] INFO: Initiating epoch #353 train run on device rank=0 [2024-06-15 18:15:29,195] INFO: Initiating epoch #353 valid run on device rank=0 [2024-06-15 18:16:37,203] INFO: Rank 0: epoch=353 / 400 train_loss=7.6524 valid_loss=7.9277 stale=2 time=5.77m eta=269.5m [2024-06-15 18:16:37,529] INFO: Initiating epoch #354 train run on device rank=0 [2024-06-15 18:21:15,895] INFO: Initiating epoch #354 valid run on device rank=0 [2024-06-15 18:22:22,399] INFO: Rank 0: epoch=354 / 400 train_loss=7.6524 valid_loss=7.9276 stale=0 time=5.75m eta=263.8m [2024-06-15 18:22:22,722] INFO: Initiating epoch #355 train run on device rank=0 [2024-06-15 18:27:00,954] INFO: Initiating epoch #355 valid run on device rank=0 [2024-06-15 18:28:07,623] INFO: Rank 0: epoch=355 / 400 train_loss=7.6524 valid_loss=7.9276 stale=0 time=5.75m eta=258.1m [2024-06-15 18:28:08,029] INFO: Initiating epoch #356 train run on device rank=0 [2024-06-15 18:32:45,880] INFO: Initiating epoch #356 valid run on device rank=0 [2024-06-15 18:33:52,631] INFO: Rank 0: epoch=356 / 400 train_loss=7.6524 valid_loss=7.9275 stale=0 time=5.74m eta=252.3m [2024-06-15 18:33:53,129] INFO: Initiating epoch #357 train run on device rank=0 [2024-06-15 18:38:31,241] INFO: Initiating epoch #357 valid run on device rank=0 [2024-06-15 18:39:37,207] INFO: Rank 0: epoch=357 / 400 train_loss=7.6523 valid_loss=7.9275 stale=1 time=5.73m eta=246.6m [2024-06-15 18:39:37,517] INFO: Initiating epoch #358 train run on device rank=0 [2024-06-15 18:44:16,022] INFO: Initiating epoch #358 valid run on device rank=0 [2024-06-15 18:45:22,499] INFO: Rank 0: epoch=358 / 400 train_loss=7.6524 valid_loss=7.9276 stale=2 time=5.75m eta=240.9m [2024-06-15 18:45:22,698] INFO: Initiating epoch #359 train run on device rank=0 [2024-06-15 18:50:01,245] INFO: Initiating epoch #359 valid run on device rank=0 [2024-06-15 18:51:07,364] INFO: Rank 0: epoch=359 / 400 train_loss=7.6524 valid_loss=7.9275 stale=3 time=5.74m eta=235.1m [2024-06-15 18:51:07,774] INFO: Initiating epoch #360 train run on device rank=0 [2024-06-15 18:55:46,471] INFO: Initiating epoch #360 valid run on device rank=0 [2024-06-15 18:56:53,191] INFO: Rank 0: epoch=360 / 400 train_loss=7.6523 valid_loss=7.9275 stale=0 time=5.76m eta=229.4m [2024-06-15 18:56:53,502] INFO: Initiating epoch #361 train run on device rank=0 [2024-06-15 19:01:32,333] INFO: Initiating epoch #361 valid run on device rank=0 [2024-06-15 19:02:38,807] INFO: Rank 0: epoch=361 / 400 train_loss=7.6523 valid_loss=7.9275 stale=1 time=5.76m eta=223.7m [2024-06-15 19:02:39,384] INFO: Initiating epoch #362 train run on device rank=0 [2024-06-15 19:07:18,018] INFO: Initiating epoch #362 valid run on device rank=0 [2024-06-15 19:08:24,776] INFO: Rank 0: epoch=362 / 400 train_loss=7.6524 valid_loss=7.9275 stale=2 time=5.76m eta=217.9m [2024-06-15 19:08:25,253] INFO: Initiating epoch #363 train run on device rank=0 [2024-06-15 19:13:04,431] INFO: Initiating epoch #363 valid run on device rank=0 [2024-06-15 19:14:11,119] INFO: Rank 0: epoch=363 / 400 train_loss=7.6524 valid_loss=7.9275 stale=0 time=5.76m eta=212.2m [2024-06-15 19:14:11,402] INFO: Initiating epoch #364 train run on device rank=0 [2024-06-15 19:18:49,931] INFO: Initiating epoch #364 valid run on device rank=0 [2024-06-15 19:19:56,411] INFO: Rank 0: epoch=364 / 400 train_loss=7.6524 valid_loss=7.9275 stale=1 time=5.75m eta=206.5m [2024-06-15 19:19:56,901] INFO: Initiating epoch #365 train run on device rank=0 [2024-06-15 19:24:35,645] INFO: Initiating epoch #365 valid run on device rank=0 [2024-06-15 19:25:42,198] INFO: Rank 0: epoch=365 / 400 train_loss=7.6524 valid_loss=7.9275 stale=2 time=5.75m eta=200.7m [2024-06-15 19:25:42,508] INFO: Initiating epoch #366 train run on device rank=0 [2024-06-15 19:30:21,121] INFO: Initiating epoch #366 valid run on device rank=0 [2024-06-15 19:31:27,739] INFO: Rank 0: epoch=366 / 400 train_loss=7.6524 valid_loss=7.9274 stale=0 time=5.75m eta=195.0m [2024-06-15 19:31:28,132] INFO: Initiating epoch #367 train run on device rank=0 [2024-06-15 19:36:06,512] INFO: Initiating epoch #367 valid run on device rank=0 [2024-06-15 19:37:13,423] INFO: Rank 0: epoch=367 / 400 train_loss=7.6524 valid_loss=7.9274 stale=1 time=5.75m eta=189.3m [2024-06-15 19:37:13,772] INFO: Initiating epoch #368 train run on device rank=0 [2024-06-15 19:41:52,616] INFO: Initiating epoch #368 valid run on device rank=0 [2024-06-15 19:42:59,339] INFO: Rank 0: epoch=368 / 400 train_loss=7.6524 valid_loss=7.9274 stale=2 time=5.76m eta=183.5m [2024-06-15 19:42:59,629] INFO: Initiating epoch #369 train run on device rank=0 [2024-06-15 19:47:38,508] INFO: Initiating epoch #369 valid run on device rank=0 [2024-06-15 19:48:45,028] INFO: Rank 0: epoch=369 / 400 train_loss=7.6524 valid_loss=7.9274 stale=3 time=5.76m eta=177.8m [2024-06-15 19:48:45,588] INFO: Initiating epoch #370 train run on device rank=0 [2024-06-15 19:53:24,506] INFO: Initiating epoch #370 valid run on device rank=0 [2024-06-15 19:54:31,713] INFO: Rank 0: epoch=370 / 400 train_loss=7.6524 valid_loss=7.9273 stale=0 time=5.77m eta=172.1m [2024-06-15 19:54:32,045] INFO: Initiating epoch #371 train run on device rank=0 [2024-06-15 19:59:10,717] INFO: Initiating epoch #371 valid run on device rank=0 [2024-06-15 20:00:17,186] INFO: Rank 0: epoch=371 / 400 train_loss=7.6524 valid_loss=7.9273 stale=1 time=5.75m eta=166.3m [2024-06-15 20:00:17,390] INFO: Initiating epoch #372 train run on device rank=0 [2024-06-15 20:04:56,154] INFO: Initiating epoch #372 valid run on device rank=0 [2024-06-15 20:06:02,650] INFO: Rank 0: epoch=372 / 400 train_loss=7.6524 valid_loss=7.9273 stale=2 time=5.75m eta=160.6m [2024-06-15 20:06:03,096] INFO: Initiating epoch #373 train run on device rank=0 [2024-06-15 20:10:41,903] INFO: Initiating epoch #373 valid run on device rank=0 [2024-06-15 20:11:48,748] INFO: Rank 0: epoch=373 / 400 train_loss=7.6524 valid_loss=7.9273 stale=3 time=5.76m eta=154.9m [2024-06-15 20:11:49,085] INFO: Initiating epoch #374 train run on device rank=0 [2024-06-15 20:16:28,004] INFO: Initiating epoch #374 valid run on device rank=0 [2024-06-15 20:17:34,563] INFO: Rank 0: epoch=374 / 400 train_loss=7.6524 valid_loss=7.9273 stale=4 time=5.76m eta=149.1m [2024-06-15 20:17:34,869] INFO: Initiating epoch #375 train run on device rank=0 [2024-06-15 20:22:13,887] INFO: Initiating epoch #375 valid run on device rank=0 [2024-06-15 20:23:21,807] INFO: Rank 0: epoch=375 / 400 train_loss=7.6524 valid_loss=7.9274 stale=5 time=5.78m eta=143.4m [2024-06-15 20:23:22,185] INFO: Initiating epoch #376 train run on device rank=0 [2024-06-15 20:28:01,340] INFO: Initiating epoch #376 valid run on device rank=0 [2024-06-15 20:29:08,663] INFO: Rank 0: epoch=376 / 400 train_loss=7.6524 valid_loss=7.9273 stale=0 time=5.77m eta=137.7m [2024-06-15 20:29:08,870] INFO: Initiating epoch #377 train run on device rank=0 [2024-06-15 20:33:47,896] INFO: Initiating epoch #377 valid run on device rank=0 [2024-06-15 20:34:54,659] INFO: Rank 0: epoch=377 / 400 train_loss=7.6524 valid_loss=7.9272 stale=0 time=5.76m eta=131.9m [2024-06-15 20:34:54,862] INFO: Initiating epoch #378 train run on device rank=0 [2024-06-15 20:39:33,898] INFO: Initiating epoch #378 valid run on device rank=0 [2024-06-15 20:40:40,960] INFO: Rank 0: epoch=378 / 400 train_loss=7.6524 valid_loss=7.9273 stale=1 time=5.77m eta=126.2m [2024-06-15 20:40:41,145] INFO: Initiating epoch #379 train run on device rank=0 [2024-06-15 20:45:20,169] INFO: Initiating epoch #379 valid run on device rank=0 [2024-06-15 20:46:27,246] INFO: Rank 0: epoch=379 / 400 train_loss=7.6524 valid_loss=7.9273 stale=2 time=5.77m eta=120.5m [2024-06-15 20:46:27,493] INFO: Initiating epoch #380 train run on device rank=0 [2024-06-15 20:51:06,614] INFO: Initiating epoch #380 valid run on device rank=0 [2024-06-15 20:52:13,462] INFO: Rank 0: epoch=380 / 400 train_loss=7.6524 valid_loss=7.9273 stale=3 time=5.77m eta=114.7m [2024-06-15 20:52:13,670] INFO: Initiating epoch #381 train run on device rank=0 [2024-06-15 20:56:52,759] INFO: Initiating epoch #381 valid run on device rank=0 [2024-06-15 20:57:59,825] INFO: Rank 0: epoch=381 / 400 train_loss=7.6524 valid_loss=7.9273 stale=4 time=5.77m eta=109.0m [2024-06-15 20:58:00,130] INFO: Initiating epoch #382 train run on device rank=0 [2024-06-15 21:02:39,736] INFO: Initiating epoch #382 valid run on device rank=0 [2024-06-15 21:03:47,060] INFO: Rank 0: epoch=382 / 400 train_loss=7.6524 valid_loss=7.9273 stale=5 time=5.78m eta=103.3m [2024-06-15 21:03:47,451] INFO: Initiating epoch #383 train run on device rank=0 [2024-06-15 21:08:26,206] INFO: Initiating epoch #383 valid run on device rank=0 [2024-06-15 21:09:32,507] INFO: Rank 0: epoch=383 / 400 train_loss=7.6524 valid_loss=7.9272 stale=0 time=5.75m eta=97.5m [2024-06-15 21:09:32,817] INFO: Initiating epoch #384 train run on device rank=0 [2024-06-15 21:14:11,878] INFO: Initiating epoch #384 valid run on device rank=0 [2024-06-15 21:15:18,344] INFO: Rank 0: epoch=384 / 400 train_loss=7.6524 valid_loss=7.9273 stale=1 time=5.76m eta=91.8m [2024-06-15 21:15:18,511] INFO: Initiating epoch #385 train run on device rank=0 [2024-06-15 21:19:57,764] INFO: Initiating epoch #385 valid run on device rank=0 [2024-06-15 21:21:04,939] INFO: Rank 0: epoch=385 / 400 train_loss=7.6524 valid_loss=7.9272 stale=2 time=5.77m eta=86.1m [2024-06-15 21:21:05,138] INFO: Initiating epoch #386 train run on device rank=0 [2024-06-15 21:25:44,574] INFO: Initiating epoch #386 valid run on device rank=0 [2024-06-15 21:26:50,948] INFO: Rank 0: epoch=386 / 400 train_loss=7.6524 valid_loss=7.9272 stale=3 time=5.76m eta=80.3m [2024-06-15 21:26:51,130] INFO: Initiating epoch #387 train run on device rank=0 [2024-06-15 21:31:30,361] INFO: Initiating epoch #387 valid run on device rank=0 [2024-06-15 21:32:37,647] INFO: Rank 0: epoch=387 / 400 train_loss=7.6524 valid_loss=7.9272 stale=0 time=5.78m eta=74.6m [2024-06-15 21:32:37,837] INFO: Initiating epoch #388 train run on device rank=0 [2024-06-15 21:37:17,112] INFO: Initiating epoch #388 valid run on device rank=0 [2024-06-15 21:38:23,683] INFO: Rank 0: epoch=388 / 400 train_loss=7.6524 valid_loss=7.9272 stale=1 time=5.76m eta=68.8m [2024-06-15 21:38:23,892] INFO: Initiating epoch #389 train run on device rank=0 [2024-06-15 21:43:03,264] INFO: Initiating epoch #389 valid run on device rank=0 [2024-06-15 21:44:10,061] INFO: Rank 0: epoch=389 / 400 train_loss=7.6525 valid_loss=7.9272 stale=0 time=5.77m eta=63.1m [2024-06-15 21:44:10,317] INFO: Initiating epoch #390 train run on device rank=0 [2024-06-15 21:48:49,446] INFO: Initiating epoch #390 valid run on device rank=0 [2024-06-15 21:49:56,643] INFO: Rank 0: epoch=390 / 400 train_loss=7.6524 valid_loss=7.9271 stale=0 time=5.77m eta=57.4m [2024-06-15 21:49:56,827] INFO: Initiating epoch #391 train run on device rank=0 [2024-06-15 21:54:35,912] INFO: Initiating epoch #391 valid run on device rank=0 [2024-06-15 21:55:42,418] INFO: Rank 0: epoch=391 / 400 train_loss=7.6524 valid_loss=7.9271 stale=1 time=5.76m eta=51.6m [2024-06-15 21:55:42,794] INFO: Initiating epoch #392 train run on device rank=0 [2024-06-15 22:00:21,543] INFO: Initiating epoch #392 valid run on device rank=0 [2024-06-15 22:01:28,583] INFO: Rank 0: epoch=392 / 400 train_loss=7.6524 valid_loss=7.9271 stale=0 time=5.76m eta=45.9m [2024-06-15 22:01:28,877] INFO: Initiating epoch #393 train run on device rank=0 [2024-06-15 22:06:07,791] INFO: Initiating epoch #393 valid run on device rank=0 [2024-06-15 22:07:14,443] INFO: Rank 0: epoch=393 / 400 train_loss=7.6524 valid_loss=7.9271 stale=1 time=5.76m eta=40.2m [2024-06-15 22:07:14,798] INFO: Initiating epoch #394 train run on device rank=0 [2024-06-15 22:11:53,733] INFO: Initiating epoch #394 valid run on device rank=0 [2024-06-15 22:13:01,265] INFO: Rank 0: epoch=394 / 400 train_loss=7.6525 valid_loss=7.9271 stale=0 time=5.77m eta=34.4m [2024-06-15 22:13:01,933] INFO: Initiating epoch #395 train run on device rank=0 [2024-06-15 22:17:41,043] INFO: Initiating epoch #395 valid run on device rank=0 [2024-06-15 22:18:47,930] INFO: Rank 0: epoch=395 / 400 train_loss=7.6524 valid_loss=7.9272 stale=1 time=5.77m eta=28.7m [2024-06-15 22:18:48,234] INFO: Initiating epoch #396 train run on device rank=0 [2024-06-15 22:23:27,072] INFO: Initiating epoch #396 valid run on device rank=0 [2024-06-15 22:24:33,252] INFO: Rank 0: epoch=396 / 400 train_loss=7.6524 valid_loss=7.9271 stale=2 time=5.75m eta=23.0m [2024-06-15 22:24:33,431] INFO: Initiating epoch #397 train run on device rank=0 [2024-06-15 22:29:12,137] INFO: Initiating epoch #397 valid run on device rank=0 [2024-06-15 22:30:18,804] INFO: Rank 0: epoch=397 / 400 train_loss=7.6525 valid_loss=7.9271 stale=0 time=5.76m eta=17.2m [2024-06-15 22:30:19,058] INFO: Initiating epoch #398 train run on device rank=0 [2024-06-15 22:34:57,625] INFO: Initiating epoch #398 valid run on device rank=0 [2024-06-15 22:36:03,971] INFO: Rank 0: epoch=398 / 400 train_loss=7.6525 valid_loss=7.9271 stale=1 time=5.75m eta=11.5m [2024-06-15 22:36:04,239] INFO: Initiating epoch #399 train run on device rank=0 [2024-06-15 22:40:42,931] INFO: Initiating epoch #399 valid run on device rank=0 [2024-06-15 22:41:49,324] INFO: Rank 0: epoch=399 / 400 train_loss=7.6524 valid_loss=7.9271 stale=2 time=5.75m eta=5.7m [2024-06-15 22:41:49,655] INFO: Initiating epoch #400 train run on device rank=0 [2024-06-15 22:46:28,225] INFO: Initiating epoch #400 valid run on device rank=0 [2024-06-15 22:47:34,644] INFO: Rank 0: epoch=400 / 400 train_loss=7.6524 valid_loss=7.9271 stale=3 time=5.75m eta=0.0m [2024-06-15 22:47:34,832] INFO: Done with training. Total training time on device 0 is 2295.255min