[2024-06-18 10:33:35,018] INFO: Will use single-gpu: NVIDIA A100-SXM4-80GB [2024-06-18 10:33:35,018] INFO: using dtype=torch.bfloat16 [2024-06-18 10:33:35,018] INFO: using dtype=torch.bfloat16 [2024-06-18 10:33:35,059] INFO: using attention_type=flash [2024-06-18 10:33:35,059] INFO: using attention_type=flash [2024-06-18 10:33:35,071] INFO: using attention_type=flash [2024-06-18 10:33:35,071] INFO: using attention_type=flash [2024-06-18 10:33:35,083] INFO: using attention_type=flash [2024-06-18 10:33:35,083] INFO: using attention_type=flash [2024-06-18 10:33:35,094] INFO: using attention_type=flash [2024-06-18 10:33:35,094] INFO: using attention_type=flash [2024-06-18 10:33:35,105] INFO: using attention_type=flash [2024-06-18 10:33:35,105] INFO: using attention_type=flash [2024-06-18 10:33:35,116] INFO: using attention_type=flash [2024-06-18 10:33:35,116] INFO: using attention_type=flash [2024-06-18 10:33:37,681] INFO: MLPF( (nn0_id): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (nn0_reg): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (conv_id): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (conv_reg): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (nn_id): Sequential( (0): Linear(in_features=529, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=6, bias=True) ) (nn_pt): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_eta): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_sin_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_cos_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_energy): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) ) [2024-06-18 10:33:37,681] INFO: MLPF( (nn0_id): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (nn0_reg): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (conv_id): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (conv_reg): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (nn_id): Sequential( (0): Linear(in_features=529, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=6, bias=True) ) (nn_pt): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_eta): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_sin_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_cos_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_energy): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) ) [2024-06-18 10:33:37,682] INFO: Trainable parameters: 11671568 [2024-06-18 10:33:37,682] INFO: Trainable parameters: 11671568 [2024-06-18 10:33:37,682] INFO: Non-trainable parameters: 0 [2024-06-18 10:33:37,682] INFO: Non-trainable parameters: 0 [2024-06-18 10:33:37,682] INFO: Total parameters: 11671568 [2024-06-18 10:33:37,682] INFO: Total parameters: 11671568 [2024-06-18 10:33:37,686] INFO: Modules Trainable parameters Non-tranable parameters nn0_id.0.weight 8704 0 nn0_id.0.bias 512 0 nn0_id.2.weight 512 0 nn0_id.2.bias 512 0 nn0_id.4.weight 262144 0 nn0_id.4.bias 512 0 nn0_reg.0.weight 8704 0 nn0_reg.0.bias 512 0 nn0_reg.2.weight 512 0 nn0_reg.2.bias 512 0 nn0_reg.4.weight 262144 0 nn0_reg.4.bias 512 0 conv_id.0.mha.in_proj_weight 786432 0 conv_id.0.mha.in_proj_bias 1536 0 conv_id.0.mha.out_proj.weight 262144 0 conv_id.0.mha.out_proj.bias 512 0 conv_id.0.norm0.weight 512 0 conv_id.0.norm0.bias 512 0 conv_id.0.norm1.weight 512 0 conv_id.0.norm1.bias 512 0 conv_id.0.seq.0.weight 262144 0 conv_id.0.seq.0.bias 512 0 conv_id.0.seq.2.weight 262144 0 conv_id.0.seq.2.bias 512 0 conv_id.1.mha.in_proj_weight 786432 0 conv_id.1.mha.in_proj_bias 1536 0 conv_id.1.mha.out_proj.weight 262144 0 conv_id.1.mha.out_proj.bias 512 0 conv_id.1.norm0.weight 512 0 conv_id.1.norm0.bias 512 0 conv_id.1.norm1.weight 512 0 conv_id.1.norm1.bias 512 0 conv_id.1.seq.0.weight 262144 0 conv_id.1.seq.0.bias 512 0 conv_id.1.seq.2.weight 262144 0 conv_id.1.seq.2.bias 512 0 conv_id.2.mha.in_proj_weight 786432 0 conv_id.2.mha.in_proj_bias 1536 0 conv_id.2.mha.out_proj.weight 262144 0 conv_id.2.mha.out_proj.bias 512 0 conv_id.2.norm0.weight 512 0 conv_id.2.norm0.bias 512 0 conv_id.2.norm1.weight 512 0 conv_id.2.norm1.bias 512 0 conv_id.2.seq.0.weight 262144 0 conv_id.2.seq.0.bias 512 0 conv_id.2.seq.2.weight 262144 0 conv_id.2.seq.2.bias 512 0 conv_reg.0.mha.in_proj_weight 786432 0 conv_reg.0.mha.in_proj_bias 1536 0 conv_reg.0.mha.out_proj.weight 262144 0 conv_reg.0.mha.out_proj.bias 512 0 conv_reg.0.norm0.weight 512 0 conv_reg.0.norm0.bias 512 0 conv_reg.0.norm1.weight 512 0 conv_reg.0.norm1.bias 512 0 conv_reg.0.seq.0.weight 262144 0 conv_reg.0.seq.0.bias 512 0 conv_reg.0.seq.2.weight 262144 0 conv_reg.0.seq.2.bias 512 0 conv_reg.1.mha.in_proj_weight 786432 0 conv_reg.1.mha.in_proj_bias 1536 0 conv_reg.1.mha.out_proj.weight 262144 0 conv_reg.1.mha.out_proj.bias 512 0 conv_reg.1.norm0.weight 512 0 conv_reg.1.norm0.bias 512 0 conv_reg.1.norm1.weight 512 0 conv_reg.1.norm1.bias 512 0 conv_reg.1.seq.0.weight 262144 0 conv_reg.1.seq.0.bias 512 0 conv_reg.1.seq.2.weight 262144 0 conv_reg.1.seq.2.bias 512 0 conv_reg.2.mha.in_proj_weight 786432 0 conv_reg.2.mha.in_proj_bias 1536 0 conv_reg.2.mha.out_proj.weight 262144 0 conv_reg.2.mha.out_proj.bias 512 0 conv_reg.2.norm0.weight 512 0 conv_reg.2.norm0.bias 512 0 conv_reg.2.norm1.weight 512 0 conv_reg.2.norm1.bias 512 0 conv_reg.2.seq.0.weight 262144 0 conv_reg.2.seq.0.bias 512 0 conv_reg.2.seq.2.weight 262144 0 conv_reg.2.seq.2.bias 512 0 nn_id.0.weight 270848 0 nn_id.0.bias 512 0 nn_id.2.weight 512 0 nn_id.2.bias 512 0 nn_id.4.weight 3072 0 nn_id.4.bias 6 0 nn_pt.nn.0.weight 273920 0 nn_pt.nn.0.bias 512 0 nn_pt.nn.2.weight 512 0 nn_pt.nn.2.bias 512 0 nn_pt.nn.4.weight 1024 0 nn_pt.nn.4.bias 2 0 nn_eta.nn.0.weight 273920 0 nn_eta.nn.0.bias 512 0 nn_eta.nn.2.weight 512 0 nn_eta.nn.2.bias 512 0 nn_eta.nn.4.weight 1024 0 nn_eta.nn.4.bias 2 0 nn_sin_phi.nn.0.weight 273920 0 nn_sin_phi.nn.0.bias 512 0 nn_sin_phi.nn.2.weight 512 0 nn_sin_phi.nn.2.bias 512 0 nn_sin_phi.nn.4.weight 1024 0 nn_sin_phi.nn.4.bias 2 0 nn_cos_phi.nn.0.weight 273920 0 nn_cos_phi.nn.0.bias 512 0 nn_cos_phi.nn.2.weight 512 0 nn_cos_phi.nn.2.bias 512 0 nn_cos_phi.nn.4.weight 1024 0 nn_cos_phi.nn.4.bias 2 0 nn_energy.nn.0.weight 273920 0 nn_energy.nn.0.bias 512 0 nn_energy.nn.2.weight 512 0 nn_energy.nn.2.bias 512 0 nn_energy.nn.4.weight 1024 0 nn_energy.nn.4.bias 2 0 [2024-06-18 10:33:37,686] INFO: Modules Trainable parameters Non-tranable parameters nn0_id.0.weight 8704 0 nn0_id.0.bias 512 0 nn0_id.2.weight 512 0 nn0_id.2.bias 512 0 nn0_id.4.weight 262144 0 nn0_id.4.bias 512 0 nn0_reg.0.weight 8704 0 nn0_reg.0.bias 512 0 nn0_reg.2.weight 512 0 nn0_reg.2.bias 512 0 nn0_reg.4.weight 262144 0 nn0_reg.4.bias 512 0 conv_id.0.mha.in_proj_weight 786432 0 conv_id.0.mha.in_proj_bias 1536 0 conv_id.0.mha.out_proj.weight 262144 0 conv_id.0.mha.out_proj.bias 512 0 conv_id.0.norm0.weight 512 0 conv_id.0.norm0.bias 512 0 conv_id.0.norm1.weight 512 0 conv_id.0.norm1.bias 512 0 conv_id.0.seq.0.weight 262144 0 conv_id.0.seq.0.bias 512 0 conv_id.0.seq.2.weight 262144 0 conv_id.0.seq.2.bias 512 0 conv_id.1.mha.in_proj_weight 786432 0 conv_id.1.mha.in_proj_bias 1536 0 conv_id.1.mha.out_proj.weight 262144 0 conv_id.1.mha.out_proj.bias 512 0 conv_id.1.norm0.weight 512 0 conv_id.1.norm0.bias 512 0 conv_id.1.norm1.weight 512 0 conv_id.1.norm1.bias 512 0 conv_id.1.seq.0.weight 262144 0 conv_id.1.seq.0.bias 512 0 conv_id.1.seq.2.weight 262144 0 conv_id.1.seq.2.bias 512 0 conv_id.2.mha.in_proj_weight 786432 0 conv_id.2.mha.in_proj_bias 1536 0 conv_id.2.mha.out_proj.weight 262144 0 conv_id.2.mha.out_proj.bias 512 0 conv_id.2.norm0.weight 512 0 conv_id.2.norm0.bias 512 0 conv_id.2.norm1.weight 512 0 conv_id.2.norm1.bias 512 0 conv_id.2.seq.0.weight 262144 0 conv_id.2.seq.0.bias 512 0 conv_id.2.seq.2.weight 262144 0 conv_id.2.seq.2.bias 512 0 conv_reg.0.mha.in_proj_weight 786432 0 conv_reg.0.mha.in_proj_bias 1536 0 conv_reg.0.mha.out_proj.weight 262144 0 conv_reg.0.mha.out_proj.bias 512 0 conv_reg.0.norm0.weight 512 0 conv_reg.0.norm0.bias 512 0 conv_reg.0.norm1.weight 512 0 conv_reg.0.norm1.bias 512 0 conv_reg.0.seq.0.weight 262144 0 conv_reg.0.seq.0.bias 512 0 conv_reg.0.seq.2.weight 262144 0 conv_reg.0.seq.2.bias 512 0 conv_reg.1.mha.in_proj_weight 786432 0 conv_reg.1.mha.in_proj_bias 1536 0 conv_reg.1.mha.out_proj.weight 262144 0 conv_reg.1.mha.out_proj.bias 512 0 conv_reg.1.norm0.weight 512 0 conv_reg.1.norm0.bias 512 0 conv_reg.1.norm1.weight 512 0 conv_reg.1.norm1.bias 512 0 conv_reg.1.seq.0.weight 262144 0 conv_reg.1.seq.0.bias 512 0 conv_reg.1.seq.2.weight 262144 0 conv_reg.1.seq.2.bias 512 0 conv_reg.2.mha.in_proj_weight 786432 0 conv_reg.2.mha.in_proj_bias 1536 0 conv_reg.2.mha.out_proj.weight 262144 0 conv_reg.2.mha.out_proj.bias 512 0 conv_reg.2.norm0.weight 512 0 conv_reg.2.norm0.bias 512 0 conv_reg.2.norm1.weight 512 0 conv_reg.2.norm1.bias 512 0 conv_reg.2.seq.0.weight 262144 0 conv_reg.2.seq.0.bias 512 0 conv_reg.2.seq.2.weight 262144 0 conv_reg.2.seq.2.bias 512 0 nn_id.0.weight 270848 0 nn_id.0.bias 512 0 nn_id.2.weight 512 0 nn_id.2.bias 512 0 nn_id.4.weight 3072 0 nn_id.4.bias 6 0 nn_pt.nn.0.weight 273920 0 nn_pt.nn.0.bias 512 0 nn_pt.nn.2.weight 512 0 nn_pt.nn.2.bias 512 0 nn_pt.nn.4.weight 1024 0 nn_pt.nn.4.bias 2 0 nn_eta.nn.0.weight 273920 0 nn_eta.nn.0.bias 512 0 nn_eta.nn.2.weight 512 0 nn_eta.nn.2.bias 512 0 nn_eta.nn.4.weight 1024 0 nn_eta.nn.4.bias 2 0 nn_sin_phi.nn.0.weight 273920 0 nn_sin_phi.nn.0.bias 512 0 nn_sin_phi.nn.2.weight 512 0 nn_sin_phi.nn.2.bias 512 0 nn_sin_phi.nn.4.weight 1024 0 nn_sin_phi.nn.4.bias 2 0 nn_cos_phi.nn.0.weight 273920 0 nn_cos_phi.nn.0.bias 512 0 nn_cos_phi.nn.2.weight 512 0 nn_cos_phi.nn.2.bias 512 0 nn_cos_phi.nn.4.weight 1024 0 nn_cos_phi.nn.4.bias 2 0 nn_energy.nn.0.weight 273920 0 nn_energy.nn.0.bias 512 0 nn_energy.nn.2.weight 512 0 nn_energy.nn.2.bias 512 0 nn_energy.nn.4.weight 1024 0 nn_energy.nn.4.bias 2 0 [2024-06-18 10:33:37,687] INFO: Creating experiment dir /pfvol/experiments/MLPF_clic_backbone_A100_pyg-clic_20240618_103334_147904 [2024-06-18 10:33:37,687] INFO: Creating experiment dir /pfvol/experiments/MLPF_clic_backbone_A100_pyg-clic_20240618_103334_147904 [2024-06-18 10:33:37,687] INFO: Model directory /pfvol/experiments/MLPF_clic_backbone_A100_pyg-clic_20240618_103334_147904 [2024-06-18 10:33:37,687] INFO: Model directory /pfvol/experiments/MLPF_clic_backbone_A100_pyg-clic_20240618_103334_147904 [2024-06-18 10:33:37,723] INFO: train_dataset: clic_edm_qq_pf, 1589912 [2024-06-18 10:33:37,723] INFO: train_dataset: clic_edm_qq_pf, 1589912 [2024-06-18 10:33:37,741] INFO: train_dataset: clic_edm_ttbar_pf, 800800 [2024-06-18 10:33:37,741] INFO: train_dataset: clic_edm_ttbar_pf, 800800 [2024-06-18 10:33:37,764] INFO: train_dataset: clic_edm_ttbar_pu10_pf, 562200 [2024-06-18 10:33:37,764] INFO: train_dataset: clic_edm_ttbar_pu10_pf, 562200 [2024-06-18 10:33:37,778] INFO: train_dataset: clic_edm_ww_fullhad_pf, 800800 [2024-06-18 10:33:37,778] INFO: train_dataset: clic_edm_ww_fullhad_pf, 800800 [2024-06-18 10:33:37,788] INFO: train_dataset: clic_edm_zh_tautau_pf, 800799 [2024-06-18 10:33:37,788] INFO: train_dataset: clic_edm_zh_tautau_pf, 800799 [2024-06-18 10:33:38,181] INFO: valid_dataset: clic_edm_qq_pf, 397514 [2024-06-18 10:33:38,181] INFO: valid_dataset: clic_edm_qq_pf, 397514 [2024-06-18 10:33:38,558] INFO: Initiating epoch #1 train run on device rank=0 [2024-06-18 10:33:38,558] INFO: Initiating epoch #1 train run on device rank=0 [2024-06-18 11:45:16,082] INFO: Initiating epoch #1 valid run on device rank=0 [2024-06-18 11:45:16,082] INFO: Initiating epoch #1 valid run on device rank=0 [2024-06-18 11:49:49,784] INFO: Rank 0: epoch=1 / 200 train_loss=17.1257 valid_loss=13.9178 stale=0 time=76.19m eta=15161.2m [2024-06-18 11:49:49,784] INFO: Rank 0: epoch=1 / 200 train_loss=17.1257 valid_loss=13.9178 stale=0 time=76.19m eta=15161.2m [2024-06-18 11:49:50,152] INFO: Initiating epoch #2 train run on device rank=0 [2024-06-18 11:49:50,152] INFO: Initiating epoch #2 train run on device rank=0 [2024-06-18 13:01:30,421] INFO: Initiating epoch #2 valid run on device rank=0 [2024-06-18 13:01:30,421] INFO: Initiating epoch #2 valid run on device rank=0 [2024-06-18 13:06:07,592] INFO: Rank 0: epoch=2 / 200 train_loss=13.2536 valid_loss=12.6360 stale=0 time=76.29m eta=15095.9m [2024-06-18 13:06:07,592] INFO: Rank 0: epoch=2 / 200 train_loss=13.2536 valid_loss=12.6360 stale=0 time=76.29m eta=15095.9m [2024-06-18 13:06:07,632] INFO: Initiating epoch #3 train run on device rank=0 [2024-06-18 13:06:07,632] INFO: Initiating epoch #3 train run on device rank=0 [2024-06-18 14:17:49,969] INFO: Initiating epoch #3 valid run on device rank=0 [2024-06-18 14:17:49,969] INFO: Initiating epoch #3 valid run on device rank=0 [2024-06-18 14:22:26,811] INFO: Rank 0: epoch=3 / 200 train_loss=12.3176 valid_loss=12.1804 stale=0 time=76.32m eta=15024.8m [2024-06-18 14:22:26,811] INFO: Rank 0: epoch=3 / 200 train_loss=12.3176 valid_loss=12.1804 stale=0 time=76.32m eta=15024.8m [2024-06-18 14:22:26,984] INFO: Initiating epoch #4 train run on device rank=0 [2024-06-18 14:22:26,984] INFO: Initiating epoch #4 train run on device rank=0 [2024-06-18 15:34:11,103] INFO: Initiating epoch #4 valid run on device rank=0 [2024-06-18 15:34:11,103] INFO: Initiating epoch #4 valid run on device rank=0 [2024-06-18 15:38:48,071] INFO: Rank 0: epoch=4 / 200 train_loss=11.8099 valid_loss=11.7438 stale=0 time=76.35m eta=14952.8m [2024-06-18 15:38:48,071] INFO: Rank 0: epoch=4 / 200 train_loss=11.8099 valid_loss=11.7438 stale=0 time=76.35m eta=14952.8m [2024-06-18 15:38:48,113] INFO: Initiating epoch #5 train run on device rank=0 [2024-06-18 15:38:48,113] INFO: Initiating epoch #5 train run on device rank=0 [2024-06-18 16:50:33,122] INFO: Initiating epoch #5 valid run on device rank=0 [2024-06-18 16:50:33,122] INFO: Initiating epoch #5 valid run on device rank=0 [2024-06-18 16:55:10,189] INFO: Rank 0: epoch=5 / 200 train_loss=11.4197 valid_loss=11.5178 stale=0 time=76.37m eta=14879.6m [2024-06-18 16:55:10,189] INFO: Rank 0: epoch=5 / 200 train_loss=11.4197 valid_loss=11.5178 stale=0 time=76.37m eta=14879.6m [2024-06-18 16:55:10,215] INFO: Initiating epoch #6 train run on device rank=0 [2024-06-18 16:55:10,215] INFO: Initiating epoch #6 train run on device rank=0 [2024-06-18 18:12:29,421] INFO: Initiating epoch #6 valid run on device rank=0 [2024-06-18 18:12:29,421] INFO: Initiating epoch #6 valid run on device rank=0 [2024-06-18 18:17:10,944] INFO: Rank 0: epoch=6 / 200 train_loss=11.1606 valid_loss=11.0642 stale=0 time=82.01m eta=14987.8m [2024-06-18 18:17:10,944] INFO: Rank 0: epoch=6 / 200 train_loss=11.1606 valid_loss=11.0642 stale=0 time=82.01m eta=14987.8m [2024-06-18 18:17:11,317] INFO: Initiating epoch #7 train run on device rank=0 [2024-06-18 18:17:11,317] INFO: Initiating epoch #7 train run on device rank=0 [2024-06-18 19:41:27,393] INFO: Initiating epoch #7 valid run on device rank=0 [2024-06-18 19:41:27,393] INFO: Initiating epoch #7 valid run on device rank=0 [2024-06-18 19:47:55,162] INFO: Rank 0: epoch=7 / 200 train_loss=10.8904 valid_loss=10.7941 stale=0 time=90.73m eta=15282.2m [2024-06-18 19:47:55,162] INFO: Rank 0: epoch=7 / 200 train_loss=10.8904 valid_loss=10.7941 stale=0 time=90.73m eta=15282.2m [2024-06-18 19:47:56,120] INFO: Initiating epoch #8 train run on device rank=0 [2024-06-18 19:47:56,120] INFO: Initiating epoch #8 train run on device rank=0 [2024-06-18 21:16:41,659] INFO: Initiating epoch #8 valid run on device rank=0 [2024-06-18 21:16:41,659] INFO: Initiating epoch #8 valid run on device rank=0 [2024-06-18 21:23:29,001] INFO: Rank 0: epoch=8 / 200 train_loss=10.6651 valid_loss=10.6969 stale=0 time=95.55m eta=15596.2m [2024-06-18 21:23:29,001] INFO: Rank 0: epoch=8 / 200 train_loss=10.6651 valid_loss=10.6969 stale=0 time=95.55m eta=15596.2m [2024-06-18 21:23:30,729] INFO: Initiating epoch #9 train run on device rank=0 [2024-06-18 21:23:30,729] INFO: Initiating epoch #9 train run on device rank=0 [2024-06-18 22:45:47,690] INFO: Initiating epoch #9 valid run on device rank=0 [2024-06-18 22:45:47,690] INFO: Initiating epoch #9 valid run on device rank=0 [2024-06-18 22:50:31,775] INFO: Rank 0: epoch=9 / 200 train_loss=10.5038 valid_loss=10.4432 stale=0 time=87.02m eta=15638.4m [2024-06-18 22:50:31,775] INFO: Rank 0: epoch=9 / 200 train_loss=10.5038 valid_loss=10.4432 stale=0 time=87.02m eta=15638.4m [2024-06-18 22:50:33,027] INFO: Initiating epoch #10 train run on device rank=0 [2024-06-18 22:50:33,027] INFO: Initiating epoch #10 train run on device rank=0 [2024-06-19 00:13:08,891] INFO: Initiating epoch #10 valid run on device rank=0 [2024-06-19 00:13:08,891] INFO: Initiating epoch #10 valid run on device rank=0 [2024-06-19 00:20:06,245] INFO: Rank 0: epoch=10 / 200 train_loss=10.3030 valid_loss=10.3556 stale=0 time=89.55m eta=15702.8m [2024-06-19 00:20:06,245] INFO: Rank 0: epoch=10 / 200 train_loss=10.3030 valid_loss=10.3556 stale=0 time=89.55m eta=15702.8m [2024-06-19 00:21:02,902] INFO: Initiating epoch #11 train run on device rank=0 [2024-06-19 00:21:02,902] INFO: Initiating epoch #11 train run on device rank=0 [2024-06-19 01:47:06,024] INFO: Initiating epoch #11 valid run on device rank=0 [2024-06-19 01:47:06,024] INFO: Initiating epoch #11 valid run on device rank=0 [2024-06-19 01:51:55,104] INFO: Rank 0: epoch=11 / 200 train_loss=10.1449 valid_loss=10.2134 stale=0 time=90.87m eta=15777.6m [2024-06-19 01:51:55,104] INFO: Rank 0: epoch=11 / 200 train_loss=10.1449 valid_loss=10.2134 stale=0 time=90.87m eta=15777.6m [2024-06-19 01:51:57,527] INFO: Initiating epoch #12 train run on device rank=0 [2024-06-19 01:51:57,527] INFO: Initiating epoch #12 train run on device rank=0 [2024-06-19 03:03:55,999] INFO: Initiating epoch #12 valid run on device rank=0 [2024-06-19 03:03:55,999] INFO: Initiating epoch #12 valid run on device rank=0 [2024-06-19 03:08:39,346] INFO: Rank 0: epoch=12 / 200 train_loss=9.9952 valid_loss=10.0955 stale=0 time=76.7m eta=15588.5m [2024-06-19 03:08:39,346] INFO: Rank 0: epoch=12 / 200 train_loss=9.9952 valid_loss=10.0955 stale=0 time=76.7m eta=15588.5m [2024-06-19 03:08:42,333] INFO: Initiating epoch #13 train run on device rank=0 [2024-06-19 03:08:42,333] INFO: Initiating epoch #13 train run on device rank=0 [2024-06-19 04:20:35,059] INFO: Initiating epoch #13 valid run on device rank=0 [2024-06-19 04:20:35,059] INFO: Initiating epoch #13 valid run on device rank=0 [2024-06-19 04:25:12,673] INFO: Rank 0: epoch=13 / 200 train_loss=9.8952 valid_loss=9.9455 stale=0 time=76.51m eta=15414.1m [2024-06-19 04:25:12,673] INFO: Rank 0: epoch=13 / 200 train_loss=9.8952 valid_loss=9.9455 stale=0 time=76.51m eta=15414.1m [2024-06-19 04:25:14,319] INFO: Initiating epoch #14 train run on device rank=0 [2024-06-19 04:25:14,319] INFO: Initiating epoch #14 train run on device rank=0 [2024-06-19 05:36:48,751] INFO: Initiating epoch #14 valid run on device rank=0 [2024-06-19 05:36:48,751] INFO: Initiating epoch #14 valid run on device rank=0 [2024-06-19 05:39:44,060] INFO: Rank 0: epoch=14 / 200 train_loss=9.7900 valid_loss=9.8483 stale=0 time=74.5m eta=15226.6m [2024-06-19 05:39:44,060] INFO: Rank 0: epoch=14 / 200 train_loss=9.7900 valid_loss=9.8483 stale=0 time=74.5m eta=15226.6m [2024-06-19 05:39:46,048] INFO: Initiating epoch #15 train run on device rank=0 [2024-06-19 05:39:46,048] INFO: Initiating epoch #15 train run on device rank=0 [2024-06-19 06:51:13,888] INFO: Initiating epoch #15 valid run on device rank=0 [2024-06-19 06:51:13,888] INFO: Initiating epoch #15 valid run on device rank=0 [2024-06-19 06:55:47,326] INFO: Rank 0: epoch=15 / 200 train_loss=9.6781 valid_loss=9.8361 stale=0 time=76.02m eta=15073.1m [2024-06-19 06:55:47,326] INFO: Rank 0: epoch=15 / 200 train_loss=9.6781 valid_loss=9.8361 stale=0 time=76.02m eta=15073.1m [2024-06-19 06:55:48,353] INFO: Initiating epoch #16 train run on device rank=0 [2024-06-19 06:55:48,353] INFO: Initiating epoch #16 train run on device rank=0 [2024-06-19 08:07:13,620] INFO: Initiating epoch #16 valid run on device rank=0 [2024-06-19 08:07:13,620] INFO: Initiating epoch #16 valid run on device rank=0 [2024-06-19 08:11:48,754] INFO: Rank 0: epoch=16 / 200 train_loss=9.6048 valid_loss=9.6500 stale=0 time=76.01m eta=14929.0m [2024-06-19 08:11:48,754] INFO: Rank 0: epoch=16 / 200 train_loss=9.6048 valid_loss=9.6500 stale=0 time=76.01m eta=14929.0m [2024-06-19 08:11:49,416] INFO: Initiating epoch #17 train run on device rank=0 [2024-06-19 08:11:49,416] INFO: Initiating epoch #17 train run on device rank=0 [2024-06-19 09:24:54,217] INFO: Initiating epoch #17 valid run on device rank=0 [2024-06-19 09:24:54,217] INFO: Initiating epoch #17 valid run on device rank=0 [2024-06-19 09:33:44,446] INFO: Rank 0: epoch=17 / 200 train_loss=9.5318 valid_loss=9.6345 stale=0 time=81.92m eta=14856.4m [2024-06-19 09:33:44,446] INFO: Rank 0: epoch=17 / 200 train_loss=9.5318 valid_loss=9.6345 stale=0 time=81.92m eta=14856.4m [2024-06-19 09:33:46,000] INFO: Initiating epoch #18 train run on device rank=0 [2024-06-19 09:33:46,000] INFO: Initiating epoch #18 train run on device rank=0 [2024-06-19 11:30:07,069] INFO: Initiating epoch #18 valid run on device rank=0 [2024-06-19 11:30:07,069] INFO: Initiating epoch #18 valid run on device rank=0 [2024-06-19 11:33:48,052] INFO: Rank 0: epoch=18 / 200 train_loss=9.4829 valid_loss=9.5904 stale=0 time=120.03m eta=15168.3m [2024-06-19 11:33:48,052] INFO: Rank 0: epoch=18 / 200 train_loss=9.4829 valid_loss=9.5904 stale=0 time=120.03m eta=15168.3m [2024-06-19 11:33:56,139] INFO: Initiating epoch #19 train run on device rank=0 [2024-06-19 11:33:56,139] INFO: Initiating epoch #19 train run on device rank=0 [2024-06-19 13:31:59,308] INFO: Initiating epoch #19 valid run on device rank=0 [2024-06-19 13:31:59,308] INFO: Initiating epoch #19 valid run on device rank=0 [2024-06-19 13:35:20,245] INFO: Rank 0: epoch=19 / 200 train_loss=9.4125 valid_loss=9.6140 stale=1 time=121.4m eta=15448.8m [2024-06-19 13:35:20,245] INFO: Rank 0: epoch=19 / 200 train_loss=9.4125 valid_loss=9.6140 stale=1 time=121.4m eta=15448.8m [2024-06-19 13:35:21,885] INFO: Initiating epoch #20 train run on device rank=0 [2024-06-19 13:35:21,885] INFO: Initiating epoch #20 train run on device rank=0 [2024-06-19 15:47:48,684] INFO: Initiating epoch #20 valid run on device rank=0 [2024-06-19 15:47:48,684] INFO: Initiating epoch #20 valid run on device rank=0 [2024-06-19 15:52:24,715] INFO: Rank 0: epoch=20 / 200 train_loss=9.3671 valid_loss=9.5541 stale=0 time=137.05m eta=15828.9m [2024-06-19 15:52:24,715] INFO: Rank 0: epoch=20 / 200 train_loss=9.3671 valid_loss=9.5541 stale=0 time=137.05m eta=15828.9m [2024-06-19 15:52:27,627] INFO: Initiating epoch #21 train run on device rank=0 [2024-06-19 15:52:27,627] INFO: Initiating epoch #21 train run on device rank=0 [2024-06-19 17:03:58,763] INFO: Initiating epoch #21 valid run on device rank=0 [2024-06-19 17:03:58,763] INFO: Initiating epoch #21 valid run on device rank=0 [2024-06-19 17:07:20,269] INFO: Rank 0: epoch=21 / 200 train_loss=9.3156 valid_loss=9.4374 stale=0 time=74.88m eta=15630.1m [2024-06-19 17:07:20,269] INFO: Rank 0: epoch=21 / 200 train_loss=9.3156 valid_loss=9.4374 stale=0 time=74.88m eta=15630.1m [2024-06-19 17:07:22,245] INFO: Initiating epoch #22 train run on device rank=0 [2024-06-19 17:07:22,245] INFO: Initiating epoch #22 train run on device rank=0 [2024-06-19 18:27:56,948] INFO: Initiating epoch #22 valid run on device rank=0 [2024-06-19 18:27:56,948] INFO: Initiating epoch #22 valid run on device rank=0 [2024-06-19 18:31:17,220] INFO: Rank 0: epoch=22 / 200 train_loss=9.2686 valid_loss=9.3706 stale=0 time=83.92m eta=15515.5m [2024-06-19 18:31:17,220] INFO: Rank 0: epoch=22 / 200 train_loss=9.2686 valid_loss=9.3706 stale=0 time=83.92m eta=15515.5m [2024-06-19 18:31:20,370] INFO: Initiating epoch #23 train run on device rank=0 [2024-06-19 18:31:20,370] INFO: Initiating epoch #23 train run on device rank=0 [2024-06-19 19:56:21,955] INFO: Initiating epoch #23 valid run on device rank=0 [2024-06-19 19:56:21,955] INFO: Initiating epoch #23 valid run on device rank=0 [2024-06-19 19:59:34,137] INFO: Rank 0: epoch=23 / 200 train_loss=9.2284 valid_loss=9.3156 stale=0 time=88.23m eta=15436.9m [2024-06-19 19:59:34,137] INFO: Rank 0: epoch=23 / 200 train_loss=9.2284 valid_loss=9.3156 stale=0 time=88.23m eta=15436.9m [2024-06-19 19:59:36,230] INFO: Initiating epoch #24 train run on device rank=0 [2024-06-19 19:59:36,230] INFO: Initiating epoch #24 train run on device rank=0 [2024-06-19 21:28:54,614] INFO: Initiating epoch #24 valid run on device rank=0 [2024-06-19 21:28:54,614] INFO: Initiating epoch #24 valid run on device rank=0 [2024-06-19 21:32:19,965] INFO: Rank 0: epoch=24 / 200 train_loss=9.1932 valid_loss=9.3698 stale=1 time=92.73m eta=15390.4m [2024-06-19 21:32:19,965] INFO: Rank 0: epoch=24 / 200 train_loss=9.1932 valid_loss=9.3698 stale=1 time=92.73m eta=15390.4m [2024-06-19 21:32:23,629] INFO: Initiating epoch #25 train run on device rank=0 [2024-06-19 21:32:23,629] INFO: Initiating epoch #25 train run on device rank=0 [2024-06-19 23:07:34,624] INFO: Initiating epoch #25 valid run on device rank=0 [2024-06-19 23:07:34,624] INFO: Initiating epoch #25 valid run on device rank=0 [2024-06-19 23:13:00,843] INFO: Rank 0: epoch=25 / 200 train_loss=9.1585 valid_loss=9.3529 stale=2 time=100.62m eta=15395.6m [2024-06-19 23:13:00,843] INFO: Rank 0: epoch=25 / 200 train_loss=9.1585 valid_loss=9.3529 stale=2 time=100.62m eta=15395.6m [2024-06-19 23:13:26,992] INFO: Initiating epoch #26 train run on device rank=0 [2024-06-19 23:13:26,992] INFO: Initiating epoch #26 train run on device rank=0 [2024-06-20 00:48:24,728] INFO: Initiating epoch #26 valid run on device rank=0 [2024-06-20 00:48:24,728] INFO: Initiating epoch #26 valid run on device rank=0 [2024-06-20 00:54:38,683] INFO: Rank 0: epoch=26 / 200 train_loss=9.1211 valid_loss=9.3085 stale=0 time=101.19m eta=15399.0m [2024-06-20 00:54:38,683] INFO: Rank 0: epoch=26 / 200 train_loss=9.1211 valid_loss=9.3085 stale=0 time=101.19m eta=15399.0m [2024-06-20 00:55:41,486] INFO: Initiating epoch #27 train run on device rank=0 [2024-06-20 00:55:41,486] INFO: Initiating epoch #27 train run on device rank=0 [2024-06-20 02:52:03,018] INFO: Initiating epoch #27 valid run on device rank=0 [2024-06-20 02:52:03,018] INFO: Initiating epoch #27 valid run on device rank=0 [2024-06-20 02:55:08,575] INFO: Rank 0: epoch=27 / 200 train_loss=9.0955 valid_loss=9.2220 stale=0 time=119.45m eta=15515.5m [2024-06-20 02:55:08,575] INFO: Rank 0: epoch=27 / 200 train_loss=9.0955 valid_loss=9.2220 stale=0 time=119.45m eta=15515.5m [2024-06-20 02:55:10,999] INFO: Initiating epoch #28 train run on device rank=0 [2024-06-20 02:55:10,999] INFO: Initiating epoch #28 train run on device rank=0 [2024-06-20 04:19:06,375] INFO: Initiating epoch #28 valid run on device rank=0 [2024-06-20 04:19:06,375] INFO: Initiating epoch #28 valid run on device rank=0 [2024-06-20 04:22:17,616] INFO: Rank 0: epoch=28 / 200 train_loss=9.0603 valid_loss=9.2053 stale=0 time=87.11m eta=15410.3m [2024-06-20 04:22:17,616] INFO: Rank 0: epoch=28 / 200 train_loss=9.0603 valid_loss=9.2053 stale=0 time=87.11m eta=15410.3m [2024-06-20 04:22:20,518] INFO: Initiating epoch #29 train run on device rank=0 [2024-06-20 04:22:20,518] INFO: Initiating epoch #29 train run on device rank=0 [2024-06-20 05:34:10,529] INFO: Initiating epoch #29 valid run on device rank=0 [2024-06-20 05:34:10,529] INFO: Initiating epoch #29 valid run on device rank=0 [2024-06-20 05:36:56,115] INFO: Rank 0: epoch=29 / 200 train_loss=9.0317 valid_loss=9.2176 stale=1 time=74.59m eta=15232.5m [2024-06-20 05:36:56,115] INFO: Rank 0: epoch=29 / 200 train_loss=9.0317 valid_loss=9.2176 stale=1 time=74.59m eta=15232.5m [2024-06-20 05:36:57,242] INFO: Initiating epoch #30 train run on device rank=0 [2024-06-20 05:36:57,242] INFO: Initiating epoch #30 train run on device rank=0 [2024-06-20 07:51:54,111] INFO: Initiating epoch #30 valid run on device rank=0 [2024-06-20 07:51:54,111] INFO: Initiating epoch #30 valid run on device rank=0 [2024-06-20 08:01:16,518] INFO: Rank 0: epoch=30 / 200 train_loss=9.0057 valid_loss=9.1262 stale=0 time=144.32m eta=15456.6m [2024-06-20 08:01:16,518] INFO: Rank 0: epoch=30 / 200 train_loss=9.0057 valid_loss=9.1262 stale=0 time=144.32m eta=15456.6m [2024-06-20 08:01:16,610] INFO: Initiating epoch #31 train run on device rank=0 [2024-06-20 08:01:16,610] INFO: Initiating epoch #31 train run on device rank=0 [2024-06-20 10:15:58,288] INFO: Initiating epoch #31 valid run on device rank=0 [2024-06-20 10:15:58,288] INFO: Initiating epoch #31 valid run on device rank=0 [2024-06-20 10:25:18,290] INFO: Rank 0: epoch=31 / 200 train_loss=8.9816 valid_loss=9.1839 stale=1 time=144.03m eta=15655.2m [2024-06-20 10:25:18,290] INFO: Rank 0: epoch=31 / 200 train_loss=8.9816 valid_loss=9.1839 stale=1 time=144.03m eta=15655.2m [2024-06-20 10:25:18,356] INFO: Initiating epoch #32 train run on device rank=0 [2024-06-20 10:25:18,356] INFO: Initiating epoch #32 train run on device rank=0 [2024-06-20 11:46:34,734] INFO: Initiating epoch #32 valid run on device rank=0 [2024-06-20 11:46:34,734] INFO: Initiating epoch #32 valid run on device rank=0 [2024-06-20 11:49:36,588] INFO: Rank 0: epoch=32 / 200 train_loss=8.9666 valid_loss=9.1808 stale=2 time=84.3m eta=15518.8m [2024-06-20 11:49:36,588] INFO: Rank 0: epoch=32 / 200 train_loss=8.9666 valid_loss=9.1808 stale=2 time=84.3m eta=15518.8m [2024-06-20 11:49:36,769] INFO: Initiating epoch #33 train run on device rank=0 [2024-06-20 11:49:36,769] INFO: Initiating epoch #33 train run on device rank=0 [2024-06-20 13:08:37,608] INFO: Initiating epoch #33 valid run on device rank=0 [2024-06-20 13:08:37,608] INFO: Initiating epoch #33 valid run on device rank=0 [2024-06-20 13:11:49,084] INFO: Rank 0: epoch=33 / 200 train_loss=8.9387 valid_loss=9.1135 stale=0 time=82.21m eta=15375.0m [2024-06-20 13:11:49,084] INFO: Rank 0: epoch=33 / 200 train_loss=8.9387 valid_loss=9.1135 stale=0 time=82.21m eta=15375.0m [2024-06-20 13:11:51,241] INFO: Initiating epoch #34 train run on device rank=0 [2024-06-20 13:11:51,241] INFO: Initiating epoch #34 train run on device rank=0 [2024-06-20 15:24:59,745] INFO: Initiating epoch #34 valid run on device rank=0 [2024-06-20 15:24:59,745] INFO: Initiating epoch #34 valid run on device rank=0 [2024-06-20 15:34:12,490] INFO: Rank 0: epoch=34 / 200 train_loss=8.9225 valid_loss=9.0744 stale=0 time=142.35m eta=15528.6m [2024-06-20 15:34:12,490] INFO: Rank 0: epoch=34 / 200 train_loss=8.9225 valid_loss=9.0744 stale=0 time=142.35m eta=15528.6m [2024-06-20 15:34:12,679] INFO: Initiating epoch #35 train run on device rank=0 [2024-06-20 15:34:12,679] INFO: Initiating epoch #35 train run on device rank=0 [2024-06-20 17:47:54,180] INFO: Initiating epoch #35 valid run on device rank=0 [2024-06-20 17:47:54,180] INFO: Initiating epoch #35 valid run on device rank=0 [2024-06-20 17:57:02,399] INFO: Rank 0: epoch=35 / 200 train_loss=8.8969 valid_loss=9.0835 stale=1 time=142.83m eta=15667.4m [2024-06-20 17:57:02,399] INFO: Rank 0: epoch=35 / 200 train_loss=8.8969 valid_loss=9.0835 stale=1 time=142.83m eta=15667.4m [2024-06-20 17:57:02,523] INFO: Initiating epoch #36 train run on device rank=0 [2024-06-20 17:57:02,523] INFO: Initiating epoch #36 train run on device rank=0 [2024-06-20 19:26:21,953] INFO: Initiating epoch #36 valid run on device rank=0 [2024-06-20 19:26:21,953] INFO: Initiating epoch #36 valid run on device rank=0 [2024-06-20 19:29:28,986] INFO: Rank 0: epoch=36 / 200 train_loss=8.8835 valid_loss=9.0562 stale=0 time=92.44m eta=15561.1m [2024-06-20 19:29:28,986] INFO: Rank 0: epoch=36 / 200 train_loss=8.8835 valid_loss=9.0562 stale=0 time=92.44m eta=15561.1m [2024-06-20 19:29:29,454] INFO: Initiating epoch #37 train run on device rank=0 [2024-06-20 19:29:29,454] INFO: Initiating epoch #37 train run on device rank=0 [2024-06-20 20:51:18,730] INFO: Initiating epoch #37 valid run on device rank=0 [2024-06-20 20:51:18,730] INFO: Initiating epoch #37 valid run on device rank=0 [2024-06-20 20:54:18,589] INFO: Rank 0: epoch=37 / 200 train_loss=8.8599 valid_loss=9.0476 stale=0 time=84.82m eta=15421.9m [2024-06-20 20:54:18,589] INFO: Rank 0: epoch=37 / 200 train_loss=8.8599 valid_loss=9.0476 stale=0 time=84.82m eta=15421.9m [2024-06-20 20:54:18,827] INFO: Initiating epoch #38 train run on device rank=0 [2024-06-20 20:54:18,827] INFO: Initiating epoch #38 train run on device rank=0 [2024-06-20 23:06:52,164] INFO: Initiating epoch #38 valid run on device rank=0 [2024-06-20 23:06:52,164] INFO: Initiating epoch #38 valid run on device rank=0 [2024-06-20 23:16:04,942] INFO: Rank 0: epoch=38 / 200 train_loss=8.8405 valid_loss=9.0432 stale=0 time=141.77m eta=15528.3m [2024-06-20 23:16:04,942] INFO: Rank 0: epoch=38 / 200 train_loss=8.8405 valid_loss=9.0432 stale=0 time=141.77m eta=15528.3m [2024-06-20 23:16:05,129] INFO: Initiating epoch #39 train run on device rank=0 [2024-06-20 23:16:05,129] INFO: Initiating epoch #39 train run on device rank=0 [2024-06-21 00:51:34,192] INFO: Initiating epoch #39 valid run on device rank=0 [2024-06-21 00:51:34,192] INFO: Initiating epoch #39 valid run on device rank=0 [2024-06-21 00:54:46,748] INFO: Rank 0: epoch=39 / 200 train_loss=8.8224 valid_loss=9.0523 stale=1 time=98.69m eta=15444.2m [2024-06-21 00:54:46,748] INFO: Rank 0: epoch=39 / 200 train_loss=8.8224 valid_loss=9.0523 stale=1 time=98.69m eta=15444.2m [2024-06-21 00:54:47,771] INFO: Initiating epoch #40 train run on device rank=0 [2024-06-21 00:54:47,771] INFO: Initiating epoch #40 train run on device rank=0 [2024-06-21 02:25:01,328] INFO: Initiating epoch #40 valid run on device rank=0 [2024-06-21 02:25:01,328] INFO: Initiating epoch #40 valid run on device rank=0 [2024-06-21 02:28:13,751] INFO: Rank 0: epoch=40 / 200 train_loss=8.8077 valid_loss=9.0308 stale=0 time=93.43m eta=15338.3m [2024-06-21 02:28:13,751] INFO: Rank 0: epoch=40 / 200 train_loss=8.8077 valid_loss=9.0308 stale=0 time=93.43m eta=15338.3m [2024-06-21 02:28:14,567] INFO: Initiating epoch #41 train run on device rank=0 [2024-06-21 02:28:14,567] INFO: Initiating epoch #41 train run on device rank=0 [2024-06-21 04:31:52,173] INFO: Initiating epoch #41 valid run on device rank=0 [2024-06-21 04:31:52,173] INFO: Initiating epoch #41 valid run on device rank=0 [2024-06-21 04:35:16,571] INFO: Rank 0: epoch=41 / 200 train_loss=8.7894 valid_loss=9.0538 stale=1 time=127.03m eta=15363.4m [2024-06-21 04:35:16,571] INFO: Rank 0: epoch=41 / 200 train_loss=8.7894 valid_loss=9.0538 stale=1 time=127.03m eta=15363.4m [2024-06-21 04:35:17,671] INFO: Initiating epoch #42 train run on device rank=0 [2024-06-21 04:35:17,671] INFO: Initiating epoch #42 train run on device rank=0 [2024-06-21 06:37:40,102] INFO: Initiating epoch #42 valid run on device rank=0 [2024-06-21 06:37:40,102] INFO: Initiating epoch #42 valid run on device rank=0 [2024-06-21 06:40:49,377] INFO: Rank 0: epoch=42 / 200 train_loss=8.7755 valid_loss=8.9476 stale=0 time=125.53m eta=15375.6m [2024-06-21 06:40:49,377] INFO: Rank 0: epoch=42 / 200 train_loss=8.7755 valid_loss=8.9476 stale=0 time=125.53m eta=15375.6m [2024-06-21 06:40:50,796] INFO: Initiating epoch #43 train run on device rank=0 [2024-06-21 06:40:50,796] INFO: Initiating epoch #43 train run on device rank=0 [2024-06-21 08:54:21,016] INFO: Initiating epoch #43 valid run on device rank=0 [2024-06-21 08:54:21,016] INFO: Initiating epoch #43 valid run on device rank=0 [2024-06-21 09:03:22,400] INFO: Rank 0: epoch=43 / 200 train_loss=8.7636 valid_loss=8.9328 stale=0 time=142.53m eta=15443.4m [2024-06-21 09:03:22,400] INFO: Rank 0: epoch=43 / 200 train_loss=8.7636 valid_loss=8.9328 stale=0 time=142.53m eta=15443.4m [2024-06-21 09:03:22,747] INFO: Initiating epoch #44 train run on device rank=0 [2024-06-21 09:03:22,747] INFO: Initiating epoch #44 train run on device rank=0 [2024-06-21 10:32:02,137] INFO: Initiating epoch #44 valid run on device rank=0 [2024-06-21 10:32:02,137] INFO: Initiating epoch #44 valid run on device rank=0 [2024-06-21 10:35:00,447] INFO: Rank 0: epoch=44 / 200 train_loss=8.7431 valid_loss=9.0081 stale=1 time=91.63m eta=15321.2m [2024-06-21 10:35:00,447] INFO: Rank 0: epoch=44 / 200 train_loss=8.7431 valid_loss=9.0081 stale=1 time=91.63m eta=15321.2m [2024-06-21 10:35:01,256] INFO: Initiating epoch #45 train run on device rank=0 [2024-06-21 10:35:01,256] INFO: Initiating epoch #45 train run on device rank=0 [2024-06-21 11:46:47,038] INFO: Initiating epoch #45 valid run on device rank=0 [2024-06-21 11:46:47,038] INFO: Initiating epoch #45 valid run on device rank=0 [2024-06-21 11:51:26,515] INFO: Rank 0: epoch=45 / 200 train_loss=8.7296 valid_loss=9.0149 stale=2 time=76.42m eta=15148.0m [2024-06-21 11:51:26,515] INFO: Rank 0: epoch=45 / 200 train_loss=8.7296 valid_loss=9.0149 stale=2 time=76.42m eta=15148.0m [2024-06-21 11:51:26,674] INFO: Initiating epoch #46 train run on device rank=0 [2024-06-21 11:51:26,674] INFO: Initiating epoch #46 train run on device rank=0 [2024-06-21 13:03:09,496] INFO: Initiating epoch #46 valid run on device rank=0 [2024-06-21 13:03:09,496] INFO: Initiating epoch #46 valid run on device rank=0 [2024-06-21 13:07:52,299] INFO: Rank 0: epoch=46 / 200 train_loss=8.7237 valid_loss=8.8755 stale=0 time=76.43m eta=14978.9m [2024-06-21 13:07:52,299] INFO: Rank 0: epoch=46 / 200 train_loss=8.7237 valid_loss=8.8755 stale=0 time=76.43m eta=14978.9m [2024-06-21 13:07:53,976] INFO: Initiating epoch #47 train run on device rank=0 [2024-06-21 13:07:53,976] INFO: Initiating epoch #47 train run on device rank=0 [2024-06-21 14:19:42,154] INFO: Initiating epoch #47 valid run on device rank=0 [2024-06-21 14:19:42,154] INFO: Initiating epoch #47 valid run on device rank=0 [2024-06-21 14:24:21,550] INFO: Rank 0: epoch=47 / 200 train_loss=8.7017 valid_loss=8.9666 stale=1 time=76.46m eta=14814.0m [2024-06-21 14:24:21,550] INFO: Rank 0: epoch=47 / 200 train_loss=8.7017 valid_loss=8.9666 stale=1 time=76.46m eta=14814.0m [2024-06-21 14:24:21,669] INFO: Initiating epoch #48 train run on device rank=0 [2024-06-21 14:24:21,669] INFO: Initiating epoch #48 train run on device rank=0 [2024-06-21 15:36:05,303] INFO: Initiating epoch #48 valid run on device rank=0 [2024-06-21 15:36:05,303] INFO: Initiating epoch #48 valid run on device rank=0 [2024-06-21 15:38:52,731] INFO: Rank 0: epoch=48 / 200 train_loss=8.6911 valid_loss=8.8607 stale=0 time=74.52m eta=14646.6m [2024-06-21 15:38:52,731] INFO: Rank 0: epoch=48 / 200 train_loss=8.6911 valid_loss=8.8607 stale=0 time=74.52m eta=14646.6m [2024-06-21 15:38:53,454] INFO: Initiating epoch #49 train run on device rank=0 [2024-06-21 15:38:53,454] INFO: Initiating epoch #49 train run on device rank=0 [2024-06-21 16:50:40,895] INFO: Initiating epoch #49 valid run on device rank=0 [2024-06-21 16:50:40,895] INFO: Initiating epoch #49 valid run on device rank=0 [2024-06-21 16:53:32,362] INFO: Rank 0: epoch=49 / 200 train_loss=8.6755 valid_loss=8.9118 stale=1 time=74.65m eta=14483.4m [2024-06-21 16:53:32,362] INFO: Rank 0: epoch=49 / 200 train_loss=8.6755 valid_loss=8.9118 stale=1 time=74.65m eta=14483.4m [2024-06-21 16:53:32,463] INFO: Initiating epoch #50 train run on device rank=0 [2024-06-21 16:53:32,463] INFO: Initiating epoch #50 train run on device rank=0 [2024-06-21 18:05:06,889] INFO: Initiating epoch #50 valid run on device rank=0 [2024-06-21 18:05:06,889] INFO: Initiating epoch #50 valid run on device rank=0 [2024-06-21 18:07:56,508] INFO: Rank 0: epoch=50 / 200 train_loss=8.6637 valid_loss=8.9198 stale=2 time=74.4m eta=14322.9m [2024-06-21 18:07:56,508] INFO: Rank 0: epoch=50 / 200 train_loss=8.6637 valid_loss=8.9198 stale=2 time=74.4m eta=14322.9m [2024-06-21 18:07:56,648] INFO: Initiating epoch #51 train run on device rank=0 [2024-06-21 18:07:56,648] INFO: Initiating epoch #51 train run on device rank=0 [2024-06-21 19:19:38,543] INFO: Initiating epoch #51 valid run on device rank=0 [2024-06-21 19:19:38,543] INFO: Initiating epoch #51 valid run on device rank=0 [2024-06-21 19:22:31,414] INFO: Rank 0: epoch=51 / 200 train_loss=8.6516 valid_loss=8.8671 stale=3 time=74.58m eta=14166.3m [2024-06-21 19:22:31,414] INFO: Rank 0: epoch=51 / 200 train_loss=8.6516 valid_loss=8.8671 stale=3 time=74.58m eta=14166.3m [2024-06-21 19:22:31,939] INFO: Initiating epoch #52 train run on device rank=0 [2024-06-21 19:22:31,939] INFO: Initiating epoch #52 train run on device rank=0 [2024-06-21 20:34:08,594] INFO: Initiating epoch #52 valid run on device rank=0 [2024-06-21 20:34:08,594] INFO: Initiating epoch #52 valid run on device rank=0 [2024-06-21 20:36:59,653] INFO: Rank 0: epoch=52 / 200 train_loss=8.6408 valid_loss=8.8645 stale=4 time=74.46m eta=14012.6m [2024-06-21 20:36:59,653] INFO: Rank 0: epoch=52 / 200 train_loss=8.6408 valid_loss=8.8645 stale=4 time=74.46m eta=14012.6m [2024-06-21 20:36:59,877] INFO: Initiating epoch #53 train run on device rank=0 [2024-06-21 20:36:59,877] INFO: Initiating epoch #53 train run on device rank=0 [2024-06-21 21:48:35,844] INFO: Initiating epoch #53 valid run on device rank=0 [2024-06-21 21:48:35,844] INFO: Initiating epoch #53 valid run on device rank=0 [2024-06-21 21:51:36,715] INFO: Rank 0: epoch=53 / 200 train_loss=8.6290 valid_loss=8.8523 stale=0 time=74.61m eta=13862.3m [2024-06-21 21:51:36,715] INFO: Rank 0: epoch=53 / 200 train_loss=8.6290 valid_loss=8.8523 stale=0 time=74.61m eta=13862.3m [2024-06-21 21:51:36,942] INFO: Initiating epoch #54 train run on device rank=0 [2024-06-21 21:51:36,942] INFO: Initiating epoch #54 train run on device rank=0 [2024-06-21 23:03:27,361] INFO: Initiating epoch #54 valid run on device rank=0 [2024-06-21 23:03:27,361] INFO: Initiating epoch #54 valid run on device rank=0 [2024-06-21 23:08:06,294] INFO: Rank 0: epoch=54 / 200 train_loss=8.6172 valid_loss=8.8603 stale=1 time=76.49m eta=13719.8m [2024-06-21 23:08:06,294] INFO: Rank 0: epoch=54 / 200 train_loss=8.6172 valid_loss=8.8603 stale=1 time=76.49m eta=13719.8m [2024-06-21 23:08:11,164] INFO: Initiating epoch #55 train run on device rank=0 [2024-06-21 23:08:11,164] INFO: Initiating epoch #55 train run on device rank=0 [2024-06-22 00:20:05,741] INFO: Initiating epoch #55 valid run on device rank=0 [2024-06-22 00:20:05,741] INFO: Initiating epoch #55 valid run on device rank=0 [2024-06-22 00:24:46,036] INFO: Rank 0: epoch=55 / 200 train_loss=8.6084 valid_loss=8.8279 stale=0 time=76.58m eta=13580.2m [2024-06-22 00:24:46,036] INFO: Rank 0: epoch=55 / 200 train_loss=8.6084 valid_loss=8.8279 stale=0 time=76.58m eta=13580.2m [2024-06-22 00:24:46,981] INFO: Initiating epoch #56 train run on device rank=0 [2024-06-22 00:24:46,981] INFO: Initiating epoch #56 train run on device rank=0 [2024-06-22 01:36:39,679] INFO: Initiating epoch #56 valid run on device rank=0 [2024-06-22 01:36:39,679] INFO: Initiating epoch #56 valid run on device rank=0 [2024-06-22 01:39:31,028] INFO: Rank 0: epoch=56 / 200 train_loss=8.5961 valid_loss=8.8231 stale=0 time=74.73m eta=13438.0m [2024-06-22 01:39:31,028] INFO: Rank 0: epoch=56 / 200 train_loss=8.5961 valid_loss=8.8231 stale=0 time=74.73m eta=13438.0m [2024-06-22 01:39:32,083] INFO: Initiating epoch #57 train run on device rank=0 [2024-06-22 01:39:32,083] INFO: Initiating epoch #57 train run on device rank=0 [2024-06-22 02:51:33,104] INFO: Initiating epoch #57 valid run on device rank=0 [2024-06-22 02:51:33,104] INFO: Initiating epoch #57 valid run on device rank=0 [2024-06-22 02:56:09,329] INFO: Rank 0: epoch=57 / 200 train_loss=8.5853 valid_loss=8.8373 stale=1 time=76.62m eta=13302.8m [2024-06-22 02:56:09,329] INFO: Rank 0: epoch=57 / 200 train_loss=8.5853 valid_loss=8.8373 stale=1 time=76.62m eta=13302.8m [2024-06-22 02:56:09,515] INFO: Initiating epoch #58 train run on device rank=0 [2024-06-22 02:56:09,515] INFO: Initiating epoch #58 train run on device rank=0 [2024-06-22 04:08:02,189] INFO: Initiating epoch #58 valid run on device rank=0 [2024-06-22 04:08:02,189] INFO: Initiating epoch #58 valid run on device rank=0 [2024-06-22 04:12:42,765] INFO: Rank 0: epoch=58 / 200 train_loss=8.5760 valid_loss=8.8279 stale=2 time=76.55m eta=13169.4m [2024-06-22 04:12:42,765] INFO: Rank 0: epoch=58 / 200 train_loss=8.5760 valid_loss=8.8279 stale=2 time=76.55m eta=13169.4m [2024-06-22 04:12:42,852] INFO: Initiating epoch #59 train run on device rank=0 [2024-06-22 04:12:42,852] INFO: Initiating epoch #59 train run on device rank=0 [2024-06-22 05:24:34,401] INFO: Initiating epoch #59 valid run on device rank=0 [2024-06-22 05:24:34,401] INFO: Initiating epoch #59 valid run on device rank=0 [2024-06-22 05:29:15,985] INFO: Rank 0: epoch=59 / 200 train_loss=8.5640 valid_loss=8.8211 stale=0 time=76.55m eta=13038.0m [2024-06-22 05:29:15,985] INFO: Rank 0: epoch=59 / 200 train_loss=8.5640 valid_loss=8.8211 stale=0 time=76.55m eta=13038.0m [2024-06-22 05:29:16,135] INFO: Initiating epoch #60 train run on device rank=0 [2024-06-22 05:29:16,135] INFO: Initiating epoch #60 train run on device rank=0 [2024-06-22 06:40:51,472] INFO: Initiating epoch #60 valid run on device rank=0 [2024-06-22 06:40:51,472] INFO: Initiating epoch #60 valid run on device rank=0 [2024-06-22 06:43:44,847] INFO: Rank 0: epoch=60 / 200 train_loss=8.5546 valid_loss=8.7972 stale=0 time=74.48m eta=12903.6m [2024-06-22 06:43:44,847] INFO: Rank 0: epoch=60 / 200 train_loss=8.5546 valid_loss=8.7972 stale=0 time=74.48m eta=12903.6m [2024-06-22 06:43:45,145] INFO: Initiating epoch #61 train run on device rank=0 [2024-06-22 06:43:45,145] INFO: Initiating epoch #61 train run on device rank=0 [2024-06-22 07:55:34,178] INFO: Initiating epoch #61 valid run on device rank=0 [2024-06-22 07:55:34,178] INFO: Initiating epoch #61 valid run on device rank=0 [2024-06-22 08:00:14,744] INFO: Rank 0: epoch=61 / 200 train_loss=8.5395 valid_loss=8.8231 stale=1 time=76.49m eta=12775.7m [2024-06-22 08:00:14,744] INFO: Rank 0: epoch=61 / 200 train_loss=8.5395 valid_loss=8.8231 stale=1 time=76.49m eta=12775.7m [2024-06-22 08:00:15,042] INFO: Initiating epoch #62 train run on device rank=0 [2024-06-22 08:00:15,042] INFO: Initiating epoch #62 train run on device rank=0 [2024-06-22 09:12:07,654] INFO: Initiating epoch #62 valid run on device rank=0 [2024-06-22 09:12:07,654] INFO: Initiating epoch #62 valid run on device rank=0 [2024-06-22 09:16:43,490] INFO: Rank 0: epoch=62 / 200 train_loss=8.5297 valid_loss=8.8142 stale=2 time=76.47m eta=12649.4m [2024-06-22 09:16:43,490] INFO: Rank 0: epoch=62 / 200 train_loss=8.5297 valid_loss=8.8142 stale=2 time=76.47m eta=12649.4m [2024-06-22 09:16:43,553] INFO: Initiating epoch #63 train run on device rank=0 [2024-06-22 09:16:43,553] INFO: Initiating epoch #63 train run on device rank=0 [2024-06-22 10:28:49,807] INFO: Initiating epoch #63 valid run on device rank=0 [2024-06-22 10:28:49,807] INFO: Initiating epoch #63 valid run on device rank=0 [2024-06-22 10:33:30,515] INFO: Rank 0: epoch=63 / 200 train_loss=8.5174 valid_loss=8.7742 stale=0 time=76.78m eta=12525.4m [2024-06-22 10:33:30,515] INFO: Rank 0: epoch=63 / 200 train_loss=8.5174 valid_loss=8.7742 stale=0 time=76.78m eta=12525.4m [2024-06-22 10:33:30,652] INFO: Initiating epoch #64 train run on device rank=0 [2024-06-22 10:33:30,652] INFO: Initiating epoch #64 train run on device rank=0 [2024-06-22 11:45:25,434] INFO: Initiating epoch #64 valid run on device rank=0 [2024-06-22 11:45:25,434] INFO: Initiating epoch #64 valid run on device rank=0 [2024-06-22 11:50:06,191] INFO: Rank 0: epoch=64 / 200 train_loss=8.5107 valid_loss=8.8185 stale=1 time=76.59m eta=12402.5m [2024-06-22 11:50:06,191] INFO: Rank 0: epoch=64 / 200 train_loss=8.5107 valid_loss=8.8185 stale=1 time=76.59m eta=12402.5m [2024-06-22 11:50:06,503] INFO: Initiating epoch #65 train run on device rank=0 [2024-06-22 11:50:06,503] INFO: Initiating epoch #65 train run on device rank=0 [2024-06-22 13:01:57,810] INFO: Initiating epoch #65 valid run on device rank=0 [2024-06-22 13:01:57,810] INFO: Initiating epoch #65 valid run on device rank=0 [2024-06-22 13:06:38,669] INFO: Rank 0: epoch=65 / 200 train_loss=8.4952 valid_loss=8.7336 stale=0 time=76.54m eta=12280.8m [2024-06-22 13:06:38,669] INFO: Rank 0: epoch=65 / 200 train_loss=8.4952 valid_loss=8.7336 stale=0 time=76.54m eta=12280.8m [2024-06-22 13:06:38,909] INFO: Initiating epoch #66 train run on device rank=0 [2024-06-22 13:06:38,909] INFO: Initiating epoch #66 train run on device rank=0 [2024-06-22 14:18:30,902] INFO: Initiating epoch #66 valid run on device rank=0 [2024-06-22 14:18:30,902] INFO: Initiating epoch #66 valid run on device rank=0 [2024-06-22 14:23:11,030] INFO: Rank 0: epoch=66 / 200 train_loss=8.4798 valid_loss=8.7685 stale=1 time=76.54m eta=12160.6m [2024-06-22 14:23:11,030] INFO: Rank 0: epoch=66 / 200 train_loss=8.4798 valid_loss=8.7685 stale=1 time=76.54m eta=12160.6m [2024-06-22 14:23:11,146] INFO: Initiating epoch #67 train run on device rank=0 [2024-06-22 14:23:11,146] INFO: Initiating epoch #67 train run on device rank=0 [2024-06-22 15:34:44,652] INFO: Initiating epoch #67 valid run on device rank=0 [2024-06-22 15:34:44,652] INFO: Initiating epoch #67 valid run on device rank=0 [2024-06-22 15:37:34,598] INFO: Rank 0: epoch=67 / 200 train_loss=8.4699 valid_loss=8.7327 stale=0 time=74.39m eta=12037.4m [2024-06-22 15:37:34,598] INFO: Rank 0: epoch=67 / 200 train_loss=8.4699 valid_loss=8.7327 stale=0 time=74.39m eta=12037.4m [2024-06-22 15:37:34,751] INFO: Initiating epoch #68 train run on device rank=0 [2024-06-22 15:37:34,751] INFO: Initiating epoch #68 train run on device rank=0 [2024-06-22 16:49:19,619] INFO: Initiating epoch #68 valid run on device rank=0 [2024-06-22 16:49:19,619] INFO: Initiating epoch #68 valid run on device rank=0 [2024-06-22 16:54:01,487] INFO: Rank 0: epoch=68 / 200 train_loss=8.4559 valid_loss=8.7072 stale=0 time=76.45m eta=11919.6m [2024-06-22 16:54:01,487] INFO: Rank 0: epoch=68 / 200 train_loss=8.4559 valid_loss=8.7072 stale=0 time=76.45m eta=11919.6m [2024-06-22 16:54:01,601] INFO: Initiating epoch #69 train run on device rank=0 [2024-06-22 16:54:01,601] INFO: Initiating epoch #69 train run on device rank=0 [2024-06-22 18:05:51,039] INFO: Initiating epoch #69 valid run on device rank=0 [2024-06-22 18:05:51,039] INFO: Initiating epoch #69 valid run on device rank=0 [2024-06-22 18:10:27,455] INFO: Rank 0: epoch=69 / 200 train_loss=8.4450 valid_loss=8.7470 stale=1 time=76.43m eta=11802.9m [2024-06-22 18:10:27,455] INFO: Rank 0: epoch=69 / 200 train_loss=8.4450 valid_loss=8.7470 stale=1 time=76.43m eta=11802.9m [2024-06-22 18:10:27,603] INFO: Initiating epoch #70 train run on device rank=0 [2024-06-22 18:10:27,603] INFO: Initiating epoch #70 train run on device rank=0 [2024-06-22 19:22:12,402] INFO: Initiating epoch #70 valid run on device rank=0 [2024-06-22 19:22:12,402] INFO: Initiating epoch #70 valid run on device rank=0 [2024-06-22 19:25:03,962] INFO: Rank 0: epoch=70 / 200 train_loss=8.4353 valid_loss=8.7547 stale=2 time=74.61m eta=11684.1m [2024-06-22 19:25:03,962] INFO: Rank 0: epoch=70 / 200 train_loss=8.4353 valid_loss=8.7547 stale=2 time=74.61m eta=11684.1m [2024-06-22 19:25:04,120] INFO: Initiating epoch #71 train run on device rank=0 [2024-06-22 19:25:04,120] INFO: Initiating epoch #71 train run on device rank=0 [2024-06-22 20:37:00,938] INFO: Initiating epoch #71 valid run on device rank=0 [2024-06-22 20:37:00,938] INFO: Initiating epoch #71 valid run on device rank=0 [2024-06-22 20:41:42,641] INFO: Rank 0: epoch=71 / 200 train_loss=8.4260 valid_loss=8.7360 stale=3 time=76.64m eta=11570.2m [2024-06-22 20:41:42,641] INFO: Rank 0: epoch=71 / 200 train_loss=8.4260 valid_loss=8.7360 stale=3 time=76.64m eta=11570.2m [2024-06-22 20:41:43,285] INFO: Initiating epoch #72 train run on device rank=0 [2024-06-22 20:41:43,285] INFO: Initiating epoch #72 train run on device rank=0 [2024-06-22 21:53:38,286] INFO: Initiating epoch #72 valid run on device rank=0 [2024-06-22 21:53:38,286] INFO: Initiating epoch #72 valid run on device rank=0 [2024-06-22 21:58:15,229] INFO: Rank 0: epoch=72 / 200 train_loss=8.4169 valid_loss=8.7065 stale=0 time=76.53m eta=11457.1m [2024-06-22 21:58:15,229] INFO: Rank 0: epoch=72 / 200 train_loss=8.4169 valid_loss=8.7065 stale=0 time=76.53m eta=11457.1m [2024-06-22 21:58:15,306] INFO: Initiating epoch #73 train run on device rank=0 [2024-06-22 21:58:15,306] INFO: Initiating epoch #73 train run on device rank=0 [2024-06-22 23:10:09,132] INFO: Initiating epoch #73 valid run on device rank=0 [2024-06-22 23:10:09,132] INFO: Initiating epoch #73 valid run on device rank=0 [2024-06-22 23:14:47,881] INFO: Rank 0: epoch=73 / 200 train_loss=8.4056 valid_loss=8.7580 stale=1 time=76.54m eta=11345.0m [2024-06-22 23:14:47,881] INFO: Rank 0: epoch=73 / 200 train_loss=8.4056 valid_loss=8.7580 stale=1 time=76.54m eta=11345.0m [2024-06-22 23:14:47,958] INFO: Initiating epoch #74 train run on device rank=0 [2024-06-22 23:14:47,958] INFO: Initiating epoch #74 train run on device rank=0 [2024-06-23 00:26:44,489] INFO: Initiating epoch #74 valid run on device rank=0 [2024-06-23 00:26:44,489] INFO: Initiating epoch #74 valid run on device rank=0 [2024-06-23 00:31:25,090] INFO: Rank 0: epoch=74 / 200 train_loss=8.3983 valid_loss=8.6703 stale=0 time=76.62m eta=11234.1m [2024-06-23 00:31:25,090] INFO: Rank 0: epoch=74 / 200 train_loss=8.3983 valid_loss=8.6703 stale=0 time=76.62m eta=11234.1m [2024-06-23 00:31:25,161] INFO: Initiating epoch #75 train run on device rank=0 [2024-06-23 00:31:25,161] INFO: Initiating epoch #75 train run on device rank=0 [2024-06-23 01:42:51,711] INFO: Initiating epoch #75 valid run on device rank=0 [2024-06-23 01:42:51,711] INFO: Initiating epoch #75 valid run on device rank=0 [2024-06-23 01:45:47,445] INFO: Rank 0: epoch=75 / 200 train_loss=8.3911 valid_loss=8.6304 stale=0 time=74.37m eta=11120.2m [2024-06-23 01:45:47,445] INFO: Rank 0: epoch=75 / 200 train_loss=8.3911 valid_loss=8.6304 stale=0 time=74.37m eta=11120.2m [2024-06-23 01:45:47,524] INFO: Initiating epoch #76 train run on device rank=0 [2024-06-23 01:45:47,524] INFO: Initiating epoch #76 train run on device rank=0 [2024-06-23 02:57:26,210] INFO: Initiating epoch #76 valid run on device rank=0 [2024-06-23 02:57:26,210] INFO: Initiating epoch #76 valid run on device rank=0 [2024-06-23 03:00:19,657] INFO: Rank 0: epoch=76 / 200 train_loss=8.3868 valid_loss=8.7037 stale=1 time=74.54m eta=11007.7m [2024-06-23 03:00:19,657] INFO: Rank 0: epoch=76 / 200 train_loss=8.3868 valid_loss=8.7037 stale=1 time=74.54m eta=11007.7m [2024-06-23 03:00:21,274] INFO: Initiating epoch #77 train run on device rank=0 [2024-06-23 03:00:21,274] INFO: Initiating epoch #77 train run on device rank=0 [2024-06-23 04:12:17,678] INFO: Initiating epoch #77 valid run on device rank=0 [2024-06-23 04:12:17,678] INFO: Initiating epoch #77 valid run on device rank=0 [2024-06-23 04:16:52,163] INFO: Rank 0: epoch=77 / 200 train_loss=8.3736 valid_loss=8.6790 stale=2 time=76.51m eta=10899.4m [2024-06-23 04:16:52,163] INFO: Rank 0: epoch=77 / 200 train_loss=8.3736 valid_loss=8.6790 stale=2 time=76.51m eta=10899.4m [2024-06-23 04:16:52,423] INFO: Initiating epoch #78 train run on device rank=0 [2024-06-23 04:16:52,423] INFO: Initiating epoch #78 train run on device rank=0 [2024-06-23 05:28:46,835] INFO: Initiating epoch #78 valid run on device rank=0 [2024-06-23 05:28:46,835] INFO: Initiating epoch #78 valid run on device rank=0 [2024-06-23 05:33:22,606] INFO: Rank 0: epoch=78 / 200 train_loss=8.3676 valid_loss=8.6334 stale=3 time=76.5m eta=10791.9m [2024-06-23 05:33:22,606] INFO: Rank 0: epoch=78 / 200 train_loss=8.3676 valid_loss=8.6334 stale=3 time=76.5m eta=10791.9m [2024-06-23 05:33:23,093] INFO: Initiating epoch #79 train run on device rank=0 [2024-06-23 05:33:23,093] INFO: Initiating epoch #79 train run on device rank=0 [2024-06-23 06:45:02,343] INFO: Initiating epoch #79 valid run on device rank=0 [2024-06-23 06:45:02,343] INFO: Initiating epoch #79 valid run on device rank=0 [2024-06-23 06:47:53,989] INFO: Rank 0: epoch=79 / 200 train_loss=8.3603 valid_loss=8.6088 stale=0 time=74.51m eta=10682.1m [2024-06-23 06:47:53,989] INFO: Rank 0: epoch=79 / 200 train_loss=8.3603 valid_loss=8.6088 stale=0 time=74.51m eta=10682.1m [2024-06-23 06:47:54,582] INFO: Initiating epoch #80 train run on device rank=0 [2024-06-23 06:47:54,582] INFO: Initiating epoch #80 train run on device rank=0 [2024-06-23 08:00:14,114] INFO: Initiating epoch #80 valid run on device rank=0 [2024-06-23 08:00:14,114] INFO: Initiating epoch #80 valid run on device rank=0 [2024-06-23 08:03:01,178] INFO: Rank 0: epoch=80 / 200 train_loss=8.3531 valid_loss=8.6452 stale=1 time=75.11m eta=10574.1m [2024-06-23 08:03:01,178] INFO: Rank 0: epoch=80 / 200 train_loss=8.3531 valid_loss=8.6452 stale=1 time=75.11m eta=10574.1m [2024-06-23 08:03:01,629] INFO: Initiating epoch #81 train run on device rank=0 [2024-06-23 08:03:01,629] INFO: Initiating epoch #81 train run on device rank=0 [2024-06-23 09:14:48,216] INFO: Initiating epoch #81 valid run on device rank=0 [2024-06-23 09:14:48,216] INFO: Initiating epoch #81 valid run on device rank=0 [2024-06-23 09:17:49,401] INFO: Rank 0: epoch=81 / 200 train_loss=8.3472 valid_loss=8.6509 stale=2 time=74.8m eta=10466.4m [2024-06-23 09:17:49,401] INFO: Rank 0: epoch=81 / 200 train_loss=8.3472 valid_loss=8.6509 stale=2 time=74.8m eta=10466.4m [2024-06-23 09:17:49,489] INFO: Initiating epoch #82 train run on device rank=0 [2024-06-23 09:17:49,489] INFO: Initiating epoch #82 train run on device rank=0 [2024-06-23 10:29:40,252] INFO: Initiating epoch #82 valid run on device rank=0 [2024-06-23 10:29:40,252] INFO: Initiating epoch #82 valid run on device rank=0 [2024-06-23 10:34:19,098] INFO: Rank 0: epoch=82 / 200 train_loss=8.3397 valid_loss=8.6501 stale=3 time=76.49m eta=10361.9m [2024-06-23 10:34:19,098] INFO: Rank 0: epoch=82 / 200 train_loss=8.3397 valid_loss=8.6501 stale=3 time=76.49m eta=10361.9m [2024-06-23 10:34:19,265] INFO: Initiating epoch #83 train run on device rank=0 [2024-06-23 10:34:19,265] INFO: Initiating epoch #83 train run on device rank=0 [2024-06-23 11:46:08,296] INFO: Initiating epoch #83 valid run on device rank=0 [2024-06-23 11:46:08,296] INFO: Initiating epoch #83 valid run on device rank=0 [2024-06-23 11:50:53,521] INFO: Rank 0: epoch=83 / 200 train_loss=8.3351 valid_loss=8.6750 stale=4 time=76.57m eta=10258.3m [2024-06-23 11:50:53,521] INFO: Rank 0: epoch=83 / 200 train_loss=8.3351 valid_loss=8.6750 stale=4 time=76.57m eta=10258.3m [2024-06-23 11:50:53,774] INFO: Initiating epoch #84 train run on device rank=0 [2024-06-23 11:50:53,774] INFO: Initiating epoch #84 train run on device rank=0 [2024-06-23 13:02:46,279] INFO: Initiating epoch #84 valid run on device rank=0 [2024-06-23 13:02:46,279] INFO: Initiating epoch #84 valid run on device rank=0 [2024-06-23 13:07:30,286] INFO: Rank 0: epoch=84 / 200 train_loss=8.3287 valid_loss=8.6796 stale=5 time=76.61m eta=10155.3m [2024-06-23 13:07:30,286] INFO: Rank 0: epoch=84 / 200 train_loss=8.3287 valid_loss=8.6796 stale=5 time=76.61m eta=10155.3m [2024-06-23 13:07:30,323] INFO: Initiating epoch #85 train run on device rank=0 [2024-06-23 13:07:30,323] INFO: Initiating epoch #85 train run on device rank=0 [2024-06-23 14:19:23,367] INFO: Initiating epoch #85 valid run on device rank=0 [2024-06-23 14:19:23,367] INFO: Initiating epoch #85 valid run on device rank=0 [2024-06-23 14:24:07,387] INFO: Rank 0: epoch=85 / 200 train_loss=8.3233 valid_loss=8.6908 stale=6 time=76.62m eta=10053.0m [2024-06-23 14:24:07,387] INFO: Rank 0: epoch=85 / 200 train_loss=8.3233 valid_loss=8.6908 stale=6 time=76.62m eta=10053.0m [2024-06-23 14:24:07,607] INFO: Initiating epoch #86 train run on device rank=0 [2024-06-23 14:24:07,607] INFO: Initiating epoch #86 train run on device rank=0 [2024-06-23 15:36:00,402] INFO: Initiating epoch #86 valid run on device rank=0 [2024-06-23 15:36:00,402] INFO: Initiating epoch #86 valid run on device rank=0 [2024-06-23 15:40:45,195] INFO: Rank 0: epoch=86 / 200 train_loss=8.3179 valid_loss=8.6193 stale=7 time=76.63m eta=9951.3m [2024-06-23 15:40:45,195] INFO: Rank 0: epoch=86 / 200 train_loss=8.3179 valid_loss=8.6193 stale=7 time=76.63m eta=9951.3m [2024-06-23 15:40:45,420] INFO: Initiating epoch #87 train run on device rank=0 [2024-06-23 15:40:45,420] INFO: Initiating epoch #87 train run on device rank=0 [2024-06-23 16:52:36,400] INFO: Initiating epoch #87 valid run on device rank=0 [2024-06-23 16:52:36,400] INFO: Initiating epoch #87 valid run on device rank=0 [2024-06-23 16:55:29,695] INFO: Rank 0: epoch=87 / 200 train_loss=8.3122 valid_loss=8.6162 stale=8 time=74.74m eta=9847.7m [2024-06-23 16:55:29,695] INFO: Rank 0: epoch=87 / 200 train_loss=8.3122 valid_loss=8.6162 stale=8 time=74.74m eta=9847.7m [2024-06-23 16:55:29,874] INFO: Initiating epoch #88 train run on device rank=0 [2024-06-23 16:55:29,874] INFO: Initiating epoch #88 train run on device rank=0 [2024-06-23 18:07:18,841] INFO: Initiating epoch #88 valid run on device rank=0 [2024-06-23 18:07:18,841] INFO: Initiating epoch #88 valid run on device rank=0 [2024-06-23 18:12:02,924] INFO: Rank 0: epoch=88 / 200 train_loss=8.3075 valid_loss=8.6136 stale=9 time=76.55m eta=9747.1m [2024-06-23 18:12:02,924] INFO: Rank 0: epoch=88 / 200 train_loss=8.3075 valid_loss=8.6136 stale=9 time=76.55m eta=9747.1m [2024-06-23 18:12:03,893] INFO: Initiating epoch #89 train run on device rank=0 [2024-06-23 18:12:03,893] INFO: Initiating epoch #89 train run on device rank=0 [2024-06-23 19:23:56,524] INFO: Initiating epoch #89 valid run on device rank=0 [2024-06-23 19:23:56,524] INFO: Initiating epoch #89 valid run on device rank=0 [2024-06-23 19:28:42,883] INFO: Rank 0: epoch=89 / 200 train_loss=8.3027 valid_loss=8.5956 stale=0 time=76.65m eta=9647.1m [2024-06-23 19:28:42,883] INFO: Rank 0: epoch=89 / 200 train_loss=8.3027 valid_loss=8.5956 stale=0 time=76.65m eta=9647.1m [2024-06-23 19:28:42,952] INFO: Initiating epoch #90 train run on device rank=0 [2024-06-23 19:28:42,952] INFO: Initiating epoch #90 train run on device rank=0 [2024-06-23 20:40:35,213] INFO: Initiating epoch #90 valid run on device rank=0 [2024-06-23 20:40:35,213] INFO: Initiating epoch #90 valid run on device rank=0 [2024-06-23 20:45:21,641] INFO: Rank 0: epoch=90 / 200 train_loss=8.2974 valid_loss=8.5812 stale=0 time=76.64m eta=9547.7m [2024-06-23 20:45:21,641] INFO: Rank 0: epoch=90 / 200 train_loss=8.2974 valid_loss=8.5812 stale=0 time=76.64m eta=9547.7m [2024-06-23 20:45:21,747] INFO: Initiating epoch #91 train run on device rank=0 [2024-06-23 20:45:21,747] INFO: Initiating epoch #91 train run on device rank=0 [2024-06-23 21:57:16,346] INFO: Initiating epoch #91 valid run on device rank=0 [2024-06-23 21:57:16,346] INFO: Initiating epoch #91 valid run on device rank=0 [2024-06-23 22:02:01,445] INFO: Rank 0: epoch=91 / 200 train_loss=8.2924 valid_loss=8.6571 stale=1 time=76.66m eta=9448.7m [2024-06-23 22:02:01,445] INFO: Rank 0: epoch=91 / 200 train_loss=8.2924 valid_loss=8.6571 stale=1 time=76.66m eta=9448.7m [2024-06-23 22:02:01,462] INFO: Initiating epoch #92 train run on device rank=0 [2024-06-23 22:02:01,462] INFO: Initiating epoch #92 train run on device rank=0 [2024-06-23 23:13:58,333] INFO: Initiating epoch #92 valid run on device rank=0 [2024-06-23 23:13:58,333] INFO: Initiating epoch #92 valid run on device rank=0 [2024-06-23 23:18:43,895] INFO: Rank 0: epoch=92 / 200 train_loss=8.2895 valid_loss=8.5959 stale=2 time=76.71m eta=9350.3m [2024-06-23 23:18:43,895] INFO: Rank 0: epoch=92 / 200 train_loss=8.2895 valid_loss=8.5959 stale=2 time=76.71m eta=9350.3m [2024-06-23 23:18:44,035] INFO: Initiating epoch #93 train run on device rank=0 [2024-06-23 23:18:44,035] INFO: Initiating epoch #93 train run on device rank=0 [2024-06-24 00:30:23,107] INFO: Initiating epoch #93 valid run on device rank=0 [2024-06-24 00:30:23,107] INFO: Initiating epoch #93 valid run on device rank=0 [2024-06-24 00:33:16,549] INFO: Rank 0: epoch=93 / 200 train_loss=8.2862 valid_loss=8.6593 stale=3 time=74.54m eta=9249.9m [2024-06-24 00:33:16,549] INFO: Rank 0: epoch=93 / 200 train_loss=8.2862 valid_loss=8.6593 stale=3 time=74.54m eta=9249.9m [2024-06-24 00:33:16,697] INFO: Initiating epoch #94 train run on device rank=0 [2024-06-24 00:33:16,697] INFO: Initiating epoch #94 train run on device rank=0 [2024-06-24 01:45:11,668] INFO: Initiating epoch #94 valid run on device rank=0 [2024-06-24 01:45:11,668] INFO: Initiating epoch #94 valid run on device rank=0 [2024-06-24 01:49:51,603] INFO: Rank 0: epoch=94 / 200 train_loss=8.2804 valid_loss=8.6269 stale=4 time=76.58m eta=9152.3m [2024-06-24 01:49:51,603] INFO: Rank 0: epoch=94 / 200 train_loss=8.2804 valid_loss=8.6269 stale=4 time=76.58m eta=9152.3m [2024-06-24 01:49:51,977] INFO: Initiating epoch #95 train run on device rank=0 [2024-06-24 01:49:51,977] INFO: Initiating epoch #95 train run on device rank=0 [2024-06-24 03:01:24,836] INFO: Initiating epoch #95 valid run on device rank=0 [2024-06-24 03:01:24,836] INFO: Initiating epoch #95 valid run on device rank=0 [2024-06-24 03:04:18,232] INFO: Rank 0: epoch=95 / 200 train_loss=8.2748 valid_loss=8.5622 stale=0 time=74.44m eta=9052.8m [2024-06-24 03:04:18,232] INFO: Rank 0: epoch=95 / 200 train_loss=8.2748 valid_loss=8.5622 stale=0 time=74.44m eta=9052.8m [2024-06-24 03:04:18,547] INFO: Initiating epoch #96 train run on device rank=0 [2024-06-24 03:04:18,547] INFO: Initiating epoch #96 train run on device rank=0 [2024-06-24 04:16:09,008] INFO: Initiating epoch #96 valid run on device rank=0 [2024-06-24 04:16:09,008] INFO: Initiating epoch #96 valid run on device rank=0 [2024-06-24 04:20:45,600] INFO: Rank 0: epoch=96 / 200 train_loss=8.2731 valid_loss=8.5748 stale=1 time=76.45m eta=8956.0m [2024-06-24 04:20:45,600] INFO: Rank 0: epoch=96 / 200 train_loss=8.2731 valid_loss=8.5748 stale=1 time=76.45m eta=8956.0m [2024-06-24 04:20:45,859] INFO: Initiating epoch #97 train run on device rank=0 [2024-06-24 04:20:45,859] INFO: Initiating epoch #97 train run on device rank=0 [2024-06-24 05:32:25,677] INFO: Initiating epoch #97 valid run on device rank=0 [2024-06-24 05:32:25,677] INFO: Initiating epoch #97 valid run on device rank=0 [2024-06-24 05:35:09,564] INFO: Rank 0: epoch=97 / 200 train_loss=8.2698 valid_loss=8.5828 stale=2 time=74.4m eta=8857.5m [2024-06-24 05:35:09,564] INFO: Rank 0: epoch=97 / 200 train_loss=8.2698 valid_loss=8.5828 stale=2 time=74.4m eta=8857.5m [2024-06-24 05:35:09,735] INFO: Initiating epoch #98 train run on device rank=0 [2024-06-24 05:35:09,735] INFO: Initiating epoch #98 train run on device rank=0 [2024-06-24 06:47:08,284] INFO: Initiating epoch #98 valid run on device rank=0 [2024-06-24 06:47:08,284] INFO: Initiating epoch #98 valid run on device rank=0 [2024-06-24 06:51:44,417] INFO: Rank 0: epoch=98 / 200 train_loss=8.2633 valid_loss=8.5884 stale=3 time=76.58m eta=8761.7m [2024-06-24 06:51:44,417] INFO: Rank 0: epoch=98 / 200 train_loss=8.2633 valid_loss=8.5884 stale=3 time=76.58m eta=8761.7m [2024-06-24 06:51:44,529] INFO: Initiating epoch #99 train run on device rank=0 [2024-06-24 06:51:44,529] INFO: Initiating epoch #99 train run on device rank=0 [2024-06-24 08:03:45,674] INFO: Initiating epoch #99 valid run on device rank=0 [2024-06-24 08:03:45,674] INFO: Initiating epoch #99 valid run on device rank=0 [2024-06-24 08:08:22,235] INFO: Rank 0: epoch=99 / 200 train_loss=8.2600 valid_loss=8.5977 stale=4 time=76.63m eta=8666.3m [2024-06-24 08:08:22,235] INFO: Rank 0: epoch=99 / 200 train_loss=8.2600 valid_loss=8.5977 stale=4 time=76.63m eta=8666.3m [2024-06-24 08:08:22,581] INFO: Initiating epoch #100 train run on device rank=0 [2024-06-24 08:08:22,581] INFO: Initiating epoch #100 train run on device rank=0 [2024-06-24 09:20:23,886] INFO: Initiating epoch #100 valid run on device rank=0 [2024-06-24 09:20:23,886] INFO: Initiating epoch #100 valid run on device rank=0 [2024-06-24 09:23:13,530] INFO: Rank 0: epoch=100 / 200 train_loss=8.2551 valid_loss=8.6019 stale=5 time=74.85m eta=8569.6m [2024-06-24 09:23:13,530] INFO: Rank 0: epoch=100 / 200 train_loss=8.2551 valid_loss=8.6019 stale=5 time=74.85m eta=8569.6m [2024-06-24 09:23:13,846] INFO: Initiating epoch #101 train run on device rank=0 [2024-06-24 09:23:13,846] INFO: Initiating epoch #101 train run on device rank=0 [2024-06-24 10:34:55,087] INFO: Initiating epoch #101 valid run on device rank=0 [2024-06-24 10:34:55,087] INFO: Initiating epoch #101 valid run on device rank=0 [2024-06-24 10:37:49,872] INFO: Rank 0: epoch=101 / 200 train_loss=8.2516 valid_loss=8.6950 stale=6 time=74.6m eta=8473.0m [2024-06-24 10:37:49,872] INFO: Rank 0: epoch=101 / 200 train_loss=8.2516 valid_loss=8.6950 stale=6 time=74.6m eta=8473.0m [2024-06-24 10:37:51,449] INFO: Initiating epoch #102 train run on device rank=0 [2024-06-24 10:37:51,449] INFO: Initiating epoch #102 train run on device rank=0 [2024-06-24 11:49:17,921] INFO: Initiating epoch #102 valid run on device rank=0 [2024-06-24 11:49:17,921] INFO: Initiating epoch #102 valid run on device rank=0 [2024-06-24 11:52:10,543] INFO: Rank 0: epoch=102 / 200 train_loss=8.2493 valid_loss=8.5738 stale=7 time=74.32m eta=8376.6m [2024-06-24 11:52:10,543] INFO: Rank 0: epoch=102 / 200 train_loss=8.2493 valid_loss=8.5738 stale=7 time=74.32m eta=8376.6m [2024-06-24 11:52:11,252] INFO: Initiating epoch #103 train run on device rank=0 [2024-06-24 11:52:11,252] INFO: Initiating epoch #103 train run on device rank=0 [2024-06-24 13:03:39,090] INFO: Initiating epoch #103 valid run on device rank=0 [2024-06-24 13:03:39,090] INFO: Initiating epoch #103 valid run on device rank=0 [2024-06-24 13:08:12,401] INFO: Rank 0: epoch=103 / 200 train_loss=8.2513 valid_loss=8.5863 stale=8 time=76.02m eta=8282.3m [2024-06-24 13:08:12,401] INFO: Rank 0: epoch=103 / 200 train_loss=8.2513 valid_loss=8.5863 stale=8 time=76.02m eta=8282.3m [2024-06-24 13:08:13,031] INFO: Initiating epoch #104 train run on device rank=0 [2024-06-24 13:08:13,031] INFO: Initiating epoch #104 train run on device rank=0 [2024-06-24 14:19:43,345] INFO: Initiating epoch #104 valid run on device rank=0 [2024-06-24 14:19:43,345] INFO: Initiating epoch #104 valid run on device rank=0 [2024-06-24 14:24:19,722] INFO: Rank 0: epoch=104 / 200 train_loss=8.2418 valid_loss=8.5999 stale=9 time=76.11m eta=8188.3m [2024-06-24 14:24:19,722] INFO: Rank 0: epoch=104 / 200 train_loss=8.2418 valid_loss=8.5999 stale=9 time=76.11m eta=8188.3m [2024-06-24 14:24:21,345] INFO: Initiating epoch #105 train run on device rank=0 [2024-06-24 14:24:21,345] INFO: Initiating epoch #105 train run on device rank=0 [2024-06-24 15:35:47,656] INFO: Initiating epoch #105 valid run on device rank=0 [2024-06-24 15:35:47,656] INFO: Initiating epoch #105 valid run on device rank=0 [2024-06-24 15:38:40,134] INFO: Rank 0: epoch=105 / 200 train_loss=8.2395 valid_loss=8.5837 stale=10 time=74.31m eta=8093.1m [2024-06-24 15:38:40,134] INFO: Rank 0: epoch=105 / 200 train_loss=8.2395 valid_loss=8.5837 stale=10 time=74.31m eta=8093.1m [2024-06-24 15:38:41,660] INFO: Initiating epoch #106 train run on device rank=0 [2024-06-24 15:38:41,660] INFO: Initiating epoch #106 train run on device rank=0 [2024-06-24 16:50:04,462] INFO: Initiating epoch #106 valid run on device rank=0 [2024-06-24 16:50:04,462] INFO: Initiating epoch #106 valid run on device rank=0 [2024-06-24 16:52:52,453] INFO: Rank 0: epoch=106 / 200 train_loss=8.2380 valid_loss=8.6008 stale=11 time=74.18m eta=7998.2m [2024-06-24 16:52:52,453] INFO: Rank 0: epoch=106 / 200 train_loss=8.2380 valid_loss=8.6008 stale=11 time=74.18m eta=7998.2m [2024-06-24 16:52:53,163] INFO: Initiating epoch #107 train run on device rank=0 [2024-06-24 16:52:53,163] INFO: Initiating epoch #107 train run on device rank=0 [2024-06-24 18:04:11,915] INFO: Initiating epoch #107 valid run on device rank=0 [2024-06-24 18:04:11,915] INFO: Initiating epoch #107 valid run on device rank=0 [2024-06-24 18:07:01,322] INFO: Rank 0: epoch=107 / 200 train_loss=8.2338 valid_loss=8.6260 stale=12 time=74.14m eta=7903.6m [2024-06-24 18:07:01,322] INFO: Rank 0: epoch=107 / 200 train_loss=8.2338 valid_loss=8.6260 stale=12 time=74.14m eta=7903.6m [2024-06-24 18:07:01,441] INFO: Initiating epoch #108 train run on device rank=0 [2024-06-24 18:07:01,441] INFO: Initiating epoch #108 train run on device rank=0 [2024-06-24 19:18:29,948] INFO: Initiating epoch #108 valid run on device rank=0 [2024-06-24 19:18:29,948] INFO: Initiating epoch #108 valid run on device rank=0 [2024-06-24 19:21:15,624] INFO: Rank 0: epoch=108 / 200 train_loss=8.2316 valid_loss=8.6108 stale=13 time=74.24m eta=7809.5m [2024-06-24 19:21:15,624] INFO: Rank 0: epoch=108 / 200 train_loss=8.2316 valid_loss=8.6108 stale=13 time=74.24m eta=7809.5m [2024-06-24 19:21:15,893] INFO: Initiating epoch #109 train run on device rank=0 [2024-06-24 19:21:15,893] INFO: Initiating epoch #109 train run on device rank=0 [2024-06-24 20:32:45,983] INFO: Initiating epoch #109 valid run on device rank=0 [2024-06-24 20:32:45,983] INFO: Initiating epoch #109 valid run on device rank=0 [2024-06-24 20:35:35,993] INFO: Rank 0: epoch=109 / 200 train_loss=8.2302 valid_loss=8.5670 stale=14 time=74.34m eta=7715.8m [2024-06-24 20:35:35,993] INFO: Rank 0: epoch=109 / 200 train_loss=8.2302 valid_loss=8.5670 stale=14 time=74.34m eta=7715.8m [2024-06-24 20:35:36,109] INFO: Initiating epoch #110 train run on device rank=0 [2024-06-24 20:35:36,109] INFO: Initiating epoch #110 train run on device rank=0 [2024-06-24 21:47:16,135] INFO: Initiating epoch #110 valid run on device rank=0 [2024-06-24 21:47:16,135] INFO: Initiating epoch #110 valid run on device rank=0 [2024-06-24 21:51:54,332] INFO: Rank 0: epoch=110 / 200 train_loss=8.2269 valid_loss=8.6792 stale=15 time=76.3m eta=7624.0m [2024-06-24 21:51:54,332] INFO: Rank 0: epoch=110 / 200 train_loss=8.2269 valid_loss=8.6792 stale=15 time=76.3m eta=7624.0m [2024-06-24 21:51:56,351] INFO: Initiating epoch #111 train run on device rank=0 [2024-06-24 21:51:56,351] INFO: Initiating epoch #111 train run on device rank=0 [2024-06-24 23:03:42,782] INFO: Initiating epoch #111 valid run on device rank=0 [2024-06-24 23:03:42,782] INFO: Initiating epoch #111 valid run on device rank=0 [2024-06-24 23:06:34,786] INFO: Rank 0: epoch=111 / 200 train_loss=8.2232 valid_loss=8.5772 stale=16 time=74.64m eta=7531.3m [2024-06-24 23:06:34,786] INFO: Rank 0: epoch=111 / 200 train_loss=8.2232 valid_loss=8.5772 stale=16 time=74.64m eta=7531.3m [2024-06-24 23:06:35,616] INFO: Initiating epoch #112 train run on device rank=0 [2024-06-24 23:06:35,616] INFO: Initiating epoch #112 train run on device rank=0 [2024-06-25 00:17:51,855] INFO: Initiating epoch #112 valid run on device rank=0 [2024-06-25 00:17:51,855] INFO: Initiating epoch #112 valid run on device rank=0 [2024-06-25 00:20:40,183] INFO: Rank 0: epoch=112 / 200 train_loss=8.2234 valid_loss=8.5558 stale=0 time=74.08m eta=7438.4m [2024-06-25 00:20:40,183] INFO: Rank 0: epoch=112 / 200 train_loss=8.2234 valid_loss=8.5558 stale=0 time=74.08m eta=7438.4m [2024-06-25 00:20:40,427] INFO: Initiating epoch #113 train run on device rank=0 [2024-06-25 00:20:40,427] INFO: Initiating epoch #113 train run on device rank=0 [2024-06-25 01:32:00,941] INFO: Initiating epoch #113 valid run on device rank=0 [2024-06-25 01:32:00,941] INFO: Initiating epoch #113 valid run on device rank=0 [2024-06-25 01:34:57,124] INFO: Rank 0: epoch=113 / 200 train_loss=8.2204 valid_loss=8.5783 stale=1 time=74.28m eta=7346.0m [2024-06-25 01:34:57,124] INFO: Rank 0: epoch=113 / 200 train_loss=8.2204 valid_loss=8.5783 stale=1 time=74.28m eta=7346.0m [2024-06-25 01:34:58,525] INFO: Initiating epoch #114 train run on device rank=0 [2024-06-25 01:34:58,525] INFO: Initiating epoch #114 train run on device rank=0 [2024-06-25 02:46:26,960] INFO: Initiating epoch #114 valid run on device rank=0 [2024-06-25 02:46:26,960] INFO: Initiating epoch #114 valid run on device rank=0 [2024-06-25 02:49:12,527] INFO: Rank 0: epoch=114 / 200 train_loss=8.2184 valid_loss=8.6191 stale=2 time=74.23m eta=7253.8m [2024-06-25 02:49:12,527] INFO: Rank 0: epoch=114 / 200 train_loss=8.2184 valid_loss=8.6191 stale=2 time=74.23m eta=7253.8m [2024-06-25 02:49:13,908] INFO: Initiating epoch #115 train run on device rank=0 [2024-06-25 02:49:13,908] INFO: Initiating epoch #115 train run on device rank=0 [2024-06-25 04:00:55,518] INFO: Initiating epoch #115 valid run on device rank=0 [2024-06-25 04:00:55,518] INFO: Initiating epoch #115 valid run on device rank=0 [2024-06-25 04:05:33,539] INFO: Rank 0: epoch=115 / 200 train_loss=8.2192 valid_loss=8.5531 stale=0 time=76.33m eta=7163.6m [2024-06-25 04:05:33,539] INFO: Rank 0: epoch=115 / 200 train_loss=8.2192 valid_loss=8.5531 stale=0 time=76.33m eta=7163.6m [2024-06-25 04:05:33,695] INFO: Initiating epoch #116 train run on device rank=0 [2024-06-25 04:05:33,695] INFO: Initiating epoch #116 train run on device rank=0 [2024-06-25 05:17:01,457] INFO: Initiating epoch #116 valid run on device rank=0 [2024-06-25 05:17:01,457] INFO: Initiating epoch #116 valid run on device rank=0 [2024-06-25 05:19:55,255] INFO: Rank 0: epoch=116 / 200 train_loss=8.2128 valid_loss=8.5671 stale=1 time=74.36m eta=7072.1m [2024-06-25 05:19:55,255] INFO: Rank 0: epoch=116 / 200 train_loss=8.2128 valid_loss=8.5671 stale=1 time=74.36m eta=7072.1m [2024-06-25 05:19:55,801] INFO: Initiating epoch #117 train run on device rank=0 [2024-06-25 05:19:55,801] INFO: Initiating epoch #117 train run on device rank=0 [2024-06-25 06:31:30,226] INFO: Initiating epoch #117 valid run on device rank=0 [2024-06-25 06:31:30,226] INFO: Initiating epoch #117 valid run on device rank=0 [2024-06-25 06:36:09,739] INFO: Rank 0: epoch=117 / 200 train_loss=8.2114 valid_loss=8.6807 stale=2 time=76.23m eta=6982.3m [2024-06-25 06:36:09,739] INFO: Rank 0: epoch=117 / 200 train_loss=8.2114 valid_loss=8.6807 stale=2 time=76.23m eta=6982.3m [2024-06-25 06:36:09,886] INFO: Initiating epoch #118 train run on device rank=0 [2024-06-25 06:36:09,886] INFO: Initiating epoch #118 train run on device rank=0 [2024-06-25 07:47:32,157] INFO: Initiating epoch #118 valid run on device rank=0 [2024-06-25 07:47:32,157] INFO: Initiating epoch #118 valid run on device rank=0 [2024-06-25 07:50:26,973] INFO: Rank 0: epoch=118 / 200 train_loss=8.2097 valid_loss=8.5323 stale=0 time=74.28m eta=6891.3m [2024-06-25 07:50:26,973] INFO: Rank 0: epoch=118 / 200 train_loss=8.2097 valid_loss=8.5323 stale=0 time=74.28m eta=6891.3m [2024-06-25 07:50:27,323] INFO: Initiating epoch #119 train run on device rank=0 [2024-06-25 07:50:27,323] INFO: Initiating epoch #119 train run on device rank=0 [2024-06-25 09:02:09,589] INFO: Initiating epoch #119 valid run on device rank=0 [2024-06-25 09:02:09,589] INFO: Initiating epoch #119 valid run on device rank=0 [2024-06-25 09:06:46,190] INFO: Rank 0: epoch=119 / 200 train_loss=8.2094 valid_loss=8.5631 stale=1 time=76.31m eta=6802.0m [2024-06-25 09:06:46,190] INFO: Rank 0: epoch=119 / 200 train_loss=8.2094 valid_loss=8.5631 stale=1 time=76.31m eta=6802.0m [2024-06-25 09:06:46,332] INFO: Initiating epoch #120 train run on device rank=0 [2024-06-25 09:06:46,332] INFO: Initiating epoch #120 train run on device rank=0 [2024-06-25 10:18:31,263] INFO: Initiating epoch #120 valid run on device rank=0 [2024-06-25 10:18:31,263] INFO: Initiating epoch #120 valid run on device rank=0 [2024-06-25 10:23:10,960] INFO: Rank 0: epoch=120 / 200 train_loss=8.2028 valid_loss=8.5698 stale=2 time=76.41m eta=6713.0m [2024-06-25 10:23:10,960] INFO: Rank 0: epoch=120 / 200 train_loss=8.2028 valid_loss=8.5698 stale=2 time=76.41m eta=6713.0m [2024-06-25 10:23:11,041] INFO: Initiating epoch #121 train run on device rank=0 [2024-06-25 10:23:11,041] INFO: Initiating epoch #121 train run on device rank=0 [2024-06-25 11:34:40,389] INFO: Initiating epoch #121 valid run on device rank=0 [2024-06-25 11:34:40,389] INFO: Initiating epoch #121 valid run on device rank=0 [2024-06-25 11:37:40,569] INFO: Rank 0: epoch=121 / 200 train_loss=8.2003 valid_loss=8.6528 stale=3 time=74.49m eta=6623.0m [2024-06-25 11:37:40,569] INFO: Rank 0: epoch=121 / 200 train_loss=8.2003 valid_loss=8.6528 stale=3 time=74.49m eta=6623.0m [2024-06-25 11:37:40,987] INFO: Initiating epoch #122 train run on device rank=0 [2024-06-25 11:37:40,987] INFO: Initiating epoch #122 train run on device rank=0 [2024-06-25 12:49:28,310] INFO: Initiating epoch #122 valid run on device rank=0 [2024-06-25 12:49:28,310] INFO: Initiating epoch #122 valid run on device rank=0 [2024-06-25 12:54:07,573] INFO: Rank 0: epoch=122 / 200 train_loss=8.1980 valid_loss=8.5307 stale=0 time=76.44m eta=6534.4m [2024-06-25 12:54:07,573] INFO: Rank 0: epoch=122 / 200 train_loss=8.1980 valid_loss=8.5307 stale=0 time=76.44m eta=6534.4m [2024-06-25 12:54:07,788] INFO: Initiating epoch #123 train run on device rank=0 [2024-06-25 12:54:07,788] INFO: Initiating epoch #123 train run on device rank=0 [2024-06-25 14:05:56,696] INFO: Initiating epoch #123 valid run on device rank=0 [2024-06-25 14:05:56,696] INFO: Initiating epoch #123 valid run on device rank=0 [2024-06-25 14:10:35,002] INFO: Rank 0: epoch=123 / 200 train_loss=8.1908 valid_loss=8.5384 stale=1 time=76.45m eta=6446.1m [2024-06-25 14:10:35,002] INFO: Rank 0: epoch=123 / 200 train_loss=8.1908 valid_loss=8.5384 stale=1 time=76.45m eta=6446.1m [2024-06-25 14:10:35,292] INFO: Initiating epoch #124 train run on device rank=0 [2024-06-25 14:10:35,292] INFO: Initiating epoch #124 train run on device rank=0 [2024-06-25 15:22:22,475] INFO: Initiating epoch #124 valid run on device rank=0 [2024-06-25 15:22:22,475] INFO: Initiating epoch #124 valid run on device rank=0 [2024-06-25 15:25:12,045] INFO: Rank 0: epoch=124 / 200 train_loss=8.1738 valid_loss=8.5577 stale=2 time=74.61m eta=6356.8m [2024-06-25 15:25:12,045] INFO: Rank 0: epoch=124 / 200 train_loss=8.1738 valid_loss=8.5577 stale=2 time=74.61m eta=6356.8m [2024-06-25 15:25:13,478] INFO: Initiating epoch #125 train run on device rank=0 [2024-06-25 15:25:13,478] INFO: Initiating epoch #125 train run on device rank=0 [2024-06-25 16:36:38,180] INFO: Initiating epoch #125 valid run on device rank=0 [2024-06-25 16:36:38,180] INFO: Initiating epoch #125 valid run on device rank=0 [2024-06-25 16:39:28,458] INFO: Rank 0: epoch=125 / 200 train_loss=8.1500 valid_loss=8.4825 stale=0 time=74.25m eta=6267.5m [2024-06-25 16:39:28,458] INFO: Rank 0: epoch=125 / 200 train_loss=8.1500 valid_loss=8.4825 stale=0 time=74.25m eta=6267.5m [2024-06-25 16:39:28,942] INFO: Initiating epoch #126 train run on device rank=0 [2024-06-25 16:39:28,942] INFO: Initiating epoch #126 train run on device rank=0 [2024-06-25 17:51:15,492] INFO: Initiating epoch #126 valid run on device rank=0 [2024-06-25 17:51:15,492] INFO: Initiating epoch #126 valid run on device rank=0 [2024-06-25 17:55:55,803] INFO: Rank 0: epoch=126 / 200 train_loss=8.1291 valid_loss=8.5019 stale=1 time=76.45m eta=6179.8m [2024-06-25 17:55:55,803] INFO: Rank 0: epoch=126 / 200 train_loss=8.1291 valid_loss=8.5019 stale=1 time=76.45m eta=6179.8m [2024-06-25 17:55:56,517] INFO: Initiating epoch #127 train run on device rank=0 [2024-06-25 17:55:56,517] INFO: Initiating epoch #127 train run on device rank=0 [2024-06-25 19:07:40,504] INFO: Initiating epoch #127 valid run on device rank=0 [2024-06-25 19:07:40,504] INFO: Initiating epoch #127 valid run on device rank=0 [2024-06-25 19:12:18,259] INFO: Rank 0: epoch=127 / 200 train_loss=8.1149 valid_loss=8.4319 stale=0 time=76.36m eta=6092.1m [2024-06-25 19:12:18,259] INFO: Rank 0: epoch=127 / 200 train_loss=8.1149 valid_loss=8.4319 stale=0 time=76.36m eta=6092.1m [2024-06-25 19:12:18,328] INFO: Initiating epoch #128 train run on device rank=0 [2024-06-25 19:12:18,328] INFO: Initiating epoch #128 train run on device rank=0 [2024-06-25 20:24:03,005] INFO: Initiating epoch #128 valid run on device rank=0 [2024-06-25 20:24:03,005] INFO: Initiating epoch #128 valid run on device rank=0 [2024-06-25 20:28:42,471] INFO: Rank 0: epoch=128 / 200 train_loss=8.1028 valid_loss=8.4436 stale=1 time=76.4m eta=6004.7m [2024-06-25 20:28:42,471] INFO: Rank 0: epoch=128 / 200 train_loss=8.1028 valid_loss=8.4436 stale=1 time=76.4m eta=6004.7m [2024-06-25 20:28:43,100] INFO: Initiating epoch #129 train run on device rank=0 [2024-06-25 20:28:43,100] INFO: Initiating epoch #129 train run on device rank=0 [2024-06-25 21:40:27,847] INFO: Initiating epoch #129 valid run on device rank=0 [2024-06-25 21:40:27,847] INFO: Initiating epoch #129 valid run on device rank=0 [2024-06-25 21:43:39,082] INFO: Rank 0: epoch=129 / 200 train_loss=8.0933 valid_loss=8.4273 stale=0 time=74.93m eta=5916.7m [2024-06-25 21:43:39,082] INFO: Rank 0: epoch=129 / 200 train_loss=8.0933 valid_loss=8.4273 stale=0 time=74.93m eta=5916.7m [2024-06-25 21:43:40,170] INFO: Initiating epoch #130 train run on device rank=0 [2024-06-25 21:43:40,170] INFO: Initiating epoch #130 train run on device rank=0 [2024-06-25 22:55:41,623] INFO: Initiating epoch #130 valid run on device rank=0 [2024-06-25 22:55:41,623] INFO: Initiating epoch #130 valid run on device rank=0 [2024-06-25 22:58:42,162] INFO: Rank 0: epoch=130 / 200 train_loss=8.0873 valid_loss=8.4727 stale=1 time=75.03m eta=5828.9m [2024-06-25 22:58:42,162] INFO: Rank 0: epoch=130 / 200 train_loss=8.0873 valid_loss=8.4727 stale=1 time=75.03m eta=5828.9m [2024-06-25 22:58:42,496] INFO: Initiating epoch #131 train run on device rank=0 [2024-06-25 22:58:42,496] INFO: Initiating epoch #131 train run on device rank=0 [2024-06-26 00:15:39,670] INFO: Initiating epoch #131 valid run on device rank=0 [2024-06-26 00:15:39,670] INFO: Initiating epoch #131 valid run on device rank=0 [2024-06-26 00:18:30,835] INFO: Rank 0: epoch=131 / 200 train_loss=8.0807 valid_loss=8.4227 stale=0 time=79.81m eta=5743.8m [2024-06-26 00:18:30,835] INFO: Rank 0: epoch=131 / 200 train_loss=8.0807 valid_loss=8.4227 stale=0 time=79.81m eta=5743.8m [2024-06-26 00:18:31,146] INFO: Initiating epoch #132 train run on device rank=0 [2024-06-26 00:18:31,146] INFO: Initiating epoch #132 train run on device rank=0 [2024-06-26 01:29:52,949] INFO: Initiating epoch #132 valid run on device rank=0 [2024-06-26 01:29:52,949] INFO: Initiating epoch #132 valid run on device rank=0 [2024-06-26 01:32:58,998] INFO: Rank 0: epoch=132 / 200 train_loss=8.0738 valid_loss=8.4331 stale=1 time=74.46m eta=5656.0m [2024-06-26 01:32:58,998] INFO: Rank 0: epoch=132 / 200 train_loss=8.0738 valid_loss=8.4331 stale=1 time=74.46m eta=5656.0m [2024-06-26 01:33:00,495] INFO: Initiating epoch #133 train run on device rank=0 [2024-06-26 01:33:00,495] INFO: Initiating epoch #133 train run on device rank=0 [2024-06-26 02:44:21,022] INFO: Initiating epoch #133 valid run on device rank=0 [2024-06-26 02:44:21,022] INFO: Initiating epoch #133 valid run on device rank=0 [2024-06-26 02:46:57,849] INFO: Rank 0: epoch=133 / 200 train_loss=8.0698 valid_loss=8.4712 stale=2 time=73.96m eta=5568.2m [2024-06-26 02:46:57,849] INFO: Rank 0: epoch=133 / 200 train_loss=8.0698 valid_loss=8.4712 stale=2 time=73.96m eta=5568.2m [2024-06-26 02:46:58,076] INFO: Initiating epoch #134 train run on device rank=0 [2024-06-26 02:46:58,076] INFO: Initiating epoch #134 train run on device rank=0 [2024-06-26 03:58:39,991] INFO: Initiating epoch #134 valid run on device rank=0 [2024-06-26 03:58:39,991] INFO: Initiating epoch #134 valid run on device rank=0 [2024-06-26 04:01:27,142] INFO: Rank 0: epoch=134 / 200 train_loss=8.0659 valid_loss=8.4378 stale=3 time=74.48m eta=5480.9m [2024-06-26 04:01:27,142] INFO: Rank 0: epoch=134 / 200 train_loss=8.0659 valid_loss=8.4378 stale=3 time=74.48m eta=5480.9m [2024-06-26 04:01:27,398] INFO: Initiating epoch #135 train run on device rank=0 [2024-06-26 04:01:27,398] INFO: Initiating epoch #135 train run on device rank=0 [2024-06-26 05:12:51,472] INFO: Initiating epoch #135 valid run on device rank=0 [2024-06-26 05:12:51,472] INFO: Initiating epoch #135 valid run on device rank=0 [2024-06-26 05:17:36,238] INFO: Rank 0: epoch=135 / 200 train_loss=8.0743 valid_loss=8.4862 stale=4 time=76.15m eta=5394.5m [2024-06-26 05:17:36,238] INFO: Rank 0: epoch=135 / 200 train_loss=8.0743 valid_loss=8.4862 stale=4 time=76.15m eta=5394.5m [2024-06-26 05:17:36,644] INFO: Initiating epoch #136 train run on device rank=0 [2024-06-26 05:17:36,644] INFO: Initiating epoch #136 train run on device rank=0 [2024-06-26 06:29:03,840] INFO: Initiating epoch #136 valid run on device rank=0 [2024-06-26 06:29:03,840] INFO: Initiating epoch #136 valid run on device rank=0 [2024-06-26 06:31:47,648] INFO: Rank 0: epoch=136 / 200 train_loss=8.0682 valid_loss=8.3990 stale=0 time=74.18m eta=5307.4m [2024-06-26 06:31:47,648] INFO: Rank 0: epoch=136 / 200 train_loss=8.0682 valid_loss=8.3990 stale=0 time=74.18m eta=5307.4m [2024-06-26 06:31:48,467] INFO: Initiating epoch #137 train run on device rank=0 [2024-06-26 06:31:48,467] INFO: Initiating epoch #137 train run on device rank=0 [2024-06-26 07:43:28,740] INFO: Initiating epoch #137 valid run on device rank=0 [2024-06-26 07:43:28,740] INFO: Initiating epoch #137 valid run on device rank=0 [2024-06-26 07:46:27,684] INFO: Rank 0: epoch=137 / 200 train_loss=8.0581 valid_loss=8.3940 stale=0 time=74.65m eta=5220.6m [2024-06-26 07:46:27,684] INFO: Rank 0: epoch=137 / 200 train_loss=8.0581 valid_loss=8.3940 stale=0 time=74.65m eta=5220.6m [2024-06-26 07:46:28,102] INFO: Initiating epoch #138 train run on device rank=0 [2024-06-26 07:46:28,102] INFO: Initiating epoch #138 train run on device rank=0 [2024-06-26 08:57:54,969] INFO: Initiating epoch #138 valid run on device rank=0 [2024-06-26 08:57:54,969] INFO: Initiating epoch #138 valid run on device rank=0 [2024-06-26 09:00:47,981] INFO: Rank 0: epoch=138 / 200 train_loss=8.0521 valid_loss=8.4395 stale=1 time=74.33m eta=5133.9m [2024-06-26 09:00:47,981] INFO: Rank 0: epoch=138 / 200 train_loss=8.0521 valid_loss=8.4395 stale=1 time=74.33m eta=5133.9m [2024-06-26 09:00:48,112] INFO: Initiating epoch #139 train run on device rank=0 [2024-06-26 09:00:48,112] INFO: Initiating epoch #139 train run on device rank=0 [2024-06-26 10:12:40,898] INFO: Initiating epoch #139 valid run on device rank=0 [2024-06-26 10:12:40,898] INFO: Initiating epoch #139 valid run on device rank=0 [2024-06-26 10:17:18,231] INFO: Rank 0: epoch=139 / 200 train_loss=8.0468 valid_loss=8.4252 stale=2 time=76.5m eta=5048.4m [2024-06-26 10:17:18,231] INFO: Rank 0: epoch=139 / 200 train_loss=8.0468 valid_loss=8.4252 stale=2 time=76.5m eta=5048.4m [2024-06-26 10:17:18,614] INFO: Initiating epoch #140 train run on device rank=0 [2024-06-26 10:17:18,614] INFO: Initiating epoch #140 train run on device rank=0 [2024-06-26 11:29:10,619] INFO: Initiating epoch #140 valid run on device rank=0 [2024-06-26 11:29:10,619] INFO: Initiating epoch #140 valid run on device rank=0 [2024-06-26 11:33:51,375] INFO: Rank 0: epoch=140 / 200 train_loss=8.0420 valid_loss=8.3894 stale=0 time=76.55m eta=4962.9m [2024-06-26 11:33:51,375] INFO: Rank 0: epoch=140 / 200 train_loss=8.0420 valid_loss=8.3894 stale=0 time=76.55m eta=4962.9m [2024-06-26 11:33:51,497] INFO: Initiating epoch #141 train run on device rank=0 [2024-06-26 11:33:51,497] INFO: Initiating epoch #141 train run on device rank=0 [2024-06-26 12:45:48,953] INFO: Initiating epoch #141 valid run on device rank=0 [2024-06-26 12:45:48,953] INFO: Initiating epoch #141 valid run on device rank=0 [2024-06-26 12:48:48,519] INFO: Rank 0: epoch=141 / 200 train_loss=8.0411 valid_loss=8.3807 stale=0 time=74.95m eta=4877.0m [2024-06-26 12:48:48,519] INFO: Rank 0: epoch=141 / 200 train_loss=8.0411 valid_loss=8.3807 stale=0 time=74.95m eta=4877.0m [2024-06-26 12:48:49,321] INFO: Initiating epoch #142 train run on device rank=0 [2024-06-26 12:48:49,321] INFO: Initiating epoch #142 train run on device rank=0 [2024-06-26 14:01:47,992] INFO: Initiating epoch #142 valid run on device rank=0 [2024-06-26 14:01:47,992] INFO: Initiating epoch #142 valid run on device rank=0 [2024-06-26 14:05:55,683] INFO: Rank 0: epoch=142 / 200 train_loss=8.0363 valid_loss=8.4319 stale=1 time=77.11m eta=4792.1m [2024-06-26 14:05:55,683] INFO: Rank 0: epoch=142 / 200 train_loss=8.0363 valid_loss=8.4319 stale=1 time=77.11m eta=4792.1m [2024-06-26 14:05:56,198] INFO: Initiating epoch #143 train run on device rank=0 [2024-06-26 14:05:56,198] INFO: Initiating epoch #143 train run on device rank=0 [2024-06-26 15:23:15,566] INFO: Initiating epoch #143 valid run on device rank=0 [2024-06-26 15:23:15,566] INFO: Initiating epoch #143 valid run on device rank=0 [2024-06-26 15:28:01,091] INFO: Rank 0: epoch=143 / 200 train_loss=8.0327 valid_loss=8.4217 stale=2 time=82.08m eta=4709.2m [2024-06-26 15:28:01,091] INFO: Rank 0: epoch=143 / 200 train_loss=8.0327 valid_loss=8.4217 stale=2 time=82.08m eta=4709.2m [2024-06-26 15:28:01,305] INFO: Initiating epoch #144 train run on device rank=0 [2024-06-26 15:28:01,305] INFO: Initiating epoch #144 train run on device rank=0 [2024-06-26 16:44:14,520] INFO: Initiating epoch #144 valid run on device rank=0 [2024-06-26 16:44:14,520] INFO: Initiating epoch #144 valid run on device rank=0 [2024-06-26 16:49:00,210] INFO: Rank 0: epoch=144 / 200 train_loss=8.0311 valid_loss=8.4183 stale=3 time=80.98m eta=4626.0m [2024-06-26 16:49:00,210] INFO: Rank 0: epoch=144 / 200 train_loss=8.0311 valid_loss=8.4183 stale=3 time=80.98m eta=4626.0m [2024-06-26 16:49:00,436] INFO: Initiating epoch #145 train run on device rank=0 [2024-06-26 16:49:00,436] INFO: Initiating epoch #145 train run on device rank=0 [2024-06-26 18:03:40,924] INFO: Initiating epoch #145 valid run on device rank=0 [2024-06-26 18:03:40,924] INFO: Initiating epoch #145 valid run on device rank=0 [2024-06-26 18:08:39,909] INFO: Rank 0: epoch=145 / 200 train_loss=8.0375 valid_loss=8.3653 stale=0 time=79.66m eta=4542.2m [2024-06-26 18:08:39,909] INFO: Rank 0: epoch=145 / 200 train_loss=8.0375 valid_loss=8.3653 stale=0 time=79.66m eta=4542.2m [2024-06-26 18:08:40,549] INFO: Initiating epoch #146 train run on device rank=0 [2024-06-26 18:08:40,549] INFO: Initiating epoch #146 train run on device rank=0 [2024-06-26 19:21:57,084] INFO: Initiating epoch #146 valid run on device rank=0 [2024-06-26 19:21:57,084] INFO: Initiating epoch #146 valid run on device rank=0 [2024-06-26 19:25:15,367] INFO: Rank 0: epoch=146 / 200 train_loss=8.0252 valid_loss=8.3828 stale=1 time=76.58m eta=4457.4m [2024-06-26 19:25:15,367] INFO: Rank 0: epoch=146 / 200 train_loss=8.0252 valid_loss=8.3828 stale=1 time=76.58m eta=4457.4m [2024-06-26 19:25:15,645] INFO: Initiating epoch #147 train run on device rank=0 [2024-06-26 19:25:15,645] INFO: Initiating epoch #147 train run on device rank=0 [2024-06-26 20:38:31,213] INFO: Initiating epoch #147 valid run on device rank=0 [2024-06-26 20:38:31,213] INFO: Initiating epoch #147 valid run on device rank=0 [2024-06-26 20:41:19,233] INFO: Rank 0: epoch=147 / 200 train_loss=8.0243 valid_loss=8.3876 stale=2 time=76.06m eta=4372.6m [2024-06-26 20:41:19,233] INFO: Rank 0: epoch=147 / 200 train_loss=8.0243 valid_loss=8.3876 stale=2 time=76.06m eta=4372.6m [2024-06-26 20:41:19,473] INFO: Initiating epoch #148 train run on device rank=0 [2024-06-26 20:41:19,473] INFO: Initiating epoch #148 train run on device rank=0 [2024-06-26 21:53:20,252] INFO: Initiating epoch #148 valid run on device rank=0 [2024-06-26 21:53:20,252] INFO: Initiating epoch #148 valid run on device rank=0 [2024-06-26 21:57:58,559] INFO: Rank 0: epoch=148 / 200 train_loss=8.0198 valid_loss=8.4000 stale=3 time=76.65m eta=4288.0m [2024-06-26 21:57:58,559] INFO: Rank 0: epoch=148 / 200 train_loss=8.0198 valid_loss=8.4000 stale=3 time=76.65m eta=4288.0m [2024-06-26 21:57:59,058] INFO: Initiating epoch #149 train run on device rank=0 [2024-06-26 21:57:59,058] INFO: Initiating epoch #149 train run on device rank=0 [2024-06-26 23:09:43,985] INFO: Initiating epoch #149 valid run on device rank=0 [2024-06-26 23:09:43,985] INFO: Initiating epoch #149 valid run on device rank=0 [2024-06-26 23:14:27,074] INFO: Rank 0: epoch=149 / 200 train_loss=8.0185 valid_loss=8.3747 stale=4 time=76.47m eta=4203.5m [2024-06-26 23:14:27,074] INFO: Rank 0: epoch=149 / 200 train_loss=8.0185 valid_loss=8.3747 stale=4 time=76.47m eta=4203.5m [2024-06-26 23:14:27,430] INFO: Initiating epoch #150 train run on device rank=0 [2024-06-26 23:14:27,430] INFO: Initiating epoch #150 train run on device rank=0 [2024-06-27 00:26:17,793] INFO: Initiating epoch #150 valid run on device rank=0 [2024-06-27 00:26:17,793] INFO: Initiating epoch #150 valid run on device rank=0 [2024-06-27 00:29:01,672] INFO: Rank 0: epoch=150 / 200 train_loss=8.0137 valid_loss=8.4078 stale=5 time=74.57m eta=4118.5m [2024-06-27 00:29:01,672] INFO: Rank 0: epoch=150 / 200 train_loss=8.0137 valid_loss=8.4078 stale=5 time=74.57m eta=4118.5m [2024-06-27 00:29:02,782] INFO: Initiating epoch #151 train run on device rank=0 [2024-06-27 00:29:02,782] INFO: Initiating epoch #151 train run on device rank=0 [2024-06-27 01:40:38,763] INFO: Initiating epoch #151 valid run on device rank=0 [2024-06-27 01:40:38,763] INFO: Initiating epoch #151 valid run on device rank=0 [2024-06-27 01:43:18,290] INFO: Rank 0: epoch=151 / 200 train_loss=8.0125 valid_loss=8.3787 stale=6 time=74.26m eta=4033.5m [2024-06-27 01:43:18,290] INFO: Rank 0: epoch=151 / 200 train_loss=8.0125 valid_loss=8.3787 stale=6 time=74.26m eta=4033.5m [2024-06-27 01:43:18,673] INFO: Initiating epoch #152 train run on device rank=0 [2024-06-27 01:43:18,673] INFO: Initiating epoch #152 train run on device rank=0 [2024-06-27 02:54:52,144] INFO: Initiating epoch #152 valid run on device rank=0 [2024-06-27 02:54:52,144] INFO: Initiating epoch #152 valid run on device rank=0 [2024-06-27 02:57:37,627] INFO: Rank 0: epoch=152 / 200 train_loss=8.0070 valid_loss=8.3618 stale=0 time=74.32m eta=3948.6m [2024-06-27 02:57:37,627] INFO: Rank 0: epoch=152 / 200 train_loss=8.0070 valid_loss=8.3618 stale=0 time=74.32m eta=3948.6m [2024-06-27 02:57:37,942] INFO: Initiating epoch #153 train run on device rank=0 [2024-06-27 02:57:37,942] INFO: Initiating epoch #153 train run on device rank=0 [2024-06-27 04:09:20,619] INFO: Initiating epoch #153 valid run on device rank=0 [2024-06-27 04:09:20,619] INFO: Initiating epoch #153 valid run on device rank=0 [2024-06-27 04:13:59,820] INFO: Rank 0: epoch=153 / 200 train_loss=8.0060 valid_loss=8.4050 stale=1 time=76.36m eta=3864.6m [2024-06-27 04:13:59,820] INFO: Rank 0: epoch=153 / 200 train_loss=8.0060 valid_loss=8.4050 stale=1 time=76.36m eta=3864.6m [2024-06-27 04:14:00,166] INFO: Initiating epoch #154 train run on device rank=0 [2024-06-27 04:14:00,166] INFO: Initiating epoch #154 train run on device rank=0 [2024-06-27 05:25:35,653] INFO: Initiating epoch #154 valid run on device rank=0 [2024-06-27 05:25:35,653] INFO: Initiating epoch #154 valid run on device rank=0 [2024-06-27 05:28:22,706] INFO: Rank 0: epoch=154 / 200 train_loss=8.0036 valid_loss=8.4303 stale=2 time=74.38m eta=3780.0m [2024-06-27 05:28:22,706] INFO: Rank 0: epoch=154 / 200 train_loss=8.0036 valid_loss=8.4303 stale=2 time=74.38m eta=3780.0m [2024-06-27 05:28:23,066] INFO: Initiating epoch #155 train run on device rank=0 [2024-06-27 05:28:23,066] INFO: Initiating epoch #155 train run on device rank=0 [2024-06-27 06:40:15,602] INFO: Initiating epoch #155 valid run on device rank=0 [2024-06-27 06:40:15,602] INFO: Initiating epoch #155 valid run on device rank=0 [2024-06-27 06:44:54,437] INFO: Rank 0: epoch=155 / 200 train_loss=8.0013 valid_loss=8.3903 stale=3 time=76.52m eta=3696.2m [2024-06-27 06:44:54,437] INFO: Rank 0: epoch=155 / 200 train_loss=8.0013 valid_loss=8.3903 stale=3 time=76.52m eta=3696.2m [2024-06-27 06:44:54,791] INFO: Initiating epoch #156 train run on device rank=0 [2024-06-27 06:44:54,791] INFO: Initiating epoch #156 train run on device rank=0 [2024-06-27 07:56:35,941] INFO: Initiating epoch #156 valid run on device rank=0 [2024-06-27 07:56:35,941] INFO: Initiating epoch #156 valid run on device rank=0 [2024-06-27 07:59:19,431] INFO: Rank 0: epoch=156 / 200 train_loss=7.9954 valid_loss=8.3954 stale=4 time=74.41m eta=3611.9m [2024-06-27 07:59:19,431] INFO: Rank 0: epoch=156 / 200 train_loss=7.9954 valid_loss=8.3954 stale=4 time=74.41m eta=3611.9m [2024-06-27 07:59:19,764] INFO: Initiating epoch #157 train run on device rank=0 [2024-06-27 07:59:19,764] INFO: Initiating epoch #157 train run on device rank=0 [2024-06-27 09:10:56,629] INFO: Initiating epoch #157 valid run on device rank=0 [2024-06-27 09:10:56,629] INFO: Initiating epoch #157 valid run on device rank=0 [2024-06-27 09:13:45,034] INFO: Rank 0: epoch=157 / 200 train_loss=8.0148 valid_loss=8.3571 stale=0 time=74.42m eta=3527.7m [2024-06-27 09:13:45,034] INFO: Rank 0: epoch=157 / 200 train_loss=8.0148 valid_loss=8.3571 stale=0 time=74.42m eta=3527.7m [2024-06-27 09:13:45,825] INFO: Initiating epoch #158 train run on device rank=0 [2024-06-27 09:13:45,825] INFO: Initiating epoch #158 train run on device rank=0 [2024-06-27 10:25:17,386] INFO: Initiating epoch #158 valid run on device rank=0 [2024-06-27 10:25:17,386] INFO: Initiating epoch #158 valid run on device rank=0 [2024-06-27 10:28:04,318] INFO: Rank 0: epoch=158 / 200 train_loss=8.0088 valid_loss=8.3862 stale=1 time=74.31m eta=3443.6m [2024-06-27 10:28:04,318] INFO: Rank 0: epoch=158 / 200 train_loss=8.0088 valid_loss=8.3862 stale=1 time=74.31m eta=3443.6m [2024-06-27 10:28:04,831] INFO: Initiating epoch #159 train run on device rank=0 [2024-06-27 10:28:04,831] INFO: Initiating epoch #159 train run on device rank=0 [2024-06-27 11:39:34,378] INFO: Initiating epoch #159 valid run on device rank=0 [2024-06-27 11:39:34,378] INFO: Initiating epoch #159 valid run on device rank=0 [2024-06-27 11:44:13,037] INFO: Rank 0: epoch=159 / 200 train_loss=7.9966 valid_loss=8.3227 stale=0 time=76.14m eta=3360.1m [2024-06-27 11:44:13,037] INFO: Rank 0: epoch=159 / 200 train_loss=7.9966 valid_loss=8.3227 stale=0 time=76.14m eta=3360.1m [2024-06-27 11:44:13,830] INFO: Initiating epoch #160 train run on device rank=0 [2024-06-27 11:44:13,830] INFO: Initiating epoch #160 train run on device rank=0 [2024-06-27 12:55:49,059] INFO: Initiating epoch #160 valid run on device rank=0 [2024-06-27 12:55:49,059] INFO: Initiating epoch #160 valid run on device rank=0 [2024-06-27 12:58:35,092] INFO: Rank 0: epoch=160 / 200 train_loss=7.9921 valid_loss=8.4083 stale=1 time=74.35m eta=3276.2m [2024-06-27 12:58:35,092] INFO: Rank 0: epoch=160 / 200 train_loss=7.9921 valid_loss=8.4083 stale=1 time=74.35m eta=3276.2m [2024-06-27 12:58:35,239] INFO: Initiating epoch #161 train run on device rank=0 [2024-06-27 12:58:35,239] INFO: Initiating epoch #161 train run on device rank=0 [2024-06-27 14:10:08,436] INFO: Initiating epoch #161 valid run on device rank=0 [2024-06-27 14:10:08,436] INFO: Initiating epoch #161 valid run on device rank=0 [2024-06-27 14:14:50,632] INFO: Rank 0: epoch=161 / 200 train_loss=7.9903 valid_loss=8.3973 stale=2 time=76.26m eta=3193.0m [2024-06-27 14:14:50,632] INFO: Rank 0: epoch=161 / 200 train_loss=7.9903 valid_loss=8.3973 stale=2 time=76.26m eta=3193.0m [2024-06-27 14:14:50,751] INFO: Initiating epoch #162 train run on device rank=0 [2024-06-27 14:14:50,751] INFO: Initiating epoch #162 train run on device rank=0 [2024-06-27 15:26:38,497] INFO: Initiating epoch #162 valid run on device rank=0 [2024-06-27 15:26:38,497] INFO: Initiating epoch #162 valid run on device rank=0 [2024-06-27 15:31:15,752] INFO: Rank 0: epoch=162 / 200 train_loss=7.9820 valid_loss=8.3575 stale=3 time=76.42m eta=3109.8m [2024-06-27 15:31:15,752] INFO: Rank 0: epoch=162 / 200 train_loss=7.9820 valid_loss=8.3575 stale=3 time=76.42m eta=3109.8m [2024-06-27 15:31:17,559] INFO: Initiating epoch #163 train run on device rank=0 [2024-06-27 15:31:17,559] INFO: Initiating epoch #163 train run on device rank=0 [2024-06-27 16:43:21,898] INFO: Initiating epoch #163 valid run on device rank=0 [2024-06-27 16:43:21,898] INFO: Initiating epoch #163 valid run on device rank=0 [2024-06-27 16:46:49,286] INFO: Rank 0: epoch=163 / 200 train_loss=7.9782 valid_loss=8.3570 stale=4 time=75.53m eta=3026.5m [2024-06-27 16:46:49,286] INFO: Rank 0: epoch=163 / 200 train_loss=7.9782 valid_loss=8.3570 stale=4 time=75.53m eta=3026.5m [2024-06-27 16:46:49,839] INFO: Initiating epoch #164 train run on device rank=0 [2024-06-27 16:46:49,839] INFO: Initiating epoch #164 train run on device rank=0 [2024-06-27 17:58:54,043] INFO: Initiating epoch #164 valid run on device rank=0 [2024-06-27 17:58:54,043] INFO: Initiating epoch #164 valid run on device rank=0 [2024-06-27 18:01:37,602] INFO: Rank 0: epoch=164 / 200 train_loss=7.9760 valid_loss=8.4461 stale=5 time=74.8m eta=2943.2m [2024-06-27 18:01:37,602] INFO: Rank 0: epoch=164 / 200 train_loss=7.9760 valid_loss=8.4461 stale=5 time=74.8m eta=2943.2m [2024-06-27 18:01:38,241] INFO: Initiating epoch #165 train run on device rank=0 [2024-06-27 18:01:38,241] INFO: Initiating epoch #165 train run on device rank=0 [2024-06-27 19:13:37,376] INFO: Initiating epoch #165 valid run on device rank=0 [2024-06-27 19:13:37,376] INFO: Initiating epoch #165 valid run on device rank=0 [2024-06-27 19:16:21,219] INFO: Rank 0: epoch=165 / 200 train_loss=7.9718 valid_loss=8.3635 stale=6 time=74.72m eta=2860.0m [2024-06-27 19:16:21,219] INFO: Rank 0: epoch=165 / 200 train_loss=7.9718 valid_loss=8.3635 stale=6 time=74.72m eta=2860.0m [2024-06-27 19:16:22,823] INFO: Initiating epoch #166 train run on device rank=0 [2024-06-27 19:16:22,823] INFO: Initiating epoch #166 train run on device rank=0 [2024-06-27 20:28:34,195] INFO: Initiating epoch #166 valid run on device rank=0 [2024-06-27 20:28:34,195] INFO: Initiating epoch #166 valid run on device rank=0 [2024-06-27 20:33:13,420] INFO: Rank 0: epoch=166 / 200 train_loss=7.9706 valid_loss=8.4313 stale=7 time=76.84m eta=2777.3m [2024-06-27 20:33:13,420] INFO: Rank 0: epoch=166 / 200 train_loss=7.9706 valid_loss=8.4313 stale=7 time=76.84m eta=2777.3m [2024-06-27 20:33:14,451] INFO: Initiating epoch #167 train run on device rank=0 [2024-06-27 20:33:14,451] INFO: Initiating epoch #167 train run on device rank=0 [2024-06-27 21:45:13,173] INFO: Initiating epoch #167 valid run on device rank=0 [2024-06-27 21:45:13,173] INFO: Initiating epoch #167 valid run on device rank=0 [2024-06-27 21:48:05,239] INFO: Rank 0: epoch=167 / 200 train_loss=7.9685 valid_loss=8.3837 stale=8 time=74.85m eta=2694.2m [2024-06-27 21:48:05,239] INFO: Rank 0: epoch=167 / 200 train_loss=7.9685 valid_loss=8.3837 stale=8 time=74.85m eta=2694.2m [2024-06-27 21:48:05,998] INFO: Initiating epoch #168 train run on device rank=0 [2024-06-27 21:48:05,998] INFO: Initiating epoch #168 train run on device rank=0 [2024-06-27 23:00:04,879] INFO: Initiating epoch #168 valid run on device rank=0 [2024-06-27 23:00:04,879] INFO: Initiating epoch #168 valid run on device rank=0 [2024-06-27 23:03:04,474] INFO: Rank 0: epoch=168 / 200 train_loss=7.9673 valid_loss=8.3496 stale=9 time=74.97m eta=2611.3m [2024-06-27 23:03:04,474] INFO: Rank 0: epoch=168 / 200 train_loss=7.9673 valid_loss=8.3496 stale=9 time=74.97m eta=2611.3m [2024-06-27 23:03:05,109] INFO: Initiating epoch #169 train run on device rank=0 [2024-06-27 23:03:05,109] INFO: Initiating epoch #169 train run on device rank=0 [2024-06-28 00:14:29,124] INFO: Initiating epoch #169 valid run on device rank=0 [2024-06-28 00:14:29,124] INFO: Initiating epoch #169 valid run on device rank=0 [2024-06-28 00:17:29,064] INFO: Rank 0: epoch=169 / 200 train_loss=7.9645 valid_loss=8.4022 stale=10 time=74.4m eta=2528.4m [2024-06-28 00:17:29,064] INFO: Rank 0: epoch=169 / 200 train_loss=7.9645 valid_loss=8.4022 stale=10 time=74.4m eta=2528.4m [2024-06-28 00:17:29,294] INFO: Initiating epoch #170 train run on device rank=0 [2024-06-28 00:17:29,294] INFO: Initiating epoch #170 train run on device rank=0 [2024-06-28 01:29:07,393] INFO: Initiating epoch #170 valid run on device rank=0 [2024-06-28 01:29:07,393] INFO: Initiating epoch #170 valid run on device rank=0 [2024-06-28 01:33:11,873] INFO: Rank 0: epoch=170 / 200 train_loss=7.9621 valid_loss=8.3460 stale=11 time=75.71m eta=2445.8m [2024-06-28 01:33:11,873] INFO: Rank 0: epoch=170 / 200 train_loss=7.9621 valid_loss=8.3460 stale=11 time=75.71m eta=2445.8m [2024-06-28 01:33:12,293] INFO: Initiating epoch #171 train run on device rank=0 [2024-06-28 01:33:12,293] INFO: Initiating epoch #171 train run on device rank=0 [2024-06-28 02:44:51,082] INFO: Initiating epoch #171 valid run on device rank=0 [2024-06-28 02:44:51,082] INFO: Initiating epoch #171 valid run on device rank=0 [2024-06-28 02:47:33,403] INFO: Rank 0: epoch=171 / 200 train_loss=7.9642 valid_loss=8.3765 stale=12 time=74.35m eta=2363.1m [2024-06-28 02:47:33,403] INFO: Rank 0: epoch=171 / 200 train_loss=7.9642 valid_loss=8.3765 stale=12 time=74.35m eta=2363.1m [2024-06-28 02:47:33,620] INFO: Initiating epoch #172 train run on device rank=0 [2024-06-28 02:47:33,620] INFO: Initiating epoch #172 train run on device rank=0 [2024-06-28 03:59:05,360] INFO: Initiating epoch #172 valid run on device rank=0 [2024-06-28 03:59:05,360] INFO: Initiating epoch #172 valid run on device rank=0 [2024-06-28 04:01:52,971] INFO: Rank 0: epoch=172 / 200 train_loss=7.9592 valid_loss=8.3350 stale=13 time=74.32m eta=2280.4m [2024-06-28 04:01:52,971] INFO: Rank 0: epoch=172 / 200 train_loss=7.9592 valid_loss=8.3350 stale=13 time=74.32m eta=2280.4m [2024-06-28 04:01:54,112] INFO: Initiating epoch #173 train run on device rank=0 [2024-06-28 04:01:54,112] INFO: Initiating epoch #173 train run on device rank=0 [2024-06-28 05:13:39,863] INFO: Initiating epoch #173 valid run on device rank=0 [2024-06-28 05:13:39,863] INFO: Initiating epoch #173 valid run on device rank=0 [2024-06-28 05:16:15,738] INFO: Rank 0: epoch=173 / 200 train_loss=7.9598 valid_loss=8.3259 stale=14 time=74.36m eta=2197.9m [2024-06-28 05:16:15,738] INFO: Rank 0: epoch=173 / 200 train_loss=7.9598 valid_loss=8.3259 stale=14 time=74.36m eta=2197.9m [2024-06-28 05:16:15,920] INFO: Initiating epoch #174 train run on device rank=0 [2024-06-28 05:16:15,920] INFO: Initiating epoch #174 train run on device rank=0 [2024-06-28 06:27:42,624] INFO: Initiating epoch #174 valid run on device rank=0 [2024-06-28 06:27:42,624] INFO: Initiating epoch #174 valid run on device rank=0