[2024-04-29 10:11:13,548] INFO: Will use single-gpu: NVIDIA A100-SXM4-80GB [2024-04-29 10:11:13,550] INFO: using dtype=torch.bfloat16 [2024-04-29 10:11:13,550] INFO: using dtype=torch.bfloat16 [2024-04-29 10:11:13,566] INFO: using attention_type=flash [2024-04-29 10:11:13,566] INFO: using attention_type=flash [2024-04-29 10:11:13,576] INFO: using attention_type=flash [2024-04-29 10:11:13,576] INFO: using attention_type=flash [2024-04-29 10:11:13,585] INFO: using attention_type=flash [2024-04-29 10:11:13,585] INFO: using attention_type=flash [2024-04-29 10:11:13,595] INFO: using attention_type=flash [2024-04-29 10:11:13,595] INFO: using attention_type=flash [2024-04-29 10:11:13,604] INFO: using attention_type=flash [2024-04-29 10:11:13,604] INFO: using attention_type=flash [2024-04-29 10:11:13,613] INFO: using attention_type=flash [2024-04-29 10:11:13,613] INFO: using attention_type=flash [2024-04-29 10:11:14,644] INFO: MLPF( (nn0_id): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (nn0_reg): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (conv_id): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (conv_reg): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (nn_id): Sequential( (0): Linear(in_features=529, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=6, bias=True) ) (nn_pt): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_eta): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_sin_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_cos_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_energy): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) ) [2024-04-29 10:11:14,644] INFO: MLPF( (nn0_id): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (nn0_reg): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (conv_id): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (conv_reg): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (nn_id): Sequential( (0): Linear(in_features=529, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=6, bias=True) ) (nn_pt): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_eta): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_sin_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_cos_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_energy): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) ) [2024-04-29 10:11:14,645] INFO: Trainable parameters: 11671568 [2024-04-29 10:11:14,645] INFO: Trainable parameters: 11671568 [2024-04-29 10:11:14,645] INFO: Non-trainable parameters: 0 [2024-04-29 10:11:14,645] INFO: Non-trainable parameters: 0 [2024-04-29 10:11:14,645] INFO: Total parameters: 11671568 [2024-04-29 10:11:14,645] INFO: Total parameters: 11671568 [2024-04-29 10:11:14,649] INFO: Modules Trainable parameters Non-tranable parameters nn0_id.0.weight 8704 0 nn0_id.0.bias 512 0 nn0_id.2.weight 512 0 nn0_id.2.bias 512 0 nn0_id.4.weight 262144 0 nn0_id.4.bias 512 0 nn0_reg.0.weight 8704 0 nn0_reg.0.bias 512 0 nn0_reg.2.weight 512 0 nn0_reg.2.bias 512 0 nn0_reg.4.weight 262144 0 nn0_reg.4.bias 512 0 conv_id.0.mha.in_proj_weight 786432 0 conv_id.0.mha.in_proj_bias 1536 0 conv_id.0.mha.out_proj.weight 262144 0 conv_id.0.mha.out_proj.bias 512 0 conv_id.0.norm0.weight 512 0 conv_id.0.norm0.bias 512 0 conv_id.0.norm1.weight 512 0 conv_id.0.norm1.bias 512 0 conv_id.0.seq.0.weight 262144 0 conv_id.0.seq.0.bias 512 0 conv_id.0.seq.2.weight 262144 0 conv_id.0.seq.2.bias 512 0 conv_id.1.mha.in_proj_weight 786432 0 conv_id.1.mha.in_proj_bias 1536 0 conv_id.1.mha.out_proj.weight 262144 0 conv_id.1.mha.out_proj.bias 512 0 conv_id.1.norm0.weight 512 0 conv_id.1.norm0.bias 512 0 conv_id.1.norm1.weight 512 0 conv_id.1.norm1.bias 512 0 conv_id.1.seq.0.weight 262144 0 conv_id.1.seq.0.bias 512 0 conv_id.1.seq.2.weight 262144 0 conv_id.1.seq.2.bias 512 0 conv_id.2.mha.in_proj_weight 786432 0 conv_id.2.mha.in_proj_bias 1536 0 conv_id.2.mha.out_proj.weight 262144 0 conv_id.2.mha.out_proj.bias 512 0 conv_id.2.norm0.weight 512 0 conv_id.2.norm0.bias 512 0 conv_id.2.norm1.weight 512 0 conv_id.2.norm1.bias 512 0 conv_id.2.seq.0.weight 262144 0 conv_id.2.seq.0.bias 512 0 conv_id.2.seq.2.weight 262144 0 conv_id.2.seq.2.bias 512 0 conv_reg.0.mha.in_proj_weight 786432 0 conv_reg.0.mha.in_proj_bias 1536 0 conv_reg.0.mha.out_proj.weight 262144 0 conv_reg.0.mha.out_proj.bias 512 0 conv_reg.0.norm0.weight 512 0 conv_reg.0.norm0.bias 512 0 conv_reg.0.norm1.weight 512 0 conv_reg.0.norm1.bias 512 0 conv_reg.0.seq.0.weight 262144 0 conv_reg.0.seq.0.bias 512 0 conv_reg.0.seq.2.weight 262144 0 conv_reg.0.seq.2.bias 512 0 conv_reg.1.mha.in_proj_weight 786432 0 conv_reg.1.mha.in_proj_bias 1536 0 conv_reg.1.mha.out_proj.weight 262144 0 conv_reg.1.mha.out_proj.bias 512 0 conv_reg.1.norm0.weight 512 0 conv_reg.1.norm0.bias 512 0 conv_reg.1.norm1.weight 512 0 conv_reg.1.norm1.bias 512 0 conv_reg.1.seq.0.weight 262144 0 conv_reg.1.seq.0.bias 512 0 conv_reg.1.seq.2.weight 262144 0 conv_reg.1.seq.2.bias 512 0 conv_reg.2.mha.in_proj_weight 786432 0 conv_reg.2.mha.in_proj_bias 1536 0 conv_reg.2.mha.out_proj.weight 262144 0 conv_reg.2.mha.out_proj.bias 512 0 conv_reg.2.norm0.weight 512 0 conv_reg.2.norm0.bias 512 0 conv_reg.2.norm1.weight 512 0 conv_reg.2.norm1.bias 512 0 conv_reg.2.seq.0.weight 262144 0 conv_reg.2.seq.0.bias 512 0 conv_reg.2.seq.2.weight 262144 0 conv_reg.2.seq.2.bias 512 0 nn_id.0.weight 270848 0 nn_id.0.bias 512 0 nn_id.2.weight 512 0 nn_id.2.bias 512 0 nn_id.4.weight 3072 0 nn_id.4.bias 6 0 nn_pt.nn.0.weight 273920 0 nn_pt.nn.0.bias 512 0 nn_pt.nn.2.weight 512 0 nn_pt.nn.2.bias 512 0 nn_pt.nn.4.weight 1024 0 nn_pt.nn.4.bias 2 0 nn_eta.nn.0.weight 273920 0 nn_eta.nn.0.bias 512 0 nn_eta.nn.2.weight 512 0 nn_eta.nn.2.bias 512 0 nn_eta.nn.4.weight 1024 0 nn_eta.nn.4.bias 2 0 nn_sin_phi.nn.0.weight 273920 0 nn_sin_phi.nn.0.bias 512 0 nn_sin_phi.nn.2.weight 512 0 nn_sin_phi.nn.2.bias 512 0 nn_sin_phi.nn.4.weight 1024 0 nn_sin_phi.nn.4.bias 2 0 nn_cos_phi.nn.0.weight 273920 0 nn_cos_phi.nn.0.bias 512 0 nn_cos_phi.nn.2.weight 512 0 nn_cos_phi.nn.2.bias 512 0 nn_cos_phi.nn.4.weight 1024 0 nn_cos_phi.nn.4.bias 2 0 nn_energy.nn.0.weight 273920 0 nn_energy.nn.0.bias 512 0 nn_energy.nn.2.weight 512 0 nn_energy.nn.2.bias 512 0 nn_energy.nn.4.weight 1024 0 nn_energy.nn.4.bias 2 0 [2024-04-29 10:11:14,649] INFO: Modules Trainable parameters Non-tranable parameters nn0_id.0.weight 8704 0 nn0_id.0.bias 512 0 nn0_id.2.weight 512 0 nn0_id.2.bias 512 0 nn0_id.4.weight 262144 0 nn0_id.4.bias 512 0 nn0_reg.0.weight 8704 0 nn0_reg.0.bias 512 0 nn0_reg.2.weight 512 0 nn0_reg.2.bias 512 0 nn0_reg.4.weight 262144 0 nn0_reg.4.bias 512 0 conv_id.0.mha.in_proj_weight 786432 0 conv_id.0.mha.in_proj_bias 1536 0 conv_id.0.mha.out_proj.weight 262144 0 conv_id.0.mha.out_proj.bias 512 0 conv_id.0.norm0.weight 512 0 conv_id.0.norm0.bias 512 0 conv_id.0.norm1.weight 512 0 conv_id.0.norm1.bias 512 0 conv_id.0.seq.0.weight 262144 0 conv_id.0.seq.0.bias 512 0 conv_id.0.seq.2.weight 262144 0 conv_id.0.seq.2.bias 512 0 conv_id.1.mha.in_proj_weight 786432 0 conv_id.1.mha.in_proj_bias 1536 0 conv_id.1.mha.out_proj.weight 262144 0 conv_id.1.mha.out_proj.bias 512 0 conv_id.1.norm0.weight 512 0 conv_id.1.norm0.bias 512 0 conv_id.1.norm1.weight 512 0 conv_id.1.norm1.bias 512 0 conv_id.1.seq.0.weight 262144 0 conv_id.1.seq.0.bias 512 0 conv_id.1.seq.2.weight 262144 0 conv_id.1.seq.2.bias 512 0 conv_id.2.mha.in_proj_weight 786432 0 conv_id.2.mha.in_proj_bias 1536 0 conv_id.2.mha.out_proj.weight 262144 0 conv_id.2.mha.out_proj.bias 512 0 conv_id.2.norm0.weight 512 0 conv_id.2.norm0.bias 512 0 conv_id.2.norm1.weight 512 0 conv_id.2.norm1.bias 512 0 conv_id.2.seq.0.weight 262144 0 conv_id.2.seq.0.bias 512 0 conv_id.2.seq.2.weight 262144 0 conv_id.2.seq.2.bias 512 0 conv_reg.0.mha.in_proj_weight 786432 0 conv_reg.0.mha.in_proj_bias 1536 0 conv_reg.0.mha.out_proj.weight 262144 0 conv_reg.0.mha.out_proj.bias 512 0 conv_reg.0.norm0.weight 512 0 conv_reg.0.norm0.bias 512 0 conv_reg.0.norm1.weight 512 0 conv_reg.0.norm1.bias 512 0 conv_reg.0.seq.0.weight 262144 0 conv_reg.0.seq.0.bias 512 0 conv_reg.0.seq.2.weight 262144 0 conv_reg.0.seq.2.bias 512 0 conv_reg.1.mha.in_proj_weight 786432 0 conv_reg.1.mha.in_proj_bias 1536 0 conv_reg.1.mha.out_proj.weight 262144 0 conv_reg.1.mha.out_proj.bias 512 0 conv_reg.1.norm0.weight 512 0 conv_reg.1.norm0.bias 512 0 conv_reg.1.norm1.weight 512 0 conv_reg.1.norm1.bias 512 0 conv_reg.1.seq.0.weight 262144 0 conv_reg.1.seq.0.bias 512 0 conv_reg.1.seq.2.weight 262144 0 conv_reg.1.seq.2.bias 512 0 conv_reg.2.mha.in_proj_weight 786432 0 conv_reg.2.mha.in_proj_bias 1536 0 conv_reg.2.mha.out_proj.weight 262144 0 conv_reg.2.mha.out_proj.bias 512 0 conv_reg.2.norm0.weight 512 0 conv_reg.2.norm0.bias 512 0 conv_reg.2.norm1.weight 512 0 conv_reg.2.norm1.bias 512 0 conv_reg.2.seq.0.weight 262144 0 conv_reg.2.seq.0.bias 512 0 conv_reg.2.seq.2.weight 262144 0 conv_reg.2.seq.2.bias 512 0 nn_id.0.weight 270848 0 nn_id.0.bias 512 0 nn_id.2.weight 512 0 nn_id.2.bias 512 0 nn_id.4.weight 3072 0 nn_id.4.bias 6 0 nn_pt.nn.0.weight 273920 0 nn_pt.nn.0.bias 512 0 nn_pt.nn.2.weight 512 0 nn_pt.nn.2.bias 512 0 nn_pt.nn.4.weight 1024 0 nn_pt.nn.4.bias 2 0 nn_eta.nn.0.weight 273920 0 nn_eta.nn.0.bias 512 0 nn_eta.nn.2.weight 512 0 nn_eta.nn.2.bias 512 0 nn_eta.nn.4.weight 1024 0 nn_eta.nn.4.bias 2 0 nn_sin_phi.nn.0.weight 273920 0 nn_sin_phi.nn.0.bias 512 0 nn_sin_phi.nn.2.weight 512 0 nn_sin_phi.nn.2.bias 512 0 nn_sin_phi.nn.4.weight 1024 0 nn_sin_phi.nn.4.bias 2 0 nn_cos_phi.nn.0.weight 273920 0 nn_cos_phi.nn.0.bias 512 0 nn_cos_phi.nn.2.weight 512 0 nn_cos_phi.nn.2.bias 512 0 nn_cos_phi.nn.4.weight 1024 0 nn_cos_phi.nn.4.bias 2 0 nn_energy.nn.0.weight 273920 0 nn_energy.nn.0.bias 512 0 nn_energy.nn.2.weight 512 0 nn_energy.nn.2.bias 512 0 nn_energy.nn.4.weight 1024 0 nn_energy.nn.4.bias 2 0 [2024-04-29 10:11:14,670] INFO: Creating experiment dir /pfvol/experiments/MLPF_clic_backbone_pyg-clic_20240429_101112_971749 [2024-04-29 10:11:14,670] INFO: Creating experiment dir /pfvol/experiments/MLPF_clic_backbone_pyg-clic_20240429_101112_971749 [2024-04-29 10:11:14,670] INFO: Model directory /pfvol/experiments/MLPF_clic_backbone_pyg-clic_20240429_101112_971749 [2024-04-29 10:11:14,670] INFO: Model directory /pfvol/experiments/MLPF_clic_backbone_pyg-clic_20240429_101112_971749 [2024-04-29 10:11:14,698] INFO: train_dataset: clic_edm_qq_pf, 1589912 [2024-04-29 10:11:14,698] INFO: train_dataset: clic_edm_qq_pf, 1589912 [2024-04-29 10:11:14,714] INFO: train_dataset: clic_edm_ttbar_pf, 800800 [2024-04-29 10:11:14,714] INFO: train_dataset: clic_edm_ttbar_pf, 800800 [2024-04-29 10:11:14,739] INFO: train_dataset: clic_edm_ttbar_pu10_pf, 562200 [2024-04-29 10:11:14,739] INFO: train_dataset: clic_edm_ttbar_pu10_pf, 562200 [2024-04-29 10:11:14,754] INFO: train_dataset: clic_edm_ww_fullhad_pf, 800800 [2024-04-29 10:11:14,754] INFO: train_dataset: clic_edm_ww_fullhad_pf, 800800 [2024-04-29 10:11:14,764] INFO: train_dataset: clic_edm_zh_tautau_pf, 800799 [2024-04-29 10:11:14,764] INFO: train_dataset: clic_edm_zh_tautau_pf, 800799 [2024-04-29 10:11:15,077] INFO: valid_dataset: clic_edm_qq_pf, 397514 [2024-04-29 10:11:15,077] INFO: valid_dataset: clic_edm_qq_pf, 397514 [2024-04-29 10:11:15,331] INFO: Initiating epoch #1 train run on device rank=0 [2024-04-29 10:11:15,331] INFO: Initiating epoch #1 train run on device rank=0 [2024-04-29 11:22:01,657] INFO: Initiating epoch #1 valid run on device rank=0 [2024-04-29 11:22:01,657] INFO: Initiating epoch #1 valid run on device rank=0 [2024-04-29 11:26:18,806] INFO: Rank 0: epoch=1 / 200 train_loss=17.0982 valid_loss=13.8239 stale=0 time=75.06m eta=14936.5m [2024-04-29 11:26:18,806] INFO: Rank 0: epoch=1 / 200 train_loss=17.0982 valid_loss=13.8239 stale=0 time=75.06m eta=14936.5m [2024-04-29 11:26:18,974] INFO: Initiating epoch #2 train run on device rank=0 [2024-04-29 11:26:18,974] INFO: Initiating epoch #2 train run on device rank=0 [2024-04-29 12:37:01,386] INFO: Initiating epoch #2 valid run on device rank=0 [2024-04-29 12:37:01,386] INFO: Initiating epoch #2 valid run on device rank=0 [2024-04-29 12:41:18,653] INFO: Rank 0: epoch=2 / 200 train_loss=13.2095 valid_loss=12.6997 stale=0 time=74.99m eta=14855.5m [2024-04-29 12:41:18,653] INFO: Rank 0: epoch=2 / 200 train_loss=13.2095 valid_loss=12.6997 stale=0 time=74.99m eta=14855.5m [2024-04-29 12:41:18,800] INFO: Initiating epoch #3 train run on device rank=0 [2024-04-29 12:41:18,800] INFO: Initiating epoch #3 train run on device rank=0 [2024-04-29 13:52:03,821] INFO: Initiating epoch #3 valid run on device rank=0 [2024-04-29 13:52:03,821] INFO: Initiating epoch #3 valid run on device rank=0 [2024-04-29 13:56:17,206] INFO: Rank 0: epoch=3 / 200 train_loss=12.2860 valid_loss=11.8937 stale=0 time=74.97m eta=14777.1m [2024-04-29 13:56:17,206] INFO: Rank 0: epoch=3 / 200 train_loss=12.2860 valid_loss=11.8937 stale=0 time=74.97m eta=14777.1m [2024-04-29 13:56:17,415] INFO: Initiating epoch #4 train run on device rank=0 [2024-04-29 13:56:17,415] INFO: Initiating epoch #4 train run on device rank=0 [2024-04-29 15:07:06,552] INFO: Initiating epoch #4 valid run on device rank=0 [2024-04-29 15:07:06,552] INFO: Initiating epoch #4 valid run on device rank=0 [2024-04-29 15:11:20,438] INFO: Rank 0: epoch=4 / 200 train_loss=11.7016 valid_loss=11.4695 stale=0 time=75.05m eta=14704.2m [2024-04-29 15:11:20,438] INFO: Rank 0: epoch=4 / 200 train_loss=11.7016 valid_loss=11.4695 stale=0 time=75.05m eta=14704.2m [2024-04-29 15:11:20,632] INFO: Initiating epoch #5 train run on device rank=0 [2024-04-29 15:11:20,632] INFO: Initiating epoch #5 train run on device rank=0 [2024-04-29 16:22:06,235] INFO: Initiating epoch #5 valid run on device rank=0 [2024-04-29 16:22:06,235] INFO: Initiating epoch #5 valid run on device rank=0 [2024-04-29 16:26:18,759] INFO: Rank 0: epoch=5 / 200 train_loss=11.2795 valid_loss=11.3035 stale=0 time=74.97m eta=14627.2m [2024-04-29 16:26:18,759] INFO: Rank 0: epoch=5 / 200 train_loss=11.2795 valid_loss=11.3035 stale=0 time=74.97m eta=14627.2m [2024-04-29 16:26:18,900] INFO: Initiating epoch #6 train run on device rank=0 [2024-04-29 16:26:18,900] INFO: Initiating epoch #6 train run on device rank=0 [2024-04-29 17:37:07,386] INFO: Initiating epoch #6 valid run on device rank=0 [2024-04-29 17:37:07,386] INFO: Initiating epoch #6 valid run on device rank=0 [2024-04-29 17:41:19,946] INFO: Rank 0: epoch=6 / 200 train_loss=10.9621 valid_loss=11.0476 stale=0 time=75.02m eta=14552.5m [2024-04-29 17:41:19,946] INFO: Rank 0: epoch=6 / 200 train_loss=10.9621 valid_loss=11.0476 stale=0 time=75.02m eta=14552.5m [2024-04-29 17:41:20,164] INFO: Initiating epoch #7 train run on device rank=0 [2024-04-29 17:41:20,164] INFO: Initiating epoch #7 train run on device rank=0 [2024-04-29 18:52:10,272] INFO: Initiating epoch #7 valid run on device rank=0 [2024-04-29 18:52:10,272] INFO: Initiating epoch #7 valid run on device rank=0 [2024-04-29 18:56:21,436] INFO: Rank 0: epoch=7 / 200 train_loss=10.7122 valid_loss=10.6644 stale=0 time=75.02m eta=14477.8m [2024-04-29 18:56:21,436] INFO: Rank 0: epoch=7 / 200 train_loss=10.7122 valid_loss=10.6644 stale=0 time=75.02m eta=14477.8m [2024-04-29 18:56:21,520] INFO: Initiating epoch #8 train run on device rank=0 [2024-04-29 18:56:21,520] INFO: Initiating epoch #8 train run on device rank=0 [2024-04-29 20:07:09,876] INFO: Initiating epoch #8 valid run on device rank=0 [2024-04-29 20:07:09,876] INFO: Initiating epoch #8 valid run on device rank=0 [2024-04-29 20:11:21,290] INFO: Rank 0: epoch=8 / 200 train_loss=10.4874 valid_loss=10.5328 stale=0 time=75.0m eta=14402.4m [2024-04-29 20:11:21,290] INFO: Rank 0: epoch=8 / 200 train_loss=10.4874 valid_loss=10.5328 stale=0 time=75.0m eta=14402.4m [2024-04-29 20:11:21,419] INFO: Initiating epoch #9 train run on device rank=0 [2024-04-29 20:11:21,419] INFO: Initiating epoch #9 train run on device rank=0 [2024-04-29 21:22:09,714] INFO: Initiating epoch #9 valid run on device rank=0 [2024-04-29 21:22:09,714] INFO: Initiating epoch #9 valid run on device rank=0 [2024-04-29 21:26:22,330] INFO: Rank 0: epoch=9 / 200 train_loss=10.3074 valid_loss=10.4322 stale=0 time=75.02m eta=14327.5m [2024-04-29 21:26:22,330] INFO: Rank 0: epoch=9 / 200 train_loss=10.3074 valid_loss=10.4322 stale=0 time=75.02m eta=14327.5m [2024-04-29 21:26:22,404] INFO: Initiating epoch #10 train run on device rank=0 [2024-04-29 21:26:22,404] INFO: Initiating epoch #10 train run on device rank=0 [2024-04-29 22:37:12,223] INFO: Initiating epoch #10 valid run on device rank=0 [2024-04-29 22:37:12,223] INFO: Initiating epoch #10 valid run on device rank=0 [2024-04-29 22:41:23,482] INFO: Rank 0: epoch=10 / 200 train_loss=10.1709 valid_loss=10.1744 stale=0 time=75.02m eta=14252.6m [2024-04-29 22:41:23,482] INFO: Rank 0: epoch=10 / 200 train_loss=10.1709 valid_loss=10.1744 stale=0 time=75.02m eta=14252.6m [2024-04-29 22:41:23,589] INFO: Initiating epoch #11 train run on device rank=0 [2024-04-29 22:41:23,589] INFO: Initiating epoch #11 train run on device rank=0 [2024-04-29 23:52:30,216] INFO: Initiating epoch #11 valid run on device rank=0 [2024-04-29 23:52:30,216] INFO: Initiating epoch #11 valid run on device rank=0 [2024-04-29 23:56:43,529] INFO: Rank 0: epoch=11 / 200 train_loss=10.0462 valid_loss=10.0725 stale=0 time=75.33m eta=14183.1m [2024-04-29 23:56:43,529] INFO: Rank 0: epoch=11 / 200 train_loss=10.0462 valid_loss=10.0725 stale=0 time=75.33m eta=14183.1m [2024-04-29 23:56:43,638] INFO: Initiating epoch #12 train run on device rank=0 [2024-04-29 23:56:43,638] INFO: Initiating epoch #12 train run on device rank=0 [2024-04-30 01:07:34,285] INFO: Initiating epoch #12 valid run on device rank=0 [2024-04-30 01:07:34,285] INFO: Initiating epoch #12 valid run on device rank=0 [2024-04-30 01:11:46,670] INFO: Rank 0: epoch=12 / 200 train_loss=9.9586 valid_loss=9.9582 stale=0 time=75.05m eta=14108.2m [2024-04-30 01:11:46,670] INFO: Rank 0: epoch=12 / 200 train_loss=9.9586 valid_loss=9.9582 stale=0 time=75.05m eta=14108.2m [2024-04-30 01:11:46,862] INFO: Initiating epoch #13 train run on device rank=0 [2024-04-30 01:11:46,862] INFO: Initiating epoch #13 train run on device rank=0 [2024-04-30 02:22:38,024] INFO: Initiating epoch #13 valid run on device rank=0 [2024-04-30 02:22:38,024] INFO: Initiating epoch #13 valid run on device rank=0 [2024-04-30 02:26:53,168] INFO: Rank 0: epoch=13 / 200 train_loss=9.8642 valid_loss=9.8883 stale=0 time=75.11m eta=14034.1m [2024-04-30 02:26:53,168] INFO: Rank 0: epoch=13 / 200 train_loss=9.8642 valid_loss=9.8883 stale=0 time=75.11m eta=14034.1m [2024-04-30 02:26:53,397] INFO: Initiating epoch #14 train run on device rank=0 [2024-04-30 02:26:53,397] INFO: Initiating epoch #14 train run on device rank=0 [2024-04-30 03:37:44,332] INFO: Initiating epoch #14 valid run on device rank=0 [2024-04-30 03:37:44,332] INFO: Initiating epoch #14 valid run on device rank=0 [2024-04-30 03:41:56,945] INFO: Rank 0: epoch=14 / 200 train_loss=9.7724 valid_loss=9.8515 stale=0 time=75.06m eta=13959.2m [2024-04-30 03:41:56,945] INFO: Rank 0: epoch=14 / 200 train_loss=9.7724 valid_loss=9.8515 stale=0 time=75.06m eta=13959.2m [2024-04-30 03:41:57,082] INFO: Initiating epoch #15 train run on device rank=0 [2024-04-30 03:41:57,082] INFO: Initiating epoch #15 train run on device rank=0 [2024-04-30 04:52:47,399] INFO: Initiating epoch #15 valid run on device rank=0 [2024-04-30 04:52:47,399] INFO: Initiating epoch #15 valid run on device rank=0 [2024-04-30 04:56:59,750] INFO: Rank 0: epoch=15 / 200 train_loss=9.7046 valid_loss=9.8181 stale=0 time=75.04m eta=13884.1m [2024-04-30 04:56:59,750] INFO: Rank 0: epoch=15 / 200 train_loss=9.7046 valid_loss=9.8181 stale=0 time=75.04m eta=13884.1m [2024-04-30 04:56:59,894] INFO: Initiating epoch #16 train run on device rank=0 [2024-04-30 04:56:59,894] INFO: Initiating epoch #16 train run on device rank=0 [2024-04-30 06:07:43,649] INFO: Initiating epoch #16 valid run on device rank=0 [2024-04-30 06:07:43,649] INFO: Initiating epoch #16 valid run on device rank=0 [2024-04-30 06:11:56,278] INFO: Rank 0: epoch=16 / 200 train_loss=9.6187 valid_loss=9.7385 stale=0 time=74.94m eta=13807.8m [2024-04-30 06:11:56,278] INFO: Rank 0: epoch=16 / 200 train_loss=9.6187 valid_loss=9.7385 stale=0 time=74.94m eta=13807.8m [2024-04-30 06:11:56,512] INFO: Initiating epoch #17 train run on device rank=0 [2024-04-30 06:11:56,512] INFO: Initiating epoch #17 train run on device rank=0 [2024-04-30 07:22:43,994] INFO: Initiating epoch #17 valid run on device rank=0 [2024-04-30 07:22:43,994] INFO: Initiating epoch #17 valid run on device rank=0 [2024-04-30 07:26:54,738] INFO: Rank 0: epoch=17 / 200 train_loss=9.5444 valid_loss=9.6184 stale=0 time=74.97m eta=13732.1m [2024-04-30 07:26:54,738] INFO: Rank 0: epoch=17 / 200 train_loss=9.5444 valid_loss=9.6184 stale=0 time=74.97m eta=13732.1m [2024-04-30 07:26:54,815] INFO: Initiating epoch #18 train run on device rank=0 [2024-04-30 07:26:54,815] INFO: Initiating epoch #18 train run on device rank=0 [2024-04-30 08:37:46,985] INFO: Initiating epoch #18 valid run on device rank=0 [2024-04-30 08:37:46,985] INFO: Initiating epoch #18 valid run on device rank=0 [2024-04-30 08:42:00,329] INFO: Rank 0: epoch=18 / 200 train_loss=9.4832 valid_loss=9.5735 stale=0 time=75.09m eta=13657.6m [2024-04-30 08:42:00,329] INFO: Rank 0: epoch=18 / 200 train_loss=9.4832 valid_loss=9.5735 stale=0 time=75.09m eta=13657.6m [2024-04-30 08:42:00,617] INFO: Initiating epoch #19 train run on device rank=0 [2024-04-30 08:42:00,617] INFO: Initiating epoch #19 train run on device rank=0 [2024-04-30 09:52:52,490] INFO: Initiating epoch #19 valid run on device rank=0 [2024-04-30 09:52:52,490] INFO: Initiating epoch #19 valid run on device rank=0 [2024-04-30 09:57:04,748] INFO: Rank 0: epoch=19 / 200 train_loss=9.4214 valid_loss=9.5307 stale=0 time=75.07m eta=13582.8m [2024-04-30 09:57:04,748] INFO: Rank 0: epoch=19 / 200 train_loss=9.4214 valid_loss=9.5307 stale=0 time=75.07m eta=13582.8m [2024-04-30 09:57:04,940] INFO: Initiating epoch #20 train run on device rank=0 [2024-04-30 09:57:04,940] INFO: Initiating epoch #20 train run on device rank=0 [2024-04-30 11:08:07,662] INFO: Initiating epoch #20 valid run on device rank=0 [2024-04-30 11:08:07,662] INFO: Initiating epoch #20 valid run on device rank=0 [2024-04-30 11:12:20,200] INFO: Rank 0: epoch=20 / 200 train_loss=9.3719 valid_loss=9.5460 stale=1 time=75.25m eta=13509.7m [2024-04-30 11:12:20,200] INFO: Rank 0: epoch=20 / 200 train_loss=9.3719 valid_loss=9.5460 stale=1 time=75.25m eta=13509.7m [2024-04-30 11:12:20,280] INFO: Initiating epoch #21 train run on device rank=0 [2024-04-30 11:12:20,280] INFO: Initiating epoch #21 train run on device rank=0 [2024-04-30 12:24:36,071] INFO: Initiating epoch #21 valid run on device rank=0 [2024-04-30 12:24:36,071] INFO: Initiating epoch #21 valid run on device rank=0 [2024-04-30 12:28:47,094] INFO: Rank 0: epoch=21 / 200 train_loss=9.3189 valid_loss=9.4865 stale=0 time=76.45m eta=13446.6m [2024-04-30 12:28:47,094] INFO: Rank 0: epoch=21 / 200 train_loss=9.3189 valid_loss=9.4865 stale=0 time=76.45m eta=13446.6m [2024-04-30 12:28:47,130] INFO: Initiating epoch #22 train run on device rank=0 [2024-04-30 12:28:47,130] INFO: Initiating epoch #22 train run on device rank=0 [2024-04-30 13:39:38,014] INFO: Initiating epoch #22 valid run on device rank=0 [2024-04-30 13:39:38,014] INFO: Initiating epoch #22 valid run on device rank=0 [2024-04-30 13:43:51,642] INFO: Rank 0: epoch=22 / 200 train_loss=9.2810 valid_loss=9.4278 stale=0 time=75.08m eta=13371.1m [2024-04-30 13:43:51,642] INFO: Rank 0: epoch=22 / 200 train_loss=9.2810 valid_loss=9.4278 stale=0 time=75.08m eta=13371.1m [2024-04-30 13:43:51,820] INFO: Initiating epoch #23 train run on device rank=0 [2024-04-30 13:43:51,820] INFO: Initiating epoch #23 train run on device rank=0 [2024-04-30 14:54:39,878] INFO: Initiating epoch #23 valid run on device rank=0 [2024-04-30 14:54:39,878] INFO: Initiating epoch #23 valid run on device rank=0 [2024-04-30 14:58:53,633] INFO: Rank 0: epoch=23 / 200 train_loss=9.2325 valid_loss=9.3612 stale=0 time=75.03m eta=13295.3m [2024-04-30 14:58:53,633] INFO: Rank 0: epoch=23 / 200 train_loss=9.2325 valid_loss=9.3612 stale=0 time=75.03m eta=13295.3m [2024-04-30 14:58:53,679] INFO: Initiating epoch #24 train run on device rank=0 [2024-04-30 14:58:53,679] INFO: Initiating epoch #24 train run on device rank=0 [2024-04-30 16:09:40,661] INFO: Initiating epoch #24 valid run on device rank=0 [2024-04-30 16:09:40,661] INFO: Initiating epoch #24 valid run on device rank=0 [2024-04-30 16:13:54,619] INFO: Rank 0: epoch=24 / 200 train_loss=9.2040 valid_loss=9.3608 stale=0 time=75.02m eta=13219.5m [2024-04-30 16:13:54,619] INFO: Rank 0: epoch=24 / 200 train_loss=9.2040 valid_loss=9.3608 stale=0 time=75.02m eta=13219.5m [2024-04-30 16:13:54,962] INFO: Initiating epoch #25 train run on device rank=0 [2024-04-30 16:13:54,962] INFO: Initiating epoch #25 train run on device rank=0 [2024-04-30 17:24:42,303] INFO: Initiating epoch #25 valid run on device rank=0 [2024-04-30 17:24:42,303] INFO: Initiating epoch #25 valid run on device rank=0 [2024-04-30 17:28:53,439] INFO: Rank 0: epoch=25 / 200 train_loss=9.1539 valid_loss=9.3173 stale=0 time=74.97m eta=13143.4m [2024-04-30 17:28:53,439] INFO: Rank 0: epoch=25 / 200 train_loss=9.1539 valid_loss=9.3173 stale=0 time=74.97m eta=13143.4m [2024-04-30 17:28:53,509] INFO: Initiating epoch #26 train run on device rank=0 [2024-04-30 17:28:53,509] INFO: Initiating epoch #26 train run on device rank=0 [2024-04-30 18:39:41,596] INFO: Initiating epoch #26 valid run on device rank=0 [2024-04-30 18:39:41,596] INFO: Initiating epoch #26 valid run on device rank=0 [2024-04-30 18:43:52,999] INFO: Rank 0: epoch=26 / 200 train_loss=9.1191 valid_loss=9.2251 stale=0 time=74.99m eta=13067.6m [2024-04-30 18:43:52,999] INFO: Rank 0: epoch=26 / 200 train_loss=9.1191 valid_loss=9.2251 stale=0 time=74.99m eta=13067.6m [2024-04-30 18:43:53,013] INFO: Initiating epoch #27 train run on device rank=0 [2024-04-30 18:43:53,013] INFO: Initiating epoch #27 train run on device rank=0 [2024-04-30 19:54:41,918] INFO: Initiating epoch #27 valid run on device rank=0 [2024-04-30 19:54:41,918] INFO: Initiating epoch #27 valid run on device rank=0 [2024-04-30 19:58:53,874] INFO: Rank 0: epoch=27 / 200 train_loss=9.0864 valid_loss=9.1954 stale=0 time=75.01m eta=12991.9m [2024-04-30 19:58:53,874] INFO: Rank 0: epoch=27 / 200 train_loss=9.0864 valid_loss=9.1954 stale=0 time=75.01m eta=12991.9m [2024-04-30 19:58:53,991] INFO: Initiating epoch #28 train run on device rank=0 [2024-04-30 19:58:53,991] INFO: Initiating epoch #28 train run on device rank=0 [2024-04-30 21:09:43,425] INFO: Initiating epoch #28 valid run on device rank=0 [2024-04-30 21:09:43,425] INFO: Initiating epoch #28 valid run on device rank=0 [2024-04-30 21:13:50,872] INFO: Rank 0: epoch=28 / 200 train_loss=9.0638 valid_loss=9.2137 stale=1 time=74.95m eta=12915.9m [2024-04-30 21:13:50,872] INFO: Rank 0: epoch=28 / 200 train_loss=9.0638 valid_loss=9.2137 stale=1 time=74.95m eta=12915.9m [2024-04-30 21:13:50,911] INFO: Initiating epoch #29 train run on device rank=0 [2024-04-30 21:13:50,911] INFO: Initiating epoch #29 train run on device rank=0 [2024-04-30 22:24:47,596] INFO: Initiating epoch #29 valid run on device rank=0 [2024-04-30 22:24:47,596] INFO: Initiating epoch #29 valid run on device rank=0 [2024-04-30 22:28:54,282] INFO: Rank 0: epoch=29 / 200 train_loss=9.0377 valid_loss=9.1970 stale=2 time=75.06m eta=12840.6m [2024-04-30 22:28:54,282] INFO: Rank 0: epoch=29 / 200 train_loss=9.0377 valid_loss=9.1970 stale=2 time=75.06m eta=12840.6m [2024-04-30 22:28:54,305] INFO: Initiating epoch #30 train run on device rank=0 [2024-04-30 22:28:54,305] INFO: Initiating epoch #30 train run on device rank=0 [2024-04-30 23:39:46,638] INFO: Initiating epoch #30 valid run on device rank=0 [2024-04-30 23:39:46,638] INFO: Initiating epoch #30 valid run on device rank=0 [2024-04-30 23:43:54,285] INFO: Rank 0: epoch=30 / 200 train_loss=9.0026 valid_loss=9.2044 stale=3 time=75.0m eta=12765.0m [2024-04-30 23:43:54,285] INFO: Rank 0: epoch=30 / 200 train_loss=9.0026 valid_loss=9.2044 stale=3 time=75.0m eta=12765.0m [2024-04-30 23:43:54,488] INFO: Initiating epoch #31 train run on device rank=0 [2024-04-30 23:43:54,488] INFO: Initiating epoch #31 train run on device rank=0 [2024-05-01 00:54:48,407] INFO: Initiating epoch #31 valid run on device rank=0 [2024-05-01 00:54:48,407] INFO: Initiating epoch #31 valid run on device rank=0 [2024-05-01 00:58:56,134] INFO: Rank 0: epoch=31 / 200 train_loss=8.9738 valid_loss=9.0621 stale=0 time=75.03m eta=12689.6m [2024-05-01 00:58:56,134] INFO: Rank 0: epoch=31 / 200 train_loss=8.9738 valid_loss=9.0621 stale=0 time=75.03m eta=12689.6m [2024-05-01 00:58:56,236] INFO: Initiating epoch #32 train run on device rank=0 [2024-05-01 00:58:56,236] INFO: Initiating epoch #32 train run on device rank=0 [2024-05-01 02:09:51,285] INFO: Initiating epoch #32 valid run on device rank=0 [2024-05-01 02:09:51,285] INFO: Initiating epoch #32 valid run on device rank=0 [2024-05-01 02:13:58,452] INFO: Rank 0: epoch=32 / 200 train_loss=8.9523 valid_loss=9.1003 stale=1 time=75.04m eta=12614.3m [2024-05-01 02:13:58,452] INFO: Rank 0: epoch=32 / 200 train_loss=8.9523 valid_loss=9.1003 stale=1 time=75.04m eta=12614.3m [2024-05-01 02:13:58,503] INFO: Initiating epoch #33 train run on device rank=0 [2024-05-01 02:13:58,503] INFO: Initiating epoch #33 train run on device rank=0 [2024-05-01 03:24:51,970] INFO: Initiating epoch #33 valid run on device rank=0 [2024-05-01 03:24:51,970] INFO: Initiating epoch #33 valid run on device rank=0 [2024-05-01 03:28:59,964] INFO: Rank 0: epoch=33 / 200 train_loss=8.9371 valid_loss=9.1129 stale=2 time=75.02m eta=12538.9m [2024-05-01 03:28:59,964] INFO: Rank 0: epoch=33 / 200 train_loss=8.9371 valid_loss=9.1129 stale=2 time=75.02m eta=12538.9m [2024-05-01 03:29:00,058] INFO: Initiating epoch #34 train run on device rank=0 [2024-05-01 03:29:00,058] INFO: Initiating epoch #34 train run on device rank=0 [2024-05-01 04:39:52,944] INFO: Initiating epoch #34 valid run on device rank=0 [2024-05-01 04:39:52,944] INFO: Initiating epoch #34 valid run on device rank=0 [2024-05-01 04:44:00,032] INFO: Rank 0: epoch=34 / 200 train_loss=8.9066 valid_loss=9.0610 stale=0 time=75.0m eta=12463.4m [2024-05-01 04:44:00,032] INFO: Rank 0: epoch=34 / 200 train_loss=8.9066 valid_loss=9.0610 stale=0 time=75.0m eta=12463.4m [2024-05-01 04:44:00,133] INFO: Initiating epoch #35 train run on device rank=0 [2024-05-01 04:44:00,133] INFO: Initiating epoch #35 train run on device rank=0 [2024-05-01 05:54:55,453] INFO: Initiating epoch #35 valid run on device rank=0 [2024-05-01 05:54:55,453] INFO: Initiating epoch #35 valid run on device rank=0 [2024-05-01 05:59:03,716] INFO: Rank 0: epoch=35 / 200 train_loss=8.8861 valid_loss=9.0518 stale=0 time=75.06m eta=12388.2m [2024-05-01 05:59:03,716] INFO: Rank 0: epoch=35 / 200 train_loss=8.8861 valid_loss=9.0518 stale=0 time=75.06m eta=12388.2m [2024-05-01 05:59:03,864] INFO: Initiating epoch #36 train run on device rank=0 [2024-05-01 05:59:03,864] INFO: Initiating epoch #36 train run on device rank=0 [2024-05-01 07:09:55,391] INFO: Initiating epoch #36 valid run on device rank=0 [2024-05-01 07:09:55,391] INFO: Initiating epoch #36 valid run on device rank=0 [2024-05-01 07:14:03,431] INFO: Rank 0: epoch=36 / 200 train_loss=8.8682 valid_loss=9.0026 stale=0 time=74.99m eta=12312.8m [2024-05-01 07:14:03,431] INFO: Rank 0: epoch=36 / 200 train_loss=8.8682 valid_loss=9.0026 stale=0 time=74.99m eta=12312.8m [2024-05-01 07:14:03,486] INFO: Initiating epoch #37 train run on device rank=0 [2024-05-01 07:14:03,486] INFO: Initiating epoch #37 train run on device rank=0 [2024-05-01 08:24:56,120] INFO: Initiating epoch #37 valid run on device rank=0 [2024-05-01 08:24:56,120] INFO: Initiating epoch #37 valid run on device rank=0 [2024-05-01 08:29:03,995] INFO: Rank 0: epoch=37 / 200 train_loss=8.8500 valid_loss=9.0766 stale=1 time=75.01m eta=12237.4m [2024-05-01 08:29:03,995] INFO: Rank 0: epoch=37 / 200 train_loss=8.8500 valid_loss=9.0766 stale=1 time=75.01m eta=12237.4m [2024-05-01 08:29:04,063] INFO: Initiating epoch #38 train run on device rank=0 [2024-05-01 08:29:04,063] INFO: Initiating epoch #38 train run on device rank=0 [2024-05-01 09:39:55,968] INFO: Initiating epoch #38 valid run on device rank=0 [2024-05-01 09:39:55,968] INFO: Initiating epoch #38 valid run on device rank=0 [2024-05-01 09:44:03,097] INFO: Rank 0: epoch=38 / 200 train_loss=8.8361 valid_loss=9.0452 stale=2 time=74.98m eta=12161.9m [2024-05-01 09:44:03,097] INFO: Rank 0: epoch=38 / 200 train_loss=8.8361 valid_loss=9.0452 stale=2 time=74.98m eta=12161.9m [2024-05-01 09:44:03,191] INFO: Initiating epoch #39 train run on device rank=0 [2024-05-01 09:44:03,191] INFO: Initiating epoch #39 train run on device rank=0 [2024-05-01 10:54:57,703] INFO: Initiating epoch #39 valid run on device rank=0 [2024-05-01 10:54:57,703] INFO: Initiating epoch #39 valid run on device rank=0 [2024-05-01 10:59:07,018] INFO: Rank 0: epoch=39 / 200 train_loss=8.8137 valid_loss=8.9911 stale=0 time=75.06m eta=12086.8m [2024-05-01 10:59:07,018] INFO: Rank 0: epoch=39 / 200 train_loss=8.8137 valid_loss=8.9911 stale=0 time=75.06m eta=12086.8m [2024-05-01 10:59:08,433] INFO: Initiating epoch #40 train run on device rank=0 [2024-05-01 10:59:08,433] INFO: Initiating epoch #40 train run on device rank=0 [2024-05-01 12:09:58,651] INFO: Initiating epoch #40 valid run on device rank=0 [2024-05-01 12:09:58,651] INFO: Initiating epoch #40 valid run on device rank=0 [2024-05-01 12:14:07,040] INFO: Rank 0: epoch=40 / 200 train_loss=8.7984 valid_loss=8.9947 stale=1 time=74.98m eta=12011.4m [2024-05-01 12:14:07,040] INFO: Rank 0: epoch=40 / 200 train_loss=8.7984 valid_loss=8.9947 stale=1 time=74.98m eta=12011.4m [2024-05-01 12:14:08,542] INFO: Initiating epoch #41 train run on device rank=0 [2024-05-01 12:14:08,542] INFO: Initiating epoch #41 train run on device rank=0 [2024-05-01 13:24:58,607] INFO: Initiating epoch #41 valid run on device rank=0 [2024-05-01 13:24:58,607] INFO: Initiating epoch #41 valid run on device rank=0 [2024-05-01 13:29:07,631] INFO: Rank 0: epoch=41 / 200 train_loss=8.7920 valid_loss=8.9768 stale=0 time=74.98m eta=11936.1m [2024-05-01 13:29:07,631] INFO: Rank 0: epoch=41 / 200 train_loss=8.7920 valid_loss=8.9768 stale=0 time=74.98m eta=11936.1m [2024-05-01 13:29:08,416] INFO: Initiating epoch #42 train run on device rank=0 [2024-05-01 13:29:08,416] INFO: Initiating epoch #42 train run on device rank=0 [2024-05-01 14:40:02,176] INFO: Initiating epoch #42 valid run on device rank=0 [2024-05-01 14:40:02,176] INFO: Initiating epoch #42 valid run on device rank=0 [2024-05-01 14:44:10,710] INFO: Rank 0: epoch=42 / 200 train_loss=8.7757 valid_loss=8.9374 stale=0 time=75.04m eta=11861.0m [2024-05-01 14:44:10,710] INFO: Rank 0: epoch=42 / 200 train_loss=8.7757 valid_loss=8.9374 stale=0 time=75.04m eta=11861.0m [2024-05-01 14:44:11,778] INFO: Initiating epoch #43 train run on device rank=0 [2024-05-01 14:44:11,778] INFO: Initiating epoch #43 train run on device rank=0 [2024-05-01 15:55:04,481] INFO: Initiating epoch #43 valid run on device rank=0 [2024-05-01 15:55:04,481] INFO: Initiating epoch #43 valid run on device rank=0 [2024-05-01 15:59:16,981] INFO: Rank 0: epoch=43 / 200 train_loss=8.7628 valid_loss=8.9789 stale=1 time=75.09m eta=11786.1m [2024-05-01 15:59:16,981] INFO: Rank 0: epoch=43 / 200 train_loss=8.7628 valid_loss=8.9789 stale=1 time=75.09m eta=11786.1m [2024-05-01 15:59:20,081] INFO: Initiating epoch #44 train run on device rank=0 [2024-05-01 15:59:20,081] INFO: Initiating epoch #44 train run on device rank=0 [2024-05-01 17:10:11,776] INFO: Initiating epoch #44 valid run on device rank=0 [2024-05-01 17:10:11,776] INFO: Initiating epoch #44 valid run on device rank=0 [2024-05-01 17:14:26,616] INFO: Rank 0: epoch=44 / 200 train_loss=8.7431 valid_loss=8.9190 stale=0 time=75.11m eta=11711.3m [2024-05-01 17:14:26,616] INFO: Rank 0: epoch=44 / 200 train_loss=8.7431 valid_loss=8.9190 stale=0 time=75.11m eta=11711.3m [2024-05-01 17:14:30,731] INFO: Initiating epoch #45 train run on device rank=0 [2024-05-01 17:14:30,731] INFO: Initiating epoch #45 train run on device rank=0 [2024-05-01 18:25:21,337] INFO: Initiating epoch #45 valid run on device rank=0 [2024-05-01 18:25:21,337] INFO: Initiating epoch #45 valid run on device rank=0 [2024-05-01 18:29:33,076] INFO: Rank 0: epoch=45 / 200 train_loss=8.7294 valid_loss=8.9662 stale=1 time=75.04m eta=11636.4m [2024-05-01 18:29:33,076] INFO: Rank 0: epoch=45 / 200 train_loss=8.7294 valid_loss=8.9662 stale=1 time=75.04m eta=11636.4m [2024-05-01 18:29:37,626] INFO: Initiating epoch #46 train run on device rank=0 [2024-05-01 18:29:37,626] INFO: Initiating epoch #46 train run on device rank=0 [2024-05-01 19:40:28,017] INFO: Initiating epoch #46 valid run on device rank=0 [2024-05-01 19:40:28,017] INFO: Initiating epoch #46 valid run on device rank=0 [2024-05-01 19:44:39,245] INFO: Rank 0: epoch=46 / 200 train_loss=8.7149 valid_loss=8.9366 stale=2 time=75.03m eta=11561.4m [2024-05-01 19:44:39,245] INFO: Rank 0: epoch=46 / 200 train_loss=8.7149 valid_loss=8.9366 stale=2 time=75.03m eta=11561.4m [2024-05-01 19:44:42,479] INFO: Initiating epoch #47 train run on device rank=0 [2024-05-01 19:44:42,479] INFO: Initiating epoch #47 train run on device rank=0 [2024-05-01 20:55:31,901] INFO: Initiating epoch #47 valid run on device rank=0 [2024-05-01 20:55:31,901] INFO: Initiating epoch #47 valid run on device rank=0 [2024-05-01 20:59:45,750] INFO: Rank 0: epoch=47 / 200 train_loss=8.7006 valid_loss=8.8448 stale=0 time=75.05m eta=11486.4m [2024-05-01 20:59:45,750] INFO: Rank 0: epoch=47 / 200 train_loss=8.7006 valid_loss=8.8448 stale=0 time=75.05m eta=11486.4m [2024-05-01 20:59:48,906] INFO: Initiating epoch #48 train run on device rank=0 [2024-05-01 20:59:48,906] INFO: Initiating epoch #48 train run on device rank=0 [2024-05-01 22:10:40,348] INFO: Initiating epoch #48 valid run on device rank=0 [2024-05-01 22:10:40,348] INFO: Initiating epoch #48 valid run on device rank=0 [2024-05-01 22:14:49,284] INFO: Rank 0: epoch=48 / 200 train_loss=8.6858 valid_loss=8.8627 stale=1 time=75.01m eta=11411.3m [2024-05-01 22:14:49,284] INFO: Rank 0: epoch=48 / 200 train_loss=8.6858 valid_loss=8.8627 stale=1 time=75.01m eta=11411.3m [2024-05-01 22:14:50,427] INFO: Initiating epoch #49 train run on device rank=0 [2024-05-01 22:14:50,427] INFO: Initiating epoch #49 train run on device rank=0 [2024-05-01 23:25:41,983] INFO: Initiating epoch #49 valid run on device rank=0 [2024-05-01 23:25:41,983] INFO: Initiating epoch #49 valid run on device rank=0 [2024-05-01 23:29:49,367] INFO: Rank 0: epoch=49 / 200 train_loss=8.6705 valid_loss=8.8770 stale=2 time=74.98m eta=11336.0m [2024-05-01 23:29:49,367] INFO: Rank 0: epoch=49 / 200 train_loss=8.6705 valid_loss=8.8770 stale=2 time=74.98m eta=11336.0m [2024-05-01 23:29:49,439] INFO: Initiating epoch #50 train run on device rank=0 [2024-05-01 23:29:49,439] INFO: Initiating epoch #50 train run on device rank=0 [2024-05-02 00:40:40,664] INFO: Initiating epoch #50 valid run on device rank=0 [2024-05-02 00:40:40,664] INFO: Initiating epoch #50 valid run on device rank=0 [2024-05-02 00:44:48,216] INFO: Rank 0: epoch=50 / 200 train_loss=8.6576 valid_loss=8.9009 stale=3 time=74.98m eta=11260.6m [2024-05-02 00:44:48,216] INFO: Rank 0: epoch=50 / 200 train_loss=8.6576 valid_loss=8.9009 stale=3 time=74.98m eta=11260.6m [2024-05-02 00:44:48,857] INFO: Initiating epoch #51 train run on device rank=0 [2024-05-02 00:44:48,857] INFO: Initiating epoch #51 train run on device rank=0 [2024-05-02 01:55:41,160] INFO: Initiating epoch #51 valid run on device rank=0 [2024-05-02 01:55:41,160] INFO: Initiating epoch #51 valid run on device rank=0 [2024-05-02 01:59:48,841] INFO: Rank 0: epoch=51 / 200 train_loss=8.6415 valid_loss=8.8596 stale=4 time=75.0m eta=11185.4m [2024-05-02 01:59:48,841] INFO: Rank 0: epoch=51 / 200 train_loss=8.6415 valid_loss=8.8596 stale=4 time=75.0m eta=11185.4m [2024-05-02 01:59:48,890] INFO: Initiating epoch #52 train run on device rank=0 [2024-05-02 01:59:48,890] INFO: Initiating epoch #52 train run on device rank=0 [2024-05-02 03:10:40,595] INFO: Initiating epoch #52 valid run on device rank=0 [2024-05-02 03:10:40,595] INFO: Initiating epoch #52 valid run on device rank=0 [2024-05-02 03:14:47,423] INFO: Rank 0: epoch=52 / 200 train_loss=8.6330 valid_loss=8.8870 stale=5 time=74.98m eta=11110.1m [2024-05-02 03:14:47,423] INFO: Rank 0: epoch=52 / 200 train_loss=8.6330 valid_loss=8.8870 stale=5 time=74.98m eta=11110.1m [2024-05-02 03:14:47,505] INFO: Initiating epoch #53 train run on device rank=0 [2024-05-02 03:14:47,505] INFO: Initiating epoch #53 train run on device rank=0 [2024-05-02 04:25:40,514] INFO: Initiating epoch #53 valid run on device rank=0 [2024-05-02 04:25:40,514] INFO: Initiating epoch #53 valid run on device rank=0 [2024-05-02 04:29:48,126] INFO: Rank 0: epoch=53 / 200 train_loss=8.6150 valid_loss=8.9505 stale=6 time=75.01m eta=11034.8m [2024-05-02 04:29:48,126] INFO: Rank 0: epoch=53 / 200 train_loss=8.6150 valid_loss=8.9505 stale=6 time=75.01m eta=11034.8m [2024-05-02 04:29:48,236] INFO: Initiating epoch #54 train run on device rank=0 [2024-05-02 04:29:48,236] INFO: Initiating epoch #54 train run on device rank=0 [2024-05-02 05:40:36,259] INFO: Initiating epoch #54 valid run on device rank=0 [2024-05-02 05:40:36,259] INFO: Initiating epoch #54 valid run on device rank=0 [2024-05-02 05:44:43,055] INFO: Rank 0: epoch=54 / 200 train_loss=8.6041 valid_loss=8.8531 stale=7 time=74.91m eta=10959.4m [2024-05-02 05:44:43,055] INFO: Rank 0: epoch=54 / 200 train_loss=8.6041 valid_loss=8.8531 stale=7 time=74.91m eta=10959.4m [2024-05-02 05:44:43,132] INFO: Initiating epoch #55 train run on device rank=0 [2024-05-02 05:44:43,132] INFO: Initiating epoch #55 train run on device rank=0 [2024-05-02 06:55:34,549] INFO: Initiating epoch #55 valid run on device rank=0 [2024-05-02 06:55:34,549] INFO: Initiating epoch #55 valid run on device rank=0 [2024-05-02 06:59:41,339] INFO: Rank 0: epoch=55 / 200 train_loss=8.5961 valid_loss=8.8548 stale=8 time=74.97m eta=10884.1m [2024-05-02 06:59:41,339] INFO: Rank 0: epoch=55 / 200 train_loss=8.5961 valid_loss=8.8548 stale=8 time=74.97m eta=10884.1m [2024-05-02 06:59:41,377] INFO: Initiating epoch #56 train run on device rank=0 [2024-05-02 06:59:41,377] INFO: Initiating epoch #56 train run on device rank=0 [2024-05-02 08:10:35,365] INFO: Initiating epoch #56 valid run on device rank=0 [2024-05-02 08:10:35,365] INFO: Initiating epoch #56 valid run on device rank=0 [2024-05-02 08:14:42,970] INFO: Rank 0: epoch=56 / 200 train_loss=8.5807 valid_loss=8.8180 stale=0 time=75.03m eta=10808.9m [2024-05-02 08:14:42,970] INFO: Rank 0: epoch=56 / 200 train_loss=8.5807 valid_loss=8.8180 stale=0 time=75.03m eta=10808.9m [2024-05-02 08:14:43,107] INFO: Initiating epoch #57 train run on device rank=0 [2024-05-02 08:14:43,107] INFO: Initiating epoch #57 train run on device rank=0 [2024-05-02 09:25:34,925] INFO: Initiating epoch #57 valid run on device rank=0 [2024-05-02 09:25:34,925] INFO: Initiating epoch #57 valid run on device rank=0 [2024-05-02 09:29:42,218] INFO: Rank 0: epoch=57 / 200 train_loss=8.5777 valid_loss=8.8465 stale=1 time=74.99m eta=10733.7m [2024-05-02 09:29:42,218] INFO: Rank 0: epoch=57 / 200 train_loss=8.5777 valid_loss=8.8465 stale=1 time=74.99m eta=10733.7m [2024-05-02 09:29:42,266] INFO: Initiating epoch #58 train run on device rank=0 [2024-05-02 09:29:42,266] INFO: Initiating epoch #58 train run on device rank=0 [2024-05-02 10:40:37,553] INFO: Initiating epoch #58 valid run on device rank=0 [2024-05-02 10:40:37,553] INFO: Initiating epoch #58 valid run on device rank=0 [2024-05-02 10:44:47,398] INFO: Rank 0: epoch=58 / 200 train_loss=8.5634 valid_loss=8.7809 stale=0 time=75.09m eta=10658.7m [2024-05-02 10:44:47,398] INFO: Rank 0: epoch=58 / 200 train_loss=8.5634 valid_loss=8.7809 stale=0 time=75.09m eta=10658.7m [2024-05-02 10:44:48,702] INFO: Initiating epoch #59 train run on device rank=0 [2024-05-02 10:44:48,702] INFO: Initiating epoch #59 train run on device rank=0 [2024-05-02 11:55:41,884] INFO: Initiating epoch #59 valid run on device rank=0 [2024-05-02 11:55:41,884] INFO: Initiating epoch #59 valid run on device rank=0 [2024-05-02 11:59:48,676] INFO: Rank 0: epoch=59 / 200 train_loss=8.5520 valid_loss=8.7975 stale=1 time=75.0m eta=10583.5m [2024-05-02 11:59:48,676] INFO: Rank 0: epoch=59 / 200 train_loss=8.5520 valid_loss=8.7975 stale=1 time=75.0m eta=10583.5m [2024-05-02 11:59:48,776] INFO: Initiating epoch #60 train run on device rank=0 [2024-05-02 11:59:48,776] INFO: Initiating epoch #60 train run on device rank=0 [2024-05-02 13:10:47,055] INFO: Initiating epoch #60 valid run on device rank=0 [2024-05-02 13:10:47,055] INFO: Initiating epoch #60 valid run on device rank=0 [2024-05-02 13:14:59,318] INFO: Rank 0: epoch=60 / 200 train_loss=8.5439 valid_loss=8.7558 stale=0 time=75.18m eta=10508.7m [2024-05-02 13:14:59,318] INFO: Rank 0: epoch=60 / 200 train_loss=8.5439 valid_loss=8.7558 stale=0 time=75.18m eta=10508.7m [2024-05-02 13:15:00,619] INFO: Initiating epoch #61 train run on device rank=0 [2024-05-02 13:15:00,619] INFO: Initiating epoch #61 train run on device rank=0 [2024-05-02 14:25:50,591] INFO: Initiating epoch #61 valid run on device rank=0 [2024-05-02 14:25:50,591] INFO: Initiating epoch #61 valid run on device rank=0 [2024-05-02 14:29:57,906] INFO: Rank 0: epoch=61 / 200 train_loss=8.5351 valid_loss=8.7759 stale=1 time=74.95m eta=10433.5m [2024-05-02 14:29:57,906] INFO: Rank 0: epoch=61 / 200 train_loss=8.5351 valid_loss=8.7759 stale=1 time=74.95m eta=10433.5m [2024-05-02 14:29:58,009] INFO: Initiating epoch #62 train run on device rank=0 [2024-05-02 14:29:58,009] INFO: Initiating epoch #62 train run on device rank=0 [2024-05-02 15:40:50,105] INFO: Initiating epoch #62 valid run on device rank=0 [2024-05-02 15:40:50,105] INFO: Initiating epoch #62 valid run on device rank=0 [2024-05-02 15:44:57,231] INFO: Rank 0: epoch=62 / 200 train_loss=8.5277 valid_loss=8.8243 stale=2 time=74.99m eta=10358.2m [2024-05-02 15:44:57,231] INFO: Rank 0: epoch=62 / 200 train_loss=8.5277 valid_loss=8.8243 stale=2 time=74.99m eta=10358.2m [2024-05-02 15:44:57,296] INFO: Initiating epoch #63 train run on device rank=0 [2024-05-02 15:44:57,296] INFO: Initiating epoch #63 train run on device rank=0 [2024-05-02 16:55:51,155] INFO: Initiating epoch #63 valid run on device rank=0 [2024-05-02 16:55:51,155] INFO: Initiating epoch #63 valid run on device rank=0 [2024-05-02 16:59:58,040] INFO: Rank 0: epoch=63 / 200 train_loss=8.5176 valid_loss=8.7586 stale=3 time=75.01m eta=10283.1m [2024-05-02 16:59:58,040] INFO: Rank 0: epoch=63 / 200 train_loss=8.5176 valid_loss=8.7586 stale=3 time=75.01m eta=10283.1m [2024-05-02 16:59:58,088] INFO: Initiating epoch #64 train run on device rank=0 [2024-05-02 16:59:58,088] INFO: Initiating epoch #64 train run on device rank=0 [2024-05-02 18:10:53,197] INFO: Initiating epoch #64 valid run on device rank=0 [2024-05-02 18:10:53,197] INFO: Initiating epoch #64 valid run on device rank=0 [2024-05-02 18:15:01,382] INFO: Rank 0: epoch=64 / 200 train_loss=8.5146 valid_loss=8.7507 stale=0 time=75.05m eta=10208.0m [2024-05-02 18:15:01,382] INFO: Rank 0: epoch=64 / 200 train_loss=8.5146 valid_loss=8.7507 stale=0 time=75.05m eta=10208.0m [2024-05-02 18:15:01,688] INFO: Initiating epoch #65 train run on device rank=0 [2024-05-02 18:15:01,688] INFO: Initiating epoch #65 train run on device rank=0 [2024-05-02 19:25:55,649] INFO: Initiating epoch #65 valid run on device rank=0 [2024-05-02 19:25:55,649] INFO: Initiating epoch #65 valid run on device rank=0 [2024-05-02 19:30:02,748] INFO: Rank 0: epoch=65 / 200 train_loss=8.5017 valid_loss=8.8117 stale=1 time=75.02m eta=10132.9m [2024-05-02 19:30:02,748] INFO: Rank 0: epoch=65 / 200 train_loss=8.5017 valid_loss=8.8117 stale=1 time=75.02m eta=10132.9m [2024-05-02 19:30:02,809] INFO: Initiating epoch #66 train run on device rank=0 [2024-05-02 19:30:02,809] INFO: Initiating epoch #66 train run on device rank=0 [2024-05-02 20:41:00,515] INFO: Initiating epoch #66 valid run on device rank=0 [2024-05-02 20:41:00,515] INFO: Initiating epoch #66 valid run on device rank=0 [2024-05-02 20:45:07,234] INFO: Rank 0: epoch=66 / 200 train_loss=8.4954 valid_loss=8.7514 stale=2 time=75.07m eta=10057.8m [2024-05-02 20:45:07,234] INFO: Rank 0: epoch=66 / 200 train_loss=8.4954 valid_loss=8.7514 stale=2 time=75.07m eta=10057.8m [2024-05-02 20:45:07,336] INFO: Initiating epoch #67 train run on device rank=0 [2024-05-02 20:45:07,336] INFO: Initiating epoch #67 train run on device rank=0 [2024-05-02 21:55:59,002] INFO: Initiating epoch #67 valid run on device rank=0 [2024-05-02 21:55:59,002] INFO: Initiating epoch #67 valid run on device rank=0 [2024-05-02 22:00:05,375] INFO: Rank 0: epoch=67 / 200 train_loss=8.4903 valid_loss=8.7928 stale=3 time=74.97m eta=9982.6m [2024-05-02 22:00:05,375] INFO: Rank 0: epoch=67 / 200 train_loss=8.4903 valid_loss=8.7928 stale=3 time=74.97m eta=9982.6m [2024-05-02 22:00:05,448] INFO: Initiating epoch #68 train run on device rank=0 [2024-05-02 22:00:05,448] INFO: Initiating epoch #68 train run on device rank=0 [2024-05-02 23:11:00,871] INFO: Initiating epoch #68 valid run on device rank=0 [2024-05-02 23:11:00,871] INFO: Initiating epoch #68 valid run on device rank=0 [2024-05-02 23:15:07,874] INFO: Rank 0: epoch=68 / 200 train_loss=8.4835 valid_loss=8.7755 stale=4 time=75.04m eta=9907.5m [2024-05-02 23:15:07,874] INFO: Rank 0: epoch=68 / 200 train_loss=8.4835 valid_loss=8.7755 stale=4 time=75.04m eta=9907.5m [2024-05-02 23:15:08,071] INFO: Initiating epoch #69 train run on device rank=0 [2024-05-02 23:15:08,071] INFO: Initiating epoch #69 train run on device rank=0 [2024-05-03 00:26:01,073] INFO: Initiating epoch #69 valid run on device rank=0 [2024-05-03 00:26:01,073] INFO: Initiating epoch #69 valid run on device rank=0 [2024-05-03 00:30:08,712] INFO: Rank 0: epoch=69 / 200 train_loss=8.4782 valid_loss=8.7311 stale=0 time=75.01m eta=9832.4m [2024-05-03 00:30:08,712] INFO: Rank 0: epoch=69 / 200 train_loss=8.4782 valid_loss=8.7311 stale=0 time=75.01m eta=9832.4m [2024-05-03 00:30:08,813] INFO: Initiating epoch #70 train run on device rank=0 [2024-05-03 00:30:08,813] INFO: Initiating epoch #70 train run on device rank=0 [2024-05-03 01:41:01,676] INFO: Initiating epoch #70 valid run on device rank=0 [2024-05-03 01:41:01,676] INFO: Initiating epoch #70 valid run on device rank=0 [2024-05-03 01:45:08,936] INFO: Rank 0: epoch=70 / 200 train_loss=8.4742 valid_loss=8.7694 stale=1 time=75.0m eta=9757.2m [2024-05-03 01:45:08,936] INFO: Rank 0: epoch=70 / 200 train_loss=8.4742 valid_loss=8.7694 stale=1 time=75.0m eta=9757.2m [2024-05-03 01:45:09,027] INFO: Initiating epoch #71 train run on device rank=0 [2024-05-03 01:45:09,027] INFO: Initiating epoch #71 train run on device rank=0 [2024-05-03 02:56:00,823] INFO: Initiating epoch #71 valid run on device rank=0 [2024-05-03 02:56:00,823] INFO: Initiating epoch #71 valid run on device rank=0 [2024-05-03 03:00:07,987] INFO: Rank 0: epoch=71 / 200 train_loss=8.4710 valid_loss=8.7435 stale=2 time=74.98m eta=9682.0m [2024-05-03 03:00:07,987] INFO: Rank 0: epoch=71 / 200 train_loss=8.4710 valid_loss=8.7435 stale=2 time=74.98m eta=9682.0m [2024-05-03 03:00:08,055] INFO: Initiating epoch #72 train run on device rank=0 [2024-05-03 03:00:08,055] INFO: Initiating epoch #72 train run on device rank=0 [2024-05-03 04:10:57,869] INFO: Initiating epoch #72 valid run on device rank=0 [2024-05-03 04:10:57,869] INFO: Initiating epoch #72 valid run on device rank=0 [2024-05-03 04:15:05,701] INFO: Rank 0: epoch=72 / 200 train_loss=8.4678 valid_loss=8.8553 stale=3 time=74.96m eta=9606.8m [2024-05-03 04:15:05,701] INFO: Rank 0: epoch=72 / 200 train_loss=8.4678 valid_loss=8.8553 stale=3 time=74.96m eta=9606.8m [2024-05-03 04:15:05,815] INFO: Initiating epoch #73 train run on device rank=0 [2024-05-03 04:15:05,815] INFO: Initiating epoch #73 train run on device rank=0 [2024-05-03 05:25:58,424] INFO: Initiating epoch #73 valid run on device rank=0 [2024-05-03 05:25:58,424] INFO: Initiating epoch #73 valid run on device rank=0 [2024-05-03 05:30:06,608] INFO: Rank 0: epoch=73 / 200 train_loss=8.4657 valid_loss=8.7001 stale=0 time=75.01m eta=9531.7m [2024-05-03 05:30:06,608] INFO: Rank 0: epoch=73 / 200 train_loss=8.4657 valid_loss=8.7001 stale=0 time=75.01m eta=9531.7m [2024-05-03 05:30:06,711] INFO: Initiating epoch #74 train run on device rank=0 [2024-05-03 05:30:06,711] INFO: Initiating epoch #74 train run on device rank=0 [2024-05-03 06:40:58,141] INFO: Initiating epoch #74 valid run on device rank=0 [2024-05-03 06:40:58,141] INFO: Initiating epoch #74 valid run on device rank=0 [2024-05-03 06:45:05,596] INFO: Rank 0: epoch=74 / 200 train_loss=8.4658 valid_loss=8.7772 stale=1 time=74.98m eta=9456.5m [2024-05-03 06:45:05,596] INFO: Rank 0: epoch=74 / 200 train_loss=8.4658 valid_loss=8.7772 stale=1 time=74.98m eta=9456.5m [2024-05-03 06:45:05,740] INFO: Initiating epoch #75 train run on device rank=0 [2024-05-03 06:45:05,740] INFO: Initiating epoch #75 train run on device rank=0 [2024-05-03 07:55:59,987] INFO: Initiating epoch #75 valid run on device rank=0 [2024-05-03 07:55:59,987] INFO: Initiating epoch #75 valid run on device rank=0 [2024-05-03 08:00:07,762] INFO: Rank 0: epoch=75 / 200 train_loss=8.4625 valid_loss=8.7462 stale=2 time=75.03m eta=9381.5m [2024-05-03 08:00:07,762] INFO: Rank 0: epoch=75 / 200 train_loss=8.4625 valid_loss=8.7462 stale=2 time=75.03m eta=9381.5m [2024-05-03 08:00:07,857] INFO: Initiating epoch #76 train run on device rank=0 [2024-05-03 08:00:07,857] INFO: Initiating epoch #76 train run on device rank=0 [2024-05-03 09:11:01,059] INFO: Initiating epoch #76 valid run on device rank=0 [2024-05-03 09:11:01,059] INFO: Initiating epoch #76 valid run on device rank=0 [2024-05-03 09:15:09,336] INFO: Rank 0: epoch=76 / 200 train_loss=8.4641 valid_loss=8.8005 stale=3 time=75.02m eta=9306.4m [2024-05-03 09:15:09,336] INFO: Rank 0: epoch=76 / 200 train_loss=8.4641 valid_loss=8.8005 stale=3 time=75.02m eta=9306.4m [2024-05-03 09:15:09,380] INFO: Initiating epoch #77 train run on device rank=0 [2024-05-03 09:15:09,380] INFO: Initiating epoch #77 train run on device rank=0 [2024-05-03 10:26:00,460] INFO: Initiating epoch #77 valid run on device rank=0 [2024-05-03 10:26:00,460] INFO: Initiating epoch #77 valid run on device rank=0 [2024-05-03 10:30:07,628] INFO: Rank 0: epoch=77 / 200 train_loss=8.4579 valid_loss=8.7091 stale=4 time=74.97m eta=9231.2m [2024-05-03 10:30:07,628] INFO: Rank 0: epoch=77 / 200 train_loss=8.4579 valid_loss=8.7091 stale=4 time=74.97m eta=9231.2m [2024-05-03 10:30:07,850] INFO: Initiating epoch #78 train run on device rank=0 [2024-05-03 10:30:07,850] INFO: Initiating epoch #78 train run on device rank=0 [2024-05-03 11:41:00,880] INFO: Initiating epoch #78 valid run on device rank=0 [2024-05-03 11:41:00,880] INFO: Initiating epoch #78 valid run on device rank=0 [2024-05-03 11:45:07,906] INFO: Rank 0: epoch=78 / 200 train_loss=8.4552 valid_loss=8.7903 stale=5 time=75.0m eta=9156.1m [2024-05-03 11:45:07,906] INFO: Rank 0: epoch=78 / 200 train_loss=8.4552 valid_loss=8.7903 stale=5 time=75.0m eta=9156.1m [2024-05-03 11:45:07,928] INFO: Initiating epoch #79 train run on device rank=0 [2024-05-03 11:45:07,928] INFO: Initiating epoch #79 train run on device rank=0 [2024-05-03 12:56:02,060] INFO: Initiating epoch #79 valid run on device rank=0 [2024-05-03 12:56:02,060] INFO: Initiating epoch #79 valid run on device rank=0 [2024-05-03 13:00:09,907] INFO: Rank 0: epoch=79 / 200 train_loss=8.4568 valid_loss=8.7256 stale=6 time=75.03m eta=9081.0m [2024-05-03 13:00:09,907] INFO: Rank 0: epoch=79 / 200 train_loss=8.4568 valid_loss=8.7256 stale=6 time=75.03m eta=9081.0m [2024-05-03 13:00:09,975] INFO: Initiating epoch #80 train run on device rank=0 [2024-05-03 13:00:09,975] INFO: Initiating epoch #80 train run on device rank=0 [2024-05-03 14:11:06,181] INFO: Initiating epoch #80 valid run on device rank=0 [2024-05-03 14:11:06,181] INFO: Initiating epoch #80 valid run on device rank=0 [2024-05-03 14:15:13,010] INFO: Rank 0: epoch=80 / 200 train_loss=8.4529 valid_loss=8.7338 stale=7 time=75.05m eta=9005.9m [2024-05-03 14:15:13,010] INFO: Rank 0: epoch=80 / 200 train_loss=8.4529 valid_loss=8.7338 stale=7 time=75.05m eta=9005.9m [2024-05-03 14:15:13,145] INFO: Initiating epoch #81 train run on device rank=0 [2024-05-03 14:15:13,145] INFO: Initiating epoch #81 train run on device rank=0 [2024-05-03 15:26:05,367] INFO: Initiating epoch #81 valid run on device rank=0 [2024-05-03 15:26:05,367] INFO: Initiating epoch #81 valid run on device rank=0 [2024-05-03 15:30:12,755] INFO: Rank 0: epoch=81 / 200 train_loss=8.4544 valid_loss=8.7214 stale=8 time=74.99m eta=8930.8m [2024-05-03 15:30:12,755] INFO: Rank 0: epoch=81 / 200 train_loss=8.4544 valid_loss=8.7214 stale=8 time=74.99m eta=8930.8m [2024-05-03 15:30:12,800] INFO: Initiating epoch #82 train run on device rank=0 [2024-05-03 15:30:12,800] INFO: Initiating epoch #82 train run on device rank=0 [2024-05-03 16:41:02,662] INFO: Initiating epoch #82 valid run on device rank=0 [2024-05-03 16:41:02,662] INFO: Initiating epoch #82 valid run on device rank=0 [2024-05-03 16:45:10,135] INFO: Rank 0: epoch=82 / 200 train_loss=8.4544 valid_loss=8.7018 stale=9 time=74.96m eta=8855.6m [2024-05-03 16:45:10,135] INFO: Rank 0: epoch=82 / 200 train_loss=8.4544 valid_loss=8.7018 stale=9 time=74.96m eta=8855.6m [2024-05-03 16:45:10,185] INFO: Initiating epoch #83 train run on device rank=0 [2024-05-03 16:45:10,185] INFO: Initiating epoch #83 train run on device rank=0 [2024-05-03 17:56:00,392] INFO: Initiating epoch #83 valid run on device rank=0 [2024-05-03 17:56:00,392] INFO: Initiating epoch #83 valid run on device rank=0 [2024-05-03 18:00:07,370] INFO: Rank 0: epoch=83 / 200 train_loss=8.4488 valid_loss=8.7189 stale=10 time=74.95m eta=8780.5m [2024-05-03 18:00:07,370] INFO: Rank 0: epoch=83 / 200 train_loss=8.4488 valid_loss=8.7189 stale=10 time=74.95m eta=8780.5m [2024-05-03 18:00:07,513] INFO: Initiating epoch #84 train run on device rank=0 [2024-05-03 18:00:07,513] INFO: Initiating epoch #84 train run on device rank=0 [2024-05-03 19:10:57,190] INFO: Initiating epoch #84 valid run on device rank=0 [2024-05-03 19:10:57,190] INFO: Initiating epoch #84 valid run on device rank=0 [2024-05-03 19:15:05,109] INFO: Rank 0: epoch=84 / 200 train_loss=8.4430 valid_loss=8.6956 stale=0 time=74.96m eta=8705.3m [2024-05-03 19:15:05,109] INFO: Rank 0: epoch=84 / 200 train_loss=8.4430 valid_loss=8.6956 stale=0 time=74.96m eta=8705.3m [2024-05-03 19:15:05,149] INFO: Initiating epoch #85 train run on device rank=0 [2024-05-03 19:15:05,149] INFO: Initiating epoch #85 train run on device rank=0 [2024-05-03 20:25:57,457] INFO: Initiating epoch #85 valid run on device rank=0 [2024-05-03 20:25:57,457] INFO: Initiating epoch #85 valid run on device rank=0 [2024-05-03 20:30:04,841] INFO: Rank 0: epoch=85 / 200 train_loss=8.4393 valid_loss=8.7393 stale=1 time=74.99m eta=8630.2m [2024-05-03 20:30:04,841] INFO: Rank 0: epoch=85 / 200 train_loss=8.4393 valid_loss=8.7393 stale=1 time=74.99m eta=8630.2m [2024-05-03 20:30:05,242] INFO: Initiating epoch #86 train run on device rank=0 [2024-05-03 20:30:05,242] INFO: Initiating epoch #86 train run on device rank=0 [2024-05-03 21:40:55,147] INFO: Initiating epoch #86 valid run on device rank=0 [2024-05-03 21:40:55,147] INFO: Initiating epoch #86 valid run on device rank=0 [2024-05-03 21:45:01,892] INFO: Rank 0: epoch=86 / 200 train_loss=8.4362 valid_loss=8.7466 stale=2 time=74.94m eta=8555.0m [2024-05-03 21:45:01,892] INFO: Rank 0: epoch=86 / 200 train_loss=8.4362 valid_loss=8.7466 stale=2 time=74.94m eta=8555.0m [2024-05-03 21:45:02,164] INFO: Initiating epoch #87 train run on device rank=0 [2024-05-03 21:45:02,164] INFO: Initiating epoch #87 train run on device rank=0 [2024-05-03 22:55:55,630] INFO: Initiating epoch #87 valid run on device rank=0 [2024-05-03 22:55:55,630] INFO: Initiating epoch #87 valid run on device rank=0 [2024-05-03 23:00:05,452] INFO: Rank 0: epoch=87 / 200 train_loss=8.4303 valid_loss=8.6323 stale=0 time=75.05m eta=8480.0m [2024-05-03 23:00:05,452] INFO: Rank 0: epoch=87 / 200 train_loss=8.4303 valid_loss=8.6323 stale=0 time=75.05m eta=8480.0m [2024-05-03 23:00:05,507] INFO: Initiating epoch #88 train run on device rank=0 [2024-05-03 23:00:05,507] INFO: Initiating epoch #88 train run on device rank=0 [2024-05-04 00:11:01,080] INFO: Initiating epoch #88 valid run on device rank=0 [2024-05-04 00:11:01,080] INFO: Initiating epoch #88 valid run on device rank=0 [2024-05-04 00:15:08,918] INFO: Rank 0: epoch=88 / 200 train_loss=8.4261 valid_loss=8.7268 stale=1 time=75.06m eta=8405.0m [2024-05-04 00:15:08,918] INFO: Rank 0: epoch=88 / 200 train_loss=8.4261 valid_loss=8.7268 stale=1 time=75.06m eta=8405.0m [2024-05-04 00:15:08,958] INFO: Initiating epoch #89 train run on device rank=0 [2024-05-04 00:15:08,958] INFO: Initiating epoch #89 train run on device rank=0 [2024-05-04 01:26:06,109] INFO: Initiating epoch #89 valid run on device rank=0 [2024-05-04 01:26:06,109] INFO: Initiating epoch #89 valid run on device rank=0 [2024-05-04 01:30:14,974] INFO: Rank 0: epoch=89 / 200 train_loss=8.4242 valid_loss=8.6977 stale=2 time=75.1m eta=8330.0m [2024-05-04 01:30:14,974] INFO: Rank 0: epoch=89 / 200 train_loss=8.4242 valid_loss=8.6977 stale=2 time=75.1m eta=8330.0m [2024-05-04 01:30:15,188] INFO: Initiating epoch #90 train run on device rank=0 [2024-05-04 01:30:15,188] INFO: Initiating epoch #90 train run on device rank=0 [2024-05-04 02:41:08,446] INFO: Initiating epoch #90 valid run on device rank=0 [2024-05-04 02:41:08,446] INFO: Initiating epoch #90 valid run on device rank=0 [2024-05-04 02:45:15,065] INFO: Rank 0: epoch=90 / 200 train_loss=8.4185 valid_loss=8.6752 stale=3 time=75.0m eta=8254.9m [2024-05-04 02:45:15,065] INFO: Rank 0: epoch=90 / 200 train_loss=8.4185 valid_loss=8.6752 stale=3 time=75.0m eta=8254.9m [2024-05-04 02:45:15,136] INFO: Initiating epoch #91 train run on device rank=0 [2024-05-04 02:45:15,136] INFO: Initiating epoch #91 train run on device rank=0 [2024-05-04 03:56:06,728] INFO: Initiating epoch #91 valid run on device rank=0 [2024-05-04 03:56:06,728] INFO: Initiating epoch #91 valid run on device rank=0 [2024-05-04 04:00:14,076] INFO: Rank 0: epoch=91 / 200 train_loss=8.4186 valid_loss=8.6954 stale=4 time=74.98m eta=8179.8m [2024-05-04 04:00:14,076] INFO: Rank 0: epoch=91 / 200 train_loss=8.4186 valid_loss=8.6954 stale=4 time=74.98m eta=8179.8m [2024-05-04 04:00:14,961] INFO: Initiating epoch #92 train run on device rank=0 [2024-05-04 04:00:14,961] INFO: Initiating epoch #92 train run on device rank=0 [2024-05-04 05:11:07,404] INFO: Initiating epoch #92 valid run on device rank=0 [2024-05-04 05:11:07,404] INFO: Initiating epoch #92 valid run on device rank=0 [2024-05-04 05:15:14,340] INFO: Rank 0: epoch=92 / 200 train_loss=8.4104 valid_loss=8.6846 stale=5 time=74.99m eta=8104.7m [2024-05-04 05:15:14,340] INFO: Rank 0: epoch=92 / 200 train_loss=8.4104 valid_loss=8.6846 stale=5 time=74.99m eta=8104.7m [2024-05-04 05:15:14,362] INFO: Initiating epoch #93 train run on device rank=0 [2024-05-04 05:15:14,362] INFO: Initiating epoch #93 train run on device rank=0 [2024-05-04 06:26:07,311] INFO: Initiating epoch #93 valid run on device rank=0 [2024-05-04 06:26:07,311] INFO: Initiating epoch #93 valid run on device rank=0 [2024-05-04 06:30:15,271] INFO: Rank 0: epoch=93 / 200 train_loss=8.4063 valid_loss=8.6708 stale=6 time=75.02m eta=8029.6m [2024-05-04 06:30:15,271] INFO: Rank 0: epoch=93 / 200 train_loss=8.4063 valid_loss=8.6708 stale=6 time=75.02m eta=8029.6m [2024-05-04 06:30:15,396] INFO: Initiating epoch #94 train run on device rank=0 [2024-05-04 06:30:15,396] INFO: Initiating epoch #94 train run on device rank=0 [2024-05-04 07:41:05,766] INFO: Initiating epoch #94 valid run on device rank=0 [2024-05-04 07:41:05,766] INFO: Initiating epoch #94 valid run on device rank=0 [2024-05-04 07:45:13,821] INFO: Rank 0: epoch=94 / 200 train_loss=8.4008 valid_loss=8.6985 stale=7 time=74.97m eta=7954.5m [2024-05-04 07:45:13,821] INFO: Rank 0: epoch=94 / 200 train_loss=8.4008 valid_loss=8.6985 stale=7 time=74.97m eta=7954.5m [2024-05-04 07:45:13,905] INFO: Initiating epoch #95 train run on device rank=0 [2024-05-04 07:45:13,905] INFO: Initiating epoch #95 train run on device rank=0 [2024-05-04 08:56:06,738] INFO: Initiating epoch #95 valid run on device rank=0 [2024-05-04 08:56:06,738] INFO: Initiating epoch #95 valid run on device rank=0 [2024-05-04 09:00:15,429] INFO: Rank 0: epoch=95 / 200 train_loss=8.3974 valid_loss=8.6378 stale=8 time=75.03m eta=7879.4m [2024-05-04 09:00:15,429] INFO: Rank 0: epoch=95 / 200 train_loss=8.3974 valid_loss=8.6378 stale=8 time=75.03m eta=7879.4m [2024-05-04 09:00:15,731] INFO: Initiating epoch #96 train run on device rank=0 [2024-05-04 09:00:15,731] INFO: Initiating epoch #96 train run on device rank=0 [2024-05-04 10:11:13,454] INFO: Initiating epoch #96 valid run on device rank=0 [2024-05-04 10:11:13,454] INFO: Initiating epoch #96 valid run on device rank=0 [2024-05-04 10:15:21,752] INFO: Rank 0: epoch=96 / 200 train_loss=8.3880 valid_loss=8.6872 stale=9 time=75.1m eta=7804.4m [2024-05-04 10:15:21,752] INFO: Rank 0: epoch=96 / 200 train_loss=8.3880 valid_loss=8.6872 stale=9 time=75.1m eta=7804.4m [2024-05-04 10:15:21,811] INFO: Initiating epoch #97 train run on device rank=0 [2024-05-04 10:15:21,811] INFO: Initiating epoch #97 train run on device rank=0 [2024-05-04 11:26:15,238] INFO: Initiating epoch #97 valid run on device rank=0 [2024-05-04 11:26:15,238] INFO: Initiating epoch #97 valid run on device rank=0 [2024-05-04 11:30:22,590] INFO: Rank 0: epoch=97 / 200 train_loss=8.3875 valid_loss=8.6976 stale=10 time=75.01m eta=7729.4m [2024-05-04 11:30:22,590] INFO: Rank 0: epoch=97 / 200 train_loss=8.3875 valid_loss=8.6976 stale=10 time=75.01m eta=7729.4m [2024-05-04 11:30:22,705] INFO: Initiating epoch #98 train run on device rank=0 [2024-05-04 11:30:22,705] INFO: Initiating epoch #98 train run on device rank=0 [2024-05-04 12:41:15,381] INFO: Initiating epoch #98 valid run on device rank=0 [2024-05-04 12:41:15,381] INFO: Initiating epoch #98 valid run on device rank=0 [2024-05-04 12:45:23,300] INFO: Rank 0: epoch=98 / 200 train_loss=8.3807 valid_loss=8.6958 stale=11 time=75.01m eta=7654.3m [2024-05-04 12:45:23,300] INFO: Rank 0: epoch=98 / 200 train_loss=8.3807 valid_loss=8.6958 stale=11 time=75.01m eta=7654.3m [2024-05-04 12:45:23,431] INFO: Initiating epoch #99 train run on device rank=0 [2024-05-04 12:45:23,431] INFO: Initiating epoch #99 train run on device rank=0 [2024-05-04 13:56:14,055] INFO: Initiating epoch #99 valid run on device rank=0 [2024-05-04 13:56:14,055] INFO: Initiating epoch #99 valid run on device rank=0 [2024-05-04 14:00:22,414] INFO: Rank 0: epoch=99 / 200 train_loss=8.3732 valid_loss=8.6837 stale=12 time=74.98m eta=7579.2m [2024-05-04 14:00:22,414] INFO: Rank 0: epoch=99 / 200 train_loss=8.3732 valid_loss=8.6837 stale=12 time=74.98m eta=7579.2m [2024-05-04 14:00:22,504] INFO: Initiating epoch #100 train run on device rank=0 [2024-05-04 14:00:22,504] INFO: Initiating epoch #100 train run on device rank=0 [2024-05-04 15:11:17,586] INFO: Initiating epoch #100 valid run on device rank=0 [2024-05-04 15:11:17,586] INFO: Initiating epoch #100 valid run on device rank=0 [2024-05-04 15:15:24,248] INFO: Rank 0: epoch=100 / 200 train_loss=8.3656 valid_loss=8.6274 stale=0 time=75.03m eta=7504.1m [2024-05-04 15:15:24,248] INFO: Rank 0: epoch=100 / 200 train_loss=8.3656 valid_loss=8.6274 stale=0 time=75.03m eta=7504.1m [2024-05-04 15:15:24,263] INFO: Initiating epoch #101 train run on device rank=0 [2024-05-04 15:15:24,263] INFO: Initiating epoch #101 train run on device rank=0 [2024-05-04 16:26:23,255] INFO: Initiating epoch #101 valid run on device rank=0 [2024-05-04 16:26:23,255] INFO: Initiating epoch #101 valid run on device rank=0 [2024-05-04 16:30:31,056] INFO: Rank 0: epoch=101 / 200 train_loss=8.3659 valid_loss=8.6266 stale=0 time=75.11m eta=7429.2m [2024-05-04 16:30:31,056] INFO: Rank 0: epoch=101 / 200 train_loss=8.3659 valid_loss=8.6266 stale=0 time=75.11m eta=7429.2m [2024-05-04 16:30:31,076] INFO: Initiating epoch #102 train run on device rank=0 [2024-05-04 16:30:31,076] INFO: Initiating epoch #102 train run on device rank=0 [2024-05-04 17:41:25,157] INFO: Initiating epoch #102 valid run on device rank=0 [2024-05-04 17:41:25,157] INFO: Initiating epoch #102 valid run on device rank=0 [2024-05-04 17:45:32,591] INFO: Rank 0: epoch=102 / 200 train_loss=8.3559 valid_loss=8.6246 stale=0 time=75.03m eta=7354.1m [2024-05-04 17:45:32,591] INFO: Rank 0: epoch=102 / 200 train_loss=8.3559 valid_loss=8.6246 stale=0 time=75.03m eta=7354.1m [2024-05-04 17:45:32,724] INFO: Initiating epoch #103 train run on device rank=0 [2024-05-04 17:45:32,724] INFO: Initiating epoch #103 train run on device rank=0 [2024-05-04 18:56:24,062] INFO: Initiating epoch #103 valid run on device rank=0 [2024-05-04 18:56:24,062] INFO: Initiating epoch #103 valid run on device rank=0 [2024-05-04 19:00:31,819] INFO: Rank 0: epoch=103 / 200 train_loss=8.3568 valid_loss=8.6567 stale=1 time=74.98m eta=7279.0m [2024-05-04 19:00:31,819] INFO: Rank 0: epoch=103 / 200 train_loss=8.3568 valid_loss=8.6567 stale=1 time=74.98m eta=7279.0m [2024-05-04 19:00:32,008] INFO: Initiating epoch #104 train run on device rank=0 [2024-05-04 19:00:32,008] INFO: Initiating epoch #104 train run on device rank=0 [2024-05-04 20:11:27,479] INFO: Initiating epoch #104 valid run on device rank=0 [2024-05-04 20:11:27,479] INFO: Initiating epoch #104 valid run on device rank=0 [2024-05-04 20:15:35,566] INFO: Rank 0: epoch=104 / 200 train_loss=8.3524 valid_loss=8.7622 stale=2 time=75.06m eta=7204.0m [2024-05-04 20:15:35,566] INFO: Rank 0: epoch=104 / 200 train_loss=8.3524 valid_loss=8.7622 stale=2 time=75.06m eta=7204.0m [2024-05-04 20:15:35,686] INFO: Initiating epoch #105 train run on device rank=0 [2024-05-04 20:15:35,686] INFO: Initiating epoch #105 train run on device rank=0 [2024-05-04 21:26:31,583] INFO: Initiating epoch #105 valid run on device rank=0 [2024-05-04 21:26:31,583] INFO: Initiating epoch #105 valid run on device rank=0 [2024-05-04 21:30:40,623] INFO: Rank 0: epoch=105 / 200 train_loss=8.3499 valid_loss=8.6644 stale=3 time=75.08m eta=7129.0m [2024-05-04 21:30:40,623] INFO: Rank 0: epoch=105 / 200 train_loss=8.3499 valid_loss=8.6644 stale=3 time=75.08m eta=7129.0m [2024-05-04 21:30:40,695] INFO: Initiating epoch #106 train run on device rank=0 [2024-05-04 21:30:40,695] INFO: Initiating epoch #106 train run on device rank=0 [2024-05-04 22:41:34,790] INFO: Initiating epoch #106 valid run on device rank=0 [2024-05-04 22:41:34,790] INFO: Initiating epoch #106 valid run on device rank=0 [2024-05-04 22:45:42,875] INFO: Rank 0: epoch=106 / 200 train_loss=8.3455 valid_loss=8.6604 stale=4 time=75.04m eta=7054.0m [2024-05-04 22:45:42,875] INFO: Rank 0: epoch=106 / 200 train_loss=8.3455 valid_loss=8.6604 stale=4 time=75.04m eta=7054.0m [2024-05-04 22:45:42,987] INFO: Initiating epoch #107 train run on device rank=0 [2024-05-04 22:45:42,987] INFO: Initiating epoch #107 train run on device rank=0 [2024-05-04 23:56:38,411] INFO: Initiating epoch #107 valid run on device rank=0 [2024-05-04 23:56:38,411] INFO: Initiating epoch #107 valid run on device rank=0 [2024-05-05 00:00:47,397] INFO: Rank 0: epoch=107 / 200 train_loss=8.3387 valid_loss=8.6196 stale=0 time=75.07m eta=6978.9m [2024-05-05 00:00:47,397] INFO: Rank 0: epoch=107 / 200 train_loss=8.3387 valid_loss=8.6196 stale=0 time=75.07m eta=6978.9m [2024-05-05 00:00:47,567] INFO: Initiating epoch #108 train run on device rank=0 [2024-05-05 00:00:47,567] INFO: Initiating epoch #108 train run on device rank=0 [2024-05-05 01:11:42,675] INFO: Initiating epoch #108 valid run on device rank=0 [2024-05-05 01:11:42,675] INFO: Initiating epoch #108 valid run on device rank=0 [2024-05-05 01:15:50,904] INFO: Rank 0: epoch=108 / 200 train_loss=8.3350 valid_loss=8.6006 stale=0 time=75.06m eta=6903.9m [2024-05-05 01:15:50,904] INFO: Rank 0: epoch=108 / 200 train_loss=8.3350 valid_loss=8.6006 stale=0 time=75.06m eta=6903.9m [2024-05-05 01:15:50,909] INFO: Initiating epoch #109 train run on device rank=0 [2024-05-05 01:15:50,909] INFO: Initiating epoch #109 train run on device rank=0 [2024-05-05 02:26:45,432] INFO: Initiating epoch #109 valid run on device rank=0 [2024-05-05 02:26:45,432] INFO: Initiating epoch #109 valid run on device rank=0 [2024-05-05 02:30:54,023] INFO: Rank 0: epoch=109 / 200 train_loss=8.3275 valid_loss=8.6356 stale=1 time=75.05m eta=6828.9m [2024-05-05 02:30:54,023] INFO: Rank 0: epoch=109 / 200 train_loss=8.3275 valid_loss=8.6356 stale=1 time=75.05m eta=6828.9m [2024-05-05 02:30:54,091] INFO: Initiating epoch #110 train run on device rank=0 [2024-05-05 02:30:54,091] INFO: Initiating epoch #110 train run on device rank=0 [2024-05-05 03:41:50,480] INFO: Initiating epoch #110 valid run on device rank=0 [2024-05-05 03:41:50,480] INFO: Initiating epoch #110 valid run on device rank=0 [2024-05-05 03:45:57,847] INFO: Rank 0: epoch=110 / 200 train_loss=8.3239 valid_loss=8.5996 stale=0 time=75.06m eta=6753.9m [2024-05-05 03:45:57,847] INFO: Rank 0: epoch=110 / 200 train_loss=8.3239 valid_loss=8.5996 stale=0 time=75.06m eta=6753.9m [2024-05-05 03:45:57,996] INFO: Initiating epoch #111 train run on device rank=0 [2024-05-05 03:45:57,996] INFO: Initiating epoch #111 train run on device rank=0 [2024-05-05 04:56:51,851] INFO: Initiating epoch #111 valid run on device rank=0 [2024-05-05 04:56:51,851] INFO: Initiating epoch #111 valid run on device rank=0 [2024-05-05 05:00:59,213] INFO: Rank 0: epoch=111 / 200 train_loss=8.3217 valid_loss=8.6409 stale=1 time=75.02m eta=6678.8m [2024-05-05 05:00:59,213] INFO: Rank 0: epoch=111 / 200 train_loss=8.3217 valid_loss=8.6409 stale=1 time=75.02m eta=6678.8m [2024-05-05 05:00:59,372] INFO: Initiating epoch #112 train run on device rank=0 [2024-05-05 05:00:59,372] INFO: Initiating epoch #112 train run on device rank=0 [2024-05-05 06:11:54,123] INFO: Initiating epoch #112 valid run on device rank=0 [2024-05-05 06:11:54,123] INFO: Initiating epoch #112 valid run on device rank=0 [2024-05-05 06:16:02,938] INFO: Rank 0: epoch=112 / 200 train_loss=8.3193 valid_loss=8.6686 stale=2 time=75.06m eta=6603.8m [2024-05-05 06:16:02,938] INFO: Rank 0: epoch=112 / 200 train_loss=8.3193 valid_loss=8.6686 stale=2 time=75.06m eta=6603.8m [2024-05-05 06:16:03,080] INFO: Initiating epoch #113 train run on device rank=0 [2024-05-05 06:16:03,080] INFO: Initiating epoch #113 train run on device rank=0 [2024-05-05 07:26:55,728] INFO: Initiating epoch #113 valid run on device rank=0 [2024-05-05 07:26:55,728] INFO: Initiating epoch #113 valid run on device rank=0 [2024-05-05 07:31:03,805] INFO: Rank 0: epoch=113 / 200 train_loss=8.3146 valid_loss=8.5429 stale=0 time=75.01m eta=6528.7m [2024-05-05 07:31:03,805] INFO: Rank 0: epoch=113 / 200 train_loss=8.3146 valid_loss=8.5429 stale=0 time=75.01m eta=6528.7m [2024-05-05 07:31:03,965] INFO: Initiating epoch #114 train run on device rank=0 [2024-05-05 07:31:03,965] INFO: Initiating epoch #114 train run on device rank=0 [2024-05-05 08:41:57,431] INFO: Initiating epoch #114 valid run on device rank=0 [2024-05-05 08:41:57,431] INFO: Initiating epoch #114 valid run on device rank=0 [2024-05-05 08:46:05,216] INFO: Rank 0: epoch=114 / 200 train_loss=8.3131 valid_loss=8.6555 stale=1 time=75.02m eta=6453.6m [2024-05-05 08:46:05,216] INFO: Rank 0: epoch=114 / 200 train_loss=8.3131 valid_loss=8.6555 stale=1 time=75.02m eta=6453.6m [2024-05-05 08:46:05,337] INFO: Initiating epoch #115 train run on device rank=0 [2024-05-05 08:46:05,337] INFO: Initiating epoch #115 train run on device rank=0 [2024-05-05 09:56:58,576] INFO: Initiating epoch #115 valid run on device rank=0 [2024-05-05 09:56:58,576] INFO: Initiating epoch #115 valid run on device rank=0 [2024-05-05 10:01:06,254] INFO: Rank 0: epoch=115 / 200 train_loss=8.3053 valid_loss=8.6167 stale=2 time=75.02m eta=6378.6m [2024-05-05 10:01:06,254] INFO: Rank 0: epoch=115 / 200 train_loss=8.3053 valid_loss=8.6167 stale=2 time=75.02m eta=6378.6m [2024-05-05 10:01:06,327] INFO: Initiating epoch #116 train run on device rank=0 [2024-05-05 10:01:06,327] INFO: Initiating epoch #116 train run on device rank=0 [2024-05-05 11:12:01,516] INFO: Initiating epoch #116 valid run on device rank=0 [2024-05-05 11:12:01,516] INFO: Initiating epoch #116 valid run on device rank=0 [2024-05-05 11:16:10,771] INFO: Rank 0: epoch=116 / 200 train_loss=8.2994 valid_loss=8.5730 stale=3 time=75.07m eta=6303.6m [2024-05-05 11:16:10,771] INFO: Rank 0: epoch=116 / 200 train_loss=8.2994 valid_loss=8.5730 stale=3 time=75.07m eta=6303.6m [2024-05-05 11:16:10,859] INFO: Initiating epoch #117 train run on device rank=0 [2024-05-05 11:16:10,859] INFO: Initiating epoch #117 train run on device rank=0 [2024-05-05 12:27:05,438] INFO: Initiating epoch #117 valid run on device rank=0 [2024-05-05 12:27:05,438] INFO: Initiating epoch #117 valid run on device rank=0 [2024-05-05 12:31:13,957] INFO: Rank 0: epoch=117 / 200 train_loss=8.2971 valid_loss=8.6413 stale=4 time=75.05m eta=6228.5m [2024-05-05 12:31:13,957] INFO: Rank 0: epoch=117 / 200 train_loss=8.2971 valid_loss=8.6413 stale=4 time=75.05m eta=6228.5m [2024-05-05 12:31:14,033] INFO: Initiating epoch #118 train run on device rank=0 [2024-05-05 12:31:14,033] INFO: Initiating epoch #118 train run on device rank=0 [2024-05-05 13:42:08,366] INFO: Initiating epoch #118 valid run on device rank=0 [2024-05-05 13:42:08,366] INFO: Initiating epoch #118 valid run on device rank=0 [2024-05-05 13:46:15,993] INFO: Rank 0: epoch=118 / 200 train_loss=8.2939 valid_loss=8.6132 stale=5 time=75.03m eta=6153.5m [2024-05-05 13:46:15,993] INFO: Rank 0: epoch=118 / 200 train_loss=8.2939 valid_loss=8.6132 stale=5 time=75.03m eta=6153.5m [2024-05-05 13:46:16,028] INFO: Initiating epoch #119 train run on device rank=0 [2024-05-05 13:46:16,028] INFO: Initiating epoch #119 train run on device rank=0 [2024-05-05 14:57:11,411] INFO: Initiating epoch #119 valid run on device rank=0 [2024-05-05 14:57:11,411] INFO: Initiating epoch #119 valid run on device rank=0 [2024-05-05 15:01:19,098] INFO: Rank 0: epoch=119 / 200 train_loss=8.2921 valid_loss=8.6114 stale=6 time=75.05m eta=6078.4m [2024-05-05 15:01:19,098] INFO: Rank 0: epoch=119 / 200 train_loss=8.2921 valid_loss=8.6114 stale=6 time=75.05m eta=6078.4m [2024-05-05 15:01:19,387] INFO: Initiating epoch #120 train run on device rank=0 [2024-05-05 15:01:19,387] INFO: Initiating epoch #120 train run on device rank=0 [2024-05-05 16:12:14,080] INFO: Initiating epoch #120 valid run on device rank=0 [2024-05-05 16:12:14,080] INFO: Initiating epoch #120 valid run on device rank=0 [2024-05-05 16:16:22,496] INFO: Rank 0: epoch=120 / 200 train_loss=8.2868 valid_loss=8.6116 stale=7 time=75.05m eta=6003.4m [2024-05-05 16:16:22,496] INFO: Rank 0: epoch=120 / 200 train_loss=8.2868 valid_loss=8.6116 stale=7 time=75.05m eta=6003.4m [2024-05-05 16:16:22,535] INFO: Initiating epoch #121 train run on device rank=0 [2024-05-05 16:16:22,535] INFO: Initiating epoch #121 train run on device rank=0 [2024-05-05 17:27:17,832] INFO: Initiating epoch #121 valid run on device rank=0 [2024-05-05 17:27:17,832] INFO: Initiating epoch #121 valid run on device rank=0 [2024-05-05 17:31:25,453] INFO: Rank 0: epoch=121 / 200 train_loss=8.2862 valid_loss=8.6204 stale=8 time=75.05m eta=5928.4m [2024-05-05 17:31:25,453] INFO: Rank 0: epoch=121 / 200 train_loss=8.2862 valid_loss=8.6204 stale=8 time=75.05m eta=5928.4m [2024-05-05 17:31:25,755] INFO: Initiating epoch #122 train run on device rank=0 [2024-05-05 17:31:25,755] INFO: Initiating epoch #122 train run on device rank=0 [2024-05-05 18:42:18,387] INFO: Initiating epoch #122 valid run on device rank=0 [2024-05-05 18:42:18,387] INFO: Initiating epoch #122 valid run on device rank=0 [2024-05-05 18:46:26,329] INFO: Rank 0: epoch=122 / 200 train_loss=8.2772 valid_loss=8.5793 stale=9 time=75.01m eta=5853.3m [2024-05-05 18:46:26,329] INFO: Rank 0: epoch=122 / 200 train_loss=8.2772 valid_loss=8.5793 stale=9 time=75.01m eta=5853.3m [2024-05-05 18:46:26,498] INFO: Initiating epoch #123 train run on device rank=0 [2024-05-05 18:46:26,498] INFO: Initiating epoch #123 train run on device rank=0