[2024-08-26 15:02:55,370] INFO: Will use torch.nn.parallel.DistributedDataParallel() and 2 gpus [2024-08-26 15:02:55,454] INFO: NVIDIA GeForce RTX 2080 Ti [2024-08-26 15:02:55,454] INFO: NVIDIA GeForce RTX 2080 Ti [2024-08-26 15:03:00,527] INFO: using dtype=torch.float32 [2024-08-26 15:03:01,387] INFO: model_kwargs: {'input_dim': 17, 'num_classes': 6, 'input_encoding': 'joint', 'pt_mode': 'linear', 'eta_mode': 'linear', 'sin_phi_mode': 'linear', 'cos_phi_mode': 'linear', 'energy_mode': 'linear', 'elemtypes_nonzero': [1, 2], 'learned_representation_mode': 'last', 'conv_type': 'attention', 'num_convs': 3, 'dropout_ff': 0.0, 'dropout_conv_id_mha': 0.0, 'dropout_conv_id_ff': 0.0, 'dropout_conv_reg_mha': 0.0, 'dropout_conv_reg_ff': 0.0, 'activation': 'relu', 'head_dim': 16, 'num_heads': 32, 'attention_type': 'efficient'} [2024-08-26 15:03:01,415] INFO: using attention_type=math [2024-08-26 15:03:01,433] INFO: using attention_type=math [2024-08-26 15:03:01,451] INFO: using attention_type=math [2024-08-26 15:03:01,468] INFO: using attention_type=math [2024-08-26 15:03:01,486] INFO: using attention_type=math [2024-08-26 15:03:01,505] INFO: using attention_type=math [2024-08-26 15:03:06,655] INFO: Loaded model weights from /pfvol/experiments/MLPF_clic_backbone_8GTX/best_weights.pth [2024-08-26 15:03:07,749] INFO: DistributedDataParallel( (module): MLPF( (nn0_id): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (nn0_reg): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (conv_id): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (conv_reg): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (nn_id): Sequential( (0): Linear(in_features=529, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=6, bias=True) ) (nn_pt): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_eta): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_sin_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_cos_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_energy): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) ) ) [2024-08-26 15:03:07,749] INFO: Trainable parameters: 11671568 [2024-08-26 15:03:07,749] INFO: Non-trainable parameters: 0 [2024-08-26 15:03:07,749] INFO: Total parameters: 11671568 [2024-08-26 15:03:07,753] INFO: Modules Trainable parameters Non-tranable parameters module.nn0_id.0.weight 8704 0 module.nn0_id.0.bias 512 0 module.nn0_id.2.weight 512 0 module.nn0_id.2.bias 512 0 module.nn0_id.4.weight 262144 0 module.nn0_id.4.bias 512 0 module.nn0_reg.0.weight 8704 0 module.nn0_reg.0.bias 512 0 module.nn0_reg.2.weight 512 0 module.nn0_reg.2.bias 512 0 module.nn0_reg.4.weight 262144 0 module.nn0_reg.4.bias 512 0 module.conv_id.0.mha.in_proj_weight 786432 0 module.conv_id.0.mha.in_proj_bias 1536 0 module.conv_id.0.mha.out_proj.weight 262144 0 module.conv_id.0.mha.out_proj.bias 512 0 module.conv_id.0.norm0.weight 512 0 module.conv_id.0.norm0.bias 512 0 module.conv_id.0.norm1.weight 512 0 module.conv_id.0.norm1.bias 512 0 module.conv_id.0.seq.0.weight 262144 0 module.conv_id.0.seq.0.bias 512 0 module.conv_id.0.seq.2.weight 262144 0 module.conv_id.0.seq.2.bias 512 0 module.conv_id.1.mha.in_proj_weight 786432 0 module.conv_id.1.mha.in_proj_bias 1536 0 module.conv_id.1.mha.out_proj.weight 262144 0 module.conv_id.1.mha.out_proj.bias 512 0 module.conv_id.1.norm0.weight 512 0 module.conv_id.1.norm0.bias 512 0 module.conv_id.1.norm1.weight 512 0 module.conv_id.1.norm1.bias 512 0 module.conv_id.1.seq.0.weight 262144 0 module.conv_id.1.seq.0.bias 512 0 module.conv_id.1.seq.2.weight 262144 0 module.conv_id.1.seq.2.bias 512 0 module.conv_id.2.mha.in_proj_weight 786432 0 module.conv_id.2.mha.in_proj_bias 1536 0 module.conv_id.2.mha.out_proj.weight 262144 0 module.conv_id.2.mha.out_proj.bias 512 0 module.conv_id.2.norm0.weight 512 0 module.conv_id.2.norm0.bias 512 0 module.conv_id.2.norm1.weight 512 0 module.conv_id.2.norm1.bias 512 0 module.conv_id.2.seq.0.weight 262144 0 module.conv_id.2.seq.0.bias 512 0 module.conv_id.2.seq.2.weight 262144 0 module.conv_id.2.seq.2.bias 512 0 module.conv_reg.0.mha.in_proj_weight 786432 0 module.conv_reg.0.mha.in_proj_bias 1536 0 module.conv_reg.0.mha.out_proj.weight 262144 0 module.conv_reg.0.mha.out_proj.bias 512 0 module.conv_reg.0.norm0.weight 512 0 module.conv_reg.0.norm0.bias 512 0 module.conv_reg.0.norm1.weight 512 0 module.conv_reg.0.norm1.bias 512 0 module.conv_reg.0.seq.0.weight 262144 0 module.conv_reg.0.seq.0.bias 512 0 module.conv_reg.0.seq.2.weight 262144 0 module.conv_reg.0.seq.2.bias 512 0 module.conv_reg.1.mha.in_proj_weight 786432 0 module.conv_reg.1.mha.in_proj_bias 1536 0 module.conv_reg.1.mha.out_proj.weight 262144 0 module.conv_reg.1.mha.out_proj.bias 512 0 module.conv_reg.1.norm0.weight 512 0 module.conv_reg.1.norm0.bias 512 0 module.conv_reg.1.norm1.weight 512 0 module.conv_reg.1.norm1.bias 512 0 module.conv_reg.1.seq.0.weight 262144 0 module.conv_reg.1.seq.0.bias 512 0 module.conv_reg.1.seq.2.weight 262144 0 module.conv_reg.1.seq.2.bias 512 0 module.conv_reg.2.mha.in_proj_weight 786432 0 module.conv_reg.2.mha.in_proj_bias 1536 0 module.conv_reg.2.mha.out_proj.weight 262144 0 module.conv_reg.2.mha.out_proj.bias 512 0 module.conv_reg.2.norm0.weight 512 0 module.conv_reg.2.norm0.bias 512 0 module.conv_reg.2.norm1.weight 512 0 module.conv_reg.2.norm1.bias 512 0 module.conv_reg.2.seq.0.weight 262144 0 module.conv_reg.2.seq.0.bias 512 0 module.conv_reg.2.seq.2.weight 262144 0 module.conv_reg.2.seq.2.bias 512 0 module.nn_id.0.weight 270848 0 module.nn_id.0.bias 512 0 module.nn_id.2.weight 512 0 module.nn_id.2.bias 512 0 module.nn_id.4.weight 3072 0 module.nn_id.4.bias 6 0 module.nn_pt.nn.0.weight 273920 0 module.nn_pt.nn.0.bias 512 0 module.nn_pt.nn.2.weight 512 0 module.nn_pt.nn.2.bias 512 0 module.nn_pt.nn.4.weight 1024 0 module.nn_pt.nn.4.bias 2 0 module.nn_eta.nn.0.weight 273920 0 module.nn_eta.nn.0.bias 512 0 module.nn_eta.nn.2.weight 512 0 module.nn_eta.nn.2.bias 512 0 module.nn_eta.nn.4.weight 1024 0 module.nn_eta.nn.4.bias 2 0 module.nn_sin_phi.nn.0.weight 273920 0 module.nn_sin_phi.nn.0.bias 512 0 module.nn_sin_phi.nn.2.weight 512 0 module.nn_sin_phi.nn.2.bias 512 0 module.nn_sin_phi.nn.4.weight 1024 0 module.nn_sin_phi.nn.4.bias 2 0 module.nn_cos_phi.nn.0.weight 273920 0 module.nn_cos_phi.nn.0.bias 512 0 module.nn_cos_phi.nn.2.weight 512 0 module.nn_cos_phi.nn.2.bias 512 0 module.nn_cos_phi.nn.4.weight 1024 0 module.nn_cos_phi.nn.4.bias 2 0 module.nn_energy.nn.0.weight 273920 0 module.nn_energy.nn.0.bias 512 0 module.nn_energy.nn.2.weight 512 0 module.nn_energy.nn.2.bias 512 0 module.nn_energy.nn.4.weight 1024 0 module.nn_energy.nn.4.bias 2 0 [2024-08-26 15:03:07,754] INFO: Creating experiment dir /pfvol/experiments/Aug26_CLD_finetuned_1k_pyg-cld_20240826_150254_267365 [2024-08-26 15:03:07,754] INFO: Model directory /pfvol/experiments/Aug26_CLD_finetuned_1k_pyg-cld_20240826_150254_267365 [2024-08-26 15:03:07,769] INFO: train_dataset: cld_edm_ttbar_pf, 1000 [2024-08-26 15:03:07,862] INFO: valid_dataset: cld_edm_ttbar_pf, 1000 [2024-08-26 15:03:07,918] INFO: Initiating epoch #1 train run on device rank=0 [2024-08-26 15:03:18,956] INFO: Initiating epoch #1 valid run on device rank=0 [2024-08-26 15:03:27,207] INFO: Rank 0: epoch=1 / 100 train_loss=46.7217 valid_loss=31.6247 stale=0 time=0.32m eta=31.8m [2024-08-26 15:03:27,255] INFO: Initiating epoch #2 train run on device rank=0 [2024-08-26 15:03:31,685] INFO: Initiating epoch #2 valid run on device rank=0 [2024-08-26 15:03:37,935] INFO: Rank 0: epoch=2 / 100 train_loss=30.7152 valid_loss=30.2023 stale=0 time=0.18m eta=24.5m [2024-08-26 15:03:38,587] INFO: Initiating epoch #3 train run on device rank=0 [2024-08-26 15:03:42,844] INFO: Initiating epoch #3 valid run on device rank=0 [2024-08-26 15:03:50,693] INFO: Rank 0: epoch=3 / 100 train_loss=29.7733 valid_loss=29.6004 stale=0 time=0.2m eta=23.1m [2024-08-26 15:03:51,399] INFO: Initiating epoch #4 train run on device rank=0 [2024-08-26 15:03:55,707] INFO: Initiating epoch #4 valid run on device rank=0 [2024-08-26 15:04:02,332] INFO: Rank 0: epoch=4 / 100 train_loss=29.2078 valid_loss=29.2707 stale=0 time=0.18m eta=21.8m [2024-08-26 15:04:03,153] INFO: Initiating epoch #5 train run on device rank=0 [2024-08-26 15:04:07,446] INFO: Initiating epoch #5 valid run on device rank=0 [2024-08-26 15:04:16,087] INFO: Rank 0: epoch=5 / 100 train_loss=28.8041 valid_loss=29.0064 stale=0 time=0.22m eta=21.6m [2024-08-26 15:04:17,143] INFO: Initiating epoch #6 train run on device rank=0 [2024-08-26 15:04:21,575] INFO: Initiating epoch #6 valid run on device rank=0 [2024-08-26 15:04:28,062] INFO: Rank 0: epoch=6 / 100 train_loss=28.4435 valid_loss=28.8297 stale=0 time=0.18m eta=20.9m [2024-08-26 15:04:28,857] INFO: Initiating epoch #7 train run on device rank=0 [2024-08-26 15:04:33,287] INFO: Initiating epoch #7 valid run on device rank=0 [2024-08-26 15:04:38,871] INFO: Rank 0: epoch=7 / 100 train_loss=28.1219 valid_loss=28.5821 stale=0 time=0.17m eta=20.1m [2024-08-26 15:04:39,673] INFO: Initiating epoch #8 train run on device rank=0 [2024-08-26 15:04:43,947] INFO: Initiating epoch #8 valid run on device rank=0 [2024-08-26 15:04:49,746] INFO: Rank 0: epoch=8 / 100 train_loss=27.8266 valid_loss=28.3269 stale=0 time=0.17m eta=19.5m [2024-08-26 15:04:50,532] INFO: Initiating epoch #9 train run on device rank=0 [2024-08-26 15:04:54,862] INFO: Initiating epoch #9 valid run on device rank=0 [2024-08-26 15:05:02,130] INFO: Rank 0: epoch=9 / 100 train_loss=27.5437 valid_loss=28.2147 stale=0 time=0.19m eta=19.2m [2024-08-26 15:05:02,973] INFO: Initiating epoch #10 train run on device rank=0 [2024-08-26 15:05:07,256] INFO: Initiating epoch #10 valid run on device rank=0 [2024-08-26 15:05:13,819] INFO: Rank 0: epoch=10 / 100 train_loss=27.2520 valid_loss=28.1076 stale=0 time=0.18m eta=18.9m [2024-08-26 15:05:14,422] INFO: Initiating epoch #11 train run on device rank=0 [2024-08-26 15:05:18,717] INFO: Initiating epoch #11 valid run on device rank=0 [2024-08-26 15:05:25,457] INFO: Rank 0: epoch=11 / 100 train_loss=26.9416 valid_loss=27.9996 stale=0 time=0.18m eta=18.5m [2024-08-26 15:05:25,933] INFO: Initiating epoch #12 train run on device rank=0 [2024-08-26 15:05:30,258] INFO: Initiating epoch #12 valid run on device rank=0 [2024-08-26 15:05:36,317] INFO: Rank 0: epoch=12 / 100 train_loss=26.5233 valid_loss=27.7138 stale=0 time=0.17m eta=18.1m [2024-08-26 15:05:36,905] INFO: Initiating epoch #13 train run on device rank=0 [2024-08-26 15:05:41,071] INFO: Initiating epoch #13 valid run on device rank=0 [2024-08-26 15:05:48,265] INFO: Rank 0: epoch=13 / 100 train_loss=26.2545 valid_loss=27.3864 stale=0 time=0.19m eta=17.9m [2024-08-26 15:05:48,876] INFO: Initiating epoch #14 train run on device rank=0 [2024-08-26 15:05:53,137] INFO: Initiating epoch #14 valid run on device rank=0 [2024-08-26 15:06:00,932] INFO: Rank 0: epoch=14 / 100 train_loss=25.9402 valid_loss=27.3679 stale=0 time=0.2m eta=17.7m [2024-08-26 15:06:01,653] INFO: Initiating epoch #15 train run on device rank=0 [2024-08-26 15:06:05,891] INFO: Initiating epoch #15 valid run on device rank=0 [2024-08-26 15:06:10,565] INFO: Rank 0: epoch=15 / 100 train_loss=25.7753 valid_loss=27.6516 stale=1 time=0.15m eta=17.3m [2024-08-26 15:06:11,290] INFO: Initiating epoch #16 train run on device rank=0 [2024-08-26 15:06:15,792] INFO: Initiating epoch #16 valid run on device rank=0 [2024-08-26 15:06:20,470] INFO: Rank 0: epoch=16 / 100 train_loss=25.5489 valid_loss=27.5263 stale=2 time=0.15m eta=16.8m [2024-08-26 15:06:21,330] INFO: Initiating epoch #17 train run on device rank=0 [2024-08-26 15:06:25,632] INFO: Initiating epoch #17 valid run on device rank=0 [2024-08-26 15:06:33,630] INFO: Rank 0: epoch=17 / 100 train_loss=25.1462 valid_loss=27.2451 stale=0 time=0.21m eta=16.7m [2024-08-26 15:06:34,628] INFO: Initiating epoch #18 train run on device rank=0 [2024-08-26 15:06:39,038] INFO: Initiating epoch #18 valid run on device rank=0 [2024-08-26 15:06:45,917] INFO: Rank 0: epoch=18 / 100 train_loss=24.9320 valid_loss=26.7176 stale=0 time=0.19m eta=16.6m [2024-08-26 15:06:46,415] INFO: Initiating epoch #19 train run on device rank=0 [2024-08-26 15:06:50,708] INFO: Initiating epoch #19 valid run on device rank=0 [2024-08-26 15:06:56,723] INFO: Rank 0: epoch=19 / 100 train_loss=24.7639 valid_loss=26.6306 stale=0 time=0.17m eta=16.3m [2024-08-26 15:06:57,397] INFO: Initiating epoch #20 train run on device rank=0 [2024-08-26 15:07:01,666] INFO: Initiating epoch #20 valid run on device rank=0 [2024-08-26 15:07:06,610] INFO: Rank 0: epoch=20 / 100 train_loss=24.4381 valid_loss=26.7470 stale=1 time=0.15m eta=15.9m [2024-08-26 15:07:07,484] INFO: Initiating epoch #21 train run on device rank=0 [2024-08-26 15:07:11,613] INFO: Initiating epoch #21 valid run on device rank=0 [2024-08-26 15:07:16,284] INFO: Rank 0: epoch=21 / 100 train_loss=24.2247 valid_loss=26.8019 stale=2 time=0.15m eta=15.6m [2024-08-26 15:07:17,045] INFO: Initiating epoch #22 train run on device rank=0 [2024-08-26 15:07:21,371] INFO: Initiating epoch #22 valid run on device rank=0 [2024-08-26 15:07:25,761] INFO: Rank 0: epoch=22 / 100 train_loss=24.1113 valid_loss=26.6541 stale=3 time=0.15m eta=15.2m [2024-08-26 15:07:26,603] INFO: Initiating epoch #23 train run on device rank=0 [2024-08-26 15:07:30,954] INFO: Initiating epoch #23 valid run on device rank=0 [2024-08-26 15:07:37,094] INFO: Rank 0: epoch=23 / 100 train_loss=23.7490 valid_loss=26.6704 stale=4 time=0.17m eta=15.0m [2024-08-26 15:07:38,197] INFO: Initiating epoch #24 train run on device rank=0 [2024-08-26 15:07:42,487] INFO: Initiating epoch #24 valid run on device rank=0 [2024-08-26 15:07:46,478] INFO: Rank 0: epoch=24 / 100 train_loss=23.4957 valid_loss=26.8693 stale=5 time=0.14m eta=14.7m [2024-08-26 15:07:47,651] INFO: Initiating epoch #25 train run on device rank=0 [2024-08-26 15:07:51,965] INFO: Initiating epoch #25 valid run on device rank=0 [2024-08-26 15:07:56,175] INFO: Rank 0: epoch=25 / 100 train_loss=23.1991 valid_loss=26.9954 stale=6 time=0.14m eta=14.4m [2024-08-26 15:07:57,231] INFO: Initiating epoch #26 train run on device rank=0 [2024-08-26 15:08:01,525] INFO: Initiating epoch #26 valid run on device rank=0 [2024-08-26 15:08:06,442] INFO: Rank 0: epoch=26 / 100 train_loss=23.0037 valid_loss=27.0160 stale=7 time=0.15m eta=14.2m [2024-08-26 15:08:07,337] INFO: Initiating epoch #27 train run on device rank=0 [2024-08-26 15:08:11,797] INFO: Initiating epoch #27 valid run on device rank=0 [2024-08-26 15:08:16,938] INFO: Rank 0: epoch=27 / 100 train_loss=22.8287 valid_loss=27.0840 stale=8 time=0.16m eta=13.9m [2024-08-26 15:08:17,878] INFO: Initiating epoch #28 train run on device rank=0 [2024-08-26 15:08:22,136] INFO: Initiating epoch #28 valid run on device rank=0 [2024-08-26 15:08:26,915] INFO: Rank 0: epoch=28 / 100 train_loss=22.7122 valid_loss=27.3264 stale=9 time=0.15m eta=13.7m [2024-08-26 15:08:27,617] INFO: Initiating epoch #29 train run on device rank=0 [2024-08-26 15:08:31,889] INFO: Initiating epoch #29 valid run on device rank=0 [2024-08-26 15:08:36,592] INFO: Rank 0: epoch=29 / 100 train_loss=22.5667 valid_loss=27.4345 stale=10 time=0.15m eta=13.4m [2024-08-26 15:08:37,306] INFO: Initiating epoch #30 train run on device rank=0 [2024-08-26 15:08:41,861] INFO: Initiating epoch #30 valid run on device rank=0 [2024-08-26 15:08:46,464] INFO: Rank 0: epoch=30 / 100 train_loss=22.4941 valid_loss=27.7222 stale=11 time=0.15m eta=13.2m [2024-08-26 15:08:47,339] INFO: Initiating epoch #31 train run on device rank=0 [2024-08-26 15:08:51,698] INFO: Initiating epoch #31 valid run on device rank=0 [2024-08-26 15:08:56,587] INFO: Rank 0: epoch=31 / 100 train_loss=22.3569 valid_loss=27.5927 stale=12 time=0.15m eta=12.9m [2024-08-26 15:08:57,446] INFO: Initiating epoch #32 train run on device rank=0 [2024-08-26 15:09:01,954] INFO: Initiating epoch #32 valid run on device rank=0 [2024-08-26 15:09:07,273] INFO: Rank 0: epoch=32 / 100 train_loss=22.0741 valid_loss=27.5606 stale=13 time=0.16m eta=12.7m [2024-08-26 15:09:08,153] INFO: Initiating epoch #33 train run on device rank=0 [2024-08-26 15:09:12,483] INFO: Initiating epoch #33 valid run on device rank=0 [2024-08-26 15:09:18,161] INFO: Rank 0: epoch=33 / 100 train_loss=21.7619 valid_loss=27.7213 stale=14 time=0.17m eta=12.5m [2024-08-26 15:09:19,487] INFO: Initiating epoch #34 train run on device rank=0 [2024-08-26 15:09:23,770] INFO: Initiating epoch #34 valid run on device rank=0 [2024-08-26 15:09:28,696] INFO: Rank 0: epoch=34 / 100 train_loss=21.5076 valid_loss=28.3184 stale=15 time=0.15m eta=12.3m [2024-08-26 15:09:29,748] INFO: Initiating epoch #35 train run on device rank=0 [2024-08-26 15:09:34,445] INFO: Initiating epoch #35 valid run on device rank=0 [2024-08-26 15:09:39,171] INFO: Rank 0: epoch=35 / 100 train_loss=21.2929 valid_loss=27.5211 stale=16 time=0.16m eta=12.1m [2024-08-26 15:09:40,618] INFO: Initiating epoch #36 train run on device rank=0 [2024-08-26 15:09:45,040] INFO: Initiating epoch #36 valid run on device rank=0 [2024-08-26 15:09:49,886] INFO: Rank 0: epoch=36 / 100 train_loss=20.9209 valid_loss=28.2710 stale=17 time=0.15m eta=11.9m [2024-08-26 15:09:50,883] INFO: Initiating epoch #37 train run on device rank=0 [2024-08-26 15:09:55,163] INFO: Initiating epoch #37 valid run on device rank=0 [2024-08-26 15:09:59,934] INFO: Rank 0: epoch=37 / 100 train_loss=20.5001 valid_loss=28.7314 stale=18 time=0.15m eta=11.7m [2024-08-26 15:10:00,846] INFO: Initiating epoch #38 train run on device rank=0 [2024-08-26 15:10:05,230] INFO: Initiating epoch #38 valid run on device rank=0 [2024-08-26 15:10:11,440] INFO: Rank 0: epoch=38 / 100 train_loss=20.4554 valid_loss=29.1423 stale=19 time=0.18m eta=11.5m [2024-08-26 15:10:12,238] INFO: Initiating epoch #39 train run on device rank=0 [2024-08-26 15:10:16,499] INFO: Initiating epoch #39 valid run on device rank=0 [2024-08-26 15:10:21,782] INFO: Rank 0: epoch=39 / 100 train_loss=20.6065 valid_loss=28.5896 stale=20 time=0.16m eta=11.3m [2024-08-26 15:10:22,982] INFO: Initiating epoch #40 train run on device rank=0 [2024-08-26 15:10:27,320] INFO: Initiating epoch #40 valid run on device rank=0 [2024-08-26 15:10:32,818] INFO: Done with training. Total training time on device 0 is 7.415min