[2024-08-26 14:53:55,052] INFO: Will use torch.nn.parallel.DistributedDataParallel() and 2 gpus [2024-08-26 14:53:55,135] INFO: NVIDIA GeForce RTX 2080 Ti [2024-08-26 14:53:55,135] INFO: NVIDIA GeForce RTX 2080 Ti [2024-08-26 14:54:02,150] INFO: using dtype=torch.float32 [2024-08-26 14:54:02,434] INFO: model_kwargs: {'input_dim': 17, 'num_classes': 6, 'input_encoding': 'joint', 'pt_mode': 'linear', 'eta_mode': 'linear', 'sin_phi_mode': 'linear', 'cos_phi_mode': 'linear', 'energy_mode': 'linear', 'elemtypes_nonzero': [1, 2], 'learned_representation_mode': 'last', 'conv_type': 'attention', 'num_convs': 3, 'dropout_ff': 0.0, 'dropout_conv_id_mha': 0.0, 'dropout_conv_id_ff': 0.0, 'dropout_conv_reg_mha': 0.0, 'dropout_conv_reg_ff': 0.0, 'activation': 'relu', 'head_dim': 16, 'num_heads': 32, 'attention_type': 'efficient'} [2024-08-26 14:54:02,461] INFO: using attention_type=math [2024-08-26 14:54:02,478] INFO: using attention_type=math [2024-08-26 14:54:02,495] INFO: using attention_type=math [2024-08-26 14:54:02,513] INFO: using attention_type=math [2024-08-26 14:54:02,530] INFO: using attention_type=math [2024-08-26 14:54:02,547] INFO: using attention_type=math [2024-08-26 14:54:08,516] INFO: Loaded model weights from /pfvol/experiments/MLPF_clic_backbone_8GTX/best_weights.pth [2024-08-26 14:54:09,940] INFO: DistributedDataParallel( (module): MLPF( (nn0_id): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (nn0_reg): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (conv_id): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (conv_reg): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (nn_id): Sequential( (0): Linear(in_features=529, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=6, bias=True) ) (nn_pt): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_eta): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_sin_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_cos_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_energy): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) ) ) [2024-08-26 14:54:09,941] INFO: Trainable parameters: 11671568 [2024-08-26 14:54:09,941] INFO: Non-trainable parameters: 0 [2024-08-26 14:54:09,941] INFO: Total parameters: 11671568 [2024-08-26 14:54:09,946] INFO: Modules Trainable parameters Non-tranable parameters module.nn0_id.0.weight 8704 0 module.nn0_id.0.bias 512 0 module.nn0_id.2.weight 512 0 module.nn0_id.2.bias 512 0 module.nn0_id.4.weight 262144 0 module.nn0_id.4.bias 512 0 module.nn0_reg.0.weight 8704 0 module.nn0_reg.0.bias 512 0 module.nn0_reg.2.weight 512 0 module.nn0_reg.2.bias 512 0 module.nn0_reg.4.weight 262144 0 module.nn0_reg.4.bias 512 0 module.conv_id.0.mha.in_proj_weight 786432 0 module.conv_id.0.mha.in_proj_bias 1536 0 module.conv_id.0.mha.out_proj.weight 262144 0 module.conv_id.0.mha.out_proj.bias 512 0 module.conv_id.0.norm0.weight 512 0 module.conv_id.0.norm0.bias 512 0 module.conv_id.0.norm1.weight 512 0 module.conv_id.0.norm1.bias 512 0 module.conv_id.0.seq.0.weight 262144 0 module.conv_id.0.seq.0.bias 512 0 module.conv_id.0.seq.2.weight 262144 0 module.conv_id.0.seq.2.bias 512 0 module.conv_id.1.mha.in_proj_weight 786432 0 module.conv_id.1.mha.in_proj_bias 1536 0 module.conv_id.1.mha.out_proj.weight 262144 0 module.conv_id.1.mha.out_proj.bias 512 0 module.conv_id.1.norm0.weight 512 0 module.conv_id.1.norm0.bias 512 0 module.conv_id.1.norm1.weight 512 0 module.conv_id.1.norm1.bias 512 0 module.conv_id.1.seq.0.weight 262144 0 module.conv_id.1.seq.0.bias 512 0 module.conv_id.1.seq.2.weight 262144 0 module.conv_id.1.seq.2.bias 512 0 module.conv_id.2.mha.in_proj_weight 786432 0 module.conv_id.2.mha.in_proj_bias 1536 0 module.conv_id.2.mha.out_proj.weight 262144 0 module.conv_id.2.mha.out_proj.bias 512 0 module.conv_id.2.norm0.weight 512 0 module.conv_id.2.norm0.bias 512 0 module.conv_id.2.norm1.weight 512 0 module.conv_id.2.norm1.bias 512 0 module.conv_id.2.seq.0.weight 262144 0 module.conv_id.2.seq.0.bias 512 0 module.conv_id.2.seq.2.weight 262144 0 module.conv_id.2.seq.2.bias 512 0 module.conv_reg.0.mha.in_proj_weight 786432 0 module.conv_reg.0.mha.in_proj_bias 1536 0 module.conv_reg.0.mha.out_proj.weight 262144 0 module.conv_reg.0.mha.out_proj.bias 512 0 module.conv_reg.0.norm0.weight 512 0 module.conv_reg.0.norm0.bias 512 0 module.conv_reg.0.norm1.weight 512 0 module.conv_reg.0.norm1.bias 512 0 module.conv_reg.0.seq.0.weight 262144 0 module.conv_reg.0.seq.0.bias 512 0 module.conv_reg.0.seq.2.weight 262144 0 module.conv_reg.0.seq.2.bias 512 0 module.conv_reg.1.mha.in_proj_weight 786432 0 module.conv_reg.1.mha.in_proj_bias 1536 0 module.conv_reg.1.mha.out_proj.weight 262144 0 module.conv_reg.1.mha.out_proj.bias 512 0 module.conv_reg.1.norm0.weight 512 0 module.conv_reg.1.norm0.bias 512 0 module.conv_reg.1.norm1.weight 512 0 module.conv_reg.1.norm1.bias 512 0 module.conv_reg.1.seq.0.weight 262144 0 module.conv_reg.1.seq.0.bias 512 0 module.conv_reg.1.seq.2.weight 262144 0 module.conv_reg.1.seq.2.bias 512 0 module.conv_reg.2.mha.in_proj_weight 786432 0 module.conv_reg.2.mha.in_proj_bias 1536 0 module.conv_reg.2.mha.out_proj.weight 262144 0 module.conv_reg.2.mha.out_proj.bias 512 0 module.conv_reg.2.norm0.weight 512 0 module.conv_reg.2.norm0.bias 512 0 module.conv_reg.2.norm1.weight 512 0 module.conv_reg.2.norm1.bias 512 0 module.conv_reg.2.seq.0.weight 262144 0 module.conv_reg.2.seq.0.bias 512 0 module.conv_reg.2.seq.2.weight 262144 0 module.conv_reg.2.seq.2.bias 512 0 module.nn_id.0.weight 270848 0 module.nn_id.0.bias 512 0 module.nn_id.2.weight 512 0 module.nn_id.2.bias 512 0 module.nn_id.4.weight 3072 0 module.nn_id.4.bias 6 0 module.nn_pt.nn.0.weight 273920 0 module.nn_pt.nn.0.bias 512 0 module.nn_pt.nn.2.weight 512 0 module.nn_pt.nn.2.bias 512 0 module.nn_pt.nn.4.weight 1024 0 module.nn_pt.nn.4.bias 2 0 module.nn_eta.nn.0.weight 273920 0 module.nn_eta.nn.0.bias 512 0 module.nn_eta.nn.2.weight 512 0 module.nn_eta.nn.2.bias 512 0 module.nn_eta.nn.4.weight 1024 0 module.nn_eta.nn.4.bias 2 0 module.nn_sin_phi.nn.0.weight 273920 0 module.nn_sin_phi.nn.0.bias 512 0 module.nn_sin_phi.nn.2.weight 512 0 module.nn_sin_phi.nn.2.bias 512 0 module.nn_sin_phi.nn.4.weight 1024 0 module.nn_sin_phi.nn.4.bias 2 0 module.nn_cos_phi.nn.0.weight 273920 0 module.nn_cos_phi.nn.0.bias 512 0 module.nn_cos_phi.nn.2.weight 512 0 module.nn_cos_phi.nn.2.bias 512 0 module.nn_cos_phi.nn.4.weight 1024 0 module.nn_cos_phi.nn.4.bias 2 0 module.nn_energy.nn.0.weight 273920 0 module.nn_energy.nn.0.bias 512 0 module.nn_energy.nn.2.weight 512 0 module.nn_energy.nn.2.bias 512 0 module.nn_energy.nn.4.weight 1024 0 module.nn_energy.nn.4.bias 2 0 [2024-08-26 14:54:09,948] INFO: Creating experiment dir /pfvol/experiments/Aug26_CLD_finetuned_10k_pyg-cld_20240826_145354_442318 [2024-08-26 14:54:09,948] INFO: Model directory /pfvol/experiments/Aug26_CLD_finetuned_10k_pyg-cld_20240826_145354_442318 [2024-08-26 14:54:09,975] INFO: train_dataset: cld_edm_ttbar_pf, 10000 [2024-08-26 14:54:10,000] INFO: valid_dataset: cld_edm_ttbar_pf, 1000 [2024-08-26 14:54:10,012] INFO: Initiating epoch #1 train run on device rank=0 [2024-08-26 14:54:50,620] INFO: Initiating epoch #1 valid run on device rank=0 [2024-08-26 14:54:55,325] INFO: Rank 0: epoch=1 / 100 train_loss=31.0139 valid_loss=27.9408 stale=0 time=0.76m eta=74.8m [2024-08-26 14:54:55,329] INFO: Initiating epoch #2 train run on device rank=0 [2024-08-26 14:55:28,808] INFO: Initiating epoch #2 valid run on device rank=0 [2024-08-26 14:55:33,941] INFO: Rank 0: epoch=2 / 100 train_loss=26.8866 valid_loss=25.9809 stale=0 time=0.64m eta=68.5m [2024-08-26 14:55:34,476] INFO: Initiating epoch #3 train run on device rank=0 [2024-08-26 14:56:08,458] INFO: Initiating epoch #3 valid run on device rank=0 [2024-08-26 14:56:13,903] INFO: Rank 0: epoch=3 / 100 train_loss=25.3245 valid_loss=25.4497 stale=0 time=0.66m eta=66.8m [2024-08-26 14:56:13,949] INFO: Initiating epoch #4 train run on device rank=0 [2024-08-26 14:56:49,534] INFO: Initiating epoch #4 valid run on device rank=0 [2024-08-26 14:56:54,212] INFO: Rank 0: epoch=4 / 100 train_loss=24.5504 valid_loss=24.5991 stale=0 time=0.67m eta=65.7m [2024-08-26 14:56:54,736] INFO: Initiating epoch #5 train run on device rank=0 [2024-08-26 14:57:31,409] INFO: Initiating epoch #5 valid run on device rank=0 [2024-08-26 14:57:37,123] INFO: Rank 0: epoch=5 / 100 train_loss=23.9790 valid_loss=24.3285 stale=0 time=0.71m eta=65.6m [2024-08-26 14:57:37,646] INFO: Initiating epoch #6 train run on device rank=0 [2024-08-26 14:58:13,504] INFO: Initiating epoch #6 valid run on device rank=0 [2024-08-26 14:58:18,632] INFO: Rank 0: epoch=6 / 100 train_loss=23.5353 valid_loss=23.6925 stale=0 time=0.68m eta=64.9m [2024-08-26 14:58:19,012] INFO: Initiating epoch #7 train run on device rank=0 [2024-08-26 14:58:56,666] INFO: Initiating epoch #7 valid run on device rank=0 [2024-08-26 14:59:01,760] INFO: Rank 0: epoch=7 / 100 train_loss=23.1651 valid_loss=23.4469 stale=0 time=0.71m eta=64.6m [2024-08-26 14:59:02,093] INFO: Initiating epoch #8 train run on device rank=0 [2024-08-26 14:59:38,739] INFO: Initiating epoch #8 valid run on device rank=0 [2024-08-26 14:59:42,777] INFO: Rank 0: epoch=8 / 100 train_loss=22.8230 valid_loss=23.2301 stale=0 time=0.68m eta=63.8m [2024-08-26 14:59:42,918] INFO: Initiating epoch #9 train run on device rank=0 [2024-08-26 15:00:21,376] INFO: Initiating epoch #9 valid run on device rank=0 [2024-08-26 15:00:25,411] INFO: Rank 0: epoch=9 / 100 train_loss=22.4787 valid_loss=23.1212 stale=0 time=0.71m eta=63.3m [2024-08-26 15:00:25,515] INFO: Initiating epoch #10 train run on device rank=0 [2024-08-26 15:01:03,424] INFO: Initiating epoch #10 valid run on device rank=0 [2024-08-26 15:01:07,592] INFO: Rank 0: epoch=10 / 100 train_loss=22.0972 valid_loss=22.7284 stale=0 time=0.7m eta=62.6m [2024-08-26 15:01:07,652] INFO: Initiating epoch #11 train run on device rank=0 [2024-08-26 15:01:45,618] INFO: Initiating epoch #11 valid run on device rank=0 [2024-08-26 15:01:53,644] INFO: Rank 0: epoch=11 / 100 train_loss=21.6517 valid_loss=22.4032 stale=0 time=0.77m eta=62.5m [2024-08-26 15:01:54,471] INFO: Initiating epoch #12 train run on device rank=0 [2024-08-26 15:02:29,830] INFO: Initiating epoch #12 valid run on device rank=0 [2024-08-26 15:02:38,290] INFO: Rank 0: epoch=12 / 100 train_loss=21.2898 valid_loss=22.2579 stale=0 time=0.73m eta=62.1m [2024-08-26 15:02:38,867] INFO: Initiating epoch #13 train run on device rank=0 [2024-08-26 15:03:14,111] INFO: Initiating epoch #13 valid run on device rank=0 [2024-08-26 15:03:22,047] INFO: Rank 0: epoch=13 / 100 train_loss=20.9539 valid_loss=22.1715 stale=0 time=0.72m eta=61.6m [2024-08-26 15:03:22,724] INFO: Initiating epoch #14 train run on device rank=0 [2024-08-26 15:03:58,203] INFO: Initiating epoch #14 valid run on device rank=0 [2024-08-26 15:04:05,788] INFO: Rank 0: epoch=14 / 100 train_loss=20.6045 valid_loss=22.0613 stale=0 time=0.72m eta=61.0m [2024-08-26 15:04:06,751] INFO: Initiating epoch #15 train run on device rank=0 [2024-08-26 15:04:42,290] INFO: Initiating epoch #15 valid run on device rank=0 [2024-08-26 15:04:49,733] INFO: Rank 0: epoch=15 / 100 train_loss=20.2790 valid_loss=21.9366 stale=0 time=0.72m eta=60.4m [2024-08-26 15:04:50,485] INFO: Initiating epoch #16 train run on device rank=0 [2024-08-26 15:05:27,486] INFO: Initiating epoch #16 valid run on device rank=0 [2024-08-26 15:05:34,524] INFO: Rank 0: epoch=16 / 100 train_loss=19.9529 valid_loss=21.8966 stale=0 time=0.73m eta=59.9m [2024-08-26 15:05:35,173] INFO: Initiating epoch #17 train run on device rank=0 [2024-08-26 15:06:10,742] INFO: Initiating epoch #17 valid run on device rank=0 [2024-08-26 15:06:17,273] INFO: Rank 0: epoch=17 / 100 train_loss=19.6421 valid_loss=21.8940 stale=0 time=0.7m eta=59.2m [2024-08-26 15:06:18,399] INFO: Initiating epoch #18 train run on device rank=0 [2024-08-26 15:06:53,934] INFO: Initiating epoch #18 valid run on device rank=0 [2024-08-26 15:07:02,814] INFO: Rank 0: epoch=18 / 100 train_loss=19.3156 valid_loss=21.8933 stale=0 time=0.74m eta=58.7m [2024-08-26 15:07:03,675] INFO: Initiating epoch #19 train run on device rank=0 [2024-08-26 15:07:39,262] INFO: Initiating epoch #19 valid run on device rank=0 [2024-08-26 15:07:44,803] INFO: Rank 0: epoch=19 / 100 train_loss=19.0088 valid_loss=22.2114 stale=1 time=0.69m eta=57.9m [2024-08-26 15:07:45,580] INFO: Initiating epoch #20 train run on device rank=0 [2024-08-26 15:08:23,139] INFO: Initiating epoch #20 valid run on device rank=0 [2024-08-26 15:08:29,021] INFO: Rank 0: epoch=20 / 100 train_loss=18.7617 valid_loss=22.2558 stale=2 time=0.72m eta=57.3m [2024-08-26 15:08:29,832] INFO: Initiating epoch #21 train run on device rank=0 [2024-08-26 15:09:06,585] INFO: Initiating epoch #21 valid run on device rank=0 [2024-08-26 15:09:12,101] INFO: Rank 0: epoch=21 / 100 train_loss=18.4839 valid_loss=22.2334 stale=3 time=0.7m eta=56.6m [2024-08-26 15:09:13,024] INFO: Initiating epoch #22 train run on device rank=0 [2024-08-26 15:09:49,738] INFO: Initiating epoch #22 valid run on device rank=0 [2024-08-26 15:09:55,706] INFO: Rank 0: epoch=22 / 100 train_loss=18.1775 valid_loss=22.2367 stale=4 time=0.71m eta=55.9m [2024-08-26 15:09:56,830] INFO: Initiating epoch #23 train run on device rank=0 [2024-08-26 15:10:33,289] INFO: Initiating epoch #23 valid run on device rank=0 [2024-08-26 15:10:38,913] INFO: Rank 0: epoch=23 / 100 train_loss=17.9536 valid_loss=22.3572 stale=5 time=0.7m eta=55.2m [2024-08-26 15:10:39,573] INFO: Initiating epoch #24 train run on device rank=0 [2024-08-26 15:11:15,882] INFO: Initiating epoch #24 valid run on device rank=0 [2024-08-26 15:11:21,034] INFO: Rank 0: epoch=24 / 100 train_loss=17.7423 valid_loss=22.4930 stale=6 time=0.69m eta=54.4m [2024-08-26 15:11:21,827] INFO: Initiating epoch #25 train run on device rank=0 [2024-08-26 15:11:58,798] INFO: Initiating epoch #25 valid run on device rank=0 [2024-08-26 15:12:05,561] INFO: Rank 0: epoch=25 / 100 train_loss=17.4906 valid_loss=22.5829 stale=7 time=0.73m eta=53.8m [2024-08-26 15:12:06,483] INFO: Initiating epoch #26 train run on device rank=0 [2024-08-26 15:12:42,745] INFO: Initiating epoch #26 valid run on device rank=0 [2024-08-26 15:12:48,746] INFO: Rank 0: epoch=26 / 100 train_loss=17.2588 valid_loss=23.0065 stale=8 time=0.7m eta=53.1m [2024-08-26 15:12:49,663] INFO: Initiating epoch #27 train run on device rank=0 [2024-08-26 15:13:27,314] INFO: Initiating epoch #27 valid run on device rank=0 [2024-08-26 15:13:33,836] INFO: Rank 0: epoch=27 / 100 train_loss=16.9705 valid_loss=23.4270 stale=9 time=0.74m eta=52.4m [2024-08-26 15:13:34,529] INFO: Initiating epoch #28 train run on device rank=0 [2024-08-26 15:14:10,866] INFO: Initiating epoch #28 valid run on device rank=0 [2024-08-26 15:14:17,355] INFO: Rank 0: epoch=28 / 100 train_loss=16.7079 valid_loss=23.9473 stale=10 time=0.71m eta=51.7m [2024-08-26 15:14:18,222] INFO: Initiating epoch #29 train run on device rank=0 [2024-08-26 15:14:54,633] INFO: Initiating epoch #29 valid run on device rank=0 [2024-08-26 15:15:00,380] INFO: Rank 0: epoch=29 / 100 train_loss=16.5551 valid_loss=24.3196 stale=11 time=0.7m eta=51.0m [2024-08-26 15:15:01,183] INFO: Initiating epoch #30 train run on device rank=0 [2024-08-26 15:15:38,003] INFO: Initiating epoch #30 valid run on device rank=0 [2024-08-26 15:15:43,759] INFO: Rank 0: epoch=30 / 100 train_loss=16.3214 valid_loss=24.6123 stale=12 time=0.71m eta=50.3m [2024-08-26 15:15:44,422] INFO: Initiating epoch #31 train run on device rank=0 [2024-08-26 15:16:20,220] INFO: Initiating epoch #31 valid run on device rank=0 [2024-08-26 15:16:26,253] INFO: Rank 0: epoch=31 / 100 train_loss=15.9884 valid_loss=25.0219 stale=13 time=0.7m eta=49.6m [2024-08-26 15:16:26,835] INFO: Initiating epoch #32 train run on device rank=0 [2024-08-26 15:17:03,176] INFO: Initiating epoch #32 valid run on device rank=0 [2024-08-26 15:17:09,475] INFO: Rank 0: epoch=32 / 100 train_loss=15.7496 valid_loss=25.4166 stale=14 time=0.71m eta=48.9m [2024-08-26 15:17:10,276] INFO: Initiating epoch #33 train run on device rank=0 [2024-08-26 15:17:46,674] INFO: Initiating epoch #33 valid run on device rank=0 [2024-08-26 15:17:53,655] INFO: Rank 0: epoch=33 / 100 train_loss=15.5224 valid_loss=25.5530 stale=15 time=0.72m eta=48.2m [2024-08-26 15:17:56,081] INFO: Initiating epoch #34 train run on device rank=0 [2024-08-26 15:18:32,013] INFO: Initiating epoch #34 valid run on device rank=0 [2024-08-26 15:18:37,600] INFO: Rank 0: epoch=34 / 100 train_loss=15.3605 valid_loss=25.9477 stale=16 time=0.69m eta=47.5m [2024-08-26 15:18:38,732] INFO: Initiating epoch #35 train run on device rank=0 [2024-08-26 15:19:15,079] INFO: Initiating epoch #35 valid run on device rank=0 [2024-08-26 15:19:20,969] INFO: Rank 0: epoch=35 / 100 train_loss=15.0853 valid_loss=26.3372 stale=17 time=0.7m eta=46.8m [2024-08-26 15:19:21,685] INFO: Initiating epoch #36 train run on device rank=0 [2024-08-26 15:19:58,449] INFO: Initiating epoch #36 valid run on device rank=0 [2024-08-26 15:20:04,413] INFO: Rank 0: epoch=36 / 100 train_loss=14.9023 valid_loss=26.4938 stale=18 time=0.71m eta=46.1m [2024-08-26 15:20:05,162] INFO: Initiating epoch #37 train run on device rank=0 [2024-08-26 15:20:42,040] INFO: Initiating epoch #37 valid run on device rank=0 [2024-08-26 15:20:47,821] INFO: Rank 0: epoch=37 / 100 train_loss=14.6175 valid_loss=26.9006 stale=19 time=0.71m eta=45.3m [2024-08-26 15:20:48,661] INFO: Initiating epoch #38 train run on device rank=0 [2024-08-26 15:21:25,275] INFO: Initiating epoch #38 valid run on device rank=0 [2024-08-26 15:21:31,120] INFO: Rank 0: epoch=38 / 100 train_loss=14.3618 valid_loss=27.0975 stale=20 time=0.71m eta=44.6m [2024-08-26 15:21:32,150] INFO: Initiating epoch #39 train run on device rank=0 [2024-08-26 15:22:08,286] INFO: Initiating epoch #39 valid run on device rank=0 [2024-08-26 15:22:14,455] INFO: Done with training. Total training time on device 0 is 28.074min