[2024-08-26 15:02:55,322] INFO: Will use torch.nn.parallel.DistributedDataParallel() and 2 gpus [2024-08-26 15:02:55,391] INFO: NVIDIA GeForce RTX 2080 Ti [2024-08-26 15:02:55,391] INFO: NVIDIA GeForce RTX 2080 Ti [2024-08-26 15:03:00,469] INFO: using dtype=torch.float32 [2024-08-26 15:03:01,502] INFO: model_kwargs: {'input_dim': 17, 'num_classes': 6, 'input_encoding': 'joint', 'pt_mode': 'linear', 'eta_mode': 'linear', 'sin_phi_mode': 'linear', 'cos_phi_mode': 'linear', 'energy_mode': 'linear', 'elemtypes_nonzero': [1, 2], 'learned_representation_mode': 'last', 'conv_type': 'attention', 'num_convs': 3, 'dropout_ff': 0.0, 'dropout_conv_id_mha': 0.0, 'dropout_conv_id_ff': 0.0, 'dropout_conv_reg_mha': 0.0, 'dropout_conv_reg_ff': 0.0, 'activation': 'relu', 'head_dim': 16, 'num_heads': 32, 'attention_type': 'efficient'} [2024-08-26 15:03:01,518] INFO: using attention_type=math [2024-08-26 15:03:01,528] INFO: using attention_type=math [2024-08-26 15:03:01,539] INFO: using attention_type=math [2024-08-26 15:03:01,550] INFO: using attention_type=math [2024-08-26 15:03:01,561] INFO: using attention_type=math [2024-08-26 15:03:01,571] INFO: using attention_type=math [2024-08-26 15:03:06,553] INFO: Loaded model weights from /pfvol/experiments/MLPF_clic_backbone_8GTX/best_weights.pth [2024-08-26 15:03:07,576] INFO: DistributedDataParallel( (module): MLPF( (nn0_id): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (nn0_reg): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (conv_id): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (conv_reg): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (nn_id): Sequential( (0): Linear(in_features=529, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=6, bias=True) ) (nn_pt): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_eta): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_sin_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_cos_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_energy): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) ) ) [2024-08-26 15:03:07,577] INFO: Trainable parameters: 11671568 [2024-08-26 15:03:07,577] INFO: Non-trainable parameters: 0 [2024-08-26 15:03:07,577] INFO: Total parameters: 11671568 [2024-08-26 15:03:07,580] INFO: Modules Trainable parameters Non-tranable parameters module.nn0_id.0.weight 8704 0 module.nn0_id.0.bias 512 0 module.nn0_id.2.weight 512 0 module.nn0_id.2.bias 512 0 module.nn0_id.4.weight 262144 0 module.nn0_id.4.bias 512 0 module.nn0_reg.0.weight 8704 0 module.nn0_reg.0.bias 512 0 module.nn0_reg.2.weight 512 0 module.nn0_reg.2.bias 512 0 module.nn0_reg.4.weight 262144 0 module.nn0_reg.4.bias 512 0 module.conv_id.0.mha.in_proj_weight 786432 0 module.conv_id.0.mha.in_proj_bias 1536 0 module.conv_id.0.mha.out_proj.weight 262144 0 module.conv_id.0.mha.out_proj.bias 512 0 module.conv_id.0.norm0.weight 512 0 module.conv_id.0.norm0.bias 512 0 module.conv_id.0.norm1.weight 512 0 module.conv_id.0.norm1.bias 512 0 module.conv_id.0.seq.0.weight 262144 0 module.conv_id.0.seq.0.bias 512 0 module.conv_id.0.seq.2.weight 262144 0 module.conv_id.0.seq.2.bias 512 0 module.conv_id.1.mha.in_proj_weight 786432 0 module.conv_id.1.mha.in_proj_bias 1536 0 module.conv_id.1.mha.out_proj.weight 262144 0 module.conv_id.1.mha.out_proj.bias 512 0 module.conv_id.1.norm0.weight 512 0 module.conv_id.1.norm0.bias 512 0 module.conv_id.1.norm1.weight 512 0 module.conv_id.1.norm1.bias 512 0 module.conv_id.1.seq.0.weight 262144 0 module.conv_id.1.seq.0.bias 512 0 module.conv_id.1.seq.2.weight 262144 0 module.conv_id.1.seq.2.bias 512 0 module.conv_id.2.mha.in_proj_weight 786432 0 module.conv_id.2.mha.in_proj_bias 1536 0 module.conv_id.2.mha.out_proj.weight 262144 0 module.conv_id.2.mha.out_proj.bias 512 0 module.conv_id.2.norm0.weight 512 0 module.conv_id.2.norm0.bias 512 0 module.conv_id.2.norm1.weight 512 0 module.conv_id.2.norm1.bias 512 0 module.conv_id.2.seq.0.weight 262144 0 module.conv_id.2.seq.0.bias 512 0 module.conv_id.2.seq.2.weight 262144 0 module.conv_id.2.seq.2.bias 512 0 module.conv_reg.0.mha.in_proj_weight 786432 0 module.conv_reg.0.mha.in_proj_bias 1536 0 module.conv_reg.0.mha.out_proj.weight 262144 0 module.conv_reg.0.mha.out_proj.bias 512 0 module.conv_reg.0.norm0.weight 512 0 module.conv_reg.0.norm0.bias 512 0 module.conv_reg.0.norm1.weight 512 0 module.conv_reg.0.norm1.bias 512 0 module.conv_reg.0.seq.0.weight 262144 0 module.conv_reg.0.seq.0.bias 512 0 module.conv_reg.0.seq.2.weight 262144 0 module.conv_reg.0.seq.2.bias 512 0 module.conv_reg.1.mha.in_proj_weight 786432 0 module.conv_reg.1.mha.in_proj_bias 1536 0 module.conv_reg.1.mha.out_proj.weight 262144 0 module.conv_reg.1.mha.out_proj.bias 512 0 module.conv_reg.1.norm0.weight 512 0 module.conv_reg.1.norm0.bias 512 0 module.conv_reg.1.norm1.weight 512 0 module.conv_reg.1.norm1.bias 512 0 module.conv_reg.1.seq.0.weight 262144 0 module.conv_reg.1.seq.0.bias 512 0 module.conv_reg.1.seq.2.weight 262144 0 module.conv_reg.1.seq.2.bias 512 0 module.conv_reg.2.mha.in_proj_weight 786432 0 module.conv_reg.2.mha.in_proj_bias 1536 0 module.conv_reg.2.mha.out_proj.weight 262144 0 module.conv_reg.2.mha.out_proj.bias 512 0 module.conv_reg.2.norm0.weight 512 0 module.conv_reg.2.norm0.bias 512 0 module.conv_reg.2.norm1.weight 512 0 module.conv_reg.2.norm1.bias 512 0 module.conv_reg.2.seq.0.weight 262144 0 module.conv_reg.2.seq.0.bias 512 0 module.conv_reg.2.seq.2.weight 262144 0 module.conv_reg.2.seq.2.bias 512 0 module.nn_id.0.weight 270848 0 module.nn_id.0.bias 512 0 module.nn_id.2.weight 512 0 module.nn_id.2.bias 512 0 module.nn_id.4.weight 3072 0 module.nn_id.4.bias 6 0 module.nn_pt.nn.0.weight 273920 0 module.nn_pt.nn.0.bias 512 0 module.nn_pt.nn.2.weight 512 0 module.nn_pt.nn.2.bias 512 0 module.nn_pt.nn.4.weight 1024 0 module.nn_pt.nn.4.bias 2 0 module.nn_eta.nn.0.weight 273920 0 module.nn_eta.nn.0.bias 512 0 module.nn_eta.nn.2.weight 512 0 module.nn_eta.nn.2.bias 512 0 module.nn_eta.nn.4.weight 1024 0 module.nn_eta.nn.4.bias 2 0 module.nn_sin_phi.nn.0.weight 273920 0 module.nn_sin_phi.nn.0.bias 512 0 module.nn_sin_phi.nn.2.weight 512 0 module.nn_sin_phi.nn.2.bias 512 0 module.nn_sin_phi.nn.4.weight 1024 0 module.nn_sin_phi.nn.4.bias 2 0 module.nn_cos_phi.nn.0.weight 273920 0 module.nn_cos_phi.nn.0.bias 512 0 module.nn_cos_phi.nn.2.weight 512 0 module.nn_cos_phi.nn.2.bias 512 0 module.nn_cos_phi.nn.4.weight 1024 0 module.nn_cos_phi.nn.4.bias 2 0 module.nn_energy.nn.0.weight 273920 0 module.nn_energy.nn.0.bias 512 0 module.nn_energy.nn.2.weight 512 0 module.nn_energy.nn.2.bias 512 0 module.nn_energy.nn.4.weight 1024 0 module.nn_energy.nn.4.bias 2 0 [2024-08-26 15:03:07,627] INFO: Creating experiment dir /pfvol/experiments/Aug26_CLD_finetuned_100_pyg-cld_20240826_150254_122632 [2024-08-26 15:03:07,627] INFO: Model directory /pfvol/experiments/Aug26_CLD_finetuned_100_pyg-cld_20240826_150254_122632 [2024-08-26 15:03:07,643] INFO: train_dataset: cld_edm_ttbar_pf, 100 [2024-08-26 15:03:07,672] INFO: valid_dataset: cld_edm_ttbar_pf, 1000 [2024-08-26 15:03:07,724] INFO: Initiating epoch #1 train run on device rank=0 [2024-08-26 15:03:15,999] INFO: Initiating epoch #1 valid run on device rank=0 [2024-08-26 15:03:23,987] INFO: Rank 0: epoch=1 / 100 train_loss=151.8542 valid_loss=50.5365 stale=0 time=0.27m eta=26.8m [2024-08-26 15:03:23,988] INFO: Initiating epoch #2 train run on device rank=0 [2024-08-26 15:03:25,936] INFO: Initiating epoch #2 valid run on device rank=0 [2024-08-26 15:03:32,918] INFO: Rank 0: epoch=2 / 100 train_loss=42.8434 valid_loss=39.4724 stale=0 time=0.15m eta=20.6m [2024-08-26 15:03:33,655] INFO: Initiating epoch #3 train run on device rank=0 [2024-08-26 15:03:35,584] INFO: Initiating epoch #3 valid run on device rank=0 [2024-08-26 15:03:41,800] INFO: Rank 0: epoch=3 / 100 train_loss=37.3250 valid_loss=36.4715 stale=0 time=0.14m eta=18.4m [2024-08-26 15:03:42,389] INFO: Initiating epoch #4 train run on device rank=0 [2024-08-26 15:03:44,442] INFO: Initiating epoch #4 valid run on device rank=0 [2024-08-26 15:03:51,097] INFO: Rank 0: epoch=4 / 100 train_loss=34.8116 valid_loss=35.1516 stale=0 time=0.15m eta=17.3m [2024-08-26 15:03:51,993] INFO: Initiating epoch #5 train run on device rank=0 [2024-08-26 15:03:53,601] INFO: Initiating epoch #5 valid run on device rank=0 [2024-08-26 15:03:59,814] INFO: Rank 0: epoch=5 / 100 train_loss=33.5123 valid_loss=33.6785 stale=0 time=0.13m eta=16.5m [2024-08-26 15:04:00,575] INFO: Initiating epoch #6 train run on device rank=0 [2024-08-26 15:04:02,264] INFO: Initiating epoch #6 valid run on device rank=0 [2024-08-26 15:04:08,975] INFO: Rank 0: epoch=6 / 100 train_loss=32.2616 valid_loss=33.1524 stale=0 time=0.14m eta=16.0m [2024-08-26 15:04:09,763] INFO: Initiating epoch #7 train run on device rank=0 [2024-08-26 15:04:11,491] INFO: Initiating epoch #7 valid run on device rank=0 [2024-08-26 15:04:18,767] INFO: Rank 0: epoch=7 / 100 train_loss=31.7137 valid_loss=32.6384 stale=0 time=0.15m eta=15.7m [2024-08-26 15:04:19,464] INFO: Initiating epoch #8 train run on device rank=0 [2024-08-26 15:04:21,177] INFO: Initiating epoch #8 valid run on device rank=0 [2024-08-26 15:04:27,676] INFO: Rank 0: epoch=8 / 100 train_loss=31.1428 valid_loss=32.3626 stale=0 time=0.14m eta=15.3m [2024-08-26 15:04:28,437] INFO: Initiating epoch #9 train run on device rank=0 [2024-08-26 15:04:30,154] INFO: Initiating epoch #9 valid run on device rank=0 [2024-08-26 15:04:36,384] INFO: Rank 0: epoch=9 / 100 train_loss=30.8227 valid_loss=32.0640 stale=0 time=0.13m eta=14.9m [2024-08-26 15:04:37,232] INFO: Initiating epoch #10 train run on device rank=0 [2024-08-26 15:04:39,128] INFO: Initiating epoch #10 valid run on device rank=0 [2024-08-26 15:04:44,812] INFO: Rank 0: epoch=10 / 100 train_loss=30.4355 valid_loss=31.8911 stale=0 time=0.13m eta=14.6m [2024-08-26 15:04:45,852] INFO: Initiating epoch #11 train run on device rank=0 [2024-08-26 15:04:47,620] INFO: Initiating epoch #11 valid run on device rank=0 [2024-08-26 15:04:54,848] INFO: Rank 0: epoch=11 / 100 train_loss=30.1706 valid_loss=31.7048 stale=0 time=0.15m eta=14.4m [2024-08-26 15:04:55,758] INFO: Initiating epoch #12 train run on device rank=0 [2024-08-26 15:04:57,570] INFO: Initiating epoch #12 valid run on device rank=0 [2024-08-26 15:05:03,883] INFO: Rank 0: epoch=12 / 100 train_loss=29.8858 valid_loss=31.6137 stale=0 time=0.14m eta=14.2m [2024-08-26 15:05:04,642] INFO: Initiating epoch #13 train run on device rank=0 [2024-08-26 15:05:06,276] INFO: Initiating epoch #13 valid run on device rank=0 [2024-08-26 15:05:13,328] INFO: Rank 0: epoch=13 / 100 train_loss=29.6624 valid_loss=31.4931 stale=0 time=0.14m eta=14.0m [2024-08-26 15:05:14,074] INFO: Initiating epoch #14 train run on device rank=0 [2024-08-26 15:05:15,655] INFO: Initiating epoch #14 valid run on device rank=0 [2024-08-26 15:05:24,036] INFO: Rank 0: epoch=14 / 100 train_loss=29.4187 valid_loss=31.4246 stale=0 time=0.17m eta=14.0m [2024-08-26 15:05:24,835] INFO: Initiating epoch #15 train run on device rank=0 [2024-08-26 15:05:26,608] INFO: Initiating epoch #15 valid run on device rank=0 [2024-08-26 15:05:32,997] INFO: Rank 0: epoch=15 / 100 train_loss=29.2061 valid_loss=31.3160 stale=0 time=0.14m eta=13.7m [2024-08-26 15:05:33,734] INFO: Initiating epoch #16 train run on device rank=0 [2024-08-26 15:05:35,446] INFO: Initiating epoch #16 valid run on device rank=0 [2024-08-26 15:05:41,471] INFO: Rank 0: epoch=16 / 100 train_loss=29.0227 valid_loss=31.2721 stale=0 time=0.13m eta=13.5m [2024-08-26 15:05:42,530] INFO: Initiating epoch #17 train run on device rank=0 [2024-08-26 15:05:44,528] INFO: Initiating epoch #17 valid run on device rank=0 [2024-08-26 15:05:50,585] INFO: Rank 0: epoch=17 / 100 train_loss=28.8187 valid_loss=31.2313 stale=0 time=0.13m eta=13.3m [2024-08-26 15:05:51,432] INFO: Initiating epoch #18 train run on device rank=0 [2024-08-26 15:05:53,110] INFO: Initiating epoch #18 valid run on device rank=0 [2024-08-26 15:06:00,099] INFO: Rank 0: epoch=18 / 100 train_loss=28.6349 valid_loss=31.1447 stale=0 time=0.14m eta=13.1m [2024-08-26 15:06:00,922] INFO: Initiating epoch #19 train run on device rank=0 [2024-08-26 15:06:02,822] INFO: Initiating epoch #19 valid run on device rank=0 [2024-08-26 15:06:08,995] INFO: Rank 0: epoch=19 / 100 train_loss=28.4362 valid_loss=31.0621 stale=0 time=0.13m eta=12.9m [2024-08-26 15:06:09,842] INFO: Initiating epoch #20 train run on device rank=0 [2024-08-26 15:06:11,466] INFO: Initiating epoch #20 valid run on device rank=0 [2024-08-26 15:06:15,724] INFO: Rank 0: epoch=20 / 100 train_loss=28.2258 valid_loss=31.1253 stale=1 time=0.1m eta=12.5m [2024-08-26 15:06:16,648] INFO: Initiating epoch #21 train run on device rank=0 [2024-08-26 15:06:18,472] INFO: Initiating epoch #21 valid run on device rank=0 [2024-08-26 15:06:25,326] INFO: Rank 0: epoch=21 / 100 train_loss=28.0113 valid_loss=31.0146 stale=0 time=0.14m eta=12.4m [2024-08-26 15:06:26,107] INFO: Initiating epoch #22 train run on device rank=0 [2024-08-26 15:06:28,060] INFO: Initiating epoch #22 valid run on device rank=0 [2024-08-26 15:06:36,030] INFO: Rank 0: epoch=22 / 100 train_loss=27.7994 valid_loss=31.0020 stale=0 time=0.17m eta=12.3m [2024-08-26 15:06:37,107] INFO: Initiating epoch #23 train run on device rank=0 [2024-08-26 15:06:38,856] INFO: Initiating epoch #23 valid run on device rank=0 [2024-08-26 15:06:46,197] INFO: Rank 0: epoch=23 / 100 train_loss=27.6187 valid_loss=30.9900 stale=0 time=0.15m eta=12.2m [2024-08-26 15:06:47,728] INFO: Initiating epoch #24 train run on device rank=0 [2024-08-26 15:06:49,379] INFO: Initiating epoch #24 valid run on device rank=0 [2024-08-26 15:06:54,020] INFO: Rank 0: epoch=24 / 100 train_loss=27.4387 valid_loss=31.1153 stale=1 time=0.1m eta=11.9m [2024-08-26 15:06:54,651] INFO: Initiating epoch #25 train run on device rank=0 [2024-08-26 15:06:56,480] INFO: Initiating epoch #25 valid run on device rank=0 [2024-08-26 15:07:02,372] INFO: Rank 0: epoch=25 / 100 train_loss=27.2824 valid_loss=31.0104 stale=2 time=0.13m eta=11.7m [2024-08-26 15:07:02,872] INFO: Initiating epoch #26 train run on device rank=0 [2024-08-26 15:07:04,604] INFO: Initiating epoch #26 valid run on device rank=0 [2024-08-26 15:07:10,978] INFO: Rank 0: epoch=26 / 100 train_loss=27.1666 valid_loss=30.9199 stale=0 time=0.14m eta=11.5m [2024-08-26 15:07:11,703] INFO: Initiating epoch #27 train run on device rank=0 [2024-08-26 15:07:13,380] INFO: Initiating epoch #27 valid run on device rank=0 [2024-08-26 15:07:17,641] INFO: Rank 0: epoch=27 / 100 train_loss=26.9703 valid_loss=30.9646 stale=1 time=0.1m eta=11.3m [2024-08-26 15:07:18,783] INFO: Initiating epoch #28 train run on device rank=0 [2024-08-26 15:07:20,595] INFO: Initiating epoch #28 valid run on device rank=0 [2024-08-26 15:07:25,465] INFO: Rank 0: epoch=28 / 100 train_loss=26.7631 valid_loss=31.0196 stale=2 time=0.11m eta=11.0m [2024-08-26 15:07:26,271] INFO: Initiating epoch #29 train run on device rank=0 [2024-08-26 15:07:28,366] INFO: Initiating epoch #29 valid run on device rank=0 [2024-08-26 15:07:34,496] INFO: Rank 0: epoch=29 / 100 train_loss=26.5628 valid_loss=31.3096 stale=3 time=0.14m eta=10.9m [2024-08-26 15:07:35,434] INFO: Initiating epoch #30 train run on device rank=0 [2024-08-26 15:07:37,338] INFO: Initiating epoch #30 valid run on device rank=0 [2024-08-26 15:07:42,418] INFO: Rank 0: epoch=30 / 100 train_loss=26.4039 valid_loss=32.0585 stale=4 time=0.12m eta=10.7m [2024-08-26 15:07:43,277] INFO: Initiating epoch #31 train run on device rank=0 [2024-08-26 15:07:45,114] INFO: Initiating epoch #31 valid run on device rank=0 [2024-08-26 15:07:49,529] INFO: Rank 0: epoch=31 / 100 train_loss=26.4436 valid_loss=33.3935 stale=5 time=0.1m eta=10.5m [2024-08-26 15:07:50,498] INFO: Initiating epoch #32 train run on device rank=0 [2024-08-26 15:07:52,469] INFO: Initiating epoch #32 valid run on device rank=0 [2024-08-26 15:07:57,316] INFO: Rank 0: epoch=32 / 100 train_loss=27.2699 valid_loss=31.8991 stale=6 time=0.11m eta=10.3m [2024-08-26 15:07:58,252] INFO: Initiating epoch #33 train run on device rank=0 [2024-08-26 15:08:00,347] INFO: Initiating epoch #33 valid run on device rank=0 [2024-08-26 15:08:04,478] INFO: Rank 0: epoch=33 / 100 train_loss=27.4846 valid_loss=32.9519 stale=7 time=0.1m eta=10.0m [2024-08-26 15:08:05,342] INFO: Initiating epoch #34 train run on device rank=0 [2024-08-26 15:08:07,036] INFO: Initiating epoch #34 valid run on device rank=0 [2024-08-26 15:08:12,412] INFO: Rank 0: epoch=34 / 100 train_loss=27.4992 valid_loss=32.3455 stale=8 time=0.12m eta=9.9m [2024-08-26 15:08:13,253] INFO: Initiating epoch #35 train run on device rank=0 [2024-08-26 15:08:15,113] INFO: Initiating epoch #35 valid run on device rank=0 [2024-08-26 15:08:20,619] INFO: Rank 0: epoch=35 / 100 train_loss=26.8562 valid_loss=31.1199 stale=9 time=0.12m eta=9.7m [2024-08-26 15:08:21,503] INFO: Initiating epoch #36 train run on device rank=0 [2024-08-26 15:08:23,197] INFO: Initiating epoch #36 valid run on device rank=0 [2024-08-26 15:08:28,078] INFO: Rank 0: epoch=36 / 100 train_loss=26.2047 valid_loss=30.9366 stale=10 time=0.11m eta=9.5m [2024-08-26 15:08:28,928] INFO: Initiating epoch #37 train run on device rank=0 [2024-08-26 15:08:30,701] INFO: Initiating epoch #37 valid run on device rank=0 [2024-08-26 15:08:37,398] INFO: Rank 0: epoch=37 / 100 train_loss=25.9093 valid_loss=30.9108 stale=0 time=0.14m eta=9.4m [2024-08-26 15:08:38,471] INFO: Initiating epoch #38 train run on device rank=0 [2024-08-26 15:08:40,121] INFO: Initiating epoch #38 valid run on device rank=0 [2024-08-26 15:08:45,187] INFO: Rank 0: epoch=38 / 100 train_loss=25.5748 valid_loss=31.1037 stale=1 time=0.11m eta=9.2m [2024-08-26 15:08:45,920] INFO: Initiating epoch #39 train run on device rank=0 [2024-08-26 15:08:47,799] INFO: Initiating epoch #39 valid run on device rank=0 [2024-08-26 15:08:52,335] INFO: Rank 0: epoch=39 / 100 train_loss=25.2763 valid_loss=31.4697 stale=2 time=0.11m eta=9.0m [2024-08-26 15:08:53,094] INFO: Initiating epoch #40 train run on device rank=0 [2024-08-26 15:08:54,988] INFO: Initiating epoch #40 valid run on device rank=0 [2024-08-26 15:08:59,559] INFO: Rank 0: epoch=40 / 100 train_loss=25.1711 valid_loss=31.1502 stale=3 time=0.11m eta=8.8m [2024-08-26 15:09:00,756] INFO: Initiating epoch #41 train run on device rank=0 [2024-08-26 15:09:02,604] INFO: Initiating epoch #41 valid run on device rank=0 [2024-08-26 15:09:08,330] INFO: Rank 0: epoch=41 / 100 train_loss=24.9489 valid_loss=31.9435 stale=4 time=0.13m eta=8.6m [2024-08-26 15:09:09,338] INFO: Initiating epoch #42 train run on device rank=0 [2024-08-26 15:09:11,263] INFO: Initiating epoch #42 valid run on device rank=0 [2024-08-26 15:09:16,259] INFO: Rank 0: epoch=42 / 100 train_loss=24.8161 valid_loss=32.1592 stale=5 time=0.12m eta=8.5m [2024-08-26 15:09:17,342] INFO: Initiating epoch #43 train run on device rank=0 [2024-08-26 15:09:19,239] INFO: Initiating epoch #43 valid run on device rank=0 [2024-08-26 15:09:23,519] INFO: Rank 0: epoch=43 / 100 train_loss=25.1915 valid_loss=31.2173 stale=6 time=0.1m eta=8.3m [2024-08-26 15:09:24,218] INFO: Initiating epoch #44 train run on device rank=0 [2024-08-26 15:09:26,464] INFO: Initiating epoch #44 valid run on device rank=0 [2024-08-26 15:09:32,218] INFO: Rank 0: epoch=44 / 100 train_loss=25.0322 valid_loss=32.1644 stale=7 time=0.13m eta=8.2m [2024-08-26 15:09:32,927] INFO: Initiating epoch #45 train run on device rank=0 [2024-08-26 15:09:34,789] INFO: Initiating epoch #45 valid run on device rank=0 [2024-08-26 15:09:39,222] INFO: Rank 0: epoch=45 / 100 train_loss=25.1832 valid_loss=32.3233 stale=8 time=0.1m eta=8.0m [2024-08-26 15:09:40,773] INFO: Initiating epoch #46 train run on device rank=0 [2024-08-26 15:09:42,591] INFO: Initiating epoch #46 valid run on device rank=0 [2024-08-26 15:09:47,578] INFO: Rank 0: epoch=46 / 100 train_loss=24.6213 valid_loss=31.6528 stale=9 time=0.11m eta=7.8m [2024-08-26 15:09:48,638] INFO: Initiating epoch #47 train run on device rank=0 [2024-08-26 15:09:50,728] INFO: Initiating epoch #47 valid run on device rank=0 [2024-08-26 15:09:55,702] INFO: Rank 0: epoch=47 / 100 train_loss=24.1893 valid_loss=31.1862 stale=10 time=0.12m eta=7.7m [2024-08-26 15:09:56,783] INFO: Initiating epoch #48 train run on device rank=0 [2024-08-26 15:09:58,636] INFO: Initiating epoch #48 valid run on device rank=0 [2024-08-26 15:10:04,768] INFO: Rank 0: epoch=48 / 100 train_loss=23.7276 valid_loss=32.0799 stale=11 time=0.13m eta=7.5m [2024-08-26 15:10:05,785] INFO: Initiating epoch #49 train run on device rank=0 [2024-08-26 15:10:07,520] INFO: Initiating epoch #49 valid run on device rank=0 [2024-08-26 15:10:12,370] INFO: Rank 0: epoch=49 / 100 train_loss=23.1606 valid_loss=31.7612 stale=12 time=0.11m eta=7.4m [2024-08-26 15:10:13,210] INFO: Initiating epoch #50 train run on device rank=0 [2024-08-26 15:10:15,003] INFO: Initiating epoch #50 valid run on device rank=0 [2024-08-26 15:10:20,852] INFO: Rank 0: epoch=50 / 100 train_loss=22.8223 valid_loss=31.9025 stale=13 time=0.13m eta=7.2m [2024-08-26 15:10:21,695] INFO: Initiating epoch #51 train run on device rank=0 [2024-08-26 15:10:23,419] INFO: Initiating epoch #51 valid run on device rank=0 [2024-08-26 15:10:28,049] INFO: Rank 0: epoch=51 / 100 train_loss=22.4201 valid_loss=32.3084 stale=14 time=0.11m eta=7.1m [2024-08-26 15:10:28,701] INFO: Initiating epoch #52 train run on device rank=0 [2024-08-26 15:10:30,473] INFO: Initiating epoch #52 valid run on device rank=0 [2024-08-26 15:10:36,191] INFO: Rank 0: epoch=52 / 100 train_loss=22.1397 valid_loss=32.6757 stale=15 time=0.12m eta=6.9m [2024-08-26 15:10:37,337] INFO: Initiating epoch #53 train run on device rank=0 [2024-08-26 15:10:39,045] INFO: Initiating epoch #53 valid run on device rank=0 [2024-08-26 15:10:44,354] INFO: Rank 0: epoch=53 / 100 train_loss=22.0871 valid_loss=33.2177 stale=16 time=0.12m eta=6.7m [2024-08-26 15:10:45,274] INFO: Initiating epoch #54 train run on device rank=0 [2024-08-26 15:10:46,811] INFO: Initiating epoch #54 valid run on device rank=0 [2024-08-26 15:10:51,647] INFO: Rank 0: epoch=54 / 100 train_loss=22.3840 valid_loss=33.6819 stale=17 time=0.11m eta=6.6m [2024-08-26 15:10:52,919] INFO: Initiating epoch #55 train run on device rank=0 [2024-08-26 15:10:54,530] INFO: Initiating epoch #55 valid run on device rank=0 [2024-08-26 15:10:58,813] INFO: Rank 0: epoch=55 / 100 train_loss=22.6161 valid_loss=34.0176 stale=18 time=0.1m eta=6.4m [2024-08-26 15:11:00,695] INFO: Initiating epoch #56 train run on device rank=0 [2024-08-26 15:11:02,217] INFO: Initiating epoch #56 valid run on device rank=0 [2024-08-26 15:11:07,331] INFO: Rank 0: epoch=56 / 100 train_loss=23.4618 valid_loss=34.3613 stale=19 time=0.11m eta=6.3m [2024-08-26 15:11:08,147] INFO: Initiating epoch #57 train run on device rank=0 [2024-08-26 15:11:09,757] INFO: Initiating epoch #57 valid run on device rank=0 [2024-08-26 15:11:15,517] INFO: Rank 0: epoch=57 / 100 train_loss=23.1883 valid_loss=32.4705 stale=20 time=0.12m eta=6.1m [2024-08-26 15:11:16,433] INFO: Initiating epoch #58 train run on device rank=0 [2024-08-26 15:11:17,998] INFO: Initiating epoch #58 valid run on device rank=0 [2024-08-26 15:11:23,005] INFO: Done with training. Total training time on device 0 is 8.255min