[2024-08-26 15:11:05,547] INFO: Will use torch.nn.parallel.DistributedDataParallel() and 2 gpus [2024-08-26 15:11:05,592] INFO: NVIDIA GeForce RTX 2080 Ti [2024-08-26 15:11:05,593] INFO: NVIDIA GeForce RTX 2080 Ti [2024-08-26 15:11:11,082] INFO: using dtype=torch.float32 [2024-08-26 15:11:11,803] INFO: model_kwargs: {'input_dim': 17, 'num_classes': 6, 'input_encoding': 'joint', 'pt_mode': 'linear', 'eta_mode': 'linear', 'sin_phi_mode': 'linear', 'cos_phi_mode': 'linear', 'energy_mode': 'linear', 'elemtypes_nonzero': [1, 2], 'learned_representation_mode': 'last', 'conv_type': 'attention', 'num_convs': 3, 'dropout_ff': 0.0, 'dropout_conv_id_mha': 0.0, 'dropout_conv_id_ff': 0.0, 'dropout_conv_reg_mha': 0.0, 'dropout_conv_reg_ff': 0.0, 'activation': 'relu', 'head_dim': 16, 'num_heads': 32, 'attention_type': 'efficient'} [2024-08-26 15:11:11,820] INFO: using attention_type=math [2024-08-26 15:11:11,831] INFO: using attention_type=math [2024-08-26 15:11:11,841] INFO: using attention_type=math [2024-08-26 15:11:11,852] INFO: using attention_type=math [2024-08-26 15:11:11,863] INFO: using attention_type=math [2024-08-26 15:11:11,874] INFO: using attention_type=math [2024-08-26 15:11:17,761] INFO: Loaded model weights from /pfvol/experiments/MLPF_clic_backbone_8GTX/best_weights.pth [2024-08-26 15:11:18,850] INFO: DistributedDataParallel( (module): MLPF( (nn0_id): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (nn0_reg): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (conv_id): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (conv_reg): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (nn_id): Sequential( (0): Linear(in_features=529, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=6, bias=True) ) (nn_pt): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_eta): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_sin_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_cos_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_energy): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) ) ) [2024-08-26 15:11:18,850] INFO: Trainable parameters: 11671568 [2024-08-26 15:11:18,851] INFO: Non-trainable parameters: 0 [2024-08-26 15:11:18,851] INFO: Total parameters: 11671568 [2024-08-26 15:11:18,854] INFO: Modules Trainable parameters Non-tranable parameters module.nn0_id.0.weight 8704 0 module.nn0_id.0.bias 512 0 module.nn0_id.2.weight 512 0 module.nn0_id.2.bias 512 0 module.nn0_id.4.weight 262144 0 module.nn0_id.4.bias 512 0 module.nn0_reg.0.weight 8704 0 module.nn0_reg.0.bias 512 0 module.nn0_reg.2.weight 512 0 module.nn0_reg.2.bias 512 0 module.nn0_reg.4.weight 262144 0 module.nn0_reg.4.bias 512 0 module.conv_id.0.mha.in_proj_weight 786432 0 module.conv_id.0.mha.in_proj_bias 1536 0 module.conv_id.0.mha.out_proj.weight 262144 0 module.conv_id.0.mha.out_proj.bias 512 0 module.conv_id.0.norm0.weight 512 0 module.conv_id.0.norm0.bias 512 0 module.conv_id.0.norm1.weight 512 0 module.conv_id.0.norm1.bias 512 0 module.conv_id.0.seq.0.weight 262144 0 module.conv_id.0.seq.0.bias 512 0 module.conv_id.0.seq.2.weight 262144 0 module.conv_id.0.seq.2.bias 512 0 module.conv_id.1.mha.in_proj_weight 786432 0 module.conv_id.1.mha.in_proj_bias 1536 0 module.conv_id.1.mha.out_proj.weight 262144 0 module.conv_id.1.mha.out_proj.bias 512 0 module.conv_id.1.norm0.weight 512 0 module.conv_id.1.norm0.bias 512 0 module.conv_id.1.norm1.weight 512 0 module.conv_id.1.norm1.bias 512 0 module.conv_id.1.seq.0.weight 262144 0 module.conv_id.1.seq.0.bias 512 0 module.conv_id.1.seq.2.weight 262144 0 module.conv_id.1.seq.2.bias 512 0 module.conv_id.2.mha.in_proj_weight 786432 0 module.conv_id.2.mha.in_proj_bias 1536 0 module.conv_id.2.mha.out_proj.weight 262144 0 module.conv_id.2.mha.out_proj.bias 512 0 module.conv_id.2.norm0.weight 512 0 module.conv_id.2.norm0.bias 512 0 module.conv_id.2.norm1.weight 512 0 module.conv_id.2.norm1.bias 512 0 module.conv_id.2.seq.0.weight 262144 0 module.conv_id.2.seq.0.bias 512 0 module.conv_id.2.seq.2.weight 262144 0 module.conv_id.2.seq.2.bias 512 0 module.conv_reg.0.mha.in_proj_weight 786432 0 module.conv_reg.0.mha.in_proj_bias 1536 0 module.conv_reg.0.mha.out_proj.weight 262144 0 module.conv_reg.0.mha.out_proj.bias 512 0 module.conv_reg.0.norm0.weight 512 0 module.conv_reg.0.norm0.bias 512 0 module.conv_reg.0.norm1.weight 512 0 module.conv_reg.0.norm1.bias 512 0 module.conv_reg.0.seq.0.weight 262144 0 module.conv_reg.0.seq.0.bias 512 0 module.conv_reg.0.seq.2.weight 262144 0 module.conv_reg.0.seq.2.bias 512 0 module.conv_reg.1.mha.in_proj_weight 786432 0 module.conv_reg.1.mha.in_proj_bias 1536 0 module.conv_reg.1.mha.out_proj.weight 262144 0 module.conv_reg.1.mha.out_proj.bias 512 0 module.conv_reg.1.norm0.weight 512 0 module.conv_reg.1.norm0.bias 512 0 module.conv_reg.1.norm1.weight 512 0 module.conv_reg.1.norm1.bias 512 0 module.conv_reg.1.seq.0.weight 262144 0 module.conv_reg.1.seq.0.bias 512 0 module.conv_reg.1.seq.2.weight 262144 0 module.conv_reg.1.seq.2.bias 512 0 module.conv_reg.2.mha.in_proj_weight 786432 0 module.conv_reg.2.mha.in_proj_bias 1536 0 module.conv_reg.2.mha.out_proj.weight 262144 0 module.conv_reg.2.mha.out_proj.bias 512 0 module.conv_reg.2.norm0.weight 512 0 module.conv_reg.2.norm0.bias 512 0 module.conv_reg.2.norm1.weight 512 0 module.conv_reg.2.norm1.bias 512 0 module.conv_reg.2.seq.0.weight 262144 0 module.conv_reg.2.seq.0.bias 512 0 module.conv_reg.2.seq.2.weight 262144 0 module.conv_reg.2.seq.2.bias 512 0 module.nn_id.0.weight 270848 0 module.nn_id.0.bias 512 0 module.nn_id.2.weight 512 0 module.nn_id.2.bias 512 0 module.nn_id.4.weight 3072 0 module.nn_id.4.bias 6 0 module.nn_pt.nn.0.weight 273920 0 module.nn_pt.nn.0.bias 512 0 module.nn_pt.nn.2.weight 512 0 module.nn_pt.nn.2.bias 512 0 module.nn_pt.nn.4.weight 1024 0 module.nn_pt.nn.4.bias 2 0 module.nn_eta.nn.0.weight 273920 0 module.nn_eta.nn.0.bias 512 0 module.nn_eta.nn.2.weight 512 0 module.nn_eta.nn.2.bias 512 0 module.nn_eta.nn.4.weight 1024 0 module.nn_eta.nn.4.bias 2 0 module.nn_sin_phi.nn.0.weight 273920 0 module.nn_sin_phi.nn.0.bias 512 0 module.nn_sin_phi.nn.2.weight 512 0 module.nn_sin_phi.nn.2.bias 512 0 module.nn_sin_phi.nn.4.weight 1024 0 module.nn_sin_phi.nn.4.bias 2 0 module.nn_cos_phi.nn.0.weight 273920 0 module.nn_cos_phi.nn.0.bias 512 0 module.nn_cos_phi.nn.2.weight 512 0 module.nn_cos_phi.nn.2.bias 512 0 module.nn_cos_phi.nn.4.weight 1024 0 module.nn_cos_phi.nn.4.bias 2 0 module.nn_energy.nn.0.weight 273920 0 module.nn_energy.nn.0.bias 512 0 module.nn_energy.nn.2.weight 512 0 module.nn_energy.nn.2.bias 512 0 module.nn_energy.nn.4.weight 1024 0 module.nn_energy.nn.4.bias 2 0 [2024-08-26 15:11:18,855] INFO: Creating experiment dir /pfvol/experiments/Aug26_CLD_finetuned_80k_pyg-cld_20240826_151105_058461 [2024-08-26 15:11:18,855] INFO: Model directory /pfvol/experiments/Aug26_CLD_finetuned_80k_pyg-cld_20240826_151105_058461 [2024-08-26 15:11:18,873] INFO: train_dataset: cld_edm_ttbar_pf, 80000 [2024-08-26 15:11:18,906] INFO: valid_dataset: cld_edm_ttbar_pf, 1000 [2024-08-26 15:11:18,952] INFO: Initiating epoch #1 train run on device rank=0 [2024-08-26 15:14:59,752] INFO: Initiating epoch #1 valid run on device rank=0 [2024-08-26 15:15:08,534] INFO: Rank 0: epoch=1 / 100 train_loss=25.1594 valid_loss=22.9836 stale=0 time=3.83m eta=378.8m [2024-08-26 15:15:08,782] INFO: Initiating epoch #2 train run on device rank=0 [2024-08-26 15:18:44,934] INFO: Initiating epoch #2 valid run on device rank=0 [2024-08-26 15:18:52,120] INFO: Rank 0: epoch=2 / 100 train_loss=21.0788 valid_loss=19.3094 stale=0 time=3.72m eta=370.1m [2024-08-26 15:18:53,584] INFO: Initiating epoch #3 train run on device rank=0 [2024-08-26 15:22:30,557] INFO: Initiating epoch #3 valid run on device rank=0 [2024-08-26 15:22:38,585] INFO: Rank 0: epoch=3 / 100 train_loss=15.7774 valid_loss=13.7460 stale=0 time=3.75m eta=366.2m [2024-08-26 15:22:39,958] INFO: Initiating epoch #4 train run on device rank=0 [2024-08-26 15:26:17,284] INFO: Initiating epoch #4 valid run on device rank=0 [2024-08-26 15:26:24,396] INFO: Rank 0: epoch=4 / 100 train_loss=13.0513 valid_loss=12.9690 stale=0 time=3.74m eta=362.2m [2024-08-26 15:26:26,444] INFO: Initiating epoch #5 train run on device rank=0 [2024-08-26 15:30:03,028] INFO: Initiating epoch #5 valid run on device rank=0 [2024-08-26 15:30:10,647] INFO: Rank 0: epoch=5 / 100 train_loss=12.2333 valid_loss=12.4645 stale=0 time=3.74m eta=358.4m [2024-08-26 15:30:12,234] INFO: Initiating epoch #6 train run on device rank=0 [2024-08-26 15:33:48,756] INFO: Initiating epoch #6 valid run on device rank=0 [2024-08-26 15:33:56,927] INFO: Rank 0: epoch=6 / 100 train_loss=11.7651 valid_loss=12.3406 stale=0 time=3.74m eta=354.6m [2024-08-26 15:33:58,425] INFO: Initiating epoch #7 train run on device rank=0 [2024-08-26 15:37:35,538] INFO: Initiating epoch #7 valid run on device rank=0 [2024-08-26 15:37:42,578] INFO: Rank 0: epoch=7 / 100 train_loss=11.4282 valid_loss=12.2330 stale=0 time=3.74m eta=350.7m [2024-08-26 15:37:44,007] INFO: Initiating epoch #8 train run on device rank=0 [2024-08-26 15:41:20,635] INFO: Initiating epoch #8 valid run on device rank=0 [2024-08-26 15:41:26,614] INFO: Rank 0: epoch=8 / 100 train_loss=11.1572 valid_loss=12.2844 stale=1 time=3.71m eta=346.5m [2024-08-26 15:41:28,169] INFO: Initiating epoch #9 train run on device rank=0 [2024-08-26 15:45:04,874] INFO: Initiating epoch #9 valid run on device rank=0 [2024-08-26 15:45:13,711] INFO: Rank 0: epoch=9 / 100 train_loss=10.9300 valid_loss=12.3527 stale=2 time=3.76m eta=342.9m [2024-08-26 15:45:15,808] INFO: Initiating epoch #10 train run on device rank=0 [2024-08-26 15:48:52,847] INFO: Initiating epoch #10 valid run on device rank=0 [2024-08-26 15:48:57,872] INFO: Rank 0: epoch=10 / 100 train_loss=10.7274 valid_loss=12.4872 stale=3 time=3.7m eta=338.8m [2024-08-26 15:48:59,249] INFO: Initiating epoch #11 train run on device rank=0 [2024-08-26 15:52:36,067] INFO: Initiating epoch #11 valid run on device rank=0 [2024-08-26 15:52:41,348] INFO: Rank 0: epoch=11 / 100 train_loss=10.5379 valid_loss=12.5075 stale=4 time=3.7m eta=334.7m [2024-08-26 15:52:43,287] INFO: Initiating epoch #12 train run on device rank=0 [2024-08-26 15:56:20,453] INFO: Initiating epoch #12 valid run on device rank=0 [2024-08-26 15:56:25,194] INFO: Rank 0: epoch=12 / 100 train_loss=10.3637 valid_loss=12.6664 stale=5 time=3.7m eta=330.8m [2024-08-26 15:56:26,669] INFO: Initiating epoch #13 train run on device rank=0 [2024-08-26 16:00:03,671] INFO: Initiating epoch #13 valid run on device rank=0 [2024-08-26 16:00:08,202] INFO: Rank 0: epoch=13 / 100 train_loss=10.2039 valid_loss=12.8772 stale=6 time=3.69m eta=326.7m [2024-08-26 16:00:09,756] INFO: Initiating epoch #14 train run on device rank=0 [2024-08-26 16:03:46,000] INFO: Initiating epoch #14 valid run on device rank=0 [2024-08-26 16:03:51,139] INFO: Rank 0: epoch=14 / 100 train_loss=10.0588 valid_loss=13.0197 stale=7 time=3.69m eta=322.7m [2024-08-26 16:03:53,071] INFO: Initiating epoch #15 train run on device rank=0 [2024-08-26 16:07:29,789] INFO: Initiating epoch #15 valid run on device rank=0 [2024-08-26 16:07:35,714] INFO: Rank 0: epoch=15 / 100 train_loss=9.9135 valid_loss=13.0964 stale=8 time=3.71m eta=318.9m [2024-08-26 16:07:37,817] INFO: Initiating epoch #16 train run on device rank=0 [2024-08-26 16:11:15,669] INFO: Initiating epoch #16 valid run on device rank=0 [2024-08-26 16:11:22,237] INFO: Rank 0: epoch=16 / 100 train_loss=9.7612 valid_loss=13.3827 stale=9 time=3.74m eta=315.3m [2024-08-26 16:11:23,689] INFO: Initiating epoch #17 train run on device rank=0 [2024-08-26 16:15:03,472] INFO: Initiating epoch #17 valid run on device rank=0 [2024-08-26 16:15:09,508] INFO: Rank 0: epoch=17 / 100 train_loss=9.6255 valid_loss=13.6899 stale=10 time=3.76m eta=311.7m [2024-08-26 16:15:11,205] INFO: Initiating epoch #18 train run on device rank=0 [2024-08-26 16:18:51,198] INFO: Initiating epoch #18 valid run on device rank=0 [2024-08-26 16:18:56,842] INFO: Rank 0: epoch=18 / 100 train_loss=9.4920 valid_loss=13.8157 stale=11 time=3.76m eta=308.1m [2024-08-26 16:18:58,032] INFO: Initiating epoch #19 train run on device rank=0 [2024-08-26 16:22:37,075] INFO: Initiating epoch #19 valid run on device rank=0 [2024-08-26 16:22:42,403] INFO: Rank 0: epoch=19 / 100 train_loss=9.3640 valid_loss=13.9649 stale=12 time=3.74m eta=304.4m [2024-08-26 16:22:43,343] INFO: Initiating epoch #20 train run on device rank=0 [2024-08-26 16:26:23,721] INFO: Initiating epoch #20 valid run on device rank=0 [2024-08-26 16:26:30,204] INFO: Rank 0: epoch=20 / 100 train_loss=9.2525 valid_loss=14.2556 stale=13 time=3.78m eta=300.8m [2024-08-26 16:26:31,664] INFO: Initiating epoch #21 train run on device rank=0 [2024-08-26 16:30:11,477] INFO: Initiating epoch #21 valid run on device rank=0 [2024-08-26 16:30:17,210] INFO: Rank 0: epoch=21 / 100 train_loss=9.1242 valid_loss=14.2562 stale=14 time=3.76m eta=297.1m [2024-08-26 16:30:18,437] INFO: Initiating epoch #22 train run on device rank=0 [2024-08-26 16:33:55,056] INFO: Initiating epoch #22 valid run on device rank=0 [2024-08-26 16:34:00,043] INFO: Rank 0: epoch=22 / 100 train_loss=9.0086 valid_loss=14.3893 stale=15 time=3.69m eta=293.2m [2024-08-26 16:34:01,146] INFO: Initiating epoch #23 train run on device rank=0 [2024-08-26 16:37:38,624] INFO: Initiating epoch #23 valid run on device rank=0 [2024-08-26 16:37:43,499] INFO: Rank 0: epoch=23 / 100 train_loss=8.8946 valid_loss=14.7135 stale=16 time=3.71m eta=289.3m [2024-08-26 16:37:44,819] INFO: Initiating epoch #24 train run on device rank=0 [2024-08-26 16:41:22,077] INFO: Initiating epoch #24 valid run on device rank=0 [2024-08-26 16:41:26,664] INFO: Rank 0: epoch=24 / 100 train_loss=8.7869 valid_loss=15.0381 stale=17 time=3.7m eta=285.4m [2024-08-26 16:41:27,826] INFO: Initiating epoch #25 train run on device rank=0 [2024-08-26 16:45:03,582] INFO: Initiating epoch #25 valid run on device rank=0 [2024-08-26 16:45:08,085] INFO: Rank 0: epoch=25 / 100 train_loss=8.6827 valid_loss=15.1873 stale=18 time=3.67m eta=281.5m [2024-08-26 16:45:09,308] INFO: Initiating epoch #26 train run on device rank=0 [2024-08-26 16:48:44,433] INFO: Initiating epoch #26 valid run on device rank=0 [2024-08-26 16:48:49,240] INFO: Rank 0: epoch=26 / 100 train_loss=8.5839 valid_loss=15.6001 stale=19 time=3.67m eta=277.5m [2024-08-26 16:48:50,954] INFO: Initiating epoch #27 train run on device rank=0 [2024-08-26 16:52:25,453] INFO: Initiating epoch #27 valid run on device rank=0 [2024-08-26 16:52:31,121] INFO: Rank 0: epoch=27 / 100 train_loss=8.4844 valid_loss=15.6428 stale=20 time=3.67m eta=273.6m [2024-08-26 16:52:32,055] INFO: Initiating epoch #28 train run on device rank=0 [2024-08-26 16:56:06,316] INFO: Initiating epoch #28 valid run on device rank=0 [2024-08-26 16:56:11,938] INFO: Done with training. Total training time on device 0 is 104.883min