[2024-08-30 10:25:43,755] INFO: Will use torch.nn.parallel.DistributedDataParallel() and 4 gpus [2024-08-30 10:25:43,858] INFO: NVIDIA GeForce GTX 1080 Ti [2024-08-30 10:25:43,858] INFO: NVIDIA GeForce GTX 1080 Ti [2024-08-30 10:25:43,858] INFO: NVIDIA GeForce GTX 1080 Ti [2024-08-30 10:25:43,858] INFO: NVIDIA GeForce GTX 1080 Ti [2024-08-30 10:25:52,529] INFO: configured dtype=torch.float32 for autocast [2024-08-30 10:25:53,327] INFO: using attention_type=math [2024-08-30 10:25:53,352] INFO: using attention_type=math [2024-08-30 10:25:53,376] INFO: using attention_type=math [2024-08-30 10:25:53,400] INFO: using attention_type=math [2024-08-30 10:25:53,425] INFO: using attention_type=math [2024-08-30 10:25:53,451] INFO: using attention_type=math [2024-08-30 10:25:55,784] INFO: DistributedDataParallel( (module): MLPF( (nn0_id): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (nn0_reg): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (conv_id): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (conv_reg): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (nn_binary_particle): Sequential( (0): Linear(in_features=529, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) (nn_pid): Sequential( (0): Linear(in_features=529, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=6, bias=True) ) (nn_pt): RegressionOutput( (nn): Sequential( (0): Linear(in_features=537, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_eta): RegressionOutput( (nn): Sequential( (0): Linear(in_features=537, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_sin_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=537, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_cos_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=537, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_energy): RegressionOutput( (nn): Sequential( (0): Linear(in_features=537, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) ) ) [2024-08-30 10:25:55,785] INFO: Trainable parameters: 11950098 [2024-08-30 10:25:55,785] INFO: Non-trainable parameters: 0 [2024-08-30 10:25:55,785] INFO: Total parameters: 11950098 [2024-08-30 10:25:55,791] INFO: Modules Trainable parameters Non-trainable parameters module.nn0_id.0.weight 8704 0 module.nn0_id.0.bias 512 0 module.nn0_id.2.weight 512 0 module.nn0_id.2.bias 512 0 module.nn0_id.4.weight 262144 0 module.nn0_id.4.bias 512 0 module.nn0_reg.0.weight 8704 0 module.nn0_reg.0.bias 512 0 module.nn0_reg.2.weight 512 0 module.nn0_reg.2.bias 512 0 module.nn0_reg.4.weight 262144 0 module.nn0_reg.4.bias 512 0 module.conv_id.0.mha.in_proj_weight 786432 0 module.conv_id.0.mha.in_proj_bias 1536 0 module.conv_id.0.mha.out_proj.weight 262144 0 module.conv_id.0.mha.out_proj.bias 512 0 module.conv_id.0.norm0.weight 512 0 module.conv_id.0.norm0.bias 512 0 module.conv_id.0.norm1.weight 512 0 module.conv_id.0.norm1.bias 512 0 module.conv_id.0.seq.0.weight 262144 0 module.conv_id.0.seq.0.bias 512 0 module.conv_id.0.seq.2.weight 262144 0 module.conv_id.0.seq.2.bias 512 0 module.conv_id.1.mha.in_proj_weight 786432 0 module.conv_id.1.mha.in_proj_bias 1536 0 module.conv_id.1.mha.out_proj.weight 262144 0 module.conv_id.1.mha.out_proj.bias 512 0 module.conv_id.1.norm0.weight 512 0 module.conv_id.1.norm0.bias 512 0 module.conv_id.1.norm1.weight 512 0 module.conv_id.1.norm1.bias 512 0 module.conv_id.1.seq.0.weight 262144 0 module.conv_id.1.seq.0.bias 512 0 module.conv_id.1.seq.2.weight 262144 0 module.conv_id.1.seq.2.bias 512 0 module.conv_id.2.mha.in_proj_weight 786432 0 module.conv_id.2.mha.in_proj_bias 1536 0 module.conv_id.2.mha.out_proj.weight 262144 0 module.conv_id.2.mha.out_proj.bias 512 0 module.conv_id.2.norm0.weight 512 0 module.conv_id.2.norm0.bias 512 0 module.conv_id.2.norm1.weight 512 0 module.conv_id.2.norm1.bias 512 0 module.conv_id.2.seq.0.weight 262144 0 module.conv_id.2.seq.0.bias 512 0 module.conv_id.2.seq.2.weight 262144 0 module.conv_id.2.seq.2.bias 512 0 module.conv_reg.0.mha.in_proj_weight 786432 0 module.conv_reg.0.mha.in_proj_bias 1536 0 module.conv_reg.0.mha.out_proj.weight 262144 0 module.conv_reg.0.mha.out_proj.bias 512 0 module.conv_reg.0.norm0.weight 512 0 module.conv_reg.0.norm0.bias 512 0 module.conv_reg.0.norm1.weight 512 0 module.conv_reg.0.norm1.bias 512 0 module.conv_reg.0.seq.0.weight 262144 0 module.conv_reg.0.seq.0.bias 512 0 module.conv_reg.0.seq.2.weight 262144 0 module.conv_reg.0.seq.2.bias 512 0 module.conv_reg.1.mha.in_proj_weight 786432 0 module.conv_reg.1.mha.in_proj_bias 1536 0 module.conv_reg.1.mha.out_proj.weight 262144 0 module.conv_reg.1.mha.out_proj.bias 512 0 module.conv_reg.1.norm0.weight 512 0 module.conv_reg.1.norm0.bias 512 0 module.conv_reg.1.norm1.weight 512 0 module.conv_reg.1.norm1.bias 512 0 module.conv_reg.1.seq.0.weight 262144 0 module.conv_reg.1.seq.0.bias 512 0 module.conv_reg.1.seq.2.weight 262144 0 module.conv_reg.1.seq.2.bias 512 0 module.conv_reg.2.mha.in_proj_weight 786432 0 module.conv_reg.2.mha.in_proj_bias 1536 0 module.conv_reg.2.mha.out_proj.weight 262144 0 module.conv_reg.2.mha.out_proj.bias 512 0 module.conv_reg.2.norm0.weight 512 0 module.conv_reg.2.norm0.bias 512 0 module.conv_reg.2.norm1.weight 512 0 module.conv_reg.2.norm1.bias 512 0 module.conv_reg.2.seq.0.weight 262144 0 module.conv_reg.2.seq.0.bias 512 0 module.conv_reg.2.seq.2.weight 262144 0 module.conv_reg.2.seq.2.bias 512 0 module.nn_binary_particle.0.weight 270848 0 module.nn_binary_particle.0.bias 512 0 module.nn_binary_particle.2.weight 512 0 module.nn_binary_particle.2.bias 512 0 module.nn_binary_particle.4.weight 1024 0 module.nn_binary_particle.4.bias 2 0 module.nn_pid.0.weight 270848 0 module.nn_pid.0.bias 512 0 module.nn_pid.2.weight 512 0 module.nn_pid.2.bias 512 0 module.nn_pid.4.weight 3072 0 module.nn_pid.4.bias 6 0 module.nn_pt.nn.0.weight 274944 0 module.nn_pt.nn.0.bias 512 0 module.nn_pt.nn.2.weight 512 0 module.nn_pt.nn.2.bias 512 0 module.nn_pt.nn.4.weight 1024 0 module.nn_pt.nn.4.bias 2 0 module.nn_eta.nn.0.weight 274944 0 module.nn_eta.nn.0.bias 512 0 module.nn_eta.nn.2.weight 512 0 module.nn_eta.nn.2.bias 512 0 module.nn_eta.nn.4.weight 1024 0 module.nn_eta.nn.4.bias 2 0 module.nn_sin_phi.nn.0.weight 274944 0 module.nn_sin_phi.nn.0.bias 512 0 module.nn_sin_phi.nn.2.weight 512 0 module.nn_sin_phi.nn.2.bias 512 0 module.nn_sin_phi.nn.4.weight 1024 0 module.nn_sin_phi.nn.4.bias 2 0 module.nn_cos_phi.nn.0.weight 274944 0 module.nn_cos_phi.nn.0.bias 512 0 module.nn_cos_phi.nn.2.weight 512 0 module.nn_cos_phi.nn.2.bias 512 0 module.nn_cos_phi.nn.4.weight 1024 0 module.nn_cos_phi.nn.4.bias 2 0 module.nn_energy.nn.0.weight 274944 0 module.nn_energy.nn.0.bias 512 0 module.nn_energy.nn.2.weight 512 0 module.nn_energy.nn.2.bias 512 0 module.nn_energy.nn.4.weight 1024 0 module.nn_energy.nn.4.bias 2 0 [2024-08-30 10:25:55,806] INFO: Creating experiment dir /pfvol/experiments/MLPF_clic_backbone_4GTX_pyg-clic_20240830_102543_631333 [2024-08-30 10:25:55,806] INFO: Model directory /pfvol/experiments/MLPF_clic_backbone_4GTX_pyg-clic_20240830_102543_631333 [2024-08-30 10:25:55,894] INFO: train_dataset: clic_edm_ttbar_pf, 2514200 [2024-08-30 10:25:55,931] INFO: train_dataset: clic_edm_qq_pf, 3075590 [2024-08-30 10:26:15,459] INFO: valid_dataset: clic_edm_ttbar_pf, 628600 [2024-08-30 10:26:15,516] INFO: valid_dataset: clic_edm_qq_pf, 768905 [2024-08-30 10:26:15,870] INFO: Initiating epoch #1 train run on device rank=0 [2024-08-30 14:23:15,137] INFO: Initiating epoch #1 valid run on device rank=0 [2024-08-30 14:36:59,839] INFO: Rank 0: epoch=1 / 200 train_loss=10.8957 valid_loss=8.9752 stale=0 time=250.73m eta=49895.9m [2024-08-30 14:37:01,153] INFO: Initiating epoch #2 train run on device rank=0 [2024-08-30 18:26:49,551] INFO: Initiating epoch #2 valid run on device rank=0 [2024-08-30 18:40:37,063] INFO: Rank 0: epoch=2 / 200 train_loss=8.5875 valid_loss=8.1817 stale=0 time=243.6m eta=48941.0m [2024-08-30 18:40:39,255] INFO: Initiating epoch #3 train run on device rank=0 [2024-08-30 22:37:40,964] INFO: Initiating epoch #3 valid run on device rank=0 [2024-08-30 22:51:32,609] INFO: Rank 0: epoch=3 / 200 train_loss=8.0236 valid_loss=7.8175 stale=0 time=250.89m eta=48940.0m [2024-08-30 22:51:36,388] INFO: Initiating epoch #4 train run on device rank=0 [2024-08-31 02:53:38,425] INFO: Initiating epoch #4 valid run on device rank=0 [2024-08-31 03:07:24,587] INFO: Rank 0: epoch=4 / 200 train_loss=7.7189 valid_loss=7.6105 stale=0 time=255.8m eta=49056.1m [2024-08-31 03:07:27,723] INFO: Initiating epoch #5 train run on device rank=0 [2024-08-31 07:14:56,770] INFO: Initiating epoch #5 valid run on device rank=0 [2024-08-31 07:28:47,205] INFO: Rank 0: epoch=5 / 200 train_loss=7.5139 valid_loss=7.4224 stale=0 time=261.32m eta=49238.4m [2024-08-31 07:28:50,659] INFO: Initiating epoch #6 train run on device rank=0 [2024-08-31 11:38:38,768] INFO: Initiating epoch #6 valid run on device rank=0 [2024-08-31 11:52:29,271] INFO: Rank 0: epoch=6 / 200 train_loss=7.3417 valid_loss=7.2586 stale=0 time=263.64m eta=49347.9m [2024-08-31 11:52:33,436] INFO: Initiating epoch #7 train run on device rank=0 [2024-08-31 16:03:44,349] INFO: Initiating epoch #7 valid run on device rank=0 [2024-08-31 16:17:32,998] INFO: Rank 0: epoch=7 / 200 train_loss=7.1762 valid_loss=7.1009 stale=0 time=264.99m eta=49388.3m [2024-08-31 16:17:37,337] INFO: Initiating epoch #8 train run on device rank=0 [2024-08-31 20:26:10,746] INFO: Initiating epoch #8 valid run on device rank=0 [2024-08-31 20:40:00,746] INFO: Rank 0: epoch=8 / 200 train_loss=7.0411 valid_loss=6.9797 stale=0 time=262.39m eta=49290.0m [2024-08-31 20:40:04,036] INFO: Initiating epoch #9 train run on device rank=0 [2024-09-01 00:54:45,964] INFO: Initiating epoch #9 valid run on device rank=0 [2024-09-01 01:08:41,290] INFO: Rank 0: epoch=9 / 200 train_loss=6.9477 valid_loss=6.9104 stale=0 time=268.62m eta=49287.0m [2024-09-01 01:08:44,081] INFO: Initiating epoch #10 train run on device rank=0 [2024-09-01 05:14:08,262] INFO: Initiating epoch #10 valid run on device rank=0 [2024-09-01 05:27:59,000] INFO: Rank 0: epoch=10 / 200 train_loss=6.8737 valid_loss=6.8542 stale=0 time=259.25m eta=49052.7m [2024-09-01 05:28:07,430] INFO: Initiating epoch #11 train run on device rank=0 [2024-09-01 09:37:15,236] INFO: Initiating epoch #11 valid run on device rank=0 [2024-09-01 09:51:03,355] INFO: Rank 0: epoch=11 / 200 train_loss=6.8077 valid_loss=6.7942 stale=0 time=262.93m eta=48878.7m [2024-09-01 09:51:06,657] INFO: Initiating epoch #12 train run on device rank=0 [2024-09-01 13:59:53,243] INFO: Initiating epoch #12 valid run on device rank=0 [2024-09-01 14:13:44,404] INFO: Rank 0: epoch=12 / 200 train_loss=6.7521 valid_loss=6.7711 stale=0 time=262.63m eta=48683.8m [2024-09-01 14:13:49,037] INFO: Initiating epoch #13 train run on device rank=0 [2024-09-01 18:23:19,511] INFO: Initiating epoch #13 valid run on device rank=0 [2024-09-01 18:37:09,099] INFO: Rank 0: epoch=13 / 200 train_loss=6.7037 valid_loss=6.7130 stale=0 time=263.33m eta=48488.9m [2024-09-01 18:37:12,413] INFO: Initiating epoch #14 train run on device rank=0 [2024-09-01 22:47:50,297] INFO: Initiating epoch #14 valid run on device rank=0 [2024-09-01 23:01:40,344] INFO: Rank 0: epoch=14 / 200 train_loss=6.6602 valid_loss=6.7023 stale=0 time=264.47m eta=48299.0m [2024-09-01 23:01:43,655] INFO: Initiating epoch #15 train run on device rank=0 [2024-09-02 03:09:20,506] INFO: Initiating epoch #15 valid run on device rank=0 [2024-09-02 03:23:06,359] INFO: Rank 0: epoch=15 / 200 train_loss=6.6202 valid_loss=6.6433 stale=0 time=261.38m eta=48061.0m [2024-09-02 03:23:11,375] INFO: Initiating epoch #16 train run on device rank=0 [2024-09-02 07:34:57,054] INFO: Initiating epoch #16 valid run on device rank=0 [2024-09-02 07:48:43,248] INFO: Rank 0: epoch=16 / 200 train_loss=6.5823 valid_loss=6.6139 stale=0 time=265.53m eta=47868.3m [2024-09-02 07:48:47,625] INFO: Initiating epoch #17 train run on device rank=0 [2024-09-02 11:59:56,230] INFO: Initiating epoch #17 valid run on device rank=0 [2024-09-02 12:13:37,928] INFO: Rank 0: epoch=17 / 200 train_loss=6.5488 valid_loss=6.5868 stale=0 time=264.84m eta=47659.3m [2024-09-02 12:13:43,240] INFO: Initiating epoch #18 train run on device rank=0 [2024-09-02 16:26:59,128] INFO: Initiating epoch #18 valid run on device rank=0 [2024-09-02 16:40:42,371] INFO: Rank 0: epoch=18 / 200 train_loss=6.5149 valid_loss=6.5437 stale=0 time=266.99m eta=47466.0m [2024-09-02 16:40:47,218] INFO: Initiating epoch #19 train run on device rank=0 [2024-09-02 20:54:21,051] INFO: Initiating epoch #19 valid run on device rank=0 [2024-09-02 21:08:12,008] INFO: Rank 0: epoch=19 / 200 train_loss=6.4847 valid_loss=6.5214 stale=0 time=267.41m eta=47269.0m [2024-09-02 21:08:16,311] INFO: Initiating epoch #20 train run on device rank=0 [2024-09-03 01:23:03,364] INFO: Initiating epoch #20 valid run on device rank=0 [2024-09-03 01:36:53,361] INFO: Rank 0: epoch=20 / 200 train_loss=6.4558 valid_loss=6.5029 stale=0 time=268.62m eta=47075.6m [2024-09-03 01:36:57,340] INFO: Initiating epoch #21 train run on device rank=0