[2024-03-14 13:35:29,916] INFO: Will use torch.nn.parallel.DistributedDataParallel() and 8 gpus [2024-03-14 13:35:29,950] INFO: NVIDIA A10 [2024-03-14 13:35:29,951] INFO: NVIDIA A10 [2024-03-14 13:35:29,951] INFO: NVIDIA A10 [2024-03-14 13:35:29,951] INFO: NVIDIA A10 [2024-03-14 13:35:29,951] INFO: NVIDIA A10 [2024-03-14 13:35:29,951] INFO: NVIDIA A10 [2024-03-14 13:35:29,951] INFO: NVIDIA A10 [2024-03-14 13:35:29,951] INFO: NVIDIA A10 [2024-03-14 13:35:44,652] INFO: DistributedDataParallel( (module): MLPF( (nn0): Sequential( (0): Linear(in_features=42, out_features=512, bias=True) (1): ELU(alpha=1.0) (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (conv_id): ModuleList( (0-2): 3 x GravNetLayer( (conv1): GravNetConv(512, 512, k=16) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.0, inplace=False) ) ) (conv_reg): ModuleList( (0-2): 3 x GravNetLayer( (conv1): GravNetConv(512, 512, k=16) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.0, inplace=False) ) ) (nn_id): Sequential( (0): Linear(in_features=1578, out_features=512, bias=True) (1): ELU(alpha=1.0) (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=9, bias=True) ) (nn_pt): RegressionOutput( (nn): Sequential( (0): Linear(in_features=1587, out_features=512, bias=True) (1): ELU(alpha=1.0) (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_eta): RegressionOutput( (nn): Sequential( (0): Linear(in_features=1587, out_features=512, bias=True) (1): ELU(alpha=1.0) (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_sin_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=1587, out_features=512, bias=True) (1): ELU(alpha=1.0) (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_cos_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=1587, out_features=512, bias=True) (1): ELU(alpha=1.0) (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_energy): RegressionOutput( (nn): Sequential( (0): Linear(in_features=1587, out_features=512, bias=True) (1): ELU(alpha=1.0) (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_charge): Sequential( (0): Linear(in_features=1587, out_features=512, bias=True) (1): ELU(alpha=1.0) (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=3, bias=True) ) ) ) [2024-03-14 13:35:44,653] INFO: Trainable parameters: 7880430 [2024-03-14 13:35:44,653] INFO: Non-trainable parameters: 0 [2024-03-14 13:35:44,653] INFO: Total parameters: 7880430 [2024-03-14 13:35:44,659] INFO: Modules Trainable params Non-tranable params Trainable Parameters Non-tranable Parameters module.nn0.0.weight NaN NaN 21504.0 - module.nn0.0.bias NaN NaN 512.0 - module.nn0.2.weight NaN NaN 512.0 - module.nn0.2.bias NaN NaN 512.0 - module.nn0.4.weight NaN NaN 262144.0 - module.nn0.4.bias NaN NaN 512.0 - module.conv_id.0.conv1.lin_s.weight NaN NaN 2048.0 - module.conv_id.0.conv1.lin_s.bias NaN NaN 4.0 - module.conv_id.0.conv1.lin_h.weight NaN NaN 16384.0 - module.conv_id.0.conv1.lin_h.bias NaN NaN 32.0 - module.conv_id.0.conv1.lin_out1.weight NaN NaN 262144.0 - module.conv_id.0.conv1.lin_out2.weight NaN NaN 32768.0 - module.conv_id.0.conv1.lin_out2.bias NaN NaN 512.0 - module.conv_id.0.norm1.weight NaN NaN 512.0 - module.conv_id.0.norm1.bias NaN NaN 512.0 - module.conv_id.1.conv1.lin_s.weight NaN NaN 2048.0 - module.conv_id.1.conv1.lin_s.bias NaN NaN 4.0 - module.conv_id.1.conv1.lin_h.weight NaN NaN 16384.0 - module.conv_id.1.conv1.lin_h.bias NaN NaN 32.0 - module.conv_id.1.conv1.lin_out1.weight NaN NaN 262144.0 - module.conv_id.1.conv1.lin_out2.weight NaN NaN 32768.0 - module.conv_id.1.conv1.lin_out2.bias NaN NaN 512.0 - module.conv_id.1.norm1.weight NaN NaN 512.0 - module.conv_id.1.norm1.bias NaN NaN 512.0 - module.conv_id.2.conv1.lin_s.weight NaN NaN 2048.0 - module.conv_id.2.conv1.lin_s.bias NaN NaN 4.0 - module.conv_id.2.conv1.lin_h.weight NaN NaN 16384.0 - module.conv_id.2.conv1.lin_h.bias NaN NaN 32.0 - module.conv_id.2.conv1.lin_out1.weight NaN NaN 262144.0 - module.conv_id.2.conv1.lin_out2.weight NaN NaN 32768.0 - module.conv_id.2.conv1.lin_out2.bias NaN NaN 512.0 - module.conv_id.2.norm1.weight NaN NaN 512.0 - module.conv_id.2.norm1.bias NaN NaN 512.0 - module.conv_reg.0.conv1.lin_s.weight NaN NaN 2048.0 - module.conv_reg.0.conv1.lin_s.bias NaN NaN 4.0 - module.conv_reg.0.conv1.lin_h.weight NaN NaN 16384.0 - module.conv_reg.0.conv1.lin_h.bias NaN NaN 32.0 - module.conv_reg.0.conv1.lin_out1.weight NaN NaN 262144.0 - module.conv_reg.0.conv1.lin_out2.weight NaN NaN 32768.0 - module.conv_reg.0.conv1.lin_out2.bias NaN NaN 512.0 - module.conv_reg.0.norm1.weight NaN NaN 512.0 - module.conv_reg.0.norm1.bias NaN NaN 512.0 - module.conv_reg.1.conv1.lin_s.weight NaN NaN 2048.0 - module.conv_reg.1.conv1.lin_s.bias NaN NaN 4.0 - module.conv_reg.1.conv1.lin_h.weight NaN NaN 16384.0 - module.conv_reg.1.conv1.lin_h.bias NaN NaN 32.0 - module.conv_reg.1.conv1.lin_out1.weight NaN NaN 262144.0 - module.conv_reg.1.conv1.lin_out2.weight NaN NaN 32768.0 - module.conv_reg.1.conv1.lin_out2.bias NaN NaN 512.0 - module.conv_reg.1.norm1.weight NaN NaN 512.0 - module.conv_reg.1.norm1.bias NaN NaN 512.0 - module.conv_reg.2.conv1.lin_s.weight NaN NaN 2048.0 - module.conv_reg.2.conv1.lin_s.bias NaN NaN 4.0 - module.conv_reg.2.conv1.lin_h.weight NaN NaN 16384.0 - module.conv_reg.2.conv1.lin_h.bias NaN NaN 32.0 - module.conv_reg.2.conv1.lin_out1.weight NaN NaN 262144.0 - module.conv_reg.2.conv1.lin_out2.weight NaN NaN 32768.0 - module.conv_reg.2.conv1.lin_out2.bias NaN NaN 512.0 - module.conv_reg.2.norm1.weight NaN NaN 512.0 - module.conv_reg.2.norm1.bias NaN NaN 512.0 - module.nn_id.0.weight NaN NaN 807936.0 - module.nn_id.0.bias NaN NaN 512.0 - module.nn_id.2.weight NaN NaN 512.0 - module.nn_id.2.bias NaN NaN 512.0 - module.nn_id.4.weight NaN NaN 4608.0 - module.nn_id.4.bias NaN NaN 9.0 - module.nn_pt.nn.0.weight NaN NaN 812544.0 - module.nn_pt.nn.0.bias NaN NaN 512.0 - module.nn_pt.nn.2.weight NaN NaN 512.0 - module.nn_pt.nn.2.bias NaN NaN 512.0 - module.nn_pt.nn.4.weight NaN NaN 1024.0 - module.nn_pt.nn.4.bias NaN NaN 2.0 - module.nn_eta.nn.0.weight NaN NaN 812544.0 - module.nn_eta.nn.0.bias NaN NaN 512.0 - module.nn_eta.nn.2.weight NaN NaN 512.0 - module.nn_eta.nn.2.bias NaN NaN 512.0 - module.nn_eta.nn.4.weight NaN NaN 1024.0 - module.nn_eta.nn.4.bias NaN NaN 2.0 - module.nn_sin_phi.nn.0.weight NaN NaN 812544.0 - module.nn_sin_phi.nn.0.bias NaN NaN 512.0 - module.nn_sin_phi.nn.2.weight NaN NaN 512.0 - module.nn_sin_phi.nn.2.bias NaN NaN 512.0 - module.nn_sin_phi.nn.4.weight NaN NaN 1024.0 - module.nn_sin_phi.nn.4.bias NaN NaN 2.0 - module.nn_cos_phi.nn.0.weight NaN NaN 812544.0 - module.nn_cos_phi.nn.0.bias NaN NaN 512.0 - module.nn_cos_phi.nn.2.weight NaN NaN 512.0 - module.nn_cos_phi.nn.2.bias NaN NaN 512.0 - module.nn_cos_phi.nn.4.weight NaN NaN 1024.0 - module.nn_cos_phi.nn.4.bias NaN NaN 2.0 - module.nn_energy.nn.0.weight NaN NaN 812544.0 - module.nn_energy.nn.0.bias NaN NaN 512.0 - module.nn_energy.nn.2.weight NaN NaN 512.0 - module.nn_energy.nn.2.bias NaN NaN 512.0 - module.nn_energy.nn.4.weight NaN NaN 1024.0 - module.nn_energy.nn.4.bias NaN NaN 2.0 - module.nn_charge.0.weight NaN NaN 812544.0 - module.nn_charge.0.bias NaN NaN 512.0 - module.nn_charge.2.weight NaN NaN 512.0 - module.nn_charge.2.bias NaN NaN 512.0 - module.nn_charge.4.weight NaN NaN 1536.0 - module.nn_charge.4.bias NaN NaN 3.0 - [2024-03-14 13:35:44,797] INFO: Creating experiment dir /pfvol/experiments/MLPF_cms_Gravnet_MET_False_8gpus_pyg-cms-small_20240314_133526_810952 [2024-03-14 13:35:44,797] INFO: Model directory /pfvol/experiments/MLPF_cms_Gravnet_MET_False_8gpus_pyg-cms-small_20240314_133526_810952 [2024-03-14 13:35:46,469] INFO: train_dataset: cms_pf_ttbar, 80000 [2024-03-14 13:35:47,705] INFO: train_dataset: cms_pf_qcd, 80000 [2024-03-14 13:35:47,736] INFO: valid_dataset: cms_pf_ttbar, 20000 [2024-03-14 13:35:47,762] INFO: valid_dataset: cms_pf_qcd, 20000 [2024-03-14 13:35:48,190] INFO: Initiating epoch #1 train run on device rank=0 [2024-03-14 15:10:50,867] INFO: Initiating epoch #1 valid run on device rank=0 [2024-03-14 15:32:13,612] INFO: Rank 0: epoch=1 / 30 train_loss=64.9347 valid_loss=62.3449 stale=0 time=116.42m eta=3376.3m [2024-03-14 15:32:14,110] INFO: Initiating epoch #2 train run on device rank=0 [2024-03-14 17:20:31,594] INFO: Initiating epoch #2 valid run on device rank=0 [2024-03-14 17:41:05,908] INFO: Rank 0: epoch=2 / 30 train_loss=61.9078 valid_loss=61.5229 stale=0 time=128.86m eta=3434.1m [2024-03-14 17:41:07,592] INFO: Initiating epoch #3 train run on device rank=0 [2024-03-14 19:33:44,662] INFO: Initiating epoch #3 valid run on device rank=0 [2024-03-14 19:55:28,683] INFO: Rank 0: epoch=3 / 30 train_loss=61.3540 valid_loss=61.1898 stale=0 time=134.35m eta=3417.1m [2024-03-14 19:55:31,205] INFO: Initiating epoch #4 train run on device rank=0 [2024-03-14 21:22:23,158] INFO: Initiating epoch #4 valid run on device rank=0 [2024-03-14 21:43:38,787] INFO: Rank 0: epoch=4 / 30 train_loss=61.1288 valid_loss=61.1280 stale=0 time=108.13m eta=3171.0m [2024-03-14 21:43:40,208] INFO: Initiating epoch #5 train run on device rank=0 [2024-03-14 23:18:57,658] INFO: Initiating epoch #5 valid run on device rank=0 [2024-03-14 23:41:41,707] INFO: Rank 0: epoch=5 / 30 train_loss=60.9828 valid_loss=60.9782 stale=0 time=118.02m eta=3029.5m [2024-03-14 23:41:43,106] INFO: Initiating epoch #6 train run on device rank=0 [2024-03-15 01:23:29,158] INFO: Initiating epoch #6 valid run on device rank=0 [2024-03-15 01:45:34,170] INFO: Rank 0: epoch=6 / 30 train_loss=60.8480 valid_loss=60.8360 stale=0 time=123.85m eta=2919.1m [2024-03-15 01:45:34,982] INFO: Initiating epoch #7 train run on device rank=0 [2024-03-15 03:27:59,585] INFO: Initiating epoch #7 valid run on device rank=0 [2024-03-15 03:49:16,956] INFO: Rank 0: epoch=7 / 30 train_loss=60.7483 valid_loss=60.7500 stale=0 time=123.7m eta=2804.3m [2024-03-15 03:49:17,648] INFO: Initiating epoch #8 train run on device rank=0 [2024-03-15 05:47:58,365] INFO: Initiating epoch #8 valid run on device rank=0 [2024-03-15 06:10:39,737] INFO: Rank 0: epoch=8 / 30 train_loss=60.6691 valid_loss=60.6708 stale=0 time=141.37m eta=2735.9m [2024-03-15 06:10:40,599] INFO: Initiating epoch #9 train run on device rank=0 [2024-03-15 08:04:10,301] INFO: Initiating epoch #9 valid run on device rank=0 [2024-03-15 08:27:30,371] INFO: Rank 0: epoch=9 / 30 train_loss=60.6116 valid_loss=60.6208 stale=0 time=136.83m eta=2640.6m [2024-03-15 08:27:30,960] INFO: Initiating epoch #10 train run on device rank=0 [2024-03-15 10:20:33,172] INFO: Initiating epoch #10 valid run on device rank=0 [2024-03-15 10:41:19,715] INFO: Rank 0: epoch=10 / 30 train_loss=60.5633 valid_loss=60.5860 stale=0 time=133.81m eta=2531.1m [2024-03-15 10:41:20,672] INFO: Initiating epoch #11 train run on device rank=0 [2024-03-15 12:18:56,787] INFO: Initiating epoch #11 valid run on device rank=0 [2024-03-15 12:40:33,726] INFO: Rank 0: epoch=11 / 30 train_loss=60.5231 valid_loss=60.5459 stale=0 time=119.22m eta=2391.9m [2024-03-15 12:40:34,239] INFO: Initiating epoch #12 train run on device rank=0 [2024-03-15 14:12:05,081] INFO: Initiating epoch #12 valid run on device rank=0 [2024-03-15 14:33:51,814] INFO: Rank 0: epoch=12 / 30 train_loss=60.4871 valid_loss=60.5206 stale=0 time=113.29m eta=2247.1m [2024-03-15 14:33:52,455] INFO: Initiating epoch #13 train run on device rank=0 [2024-03-15 16:03:28,168] INFO: Initiating epoch #13 valid run on device rank=0 [2024-03-15 16:25:40,020] INFO: Rank 0: epoch=13 / 30 train_loss=60.4541 valid_loss=60.4914 stale=0 time=111.79m eta=2105.2m [2024-03-15 16:25:41,586] INFO: Initiating epoch #14 train run on device rank=0 [2024-03-15 18:08:38,370] INFO: Initiating epoch #14 valid run on device rank=0 [2024-03-15 18:29:08,301] INFO: Rank 0: epoch=14 / 30 train_loss=60.4258 valid_loss=60.4659 stale=0 time=123.45m eta=1981.0m [2024-03-15 18:29:08,896] INFO: Initiating epoch #15 train run on device rank=0 [2024-03-15 19:59:38,758] INFO: Initiating epoch #15 valid run on device rank=0 [2024-03-15 20:21:02,962] INFO: Rank 0: epoch=15 / 30 train_loss=60.3995 valid_loss=60.4408 stale=0 time=111.9m eta=1845.2m [2024-03-15 20:21:03,995] INFO: Initiating epoch #16 train run on device rank=0 [2024-03-15 22:05:33,937] INFO: Initiating epoch #16 valid run on device rank=0 [2024-03-15 22:26:50,417] INFO: Rank 0: epoch=16 / 30 train_loss=60.3753 valid_loss=60.4184 stale=0 time=125.77m eta=1724.7m [2024-03-15 22:26:51,947] INFO: Initiating epoch #17 train run on device rank=0 [2024-03-15 23:55:38,194] INFO: Initiating epoch #17 valid run on device rank=0 [2024-03-16 00:16:13,019] INFO: Rank 0: epoch=17 / 30 train_loss=60.3526 valid_loss=60.3973 stale=0 time=109.35m eta=1590.9m [2024-03-16 00:16:14,491] INFO: Initiating epoch #18 train run on device rank=0 [2024-03-16 01:59:19,638] INFO: Initiating epoch #18 valid run on device rank=0 [2024-03-16 02:19:47,103] INFO: Rank 0: epoch=18 / 30 train_loss=60.3319 valid_loss=60.3769 stale=0 time=123.54m eta=1469.3m [2024-03-16 02:19:47,718] INFO: Initiating epoch #19 train run on device rank=0 [2024-03-16 03:47:00,339] INFO: Initiating epoch #19 valid run on device rank=0 [2024-03-16 04:08:46,381] INFO: Rank 0: epoch=19 / 30 train_loss=60.3106 valid_loss=60.3529 stale=0 time=108.98m eta=1339.1m [2024-03-16 04:08:46,981] INFO: Initiating epoch #20 train run on device rank=0 [2024-03-16 05:49:11,330] INFO: Initiating epoch #20 valid run on device rank=0 [2024-03-16 06:11:48,117] INFO: Rank 0: epoch=20 / 30 train_loss=60.2915 valid_loss=60.3465 stale=0 time=123.02m eta=1218.0m [2024-03-16 06:11:48,777] INFO: Initiating epoch #21 train run on device rank=0 [2024-03-16 07:46:28,132] INFO: Initiating epoch #21 valid run on device rank=0 [2024-03-16 08:08:38,510] INFO: Rank 0: epoch=21 / 30 train_loss=60.2732 valid_loss=60.3249 stale=0 time=116.83m eta=1094.1m [2024-03-16 08:08:39,872] INFO: Initiating epoch #22 train run on device rank=0 [2024-03-16 09:39:31,363] INFO: Initiating epoch #22 valid run on device rank=0 [2024-03-16 10:01:07,315] INFO: Rank 0: epoch=22 / 30 train_loss=60.2556 valid_loss=60.3136 stale=0 time=112.46m eta=969.2m [2024-03-16 10:01:08,758] INFO: Initiating epoch #23 train run on device rank=0 [2024-03-16 11:32:00,488] INFO: Initiating epoch #23 valid run on device rank=0 [2024-03-16 11:55:19,815] INFO: Rank 0: epoch=23 / 30 train_loss=60.2396 valid_loss=60.2990 stale=0 time=114.18m eta=845.9m [2024-03-16 11:55:20,554] INFO: Initiating epoch #24 train run on device rank=0 [2024-03-16 13:28:30,958] INFO: Initiating epoch #24 valid run on device rank=0 [2024-03-16 13:51:44,143] INFO: Rank 0: epoch=24 / 30 train_loss=60.2248 valid_loss=60.2878 stale=0 time=116.39m eta=724.0m [2024-03-16 13:51:44,791] INFO: Initiating epoch #25 train run on device rank=0 [2024-03-16 15:26:16,744] INFO: Initiating epoch #25 valid run on device rank=0 [2024-03-16 15:48:32,544] INFO: Rank 0: epoch=25 / 30 train_loss=60.2118 valid_loss=60.2777 stale=0 time=116.8m eta=602.5m [2024-03-16 15:48:33,501] INFO: Initiating epoch #26 train run on device rank=0 [2024-03-16 17:24:50,383] INFO: Initiating epoch #26 valid run on device rank=0 [2024-03-16 17:47:27,683] INFO: Rank 0: epoch=26 / 30 train_loss=60.2004 valid_loss=60.2684 stale=0 time=118.9m eta=481.8m [2024-03-16 17:47:28,882] INFO: Initiating epoch #27 train run on device rank=0 [2024-03-16 19:30:07,801] INFO: Initiating epoch #27 valid run on device rank=0 [2024-03-16 19:53:23,246] INFO: Rank 0: epoch=27 / 30 train_loss=60.1910 valid_loss=60.2594 stale=0 time=125.91m eta=362.0m [2024-03-16 19:53:24,250] INFO: Initiating epoch #28 train run on device rank=0 [2024-03-16 21:34:54,086] INFO: Initiating epoch #28 valid run on device rank=0 [2024-03-16 21:57:40,549] INFO: Rank 0: epoch=28 / 30 train_loss=60.1837 valid_loss=60.2549 stale=0 time=124.27m eta=241.6m [2024-03-16 21:57:48,147] INFO: Initiating epoch #29 train run on device rank=0 [2024-03-17 00:26:51,158] INFO: Initiating epoch #29 valid run on device rank=0 [2024-03-17 00:47:04,051] INFO: Rank 0: epoch=29 / 30 train_loss=60.1787 valid_loss=60.2513 stale=0 time=169.27m eta=122.5m [2024-03-17 00:47:06,957] INFO: Initiating epoch #30 train run on device rank=0 [2024-03-17 02:21:53,558] INFO: Initiating epoch #30 valid run on device rank=0 [2024-03-17 02:46:56,537] INFO: Rank 0: epoch=30 / 30 train_loss=60.1762 valid_loss=60.2506 stale=0 time=119.83m eta=0.0m [2024-03-17 02:46:57,475] INFO: Done with training. Total training time on device 0 is 3671.155min