[2024-03-14 11:13:11,007] INFO: Will use torch.nn.parallel.DistributedDataParallel() and 8 gpus [2024-03-14 11:13:11,042] INFO: NVIDIA A10 [2024-03-14 11:13:11,042] INFO: NVIDIA A10 [2024-03-14 11:13:11,042] INFO: NVIDIA A10 [2024-03-14 11:13:11,042] INFO: NVIDIA A10 [2024-03-14 11:13:11,042] INFO: NVIDIA A10 [2024-03-14 11:13:11,042] INFO: NVIDIA A10 [2024-03-14 11:13:11,042] INFO: NVIDIA A10 [2024-03-14 11:13:11,042] INFO: NVIDIA A10 [2024-03-14 11:13:25,735] INFO: DistributedDataParallel( (module): MLPF( (nn0): Sequential( (0): Linear(in_features=42, out_features=512, bias=True) (1): ELU(alpha=1.0) (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (conv_id): ModuleList( (0-2): 3 x GravNetLayer( (conv1): GravNetConv(512, 512, k=16) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.0, inplace=False) ) ) (conv_reg): ModuleList( (0-2): 3 x GravNetLayer( (conv1): GravNetConv(512, 512, k=16) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.0, inplace=False) ) ) (nn_id): Sequential( (0): Linear(in_features=1578, out_features=512, bias=True) (1): ELU(alpha=1.0) (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=9, bias=True) ) (nn_pt): RegressionOutput( (nn): Sequential( (0): Linear(in_features=1587, out_features=512, bias=True) (1): ELU(alpha=1.0) (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_eta): RegressionOutput( (nn): Sequential( (0): Linear(in_features=1587, out_features=512, bias=True) (1): ELU(alpha=1.0) (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_sin_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=1587, out_features=512, bias=True) (1): ELU(alpha=1.0) (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_cos_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=1587, out_features=512, bias=True) (1): ELU(alpha=1.0) (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_energy): RegressionOutput( (nn): Sequential( (0): Linear(in_features=1587, out_features=512, bias=True) (1): ELU(alpha=1.0) (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_charge): Sequential( (0): Linear(in_features=1587, out_features=512, bias=True) (1): ELU(alpha=1.0) (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=3, bias=True) ) ) ) [2024-03-14 11:13:25,736] INFO: Trainable parameters: 7880430 [2024-03-14 11:13:25,736] INFO: Non-trainable parameters: 0 [2024-03-14 11:13:25,736] INFO: Total parameters: 7880430 [2024-03-14 11:13:25,741] INFO: Modules Trainable params Non-tranable params Trainable Parameters Non-tranable Parameters module.nn0.0.weight NaN NaN 21504.0 - module.nn0.0.bias NaN NaN 512.0 - module.nn0.2.weight NaN NaN 512.0 - module.nn0.2.bias NaN NaN 512.0 - module.nn0.4.weight NaN NaN 262144.0 - module.nn0.4.bias NaN NaN 512.0 - module.conv_id.0.conv1.lin_s.weight NaN NaN 2048.0 - module.conv_id.0.conv1.lin_s.bias NaN NaN 4.0 - module.conv_id.0.conv1.lin_h.weight NaN NaN 16384.0 - module.conv_id.0.conv1.lin_h.bias NaN NaN 32.0 - module.conv_id.0.conv1.lin_out1.weight NaN NaN 262144.0 - module.conv_id.0.conv1.lin_out2.weight NaN NaN 32768.0 - module.conv_id.0.conv1.lin_out2.bias NaN NaN 512.0 - module.conv_id.0.norm1.weight NaN NaN 512.0 - module.conv_id.0.norm1.bias NaN NaN 512.0 - module.conv_id.1.conv1.lin_s.weight NaN NaN 2048.0 - module.conv_id.1.conv1.lin_s.bias NaN NaN 4.0 - module.conv_id.1.conv1.lin_h.weight NaN NaN 16384.0 - module.conv_id.1.conv1.lin_h.bias NaN NaN 32.0 - module.conv_id.1.conv1.lin_out1.weight NaN NaN 262144.0 - module.conv_id.1.conv1.lin_out2.weight NaN NaN 32768.0 - module.conv_id.1.conv1.lin_out2.bias NaN NaN 512.0 - module.conv_id.1.norm1.weight NaN NaN 512.0 - module.conv_id.1.norm1.bias NaN NaN 512.0 - module.conv_id.2.conv1.lin_s.weight NaN NaN 2048.0 - module.conv_id.2.conv1.lin_s.bias NaN NaN 4.0 - module.conv_id.2.conv1.lin_h.weight NaN NaN 16384.0 - module.conv_id.2.conv1.lin_h.bias NaN NaN 32.0 - module.conv_id.2.conv1.lin_out1.weight NaN NaN 262144.0 - module.conv_id.2.conv1.lin_out2.weight NaN NaN 32768.0 - module.conv_id.2.conv1.lin_out2.bias NaN NaN 512.0 - module.conv_id.2.norm1.weight NaN NaN 512.0 - module.conv_id.2.norm1.bias NaN NaN 512.0 - module.conv_reg.0.conv1.lin_s.weight NaN NaN 2048.0 - module.conv_reg.0.conv1.lin_s.bias NaN NaN 4.0 - module.conv_reg.0.conv1.lin_h.weight NaN NaN 16384.0 - module.conv_reg.0.conv1.lin_h.bias NaN NaN 32.0 - module.conv_reg.0.conv1.lin_out1.weight NaN NaN 262144.0 - module.conv_reg.0.conv1.lin_out2.weight NaN NaN 32768.0 - module.conv_reg.0.conv1.lin_out2.bias NaN NaN 512.0 - module.conv_reg.0.norm1.weight NaN NaN 512.0 - module.conv_reg.0.norm1.bias NaN NaN 512.0 - module.conv_reg.1.conv1.lin_s.weight NaN NaN 2048.0 - module.conv_reg.1.conv1.lin_s.bias NaN NaN 4.0 - module.conv_reg.1.conv1.lin_h.weight NaN NaN 16384.0 - module.conv_reg.1.conv1.lin_h.bias NaN NaN 32.0 - module.conv_reg.1.conv1.lin_out1.weight NaN NaN 262144.0 - module.conv_reg.1.conv1.lin_out2.weight NaN NaN 32768.0 - module.conv_reg.1.conv1.lin_out2.bias NaN NaN 512.0 - module.conv_reg.1.norm1.weight NaN NaN 512.0 - module.conv_reg.1.norm1.bias NaN NaN 512.0 - module.conv_reg.2.conv1.lin_s.weight NaN NaN 2048.0 - module.conv_reg.2.conv1.lin_s.bias NaN NaN 4.0 - module.conv_reg.2.conv1.lin_h.weight NaN NaN 16384.0 - module.conv_reg.2.conv1.lin_h.bias NaN NaN 32.0 - module.conv_reg.2.conv1.lin_out1.weight NaN NaN 262144.0 - module.conv_reg.2.conv1.lin_out2.weight NaN NaN 32768.0 - module.conv_reg.2.conv1.lin_out2.bias NaN NaN 512.0 - module.conv_reg.2.norm1.weight NaN NaN 512.0 - module.conv_reg.2.norm1.bias NaN NaN 512.0 - module.nn_id.0.weight NaN NaN 807936.0 - module.nn_id.0.bias NaN NaN 512.0 - module.nn_id.2.weight NaN NaN 512.0 - module.nn_id.2.bias NaN NaN 512.0 - module.nn_id.4.weight NaN NaN 4608.0 - module.nn_id.4.bias NaN NaN 9.0 - module.nn_pt.nn.0.weight NaN NaN 812544.0 - module.nn_pt.nn.0.bias NaN NaN 512.0 - module.nn_pt.nn.2.weight NaN NaN 512.0 - module.nn_pt.nn.2.bias NaN NaN 512.0 - module.nn_pt.nn.4.weight NaN NaN 1024.0 - module.nn_pt.nn.4.bias NaN NaN 2.0 - module.nn_eta.nn.0.weight NaN NaN 812544.0 - module.nn_eta.nn.0.bias NaN NaN 512.0 - module.nn_eta.nn.2.weight NaN NaN 512.0 - module.nn_eta.nn.2.bias NaN NaN 512.0 - module.nn_eta.nn.4.weight NaN NaN 1024.0 - module.nn_eta.nn.4.bias NaN NaN 2.0 - module.nn_sin_phi.nn.0.weight NaN NaN 812544.0 - module.nn_sin_phi.nn.0.bias NaN NaN 512.0 - module.nn_sin_phi.nn.2.weight NaN NaN 512.0 - module.nn_sin_phi.nn.2.bias NaN NaN 512.0 - module.nn_sin_phi.nn.4.weight NaN NaN 1024.0 - module.nn_sin_phi.nn.4.bias NaN NaN 2.0 - module.nn_cos_phi.nn.0.weight NaN NaN 812544.0 - module.nn_cos_phi.nn.0.bias NaN NaN 512.0 - module.nn_cos_phi.nn.2.weight NaN NaN 512.0 - module.nn_cos_phi.nn.2.bias NaN NaN 512.0 - module.nn_cos_phi.nn.4.weight NaN NaN 1024.0 - module.nn_cos_phi.nn.4.bias NaN NaN 2.0 - module.nn_energy.nn.0.weight NaN NaN 812544.0 - module.nn_energy.nn.0.bias NaN NaN 512.0 - module.nn_energy.nn.2.weight NaN NaN 512.0 - module.nn_energy.nn.2.bias NaN NaN 512.0 - module.nn_energy.nn.4.weight NaN NaN 1024.0 - module.nn_energy.nn.4.bias NaN NaN 2.0 - module.nn_charge.0.weight NaN NaN 812544.0 - module.nn_charge.0.bias NaN NaN 512.0 - module.nn_charge.2.weight NaN NaN 512.0 - module.nn_charge.2.bias NaN NaN 512.0 - module.nn_charge.4.weight NaN NaN 1536.0 - module.nn_charge.4.bias NaN NaN 3.0 - [2024-03-14 11:13:25,884] INFO: Creating experiment dir /pfvol/experiments/MLPF_cms_Gravnet_MET_False_8gpus_pyg-cms-small_20240314_111309_722128 [2024-03-14 11:13:25,884] INFO: Model directory /pfvol/experiments/MLPF_cms_Gravnet_MET_False_8gpus_pyg-cms-small_20240314_111309_722128 [2024-03-14 11:13:27,309] INFO: train_dataset: cms_pf_ttbar, 80000 [2024-03-14 11:13:28,542] INFO: train_dataset: cms_pf_qcd, 80000 [2024-03-14 11:13:28,576] INFO: valid_dataset: cms_pf_ttbar, 20000 [2024-03-14 11:13:28,607] INFO: valid_dataset: cms_pf_qcd, 20000 [2024-03-14 11:13:29,212] INFO: Initiating epoch #1 train run on device rank=0 [2024-03-14 13:05:11,301] INFO: Initiating epoch #1 valid run on device rank=0 [2024-03-14 13:30:39,529] INFO: Rank 0: epoch=1 / 30 train_loss=64.9224 valid_loss=62.7525 stale=0 time=137.17m eta=3978.0m [2024-03-14 13:30:40,038] INFO: Initiating epoch #2 train run on device rank=0 [2024-03-14 15:11:36,897] INFO: Initiating epoch #2 valid run on device rank=0 [2024-03-14 15:37:02,177] INFO: Rank 0: epoch=2 / 30 train_loss=62.3490 valid_loss=62.0672 stale=0 time=126.37m eta=3689.7m [2024-03-14 15:37:03,348] INFO: Initiating epoch #3 train run on device rank=0 [2024-03-14 17:32:56,914] INFO: Initiating epoch #3 valid run on device rank=0 [2024-03-14 17:58:13,501] INFO: Rank 0: epoch=3 / 30 train_loss=61.6124 valid_loss=61.4601 stale=0 time=141.17m eta=3642.6m [2024-03-14 17:58:13,982] INFO: Initiating epoch #4 train run on device rank=0 [2024-03-14 19:40:10,175] INFO: Initiating epoch #4 valid run on device rank=0 [2024-03-14 20:05:57,782] INFO: Rank 0: epoch=4 / 30 train_loss=61.3332 valid_loss=61.2970 stale=0 time=127.73m eta=3461.1m [2024-03-14 20:05:58,569] INFO: Initiating epoch #5 train run on device rank=0 [2024-03-14 21:46:29,020] INFO: Initiating epoch #5 valid run on device rank=0 [2024-03-14 22:13:31,163] INFO: Rank 0: epoch=5 / 30 train_loss=61.1771 valid_loss=61.1717 stale=0 time=127.54m eta=3300.2m [2024-03-14 22:13:32,036] INFO: Initiating epoch #6 train run on device rank=0 [2024-03-14 23:58:25,763] INFO: Initiating epoch #6 valid run on device rank=0 [2024-03-15 00:24:11,436] INFO: Rank 0: epoch=6 / 30 train_loss=61.0506 valid_loss=61.0333 stale=0 time=130.66m eta=3162.8m [2024-03-15 00:24:12,203] INFO: Initiating epoch #7 train run on device rank=0 [2024-03-15 02:13:42,114] INFO: Initiating epoch #7 valid run on device rank=0 [2024-03-15 02:39:48,289] INFO: Rank 0: epoch=7 / 30 train_loss=60.9402 valid_loss=60.9276 stale=0 time=135.6m eta=3043.6m [2024-03-15 02:39:49,314] INFO: Initiating epoch #8 train run on device rank=0 [2024-03-15 04:37:19,542] INFO: Initiating epoch #8 valid run on device rank=0 [2024-03-15 05:06:59,876] INFO: Rank 0: epoch=8 / 30 train_loss=60.8423 valid_loss=60.8564 stale=0 time=147.18m eta=2952.2m [2024-03-15 05:07:01,723] INFO: Initiating epoch #9 train run on device rank=0 [2024-03-15 07:05:47,716] INFO: Initiating epoch #9 valid run on device rank=0 [2024-03-15 07:36:19,244] INFO: Rank 0: epoch=9 / 30 train_loss=60.7422 valid_loss=60.7346 stale=0 time=149.29m eta=2853.3m [2024-03-15 07:36:20,938] INFO: Initiating epoch #10 train run on device rank=0 [2024-03-15 09:42:11,546] INFO: Initiating epoch #10 valid run on device rank=0 [2024-03-15 10:09:28,249] INFO: Rank 0: epoch=10 / 30 train_loss=60.6636 valid_loss=60.6539 stale=0 time=153.12m eta=2752.0m [2024-03-15 10:09:28,758] INFO: Initiating epoch #11 train run on device rank=0 [2024-03-15 11:55:18,856] INFO: Initiating epoch #11 valid run on device rank=0 [2024-03-15 12:20:51,694] INFO: Rank 0: epoch=11 / 30 train_loss=60.6061 valid_loss=60.6168 stale=0 time=131.38m eta=2603.6m [2024-03-15 12:20:52,338] INFO: Initiating epoch #12 train run on device rank=0 [2024-03-15 13:59:26,845] INFO: Initiating epoch #12 valid run on device rank=0 [2024-03-15 14:27:14,316] INFO: Rank 0: epoch=12 / 30 train_loss=60.5590 valid_loss=60.5844 stale=0 time=126.37m eta=2450.6m [2024-03-15 14:27:15,062] INFO: Initiating epoch #13 train run on device rank=0 [2024-03-15 16:07:40,121] INFO: Initiating epoch #13 valid run on device rank=0 [2024-03-15 16:32:59,022] INFO: Rank 0: epoch=13 / 30 train_loss=60.5191 valid_loss=60.5523 stale=0 time=125.73m eta=2300.9m [2024-03-15 16:33:00,574] INFO: Initiating epoch #14 train run on device rank=0 [2024-03-15 18:17:10,572] INFO: Initiating epoch #14 valid run on device rank=0 [2024-03-15 18:42:18,073] INFO: Rank 0: epoch=14 / 30 train_loss=60.4847 valid_loss=60.5236 stale=0 time=129.29m eta=2158.6m [2024-03-15 18:42:18,530] INFO: Initiating epoch #15 train run on device rank=0 [2024-03-15 20:19:55,964] INFO: Initiating epoch #15 valid run on device rank=0 [2024-03-15 20:45:27,982] INFO: Rank 0: epoch=15 / 30 train_loss=60.4572 valid_loss=60.5045 stale=0 time=123.16m eta=2012.0m [2024-03-15 20:45:28,518] INFO: Initiating epoch #16 train run on device rank=0 [2024-03-15 22:35:52,785] INFO: Initiating epoch #16 valid run on device rank=0 [2024-03-15 23:01:27,588] INFO: Rank 0: epoch=16 / 30 train_loss=60.4308 valid_loss=60.4819 stale=0 time=135.98m eta=1879.5m [2024-03-15 23:01:28,952] INFO: Initiating epoch #17 train run on device rank=0 [2024-03-16 00:40:42,766] INFO: Initiating epoch #17 valid run on device rank=0 [2024-03-16 01:06:48,950] INFO: Rank 0: epoch=17 / 30 train_loss=60.4074 valid_loss=60.4541 stale=0 time=125.33m eta=1738.4m [2024-03-16 01:06:49,489] INFO: Initiating epoch #18 train run on device rank=0 [2024-03-16 02:56:31,681] INFO: Initiating epoch #18 valid run on device rank=0 [2024-03-16 03:22:08,960] INFO: Rank 0: epoch=18 / 30 train_loss=60.3845 valid_loss=60.4257 stale=0 time=135.32m eta=1605.8m [2024-03-16 03:22:09,379] INFO: Initiating epoch #19 train run on device rank=0 [2024-03-16 05:06:36,949] INFO: Initiating epoch #19 valid run on device rank=0 [2024-03-16 05:32:29,432] INFO: Rank 0: epoch=19 / 30 train_loss=60.3635 valid_loss=60.4114 stale=0 time=130.33m eta=1469.9m [2024-03-16 05:32:30,112] INFO: Initiating epoch #20 train run on device rank=0 [2024-03-16 07:11:23,962] INFO: Initiating epoch #20 valid run on device rank=0 [2024-03-16 07:38:05,861] INFO: Rank 0: epoch=20 / 30 train_loss=60.3427 valid_loss=60.3966 stale=0 time=125.6m eta=1332.3m [2024-03-16 07:38:06,335] INFO: Initiating epoch #21 train run on device rank=0 [2024-03-16 09:16:24,642] INFO: Initiating epoch #21 valid run on device rank=0 [2024-03-16 09:42:37,433] INFO: Rank 0: epoch=21 / 30 train_loss=60.3235 valid_loss=60.3825 stale=0 time=124.52m eta=1195.3m [2024-03-16 09:42:39,026] INFO: Initiating epoch #22 train run on device rank=0 [2024-03-16 11:22:51,442] INFO: Initiating epoch #22 valid run on device rank=0 [2024-03-16 11:48:22,680] INFO: Rank 0: epoch=22 / 30 train_loss=60.3064 valid_loss=60.3681 stale=0 time=125.73m eta=1060.0m [2024-03-16 11:48:23,443] INFO: Initiating epoch #23 train run on device rank=0 [2024-03-16 13:30:39,345] INFO: Initiating epoch #23 valid run on device rank=0 [2024-03-16 13:56:01,631] INFO: Rank 0: epoch=23 / 30 train_loss=60.2901 valid_loss=60.3516 stale=0 time=127.64m eta=926.0m [2024-03-16 13:56:02,130] INFO: Initiating epoch #24 train run on device rank=0 [2024-03-16 15:35:49,494] INFO: Initiating epoch #24 valid run on device rank=0 [2024-03-16 16:01:10,937] INFO: Rank 0: epoch=24 / 30 train_loss=60.2748 valid_loss=60.3375 stale=0 time=125.15m eta=791.9m [2024-03-16 16:01:11,970] INFO: Initiating epoch #25 train run on device rank=0 [2024-03-16 17:42:46,947] INFO: Initiating epoch #25 valid run on device rank=0 [2024-03-16 18:08:04,766] INFO: Rank 0: epoch=25 / 30 train_loss=60.2617 valid_loss=60.3253 stale=0 time=126.88m eta=658.9m [2024-03-16 18:08:05,731] INFO: Initiating epoch #26 train run on device rank=0 [2024-03-16 19:55:14,281] INFO: Initiating epoch #26 valid run on device rank=0 [2024-03-16 20:21:22,941] INFO: Rank 0: epoch=26 / 30 train_loss=60.2506 valid_loss=60.3140 stale=0 time=133.29m eta=527.4m [2024-03-16 20:21:23,834] INFO: Initiating epoch #27 train run on device rank=0 [2024-03-16 22:01:48,959] INFO: Initiating epoch #27 valid run on device rank=0 [2024-03-16 22:31:42,971] INFO: Rank 0: epoch=27 / 30 train_loss=60.2414 valid_loss=60.3078 stale=0 time=130.32m eta=395.4m [2024-03-16 22:31:44,230] INFO: Initiating epoch #28 train run on device rank=0 [2024-03-17 00:55:32,946] INFO: Initiating epoch #28 valid run on device rank=0 [2024-03-17 01:21:50,082] INFO: Rank 0: epoch=28 / 30 train_loss=60.2345 valid_loss=60.3027 stale=0 time=170.1m eta=266.3m [2024-03-17 01:21:51,323] INFO: Initiating epoch #29 train run on device rank=0 [2024-03-17 03:05:06,246] INFO: Initiating epoch #29 valid run on device rank=0 [2024-03-17 03:30:48,584] INFO: Rank 0: epoch=29 / 30 train_loss=60.2297 valid_loss=60.2998 stale=0 time=128.95m eta=133.0m [2024-03-17 03:30:50,436] INFO: Initiating epoch #30 train run on device rank=0 [2024-03-17 05:16:16,815] INFO: Initiating epoch #30 valid run on device rank=0 [2024-03-17 05:42:03,702] INFO: Rank 0: epoch=30 / 30 train_loss=60.2272 valid_loss=60.2992 stale=0 time=131.22m eta=0.0m [2024-03-17 05:42:04,178] INFO: Done with training. Total training time on device 0 is 3988.583min