[2024-03-14 11:13:56,730] INFO: Will use torch.nn.parallel.DistributedDataParallel() and 8 gpus [2024-03-14 11:13:56,738] INFO: NVIDIA GeForce GTX 1080 [2024-03-14 11:13:56,738] INFO: NVIDIA GeForce GTX 1080 [2024-03-14 11:13:56,738] INFO: NVIDIA GeForce GTX 1080 [2024-03-14 11:13:56,738] INFO: NVIDIA GeForce GTX 1080 [2024-03-14 11:13:56,738] INFO: NVIDIA GeForce GTX 1080 [2024-03-14 11:13:56,738] INFO: NVIDIA GeForce GTX 1080 [2024-03-14 11:13:56,739] INFO: NVIDIA GeForce GTX 1080 [2024-03-14 11:13:56,739] INFO: NVIDIA GeForce GTX 1080 [2024-03-14 11:14:10,780] INFO: DistributedDataParallel( (module): MLPF( (nn0): Sequential( (0): Linear(in_features=42, out_features=512, bias=True) (1): ELU(alpha=1.0) (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (conv_id): ModuleList( (0-2): 3 x GravNetLayer( (conv1): GravNetConv(512, 512, k=16) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.0, inplace=False) ) ) (conv_reg): ModuleList( (0-2): 3 x GravNetLayer( (conv1): GravNetConv(512, 512, k=16) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.0, inplace=False) ) ) (nn_id): Sequential( (0): Linear(in_features=1578, out_features=512, bias=True) (1): ELU(alpha=1.0) (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=9, bias=True) ) (nn_pt): RegressionOutput( (nn): Sequential( (0): Linear(in_features=1587, out_features=512, bias=True) (1): ELU(alpha=1.0) (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_eta): RegressionOutput( (nn): Sequential( (0): Linear(in_features=1587, out_features=512, bias=True) (1): ELU(alpha=1.0) (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_sin_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=1587, out_features=512, bias=True) (1): ELU(alpha=1.0) (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_cos_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=1587, out_features=512, bias=True) (1): ELU(alpha=1.0) (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_energy): RegressionOutput( (nn): Sequential( (0): Linear(in_features=1587, out_features=512, bias=True) (1): ELU(alpha=1.0) (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_charge): Sequential( (0): Linear(in_features=1587, out_features=512, bias=True) (1): ELU(alpha=1.0) (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=3, bias=True) ) (nn_probX): Sequential( (0): Linear(in_features=1587, out_features=512, bias=True) (1): ELU(alpha=1.0) (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=1, bias=True) ) ) ) [2024-03-14 11:14:10,782] INFO: Trainable parameters: 8695023 [2024-03-14 11:14:10,782] INFO: Non-trainable parameters: 0 [2024-03-14 11:14:10,782] INFO: Total parameters: 8695023 [2024-03-14 11:14:10,791] INFO: Modules Trainable params Non-tranable params Trainable Parameters Non-tranable Parameters module.nn0.0.weight NaN NaN 21504.0 - module.nn0.0.bias NaN NaN 512.0 - module.nn0.2.weight NaN NaN 512.0 - module.nn0.2.bias NaN NaN 512.0 - module.nn0.4.weight NaN NaN 262144.0 - module.nn0.4.bias NaN NaN 512.0 - module.conv_id.0.conv1.lin_s.weight NaN NaN 2048.0 - module.conv_id.0.conv1.lin_s.bias NaN NaN 4.0 - module.conv_id.0.conv1.lin_h.weight NaN NaN 16384.0 - module.conv_id.0.conv1.lin_h.bias NaN NaN 32.0 - module.conv_id.0.conv1.lin_out1.weight NaN NaN 262144.0 - module.conv_id.0.conv1.lin_out2.weight NaN NaN 32768.0 - module.conv_id.0.conv1.lin_out2.bias NaN NaN 512.0 - module.conv_id.0.norm1.weight NaN NaN 512.0 - module.conv_id.0.norm1.bias NaN NaN 512.0 - module.conv_id.1.conv1.lin_s.weight NaN NaN 2048.0 - module.conv_id.1.conv1.lin_s.bias NaN NaN 4.0 - module.conv_id.1.conv1.lin_h.weight NaN NaN 16384.0 - module.conv_id.1.conv1.lin_h.bias NaN NaN 32.0 - module.conv_id.1.conv1.lin_out1.weight NaN NaN 262144.0 - module.conv_id.1.conv1.lin_out2.weight NaN NaN 32768.0 - module.conv_id.1.conv1.lin_out2.bias NaN NaN 512.0 - module.conv_id.1.norm1.weight NaN NaN 512.0 - module.conv_id.1.norm1.bias NaN NaN 512.0 - module.conv_id.2.conv1.lin_s.weight NaN NaN 2048.0 - module.conv_id.2.conv1.lin_s.bias NaN NaN 4.0 - module.conv_id.2.conv1.lin_h.weight NaN NaN 16384.0 - module.conv_id.2.conv1.lin_h.bias NaN NaN 32.0 - module.conv_id.2.conv1.lin_out1.weight NaN NaN 262144.0 - module.conv_id.2.conv1.lin_out2.weight NaN NaN 32768.0 - module.conv_id.2.conv1.lin_out2.bias NaN NaN 512.0 - module.conv_id.2.norm1.weight NaN NaN 512.0 - module.conv_id.2.norm1.bias NaN NaN 512.0 - module.conv_reg.0.conv1.lin_s.weight NaN NaN 2048.0 - module.conv_reg.0.conv1.lin_s.bias NaN NaN 4.0 - module.conv_reg.0.conv1.lin_h.weight NaN NaN 16384.0 - module.conv_reg.0.conv1.lin_h.bias NaN NaN 32.0 - module.conv_reg.0.conv1.lin_out1.weight NaN NaN 262144.0 - module.conv_reg.0.conv1.lin_out2.weight NaN NaN 32768.0 - module.conv_reg.0.conv1.lin_out2.bias NaN NaN 512.0 - module.conv_reg.0.norm1.weight NaN NaN 512.0 - module.conv_reg.0.norm1.bias NaN NaN 512.0 - module.conv_reg.1.conv1.lin_s.weight NaN NaN 2048.0 - module.conv_reg.1.conv1.lin_s.bias NaN NaN 4.0 - module.conv_reg.1.conv1.lin_h.weight NaN NaN 16384.0 - module.conv_reg.1.conv1.lin_h.bias NaN NaN 32.0 - module.conv_reg.1.conv1.lin_out1.weight NaN NaN 262144.0 - module.conv_reg.1.conv1.lin_out2.weight NaN NaN 32768.0 - module.conv_reg.1.conv1.lin_out2.bias NaN NaN 512.0 - module.conv_reg.1.norm1.weight NaN NaN 512.0 - module.conv_reg.1.norm1.bias NaN NaN 512.0 - module.conv_reg.2.conv1.lin_s.weight NaN NaN 2048.0 - module.conv_reg.2.conv1.lin_s.bias NaN NaN 4.0 - module.conv_reg.2.conv1.lin_h.weight NaN NaN 16384.0 - module.conv_reg.2.conv1.lin_h.bias NaN NaN 32.0 - module.conv_reg.2.conv1.lin_out1.weight NaN NaN 262144.0 - module.conv_reg.2.conv1.lin_out2.weight NaN NaN 32768.0 - module.conv_reg.2.conv1.lin_out2.bias NaN NaN 512.0 - module.conv_reg.2.norm1.weight NaN NaN 512.0 - module.conv_reg.2.norm1.bias NaN NaN 512.0 - module.nn_id.0.weight NaN NaN 807936.0 - module.nn_id.0.bias NaN NaN 512.0 - module.nn_id.2.weight NaN NaN 512.0 - module.nn_id.2.bias NaN NaN 512.0 - module.nn_id.4.weight NaN NaN 4608.0 - module.nn_id.4.bias NaN NaN 9.0 - module.nn_pt.nn.0.weight NaN NaN 812544.0 - module.nn_pt.nn.0.bias NaN NaN 512.0 - module.nn_pt.nn.2.weight NaN NaN 512.0 - module.nn_pt.nn.2.bias NaN NaN 512.0 - module.nn_pt.nn.4.weight NaN NaN 1024.0 - module.nn_pt.nn.4.bias NaN NaN 2.0 - module.nn_eta.nn.0.weight NaN NaN 812544.0 - module.nn_eta.nn.0.bias NaN NaN 512.0 - module.nn_eta.nn.2.weight NaN NaN 512.0 - module.nn_eta.nn.2.bias NaN NaN 512.0 - module.nn_eta.nn.4.weight NaN NaN 1024.0 - module.nn_eta.nn.4.bias NaN NaN 2.0 - module.nn_sin_phi.nn.0.weight NaN NaN 812544.0 - module.nn_sin_phi.nn.0.bias NaN NaN 512.0 - module.nn_sin_phi.nn.2.weight NaN NaN 512.0 - module.nn_sin_phi.nn.2.bias NaN NaN 512.0 - module.nn_sin_phi.nn.4.weight NaN NaN 1024.0 - module.nn_sin_phi.nn.4.bias NaN NaN 2.0 - module.nn_cos_phi.nn.0.weight NaN NaN 812544.0 - module.nn_cos_phi.nn.0.bias NaN NaN 512.0 - module.nn_cos_phi.nn.2.weight NaN NaN 512.0 - module.nn_cos_phi.nn.2.bias NaN NaN 512.0 - module.nn_cos_phi.nn.4.weight NaN NaN 1024.0 - module.nn_cos_phi.nn.4.bias NaN NaN 2.0 - module.nn_energy.nn.0.weight NaN NaN 812544.0 - module.nn_energy.nn.0.bias NaN NaN 512.0 - module.nn_energy.nn.2.weight NaN NaN 512.0 - module.nn_energy.nn.2.bias NaN NaN 512.0 - module.nn_energy.nn.4.weight NaN NaN 1024.0 - module.nn_energy.nn.4.bias NaN NaN 2.0 - module.nn_charge.0.weight NaN NaN 812544.0 - module.nn_charge.0.bias NaN NaN 512.0 - module.nn_charge.2.weight NaN NaN 512.0 - module.nn_charge.2.bias NaN NaN 512.0 - module.nn_charge.4.weight NaN NaN 1536.0 - module.nn_charge.4.bias NaN NaN 3.0 - module.nn_probX.0.weight NaN NaN 812544.0 - module.nn_probX.0.bias NaN NaN 512.0 - module.nn_probX.2.weight NaN NaN 512.0 - module.nn_probX.2.bias NaN NaN 512.0 - module.nn_probX.4.weight NaN NaN 512.0 - module.nn_probX.4.bias NaN NaN 1.0 - [2024-03-14 11:14:10,818] INFO: Creating experiment dir /pfvol/experiments/MLPF_cms_Gravnet_MET_True_8gpus_pyg-cms-small_20240314_111356_075210 [2024-03-14 11:14:10,818] INFO: Model directory /pfvol/experiments/MLPF_cms_Gravnet_MET_True_8gpus_pyg-cms-small_20240314_111356_075210 [2024-03-14 11:14:11,153] INFO: train_dataset: cms_pf_ttbar, 80000 [2024-03-14 11:14:11,454] INFO: train_dataset: cms_pf_qcd, 80000 [2024-03-14 11:14:11,501] INFO: valid_dataset: cms_pf_ttbar, 20000 [2024-03-14 11:14:11,537] INFO: valid_dataset: cms_pf_qcd, 20000 [2024-03-14 11:14:11,629] INFO: Initiating epoch #1 train run on device rank=0 [2024-03-14 12:32:23,550] INFO: Initiating epoch #1 valid run on device rank=0 [2024-03-14 12:50:33,391] INFO: Rank 0: epoch=1 / 30 train_loss=326.7937 valid_loss=91.8959 stale=0 time=96.36m eta=2794.5m [2024-03-14 12:50:33,517] INFO: Initiating epoch #2 train run on device rank=0 [2024-03-14 14:21:38,999] INFO: Initiating epoch #2 valid run on device rank=0 [2024-03-14 14:43:35,572] INFO: Rank 0: epoch=2 / 30 train_loss=89.8721 valid_loss=88.5483 stale=0 time=113.03m eta=2931.6m [2024-03-14 14:43:36,183] INFO: Initiating epoch #3 train run on device rank=0 [2024-03-14 16:16:22,119] INFO: Initiating epoch #3 valid run on device rank=0 [2024-03-14 16:38:42,260] INFO: Rank 0: epoch=3 / 30 train_loss=88.5117 valid_loss=88.2026 stale=0 time=115.1m eta=2920.6m [2024-03-14 16:38:42,763] INFO: Initiating epoch #4 train run on device rank=0 [2024-03-14 18:31:15,295] INFO: Initiating epoch #4 valid run on device rank=0 [2024-03-14 18:52:53,041] INFO: Rank 0: epoch=4 / 30 train_loss=87.6454 valid_loss=87.5690 stale=0 time=134.17m eta=2981.5m [2024-03-14 18:52:53,274] INFO: Initiating epoch #5 train run on device rank=0 [2024-03-14 20:27:09,691] INFO: Initiating epoch #5 valid run on device rank=0 [2024-03-14 20:49:02,110] INFO: Rank 0: epoch=5 / 30 train_loss=87.1806 valid_loss=87.1415 stale=0 time=116.15m eta=2874.2m [2024-03-14 20:49:02,507] INFO: Initiating epoch #6 train run on device rank=0 [2024-03-14 22:08:35,670] INFO: Initiating epoch #6 valid run on device rank=0 [2024-03-14 22:27:02,046] INFO: Rank 0: epoch=6 / 30 train_loss=86.8507 valid_loss=86.8485 stale=0 time=97.99m eta=2691.4m [2024-03-14 22:27:02,703] INFO: Initiating epoch #7 train run on device rank=0 [2024-03-14 23:57:11,049] INFO: Initiating epoch #7 valid run on device rank=0 [2024-03-15 00:16:04,694] INFO: Rank 0: epoch=7 / 30 train_loss=86.5949 valid_loss=86.7791 stale=0 time=109.03m eta=2569.0m [2024-03-15 00:16:04,991] INFO: Initiating epoch #8 train run on device rank=0 [2024-03-15 01:42:25,689] INFO: Initiating epoch #8 valid run on device rank=0 [2024-03-15 02:00:19,322] INFO: Rank 0: epoch=8 / 30 train_loss=86.4599 valid_loss=86.7406 stale=0 time=104.24m eta=2436.9m [2024-03-15 02:00:19,707] INFO: Initiating epoch #9 train run on device rank=0 [2024-03-15 03:25:19,046] INFO: Initiating epoch #9 valid run on device rank=0 [2024-03-15 03:44:00,488] INFO: Rank 0: epoch=9 / 30 train_loss=86.3229 valid_loss=86.2834 stale=0 time=103.68m eta=2309.6m [2024-03-15 03:44:01,479] INFO: Initiating epoch #10 train run on device rank=0 [2024-03-15 05:18:35,803] INFO: Initiating epoch #10 valid run on device rank=0 [2024-03-15 05:37:40,653] INFO: Rank 0: epoch=10 / 30 train_loss=86.1940 valid_loss=86.5955 stale=1 time=113.65m eta=2207.0m [2024-03-15 05:37:42,726] INFO: Initiating epoch #11 train run on device rank=0 [2024-03-15 07:08:27,826] INFO: Initiating epoch #11 valid run on device rank=0 [2024-03-15 07:28:44,464] INFO: Rank 0: epoch=11 / 30 train_loss=86.1195 valid_loss=86.1034 stale=0 time=111.03m eta=2097.9m [2024-03-15 07:28:46,478] INFO: Initiating epoch #12 train run on device rank=0 [2024-03-15 08:58:11,998] INFO: Initiating epoch #12 valid run on device rank=0 [2024-03-15 09:16:57,188] INFO: Rank 0: epoch=12 / 30 train_loss=85.9955 valid_loss=86.0197 stale=0 time=108.18m eta=1984.1m [2024-03-15 09:16:57,447] INFO: Initiating epoch #13 train run on device rank=0 [2024-03-15 10:40:18,927] INFO: Initiating epoch #13 valid run on device rank=0 [2024-03-15 10:58:06,624] INFO: Rank 0: epoch=13 / 30 train_loss=85.9173 valid_loss=85.9722 stale=0 time=101.15m eta=1862.0m [2024-03-15 10:58:06,788] INFO: Initiating epoch #14 train run on device rank=0 [2024-03-15 12:21:24,437] INFO: Initiating epoch #14 valid run on device rank=0 [2024-03-15 12:39:00,610] INFO: Rank 0: epoch=14 / 30 train_loss=85.8340 valid_loss=85.8734 stale=0 time=100.9m eta=1742.6m [2024-03-15 12:39:00,770] INFO: Initiating epoch #15 train run on device rank=0 [2024-03-15 13:55:12,558] INFO: Initiating epoch #15 valid run on device rank=0 [2024-03-15 14:13:18,738] INFO: Rank 0: epoch=15 / 30 train_loss=85.7696 valid_loss=85.8573 stale=0 time=94.3m eta=1619.1m [2024-03-15 14:13:21,790] INFO: Initiating epoch #16 train run on device rank=0 [2024-03-15 15:42:18,267] INFO: Initiating epoch #16 valid run on device rank=0 [2024-03-15 16:00:44,567] INFO: Rank 0: epoch=16 / 30 train_loss=85.6921 valid_loss=85.8211 stale=0 time=107.38m eta=1510.7m [2024-03-15 16:00:45,572] INFO: Initiating epoch #17 train run on device rank=0 [2024-03-15 17:17:34,113] INFO: Initiating epoch #17 valid run on device rank=0 [2024-03-15 17:36:14,446] INFO: Rank 0: epoch=17 / 30 train_loss=85.6463 valid_loss=85.7654 stale=0 time=95.48m eta=1393.3m [2024-03-15 17:36:14,948] INFO: Initiating epoch #18 train run on device rank=0 [2024-03-15 19:01:16,781] INFO: Initiating epoch #18 valid run on device rank=0 [2024-03-15 19:19:39,618] INFO: Rank 0: epoch=18 / 30 train_loss=85.6019 valid_loss=85.7592 stale=0 time=103.41m eta=1283.6m [2024-03-15 19:19:39,895] INFO: Initiating epoch #19 train run on device rank=0 [2024-03-15 20:37:06,902] INFO: Initiating epoch #19 valid run on device rank=0 [2024-03-15 20:54:52,924] INFO: Rank 0: epoch=19 / 30 train_loss=85.5294 valid_loss=85.7422 stale=0 time=95.22m eta=1169.9m [2024-03-15 20:54:53,258] INFO: Initiating epoch #20 train run on device rank=0 [2024-03-15 22:20:11,578] INFO: Initiating epoch #20 valid run on device rank=0 [2024-03-15 22:38:50,197] INFO: Rank 0: epoch=20 / 30 train_loss=85.4526 valid_loss=85.6552 stale=0 time=103.95m eta=1062.3m [2024-03-15 22:38:51,487] INFO: Initiating epoch #21 train run on device rank=0 [2024-03-15 23:55:40,315] INFO: Initiating epoch #21 valid run on device rank=0 [2024-03-16 00:12:56,403] INFO: Rank 0: epoch=21 / 30 train_loss=85.3803 valid_loss=85.6974 stale=1 time=94.08m eta=950.9m [2024-03-16 00:12:57,250] INFO: Initiating epoch #22 train run on device rank=0 [2024-03-16 01:37:02,406] INFO: Initiating epoch #22 valid run on device rank=0 [2024-03-16 01:55:52,640] INFO: Rank 0: epoch=22 / 30 train_loss=85.3144 valid_loss=85.6053 stale=0 time=102.92m eta=844.2m [2024-03-16 01:55:55,068] INFO: Initiating epoch #23 train run on device rank=0 [2024-03-16 03:18:42,775] INFO: Initiating epoch #23 valid run on device rank=0 [2024-03-16 03:36:43,403] INFO: Rank 0: epoch=23 / 30 train_loss=85.2458 valid_loss=85.5870 stale=0 time=100.81m eta=737.3m [2024-03-16 03:36:43,699] INFO: Initiating epoch #24 train run on device rank=0 [2024-03-16 04:56:40,246] INFO: Initiating epoch #24 valid run on device rank=0 [2024-03-16 05:15:39,615] INFO: Rank 0: epoch=24 / 30 train_loss=85.1793 valid_loss=85.5138 stale=0 time=98.93m eta=630.4m [2024-03-16 05:15:40,034] INFO: Initiating epoch #25 train run on device rank=0 [2024-03-16 06:38:08,182] INFO: Initiating epoch #25 valid run on device rank=0 [2024-03-16 06:57:31,755] INFO: Rank 0: epoch=25 / 30 train_loss=85.1042 valid_loss=85.4930 stale=0 time=101.86m eta=524.7m [2024-03-16 06:57:32,014] INFO: Initiating epoch #26 train run on device rank=0 [2024-03-16 08:19:32,553] INFO: Initiating epoch #26 valid run on device rank=0 [2024-03-16 08:37:21,643] INFO: Rank 0: epoch=26 / 30 train_loss=85.0362 valid_loss=85.4771 stale=0 time=99.83m eta=418.9m [2024-03-16 08:37:21,917] INFO: Initiating epoch #27 train run on device rank=0 [2024-03-16 09:55:43,786] INFO: Initiating epoch #27 valid run on device rank=0 [2024-03-16 10:14:06,779] INFO: Rank 0: epoch=27 / 30 train_loss=84.9823 valid_loss=85.4487 stale=0 time=96.75m eta=313.3m [2024-03-16 10:14:08,089] INFO: Initiating epoch #28 train run on device rank=0 [2024-03-16 11:31:57,632] INFO: Initiating epoch #28 valid run on device rank=0 [2024-03-16 11:50:04,607] INFO: Rank 0: epoch=28 / 30 train_loss=84.9412 valid_loss=85.4363 stale=0 time=95.94m eta=208.3m [2024-03-16 11:50:04,975] INFO: Initiating epoch #29 train run on device rank=0 [2024-03-16 13:07:20,850] INFO: Initiating epoch #29 valid run on device rank=0 [2024-03-16 13:26:41,061] INFO: Rank 0: epoch=29 / 30 train_loss=84.9116 valid_loss=85.4341 stale=0 time=96.6m eta=103.9m [2024-03-16 13:26:41,949] INFO: Initiating epoch #30 train run on device rank=0 [2024-03-16 14:47:08,964] INFO: Initiating epoch #30 valid run on device rank=0 [2024-03-16 15:05:49,453] INFO: Rank 0: epoch=30 / 30 train_loss=84.8940 valid_loss=85.4361 stale=1 time=99.13m eta=0.0m [2024-03-16 15:05:50,907] INFO: Done with training. Total training time on device 0 is 3111.655min