[2024-06-18 09:39:36,522] INFO: Will use torch.nn.parallel.DistributedDataParallel() and 4 gpus [2024-06-18 09:39:36,609] INFO: NVIDIA GeForce RTX 2080 Ti [2024-06-18 09:39:36,609] INFO: NVIDIA GeForce RTX 2080 Ti [2024-06-18 09:39:36,609] INFO: NVIDIA GeForce RTX 2080 Ti [2024-06-18 09:39:36,609] INFO: NVIDIA GeForce RTX 2080 Ti [2024-06-18 09:39:41,281] INFO: using dtype=torch.float32 [2024-06-18 09:39:42,041] INFO: using attention_type=math [2024-06-18 09:39:42,051] INFO: using attention_type=math [2024-06-18 09:39:42,063] INFO: using attention_type=math [2024-06-18 09:39:42,077] INFO: using attention_type=math [2024-06-18 09:39:42,087] INFO: using attention_type=math [2024-06-18 09:39:42,098] INFO: using attention_type=math [2024-06-18 09:39:44,765] INFO: DistributedDataParallel( (module): MLPF( (nn0_id): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (nn0_reg): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (conv_id): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (conv_reg): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (nn_id): Sequential( (0): Linear(in_features=529, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=6, bias=True) ) (nn_pt): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_eta): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_sin_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_cos_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_energy): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) ) ) [2024-06-18 09:39:44,766] INFO: Backbone Trainable parameters: 11671568 [2024-06-18 09:39:44,766] INFO: Backbone Non-trainable parameters: 0 [2024-06-18 09:39:44,766] INFO: Backbone Total parameters: 11671568 [2024-06-18 09:39:44,770] INFO: Modules Trainable parameters Non-tranable parameters module.nn0_id.0.weight 8704 0 module.nn0_id.0.bias 512 0 module.nn0_id.2.weight 512 0 module.nn0_id.2.bias 512 0 module.nn0_id.4.weight 262144 0 module.nn0_id.4.bias 512 0 module.nn0_reg.0.weight 8704 0 module.nn0_reg.0.bias 512 0 module.nn0_reg.2.weight 512 0 module.nn0_reg.2.bias 512 0 module.nn0_reg.4.weight 262144 0 module.nn0_reg.4.bias 512 0 module.conv_id.0.mha.in_proj_weight 786432 0 module.conv_id.0.mha.in_proj_bias 1536 0 module.conv_id.0.mha.out_proj.weight 262144 0 module.conv_id.0.mha.out_proj.bias 512 0 module.conv_id.0.norm0.weight 512 0 module.conv_id.0.norm0.bias 512 0 module.conv_id.0.norm1.weight 512 0 module.conv_id.0.norm1.bias 512 0 module.conv_id.0.seq.0.weight 262144 0 module.conv_id.0.seq.0.bias 512 0 module.conv_id.0.seq.2.weight 262144 0 module.conv_id.0.seq.2.bias 512 0 module.conv_id.1.mha.in_proj_weight 786432 0 module.conv_id.1.mha.in_proj_bias 1536 0 module.conv_id.1.mha.out_proj.weight 262144 0 module.conv_id.1.mha.out_proj.bias 512 0 module.conv_id.1.norm0.weight 512 0 module.conv_id.1.norm0.bias 512 0 module.conv_id.1.norm1.weight 512 0 module.conv_id.1.norm1.bias 512 0 module.conv_id.1.seq.0.weight 262144 0 module.conv_id.1.seq.0.bias 512 0 module.conv_id.1.seq.2.weight 262144 0 module.conv_id.1.seq.2.bias 512 0 module.conv_id.2.mha.in_proj_weight 786432 0 module.conv_id.2.mha.in_proj_bias 1536 0 module.conv_id.2.mha.out_proj.weight 262144 0 module.conv_id.2.mha.out_proj.bias 512 0 module.conv_id.2.norm0.weight 512 0 module.conv_id.2.norm0.bias 512 0 module.conv_id.2.norm1.weight 512 0 module.conv_id.2.norm1.bias 512 0 module.conv_id.2.seq.0.weight 262144 0 module.conv_id.2.seq.0.bias 512 0 module.conv_id.2.seq.2.weight 262144 0 module.conv_id.2.seq.2.bias 512 0 module.conv_reg.0.mha.in_proj_weight 786432 0 module.conv_reg.0.mha.in_proj_bias 1536 0 module.conv_reg.0.mha.out_proj.weight 262144 0 module.conv_reg.0.mha.out_proj.bias 512 0 module.conv_reg.0.norm0.weight 512 0 module.conv_reg.0.norm0.bias 512 0 module.conv_reg.0.norm1.weight 512 0 module.conv_reg.0.norm1.bias 512 0 module.conv_reg.0.seq.0.weight 262144 0 module.conv_reg.0.seq.0.bias 512 0 module.conv_reg.0.seq.2.weight 262144 0 module.conv_reg.0.seq.2.bias 512 0 module.conv_reg.1.mha.in_proj_weight 786432 0 module.conv_reg.1.mha.in_proj_bias 1536 0 module.conv_reg.1.mha.out_proj.weight 262144 0 module.conv_reg.1.mha.out_proj.bias 512 0 module.conv_reg.1.norm0.weight 512 0 module.conv_reg.1.norm0.bias 512 0 module.conv_reg.1.norm1.weight 512 0 module.conv_reg.1.norm1.bias 512 0 module.conv_reg.1.seq.0.weight 262144 0 module.conv_reg.1.seq.0.bias 512 0 module.conv_reg.1.seq.2.weight 262144 0 module.conv_reg.1.seq.2.bias 512 0 module.conv_reg.2.mha.in_proj_weight 786432 0 module.conv_reg.2.mha.in_proj_bias 1536 0 module.conv_reg.2.mha.out_proj.weight 262144 0 module.conv_reg.2.mha.out_proj.bias 512 0 module.conv_reg.2.norm0.weight 512 0 module.conv_reg.2.norm0.bias 512 0 module.conv_reg.2.norm1.weight 512 0 module.conv_reg.2.norm1.bias 512 0 module.conv_reg.2.seq.0.weight 262144 0 module.conv_reg.2.seq.0.bias 512 0 module.conv_reg.2.seq.2.weight 262144 0 module.conv_reg.2.seq.2.bias 512 0 module.nn_id.0.weight 270848 0 module.nn_id.0.bias 512 0 module.nn_id.2.weight 512 0 module.nn_id.2.bias 512 0 module.nn_id.4.weight 3072 0 module.nn_id.4.bias 6 0 module.nn_pt.nn.0.weight 273920 0 module.nn_pt.nn.0.bias 512 0 module.nn_pt.nn.2.weight 512 0 module.nn_pt.nn.2.bias 512 0 module.nn_pt.nn.4.weight 1024 0 module.nn_pt.nn.4.bias 2 0 module.nn_eta.nn.0.weight 273920 0 module.nn_eta.nn.0.bias 512 0 module.nn_eta.nn.2.weight 512 0 module.nn_eta.nn.2.bias 512 0 module.nn_eta.nn.4.weight 1024 0 module.nn_eta.nn.4.bias 2 0 module.nn_sin_phi.nn.0.weight 273920 0 module.nn_sin_phi.nn.0.bias 512 0 module.nn_sin_phi.nn.2.weight 512 0 module.nn_sin_phi.nn.2.bias 512 0 module.nn_sin_phi.nn.4.weight 1024 0 module.nn_sin_phi.nn.4.bias 2 0 module.nn_cos_phi.nn.0.weight 273920 0 module.nn_cos_phi.nn.0.bias 512 0 module.nn_cos_phi.nn.2.weight 512 0 module.nn_cos_phi.nn.2.bias 512 0 module.nn_cos_phi.nn.4.weight 1024 0 module.nn_cos_phi.nn.4.bias 2 0 module.nn_energy.nn.0.weight 273920 0 module.nn_energy.nn.0.bias 512 0 module.nn_energy.nn.2.weight 512 0 module.nn_energy.nn.2.bias 512 0 module.nn_energy.nn.4.weight 1024 0 module.nn_energy.nn.4.bias 2 0 [2024-06-18 09:39:44,827] INFO: DistributedDataParallel( (module): DeepMET( (nn): Sequential( (0): Linear(in_features=11, out_features=256, bias=True) (1): ELU(alpha=1.0) (2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0, inplace=False) (4): Linear(in_features=256, out_features=2, bias=True) ) ) ) [2024-06-18 09:39:44,827] INFO: DeepMET Trainable parameters: 4098 [2024-06-18 09:39:44,827] INFO: DeepMET Non-trainable parameters: 0 [2024-06-18 09:39:44,827] INFO: DeepMET Total parameters: 4098 [2024-06-18 09:39:44,828] INFO: Modules Trainable parameters Non-tranable parameters module.nn.0.weight 2816 0 module.nn.0.bias 256 0 module.nn.2.weight 256 0 module.nn.2.bias 256 0 module.nn.4.weight 512 0 module.nn.4.bias 2 0 [2024-06-18 09:39:44,829] INFO: Creating experiment dir /pfvol/experiments/MLPF_clic_backbone_pyg-clic_20240429_101112_971749/MLPF_4GTX_MET_MLPFCands_ReinitializeBackbone_20240618_093936_341602 [2024-06-18 09:39:44,829] INFO: Model directory /pfvol/experiments/MLPF_clic_backbone_pyg-clic_20240429_101112_971749/MLPF_4GTX_MET_MLPFCands_ReinitializeBackbone_20240618_093936_341602 [2024-06-18 09:39:44,862] INFO: train_dataset: clic_edm_ttbar_pf, 800800 [2024-06-18 09:39:53,389] INFO: valid_dataset: clic_edm_ttbar_pf, 200200 [2024-06-18 09:39:53,407] INFO: Initiating epoch #1 train run on device rank=0 [2024-06-18 09:54:41,407] INFO: Initiating epoch #1 valid run on device rank=0 [2024-06-18 09:55:46,787] INFO: Rank 0: epoch=1 / 400 train_loss=19.4451 valid_loss=21.4503 stale=0 time=15.89m eta=6340.0m [2024-06-18 09:55:46,789] INFO: Initiating epoch #2 train run on device rank=0 [2024-06-18 10:10:35,393] INFO: Initiating epoch #2 valid run on device rank=0 [2024-06-18 10:11:39,685] INFO: Rank 0: epoch=2 / 400 train_loss=16.6239 valid_loss=14.0316 stale=0 time=15.88m eta=6322.5m [2024-06-18 10:11:39,713] INFO: Initiating epoch #3 train run on device rank=0 [2024-06-18 10:26:28,136] INFO: Initiating epoch #3 valid run on device rank=0 [2024-06-18 10:27:32,717] INFO: Rank 0: epoch=3 / 400 train_loss=13.3989 valid_loss=13.7032 stale=0 time=15.88m eta=6306.4m [2024-06-18 10:27:32,764] INFO: Initiating epoch #4 train run on device rank=0 [2024-06-18 10:42:19,180] INFO: Initiating epoch #4 valid run on device rank=0 [2024-06-18 10:43:23,821] INFO: Rank 0: epoch=4 / 400 train_loss=12.7747 valid_loss=12.9256 stale=0 time=15.85m eta=6287.2m [2024-06-18 10:43:23,855] INFO: Initiating epoch #5 train run on device rank=0 [2024-06-18 10:58:11,664] INFO: Initiating epoch #5 valid run on device rank=0 [2024-06-18 10:59:16,370] INFO: Rank 0: epoch=5 / 400 train_loss=12.1401 valid_loss=12.0174 stale=0 time=15.88m eta=6271.2m [2024-06-18 10:59:16,416] INFO: Initiating epoch #6 train run on device rank=0 [2024-06-18 11:14:05,246] INFO: Initiating epoch #6 valid run on device rank=0 [2024-06-18 11:15:09,482] INFO: Rank 0: epoch=6 / 400 train_loss=11.6557 valid_loss=11.8547 stale=0 time=15.88m eta=6255.9m [2024-06-18 11:15:09,490] INFO: Initiating epoch #7 train run on device rank=0 [2024-06-18 11:29:59,291] INFO: Initiating epoch #7 valid run on device rank=0 [2024-06-18 11:31:03,837] INFO: Rank 0: epoch=7 / 400 train_loss=11.3253 valid_loss=11.7197 stale=0 time=15.91m eta=6241.6m [2024-06-18 11:31:04,025] INFO: Initiating epoch #8 train run on device rank=0 [2024-06-18 11:45:53,347] INFO: Initiating epoch #8 valid run on device rank=0 [2024-06-18 11:46:58,060] INFO: Rank 0: epoch=8 / 400 train_loss=11.1067 valid_loss=11.4958 stale=0 time=15.9m eta=6226.8m [2024-06-18 11:46:58,119] INFO: Initiating epoch #9 train run on device rank=0 [2024-06-18 12:01:47,371] INFO: Initiating epoch #9 valid run on device rank=0 [2024-06-18 12:02:52,128] INFO: Rank 0: epoch=9 / 400 train_loss=10.9411 valid_loss=11.2550 stale=0 time=15.9m eta=6211.6m [2024-06-18 12:02:52,180] INFO: Initiating epoch #10 train run on device rank=0 [2024-06-18 12:17:41,519] INFO: Initiating epoch #10 valid run on device rank=0 [2024-06-18 12:18:45,987] INFO: Rank 0: epoch=10 / 400 train_loss=10.8053 valid_loss=11.2450 stale=0 time=15.9m eta=6196.2m [2024-06-18 12:18:46,030] INFO: Initiating epoch #11 train run on device rank=0 [2024-06-18 12:33:35,228] INFO: Initiating epoch #11 valid run on device rank=0 [2024-06-18 12:34:39,811] INFO: Rank 0: epoch=11 / 400 train_loss=10.6903 valid_loss=11.1310 stale=0 time=15.9m eta=6180.6m [2024-06-18 12:34:39,860] INFO: Initiating epoch #12 train run on device rank=0 [2024-06-18 12:49:29,204] INFO: Initiating epoch #12 valid run on device rank=0 [2024-06-18 12:50:33,681] INFO: Rank 0: epoch=12 / 400 train_loss=10.5914 valid_loss=11.0159 stale=0 time=15.9m eta=6165.0m [2024-06-18 12:50:33,780] INFO: Initiating epoch #13 train run on device rank=0 [2024-06-18 13:05:22,934] INFO: Initiating epoch #13 valid run on device rank=0 [2024-06-18 13:06:27,461] INFO: Rank 0: epoch=13 / 400 train_loss=10.4948 valid_loss=10.9026 stale=0 time=15.89m eta=6149.4m [2024-06-18 13:06:27,469] INFO: Initiating epoch #14 train run on device rank=0 [2024-06-18 13:21:13,680] INFO: Initiating epoch #14 valid run on device rank=0 [2024-06-18 13:22:18,699] INFO: Rank 0: epoch=14 / 400 train_loss=10.4107 valid_loss=10.6974 stale=0 time=15.85m eta=6132.5m [2024-06-18 13:22:18,867] INFO: Initiating epoch #15 train run on device rank=0 [2024-06-18 13:37:07,962] INFO: Initiating epoch #15 valid run on device rank=0 [2024-06-18 13:38:12,213] INFO: Rank 0: epoch=15 / 400 train_loss=10.3326 valid_loss=10.6139 stale=0 time=15.89m eta=6116.7m [2024-06-18 13:38:12,221] INFO: Initiating epoch #16 train run on device rank=0 [2024-06-18 13:53:02,143] INFO: Initiating epoch #16 valid run on device rank=0 [2024-06-18 13:54:06,359] INFO: Rank 0: epoch=16 / 400 train_loss=10.2643 valid_loss=10.5394 stale=0 time=15.9m eta=6101.2m [2024-06-18 13:54:06,373] INFO: Initiating epoch #17 train run on device rank=0 [2024-06-18 14:08:54,996] INFO: Initiating epoch #17 valid run on device rank=0 [2024-06-18 14:09:59,030] INFO: Rank 0: epoch=17 / 400 train_loss=10.2035 valid_loss=10.4403 stale=0 time=15.88m eta=6085.1m [2024-06-18 14:09:59,057] INFO: Initiating epoch #18 train run on device rank=0 [2024-06-18 14:24:45,708] INFO: Initiating epoch #18 valid run on device rank=0 [2024-06-18 14:25:49,871] INFO: Rank 0: epoch=18 / 400 train_loss=10.1506 valid_loss=10.3820 stale=0 time=15.85m eta=6068.3m [2024-06-18 14:25:49,927] INFO: Initiating epoch #19 train run on device rank=0 [2024-06-18 14:40:36,208] INFO: Initiating epoch #19 valid run on device rank=0 [2024-06-18 14:41:40,071] INFO: Rank 0: epoch=19 / 400 train_loss=10.1008 valid_loss=10.3142 stale=0 time=15.84m eta=6051.4m [2024-06-18 14:41:40,085] INFO: Initiating epoch #20 train run on device rank=0 [2024-06-18 14:56:26,419] INFO: Initiating epoch #20 valid run on device rank=0 [2024-06-18 14:57:30,636] INFO: Rank 0: epoch=20 / 400 train_loss=10.0533 valid_loss=10.2582 stale=0 time=15.84m eta=6034.8m [2024-06-18 14:57:30,686] INFO: Initiating epoch #21 train run on device rank=0 [2024-06-18 15:12:16,488] INFO: Initiating epoch #21 valid run on device rank=0 [2024-06-18 15:13:21,649] INFO: Rank 0: epoch=21 / 400 train_loss=10.0087 valid_loss=10.1821 stale=0 time=15.85m eta=6018.4m [2024-06-18 15:13:21,846] INFO: Initiating epoch #22 train run on device rank=0 [2024-06-18 15:28:07,928] INFO: Initiating epoch #22 valid run on device rank=0 [2024-06-18 15:29:11,956] INFO: Rank 0: epoch=22 / 400 train_loss=9.9688 valid_loss=10.1340 stale=0 time=15.84m eta=6001.8m [2024-06-18 15:29:11,999] INFO: Initiating epoch #23 train run on device rank=0 [2024-06-18 15:43:55,804] INFO: Initiating epoch #23 valid run on device rank=0 [2024-06-18 15:44:59,664] INFO: Rank 0: epoch=23 / 400 train_loss=9.9304 valid_loss=10.0960 stale=0 time=15.79m eta=5984.5m [2024-06-18 15:44:59,682] INFO: Initiating epoch #24 train run on device rank=0 [2024-06-18 15:59:46,096] INFO: Initiating epoch #24 valid run on device rank=0 [2024-06-18 16:00:50,379] INFO: Rank 0: epoch=24 / 400 train_loss=9.8941 valid_loss=10.0721 stale=0 time=15.84m eta=5968.2m [2024-06-18 16:00:50,439] INFO: Initiating epoch #25 train run on device rank=0 [2024-06-18 16:15:36,938] INFO: Initiating epoch #25 valid run on device rank=0 [2024-06-18 16:16:40,451] INFO: Rank 0: epoch=25 / 400 train_loss=9.8599 valid_loss=10.0742 stale=1 time=15.83m eta=5951.8m [2024-06-18 16:16:40,544] INFO: Initiating epoch #26 train run on device rank=0 [2024-06-18 16:31:27,108] INFO: Initiating epoch #26 valid run on device rank=0 [2024-06-18 16:32:31,167] INFO: Rank 0: epoch=26 / 400 train_loss=9.8284 valid_loss=10.1069 stale=2 time=15.84m eta=5935.5m [2024-06-18 16:32:31,280] INFO: Initiating epoch #27 train run on device rank=0 [2024-06-18 16:47:18,065] INFO: Initiating epoch #27 valid run on device rank=0 [2024-06-18 16:48:21,343] INFO: Rank 0: epoch=27 / 400 train_loss=9.7961 valid_loss=10.1102 stale=3 time=15.83m eta=5919.2m [2024-06-18 16:48:21,368] INFO: Initiating epoch #28 train run on device rank=0 [2024-06-18 17:03:08,296] INFO: Initiating epoch #28 valid run on device rank=0 [2024-06-18 17:04:11,526] INFO: Rank 0: epoch=28 / 400 train_loss=9.7662 valid_loss=10.0960 stale=4 time=15.84m eta=5902.9m [2024-06-18 17:04:11,541] INFO: Initiating epoch #29 train run on device rank=0 [2024-06-18 17:18:58,336] INFO: Initiating epoch #29 valid run on device rank=0 [2024-06-18 17:20:01,800] INFO: Rank 0: epoch=29 / 400 train_loss=9.7363 valid_loss=10.1177 stale=5 time=15.84m eta=5886.6m [2024-06-18 17:20:01,810] INFO: Initiating epoch #30 train run on device rank=0 [2024-06-18 17:34:48,004] INFO: Initiating epoch #30 valid run on device rank=0 [2024-06-18 17:35:51,575] INFO: Rank 0: epoch=30 / 400 train_loss=9.7064 valid_loss=10.0964 stale=6 time=15.83m eta=5870.3m [2024-06-18 17:35:51,646] INFO: Initiating epoch #31 train run on device rank=0 [2024-06-18 17:50:36,979] INFO: Initiating epoch #31 valid run on device rank=0 [2024-06-18 17:51:40,722] INFO: Rank 0: epoch=31 / 400 train_loss=9.6786 valid_loss=10.0532 stale=0 time=15.82m eta=5853.9m [2024-06-18 17:51:40,857] INFO: Initiating epoch #32 train run on device rank=0 [2024-06-18 18:06:25,957] INFO: Initiating epoch #32 valid run on device rank=0 [2024-06-18 18:07:30,681] INFO: Rank 0: epoch=32 / 400 train_loss=9.6520 valid_loss=10.0067 stale=0 time=15.83m eta=5837.6m [2024-06-18 18:07:30,770] INFO: Initiating epoch #33 train run on device rank=0 [2024-06-18 18:22:14,138] INFO: Initiating epoch #33 valid run on device rank=0 [2024-06-18 18:23:20,978] INFO: Rank 0: epoch=33 / 400 train_loss=9.6259 valid_loss=9.9709 stale=0 time=15.84m eta=5821.5m [2024-06-18 18:23:21,280] INFO: Initiating epoch #34 train run on device rank=0 [2024-06-18 18:38:07,868] INFO: Initiating epoch #34 valid run on device rank=0 [2024-06-18 18:40:29,889] INFO: Rank 0: epoch=34 / 400 train_loss=9.6018 valid_loss=9.9533 stale=0 time=17.14m eta=5819.5m [2024-06-18 18:40:34,048] INFO: Initiating epoch #35 train run on device rank=0 [2024-06-18 18:55:18,312] INFO: Initiating epoch #35 valid run on device rank=0 [2024-06-18 18:56:22,990] INFO: Rank 0: epoch=35 / 400 train_loss=9.5786 valid_loss=9.9211 stale=0 time=15.82m eta=5803.4m [2024-06-18 18:56:23,039] INFO: Initiating epoch #36 train run on device rank=0 [2024-06-18 19:11:09,834] INFO: Initiating epoch #36 valid run on device rank=0 [2024-06-18 19:12:16,263] INFO: Rank 0: epoch=36 / 400 train_loss=9.5562 valid_loss=9.9044 stale=0 time=15.89m eta=5787.4m [2024-06-18 19:12:16,358] INFO: Initiating epoch #37 train run on device rank=0 [2024-06-18 19:27:03,302] INFO: Initiating epoch #37 valid run on device rank=0 [2024-06-18 19:28:19,818] INFO: Rank 0: epoch=37 / 400 train_loss=9.5348 valid_loss=9.9008 stale=0 time=16.06m eta=5773.1m [2024-06-18 19:28:20,122] INFO: Initiating epoch #38 train run on device rank=0 [2024-06-18 19:43:05,136] INFO: Initiating epoch #38 valid run on device rank=0 [2024-06-18 19:44:09,059] INFO: Rank 0: epoch=38 / 400 train_loss=9.5122 valid_loss=9.8928 stale=0 time=15.82m eta=5756.4m [2024-06-18 19:44:09,076] INFO: Initiating epoch #39 train run on device rank=0 [2024-06-18 19:58:51,746] INFO: Initiating epoch #39 valid run on device rank=0 [2024-06-18 19:59:56,724] INFO: Rank 0: epoch=39 / 400 train_loss=9.4891 valid_loss=9.8812 stale=0 time=15.79m eta=5739.5m [2024-06-18 19:59:56,749] INFO: Initiating epoch #40 train run on device rank=0 [2024-06-18 20:14:42,310] INFO: Initiating epoch #40 valid run on device rank=0 [2024-06-18 20:15:56,163] INFO: Rank 0: epoch=40 / 400 train_loss=9.4677 valid_loss=9.8717 stale=0 time=15.99m eta=5724.4m [2024-06-18 20:15:56,213] INFO: Initiating epoch #41 train run on device rank=0 [2024-06-18 20:30:42,788] INFO: Initiating epoch #41 valid run on device rank=0 [2024-06-18 20:31:49,842] INFO: Rank 0: epoch=41 / 400 train_loss=9.4451 valid_loss=9.8680 stale=0 time=15.89m eta=5708.5m [2024-06-18 20:31:49,890] INFO: Initiating epoch #42 train run on device rank=0 [2024-06-18 20:46:37,193] INFO: Initiating epoch #42 valid run on device rank=0 [2024-06-18 20:50:00,769] INFO: Rank 0: epoch=42 / 400 train_loss=9.4228 valid_loss=9.8616 stale=0 time=18.18m eta=5712.0m [2024-06-18 20:50:31,234] INFO: Initiating epoch #43 train run on device rank=0 [2024-06-18 21:05:16,007] INFO: Initiating epoch #43 valid run on device rank=0 [2024-06-18 21:07:20,846] INFO: Rank 0: epoch=43 / 400 train_loss=9.3977 valid_loss=9.8687 stale=1 time=16.83m eta=5707.5m [2024-06-18 21:07:30,091] INFO: Initiating epoch #44 train run on device rank=0 [2024-06-18 21:22:15,276] INFO: Initiating epoch #44 valid run on device rank=0 [2024-06-18 21:23:34,097] INFO: Rank 0: epoch=44 / 400 train_loss=9.3747 valid_loss=9.8461 stale=0 time=16.07m eta=5693.4m [2024-06-18 21:23:36,242] INFO: Initiating epoch #45 train run on device rank=0 [2024-06-18 21:38:18,873] INFO: Initiating epoch #45 valid run on device rank=0 [2024-06-18 21:39:47,850] INFO: Rank 0: epoch=45 / 400 train_loss=9.3506 valid_loss=9.8486 stale=1 time=16.19m eta=5679.3m [2024-06-18 21:39:51,747] INFO: Initiating epoch #46 train run on device rank=0 [2024-06-18 21:54:37,774] INFO: Initiating epoch #46 valid run on device rank=0 [2024-06-18 21:55:57,568] INFO: Rank 0: epoch=46 / 400 train_loss=9.3269 valid_loss=9.8247 stale=0 time=16.1m eta=5664.5m [2024-06-18 21:55:58,727] INFO: Initiating epoch #47 train run on device rank=0 [2024-06-18 22:10:44,691] INFO: Initiating epoch #47 valid run on device rank=0 [2024-06-18 22:12:08,536] INFO: Rank 0: epoch=47 / 400 train_loss=9.3012 valid_loss=9.8187 stale=0 time=16.16m eta=5649.9m [2024-06-18 22:12:11,294] INFO: Initiating epoch #48 train run on device rank=0 [2024-06-18 22:26:57,251] INFO: Initiating epoch #48 valid run on device rank=0 [2024-06-18 22:28:04,889] INFO: Rank 0: epoch=48 / 400 train_loss=9.2770 valid_loss=9.8037 stale=0 time=15.89m eta=5633.4m [2024-06-18 22:28:04,914] INFO: Initiating epoch #49 train run on device rank=0 [2024-06-18 22:42:51,429] INFO: Initiating epoch #49 valid run on device rank=0 [2024-06-18 22:43:55,702] INFO: Rank 0: epoch=49 / 400 train_loss=9.2521 valid_loss=9.7994 stale=0 time=15.85m eta=5616.3m [2024-06-18 22:43:55,725] INFO: Initiating epoch #50 train run on device rank=0 [2024-06-18 22:58:38,903] INFO: Initiating epoch #50 valid run on device rank=0 [2024-06-18 22:59:42,967] INFO: Rank 0: epoch=50 / 400 train_loss=9.2265 valid_loss=9.7943 stale=0 time=15.79m eta=5598.8m [2024-06-18 22:59:43,012] INFO: Initiating epoch #51 train run on device rank=0 [2024-06-18 23:14:29,707] INFO: Initiating epoch #51 valid run on device rank=0 [2024-06-18 23:15:36,055] INFO: Rank 0: epoch=51 / 400 train_loss=9.2024 valid_loss=9.7852 stale=0 time=15.88m eta=5582.0m [2024-06-18 23:15:36,072] INFO: Initiating epoch #52 train run on device rank=0 [2024-06-18 23:30:22,923] INFO: Initiating epoch #52 valid run on device rank=0 [2024-06-18 23:31:31,767] INFO: Rank 0: epoch=52 / 400 train_loss=9.1799 valid_loss=9.7988 stale=1 time=15.93m eta=5565.6m [2024-06-18 23:31:32,636] INFO: Initiating epoch #53 train run on device rank=0 [2024-06-18 23:46:19,056] INFO: Initiating epoch #53 valid run on device rank=0 [2024-06-18 23:48:46,034] INFO: Rank 0: epoch=53 / 400 train_loss=9.1548 valid_loss=9.7866 stale=2 time=17.22m eta=5557.7m [2024-06-18 23:48:51,511] INFO: Initiating epoch #54 train run on device rank=0 [2024-06-19 00:03:36,317] INFO: Initiating epoch #54 valid run on device rank=0 [2024-06-19 00:05:27,677] INFO: Rank 0: epoch=54 / 400 train_loss=9.1310 valid_loss=9.7955 stale=3 time=16.6m eta=5546.1m [2024-06-19 00:05:33,674] INFO: Initiating epoch #55 train run on device rank=0 [2024-06-19 00:20:17,423] INFO: Initiating epoch #55 valid run on device rank=0 [2024-06-19 00:22:15,218] INFO: Rank 0: epoch=55 / 400 train_loss=9.1080 valid_loss=9.7997 stale=4 time=16.69m eta=5534.8m [2024-06-19 00:22:15,931] INFO: Initiating epoch #56 train run on device rank=0 [2024-06-19 00:36:59,629] INFO: Initiating epoch #56 valid run on device rank=0 [2024-06-19 00:39:07,721] INFO: Rank 0: epoch=56 / 400 train_loss=9.0818 valid_loss=9.8014 stale=5 time=16.86m eta=5523.9m [2024-06-19 00:39:08,457] INFO: Initiating epoch #57 train run on device rank=0 [2024-06-19 00:53:53,742] INFO: Initiating epoch #57 valid run on device rank=0 [2024-06-19 00:55:29,620] INFO: Rank 0: epoch=57 / 400 train_loss=9.0568 valid_loss=9.8068 stale=6 time=16.35m eta=5509.7m [2024-06-19 00:55:32,924] INFO: Initiating epoch #58 train run on device rank=0 [2024-06-19 01:10:18,300] INFO: Initiating epoch #58 valid run on device rank=0 [2024-06-19 01:12:00,058] INFO: Rank 0: epoch=58 / 400 train_loss=9.0292 valid_loss=9.8176 stale=7 time=16.45m eta=5496.2m [2024-06-19 01:12:04,329] INFO: Initiating epoch #59 train run on device rank=0 [2024-06-19 01:26:49,602] INFO: Initiating epoch #59 valid run on device rank=0 [2024-06-19 01:28:00,973] INFO: Rank 0: epoch=59 / 400 train_loss=9.0070 valid_loss=9.8242 stale=8 time=15.94m eta=5479.8m [2024-06-19 01:28:01,752] INFO: Initiating epoch #60 train run on device rank=0 [2024-06-19 01:42:44,926] INFO: Initiating epoch #60 valid run on device rank=0 [2024-06-19 01:43:56,097] INFO: Rank 0: epoch=60 / 400 train_loss=8.9799 valid_loss=9.8287 stale=9 time=15.91m eta=5462.9m [2024-06-19 01:43:57,749] INFO: Initiating epoch #61 train run on device rank=0 [2024-06-19 01:58:43,805] INFO: Initiating epoch #61 valid run on device rank=0 [2024-06-19 01:59:56,143] INFO: Rank 0: epoch=61 / 400 train_loss=8.9530 valid_loss=9.8622 stale=10 time=15.97m eta=5446.5m [2024-06-19 01:59:58,259] INFO: Initiating epoch #62 train run on device rank=0 [2024-06-19 02:14:44,657] INFO: Initiating epoch #62 valid run on device rank=0 [2024-06-19 02:15:54,744] INFO: Rank 0: epoch=62 / 400 train_loss=8.9276 valid_loss=9.8750 stale=11 time=15.94m eta=5429.9m [2024-06-19 02:15:56,301] INFO: Initiating epoch #63 train run on device rank=0 [2024-06-19 02:30:42,775] INFO: Initiating epoch #63 valid run on device rank=0 [2024-06-19 02:31:59,145] INFO: Rank 0: epoch=63 / 400 train_loss=8.9041 valid_loss=9.9060 stale=12 time=16.05m eta=5413.9m [2024-06-19 02:32:01,603] INFO: Initiating epoch #64 train run on device rank=0 [2024-06-19 02:46:47,630] INFO: Initiating epoch #64 valid run on device rank=0 [2024-06-19 02:47:59,765] INFO: Rank 0: epoch=64 / 400 train_loss=8.8819 valid_loss=9.9374 stale=13 time=15.97m eta=5397.6m [2024-06-19 02:48:00,960] INFO: Initiating epoch #65 train run on device rank=0 [2024-06-19 03:02:44,101] INFO: Initiating epoch #65 valid run on device rank=0 [2024-06-19 03:03:57,320] INFO: Rank 0: epoch=65 / 400 train_loss=8.8525 valid_loss=9.9654 stale=14 time=15.94m eta=5381.0m [2024-06-19 03:03:58,557] INFO: Initiating epoch #66 train run on device rank=0 [2024-06-19 03:18:44,708] INFO: Initiating epoch #66 valid run on device rank=0 [2024-06-19 03:19:53,758] INFO: Rank 0: epoch=66 / 400 train_loss=8.8262 valid_loss=9.9843 stale=15 time=15.92m eta=5364.3m [2024-06-19 03:19:54,525] INFO: Initiating epoch #67 train run on device rank=0 [2024-06-19 03:34:41,180] INFO: Initiating epoch #67 valid run on device rank=0 [2024-06-19 03:35:49,731] INFO: Rank 0: epoch=67 / 400 train_loss=8.7938 valid_loss=10.0006 stale=16 time=15.92m eta=5347.6m [2024-06-19 03:35:50,515] INFO: Initiating epoch #68 train run on device rank=0 [2024-06-19 03:50:36,996] INFO: Initiating epoch #68 valid run on device rank=0 [2024-06-19 03:51:43,949] INFO: Rank 0: epoch=68 / 400 train_loss=8.7657 valid_loss=9.9993 stale=17 time=15.89m eta=5330.8m [2024-06-19 03:51:44,515] INFO: Initiating epoch #69 train run on device rank=0 [2024-06-19 04:06:30,937] INFO: Initiating epoch #69 valid run on device rank=0 [2024-06-19 04:07:39,419] INFO: Rank 0: epoch=69 / 400 train_loss=8.7393 valid_loss=10.0707 stale=18 time=15.92m eta=5314.1m [2024-06-19 04:07:40,039] INFO: Initiating epoch #70 train run on device rank=0 [2024-06-19 04:22:24,182] INFO: Initiating epoch #70 valid run on device rank=0 [2024-06-19 04:23:30,966] INFO: Rank 0: epoch=70 / 400 train_loss=8.7168 valid_loss=10.1363 stale=19 time=15.85m eta=5297.1m [2024-06-19 04:23:31,537] INFO: Initiating epoch #71 train run on device rank=0 [2024-06-19 04:38:17,583] INFO: Initiating epoch #71 valid run on device rank=0 [2024-06-19 04:39:24,320] INFO: Rank 0: epoch=71 / 400 train_loss=8.6869 valid_loss=10.1482 stale=20 time=15.88m eta=5280.3m [2024-06-19 04:39:24,677] INFO: Initiating epoch #72 train run on device rank=0 [2024-06-19 04:54:11,269] INFO: Initiating epoch #72 valid run on device rank=0 [2024-06-19 04:55:19,050] INFO: Done with training. Total training time on device 0 is 1155.427min