[2024-08-26 13:46:19,547] INFO: Will use torch.nn.parallel.DistributedDataParallel() and 2 gpus [2024-08-26 13:46:19,634] INFO: NVIDIA GeForce GTX 1080 Ti [2024-08-26 13:46:19,634] INFO: NVIDIA GeForce GTX 1080 Ti [2024-08-26 13:46:28,040] INFO: using dtype=torch.float32 [2024-08-26 13:46:28,376] INFO: using attention_type=math [2024-08-26 13:46:28,394] INFO: using attention_type=math [2024-08-26 13:46:28,413] INFO: using attention_type=math [2024-08-26 13:46:28,432] INFO: using attention_type=math [2024-08-26 13:46:28,451] INFO: using attention_type=math [2024-08-26 13:46:28,470] INFO: using attention_type=math [2024-08-26 13:46:31,316] INFO: DistributedDataParallel( (module): MLPF( (nn0_id): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (nn0_reg): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (conv_id): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (conv_reg): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (nn_id): Sequential( (0): Linear(in_features=529, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=6, bias=True) ) (nn_pt): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_eta): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_sin_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_cos_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_energy): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) ) ) [2024-08-26 13:46:31,318] INFO: Trainable parameters: 11671568 [2024-08-26 13:46:31,318] INFO: Non-trainable parameters: 0 [2024-08-26 13:46:31,318] INFO: Total parameters: 11671568 [2024-08-26 13:46:31,323] INFO: Modules Trainable parameters Non-tranable parameters module.nn0_id.0.weight 8704 0 module.nn0_id.0.bias 512 0 module.nn0_id.2.weight 512 0 module.nn0_id.2.bias 512 0 module.nn0_id.4.weight 262144 0 module.nn0_id.4.bias 512 0 module.nn0_reg.0.weight 8704 0 module.nn0_reg.0.bias 512 0 module.nn0_reg.2.weight 512 0 module.nn0_reg.2.bias 512 0 module.nn0_reg.4.weight 262144 0 module.nn0_reg.4.bias 512 0 module.conv_id.0.mha.in_proj_weight 786432 0 module.conv_id.0.mha.in_proj_bias 1536 0 module.conv_id.0.mha.out_proj.weight 262144 0 module.conv_id.0.mha.out_proj.bias 512 0 module.conv_id.0.norm0.weight 512 0 module.conv_id.0.norm0.bias 512 0 module.conv_id.0.norm1.weight 512 0 module.conv_id.0.norm1.bias 512 0 module.conv_id.0.seq.0.weight 262144 0 module.conv_id.0.seq.0.bias 512 0 module.conv_id.0.seq.2.weight 262144 0 module.conv_id.0.seq.2.bias 512 0 module.conv_id.1.mha.in_proj_weight 786432 0 module.conv_id.1.mha.in_proj_bias 1536 0 module.conv_id.1.mha.out_proj.weight 262144 0 module.conv_id.1.mha.out_proj.bias 512 0 module.conv_id.1.norm0.weight 512 0 module.conv_id.1.norm0.bias 512 0 module.conv_id.1.norm1.weight 512 0 module.conv_id.1.norm1.bias 512 0 module.conv_id.1.seq.0.weight 262144 0 module.conv_id.1.seq.0.bias 512 0 module.conv_id.1.seq.2.weight 262144 0 module.conv_id.1.seq.2.bias 512 0 module.conv_id.2.mha.in_proj_weight 786432 0 module.conv_id.2.mha.in_proj_bias 1536 0 module.conv_id.2.mha.out_proj.weight 262144 0 module.conv_id.2.mha.out_proj.bias 512 0 module.conv_id.2.norm0.weight 512 0 module.conv_id.2.norm0.bias 512 0 module.conv_id.2.norm1.weight 512 0 module.conv_id.2.norm1.bias 512 0 module.conv_id.2.seq.0.weight 262144 0 module.conv_id.2.seq.0.bias 512 0 module.conv_id.2.seq.2.weight 262144 0 module.conv_id.2.seq.2.bias 512 0 module.conv_reg.0.mha.in_proj_weight 786432 0 module.conv_reg.0.mha.in_proj_bias 1536 0 module.conv_reg.0.mha.out_proj.weight 262144 0 module.conv_reg.0.mha.out_proj.bias 512 0 module.conv_reg.0.norm0.weight 512 0 module.conv_reg.0.norm0.bias 512 0 module.conv_reg.0.norm1.weight 512 0 module.conv_reg.0.norm1.bias 512 0 module.conv_reg.0.seq.0.weight 262144 0 module.conv_reg.0.seq.0.bias 512 0 module.conv_reg.0.seq.2.weight 262144 0 module.conv_reg.0.seq.2.bias 512 0 module.conv_reg.1.mha.in_proj_weight 786432 0 module.conv_reg.1.mha.in_proj_bias 1536 0 module.conv_reg.1.mha.out_proj.weight 262144 0 module.conv_reg.1.mha.out_proj.bias 512 0 module.conv_reg.1.norm0.weight 512 0 module.conv_reg.1.norm0.bias 512 0 module.conv_reg.1.norm1.weight 512 0 module.conv_reg.1.norm1.bias 512 0 module.conv_reg.1.seq.0.weight 262144 0 module.conv_reg.1.seq.0.bias 512 0 module.conv_reg.1.seq.2.weight 262144 0 module.conv_reg.1.seq.2.bias 512 0 module.conv_reg.2.mha.in_proj_weight 786432 0 module.conv_reg.2.mha.in_proj_bias 1536 0 module.conv_reg.2.mha.out_proj.weight 262144 0 module.conv_reg.2.mha.out_proj.bias 512 0 module.conv_reg.2.norm0.weight 512 0 module.conv_reg.2.norm0.bias 512 0 module.conv_reg.2.norm1.weight 512 0 module.conv_reg.2.norm1.bias 512 0 module.conv_reg.2.seq.0.weight 262144 0 module.conv_reg.2.seq.0.bias 512 0 module.conv_reg.2.seq.2.weight 262144 0 module.conv_reg.2.seq.2.bias 512 0 module.nn_id.0.weight 270848 0 module.nn_id.0.bias 512 0 module.nn_id.2.weight 512 0 module.nn_id.2.bias 512 0 module.nn_id.4.weight 3072 0 module.nn_id.4.bias 6 0 module.nn_pt.nn.0.weight 273920 0 module.nn_pt.nn.0.bias 512 0 module.nn_pt.nn.2.weight 512 0 module.nn_pt.nn.2.bias 512 0 module.nn_pt.nn.4.weight 1024 0 module.nn_pt.nn.4.bias 2 0 module.nn_eta.nn.0.weight 273920 0 module.nn_eta.nn.0.bias 512 0 module.nn_eta.nn.2.weight 512 0 module.nn_eta.nn.2.bias 512 0 module.nn_eta.nn.4.weight 1024 0 module.nn_eta.nn.4.bias 2 0 module.nn_sin_phi.nn.0.weight 273920 0 module.nn_sin_phi.nn.0.bias 512 0 module.nn_sin_phi.nn.2.weight 512 0 module.nn_sin_phi.nn.2.bias 512 0 module.nn_sin_phi.nn.4.weight 1024 0 module.nn_sin_phi.nn.4.bias 2 0 module.nn_cos_phi.nn.0.weight 273920 0 module.nn_cos_phi.nn.0.bias 512 0 module.nn_cos_phi.nn.2.weight 512 0 module.nn_cos_phi.nn.2.bias 512 0 module.nn_cos_phi.nn.4.weight 1024 0 module.nn_cos_phi.nn.4.bias 2 0 module.nn_energy.nn.0.weight 273920 0 module.nn_energy.nn.0.bias 512 0 module.nn_energy.nn.2.weight 512 0 module.nn_energy.nn.2.bias 512 0 module.nn_energy.nn.4.weight 1024 0 module.nn_energy.nn.4.bias 2 0 [2024-08-26 13:46:31,339] INFO: Creating experiment dir /pfvol/experiments/Aug26_CLD_fromscratch_80k_pyg-cld_20240826_134619_311257 [2024-08-26 13:46:31,339] INFO: Model directory /pfvol/experiments/Aug26_CLD_fromscratch_80k_pyg-cld_20240826_134619_311257 [2024-08-26 13:46:31,367] INFO: train_dataset: cld_edm_ttbar_pf, 80000 [2024-08-26 13:46:31,412] INFO: valid_dataset: cld_edm_ttbar_pf, 1000 [2024-08-26 13:46:31,517] INFO: Initiating epoch #1 train run on device rank=0 [2024-08-26 13:52:14,774] INFO: Initiating epoch #1 valid run on device rank=0 [2024-08-26 13:52:24,932] INFO: Rank 0: epoch=1 / 100 train_loss=31.5252 valid_loss=28.4126 stale=0 time=5.89m eta=583.1m [2024-08-26 13:52:24,960] INFO: Initiating epoch #2 train run on device rank=0 [2024-08-26 13:57:58,490] INFO: Initiating epoch #2 valid run on device rank=0 [2024-08-26 13:58:07,397] INFO: Rank 0: epoch=2 / 100 train_loss=27.0772 valid_loss=26.4151 stale=0 time=5.71m eta=568.3m [2024-08-26 13:58:08,168] INFO: Initiating epoch #3 train run on device rank=0 [2024-08-26 14:03:35,793] INFO: Initiating epoch #3 valid run on device rank=0 [2024-08-26 14:03:42,723] INFO: Rank 0: epoch=3 / 100 train_loss=25.0751 valid_loss=24.4869 stale=0 time=5.58m eta=555.7m [2024-08-26 14:03:43,447] INFO: Initiating epoch #4 train run on device rank=0 [2024-08-26 14:09:09,889] INFO: Initiating epoch #4 valid run on device rank=0 [2024-08-26 14:09:14,864] INFO: Rank 0: epoch=4 / 100 train_loss=23.2690 valid_loss=22.6528 stale=0 time=5.52m eta=545.3m [2024-08-26 14:09:15,105] INFO: Initiating epoch #5 train run on device rank=0 [2024-08-26 14:14:41,808] INFO: Initiating epoch #5 valid run on device rank=0 [2024-08-26 14:14:46,262] INFO: Rank 0: epoch=5 / 100 train_loss=21.9402 valid_loss=21.6029 stale=0 time=5.52m eta=536.7m [2024-08-26 14:14:46,322] INFO: Initiating epoch #6 train run on device rank=0 [2024-08-26 14:20:12,916] INFO: Initiating epoch #6 valid run on device rank=0 [2024-08-26 14:20:17,965] INFO: Rank 0: epoch=6 / 100 train_loss=21.0715 valid_loss=20.8639 stale=0 time=5.53m eta=529.1m [2024-08-26 14:20:18,063] INFO: Initiating epoch #7 train run on device rank=0 [2024-08-26 14:25:44,092] INFO: Initiating epoch #7 valid run on device rank=0 [2024-08-26 14:25:49,399] INFO: Rank 0: epoch=7 / 100 train_loss=20.3276 valid_loss=20.3257 stale=0 time=5.52m eta=522.1m [2024-08-26 14:25:49,647] INFO: Initiating epoch #8 train run on device rank=0 [2024-08-26 14:31:16,075] INFO: Initiating epoch #8 valid run on device rank=0 [2024-08-26 14:31:20,626] INFO: Rank 0: epoch=8 / 100 train_loss=19.6861 valid_loss=19.5670 stale=0 time=5.52m eta=515.4m [2024-08-26 14:31:20,785] INFO: Initiating epoch #9 train run on device rank=0 [2024-08-26 14:36:50,373] INFO: Initiating epoch #9 valid run on device rank=0 [2024-08-26 14:36:55,729] INFO: Rank 0: epoch=9 / 100 train_loss=19.2027 valid_loss=19.2089 stale=0 time=5.58m eta=509.6m [2024-08-26 14:36:55,802] INFO: Initiating epoch #10 train run on device rank=0 [2024-08-26 14:42:22,665] INFO: Initiating epoch #10 valid run on device rank=0 [2024-08-26 14:42:26,994] INFO: Rank 0: epoch=10 / 100 train_loss=18.8127 valid_loss=19.0446 stale=0 time=5.52m eta=503.3m [2024-08-26 14:42:27,211] INFO: Initiating epoch #11 train run on device rank=0 [2024-08-26 14:47:55,427] INFO: Initiating epoch #11 valid run on device rank=0 [2024-08-26 14:47:59,951] INFO: Rank 0: epoch=11 / 100 train_loss=18.4728 valid_loss=18.6214 stale=0 time=5.55m eta=497.4m [2024-08-26 14:48:00,083] INFO: Initiating epoch #12 train run on device rank=0 [2024-08-26 14:53:27,838] INFO: Initiating epoch #12 valid run on device rank=0 [2024-08-26 14:53:32,676] INFO: Rank 0: epoch=12 / 100 train_loss=18.1662 valid_loss=18.5318 stale=0 time=5.54m eta=491.5m [2024-08-26 14:53:32,946] INFO: Initiating epoch #13 train run on device rank=0 [2024-08-26 14:59:00,046] INFO: Initiating epoch #13 valid run on device rank=0 [2024-08-26 14:59:04,743] INFO: Rank 0: epoch=13 / 100 train_loss=17.8834 valid_loss=18.3358 stale=0 time=5.53m eta=485.6m [2024-08-26 14:59:05,006] INFO: Initiating epoch #14 train run on device rank=0 [2024-08-26 15:04:33,359] INFO: Initiating epoch #14 valid run on device rank=0 [2024-08-26 15:04:41,341] INFO: Rank 0: epoch=14 / 100 train_loss=17.6372 valid_loss=18.2157 stale=0 time=5.61m eta=480.1m [2024-08-26 15:04:42,284] INFO: Initiating epoch #15 train run on device rank=0 [2024-08-26 15:10:07,224] INFO: Initiating epoch #15 valid run on device rank=0 [2024-08-26 15:10:17,457] INFO: Rank 0: epoch=15 / 100 train_loss=17.4008 valid_loss=18.1525 stale=0 time=5.59m eta=474.7m [2024-08-26 15:10:19,164] INFO: Initiating epoch #16 train run on device rank=0 [2024-08-26 15:15:44,538] INFO: Initiating epoch #16 valid run on device rank=0 [2024-08-26 15:15:53,416] INFO: Rank 0: epoch=16 / 100 train_loss=17.1490 valid_loss=18.0946 stale=0 time=5.57m eta=469.2m [2024-08-26 15:15:54,880] INFO: Initiating epoch #17 train run on device rank=0 [2024-08-26 15:21:19,509] INFO: Initiating epoch #17 valid run on device rank=0 [2024-08-26 15:21:28,874] INFO: Rank 0: epoch=17 / 100 train_loss=16.9186 valid_loss=17.9704 stale=0 time=5.57m eta=463.6m [2024-08-26 15:21:30,131] INFO: Initiating epoch #18 train run on device rank=0 [2024-08-26 15:26:54,332] INFO: Initiating epoch #18 valid run on device rank=0 [2024-08-26 15:27:02,677] INFO: Rank 0: epoch=18 / 100 train_loss=16.7217 valid_loss=17.6315 stale=0 time=5.54m eta=457.9m [2024-08-26 15:27:04,142] INFO: Initiating epoch #19 train run on device rank=0 [2024-08-26 15:32:29,105] INFO: Initiating epoch #19 valid run on device rank=0 [2024-08-26 15:32:35,555] INFO: Rank 0: epoch=19 / 100 train_loss=16.5337 valid_loss=17.6353 stale=1 time=5.52m eta=452.2m [2024-08-26 15:32:37,024] INFO: Initiating epoch #20 train run on device rank=0 [2024-08-26 15:38:02,152] INFO: Initiating epoch #20 valid run on device rank=0 [2024-08-26 15:38:09,161] INFO: Rank 0: epoch=20 / 100 train_loss=16.3611 valid_loss=17.6423 stale=2 time=5.54m eta=446.5m [2024-08-26 15:38:10,342] INFO: Initiating epoch #21 train run on device rank=0 [2024-08-26 15:43:37,612] INFO: Initiating epoch #21 valid run on device rank=0 [2024-08-26 15:43:46,841] INFO: Rank 0: epoch=21 / 100 train_loss=16.1907 valid_loss=17.5625 stale=0 time=5.61m eta=441.1m [2024-08-26 15:43:48,386] INFO: Initiating epoch #22 train run on device rank=0 [2024-08-26 15:49:17,917] INFO: Initiating epoch #22 valid run on device rank=0 [2024-08-26 15:49:26,332] INFO: Rank 0: epoch=22 / 100 train_loss=16.0327 valid_loss=17.5018 stale=0 time=5.63m eta=435.8m [2024-08-26 15:49:27,902] INFO: Initiating epoch #23 train run on device rank=0 [2024-08-26 15:54:56,943] INFO: Initiating epoch #23 valid run on device rank=0 [2024-08-26 15:55:06,103] INFO: Rank 0: epoch=23 / 100 train_loss=15.8653 valid_loss=17.4836 stale=0 time=5.64m eta=430.5m [2024-08-26 15:55:07,611] INFO: Initiating epoch #24 train run on device rank=0 [2024-08-26 16:00:37,576] INFO: Initiating epoch #24 valid run on device rank=0 [2024-08-26 16:00:43,529] INFO: Rank 0: epoch=24 / 100 train_loss=15.7363 valid_loss=17.4849 stale=1 time=5.6m eta=425.0m [2024-08-26 16:00:45,041] INFO: Initiating epoch #25 train run on device rank=0 [2024-08-26 16:06:13,995] INFO: Initiating epoch #25 valid run on device rank=0 [2024-08-26 16:06:25,002] INFO: Rank 0: epoch=25 / 100 train_loss=15.5849 valid_loss=17.4256 stale=0 time=5.67m eta=419.7m [2024-08-26 16:06:26,374] INFO: Initiating epoch #26 train run on device rank=0 [2024-08-26 16:11:53,829] INFO: Initiating epoch #26 valid run on device rank=0 [2024-08-26 16:12:01,889] INFO: Rank 0: epoch=26 / 100 train_loss=15.4507 valid_loss=17.5673 stale=1 time=5.59m eta=414.1m [2024-08-26 16:12:03,598] INFO: Initiating epoch #27 train run on device rank=0 [2024-08-26 16:17:32,255] INFO: Initiating epoch #27 valid run on device rank=0 [2024-08-26 16:17:39,198] INFO: Rank 0: epoch=27 / 100 train_loss=15.3196 valid_loss=17.5480 stale=2 time=5.59m eta=408.6m [2024-08-26 16:17:40,628] INFO: Initiating epoch #28 train run on device rank=0 [2024-08-26 16:23:06,878] INFO: Initiating epoch #28 valid run on device rank=0 [2024-08-26 16:23:13,100] INFO: Rank 0: epoch=28 / 100 train_loss=15.1848 valid_loss=17.8035 stale=3 time=5.54m eta=402.9m [2024-08-26 16:23:14,540] INFO: Initiating epoch #29 train run on device rank=0 [2024-08-26 16:28:40,921] INFO: Initiating epoch #29 valid run on device rank=0 [2024-08-26 16:28:46,960] INFO: Rank 0: epoch=29 / 100 train_loss=15.0539 valid_loss=17.8653 stale=4 time=5.54m eta=397.3m [2024-08-26 16:28:48,557] INFO: Initiating epoch #30 train run on device rank=0 [2024-08-26 16:34:15,366] INFO: Initiating epoch #30 valid run on device rank=0 [2024-08-26 16:34:21,310] INFO: Rank 0: epoch=30 / 100 train_loss=14.9556 valid_loss=17.9474 stale=5 time=5.55m eta=391.6m [2024-08-26 16:34:22,672] INFO: Initiating epoch #31 train run on device rank=0 [2024-08-26 16:39:50,574] INFO: Initiating epoch #31 valid run on device rank=0 [2024-08-26 16:39:56,522] INFO: Rank 0: epoch=31 / 100 train_loss=14.8113 valid_loss=18.0633 stale=6 time=5.56m eta=386.0m [2024-08-26 16:39:57,435] INFO: Initiating epoch #32 train run on device rank=0 [2024-08-26 16:45:23,035] INFO: Initiating epoch #32 valid run on device rank=0 [2024-08-26 16:45:28,806] INFO: Rank 0: epoch=32 / 100 train_loss=14.6932 valid_loss=18.2200 stale=7 time=5.52m eta=380.3m [2024-08-26 16:45:29,990] INFO: Initiating epoch #33 train run on device rank=0 [2024-08-26 16:50:54,834] INFO: Initiating epoch #33 valid run on device rank=0 [2024-08-26 16:51:01,411] INFO: Rank 0: epoch=33 / 100 train_loss=14.5657 valid_loss=18.2921 stale=8 time=5.52m eta=374.6m [2024-08-26 16:51:02,400] INFO: Initiating epoch #34 train run on device rank=0 [2024-08-26 16:56:27,960] INFO: Initiating epoch #34 valid run on device rank=0 [2024-08-26 16:56:34,502] INFO: Rank 0: epoch=34 / 100 train_loss=14.4666 valid_loss=18.6517 stale=9 time=5.54m eta=368.9m [2024-08-26 16:56:35,251] INFO: Initiating epoch #35 train run on device rank=0 [2024-08-26 17:02:02,626] INFO: Initiating epoch #35 valid run on device rank=0 [2024-08-26 17:02:07,790] INFO: Rank 0: epoch=35 / 100 train_loss=14.3454 valid_loss=18.6617 stale=10 time=5.54m eta=363.3m [2024-08-26 17:02:08,814] INFO: Initiating epoch #36 train run on device rank=0 [2024-08-26 17:07:34,438] INFO: Initiating epoch #36 valid run on device rank=0 [2024-08-26 17:07:41,153] INFO: Rank 0: epoch=36 / 100 train_loss=14.2284 valid_loss=18.7770 stale=11 time=5.54m eta=357.6m [2024-08-26 17:07:42,792] INFO: Initiating epoch #37 train run on device rank=0 [2024-08-26 17:13:07,676] INFO: Initiating epoch #37 valid run on device rank=0 [2024-08-26 17:13:13,659] INFO: Rank 0: epoch=37 / 100 train_loss=14.1235 valid_loss=18.9439 stale=12 time=5.51m eta=352.0m [2024-08-26 17:13:14,939] INFO: Initiating epoch #38 train run on device rank=0 [2024-08-26 17:18:40,747] INFO: Initiating epoch #38 valid run on device rank=0 [2024-08-26 17:18:46,612] INFO: Rank 0: epoch=38 / 100 train_loss=13.9873 valid_loss=18.8963 stale=13 time=5.53m eta=346.3m [2024-08-26 17:18:47,798] INFO: Initiating epoch #39 train run on device rank=0 [2024-08-26 17:24:12,445] INFO: Initiating epoch #39 valid run on device rank=0 [2024-08-26 17:24:18,421] INFO: Rank 0: epoch=39 / 100 train_loss=13.8784 valid_loss=19.1136 stale=14 time=5.51m eta=340.6m [2024-08-26 17:24:19,855] INFO: Initiating epoch #40 train run on device rank=0 [2024-08-26 17:29:43,878] INFO: Initiating epoch #40 valid run on device rank=0 [2024-08-26 17:29:50,408] INFO: Rank 0: epoch=40 / 100 train_loss=13.7584 valid_loss=18.9927 stale=15 time=5.51m eta=335.0m [2024-08-26 17:29:51,664] INFO: Initiating epoch #41 train run on device rank=0 [2024-08-26 17:35:17,015] INFO: Initiating epoch #41 valid run on device rank=0 [2024-08-26 17:35:21,984] INFO: Rank 0: epoch=41 / 100 train_loss=13.6636 valid_loss=19.2629 stale=16 time=5.51m eta=329.3m [2024-08-26 17:35:22,934] INFO: Initiating epoch #42 train run on device rank=0 [2024-08-26 17:40:49,027] INFO: Initiating epoch #42 valid run on device rank=0 [2024-08-26 17:40:54,867] INFO: Rank 0: epoch=42 / 100 train_loss=13.5666 valid_loss=19.0788 stale=17 time=5.53m eta=323.7m [2024-08-26 17:40:55,775] INFO: Initiating epoch #43 train run on device rank=0 [2024-08-26 17:46:21,733] INFO: Initiating epoch #43 valid run on device rank=0 [2024-08-26 17:46:27,581] INFO: Rank 0: epoch=43 / 100 train_loss=13.4464 valid_loss=19.3210 stale=18 time=5.53m eta=318.1m [2024-08-26 17:46:28,598] INFO: Initiating epoch #44 train run on device rank=0 [2024-08-26 17:51:55,883] INFO: Initiating epoch #44 valid run on device rank=0 [2024-08-26 17:52:01,760] INFO: Rank 0: epoch=44 / 100 train_loss=13.3417 valid_loss=19.4924 stale=19 time=5.55m eta=312.5m [2024-08-26 17:52:02,783] INFO: Initiating epoch #45 train run on device rank=0 [2024-08-26 17:57:29,150] INFO: Initiating epoch #45 valid run on device rank=0 [2024-08-26 17:57:35,132] INFO: Rank 0: epoch=45 / 100 train_loss=13.2483 valid_loss=19.5049 stale=20 time=5.54m eta=306.9m [2024-08-26 17:57:36,465] INFO: Initiating epoch #46 train run on device rank=0 [2024-08-26 18:03:04,076] INFO: Initiating epoch #46 valid run on device rank=0 [2024-08-26 18:03:10,766] INFO: Done with training. Total training time on device 0 is 256.654min