[2024-08-26 13:45:19,066] INFO: Will use torch.nn.parallel.DistributedDataParallel() and 2 gpus [2024-08-26 13:45:19,179] INFO: NVIDIA GeForce GTX 1080 Ti [2024-08-26 13:45:19,179] INFO: NVIDIA GeForce GTX 1080 Ti [2024-08-26 13:45:27,210] INFO: using dtype=torch.float32 [2024-08-26 13:45:27,810] INFO: using attention_type=math [2024-08-26 13:45:27,829] INFO: using attention_type=math [2024-08-26 13:45:27,849] INFO: using attention_type=math [2024-08-26 13:45:27,868] INFO: using attention_type=math [2024-08-26 13:45:27,887] INFO: using attention_type=math [2024-08-26 13:45:27,907] INFO: using attention_type=math [2024-08-26 13:45:31,054] INFO: DistributedDataParallel( (module): MLPF( (nn0_id): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (nn0_reg): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (conv_id): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (conv_reg): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (nn_id): Sequential( (0): Linear(in_features=529, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=6, bias=True) ) (nn_pt): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_eta): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_sin_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_cos_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_energy): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) ) ) [2024-08-26 13:45:31,055] INFO: Trainable parameters: 11671568 [2024-08-26 13:45:31,055] INFO: Non-trainable parameters: 0 [2024-08-26 13:45:31,055] INFO: Total parameters: 11671568 [2024-08-26 13:45:31,060] INFO: Modules Trainable parameters Non-tranable parameters module.nn0_id.0.weight 8704 0 module.nn0_id.0.bias 512 0 module.nn0_id.2.weight 512 0 module.nn0_id.2.bias 512 0 module.nn0_id.4.weight 262144 0 module.nn0_id.4.bias 512 0 module.nn0_reg.0.weight 8704 0 module.nn0_reg.0.bias 512 0 module.nn0_reg.2.weight 512 0 module.nn0_reg.2.bias 512 0 module.nn0_reg.4.weight 262144 0 module.nn0_reg.4.bias 512 0 module.conv_id.0.mha.in_proj_weight 786432 0 module.conv_id.0.mha.in_proj_bias 1536 0 module.conv_id.0.mha.out_proj.weight 262144 0 module.conv_id.0.mha.out_proj.bias 512 0 module.conv_id.0.norm0.weight 512 0 module.conv_id.0.norm0.bias 512 0 module.conv_id.0.norm1.weight 512 0 module.conv_id.0.norm1.bias 512 0 module.conv_id.0.seq.0.weight 262144 0 module.conv_id.0.seq.0.bias 512 0 module.conv_id.0.seq.2.weight 262144 0 module.conv_id.0.seq.2.bias 512 0 module.conv_id.1.mha.in_proj_weight 786432 0 module.conv_id.1.mha.in_proj_bias 1536 0 module.conv_id.1.mha.out_proj.weight 262144 0 module.conv_id.1.mha.out_proj.bias 512 0 module.conv_id.1.norm0.weight 512 0 module.conv_id.1.norm0.bias 512 0 module.conv_id.1.norm1.weight 512 0 module.conv_id.1.norm1.bias 512 0 module.conv_id.1.seq.0.weight 262144 0 module.conv_id.1.seq.0.bias 512 0 module.conv_id.1.seq.2.weight 262144 0 module.conv_id.1.seq.2.bias 512 0 module.conv_id.2.mha.in_proj_weight 786432 0 module.conv_id.2.mha.in_proj_bias 1536 0 module.conv_id.2.mha.out_proj.weight 262144 0 module.conv_id.2.mha.out_proj.bias 512 0 module.conv_id.2.norm0.weight 512 0 module.conv_id.2.norm0.bias 512 0 module.conv_id.2.norm1.weight 512 0 module.conv_id.2.norm1.bias 512 0 module.conv_id.2.seq.0.weight 262144 0 module.conv_id.2.seq.0.bias 512 0 module.conv_id.2.seq.2.weight 262144 0 module.conv_id.2.seq.2.bias 512 0 module.conv_reg.0.mha.in_proj_weight 786432 0 module.conv_reg.0.mha.in_proj_bias 1536 0 module.conv_reg.0.mha.out_proj.weight 262144 0 module.conv_reg.0.mha.out_proj.bias 512 0 module.conv_reg.0.norm0.weight 512 0 module.conv_reg.0.norm0.bias 512 0 module.conv_reg.0.norm1.weight 512 0 module.conv_reg.0.norm1.bias 512 0 module.conv_reg.0.seq.0.weight 262144 0 module.conv_reg.0.seq.0.bias 512 0 module.conv_reg.0.seq.2.weight 262144 0 module.conv_reg.0.seq.2.bias 512 0 module.conv_reg.1.mha.in_proj_weight 786432 0 module.conv_reg.1.mha.in_proj_bias 1536 0 module.conv_reg.1.mha.out_proj.weight 262144 0 module.conv_reg.1.mha.out_proj.bias 512 0 module.conv_reg.1.norm0.weight 512 0 module.conv_reg.1.norm0.bias 512 0 module.conv_reg.1.norm1.weight 512 0 module.conv_reg.1.norm1.bias 512 0 module.conv_reg.1.seq.0.weight 262144 0 module.conv_reg.1.seq.0.bias 512 0 module.conv_reg.1.seq.2.weight 262144 0 module.conv_reg.1.seq.2.bias 512 0 module.conv_reg.2.mha.in_proj_weight 786432 0 module.conv_reg.2.mha.in_proj_bias 1536 0 module.conv_reg.2.mha.out_proj.weight 262144 0 module.conv_reg.2.mha.out_proj.bias 512 0 module.conv_reg.2.norm0.weight 512 0 module.conv_reg.2.norm0.bias 512 0 module.conv_reg.2.norm1.weight 512 0 module.conv_reg.2.norm1.bias 512 0 module.conv_reg.2.seq.0.weight 262144 0 module.conv_reg.2.seq.0.bias 512 0 module.conv_reg.2.seq.2.weight 262144 0 module.conv_reg.2.seq.2.bias 512 0 module.nn_id.0.weight 270848 0 module.nn_id.0.bias 512 0 module.nn_id.2.weight 512 0 module.nn_id.2.bias 512 0 module.nn_id.4.weight 3072 0 module.nn_id.4.bias 6 0 module.nn_pt.nn.0.weight 273920 0 module.nn_pt.nn.0.bias 512 0 module.nn_pt.nn.2.weight 512 0 module.nn_pt.nn.2.bias 512 0 module.nn_pt.nn.4.weight 1024 0 module.nn_pt.nn.4.bias 2 0 module.nn_eta.nn.0.weight 273920 0 module.nn_eta.nn.0.bias 512 0 module.nn_eta.nn.2.weight 512 0 module.nn_eta.nn.2.bias 512 0 module.nn_eta.nn.4.weight 1024 0 module.nn_eta.nn.4.bias 2 0 module.nn_sin_phi.nn.0.weight 273920 0 module.nn_sin_phi.nn.0.bias 512 0 module.nn_sin_phi.nn.2.weight 512 0 module.nn_sin_phi.nn.2.bias 512 0 module.nn_sin_phi.nn.4.weight 1024 0 module.nn_sin_phi.nn.4.bias 2 0 module.nn_cos_phi.nn.0.weight 273920 0 module.nn_cos_phi.nn.0.bias 512 0 module.nn_cos_phi.nn.2.weight 512 0 module.nn_cos_phi.nn.2.bias 512 0 module.nn_cos_phi.nn.4.weight 1024 0 module.nn_cos_phi.nn.4.bias 2 0 module.nn_energy.nn.0.weight 273920 0 module.nn_energy.nn.0.bias 512 0 module.nn_energy.nn.2.weight 512 0 module.nn_energy.nn.2.bias 512 0 module.nn_energy.nn.4.weight 1024 0 module.nn_energy.nn.4.bias 2 0 [2024-08-26 13:45:31,115] INFO: Creating experiment dir /pfvol/experiments/Aug26_CLD_fromscratch_1k_pyg-cld_20240826_134517_856191 [2024-08-26 13:45:31,115] INFO: Model directory /pfvol/experiments/Aug26_CLD_fromscratch_1k_pyg-cld_20240826_134517_856191 [2024-08-26 13:45:31,141] INFO: train_dataset: cld_edm_ttbar_pf, 1000 [2024-08-26 13:45:31,163] INFO: valid_dataset: cld_edm_ttbar_pf, 1000 [2024-08-26 13:45:31,247] INFO: Initiating epoch #1 train run on device rank=0 [2024-08-26 13:45:47,189] INFO: Initiating epoch #1 valid run on device rank=0 [2024-08-26 13:45:56,996] INFO: Rank 0: epoch=1 / 100 train_loss=102.4851 valid_loss=58.3279 stale=0 time=0.43m eta=42.5m [2024-08-26 13:45:57,016] INFO: Initiating epoch #2 train run on device rank=0 [2024-08-26 13:46:02,990] INFO: Initiating epoch #2 valid run on device rank=0 [2024-08-26 13:46:11,331] INFO: Rank 0: epoch=2 / 100 train_loss=50.1855 valid_loss=45.1558 stale=0 time=0.24m eta=32.7m [2024-08-26 13:46:12,052] INFO: Initiating epoch #3 train run on device rank=0 [2024-08-26 13:46:18,044] INFO: Initiating epoch #3 valid run on device rank=0 [2024-08-26 13:46:26,259] INFO: Rank 0: epoch=3 / 100 train_loss=42.3783 valid_loss=40.5038 stale=0 time=0.24m eta=29.6m [2024-08-26 13:46:27,269] INFO: Initiating epoch #4 train run on device rank=0 [2024-08-26 13:46:33,603] INFO: Initiating epoch #4 valid run on device rank=0 [2024-08-26 13:46:43,156] INFO: Rank 0: epoch=4 / 100 train_loss=38.9293 valid_loss=37.9875 stale=0 time=0.26m eta=28.8m [2024-08-26 13:46:44,073] INFO: Initiating epoch #5 train run on device rank=0 [2024-08-26 13:46:50,864] INFO: Initiating epoch #5 valid run on device rank=0 [2024-08-26 13:47:00,256] INFO: Rank 0: epoch=5 / 100 train_loss=36.9764 valid_loss=36.7653 stale=0 time=0.27m eta=28.2m [2024-08-26 13:47:01,118] INFO: Initiating epoch #6 train run on device rank=0 [2024-08-26 13:47:07,566] INFO: Initiating epoch #6 valid run on device rank=0 [2024-08-26 13:47:16,727] INFO: Rank 0: epoch=6 / 100 train_loss=35.7879 valid_loss=35.3898 stale=0 time=0.26m eta=27.5m [2024-08-26 13:47:17,572] INFO: Initiating epoch #7 train run on device rank=0 [2024-08-26 13:47:24,141] INFO: Initiating epoch #7 valid run on device rank=0 [2024-08-26 13:47:33,697] INFO: Rank 0: epoch=7 / 100 train_loss=34.6488 valid_loss=34.6770 stale=0 time=0.27m eta=27.1m [2024-08-26 13:47:34,612] INFO: Initiating epoch #8 train run on device rank=0 [2024-08-26 13:47:41,803] INFO: Initiating epoch #8 valid run on device rank=0 [2024-08-26 13:47:50,810] INFO: Rank 0: epoch=8 / 100 train_loss=33.9676 valid_loss=34.2361 stale=0 time=0.27m eta=26.7m [2024-08-26 13:47:51,412] INFO: Initiating epoch #9 train run on device rank=0 [2024-08-26 13:47:58,040] INFO: Initiating epoch #9 valid run on device rank=0 [2024-08-26 13:48:07,416] INFO: Rank 0: epoch=9 / 100 train_loss=33.3634 valid_loss=33.4591 stale=0 time=0.27m eta=26.3m [2024-08-26 13:48:08,237] INFO: Initiating epoch #10 train run on device rank=0 [2024-08-26 13:48:14,897] INFO: Initiating epoch #10 valid run on device rank=0 [2024-08-26 13:48:24,316] INFO: Rank 0: epoch=10 / 100 train_loss=32.6483 valid_loss=32.8410 stale=0 time=0.27m eta=26.0m [2024-08-26 13:48:25,209] INFO: Initiating epoch #11 train run on device rank=0 [2024-08-26 13:48:32,125] INFO: Initiating epoch #11 valid run on device rank=0 [2024-08-26 13:48:42,207] INFO: Rank 0: epoch=11 / 100 train_loss=32.2391 valid_loss=32.6949 stale=0 time=0.28m eta=25.8m [2024-08-26 13:48:43,169] INFO: Initiating epoch #12 train run on device rank=0 [2024-08-26 13:48:49,573] INFO: Initiating epoch #12 valid run on device rank=0 [2024-08-26 13:48:59,407] INFO: Rank 0: epoch=12 / 100 train_loss=31.8077 valid_loss=32.4919 stale=0 time=0.27m eta=25.4m [2024-08-26 13:49:00,181] INFO: Initiating epoch #13 train run on device rank=0 [2024-08-26 13:49:06,500] INFO: Initiating epoch #13 valid run on device rank=0 [2024-08-26 13:49:15,990] INFO: Rank 0: epoch=13 / 100 train_loss=31.5205 valid_loss=32.1632 stale=0 time=0.26m eta=25.1m [2024-08-26 13:49:17,015] INFO: Initiating epoch #14 train run on device rank=0 [2024-08-26 13:49:23,438] INFO: Initiating epoch #14 valid run on device rank=0 [2024-08-26 13:49:32,243] INFO: Rank 0: epoch=14 / 100 train_loss=31.1246 valid_loss=31.7993 stale=0 time=0.25m eta=24.7m [2024-08-26 13:49:33,227] INFO: Initiating epoch #15 train run on device rank=0 [2024-08-26 13:49:39,498] INFO: Initiating epoch #15 valid run on device rank=0 [2024-08-26 13:49:48,191] INFO: Rank 0: epoch=15 / 100 train_loss=30.8687 valid_loss=31.6073 stale=0 time=0.25m eta=24.3m [2024-08-26 13:49:49,161] INFO: Initiating epoch #16 train run on device rank=0 [2024-08-26 13:49:55,521] INFO: Initiating epoch #16 valid run on device rank=0 [2024-08-26 13:50:05,527] INFO: Rank 0: epoch=16 / 100 train_loss=30.6555 valid_loss=31.3242 stale=0 time=0.27m eta=24.0m [2024-08-26 13:50:06,207] INFO: Initiating epoch #17 train run on device rank=0 [2024-08-26 13:50:12,391] INFO: Initiating epoch #17 valid run on device rank=0 [2024-08-26 13:50:20,047] INFO: Rank 0: epoch=17 / 100 train_loss=30.3112 valid_loss=31.0799 stale=0 time=0.23m eta=23.5m [2024-08-26 13:50:21,070] INFO: Initiating epoch #18 train run on device rank=0 [2024-08-26 13:50:27,287] INFO: Initiating epoch #18 valid run on device rank=0 [2024-08-26 13:50:37,259] INFO: Rank 0: epoch=18 / 100 train_loss=30.1594 valid_loss=31.0604 stale=0 time=0.27m eta=23.2m [2024-08-26 13:50:38,246] INFO: Initiating epoch #19 train run on device rank=0 [2024-08-26 13:50:44,595] INFO: Initiating epoch #19 valid run on device rank=0 [2024-08-26 13:50:53,082] INFO: Rank 0: epoch=19 / 100 train_loss=29.9457 valid_loss=30.9400 stale=0 time=0.25m eta=22.9m [2024-08-26 13:50:53,843] INFO: Initiating epoch #20 train run on device rank=0 [2024-08-26 13:51:00,176] INFO: Initiating epoch #20 valid run on device rank=0 [2024-08-26 13:51:05,989] INFO: Rank 0: epoch=20 / 100 train_loss=29.8100 valid_loss=31.0479 stale=1 time=0.2m eta=22.3m [2024-08-26 13:51:06,939] INFO: Initiating epoch #21 train run on device rank=0 [2024-08-26 13:51:13,524] INFO: Initiating epoch #21 valid run on device rank=0 [2024-08-26 13:51:20,418] INFO: Rank 0: epoch=21 / 100 train_loss=29.6663 valid_loss=31.0656 stale=2 time=0.22m eta=21.9m [2024-08-26 13:51:21,376] INFO: Initiating epoch #22 train run on device rank=0 [2024-08-26 13:51:28,077] INFO: Initiating epoch #22 valid run on device rank=0 [2024-08-26 13:51:50,499] INFO: Rank 0: epoch=22 / 100 train_loss=29.6804 valid_loss=30.9233 stale=0 time=0.49m eta=22.4m [2024-08-26 13:51:51,399] INFO: Initiating epoch #23 train run on device rank=0 [2024-08-26 13:51:57,581] INFO: Initiating epoch #23 valid run on device rank=0 [2024-08-26 13:52:03,605] INFO: Rank 0: epoch=23 / 100 train_loss=29.5210 valid_loss=31.0512 stale=1 time=0.2m eta=21.9m [2024-08-26 13:52:04,557] INFO: Initiating epoch #24 train run on device rank=0 [2024-08-26 13:52:11,017] INFO: Initiating epoch #24 valid run on device rank=0 [2024-08-26 13:52:17,436] INFO: Rank 0: epoch=24 / 100 train_loss=29.3781 valid_loss=31.0722 stale=2 time=0.21m eta=21.4m [2024-08-26 13:52:18,128] INFO: Initiating epoch #25 train run on device rank=0 [2024-08-26 13:52:25,857] INFO: Initiating epoch #25 valid run on device rank=0 [2024-08-26 13:52:34,442] INFO: Rank 0: epoch=25 / 100 train_loss=29.5264 valid_loss=30.7886 stale=0 time=0.27m eta=21.2m [2024-08-26 13:52:35,248] INFO: Initiating epoch #26 train run on device rank=0 [2024-08-26 13:52:41,606] INFO: Initiating epoch #26 valid run on device rank=0 [2024-08-26 13:52:50,998] INFO: Rank 0: epoch=26 / 100 train_loss=29.1733 valid_loss=30.7297 stale=0 time=0.26m eta=20.9m [2024-08-26 13:52:52,116] INFO: Initiating epoch #27 train run on device rank=0 [2024-08-26 13:52:58,505] INFO: Initiating epoch #27 valid run on device rank=0 [2024-08-26 13:53:06,568] INFO: Rank 0: epoch=27 / 100 train_loss=28.9462 valid_loss=31.0213 stale=1 time=0.24m eta=20.5m [2024-08-26 13:53:07,422] INFO: Initiating epoch #28 train run on device rank=0 [2024-08-26 13:53:13,776] INFO: Initiating epoch #28 valid run on device rank=0 [2024-08-26 13:53:19,434] INFO: Rank 0: epoch=28 / 100 train_loss=28.8260 valid_loss=31.1622 stale=2 time=0.2m eta=20.1m [2024-08-26 13:53:20,147] INFO: Initiating epoch #29 train run on device rank=0 [2024-08-26 13:53:27,352] INFO: Initiating epoch #29 valid run on device rank=0 [2024-08-26 13:53:33,645] INFO: Rank 0: epoch=29 / 100 train_loss=28.5826 valid_loss=31.1108 stale=3 time=0.22m eta=19.7m [2024-08-26 13:53:34,452] INFO: Initiating epoch #30 train run on device rank=0 [2024-08-26 13:53:41,069] INFO: Initiating epoch #30 valid run on device rank=0 [2024-08-26 13:53:47,903] INFO: Rank 0: epoch=30 / 100 train_loss=28.4313 valid_loss=31.2035 stale=4 time=0.22m eta=19.3m [2024-08-26 13:53:48,538] INFO: Initiating epoch #31 train run on device rank=0 [2024-08-26 13:53:54,919] INFO: Initiating epoch #31 valid run on device rank=0 [2024-08-26 13:54:01,466] INFO: Rank 0: epoch=31 / 100 train_loss=28.2549 valid_loss=31.4556 stale=5 time=0.22m eta=18.9m [2024-08-26 13:54:02,384] INFO: Initiating epoch #32 train run on device rank=0 [2024-08-26 13:54:08,845] INFO: Initiating epoch #32 valid run on device rank=0 [2024-08-26 13:54:15,060] INFO: Rank 0: epoch=32 / 100 train_loss=28.1735 valid_loss=31.1223 stale=6 time=0.21m eta=18.6m [2024-08-26 13:54:16,105] INFO: Initiating epoch #33 train run on device rank=0 [2024-08-26 13:54:22,269] INFO: Initiating epoch #33 valid run on device rank=0 [2024-08-26 13:54:28,408] INFO: Rank 0: epoch=33 / 100 train_loss=28.0708 valid_loss=31.2236 stale=7 time=0.21m eta=18.2m [2024-08-26 13:54:29,246] INFO: Initiating epoch #34 train run on device rank=0 [2024-08-26 13:54:35,717] INFO: Initiating epoch #34 valid run on device rank=0 [2024-08-26 13:54:42,088] INFO: Rank 0: epoch=34 / 100 train_loss=27.7482 valid_loss=31.4453 stale=8 time=0.21m eta=17.8m [2024-08-26 13:54:42,902] INFO: Initiating epoch #35 train run on device rank=0 [2024-08-26 13:54:49,336] INFO: Initiating epoch #35 valid run on device rank=0 [2024-08-26 13:54:55,303] INFO: Rank 0: epoch=35 / 100 train_loss=27.5319 valid_loss=31.6017 stale=9 time=0.21m eta=17.5m [2024-08-26 13:54:56,317] INFO: Initiating epoch #36 train run on device rank=0 [2024-08-26 13:55:02,764] INFO: Initiating epoch #36 valid run on device rank=0 [2024-08-26 13:55:10,876] INFO: Rank 0: epoch=36 / 100 train_loss=27.4673 valid_loss=31.8858 stale=10 time=0.24m eta=17.2m [2024-08-26 13:55:13,353] INFO: Initiating epoch #37 train run on device rank=0 [2024-08-26 13:55:19,740] INFO: Initiating epoch #37 valid run on device rank=0 [2024-08-26 13:55:25,965] INFO: Rank 0: epoch=37 / 100 train_loss=27.2895 valid_loss=31.6848 stale=11 time=0.21m eta=16.9m [2024-08-26 13:55:26,815] INFO: Initiating epoch #38 train run on device rank=0 [2024-08-26 13:55:33,800] INFO: Initiating epoch #38 valid run on device rank=0 [2024-08-26 13:55:39,613] INFO: Rank 0: epoch=38 / 100 train_loss=27.2991 valid_loss=31.8519 stale=12 time=0.21m eta=16.5m [2024-08-26 13:55:40,269] INFO: Initiating epoch #39 train run on device rank=0 [2024-08-26 13:55:46,463] INFO: Initiating epoch #39 valid run on device rank=0 [2024-08-26 13:55:52,318] INFO: Rank 0: epoch=39 / 100 train_loss=27.2447 valid_loss=31.4133 stale=13 time=0.2m eta=16.2m [2024-08-26 13:55:52,957] INFO: Initiating epoch #40 train run on device rank=0 [2024-08-26 13:56:00,962] INFO: Initiating epoch #40 valid run on device rank=0 [2024-08-26 13:56:06,997] INFO: Rank 0: epoch=40 / 100 train_loss=27.1831 valid_loss=31.3887 stale=14 time=0.23m eta=15.9m [2024-08-26 13:56:07,556] INFO: Initiating epoch #41 train run on device rank=0 [2024-08-26 13:56:13,850] INFO: Initiating epoch #41 valid run on device rank=0 [2024-08-26 13:56:19,911] INFO: Rank 0: epoch=41 / 100 train_loss=26.7189 valid_loss=32.5503 stale=15 time=0.21m eta=15.6m [2024-08-26 13:56:20,777] INFO: Initiating epoch #42 train run on device rank=0 [2024-08-26 13:56:27,473] INFO: Initiating epoch #42 valid run on device rank=0 [2024-08-26 13:56:33,918] INFO: Rank 0: epoch=42 / 100 train_loss=26.4924 valid_loss=33.3223 stale=16 time=0.22m eta=15.3m [2024-08-26 13:56:34,689] INFO: Initiating epoch #43 train run on device rank=0 [2024-08-26 13:56:40,971] INFO: Initiating epoch #43 valid run on device rank=0 [2024-08-26 13:56:47,126] INFO: Rank 0: epoch=43 / 100 train_loss=26.4915 valid_loss=32.8763 stale=17 time=0.21m eta=14.9m [2024-08-26 13:56:47,924] INFO: Initiating epoch #44 train run on device rank=0 [2024-08-26 13:56:54,417] INFO: Initiating epoch #44 valid run on device rank=0 [2024-08-26 13:57:00,628] INFO: Rank 0: epoch=44 / 100 train_loss=26.3698 valid_loss=32.8692 stale=18 time=0.21m eta=14.6m [2024-08-26 13:57:01,816] INFO: Initiating epoch #45 train run on device rank=0 [2024-08-26 13:57:08,251] INFO: Initiating epoch #45 valid run on device rank=0 [2024-08-26 13:57:14,224] INFO: Rank 0: epoch=45 / 100 train_loss=26.2879 valid_loss=32.2546 stale=19 time=0.21m eta=14.3m [2024-08-26 13:57:14,778] INFO: Initiating epoch #46 train run on device rank=0 [2024-08-26 13:57:21,363] INFO: Initiating epoch #46 valid run on device rank=0 [2024-08-26 13:57:27,526] INFO: Rank 0: epoch=46 / 100 train_loss=26.1288 valid_loss=32.8996 stale=20 time=0.21m eta=14.0m [2024-08-26 13:57:28,460] INFO: Initiating epoch #47 train run on device rank=0 [2024-08-26 13:57:35,163] INFO: Initiating epoch #47 valid run on device rank=0 [2024-08-26 13:57:40,993] INFO: Done with training. Total training time on device 0 is 12.162min