[2024-08-30 15:25:23,631] INFO: Will use torch.nn.parallel.DistributedDataParallel() and 8 gpus [2024-08-30 15:25:23,758] INFO: NVIDIA GeForce GTX 1080 Ti [2024-08-30 15:25:23,758] INFO: NVIDIA GeForce GTX 1080 Ti [2024-08-30 15:25:23,758] INFO: NVIDIA GeForce GTX 1080 Ti [2024-08-30 15:25:23,758] INFO: NVIDIA GeForce GTX 1080 Ti [2024-08-30 15:25:23,758] INFO: NVIDIA GeForce GTX 1080 Ti [2024-08-30 15:25:23,758] INFO: NVIDIA GeForce GTX 1080 Ti [2024-08-30 15:25:23,758] INFO: NVIDIA GeForce GTX 1080 Ti [2024-08-30 15:25:23,758] INFO: NVIDIA GeForce GTX 1080 Ti [2024-08-30 15:25:31,818] INFO: configured dtype=torch.float32 for autocast [2024-08-30 15:25:32,908] INFO: using attention_type=math [2024-08-30 15:25:32,937] INFO: using attention_type=math [2024-08-30 15:25:32,967] INFO: using attention_type=math [2024-08-30 15:25:32,997] INFO: using attention_type=math [2024-08-30 15:25:33,027] INFO: using attention_type=math [2024-08-30 15:25:33,057] INFO: using attention_type=math [2024-08-30 15:25:36,141] INFO: DistributedDataParallel( (module): MLPF( (nn0_id): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (nn0_reg): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (conv_id): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (conv_reg): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (nn_binary_particle): Sequential( (0): Linear(in_features=529, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) (nn_pid): Sequential( (0): Linear(in_features=529, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=6, bias=True) ) (nn_pt): RegressionOutput( (nn): Sequential( (0): Linear(in_features=537, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_eta): RegressionOutput( (nn): Sequential( (0): Linear(in_features=537, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_sin_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=537, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_cos_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=537, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_energy): RegressionOutput( (nn): Sequential( (0): Linear(in_features=537, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) ) ) [2024-08-30 15:25:36,142] INFO: Trainable parameters: 11950098 [2024-08-30 15:25:36,142] INFO: Non-trainable parameters: 0 [2024-08-30 15:25:36,142] INFO: Total parameters: 11950098 [2024-08-30 15:25:36,148] INFO: Modules Trainable parameters Non-trainable parameters module.nn0_id.0.weight 8704 0 module.nn0_id.0.bias 512 0 module.nn0_id.2.weight 512 0 module.nn0_id.2.bias 512 0 module.nn0_id.4.weight 262144 0 module.nn0_id.4.bias 512 0 module.nn0_reg.0.weight 8704 0 module.nn0_reg.0.bias 512 0 module.nn0_reg.2.weight 512 0 module.nn0_reg.2.bias 512 0 module.nn0_reg.4.weight 262144 0 module.nn0_reg.4.bias 512 0 module.conv_id.0.mha.in_proj_weight 786432 0 module.conv_id.0.mha.in_proj_bias 1536 0 module.conv_id.0.mha.out_proj.weight 262144 0 module.conv_id.0.mha.out_proj.bias 512 0 module.conv_id.0.norm0.weight 512 0 module.conv_id.0.norm0.bias 512 0 module.conv_id.0.norm1.weight 512 0 module.conv_id.0.norm1.bias 512 0 module.conv_id.0.seq.0.weight 262144 0 module.conv_id.0.seq.0.bias 512 0 module.conv_id.0.seq.2.weight 262144 0 module.conv_id.0.seq.2.bias 512 0 module.conv_id.1.mha.in_proj_weight 786432 0 module.conv_id.1.mha.in_proj_bias 1536 0 module.conv_id.1.mha.out_proj.weight 262144 0 module.conv_id.1.mha.out_proj.bias 512 0 module.conv_id.1.norm0.weight 512 0 module.conv_id.1.norm0.bias 512 0 module.conv_id.1.norm1.weight 512 0 module.conv_id.1.norm1.bias 512 0 module.conv_id.1.seq.0.weight 262144 0 module.conv_id.1.seq.0.bias 512 0 module.conv_id.1.seq.2.weight 262144 0 module.conv_id.1.seq.2.bias 512 0 module.conv_id.2.mha.in_proj_weight 786432 0 module.conv_id.2.mha.in_proj_bias 1536 0 module.conv_id.2.mha.out_proj.weight 262144 0 module.conv_id.2.mha.out_proj.bias 512 0 module.conv_id.2.norm0.weight 512 0 module.conv_id.2.norm0.bias 512 0 module.conv_id.2.norm1.weight 512 0 module.conv_id.2.norm1.bias 512 0 module.conv_id.2.seq.0.weight 262144 0 module.conv_id.2.seq.0.bias 512 0 module.conv_id.2.seq.2.weight 262144 0 module.conv_id.2.seq.2.bias 512 0 module.conv_reg.0.mha.in_proj_weight 786432 0 module.conv_reg.0.mha.in_proj_bias 1536 0 module.conv_reg.0.mha.out_proj.weight 262144 0 module.conv_reg.0.mha.out_proj.bias 512 0 module.conv_reg.0.norm0.weight 512 0 module.conv_reg.0.norm0.bias 512 0 module.conv_reg.0.norm1.weight 512 0 module.conv_reg.0.norm1.bias 512 0 module.conv_reg.0.seq.0.weight 262144 0 module.conv_reg.0.seq.0.bias 512 0 module.conv_reg.0.seq.2.weight 262144 0 module.conv_reg.0.seq.2.bias 512 0 module.conv_reg.1.mha.in_proj_weight 786432 0 module.conv_reg.1.mha.in_proj_bias 1536 0 module.conv_reg.1.mha.out_proj.weight 262144 0 module.conv_reg.1.mha.out_proj.bias 512 0 module.conv_reg.1.norm0.weight 512 0 module.conv_reg.1.norm0.bias 512 0 module.conv_reg.1.norm1.weight 512 0 module.conv_reg.1.norm1.bias 512 0 module.conv_reg.1.seq.0.weight 262144 0 module.conv_reg.1.seq.0.bias 512 0 module.conv_reg.1.seq.2.weight 262144 0 module.conv_reg.1.seq.2.bias 512 0 module.conv_reg.2.mha.in_proj_weight 786432 0 module.conv_reg.2.mha.in_proj_bias 1536 0 module.conv_reg.2.mha.out_proj.weight 262144 0 module.conv_reg.2.mha.out_proj.bias 512 0 module.conv_reg.2.norm0.weight 512 0 module.conv_reg.2.norm0.bias 512 0 module.conv_reg.2.norm1.weight 512 0 module.conv_reg.2.norm1.bias 512 0 module.conv_reg.2.seq.0.weight 262144 0 module.conv_reg.2.seq.0.bias 512 0 module.conv_reg.2.seq.2.weight 262144 0 module.conv_reg.2.seq.2.bias 512 0 module.nn_binary_particle.0.weight 270848 0 module.nn_binary_particle.0.bias 512 0 module.nn_binary_particle.2.weight 512 0 module.nn_binary_particle.2.bias 512 0 module.nn_binary_particle.4.weight 1024 0 module.nn_binary_particle.4.bias 2 0 module.nn_pid.0.weight 270848 0 module.nn_pid.0.bias 512 0 module.nn_pid.2.weight 512 0 module.nn_pid.2.bias 512 0 module.nn_pid.4.weight 3072 0 module.nn_pid.4.bias 6 0 module.nn_pt.nn.0.weight 274944 0 module.nn_pt.nn.0.bias 512 0 module.nn_pt.nn.2.weight 512 0 module.nn_pt.nn.2.bias 512 0 module.nn_pt.nn.4.weight 1024 0 module.nn_pt.nn.4.bias 2 0 module.nn_eta.nn.0.weight 274944 0 module.nn_eta.nn.0.bias 512 0 module.nn_eta.nn.2.weight 512 0 module.nn_eta.nn.2.bias 512 0 module.nn_eta.nn.4.weight 1024 0 module.nn_eta.nn.4.bias 2 0 module.nn_sin_phi.nn.0.weight 274944 0 module.nn_sin_phi.nn.0.bias 512 0 module.nn_sin_phi.nn.2.weight 512 0 module.nn_sin_phi.nn.2.bias 512 0 module.nn_sin_phi.nn.4.weight 1024 0 module.nn_sin_phi.nn.4.bias 2 0 module.nn_cos_phi.nn.0.weight 274944 0 module.nn_cos_phi.nn.0.bias 512 0 module.nn_cos_phi.nn.2.weight 512 0 module.nn_cos_phi.nn.2.bias 512 0 module.nn_cos_phi.nn.4.weight 1024 0 module.nn_cos_phi.nn.4.bias 2 0 module.nn_energy.nn.0.weight 274944 0 module.nn_energy.nn.0.bias 512 0 module.nn_energy.nn.2.weight 512 0 module.nn_energy.nn.2.bias 512 0 module.nn_energy.nn.4.weight 1024 0 module.nn_energy.nn.4.bias 2 0 [2024-08-30 15:25:36,150] INFO: Creating experiment dir /pfvol/experiments/MLPF_clic_backbone_8GTX_pyg-clic_20240830_152523_166622 [2024-08-30 15:25:36,150] INFO: Model directory /pfvol/experiments/MLPF_clic_backbone_8GTX_pyg-clic_20240830_152523_166622 [2024-08-30 15:25:36,242] INFO: train_dataset: clic_edm_ttbar_pf, 2514200 [2024-08-30 15:25:36,282] INFO: train_dataset: clic_edm_qq_pf, 3075590 [2024-08-30 15:25:54,168] INFO: valid_dataset: clic_edm_ttbar_pf, 628600 [2024-08-30 15:25:54,188] INFO: valid_dataset: clic_edm_qq_pf, 768905 [2024-08-30 15:25:54,482] INFO: Initiating epoch #1 train run on device rank=0 [2024-08-30 17:01:14,202] INFO: Initiating epoch #1 valid run on device rank=0 [2024-08-30 17:08:00,026] INFO: Rank 0: epoch=1 / 200 train_loss=11.8628 valid_loss=9.7602 stale=0 time=102.09m eta=20316.4m [2024-08-30 17:08:00,655] INFO: Initiating epoch #2 train run on device rank=0 [2024-08-30 18:43:16,514] INFO: Initiating epoch #2 valid run on device rank=0 [2024-08-30 18:50:01,061] INFO: Rank 0: epoch=2 / 200 train_loss=9.1900 valid_loss=8.7369 stale=0 time=102.01m eta=20206.9m [2024-08-30 18:50:03,211] INFO: Initiating epoch #3 train run on device rank=0 [2024-08-30 20:25:27,320] INFO: Initiating epoch #3 valid run on device rank=0 [2024-08-30 20:32:17,564] INFO: Rank 0: epoch=3 / 200 train_loss=8.4877 valid_loss=8.2345 stale=0 time=102.24m eta=20119.3m [2024-08-30 20:32:19,939] INFO: Initiating epoch #4 train run on device rank=0 [2024-08-30 22:12:47,406] INFO: Initiating epoch #4 valid run on device rank=0 [2024-08-30 22:19:33,991] INFO: Rank 0: epoch=4 / 200 train_loss=8.1073 valid_loss=7.9553 stale=0 time=107.23m eta=20269.3m [2024-08-30 22:19:37,318] INFO: Initiating epoch #5 train run on device rank=0 [2024-08-31 00:00:13,024] INFO: Initiating epoch #5 valid run on device rank=0 [2024-08-31 00:06:55,640] INFO: Rank 0: epoch=5 / 200 train_loss=7.8440 valid_loss=7.7391 stale=0 time=107.31m eta=20319.8m [2024-08-31 00:06:58,346] INFO: Initiating epoch #6 train run on device rank=0 [2024-08-31 01:49:06,633] INFO: Initiating epoch #6 valid run on device rank=0 [2024-08-31 01:55:49,728] INFO: Rank 0: epoch=6 / 200 train_loss=7.6377 valid_loss=7.5384 stale=0 time=108.86m eta=20367.4m [2024-08-31 01:55:52,217] INFO: Initiating epoch #7 train run on device rank=0 [2024-08-31 03:36:38,448] INFO: Initiating epoch #7 valid run on device rank=0 [2024-08-31 03:43:20,210] INFO: Rank 0: epoch=7 / 200 train_loss=7.4539 valid_loss=7.4080 stale=0 time=107.47m eta=20332.0m [2024-08-31 03:43:23,800] INFO: Initiating epoch #8 train run on device rank=0 [2024-08-31 05:27:18,156] INFO: Initiating epoch #8 valid run on device rank=0 [2024-08-31 05:34:05,918] INFO: Rank 0: epoch=8 / 200 train_loss=7.2983 valid_loss=7.2448 stale=0 time=110.7m eta=20356.6m [2024-08-31 05:34:10,008] INFO: Initiating epoch #9 train run on device rank=0 [2024-08-31 07:18:57,207] INFO: Initiating epoch #9 valid run on device rank=0 [2024-08-31 07:25:41,875] INFO: Rank 0: epoch=9 / 200 train_loss=7.1776 valid_loss=7.1310 stale=0 time=111.53m eta=20368.9m [2024-08-31 07:25:45,725] INFO: Initiating epoch #10 train run on device rank=0 [2024-08-31 09:09:32,681] INFO: Initiating epoch #10 valid run on device rank=0 [2024-08-31 09:16:14,859] INFO: Rank 0: epoch=10 / 200 train_loss=7.0824 valid_loss=7.0647 stale=0 time=110.49m eta=20336.5m [2024-08-31 09:16:18,524] INFO: Initiating epoch #11 train run on device rank=0 [2024-08-31 11:01:41,214] INFO: Initiating epoch #11 valid run on device rank=0 [2024-08-31 11:08:35,352] INFO: Rank 0: epoch=11 / 200 train_loss=7.0047 valid_loss=6.9948 stale=0 time=112.28m eta=20320.6m [2024-08-31 11:08:45,341] INFO: Initiating epoch #12 train run on device rank=0 [2024-08-31 12:53:00,781] INFO: Initiating epoch #12 valid run on device rank=0 [2024-08-31 13:01:48,952] INFO: Rank 0: epoch=12 / 200 train_loss=6.9381 valid_loss=6.9357 stale=0 time=113.06m eta=20302.6m [2024-08-31 13:01:51,455] INFO: Initiating epoch #13 train run on device rank=0 [2024-08-31 14:43:22,535] INFO: Initiating epoch #13 valid run on device rank=0 [2024-08-31 14:50:08,059] INFO: Rank 0: epoch=13 / 200 train_loss=6.8771 valid_loss=6.8960 stale=0 time=108.28m eta=20199.3m [2024-08-31 14:50:11,931] INFO: Initiating epoch #14 train run on device rank=0 [2024-08-31 16:33:09,781] INFO: Initiating epoch #14 valid run on device rank=0 [2024-08-31 16:39:53,233] INFO: Rank 0: epoch=14 / 200 train_loss=6.8201 valid_loss=6.8379 stale=0 time=109.69m eta=20114.3m [2024-08-31 16:39:55,835] INFO: Initiating epoch #15 train run on device rank=0 [2024-08-31 18:24:09,961] INFO: Initiating epoch #15 valid run on device rank=0 [2024-08-31 18:30:54,770] INFO: Rank 0: epoch=15 / 200 train_loss=6.7689 valid_loss=6.8104 stale=0 time=110.98m eta=20041.7m [2024-08-31 18:30:58,844] INFO: Initiating epoch #16 train run on device rank=0 [2024-08-31 20:13:13,681] INFO: Initiating epoch #16 valid run on device rank=0 [2024-08-31 20:19:56,165] INFO: Rank 0: epoch=16 / 200 train_loss=6.7235 valid_loss=6.7545 stale=0 time=108.96m eta=19941.3m [2024-08-31 20:19:58,834] INFO: Initiating epoch #17 train run on device rank=0 [2024-08-31 22:01:54,000] INFO: Initiating epoch #17 valid run on device rank=0 [2024-08-31 22:08:35,796] INFO: Rank 0: epoch=17 / 200 train_loss=6.6819 valid_loss=6.7147 stale=0 time=108.62m eta=19836.0m [2024-08-31 22:08:39,365] INFO: Initiating epoch #18 train run on device rank=0 [2024-08-31 23:55:36,069] INFO: Initiating epoch #18 valid run on device rank=0 [2024-09-01 00:02:20,125] INFO: Rank 0: epoch=18 / 200 train_loss=6.6446 valid_loss=6.6687 stale=0 time=113.68m eta=19781.7m [2024-09-01 00:02:25,749] INFO: Initiating epoch #19 train run on device rank=0 [2024-09-01 01:46:10,947] INFO: Initiating epoch #19 valid run on device rank=0 [2024-09-01 01:52:55,317] INFO: Rank 0: epoch=19 / 200 train_loss=6.6100 valid_loss=6.6431 stale=0 time=110.49m eta=19691.0m [2024-09-01 01:52:59,096] INFO: Initiating epoch #20 train run on device rank=0 [2024-09-01 03:36:20,784] INFO: Initiating epoch #20 valid run on device rank=0 [2024-09-01 03:43:03,497] INFO: Rank 0: epoch=20 / 200 train_loss=6.5778 valid_loss=6.6214 stale=0 time=110.07m eta=19594.4m [2024-09-01 03:43:08,016] INFO: Initiating epoch #21 train run on device rank=0 [2024-09-01 05:26:56,949] INFO: Initiating epoch #21 valid run on device rank=0 [2024-09-01 05:33:50,731] INFO: Rank 0: epoch=21 / 200 train_loss=6.5482 valid_loss=6.5957 stale=0 time=110.71m eta=19501.9m [2024-09-01 05:33:55,211] INFO: Initiating epoch #22 train run on device rank=0 [2024-09-01 07:19:03,320] INFO: Initiating epoch #22 valid run on device rank=0 [2024-09-01 07:25:59,861] INFO: Rank 0: epoch=22 / 200 train_loss=6.5203 valid_loss=6.5673 stale=0 time=112.08m eta=19418.9m [2024-09-01 07:26:03,309] INFO: Initiating epoch #23 train run on device rank=0 [2024-09-01 09:10:23,695] INFO: Initiating epoch #23 valid run on device rank=0 [2024-09-01 09:17:08,240] INFO: Rank 0: epoch=23 / 200 train_loss=6.4940 valid_loss=6.5502 stale=0 time=111.08m eta=19325.5m [2024-09-01 09:17:12,614] INFO: Initiating epoch #24 train run on device rank=0 [2024-09-01 11:01:04,359] INFO: Initiating epoch #24 valid run on device rank=0 [2024-09-01 11:07:49,847] INFO: Rank 0: epoch=24 / 200 train_loss=6.4666 valid_loss=6.5211 stale=0 time=110.62m eta=19227.4m [2024-09-01 11:07:54,409] INFO: Initiating epoch #25 train run on device rank=0 [2024-09-01 12:53:26,275] INFO: Initiating epoch #25 valid run on device rank=0 [2024-09-01 13:00:10,976] INFO: Rank 0: epoch=25 / 200 train_loss=6.4403 valid_loss=6.4931 stale=0 time=112.28m eta=19139.9m [2024-09-01 13:00:14,350] INFO: Initiating epoch #26 train run on device rank=0 [2024-09-01 14:44:58,881] INFO: Initiating epoch #26 valid run on device rank=0 [2024-09-01 14:51:43,653] INFO: Rank 0: epoch=26 / 200 train_loss=6.4132 valid_loss=6.4539 stale=0 time=111.49m eta=19045.1m [2024-09-01 14:51:47,269] INFO: Initiating epoch #27 train run on device rank=0 [2024-09-01 16:37:18,398] INFO: Initiating epoch #27 valid run on device rank=0 [2024-09-01 16:44:00,495] INFO: Rank 0: epoch=27 / 200 train_loss=6.3915 valid_loss=6.4543 stale=1 time=112.22m eta=18953.8m [2024-09-01 16:44:06,133] INFO: Initiating epoch #28 train run on device rank=0 [2024-09-01 18:27:48,819] INFO: Initiating epoch #28 valid run on device rank=0 [2024-09-01 18:34:32,269] INFO: Rank 0: epoch=28 / 200 train_loss=6.3726 valid_loss=6.4336 stale=0 time=110.44m eta=18850.2m [2024-09-01 18:34:37,727] INFO: Initiating epoch #29 train run on device rank=0 [2024-09-01 20:19:31,618] INFO: Initiating epoch #29 valid run on device rank=0 [2024-09-01 20:26:15,609] INFO: Rank 0: epoch=29 / 200 train_loss=6.3547 valid_loss=6.4219 stale=0 time=111.63m eta=18753.1m [2024-09-01 20:26:18,974] INFO: Initiating epoch #30 train run on device rank=0 [2024-09-01 22:10:41,204] INFO: Initiating epoch #30 valid run on device rank=0 [2024-09-01 22:17:23,198] INFO: Rank 0: epoch=30 / 200 train_loss=6.3381 valid_loss=6.4151 stale=0 time=111.07m eta=18651.7m [2024-09-01 22:17:27,051] INFO: Initiating epoch #31 train run on device rank=0 [2024-09-02 00:02:27,679] INFO: Initiating epoch #31 valid run on device rank=0 [2024-09-02 00:09:09,157] INFO: Rank 0: epoch=31 / 200 train_loss=6.3224 valid_loss=6.3896 stale=0 time=111.7m eta=18553.2m [2024-09-02 00:09:13,176] INFO: Initiating epoch #32 train run on device rank=0 [2024-09-02 01:53:21,194] INFO: Initiating epoch #32 valid run on device rank=0 [2024-09-02 02:00:02,536] INFO: Rank 0: epoch=32 / 200 train_loss=6.3078 valid_loss=6.3727 stale=0 time=110.82m eta=18449.2m [2024-09-02 02:00:06,868] INFO: Initiating epoch #33 train run on device rank=0 [2024-09-02 03:44:19,105] INFO: Initiating epoch #33 valid run on device rank=0 [2024-09-02 03:51:02,417] INFO: Rank 0: epoch=33 / 200 train_loss=6.2938 valid_loss=6.3699 stale=0 time=110.93m eta=18345.4m [2024-09-02 03:51:07,728] INFO: Initiating epoch #34 train run on device rank=0 [2024-09-02 05:35:32,754] INFO: Initiating epoch #34 valid run on device rank=0 [2024-09-02 05:42:13,542] INFO: Rank 0: epoch=34 / 200 train_loss=6.2803 valid_loss=6.3587 stale=0 time=111.1m eta=18242.0m [2024-09-02 05:42:19,145] INFO: Initiating epoch #35 train run on device rank=0 [2024-09-02 07:28:16,666] INFO: Initiating epoch #35 valid run on device rank=0 [2024-09-02 07:35:03,622] INFO: Rank 0: epoch=35 / 200 train_loss=6.2677 valid_loss=6.3481 stale=0 time=112.74m eta=18146.0m [2024-09-02 07:35:10,946] INFO: Initiating epoch #36 train run on device rank=0 [2024-09-02 09:21:32,445] INFO: Initiating epoch #36 valid run on device rank=0 [2024-09-02 09:28:16,856] INFO: Rank 0: epoch=36 / 200 train_loss=6.2547 valid_loss=6.3386 stale=0 time=113.1m eta=18050.8m [2024-09-02 09:28:22,536] INFO: Initiating epoch #37 train run on device rank=0 [2024-09-02 11:13:07,548] INFO: Initiating epoch #37 valid run on device rank=0 [2024-09-02 11:20:02,526] INFO: Rank 0: epoch=37 / 200 train_loss=6.2424 valid_loss=6.3284 stale=0 time=111.67m eta=17948.2m [2024-09-02 11:20:07,950] INFO: Initiating epoch #38 train run on device rank=0 [2024-09-02 13:06:29,902] INFO: Initiating epoch #38 valid run on device rank=0 [2024-09-02 13:13:17,178] INFO: Rank 0: epoch=38 / 200 train_loss=6.2294 valid_loss=6.3253 stale=0 time=113.15m eta=17851.5m [2024-09-02 13:13:22,428] INFO: Initiating epoch #39 train run on device rank=0 [2024-09-02 15:00:14,853] INFO: Initiating epoch #39 valid run on device rank=0 [2024-09-02 15:06:59,382] INFO: Rank 0: epoch=39 / 200 train_loss=6.2163 valid_loss=6.3129 stale=0 time=113.62m eta=17755.7m [2024-09-02 15:07:03,095] INFO: Initiating epoch #40 train run on device rank=0 [2024-09-02 16:52:13,218] INFO: Initiating epoch #40 valid run on device rank=0 [2024-09-02 16:58:57,848] INFO: Rank 0: epoch=40 / 200 train_loss=6.2041 valid_loss=6.3071 stale=0 time=111.91m eta=17652.2m [2024-09-02 16:59:03,325] INFO: Initiating epoch #41 train run on device rank=0 [2024-09-02 18:45:30,519] INFO: Initiating epoch #41 valid run on device rank=0 [2024-09-02 18:52:12,626] INFO: Rank 0: epoch=41 / 200 train_loss=6.1923 valid_loss=6.2968 stale=0 time=113.16m eta=17553.2m [2024-09-02 18:52:16,692] INFO: Initiating epoch #42 train run on device rank=0 [2024-09-02 20:38:54,828] INFO: Initiating epoch #42 valid run on device rank=0 [2024-09-02 20:45:37,346] INFO: Rank 0: epoch=42 / 200 train_loss=6.1814 valid_loss=6.2934 stale=0 time=113.34m eta=17454.2m [2024-09-02 20:45:41,665] INFO: Initiating epoch #43 train run on device rank=0 [2024-09-02 22:35:19,266] INFO: Initiating epoch #43 valid run on device rank=0 [2024-09-02 22:42:02,487] INFO: Rank 0: epoch=43 / 200 train_loss=6.1710 valid_loss=6.2777 stale=0 time=116.35m eta=17365.4m [2024-09-02 22:42:05,246] INFO: Initiating epoch #44 train run on device rank=0 [2024-09-03 00:28:14,126] INFO: Initiating epoch #44 valid run on device rank=0 [2024-09-03 00:34:53,315] INFO: Rank 0: epoch=44 / 200 train_loss=6.1611 valid_loss=6.2802 stale=1 time=112.8m eta=17262.7m [2024-09-03 00:34:57,652] INFO: Initiating epoch #45 train run on device rank=0 [2024-09-03 02:20:23,743] INFO: Initiating epoch #45 valid run on device rank=0 [2024-09-03 02:27:02,200] INFO: Rank 0: epoch=45 / 200 train_loss=6.1518 valid_loss=6.2826 stale=2 time=112.08m eta=17157.2m [2024-09-03 02:27:07,107] INFO: Initiating epoch #46 train run on device rank=0