[2024-08-26 13:44:06,988] INFO: Will use torch.nn.parallel.DistributedDataParallel() and 2 gpus [2024-08-26 13:44:07,065] INFO: NVIDIA GeForce RTX 2080 Ti [2024-08-26 13:44:07,065] INFO: NVIDIA GeForce RTX 2080 Ti [2024-08-26 13:44:11,654] INFO: using dtype=torch.float32 [2024-08-26 13:44:11,898] INFO: using attention_type=math [2024-08-26 13:44:11,908] INFO: using attention_type=math [2024-08-26 13:44:11,919] INFO: using attention_type=math [2024-08-26 13:44:11,929] INFO: using attention_type=math [2024-08-26 13:44:11,940] INFO: using attention_type=math [2024-08-26 13:44:11,950] INFO: using attention_type=math [2024-08-26 13:44:14,230] INFO: DistributedDataParallel( (module): MLPF( (nn0_id): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (nn0_reg): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (conv_id): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (conv_reg): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (nn_id): Sequential( (0): Linear(in_features=529, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=6, bias=True) ) (nn_pt): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_eta): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_sin_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_cos_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_energy): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) ) ) [2024-08-26 13:44:14,230] INFO: Trainable parameters: 11671568 [2024-08-26 13:44:14,230] INFO: Non-trainable parameters: 0 [2024-08-26 13:44:14,230] INFO: Total parameters: 11671568 [2024-08-26 13:44:14,233] INFO: Modules Trainable parameters Non-tranable parameters module.nn0_id.0.weight 8704 0 module.nn0_id.0.bias 512 0 module.nn0_id.2.weight 512 0 module.nn0_id.2.bias 512 0 module.nn0_id.4.weight 262144 0 module.nn0_id.4.bias 512 0 module.nn0_reg.0.weight 8704 0 module.nn0_reg.0.bias 512 0 module.nn0_reg.2.weight 512 0 module.nn0_reg.2.bias 512 0 module.nn0_reg.4.weight 262144 0 module.nn0_reg.4.bias 512 0 module.conv_id.0.mha.in_proj_weight 786432 0 module.conv_id.0.mha.in_proj_bias 1536 0 module.conv_id.0.mha.out_proj.weight 262144 0 module.conv_id.0.mha.out_proj.bias 512 0 module.conv_id.0.norm0.weight 512 0 module.conv_id.0.norm0.bias 512 0 module.conv_id.0.norm1.weight 512 0 module.conv_id.0.norm1.bias 512 0 module.conv_id.0.seq.0.weight 262144 0 module.conv_id.0.seq.0.bias 512 0 module.conv_id.0.seq.2.weight 262144 0 module.conv_id.0.seq.2.bias 512 0 module.conv_id.1.mha.in_proj_weight 786432 0 module.conv_id.1.mha.in_proj_bias 1536 0 module.conv_id.1.mha.out_proj.weight 262144 0 module.conv_id.1.mha.out_proj.bias 512 0 module.conv_id.1.norm0.weight 512 0 module.conv_id.1.norm0.bias 512 0 module.conv_id.1.norm1.weight 512 0 module.conv_id.1.norm1.bias 512 0 module.conv_id.1.seq.0.weight 262144 0 module.conv_id.1.seq.0.bias 512 0 module.conv_id.1.seq.2.weight 262144 0 module.conv_id.1.seq.2.bias 512 0 module.conv_id.2.mha.in_proj_weight 786432 0 module.conv_id.2.mha.in_proj_bias 1536 0 module.conv_id.2.mha.out_proj.weight 262144 0 module.conv_id.2.mha.out_proj.bias 512 0 module.conv_id.2.norm0.weight 512 0 module.conv_id.2.norm0.bias 512 0 module.conv_id.2.norm1.weight 512 0 module.conv_id.2.norm1.bias 512 0 module.conv_id.2.seq.0.weight 262144 0 module.conv_id.2.seq.0.bias 512 0 module.conv_id.2.seq.2.weight 262144 0 module.conv_id.2.seq.2.bias 512 0 module.conv_reg.0.mha.in_proj_weight 786432 0 module.conv_reg.0.mha.in_proj_bias 1536 0 module.conv_reg.0.mha.out_proj.weight 262144 0 module.conv_reg.0.mha.out_proj.bias 512 0 module.conv_reg.0.norm0.weight 512 0 module.conv_reg.0.norm0.bias 512 0 module.conv_reg.0.norm1.weight 512 0 module.conv_reg.0.norm1.bias 512 0 module.conv_reg.0.seq.0.weight 262144 0 module.conv_reg.0.seq.0.bias 512 0 module.conv_reg.0.seq.2.weight 262144 0 module.conv_reg.0.seq.2.bias 512 0 module.conv_reg.1.mha.in_proj_weight 786432 0 module.conv_reg.1.mha.in_proj_bias 1536 0 module.conv_reg.1.mha.out_proj.weight 262144 0 module.conv_reg.1.mha.out_proj.bias 512 0 module.conv_reg.1.norm0.weight 512 0 module.conv_reg.1.norm0.bias 512 0 module.conv_reg.1.norm1.weight 512 0 module.conv_reg.1.norm1.bias 512 0 module.conv_reg.1.seq.0.weight 262144 0 module.conv_reg.1.seq.0.bias 512 0 module.conv_reg.1.seq.2.weight 262144 0 module.conv_reg.1.seq.2.bias 512 0 module.conv_reg.2.mha.in_proj_weight 786432 0 module.conv_reg.2.mha.in_proj_bias 1536 0 module.conv_reg.2.mha.out_proj.weight 262144 0 module.conv_reg.2.mha.out_proj.bias 512 0 module.conv_reg.2.norm0.weight 512 0 module.conv_reg.2.norm0.bias 512 0 module.conv_reg.2.norm1.weight 512 0 module.conv_reg.2.norm1.bias 512 0 module.conv_reg.2.seq.0.weight 262144 0 module.conv_reg.2.seq.0.bias 512 0 module.conv_reg.2.seq.2.weight 262144 0 module.conv_reg.2.seq.2.bias 512 0 module.nn_id.0.weight 270848 0 module.nn_id.0.bias 512 0 module.nn_id.2.weight 512 0 module.nn_id.2.bias 512 0 module.nn_id.4.weight 3072 0 module.nn_id.4.bias 6 0 module.nn_pt.nn.0.weight 273920 0 module.nn_pt.nn.0.bias 512 0 module.nn_pt.nn.2.weight 512 0 module.nn_pt.nn.2.bias 512 0 module.nn_pt.nn.4.weight 1024 0 module.nn_pt.nn.4.bias 2 0 module.nn_eta.nn.0.weight 273920 0 module.nn_eta.nn.0.bias 512 0 module.nn_eta.nn.2.weight 512 0 module.nn_eta.nn.2.bias 512 0 module.nn_eta.nn.4.weight 1024 0 module.nn_eta.nn.4.bias 2 0 module.nn_sin_phi.nn.0.weight 273920 0 module.nn_sin_phi.nn.0.bias 512 0 module.nn_sin_phi.nn.2.weight 512 0 module.nn_sin_phi.nn.2.bias 512 0 module.nn_sin_phi.nn.4.weight 1024 0 module.nn_sin_phi.nn.4.bias 2 0 module.nn_cos_phi.nn.0.weight 273920 0 module.nn_cos_phi.nn.0.bias 512 0 module.nn_cos_phi.nn.2.weight 512 0 module.nn_cos_phi.nn.2.bias 512 0 module.nn_cos_phi.nn.4.weight 1024 0 module.nn_cos_phi.nn.4.bias 2 0 module.nn_energy.nn.0.weight 273920 0 module.nn_energy.nn.0.bias 512 0 module.nn_energy.nn.2.weight 512 0 module.nn_energy.nn.2.bias 512 0 module.nn_energy.nn.4.weight 1024 0 module.nn_energy.nn.4.bias 2 0 [2024-08-26 13:44:14,276] INFO: Creating experiment dir /pfvol/experiments/Aug26_CLD_fromscratch_10k_pyg-cld_20240826_134406_403025 [2024-08-26 13:44:14,276] INFO: Model directory /pfvol/experiments/Aug26_CLD_fromscratch_10k_pyg-cld_20240826_134406_403025 [2024-08-26 13:44:14,292] INFO: train_dataset: cld_edm_ttbar_pf, 10000 [2024-08-26 13:44:14,364] INFO: valid_dataset: cld_edm_ttbar_pf, 1000 [2024-08-26 13:44:14,412] INFO: Initiating epoch #1 train run on device rank=0 [2024-08-26 13:44:46,594] INFO: Initiating epoch #1 valid run on device rank=0 [2024-08-26 13:44:55,885] INFO: Rank 0: epoch=1 / 100 train_loss=45.2201 valid_loss=32.7626 stale=0 time=0.69m eta=68.4m [2024-08-26 13:44:55,887] INFO: Initiating epoch #2 train run on device rank=0 [2024-08-26 13:45:22,322] INFO: Initiating epoch #2 valid run on device rank=0 [2024-08-26 13:45:28,970] INFO: Rank 0: epoch=2 / 100 train_loss=31.5884 valid_loss=30.5456 stale=0 time=0.55m eta=60.9m [2024-08-26 13:45:29,805] INFO: Initiating epoch #3 train run on device rank=0 [2024-08-26 13:45:56,320] INFO: Initiating epoch #3 valid run on device rank=0 [2024-08-26 13:46:02,878] INFO: Rank 0: epoch=3 / 100 train_loss=30.4192 valid_loss=30.0116 stale=0 time=0.55m eta=58.5m [2024-08-26 13:46:03,985] INFO: Initiating epoch #4 train run on device rank=0 [2024-08-26 13:46:30,841] INFO: Initiating epoch #4 valid run on device rank=0 [2024-08-26 13:46:37,673] INFO: Rank 0: epoch=4 / 100 train_loss=29.9268 valid_loss=29.3441 stale=0 time=0.56m eta=57.3m [2024-08-26 13:46:39,183] INFO: Initiating epoch #5 train run on device rank=0 [2024-08-26 13:47:06,829] INFO: Initiating epoch #5 valid run on device rank=0 [2024-08-26 13:47:13,926] INFO: Rank 0: epoch=5 / 100 train_loss=29.3929 valid_loss=28.9965 stale=0 time=0.58m eta=56.8m [2024-08-26 13:47:14,934] INFO: Initiating epoch #6 train run on device rank=0 [2024-08-26 13:47:42,485] INFO: Initiating epoch #6 valid run on device rank=0 [2024-08-26 13:47:49,707] INFO: Rank 0: epoch=6 / 100 train_loss=29.0158 valid_loss=28.5107 stale=0 time=0.58m eta=56.2m [2024-08-26 13:47:50,366] INFO: Initiating epoch #7 train run on device rank=0 [2024-08-26 13:48:18,156] INFO: Initiating epoch #7 valid run on device rank=0 [2024-08-26 13:48:25,025] INFO: Rank 0: epoch=7 / 100 train_loss=28.5486 valid_loss=28.2524 stale=0 time=0.58m eta=55.5m [2024-08-26 13:48:25,840] INFO: Initiating epoch #8 train run on device rank=0 [2024-08-26 13:48:53,773] INFO: Initiating epoch #8 valid run on device rank=0 [2024-08-26 13:49:01,147] INFO: Rank 0: epoch=8 / 100 train_loss=28.2622 valid_loss=27.9415 stale=0 time=0.59m eta=55.0m [2024-08-26 13:49:01,883] INFO: Initiating epoch #9 train run on device rank=0 [2024-08-26 13:49:29,696] INFO: Initiating epoch #9 valid run on device rank=0 [2024-08-26 13:49:36,287] INFO: Rank 0: epoch=9 / 100 train_loss=27.8667 valid_loss=27.7843 stale=0 time=0.57m eta=54.2m [2024-08-26 13:49:37,223] INFO: Initiating epoch #10 train run on device rank=0 [2024-08-26 13:50:04,967] INFO: Initiating epoch #10 valid run on device rank=0 [2024-08-26 13:50:11,523] INFO: Rank 0: epoch=10 / 100 train_loss=27.5158 valid_loss=27.4986 stale=0 time=0.57m eta=53.6m [2024-08-26 13:50:12,404] INFO: Initiating epoch #11 train run on device rank=0 [2024-08-26 13:50:40,119] INFO: Initiating epoch #11 valid run on device rank=0 [2024-08-26 13:50:45,444] INFO: Rank 0: epoch=11 / 100 train_loss=27.5096 valid_loss=27.6014 stale=1 time=0.55m eta=52.7m [2024-08-26 13:50:46,458] INFO: Initiating epoch #12 train run on device rank=0 [2024-08-26 13:51:14,262] INFO: Initiating epoch #12 valid run on device rank=0 [2024-08-26 13:51:21,380] INFO: Rank 0: epoch=12 / 100 train_loss=27.0427 valid_loss=27.1413 stale=0 time=0.58m eta=52.2m [2024-08-26 13:51:22,116] INFO: Initiating epoch #13 train run on device rank=0 [2024-08-26 13:51:49,722] INFO: Initiating epoch #13 valid run on device rank=0 [2024-08-26 13:51:54,694] INFO: Rank 0: epoch=13 / 100 train_loss=26.8937 valid_loss=27.2193 stale=1 time=0.54m eta=51.3m [2024-08-26 13:51:55,465] INFO: Initiating epoch #14 train run on device rank=0 [2024-08-26 13:52:23,410] INFO: Initiating epoch #14 valid run on device rank=0 [2024-08-26 13:52:30,295] INFO: Rank 0: epoch=14 / 100 train_loss=26.6465 valid_loss=26.9899 stale=0 time=0.58m eta=50.8m [2024-08-26 13:52:31,335] INFO: Initiating epoch #15 train run on device rank=0 [2024-08-26 13:52:59,359] INFO: Initiating epoch #15 valid run on device rank=0 [2024-08-26 13:53:07,125] INFO: Rank 0: epoch=15 / 100 train_loss=26.3551 valid_loss=26.8440 stale=0 time=0.6m eta=50.3m [2024-08-26 13:53:08,067] INFO: Initiating epoch #16 train run on device rank=0 [2024-08-26 13:53:35,815] INFO: Initiating epoch #16 valid run on device rank=0 [2024-08-26 13:53:43,263] INFO: Rank 0: epoch=16 / 100 train_loss=26.0853 valid_loss=26.8128 stale=0 time=0.59m eta=49.8m [2024-08-26 13:53:44,246] INFO: Initiating epoch #17 train run on device rank=0 [2024-08-26 13:54:11,987] INFO: Initiating epoch #17 valid run on device rank=0 [2024-08-26 13:54:18,807] INFO: Rank 0: epoch=17 / 100 train_loss=25.7558 valid_loss=26.5279 stale=0 time=0.58m eta=49.2m [2024-08-26 13:54:19,546] INFO: Initiating epoch #18 train run on device rank=0 [2024-08-26 13:54:47,189] INFO: Initiating epoch #18 valid run on device rank=0 [2024-08-26 13:54:54,687] INFO: Rank 0: epoch=18 / 100 train_loss=25.4935 valid_loss=26.5149 stale=0 time=0.59m eta=48.6m [2024-08-26 13:54:55,897] INFO: Initiating epoch #19 train run on device rank=0 [2024-08-26 13:55:23,323] INFO: Initiating epoch #19 valid run on device rank=0 [2024-08-26 13:55:29,912] INFO: Rank 0: epoch=19 / 100 train_loss=25.2389 valid_loss=26.4797 stale=0 time=0.57m eta=48.0m [2024-08-26 13:55:30,674] INFO: Initiating epoch #20 train run on device rank=0 [2024-08-26 13:55:58,493] INFO: Initiating epoch #20 valid run on device rank=0 [2024-08-26 13:56:02,422] INFO: Rank 0: epoch=20 / 100 train_loss=25.0099 valid_loss=26.5944 stale=1 time=0.53m eta=47.2m [2024-08-26 13:56:03,438] INFO: Initiating epoch #21 train run on device rank=0 [2024-08-26 13:56:31,036] INFO: Initiating epoch #21 valid run on device rank=0 [2024-08-26 13:56:35,496] INFO: Rank 0: epoch=21 / 100 train_loss=24.7069 valid_loss=26.5616 stale=2 time=0.53m eta=46.5m [2024-08-26 13:56:36,233] INFO: Initiating epoch #22 train run on device rank=0 [2024-08-26 13:57:03,873] INFO: Initiating epoch #22 valid run on device rank=0 [2024-08-26 13:57:09,228] INFO: Rank 0: epoch=22 / 100 train_loss=24.3840 valid_loss=26.7096 stale=3 time=0.55m eta=45.8m [2024-08-26 13:57:09,903] INFO: Initiating epoch #23 train run on device rank=0 [2024-08-26 13:57:37,416] INFO: Initiating epoch #23 valid run on device rank=0 [2024-08-26 13:57:42,041] INFO: Rank 0: epoch=23 / 100 train_loss=24.1537 valid_loss=26.8224 stale=4 time=0.54m eta=45.1m [2024-08-26 13:57:42,535] INFO: Initiating epoch #24 train run on device rank=0 [2024-08-26 13:58:09,954] INFO: Initiating epoch #24 valid run on device rank=0 [2024-08-26 13:58:14,182] INFO: Rank 0: epoch=24 / 100 train_loss=23.8855 valid_loss=26.8292 stale=5 time=0.53m eta=44.3m [2024-08-26 13:58:14,790] INFO: Initiating epoch #25 train run on device rank=0 [2024-08-26 13:58:42,571] INFO: Initiating epoch #25 valid run on device rank=0 [2024-08-26 13:58:47,711] INFO: Rank 0: epoch=25 / 100 train_loss=23.6609 valid_loss=27.1336 stale=6 time=0.55m eta=43.7m [2024-08-26 13:58:48,151] INFO: Initiating epoch #26 train run on device rank=0 [2024-08-26 13:59:15,800] INFO: Initiating epoch #26 valid run on device rank=0 [2024-08-26 13:59:20,139] INFO: Rank 0: epoch=26 / 100 train_loss=23.3309 valid_loss=27.1378 stale=7 time=0.53m eta=43.0m [2024-08-26 13:59:20,663] INFO: Initiating epoch #27 train run on device rank=0 [2024-08-26 13:59:48,412] INFO: Initiating epoch #27 valid run on device rank=0 [2024-08-26 13:59:52,902] INFO: Rank 0: epoch=27 / 100 train_loss=23.0510 valid_loss=27.4797 stale=8 time=0.54m eta=42.3m [2024-08-26 13:59:53,623] INFO: Initiating epoch #28 train run on device rank=0 [2024-08-26 14:00:21,367] INFO: Initiating epoch #28 valid run on device rank=0 [2024-08-26 14:00:25,255] INFO: Rank 0: epoch=28 / 100 train_loss=22.7356 valid_loss=27.9814 stale=9 time=0.53m eta=41.6m [2024-08-26 14:00:25,673] INFO: Initiating epoch #29 train run on device rank=0 [2024-08-26 14:00:53,241] INFO: Initiating epoch #29 valid run on device rank=0 [2024-08-26 14:00:57,890] INFO: Rank 0: epoch=29 / 100 train_loss=22.5966 valid_loss=27.5518 stale=10 time=0.54m eta=40.9m [2024-08-26 14:00:58,630] INFO: Initiating epoch #30 train run on device rank=0 [2024-08-26 14:01:26,162] INFO: Initiating epoch #30 valid run on device rank=0 [2024-08-26 14:01:30,018] INFO: Rank 0: epoch=30 / 100 train_loss=22.2120 valid_loss=27.9884 stale=11 time=0.52m eta=40.3m [2024-08-26 14:01:30,563] INFO: Initiating epoch #31 train run on device rank=0 [2024-08-26 14:01:58,282] INFO: Initiating epoch #31 valid run on device rank=0 [2024-08-26 14:02:02,561] INFO: Rank 0: epoch=31 / 100 train_loss=21.7730 valid_loss=28.5953 stale=12 time=0.53m eta=39.6m [2024-08-26 14:02:03,094] INFO: Initiating epoch #32 train run on device rank=0 [2024-08-26 14:02:30,724] INFO: Initiating epoch #32 valid run on device rank=0 [2024-08-26 14:02:34,775] INFO: Rank 0: epoch=32 / 100 train_loss=21.4779 valid_loss=28.8921 stale=13 time=0.53m eta=39.0m [2024-08-26 14:02:35,082] INFO: Initiating epoch #33 train run on device rank=0 [2024-08-26 14:03:02,962] INFO: Initiating epoch #33 valid run on device rank=0 [2024-08-26 14:03:06,509] INFO: Rank 0: epoch=33 / 100 train_loss=21.2374 valid_loss=29.1453 stale=14 time=0.52m eta=38.3m [2024-08-26 14:03:06,867] INFO: Initiating epoch #34 train run on device rank=0 [2024-08-26 14:03:35,022] INFO: Initiating epoch #34 valid run on device rank=0 [2024-08-26 14:03:38,494] INFO: Rank 0: epoch=34 / 100 train_loss=20.9415 valid_loss=29.2509 stale=15 time=0.53m eta=37.7m [2024-08-26 14:03:39,038] INFO: Initiating epoch #35 train run on device rank=0 [2024-08-26 14:04:06,892] INFO: Initiating epoch #35 valid run on device rank=0 [2024-08-26 14:04:10,771] INFO: Rank 0: epoch=35 / 100 train_loss=20.6351 valid_loss=29.4610 stale=16 time=0.53m eta=37.0m [2024-08-26 14:04:11,293] INFO: Initiating epoch #36 train run on device rank=0 [2024-08-26 14:04:39,402] INFO: Initiating epoch #36 valid run on device rank=0 [2024-08-26 14:04:43,264] INFO: Rank 0: epoch=36 / 100 train_loss=20.2896 valid_loss=29.9253 stale=17 time=0.53m eta=36.4m [2024-08-26 14:04:43,845] INFO: Initiating epoch #37 train run on device rank=0 [2024-08-26 14:05:11,435] INFO: Initiating epoch #37 valid run on device rank=0 [2024-08-26 14:05:15,190] INFO: Rank 0: epoch=37 / 100 train_loss=20.0655 valid_loss=30.1508 stale=18 time=0.52m eta=35.8m [2024-08-26 14:05:15,654] INFO: Initiating epoch #38 train run on device rank=0 [2024-08-26 14:05:43,117] INFO: Initiating epoch #38 valid run on device rank=0 [2024-08-26 14:05:46,880] INFO: Rank 0: epoch=38 / 100 train_loss=19.8505 valid_loss=30.3754 stale=19 time=0.52m eta=35.1m [2024-08-26 14:05:47,350] INFO: Initiating epoch #39 train run on device rank=0 [2024-08-26 14:06:14,810] INFO: Initiating epoch #39 valid run on device rank=0 [2024-08-26 14:06:18,348] INFO: Rank 0: epoch=39 / 100 train_loss=19.6783 valid_loss=31.0492 stale=20 time=0.52m eta=34.5m [2024-08-26 14:06:18,756] INFO: Initiating epoch #40 train run on device rank=0 [2024-08-26 14:06:47,049] INFO: Initiating epoch #40 valid run on device rank=0 [2024-08-26 14:06:52,225] INFO: Done with training. Total training time on device 0 is 22.63min