[2024-08-26 13:44:41,299] INFO: Will use torch.nn.parallel.DistributedDataParallel() and 2 gpus [2024-08-26 13:44:41,364] INFO: NVIDIA GeForce RTX 2080 Ti [2024-08-26 13:44:41,365] INFO: NVIDIA GeForce RTX 2080 Ti [2024-08-26 13:44:46,235] INFO: using dtype=torch.float32 [2024-08-26 13:44:47,120] INFO: using attention_type=math [2024-08-26 13:44:47,130] INFO: using attention_type=math [2024-08-26 13:44:47,140] INFO: using attention_type=math [2024-08-26 13:44:47,150] INFO: using attention_type=math [2024-08-26 13:44:47,160] INFO: using attention_type=math [2024-08-26 13:44:47,170] INFO: using attention_type=math [2024-08-26 13:44:49,454] INFO: DistributedDataParallel( (module): MLPF( (nn0_id): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (nn0_reg): Sequential( (0): Linear(in_features=17, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=512, bias=True) ) (conv_id): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (conv_reg): ModuleList( (0-2): 3 x SelfAttentionLayer( (mha): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (norm0): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (seq): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() ) (dropout): Dropout(p=0.0, inplace=False) ) ) (nn_id): Sequential( (0): Linear(in_features=529, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=6, bias=True) ) (nn_pt): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_eta): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_sin_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_cos_phi): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) (nn_energy): RegressionOutput( (nn): Sequential( (0): Linear(in_features=535, out_features=512, bias=True) (1): ReLU() (2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0.0, inplace=False) (4): Linear(in_features=512, out_features=2, bias=True) ) ) ) ) [2024-08-26 13:44:49,454] INFO: Trainable parameters: 11671568 [2024-08-26 13:44:49,454] INFO: Non-trainable parameters: 0 [2024-08-26 13:44:49,454] INFO: Total parameters: 11671568 [2024-08-26 13:44:49,457] INFO: Modules Trainable parameters Non-tranable parameters module.nn0_id.0.weight 8704 0 module.nn0_id.0.bias 512 0 module.nn0_id.2.weight 512 0 module.nn0_id.2.bias 512 0 module.nn0_id.4.weight 262144 0 module.nn0_id.4.bias 512 0 module.nn0_reg.0.weight 8704 0 module.nn0_reg.0.bias 512 0 module.nn0_reg.2.weight 512 0 module.nn0_reg.2.bias 512 0 module.nn0_reg.4.weight 262144 0 module.nn0_reg.4.bias 512 0 module.conv_id.0.mha.in_proj_weight 786432 0 module.conv_id.0.mha.in_proj_bias 1536 0 module.conv_id.0.mha.out_proj.weight 262144 0 module.conv_id.0.mha.out_proj.bias 512 0 module.conv_id.0.norm0.weight 512 0 module.conv_id.0.norm0.bias 512 0 module.conv_id.0.norm1.weight 512 0 module.conv_id.0.norm1.bias 512 0 module.conv_id.0.seq.0.weight 262144 0 module.conv_id.0.seq.0.bias 512 0 module.conv_id.0.seq.2.weight 262144 0 module.conv_id.0.seq.2.bias 512 0 module.conv_id.1.mha.in_proj_weight 786432 0 module.conv_id.1.mha.in_proj_bias 1536 0 module.conv_id.1.mha.out_proj.weight 262144 0 module.conv_id.1.mha.out_proj.bias 512 0 module.conv_id.1.norm0.weight 512 0 module.conv_id.1.norm0.bias 512 0 module.conv_id.1.norm1.weight 512 0 module.conv_id.1.norm1.bias 512 0 module.conv_id.1.seq.0.weight 262144 0 module.conv_id.1.seq.0.bias 512 0 module.conv_id.1.seq.2.weight 262144 0 module.conv_id.1.seq.2.bias 512 0 module.conv_id.2.mha.in_proj_weight 786432 0 module.conv_id.2.mha.in_proj_bias 1536 0 module.conv_id.2.mha.out_proj.weight 262144 0 module.conv_id.2.mha.out_proj.bias 512 0 module.conv_id.2.norm0.weight 512 0 module.conv_id.2.norm0.bias 512 0 module.conv_id.2.norm1.weight 512 0 module.conv_id.2.norm1.bias 512 0 module.conv_id.2.seq.0.weight 262144 0 module.conv_id.2.seq.0.bias 512 0 module.conv_id.2.seq.2.weight 262144 0 module.conv_id.2.seq.2.bias 512 0 module.conv_reg.0.mha.in_proj_weight 786432 0 module.conv_reg.0.mha.in_proj_bias 1536 0 module.conv_reg.0.mha.out_proj.weight 262144 0 module.conv_reg.0.mha.out_proj.bias 512 0 module.conv_reg.0.norm0.weight 512 0 module.conv_reg.0.norm0.bias 512 0 module.conv_reg.0.norm1.weight 512 0 module.conv_reg.0.norm1.bias 512 0 module.conv_reg.0.seq.0.weight 262144 0 module.conv_reg.0.seq.0.bias 512 0 module.conv_reg.0.seq.2.weight 262144 0 module.conv_reg.0.seq.2.bias 512 0 module.conv_reg.1.mha.in_proj_weight 786432 0 module.conv_reg.1.mha.in_proj_bias 1536 0 module.conv_reg.1.mha.out_proj.weight 262144 0 module.conv_reg.1.mha.out_proj.bias 512 0 module.conv_reg.1.norm0.weight 512 0 module.conv_reg.1.norm0.bias 512 0 module.conv_reg.1.norm1.weight 512 0 module.conv_reg.1.norm1.bias 512 0 module.conv_reg.1.seq.0.weight 262144 0 module.conv_reg.1.seq.0.bias 512 0 module.conv_reg.1.seq.2.weight 262144 0 module.conv_reg.1.seq.2.bias 512 0 module.conv_reg.2.mha.in_proj_weight 786432 0 module.conv_reg.2.mha.in_proj_bias 1536 0 module.conv_reg.2.mha.out_proj.weight 262144 0 module.conv_reg.2.mha.out_proj.bias 512 0 module.conv_reg.2.norm0.weight 512 0 module.conv_reg.2.norm0.bias 512 0 module.conv_reg.2.norm1.weight 512 0 module.conv_reg.2.norm1.bias 512 0 module.conv_reg.2.seq.0.weight 262144 0 module.conv_reg.2.seq.0.bias 512 0 module.conv_reg.2.seq.2.weight 262144 0 module.conv_reg.2.seq.2.bias 512 0 module.nn_id.0.weight 270848 0 module.nn_id.0.bias 512 0 module.nn_id.2.weight 512 0 module.nn_id.2.bias 512 0 module.nn_id.4.weight 3072 0 module.nn_id.4.bias 6 0 module.nn_pt.nn.0.weight 273920 0 module.nn_pt.nn.0.bias 512 0 module.nn_pt.nn.2.weight 512 0 module.nn_pt.nn.2.bias 512 0 module.nn_pt.nn.4.weight 1024 0 module.nn_pt.nn.4.bias 2 0 module.nn_eta.nn.0.weight 273920 0 module.nn_eta.nn.0.bias 512 0 module.nn_eta.nn.2.weight 512 0 module.nn_eta.nn.2.bias 512 0 module.nn_eta.nn.4.weight 1024 0 module.nn_eta.nn.4.bias 2 0 module.nn_sin_phi.nn.0.weight 273920 0 module.nn_sin_phi.nn.0.bias 512 0 module.nn_sin_phi.nn.2.weight 512 0 module.nn_sin_phi.nn.2.bias 512 0 module.nn_sin_phi.nn.4.weight 1024 0 module.nn_sin_phi.nn.4.bias 2 0 module.nn_cos_phi.nn.0.weight 273920 0 module.nn_cos_phi.nn.0.bias 512 0 module.nn_cos_phi.nn.2.weight 512 0 module.nn_cos_phi.nn.2.bias 512 0 module.nn_cos_phi.nn.4.weight 1024 0 module.nn_cos_phi.nn.4.bias 2 0 module.nn_energy.nn.0.weight 273920 0 module.nn_energy.nn.0.bias 512 0 module.nn_energy.nn.2.weight 512 0 module.nn_energy.nn.2.bias 512 0 module.nn_energy.nn.4.weight 1024 0 module.nn_energy.nn.4.bias 2 0 [2024-08-26 13:44:49,514] INFO: Creating experiment dir /pfvol/experiments/Aug26_CLD_fromscratch_100_pyg-cld_20240826_134440_519922 [2024-08-26 13:44:49,514] INFO: Model directory /pfvol/experiments/Aug26_CLD_fromscratch_100_pyg-cld_20240826_134440_519922 [2024-08-26 13:44:49,530] INFO: train_dataset: cld_edm_ttbar_pf, 100 [2024-08-26 13:44:49,564] INFO: valid_dataset: cld_edm_ttbar_pf, 1000 [2024-08-26 13:44:49,629] INFO: Initiating epoch #1 train run on device rank=0 [2024-08-26 13:44:57,212] INFO: Initiating epoch #1 valid run on device rank=0 [2024-08-26 13:45:05,511] INFO: Rank 0: epoch=1 / 100 train_loss=171.3931 valid_loss=140.1817 stale=0 time=0.26m eta=26.2m [2024-08-26 13:45:05,555] INFO: Initiating epoch #2 train run on device rank=0 [2024-08-26 13:45:07,298] INFO: Initiating epoch #2 valid run on device rank=0 [2024-08-26 13:45:13,910] INFO: Rank 0: epoch=2 / 100 train_loss=127.5241 valid_loss=112.6116 stale=0 time=0.14m eta=19.8m [2024-08-26 13:45:14,931] INFO: Initiating epoch #3 train run on device rank=0 [2024-08-26 13:45:16,935] INFO: Initiating epoch #3 valid run on device rank=0 [2024-08-26 13:45:23,286] INFO: Rank 0: epoch=3 / 100 train_loss=103.2444 valid_loss=95.7598 stale=0 time=0.14m eta=18.1m [2024-08-26 13:45:24,046] INFO: Initiating epoch #4 train run on device rank=0 [2024-08-26 13:45:25,936] INFO: Initiating epoch #4 valid run on device rank=0 [2024-08-26 13:45:33,475] INFO: Rank 0: epoch=4 / 100 train_loss=88.9178 valid_loss=84.1871 stale=0 time=0.16m eta=17.5m [2024-08-26 13:45:34,430] INFO: Initiating epoch #5 train run on device rank=0 [2024-08-26 13:45:36,100] INFO: Initiating epoch #5 valid run on device rank=0 [2024-08-26 13:45:43,348] INFO: Rank 0: epoch=5 / 100 train_loss=79.8078 valid_loss=76.9893 stale=0 time=0.15m eta=17.0m [2024-08-26 13:45:43,768] INFO: Initiating epoch #6 train run on device rank=0 [2024-08-26 13:45:45,575] INFO: Initiating epoch #6 valid run on device rank=0 [2024-08-26 13:45:52,401] INFO: Rank 0: epoch=6 / 100 train_loss=72.9875 valid_loss=71.2727 stale=0 time=0.14m eta=16.4m [2024-08-26 13:45:53,283] INFO: Initiating epoch #7 train run on device rank=0 [2024-08-26 13:45:55,114] INFO: Initiating epoch #7 valid run on device rank=0 [2024-08-26 13:46:01,677] INFO: Rank 0: epoch=7 / 100 train_loss=67.7476 valid_loss=66.8665 stale=0 time=0.14m eta=16.0m [2024-08-26 13:46:02,574] INFO: Initiating epoch #8 train run on device rank=0 [2024-08-26 13:46:04,315] INFO: Initiating epoch #8 valid run on device rank=0 [2024-08-26 13:46:11,278] INFO: Rank 0: epoch=8 / 100 train_loss=63.4527 valid_loss=63.0274 stale=0 time=0.15m eta=15.6m [2024-08-26 13:46:12,047] INFO: Initiating epoch #9 train run on device rank=0 [2024-08-26 13:46:13,721] INFO: Initiating epoch #9 valid run on device rank=0 [2024-08-26 13:46:20,691] INFO: Rank 0: epoch=9 / 100 train_loss=59.9509 valid_loss=59.8358 stale=0 time=0.14m eta=15.3m [2024-08-26 13:46:21,985] INFO: Initiating epoch #10 train run on device rank=0 [2024-08-26 13:46:23,690] INFO: Initiating epoch #10 valid run on device rank=0 [2024-08-26 13:46:30,457] INFO: Rank 0: epoch=10 / 100 train_loss=57.0680 valid_loss=57.3374 stale=0 time=0.14m eta=15.1m [2024-08-26 13:46:31,201] INFO: Initiating epoch #11 train run on device rank=0 [2024-08-26 13:46:33,021] INFO: Initiating epoch #11 valid run on device rank=0 [2024-08-26 13:46:41,331] INFO: Rank 0: epoch=11 / 100 train_loss=54.6374 valid_loss=55.1854 stale=0 time=0.17m eta=15.1m [2024-08-26 13:46:42,105] INFO: Initiating epoch #12 train run on device rank=0 [2024-08-26 13:46:43,932] INFO: Initiating epoch #12 valid run on device rank=0 [2024-08-26 13:46:50,610] INFO: Rank 0: epoch=12 / 100 train_loss=52.7406 valid_loss=53.4005 stale=0 time=0.14m eta=14.8m [2024-08-26 13:46:51,637] INFO: Initiating epoch #13 train run on device rank=0 [2024-08-26 13:46:53,435] INFO: Initiating epoch #13 valid run on device rank=0 [2024-08-26 13:47:00,454] INFO: Rank 0: epoch=13 / 100 train_loss=51.0135 valid_loss=52.0414 stale=0 time=0.15m eta=14.6m [2024-08-26 13:47:01,742] INFO: Initiating epoch #14 train run on device rank=0 [2024-08-26 13:47:03,466] INFO: Initiating epoch #14 valid run on device rank=0 [2024-08-26 13:47:10,775] INFO: Rank 0: epoch=14 / 100 train_loss=49.6180 valid_loss=50.4891 stale=0 time=0.15m eta=14.5m [2024-08-26 13:47:11,607] INFO: Initiating epoch #15 train run on device rank=0 [2024-08-26 13:47:13,379] INFO: Initiating epoch #15 valid run on device rank=0 [2024-08-26 13:47:21,428] INFO: Rank 0: epoch=15 / 100 train_loss=48.3110 valid_loss=49.3362 stale=0 time=0.16m eta=14.3m [2024-08-26 13:47:22,808] INFO: Initiating epoch #16 train run on device rank=0 [2024-08-26 13:47:24,603] INFO: Initiating epoch #16 valid run on device rank=0 [2024-08-26 13:47:32,490] INFO: Rank 0: epoch=16 / 100 train_loss=47.1443 valid_loss=48.2088 stale=0 time=0.16m eta=14.3m [2024-08-26 13:47:33,601] INFO: Initiating epoch #17 train run on device rank=0 [2024-08-26 13:47:35,325] INFO: Initiating epoch #17 valid run on device rank=0 [2024-08-26 13:47:42,631] INFO: Rank 0: epoch=17 / 100 train_loss=46.1467 valid_loss=47.4104 stale=0 time=0.15m eta=14.1m [2024-08-26 13:47:43,534] INFO: Initiating epoch #18 train run on device rank=0 [2024-08-26 13:47:45,364] INFO: Initiating epoch #18 valid run on device rank=0 [2024-08-26 13:47:51,650] INFO: Rank 0: epoch=18 / 100 train_loss=45.2248 valid_loss=46.6024 stale=0 time=0.14m eta=13.8m [2024-08-26 13:47:52,481] INFO: Initiating epoch #19 train run on device rank=0 [2024-08-26 13:47:54,146] INFO: Initiating epoch #19 valid run on device rank=0 [2024-08-26 13:48:01,438] INFO: Rank 0: epoch=19 / 100 train_loss=44.4662 valid_loss=45.8603 stale=0 time=0.15m eta=13.6m [2024-08-26 13:48:02,384] INFO: Initiating epoch #20 train run on device rank=0 [2024-08-26 13:48:04,270] INFO: Initiating epoch #20 valid run on device rank=0 [2024-08-26 13:48:11,415] INFO: Rank 0: epoch=20 / 100 train_loss=43.6950 valid_loss=45.2011 stale=0 time=0.15m eta=13.5m [2024-08-26 13:48:12,264] INFO: Initiating epoch #21 train run on device rank=0 [2024-08-26 13:48:14,161] INFO: Initiating epoch #21 valid run on device rank=0 [2024-08-26 13:48:21,500] INFO: Rank 0: epoch=21 / 100 train_loss=43.0498 valid_loss=44.4340 stale=0 time=0.15m eta=13.3m [2024-08-26 13:48:22,430] INFO: Initiating epoch #22 train run on device rank=0 [2024-08-26 13:48:24,241] INFO: Initiating epoch #22 valid run on device rank=0 [2024-08-26 13:48:31,922] INFO: Rank 0: epoch=22 / 100 train_loss=42.4255 valid_loss=44.0795 stale=0 time=0.16m eta=13.1m [2024-08-26 13:48:32,868] INFO: Initiating epoch #23 train run on device rank=0 [2024-08-26 13:48:34,704] INFO: Initiating epoch #23 valid run on device rank=0 [2024-08-26 13:48:42,503] INFO: Rank 0: epoch=23 / 100 train_loss=41.7909 valid_loss=43.4771 stale=0 time=0.16m eta=13.0m [2024-08-26 13:48:43,439] INFO: Initiating epoch #24 train run on device rank=0 [2024-08-26 13:48:45,305] INFO: Initiating epoch #24 valid run on device rank=0 [2024-08-26 13:48:52,390] INFO: Rank 0: epoch=24 / 100 train_loss=41.2878 valid_loss=42.9380 stale=0 time=0.15m eta=12.8m [2024-08-26 13:48:53,443] INFO: Initiating epoch #25 train run on device rank=0 [2024-08-26 13:48:55,181] INFO: Initiating epoch #25 valid run on device rank=0 [2024-08-26 13:49:02,552] INFO: Rank 0: epoch=25 / 100 train_loss=40.7653 valid_loss=42.6678 stale=0 time=0.15m eta=12.6m [2024-08-26 13:49:03,758] INFO: Initiating epoch #26 train run on device rank=0 [2024-08-26 13:49:05,661] INFO: Initiating epoch #26 valid run on device rank=0 [2024-08-26 13:49:12,815] INFO: Rank 0: epoch=26 / 100 train_loss=40.3013 valid_loss=42.0641 stale=0 time=0.15m eta=12.5m [2024-08-26 13:49:13,763] INFO: Initiating epoch #27 train run on device rank=0 [2024-08-26 13:49:15,784] INFO: Initiating epoch #27 valid run on device rank=0 [2024-08-26 13:49:22,102] INFO: Rank 0: epoch=27 / 100 train_loss=39.8771 valid_loss=41.8430 stale=0 time=0.14m eta=12.3m [2024-08-26 13:49:23,306] INFO: Initiating epoch #28 train run on device rank=0 [2024-08-26 13:49:25,264] INFO: Initiating epoch #28 valid run on device rank=0 [2024-08-26 13:49:32,876] INFO: Rank 0: epoch=28 / 100 train_loss=39.4117 valid_loss=41.5461 stale=0 time=0.16m eta=12.1m [2024-08-26 13:49:33,546] INFO: Initiating epoch #29 train run on device rank=0 [2024-08-26 13:49:35,194] INFO: Initiating epoch #29 valid run on device rank=0 [2024-08-26 13:49:41,691] INFO: Rank 0: epoch=29 / 100 train_loss=38.9850 valid_loss=41.2184 stale=0 time=0.14m eta=11.9m [2024-08-26 13:49:42,651] INFO: Initiating epoch #30 train run on device rank=0 [2024-08-26 13:49:44,300] INFO: Initiating epoch #30 valid run on device rank=0 [2024-08-26 13:49:51,531] INFO: Rank 0: epoch=30 / 100 train_loss=38.6255 valid_loss=40.9842 stale=0 time=0.15m eta=11.7m [2024-08-26 13:49:52,384] INFO: Initiating epoch #31 train run on device rank=0 [2024-08-26 13:49:54,038] INFO: Initiating epoch #31 valid run on device rank=0 [2024-08-26 13:50:00,300] INFO: Rank 0: epoch=31 / 100 train_loss=38.2746 valid_loss=40.7224 stale=0 time=0.13m eta=11.5m [2024-08-26 13:50:01,493] INFO: Initiating epoch #32 train run on device rank=0 [2024-08-26 13:50:03,120] INFO: Initiating epoch #32 valid run on device rank=0 [2024-08-26 13:50:09,129] INFO: Rank 0: epoch=32 / 100 train_loss=37.9751 valid_loss=40.4723 stale=0 time=0.13m eta=11.3m [2024-08-26 13:50:09,557] INFO: Initiating epoch #33 train run on device rank=0 [2024-08-26 13:50:11,269] INFO: Initiating epoch #33 valid run on device rank=0 [2024-08-26 13:50:18,085] INFO: Rank 0: epoch=33 / 100 train_loss=37.6494 valid_loss=40.3520 stale=0 time=0.14m eta=11.1m [2024-08-26 13:50:18,741] INFO: Initiating epoch #34 train run on device rank=0 [2024-08-26 13:50:20,505] INFO: Initiating epoch #34 valid run on device rank=0 [2024-08-26 13:50:28,600] INFO: Rank 0: epoch=34 / 100 train_loss=37.3848 valid_loss=40.2349 stale=0 time=0.16m eta=11.0m [2024-08-26 13:50:29,273] INFO: Initiating epoch #35 train run on device rank=0 [2024-08-26 13:50:31,018] INFO: Initiating epoch #35 valid run on device rank=0 [2024-08-26 13:50:37,143] INFO: Rank 0: epoch=35 / 100 train_loss=37.1451 valid_loss=40.4424 stale=1 time=0.13m eta=10.8m [2024-08-26 13:50:38,211] INFO: Initiating epoch #36 train run on device rank=0 [2024-08-26 13:50:39,884] INFO: Initiating epoch #36 valid run on device rank=0 [2024-08-26 13:50:47,396] INFO: Rank 0: epoch=36 / 100 train_loss=37.0783 valid_loss=39.7675 stale=0 time=0.15m eta=10.6m [2024-08-26 13:50:48,268] INFO: Initiating epoch #37 train run on device rank=0 [2024-08-26 13:50:49,958] INFO: Initiating epoch #37 valid run on device rank=0 [2024-08-26 13:50:54,979] INFO: Rank 0: epoch=37 / 100 train_loss=36.8069 valid_loss=39.8278 stale=1 time=0.11m eta=10.4m [2024-08-26 13:50:55,730] INFO: Initiating epoch #38 train run on device rank=0 [2024-08-26 13:50:57,635] INFO: Initiating epoch #38 valid run on device rank=0 [2024-08-26 13:51:04,314] INFO: Rank 0: epoch=38 / 100 train_loss=36.6700 valid_loss=39.5176 stale=0 time=0.14m eta=10.2m [2024-08-26 13:51:05,031] INFO: Initiating epoch #39 train run on device rank=0 [2024-08-26 13:51:06,740] INFO: Initiating epoch #39 valid run on device rank=0 [2024-08-26 13:51:13,423] INFO: Rank 0: epoch=39 / 100 train_loss=36.2147 valid_loss=39.4663 stale=0 time=0.14m eta=10.0m [2024-08-26 13:51:14,175] INFO: Initiating epoch #40 train run on device rank=0 [2024-08-26 13:51:15,801] INFO: Initiating epoch #40 valid run on device rank=0 [2024-08-26 13:51:22,702] INFO: Rank 0: epoch=40 / 100 train_loss=35.9097 valid_loss=39.0889 stale=0 time=0.14m eta=9.8m [2024-08-26 13:51:23,669] INFO: Initiating epoch #41 train run on device rank=0 [2024-08-26 13:51:25,349] INFO: Initiating epoch #41 valid run on device rank=0 [2024-08-26 13:51:30,234] INFO: Rank 0: epoch=41 / 100 train_loss=35.6130 valid_loss=39.2218 stale=1 time=0.11m eta=9.6m [2024-08-26 13:51:31,265] INFO: Initiating epoch #42 train run on device rank=0 [2024-08-26 13:51:33,247] INFO: Initiating epoch #42 valid run on device rank=0 [2024-08-26 13:51:51,234] INFO: Rank 0: epoch=42 / 100 train_loss=35.4307 valid_loss=38.9071 stale=0 time=0.33m eta=9.7m [2024-08-26 13:51:52,414] INFO: Initiating epoch #43 train run on device rank=0 [2024-08-26 13:51:54,112] INFO: Initiating epoch #43 valid run on device rank=0 [2024-08-26 13:52:01,642] INFO: Rank 0: epoch=43 / 100 train_loss=35.1147 valid_loss=38.8330 stale=0 time=0.15m eta=9.5m [2024-08-26 13:52:02,504] INFO: Initiating epoch #44 train run on device rank=0 [2024-08-26 13:52:04,158] INFO: Initiating epoch #44 valid run on device rank=0 [2024-08-26 13:52:11,253] INFO: Rank 0: epoch=44 / 100 train_loss=34.9096 valid_loss=38.7445 stale=0 time=0.15m eta=9.4m [2024-08-26 13:52:12,106] INFO: Initiating epoch #45 train run on device rank=0 [2024-08-26 13:52:13,769] INFO: Initiating epoch #45 valid run on device rank=0 [2024-08-26 13:52:20,262] INFO: Rank 0: epoch=45 / 100 train_loss=34.6696 valid_loss=38.6105 stale=0 time=0.14m eta=9.2m [2024-08-26 13:52:20,920] INFO: Initiating epoch #46 train run on device rank=0 [2024-08-26 13:52:22,589] INFO: Initiating epoch #46 valid run on device rank=0 [2024-08-26 13:52:27,493] INFO: Rank 0: epoch=46 / 100 train_loss=34.4721 valid_loss=38.8167 stale=1 time=0.11m eta=9.0m [2024-08-26 13:52:28,279] INFO: Initiating epoch #47 train run on device rank=0 [2024-08-26 13:52:30,227] INFO: Initiating epoch #47 valid run on device rank=0 [2024-08-26 13:52:37,655] INFO: Rank 0: epoch=47 / 100 train_loss=34.3216 valid_loss=38.4853 stale=0 time=0.16m eta=8.8m [2024-08-26 13:52:38,792] INFO: Initiating epoch #48 train run on device rank=0 [2024-08-26 13:52:40,553] INFO: Initiating epoch #48 valid run on device rank=0 [2024-08-26 13:52:45,597] INFO: Rank 0: epoch=48 / 100 train_loss=34.1118 valid_loss=38.8431 stale=1 time=0.11m eta=8.6m [2024-08-26 13:52:46,298] INFO: Initiating epoch #49 train run on device rank=0 [2024-08-26 13:52:48,177] INFO: Initiating epoch #49 valid run on device rank=0 [2024-08-26 13:52:55,963] INFO: Rank 0: epoch=49 / 100 train_loss=34.1021 valid_loss=38.2227 stale=0 time=0.16m eta=8.4m [2024-08-26 13:52:56,897] INFO: Initiating epoch #50 train run on device rank=0 [2024-08-26 13:52:58,656] INFO: Initiating epoch #50 valid run on device rank=0 [2024-08-26 13:53:06,526] INFO: Rank 0: epoch=50 / 100 train_loss=33.9092 valid_loss=38.0953 stale=0 time=0.16m eta=8.3m [2024-08-26 13:53:07,414] INFO: Initiating epoch #51 train run on device rank=0 [2024-08-26 13:53:09,023] INFO: Initiating epoch #51 valid run on device rank=0 [2024-08-26 13:53:13,900] INFO: Rank 0: epoch=51 / 100 train_loss=33.8624 valid_loss=38.6651 stale=1 time=0.11m eta=8.1m [2024-08-26 13:53:14,751] INFO: Initiating epoch #52 train run on device rank=0 [2024-08-26 13:53:16,679] INFO: Initiating epoch #52 valid run on device rank=0 [2024-08-26 13:53:22,359] INFO: Rank 0: epoch=52 / 100 train_loss=33.7408 valid_loss=38.4001 stale=2 time=0.13m eta=7.9m [2024-08-26 13:53:23,527] INFO: Initiating epoch #53 train run on device rank=0 [2024-08-26 13:53:25,509] INFO: Initiating epoch #53 valid run on device rank=0 [2024-08-26 13:53:30,068] INFO: Rank 0: epoch=53 / 100 train_loss=33.7663 valid_loss=38.3626 stale=3 time=0.11m eta=7.7m [2024-08-26 13:53:30,967] INFO: Initiating epoch #54 train run on device rank=0 [2024-08-26 13:53:33,286] INFO: Initiating epoch #54 valid run on device rank=0 [2024-08-26 13:53:40,085] INFO: Rank 0: epoch=54 / 100 train_loss=33.5194 valid_loss=38.0424 stale=0 time=0.15m eta=7.5m [2024-08-26 13:53:40,826] INFO: Initiating epoch #55 train run on device rank=0 [2024-08-26 13:53:42,530] INFO: Initiating epoch #55 valid run on device rank=0 [2024-08-26 13:53:49,386] INFO: Rank 0: epoch=55 / 100 train_loss=33.0814 valid_loss=37.8894 stale=0 time=0.14m eta=7.4m [2024-08-26 13:53:50,331] INFO: Initiating epoch #56 train run on device rank=0 [2024-08-26 13:53:51,955] INFO: Initiating epoch #56 valid run on device rank=0 [2024-08-26 13:53:58,796] INFO: Rank 0: epoch=56 / 100 train_loss=32.8441 valid_loss=37.7235 stale=0 time=0.14m eta=7.2m [2024-08-26 13:53:59,555] INFO: Initiating epoch #57 train run on device rank=0 [2024-08-26 13:54:01,295] INFO: Initiating epoch #57 valid run on device rank=0 [2024-08-26 13:54:08,382] INFO: Rank 0: epoch=57 / 100 train_loss=32.6402 valid_loss=37.7108 stale=0 time=0.15m eta=7.0m [2024-08-26 13:54:09,297] INFO: Initiating epoch #58 train run on device rank=0 [2024-08-26 13:54:11,165] INFO: Initiating epoch #58 valid run on device rank=0 [2024-08-26 13:54:16,629] INFO: Rank 0: epoch=58 / 100 train_loss=32.4551 valid_loss=37.9126 stale=1 time=0.12m eta=6.8m [2024-08-26 13:54:17,848] INFO: Initiating epoch #59 train run on device rank=0 [2024-08-26 13:54:19,477] INFO: Initiating epoch #59 valid run on device rank=0 [2024-08-26 13:54:24,496] INFO: Rank 0: epoch=59 / 100 train_loss=32.2918 valid_loss=37.8013 stale=2 time=0.11m eta=6.7m [2024-08-26 13:54:25,423] INFO: Initiating epoch #60 train run on device rank=0 [2024-08-26 13:54:27,495] INFO: Initiating epoch #60 valid run on device rank=0 [2024-08-26 13:54:32,452] INFO: Rank 0: epoch=60 / 100 train_loss=32.1443 valid_loss=37.7669 stale=3 time=0.12m eta=6.5m [2024-08-26 13:54:33,369] INFO: Initiating epoch #61 train run on device rank=0 [2024-08-26 13:54:35,465] INFO: Initiating epoch #61 valid run on device rank=0 [2024-08-26 13:54:40,533] INFO: Rank 0: epoch=61 / 100 train_loss=31.9455 valid_loss=37.7802 stale=4 time=0.12m eta=6.3m [2024-08-26 13:54:41,498] INFO: Initiating epoch #62 train run on device rank=0 [2024-08-26 13:54:43,285] INFO: Initiating epoch #62 valid run on device rank=0 [2024-08-26 13:54:48,226] INFO: Rank 0: epoch=62 / 100 train_loss=31.8528 valid_loss=38.3848 stale=5 time=0.11m eta=6.1m [2024-08-26 13:54:49,243] INFO: Initiating epoch #63 train run on device rank=0 [2024-08-26 13:54:51,238] INFO: Initiating epoch #63 valid run on device rank=0 [2024-08-26 13:54:56,518] INFO: Rank 0: epoch=63 / 100 train_loss=31.7932 valid_loss=37.8656 stale=6 time=0.12m eta=5.9m [2024-08-26 13:54:57,242] INFO: Initiating epoch #64 train run on device rank=0 [2024-08-26 13:54:59,006] INFO: Initiating epoch #64 valid run on device rank=0 [2024-08-26 13:55:04,958] INFO: Rank 0: epoch=64 / 100 train_loss=31.6388 valid_loss=37.7642 stale=7 time=0.13m eta=5.8m [2024-08-26 13:55:05,697] INFO: Initiating epoch #65 train run on device rank=0 [2024-08-26 13:55:07,654] INFO: Initiating epoch #65 valid run on device rank=0 [2024-08-26 13:55:14,293] INFO: Rank 0: epoch=65 / 100 train_loss=31.4031 valid_loss=37.9075 stale=8 time=0.14m eta=5.6m [2024-08-26 13:55:15,196] INFO: Initiating epoch #66 train run on device rank=0 [2024-08-26 13:55:16,824] INFO: Initiating epoch #66 valid run on device rank=0 [2024-08-26 13:55:21,196] INFO: Rank 0: epoch=66 / 100 train_loss=31.3373 valid_loss=37.7834 stale=9 time=0.1m eta=5.4m [2024-08-26 13:55:21,749] INFO: Initiating epoch #67 train run on device rank=0 [2024-08-26 13:55:23,581] INFO: Initiating epoch #67 valid run on device rank=0 [2024-08-26 13:55:28,417] INFO: Rank 0: epoch=67 / 100 train_loss=31.6696 valid_loss=38.1168 stale=10 time=0.11m eta=5.2m [2024-08-26 13:55:29,129] INFO: Initiating epoch #68 train run on device rank=0 [2024-08-26 13:55:31,144] INFO: Initiating epoch #68 valid run on device rank=0 [2024-08-26 13:55:35,075] INFO: Rank 0: epoch=68 / 100 train_loss=31.5404 valid_loss=38.2423 stale=11 time=0.1m eta=5.1m [2024-08-26 13:55:35,662] INFO: Initiating epoch #69 train run on device rank=0 [2024-08-26 13:55:37,354] INFO: Initiating epoch #69 valid run on device rank=0 [2024-08-26 13:55:44,670] INFO: Rank 0: epoch=69 / 100 train_loss=31.7935 valid_loss=37.6034 stale=0 time=0.15m eta=4.9m [2024-08-26 13:55:45,341] INFO: Initiating epoch #70 train run on device rank=0 [2024-08-26 13:55:46,960] INFO: Initiating epoch #70 valid run on device rank=0 [2024-08-26 13:55:50,799] INFO: Rank 0: epoch=70 / 100 train_loss=31.2579 valid_loss=37.7425 stale=1 time=0.09m eta=4.7m [2024-08-26 13:55:51,405] INFO: Initiating epoch #71 train run on device rank=0 [2024-08-26 13:55:53,089] INFO: Initiating epoch #71 valid run on device rank=0 [2024-08-26 13:56:00,456] INFO: Rank 0: epoch=71 / 100 train_loss=30.9250 valid_loss=37.0121 stale=0 time=0.15m eta=4.6m [2024-08-26 13:56:01,446] INFO: Initiating epoch #72 train run on device rank=0 [2024-08-26 13:56:03,336] INFO: Initiating epoch #72 valid run on device rank=0 [2024-08-26 13:56:07,696] INFO: Rank 0: epoch=72 / 100 train_loss=30.5748 valid_loss=37.1463 stale=1 time=0.1m eta=4.4m [2024-08-26 13:56:08,728] INFO: Initiating epoch #73 train run on device rank=0 [2024-08-26 13:56:10,544] INFO: Initiating epoch #73 valid run on device rank=0 [2024-08-26 13:56:15,911] INFO: Rank 0: epoch=73 / 100 train_loss=30.4850 valid_loss=37.2062 stale=2 time=0.12m eta=4.2m [2024-08-26 13:56:16,572] INFO: Initiating epoch #74 train run on device rank=0 [2024-08-26 13:56:18,542] INFO: Initiating epoch #74 valid run on device rank=0 [2024-08-26 13:56:23,063] INFO: Rank 0: epoch=74 / 100 train_loss=30.2967 valid_loss=37.5314 stale=3 time=0.11m eta=4.1m [2024-08-26 13:56:24,410] INFO: Initiating epoch #75 train run on device rank=0 [2024-08-26 13:56:26,463] INFO: Initiating epoch #75 valid run on device rank=0 [2024-08-26 13:56:30,996] INFO: Rank 0: epoch=75 / 100 train_loss=30.1924 valid_loss=37.6345 stale=4 time=0.11m eta=3.9m [2024-08-26 13:56:31,501] INFO: Initiating epoch #76 train run on device rank=0 [2024-08-26 13:56:33,589] INFO: Initiating epoch #76 valid run on device rank=0 [2024-08-26 13:56:38,688] INFO: Rank 0: epoch=76 / 100 train_loss=29.9604 valid_loss=37.4989 stale=5 time=0.12m eta=3.7m [2024-08-26 13:56:39,599] INFO: Initiating epoch #77 train run on device rank=0 [2024-08-26 13:56:41,535] INFO: Initiating epoch #77 valid run on device rank=0 [2024-08-26 13:56:46,031] INFO: Rank 0: epoch=77 / 100 train_loss=29.8711 valid_loss=37.6020 stale=6 time=0.11m eta=3.6m [2024-08-26 13:56:46,519] INFO: Initiating epoch #78 train run on device rank=0 [2024-08-26 13:56:48,513] INFO: Initiating epoch #78 valid run on device rank=0 [2024-08-26 13:56:52,824] INFO: Rank 0: epoch=78 / 100 train_loss=29.7666 valid_loss=37.6484 stale=7 time=0.11m eta=3.4m [2024-08-26 13:56:53,414] INFO: Initiating epoch #79 train run on device rank=0 [2024-08-26 13:56:55,171] INFO: Initiating epoch #79 valid run on device rank=0 [2024-08-26 13:56:59,951] INFO: Rank 0: epoch=79 / 100 train_loss=29.7206 valid_loss=38.0210 stale=8 time=0.11m eta=3.2m [2024-08-26 13:57:00,792] INFO: Initiating epoch #80 train run on device rank=0 [2024-08-26 13:57:02,720] INFO: Initiating epoch #80 valid run on device rank=0 [2024-08-26 13:57:07,732] INFO: Rank 0: epoch=80 / 100 train_loss=29.6602 valid_loss=38.1379 stale=9 time=0.12m eta=3.1m [2024-08-26 13:57:08,434] INFO: Initiating epoch #81 train run on device rank=0 [2024-08-26 13:57:10,262] INFO: Initiating epoch #81 valid run on device rank=0 [2024-08-26 13:57:14,503] INFO: Rank 0: epoch=81 / 100 train_loss=29.8082 valid_loss=38.7325 stale=10 time=0.1m eta=2.9m [2024-08-26 13:57:15,046] INFO: Initiating epoch #82 train run on device rank=0 [2024-08-26 13:57:17,174] INFO: Initiating epoch #82 valid run on device rank=0 [2024-08-26 13:57:22,821] INFO: Rank 0: epoch=82 / 100 train_loss=30.0731 valid_loss=37.6889 stale=11 time=0.13m eta=2.8m [2024-08-26 13:57:23,453] INFO: Initiating epoch #83 train run on device rank=0 [2024-08-26 13:57:25,156] INFO: Initiating epoch #83 valid run on device rank=0 [2024-08-26 13:57:29,551] INFO: Rank 0: epoch=83 / 100 train_loss=29.9014 valid_loss=37.2835 stale=12 time=0.1m eta=2.6m [2024-08-26 13:57:30,713] INFO: Initiating epoch #84 train run on device rank=0 [2024-08-26 13:57:32,688] INFO: Initiating epoch #84 valid run on device rank=0 [2024-08-26 13:57:37,596] INFO: Rank 0: epoch=84 / 100 train_loss=29.8992 valid_loss=37.8842 stale=13 time=0.11m eta=2.4m [2024-08-26 13:57:38,288] INFO: Initiating epoch #85 train run on device rank=0 [2024-08-26 13:57:40,040] INFO: Initiating epoch #85 valid run on device rank=0 [2024-08-26 13:57:44,588] INFO: Rank 0: epoch=85 / 100 train_loss=29.6210 valid_loss=37.6647 stale=14 time=0.1m eta=2.3m [2024-08-26 13:57:45,486] INFO: Initiating epoch #86 train run on device rank=0 [2024-08-26 13:57:47,429] INFO: Initiating epoch #86 valid run on device rank=0 [2024-08-26 13:57:52,306] INFO: Rank 0: epoch=86 / 100 train_loss=29.3055 valid_loss=37.3211 stale=15 time=0.11m eta=2.1m [2024-08-26 13:57:53,221] INFO: Initiating epoch #87 train run on device rank=0 [2024-08-26 13:57:55,195] INFO: Initiating epoch #87 valid run on device rank=0 [2024-08-26 13:57:59,693] INFO: Rank 0: epoch=87 / 100 train_loss=29.0857 valid_loss=37.1285 stale=16 time=0.11m eta=2.0m [2024-08-26 13:58:00,174] INFO: Initiating epoch #88 train run on device rank=0 [2024-08-26 13:58:02,140] INFO: Initiating epoch #88 valid run on device rank=0 [2024-08-26 13:58:06,712] INFO: Rank 0: epoch=88 / 100 train_loss=28.8939 valid_loss=37.4519 stale=17 time=0.11m eta=1.8m [2024-08-26 13:58:07,606] INFO: Initiating epoch #89 train run on device rank=0 [2024-08-26 13:58:09,577] INFO: Initiating epoch #89 valid run on device rank=0 [2024-08-26 13:58:13,562] INFO: Rank 0: epoch=89 / 100 train_loss=28.6558 valid_loss=37.3080 stale=18 time=0.1m eta=1.7m [2024-08-26 13:58:14,173] INFO: Initiating epoch #90 train run on device rank=0 [2024-08-26 13:58:16,089] INFO: Initiating epoch #90 valid run on device rank=0 [2024-08-26 13:58:20,050] INFO: Rank 0: epoch=90 / 100 train_loss=28.4517 valid_loss=37.7124 stale=19 time=0.1m eta=1.5m [2024-08-26 13:58:20,602] INFO: Initiating epoch #91 train run on device rank=0 [2024-08-26 13:58:22,544] INFO: Initiating epoch #91 valid run on device rank=0 [2024-08-26 13:58:26,814] INFO: Rank 0: epoch=91 / 100 train_loss=28.3671 valid_loss=37.8489 stale=20 time=0.1m eta=1.3m [2024-08-26 13:58:27,380] INFO: Initiating epoch #92 train run on device rank=0 [2024-08-26 13:58:29,264] INFO: Initiating epoch #92 valid run on device rank=0 [2024-08-26 13:58:33,940] INFO: Done with training. Total training time on device 0 is 13.739min