[2024-06-21 12:01:57,182] INFO: Will use torch.nn.parallel.DistributedDataParallel() and 4 gpus [2024-06-21 12:01:57,269] INFO: NVIDIA GeForce RTX 2080 Ti [2024-06-21 12:01:57,269] INFO: NVIDIA GeForce RTX 2080 Ti [2024-06-21 12:01:57,269] INFO: NVIDIA GeForce RTX 2080 Ti [2024-06-21 12:01:57,269] INFO: NVIDIA GeForce RTX 2080 Ti [2024-06-21 12:02:01,629] INFO: using dtype=torch.float32 [2024-06-21 12:02:02,688] INFO: using attention_type=math [2024-06-21 12:02:02,699] INFO: using attention_type=math [2024-06-21 12:02:02,710] INFO: using attention_type=math [2024-06-21 12:02:02,720] INFO: using attention_type=math [2024-06-21 12:02:02,731] INFO: using attention_type=math [2024-06-21 12:02:02,741] INFO: using attention_type=math [2024-06-21 12:02:08,351] INFO: mlpf_kwargs: {'input_dim': 17, 'num_classes': 6, 'input_encoding': 'joint', 'pt_mode': 'linear', 'eta_mode': 'linear', 'sin_phi_mode': 'linear', 'cos_phi_mode': 'linear', 'energy_mode': 'linear', 'elemtypes_nonzero': [1, 2], 'learned_representation_mode': 'last', 'conv_type': 'attention', 'num_convs': 3, 'dropout_ff': 0.0, 'dropout_conv_id_mha': 0.0, 'dropout_conv_id_ff': 0.0, 'dropout_conv_reg_mha': 0.0, 'dropout_conv_reg_ff': 0.0, 'activation': 'relu', 'head_dim': 16, 'num_heads': 32, 'attention_type': 'math'} [2024-06-21 12:02:08,351] INFO: Loaded model weights from /pfvol/experiments/MLPF_clic_backbone_pyg-clic_20240429_101112_971749/best_weights.pth [2024-06-21 12:02:09,642] INFO: DistributedDataParallel( (module): DeepMET( (nn): Sequential( (0): Linear(in_features=6, out_features=256, bias=True) (1): ELU(alpha=1.0) (2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (3): Dropout(p=0, inplace=False) (4): Linear(in_features=256, out_features=2, bias=True) ) ) ) [2024-06-21 12:02:09,642] INFO: DeepMET Trainable parameters: 2818 [2024-06-21 12:02:09,642] INFO: DeepMET Non-trainable parameters: 0 [2024-06-21 12:02:09,642] INFO: DeepMET Total parameters: 2818 [2024-06-21 12:02:09,644] INFO: Modules Trainable parameters Non-tranable parameters module.nn.0.weight 1536 0 module.nn.0.bias 256 0 module.nn.2.weight 256 0 module.nn.2.bias 256 0 module.nn.4.weight 512 0 module.nn.4.bias 2 0 [2024-06-21 12:02:09,645] INFO: Creating experiment dir /pfvol/experiments/MLPF_clic_backbone_pyg-clic_20240429_101112_971749/Finetuning_PFCands_20240621_120157_083614 [2024-06-21 12:02:09,645] INFO: Model directory /pfvol/experiments/MLPF_clic_backbone_pyg-clic_20240429_101112_971749/Finetuning_PFCands_20240621_120157_083614 [2024-06-21 12:02:12,549] INFO: train_dataset: clic_edm_ttbar_pf, 800800 [2024-06-21 12:02:12,852] INFO: valid_dataset: clic_edm_ttbar_pf, 200200 [2024-06-21 12:02:12,901] INFO: Initiating epoch #1 train run on device rank=0 [2024-06-21 12:04:35,906] INFO: Initiating epoch #1 valid run on device rank=0 [2024-06-21 12:05:12,862] INFO: Rank 0: epoch=1 / 400 train_loss=10.8688 valid_loss=10.7552 stale=0 time=3.0m eta=1196.7m [2024-06-21 12:05:12,863] INFO: Initiating epoch #2 train run on device rank=0 [2024-06-21 12:07:17,059] INFO: Initiating epoch #2 valid run on device rank=0 [2024-06-21 12:07:51,027] INFO: Rank 0: epoch=2 / 400 train_loss=10.6495 valid_loss=10.6922 stale=0 time=2.64m eta=1121.4m [2024-06-21 12:07:51,043] INFO: Initiating epoch #3 train run on device rank=0 [2024-06-21 12:09:59,549] INFO: Initiating epoch #3 valid run on device rank=0 [2024-06-21 12:10:33,007] INFO: Rank 0: epoch=3 / 400 train_loss=10.5918 valid_loss=10.6327 stale=0 time=2.7m eta=1103.0m [2024-06-21 12:10:33,025] INFO: Initiating epoch #4 train run on device rank=0 [2024-06-21 12:12:40,272] INFO: Initiating epoch #4 valid run on device rank=0 [2024-06-21 12:13:16,011] INFO: Rank 0: epoch=4 / 400 train_loss=10.5366 valid_loss=10.5754 stale=0 time=2.72m eta=1094.1m [2024-06-21 12:13:16,035] INFO: Initiating epoch #5 train run on device rank=0 [2024-06-21 12:15:26,614] INFO: Initiating epoch #5 valid run on device rank=0 [2024-06-21 12:16:01,653] INFO: Rank 0: epoch=5 / 400 train_loss=10.4889 valid_loss=10.5301 stale=0 time=2.76m eta=1091.2m [2024-06-21 12:16:01,697] INFO: Initiating epoch #6 train run on device rank=0 [2024-06-21 12:18:13,278] INFO: Initiating epoch #6 valid run on device rank=0 [2024-06-21 12:18:47,691] INFO: Rank 0: epoch=6 / 400 train_loss=10.4519 valid_loss=10.4976 stale=0 time=2.77m eta=1088.7m [2024-06-21 12:18:47,738] INFO: Initiating epoch #7 train run on device rank=0 [2024-06-21 12:20:56,956] INFO: Initiating epoch #7 valid run on device rank=0 [2024-06-21 12:21:30,498] INFO: Rank 0: epoch=7 / 400 train_loss=10.4259 valid_loss=10.4723 stale=0 time=2.71m eta=1083.2m [2024-06-21 12:21:30,785] INFO: Initiating epoch #8 train run on device rank=0 [2024-06-21 12:23:43,303] INFO: Initiating epoch #8 valid run on device rank=0 [2024-06-21 12:24:17,668] INFO: Rank 0: epoch=8 / 400 train_loss=10.4092 valid_loss=10.4550 stale=0 time=2.78m eta=1081.9m [2024-06-21 12:24:17,700] INFO: Initiating epoch #9 train run on device rank=0 [2024-06-21 12:26:26,483] INFO: Initiating epoch #9 valid run on device rank=0 [2024-06-21 12:27:00,233] INFO: Rank 0: epoch=9 / 400 train_loss=10.3980 valid_loss=10.4421 stale=0 time=2.71m eta=1076.9m [2024-06-21 12:27:00,258] INFO: Initiating epoch #10 train run on device rank=0 [2024-06-21 12:29:07,600] INFO: Initiating epoch #10 valid run on device rank=0 [2024-06-21 12:29:43,828] INFO: Rank 0: epoch=10 / 400 train_loss=10.3896 valid_loss=10.4320 stale=0 time=2.73m eta=1073.1m [2024-06-21 12:29:43,840] INFO: Initiating epoch #11 train run on device rank=0 [2024-06-21 12:31:55,585] INFO: Initiating epoch #11 valid run on device rank=0 [2024-06-21 12:32:35,040] INFO: Rank 0: epoch=11 / 400 train_loss=10.3825 valid_loss=10.4233 stale=0 time=2.85m eta=1074.0m [2024-06-21 12:32:35,558] INFO: Initiating epoch #12 train run on device rank=0 [2024-06-21 12:34:40,314] INFO: Initiating epoch #12 valid run on device rank=0 [2024-06-21 12:35:16,624] INFO: Rank 0: epoch=12 / 400 train_loss=10.3761 valid_loss=10.4156 stale=0 time=2.68m eta=1069.0m [2024-06-21 12:35:17,182] INFO: Initiating epoch #13 train run on device rank=0 [2024-06-21 12:37:24,611] INFO: Initiating epoch #13 valid run on device rank=0 [2024-06-21 12:38:00,592] INFO: Rank 0: epoch=13 / 400 train_loss=10.3698 valid_loss=10.4085 stale=0 time=2.72m eta=1065.6m [2024-06-21 12:38:01,196] INFO: Initiating epoch #14 train run on device rank=0 [2024-06-21 12:40:08,400] INFO: Initiating epoch #14 valid run on device rank=0 [2024-06-21 12:40:44,359] INFO: Rank 0: epoch=14 / 400 train_loss=10.3637 valid_loss=10.4017 stale=0 time=2.72m eta=1062.2m [2024-06-21 12:40:45,108] INFO: Initiating epoch #15 train run on device rank=0 [2024-06-21 12:42:49,184] INFO: Initiating epoch #15 valid run on device rank=0 [2024-06-21 12:43:26,268] INFO: Rank 0: epoch=15 / 400 train_loss=10.3578 valid_loss=10.3952 stale=0 time=2.69m eta=1058.1m [2024-06-21 12:43:26,797] INFO: Initiating epoch #16 train run on device rank=0 [2024-06-21 12:45:31,414] INFO: Initiating epoch #16 valid run on device rank=0 [2024-06-21 12:46:07,024] INFO: Rank 0: epoch=16 / 400 train_loss=10.3521 valid_loss=10.3887 stale=0 time=2.67m eta=1053.6m [2024-06-21 12:46:07,584] INFO: Initiating epoch #17 train run on device rank=0 [2024-06-21 12:48:15,303] INFO: Initiating epoch #17 valid run on device rank=0 [2024-06-21 12:48:50,645] INFO: Rank 0: epoch=17 / 400 train_loss=10.3465 valid_loss=10.3822 stale=0 time=2.72m eta=1050.5m [2024-06-21 12:48:51,301] INFO: Initiating epoch #18 train run on device rank=0 [2024-06-21 12:50:55,789] INFO: Initiating epoch #18 valid run on device rank=0 [2024-06-21 12:51:31,626] INFO: Rank 0: epoch=18 / 400 train_loss=10.3410 valid_loss=10.3756 stale=0 time=2.67m eta=1046.5m [2024-06-21 12:51:32,116] INFO: Initiating epoch #19 train run on device rank=0 [2024-06-21 12:53:37,990] INFO: Initiating epoch #19 valid run on device rank=0 [2024-06-21 12:54:14,251] INFO: Rank 0: epoch=19 / 400 train_loss=10.3356 valid_loss=10.3693 stale=0 time=2.7m eta=1043.2m [2024-06-21 12:54:15,274] INFO: Initiating epoch #20 train run on device rank=0 [2024-06-21 12:56:21,560] INFO: Initiating epoch #20 valid run on device rank=0 [2024-06-21 12:56:58,302] INFO: Rank 0: epoch=20 / 400 train_loss=10.3303 valid_loss=10.3629 stale=0 time=2.72m eta=1040.4m [2024-06-21 12:56:58,829] INFO: Initiating epoch #21 train run on device rank=0 [2024-06-21 12:59:07,664] INFO: Initiating epoch #21 valid run on device rank=0 [2024-06-21 12:59:45,002] INFO: Rank 0: epoch=21 / 400 train_loss=10.3251 valid_loss=10.3568 stale=0 time=2.77m eta=1038.4m [2024-06-21 12:59:45,466] INFO: Initiating epoch #22 train run on device rank=0 [2024-06-21 13:01:53,766] INFO: Initiating epoch #22 valid run on device rank=0 [2024-06-21 13:02:31,355] INFO: Rank 0: epoch=22 / 400 train_loss=10.3201 valid_loss=10.3508 stale=0 time=2.76m eta=1036.2m [2024-06-21 13:02:31,915] INFO: Initiating epoch #23 train run on device rank=0 [2024-06-21 13:04:35,959] INFO: Initiating epoch #23 valid run on device rank=0 [2024-06-21 13:05:13,445] INFO: Rank 0: epoch=23 / 400 train_loss=10.3151 valid_loss=10.3449 stale=0 time=2.69m eta=1032.8m [2024-06-21 13:05:14,093] INFO: Initiating epoch #24 train run on device rank=0 [2024-06-21 13:07:20,082] INFO: Initiating epoch #24 valid run on device rank=0 [2024-06-21 13:07:57,657] INFO: Rank 0: epoch=24 / 400 train_loss=10.3103 valid_loss=10.3392 stale=0 time=2.73m eta=1030.0m [2024-06-21 13:07:58,168] INFO: Initiating epoch #25 train run on device rank=0 [2024-06-21 13:10:05,018] INFO: Initiating epoch #25 valid run on device rank=0 [2024-06-21 13:10:42,392] INFO: Rank 0: epoch=25 / 400 train_loss=10.3056 valid_loss=10.3337 stale=0 time=2.74m eta=1027.4m [2024-06-21 13:10:42,923] INFO: Initiating epoch #26 train run on device rank=0 [2024-06-21 13:12:50,002] INFO: Initiating epoch #26 valid run on device rank=0 [2024-06-21 13:13:23,234] INFO: Rank 0: epoch=26 / 400 train_loss=10.3010 valid_loss=10.3284 stale=0 time=2.67m eta=1023.8m [2024-06-21 13:13:23,277] INFO: Initiating epoch #27 train run on device rank=0 [2024-06-21 13:15:34,427] INFO: Initiating epoch #27 valid run on device rank=0 [2024-06-21 13:16:08,682] INFO: Rank 0: epoch=27 / 400 train_loss=10.2965 valid_loss=10.3233 stale=0 time=2.76m eta=1021.3m [2024-06-21 13:16:08,720] INFO: Initiating epoch #28 train run on device rank=0 [2024-06-21 13:18:17,392] INFO: Initiating epoch #28 valid run on device rank=0 [2024-06-21 13:18:53,022] INFO: Rank 0: epoch=28 / 400 train_loss=10.2922 valid_loss=10.3185 stale=0 time=2.74m eta=1018.6m [2024-06-21 13:18:53,186] INFO: Initiating epoch #29 train run on device rank=0 [2024-06-21 13:21:00,251] INFO: Initiating epoch #29 valid run on device rank=0 [2024-06-21 13:21:34,398] INFO: Rank 0: epoch=29 / 400 train_loss=10.2879 valid_loss=10.3140 stale=0 time=2.69m eta=1015.2m [2024-06-21 13:21:34,436] INFO: Initiating epoch #30 train run on device rank=0 [2024-06-21 13:23:41,552] INFO: Initiating epoch #30 valid run on device rank=0 [2024-06-21 13:24:15,090] INFO: Rank 0: epoch=30 / 400 train_loss=10.2838 valid_loss=10.3096 stale=0 time=2.68m eta=1011.8m [2024-06-21 13:24:15,140] INFO: Initiating epoch #31 train run on device rank=0 [2024-06-21 13:26:24,684] INFO: Initiating epoch #31 valid run on device rank=0 [2024-06-21 13:26:58,507] INFO: Rank 0: epoch=31 / 400 train_loss=10.2799 valid_loss=10.3055 stale=0 time=2.72m eta=1008.9m [2024-06-21 13:26:58,516] INFO: Initiating epoch #32 train run on device rank=0 [2024-06-21 13:29:08,498] INFO: Initiating epoch #32 valid run on device rank=0 [2024-06-21 13:29:41,625] INFO: Rank 0: epoch=32 / 400 train_loss=10.2761 valid_loss=10.3018 stale=0 time=2.72m eta=1006.0m [2024-06-21 13:29:41,651] INFO: Initiating epoch #33 train run on device rank=0 [2024-06-21 13:31:50,834] INFO: Initiating epoch #33 valid run on device rank=0 [2024-06-21 13:32:24,318] INFO: Rank 0: epoch=33 / 400 train_loss=10.2725 valid_loss=10.2982 stale=0 time=2.71m eta=1003.0m [2024-06-21 13:32:24,334] INFO: Initiating epoch #34 train run on device rank=0 [2024-06-21 13:34:34,539] INFO: Initiating epoch #34 valid run on device rank=0 [2024-06-21 13:35:08,191] INFO: Rank 0: epoch=34 / 400 train_loss=10.2688 valid_loss=10.2950 stale=0 time=2.73m eta=1000.3m [2024-06-21 13:35:08,287] INFO: Initiating epoch #35 train run on device rank=0 [2024-06-21 13:37:17,756] INFO: Initiating epoch #35 valid run on device rank=0 [2024-06-21 13:37:51,975] INFO: Rank 0: epoch=35 / 400 train_loss=10.2653 valid_loss=10.2918 stale=0 time=2.73m eta=997.5m [2024-06-21 13:37:51,991] INFO: Initiating epoch #36 train run on device rank=0 [2024-06-21 13:39:57,549] INFO: Initiating epoch #36 valid run on device rank=0 [2024-06-21 13:40:31,230] INFO: Rank 0: epoch=36 / 400 train_loss=10.2620 valid_loss=10.2890 stale=0 time=2.65m eta=994.0m [2024-06-21 13:40:31,288] INFO: Initiating epoch #37 train run on device rank=0 [2024-06-21 13:42:37,354] INFO: Initiating epoch #37 valid run on device rank=0 [2024-06-21 13:43:10,870] INFO: Rank 0: epoch=37 / 400 train_loss=10.2590 valid_loss=10.2865 stale=0 time=2.66m eta=990.6m [2024-06-21 13:43:10,894] INFO: Initiating epoch #38 train run on device rank=0 [2024-06-21 13:45:17,744] INFO: Initiating epoch #38 valid run on device rank=0 [2024-06-21 13:45:55,510] INFO: Rank 0: epoch=38 / 400 train_loss=10.2560 valid_loss=10.2841 stale=0 time=2.74m eta=988.0m [2024-06-21 13:45:56,296] INFO: Initiating epoch #39 train run on device rank=0 [2024-06-21 13:48:02,079] INFO: Initiating epoch #39 valid run on device rank=0 [2024-06-21 13:48:38,503] INFO: Rank 0: epoch=39 / 400 train_loss=10.2532 valid_loss=10.2819 stale=0 time=2.7m eta=985.1m [2024-06-21 13:48:39,128] INFO: Initiating epoch #40 train run on device rank=0 [2024-06-21 13:50:46,308] INFO: Initiating epoch #40 valid run on device rank=0 [2024-06-21 13:51:23,263] INFO: Rank 0: epoch=40 / 400 train_loss=10.2504 valid_loss=10.2797 stale=0 time=2.74m eta=982.6m [2024-06-21 13:51:24,259] INFO: Initiating epoch #41 train run on device rank=0 [2024-06-21 13:53:27,884] INFO: Initiating epoch #41 valid run on device rank=0 [2024-06-21 13:54:03,854] INFO: Rank 0: epoch=41 / 400 train_loss=10.2478 valid_loss=10.2775 stale=0 time=2.66m eta=979.4m [2024-06-21 13:54:04,717] INFO: Initiating epoch #42 train run on device rank=0 [2024-06-21 13:56:08,729] INFO: Initiating epoch #42 valid run on device rank=0 [2024-06-21 13:56:45,787] INFO: Rank 0: epoch=42 / 400 train_loss=10.2452 valid_loss=10.2755 stale=0 time=2.68m eta=976.4m [2024-06-21 13:56:46,868] INFO: Initiating epoch #43 train run on device rank=0 [2024-06-21 13:58:51,583] INFO: Initiating epoch #43 valid run on device rank=0 [2024-06-21 13:59:28,922] INFO: Rank 0: epoch=43 / 400 train_loss=10.2428 valid_loss=10.2736 stale=0 time=2.7m eta=973.6m [2024-06-21 13:59:29,876] INFO: Initiating epoch #44 train run on device rank=0 [2024-06-21 14:01:33,632] INFO: Initiating epoch #44 valid run on device rank=0 [2024-06-21 14:02:12,007] INFO: Rank 0: epoch=44 / 400 train_loss=10.2405 valid_loss=10.2717 stale=0 time=2.7m eta=970.8m [2024-06-21 14:02:12,838] INFO: Initiating epoch #45 train run on device rank=0 [2024-06-21 14:04:16,595] INFO: Initiating epoch #45 valid run on device rank=0 [2024-06-21 14:04:53,698] INFO: Rank 0: epoch=45 / 400 train_loss=10.2382 valid_loss=10.2698 stale=0 time=2.68m eta=967.8m [2024-06-21 14:04:54,587] INFO: Initiating epoch #46 train run on device rank=0 [2024-06-21 14:06:58,143] INFO: Initiating epoch #46 valid run on device rank=0 [2024-06-21 14:07:35,591] INFO: Rank 0: epoch=46 / 400 train_loss=10.2360 valid_loss=10.2682 stale=0 time=2.68m eta=964.9m [2024-06-21 14:07:36,506] INFO: Initiating epoch #47 train run on device rank=0 [2024-06-21 14:09:41,457] INFO: Initiating epoch #47 valid run on device rank=0 [2024-06-21 14:10:17,603] INFO: Rank 0: epoch=47 / 400 train_loss=10.2339 valid_loss=10.2665 stale=0 time=2.68m eta=962.0m [2024-06-21 14:10:18,261] INFO: Initiating epoch #48 train run on device rank=0 [2024-06-21 14:12:24,271] INFO: Initiating epoch #48 valid run on device rank=0 [2024-06-21 14:13:00,633] INFO: Rank 0: epoch=48 / 400 train_loss=10.2320 valid_loss=10.2649 stale=0 time=2.71m eta=959.2m [2024-06-21 14:13:01,357] INFO: Initiating epoch #49 train run on device rank=0 [2024-06-21 14:15:05,908] INFO: Initiating epoch #49 valid run on device rank=0 [2024-06-21 14:15:42,360] INFO: Rank 0: epoch=49 / 400 train_loss=10.2301 valid_loss=10.2633 stale=0 time=2.68m eta=956.2m [2024-06-21 14:15:42,836] INFO: Initiating epoch #50 train run on device rank=0 [2024-06-21 14:17:50,533] INFO: Initiating epoch #50 valid run on device rank=0 [2024-06-21 14:18:26,738] INFO: Rank 0: epoch=50 / 400 train_loss=10.2282 valid_loss=10.2617 stale=0 time=2.73m eta=953.6m [2024-06-21 14:18:27,391] INFO: Initiating epoch #51 train run on device rank=0 [2024-06-21 14:20:31,455] INFO: Initiating epoch #51 valid run on device rank=0 [2024-06-21 14:21:04,905] INFO: Rank 0: epoch=51 / 400 train_loss=10.2263 valid_loss=10.2601 stale=0 time=2.63m eta=950.3m [2024-06-21 14:21:04,928] INFO: Initiating epoch #52 train run on device rank=0 [2024-06-21 14:23:10,967] INFO: Initiating epoch #52 valid run on device rank=0 [2024-06-21 14:23:44,963] INFO: Rank 0: epoch=52 / 400 train_loss=10.2246 valid_loss=10.2585 stale=0 time=2.67m eta=947.2m [2024-06-21 14:23:45,007] INFO: Initiating epoch #53 train run on device rank=0 [2024-06-21 14:25:51,897] INFO: Initiating epoch #53 valid run on device rank=0 [2024-06-21 14:26:26,637] INFO: Rank 0: epoch=53 / 400 train_loss=10.2229 valid_loss=10.2570 stale=0 time=2.69m eta=944.3m [2024-06-21 14:26:26,683] INFO: Initiating epoch #54 train run on device rank=0 [2024-06-21 14:28:33,579] INFO: Initiating epoch #54 valid run on device rank=0 [2024-06-21 14:29:07,098] INFO: Rank 0: epoch=54 / 400 train_loss=10.2212 valid_loss=10.2556 stale=0 time=2.67m eta=941.3m [2024-06-21 14:29:07,174] INFO: Initiating epoch #55 train run on device rank=0 [2024-06-21 14:31:13,363] INFO: Initiating epoch #55 valid run on device rank=0 [2024-06-21 14:31:46,651] INFO: Rank 0: epoch=55 / 400 train_loss=10.2196 valid_loss=10.2542 stale=0 time=2.66m eta=938.2m [2024-06-21 14:31:46,702] INFO: Initiating epoch #56 train run on device rank=0 [2024-06-21 14:33:53,142] INFO: Initiating epoch #56 valid run on device rank=0 [2024-06-21 14:34:30,710] INFO: Rank 0: epoch=56 / 400 train_loss=10.2180 valid_loss=10.2529 stale=0 time=2.73m eta=935.5m [2024-06-21 14:34:31,779] INFO: Initiating epoch #57 train run on device rank=0 [2024-06-21 14:36:37,867] INFO: Initiating epoch #57 valid run on device rank=0 [2024-06-21 14:37:11,157] INFO: Rank 0: epoch=57 / 400 train_loss=10.2164 valid_loss=10.2516 stale=0 time=2.66m eta=932.5m [2024-06-21 14:37:11,217] INFO: Initiating epoch #58 train run on device rank=0 [2024-06-21 14:39:17,426] INFO: Initiating epoch #58 valid run on device rank=0 [2024-06-21 14:39:52,983] INFO: Rank 0: epoch=58 / 400 train_loss=10.2149 valid_loss=10.2503 stale=0 time=2.7m eta=929.7m [2024-06-21 14:39:53,004] INFO: Initiating epoch #59 train run on device rank=0 [2024-06-21 14:41:58,350] INFO: Initiating epoch #59 valid run on device rank=0 [2024-06-21 14:42:31,483] INFO: Rank 0: epoch=59 / 400 train_loss=10.2133 valid_loss=10.2488 stale=0 time=2.64m eta=926.5m [2024-06-21 14:42:31,509] INFO: Initiating epoch #60 train run on device rank=0 [2024-06-21 14:44:37,616] INFO: Initiating epoch #60 valid run on device rank=0 [2024-06-21 14:45:14,988] INFO: Rank 0: epoch=60 / 400 train_loss=10.2116 valid_loss=10.2473 stale=0 time=2.72m eta=923.9m [2024-06-21 14:45:15,591] INFO: Initiating epoch #61 train run on device rank=0 [2024-06-21 14:47:19,381] INFO: Initiating epoch #61 valid run on device rank=0 [2024-06-21 14:47:55,824] INFO: Rank 0: epoch=61 / 400 train_loss=10.2100 valid_loss=10.2460 stale=0 time=2.67m eta=920.9m [2024-06-21 14:47:56,407] INFO: Initiating epoch #62 train run on device rank=0 [2024-06-21 14:50:02,638] INFO: Initiating epoch #62 valid run on device rank=0 [2024-06-21 14:50:38,544] INFO: Rank 0: epoch=62 / 400 train_loss=10.2084 valid_loss=10.2447 stale=0 time=2.7m eta=918.2m [2024-06-21 14:50:39,126] INFO: Initiating epoch #63 train run on device rank=0 [2024-06-21 14:52:42,911] INFO: Initiating epoch #63 valid run on device rank=0 [2024-06-21 14:53:19,588] INFO: Rank 0: epoch=63 / 400 train_loss=10.2068 valid_loss=10.2433 stale=0 time=2.67m eta=915.3m [2024-06-21 14:53:20,210] INFO: Initiating epoch #64 train run on device rank=0 [2024-06-21 14:55:24,518] INFO: Initiating epoch #64 valid run on device rank=0 [2024-06-21 14:55:58,947] INFO: Rank 0: epoch=64 / 400 train_loss=10.2052 valid_loss=10.2419 stale=0 time=2.65m eta=912.3m [2024-06-21 14:55:58,980] INFO: Initiating epoch #65 train run on device rank=0 [2024-06-21 14:58:07,329] INFO: Initiating epoch #65 valid run on device rank=0 [2024-06-21 14:58:40,788] INFO: Rank 0: epoch=65 / 400 train_loss=10.2033 valid_loss=10.2403 stale=0 time=2.7m eta=909.5m [2024-06-21 14:58:40,825] INFO: Initiating epoch #66 train run on device rank=0 [2024-06-21 15:00:50,601] INFO: Initiating epoch #66 valid run on device rank=0 [2024-06-21 15:01:24,390] INFO: Rank 0: epoch=66 / 400 train_loss=10.2013 valid_loss=10.2376 stale=0 time=2.73m eta=906.8m [2024-06-21 15:01:24,524] INFO: Initiating epoch #67 train run on device rank=0 [2024-06-21 15:03:30,052] INFO: Initiating epoch #67 valid run on device rank=0 [2024-06-21 15:04:03,788] INFO: Rank 0: epoch=67 / 400 train_loss=10.1992 valid_loss=10.2348 stale=0 time=2.65m eta=903.8m [2024-06-21 15:04:03,803] INFO: Initiating epoch #68 train run on device rank=0 [2024-06-21 15:06:09,372] INFO: Initiating epoch #68 valid run on device rank=0 [2024-06-21 15:06:42,925] INFO: Rank 0: epoch=68 / 400 train_loss=10.1972 valid_loss=10.2323 stale=0 time=2.65m eta=900.8m [2024-06-21 15:06:42,978] INFO: Initiating epoch #69 train run on device rank=0 [2024-06-21 15:08:49,827] INFO: Initiating epoch #69 valid run on device rank=0 [2024-06-21 15:09:23,146] INFO: Rank 0: epoch=69 / 400 train_loss=10.1951 valid_loss=10.2297 stale=0 time=2.67m eta=897.9m [2024-06-21 15:09:23,257] INFO: Initiating epoch #70 train run on device rank=0 [2024-06-21 15:11:32,595] INFO: Initiating epoch #70 valid run on device rank=0 [2024-06-21 15:12:05,874] INFO: Rank 0: epoch=70 / 400 train_loss=10.1928 valid_loss=10.2267 stale=0 time=2.71m eta=895.2m [2024-06-21 15:12:05,911] INFO: Initiating epoch #71 train run on device rank=0 [2024-06-21 15:14:11,954] INFO: Initiating epoch #71 valid run on device rank=0 [2024-06-21 15:14:45,252] INFO: Rank 0: epoch=71 / 400 train_loss=10.1899 valid_loss=10.2234 stale=0 time=2.66m eta=892.2m [2024-06-21 15:14:45,301] INFO: Initiating epoch #72 train run on device rank=0 [2024-06-21 15:16:51,363] INFO: Initiating epoch #72 valid run on device rank=0 [2024-06-21 15:17:26,064] INFO: Rank 0: epoch=72 / 400 train_loss=10.1870 valid_loss=10.2213 stale=0 time=2.68m eta=889.3m [2024-06-21 15:17:26,086] INFO: Initiating epoch #73 train run on device rank=0 [2024-06-21 15:19:33,700] INFO: Initiating epoch #73 valid run on device rank=0 [2024-06-21 15:20:10,497] INFO: Rank 0: epoch=73 / 400 train_loss=10.1841 valid_loss=10.2212 stale=0 time=2.74m eta=886.8m [2024-06-21 15:20:11,099] INFO: Initiating epoch #74 train run on device rank=0 [2024-06-21 15:22:18,770] INFO: Initiating epoch #74 valid run on device rank=0 [2024-06-21 15:22:53,958] INFO: Rank 0: epoch=74 / 400 train_loss=10.1811 valid_loss=10.2238 stale=1 time=2.71m eta=884.1m [2024-06-21 15:22:54,710] INFO: Initiating epoch #75 train run on device rank=0 [2024-06-21 15:24:59,313] INFO: Initiating epoch #75 valid run on device rank=0 [2024-06-21 15:25:35,835] INFO: Rank 0: epoch=75 / 400 train_loss=10.1784 valid_loss=10.2261 stale=2 time=2.69m eta=881.3m [2024-06-21 15:25:36,446] INFO: Initiating epoch #76 train run on device rank=0 [2024-06-21 15:27:41,331] INFO: Initiating epoch #76 valid run on device rank=0 [2024-06-21 15:28:14,549] INFO: Rank 0: epoch=76 / 400 train_loss=10.1765 valid_loss=10.2127 stale=0 time=2.64m eta=878.3m [2024-06-21 15:28:14,583] INFO: Initiating epoch #77 train run on device rank=0 [2024-06-21 15:30:20,796] INFO: Initiating epoch #77 valid run on device rank=0 [2024-06-21 15:30:55,522] INFO: Rank 0: epoch=77 / 400 train_loss=10.1750 valid_loss=10.2131 stale=1 time=2.68m eta=875.5m [2024-06-21 15:30:56,488] INFO: Initiating epoch #78 train run on device rank=0 [2024-06-21 15:33:01,770] INFO: Initiating epoch #78 valid run on device rank=0 [2024-06-21 15:33:37,444] INFO: Rank 0: epoch=78 / 400 train_loss=10.1734 valid_loss=10.2138 stale=2 time=2.68m eta=872.7m [2024-06-21 15:33:38,030] INFO: Initiating epoch #79 train run on device rank=0 [2024-06-21 15:35:44,639] INFO: Initiating epoch #79 valid run on device rank=0 [2024-06-21 15:36:18,039] INFO: Rank 0: epoch=79 / 400 train_loss=10.1722 valid_loss=10.2123 stale=0 time=2.67m eta=869.9m [2024-06-21 15:36:18,072] INFO: Initiating epoch #80 train run on device rank=0 [2024-06-21 15:38:26,372] INFO: Initiating epoch #80 valid run on device rank=0 [2024-06-21 15:38:59,999] INFO: Rank 0: epoch=80 / 400 train_loss=10.1708 valid_loss=10.2110 stale=0 time=2.7m eta=867.1m [2024-06-21 15:39:00,029] INFO: Initiating epoch #81 train run on device rank=0 [2024-06-21 15:41:07,108] INFO: Initiating epoch #81 valid run on device rank=0 [2024-06-21 15:41:40,696] INFO: Rank 0: epoch=81 / 400 train_loss=10.1692 valid_loss=10.2112 stale=1 time=2.68m eta=864.3m [2024-06-21 15:41:40,703] INFO: Initiating epoch #82 train run on device rank=0 [2024-06-21 15:43:50,311] INFO: Initiating epoch #82 valid run on device rank=0 [2024-06-21 15:44:24,227] INFO: Rank 0: epoch=82 / 400 train_loss=10.1680 valid_loss=10.2110 stale=0 time=2.73m eta=861.7m [2024-06-21 15:44:24,568] INFO: Initiating epoch #83 train run on device rank=0 [2024-06-21 15:46:31,690] INFO: Initiating epoch #83 valid run on device rank=0 [2024-06-21 15:47:05,578] INFO: Rank 0: epoch=83 / 400 train_loss=10.1668 valid_loss=10.2094 stale=0 time=2.68m eta=858.9m [2024-06-21 15:47:05,588] INFO: Initiating epoch #84 train run on device rank=0 [2024-06-21 15:49:13,384] INFO: Initiating epoch #84 valid run on device rank=0 [2024-06-21 15:49:47,462] INFO: Rank 0: epoch=84 / 400 train_loss=10.1655 valid_loss=10.2028 stale=0 time=2.7m eta=856.1m [2024-06-21 15:49:47,501] INFO: Initiating epoch #85 train run on device rank=0 [2024-06-21 15:51:57,755] INFO: Initiating epoch #85 valid run on device rank=0 [2024-06-21 15:52:32,731] INFO: Rank 0: epoch=85 / 400 train_loss=10.1641 valid_loss=10.2007 stale=0 time=2.75m eta=853.6m [2024-06-21 15:52:32,799] INFO: Initiating epoch #86 train run on device rank=0 [2024-06-21 15:54:38,697] INFO: Initiating epoch #86 valid run on device rank=0 [2024-06-21 15:55:11,757] INFO: Rank 0: epoch=86 / 400 train_loss=10.1630 valid_loss=10.2002 stale=0 time=2.65m eta=850.7m [2024-06-21 15:55:11,789] INFO: Initiating epoch #87 train run on device rank=0 [2024-06-21 15:57:21,534] INFO: Initiating epoch #87 valid run on device rank=0 [2024-06-21 15:57:54,990] INFO: Rank 0: epoch=87 / 400 train_loss=10.1614 valid_loss=10.2021 stale=1 time=2.72m eta=848.0m [2024-06-21 15:57:55,011] INFO: Initiating epoch #88 train run on device rank=0 [2024-06-21 16:00:01,696] INFO: Initiating epoch #88 valid run on device rank=0 [2024-06-21 16:00:35,696] INFO: Rank 0: epoch=88 / 400 train_loss=10.1593 valid_loss=10.1929 stale=0 time=2.68m eta=845.2m [2024-06-21 16:00:35,722] INFO: Initiating epoch #89 train run on device rank=0 [2024-06-21 16:02:42,136] INFO: Initiating epoch #89 valid run on device rank=0 [2024-06-21 16:03:15,733] INFO: Rank 0: epoch=89 / 400 train_loss=10.1567 valid_loss=10.1887 stale=0 time=2.67m eta=842.3m [2024-06-21 16:03:15,780] INFO: Initiating epoch #90 train run on device rank=0 [2024-06-21 16:05:24,161] INFO: Initiating epoch #90 valid run on device rank=0 [2024-06-21 16:05:57,370] INFO: Rank 0: epoch=90 / 400 train_loss=10.1545 valid_loss=10.1859 stale=0 time=2.69m eta=839.6m [2024-06-21 16:05:57,405] INFO: Initiating epoch #91 train run on device rank=0 [2024-06-21 16:08:04,349] INFO: Initiating epoch #91 valid run on device rank=0 [2024-06-21 16:08:38,703] INFO: Rank 0: epoch=91 / 400 train_loss=10.1528 valid_loss=10.1850 stale=0 time=2.69m eta=836.8m [2024-06-21 16:08:38,716] INFO: Initiating epoch #92 train run on device rank=0 [2024-06-21 16:10:45,872] INFO: Initiating epoch #92 valid run on device rank=0 [2024-06-21 16:11:20,244] INFO: Rank 0: epoch=92 / 400 train_loss=10.1515 valid_loss=10.1839 stale=0 time=2.69m eta=834.0m [2024-06-21 16:11:20,258] INFO: Initiating epoch #93 train run on device rank=0 [2024-06-21 16:13:26,932] INFO: Initiating epoch #93 valid run on device rank=0 [2024-06-21 16:14:05,713] INFO: Rank 0: epoch=93 / 400 train_loss=10.1502 valid_loss=10.1832 stale=0 time=2.76m eta=831.5m [2024-06-21 16:14:06,540] INFO: Initiating epoch #94 train run on device rank=0 [2024-06-21 16:16:11,170] INFO: Initiating epoch #94 valid run on device rank=0 [2024-06-21 16:16:47,611] INFO: Rank 0: epoch=94 / 400 train_loss=10.1490 valid_loss=10.1825 stale=0 time=2.68m eta=828.7m [2024-06-21 16:16:48,179] INFO: Initiating epoch #95 train run on device rank=0 [2024-06-21 16:18:51,824] INFO: Initiating epoch #95 valid run on device rank=0 [2024-06-21 16:19:26,544] INFO: Rank 0: epoch=95 / 400 train_loss=10.1481 valid_loss=10.1825 stale=0 time=2.64m eta=825.8m [2024-06-21 16:19:26,589] INFO: Initiating epoch #96 train run on device rank=0 [2024-06-21 16:21:35,128] INFO: Initiating epoch #96 valid run on device rank=0 [2024-06-21 16:22:13,034] INFO: Rank 0: epoch=96 / 400 train_loss=10.1472 valid_loss=10.1814 stale=0 time=2.77m eta=823.3m [2024-06-21 16:22:13,759] INFO: Initiating epoch #97 train run on device rank=0 [2024-06-21 16:24:17,718] INFO: Initiating epoch #97 valid run on device rank=0 [2024-06-21 16:24:55,492] INFO: Rank 0: epoch=97 / 400 train_loss=10.1463 valid_loss=10.1806 stale=0 time=2.7m eta=820.6m [2024-06-21 16:24:56,308] INFO: Initiating epoch #98 train run on device rank=0 [2024-06-21 16:27:01,712] INFO: Initiating epoch #98 valid run on device rank=0 [2024-06-21 16:27:39,657] INFO: Rank 0: epoch=98 / 400 train_loss=10.1454 valid_loss=10.1796 stale=0 time=2.72m eta=818.0m [2024-06-21 16:27:40,582] INFO: Initiating epoch #99 train run on device rank=0 [2024-06-21 16:29:44,364] INFO: Initiating epoch #99 valid run on device rank=0 [2024-06-21 16:30:21,770] INFO: Rank 0: epoch=99 / 400 train_loss=10.1444 valid_loss=10.1789 stale=0 time=2.69m eta=815.3m [2024-06-21 16:30:22,527] INFO: Initiating epoch #100 train run on device rank=0 [2024-06-21 16:32:27,873] INFO: Initiating epoch #100 valid run on device rank=0 [2024-06-21 16:33:05,339] INFO: Rank 0: epoch=100 / 400 train_loss=10.1435 valid_loss=10.1784 stale=0 time=2.71m eta=812.6m [2024-06-21 16:33:05,730] INFO: Initiating epoch #101 train run on device rank=0 [2024-06-21 16:35:13,368] INFO: Initiating epoch #101 valid run on device rank=0 [2024-06-21 16:35:51,307] INFO: Rank 0: epoch=101 / 400 train_loss=10.1425 valid_loss=10.1778 stale=0 time=2.76m eta=810.1m [2024-06-21 16:35:51,804] INFO: Initiating epoch #102 train run on device rank=0 [2024-06-21 16:37:59,935] INFO: Initiating epoch #102 valid run on device rank=0 [2024-06-21 16:38:33,925] INFO: Rank 0: epoch=102 / 400 train_loss=10.1415 valid_loss=10.1770 stale=0 time=2.7m eta=807.4m [2024-06-21 16:38:33,933] INFO: Initiating epoch #103 train run on device rank=0 [2024-06-21 16:40:43,848] INFO: Initiating epoch #103 valid run on device rank=0 [2024-06-21 16:41:18,101] INFO: Rank 0: epoch=103 / 400 train_loss=10.1404 valid_loss=10.1763 stale=0 time=2.74m eta=804.7m [2024-06-21 16:41:18,115] INFO: Initiating epoch #104 train run on device rank=0 [2024-06-21 16:43:27,873] INFO: Initiating epoch #104 valid run on device rank=0 [2024-06-21 16:44:03,624] INFO: Rank 0: epoch=104 / 400 train_loss=10.1394 valid_loss=10.1756 stale=0 time=2.76m eta=802.2m [2024-06-21 16:44:03,658] INFO: Initiating epoch #105 train run on device rank=0 [2024-06-21 16:46:14,257] INFO: Initiating epoch #105 valid run on device rank=0 [2024-06-21 16:46:48,222] INFO: Rank 0: epoch=105 / 400 train_loss=10.1384 valid_loss=10.1750 stale=0 time=2.74m eta=799.6m [2024-06-21 16:46:48,246] INFO: Initiating epoch #106 train run on device rank=0 [2024-06-21 16:48:58,172] INFO: Initiating epoch #106 valid run on device rank=0 [2024-06-21 16:49:32,109] INFO: Rank 0: epoch=106 / 400 train_loss=10.1375 valid_loss=10.1743 stale=0 time=2.73m eta=796.9m [2024-06-21 16:49:32,136] INFO: Initiating epoch #107 train run on device rank=0 [2024-06-21 16:51:42,010] INFO: Initiating epoch #107 valid run on device rank=0 [2024-06-21 16:52:16,725] INFO: Rank 0: epoch=107 / 400 train_loss=10.1366 valid_loss=10.1736 stale=0 time=2.74m eta=794.3m [2024-06-21 16:52:16,812] INFO: Initiating epoch #108 train run on device rank=0 [2024-06-21 16:54:26,349] INFO: Initiating epoch #108 valid run on device rank=0 [2024-06-21 16:55:00,084] INFO: Rank 0: epoch=108 / 400 train_loss=10.1358 valid_loss=10.1730 stale=0 time=2.72m eta=791.6m [2024-06-21 16:55:00,104] INFO: Initiating epoch #109 train run on device rank=0 [2024-06-21 16:57:08,795] INFO: Initiating epoch #109 valid run on device rank=0 [2024-06-21 16:57:42,987] INFO: Rank 0: epoch=109 / 400 train_loss=10.1352 valid_loss=10.1727 stale=0 time=2.71m eta=788.9m [2024-06-21 16:57:43,003] INFO: Initiating epoch #110 train run on device rank=0 [2024-06-21 16:59:52,936] INFO: Initiating epoch #110 valid run on device rank=0 [2024-06-21 17:00:27,594] INFO: Rank 0: epoch=110 / 400 train_loss=10.1345 valid_loss=10.1726 stale=0 time=2.74m eta=786.3m [2024-06-21 17:00:27,654] INFO: Initiating epoch #111 train run on device rank=0 [2024-06-21 17:02:37,258] INFO: Initiating epoch #111 valid run on device rank=0 [2024-06-21 17:03:11,798] INFO: Rank 0: epoch=111 / 400 train_loss=10.1337 valid_loss=10.1693 stale=0 time=2.74m eta=783.6m [2024-06-21 17:03:11,818] INFO: Initiating epoch #112 train run on device rank=0 [2024-06-21 17:05:21,230] INFO: Initiating epoch #112 valid run on device rank=0 [2024-06-21 17:05:55,353] INFO: Rank 0: epoch=112 / 400 train_loss=10.1332 valid_loss=10.1689 stale=0 time=2.73m eta=781.0m [2024-06-21 17:05:55,499] INFO: Initiating epoch #113 train run on device rank=0 [2024-06-21 17:08:05,991] INFO: Initiating epoch #113 valid run on device rank=0 [2024-06-21 17:08:40,549] INFO: Rank 0: epoch=113 / 400 train_loss=10.1326 valid_loss=10.1684 stale=0 time=2.75m eta=778.4m [2024-06-21 17:08:40,602] INFO: Initiating epoch #114 train run on device rank=0 [2024-06-21 17:10:50,817] INFO: Initiating epoch #114 valid run on device rank=0 [2024-06-21 17:11:25,992] INFO: Rank 0: epoch=114 / 400 train_loss=10.1321 valid_loss=10.1681 stale=0 time=2.76m eta=775.8m [2024-06-21 17:11:26,079] INFO: Initiating epoch #115 train run on device rank=0 [2024-06-21 17:13:35,791] INFO: Initiating epoch #115 valid run on device rank=0 [2024-06-21 17:14:09,857] INFO: Rank 0: epoch=115 / 400 train_loss=10.1317 valid_loss=10.1676 stale=0 time=2.73m eta=773.1m [2024-06-21 17:14:09,929] INFO: Initiating epoch #116 train run on device rank=0 [2024-06-21 17:16:17,741] INFO: Initiating epoch #116 valid run on device rank=0 [2024-06-21 17:16:52,186] INFO: Rank 0: epoch=116 / 400 train_loss=10.1312 valid_loss=10.1672 stale=0 time=2.7m eta=770.4m [2024-06-21 17:16:52,232] INFO: Initiating epoch #117 train run on device rank=0 [2024-06-21 17:19:01,290] INFO: Initiating epoch #117 valid run on device rank=0 [2024-06-21 17:19:37,299] INFO: Rank 0: epoch=117 / 400 train_loss=10.1309 valid_loss=10.1668 stale=0 time=2.75m eta=767.7m [2024-06-21 17:19:37,404] INFO: Initiating epoch #118 train run on device rank=0 [2024-06-21 17:21:45,913] INFO: Initiating epoch #118 valid run on device rank=0 [2024-06-21 17:22:20,215] INFO: Rank 0: epoch=118 / 400 train_loss=10.1305 valid_loss=10.1665 stale=0 time=2.71m eta=765.0m [2024-06-21 17:22:20,224] INFO: Initiating epoch #119 train run on device rank=0 [2024-06-21 17:24:30,346] INFO: Initiating epoch #119 valid run on device rank=0 [2024-06-21 17:25:04,708] INFO: Rank 0: epoch=119 / 400 train_loss=10.1302 valid_loss=10.1662 stale=0 time=2.74m eta=762.4m [2024-06-21 17:25:04,797] INFO: Initiating epoch #120 train run on device rank=0 [2024-06-21 17:27:13,173] INFO: Initiating epoch #120 valid run on device rank=0 [2024-06-21 17:27:48,606] INFO: Rank 0: epoch=120 / 400 train_loss=10.1299 valid_loss=10.1660 stale=0 time=2.73m eta=759.7m [2024-06-21 17:27:48,624] INFO: Initiating epoch #121 train run on device rank=0 [2024-06-21 17:29:57,857] INFO: Initiating epoch #121 valid run on device rank=0 [2024-06-21 17:30:32,621] INFO: Rank 0: epoch=121 / 400 train_loss=10.1297 valid_loss=10.1658 stale=0 time=2.73m eta=757.1m [2024-06-21 17:30:32,644] INFO: Initiating epoch #122 train run on device rank=0 [2024-06-21 17:32:38,593] INFO: Initiating epoch #122 valid run on device rank=0 [2024-06-21 17:33:12,034] INFO: Rank 0: epoch=122 / 400 train_loss=10.1295 valid_loss=10.1656 stale=0 time=2.66m eta=754.2m [2024-06-21 17:33:12,076] INFO: Initiating epoch #123 train run on device rank=0 [2024-06-21 17:35:19,587] INFO: Initiating epoch #123 valid run on device rank=0 [2024-06-21 17:35:53,633] INFO: Rank 0: epoch=123 / 400 train_loss=10.1293 valid_loss=10.1653 stale=0 time=2.69m eta=751.5m [2024-06-21 17:35:53,669] INFO: Initiating epoch #124 train run on device rank=0 [2024-06-21 17:38:00,342] INFO: Initiating epoch #124 valid run on device rank=0 [2024-06-21 17:38:34,457] INFO: Rank 0: epoch=124 / 400 train_loss=10.1291 valid_loss=10.1651 stale=0 time=2.68m eta=748.7m [2024-06-21 17:38:34,470] INFO: Initiating epoch #125 train run on device rank=0 [2024-06-21 17:40:41,509] INFO: Initiating epoch #125 valid run on device rank=0 [2024-06-21 17:41:18,423] INFO: Rank 0: epoch=125 / 400 train_loss=10.1289 valid_loss=10.1649 stale=0 time=2.73m eta=746.0m [2024-06-21 17:41:18,917] INFO: Initiating epoch #126 train run on device rank=0 [2024-06-21 17:43:25,484] INFO: Initiating epoch #126 valid run on device rank=0 [2024-06-21 17:44:07,833] INFO: Rank 0: epoch=126 / 400 train_loss=10.1287 valid_loss=10.1647 stale=0 time=2.82m eta=743.5m [2024-06-21 17:44:09,265] INFO: Initiating epoch #127 train run on device rank=0 [2024-06-21 17:46:13,287] INFO: Initiating epoch #127 valid run on device rank=0 [2024-06-21 17:46:49,524] INFO: Rank 0: epoch=127 / 400 train_loss=10.1286 valid_loss=10.1645 stale=0 time=2.67m eta=740.8m [2024-06-21 17:46:50,310] INFO: Initiating epoch #128 train run on device rank=0 [2024-06-21 17:48:53,897] INFO: Initiating epoch #128 valid run on device rank=0 [2024-06-21 17:49:27,684] INFO: Rank 0: epoch=128 / 400 train_loss=10.1284 valid_loss=10.1643 stale=0 time=2.62m eta=737.9m [2024-06-21 17:49:27,899] INFO: Initiating epoch #129 train run on device rank=0 [2024-06-21 17:51:35,635] INFO: Initiating epoch #129 valid run on device rank=0 [2024-06-21 17:52:10,906] INFO: Rank 0: epoch=129 / 400 train_loss=10.1283 valid_loss=10.1642 stale=0 time=2.72m eta=735.2m [2024-06-21 17:52:10,928] INFO: Initiating epoch #130 train run on device rank=0 [2024-06-21 17:54:17,810] INFO: Initiating epoch #130 valid run on device rank=0 [2024-06-21 17:54:51,312] INFO: Rank 0: epoch=130 / 400 train_loss=10.1282 valid_loss=10.1640 stale=0 time=2.67m eta=732.4m [2024-06-21 17:54:51,333] INFO: Initiating epoch #131 train run on device rank=0 [2024-06-21 17:56:59,878] INFO: Initiating epoch #131 valid run on device rank=0 [2024-06-21 17:57:33,962] INFO: Rank 0: epoch=131 / 400 train_loss=10.1281 valid_loss=10.1639 stale=0 time=2.71m eta=729.7m [2024-06-21 17:57:34,027] INFO: Initiating epoch #132 train run on device rank=0 [2024-06-21 17:59:39,911] INFO: Initiating epoch #132 valid run on device rank=0 [2024-06-21 18:00:13,129] INFO: Rank 0: epoch=132 / 400 train_loss=10.1279 valid_loss=10.1639 stale=0 time=2.65m eta=726.9m [2024-06-21 18:00:13,175] INFO: Initiating epoch #133 train run on device rank=0 [2024-06-21 18:02:19,034] INFO: Initiating epoch #133 valid run on device rank=0 [2024-06-21 18:02:53,934] INFO: Rank 0: epoch=133 / 400 train_loss=10.1278 valid_loss=10.1637 stale=0 time=2.68m eta=724.1m [2024-06-21 18:02:54,325] INFO: Initiating epoch #134 train run on device rank=0 [2024-06-21 18:04:59,562] INFO: Initiating epoch #134 valid run on device rank=0 [2024-06-21 18:05:33,060] INFO: Rank 0: epoch=134 / 400 train_loss=10.1277 valid_loss=10.1636 stale=0 time=2.65m eta=721.2m [2024-06-21 18:05:33,062] INFO: Initiating epoch #135 train run on device rank=0 [2024-06-21 18:07:38,987] INFO: Initiating epoch #135 valid run on device rank=0 [2024-06-21 18:08:12,460] INFO: Rank 0: epoch=135 / 400 train_loss=10.1276 valid_loss=10.1635 stale=0 time=2.66m eta=718.4m [2024-06-21 18:08:12,545] INFO: Initiating epoch #136 train run on device rank=0 [2024-06-21 18:10:22,971] INFO: Initiating epoch #136 valid run on device rank=0 [2024-06-21 18:10:57,046] INFO: Rank 0: epoch=136 / 400 train_loss=10.1275 valid_loss=10.1634 stale=0 time=2.74m eta=715.8m [2024-06-21 18:10:57,072] INFO: Initiating epoch #137 train run on device rank=0 [2024-06-21 18:13:05,416] INFO: Initiating epoch #137 valid run on device rank=0 [2024-06-21 18:13:39,491] INFO: Rank 0: epoch=137 / 400 train_loss=10.1274 valid_loss=10.1633 stale=0 time=2.71m eta=713.1m [2024-06-21 18:13:39,525] INFO: Initiating epoch #138 train run on device rank=0 [2024-06-21 18:15:45,604] INFO: Initiating epoch #138 valid run on device rank=0 [2024-06-21 18:16:18,738] INFO: Rank 0: epoch=138 / 400 train_loss=10.1273 valid_loss=10.1632 stale=0 time=2.65m eta=710.2m [2024-06-21 18:16:18,775] INFO: Initiating epoch #139 train run on device rank=0 [2024-06-21 18:18:25,535] INFO: Initiating epoch #139 valid run on device rank=0 [2024-06-21 18:18:59,059] INFO: Rank 0: epoch=139 / 400 train_loss=10.1272 valid_loss=10.1631 stale=0 time=2.67m eta=707.5m [2024-06-21 18:18:59,074] INFO: Initiating epoch #140 train run on device rank=0 [2024-06-21 18:21:05,946] INFO: Initiating epoch #140 valid run on device rank=0 [2024-06-21 18:21:40,355] INFO: Rank 0: epoch=140 / 400 train_loss=10.1271 valid_loss=10.1630 stale=0 time=2.69m eta=704.7m [2024-06-21 18:21:40,379] INFO: Initiating epoch #141 train run on device rank=0 [2024-06-21 18:23:48,285] INFO: Initiating epoch #141 valid run on device rank=0 [2024-06-21 18:24:21,505] INFO: Rank 0: epoch=141 / 400 train_loss=10.1270 valid_loss=10.1629 stale=0 time=2.69m eta=702.0m [2024-06-21 18:24:21,590] INFO: Initiating epoch #142 train run on device rank=0 [2024-06-21 18:26:28,683] INFO: Initiating epoch #142 valid run on device rank=0 [2024-06-21 18:27:01,885] INFO: Rank 0: epoch=142 / 400 train_loss=10.1268 valid_loss=10.1628 stale=0 time=2.67m eta=699.2m [2024-06-21 18:27:01,934] INFO: Initiating epoch #143 train run on device rank=0 [2024-06-21 18:29:10,064] INFO: Initiating epoch #143 valid run on device rank=0 [2024-06-21 18:29:44,986] INFO: Rank 0: epoch=143 / 400 train_loss=10.1267 valid_loss=10.1627 stale=0 time=2.72m eta=696.5m [2024-06-21 18:29:45,012] INFO: Initiating epoch #144 train run on device rank=0 [2024-06-21 18:31:52,215] INFO: Initiating epoch #144 valid run on device rank=0 [2024-06-21 18:32:26,734] INFO: Rank 0: epoch=144 / 400 train_loss=10.1266 valid_loss=10.1626 stale=0 time=2.7m eta=693.7m [2024-06-21 18:32:26,746] INFO: Initiating epoch #145 train run on device rank=0 [2024-06-21 18:34:33,927] INFO: Initiating epoch #145 valid run on device rank=0 [2024-06-21 18:35:08,480] INFO: Rank 0: epoch=145 / 400 train_loss=10.1265 valid_loss=10.1625 stale=0 time=2.7m eta=691.0m [2024-06-21 18:35:08,523] INFO: Initiating epoch #146 train run on device rank=0 [2024-06-21 18:37:16,382] INFO: Initiating epoch #146 valid run on device rank=0 [2024-06-21 18:37:50,838] INFO: Rank 0: epoch=146 / 400 train_loss=10.1264 valid_loss=10.1624 stale=0 time=2.71m eta=688.3m [2024-06-21 18:37:50,854] INFO: Initiating epoch #147 train run on device rank=0 [2024-06-21 18:39:58,564] INFO: Initiating epoch #147 valid run on device rank=0 [2024-06-21 18:40:32,555] INFO: Rank 0: epoch=147 / 400 train_loss=10.1263 valid_loss=10.1624 stale=0 time=2.7m eta=685.6m [2024-06-21 18:40:32,580] INFO: Initiating epoch #148 train run on device rank=0 [2024-06-21 18:42:38,263] INFO: Initiating epoch #148 valid run on device rank=0 [2024-06-21 18:43:11,639] INFO: Rank 0: epoch=148 / 400 train_loss=10.1262 valid_loss=10.1623 stale=0 time=2.65m eta=682.7m [2024-06-21 18:43:11,689] INFO: Initiating epoch #149 train run on device rank=0 [2024-06-21 18:45:18,053] INFO: Initiating epoch #149 valid run on device rank=0 [2024-06-21 18:45:51,356] INFO: Rank 0: epoch=149 / 400 train_loss=10.1261 valid_loss=10.1622 stale=0 time=2.66m eta=680.0m [2024-06-21 18:45:51,371] INFO: Initiating epoch #150 train run on device rank=0 [2024-06-21 18:47:57,502] INFO: Initiating epoch #150 valid run on device rank=0 [2024-06-21 18:48:32,162] INFO: Rank 0: epoch=150 / 400 train_loss=10.1260 valid_loss=10.1621 stale=0 time=2.68m eta=677.2m [2024-06-21 18:48:32,377] INFO: Initiating epoch #151 train run on device rank=0 [2024-06-21 18:50:37,979] INFO: Initiating epoch #151 valid run on device rank=0 [2024-06-21 18:51:13,598] INFO: Rank 0: epoch=151 / 400 train_loss=10.1259 valid_loss=10.1620 stale=0 time=2.69m eta=674.5m [2024-06-21 18:51:13,620] INFO: Initiating epoch #152 train run on device rank=0 [2024-06-21 18:53:19,208] INFO: Initiating epoch #152 valid run on device rank=0 [2024-06-21 18:53:52,826] INFO: Rank 0: epoch=152 / 400 train_loss=10.1258 valid_loss=10.1619 stale=0 time=2.65m eta=671.7m [2024-06-21 18:53:52,966] INFO: Initiating epoch #153 train run on device rank=0 [2024-06-21 18:55:59,132] INFO: Initiating epoch #153 valid run on device rank=0 [2024-06-21 18:56:33,083] INFO: Rank 0: epoch=153 / 400 train_loss=10.1257 valid_loss=10.1618 stale=0 time=2.67m eta=668.9m [2024-06-21 18:56:33,114] INFO: Initiating epoch #154 train run on device rank=0 [2024-06-21 18:58:38,623] INFO: Initiating epoch #154 valid run on device rank=0 [2024-06-21 18:59:13,048] INFO: Rank 0: epoch=154 / 400 train_loss=10.1257 valid_loss=10.1618 stale=0 time=2.67m eta=666.1m [2024-06-21 18:59:13,097] INFO: Initiating epoch #155 train run on device rank=0 [2024-06-21 19:01:20,891] INFO: Initiating epoch #155 valid run on device rank=0 [2024-06-21 19:01:54,808] INFO: Rank 0: epoch=155 / 400 train_loss=10.1256 valid_loss=10.1617 stale=0 time=2.7m eta=663.4m [2024-06-21 19:01:54,818] INFO: Initiating epoch #156 train run on device rank=0 [2024-06-21 19:04:04,590] INFO: Initiating epoch #156 valid run on device rank=0 [2024-06-21 19:04:39,263] INFO: Rank 0: epoch=156 / 400 train_loss=10.1256 valid_loss=10.1617 stale=0 time=2.74m eta=660.7m [2024-06-21 19:04:39,339] INFO: Initiating epoch #157 train run on device rank=0 [2024-06-21 19:06:45,743] INFO: Initiating epoch #157 valid run on device rank=0 [2024-06-21 19:07:20,379] INFO: Rank 0: epoch=157 / 400 train_loss=10.1256 valid_loss=10.1617 stale=0 time=2.68m eta=658.0m [2024-06-21 19:07:20,437] INFO: Initiating epoch #158 train run on device rank=0 [2024-06-21 19:09:29,615] INFO: Initiating epoch #158 valid run on device rank=0 [2024-06-21 19:10:04,465] INFO: Rank 0: epoch=158 / 400 train_loss=10.1255 valid_loss=10.1616 stale=0 time=2.73m eta=655.3m [2024-06-21 19:10:04,513] INFO: Initiating epoch #159 train run on device rank=0 [2024-06-21 19:12:17,584] INFO: Initiating epoch #159 valid run on device rank=0 [2024-06-21 19:12:53,538] INFO: Rank 0: epoch=159 / 400 train_loss=10.1255 valid_loss=10.1616 stale=0 time=2.82m eta=652.8m [2024-06-21 19:12:53,639] INFO: Initiating epoch #160 train run on device rank=0 [2024-06-21 19:15:08,417] INFO: Initiating epoch #160 valid run on device rank=0 [2024-06-21 19:15:43,523] INFO: Rank 0: epoch=160 / 400 train_loss=10.1255 valid_loss=10.1616 stale=0 time=2.83m eta=650.3m [2024-06-21 19:15:43,585] INFO: Initiating epoch #161 train run on device rank=0 [2024-06-21 19:17:55,376] INFO: Initiating epoch #161 valid run on device rank=0 [2024-06-21 19:18:30,838] INFO: Rank 0: epoch=161 / 400 train_loss=10.1254 valid_loss=10.1616 stale=0 time=2.79m eta=647.7m [2024-06-21 19:18:30,931] INFO: Initiating epoch #162 train run on device rank=0 [2024-06-21 19:20:40,748] INFO: Initiating epoch #162 valid run on device rank=0 [2024-06-21 19:21:15,231] INFO: Rank 0: epoch=162 / 400 train_loss=10.1254 valid_loss=10.1615 stale=0 time=2.74m eta=645.0m [2024-06-21 19:21:15,281] INFO: Initiating epoch #163 train run on device rank=0 [2024-06-21 19:23:24,611] INFO: Initiating epoch #163 valid run on device rank=0 [2024-06-21 19:23:59,326] INFO: Rank 0: epoch=163 / 400 train_loss=10.1254 valid_loss=10.1615 stale=0 time=2.73m eta=642.3m [2024-06-21 19:23:59,778] INFO: Initiating epoch #164 train run on device rank=0 [2024-06-21 19:26:10,005] INFO: Initiating epoch #164 valid run on device rank=0 [2024-06-21 19:26:44,167] INFO: Rank 0: epoch=164 / 400 train_loss=10.1254 valid_loss=10.1615 stale=0 time=2.74m eta=639.7m [2024-06-21 19:26:44,217] INFO: Initiating epoch #165 train run on device rank=0 [2024-06-21 19:28:54,770] INFO: Initiating epoch #165 valid run on device rank=0 [2024-06-21 19:29:29,992] INFO: Rank 0: epoch=165 / 400 train_loss=10.1254 valid_loss=10.1615 stale=0 time=2.76m eta=637.0m [2024-06-21 19:29:30,030] INFO: Initiating epoch #166 train run on device rank=0 [2024-06-21 19:31:41,253] INFO: Initiating epoch #166 valid run on device rank=0 [2024-06-21 19:32:16,614] INFO: Rank 0: epoch=166 / 400 train_loss=10.1253 valid_loss=10.1615 stale=0 time=2.78m eta=634.4m [2024-06-21 19:32:16,660] INFO: Initiating epoch #167 train run on device rank=0 [2024-06-21 19:34:31,724] INFO: Initiating epoch #167 valid run on device rank=0 [2024-06-21 19:35:06,568] INFO: Rank 0: epoch=167 / 400 train_loss=10.1253 valid_loss=10.1615 stale=0 time=2.83m eta=631.9m [2024-06-21 19:35:06,611] INFO: Initiating epoch #168 train run on device rank=0 [2024-06-21 19:37:16,461] INFO: Initiating epoch #168 valid run on device rank=0 [2024-06-21 19:37:52,121] INFO: Rank 0: epoch=168 / 400 train_loss=10.1253 valid_loss=10.1615 stale=0 time=2.76m eta=629.2m [2024-06-21 19:37:52,176] INFO: Initiating epoch #169 train run on device rank=0 [2024-06-21 19:40:04,711] INFO: Initiating epoch #169 valid run on device rank=0 [2024-06-21 19:40:44,818] INFO: Rank 0: epoch=169 / 400 train_loss=10.1253 valid_loss=10.1615 stale=0 time=2.88m eta=626.8m [2024-06-21 19:40:45,577] INFO: Initiating epoch #170 train run on device rank=0 [2024-06-21 19:42:52,885] INFO: Initiating epoch #170 valid run on device rank=0 [2024-06-21 19:43:30,523] INFO: Rank 0: epoch=170 / 400 train_loss=10.1253 valid_loss=10.1614 stale=0 time=2.75m eta=624.1m [2024-06-21 19:43:31,138] INFO: Initiating epoch #171 train run on device rank=0 [2024-06-21 19:45:40,325] INFO: Initiating epoch #171 valid run on device rank=0 [2024-06-21 19:46:17,413] INFO: Rank 0: epoch=171 / 400 train_loss=10.1253 valid_loss=10.1614 stale=0 time=2.77m eta=621.5m [2024-06-21 19:46:18,010] INFO: Initiating epoch #172 train run on device rank=0 [2024-06-21 19:48:27,536] INFO: Initiating epoch #172 valid run on device rank=0 [2024-06-21 19:49:02,707] INFO: Rank 0: epoch=172 / 400 train_loss=10.1253 valid_loss=10.1614 stale=1 time=2.74m eta=618.8m [2024-06-21 19:49:03,308] INFO: Initiating epoch #173 train run on device rank=0 [2024-06-21 19:51:13,228] INFO: Initiating epoch #173 valid run on device rank=0 [2024-06-21 19:51:51,099] INFO: Rank 0: epoch=173 / 400 train_loss=10.1252 valid_loss=10.1613 stale=0 time=2.8m eta=616.2m [2024-06-21 19:51:51,753] INFO: Initiating epoch #174 train run on device rank=0 [2024-06-21 19:54:01,291] INFO: Initiating epoch #174 valid run on device rank=0 [2024-06-21 19:54:35,867] INFO: Rank 0: epoch=174 / 400 train_loss=10.1252 valid_loss=10.1614 stale=1 time=2.74m eta=613.6m [2024-06-21 19:54:35,903] INFO: Initiating epoch #175 train run on device rank=0 [2024-06-21 19:56:46,010] INFO: Initiating epoch #175 valid run on device rank=0 [2024-06-21 19:57:20,226] INFO: Rank 0: epoch=175 / 400 train_loss=10.1252 valid_loss=10.1611 stale=0 time=2.74m eta=610.9m [2024-06-21 19:57:20,356] INFO: Initiating epoch #176 train run on device rank=0 [2024-06-21 19:59:31,259] INFO: Initiating epoch #176 valid run on device rank=0 [2024-06-21 20:00:05,718] INFO: Rank 0: epoch=176 / 400 train_loss=10.1252 valid_loss=10.1614 stale=1 time=2.76m eta=608.2m [2024-06-21 20:00:05,760] INFO: Initiating epoch #177 train run on device rank=0 [2024-06-21 20:02:16,287] INFO: Initiating epoch #177 valid run on device rank=0 [2024-06-21 20:02:50,470] INFO: Rank 0: epoch=177 / 400 train_loss=10.1252 valid_loss=10.1610 stale=0 time=2.75m eta=605.5m [2024-06-21 20:02:50,492] INFO: Initiating epoch #178 train run on device rank=0 [2024-06-21 20:05:01,106] INFO: Initiating epoch #178 valid run on device rank=0 [2024-06-21 20:05:35,408] INFO: Rank 0: epoch=178 / 400 train_loss=10.1252 valid_loss=10.1614 stale=1 time=2.75m eta=602.9m [2024-06-21 20:05:35,431] INFO: Initiating epoch #179 train run on device rank=0 [2024-06-21 20:07:46,503] INFO: Initiating epoch #179 valid run on device rank=0 [2024-06-21 20:08:21,923] INFO: Rank 0: epoch=179 / 400 train_loss=10.1252 valid_loss=10.1609 stale=0 time=2.77m eta=600.2m [2024-06-21 20:08:21,949] INFO: Initiating epoch #180 train run on device rank=0 [2024-06-21 20:10:33,002] INFO: Initiating epoch #180 valid run on device rank=0 [2024-06-21 20:11:08,295] INFO: Rank 0: epoch=180 / 400 train_loss=10.1252 valid_loss=10.1614 stale=1 time=2.77m eta=597.6m [2024-06-21 20:11:08,582] INFO: Initiating epoch #181 train run on device rank=0 [2024-06-21 20:13:20,094] INFO: Initiating epoch #181 valid run on device rank=0 [2024-06-21 20:13:55,406] INFO: Rank 0: epoch=181 / 400 train_loss=10.1252 valid_loss=10.1609 stale=0 time=2.78m eta=594.9m [2024-06-21 20:13:55,435] INFO: Initiating epoch #182 train run on device rank=0 [2024-06-21 20:16:05,602] INFO: Initiating epoch #182 valid run on device rank=0 [2024-06-21 20:16:40,784] INFO: Rank 0: epoch=182 / 400 train_loss=10.1252 valid_loss=10.1613 stale=1 time=2.76m eta=592.3m [2024-06-21 20:16:40,874] INFO: Initiating epoch #183 train run on device rank=0 [2024-06-21 20:18:50,935] INFO: Initiating epoch #183 valid run on device rank=0 [2024-06-21 20:19:28,470] INFO: Rank 0: epoch=183 / 400 train_loss=10.1252 valid_loss=10.1609 stale=0 time=2.79m eta=589.6m [2024-06-21 20:19:29,015] INFO: Initiating epoch #184 train run on device rank=0 [2024-06-21 20:21:36,376] INFO: Initiating epoch #184 valid run on device rank=0 [2024-06-21 20:22:11,370] INFO: Rank 0: epoch=184 / 400 train_loss=10.1252 valid_loss=10.1612 stale=1 time=2.71m eta=586.9m [2024-06-21 20:22:12,052] INFO: Initiating epoch #185 train run on device rank=0 [2024-06-21 20:24:21,844] INFO: Initiating epoch #185 valid run on device rank=0 [2024-06-21 20:24:55,780] INFO: Rank 0: epoch=185 / 400 train_loss=10.1252 valid_loss=10.1609 stale=2 time=2.73m eta=584.2m [2024-06-21 20:24:55,811] INFO: Initiating epoch #186 train run on device rank=0 [2024-06-21 20:27:07,469] INFO: Initiating epoch #186 valid run on device rank=0 [2024-06-21 20:27:42,062] INFO: Rank 0: epoch=186 / 400 train_loss=10.1252 valid_loss=10.1612 stale=3 time=2.77m eta=581.6m [2024-06-21 20:27:42,183] INFO: Initiating epoch #187 train run on device rank=0 [2024-06-21 20:29:51,352] INFO: Initiating epoch #187 valid run on device rank=0 [2024-06-21 20:30:25,380] INFO: Rank 0: epoch=187 / 400 train_loss=10.1252 valid_loss=10.1609 stale=4 time=2.72m eta=578.9m [2024-06-21 20:30:25,392] INFO: Initiating epoch #188 train run on device rank=0 [2024-06-21 20:32:37,381] INFO: Initiating epoch #188 valid run on device rank=0 [2024-06-21 20:33:11,398] INFO: Rank 0: epoch=188 / 400 train_loss=10.1252 valid_loss=10.1611 stale=5 time=2.77m eta=576.2m [2024-06-21 20:33:11,452] INFO: Initiating epoch #189 train run on device rank=0 [2024-06-21 20:35:20,815] INFO: Initiating epoch #189 valid run on device rank=0 [2024-06-21 20:35:55,554] INFO: Rank 0: epoch=189 / 400 train_loss=10.1252 valid_loss=10.1609 stale=6 time=2.74m eta=573.5m [2024-06-21 20:35:56,024] INFO: Initiating epoch #190 train run on device rank=0 [2024-06-21 20:38:04,621] INFO: Initiating epoch #190 valid run on device rank=0 [2024-06-21 20:38:39,243] INFO: Rank 0: epoch=190 / 400 train_loss=10.1252 valid_loss=10.1611 stale=7 time=2.72m eta=570.8m [2024-06-21 20:38:39,284] INFO: Initiating epoch #191 train run on device rank=0 [2024-06-21 20:40:47,922] INFO: Initiating epoch #191 valid run on device rank=0 [2024-06-21 20:41:22,929] INFO: Rank 0: epoch=191 / 400 train_loss=10.1253 valid_loss=10.1609 stale=8 time=2.73m eta=568.1m [2024-06-21 20:41:23,028] INFO: Initiating epoch #192 train run on device rank=0 [2024-06-21 20:43:31,831] INFO: Initiating epoch #192 valid run on device rank=0 [2024-06-21 20:44:05,770] INFO: Rank 0: epoch=192 / 400 train_loss=10.1253 valid_loss=10.1610 stale=9 time=2.71m eta=565.4m [2024-06-21 20:44:05,846] INFO: Initiating epoch #193 train run on device rank=0 [2024-06-21 20:46:15,763] INFO: Initiating epoch #193 valid run on device rank=0 [2024-06-21 20:46:49,816] INFO: Rank 0: epoch=193 / 400 train_loss=10.1253 valid_loss=10.1609 stale=10 time=2.73m eta=562.7m [2024-06-21 20:46:49,845] INFO: Initiating epoch #194 train run on device rank=0 [2024-06-21 20:49:00,701] INFO: Initiating epoch #194 valid run on device rank=0 [2024-06-21 20:49:34,870] INFO: Rank 0: epoch=194 / 400 train_loss=10.1253 valid_loss=10.1610 stale=11 time=2.75m eta=560.0m [2024-06-21 20:49:34,922] INFO: Initiating epoch #195 train run on device rank=0 [2024-06-21 20:51:44,061] INFO: Initiating epoch #195 valid run on device rank=0 [2024-06-21 20:52:19,097] INFO: Rank 0: epoch=195 / 400 train_loss=10.1253 valid_loss=10.1609 stale=12 time=2.74m eta=557.3m [2024-06-21 20:52:19,141] INFO: Initiating epoch #196 train run on device rank=0 [2024-06-21 20:54:29,796] INFO: Initiating epoch #196 valid run on device rank=0 [2024-06-21 20:55:03,466] INFO: Rank 0: epoch=196 / 400 train_loss=10.1253 valid_loss=10.1609 stale=13 time=2.74m eta=554.6m [2024-06-21 20:55:03,528] INFO: Initiating epoch #197 train run on device rank=0 [2024-06-21 20:57:14,790] INFO: Initiating epoch #197 valid run on device rank=0 [2024-06-21 20:57:48,333] INFO: Rank 0: epoch=197 / 400 train_loss=10.1253 valid_loss=10.1609 stale=14 time=2.75m eta=551.9m [2024-06-21 20:57:48,350] INFO: Initiating epoch #198 train run on device rank=0 [2024-06-21 20:59:59,535] INFO: Initiating epoch #198 valid run on device rank=0 [2024-06-21 21:00:34,591] INFO: Rank 0: epoch=198 / 400 train_loss=10.1253 valid_loss=10.1609 stale=15 time=2.77m eta=549.2m [2024-06-21 21:00:34,609] INFO: Initiating epoch #199 train run on device rank=0 [2024-06-21 21:02:47,759] INFO: Initiating epoch #199 valid run on device rank=0 [2024-06-21 21:03:22,384] INFO: Rank 0: epoch=199 / 400 train_loss=10.1253 valid_loss=10.1609 stale=16 time=2.8m eta=546.6m [2024-06-21 21:03:22,448] INFO: Initiating epoch #200 train run on device rank=0 [2024-06-21 21:05:31,942] INFO: Initiating epoch #200 valid run on device rank=0 [2024-06-21 21:06:06,893] INFO: Rank 0: epoch=200 / 400 train_loss=10.1253 valid_loss=10.1609 stale=17 time=2.74m eta=543.9m [2024-06-21 21:06:06,918] INFO: Initiating epoch #201 train run on device rank=0 [2024-06-21 21:08:17,928] INFO: Initiating epoch #201 valid run on device rank=0 [2024-06-21 21:08:52,380] INFO: Rank 0: epoch=201 / 400 train_loss=10.1253 valid_loss=10.1609 stale=0 time=2.76m eta=541.2m [2024-06-21 21:08:52,498] INFO: Initiating epoch #202 train run on device rank=0 [2024-06-21 21:11:05,056] INFO: Initiating epoch #202 valid run on device rank=0 [2024-06-21 21:11:40,317] INFO: Rank 0: epoch=202 / 400 train_loss=10.1253 valid_loss=10.1609 stale=0 time=2.8m eta=538.6m [2024-06-21 21:11:40,359] INFO: Initiating epoch #203 train run on device rank=0 [2024-06-21 21:13:53,088] INFO: Initiating epoch #203 valid run on device rank=0 [2024-06-21 21:14:27,592] INFO: Rank 0: epoch=203 / 400 train_loss=10.1253 valid_loss=10.1609 stale=0 time=2.79m eta=535.9m [2024-06-21 21:14:27,796] INFO: Initiating epoch #204 train run on device rank=0 [2024-06-21 21:16:37,070] INFO: Initiating epoch #204 valid run on device rank=0 [2024-06-21 21:17:11,736] INFO: Rank 0: epoch=204 / 400 train_loss=10.1253 valid_loss=10.1609 stale=0 time=2.73m eta=533.2m [2024-06-21 21:17:12,007] INFO: Initiating epoch #205 train run on device rank=0 [2024-06-21 21:19:22,709] INFO: Initiating epoch #205 valid run on device rank=0 [2024-06-21 21:19:57,010] INFO: Rank 0: epoch=205 / 400 train_loss=10.1253 valid_loss=10.1609 stale=0 time=2.75m eta=530.5m [2024-06-21 21:19:57,041] INFO: Initiating epoch #206 train run on device rank=0 [2024-06-21 21:22:08,632] INFO: Initiating epoch #206 valid run on device rank=0 [2024-06-21 21:22:44,937] INFO: Rank 0: epoch=206 / 400 train_loss=10.1253 valid_loss=10.1609 stale=1 time=2.8m eta=527.9m [2024-06-21 21:22:45,013] INFO: Initiating epoch #207 train run on device rank=0 [2024-06-21 21:24:58,866] INFO: Initiating epoch #207 valid run on device rank=0 [2024-06-21 21:25:34,981] INFO: Rank 0: epoch=207 / 400 train_loss=10.1253 valid_loss=10.1609 stale=0 time=2.83m eta=525.3m [2024-06-21 21:25:35,012] INFO: Initiating epoch #208 train run on device rank=0 [2024-06-21 21:27:49,163] INFO: Initiating epoch #208 valid run on device rank=0 [2024-06-21 21:28:24,114] INFO: Rank 0: epoch=208 / 400 train_loss=10.1253 valid_loss=10.1609 stale=0 time=2.82m eta=522.6m [2024-06-21 21:28:24,145] INFO: Initiating epoch #209 train run on device rank=0 [2024-06-21 21:30:36,244] INFO: Initiating epoch #209 valid run on device rank=0 [2024-06-21 21:31:12,148] INFO: Rank 0: epoch=209 / 400 train_loss=10.1253 valid_loss=10.1609 stale=0 time=2.8m eta=520.0m [2024-06-21 21:31:12,179] INFO: Initiating epoch #210 train run on device rank=0 [2024-06-21 21:33:27,002] INFO: Initiating epoch #210 valid run on device rank=0 [2024-06-21 21:34:02,329] INFO: Rank 0: epoch=210 / 400 train_loss=10.1253 valid_loss=10.1609 stale=1 time=2.84m eta=517.4m [2024-06-21 21:34:02,346] INFO: Initiating epoch #211 train run on device rank=0 [2024-06-21 21:36:14,452] INFO: Initiating epoch #211 valid run on device rank=0 [2024-06-21 21:36:51,816] INFO: Rank 0: epoch=211 / 400 train_loss=10.1253 valid_loss=10.1608 stale=0 time=2.82m eta=514.7m [2024-06-21 21:36:52,007] INFO: Initiating epoch #212 train run on device rank=0 [2024-06-21 21:39:03,806] INFO: Initiating epoch #212 valid run on device rank=0 [2024-06-21 21:39:38,442] INFO: Rank 0: epoch=212 / 400 train_loss=10.1253 valid_loss=10.1608 stale=0 time=2.77m eta=512.1m [2024-06-21 21:39:38,465] INFO: Initiating epoch #213 train run on device rank=0 [2024-06-21 21:41:50,464] INFO: Initiating epoch #213 valid run on device rank=0 [2024-06-21 21:42:25,822] INFO: Rank 0: epoch=213 / 400 train_loss=10.1253 valid_loss=10.1608 stale=0 time=2.79m eta=509.4m [2024-06-21 21:42:25,825] INFO: Initiating epoch #214 train run on device rank=0 [2024-06-21 21:44:36,183] INFO: Initiating epoch #214 valid run on device rank=0 [2024-06-21 21:45:10,324] INFO: Rank 0: epoch=214 / 400 train_loss=10.1253 valid_loss=10.1609 stale=1 time=2.74m eta=506.7m [2024-06-21 21:45:10,390] INFO: Initiating epoch #215 train run on device rank=0 [2024-06-21 21:47:21,161] INFO: Initiating epoch #215 valid run on device rank=0 [2024-06-21 21:47:56,251] INFO: Rank 0: epoch=215 / 400 train_loss=10.1253 valid_loss=10.1609 stale=2 time=2.76m eta=504.0m [2024-06-21 21:47:56,285] INFO: Initiating epoch #216 train run on device rank=0 [2024-06-21 21:50:09,301] INFO: Initiating epoch #216 valid run on device rank=0 [2024-06-21 21:50:43,827] INFO: Rank 0: epoch=216 / 400 train_loss=10.1253 valid_loss=10.1609 stale=3 time=2.79m eta=501.3m [2024-06-21 21:50:43,873] INFO: Initiating epoch #217 train run on device rank=0 [2024-06-21 21:52:57,908] INFO: Initiating epoch #217 valid run on device rank=0 [2024-06-21 21:53:32,366] INFO: Rank 0: epoch=217 / 400 train_loss=10.1253 valid_loss=10.1609 stale=4 time=2.81m eta=498.7m [2024-06-21 21:53:32,624] INFO: Initiating epoch #218 train run on device rank=0 [2024-06-21 21:55:45,187] INFO: Initiating epoch #218 valid run on device rank=0 [2024-06-21 21:56:19,650] INFO: Rank 0: epoch=218 / 400 train_loss=10.1253 valid_loss=10.1609 stale=5 time=2.78m eta=496.0m [2024-06-21 21:56:19,732] INFO: Initiating epoch #219 train run on device rank=0 [2024-06-21 21:58:32,324] INFO: Initiating epoch #219 valid run on device rank=0 [2024-06-21 21:59:07,319] INFO: Rank 0: epoch=219 / 400 train_loss=10.1253 valid_loss=10.1609 stale=6 time=2.79m eta=493.3m [2024-06-21 21:59:07,361] INFO: Initiating epoch #220 train run on device rank=0 [2024-06-21 22:01:18,720] INFO: Initiating epoch #220 valid run on device rank=0 [2024-06-21 22:01:53,515] INFO: Rank 0: epoch=220 / 400 train_loss=10.1253 valid_loss=10.1609 stale=7 time=2.77m eta=490.6m [2024-06-21 22:01:53,538] INFO: Initiating epoch #221 train run on device rank=0 [2024-06-21 22:04:05,108] INFO: Initiating epoch #221 valid run on device rank=0 [2024-06-21 22:04:40,834] INFO: Rank 0: epoch=221 / 400 train_loss=10.1253 valid_loss=10.1609 stale=8 time=2.79m eta=488.0m [2024-06-21 22:04:41,374] INFO: Initiating epoch #222 train run on device rank=0 [2024-06-21 22:06:51,686] INFO: Initiating epoch #222 valid run on device rank=0 [2024-06-21 22:07:27,758] INFO: Rank 0: epoch=222 / 400 train_loss=10.1253 valid_loss=10.1609 stale=9 time=2.77m eta=485.3m [2024-06-21 22:07:28,643] INFO: Initiating epoch #223 train run on device rank=0 [2024-06-21 22:09:39,558] INFO: Initiating epoch #223 valid run on device rank=0 [2024-06-21 22:10:14,893] INFO: Rank 0: epoch=223 / 400 train_loss=10.1253 valid_loss=10.1609 stale=10 time=2.77m eta=482.6m [2024-06-21 22:10:15,586] INFO: Initiating epoch #224 train run on device rank=0 [2024-06-21 22:12:27,154] INFO: Initiating epoch #224 valid run on device rank=0 [2024-06-21 22:13:03,096] INFO: Rank 0: epoch=224 / 400 train_loss=10.1253 valid_loss=10.1609 stale=11 time=2.79m eta=479.9m [2024-06-21 22:13:03,706] INFO: Initiating epoch #225 train run on device rank=0 [2024-06-21 22:15:13,675] INFO: Initiating epoch #225 valid run on device rank=0 [2024-06-21 22:15:59,067] INFO: Rank 0: epoch=225 / 400 train_loss=10.1253 valid_loss=10.1609 stale=12 time=2.92m eta=477.4m [2024-06-21 22:15:59,712] INFO: Initiating epoch #226 train run on device rank=0 [2024-06-21 22:18:07,534] INFO: Initiating epoch #226 valid run on device rank=0 [2024-06-21 22:18:43,292] INFO: Rank 0: epoch=226 / 400 train_loss=10.1253 valid_loss=10.1609 stale=13 time=2.73m eta=474.7m [2024-06-21 22:18:43,730] INFO: Initiating epoch #227 train run on device rank=0 [2024-06-21 22:20:53,676] INFO: Initiating epoch #227 valid run on device rank=0 [2024-06-21 22:21:30,305] INFO: Rank 0: epoch=227 / 400 train_loss=10.1253 valid_loss=10.1609 stale=14 time=2.78m eta=472.0m [2024-06-21 22:21:31,124] INFO: Initiating epoch #228 train run on device rank=0 [2024-06-21 22:23:40,404] INFO: Initiating epoch #228 valid run on device rank=0 [2024-06-21 22:24:16,390] INFO: Rank 0: epoch=228 / 400 train_loss=10.1253 valid_loss=10.1609 stale=15 time=2.75m eta=469.3m [2024-06-21 22:24:17,490] INFO: Initiating epoch #229 train run on device rank=0 [2024-06-21 22:26:27,632] INFO: Initiating epoch #229 valid run on device rank=0 [2024-06-21 22:27:07,236] INFO: Rank 0: epoch=229 / 400 train_loss=10.1252 valid_loss=10.1609 stale=16 time=2.83m eta=466.6m [2024-06-21 22:27:07,898] INFO: Initiating epoch #230 train run on device rank=0 [2024-06-21 22:29:15,673] INFO: Initiating epoch #230 valid run on device rank=0 [2024-06-21 22:29:52,005] INFO: Rank 0: epoch=230 / 400 train_loss=10.1252 valid_loss=10.1609 stale=17 time=2.74m eta=463.9m [2024-06-21 22:29:52,841] INFO: Initiating epoch #231 train run on device rank=0 [2024-06-21 22:32:01,742] INFO: Initiating epoch #231 valid run on device rank=0 [2024-06-21 22:32:36,998] INFO: Rank 0: epoch=231 / 400 train_loss=10.1252 valid_loss=10.1609 stale=18 time=2.74m eta=461.2m [2024-06-21 22:32:37,524] INFO: Initiating epoch #232 train run on device rank=0 [2024-06-21 22:34:47,240] INFO: Initiating epoch #232 valid run on device rank=0 [2024-06-21 22:35:23,770] INFO: Rank 0: epoch=232 / 400 train_loss=10.1252 valid_loss=10.1609 stale=19 time=2.77m eta=458.5m [2024-06-21 22:35:24,576] INFO: Initiating epoch #233 train run on device rank=0 [2024-06-21 22:37:34,062] INFO: Initiating epoch #233 valid run on device rank=0 [2024-06-21 22:38:08,662] INFO: Rank 0: epoch=233 / 400 train_loss=10.1252 valid_loss=10.1608 stale=20 time=2.73m eta=455.8m [2024-06-21 22:38:08,704] INFO: Initiating epoch #234 train run on device rank=0 [2024-06-21 22:40:19,429] INFO: Initiating epoch #234 valid run on device rank=0 [2024-06-21 22:40:53,958] INFO: Rank 0: epoch=234 / 400 train_loss=10.1251 valid_loss=10.1608 stale=0 time=2.75m eta=453.1m [2024-06-21 22:40:54,052] INFO: Initiating epoch #235 train run on device rank=0 [2024-06-21 22:43:06,454] INFO: Initiating epoch #235 valid run on device rank=0 [2024-06-21 22:43:41,815] INFO: Rank 0: epoch=235 / 400 train_loss=10.1251 valid_loss=10.1608 stale=0 time=2.8m eta=450.4m [2024-06-21 22:43:42,162] INFO: Initiating epoch #236 train run on device rank=0 [2024-06-21 22:45:53,968] INFO: Initiating epoch #236 valid run on device rank=0 [2024-06-21 22:46:28,330] INFO: Rank 0: epoch=236 / 400 train_loss=10.1251 valid_loss=10.1608 stale=0 time=2.77m eta=447.7m [2024-06-21 22:46:28,378] INFO: Initiating epoch #237 train run on device rank=0 [2024-06-21 22:48:38,586] INFO: Initiating epoch #237 valid run on device rank=0 [2024-06-21 22:49:12,907] INFO: Rank 0: epoch=237 / 400 train_loss=10.1250 valid_loss=10.1608 stale=0 time=2.74m eta=445.0m [2024-06-21 22:49:12,916] INFO: Initiating epoch #238 train run on device rank=0 [2024-06-21 22:51:24,219] INFO: Initiating epoch #238 valid run on device rank=0 [2024-06-21 22:52:02,620] INFO: Rank 0: epoch=238 / 400 train_loss=10.1250 valid_loss=10.1608 stale=0 time=2.83m eta=442.3m [2024-06-21 22:52:03,474] INFO: Initiating epoch #239 train run on device rank=0 [2024-06-21 22:54:10,999] INFO: Initiating epoch #239 valid run on device rank=0 [2024-06-21 22:54:49,899] INFO: Rank 0: epoch=239 / 400 train_loss=10.1250 valid_loss=10.1608 stale=0 time=2.77m eta=439.6m [2024-06-21 22:54:50,909] INFO: Initiating epoch #240 train run on device rank=0 [2024-06-21 22:57:01,687] INFO: Initiating epoch #240 valid run on device rank=0 [2024-06-21 22:57:40,010] INFO: Rank 0: epoch=240 / 400 train_loss=10.1249 valid_loss=10.1607 stale=0 time=2.82m eta=437.0m [2024-06-21 22:57:40,759] INFO: Initiating epoch #241 train run on device rank=0 [2024-06-21 22:59:49,917] INFO: Initiating epoch #241 valid run on device rank=0 [2024-06-21 23:00:26,949] INFO: Rank 0: epoch=241 / 400 train_loss=10.1249 valid_loss=10.1607 stale=0 time=2.77m eta=434.3m [2024-06-21 23:00:27,667] INFO: Initiating epoch #242 train run on device rank=0 [2024-06-21 23:02:36,136] INFO: Initiating epoch #242 valid run on device rank=0 [2024-06-21 23:03:14,285] INFO: Rank 0: epoch=242 / 400 train_loss=10.1249 valid_loss=10.1607 stale=0 time=2.78m eta=431.6m [2024-06-21 23:03:14,906] INFO: Initiating epoch #243 train run on device rank=0 [2024-06-21 23:05:23,794] INFO: Initiating epoch #243 valid run on device rank=0 [2024-06-21 23:06:01,916] INFO: Rank 0: epoch=243 / 400 train_loss=10.1248 valid_loss=10.1607 stale=0 time=2.78m eta=428.9m [2024-06-21 23:06:02,601] INFO: Initiating epoch #244 train run on device rank=0 [2024-06-21 23:08:10,910] INFO: Initiating epoch #244 valid run on device rank=0 [2024-06-21 23:08:49,045] INFO: Rank 0: epoch=244 / 400 train_loss=10.1248 valid_loss=10.1606 stale=0 time=2.77m eta=426.2m [2024-06-21 23:08:51,937] INFO: Initiating epoch #245 train run on device rank=0 [2024-06-21 23:11:00,143] INFO: Initiating epoch #245 valid run on device rank=0 [2024-06-21 23:11:39,218] INFO: Rank 0: epoch=245 / 400 train_loss=10.1248 valid_loss=10.1606 stale=0 time=2.79m eta=423.5m [2024-06-21 23:11:40,084] INFO: Initiating epoch #246 train run on device rank=0 [2024-06-21 23:13:48,385] INFO: Initiating epoch #246 valid run on device rank=0 [2024-06-21 23:14:25,614] INFO: Rank 0: epoch=246 / 400 train_loss=10.1247 valid_loss=10.1606 stale=0 time=2.76m eta=420.8m [2024-06-21 23:14:25,831] INFO: Initiating epoch #247 train run on device rank=0 [2024-06-21 23:16:35,601] INFO: Initiating epoch #247 valid run on device rank=0 [2024-06-21 23:17:10,070] INFO: Rank 0: epoch=247 / 400 train_loss=10.1247 valid_loss=10.1606 stale=0 time=2.74m eta=418.1m [2024-06-21 23:17:10,416] INFO: Initiating epoch #248 train run on device rank=0 [2024-06-21 23:19:22,453] INFO: Initiating epoch #248 valid run on device rank=0 [2024-06-21 23:19:57,893] INFO: Rank 0: epoch=248 / 400 train_loss=10.1246 valid_loss=10.1606 stale=0 time=2.79m eta=415.4m [2024-06-21 23:19:57,923] INFO: Initiating epoch #249 train run on device rank=0 [2024-06-21 23:22:08,240] INFO: Initiating epoch #249 valid run on device rank=0 [2024-06-21 23:22:43,698] INFO: Rank 0: epoch=249 / 400 train_loss=10.1246 valid_loss=10.1605 stale=0 time=2.76m eta=412.7m [2024-06-21 23:22:43,737] INFO: Initiating epoch #250 train run on device rank=0 [2024-06-21 23:24:53,799] INFO: Initiating epoch #250 valid run on device rank=0 [2024-06-21 23:25:28,980] INFO: Rank 0: epoch=250 / 400 train_loss=10.1246 valid_loss=10.1605 stale=0 time=2.75m eta=410.0m [2024-06-21 23:25:29,107] INFO: Initiating epoch #251 train run on device rank=0 [2024-06-21 23:27:39,976] INFO: Initiating epoch #251 valid run on device rank=0 [2024-06-21 23:28:15,230] INFO: Rank 0: epoch=251 / 400 train_loss=10.1245 valid_loss=10.1605 stale=0 time=2.77m eta=407.3m [2024-06-21 23:28:15,299] INFO: Initiating epoch #252 train run on device rank=0 [2024-06-21 23:30:26,578] INFO: Initiating epoch #252 valid run on device rank=0 [2024-06-21 23:31:01,798] INFO: Rank 0: epoch=252 / 400 train_loss=10.1245 valid_loss=10.1604 stale=0 time=2.77m eta=404.5m [2024-06-21 23:31:01,848] INFO: Initiating epoch #253 train run on device rank=0 [2024-06-21 23:33:13,410] INFO: Initiating epoch #253 valid run on device rank=0 [2024-06-21 23:33:49,106] INFO: Rank 0: epoch=253 / 400 train_loss=10.1244 valid_loss=10.1604 stale=0 time=2.79m eta=401.8m [2024-06-21 23:33:49,188] INFO: Initiating epoch #254 train run on device rank=0 [2024-06-21 23:35:59,710] INFO: Initiating epoch #254 valid run on device rank=0 [2024-06-21 23:36:35,638] INFO: Rank 0: epoch=254 / 400 train_loss=10.1244 valid_loss=10.1604 stale=0 time=2.77m eta=399.1m [2024-06-21 23:36:35,860] INFO: Initiating epoch #255 train run on device rank=0 [2024-06-21 23:38:46,445] INFO: Initiating epoch #255 valid run on device rank=0 [2024-06-21 23:39:21,178] INFO: Rank 0: epoch=255 / 400 train_loss=10.1244 valid_loss=10.1604 stale=0 time=2.76m eta=396.4m [2024-06-21 23:39:21,266] INFO: Initiating epoch #256 train run on device rank=0 [2024-06-21 23:41:31,698] INFO: Initiating epoch #256 valid run on device rank=0 [2024-06-21 23:42:09,598] INFO: Rank 0: epoch=256 / 400 train_loss=10.1243 valid_loss=10.1603 stale=0 time=2.81m eta=393.7m [2024-06-21 23:42:10,044] INFO: Initiating epoch #257 train run on device rank=0 [2024-06-21 23:44:19,029] INFO: Initiating epoch #257 valid run on device rank=0 [2024-06-21 23:44:53,672] INFO: Rank 0: epoch=257 / 400 train_loss=10.1243 valid_loss=10.1603 stale=0 time=2.73m eta=391.0m [2024-06-21 23:44:53,802] INFO: Initiating epoch #258 train run on device rank=0 [2024-06-21 23:47:04,299] INFO: Initiating epoch #258 valid run on device rank=0 [2024-06-21 23:47:38,714] INFO: Rank 0: epoch=258 / 400 train_loss=10.1243 valid_loss=10.1603 stale=0 time=2.75m eta=388.3m [2024-06-21 23:47:38,740] INFO: Initiating epoch #259 train run on device rank=0 [2024-06-21 23:49:49,857] INFO: Initiating epoch #259 valid run on device rank=0 [2024-06-21 23:50:24,573] INFO: Rank 0: epoch=259 / 400 train_loss=10.1242 valid_loss=10.1603 stale=0 time=2.76m eta=385.5m [2024-06-21 23:50:26,684] INFO: Initiating epoch #260 train run on device rank=0 [2024-06-21 23:52:36,754] INFO: Initiating epoch #260 valid run on device rank=0 [2024-06-21 23:53:11,670] INFO: Rank 0: epoch=260 / 400 train_loss=10.1242 valid_loss=10.1603 stale=0 time=2.75m eta=382.8m [2024-06-21 23:53:11,744] INFO: Initiating epoch #261 train run on device rank=0 [2024-06-21 23:55:22,716] INFO: Initiating epoch #261 valid run on device rank=0 [2024-06-21 23:55:57,661] INFO: Rank 0: epoch=261 / 400 train_loss=10.1242 valid_loss=10.1603 stale=0 time=2.77m eta=380.1m [2024-06-21 23:55:57,677] INFO: Initiating epoch #262 train run on device rank=0 [2024-06-21 23:58:12,321] INFO: Initiating epoch #262 valid run on device rank=0 [2024-06-21 23:58:47,544] INFO: Rank 0: epoch=262 / 400 train_loss=10.1242 valid_loss=10.1602 stale=0 time=2.83m eta=377.4m [2024-06-21 23:58:47,597] INFO: Initiating epoch #263 train run on device rank=0 [2024-06-22 00:00:58,954] INFO: Initiating epoch #263 valid run on device rank=0 [2024-06-22 00:01:32,879] INFO: Rank 0: epoch=263 / 400 train_loss=10.1241 valid_loss=10.1602 stale=1 time=2.75m eta=374.7m [2024-06-22 00:01:32,909] INFO: Initiating epoch #264 train run on device rank=0 [2024-06-22 00:03:43,229] INFO: Initiating epoch #264 valid run on device rank=0 [2024-06-22 00:04:18,828] INFO: Rank 0: epoch=264 / 400 train_loss=10.1241 valid_loss=10.1602 stale=0 time=2.77m eta=372.0m [2024-06-22 00:04:18,853] INFO: Initiating epoch #265 train run on device rank=0 [2024-06-22 00:06:31,522] INFO: Initiating epoch #265 valid run on device rank=0 [2024-06-22 00:07:07,477] INFO: Rank 0: epoch=265 / 400 train_loss=10.1241 valid_loss=10.1602 stale=0 time=2.81m eta=369.3m [2024-06-22 00:07:07,519] INFO: Initiating epoch #266 train run on device rank=0 [2024-06-22 00:09:18,754] INFO: Initiating epoch #266 valid run on device rank=0 [2024-06-22 00:09:52,594] INFO: Rank 0: epoch=266 / 400 train_loss=10.1241 valid_loss=10.1602 stale=0 time=2.75m eta=366.6m [2024-06-22 00:09:52,636] INFO: Initiating epoch #267 train run on device rank=0 [2024-06-22 00:12:02,204] INFO: Initiating epoch #267 valid run on device rank=0 [2024-06-22 00:12:37,166] INFO: Rank 0: epoch=267 / 400 train_loss=10.1241 valid_loss=10.1602 stale=0 time=2.74m eta=363.8m [2024-06-22 00:12:37,174] INFO: Initiating epoch #268 train run on device rank=0 [2024-06-22 00:14:45,634] INFO: Initiating epoch #268 valid run on device rank=0 [2024-06-22 00:15:19,844] INFO: Rank 0: epoch=268 / 400 train_loss=10.1240 valid_loss=10.1602 stale=1 time=2.71m eta=361.1m [2024-06-22 00:15:19,898] INFO: Initiating epoch #269 train run on device rank=0 [2024-06-22 00:17:30,814] INFO: Initiating epoch #269 valid run on device rank=0 [2024-06-22 00:18:04,823] INFO: Rank 0: epoch=269 / 400 train_loss=10.1240 valid_loss=10.1602 stale=0 time=2.75m eta=358.4m [2024-06-22 00:18:04,847] INFO: Initiating epoch #270 train run on device rank=0 [2024-06-22 00:20:12,964] INFO: Initiating epoch #270 valid run on device rank=0 [2024-06-22 00:20:47,486] INFO: Rank 0: epoch=270 / 400 train_loss=10.1240 valid_loss=10.1602 stale=0 time=2.71m eta=355.6m [2024-06-22 00:20:47,510] INFO: Initiating epoch #271 train run on device rank=0 [2024-06-22 00:22:55,995] INFO: Initiating epoch #271 valid run on device rank=0 [2024-06-22 00:23:30,362] INFO: Rank 0: epoch=271 / 400 train_loss=10.1240 valid_loss=10.1602 stale=1 time=2.71m eta=352.9m [2024-06-22 00:23:30,512] INFO: Initiating epoch #272 train run on device rank=0 [2024-06-22 00:25:39,569] INFO: Initiating epoch #272 valid run on device rank=0 [2024-06-22 00:26:13,722] INFO: Rank 0: epoch=272 / 400 train_loss=10.1240 valid_loss=10.1602 stale=2 time=2.72m eta=350.1m [2024-06-22 00:26:13,752] INFO: Initiating epoch #273 train run on device rank=0 [2024-06-22 00:28:21,500] INFO: Initiating epoch #273 valid run on device rank=0 [2024-06-22 00:28:56,266] INFO: Rank 0: epoch=273 / 400 train_loss=10.1240 valid_loss=10.1602 stale=3 time=2.71m eta=347.4m [2024-06-22 00:28:56,317] INFO: Initiating epoch #274 train run on device rank=0 [2024-06-22 00:31:07,601] INFO: Initiating epoch #274 valid run on device rank=0 [2024-06-22 00:31:43,312] INFO: Rank 0: epoch=274 / 400 train_loss=10.1240 valid_loss=10.1602 stale=4 time=2.78m eta=344.7m [2024-06-22 00:31:43,367] INFO: Initiating epoch #275 train run on device rank=0 [2024-06-22 00:33:57,750] INFO: Initiating epoch #275 valid run on device rank=0 [2024-06-22 00:34:32,102] INFO: Rank 0: epoch=275 / 400 train_loss=10.1240 valid_loss=10.1602 stale=5 time=2.81m eta=342.0m [2024-06-22 00:34:32,114] INFO: Initiating epoch #276 train run on device rank=0 [2024-06-22 00:36:42,323] INFO: Initiating epoch #276 valid run on device rank=0 [2024-06-22 00:37:19,110] INFO: Rank 0: epoch=276 / 400 train_loss=10.1240 valid_loss=10.1602 stale=6 time=2.78m eta=339.2m [2024-06-22 00:37:19,143] INFO: Initiating epoch #277 train run on device rank=0 [2024-06-22 00:39:30,467] INFO: Initiating epoch #277 valid run on device rank=0 [2024-06-22 00:40:05,622] INFO: Rank 0: epoch=277 / 400 train_loss=10.1241 valid_loss=10.1602 stale=7 time=2.77m eta=336.5m [2024-06-22 00:40:05,652] INFO: Initiating epoch #278 train run on device rank=0 [2024-06-22 00:42:16,716] INFO: Initiating epoch #278 valid run on device rank=0 [2024-06-22 00:42:53,152] INFO: Rank 0: epoch=278 / 400 train_loss=10.1241 valid_loss=10.1602 stale=8 time=2.79m eta=333.8m [2024-06-22 00:42:53,162] INFO: Initiating epoch #279 train run on device rank=0 [2024-06-22 00:45:02,206] INFO: Initiating epoch #279 valid run on device rank=0 [2024-06-22 00:45:36,698] INFO: Rank 0: epoch=279 / 400 train_loss=10.1241 valid_loss=10.1602 stale=9 time=2.73m eta=331.1m [2024-06-22 00:45:36,720] INFO: Initiating epoch #280 train run on device rank=0 [2024-06-22 00:47:46,615] INFO: Initiating epoch #280 valid run on device rank=0 [2024-06-22 00:48:20,853] INFO: Rank 0: epoch=280 / 400 train_loss=10.1241 valid_loss=10.1602 stale=10 time=2.74m eta=328.3m [2024-06-22 00:48:21,037] INFO: Initiating epoch #281 train run on device rank=0 [2024-06-22 00:50:28,794] INFO: Initiating epoch #281 valid run on device rank=0 [2024-06-22 00:51:03,579] INFO: Rank 0: epoch=281 / 400 train_loss=10.1241 valid_loss=10.1602 stale=11 time=2.71m eta=325.6m [2024-06-22 00:51:03,598] INFO: Initiating epoch #282 train run on device rank=0 [2024-06-22 00:53:11,634] INFO: Initiating epoch #282 valid run on device rank=0 [2024-06-22 00:53:45,338] INFO: Rank 0: epoch=282 / 400 train_loss=10.1241 valid_loss=10.1602 stale=12 time=2.7m eta=322.8m [2024-06-22 00:53:45,342] INFO: Initiating epoch #283 train run on device rank=0 [2024-06-22 00:55:53,989] INFO: Initiating epoch #283 valid run on device rank=0 [2024-06-22 00:56:27,592] INFO: Rank 0: epoch=283 / 400 train_loss=10.1241 valid_loss=10.1602 stale=13 time=2.7m eta=320.1m [2024-06-22 00:56:27,624] INFO: Initiating epoch #284 train run on device rank=0 [2024-06-22 00:58:35,947] INFO: Initiating epoch #284 valid run on device rank=0 [2024-06-22 00:59:10,159] INFO: Rank 0: epoch=284 / 400 train_loss=10.1241 valid_loss=10.1602 stale=14 time=2.71m eta=317.3m [2024-06-22 00:59:10,184] INFO: Initiating epoch #285 train run on device rank=0 [2024-06-22 01:01:18,397] INFO: Initiating epoch #285 valid run on device rank=0 [2024-06-22 01:01:52,500] INFO: Rank 0: epoch=285 / 400 train_loss=10.1241 valid_loss=10.1602 stale=15 time=2.71m eta=314.6m [2024-06-22 01:01:52,556] INFO: Initiating epoch #286 train run on device rank=0 [2024-06-22 01:04:00,868] INFO: Initiating epoch #286 valid run on device rank=0 [2024-06-22 01:04:35,353] INFO: Rank 0: epoch=286 / 400 train_loss=10.1242 valid_loss=10.1603 stale=16 time=2.71m eta=311.9m [2024-06-22 01:04:35,422] INFO: Initiating epoch #287 train run on device rank=0 [2024-06-22 01:06:44,544] INFO: Initiating epoch #287 valid run on device rank=0 [2024-06-22 01:07:18,175] INFO: Rank 0: epoch=287 / 400 train_loss=10.1242 valid_loss=10.1603 stale=17 time=2.71m eta=309.1m [2024-06-22 01:07:18,241] INFO: Initiating epoch #288 train run on device rank=0 [2024-06-22 01:09:26,797] INFO: Initiating epoch #288 valid run on device rank=0 [2024-06-22 01:10:02,810] INFO: Rank 0: epoch=288 / 400 train_loss=10.1242 valid_loss=10.1603 stale=18 time=2.74m eta=306.4m [2024-06-22 01:10:02,864] INFO: Initiating epoch #289 train run on device rank=0 [2024-06-22 01:12:12,805] INFO: Initiating epoch #289 valid run on device rank=0 [2024-06-22 01:12:46,424] INFO: Rank 0: epoch=289 / 400 train_loss=10.1242 valid_loss=10.1603 stale=19 time=2.73m eta=303.6m [2024-06-22 01:12:46,640] INFO: Initiating epoch #290 train run on device rank=0 [2024-06-22 01:14:57,245] INFO: Initiating epoch #290 valid run on device rank=0 [2024-06-22 01:15:30,621] INFO: Rank 0: epoch=290 / 400 train_loss=10.1242 valid_loss=10.1603 stale=20 time=2.73m eta=300.9m [2024-06-22 01:15:30,660] INFO: Initiating epoch #291 train run on device rank=0 [2024-06-22 01:17:38,756] INFO: Initiating epoch #291 valid run on device rank=0 [2024-06-22 01:18:12,953] INFO: Done with training. Total training time on device 0 is 796.001min