Skip to content
Snippets Groups Projects
Select Git revision
  • e800454d2640e5c327fadc80751873cb908abd98
  • main default protected
  • change_data_model
  • feature/add_auth_asr_service
  • fix/incorrect_import
  • feature/change_registry_clarin
  • feature/add_base_asr_service
  • feature/add_poetry
  • feature/add_word_ids
  • feature/add_sziszapangma
10 results

Dockerfile

Blame
  • new_learning_log3.txt 527.82 KiB
    using world size: 1 and model-parallel size: 1 
     > using dynamic loss scaling
    > initializing model parallel with size 1
    Pretrain GPT2 model
    arguments:
      pretrained_bert .............. False
      attention_dropout ............ 0.1
      num_attention_heads .......... 20
      hidden_size .................. 1280
      intermediate_size ............ None
      num_layers ................... 36
      layernorm_epsilon ............ 1e-05
      hidden_dropout ............... 0.1
      max_position_embeddings ...... 1024
      vocab_size ................... 32296
      deep_init .................... False
      make_vocab_size_divisible_by . 128
      fp16 ......................... True
      fp32_embedding ............... False
      fp32_layernorm ............... False
      fp32_tokentypes .............. False
      fp32_allreduce ............... False
      hysteresis ................... 2
      loss_scale ................... None
      loss_scale_window ............ 1000
      min_scale .................... 1
      batch_size ................... 8
      weight_decay ................. 0.01
      checkpoint_activations ....... True
      checkpoint_num_layers ........ 1
      clip_grad .................... 1.0
      train_iters .................. 300000
      log_interval ................. 100
      exit_interval ................ None
      tensorboard_dir .............. None
      seed ......................... 1234
      reset_position_ids ........... False
      reset_attention_mask ......... False
      eod_mask_loss ................ False
      lr_decay_iters ............... None
      lr_decay_style ............... cosine
      lr ........................... 0.00015
      min_lr ....................... 0.0
      warmup ....................... 0.01
      override_lr_scheduler ........ False
      use_checkpoint_lr_scheduler .. False
      save ......................... checkpoints/gpt2_750m_2
      save_interval ................ 5000
      no_save_optim ................ False
      no_save_rng .................. False
      load ......................... checkpoints/gpt2_750m_2
      no_load_optim ................ False
      no_load_rng .................. False
      finetune ..................... False
      resume_dataloader ............ True
      distributed_backend .......... nccl
      DDP_impl ..................... local
      local_rank ................... None
      adlr_autoresume .............. False
      adlr_autoresume_interval ..... 1000
      eval_batch_size .............. None
      eval_iters ................... 100
      eval_interval ................ 1000
      eval_seq_length .............. None
      eval_max_preds_per_seq ....... None
      overlapping_eval ............. 32
      cloze_eval ................... False
      strict_lambada ............... False
      eval_hf ...................... False
      load_openai .................. False
      temperature .................. 1.0
      greedy ....................... False
      top_p ........................ 0.0
      top_k ........................ 0
      out_seq_length ............... 1024
      sample_input_file ............ 
      sample_output_file ........... 
      num_samples .................. 0
      genfile ...................... None
      recompute .................... False
      model_parallel_size .......... 1
      shuffle ...................... False
      train_data ................... ['wikipedia']
      use_npy_data_loader .......... False
      train_data_path .............. 
      val_data_path ................ 
      test_data_path ............... 
      input_data_sizes_file ........ sizes.txt
      delim ........................ ,
      text_key ..................... sentence
      eval_text_key ................ None
      valid_data ................... None
      split ........................ 949,50,1
      test_data .................... None
      lazy_loader .................. True
      loose_json ................... False
      presplit_sentences ........... False
      num_workers .................. 2
      tokenizer_model_type ......... bert-large-uncased
      tokenizer_path ............... tokenizer.model
      tokenizer_type ............... GPT2BPETokenizer
      cache_dir .................... cache
      use_tfrecords ................ False
      seq_length ................... 1024
      max_preds_per_seq ............ None
      cuda ......................... True
      rank ......................... 0
      world_size ................... 1
      dynamic_loss_scale ........... True
    > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
    configuring data
    > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
    > found end-of-document token: 50256
    building GPT2 model ...
     > number of parameters on model parallel rank 0: 774090240
    learning rate decaying cosine
    global rank 0 is loading checkpoint checkpoints/gpt2_750m_2/iter_0130000/mp_rank_00/model_optim_rng.pt
     > using checkpoint value 0.00015 for learning rate
     > using checkpoint value 0.0 for minimum learning rate
     > using checkpoint value 3000.0 for warmup iterations
     > using checkpoint value 300000 for total number of iterations
     > using checkpoint value cosine for decay style
      successfully loaded checkpoints/gpt2_750m_2/iter_0130000/mp_rank_00/model_optim_rng.pt
    setting training data start iteration to 130000
    setting validation data start iteration to 13000
     iteration   130100/  300000 | elapsed time per iteration (ms): 2458.5 | learning rate 9.293E-05 | lm loss 1.639116E+00 | loss scale 524288.0 |
    after 130100 iterations memory (MB) | allocated: 15024.22607421875 | max allocated: 20513.126953125 | cached: 21598.0 | max cached: 21598.0
    time (ms) | forward: 690.11 | backward: 1712.41 | allreduce: 25.10 | optimizer: 55.81 | batch generator: 25.93 | data loader: 25.45
     iteration   130200/  300000 | elapsed time per iteration (ms): 2451.6 | learning rate 9.285E-05 | lm loss 1.644546E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.49 | backward: 1727.71 | allreduce: 27.92 | optimizer: 55.24 | batch generator: 0.50 | data loader: 0.04
     iteration   130300/  300000 | elapsed time per iteration (ms): 2449.1 | learning rate 9.278E-05 | lm loss 1.660332E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.51 | backward: 1724.65 | allreduce: 24.32 | optimizer: 55.80 | batch generator: 0.50 | data loader: 0.04
     iteration   130400/  300000 | elapsed time per iteration (ms): 2448.6 | learning rate 9.270E-05 | lm loss 1.667089E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.29 | backward: 1724.37 | allreduce: 24.79 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
     iteration   130500/  300000 | elapsed time per iteration (ms): 2449.6 | learning rate 9.262E-05 | lm loss 1.642417E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.04 | backward: 1725.60 | allreduce: 26.57 | optimizer: 55.80 | batch generator: 0.50 | data loader: 0.04
     iteration   130600/  300000 | elapsed time per iteration (ms): 2448.6 | learning rate 9.255E-05 | lm loss 1.648132E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.25 | backward: 1724.35 | allreduce: 24.59 | optimizer: 55.80 | batch generator: 0.50 | data loader: 0.04
     iteration   130700/  300000 | elapsed time per iteration (ms): 2448.0 | learning rate 9.247E-05 | lm loss 1.635626E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.20 | backward: 1723.82 | allreduce: 24.59 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
     iteration   130800/  300000 | elapsed time per iteration (ms): 2450.1 | learning rate 9.239E-05 | lm loss 1.634159E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.14 | backward: 1725.96 | allreduce: 26.92 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   130900/  300000 | elapsed time per iteration (ms): 2447.4 | learning rate 9.232E-05 | lm loss 1.646588E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.17 | backward: 1723.27 | allreduce: 24.11 | optimizer: 55.80 | batch generator: 0.50 | data loader: 0.04
     iteration   131000/  300000 | elapsed time per iteration (ms): 2448.0 | learning rate 9.224E-05 | lm loss 1.632920E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.29 | backward: 1723.75 | allreduce: 24.19 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 131000 | LM loss: 1.693012E+00 | LM PPL: 5.435829E+00
    ------------------------------------------------------------------------------------
     iteration   131100/  300000 | elapsed time per iteration (ms): 3128.1 | learning rate 9.217E-05 | lm loss 1.623064E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.04 | backward: 1726.64 | allreduce: 27.93 | optimizer: 55.80 | batch generator: 79.57 | data loader: 78.68
     iteration   131200/  300000 | elapsed time per iteration (ms): 2447.8 | learning rate 9.209E-05 | lm loss 1.644969E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.09 | backward: 1723.74 | allreduce: 24.67 | optimizer: 55.80 | batch generator: 0.51 | data loader: 0.04
     iteration   131300/  300000 | elapsed time per iteration (ms): 2446.1 | learning rate 9.201E-05 | lm loss 1.637834E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.01 | backward: 1722.63 | allreduce: 24.40 | optimizer: 55.24 | batch generator: 0.50 | data loader: 0.04
     iteration   131400/  300000 | elapsed time per iteration (ms): 2446.9 | learning rate 9.194E-05 | lm loss 1.643891E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.07 | backward: 1723.95 | allreduce: 26.70 | optimizer: 54.68 | batch generator: 0.50 | data loader: 0.04
     iteration   131500/  300000 | elapsed time per iteration (ms): 2446.0 | learning rate 9.186E-05 | lm loss 1.648590E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.02 | backward: 1721.96 | allreduce: 23.65 | optimizer: 55.80 | batch generator: 0.50 | data loader: 0.04
     iteration   131600/  300000 | elapsed time per iteration (ms): 2445.8 | learning rate 9.179E-05 | lm loss 1.649733E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.01 | backward: 1721.78 | allreduce: 23.46 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   131700/  300000 | elapsed time per iteration (ms): 2449.6 | learning rate 9.171E-05 | lm loss 1.634006E+00 | loss scale 262144.0 |
    time (ms) | forward: 667.91 | backward: 1725.73 | allreduce: 27.67 | optimizer: 55.80 | batch generator: 0.50 | data loader: 0.04
     iteration   131800/  300000 | elapsed time per iteration (ms): 2446.7 | learning rate 9.163E-05 | lm loss 1.651026E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.06 | backward: 1722.66 | allreduce: 24.24 | optimizer: 55.80 | batch generator: 0.50 | data loader: 0.04
     iteration   131900/  300000 | elapsed time per iteration (ms): 2446.2 | learning rate 9.156E-05 | lm loss 1.645223E+00 | loss scale 262144.0 |
    time (ms) | forward: 667.99 | backward: 1722.24 | allreduce: 23.74 | optimizer: 55.80 | batch generator: 0.50 | data loader: 0.04
     iteration   132000/  300000 | elapsed time per iteration (ms): 2447.9 | learning rate 9.148E-05 | lm loss 1.640957E+00 | loss scale 131072.0 |
    time (ms) | forward: 668.00 | backward: 1724.50 | allreduce: 27.00 | optimizer: 55.24 | batch generator: 0.50 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 132000 | LM loss: 1.697133E+00 | LM PPL: 5.458274E+00
    ------------------------------------------------------------------------------------
     iteration   132100/  300000 | elapsed time per iteration (ms): 3051.4 | learning rate 9.140E-05 | lm loss 1.634691E+00 | loss scale 131072.0 |
    time (ms) | forward: 668.23 | backward: 1723.84 | allreduce: 25.34 | optimizer: 55.79 | batch generator: 3.90 | data loader: 3.04
     iteration   132200/  300000 | elapsed time per iteration (ms): 2448.6 | learning rate 9.133E-05 | lm loss 1.633199E+00 | loss scale 131072.0 |
    time (ms) | forward: 668.61 | backward: 1724.01 | allreduce: 24.19 | optimizer: 55.80 | batch generator: 0.50 | data loader: 0.04
     iteration   132300/  300000 | elapsed time per iteration (ms): 2450.9 | learning rate 9.125E-05 | lm loss 1.626285E+00 | loss scale 131072.0 |
    time (ms) | forward: 668.59 | backward: 1726.28 | allreduce: 26.66 | optimizer: 55.80 | batch generator: 0.49 | data loader: 0.04
     iteration   132400/  300000 | elapsed time per iteration (ms): 2448.0 | learning rate 9.117E-05 | lm loss 1.628713E+00 | loss scale 131072.0 |
    time (ms) | forward: 668.44 | backward: 1723.55 | allreduce: 24.32 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   132500/  300000 | elapsed time per iteration (ms): 2447.1 | learning rate 9.110E-05 | lm loss 1.646128E+00 | loss scale 131072.0 |
    time (ms) | forward: 668.31 | backward: 1722.80 | allreduce: 24.14 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
     iteration   132600/  300000 | elapsed time per iteration (ms): 2450.0 | learning rate 9.102E-05 | lm loss 1.626013E+00 | loss scale 131072.0 |
    time (ms) | forward: 668.41 | backward: 1725.62 | allreduce: 26.59 | optimizer: 55.80 | batch generator: 0.50 | data loader: 0.04
     iteration   132700/  300000 | elapsed time per iteration (ms): 2448.7 | learning rate 9.094E-05 | lm loss 1.637809E+00 | loss scale 131072.0 |
    time (ms) | forward: 668.52 | backward: 1724.20 | allreduce: 24.72 | optimizer: 55.80 | batch generator: 0.50 | data loader: 0.04
     iteration   132800/  300000 | elapsed time per iteration (ms): 2447.0 | learning rate 9.087E-05 | lm loss 1.647463E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.41 | backward: 1723.22 | allreduce: 24.97 | optimizer: 55.24 | batch generator: 0.49 | data loader: 0.04
     iteration   132900/  300000 | elapsed time per iteration (ms): 2449.6 | learning rate 9.079E-05 | lm loss 1.632657E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.42 | backward: 1725.19 | allreduce: 26.66 | optimizer: 55.80 | batch generator: 0.51 | data loader: 0.04
     iteration   133000/  300000 | elapsed time per iteration (ms): 2448.0 | learning rate 9.071E-05 | lm loss 1.631149E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.65 | backward: 1723.38 | allreduce: 23.81 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 133000 | LM loss: 1.667062E+00 | LM PPL: 5.296584E+00
    ------------------------------------------------------------------------------------
     iteration   133100/  300000 | elapsed time per iteration (ms): 3062.2 | learning rate 9.064E-05 | lm loss 1.650172E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.47 | backward: 1722.86 | allreduce: 24.14 | optimizer: 55.79 | batch generator: 16.14 | data loader: 15.27
     iteration   133200/  300000 | elapsed time per iteration (ms): 2449.1 | learning rate 9.056E-05 | lm loss 1.629386E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.26 | backward: 1724.84 | allreduce: 26.57 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
     iteration   133300/  300000 | elapsed time per iteration (ms): 2446.4 | learning rate 9.048E-05 | lm loss 1.626385E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.18 | backward: 1722.20 | allreduce: 24.29 | optimizer: 55.80 | batch generator: 0.50 | data loader: 0.04
     iteration   133400/  300000 | elapsed time per iteration (ms): 2446.5 | learning rate 9.041E-05 | lm loss 1.612287E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.20 | backward: 1722.33 | allreduce: 24.16 | optimizer: 55.80 | batch generator: 0.50 | data loader: 0.04
     iteration   133500/  300000 | elapsed time per iteration (ms): 2450.2 | learning rate 9.033E-05 | lm loss 1.659256E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.30 | backward: 1725.97 | allreduce: 27.47 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   133600/  300000 | elapsed time per iteration (ms): 2446.5 | learning rate 9.025E-05 | lm loss 1.643865E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.27 | backward: 1722.22 | allreduce: 23.81 | optimizer: 55.80 | batch generator: 0.50 | data loader: 0.04
     iteration   133700/  300000 | elapsed time per iteration (ms): 2446.7 | learning rate 9.018E-05 | lm loss 1.656689E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.23 | backward: 1722.47 | allreduce: 24.21 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   133800/  300000 | elapsed time per iteration (ms): 2447.0 | learning rate 9.010E-05 | lm loss 1.667565E+00 | loss scale 32768.0 |
    time (ms) | forward: 668.17 | backward: 1723.44 | allreduce: 25.98 | optimizer: 55.24 | batch generator: 0.50 | data loader: 0.04
     iteration   133900/  300000 | elapsed time per iteration (ms): 2447.1 | learning rate 9.002E-05 | lm loss 1.624317E+00 | loss scale 32768.0 |
    time (ms) | forward: 668.48 | backward: 1722.68 | allreduce: 24.23 | optimizer: 55.80 | batch generator: 0.51 | data loader: 0.04
     iteration   134000/  300000 | elapsed time per iteration (ms): 2447.1 | learning rate 8.995E-05 | lm loss 1.639552E+00 | loss scale 32768.0 |
    time (ms) | forward: 668.54 | backward: 1722.59 | allreduce: 23.95 | optimizer: 55.81 | batch generator: 0.51 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 134000 | LM loss: 1.701056E+00 | LM PPL: 5.479732E+00
    ------------------------------------------------------------------------------------
     iteration   134100/  300000 | elapsed time per iteration (ms): 3058.1 | learning rate 8.987E-05 | lm loss 1.638367E+00 | loss scale 32768.0 |
    time (ms) | forward: 668.44 | backward: 1725.24 | allreduce: 26.93 | optimizer: 55.80 | batch generator: 9.59 | data loader: 8.72
     iteration   134200/  300000 | elapsed time per iteration (ms): 2447.7 | learning rate 8.979E-05 | lm loss 1.638537E+00 | loss scale 32768.0 |
    time (ms) | forward: 668.41 | backward: 1723.29 | allreduce: 24.93 | optimizer: 55.80 | batch generator: 0.51 | data loader: 0.04
     iteration   134300/  300000 | elapsed time per iteration (ms): 2446.7 | learning rate 8.971E-05 | lm loss 1.653795E+00 | loss scale 32768.0 |
    time (ms) | forward: 668.39 | backward: 1722.34 | allreduce: 24.17 | optimizer: 55.80 | batch generator: 0.50 | data loader: 0.04
     iteration   134400/  300000 | elapsed time per iteration (ms): 2450.0 | learning rate 8.964E-05 | lm loss 1.630844E+00 | loss scale 32768.0 |
    time (ms) | forward: 668.50 | backward: 1725.47 | allreduce: 26.96 | optimizer: 55.80 | batch generator: 0.51 | data loader: 0.04
     iteration   134500/  300000 | elapsed time per iteration (ms): 2448.5 | learning rate 8.956E-05 | lm loss 1.662026E+00 | loss scale 32768.0 |
    time (ms) | forward: 668.55 | backward: 1723.94 | allreduce: 25.09 | optimizer: 55.80 | batch generator: 0.51 | data loader: 0.04
     iteration   134600/  300000 | elapsed time per iteration (ms): 2447.4 | learning rate 8.948E-05 | lm loss 1.629301E+00 | loss scale 32768.0 |
    time (ms) | forward: 668.43 | backward: 1722.97 | allreduce: 24.35 | optimizer: 55.80 | batch generator: 0.51 | data loader: 0.04
     iteration   134700/  300000 | elapsed time per iteration (ms): 2450.5 | learning rate 8.941E-05 | lm loss 1.638879E+00 | loss scale 32768.0 |
    time (ms) | forward: 668.34 | backward: 1726.20 | allreduce: 27.67 | optimizer: 55.80 | batch generator: 0.50 | data loader: 0.04
     iteration   134800/  300000 | elapsed time per iteration (ms): 2448.4 | learning rate 8.933E-05 | lm loss 1.614962E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.59 | backward: 1723.85 | allreduce: 24.79 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   134900/  300000 | elapsed time per iteration (ms): 2448.0 | learning rate 8.925E-05 | lm loss 1.615125E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.46 | backward: 1723.52 | allreduce: 24.62 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
     iteration   135000/  300000 | elapsed time per iteration (ms): 2449.3 | learning rate 8.917E-05 | lm loss 1.626848E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.29 | backward: 1725.06 | allreduce: 26.92 | optimizer: 55.81 | batch generator: 0.51 | data loader: 0.04
    global rank 0 is saving checkpoint at iteration  135000 to checkpoints/gpt2_750m_2/iter_0135000/mp_rank_00/model_optim_rng.pt
      successfully saved checkpoints/gpt2_750m_2/iter_0135000/mp_rank_00/model_optim_rng.pt
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 135000 | LM loss: 1.702094E+00 | LM PPL: 5.485422E+00
    ------------------------------------------------------------------------------------
     iteration   135100/  300000 | elapsed time per iteration (ms): 3109.6 | learning rate 8.910E-05 | lm loss 1.636857E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.38 | backward: 1722.85 | allreduce: 24.46 | optimizer: 55.79 | batch generator: 8.91 | data loader: 8.04
     iteration   135200/  300000 | elapsed time per iteration (ms): 2450.0 | learning rate 8.902E-05 | lm loss 1.626400E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.37 | backward: 1725.65 | allreduce: 27.07 | optimizer: 55.77 | batch generator: 0.50 | data loader: 0.04
     iteration   135300/  300000 | elapsed time per iteration (ms): 2447.2 | learning rate 8.894E-05 | lm loss 1.633049E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.49 | backward: 1722.74 | allreduce: 23.73 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   135400/  300000 | elapsed time per iteration (ms): 2447.4 | learning rate 8.887E-05 | lm loss 1.635175E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.31 | backward: 1723.12 | allreduce: 24.53 | optimizer: 55.78 | batch generator: 0.49 | data loader: 0.04
     iteration   135500/  300000 | elapsed time per iteration (ms): 2448.5 | learning rate 8.879E-05 | lm loss 1.630444E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.21 | backward: 1724.35 | allreduce: 26.61 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   135600/  300000 | elapsed time per iteration (ms): 2445.8 | learning rate 8.871E-05 | lm loss 1.624113E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.07 | backward: 1721.78 | allreduce: 23.95 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   135700/  300000 | elapsed time per iteration (ms): 2446.5 | learning rate 8.863E-05 | lm loss 1.626563E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.19 | backward: 1722.39 | allreduce: 24.37 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   135800/  300000 | elapsed time per iteration (ms): 2448.9 | learning rate 8.856E-05 | lm loss 1.627990E+00 | loss scale 131072.0 |
    time (ms) | forward: 668.20 | backward: 1724.79 | allreduce: 26.57 | optimizer: 55.78 | batch generator: 0.49 | data loader: 0.04
     iteration   135900/  300000 | elapsed time per iteration (ms): 2446.0 | learning rate 8.848E-05 | lm loss 1.658435E+00 | loss scale 131072.0 |
    time (ms) | forward: 668.04 | backward: 1721.99 | allreduce: 23.99 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   136000/  300000 | elapsed time per iteration (ms): 2446.3 | learning rate 8.840E-05 | lm loss 1.629040E+00 | loss scale 131072.0 |
    time (ms) | forward: 667.94 | backward: 1722.36 | allreduce: 24.65 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 136000 | LM loss: 1.675859E+00 | LM PPL: 5.343385E+00
    ------------------------------------------------------------------------------------
     iteration   136100/  300000 | elapsed time per iteration (ms): 3061.0 | learning rate 8.833E-05 | lm loss 1.630849E+00 | loss scale 131072.0 |
    time (ms) | forward: 667.95 | backward: 1724.60 | allreduce: 26.84 | optimizer: 55.78 | batch generator: 13.75 | data loader: 12.90
     iteration   136200/  300000 | elapsed time per iteration (ms): 2445.3 | learning rate 8.825E-05 | lm loss 1.644989E+00 | loss scale 131072.0 |
    time (ms) | forward: 667.83 | backward: 1721.47 | allreduce: 24.01 | optimizer: 55.78 | batch generator: 0.49 | data loader: 0.04
     iteration   136300/  300000 | elapsed time per iteration (ms): 2445.6 | learning rate 8.817E-05 | lm loss 1.659691E+00 | loss scale 131072.0 |
    time (ms) | forward: 667.83 | backward: 1721.79 | allreduce: 24.46 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   136400/  300000 | elapsed time per iteration (ms): 2449.0 | learning rate 8.809E-05 | lm loss 1.620970E+00 | loss scale 131072.0 |
    time (ms) | forward: 668.04 | backward: 1724.97 | allreduce: 27.03 | optimizer: 55.78 | batch generator: 0.51 | data loader: 0.04
     iteration   136500/  300000 | elapsed time per iteration (ms): 2446.1 | learning rate 8.802E-05 | lm loss 1.638337E+00 | loss scale 131072.0 |
    time (ms) | forward: 668.07 | backward: 1722.02 | allreduce: 24.27 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   136600/  300000 | elapsed time per iteration (ms): 2445.6 | learning rate 8.794E-05 | lm loss 1.624787E+00 | loss scale 131072.0 |
    time (ms) | forward: 667.84 | backward: 1721.78 | allreduce: 24.46 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   136700/  300000 | elapsed time per iteration (ms): 2448.4 | learning rate 8.786E-05 | lm loss 1.636629E+00 | loss scale 131072.0 |
    time (ms) | forward: 667.86 | backward: 1724.63 | allreduce: 27.11 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   136800/  300000 | elapsed time per iteration (ms): 2444.9 | learning rate 8.778E-05 | lm loss 1.634099E+00 | loss scale 262144.0 |
    time (ms) | forward: 667.92 | backward: 1721.01 | allreduce: 23.41 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   136900/  300000 | elapsed time per iteration (ms): 2446.7 | learning rate 8.771E-05 | lm loss 1.630581E+00 | loss scale 262144.0 |
    time (ms) | forward: 667.94 | backward: 1722.77 | allreduce: 24.54 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   137000/  300000 | elapsed time per iteration (ms): 2448.8 | learning rate 8.763E-05 | lm loss 1.631469E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.03 | backward: 1724.82 | allreduce: 26.57 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 137000 | LM loss: 1.680419E+00 | LM PPL: 5.367806E+00
    ------------------------------------------------------------------------------------
     iteration   137100/  300000 | elapsed time per iteration (ms): 3045.9 | learning rate 8.755E-05 | lm loss 1.622637E+00 | loss scale 262144.0 |
    time (ms) | forward: 667.99 | backward: 1722.04 | allreduce: 23.99 | optimizer: 55.79 | batch generator: 1.24 | data loader: 0.38
     iteration   137200/  300000 | elapsed time per iteration (ms): 2445.4 | learning rate 8.747E-05 | lm loss 1.652675E+00 | loss scale 262144.0 |
    time (ms) | forward: 667.95 | backward: 1721.44 | allreduce: 23.51 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   137300/  300000 | elapsed time per iteration (ms): 2448.1 | learning rate 8.740E-05 | lm loss 1.636625E+00 | loss scale 262144.0 |
    time (ms) | forward: 667.75 | backward: 1724.40 | allreduce: 26.73 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   137400/  300000 | elapsed time per iteration (ms): 2445.0 | learning rate 8.732E-05 | lm loss 1.628191E+00 | loss scale 262144.0 |
    time (ms) | forward: 667.84 | backward: 1721.20 | allreduce: 23.56 | optimizer: 55.79 | batch generator: 0.49 | data loader: 0.04
     iteration   137500/  300000 | elapsed time per iteration (ms): 2444.8 | learning rate 8.724E-05 | lm loss 1.637398E+00 | loss scale 262144.0 |
    time (ms) | forward: 667.81 | backward: 1721.07 | allreduce: 23.29 | optimizer: 55.79 | batch generator: 0.49 | data loader: 0.04
     iteration   137600/  300000 | elapsed time per iteration (ms): 2447.5 | learning rate 8.716E-05 | lm loss 1.624063E+00 | loss scale 262144.0 |
    time (ms) | forward: 667.77 | backward: 1723.73 | allreduce: 26.22 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   137700/  300000 | elapsed time per iteration (ms): 2445.7 | learning rate 8.709E-05 | lm loss 1.652098E+00 | loss scale 262144.0 |
    time (ms) | forward: 667.82 | backward: 1721.96 | allreduce: 24.28 | optimizer: 55.79 | batch generator: 0.49 | data loader: 0.04
     iteration   137800/  300000 | elapsed time per iteration (ms): 2445.0 | learning rate 8.701E-05 | lm loss 1.632375E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.81 | backward: 1721.25 | allreduce: 23.44 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   137900/  300000 | elapsed time per iteration (ms): 2449.0 | learning rate 8.693E-05 | lm loss 1.623040E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.90 | backward: 1725.17 | allreduce: 26.91 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   138000/  300000 | elapsed time per iteration (ms): 2445.9 | learning rate 8.685E-05 | lm loss 1.621116E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.89 | backward: 1722.07 | allreduce: 24.05 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 138000 | LM loss: 1.680221E+00 | LM PPL: 5.366742E+00
    ------------------------------------------------------------------------------------
     iteration   138100/  300000 | elapsed time per iteration (ms): 3045.0 | learning rate 8.678E-05 | lm loss 1.607277E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.88 | backward: 1721.83 | allreduce: 23.57 | optimizer: 55.79 | batch generator: 0.93 | data loader: 0.07
     iteration   138200/  300000 | elapsed time per iteration (ms): 2447.5 | learning rate 8.670E-05 | lm loss 1.628989E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.75 | backward: 1723.79 | allreduce: 25.80 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   138300/  300000 | elapsed time per iteration (ms): 2445.7 | learning rate 8.662E-05 | lm loss 1.638833E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.86 | backward: 1721.85 | allreduce: 23.84 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   138400/  300000 | elapsed time per iteration (ms): 2445.8 | learning rate 8.654E-05 | lm loss 1.627660E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.70 | backward: 1722.11 | allreduce: 24.17 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   138500/  300000 | elapsed time per iteration (ms): 2449.0 | learning rate 8.647E-05 | lm loss 1.645545E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.80 | backward: 1725.24 | allreduce: 26.99 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   138600/  300000 | elapsed time per iteration (ms): 2445.4 | learning rate 8.639E-05 | lm loss 1.615064E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.68 | backward: 1721.73 | allreduce: 23.77 | optimizer: 55.78 | batch generator: 0.49 | data loader: 0.04
     iteration   138700/  300000 | elapsed time per iteration (ms): 2446.2 | learning rate 8.631E-05 | lm loss 1.631654E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.90 | backward: 1722.34 | allreduce: 23.81 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   138800/  300000 | elapsed time per iteration (ms): 2448.8 | learning rate 8.623E-05 | lm loss 1.634731E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.84 | backward: 1725.04 | allreduce: 26.72 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   138900/  300000 | elapsed time per iteration (ms): 2446.4 | learning rate 8.616E-05 | lm loss 1.611541E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.69 | backward: 1722.76 | allreduce: 24.52 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   139000/  300000 | elapsed time per iteration (ms): 2445.7 | learning rate 8.608E-05 | lm loss 1.623726E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.68 | backward: 1722.01 | allreduce: 23.82 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 139000 | LM loss: 1.682555E+00 | LM PPL: 5.379281E+00
    ------------------------------------------------------------------------------------
     iteration   139100/  300000 | elapsed time per iteration (ms): 3052.6 | learning rate 8.600E-05 | lm loss 1.633978E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.81 | backward: 1724.26 | allreduce: 26.07 | optimizer: 55.78 | batch generator: 5.96 | data loader: 5.09
     iteration   139200/  300000 | elapsed time per iteration (ms): 2445.6 | learning rate 8.592E-05 | lm loss 1.630968E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.80 | backward: 1721.80 | allreduce: 23.45 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   139300/  300000 | elapsed time per iteration (ms): 2444.5 | learning rate 8.585E-05 | lm loss 1.619876E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.33 | backward: 1720.72 | allreduce: 23.30 | optimizer: 55.22 | batch generator: 1.11 | data loader: 0.65
     iteration   139400/  300000 | elapsed time per iteration (ms): 2446.5 | learning rate 8.577E-05 | lm loss 1.641925E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.74 | backward: 1723.40 | allreduce: 25.91 | optimizer: 55.23 | batch generator: 0.49 | data loader: 0.04
     iteration   139500/  300000 | elapsed time per iteration (ms): 2445.8 | learning rate 8.569E-05 | lm loss 1.623623E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.71 | backward: 1722.14 | allreduce: 24.30 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   139600/  300000 | elapsed time per iteration (ms): 2445.7 | learning rate 8.561E-05 | lm loss 1.621255E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.88 | backward: 1721.87 | allreduce: 23.68 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   139700/  300000 | elapsed time per iteration (ms): 2448.2 | learning rate 8.554E-05 | lm loss 1.621121E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.72 | backward: 1724.48 | allreduce: 26.38 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   139800/  300000 | elapsed time per iteration (ms): 2445.1 | learning rate 8.546E-05 | lm loss 1.600436E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.76 | backward: 1721.40 | allreduce: 23.58 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   139900/  300000 | elapsed time per iteration (ms): 2444.7 | learning rate 8.538E-05 | lm loss 1.607543E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.69 | backward: 1721.07 | allreduce: 23.32 | optimizer: 55.77 | batch generator: 0.50 | data loader: 0.04
     iteration   140000/  300000 | elapsed time per iteration (ms): 2447.9 | learning rate 8.530E-05 | lm loss 1.624468E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.70 | backward: 1724.24 | allreduce: 26.52 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
    global rank 0 is saving checkpoint at iteration  140000 to checkpoints/gpt2_750m_2/iter_0140000/mp_rank_00/model_optim_rng.pt
      successfully saved checkpoints/gpt2_750m_2/iter_0140000/mp_rank_00/model_optim_rng.pt
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 140000 | LM loss: 1.673623E+00 | LM PPL: 5.331447E+00
    ------------------------------------------------------------------------------------
     iteration   140100/  300000 | elapsed time per iteration (ms): 3120.0 | learning rate 8.522E-05 | lm loss 1.632671E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.54 | backward: 1721.51 | allreduce: 24.29 | optimizer: 55.79 | batch generator: 18.47 | data loader: 17.60
     iteration   140200/  300000 | elapsed time per iteration (ms): 2445.4 | learning rate 8.515E-05 | lm loss 1.612584E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.79 | backward: 1721.74 | allreduce: 23.77 | optimizer: 55.78 | batch generator: 0.51 | data loader: 0.04
     iteration   140300/  300000 | elapsed time per iteration (ms): 2447.4 | learning rate 8.507E-05 | lm loss 1.617171E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.79 | backward: 1723.70 | allreduce: 25.92 | optimizer: 55.80 | batch generator: 0.51 | data loader: 0.04
     iteration   140400/  300000 | elapsed time per iteration (ms): 2444.8 | learning rate 8.499E-05 | lm loss 1.628141E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.64 | backward: 1721.31 | allreduce: 23.45 | optimizer: 55.78 | batch generator: 0.51 | data loader: 0.04
     iteration   140500/  300000 | elapsed time per iteration (ms): 2444.2 | learning rate 8.491E-05 | lm loss 1.615904E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.68 | backward: 1720.61 | allreduce: 22.92 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
     iteration   140600/  300000 | elapsed time per iteration (ms): 2447.6 | learning rate 8.484E-05 | lm loss 1.617158E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.75 | backward: 1723.93 | allreduce: 25.89 | optimizer: 55.78 | batch generator: 0.51 | data loader: 0.04
     iteration   140700/  300000 | elapsed time per iteration (ms): 2446.6 | learning rate 8.476E-05 | lm loss 1.626810E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.80 | backward: 1722.88 | allreduce: 24.48 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   140800/  300000 | elapsed time per iteration (ms): 2445.3 | learning rate 8.468E-05 | lm loss 1.606187E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.67 | backward: 1721.77 | allreduce: 23.82 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   140900/  300000 | elapsed time per iteration (ms): 2447.5 | learning rate 8.460E-05 | lm loss 1.645331E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.89 | backward: 1724.25 | allreduce: 26.58 | optimizer: 55.23 | batch generator: 0.51 | data loader: 0.04
     iteration   141000/  300000 | elapsed time per iteration (ms): 2443.6 | learning rate 8.453E-05 | lm loss 1.600327E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.79 | backward: 1720.47 | allreduce: 22.88 | optimizer: 55.23 | batch generator: 0.50 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 141000 | LM loss: 1.678585E+00 | LM PPL: 5.357969E+00
    ------------------------------------------------------------------------------------
     iteration   141100/  300000 | elapsed time per iteration (ms): 3046.3 | learning rate 8.445E-05 | lm loss 1.612395E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.87 | backward: 1721.41 | allreduce: 23.26 | optimizer: 55.79 | batch generator: 2.84 | data loader: 1.97
     iteration   141200/  300000 | elapsed time per iteration (ms): 2447.5 | learning rate 8.437E-05 | lm loss 1.632437E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.80 | backward: 1723.84 | allreduce: 26.05 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   141300/  300000 | elapsed time per iteration (ms): 2444.9 | learning rate 8.429E-05 | lm loss 1.642236E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.74 | backward: 1721.32 | allreduce: 23.62 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   141400/  300000 | elapsed time per iteration (ms): 2444.8 | learning rate 8.421E-05 | lm loss 1.640242E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.82 | backward: 1721.08 | allreduce: 23.24 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   141500/  300000 | elapsed time per iteration (ms): 2447.0 | learning rate 8.414E-05 | lm loss 1.623424E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.72 | backward: 1723.39 | allreduce: 25.80 | optimizer: 55.78 | batch generator: 0.51 | data loader: 0.04
     iteration   141600/  300000 | elapsed time per iteration (ms): 2444.6 | learning rate 8.406E-05 | lm loss 1.615847E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.92 | backward: 1720.79 | allreduce: 22.82 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   141700/  300000 | elapsed time per iteration (ms): 2444.6 | learning rate 8.398E-05 | lm loss 1.632841E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.76 | backward: 1720.91 | allreduce: 23.09 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   141800/  300000 | elapsed time per iteration (ms): 2446.8 | learning rate 8.390E-05 | lm loss 1.618629E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.67 | backward: 1723.28 | allreduce: 25.78 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
     iteration   141900/  300000 | elapsed time per iteration (ms): 2444.7 | learning rate 8.382E-05 | lm loss 1.613483E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.65 | backward: 1721.20 | allreduce: 23.64 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   142000/  300000 | elapsed time per iteration (ms): 2444.4 | learning rate 8.375E-05 | lm loss 1.602413E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.61 | backward: 1720.91 | allreduce: 23.34 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 142000 | LM loss: 1.688109E+00 | LM PPL: 5.409243E+00
    ------------------------------------------------------------------------------------
     iteration   142100/  300000 | elapsed time per iteration (ms): 3082.1 | learning rate 8.367E-05 | lm loss 1.625222E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.46 | backward: 1723.45 | allreduce: 26.22 | optimizer: 55.79 | batch generator: 37.06 | data loader: 36.19
     iteration   142200/  300000 | elapsed time per iteration (ms): 2444.9 | learning rate 8.359E-05 | lm loss 1.599989E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.56 | backward: 1721.42 | allreduce: 23.81 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
     iteration   142300/  300000 | elapsed time per iteration (ms): 2444.0 | learning rate 8.351E-05 | lm loss 1.595482E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.50 | backward: 1720.61 | allreduce: 23.42 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
     iteration   142400/  300000 | elapsed time per iteration (ms): 2446.4 | learning rate 8.343E-05 | lm loss 1.603345E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.53 | backward: 1722.93 | allreduce: 25.34 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   142500/  300000 | elapsed time per iteration (ms): 2444.5 | learning rate 8.336E-05 | lm loss 1.613604E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.46 | backward: 1721.16 | allreduce: 23.84 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
     iteration   142600/  300000 | elapsed time per iteration (ms): 2445.1 | learning rate 8.328E-05 | lm loss 1.626125E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.51 | backward: 1721.73 | allreduce: 24.27 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
     iteration   142700/  300000 | elapsed time per iteration (ms): 2446.6 | learning rate 8.320E-05 | lm loss 1.616251E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.46 | backward: 1723.22 | allreduce: 26.06 | optimizer: 55.79 | batch generator: 0.52 | data loader: 0.04
     iteration   142800/  300000 | elapsed time per iteration (ms): 2443.8 | learning rate 8.312E-05 | lm loss 1.594443E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.36 | backward: 1720.58 | allreduce: 23.51 | optimizer: 55.80 | batch generator: 0.51 | data loader: 0.04
     iteration   142900/  300000 | elapsed time per iteration (ms): 2444.2 | learning rate 8.304E-05 | lm loss 1.628204E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.52 | backward: 1720.77 | allreduce: 23.45 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   143000/  300000 | elapsed time per iteration (ms): 2446.2 | learning rate 8.297E-05 | lm loss 1.609052E+00 | loss scale 2097152.0 |
    time (ms) | forward: 667.42 | backward: 1723.43 | allreduce: 26.13 | optimizer: 55.23 | batch generator: 0.52 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 143000 | LM loss: 1.694710E+00 | LM PPL: 5.445066E+00
    ------------------------------------------------------------------------------------
     iteration   143100/  300000 | elapsed time per iteration (ms): 3043.7 | learning rate 8.289E-05 | lm loss 1.618697E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.58 | backward: 1720.63 | allreduce: 23.64 | optimizer: 55.23 | batch generator: 2.03 | data loader: 1.16
     iteration   143200/  300000 | elapsed time per iteration (ms): 2443.3 | learning rate 8.281E-05 | lm loss 1.618974E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.40 | backward: 1719.95 | allreduce: 22.96 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
     iteration   143300/  300000 | elapsed time per iteration (ms): 2446.6 | learning rate 8.273E-05 | lm loss 1.609613E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.37 | backward: 1723.37 | allreduce: 26.36 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   143400/  300000 | elapsed time per iteration (ms): 2444.4 | learning rate 8.265E-05 | lm loss 1.616890E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.54 | backward: 1720.92 | allreduce: 23.52 | optimizer: 55.80 | batch generator: 0.50 | data loader: 0.04
     iteration   143500/  300000 | elapsed time per iteration (ms): 2444.5 | learning rate 8.258E-05 | lm loss 1.606322E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.59 | backward: 1721.01 | allreduce: 23.37 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   143600/  300000 | elapsed time per iteration (ms): 2445.4 | learning rate 8.250E-05 | lm loss 1.633315E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.45 | backward: 1722.57 | allreduce: 26.05 | optimizer: 55.23 | batch generator: 0.51 | data loader: 0.04
     iteration   143700/  300000 | elapsed time per iteration (ms): 2443.7 | learning rate 8.242E-05 | lm loss 1.586727E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.40 | backward: 1720.36 | allreduce: 23.44 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
     iteration   143800/  300000 | elapsed time per iteration (ms): 2444.8 | learning rate 8.234E-05 | lm loss 1.597613E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.64 | backward: 1721.28 | allreduce: 23.97 | optimizer: 55.78 | batch generator: 0.51 | data loader: 0.04
     iteration   143900/  300000 | elapsed time per iteration (ms): 2448.4 | learning rate 8.226E-05 | lm loss 1.630505E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.71 | backward: 1724.77 | allreduce: 27.06 | optimizer: 55.78 | batch generator: 0.51 | data loader: 0.04
     iteration   144000/  300000 | elapsed time per iteration (ms): 2444.2 | learning rate 8.219E-05 | lm loss 1.623394E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.55 | backward: 1720.74 | allreduce: 23.49 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 144000 | LM loss: 1.692763E+00 | LM PPL: 5.434475E+00
    ------------------------------------------------------------------------------------
     iteration   144100/  300000 | elapsed time per iteration (ms): 3047.5 | learning rate 8.211E-05 | lm loss 1.607367E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.60 | backward: 1721.75 | allreduce: 24.22 | optimizer: 55.79 | batch generator: 4.15 | data loader: 3.28
     iteration   144200/  300000 | elapsed time per iteration (ms): 2447.2 | learning rate 8.203E-05 | lm loss 1.606489E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.63 | backward: 1723.67 | allreduce: 26.17 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
     iteration   144300/  300000 | elapsed time per iteration (ms): 2444.8 | learning rate 8.195E-05 | lm loss 1.601854E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.67 | backward: 1721.24 | allreduce: 23.76 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
     iteration   144400/  300000 | elapsed time per iteration (ms): 2445.0 | learning rate 8.187E-05 | lm loss 1.593697E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.69 | backward: 1721.40 | allreduce: 23.67 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   144500/  300000 | elapsed time per iteration (ms): 2446.9 | learning rate 8.180E-05 | lm loss 1.632774E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.65 | backward: 1723.34 | allreduce: 26.02 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
     iteration   144600/  300000 | elapsed time per iteration (ms): 2444.3 | learning rate 8.172E-05 | lm loss 1.618160E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.54 | backward: 1720.85 | allreduce: 23.48 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
     iteration   144700/  300000 | elapsed time per iteration (ms): 2444.5 | learning rate 8.164E-05 | lm loss 1.599581E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.57 | backward: 1721.00 | allreduce: 23.36 | optimizer: 55.79 | batch generator: 0.52 | data loader: 0.04
     iteration   144800/  300000 | elapsed time per iteration (ms): 2447.1 | learning rate 8.156E-05 | lm loss 1.592536E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.62 | backward: 1723.57 | allreduce: 25.73 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
     iteration   144900/  300000 | elapsed time per iteration (ms): 2443.2 | learning rate 8.148E-05 | lm loss 1.618822E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.65 | backward: 1720.24 | allreduce: 23.11 | optimizer: 55.23 | batch generator: 0.51 | data loader: 0.04
     iteration   145000/  300000 | elapsed time per iteration (ms): 2445.0 | learning rate 8.140E-05 | lm loss 1.594264E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.50 | backward: 1721.58 | allreduce: 24.00 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
    global rank 0 is saving checkpoint at iteration  145000 to checkpoints/gpt2_750m_2/iter_0145000/mp_rank_00/model_optim_rng.pt
      successfully saved checkpoints/gpt2_750m_2/iter_0145000/mp_rank_00/model_optim_rng.pt
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 145000 | LM loss: 1.674432E+00 | LM PPL: 5.335766E+00
    ------------------------------------------------------------------------------------
     iteration   145100/  300000 | elapsed time per iteration (ms): 3112.9 | learning rate 8.133E-05 | lm loss 1.626325E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.37 | backward: 1723.77 | allreduce: 26.70 | optimizer: 55.79 | batch generator: 11.02 | data loader: 10.16
     iteration   145200/  300000 | elapsed time per iteration (ms): 2445.0 | learning rate 8.125E-05 | lm loss 1.605936E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.59 | backward: 1721.51 | allreduce: 23.80 | optimizer: 55.79 | batch generator: 0.52 | data loader: 0.04
     iteration   145300/  300000 | elapsed time per iteration (ms): 2444.7 | learning rate 8.117E-05 | lm loss 1.609830E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.66 | backward: 1721.13 | allreduce: 23.43 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
     iteration   145400/  300000 | elapsed time per iteration (ms): 2447.9 | learning rate 8.109E-05 | lm loss 1.595067E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.72 | backward: 1724.26 | allreduce: 26.22 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   145500/  300000 | elapsed time per iteration (ms): 2445.0 | learning rate 8.101E-05 | lm loss 1.611733E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.59 | backward: 1721.50 | allreduce: 23.68 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   145600/  300000 | elapsed time per iteration (ms): 2444.3 | learning rate 8.094E-05 | lm loss 1.623356E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.56 | backward: 1720.85 | allreduce: 23.36 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   145700/  300000 | elapsed time per iteration (ms): 2446.9 | learning rate 8.086E-05 | lm loss 1.602042E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.54 | backward: 1723.51 | allreduce: 25.98 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
     iteration   145800/  300000 | elapsed time per iteration (ms): 2444.9 | learning rate 8.078E-05 | lm loss 1.597441E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.60 | backward: 1721.41 | allreduce: 23.72 | optimizer: 55.78 | batch generator: 0.51 | data loader: 0.04
     iteration   145900/  300000 | elapsed time per iteration (ms): 2443.3 | learning rate 8.070E-05 | lm loss 1.605063E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.64 | backward: 1720.35 | allreduce: 23.40 | optimizer: 55.24 | batch generator: 0.50 | data loader: 0.04
     iteration   146000/  300000 | elapsed time per iteration (ms): 2446.1 | learning rate 8.062E-05 | lm loss 1.608689E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.59 | backward: 1722.66 | allreduce: 25.32 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 146000 | LM loss: 1.687506E+00 | LM PPL: 5.405981E+00
    ------------------------------------------------------------------------------------
     iteration   146100/  300000 | elapsed time per iteration (ms): 3043.2 | learning rate 8.054E-05 | lm loss 1.605295E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.65 | backward: 1720.58 | allreduce: 23.24 | optimizer: 55.79 | batch generator: 0.93 | data loader: 0.08
     iteration   146200/  300000 | elapsed time per iteration (ms): 2443.9 | learning rate 8.047E-05 | lm loss 1.589370E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.69 | backward: 1720.35 | allreduce: 22.81 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   146300/  300000 | elapsed time per iteration (ms): 2447.2 | learning rate 8.039E-05 | lm loss 1.626626E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.61 | backward: 1723.67 | allreduce: 26.16 | optimizer: 55.80 | batch generator: 0.50 | data loader: 0.04
     iteration   146400/  300000 | elapsed time per iteration (ms): 2445.0 | learning rate 8.031E-05 | lm loss 1.610443E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.82 | backward: 1721.25 | allreduce: 23.24 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
     iteration   146500/  300000 | elapsed time per iteration (ms): 2444.3 | learning rate 8.023E-05 | lm loss 1.603022E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.81 | backward: 1720.63 | allreduce: 22.82 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   146600/  300000 | elapsed time per iteration (ms): 2447.2 | learning rate 8.015E-05 | lm loss 1.602652E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.78 | backward: 1723.54 | allreduce: 25.71 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
     iteration   146700/  300000 | elapsed time per iteration (ms): 2444.3 | learning rate 8.007E-05 | lm loss 1.609592E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.87 | backward: 1720.58 | allreduce: 22.83 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   146800/  300000 | elapsed time per iteration (ms): 2445.1 | learning rate 8.000E-05 | lm loss 1.630102E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.71 | backward: 1721.51 | allreduce: 23.77 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   146900/  300000 | elapsed time per iteration (ms): 2447.8 | learning rate 7.992E-05 | lm loss 1.619039E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.68 | backward: 1724.27 | allreduce: 26.22 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
     iteration   147000/  300000 | elapsed time per iteration (ms): 2444.8 | learning rate 7.984E-05 | lm loss 1.603241E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.78 | backward: 1721.13 | allreduce: 23.13 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 147000 | LM loss: 1.679711E+00 | LM PPL: 5.364007E+00
    ------------------------------------------------------------------------------------
     iteration   147100/  300000 | elapsed time per iteration (ms): 3045.3 | learning rate 7.976E-05 | lm loss 1.608877E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.64 | backward: 1721.46 | allreduce: 23.66 | optimizer: 55.79 | batch generator: 2.02 | data loader: 1.15
     iteration   147200/  300000 | elapsed time per iteration (ms): 2448.0 | learning rate 7.968E-05 | lm loss 1.609911E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.62 | backward: 1724.49 | allreduce: 26.53 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   147300/  300000 | elapsed time per iteration (ms): 2445.0 | learning rate 7.960E-05 | lm loss 1.625765E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.52 | backward: 1721.56 | allreduce: 23.96 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   147400/  300000 | elapsed time per iteration (ms): 2444.7 | learning rate 7.953E-05 | lm loss 1.604193E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.65 | backward: 1721.16 | allreduce: 23.63 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   147500/  300000 | elapsed time per iteration (ms): 2446.3 | learning rate 7.945E-05 | lm loss 1.602965E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.64 | backward: 1723.32 | allreduce: 26.28 | optimizer: 55.23 | batch generator: 0.51 | data loader: 0.04
     iteration   147600/  300000 | elapsed time per iteration (ms): 2445.1 | learning rate 7.937E-05 | lm loss 1.605377E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.80 | backward: 1721.42 | allreduce: 23.52 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   147700/  300000 | elapsed time per iteration (ms): 2444.4 | learning rate 7.929E-05 | lm loss 1.609738E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.67 | backward: 1720.79 | allreduce: 22.85 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
     iteration   147800/  300000 | elapsed time per iteration (ms): 2447.4 | learning rate 7.921E-05 | lm loss 1.608004E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.63 | backward: 1723.91 | allreduce: 26.04 | optimizer: 55.78 | batch generator: 0.51 | data loader: 0.04
     iteration   147900/  300000 | elapsed time per iteration (ms): 2445.6 | learning rate 7.913E-05 | lm loss 1.616582E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.67 | backward: 1722.04 | allreduce: 24.19 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.05
     iteration   148000/  300000 | elapsed time per iteration (ms): 2445.5 | learning rate 7.906E-05 | lm loss 1.617014E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.83 | backward: 1721.76 | allreduce: 23.68 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 148000 | LM loss: 1.678962E+00 | LM PPL: 5.359992E+00
    ------------------------------------------------------------------------------------
     iteration   148100/  300000 | elapsed time per iteration (ms): 3068.1 | learning rate 7.898E-05 | lm loss 1.602351E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.82 | backward: 1724.22 | allreduce: 25.94 | optimizer: 55.79 | batch generator: 22.02 | data loader: 21.14
     iteration   148200/  300000 | elapsed time per iteration (ms): 2445.6 | learning rate 7.890E-05 | lm loss 1.592080E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.81 | backward: 1721.94 | allreduce: 23.58 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.05
     iteration   148300/  300000 | elapsed time per iteration (ms): 2444.9 | learning rate 7.882E-05 | lm loss 1.613767E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.76 | backward: 1721.21 | allreduce: 23.23 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.05
     iteration   148400/  300000 | elapsed time per iteration (ms): 2446.8 | learning rate 7.874E-05 | lm loss 1.599499E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.62 | backward: 1723.25 | allreduce: 25.36 | optimizer: 55.78 | batch generator: 0.51 | data loader: 0.04
     iteration   148500/  300000 | elapsed time per iteration (ms): 2443.5 | learning rate 7.866E-05 | lm loss 1.613387E+00 | loss scale 2097152.0 |
    time (ms) | forward: 667.63 | backward: 1720.49 | allreduce: 23.18 | optimizer: 55.22 | batch generator: 0.50 | data loader: 0.04
     iteration   148600/  300000 | elapsed time per iteration (ms): 2443.2 | learning rate 7.859E-05 | lm loss 1.619367E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.84 | backward: 1720.59 | allreduce: 23.46 | optimizer: 54.67 | batch generator: 0.50 | data loader: 0.04
     iteration   148700/  300000 | elapsed time per iteration (ms): 2447.8 | learning rate 7.851E-05 | lm loss 1.602043E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.93 | backward: 1723.98 | allreduce: 25.81 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
     iteration   148800/  300000 | elapsed time per iteration (ms): 2444.1 | learning rate 7.843E-05 | lm loss 1.605035E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.75 | backward: 1720.51 | allreduce: 22.87 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   148900/  300000 | elapsed time per iteration (ms): 2444.7 | learning rate 7.835E-05 | lm loss 1.595149E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.70 | backward: 1721.13 | allreduce: 23.51 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   149000/  300000 | elapsed time per iteration (ms): 2447.6 | learning rate 7.827E-05 | lm loss 1.627364E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.78 | backward: 1723.94 | allreduce: 25.95 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 149000 | LM loss: 1.676335E+00 | LM PPL: 5.345925E+00
    ------------------------------------------------------------------------------------
     iteration   149100/  300000 | elapsed time per iteration (ms): 3049.6 | learning rate 7.820E-05 | lm loss 1.591786E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.81 | backward: 1720.75 | allreduce: 23.06 | optimizer: 55.79 | batch generator: 6.52 | data loader: 5.66
     iteration   149200/  300000 | elapsed time per iteration (ms): 2446.2 | learning rate 7.812E-05 | lm loss 1.606299E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.98 | backward: 1722.32 | allreduce: 23.90 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   149300/  300000 | elapsed time per iteration (ms): 2446.5 | learning rate 7.804E-05 | lm loss 1.614709E+00 | loss scale 262144.0 |
    time (ms) | forward: 667.87 | backward: 1723.32 | allreduce: 26.18 | optimizer: 55.23 | batch generator: 0.51 | data loader: 0.04
     iteration   149400/  300000 | elapsed time per iteration (ms): 2443.8 | learning rate 7.796E-05 | lm loss 1.598603E+00 | loss scale 262144.0 |
    time (ms) | forward: 667.71 | backward: 1720.16 | allreduce: 23.07 | optimizer: 55.78 | batch generator: 0.51 | data loader: 0.04
     iteration   149500/  300000 | elapsed time per iteration (ms): 2444.2 | learning rate 7.788E-05 | lm loss 1.610622E+00 | loss scale 262144.0 |
    time (ms) | forward: 667.74 | backward: 1720.55 | allreduce: 23.36 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
     iteration   149600/  300000 | elapsed time per iteration (ms): 2446.7 | learning rate 7.780E-05 | lm loss 1.568598E+00 | loss scale 262144.0 |
    time (ms) | forward: 667.60 | backward: 1723.18 | allreduce: 26.15 | optimizer: 55.78 | batch generator: 0.51 | data loader: 0.04
     iteration   149700/  300000 | elapsed time per iteration (ms): 2444.4 | learning rate 7.773E-05 | lm loss 1.613363E+00 | loss scale 262144.0 |
    time (ms) | forward: 667.72 | backward: 1720.77 | allreduce: 23.66 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
     iteration   149800/  300000 | elapsed time per iteration (ms): 2444.6 | learning rate 7.765E-05 | lm loss 1.600616E+00 | loss scale 262144.0 |
    time (ms) | forward: 667.81 | backward: 1720.88 | allreduce: 23.46 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   149900/  300000 | elapsed time per iteration (ms): 2447.1 | learning rate 7.757E-05 | lm loss 1.609781E+00 | loss scale 262144.0 |
    time (ms) | forward: 667.66 | backward: 1723.51 | allreduce: 26.35 | optimizer: 55.80 | batch generator: 0.51 | data loader: 0.04
     iteration   150000/  300000 | elapsed time per iteration (ms): 2444.3 | learning rate 7.749E-05 | lm loss 1.612223E+00 | loss scale 262144.0 |
    time (ms) | forward: 667.82 | backward: 1720.63 | allreduce: 23.39 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
    global rank 0 is saving checkpoint at iteration  150000 to checkpoints/gpt2_750m_2/iter_0150000/mp_rank_00/model_optim_rng.pt
      successfully saved checkpoints/gpt2_750m_2/iter_0150000/mp_rank_00/model_optim_rng.pt
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 150000 | LM loss: 1.643345E+00 | LM PPL: 5.172442E+00
    ------------------------------------------------------------------------------------
     iteration   150100/  300000 | elapsed time per iteration (ms): 3100.6 | learning rate 7.741E-05 | lm loss 1.599773E+00 | loss scale 262144.0 |
    time (ms) | forward: 667.63 | backward: 1722.69 | allreduce: 25.75 | optimizer: 55.78 | batch generator: 0.93 | data loader: 0.08
     iteration   150200/  300000 | elapsed time per iteration (ms): 2444.4 | learning rate 7.733E-05 | lm loss 1.610615E+00 | loss scale 262144.0 |
    time (ms) | forward: 667.71 | backward: 1720.81 | allreduce: 23.55 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   150300/  300000 | elapsed time per iteration (ms): 2444.2 | learning rate 7.725E-05 | lm loss 1.619339E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.69 | backward: 1720.60 | allreduce: 23.33 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   150400/  300000 | elapsed time per iteration (ms): 2447.2 | learning rate 7.718E-05 | lm loss 1.600201E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.79 | backward: 1723.50 | allreduce: 25.58 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   150500/  300000 | elapsed time per iteration (ms): 2445.5 | learning rate 7.710E-05 | lm loss 1.611155E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.80 | backward: 1721.76 | allreduce: 23.93 | optimizer: 55.78 | batch generator: 0.51 | data loader: 0.04
     iteration   150600/  300000 | elapsed time per iteration (ms): 2446.1 | learning rate 7.702E-05 | lm loss 1.606558E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.86 | backward: 1722.37 | allreduce: 24.18 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   150700/  300000 | elapsed time per iteration (ms): 2446.9 | learning rate 7.694E-05 | lm loss 1.598593E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.79 | backward: 1723.23 | allreduce: 25.48 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   150800/  300000 | elapsed time per iteration (ms): 2445.0 | learning rate 7.686E-05 | lm loss 1.612386E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.68 | backward: 1721.45 | allreduce: 23.57 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
     iteration   150900/  300000 | elapsed time per iteration (ms): 2444.1 | learning rate 7.678E-05 | lm loss 1.592682E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.64 | backward: 1720.59 | allreduce: 23.17 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   151000/  300000 | elapsed time per iteration (ms): 2446.7 | learning rate 7.670E-05 | lm loss 1.605303E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.58 | backward: 1723.21 | allreduce: 25.95 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 151000 | LM loss: 1.667207E+00 | LM PPL: 5.297353E+00
    ------------------------------------------------------------------------------------
     iteration   151100/  300000 | elapsed time per iteration (ms): 3043.4 | learning rate 7.663E-05 | lm loss 1.615890E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.63 | backward: 1720.57 | allreduce: 23.17 | optimizer: 55.78 | batch generator: 0.93 | data loader: 0.08
     iteration   151200/  300000 | elapsed time per iteration (ms): 2443.2 | learning rate 7.655E-05 | lm loss 1.601953E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.42 | backward: 1719.89 | allreduce: 22.82 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   151300/  300000 | elapsed time per iteration (ms): 2446.8 | learning rate 7.647E-05 | lm loss 1.582508E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.64 | backward: 1723.25 | allreduce: 25.57 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   151400/  300000 | elapsed time per iteration (ms): 2445.0 | learning rate 7.639E-05 | lm loss 1.618718E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.84 | backward: 1721.25 | allreduce: 23.12 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   151500/  300000 | elapsed time per iteration (ms): 2444.8 | learning rate 7.631E-05 | lm loss 1.605569E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.67 | backward: 1721.22 | allreduce: 23.40 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   151600/  300000 | elapsed time per iteration (ms): 2445.5 | learning rate 7.624E-05 | lm loss 1.601622E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.72 | backward: 1722.99 | allreduce: 26.36 | optimizer: 54.66 | batch generator: 0.51 | data loader: 0.04
     iteration   151700/  300000 | elapsed time per iteration (ms): 2444.6 | learning rate 7.616E-05 | lm loss 1.588381E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.78 | backward: 1720.91 | allreduce: 23.24 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   151800/  300000 | elapsed time per iteration (ms): 2444.8 | learning rate 7.608E-05 | lm loss 1.597310E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.79 | backward: 1721.10 | allreduce: 23.14 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   151900/  300000 | elapsed time per iteration (ms): 2446.5 | learning rate 7.600E-05 | lm loss 1.586362E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.65 | backward: 1722.93 | allreduce: 25.37 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   152000/  300000 | elapsed time per iteration (ms): 2444.6 | learning rate 7.592E-05 | lm loss 1.598511E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.70 | backward: 1720.96 | allreduce: 23.20 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.05
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 152000 | LM loss: 1.668991E+00 | LM PPL: 5.306811E+00
    ------------------------------------------------------------------------------------
     iteration   152100/  300000 | elapsed time per iteration (ms): 3071.8 | learning rate 7.584E-05 | lm loss 1.583089E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.68 | backward: 1722.20 | allreduce: 24.60 | optimizer: 55.77 | batch generator: 28.19 | data loader: 27.32
     iteration   152200/  300000 | elapsed time per iteration (ms): 2448.0 | learning rate 7.576E-05 | lm loss 1.582883E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.77 | backward: 1724.23 | allreduce: 26.29 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   152300/  300000 | elapsed time per iteration (ms): 2444.9 | learning rate 7.569E-05 | lm loss 1.622689E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.74 | backward: 1721.19 | allreduce: 23.45 | optimizer: 55.78 | batch generator: 0.49 | data loader: 0.04
     iteration   152400/  300000 | elapsed time per iteration (ms): 2445.9 | learning rate 7.561E-05 | lm loss 1.603352E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.94 | backward: 1721.96 | allreduce: 23.44 | optimizer: 55.78 | batch generator: 0.49 | data loader: 0.04
     iteration   152500/  300000 | elapsed time per iteration (ms): 2448.7 | learning rate 7.553E-05 | lm loss 1.607800E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.81 | backward: 1724.98 | allreduce: 26.98 | optimizer: 55.78 | batch generator: 0.49 | data loader: 0.04
     iteration   152600/  300000 | elapsed time per iteration (ms): 2445.8 | learning rate 7.545E-05 | lm loss 1.592245E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.97 | backward: 1721.85 | allreduce: 23.38 | optimizer: 55.77 | batch generator: 0.49 | data loader: 0.04
     iteration   152700/  300000 | elapsed time per iteration (ms): 2445.7 | learning rate 7.537E-05 | lm loss 1.615442E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.76 | backward: 1721.98 | allreduce: 23.79 | optimizer: 55.77 | batch generator: 0.50 | data loader: 0.04
     iteration   152800/  300000 | elapsed time per iteration (ms): 2448.6 | learning rate 7.529E-05 | lm loss 1.605245E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.64 | backward: 1724.96 | allreduce: 26.82 | optimizer: 55.78 | batch generator: 0.49 | data loader: 0.04
     iteration   152900/  300000 | elapsed time per iteration (ms): 2446.2 | learning rate 7.521E-05 | lm loss 1.600901E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.77 | backward: 1722.43 | allreduce: 24.25 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   153000/  300000 | elapsed time per iteration (ms): 2444.0 | learning rate 7.514E-05 | lm loss 1.590296E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.77 | backward: 1720.78 | allreduce: 23.37 | optimizer: 55.22 | batch generator: 0.50 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 153000 | LM loss: 1.685126E+00 | LM PPL: 5.393132E+00
    ------------------------------------------------------------------------------------
     iteration   153100/  300000 | elapsed time per iteration (ms): 3066.8 | learning rate 7.506E-05 | lm loss 1.604385E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.68 | backward: 1724.59 | allreduce: 26.67 | optimizer: 55.78 | batch generator: 20.45 | data loader: 19.59
     iteration   153200/  300000 | elapsed time per iteration (ms): 2445.5 | learning rate 7.498E-05 | lm loss 1.592443E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.85 | backward: 1721.73 | allreduce: 23.31 | optimizer: 55.78 | batch generator: 0.49 | data loader: 0.04
     iteration   153300/  300000 | elapsed time per iteration (ms): 2446.1 | learning rate 7.490E-05 | lm loss 1.560168E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.83 | backward: 1722.31 | allreduce: 23.99 | optimizer: 55.77 | batch generator: 0.49 | data loader: 0.04
     iteration   153400/  300000 | elapsed time per iteration (ms): 2447.7 | learning rate 7.482E-05 | lm loss 1.581269E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.48 | backward: 1724.29 | allreduce: 26.69 | optimizer: 55.78 | batch generator: 0.49 | data loader: 0.04
     iteration   153500/  300000 | elapsed time per iteration (ms): 2444.9 | learning rate 7.474E-05 | lm loss 1.585845E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.55 | backward: 1721.41 | allreduce: 23.93 | optimizer: 55.78 | batch generator: 0.49 | data loader: 0.04
     iteration   153600/  300000 | elapsed time per iteration (ms): 2445.2 | learning rate 7.467E-05 | lm loss 1.582289E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.78 | backward: 1721.49 | allreduce: 23.30 | optimizer: 55.77 | batch generator: 0.49 | data loader: 0.04
     iteration   153700/  300000 | elapsed time per iteration (ms): 2447.7 | learning rate 7.459E-05 | lm loss 1.599310E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.71 | backward: 1724.09 | allreduce: 26.03 | optimizer: 55.77 | batch generator: 0.50 | data loader: 0.04
     iteration   153800/  300000 | elapsed time per iteration (ms): 2445.7 | learning rate 7.451E-05 | lm loss 1.601644E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.69 | backward: 1722.04 | allreduce: 23.91 | optimizer: 55.78 | batch generator: 0.49 | data loader: 0.04
     iteration   153900/  300000 | elapsed time per iteration (ms): 2445.7 | learning rate 7.443E-05 | lm loss 1.595514E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.70 | backward: 1722.01 | allreduce: 24.01 | optimizer: 55.77 | batch generator: 0.49 | data loader: 0.04
     iteration   154000/  300000 | elapsed time per iteration (ms): 2446.5 | learning rate 7.435E-05 | lm loss 1.593261E+00 | loss scale 2097152.0 |
    time (ms) | forward: 667.68 | backward: 1723.40 | allreduce: 25.89 | optimizer: 55.21 | batch generator: 0.49 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 154000 | LM loss: 1.656932E+00 | LM PPL: 5.243201E+00
    ------------------------------------------------------------------------------------
     iteration   154100/  300000 | elapsed time per iteration (ms): 3053.4 | learning rate 7.427E-05 | lm loss 1.588524E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.76 | backward: 1721.87 | allreduce: 24.22 | optimizer: 55.22 | batch generator: 9.84 | data loader: 8.99
     iteration   154200/  300000 | elapsed time per iteration (ms): 2445.3 | learning rate 7.420E-05 | lm loss 1.601108E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.83 | backward: 1721.49 | allreduce: 23.39 | optimizer: 55.77 | batch generator: 0.50 | data loader: 0.04
     iteration   154300/  300000 | elapsed time per iteration (ms): 2448.9 | learning rate 7.412E-05 | lm loss 1.598652E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.76 | backward: 1725.18 | allreduce: 26.75 | optimizer: 55.77 | batch generator: 0.49 | data loader: 0.04
     iteration   154400/  300000 | elapsed time per iteration (ms): 2446.4 | learning rate 7.404E-05 | lm loss 1.577140E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.76 | backward: 1722.70 | allreduce: 24.31 | optimizer: 55.78 | batch generator: 0.49 | data loader: 0.04
     iteration   154500/  300000 | elapsed time per iteration (ms): 2446.7 | learning rate 7.396E-05 | lm loss 1.594070E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.85 | backward: 1722.86 | allreduce: 24.33 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   154600/  300000 | elapsed time per iteration (ms): 2448.8 | learning rate 7.388E-05 | lm loss 1.600072E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.69 | backward: 1725.17 | allreduce: 27.22 | optimizer: 55.78 | batch generator: 0.49 | data loader: 0.04
     iteration   154700/  300000 | elapsed time per iteration (ms): 2446.4 | learning rate 7.380E-05 | lm loss 1.596031E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.70 | backward: 1722.75 | allreduce: 24.59 | optimizer: 55.78 | batch generator: 0.49 | data loader: 0.04
     iteration   154800/  300000 | elapsed time per iteration (ms): 2446.4 | learning rate 7.372E-05 | lm loss 1.600903E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.74 | backward: 1722.70 | allreduce: 24.51 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   154900/  300000 | elapsed time per iteration (ms): 2448.7 | learning rate 7.365E-05 | lm loss 1.593926E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.63 | backward: 1725.09 | allreduce: 27.14 | optimizer: 55.79 | batch generator: 0.49 | data loader: 0.04
     iteration   155000/  300000 | elapsed time per iteration (ms): 2445.7 | learning rate 7.357E-05 | lm loss 1.582771E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.72 | backward: 1722.02 | allreduce: 23.91 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
    global rank 0 is saving checkpoint at iteration  155000 to checkpoints/gpt2_750m_2/iter_0155000/mp_rank_00/model_optim_rng.pt
      successfully saved checkpoints/gpt2_750m_2/iter_0155000/mp_rank_00/model_optim_rng.pt
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 155000 | LM loss: 1.667142E+00 | LM PPL: 5.297005E+00
    ------------------------------------------------------------------------------------
     iteration   155100/  300000 | elapsed time per iteration (ms): 3098.7 | learning rate 7.349E-05 | lm loss 1.599153E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.55 | backward: 1723.39 | allreduce: 26.70 | optimizer: 54.67 | batch generator: 0.94 | data loader: 0.07
     iteration   155200/  300000 | elapsed time per iteration (ms): 2446.7 | learning rate 7.341E-05 | lm loss 1.588849E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.92 | backward: 1722.78 | allreduce: 24.33 | optimizer: 55.78 | batch generator: 0.49 | data loader: 0.04
     iteration   155300/  300000 | elapsed time per iteration (ms): 2446.1 | learning rate 7.333E-05 | lm loss 1.585094E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.60 | backward: 1722.57 | allreduce: 24.57 | optimizer: 55.78 | batch generator: 0.49 | data loader: 0.04
     iteration   155400/  300000 | elapsed time per iteration (ms): 2448.1 | learning rate 7.326E-05 | lm loss 1.620869E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.49 | backward: 1724.70 | allreduce: 27.00 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   155500/  300000 | elapsed time per iteration (ms): 2446.0 | learning rate 7.318E-05 | lm loss 1.587861E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.67 | backward: 1722.40 | allreduce: 24.52 | optimizer: 55.77 | batch generator: 0.49 | data loader: 0.04
     iteration   155600/  300000 | elapsed time per iteration (ms): 2446.3 | learning rate 7.310E-05 | lm loss 1.592913E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.75 | backward: 1722.64 | allreduce: 24.39 | optimizer: 55.78 | batch generator: 0.49 | data loader: 0.04
     iteration   155700/  300000 | elapsed time per iteration (ms): 2448.3 | learning rate 7.302E-05 | lm loss 1.587246E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.62 | backward: 1724.76 | allreduce: 26.74 | optimizer: 55.78 | batch generator: 0.49 | data loader: 0.04
     iteration   155800/  300000 | elapsed time per iteration (ms): 2446.0 | learning rate 7.294E-05 | lm loss 1.566657E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.75 | backward: 1722.25 | allreduce: 24.13 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   155900/  300000 | elapsed time per iteration (ms): 2445.3 | learning rate 7.286E-05 | lm loss 1.581643E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.65 | backward: 1721.72 | allreduce: 23.57 | optimizer: 55.79 | batch generator: 0.49 | data loader: 0.04
     iteration   156000/  300000 | elapsed time per iteration (ms): 2447.4 | learning rate 7.278E-05 | lm loss 1.607814E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.70 | backward: 1723.75 | allreduce: 25.95 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 156000 | LM loss: 1.666989E+00 | LM PPL: 5.296198E+00
    ------------------------------------------------------------------------------------
     iteration   156100/  300000 | elapsed time per iteration (ms): 3047.4 | learning rate 7.271E-05 | lm loss 1.591112E+00 | loss scale 2097152.0 |
    time (ms) | forward: 667.86 | backward: 1721.78 | allreduce: 23.30 | optimizer: 55.78 | batch generator: 3.20 | data loader: 2.34
     iteration   156200/  300000 | elapsed time per iteration (ms): 2443.9 | learning rate 7.263E-05 | lm loss 1.586876E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.79 | backward: 1721.28 | allreduce: 24.16 | optimizer: 54.67 | batch generator: 0.50 | data loader: 0.04
     iteration   156300/  300000 | elapsed time per iteration (ms): 2446.0 | learning rate 7.255E-05 | lm loss 1.596918E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.91 | backward: 1722.14 | allreduce: 23.39 | optimizer: 55.78 | batch generator: 0.49 | data loader: 0.04
     iteration   156400/  300000 | elapsed time per iteration (ms): 2449.8 | learning rate 7.247E-05 | lm loss 1.585771E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.12 | backward: 1725.73 | allreduce: 26.53 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   156500/  300000 | elapsed time per iteration (ms): 2446.7 | learning rate 7.239E-05 | lm loss 1.613570E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.05 | backward: 1722.64 | allreduce: 23.69 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   156600/  300000 | elapsed time per iteration (ms): 2445.6 | learning rate 7.231E-05 | lm loss 1.592762E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.82 | backward: 1721.77 | allreduce: 23.31 | optimizer: 55.78 | batch generator: 0.49 | data loader: 0.04
     iteration   156700/  300000 | elapsed time per iteration (ms): 2447.5 | learning rate 7.224E-05 | lm loss 1.591329E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.89 | backward: 1724.25 | allreduce: 26.30 | optimizer: 55.22 | batch generator: 0.50 | data loader: 0.04
     iteration   156800/  300000 | elapsed time per iteration (ms): 2446.5 | learning rate 7.216E-05 | lm loss 1.593310E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.79 | backward: 1722.72 | allreduce: 24.54 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   156900/  300000 | elapsed time per iteration (ms): 2445.3 | learning rate 7.208E-05 | lm loss 1.585589E+00 | loss scale 262144.0 |
    time (ms) | forward: 667.90 | backward: 1722.02 | allreduce: 24.30 | optimizer: 55.22 | batch generator: 0.49 | data loader: 0.04
     iteration   157000/  300000 | elapsed time per iteration (ms): 2448.5 | learning rate 7.200E-05 | lm loss 1.595329E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.07 | backward: 1724.50 | allreduce: 26.28 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 157000 | LM loss: 1.680420E+00 | LM PPL: 5.367808E+00
    ------------------------------------------------------------------------------------
     iteration   157100/  300000 | elapsed time per iteration (ms): 3047.5 | learning rate 7.192E-05 | lm loss 1.581081E+00 | loss scale 262144.0 |
    time (ms) | forward: 667.96 | backward: 1722.89 | allreduce: 24.83 | optimizer: 55.78 | batch generator: 2.12 | data loader: 1.27
     iteration   157200/  300000 | elapsed time per iteration (ms): 2446.2 | learning rate 7.185E-05 | lm loss 1.581088E+00 | loss scale 262144.0 |
    time (ms) | forward: 667.89 | backward: 1722.35 | allreduce: 24.48 | optimizer: 55.78 | batch generator: 0.49 | data loader: 0.04
     iteration   157300/  300000 | elapsed time per iteration (ms): 2448.3 | learning rate 7.177E-05 | lm loss 1.591513E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.12 | backward: 1724.24 | allreduce: 25.82 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   157400/  300000 | elapsed time per iteration (ms): 2445.2 | learning rate 7.169E-05 | lm loss 1.572337E+00 | loss scale 262144.0 |
    time (ms) | forward: 667.95 | backward: 1721.27 | allreduce: 23.33 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   157500/  300000 | elapsed time per iteration (ms): 2445.0 | learning rate 7.161E-05 | lm loss 1.586519E+00 | loss scale 262144.0 |
    time (ms) | forward: 667.92 | backward: 1721.11 | allreduce: 23.30 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   157600/  300000 | elapsed time per iteration (ms): 2447.3 | learning rate 7.153E-05 | lm loss 1.589717E+00 | loss scale 262144.0 |
    time (ms) | forward: 667.80 | backward: 1723.58 | allreduce: 26.17 | optimizer: 55.78 | batch generator: 0.49 | data loader: 0.04
     iteration   157700/  300000 | elapsed time per iteration (ms): 2446.0 | learning rate 7.145E-05 | lm loss 1.598645E+00 | loss scale 262144.0 |
    time (ms) | forward: 667.99 | backward: 1722.07 | allreduce: 23.84 | optimizer: 55.77 | batch generator: 0.49 | data loader: 0.04
     iteration   157800/  300000 | elapsed time per iteration (ms): 2446.7 | learning rate 7.137E-05 | lm loss 1.613112E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.11 | backward: 1722.65 | allreduce: 24.02 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   157900/  300000 | elapsed time per iteration (ms): 2448.4 | learning rate 7.130E-05 | lm loss 1.550234E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.00 | backward: 1724.48 | allreduce: 26.21 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   158000/  300000 | elapsed time per iteration (ms): 2446.5 | learning rate 7.122E-05 | lm loss 1.592085E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.15 | backward: 1722.43 | allreduce: 23.37 | optimizer: 55.77 | batch generator: 0.49 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 158000 | LM loss: 1.650805E+00 | LM PPL: 5.211173E+00
    ------------------------------------------------------------------------------------
     iteration   158100/  300000 | elapsed time per iteration (ms): 3053.0 | learning rate 7.114E-05 | lm loss 1.592870E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.12 | backward: 1722.77 | allreduce: 23.89 | optimizer: 55.78 | batch generator: 7.66 | data loader: 6.80
     iteration   158200/  300000 | elapsed time per iteration (ms): 2449.9 | learning rate 7.106E-05 | lm loss 1.588595E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.12 | backward: 1725.85 | allreduce: 26.89 | optimizer: 55.78 | batch generator: 0.51 | data loader: 0.04
     iteration   158300/  300000 | elapsed time per iteration (ms): 2447.7 | learning rate 7.098E-05 | lm loss 1.603143E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.06 | backward: 1723.65 | allreduce: 24.86 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   158400/  300000 | elapsed time per iteration (ms): 2445.8 | learning rate 7.090E-05 | lm loss 1.587995E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.89 | backward: 1721.93 | allreduce: 23.57 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   158500/  300000 | elapsed time per iteration (ms): 2448.3 | learning rate 7.083E-05 | lm loss 1.597041E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.98 | backward: 1724.38 | allreduce: 25.80 | optimizer: 55.78 | batch generator: 0.49 | data loader: 0.04
     iteration   158600/  300000 | elapsed time per iteration (ms): 2447.5 | learning rate 7.075E-05 | lm loss 1.555439E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.05 | backward: 1723.50 | allreduce: 24.64 | optimizer: 55.78 | batch generator: 0.49 | data loader: 0.04
     iteration   158700/  300000 | elapsed time per iteration (ms): 2446.0 | learning rate 7.067E-05 | lm loss 1.582020E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.02 | backward: 1722.55 | allreduce: 24.34 | optimizer: 55.22 | batch generator: 0.50 | data loader: 0.04
     iteration   158800/  300000 | elapsed time per iteration (ms): 2448.1 | learning rate 7.059E-05 | lm loss 1.603966E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.88 | backward: 1724.22 | allreduce: 25.92 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   158900/  300000 | elapsed time per iteration (ms): 2446.9 | learning rate 7.051E-05 | lm loss 1.584280E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.96 | backward: 1722.95 | allreduce: 24.50 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   159000/  300000 | elapsed time per iteration (ms): 2446.7 | learning rate 7.043E-05 | lm loss 1.574888E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.85 | backward: 1722.86 | allreduce: 24.44 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 159000 | LM loss: 1.676126E+00 | LM PPL: 5.344810E+00
    ------------------------------------------------------------------------------------
     iteration   159100/  300000 | elapsed time per iteration (ms): 3075.6 | learning rate 7.036E-05 | lm loss 1.577307E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.79 | backward: 1725.38 | allreduce: 27.12 | optimizer: 55.78 | batch generator: 27.71 | data loader: 26.84
     iteration   159200/  300000 | elapsed time per iteration (ms): 2446.8 | learning rate 7.028E-05 | lm loss 1.568375E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.00 | backward: 1722.87 | allreduce: 24.16 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   159300/  300000 | elapsed time per iteration (ms): 2445.7 | learning rate 7.020E-05 | lm loss 1.600192E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.93 | backward: 1721.78 | allreduce: 23.60 | optimizer: 55.77 | batch generator: 0.49 | data loader: 0.04
     iteration   159400/  300000 | elapsed time per iteration (ms): 2448.3 | learning rate 7.012E-05 | lm loss 1.607308E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.00 | backward: 1724.39 | allreduce: 25.86 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   159500/  300000 | elapsed time per iteration (ms): 2447.1 | learning rate 7.004E-05 | lm loss 1.547833E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.04 | backward: 1723.10 | allreduce: 24.33 | optimizer: 55.78 | batch generator: 0.51 | data loader: 0.04
     iteration   159600/  300000 | elapsed time per iteration (ms): 2447.6 | learning rate 6.996E-05 | lm loss 1.586419E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.13 | backward: 1723.49 | allreduce: 24.84 | optimizer: 55.78 | batch generator: 0.51 | data loader: 0.04
     iteration   159700/  300000 | elapsed time per iteration (ms): 2449.8 | learning rate 6.989E-05 | lm loss 1.575684E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.01 | backward: 1725.85 | allreduce: 27.07 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
     iteration   159800/  300000 | elapsed time per iteration (ms): 2447.1 | learning rate 6.981E-05 | lm loss 1.593601E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.05 | backward: 1723.12 | allreduce: 24.30 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
     iteration   159900/  300000 | elapsed time per iteration (ms): 2446.8 | learning rate 6.973E-05 | lm loss 1.575129E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.02 | backward: 1722.88 | allreduce: 24.08 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
     iteration   160000/  300000 | elapsed time per iteration (ms): 2448.5 | learning rate 6.965E-05 | lm loss 1.571174E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.07 | backward: 1724.49 | allreduce: 25.58 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
    global rank 0 is saving checkpoint at iteration  160000 to checkpoints/gpt2_750m_2/iter_0160000/mp_rank_00/model_optim_rng.pt
      successfully saved checkpoints/gpt2_750m_2/iter_0160000/mp_rank_00/model_optim_rng.pt
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 160000 | LM loss: 1.659122E+00 | LM PPL: 5.254695E+00
    ------------------------------------------------------------------------------------
     iteration   160100/  300000 | elapsed time per iteration (ms): 3111.0 | learning rate 6.957E-05 | lm loss 1.581199E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.01 | backward: 1723.02 | allreduce: 24.51 | optimizer: 55.79 | batch generator: 11.60 | data loader: 10.73
     iteration   160200/  300000 | elapsed time per iteration (ms): 2449.7 | learning rate 6.949E-05 | lm loss 1.599547E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.03 | backward: 1725.73 | allreduce: 26.58 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
     iteration   160300/  300000 | elapsed time per iteration (ms): 2446.8 | learning rate 6.942E-05 | lm loss 1.601519E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.87 | backward: 1722.92 | allreduce: 24.22 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
     iteration   160400/  300000 | elapsed time per iteration (ms): 2446.5 | learning rate 6.934E-05 | lm loss 1.627843E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.08 | backward: 1722.45 | allreduce: 23.32 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   160500/  300000 | elapsed time per iteration (ms): 2448.6 | learning rate 6.926E-05 | lm loss 1.601131E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.08 | backward: 1725.11 | allreduce: 26.68 | optimizer: 55.23 | batch generator: 0.50 | data loader: 0.04
     iteration   160600/  300000 | elapsed time per iteration (ms): 2446.9 | learning rate 6.918E-05 | lm loss 1.584797E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.83 | backward: 1723.05 | allreduce: 24.49 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   160700/  300000 | elapsed time per iteration (ms): 2447.4 | learning rate 6.910E-05 | lm loss 1.600839E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.01 | backward: 1723.43 | allreduce: 24.60 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
     iteration   160800/  300000 | elapsed time per iteration (ms): 2450.6 | learning rate 6.902E-05 | lm loss 1.584355E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.03 | backward: 1726.58 | allreduce: 27.69 | optimizer: 55.78 | batch generator: 0.51 | data loader: 0.04
     iteration   160900/  300000 | elapsed time per iteration (ms): 2446.2 | learning rate 6.895E-05 | lm loss 1.597951E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.94 | backward: 1722.33 | allreduce: 23.59 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   161000/  300000 | elapsed time per iteration (ms): 2445.3 | learning rate 6.887E-05 | lm loss 1.598395E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.08 | backward: 1721.83 | allreduce: 23.30 | optimizer: 55.22 | batch generator: 0.50 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 161000 | LM loss: 1.663699E+00 | LM PPL: 5.278803E+00
    ------------------------------------------------------------------------------------
     iteration   161100/  300000 | elapsed time per iteration (ms): 3069.6 | learning rate 6.879E-05 | lm loss 1.596302E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.98 | backward: 1725.05 | allreduce: 26.46 | optimizer: 55.79 | batch generator: 22.47 | data loader: 21.61
     iteration   161200/  300000 | elapsed time per iteration (ms): 2447.6 | learning rate 6.871E-05 | lm loss 1.595502E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.10 | backward: 1723.51 | allreduce: 24.67 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   161300/  300000 | elapsed time per iteration (ms): 2447.9 | learning rate 6.863E-05 | lm loss 1.598881E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.12 | backward: 1723.87 | allreduce: 24.66 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   161400/  300000 | elapsed time per iteration (ms): 2450.0 | learning rate 6.856E-05 | lm loss 1.592041E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.11 | backward: 1725.96 | allreduce: 27.13 | optimizer: 55.80 | batch generator: 0.50 | data loader: 0.04
     iteration   161500/  300000 | elapsed time per iteration (ms): 2448.0 | learning rate 6.848E-05 | lm loss 1.572585E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.21 | backward: 1723.82 | allreduce: 24.55 | optimizer: 55.78 | batch generator: 0.51 | data loader: 0.04
     iteration   161600/  300000 | elapsed time per iteration (ms): 2448.6 | learning rate 6.840E-05 | lm loss 1.613678E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.29 | backward: 1724.32 | allreduce: 24.86 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   161700/  300000 | elapsed time per iteration (ms): 2450.6 | learning rate 6.832E-05 | lm loss 1.592239E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.35 | backward: 1726.33 | allreduce: 26.76 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   161800/  300000 | elapsed time per iteration (ms): 2449.1 | learning rate 6.824E-05 | lm loss 1.601647E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.54 | backward: 1724.54 | allreduce: 24.33 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   161900/  300000 | elapsed time per iteration (ms): 2449.5 | learning rate 6.816E-05 | lm loss 1.582050E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.54 | backward: 1725.02 | allreduce: 24.83 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   162000/  300000 | elapsed time per iteration (ms): 2452.1 | learning rate 6.809E-05 | lm loss 1.598206E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.51 | backward: 1727.57 | allreduce: 26.87 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 162000 | LM loss: 1.664676E+00 | LM PPL: 5.283962E+00
    ------------------------------------------------------------------------------------
     iteration   162100/  300000 | elapsed time per iteration (ms): 3050.3 | learning rate 6.801E-05 | lm loss 1.603024E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.68 | backward: 1725.65 | allreduce: 24.41 | optimizer: 55.79 | batch generator: 1.00 | data loader: 0.14
     iteration   162200/  300000 | elapsed time per iteration (ms): 2449.1 | learning rate 6.793E-05 | lm loss 1.598774E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.39 | backward: 1724.76 | allreduce: 24.66 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   162300/  300000 | elapsed time per iteration (ms): 2451.9 | learning rate 6.785E-05 | lm loss 1.579347E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.44 | backward: 1727.47 | allreduce: 27.16 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.04
     iteration   162400/  300000 | elapsed time per iteration (ms): 2450.1 | learning rate 6.777E-05 | lm loss 1.597971E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.66 | backward: 1725.57 | allreduce: 24.56 | optimizer: 55.79 | batch generator: 0.51 | data loader: 0.05
     iteration   162500/  300000 | elapsed time per iteration (ms): 2450.0 | learning rate 6.770E-05 | lm loss 1.582781E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.73 | backward: 1725.35 | allreduce: 24.08 | optimizer: 55.80 | batch generator: 0.51 | data loader: 0.05
     iteration   162600/  300000 | elapsed time per iteration (ms): 2450.3 | learning rate 6.762E-05 | lm loss 1.582565E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.44 | backward: 1726.53 | allreduce: 26.87 | optimizer: 55.25 | batch generator: 0.51 | data loader: 0.05
     iteration   162700/  300000 | elapsed time per iteration (ms): 2448.4 | learning rate 6.754E-05 | lm loss 1.595154E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.45 | backward: 1724.07 | allreduce: 23.95 | optimizer: 55.80 | batch generator: 0.51 | data loader: 0.05
     iteration   162800/  300000 | elapsed time per iteration (ms): 2449.3 | learning rate 6.746E-05 | lm loss 1.607597E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.63 | backward: 1724.76 | allreduce: 23.77 | optimizer: 55.79 | batch generator: 0.52 | data loader: 0.05
     iteration   162900/  300000 | elapsed time per iteration (ms): 2452.3 | learning rate 6.738E-05 | lm loss 1.585512E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.67 | backward: 1727.75 | allreduce: 26.64 | optimizer: 55.80 | batch generator: 0.51 | data loader: 0.05
     iteration   163000/  300000 | elapsed time per iteration (ms): 2447.5 | learning rate 6.731E-05 | lm loss 1.586240E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.47 | backward: 1723.62 | allreduce: 23.94 | optimizer: 55.24 | batch generator: 0.51 | data loader: 0.05
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 163000 | LM loss: 1.668265E+00 | LM PPL: 5.302961E+00
    ------------------------------------------------------------------------------------
     iteration   163100/  300000 | elapsed time per iteration (ms): 3063.0 | learning rate 6.723E-05 | lm loss 1.601572E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.68 | backward: 1725.37 | allreduce: 24.69 | optimizer: 55.78 | batch generator: 13.30 | data loader: 12.43
     iteration   163200/  300000 | elapsed time per iteration (ms): 2452.0 | learning rate 6.715E-05 | lm loss 1.558159E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.58 | backward: 1727.46 | allreduce: 26.88 | optimizer: 55.79 | batch generator: 0.50 | data loader: 0.04
     iteration   163300/  300000 | elapsed time per iteration (ms): 2449.5 | learning rate 6.707E-05 | lm loss 1.583574E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.70 | backward: 1724.82 | allreduce: 23.81 | optimizer: 55.79 | batch generator: 0.49 | data loader: 0.04
     iteration   163400/  300000 | elapsed time per iteration (ms): 2450.1 | learning rate 6.699E-05 | lm loss 1.600803E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.56 | backward: 1725.53 | allreduce: 24.78 | optimizer: 55.78 | batch generator: 0.49 | data loader: 0.04
     iteration   163500/  300000 | elapsed time per iteration (ms): 2451.5 | learning rate 6.692E-05 | lm loss 1.564504E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.37 | backward: 1727.22 | allreduce: 27.40 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   163600/  300000 | elapsed time per iteration (ms): 2448.9 | learning rate 6.684E-05 | lm loss 1.602377E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.58 | backward: 1724.34 | allreduce: 23.61 | optimizer: 55.77 | batch generator: 0.49 | data loader: 0.04
     iteration   163700/  300000 | elapsed time per iteration (ms): 2450.5 | learning rate 6.676E-05 | lm loss 1.597099E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.84 | backward: 1725.73 | allreduce: 24.00 | optimizer: 55.78 | batch generator: 0.50 | data loader: 0.04
     iteration   163800/  300000 | elapsed time per iteration (ms): 2452.5 | learning rate 6.668E-05 | lm loss 1.578363E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.61 | backward: 1727.93 | allreduce: 27.12 | optimizer: 55.78 | batch generator: 0.49 | data loader: 0.04
     iteration   163900/  300000 | elapsed time per iteration (ms): 2449.4 | learning rate 6.660E-05 | lm loss 1.588770E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.62 | backward: 1724.79 | allreduce: 24.01 | optimizer: 55.78 | batch generator: 0.49 | data loader: 0.04
     iteration   164000/  300000 | elapsed time per iteration (ms): 2452.2 | learning rate 6.653E-05 | lm loss 1.609863E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.37 | backward: 1726.90 | allreduce: 23.39 | optimizer: 55.78 | batch generator: 0.49 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 164000 | LM loss: 1.659471E+00 | LM PPL: 5.256530E+00
    ------------------------------------------------------------------------------------
     iteration   164100/  300000 | elapsed time per iteration (ms): 3071.4 | learning rate 6.645E-05 | lm loss 1.586630E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.49 | backward: 1729.93 | allreduce: 26.45 | optimizer: 55.77 | batch generator: 16.62 | data loader: 15.75
     iteration   164200/  300000 | elapsed time per iteration (ms): 2451.2 | learning rate 6.637E-05 | lm loss 1.591125E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.54 | backward: 1726.28 | allreduce: 23.40 | optimizer: 55.21 | batch generator: 0.49 | data loader: 0.04
     iteration   164300/  300000 | elapsed time per iteration (ms): 2451.6 | learning rate 6.629E-05 | lm loss 1.589425E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.30 | backward: 1726.31 | allreduce: 23.39 | optimizer: 55.77 | batch generator: 0.49 | data loader: 0.04
     iteration   164400/  300000 | elapsed time per iteration (ms): 2452.8 | learning rate 6.622E-05 | lm loss 1.602999E+00 | loss scale 524288.0 |
    time (ms) | forward: 669.22 | backward: 1728.23 | allreduce: 26.21 | optimizer: 55.21 | batch generator: 0.50 | data loader: 0.04
     iteration   164500/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 6.614E-05 | lm loss 1.562660E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.86 | backward: 1726.47 | allreduce: 24.66 | optimizer: 55.78 | batch generator: 0.49 | data loader: 0.04
     iteration   164600/  300000 | elapsed time per iteration (ms): 2453.5 | learning rate 6.606E-05 | lm loss 1.590169E+00 | loss scale 524288.0 |
    time (ms) | forward: 669.57 | backward: 1727.96 | allreduce: 24.46 | optimizer: 55.77 | batch generator: 0.49 | data loader: 0.04
     iteration   164700/  300000 | elapsed time per iteration (ms): 2455.8 | learning rate 6.598E-05 | lm loss 1.598503E+00 | loss scale 524288.0 |
    time (ms) | forward: 669.54 | backward: 1730.29 | allreduce: 27.05 | optimizer: 55.77 | batch generator: 0.50 | data loader: 0.04
     iteration   164800/  300000 | elapsed time per iteration (ms): 2453.6 | learning rate 6.590E-05 | lm loss 1.601857E+00 | loss scale 524288.0 |
    time (ms) | forward: 669.88 | backward: 1727.75 | allreduce: 23.41 | optimizer: 55.78 | batch generator: 0.49 | data loader: 0.04
     iteration   164900/  300000 | elapsed time per iteration (ms): 2453.4 | learning rate 6.583E-05 | lm loss 1.587811E+00 | loss scale 524288.0 |
    time (ms) | forward: 669.80 | backward: 1727.60 | allreduce: 23.43 | optimizer: 55.77 | batch generator: 0.50 | data loader: 0.04
     iteration   165000/  300000 | elapsed time per iteration (ms): 2454.6 | learning rate 6.575E-05 | lm loss 1.580309E+00 | loss scale 524288.0 |
    time (ms) | forward: 669.51 | backward: 1729.13 | allreduce: 25.94 | optimizer: 55.77 | batch generator: 0.50 | data loader: 0.04
    global rank 0 is saving checkpoint at iteration  165000 to checkpoints/gpt2_750m_2/iter_0165000/mp_rank_00/model_optim_rng.pt
      successfully saved checkpoints/gpt2_750m_2/iter_0165000/mp_rank_00/model_optim_rng.pt
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 165000 | LM loss: 1.648357E+00 | LM PPL: 5.198433E+00
    ------------------------------------------------------------------------------------
     iteration   165100/  300000 | elapsed time per iteration (ms): 3119.0 | learning rate 6.567E-05 | lm loss 1.612360E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.98 | backward: 1727.17 | allreduce: 24.24 | optimizer: 55.77 | batch generator: 12.64 | data loader: 11.77
     iteration   165200/  300000 | elapsed time per iteration (ms): 2454.4 | learning rate 6.559E-05 | lm loss 1.586564E+00 | loss scale 524288.0 |
    time (ms) | forward: 669.50 | backward: 1728.97 | allreduce: 24.66 | optimizer: 55.77 | batch generator: 0.49 | data loader: 0.04
     iteration   165300/  300000 | elapsed time per iteration (ms): 2456.8 | learning rate 6.551E-05 | lm loss 1.577861E+00 | loss scale 524288.0 |
    time (ms) | forward: 669.37 | backward: 1731.52 | allreduce: 27.34 | optimizer: 55.77 | batch generator: 0.49 | data loader: 0.04
     iteration   165400/  300000 | elapsed time per iteration (ms): 2453.0 | learning rate 6.544E-05 | lm loss 1.583582E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.19 | backward: 1727.90 | allreduce: 24.45 | optimizer: 55.77 | batch generator: 0.50 | data loader: 0.04
     iteration   165500/  300000 | elapsed time per iteration (ms): 2453.2 | learning rate 6.536E-05 | lm loss 1.587984E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.40 | backward: 1727.84 | allreduce: 23.56 | optimizer: 55.78 | batch generator: 0.49 | data loader: 0.04
     iteration   165600/  300000 | elapsed time per iteration (ms): 2455.8 | learning rate 6.528E-05 | lm loss 1.597776E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.21 | backward: 1730.67 | allreduce: 26.70 | optimizer: 55.77 | batch generator: 0.50 | data loader: 0.04
     iteration   165700/  300000 | elapsed time per iteration (ms): 2452.8 | learning rate 6.520E-05 | lm loss 1.589862E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.89 | backward: 1727.98 | allreduce: 24.79 | optimizer: 55.78 | batch generator: 0.49 | data loader: 0.04
     iteration   165800/  300000 | elapsed time per iteration (ms): 2453.3 | learning rate 6.512E-05 | lm loss 1.576640E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.05 | backward: 1728.30 | allreduce: 24.77 | optimizer: 55.77 | batch generator: 0.50 | data loader: 0.04
     iteration   165900/  300000 | elapsed time per iteration (ms): 2456.1 | learning rate 6.505E-05 | lm loss 1.609063E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.29 | backward: 1730.88 | allreduce: 26.90 | optimizer: 55.77 | batch generator: 0.49 | data loader: 0.04
     iteration   166000/  300000 | elapsed time per iteration (ms): 2452.9 | learning rate 6.497E-05 | lm loss 1.587865E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.98 | backward: 1727.96 | allreduce: 24.48 | optimizer: 55.77 | batch generator: 0.49 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 166000 | LM loss: 1.649485E+00 | LM PPL: 5.204301E+00
    ------------------------------------------------------------------------------------
     iteration   166100/  300000 | elapsed time per iteration (ms): 3054.8 | learning rate 6.489E-05 | lm loss 1.577055E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.16 | backward: 1728.19 | allreduce: 24.43 | optimizer: 55.77 | batch generator: 0.93 | data loader: 0.07
     iteration   166200/  300000 | elapsed time per iteration (ms): 2456.3 | learning rate 6.481E-05 | lm loss 1.585754E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.34 | backward: 1730.96 | allreduce: 26.66 | optimizer: 55.77 | batch generator: 0.49 | data loader: 0.04
     iteration   166300/  300000 | elapsed time per iteration (ms): 2454.3 | learning rate 6.473E-05 | lm loss 1.576098E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.38 | backward: 1728.97 | allreduce: 24.70 | optimizer: 55.77 | batch generator: 0.50 | data loader: 0.04
     iteration   166400/  300000 | elapsed time per iteration (ms): 2456.1 | learning rate 6.466E-05 | lm loss 1.613189E+00 | loss scale 2097152.0 |
    time (ms) | forward: 669.46 | backward: 1730.36 | allreduce: 25.36 | optimizer: 55.95 | batch generator: 0.83 | data loader: 0.06
     iteration   166500/  300000 | elapsed time per iteration (ms): 2453.8 | learning rate 6.458E-05 | lm loss 1.584977E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.23 | backward: 1729.68 | allreduce: 27.02 | optimizer: 54.68 | batch generator: 0.49 | data loader: 0.04
     iteration   166600/  300000 | elapsed time per iteration (ms): 2454.3 | learning rate 6.450E-05 | lm loss 1.585158E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.24 | backward: 1729.09 | allreduce: 25.15 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   166700/  300000 | elapsed time per iteration (ms): 2453.8 | learning rate 6.443E-05 | lm loss 1.572517E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.02 | backward: 1728.79 | allreduce: 25.40 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   166800/  300000 | elapsed time per iteration (ms): 2455.3 | learning rate 6.435E-05 | lm loss 1.587440E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.70 | backward: 1730.67 | allreduce: 27.82 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   166900/  300000 | elapsed time per iteration (ms): 2454.4 | learning rate 6.427E-05 | lm loss 1.593300E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.30 | backward: 1729.17 | allreduce: 24.94 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   167000/  300000 | elapsed time per iteration (ms): 2454.3 | learning rate 6.419E-05 | lm loss 1.578400E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.31 | backward: 1729.07 | allreduce: 24.86 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 167000 | LM loss: 1.654691E+00 | LM PPL: 5.231465E+00
    ------------------------------------------------------------------------------------
     iteration   167100/  300000 | elapsed time per iteration (ms): 3066.9 | learning rate 6.411E-05 | lm loss 1.586184E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.51 | backward: 1731.88 | allreduce: 27.32 | optimizer: 55.77 | batch generator: 9.73 | data loader: 8.93
     iteration   167200/  300000 | elapsed time per iteration (ms): 2454.6 | learning rate 6.404E-05 | lm loss 1.575254E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.56 | backward: 1729.06 | allreduce: 24.49 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   167300/  300000 | elapsed time per iteration (ms): 2454.6 | learning rate 6.396E-05 | lm loss 1.552859E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.43 | backward: 1729.17 | allreduce: 24.67 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   167400/  300000 | elapsed time per iteration (ms): 2456.9 | learning rate 6.388E-05 | lm loss 1.578271E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.28 | backward: 1731.63 | allreduce: 27.43 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   167500/  300000 | elapsed time per iteration (ms): 2452.5 | learning rate 6.381E-05 | lm loss 1.566068E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.37 | backward: 1728.31 | allreduce: 24.91 | optimizer: 54.65 | batch generator: 0.45 | data loader: 0.04
     iteration   167600/  300000 | elapsed time per iteration (ms): 2452.6 | learning rate 6.373E-05 | lm loss 1.587247E+00 | loss scale 524288.0 |
    time (ms) | forward: 669.19 | backward: 1727.99 | allreduce: 24.76 | optimizer: 55.22 | batch generator: 0.45 | data loader: 0.04
     iteration   167700/  300000 | elapsed time per iteration (ms): 2455.9 | learning rate 6.365E-05 | lm loss 1.604026E+00 | loss scale 524288.0 |
    time (ms) | forward: 669.16 | backward: 1730.82 | allreduce: 27.40 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   167800/  300000 | elapsed time per iteration (ms): 2453.8 | learning rate 6.357E-05 | lm loss 1.605260E+00 | loss scale 524288.0 |
    time (ms) | forward: 669.22 | backward: 1728.57 | allreduce: 24.85 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   167900/  300000 | elapsed time per iteration (ms): 2453.8 | learning rate 6.350E-05 | lm loss 1.570569E+00 | loss scale 524288.0 |
    time (ms) | forward: 669.24 | backward: 1728.57 | allreduce: 24.84 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   168000/  300000 | elapsed time per iteration (ms): 2456.6 | learning rate 6.342E-05 | lm loss 1.586873E+00 | loss scale 524288.0 |
    time (ms) | forward: 669.39 | backward: 1731.29 | allreduce: 27.42 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 168000 | LM loss: 1.635211E+00 | LM PPL: 5.130540E+00
    ------------------------------------------------------------------------------------
     iteration   168100/  300000 | elapsed time per iteration (ms): 3061.3 | learning rate 6.334E-05 | lm loss 1.618260E+00 | loss scale 524288.0 |
    time (ms) | forward: 669.25 | backward: 1728.51 | allreduce: 24.81 | optimizer: 55.77 | batch generator: 7.65 | data loader: 6.85
     iteration   168200/  300000 | elapsed time per iteration (ms): 2454.0 | learning rate 6.326E-05 | lm loss 1.569742E+00 | loss scale 524288.0 |
    time (ms) | forward: 669.34 | backward: 1728.66 | allreduce: 24.88 | optimizer: 55.78 | batch generator: 0.47 | data loader: 0.04
     iteration   168300/  300000 | elapsed time per iteration (ms): 2457.0 | learning rate 6.319E-05 | lm loss 1.579352E+00 | loss scale 524288.0 |
    time (ms) | forward: 669.46 | backward: 1731.55 | allreduce: 27.34 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   168400/  300000 | elapsed time per iteration (ms): 2453.9 | learning rate 6.311E-05 | lm loss 1.590918E+00 | loss scale 524288.0 |
    time (ms) | forward: 669.23 | backward: 1728.71 | allreduce: 24.99 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   168500/  300000 | elapsed time per iteration (ms): 2453.7 | learning rate 6.303E-05 | lm loss 1.593012E+00 | loss scale 524288.0 |
    time (ms) | forward: 669.27 | backward: 1728.47 | allreduce: 24.78 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   168600/  300000 | elapsed time per iteration (ms): 2456.7 | learning rate 6.295E-05 | lm loss 1.589811E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.24 | backward: 1731.48 | allreduce: 27.50 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   168700/  300000 | elapsed time per iteration (ms): 2454.3 | learning rate 6.288E-05 | lm loss 1.576123E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.26 | backward: 1729.10 | allreduce: 24.98 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   168800/  300000 | elapsed time per iteration (ms): 2453.5 | learning rate 6.280E-05 | lm loss 1.585071E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.06 | backward: 1728.46 | allreduce: 24.80 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   168900/  300000 | elapsed time per iteration (ms): 2456.1 | learning rate 6.272E-05 | lm loss 1.566917E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.38 | backward: 1730.76 | allreduce: 26.40 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   169000/  300000 | elapsed time per iteration (ms): 2452.8 | learning rate 6.264E-05 | lm loss 1.571462E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.27 | backward: 1727.58 | allreduce: 23.38 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 169000 | LM loss: 1.655470E+00 | LM PPL: 5.235540E+00
    ------------------------------------------------------------------------------------
     iteration   169100/  300000 | elapsed time per iteration (ms): 3058.2 | learning rate 6.257E-05 | lm loss 1.597696E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.24 | backward: 1727.50 | allreduce: 23.39 | optimizer: 55.78 | batch generator: 5.71 | data loader: 4.92
     iteration   169200/  300000 | elapsed time per iteration (ms): 2455.2 | learning rate 6.249E-05 | lm loss 1.576948E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.21 | backward: 1730.06 | allreduce: 25.92 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   169300/  300000 | elapsed time per iteration (ms): 2452.5 | learning rate 6.241E-05 | lm loss 1.580021E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.20 | backward: 1727.32 | allreduce: 23.39 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   169400/  300000 | elapsed time per iteration (ms): 2452.6 | learning rate 6.233E-05 | lm loss 1.567724E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.24 | backward: 1727.41 | allreduce: 23.38 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   169500/  300000 | elapsed time per iteration (ms): 2454.7 | learning rate 6.226E-05 | lm loss 1.562702E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.09 | backward: 1729.68 | allreduce: 25.93 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   169600/  300000 | elapsed time per iteration (ms): 2451.4 | learning rate 6.218E-05 | lm loss 1.562168E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.16 | backward: 1726.82 | allreduce: 23.40 | optimizer: 55.21 | batch generator: 0.46 | data loader: 0.04
     iteration   169700/  300000 | elapsed time per iteration (ms): 2452.6 | learning rate 6.210E-05 | lm loss 1.579203E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.18 | backward: 1727.41 | allreduce: 23.39 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   169800/  300000 | elapsed time per iteration (ms): 2454.6 | learning rate 6.202E-05 | lm loss 1.570576E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.07 | backward: 1729.61 | allreduce: 25.92 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   169900/  300000 | elapsed time per iteration (ms): 2453.3 | learning rate 6.195E-05 | lm loss 1.576310E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.16 | backward: 1728.16 | allreduce: 24.13 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   170000/  300000 | elapsed time per iteration (ms): 2454.5 | learning rate 6.187E-05 | lm loss 1.572238E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.34 | backward: 1729.24 | allreduce: 24.90 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
    global rank 0 is saving checkpoint at iteration  170000 to checkpoints/gpt2_750m_2/iter_0170000/mp_rank_00/model_optim_rng.pt
      successfully saved checkpoints/gpt2_750m_2/iter_0170000/mp_rank_00/model_optim_rng.pt
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 170000 | LM loss: 1.666919E+00 | LM PPL: 5.295825E+00
    ------------------------------------------------------------------------------------
     iteration   170100/  300000 | elapsed time per iteration (ms): 3112.2 | learning rate 6.179E-05 | lm loss 1.588655E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.30 | backward: 1728.72 | allreduce: 24.50 | optimizer: 55.77 | batch generator: 0.86 | data loader: 0.07
     iteration   170200/  300000 | elapsed time per iteration (ms): 2453.3 | learning rate 6.172E-05 | lm loss 1.588457E+00 | loss scale 524288.0 |
    time (ms) | forward: 669.44 | backward: 1728.47 | allreduce: 24.66 | optimizer: 55.21 | batch generator: 0.45 | data loader: 0.04
     iteration   170300/  300000 | elapsed time per iteration (ms): 2456.7 | learning rate 6.164E-05 | lm loss 1.580034E+00 | loss scale 524288.0 |
    time (ms) | forward: 669.43 | backward: 1731.30 | allreduce: 27.12 | optimizer: 55.76 | batch generator: 0.46 | data loader: 0.04
     iteration   170400/  300000 | elapsed time per iteration (ms): 2454.0 | learning rate 6.156E-05 | lm loss 1.563023E+00 | loss scale 524288.0 |
    time (ms) | forward: 669.39 | backward: 1728.60 | allreduce: 24.51 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   170500/  300000 | elapsed time per iteration (ms): 2453.5 | learning rate 6.148E-05 | lm loss 1.578442E+00 | loss scale 524288.0 |
    time (ms) | forward: 669.24 | backward: 1728.30 | allreduce: 24.49 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   170600/  300000 | elapsed time per iteration (ms): 2456.5 | learning rate 6.141E-05 | lm loss 1.594778E+00 | loss scale 524288.0 |
    time (ms) | forward: 669.27 | backward: 1731.26 | allreduce: 27.43 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   170700/  300000 | elapsed time per iteration (ms): 2453.0 | learning rate 6.133E-05 | lm loss 1.576304E+00 | loss scale 524288.0 |
    time (ms) | forward: 669.02 | backward: 1728.02 | allreduce: 24.69 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   170800/  300000 | elapsed time per iteration (ms): 2452.9 | learning rate 6.125E-05 | lm loss 1.575955E+00 | loss scale 524288.0 |
    time (ms) | forward: 669.03 | backward: 1727.96 | allreduce: 24.80 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   170900/  300000 | elapsed time per iteration (ms): 2456.3 | learning rate 6.117E-05 | lm loss 1.563058E+00 | loss scale 524288.0 |
    time (ms) | forward: 669.23 | backward: 1731.10 | allreduce: 27.28 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   171000/  300000 | elapsed time per iteration (ms): 2452.8 | learning rate 6.110E-05 | lm loss 1.566344E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.95 | backward: 1727.86 | allreduce: 24.69 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 171000 | LM loss: 1.674447E+00 | LM PPL: 5.335846E+00
    ------------------------------------------------------------------------------------
     iteration   171100/  300000 | elapsed time per iteration (ms): 3058.6 | learning rate 6.102E-05 | lm loss 1.575769E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.88 | backward: 1727.75 | allreduce: 24.79 | optimizer: 55.77 | batch generator: 6.09 | data loader: 5.31
     iteration   171200/  300000 | elapsed time per iteration (ms): 2454.9 | learning rate 6.094E-05 | lm loss 1.579766E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.83 | backward: 1730.14 | allreduce: 27.17 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   171300/  300000 | elapsed time per iteration (ms): 2452.7 | learning rate 6.087E-05 | lm loss 1.578406E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.86 | backward: 1727.86 | allreduce: 24.66 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   171400/  300000 | elapsed time per iteration (ms): 2452.6 | learning rate 6.079E-05 | lm loss 1.549951E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.83 | backward: 1727.86 | allreduce: 24.83 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   171500/  300000 | elapsed time per iteration (ms): 2454.7 | learning rate 6.071E-05 | lm loss 1.565002E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.75 | backward: 1730.01 | allreduce: 27.09 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   171600/  300000 | elapsed time per iteration (ms): 2451.0 | learning rate 6.064E-05 | lm loss 1.572021E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.67 | backward: 1726.89 | allreduce: 24.55 | optimizer: 55.22 | batch generator: 0.45 | data loader: 0.04
     iteration   171700/  300000 | elapsed time per iteration (ms): 2451.7 | learning rate 6.056E-05 | lm loss 1.570153E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.61 | backward: 1727.11 | allreduce: 24.65 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   171800/  300000 | elapsed time per iteration (ms): 2454.3 | learning rate 6.048E-05 | lm loss 1.575331E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.64 | backward: 1729.68 | allreduce: 27.08 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   171900/  300000 | elapsed time per iteration (ms): 2452.7 | learning rate 6.040E-05 | lm loss 1.586454E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.95 | backward: 1727.80 | allreduce: 24.50 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   172000/  300000 | elapsed time per iteration (ms): 2452.7 | learning rate 6.033E-05 | lm loss 1.577984E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.89 | backward: 1727.89 | allreduce: 24.69 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 172000 | LM loss: 1.645566E+00 | LM PPL: 5.183945E+00
    ------------------------------------------------------------------------------------
     iteration   172100/  300000 | elapsed time per iteration (ms): 3060.1 | learning rate 6.025E-05 | lm loss 1.579018E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.76 | backward: 1730.03 | allreduce: 27.17 | optimizer: 55.77 | batch generator: 5.76 | data loader: 4.98
     iteration   172200/  300000 | elapsed time per iteration (ms): 2451.5 | learning rate 6.017E-05 | lm loss 1.568706E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.53 | backward: 1727.00 | allreduce: 24.54 | optimizer: 55.76 | batch generator: 0.46 | data loader: 0.04
     iteration   172300/  300000 | elapsed time per iteration (ms): 2451.8 | learning rate 6.010E-05 | lm loss 1.562809E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.56 | backward: 1727.30 | allreduce: 24.95 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   172400/  300000 | elapsed time per iteration (ms): 2455.3 | learning rate 6.002E-05 | lm loss 1.590865E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.81 | backward: 1730.53 | allreduce: 27.40 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   172500/  300000 | elapsed time per iteration (ms): 2452.3 | learning rate 5.994E-05 | lm loss 1.593293E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.87 | backward: 1727.51 | allreduce: 24.41 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   172600/  300000 | elapsed time per iteration (ms): 2451.7 | learning rate 5.987E-05 | lm loss 1.569774E+00 | loss scale 2097152.0 |
    time (ms) | forward: 669.01 | backward: 1727.28 | allreduce: 24.48 | optimizer: 55.21 | batch generator: 0.45 | data loader: 0.04
     iteration   172700/  300000 | elapsed time per iteration (ms): 2452.4 | learning rate 5.979E-05 | lm loss 1.585620E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.81 | backward: 1728.69 | allreduce: 27.03 | optimizer: 54.66 | batch generator: 0.45 | data loader: 0.04
     iteration   172800/  300000 | elapsed time per iteration (ms): 2451.2 | learning rate 5.971E-05 | lm loss 1.562733E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.61 | backward: 1726.64 | allreduce: 24.34 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   172900/  300000 | elapsed time per iteration (ms): 2452.0 | learning rate 5.964E-05 | lm loss 1.561104E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.86 | backward: 1727.15 | allreduce: 24.39 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   173000/  300000 | elapsed time per iteration (ms): 2453.9 | learning rate 5.956E-05 | lm loss 1.567106E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.62 | backward: 1729.33 | allreduce: 26.93 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 173000 | LM loss: 1.648959E+00 | LM PPL: 5.201563E+00
    ------------------------------------------------------------------------------------
     iteration   173100/  300000 | elapsed time per iteration (ms): 3052.5 | learning rate 5.948E-05 | lm loss 1.545807E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.90 | backward: 1727.31 | allreduce: 24.62 | optimizer: 55.76 | batch generator: 0.85 | data loader: 0.07
     iteration   173200/  300000 | elapsed time per iteration (ms): 2452.1 | learning rate 5.941E-05 | lm loss 1.586041E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.78 | backward: 1727.38 | allreduce: 24.88 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   173300/  300000 | elapsed time per iteration (ms): 2454.2 | learning rate 5.933E-05 | lm loss 1.551988E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.72 | backward: 1729.54 | allreduce: 27.09 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   173400/  300000 | elapsed time per iteration (ms): 2445.9 | learning rate 5.926E-05 | lm loss 3.393091E+00 | loss scale 2048.0 |
    time (ms) | forward: 668.46 | backward: 1725.92 | allreduce: 25.15 | optimizer: 51.31 | batch generator: 0.45 | data loader: 0.04
     iteration   173500/  300000 | elapsed time per iteration (ms): 2453.6 | learning rate 5.918E-05 | lm loss 3.019069E+00 | loss scale 2048.0 |
    time (ms) | forward: 668.51 | backward: 1729.10 | allreduce: 24.59 | optimizer: 55.77 | batch generator: 0.44 | data loader: 0.04
     iteration   173600/  300000 | elapsed time per iteration (ms): 2448.8 | learning rate 5.911E-05 | lm loss 1.588212E+00 | loss scale 2048.0 |
    time (ms) | forward: 668.84 | backward: 1724.01 | allreduce: 24.93 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   173700/  300000 | elapsed time per iteration (ms): 2450.8 | learning rate 5.903E-05 | lm loss 1.573870E+00 | loss scale 2048.0 |
    time (ms) | forward: 668.73 | backward: 1726.15 | allreduce: 27.23 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   173800/  300000 | elapsed time per iteration (ms): 2448.1 | learning rate 5.895E-05 | lm loss 1.563479E+00 | loss scale 2048.0 |
    time (ms) | forward: 668.71 | backward: 1723.43 | allreduce: 24.62 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   173900/  300000 | elapsed time per iteration (ms): 2447.9 | learning rate 5.888E-05 | lm loss 1.570251E+00 | loss scale 2048.0 |
    time (ms) | forward: 668.51 | backward: 1723.41 | allreduce: 24.91 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   174000/  300000 | elapsed time per iteration (ms): 2450.4 | learning rate 5.880E-05 | lm loss 1.591870E+00 | loss scale 2048.0 |
    time (ms) | forward: 668.67 | backward: 1725.78 | allreduce: 27.16 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 174000 | LM loss: 1.649237E+00 | LM PPL: 5.203009E+00
    ------------------------------------------------------------------------------------
     iteration   174100/  300000 | elapsed time per iteration (ms): 3049.4 | learning rate 5.872E-05 | lm loss 1.573480E+00 | loss scale 2048.0 |
    time (ms) | forward: 668.71 | backward: 1723.62 | allreduce: 24.75 | optimizer: 55.77 | batch generator: 0.86 | data loader: 0.07
     iteration   174200/  300000 | elapsed time per iteration (ms): 2448.1 | learning rate 5.865E-05 | lm loss 1.562142E+00 | loss scale 2048.0 |
    time (ms) | forward: 668.59 | backward: 1723.54 | allreduce: 24.89 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   174300/  300000 | elapsed time per iteration (ms): 2450.3 | learning rate 5.857E-05 | lm loss 1.577657E+00 | loss scale 2048.0 |
    time (ms) | forward: 668.56 | backward: 1725.84 | allreduce: 27.20 | optimizer: 55.77 | batch generator: 0.44 | data loader: 0.04
     iteration   174400/  300000 | elapsed time per iteration (ms): 2448.3 | learning rate 5.849E-05 | lm loss 1.562688E+00 | loss scale 4096.0 |
    time (ms) | forward: 668.62 | backward: 1723.75 | allreduce: 24.71 | optimizer: 55.76 | batch generator: 0.44 | data loader: 0.04
     iteration   174500/  300000 | elapsed time per iteration (ms): 2448.7 | learning rate 5.842E-05 | lm loss 1.558960E+00 | loss scale 4096.0 |
    time (ms) | forward: 668.56 | backward: 1724.18 | allreduce: 24.77 | optimizer: 55.76 | batch generator: 0.44 | data loader: 0.04
     iteration   174600/  300000 | elapsed time per iteration (ms): 2450.6 | learning rate 5.834E-05 | lm loss 1.559439E+00 | loss scale 4096.0 |
    time (ms) | forward: 668.40 | backward: 1726.29 | allreduce: 27.31 | optimizer: 55.76 | batch generator: 0.44 | data loader: 0.04
     iteration   174700/  300000 | elapsed time per iteration (ms): 2448.3 | learning rate 5.826E-05 | lm loss 1.573769E+00 | loss scale 4096.0 |
    time (ms) | forward: 668.40 | backward: 1723.91 | allreduce: 24.89 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   174800/  300000 | elapsed time per iteration (ms): 2448.7 | learning rate 5.819E-05 | lm loss 1.567411E+00 | loss scale 4096.0 |
    time (ms) | forward: 668.53 | backward: 1724.17 | allreduce: 24.98 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   174900/  300000 | elapsed time per iteration (ms): 2450.5 | learning rate 5.811E-05 | lm loss 1.576706E+00 | loss scale 4096.0 |
    time (ms) | forward: 668.44 | backward: 1726.09 | allreduce: 27.24 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   175000/  300000 | elapsed time per iteration (ms): 2447.7 | learning rate 5.803E-05 | lm loss 1.571337E+00 | loss scale 4096.0 |
    time (ms) | forward: 668.26 | backward: 1723.50 | allreduce: 24.83 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    global rank 0 is saving checkpoint at iteration  175000 to checkpoints/gpt2_750m_2/iter_0175000/mp_rank_00/model_optim_rng.pt
      successfully saved checkpoints/gpt2_750m_2/iter_0175000/mp_rank_00/model_optim_rng.pt
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 175000 | LM loss: 1.675303E+00 | LM PPL: 5.340415E+00
    ------------------------------------------------------------------------------------
     iteration   175100/  300000 | elapsed time per iteration (ms): 3108.7 | learning rate 5.796E-05 | lm loss 1.577269E+00 | loss scale 4096.0 |
    time (ms) | forward: 667.51 | backward: 1723.81 | allreduce: 27.42 | optimizer: 55.78 | batch generator: 8.47 | data loader: 7.66
     iteration   175200/  300000 | elapsed time per iteration (ms): 2448.0 | learning rate 5.788E-05 | lm loss 1.560276E+00 | loss scale 4096.0 |
    time (ms) | forward: 668.42 | backward: 1723.59 | allreduce: 24.74 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   175300/  300000 | elapsed time per iteration (ms): 2447.7 | learning rate 5.780E-05 | lm loss 1.553773E+00 | loss scale 4096.0 |
    time (ms) | forward: 668.28 | backward: 1723.43 | allreduce: 24.75 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   175400/  300000 | elapsed time per iteration (ms): 2450.6 | learning rate 5.773E-05 | lm loss 1.574240E+00 | loss scale 8192.0 |
    time (ms) | forward: 668.51 | backward: 1726.15 | allreduce: 26.92 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   175500/  300000 | elapsed time per iteration (ms): 2448.9 | learning rate 5.765E-05 | lm loss 1.583874E+00 | loss scale 8192.0 |
    time (ms) | forward: 668.51 | backward: 1724.41 | allreduce: 24.85 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   175600/  300000 | elapsed time per iteration (ms): 2448.4 | learning rate 5.757E-05 | lm loss 1.569701E+00 | loss scale 8192.0 |
    time (ms) | forward: 668.35 | backward: 1724.05 | allreduce: 24.76 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   175700/  300000 | elapsed time per iteration (ms): 2450.9 | learning rate 5.750E-05 | lm loss 1.586699E+00 | loss scale 8192.0 |
    time (ms) | forward: 668.34 | backward: 1726.59 | allreduce: 27.22 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   175800/  300000 | elapsed time per iteration (ms): 2448.6 | learning rate 5.742E-05 | lm loss 1.558611E+00 | loss scale 8192.0 |
    time (ms) | forward: 668.48 | backward: 1724.20 | allreduce: 24.62 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   175900/  300000 | elapsed time per iteration (ms): 2449.3 | learning rate 5.735E-05 | lm loss 1.557604E+00 | loss scale 8192.0 |
    time (ms) | forward: 668.66 | backward: 1724.65 | allreduce: 24.62 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   176000/  300000 | elapsed time per iteration (ms): 2451.2 | learning rate 5.727E-05 | lm loss 1.567551E+00 | loss scale 8192.0 |
    time (ms) | forward: 668.47 | backward: 1726.81 | allreduce: 27.17 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 176000 | LM loss: 1.656957E+00 | LM PPL: 5.243331E+00
    ------------------------------------------------------------------------------------
     iteration   176100/  300000 | elapsed time per iteration (ms): 3058.8 | learning rate 5.719E-05 | lm loss 1.595934E+00 | loss scale 8192.0 |
    time (ms) | forward: 668.45 | backward: 1724.34 | allreduce: 24.83 | optimizer: 55.77 | batch generator: 10.80 | data loader: 9.99
     iteration   176200/  300000 | elapsed time per iteration (ms): 2448.1 | learning rate 5.712E-05 | lm loss 1.567663E+00 | loss scale 8192.0 |
    time (ms) | forward: 668.12 | backward: 1724.01 | allreduce: 25.08 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   176300/  300000 | elapsed time per iteration (ms): 2450.5 | learning rate 5.704E-05 | lm loss 1.590403E+00 | loss scale 8192.0 |
    time (ms) | forward: 667.91 | backward: 1726.66 | allreduce: 27.98 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   176400/  300000 | elapsed time per iteration (ms): 2447.8 | learning rate 5.696E-05 | lm loss 1.575623E+00 | loss scale 16384.0 |
    time (ms) | forward: 668.01 | backward: 1723.84 | allreduce: 24.96 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   176500/  300000 | elapsed time per iteration (ms): 2448.4 | learning rate 5.689E-05 | lm loss 1.561656E+00 | loss scale 16384.0 |
    time (ms) | forward: 668.10 | backward: 1724.37 | allreduce: 25.03 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   176600/  300000 | elapsed time per iteration (ms): 2450.9 | learning rate 5.681E-05 | lm loss 1.593772E+00 | loss scale 16384.0 |
    time (ms) | forward: 668.04 | backward: 1726.88 | allreduce: 27.38 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   176700/  300000 | elapsed time per iteration (ms): 2448.8 | learning rate 5.673E-05 | lm loss 1.551594E+00 | loss scale 16384.0 |
    time (ms) | forward: 668.21 | backward: 1724.62 | allreduce: 24.91 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   176800/  300000 | elapsed time per iteration (ms): 2448.5 | learning rate 5.666E-05 | lm loss 1.560814E+00 | loss scale 16384.0 |
    time (ms) | forward: 668.09 | backward: 1724.49 | allreduce: 24.93 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   176900/  300000 | elapsed time per iteration (ms): 2451.2 | learning rate 5.658E-05 | lm loss 1.577234E+00 | loss scale 16384.0 |
    time (ms) | forward: 668.24 | backward: 1727.01 | allreduce: 27.35 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   177000/  300000 | elapsed time per iteration (ms): 2448.6 | learning rate 5.651E-05 | lm loss 1.587682E+00 | loss scale 16384.0 |
    time (ms) | forward: 668.20 | backward: 1724.49 | allreduce: 24.77 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 177000 | LM loss: 1.668194E+00 | LM PPL: 5.302581E+00
    ------------------------------------------------------------------------------------
     iteration   177100/  300000 | elapsed time per iteration (ms): 3069.7 | learning rate 5.643E-05 | lm loss 1.560725E+00 | loss scale 16384.0 |
    time (ms) | forward: 668.17 | backward: 1724.64 | allreduce: 25.25 | optimizer: 55.77 | batch generator: 21.83 | data loader: 21.02
     iteration   177200/  300000 | elapsed time per iteration (ms): 2451.4 | learning rate 5.635E-05 | lm loss 1.567020E+00 | loss scale 16384.0 |
    time (ms) | forward: 668.21 | backward: 1727.24 | allreduce: 27.79 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   177300/  300000 | elapsed time per iteration (ms): 2449.2 | learning rate 5.628E-05 | lm loss 1.555949E+00 | loss scale 16384.0 |
    time (ms) | forward: 668.27 | backward: 1724.92 | allreduce: 25.28 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   177400/  300000 | elapsed time per iteration (ms): 2448.8 | learning rate 5.620E-05 | lm loss 1.576893E+00 | loss scale 32768.0 |
    time (ms) | forward: 668.18 | backward: 1724.70 | allreduce: 25.12 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   177500/  300000 | elapsed time per iteration (ms): 2451.5 | learning rate 5.613E-05 | lm loss 1.551598E+00 | loss scale 32768.0 |
    time (ms) | forward: 668.31 | backward: 1727.20 | allreduce: 27.12 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   177600/  300000 | elapsed time per iteration (ms): 2448.8 | learning rate 5.605E-05 | lm loss 1.545071E+00 | loss scale 32768.0 |
    time (ms) | forward: 668.21 | backward: 1724.59 | allreduce: 24.68 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   177700/  300000 | elapsed time per iteration (ms): 2449.0 | learning rate 5.597E-05 | lm loss 1.562677E+00 | loss scale 32768.0 |
    time (ms) | forward: 668.36 | backward: 1724.73 | allreduce: 24.57 | optimizer: 55.77 | batch generator: 0.44 | data loader: 0.04
     iteration   177800/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 5.590E-05 | lm loss 1.552126E+00 | loss scale 32768.0 |
    time (ms) | forward: 668.31 | backward: 1727.08 | allreduce: 27.04 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   177900/  300000 | elapsed time per iteration (ms): 2448.5 | learning rate 5.582E-05 | lm loss 1.553204E+00 | loss scale 32768.0 |
    time (ms) | forward: 668.21 | backward: 1724.37 | allreduce: 24.52 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   178000/  300000 | elapsed time per iteration (ms): 2449.0 | learning rate 5.575E-05 | lm loss 1.573378E+00 | loss scale 32768.0 |
    time (ms) | forward: 668.23 | backward: 1724.80 | allreduce: 24.57 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 178000 | LM loss: 1.661657E+00 | LM PPL: 5.268032E+00
    ------------------------------------------------------------------------------------
     iteration   178100/  300000 | elapsed time per iteration (ms): 3060.1 | learning rate 5.567E-05 | lm loss 1.596845E+00 | loss scale 32768.0 |
    time (ms) | forward: 668.30 | backward: 1727.56 | allreduce: 27.24 | optimizer: 55.77 | batch generator: 8.52 | data loader: 7.71
     iteration   178200/  300000 | elapsed time per iteration (ms): 2448.9 | learning rate 5.559E-05 | lm loss 1.557270E+00 | loss scale 32768.0 |
    time (ms) | forward: 668.31 | backward: 1724.68 | allreduce: 24.61 | optimizer: 55.76 | batch generator: 0.44 | data loader: 0.04
     iteration   178300/  300000 | elapsed time per iteration (ms): 2449.1 | learning rate 5.552E-05 | lm loss 1.566042E+00 | loss scale 32768.0 |
    time (ms) | forward: 668.37 | backward: 1724.77 | allreduce: 24.70 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   178400/  300000 | elapsed time per iteration (ms): 2452.2 | learning rate 5.544E-05 | lm loss 1.560842E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.37 | backward: 1727.88 | allreduce: 27.58 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   178500/  300000 | elapsed time per iteration (ms): 2448.8 | learning rate 5.537E-05 | lm loss 1.560677E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.09 | backward: 1724.80 | allreduce: 24.94 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   178600/  300000 | elapsed time per iteration (ms): 2449.1 | learning rate 5.529E-05 | lm loss 1.581960E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.03 | backward: 1725.05 | allreduce: 24.97 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   178700/  300000 | elapsed time per iteration (ms): 2452.2 | learning rate 5.522E-05 | lm loss 1.558833E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.24 | backward: 1728.02 | allreduce: 27.55 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   178800/  300000 | elapsed time per iteration (ms): 2449.6 | learning rate 5.514E-05 | lm loss 1.563733E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.19 | backward: 1725.41 | allreduce: 25.02 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   178900/  300000 | elapsed time per iteration (ms): 2449.7 | learning rate 5.506E-05 | lm loss 1.578252E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.26 | backward: 1725.46 | allreduce: 25.01 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   179000/  300000 | elapsed time per iteration (ms): 2452.0 | learning rate 5.499E-05 | lm loss 1.577145E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.18 | backward: 1727.81 | allreduce: 27.58 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 179000 | LM loss: 1.669789E+00 | LM PPL: 5.311047E+00
    ------------------------------------------------------------------------------------
     iteration   179100/  300000 | elapsed time per iteration (ms): 3065.8 | learning rate 5.491E-05 | lm loss 1.589495E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.21 | backward: 1725.46 | allreduce: 25.04 | optimizer: 55.78 | batch generator: 16.84 | data loader: 16.04
     iteration   179200/  300000 | elapsed time per iteration (ms): 2449.7 | learning rate 5.484E-05 | lm loss 1.565946E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.12 | backward: 1725.59 | allreduce: 25.27 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   179300/  300000 | elapsed time per iteration (ms): 2453.7 | learning rate 5.476E-05 | lm loss 1.577317E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.59 | backward: 1729.14 | allreduce: 28.07 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   179400/  300000 | elapsed time per iteration (ms): 2450.4 | learning rate 5.469E-05 | lm loss 1.576012E+00 | loss scale 131072.0 |
    time (ms) | forward: 668.31 | backward: 1726.14 | allreduce: 25.37 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   179500/  300000 | elapsed time per iteration (ms): 2450.6 | learning rate 5.461E-05 | lm loss 1.592391E+00 | loss scale 131072.0 |
    time (ms) | forward: 668.27 | backward: 1726.34 | allreduce: 25.40 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   179600/  300000 | elapsed time per iteration (ms): 2453.1 | learning rate 5.453E-05 | lm loss 1.556446E+00 | loss scale 131072.0 |
    time (ms) | forward: 668.44 | backward: 1728.73 | allreduce: 27.75 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   179700/  300000 | elapsed time per iteration (ms): 2451.5 | learning rate 5.446E-05 | lm loss 1.563871E+00 | loss scale 131072.0 |
    time (ms) | forward: 668.63 | backward: 1726.86 | allreduce: 25.34 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   179800/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 5.438E-05 | lm loss 1.589321E+00 | loss scale 131072.0 |
    time (ms) | forward: 668.58 | backward: 1726.78 | allreduce: 25.30 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   179900/  300000 | elapsed time per iteration (ms): 2453.2 | learning rate 5.431E-05 | lm loss 1.550040E+00 | loss scale 131072.0 |
    time (ms) | forward: 668.51 | backward: 1728.70 | allreduce: 27.58 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   180000/  300000 | elapsed time per iteration (ms): 2449.4 | learning rate 5.423E-05 | lm loss 1.563025E+00 | loss scale 131072.0 |
    time (ms) | forward: 668.23 | backward: 1725.26 | allreduce: 24.77 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
    global rank 0 is saving checkpoint at iteration  180000 to checkpoints/gpt2_750m_2/iter_0180000/mp_rank_00/model_optim_rng.pt
      successfully saved checkpoints/gpt2_750m_2/iter_0180000/mp_rank_00/model_optim_rng.pt
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 180000 | LM loss: 1.648620E+00 | LM PPL: 5.199801E+00
    ------------------------------------------------------------------------------------
     iteration   180100/  300000 | elapsed time per iteration (ms): 3107.8 | learning rate 5.416E-05 | lm loss 1.554308E+00 | loss scale 131072.0 |
    time (ms) | forward: 667.45 | backward: 1723.08 | allreduce: 24.66 | optimizer: 55.77 | batch generator: 8.55 | data loader: 7.75
     iteration   180200/  300000 | elapsed time per iteration (ms): 2451.5 | learning rate 5.408E-05 | lm loss 1.566817E+00 | loss scale 131072.0 |
    time (ms) | forward: 668.04 | backward: 1727.51 | allreduce: 27.11 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   180300/  300000 | elapsed time per iteration (ms): 2448.7 | learning rate 5.401E-05 | lm loss 1.553206E+00 | loss scale 131072.0 |
    time (ms) | forward: 667.98 | backward: 1724.77 | allreduce: 24.63 | optimizer: 55.76 | batch generator: 0.44 | data loader: 0.04
     iteration   180400/  300000 | elapsed time per iteration (ms): 2449.2 | learning rate 5.393E-05 | lm loss 1.569858E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.09 | backward: 1725.21 | allreduce: 24.62 | optimizer: 55.76 | batch generator: 0.44 | data loader: 0.04
     iteration   180500/  300000 | elapsed time per iteration (ms): 2450.9 | learning rate 5.386E-05 | lm loss 1.579144E+00 | loss scale 262144.0 |
    time (ms) | forward: 667.81 | backward: 1727.14 | allreduce: 26.98 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   180600/  300000 | elapsed time per iteration (ms): 2449.2 | learning rate 5.378E-05 | lm loss 1.577220E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.01 | backward: 1725.28 | allreduce: 24.66 | optimizer: 55.76 | batch generator: 0.46 | data loader: 0.04
     iteration   180700/  300000 | elapsed time per iteration (ms): 2449.5 | learning rate 5.370E-05 | lm loss 1.578887E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.02 | backward: 1725.56 | allreduce: 24.57 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   180800/  300000 | elapsed time per iteration (ms): 2452.0 | learning rate 5.363E-05 | lm loss 1.567803E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.02 | backward: 1728.05 | allreduce: 27.23 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   180900/  300000 | elapsed time per iteration (ms): 2449.3 | learning rate 5.355E-05 | lm loss 1.570563E+00 | loss scale 262144.0 |
    time (ms) | forward: 667.93 | backward: 1725.46 | allreduce: 24.71 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   181000/  300000 | elapsed time per iteration (ms): 2462.4 | learning rate 5.348E-05 | lm loss 1.575165E+00 | loss scale 262144.0 |
    time (ms) | forward: 681.41 | backward: 1724.99 | allreduce: 24.66 | optimizer: 55.77 | batch generator: 14.00 | data loader: 13.59
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 181000 | LM loss: 1.655740E+00 | LM PPL: 5.236952E+00
    ------------------------------------------------------------------------------------
     iteration   181100/  300000 | elapsed time per iteration (ms): 3052.1 | learning rate 5.340E-05 | lm loss 1.568529E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.01 | backward: 1728.10 | allreduce: 27.10 | optimizer: 55.77 | batch generator: 0.88 | data loader: 0.07
     iteration   181200/  300000 | elapsed time per iteration (ms): 2449.8 | learning rate 5.333E-05 | lm loss 1.546543E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.07 | backward: 1725.75 | allreduce: 24.76 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   181300/  300000 | elapsed time per iteration (ms): 2450.0 | learning rate 5.325E-05 | lm loss 1.543888E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.11 | backward: 1725.96 | allreduce: 24.80 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   181400/  300000 | elapsed time per iteration (ms): 2451.7 | learning rate 5.318E-05 | lm loss 1.563427E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.82 | backward: 1727.90 | allreduce: 27.32 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   181500/  300000 | elapsed time per iteration (ms): 2450.0 | learning rate 5.310E-05 | lm loss 1.576098E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.98 | backward: 1726.09 | allreduce: 24.71 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   181600/  300000 | elapsed time per iteration (ms): 2450.2 | learning rate 5.303E-05 | lm loss 1.543360E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.14 | backward: 1726.13 | allreduce: 24.62 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   181700/  300000 | elapsed time per iteration (ms): 2452.7 | learning rate 5.295E-05 | lm loss 1.536400E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.14 | backward: 1728.54 | allreduce: 27.05 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   181800/  300000 | elapsed time per iteration (ms): 2448.6 | learning rate 5.288E-05 | lm loss 1.559758E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.96 | backward: 1725.26 | allreduce: 24.88 | optimizer: 55.21 | batch generator: 0.46 | data loader: 0.04
     iteration   181900/  300000 | elapsed time per iteration (ms): 2449.9 | learning rate 5.280E-05 | lm loss 1.560845E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.98 | backward: 1725.98 | allreduce: 24.62 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   182000/  300000 | elapsed time per iteration (ms): 2452.6 | learning rate 5.273E-05 | lm loss 1.560261E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.12 | backward: 1728.52 | allreduce: 27.06 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 182000 | LM loss: 1.639788E+00 | LM PPL: 5.154076E+00
    ------------------------------------------------------------------------------------
     iteration   182100/  300000 | elapsed time per iteration (ms): 3056.6 | learning rate 5.265E-05 | lm loss 1.549064E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.24 | backward: 1726.81 | allreduce: 24.76 | optimizer: 55.77 | batch generator: 6.18 | data loader: 5.38
     iteration   182200/  300000 | elapsed time per iteration (ms): 2451.1 | learning rate 5.258E-05 | lm loss 1.568672E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.27 | backward: 1726.87 | allreduce: 24.79 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   182300/  300000 | elapsed time per iteration (ms): 2453.3 | learning rate 5.250E-05 | lm loss 1.544029E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.30 | backward: 1729.04 | allreduce: 27.17 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   182400/  300000 | elapsed time per iteration (ms): 2449.9 | learning rate 5.243E-05 | lm loss 1.567149E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.01 | backward: 1725.98 | allreduce: 24.80 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   182500/  300000 | elapsed time per iteration (ms): 2450.4 | learning rate 5.235E-05 | lm loss 1.582791E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.02 | backward: 1726.41 | allreduce: 25.22 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   182600/  300000 | elapsed time per iteration (ms): 2453.8 | learning rate 5.228E-05 | lm loss 1.562955E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.32 | backward: 1729.54 | allreduce: 27.78 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   182700/  300000 | elapsed time per iteration (ms): 2451.2 | learning rate 5.220E-05 | lm loss 1.553152E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.38 | backward: 1726.83 | allreduce: 24.84 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   182800/  300000 | elapsed time per iteration (ms): 2450.3 | learning rate 5.213E-05 | lm loss 1.558522E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.13 | backward: 1726.27 | allreduce: 24.74 | optimizer: 55.75 | batch generator: 0.45 | data loader: 0.04
     iteration   182900/  300000 | elapsed time per iteration (ms): 2453.7 | learning rate 5.205E-05 | lm loss 1.569086E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.30 | backward: 1729.44 | allreduce: 27.30 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   183000/  300000 | elapsed time per iteration (ms): 2450.7 | learning rate 5.198E-05 | lm loss 1.543853E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.18 | backward: 1726.61 | allreduce: 24.73 | optimizer: 55.76 | batch generator: 0.44 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 183000 | LM loss: 1.672889E+00 | LM PPL: 5.327539E+00
    ------------------------------------------------------------------------------------
     iteration   183100/  300000 | elapsed time per iteration (ms): 3051.2 | learning rate 5.191E-05 | lm loss 1.564106E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.16 | backward: 1726.94 | allreduce: 24.92 | optimizer: 55.77 | batch generator: 0.88 | data loader: 0.07
     iteration   183200/  300000 | elapsed time per iteration (ms): 2454.0 | learning rate 5.183E-05 | lm loss 1.541179E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.22 | backward: 1729.77 | allreduce: 27.71 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   183300/  300000 | elapsed time per iteration (ms): 2451.5 | learning rate 5.176E-05 | lm loss 1.556654E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.20 | backward: 1727.36 | allreduce: 25.31 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   183400/  300000 | elapsed time per iteration (ms): 2451.2 | learning rate 5.168E-05 | lm loss 1.569568E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.12 | backward: 1727.12 | allreduce: 25.28 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   183500/  300000 | elapsed time per iteration (ms): 2453.8 | learning rate 5.161E-05 | lm loss 1.570958E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.12 | backward: 1729.69 | allreduce: 27.89 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   183600/  300000 | elapsed time per iteration (ms): 2451.7 | learning rate 5.153E-05 | lm loss 1.561234E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.28 | backward: 1727.44 | allreduce: 25.10 | optimizer: 55.76 | batch generator: 0.46 | data loader: 0.04
     iteration   183700/  300000 | elapsed time per iteration (ms): 2450.9 | learning rate 5.146E-05 | lm loss 1.544219E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.12 | backward: 1726.81 | allreduce: 24.90 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   183800/  300000 | elapsed time per iteration (ms): 2451.5 | learning rate 5.138E-05 | lm loss 1.561282E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.25 | backward: 1728.38 | allreduce: 27.20 | optimizer: 54.65 | batch generator: 0.45 | data loader: 0.04
     iteration   183900/  300000 | elapsed time per iteration (ms): 2451.2 | learning rate 5.131E-05 | lm loss 1.568929E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.19 | backward: 1727.04 | allreduce: 24.86 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   184000/  300000 | elapsed time per iteration (ms): 2451.1 | learning rate 5.124E-05 | lm loss 1.555723E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.21 | backward: 1726.97 | allreduce: 24.78 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 184000 | LM loss: 1.644871E+00 | LM PPL: 5.180343E+00
    ------------------------------------------------------------------------------------
     iteration   184100/  300000 | elapsed time per iteration (ms): 3054.0 | learning rate 5.116E-05 | lm loss 1.557391E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.33 | backward: 1729.41 | allreduce: 27.22 | optimizer: 55.77 | batch generator: 0.87 | data loader: 0.07
     iteration   184200/  300000 | elapsed time per iteration (ms): 2450.4 | learning rate 5.109E-05 | lm loss 1.539988E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.07 | backward: 1726.39 | allreduce: 24.83 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   184300/  300000 | elapsed time per iteration (ms): 2450.7 | learning rate 5.101E-05 | lm loss 1.552805E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.20 | backward: 1726.55 | allreduce: 24.74 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   184400/  300000 | elapsed time per iteration (ms): 2453.0 | learning rate 5.094E-05 | lm loss 1.570181E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.19 | backward: 1728.88 | allreduce: 27.14 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   184500/  300000 | elapsed time per iteration (ms): 2450.1 | learning rate 5.086E-05 | lm loss 1.549571E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.96 | backward: 1726.21 | allreduce: 24.94 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   184600/  300000 | elapsed time per iteration (ms): 2449.5 | learning rate 5.079E-05 | lm loss 1.555306E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.87 | backward: 1725.70 | allreduce: 24.68 | optimizer: 55.77 | batch generator: 0.44 | data loader: 0.04
     iteration   184700/  300000 | elapsed time per iteration (ms): 2452.8 | learning rate 5.071E-05 | lm loss 1.557683E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.00 | backward: 1728.85 | allreduce: 27.28 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   184800/  300000 | elapsed time per iteration (ms): 2448.1 | learning rate 5.064E-05 | lm loss 1.599880E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.07 | backward: 1725.22 | allreduce: 24.77 | optimizer: 54.65 | batch generator: 0.44 | data loader: 0.04
     iteration   184900/  300000 | elapsed time per iteration (ms): 2450.3 | learning rate 5.057E-05 | lm loss 1.566024E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.14 | backward: 1726.20 | allreduce: 24.63 | optimizer: 55.76 | batch generator: 0.44 | data loader: 0.04
     iteration   185000/  300000 | elapsed time per iteration (ms): 2452.4 | learning rate 5.049E-05 | lm loss 1.555749E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.97 | backward: 1728.51 | allreduce: 27.32 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
    global rank 0 is saving checkpoint at iteration  185000 to checkpoints/gpt2_750m_2/iter_0185000/mp_rank_00/model_optim_rng.pt
      successfully saved checkpoints/gpt2_750m_2/iter_0185000/mp_rank_00/model_optim_rng.pt
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 185000 | LM loss: 1.644781E+00 | LM PPL: 5.179874E+00
    ------------------------------------------------------------------------------------
     iteration   185100/  300000 | elapsed time per iteration (ms): 3135.6 | learning rate 5.042E-05 | lm loss 1.562029E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.74 | backward: 1724.76 | allreduce: 24.58 | optimizer: 55.77 | batch generator: 35.56 | data loader: 34.76
     iteration   185200/  300000 | elapsed time per iteration (ms): 2453.9 | learning rate 5.034E-05 | lm loss 1.570294E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.50 | backward: 1729.43 | allreduce: 27.24 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   185300/  300000 | elapsed time per iteration (ms): 2451.1 | learning rate 5.027E-05 | lm loss 1.577966E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.34 | backward: 1726.78 | allreduce: 24.69 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   185400/  300000 | elapsed time per iteration (ms): 2451.2 | learning rate 5.020E-05 | lm loss 1.554245E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.41 | backward: 1726.89 | allreduce: 24.76 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   185500/  300000 | elapsed time per iteration (ms): 2453.2 | learning rate 5.012E-05 | lm loss 1.565342E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.21 | backward: 1729.02 | allreduce: 27.34 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   185600/  300000 | elapsed time per iteration (ms): 2451.1 | learning rate 5.005E-05 | lm loss 1.565813E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.31 | backward: 1726.86 | allreduce: 24.94 | optimizer: 55.76 | batch generator: 0.43 | data loader: 0.04
     iteration   185700/  300000 | elapsed time per iteration (ms): 2451.6 | learning rate 4.997E-05 | lm loss 1.547633E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.39 | backward: 1727.25 | allreduce: 25.25 | optimizer: 55.77 | batch generator: 0.44 | data loader: 0.04
     iteration   185800/  300000 | elapsed time per iteration (ms): 2451.4 | learning rate 4.990E-05 | lm loss 1.564629E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.34 | backward: 1728.26 | allreduce: 27.48 | optimizer: 54.65 | batch generator: 0.45 | data loader: 0.04
     iteration   185900/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 4.983E-05 | lm loss 1.573736E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.41 | backward: 1726.91 | allreduce: 24.79 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   186000/  300000 | elapsed time per iteration (ms): 2451.1 | learning rate 4.975E-05 | lm loss 1.558823E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.43 | backward: 1726.69 | allreduce: 24.79 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 186000 | LM loss: 1.651098E+00 | LM PPL: 5.212701E+00
    ------------------------------------------------------------------------------------
     iteration   186100/  300000 | elapsed time per iteration (ms): 3055.3 | learning rate 4.968E-05 | lm loss 1.553161E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.60 | backward: 1729.65 | allreduce: 27.28 | optimizer: 55.77 | batch generator: 1.41 | data loader: 0.60
     iteration   186200/  300000 | elapsed time per iteration (ms): 2452.3 | learning rate 4.961E-05 | lm loss 1.562627E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.71 | backward: 1727.66 | allreduce: 24.83 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   186300/  300000 | elapsed time per iteration (ms): 2451.6 | learning rate 4.953E-05 | lm loss 1.566859E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.40 | backward: 1727.22 | allreduce: 24.82 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   186400/  300000 | elapsed time per iteration (ms): 2454.8 | learning rate 4.946E-05 | lm loss 1.555986E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.59 | backward: 1730.19 | allreduce: 27.51 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   186500/  300000 | elapsed time per iteration (ms): 2451.6 | learning rate 4.938E-05 | lm loss 1.526186E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.47 | backward: 1727.19 | allreduce: 24.86 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   186600/  300000 | elapsed time per iteration (ms): 2452.5 | learning rate 4.931E-05 | lm loss 1.565735E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.70 | backward: 1727.86 | allreduce: 24.97 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   186700/  300000 | elapsed time per iteration (ms): 2453.4 | learning rate 4.924E-05 | lm loss 1.560674E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.65 | backward: 1729.33 | allreduce: 27.30 | optimizer: 55.22 | batch generator: 0.46 | data loader: 0.04
     iteration   186800/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 4.916E-05 | lm loss 1.558560E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.47 | backward: 1726.87 | allreduce: 24.74 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   186900/  300000 | elapsed time per iteration (ms): 2451.1 | learning rate 4.909E-05 | lm loss 1.538393E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.38 | backward: 1726.71 | allreduce: 24.83 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   187000/  300000 | elapsed time per iteration (ms): 2453.8 | learning rate 4.902E-05 | lm loss 1.539829E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.53 | backward: 1729.34 | allreduce: 27.33 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 187000 | LM loss: 1.648057E+00 | LM PPL: 5.196873E+00
    ------------------------------------------------------------------------------------
     iteration   187100/  300000 | elapsed time per iteration (ms): 3053.2 | learning rate 4.894E-05 | lm loss 1.544986E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.86 | backward: 1727.81 | allreduce: 24.77 | optimizer: 55.77 | batch generator: 0.88 | data loader: 0.07
     iteration   187200/  300000 | elapsed time per iteration (ms): 2452.5 | learning rate 4.887E-05 | lm loss 1.547039E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.79 | backward: 1727.78 | allreduce: 24.87 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   187300/  300000 | elapsed time per iteration (ms): 2454.5 | learning rate 4.880E-05 | lm loss 1.554161E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.69 | backward: 1729.88 | allreduce: 27.44 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   187400/  300000 | elapsed time per iteration (ms): 2451.5 | learning rate 4.872E-05 | lm loss 1.556972E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.65 | backward: 1726.92 | allreduce: 24.75 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   187500/  300000 | elapsed time per iteration (ms): 2451.6 | learning rate 4.865E-05 | lm loss 1.561641E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.68 | backward: 1727.00 | allreduce: 24.72 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   187600/  300000 | elapsed time per iteration (ms): 2454.6 | learning rate 4.857E-05 | lm loss 1.527751E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.80 | backward: 1729.85 | allreduce: 27.23 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   187700/  300000 | elapsed time per iteration (ms): 2451.5 | learning rate 4.850E-05 | lm loss 1.591044E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.68 | backward: 1726.86 | allreduce: 24.46 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   187800/  300000 | elapsed time per iteration (ms): 2451.6 | learning rate 4.843E-05 | lm loss 1.542937E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.46 | backward: 1727.19 | allreduce: 24.87 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   187900/  300000 | elapsed time per iteration (ms): 2454.5 | learning rate 4.835E-05 | lm loss 1.560014E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.53 | backward: 1729.99 | allreduce: 27.37 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   188000/  300000 | elapsed time per iteration (ms): 2452.3 | learning rate 4.828E-05 | lm loss 1.553169E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.69 | backward: 1727.61 | allreduce: 24.66 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 188000 | LM loss: 1.630366E+00 | LM PPL: 5.105745E+00
    ------------------------------------------------------------------------------------
     iteration   188100/  300000 | elapsed time per iteration (ms): 3068.1 | learning rate 4.821E-05 | lm loss 1.549234E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.76 | backward: 1726.98 | allreduce: 24.58 | optimizer: 55.22 | batch generator: 17.57 | data loader: 16.76
     iteration   188200/  300000 | elapsed time per iteration (ms): 2454.9 | learning rate 4.813E-05 | lm loss 1.559341E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.71 | backward: 1730.21 | allreduce: 27.23 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   188300/  300000 | elapsed time per iteration (ms): 2451.9 | learning rate 4.806E-05 | lm loss 1.568634E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.57 | backward: 1727.35 | allreduce: 24.76 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   188400/  300000 | elapsed time per iteration (ms): 2453.0 | learning rate 4.799E-05 | lm loss 1.543040E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.92 | backward: 1728.16 | allreduce: 24.81 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   188500/  300000 | elapsed time per iteration (ms): 2454.6 | learning rate 4.792E-05 | lm loss 1.557857E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.64 | backward: 1730.02 | allreduce: 27.26 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   188600/  300000 | elapsed time per iteration (ms): 2452.0 | learning rate 4.784E-05 | lm loss 1.538182E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.59 | backward: 1727.40 | allreduce: 24.74 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   188700/  300000 | elapsed time per iteration (ms): 2451.4 | learning rate 4.777E-05 | lm loss 1.548657E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.46 | backward: 1727.00 | allreduce: 24.67 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   188800/  300000 | elapsed time per iteration (ms): 2454.1 | learning rate 4.770E-05 | lm loss 1.550792E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.49 | backward: 1729.61 | allreduce: 27.21 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   188900/  300000 | elapsed time per iteration (ms): 2452.4 | learning rate 4.762E-05 | lm loss 1.535154E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.60 | backward: 1727.81 | allreduce: 25.04 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   189000/  300000 | elapsed time per iteration (ms): 2452.5 | learning rate 4.755E-05 | lm loss 1.530976E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.64 | backward: 1727.92 | allreduce: 25.30 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 189000 | LM loss: 1.633141E+00 | LM PPL: 5.119933E+00
    ------------------------------------------------------------------------------------
     iteration   189100/  300000 | elapsed time per iteration (ms): 3068.3 | learning rate 4.748E-05 | lm loss 1.561202E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.41 | backward: 1728.98 | allreduce: 27.79 | optimizer: 54.66 | batch generator: 16.87 | data loader: 16.06
     iteration   189200/  300000 | elapsed time per iteration (ms): 2452.2 | learning rate 4.740E-05 | lm loss 1.579550E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.46 | backward: 1727.76 | allreduce: 25.33 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   189300/  300000 | elapsed time per iteration (ms): 2450.3 | learning rate 4.733E-05 | lm loss 1.545867E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.52 | backward: 1726.33 | allreduce: 24.71 | optimizer: 55.21 | batch generator: 0.45 | data loader: 0.04
     iteration   189400/  300000 | elapsed time per iteration (ms): 2453.0 | learning rate 4.726E-05 | lm loss 1.575178E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.32 | backward: 1728.68 | allreduce: 27.00 | optimizer: 55.76 | batch generator: 0.44 | data loader: 0.04
     iteration   189500/  300000 | elapsed time per iteration (ms): 2450.6 | learning rate 4.719E-05 | lm loss 1.553389E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.41 | backward: 1726.22 | allreduce: 24.55 | optimizer: 55.76 | batch generator: 0.44 | data loader: 0.04
     iteration   189600/  300000 | elapsed time per iteration (ms): 2451.0 | learning rate 4.711E-05 | lm loss 1.562527E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.35 | backward: 1726.70 | allreduce: 24.79 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   189700/  300000 | elapsed time per iteration (ms): 2452.9 | learning rate 4.704E-05 | lm loss 1.556513E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.25 | backward: 1728.69 | allreduce: 27.16 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   189800/  300000 | elapsed time per iteration (ms): 2450.7 | learning rate 4.697E-05 | lm loss 1.551947E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.30 | backward: 1726.47 | allreduce: 24.94 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   189900/  300000 | elapsed time per iteration (ms): 2451.1 | learning rate 4.689E-05 | lm loss 1.540995E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.46 | backward: 1726.69 | allreduce: 24.84 | optimizer: 55.76 | batch generator: 0.44 | data loader: 0.04
     iteration   190000/  300000 | elapsed time per iteration (ms): 2453.8 | learning rate 4.682E-05 | lm loss 1.560892E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.59 | backward: 1729.26 | allreduce: 27.04 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    global rank 0 is saving checkpoint at iteration  190000 to checkpoints/gpt2_750m_2/iter_0190000/mp_rank_00/model_optim_rng.pt
      successfully saved checkpoints/gpt2_750m_2/iter_0190000/mp_rank_00/model_optim_rng.pt
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 190000 | LM loss: 1.631758E+00 | LM PPL: 5.112857E+00
    ------------------------------------------------------------------------------------
     iteration   190100/  300000 | elapsed time per iteration (ms): 3110.5 | learning rate 4.675E-05 | lm loss 1.551199E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.79 | backward: 1725.40 | allreduce: 24.85 | optimizer: 55.77 | batch generator: 8.95 | data loader: 8.14
     iteration   190200/  300000 | elapsed time per iteration (ms): 2449.9 | learning rate 4.668E-05 | lm loss 1.546107E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.87 | backward: 1726.09 | allreduce: 25.13 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   190300/  300000 | elapsed time per iteration (ms): 2452.1 | learning rate 4.660E-05 | lm loss 1.550024E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.86 | backward: 1728.28 | allreduce: 27.36 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   190400/  300000 | elapsed time per iteration (ms): 2449.6 | learning rate 4.653E-05 | lm loss 1.570221E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.04 | backward: 1726.15 | allreduce: 25.09 | optimizer: 55.21 | batch generator: 0.45 | data loader: 0.04
     iteration   190500/  300000 | elapsed time per iteration (ms): 2450.1 | learning rate 4.646E-05 | lm loss 1.543469E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.27 | backward: 1726.46 | allreduce: 25.00 | optimizer: 55.21 | batch generator: 0.45 | data loader: 0.04
     iteration   190600/  300000 | elapsed time per iteration (ms): 2453.8 | learning rate 4.639E-05 | lm loss 1.541031E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.31 | backward: 1729.55 | allreduce: 27.59 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   190700/  300000 | elapsed time per iteration (ms): 2450.8 | learning rate 4.631E-05 | lm loss 1.536968E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.12 | backward: 1726.70 | allreduce: 25.09 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   190800/  300000 | elapsed time per iteration (ms): 2451.4 | learning rate 4.624E-05 | lm loss 1.545254E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.31 | backward: 1727.08 | allreduce: 25.02 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   190900/  300000 | elapsed time per iteration (ms): 2454.1 | learning rate 4.617E-05 | lm loss 1.547003E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.49 | backward: 1729.70 | allreduce: 27.47 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   191000/  300000 | elapsed time per iteration (ms): 2451.0 | learning rate 4.610E-05 | lm loss 1.551006E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.40 | backward: 1726.64 | allreduce: 24.66 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 191000 | LM loss: 1.655653E+00 | LM PPL: 5.236501E+00
    ------------------------------------------------------------------------------------
     iteration   191100/  300000 | elapsed time per iteration (ms): 3064.9 | learning rate 4.602E-05 | lm loss 1.551340E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.31 | backward: 1726.44 | allreduce: 24.60 | optimizer: 55.76 | batch generator: 14.97 | data loader: 14.17
     iteration   191200/  300000 | elapsed time per iteration (ms): 2453.3 | learning rate 4.595E-05 | lm loss 1.556786E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.29 | backward: 1729.07 | allreduce: 27.37 | optimizer: 55.76 | batch generator: 0.44 | data loader: 0.04
     iteration   191300/  300000 | elapsed time per iteration (ms): 2450.3 | learning rate 4.588E-05 | lm loss 1.562500E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.15 | backward: 1726.21 | allreduce: 24.67 | optimizer: 55.77 | batch generator: 0.44 | data loader: 0.04
     iteration   191400/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 4.581E-05 | lm loss 1.547110E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.48 | backward: 1726.92 | allreduce: 24.74 | optimizer: 55.76 | batch generator: 0.44 | data loader: 0.04
     iteration   191500/  300000 | elapsed time per iteration (ms): 2453.9 | learning rate 4.574E-05 | lm loss 1.524027E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.46 | backward: 1729.54 | allreduce: 27.22 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   191600/  300000 | elapsed time per iteration (ms): 2452.2 | learning rate 4.566E-05 | lm loss 1.556310E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.63 | backward: 1727.59 | allreduce: 24.66 | optimizer: 55.76 | batch generator: 0.44 | data loader: 0.04
     iteration   191700/  300000 | elapsed time per iteration (ms): 2451.0 | learning rate 4.559E-05 | lm loss 1.550108E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.27 | backward: 1726.77 | allreduce: 24.63 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   191800/  300000 | elapsed time per iteration (ms): 2454.2 | learning rate 4.552E-05 | lm loss 1.545665E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.39 | backward: 1729.87 | allreduce: 27.32 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   191900/  300000 | elapsed time per iteration (ms): 2451.9 | learning rate 4.545E-05 | lm loss 1.535877E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.48 | backward: 1727.51 | allreduce: 24.69 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   192000/  300000 | elapsed time per iteration (ms): 2453.1 | learning rate 4.537E-05 | lm loss 1.565136E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.77 | backward: 1728.40 | allreduce: 24.72 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 192000 | LM loss: 1.655813E+00 | LM PPL: 5.237338E+00
    ------------------------------------------------------------------------------------
     iteration   192100/  300000 | elapsed time per iteration (ms): 3057.4 | learning rate 4.530E-05 | lm loss 1.533101E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.54 | backward: 1730.15 | allreduce: 27.23 | optimizer: 55.78 | batch generator: 2.98 | data loader: 2.17
     iteration   192200/  300000 | elapsed time per iteration (ms): 2451.4 | learning rate 4.523E-05 | lm loss 1.546876E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.27 | backward: 1727.21 | allreduce: 24.78 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   192300/  300000 | elapsed time per iteration (ms): 2451.0 | learning rate 4.516E-05 | lm loss 1.576999E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.35 | backward: 1726.72 | allreduce: 24.58 | optimizer: 55.76 | batch generator: 0.43 | data loader: 0.04
     iteration   192400/  300000 | elapsed time per iteration (ms): 2453.9 | learning rate 4.509E-05 | lm loss 1.557990E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.39 | backward: 1729.59 | allreduce: 27.18 | optimizer: 55.76 | batch generator: 0.44 | data loader: 0.04
     iteration   192500/  300000 | elapsed time per iteration (ms): 2449.7 | learning rate 4.502E-05 | lm loss 1.551292E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.50 | backward: 1726.36 | allreduce: 24.85 | optimizer: 54.64 | batch generator: 0.43 | data loader: 0.04
     iteration   192600/  300000 | elapsed time per iteration (ms): 2453.3 | learning rate 4.494E-05 | lm loss 1.547850E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.76 | backward: 1728.62 | allreduce: 25.00 | optimizer: 55.77 | batch generator: 0.44 | data loader: 0.04
     iteration   192700/  300000 | elapsed time per iteration (ms): 2454.0 | learning rate 4.487E-05 | lm loss 1.536568E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.36 | backward: 1729.71 | allreduce: 27.22 | optimizer: 55.77 | batch generator: 0.44 | data loader: 0.04
     iteration   192800/  300000 | elapsed time per iteration (ms): 2451.6 | learning rate 4.480E-05 | lm loss 1.545925E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.32 | backward: 1727.28 | allreduce: 24.87 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   192900/  300000 | elapsed time per iteration (ms): 2451.7 | learning rate 4.473E-05 | lm loss 1.553802E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.35 | backward: 1727.42 | allreduce: 24.76 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   193000/  300000 | elapsed time per iteration (ms): 2454.1 | learning rate 4.466E-05 | lm loss 1.555527E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.29 | backward: 1729.83 | allreduce: 27.35 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 193000 | LM loss: 1.636069E+00 | LM PPL: 5.134947E+00
    ------------------------------------------------------------------------------------
     iteration   193100/  300000 | elapsed time per iteration (ms): 3053.0 | learning rate 4.458E-05 | lm loss 1.560687E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.65 | backward: 1727.91 | allreduce: 24.75 | optimizer: 55.78 | batch generator: 0.87 | data loader: 0.07
     iteration   193200/  300000 | elapsed time per iteration (ms): 2452.4 | learning rate 4.451E-05 | lm loss 1.514324E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.46 | backward: 1727.95 | allreduce: 25.01 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   193300/  300000 | elapsed time per iteration (ms): 2455.0 | learning rate 4.444E-05 | lm loss 1.539072E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.56 | backward: 1730.45 | allreduce: 27.31 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   193400/  300000 | elapsed time per iteration (ms): 2451.2 | learning rate 4.437E-05 | lm loss 1.532797E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.52 | backward: 1727.25 | allreduce: 24.80 | optimizer: 55.22 | batch generator: 0.45 | data loader: 0.04
     iteration   193500/  300000 | elapsed time per iteration (ms): 2452.2 | learning rate 4.430E-05 | lm loss 1.560025E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.60 | backward: 1727.66 | allreduce: 24.88 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   193600/  300000 | elapsed time per iteration (ms): 2455.3 | learning rate 4.423E-05 | lm loss 1.557223E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.79 | backward: 1730.51 | allreduce: 27.20 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   193700/  300000 | elapsed time per iteration (ms): 2452.5 | learning rate 4.415E-05 | lm loss 1.547406E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.72 | backward: 1727.86 | allreduce: 24.79 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   193800/  300000 | elapsed time per iteration (ms): 2452.4 | learning rate 4.408E-05 | lm loss 1.548113E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.57 | backward: 1727.82 | allreduce: 24.97 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   193900/  300000 | elapsed time per iteration (ms): 2454.8 | learning rate 4.401E-05 | lm loss 1.548753E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.66 | backward: 1730.23 | allreduce: 27.12 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   194000/  300000 | elapsed time per iteration (ms): 2452.3 | learning rate 4.394E-05 | lm loss 1.564319E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.65 | backward: 1727.69 | allreduce: 24.79 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 194000 | LM loss: 1.626589E+00 | LM PPL: 5.086494E+00
    ------------------------------------------------------------------------------------
     iteration   194100/  300000 | elapsed time per iteration (ms): 3053.9 | learning rate 4.387E-05 | lm loss 1.551226E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.88 | backward: 1728.29 | allreduce: 25.16 | optimizer: 55.77 | batch generator: 0.87 | data loader: 0.07
     iteration   194200/  300000 | elapsed time per iteration (ms): 2454.6 | learning rate 4.380E-05 | lm loss 1.557544E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.66 | backward: 1730.03 | allreduce: 27.44 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   194300/  300000 | elapsed time per iteration (ms): 2452.0 | learning rate 4.373E-05 | lm loss 1.550200E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.73 | backward: 1727.34 | allreduce: 24.60 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   194400/  300000 | elapsed time per iteration (ms): 2452.1 | learning rate 4.365E-05 | lm loss 1.556542E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.66 | backward: 1727.54 | allreduce: 24.64 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   194500/  300000 | elapsed time per iteration (ms): 2454.2 | learning rate 4.358E-05 | lm loss 1.542428E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.48 | backward: 1729.79 | allreduce: 27.07 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   194600/  300000 | elapsed time per iteration (ms): 2451.1 | learning rate 4.351E-05 | lm loss 1.547936E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.58 | backward: 1727.14 | allreduce: 24.69 | optimizer: 55.20 | batch generator: 0.45 | data loader: 0.04
     iteration   194700/  300000 | elapsed time per iteration (ms): 2452.7 | learning rate 4.344E-05 | lm loss 1.537710E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.65 | backward: 1728.05 | allreduce: 24.84 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   194800/  300000 | elapsed time per iteration (ms): 2454.7 | learning rate 4.337E-05 | lm loss 1.545831E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.59 | backward: 1730.11 | allreduce: 27.21 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   194900/  300000 | elapsed time per iteration (ms): 2451.9 | learning rate 4.330E-05 | lm loss 1.529111E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.56 | backward: 1727.41 | allreduce: 24.60 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   195000/  300000 | elapsed time per iteration (ms): 2451.7 | learning rate 4.323E-05 | lm loss 1.573665E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.47 | backward: 1727.24 | allreduce: 24.80 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    global rank 0 is saving checkpoint at iteration  195000 to checkpoints/gpt2_750m_2/iter_0195000/mp_rank_00/model_optim_rng.pt
      successfully saved checkpoints/gpt2_750m_2/iter_0195000/mp_rank_00/model_optim_rng.pt
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 195000 | LM loss: 1.637102E+00 | LM PPL: 5.140252E+00
    ------------------------------------------------------------------------------------
     iteration   195100/  300000 | elapsed time per iteration (ms): 3126.0 | learning rate 4.316E-05 | lm loss 1.542293E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.40 | backward: 1727.44 | allreduce: 24.82 | optimizer: 55.77 | batch generator: 12.29 | data loader: 11.49
     iteration   195200/  300000 | elapsed time per iteration (ms): 2452.5 | learning rate 4.309E-05 | lm loss 1.548293E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.56 | backward: 1727.94 | allreduce: 24.78 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   195300/  300000 | elapsed time per iteration (ms): 2452.1 | learning rate 4.301E-05 | lm loss 1.523841E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.49 | backward: 1727.63 | allreduce: 24.76 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   195400/  300000 | elapsed time per iteration (ms): 2455.1 | learning rate 4.294E-05 | lm loss 1.533914E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.72 | backward: 1730.43 | allreduce: 27.25 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   195500/  300000 | elapsed time per iteration (ms): 2453.4 | learning rate 4.287E-05 | lm loss 1.557469E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.76 | backward: 1728.66 | allreduce: 25.31 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   195600/  300000 | elapsed time per iteration (ms): 2451.7 | learning rate 4.280E-05 | lm loss 1.533657E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.81 | backward: 1728.02 | allreduce: 25.29 | optimizer: 54.65 | batch generator: 0.45 | data loader: 0.04
     iteration   195700/  300000 | elapsed time per iteration (ms): 2455.2 | learning rate 4.273E-05 | lm loss 1.551068E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.50 | backward: 1730.73 | allreduce: 27.85 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   195800/  300000 | elapsed time per iteration (ms): 2452.5 | learning rate 4.266E-05 | lm loss 1.554981E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.54 | backward: 1727.99 | allreduce: 24.98 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   195900/  300000 | elapsed time per iteration (ms): 2452.3 | learning rate 4.259E-05 | lm loss 1.552334E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.49 | backward: 1727.83 | allreduce: 24.87 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   196000/  300000 | elapsed time per iteration (ms): 2455.6 | learning rate 4.252E-05 | lm loss 1.578396E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.74 | backward: 1730.91 | allreduce: 27.31 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 196000 | LM loss: 1.641913E+00 | LM PPL: 5.165039E+00
    ------------------------------------------------------------------------------------
     iteration   196100/  300000 | elapsed time per iteration (ms): 3073.5 | learning rate 4.245E-05 | lm loss 1.546966E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.62 | backward: 1727.84 | allreduce: 24.75 | optimizer: 55.76 | batch generator: 21.30 | data loader: 20.48
     iteration   196200/  300000 | elapsed time per iteration (ms): 2452.9 | learning rate 4.238E-05 | lm loss 1.560339E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.70 | backward: 1728.22 | allreduce: 24.82 | optimizer: 55.76 | batch generator: 0.46 | data loader: 0.04
     iteration   196300/  300000 | elapsed time per iteration (ms): 2454.3 | learning rate 4.231E-05 | lm loss 1.559861E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.74 | backward: 1730.12 | allreduce: 27.42 | optimizer: 55.21 | batch generator: 0.45 | data loader: 0.04
     iteration   196400/  300000 | elapsed time per iteration (ms): 2452.9 | learning rate 4.224E-05 | lm loss 1.552484E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.73 | backward: 1728.20 | allreduce: 24.83 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   196500/  300000 | elapsed time per iteration (ms): 2452.0 | learning rate 4.217E-05 | lm loss 1.519280E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.52 | backward: 1727.57 | allreduce: 24.86 | optimizer: 55.76 | batch generator: 0.46 | data loader: 0.04
     iteration   196600/  300000 | elapsed time per iteration (ms): 2455.4 | learning rate 4.210E-05 | lm loss 1.565597E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.82 | backward: 1730.61 | allreduce: 27.31 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   196700/  300000 | elapsed time per iteration (ms): 2452.6 | learning rate 4.203E-05 | lm loss 1.543311E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.63 | backward: 1727.98 | allreduce: 24.87 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   196800/  300000 | elapsed time per iteration (ms): 2453.1 | learning rate 4.195E-05 | lm loss 1.549408E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.78 | backward: 1728.38 | allreduce: 24.98 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   196900/  300000 | elapsed time per iteration (ms): 2455.7 | learning rate 4.188E-05 | lm loss 1.529576E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.83 | backward: 1730.89 | allreduce: 27.30 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   197000/  300000 | elapsed time per iteration (ms): 2453.1 | learning rate 4.181E-05 | lm loss 1.544926E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.78 | backward: 1728.40 | allreduce: 24.82 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 197000 | LM loss: 1.643271E+00 | LM PPL: 5.172060E+00
    ------------------------------------------------------------------------------------
     iteration   197100/  300000 | elapsed time per iteration (ms): 3056.5 | learning rate 4.174E-05 | lm loss 1.561043E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.83 | backward: 1728.41 | allreduce: 24.83 | optimizer: 55.77 | batch generator: 3.47 | data loader: 2.66
     iteration   197200/  300000 | elapsed time per iteration (ms): 2455.2 | learning rate 4.167E-05 | lm loss 1.537484E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.64 | backward: 1730.54 | allreduce: 27.45 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   197300/  300000 | elapsed time per iteration (ms): 2453.0 | learning rate 4.160E-05 | lm loss 1.552435E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.88 | backward: 1728.13 | allreduce: 24.43 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   197400/  300000 | elapsed time per iteration (ms): 2451.8 | learning rate 4.153E-05 | lm loss 1.552778E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.84 | backward: 1727.00 | allreduce: 23.33 | optimizer: 55.76 | batch generator: 0.44 | data loader: 0.04
     iteration   197500/  300000 | elapsed time per iteration (ms): 2454.0 | learning rate 4.146E-05 | lm loss 1.546155E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.76 | backward: 1729.32 | allreduce: 25.84 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   197600/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 4.139E-05 | lm loss 1.540650E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.72 | backward: 1726.66 | allreduce: 23.35 | optimizer: 55.76 | batch generator: 0.44 | data loader: 0.04
     iteration   197700/  300000 | elapsed time per iteration (ms): 2450.9 | learning rate 4.132E-05 | lm loss 1.525108E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.63 | backward: 1726.29 | allreduce: 23.35 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   197800/  300000 | elapsed time per iteration (ms): 2453.2 | learning rate 4.125E-05 | lm loss 1.535594E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.60 | backward: 1728.65 | allreduce: 25.84 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   197900/  300000 | elapsed time per iteration (ms): 2451.1 | learning rate 4.118E-05 | lm loss 1.524958E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.67 | backward: 1726.44 | allreduce: 23.34 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   198000/  300000 | elapsed time per iteration (ms): 2450.8 | learning rate 4.111E-05 | lm loss 1.543306E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.56 | backward: 1726.29 | allreduce: 23.34 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 198000 | LM loss: 1.641965E+00 | LM PPL: 5.165308E+00
    ------------------------------------------------------------------------------------
     iteration   198100/  300000 | elapsed time per iteration (ms): 3058.8 | learning rate 4.104E-05 | lm loss 1.527046E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.67 | backward: 1728.89 | allreduce: 25.84 | optimizer: 55.76 | batch generator: 5.73 | data loader: 4.94
     iteration   198200/  300000 | elapsed time per iteration (ms): 2451.2 | learning rate 4.097E-05 | lm loss 1.522480E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.68 | backward: 1726.52 | allreduce: 23.35 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   198300/  300000 | elapsed time per iteration (ms): 2448.7 | learning rate 4.090E-05 | lm loss 1.525587E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.61 | backward: 1725.22 | allreduce: 23.34 | optimizer: 54.65 | batch generator: 0.45 | data loader: 0.04
     iteration   198400/  300000 | elapsed time per iteration (ms): 2454.0 | learning rate 4.083E-05 | lm loss 1.516153E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.79 | backward: 1729.29 | allreduce: 25.84 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   198500/  300000 | elapsed time per iteration (ms): 2451.2 | learning rate 4.076E-05 | lm loss 1.546210E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.69 | backward: 1726.53 | allreduce: 23.35 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   198600/  300000 | elapsed time per iteration (ms): 2452.0 | learning rate 4.069E-05 | lm loss 1.543077E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.91 | backward: 1727.68 | allreduce: 24.68 | optimizer: 55.21 | batch generator: 0.45 | data loader: 0.04
     iteration   198700/  300000 | elapsed time per iteration (ms): 2455.7 | learning rate 4.062E-05 | lm loss 1.526855E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.97 | backward: 1730.74 | allreduce: 27.13 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   198800/  300000 | elapsed time per iteration (ms): 2453.4 | learning rate 4.055E-05 | lm loss 1.562176E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.98 | backward: 1728.48 | allreduce: 24.86 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   198900/  300000 | elapsed time per iteration (ms): 2452.8 | learning rate 4.048E-05 | lm loss 1.524511E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.88 | backward: 1727.95 | allreduce: 24.66 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   199000/  300000 | elapsed time per iteration (ms): 2455.6 | learning rate 4.041E-05 | lm loss 1.537803E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.95 | backward: 1730.68 | allreduce: 27.16 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 199000 | LM loss: 1.651123E+00 | LM PPL: 5.212831E+00
    ------------------------------------------------------------------------------------
     iteration   199100/  300000 | elapsed time per iteration (ms): 3057.7 | learning rate 4.034E-05 | lm loss 1.524562E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.95 | backward: 1728.30 | allreduce: 24.75 | optimizer: 55.77 | batch generator: 4.62 | data loader: 3.82
     iteration   199200/  300000 | elapsed time per iteration (ms): 2453.7 | learning rate 4.028E-05 | lm loss 1.555924E+00 | loss scale 524288.0 |
    time (ms) | forward: 669.13 | backward: 1728.58 | allreduce: 24.62 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   199300/  300000 | elapsed time per iteration (ms): 2455.9 | learning rate 4.021E-05 | lm loss 1.543086E+00 | loss scale 524288.0 |
    time (ms) | forward: 669.01 | backward: 1730.96 | allreduce: 27.20 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   199400/  300000 | elapsed time per iteration (ms): 2453.0 | learning rate 4.014E-05 | lm loss 1.535400E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.89 | backward: 1728.12 | allreduce: 24.74 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   199500/  300000 | elapsed time per iteration (ms): 2452.1 | learning rate 4.007E-05 | lm loss 1.536042E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.73 | backward: 1727.41 | allreduce: 24.47 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   199600/  300000 | elapsed time per iteration (ms): 2454.7 | learning rate 4.000E-05 | lm loss 1.536505E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.71 | backward: 1730.08 | allreduce: 27.17 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   199700/  300000 | elapsed time per iteration (ms): 2452.2 | learning rate 3.993E-05 | lm loss 1.545171E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.61 | backward: 1727.69 | allreduce: 24.65 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   199800/  300000 | elapsed time per iteration (ms): 2452.3 | learning rate 3.986E-05 | lm loss 1.524230E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.64 | backward: 1727.76 | allreduce: 24.65 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   199900/  300000 | elapsed time per iteration (ms): 2455.1 | learning rate 3.979E-05 | lm loss 1.520063E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.75 | backward: 1730.41 | allreduce: 27.08 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   200000/  300000 | elapsed time per iteration (ms): 2453.0 | learning rate 3.972E-05 | lm loss 1.533237E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.84 | backward: 1728.25 | allreduce: 24.80 | optimizer: 55.77 | batch generator: 0.44 | data loader: 0.04
    global rank 0 is saving checkpoint at iteration  200000 to checkpoints/gpt2_750m_2/iter_0200000/mp_rank_00/model_optim_rng.pt
      successfully saved checkpoints/gpt2_750m_2/iter_0200000/mp_rank_00/model_optim_rng.pt
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 200000 | LM loss: 1.636861E+00 | LM PPL: 5.139014E+00
    ------------------------------------------------------------------------------------
     iteration   200100/  300000 | elapsed time per iteration (ms): 3126.7 | learning rate 3.965E-05 | lm loss 1.553122E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.54 | backward: 1730.25 | allreduce: 27.25 | optimizer: 55.78 | batch generator: 16.95 | data loader: 16.15
     iteration   200200/  300000 | elapsed time per iteration (ms): 2453.3 | learning rate 3.958E-05 | lm loss 1.532743E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.80 | backward: 1728.52 | allreduce: 24.85 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   200300/  300000 | elapsed time per iteration (ms): 2452.3 | learning rate 3.951E-05 | lm loss 1.544006E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.65 | backward: 1727.73 | allreduce: 24.59 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   200400/  300000 | elapsed time per iteration (ms): 2452.9 | learning rate 3.944E-05 | lm loss 1.537649E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.44 | backward: 1729.09 | allreduce: 27.08 | optimizer: 55.21 | batch generator: 0.45 | data loader: 0.04
     iteration   200500/  300000 | elapsed time per iteration (ms): 2452.0 | learning rate 3.937E-05 | lm loss 1.519362E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.59 | backward: 1727.44 | allreduce: 24.59 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   200600/  300000 | elapsed time per iteration (ms): 2451.9 | learning rate 3.930E-05 | lm loss 1.559530E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.56 | backward: 1727.34 | allreduce: 24.58 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   200700/  300000 | elapsed time per iteration (ms): 2454.3 | learning rate 3.924E-05 | lm loss 1.514224E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.51 | backward: 1729.84 | allreduce: 27.13 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   200800/  300000 | elapsed time per iteration (ms): 2452.5 | learning rate 3.917E-05 | lm loss 1.532146E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.65 | backward: 1727.95 | allreduce: 25.01 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   200900/  300000 | elapsed time per iteration (ms): 2451.8 | learning rate 3.910E-05 | lm loss 1.552521E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.72 | backward: 1727.71 | allreduce: 25.42 | optimizer: 55.21 | batch generator: 0.45 | data loader: 0.04
     iteration   201000/  300000 | elapsed time per iteration (ms): 2455.0 | learning rate 3.903E-05 | lm loss 1.526951E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.63 | backward: 1730.46 | allreduce: 27.79 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 201000 | LM loss: 1.646122E+00 | LM PPL: 5.186828E+00
    ------------------------------------------------------------------------------------
     iteration   201100/  300000 | elapsed time per iteration (ms): 3061.1 | learning rate 3.896E-05 | lm loss 1.552265E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.70 | backward: 1727.61 | allreduce: 25.24 | optimizer: 55.21 | batch generator: 9.75 | data loader: 8.94
     iteration   201200/  300000 | elapsed time per iteration (ms): 2449.2 | learning rate 3.889E-05 | lm loss 1.557907E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.63 | backward: 1725.76 | allreduce: 25.19 | optimizer: 54.65 | batch generator: 0.45 | data loader: 0.04
     iteration   201300/  300000 | elapsed time per iteration (ms): 2453.5 | learning rate 3.883E-05 | lm loss 1.522638E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.54 | backward: 1728.95 | allreduce: 27.69 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   201400/  300000 | elapsed time per iteration (ms): 2451.4 | learning rate 3.876E-05 | lm loss 1.548252E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.73 | backward: 1726.77 | allreduce: 24.95 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   201500/  300000 | elapsed time per iteration (ms): 2450.5 | learning rate 3.869E-05 | lm loss 1.550882E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.45 | backward: 1726.05 | allreduce: 24.71 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   201600/  300000 | elapsed time per iteration (ms): 2452.7 | learning rate 3.862E-05 | lm loss 1.519331E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.33 | backward: 1728.45 | allreduce: 27.34 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   201700/  300000 | elapsed time per iteration (ms): 2449.9 | learning rate 3.855E-05 | lm loss 1.521086E+00 | loss scale 32768.0 |
    time (ms) | forward: 668.61 | backward: 1725.91 | allreduce: 24.72 | optimizer: 55.22 | batch generator: 0.45 | data loader: 0.04
     iteration   201800/  300000 | elapsed time per iteration (ms): 2450.4 | learning rate 3.848E-05 | lm loss 1.528139E+00 | loss scale 32768.0 |
    time (ms) | forward: 668.54 | backward: 1725.92 | allreduce: 24.72 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   201900/  300000 | elapsed time per iteration (ms): 2453.4 | learning rate 3.841E-05 | lm loss 1.549168E+00 | loss scale 32768.0 |
    time (ms) | forward: 668.69 | backward: 1728.71 | allreduce: 27.38 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   202000/  300000 | elapsed time per iteration (ms): 2451.1 | learning rate 3.835E-05 | lm loss 1.521619E+00 | loss scale 32768.0 |
    time (ms) | forward: 668.74 | backward: 1726.39 | allreduce: 24.70 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 202000 | LM loss: 1.633676E+00 | LM PPL: 5.122670E+00
    ------------------------------------------------------------------------------------
     iteration   202100/  300000 | elapsed time per iteration (ms): 3059.7 | learning rate 3.828E-05 | lm loss 1.549337E+00 | loss scale 32768.0 |
    time (ms) | forward: 668.69 | backward: 1726.27 | allreduce: 24.70 | optimizer: 55.78 | batch generator: 8.94 | data loader: 8.14
     iteration   202200/  300000 | elapsed time per iteration (ms): 2452.7 | learning rate 3.821E-05 | lm loss 1.539592E+00 | loss scale 32768.0 |
    time (ms) | forward: 668.50 | backward: 1728.28 | allreduce: 27.34 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   202300/  300000 | elapsed time per iteration (ms): 2450.3 | learning rate 3.814E-05 | lm loss 1.539708E+00 | loss scale 32768.0 |
    time (ms) | forward: 668.63 | backward: 1725.76 | allreduce: 24.72 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   202400/  300000 | elapsed time per iteration (ms): 2450.9 | learning rate 3.807E-05 | lm loss 1.536518E+00 | loss scale 32768.0 |
    time (ms) | forward: 668.76 | backward: 1726.17 | allreduce: 24.74 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   202500/  300000 | elapsed time per iteration (ms): 2452.7 | learning rate 3.800E-05 | lm loss 1.539339E+00 | loss scale 32768.0 |
    time (ms) | forward: 668.52 | backward: 1728.20 | allreduce: 27.32 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   202600/  300000 | elapsed time per iteration (ms): 2450.2 | learning rate 3.793E-05 | lm loss 1.534686E+00 | loss scale 32768.0 |
    time (ms) | forward: 668.64 | backward: 1725.57 | allreduce: 24.54 | optimizer: 55.76 | batch generator: 0.44 | data loader: 0.04
     iteration   202700/  300000 | elapsed time per iteration (ms): 2450.2 | learning rate 3.787E-05 | lm loss 1.536752E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.56 | backward: 1725.69 | allreduce: 24.68 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   202800/  300000 | elapsed time per iteration (ms): 2453.4 | learning rate 3.780E-05 | lm loss 1.532294E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.65 | backward: 1728.80 | allreduce: 27.36 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   202900/  300000 | elapsed time per iteration (ms): 2450.9 | learning rate 3.773E-05 | lm loss 1.523040E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.63 | backward: 1726.34 | allreduce: 24.66 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   203000/  300000 | elapsed time per iteration (ms): 2450.7 | learning rate 3.766E-05 | lm loss 1.539287E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.66 | backward: 1726.11 | allreduce: 24.59 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 203000 | LM loss: 1.639929E+00 | LM PPL: 5.154804E+00
    ------------------------------------------------------------------------------------
     iteration   203100/  300000 | elapsed time per iteration (ms): 3062.2 | learning rate 3.759E-05 | lm loss 1.522311E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.69 | backward: 1728.50 | allreduce: 27.14 | optimizer: 55.76 | batch generator: 9.39 | data loader: 8.59
     iteration   203200/  300000 | elapsed time per iteration (ms): 2450.7 | learning rate 3.753E-05 | lm loss 1.568132E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.71 | backward: 1726.08 | allreduce: 24.52 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   203300/  300000 | elapsed time per iteration (ms): 2450.5 | learning rate 3.746E-05 | lm loss 1.540614E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.70 | backward: 1725.91 | allreduce: 24.53 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   203400/  300000 | elapsed time per iteration (ms): 2453.1 | learning rate 3.739E-05 | lm loss 1.538738E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.69 | backward: 1728.45 | allreduce: 26.97 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   203500/  300000 | elapsed time per iteration (ms): 2451.4 | learning rate 3.732E-05 | lm loss 1.557742E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.77 | backward: 1726.71 | allreduce: 24.67 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   203600/  300000 | elapsed time per iteration (ms): 2451.6 | learning rate 3.725E-05 | lm loss 1.539019E+00 | loss scale 65536.0 |
    time (ms) | forward: 668.88 | backward: 1726.75 | allreduce: 24.70 | optimizer: 55.76 | batch generator: 0.44 | data loader: 0.04
     iteration   203700/  300000 | elapsed time per iteration (ms): 2453.5 | learning rate 3.719E-05 | lm loss 1.551420E+00 | loss scale 131072.0 |
    time (ms) | forward: 668.71 | backward: 1728.84 | allreduce: 27.15 | optimizer: 55.77 | batch generator: 0.44 | data loader: 0.04
     iteration   203800/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 3.712E-05 | lm loss 1.518472E+00 | loss scale 131072.0 |
    time (ms) | forward: 668.72 | backward: 1726.64 | allreduce: 24.67 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   203900/  300000 | elapsed time per iteration (ms): 2452.1 | learning rate 3.705E-05 | lm loss 1.544381E+00 | loss scale 131072.0 |
    time (ms) | forward: 668.92 | backward: 1727.25 | allreduce: 24.76 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   204000/  300000 | elapsed time per iteration (ms): 2452.4 | learning rate 3.698E-05 | lm loss 1.519948E+00 | loss scale 131072.0 |
    time (ms) | forward: 669.02 | backward: 1727.44 | allreduce: 24.62 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 204000 | LM loss: 1.637120E+00 | LM PPL: 5.140343E+00
    ------------------------------------------------------------------------------------
     iteration   204100/  300000 | elapsed time per iteration (ms): 3067.7 | learning rate 3.692E-05 | lm loss 1.536009E+00 | loss scale 131072.0 |
    time (ms) | forward: 669.01 | backward: 1729.87 | allreduce: 27.22 | optimizer: 55.76 | batch generator: 12.72 | data loader: 11.91
     iteration   204200/  300000 | elapsed time per iteration (ms): 2452.4 | learning rate 3.685E-05 | lm loss 1.552858E+00 | loss scale 131072.0 |
    time (ms) | forward: 669.06 | backward: 1727.42 | allreduce: 24.68 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   204300/  300000 | elapsed time per iteration (ms): 2452.3 | learning rate 3.678E-05 | lm loss 1.543314E+00 | loss scale 131072.0 |
    time (ms) | forward: 668.99 | backward: 1727.35 | allreduce: 24.61 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   204400/  300000 | elapsed time per iteration (ms): 2454.2 | learning rate 3.671E-05 | lm loss 1.528197E+00 | loss scale 131072.0 |
    time (ms) | forward: 668.85 | backward: 1729.38 | allreduce: 27.04 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   204500/  300000 | elapsed time per iteration (ms): 2452.0 | learning rate 3.664E-05 | lm loss 1.532419E+00 | loss scale 131072.0 |
    time (ms) | forward: 668.96 | backward: 1727.08 | allreduce: 24.59 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   204600/  300000 | elapsed time per iteration (ms): 2452.0 | learning rate 3.658E-05 | lm loss 1.516771E+00 | loss scale 131072.0 |
    time (ms) | forward: 668.89 | backward: 1727.16 | allreduce: 24.61 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   204700/  300000 | elapsed time per iteration (ms): 2455.1 | learning rate 3.651E-05 | lm loss 1.537541E+00 | loss scale 262144.0 |
    time (ms) | forward: 669.03 | backward: 1730.10 | allreduce: 27.12 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   204800/  300000 | elapsed time per iteration (ms): 2452.9 | learning rate 3.644E-05 | lm loss 1.511033E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.93 | backward: 1728.02 | allreduce: 24.69 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   204900/  300000 | elapsed time per iteration (ms): 2453.1 | learning rate 3.638E-05 | lm loss 1.547607E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.99 | backward: 1728.13 | allreduce: 24.64 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   205000/  300000 | elapsed time per iteration (ms): 2455.5 | learning rate 3.631E-05 | lm loss 1.512511E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.94 | backward: 1730.56 | allreduce: 27.34 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    global rank 0 is saving checkpoint at iteration  205000 to checkpoints/gpt2_750m_2/iter_0205000/mp_rank_00/model_optim_rng.pt
      successfully saved checkpoints/gpt2_750m_2/iter_0205000/mp_rank_00/model_optim_rng.pt
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 205000 | LM loss: 1.638279E+00 | LM PPL: 5.146305E+00
    ------------------------------------------------------------------------------------
     iteration   205100/  300000 | elapsed time per iteration (ms): 3115.2 | learning rate 3.624E-05 | lm loss 1.539620E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.79 | backward: 1727.62 | allreduce: 24.68 | optimizer: 55.77 | batch generator: 8.64 | data loader: 7.83
     iteration   205200/  300000 | elapsed time per iteration (ms): 2455.1 | learning rate 3.617E-05 | lm loss 1.546516E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.94 | backward: 1730.22 | allreduce: 27.05 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   205300/  300000 | elapsed time per iteration (ms): 2452.9 | learning rate 3.611E-05 | lm loss 1.538522E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.90 | backward: 1728.08 | allreduce: 24.87 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   205400/  300000 | elapsed time per iteration (ms): 2451.8 | learning rate 3.604E-05 | lm loss 1.517930E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.63 | backward: 1727.26 | allreduce: 24.71 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   205500/  300000 | elapsed time per iteration (ms): 2454.8 | learning rate 3.597E-05 | lm loss 1.531931E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.81 | backward: 1730.08 | allreduce: 27.17 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   205600/  300000 | elapsed time per iteration (ms): 2452.8 | learning rate 3.591E-05 | lm loss 1.531691E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.91 | backward: 1727.92 | allreduce: 24.70 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   205700/  300000 | elapsed time per iteration (ms): 2452.7 | learning rate 3.584E-05 | lm loss 1.517363E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.89 | backward: 1727.90 | allreduce: 24.72 | optimizer: 55.76 | batch generator: 0.46 | data loader: 0.04
     iteration   205800/  300000 | elapsed time per iteration (ms): 2454.3 | learning rate 3.577E-05 | lm loss 1.518047E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.48 | backward: 1729.83 | allreduce: 27.20 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   205900/  300000 | elapsed time per iteration (ms): 2452.8 | learning rate 3.570E-05 | lm loss 1.531282E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.76 | backward: 1728.05 | allreduce: 24.81 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   206000/  300000 | elapsed time per iteration (ms): 2452.2 | learning rate 3.564E-05 | lm loss 1.541045E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.60 | backward: 1727.59 | allreduce: 24.81 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 206000 | LM loss: 1.639208E+00 | LM PPL: 5.151086E+00
    ------------------------------------------------------------------------------------
     iteration   206100/  300000 | elapsed time per iteration (ms): 3072.0 | learning rate 3.557E-05 | lm loss 1.544321E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.44 | backward: 1729.32 | allreduce: 27.23 | optimizer: 55.76 | batch generator: 18.82 | data loader: 18.01
     iteration   206200/  300000 | elapsed time per iteration (ms): 2451.5 | learning rate 3.550E-05 | lm loss 1.524939E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.42 | backward: 1727.07 | allreduce: 24.75 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   206300/  300000 | elapsed time per iteration (ms): 2451.0 | learning rate 3.544E-05 | lm loss 1.524335E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.29 | backward: 1726.80 | allreduce: 24.73 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   206400/  300000 | elapsed time per iteration (ms): 2453.4 | learning rate 3.537E-05 | lm loss 1.538159E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.28 | backward: 1729.15 | allreduce: 27.16 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   206500/  300000 | elapsed time per iteration (ms): 2451.2 | learning rate 3.530E-05 | lm loss 1.537279E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.34 | backward: 1726.86 | allreduce: 24.73 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   206600/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 3.524E-05 | lm loss 1.531263E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.41 | backward: 1726.89 | allreduce: 24.61 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   206700/  300000 | elapsed time per iteration (ms): 2454.1 | learning rate 3.517E-05 | lm loss 1.540331E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.49 | backward: 1729.63 | allreduce: 27.19 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   206800/  300000 | elapsed time per iteration (ms): 2452.1 | learning rate 3.510E-05 | lm loss 1.511245E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.52 | backward: 1727.63 | allreduce: 24.81 | optimizer: 55.76 | batch generator: 0.46 | data loader: 0.04
     iteration   206900/  300000 | elapsed time per iteration (ms): 2452.7 | learning rate 3.504E-05 | lm loss 1.555583E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.65 | backward: 1728.06 | allreduce: 24.74 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   207000/  300000 | elapsed time per iteration (ms): 2453.1 | learning rate 3.497E-05 | lm loss 1.539110E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.42 | backward: 1729.33 | allreduce: 27.24 | optimizer: 55.21 | batch generator: 0.46 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 207000 | LM loss: 1.644402E+00 | LM PPL: 5.177914E+00
    ------------------------------------------------------------------------------------
     iteration   207100/  300000 | elapsed time per iteration (ms): 3059.2 | learning rate 3.491E-05 | lm loss 1.572436E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.53 | backward: 1727.56 | allreduce: 24.81 | optimizer: 55.76 | batch generator: 7.60 | data loader: 6.81
     iteration   207200/  300000 | elapsed time per iteration (ms): 2451.5 | learning rate 3.484E-05 | lm loss 1.511736E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.32 | backward: 1727.27 | allreduce: 24.93 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   207300/  300000 | elapsed time per iteration (ms): 2454.0 | learning rate 3.477E-05 | lm loss 1.536637E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.35 | backward: 1729.68 | allreduce: 27.41 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   207400/  300000 | elapsed time per iteration (ms): 2450.5 | learning rate 3.471E-05 | lm loss 1.546067E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.38 | backward: 1726.20 | allreduce: 23.74 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   207500/  300000 | elapsed time per iteration (ms): 2450.4 | learning rate 3.464E-05 | lm loss 1.533851E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.38 | backward: 1726.07 | allreduce: 23.39 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   207600/  300000 | elapsed time per iteration (ms): 2453.8 | learning rate 3.457E-05 | lm loss 1.538064E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.40 | backward: 1729.50 | allreduce: 27.22 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   207700/  300000 | elapsed time per iteration (ms): 2451.4 | learning rate 3.451E-05 | lm loss 1.522415E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.37 | backward: 1727.03 | allreduce: 24.87 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   207800/  300000 | elapsed time per iteration (ms): 2450.9 | learning rate 3.444E-05 | lm loss 1.538762E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.23 | backward: 1726.73 | allreduce: 24.75 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   207900/  300000 | elapsed time per iteration (ms): 2452.5 | learning rate 3.438E-05 | lm loss 1.527930E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.19 | backward: 1728.39 | allreduce: 26.42 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   208000/  300000 | elapsed time per iteration (ms): 2448.7 | learning rate 3.431E-05 | lm loss 1.526058E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.17 | backward: 1725.64 | allreduce: 24.72 | optimizer: 54.65 | batch generator: 0.46 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 208000 | LM loss: 1.655344E+00 | LM PPL: 5.234879E+00
    ------------------------------------------------------------------------------------
     iteration   208100/  300000 | elapsed time per iteration (ms): 3063.5 | learning rate 3.424E-05 | lm loss 1.530015E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.47 | backward: 1727.59 | allreduce: 24.80 | optimizer: 55.77 | batch generator: 11.79 | data loader: 10.98
     iteration   208200/  300000 | elapsed time per iteration (ms): 2453.3 | learning rate 3.418E-05 | lm loss 1.530737E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.12 | backward: 1729.18 | allreduce: 27.22 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   208300/  300000 | elapsed time per iteration (ms): 2450.1 | learning rate 3.411E-05 | lm loss 1.536243E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.16 | backward: 1725.96 | allreduce: 23.95 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   208400/  300000 | elapsed time per iteration (ms): 2450.1 | learning rate 3.405E-05 | lm loss 1.534113E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.46 | backward: 1725.70 | allreduce: 23.33 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   208500/  300000 | elapsed time per iteration (ms): 2453.8 | learning rate 3.398E-05 | lm loss 1.540937E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.35 | backward: 1729.51 | allreduce: 27.25 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   208600/  300000 | elapsed time per iteration (ms): 2450.7 | learning rate 3.392E-05 | lm loss 1.530801E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.18 | backward: 1726.61 | allreduce: 24.83 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   208700/  300000 | elapsed time per iteration (ms): 2450.2 | learning rate 3.385E-05 | lm loss 1.536505E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.33 | backward: 1726.47 | allreduce: 24.87 | optimizer: 55.20 | batch generator: 0.45 | data loader: 0.04
     iteration   208800/  300000 | elapsed time per iteration (ms): 2453.5 | learning rate 3.379E-05 | lm loss 1.535588E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.36 | backward: 1729.15 | allreduce: 27.28 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   208900/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 3.372E-05 | lm loss 1.524183E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.41 | backward: 1726.91 | allreduce: 24.90 | optimizer: 55.76 | batch generator: 0.44 | data loader: 0.04
     iteration   209000/  300000 | elapsed time per iteration (ms): 2451.5 | learning rate 3.365E-05 | lm loss 1.535477E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.48 | backward: 1727.07 | allreduce: 24.73 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 209000 | LM loss: 1.649644E+00 | LM PPL: 5.205124E+00
    ------------------------------------------------------------------------------------
     iteration   209100/  300000 | elapsed time per iteration (ms): 3072.2 | learning rate 3.359E-05 | lm loss 1.527413E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.54 | backward: 1729.85 | allreduce: 27.22 | optimizer: 55.78 | batch generator: 18.45 | data loader: 17.63
     iteration   209200/  300000 | elapsed time per iteration (ms): 2452.2 | learning rate 3.352E-05 | lm loss 1.542352E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.63 | backward: 1727.61 | allreduce: 24.84 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   209300/  300000 | elapsed time per iteration (ms): 2451.7 | learning rate 3.346E-05 | lm loss 1.520572E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.41 | backward: 1727.34 | allreduce: 25.02 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   209400/  300000 | elapsed time per iteration (ms): 2453.4 | learning rate 3.339E-05 | lm loss 1.513737E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.24 | backward: 1729.23 | allreduce: 27.35 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   209500/  300000 | elapsed time per iteration (ms): 2453.1 | learning rate 3.333E-05 | lm loss 1.543435E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.83 | backward: 1728.35 | allreduce: 24.89 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   209600/  300000 | elapsed time per iteration (ms): 2451.8 | learning rate 3.326E-05 | lm loss 1.517042E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.45 | backward: 1727.35 | allreduce: 24.93 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   209700/  300000 | elapsed time per iteration (ms): 2454.5 | learning rate 3.320E-05 | lm loss 1.524809E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.60 | backward: 1729.97 | allreduce: 27.29 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   209800/  300000 | elapsed time per iteration (ms): 2452.3 | learning rate 3.313E-05 | lm loss 1.522126E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.49 | backward: 1727.81 | allreduce: 24.96 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   209900/  300000 | elapsed time per iteration (ms): 2450.5 | learning rate 3.307E-05 | lm loss 1.534135E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.34 | backward: 1726.71 | allreduce: 24.69 | optimizer: 55.22 | batch generator: 0.45 | data loader: 0.04
     iteration   210000/  300000 | elapsed time per iteration (ms): 2454.7 | learning rate 3.300E-05 | lm loss 1.522401E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.34 | backward: 1730.38 | allreduce: 27.78 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    global rank 0 is saving checkpoint at iteration  210000 to checkpoints/gpt2_750m_2/iter_0210000/mp_rank_00/model_optim_rng.pt
      successfully saved checkpoints/gpt2_750m_2/iter_0210000/mp_rank_00/model_optim_rng.pt
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 210000 | LM loss: 1.642650E+00 | LM PPL: 5.168850E+00
    ------------------------------------------------------------------------------------
     iteration   210100/  300000 | elapsed time per iteration (ms): 3106.9 | learning rate 3.294E-05 | lm loss 1.515178E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.30 | backward: 1727.49 | allreduce: 25.29 | optimizer: 55.77 | batch generator: 0.93 | data loader: 0.13
     iteration   210200/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 3.287E-05 | lm loss 1.525599E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.45 | backward: 1727.48 | allreduce: 25.21 | optimizer: 55.21 | batch generator: 0.45 | data loader: 0.04
     iteration   210300/  300000 | elapsed time per iteration (ms): 2454.9 | learning rate 3.281E-05 | lm loss 1.525417E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.55 | backward: 1730.43 | allreduce: 27.80 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   210400/  300000 | elapsed time per iteration (ms): 2452.4 | learning rate 3.274E-05 | lm loss 1.514005E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.54 | backward: 1727.93 | allreduce: 25.36 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   210500/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 3.268E-05 | lm loss 1.553431E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.28 | backward: 1727.10 | allreduce: 25.10 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   210600/  300000 | elapsed time per iteration (ms): 2453.5 | learning rate 3.261E-05 | lm loss 1.532277E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.28 | backward: 1729.29 | allreduce: 27.41 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   210700/  300000 | elapsed time per iteration (ms): 2450.6 | learning rate 3.255E-05 | lm loss 1.517723E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.15 | backward: 1726.52 | allreduce: 24.98 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   210800/  300000 | elapsed time per iteration (ms): 2451.1 | learning rate 3.248E-05 | lm loss 1.530670E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.35 | backward: 1726.83 | allreduce: 24.98 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   210900/  300000 | elapsed time per iteration (ms): 2453.4 | learning rate 3.242E-05 | lm loss 1.519468E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.31 | backward: 1729.13 | allreduce: 27.43 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   211000/  300000 | elapsed time per iteration (ms): 2450.5 | learning rate 3.235E-05 | lm loss 1.526380E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.19 | backward: 1726.39 | allreduce: 24.89 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 211000 | LM loss: 1.626989E+00 | LM PPL: 5.088532E+00
    ------------------------------------------------------------------------------------
     iteration   211100/  300000 | elapsed time per iteration (ms): 3054.3 | learning rate 3.229E-05 | lm loss 1.529968E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.22 | backward: 1727.27 | allreduce: 25.52 | optimizer: 55.77 | batch generator: 3.42 | data loader: 2.63
     iteration   211200/  300000 | elapsed time per iteration (ms): 2452.7 | learning rate 3.222E-05 | lm loss 1.525694E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.09 | backward: 1728.69 | allreduce: 27.30 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   211300/  300000 | elapsed time per iteration (ms): 2450.8 | learning rate 3.216E-05 | lm loss 1.522677E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.06 | backward: 1726.78 | allreduce: 25.00 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   211400/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 3.210E-05 | lm loss 1.512628E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.32 | backward: 1727.05 | allreduce: 24.80 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   211500/  300000 | elapsed time per iteration (ms): 2453.3 | learning rate 3.203E-05 | lm loss 1.523165E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.19 | backward: 1729.15 | allreduce: 27.36 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   211600/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 3.197E-05 | lm loss 1.526353E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.34 | backward: 1727.03 | allreduce: 24.72 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   211700/  300000 | elapsed time per iteration (ms): 2450.4 | learning rate 3.190E-05 | lm loss 1.516593E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.43 | backward: 1725.98 | allreduce: 23.39 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   211800/  300000 | elapsed time per iteration (ms): 2452.6 | learning rate 3.184E-05 | lm loss 1.534888E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.35 | backward: 1728.28 | allreduce: 25.84 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   211900/  300000 | elapsed time per iteration (ms): 2450.0 | learning rate 3.177E-05 | lm loss 1.522526E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.33 | backward: 1725.71 | allreduce: 23.37 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   212000/  300000 | elapsed time per iteration (ms): 2449.4 | learning rate 3.171E-05 | lm loss 1.535247E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.22 | backward: 1725.19 | allreduce: 23.33 | optimizer: 55.76 | batch generator: 0.46 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 212000 | LM loss: 1.611071E+00 | LM PPL: 5.008170E+00
    ------------------------------------------------------------------------------------
     iteration   212100/  300000 | elapsed time per iteration (ms): 3077.2 | learning rate 3.165E-05 | lm loss 1.520656E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.28 | backward: 1729.38 | allreduce: 27.29 | optimizer: 55.77 | batch generator: 23.97 | data loader: 23.16
     iteration   212200/  300000 | elapsed time per iteration (ms): 2448.6 | learning rate 3.158E-05 | lm loss 1.519875E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.22 | backward: 1725.58 | allreduce: 24.89 | optimizer: 54.66 | batch generator: 0.45 | data loader: 0.04
     iteration   212300/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 3.152E-05 | lm loss 1.520986E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.27 | backward: 1727.08 | allreduce: 24.94 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   212400/  300000 | elapsed time per iteration (ms): 2452.2 | learning rate 3.146E-05 | lm loss 1.512144E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.19 | backward: 1728.66 | allreduce: 27.38 | optimizer: 55.21 | batch generator: 0.45 | data loader: 0.04
     iteration   212500/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 3.139E-05 | lm loss 1.509590E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.39 | backward: 1726.95 | allreduce: 24.86 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   212600/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 3.133E-05 | lm loss 1.550407E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.44 | backward: 1726.94 | allreduce: 24.82 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   212700/  300000 | elapsed time per iteration (ms): 2454.3 | learning rate 3.126E-05 | lm loss 1.519657E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.44 | backward: 1729.88 | allreduce: 27.49 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   212800/  300000 | elapsed time per iteration (ms): 2451.9 | learning rate 3.120E-05 | lm loss 1.520656E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.38 | backward: 1727.53 | allreduce: 25.06 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   212900/  300000 | elapsed time per iteration (ms): 2452.2 | learning rate 3.114E-05 | lm loss 1.519334E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.54 | backward: 1727.67 | allreduce: 25.00 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   213000/  300000 | elapsed time per iteration (ms): 2454.5 | learning rate 3.107E-05 | lm loss 1.531557E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.48 | backward: 1730.01 | allreduce: 27.47 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 213000 | LM loss: 1.608991E+00 | LM PPL: 4.997768E+00
    ------------------------------------------------------------------------------------
     iteration   213100/  300000 | elapsed time per iteration (ms): 3065.0 | learning rate 3.101E-05 | lm loss 1.515319E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.64 | backward: 1728.03 | allreduce: 25.09 | optimizer: 55.78 | batch generator: 12.78 | data loader: 11.96
     iteration   213200/  300000 | elapsed time per iteration (ms): 2451.2 | learning rate 3.095E-05 | lm loss 1.518894E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.24 | backward: 1727.00 | allreduce: 25.04 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   213300/  300000 | elapsed time per iteration (ms): 2454.1 | learning rate 3.088E-05 | lm loss 1.536544E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.40 | backward: 1729.72 | allreduce: 27.33 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   213400/  300000 | elapsed time per iteration (ms): 2452.2 | learning rate 3.082E-05 | lm loss 1.522572E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.46 | backward: 1727.77 | allreduce: 25.05 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   213500/  300000 | elapsed time per iteration (ms): 2452.1 | learning rate 3.076E-05 | lm loss 1.494339E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.46 | backward: 1727.66 | allreduce: 24.90 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   213600/  300000 | elapsed time per iteration (ms): 2454.8 | learning rate 3.069E-05 | lm loss 1.521892E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.48 | backward: 1730.40 | allreduce: 27.51 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   213700/  300000 | elapsed time per iteration (ms): 2452.0 | learning rate 3.063E-05 | lm loss 1.527649E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.36 | backward: 1727.64 | allreduce: 25.16 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   213800/  300000 | elapsed time per iteration (ms): 2452.0 | learning rate 3.057E-05 | lm loss 1.526882E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.36 | backward: 1727.71 | allreduce: 25.02 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   213900/  300000 | elapsed time per iteration (ms): 2454.8 | learning rate 3.050E-05 | lm loss 1.521238E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.47 | backward: 1730.38 | allreduce: 27.59 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   214000/  300000 | elapsed time per iteration (ms): 2451.8 | learning rate 3.044E-05 | lm loss 1.531504E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.37 | backward: 1727.49 | allreduce: 24.98 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 214000 | LM loss: 1.636653E+00 | LM PPL: 5.137945E+00
    ------------------------------------------------------------------------------------
     iteration   214100/  300000 | elapsed time per iteration (ms): 3056.4 | learning rate 3.038E-05 | lm loss 1.522905E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.17 | backward: 1727.12 | allreduce: 25.07 | optimizer: 55.78 | batch generator: 5.28 | data loader: 4.47
     iteration   214200/  300000 | elapsed time per iteration (ms): 2453.6 | learning rate 3.031E-05 | lm loss 1.510329E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.47 | backward: 1729.71 | allreduce: 27.51 | optimizer: 55.22 | batch generator: 0.46 | data loader: 0.04
     iteration   214300/  300000 | elapsed time per iteration (ms): 2452.0 | learning rate 3.025E-05 | lm loss 1.508177E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.42 | backward: 1727.66 | allreduce: 24.94 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   214400/  300000 | elapsed time per iteration (ms): 2452.9 | learning rate 3.019E-05 | lm loss 1.514682E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.56 | backward: 1728.38 | allreduce: 25.02 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   214500/  300000 | elapsed time per iteration (ms): 2453.1 | learning rate 3.013E-05 | lm loss 1.534169E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.32 | backward: 1729.40 | allreduce: 27.60 | optimizer: 55.22 | batch generator: 0.45 | data loader: 0.04
     iteration   214600/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 3.006E-05 | lm loss 1.528844E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.43 | backward: 1726.96 | allreduce: 24.70 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   214700/  300000 | elapsed time per iteration (ms): 2450.1 | learning rate 3.000E-05 | lm loss 1.515377E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.35 | backward: 1726.32 | allreduce: 24.79 | optimizer: 55.22 | batch generator: 0.45 | data loader: 0.04
     iteration   214800/  300000 | elapsed time per iteration (ms): 2453.7 | learning rate 2.994E-05 | lm loss 1.519253E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.39 | backward: 1729.33 | allreduce: 27.32 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   214900/  300000 | elapsed time per iteration (ms): 2451.1 | learning rate 2.987E-05 | lm loss 1.530387E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.42 | backward: 1726.70 | allreduce: 24.80 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   215000/  300000 | elapsed time per iteration (ms): 2450.6 | learning rate 2.981E-05 | lm loss 1.498074E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.23 | backward: 1726.45 | allreduce: 24.73 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    global rank 0 is saving checkpoint at iteration  215000 to checkpoints/gpt2_750m_2/iter_0215000/mp_rank_00/model_optim_rng.pt
      successfully saved checkpoints/gpt2_750m_2/iter_0215000/mp_rank_00/model_optim_rng.pt
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 215000 | LM loss: 1.629059E+00 | LM PPL: 5.099074E+00
    ------------------------------------------------------------------------------------
     iteration   215100/  300000 | elapsed time per iteration (ms): 3110.3 | learning rate 2.975E-05 | lm loss 1.535164E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.19 | backward: 1725.79 | allreduce: 24.31 | optimizer: 55.77 | batch generator: 2.22 | data loader: 1.41
     iteration   215200/  300000 | elapsed time per iteration (ms): 2450.7 | learning rate 2.969E-05 | lm loss 1.528059E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.31 | backward: 1726.42 | allreduce: 24.46 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   215300/  300000 | elapsed time per iteration (ms): 2453.6 | learning rate 2.962E-05 | lm loss 1.510566E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.38 | backward: 1729.28 | allreduce: 27.24 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   215400/  300000 | elapsed time per iteration (ms): 2450.6 | learning rate 2.956E-05 | lm loss 1.523802E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.42 | backward: 1726.21 | allreduce: 24.14 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   215500/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 2.950E-05 | lm loss 1.509421E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.58 | backward: 1726.72 | allreduce: 24.24 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   215600/  300000 | elapsed time per iteration (ms): 2453.0 | learning rate 2.944E-05 | lm loss 1.541730E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.35 | backward: 1728.66 | allreduce: 26.74 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   215700/  300000 | elapsed time per iteration (ms): 2450.7 | learning rate 2.937E-05 | lm loss 1.518056E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.37 | backward: 1726.36 | allreduce: 24.37 | optimizer: 55.79 | batch generator: 0.46 | data loader: 0.04
     iteration   215800/  300000 | elapsed time per iteration (ms): 2451.1 | learning rate 2.931E-05 | lm loss 1.510328E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.41 | backward: 1726.72 | allreduce: 24.32 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   215900/  300000 | elapsed time per iteration (ms): 2453.9 | learning rate 2.925E-05 | lm loss 1.528475E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.55 | backward: 1729.43 | allreduce: 26.60 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   216000/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 2.919E-05 | lm loss 1.537195E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.48 | backward: 1726.87 | allreduce: 24.25 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 216000 | LM loss: 1.621067E+00 | LM PPL: 5.058486E+00
    ------------------------------------------------------------------------------------
     iteration   216100/  300000 | elapsed time per iteration (ms): 3063.6 | learning rate 2.913E-05 | lm loss 1.510169E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.45 | backward: 1726.99 | allreduce: 24.35 | optimizer: 55.78 | batch generator: 12.67 | data loader: 11.85
     iteration   216200/  300000 | elapsed time per iteration (ms): 2454.4 | learning rate 2.906E-05 | lm loss 1.543517E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.39 | backward: 1730.01 | allreduce: 27.58 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   216300/  300000 | elapsed time per iteration (ms): 2452.5 | learning rate 2.900E-05 | lm loss 1.522852E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.51 | backward: 1728.03 | allreduce: 25.38 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   216400/  300000 | elapsed time per iteration (ms): 2451.9 | learning rate 2.894E-05 | lm loss 1.512945E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.40 | backward: 1727.50 | allreduce: 25.07 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   216500/  300000 | elapsed time per iteration (ms): 2454.1 | learning rate 2.888E-05 | lm loss 1.519243E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.38 | backward: 1729.74 | allreduce: 27.55 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   216600/  300000 | elapsed time per iteration (ms): 2451.7 | learning rate 2.882E-05 | lm loss 1.513163E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.26 | backward: 1727.45 | allreduce: 25.35 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   216700/  300000 | elapsed time per iteration (ms): 2452.6 | learning rate 2.875E-05 | lm loss 1.508917E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.45 | backward: 1728.15 | allreduce: 25.47 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   216800/  300000 | elapsed time per iteration (ms): 2454.8 | learning rate 2.869E-05 | lm loss 1.527094E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.56 | backward: 1730.86 | allreduce: 28.04 | optimizer: 55.22 | batch generator: 0.45 | data loader: 0.04
     iteration   216900/  300000 | elapsed time per iteration (ms): 2451.9 | learning rate 2.863E-05 | lm loss 1.505966E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.36 | backward: 1727.55 | allreduce: 25.00 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   217000/  300000 | elapsed time per iteration (ms): 2452.0 | learning rate 2.857E-05 | lm loss 1.511378E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.41 | backward: 1727.66 | allreduce: 24.92 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 217000 | LM loss: 1.606313E+00 | LM PPL: 4.984400E+00
    ------------------------------------------------------------------------------------
     iteration   217100/  300000 | elapsed time per iteration (ms): 3067.8 | learning rate 2.851E-05 | lm loss 1.528659E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.47 | backward: 1730.65 | allreduce: 27.89 | optimizer: 55.79 | batch generator: 13.10 | data loader: 12.28
     iteration   217200/  300000 | elapsed time per iteration (ms): 2451.8 | learning rate 2.845E-05 | lm loss 1.508725E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.33 | backward: 1727.54 | allreduce: 24.82 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   217300/  300000 | elapsed time per iteration (ms): 2453.0 | learning rate 2.838E-05 | lm loss 1.529051E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.65 | backward: 1728.42 | allreduce: 25.09 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   217400/  300000 | elapsed time per iteration (ms): 2452.2 | learning rate 2.832E-05 | lm loss 1.533648E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.33 | backward: 1727.95 | allreduce: 25.39 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   217500/  300000 | elapsed time per iteration (ms): 2454.3 | learning rate 2.826E-05 | lm loss 1.515663E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.38 | backward: 1729.97 | allreduce: 27.24 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   217600/  300000 | elapsed time per iteration (ms): 2451.9 | learning rate 2.820E-05 | lm loss 1.528604E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.40 | backward: 1727.57 | allreduce: 24.84 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   217700/  300000 | elapsed time per iteration (ms): 2451.5 | learning rate 2.814E-05 | lm loss 1.501621E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.24 | backward: 1727.26 | allreduce: 24.84 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   217800/  300000 | elapsed time per iteration (ms): 2452.3 | learning rate 2.808E-05 | lm loss 1.515119E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.47 | backward: 1728.93 | allreduce: 27.32 | optimizer: 54.66 | batch generator: 0.45 | data loader: 0.04
     iteration   217900/  300000 | elapsed time per iteration (ms): 2450.7 | learning rate 2.802E-05 | lm loss 1.511718E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.13 | backward: 1726.66 | allreduce: 24.76 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   218000/  300000 | elapsed time per iteration (ms): 2451.0 | learning rate 2.796E-05 | lm loss 1.507814E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.11 | backward: 1726.89 | allreduce: 25.01 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 218000 | LM loss: 1.675240E+00 | LM PPL: 5.340075E+00
    ------------------------------------------------------------------------------------
     iteration   218100/  300000 | elapsed time per iteration (ms): 3066.9 | learning rate 2.789E-05 | lm loss 1.533660E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.28 | backward: 1729.92 | allreduce: 27.66 | optimizer: 55.78 | batch generator: 13.18 | data loader: 12.38
     iteration   218200/  300000 | elapsed time per iteration (ms): 2452.5 | learning rate 2.783E-05 | lm loss 1.508778E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.44 | backward: 1728.15 | allreduce: 25.27 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   218300/  300000 | elapsed time per iteration (ms): 2450.5 | learning rate 2.777E-05 | lm loss 1.527059E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.40 | backward: 1726.73 | allreduce: 24.72 | optimizer: 55.21 | batch generator: 0.46 | data loader: 0.04
     iteration   218400/  300000 | elapsed time per iteration (ms): 2454.4 | learning rate 2.771E-05 | lm loss 1.510819E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.49 | backward: 1729.98 | allreduce: 27.36 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   218500/  300000 | elapsed time per iteration (ms): 2451.0 | learning rate 2.765E-05 | lm loss 1.489325E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.19 | backward: 1726.80 | allreduce: 24.91 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   218600/  300000 | elapsed time per iteration (ms): 2450.5 | learning rate 2.759E-05 | lm loss 1.524213E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.15 | backward: 1726.39 | allreduce: 24.62 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   218700/  300000 | elapsed time per iteration (ms): 2453.7 | learning rate 2.753E-05 | lm loss 1.495579E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.32 | backward: 1729.45 | allreduce: 27.31 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   218800/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 2.747E-05 | lm loss 1.529270E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.33 | backward: 1726.96 | allreduce: 24.86 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   218900/  300000 | elapsed time per iteration (ms): 2451.0 | learning rate 2.741E-05 | lm loss 1.525555E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.29 | backward: 1726.74 | allreduce: 24.65 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   219000/  300000 | elapsed time per iteration (ms): 2453.3 | learning rate 2.735E-05 | lm loss 1.517031E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.24 | backward: 1729.08 | allreduce: 27.24 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 219000 | LM loss: 1.608241E+00 | LM PPL: 4.994020E+00
    ------------------------------------------------------------------------------------
     iteration   219100/  300000 | elapsed time per iteration (ms): 3053.3 | learning rate 2.729E-05 | lm loss 1.515677E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.50 | backward: 1727.78 | allreduce: 24.98 | optimizer: 55.78 | batch generator: 0.87 | data loader: 0.07
     iteration   219200/  300000 | elapsed time per iteration (ms): 2451.2 | learning rate 2.723E-05 | lm loss 1.505108E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.21 | backward: 1726.97 | allreduce: 25.00 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   219300/  300000 | elapsed time per iteration (ms): 2453.5 | learning rate 2.717E-05 | lm loss 1.527642E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.19 | backward: 1729.35 | allreduce: 27.31 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   219400/  300000 | elapsed time per iteration (ms): 2451.6 | learning rate 2.710E-05 | lm loss 1.526980E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.27 | backward: 1727.42 | allreduce: 25.02 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   219500/  300000 | elapsed time per iteration (ms): 2452.0 | learning rate 2.704E-05 | lm loss 1.514046E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.40 | backward: 1727.62 | allreduce: 24.97 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   219600/  300000 | elapsed time per iteration (ms): 2453.8 | learning rate 2.698E-05 | lm loss 1.512144E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.26 | backward: 1729.54 | allreduce: 27.44 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   219700/  300000 | elapsed time per iteration (ms): 2451.5 | learning rate 2.692E-05 | lm loss 1.515969E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.25 | backward: 1727.29 | allreduce: 24.96 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   219800/  300000 | elapsed time per iteration (ms): 2450.6 | learning rate 2.686E-05 | lm loss 1.515663E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.41 | backward: 1726.76 | allreduce: 24.49 | optimizer: 55.22 | batch generator: 0.46 | data loader: 0.04
     iteration   219900/  300000 | elapsed time per iteration (ms): 2454.0 | learning rate 2.680E-05 | lm loss 1.516317E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.38 | backward: 1729.61 | allreduce: 26.96 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   220000/  300000 | elapsed time per iteration (ms): 2451.2 | learning rate 2.674E-05 | lm loss 1.498490E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.30 | backward: 1726.89 | allreduce: 24.45 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    global rank 0 is saving checkpoint at iteration  220000 to checkpoints/gpt2_750m_2/iter_0220000/mp_rank_00/model_optim_rng.pt
      successfully saved checkpoints/gpt2_750m_2/iter_0220000/mp_rank_00/model_optim_rng.pt
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 220000 | LM loss: 1.628286E+00 | LM PPL: 5.095136E+00
    ------------------------------------------------------------------------------------
     iteration   220100/  300000 | elapsed time per iteration (ms): 3131.7 | learning rate 2.668E-05 | lm loss 1.541458E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.88 | backward: 1728.34 | allreduce: 26.98 | optimizer: 55.78 | batch generator: 26.74 | data loader: 25.93
     iteration   220200/  300000 | elapsed time per iteration (ms): 2451.1 | learning rate 2.662E-05 | lm loss 1.526868E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.30 | backward: 1726.84 | allreduce: 24.52 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   220300/  300000 | elapsed time per iteration (ms): 2449.8 | learning rate 2.656E-05 | lm loss 1.510049E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.32 | backward: 1726.07 | allreduce: 24.35 | optimizer: 55.22 | batch generator: 0.46 | data loader: 0.04
     iteration   220400/  300000 | elapsed time per iteration (ms): 2453.4 | learning rate 2.650E-05 | lm loss 1.522488E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.27 | backward: 1729.19 | allreduce: 26.98 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   220500/  300000 | elapsed time per iteration (ms): 2451.7 | learning rate 2.644E-05 | lm loss 1.502921E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.30 | backward: 1727.41 | allreduce: 25.02 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   220600/  300000 | elapsed time per iteration (ms): 2450.9 | learning rate 2.638E-05 | lm loss 1.518931E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.20 | backward: 1726.72 | allreduce: 24.83 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   220700/  300000 | elapsed time per iteration (ms): 2453.3 | learning rate 2.632E-05 | lm loss 1.511769E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.15 | backward: 1729.21 | allreduce: 27.44 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   220800/  300000 | elapsed time per iteration (ms): 2451.4 | learning rate 2.627E-05 | lm loss 1.517776E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.32 | backward: 1727.15 | allreduce: 25.06 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   220900/  300000 | elapsed time per iteration (ms): 2451.6 | learning rate 2.621E-05 | lm loss 1.531566E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.37 | backward: 1727.25 | allreduce: 24.96 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   221000/  300000 | elapsed time per iteration (ms): 2454.0 | learning rate 2.615E-05 | lm loss 1.498951E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.35 | backward: 1729.69 | allreduce: 27.42 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 221000 | LM loss: 1.628059E+00 | LM PPL: 5.093980E+00
    ------------------------------------------------------------------------------------
     iteration   221100/  300000 | elapsed time per iteration (ms): 3054.5 | learning rate 2.609E-05 | lm loss 1.517538E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.21 | backward: 1726.88 | allreduce: 25.03 | optimizer: 55.78 | batch generator: 3.55 | data loader: 2.74
     iteration   221200/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 2.603E-05 | lm loss 1.521606E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.29 | backward: 1727.01 | allreduce: 24.96 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   221300/  300000 | elapsed time per iteration (ms): 2454.7 | learning rate 2.597E-05 | lm loss 1.509824E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.53 | backward: 1730.18 | allreduce: 27.50 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   221400/  300000 | elapsed time per iteration (ms): 2451.2 | learning rate 2.591E-05 | lm loss 1.544065E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.22 | backward: 1727.01 | allreduce: 24.79 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   221500/  300000 | elapsed time per iteration (ms): 2451.9 | learning rate 2.585E-05 | lm loss 1.526325E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.38 | backward: 1727.53 | allreduce: 24.90 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   221600/  300000 | elapsed time per iteration (ms): 2453.6 | learning rate 2.579E-05 | lm loss 1.520059E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.14 | backward: 1729.49 | allreduce: 27.40 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   221700/  300000 | elapsed time per iteration (ms): 2449.6 | learning rate 2.573E-05 | lm loss 1.515897E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.41 | backward: 1726.38 | allreduce: 24.87 | optimizer: 54.66 | batch generator: 0.46 | data loader: 0.04
     iteration   221800/  300000 | elapsed time per iteration (ms): 2452.5 | learning rate 2.567E-05 | lm loss 1.520422E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.61 | backward: 1727.95 | allreduce: 25.00 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   221900/  300000 | elapsed time per iteration (ms): 2453.6 | learning rate 2.561E-05 | lm loss 1.506706E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.36 | backward: 1729.24 | allreduce: 26.91 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   222000/  300000 | elapsed time per iteration (ms): 2449.6 | learning rate 2.555E-05 | lm loss 1.501400E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.25 | backward: 1725.42 | allreduce: 23.39 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 222000 | LM loss: 1.644852E+00 | LM PPL: 5.180241E+00
    ------------------------------------------------------------------------------------
     iteration   222100/  300000 | elapsed time per iteration (ms): 3066.7 | learning rate 2.549E-05 | lm loss 1.508197E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.36 | backward: 1727.04 | allreduce: 24.79 | optimizer: 55.78 | batch generator: 15.27 | data loader: 14.45
     iteration   222200/  300000 | elapsed time per iteration (ms): 2454.5 | learning rate 2.544E-05 | lm loss 1.497695E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.52 | backward: 1730.00 | allreduce: 27.36 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   222300/  300000 | elapsed time per iteration (ms): 2451.5 | learning rate 2.538E-05 | lm loss 1.525571E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.38 | backward: 1727.17 | allreduce: 24.79 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   222400/  300000 | elapsed time per iteration (ms): 2451.6 | learning rate 2.532E-05 | lm loss 1.498840E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.47 | backward: 1727.19 | allreduce: 24.45 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   222500/  300000 | elapsed time per iteration (ms): 2453.4 | learning rate 2.526E-05 | lm loss 1.507364E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.36 | backward: 1729.08 | allreduce: 26.98 | optimizer: 55.78 | batch generator: 0.47 | data loader: 0.04
     iteration   222600/  300000 | elapsed time per iteration (ms): 2450.4 | learning rate 2.520E-05 | lm loss 1.511760E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.24 | backward: 1726.18 | allreduce: 24.64 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   222700/  300000 | elapsed time per iteration (ms): 2450.9 | learning rate 2.514E-05 | lm loss 1.490275E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.31 | backward: 1726.69 | allreduce: 24.71 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   222800/  300000 | elapsed time per iteration (ms): 2454.4 | learning rate 2.508E-05 | lm loss 1.512012E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.40 | backward: 1730.01 | allreduce: 27.53 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   222900/  300000 | elapsed time per iteration (ms): 2451.4 | learning rate 2.502E-05 | lm loss 1.496440E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.33 | backward: 1727.13 | allreduce: 24.76 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   223000/  300000 | elapsed time per iteration (ms): 2451.2 | learning rate 2.497E-05 | lm loss 1.503248E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.26 | backward: 1726.97 | allreduce: 24.80 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 223000 | LM loss: 1.631512E+00 | LM PPL: 5.111600E+00
    ------------------------------------------------------------------------------------
     iteration   223100/  300000 | elapsed time per iteration (ms): 3091.3 | learning rate 2.491E-05 | lm loss 1.497678E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.15 | backward: 1729.46 | allreduce: 27.43 | optimizer: 55.78 | batch generator: 37.96 | data loader: 37.14
     iteration   223200/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 2.485E-05 | lm loss 1.511714E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.30 | backward: 1727.05 | allreduce: 24.70 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   223300/  300000 | elapsed time per iteration (ms): 2451.5 | learning rate 2.479E-05 | lm loss 1.514535E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.31 | backward: 1727.27 | allreduce: 24.73 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   223400/  300000 | elapsed time per iteration (ms): 2454.5 | learning rate 2.473E-05 | lm loss 1.493242E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.45 | backward: 1730.06 | allreduce: 27.26 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   223500/  300000 | elapsed time per iteration (ms): 2452.1 | learning rate 2.467E-05 | lm loss 1.486661E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.46 | backward: 1727.71 | allreduce: 24.71 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   223600/  300000 | elapsed time per iteration (ms): 2451.7 | learning rate 2.462E-05 | lm loss 1.508246E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.39 | backward: 1727.34 | allreduce: 24.72 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   223700/  300000 | elapsed time per iteration (ms): 2452.9 | learning rate 2.456E-05 | lm loss 1.520789E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.32 | backward: 1729.14 | allreduce: 27.19 | optimizer: 55.22 | batch generator: 0.46 | data loader: 0.04
     iteration   223800/  300000 | elapsed time per iteration (ms): 2451.6 | learning rate 2.450E-05 | lm loss 1.517294E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.33 | backward: 1727.26 | allreduce: 24.69 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   223900/  300000 | elapsed time per iteration (ms): 2451.5 | learning rate 2.444E-05 | lm loss 1.502489E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.27 | backward: 1727.30 | allreduce: 24.88 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   224000/  300000 | elapsed time per iteration (ms): 2453.5 | learning rate 2.438E-05 | lm loss 1.506926E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.20 | backward: 1729.35 | allreduce: 27.18 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 224000 | LM loss: 1.625826E+00 | LM PPL: 5.082614E+00
    ------------------------------------------------------------------------------------
     iteration   224100/  300000 | elapsed time per iteration (ms): 3064.7 | learning rate 2.433E-05 | lm loss 1.495735E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.28 | backward: 1727.24 | allreduce: 24.74 | optimizer: 55.77 | batch generator: 14.09 | data loader: 13.28
     iteration   224200/  300000 | elapsed time per iteration (ms): 2451.4 | learning rate 2.427E-05 | lm loss 1.511252E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.23 | backward: 1727.17 | allreduce: 24.86 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   224300/  300000 | elapsed time per iteration (ms): 2454.3 | learning rate 2.421E-05 | lm loss 1.528444E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.35 | backward: 1729.99 | allreduce: 27.55 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   224400/  300000 | elapsed time per iteration (ms): 2452.5 | learning rate 2.415E-05 | lm loss 1.493343E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.50 | backward: 1728.07 | allreduce: 25.06 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   224500/  300000 | elapsed time per iteration (ms): 2451.5 | learning rate 2.410E-05 | lm loss 1.508666E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.29 | backward: 1727.29 | allreduce: 24.83 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   224600/  300000 | elapsed time per iteration (ms): 2453.3 | learning rate 2.404E-05 | lm loss 1.536029E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.49 | backward: 1729.36 | allreduce: 27.33 | optimizer: 55.22 | batch generator: 0.45 | data loader: 0.04
     iteration   224700/  300000 | elapsed time per iteration (ms): 2451.7 | learning rate 2.398E-05 | lm loss 1.537360E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.38 | backward: 1727.32 | allreduce: 24.90 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   224800/  300000 | elapsed time per iteration (ms): 2451.5 | learning rate 2.392E-05 | lm loss 1.520676E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.36 | backward: 1727.22 | allreduce: 24.96 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   224900/  300000 | elapsed time per iteration (ms): 2454.9 | learning rate 2.387E-05 | lm loss 1.494254E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.59 | backward: 1730.35 | allreduce: 27.36 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   225000/  300000 | elapsed time per iteration (ms): 2453.4 | learning rate 2.381E-05 | lm loss 1.501431E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.81 | backward: 1728.61 | allreduce: 25.03 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    global rank 0 is saving checkpoint at iteration  225000 to checkpoints/gpt2_750m_2/iter_0225000/mp_rank_00/model_optim_rng.pt
      successfully saved checkpoints/gpt2_750m_2/iter_0225000/mp_rank_00/model_optim_rng.pt
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 225000 | LM loss: 1.639601E+00 | LM PPL: 5.153113E+00
    ------------------------------------------------------------------------------------
     iteration   225100/  300000 | elapsed time per iteration (ms): 3128.7 | learning rate 2.375E-05 | lm loss 1.492541E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.73 | backward: 1728.01 | allreduce: 24.77 | optimizer: 55.78 | batch generator: 22.15 | data loader: 21.36
     iteration   225200/  300000 | elapsed time per iteration (ms): 2454.4 | learning rate 2.369E-05 | lm loss 1.498843E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.47 | backward: 1729.96 | allreduce: 27.33 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   225300/  300000 | elapsed time per iteration (ms): 2452.3 | learning rate 2.364E-05 | lm loss 1.532713E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.59 | backward: 1727.75 | allreduce: 24.82 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   225400/  300000 | elapsed time per iteration (ms): 2451.7 | learning rate 2.358E-05 | lm loss 1.515958E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.51 | backward: 1727.25 | allreduce: 24.66 | optimizer: 55.79 | batch generator: 0.46 | data loader: 0.04
     iteration   225500/  300000 | elapsed time per iteration (ms): 2454.5 | learning rate 2.352E-05 | lm loss 1.507113E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.62 | backward: 1729.95 | allreduce: 27.23 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   225600/  300000 | elapsed time per iteration (ms): 2452.5 | learning rate 2.346E-05 | lm loss 1.489240E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.60 | backward: 1727.94 | allreduce: 24.75 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   225700/  300000 | elapsed time per iteration (ms): 2452.3 | learning rate 2.341E-05 | lm loss 1.498587E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.57 | backward: 1727.76 | allreduce: 24.65 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   225800/  300000 | elapsed time per iteration (ms): 2454.8 | learning rate 2.335E-05 | lm loss 1.486849E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.41 | backward: 1730.43 | allreduce: 27.72 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   225900/  300000 | elapsed time per iteration (ms): 2452.3 | learning rate 2.329E-05 | lm loss 1.509588E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.50 | backward: 1727.80 | allreduce: 24.77 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   226000/  300000 | elapsed time per iteration (ms): 2452.8 | learning rate 2.324E-05 | lm loss 1.533536E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.66 | backward: 1728.19 | allreduce: 24.83 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 226000 | LM loss: 1.610039E+00 | LM PPL: 5.003006E+00
    ------------------------------------------------------------------------------------
     iteration   226100/  300000 | elapsed time per iteration (ms): 3056.2 | learning rate 2.318E-05 | lm loss 1.523215E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.49 | backward: 1730.53 | allreduce: 27.43 | optimizer: 55.77 | batch generator: 0.85 | data loader: 0.07
     iteration   226200/  300000 | elapsed time per iteration (ms): 2451.8 | learning rate 2.312E-05 | lm loss 1.478531E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.39 | backward: 1727.49 | allreduce: 24.65 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   226300/  300000 | elapsed time per iteration (ms): 2451.0 | learning rate 2.307E-05 | lm loss 1.535393E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.50 | backward: 1727.14 | allreduce: 24.64 | optimizer: 55.21 | batch generator: 0.45 | data loader: 0.04
     iteration   226400/  300000 | elapsed time per iteration (ms): 2454.9 | learning rate 2.301E-05 | lm loss 1.494581E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.53 | backward: 1730.38 | allreduce: 27.21 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   226500/  300000 | elapsed time per iteration (ms): 2452.3 | learning rate 2.295E-05 | lm loss 1.492953E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.51 | backward: 1727.83 | allreduce: 24.92 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   226600/  300000 | elapsed time per iteration (ms): 2452.0 | learning rate 2.290E-05 | lm loss 1.510834E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.38 | backward: 1727.64 | allreduce: 24.90 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   226700/  300000 | elapsed time per iteration (ms): 2454.3 | learning rate 2.284E-05 | lm loss 1.503930E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.71 | backward: 1730.21 | allreduce: 27.40 | optimizer: 55.22 | batch generator: 0.45 | data loader: 0.04
     iteration   226800/  300000 | elapsed time per iteration (ms): 2452.5 | learning rate 2.279E-05 | lm loss 1.530680E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.69 | backward: 1727.83 | allreduce: 24.78 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   226900/  300000 | elapsed time per iteration (ms): 2453.0 | learning rate 2.273E-05 | lm loss 1.520904E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.77 | backward: 1728.27 | allreduce: 24.89 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   227000/  300000 | elapsed time per iteration (ms): 2454.8 | learning rate 2.267E-05 | lm loss 1.499350E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.63 | backward: 1730.25 | allreduce: 27.30 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 227000 | LM loss: 1.637368E+00 | LM PPL: 5.141619E+00
    ------------------------------------------------------------------------------------
     iteration   227100/  300000 | elapsed time per iteration (ms): 3056.5 | learning rate 2.262E-05 | lm loss 1.513777E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.66 | backward: 1727.72 | allreduce: 24.73 | optimizer: 55.78 | batch generator: 4.24 | data loader: 3.45
     iteration   227200/  300000 | elapsed time per iteration (ms): 2451.5 | learning rate 2.256E-05 | lm loss 1.519811E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.38 | backward: 1727.17 | allreduce: 24.77 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   227300/  300000 | elapsed time per iteration (ms): 2454.6 | learning rate 2.250E-05 | lm loss 1.508063E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.54 | backward: 1730.06 | allreduce: 27.28 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   227400/  300000 | elapsed time per iteration (ms): 2452.2 | learning rate 2.245E-05 | lm loss 1.490227E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.65 | backward: 1727.64 | allreduce: 24.62 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   227500/  300000 | elapsed time per iteration (ms): 2452.6 | learning rate 2.239E-05 | lm loss 1.540402E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.68 | backward: 1727.99 | allreduce: 24.71 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   227600/  300000 | elapsed time per iteration (ms): 2454.9 | learning rate 2.234E-05 | lm loss 1.509412E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.64 | backward: 1730.28 | allreduce: 27.30 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   227700/  300000 | elapsed time per iteration (ms): 2452.6 | learning rate 2.228E-05 | lm loss 1.508167E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.66 | backward: 1727.96 | allreduce: 24.71 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   227800/  300000 | elapsed time per iteration (ms): 2452.7 | learning rate 2.222E-05 | lm loss 1.516658E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.61 | backward: 1728.16 | allreduce: 24.86 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   227900/  300000 | elapsed time per iteration (ms): 2455.7 | learning rate 2.217E-05 | lm loss 1.507032E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.70 | backward: 1731.02 | allreduce: 27.43 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   228000/  300000 | elapsed time per iteration (ms): 2453.2 | learning rate 2.211E-05 | lm loss 1.507399E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.78 | backward: 1728.46 | allreduce: 24.75 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 228000 | LM loss: 1.631548E+00 | LM PPL: 5.111784E+00
    ------------------------------------------------------------------------------------
     iteration   228100/  300000 | elapsed time per iteration (ms): 3071.8 | learning rate 2.206E-05 | lm loss 1.518506E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.68 | backward: 1728.23 | allreduce: 24.88 | optimizer: 55.78 | batch generator: 19.24 | data loader: 18.44
     iteration   228200/  300000 | elapsed time per iteration (ms): 2454.4 | learning rate 2.200E-05 | lm loss 1.515052E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.56 | backward: 1729.88 | allreduce: 26.82 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   228300/  300000 | elapsed time per iteration (ms): 2453.0 | learning rate 2.195E-05 | lm loss 1.517343E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.62 | backward: 1728.38 | allreduce: 24.94 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   228400/  300000 | elapsed time per iteration (ms): 2451.7 | learning rate 2.189E-05 | lm loss 1.526321E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.39 | backward: 1727.39 | allreduce: 24.81 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   228500/  300000 | elapsed time per iteration (ms): 2453.5 | learning rate 2.184E-05 | lm loss 1.510362E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.43 | backward: 1729.70 | allreduce: 27.51 | optimizer: 55.21 | batch generator: 0.45 | data loader: 0.04
     iteration   228600/  300000 | elapsed time per iteration (ms): 2452.4 | learning rate 2.178E-05 | lm loss 1.534620E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.49 | backward: 1727.96 | allreduce: 24.95 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   228700/  300000 | elapsed time per iteration (ms): 2452.8 | learning rate 2.172E-05 | lm loss 1.512917E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.66 | backward: 1728.13 | allreduce: 24.75 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   228800/  300000 | elapsed time per iteration (ms): 2455.5 | learning rate 2.167E-05 | lm loss 1.521705E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.74 | backward: 1730.82 | allreduce: 27.25 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   228900/  300000 | elapsed time per iteration (ms): 2452.3 | learning rate 2.161E-05 | lm loss 1.506557E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.51 | backward: 1727.78 | allreduce: 24.80 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   229000/  300000 | elapsed time per iteration (ms): 2452.2 | learning rate 2.156E-05 | lm loss 1.505893E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.49 | backward: 1727.76 | allreduce: 24.78 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 229000 | LM loss: 1.635488E+00 | LM PPL: 5.131962E+00
    ------------------------------------------------------------------------------------
     iteration   229100/  300000 | elapsed time per iteration (ms): 3070.7 | learning rate 2.150E-05 | lm loss 1.524373E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.59 | backward: 1730.77 | allreduce: 27.50 | optimizer: 55.77 | batch generator: 14.84 | data loader: 14.04
     iteration   229200/  300000 | elapsed time per iteration (ms): 2452.6 | learning rate 2.145E-05 | lm loss 1.519162E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.60 | backward: 1728.04 | allreduce: 25.02 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   229300/  300000 | elapsed time per iteration (ms): 2451.9 | learning rate 2.139E-05 | lm loss 1.506056E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.70 | backward: 1727.24 | allreduce: 23.64 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   229400/  300000 | elapsed time per iteration (ms): 2454.9 | learning rate 2.134E-05 | lm loss 1.506982E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.53 | backward: 1730.41 | allreduce: 27.27 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   229500/  300000 | elapsed time per iteration (ms): 2452.2 | learning rate 2.129E-05 | lm loss 1.492355E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.87 | backward: 1727.87 | allreduce: 24.76 | optimizer: 55.22 | batch generator: 0.47 | data loader: 0.04
     iteration   229600/  300000 | elapsed time per iteration (ms): 2452.8 | learning rate 2.123E-05 | lm loss 1.519596E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.75 | backward: 1728.10 | allreduce: 24.58 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   229700/  300000 | elapsed time per iteration (ms): 2455.0 | learning rate 2.118E-05 | lm loss 1.530549E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.78 | backward: 1730.23 | allreduce: 26.89 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   229800/  300000 | elapsed time per iteration (ms): 2454.3 | learning rate 2.112E-05 | lm loss 1.480091E+00 | loss scale 524288.0 |
    time (ms) | forward: 670.24 | backward: 1728.08 | allreduce: 24.58 | optimizer: 55.77 | batch generator: 1.86 | data loader: 1.44
     iteration   229900/  300000 | elapsed time per iteration (ms): 2453.0 | learning rate 2.107E-05 | lm loss 1.486572E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.67 | backward: 1728.32 | allreduce: 25.07 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   230000/  300000 | elapsed time per iteration (ms): 2455.5 | learning rate 2.101E-05 | lm loss 1.532428E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.75 | backward: 1730.77 | allreduce: 27.60 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
    global rank 0 is saving checkpoint at iteration  230000 to checkpoints/gpt2_750m_2/iter_0230000/mp_rank_00/model_optim_rng.pt
      successfully saved checkpoints/gpt2_750m_2/iter_0230000/mp_rank_00/model_optim_rng.pt
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 230000 | LM loss: 1.588412E+00 | LM PPL: 4.895970E+00
    ------------------------------------------------------------------------------------
     iteration   230100/  300000 | elapsed time per iteration (ms): 3120.9 | learning rate 2.096E-05 | lm loss 1.561256E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.69 | backward: 1728.17 | allreduce: 25.00 | optimizer: 55.77 | batch generator: 12.30 | data loader: 11.50
     iteration   230200/  300000 | elapsed time per iteration (ms): 2455.5 | learning rate 2.090E-05 | lm loss 1.536294E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.70 | backward: 1730.89 | allreduce: 27.78 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   230300/  300000 | elapsed time per iteration (ms): 2453.8 | learning rate 2.085E-05 | lm loss 1.588477E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.79 | backward: 1729.08 | allreduce: 25.47 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   230400/  300000 | elapsed time per iteration (ms): 2451.9 | learning rate 2.079E-05 | lm loss 1.554996E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.76 | backward: 1727.19 | allreduce: 23.86 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   230500/  300000 | elapsed time per iteration (ms): 2453.8 | learning rate 2.074E-05 | lm loss 1.533681E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.70 | backward: 1729.17 | allreduce: 25.86 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   230600/  300000 | elapsed time per iteration (ms): 2451.9 | learning rate 2.069E-05 | lm loss 1.545378E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.69 | backward: 1727.23 | allreduce: 23.36 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   230700/  300000 | elapsed time per iteration (ms): 2452.2 | learning rate 2.063E-05 | lm loss 1.532331E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.71 | backward: 1727.53 | allreduce: 23.79 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   230800/  300000 | elapsed time per iteration (ms): 2455.1 | learning rate 2.058E-05 | lm loss 1.561074E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.51 | backward: 1730.59 | allreduce: 27.50 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   230900/  300000 | elapsed time per iteration (ms): 2453.7 | learning rate 2.052E-05 | lm loss 1.564773E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.79 | backward: 1729.00 | allreduce: 25.14 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   231000/  300000 | elapsed time per iteration (ms): 2453.2 | learning rate 2.047E-05 | lm loss 1.533759E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.68 | backward: 1728.54 | allreduce: 25.05 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 231000 | LM loss: 1.632425E+00 | LM PPL: 5.116268E+00
    ------------------------------------------------------------------------------------
     iteration   231100/  300000 | elapsed time per iteration (ms): 3058.9 | learning rate 2.042E-05 | lm loss 1.568414E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.66 | backward: 1730.99 | allreduce: 27.51 | optimizer: 55.77 | batch generator: 2.90 | data loader: 2.11
     iteration   231200/  300000 | elapsed time per iteration (ms): 2453.6 | learning rate 2.036E-05 | lm loss 1.550669E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.82 | backward: 1728.83 | allreduce: 24.98 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   231300/  300000 | elapsed time per iteration (ms): 2453.2 | learning rate 2.031E-05 | lm loss 1.557910E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.90 | backward: 1728.31 | allreduce: 24.26 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   231400/  300000 | elapsed time per iteration (ms): 2455.9 | learning rate 2.025E-05 | lm loss 1.537639E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.86 | backward: 1731.12 | allreduce: 27.16 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   231500/  300000 | elapsed time per iteration (ms): 2449.6 | learning rate 2.020E-05 | lm loss 1.530409E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.78 | backward: 1725.98 | allreduce: 23.37 | optimizer: 54.66 | batch generator: 0.45 | data loader: 0.04
     iteration   231600/  300000 | elapsed time per iteration (ms): 2451.1 | learning rate 2.015E-05 | lm loss 1.566128E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.71 | backward: 1726.96 | allreduce: 24.05 | optimizer: 55.21 | batch generator: 0.45 | data loader: 0.04
     iteration   231700/  300000 | elapsed time per iteration (ms): 2454.0 | learning rate 2.010E-05 | lm loss 1.535590E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.49 | backward: 1729.56 | allreduce: 26.80 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   231800/  300000 | elapsed time per iteration (ms): 2451.5 | learning rate 2.004E-05 | lm loss 1.541427E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.52 | backward: 1726.98 | allreduce: 24.30 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   231900/  300000 | elapsed time per iteration (ms): 2451.5 | learning rate 1.999E-05 | lm loss 1.561528E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.49 | backward: 1727.05 | allreduce: 24.36 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   232000/  300000 | elapsed time per iteration (ms): 2454.9 | learning rate 1.993E-05 | lm loss 1.570369E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.51 | backward: 1730.47 | allreduce: 27.51 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 232000 | LM loss: 1.630395E+00 | LM PPL: 5.105891E+00
    ------------------------------------------------------------------------------------
     iteration   232100/  300000 | elapsed time per iteration (ms): 3067.4 | learning rate 1.988E-05 | lm loss 1.551206E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.76 | backward: 1728.15 | allreduce: 24.84 | optimizer: 55.77 | batch generator: 14.77 | data loader: 13.98
     iteration   232200/  300000 | elapsed time per iteration (ms): 2453.7 | learning rate 1.983E-05 | lm loss 1.563730E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.83 | backward: 1728.92 | allreduce: 25.16 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   232300/  300000 | elapsed time per iteration (ms): 2455.6 | learning rate 1.978E-05 | lm loss 1.553304E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.78 | backward: 1730.83 | allreduce: 27.47 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   232400/  300000 | elapsed time per iteration (ms): 2452.9 | learning rate 1.972E-05 | lm loss 1.535039E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.69 | backward: 1728.29 | allreduce: 25.07 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   232500/  300000 | elapsed time per iteration (ms): 2453.2 | learning rate 1.967E-05 | lm loss 1.537703E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.73 | backward: 1728.51 | allreduce: 25.05 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   232600/  300000 | elapsed time per iteration (ms): 2455.0 | learning rate 1.962E-05 | lm loss 1.572172E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.56 | backward: 1730.47 | allreduce: 27.27 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   232700/  300000 | elapsed time per iteration (ms): 2451.4 | learning rate 1.956E-05 | lm loss 1.556194E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.65 | backward: 1726.82 | allreduce: 23.37 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   232800/  300000 | elapsed time per iteration (ms): 2452.0 | learning rate 1.951E-05 | lm loss 1.546939E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.80 | backward: 1727.29 | allreduce: 23.37 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   232900/  300000 | elapsed time per iteration (ms): 2453.9 | learning rate 1.946E-05 | lm loss 1.548309E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.66 | backward: 1729.28 | allreduce: 25.80 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   233000/  300000 | elapsed time per iteration (ms): 2451.8 | learning rate 1.940E-05 | lm loss 1.564272E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.74 | backward: 1727.10 | allreduce: 23.38 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 233000 | LM loss: 1.611275E+00 | LM PPL: 5.009196E+00
    ------------------------------------------------------------------------------------
     iteration   233100/  300000 | elapsed time per iteration (ms): 3068.3 | learning rate 1.935E-05 | lm loss 1.557346E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.65 | backward: 1726.84 | allreduce: 23.37 | optimizer: 55.78 | batch generator: 17.34 | data loader: 16.55
     iteration   233200/  300000 | elapsed time per iteration (ms): 2455.4 | learning rate 1.930E-05 | lm loss 1.557696E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.73 | backward: 1730.77 | allreduce: 26.99 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   233300/  300000 | elapsed time per iteration (ms): 2454.2 | learning rate 1.925E-05 | lm loss 1.542242E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.92 | backward: 1729.30 | allreduce: 25.15 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   233400/  300000 | elapsed time per iteration (ms): 2453.4 | learning rate 1.919E-05 | lm loss 1.571437E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.62 | backward: 1728.80 | allreduce: 25.46 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   233500/  300000 | elapsed time per iteration (ms): 2454.3 | learning rate 1.914E-05 | lm loss 1.555129E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.63 | backward: 1730.23 | allreduce: 27.38 | optimizer: 55.21 | batch generator: 0.46 | data loader: 0.04
     iteration   233600/  300000 | elapsed time per iteration (ms): 2452.0 | learning rate 1.909E-05 | lm loss 1.550546E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.68 | backward: 1727.87 | allreduce: 24.94 | optimizer: 55.21 | batch generator: 0.45 | data loader: 0.04
     iteration   233700/  300000 | elapsed time per iteration (ms): 2453.2 | learning rate 1.904E-05 | lm loss 1.530219E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.81 | backward: 1728.48 | allreduce: 25.01 | optimizer: 55.76 | batch generator: 0.46 | data loader: 0.04
     iteration   233800/  300000 | elapsed time per iteration (ms): 2455.5 | learning rate 1.899E-05 | lm loss 1.559535E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.70 | backward: 1730.80 | allreduce: 27.55 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   233900/  300000 | elapsed time per iteration (ms): 2451.6 | learning rate 1.893E-05 | lm loss 1.547995E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.50 | backward: 1727.10 | allreduce: 24.48 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   234000/  300000 | elapsed time per iteration (ms): 2450.6 | learning rate 1.888E-05 | lm loss 1.539597E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.49 | backward: 1726.20 | allreduce: 23.37 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 234000 | LM loss: 1.590245E+00 | LM PPL: 4.904950E+00
    ------------------------------------------------------------------------------------
     iteration   234100/  300000 | elapsed time per iteration (ms): 3056.5 | learning rate 1.883E-05 | lm loss 1.555919E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.61 | backward: 1727.79 | allreduce: 24.87 | optimizer: 55.77 | batch generator: 3.84 | data loader: 3.04
     iteration   234200/  300000 | elapsed time per iteration (ms): 2456.0 | learning rate 1.878E-05 | lm loss 1.583530E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.88 | backward: 1731.15 | allreduce: 27.38 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   234300/  300000 | elapsed time per iteration (ms): 2453.1 | learning rate 1.873E-05 | lm loss 1.554971E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.77 | backward: 1728.35 | allreduce: 24.91 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   234400/  300000 | elapsed time per iteration (ms): 2452.5 | learning rate 1.867E-05 | lm loss 1.549883E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.68 | backward: 1727.85 | allreduce: 24.54 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   234500/  300000 | elapsed time per iteration (ms): 2454.6 | learning rate 1.862E-05 | lm loss 1.539758E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.59 | backward: 1730.04 | allreduce: 26.98 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   234600/  300000 | elapsed time per iteration (ms): 2452.4 | learning rate 1.857E-05 | lm loss 1.561821E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.63 | backward: 1727.78 | allreduce: 24.50 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   234700/  300000 | elapsed time per iteration (ms): 2453.3 | learning rate 1.852E-05 | lm loss 1.532299E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.89 | backward: 1728.44 | allreduce: 24.38 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   234800/  300000 | elapsed time per iteration (ms): 2455.1 | learning rate 1.847E-05 | lm loss 1.542841E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.77 | backward: 1730.32 | allreduce: 26.60 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   234900/  300000 | elapsed time per iteration (ms): 2453.4 | learning rate 1.842E-05 | lm loss 1.532066E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.69 | backward: 1728.72 | allreduce: 24.87 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   235000/  300000 | elapsed time per iteration (ms): 2452.9 | learning rate 1.836E-05 | lm loss 1.532307E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.69 | backward: 1728.29 | allreduce: 24.77 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    global rank 0 is saving checkpoint at iteration  235000 to checkpoints/gpt2_750m_2/iter_0235000/mp_rank_00/model_optim_rng.pt
      successfully saved checkpoints/gpt2_750m_2/iter_0235000/mp_rank_00/model_optim_rng.pt
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 235000 | LM loss: 1.605017E+00 | LM PPL: 4.977942E+00
    ------------------------------------------------------------------------------------
     iteration   235100/  300000 | elapsed time per iteration (ms): 3110.3 | learning rate 1.831E-05 | lm loss 1.558526E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.49 | backward: 1726.39 | allreduce: 24.66 | optimizer: 54.66 | batch generator: 0.87 | data loader: 0.07
     iteration   235200/  300000 | elapsed time per iteration (ms): 2452.1 | learning rate 1.826E-05 | lm loss 1.552702E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.60 | backward: 1727.58 | allreduce: 24.73 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   235300/  300000 | elapsed time per iteration (ms): 2455.2 | learning rate 1.821E-05 | lm loss 1.553844E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.74 | backward: 1730.51 | allreduce: 27.32 | optimizer: 55.76 | batch generator: 0.46 | data loader: 0.04
     iteration   235400/  300000 | elapsed time per iteration (ms): 2453.2 | learning rate 1.816E-05 | lm loss 1.528173E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.85 | backward: 1728.35 | allreduce: 24.73 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   235500/  300000 | elapsed time per iteration (ms): 2452.6 | learning rate 1.811E-05 | lm loss 1.547773E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.69 | backward: 1727.95 | allreduce: 24.75 | optimizer: 55.76 | batch generator: 0.46 | data loader: 0.04
     iteration   235600/  300000 | elapsed time per iteration (ms): 2454.9 | learning rate 1.806E-05 | lm loss 1.530139E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.60 | backward: 1730.29 | allreduce: 27.20 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   235700/  300000 | elapsed time per iteration (ms): 2452.6 | learning rate 1.801E-05 | lm loss 1.548308E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.69 | backward: 1727.95 | allreduce: 24.68 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   235800/  300000 | elapsed time per iteration (ms): 2453.2 | learning rate 1.795E-05 | lm loss 1.536690E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.86 | backward: 1728.39 | allreduce: 24.66 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   235900/  300000 | elapsed time per iteration (ms): 2454.6 | learning rate 1.790E-05 | lm loss 1.540810E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.68 | backward: 1729.91 | allreduce: 26.74 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   236000/  300000 | elapsed time per iteration (ms): 2453.2 | learning rate 1.785E-05 | lm loss 1.535663E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.97 | backward: 1728.28 | allreduce: 24.33 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 236000 | LM loss: 1.616296E+00 | LM PPL: 5.034407E+00
    ------------------------------------------------------------------------------------
     iteration   236100/  300000 | elapsed time per iteration (ms): 3070.0 | learning rate 1.780E-05 | lm loss 1.547656E+00 | loss scale 1048576.0 |
    time (ms) | forward: 669.03 | backward: 1728.65 | allreduce: 24.34 | optimizer: 55.76 | batch generator: 16.30 | data loader: 15.51
     iteration   236200/  300000 | elapsed time per iteration (ms): 2454.8 | learning rate 1.775E-05 | lm loss 1.563708E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.72 | backward: 1730.67 | allreduce: 27.58 | optimizer: 55.21 | batch generator: 0.46 | data loader: 0.04
     iteration   236300/  300000 | elapsed time per iteration (ms): 2452.7 | learning rate 1.770E-05 | lm loss 1.579942E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.53 | backward: 1728.17 | allreduce: 24.96 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   236400/  300000 | elapsed time per iteration (ms): 2452.9 | learning rate 1.765E-05 | lm loss 1.534102E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.58 | backward: 1728.42 | allreduce: 25.02 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   236500/  300000 | elapsed time per iteration (ms): 2455.3 | learning rate 1.760E-05 | lm loss 1.550711E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.56 | backward: 1730.77 | allreduce: 27.48 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   236600/  300000 | elapsed time per iteration (ms): 2453.1 | learning rate 1.755E-05 | lm loss 1.552955E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.67 | backward: 1728.47 | allreduce: 24.91 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   236700/  300000 | elapsed time per iteration (ms): 2453.4 | learning rate 1.750E-05 | lm loss 1.542576E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.71 | backward: 1728.71 | allreduce: 25.01 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   236800/  300000 | elapsed time per iteration (ms): 2454.6 | learning rate 1.745E-05 | lm loss 1.535740E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.54 | backward: 1730.11 | allreduce: 26.94 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   236900/  300000 | elapsed time per iteration (ms): 2452.3 | learning rate 1.740E-05 | lm loss 1.562806E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.48 | backward: 1727.86 | allreduce: 24.68 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   237000/  300000 | elapsed time per iteration (ms): 2451.8 | learning rate 1.735E-05 | lm loss 1.552116E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.75 | backward: 1727.11 | allreduce: 23.36 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 237000 | LM loss: 1.636918E+00 | LM PPL: 5.139304E+00
    ------------------------------------------------------------------------------------
     iteration   237100/  300000 | elapsed time per iteration (ms): 3058.0 | learning rate 1.730E-05 | lm loss 1.541470E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.75 | backward: 1731.16 | allreduce: 27.43 | optimizer: 55.77 | batch generator: 2.07 | data loader: 1.28
     iteration   237200/  300000 | elapsed time per iteration (ms): 2450.6 | learning rate 1.725E-05 | lm loss 1.566491E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.56 | backward: 1727.20 | allreduce: 24.98 | optimizer: 54.65 | batch generator: 0.45 | data loader: 0.04
     iteration   237300/  300000 | elapsed time per iteration (ms): 2453.3 | learning rate 1.720E-05 | lm loss 1.553504E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.66 | backward: 1728.66 | allreduce: 25.07 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   237400/  300000 | elapsed time per iteration (ms): 2455.5 | learning rate 1.715E-05 | lm loss 1.551279E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.60 | backward: 1730.94 | allreduce: 27.54 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   237500/  300000 | elapsed time per iteration (ms): 2451.4 | learning rate 1.710E-05 | lm loss 1.548428E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.49 | backward: 1726.94 | allreduce: 23.86 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   237600/  300000 | elapsed time per iteration (ms): 2452.6 | learning rate 1.705E-05 | lm loss 1.539141E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.52 | backward: 1728.14 | allreduce: 24.87 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   237700/  300000 | elapsed time per iteration (ms): 2454.4 | learning rate 1.700E-05 | lm loss 1.530884E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.64 | backward: 1729.79 | allreduce: 25.98 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   237800/  300000 | elapsed time per iteration (ms): 2451.2 | learning rate 1.695E-05 | lm loss 1.546897E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.54 | backward: 1726.70 | allreduce: 23.38 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   237900/  300000 | elapsed time per iteration (ms): 2452.0 | learning rate 1.690E-05 | lm loss 1.561017E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.74 | backward: 1727.27 | allreduce: 23.39 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   238000/  300000 | elapsed time per iteration (ms): 2453.8 | learning rate 1.685E-05 | lm loss 1.543475E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.50 | backward: 1729.36 | allreduce: 25.91 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 238000 | LM loss: 1.624546E+00 | LM PPL: 5.076116E+00
    ------------------------------------------------------------------------------------
     iteration   238100/  300000 | elapsed time per iteration (ms): 3057.5 | learning rate 1.680E-05 | lm loss 1.557019E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.42 | backward: 1726.46 | allreduce: 23.39 | optimizer: 55.77 | batch generator: 6.52 | data loader: 5.73
     iteration   238200/  300000 | elapsed time per iteration (ms): 2450.6 | learning rate 1.675E-05 | lm loss 1.536175E+00 | loss scale 2097152.0 |
    time (ms) | forward: 668.40 | backward: 1726.25 | allreduce: 23.37 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   238300/  300000 | elapsed time per iteration (ms): 2451.2 | learning rate 1.670E-05 | lm loss 1.556266E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.45 | backward: 1727.90 | allreduce: 25.92 | optimizer: 54.65 | batch generator: 0.45 | data loader: 0.04
     iteration   238400/  300000 | elapsed time per iteration (ms): 2449.3 | learning rate 1.665E-05 | lm loss 1.550167E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.37 | backward: 1725.54 | allreduce: 23.37 | optimizer: 55.22 | batch generator: 0.45 | data loader: 0.04
     iteration   238500/  300000 | elapsed time per iteration (ms): 2451.5 | learning rate 1.660E-05 | lm loss 1.533115E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.71 | backward: 1726.81 | allreduce: 23.37 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   238600/  300000 | elapsed time per iteration (ms): 2453.3 | learning rate 1.655E-05 | lm loss 1.551071E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.50 | backward: 1728.80 | allreduce: 25.90 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   238700/  300000 | elapsed time per iteration (ms): 2451.0 | learning rate 1.651E-05 | lm loss 1.536210E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.34 | backward: 1726.76 | allreduce: 24.34 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   238800/  300000 | elapsed time per iteration (ms): 2451.7 | learning rate 1.646E-05 | lm loss 1.544743E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.54 | backward: 1727.17 | allreduce: 24.29 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   238900/  300000 | elapsed time per iteration (ms): 2454.2 | learning rate 1.641E-05 | lm loss 1.552188E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.50 | backward: 1729.69 | allreduce: 26.74 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   239000/  300000 | elapsed time per iteration (ms): 2451.8 | learning rate 1.636E-05 | lm loss 1.576162E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.55 | backward: 1727.32 | allreduce: 24.15 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 239000 | LM loss: 1.637886E+00 | LM PPL: 5.144283E+00
    ------------------------------------------------------------------------------------
     iteration   239100/  300000 | elapsed time per iteration (ms): 3054.2 | learning rate 1.631E-05 | lm loss 1.505616E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.54 | backward: 1727.60 | allreduce: 24.67 | optimizer: 55.77 | batch generator: 2.39 | data loader: 1.60
     iteration   239200/  300000 | elapsed time per iteration (ms): 2455.2 | learning rate 1.626E-05 | lm loss 1.540067E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.65 | backward: 1730.61 | allreduce: 27.45 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   239300/  300000 | elapsed time per iteration (ms): 2452.1 | learning rate 1.621E-05 | lm loss 1.522368E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.51 | backward: 1727.62 | allreduce: 24.82 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   239400/  300000 | elapsed time per iteration (ms): 2450.9 | learning rate 1.616E-05 | lm loss 1.559929E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.42 | backward: 1727.11 | allreduce: 24.90 | optimizer: 55.21 | batch generator: 0.45 | data loader: 0.04
     iteration   239500/  300000 | elapsed time per iteration (ms): 2453.7 | learning rate 1.612E-05 | lm loss 1.549392E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.42 | backward: 1729.84 | allreduce: 27.37 | optimizer: 55.22 | batch generator: 0.46 | data loader: 0.04
     iteration   239600/  300000 | elapsed time per iteration (ms): 2452.9 | learning rate 1.607E-05 | lm loss 1.535242E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.67 | backward: 1728.25 | allreduce: 24.90 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   239700/  300000 | elapsed time per iteration (ms): 2452.1 | learning rate 1.602E-05 | lm loss 1.559341E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.44 | backward: 1727.65 | allreduce: 24.88 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   239800/  300000 | elapsed time per iteration (ms): 2454.4 | learning rate 1.597E-05 | lm loss 1.533406E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.41 | backward: 1730.07 | allreduce: 27.39 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   239900/  300000 | elapsed time per iteration (ms): 2452.5 | learning rate 1.592E-05 | lm loss 1.559025E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.58 | backward: 1728.00 | allreduce: 24.85 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   240000/  300000 | elapsed time per iteration (ms): 2452.1 | learning rate 1.587E-05 | lm loss 1.568201E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.44 | backward: 1727.73 | allreduce: 24.92 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    global rank 0 is saving checkpoint at iteration  240000 to checkpoints/gpt2_750m_2/iter_0240000/mp_rank_00/model_optim_rng.pt
      successfully saved checkpoints/gpt2_750m_2/iter_0240000/mp_rank_00/model_optim_rng.pt
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 240000 | LM loss: 1.608402E+00 | LM PPL: 4.994821E+00
    ------------------------------------------------------------------------------------
     iteration   240100/  300000 | elapsed time per iteration (ms): 3118.3 | learning rate 1.582E-05 | lm loss 1.536667E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.33 | backward: 1729.58 | allreduce: 27.27 | optimizer: 55.77 | batch generator: 10.25 | data loader: 9.45
     iteration   240200/  300000 | elapsed time per iteration (ms): 2452.0 | learning rate 1.578E-05 | lm loss 1.547506E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.45 | backward: 1727.56 | allreduce: 24.77 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   240300/  300000 | elapsed time per iteration (ms): 2451.2 | learning rate 1.573E-05 | lm loss 1.545614E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.24 | backward: 1726.95 | allreduce: 24.59 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   240400/  300000 | elapsed time per iteration (ms): 2454.1 | learning rate 1.568E-05 | lm loss 1.559834E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.32 | backward: 1729.81 | allreduce: 27.33 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   240500/  300000 | elapsed time per iteration (ms): 2452.1 | learning rate 1.563E-05 | lm loss 1.551004E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.43 | backward: 1727.66 | allreduce: 24.85 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   240600/  300000 | elapsed time per iteration (ms): 2452.5 | learning rate 1.558E-05 | lm loss 1.558626E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.53 | backward: 1728.02 | allreduce: 24.67 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   240700/  300000 | elapsed time per iteration (ms): 2455.3 | learning rate 1.554E-05 | lm loss 1.550413E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.59 | backward: 1730.75 | allreduce: 27.32 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   240800/  300000 | elapsed time per iteration (ms): 2450.4 | learning rate 1.549E-05 | lm loss 1.582545E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.22 | backward: 1726.81 | allreduce: 24.82 | optimizer: 55.22 | batch generator: 0.46 | data loader: 0.04
     iteration   240900/  300000 | elapsed time per iteration (ms): 2450.6 | learning rate 1.544E-05 | lm loss 1.557434E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.39 | backward: 1726.77 | allreduce: 24.83 | optimizer: 55.22 | batch generator: 0.45 | data loader: 0.04
     iteration   241000/  300000 | elapsed time per iteration (ms): 2454.5 | learning rate 1.539E-05 | lm loss 1.554935E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.42 | backward: 1730.12 | allreduce: 27.45 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 241000 | LM loss: 1.624024E+00 | LM PPL: 5.073464E+00
    ------------------------------------------------------------------------------------
     iteration   241100/  300000 | elapsed time per iteration (ms): 3061.5 | learning rate 1.535E-05 | lm loss 1.545367E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.36 | backward: 1727.44 | allreduce: 24.84 | optimizer: 55.77 | batch generator: 10.05 | data loader: 9.25
     iteration   241200/  300000 | elapsed time per iteration (ms): 2452.0 | learning rate 1.530E-05 | lm loss 1.534755E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.44 | backward: 1727.55 | allreduce: 24.83 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   241300/  300000 | elapsed time per iteration (ms): 2454.0 | learning rate 1.525E-05 | lm loss 1.566674E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.30 | backward: 1729.75 | allreduce: 27.35 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   241400/  300000 | elapsed time per iteration (ms): 2451.6 | learning rate 1.520E-05 | lm loss 1.554541E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.37 | backward: 1727.32 | allreduce: 24.78 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   241500/  300000 | elapsed time per iteration (ms): 2452.2 | learning rate 1.516E-05 | lm loss 1.531032E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.57 | backward: 1727.62 | allreduce: 24.73 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   241600/  300000 | elapsed time per iteration (ms): 2454.8 | learning rate 1.511E-05 | lm loss 1.544676E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.59 | backward: 1730.23 | allreduce: 27.25 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   241700/  300000 | elapsed time per iteration (ms): 2452.6 | learning rate 1.506E-05 | lm loss 1.573555E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.63 | backward: 1727.99 | allreduce: 24.86 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   241800/  300000 | elapsed time per iteration (ms): 2452.0 | learning rate 1.501E-05 | lm loss 1.550205E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.43 | backward: 1727.64 | allreduce: 24.82 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   241900/  300000 | elapsed time per iteration (ms): 2453.9 | learning rate 1.497E-05 | lm loss 1.533982E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.19 | backward: 1729.70 | allreduce: 27.37 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   242000/  300000 | elapsed time per iteration (ms): 2452.3 | learning rate 1.492E-05 | lm loss 1.562572E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.49 | backward: 1727.88 | allreduce: 24.75 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 242000 | LM loss: 1.610470E+00 | LM PPL: 5.005164E+00
    ------------------------------------------------------------------------------------
     iteration   242100/  300000 | elapsed time per iteration (ms): 3054.5 | learning rate 1.487E-05 | lm loss 1.527349E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.62 | backward: 1728.35 | allreduce: 24.77 | optimizer: 55.77 | batch generator: 2.01 | data loader: 1.24
     iteration   242200/  300000 | elapsed time per iteration (ms): 2454.3 | learning rate 1.483E-05 | lm loss 1.539886E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.56 | backward: 1729.80 | allreduce: 26.65 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   242300/  300000 | elapsed time per iteration (ms): 2451.2 | learning rate 1.478E-05 | lm loss 1.513954E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.56 | backward: 1726.70 | allreduce: 23.38 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   242400/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 1.473E-05 | lm loss 1.537966E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.36 | backward: 1726.99 | allreduce: 24.15 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   242500/  300000 | elapsed time per iteration (ms): 2455.4 | learning rate 1.469E-05 | lm loss 1.542570E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.58 | backward: 1730.89 | allreduce: 27.40 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   242600/  300000 | elapsed time per iteration (ms): 2453.2 | learning rate 1.464E-05 | lm loss 1.537859E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.65 | backward: 1728.63 | allreduce: 25.07 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   242700/  300000 | elapsed time per iteration (ms): 2453.0 | learning rate 1.459E-05 | lm loss 1.547831E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.62 | backward: 1728.46 | allreduce: 24.94 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   242800/  300000 | elapsed time per iteration (ms): 2455.1 | learning rate 1.455E-05 | lm loss 1.562097E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.48 | backward: 1730.62 | allreduce: 27.45 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   242900/  300000 | elapsed time per iteration (ms): 2450.2 | learning rate 1.450E-05 | lm loss 1.522124E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.42 | backward: 1726.90 | allreduce: 25.08 | optimizer: 54.65 | batch generator: 0.46 | data loader: 0.04
     iteration   243000/  300000 | elapsed time per iteration (ms): 2453.4 | learning rate 1.445E-05 | lm loss 1.552555E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.72 | backward: 1728.71 | allreduce: 25.04 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 243000 | LM loss: 1.612976E+00 | LM PPL: 5.017723E+00
    ------------------------------------------------------------------------------------
     iteration   243100/  300000 | elapsed time per iteration (ms): 3096.9 | learning rate 1.441E-05 | lm loss 1.529106E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.44 | backward: 1728.86 | allreduce: 25.91 | optimizer: 55.77 | batch generator: 44.09 | data loader: 43.29
     iteration   243200/  300000 | elapsed time per iteration (ms): 2452.2 | learning rate 1.436E-05 | lm loss 1.540014E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.80 | backward: 1727.42 | allreduce: 23.39 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   243300/  300000 | elapsed time per iteration (ms): 2451.6 | learning rate 1.432E-05 | lm loss 1.542522E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.64 | backward: 1727.02 | allreduce: 23.38 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   243400/  300000 | elapsed time per iteration (ms): 2454.6 | learning rate 1.427E-05 | lm loss 1.557606E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.52 | backward: 1730.08 | allreduce: 26.73 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   243500/  300000 | elapsed time per iteration (ms): 2451.6 | learning rate 1.422E-05 | lm loss 1.525952E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.57 | backward: 1727.60 | allreduce: 24.87 | optimizer: 55.22 | batch generator: 0.45 | data loader: 0.04
     iteration   243600/  300000 | elapsed time per iteration (ms): 2452.6 | learning rate 1.418E-05 | lm loss 1.558995E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.60 | backward: 1728.00 | allreduce: 24.88 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   243700/  300000 | elapsed time per iteration (ms): 2456.0 | learning rate 1.413E-05 | lm loss 1.541577E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.86 | backward: 1731.15 | allreduce: 27.41 | optimizer: 55.76 | batch generator: 0.46 | data loader: 0.04
     iteration   243800/  300000 | elapsed time per iteration (ms): 2452.3 | learning rate 1.409E-05 | lm loss 1.542923E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.55 | backward: 1727.77 | allreduce: 24.87 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   243900/  300000 | elapsed time per iteration (ms): 2452.3 | learning rate 1.404E-05 | lm loss 1.555322E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.50 | backward: 1727.86 | allreduce: 24.86 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   244000/  300000 | elapsed time per iteration (ms): 2454.6 | learning rate 1.399E-05 | lm loss 1.550499E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.44 | backward: 1730.24 | allreduce: 27.50 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 244000 | LM loss: 1.615975E+00 | LM PPL: 5.032791E+00
    ------------------------------------------------------------------------------------
     iteration   244100/  300000 | elapsed time per iteration (ms): 3055.6 | learning rate 1.395E-05 | lm loss 1.529395E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.48 | backward: 1726.65 | allreduce: 23.96 | optimizer: 55.78 | batch generator: 4.14 | data loader: 3.35
     iteration   244200/  300000 | elapsed time per iteration (ms): 2450.3 | learning rate 1.390E-05 | lm loss 1.561104E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.37 | backward: 1725.95 | allreduce: 23.36 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   244300/  300000 | elapsed time per iteration (ms): 2452.8 | learning rate 1.386E-05 | lm loss 1.549369E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.39 | backward: 1728.45 | allreduce: 25.88 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   244400/  300000 | elapsed time per iteration (ms): 2452.3 | learning rate 1.381E-05 | lm loss 1.550698E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.69 | backward: 1727.67 | allreduce: 24.07 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   244500/  300000 | elapsed time per iteration (ms): 2453.0 | learning rate 1.377E-05 | lm loss 1.538578E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.71 | backward: 1728.32 | allreduce: 24.93 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   244600/  300000 | elapsed time per iteration (ms): 2455.2 | learning rate 1.372E-05 | lm loss 1.539020E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.47 | backward: 1730.72 | allreduce: 27.69 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   244700/  300000 | elapsed time per iteration (ms): 2452.9 | learning rate 1.368E-05 | lm loss 1.550198E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.57 | backward: 1728.34 | allreduce: 24.94 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   244800/  300000 | elapsed time per iteration (ms): 2449.9 | learning rate 1.363E-05 | lm loss 1.545435E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.62 | backward: 1726.44 | allreduce: 24.06 | optimizer: 54.65 | batch generator: 0.45 | data loader: 0.04
     iteration   244900/  300000 | elapsed time per iteration (ms): 2453.7 | learning rate 1.359E-05 | lm loss 1.522588E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.23 | backward: 1729.47 | allreduce: 27.35 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   245000/  300000 | elapsed time per iteration (ms): 2451.1 | learning rate 1.354E-05 | lm loss 1.554808E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.18 | backward: 1726.93 | allreduce: 24.81 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
    global rank 0 is saving checkpoint at iteration  245000 to checkpoints/gpt2_750m_2/iter_0245000/mp_rank_00/model_optim_rng.pt
      successfully saved checkpoints/gpt2_750m_2/iter_0245000/mp_rank_00/model_optim_rng.pt
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 245000 | LM loss: 1.610264E+00 | LM PPL: 5.004133E+00
    ------------------------------------------------------------------------------------
     iteration   245100/  300000 | elapsed time per iteration (ms): 3107.5 | learning rate 1.350E-05 | lm loss 1.526389E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.12 | backward: 1729.21 | allreduce: 27.44 | optimizer: 55.77 | batch generator: 0.86 | data loader: 0.07
     iteration   245200/  300000 | elapsed time per iteration (ms): 2451.1 | learning rate 1.345E-05 | lm loss 1.562231E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.28 | backward: 1726.91 | allreduce: 24.79 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   245300/  300000 | elapsed time per iteration (ms): 2451.1 | learning rate 1.341E-05 | lm loss 1.548252E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.24 | backward: 1726.95 | allreduce: 24.84 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   245400/  300000 | elapsed time per iteration (ms): 2453.6 | learning rate 1.336E-05 | lm loss 1.540700E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.25 | backward: 1729.36 | allreduce: 27.31 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   245500/  300000 | elapsed time per iteration (ms): 2451.5 | learning rate 1.332E-05 | lm loss 1.532807E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.34 | backward: 1727.15 | allreduce: 24.82 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   245600/  300000 | elapsed time per iteration (ms): 2451.0 | learning rate 1.327E-05 | lm loss 1.557960E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.26 | backward: 1726.81 | allreduce: 24.60 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   245700/  300000 | elapsed time per iteration (ms): 2452.8 | learning rate 1.323E-05 | lm loss 1.536965E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.17 | backward: 1728.67 | allreduce: 26.77 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   245800/  300000 | elapsed time per iteration (ms): 2450.7 | learning rate 1.318E-05 | lm loss 1.516862E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.22 | backward: 1726.51 | allreduce: 24.32 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   245900/  300000 | elapsed time per iteration (ms): 2450.3 | learning rate 1.314E-05 | lm loss 1.535375E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.28 | backward: 1726.06 | allreduce: 23.64 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   246000/  300000 | elapsed time per iteration (ms): 2452.9 | learning rate 1.310E-05 | lm loss 1.531118E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.33 | backward: 1728.64 | allreduce: 25.87 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 246000 | LM loss: 1.610967E+00 | LM PPL: 5.007650E+00
    ------------------------------------------------------------------------------------
     iteration   246100/  300000 | elapsed time per iteration (ms): 3052.0 | learning rate 1.305E-05 | lm loss 1.541574E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.28 | backward: 1727.40 | allreduce: 24.75 | optimizer: 55.78 | batch generator: 0.87 | data loader: 0.07
     iteration   246200/  300000 | elapsed time per iteration (ms): 2450.1 | learning rate 1.301E-05 | lm loss 1.565832E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.22 | backward: 1725.88 | allreduce: 23.37 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   246300/  300000 | elapsed time per iteration (ms): 2452.7 | learning rate 1.296E-05 | lm loss 1.516910E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.23 | backward: 1728.45 | allreduce: 25.91 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   246400/  300000 | elapsed time per iteration (ms): 2449.9 | learning rate 1.292E-05 | lm loss 1.524397E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.17 | backward: 1725.82 | allreduce: 23.35 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   246500/  300000 | elapsed time per iteration (ms): 2449.9 | learning rate 1.287E-05 | lm loss 1.529491E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.21 | backward: 1725.74 | allreduce: 23.35 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   246600/  300000 | elapsed time per iteration (ms): 2453.0 | learning rate 1.283E-05 | lm loss 1.565772E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.33 | backward: 1728.73 | allreduce: 25.89 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   246700/  300000 | elapsed time per iteration (ms): 2451.5 | learning rate 1.279E-05 | lm loss 1.540853E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.48 | backward: 1727.00 | allreduce: 23.97 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   246800/  300000 | elapsed time per iteration (ms): 2449.6 | learning rate 1.274E-05 | lm loss 1.547209E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.32 | backward: 1726.45 | allreduce: 24.94 | optimizer: 54.67 | batch generator: 0.45 | data loader: 0.04
     iteration   246900/  300000 | elapsed time per iteration (ms): 2454.1 | learning rate 1.270E-05 | lm loss 1.529677E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.28 | backward: 1729.87 | allreduce: 27.25 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   247000/  300000 | elapsed time per iteration (ms): 2451.1 | learning rate 1.266E-05 | lm loss 1.562820E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.16 | backward: 1726.95 | allreduce: 24.77 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 247000 | LM loss: 1.611489E+00 | LM PPL: 5.010266E+00
    ------------------------------------------------------------------------------------
     iteration   247100/  300000 | elapsed time per iteration (ms): 3058.4 | learning rate 1.261E-05 | lm loss 1.534959E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.32 | backward: 1727.31 | allreduce: 24.61 | optimizer: 55.77 | batch generator: 7.54 | data loader: 6.74
     iteration   247200/  300000 | elapsed time per iteration (ms): 2452.4 | learning rate 1.257E-05 | lm loss 1.547557E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.27 | backward: 1728.75 | allreduce: 27.04 | optimizer: 55.22 | batch generator: 0.45 | data loader: 0.04
     iteration   247300/  300000 | elapsed time per iteration (ms): 2451.2 | learning rate 1.253E-05 | lm loss 1.542031E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.40 | backward: 1726.85 | allreduce: 24.28 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   247400/  300000 | elapsed time per iteration (ms): 2451.1 | learning rate 1.248E-05 | lm loss 1.555329E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.32 | backward: 1726.87 | allreduce: 24.40 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   247500/  300000 | elapsed time per iteration (ms): 2453.5 | learning rate 1.244E-05 | lm loss 1.565076E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.33 | backward: 1729.19 | allreduce: 26.89 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   247600/  300000 | elapsed time per iteration (ms): 2451.4 | learning rate 1.240E-05 | lm loss 1.550071E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.23 | backward: 1727.16 | allreduce: 24.79 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   247700/  300000 | elapsed time per iteration (ms): 2451.1 | learning rate 1.235E-05 | lm loss 1.563294E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.10 | backward: 1727.00 | allreduce: 24.99 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   247800/  300000 | elapsed time per iteration (ms): 2452.4 | learning rate 1.231E-05 | lm loss 1.557959E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.22 | backward: 1728.24 | allreduce: 26.16 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   247900/  300000 | elapsed time per iteration (ms): 2451.5 | learning rate 1.227E-05 | lm loss 1.543956E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.34 | backward: 1727.18 | allreduce: 24.77 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   248000/  300000 | elapsed time per iteration (ms): 2452.0 | learning rate 1.222E-05 | lm loss 1.538803E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.42 | backward: 1727.65 | allreduce: 24.97 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 248000 | LM loss: 1.591482E+00 | LM PPL: 4.911021E+00
    ------------------------------------------------------------------------------------
     iteration   248100/  300000 | elapsed time per iteration (ms): 3065.8 | learning rate 1.218E-05 | lm loss 1.509250E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.44 | backward: 1729.56 | allreduce: 26.95 | optimizer: 55.77 | batch generator: 12.23 | data loader: 11.45
     iteration   248200/  300000 | elapsed time per iteration (ms): 2452.0 | learning rate 1.214E-05 | lm loss 1.540464E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.40 | backward: 1727.68 | allreduce: 24.99 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   248300/  300000 | elapsed time per iteration (ms): 2452.1 | learning rate 1.210E-05 | lm loss 1.560682E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.35 | backward: 1727.78 | allreduce: 25.03 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   248400/  300000 | elapsed time per iteration (ms): 2453.5 | learning rate 1.205E-05 | lm loss 1.530363E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.36 | backward: 1729.14 | allreduce: 26.24 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   248500/  300000 | elapsed time per iteration (ms): 2450.6 | learning rate 1.201E-05 | lm loss 1.554354E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.42 | backward: 1726.26 | allreduce: 23.37 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   248600/  300000 | elapsed time per iteration (ms): 2449.2 | learning rate 1.197E-05 | lm loss 1.535564E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.32 | backward: 1725.52 | allreduce: 23.60 | optimizer: 55.22 | batch generator: 0.45 | data loader: 0.04
     iteration   248700/  300000 | elapsed time per iteration (ms): 2454.1 | learning rate 1.192E-05 | lm loss 1.528810E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.36 | backward: 1729.73 | allreduce: 27.06 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   248800/  300000 | elapsed time per iteration (ms): 2451.2 | learning rate 1.188E-05 | lm loss 1.558257E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.53 | backward: 1726.75 | allreduce: 23.38 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   248900/  300000 | elapsed time per iteration (ms): 2451.7 | learning rate 1.184E-05 | lm loss 1.550043E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.67 | backward: 1727.06 | allreduce: 23.37 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   249000/  300000 | elapsed time per iteration (ms): 2453.7 | learning rate 1.180E-05 | lm loss 1.537227E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.57 | backward: 1729.15 | allreduce: 25.89 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 249000 | LM loss: 1.606889E+00 | LM PPL: 4.987269E+00
    ------------------------------------------------------------------------------------
     iteration   249100/  300000 | elapsed time per iteration (ms): 3079.3 | learning rate 1.176E-05 | lm loss 1.516311E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.67 | backward: 1727.76 | allreduce: 24.08 | optimizer: 55.78 | batch generator: 27.74 | data loader: 26.95
     iteration   249200/  300000 | elapsed time per iteration (ms): 2449.9 | learning rate 1.171E-05 | lm loss 1.526953E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.54 | backward: 1725.96 | allreduce: 23.37 | optimizer: 55.21 | batch generator: 0.45 | data loader: 0.04
     iteration   249300/  300000 | elapsed time per iteration (ms): 2453.2 | learning rate 1.167E-05 | lm loss 1.524888E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.50 | backward: 1728.71 | allreduce: 25.87 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   249400/  300000 | elapsed time per iteration (ms): 2451.2 | learning rate 1.163E-05 | lm loss 1.521718E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.65 | backward: 1726.59 | allreduce: 23.36 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   249500/  300000 | elapsed time per iteration (ms): 2451.1 | learning rate 1.159E-05 | lm loss 1.536049E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.59 | backward: 1726.51 | allreduce: 23.60 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   249600/  300000 | elapsed time per iteration (ms): 2453.0 | learning rate 1.155E-05 | lm loss 1.539230E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.44 | backward: 1728.60 | allreduce: 25.87 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   249700/  300000 | elapsed time per iteration (ms): 2450.4 | learning rate 1.150E-05 | lm loss 1.528898E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.24 | backward: 1726.15 | allreduce: 23.82 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   249800/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 1.146E-05 | lm loss 1.542020E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.33 | backward: 1726.97 | allreduce: 24.41 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   249900/  300000 | elapsed time per iteration (ms): 2454.0 | learning rate 1.142E-05 | lm loss 1.550308E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.50 | backward: 1729.53 | allreduce: 26.77 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   250000/  300000 | elapsed time per iteration (ms): 2451.8 | learning rate 1.138E-05 | lm loss 1.540786E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.49 | backward: 1727.34 | allreduce: 24.58 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
    global rank 0 is saving checkpoint at iteration  250000 to checkpoints/gpt2_750m_2/iter_0250000/mp_rank_00/model_optim_rng.pt
      successfully saved checkpoints/gpt2_750m_2/iter_0250000/mp_rank_00/model_optim_rng.pt
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 250000 | LM loss: 1.605138E+00 | LM PPL: 4.978544E+00
    ------------------------------------------------------------------------------------
     iteration   250100/  300000 | elapsed time per iteration (ms): 3112.5 | learning rate 1.134E-05 | lm loss 1.537346E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.32 | backward: 1726.74 | allreduce: 24.64 | optimizer: 55.77 | batch generator: 8.13 | data loader: 7.33
     iteration   250200/  300000 | elapsed time per iteration (ms): 2454.7 | learning rate 1.130E-05 | lm loss 1.550400E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.52 | backward: 1730.25 | allreduce: 27.31 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   250300/  300000 | elapsed time per iteration (ms): 2452.9 | learning rate 1.125E-05 | lm loss 1.544858E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.57 | backward: 1728.35 | allreduce: 24.94 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   250400/  300000 | elapsed time per iteration (ms): 2452.3 | learning rate 1.121E-05 | lm loss 1.528723E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.42 | backward: 1727.88 | allreduce: 24.89 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   250500/  300000 | elapsed time per iteration (ms): 2454.8 | learning rate 1.117E-05 | lm loss 1.549218E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.42 | backward: 1730.39 | allreduce: 27.43 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   250600/  300000 | elapsed time per iteration (ms): 2452.3 | learning rate 1.113E-05 | lm loss 1.545238E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.47 | backward: 1727.83 | allreduce: 24.84 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   250700/  300000 | elapsed time per iteration (ms): 2452.2 | learning rate 1.109E-05 | lm loss 1.534498E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.44 | backward: 1727.80 | allreduce: 24.72 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   250800/  300000 | elapsed time per iteration (ms): 2454.2 | learning rate 1.105E-05 | lm loss 1.547549E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.33 | backward: 1729.95 | allreduce: 27.23 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   250900/  300000 | elapsed time per iteration (ms): 2451.9 | learning rate 1.101E-05 | lm loss 1.545617E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.37 | backward: 1727.53 | allreduce: 24.80 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   251000/  300000 | elapsed time per iteration (ms): 2450.3 | learning rate 1.097E-05 | lm loss 1.540878E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.25 | backward: 1726.62 | allreduce: 24.66 | optimizer: 55.21 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 251000 | LM loss: 1.608184E+00 | LM PPL: 4.993735E+00
    ------------------------------------------------------------------------------------
     iteration   251100/  300000 | elapsed time per iteration (ms): 3058.7 | learning rate 1.093E-05 | lm loss 1.533899E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.50 | backward: 1729.94 | allreduce: 26.85 | optimizer: 55.78 | batch generator: 4.14 | data loader: 3.35
     iteration   251200/  300000 | elapsed time per iteration (ms): 2451.7 | learning rate 1.089E-05 | lm loss 1.529814E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.44 | backward: 1727.35 | allreduce: 24.54 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   251300/  300000 | elapsed time per iteration (ms): 2450.8 | learning rate 1.084E-05 | lm loss 1.536202E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.46 | backward: 1726.35 | allreduce: 23.36 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   251400/  300000 | elapsed time per iteration (ms): 2452.8 | learning rate 1.080E-05 | lm loss 1.536765E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.33 | backward: 1728.55 | allreduce: 25.87 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   251500/  300000 | elapsed time per iteration (ms): 2450.2 | learning rate 1.076E-05 | lm loss 1.526822E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.30 | backward: 1725.89 | allreduce: 23.36 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   251600/  300000 | elapsed time per iteration (ms): 2450.8 | learning rate 1.072E-05 | lm loss 1.530421E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.47 | backward: 1726.35 | allreduce: 23.36 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   251700/  300000 | elapsed time per iteration (ms): 2452.7 | learning rate 1.068E-05 | lm loss 1.533917E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.30 | backward: 1728.43 | allreduce: 25.87 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   251800/  300000 | elapsed time per iteration (ms): 2450.8 | learning rate 1.064E-05 | lm loss 1.535849E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.49 | backward: 1726.41 | allreduce: 23.36 | optimizer: 55.76 | batch generator: 0.46 | data loader: 0.04
     iteration   251900/  300000 | elapsed time per iteration (ms): 2449.9 | learning rate 1.060E-05 | lm loss 1.554003E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.23 | backward: 1725.72 | allreduce: 23.37 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   252000/  300000 | elapsed time per iteration (ms): 2453.0 | learning rate 1.056E-05 | lm loss 1.574433E+00 | loss scale 2097152.0 |
    time (ms) | forward: 668.19 | backward: 1728.79 | allreduce: 26.42 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 252000 | LM loss: 1.605946E+00 | LM PPL: 4.982571E+00
    ------------------------------------------------------------------------------------
     iteration   252100/  300000 | elapsed time per iteration (ms): 3059.2 | learning rate 1.052E-05 | lm loss 1.542620E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.24 | backward: 1725.95 | allreduce: 25.46 | optimizer: 54.09 | batch generator: 11.18 | data loader: 10.38
     iteration   252200/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 1.048E-05 | lm loss 1.545787E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.11 | backward: 1727.20 | allreduce: 25.30 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   252300/  300000 | elapsed time per iteration (ms): 2453.8 | learning rate 1.044E-05 | lm loss 1.553855E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.30 | backward: 1729.58 | allreduce: 27.43 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   252400/  300000 | elapsed time per iteration (ms): 2450.6 | learning rate 1.040E-05 | lm loss 1.527006E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.04 | backward: 1726.64 | allreduce: 25.08 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   252500/  300000 | elapsed time per iteration (ms): 2450.9 | learning rate 1.036E-05 | lm loss 1.506544E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.14 | backward: 1726.78 | allreduce: 24.85 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   252600/  300000 | elapsed time per iteration (ms): 2453.3 | learning rate 1.032E-05 | lm loss 1.546030E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.09 | backward: 1729.27 | allreduce: 27.42 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   252700/  300000 | elapsed time per iteration (ms): 2451.7 | learning rate 1.028E-05 | lm loss 1.550672E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.36 | backward: 1727.33 | allreduce: 24.88 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   252800/  300000 | elapsed time per iteration (ms): 2451.1 | learning rate 1.024E-05 | lm loss 1.526539E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.24 | backward: 1726.87 | allreduce: 24.83 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   252900/  300000 | elapsed time per iteration (ms): 2454.1 | learning rate 1.020E-05 | lm loss 1.544634E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.34 | backward: 1729.79 | allreduce: 27.25 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   253000/  300000 | elapsed time per iteration (ms): 2450.8 | learning rate 1.016E-05 | lm loss 1.571579E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.21 | backward: 1726.61 | allreduce: 24.60 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 253000 | LM loss: 1.596227E+00 | LM PPL: 4.934378E+00
    ------------------------------------------------------------------------------------
     iteration   253100/  300000 | elapsed time per iteration (ms): 3059.0 | learning rate 1.012E-05 | lm loss 1.560386E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.63 | backward: 1728.09 | allreduce: 24.97 | optimizer: 55.77 | batch generator: 6.96 | data loader: 6.17
     iteration   253200/  300000 | elapsed time per iteration (ms): 2454.9 | learning rate 1.009E-05 | lm loss 1.548193E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.48 | backward: 1730.51 | allreduce: 27.46 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   253300/  300000 | elapsed time per iteration (ms): 2451.7 | learning rate 1.005E-05 | lm loss 1.557290E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.26 | backward: 1727.50 | allreduce: 24.95 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   253400/  300000 | elapsed time per iteration (ms): 2451.1 | learning rate 1.001E-05 | lm loss 1.542686E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.17 | backward: 1726.98 | allreduce: 24.71 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   253500/  300000 | elapsed time per iteration (ms): 2454.4 | learning rate 9.968E-06 | lm loss 1.540232E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.37 | backward: 1730.10 | allreduce: 27.41 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   253600/  300000 | elapsed time per iteration (ms): 2450.7 | learning rate 9.929E-06 | lm loss 1.554463E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.33 | backward: 1727.02 | allreduce: 24.89 | optimizer: 55.21 | batch generator: 0.45 | data loader: 0.04
     iteration   253700/  300000 | elapsed time per iteration (ms): 2452.1 | learning rate 9.890E-06 | lm loss 1.539200E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.36 | backward: 1727.74 | allreduce: 24.89 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   253800/  300000 | elapsed time per iteration (ms): 2453.5 | learning rate 9.851E-06 | lm loss 1.537926E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.54 | backward: 1729.54 | allreduce: 27.20 | optimizer: 55.22 | batch generator: 0.45 | data loader: 0.04
     iteration   253900/  300000 | elapsed time per iteration (ms): 2451.9 | learning rate 9.813E-06 | lm loss 1.546480E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.47 | backward: 1727.48 | allreduce: 24.79 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   254000/  300000 | elapsed time per iteration (ms): 2451.9 | learning rate 9.774E-06 | lm loss 1.543671E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.40 | backward: 1727.49 | allreduce: 24.90 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 254000 | LM loss: 1.637535E+00 | LM PPL: 5.142478E+00
    ------------------------------------------------------------------------------------
     iteration   254100/  300000 | elapsed time per iteration (ms): 3069.9 | learning rate 9.735E-06 | lm loss 1.553352E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.15 | backward: 1729.03 | allreduce: 27.09 | optimizer: 55.77 | batch generator: 17.26 | data loader: 16.46
     iteration   254200/  300000 | elapsed time per iteration (ms): 2450.0 | learning rate 9.697E-06 | lm loss 1.532104E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.33 | backward: 1726.30 | allreduce: 24.72 | optimizer: 55.22 | batch generator: 0.45 | data loader: 0.04
     iteration   254300/  300000 | elapsed time per iteration (ms): 2450.8 | learning rate 9.658E-06 | lm loss 1.537080E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.27 | backward: 1726.60 | allreduce: 24.72 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   254400/  300000 | elapsed time per iteration (ms): 2453.9 | learning rate 9.620E-06 | lm loss 1.526588E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.49 | backward: 1729.49 | allreduce: 27.19 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   254500/  300000 | elapsed time per iteration (ms): 2452.1 | learning rate 9.581E-06 | lm loss 1.518921E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.59 | backward: 1727.53 | allreduce: 24.69 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   254600/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 9.543E-06 | lm loss 1.532216E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.41 | backward: 1726.88 | allreduce: 24.65 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   254700/  300000 | elapsed time per iteration (ms): 2453.7 | learning rate 9.505E-06 | lm loss 1.537573E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.38 | backward: 1729.34 | allreduce: 27.19 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   254800/  300000 | elapsed time per iteration (ms): 2452.2 | learning rate 9.466E-06 | lm loss 1.541308E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.65 | backward: 1727.60 | allreduce: 24.77 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   254900/  300000 | elapsed time per iteration (ms): 2451.7 | learning rate 9.428E-06 | lm loss 1.527556E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.50 | backward: 1727.19 | allreduce: 24.70 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   255000/  300000 | elapsed time per iteration (ms): 2453.6 | learning rate 9.390E-06 | lm loss 1.532166E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.36 | backward: 1729.27 | allreduce: 27.19 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
    global rank 0 is saving checkpoint at iteration  255000 to checkpoints/gpt2_750m_2/iter_0255000/mp_rank_00/model_optim_rng.pt
      successfully saved checkpoints/gpt2_750m_2/iter_0255000/mp_rank_00/model_optim_rng.pt
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 255000 | LM loss: 1.598506E+00 | LM PPL: 4.945636E+00
    ------------------------------------------------------------------------------------
     iteration   255100/  300000 | elapsed time per iteration (ms): 3112.7 | learning rate 9.352E-06 | lm loss 1.549819E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.30 | backward: 1726.82 | allreduce: 24.97 | optimizer: 55.77 | batch generator: 7.43 | data loader: 6.63
     iteration   255200/  300000 | elapsed time per iteration (ms): 2453.9 | learning rate 9.314E-06 | lm loss 1.552656E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.40 | backward: 1729.53 | allreduce: 27.19 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   255300/  300000 | elapsed time per iteration (ms): 2452.2 | learning rate 9.276E-06 | lm loss 1.531378E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.53 | backward: 1727.69 | allreduce: 24.80 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   255400/  300000 | elapsed time per iteration (ms): 2451.2 | learning rate 9.238E-06 | lm loss 1.550445E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.30 | backward: 1726.94 | allreduce: 24.75 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   255500/  300000 | elapsed time per iteration (ms): 2453.7 | learning rate 9.201E-06 | lm loss 1.534818E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.20 | backward: 1729.52 | allreduce: 27.37 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   255600/  300000 | elapsed time per iteration (ms): 2452.3 | learning rate 9.163E-06 | lm loss 1.545884E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.59 | backward: 1727.79 | allreduce: 24.76 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   255700/  300000 | elapsed time per iteration (ms): 2451.9 | learning rate 9.125E-06 | lm loss 1.529348E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.45 | backward: 1727.48 | allreduce: 24.72 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   255800/  300000 | elapsed time per iteration (ms): 2453.7 | learning rate 9.088E-06 | lm loss 1.528869E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.23 | backward: 1729.50 | allreduce: 27.39 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   255900/  300000 | elapsed time per iteration (ms): 2451.0 | learning rate 9.051E-06 | lm loss 1.548156E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.21 | backward: 1726.78 | allreduce: 24.83 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   256000/  300000 | elapsed time per iteration (ms): 2451.6 | learning rate 9.013E-06 | lm loss 1.543198E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.40 | backward: 1727.21 | allreduce: 24.81 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 256000 | LM loss: 1.619277E+00 | LM PPL: 5.049439E+00
    ------------------------------------------------------------------------------------
     iteration   256100/  300000 | elapsed time per iteration (ms): 3079.2 | learning rate 8.976E-06 | lm loss 1.511360E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.48 | backward: 1729.83 | allreduce: 27.14 | optimizer: 55.78 | batch generator: 25.40 | data loader: 24.61
     iteration   256200/  300000 | elapsed time per iteration (ms): 2451.2 | learning rate 8.939E-06 | lm loss 1.530029E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.22 | backward: 1726.99 | allreduce: 24.98 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   256300/  300000 | elapsed time per iteration (ms): 2451.5 | learning rate 8.901E-06 | lm loss 1.550628E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.21 | backward: 1727.30 | allreduce: 24.95 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   256400/  300000 | elapsed time per iteration (ms): 2454.2 | learning rate 8.864E-06 | lm loss 1.570104E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.31 | backward: 1729.90 | allreduce: 27.31 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   256500/  300000 | elapsed time per iteration (ms): 2452.2 | learning rate 8.827E-06 | lm loss 1.535051E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.45 | backward: 1727.84 | allreduce: 24.88 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   256600/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 8.790E-06 | lm loss 1.547128E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.20 | backward: 1727.16 | allreduce: 24.68 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   256700/  300000 | elapsed time per iteration (ms): 2453.8 | learning rate 8.754E-06 | lm loss 1.525268E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.16 | backward: 1729.65 | allreduce: 27.41 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   256800/  300000 | elapsed time per iteration (ms): 2450.7 | learning rate 8.717E-06 | lm loss 1.527065E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.08 | backward: 1726.65 | allreduce: 24.80 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   256900/  300000 | elapsed time per iteration (ms): 2451.6 | learning rate 8.680E-06 | lm loss 1.538504E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.21 | backward: 1727.43 | allreduce: 24.86 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   257000/  300000 | elapsed time per iteration (ms): 2454.8 | learning rate 8.643E-06 | lm loss 1.529287E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.44 | backward: 1730.36 | allreduce: 27.35 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 257000 | LM loss: 1.609903E+00 | LM PPL: 5.002326E+00
    ------------------------------------------------------------------------------------
     iteration   257100/  300000 | elapsed time per iteration (ms): 3053.6 | learning rate 8.607E-06 | lm loss 1.551238E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.44 | backward: 1727.73 | allreduce: 24.72 | optimizer: 55.77 | batch generator: 2.11 | data loader: 1.32
     iteration   257200/  300000 | elapsed time per iteration (ms): 2448.6 | learning rate 8.571E-06 | lm loss 1.524019E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.17 | backward: 1725.60 | allreduce: 24.60 | optimizer: 54.66 | batch generator: 0.46 | data loader: 0.04
     iteration   257300/  300000 | elapsed time per iteration (ms): 2453.5 | learning rate 8.535E-06 | lm loss 1.532709E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.24 | backward: 1729.31 | allreduce: 26.92 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   257400/  300000 | elapsed time per iteration (ms): 2451.2 | learning rate 8.498E-06 | lm loss 1.532184E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.32 | backward: 1726.92 | allreduce: 24.34 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   257500/  300000 | elapsed time per iteration (ms): 2451.6 | learning rate 8.462E-06 | lm loss 1.537614E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.38 | backward: 1727.24 | allreduce: 24.33 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   257600/  300000 | elapsed time per iteration (ms): 2452.4 | learning rate 8.426E-06 | lm loss 1.560150E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.11 | backward: 1728.36 | allreduce: 26.17 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   257700/  300000 | elapsed time per iteration (ms): 2449.1 | learning rate 8.390E-06 | lm loss 1.541412E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.28 | backward: 1725.41 | allreduce: 23.36 | optimizer: 55.21 | batch generator: 0.45 | data loader: 0.04
     iteration   257800/  300000 | elapsed time per iteration (ms): 2450.1 | learning rate 8.354E-06 | lm loss 1.537507E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.41 | backward: 1725.75 | allreduce: 23.35 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   257900/  300000 | elapsed time per iteration (ms): 2452.0 | learning rate 8.318E-06 | lm loss 1.529949E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.27 | backward: 1727.78 | allreduce: 25.85 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   258000/  300000 | elapsed time per iteration (ms): 2449.9 | learning rate 8.282E-06 | lm loss 1.534447E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.16 | backward: 1725.74 | allreduce: 23.73 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 258000 | LM loss: 1.623428E+00 | LM PPL: 5.070444E+00
    ------------------------------------------------------------------------------------
     iteration   258100/  300000 | elapsed time per iteration (ms): 3055.3 | learning rate 8.246E-06 | lm loss 1.533364E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.24 | backward: 1727.32 | allreduce: 25.18 | optimizer: 55.77 | batch generator: 3.82 | data loader: 3.03
     iteration   258200/  300000 | elapsed time per iteration (ms): 2453.4 | learning rate 8.211E-06 | lm loss 1.547578E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.02 | backward: 1729.39 | allreduce: 27.78 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   258300/  300000 | elapsed time per iteration (ms): 2451.0 | learning rate 8.175E-06 | lm loss 1.528297E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.10 | backward: 1726.96 | allreduce: 25.26 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   258400/  300000 | elapsed time per iteration (ms): 2451.1 | learning rate 8.139E-06 | lm loss 1.523560E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.09 | backward: 1727.10 | allreduce: 25.24 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   258500/  300000 | elapsed time per iteration (ms): 2453.4 | learning rate 8.104E-06 | lm loss 1.536912E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.01 | backward: 1729.39 | allreduce: 27.82 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   258600/  300000 | elapsed time per iteration (ms): 2451.0 | learning rate 8.068E-06 | lm loss 1.523961E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.05 | backward: 1726.97 | allreduce: 25.27 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   258700/  300000 | elapsed time per iteration (ms): 2451.2 | learning rate 8.033E-06 | lm loss 1.517536E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.12 | backward: 1727.10 | allreduce: 25.25 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   258800/  300000 | elapsed time per iteration (ms): 2454.3 | learning rate 7.997E-06 | lm loss 1.533055E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.18 | backward: 1730.13 | allreduce: 27.84 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   258900/  300000 | elapsed time per iteration (ms): 2451.7 | learning rate 7.962E-06 | lm loss 1.556174E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.20 | backward: 1727.58 | allreduce: 25.27 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   259000/  300000 | elapsed time per iteration (ms): 2451.4 | learning rate 7.927E-06 | lm loss 1.508697E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.08 | backward: 1727.36 | allreduce: 25.24 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 259000 | LM loss: 1.606316E+00 | LM PPL: 4.984416E+00
    ------------------------------------------------------------------------------------
     iteration   259100/  300000 | elapsed time per iteration (ms): 3076.7 | learning rate 7.892E-06 | lm loss 1.543428E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.01 | backward: 1729.84 | allreduce: 27.80 | optimizer: 55.78 | batch generator: 23.47 | data loader: 22.68
     iteration   259200/  300000 | elapsed time per iteration (ms): 2452.2 | learning rate 7.857E-06 | lm loss 1.537203E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.32 | backward: 1727.88 | allreduce: 25.27 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   259300/  300000 | elapsed time per iteration (ms): 2452.7 | learning rate 7.822E-06 | lm loss 1.538779E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.58 | backward: 1728.17 | allreduce: 24.79 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   259400/  300000 | elapsed time per iteration (ms): 2454.9 | learning rate 7.787E-06 | lm loss 1.516134E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.53 | backward: 1730.42 | allreduce: 27.34 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   259500/  300000 | elapsed time per iteration (ms): 2452.1 | learning rate 7.752E-06 | lm loss 1.522550E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.38 | backward: 1727.76 | allreduce: 25.08 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   259600/  300000 | elapsed time per iteration (ms): 2452.3 | learning rate 7.718E-06 | lm loss 1.536635E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.31 | backward: 1727.99 | allreduce: 25.20 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   259700/  300000 | elapsed time per iteration (ms): 2451.8 | learning rate 7.683E-06 | lm loss 1.534965E+00 | loss scale 2097152.0 |
    time (ms) | forward: 668.56 | backward: 1727.82 | allreduce: 25.22 | optimizer: 55.21 | batch generator: 0.45 | data loader: 0.04
     iteration   259800/  300000 | elapsed time per iteration (ms): 2453.9 | learning rate 7.649E-06 | lm loss 1.544625E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.41 | backward: 1730.07 | allreduce: 27.76 | optimizer: 55.22 | batch generator: 0.46 | data loader: 0.04
     iteration   259900/  300000 | elapsed time per iteration (ms): 2450.4 | learning rate 7.615E-06 | lm loss 1.535761E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.23 | backward: 1726.75 | allreduce: 25.26 | optimizer: 55.21 | batch generator: 0.45 | data loader: 0.04
     iteration   260000/  300000 | elapsed time per iteration (ms): 2452.2 | learning rate 7.580E-06 | lm loss 1.529476E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.42 | backward: 1727.82 | allreduce: 25.23 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    global rank 0 is saving checkpoint at iteration  260000 to checkpoints/gpt2_750m_2/iter_0260000/mp_rank_00/model_optim_rng.pt
      successfully saved checkpoints/gpt2_750m_2/iter_0260000/mp_rank_00/model_optim_rng.pt
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 260000 | LM loss: 1.597305E+00 | LM PPL: 4.939704E+00
    ------------------------------------------------------------------------------------
     iteration   260100/  300000 | elapsed time per iteration (ms): 3113.5 | learning rate 7.546E-06 | lm loss 1.545927E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.39 | backward: 1727.40 | allreduce: 24.93 | optimizer: 55.78 | batch generator: 5.75 | data loader: 4.97
     iteration   260200/  300000 | elapsed time per iteration (ms): 2452.6 | learning rate 7.512E-06 | lm loss 1.570954E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.70 | backward: 1727.98 | allreduce: 24.77 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   260300/  300000 | elapsed time per iteration (ms): 2455.4 | learning rate 7.477E-06 | lm loss 1.516898E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.55 | backward: 1730.85 | allreduce: 28.00 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   260400/  300000 | elapsed time per iteration (ms): 2451.6 | learning rate 7.443E-06 | lm loss 1.539615E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.36 | backward: 1727.27 | allreduce: 24.98 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   260500/  300000 | elapsed time per iteration (ms): 2451.5 | learning rate 7.409E-06 | lm loss 1.559171E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.32 | backward: 1727.20 | allreduce: 24.80 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   260600/  300000 | elapsed time per iteration (ms): 2452.8 | learning rate 7.375E-06 | lm loss 1.547606E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.34 | backward: 1728.53 | allreduce: 26.14 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   260700/  300000 | elapsed time per iteration (ms): 2450.6 | learning rate 7.341E-06 | lm loss 1.522807E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.47 | backward: 1726.19 | allreduce: 23.53 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   260800/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 7.307E-06 | lm loss 1.528956E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.39 | backward: 1726.99 | allreduce: 24.36 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   260900/  300000 | elapsed time per iteration (ms): 2453.8 | learning rate 7.274E-06 | lm loss 1.554626E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.31 | backward: 1729.56 | allreduce: 26.90 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   261000/  300000 | elapsed time per iteration (ms): 2451.0 | learning rate 7.240E-06 | lm loss 1.541840E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.62 | backward: 1727.01 | allreduce: 24.20 | optimizer: 55.22 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 261000 | LM loss: 1.620172E+00 | LM PPL: 5.053960E+00
    ------------------------------------------------------------------------------------
     iteration   261100/  300000 | elapsed time per iteration (ms): 3076.2 | learning rate 7.207E-06 | lm loss 1.532696E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.50 | backward: 1727.39 | allreduce: 24.25 | optimizer: 55.77 | batch generator: 24.91 | data loader: 24.11
     iteration   261200/  300000 | elapsed time per iteration (ms): 2455.1 | learning rate 7.173E-06 | lm loss 1.539672E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.65 | backward: 1730.44 | allreduce: 26.96 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   261300/  300000 | elapsed time per iteration (ms): 2452.0 | learning rate 7.140E-06 | lm loss 1.537215E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.53 | backward: 1727.56 | allreduce: 24.30 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   261400/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 7.106E-06 | lm loss 1.534530E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.34 | backward: 1726.97 | allreduce: 24.28 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   261500/  300000 | elapsed time per iteration (ms): 2453.8 | learning rate 7.073E-06 | lm loss 1.543142E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.33 | backward: 1729.53 | allreduce: 26.89 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   261600/  300000 | elapsed time per iteration (ms): 2451.5 | learning rate 7.040E-06 | lm loss 1.537859E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.39 | backward: 1727.10 | allreduce: 24.33 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   261700/  300000 | elapsed time per iteration (ms): 2452.4 | learning rate 7.006E-06 | lm loss 1.539489E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.61 | backward: 1727.83 | allreduce: 24.34 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   261800/  300000 | elapsed time per iteration (ms): 2454.0 | learning rate 6.973E-06 | lm loss 1.528992E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.38 | backward: 1729.61 | allreduce: 26.82 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   261900/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 6.940E-06 | lm loss 1.528494E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.31 | backward: 1727.03 | allreduce: 24.36 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   262000/  300000 | elapsed time per iteration (ms): 2447.7 | learning rate 6.908E-06 | lm loss 1.560771E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.39 | backward: 1725.05 | allreduce: 24.15 | optimizer: 54.10 | batch generator: 0.46 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 262000 | LM loss: 1.599607E+00 | LM PPL: 4.951086E+00
    ------------------------------------------------------------------------------------
     iteration   262100/  300000 | elapsed time per iteration (ms): 3064.3 | learning rate 6.875E-06 | lm loss 1.536413E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.36 | backward: 1729.22 | allreduce: 26.88 | optimizer: 55.77 | batch generator: 11.27 | data loader: 10.48
     iteration   262200/  300000 | elapsed time per iteration (ms): 2451.1 | learning rate 6.843E-06 | lm loss 1.533410E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.42 | backward: 1726.71 | allreduce: 24.16 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   262300/  300000 | elapsed time per iteration (ms): 2450.9 | learning rate 6.810E-06 | lm loss 1.562262E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.24 | backward: 1726.67 | allreduce: 24.32 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   262400/  300000 | elapsed time per iteration (ms): 2453.6 | learning rate 6.777E-06 | lm loss 1.577354E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.39 | backward: 1729.22 | allreduce: 26.80 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   262500/  300000 | elapsed time per iteration (ms): 2451.6 | learning rate 6.745E-06 | lm loss 1.528363E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.36 | backward: 1727.28 | allreduce: 24.78 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   262600/  300000 | elapsed time per iteration (ms): 2452.1 | learning rate 6.712E-06 | lm loss 1.546794E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.51 | backward: 1727.61 | allreduce: 24.94 | optimizer: 55.79 | batch generator: 0.45 | data loader: 0.04
     iteration   262700/  300000 | elapsed time per iteration (ms): 2454.4 | learning rate 6.680E-06 | lm loss 1.537677E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.55 | backward: 1729.89 | allreduce: 27.20 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   262800/  300000 | elapsed time per iteration (ms): 2451.2 | learning rate 6.647E-06 | lm loss 1.540415E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.27 | backward: 1726.91 | allreduce: 24.77 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   262900/  300000 | elapsed time per iteration (ms): 2451.4 | learning rate 6.615E-06 | lm loss 1.520761E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.31 | backward: 1727.15 | allreduce: 24.82 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   263000/  300000 | elapsed time per iteration (ms): 2454.2 | learning rate 6.583E-06 | lm loss 1.542332E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.42 | backward: 1729.79 | allreduce: 27.39 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 263000 | LM loss: 1.613962E+00 | LM PPL: 5.022672E+00
    ------------------------------------------------------------------------------------
     iteration   263100/  300000 | elapsed time per iteration (ms): 3061.6 | learning rate 6.551E-06 | lm loss 1.524843E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.22 | backward: 1727.46 | allreduce: 25.18 | optimizer: 55.78 | batch generator: 10.53 | data loader: 9.74
     iteration   263200/  300000 | elapsed time per iteration (ms): 2452.9 | learning rate 6.519E-06 | lm loss 1.547668E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.51 | backward: 1728.41 | allreduce: 25.35 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   263300/  300000 | elapsed time per iteration (ms): 2454.3 | learning rate 6.487E-06 | lm loss 1.544701E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.23 | backward: 1730.09 | allreduce: 27.76 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   263400/  300000 | elapsed time per iteration (ms): 2452.0 | learning rate 6.455E-06 | lm loss 1.549452E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.27 | backward: 1727.78 | allreduce: 25.34 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   263500/  300000 | elapsed time per iteration (ms): 2451.7 | learning rate 6.423E-06 | lm loss 1.551045E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.14 | backward: 1727.55 | allreduce: 25.32 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   263600/  300000 | elapsed time per iteration (ms): 2455.1 | learning rate 6.391E-06 | lm loss 1.579202E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.61 | backward: 1730.56 | allreduce: 27.29 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   263700/  300000 | elapsed time per iteration (ms): 2450.8 | learning rate 6.360E-06 | lm loss 1.551046E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.39 | backward: 1727.04 | allreduce: 24.68 | optimizer: 55.21 | batch generator: 0.45 | data loader: 0.04
     iteration   263800/  300000 | elapsed time per iteration (ms): 2452.2 | learning rate 6.328E-06 | lm loss 1.503252E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.50 | backward: 1727.79 | allreduce: 24.76 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   263900/  300000 | elapsed time per iteration (ms): 2454.8 | learning rate 6.297E-06 | lm loss 1.527766E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.41 | backward: 1730.44 | allreduce: 27.52 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   264000/  300000 | elapsed time per iteration (ms): 2452.0 | learning rate 6.265E-06 | lm loss 1.531297E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.36 | backward: 1727.63 | allreduce: 24.91 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 264000 | LM loss: 1.612461E+00 | LM PPL: 5.015140E+00
    ------------------------------------------------------------------------------------
     iteration   264100/  300000 | elapsed time per iteration (ms): 3071.4 | learning rate 6.234E-06 | lm loss 1.510634E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.40 | backward: 1728.12 | allreduce: 25.25 | optimizer: 55.78 | batch generator: 19.55 | data loader: 18.75
     iteration   264200/  300000 | elapsed time per iteration (ms): 2453.7 | learning rate 6.203E-06 | lm loss 1.549031E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.55 | backward: 1729.73 | allreduce: 27.46 | optimizer: 55.21 | batch generator: 0.45 | data loader: 0.04
     iteration   264300/  300000 | elapsed time per iteration (ms): 2451.9 | learning rate 6.172E-06 | lm loss 1.538875E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.44 | backward: 1727.45 | allreduce: 24.85 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   264400/  300000 | elapsed time per iteration (ms): 2451.1 | learning rate 6.140E-06 | lm loss 1.527598E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.33 | backward: 1726.84 | allreduce: 24.59 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   264500/  300000 | elapsed time per iteration (ms): 2454.5 | learning rate 6.109E-06 | lm loss 1.530142E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.51 | backward: 1730.02 | allreduce: 27.17 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   264600/  300000 | elapsed time per iteration (ms): 2452.4 | learning rate 6.078E-06 | lm loss 1.503872E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.63 | backward: 1727.79 | allreduce: 24.78 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   264700/  300000 | elapsed time per iteration (ms): 2451.5 | learning rate 6.047E-06 | lm loss 1.545848E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.40 | backward: 1727.13 | allreduce: 24.62 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   264800/  300000 | elapsed time per iteration (ms): 2453.7 | learning rate 6.016E-06 | lm loss 1.522464E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.31 | backward: 1729.45 | allreduce: 27.24 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   264900/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 5.986E-06 | lm loss 1.535353E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.29 | backward: 1727.10 | allreduce: 24.72 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   265000/  300000 | elapsed time per iteration (ms): 2451.4 | learning rate 5.955E-06 | lm loss 1.514805E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.29 | backward: 1727.12 | allreduce: 24.78 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    global rank 0 is saving checkpoint at iteration  265000 to checkpoints/gpt2_750m_2/iter_0265000/mp_rank_00/model_optim_rng.pt
      successfully saved checkpoints/gpt2_750m_2/iter_0265000/mp_rank_00/model_optim_rng.pt
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 265000 | LM loss: 1.592687E+00 | LM PPL: 4.916944E+00
    ------------------------------------------------------------------------------------
     iteration   265100/  300000 | elapsed time per iteration (ms): 3109.1 | learning rate 5.924E-06 | lm loss 1.524413E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.48 | backward: 1729.68 | allreduce: 27.23 | optimizer: 55.76 | batch generator: 0.86 | data loader: 0.07
     iteration   265200/  300000 | elapsed time per iteration (ms): 2451.2 | learning rate 5.894E-06 | lm loss 1.515624E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.21 | backward: 1727.02 | allreduce: 24.63 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   265300/  300000 | elapsed time per iteration (ms): 2451.1 | learning rate 5.863E-06 | lm loss 1.525544E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.23 | backward: 1726.91 | allreduce: 24.71 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   265400/  300000 | elapsed time per iteration (ms): 2453.9 | learning rate 5.833E-06 | lm loss 1.542055E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.32 | backward: 1729.60 | allreduce: 27.19 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   265500/  300000 | elapsed time per iteration (ms): 2451.4 | learning rate 5.803E-06 | lm loss 1.530605E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.30 | backward: 1727.19 | allreduce: 24.75 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   265600/  300000 | elapsed time per iteration (ms): 2450.5 | learning rate 5.772E-06 | lm loss 1.525710E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.25 | backward: 1726.29 | allreduce: 24.24 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   265700/  300000 | elapsed time per iteration (ms): 2453.5 | learning rate 5.742E-06 | lm loss 1.527930E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.25 | backward: 1729.24 | allreduce: 26.95 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   265800/  300000 | elapsed time per iteration (ms): 2449.5 | learning rate 5.712E-06 | lm loss 1.537985E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.13 | backward: 1726.02 | allreduce: 24.34 | optimizer: 55.22 | batch generator: 0.46 | data loader: 0.04
     iteration   265900/  300000 | elapsed time per iteration (ms): 2451.2 | learning rate 5.682E-06 | lm loss 1.528993E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.36 | backward: 1726.86 | allreduce: 24.35 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   266000/  300000 | elapsed time per iteration (ms): 2453.0 | learning rate 5.652E-06 | lm loss 1.526863E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.19 | backward: 1728.84 | allreduce: 26.70 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 266000 | LM loss: 1.596505E+00 | LM PPL: 4.935750E+00
    ------------------------------------------------------------------------------------
     iteration   266100/  300000 | elapsed time per iteration (ms): 3056.6 | learning rate 5.622E-06 | lm loss 1.543363E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.43 | backward: 1727.28 | allreduce: 24.54 | optimizer: 55.77 | batch generator: 5.25 | data loader: 4.46
     iteration   266200/  300000 | elapsed time per iteration (ms): 2451.0 | learning rate 5.593E-06 | lm loss 1.526224E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.13 | backward: 1726.88 | allreduce: 24.75 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   266300/  300000 | elapsed time per iteration (ms): 2453.9 | learning rate 5.563E-06 | lm loss 1.510952E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.21 | backward: 1729.75 | allreduce: 27.47 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   266400/  300000 | elapsed time per iteration (ms): 2451.9 | learning rate 5.533E-06 | lm loss 1.529902E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.41 | backward: 1727.58 | allreduce: 24.86 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   266500/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 5.504E-06 | lm loss 1.539363E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.23 | backward: 1727.09 | allreduce: 24.93 | optimizer: 55.76 | batch generator: 0.46 | data loader: 0.04
     iteration   266600/  300000 | elapsed time per iteration (ms): 2452.1 | learning rate 5.475E-06 | lm loss 1.528813E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.19 | backward: 1728.54 | allreduce: 27.38 | optimizer: 55.22 | batch generator: 0.46 | data loader: 0.04
     iteration   266700/  300000 | elapsed time per iteration (ms): 2450.5 | learning rate 5.445E-06 | lm loss 1.547637E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.43 | backward: 1726.06 | allreduce: 23.84 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   266800/  300000 | elapsed time per iteration (ms): 2450.6 | learning rate 5.416E-06 | lm loss 1.556426E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.40 | backward: 1726.21 | allreduce: 23.85 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   266900/  300000 | elapsed time per iteration (ms): 2453.7 | learning rate 5.387E-06 | lm loss 1.527313E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.37 | backward: 1729.38 | allreduce: 27.04 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   267000/  300000 | elapsed time per iteration (ms): 2451.0 | learning rate 5.357E-06 | lm loss 1.533533E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.34 | backward: 1726.72 | allreduce: 24.65 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 267000 | LM loss: 1.597153E+00 | LM PPL: 4.938951E+00
    ------------------------------------------------------------------------------------
     iteration   267100/  300000 | elapsed time per iteration (ms): 3054.3 | learning rate 5.328E-06 | lm loss 1.509451E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.22 | backward: 1726.62 | allreduce: 24.63 | optimizer: 55.78 | batch generator: 3.27 | data loader: 2.47
     iteration   267200/  300000 | elapsed time per iteration (ms): 2453.3 | learning rate 5.299E-06 | lm loss 1.524473E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.26 | backward: 1729.10 | allreduce: 27.04 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   267300/  300000 | elapsed time per iteration (ms): 2451.1 | learning rate 5.270E-06 | lm loss 1.540244E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.33 | backward: 1726.83 | allreduce: 24.72 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   267400/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 5.241E-06 | lm loss 1.508963E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.34 | backward: 1727.01 | allreduce: 24.71 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   267500/  300000 | elapsed time per iteration (ms): 2453.4 | learning rate 5.213E-06 | lm loss 1.544493E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.21 | backward: 1729.20 | allreduce: 27.46 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   267600/  300000 | elapsed time per iteration (ms): 2450.6 | learning rate 5.184E-06 | lm loss 1.534056E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.01 | backward: 1726.58 | allreduce: 24.87 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   267700/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 5.155E-06 | lm loss 1.548752E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.22 | backward: 1727.14 | allreduce: 24.83 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   267800/  300000 | elapsed time per iteration (ms): 2453.6 | learning rate 5.127E-06 | lm loss 1.556512E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.16 | backward: 1729.51 | allreduce: 27.37 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   267900/  300000 | elapsed time per iteration (ms): 2449.9 | learning rate 5.098E-06 | lm loss 1.521862E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.21 | backward: 1726.29 | allreduce: 24.69 | optimizer: 55.22 | batch generator: 0.45 | data loader: 0.04
     iteration   268000/  300000 | elapsed time per iteration (ms): 2450.7 | learning rate 5.070E-06 | lm loss 1.529336E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.12 | backward: 1726.60 | allreduce: 24.80 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 268000 | LM loss: 1.600466E+00 | LM PPL: 4.955339E+00
    ------------------------------------------------------------------------------------
     iteration   268100/  300000 | elapsed time per iteration (ms): 3056.1 | learning rate 5.042E-06 | lm loss 1.543802E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.22 | backward: 1728.77 | allreduce: 27.48 | optimizer: 55.21 | batch generator: 3.48 | data loader: 2.69
     iteration   268200/  300000 | elapsed time per iteration (ms): 2451.2 | learning rate 5.014E-06 | lm loss 1.537308E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.29 | backward: 1726.92 | allreduce: 24.85 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   268300/  300000 | elapsed time per iteration (ms): 2451.4 | learning rate 4.985E-06 | lm loss 1.535961E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.42 | backward: 1727.03 | allreduce: 24.74 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   268400/  300000 | elapsed time per iteration (ms): 2453.6 | learning rate 4.957E-06 | lm loss 1.545793E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.46 | backward: 1729.22 | allreduce: 26.78 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   268500/  300000 | elapsed time per iteration (ms): 2450.7 | learning rate 4.929E-06 | lm loss 1.532993E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.31 | backward: 1726.41 | allreduce: 24.28 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   268600/  300000 | elapsed time per iteration (ms): 2451.1 | learning rate 4.901E-06 | lm loss 1.522741E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.36 | backward: 1726.75 | allreduce: 24.47 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   268700/  300000 | elapsed time per iteration (ms): 2452.7 | learning rate 4.873E-06 | lm loss 1.527593E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.39 | backward: 1728.35 | allreduce: 26.05 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   268800/  300000 | elapsed time per iteration (ms): 2449.9 | learning rate 4.846E-06 | lm loss 1.523861E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.39 | backward: 1725.60 | allreduce: 23.36 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   268900/  300000 | elapsed time per iteration (ms): 2449.1 | learning rate 4.818E-06 | lm loss 1.551425E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.12 | backward: 1725.05 | allreduce: 23.36 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   269000/  300000 | elapsed time per iteration (ms): 2451.6 | learning rate 4.790E-06 | lm loss 1.540666E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.12 | backward: 1727.51 | allreduce: 25.86 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 269000 | LM loss: 1.586709E+00 | LM PPL: 4.887636E+00
    ------------------------------------------------------------------------------------
     iteration   269100/  300000 | elapsed time per iteration (ms): 3050.3 | learning rate 4.763E-06 | lm loss 1.532920E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.21 | backward: 1725.51 | allreduce: 23.37 | optimizer: 55.77 | batch generator: 0.86 | data loader: 0.07
     iteration   269200/  300000 | elapsed time per iteration (ms): 2450.3 | learning rate 4.735E-06 | lm loss 1.531118E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.36 | backward: 1726.02 | allreduce: 23.37 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   269300/  300000 | elapsed time per iteration (ms): 2452.9 | learning rate 4.708E-06 | lm loss 1.545582E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.39 | backward: 1728.52 | allreduce: 25.89 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   269400/  300000 | elapsed time per iteration (ms): 2450.6 | learning rate 4.680E-06 | lm loss 1.518652E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.44 | backward: 1726.21 | allreduce: 23.36 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   269500/  300000 | elapsed time per iteration (ms): 2449.6 | learning rate 4.653E-06 | lm loss 1.531184E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.03 | backward: 1725.62 | allreduce: 23.90 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   269600/  300000 | elapsed time per iteration (ms): 2452.4 | learning rate 4.626E-06 | lm loss 1.530977E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.25 | backward: 1728.23 | allreduce: 25.96 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   269700/  300000 | elapsed time per iteration (ms): 2449.5 | learning rate 4.599E-06 | lm loss 1.542269E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.49 | backward: 1725.66 | allreduce: 23.36 | optimizer: 55.21 | batch generator: 0.45 | data loader: 0.04
     iteration   269800/  300000 | elapsed time per iteration (ms): 2449.4 | learning rate 4.572E-06 | lm loss 1.533511E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.53 | backward: 1725.44 | allreduce: 23.37 | optimizer: 55.21 | batch generator: 0.46 | data loader: 0.04
     iteration   269900/  300000 | elapsed time per iteration (ms): 2452.4 | learning rate 4.545E-06 | lm loss 1.546178E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.37 | backward: 1728.05 | allreduce: 25.89 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   270000/  300000 | elapsed time per iteration (ms): 2449.9 | learning rate 4.518E-06 | lm loss 1.533114E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.35 | backward: 1725.62 | allreduce: 23.36 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    global rank 0 is saving checkpoint at iteration  270000 to checkpoints/gpt2_750m_2/iter_0270000/mp_rank_00/model_optim_rng.pt
      successfully saved checkpoints/gpt2_750m_2/iter_0270000/mp_rank_00/model_optim_rng.pt
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 270000 | LM loss: 1.600098E+00 | LM PPL: 4.953520E+00
    ------------------------------------------------------------------------------------
     iteration   270100/  300000 | elapsed time per iteration (ms): 3134.4 | learning rate 4.492E-06 | lm loss 1.513058E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.36 | backward: 1729.91 | allreduce: 27.65 | optimizer: 55.77 | batch generator: 23.91 | data loader: 23.11
     iteration   270200/  300000 | elapsed time per iteration (ms): 2451.8 | learning rate 4.465E-06 | lm loss 1.547623E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.50 | backward: 1727.37 | allreduce: 24.66 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   270300/  300000 | elapsed time per iteration (ms): 2450.7 | learning rate 4.438E-06 | lm loss 1.524248E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.34 | backward: 1726.36 | allreduce: 24.14 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   270400/  300000 | elapsed time per iteration (ms): 2452.9 | learning rate 4.412E-06 | lm loss 1.537380E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.21 | backward: 1728.73 | allreduce: 27.05 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   270500/  300000 | elapsed time per iteration (ms): 2450.7 | learning rate 4.385E-06 | lm loss 1.513052E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.21 | backward: 1726.57 | allreduce: 24.80 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   270600/  300000 | elapsed time per iteration (ms): 2450.9 | learning rate 4.359E-06 | lm loss 1.539646E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.22 | backward: 1726.76 | allreduce: 24.80 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   270700/  300000 | elapsed time per iteration (ms): 2453.7 | learning rate 4.332E-06 | lm loss 1.529443E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.29 | backward: 1729.43 | allreduce: 27.42 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   270800/  300000 | elapsed time per iteration (ms): 2450.0 | learning rate 4.306E-06 | lm loss 1.542845E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.24 | backward: 1725.76 | allreduce: 23.64 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   270900/  300000 | elapsed time per iteration (ms): 2450.0 | learning rate 4.280E-06 | lm loss 1.532045E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.26 | backward: 1725.80 | allreduce: 23.51 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   271000/  300000 | elapsed time per iteration (ms): 2454.2 | learning rate 4.254E-06 | lm loss 1.535591E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.34 | backward: 1729.87 | allreduce: 27.28 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 271000 | LM loss: 1.606981E+00 | LM PPL: 4.987731E+00
    ------------------------------------------------------------------------------------
     iteration   271100/  300000 | elapsed time per iteration (ms): 3061.2 | learning rate 4.228E-06 | lm loss 1.510906E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.28 | backward: 1727.08 | allreduce: 24.86 | optimizer: 55.77 | batch generator: 10.60 | data loader: 9.80
     iteration   271200/  300000 | elapsed time per iteration (ms): 2451.2 | learning rate 4.202E-06 | lm loss 1.542754E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.41 | backward: 1726.84 | allreduce: 24.20 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   271300/  300000 | elapsed time per iteration (ms): 2452.7 | learning rate 4.176E-06 | lm loss 1.544887E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.10 | backward: 1728.59 | allreduce: 26.80 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   271400/  300000 | elapsed time per iteration (ms): 2450.7 | learning rate 4.150E-06 | lm loss 1.514615E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.21 | backward: 1726.56 | allreduce: 24.27 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   271500/  300000 | elapsed time per iteration (ms): 2450.5 | learning rate 4.125E-06 | lm loss 1.548212E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.50 | backward: 1726.61 | allreduce: 24.19 | optimizer: 55.21 | batch generator: 0.45 | data loader: 0.04
     iteration   271600/  300000 | elapsed time per iteration (ms): 2453.8 | learning rate 4.099E-06 | lm loss 1.534418E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.29 | backward: 1729.52 | allreduce: 26.96 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   271700/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 4.073E-06 | lm loss 1.518714E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.20 | backward: 1727.14 | allreduce: 24.93 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   271800/  300000 | elapsed time per iteration (ms): 2451.1 | learning rate 4.048E-06 | lm loss 1.544357E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.19 | backward: 1726.96 | allreduce: 24.80 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   271900/  300000 | elapsed time per iteration (ms): 2453.8 | learning rate 4.023E-06 | lm loss 1.521198E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.33 | backward: 1729.54 | allreduce: 27.14 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   272000/  300000 | elapsed time per iteration (ms): 2449.7 | learning rate 3.997E-06 | lm loss 1.536073E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.20 | backward: 1726.07 | allreduce: 24.71 | optimizer: 55.22 | batch generator: 0.46 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 272000 | LM loss: 1.615842E+00 | LM PPL: 5.032123E+00
    ------------------------------------------------------------------------------------
     iteration   272100/  300000 | elapsed time per iteration (ms): 3064.7 | learning rate 3.972E-06 | lm loss 1.520334E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.30 | backward: 1726.76 | allreduce: 24.81 | optimizer: 55.77 | batch generator: 14.66 | data loader: 13.86
     iteration   272200/  300000 | elapsed time per iteration (ms): 2452.9 | learning rate 3.947E-06 | lm loss 1.534373E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.16 | backward: 1728.75 | allreduce: 27.20 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   272300/  300000 | elapsed time per iteration (ms): 2450.6 | learning rate 3.922E-06 | lm loss 1.526085E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.16 | backward: 1726.47 | allreduce: 24.88 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   272400/  300000 | elapsed time per iteration (ms): 2450.7 | learning rate 3.897E-06 | lm loss 1.532384E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.30 | backward: 1726.42 | allreduce: 24.34 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   272500/  300000 | elapsed time per iteration (ms): 2452.3 | learning rate 3.872E-06 | lm loss 1.539615E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.29 | backward: 1728.03 | allreduce: 25.92 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   272600/  300000 | elapsed time per iteration (ms): 2448.5 | learning rate 3.847E-06 | lm loss 1.542396E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.99 | backward: 1724.59 | allreduce: 23.36 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   272700/  300000 | elapsed time per iteration (ms): 2449.4 | learning rate 3.822E-06 | lm loss 1.519065E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.24 | backward: 1725.22 | allreduce: 23.36 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   272800/  300000 | elapsed time per iteration (ms): 2452.6 | learning rate 3.798E-06 | lm loss 1.536267E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.32 | backward: 1728.33 | allreduce: 26.34 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   272900/  300000 | elapsed time per iteration (ms): 2450.9 | learning rate 3.773E-06 | lm loss 1.539610E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.22 | backward: 1726.70 | allreduce: 24.86 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   273000/  300000 | elapsed time per iteration (ms): 2449.9 | learning rate 3.748E-06 | lm loss 1.536344E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.96 | backward: 1725.93 | allreduce: 24.69 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 273000 | LM loss: 1.622953E+00 | LM PPL: 5.068032E+00
    ------------------------------------------------------------------------------------
     iteration   273100/  300000 | elapsed time per iteration (ms): 3054.3 | learning rate 3.724E-06 | lm loss 1.540533E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.20 | backward: 1729.50 | allreduce: 27.39 | optimizer: 55.77 | batch generator: 0.86 | data loader: 0.07
     iteration   273200/  300000 | elapsed time per iteration (ms): 2451.4 | learning rate 3.699E-06 | lm loss 1.527158E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.29 | backward: 1727.19 | allreduce: 24.83 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   273300/  300000 | elapsed time per iteration (ms): 2450.4 | learning rate 3.675E-06 | lm loss 1.543538E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.28 | backward: 1726.70 | allreduce: 24.70 | optimizer: 55.22 | batch generator: 0.45 | data loader: 0.04
     iteration   273400/  300000 | elapsed time per iteration (ms): 2452.7 | learning rate 3.651E-06 | lm loss 1.539439E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.24 | backward: 1728.53 | allreduce: 26.31 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   273500/  300000 | elapsed time per iteration (ms): 2450.7 | learning rate 3.627E-06 | lm loss 1.547628E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.04 | backward: 1726.66 | allreduce: 24.76 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   273600/  300000 | elapsed time per iteration (ms): 2451.8 | learning rate 3.603E-06 | lm loss 1.522507E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.24 | backward: 1727.63 | allreduce: 25.28 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   273700/  300000 | elapsed time per iteration (ms): 2453.8 | learning rate 3.579E-06 | lm loss 1.528948E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.43 | backward: 1729.38 | allreduce: 26.66 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   273800/  300000 | elapsed time per iteration (ms): 2450.3 | learning rate 3.555E-06 | lm loss 1.558341E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.11 | backward: 1726.24 | allreduce: 24.26 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   273900/  300000 | elapsed time per iteration (ms): 2450.4 | learning rate 3.531E-06 | lm loss 1.523877E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.99 | backward: 1726.41 | allreduce: 24.83 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   274000/  300000 | elapsed time per iteration (ms): 2453.7 | learning rate 3.507E-06 | lm loss 1.543919E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.15 | backward: 1729.59 | allreduce: 27.55 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 274000 | LM loss: 1.610531E+00 | LM PPL: 5.005466E+00
    ------------------------------------------------------------------------------------
     iteration   274100/  300000 | elapsed time per iteration (ms): 3061.0 | learning rate 3.484E-06 | lm loss 1.565826E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.30 | backward: 1726.89 | allreduce: 24.68 | optimizer: 55.76 | batch generator: 10.71 | data loader: 9.92
     iteration   274200/  300000 | elapsed time per iteration (ms): 2450.1 | learning rate 3.460E-06 | lm loss 1.518572E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.25 | backward: 1726.49 | allreduce: 24.88 | optimizer: 55.21 | batch generator: 0.45 | data loader: 0.04
     iteration   274300/  300000 | elapsed time per iteration (ms): 2453.0 | learning rate 3.437E-06 | lm loss 1.545332E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.08 | backward: 1728.92 | allreduce: 27.24 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   274400/  300000 | elapsed time per iteration (ms): 2448.6 | learning rate 3.413E-06 | lm loss 1.529298E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.02 | backward: 1725.18 | allreduce: 24.84 | optimizer: 55.21 | batch generator: 0.45 | data loader: 0.04
     iteration   274500/  300000 | elapsed time per iteration (ms): 2450.4 | learning rate 3.390E-06 | lm loss 1.520204E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.26 | backward: 1726.14 | allreduce: 24.70 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   274600/  300000 | elapsed time per iteration (ms): 2452.7 | learning rate 3.367E-06 | lm loss 1.523902E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.20 | backward: 1728.59 | allreduce: 27.11 | optimizer: 55.76 | batch generator: 0.46 | data loader: 0.04
     iteration   274700/  300000 | elapsed time per iteration (ms): 2450.0 | learning rate 3.344E-06 | lm loss 1.547372E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.12 | backward: 1725.91 | allreduce: 24.75 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   274800/  300000 | elapsed time per iteration (ms): 2449.5 | learning rate 3.320E-06 | lm loss 1.527291E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.02 | backward: 1725.57 | allreduce: 24.70 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   274900/  300000 | elapsed time per iteration (ms): 2452.8 | learning rate 3.297E-06 | lm loss 1.514855E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.19 | backward: 1728.61 | allreduce: 27.25 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   275000/  300000 | elapsed time per iteration (ms): 2449.9 | learning rate 3.274E-06 | lm loss 1.548661E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.22 | backward: 1725.77 | allreduce: 24.19 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
    global rank 0 is saving checkpoint at iteration  275000 to checkpoints/gpt2_750m_2/iter_0275000/mp_rank_00/model_optim_rng.pt
      successfully saved checkpoints/gpt2_750m_2/iter_0275000/mp_rank_00/model_optim_rng.pt
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 275000 | LM loss: 1.607745E+00 | LM PPL: 4.991542E+00
    ------------------------------------------------------------------------------------
     iteration   275100/  300000 | elapsed time per iteration (ms): 3113.2 | learning rate 3.251E-06 | lm loss 1.526421E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.23 | backward: 1726.55 | allreduce: 24.86 | optimizer: 55.77 | batch generator: 6.70 | data loader: 5.90
     iteration   275200/  300000 | elapsed time per iteration (ms): 2452.7 | learning rate 3.229E-06 | lm loss 1.531454E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.11 | backward: 1728.59 | allreduce: 27.15 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   275300/  300000 | elapsed time per iteration (ms): 2450.2 | learning rate 3.206E-06 | lm loss 1.531966E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.14 | backward: 1726.12 | allreduce: 24.61 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   275400/  300000 | elapsed time per iteration (ms): 2450.7 | learning rate 3.183E-06 | lm loss 1.531829E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.18 | backward: 1726.57 | allreduce: 24.66 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   275500/  300000 | elapsed time per iteration (ms): 2453.8 | learning rate 3.161E-06 | lm loss 1.547068E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.32 | backward: 1729.52 | allreduce: 27.25 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   275600/  300000 | elapsed time per iteration (ms): 2451.1 | learning rate 3.138E-06 | lm loss 1.533827E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.24 | backward: 1726.88 | allreduce: 24.67 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   275700/  300000 | elapsed time per iteration (ms): 2450.6 | learning rate 3.116E-06 | lm loss 1.527392E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.14 | backward: 1726.55 | allreduce: 24.61 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   275800/  300000 | elapsed time per iteration (ms): 2453.5 | learning rate 3.093E-06 | lm loss 1.545502E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.23 | backward: 1729.33 | allreduce: 27.12 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   275900/  300000 | elapsed time per iteration (ms): 2449.3 | learning rate 3.071E-06 | lm loss 1.531718E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.10 | backward: 1725.85 | allreduce: 24.63 | optimizer: 55.21 | batch generator: 0.46 | data loader: 0.04
     iteration   276000/  300000 | elapsed time per iteration (ms): 2450.5 | learning rate 3.049E-06 | lm loss 1.525405E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.09 | backward: 1726.42 | allreduce: 24.65 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 276000 | LM loss: 1.614399E+00 | LM PPL: 5.024869E+00
    ------------------------------------------------------------------------------------
     iteration   276100/  300000 | elapsed time per iteration (ms): 3057.4 | learning rate 3.027E-06 | lm loss 1.526864E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.03 | backward: 1728.78 | allreduce: 27.22 | optimizer: 55.78 | batch generator: 5.05 | data loader: 4.26
     iteration   276200/  300000 | elapsed time per iteration (ms): 2450.5 | learning rate 3.005E-06 | lm loss 1.561095E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.12 | backward: 1726.42 | allreduce: 24.63 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   276300/  300000 | elapsed time per iteration (ms): 2451.0 | learning rate 2.983E-06 | lm loss 1.527935E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.23 | backward: 1726.86 | allreduce: 24.75 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   276400/  300000 | elapsed time per iteration (ms): 2452.9 | learning rate 2.961E-06 | lm loss 1.534883E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.08 | backward: 1728.86 | allreduce: 27.07 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   276500/  300000 | elapsed time per iteration (ms): 2450.3 | learning rate 2.939E-06 | lm loss 1.538710E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.03 | backward: 1726.29 | allreduce: 24.72 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   276600/  300000 | elapsed time per iteration (ms): 2449.9 | learning rate 2.917E-06 | lm loss 1.526997E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.03 | backward: 1725.91 | allreduce: 24.55 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   276700/  300000 | elapsed time per iteration (ms): 2453.0 | learning rate 2.896E-06 | lm loss 1.539336E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.09 | backward: 1728.98 | allreduce: 27.24 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   276800/  300000 | elapsed time per iteration (ms): 2451.0 | learning rate 2.874E-06 | lm loss 1.512888E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.23 | backward: 1726.81 | allreduce: 24.69 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   276900/  300000 | elapsed time per iteration (ms): 2450.9 | learning rate 2.853E-06 | lm loss 1.521941E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.18 | backward: 1726.81 | allreduce: 24.71 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   277000/  300000 | elapsed time per iteration (ms): 2453.1 | learning rate 2.831E-06 | lm loss 1.526746E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.02 | backward: 1729.10 | allreduce: 27.33 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 277000 | LM loss: 1.614433E+00 | LM PPL: 5.025038E+00
    ------------------------------------------------------------------------------------
     iteration   277100/  300000 | elapsed time per iteration (ms): 3052.0 | learning rate 2.810E-06 | lm loss 1.539358E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.20 | backward: 1727.10 | allreduce: 24.77 | optimizer: 55.77 | batch generator: 0.86 | data loader: 0.07
     iteration   277200/  300000 | elapsed time per iteration (ms): 2450.5 | learning rate 2.789E-06 | lm loss 1.539272E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.00 | backward: 1726.50 | allreduce: 24.66 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   277300/  300000 | elapsed time per iteration (ms): 2454.1 | learning rate 2.767E-06 | lm loss 1.518423E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.31 | backward: 1729.83 | allreduce: 27.16 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   277400/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 2.746E-06 | lm loss 1.510756E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.23 | backward: 1727.15 | allreduce: 24.68 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   277500/  300000 | elapsed time per iteration (ms): 2450.2 | learning rate 2.725E-06 | lm loss 1.530766E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.90 | backward: 1726.32 | allreduce: 24.68 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   277600/  300000 | elapsed time per iteration (ms): 2452.6 | learning rate 2.705E-06 | lm loss 1.528558E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.15 | backward: 1729.03 | allreduce: 27.32 | optimizer: 55.22 | batch generator: 0.46 | data loader: 0.04
     iteration   277700/  300000 | elapsed time per iteration (ms): 2451.5 | learning rate 2.684E-06 | lm loss 1.512422E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.33 | backward: 1727.18 | allreduce: 24.48 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   277800/  300000 | elapsed time per iteration (ms): 2451.6 | learning rate 2.663E-06 | lm loss 1.528916E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.25 | backward: 1727.34 | allreduce: 24.84 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   277900/  300000 | elapsed time per iteration (ms): 2453.8 | learning rate 2.642E-06 | lm loss 1.544867E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.16 | backward: 1729.64 | allreduce: 27.40 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   278000/  300000 | elapsed time per iteration (ms): 2450.8 | learning rate 2.622E-06 | lm loss 1.564845E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.32 | backward: 1726.56 | allreduce: 23.87 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 278000 | LM loss: 1.603411E+00 | LM PPL: 4.969955E+00
    ------------------------------------------------------------------------------------
     iteration   278100/  300000 | elapsed time per iteration (ms): 3055.8 | learning rate 2.601E-06 | lm loss 1.524604E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.49 | backward: 1728.07 | allreduce: 24.91 | optimizer: 55.78 | batch generator: 3.94 | data loader: 3.14
     iteration   278200/  300000 | elapsed time per iteration (ms): 2454.2 | learning rate 2.581E-06 | lm loss 1.546675E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.29 | backward: 1729.96 | allreduce: 27.40 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   278300/  300000 | elapsed time per iteration (ms): 2449.6 | learning rate 2.560E-06 | lm loss 1.545156E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.05 | backward: 1726.20 | allreduce: 24.76 | optimizer: 55.20 | batch generator: 0.45 | data loader: 0.04
     iteration   278400/  300000 | elapsed time per iteration (ms): 2450.8 | learning rate 2.540E-06 | lm loss 1.516379E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.14 | backward: 1726.67 | allreduce: 24.76 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   278500/  300000 | elapsed time per iteration (ms): 2453.4 | learning rate 2.520E-06 | lm loss 1.551028E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.33 | backward: 1729.14 | allreduce: 26.80 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   278600/  300000 | elapsed time per iteration (ms): 2450.1 | learning rate 2.500E-06 | lm loss 1.522196E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.34 | backward: 1725.83 | allreduce: 23.36 | optimizer: 55.76 | batch generator: 0.46 | data loader: 0.04
     iteration   278700/  300000 | elapsed time per iteration (ms): 2449.5 | learning rate 2.480E-06 | lm loss 1.556276E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.19 | backward: 1725.39 | allreduce: 23.38 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   278800/  300000 | elapsed time per iteration (ms): 2451.8 | learning rate 2.460E-06 | lm loss 1.547047E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.06 | backward: 1727.83 | allreduce: 26.17 | optimizer: 55.76 | batch generator: 0.46 | data loader: 0.04
     iteration   278900/  300000 | elapsed time per iteration (ms): 2451.0 | learning rate 2.440E-06 | lm loss 1.547769E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.20 | backward: 1726.85 | allreduce: 24.75 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   279000/  300000 | elapsed time per iteration (ms): 2451.4 | learning rate 2.420E-06 | lm loss 1.535488E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.29 | backward: 1727.18 | allreduce: 24.84 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 279000 | LM loss: 1.612178E+00 | LM PPL: 5.013721E+00
    ------------------------------------------------------------------------------------
     iteration   279100/  300000 | elapsed time per iteration (ms): 3059.3 | learning rate 2.400E-06 | lm loss 1.538898E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.52 | backward: 1729.65 | allreduce: 26.77 | optimizer: 55.77 | batch generator: 5.68 | data loader: 4.88
     iteration   279200/  300000 | elapsed time per iteration (ms): 2449.7 | learning rate 2.381E-06 | lm loss 1.529158E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.14 | backward: 1725.62 | allreduce: 23.77 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   279300/  300000 | elapsed time per iteration (ms): 2448.7 | learning rate 2.361E-06 | lm loss 1.514144E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.89 | backward: 1724.82 | allreduce: 23.35 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   279400/  300000 | elapsed time per iteration (ms): 2451.7 | learning rate 2.341E-06 | lm loss 1.517544E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.99 | backward: 1727.72 | allreduce: 25.85 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   279500/  300000 | elapsed time per iteration (ms): 2449.6 | learning rate 2.322E-06 | lm loss 1.526664E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.13 | backward: 1725.52 | allreduce: 23.37 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   279600/  300000 | elapsed time per iteration (ms): 2449.4 | learning rate 2.303E-06 | lm loss 1.544845E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.11 | backward: 1725.34 | allreduce: 23.37 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   279700/  300000 | elapsed time per iteration (ms): 2451.9 | learning rate 2.283E-06 | lm loss 1.554358E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.11 | backward: 1727.84 | allreduce: 25.85 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   279800/  300000 | elapsed time per iteration (ms): 2447.9 | learning rate 2.264E-06 | lm loss 1.538214E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.00 | backward: 1724.47 | allreduce: 23.35 | optimizer: 55.22 | batch generator: 0.45 | data loader: 0.04
     iteration   279900/  300000 | elapsed time per iteration (ms): 2448.6 | learning rate 2.245E-06 | lm loss 1.531354E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.87 | backward: 1724.78 | allreduce: 23.35 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   280000/  300000 | elapsed time per iteration (ms): 2451.9 | learning rate 2.226E-06 | lm loss 1.535393E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.07 | backward: 1727.85 | allreduce: 25.86 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    global rank 0 is saving checkpoint at iteration  280000 to checkpoints/gpt2_750m_2/iter_0280000/mp_rank_00/model_optim_rng.pt
      successfully saved checkpoints/gpt2_750m_2/iter_0280000/mp_rank_00/model_optim_rng.pt
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 280000 | LM loss: 1.599746E+00 | LM PPL: 4.951775E+00
    ------------------------------------------------------------------------------------
     iteration   280100/  300000 | elapsed time per iteration (ms): 3103.1 | learning rate 2.207E-06 | lm loss 1.514712E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.12 | backward: 1724.24 | allreduce: 24.67 | optimizer: 55.77 | batch generator: 0.86 | data loader: 0.07
     iteration   280200/  300000 | elapsed time per iteration (ms): 2450.7 | learning rate 2.188E-06 | lm loss 1.526581E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.08 | backward: 1726.71 | allreduce: 27.11 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   280300/  300000 | elapsed time per iteration (ms): 2447.8 | learning rate 2.170E-06 | lm loss 1.527452E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.98 | backward: 1723.87 | allreduce: 24.58 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   280400/  300000 | elapsed time per iteration (ms): 2446.7 | learning rate 2.151E-06 | lm loss 1.547090E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.01 | backward: 1723.29 | allreduce: 24.78 | optimizer: 55.22 | batch generator: 0.46 | data loader: 0.04
     iteration   280500/  300000 | elapsed time per iteration (ms): 2449.3 | learning rate 2.133E-06 | lm loss 1.521986E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.14 | backward: 1725.78 | allreduce: 27.13 | optimizer: 55.21 | batch generator: 0.46 | data loader: 0.04
     iteration   280600/  300000 | elapsed time per iteration (ms): 2447.9 | learning rate 2.114E-06 | lm loss 1.504128E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.13 | backward: 1723.78 | allreduce: 24.73 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   280700/  300000 | elapsed time per iteration (ms): 2448.6 | learning rate 2.096E-06 | lm loss 1.533752E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.14 | backward: 1724.49 | allreduce: 25.28 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   280800/  300000 | elapsed time per iteration (ms): 2450.6 | learning rate 2.077E-06 | lm loss 1.536346E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.04 | backward: 1726.58 | allreduce: 27.77 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   280900/  300000 | elapsed time per iteration (ms): 2448.1 | learning rate 2.059E-06 | lm loss 1.545434E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.05 | backward: 1724.11 | allreduce: 25.29 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   281000/  300000 | elapsed time per iteration (ms): 2448.7 | learning rate 2.041E-06 | lm loss 1.528472E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.17 | backward: 1724.54 | allreduce: 25.28 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 281000 | LM loss: 1.605842E+00 | LM PPL: 4.982052E+00
    ------------------------------------------------------------------------------------
     iteration   281100/  300000 | elapsed time per iteration (ms): 3054.8 | learning rate 2.023E-06 | lm loss 1.527399E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.23 | backward: 1726.45 | allreduce: 27.24 | optimizer: 55.76 | batch generator: 5.50 | data loader: 4.71
     iteration   281200/  300000 | elapsed time per iteration (ms): 2448.6 | learning rate 2.004E-06 | lm loss 1.518157E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.35 | backward: 1724.30 | allreduce: 24.47 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   281300/  300000 | elapsed time per iteration (ms): 2448.6 | learning rate 1.986E-06 | lm loss 1.538250E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.40 | backward: 1724.24 | allreduce: 24.46 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   281400/  300000 | elapsed time per iteration (ms): 2450.1 | learning rate 1.969E-06 | lm loss 1.528118E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.09 | backward: 1726.11 | allreduce: 27.23 | optimizer: 55.76 | batch generator: 0.44 | data loader: 0.04
     iteration   281500/  300000 | elapsed time per iteration (ms): 2448.1 | learning rate 1.951E-06 | lm loss 1.522712E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.25 | backward: 1723.89 | allreduce: 24.67 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   281600/  300000 | elapsed time per iteration (ms): 2450.0 | learning rate 1.933E-06 | lm loss 1.542209E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.47 | backward: 1725.61 | allreduce: 24.99 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   281700/  300000 | elapsed time per iteration (ms): 2447.3 | learning rate 1.915E-06 | lm loss 1.514115E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.19 | backward: 1723.70 | allreduce: 24.65 | optimizer: 55.20 | batch generator: 0.45 | data loader: 0.04
     iteration   281800/  300000 | elapsed time per iteration (ms): 2451.1 | learning rate 1.898E-06 | lm loss 1.551724E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.24 | backward: 1726.89 | allreduce: 27.17 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   281900/  300000 | elapsed time per iteration (ms): 2447.9 | learning rate 1.880E-06 | lm loss 1.558524E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.09 | backward: 1723.83 | allreduce: 24.59 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   282000/  300000 | elapsed time per iteration (ms): 2448.5 | learning rate 1.863E-06 | lm loss 1.527745E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.24 | backward: 1724.36 | allreduce: 24.60 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 282000 | LM loss: 1.607812E+00 | LM PPL: 4.991877E+00
    ------------------------------------------------------------------------------------
     iteration   282100/  300000 | elapsed time per iteration (ms): 3051.4 | learning rate 1.846E-06 | lm loss 1.523906E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.14 | backward: 1726.72 | allreduce: 27.20 | optimizer: 55.76 | batch generator: 1.95 | data loader: 1.16
     iteration   282200/  300000 | elapsed time per iteration (ms): 2446.4 | learning rate 1.828E-06 | lm loss 1.558579E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.06 | backward: 1722.98 | allreduce: 24.66 | optimizer: 55.20 | batch generator: 0.44 | data loader: 0.04
     iteration   282300/  300000 | elapsed time per iteration (ms): 2448.0 | learning rate 1.811E-06 | lm loss 1.518077E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.28 | backward: 1723.73 | allreduce: 24.43 | optimizer: 55.76 | batch generator: 0.46 | data loader: 0.04
     iteration   282400/  300000 | elapsed time per iteration (ms): 2450.7 | learning rate 1.794E-06 | lm loss 1.517641E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.22 | backward: 1726.51 | allreduce: 27.20 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   282500/  300000 | elapsed time per iteration (ms): 2448.6 | learning rate 1.777E-06 | lm loss 1.531841E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.26 | backward: 1724.36 | allreduce: 24.93 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   282600/  300000 | elapsed time per iteration (ms): 2448.0 | learning rate 1.760E-06 | lm loss 1.545347E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.19 | backward: 1723.83 | allreduce: 24.54 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   282700/  300000 | elapsed time per iteration (ms): 2451.1 | learning rate 1.743E-06 | lm loss 1.537175E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.23 | backward: 1726.94 | allreduce: 27.59 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   282800/  300000 | elapsed time per iteration (ms): 2448.4 | learning rate 1.726E-06 | lm loss 1.540891E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.17 | backward: 1724.30 | allreduce: 25.28 | optimizer: 55.76 | batch generator: 0.44 | data loader: 0.04
     iteration   282900/  300000 | elapsed time per iteration (ms): 2447.8 | learning rate 1.710E-06 | lm loss 1.541637E+00 | loss scale 262144.0 |
    time (ms) | forward: 667.97 | backward: 1723.88 | allreduce: 25.33 | optimizer: 55.76 | batch generator: 0.44 | data loader: 0.04
     iteration   283000/  300000 | elapsed time per iteration (ms): 2450.7 | learning rate 1.693E-06 | lm loss 1.539971E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.12 | backward: 1726.61 | allreduce: 27.69 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 283000 | LM loss: 1.613188E+00 | LM PPL: 5.018785E+00
    ------------------------------------------------------------------------------------
     iteration   283100/  300000 | elapsed time per iteration (ms): 3077.3 | learning rate 1.677E-06 | lm loss 1.519655E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.11 | backward: 1724.08 | allreduce: 25.22 | optimizer: 55.77 | batch generator: 30.16 | data loader: 29.37
     iteration   283200/  300000 | elapsed time per iteration (ms): 2449.0 | learning rate 1.660E-06 | lm loss 1.521262E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.29 | backward: 1724.74 | allreduce: 25.26 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   283300/  300000 | elapsed time per iteration (ms): 2451.4 | learning rate 1.644E-06 | lm loss 1.505891E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.22 | backward: 1727.24 | allreduce: 27.75 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   283400/  300000 | elapsed time per iteration (ms): 2448.0 | learning rate 1.627E-06 | lm loss 1.529963E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.01 | backward: 1724.08 | allreduce: 25.19 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   283500/  300000 | elapsed time per iteration (ms): 2448.5 | learning rate 1.611E-06 | lm loss 1.534904E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.22 | backward: 1724.31 | allreduce: 24.70 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   283600/  300000 | elapsed time per iteration (ms): 2450.3 | learning rate 1.595E-06 | lm loss 1.554706E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.29 | backward: 1726.60 | allreduce: 27.23 | optimizer: 55.21 | batch generator: 0.45 | data loader: 0.04
     iteration   283700/  300000 | elapsed time per iteration (ms): 2449.2 | learning rate 1.579E-06 | lm loss 1.543643E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.18 | backward: 1725.02 | allreduce: 25.36 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   283800/  300000 | elapsed time per iteration (ms): 2449.1 | learning rate 1.563E-06 | lm loss 1.546333E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.15 | backward: 1724.94 | allreduce: 25.39 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   283900/  300000 | elapsed time per iteration (ms): 2451.0 | learning rate 1.547E-06 | lm loss 1.525800E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.08 | backward: 1727.06 | allreduce: 27.65 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   284000/  300000 | elapsed time per iteration (ms): 2448.8 | learning rate 1.531E-06 | lm loss 1.540205E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.26 | backward: 1724.68 | allreduce: 24.88 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 284000 | LM loss: 1.602552E+00 | LM PPL: 4.965686E+00
    ------------------------------------------------------------------------------------
     iteration   284100/  300000 | elapsed time per iteration (ms): 3080.4 | learning rate 1.516E-06 | lm loss 1.533607E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.23 | backward: 1724.43 | allreduce: 24.92 | optimizer: 55.77 | batch generator: 33.51 | data loader: 32.71
     iteration   284200/  300000 | elapsed time per iteration (ms): 2450.5 | learning rate 1.500E-06 | lm loss 1.520436E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.11 | backward: 1726.52 | allreduce: 27.32 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   284300/  300000 | elapsed time per iteration (ms): 2448.1 | learning rate 1.484E-06 | lm loss 1.508236E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.16 | backward: 1724.08 | allreduce: 24.57 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   284400/  300000 | elapsed time per iteration (ms): 2448.1 | learning rate 1.469E-06 | lm loss 1.536087E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.22 | backward: 1723.98 | allreduce: 24.31 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   284500/  300000 | elapsed time per iteration (ms): 2449.8 | learning rate 1.453E-06 | lm loss 1.529505E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.90 | backward: 1726.05 | allreduce: 27.26 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   284600/  300000 | elapsed time per iteration (ms): 2447.3 | learning rate 1.438E-06 | lm loss 1.529334E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.98 | backward: 1723.46 | allreduce: 24.57 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   284700/  300000 | elapsed time per iteration (ms): 2446.6 | learning rate 1.423E-06 | lm loss 1.546626E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.11 | backward: 1723.13 | allreduce: 24.12 | optimizer: 55.22 | batch generator: 0.46 | data loader: 0.04
     iteration   284800/  300000 | elapsed time per iteration (ms): 2449.4 | learning rate 1.408E-06 | lm loss 1.518572E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.08 | backward: 1725.97 | allreduce: 27.12 | optimizer: 55.22 | batch generator: 0.46 | data loader: 0.04
     iteration   284900/  300000 | elapsed time per iteration (ms): 2448.1 | learning rate 1.393E-06 | lm loss 1.541815E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.06 | backward: 1724.07 | allreduce: 24.82 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   285000/  300000 | elapsed time per iteration (ms): 2449.1 | learning rate 1.378E-06 | lm loss 1.518086E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.30 | backward: 1724.89 | allreduce: 24.89 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
    global rank 0 is saving checkpoint at iteration  285000 to checkpoints/gpt2_750m_2/iter_0285000/mp_rank_00/model_optim_rng.pt
      successfully saved checkpoints/gpt2_750m_2/iter_0285000/mp_rank_00/model_optim_rng.pt
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 285000 | LM loss: 1.614780E+00 | LM PPL: 5.026783E+00
    ------------------------------------------------------------------------------------
     iteration   285100/  300000 | elapsed time per iteration (ms): 3105.0 | learning rate 1.363E-06 | lm loss 1.531221E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.26 | backward: 1724.72 | allreduce: 24.93 | optimizer: 55.76 | batch generator: 0.85 | data loader: 0.07
     iteration   285200/  300000 | elapsed time per iteration (ms): 2449.4 | learning rate 1.348E-06 | lm loss 1.539382E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.33 | backward: 1725.11 | allreduce: 25.14 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   285300/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 1.333E-06 | lm loss 1.535619E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.17 | backward: 1727.21 | allreduce: 27.74 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   285400/  300000 | elapsed time per iteration (ms): 2449.4 | learning rate 1.318E-06 | lm loss 1.528482E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.30 | backward: 1725.17 | allreduce: 25.21 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   285500/  300000 | elapsed time per iteration (ms): 2449.8 | learning rate 1.304E-06 | lm loss 1.535023E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.40 | backward: 1725.45 | allreduce: 25.11 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   285600/  300000 | elapsed time per iteration (ms): 2450.5 | learning rate 1.289E-06 | lm loss 1.524859E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.13 | backward: 1726.43 | allreduce: 27.10 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   285700/  300000 | elapsed time per iteration (ms): 2449.1 | learning rate 1.275E-06 | lm loss 1.523768E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.37 | backward: 1724.79 | allreduce: 24.71 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   285800/  300000 | elapsed time per iteration (ms): 2449.4 | learning rate 1.260E-06 | lm loss 1.533406E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.32 | backward: 1725.07 | allreduce: 25.01 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   285900/  300000 | elapsed time per iteration (ms): 2452.1 | learning rate 1.246E-06 | lm loss 1.530317E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.27 | backward: 1727.83 | allreduce: 27.49 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   286000/  300000 | elapsed time per iteration (ms): 2448.4 | learning rate 1.232E-06 | lm loss 1.516670E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.06 | backward: 1724.42 | allreduce: 24.68 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 286000 | LM loss: 1.601203E+00 | LM PPL: 4.958993E+00
    ------------------------------------------------------------------------------------
     iteration   286100/  300000 | elapsed time per iteration (ms): 3060.8 | learning rate 1.218E-06 | lm loss 1.567446E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.09 | backward: 1724.52 | allreduce: 24.86 | optimizer: 55.76 | batch generator: 13.21 | data loader: 12.41
     iteration   286200/  300000 | elapsed time per iteration (ms): 2451.7 | learning rate 1.204E-06 | lm loss 1.521348E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.19 | backward: 1727.53 | allreduce: 27.30 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   286300/  300000 | elapsed time per iteration (ms): 2448.4 | learning rate 1.190E-06 | lm loss 1.520400E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.03 | backward: 1724.41 | allreduce: 24.79 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   286400/  300000 | elapsed time per iteration (ms): 2446.7 | learning rate 1.176E-06 | lm loss 1.533608E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.95 | backward: 1723.35 | allreduce: 24.48 | optimizer: 55.21 | batch generator: 0.45 | data loader: 0.04
     iteration   286500/  300000 | elapsed time per iteration (ms): 2450.6 | learning rate 1.162E-06 | lm loss 1.547592E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.26 | backward: 1726.90 | allreduce: 27.20 | optimizer: 55.21 | batch generator: 0.46 | data loader: 0.04
     iteration   286600/  300000 | elapsed time per iteration (ms): 2449.0 | learning rate 1.149E-06 | lm loss 1.538184E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.23 | backward: 1724.78 | allreduce: 24.73 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   286700/  300000 | elapsed time per iteration (ms): 2448.0 | learning rate 1.135E-06 | lm loss 1.532783E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.06 | backward: 1724.02 | allreduce: 24.60 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   286800/  300000 | elapsed time per iteration (ms): 2447.0 | learning rate 1.121E-06 | lm loss 1.542263E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.88 | backward: 1723.13 | allreduce: 24.24 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   286900/  300000 | elapsed time per iteration (ms): 2450.3 | learning rate 1.108E-06 | lm loss 1.535213E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.08 | backward: 1726.28 | allreduce: 26.83 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   287000/  300000 | elapsed time per iteration (ms): 2448.3 | learning rate 1.094E-06 | lm loss 1.551478E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.12 | backward: 1724.16 | allreduce: 24.58 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 287000 | LM loss: 1.591496E+00 | LM PPL: 4.911093E+00
    ------------------------------------------------------------------------------------
     iteration   287100/  300000 | elapsed time per iteration (ms): 3056.0 | learning rate 1.081E-06 | lm loss 1.563272E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.10 | backward: 1724.19 | allreduce: 24.87 | optimizer: 55.76 | batch generator: 9.28 | data loader: 8.50
     iteration   287200/  300000 | elapsed time per iteration (ms): 2449.8 | learning rate 1.068E-06 | lm loss 1.531792E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.91 | backward: 1725.90 | allreduce: 27.29 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   287300/  300000 | elapsed time per iteration (ms): 2448.2 | learning rate 1.055E-06 | lm loss 1.557002E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.13 | backward: 1724.08 | allreduce: 24.67 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   287400/  300000 | elapsed time per iteration (ms): 2448.5 | learning rate 1.042E-06 | lm loss 1.532808E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.08 | backward: 1724.42 | allreduce: 25.11 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   287500/  300000 | elapsed time per iteration (ms): 2450.5 | learning rate 1.029E-06 | lm loss 1.535840E+00 | loss scale 1048576.0 |
    time (ms) | forward: 667.98 | backward: 1726.61 | allreduce: 27.52 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   287600/  300000 | elapsed time per iteration (ms): 2448.5 | learning rate 1.016E-06 | lm loss 1.539356E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.14 | backward: 1724.45 | allreduce: 24.71 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   287700/  300000 | elapsed time per iteration (ms): 2449.3 | learning rate 1.003E-06 | lm loss 1.532373E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.25 | backward: 1725.07 | allreduce: 24.82 | optimizer: 55.77 | batch generator: 0.47 | data loader: 0.04
     iteration   287800/  300000 | elapsed time per iteration (ms): 2450.8 | learning rate 9.901E-07 | lm loss 1.519737E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.14 | backward: 1726.73 | allreduce: 27.11 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   287900/  300000 | elapsed time per iteration (ms): 2448.4 | learning rate 9.775E-07 | lm loss 1.516524E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.02 | backward: 1724.45 | allreduce: 24.94 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   288000/  300000 | elapsed time per iteration (ms): 2449.0 | learning rate 9.649E-07 | lm loss 1.562150E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.20 | backward: 1724.84 | allreduce: 24.90 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 288000 | LM loss: 1.585181E+00 | LM PPL: 4.880175E+00
    ------------------------------------------------------------------------------------
     iteration   288100/  300000 | elapsed time per iteration (ms): 3059.7 | learning rate 9.523E-07 | lm loss 1.522983E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.18 | backward: 1726.85 | allreduce: 26.95 | optimizer: 55.77 | batch generator: 10.15 | data loader: 9.35
     iteration   288200/  300000 | elapsed time per iteration (ms): 2448.7 | learning rate 9.399E-07 | lm loss 1.549914E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.21 | backward: 1724.50 | allreduce: 24.47 | optimizer: 55.76 | batch generator: 0.46 | data loader: 0.04
     iteration   288300/  300000 | elapsed time per iteration (ms): 2448.8 | learning rate 9.276E-07 | lm loss 1.512751E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.23 | backward: 1724.66 | allreduce: 24.76 | optimizer: 55.77 | batch generator: 0.47 | data loader: 0.04
     iteration   288400/  300000 | elapsed time per iteration (ms): 2450.8 | learning rate 9.153E-07 | lm loss 1.533853E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.19 | backward: 1726.61 | allreduce: 26.70 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   288500/  300000 | elapsed time per iteration (ms): 2445.6 | learning rate 9.033E-07 | lm loss 1.562077E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.20 | backward: 1722.58 | allreduce: 23.63 | optimizer: 54.65 | batch generator: 0.46 | data loader: 0.04
     iteration   288600/  300000 | elapsed time per iteration (ms): 2448.1 | learning rate 8.912E-07 | lm loss 1.516451E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.09 | backward: 1724.10 | allreduce: 24.60 | optimizer: 55.76 | batch generator: 0.46 | data loader: 0.04
     iteration   288700/  300000 | elapsed time per iteration (ms): 2451.4 | learning rate 8.792E-07 | lm loss 1.525495E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.26 | backward: 1727.15 | allreduce: 27.23 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   288800/  300000 | elapsed time per iteration (ms): 2449.7 | learning rate 8.672E-07 | lm loss 1.529547E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.40 | backward: 1725.35 | allreduce: 24.88 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   288900/  300000 | elapsed time per iteration (ms): 2447.6 | learning rate 8.555E-07 | lm loss 1.521286E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.18 | backward: 1724.03 | allreduce: 24.86 | optimizer: 55.21 | batch generator: 0.47 | data loader: 0.04
     iteration   289000/  300000 | elapsed time per iteration (ms): 2450.7 | learning rate 8.437E-07 | lm loss 1.534728E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.18 | backward: 1726.52 | allreduce: 27.08 | optimizer: 55.76 | batch generator: 0.46 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 289000 | LM loss: 1.604650E+00 | LM PPL: 4.976119E+00
    ------------------------------------------------------------------------------------
     iteration   289100/  300000 | elapsed time per iteration (ms): 3063.2 | learning rate 8.320E-07 | lm loss 1.546144E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.24 | backward: 1724.34 | allreduce: 24.66 | optimizer: 55.77 | batch generator: 16.04 | data loader: 15.23
     iteration   289200/  300000 | elapsed time per iteration (ms): 2449.0 | learning rate 8.204E-07 | lm loss 1.531214E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.39 | backward: 1724.65 | allreduce: 24.38 | optimizer: 55.76 | batch generator: 0.46 | data loader: 0.04
     iteration   289300/  300000 | elapsed time per iteration (ms): 2450.7 | learning rate 8.088E-07 | lm loss 1.507834E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.15 | backward: 1726.60 | allreduce: 27.06 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   289400/  300000 | elapsed time per iteration (ms): 2446.7 | learning rate 7.975E-07 | lm loss 1.551836E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.20 | backward: 1723.14 | allreduce: 24.57 | optimizer: 55.20 | batch generator: 0.46 | data loader: 0.04
     iteration   289500/  300000 | elapsed time per iteration (ms): 2448.6 | learning rate 7.861E-07 | lm loss 1.550196E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.31 | backward: 1724.33 | allreduce: 24.77 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   289600/  300000 | elapsed time per iteration (ms): 2451.6 | learning rate 7.748E-07 | lm loss 1.543289E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.41 | backward: 1727.24 | allreduce: 27.31 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
     iteration   289700/  300000 | elapsed time per iteration (ms): 2448.3 | learning rate 7.636E-07 | lm loss 1.530625E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.22 | backward: 1724.11 | allreduce: 24.90 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   289800/  300000 | elapsed time per iteration (ms): 2448.7 | learning rate 7.524E-07 | lm loss 1.528188E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.26 | backward: 1724.48 | allreduce: 24.98 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   289900/  300000 | elapsed time per iteration (ms): 2451.8 | learning rate 7.414E-07 | lm loss 1.551728E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.37 | backward: 1727.43 | allreduce: 27.42 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   290000/  300000 | elapsed time per iteration (ms): 2448.2 | learning rate 7.304E-07 | lm loss 1.535076E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.17 | backward: 1724.04 | allreduce: 24.90 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    global rank 0 is saving checkpoint at iteration  290000 to checkpoints/gpt2_750m_2/iter_0290000/mp_rank_00/model_optim_rng.pt
      successfully saved checkpoints/gpt2_750m_2/iter_0290000/mp_rank_00/model_optim_rng.pt
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 290000 | LM loss: 1.601904E+00 | LM PPL: 4.962471E+00
    ------------------------------------------------------------------------------------
     iteration   290100/  300000 | elapsed time per iteration (ms): 3110.8 | learning rate 7.195E-07 | lm loss 1.534693E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.25 | backward: 1726.71 | allreduce: 27.21 | optimizer: 55.78 | batch generator: 6.05 | data loader: 5.23
     iteration   290200/  300000 | elapsed time per iteration (ms): 2448.7 | learning rate 7.087E-07 | lm loss 1.513654E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.29 | backward: 1724.41 | allreduce: 24.75 | optimizer: 55.77 | batch generator: 0.47 | data loader: 0.04
     iteration   290300/  300000 | elapsed time per iteration (ms): 2448.9 | learning rate 6.980E-07 | lm loss 1.526417E+00 | loss scale 262144.0 |
    time (ms) | forward: 668.35 | backward: 1724.57 | allreduce: 24.69 | optimizer: 55.76 | batch generator: 0.47 | data loader: 0.04
     iteration   290400/  300000 | elapsed time per iteration (ms): 2450.4 | learning rate 6.873E-07 | lm loss 1.526021E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.22 | backward: 1726.25 | allreduce: 26.39 | optimizer: 55.77 | batch generator: 0.47 | data loader: 0.04
     iteration   290500/  300000 | elapsed time per iteration (ms): 2448.7 | learning rate 6.768E-07 | lm loss 1.517145E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.41 | backward: 1724.36 | allreduce: 23.87 | optimizer: 55.76 | batch generator: 0.46 | data loader: 0.04
     iteration   290600/  300000 | elapsed time per iteration (ms): 2448.2 | learning rate 6.663E-07 | lm loss 1.522451E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.39 | backward: 1723.82 | allreduce: 23.52 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   290700/  300000 | elapsed time per iteration (ms): 2449.3 | learning rate 6.559E-07 | lm loss 1.529230E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.07 | backward: 1725.28 | allreduce: 25.89 | optimizer: 55.77 | batch generator: 0.47 | data loader: 0.04
     iteration   290800/  300000 | elapsed time per iteration (ms): 2448.1 | learning rate 6.455E-07 | lm loss 1.542189E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.22 | backward: 1723.93 | allreduce: 24.09 | optimizer: 55.78 | batch generator: 0.47 | data loader: 0.04
     iteration   290900/  300000 | elapsed time per iteration (ms): 2449.7 | learning rate 6.353E-07 | lm loss 1.538723E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.42 | backward: 1725.35 | allreduce: 24.72 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   291000/  300000 | elapsed time per iteration (ms): 2450.5 | learning rate 6.251E-07 | lm loss 1.528582E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.07 | backward: 1726.47 | allreduce: 27.25 | optimizer: 55.78 | batch generator: 0.46 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 291000 | LM loss: 1.614735E+00 | LM PPL: 5.026554E+00
    ------------------------------------------------------------------------------------
     iteration   291100/  300000 | elapsed time per iteration (ms): 3047.6 | learning rate 6.151E-07 | lm loss 1.532000E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.05 | backward: 1723.32 | allreduce: 23.87 | optimizer: 55.77 | batch generator: 1.16 | data loader: 0.37
     iteration   291200/  300000 | elapsed time per iteration (ms): 2448.0 | learning rate 6.051E-07 | lm loss 1.524571E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.35 | backward: 1723.70 | allreduce: 23.39 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   291300/  300000 | elapsed time per iteration (ms): 2450.2 | learning rate 5.951E-07 | lm loss 1.524902E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.28 | backward: 1725.92 | allreduce: 25.90 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   291400/  300000 | elapsed time per iteration (ms): 2447.1 | learning rate 5.853E-07 | lm loss 1.525042E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.09 | backward: 1723.06 | allreduce: 23.38 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   291500/  300000 | elapsed time per iteration (ms): 2447.5 | learning rate 5.756E-07 | lm loss 1.527295E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.11 | backward: 1723.44 | allreduce: 23.38 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   291600/  300000 | elapsed time per iteration (ms): 2450.4 | learning rate 5.659E-07 | lm loss 1.525360E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.22 | backward: 1726.18 | allreduce: 25.92 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   291700/  300000 | elapsed time per iteration (ms): 2447.2 | learning rate 5.563E-07 | lm loss 1.527318E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.13 | backward: 1723.13 | allreduce: 23.39 | optimizer: 55.76 | batch generator: 0.47 | data loader: 0.04
     iteration   291800/  300000 | elapsed time per iteration (ms): 2448.5 | learning rate 5.468E-07 | lm loss 1.528836E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.25 | backward: 1724.30 | allreduce: 23.81 | optimizer: 55.77 | batch generator: 0.47 | data loader: 0.04
     iteration   291900/  300000 | elapsed time per iteration (ms): 2451.4 | learning rate 5.374E-07 | lm loss 1.527048E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.28 | backward: 1727.11 | allreduce: 26.85 | optimizer: 55.76 | batch generator: 0.47 | data loader: 0.04
     iteration   292000/  300000 | elapsed time per iteration (ms): 2447.2 | learning rate 5.280E-07 | lm loss 1.554560E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.06 | backward: 1723.18 | allreduce: 23.74 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 292000 | LM loss: 1.596514E+00 | LM PPL: 4.935798E+00
    ------------------------------------------------------------------------------------
     iteration   292100/  300000 | elapsed time per iteration (ms): 3065.9 | learning rate 5.188E-07 | lm loss 1.538140E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.29 | backward: 1724.91 | allreduce: 24.53 | optimizer: 55.77 | batch generator: 17.54 | data loader: 16.75
     iteration   292200/  300000 | elapsed time per iteration (ms): 2451.7 | learning rate 5.096E-07 | lm loss 1.514088E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.22 | backward: 1727.57 | allreduce: 27.39 | optimizer: 55.76 | batch generator: 0.46 | data loader: 0.04
     iteration   292300/  300000 | elapsed time per iteration (ms): 2449.4 | learning rate 5.005E-07 | lm loss 1.522146E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.30 | backward: 1725.13 | allreduce: 24.87 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   292400/  300000 | elapsed time per iteration (ms): 2444.8 | learning rate 4.917E-07 | lm loss 1.526520E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.99 | backward: 1722.56 | allreduce: 24.86 | optimizer: 54.10 | batch generator: 0.46 | data loader: 0.04
     iteration   292500/  300000 | elapsed time per iteration (ms): 2450.0 | learning rate 4.828E-07 | lm loss 1.535453E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.11 | backward: 1725.95 | allreduce: 26.49 | optimizer: 55.77 | batch generator: 0.47 | data loader: 0.04
     iteration   292600/  300000 | elapsed time per iteration (ms): 2448.1 | learning rate 4.739E-07 | lm loss 1.546197E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.11 | backward: 1724.00 | allreduce: 24.78 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   292700/  300000 | elapsed time per iteration (ms): 2447.8 | learning rate 4.652E-07 | lm loss 1.525958E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.05 | backward: 1723.77 | allreduce: 24.54 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   292800/  300000 | elapsed time per iteration (ms): 2450.8 | learning rate 4.565E-07 | lm loss 1.523010E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.09 | backward: 1726.72 | allreduce: 27.24 | optimizer: 55.76 | batch generator: 0.46 | data loader: 0.04
     iteration   292900/  300000 | elapsed time per iteration (ms): 2447.5 | learning rate 4.479E-07 | lm loss 1.537423E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.98 | backward: 1723.61 | allreduce: 24.52 | optimizer: 55.76 | batch generator: 0.47 | data loader: 0.04
     iteration   293000/  300000 | elapsed time per iteration (ms): 2447.7 | learning rate 4.393E-07 | lm loss 1.508251E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.08 | backward: 1723.66 | allreduce: 24.50 | optimizer: 55.76 | batch generator: 0.46 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 293000 | LM loss: 1.595849E+00 | LM PPL: 4.932515E+00
    ------------------------------------------------------------------------------------
     iteration   293100/  300000 | elapsed time per iteration (ms): 3061.5 | learning rate 4.309E-07 | lm loss 1.524305E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.04 | backward: 1726.14 | allreduce: 26.78 | optimizer: 55.77 | batch generator: 12.90 | data loader: 12.11
     iteration   293200/  300000 | elapsed time per iteration (ms): 2447.2 | learning rate 4.225E-07 | lm loss 1.526864E+00 | loss scale 524288.0 |
    time (ms) | forward: 667.95 | backward: 1723.30 | allreduce: 24.39 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   293300/  300000 | elapsed time per iteration (ms): 2447.5 | learning rate 4.142E-07 | lm loss 1.523844E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.10 | backward: 1723.42 | allreduce: 24.21 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   293400/  300000 | elapsed time per iteration (ms): 2450.5 | learning rate 4.060E-07 | lm loss 1.540531E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.07 | backward: 1726.45 | allreduce: 26.90 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   293500/  300000 | elapsed time per iteration (ms): 2448.3 | learning rate 3.979E-07 | lm loss 1.512540E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.22 | backward: 1724.13 | allreduce: 24.10 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   293600/  300000 | elapsed time per iteration (ms): 2448.5 | learning rate 3.899E-07 | lm loss 1.522417E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.19 | backward: 1724.33 | allreduce: 24.24 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   293700/  300000 | elapsed time per iteration (ms): 2452.3 | learning rate 3.819E-07 | lm loss 1.520466E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.46 | backward: 1727.91 | allreduce: 26.81 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   293800/  300000 | elapsed time per iteration (ms): 2449.2 | learning rate 3.740E-07 | lm loss 1.551999E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.27 | backward: 1724.94 | allreduce: 24.47 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   293900/  300000 | elapsed time per iteration (ms): 2446.3 | learning rate 3.664E-07 | lm loss 1.539032E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.09 | backward: 1723.41 | allreduce: 24.72 | optimizer: 54.65 | batch generator: 0.45 | data loader: 0.04
     iteration   294000/  300000 | elapsed time per iteration (ms): 2451.0 | learning rate 3.587E-07 | lm loss 1.501950E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.09 | backward: 1726.95 | allreduce: 27.49 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 294000 | LM loss: 1.611049E+00 | LM PPL: 5.008060E+00
    ------------------------------------------------------------------------------------
     iteration   294100/  300000 | elapsed time per iteration (ms): 3058.0 | learning rate 3.511E-07 | lm loss 1.548806E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.20 | backward: 1724.77 | allreduce: 24.81 | optimizer: 55.78 | batch generator: 10.25 | data loader: 9.46
     iteration   294200/  300000 | elapsed time per iteration (ms): 2449.1 | learning rate 3.435E-07 | lm loss 1.543346E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.25 | backward: 1724.92 | allreduce: 24.81 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   294300/  300000 | elapsed time per iteration (ms): 2451.4 | learning rate 3.360E-07 | lm loss 1.525452E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.23 | backward: 1727.25 | allreduce: 27.37 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   294400/  300000 | elapsed time per iteration (ms): 2448.2 | learning rate 3.287E-07 | lm loss 1.543942E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.05 | backward: 1724.15 | allreduce: 24.86 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   294500/  300000 | elapsed time per iteration (ms): 2448.8 | learning rate 3.213E-07 | lm loss 1.510058E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.21 | backward: 1724.61 | allreduce: 24.84 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   294600/  300000 | elapsed time per iteration (ms): 2452.0 | learning rate 3.141E-07 | lm loss 1.527559E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.43 | backward: 1727.63 | allreduce: 27.15 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   294700/  300000 | elapsed time per iteration (ms): 2449.3 | learning rate 3.070E-07 | lm loss 1.550235E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.37 | backward: 1725.01 | allreduce: 24.77 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   294800/  300000 | elapsed time per iteration (ms): 2448.2 | learning rate 2.999E-07 | lm loss 1.536371E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.16 | backward: 1724.08 | allreduce: 24.47 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   294900/  300000 | elapsed time per iteration (ms): 2451.0 | learning rate 2.930E-07 | lm loss 1.539026E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.20 | backward: 1726.83 | allreduce: 26.98 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   295000/  300000 | elapsed time per iteration (ms): 2449.4 | learning rate 2.861E-07 | lm loss 1.525711E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.31 | backward: 1725.10 | allreduce: 24.38 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    global rank 0 is saving checkpoint at iteration  295000 to checkpoints/gpt2_750m_2/iter_0295000/mp_rank_00/model_optim_rng.pt
      successfully saved checkpoints/gpt2_750m_2/iter_0295000/mp_rank_00/model_optim_rng.pt
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 295000 | LM loss: 1.598327E+00 | LM PPL: 4.944754E+00
    ------------------------------------------------------------------------------------
     iteration   295100/  300000 | elapsed time per iteration (ms): 3104.3 | learning rate 2.792E-07 | lm loss 1.522055E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.35 | backward: 1725.29 | allreduce: 24.59 | optimizer: 55.77 | batch generator: 1.13 | data loader: 0.35
     iteration   295200/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 2.725E-07 | lm loss 1.533802E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.22 | backward: 1727.10 | allreduce: 26.84 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   295300/  300000 | elapsed time per iteration (ms): 2449.4 | learning rate 2.659E-07 | lm loss 1.541328E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.37 | backward: 1725.09 | allreduce: 24.28 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   295400/  300000 | elapsed time per iteration (ms): 2449.0 | learning rate 2.593E-07 | lm loss 1.543127E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.32 | backward: 1724.71 | allreduce: 24.21 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   295500/  300000 | elapsed time per iteration (ms): 2451.0 | learning rate 2.528E-07 | lm loss 1.547686E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.18 | backward: 1726.82 | allreduce: 26.81 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   295600/  300000 | elapsed time per iteration (ms): 2448.2 | learning rate 2.464E-07 | lm loss 1.541587E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.11 | backward: 1724.11 | allreduce: 24.36 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   295700/  300000 | elapsed time per iteration (ms): 2448.5 | learning rate 2.401E-07 | lm loss 1.516415E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.21 | backward: 1724.35 | allreduce: 24.37 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   295800/  300000 | elapsed time per iteration (ms): 2451.0 | learning rate 2.339E-07 | lm loss 1.536841E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.18 | backward: 1726.91 | allreduce: 26.87 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   295900/  300000 | elapsed time per iteration (ms): 2445.9 | learning rate 2.278E-07 | lm loss 1.523713E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.17 | backward: 1722.91 | allreduce: 24.27 | optimizer: 54.66 | batch generator: 0.45 | data loader: 0.04
     iteration   296000/  300000 | elapsed time per iteration (ms): 2448.4 | learning rate 2.217E-07 | lm loss 1.531232E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.14 | backward: 1724.26 | allreduce: 24.29 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 296000 | LM loss: 1.596946E+00 | LM PPL: 4.937928E+00
    ------------------------------------------------------------------------------------
     iteration   296100/  300000 | elapsed time per iteration (ms): 3058.3 | learning rate 2.158E-07 | lm loss 1.537546E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.14 | backward: 1726.68 | allreduce: 26.78 | optimizer: 55.77 | batch generator: 8.43 | data loader: 7.65
     iteration   296200/  300000 | elapsed time per iteration (ms): 2448.5 | learning rate 2.098E-07 | lm loss 1.515552E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.17 | backward: 1724.37 | allreduce: 24.38 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   296300/  300000 | elapsed time per iteration (ms): 2449.2 | learning rate 2.040E-07 | lm loss 1.540226E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.33 | backward: 1724.89 | allreduce: 24.37 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   296400/  300000 | elapsed time per iteration (ms): 2451.5 | learning rate 1.983E-07 | lm loss 1.518402E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.29 | backward: 1727.24 | allreduce: 26.95 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   296500/  300000 | elapsed time per iteration (ms): 2448.8 | learning rate 1.926E-07 | lm loss 1.511408E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.21 | backward: 1724.62 | allreduce: 24.41 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   296600/  300000 | elapsed time per iteration (ms): 2447.9 | learning rate 1.871E-07 | lm loss 1.522186E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.29 | backward: 1724.21 | allreduce: 24.36 | optimizer: 55.21 | batch generator: 0.45 | data loader: 0.04
     iteration   296700/  300000 | elapsed time per iteration (ms): 2451.6 | learning rate 1.816E-07 | lm loss 1.527414E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.35 | backward: 1727.27 | allreduce: 26.81 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   296800/  300000 | elapsed time per iteration (ms): 2447.8 | learning rate 1.761E-07 | lm loss 1.528538E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.03 | backward: 1723.81 | allreduce: 24.44 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   296900/  300000 | elapsed time per iteration (ms): 2448.3 | learning rate 1.708E-07 | lm loss 1.530917E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.18 | backward: 1724.20 | allreduce: 24.36 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   297000/  300000 | elapsed time per iteration (ms): 2450.3 | learning rate 1.655E-07 | lm loss 1.532607E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.14 | backward: 1726.19 | allreduce: 26.51 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 297000 | LM loss: 1.605001E+00 | LM PPL: 4.977865E+00
    ------------------------------------------------------------------------------------
     iteration   297100/  300000 | elapsed time per iteration (ms): 3048.4 | learning rate 1.604E-07 | lm loss 1.554576E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.24 | backward: 1724.77 | allreduce: 24.82 | optimizer: 55.76 | batch generator: 0.86 | data loader: 0.07
     iteration   297200/  300000 | elapsed time per iteration (ms): 2448.9 | learning rate 1.553E-07 | lm loss 1.503779E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.23 | backward: 1724.69 | allreduce: 24.73 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   297300/  300000 | elapsed time per iteration (ms): 2451.1 | learning rate 1.503E-07 | lm loss 1.514281E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.17 | backward: 1726.94 | allreduce: 27.27 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   297400/  300000 | elapsed time per iteration (ms): 2449.4 | learning rate 1.480E-07 | lm loss 1.529715E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.33 | backward: 1725.10 | allreduce: 24.89 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   297500/  300000 | elapsed time per iteration (ms): 2448.9 | learning rate 1.480E-07 | lm loss 1.549809E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.30 | backward: 1724.66 | allreduce: 24.44 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   297600/  300000 | elapsed time per iteration (ms): 2450.7 | learning rate 1.480E-07 | lm loss 1.546072E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.10 | backward: 1726.68 | allreduce: 26.84 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   297700/  300000 | elapsed time per iteration (ms): 2448.2 | learning rate 1.480E-07 | lm loss 1.542286E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.16 | backward: 1724.08 | allreduce: 24.29 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   297800/  300000 | elapsed time per iteration (ms): 2449.0 | learning rate 1.480E-07 | lm loss 1.527725E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.29 | backward: 1724.71 | allreduce: 24.27 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   297900/  300000 | elapsed time per iteration (ms): 2451.3 | learning rate 1.480E-07 | lm loss 1.544193E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.22 | backward: 1727.11 | allreduce: 26.86 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   298000/  300000 | elapsed time per iteration (ms): 2447.7 | learning rate 1.480E-07 | lm loss 1.546064E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.12 | backward: 1723.62 | allreduce: 23.76 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 298000 | LM loss: 1.578382E+00 | LM PPL: 4.847105E+00
    ------------------------------------------------------------------------------------
     iteration   298100/  300000 | elapsed time per iteration (ms): 3049.6 | learning rate 1.480E-07 | lm loss 1.534136E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.29 | backward: 1725.21 | allreduce: 24.96 | optimizer: 55.76 | batch generator: 0.86 | data loader: 0.07
     iteration   298200/  300000 | elapsed time per iteration (ms): 2449.6 | learning rate 1.480E-07 | lm loss 1.518924E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.28 | backward: 1725.89 | allreduce: 26.04 | optimizer: 55.21 | batch generator: 0.45 | data loader: 0.04
     iteration   298300/  300000 | elapsed time per iteration (ms): 2448.2 | learning rate 1.480E-07 | lm loss 1.531920E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.35 | backward: 1723.87 | allreduce: 23.39 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   298400/  300000 | elapsed time per iteration (ms): 2448.6 | learning rate 1.480E-07 | lm loss 1.548340E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.40 | backward: 1724.23 | allreduce: 23.40 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   298500/  300000 | elapsed time per iteration (ms): 2449.7 | learning rate 1.480E-07 | lm loss 1.520524E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.12 | backward: 1725.65 | allreduce: 25.93 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   298600/  300000 | elapsed time per iteration (ms): 2448.8 | learning rate 1.480E-07 | lm loss 1.528857E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.31 | backward: 1724.56 | allreduce: 24.28 | optimizer: 55.77 | batch generator: 0.46 | data loader: 0.04
     iteration   298700/  300000 | elapsed time per iteration (ms): 2447.7 | learning rate 1.480E-07 | lm loss 1.514556E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.24 | backward: 1724.02 | allreduce: 24.43 | optimizer: 55.22 | batch generator: 0.45 | data loader: 0.04
     iteration   298800/  300000 | elapsed time per iteration (ms): 2450.8 | learning rate 1.480E-07 | lm loss 1.537302E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.43 | backward: 1726.39 | allreduce: 25.86 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   298900/  300000 | elapsed time per iteration (ms): 2447.5 | learning rate 1.480E-07 | lm loss 1.526992E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.24 | backward: 1723.32 | allreduce: 23.39 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   299000/  300000 | elapsed time per iteration (ms): 2447.2 | learning rate 1.480E-07 | lm loss 1.537075E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.19 | backward: 1723.02 | allreduce: 23.38 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 299000 | LM loss: 1.593881E+00 | LM PPL: 4.922819E+00
    ------------------------------------------------------------------------------------
     iteration   299100/  300000 | elapsed time per iteration (ms): 3053.4 | learning rate 1.480E-07 | lm loss 1.532703E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.42 | backward: 1726.34 | allreduce: 25.90 | optimizer: 55.77 | batch generator: 3.73 | data loader: 2.95
     iteration   299200/  300000 | elapsed time per iteration (ms): 2448.1 | learning rate 1.480E-07 | lm loss 1.536558E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.36 | backward: 1723.78 | allreduce: 23.39 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   299300/  300000 | elapsed time per iteration (ms): 2448.0 | learning rate 1.480E-07 | lm loss 1.545900E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.33 | backward: 1723.70 | allreduce: 23.38 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   299400/  300000 | elapsed time per iteration (ms): 2450.1 | learning rate 1.480E-07 | lm loss 1.531297E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.26 | backward: 1725.86 | allreduce: 25.91 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   299500/  300000 | elapsed time per iteration (ms): 2447.2 | learning rate 1.480E-07 | lm loss 1.532396E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.13 | backward: 1723.11 | allreduce: 23.39 | optimizer: 55.76 | batch generator: 0.45 | data loader: 0.04
     iteration   299600/  300000 | elapsed time per iteration (ms): 2447.5 | learning rate 1.480E-07 | lm loss 1.536357E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.24 | backward: 1723.31 | allreduce: 23.38 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   299700/  300000 | elapsed time per iteration (ms): 2450.1 | learning rate 1.480E-07 | lm loss 1.530653E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.29 | backward: 1726.40 | allreduce: 26.78 | optimizer: 55.22 | batch generator: 0.46 | data loader: 0.04
     iteration   299800/  300000 | elapsed time per iteration (ms): 2450.2 | learning rate 1.480E-07 | lm loss 1.536035E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.36 | backward: 1725.86 | allreduce: 25.03 | optimizer: 55.78 | batch generator: 0.45 | data loader: 0.04
     iteration   299900/  300000 | elapsed time per iteration (ms): 2449.8 | learning rate 1.480E-07 | lm loss 1.528693E+00 | loss scale 1048576.0 |
    time (ms) | forward: 668.31 | backward: 1725.58 | allreduce: 25.06 | optimizer: 55.77 | batch generator: 0.45 | data loader: 0.04
     iteration   300000/  300000 | elapsed time per iteration (ms): 2450.1 | learning rate 1.480E-07 | lm loss 1.549867E+00 | loss scale 524288.0 |
    time (ms) | forward: 668.16 | backward: 1726.55 | allreduce: 27.49 | optimizer: 55.21 | batch generator: 0.45 | data loader: 0.04
    global rank 0 is saving checkpoint at iteration  300000 to checkpoints/gpt2_750m_2/iter_0300000/mp_rank_00/model_optim_rng.pt
      successfully saved checkpoints/gpt2_750m_2/iter_0300000/mp_rank_00/model_optim_rng.pt
    ----------------------------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
     validation loss at iteration 300000 | LM loss: 1.609921E+00 | LM PPL: 5.002414E+00
    ------------------------------------------------------------------------------------
    ----------------------------------------------------------------------------------------------------
    ----------------------------------------------------------------------------------------------------
     validation loss at the end of training for val data | LM loss: 1.605040E+00 | LM PPL: 4.978060E+00
    ----------------------------------------------------------------------------------------------------
    global rank 0 is saving checkpoint at iteration  300000 to checkpoints/gpt2_750m_2/iter_0300000/mp_rank_00/model_optim_rng.pt
      successfully saved checkpoints/gpt2_750m_2/iter_0300000/mp_rank_00/model_optim_rng.pt
    Evaluating iter 100/100
    ----------------------------------------------------------------------------------------------------
    -----------------------------------------------------------------------------------------------------
     validation loss at the end of training for test data | LM loss: 1.598005E+00 | LM PPL: 4.943160E+00
    -----------------------------------------------------------------------------------------------------