T4 Performance Counters explained
- by user13346607
Now that T4 is out for a few month some people might have wondered what details of the new pipeline you can monitor. A "cpustat -h" lists a lot of events that can be monitored, and only very few are self-explanatory. I will try to give some insight on all of them, some of these "PIC events" require an in-depth knowledge of T4 pipeline. Over time I will try to explain these, for the time being these events should simply be ignored. (Side note: some counters changed from tape-out 1.1 (*only* used in the T4 beta program) to tape-out 1.2 (used in the systems shipping today) The table only lists the tape-out 1.2 counters) 
   
 
  0
  0
  1
  1058
  6033
  Oracle Microelectronics
  50
  14
  7077
  14.0
 
 
  
 
 
 
  Normal
  0
  
  
  
  
  false
  false
  false
  
  EN-US
  JA
  X-NONE
  
   
   
   
   
   
   
   
   
   
   
  
  
   
   
   
   
   
   
   
   
   
   
   
  
 
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
 
 
 /* Style Definitions */
table.MsoNormalTable
	{mso-style-name:"Table Normal";
	mso-tstyle-rowband-size:0;
	mso-tstyle-colband-size:0;
	mso-style-noshow:yes;
	mso-style-priority:99;
	mso-style-parent:"";
	mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
	mso-para-margin:0cm;
	mso-para-margin-bottom:.0001pt;
	mso-pagination:widow-orphan;
	font-size:12.0pt;
	font-family:Cambria;
	mso-ascii-font-family:Cambria;
	mso-ascii-theme-font:minor-latin;
	mso-hansi-font-family:Cambria;
	mso-hansi-theme-font:minor-latin;}
   
   
     
       
         
          pic name (cpustat) 
         
         
          Prose Comment 
         
       
     
     
       
         
          Sel-pipe-drain-cycles,
  
  Sel-0-[wait|ready], 
  Sel-[1,2] 
         
         
          Sel-0-wait
  counts cycles a strand waits to be selected. Some reasons can be counted in
  detail; these are: 
           
            Sel-0-ready: Cycles a strand was
       ready but not selected, that can signal pipeline oversubscription 
            Sel-1: Cycles only one
       instruction or µop was selected 
            Sel-2: Cycles two instructions
       or µops were selected 
            Sel-pipe-drain-cycles: cf. PRM
       footnote 8 to table 10.2 
           
         
       
       
         
          Pick-any,
  Pick-[0|1|2|3] 
         
         
          Cycles one,
  two, three, no or at least one instruction or µop is picked 
         
       
       
         
          Instr_FGU_crypto 
         
         
          Number of FGU
  or crypto instructions executed on that vcpu 
         
       
       
         
          Instr_ld 
         
         
          dto. for
  load 
         
       
       
         
          Instr_st 
         
         
          dto. for store 
         
       
       
         
          SPR_ring_ops 
         
         
          dto. for SPR
  ring ops 
         
       
       
         
          Instr_other 
         
         
          dto. for all
  other instructions not listed above, PRM footnote 7 to table 10.2 lists the
  instructions 
         
       
       
         
          Instr_all 
         
         
          total number
  of instructions executed on that vcpu 
         
       
       
         
          Sw_count_intr 
         
         
          Nr of S/W
  count instructions on that vcpu (sethi %hi(fc000),%g0 (whatever that is))  
         
       
       
         
          Atomics 
         
         
          nr of atomic
  ops, which are LDSTUB/a, CASA/XA, and SWAP/A 
         
       
       
         
          SW_prefetch 
         
         
          Nr of
  PREFETCH or PREFETCHA instructions 
         
       
       
         
          Block_ld_st 
         
         
          Block loads
  or store on that vcpu 
         
       
       
         
          IC_miss_nospec,
  IC_miss_[L2_or_L3|local|remote]\  _hit_nospec 
         
         
          Various I$
  misses, distinguished by where they hit. All of these count per thread, but
  only primary events: T4 counts only the first occurence of an I$ miss on a
  core for a certain instruction. If one strand misses in I$ this miss is
  counted, but if a second strand on the same core misses while the first miss
  is being resolved, that second miss is not counted
  This flavour of I$ misses counts only misses that are caused by instruction
  that really commit (note the "_nospec") 
         
       
       
         
          BTC_miss 
         
         
          Branch target
  cache miss 
         
       
       
         
          ITLB_miss 
         
         
          ITLB misses
  (synchronously counted) 
         
       
       
         
          ITLB_miss_asynch 
         
         
          dto. but
  asynchronously 
         
       
       
         
          [I|D]TLB_fill_\  [8KB|64KB|4MB|256MB|2GB|trap] 
         
         
          H/W tablewalk
  events that fill ITLB or DTLB with translation for the corresponding page
  size. The “_trap”
  event occurs if the HWTW was not able to fill the corresponding TLB 
         
       
       
         
          IC_mtag_miss,
  IC_mtag_miss_\  [ptag_hit|ptag_miss|\  ptag_hit_way_mismatch] 
         
         
          I$ micro tag
  misses, with some options for drill down 
         
       
       
         
          Fetch-0,
  Fetch-0-all 
         
         
          fetch-0
  counts nr of cycles nothing was fetched for this particular strand,
  fetch-0-all counts cycles nothing was fetched for all strands on a core 
         
       
       
         
          Instr_buffer_full 
         
         
          Cycles the
  instruction buffer for a strand was full, thereby preventing any fetch 
         
       
       
         
          BTC_targ_incorrect 
         
         
          Counts all
  occurences of wrongly predicted branch targets from the BTC 
         
       
       
         
          [PQ|ROB|LB|ROB_LB|SB|\  ROB_SB|LB_SB|RB_LB_SB|\  DTLB_miss]\  _tag_wait 
         
         
          ST_q_tag_wait
  is listed under sl=20. 
          These counters
  monitor pipeline behaviour therefore they are not strand specific:  
           
            PQ_...: cycles Rename stage
       waits for a Pick Queue tag (might signal memory bound workload for
       single thread mode, cf. Mail from Richard Smith) 
            ROB_...: cycles Select stage
       waits for a ROB (ReOrderBuffer) tag 
            LB_...: cycles Select stage
       waits for a Load Buffer tag 
            SB_...: cycles Select stage
       waits for Store Buffer tag 
            combinations of the above are
       allowed, although some of these events can overlap, the counter will
       only be incremented once per cycle if any of these occur 
            DTLB_...: cycles load or store
       instructions wait at Pick stage for a DTLB miss tag 
           
         
       
       
         
          [ID]TLB_HWTW_\  [L2_hit|L3_hit|L3_miss|all]  
         
         
          Counters for
  HWTW accesses caused by either DTLB or ITLB misses. Canbe further detailed by
  where they hit 
         
       
       
         
          IC_miss_L2_L3_hit,
  IC_miss_local_remote_remL3_hit,
  IC_miss 
         
         
          I$ prefetches
  that were dropped because they either miss in L2$ or L3$
  This variant counts misses regardless if the causing instruction commits or
  not 
         
       
       
         
          DC_miss_nospec,
  DC_miss_[L2_L3|local|remote_L3]\  _hit_nospec 
         
         
          D$ misses
  either in general or detailed by where they hit
  cf. the explanation for the IC_miss in two flavours for an explanation of
  _nospec and the reasoning for two DC_miss counters 
         
       
       
         
          DTLB_miss_asynch 
         
         
          counts all
  DTLB misses asynchronously, there is no way to count them synchronously 
         
       
       
         
          DC_pref_drop_DC_hit,
  SW_pref_drop_[DC_hit|buffer_full] 
         
         
          L1-D$ h/w
  prefetches that were dropped because of a D$ hit, counted per core.
  The others count software prefetches per strand 
         
       
       
         
          [Full|Partial]_RAW_hit_st_[buf|q] 
         
         
          Count events
  where a load wants to get data that has not yet been stored, i. e. it is
  still inside the pipeline. The data might be either still in the store buffer
  or in the store queue. If the load's data matches in the SB and in the store
  queue the data in buffer takes precedence of course since it is younger  
         
       
       
         
          [IC|DC]_evict_invalid,
  
  [IC|DC|L1]_snoop_invalid, 
  [IC|DC|L1]_invalid_all 
         
         
          Counter for
  invalidated cache evictions per core  
         
       
       
         
          St_q_tag_wait 
         
         
          Number of
  cycles pipeline waits for a store queue tag, of course counted per core 
         
       
       
         
          Data_pref_[drop_L2|drop_L3|\  hit_L2|hit_L3|\  hit_local|hit_remote] 
         
         
          Data prefetches
  that can be further detailed by either why they were dropped or where they
  did hit 
         
       
       
         
          St_hit_[L2|L3],
  
  St_L2_[local|remote]_C2C,
  St_local, St_remote 
         
         
          Store events
  distinguished by where they hit or where they cause a L2 cache-to-cache
  transfer, i.e. either a transfer from another L2$ on the same die or from a
  different die  
         
       
       
         
          DC_miss,
  DC_miss_\  [L2_L3|local|remote]_hit 
         
         
          D$ misses
  either in general or detailed by where they hit
  cf. the explanation for the IC_miss in two flavours for an explanation of
  _nospec and the reasoning for two DC_miss counters 
         
       
       
         
          L2_[clean|dirty]_evict 
         
         
          Per core
  clean or dirty L2$ evictions 
         
       
       
         
          L2_fill_buf_full,
  L2_wb_buf_full,
  L2_miss_buf_full 
         
         
          Per core
  L2$ buffer events, all count number of cycles that this state was present 
         
       
       
         
          L2_pipe_stall 
         
         
          Per core
  cycles pipeline stalled because of L2$ 
         
       
       
         
          Branches 
         
         
          Count
  branches (Tcc, DONE, RETRY, and SIT are not counted as branches) 
         
       
       
         
          Br_taken 
         
         
          Counts taken
  branches (Tcc, DONE, RETRY, and SIT are not counted as branches) 
         
       
       
         
          Br_mispred,
  
  Br_dir_mispred, 
  Br_trg_mispred,
  Br_trg_mispred_\  [far_tbl|indir_tbl|ret_stk] 
         
         
          Counter for
  various branch misprediction events.  
         
       
       
         
          Cycles_user 
         
         
          counts
  cycles, attribute setting hpriv, nouser, sys controls addess space to count in 
         
       
       
         
          Commit-[0|1|2],
  
  Commit-0-all, 
  Commit-1-or-2 
         
         
          Number of
  times either no, one, or two µops commit for a strand. Commit-0-all counts
  number of times no µop commits for the whole core, cf. footnote 11 to table
  10.2 in PRM for a more detailed explanation on how this counters interacts
  with the privilege levels