Xiangshan_BPU_FTB

简介

FTB 是香山 BPU 的第三个子预测器，它也能一并获取到 uFTB 和 TAGE-SC 的输出。在 FTB 的输入接口中，s1 通道含有 uFTB 的基础预测结果，s2 通道和 s3 通道中仅有 br_taken_mask 一组信号被 TAGE-SC 填充，并无 FTB 项生成的基础预测结果。FTB 的工作便是为 s2 和 s3 通道提供基础预测结果。

FTB 在功能和结构上都与 uFTB 类似，其主要区别就是 FTB 能够容纳更多的 FTB 项，并且 FTB 的预测结果是在 s2 与 s3 通道输出。正是由于容量大，其读出的速度上会比 uFTB 慢，无法被放置在第一周期产生预测结果，但大容量也使它能够获得更加精准的预测结果。

功能概述

缓存更多 FTB 项，为 s2 和 s3 通道提供基础预测结果。 FTB 预测器的本质是一个较大容量的存储器，其会根据当前预测的 PC 读出对应的 FTB 项，并在 s2 阶段产出预测结果。与此同时该 FTB 项还会被再保存一个周期，生成 s3 阶段预测结果。
根据更新请求，更新存储中的 FTB 项。

接收请求

S0阶段时，FTB向内部FTBbank发送读请求，其请求pc值为S0传入的PC.
发送请求的下一拍 S1，暂存从FTB SRAM读出的多路信号。
S2从暂存数据中根据各路tag和实际请求时tag匹配情况生成命中信号并在命中时选出命中的FTB数据，若存在hit请求，则返回值为选出的 FTB 项及命中的路信息。

FTBBank 模块读出的数据在FTB模块内作为 2 阶段的预测结果以组合逻辑连线形式在当拍传递给后续预测器，此外这一读出的结果还会被暂存到 FTB 模块内，在 3 阶段作为预测结果再次以组合逻辑连线传递给后续预测器。若 FTB 命中，则读出的命中路编号也会作为 meta 信息在 s3 与命中信息、周期数一起传递给后续 FTQ 模块。

数据更新

收到update请求后，FTB会根据meta中的hit信息来决定更新时间，若meta中为hit，则本拍更新，否则要延迟两周期等待读出FTB内现有结果后更新。

在FTBBank内部，当存在更新请求时，该模块行为也因立即更新和推迟更新两情况而有所不同。立即更新时，FTBBank内的 SRAM写通道拉高，按照给定的信息完成写入。

推迟更新时，FTBbank首先收到一个update的读请求，下一拍读出数据，选出给定地址命中的路编码传递给外部FTB模块，而若这一拍未命中，则下一拍需要写入到分配的路中。路选取规则为，若所有路均已写满，则使用替换算法。

SRAM规格

单bank，512set，4way，使用单口SRAM，无读保持，有上电复位。 20bit tag，60bit FTB项。

FTB存储结构

FTB项

total	valid	brSlot	tailSlot	pftAddr	carry	isCall, isRet, isJalr
	有效位	第一条分支信息	第二条分支信息	预测块结束地址	结束地址高位是否进位	tailSlot分支类型	RAS标识特殊位	强 bias
62	1	21	29	4	1	3	1	2

FTB slot

total	valid	offset	lower	tarStat	sharing	isRVC
	有效位
21/29	1	4	12/20	2	1	1

目标地址生成逻辑

对于每个slot，根据三种可能的高位进位情况（进位/退位/不变），在（PC高位+1, PC高位‑1, PC高位）三种情况中选择一个，和存储的目标地址低位信息进行拼位

更新流程

表项生成
- 从FTQ读信息startAddr,old_entry,包含FTQ项内32Byte内所有分支指令的预译码信息pd,此FTQ项内有效指令的真实跳转结果 cfiIndex，包括是否跳转，以及跳转指令相对startAddr的偏移,此FTQ项内分支指令（如跳转）的跳转地址（执行结果）,预测时FTB是否真正命中（旧FTB项是否有效），对应 FTQ 项内所有可能指令的误预测 mask
写入流程

2.1 写入条件：新FTB项完全没有变化，或者虽然FTB未命中但uFTB命中：不需写入;新FTB项有变化且非uFTB命中、FTB未命中的情况：需要写入

写入SRAM的流水线示意图

Chisel代码分析

trait FTBParams extends HasXSParameter with HasBPUConst {
  val numEntries = FtbSize				//FtbSize: Int = 2048
  val numWays    = FtbWays				//FtbWays: Int = 4,
  val numSets    = numEntries / numWays // 512
  val tagLength  = FtbTagLength			//FtbTagLength: Int = 20,

  val TAR_STAT_SZ = 2					//定义状态字段的位宽为2位
  def TAR_FIT     = 0.U(TAR_STAT_SZ.W)	//00: 地址匹配状态（Target Address FIT）
  def TAR_OVF     = 1.U(TAR_STAT_SZ.W)	//01: 地址溢出状态（Target Address Overflow）
  def TAR_UDF     = 2.U(TAR_STAT_SZ.W)	//10: 地址下溢状态（Target Address Underflow）

  def BR_OFFSET_LEN  = 12				//分支指令偏移量位宽
  def JMP_OFFSET_LEN = 20				//跳转指令偏移量位宽

  def FTBCLOSE_THRESHOLD_SZ = log2Ceil(500)					// 阈值计数器位宽
  def FTBCLOSE_THRESHOLD    = 500.U(FTBCLOSE_THRESHOLD_SZ.W) // can be modified 阈值常量
}

这些常量用于表示目标地址预测的状态：

TAR_FIT：表示预测地址与实际地址匹配（无偏移）
TAR_OVF：表示预测地址过高（需要+1调整）
TAR_UDF：表示预测地址过低（需要-1调整）

在硬件实现中：

使用2位宽度存储状态值
通过Mux1H多路复用器根据状态选择对应的目标地址计算方式
这些状态会影响后续的地址生成逻辑，例如在FTBEntry类的getTarget方法中会根据这些状态调整地址高位部分

BR_OFFSET_LEN (12位)
用于分支指令的偏移量字段长度，决定分支目标地址低位部分的存储精度。例如：
- 支持的最大分支偏移量为 2^12 = 4096
JMP_OFFSET_LEN (20位)
用于跳转指令的偏移量字段长度，比分支指令需要更大范围：
- 支持的最大跳转偏移量为 2^20 = 1MB
- 更长的偏移量允许存储更大范围的跳转目标地址
FTBCLOSE_THRESHOLD (500)
- 用于控制FTB（分支目标缓冲区）关闭机制的阈值
- 当某个计数器达到500时，触发FTB请求关闭逻辑
- 使用log2Ceil(500)计算得到9位宽度（因为2^8=256 < 500 < 512=2^9）

class FtbSlot_FtqMem(implicit p: Parameters) extends XSBundle with FTBParams {
  val offset  = UInt(log2Ceil(PredictWidth).W)			//指令偏移量字段
  val sharing = Bool()									//共享标志
  val valid   = Bool()									//有效性标志
}

offset 字段
- 位宽：log2Ceil(PredictWidth) 位
- 用途：存储指令在预测块中的偏移位置
- 示例：当 PredictWidth=8（每周期预测8条指令）时，offset为3位（0~7）
- 在FTB中用于定位分支指令在指令块中的精确位置
sharing 字段
- 布尔类型
- 用途：指示该槽位是否与其他槽位共享信息
- 典型场景：
  - 当多个分支共享相同的目标地址时，可通过该标志实现空间优化
  - 用于区分独立分支和共享分支的存储方式
valid 字段
- 布尔类型
- 用途：表示该槽位是否包含有效的预测信息
- 作用：
  - 控制预测结果的有效性
  - 用于槽位分配和替换策略
  - 在预测过程中过滤无效的预测条目

class FtbSlot(val offsetLen: Int, val subOffsetLen: Option[Int] = None)(implicit p: Parameters) extends FtbSlot_FtqMem
    with FTBParams {
  if (subOffsetLen.isDefined) {
    require(subOffsetLen.get <= offsetLen)			// 要求子偏移量必须小于等于offsetLen
  }
  val lower   = UInt(offsetLen.W)
  val tarStat = UInt(TAR_STAT_SZ.W)

  def setLowerStatByTarget(pc: UInt, target: UInt, isShare: Boolean) = {	// 根据高位地址比较结果生成状态
    def getTargetStatByHigher(pc_higher: UInt, target_higher: UInt) =		
      Mux(target_higher > pc_higher, TAR_OVF, Mux(target_higher < pc_higher, TAR_UDF, TAR_FIT))		// 目标地址超过PC范围 -> 溢出;目标地址不足/匹配 -> 下溢/适配
    def getLowerByTarget(target: UInt, offsetLen: Int) = target(offsetLen, 1)	// 从目标地址提取低位部分（用于缓存索引）;取[offsetLen:1]位（忽略最低位，因通常对齐访问）
    val offLen        = if (isShare) this.subOffsetLen.get else this.offsetLen	// 根据共享标志选择偏移长度
    val pc_higher     = pc(VAddrBits - 1, offLen + 1)							// 提取高位地址段（用于范围比较）
    val target_higher = target(VAddrBits - 1, offLen + 1)						// 目标地址高位段
    val stat          = getTargetStatByHigher(pc_higher, target_higher)			// 生成地址比较状态 进位or not
    val lower         = ZeroExt(getLowerByTarget(target, offLen), this.offsetLen)	// 提取低位并零扩展到标准位宽
    // 更新模块内部状态寄存器
    this.lower   := lower		// 存储计算出的低位地址
    this.tarStat := stat		// 存储地址比较状态
    this.sharing := isShare.B	// 存储共享模式标志
  }

  def getTarget(pc: UInt, last_stage: Option[Tuple2[UInt, Bool]] = None) = {
    def getTarget(offLen: Int)(pc: UInt, lower: UInt, stat: UInt, last_stage: Option[Tuple2[UInt, Bool]] = None) = {
      val h                = pc(VAddrBits - 1, offLen + 1)
      val higher           = Wire(UInt((VAddrBits - offLen - 1).W))
      val higher_plus_one  = Wire(UInt((VAddrBits - offLen - 1).W))
      val higher_minus_one = Wire(UInt((VAddrBits - offLen - 1).W))

      // Switch between previous stage pc and current stage pc
      // Give flexibility for timing
      if (last_stage.isDefined) {
        val last_stage_pc   = last_stage.get._1
        val last_stage_pc_h = last_stage_pc(VAddrBits - 1, offLen + 1)
        val stage_en        = last_stage.get._2
        higher           := RegEnable(last_stage_pc_h, stage_en)
        higher_plus_one  := RegEnable(last_stage_pc_h + 1.U, stage_en)
        higher_minus_one := RegEnable(last_stage_pc_h - 1.U, stage_en)
      } else {
        higher           := h
        higher_plus_one  := h + 1.U
        higher_minus_one := h - 1.U
      }
      val target =
        Cat(
          Mux1H(Seq(
            (stat === TAR_OVF, higher_plus_one),
            (stat === TAR_UDF, higher_minus_one),
            (stat === TAR_FIT, higher)
          )),
          lower(offLen - 1, 0),
          0.U(1.W)
        )
      require(target.getWidth == VAddrBits)
      require(offLen != 0)
      target
    }
    if (subOffsetLen.isDefined)
      Mux(
        sharing,
        getTarget(subOffsetLen.get)(pc, lower, tarStat, last_stage),
        getTarget(offsetLen)(pc, lower, tarStat, last_stage)
      )
    else
      getTarget(offsetLen)(pc, lower, tarStat, last_stage)
  }
  def fromAnotherSlot(that: FtbSlot) = {
    require(
      this.offsetLen > that.offsetLen && this.subOffsetLen.map(_ == that.offsetLen).getOrElse(true) ||
        this.offsetLen == that.offsetLen
    )
    this.offset  := that.offset
    this.tarStat := that.tarStat
    this.sharing := (this.offsetLen > that.offsetLen && that.offsetLen == this.subOffsetLen.get).B
    this.valid   := that.valid
    this.lower   := ZeroExt(that.lower, this.offsetLen)
  }

  def slotConsistent(that: FtbSlot) =
    VecInit(
      this.offset === that.offset,
      this.lower === that.lower,
      this.tarStat === that.tarStat,
      this.sharing === that.sharing,
      this.valid === that.valid
    ).reduce(_ && _)

}

分支目标地址计算
分支状态管理
数据一致性与复制

class FTBEntry_part(implicit p: Parameters) extends XSBundle with FTBParams with BPUUtils {
  val isCall = Bool()
  val isRet  = Bool()
  val isJalr = Bool()

  def isJal = !isJalr
}

定义 分支预测条目基础属性，用于描述指令类型特征

class FTBEntry_FtqMem(implicit p: Parameters) extends FTBEntry_part with FTBParams with BPUUtils {	//FTBEntry

  val brSlots  = Vec(numBrSlot, new FtbSlot_FtqMem)		//分支槽位集合，有numBrSlot个
  val tailSlot = new FtbSlot_FtqMem						//尾部槽位

  def jmpValid =
    tailSlot.valid && !tailSlot.sharing					//判断tailSlot跳转有效

  def getBrRecordedVec(offset: UInt) =		//获取指定offset的分支记录向量
    VecInit(
      brSlots.map(s => s.valid && s.offset === offset) :+	//检索所有brSlots
        (tailSlot.valid && tailSlot.offset === offset && tailSlot.sharing)
    )

  def brIsSaved(offset: UInt) = getBrRecordedVec(offset).reduce(_ || _)	//判断在offset便宜内是否有分支记录

  def getBrMaskByOffset(offset: UInt) =	
    brSlots.map { s =>
      s.valid && s.offset <= offset
    } :+
      (tailSlot.valid && tailSlot.offset <= offset && tailSlot.sharing)

  def newBrCanNotInsert(offset: UInt) = {	//判断新分支是否无法插入
    val lastSlotForBr = tailSlot
    lastSlotForBr.valid && lastSlotForBr.offset < offset
  }

}

def brIsSaved(offset: UInt) = getBrRecordedVec(offset).reduce(_ || _)调用getBrRecordedVec方法，这个方法返回一个seq布尔值集合，为offset偏移值内的各分支记录状态,如[true,false,true],.reduce(_ || _)是对这个集合进行或运算，即集合中只要有一个true，返回就为true。
def getBrMaskByOffset(offset: UInt)根据传入的 offset（无符号整数），生成一个布尔序列，表示哪些分支槽满足条件.

对每个brslot检查valid是否为true，s.offest是否小于目标偏移，返回一个布尔序列。然后在这个序列后拼接一个逻辑式

class FTBEntry(implicit p: Parameters) extends FTBEntry_part with FTBParams with BPUUtils

def getSlotForBr(idx: Int): FtbSlot = {
  require(idx <= numBr - 1)	
  (idx, numBr) match {	//逻辑分支
    case (i, n) if i == n - 1 => this.tailSlot		//最后一个slot，为tailSlot
    case _                    => this.brSlots(idx)	//其他情况，返回索引为idx的Slot
  }
}

传入索引idx，索引slot

require(idx <= numBr - 1)输入idx需要小于numBr-1,否侧产生异常。

def allSlotsForBr =
  (0 until numBr).map(getSlotForBr(_))	//

返回一个序列，包括所有brSlots和tailSlot的完整集合。

def setByBrTarget(brIdx: Int, pc: UInt, target: UInt) = {
  val slot = getSlotForBr(brIdx)
  slot.setLowerStatByTarget(pc, target, brIdx == numBr - 1)	//提取目标地址的低位
}

根据分支索引brIdx，更新Brslot的分支目标地址状态

def setByJmpTarget(pc: UInt, target: UInt) =
  this.tailSlot.setLowerStatByTarget(pc, target, false)

根据分支索引brIdx，更新JmpSlot的分支目标地址状态

def getTargetVec(pc: UInt, last_stage: Option[Tuple2[UInt, Bool]] = None)

val h_br                  = pc(VAddrBits - 1, BR_OFFSET_LEN + 1)			//提取pc中br高位，索引预测表
val higher_br             = Wire(UInt((VAddrBits - BR_OFFSET_LEN - 1).W))	//br指令索引高位
val higher_plus_one_br    = Wire(UInt((VAddrBits - BR_OFFSET_LEN - 1).W))	//br指令索引高位+1
val higher_minus_one_br   = Wire(UInt((VAddrBits - BR_OFFSET_LEN - 1).W))	//br指令索引高位-1
val h_tail                = pc(VAddrBits - 1, JMP_OFFSET_LEN + 1)			//提取pc中jmp高位，索引预测表
val higher_tail           = Wire(UInt((VAddrBits - JMP_OFFSET_LEN - 1).W))	//jmp指令索引高位
val higher_plus_one_tail  = Wire(UInt((VAddrBits - JMP_OFFSET_LEN - 1).W))	//jmp指令索引高位+1
val higher_minus_one_tail = Wire(UInt((VAddrBits - JMP_OFFSET_LEN - 1).W))	//jmp指令索引高位-1

if (last_stage.isDefined) {
     val last_stage_pc                  = last_stage.get._1
     val stage_en                       = last_stage.get._2
     val last_stage_pc_higher           = RegEnable(last_stage_pc(VAddrBits - 1, JMP_OFFSET_LEN + 1), stage_en)
     val last_stage_pc_middle           = RegEnable(last_stage_pc(JMP_OFFSET_LEN, BR_OFFSET_LEN + 1), stage_en)
     val last_stage_pc_higher_plus_one  = RegEnable(last_stage_pc(VAddrBits - 1, JMP_OFFSET_LEN + 1) + 1.U, stage_en)
     val last_stage_pc_higher_minus_one = RegEnable(last_stage_pc(VAddrBits - 1, JMP_OFFSET_LEN + 1) - 1.U, stage_en)
     val last_stage_pc_middle_plus_one =
       RegEnable(Cat(0.U(1.W), last_stage_pc(JMP_OFFSET_LEN, BR_OFFSET_LEN + 1)) + 1.U, stage_en)
     val last_stage_pc_middle_minus_one =
       RegEnable(Cat(0.U(1.W), last_stage_pc(JMP_OFFSET_LEN, BR_OFFSET_LEN + 1)) - 1.U, stage_en)