Self-Correcting Strategy for Network-on-Chip (NoC) Interconnect
Reliability has traditionally been a design challenge in mission critical electronic systems however this challenge is now migrating into everyday non-critical systems where engineers must aim to design reliable systems on unreliable fabrics. The complexity of modern System-on-Chip (SoC) systems has seen the introduction of new interconnection strategies such as Network-on-Chip (NoC) which allow scalable on-chip communication between large numbers of processing components. Traditionally fault tolerant and repair/correction approaches have been applied to the processing components of SoCs however, modern SoCs now exhibit a large amount of resources to the NoC interconnect and therefore is a major system component and one that is susceptible to faults. As device geometries scale the reliability of devices become lower, exhibiting higher risks of faults occurring post-manufacturing. To ensure modern SoCs are able to accommodate faults and maintain operation post-manufacturing will require the development of autonomous systems that can self-detect and self-correct without human intervention not only in the processing elements but also in the NoC interconnect.
This research looks to address the reliability challenges and has developed a novel online fault detection mechanism and several different level fault-tolerant adaptive routing algorithms to enhance the system fault-tolerant ability where the interruption of the runtime operation (performance) under diagnosis is minimised. A novel Monitor Module (MM) is proposed to detect NoC interconnect faults by using a channel tester which only examines NoC channels when they are idle and using a testing interval parameter based on the Binary Exponential Back off algorithm to dynamically balance the level of testing when recovering from temporary faults. Based on the fault detection results, three adaptive routing algorithms – nearest neighbour, and coarse and fine-grained look-ahead are proposed to increases levels of fault-tolerance capability while maintaining NoC throughput. Results show that the proposed routing algorithms can achieve higher throughput compared to other state of the art routing algorithms under various traffic patterns and levels of injected faults. For example, the approaches are able to autonomously detect faults and take corrective action to maintain NoC throughput or at least provide graceful degradation of system performance. In particular, the hardware area overheads are minimal and exhibit low power costs which enables scalability to be maintained for large NoC implementations.