processor
Advance neural processor has become one of the most significant workloads because of its unprecedented performance improvements in many fields. Nevertheless, how to effectively run deep neural network models has still been a significant issue since the revival of deep learning. Compounding the issue is the increasingly advanced network structures, along with the irregularity brought about by network pruning and compression, which undoubtedly hinders the deployment of state-of-the-art DNNs onto hardware. The aforementioned issue inherently emphasizes the need for tailored physical synthesis techniques because the quality of physical synthesis directly impacts the performance of neural network processors.
As neural network processors get increasingly hierarchical and structural, dataflow optimization becomes essential to boost the system capabilities. Dataflow optimization schedules operation by data availability, which exposes opportunities for parallelism and data reuse. To demonstrate some success in dataflow optimization, Zhang et al. perform a quantitative analysis of the computing throughput and memory bandwidth needed with different traditional optimization methods, including loop tiling and transformation, and then utilize he roofline model to balance resources.
Halwani et al. integrate the processing of consecutive convolutional layers with the data on-chip and exploit a pyramid-shaped multi-layer sliding window to reduce off-chip transfer. introduce an analysis of convolution loop acceleration strategy by numerically describing the loop optimization techniques and subsequently employing several optimization algorithms to optimize the loop operation and dataflow.
Neural Network Processor
In deep neural networks processor, running inference like convolution performs an extremely large number of multiply-accumulate operations, as one convolution involves iterating over each channel and each pixel for each input given, often with billions or even trillions of iterations. In addition, the model itself has to be run once per new input. Central processing units are good at handling highly serialized streams of instructions, but machine learning workloads are highly parallelizable, which is well-suited to graphics processing units. In addition, neural processing units enjoy enormously simpler logic because their workloads have high regularity in the computation patterns of deep neural networks.
Due to the above reasons, numerous specialized dedicated neural processors have been neural designed, neural processors, computer processor, neural camera, neural sensors. An NPU is a well-partitioned circuit that comprises all the control and arithmetic logic elements required to run machine learning algorithms. NPUs are optimized to speed up typical machine learning applications like image recognition, machine translation, object detection, and many other predictive models. NPUs can be a part of a giant SoC, multiple NPUs can be realized on one chip, or they can be part of a neural-network accelerator.
Physical Design Flow
Physical design processor relies on a netlist that is synthesized from an RTL design to a gate-level description. Typically, the physical design flow is split into several steps: floor planning, partitioning, placement, clock-tree synthesis, routing, physical verification, and layout postprocessing with mask data generation. Floor planning, placement, and routing are the most critical steps in physical design. Floor planning defines geometric relationships between modules to achieve some goals like area, wire length, and some required performance.
Poor floorplan results in die area wastage and congestion in routing. Regarding circuit performance, the lower area is typically desirable since it represents shorter interconnect distances, reduced routing resources consumed, faster end-to-end signal paths, computer server processors and even quicker and more predictable place and route time. But routing can be more challenging with fewer routing resources assigned. Generally, floor planning is helped by hierarchy information such as data paths. Placement is also a critical phase in physical design.
A poor placement not only degrades the chip performance but also renders the chip non-manufacturable with a wire length far beyond available routing resources. Hence, placement always processes with some targets to verify that a circuit achieves its performance requirements. Routing takes the place strategy and maps out wires to route the placed devices correctly according to all the rules of design for the integrated circuits. Collectively, the placement and routing stages of integrated circuit design are referred to as place and route.
Placement with Datapath Constraints
The concept of processor data path-driven placement may go back to no earlier than the paper in the year 1990, which addresses automatic bit-sliced data path generation in high-speed DSP circuits. The data path is made up of multi-bit functional building blocks like adders or registers. The linear placement tool proposed here produces a linear sequence of the FBBs to reduce the area of the layout.
In their paper, the ordering solution space is modeled as an acyclic directed graph such that the orderings can be searched using the A∗ algorithm. The algorithm performs well and is much faster than metaheuristics, such as simulated annealing, and the authors stressed that the algorithm is easy to adapt to different cost functions.
Subsequently, data path-driven standard cell placement was introduced. In this paper, highly interconnected subcircuits, i.e., cones, computer components, neural memory, neural connections, neural networks are identified using a breadth-first algorithm with added heuristic rules. The cones are handled as soft macro cells and are placed using a macro-cell placement algorithm to minimize the inter-cone wiring length. Macros are mapped back into cells by a mapping subsystem to maintain the topological relationship among them.
It is contended that if the data path is created independently and merely combined with the netlists of other components, the placement tool has minimal control over the precise location where a cell can be placed. In this manner, the regularity information will be lost. Datapath has been taken into account in detailed placement, where a modified O-tree-based placer can place components on reflection lines while respecting design rules. It is also taken into account in physical design within SOC and for parallel multiplier design.
Many more works on data path-driven general ASIC processor design have been introduced. In, data path clusters are allocated with the constraints that the relative positions of the clusters must be by the order of dataflow, the relative orientations of the clusters must be by the order of bits, and the same order of bits must be maintained along the dataflow. The authors refer to it as 1.5 dimensional placement’ and suggested solving it by linear placement heuristics like.
A density model based on a sigmoid function is suggested for independent optimization in horizontal and vertical directions. Blocks of every functional stage are placed vertically because of the regularity of data paths, which minimizes variables in the optimization problem. Within, data path macros are inserted with other random blocks by an analytical placement algorithm, whereas the relative position of bit-slices can be tuned within the data path macros to minimize overall wire length. Experiments have demonstrated that these methods achieve better wire length and/or reputability results.
Techniques for Regularity Extraction
In addition to the data path-driven processor methodologies, several techniques for regularity extraction have been introduced. We overview some of the sample methods in this section. Naturally, as defined in, let cells belonging to the same bit-slice be arranged horizontally, and the same type of cells occurring at about the same location are stacked to the side building stages. The circuit is thereby accommodated in a matrix of rectangular buckets, which gives maximum density cell placement. Besides the above geometric regularity, the interconnect regularity means that nearly all nets fall within one slice or one stage.
In, a local regularity measure is introduced by translating the distribution of the number of pins, and a regularity extraction algorithm is introduced by propagating search waves through the network, stage by stage, based on the regularity measure. A signature-based regularity extraction algorithm is later proposed. The signature of a random instance is determined by its master cell and its connectivity to data path instances. Next, a connectivity cost function is defined in terms of some objective, e.g., the vertical distance between two pins.
The random instances are sorted according to the signatures and are divided into blocks with the same signature. Lastly, the normal functions are synthesized, taking into account the connectivity cost. The authors also introduce a relaxed function-based signature. Template covering of a circuit is another research direction. In addition to assuming a library of given templates, Chowdhary et al. introduce a method to automatically derive all possible templates for the input circuit.
Conclusion
Physical synthesis for state-of-the-art neural network processors is what we talked about in this paper. Owing to the frequency of neural network processors, we evaluated previous literature in data path-driven placement, considering circuit topology as physical design input. We also evaluated a wafer-scale deep learning accelerator placement instance, a case study on specialized physical synthesis for next-generation neural network processors. Experimental findings demonstrate that data path-driven floorplans significantly surpass conventional methods like simulated annealing. Advanced neural network processor design technologies are further elaborated.
Advanced technologies have been proved to have excellent capability in solving the scaling problems. Owing to the page constraint, we briefly introduce a few of them and direct readers to a more extensive survey. Processing-in-memory offers huge parallelism and high energy efficiency, which provides new solutions to solve problems in contemporary computer systems. Past work has shown that the computation in a neural network is possible using most of the up-and-coming non-volatile memories like RRAM, STT-MRAM, PCM, and memristor.
Analog simulation based on in-memory is another upcoming candidate, e.g., memristor crossbar-based and FTJ-based. Without requiring any data moving in between the processor and the memory, such accelerators highly boost the performance as well as energy efficiency of running a neural network. However, many systems have an external controlling device, which might decrease the advantage of PIM.