論文98-35C-11-5 # Gigabit ATM Packet 교환을 위한 파이프라인 방식의 고속 메모리 ### (High-Speed Pipelined Memory Architecture for Gigabit ATM Packet Switching) 鄭甲重\*. 李文基\* (Gab Joong Jeong and Moon Key Lee) 요 약 본 논문에서는 공유 버퍼 ATM 스위치를 위한 파이프라인 방식의 고속 메모리 구조를 제안하고 설계하 였다. 제안된 메모리 구조는 빠른 동작 속도와 용량 가변성을 지원하여 공유 버퍼 ATM 스위치가 가지는 메모리 cycle time의 제한을 극복하였다. 본 메모리 구조가 지원하는 용량 가변성은 ATM 스위치에서의 교 환 성능 가변성을 제공한다. 본 메모리 구조는 작은 메모리 bank들로 이루어진 2차원 배열 구조를 가진다. 메모리 용량은 부가적인 메모리 bank들을 추가하여 메모리 bank들의 배열 크기를 증가 시킴으로 인해 증가 된다. 설계된 파이프라인 방식의 메모리는 4160 bit 메모리 bank를 16개 이용하여 4 imes 4의 배열로 설계하 였고 전체 용량은 65 Kbit이다. 레이아웃후 시뮬레이션을 통한 최대 동작 속도는 5 VVdd 및 25℃에서 4 ns이다. 설계된 메모리는 공유 가변 버퍼 ATM 스위치의 시험 설계된 칩에 내장되었다. 시험 설계된 칩은 0.6 μm 2-metal 1-poly CMOS 공정 기술을 이용하여 설계하였다. ### **Abstract** This paper describes high-speed pipelined memory architecture for a shared buffer ATM switch. The memory architecture provides high speed and scalability. It eliminates the restriction of memory cycle time in a shared buffer ATM switch. It provides versatile performance in a shared buffer ATM switch using its scalability. It consists of a 2-D array configuration of small memory banks. Increasing the array configuration enlarges the entire memory capacity. Maximum cycle time of the designed pipelined memory is 4 ns with 5 V Vdd and 25°C. It is embedded in the prototype chip of a shared scalable buffer ATM switch with $4 \times 4$ configuration of 4160-bit SRAM memory banks. It is integrated in 0.6 $\mu$ m 2-metal 1-poly CMOS technology. ### I. Introduction Asynchronous transfer mode (ATM) has been selected as the multiplexing and switching \*正會員,延世大學校 電子工學科 (Dept. of Elec. Eng., Yonsei University) ※ 본 연구는 (주)현대전자산업의 연구비지원에 의하여 수행되었음 接受日字:1998年9月7日, 수정완료일:1998年10月24日 technique for use in the Broadband Integrated Services Digital Network (B-ISDN). Switch element design is an important issue to provide efficient and reliable transport for various services in the B-ISDN. Among various architectures of the ATM switch element, shared buffer ATM switch architecture provides the best performance [1,2]. However, there are physical limitations to design the shared buffer ATM switch. One of the limitations is the restriction of buffer memory cycle time. In case of single port buffer memory, the buffer memory has to operate 2N times faster than the port speed of shared buffer ATM switch which has N×N switch size. High-speed memory architecture is very important for shared buffer ATM switch. There have been developed many techniques to improve memory cycle time [3-5]. Those techniques have improved memory cycle time for general-purpose high-density memory. They have not considered scalability. Scalability in buffer memory is important for the versatile performance of ATM switch. It realizes re-configurable shared buffer ATM switch in switch size and switching performance. We have proposed and designed a scalable pipelined memory for fast ATM packet switching [6-8]. The proposed memory architecture has divided a large memory block into small memory banks and configured a 2-D array of the small memory banks. It has been embedded in a shared scalable buffer ATM switch. It breaks through the restriction of memory cycle time for shared buffer ATM switch with its high-speed memory cycle time. It supports scalability without circuit redesign for the flexible switching performance of ATM switch. The 2-D array configuration of the small memory banks is scalable by adding additional memory banks and decoders without the variation of memory cycle time. Gated clock in each memory bank minimizes the increase of power consumption according to the increase of the array configuration. Section II illustrates the pipelined memory architecture and data propagation. It describes address decoding scheme and systolic data flow. Detailed circuit design is shown in Section III. Section IV describes the advantages and disadvantages. Conclusions are presented in Section V. ### II. Architecture ### 1. Pipelined Memory Architecture Proposed memory architecture for fast ATM packet switch divides a large buffer memory into small memory banks. It uses a 2-D array configuration of the memory banks. Fig. 1 shows the entire architecture. It consists of primary decoder, column and row decoders, small memory bank, and output buffer. The primary decoder decodes input address partially. It generates column and row branch addresses which are used in the column and row decoders individually. It generates a pipeline depth that is used in each SRAM based small memory bank. The column and row decoders select a column and a row in the 2-D array of memory banks. Each memory bank decides one operation of propagation and internal memory access at every cycle. It communicates with adjacent memory banks in three directions. Each output buffer propagates valid data to data bus at 'read' operation. 그림 1. 파이프라인 방식의 메모리 구조 Fig. 1. Architecture of pipelined memory. There is no output data conflict because only one output buffer gets valid data at every cycle. Fig. 2 shows the three directional data flows. It represents the systolic data flow of the proposed memory architecture where a module in each pipeline stage loads it's control signals after one cycle delay from previous modules. 그림 2. 제안된 메모리 구조의 3~방향 데이터 흐름도 Fig. 2. Three directional data flows of the proposed memory architecture. ### 2. Address Decoding and Propagation Input data of the pipelined memory are data (DATA), address (ADDR), 'R/W', enable (ME), mode (MD), and port address (PA). The primary decoder decodes parts of input address and generates special pipeline control signals according to the algorithm in Fig. 3. It generates column branch address (CBA), row branch address (RBA), and pipeline depth (PD). It transfers row address (RA) to row decoder word line selection. It for propagates additionally mode (MD) signal and port address (PA) to column decoder. MD and PA data are used in address controller of designed ATM switch. Each column decoder compares CBA, which is transferred by the primary decoder or previous column decoder, with zero. If it is true, column decoder generates one bit signal, column branch trigger (CBT), and transfers the CBT to adjacent memory bank with DATA in vertical. It transfers PD to adjacent memory bank. It transfers the next CBA and PD, which are decreased by 1 from its input CBA and PD, to next column decoder with the DATA in horizontal. Each row decoder compares RBA, which is transferred by the primary decoder or previous row decoder, with zero. If it is true, row decoder generates row branch trigger (RBT) signal and transfers the RBT with pre-decoded word lines to adjacent memory bank in horizontal. It transfers next RBA, which is decreased by 1 from RBA, to next row decoder in vertical with row address (RA). The CBT and RBT signals go to low when the memory enable (ME) propagated by the primary decoder is low. ``` Algorithm: Primary Decoding(RBA, CBA, PD, ADDR) x = log<sub>2</sub>N bits of ADDR; y = another log<sub>2</sub>N bits of ADDR; Temp = x - y; if(Temp >= ZERO) { RBA = | temp|; CBA = 0; PD = x; } else { RBA = 0; CBA = | temp|; PD \ y; } ``` 그림 3. 상위 디코더의 알고리즘 Fig. 3. Algorithm of the primary decoder. Each memory bank transfers input data to next adjacent memory banks when the CBT or RBT is high according to the systolic fashioned three-directional data flow. It compares PD with zero. If PD is zero and CBT and RBT are high in a memory bank, the memory bank accesses its own internal memory cells according to the 'R/W' signal. Memory bank transfers next PD, which is decreased by 1 from its input PD, to next adjacent memory banks. Output buffers located at the last pipeline stage transfer valid DATA to data bus. Only one output buffer is activated to transfer the valid DATA at every cycle when the both signals, CBT and RBT are high and the 'R/W' signal indicates 'read' mode. In the 'read' mode, the output data of the scalable pipelined memory come out after the initial latency of N + 3 cycles for N×N array configuration from the read address input cycle. In 'write' mode, the input data is stored during N + 3 cycles. The scalable pipelined memory does not need any redundant cycles for the mode change between 'read' and 'write'. Timing diagram of input and output data is shown in Fig. 4. It shows the initial latency of 7 for 4×4 array configuration of small memory banks. 그림 4. 배열 구조 4×4 의 파이프라인 방식 메모리의 타이밍도 Fig. 4. Timing diagram of the pipelined memory with $4\times4$ array configuration. ### III. VLSI Implementation ### 1. Address Decoders and Output Buffer Designed pipelined memory decodes input address in three types of decoder. There are primary decoder, column decoder, and row decoder. All of the blocks have input and output latches. All the input latches transfer new input data during the negative level of clock. All the output latches transfer internal data during the positive level of clock. The block diagram of the primary decoder is shown in Fig. 5. It consists of a subtractor, 2's complement block, and some gates for signal selection according to the algorithm in Fig. 3. Signal selection method is very simple. All the pipeline control data, column branch address (CBA), row branch address (RBA), and pipeline depth (PD) are determined by the 'borrow' signal. The bit size of CBA, RBA, and PD is determined by the array configuration of memory banks in the pipelined memory. It is $log_2N$ bits for the $N\times N$ array configuration. It is designed by the size of 2-bit because the pipelined memory in the prototype chip has the $4\times 4$ configuration of memory banks. 그림 5. 상위 디코더의 블록도 Fig. 5. Block diagram of the primary decoder. Column and row decoders are simpler than the primary decoder. Fig. 6 shows the block diagrams of the column and row decoders. They have input and output latches like the primary decoder. The column decoder has two decrementors to decrease PD and CBA by 1. It has a zero detector for the generation of column branch trigger (CBT) to memory bank when the CBA is zero. It propagates additional data, port address (PA), through the pipeline stages. Address controller of a shared scalable buffer ATM switch accepts the propagated PA to select the output port of current cell. The row decoder has one decrementor to decrease the RBA by 1. It has a zero detector for the generation of row branch trigger (RBT) to memory bank when the RBA is zero. It has row address (RA) decoder that pre-decodes the word-lines for fast memory access. The 7-bit RA is used for selecting one word-line in the memory bank, but one bit of the RA is reserved because the designed memory bank of the prototype chip has 64 word-lines. Output buffer checks CBT and RBT. If CBT and RBT are high and R/W is high, the output buffer transfers data to the output data bus of the buffer memory. 그림 6. 디코더 블록도: (a) 열 디코더 (b) 행 디코더 Fig. 6. Block diagrams of decoders: (a) column decoder (b) row decoder. ### 2. SRAM Based Memory Bank We have designed a SRAM based small memory bank by 65 bit-lines and 64 word-lines for the shared buffer of an ATM switch where 65-bit data constituted with 64-bit ATM cell data and 1-bit next cell address data are stored by selecting one word-line. The address controller of the prototype ATM switch uses the next cell address to maintain output queue chains for every output ports of the switch. One 64-byte ATM cell and one 8-bit next cell address occupy 8 word-lines. One memory bank can store eight 64-byte cells and eight 1-byte next cell addresses. We have used 16 memory banks to configure an array of 4×4. The designed buffer can store 128 cells. Fig. 7 shows the block diagram of one memory bank. 그림 7. 단일 메모리 bank의 블록도 Fig. 7. Block diagram of one memory bank. It consists of input and output latches, one decrementor to decrease PD by 1, control logic block, and memory cells. The control logic block controls the memory bank operation. It generates internal control signals for 'read', 'write', and propagation of data. It controls sense amplifiers and bit-lines for reading, writing, and propagating. It pre-charges the bit-lines of memory columns and the load lines of sense amplifiers to Vdd. Word-line capacitance is not increased according to the increase of array configuration in the proposed memory architecture, as all the memory banks are pipelined. The cycle time of a designed memory bank is 4ns including input and output latch delays with output load capacitance. The circuit diagram of one memory column is shown in Fig. 8. We have used typical 6-tr SRAM cell and added additional word-line for propagation at each SRAM cell. The designed single stage sense amplifier is a p-MOS cross-coupled sense amplifier. In the circuit of the memory column, there is no steady state current path to minimize the power consumption in memory. Fig. 9 shows the simulation results of one SRAM memory bank. 그림 8. 단일 메모리 column의 회로도 Fig. 8. Circuit Diagram of one memory column. 그림 9. 단일 SRAM 메모리 bank의 HSPICE 시뮬레 이션 결과 Fig. 9. HSPICE simulation plot of one SRAM memory bank. For minimizing power consumption of the pipelined memory caused by dynamic signal transition, we have used gated clock in each memory bank. Gated clock disables the I/O latches of inactive memory banks. The gated clocking scheme in a memory bank is shown in Fig. 10. We have separated the I/O latches in a memory bank by vertical and horizontal latches to reduce clock skew. Vertical and horizontal data are independent in a memory bank. Data transmission is decided by the trigger signals, CBT and RBT, in a memory bank. If the CBT signal is 'low', all vertical data are not loaded in input and output latches. If the RBT signal is 'low', all horizontal data are not loaded in input and output latches. Fig. 11 shows the floor plan and the layout of one memory bank. 그림 10. 메모리 bank에서의 gated clock 방법 Fig. 10. Gated clocking scheme of one memory bank. 그림 11. 메모리 bank의 레이아웃 Fig. 11. Layout of one memory bank. ## IV. Experimental Results and Discussions ### 1 Characteristics The scalable pipelined memory proposed in this paper is embedded in a shared scalable buffer ATM switch. The experimental shared scalable buffer ATM switch chip with switch size 4 is shown in Fig. 12. 0.6 µm 2-metal 1-poly CMOS technology, 1.02M Tr. 그림 12. 버퍼 크기가 128-cell 이고 스위치 크기가 4 인 공유 가변 버퍼 ATM 스위치의 전체 칩 레이아웃 Fig. 12. Full chip layout of a 4 x 4 shared scalable buffer ATM switch with 128-cell buffer. It supports 640Mbps at each port of the ATM switch. It has the pipelined memory of 65 Kbit SRAM for the shared scalable buffer of 128-cell with 128 next cell addresses. It is designed by $0.6\,\mu\,\mathrm{m}$ twin-well single-poly double-metal CMOS technology. Core size is $10.6\,\times\,10.6\,\mathrm{mm}^2$ . It has 1-million transistors. The buffer memory is designed by full custom. Routing area of the buffer memory takes 15% of total memory area. It can be shrunk with better technology. General low-power and high-speed circuit techniques of SRAM can be applied to the pipelined memory for increasing memory cycle time. Operating frequency of the prototype chip is 80MHz by post layout simulation. It can be enhanced by the speed optimization of control circuit. The cycle time of the designed pipelined memory is 4 ns at 5 VV<sub>dd</sub> and 25°C including output load capacitance block routing. It can 640Mbps/port 8×8 switch enough that needs 6.25 ns as the cycle time of shared buffer. The high-speed memory cycle time has solved the restriction of memory cycle time for a shared buffer ATM switch. It is the major advantage of the proposed ATM switch. Estimated power consumption of the designed experimental chip with gated clock is 3.3W at 5VV<sub>dd</sub> for 640Mbps 4×4 ATM switch. The power consumed by the pipelined memory is 465mW at 80MHz. Table 1 summaries the characteristics of the experimental chip. ### 표 1. 파이프라인 방식의 메모리 버퍼를 내 장하여 시험 설계된 4×4 공유 가변 버퍼 ATM 스위치 칩의 특성 Table 1. Characteristics of the experimental 4 ×4 shared scalable buffer ATM switch embedded pipelined memory buffer. | Process Technology | 0.6µm 2-Metal 1-Poly CMOS | |------------------------|----------------------------------------------| | Chip Size | $10.6 \times 10.6 \text{ mm}^2$ | | Switch Size | 4×4 | | Buffer Size | $66560$ -bit( $4 \times 4 \times 4160$ -bit) | | Transistors | 1-Million | | Max. Buffer Cycle Time | 4ns (Simulated) | | Operating Frequency | 80-MHz(Simulated) | | Power Dissipation | 3.3W at 80MHz(Estimated) | | Power Supply | 5V | | | | ### 2. Scalability Scalability is the advantage of the proposed pipelined memory. Entire memory capacity is enlarged with increasing the array configuration. It is accomplished by adding additional memory banks and decoders without the variation of memory cycle time. The scalability of buffer size in a shared buffer ATM switch is an advantage for versatile switching performance and fast migration to better process technology. The high-speed memory cycle time and the scalability provide a re-configurable shared buffer ATM switch. ### 3. Power Dissipation Power consumption is the disadvantage of the proposed memory architecture. It increases seriously according to the increase of the array configuration. Disabling input and output latches in inactive memory banks with gated clock reduces the increase of power consumption followed by the increase of configuration size. Fully active memory banks are N among N<sup>2</sup> memory banks because only one diagonal data path is fully active for one operation. 그림 13. 제안된 파이프라인 방식의 메모리에서 gated clock과 non-gated clock을 사용하였을 때의 예측 소비 전력 비교 Fig. 13. The comparison of estimated power consumption for the proposed pipelined memory with gated clock and non-gated clock. Therefore, we can estimate that the reduction rate of on-chip power consumption of the pipelined memory is 1/N approximately by the gated clock for N×N configuration. We have estimated carefully the power consumption of the designed chip by the method of <sup>[9]</sup>. Fig. 13 shows the comparison of the estimated power consumption for the proposed memory architecture with gated and non-gated clock in the pipelined memory. We have compared the power consumption of the pipelined memory that have 4×4 and 8×8 configurations to show the power minimization using gated clock. #### V. Conclusions High-speed pipelined memory architecture and its circuit implementation have been presented. It provides scalability with structural characteristics as shown above. It is embedded in a shared scalable buffer ATM switch. We have minimized the on-chip power consumption of the pipelined memory with gated clock in a memory bank. The buffer memory of the prototype ATM switch is designed by full custom layout. It has 65 kbit SRAM with sixteen 4160-bit SRAM banks. The designed prototype 4×4 ATM switch has the shared buffer of 128-cell. It is integrated in the area of $10.6 \times 10.6 \text{ mm}^2 \text{ with } 0.6 \mu \text{ m} \text{ CMOS}$ technology. It operates at 80MHz by post layout simulation that supports 640Mbps per port. It has the throughput of 2.5Gbps. The maximum cycle time of the pipelined memory is 4ns. It is sufficient for one-chip 640Mbps/port 8×8 shared scalable buffer ATM switch. High-speed and multi-port shared buffer ATM switch on a chip can be designed by the speed optimization of control circuit and the area optimization using better process technology. In future work, we are researching re-configurable multi-port shared scalable buffer ATM switch and its priority control and multicasting functions. ### 참 고 문 헌 - [1] T. Koinuma and N. Miyaho, "ATM in B-ISDN Communication Systems and VLSI Realization", *IEEE J. Solid-State Circuits*, vol. 30, no. 4, pp. 341-347, Apr. 1995. - [2] N. Endo, T. Kozaki, T. Ohuchi, H. Kuwahara, and S. Gohara, "Shared Buffer Memory Switch for an ATM Exchange", *IEEE Trans. Commun.*, vol. 41, no. 1, pp. 237–245, Jan. 1993. - [3] M. Yoshimoto, K. Anami, H. Shinohara, T. Yoshihara, H. Takagi, S. Nagao, S. Kayano, and T. Nakano, "A divided word-line structure in the static RAM and its application to a 64k full CMOS RAM", *IEEE J. Solid-State Circuits*, vol. SC-18, no. 5, pp. 479-485, Oct. 1983. - [4] T. Hirose, H. Kuriyama, S. Murakami, K. Yuzuriha, T. Mukai, K. Tsutsumi, Y. Nishimura, Y. Kohno, and K. Anami, "A 20-ns 4-Mb CMOS SRAM with hierarchical word decoding architecture," *IEEE J. Solid-State Circuits*, vol. 25, no. - 5, pp. 1068-1074, Oct. 1990. - [5] D. Schmitt-Landsiedel, B. Hoppe, G. Neuendorf, M. Wurm, and J. Winnerl, "Pipeline architecture for fast CMOS buffer RAM's," IEEE J. Solid-State Circuits, vol. 25, no. 3, pp. 741-747, June 1990. - [6] G. J. Jeong and M. K. Lee, "Design of a Scalable Pipelined RAM System," *IEEE J. Solid-State Circuits*, vol. 33, no. 6, pp. 910–914, June 1998. - [7] G. J. Jeong, J. W. Shim, M. K. Lee, and S. H. Ahn, "A Scalable Shared Buffer ATM Switch Embedded SPRAMS," in *Proc. IEEE Int. Symp. Circuits and Systems*, Monterey, CA, May 1998, MPA14-5. - [8] J. W. Shim, G. J. Jeong, M. K. Lee, and S. H. Ahn, "FPGA Implementation of a Scalable Shared Buffer ATM Switch," in *Proc. IEEE Int. Conf. ATM*, Colmar, France, June 1998, pp. 247–251. - [9] D. Liu and C. Svensson, "Power Consumption Estimation in CMOS VLSI Chips," *IEEE J. Solid-State Circuits*, vol. 29, no. 6, pp. 663-670, June 1994. 저 자 소 개 ### 鄭 甲 重(正會員) 1987년 경복대학교 전자공학과 졸업 (공학사). 1989년 경북대학교 대학원 전자공학과 졸업 (공학석사). 1989년 ~ 1994년 8월 LG 반도체 선임 연구원. 1994년 9월 ~ 현재 연세대학교 대학원 박사과정. 주관심 분야는 VLSI 설계 및 CAD ### 李 文 基(正會員) 1965년 연세대학교 전기공학과 학사. 1967년 연세대학교 대학원 전기공학과 석사. 1973년 연세대학교 대학원 전기공학과 박사. 1980년 미국 University of Oklahoma 전기전자 공학과 박사. 1970년 ~ 1976년 경희 대학교 전자공학과 조교수. 1980년 ~ 1982년 ETRI IC 설계 실장. 1982년 ~ 현재 연세대학교 전자공학과 교수. 연세대학교 아시설계공동연구소 소장. 1995년 전자공학회 회장 역임. 1997년 미국 University of Illinois, Urbana-Champaign 방문 교수. 1998년 한국 과학기술 공헌을 기리는 대한민국 국민훈장 수상. 주관심 분야는 VLSI 설계 및 CAD