- The processor future is multicore
Machine Design, Vol. 80, No. 5, 6 Mar. 2008, pp. 74-77, 79-80.
It's not hard to predict that next-generation processors will have smaller details, more transistors on each square millimeter of silicon, and do more clever things, such as powering down under-used sections even between keystrokes. They will also have many cores or CPUs. The latest chips, for example, have up to four cores and by 2010 that figure could rise to eight and more in engineering desktop computers. Even now, Intel has given hardware and software developers a peek at a prototype 80-core processor with one teraflop of computational ability. The company, however, has set no launch date. Nvidia Corp. sells a Quadro Plex graphic processor sporting 128 cores. And Cisco Systems Inc., sensing a coming surge in network traffic, is toying with a 188-core router. But to see what tomorrow's processors hold for engineers, take a closer look at today's multicore designs.
- Cell architecture:key physical design features and methodology
International Symposium on Physical Design: Proceedings of
the 2007 international symposium on Physical design; 18-21
The Cell Broadband Engine is a high performance supercomputer on a chip. It is a 64-bit Power Processor compatible to Power Architecture applications and OS. This heterogeneous multi-core architecture sets a new performance standard for games and multimedia applications. It exploits parallelism while achieving high frequency. With a multi-core design, we are facing several obstacles from architecture, technology to physical design. This presentation will give an overview of the Cell architecture and the challenges we faced. We will also discuss how we addressed these challenges and successfully implemented the Cell Processor.
- The Coming Wave of Multithreaded Chip Multiprocessors
James Laudon and Lawrence Spracklen.
International Journal of Parallel Programming, Vol. 35, No. 3, June 2007, pp. 299-330.
The performance of microprocessors has increased exponentially for over 35 years. However, process technology challenges, chip power constraints, and difficulty in extracting instruction-level parallelism are conspiring to limit the performance of future individual processors. To address these limits, the computer industry has embraced chip multiprocessing (CMP), predominately in the form of multiple high-performance superscalar processors on the same die. We explore the trade-off between building CMPs from a few high-performance cores or building CMPs from a large number of lower-performance cores and argue that CMPs built from a larger number of lower-performance cores can provide better performance and performance/Watt on many commercial workloads. We examine two multi-threaded CMPs built using a large number of processor cores: Sun's Niagara and Niagara 2 processors. We also explore the programming issues for CMPs with large number of threads. The programming model for these CMPs is similar to the widely used programming model for symmetric multiprocessors (SMPs), but the greatly reduced costs associated with communication of data through the on-chip shared secondary cache allows for more fine-grain parallelism to be effectively exploited by the CMP. Finally, we present performance comparisons between Sun's Niagara and more conventional dual-core processors built from large superscalar processor cores. For several key server workloads, Niagara shows significant performance and even more significant performance/Watt advantages over the CMPs built from traditional superscalar processors.
- A Flexible Heterogeneous Multi-Core Architecture
Miquel Pericas, Adrian Cristal, Francisco J. Cazorla, Ruben Gonzalez, Daniel A. Jimenez and Mateo Valero.
PACT: Proceedings of the 16th International Conference on
Parallel Architecture and Compilation Techniques (PACT 2007)
- Volume 00; 15-19 Sept. 2007
Multi-core processors naturally exploit thread-level par- allelism (TLP). However, extracting instruction-level paral- lelism (ILP) from individual applications or threads is still a challenge as application mixes in this environment are nonuniform. Thus, multi-core processors should be flexi- ble enough to provide high throughput for uniform paral- lel applications as well as high performance for more gen- eral workloads. Heterogeneous architectures are a first step in this direction, but partitioning remains static and only roughly fits application requirements. This paper proposes the Flexible Heterogeneous Mul- tiCore processor (FMC), the first dynamic heterogeneous multi-core architecture capable of reconfiguring itself to fit application requirements without programmer intervention. The basic building block of this microarchitecture is a scal- able, variable-size window microarchitecture that exploits the concept of Execution Locality to provide large-window capabilities. This allows to overcome the memory wall for applications with high memory-level parallelism (MLP). The microarchitecture contains a set of small and fast cache processors that execute high locality code and a network of small in-order memory engines that together exploit low locality code. Single-threaded applications can use the entire network of cores while multi-threaded applications can effi- ciently share the resources. The sizing of critical structures remains small enough to handle current power envelopes. In single-threaded mode this processor is able to out- perform previous state-of-the-art high-performance proces- sor research by 12% on SpecFP. We show how in a quad- threaded/quad-core environment the processor outperforms a statically allocated configuration in both throughput and harmonic mean, two commonly used metrics to evaluate SMT performance, by around 2-4%. This is achieved while using a very simple sharing algorithm.
- The kill rule for multicore
Anant Agarwal and Markus Levy.
Annual ACM IEEE Design Automation Conference: Proceedings
of the 44th annual conference on Design automation : San Diego,
California; 04-08 June 2007
Multicore has shown significant performance and power advantages over single cores in commercial systems with a 2-4 cores. Applying a corollary of Moore's Law for multicore, we expect to see 1K multicore chips within a decade. 1K multicore systems introduce significant architectural challenges. One of these is the power efficiency challenge. Today's cores consume 10's of watts. Even at about one watt per core, a 1K-core chip would need to dissipate 1K watts! This paper discusses the 'Kill rule for multicore' for power-efficient multicore design, an approach inspired by the 'Kiss rule for RISC processor design'. Kill stands for Kill if less than linear, and represents a design approach in which any additional area allocated to a resource within a core, such as a cache, is carefully traded off against using the area for additional cores. The Kill Rule states that we must increase resource size (for example, cache size) only if for every 1% increase in core area there is at least a 1% increase in core performance.
- Multicore CPU Race Is On
VARBusiness, Vol. 23, No. 6, 19 Mar. 2007, pp. 66-68.
Data center consolidation and virtualization may be driving server implementations these days, but there's a much more fundamental shift occurring at the microprocessor platform level. Systems based on the industry-standard x86 architecture are taking on a predominant role in the overall infrastructure, rivaling the performance and capability of higher-end processors. At the same time, new generations of RISC-based enterprise platforms that run improved multithreaded implementations of Unix are making huge strides in throughput and performance. The latest in server processors is a constant barrage of next-generation platforms-and a robust pipeline, to boot. It's important to note that the open source revolution is redrawing the server-processing landscape. Because customers now can run software on various processor platforms, VARs have the option of moving customers off higher-end Unix systems.
- Research on Architecture Design of Multi-core Processor
Jun He and Biao Wang.
Jisuanji Gongcheng / Computer Engineering, Vol. 33, No. 16, 20 Aug. 2007, pp. 208-210.
On the issue how to design the architecture of multi-core processor as to improve its performance, this paper researches on the multi-core processor architecture design referring to the theory of traditional multiprocessor, discloses the development trends of commercial multi-core processors by making a study of current commercial multi-core processors, and reflects on the future of multi-core processor design.
- Trace Cache Miss Rate
A. Hossain, D. Pease and A. El Kateeb.
International Journal of Modelling & Simulation, Vol. 27, No. 3, 2007, pp. 203-210.
Instruction fetch mechanism is a performance bottleneck of Super- scalar and Simultaneous Multithreading Processors. A hardware mechanism, known as Trace Cache, is used in several processor architectures to improve instruction fetch performance. Most studies on Trace Cache architectures are based on simulation of benchmark programs. Analytical studies on Trace Cache and Trace Cache Miss Rates are rare. This paper presents a new analytical model of Trace Cache Miss Rate. The presented model can be used to understand performance and tradeoffs in Trace Cache design. The presented study is the first of its kind, which provides clearer understanding of Trace Cache performance for designers, students, and researchers.
- Balancing power consumption in multiprocessor systems
Andreas Merkel and Frank Bellosa.
ACM SIGOPS Operating Systems Review, Vol. 40, No. 4, Oct. 2006, pp. 403-414.
Actions usually taken to prevent processors from overheating, such as decreasing the frequency or stopping the execution flow, also degrade performance. Multiprocessor systems, however, offer the possibility of moving the task that caused a CPU to overheat away to some other, cooler CPU, so throttling becomes only a last resort taken if all of a system's processors are hot. Additionally, the scheduler can take advantage of the energy characteristics of individual tasks, and distribute hot tasks as well as cool tasks evenly among all CPUs.This work presents a mechanism for determining the energy characteristics of tasks by means of event monitoring counters, and an energy-aware scheduling policy that strives to assign tasks to CPUs in a way that avoids overheating individual CPUs. Our evaluations show that the benefit of avoiding throttling outweighs the overhead of additional task migrations, and that energy-aware scheduling in many cases increases the system's throughput.
- A Case for Chip Multiprocessors Based on the Data-Driven Multithreading Model
Pedro Trancoso, Paraskevas Evripidou, Kyriakos Stavrou and Costas Kyriacou.
International Journal of Parallel Programming, Vol. 34, No. 3, June 2006, pp. 213-235.
Current high-end microprocessors achieve high performance as a result of adding more features and therefore increasing complexity. This paper makes the case for a Chip-Multiprocessor based on the Data-Driven Multithreading (DDM-CMP) execution model in order to overcome the limitations of current design trends. Data-Driven Multithreading (DDM) is a multithreading model that effectively hides the communication delay and synchronization overheads. DDM-CMP avoids the complexity of other designs by combining simple commodity microprocessors with a small hardware overhead for thread scheduling and an interconnection network. Preliminary experimental results show that a DDM-CMP chip of the same hardware budget as a high-end commercial microprocessor, clocked at the same frequency, achieves a speedup of up to 18.5 with a 78-81% power consumption of the commercial chip. Overall, the estimated results for the proposed DDM-CMP architecture show a significant benefit in terms of both speedup and power consumption making it an attractive architecture for future processors.
- Does Dual-Core Processing Have Advantages?
Control Design, Vol. 10, No. 4, Apr. 2006, pp. 59.
Over the years, our machine control and HMI solution has evolved from proprietary processors, to using two separate PCs, to using Windows NT and XP with real-time extensions to achieve less than 50 ms scan rates. Now that dual-core processors are available from AMD and Intel, does anyone think there could be task management performance advantages using a dual-core solution, which might lead to eliminating the third-party kernel? Has anyone tried it?
- Re-inventing the x86 architecture:quad-core and beyond
Conference on High Performance Networking and Computing:
Proceedings of the 2006 ACM/IEEE conference on Supercomputing;
11-17 Nov. 2006
x86 technology has become the standard processor building block of high performance computing. With each new generation of x86, significant performance improvements have occurred. In the past, frequency was the key to this increasing performance. Today, it is a combination of systems architecture and multi-core support. As a result, the bandwidth and latency of both the chip level interconnect and memory have become the keys to increasing chip level performance. Mr. Oehler will discuss both the challenges and solutions in maximizing chip level interconnect and memory efficiencies, citing examples from AMD's development and advancement of the Direct Connect Architecture and integrated memory controller. He will specifically highlight the details associated with the development of next-generation quad-core processors, including the substantial evolution of the Direct Connect high-speed I/O infrastructure and memory access structures. As the number of cores continue to increase, the challenges and possible future solutions will also be discussed.
- Techniques for Multicore Thermal Management
James Donald and Margaret Martonosi.
ACM SIGARCH Computer Architecture News, Vol. 34, No. 2, May 2006, pp. 78-88.
Power density continues to increase exponentially with each new technology generation, posing a major challenge for thermal management in modern processors. Much past work has examined microarchitectural policies for reducing total chip power, but these techniques alone are insufficient if not aimed at mitigating individual hotspots. The industry's current trend has been toward multicore architectures, which provide additional opportunities for dynamic thermal management. This paper explores various thermal management techniques that exploit the distributed nature of multicore processors. We classify these techniques in terms of core throttling policy, whether that policy is applied locally to a core or to the processor as a whole, and process migration policies. We use Turandot and a HotSpot-based thermal simulator to simulate a variety of workloads under thermal duress on a 4-core PowerPCTMprocessor. Using benchmarks from the SPEC 2000 suite we characterize workloads in terms of instruction throughput as well as their effective duty cycles. Among a variety of options we find that distributed controltheoretic DVFS alone improves throughput by 2.5X under our test conditions. Our final design involves a PI-based core thermal controller and an outer control loop to decide process migrations. This policy avoids all thermal emergencies and yields an average of 2.6X speedup over the baseline across all workloads.
- AMD VS. Intel: An Uphill Battle?
Jeffrey Schwartz and Rob Wright.
VARBusiness, Vol. 21, No. 16, 25 July 2005, pp. 18-29.
By many accounts, AMD should be on cloud nine. Its 64-bit Opteron server-based processors continue to sell like hotcakes two years after their debut, and it's way out in front with its new line of desktop and server-based dual-core processors. But apparently, AMD is not gaining the traction it thinks it should given its recent antitrust suit against No. 1 chip maker Intel. While some see the lawsuit as an act of desperation, others say it will ultimately create a more level playing field. Either way, it still raises plenty of questions: Should AMD win, would it be a Pyrrhic victory where the casualties outweigh any benefits that would come from this action? How will the legal fight impact the product road maps of other vendors? Will it be a distraction, or business as usual? In a research note, Gartner analyst Martin Reynolds points to the latter, and that customers should not change direction in light of AMD's suit.
- Hardware and software architectures for the CELL processor
Peter Hofstee and Michael Day.
International Conference on Hardware Software Codesign: Proceedings
of the 3rd IEEE/ACM/IFIP international conference on Hardware/software
codesign and system synthesis; 19-21 Sept. 2005
The Cell processor is a first instance of a new family of processors intended for the broadband era. The processors will find early use in game systems (PlayStation3TM), a variety of other consumer electronics applications, a wide variety of embedded applications, and various forms of computational accelerators. Cell is a non-homogeneous multi-core processor, with one POWER processor core (two threads) dedicated to the operating system and other control functions, and eight synergistic processors optimized for compute-intensive applications.Cell addresses two of the main limiters to microprocessor performance: increased memory latency, and performance limitations induced by system power limits. Memory latency is addressed by introducing another software-managed level of private 'local' memory, in between the private registers and shared system memory. Data is transferred between this local memory and shared memory with asynchronous cache coherent DMA commands, and synergistic processor load and store commands access the local store only. This organization of memory makes it possible for the Cell processor to have over 100 memory transactions in flight at the same time, more than enough to cover memory latency. Power limitations are addressed by two main mechanisms; a non-homogeneous multi-core organization, and an ultra high-frequency design that allows the chip to be operated at 3.2GHz at low voltage.The Cell processor supports many of today's programming models by introducing the concept of heterogeneous tasks or threads. Both Power processor and SPE based threads can be managed by the operating system and effectively utilized by applications starting with the relatively straightforward function offload model to the more complex single source heterogeneous parallel programming model. Cell achieves between one and two orders of magnitude of performance advantage over conventional single-core processors on compute-intensive (32-bit) applications, by permitting programmers and compilers explicit control over instruction scheduling, data movement and the use of a large register file.
- Microprocessors: the shift to dual-core begins
Purchasing, Vol. 134, No. 15, 15 Sept. 2005, pp. 17-18.
Strong demand for computers, especially mobile PCs, will result in the microprocessor (MPU) market growing 8% to $34.4 billion in 2005. The good news for buyers is that the average price of a microprocessor will fall by about 5% despite strong demand. Buyers can also expect to see more dual-core processors hit the market, although the big ramp-up for dual core will be in 2006. By the end of next year, about 70% of processors will be dual core. For 2005, most microprocessors will be single-core chips and an increasing percentage will be used in mobile computers. Overall, computer unit growth will be 10.2% to 202.1 million units in 2005, according to researcher Gartner.
- Software and the concurrency revolution
Herb Sutter and James Larus.
Queue, Vol. 3, No. 7, Sept. 2005, pp. 54-62.
Leveraging the full power of multicore processors demands.
- Dual-Core Chips Shift Performance Focus
Darrell Dunn and Aaron Ricadela.
Information Week, Vol. , No. 1004, 6 Sept. 2004, pp. 30.
With the race for dominance in processor clock speed over, or at least waning, the pursuit for overall performance gains through architectural enhancements is officially beginning. Back-to-back demonstrations of dual-core processor implementations by Advanced Micro Devices Inc. last week and by Intel at its Developer Forum this week are kicking things off. Dual-core processors reside on a single chip and can run at lower clock frequencies, leading to cooler operations and performance parity or even gains over single-core chips. AMD last week demonstrated how a four-way Hewlett-Packard server using its Opteron dual-core chip effectively turned the machine into an eight-way system with no hardware changes. HP calls dual-core processing a game-changing technology that will lead computer makers to offer a wider variety of systems aimed at specific markets.
- Cache replacement policy to mitigate pollution in multicore processors
A method for identifying a least recently used cache entry in a cache. The method includes receiving a cache access request, determining whether the contents of the main memory address are present in the cache, associating, when the contents of the main memory address are not present in the cache, updating more significant bits of a pseudo least recently used pointer when the contents of the main memory address are not present in the cache, and updating more significant bits and less significant bits of the pseudo least recently used pointer when the contents of the main memory address are present in the cache. The cache access request is associated with a main memory address. The memory address has a set within the cache and the set includes a plurality of ways.