Software pipelining

In computer science, software pipelining is a technique used to optimize loops, in a manner that parallels hardware pipelining. Software pipelining is a type of out-of-order execution, except that the reordering is done by a compiler (or in the case of hand written assembly code, by the programmer) instead of the processor. Some computer architectures have explicit support for software pipelining, notably Intel's IA-64 architecture.

It is important to distinguish software pipelining, which is a target code technique for overlapping loop iterations, from modulo scheduling, the currently most effective known compiler technique for generating software pipelined loops. Software pipelining has been known to assembly language programmers of machines with instruction-level parallelism since such architectures existed. Effective compiler generation of such code dates to the invention of modulo scheduling by Rau and Glaeser.[1] Lam showed that special hardware is unnecessary for effective modulo scheduling. Her technique, modulo variable expansion is widely used in practice.[2] Gao et al. formulated optimal software pipelining in integer linear programming, culminating in validation of advanced heuristics in an evaluation paper.[3] This paper has a good set of references on the topic.

Example

Consider the following loop:

for i = 1 to bignumber
  A(i)
  B(i)
  C(i)
end

In this example, let A(i), B(i), C(i) be instructions, each operating on data i, that are dependent on each other. In other words, A(i) must complete before B(i) can start. For example, A could load data from memory into a register, B could perform some arithmetic operation on the data, and C could store the data back into memory. However, let there be no dependence between operations for different values of i. In other words, A(2) can begin before A(1) finishes.

Without software pipelining, the operations execute in the following sequence:

A(1) B(1) C(1) A(2) B(2) C(2) A(3) B(3) C(3) ...

Assume that each instruction takes 3 clock cycles to complete (ignore for the moment the cost of the looping control flow). Also assume (as is the case on most modern systems) that an instruction can be dispatched every cycle, as long as it has no dependencies on an instruction that is already executing. In the unpipelined case, each iteration thus takes 9 cycles to complete: 3 clock cycles for A(1), 3 clock cycles for B(1), and 3 clock cycles for C(1).

Now consider the following sequence of instructions with software pipelining:

A(1) A(2) A(3) B(1) B(2) B(3) C(1) C(2) C(3) ...

It can be easily verified that an instruction can be dispatched each cycle, which means that the same 3 iterations can be executed in a total of 9 cycles, giving an average of 3 cycles per iteration.

Implementation

Software pipelining is often used in combination with loop unrolling, and this combination of techniques is often a far better optimization than loop unrolling alone. In the example above, we could write the code as follows (assume for the moment that bignumber is divisible by 3):

for i = 1 to (bignumber - 2) step 3
  A(i)
  A(i+1)
  A(i+2)
  B(i)
  B(i+1)
  B(i+2)
  C(i)
  C(i+1)
  C(i+2)
end

Of course, matters are complicated if (as is usually the case) we can't guarantee that the total number of iterations will be divisible by the number of iterations we unroll. See the article on loop unrolling for more on solutions to this problem, but note that software pipelining prevents the use of Duff's device.[citation needed]

In the general case, loop unrolling may not be the best way to implement software pipelining. Consider a loop containing instructions with a high latency. For example, the following code:

for i = 1 to bignumber
  A(i) ; 3 cycle latency
  B(i) ; 3
  C(i) ; 12(perhaps a floating point operation)
  D(i) ; 3
  E(i) ; 3
  F(i) ; 3
end

would require 12 iterations of the loop to be unrolled to avoid the bottleneck of instruction C. This means that the code of the loop would increase by a factor of 12 (which not only affects memory usage, but can also affect cache performance, see code bloat). Even worse, the prologue (code before the loop for handling the case of bignumber not divisible by 12) will likely be even larger than the code for the loop, and very probably inefficient because software pipelining cannot be used in this code (at least not without a significant amount of further code bloat). Furthermore, if bignumber is expected to be moderate in size compared to the number of iterations unrolled (say 10-20), then the execution will spend most of its time in this inefficient prologue code, rendering the software pipelining optimization ineffectual.

By contrast, here is the software pipelining for our example (the prologue and epilogue will be explained later):

prologue
for i = 1 to (bignumber - 6)
  A(i+6)
  B(i+5)
  C(i+4)
  D(i+2) ; note that we skip i+3
  E(i+1)
  F(i)
end
epilogue

Before getting to the prologue and epilogue, which handle iterations at the beginning and end of the loop, let's verify that this code does the same thing as the original for iterations in the middle of the loop. Specifically, consider iteration 7 in the original loop. The first iteration of the pipelined loop will be the first iteration that includes an instruction from iteration 7 of the original loop. The sequence of instructions is:

Iteration 1: A(7) B(6) C(5) D(3) E(2) F(1)
Iteration 2: A(8) B(7) C(6) D(4) E(3) F(2)
Iteration 3: A(9) B(8) C(7) D(5) E(4) F(3)
Iteration 4: A(10) B(9) C(8) D(6) E(5) F(4)
Iteration 5: A(11) B(10) C(9) D(7) E(6) F(5)
Iteration 6: A(12) B(11) C(10) D(8) E(7) F(6)
Iteration 7: A(13) B(12) C(11) D(9) E(8) F(7)

However, unlike the original loop, the pipelined version avoids the bottleneck at instruction C. Note that there are 12 instructions between C(7) and the dependent instruction D(7), which means that the latency cycles of instruction C(7) are used for other instructions instead of being wasted.

The prologue and epilogue handle iterations at the beginning and end of the loop. Here is a possible prologue for our example above:

; loop prologue (arranged on lines for clarity)
A(1)
A(2), B(1)
A(3), B(2), C(1)
A(4), B(3), C(2) ; cannot start D(1) yet
A(5), B(4), C(3), D(1)
A(6), B(5), C(4), D(2), E(1)

Each line above corresponds to an iteration of the main pipelined loop, but without the instructions for iterations that have not yet begun. Similarly, the epilogue progressively removes instructions for iterations that have completed:

; loop epilogue (arranged on lines for clarity)
B(bignumber), C(bignumber-1), D(bignumber-3), E(bignumber-4), F(bignumber-5)
C(bignumber), D(bignumber-2), E(bignumber-3), F(bignumber-4)
D(bignumber-1), E(bignumber-2), F(bignumber-3)
D(bignumber), E(bignumber-1), F(bignumber-2)
E(bignumber), F(bignumber-1)
F(bignumber)

Difficulties of implementation

The requirement of a prologue and epilogue is one of the major difficulties of implementing software pipelining. Note that the prologue in this example is 18 instructions, 3 times as large as the loop itself. The epilogue would also be 18 instructions. In other words, the prologue and epilogue together are 6 times as large as the loop itself. While still better than attempting loop unrolling for this example, software pipelining requires a trade-off between speed and memory usage. Keep in mind, also, that if the code bloat is too large, it will affect speed anyway via a decrease in cache performance.

A further difficulty is that on many architectures, most instructions use a register as an argument, and that the specific register to use must be hard-coded into the instruction. In other words, on many architectures, it is impossible to code such an instruction as "multiply the contents of register X and register Y and put the result in register Z", where X, Y, and Z are numbers taken from other registers or memory. This has often been cited as a reason that software pipelining cannot be effectively implemented on conventional architectures.

In fact, Monica Lam presents an elegant solution to this problem in her thesis, A Systolic Array Optimizing Compiler (1989) (ISBN 0-89838-300-5). She calls it modulo variable expansion. The trick is to replicate the body of the loop after it has been scheduled, allowing different registers to be used for different values of the same variable when they have to be live at the same time. For the simplest possible example, let's suppose that A(i) and B(i) can be issued in parallel and that the latency of the former is 2 cycles. The pipelined body could then be:

A(i+2); B(i)

Register allocation of this loop body runs into the problem that the result of A(i+2) must stay live for two iterations. Using the same register for the result of A(i+2) and the input of B(i) will result in incorrect results.

However, if we replicate the scheduled loop body, the problem is solved:

 A(i+2); B(i)
 A(i+3); B(i+1)

Now a separate register can be allocated to the results of A(i+2) and A(i+3). To be more concrete:

 r1 = A(i+2); B(i) = r1
 r2 = A(i+3); B(i+1) = r2
 i = i + 2 // Just to be clear

On the assumption that each instruction bundle reads its input registers before writing its output registers, this code is correct. At the start of the replicated loop body, r1 holds the value of A(i+2) from the previous replicated loop iteration. Since i has been incremented by 2 in the meantime, this is actually the value of A(i) in this replicated loop iteration.

Of course, code replication increases code size and cache pressure just as the prologue and epilogue do. Nevertheless, for loops with large trip counts on architectures with enough instruction level parallelism, the technique easily performs well enough to be worth any increase in code size.

IA-64 implementation

Intel's IA-64 architecture provides an example of an architecture designed with the difficulties of software pipelining in mind. Some of the architectural support for software pipelining includes:

  • A "rotating" register bank; instructions can refer to a register number that is redirected to a different register each iteration of the loop (eventually looping back around to the beginning). This makes the extra instructions[specify] inserted in the previous example unnecessary.
  • Predicates (used to "predicate" instructions; see Branch predication) that take their value from special looping instructions. These predicates turn on or off certain instructions in the loop, making a separate prologue and epilogue unnecessary.

References

  1. ^ B.R. Rau and C.D. Glaeser, "Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing", In Proceedings of the Fourteenth Annual Workshop on Microprogramming (MICRO-14), December 1981, pages 183-198
  2. ^ M. Lam, "Software pipelining: An effective scheduling technique for VLIW machines", In Proceedings of the ACM SIGPLAN 88 Conference on Programming Language Design and Implementation (PLDI 88), July 1988 pages 318-328. Also published as ACM SIGPLAN Notices 23(7).
  3. ^ J. Ruttenberg, G.R. Gao, A. Stoutchinin, and W. Lichtenstein, "Software pipelining showdown: optimal vs. heuristic methods in a production compiler", In Proceedings of the ACM SIGPLAN 1996 Conference on Programming Language Design and Implementation, June 1996, pages 1-11. Also published as ACM SIGPLAN Notices 31(5).

Read other articles:

Esta página cita fontes, mas que não cobrem todo o conteúdo. Ajude a inserir referências. Conteúdo não verificável pode ser removido.—Encontre fontes: ABW  • CAPES  • Google (N • L • A) (Junho de 2018) corte de embrião de planta da seda: 6. Cortéx Em botânica, o córtex é a camada externa de muitas estruturas das plantas, no sentido da Taxonomia de Lineu, ou seja, incluindo as plantas verdes, alguns tipos de algas, com...

 

SkotlandiaAsosiasiAsosiasi Sepak Bola SkotlandiaKonfederasiUEFA (Eropa)Pelatih Steve ClarkeKaptenAndrew RobertsonPenampilan terbanyakKenny Dalglish (102)Pencetak gol terbanyakKenny Dalglish Denis Law (30)Stadion kandangHampden ParkKode FIFASCOPeringkat FIFATerkini 34 3 (26 Oktober 2023)[1]Tertinggi13[2] (Oktober 2007)Terendah88[3] (Maret 2005)Peringkat EloTerkini 24 7 (18 Oktober 2023)[4] [5] Warna pertama Warna kedua Pertandingan internasional pertama ...

 

ArteDiluncurkan30 Maret 1992PemilikARTE France & ARTE DeutschlandPangsa pemirsa2.7% (selama jam ketika siaran dilakukan dalam analog) (April 2008, [1])Negara Prancis  JermanBahasaPrancis dan JermanSitus webarte.tv Kantor Arte di Strasbourg Arte (Association Relative à la Télévision Européenne) merupakan sebuah jaringan TV Prancis-Jerman. Menyatakan dirinya sebagai sebuah saluran budaya Eropa dan bertujuan mempromosikan kualitas pemrograman khususnya mengenai budaya dan seni....

Video game franchise For other uses, see The Elder Scrolls (disambiguation). Video game seriesThe Elder ScrollsGenre(s)Action role-playingDeveloper(s)PrimaryBethesda Softworks (1994–1998)Bethesda Game Studios (2002–present)OtherVir2L Studios (2003–2004)TKO Software (2004)ZeniMax Online Studios (2014)Dire Wolf Digital (2017)Publisher(s)PrimaryBethesda Softworks (1994–present)2K Games (Oblivion only)OtherVir2L Studios (2003–2004)Nokia (2004)Platform(s)AndroidiOSJ2MEmacOSMicrosoft Wind...

 

Religion in Kuwait by ethnicity ( January 1, 2023)[1] Muslim Kuwaiti    99.97% non-Kuwaiti    63.02% Christianity Kuwaiti    0.01% non-Kuwaiti    26.08% Hinduism and Other Kuwaiti    0% non-Kuwaiti    10.88% Religion in Kuwait (total residents, 31 December 2020)[1]   Islam (75%)  Christianity (18%)  Other religions (7%) Islam is the official religion in Kuwait, and the majority of t...

 

Part of the 2019 Irish local elections 2019 Galway County Council election ← 2014 24 May 2019 2024 → All 39 seats on Galway County Council20 seats needed for a majority   First party Second party Third party   Party Fianna Fáil Fine Gael Sinn Féin Seats won 15 11 1 Seat change 3 1 2   Fourth party Fifth party Sixth party   Party Green Republican Sinn Féin Independent Seats won 1 1[a] 10 Seat change 1 1 Results by local e...

التدخل العسكري الروسي في الحرب الأهلية السورية جزء من الانخراط الروسي في الحرب الأهلية السورية التدخل العسكري ضد تنظيم الدولة الإسلامية الحرب الأهلية السورية طائرتي سوخوي سو-25 هجوميتان في مطار باسل الأسد الدولي باللاذقية معلومات عامة التاريخ 30 سبتمبر 2015 – جاري (8 سنواتٍ ...

 

Group of ancient Mesopotamian deities Four copper-alloy foundation figures depicting ancient Mesopotamian gods wearing characteristic horned crowns (c. 2130 BC) The Anunnaki (Sumerian:

 

American animated Disney television series My Friends Tigger & PoohCreated byBrian HohlfeldBased onWinnie-the-Poohby A. A. MilneVoices of Chloë Grace Moretz Dee Bradley Baker Jim Cummings Travis Oates Peter Cullen Ken Sansom Kath Soucie Max Burkholder Oliver Dillon Theme music composerAndy SturmerComposerAndy SturmerCountry of originUnited StatesOriginal languageEnglishNo. of seasons3No. of episodes63 (list of episodes)ProductionExecutive producersBrian HohlfeldJeff Kline (S1)ProducerAng...

Indian State Government Government of BiharBicameralSeat of GovernmentPatnaCountryIndiaLegislative branchAssemblyBihar Legislative AssemblySpeaker, Bihar Vidhan SabhaAwadh Bihari ChoudharyDeputy Speaker, Bihar Vidhan SabhaMaheshwar HazariMembers in Assembly243CouncilBihar Legislative CouncilChairman of Bihar Vidhan ParishadDevesh Chandra ThakurDeputy ChairRam Chandra PurveMembers in Council75 (63 Elected + 12 Nominated)Executive branchGovernorRajendra ArlekarChief MinisterNitish KumarDeputy C...

 

For the flower, see Heartsease. For the hamlet near Knighton, see Heartsease, Knighton. This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Heartsease, Llanddewi Ystradenni – news · newspapers · books · scholar · JSTOR (May 2023) (Learn how and when to remove this template message) Human settlement in WalesHeart...

 

American politician Carlos MoorheadMember of theU.S. House of Representativesfrom CaliforniaIn officeJanuary 3, 1973 – January 3, 1997Preceded byH. Allen SmithSucceeded byJames E. RoganConstituency20th district (1973–1975)22nd district (1975–1993)27th district (1993–1997)Member of the California State Assemblyfrom the 43rd districtIn officeJanuary 7, 1967 – January 3, 1973Preceded byHoward J. ThelinSucceeded byMichael D. Antonovich Personal detailsBornCarlos John M...

American fantasy drama television series Not to be confused with Belief (TV series). BelieveGenre Fantasy Drama Science fiction Created by Alfonso Cuarón Mark Friedman Starring Jake McLaughlin Johnny Sequoyah Jamie Chung Kyle MacLachlan Delroy Lindo ComposerSteven PriceCountry of originUnited StatesOriginal languageEnglishNo. of seasons1No. of episodes13 (1 unaired)ProductionExecutive producers Alfonso Cuarón J. J. Abrams Mark Friedman (pilot only) Bryan Burk Jonas Pate Hans Tobeason Produc...

 

العلاقات البرازيلية الغواتيمالية البرازيل غواتيمالا   البرازيل   غواتيمالا تعديل مصدري - تعديل   العلاقات البرازيلية الغواتيمالية هي العلاقات الثنائية التي تجمع بين البرازيل وغواتيمالا.[1][2][3][4][5] مقارنة بين البلدين هذه مقارنة عامة ومرجع...

 

Panamerika-Meisterschaften im Straßenradsport 2023 Stadt Panama-Stadt Austragungsland Panama Panama Austragungszeitraum 18.–23. April 2023 ← 2022 2024 → Die Panamerika-Meisterschaften im Straßenradsport 2023 wurden vom 18. bis 23. April durch den Panamerikanischen Radsportverband (COPACI) in Panama ausgetragen. Zugleich fand der Jahreskongress des Verbands dort statt. Inhaltsverzeichnis 1 Beteiligung 2 Ablauf 3 Ergebnisse 3.1 Männer 3.2 Frauen 3.3 Männer U23 3.4 Männer Ju...

Australia international rugby league footballer Dylan WalkerPersonal informationBorn (1994-09-27) 27 September 1994 (age 29)Sydney, New South Wales, AustraliaHeight181 cm (5 ft 11 in)Weight98 kg (15 st 6 lb)Playing informationPositionCentre, Five-eighth, Lock Club Years Team Pld T G FG P 2013–15 South Sydney 62 33 2 0 140 2016–22 Manly Sea Eagles 124 32 40 0 208 2023– New Zealand Warriors 24 3 0 0 12 Total 210 68 42 0 360 Representative Years T...

 

This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: The Missing Star – news · newspapers · books · scholar · JSTOR (April 2019) (Learn how and when to remove this template message) 2006 filmThe Missing StarDirected byGianni AmelioWritten byGianni Amelio, Umberto ContarelloProduced by01 DistributionStarringSergio...

 

2019 film directed by Tony Giglio Doom: AnnihilationBlu-ray/DVD coverDirected byTony GiglioScreenplay byTony GiglioBased onDoomby id SoftwareProduced by Jeffery Beach Phillip J. Roth Ogden Gavanski Starring Amy Manson Dominic Mafham Luke Allen-Gale Nina Bergman CinematographyAlexander KrumovEdited byPeter MergusMusic byFrederik WiedmannProductioncompanyUniversal 1440 EntertainmentDistributed byUniversal Pictures Home EntertainmentRelease date October 1, 2019 (2019-10-01) Runnin...

Gentrification of Atlanta's inner-city neighborhoods Bungalows in Atlanta's Inman Park neighborhood Part of a series onLiving spaces MainHouse (detached) • Apartment • Housing projects • Human outpost • Tenement • Condominium • Mixed-use development (live-work) • Hotel • Hostel (travellers' hotel) • Castles • Public housing • Squat • Flophouse • Green home • Shack • Slum • Shanty town IssuesAffordability • Executive housing • Environmental planning • Evic...

 

Theory that a society's development is predetermined by its physical environment This article possibly contains original research. Please improve it by verifying the claims made and adding inline citations. Statements consisting only of original research should be removed. (September 2023) (Learn how and when to remove this template message) History of geography Graeco-Roman Chinese Islamic Age of Discovery History of cartography Historical geography Environmental determinism Regional geograp...

 

Strategi Solo vs Squad di Free Fire: Cara Menang Mudah!