Communication-avoiding algorithm

Communication-avoiding algorithms minimize movement of data within a memory hierarchy for improving its running-time and energy consumption. These minimize the total of two costs (in terms of time and energy): arithmetic and communication. Communication, in this context refers to moving data, either between levels of memory or between multiple processors over a network. It is much more expensive than arithmetic.[1]

Formal theory

Two-level memory model

A common computational model in analyzing communication-avoiding algorithms is the two-level memory model:

  • There is one processor and two levels of memory.
  • Level 1 memory is infinitely large. Level 0 memory ("cache") has size .
  • In the beginning, input resides in level 1. In the end, the output resides in level 1.
  • Processor can only operate on data in cache.
  • The goal is to minimize data transfers between the two levels of memory.

Matrix multiplication

[2] Corollary 6.2:

Theorem — Given matrices of sizes , then has communication complexity .

This lower bound is achievable by tiling matrix multiplication.

More general results for other numerical linear algebra operations can be found in.[3] The following proof is from.[4]

Proof

We can draw the computation graph of as a cube of lattice points, each point is of form . Since , computing requires the processor to have access to each point within the cube at least once. So the problem becomes covering the lattice points with a minimal amount of communication.

If is large, then we can simply load all entries then write entries. This is uninteresting.

If is small, then we can divide the minimal-communication algorithm into separate segments. During each segment, it performs exactly reads to cache, and any number of writes from cache.

During each segment, the processor has access to at most different points from .

Let be the set of lattice points covered during this segment. Then by the Loomis–Whitney inequality,

with constraint .

By the inequality of arithmetic and geometric means, we have , with extremum reached when .

Thus the arithmetic intensity is bounded above by where , and so the communication is bounded below by .

Direct computation verifies that the tiling matrix multiplication algorithm reaches the lower bound.

Motivation

Consider the following running-time model:[5]

  • Measure of computation = Time per FLOP = γ
  • Measure of communication = No. of words of data moved = β

⇒ Total running time = γ·(no. of FLOPs) + β·(no. of words)

From the fact that β >> γ as measured in time and energy, communication cost dominates computation cost. Technological trends[6] indicate that the relative cost of communication is increasing on a variety of platforms, from cloud computing to supercomputers to mobile devices. The report also predicts that gap between DRAM access time and FLOPs will increase 100× over coming decade to balance power usage between processors and DRAM.[1]

FLOP rate (γ) DRAM bandwidth (β) Network bandwidth (β)
59% / year 23% / year 26% / year
Energy cost of data movement in 2010: On-chip vs Off-chip

Energy consumption increases by orders of magnitude as we go higher in the memory hierarchy.[7]

United States president Barack Obama cited communication-avoiding algorithms in the FY 2012 Department of Energy budget request to Congress:[1]

New Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems. On modern computer architectures, communication between processors takes longer than the performance of a floating-point arithmetic operation by a given processor. ASCR researchers have developed a new method, derived from commonly used linear algebra methods, to minimize communications between processors and the memory hierarchy, by reformulating the communication patterns specified within the algorithm. This method has been implemented in the TRILINOS framework, a highly-regarded suite of software, which provides functionality for researchers around the world to solve large scale, complex multi-physics problems.

Objectives

Communication-avoiding algorithms are designed with the following objectives:

  • Reorganize algorithms to reduce communication across all memory hierarchies.
  • Attain the lower-bound on communication when possible.

The following simple example[1] demonstrates how these are achieved.

Matrix multiplication example

Let A, B and C be square matrices of order n × n. The following naive algorithm implements C = C + A * B:

 for i = 1 to n
     for j = 1 to n
         for k = 1 to n
             C(i,j) = C(i,j) + A(i,k) * B(k,j)

Arithmetic cost (time-complexity): n2(2n − 1) for sufficiently large n or O(n3).

Rewriting this algorithm with communication cost labelled at each step

 for i = 1 to n
     {read row i of A into fast memory}               - n2 reads
     for j = 1 to n
         {read C(i,j) into fast memory}               - n2 reads
         {read column j of B into fast memory}        - n3 reads
         for k = 1 to n
             C(i,j) = C(i,j) + A(i,k) * B(k,j)
         {write C(i,j) back to slow memory}           - n2 writes

Fast memory may be defined as the local processor memory (CPU cache) of size M and slow memory may be defined as the DRAM.

Communication cost (reads/writes): n3 + 3n2 or O(n3)

Since total running time = γ·O(n3) + β·O(n3) and β >> γ the communication cost is dominant. The blocked (tiled) matrix multiplication algorithm[1] reduces this dominant term:

Blocked (tiled) matrix multiplication

Consider A, B and C to be n/b-by-n/b matrices of b-by-b sub-blocks where b is called the block size; assume three b-by-b blocks fit in fast memory.

 for i = 1 to n/b
     for j = 1 to n/b
         {read block C(i,j) into fast memory}           - b2 × (n/b)2 = n2 reads
         for k = 1 to n/b
             {read block A(i,k) into fast memory}       - b2 × (n/b)3 = n3/b reads 
             {read block B(k,j) into fast memory}       - b2 × (n/b)3 = n3/b reads
             C(i,j) = C(i,j) + A(i,k) * B(k,j)          - {do a matrix multiply on blocks}
         {write block C(i,j) back to slow memory}       - b2 × (n/b)2 = n2 writes

Communication cost: 2n3/b + 2n2 reads/writes << 2n3 arithmetic cost

Making b as large possible:

3b2M

we achieve the following communication lower bound:

31/2n3/M1/2 + 2n2 or Ω (no. of FLOPs / M1/2)

Previous approaches for reducing communication

Most of the approaches investigated in the past to address this problem rely on scheduling or tuning techniques that aim at overlapping communication with computation. However, this approach can lead to an improvement of at most a factor of two. Ghosting is a different technique for reducing communication, in which a processor stores and computes redundantly data from neighboring processors for future computations. Cache-oblivious algorithms represent a different approach introduced in 1999 for fast Fourier transforms,[8] and then extended to graph algorithms, dynamic programming, etc. They were also applied to several operations in linear algebra[9][10][11] as dense LU and QR factorizations. The design of architecture specific algorithms is another approach that can be used for reducing the communication in parallel algorithms, and there are many examples in the literature of algorithms that are adapted to a given communication topology.[12]

See also

References

  1. ^ a b c d e Demmel, Jim. "Communication avoiding algorithms". 2012 SC Companion: High Performance Computing, Networking Storage and Analysis. IEEE, 2012.
  2. ^ Jia-Wei, Hong; Kung, H. T. (1981). "I/O complexity". Proceedings of the thirteenth annual ACM symposium on Theory of computing - STOC '81. New York, New York, USA: ACM Press. pp. 326–333. doi:10.1145/800076.802486. S2CID 8410593.
  3. ^ Ballard, G.; Carson, E.; Demmel, J.; Hoemmen, M.; Knight, N.; Schwartz, O. (May 2014). "Communication lower bounds and optimal algorithms for numerical linear algebra". Acta Numerica. 23: 1–155. doi:10.1017/s0962492914000038. ISSN 0962-4929. S2CID 122513943.
  4. ^ Demmel, James; Dinh, Grace (2018-04-24). "Communication-Optimal Convolutional Neural Nets". arXiv:1802.06905 [cs.DS].
  5. ^ Demmel, James, and Kathy Yelick. "Communication Avoiding (CA) and Other Innovative Algorithms". The Berkeley Par Lab: Progress in the Parallel Computing Landscape: 243–250.
  6. ^ Bergman, Keren, et al. "Exascale computing study: Technology challenges in exascale computing systems." Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO), Tech. Rep 15 (2008).
  7. ^ Shalf, John, Sudip Dosanjh, and John Morrison. "Exascale computing technology challenges". High Performance Computing for Computational Science–VECPAR 2010. Springer Berlin Heidelberg, 2011. 1–25.
  8. ^ M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran, "Cacheoblivious algorithms", In FOCS '99: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, 1999. IEEE Computer Society.
  9. ^ S. Toledo, "Locality of reference in LU Decomposition with partial pivoting," SIAM J. Matrix Anal. Appl., vol. 18, no. 4, 1997.
  10. ^ F. Gustavson, "Recursion Leads to Automatic Variable Blocking for Dense Linear-Algebra Algorithms," IBM Journal of Research and Development, vol. 41, no. 6, pp. 737–755, 1997.
  11. ^ E. Elmroth, F. Gustavson, I. Jonsson, and B. Kagstrom, "Recursive blocked algorithms and hybrid data structures for dense matrix library software," SIAM Review, vol. 46, no. 1, pp. 3–45, 2004.
  12. ^ Grigori, Laura. "Introduction to communication avoiding linear algebra algorithms in high performance computing.

Read other articles:

Red Hat Enterprise LinuxRHEL 8Perusahaan / pengembangRed Hat, Inc.KeluargaMirip Unix (Linux)Status terkiniAktifRilis perdana23 Maret 2002; 21 tahun lalu (2002-03-23)[1]Dukungan platformx86-64, ARM64, IBM Power Systems, dan IBM Z[2]Kernel typeLinuxAntarmuka bawaanGNOME ShellPendahuluRed Hat LinuxSitus web resmiwww.redhat.com/en/technologies/linux-platforms/enterprise-linux Red Hat Enterprise Linux (sering disingkat RHEL) adalah distribusi Linux yang dikembangkan oleh ...

 

Хуан Ерраскін Особисті дані Повне ім'я Хуан Ерраскін Тумас Народження 22 червня 1906(1906-06-22)   Кордова, Аргентина Смерть 30 листопада 1930(1930-11-30) (24 роки)   Ірун, Іспанія Громадянство  Іспанія Аргентина Позиція нападник Професіональні клуби* Роки Клуб І (г) 1921–1928 «Ре

 

Опис Мійосівські ляльки Джерело [1] Час створення 2011, лютий Автор зображення невідомо Ліцензія Ця робота є невільною — тобто, не відповідає визначенню вільних творів культури. Згідно з рішенням фонду «Вікімедіа» від 23 березня 2007 року вона може бути використана у відп

عازل أنثوي1: مثانة، 2: عظم العانة، 3: حالب، 4: مهبل، 5: رحم، 6: قبو المهبل، 7: عنق الرحم، 8: العازل، 9: مستقيمالخلفيةالنوعحاجزأول استعمالعقد 1880فشل (السنة الأولى مع مبيد النطاف)الاستعمال المثالي6%[1]الاستعمال النموذجي12%[1]الاستعمالالمعكوسيةحاليتذكير المستخدميتم إدخاله قبل ا

 

Contre-la-montre masculin aux championnats du monde de cyclisme sur route 2018 GénéralitésCourse25e Championnat du monde masculin de cyclisme contre-la-montreCompétitionChampionnats du monde de cyclisme sur route 2018 CMDate26 septembre 2018Distance52,5 kmPays AutricheLieu de départRattenbergLieu d'arrivéeInnsbruckPartants61Arrivants56Vitesse moyenne49,96 km/hSpecial 1Site officielRésultatsVainqueur Rohan Dennis (Australie)Deuxième Tom Dumoulin (Pays-Bas)Troisième Victor Campenaerts ...

 

Carte du réseau hydrographique du département de la Seine-Maritime (France). La liste des cours d'eau de la Seine-Maritime présente les principaux cours d'eau, de longueur supérieure à 10 km, traversant pour tout ou partie le territoire du département français de la Seine-Maritime dans la région Normandie. Le réseau hydrographique est long d'environ 1 500 km et comprend 28 cours d'eau de longueur supérieure à 10 km, dont la Seine (156 km dans le dépar...

The CrossingPoster teatrikalSutradara John Woo Produser Terence Chang Ditulis oleh Wang Hui-ling PemeranZhang Ziyi Takeshi Kaneshiro Song Hye-kyo Huang Xiaoming Tong Dawei Masami NagasawaPenata musikTaro IwashiroSinematograferZhao FeiPenyuntingJohn WooKai Kit-WaiDavid WuPerusahaanproduksiBeijing Gallop Horse Film China Film Group Corporation Lion Rock Productions Zhejiang Huace Film & TVTanggal rilis 2 Desember 2014 (2014-12-02) (China) 5 Desember 2014 (2014-12-05) ...

 

Zeichnung der Grabplatte (Johann Friedrich Schannat) Hildebold (auch Hildibold, Hildibald, Hildebald) (* um 940; † 3. August 998) war von 978 bis 998 Bischof von Worms. Fraglich ist, ob Hildebold aus der Wormser Domschule hervorgegangen ist. Er war Kanzler und Vertrauter Ottos II. und wurde Ende Oktober 977 Leiter der deutschen Königskanzlei Ottos II. und behielt dieses Amt als erster Kanzler auch nach der Erhebung zum Bischof bis zu seinem Tode. Der Wormser Bischof gehörte in der Zeit de...

 

أكاذيب مخبأة في حديقتي ملصق المسلسل النوع إثارة تشويق  [لغات أخرى]‏  صناعة كي تي إستوديو (تخطيط)[1] تطوير دورثي فيلم[1]استوديو دراغون[1] بطولة كم تاي هي،  وليم جي يون،  وكيم سونغ اوه  [لغات أخرى]‏،  وتشوي جاي ريم  [لغات أخرى]‏  البلد

هذه المقالة تحتاج للمزيد من الوصلات للمقالات الأخرى للمساعدة في ترابط مقالات الموسوعة. فضلًا ساعد في تحسين هذه المقالة بإضافة وصلات إلى المقالات المتعلقة بها الموجودة في النص الحالي. (فبراير 2018) هذه المقالة يتيمة إذ تصل إليها مقالات أخرى قليلة جدًا. فضلًا، ساعد بإضافة وصلة...

 

Pour les articles homonymes, voir Sarr. Cet article est une ébauche concernant une localité sénégalaise. Vous pouvez partager vos connaissances en l’améliorant (comment ?) selon les recommandations des projets correspondants. Keur Momar Sarr Administration Pays Sénégal Région Louga Département Louga Maire Mandat Dioumorou Kâ 2022 - 2027 Démographie Population 1 774 hab. (2003) Géographie Coordonnées 15° 55′ 00″ nord, 15° 58′ 00″...

 

Cottage in Sikkilsdalen in Jotunheimen, Norway The Royal Mountain ChaletPrinsehyttaGeneral informationTypeChaletMountain Residence for the Norwegian Royal FamilyLocationSikkilsdalenJotunheimen, NorwayCompleted1900Opened1902Technical detailsStructural systemTimber Log cabinDesign and constructionArchitect(s)Hjalmar Welhaven The Royal Mountain Chalet (Norwegian: Prinsehytta, The Prince Cottage) is a cottage in Sikkilsdalen in Jotunheimen, Norway. The cottage is privately owned by the Norwegian ...

Servicio del memorial del Michael Jackson. El funeral público de Michael Jackson (Michael Jackson Memorial Service), fue un homenaje póstumo a Michael Jackson realizado el 7 de julio de 2009. Es considerado el más grande en la historia, superando a los funerales de Elvis Presley en 1977, que congregaron a 70.000 personas, y a los de Diana de Gales en 1997, a los que asistieron 230.000 en el Hyde Park de Londres.[1]​ Esto se refiere no solo a la cantidad de entradas (que fueron previa...

 

German engineer and surveyor Hermann Philipp DetznerHermann Detzner, portrayed on the jacket of the 1921 edition of his book Four Years Among the Cannibals.Born16 October 1882Speyer, Bavarian Palatinate, German EmpireDied1 December 1970(1970-12-01) (aged 88)Heidelberg, Baden-Württemberg, West GermanyAllegiance German EmpireService/branchImperial German ArmySchutztruppe (Kamerun and German New Guinea)Years of servicec. 1901–1919RankMajorUnit6th Infantry Regiment (Prussia)2nd ...

 

Parque Chacabuco Barrio de Buenos Aires País Argentina• Ciudad Buenos AiresUbicación 34°38′00″S 58°27′00″O / -34.633333333333, -58.45Superficie 2,4 km²Límites Av. La PlataAv. CoboDel Barco CenteneraAv. RiestraPte. Camilo Torres y TenorioCurapaligüeAv. CastañaresAv. CaraboboPoblación  • Total 56 281 hab.• Densidad 16 447 hab./km²Día del barrio 15 de mayo[editar datos en Wikidata] Pa...

Novel by Robert Cormier This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages) This article consists almost entirely of a plot summary. Please help improve the article by adding more real-world context. (September 2012) (Learn how and when to remove this template message) This article needs additional citations for verification. Please help improve this article by adding citations to reliable so...

 

2015 television miniseries This article is about the 2015 miniseries. For the 1985 miniseries, see A.D. (miniseries). A.D. The Bible ContinuesGenreBiblical dramaCreated byRoma DowneyMark BurnettBased onThe New TestamentDirected byCiaran DonnellyTony MitchellBrian KellyRob EvansPaul WilmshurstStarring Juan Pablo Di Pace Adam Levy Chipo Chung Babou Ceesay Emmett J. Scanlan Will Thorp Theme music composerLorne BalfeCountry of originUnited StatesOriginal languageEnglishNo. of episodes12Production...

 

1928 film by Mark Sandrich Runaway GirlsDirected byMark SandrichWritten byLillie Hayward (story)Dorothy Howell (continuity)Morton Blumenstock (intertitles)Produced byHarry CohnStarringShirley MasonArthur RankinHedda HopperCinematographyHarry DavisEdited byFrank AtkinsonDistributed byColumbia PicturesRelease dateAugust 23, 1928Running time58 minutesCountryUnited States Runaway Girls is a lost[1] 1928 silent film drama directed by Mark Sandrich and starring Shirley Mason and Hedda Hoppe...

American writer (1926–2013) Diana M. Vogel Diana M. Vogel (1926–2013), known professionally as D. H. Melhem, was an American poet, novelist, and editor. Life She was born in Brooklyn, New York, the daughter of Nicholas Melhem and Georgette Deyrataui Melhem, both immigrants from Lebanon. She graduated from New York University cum laude and received her master's degree and a doctoral degree in English and American Literature from City College. She was a longtime resident of New York City, w...

 

This article includes a list of references, related reading, or external links, but its sources remain unclear because it lacks inline citations. Please help to improve this article by introducing more precise citations. (June 2012) (Learn how and when to remove this template message) The Last Airbender Prequel: Zuko's StoryThe Last Airbender Prequel: Zuko's Story coverDateMay 18, 2010PublisherDel Rey MangaCreative teamWriters Dave Roman Alison Wilgus ArtistsNina MatsumotoChronologyFollowed&#...

 

Strategi Solo vs Squad di Free Fire: Cara Menang Mudah!