mD-DSP algorithms exhibit a large amount of complexity, as described in the previous section, which makes efficient implementation difficult in regard to run-time and power consumption. This article primarily addresses basic parallel concepts used to alleviate run-time of common mD-DSP applications. The concept of parallel computing can be applied to mD-DSP applications to exploit the fact that if a problem can be expressed in a parallelalgorithmic form, then parallel programming and multiprocessing can be used in an attempt to increase the computational throughput of the mD-DSP procedure on a given hardware platform. An increase in computational throughput can result in a decreased run-time, i.e. a speedup of a specific mD-DSP algorithm.
In addition to increasing computational throughput, a generally considered equally important goal is to maximally utilize the memory bandwidth of a given computing memory architecture. The combination of the computational throughput and memory bandwidth usage can be achieved through the concept of operational intensity, which is summarized in what is referred to as the roofline model.[9] The concepts of operational intensity and the roofline model in general have recently become popular methods of quantifying the performance of mD-DSP algorithms.[10][11]
Increasing throughput can be beneficial to strong scaling[12][13] of a given mD-DSP algorithm. Another possible benefit of increasing operational intensity is to allow for an increase in weak scaling, which allows the mD-DSP procedure to operate on increased data sizes or larger data sets, which is important for application areas such as data mining and the training of deep neural networks[14] using big data.
The goal of parallizing an algorithm is not always to decrease the traditional concept of complexity of the algorithm because the term complexity as used in this context typically refers to the RAMabstract computer model, which by definition is serial. Parallel abstract computer models such as PRAM have been proposed to describe complexity for parallel algorithms such as mD signal processing algorithms.[15]
Another factor that is important to the performance of mD-DSP algorithm implementations is the resulting energy consumption and power dissipation.[16]
Existing approaches
Parallel implementations of multidimensional discrete fourier transforms
As a simple example of an mD-DSP algorithm that is commonly decomposed into a parallel form, let’s consider the parallelization of the discrete Fourier transform, which is generally implemented using a form of the Fast Fourier Transform (FFT). There are hundreds of available software libraries that offer optimized FFT algorithms,[17] and many of which offer parallelized versions of mD-FFT algorithms[18][19][20] with the most popular being the parallel versions of the FFTw[21] library.
The most straightforward method of paralyzing the DFT is to utilize the row-column decomposition method. The following derivation is a close paraphrasing from the classical text Multidimensional Digital Signal Processing.[22] The row-column decomposition can be applied to an arbitrary number of dimensions, but for illustrative purposes, the 2D row-column decomposition of the DFT will be described first. The 2D DFT is defined as
where term is commonly referred to as the twiddle factor of the DFT in the signal processing literature.
The DFT equation can be re-written in the following form
where the quantity inside the brackets is a 2D sequence which we will denote as . We can then express the above equation as the pair of relations
Each column of is the 1D DFT of the corresponding column of . Each row of is the 1D DFT of the corresponding row of . Expressing the 2D-DFT in the above form allows us to see that we can compute a 2D DFT by decomposing it into row and column DFTs. The DFT of each column of can first be computed where the results of which are placed into an intermediate array. Then we can compute the DFT of each row of the intermediate array.
This row-column decomposition process can easily be extended to compute an mD DFT. First, the 1D DFT is computed with respect to one of the independent variables, say , for each value of the remaining variables. Next, 1D DFTs are computed with respect to the variable for all values of the -tuple . We continue in this fashion until all 1D DFTs have been evaluated with respect to all the spatial variables.[22]
The row-column decomposition of the DFT is parallelized in its most simplistic manner by noting that the row and column computations are independent of each other and therefore can be performed on separate processors in parallel. The parallel 1D DFT computations on each processor can then utilize the FFT algorithm for further optimization. One large advantage of this specific method of parallelizing an mD DFT is that each of the 1D FFTs being performed in parallel on separate processors can then be performed in a concurrent fashion on Shared memorymultithreaded SIMD processors[23]
.[8]
A specifically convenient hardware platform that has the ability to simultaneous perform both parallel and concurrent DFT implementation techniques that is highly amenable to are GPUs due to common GPUs having both a separate set of multithreaded SIMD processors (which are referred to as "streaming multiprocessors" in the CUDA programming language, and "compute units" in the OpenCL language) and individual SIMD lanes (commonly referred to loosely as a "core", or more specifically a CUDA "thread processor" or as an OpenCL "processing element") within each multithreaded SIMD processor.
A disadvantage to this technique of applying a separate FFT on each shared memory multiprocessor is the required interleaving of the data among the shared memory. One of the most popular libraries that utilizes this basic form of concurrent FFT computation is in the shared memory version of the FFTw library.[24]
Parallel implementations of multidimensional FIR filter structures
The section will describe a method of implementing an mD digital finite impulse response (FIR) filter in a completely parallel realization. The proposed method for a completely parallel realization of a general FIR filter is achieved through the use of a combination of parallel sections consisting of cascaded 1D digital filters.[25]
Consider the general desired ideal finite extent mD FIR digital filter in the complex -domain, given as
Placing the coefficients of into an array and performing some algebraic manipulation as described in,[25] we are able to arrive at an expression that allows us to decompose the filter into a parallel filterbank, given as
where
Therefore, the original MD digital filter is approximately decomposed into a parallel filterbank realization composed of a set of separable parallel filters , such that . This proposed parallel FIR filter realization is represented by the block diagram as seen in Figure 1.
The completely parallel realization as seen in figure 1 can be implemented in hardware by noting that block diagrams, and their corresponding Signal-flow graphs (SFGs) are a useful method of graphically representing any DSP algorithm that can be expressed as a linear constant coefficient difference equation. SFGs allow for easy transition from a difference equation into a hardware implementation by allowing one to visualize the difference equation in terms of digital logic components such as shift registers, and basic ALU digital circuit elements such as adders and multipliers. For this specific parallel realization, one could place each parallel section on a separate parallel processor to allow for each section to be implemented in a completely task-parallel fashion.
Using the fork–joinparallel programming model, a 'fork' may be applied at the first pickoff point in Figure 1, and the summing junction can be implemented during a synchronization with a 'join' operation. Implementing an mD FIR filter in this fashion lends itself well to the MapReducegeneral programming model[26]
Implementations of multidimensional discrete convolution via shift registers on an FPGA
Convolution on mD signals lends itself well to pipelining due to the fact each of single output convolution operation is independent of every other one. Due to this data independence between each convolution operation between the filters impulse response and the signal a new set of data calculations may begin at the instant the first convolution operation is finished. A common method of performing mD convolution in a raster scan fashion (including dimensions greater than 2) on a traditional general purpose CPU, or even a GPU, is to cache the set of output data from each scan line of each independent dimension into the local cache. By utilizing the unique custom re-configurable architecture of a field-programmable gate array (FPGA) we can optimize this procedure dramatically by customizing the cache structure.[27]
As in the illustrative example found in the presentation this description is derived from[27] we are going to restrict our discussion to two dimensional signals. In the example we perform a set of convolutional operations between a general 2D signal and a 3x3 filter kernel. As the sequence of convolution operations proceed along each raster line the filter kernel is slid across one dimension of the input signal and the data read from the memory is cached. The first pass loads three new lines of data into cache. The OpenCL code for this procedure is scene below.
// Using a cache to hide poor memory access patterns on a traditional general purpose processorfor(inty=1;y<yDim-1;++y){for(intx=1;x<xDim-1;++x){// 3x3 Filter Kernelfor(inty2=-1;y2<1;++y2){for(intx2=-1;x2<1;++x2){cache_temp[y][x]+=2D_input_signal[y+y2][x+x2]*kernel[y2][x2];//...}}}}
This caching technique is used to hide poor data to memory access pattern efficiency in terms of coalescing. However, with each successive loop only a single cache-line is updated. If we make the reasonable assumption of a 1 pixel per cycle performance point, applying this proposed caching technique to an FPGA results in a cache requirement of 9 reads and one write per cycle. Utilizing this caching technique on an FPGA results in inefficient performance in terms of both power consumption and a creation of a larger memory footprint than is required because there is a great deal of redundant reads into the cache.
With an FPGA we can customize the cache structure to give rise to a much more efficient result.
The proposed method to alleviate this poor performance with an FPGA implementation as proposed in the corresponding literature[27] is to customize the cache architecture through utilization of the re-configurable hardware resources of the FPGA. The important attribute to note here is that the FPGA creates the hardware based on the code that we write as opposed to writing code to run on a fixed architecture with a set of fixed instructions.
A description of how to modify the implementation to optimize the cache architecture on an FPGA will now be discussed. Again, let's begin with an initial 2D signal and assume it is of size . We can then remove all of the lines that aren't in the neighborhood of the window. We next can linearize the 2D signal of this restricted segment of the 2D signal after removing lines that aren't in the neighborhood of the window. We can achieve this linearization via a simple row-major data layout. After linearizing the 2D signal into a 1D array, under the assumption that we are not concerned with the boundary conditions of the convolution, we can discard any pixels that only contribute to the boundary computations – which is common in practice for many practical applications. Now, when we slide the window over each array element to perform the next computation, we have effectively created a shift register. The corresponding code for our shift register implementation of achieving this 2D filtering can be seen below.
// Shift register in OpenCLpixel_tsr[2*W+3];while(keep_going){// Shift data in#pragma unrollfor(inti=1;i<2*W+3;++i)sr[i]=sr[i-1];sr[0]=data_in// Tap output datadata_out={sr[0],sr[1],sr[2],sr[2],sr[W+1],sr[W+2],sr[2*W],sr[2*W+1],sr[2*W+2]};//...}
By performing the aforementioned steps, the goal is to manage the data movement to match the FPGAs architectural strengths to achieve the highest performance. These architectural strengths allow custom implementation that can be based on the work being done as opposed to leveraging fixed operations, fixed data paths, and fixed data widths, as would be done on a general purpose processor.
Notes
^Keimel, Christian, Martin Rothbucher, Hao Shen, and Klaus Diepold. "Video is a cube." IEEE Signal Processing Magazine 28, no. 6 (2011): 41–49.
^"Introduction to Parallel Programming With CUDA | Udacity." Introduction to Parallel Programming With CUDA | Udacity. Accessed December 07, 2016. https://www.youtube.com/watch?v=ET9wxEuqp_Y.
^ abHennessy, John L., and David A. Patterson. Computer architecture: a quantitative approach. Elsevier, 2011.
^Williams, S., Waterman, A. and Patterson, D., 2009. Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM, 52(4), pp. 65–76.
^Ofenbeck, Georg, et al. "Applying the roofline model." Performance Analysis of Systems and Software (ISPASS), 2014 IEEE International Symposium on. IEEE, 2014.
^"Introduction to Parallel Programming With CUDA | Udacity." Introduction to Parallel Programming With CUDA | Udacity. Accessed December 07, 2016. https://www.youtube.com/watch?v=539pseJ3EUc.
^Humphries, Benjamin, Hansen Zhang, Jiayi Sheng, Raphael Landaverde, and Martin C. Herbordt. "3D FFTs on a Single FPGA." In Field-Programmable Custom Computing Machines (FCCM), 2014 IEEE 22nd Annual International Symposium on, pp. 68–71. IEEE, 2014.
^"2DECOMP&FFT – Review on Fast Fourier Transform Software." 2DECOMP&FFT – Review on Fast Fourier Transform Software. Accessed December 07, 2016. http://www.2decomp.org/fft.html.
^SIAM Journal on Scientific Computing 2012, Vol. 34, No. 4, pp. C192–C209.
^Wang, Lizhe, Dan Chen, Rajiv Ranjan, Samee U. Khan, Joanna KolOdziej, and Jun Wang. "Parallel processing of massive EEG data with MapReduce." In Parallel and Distributed Systems (ICPADS), 2012 IEEE 18th International Conference on, pp. 164–171. Ieee, 2012.
Artikel ini sebatang kara, artinya tidak ada artikel lain yang memiliki pranala balik ke halaman ini.Bantulah menambah pranala ke artikel ini dari artikel yang berhubungan atau coba peralatan pencari pranala.Tag ini diberikan pada Maret 2023. Kampung Bali, Kelurahan Duri Kepa, Kecamatan Kebon Jeruk, kode pos 11510 - Jakarta Barat Kampung BaliKampungNegara IndonesiaProvinsiDaerah Khusus Ibukota JakartaKota AdministrasiJakarta BaratKecamatanKebon JerukKodepos11510Luas40.000 m² Kampung Bal...
Bimo Setiawan AlmachzumiInformasi latar belakangNama lahirBimo Setiawan AlmachzumiNama lainBimbim SlankLahir25 Desember 1966 (umur 56)Jakarta, IndonesiaGenre Rock blues Pekerjaan Musisi penulis lagu Instrumen Drum gitar bass vokal Tahun aktif1979–sekarangLabel Slank Records AnggotaSlank Bimo Setiawan Almachzumi (lahir 25 Desember 1966), yang dikenal sebagai Bimbim adalah musisi dan penulis lagu Indonesia. Ia merupakan anggota tertua, pendiri dan pimpinan dari grup musik Slank, dengan p...
Lin Chi-ling 林志玲Lin berpidato di National Taiwan University of Science and TechnologyLahir29 November 1974 (umur 48)Taipei, TaiwanNama lainSister Chi-ling,[1] 040, Ice-cream,[2] Top Taiwan Model.Informasi modelingTinggi174 cm (5 ft 9 in)Warna rambutHitamWarna mataCokelatManajerCatwalk Production House Lin Chi-ling Hanzi: 林志玲 Alih aksara Mandarin - Hanyu Pinyin: Lín Zhìlíng - Wade-Giles: Lin Chi-ling Min Nan - Romanisasi POJ: Lím Jî-Lé...
ميخائيل ماتفيفيتش خيراسكوف (بالروسية: Михаил Матвеевич Херасков) بروتريه لميخائيل ماتفيفيتش خيراسكوف بريشة فنان مجهول معلومات شخصية اسم الولادة Михаи́л Матве́евич Хера́сков الميلاد 5 نوفمبر 1733بيرياسلاف[1] الوفاة 9 أكتوبر 1807موسكو[1][2][3][4] م...
Herrhausen am Harz Stadt Seesen Wappen von Herrhausen am Harz Koordinaten: 51° 52′ N, 10° 11′ O51.867510.180555555556205Koordinaten: 51° 52′ 3″ N, 10° 10′ 50″ O Höhe: 205 m Fläche: 5,5 km² Einwohner: 725 (30. Jun. 2018)[1] Bevölkerungsdichte: 132 Einwohner/km² Eingemeindung: 1. Juli 1972 Postleitzahl: 38723 Herrhausen am Harz (Niedersachsen) Lage von Herrhausen am Harz in Niedersachsen Kupf...
Página de El Gráfico sobre Alcok y Brown en 1919. Los aviadores británicos John Alcock y Arthur Brown realizaron el primer vuelo transatlántico sin escalas en junio de 1919.[1] Volaron en un bombardero de la Primera Guerra Mundial Vickers Vimy modificado,[2] desde San Juan, Terranova (Canadá) hasta Clifden, Connemara, condado de Galway (Irlanda).[3] El secretario de Estado del Aire, Winston Churchill, los había presentado al concurso organizado por el diario londine...
هذه المقالة يتيمة إذ تصل إليها مقالات أخرى قليلة جدًا. فضلًا، ساعد بإضافة وصلة إليها في مقالات متعلقة بها. (يوليو 2019) ريك هايوارد معلومات شخصية الميلاد 17 مارس 1952 (71 سنة) ناساو مواطنة باهاماس الحياة العملية المدرسة الأم جامعة إدنبرةمدرسة جوردونستون المهنة رجل أعم
Volker SchlöndorffLahir31 Maret 1939 (umur 84)Wiesbaden, JermanPekerjaanSutradara, penulis latar, produserTahun aktif1960–sekarangSuami/istriMargarethe von Trotta (1971-1991; bercerai)Angelika Schlöndorff Volker Schlöndorff (kelahiran 31 Maret 1939) adalah seorang pembuat film Jerman yang berbasis di Berlin yang berkarya di Jerman, Prancis dan Amerika Serikat. Ia menjadi anggota berpengaruh Sinema Jerman Baru pada akhir 1960an dan awal 1970an, yang juga meliputi Werner Herzog, ...
Litium amida Nama Nama IUPAC Lithium amide Nama lain Litamida Penanda Nomor CAS 7782-89-0 N Model 3D (JSmol) Gambar interaktif 3DMet {{{3DMet}}} ChemSpider 22939 Y Nomor EC PubChem CID 24532 Nomor RTECS {{{value}}} CompTox Dashboard (EPA) DTXSID7064815 InChI InChI=1S/Li.H2N/h;1H2/q+1;-1 YKey: AFRJJFRNGGLMDW-UHFFFAOYSA-N YInChI=1/Li.H2N/h;1H2/q+1;-1Key: AFRJJFRNGGLMDW-UHFFFAOYAO SMILES [Li+].[NH2-] Sifat Rumus kimia LiNH2 Massa molar 22,96 g/mol ...
This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Lorraine – news · newspapers · books · scholar · JSTOR (August 2022) (Learn how and when to remove this template message) Cultural and historical region in northeastern France Lothringen redirects here. For other uses, see Lorraine (disambiguation), Lauren (dis...
The Day I Died: Unclosed CasePoster promosiNama lainHangul내가 죽던 날 Alih Aksara yang DisempurnakanNaega Jugdeon Nal Sutradara Park Ji-wan Produser Jang Jin-seung Gwon Nam-jin Kim Han-gil Ditulis oleh Park Ji-wan SkenarioPark Ji-wanPemeranKim Hye-sooLee Jung-eunRoh Jeong-euiKim Sun-youngPenata musikKim Hong-jibLee Jin-huiSinematograferCho Yong-kyuPenyuntingKim Sang-bumJeong Gyeh-yeonPerusahaanproduksiOscar 10 StudioStory PongDistributorWarner Bros. KoreaTanggal rilis 12 Nov...
Seventh season of RuPaul's Drag Race Season of television series RuPaul's Drag RaceSeason 7Promotional poster for season sevenHosted byRuPaulJudges RuPaul Michelle Visage Carson Kressley Ross Mathews No. of contestants14WinnerViolet ChachkiRunners-up Ginger Minj Pearl Miss CongenialityKatyaCompanion showRuPaul's Drag Race: Untucked! Country of originUnited StatesNo. of episodes14ReleaseOriginal networkLogo TVOriginal releaseMarch 2 (2015-03-02) –June 1, 2015 (2015-06-01)S...
Landlord removals of rental housing tenants in the North American country Evicted men and child with belongings on street. New York City, 1910s. Eviction in the United States refers to the pattern of tenant removal by landlords in the United States.[1] In an eviction process, landlords forcibly remove tenants from their place of residence and reclaim the property.[2] Landlords may decide to evict tenants who have failed to pay rent, violated lease terms, or possess an expired ...
American civil service post Assistant Secretary of Treasury for ManagementSeal of the United States Department of the TreasuryFlag of the Assistant Secretary of TreasuryIncumbentAnna Canfield Rothsince February 2023U.S. Department of the TreasuryReports toThe United States Deputy Secretary of the TreasurySeatWashington, D.C.AppointerThe PresidentTerm lengthNo fixed termWebsitewww.treasury.gov The Assistant Secretary of the Treasury for Management, Chief Financial Officer, and Chief Perfo...
The Spawn of Cthulhu Cover of the first edition.EditorLin CarterCover artistGervasio GallardoCountryUnited StatesLanguageEnglishSeriesBallantine Adult Fantasy seriesGenreFantasyPublisherBallantine BooksPublication date1971Media typePrint (paperback)ISBN0-345-02394-3OCLC398491Preceded byNew Worlds for Old Followed byDouble Phoenix The Spawn of Cthulhu is an anthology of fantasy short stories, edited by American writer Lin Carter. It was first published in paperba...
Battle of the American Revolutionary War Battle of Spencer's OrdinaryPart of the American Revolutionary WarDetail from a 1781 French map prepared for Lafayette depicting his and Cornwallis's movements. The clash at Spencer's is marked by le 26 Juin.Date26 June 1781LocationJames City County,near Williamsburg, VirginiaResult InconclusiveBelligerents United States Great Britain Hesse-KasselCommanders and leaders Richard Butler John Graves SimcoeStrength 570[1] 400[2]Casualt...
لمعانٍ أخرى، طالع جواو باولو (توضيح). جواو باولو معلومات شخصية الميلاد 8 مارس 1991 (العمر 32 سنة) الطول 1.71 م (5 قدم 7 بوصة) مركز اللعب وسط الجنسية البرازيل معلومات النادي النادي الحالي سياتل ساوندرز الرقم 6 مسيرة الشباب سنوات فريق 2004–2006 Sport Club Gaúcho [الإنجليزية]...
العلاقات الأوزبكستانية النمساوية أوزبكستان النمسا أوزبكستان النمسا تعديل مصدري - تعديل العلاقات الأوزبكستانية النمساوية هي العلاقات الثنائية التي تجمع بين أوزبكستان والنمسا.[1][2][3][4][5] مقارنة بين البلدين هذه مقارنة عامة ومرجعية للدو...
Hong Kong swimmer Elaine ChanChan in 2009Personal informationFull nameElaine Chan Yu-ningNational team Hong KongBorn (1988-03-15) 15 March 1988 (age 35)Hong Kong, Hong KongHeight1.76 m (5 ft 9 in)Weight65 kg (143 lb)SportSportSwimmingStrokes Freestyle Not to be confused with Elaine Chen. In this Hong Kong name, the surname is Chan. In accordance with Hong Kong custom, the Western-style name is Elaine Chan and the Chinese-style name is Chan Yu-ning....