Compared to days of the yore, the design and tuning of integrated circuits came a long way down the line for modern-day versions to be manufactured using computers themselves via the use of photolithographic printers to precisely design the tiny (in the scale of nanometers!) intricate components (mostly transistors, which are used to build other components like logic gates and memory registers, tied via interconnects), composing the layers of masks that go into the chip assembly. Now while my curiosity does drift into the making, I tend to be more concerned about the hardware and its implications on the software that is being run on these chipsets.
x86 vs ARM
Stepping upon opposite polarity are the two prime architectural variants, x86- (Intel’s 8086/8088 microprocessor and their family extensions moving forward) and ARM-based (as a company, arm licenses chip-based designs to select manufacturers for further customization and commercial distribution of those custom chips) ones, wherein the former is conformant to CISC or ‘Complex Instruction Set Computer’, and the latter is to RISC (which is the extension of ‘R’ in ARM as well - a touch of nested abbreviations) or Reduced in place of Complex, with the rest of the expansion being the same.
CISC tries to handle a lot of tasks together in a jam-packed manner with hardware-based schemes that can handle multi-clock instructions of fairly high complexity, translating from fewer lines of assembly. The trade-off from a transistor perspective would be the storage of those complex instructions in place of registers for extending the memory capacity. In terms of clock cycles, CISC can consume several depending on the nature of the set of instructions, but it does save up on primary memory, given the relatively shorter length of machine code that is generated and stored.
Meanwhile, RISC utilizes a small, highly optimized set of instructions as opposed to the specialized ones typically found in other architectures (definitely deals with far less instructions than x86!), which saves time in deciphering how to evaluate and execute complex instructions (this does allow manufacturers to increase the register set, in addition to the number of possible parallel threads that can be executed by the CPU). To write software in fewer instructions would be the lookout here, but if that is achieved, the low cycles per instruction count can outweigh the benefits of CISC. Not sure how much of it comes in practice, but theoretically, RISC strictly specifies the execution of one instruction per clock cycle, which if true, would definitely help with pipelining (a common arbitrariness for which is branch prediction1).
Preferences
Having reached the point where I use ARM-based system-on-a-chip (SoC) architectures predominantly for everything I use on a day-to-day basis (right from Qualcomm’s Snapdragon on my phone to Apple’s M1 on my laptop), I infrequently look back at the times when I dealt with Intel’s x86, for the better or the worse. Performance is the core reason to be leveraged (time being a luxury), whereas the required conversion of code to fit well into the customized ARM architectures (given that nearly everything is originally written and optimized for x86!) is the drawback (although, the learnings from the use of and adaptiveness towards a new environment tends to be beneficial).
Apple Silicon and M1s
For a case study on the advantageous part, consider the M1 chips, wherein the SoC structure (collection of many chips in a silicon container) itself comes to provide faster access and lower power consumption (plus with less heat being produced, better battery life is apparent) due to co-location of components (even the memory is ‘unified’, or the same for both CPUs and GPUs). In contrast, the Intel-based x86 systems have components in separate locations, which in turn costs them on both power efficiency (as it takes more power to reach the separately located units, in addition to the considerable amount of time taken) and memory bandwidth aspects.
Among recent developments2 is the combination of two such chips via a shared memory interconnect, resulting in the ‘Ultra’ chip with twice the specs. By nature of the memory components and cores contained within the two chips being separate, this structure gives the look of a system with NUMA3, but due to the rapid data transfer rate through this memory linkage, the core to memory access latency is found to be almost negligible (not that a Frankenstein was the expectation, but Apple certainly did this right, making the access pseudo-uniform). What this means is that developers won’t have to write software taking NUMA into account, or by considering which processing cores have fast/local access to which piece of primary memory (so as to fetch/put instructions and data in the closest block of memory).
This brings me to my second point, which being a drawback, means that at times one has to write code to make certain software to be compatible with a particular system or in this case, an IC as well. While Apple’s Rosetta program intends to do this transformation from the x86 instruction set to the Apple silicon-based equivalent, it does not take care of everything, and focuses mostly on legacy software.
A64FX and Fugaku
Consider Fujitsu, the ones behind the modification of Aarch64 (64-bit extension of the ARM-based architecture) that resulted in the birth of the A64FX4 processor for the world’s most powerful supercomputer, Fugaku (a machine on the higher end of the computing scale5, one that can achieve Exaflops on half-precision/16-bit integers, and one that is completely CPU-based, without any GPUs!). On one instance, it was used to run the training of CosmoFlow (a 3D ConvNet6) in parallel, across 214 nodes following a hybrid setup (with data- and model-based parallelism). To achieve this, they (Fujitsu) had to develop a code generator that translates the output of Xbyak (the JIT assembler in oneDNN, the numerical library they used) to be compatible with their ARM-based A64FX (the original assembler was developed to generate execution code for the x86_64 instruction set, given that oneDNN was for Intel-based CPUs). Additional complexity, as one may call it, but the results can be fruitful once you pass this barrier. In this case, they were able to achieve significantly fast training speeds (under thirty minutes, with validation included) with this setup, also considering additional improvements atop, such as tuning the code generators (considering all the three types of convolutions they deal with) and optimizing the data loading operations, which includes the use of Tensorflow’s cache function to cache parsed data samples in order to avoid preprocessing in the subsequent epoch7 (i.e., iff the samples can fit into memory) and subsequent data shuffling at the beginning of each epoch to improve the convergence of training.
Facts and elaboration or additional information on superscripted terms for the reader (future me included :)
- 1 Basically when the processor deals with a conditional statement and has to take a particular route. Given that going in the unerring direction isn’t pretty obvious in terms of processing logic, ‘predictions’ are essentially made with certain predicates as heuristics (the reason why sorted data has an advantage over randomized data), which is the basis for this term. And note that this is concerning CPU-based workloads; for GPUs, even with predicators it’s always better to completely avoid than to minimize (quite possibly the reason why warp sizes weren’t extended beyond 25) the conditionals, often by offloading the branching logic to the CPU or filtering the workload before pushing it onto the kernel that designates the GPU work.
- 2 The latest development for Apple Silicon was revealed today in WWDC ‘22 (part motivation for writing this), which brings forth the second generation M2 chips into the market, with acclaimed betterness in terms of improved memory bandwidth and processor speeds, apart from significantly reduced power consumption (key points).
- 3 For systems composed of multiple processors, the access times for a chunk of random access memory from a processing core can vary based on that main memory’s location relative to a particular processor. The decision as to which core will have access to which piece of memory locally, is traditionally taken care of by the arbitration logic from the bus that connects both of these components, but naturally, it cannot facilitate multiple cores accessing the same piece of RAM at a given point. While the cache helps to mitigate this to some effect (given that one processor can fetch data from its cache, instead of waiting for its turn to access that part of data from main memory), these types of systems still suffer from collisions when multiple processors try to access the memory at the same time, the probability of which increases with increasing core count.
- 4 A64FX is the first one implementing the Armv8-A scalable vector instruction set with a 512-bit vector length. (talk about high SIMD processing!)
- 5 As a sidenote for the opposite extremity on the size-variant scale, even the world’s smallest computer (courtesy of UMich) is ARM-based!
- 6 A ‘Convolutional’ variant of artificial neural networks that constitute multiple layers of abstraction (given by mathematical operations that merge sets of filtered information), often used for classification and in general, analysis of images (definitely not restricted to just image-based scopes though. For instance, signal processing is another area where it fits well). For good reason, CNNs are used to filter out the images, as with proper heuristics, they lessen the computational expense that ANNs have. Consider the basics - For x megapixels you have in an image (which is nothing but a multidimensional array of programmable colours, mapping to the linear address space inside computers as strides which designate the width of a row in the image, or how far we have to go in memory to get the next row’s pixel), you would have x million input nodes in your traditional neural network (where all neurons are connected to each other, similar to the nodes in a N/W with mesh topology) for a brute force strategy to consider all the pixels as input. One pass from this input layer to all the nodes (x2 million) in the next would be more than enough to melt typical computers and quite possibly even supercomputers for a large number of layers. Given that there is too much information to begin with, one can’t attempt to look at it with the unintuitive all-in approach; there has to be some sort of informed decision making, which again, is what ConvNets have, and why they come into play here.
- 7 An epoch is when the entire dataset is passed back and forth throughout the entire neural network.
Anirban | 06/06/2022 |