Embedded applications can include a micro controller, a digital signal processor DSP , system on a chip, network computers NetPC , set-top boxes, network hubs, wide area network WAN switches, or any other system that can perform one or more instructions in accordance with at least one embodiment.

One embodiment may be described in the context of a single processor desktop or server system, but alternative embodiments may be included in a multiprocessor system. The computer system includes a processor to process data signals. The processor , as one illustrative example, includes a complex instruction set computer CISC microprocessor, a reduced instruction set computing RISC. The processor is coupled to a processor bus that transmits data signals between the processor and other components in the system Depending on the architecture, the processor may have a single internal cache or multiple levels of internal caches.

Other embodiments include a combination of both internal and external caches depending on the particular implementation and needs. One embodiment of a SoC includes of a processor and a memory. The memory of the SoC may be a flash memory. The flash memory can be located on the same die as the processor and other system components. Additionally, other logic blocks such as a memory controller or graphics controller can also be located on a SoC. LD may be a programmable logic device PLD or a non-programmable logic device.

In one embodiment, processor and LD may be included on a single circuit board, each in their respective locations. The LD can be an electronic component used in connection with other components or other integrated circuits, such as processor In general, PLDs can have undefined functions at the time of manufacturing and can be programmed or reconfigured before use.

The LD can be a combination of a logic device and a memory device. The memory of the LD can store a pattern that was given to the integrated circuit during programming. The LD can use any type of logic device technology. In one embodiment, hardware accelerator is a Bitcoin mining hardware accelerator - described in further detail with respect to figures In one embodiment, the Bitcoin mining process starts with a bit message consisting of a bit version , bit hash from the previous block, bit Merkle root of the transaction, bit time stamp , bit target value , bit nonce and a bit padding The bit message is compressed using two stages of round SHA to generate a bit hash This is padded with a bit constant and is compressed again to obtain the final bit hash This may be achieved by looking for a minimum number of leading zeros that would ensure the hash to be smaller than the target.

The target, and hence the leading zero requirement, may change depending on the rate of new block creation to maintain the rate at approximately one block every ten minutes. Decreasing the target may decrease the probability of finding a valid hash and hence increase the overall search space to generate a new block for the chain. In one embodiment, for a given header, the Bitcoin mining hardware accelerator traverses the search space of 2 32 options to potentially find a valid nonce.

If no valid nonce is found, the Merkle root may be changed by choosing a different set of pending transactions and starting over with the nonce search. The three stages of hashing may be implemented as fully unrolled 64 rounds of SHA message digest and parallel message expansion logic. The computation intensive SHA hashing may be the major contributor to the energy consumption in a Bitcoin mining accelerator.

This may equate to approximately 19 logic gate levels, as shown in FIG. In one embodiment,. H may be a shifted version of G e. Therefore, with WH-LookAhead, in one embodiment,. Deferring the computation may increase the overall SHA latency by one cycle. This may result in a negligible 0.

The computation of A i from the previous optimizations may make use of the addition of E i and subtraction of D i-1 , as shown:. In one embodiment, the bit message input to SHA is consumed by the message digest logic across the first 16 rounds in the form of bit words.

For the remaining 48 rounds, the message scheduler logic may combine the input message to generate a new bit message word each round. In one embodiment, the datapath for a single round of message expansion logic is shown in figure 8a. The critical path in the message expansion datapath may include a sigma-function , two CSA , , and a CA This results in a critical path of 16 logic gates, as shown in FIG.

In one embodiment, the new bit message generated in each round or cycle is not consumed by the message digest logic for the subsequent 15 rounds. As a result, the computation of a new message word may be distributed across multiple rounds or cycles to reduce the critical path.

The 3-cycle distributed message expansion datapath is shown in FIG. Each of the three additions in the message expansion logic is distributed across three rounds, thereby limiting the critical path of each round to a maximum of one sigma-function and a CA.

The critical path in the 3-cycle distributed message expansion may include the completion adder. In one embodiment, the bit complete addition can be distributed across two rounds to obtain a 6-cycle distributed message expansion datapath, as shown in FIG. The bit addition in each round may be replaced by a bit addition, reducing the critical path by at least 1 logic gate.

The embodiments of the Bitcoin mining hardware accelerator operations described herein can be implemented in processor As yet another option, processor may include a special-purpose core, such as, for example, a network or communication core, compression engine, graphics core, or the like. In one embodiment, processor may be a multi-core processor or may be part of a multi- processor system. The decode unit also known as a decoder may decode instructions and generate as an output one or more micro-operations, micro-code entry points,.

The decoder may be implemented using various different mechanisms. Examples of suitable mechanisms include, but are not limited to, look-up tables, hardware implementations, programmable logic arrays PLAs , microcode read only memories ROMs , etc. The instruction cache unit is further coupled to the memory unit The scheduler unit s represents any number of different schedulers, including reservations stations RS , central instruction window, etc.

The scheduler unit s is coupled to the physical register file s unit s Each of the physical register file s units represents one or more physical register files, different ones of which store one or more different data types, such as scalar integer, scalar floating point, packed integer, packed floating point, vector integer, vector floating point, etc.

The physical register file s unit s is overlapped by the retirement unit to illustrate various ways in which register renaming and out-of-order execution may be implemented e. The registers are not limited to any known particular type of circuit. Various types of registers are suitable as long as they are capable of storing and providing data as described herein. Examples of suitable registers include, but are not limited to, dedicated physical registers, dynamically allocated physical registers using register renaming, combinations of dedicated and dynamically allocated physical registers, etc.

The retirement unit and the physical register file s unit s are coupled to the execution cluster s The execution cluster s includes a set of one or more execution units and a set of one or more memory access units The execution units may perform various operations e.

In some embodiments DCU is also known as a first level data cache L1 cache. The DCU may handle multiple outstanding cache misses and continue to service incoming stores and loads. It also supports maintaining cache coherency. The data TLB unit is a cache used to improve virtual address translation speed by mapping virtual and physical address spaces.

In one exemplary embodiment, the memory access units may include a load unit, a store address unit, and a store data unit, each of which is coupled to the data TLB unit in the memory unit The L2 cache unit may be coupled to one or more other levels of cache and eventually to a main memory. Prefetching may refer to transferring data stored in one memory location e.

While the illustrated embodiment of the processor also includes a separate instruction and data cache units and a shared L2 cache unit, alternative embodiments may have a single internal cache for both instructions and data, such as, for example, a Level 1 L1 internal cache, or multiple levels of internal cache.

In some embodiments, the system may include a. The solid lined boxes in FIG. In FIG. In some embodiments, the ordering of stages may be different than illustrated and are not limited to the specific ordering shown in FIG. In some embodiments, Bitcoin mining hardware accelerator operation instructions in accordance with one embodiment can be implemented to operate on data elements having sizes of byte, word, doubleword, quadword, etc.

In one embodiment the in-order front end is the part of the processor that fetches instructions to be executed and prepares them to be used later in the processor pipeline. The embodiments of the Bitcoin mining hardware accelerator operations disclosed herein can be implemented in processor In one embodiment, the instruction prefetcher fetches instructions from memory and feeds them to an instruction decoder which in turn decodes or interprets them.

In other embodiments, the decoder parses the instruction into an opcode and corresponding data and control fields that are used by the micro-architecture to perform operations in accordance with one embodiment. In one embodiment, the trace cache takes decoded uops and assembles them into program ordered sequences or traces in the uop queue for execution.

When the trace cache encounters a complex instruction, the microcode ROM provides the uops needed to complete the operation. In one embodiment, if more than four micro-ops are needed to complete an instruction, the decoder accesses the microcode ROM to do the instruction. For one embodiment, an instruction can be decoded into a small number of micro ops for processing at the instruction decoder In another embodiment, an instruction can be stored within the microcode ROM should a number of micro-ops be needed to accomplish the operation.

The trace cache refers to an entry point. After the microcode ROM finishes sequencing micro-ops for an instruction, the front end of the machine resumes fetching micro-ops from the trace cache The out-of-order execution logic has a number of buffers to smooth out and re- order the flow of instructions to optimize performance as they go down the pipeline and get scheduled for execution. The allocator logic allocates the machine buffers and resources that each uop needs in order to execute.

The register renaming logic renames logic registers onto entries in a register file. The uop schedulers , , , determine when a uop is ready to execute based on the readiness of their dependent input register operand sources and the availability of the execution resources the uops need to complete their operation. The fast scheduler of one embodiment can schedule on each half of the main clock cycle while the other schedulers can only schedule once per main processor clock cycle.

The schedulers arbitrate for the dispatch ports to schedule uops for execution. There is a separate register file , , for integer and floating point operations, respectively. Each register file , , of one embodiment also includes a bypass network that can bypass or forward just completed results that have not yet been written into the register file to new dependent uops.

The integer register file and the floating point register file are also capable of communicating data with the other. For one embodiment, the integer register file is split into two separate register files, one register file for the low order 32 bits of data and a second register file for the high order 32 bits of data. The floating point register file of one embodiment has bit wide entries because floating point instructions typically have operands from 64 to bits in width.

This section includes the register files , , that store the integer and floating point data operand values that the micro-instructions need to execute. The floating point ALU of one. For embodiments of the present disclosure, instructions involving a floating point value may be handled with the floating point hardware.

The fast ALUs , , of one embodiment can execute fast operations with an effective latency of half a clock cycle. For one embodiment, most complex integer operations go to the slow ALU as the slow ALU includes integer execution hardware for long latency type of operations, such as a multiplier, shifts, flag logic, and branch processing.

For one embodiment, the integer ALUs , , , are described in the context of performing integer operations on 64 bit data operands. In alternative embodiments, the ALUs , , , can be implemented to support a variety of data bits including 16, 32, , , etc. Similarly, the floating point units , , can be implemented to support a range of operands having bits of various widths. For one embodiment, the floating point units , , can operate on bits wide packed data operands in conjunction with SIMD and multimedia instructions.

As uops are speculatively scheduled and executed in processor , the processor also includes logic to handle memory misses. If a data load misses in the data cache, there can be dependent operations in flight in the pipeline that have left the scheduler with temporarily incorrect data.

A replay mechanism tracks and re-executes instructions that use incorrect data. Only the dependent operations need to be replayed and the independent ones are allowed to complete. The schedulers and replay mechanism of one embodiment of a processor are also designed to catch instruction sequences for text string comparison operations.

In one embodiment, the execution block of processor may include a microcontroller MCU , to perform Bitcoin mining operations according to the description herein. However, the registers of an embodiment should not be limited in meaning to a particular type of circuit. Rather, a register of an embodiment is capable of storing and providing data, and performing the functions described herein. The registers described herein can be implemented by circuitry within a processor using any number of different techniques, such as dedicated physical registers, dynamically allocated physical registers using register renaming, combinations of dedicated and dynamically allocated physical registers, etc.

In one embodiment, integer registers store thirty-two bit integer data. A register file of one embodiment also contains eight multimedia SIMD registers for packed data. In one embodiment, in storing packed data and integer data, the registers do not need to differentiate between the two data types. In one embodiment, integer and floating point are either contained in the same register file or different register files. Furthermore, in one embodiment, floating point and integer data may be stored in different registers or the same registers.

Referring now to FIG. As shown in FIG. The processors each may include hybrid write mode logics in accordance with an embodiment of the present. Bitcoin mining hardware accelerator operations discussed herein can be implemented in the processor , processor , or both. In other implementations, one or more additional processors may be present in a given processor. Processor also includes as part of its bus controller units point-to-point P-P interfaces and ; similarly, second processor includes P-P interfaces and Processors , may exchange information via a point-to- point P-P interface using P-P interface circuits , Chipset may also exchange information with a high-performance graphics circuit via a high-performance graphics interface In one embodiment, second bus may be a low pin count LPC bus.

Note that other architectures are possible. For example, instead of the point-to-point architecture of FIG. Like elements in FIGS. For at least one embodiment, the CL , may include integrated memory controller units such as described herein.

In addition. Operations discussed herein can be implemented in the processor , processor , or both. Other system designs and configurations known in the arts for laptops, desktops, handheld PCs, personal digital assistants, engineering workstations, servers, network devices, network hubs, switches, embedded processors, digital signal processors DSPs , graphics devices, video game devices, set-top boxes, micro controllers, cell phones, portable media players, hand held devices, and various other electronic devices, are also suitable.

Dashed lined boxes are features on more advanced SoCs. Bitcoin mining hardware accelerator operations discussed herein can be implemented by SoC As an illustrative example, SoC is included in user equipment UE. In one embodiment, UE refers to any device to be used by an end-user to communicate, such as a hand-held phone, smartphone, tablet, ultra- thin notebook, notebook with broadband adapter, or any other similar communication device.

Cores and are coupled to cache control that is associated with bus interface unit and L2 cache to communicate with other parts of system Interconnect includes an on-chip interconnect, such as an IOSF, AMBA, or other interconnects discussed above, which can implement one or more aspects of the described disclosure. DRAM , a flash controller to interface with non-volatile memory e.

Flash , a peripheral control e. Serial Peripheral Interface to interface with peripherals, power control to control power, video codecs and Video interface to display and receive input e. Any of these interfaces may incorporate aspects of the embodiments described herein. Note as stated above, a UE includes a radio for communication. As a result, these peripheral communication modules may not all be included.

However, in a UE some form of a radio for external communication should be included. In alternative embodiments, the machine may be connected e. The machine may operate in the capacity of a server or a client device in a client-server network environment, or as a peer machine in a peer-to-peer or distributed network environment.

The machine may be a personal computer PC , a tablet PC, a set-top box STB , a Personal Digital Assistant PDA , a cellular telephone, a web appliance, a server, a network router, switch or bridge, or any machine capable of executing a set of instructions sequential or otherwise that specify actions to be taken by that machine.

The embodiments of the page additions and content copying can be implemented in computing system More particularly, the processing device may be complex instruction set computing CISC microprocessor, reduced instruction set computer RISC microprocessor, very long instruction word VLIW microprocessor, or processor implementing other instruction sets, or processors implementing a combination of instruction sets. Processing device may also be one or more special- purpose processing devices such as an application specific integrated circuit ASIC , a field programmable gate array FPGA , a digital signal processor DSP , network processor, or the like.

In one embodiment, processing device may include one or processor cores. The processing device is configured to execute the processing logic for performing the Bitcoin mining hardware accelerator operations discussed herein. I'm old enough to remember being given a couple of bitcoins when they were worth next to nothing. Needless to say, I don't have them anymore. Now, with bitcoin and other cryptocurrency prices skyrocketing again, there's renewed interest in cryptomining, which is a way to accumulate cryptocurrency without having to pay for it.

Cryptocurrency What every business needs to know. Read More. Let's take a look at what makes a good cryptomining rig, and what hardware it takes if you want to be serious about mining. In the most basic terms, you are using a computer or computers to solve cryptographic equations and record that data to a blockchain. Taking this a bit deeper, miners verify the hashes of unconfirmed blocks and receive a reward for every hash that is verified.

The process is computationally intensive, requiring state-of-the-art hardware if you are planning on making much headway with mining. Mining, as it was back in the days of the gold rush, is not for the faint of heart. And as with all high-end systems, it's less a case of how much do you want to spend, and more a case of how fast do you want to spend. So, what hardware do you need to mine cryptocurrency?

OK, the "rig" is essentially a customized PC. Where things deviate from the norm is when it comes to the graphics cards. You're going to need quite a powerful GPU for mining, and likely you are going to be buying more than one. A lot more. In fact, you can think of a mining rig as a relatively cheap PC with one or more high-performance GPUs attached. You need to connect multiple graphics cards to a single system, which means you also need a motherboard to handle that.

You'll also be looking at more than one power supply unit PSU if you're planning to push things to the extremes. There are also some other mining-specific items you'll need to make the mining rig ready for mining. OK, let's start with the motherboard. The Asus B Mining Expert is a beast of a motherboard, capable of having 19 graphics cards connected to it. That's a lot. The board isn't new -- it was released in -- and it is finickity when it comes to setting up it needs a specific layout of AMD and Nvidia graphics cards ,.

Asus has published recommend GPU layouts for , , and card for this board, and while other layouts might work, I recommend staying with what the manufacturer suggests, as veering away from this is a recipe for a serious -- not to mention expensive -- headaches.

This quad-core Core i5 is perfect for this setup and works great with the motherboard chosen above. You're not going to overspend on RAM either. SKILL fits the bill. Depending on how many graphics cards you have installed, you may need multiple PSUs. It's tempting to find the cheapest possible, but since they are going to be pushed hard, I recommend paying a little more. These Segotep PSUs are middle-of-the-road good value, yet they offer reliable performance.

The modular nature also means that you're not turning the mining rig into a spaghetti of wires. This is where a bitcoin mining rig differs from a regular PC in that you can't have all the graphics cards directly attached to the motherboard, so these risers allow you to connect them indirectly. You're going to need one of these for every card you connect other than the card that goes into the x16 PCI-e slot. This six-pack of powered risers are great and provide stable power to your graphics cards.

I do not recommend using non-powered risers. I've had nothing but problems with stability using them in the past in cryptomining rigs, so don't make the same mistake I made! This is a great card and everything you're looking for in a mining rig. Loads of potential for overclocking, stable, and great cooling. Another nice side benefit is that it's quite an efficient card, which means lower power consumption and reduced mining costs. Another example of you get what you pay for: A high-performance graphics card that offers power, performance, and a nice level of efficiency.

By Adrian Kingsley-Hughes for Hardware 2.

