Search Results

Search found 34 results on 2 pages for 'microprocessor'.

Page 1/2 | 1 2  | Next Page >

  • Looking for a good book on microprocessor internals

    - by David Holm
    I'm looking for a good book on how modern microprocessors are designed and work as I would like to increase my understanding of what makes them tick. Something that covers pipelines, superscalar architectures, caches etc. A book that is suitable for a programmer with several years of experience and has done and understands assembly programming and machine language, so basically not "CPUs for Dummies" or anything such. What books do people who design today's processors read for instance?

    Read the article

  • What is the relationship between the command line, the OS and the microprocessor? [closed]

    - by ssbrewster
    I'm not totally clear on how using the command line differs from working through the OS' interface using an editor for example. Obviously the UI is different but I want to understand how the command line interacts with the kernel and microprocessor, and how this compares to how kernel interfaces with the OS' GUI. I know I'm missing several layers of abstraction but would be grateful for someone to explain this.

    Read the article

  • What operating systems available for an 8-bit microprocessor?

    - by Benoit
    It does not need to be a full fledged OS, but at least have multitasking capabilities (i.e. a scheduler). Please mention what processor architecture it works on. This is a survey, so exact capabilities are not really important. Think of this as being a place to look at for possibilities when your next 8-bit embedded project comes up... I realize that most 8-bit micro do not require an OS, but just as a counter-example, Rabbit Semiconductor offers the RCM3710 processor module with 4 serial ports, a 10-BaseT ethernet port, 512K RAM and 512K Flash. All that for $39. All based around an 8-bit Z80 core. 8-bit does NOT necessarily mean extreme resource constraint.

    Read the article

  • Optimizing Solaris 11 SHA-1 on Intel Processors

    - by danx
    SHA-1 is a "hash" or "digest" operation that produces a 160 bit (20 byte) checksum value on arbitrary data, such as a file. It is intended to uniquely identify text and to verify it hasn't been modified. Max Locktyukhin and others at Intel have improved the performance of the SHA-1 digest algorithm using multiple techniques. This code has been incorporated into Solaris 11 and is available in the Solaris Crypto Framework via the libmd(3LIB), the industry-standard libpkcs11(3LIB) library, and Solaris kernel module sha1. The optimized code is used automatically on systems with a x86 CPU supporting SSSE3 (Intel Supplemental SSSE3). Intel microprocessor architectures that support SSSE3 include Nehalem, Westmere, Sandy Bridge microprocessor families. Further optimizations are available for microprocessors that support AVX (such as Sandy Bridge). Although SHA-1 is considered obsolete because of weaknesses found in the SHA-1 algorithm—NIST recommends using at least SHA-256, SHA-1 is still widely used and will be with us for awhile more. Collisions (the same SHA-1 result for two different inputs) can be found with moderate effort. SHA-1 is used heavily though in SSL/TLS, for example. And SHA-1 is stronger than the older MD5 digest algorithm, another digest option defined in SSL/TLS. Optimizations Review SHA-1 operates by reading an arbitrary amount of data. The data is read in 512 bit (64 byte) blocks (the last block is padded in a specific way to ensure it's a full 64 bytes). Each 64 byte block has 80 "rounds" of calculations (consisting of a mixture of "ROTATE-LEFT", "AND", and "XOR") applied to the block. Each round produces a 32-bit intermediate result, called W[i]. Here's what each round operates: The first 16 rounds, rounds 0 to 15, read the 512 bit block 32 bits at-a-time. These 32 bits is used as input to the round. The remaining rounds, rounds 16 to 79, use the results from the previous rounds as input. Specifically for round i it XORs the results of rounds i-3, i-8, i-14, and i-16 and rotates the result left 1 bit. The remaining calculations for the round is a series of AND, XOR, and ROTATE-LEFT operators on the 32-bit input and some constants. The 32-bit result is saved as W[i] for round i. The 32-bit result of the final round, W[79], is the SHA-1 checksum. Optimization: Vectorization The first 16 rounds can be vectorized (computed in parallel) because they don't depend on the output of a previous round. As for the remaining rounds, because of step 2 above, computing round i depends on the results of round i-3, W[i-3], one can vectorize 3 rounds at-a-time. Max Locktyukhin found through simple factoring, explained in detail in his article referenced below, that the dependencies of round i on the results of rounds i-3, i-8, i-14, and i-16 can be replaced instead with dependencies on the results of rounds i-6, i-16, i-28, and i-32. That is, instead of initializing intermediate result W[i] with: W[i] = (W[i-3] XOR W[i-8] XOR W[i-14] XOR W[i-16]) ROTATE-LEFT 1 Initialize W[i] as follows: W[i] = (W[i-6] XOR W[i-16] XOR W[i-28] XOR W[i-32]) ROTATE-LEFT 2 That means that 6 rounds could be vectorized at once, with no additional calculations, instead of just 3! This optimization is independent of Intel or any other microprocessor architecture, although the microprocessor has to support vectorization to use it, and exploits one of the weaknesses of SHA-1. Optimization: SSSE3 Intel SSSE3 makes use of 16 %xmm registers, each 128 bits wide. The 4 32-bit inputs to a round, W[i-6], W[i-16], W[i-28], W[i-32], all fit in one %xmm register. The following code snippet, from Max Locktyukhin's article, converted to ATT assembly syntax, computes 4 rounds in parallel with just a dozen or so SSSE3 instructions: movdqa W_minus_04, W_TMP pxor W_minus_28, W // W equals W[i-32:i-29] before XOR // W = W[i-32:i-29] ^ W[i-28:i-25] palignr $8, W_minus_08, W_TMP // W_TMP = W[i-6:i-3], combined from // W[i-4:i-1] and W[i-8:i-5] vectors pxor W_minus_16, W // W = (W[i-32:i-29] ^ W[i-28:i-25]) ^ W[i-16:i-13] pxor W_TMP, W // W = (W[i-32:i-29] ^ W[i-28:i-25] ^ W[i-16:i-13]) ^ W[i-6:i-3]) movdqa W, W_TMP // 4 dwords in W are rotated left by 2 psrld $30, W // rotate left by 2 W = (W >> 30) | (W << 2) pslld $2, W_TMP por W, W_TMP movdqa W_TMP, W // four new W values W[i:i+3] are now calculated paddd (K_XMM), W_TMP // adding 4 current round's values of K movdqa W_TMP, (WK(i)) // storing for downstream GPR instructions to read A window of the 32 previous results, W[i-1] to W[i-32] is saved in memory on the stack. This is best illustrated with a chart. Without vectorization, computing the rounds is like this (each "R" represents 1 round of SHA-1 computation): RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR With vectorization, 4 rounds can be computed in parallel: RRRRRRRRRRRRRRRRRRRR RRRRRRRRRRRRRRRRRRRR RRRRRRRRRRRRRRRRRRRR RRRRRRRRRRRRRRRRRRRR Optimization: AVX The new "Sandy Bridge" microprocessor architecture, which supports AVX, allows another interesting optimization. SSSE3 instructions have two operands, a input and an output. AVX allows three operands, two inputs and an output. In many cases two SSSE3 instructions can be combined into one AVX instruction. The difference is best illustrated with an example. Consider these two instructions from the snippet above: pxor W_minus_16, W // W = (W[i-32:i-29] ^ W[i-28:i-25]) ^ W[i-16:i-13] pxor W_TMP, W // W = (W[i-32:i-29] ^ W[i-28:i-25] ^ W[i-16:i-13]) ^ W[i-6:i-3]) With AVX they can be combined in one instruction: vpxor W_minus_16, W, W_TMP // W = (W[i-32:i-29] ^ W[i-28:i-25] ^ W[i-16:i-13]) ^ W[i-6:i-3]) This optimization is also in Solaris, although Sandy Bridge-based systems aren't widely available yet. As an exercise for the reader, AVX also has 256-bit media registers, %ymm0 - %ymm15 (a superset of 128-bit %xmm0 - %xmm15). Can %ymm registers be used to parallelize the code even more? Optimization: Solaris-specific In addition to using the Intel code described above, I performed other minor optimizations to the Solaris SHA-1 code: Increased the digest(1) and mac(1) command's buffer size from 4K to 64K, as previously done for decrypt(1) and encrypt(1). This size is well suited for ZFS file systems, but helps for other file systems as well. Optimized encode functions, which byte swap the input and output data, to copy/byte-swap 4 or 8 bytes at-a-time instead of 1 byte-at-a-time. Enhanced the Solaris mdb(1) and kmdb(1) debuggers to display all 16 %xmm and %ymm registers (mdb "$x" command). Previously they only displayed the first 8 that are available in 32-bit mode. Can't optimize if you can't debug :-). Changed the SHA-1 code to allow processing in "chunks" greater than 2 Gigabytes (64-bits) Performance I measured performance on a Sun Ultra 27 (which has a Nehalem-class Xeon 5500 Intel W3570 microprocessor @3.2GHz). Turbo mode is disabled for consistent performance measurement. Graphs are better than words and numbers, so here they are: The first graph shows the Solaris digest(1) command before and after the optimizations discussed here, contained in libmd(3LIB). I ran the digest command on a half GByte file in swapfs (/tmp) and execution time decreased from 1.35 seconds to 0.98 seconds. The second graph shows the the results of an internal microbenchmark that uses the Solaris libpkcs11(3LIB) library. The operations are on a 128 byte buffer with 10,000 iterations. The results show operations increased from 320,000 to 416,000 operations per second. Finally the third graph shows the results of an internal kernel microbenchmark that uses the Solaris /kernel/crypto/amd64/sha1 module. The operations are on a 64Kbyte buffer with 100 iterations. third graph shows the results of an internal kernel microbenchmark that uses the Solaris /kernel/crypto/amd64/sha1 module. The operations are on a 64Kbyte buffer with 100 iterations. The results show for 1 kernel thread, operations increased from 410 to 600 MBytes/second. For 8 kernel threads, operations increase from 1540 to 1940 MBytes/second. Availability This code is in Solaris 11 FCS. It is available in the 64-bit libmd(3LIB) library for 64-bit programs and is in the Solaris kernel. You must be running hardware that supports Intel's SSSE3 instructions (for example, Intel Nehalem, Westmere, or Sandy Bridge microprocessor architectures). The easiest way to determine if SSSE3 is available is with the isainfo(1) command. For example, nehalem $ isainfo -v $ isainfo -v 64-bit amd64 applications sse4.2 sse4.1 ssse3 popcnt tscp ahf cx16 sse3 sse2 sse fxsr mmx cmov amd_sysc cx8 tsc fpu 32-bit i386 applications sse4.2 sse4.1 ssse3 popcnt tscp ahf cx16 sse3 sse2 sse fxsr mmx cmov sep cx8 tsc fpu If the output also shows "avx", the Solaris executes the even-more optimized 3-operand AVX instructions for SHA-1 mentioned above: sandybridge $ isainfo -v 64-bit amd64 applications avx xsave pclmulqdq aes sse4.2 sse4.1 ssse3 popcnt tscp ahf cx16 sse3 sse2 sse fxsr mmx cmov amd_sysc cx8 tsc fpu 32-bit i386 applications avx xsave pclmulqdq aes sse4.2 sse4.1 ssse3 popcnt tscp ahf cx16 sse3 sse2 sse fxsr mmx cmov sep cx8 tsc fpu No special configuration or setup is needed to take advantage of this code. Solaris libraries and kernel automatically determine if it's running on SSSE3 or AVX-capable machines and execute the correctly-tuned code for that microprocessor. Summary The Solaris 11 Crypto Framework, via the sha1 kernel module and libmd(3LIB) and libpkcs11(3LIB) libraries, incorporated a useful SHA-1 optimization from Intel for SSSE3-capable microprocessors. As with other Solaris optimizations, they come automatically "under the hood" with the current Solaris release. References "Improving the Performance of the Secure Hash Algorithm (SHA-1)" by Max Locktyukhin (Intel, March 2010). The source for these SHA-1 optimizations used in Solaris "SHA-1", Wikipedia Good overview of SHA-1 FIPS 180-1 SHA-1 standard (FIPS, 1995) NIST Comments on Cryptanalytic Attacks on SHA-1 (2005, revised 2006)

    Read the article

  • C++ USB Programming

    - by JB_SO
    Hi, I am new to hardware programming(especially USB) so please bear with me and my questions. I am using C++ and I need to send/receive some data (a byte array) to/from a USB port on a microprocessor board. Now, I have done some serial port programming before and I know that for a serial port you have to open a port, setup, perform i/o and finally close the port. I am guessing to use a USB port, it is not as simple as what I mentioned above. I do know that I want to use Microsoft standard drivers and implement standard Windows IO commands to accomplish this, since I believe there are no drivers for the microprocessor board for me to interact with. If somebody can point me in the right direction as to the steps needed to "talk" to a USB port (open, setup, i/o) via standard Windows IO commands, I would truly and greatly appreciate it. Thanks you so much!!

    Read the article

  • Preprocessor not skipping asm directives

    - by demula
    I'm programming to a microprocessor (Coldfire) and I don't have access every day to the microprocessor trainer so I want to be able to execute some of my code on the computer. So I'm trying to skip a part of my code when executing on the computer by defining _TEST_. It doesn't work. It tries to compile the asm code and dies whining about not knowing the registers names (they're defined alright compiling against the Coldfire, not my Intel Core Duo). Any ideas why it's not working? or maybe an alternative way to run the code on the pc without commenting it out?. Here's sample code from my project: inline void ct_sistem_exit(int status) { #ifdef _TEST_ exit(status); #else asm volatile( "moveb #0,%%d1\n\t" "movel #0, %%d0\n\t" "trap #15\n\t" : : : "d0", "d1" ); #endif /* _TEST_ */ } If it helps: using gcc3 with cygwin on Netbeans 6.8

    Read the article

  • Problem in installation in My Hp g4 1226se

    - by vivek Verma
    1vivek.100 Dual booting error in Hp pavilion g4 1226se Dear sir or Madam, My name is vivek verma.... I am the user of my Hp laptop which series and model name is HP PAVILION G4 1226SE........ i have purchase in the year of 2012 and month is February.....the windows 7 home basic 64 Bit is already installed in in my laptop.... Now i want to install Ubuntu 12.04 Lts or 13.10 lts..... i have try many time to install in my laptop via live CD or USB installer....and i have try many live CD and many pen drive to install Ubuntu ... but it is not done......now i am in very big problem...... when i put my CD or USB drive to boot and install the Ubuntu......my laptop screen is goes the some black (brightness of my laptop screen is very low and there is very low visibility ) and not showing any thing on my laptop screen..... and when i move the my laptop screen.....then there is graphics option in this screen to installation of the Ubuntu option......and when i press the dual boot with setting button and press to continue them my laptop is goes for shutdown after 2 or 5 minutes..... ...... and Hp service center person is saying to me our laptop hardware has no problem.....please contact to Ubuntu tech support............. show please help me if possible..... My laptop configuration is here...... Hardware Product Name g4-1226se Product Number QJ551EA Microprocessor 2.4 GHz Intel Core i5-2430M Microprocessor Cache 3 MB L3 cache Memory 4 GB DDR3 Memory Max Upgradeable to 4 GB DDR3 Video Graphics Intel HD 3000 (up to 1.65 GB) Display 35,5 cm (14,0") High-Definition LED-backlit BrightView Display (1366 x 768) Hard Drive 500 GB SATA (5400 rpm) Multimedia Drive SuperMulti DVD±R/RW with Double Layer Support Network Card Integrated 10/100 BASE-T Ethernet LAN Wireless Connectivity 802.11 b/g/n Sound Altec Lansing speakers Keyboard Full size island-style keyboard with home roll keys Pointing Device TouchPad supporting Multi-Touch gestures with On/Off button PC Card Slots Multi-Format Digital Media Card Reader for Secure Digital cards, Multimedia cards External Ports 1 VGA 1 headphone-out 1 microphone-in 3 USB 2.0 1 RJ45 Dimensions 34.1 x 23.1 x 3.56 cm Weight Starting at 2.1 kg Power 65W AC Power Adapter 6-cell Lithium-Ion (Li-Ion) What's In The Box Webcam with Integrated Digital Microphone (VGA) Software Operating System: Windows 7 Home Basic 64bit....Genuine..... ......... Sir please help me if possible....... Name =vivek verma Contact no.+919911146737 Email [email protected]

    Read the article

  • What Are Some Advantages/Disadvantages of Using C over Assembly?

    - by Daniel
    I'm currently studying engineering in Telecommunications and Electronics and we have migrated from assembler to C in microprocessor programming. I have doubts that this is a good idea. What are some advantages and disadvantages of C compared to assembly? The advantages/disadvantages I see are: Advantages: I can tell that C syntax is a lot easier to learn than Assembler syntax. C is easier to use for making more complex programs. Learning C is somehow more productive than learning assembler cause there is more developing stuff around C than Assembler. Disadvantages: Assembler is a lower level programming language than C,so this makes it a good for programming directly to hardware. Is a lot more flexible alluding you to work with memory,interrupts,micro-registers,etc.

    Read the article

  • DIY Sunrise Simulator Combines Microchips, LEDs, and Laser Cut Goodness

    - by Jason Fitzpatrick
    Sunrise simulators use a gradually brightening light to wake you in the morning. Check out this creative build that combines a microprocessor, addressable LEDs, and a nifty laser-cut bracket to yield a polished and wall-mountable alarm clock lamp. Courtesy of NYC-based tinker Holly, the project features a detailed build guide that references all the other projects that inspired her sunrise simulator. Hit up the link below to check out everything from her laser cut shade brackets to the Adafruit module she used to control the light timing. Sunrise Lamp Alarm Clock [via Make] How To Create a Customized Windows 7 Installation Disc With Integrated Updates How to Get Pro Features in Windows Home Versions with Third Party Tools HTG Explains: Is ReadyBoost Worth Using?

    Read the article

  • Resources on learning to program in machine code?

    - by AceofSpades
    I'm a student, fresh into programming and loving it, from Java to C++ and down to C. I moved backwards to the barebones and thought to go further down to Assembly. But, to my surprise, a lot of people said it's not as fast as C and there is no use. They suggested learning either how to program a kernel or writing a C compiler. My dream is to learn to program in binary (machine code) or maybe program bare metal (program micro-controller physically) or write bios or boot loaders or something of that nature. The only possible thing I heard after so much research is that a hex editor is the closest thing to machine language I could find in this age and era. Are there other things I'm unaware of? Are there any resources to learn to program in machine code? Preferably on a 8-bit micro-controller/microprocessor. This question is similar to mine, but I'm interested in practical learning first and then understanding the theory.

    Read the article

  • GPRS communication

    - by Manoj Goel
    Hi All, I want to set up communication between GSM/GPRS modem and a remote server or PC. How to do that? Do we need some application on PC which will communicate to the GSM modem. I want 2 way commuinication. I want to interface GSM/GPRS modem with some microprocessor which has some LCD display. Can anybody help me in this. Thanks, Manoj

    Read the article

  • Optimizing AES modes on Solaris for Intel Westmere

    - by danx
    Optimizing AES modes on Solaris for Intel Westmere Review AES is a strong method of symmetric (secret-key) encryption. It is a U.S. FIPS-approved cryptographic algorithm (FIPS 197) that operates on 16-byte blocks. AES has been available since 2001 and is widely used. However, AES by itself has a weakness. AES encryption isn't usually used by itself because identical blocks of plaintext are always encrypted into identical blocks of ciphertext. This encryption can be easily attacked with "dictionaries" of common blocks of text and allows one to more-easily discern the content of the unknown cryptotext. This mode of encryption is called "Electronic Code Book" (ECB), because one in theory can keep a "code book" of all known cryptotext and plaintext results to cipher and decipher AES. In practice, a complete "code book" is not practical, even in electronic form, but large dictionaries of common plaintext blocks is still possible. Here's a diagram of encrypting input data using AES ECB mode: Block 1 Block 2 PlainTextInput PlainTextInput | | | | \/ \/ AESKey-->(AES Encryption) AESKey-->(AES Encryption) | | | | \/ \/ CipherTextOutput CipherTextOutput Block 1 Block 2 What's the solution to the same cleartext input producing the same ciphertext output? The solution is to further process the encrypted or decrypted text in such a way that the same text produces different output. This usually involves an Initialization Vector (IV) and XORing the decrypted or encrypted text. As an example, I'll illustrate CBC mode encryption: Block 1 Block 2 PlainTextInput PlainTextInput | | | | \/ \/ IV >----->(XOR) +------------->(XOR) +---> . . . . | | | | | | | | \/ | \/ | AESKey-->(AES Encryption) | AESKey-->(AES Encryption) | | | | | | | | | \/ | \/ | CipherTextOutput ------+ CipherTextOutput -------+ Block 1 Block 2 The steps for CBC encryption are: Start with a 16-byte Initialization Vector (IV), choosen randomly. XOR the IV with the first block of input plaintext Encrypt the result with AES using a user-provided key. The result is the first 16-bytes of output cryptotext. Use the cryptotext (instead of the IV) of the previous block to XOR with the next input block of plaintext Another mode besides CBC is Counter Mode (CTR). As with CBC mode, it also starts with a 16-byte IV. However, for subsequent blocks, the IV is just incremented by one. Also, the IV ix XORed with the AES encryption result (not the plain text input). Here's an illustration: Block 1 Block 2 PlainTextInput PlainTextInput | | | | \/ \/ AESKey-->(AES Encryption) AESKey-->(AES Encryption) | | | | \/ \/ IV >----->(XOR) IV + 1 >---->(XOR) IV + 2 ---> . . . . | | | | \/ \/ CipherTextOutput CipherTextOutput Block 1 Block 2 Optimization Which of these modes can be parallelized? ECB encryption/decryption can be parallelized because it does more than plain AES encryption and decryption, as mentioned above. CBC encryption can't be parallelized because it depends on the output of the previous block. However, CBC decryption can be parallelized because all the encrypted blocks are known at the beginning. CTR encryption and decryption can be parallelized because the input to each block is known--it's just the IV incremented by one for each subsequent block. So, in summary, for ECB, CBC, and CTR modes, encryption and decryption can be parallelized with the exception of CBC encryption. How do we parallelize encryption? By interleaving. Usually when reading and writing data there are pipeline "stalls" (idle processor cycles) that result from waiting for memory to be loaded or stored to or from CPU registers. Since the software is written to encrypt/decrypt the next data block where pipeline stalls usually occurs, we can avoid stalls and crypt with fewer cycles. This software processes 4 blocks at a time, which ensures virtually no waiting ("stalling") for reading or writing data in memory. Other Optimizations Besides interleaving, other optimizations performed are Loading the entire key schedule into the 128-bit %xmm registers. This is done once for per 4-block of data (since 4 blocks of data is processed, when present). The following is loaded: the entire "key schedule" (user input key preprocessed for encryption and decryption). This takes 11, 13, or 15 registers, for AES-128, AES-192, and AES-256, respectively The input data is loaded into another %xmm register The same register contains the output result after encrypting/decrypting Using SSSE 4 instructions (AESNI). Besides the aesenc, aesenclast, aesdec, aesdeclast, aeskeygenassist, and aesimc AESNI instructions, Intel has several other instructions that operate on the 128-bit %xmm registers. Some common instructions for encryption are: pxor exclusive or (very useful), movdqu load/store a %xmm register from/to memory, pshufb shuffle bytes for byte swapping, pclmulqdq carry-less multiply for GCM mode Combining AES encryption/decryption with CBC or CTR modes processing. Instead of loading input data twice (once for AES encryption/decryption, and again for modes (CTR or CBC, for example) processing, the input data is loaded once as both AES and modes operations occur at in the same function Performance Everyone likes pretty color charts, so here they are. I ran these on Solaris 11 running on a Piketon Platform system with a 4-core Intel Clarkdale processor @3.20GHz. Clarkdale which is part of the Westmere processor architecture family. The "before" case is Solaris 11, unmodified. Keep in mind that the "before" case already has been optimized with hand-coded Intel AESNI assembly. The "after" case has combined AES-NI and mode instructions, interleaved 4 blocks at-a-time. « For the first table, lower is better (milliseconds). The first table shows the performance improvement using the Solaris encrypt(1) and decrypt(1) CLI commands. I encrypted and decrypted a 1/2 GByte file on /tmp (swap tmpfs). Encryption improved by about 40% and decryption improved by about 80%. AES-128 is slighty faster than AES-256, as expected. The second table shows more detail timings for CBC, CTR, and ECB modes for the 3 AES key sizes and different data lengths. » The results shown are the percentage improvement as shown by an internal PKCS#11 microbenchmark. And keep in mind the previous baseline code already had optimized AESNI assembly! The keysize (AES-128, 192, or 256) makes little difference in relative percentage improvement (although, of course, AES-128 is faster than AES-256). Larger data sizes show better improvement than 128-byte data. Availability This software is in Solaris 11 FCS. It is available in the 64-bit libcrypto library and the "aes" Solaris kernel module. You must be running hardware that supports AESNI (for example, Intel Westmere and Sandy Bridge, microprocessor architectures). The easiest way to determine if AES-NI is available is with the isainfo(1) command. For example, $ isainfo -v 64-bit amd64 applications pclmulqdq aes sse4.2 sse4.1 ssse3 popcnt tscp ahf cx16 sse3 sse2 sse fxsr mmx cmov amd_sysc cx8 tsc fpu 32-bit i386 applications pclmulqdq aes sse4.2 sse4.1 ssse3 popcnt tscp ahf cx16 sse3 sse2 sse fxsr mmx cmov sep cx8 tsc fpu No special configuration or setup is needed to take advantage of this software. Solaris libraries and kernel automatically determine if it's running on AESNI-capable machines and execute the correctly-tuned software for the current microprocessor. Summary Maximum throughput of AES cipher modes can be achieved by combining AES encryption with modes processing, interleaving encryption of 4 blocks at a time, and using Intel's wide 128-bit %xmm registers and instructions. References "Block cipher modes of operation", Wikipedia Good overview of AES modes (ECB, CBC, CTR, etc.) "Advanced Encryption Standard", Wikipedia "Current Modes" describes NIST-approved block cipher modes (ECB,CBC, CFB, OFB, CCM, GCM)

    Read the article

  • Parameters for selection of Operating system, memory and processor for embedded system ?

    - by James
    I am developing an embedded real time system software (in C language). I have designed the s/w architecture - we know various objects required, interactions required between various objects and IPC communication between tasks. Based on this information, i need to decide on the operating system(RTOS), microprocessor and memory size requirements. (Most likely i would be using Quadros, as it has been suggested by the client based on their prior experience in similar projects) But i am confused about which one to begin with, since choice of one could impact the selection of other. Could you also guide me on parameters to consider to estimate the memory requirements from the s/w design (lower limit and upper limit of memory requirement) ? (Cost of the component(s) could be ignored for this evaluation)

    Read the article

  • Which equipments should I buy for a laboratory?

    - by Hossein Margani
    Hi every body! I want to create a network of computers for a network laboratory in a university. I want to install windows on them. What is the fastest way for doing this and also which equipments I should buy for this laboratory. I also want to use this network as a computer laboratory, operating system laboratory, programming laboratory, microprocessor laboratory, computer architecture laboratory and software engineering laboratory. I should emphasis that this is a scholastic laboratory. Thanks.

    Read the article

  • How do I create my own programming language and a compiler for it

    - by Dave
    I am thorough with programming and have come across languages including BASIC, FORTRAN, COBOL, LISP, LOGO, Java, C++, C, MATLAB, Mathematica, Python, Ruby, Perl, JavaScript, Assembly and so on. I can't understand how people create programming languages and devise compilers for it. I also couldn't understand how people create OS like Windows, Mac, UNIX, DOS and so on. The other thing that is mysterious to me is how people create libraries like OpenGL, OpenCL, OpenCV, Cocoa, MFC and so on. The last thing I am unable to figure out is how scientists devise an assembly language and an assembler for a microprocessor. I would really like to learn all of these stuff and I am 15 years old. I always wanted to be a computer scientist someone like Babbage, Turing, Shannon, or Dennis Ritchie. I have already read Aho's Compiler Design and Tanenbaum's OS concepts book and they all only discuss concepts and code in a high level. They don't go into the details and nuances and how to devise a compiler or operating system. I want a concrete understanding so that I can create one myself and not just an understanding of what a thread, semaphore, process, or parsing is. I asked my brother about all this. He is a SB student in EECS at MIT and hasn't got a clue of how to actually create all these stuff in the real world. All he knows is just an understanding of Compiler Design and OS concepts like the ones that you guys have mentioned (i.e. like Thread, Synchronization, Concurrency, memory management, Lexical Analysis, Intermediate code generation and so on)

    Read the article

  • CodePlex Daily Summary for Friday, December 14, 2012

    CodePlex Daily Summary for Friday, December 14, 2012Popular ReleasesCommand Line Parser Library: 1.9.3.31 rc0: Main assembly CommandLine.dll signed. Removed old commented code. Added missing XML documentation comments. Two (very) minor code refactoring changes.BlackJumboDog: Ver5.7.4: 2012.12.13 Ver5.7.4 (1)Web???????、???????????????????????????????????????????VFPX: ssClasses A1.0: My initial release. See https://vfpx.codeplex.com/wikipage?title=ssClasses&referringTitle=Home for a brief description of what is inside this releasesb0t v.5: sb0t 5.01 alpha 1: GUI support for Spanish language (Thank you Di3go and Oysterhead) Captcha database updates Bug fix where some hosts could not joinHome Access Plus+: v8.6: v8.6.1213.1220 Added: Group look up to the visible property of the Booking System Fixed: Switched to using the outlook/exchange thumbnailPhoto instead of jpegPhoto Added: Add a blank paragraph below the tiles. This means that the browser displays a vertical scroller when resizing the window. Previously it was possible for the bottom edge of a tile not to be visible if the browser window was resized. Added: Booking System: Only Display Day+Month on the booking Home Page. This allows for the cs...Layered Architecture Solution Guidance (LASG): LASG 1.0.0.8 for Visual Studio 2012: PRE-REQUISITES Open GAX (Please install Oct 4, 2012 version) Microsoft® System CLR Types for Microsoft® SQL Server® 2012 Microsoft® SQL Server® 2012 Shared Management Objects Microsoft Enterprise Library 5.0 (for the generated code) Windows Azure SDK (for layered cloud applications) Silverlight 5 SDK (for Silverlight applications) THE RELEASE This release only works on Visual Studio 2012. Known Issue If you choose the Database project, the solution unfolding time will be slow....Fiskalizacija za developere: FiskalizacijaDev 2.0: Prva prava produkcijska verzija - Zakon je tu, ova je verzija uskladena sa trenutno važecom Tehnickom specifikacijom (v1.2. od 04.12.2012.) i spremna je za produkcijsko korištenje. Verzije iza ove ce ovisiti o naknadnim izmjenama Zakona i/ili Tehnicke specifikacije, odnosno, o eventualnim greškama u radu/zahtjevima community-a za novim feature-ima. Novosti u v2.0 su: - That assembly does not allow partially trusted callers (http://fiskalizacija.codeplex.com/workitem/699) - scheme IznosType...Simple Injector: Simple Injector v1.6.1: This patch release fixes a bug in the integration libraries that disallowed the application to start when .NET 4.5 was not installed on the machine (but only .NET 4.0). The following packages are affected: SimpleInjector.Integration.Web.dll SimpleInjector.Integration.Web.Mvc.dll SimpleInjector.Integration.Wcf.dll SimpleInjector.Extensions.LifetimeScoping.dllBootstrap Helpers: Version 1: First releasesheetengine - Isometric HTML5 JavaScript Display Engine: sheetengine v1.2.0: Main featuresOptimizations for intersectionsThe main purpose of this release was to further optimize rendering performance by skipping object intersections with other sheets. From now by default an object's sheets will only intersect its own sheets and never other static or dynamic sheets. This is the usual scenario since objects will never bump into other sheets when using collision detection. DocumentationMany of you have been asking for proper documentation, so here it goes. Check out the...DirectX Tool Kit: December 11, 2012: December 11, 2012 Ex versions of DDSTextureLoader and WICTextureLoader Removed use of ATL's CComPtr in favor of WRL's ComPtr for all platforms to support VS Express editions Updated VS 2010 project for official 'property sheet' integration for Windows 8.0 SDK Minor fix to CommonStates for Feature Level 9.1 Tweaked AlphaTestEffect.cpp to work around ARM NEON compiler codegen bug Added dxguid.lib as a default library for Debug builds to resolve GUID link issuesDNN Flash Viewer: Source (same as default release): SourceArcGIS Editor for OpenStreetMap: ArcGIS Editor for OSM 2.1 Final for 10.1: We are proud to announce the release of ArcGIS Editor for OpenStreetMap version 2.1. This download is compatible with ArcGIS 10.1, and includes setups for the Desktop Component, Desktop Component when 64 bit Background Geoprocessing is installed, and the Server Component. Important: if you already have ArcGIS Editor for OSM installed but want to install this new version, you will need to uninstall your previous version and then install this one. This release includes support for the ArcGIS 1...SharpCompress - a fully native C# library for RAR, 7Zip, Zip, Tar, GZip, BZip2: SharpCompress 0.8.2: This release just contains some fixes that have been done since the last release. Plus, this is strong named as well. I apologize for the lack of updates but my free time is less these days.Media Companion: MediaCompanion3.511b release: Two more bug fixes: - General Preferences were not getting restored - Fanart and poster image files were being locked, preventing changes to themVodigi Open Source Interactive Digital Signage: Vodigi Release 5.5: The following enhancements and fixes are included in Vodigi 5.5. Vodigi Administrator - Manage Music Files - Add Music Files to Image Slide Shows - Manage System Messages - Display System Messages to Users During Login - Ported to Visual Studio 2012 and MVC 4 - Added New Vodigi Administrator User Guide Vodigi Player - Improved Login/Schedule Startup Procedure - Startup Using Last Known Schedule when Disconnected on Startup - Improved Check for Schedule Changes - Now Every 15 Minutes - Pla...VidCoder: 1.4.10 Beta: Added progress percent to the title bar/task bar icon. Added MPLS information to Blu-ray titles. Fixed the following display issues in Windows 8: Uncentered text in textbox controls Disabled controls not having gray text making them hard to identify as disabled Drop-down menus having hard-to distinguish white on light-blue text Added more logging to proxy disconnect issues and increased timeout on initial call to help prevent timeouts. Fixed encoding window showing the built-in pre...WPF Application Framework (WAF): WPF Application Framework (WAF) 2.5.0.400: Version 2.5.0.400 (Release): This release contains the source code of the WPF Application Framework (WAF) and the sample applications. Requirements .NET Framework 4.0 (The package contains a solution file for Visual Studio 2010) The unit test projects require Visual Studio 2010 Professional Changelog Legend: [B] Breaking change; [O] Marked member as obsolete Update the documentation. InfoMan: Write the documentation. Other Downloads Downloads OverviewBee OPOA Platform: Bee OPOA Demo V1.0.001: Initial version.Microsoft Ajax Minifier: Microsoft Ajax Minifier 4.78: Fix for issue #18924 - using -pretty option left in ///#DEBUG blocks. Fix for issue #18980 - bad += optimization caused bug in resulting code. Optimization has been removed pending further review.New Projects.net demo: This is a demo project. 12141325: tesing12141327: testingAW User Applications: This project is used to enable AccessWeb User Applications for Database Application Users. Outline: The degree of professionalism permissions access options OkBattery Life Live Tile: Windows 8 app that displays the battery percentage in a live tile on the homescreen with the help of a system tray application. only available on intel devices.CMS .NET: CMS .net HTML5 multilingual+FORUM+GALLERY+Responsive Web Design+WIKI+COMMUNITY: Easy user-friendly, fast 10X,;multi site in different domain; multi servercookieTerm: A simple BBS terminal that can run in unicode environmentDevCow: This is a location for all of the community projects that help support DevCow.comDevville.NET: The project contains some of a very useful helpers and extension methods for .NET and SharePoint.Echo Garden: Echo Garden is a modification of Alphalabs' Windows Phone app, Node Garden, that represents the nodes through sound.FluentNavigationCoercion: Simple library that makes navigation coercion on WP simple and readableGet Music: Internet radio's parser. Parse radio logs, store received data to store. Make statistic analyze of radio. jKinect - Kinectify any web site.: Provide a unique Kinect User Experience for your website. With jKinect, turn any web site into a Kinect enabled web application.Lagos Single Mothers: This is a web2.0 site that will help single mothers in the city of Lagos (in Nigeria) share ideas on how to raise children, as single mothers. LIF11: Projet logique classiqueMalkiSum: Malki SumMyLittleAdressBook: Just a little project to test MVVM pattern on a multiview solution with authentificationNopCommerce 2.7x Multi Store version: NopCommerce 7.x Multi Store / Vendor versionPodcastToMp3: Automated tool to make MP3s from M4A chapters.Primer Demo ClickOnce: Demo ClickOncePulsus: A simple .NET logging library for modern applications.Sharp6800 - ET-3400 Microprocessor Trainer Emulator: A Heathkit ET-3400 Microprocessor Trainer Emulator. It features a 6800 emulator core and simulated 7-segment display and keypad. Written in 100% C#TEdit: TEdit is a source code editor that mainly used for InfoBasic Programming language. Running in Windows platform. Developed using Scintilla & ScintillaNet.VsPackageUtils: VsPackageUtils is a basic utility helper class for common operations in a VsPackage.WiFiShare: WiFiShare shares your LAN connection to WiFi using Internet Connection Sharing(ICS) from your Windows operating system.Windows 8 RSS App Kit: A Windows 8 App "kit" that allows you to build a fixed-list RSS reader with auto-image-detection in feeds.

    Read the article

  • How do I create my own programming language and a compiler for it

    - by Dave
    I am thorough with programming and have come across languages including BASIC, FORTRAN, COBOL, LISP, LOGO, Java, C++, C, MATLAB, Mathematica, Python, Ruby, Perl, Javascript, Assembly and so on. I can't understand how people create programming languages and devise compilers for it. I also couldn't understand how people create OS like Windows, Mac, UNIX, DOS and so on. The other thing that is mysterious to me is how people create libraries like OpenGL, OpenCL, OpenCV, Cocoa, MFC and so on. The last thing I am unable to figure out is how scientists devise an assembly language and an assembler for a microprocessor. I would really like to learn all of these stuff and I am 15 years old. I always wanted to be a computer scientist some one like Babbage, Turing, Shannon, or Dennis Ritchie. I have already read Aho's Compiler Design and Tanenbaum's OS concepts book and they all only discuss concepts and code in a high level. They don't go into the details and nuances and how to devise a compiler or operating system. I want a concrete understanding so that I can create one myself and not just an understanding of what a thread, semaphore, process, or parsing is. I asked my brother about all this. He is a SB student in EECS at MIT and hasn't got a clue of how to actually create all these stuff in the real world. All he knows is just an understanding of Compiler Design and OS concepts like the ones that you guys have mentioned (ie like Thread, Synchronisation, Concurrency, memory management, Lexical Analysis, Intermediate code generation and so on)

    Read the article

  • Using typedefs (or #defines) on built in types - any sensible reason?

    - by jb
    Well I'm doing some Java - C integration, and throught C library werid type mappings are used (theres more of them;)): #define CHAR char /* 8 bit signed int */ #define SHORT short /* 16 bit signed int */ #define INT int /* "natural" length signed int */ #define LONG long /* 32 bit signed int */ typedef unsigned char BYTE; /* 8 bit unsigned int */ typedef unsigned char UCHAR; /* 8 bit unsigned int */ typedef unsigned short USHORT; /* 16 bit unsigned int */ typedef unsigned int UINT; /* "natural" length unsigned int*/ Is there any legitimate reason not to use them? It's not like char is going to be redefined anytime soon. I can think of: Writing platform/compiler portable code (size of type is underspecified in C/C++) Saving space and time on embedded systems - if you loop over array shorter than 255 on 8bit microprocessor writing: for(uint8_t ii = 0; ii < len; ii++) will give meaureable speedup.

    Read the article

  • How John Got 15x Improvement Without Really Trying

    - by rchrd
    The following article was published on a Sun Microsystems website a number of years ago by John Feo. It is still useful and worth preserving. So I'm republishing it here.  How I Got 15x Improvement Without Really Trying John Feo, Sun Microsystems Taking ten "personal" program codes used in scientific and engineering research, the author was able to get from 2 to 15 times performance improvement easily by applying some simple general optimization techniques. Introduction Scientific research based on computer simulation depends on the simulation for advancement. The research can advance only as fast as the computational codes can execute. The codes' efficiency determines both the rate and quality of results. In the same amount of time, a faster program can generate more results and can carry out a more detailed simulation of physical phenomena than a slower program. Highly optimized programs help science advance quickly and insure that monies supporting scientific research are used as effectively as possible. Scientific computer codes divide into three broad categories: ISV, community, and personal. ISV codes are large, mature production codes developed and sold commercially. The codes improve slowly over time both in methods and capabilities, and they are well tuned for most vendor platforms. Since the codes are mature and complex, there are few opportunities to improve their performance solely through code optimization. Improvements of 10% to 15% are typical. Examples of ISV codes are DYNA3D, Gaussian, and Nastran. Community codes are non-commercial production codes used by a particular research field. Generally, they are developed and distributed by a single academic or research institution with assistance from the community. Most users just run the codes, but some develop new methods and extensions that feed back into the general release. The codes are available on most vendor platforms. Since these codes are younger than ISV codes, there are more opportunities to optimize the source code. Improvements of 50% are not unusual. Examples of community codes are AMBER, CHARM, BLAST, and FASTA. Personal codes are those written by single users or small research groups for their own use. These codes are not distributed, but may be passed from professor-to-student or student-to-student over several years. They form the primordial ocean of applications from which community and ISV codes emerge. Government research grants pay for the development of most personal codes. This paper reports on the nature and performance of this class of codes. Over the last year, I have looked at over two dozen personal codes from more than a dozen research institutions. The codes cover a variety of scientific fields, including astronomy, atmospheric sciences, bioinformatics, biology, chemistry, geology, and physics. The sources range from a few hundred lines to more than ten thousand lines, and are written in Fortran, Fortran 90, C, and C++. For the most part, the codes are modular, documented, and written in a clear, straightforward manner. They do not use complex language features, advanced data structures, programming tricks, or libraries. I had little trouble understanding what the codes did or how data structures were used. Most came with a makefile. Surprisingly, only one of the applications is parallel. All developers have access to parallel machines, so availability is not an issue. Several tried to parallelize their applications, but stopped after encountering difficulties. Lack of education and a perception that parallelism is difficult prevented most from trying. I parallelized several of the codes using OpenMP, and did not judge any of the codes as difficult to parallelize. Even more surprising than the lack of parallelism is the inefficiency of the codes. I was able to get large improvements in performance in a matter of a few days applying simple optimization techniques. Table 1 lists ten representative codes [names and affiliation are omitted to preserve anonymity]. Improvements on one processor range from 2x to 15.5x with a simple average of 4.75x. I did not use sophisticated performance tools or drill deep into the program's execution character as one would do when tuning ISV or community codes. Using only a profiler and source line timers, I identified inefficient sections of code and improved their performance by inspection. The changes were at a high level. I am sure there is another factor of 2 or 3 in each code, and more if the codes are parallelized. The study’s results show that personal scientific codes are running many times slower than they should and that the problem is pervasive. Computational scientists are not sloppy programmers; however, few are trained in the art of computer programming or code optimization. I found that most have a working knowledge of some programming language and standard software engineering practices; but they do not know, or think about, how to make their programs run faster. They simply do not know the standard techniques used to make codes run faster. In fact, they do not even perceive that such techniques exist. The case studies described in this paper show that applying simple, well known techniques can significantly increase the performance of personal codes. It is important that the scientific community and the Government agencies that support scientific research find ways to better educate academic scientific programmers. The inefficiency of their codes is so bad that it is retarding both the quality and progress of scientific research. # cacheperformance redundantoperations loopstructures performanceimprovement 1 x x 15.5 2 x 2.8 3 x x 2.5 4 x 2.1 5 x x 2.0 6 x 5.0 7 x 5.8 8 x 6.3 9 2.2 10 x x 3.3 Table 1 — Area of improvement and performance gains of 10 codes The remainder of the paper is organized as follows: sections 2, 3, and 4 discuss the three most common sources of inefficiencies in the codes studied. These are cache performance, redundant operations, and loop structures. Each section includes several examples. The last section summaries the work and suggests a possible solution to the issues raised. Optimizing cache performance Commodity microprocessor systems use caches to increase memory bandwidth and reduce memory latencies. Typical latencies from processor to L1, L2, local, and remote memory are 3, 10, 50, and 200 cycles, respectively. Moreover, bandwidth falls off dramatically as memory distances increase. Programs that do not use cache effectively run many times slower than programs that do. When optimizing for cache, the biggest performance gains are achieved by accessing data in cache order and reusing data to amortize the overhead of cache misses. Secondary considerations are prefetching, associativity, and replacement; however, the understanding and analysis required to optimize for the latter are probably beyond the capabilities of the non-expert. Much can be gained simply by accessing data in the correct order and maximizing data reuse. 6 out of the 10 codes studied here benefited from such high level optimizations. Array Accesses The most important cache optimization is the most basic: accessing Fortran array elements in column order and C array elements in row order. Four of the ten codes—1, 2, 4, and 10—got it wrong. Compilers will restructure nested loops to optimize cache performance, but may not do so if the loop structure is too complex, or the loop body includes conditionals, complex addressing, or function calls. In code 1, the compiler failed to invert a key loop because of complex addressing do I = 0, 1010, delta_x IM = I - delta_x IP = I + delta_x do J = 5, 995, delta_x JM = J - delta_x JP = J + delta_x T1 = CA1(IP, J) + CA1(I, JP) T2 = CA1(IM, J) + CA1(I, JM) S1 = T1 + T2 - 4 * CA1(I, J) CA(I, J) = CA1(I, J) + D * S1 end do end do In code 2, the culprit is conditionals do I = 1, N do J = 1, N If (IFLAG(I,J) .EQ. 0) then T1 = Value(I, J-1) T2 = Value(I-1, J) T3 = Value(I, J) T4 = Value(I+1, J) T5 = Value(I, J+1) Value(I,J) = 0.25 * (T1 + T2 + T5 + T4) Delta = ABS(T3 - Value(I,J)) If (Delta .GT. MaxDelta) MaxDelta = Delta endif enddo enddo I fixed both programs by inverting the loops by hand. Code 10 has three-dimensional arrays and triply nested loops. The structure of the most computationally intensive loops is too complex to invert automatically or by hand. The only practical solution is to transpose the arrays so that the dimension accessed by the innermost loop is in cache order. The arrays can be transposed at construction or prior to entering a computationally intensive section of code. The former requires all array references to be modified, while the latter is cost effective only if the cost of the transpose is amortized over many accesses. I used the second approach to optimize code 10. Code 5 has four-dimensional arrays and loops are nested four deep. For all of the reasons cited above the compiler is not able to restructure three key loops. Assume C arrays and let the four dimensions of the arrays be i, j, k, and l. In the original code, the index structure of the three loops is L1: for i L2: for i L3: for i for l for l for j for k for j for k for j for k for l So only L3 accesses array elements in cache order. L1 is a very complex loop—much too complex to invert. I brought the loop into cache alignment by transposing the second and fourth dimensions of the arrays. Since the code uses a macro to compute all array indexes, I effected the transpose at construction and changed the macro appropriately. The dimensions of the new arrays are now: i, l, k, and j. L3 is a simple loop and easily inverted. L2 has a loop-carried scalar dependence in k. By promoting the scalar name that carries the dependence to an array, I was able to invert the third and fourth subloops aligning the loop with cache. Code 5 is by far the most difficult of the four codes to optimize for array accesses; but the knowledge required to fix the problems is no more than that required for the other codes. I would judge this code at the limits of, but not beyond, the capabilities of appropriately trained computational scientists. Array Strides When a cache miss occurs, a line (64 bytes) rather than just one word is loaded into the cache. If data is accessed stride 1, than the cost of the miss is amortized over 8 words. Any stride other than one reduces the cost savings. Two of the ten codes studied suffered from non-unit strides. The codes represent two important classes of "strided" codes. Code 1 employs a multi-grid algorithm to reduce time to convergence. The grids are every tenth, fifth, second, and unit element. Since time to convergence is inversely proportional to the distance between elements, coarse grids converge quickly providing good starting values for finer grids. The better starting values further reduce the time to convergence. The downside is that grids of every nth element, n > 1, introduce non-unit strides into the computation. In the original code, much of the savings of the multi-grid algorithm were lost due to this problem. I eliminated the problem by compressing (copying) coarse grids into continuous memory, and rewriting the computation as a function of the compressed grid. On convergence, I copied the final values of the compressed grid back to the original grid. The savings gained from unit stride access of the compressed grid more than paid for the cost of copying. Using compressed grids, the loop from code 1 included in the previous section becomes do j = 1, GZ do i = 1, GZ T1 = CA(i+0, j-1) + CA(i-1, j+0) T4 = CA1(i+1, j+0) + CA1(i+0, j+1) S1 = T1 + T4 - 4 * CA1(i+0, j+0) CA(i+0, j+0) = CA1(i+0, j+0) + DD * S1 enddo enddo where CA and CA1 are compressed arrays of size GZ. Code 7 traverses a list of objects selecting objects for later processing. The labels of the selected objects are stored in an array. The selection step has unit stride, but the processing steps have irregular stride. A fix is to save the parameters of the selected objects in temporary arrays as they are selected, and pass the temporary arrays to the processing functions. The fix is practical if the same parameters are used in selection as in processing, or if processing comprises a series of distinct steps which use overlapping subsets of the parameters. Both conditions are true for code 7, so I achieved significant improvement by copying parameters to temporary arrays during selection. Data reuse In the previous sections, we optimized for spatial locality. It is also important to optimize for temporal locality. Once read, a datum should be used as much as possible before it is forced from cache. Loop fusion and loop unrolling are two techniques that increase temporal locality. Unfortunately, both techniques increase register pressure—as loop bodies become larger, the number of registers required to hold temporary values grows. Once register spilling occurs, any gains evaporate quickly. For multiprocessors with small register sets or small caches, the sweet spot can be very small. In the ten codes presented here, I found no opportunities for loop fusion and only two opportunities for loop unrolling (codes 1 and 3). In code 1, unrolling the outer and inner loop one iteration increases the number of result values computed by the loop body from 1 to 4, do J = 1, GZ-2, 2 do I = 1, GZ-2, 2 T1 = CA1(i+0, j-1) + CA1(i-1, j+0) T2 = CA1(i+1, j-1) + CA1(i+0, j+0) T3 = CA1(i+0, j+0) + CA1(i-1, j+1) T4 = CA1(i+1, j+0) + CA1(i+0, j+1) T5 = CA1(i+2, j+0) + CA1(i+1, j+1) T6 = CA1(i+1, j+1) + CA1(i+0, j+2) T7 = CA1(i+2, j+1) + CA1(i+1, j+2) S1 = T1 + T4 - 4 * CA1(i+0, j+0) S2 = T2 + T5 - 4 * CA1(i+1, j+0) S3 = T3 + T6 - 4 * CA1(i+0, j+1) S4 = T4 + T7 - 4 * CA1(i+1, j+1) CA(i+0, j+0) = CA1(i+0, j+0) + DD * S1 CA(i+1, j+0) = CA1(i+1, j+0) + DD * S2 CA(i+0, j+1) = CA1(i+0, j+1) + DD * S3 CA(i+1, j+1) = CA1(i+1, j+1) + DD * S4 enddo enddo The loop body executes 12 reads, whereas as the rolled loop shown in the previous section executes 20 reads to compute the same four values. In code 3, two loops are unrolled 8 times and one loop is unrolled 4 times. Here is the before for (k = 0; k < NK[u]; k++) { sum = 0.0; for (y = 0; y < NY; y++) { sum += W[y][u][k] * delta[y]; } backprop[i++]=sum; } and after code for (k = 0; k < KK - 8; k+=8) { sum0 = 0.0; sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0; sum5 = 0.0; sum6 = 0.0; sum7 = 0.0; for (y = 0; y < NY; y++) { sum0 += W[y][0][k+0] * delta[y]; sum1 += W[y][0][k+1] * delta[y]; sum2 += W[y][0][k+2] * delta[y]; sum3 += W[y][0][k+3] * delta[y]; sum4 += W[y][0][k+4] * delta[y]; sum5 += W[y][0][k+5] * delta[y]; sum6 += W[y][0][k+6] * delta[y]; sum7 += W[y][0][k+7] * delta[y]; } backprop[k+0] = sum0; backprop[k+1] = sum1; backprop[k+2] = sum2; backprop[k+3] = sum3; backprop[k+4] = sum4; backprop[k+5] = sum5; backprop[k+6] = sum6; backprop[k+7] = sum7; } for one of the loops unrolled 8 times. Optimizing for temporal locality is the most difficult optimization considered in this paper. The concepts are not difficult, but the sweet spot is small. Identifying where the program can benefit from loop unrolling or loop fusion is not trivial. Moreover, it takes some effort to get it right. Still, educating scientific programmers about temporal locality and teaching them how to optimize for it will pay dividends. Reducing instruction count Execution time is a function of instruction count. Reduce the count and you usually reduce the time. The best solution is to use a more efficient algorithm; that is, an algorithm whose order of complexity is smaller, that converges quicker, or is more accurate. Optimizing source code without changing the algorithm yields smaller, but still significant, gains. This paper considers only the latter because the intent is to study how much better codes can run if written by programmers schooled in basic code optimization techniques. The ten codes studied benefited from three types of "instruction reducing" optimizations. The two most prevalent were hoisting invariant memory and data operations out of inner loops. The third was eliminating unnecessary data copying. The nature of these inefficiencies is language dependent. Memory operations The semantics of C make it difficult for the compiler to determine all the invariant memory operations in a loop. The problem is particularly acute for loops in functions since the compiler may not know the values of the function's parameters at every call site when compiling the function. Most compilers support pragmas to help resolve ambiguities; however, these pragmas are not comprehensive and there is no standard syntax. To guarantee that invariant memory operations are not executed repetitively, the user has little choice but to hoist the operations by hand. The problem is not as severe in Fortran programs because in the absence of equivalence statements, it is a violation of the language's semantics for two names to share memory. Codes 3 and 5 are C programs. In both cases, the compiler did not hoist all invariant memory operations from inner loops. Consider the following loop from code 3 for (y = 0; y < NY; y++) { i = 0; for (u = 0; u < NU; u++) { for (k = 0; k < NK[u]; k++) { dW[y][u][k] += delta[y] * I1[i++]; } } } Since dW[y][u] can point to the same memory space as delta for one or more values of y and u, assignment to dW[y][u][k] may change the value of delta[y]. In reality, dW and delta do not overlap in memory, so I rewrote the loop as for (y = 0; y < NY; y++) { i = 0; Dy = delta[y]; for (u = 0; u < NU; u++) { for (k = 0; k < NK[u]; k++) { dW[y][u][k] += Dy * I1[i++]; } } } Failure to hoist invariant memory operations may be due to complex address calculations. If the compiler can not determine that the address calculation is invariant, then it can hoist neither the calculation nor the associated memory operations. As noted above, code 5 uses a macro to address four-dimensional arrays #define MAT4D(a,q,i,j,k) (double *)((a)->data + (q)*(a)->strides[0] + (i)*(a)->strides[3] + (j)*(a)->strides[2] + (k)*(a)->strides[1]) The macro is too complex for the compiler to understand and so, it does not identify any subexpressions as loop invariant. The simplest way to eliminate the address calculation from the innermost loop (over i) is to define a0 = MAT4D(a,q,0,j,k) before the loop and then replace all instances of *MAT4D(a,q,i,j,k) in the loop with a0[i] A similar problem appears in code 6, a Fortran program. The key loop in this program is do n1 = 1, nh nx1 = (n1 - 1) / nz + 1 nz1 = n1 - nz * (nx1 - 1) do n2 = 1, nh nx2 = (n2 - 1) / nz + 1 nz2 = n2 - nz * (nx2 - 1) ndx = nx2 - nx1 ndy = nz2 - nz1 gxx = grn(1,ndx,ndy) gyy = grn(2,ndx,ndy) gxy = grn(3,ndx,ndy) balance(n1,1) = balance(n1,1) + (force(n2,1) * gxx + force(n2,2) * gxy) * h1 balance(n1,2) = balance(n1,2) + (force(n2,1) * gxy + force(n2,2) * gyy)*h1 end do end do The programmer has written this loop well—there are no loop invariant operations with respect to n1 and n2. However, the loop resides within an iterative loop over time and the index calculations are independent with respect to time. Trading space for time, I precomputed the index values prior to the entering the time loop and stored the values in two arrays. I then replaced the index calculations with reads of the arrays. Data operations Ways to reduce data operations can appear in many forms. Implementing a more efficient algorithm produces the biggest gains. The closest I came to an algorithm change was in code 4. This code computes the inner product of K-vectors A(i) and B(j), 0 = i < N, 0 = j < M, for most values of i and j. Since the program computes most of the NM possible inner products, it is more efficient to compute all the inner products in one triply-nested loop rather than one at a time when needed. The savings accrue from reading A(i) once for all B(j) vectors and from loop unrolling. for (i = 0; i < N; i+=8) { for (j = 0; j < M; j++) { sum0 = 0.0; sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0; sum5 = 0.0; sum6 = 0.0; sum7 = 0.0; for (k = 0; k < K; k++) { sum0 += A[i+0][k] * B[j][k]; sum1 += A[i+1][k] * B[j][k]; sum2 += A[i+2][k] * B[j][k]; sum3 += A[i+3][k] * B[j][k]; sum4 += A[i+4][k] * B[j][k]; sum5 += A[i+5][k] * B[j][k]; sum6 += A[i+6][k] * B[j][k]; sum7 += A[i+7][k] * B[j][k]; } C[i+0][j] = sum0; C[i+1][j] = sum1; C[i+2][j] = sum2; C[i+3][j] = sum3; C[i+4][j] = sum4; C[i+5][j] = sum5; C[i+6][j] = sum6; C[i+7][j] = sum7; }} This change requires knowledge of a typical run; i.e., that most inner products are computed. The reasons for the change, however, derive from basic optimization concepts. It is the type of change easily made at development time by a knowledgeable programmer. In code 5, we have the data version of the index optimization in code 6. Here a very expensive computation is a function of the loop indices and so cannot be hoisted out of the loop; however, the computation is invariant with respect to an outer iterative loop over time. We can compute its value for each iteration of the computation loop prior to entering the time loop and save the values in an array. The increase in memory required to store the values is small in comparison to the large savings in time. The main loop in Code 8 is doubly nested. The inner loop includes a series of guarded computations; some are a function of the inner loop index but not the outer loop index while others are a function of the outer loop index but not the inner loop index for (j = 0; j < N; j++) { for (i = 0; i < M; i++) { r = i * hrmax; R = A[j]; temp = (PRM[3] == 0.0) ? 1.0 : pow(r, PRM[3]); high = temp * kcoeff * B[j] * PRM[2] * PRM[4]; low = high * PRM[6] * PRM[6] / (1.0 + pow(PRM[4] * PRM[6], 2.0)); kap = (R > PRM[6]) ? high * R * R / (1.0 + pow(PRM[4]*r, 2.0) : low * pow(R/PRM[6], PRM[5]); < rest of loop omitted > }} Note that the value of temp is invariant to j. Thus, we can hoist the computation for temp out of the loop and save its values in an array. for (i = 0; i < M; i++) { r = i * hrmax; TEMP[i] = pow(r, PRM[3]); } [N.B. – the case for PRM[3] = 0 is omitted and will be reintroduced later.] We now hoist out of the inner loop the computations invariant to i. Since the conditional guarding the value of kap is invariant to i, it behooves us to hoist the computation out of the inner loop, thereby executing the guard once rather than M times. The final version of the code is for (j = 0; j < N; j++) { R = rig[j] / 1000.; tmp1 = kcoeff * par[2] * beta[j] * par[4]; tmp2 = 1.0 + (par[4] * par[4] * par[6] * par[6]); tmp3 = 1.0 + (par[4] * par[4] * R * R); tmp4 = par[6] * par[6] / tmp2; tmp5 = R * R / tmp3; tmp6 = pow(R / par[6], par[5]); if ((par[3] == 0.0) && (R > par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * tmp5; } else if ((par[3] == 0.0) && (R <= par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * tmp4 * tmp6; } else if ((par[3] != 0.0) && (R > par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * TEMP[i] * tmp5; } else if ((par[3] != 0.0) && (R <= par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * TEMP[i] * tmp4 * tmp6; } for (i = 0; i < M; i++) { kap = KAP[i]; r = i * hrmax; < rest of loop omitted > } } Maybe not the prettiest piece of code, but certainly much more efficient than the original loop, Copy operations Several programs unnecessarily copy data from one data structure to another. This problem occurs in both Fortran and C programs, although it manifests itself differently in the two languages. Code 1 declares two arrays—one for old values and one for new values. At the end of each iteration, the array of new values is copied to the array of old values to reset the data structures for the next iteration. This problem occurs in Fortran programs not included in this study and in both Fortran 77 and Fortran 90 code. Introducing pointers to the arrays and swapping pointer values is an obvious way to eliminate the copying; but pointers is not a feature that many Fortran programmers know well or are comfortable using. An easy solution not involving pointers is to extend the dimension of the value array by 1 and use the last dimension to differentiate between arrays at different times. For example, if the data space is N x N, declare the array (N, N, 2). Then store the problem’s initial values in (_, _, 2) and define the scalar names new = 2 and old = 1. At the start of each iteration, swap old and new to reset the arrays. The old–new copy problem did not appear in any C program. In programs that had new and old values, the code swapped pointers to reset data structures. Where unnecessary coping did occur is in structure assignment and parameter passing. Structures in C are handled much like scalars. Assignment causes the data space of the right-hand name to be copied to the data space of the left-hand name. Similarly, when a structure is passed to a function, the data space of the actual parameter is copied to the data space of the formal parameter. If the structure is large and the assignment or function call is in an inner loop, then copying costs can grow quite large. While none of the ten programs considered here manifested this problem, it did occur in programs not included in the study. A simple fix is always to refer to structures via pointers. Optimizing loop structures Since scientific programs spend almost all their time in loops, efficient loops are the key to good performance. Conditionals, function calls, little instruction level parallelism, and large numbers of temporary values make it difficult for the compiler to generate tightly packed, highly efficient code. Conditionals and function calls introduce jumps that disrupt code flow. Users should eliminate or isolate conditionls to their own loops as much as possible. Often logical expressions can be substituted for if-then-else statements. For example, code 2 includes the following snippet MaxDelta = 0.0 do J = 1, N do I = 1, M < code omitted > Delta = abs(OldValue ? NewValue) if (Delta > MaxDelta) MaxDelta = Delta enddo enddo if (MaxDelta .gt. 0.001) goto 200 Since the only use of MaxDelta is to control the jump to 200 and all that matters is whether or not it is greater than 0.001, I made MaxDelta a boolean and rewrote the snippet as MaxDelta = .false. do J = 1, N do I = 1, M < code omitted > Delta = abs(OldValue ? NewValue) MaxDelta = MaxDelta .or. (Delta .gt. 0.001) enddo enddo if (MaxDelta) goto 200 thereby, eliminating the conditional expression from the inner loop. A microprocessor can execute many instructions per instruction cycle. Typically, it can execute one or more memory, floating point, integer, and jump operations. To be executed simultaneously, the operations must be independent. Thick loops tend to have more instruction level parallelism than thin loops. Moreover, they reduce memory traffice by maximizing data reuse. Loop unrolling and loop fusion are two techniques to increase the size of loop bodies. Several of the codes studied benefitted from loop unrolling, but none benefitted from loop fusion. This observation is not too surpising since it is the general tendency of programmers to write thick loops. As loops become thicker, the number of temporary values grows, increasing register pressure. If registers spill, then memory traffic increases and code flow is disrupted. A thick loop with many temporary values may execute slower than an equivalent series of thin loops. The biggest gain will be achieved if the thick loop can be split into a series of independent loops eliminating the need to write and read temporary arrays. I found such an occasion in code 10 where I split the loop do i = 1, n do j = 1, m A24(j,i)= S24(j,i) * T24(j,i) + S25(j,i) * U25(j,i) B24(j,i)= S24(j,i) * T25(j,i) + S25(j,i) * U24(j,i) A25(j,i)= S24(j,i) * C24(j,i) + S25(j,i) * V24(j,i) B25(j,i)= S24(j,i) * U25(j,i) + S25(j,i) * V25(j,i) C24(j,i)= S26(j,i) * T26(j,i) + S27(j,i) * U26(j,i) D24(j,i)= S26(j,i) * T27(j,i) + S27(j,i) * V26(j,i) C25(j,i)= S27(j,i) * S28(j,i) + S26(j,i) * U28(j,i) D25(j,i)= S27(j,i) * T28(j,i) + S26(j,i) * V28(j,i) end do end do into two disjoint loops do i = 1, n do j = 1, m A24(j,i)= S24(j,i) * T24(j,i) + S25(j,i) * U25(j,i) B24(j,i)= S24(j,i) * T25(j,i) + S25(j,i) * U24(j,i) A25(j,i)= S24(j,i) * C24(j,i) + S25(j,i) * V24(j,i) B25(j,i)= S24(j,i) * U25(j,i) + S25(j,i) * V25(j,i) end do end do do i = 1, n do j = 1, m C24(j,i)= S26(j,i) * T26(j,i) + S27(j,i) * U26(j,i) D24(j,i)= S26(j,i) * T27(j,i) + S27(j,i) * V26(j,i) C25(j,i)= S27(j,i) * S28(j,i) + S26(j,i) * U28(j,i) D25(j,i)= S27(j,i) * T28(j,i) + S26(j,i) * V28(j,i) end do end do Conclusions Over the course of the last year, I have had the opportunity to work with over two dozen academic scientific programmers at leading research universities. Their research interests span a broad range of scientific fields. Except for two programs that relied almost exclusively on library routines (matrix multiply and fast Fourier transform), I was able to improve significantly the single processor performance of all codes. Improvements range from 2x to 15.5x with a simple average of 4.75x. Changes to the source code were at a very high level. I did not use sophisticated techniques or programming tools to discover inefficiencies or effect the changes. Only one code was parallel despite the availability of parallel systems to all developers. Clearly, we have a problem—personal scientific research codes are highly inefficient and not running parallel. The developers are unaware of simple optimization techniques to make programs run faster. They lack education in the art of code optimization and parallel programming. I do not believe we can fix the problem by publishing additional books or training manuals. To date, the developers in questions have not studied the books or manual available, and are unlikely to do so in the future. Short courses are a possible solution, but I believe they are too concentrated to be much use. The general concepts can be taught in a three or four day course, but that is not enough time for students to practice what they learn and acquire the experience to apply and extend the concepts to their codes. Practice is the key to becoming proficient at optimization. I recommend that graduate students be required to take a semester length course in optimization and parallel programming. We would never give someone access to state-of-the-art scientific equipment costing hundreds of thousands of dollars without first requiring them to demonstrate that they know how to use the equipment. Yet the criterion for time on state-of-the-art supercomputers is at most an interesting project. Requestors are never asked to demonstrate that they know how to use the system, or can use the system effectively. A semester course would teach them the required skills. Government agencies that fund academic scientific research pay for most of the computer systems supporting scientific research as well as the development of most personal scientific codes. These agencies should require graduate schools to offer a course in optimization and parallel programming as a requirement for funding. About the Author John Feo received his Ph.D. in Computer Science from The University of Texas at Austin in 1986. After graduate school, Dr. Feo worked at Lawrence Livermore National Laboratory where he was the Group Leader of the Computer Research Group and principal investigator of the Sisal Language Project. In 1997, Dr. Feo joined Tera Computer Company where he was project manager for the MTA, and oversaw the programming and evaluation of the MTA at the San Diego Supercomputer Center. In 2000, Dr. Feo joined Sun Microsystems as an HPC application specialist. He works with university research groups to optimize and parallelize scientific codes. Dr. Feo has published over two dozen research articles in the areas of parallel parallel programming, parallel programming languages, and application performance.

    Read the article

  • Why I&rsquo;m Getting an iPad

    - by andrewbrust
    I have never purchased an Apple product in my life.  That’s a “true fact.”  And, for that matter, the last Apple product I really wanted was an Apple IIe, back in the 1980s.  I couldn’t afford it though (I was in high school), so I got a Commodore 64 instead…it had the same microprocessor, after all.  If the iPhone were on Verizon, I probably would have picked one up in December, when I got my Droid.  And if the iPod Touch worked with my Napster subscription (which of course it does not, but my Sonos does) I might have picked on of those instead. That’s three strikes, but Apple’s not out.  I’ve decided I want the iPad.  Why?  Well, to start with, my birthday is March 31st…the iPad comes out on April 3rd, and my wife wanted to know what to get me.  Also, my house is a 7-minute walk from the Apple Store on West 14th Street in Manhattan.  This makes it easy to get my pre-ordered device on launch day, and get home quickly with it.  Oh, and I agreed to write an article for Redmond Magazine, the fee for which will pay for the device…that way the birthday present doesn’t have to be an extravagant expense.  Plus, I’m a contrarian, so I want to buy the one device from Apple that the fanboys have actually panned. Think those are bad reasons? How about this: I want to experience iPhone and iPad development and, although my app will probably never hit the App Store and run on the actual device, I still think owning one will help me develop something better.  i want to see if the slate form factor has good business usage scenarios.  I want to see if Business Intelligence technology on a device like this can work.  Imagine a dashboard on this thing. And, for the consumer experience, I really want a touch device on which I can surf the Web while I’m in the kitchen, or on the couch.  I don’t want the small form factor of my phone, I don’t want to use my TV, and I don’t want a keyboard that will get dirty or in my way. I don’t want to watch movies on it (my TV is good for that), so I don’t care that the iPad has a 4:3 screen.  I don’t want to read books on it, so I don’t care that the display is backlit LCD, rather than eInk. But really what I want is to understand, first hand, why people have such brand loyalty to Apple.  I know the big reasons; I’m not detached from society.  But I want to know the subtle points of what Apple does really well, and also what they do poorly.  And I’d like to know, once and for all, if Microsoft can beat Apple, if Microsoft can think the right way to beat Apple and if Microsoft should  even try to beat Apple. I expect to share my thoughts on these questions, as they develop.

    Read the article

1 2  | Next Page >