Search Results

Search found 10 results on 1 pages for 'danx'.

Page 1/1 | 1

Toorcon14

- by danx

Toorcon 2012 Information Security Conference San Diego, CA, http://www.toorcon.org/ Dan Anderson, October 2012 It's almost Halloween, and we all know what that means—yes, of course, it's time for another Toorcon Conference! Toorcon is an annual conference for people interested in computer security. This includes the whole range of hackers, computer hobbyists, professionals, security consultants, press, law enforcement, prosecutors, FBI, etc. We're at Toorcon 14—see earlier blogs for some of the previous Toorcon's I've attended (back to 2003). This year's "con" was held at the Westin on Broadway in downtown San Diego, California. The following are not necessarily my views—I'm just the messenger—although I could have misquoted or misparaphrased the speakers. Also, I only reviewed some of the talks, below, which I attended and interested me. MalAndroid—the Crux of Android Infections, Aditya K. Sood Programming Weird Machines with ELF Metadata, Rebecca "bx" Shapiro Privacy at the Handset: New FCC Rules?, Valkyrie Hacking Measured Boot and UEFI, Dan Griffin You Can't Buy Security: Building the Open Source InfoSec Program, Boris Sverdlik What Journalists Want: The Investigative Reporters' Perspective on Hacking, Dave Maas & Jason Leopold Accessibility and Security, Anna Shubina Stop Patching, for Stronger PCI Compliance, Adam Brand McAfee Secure & Trustmarks — a Hacker's Best Friend, Jay James & Shane MacDougall MalAndroid—the Crux of Android Infections Aditya K. Sood, IOActive, Michigan State PhD candidate Aditya talked about Android smartphone malware. There's a lot of old Android software out there—over 50% Gingerbread (2.3.x)—and most have unpatched vulnerabilities. Of 9 Android vulnerabilities, 8 have known exploits (such as the old Gingerbread Global Object Table exploit). Android protection includes sandboxing, security scanner, app permissions, and screened Android app market. The Android permission checker has fine-grain resource control, policy enforcement. Android static analysis also includes a static analysis app checker (bouncer), and a vulnerablity checker. What security problems does Android have? User-centric security, which depends on the user to grant permission and make smart decisions. But users don't care or think about malware (the're not aware, not paranoid). All they want is functionality, extensibility, mobility Android had no "proper" encryption before Android 3.0 No built-in protection against social engineering and web tricks Alternative Android app markets are unsafe. Simply visiting some markets can infect Android Aditya classified Android Malware types as: Type A—Apps. These interact with the Android app framework. For example, a fake Netflix app. Or Android Gold Dream (game), which uploads user files stealthy manner to a remote location. Type K—Kernel. Exploits underlying Linux libraries or kernel Type H—Hybrid. These use multiple layers (app framework, libraries, kernel). These are most commonly used by Android botnets, which are popular with Chinese botnet authors What are the threats from Android malware? These incude leak info (contacts), banking fraud, corporate network attacks, malware advertising, malware "Hackivism" (the promotion of social causes. For example, promiting specific leaders of the Tunisian or Iranian revolutions. Android malware is frequently "masquerated". That is, repackaged inside a legit app with malware. To avoid detection, the hidden malware is not unwrapped until runtime. The malware payload can be hidden in, for example, PNG files. Less common are Android bootkits—there's not many around. What they do is hijack the Android init framework—alteering system programs and daemons, then deletes itself. For example, the DKF Bootkit (China). Android App Problems: no code signing! all self-signed native code execution permission sandbox — all or none alternate market places no robust Android malware detection at network level delayed patch process Programming Weird Machines with ELF Metadata Rebecca "bx" Shapiro, Dartmouth College, NH https://github.com/bx/elf-bf-tools @bxsays on twitter Definitions. "ELF" is an executable file format used in linking and loading executables (on UNIX/Linux-class machines). "Weird machine" uses undocumented computation sources (I think of them as unintended virtual machines). Some examples of "weird machines" are those that: return to weird location, does SQL injection, corrupts the heap. Bx then talked about using ELF metadata as (an uintended) "weird machine". Some ELF background: A compiler takes source code and generates a ELF object file (hello.o). A static linker makes an ELF executable from the object file. A runtime linker and loader takes ELF executable and loads and relocates it in memory. The ELF file has symbols to relocate functions and variables. ELF has two relocation tables—one at link time and another one at loading time: .rela.dyn (link time) and .dynsym (dynamic table). GOT: Global Offset Table of addresses for dynamically-linked functions. PLT: Procedure Linkage Tables—works with GOT. The memory layout of a process (not the ELF file) is, in order: program (+ heap), dynamic libraries, libc, ld.so, stack (which includes the dynamic table loaded into memory) For ELF, the "weird machine" is found and exploited in the loader. ELF can be crafted for executing viruses, by tricking runtime into executing interpreted "code" in the ELF symbol table. One can inject parasitic "code" without modifying the actual ELF code portions. Think of the ELF symbol table as an "assembly language" interpreter. It has these elements: instructions: Add, move, jump if not 0 (jnz) Think of symbol table entries as "registers" symbol table value is "contents" immediate values are constants direct values are addresses (e.g., 0xdeadbeef) move instruction: is a relocation table entry add instruction: relocation table "addend" entry jnz instruction: takes multiple relocation table entries The ELF weird machine exploits the loader by relocating relocation table entries. The loader will go on forever until told to stop. It stores state on stack at "end" and uses IFUNC table entries (containing function pointer address). The ELF weird machine, called "Brainfu*k" (BF) has: 8 instructions: pointer inc, dec, inc indirect, dec indirect, jump forward, jump backward, print. Three registers - 3 registers Bx showed example BF source code that implemented a Turing machine printing "hello, world". More interesting was the next demo, where bx modified ping. Ping runs suid as root, but quickly drops privilege. BF modified the loader to disable the library function call dropping privilege, so it remained as root. Then BF modified the ping -t argument to execute the -t filename as root. It's best to show what this modified ping does with an example: $ whoami bx $ ping localhost -t backdoor.sh # executes backdoor $ whoami root $ The modified code increased from 285948 bytes to 290209 bytes. A BF tool compiles "executable" by modifying the symbol table in an existing ELF executable. The tool modifies .dynsym and .rela.dyn table, but not code or data. Privacy at the Handset: New FCC Rules? "Valkyrie" (Christie Dudley, Santa Clara Law JD candidate) Valkyrie talked about mobile handset privacy. Some background: Senator Franken (also a comedian) became alarmed about CarrierIQ, where the carriers track their customers. Franken asked the FCC to find out what obligations carriers think they have to protect privacy. The carriers' response was that they are doing just fine with self-regulation—no worries! Carriers need to collect data, such as missed calls, to maintain network quality. But carriers also sell data for marketing. Verizon sells customer data and enables this with a narrow privacy policy (only 1 month to opt out, with difficulties). The data sold is not individually identifiable and is aggregated. But Verizon recommends, as an aggregation workaround to "recollate" data to other databases to identify customers indirectly. The FCC has regulated telephone privacy since 1934 and mobile network privacy since 2007. Also, the carriers say mobile phone privacy is a FTC responsibility (not FCC). FTC is trying to improve mobile app privacy, but FTC has no authority over carrier / customer relationships. As a side note, Apple iPhones are unique as carriers have extra control over iPhones they don't have with other smartphones. As a result iPhones may be more regulated. Who are the consumer advocates? Everyone knows EFF, but EPIC (Electrnic Privacy Info Center), although more obsecure, is more relevant. What to do? Carriers must be accountable. Opt-in and opt-out at any time. Carriers need incentive to grant users control for those who want it, by holding them liable and responsible for breeches on their clock. Location information should be added current CPNI privacy protection, and require "Pen/trap" judicial order to obtain (and would still be a lower standard than 4th Amendment). Politics are on a pro-privacy swing now, with many senators and the Whitehouse. There will probably be new regulation soon, and enforcement will be a problem, but consumers will still have some benefit. Hacking Measured Boot and UEFI Dan Griffin, JWSecure, Inc., Seattle, @JWSdan Dan talked about hacking measured UEFI boot. First some terms: UEFI is a boot technology that is replacing BIOS (has whitelisting and blacklisting). UEFI protects devices against rootkits. TPM - hardware security device to store hashs and hardware-protected keys "secure boot" can control at firmware level what boot images can boot "measured boot" OS feature that tracks hashes (from BIOS, boot loader, krnel, early drivers). "remote attestation" allows remote validation and control based on policy on a remote attestation server. Microsoft pushing TPM (Windows 8 required), but Google is not. Intel TianoCore is the only open source for UEFI. Dan has Measured Boot Tool at http://mbt.codeplex.com/ with a demo where you can also view TPM data. TPM support already on enterprise-class machines. UEFI Weaknesses. UEFI toolkits are evolving rapidly, but UEFI has weaknesses: assume user is an ally trust TPM implicitly, and attached to computer hibernate file is unprotected (disk encryption protects against this) protection migrating from hardware to firmware delays in patching and whitelist updates will UEFI really be adopted by the mainstream (smartphone hardware support, bank support, apathetic consumer support) You Can't Buy Security: Building the Open Source InfoSec Program Boris Sverdlik, ISDPodcast.com co-host Boris talked about problems typical with current security audits. "IT Security" is an oxymoron—IT exists to enable buiness, uptime, utilization, reporting, but don't care about security—IT has conflict of interest. There's no Magic Bullet ("blinky box"), no one-size-fits-all solution (e.g., Intrusion Detection Systems (IDSs)). Regulations don't make you secure. The cloud is not secure (because of shared data and admin access). Defense and pen testing is not sexy. Auditors are not solution (security not a checklist)—what's needed is experience and adaptability—need soft skills. Step 1: First thing is to Google and learn the company end-to-end before you start. Get to know the management team (not IT team), meet as many people as you can. Don't use arbitrary values such as CISSP scores. Quantitive risk assessment is a myth (e.g. AV*EF-SLE). Learn different Business Units, legal/regulatory obligations, learn the business and where the money is made, verify company is protected from script kiddies (easy), learn sensitive information (IP, internal use only), and start with low-hanging fruit (customer service reps and social engineering). Step 2: Policies. Keep policies short and relevant. Generic SANS "security" boilerplate policies don't make sense and are not followed. Focus on acceptable use, data usage, communications, physical security. Step 3: Implementation: keep it simple stupid. Open source, although useful, is not free (implementation cost). Access controls with authentication & authorization for local and remote access. MS Windows has it, otherwise use OpenLDAP, OpenIAM, etc. Application security Everyone tries to reinvent the wheel—use existing static analysis tools. Review high-risk apps and major revisions. Don't run different risk level apps on same system. Assume host/client compromised and use app-level security control. Network security VLAN != segregated because there's too many workarounds. Use explicit firwall rules, active and passive network monitoring (snort is free), disallow end user access to production environment, have a proxy instead of direct Internet access. Also, SSL certificates are not good two-factor auth and SSL does not mean "safe." Operational Controls Have change, patch, asset, & vulnerability management (OSSI is free). For change management, always review code before pushing to production For logging, have centralized security logging for business-critical systems, separate security logging from administrative/IT logging, and lock down log (as it has everything). Monitor with OSSIM (open source). Use intrusion detection, but not just to fulfill a checkbox: build rules from a whitelist perspective (snort). OSSEC has 95% of what you need. Vulnerability management is a QA function when done right: OpenVas and Seccubus are free. Security awareness The reality is users will always click everything. Build real awareness, not compliance driven checkbox, and have it integrated into the culture. Pen test by crowd sourcing—test with logging COSSP http://www.cossp.org/ - Comprehensive Open Source Security Project What Journalists Want: The Investigative Reporters' Perspective on Hacking Dave Maas, San Diego CityBeat Jason Leopold, Truthout.org The difference between hackers and investigative journalists: For hackers, the motivation varies, but method is same, technological specialties. For investigative journalists, it's about one thing—The Story, and they need broad info-gathering skills. J-School in 60 Seconds: Generic formula: Person or issue of pubic interest, new info, or angle. Generic criteria: proximity, prominence, timeliness, human interest, oddity, or consequence. Media awareness of hackers and trends: journalists becoming extremely aware of hackers with congressional debates (privacy, data breaches), demand for data-mining Journalists, use of coding and web development for Journalists, and Journalists busted for hacking (Murdock). Info gathering by investigative journalists include Public records laws. Federal Freedom of Information Act (FOIA) is good, but slow. California Public Records Act is a lot stronger. FOIA takes forever because of foot-dragging—it helps to be specific. Often need to sue (especially FBI). CPRA is faster, and requests can be vague. Dumps and leaks (a la Wikileaks) Journalists want: leads, protecting ourselves, our sources, and adapting tools for news gathering (Google hacking). Anonomity is important to whistleblowers. They want no digital footprint left behind (e.g., email, web log). They don't trust encryption, want to feel safe and secure. Whistleblower laws are very weak—there's no upside for whistleblowers—they have to be very passionate to do it. Accessibility and Security or: How I Learned to Stop Worrying and Love the Halting Problem Anna Shubina, Dartmouth College Anna talked about how accessibility and security are related. Accessibility of digital content (not real world accessibility). mostly refers to blind users and screenreaders, for our purpose. Accessibility is about parsing documents, as are many security issues. "Rich" executable content causes accessibility to fail, and often causes security to fail. For example MS Word has executable format—it's not a document exchange format—more dangerous than PDF or HTML. Accessibility is often the first and maybe only sanity check with parsing. They have no choice because someone may want to read what you write. Google, for example, is very particular about web browser you use and are bad at supporting other browsers. Uses JavaScript instead of links, often requiring mouseover to display content. PDF is a security nightmare. Executible format, embedded flash, JavaScript, etc. 15 million lines of code. Google Chrome doesn't handle PDF correctly, causing several security bugs. PDF has an accessibility checker and PDF tagging, to help with accessibility. But no PDF checker checks for incorrect tags, untagged content, or validates lists or tables. None check executable content at all. The "Halting Problem" is: can one decide whether a program will ever stop? The answer, in general, is no (Rice's theorem). The same holds true for accessibility checkers. Language-theoretic Security says complicated data formats are hard to parse and cannot be solved due to the Halting Problem. W3C Web Accessibility Guidelines: "Perceivable, Operable, Understandable, Robust" Not much help though, except for "Robust", but here's some gems: * all information should be parsable (paraphrasing) * if not parsable, cannot be converted to alternate formats * maximize compatibility in new document formats Executible webpages are bad for security and accessibility. They say it's for a better web experience. But is it necessary to stuff web pages with JavaScript for a better experience? A good example is The Drudge Report—it has hand-written HTML with no JavaScript, yet drives a lot of web traffic due to good content. A bad example is Google News—hidden scrollbars, guessing user input. Solutions: Accessibility and security problems come from same source Expose "better user experience" myth Keep your corner of Internet parsable Remember "Halting Problem"—recognize false solutions (checking and verifying tools) Stop Patching, for Stronger PCI Compliance Adam Brand, protiviti @adamrbrand, http://www.picfun.com/ Adam talked about PCI compliance for retail sales. Take an example: for PCI compliance, 50% of Brian's time (a IT guy), 960 hours/year was spent patching POSs in 850 restaurants. Often applying some patches make no sense (like fixing a browser vulnerability on a server). "Scanner worship" is overuse of vulnerability scanners—it gives a warm and fuzzy and it's simple (red or green results—fix reds). Scanners give a false sense of security. In reality, breeches from missing patches are uncommon—more common problems are: default passwords, cleartext authentication, misconfiguration (firewall ports open). Patching Myths: Myth 1: install within 30 days of patch release (but PCI Â§6.1 allows a "risk-based approach" instead). Myth 2: vendor decides what's critical (also PCI Â§6.1). But Â§6.2 requires user ranking of vulnerabilities instead. Myth 3: scan and rescan until it passes. But PCI Â§11.2.1b says this applies only to high-risk vulnerabilities. Adam says good recommendations come from NIST 800-40. Instead use sane patching and focus on what's really important. From NIST 800-40: Proactive: Use a proactive vulnerability management process: use change control, configuration management, monitor file integrity. Monitor: start with NVD and other vulnerability alerts, not scanner results. Evaluate: public-facing system? workstation? internal server? (risk rank) Decide:on action and timeline Test: pre-test patches (stability, functionality, rollback) for change control Install: notify, change control, tickets McAfee Secure & Trustmarks — a Hacker's Best Friend Jay James, Shane MacDougall, Tactical Intelligence Inc., Canada "McAfee Secure Trustmark" is a website seal marketed by McAfee. A website gets this badge if they pass their remote scanning. The problem is a removal of trustmarks act as flags that you're vulnerable. Easy to view status change by viewing McAfee list on website or on Google. "Secure TrustGuard" is similar to McAfee. Jay and Shane wrote Perl scripts to gather sites from McAfee and search engines. If their certification image changes to a 1x1 pixel image, then they are longer certified. Their scripts take deltas of scans to see what changed daily. The bottom line is change in TrustGuard status is a flag for hackers to attack your site. Entire idea of seals is silly—you're raising a flag saying if you're vulnerable.

Read the article
Toorcon 15 (2013)

- by danx

The Toorcon gang (senior staff): h1kari (founder), nfiltr8, and Geo Introduction to Toorcon 15 (2013) A Tale of One Software Bypass of MS Windows 8 Secure Boot Breaching SSL, One Byte at a Time Running at 99%: Surviving an Application DoS Security Response in the Age of Mass Customized Attacks x86 Rewriting: Defeating RoP and other Shinanighans Clowntown Express: interesting bugs and running a bug bounty program Active Fingerprinting of Encrypted VPNs Making Attacks Go Backwards Mask Your Checksums—The Gorry Details Adventures with weird machines thirty years after "Reflections on Trusting Trust" Introduction to Toorcon 15 (2013) Toorcon 15 is the 15th annual security conference held in San Diego. I've attended about a third of them and blogged about previous conferences I attended here starting in 2003. As always, I've only summarized the talks I attended and interested me enough to write about them. Be aware that I may have misrepresented the speaker's remarks and that they are not my remarks or opinion, or those of my employer, so don't quote me or them. Those seeking further details may contact the speakers directly or use The Google. For some talks, I have a URL for further information. A Tale of One Software Bypass of MS Windows 8 Secure Boot Andrew Furtak and Oleksandr Bazhaniuk Yuri Bulygin, Oleksandr ("Alex") Bazhaniuk, and (not present) Andrew Furtak Yuri and Alex talked about UEFI and Bootkits and bypassing MS Windows 8 Secure Boot, with vendor recommendations. They previously gave this talk at the BlackHat 2013 conference. MS Windows 8 Secure Boot Overview UEFI (Unified Extensible Firmware Interface) is interface between hardware and OS. UEFI is processor and architecture independent. Malware can replace bootloader (bootx64.efi, bootmgfw.efi). Once replaced can modify kernel. Trivial to replace bootloader. Today many legacy bootkits—UEFI replaces them most of them. MS Windows 8 Secure Boot verifies everything you load, either through signatures or hashes. UEFI firmware relies on secure update (with signed update). You would think Secure Boot would rely on ROM (such as used for phones0, but you can't do that for PCs—PCs use writable memory with signatures DXE core verifies the UEFI boat loader(s) OS Loader (winload.efi, winresume.efi) verifies the OS kernel A chain of trust is established with a root key (Platform Key, PK), which is a cert belonging to the platform vendor. Key Exchange Keys (KEKs) verify an "authorized" database (db), and "forbidden" database (dbx). X.509 certs with SHA-1/SHA-256 hashes. Keys are stored in non-volatile (NV) flash-based NVRAM. Boot Services (BS) allow adding/deleting keys (can't be accessed once OS starts—which uses Run-Time (RT)). Root cert uses RSA-2048 public keys and PKCS#7 format signatures. SecureBoot — enable disable image signature checks SetupMode — update keys, self-signed keys, and secure boot variables CustomMode — allows updating keys Secure Boot policy settings are: always execute, never execute, allow execute on security violation, defer execute on security violation, deny execute on security violation, query user on security violation Attacking MS Windows 8 Secure Boot Secure Boot does NOT protect from physical access. Can disable from console. Each BIOS vendor implements Secure Boot differently. There are several platform and BIOS vendors. It becomes a "zoo" of implementations—which can be taken advantage of. Secure Boot is secure only when all vendors implement it correctly. Allow only UEFI firmware signed updates protect UEFI firmware from direct modification in flash memory protect FW update components program SPI controller securely protect secure boot policy settings in nvram protect runtime api disable compatibility support module which allows unsigned legacy Can corrupt the Platform Key (PK) EFI root certificate variable in SPI flash. If PK is not found, FW enters setup mode wich secure boot turned off. Can also exploit TPM in a similar manner. One is not supposed to be able to directly modify the PK in SPI flash from the OS though. But they found a bug that they can exploit from User Mode (undisclosed) and demoed the exploit. It loaded and ran their own bootkit. The exploit requires a reboot. Multiple vendors are vulnerable. They will disclose this exploit to vendors in the future. Recommendations: allow only signed updates protect UEFI fw in ROM protect EFI variable store in ROM Breaching SSL, One Byte at a Time Yoel Gluck and Angelo Prado Angelo Prado and Yoel Gluck, Salesforce.com CRIME is software that performs a "compression oracle attack." This is possible because the SSL protocol doesn't hide length, and because SSL compresses the header. CRIME requests with every possible character and measures the ciphertext length. Look for the plaintext which compresses the most and looks for the cookie one byte-at-a-time. SSL Compression uses LZ77 to reduce redundancy. Huffman coding replaces common byte sequences with shorter codes. US CERT thinks the SSL compression problem is fixed, but it isn't. They convinced CERT that it wasn't fixed and they issued a CVE. BREACH, breachattrack.com BREACH exploits the SSL response body (Accept-Encoding response, Content-Encoding). It takes advantage of the fact that the response is not compressed. BREACH uses gzip and needs fairly "stable" pages that are static for ~30 seconds. It needs attacker-supplied content (say from a web form or added to a URL parameter). BREACH listens to a session's requests and responses, then inserts extra requests and responses. Eventually, BREACH guesses a session's secret key. Can use compression to guess contents one byte at-a-time. For example, "Supersecret SupersecreX" (a wrong guess) compresses 10 bytes, and "Supersecret Supersecret" (a correct guess) compresses 11 bytes, so it can find each character by guessing every character. To start the guess, BREACH needs at least three known initial characters in the response sequence. Compression length then "leaks" information. Some roadblocks include no winners (all guesses wrong) or too many winners (multiple possibilities that compress the same). The solutions include: lookahead (guess 2 or 3 characters at-a-time instead of 1 character). Expensive rollback to last known conflict check compression ratio can brute-force first 3 "bootstrap" characters, if needed (expensive) block ciphers hide exact plain text length. Solution is to align response in advance to block size Mitigations length: use variable padding secrets: dynamic CSRF tokens per request secret: change over time separate secret to input-less servlets Future work eiter understand DEFLATE/GZIP HTTPS extensions Running at 99%: Surviving an Application DoS Ryan Huber Ryan Huber, Risk I/O Ryan first discussed various ways to do a denial of service (DoS) attack against web services. One usual method is to find a slow web page and do several wgets. Or download large files. Apache is not well suited at handling a large number of connections, but one can put something in front of it Can use Apache alternatives, such as nginx How to identify malicious hosts short, sudden web requests user-agent is obvious (curl, python) same url requested repeatedly no web page referer (not normal) hidden links. hide a link and see if a bot gets it restricted access if not your geo IP (unless the website is global) missing common headers in request regular timing first seen IP at beginning of attack count requests per hosts (usually a very large number) Use of captcha can mitigate attacks, but you'll lose a lot of genuine users. Bouncer, goo.gl/c2vyEc and www.github.com/rawdigits/Bouncer Bouncer is software written by Ryan in netflow. Bouncer has a small, unobtrusive footprint and detects DoS attempts. It closes blacklisted sockets immediately (not nice about it, no proper close connection). Aggregator collects requests and controls your web proxies. Need NTP on the front end web servers for clean data for use by bouncer. Bouncer is also useful for a popularity storm ("Slashdotting") and scraper storms. Future features: gzip collection data, documentation, consumer library, multitask, logging destroyed connections. Takeaways: DoS mitigation is easier with a complete picture Bouncer designed to make it easier to detect and defend DoS—not a complete cure Security Response in the Age of Mass Customized Attacks Peleus Uhley and Karthik Raman Peleus Uhley and Karthik Raman, Adobe ASSET, blogs.adobe.com/asset/ Peleus and Karthik talked about response to mass-customized exploits. Attackers behave much like a business. "Mass customization" refers to concept discussed in the book Future Perfect by Stan Davis of Harvard Business School. Mass customization is differentiating a product for an individual customer, but at a mass production price. For example, the same individual with a debit card receives basically the same customized ATM experience around the world. Or designing your own PC from commodity parts. Exploit kits are another example of mass customization. The kits support multiple browsers and plugins, allows new modules. Exploit kits are cheap and customizable. Organized gangs use exploit kits. A group at Berkeley looked at 77,000 malicious websites (Grier et al., "Manufacturing Compromise: The Emergence of Exploit-as-a-Service", 2012). They found 10,000 distinct binaries among them, but derived from only a dozen or so exploit kits. Characteristics of Mass Malware: potent, resilient, relatively low cost Technical characteristics: multiple OS, multipe payloads, multiple scenarios, multiple languages, obfuscation Response time for 0-day exploits has gone down from ~40 days 5 years ago to about ~10 days now. So the drive with malware is towards mass customized exploits, to avoid detection There's plenty of evicence that exploit development has Project Manager bureaucracy. They infer from the malware edicts to: support all versions of reader support all versions of windows support all versions of flash support all browsers write large complex, difficult to main code (8750 lines of JavaScript for example Exploits have "loose coupling" of multipe versions of software (adobe), OS, and browser. This allows specific attacks against specific versions of multiple pieces of software. Also allows exploits of more obscure software/OS/browsers and obscure versions. Gave examples of exploits that exploited 2, 3, 6, or 14 separate bugs. However, these complete exploits are more likely to be buggy or fragile in themselves and easier to defeat. Future research includes normalizing malware and Javascript. Conclusion: The coming trend is that mass-malware with mass zero-day attacks will result in mass customization of attacks. x86 Rewriting: Defeating RoP and other Shinanighans Richard Wartell Richard Wartell The attack vector we are addressing here is: First some malware causes a buffer overflow. The malware has no program access, but input access and buffer overflow code onto stack Later the stack became non-executable. The workaround malware used was to write a bogus return address to the stack jumping to malware Later came ASLR (Address Space Layout Randomization) to randomize memory layout and make addresses non-deterministic. The workaround malware used was to jump t existing code segments in the program that can be used in bad ways "RoP" is Return-oriented Programming attacks. RoP attacks use your own code and write return address on stack to (existing) expoitable code found in program ("gadgets"). Pinkie Pie was paid $60K last year for a RoP attack. One solution is using anti-RoP compilers that compile source code with NO return instructions. ASLR does not randomize address space, just "gadgets". IPR/ILR ("Instruction Location Randomization") randomizes each instruction with a virtual machine. Richard's goal was to randomize a binary with no source code access. He created "STIR" (Self-Transofrming Instruction Relocation). STIR disassembles binary and operates on "basic blocks" of code. The STIR disassembler is conservative in what to disassemble. Each basic block is moved to a random location in memory. Next, STIR writes new code sections with copies of "basic blocks" of code in randomized locations. The old code is copied and rewritten with jumps to new code. the original code sections in the file is marked non-executible. STIR has better entropy than ASLR in location of code. Makes brute force attacks much harder. STIR runs on MS Windows (PEM) and Linux (ELF). It eliminated 99.96% or more "gadgets" (i.e., moved the address). Overhead usually 5-10% on MS Windows, about 1.5-4% on Linux (but some code actually runs faster!). The unique thing about STIR is it requires no source access and the modified binary fully works! Current work is to rewrite code to enforce security policies. For example, don't create a *.{exe,msi,bat} file. Or don't connect to the network after reading from the disk. Clowntown Express: interesting bugs and running a bug bounty program Collin Greene Collin Greene, Facebook Collin talked about Facebook's bug bounty program. Background at FB: FB has good security frameworks, such as security teams, external audits, and cc'ing on diffs. But there's lots of "deep, dark, forgotten" parts of legacy FB code. Collin gave several examples of bountied bugs. Some bounty submissions were on software purchased from a third-party (but bounty claimers don't know and don't care). We use security questions, as does everyone else, but they are basically insecure (often easily discoverable). Collin didn't expect many bugs from the bounty program, but they ended getting 20+ good bugs in first 24 hours and good submissions continue to come in. Bug bounties bring people in with different perspectives, and are paid only for success. Bug bounty is a better use of a fixed amount of time and money versus just code review or static code analysis. The Bounty program started July 2011 and paid out $1.5 million to date. 14% of the submissions have been high priority problems that needed to be fixed immediately. The best bugs come from a small % of submitters (as with everything else)—the top paid submitters are paid 6 figures a year. Spammers like to backstab competitors. The youngest sumitter was 13. Some submitters have been hired. Bug bounties also allows to see bugs that were missed by tools or reviews, allowing improvement in the process. Bug bounties might not work for traditional software companies where the product has release cycle or is not on Internet. Active Fingerprinting of Encrypted VPNs Anna Shubina Anna Shubina, Dartmouth Institute for Security, Technology, and Society (I missed the start of her talk because another track went overtime. But I have the DVD of the talk, so I'll expand later) IPsec leaves fingerprints. Using netcat, one can easily visually distinguish various crypto chaining modes just from packet timing on a chart (example, DES-CBC versus AES-CBC) One can tell a lot about VPNs just from ping roundtrips (such as what router is used) Delayed packets are not informative about a network, especially if far away from the network More needed to explore about how TCP works in real life with respect to timing Making Attacks Go Backwards Fuzzynop FuzzyNop, Mandiant This talk is not about threat attribution (finding who), product solutions, politics, or sales pitches. But who are making these malware threats? It's not a single person or group—they have diverse skill levels. There's a lot of fat-fingered fumblers out there. Always look for low-hanging fruit first: "hiding" malware in the temp, recycle, or root directories creation of unnamed scheduled tasks obvious names of files and syscalls ("ClearEventLog") uncleared event logs. Clearing event log in itself, and time of clearing, is a red flag and good first clue to look for on a suspect system Reverse engineering is hard. Disassembler use takes practice and skill. A popular tool is IDA Pro, but it takes multiple interactive iterations to get a clean disassembly. Key loggers are used a lot in targeted attacks. They are typically custom code or built in a backdoor. A big tip-off is that non-printable characters need to be printed out (such as "[Ctrl]" "[RightShift]") or time stamp printf strings. Look for these in files. Presence is not proof they are used. Absence is not proof they are not used. Java exploits. Can parse jar file with idxparser.py and decomile Java file. Java typially used to target tech companies. Backdoors are the main persistence mechanism (provided externally) for malware. Also malware typically needs command and control. Application of Artificial Intelligence in Ad-Hoc Static Code Analysis John Ashaman John Ashaman, Security Innovation Initially John tried to analyze open source files with open source static analysis tools, but these showed thousands of false positives. Also tried using grep, but tis fails to find anything even mildly complex. So next John decided to write his own tool. His approach was to first generate a call graph then analyze the graph. However, the problem is that making a call graph is really hard. For example, one problem is "evil" coding techniques, such as passing function pointer. First the tool generated an Abstract Syntax Tree (AST) with the nodes created from method declarations and edges created from method use. Then the tool generated a control flow graph with the goal to find a path through the AST (a maze) from source to sink. The algorithm is to look at adjacent nodes to see if any are "scary" (a vulnerability), using heuristics for search order. The tool, called "Scat" (Static Code Analysis Tool), currently looks for C# vulnerabilities and some simple PHP. Later, he plans to add more PHP, then JSP and Java. For more information see his posts in Security Innovation blog and NRefactory on GitHub. Mask Your Checksums—The Gorry Details Eric (XlogicX) Davisson Eric (XlogicX) Davisson Sometimes in emailing or posting TCP/IP packets to analyze problems, you may want to mask the IP address. But to do this correctly, you need to mask the checksum too, or you'll leak information about the IP. Problem reports found in stackoverflow.com, sans.org, and pastebin.org are usually not masked, but a few companies do care. If only the IP is masked, the IP may be guessed from checksum (that is, it leaks data). Other parts of packet may leak more data about the IP. TCP and IP checksums both refer to the same data, so can get more bits of information out of using both checksums than just using one checksum. Also, one can usually determine the OS from the TTL field and ports in a packet header. If we get hundreds of possible results (16x each masked nibble that is unknown), one can do other things to narrow the results, such as look at packet contents for domain or geo information. With hundreds of results, can import as CSV format into a spreadsheet. Can corelate with geo data and see where each possibility is located. Eric then demoed a real email report with a masked IP packet attached. Was able to find the exact IP address, given the geo and university of the sender. Point is if you're going to mask a packet, do it right. Eric wouldn't usually bother, but do it correctly if at all, to not create a false impression of security. Adventures with weird machines thirty years after "Reflections on Trusting Trust" Sergey Bratus Sergey Bratus, Dartmouth College (and Julian Bangert and Rebecca Shapiro, not present) "Reflections on Trusting Trust" refers to Ken Thompson's classic 1984 paper. "You can't trust code that you did not totally create yourself." There's invisible links in the chain-of-trust, such as "well-installed microcode bugs" or in the compiler, and other planted bugs. Thompson showed how a compiler can introduce and propagate bugs in unmodified source. But suppose if there's no bugs and you trust the author, can you trust the code? Hell No! There's too many factors—it's Babylonian in nature. Why not? Well, Input is not well-defined/recognized (code's assumptions about "checked" input will be violated (bug/vunerabiliy). For example, HTML is recursive, but Regex checking is not recursive. Input well-formed but so complex there's no telling what it does For example, ELF file parsing is complex and has multiple ways of parsing. Input is seen differently by different pieces of program or toolchain Any Input is a program input executes on input handlers (drives state changes & transitions) only a well-defined execution model can be trusted (regex/DFA, PDA, CFG) Input handler either is a "recognizer" for the inputs as a well-defined language (see langsec.org) or it's a "virtual machine" for inputs to drive into pwn-age ELF ABI (UNIX/Linux executible file format) case study. Problems can arise from these steps (without planting bugs): compiler linker loader ld.so/rtld relocator DWARF (debugger info) exceptions The problem is you can't really automatically analyze code (it's the "halting problem" and undecidable). Only solution is to freeze code and sign it. But you can't freeze everything! Can't freeze ASLR or loading—must have tables and metadata. Any sufficiently complex input data is the same as VM byte code Example, ELF relocation entries + dynamic symbols == a Turing Complete Machine (TM). @bxsays created a Turing machine in Linux from relocation data (not code) in an ELF file. For more information, see Rebecca "bx" Shapiro's presentation from last year's Toorcon, "Programming Weird Machines with ELF Metadata" @bxsays did same thing with Mach-O bytecode Or a DWARF exception handling data .eh_frame + glibc == Turning Machine X86 MMU (IDT, GDT, TSS): used address translation to create a Turning Machine. Page handler reads and writes (on page fault) memory. Uses a page table, which can be used as Turning Machine byte code. Example on Github using this TM that will fly a glider across the screen Next Sergey talked about "Parser Differentials". That having one input format, but two parsers, will create confusion and opportunity for exploitation. For example, CSRs are parsed during creation by cert requestor and again by another parser at the CA. Another example is ELF—several parsers in OS tool chain, which are all different. Can have two different Program Headers (PHDRs) because ld.so parses multiple PHDRs. The second PHDR can completely transform the executable. This is described in paper in the first issue of International Journal of PoC. Conclusions trusting computers not only about bugs! Bugs are part of a problem, but no by far all of it complex data formats means bugs no "chain of trust" in Babylon! (that is, with parser differentials) we need to squeeze complexity out of data until data stops being "code equivalent" Further information See and langsec.org. USENIX WOOT 2013 (Workshop on Offensive Technologies) for "weird machines" papers and videos.

Read the article
Elfsign Object Signing on Solaris

- by danx

Elfsign Object Signing on Solaris Don't let this happen to you—use elfsign! Solaris elfsign(1) is a command that signs and verifies ELF format executables. That includes not just executable programs (such as ls or cp), but other ELF format files including libraries (such as libnvpair.so) and kernel modules (such as autofs). Elfsign has been available since Solaris 10 and ELF format files distributed with Solaris, since Solaris 10, are signed by either Sun Microsystems or its successor, Oracle Corporation. When an ELF file is signed, elfsign adds a new section the ELF file, .SUNW_signature, that contains a RSA public key signature and other information about the signer. That is, the algorithm used, algorithm OID, signer CN/OU, and time stamp. The signature section can later be verified by elfsign or other software by matching the signature in the file agains the ELF file contents (excluding the signature). ELF executable files may also be signed by a 3rd-party or by the customer. This is useful for verifying the origin and authenticity of executable files installed on a system. The 3rd-party or customer public key certificate should be installed in /etc/certs/ to allow verification by elfsign. For currently-released versions of Solaris, only cryptographic framework plugin libraries are verified by Solaris. However, all ELF files may be verified by the elfsign command at any time. Elfsign Algorithms Elfsign signatures are created by taking a digest of the ELF section contents, then signing the digest with RSA. To verify, one takes a digest of ELF file and compares with the expected digest that's computed from the signature and RSA public key. Originally elfsign took a MD5 digest of a SHA-1 digest of the ELF file sections, then signed the resulting digest with RSA. In Solaris 11.1 then Solaris 11.1 SRU 7 (5/2013), the elfsign crypto algorithms available have been expanded to keep up with evolving cryptography. The following table shows the available elfsign algorithms: Elfsign Algorithm Solaris Release Comments elfsign sign -F rsa_md5_sha1 S10, S11.0, S11.1 Default for S10. Not recommended* elfsign sign -F rsa_sha1 S11.1 Default for S11.1. Not recommended elfsign sign -F rsa_sha256 S11.1 patch SRU7+ Recommended ___ *Most or all CAs do not accept MD5 CSRs and do not issue MD5 certs due to MD5 hash collision problems. RSA Key Length. I recommend using RSA-2048 key length with elfsign is RSA-2048 as the best balance between a long expected "life time", interoperability, and performance. RSA-2048 keys have an expected lifetime through 2030 (and probably beyond). For details, see Recommendation for Key Management: Part 1: General, NIST Publication SP 800-57 part 1 (rev. 3, 7/2012, PDF), tables 2 and 4 (pp. 64, 67). Step 1: create or obtain a key and cert The first step in using elfsign is to obtain a key and cert from a public Certificate Authority (CA), or create your own self-signed key and cert. I'll briefly explain both methods. Obtaining a Certificate from a CA To obtain a cert from a CA, such as Verisign, Thawte, or Go Daddy (to name a few random examples), you create a private key and a Certificate Signing Request (CSR) file and send it to the CA, following the instructions of the CA on their website. They send back a signed public key certificate. The public key cert, along with the private key you created is used by elfsign to sign an ELF file. The public key cert is distributed with the software and is used by elfsign to verify elfsign signatures in ELF files. You need to request a RSA "Class 3 public key certificate", which is used for servers and software signing. Elfsign uses RSA and we recommend RSA-2048 keys. The private key and CSR can be generated with openssl(1) or pktool(1) on Solaris. Here's a simple example that uses pktool to generate a private RSA_2048 key and a CSR for sending to a CA: $ pktool gencsr keystore=file format=pem outcsr=MYCSR.p10 \ subject="CN=canineswworks.com,OU=Canine SW object signing" \ outkey=MYPRIVATEKEY.key $ openssl rsa -noout -text -in MYPRIVATEKEY.key Private-Key: (2048 bit) modulus: 00:d2:ef:42:f2:0b:8c:96:9f:45:32:fc:fe:54:94: . . . [omitted for brevity] . . . c9:c7 publicExponent: 65537 (0x10001) privateExponent: 26:14:fc:49:26:bc:a3:14:ee:31:5e:6b:ac:69:83: . . . [omitted for brevity] . . . 81 prime1: 00:f6:b7:52:73:bc:26:57:26:c8:11:eb:6c:dc:cb: . . . [omitted for brevity] . . . bc:91:d0:40:d6:9d:ac:b5:69 prime2: 00:da:df:3f:56:b2:18:46:e1:89:5b:6c:f1:1a:41: . . . [omitted for brevity] . . . f3:b7:48:de:c3:d9:ce:af:af exponent1: 00:b9:a2:00:11:02:ed:9a:3f:9c:e4:16:ce:c7:67: . . . [omitted for brevity] . . . 55:50:25:70:d3:ca:b9:ab:99 exponent2: 00:c8:fc:f5:57:11:98:85:8e:9a:ea:1f:f2:8f:df: . . . [omitted for brevity] . . . 23:57:0e:4d:b2:a0:12:d2:f5 coefficient: 2f:60:21:cd:dc:52:76:67:1a:d8:75:3e:7f:b0:64: . . . [omitted for brevity] . . . 06:94:56:d8:9d:5c:8e:9b $ openssl req -noout -text -in MYCSR.p10 Certificate Request: Data: Version: 2 (0x2) Subject: OU=Canine SW object signing, CN=canineswworks.com Subject Public Key Info: Public Key Algorithm: rsaEncryption Public-Key: (2048 bit) Modulus: 00:d2:ef:42:f2:0b:8c:96:9f:45:32:fc:fe:54:94: . . . [omitted for brevity] . . . c9:c7 Exponent: 65537 (0x10001) Attributes: Signature Algorithm: sha1WithRSAEncryption b3:e8:30:5b:88:37:68:1c:26:6b:45:af:5e:de:ea:60:87:ea: . . . [omitted for brevity] . . . 06:f9:ed:b4 Secure storage of RSA private key. The private key needs to be protected if the key signing is used for production (as opposed to just testing). That is, protect the key to protect against unauthorized signatures by others. One method is to use a PIN-protected PKCS#11 keystore. The private key you generate should be stored in a secure manner, such as in a PKCS#11 keystore using pktool(1). Otherwise others can sign your signature. Other secure key storage mechanisms include a SCA-6000 crypto card, a USB thumb drive stored in a locked area, a dedicated server with restricted access, Oracle Key Manager (OKM), or some combination of these. I also recommend secure backup of the private key. Here's an example of generating a private key protected in the PKCS#11 keystore, and a CSR. $ pktool setpin # use if PIN not set yet Enter token passphrase: changeme Create new passphrase: Re-enter new passphrase: Passphrase changed. $ pktool gencsr keystore=pkcs11 label=MYPRIVATEKEY \ format=pem outcsr=MYCSR.p10 \ subject="CN=canineswworks.com,OU=Canine SW object signing" $ pktool list keystore=pkcs11 Enter PIN for Sun Software PKCS#11 softtoken: Found 1 asymmetric public keys. Key #1 - RSA public key: MYPRIVATEKEY Here's another example that uses openssl instead of pktool to generate a private key and CSR: $ openssl genrsa -out cert.key 2048 $ openssl req -new -key cert.key -out MYCSR.p10 Self-Signed Cert You can use openssl or pktool to create a private key and a self-signed public key certificate. A self-signed cert is useful for development, testing, and internal use. The private key created should be stored in a secure manner, as mentioned above. The following example creates a private key, MYSELFSIGNED.key, and a public key cert, MYSELFSIGNED.pem, using pktool and displays the contents with the openssl command. $ pktool gencert keystore=file format=pem serial=0xD06F00D lifetime=20-year \ keytype=rsa hash=sha256 outcert=MYSELFSIGNED.pem outkey=MYSELFSIGNED.key \ subject="O=Canine Software Works, OU=Self-signed CA, CN=canineswworks.com" $ pktool list keystore=file objtype=cert infile=MYSELFSIGNED.pem Found 1 certificates. 1. (X.509 certificate) Filename: MYSELFSIGNED.pem ID: c8:24:59:08:2b:ae:6e:5c:bc:26:bd:ef:0a:9c:54:de:dd:0f:60:46 Subject: O=Canine Software Works, OU=Self-signed CA, CN=canineswworks.com Issuer: O=Canine Software Works, OU=Self-signed CA, CN=canineswworks.com Not Before: Oct 17 23:18:00 2013 GMT Not After: Oct 12 23:18:00 2033 GMT Serial: 0xD06F00D0 Signature Algorithm: sha256WithRSAEncryption $ openssl x509 -noout -text -in MYSELFSIGNED.pem Certificate: Data: Version: 3 (0x2) Serial Number: 3496935632 (0xd06f00d0) Signature Algorithm: sha256WithRSAEncryption Issuer: O=Canine Software Works, OU=Self-signed CA, CN=canineswworks.com Validity Not Before: Oct 17 23:18:00 2013 GMT Not After : Oct 12 23:18:00 2033 GMT Subject: O=Canine Software Works, OU=Self-signed CA, CN=canineswworks.com Subject Public Key Info: Public Key Algorithm: rsaEncryption Public-Key: (2048 bit) Modulus: 00:bb:e8:11:21:d9:4b:88:53:8b:6c:5a:7a:38:8b: . . . [omitted for brevity] . . . bf:77 Exponent: 65537 (0x10001) Signature Algorithm: sha256WithRSAEncryption 9e:39:fe:c8:44:5c:87:2c:8f:f4:24:f6:0c:9a:2f:64:84:d1: . . . [omitted for brevity] . . . 5f:78:8e:e8 $ openssl rsa -noout -text -in MYSELFSIGNED.key Private-Key: (2048 bit) modulus: 00:bb:e8:11:21:d9:4b:88:53:8b:6c:5a:7a:38:8b: . . . [omitted for brevity] . . . bf:77 publicExponent: 65537 (0x10001) privateExponent: 0a:06:0f:23:e7:1b:88:62:2c:85:d3:2d:c1:e6:6e: . . . [omitted for brevity] . . . 9c:e1:e0:0a:52:77:29:4a:75:aa:02:d8:af:53:24: c1 prime1: 00:ea:12:02:bb:5a:0f:5a:d8:a9:95:b2:ba:30:15: . . . [omitted for brevity] . . . 5b:ca:9c:7c:19:48:77:1e:5d prime2: 00:cd:82:da:84:71:1d:18:52:cb:c6:4d:74:14:be: . . . [omitted for brevity] . . . 5f:db:d5:5e:47:89:a7:ef:e3 exponent1: 32:37:62:f6:a6:bf:9c:91:d6:f0:12:c3:f7:04:e9: . . . [omitted for brevity] . . . 97:3e:33:31:89:66:64:d1 exponent2: 00:88:a2:e8:90:47:f8:75:34:8f:41:50:3b:ce:93: . . . [omitted for brevity] . . . ff:74:d4:be:f3:47:45:bd:cb coefficient: 4d:7c:09:4c:34:73:c4:26:f0:58:f5:e1:45:3c:af: . . . [omitted for brevity] . . . af:01:5f:af:ad:6a:09:bf Step 2: Sign the ELF File object By now you should have your private key, and obtained, by hook or crook, a cert (either from a CA or use one you created (a self-signed cert). The next step is to sign one or more objects with your private key and cert. Here's a simple example that creates an object file, signs, verifies, and lists the contents of the ELF signature. $ echo '#include <stdio.h>\nint main(){printf("Hello\\n");}'>hello.c $ make hello cc -o hello hello.c $ elfsign verify -v -c MYSELFSIGNED.pem -e hello elfsign: no signature found in hello. $ elfsign sign -F rsa_sha256 -v -k MYSELFSIGNED.key -c MYSELFSIGNED.pem -e hello elfsign: hello signed successfully. format: rsa_sha256. signer: O=Canine Software Works, OU=Self-signed CA, CN=canineswworks.com. signed on: October 17, 2013 04:22:49 PM PDT. $ elfsign list -f format -e hello rsa_sha256 $ elfsign list -f signer -e hello O=Canine Software Works, OU=Self-signed CA, CN=canineswworks.com $ elfsign list -f time -e hello October 17, 2013 04:22:49 PM PDT $ elfsign verify -v -c MYSELFSIGNED.key -e hello elfsign: verification of hello failed. format: rsa_sha256. signer: O=Canine Software Works, OU=Self-signed CA, CN=canineswworks.com. signed on: October 17, 2013 04:22:49 PM PDT. Signing using the pkcs11 keystore To sign the ELF file using a private key in the secure pkcs11 keystore, replace "-K MYSELFSIGNED.key" in the "elfsign sign" command line with "-T MYPRIVATEKEY", where MYPRIVATKEY is the pkcs11 token label. Step 3: Install the cert and test on another system Just signing the object isn't enough. You need to copy or install the cert and the signed ELF file(s) on another system to test that the signature is OK. Your public key cert should be installed in /etc/certs. Use elfsign verify to verify the signature. Elfsign verify checks each cert in /etc/certs until it finds one that matches the elfsign signature in the file. If one isn't found, the verification fails. Here's an example: $ su Password: # rm /etc/certs/MYSELFSIGNED.key # cp MYSELFSIGNED.pem /etc/certs # exit $ elfsign verify -v hello elfsign: verification of hello passed. format: rsa_sha256. signer: O=Canine Software Works, OU=Self-signed CA, CN=canineswworks.com. signed on: October 17, 2013 04:24:20 PM PDT. After testing, package your cert along with your ELF object to allow elfsign verification after your cert and object are installed or copied. Under the Hood: elfsign verification Here's the steps taken to verify a ELF file signed with elfsign. The steps to sign the file are similar except the private key exponent is used instead of the public key exponent and the .SUNW_signature section is written to the ELF file instead of being read from the file. Generate a digest (SHA-256) of the ELF file sections. This digest uses all ELF sections loaded in memory, but excludes the ELF header, the .SUNW_signature section, and the symbol table Extract the RSA signature (RSA-2048) from the .SUNW_signature section Extract the RSA public key modulus and public key exponent (65537) from the public key cert Calculate the expected digest as follows: signaturepublicKeyExponent % publicKeyModulus Strip the PKCS#1 padding (most significant bytes) from the above. The padding is 0x00, 0x01, 0xff, 0xff, . . ., 0xff, 0x00. If the actual digest == expected digest, the ELF file is verified (OK). Further Information elfsign(1), pktool(1), and openssl(1) man pages. "Signed Solaris 10 Binaries?" blog by Darren Moffat (2005) shows how to use elfsign. "Simple CLI based CA on Solaris" blog by Darren Moffat (2008) shows how to set up a simple CA for use with self-signed certificates. "How to Create a Certificate by Using the pktool gencert Command" System Administration Guide: Security Services (available at docs.oracle.com)

Read the article
Solaris X86 AESNI OpenSSL Engine

- by danx

Solaris X86 AESNI OpenSSL Engine Cryptography is a major component of secure e-commerce. Since cryptography is compute intensive and adds a significant load to applications, such as SSL web servers (https), crypto performance is an important factor. Providing accelerated crypto hardware greatly helps these applications and will help lead to a wider adoption of cryptography, and lower cost, in e-commerce and other applications. The Intel Westmere microprocessor has six new instructions to acclerate AES encryption. They are called "AESNI" for "AES New Instructions". These are unprivileged instructions, so no "root", other elevated access, or context switch is required to execute these instructions. These instructions are used in a new built-in OpenSSL 1.0 engine available in Solaris 11, the aesni engine. Previous Work Previously, AESNI instructions were introduced into the Solaris x86 kernel and libraries. That is, the "aes" kernel module (used by IPsec and other kernel modules) and the Solaris pkcs11 library (for user applications). These are available in Solaris 10 10/09 (update 8) and above, and Solaris 11. The work here is to add the aesni engine to OpenSSL. X86 AESNI Instructions Intel's Xeon 5600 is one of the processors that support AESNI. This processor is used in the Sun Fire X4170 M2 As mentioned above, six new instructions acclerate AES encryption in processor silicon. The new instructions are: aesenc performs one round of AES encryption. One encryption round is composed of these steps: substitute bytes, shift rows, mix columns, and xor the round key. aesenclast performs the final encryption round, which is the same as above, except omitting the mix columns (which is only needed for the next encryption round). aesdec performs one round of AES decryption aesdeclast performs the final AES decryption round aeskeygenassist Helps expand the user-provided key into a "key schedule" of keys, one per round aesimc performs an "inverse mixed columns" operation to convert the encryption key schedule into a decryption key schedule pclmulqdq Not a AESNI instruction, but performs "carryless multiply" operations to acclerate AES GCM mode. Since the AESNI instructions are implemented in hardware, they take a constant number of cycles and are not vulnerable to side-channel timing attacks that attempt to discern some bits of data from the time taken to encrypt or decrypt the data. Solaris x86 and OpenSSL Software Optimizations Having X86 AESNI hardware crypto instructions is all well and good, but how do we access it? The software is available with Solaris 11 and is used automatically if you are running Solaris x86 on a AESNI-capable processor. AESNI is used internally in the kernel through kernel crypto modules and is available in user space through the PKCS#11 library. For OpenSSL on Solaris 11, AESNI crypto is available directly with a new built-in OpenSSL 1.0 engine, called the "aesni engine." This is in lieu of the extra overhead of going through the Solaris OpenSSL pkcs11 engine, which accesses Solaris crypto and digest operations. Instead, AESNI assembly is included directly in the new aesni engine. Instead of including the aesni engine in a separate library in /lib/openssl/engines/, the aesni engine is "built-in", meaning it is included directly in OpenSSL's libcrypto.so.1.0.0 library. This reduces overhead and the need to manually specify the aesni engine. Since the engine is built-in (that is, in libcrypto.so.1.0.0), the openssl -engine command line flag or API call is not needed to access the engine—the aesni engine is used automatically on AESNI hardware. Ciphers and Digests supported by OpenSSL aesni engine The Openssl aesni engine auto-detects if it's running on AESNI hardware and uses AESNI encryption instructions for these ciphers: AES-128-CBC, AES-192-CBC, AES-256-CBC, AES-128-CFB128, AES-192-CFB128, AES-256-CFB128, AES-128-CTR, AES-192-CTR, AES-256-CTR, AES-128-ECB, AES-192-ECB, AES-256-ECB, AES-128-OFB, AES-192-OFB, and AES-256-OFB. Implementation of the OpenSSL aesni engine The AESNI assembly language routines are not a part of the regular Openssl 1.0.0 release. AESNI is a part of the "HEAD" ("development" or "unstable") branch of OpenSSL, for future release. But AESNI is also available as a separate patch provided by Intel to the OpenSSL project for OpenSSL 1.0.0. A minimal amount of "glue" code in the aesni engine works between the OpenSSL libcrypto.so.1.0.0 library and the assembly functions. The aesni engine code is separate from the base OpenSSL code and requires patching only a few source files to use it. That means OpenSSL can be more easily updated to future versions without losing the performance from the built-in aesni engine. OpenSSL aesni engine Performance Here's some graphs of aesni engine performance I measured by running openssl speed -evp $algorithm where $algorithm is aes-128-cbc, aes-192-cbc, and aes-256-cbc. These are using the 64-bit version of openssl on the same AESNI hardware, a Sun Fire X4170 M2 with a Intel Xeon E5620 @2.40GHz, running Solaris 11 FCS. "Before" is openssl without the aesni engine and "after" is openssl with the aesni engine. The numbers are MBytes/second. OpenSSL aesni engine performance on Sun Fire X4170 M2 (Xeon E5620 @2.40GHz) (Higher is better; "before"=OpenSSL on AESNI without AESNI engine software, "after"=OpenSSL AESNI engine) As you can see the speedup is dramatic for all 3 key lengths and for data sizes from 16 bytes to 8 Kbytes—AESNI is about 7.5-8x faster over hand-coded amd64 assembly (without aesni instructions). Verifying the OpenSSL aesni engine is present The easiest way to determine if you are running the aesni engine is to type "openssl engine" on the command line. No configuration, API, or command line options are needed to use the OpenSSL aesni engine. If you are running on Intel AESNI hardware with Solaris 11 FCS, you'll see this output indicating you are using the aesni engine: intel-westmere $ openssl engine (aesni) Intel AES-NI engine (no-aesni) (dynamic) Dynamic engine loading support (pkcs11) PKCS #11 engine support If you are running on Intel without AESNI hardware you'll see this output indicating the hardware can't support the aesni engine: intel-nehalem $ openssl engine (aesni) Intel AES-NI engine (no-aesni) (dynamic) Dynamic engine loading support (pkcs11) PKCS #11 engine support For Solaris on SPARC or older Solaris OpenSSL software, you won't see any aesni engine line at all. Third-party OpenSSL software (built yourself or from outside Oracle) will not have the aesni engine either. Solaris 11 FCS comes with OpenSSL version 1.0.0e. The output of typing "openssl version" should be "OpenSSL 1.0.0e 6 Sep 2011". 64- and 32-bit OpenSSL OpenSSL comes in both 32- and 64-bit binaries. 64-bit executable is now the default, at /usr/bin/openssl, and OpenSSL 64-bit libraries at /lib/amd64/libcrypto.so.1.0.0 and libssl.so.1.0.0 The 32-bit executable is at /usr/bin/i86/openssl and the libraries are at /lib/libcrytpo.so.1.0.0 and libssl.so.1.0.0. Availability The OpenSSL AESNI engine is available in Solaris 11 x86 for both the 64- and 32-bit versions of OpenSSL. It is not available with Solaris 10. You must have a processor that supports AESNI instructions, otherwise OpenSSL will fallback to the older, slower AES implementation without AESNI. Processors that support AESNI include most Westmere and Sandy Bridge class processor architectures. Some low-end processors (such as for mobile/laptop platforms) do not support AESNI. The easiest way to determine if the processor supports AESNI is with the isainfo -v command—look for "amd64" and "aes" in the output: $ isainfo -v 64-bit amd64 applications pclmulqdq aes sse4.2 sse4.1 ssse3 popcnt tscp ahf cx16 sse3 sse2 sse fxsr mmx cmov amd_sysc cx8 tsc fpu Conclusion The Solaris 11 OpenSSL aesni engine provides easy access to powerful Intel AESNI hardware cryptography, in addition to Solaris userland PKCS#11 libraries and Solaris crypto kernel modules.

Read the article
Optimizing Solaris 11 SHA-1 on Intel Processors

- by danx

SHA-1 is a "hash" or "digest" operation that produces a 160 bit (20 byte) checksum value on arbitrary data, such as a file. It is intended to uniquely identify text and to verify it hasn't been modified. Max Locktyukhin and others at Intel have improved the performance of the SHA-1 digest algorithm using multiple techniques. This code has been incorporated into Solaris 11 and is available in the Solaris Crypto Framework via the libmd(3LIB), the industry-standard libpkcs11(3LIB) library, and Solaris kernel module sha1. The optimized code is used automatically on systems with a x86 CPU supporting SSSE3 (Intel Supplemental SSSE3). Intel microprocessor architectures that support SSSE3 include Nehalem, Westmere, Sandy Bridge microprocessor families. Further optimizations are available for microprocessors that support AVX (such as Sandy Bridge). Although SHA-1 is considered obsolete because of weaknesses found in the SHA-1 algorithm—NIST recommends using at least SHA-256, SHA-1 is still widely used and will be with us for awhile more. Collisions (the same SHA-1 result for two different inputs) can be found with moderate effort. SHA-1 is used heavily though in SSL/TLS, for example. And SHA-1 is stronger than the older MD5 digest algorithm, another digest option defined in SSL/TLS. Optimizations Review SHA-1 operates by reading an arbitrary amount of data. The data is read in 512 bit (64 byte) blocks (the last block is padded in a specific way to ensure it's a full 64 bytes). Each 64 byte block has 80 "rounds" of calculations (consisting of a mixture of "ROTATE-LEFT", "AND", and "XOR") applied to the block. Each round produces a 32-bit intermediate result, called W[i]. Here's what each round operates: The first 16 rounds, rounds 0 to 15, read the 512 bit block 32 bits at-a-time. These 32 bits is used as input to the round. The remaining rounds, rounds 16 to 79, use the results from the previous rounds as input. Specifically for round i it XORs the results of rounds i-3, i-8, i-14, and i-16 and rotates the result left 1 bit. The remaining calculations for the round is a series of AND, XOR, and ROTATE-LEFT operators on the 32-bit input and some constants. The 32-bit result is saved as W[i] for round i. The 32-bit result of the final round, W[79], is the SHA-1 checksum. Optimization: Vectorization The first 16 rounds can be vectorized (computed in parallel) because they don't depend on the output of a previous round. As for the remaining rounds, because of step 2 above, computing round i depends on the results of round i-3, W[i-3], one can vectorize 3 rounds at-a-time. Max Locktyukhin found through simple factoring, explained in detail in his article referenced below, that the dependencies of round i on the results of rounds i-3, i-8, i-14, and i-16 can be replaced instead with dependencies on the results of rounds i-6, i-16, i-28, and i-32. That is, instead of initializing intermediate result W[i] with: W[i] = (W[i-3] XOR W[i-8] XOR W[i-14] XOR W[i-16]) ROTATE-LEFT 1 Initialize W[i] as follows: W[i] = (W[i-6] XOR W[i-16] XOR W[i-28] XOR W[i-32]) ROTATE-LEFT 2 That means that 6 rounds could be vectorized at once, with no additional calculations, instead of just 3! This optimization is independent of Intel or any other microprocessor architecture, although the microprocessor has to support vectorization to use it, and exploits one of the weaknesses of SHA-1. Optimization: SSSE3 Intel SSSE3 makes use of 16 %xmm registers, each 128 bits wide. The 4 32-bit inputs to a round, W[i-6], W[i-16], W[i-28], W[i-32], all fit in one %xmm register. The following code snippet, from Max Locktyukhin's article, converted to ATT assembly syntax, computes 4 rounds in parallel with just a dozen or so SSSE3 instructions: movdqa W_minus_04, W_TMP pxor W_minus_28, W // W equals W[i-32:i-29] before XOR // W = W[i-32:i-29] ^ W[i-28:i-25] palignr $8, W_minus_08, W_TMP // W_TMP = W[i-6:i-3], combined from // W[i-4:i-1] and W[i-8:i-5] vectors pxor W_minus_16, W // W = (W[i-32:i-29] ^ W[i-28:i-25]) ^ W[i-16:i-13] pxor W_TMP, W // W = (W[i-32:i-29] ^ W[i-28:i-25] ^ W[i-16:i-13]) ^ W[i-6:i-3]) movdqa W, W_TMP // 4 dwords in W are rotated left by 2 psrld $30, W // rotate left by 2 W = (W >> 30) | (W << 2) pslld $2, W_TMP por W, W_TMP movdqa W_TMP, W // four new W values W[i:i+3] are now calculated paddd (K_XMM), W_TMP // adding 4 current round's values of K movdqa W_TMP, (WK(i)) // storing for downstream GPR instructions to read A window of the 32 previous results, W[i-1] to W[i-32] is saved in memory on the stack. This is best illustrated with a chart. Without vectorization, computing the rounds is like this (each "R" represents 1 round of SHA-1 computation): RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR With vectorization, 4 rounds can be computed in parallel: RRRRRRRRRRRRRRRRRRRR RRRRRRRRRRRRRRRRRRRR RRRRRRRRRRRRRRRRRRRR RRRRRRRRRRRRRRRRRRRR Optimization: AVX The new "Sandy Bridge" microprocessor architecture, which supports AVX, allows another interesting optimization. SSSE3 instructions have two operands, a input and an output. AVX allows three operands, two inputs and an output. In many cases two SSSE3 instructions can be combined into one AVX instruction. The difference is best illustrated with an example. Consider these two instructions from the snippet above: pxor W_minus_16, W // W = (W[i-32:i-29] ^ W[i-28:i-25]) ^ W[i-16:i-13] pxor W_TMP, W // W = (W[i-32:i-29] ^ W[i-28:i-25] ^ W[i-16:i-13]) ^ W[i-6:i-3]) With AVX they can be combined in one instruction: vpxor W_minus_16, W, W_TMP // W = (W[i-32:i-29] ^ W[i-28:i-25] ^ W[i-16:i-13]) ^ W[i-6:i-3]) This optimization is also in Solaris, although Sandy Bridge-based systems aren't widely available yet. As an exercise for the reader, AVX also has 256-bit media registers, %ymm0 - %ymm15 (a superset of 128-bit %xmm0 - %xmm15). Can %ymm registers be used to parallelize the code even more? Optimization: Solaris-specific In addition to using the Intel code described above, I performed other minor optimizations to the Solaris SHA-1 code: Increased the digest(1) and mac(1) command's buffer size from 4K to 64K, as previously done for decrypt(1) and encrypt(1). This size is well suited for ZFS file systems, but helps for other file systems as well. Optimized encode functions, which byte swap the input and output data, to copy/byte-swap 4 or 8 bytes at-a-time instead of 1 byte-at-a-time. Enhanced the Solaris mdb(1) and kmdb(1) debuggers to display all 16 %xmm and %ymm registers (mdb "$x" command). Previously they only displayed the first 8 that are available in 32-bit mode. Can't optimize if you can't debug :-). Changed the SHA-1 code to allow processing in "chunks" greater than 2 Gigabytes (64-bits) Performance I measured performance on a Sun Ultra 27 (which has a Nehalem-class Xeon 5500 Intel W3570 microprocessor @3.2GHz). Turbo mode is disabled for consistent performance measurement. Graphs are better than words and numbers, so here they are: The first graph shows the Solaris digest(1) command before and after the optimizations discussed here, contained in libmd(3LIB). I ran the digest command on a half GByte file in swapfs (/tmp) and execution time decreased from 1.35 seconds to 0.98 seconds. The second graph shows the the results of an internal microbenchmark that uses the Solaris libpkcs11(3LIB) library. The operations are on a 128 byte buffer with 10,000 iterations. The results show operations increased from 320,000 to 416,000 operations per second. Finally the third graph shows the results of an internal kernel microbenchmark that uses the Solaris /kernel/crypto/amd64/sha1 module. The operations are on a 64Kbyte buffer with 100 iterations. third graph shows the results of an internal kernel microbenchmark that uses the Solaris /kernel/crypto/amd64/sha1 module. The operations are on a 64Kbyte buffer with 100 iterations. The results show for 1 kernel thread, operations increased from 410 to 600 MBytes/second. For 8 kernel threads, operations increase from 1540 to 1940 MBytes/second. Availability This code is in Solaris 11 FCS. It is available in the 64-bit libmd(3LIB) library for 64-bit programs and is in the Solaris kernel. You must be running hardware that supports Intel's SSSE3 instructions (for example, Intel Nehalem, Westmere, or Sandy Bridge microprocessor architectures). The easiest way to determine if SSSE3 is available is with the isainfo(1) command. For example, nehalem $ isainfo -v $ isainfo -v 64-bit amd64 applications sse4.2 sse4.1 ssse3 popcnt tscp ahf cx16 sse3 sse2 sse fxsr mmx cmov amd_sysc cx8 tsc fpu 32-bit i386 applications sse4.2 sse4.1 ssse3 popcnt tscp ahf cx16 sse3 sse2 sse fxsr mmx cmov sep cx8 tsc fpu If the output also shows "avx", the Solaris executes the even-more optimized 3-operand AVX instructions for SHA-1 mentioned above: sandybridge $ isainfo -v 64-bit amd64 applications avx xsave pclmulqdq aes sse4.2 sse4.1 ssse3 popcnt tscp ahf cx16 sse3 sse2 sse fxsr mmx cmov amd_sysc cx8 tsc fpu 32-bit i386 applications avx xsave pclmulqdq aes sse4.2 sse4.1 ssse3 popcnt tscp ahf cx16 sse3 sse2 sse fxsr mmx cmov sep cx8 tsc fpu No special configuration or setup is needed to take advantage of this code. Solaris libraries and kernel automatically determine if it's running on SSSE3 or AVX-capable machines and execute the correctly-tuned code for that microprocessor. Summary The Solaris 11 Crypto Framework, via the sha1 kernel module and libmd(3LIB) and libpkcs11(3LIB) libraries, incorporated a useful SHA-1 optimization from Intel for SSSE3-capable microprocessors. As with other Solaris optimizations, they come automatically "under the hood" with the current Solaris release. References "Improving the Performance of the Secure Hash Algorithm (SHA-1)" by Max Locktyukhin (Intel, March 2010). The source for these SHA-1 optimizations used in Solaris "SHA-1", Wikipedia Good overview of SHA-1 FIPS 180-1 SHA-1 standard (FIPS, 1995) NIST Comments on Cryptanalytic Attacks on SHA-1 (2005, revised 2006)

Read the article
Solaris X86 AESNI OpenSSL Engine

- by danx

Solaris X86 AESNI OpenSSL Engine Cryptography is a major component of secure e-commerce. Since cryptography is compute intensive and adds a significant load to applications, such as SSL web servers (https), crypto performance is an important factor. Providing accelerated crypto hardware greatly helps these applications and will help lead to a wider adoption of cryptography, and lower cost, in e-commerce and other applications. The Intel Westmere microprocessor has six new instructions to acclerate AES encryption. They are called "AESNI" for "AES New Instructions". These are unprivileged instructions, so no "root", other elevated access, or context switch is required to execute these instructions. These instructions are used in a new built-in OpenSSL 1.0 engine available in Solaris 11, the aesni engine. Previous Work Previously, AESNI instructions were introduced into the Solaris x86 kernel and libraries. That is, the "aes" kernel module (used by IPsec and other kernel modules) and the Solaris pkcs11 library (for user applications). These are available in Solaris 10 10/09 (update 8) and above, and Solaris 11. The work here is to add the aesni engine to OpenSSL. X86 AESNI Instructions Intel's Xeon 5600 is one of the processors that support AESNI. This processor is used in the Sun Fire X4170 M2 As mentioned above, six new instructions acclerate AES encryption in processor silicon. The new instructions are: aesenc performs one round of AES encryption. One encryption round is composed of these steps: substitute bytes, shift rows, mix columns, and xor the round key. aesenclast performs the final encryption round, which is the same as above, except omitting the mix columns (which is only needed for the next encryption round). aesdec performs one round of AES decryption aesdeclast performs the final AES decryption round aeskeygenassist Helps expand the user-provided key into a "key schedule" of keys, one per round aesimc performs an "inverse mixed columns" operation to convert the encryption key schedule into a decryption key schedule pclmulqdq Not a AESNI instruction, but performs "carryless multiply" operations to acclerate AES GCM mode. Since the AESNI instructions are implemented in hardware, they take a constant number of cycles and are not vulnerable to side-channel timing attacks that attempt to discern some bits of data from the time taken to encrypt or decrypt the data. Solaris x86 and OpenSSL Software Optimizations Having X86 AESNI hardware crypto instructions is all well and good, but how do we access it? The software is available with Solaris 11 and is used automatically if you are running Solaris x86 on a AESNI-capable processor. AESNI is used internally in the kernel through kernel crypto modules and is available in user space through the PKCS#11 library. For OpenSSL on Solaris 11, AESNI crypto is available directly with a new built-in OpenSSL 1.0 engine, called the "aesni engine." This is in lieu of the extra overhead of going through the Solaris OpenSSL pkcs11 engine, which accesses Solaris crypto and digest operations. Instead, AESNI assembly is included directly in the new aesni engine. Instead of including the aesni engine in a separate library in /lib/openssl/engines/, the aesni engine is "built-in", meaning it is included directly in OpenSSL's libcrypto.so.1.0.0 library. This reduces overhead and the need to manually specify the aesni engine. Since the engine is built-in (that is, in libcrypto.so.1.0.0), the openssl -engine command line flag or API call is not needed to access the engine—the aesni engine is used automatically on AESNI hardware. Ciphers and Digests supported by OpenSSL aesni engine The Openssl aesni engine auto-detects if it's running on AESNI hardware and uses AESNI encryption instructions for these ciphers: AES-128-CBC, AES-192-CBC, AES-256-CBC, AES-128-CFB128, AES-192-CFB128, AES-256-CFB128, AES-128-CTR, AES-192-CTR, AES-256-CTR, AES-128-ECB, AES-192-ECB, AES-256-ECB, AES-128-OFB, AES-192-OFB, and AES-256-OFB. Implementation of the OpenSSL aesni engine The AESNI assembly language routines are not a part of the regular Openssl 1.0.0 release. AESNI is a part of the "HEAD" ("development" or "unstable") branch of OpenSSL, for future release. But AESNI is also available as a separate patch provided by Intel to the OpenSSL project for OpenSSL 1.0.0. A minimal amount of "glue" code in the aesni engine works between the OpenSSL libcrypto.so.1.0.0 library and the assembly functions. The aesni engine code is separate from the base OpenSSL code and requires patching only a few source files to use it. That means OpenSSL can be more easily updated to future versions without losing the performance from the built-in aesni engine. OpenSSL aesni engine Performance Here's some graphs of aesni engine performance I measured by running openssl speed -evp $algorithm where $algorithm is aes-128-cbc, aes-192-cbc, and aes-256-cbc. These are using the 64-bit version of openssl on the same AESNI hardware, a Sun Fire X4170 M2 with a Intel Xeon E5620 @2.40GHz, running Solaris 11 FCS. "Before" is openssl without the aesni engine and "after" is openssl with the aesni engine. The numbers are MBytes/second. OpenSSL aesni engine performance on Sun Fire X4170 M2 (Xeon E5620 @2.40GHz) (Higher is better; "before"=OpenSSL on AESNI without AESNI engine software, "after"=OpenSSL AESNI engine) As you can see the speedup is dramatic for all 3 key lengths and for data sizes from 16 bytes to 8 Kbytes—AESNI is about 7.5-8x faster over hand-coded amd64 assembly (without aesni instructions). Verifying the OpenSSL aesni engine is present The easiest way to determine if you are running the aesni engine is to type "openssl engine" on the command line. No configuration, API, or command line options are needed to use the OpenSSL aesni engine. If you are running on Intel AESNI hardware with Solaris 11 FCS, you'll see this output indicating you are using the aesni engine: intel-westmere $ openssl engine (aesni) Intel AES-NI engine (no-aesni) (dynamic) Dynamic engine loading support (pkcs11) PKCS #11 engine support If you are running on Intel without AESNI hardware you'll see this output indicating the hardware can't support the aesni engine: intel-nehalem $ openssl engine (aesni) Intel AES-NI engine (no-aesni) (dynamic) Dynamic engine loading support (pkcs11) PKCS #11 engine support For Solaris on SPARC or older Solaris OpenSSL software, you won't see any aesni engine line at all. Third-party OpenSSL software (built yourself or from outside Oracle) will not have the aesni engine either. Solaris 11 FCS comes with OpenSSL version 1.0.0e. The output of typing "openssl version" should be "OpenSSL 1.0.0e 6 Sep 2011". 64- and 32-bit OpenSSL OpenSSL comes in both 32- and 64-bit binaries. 64-bit executable is now the default, at /usr/bin/openssl, and OpenSSL 64-bit libraries at /lib/amd64/libcrypto.so.1.0.0 and libssl.so.1.0.0 The 32-bit executable is at /usr/bin/i86/openssl and the libraries are at /lib/libcrytpo.so.1.0.0 and libssl.so.1.0.0. Availability The OpenSSL AESNI engine is available in Solaris 11 x86 for both the 64- and 32-bit versions of OpenSSL. It is not available with Solaris 10. You must have a processor that supports AESNI instructions, otherwise OpenSSL will fallback to the older, slower AES implementation without AESNI. Processors that support AESNI include most Westmere and Sandy Bridge class processor architectures. Some low-end processors (such as for mobile/laptop platforms) do not support AESNI. The easiest way to determine if the processor supports AESNI is with the isainfo -v command—look for "amd64" and "aes" in the output: $ isainfo -v 64-bit amd64 applications pclmulqdq aes sse4.2 sse4.1 ssse3 popcnt tscp ahf cx16 sse3 sse2 sse fxsr mmx cmov amd_sysc cx8 tsc fpu Conclusion The Solaris 11 OpenSSL aesni engine provides easy access to powerful Intel AESNI hardware cryptography, in addition to Solaris userland PKCS#11 libraries and Solaris crypto kernel modules.

Read the article
Solaris X86 64-bit Assembly Programming

- by danx

Solaris X86 64-bit Assembly Programming This is a simple example on writing, compiling, and debugging Solaris 64-bit x86 assembly language with a C program. This is also referred to as "AMD64" assembly. The term "AMD64" is used in an inclusive sense to refer to all X86 64-bit processors, whether AMD Opteron family or Intel 64 processor family. Both run Solaris x86. I'm keeping this example simple mainly to illustrate how everything comes together—compiler, assembler, linker, and debugger when using assembly language. The example I'm using here is a C program that calls an assembly language program passing a C string. The assembly language program takes the C string and calls printf() with it to print the string. AMD64 Register Usage But first let's review the use of AMD64 registers. AMD64 has several 64-bit registers, some special purpose (such as the stack pointer) and others general purpose. By convention, Solaris follows the AMD64 ABI in register usage, which is the same used by Linux, but different from Microsoft Windows in usage (such as which registers are used to pass parameters). This blog will only discuss conventions for Linux and Solaris. The following chart shows how AMD64 registers are used. The first six parameters to a function are passed through registers. If there's more than six parameters, parameter 7 and above are pushed on the stack before calling the function. The stack is also used to save temporary "stack" variables for use by a function. 64-bit Register Usage %rip Instruction Pointer points to the current instruction %rsp Stack Pointer %rbp Frame Pointer (saved stack pointer pointing to parameters on stack) %rdi Function Parameter 1 %rsi Function Parameter 2 %rdx Function Parameter 3 %rcx Function Parameter 4 %r8 Function Parameter 5 %r9 Function Parameter 6 %rax Function return value %r10, %r11 Temporary registers (need not be saved before used) %rbx, %r12, %r13, %r14, %r15 Temporary registers, but must be saved before use and restored before returning from the current function (usually with the push and pop instructions). 32-, 16-, and 8-bit registers To access the lower 32-, 16-, or 8-bits of a 64-bit register use the following: 64-bit register Least significant 32-bits Least significant 16-bits Least significant 8-bits %rax%eax%ax%al %rbx%ebx%bx%bl %rcx%ecx%cx%cl %rdx%edx%dx%dl %rsi%esi%si%sil %rdi%edi%di%axl %rbp%ebp%bp%bp %rsp%esp%sp%spl %r9%r9d%r9w%r9b %r10%r10d%r10w%r10b %r11%r11d%r11w%r11b %r12%r12d%r12w%r12b %r13%r13d%r13w%r13b %r14%r14d%r14w%r14b %r15%r15d%r15w%r15b %r16%r16d%r16w%r16b There's other registers present, such as the 64-bit %mm registers, 128-bit %xmm registers, 256-bit %ymm registers, and 512-bit %zmm registers. Except for %mm registers, these registers may not present on older AMD64 processors. Assembly Source The following is the source for a C program, helloas1.c, that calls an assembly function, hello_asm(). $ cat helloas1.c extern void hello_asm(char *s); int main(void) { hello_asm("Hello, World!"); } The assembly function called above, hello_asm(), is defined below. $ cat helloas2.s /* * helloas2.s * To build: * cc -m64 -o helloas2-cpp.s -D_ASM -E helloas2.s * cc -m64 -c -o helloas2.o helloas2-cpp.s */ #if defined(lint) || defined(__lint) /* ARGSUSED */ void hello_asm(char *s) { } #else /* lint */ #include <sys/asm_linkage.h> .extern printf ENTRY_NP(hello_asm) // Setup printf parameters on stack mov %rdi, %rsi // P2 (%rsi) is string variable lea .printf_string, %rdi // P1 (%rdi) is printf format string call printf ret SET_SIZE(hello_asm) // Read-only data .text .align 16 .type .printf_string, @object .printf_string: .ascii "The string is: %s.\n\0" #endif /* lint || __lint */ In the assembly source above, the C skeleton code under "#if defined(lint)" is optionally used for lint to check the interfaces with your C program--very useful to catch nasty interface bugs. The "asm_linkage.h" file includes some handy macros useful for assembly, such as ENTRY_NP(), used to define a program entry point, and SET_SIZE(), used to set the function size in the symbol table. The function hello_asm calls C function printf() by passing two parameters, Parameter 1 (P1) is a printf format string, and P2 is a string variable. The function begins by moving %rdi, which contains Parameter 1 (P1) passed hello_asm, to printf()'s P2, %rsi. Then it sets printf's P1, the format string, by loading the address the address of the format string in %rdi, P1. Finally it calls printf. After returning from printf, the hello_asm function returns itself. Larger, more complex assembly functions usually do more setup than the example above. If a function is returning a value, it would set %rax to the return value. Also, it's typical for a function to save the %rbp and %rsp registers of the calling function and to restore these registers before returning. %rsp contains the stack pointer and %rbp contains the frame pointer. Here is the typical function setup and return sequence for a function: ENTRY_NP(sample_assembly_function) push %rbp // save frame pointer on stack mov %rsp, %rbp // save stack pointer in frame pointer xor %rax, %r4ax // set function return value to 0. mov %rbp, %rsp // restore stack pointer pop %rbp // restore frame pointer ret // return to calling function SET_SIZE(sample_assembly_function) Compiling and Running Assembly Use the Solaris cc command to compile both C and assembly source, and to pre-process assembly source. You can also use GNU gcc instead of cc to compile, if you prefer. The "-m64" option tells the compiler to compile in 64-bit address mode (instead of 32-bit). $ cc -m64 -o helloas2-cpp.s -D_ASM -E helloas2.s $ cc -m64 -c -o helloas2.o helloas2-cpp.s $ cc -m64 -c helloas1.c $ cc -m64 -o hello-asm helloas1.o helloas2.o $ file hello-asm helloas1.o helloas2.o hello-asm: ELF 64-bit LSB executable AMD64 Version 1 [SSE FXSR FPU], dynamically linked, not stripped helloas1.o: ELF 64-bit LSB relocatable AMD64 Version 1 helloas2.o: ELF 64-bit LSB relocatable AMD64 Version 1 $ hello-asm The string is: Hello, World!. Debugging Assembly with MDB MDB is the Solaris system debugger. It can also be used to debug user programs, including assembly and C. The following example runs the above program, hello-asm, under control of the debugger. In the example below I load the program, set a breakpoint at the assembly function hello_asm, display the registers and the first parameter, step through the assembly function, and continue execution. $ mdb hello-asm # Start the debugger > hello_asm:b # Set a breakpoint > ::run # Run the program under the debugger mdb: stop at hello_asm mdb: target stopped at: hello_asm: movq %rdi,%rsi > $C # display function stack ffff80ffbffff6e0 hello_asm() ffff80ffbffff6f0 0x400adc() > $r # display registers %rax = 0x0000000000000000 %r8 = 0x0000000000000000 %rbx = 0xffff80ffbf7f8e70 %r9 = 0x0000000000000000 %rcx = 0x0000000000000000 %r10 = 0x0000000000000000 %rdx = 0xffff80ffbffff718 %r11 = 0xffff80ffbf537db8 %rsi = 0xffff80ffbffff708 %r12 = 0x0000000000000000 %rdi = 0x0000000000400cf8 %r13 = 0x0000000000000000 %r14 = 0x0000000000000000 %r15 = 0x0000000000000000 %cs = 0x0053 %fs = 0x0000 %gs = 0x0000 %ds = 0x0000 %es = 0x0000 %ss = 0x004b %rip = 0x0000000000400c70 hello_asm %rbp = 0xffff80ffbffff6e0 %rsp = 0xffff80ffbffff6c8 %rflags = 0x00000282 id=0 vip=0 vif=0 ac=0 vm=0 rf=0 nt=0 iopl=0x0 status=<of,df,IF,tf,SF,zf,af,pf,cf> %gsbase = 0x0000000000000000 %fsbase = 0xffff80ffbf782a40 %trapno = 0x3 %err = 0x0 > ::dis # disassemble the current instructions hello_asm: movq %rdi,%rsi hello_asm+3: leaq 0x400c90,%rdi hello_asm+0xb: call -0x220 <PLT:printf> hello_asm+0x10: ret 0x400c81: nop 0x400c85: nop 0x400c88: nop 0x400c8c: nop 0x400c90: pushq %rsp 0x400c91: pushq $0x74732065 0x400c96: jb +0x69 <0x400d01> > 0x0000000000400cf8/S # %rdi contains Parameter 1 0x400cf8: Hello, World! > [ # Step and execute 1 instruction mdb: target stopped at: hello_asm+3: leaq 0x400c90,%rdi > [ mdb: target stopped at: hello_asm+0xb: call -0x220 <PLT:printf> > [ The string is: Hello, World!. mdb: target stopped at: hello_asm+0x10: ret > [ mdb: target stopped at: main+0x19: movl $0x0,-0x4(%rbp) > :c # continue program execution mdb: target has terminated > $q # quit the MDB debugger $ In the example above, at the start of function hello_asm(), I display the stack contents with "$C", display the registers contents with "$r", then disassemble the current function with "::dis". The first function parameter, which is a C string, is passed by reference with the string address in %rdi (see the register usage chart above). The address is 0x400cf8, so I print the value of the string with the "/S" MDB command: "0x0000000000400cf8/S". I can also print the contents at an address in several other formats. Here's a few popular formats. For more, see the mdb(1) man page for details. address/S C string address/C ASCII character (1 byte) address/E unsigned decimal (8 bytes) address/U unsigned decimal (4 bytes) address/D signed decimal (4 bytes) address/J hexadecimal (8 bytes) address/X hexadecimal (4 bytes) address/B hexadecimal (1 bytes) address/K pointer in hexadecimal (4 or 8 bytes) address/I disassembled instruction Finally, I step through each machine instruction with the "[" command, which steps over functions. If I wanted to enter a function, I would use the "]" command. Then I continue program execution with ":c", which continues until the program terminates. MDB Basic Cheat Sheet Here's a brief cheat sheet of some of the more common MDB commands useful for assembly debugging. There's an entire set of macros and more powerful commands, especially some for debugging the Solaris kernel, but that's beyond the scope of this example. $C Display function stack with pointers $c Display function stack $e Display external function names $v Display non-zero variables and registers $r Display registers ::fpregs Display floating point (or "media" registers). Includes %st, %xmm, and %ymm registers. ::status Display program status ::run Run the program (followed by optional command line parameters) $q Quit the debugger address:b Set a breakpoint address:d Delete a breakpoint $b Display breakpoints :c Continue program execution after a breakpoint [ Step 1 instruction, but step over function calls ] Step 1 instruction address::dis Disassemble instructions at an address ::events Display events Further Information "Assembly Language Techniques for Oracle Solaris on x86 Platforms" by Paul Lowik (2004). Good tutorial on Solaris x86 optimization with assembly. The Solaris Operating System on x86 Platforms An excellent, detailed tutorial on X86 architecture, with Solaris specifics. By an ex-Sun employee, Frank Hofmann (2005). "AMD64 ABI Features", Solaris 64-bit Developer's Guide contains rules on data types and register usage for Intel 64/AMD64-class processors. (available at docs.oracle.com) Solaris X86 Assembly Language Reference Manual (available at docs.oracle.com) SPARC Assembly Language Reference Manual (available at docs.oracle.com) System V Application Binary Interface (2003) defines the AMD64 ABI for UNIX-class operating systems, including Solaris, Linux, and BSD. Google for it—the original website is gone. cc(1), gcc(1), and mdb(1) man pages.

Read the article
How to tell if SPARC T4 crypto is being used?

- by danx

A question that often comes up when running applications on SPARC T4 systems is "How can I tell if hardware crypto accleration is being used?" To review, the SPARC T4 processor includes a crypto unit that supports several crypto instructions. For hardware crypto these include 11 AES instructions, 4 xmul* instructions (for AES GCM carryless multiply), mont for Montgomery multiply (optimizes RSA and DSA), and 5 des_* instructions (for DES3). For hardware hash algorithm optimization, the T4 has the md5, sha1, sha256, and sha512 instructions (the last two are used for SHA-224 an SHA-384). First off, it's easy to tell if the processor T4 crypto instructions—use the isainfo -v command and look for "sparcv9" and "aes" (and other hash and crypto algorithms) in the output: $ isainfo -v 64-bit sparcv9 applications crc32c cbcond pause mont mpmul sha512 sha256 sha1 md5 camellia kasumi des aes ima hpc vis3 fmaf asi_blk_init vis2 vis popc These instructions are not-privileged, so are available for direct use in user-level applications and libraries (such as OpenSSL). Here is the "openssl speed -evp" command shown with the built-in t4 engine and with the pkcs11 engine. Both run the T4 AES instructions, but the t4 engine is faster than the pkcs11 engine because it has less overhead (especially for smaller packet sizes): t-4 $ /usr/bin/openssl version OpenSSL 1.0.0j 10 May 2012 t-4 $ /usr/bin/openssl engine (t4) SPARC T4 engine support (dynamic) Dynamic engine loading support (pkcs11) PKCS #11 engine support t-4 $ /usr/bin/openssl speed -evp aes-128-cbc # t4 engine used by default . . . The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 487777.10k 816822.21k 986012.59k 1017029.97k 1053543.08k t-4 $ /usr/bin/openssl speed -engine pkcs11 -evp aes-128-cbc engine "pkcs11" set. . . . The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 31703.58k 116636.39k 350672.81k 696170.50k 993599.49k Note: The "-evp" flag indicates use the OpenSSL "EnVeloPe" API, which gives more accurate results. That's because it tells OpenSSL to use the same API that external programs use when calling OpenSSL libcrypto functions, evp(3openssl). DTrace Shows if T4 Crypto Functions Are Used OK, good enough, the isainfo(1) command shows the instructions are present, but how does one know if they are being used? Chi-Chang Lin, who works on Oracle Solaris performance, wrote a Dtrace script to show if T4 instructions are being executed. To show the T4 instructions are being used, run the following Dtrace script. Look for functions named "t4" and "yf" in the output. The OpenSSL T4 engine uses functions named "t4" and the PKCS#11 engine uses functions named "yf". To demonstrate, I'll first run "openssl speed" with the built-in t4 engine then with the pkcs11 engine. The performance numbers are not valid due to dtrace probes slowing things down. t-4 # dtrace -Z -n ' pid$target::*yf*:entry,pid$target::*t4_*:entry{ @[probemod, probefunc] = count();}' \ -c "/usr/bin/openssl speed -evp aes-128-cbc" dtrace: description 'pid$target::*yf*:entry' matched 101 probes . . . dtrace: pid 2029 has exited libcrypto.so.1.0.0 ENGINE_load_t4 1 libcrypto.so.1.0.0 t4_DH 1 libcrypto.so.1.0.0 t4_DSA 1 libcrypto.so.1.0.0 t4_RSA 1 libcrypto.so.1.0.0 t4_destroy 1 libcrypto.so.1.0.0 t4_free_aes_ctr_NIDs 1 libcrypto.so.1.0.0 t4_init 1 libcrypto.so.1.0.0 t4_add_NID 3 libcrypto.so.1.0.0 t4_aes_expand128 5 libcrypto.so.1.0.0 t4_cipher_init_aes 5 libcrypto.so.1.0.0 t4_get_all_ciphers 6 libcrypto.so.1.0.0 t4_get_all_digests 59 libcrypto.so.1.0.0 t4_digest_final_sha1 65 libcrypto.so.1.0.0 t4_digest_init_sha1 65 libcrypto.so.1.0.0 t4_sha1_multiblock 126 libcrypto.so.1.0.0 t4_digest_update_sha1 261 libcrypto.so.1.0.0 t4_aes128_cbc_encrypt 1432979 libcrypto.so.1.0.0 t4_aes128_load_keys_for_encrypt 1432979 libcrypto.so.1.0.0 t4_cipher_do_aes_128_cbc 1432979 t-4 # dtrace -Z -n 'pid$target::*yf*:entry{ @[probemod, probefunc] = count();} pid$target::*yf*:entry,pid$target::*t4_*:entry{ @[probemod, probefunc] = count();}' \ -c "/usr/bin/openssl speed -engine pkcs11 -evp aes-128-cbc" dtrace: description 'pid$target::*yf*:entry' matched 101 probes engine "pkcs11" set. . . . dtrace: pid 2033 has exited libcrypto.so.1.0.0 ENGINE_load_t4 1 libcrypto.so.1.0.0 t4_DH 1 libcrypto.so.1.0.0 t4_DSA 1 libcrypto.so.1.0.0 t4_RSA 1 libcrypto.so.1.0.0 t4_destroy 1 libcrypto.so.1.0.0 t4_free_aes_ctr_NIDs 1 libcrypto.so.1.0.0 t4_get_all_ciphers 1 libcrypto.so.1.0.0 t4_get_all_digests 1 libsoftcrypto.so.1 rijndael_key_setup_enc_yf 1 libsoftcrypto.so.1 yf_aes_expand128 1 libcrypto.so.1.0.0 t4_add_NID 3 libsoftcrypto.so.1 yf_aes128_cbc_encrypt 1542330 libsoftcrypto.so.1 yf_aes128_load_keys_for_encrypt 1542330 So, as shown above the OpenSSL built-in t4 engine executes t4_* functions (which are hand-coded assembly executing the T4 AES instructions) and the OpenSSL pkcs11 engine executes *yf* functions. Programmatic Use of OpenSSL T4 engine The OpenSSL t4 engine is used automatically with the /usr/bin/openssl command line. Chi-Chang Lin also points out that if you're calling the OpenSSL API (libcrypto.so) from a program, you must call ENGINE_load_built_engines(), otherwise the built-in t4 engine will not be loaded. You do not call ENGINE_set_default(). That's because "openssl speed -evp" test calls ENGINE_load_built_engines() even though the "-engine" option wasn't specified. OpenSSL T4 engine Availability The OpenSSL t4 engine is available with Solaris 11 and 11.1. For Solaris 10 08/11 (U10), you need to use the OpenSSL pkcs311 engine. The OpenSSL t4 engine is distributed only with the version of OpenSSL distributed with Solaris (and not third-party or self-compiled versions of OpenSSL). The OpenSSL engine implements the AES cipher for Solaris 11, released 11/2011. For Solaris 11.1, released 11/2012, the OpenSSL engine adds optimization for the MD5, SHA-1, and SHA-2 hash algorithms, and DES-3. Although the T4 processor has Camillia and Kasumi block cipher instructions, these are not implemented in the OpenSSL T4 engine. The following charts may help view availability of optimizations. The first chart shows what's available with Solaris CLIs and APIs, the second chart shows what's available in Solaris OpenSSL. Native Solaris Optimization for SPARC T4 This table is shows Solaris native CLI and API support. As such, they are all available with the OpenSSL pkcs11 engine. CLIs: "openssl -engine pkcs11", encrypt(1), decrypt(1), mac(1), digest(1), MD5sum(1), SHA1sum(1), SHA224sum(1), SHA256sum(1), SHA384sum(1), SHA512sum(1) APIs: PKCS#11 library libpkcs11(3LIB) (incluDES Openssl pkcs11 engine), libMD(3LIB), and Solaris kernel modules AlgorithmSolaris 1008/11 (U10)Solaris 11Solaris 11.1 AES-ECB, AES-CBC, AES-CTR, AES-CBC AES-CFB128 XXX DES3-ECB, DES3-CBC, DES2-ECB, DES2-CBC, DES-ECB, DES-CBC XXX bignum Montgomery multiply (RSA, DSA) XXX MD5, SHA-1, SHA-256, SHA-384, SHA-512 XXX SHA-224 X ARCFOUR (RC4) X Solaris OpenSSL T4 Engine Optimization This table is for the Solaris OpenSSL built-in t4 engine. Algorithms listed above are also available through the OpenSSL pkcs11 engine. CLI: openssl(1openssl) APIs: openssl(5), engine(3openssl), evp(3openssl), libcrypto crypto(3openssl) AlgorithmSolaris 11Solaris 11SRU2Solaris 11.1 AES-ECB, AES-CBC, AES-CTR, AES-CBC AES-CFB128 XXX DES3-ECB, DES3-CBC, DES-ECB, DES-CBC X bignum Montgomery multiply (RSA, DSA) X MD5, SHA-1, SHA-256, SHA-384, SHA-512 XX SHA-224 X Source Code Availability Solaris Most of the T4 assembly code that called the new T4 crypto instructions was written by Ferenc Rákóczi of the Solaris Security group, with assistance from others. You can download the Solaris source for this and other parts of Solaris as a few zip files at the Oracle Download website. The relevant source files are generally under directories usr/src/common/crypto/{aes,arcfour,des,md5,modes,sha1,sha2}}/sun4v/. and usr/src/common/bignum/sun4v/. Solaris 11 binary is available from the Oracle Solaris 11 download website. OpenSSL t4 engine The source for the OpenSSL t4 engine, which is based on the Solaris source above, is viewable through the OpenGrok source code browser in directory src/components/openssl/openssl-1.0.0/engines/t4 . You can download the source from the same website or through Mercurial source code management, hg(1). Conclusion Oracle Solaris with SPARC T4 provides a rich set of accelerated cryptographic and hash algorithms. Using the latest update, Solaris 11.1, provides the best set of optimized algorithms, but alternatives are often available, sometimes slightly slower, for releases back to Solaris 10 08/11 (U10). Reference See also these earlier blogs. SPARC T4 OpenSSL Engine by myself, Dan Anderson (2011), discusses the Openssl T4 engine and reviews the SPARC T4 processor for the Solaris 11 release. Exciting Crypto Advances with the T4 processor and Oracle Solaris 11 by Valerie Fenwick (2011) discusses crypto algorithms that were optimized for the T4 processor with the Solaris 11 FCS (11/11) and Solaris 10 08/11 (U10) release. T4 Crypto Cheat Sheet by Stefan Hinker (2012) discusses how to make T4 crypto optimization available to various consumers (such as SSH, Java, OpenSSL, Apache, etc.) High Performance Security For Oracle Database and Fusion Middleware Applications using SPARC T4 (PDF, 2012) discusses SPARC T4 and its usage to optimize application security. Configuring Oracle iPlanet WebServer / Oracle Traffic Director to use crypto accelerators on T4-1 servers by Meena Vyas (2012)

Read the article
Optimizing AES modes on Solaris for Intel Westmere

- by danx

Optimizing AES modes on Solaris for Intel Westmere Review AES is a strong method of symmetric (secret-key) encryption. It is a U.S. FIPS-approved cryptographic algorithm (FIPS 197) that operates on 16-byte blocks. AES has been available since 2001 and is widely used. However, AES by itself has a weakness. AES encryption isn't usually used by itself because identical blocks of plaintext are always encrypted into identical blocks of ciphertext. This encryption can be easily attacked with "dictionaries" of common blocks of text and allows one to more-easily discern the content of the unknown cryptotext. This mode of encryption is called "Electronic Code Book" (ECB), because one in theory can keep a "code book" of all known cryptotext and plaintext results to cipher and decipher AES. In practice, a complete "code book" is not practical, even in electronic form, but large dictionaries of common plaintext blocks is still possible. Here's a diagram of encrypting input data using AES ECB mode: Block 1 Block 2 PlainTextInput PlainTextInput | | | | \/ \/ AESKey-->(AES Encryption) AESKey-->(AES Encryption) | | | | \/ \/ CipherTextOutput CipherTextOutput Block 1 Block 2 What's the solution to the same cleartext input producing the same ciphertext output? The solution is to further process the encrypted or decrypted text in such a way that the same text produces different output. This usually involves an Initialization Vector (IV) and XORing the decrypted or encrypted text. As an example, I'll illustrate CBC mode encryption: Block 1 Block 2 PlainTextInput PlainTextInput | | | | \/ \/ IV >----->(XOR) +------------->(XOR) +---> . . . . | | | | | | | | \/ | \/ | AESKey-->(AES Encryption) | AESKey-->(AES Encryption) | | | | | | | | | \/ | \/ | CipherTextOutput ------+ CipherTextOutput -------+ Block 1 Block 2 The steps for CBC encryption are: Start with a 16-byte Initialization Vector (IV), choosen randomly. XOR the IV with the first block of input plaintext Encrypt the result with AES using a user-provided key. The result is the first 16-bytes of output cryptotext. Use the cryptotext (instead of the IV) of the previous block to XOR with the next input block of plaintext Another mode besides CBC is Counter Mode (CTR). As with CBC mode, it also starts with a 16-byte IV. However, for subsequent blocks, the IV is just incremented by one. Also, the IV ix XORed with the AES encryption result (not the plain text input). Here's an illustration: Block 1 Block 2 PlainTextInput PlainTextInput | | | | \/ \/ AESKey-->(AES Encryption) AESKey-->(AES Encryption) | | | | \/ \/ IV >----->(XOR) IV + 1 >---->(XOR) IV + 2 ---> . . . . | | | | \/ \/ CipherTextOutput CipherTextOutput Block 1 Block 2 Optimization Which of these modes can be parallelized? ECB encryption/decryption can be parallelized because it does more than plain AES encryption and decryption, as mentioned above. CBC encryption can't be parallelized because it depends on the output of the previous block. However, CBC decryption can be parallelized because all the encrypted blocks are known at the beginning. CTR encryption and decryption can be parallelized because the input to each block is known--it's just the IV incremented by one for each subsequent block. So, in summary, for ECB, CBC, and CTR modes, encryption and decryption can be parallelized with the exception of CBC encryption. How do we parallelize encryption? By interleaving. Usually when reading and writing data there are pipeline "stalls" (idle processor cycles) that result from waiting for memory to be loaded or stored to or from CPU registers. Since the software is written to encrypt/decrypt the next data block where pipeline stalls usually occurs, we can avoid stalls and crypt with fewer cycles. This software processes 4 blocks at a time, which ensures virtually no waiting ("stalling") for reading or writing data in memory. Other Optimizations Besides interleaving, other optimizations performed are Loading the entire key schedule into the 128-bit %xmm registers. This is done once for per 4-block of data (since 4 blocks of data is processed, when present). The following is loaded: the entire "key schedule" (user input key preprocessed for encryption and decryption). This takes 11, 13, or 15 registers, for AES-128, AES-192, and AES-256, respectively The input data is loaded into another %xmm register The same register contains the output result after encrypting/decrypting Using SSSE 4 instructions (AESNI). Besides the aesenc, aesenclast, aesdec, aesdeclast, aeskeygenassist, and aesimc AESNI instructions, Intel has several other instructions that operate on the 128-bit %xmm registers. Some common instructions for encryption are: pxor exclusive or (very useful), movdqu load/store a %xmm register from/to memory, pshufb shuffle bytes for byte swapping, pclmulqdq carry-less multiply for GCM mode Combining AES encryption/decryption with CBC or CTR modes processing. Instead of loading input data twice (once for AES encryption/decryption, and again for modes (CTR or CBC, for example) processing, the input data is loaded once as both AES and modes operations occur at in the same function Performance Everyone likes pretty color charts, so here they are. I ran these on Solaris 11 running on a Piketon Platform system with a 4-core Intel Clarkdale processor @3.20GHz. Clarkdale which is part of the Westmere processor architecture family. The "before" case is Solaris 11, unmodified. Keep in mind that the "before" case already has been optimized with hand-coded Intel AESNI assembly. The "after" case has combined AES-NI and mode instructions, interleaved 4 blocks at-a-time. « For the first table, lower is better (milliseconds). The first table shows the performance improvement using the Solaris encrypt(1) and decrypt(1) CLI commands. I encrypted and decrypted a 1/2 GByte file on /tmp (swap tmpfs). Encryption improved by about 40% and decryption improved by about 80%. AES-128 is slighty faster than AES-256, as expected. The second table shows more detail timings for CBC, CTR, and ECB modes for the 3 AES key sizes and different data lengths. » The results shown are the percentage improvement as shown by an internal PKCS#11 microbenchmark. And keep in mind the previous baseline code already had optimized AESNI assembly! The keysize (AES-128, 192, or 256) makes little difference in relative percentage improvement (although, of course, AES-128 is faster than AES-256). Larger data sizes show better improvement than 128-byte data. Availability This software is in Solaris 11 FCS. It is available in the 64-bit libcrypto library and the "aes" Solaris kernel module. You must be running hardware that supports AESNI (for example, Intel Westmere and Sandy Bridge, microprocessor architectures). The easiest way to determine if AES-NI is available is with the isainfo(1) command. For example, $ isainfo -v 64-bit amd64 applications pclmulqdq aes sse4.2 sse4.1 ssse3 popcnt tscp ahf cx16 sse3 sse2 sse fxsr mmx cmov amd_sysc cx8 tsc fpu 32-bit i386 applications pclmulqdq aes sse4.2 sse4.1 ssse3 popcnt tscp ahf cx16 sse3 sse2 sse fxsr mmx cmov sep cx8 tsc fpu No special configuration or setup is needed to take advantage of this software. Solaris libraries and kernel automatically determine if it's running on AESNI-capable machines and execute the correctly-tuned software for the current microprocessor. Summary Maximum throughput of AES cipher modes can be achieved by combining AES encryption with modes processing, interleaving encryption of 4 blocks at a time, and using Intel's wide 128-bit %xmm registers and instructions. References "Block cipher modes of operation", Wikipedia Good overview of AES modes (ECB, CBC, CTR, etc.) "Advanced Encryption Standard", Wikipedia "Current Modes" describes NIST-approved block cipher modes (ECB,CBC, CFB, OFB, CCM, GCM)

Read the article
Interesting articles and blogs on SPARC T4

- by mv

Interesting articles and blogs on SPARC T4 processor I have consolidated all the interesting information I could get on SPARC T4 processor and its hardware cryptographic capabilities. Hope its useful. 1. Advantages of SPARC T4 processor Most important points in this T4 announcement are : "The SPARC T4 processor was designed from the ground up for high speed security and has a cryptographic stream processing unit (SPU) integrated directly into each processor core. These accelerators support 16 industry standard security ciphers and enable high speed encryption at rates 3 to 5 times that of competing processors. By integrating encryption capabilities directly inside the instruction pipeline, the SPARC T4 processor eliminates the performance and cost barriers typically associated with secure computing and makes it possible to deliver high security levels without impacting the user experience." Data Sheet has more details on these : "New on-chip Encryption Instruction Accelerators with direct non-privileged support for 16 industry-standard cryptographic algorithms plus random number generation in each of the eight cores: AES, Camellia, CRC32c, DES, 3DES, DH, DSA, ECC, Kasumi, MD5, RSA, SHA-1, SHA-224, SHA-256, SHA-384, SHA-512" I ran "isainfo -v" command on Solaris 11 Sparc T4-1 system. It shows the new instructions as expected : $ isainfo -v 64-bit sparcv9 applications crc32c cbcond pause mont mpmul sha512 sha256 sha1 md5 camellia kasumi des aes ima hpc vis3 fmaf asi_blk_init vis2 vis popc 32-bit sparc applications crc32c cbcond pause mont mpmul sha512 sha256 sha1 md5 camellia kasumi des aes ima hpc vis3 fmaf asi_blk_init vis2 vis popc v8plus div32 mul32 2. Dan Anderson's Blog have some interesting points about how these can be used : "New T4 crypto instructions include: aes_kexpand0, aes_kexpand1, aes_kexpand2, aes_eround01, aes_eround23, aes_eround01_l, aes_eround_23_l, aes_dround01, aes_dround23, aes_dround01_l, aes_dround_23_l. Having SPARC T4 hardware crypto instructions is all well and good, but how do we access it ? The software is available with Solaris 11 and is used automatically if you are running Solaris a SPARC T4. It is used internally in the kernel through kernel crypto modules. It is available in user space through the PKCS#11 library." 3. Dans' Blog on Where's the Crypto Libraries? Although this was written in 2009 but still is very useful "Here's a brief tour of the major crypto libraries shown in the digraph: The libpkcs11 library contains the PKCS#11 API (C_\*() functions, such as C_Initialize()). That in turn calls library pkcs11_softtoken or pkcs11_kernel, for userland or kernel crypto providers. The latter is used mostly for hardware-assisted cryptography (such as n2cp for Niagara2 SPARC processors), as that is performed more efficiently in kernel space with the "kCF" module (Kernel Crypto Framework). Additionally, for Solaris 10, strong crypto algorithms were split off in separate libraries, pkcs11_softtoken_extra libcryptoutil contains low-level utility functions to help implement cryptography. libsoftcrypto (OpenSolaris and Solaris Nevada only) implements several symmetric-key crypto algorithms in software, such as AES, RC4, and DES3, and the bignum library (used for RSA). libmd implements MD5, SHA, and SHA2 message digest algorithms" 4. Difference in T3 and T4 Diagram in this blog is good and self explanatory. Jeff's blog also highlights the differences "The T4 servers have improved crypto acceleration, described at https://blogs.oracle.com/DanX/entry/sparc_t4_openssl_engine. It is "just built in" so administrators no longer have to assign crypto accelerator units to domains - it "just happens". Every physical or virtual CPU on a SPARC-T4 has full access to hardware based crypto acceleration at all times. .... For completeness sake, it's worth noting that the T4 adds more crypto algorithms, and accelerates Camelia, CRC32c, and more SHA-x." 5. About performance counters In this blog, performance counters are explained : "Note that unlike T3 and before, T4 crypto doesn't require kernel modules like ncp or n2cp, there is no visibility of crypto hardware with kstats or cryptoadm. T4 does provide hardware counters for crypto operations. You can see these using cpustat: cpustat -c pic0=Instr_FGU_crypto 5 You can check the general crypto support of the hardware and OS with the command "isainfo -v". Since T4 crypto's implementation now allows direct userland access, there are no "crypto units" visible to cryptoadm. " For more details refer Martin's blog as well. 6. How to turn off SPARC T4 or Intel AES-NI crypto acceleration I found this interesting blog from Darren about how to turn off SPARC T4 or Intel AES-NI crypto acceleration. "One of the new Solaris 11 features of the linker/loader is the ability to have a single ELF object that has multiple different implementations of the same functions that are selected at runtime based on the capabilities of the machine. The alternate to this is having the application coded to call getisax(2) system call and make the choice itself. We use this functionality of the linker/loader when we build the userland libraries for the Solaris Cryptographic Framework (specifically libmd.so and libsoftcrypto.so) The Solaris linker/loader allows control of a lot of its functionality via environment variables, we can use that to control the version of the cryptographic functions we run. To do this we simply export the LD_HWCAP environment variable with values that tell ld.so.1 to not select the HWCAP section matching certain features even if isainfo says they are present. This will work for consumers of the Solaris Cryptographic Framework that use the Solaris PKCS#11 libraries or use libmd.so interfaces directly. For SPARC T4 : export LD_HWCAP="-aes -des -md5 -sha256 -sha512 -mont -mpul" .. For Intel systems with AES-NI support: export LD_HWCAP="-aes"" Note that LD_HWCAP is explained in http://docs.oracle.com/cd/E23823_01/html/816-5165/ld.so.1-1.html "LD_HWCAP, LD_HWCAP_32, and LD_HWCAP_64 - Identifies an alternative hardware capabilities value... A “-” prefix results in the capabilities that follow being removed from the alternative capabilities." 7. Whitepaper on SPARC T4 Servers—Optimized for End-to-End Data Center Computing This Whitepaper on SPARC T4 Servers—Optimized for End-to-End Data Center Computing explains more details. It has DTrace scripts which may come in handy : "To ensure the hardware-assisted cryptographic acceleration is configured to use and working with the security scenarios, it is recommended to use the following Solaris DTrace script. #!/usr/sbin/dtrace -s pid$1:libsoftcrypto:yf*:entry, pid$target:libsoftcrypto:rsa*:entry, pid$1:libmd:yf*:entry { @[probefunc] = count(); } tick-1sec { printa(@ops); trunc(@ops); }" Note that I have slightly modified the D Script to have RSA "libsoftcrypto:rsa*:entry" as well as per recommendations from Chi-Chang Lin. 8. References http://www.oracle.com/us/corporate/features/sparc-t4-announcement-494846.html http://www.oracle.com/us/products/servers-storage/servers/sparc-enterprise/t-series/sparc-t4-1-ds-487858.pdf https://blogs.oracle.com/DanX/entry/sparc_t4_openssl_engine https://blogs.oracle.com/DanX/entry/where_s_the_crypto_libraries https://blogs.oracle.com/darren/entry/howto_turn_off_sparc_t4 http://docs.oracle.com/cd/E23823_01/html/816-5165/ld.so.1-1.html https://blogs.oracle.com/hardware/entry/unleash_the_power_of_cryptography https://blogs.oracle.com/cmt/entry/t4_crypto_cheat_sheet https://blogs.oracle.com/martinm/entry/t4_performance_counters_explained https://blogs.oracle.com/jsavit/entry/no_mau_required_on_a http://www.oracle.com/us/products/servers-storage/servers/sparc-enterprise/t-series/sparc-t4-business-wp-524472.pdf

Read the article

1