Search Results

Search found 2629 results on 106 pages for 'idle timer'.

Page 98/106 | < Previous Page | 94 95 96 97 98 99 100 101 102 103 104 105 | Next Page >

Determining cause of high NFS/IO utilization without iotop

- by Matt

I have a server that is doing an NFSv4 export for user's home directories. There are roughly 25 users (mostly developers/analysts) and about 40 servers mounting the home directory export. Performance is miserable, with users often seeing multi-second lags for simple commands (like ls, or writing a small text file). Sometimes the home directory mount completely hangs for minutes, with users getting "permission denied" errors. The hardware is a Dell R510 with dual E5620 CPUs and 8 GB RAM. There are eight 15k 2.5” 600 GB drives (Seagate ST3600057SS) configured in hardware RAID-6 with a single hot spare. RAID controller is a Dell PERC H700 w/512MB cache (Linux sees this as a LSI MegaSAS 9260). OS is CentOS 5.6, home directory partition is ext3, with options “rw,data=journal,usrquota”. I have the HW RAID configured to present two virtual disks to the OS: /dev/sda for the OS (boot, root and swap partitions), and /dev/sdb for the home directories. What I find curious, and suspicious, is that the sda device often has very high utilization, even though it only contains the OS. I would expect this virtual drive to be idle almost all the time. The system is not swapping, according to "free" and "vmstat". Why would there be major load on this device? Here is a 30-second snapshot from iostat: Time: 09:37:28 AM Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 44.09 0.03 107.76 0.13 607.40 11.27 0.89 8.27 7.27 78.35 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda2 0.00 44.09 0.03 107.76 0.13 607.40 11.27 0.89 8.27 7.27 78.35 sdb 0.00 2616.53 0.67 157.88 2.80 11098.83 140.04 8.57 54.08 4.21 66.68 sdb1 0.00 2616.53 0.67 157.88 2.80 11098.83 140.04 8.57 54.08 4.21 66.68 dm-0 0.00 0.00 0.03 151.82 0.13 607.26 8.00 1.25 8.23 5.16 78.35 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.67 2774.84 2.80 11099.37 8.00 474.30 170.89 0.24 66.84 dm-3 0.00 0.00 0.67 2774.84 2.80 11099.37 8.00 474.30 170.89 0.24 66.84 Looks like iotop is the ideal tool to use to sniff out these kinds of issues. But I'm on CentOS 5.6, which doesn't have a new enough kernel to support that program. I looked at Determining which process is causing heavy disk I/O?, and besides iotop, one of the suggestions said to do a "echo 1 /proc/sys/vm/block_dump". I did that (after directing kernel messages to tempfs). In about 13 minutes I had about 700k reads or writes, roughly half from kjournald and the other half from nfsd: # egrep " kernel: .*(READ|WRITE)" messages | wc -l 768439 # egrep " kernel: kjournald.*(READ|WRITE)" messages | wc -l 403615 # egrep " kernel: nfsd.*(READ|WRITE)" messages | wc -l 314028 For what it's worth, for the last hour, utilization has constantly been over 90% for the home directory drive. My 30-second iostat keeps showing output like this: Time: 09:36:30 PM Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 6.46 0.20 11.33 0.80 71.71 12.58 0.24 20.53 14.37 16.56 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda2 0.00 6.46 0.20 11.33 0.80 71.71 12.58 0.24 20.53 14.37 16.56 sdb 137.29 7.00 549.92 3.80 22817.19 43.19 82.57 3.02 5.45 1.74 96.32 sdb1 137.29 7.00 549.92 3.80 22817.19 43.19 82.57 3.02 5.45 1.74 96.32 dm-0 0.00 0.00 0.20 17.76 0.80 71.04 8.00 0.38 21.21 9.22 16.57 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 687.47 10.80 22817.19 43.19 65.48 4.62 6.61 1.43 99.81 dm-3 0.00 0.00 687.47 10.80 22817.19 43.19 65.48 4.62 6.61 1.43 99.82

Read the article
Cisco ASA 5505 site to site IPSEC VPN won't route from multiple LANs

- by franklundy

Hi I've set up a standard site to site VPN between 2 ASA 5505s (using the wizard in ASDM) and have the VPN working fine for traffic between Site A and Site B on the directly connected LANs. But this VPN is actually to be used for data originating on LAN subnets that are one hop away from the directly connected LANs. So actually there is another router connected to each ASA (LAN side) that then route to two completely different LAN ranges, where the clients and servers reside. At the moment, any traffic that gets to the ASA that has not originated from the directly connected LAN gets sent straight to the default gateway, and not through the VPN. I've tried adding the additional subnets to the "Protected Networks" on the VPN, but that has no effect. I have also tried adding a static route to each ASA trying to point the traffic to the other side, but again this hasn't worked. Here is the config for one of the sites. This works for traffic to/from the 192.168.144.x subnets perfectly. What I need is to be able to route traffic from 10.1.0.0/24 to 10.2.0.0/24 for example. ASA Version 8.0(3) ! hostname Site1 enable password ** encrypted names name 192.168.144.4 Site2 ! interface Vlan1 nameif inside security-level 100 ip address 192.168.144.2 255.255.255.252 ! interface Vlan2 nameif outside security-level 0 ip address 10.78.254.70 255.255.255.252 (this is a private WAN circuit) ! interface Ethernet0/0 switchport access vlan 2 ! interface Ethernet0/1 ! interface Ethernet0/2 ! interface Ethernet0/3 ! interface Ethernet0/4 ! interface Ethernet0/5 ! interface Ethernet0/6 ! interface Ethernet0/7 ! passwd ** encrypted ftp mode passive access-list inside_access_in extended permit ip any any access-list outside_access_in extended permit icmp any any echo-reply access-list outside_1_cryptomap extended permit ip 192.168.144.0 255.255.255.252 Site2 255.255.255.252 access-list inside_nat0_outbound extended permit ip 192.168.144.0 255.255.255.252 Site2 255.255.255.252 pager lines 24 logging enable logging asdm informational mtu inside 1500 mtu outside 1500 icmp unreachable rate-limit 1 burst-size 1 asdm image disk0:/asdm-603.bin no asdm history enable arp timeout 14400 global (outside) 1 interface nat (inside) 0 access-list inside_nat0_outbound nat (inside) 1 0.0.0.0 0.0.0.0 access-group inside_access_in in interface inside access-group outside_access_in in interface outside route outside 0.0.0.0 0.0.0.0 10.78.254.69 1 timeout xlate 3:00:00 timeout conn 1:00:00 half-closed 0:10:00 udp 0:02:00 icmp 0:00:02 timeout sunrpc 0:10:00 h323 0:05:00 h225 1:00:00 mgcp 0:05:00 mgcp-pat 0:05:00 timeout sip 0:30:00 sip_media 0:02:00 sip-invite 0:03:00 sip-disconnect 0:02:00 timeout uauth 0:05:00 absolute dynamic-access-policy-record DfltAccessPolicy aaa authentication ssh console LOCAL http server enable http 0.0.0.0 0.0.0.0 outside http 192.168.1.0 255.255.255.0 inside no snmp-server location no snmp-server contact snmp-server enable traps snmp authentication linkup linkdown coldstart crypto ipsec transform-set ESP-3DES-SHA esp-3des esp-sha-hmac crypto map outside_map 1 match address outside_1_cryptomap crypto map outside_map 1 set pfs crypto map outside_map 1 set peer 10.78.254.66 crypto map outside_map 1 set transform-set ESP-3DES-SHA crypto map outside_map interface outside crypto isakmp enable outside crypto isakmp policy 10 authentication pre-share encryption 3des hash sha group 2 lifetime 86400 no crypto isakmp nat-traversal telnet timeout 5 ssh 0.0.0.0 0.0.0.0 outside ssh timeout 5 console timeout 0 management-access inside threat-detection basic-threat threat-detection statistics port threat-detection statistics protocol threat-detection statistics access-list group-policy DfltGrpPolicy attributes vpn-idle-timeout none username enadmin password * encrypted privilege 15 tunnel-group 10.78.254.66 type ipsec-l2l tunnel-group 10.78.254.66 ipsec-attributes pre-shared-key * ! ! prompt hostname context

Read the article
ESXI guests not using available CPU resources

- by Alan M

I have a VMWare ESXI 5.0.0 (it's a bit old, I know) host with three guest VMs on it. For reasons unknown, the guests will not use much of the availble CPU resources. I have all three guests in a single pool, with the hosts all configured to use the same amount of resource shares, so they're basically 33% each. The three guests are basically identically configured as far as their VM resources go. So the problem is, even when the guests are performing what should be very 'busy' activitites, such as at bootup, the actual host CPU consumed is something tiny, like 33mhz, when seen via vSphere console's "Virtual Machines" tab when viewing properties for the pool. And of course, the performance of the guest VMs is terrible. The host has plenty of CPU to spare. I've tried tinkering with individual guest VM resource settings; cranking up the reservation, etc. No matter. The guests just refuse to make use of the abundant CPU available to them, and insist on using a sliver of the available resources. Any suggestions? Update after reading various comments below Per the suggestions below, I did remove the guests from the application pool; this didn't make any difference. I do understand that the guests are not going to consume resources they do not need. I have tried to do a remote perfmon on the guest which is experiencing long boot times, but I cannot connect to the guest remotely with perfmon (guest is w2k8r2 server). Host graphs for CPU, Mem, Disk are basically flatlining; very little demand. Same is true for the guest stats; while the guest itself seems to be crawling, the guest resource graphing shows very little activity across CPU, Mem, Disk. Host is a Dell PowerEdge 2900, has 2 physical CPU,20gb RAM. (it's a test/dev environment using surplus gear) Guest1 has: VM ver. 7, 2vCPU, 4gb RAM, 140gb storage which lives on a RAID-5 array on the host. Guest2 has: VM ver. 7, 2vCPU, 4gb RAM, 140gb storage which lives on a RAID-5 array on the host. Guest3 has: VM ver. 7, 1vCPU, 2gb RAM, 2tb storage which lives on a RAID-5 ISCSI NAS box Perhaps I am making a false assumption that if a guest has a demand for CPU (e.g. Windows Task Manager shows 100% CPU), the host would supply the guest with more CPU (mem, disk) on demand. Another Update After checking the stats, it would appear that the host is indeed not busy at all, neither is the guest. I believe I have a good idea on the issue, though; a messed-up VMWare Tools install. The guest has VMware Tools on it, but the host says it does not. VMWare Tools refuses to uninstall, refuses to be upgraded, refuses to be recognized. While I cannot say with authority, this would appear to be something worth investigation. I do not know the origin of the guest itself, nor the specifics on the original VMWare Tools install. Following various bits of googling, I did come up with a few suggestions that went nowhere. To that end, I was going to delete this question, but was prompted not to do so since so many folks answered. My suspicion right now is; the problem truly is the guest; the guest is not making a demand on the host, and as a natural result, the host is treating the guest accordingly. My Final Update I am 99% certain the guest VM had something fundamentally wrong with it re VMWare Tools. I created a clone of a different VM with a near-identical OS config, but a properly working install of VMWare tools. The guest runs just great, and takes up it's allotment of resources when it needs to; e.g. it eats up about 850mhz CPU during startup, then ticks down to idle once the guest OS is stable.

Read the article
A faulty Caviar Blue hard drive?

- by Glister

We have a small "homemade" server running fully updated Debian Wheezy (amd64). One hard drive installed: WDC WD6400AAKS. The motherboard is ASUS M4N68T V2. The usual load: CPU: an average of 20% Each week about 50GB of additional space is occupied. About 47GB of uploaded files and 3GB of MySQL data. I'm afraid that the hard drive may be about to fail. I saw Pre-fail on few places when I ran: root@SERVER:/tmp# smartctl -a /dev/sda smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Blue Serial ATA Device Model: WDC WD6400AAKS-XXXXXXX Serial Number: WD-XXXXXXXXXXXXXXXXXXX LU WWN Device Id: 5 0014ee XXXXXXXXXXXXX Firmware Version: 01.03B01 User Capacity: 640,135,028,736 bytes [640 GB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Mon Oct 28 18:55:27 2013 UTC SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x85) Offline data collection activity was aborted by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 247) Self-test routine in progress... 70% of test remaining. Total time to complete Offline data collection: (11580) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 136) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x303f) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 157 146 021 Pre-fail Always - 5108 4 Start_Stop_Count 0x0032 098 098 000 Old_age Always - 2968 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 051 Old_age Always - 0 9 Power_On_Hours 0x0032 079 079 000 Old_age Always - 15445 10 Spin_Retry_Count 0x0032 100 100 051 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 051 Old_age Always - 0 12 Power_Cycle_Count 0x0032 098 098 000 Old_age Always - 2950 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 426 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 2968 194 Temperature_Celsius 0x0022 111 095 000 Old_age Always - 36 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 160 000 Old_age Always - 21716 200 Multi_Zone_Error_Rate 0x0008 200 200 051 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 15444 - Error SMART Read Selective Self-Test Log failed: scsi error aborted command Smartctl: SMART Selective Self Test Log Read Failed root@SERVER:/tmp# In one tutorial I read that the pre-fail is a an indication of coming failure, in another tutorial I read that it is not true. Can you guys help me decode the output of smartctl? It would be also nice to share suggestions what should I do if I want to ensure data integrity (about 50GB of new data each week, up to 2TB for the whole period I'm interested in). Maybe I will go with 2x2TB Caviar Black in RAID4?

Read the article
Optimizing AES modes on Solaris for Intel Westmere

- by danx

Optimizing AES modes on Solaris for Intel Westmere Review AES is a strong method of symmetric (secret-key) encryption. It is a U.S. FIPS-approved cryptographic algorithm (FIPS 197) that operates on 16-byte blocks. AES has been available since 2001 and is widely used. However, AES by itself has a weakness. AES encryption isn't usually used by itself because identical blocks of plaintext are always encrypted into identical blocks of ciphertext. This encryption can be easily attacked with "dictionaries" of common blocks of text and allows one to more-easily discern the content of the unknown cryptotext. This mode of encryption is called "Electronic Code Book" (ECB), because one in theory can keep a "code book" of all known cryptotext and plaintext results to cipher and decipher AES. In practice, a complete "code book" is not practical, even in electronic form, but large dictionaries of common plaintext blocks is still possible. Here's a diagram of encrypting input data using AES ECB mode: Block 1 Block 2 PlainTextInput PlainTextInput | | | | \/ \/ AESKey-->(AES Encryption) AESKey-->(AES Encryption) | | | | \/ \/ CipherTextOutput CipherTextOutput Block 1 Block 2 What's the solution to the same cleartext input producing the same ciphertext output? The solution is to further process the encrypted or decrypted text in such a way that the same text produces different output. This usually involves an Initialization Vector (IV) and XORing the decrypted or encrypted text. As an example, I'll illustrate CBC mode encryption: Block 1 Block 2 PlainTextInput PlainTextInput | | | | \/ \/ IV >----->(XOR) +------------->(XOR) +---> . . . . | | | | | | | | \/ | \/ | AESKey-->(AES Encryption) | AESKey-->(AES Encryption) | | | | | | | | | \/ | \/ | CipherTextOutput ------+ CipherTextOutput -------+ Block 1 Block 2 The steps for CBC encryption are: Start with a 16-byte Initialization Vector (IV), choosen randomly. XOR the IV with the first block of input plaintext Encrypt the result with AES using a user-provided key. The result is the first 16-bytes of output cryptotext. Use the cryptotext (instead of the IV) of the previous block to XOR with the next input block of plaintext Another mode besides CBC is Counter Mode (CTR). As with CBC mode, it also starts with a 16-byte IV. However, for subsequent blocks, the IV is just incremented by one. Also, the IV ix XORed with the AES encryption result (not the plain text input). Here's an illustration: Block 1 Block 2 PlainTextInput PlainTextInput | | | | \/ \/ AESKey-->(AES Encryption) AESKey-->(AES Encryption) | | | | \/ \/ IV >----->(XOR) IV + 1 >---->(XOR) IV + 2 ---> . . . . | | | | \/ \/ CipherTextOutput CipherTextOutput Block 1 Block 2 Optimization Which of these modes can be parallelized? ECB encryption/decryption can be parallelized because it does more than plain AES encryption and decryption, as mentioned above. CBC encryption can't be parallelized because it depends on the output of the previous block. However, CBC decryption can be parallelized because all the encrypted blocks are known at the beginning. CTR encryption and decryption can be parallelized because the input to each block is known--it's just the IV incremented by one for each subsequent block. So, in summary, for ECB, CBC, and CTR modes, encryption and decryption can be parallelized with the exception of CBC encryption. How do we parallelize encryption? By interleaving. Usually when reading and writing data there are pipeline "stalls" (idle processor cycles) that result from waiting for memory to be loaded or stored to or from CPU registers. Since the software is written to encrypt/decrypt the next data block where pipeline stalls usually occurs, we can avoid stalls and crypt with fewer cycles. This software processes 4 blocks at a time, which ensures virtually no waiting ("stalling") for reading or writing data in memory. Other Optimizations Besides interleaving, other optimizations performed are Loading the entire key schedule into the 128-bit %xmm registers. This is done once for per 4-block of data (since 4 blocks of data is processed, when present). The following is loaded: the entire "key schedule" (user input key preprocessed for encryption and decryption). This takes 11, 13, or 15 registers, for AES-128, AES-192, and AES-256, respectively The input data is loaded into another %xmm register The same register contains the output result after encrypting/decrypting Using SSSE 4 instructions (AESNI). Besides the aesenc, aesenclast, aesdec, aesdeclast, aeskeygenassist, and aesimc AESNI instructions, Intel has several other instructions that operate on the 128-bit %xmm registers. Some common instructions for encryption are: pxor exclusive or (very useful), movdqu load/store a %xmm register from/to memory, pshufb shuffle bytes for byte swapping, pclmulqdq carry-less multiply for GCM mode Combining AES encryption/decryption with CBC or CTR modes processing. Instead of loading input data twice (once for AES encryption/decryption, and again for modes (CTR or CBC, for example) processing, the input data is loaded once as both AES and modes operations occur at in the same function Performance Everyone likes pretty color charts, so here they are. I ran these on Solaris 11 running on a Piketon Platform system with a 4-core Intel Clarkdale processor @3.20GHz. Clarkdale which is part of the Westmere processor architecture family. The "before" case is Solaris 11, unmodified. Keep in mind that the "before" case already has been optimized with hand-coded Intel AESNI assembly. The "after" case has combined AES-NI and mode instructions, interleaved 4 blocks at-a-time. « For the first table, lower is better (milliseconds). The first table shows the performance improvement using the Solaris encrypt(1) and decrypt(1) CLI commands. I encrypted and decrypted a 1/2 GByte file on /tmp (swap tmpfs). Encryption improved by about 40% and decryption improved by about 80%. AES-128 is slighty faster than AES-256, as expected. The second table shows more detail timings for CBC, CTR, and ECB modes for the 3 AES key sizes and different data lengths. » The results shown are the percentage improvement as shown by an internal PKCS#11 microbenchmark. And keep in mind the previous baseline code already had optimized AESNI assembly! The keysize (AES-128, 192, or 256) makes little difference in relative percentage improvement (although, of course, AES-128 is faster than AES-256). Larger data sizes show better improvement than 128-byte data. Availability This software is in Solaris 11 FCS. It is available in the 64-bit libcrypto library and the "aes" Solaris kernel module. You must be running hardware that supports AESNI (for example, Intel Westmere and Sandy Bridge, microprocessor architectures). The easiest way to determine if AES-NI is available is with the isainfo(1) command. For example, $ isainfo -v 64-bit amd64 applications pclmulqdq aes sse4.2 sse4.1 ssse3 popcnt tscp ahf cx16 sse3 sse2 sse fxsr mmx cmov amd_sysc cx8 tsc fpu 32-bit i386 applications pclmulqdq aes sse4.2 sse4.1 ssse3 popcnt tscp ahf cx16 sse3 sse2 sse fxsr mmx cmov sep cx8 tsc fpu No special configuration or setup is needed to take advantage of this software. Solaris libraries and kernel automatically determine if it's running on AESNI-capable machines and execute the correctly-tuned software for the current microprocessor. Summary Maximum throughput of AES cipher modes can be achieved by combining AES encryption with modes processing, interleaving encryption of 4 blocks at a time, and using Intel's wide 128-bit %xmm registers and instructions. References "Block cipher modes of operation", Wikipedia Good overview of AES modes (ECB, CBC, CTR, etc.) "Advanced Encryption Standard", Wikipedia "Current Modes" describes NIST-approved block cipher modes (ECB,CBC, CFB, OFB, CCM, GCM)

Read the article
Thread placement policies on NUMA systems - update

- by Dave

In a prior blog entry I noted that Solaris used a "maximum dispersal" placement policy to assign nascent threads to their initial processors. The general idea is that threads should be placed as far away from each other as possible in the resource topology in order to reduce resource contention between concurrently running threads. This policy assumes that resource contention -- pipelines, memory channel contention, destructive interference in the shared caches, etc -- will likely outweigh (a) any potential communication benefits we might achieve by packing our threads more densely onto a subset of the NUMA nodes, and (b) benefits of NUMA affinity between memory allocated by one thread and accessed by other threads. We want our threads spread widely over the system and not packed together. Conceptually, when placing a new thread, the kernel picks the least loaded node NUMA node (the node with lowest aggregate load average), and then the least loaded core on that node, etc. Furthermore, the kernel places threads onto resources -- sockets, cores, pipelines, etc -- without regard to the thread's process membership. That is, initial placement is process-agnostic. Keep reading, though. This description is incorrect. On Solaris 10 on a SPARC T5440 with 4 x T2+ NUMA nodes, if the system is otherwise unloaded and we launch a process that creates 20 compute-bound concurrent threads, then typically we'll see a perfect balance with 5 threads on each node. We see similar behavior on an 8-node x86 x4800 system, where each node has 8 cores and each core is 2-way hyperthreaded. So far so good; this behavior seems in agreement with the policy I described in the 1st paragraph. I recently tried the same experiment on a 4-node T4-4 running Solaris 11. Both the T5440 and T4-4 are 4-node systems that expose 256 logical thread contexts. To my surprise, all 20 threads were placed onto just one NUMA node while the other 3 nodes remained completely idle. I checked the usual suspects such as processor sets inadvertently left around by colleagues, processors left offline, and power management policies, but the system was configured normally. I then launched multiple concurrent instances of the process, and, interestingly, all the threads from the 1st process landed on one node, all the threads from the 2nd process landed on another node, and so on. This happened even if I interleaved thread creating between the processes, so I was relatively sure the effect didn't related to thread creation time, but rather that placement was a function of process membership. I this point I consulted the Solaris sources and talked with folks in the Solaris group. The new Solaris 11 behavior is intentional. The kernel is no longer using a simple maximum dispersal policy, and thread placement is process membership-aware. Now, even if other nodes are completely unloaded, the kernel will still try to pack new threads onto the home lgroup (socket) of the primordial thread until the load average of that node reaches 50%, after which it will pick the next least loaded node as the process's new favorite node for placement. On the T4-4 we have 64 logical thread contexts (strands) per socket (lgroup), so if we launch 48 concurrent threads we will find 32 placed on one node and 16 on some other node. If we launch 64 threads we'll find 32 and 32. That means we can end up with our threads clustered on a small subset of the nodes in a way that's quite different that what we've seen on Solaris 10. So we have a policy that allows process-aware packing but reverts to spreading threads onto other nodes if a node becomes too saturated. It turns out this policy was enabled in Solaris 10, but certain bugs suppressed the mixed packing/spreading behavior. There are configuration variables in /etc/system that allow us to dial the affinity between nascent threads and their primordial thread up and down: see lgrp_expand_proc_thresh, specifically. In the OpenSolaris source code the key routine is mpo_update_tunables(). This method reads the /etc/system variables and sets up some global variables that will subsequently be used by the dispatcher, which calls lgrp_choose() in lgrp.c to place nascent threads. Lgrp_expand_proc_thresh controls how loaded an lgroup must be before we'll consider homing a process's threads to another lgroup. Tune this value lower to have it spread your process's threads out more. To recap, the 'new' policy is as follows. Threads from the same process are packed onto a subset of the strands of a socket (50% for T-series). Once that socket reaches the 50% threshold the kernel then picks another preferred socket for that process. Threads from unrelated processes are spread across sockets. More precisely, different processes may have different preferred sockets (lgroups). Beware that I've simplified and elided details for the purposes of explication. The truth is in the code. Remarks: It's worth noting that initial thread placement is just that. If there's a gross imbalance between the load on different nodes then the kernel will migrate threads to achieve a better and more even distribution over the set of available nodes. Once a thread runs and gains some affinity for a node, however, it becomes "stickier" under the assumption that the thread has residual cache residency on that node, and that memory allocated by that thread resides on that node given the default "first-touch" page-level NUMA allocation policy. Exactly how the various policies interact and which have precedence under what circumstances could the topic of a future blog entry. The scheduler is work-conserving. The x4800 mentioned above is an interesting system. Each of the 8 sockets houses an Intel 7500-series processor. Each processor has 3 coherent QPI links and the system is arranged as a glueless 8-socket twisted ladder "mobius" topology. Nodes are either 1 or 2 hops distant over the QPI links. As an aside the mapping of logical CPUIDs to physical resources is rather interesting on Solaris/x4800. On SPARC/Solaris the CPUID layout is strictly geographic, with the highest order bits identifying the socket, the next lower bits identifying the core within that socket, following by the pipeline (if present) and finally the logical thread context ("strand") on the core. But on Solaris on the x4800 the CPUID layout is as follows. [6:6] identifies the hyperthread on a core; bits [5:3] identify the socket, or package in Intel terminology; bits [2:0] identify the core within a socket. Such low-level details should be of interest only if you're binding threads -- a bad idea, the kernel typically handles placement best -- or if you're writing NUMA-aware code that's aware of the ambient placement and makes decisions accordingly. Solaris introduced the so-called critical-threads mechanism, which is expressed by putting a thread into the FX scheduling class at priority 60. The critical-threads mechanism applies to placement on cores, not on sockets, however. That is, it's an intra-socket policy, not an inter-socket policy. Solaris 11 introduces the Power Aware Dispatcher (PAD) which packs threads instead of spreading them out in an attempt to be able to keep sockets or cores at lower power levels. Maximum dispersal may be good for performance but is anathema to power management. PAD is off by default, but power management polices constitute yet another confounding factor with respect to scheduling and dispatching. If your threads communicate heavily -- one thread reads cache lines last written by some other thread -- then the new dense packing policy may improve performance by reducing traffic on the coherent interconnect. On the other hand if your threads in your process communicate rarely, then it's possible the new packing policy might result on contention on shared computing resources. Unfortunately there's no simple litmus test that says whether packing or spreading is optimal in a given situation. The answer varies by system load, application, number of threads, and platform hardware characteristics. Currently we don't have the necessary tools and sensoria to decide at runtime, so we're reduced to an empirical approach where we run trials and try to decide on a placement policy. The situation is quite frustrating. Relatedly, it's often hard to determine just the right level of concurrency to optimize throughput. (Understanding constructive vs destructive interference in the shared caches would be a good start. We could augment the lines with a small tag field indicating which strand last installed or accessed a line. Given that, we could augment the CPU with performance counters for misses where a thread evicts a line it installed vs misses where a thread displaces a line installed by some other thread.)

Read the article
Users loggin to 3Com switches authenticated by radius not getting admin priv and no access available

- by 3D1L

Hi, Following the setup that I have for my Cisco devices, I got some basic level of functionality authenticating users that loggin to 3Com switches authenticated against a RADIUS server. Problem is that I can not get the user to obtain admin privileges. I'm using Microsoft's IAS service. According to 3Com documentation when configuring the access policy on IAS the value of 010600000003 have to be used to specify admin access level. That value have to be input in the Dial-in profile section: 010600000003 - indicates admin privileges 010600000002 - manager 010600000001 - monitor 010600000000 - visitor Here is the configuration on the switch: radius scheme system server-type standard primary authentication XXX.XXX.XXX.XXX accounting optional key authentication XXXXXX key accounting XXXXXX domain system scheme radius-scheme system local-user admin service-type ssh telnet terminal level 3 local-user manager service-type ssh telnet terminal level 2 local-user monitor service-type ssh telnet terminal level 1 The configuration is working with the IAS server because I can check user login events with the Eventviewer tool. Here is the output of the DISPLAY RADIUS command at the switch: [4500]disp radius SchemeName =system Index=0 Type=standard Primary Auth IP =XXX.XXX.XXX.XXX Port=1645 State=active Primary Acct IP =127.0.0.1 Port=1646 State=active Second Auth IP =0.0.0.0 Port=1812 State=block Second Acct IP =0.0.0.0 Port=1813 State=block Auth Server Encryption Key= XXXXXX Acct Server Encryption Key= XXXXXX Accounting method = optional TimeOutValue(in second)=3 RetryTimes=3 RealtimeACCT(in minute)=12 Permitted send realtime PKT failed counts =5 Retry sending times of noresponse acct-stop-PKT =500 Quiet-interval(min) =5 Username format =without-domain Data flow unit =Byte Packet unit =1 Total 1 RADIUS scheme(s). 1 listed Here is the output of the DISPLAY DOMAIN and DISPLAY CONNECTION commands after users log into the switch: [4500]display domain 0 Domain = system State = Active RADIUS Scheme = system Access-limit = Disable Domain User Template: Idle-cut = Disable Self-service = Disable Messenger Time = Disable Default Domain Name: system Total 1 domain(s).1 listed. [4500]display connection Index=0 ,Username=admin@system IP=0.0.0.0 Index=2 ,Username=user@system IP=xxx.xxx.xxx.xxx On Unit 1:Total 2 connections matched, 2 listed. Total 2 connections matched, 2 listed. [4500] Here is the DISP RADIUS STATISTICS: [4500] %Apr 2 00:23:39:957 2000 4500 SHELL/5/LOGIN:- 1 - ecajigas(xxx.xxx.xxx.xxx) in un it1 logindisp radius stat state statistic(total=1048): DEAD=1046 AuthProc=0 AuthSucc=0 AcctStart=0 RLTSend=0 RLTWait=2 AcctStop=0 OnLine=2 Stop=0 StateErr=0 Received and Sent packets statistic: Unit 1........................................ Sent PKT total :4 Received PKT total:1 Resend Times Resend total 1 1 2 1 Total 2 RADIUS received packets statistic: Code= 2,Num=1 ,Err=0 Code= 3,Num=0 ,Err=0 Code= 5,Num=0 ,Err=0 Code=11,Num=0 ,Err=0 Running statistic: RADIUS received messages statistic: Normal auth request , Num=1 , Err=0 , Succ=1 EAP auth request , Num=0 , Err=0 , Succ=0 Account request , Num=1 , Err=0 , Succ=1 Account off request , Num=0 , Err=0 , Succ=0 PKT auth timeout , Num=0 , Err=0 , Succ=0 PKT acct_timeout , Num=3 , Err=1 , Succ=2 Realtime Account timer , Num=0 , Err=0 , Succ=0 PKT response , Num=1 , Err=0 , Succ=1 EAP reauth_request , Num=0 , Err=0 , Succ=0 PORTAL access , Num=0 , Err=0 , Succ=0 Update ack , Num=0 , Err=0 , Succ=0 PORTAL access ack , Num=0 , Err=0 , Succ=0 Session ctrl pkt , Num=0 , Err=0 , Succ=0 RADIUS sent messages statistic: Auth accept , Num=0 Auth reject , Num=0 EAP auth replying , Num=0 Account success , Num=0 Account failure , Num=0 Cut req , Num=0 RecError_MSG_sum:0 SndMSG_Fail_sum :0 Timer_Err :0 Alloc_Mem_Err :0 State Mismatch :0 Other_Error :0 No-response-acct-stop packet =0 Discarded No-response-acct-stop packet for buffer overflow =0 The other problem is that when the RADIUS server is not available I can not log in to the switch. The switch have 3 local accounts but none of them works. How can I specify the switch to use the local accounts in case that the RADIUS service is not available?

Read the article
Performance Tuning a High-Load Apache Server

- by futureal

I am looking to understand some server performance problems I am seeing with a (for us) heavily loaded web server. The environment is as follows: Debian Lenny (all stable packages + patched to security updates) Apache 2.2.9 PHP 5.2.6 Amazon EC2 large instance The behavior we're seeing is that the web typically feels responsive, but with a slight delay to begin handling a request -- sometimes a fraction of a second, sometimes 2-3 seconds in our peak usage times. The actual load on the server is being reported as very high -- often 10.xx or 20.xx as reported by top. Further, running other things on the server during these times (even vi) is very slow, so the load is definitely up there. Oddly enough Apache remains very responsive, other than that initial delay. We have Apache configured as follows, using prefork: StartServers 5 MinSpareServers 5 MaxSpareServers 10 MaxClients 150 MaxRequestsPerChild 0 And KeepAlive as: KeepAlive On MaxKeepAliveRequests 100 KeepAliveTimeout 5 Looking at the server-status page, even at these times of heavy load we are rarely hitting the client cap, usually serving between 80-100 requests and many of those in the keepalive state. That tells me to rule out the initial request slowness as "waiting for a handler" but I may be wrong. Amazon's CloudWatch monitoring tells me that even when our OS is reporting a load of 15, our instance CPU utilization is between 75-80%. Example output from top: top - 15:47:06 up 31 days, 1:38, 8 users, load average: 11.46, 7.10, 6.56 Tasks: 221 total, 28 running, 193 sleeping, 0 stopped, 0 zombie Cpu(s): 66.9%us, 22.1%sy, 0.0%ni, 2.6%id, 3.1%wa, 0.0%hi, 0.7%si, 4.5%st Mem: 7871900k total, 7850624k used, 21276k free, 68728k buffers Swap: 0k total, 0k used, 0k free, 3750664k cached The majority of the processes look like: 24720 www-data 15 0 202m 26m 4412 S 9 0.3 0:02.97 apache2 24530 www-data 15 0 212m 35m 4544 S 7 0.5 0:03.05 apache2 24846 www-data 15 0 209m 33m 4420 S 7 0.4 0:01.03 apache2 24083 www-data 15 0 211m 35m 4484 S 7 0.5 0:07.14 apache2 24615 www-data 15 0 212m 35m 4404 S 7 0.5 0:02.89 apache2 Example output from vmstat at the same time as the above: procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 8 0 0 215084 68908 3774864 0 0 154 228 5 7 32 12 42 9 6 21 0 198948 68936 3775740 0 0 676 2363 4022 1047 56 16 9 15 23 0 0 169460 68936 3776356 0 0 432 1372 3762 835 76 21 0 0 23 1 0 140412 68936 3776648 0 0 280 0 3157 827 70 25 0 0 20 1 0 115892 68936 3776792 0 0 188 8 2802 532 68 24 0 0 6 1 0 133368 68936 3777780 0 0 752 71 3501 878 67 29 0 1 0 1 0 146656 68944 3778064 0 0 308 2052 3312 850 38 17 19 24 2 0 0 202104 68952 3778140 0 0 28 90 2617 700 44 13 33 5 9 0 0 188960 68956 3778200 0 0 8 0 2226 475 59 17 6 2 3 0 0 166364 68956 3778252 0 0 0 21 2288 386 65 19 1 0 And finally, output from Apache's server-status: Server uptime: 31 days 2 hours 18 minutes 31 seconds Total accesses: 60102946 - Total Traffic: 974.5 GB CPU Usage: u209.62 s75.19 cu0 cs0 - .0106% CPU load 22.4 requests/sec - 380.3 kB/second - 17.0 kB/request 107 requests currently being processed, 6 idle workers C.KKKW..KWWKKWKW.KKKCKK..KKK.KKKK.KK._WK.K.K.KKKKK.K.R.KK..C.C.K K.C.K..WK_K..KKW_CK.WK..W.KKKWKCKCKW.W_KKKKK.KKWKKKW._KKK.CKK... KK_KWKKKWKCKCWKK.KKKCK.......................................... ................................................................ From my limited experience I draw the following conclusions/questions: We may be allowing far too many KeepAlive requests I do see some time spent waiting for IO in the vmstat although not consistently and not a lot (I think?) so I am not sure this is a big concern or not, I am less experienced with vmstat Also in vmstat, I see in some iterations a number of processes waiting to be served, which is what I am attributing the initial page load delay on our web server to, possibly erroneously We serve a mixture of static content (75% or higher) and script content, and the script content is often fairly processor intensive, so finding the right balance between the two is important; long term we want to move statics elsewhere to optimize both servers but our software is not ready for that today I am happy to provide additional information if anybody has any ideas, the other note is that this is a high-availability production installation so I am wary of making tweak after tweak, and is why I haven't played with things like the KeepAlive value myself yet.

Read the article
DNS lookups failing somewhere between firewall and router

- by TessellatingHeckler

we have a setup of ADSL line - Cisco 837 ADSL router - Zyxel ZyWall 35 firewall/NAT - Switch == Intel load balanced NICS in a server. It has been fine for years, suddenly DNS resolution stopped working on the server. No changes that I know of, so I can't work backwards from there. It was configured with the ISP's DNS servers, neither network device does DNS relaying. Wireshark shows the request go out but nothing comes back. The server networking stack seems OK though, because if we query an internal DNS server on a remote site, that works. I can logon to the Cisco, and DNS resolves OK from the command line. I can logon to the ZyWall, and DNS does not resolve from the command line. So the problem seems to be the firewall, patch cable or router, yes? On the router: interface Ethernet0 ip address aaa.bbb.ccc.ddd 255.255.255.ddd ip tcp adjust-mss 1450 hold-queue 100 out On the firewall: DNS server set to 8.8.8.8 (Google's), DNS traffic allowed LAN-WAN. What else should I look for? Update: Following This guide I've got traffic logging on the Cisco. I have also got access to a public DNS server which I can run tcpdump on to see things from the other side. And as per the below comments, I've tested with Dig and see that DNS over TCP works, and over UDP does not. Currently: DNS request from the server using TCP shows up in the firewall log, and in the Cisco log, and in tcpdump on the DNS server, the answer comes back, it works fine. DNS request from the server using UDP shows up in the firewall log, and in the Cisco log, does NOT show in tcpdump on the DNS server, times out. DNS request from the cisco (using UDP) does show up in tcpdump on the DNS server, answer received, works fine. Ping requests from the server and the cisco to the DNS server show up in tcpdump on the DNS server. DNS request from the server using UDP does show up on the firewall. Summary: TCP seems fine throughought. UDP works over the ADSL and to the Cisco, and it works from the server to the Cisco, but it doesn't cross the Cisco properly, it seems. I did see the Cisco showing as connected at 10Mb/full-duplex internally, and the firewall showing as 100Mb/full-duplex externally. I have forced the firewall to 10Mb and rebooted both devices. That seemed to help get UDP traffic (server-firewall-cisco) instead of (server-firewall), but did not fix it. Update: Sanitized Cisco config: version 12.2 no service pad service timestamps debug datetime msec service timestamps log datetime msec service password-encryption ! hostname cisco ! logging queue-limit 100 enable secret 5 {password} enable password 7 {password} ! ip subnet-zero ip domain name example.org ip name-server {nameserver_IP} ! ! ip audit notify log ip audit po max-events 100 no ftp-server write-enable ! interface Ethernet0 ip address {Inside_public_IP} 255.255.255.248 ip tcp adjust-mss 1460 hold-queue 100 out ! interface ATM0 no ip address no atm ilmi-keepalive pvc 0/38 encapsulation aal5mux ppp dialer dialer pool-member 1 ! dsl operating-mode auto ! interface Dialer1 ip unnumbered Ethernet0 encapsulation ppp dialer pool 1 dialer idle-timeout 0 dialer persistent no cdp enable ppp chap hostname {ADSL_Username} ppp chap password 7 {ADSL_Password} ! ip classless ip route 0.0.0.0 0.0.0.0 Dialer1 no ip http server no ip http secure-server ! access-list 23 permit {IP} dialer-list 1 protocol ip permit no cdp run snmp-server enable traps tty ! {con, vty} end

Read the article
smartctl -t long isn't finishing

- by xenoterracide

I been running smartctl -t long on a drive for about 2 days now and it seems to be stalled at 10%. short and conveyance both passed. I have to send 1 of 2 drives purchased back I found badblocks with badblocks (none on this drive and I'ts made over a pass already). I'm just wondering if I should be concerned about this. smartctl 5.39.1 2010-01-28 r3054 [x86_64-unknown-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: WDC WD10EARS-00Y5B1 Serial Number: WD-WMAV51582123 Firmware Version: 80.00A80 User Capacity: 1,000,204,886,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Mon May 10 22:19:52 2010 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 241) Self-test routine in progress... 10% of test remaining. Total time to complete Offline data collection: (20100) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 231) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3031) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 2 3 Spin_Up_Time 0x0027 131 131 021 Pre-fail Always - 6408 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 12 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 148 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 10 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 7 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 174 194 Temperature_Celsius 0x0022 106 102 000 Old_age Always - 41 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Conveyance offline Completed without error 00% 99 - # 2 Extended offline Interrupted (host reset) 10% 30 - # 3 Short offline Completed without error 00% 0 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.

Read the article
Bizarre and very specific Internet connection loss

- by Synetech

Yesterday (Friday, September 21, 2012), my Internet connection started acting up. After some testing, I confirmed a very specific and baffling set of symptoms: Internet connection goes away every 25-35 minutes (I did not confirm the exact interval, but it seems to be about 30 mins.) Only some protocols are affected; HTTP*, P2P, etc. stop working; FTP, etc. continue to work When it’s stopped, cannot even ping router or cable-modem IPs or view their firmware pages Domain-names and IPs are irrelevant (for protocols that stop working, neither work, for those that still work, both work) Resetting router fixes it for another 30 minutes Keeping the connection idle or active doesn’t seem to make a difference (nor the bandwidth usage in that period) Connecting directly to cable-modem allows it to work indefinitely Disconnecting the router from the cable-modem works indefinitely (no Internet connection obviously, but can still access router IP and firmware page) Connecting the router to the cable-modem, but putting the modem on standby also works indefinitely Same problem with both a wireless laptop and wired (on any port) desktop (both Windows 7; will try to test Windows XP when possible) Nothing had changed in the days leading up to the issue. No modifications to the networking configuration or the router; there were not even any Windows updates except for an MSSE definition update. Waiting does not fix it, nor does any amount of fiddling with anything; only resetting the router fixes it for 30 minutes (resetting the cable-modem doesn't work either) I tried cleaning the pins in the router’s plugs, but that didn’t help, which was not really a surprise since I was not getting a lost connection error. Obviously my first thought was that the router was having a problem, and this is borne out by some tests. The problem is that when it drops, it is not a full drop since I can still do things like ftp ftp.mcafee.com and such which means that the connection and DNS are still working. Moreover, if it were the router, then why does it stay alive indefinitely when not connected to the cable-modem (i.e., no outside influence)? The problem doesn't seem to be either the cable-modem nor the router, but rather an interaction between the two, like something from the outside (port scan? hacker? ISP?) that is triggering a problem in the router. I see that there have been a couple of vulnerabilities for the DI-524, but those were a while back and should be fixed since I have the last firmware for it. I don’t think it’s my ISP (Rogers) since I have been using the router for several years without problem and can connect indefinitely when bypassing it. But I can’t rule them out since that is one of the only possible things that could have suddenly changed. Does anybody have any ideas of explanations, fixed, or tests? (I note that when I opened the router, I heard a very high-pitched noise from somewhere near the capacitors/ferrite ring which I don’t think I heard the last time I opened it a few years ago, but then if it were that, then why would it affect only a very small, specific set of functions?)

Read the article
why nginx rewrite post request from /login to //login?

- by jiangchengwu

There is a if statement, which will rewrite url when the client is Android. Everything ok. But, something got strange. Nginx will write post request /login to //login, even if the block of if statement is bank. So I got a 404 page. As the jetty server only accept /login request. Server conf: location / { proxy_pass http://localhost:8785/; proxy_set_header Host $http_host; proxy_set_header Remote-Addr $http_remote_addr; proxy_set_header X-Real-IP $remote_addr; if ( $http_user_agent ~ Android ){ # rewrite something, been commented } } Debug info, origin log https://gist.github.com/3799021 ... 2012/09/28 16:29:49 [debug] 26416#0: *1 http script regex: "Android" 2012/09/28 16:29:49 [notice] 26416#0: *1 "Android" matches "Android/1.0", client: 106.187.97.22, server: ireedr.com, request: "POST /login HTTP/1.1", host: "ireedr.com" ... 2012/09/28 16:29:49 [debug] 26416#0: *1 http proxy header: "POST //login HTTP/1.0 Host: ireedr.com X-Real-IP: 106.187.97.22 Connection: close Accept-Encoding: identity, deflate, compress, gzip Accept: */* User-Agent: Android/1.0 " ... 2012/09/28 16:29:49 [debug] 26416#0: *1 HTTP/1.1 404 Not Found Server: nginx/1.2.1 Date: Fri, 28 Sep 2012 08:29:49 GMT Content-Type: text/html;charset=ISO-8859-1 Transfer-Encoding: chunked Connection: keep-alive Cache-Control: must-revalidate,no-cache,no-store Content-Encoding: gzip ... Only when I commented the block in the configration file: location / { proxy_pass http://localhost:8785/; proxy_set_header Host $http_host; proxy_set_header Remote-Addr $http_remote_addr; proxy_set_header X-Real-IP $remote_addr; #if ( $http_user_agent ~ Android ){ # #} } The client can get an 200 response. Debug info, origin log https://gist.github.com/3799023 ... "POST /login HTTP/1.0 Host: ireedr.com X-Real-IP: 106.187.97.22 Connection: close Accept-Encoding: identity, deflate, compress, gzip Accept: */* User-Agent: Android/1.0 " ... 2012/09/28 16:27:19 [debug] 26319#0: *1 HTTP/1.1 200 OK Server: nginx/1.2.1 Date: Fri, 28 Sep 2012 08:27:19 GMT Content-Type: application/json;charset=UTF-8 Content-Length: 17 Connection: keep-alive ... As the log: 2012/09/28 16:29:49 [notice] 26416#0: *1 "Android" matches "Android/1.0", client: 106.187.97.22, server: ireedr.com, request: "POST /login HTTP/1.1", host: "ireedr.com" 2012/09/28 16:29:49 [debug] 26416#0: *1 http script if 2012/09/28 16:29:49 [debug] 26416#0: *1 post rewrite phase: 4 2012/09/28 16:29:49 [debug] 26416#0: *1 generic phase: 5 2012/09/28 16:29:49 [debug] 26416#0: *1 generic phase: 6 2012/09/28 16:29:49 [debug] 26416#0: *1 generic phase: 7 2012/09/28 16:29:49 [debug] 26416#0: *1 access phase: 8 2012/09/28 16:29:49 [debug] 26416#0: *1 access phase: 9 2012/09/28 16:29:49 [debug] 26416#0: *1 access phase: 10 2012/09/28 16:29:49 [debug] 26416#0: *1 post access phase: 11 2012/09/28 16:29:49 [debug] 26416#0: *1 try files phase: 12 2012/09/28 16:29:49 [debug] 26416#0: *1 posix_memalign: 0000000001E798F0:4096 @16 2012/09/28 16:29:49 [debug] 26416#0: *1 http init upstream, client timer: 0 2012/09/28 16:29:49 [debug] 26416#0: *1 epoll add event: fd:13 op:3 ev:80000005 2012/09/28 16:29:49 [debug] 26416#0: *1 http script copy: "Host: " 2012/09/28 16:29:49 [debug] 26416#0: *1 http script var: "ireedr.com" 2012/09/28 16:29:49 [debug] 26416#0: *1 http script copy: " " 2012/09/28 16:29:49 [debug] 26416#0: *1 http script copy: "" 2012/09/28 16:29:49 [debug] 26416#0: *1 http script copy: "" 2012/09/28 16:29:49 [debug] 26416#0: *1 http script copy: "X-Real-IP: " 2012/09/28 16:29:49 [debug] 26416#0: *1 http script var: "106.187.97.22" 2012/09/28 16:29:49 [debug] 26416#0: *1 http script copy: " " 2012/09/28 16:29:49 [debug] 26416#0: *1 http script copy: "Connection: close " 2012/09/28 16:29:49 [debug] 26416#0: *1 http proxy header: "Accept-Encoding: identity, deflate, compress, gzip" 2012/09/28 16:29:49 [debug] 26416#0: *1 http proxy header: "Accept: */*" 2012/09/28 16:29:49 [debug] 26416#0: *1 http proxy header: "User-Agent: Android/1.0" 2012/09/28 16:29:49 [debug] 26416#0: *1 http proxy header: "POST //login HTTP/1.0 Host: ireedr.com X-Real-IP: 106.187.97.22 Connection: close Accept-Encoding: identity, deflate, compress, gzip Accept: */* User-Agent: Android/1.0 " ... Maybe post rewrite phase had rewrite the request. Anybody can help me to solve this problem or know why nginx do that ? Much appreciated.

Read the article
IBM Server Config questions

- by Joel Coel

I have a few questions on a potential server setup. First, the situation: Last year we bought an IBM x3500 server with 2 Xeon E5410's, 9GB RAM, 6 HDDs. The original intent for this server was to replace the old exchange e-mail server. It was brought in, set up, and then shortly after we switched to gmail. Shortly after that my predecessor left for greener pastures, and finally I was hired. So this nice server is now sitting (mostly) idle. This year I have budget again for one server, and of course I want to put this other server to work. I'm thinking about the best use for the two server, and I think I finally have a plan for what I want to do with them. The idea is to use the two newer servers as a pair of VM hosts. I will set up each server with the same 8 VMs, but divide up the load so that only 4 are active per physical host. That means I've normally got 2GB RAM + 2 cores per host. I've done some load testing to pick out what servers to convert to virtual, and chose them so that each host will be capable of handling the entire set of 8 by itself in a pinch with 1 core and 1GB RAM, but would be very taxed to do so. This should take our data center from 13 total servers down to 7. The "servers" I'm replacing are mostly re-purposed desktops, so I'm more than happy to be able to do this. Now it's time to go shopping for the new server. I'd like my two hosts to match as closely as possible, and so I'm looking at IBM again. It also helps that we have some educational matching grant money from IBM that I need to use to help pay for this system (we're a small private college). So finally, (if you're not bored already), we come to my questions: Am I missing anything big or obvious in this plan? I'm a little worried about network performance since the VM hosts will only have 4 nics total where 8 used to be, but I don't think it will be a problem. Is there anything else like this I might be overlooking? Am I making it even too complicated? IBM no longer has a good analog to last year's server. If I want to match the performance (8 cores, 9GB RAM, 1333mhz front side bus, 6 spindles), I have to spend quite a bit more than we paid last year: $2K+, or nearly a 33% cost increase. This only brings a marginal increase in performance. The alternative to stay in budget is to take a hit on the fsb down to 800mhz or cut the number of cores in half, neither of which is attractive. The main cost culprit is the processor. IBM no longer offers the E5410. It's listed as a part, but not available in any of the server configs I've looked at. I'm considering getting the cheapest 800mhz fsb dual core xeon I can configure and then buying the E5410's separately. That's still an extra $350 I wasn't counting on, but that's better than $2K. I want to know what others think of this - will it work or will I end up with the wrong motherboard or some other issue? Am I missing a simple way to configure the server I really want? I don't really intend to do this, but one option to save some money back is to omit the redundant power supply. Since my redundancy plan for these system is to switch over to a completely different host, the extra power isn't fully necessary. That said, it's still very helpful to avoid even short downtimes while I switch over VMs. Has anyone done this?

Read the article
iPhone doesn't save password for Cisco IPsec VPN using racoon daemon

- by dsx

On my Debian server I had set up racoon daemon (1:0.8.0-14) for Cisco IPSec VPN using certificates for authentication. My racoon.conf is like following: log info; path certificate "/etc/racoon/certs"; listen { isakmp $SERVER_IP_HERE [500]; isakmp_natt $SERVER_IP_HERE [4500]; } timer { natt_keepalive 10 sec; } remote anonymous { lifetime time 24 hours; proposal_check obey; passive on; exchange_mode aggressive,main; my_identifier asn1dn; peers_identifier asn1dn; verify_identifier on; certificate_type x509 "cert_name.crt" "key_name.key"; ca_type x509 "ca.crt"; mode_cfg on; verify_cert on; ike_frag on; generate_policy on; nat_traversal on; dpd_delay 20; proposal { encryption_algorithm aes; hash_algorithm sha1; authentication_method xauth_rsa_server; dh_group modp1024; } } mode_cfg { conf_source local; auth_source system; auth_throttle 3; save_passwd on; dns4 8.8.8.8; network4 $SOME_LAN_SUBNET; netmask4 255.255.255.0; pool_size 128; } sainfo anonymous { pfs_group 2; lifetime time 24 hour; encryption_algorithm aes; authentication_algorithm hmac_sha1; compression_algorithm deflate; } I'm not using PSK authentication here. Using iPhone configuration utility I had uploaded all required certificates to iPhone and set up VPN on demand. Everything works just fine except one thing: iPhone refuses to save VPN password regardless of save_passwd on; in racoon configuration file. As opposed to iPhone behaviour, Mac OS X 10.8.2 have no problems saving password. I had examined iPhone log file and found following: racoon[151] <Notice>: >>>>> phase change status = phase 1 established configd[50] <Notice>: IPSec Network Configuration started. configd[50] <Notice>: IPSec Network Configuration: INTERNAL-IP4-ADDRESS = $SUBNET_IP_HERE. configd[50] <Notice>: IPSec Network Configuration: INTERNAL-IP4-MASK = 255.255.255.0. configd[50] <Notice>: IPSec Network Configuration: SAVE-PASSWORD = 0. configd[50] <Notice>: IPSec Network Configuration: INTERNAL-IP4-DNS = 8.8.8.8. configd[50] <Notice>: IPSec Network Configuration: BANNER = . configd[50] <Notice>: IPSec Network Configuration: DEF-DOMAIN = . configd[50] <Notice>: IPSec Network Configuration: DEFAULT-ROUTE = local-address $SUBNET_IP_HERE/32. configd[50] <Notice>: IPSec Phase2 starting. configd[50] <Notice>: IPSec Network Configuration established. configd[50] <Notice>: IPSec Phase1 established. Please note IPSec Network Configuration message containing SAVE-PASSWORD = 0.. Is it a bug in racoon daemon on server, or iPhone (iOS version is 6.0.1 (10A523)) or it is me missing something? How to make iPhone remember IPSec VPN password?

Read the article
Slow disk transfer rate

- by Nooklez

I have problem with slow disk transfer rate. It's static files server for our website. I was making backup of data and noticed that tar is very slow. So I did hdparm -t and... hdparm -t /dev/sda3 /dev/sda3: Timing buffered disk reads: 6 MB in 4.70 seconds = 1.28 MB/sec It's low traffic hour now on our site, so huge I/O traffic is not a reason (iotop show less than 1 MB/s). It's RAID10 setup (2x2 SATA drives). Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy ------------------------------------------------------------------------------ u0 RAID-10 OK - - 64K 1396.96 W ON VPort Status Unit Size Type Phy Encl-Slot Model ------------------------------------------------------------------------------ p0 OK u0 698.63 GB SATA 0 - WDC WD7500AADS-00M2 p1 OK u0 698.63 GB SATA 1 - WDC WD7500AADS-00M2 p2 OK u0 698.63 GB SATA 2 - WDC WD7500AADS-00M2 p3 OK u0 698.63 GB SATA 3 - WDC WD7500AADS-00M2 We have recently changed almost all components of server (excluding 3ware controller + disks). And I think problems started since then. Can it be configuration problem or hardware? EDIT: I found something like that in dmesg [166843.625843] irq 16: nobody cared (try booting with the "irqpoll" option) [166843.625846] Pid: 0, comm: swapper Not tainted 3.1.5-gentoo #3 [166843.625847] Call Trace: [166843.625848] <IRQ> [<ffffffff810859d5>] __report_bad_irq+0x35/0xc1 [166843.625856] [<ffffffff81085cec>] note_interrupt+0x165/0x1e1 [166843.625859] [<ffffffff8108445f>] handle_irq_event_percpu+0x16f/0x187 [166843.625861] [<ffffffff810844a9>] handle_irq_event+0x32/0x51 [166843.625863] [<ffffffff8108640b>] handle_fasteoi_irq+0x75/0x99 [166843.625866] [<ffffffff810039d7>] handle_irq+0x83/0x8b [166843.625868] [<ffffffff810036ad>] do_IRQ+0x48/0xa0 [166843.625871] [<ffffffff8155082b>] common_interrupt+0x6b/0x6b [166843.625872] <EOI> [<ffffffff812981e8>] ? acpi_safe_halt+0x22/0x35 [166843.625877] [<ffffffff812981e2>] ? acpi_safe_halt+0x1c/0x35 [166843.625879] [<ffffffff81298216>] acpi_idle_do_entry+0x1b/0x2b [166843.625881] [<ffffffff81298276>] acpi_idle_enter_c1+0x50/0x99 [166843.625884] [<ffffffff813b792a>] cpuidle_idle_call+0xed/0x171 [166843.625886] [<ffffffff81001257>] cpu_idle+0x55/0x81 [166843.625888] [<ffffffff81532a69>] rest_init+0x6d/0x6f [166843.625891] [<ffffffff81aa1aca>] start_kernel+0x329/0x334 [166843.625893] [<ffffffff81aa12a6>] x86_64_start_reservations+0xb6/0xba [166843.625894] [<ffffffff81aa139c>] x86_64_start_kernel+0xf2/0xf9 [166843.625896] handlers: [166843.625898] [<ffffffff812dc8de>] twl_interrupt [166843.625900] Disabling IRQ #16 It's related to problem? EDIT #2: Based on feedback in comments, here is more informations. cat /proc/interrupts 16: 390813 0 0 0 IO-APIC-fasteoi 3w-sas Controller model: [ 1.095350] 3ware Storage Controller device driver for Linux v1.26.02.003. [ 1.095467] 3ware 9000 Storage Controller device driver for Linux v2.26.02.014. [ 1.095641] LSI 3ware SAS/SATA-RAID Controller device driver for Linux v3.26.02.000. [ 1.095787] 3w-sas 0000:01:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16 [ 1.095881] 3w-sas 0000:01:00.0: setting latency timer to 64 [ 1.910801] 3w-sas: scsi0: Found an LSI 3ware 9750-4i Controller at 0xfe560000, IRQ: 16. [ 2.216537] 3w-sas: scsi0: Firmware FH9X 5.08.00.008, BIOS BE9X 5.07.00.011, Phys: 8. [ 2.216836] scsi 0:0:0:0: Direct-Access LSI 9750-4i DISK 5.08 PQ: 0 ANSI: 5 And motherboard: description: Motherboard product: P8H67-M vendor: ASUSTeK Computer INC.

Read the article
Learnings from trying to write better software: Loud errors from the very start

- by theo.spears

Microsoft made a very small number of backwards incompatible changes between .NET 1.1 and 2.0, because they wanted to make it as easy and safe as possible to port applications to the new runtime. (Here’s a list.) However, one thing they did change was what happens when a background thread fails with an unhanded exception - in .NET 1.1 nothing happened, the thread terminated, and the application continued oblivious. Try the same trick in .NET 2.0 and the entire application, including all threads, will rudely terminate. There are three reasons for this. Firstly if a background thread has crashed, it may have left the entire application in an inconsistent state, in a way that will affect other threads. It’s better to terminate the entire application than continue and have the application perform actions based on a broken state, for example take customer orders, or write corrupt files to disk. Secondly, during software development, it is far better for errors to be loud and obtrusive. Even if you have unit tests and integration tests (and you should), a key part of ensuring software works properly is to actually try using it, both through systematic testing and through the casual use all software gets by its developers during use. Subtle errors are easy to miss if you are not actually doing real work using the application, loud errors are obvious. Thirdly, and most importantly, even if catching and swallowing exceptions indiscriminately doesn't cause any problems in your application, the presence of unexpected exceptions shows you do not fully understand the behavior of your code. The currently released version of your application may be absolutely correct. However, because your mental model of the behavior is wrong, any future change you make to the program could and probably will introduce critical errors. This applies to more than just exceptions causing threads to exit, any unexpected state should make the application blow up in an un-ignorable way. The worst thing you can do is silently swallow errors and continue. And let's be clear, writing to a log file does not count as blowing up in an un-ignorable way. This is all simple as long as the call stack only contains your code, but when your functions start to be called by third party or .NET framework code, it's surprisingly easy for exceptions to start vanishing. Let's look at two examples. 1. Windows forms drag drop events Usually if you throw an exception from a winforms event handler it will bring up the "application has crashed" dialog with abort and continue options. This is a good default behavior - the error is big and loud, but it is possible for the user to ignore the error and hopefully save their data, if somehow this bug makes it past testing. However drag and drop are different - throw an exception from one of these and it will just be silently swallowed with no explanation. By the way, it's not just drag and drop events. Timer events do it too. You can research how exceptions are treated in different handlers and code appropriately, but the safest and most user friendly approach is to always catch exceptions in your event handlers and show your own error message. I'll talk about one good approach to handling these exceptions at the end of this post. 2. SSMS integration for SQL Tab Magic A while back wrote an SSMS add-in called SQL Tab Magic (learn more about the process here). It works by listening to certain SSMS events and remembering what documents are opened and closed. I deployed it internally and it was used for a few months by a number of people without problems, so I was reasonably confident in its quality. Before releasing I made a few cleanups, including introducing error reporting. Bam. A few days later I was looking at over 1,000 error reports in my inbox. In turns out I wasn't handling table designers properly. The exceptions were there, but again SSMS was helpfully swallowing them all for me, so I was blissfully unaware. Had I made my errors loud from the start, I would have noticed these issues long before and fixed them. Handling exceptions Now you are systematically catching exceptions throughout your application, you need to do something with them. I've tried 3 options: log them, alert the user, and automatically send them home. There are a few good options for logging in .NET. The most widespread is Apache log4net, which provides a very capable and configurable logging framework. There is also NLog which has a compatible interface, with a greater emphasis on fluent rather than XML configuration. Alerting the user serves two purposes. Firstly it means they understand their action has failed to they don't just assume it worked (Silent file copy failure is a problem if you then delete the originals) or that they should keep waiting for a background task to complete. Secondly, it means the users can report the bug to your support team, and then you can fix it. This means the message you show the user should contain the information you need as a developer to identify and fix it. And the user will probably just send you a screenshot of the dialog, so it shouldn't be hidden by scroll bars. This leads us to the third option, automatically sending error reports home. By automatic I mean with minimal effort on the part of the user, rather than doing it silently behind their backs. The advantage of this is you can send back far more detailed and precise information than you can expect a user to include in an email, and by making it easier to report errors, you make it more likely users will do so. We do this using a great tool called SmartAssembly (full disclosure: this is a product made by Red Gate). It captures complete stack traces including the values of all local variables and then allows the user to send all this information back with a single click. We also capture log files to help understand what lead up to the error. We then use the free SmartAssembly Sync for Jira to dedupe these reports and raise them as bugs in our bug tracking system. The combined effect of loud errors during development and then automatic error reporting once software is deployed allows us to find and fix more bugs, correct misunderstandings on how our software works, and overall is a key piece in delivering higher quality software. However it is no substitute for having motivated cunning testers in the building - and we're looking to hire more of those too. If you found this post interesting you should follow me on twitter.

Read the article
How to Identify Which Hardware Component is Failing in Your Computer

- by Chris Hoffman

Concluding that your computer has a hardware problem is just the first step. If you’re dealing with a hardware issue and not a software issue, the next step is determining what hardware problem you’re actually dealing with. If you purchased a laptop or pre-built desktop PC and it’s still under warranty, you don’t need to care about this. Have the manufacturer fix the PC for you — figuring it out is their problem. If you’ve built your own PC or you want to fix a computer that’s out of warranty, this is something you’ll need to do on your own. Blue Screen 101: Search for the Error Message This may seem like obvious advice, but searching for information about a blue screen’s error message can help immensely. Most blue screens of death you’ll encounter on modern versions of Windows will likely be caused by hardware failures. The blue screen of death often displays information about the driver that crashed or the type of error it encountered. For example, let’s say you encounter a blue screen that identified “NV4_disp.dll” as the driver that caused the blue screen. A quick Google search will reveal that this is the driver for NVIDIA graphics cards, so you now have somewhere to start. It’s possible that your graphics card is failing if you encounter such an error message. Check Hard Drive SMART Status Hard drives have a built in S.M.A.R.T. (Self-Monitoring, Analysis, and Reporting Technology) feature. The idea is that the hard drive monitors itself and will notice if it starts to fail, providing you with some advance notice before the drive fails completely. This isn’t perfect, so your hard drive may fail even if SMART says everything is okay. If you see any sort of “SMART error” message, your hard drive is failing. You can use SMART analysis tools to view the SMART health status information your hard drives are reporting. Test Your RAM RAM failure can result in a variety of problems. If the computer writes data to RAM and the RAM returns different data because it’s malfunctioning, you may see application crashes, blue screens, and file system corruption. To test your memory and see if it’s working properly, use Windows’ built-in Memory Diagnostic tool. The Memory Diagnostic tool will write data to every sector of your RAM and read it back afterwards, ensuring that all your RAM is working properly. Check Heat Levels How hot is is inside your computer? Overheating can rsult in blue screens, crashes, and abrupt shut downs. Your computer may be overheating because you’re in a very hot location, it’s ventilated poorly, a fan has stopped inside your computer, or it’s full of dust. Your computer monitors its own internal temperatures and you can access this information. It’s generally available in your computer’s BIOS, but you can also view it with system information utilities such as SpeedFan or Speccy. Check your computer’s recommended temperature level and ensure it’s within the appropriate range. If your computer is overheating, you may see problems only when you’re doing something demanding, such as playing a game that stresses your CPU and graphics card. Be sure to keep an eye on how hot your computer gets when it performs these demanding tasks, not only when it’s idle. Stress Test Your CPU You can use a utility like Prime95 to stress test your CPU. Such a utility will fore your computer’s CPU to perform calculations without allowing it to rest, working it hard and generating heat. If your CPU is becoming too hot, you’ll start to see errors or system crashes. Overclockers use Prime95 to stress test their overclock settings — if Prime95 experiences errors, they throttle back on their overclocks to ensure the CPU runs cooler and more stable. It’s a good way to check if your CPU is stable under load. Stress Test Your Graphics Card Your graphics card can also be stress tested. For example, if your graphics driver crashes while playing games, the games themselves crash, or you see odd graphical corruption, you can run a graphics benchmark utility like 3DMark. The benchmark will stress your graphics card and, if it’s overheating or failing under load, you’ll see graphical problems, crashes, or blue screens while running the benchmark. If the benchmark seems to work fine but you have issues playing a certain game, it may just be a problem with that game. Swap it Out Not every hardware problem is easy to diagnose. If you have a bad motherboard or power supply, their problems may only manifest through occasional odd issues with other components. It’s hard to tell if these components are causing problems unless you replace them completely. Ultimately, the best way to determine whether a component is faulty is to swap it out. For example, if you think your graphics card may be causing your computer to blue screen, pull the graphics card out of your computer and swap in a new graphics card. If everything is working well, it’s likely that your previous graphics card was bad. This isn’t easy for people who don’t have boxes of components sitting around, but it’s the ideal way to troubleshoot. Troubleshooting is all about trial and error, and swapping components out allows you to pin down which component is actually causing the problem through a process of elimination. This isn’t a complete guide to everything that could likely go wrong and how to identify it — someone could write a full textbook on identifying failing components and still not cover everything. But the tips above should give you some places to start dealing with the more common problems. Image Credit: Justin Marty on Flickr

Read the article
What's new in Solaris 11.1?

- by Karoly Vegh

Solaris 11.1 is released. This is the first release update since Solaris 11 11/11, the versioning has been changed from MM/YY style to 11.1 highlighting that this is Solaris 11 Update 1. Solaris 11 itself has been great. What's new in Solaris 11.1? Allow me to pick some new features from the What's New PDF that can be found in the official Oracle Solaris 11.1 Documentation. The updates are very numerous, I really can't include all. I. New AI Automated Installer RBAC profiles have been introduced to enable delegation of installation tasks. II. The interactive installer now supports installing the OS to iSCSI targets. III. ASR (Auto Service Request) and OCM (Oracle Configuration Manager) have been enabled by default to proactively provide support information and create service requests to speed up support processes. This is optional and can be disabled but helps a lot in supportcases. For further information, see: http://oracle.com/goto/solarisautoreg IV. The new command svcbundle helps you to create SMF manifests without having to struggle with XML editing. (btw, do you know the interactive editprop subcommand in svccfg? The listprop/setprop subcommands are great for scripting and automating, but for an interactive property editing session try, for example, this: svccfg -s svc:/application/pkg/system-repository:default editprop ) V. pfedit: Ever wondered how to delegate editing permissions to certain files? It is well known "sudo /usr/bin/vi /etc/hosts" is not the right way, for sudo elevates the complete vi process to admin levels, and the user can "break" out of the session as root with simply starting a shell from that vi. Now, the new pfedit command provides a solution exactly to this challenge - an auditable, secure, per-user configurable editing possibility. See the pfedit man page for examples. VI. rsyslog, the popular logging daemon (filters, SSL, formattable output, SQL collect...) has been included in Solaris 11.1 as an alternative to syslog. VII: Zones: Solaris Zones - as a major Solaris differentiator - got lots of love in terms of new features: ZOSS - Zones on Shared Storage: Placing your zones to shared storage (FC, iSCSI) has never been this easy - via zonecfg. parallell updates - with S11's bootenvironments updating zones was no problem and meant no downtime anyway, but still, now you can update them parallelly, a way faster update action if you are running a large number of zones. This is like parallell patching in Solaris 10, but with all the IPS/ZFS/S11 goodness. per-zone fstype statistics: Running zones on a shared filesystems complicate the I/O debugging, since ZFS collects all the random writes and delivers them sequentially to boost performance. Now, over kstat you can find out which zone's I/O has an impact on the other ones, see the examples in the documentation: http://docs.oracle.com/cd/E26502_01/html/E29024/gmheh.html#scrolltoc Zones got RDSv3 protocol support for InfiniBand, and IPoIB support with Crossbow's anet (automatic vnic creation) feature. NUMA I/O support for Zones: customers can now determine the NUMA I/O topology of the system from within zones. VIII: Security got a lot of attention too: Automated security/audit reporting, with builtin reporting templates e.g. for PCI (payment card industry) audits. PAM is now configureable on a per-user basis instead of system wide, allowing different authentication requirements for different users SSH in Solaris 11.1 now supports running in FIPS 140-2 mode, that is, in a U.S. government security accredited fashion. SHA512/224 and SHA512/256 cryptographic hash functions are implemented in a FIPS-compliant way - and on a T4 implemented in silicon! That is, goverment-approved cryptography at HW-speed. Generally, Solaris is currently under evaluation to be both FIPS and Common Criteria certified. IX. Networking, as one of the core strengths of Solaris 11, has been extended with: Data Center Bridging (DCB) - not only setups where network and storage share the same fabric (FCoE, anyone?) can have Quality-of-Service requirements. DCB enables peers to distinguish traffic based on priorities. Your NICs have to support DCB, see the documentation, and additional information on Wikipedia. DataLink MultiPathing, DLMP, enables link aggregation to span across multiple switches, even between those of different vendors. But there are essential differences to the good old bandwidth-aggregating LACP, see the documentation: http://docs.oracle.com/cd/E26502_01/html/E28993/gmdlu.html#scrolltoc VNIC live migration is now supported from one physical NIC to another on-the-fly X. Data management: FedFS, (Federated FileSystem) is new, it relies on Solaris 11's NFS referring mechanism to join separate shares of different NFS servers into a single filesystem namespace. The referring system has been there since S11 11/11, in Solaris 11.1 FedFS uses a LDAP - as the one global nameservice to bind them all. The iSCSI initiator now uses the T4 CPU's HW-implemented CRC32 algorithm - thus improving iSCSI throughput while reducing CPU utilization on a T4 Storage locking improvements are now RAC aware, speeding up throughput with better locking-communication between nodes up to 20%! XI: Kernel performance optimizations: The new Virtual Memory subsystem ("VM2") scales now to 100+ TB Memory ranges. The memory predictor monitors large memory page usage, and adjust memory page sizes to applications' needs OSM, the Optimized Shared Memory allows Oracle DBs' SGA to be resized online XII: The Power Aware Dispatcher in now by default enabled, reducing power consumption of idle CPUs. Also, the LDoms' Power Management policies and the poweradm settings in Solaris 11 OS will cooperate. XIII: x86 boot: upgrade to the (Grand Unified Bootloader) GRUB2. Because grub2 differs in the configuration syntactically from grub1, one shall not edit the new grub configuration (grub.cfg) but use the new bootadm features to update it. GRUB2 adds UEFI support and also support for disks over 2TB. XIV: Improved viewing of per-CPU statistics of mpstat. This one might seem of less importance at first, but nowadays having better sorting/filtering possibilities on a periodically updated mpstat output of 256+ vCPUs can be a blessing. XV: Support for Solaris Cluster 4.1: The What's New document doesn't actually mention this one, since OSC 4.1 has not been released at the time 11.1 was. But since then it is available, and it requires Solaris 11.1. And it's only a "pkg update" away. ...aand I seriously need to stop here. There's a lot I missed, Edge Virtual Bridging, lofi tuning, ZFS sharing and crypto enhancements, USB3.0, pulseaudio, trusted extensions updates, etc - but if I mention all those then I effectively copy the What's New document. Which I recommend reading now anyway, it is a great extract of the 300+ new projects and RFE-followups in S11.1. And this blogpost is a summary of that extract. For closing words, allow me to come back to Request For Enhancements, RFEs. Any customer can request features. Open up a Support Request, explain that this is an RFE, describe the feature you/your company desires to have in S11 implemented. The more SRs are collected for an RFE, the more chance it's got to get implemented. Feel free to provide feedback about the product, as well as about the Solaris 11.1 Documentation using the "Feedback" button there. Both the Solaris engineers and the documentation writers are eager to hear your input.Feel free to comment about this post too. Except that it's too long ;) wbr,charlie

Read the article
apache webserver unresponsible with server-status showing all child processes waiting for connection

- by Jeff

My setup: i have 3 nearly identical webserver machines serving the same high loaded dynamic website with simple load balancing over dns. The service has been working for over two ears with the same apache config. apache2, php5, ubuntu 8.04 linux 2.6.24-29-server My problem: since about two weeks i'm experiencing problems with this config. Nearly every day i have one small moment about 5 minutes, in which the website is unreachable. I'm still able to login to the servers over ssh. If i run htop, i see the machine simply doing nothing. i have about 1000 apache processes running, but no cpu activity. i've used the apache mod_status to debug this situation. the process scoreboard looks like this: _C.___K_______________________R._______.__K_K____K___C_______.__ _______C__________.___________________________________.________C _.____K__________K___K_WK_____._K_____________________________._ W______K__________K________.____________________._______C_______ _C_.__K__K____.._.._____________________________________C_______ _R___________K___.______C________.C_________.______._____C______ ____________KKC____K_____K__WC_________________C_____.__.____.__ _____________________C_________K______.____C______._____________ _.___C____.___.___________________________.K______.____K________ W__.___________________C.__.____K________K_______R_._.__._______ __C__C_.__________C__C_______._____W______________C_.___C_______ ____.______C_____________C________.____C____________.________._K __.__________.K_____________K_________._____C____.K__________KW_ __K.W________R_________._______.___W___________.____.__K_____W__ W___.___..________W____K Scoreboard Key: "_" Waiting for Connection, "S" Starting up, "R" Reading Request, "W" Sending Reply, "K" Keepalive (read), "D" DNS Lookup, "C" Closing connection, "L" Logging, "G" Gracefully finishing, "I" Idle cleanup of worker, "." Open slot with no current process So the most of the processes are just waiting for connection. after about 5 minutes the situation will return to normal: i have lot least processes on every machine, the most workers have the "."-status (meaing they are open to process a request) and of course the website is reachable! so i'm trying to find something in the logs, but there is simply nothing... the apache access log is silent for about 4 minutes, the same is for the error log. i also can not figure out anything wrong in other system logs. the situation is the same on all 3 webservers (all of them have this load peak and unresposibility at the same time), so i do not thing this is hardware related. but i think, this might be related to some network (tcp) issue. any ideas? EDIT: some more information, that i have just discovered: it has just happened again. and i was able to verify that i'm also not able to connect locally when this problem occurs. i have made some connection statistics with the following command after it happend netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c 109 CLOSE_WAIT 2652 ESTABLISHED 2 FIN_WAIT1 11 LAST_ACK 12 LISTEN 91 SYN_RECV 1 SYN_SENT 16 TIME_WAIT If i execute the same command some time later, i have something like this: 4 CLOSING 108 ESTABLISHED 18 FIN_WAIT1 182 FIN_WAIT2 37 LAST_ACK 12 LISTEN 50 SYN_RECV 11276 TIME_WAIT So in the normal situation i have only 100-200 open connections by clients beeing handled by apache in this moment. when i have this "crash", i have a lot more connections. what is the best way to analyse this? EDIT2: the important lines in apache2.conf are: KeepAlive On MaxKeepAliveRequests 20 KeepAliveTimeout 1 <IfModule mpm_prefork_module> ServerLimit 920 StartServers 30 MinSpareServers 80 MaxSpareServers 120 MaxClients 920 MaxRequestsPerChild 700 </IfModule> it is an apache2 prefork with php_mod. the server has 8GB ram and a 4gb swap partition.

Read the article
How to stop Apache from crashing my entire server?

- by CyberShadow

I maintain a Gentoo server with a few services, including Apache. It's fairly low-end (2GB of RAM and a low-end CPU with 2 cores). My problem is that, despite my best efforts, an over-loaded Apache crashes the entire server. In fact, at this point I'm close to being convinced that Linux is a horrible operating system that isn't worth anyone's time looking for stability under load. Things I tried: Adjusting oom_adj for the root Apache process (and thus all its children). That had close to no effect. When Apache was overloaded it would bring the system to a grind, as the system paged out everything else before it got to kill anything. Turning off swap. Didn't help, it would unload memory paged to binaries of processes and other files on /, thus causing the same effect. Putting it in a memory-limited cgroup (limited to 512 MB of RAM, 1/4th of the total). This "worked", at least in my own stress tests - except the server keeps crashing under load (basically stalling all other processes, inaccessible via SSH, etc.) Running it with idle I/O priority. This wasn't a very good idea in the end, because it just caused the system load to climb indefinitely (into the thousands) with almost no visible effect - until you tried to access an unbuffered part of the disk. This caused the task to freeze. (So much for good I/O scheduling, eh?) Limiting the number of concurrent connections to Apache. Setting the number too low caused web sites to become unresponsive due to most slots being occupied with long requests (file downloads). I tried various Apache MPMs without much success (prefork, event, itk). Switching from prefork/event+php-cgi+suphp to itk+mod_php. This improved performance, but didn't solve the actual problem. Switching I/O schedulers (cfq to deadline). Just to stress this out: I don't care if Apache itself goes down under load, I just want the rest of my system to remain stable. Of course, having Apache recover quickly after a brief period of intensive load would be great to have, but one step at a time. Right now I am mostly dumbfounded by how can humanity, in this day and age, design an operating system where such a seemingly simple task (don't allow one system component to crash the entire system) seems practically impossible - or at least, very hard to do. Please don't suggest things like VMs or "BUY MORE RAM". Some more information gathered with a friend's help: The processes hang when the cgroup oom killer is invoked. Here's the call trace: [<ffffffff8104b94b>] ? prepare_to_wait+0x70/0x7b [<ffffffff810a9c73>] mem_cgroup_handle_oom+0xdf/0x180 [<ffffffff810a9559>] ? memcg_oom_wake_function+0x0/0x6d [<ffffffff810aa041>] __mem_cgroup_try_charge+0x32d/0x478 [<ffffffff810aac67>] mem_cgroup_charge_common+0x48/0x73 [<ffffffff81081c98>] ? __lru_cache_add+0x60/0x62 [<ffffffff810aadc3>] mem_cgroup_newpage_charge+0x3b/0x4a [<ffffffff8108ec38>] handle_mm_fault+0x305/0x8cf [<ffffffff813c6276>] ? schedule+0x6ae/0x6fb [<ffffffff8101f568>] do_page_fault+0x214/0x22b [<ffffffff813c7e1f>] page_fault+0x1f/0x30 At this point, the apache memory cgroup is practically deadlocked, and burning CPU in syscalls (all with the above call trace). This seems like a problem in the cgroup implementation...

Read the article
External File Upload Optimizations for Windows Azure

- by rgillen

[Cross posted from here: http://rob.gillenfamily.net/post/External-File-Upload-Optimizations-for-Windows-Azure.aspx] I’m wrapping up a bit of the work we’ve been doing on data movement optimizations for cloud computing and the latest set of data yielded some interesting points I thought I’d share. The work done here is not really rocket science but may, in some ways, be slightly counter-intuitive and therefore seemed worthy of posting. Summary: for those who don’t like to read detailed posts or don’t have time, the synopsis is that if you are uploading data to Azure, block your data (even down to 1MB) and upload in parallel. Set your block size based on your source file size, but if you must choose a fixed value, use 1MB. Following the above will result in significant performance gains… upwards of 10x-24x and a reduction in overall file transfer time of upwards of 90% (eg, uploading a 1GB file averaged 46.37 minutes prior to optimizations and averaged 1.86 minutes afterwards). Detail: For those of you who want more detail, or think that the claims at the end of the preceding paragraph are over-reaching, what follows is information and code supporting these claims. As the title would indicate, these tests were run from our research facility pointing to the Azure cloud (specifically US North Central as it is physically closest to us) and do not represent intra-cloud results… we have performed intra-cloud tests and the overall results are similar in notion but the data rates are significantly different as well as the tipping points for the various block sizes… this will be detailed separately). We started by building a very simple console application that would loop through a directory and upload each file to Azure storage. This application used the shipping storage client library from the 1.1 version of the azure tools. The only real variation from the client library is that we added code to collect and record the duration (in ms) and size (in bytes) for each file transferred. The code is available here. We then created a directory that had a collection of files for the following sizes: 2KB, 32KB, 64KB, 128KB, 512KB, 1MB, 5MB, 10MB, 25MB, 50MB, 100MB, 250MB, 500MB, 750MB, and 1GB (50 files for each size listed). These files contained randomly-generated binary data and do not benefit from compression (a separate discussion topic). Our file generation tool is available here. The baseline was established by running the application described above against the directory containing all of the data files. This application uploads the files in a random order so as to avoid transferring all of the files of a given size sequentially and thereby spreading the affects of periodic Internet delays across the collection of results. We then ran some scripts to split the resulting data and generate some reports. The raw data collected for our non-optimized tests is available via the links in the Related Resources section at the bottom of this post. For each file size, we calculated the average upload time (and standard deviation) and the average transfer rate (and standard deviation). As you likely are aware, transferring data across the Internet is susceptible to many transient delays which can cause anomalies in the resulting data. It is for this reason that we randomized the order of source file processing as well as executed the tests 50x for each file size. We expect that these steps will yield a sufficiently balanced set of results. Once the baseline was collected and analyzed, we updated the test harness application with some methods to split the source file into user-defined block sizes and then to upload those blocks in parallel (using the PutBlock() method of Azure storage). The parallelization was handled by simply relying on the Parallel Extensions to .NET to provide a Parallel.For loop (see linked source for specific implementation details in Program.cs, line 173 and following… less than 100 lines total). Once all of the blocks were uploaded, we called PutBlockList() to assemble/commit the file in Azure storage. For each block transferred, the MD5 was calculated and sent ensuring that the bits that arrived matched was was intended. The timer for the blocked/parallelized transfer method wraps the entire process (source file splitting, block transfer, MD5 validation, file committal). A diagram of the process is as follows: We then tested the affects of blocking & parallelizing the transfers by running the updated application against the same source set and did a parameter sweep on the block size including 256KB, 512KB, 1MB, 2MB, and 4MB (our assumption was that anything lower than 256KB wasn’t worth the trouble and 4MB is the maximum size of a block supported by Azure). The raw data for the parallel tests is available via the links in the Related Resources section at the bottom of this post. This data was processed and then compared against the single-threaded / non-optimized transfer numbers and the results were encouraging. The Excel version of the results is available here. Two semi-obvious points need to be made prior to reviewing the data. The first is that if the block size is larger than the source file size you will end up with a “negative optimization” due to the overhead of attempting to block and parallelize. The second is that as the files get smaller, the clock-time cost of blocking and parallelizing (overhead) is more apparent and can tend towards negative optimizations. For this reason (and is supported in the raw data provided in the linked worksheet) the charts and dialog below ignore source file sizes less than 1MB. (click chart for full size image) The chart above illustrates some interesting points about the results: When the block size is smaller than the source file, performance increases but as the block size approaches and then passes the source file size, you see decreasing benefit to the point of negative gains (see the values for the 1MB file size) For some of the moderately-sized source files, small blocks (256KB) are best As the size of the source file gets larger (see values for 50MB and up), the smallest block size is not the most efficient (presumably due, at least in part, to the increased number of blocks, increased number of individual transfer requests, and reassembly/committal costs). Once you pass the 250MB source file size, the difference in rate for 1MB to 4MB blocks is more-or-less constant The 1MB block size gives the best average improvement (~16x) but the optimal approach would be to vary the block size based on the size of the source file. (click chart for full size image) The above is another view of the same data as the prior chart just with the axis changed (x-axis represents file size and plotted data shows improvement by block size). It again highlights the fact that the 1MB block size is probably the best overall size but highlights the benefits of some of the other block sizes at different source file sizes. This last chart shows the change in total duration of the file uploads based on different block sizes for the source file sizes. Nothing really new here other than this view of the data highlights the negative affects of poorly choosing a block size for smaller files. Summary What we have found so far is that blocking your file uploads and uploading them in parallel results in significant performance improvements. Further, utilizing extension methods and the Task Parallel Library (.NET 4.0) make short work of altering the shipping client library to provide this functionality while minimizing the amount of change to existing applications that might be using the client library for other interactions. Related Resources Source code for upload test application Source code for random file generator ODatas feed of raw data from non-optimized transfer tests Experiment Metadata Experiment Datasets 2KB Uploads 32KB Uploads 64KB Uploads 128KB Uploads 256KB Uploads 512KB Uploads 1MB Uploads 5MB Uploads 10MB Uploads 25MB Uploads 50MB Uploads 100MB Uploads 250MB Uploads 500MB Uploads 750MB Uploads 1GB Uploads Raw Data OData feeds of raw data from blocked/parallelized transfer tests Experiment Metadata Experiment Datasets Raw Data 256KB Blocks 512KB Blocks 1MB Blocks 2MB Blocks 4MB Blocks Excel worksheet showing summarizations and comparisons

Read the article
Oracle Virtual Server OEL vm fails to start - kernel panic on cpu identify

- by Towndrunk

I am in the process of following a guide to setup various oracle vm templates, so far I have installed OVS 2. 2 and got the OVM Manager working, imported the template for OEL5U5 and created a vm from it.. the problem comes when starting that vm. The log in the OVMM console shows the following; Update VM Status - Running Configure CPU Cap Set CPU Cap: failed:<Exception: failed:<Exception: ['xm', 'sched-credit', '-d', '32_EM11g_OVM', '-c', '0'] => Error: Domain '32_EM11g_OVM' does not exist. StackTrace: File "/opt/ovs-agent-2.3/OVSXXenVMConfig.py", line 2531, in xen_set_cpu_cap run_cmd(args=['xm', File "/opt/ovs-agent-2.3/OVSCommons.py", line 92, in run_cmd raise Exception('%s => %s' % (args, err)) The xend.log shows; [2012-11-12 16:42:01 7581] DEBUG (DevController:139) Waiting for devices vtpm [2012-11-12 16:42:01 7581] INFO (XendDomain:1180) Domain 32_EM11g_OVM (3) unpaused. [2012-11-12 16:42:03 7581] WARNING (XendDomainInfo:1907) Domain has crashed: name=32_EM11g_OVM id=3. [2012-11-12 16:42:03 7581] ERROR (XendDomainInfo:2041) VM 32_EM11g_OVM restarting too fast (Elapsed time: 11.377262 seconds). Refusing to restart to avoid loops .> [2012-11-12 16:42:03 7581] DEBUG (XendDomainInfo:2757) XendDomainInfo.destroy: domid=3 [2012-11-12 16:42:12 7581] DEBUG (XendDomainInfo:2230) Destroying device model [2012-11-12 16:42:12 7581] INFO (image:553) 32_EM11g_OVM device model terminated I have set_on_crash="preserve" in the vm.cfg and have then run xm create -c to get the console screen while booting and this is the log of what happens.. Started domain 32_EM11g_OVM (id=4) Bootdata ok (command line is ro root=LABEL=/ ) Linux version 2.6.18-194.0.0.0.3.el5xen ([email protected]) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-48)) #1 SMP Mon Mar 29 18:27:00 EDT 2010 BIOS-provided physical RAM map: Xen: 0000000000000000 - 0000000180800000 (usable)> No mptable found. Built 1 zonelists. Total pages: 1574912 Kernel command line: ro root=LABEL=/ Initializing CPU#0 PID hash table entries: 4096 (order: 12, 32768 bytes) Xen reported: 1600.008 MHz processor. Console: colour dummy device 80x25 Dentry cache hash table entries: 1048576 (order: 11, 8388608 bytes) Inode-cache hash table entries: 524288 (order: 10, 4194304 bytes) Software IO TLB disabled Memory: 6155256k/6299648k available (2514k kernel code, 135548k reserved, 1394k data, 184k init) Calibrating delay using timer specific routine.. 4006.42 BogoMIPS (lpj=8012858) Security Framework v1.0.0 initialized SELinux: Initializing. selinux_register_security: Registering secondary module capability Capability LSM initialized as secondary Mount-cache hash table entries: 256 CPU: L1 I Cache: 64K (64 bytes/line), D cache 16K (64 bytes/line) CPU: L2 Cache: 2048K (64 bytes/line) general protection fault: 0000 [1] SMP last sysfs file: CPU 0 Modules linked in: Pid: 0, comm: swapper Not tainted 2.6.18-194.0.0.0.3.el5xen #1 RIP: e030:[ffffffff80271280] [ffffffff80271280] identify_cpu+0x210/0x494 RSP: e02b:ffffffff80643f70 EFLAGS: 00010212 RAX: 0040401000810008 RBX: 0000000000000000 RCX: 00000000c001001f RDX: 0000000000404010 RSI: 0000000000000001 RDI: 0000000000000005 RBP: ffffffff8063e980 R08: 0000000000000025 R09: ffff8800019d1000 R10: 0000000000000026 R11: ffff88000102c400 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffffffff805d2000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 Process swapper (pid: 0, threadinfo ffffffff80642000, task ffffffff804f4b80) Stack: 0000000000000000 ffffffff802d09bb ffffffff804f4b80 0000000000000000 0000000021100800 0000000000000000 0000000000000000 ffffffff8064cb00 0000000000000000 0000000000000000 Call Trace: [ffffffff802d09bb] kmem_cache_zalloc+0x62/0x80 [ffffffff8064cb00] start_kernel+0x210/0x224 [ffffffff8064c1e5] _sinittext+0x1e5/0x1eb Code: 0f 30 b8 73 00 00 00 f0 0f ab 45 08 e9 f0 00 00 00 48 89 ef RIP [ffffffff80271280] identify_cpu+0x210/0x494 RSP ffffffff80643f70 0 Kernel panic - not syncing: Fatal exception clear as mud to me. are there any other logs that will help me? I have now deployed another vm from the same template and used the default vm settings rather than adding more memory etc - I get exactly the same error.

Read the article
Why are my httpd mpm_prefork processes being reaped so quickly?

- by Dan Pritts

We've got a system running RHEL6, x64. We are using a local installation of apache 2.2.22 from source. we serve primarily: mod_perl applications (with a local installation of perl 5.16.0) tomcat applications proxied with mod_jk Here is some context; the main question is below. All of this talks to an Oracle backend. We are having issues with Oracle becoming unresponsive. We think this is because we're hitting the maximum process limit in oracle. We've upped the process limit, but now we are hitting memory pressure on the oracle server. We have tons of oracle sessions sitting idle. I can trace a bunch of them back to the httpd processes. We have mod_perl's Apache::DBI start up a new connection to the database with each httpd child that's spawned. We are concerned that these are not always getting closed out properly when the httpd's exit...and the httpd's are exiting very frequently. I know that it would be good to modify the mod_perl applications to use some better form of db connection pooling; we plan to pursue that but would like to solve our immediate problem sooner. So here's the main question. We are using the prefork MPM. The apache child processes are lasting at most a few minutes. Log analysis shows that each one is serving fewer than 50 clients before exiting; the last request each child serves is OPTIONS * HTTP/1.0 on some sort of internal connection; I'm under the impression that this is a "ping" from the master process. I've adjusted the MPM config as follows. I didn't want to raise MinSpareServers too high, because, after all, i'm trying to minimize the number of sessions to oracle. MinSpareServers 5 MaxSpareServers 30 MaxClients 150 MaxRequestsPerChild 10000 Right now we're serving 250-300 requests per minute. We've got 21 httpd's running, the eldest (other than the master, owned by root) being 3 minutes old. This rate of reaping of the apache children really seems excessive. What could be causing it? Apache was built with: $ ./configure --prefix=/opt/apache --with-ssl=/usr/lib --enable-expires --enable-ext-filter --enable-info --enable-mime-magic --enable-rewrite --enable-so --enable-speling --enable-ssl --enable-usertrack --enable-proxy --enable-headers --enable-log-forensic Apache config info: % /opt/apache/bin/httpd -V Server version: Apache/2.2.22 (Unix) Server built: Jul 23 2012 22:30:13 Server's Module Magic Number: 20051115:30 Server loaded: APR 1.4.5, APR-Util 1.4.1 Compiled using: APR 1.4.5, APR-Util 1.4.1 Architecture: 64-bit Server MPM: Prefork threaded: no forked: yes (variable process count) Server compiled with.... -D APACHE_MPM_DIR="server/mpm/prefork" -D APR_HAS_SENDFILE -D APR_HAS_MMAP -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled) -D APR_USE_SYSVSEM_SERIALIZE -D APR_USE_PTHREAD_SERIALIZE -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT -D APR_HAS_OTHER_CHILD -D AP_HAVE_RELIABLE_PIPED_LOGS -D DYNAMIC_MODULE_LIMIT=128 -D HTTPD_ROOT="/opt/apache" -D SUEXEC_BIN="/opt/apache/bin/suexec" -D DEFAULT_PIDLOG="logs/httpd.pid" -D DEFAULT_SCOREBOARD="logs/apache_runtime_status" -D DEFAULT_LOCKFILE="logs/accept.lock" -D DEFAULT_ERRORLOG="logs/error_log" -D AP_TYPES_CONFIG_FILE="conf/mime.types" -D SERVER_CONFIG_FILE="conf/httpd.conf" modules are compiled into apache rather than shared libs: % /opt/apache/bin/httpd -l Compiled in modules: core.c mod_authn_file.c mod_authn_default.c mod_authz_host.c mod_authz_groupfile.c mod_authz_user.c mod_authz_default.c mod_auth_basic.c mod_ext_filter.c mod_include.c mod_filter.c mod_log_config.c mod_log_forensic.c mod_env.c mod_mime_magic.c mod_expires.c mod_headers.c mod_usertrack.c mod_setenvif.c mod_version.c mod_proxy.c mod_proxy_connect.c mod_proxy_ftp.c mod_proxy_http.c mod_proxy_scgi.c mod_proxy_ajp.c mod_proxy_balancer.c mod_ssl.c prefork.c http_core.c mod_mime.c mod_status.c mod_autoindex.c mod_asis.c mod_info.c mod_cgi.c mod_negotiation.c mod_dir.c mod_actions.c mod_speling.c mod_userdir.c mod_alias.c mod_rewrite.c mod_so.c One final note - the red hat httpd, apr, and perl packages are all installed, but ldd shows that none of those libraries are linked with the running httpd.

Read the article
Conducting Effective Web Meetings

- by BuckWoody

There are several forms of corporate communication. From immediate, rich communications like phones and IM messaging to historical transactions like e-mail, there are a lot of ways to get information to one or more people. From time to time, it's even useful to have a meeting. (This is where a witty picture of a guy sleeping in a meeting goes. I won't bother actually putting one here; you're already envisioning it in your mind) Most meetings are pointless, and a complete waste of time. This is the fault, completely and solely, of the organizer. It's because he or she hasn't thought things through enough to think about alternate forms of information passing. Here's the criteria for a good meeting - whether in-person or over the web: 100% of the content of a meeting should require the participation of 100% of the attendees for 100% of the time It doesn't get any simpler than that. If it doesn't meet that criteria, then don't invite that person to that meeting. If you're just conveying information and no one has the need for immediate interaction with that information (like telling you something that modifies the message), then send an e-mail. If you're a manager, and you need to get status from lots of people, pick up the phone.If you need a quick answer, use IM. I once had a high-level manager that called frequent meetings. His real need was status updates on various processes, so 50 of us would sit in a room while he asked each one of us questions. He believed this larger meeting helped us "cross pollinate ideas". In fact, it was a complete waste of time for most everyone, except in the one or two moments that they interacted with him. So I wrote some code for a Palm Pilot (which was a kind of SmartPhone but with no phone and no real graphics, but this was in the days when we had just discovered fire and the wheel, although the order of those things is still in debate) that took an average of the salaries of the people in the room (I guessed at it) and ran a timer which multiplied the number of people against the salaries. I left that running in plain sight for him, and when he asked about it, I explained how much the meetings were really costing the company. We had far fewer meetings after. Meetings are now web-enabled. I believe that's largely a good thing, since it saves on travel time and allows more people to participate, but I think the rule above still holds. And in fact, there are some other rules that you should follow to have a great meeting - and fewer of them. Be Clear About the Goal This is important in any meeting, but all of us have probably gotten an invite with a web link and an ambiguous title. Then you get to the meeting, and it's a 500-level deep-dive on something everyone expects you to know. This is unfair to the "expert" and to the participants. I always tell people that invite me to a meeting that I will be as detailed as I can - but the more detail they can tell me about the questions, the more detailed I can be in my responses. Granted, there are times when you don't know what you don't know, but the more you can say about the topic the better. There's another point here - and it's that you should have a clearly defined "win" for the meeting. When the meeting is over, and everyone goes back to work, what were you expecting them to do with the information? Have that clearly defined in your head, and in the meeting invite. Understand the Technology There are several web-meeting clients out there. I use them all, since I meet with clients all over the world. They all work differently - so I take a few moments and read up on the different clients and find out how I can use the tools properly. I do this with the technology I use for everything else, and it's important to understand it if the meeting is to be a success. If you're running the meeting, know the tools. I don't care if you like the tools or not, learn them anyway. Don't waste everyone else's time just because you're too bitter/snarky/lazy to spend a few minutes reading. Check your phone or mic. Check your video size. Install (and learn to use) ZoomIT (http://technet.microsoft.com/en-us/sysinternals/bb897434.aspx). Format your slides or screen or output correctly. Learn to use the voting features of the meeting software, and especially it's whiteboard features. Figure out how multiple monitors work. Try a quick meeting with someone to test all this. Do this *before* you invite lots of other people to your meeting. Use a WebCam I'm not a pretty man. I have a face fit for radio. But after attending a meeting with clients where one Microsoft person used a webcam and another did not, I'm convinced that people pay more attention when a face is involved. There are tons of studies around this, or you can take my word for it, but toss a shirt on over those pajamas and turn the webcam on. Set Up Early Whether you're attending or leading the meeting, don't wait to sign on to the meeting at the time when it starts. I can almost plan that a 10:00 meeting will actually start at 10:10 because the participants/leader is just now installing the web client for the meeting at 10:00. Sign on early, go on mute, and then wait for everyone to arrive. Mute When Not Talking No one wants to hear your screaming offspring / yappy dog / other cubicle conversations / car wind noise (are you driving in a desert storm or something?) while the person leading the meeting is trying to talk. I use the Lync software from Microsoft for my meetings, and I mute everyone by default, and then tell them to un-mute to talk to the group. Share Collateral If you have a PowerPoint deck, mail it out in case you have a tech failure. If you have a document, share it as an attachment to the meeting. Don't make people ask you for the information - that's why you're there to begin with. Even better, send it out early. "But", you say, "then no one will come to the meeting if they have the deck first!" Uhm, then don't have a meeting. Send out the deck and a quick e-mail and let everyone get on with their productive day. Set Actions At the Meeting A meeting should have some sort of outcome (see point one). That means there are actions to take, a follow up, or some deliverable. Otherwise, it's an e-mail. At the meeting, decide who will do what, when things are needed, and so on. And avoid, if at all possible, setting up another meeting, unless absolutely necessary. So there you have it. Whether it's on-premises or on the web, meetings are a necessary evil, and should be treated that way. Like politicians, you should have as few of them as are necessary to keep the roads paved and public libraries open.

Read the article
Proliant server will not accept new hard disks in RAID 1+0?

- by Leigh

I have a HP ProLiant DL380 G5, I have two logical drives configured with RAID. I have one logical drive RAID 1+0 with two 72 gb 10k sas 1 port spare no 376597-001. I had one hard disk fail and ordered a replacement. The configuration utility showed error and would not rebuild the RAID. I presumed a hard disk fault and ordered a replacement again. In the mean time I put the original failed disk back in the server and this started rebuilding. Currently shows ok status however in the log I can see hardware errors. The new disk has come and I again have the same problem of not accepting the hard disk. I have updated the P400 controller with the latest firmware 7.24 , but still no luck. The only difference I can see is the original drive has firmware 0103 (same as the RAID drive) and the new one has HPD2. Any advice would be appreciated. Thanks in advance Logs from server ctrl all show config Smart Array P400 in Slot 1 (sn: PAFGK0P9VWO0UQ) array A (SAS, Unused Space: 0 MB) logicaldrive 1 (68.5 GB, RAID 1, Interim Recovery Mode) physicaldrive 2I:1:1 (port 2I:box 1:bay 1, SAS, 73.5 GB, OK) physicaldrive 2I:1:2 (port 2I:box 1:bay 2, SAS, 72 GB, Failed array B (SAS, Unused Space: 0 MB) logicaldrive 2 (558.7 GB, RAID 5, OK) physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 300 GB, OK) physicaldrive 2I:1:3 (port 2I:box 1:bay 3, SAS, 300 GB, OK) physicaldrive 2I:1:4 (port 2I:box 1:bay 4, SAS, 300 GB, OK) ctrl all show config detail Smart Array P400 in Slot 1 Bus Interface: PCI Slot: 1 Serial Number: PAFGK0P9VWO0UQ Cache Serial Number: PA82C0J9VWL8I7 RAID 6 (ADG) Status: Disabled Controller Status: OK Hardware Revision: E Firmware Version: 7.24 Rebuild Priority: Medium Expand Priority: Medium Surface Scan Delay: 15 secs Surface Scan Mode: Idle Wait for Cache Room: Disabled Surface Analysis Inconsistency Notification: Disabled Post Prompt Timeout: 0 secs Cache Board Present: True Cache Status: OK Cache Status Details: A cache error was detected. Run more information. Cache Ratio: 100% Read / 0% Write Drive Write Cache: Disabled Total Cache Size: 256 MB Total Cache Memory Available: 208 MB No-Battery Write Cache: Disabled Battery/Capacitor Count: 0 SATA NCQ Supported: True Array: A Interface Type: SAS Unused Space: 0 MB Status: Failed Physical Drive Array Type: Data One of the drives on this array have failed or has Logical Drive: 1 Size: 68.5 GB Fault Tolerance: RAID 1 Heads: 255 Sectors Per Track: 32 Cylinders: 17594 Strip Size: 128 KB Full Stripe Size: 128 KB Status: Interim Recovery Mode Caching: Enabled Unique Identifier: 600508B10010503956574F305551 Disk Name: \\.\PhysicalDrive0 Mount Points: C:\ 68.5 GB Logical Drive Label: A0100539PAFGK0P9VWO0UQ0E93 Mirror Group 0: physicaldrive 2I:1:2 (port 2I:box 1:bay 2, S Mirror Group 1: physicaldrive 2I:1:1 (port 2I:box 1:bay 1, S Drive Type: Data physicaldrive 2I:1:1 Port: 2I Box: 1 Bay: 1 Status: OK Drive Type: Data Drive Interface Type: SAS Size: 73.5 GB Rotational Speed: 10000 Firmware Revision: 0103 Serial Number: B379P8C006RK Model: HP DG072A9B7 PHY Count: 2 PHY Transfer Rate: Unknown, Unknown physicaldrive 2I:1:2 Port: 2I Box: 1 Bay: 2 Status: Failed Drive Type: Data Drive Interface Type: SAS Size: 72 GB Rotational Speed: 15000 Firmware Revision: HPD9 Serial Number: D5A1PCA04SL01244 Model: HP EH0072FARUA PHY Count: 2 PHY Transfer Rate: Unknown, Unknown Array: B Interface Type: SAS Unused Space: 0 MB Status: OK Array Type: Data Logical Drive: 2 Size: 558.7 GB Fault Tolerance: RAID 5 Heads: 255 Sectors Per Track: 32 Cylinders: 65535 Strip Size: 64 KB Full Stripe Size: 128 KB Status: OK Caching: Enabled Parity Initialization Status: Initialization Co Unique Identifier: 600508B10010503956574F305551 Disk Name: \\.\PhysicalDrive1 Mount Points: E:\ 558.7 GB Logical Drive Label: AF14FD12PAFGK0P9VWO0UQD007 Drive Type: Data physicaldrive 1I:1:5 Port: 1I Box: 1 Bay: 5 Status: OK Drive Type: Data Drive Interface Type: SAS Size: 300 GB Rotational Speed: 10000 Firmware Revision: HPD4 Serial Number: 3SE07QH300009923X1X3 Model: HP DG0300BALVP Current Temperature (C): 32 Maximum Temperature (C): 45 PHY Count: 2 PHY Transfer Rate: Unknown, Unknown physicaldrive 2I:1:3 Port: 2I Box: 1 Bay: 3 Status: OK Drive Type: Data Drive Interface Type: SAS Size: 300 GB Rotational Speed: 10000 Firmware Revision: HPD4 Serial Number: 3SE0AHVH00009924P8F3 Model: HP DG0300BALVP Current Temperature (C): 34 Maximum Temperature (C): 47 PHY Count: 2 PHY Transfer Rate: Unknown, Unknown physicaldrive 2I:1:4 Port: 2I Box: 1 Bay: 4 Status: OK Drive Type: Data Drive Interface Type: SAS Size: 300 GB Rotational Speed: 10000 Firmware Revision: HPD4 Serial Number: 3SE08NAK00009924KWD6 Model: HP DG0300BALVP Current Temperature (C): 35 Maximum Temperature (C): 47 PHY Count: 2 PHY Transfer Rate: Unknown, Unknown

Read the article

Search Results

Search found 2629 results on 106 pages for 'idle timer'.

Page 98/106 | < Previous Page | 94 95 96 97 98 99 100 101 102 103 104 105 | Next Page >

- by Matt

- by franklundy

- by Alan M

- by Glister

- by danx

- by Dave

- by 3D1L

- by futureal

- by TessellatingHeckler

- by xenoterracide

- by Synetech

- by jiangchengwu

- by Joel Coel

- by dsx

- by Nooklez

- by theo.spears

- by Chris Hoffman

- by Karoly Vegh

- by Jeff

- by CyberShadow

- by rgillen

- by Towndrunk

- by Dan Pritts

- by BuckWoody

- by Leigh

< Previous Page | 94 95 96 97 98 99 100 101 102 103 104 105 | Next Page >