Search Results

Search found 205 results on 9 pages for 'outage'.

Page 8/9 | < Previous Page | 4 5 6 7 8 9 | Next Page >

SQL Server High Availability - Mirroring with MSCS?

- by David

I'm looking at options for high-availability for my SQL Server-powered application. The requirements are: HA protection from storage failure. Data accessibility when one of the DB servers is undergoing software updates (e.g. planned outage for Windows Update / SQL Server service-packs). Must not involve much in the way of hardware procurement. The application is an ASP.NET web application. The web application's users have their own database instances. I've seen two main options: SQL Server failover clustering, and SQL Server mirroring. I understand that SQL Server Failover Clustering requires the purchasing of a shared disk array and doesn't offer any protection if the shared storage goes down (so the documentation recommends to set up a Mirroring between two clusters). Database Mirroring seems the cheaper option (as it only requires two database servers and a simple witness box) - but I've heard it doesn't work well when you have a large number of databases. The application I'm developing involves giving each client their own database for their application - there could be hundreds of databases. Setting up the mirroring is no problem thanks to the automation systems we have in place. My final point concerns how failover works with respect to client connections - SQL Server Failover Clustering uses MSCS which means that the cluster is invisible to clients - a connection attempt might fail during the failover, but a simple reconnect will have it working again. However mirroring, as far as I know, requires that the client be aware of the mirrored partners: if the client cannot connect to the primary server then it tries the secondary server. I'm wondering how this work with respect to Connection Pooling in ASP.NET applications - does the client connection failovering mean that there's a potential 2-second (assuming 2000ms TCP timeout policy) pause when the connection pool tries the primary server on every connection attempt? I read somewhere that Mirroring can be used on top of MSCS which means that the client does not need to be aware of mirroring (so there wouldn't be any potential delays during connection, and also that no changes would need to be made to the client, not even the connection string) - however I'm finding it hard to get documentation or white papers on this approach. But if true, then it means the best method is then Mirroring (for HA) with MSCS (for client ignorance and connection performance). ...but how does this scale to a server instance that might contain hundreds of mirrored databases?

Read the article
How can I prevent an unintentional DDOS running ColdFusion 8 with IIS 6?

- by Eric Belair

We had an interesting outage today on one of our client's websites. Out of nowhere, the website was inaccessible. The website runs by itself on a dedicated physical Windows 2000 server (probably overkill, I know, but that's a discussion for a different day). After restarting IIS and ColdFusion Application Service, the problem came back several times. My initial thought was that it was a DNS issue, which happens occasionally - the last time it happened was after Hurricane Sandy when we our ISP was out, and we had to make some network config changes. But, it was not a DNS issue. My second thought was that it was a DDOS attack, but, there's very little reason anyone would want to take this site down. When we called our ISP, the operator on the other end noted that traffic was spiking significantly. As it turned out, the client had unintentionally caused a DDOS on the website, after they FTPed a very large video file, and then mass emailed a link to it. Hundreds of people clicked the link and brought the site to its knees. I am primarily a Website Programmer, but I often have to contribute to server administration at times. Sadly, I'm the resident ColdFusion and IIS expert, but I don't have a lot of experience with this issue. What are some basic steps that I can take to prevent this from happening in the future, since we cannot always control what files the client posts to the website. Here are some ideas I had, but I'm unsure of the impact: Limit the number of connections in IIS. Put media files on a separate server (like an Amazon site, etc.). File requests of this type currently behind a server-script (i.e. /www.site.com/viewFile.cfm?fileId=1424545, where the fileId references a file off the webroot) that logs requests, and pushes the file to the browser using CFCONTENT. I could edit this script to reject requests when they exceed a certain amount in a given time-frame (i.e. a 5MB can be accessed globally 10 times in an hour). This may cause some users frustration, but, if hundreds of users are attempting to view the file, the site is going to crash anyways, as it did today, which is way more frustrating, since there is no "pretty" message explaining why they can't get to the file. I'm open to any suggestions, as I'm continuing my research to report to the CTO with the best options, so that we can put a solution into effect. Thank you.

Read the article
APC ups es 700 randomly overload

- by Matteo Mosca

First of all, I live in Italy, Europe, so keep this in mind for Volt/Watt considerations. Standard voltage in Italian apartments is 220V. In my living room I have 2 APC ups, one being an ES-550 and the other an ES-700 They each have 4 slots for surge protection only, and 4 slots for surge protection + battery backup. Just to give all the information, they both got their battery replaced less than one month ago. The ES-550 works fine, without any problem. On the battery I have connected: Pc Monitor Sony Bravia 46'' 4th slot is empty The ES-700 has the following on battery: Xbox 360 Ps3 (standby when not used) Wii (standby when not used) Netgear 8 port switch (always on) Here's what happens: the ES-700, randomly, but mostly at night when I'm sleeping, goes like "overload", with the constant beep. If I try to shut it off keeping the power button pressed, nothing happens. The only thing that works is unplugging random stuff (sometimes unplugging 1 console works, other times I have to unplug all 4 devices). Every time this happens the problem is "real", meaning the 4 devices become unpowered, so it's not just an "alarm no working properly" problem. While I'm sleeping, of course, the power usage is what described on the list, 2 devices on standy, 1 off and 1 on. Today it happened again while I was playing with my Ps3. I unplugged it, problem went away. I plugged it again, and it kept working fine. I just can't figure out what's the problem. The only additional info I can provide is that this behaviour started after a big power outage last december 26 (a blackout that lasted almost all day) but the "surge protection" part of those UPS should be there for those problems, to leverage peaks when power goes away or gets restored. Another funny thing is, althought it might not be related, for a couple of days after that event the Wii was unable to power on, I thought its power transformer was broken, but then it suddenly started working again. I can be sure it's not the Wii overloading the UPS because the overload happens even if I leave the Wii unplugged. Any suggestion is really appreciated, and I can provide any additional info, if needed, that didn't come to mind right now. Thanks.

Read the article
Linux filesystem with inodes close on the disk

- by pts

I'd like to make the ls -laR /media/myfs on Linux as fast as possible. I'll have 1 million files on the filesystem, 2TB of total file size, and some directories containing as much as 10000 files. Which filesystem should I use and how should I configure it? As far as I understand, the reason why ls -laR is slow because it has to stat(2) each inode (i.e. 1 million stat(2)s), and since inodes are distributed randomly on the disk, each stat(2) needs one disk seek. Here are some solutions I had in mind, none of which I am satisfied with: Create the filesystem on an SSD, because the seek operations on SSDs are fast. This wouldn't work, because a 2TB SSD doesn't exist, or it's prohibitively expensive. Create a filesystem which spans on two block devices: an SSD and a disk; the disk contains file data, and the SSD contains all the metadata (including directory entries, inodes and POSIX extended attributes). Is there a filesystem which supports this? Would it survive a system crash (power outage)? Use find /media/myfs on ext2, ext3 or ext4, instead of ls -laR /media/myfs, because the former can the advantage of the d_type field (see in the getdents(2) man page), so it doesn't have to stat. Unfortunately, this doesn't meet my requirements, because I need all file sizes as well, which find /media/myfs doesn't print. Use a filesystem, such as VFAT, which stores inodes in the directory entries. I'd love this one, but VFAT is not reliable and flexible enough for me, and I don't know of any other filesystem which does that. Do you? Of course, storing inodes in the directory entries wouldn't work for files with a link count more than 1, but that's not a problem since I have only a few dozen such files in my use case. Adjust some settings in /proc or sysctl so that inodes are locked to system memory forever. This would not speed up the first ls -laR /media/myfs, but it would make all subsequent invocations amazingly fast. How can I do this? I don't like this idea, because it doesn't speed up the first invocation, which currently takes 30 minutes. Also I'd like to lock the POSIX extended attributes in memory as well. What do I have to do for that? Use a filesystem which has an online defragmentation tool, which can be instructed to relocate inodes to the the beginning of the block device. Once the relocation is done, I can run dd if=/dev/sdb of=/dev/null bs=1M count=256 to get the beginning of the block device fetched to the kernel in-memory cache without seeking, and then the stat(2) operations would be fast, because they read from the cache. Is there a way to lock those inodes and/or blocks into memory once they have been read? Which filesystem has such a defragmentation tool?

Read the article
Mysql innoDB corruption after server crash

- by Ward Loockx

Yesterday my server died because an outage in the data center. Today it's back up, but having some problems with mysql. First of all my mysql server was not able to start. For this reason I deleted the files ib_logfile0 and ib_logfile1 in /var/lib/mysql folder (I still have the old failing files). After this my server was able to startup again. But now I see a lot of issues in the mysql log file. Sep 1 09:43:55 * mysqld: 120901 9:43:55 InnoDB: Error: page 70944 log sequence number 8 1483471899 Sep 1 09:43:55 * mysqld: InnoDB: is in the future! Current system log sequence number 5 612394935. Sep 1 09:43:55 * mysqld: InnoDB: Your database may be corrupt or you may have copied the InnoDB Sep 1 09:43:55 * mysqld: InnoDB: tablespace but not the InnoDB log files. See Sep 1 09:43:55 * mysqld: InnoDB: http://dev.mysql.com/doc/refman/5.1/en/forcing-recovery.html When I check the docs on mysql.com, I found that I need to recover my database with backups. I have a backup but not sure what's the good way on importing it. Or is there a way to recover without having to re-import the database again? So if I'm correct I need to put innodb_force_recovery to 4 in mysql and delete all current data and re-import? Is there a way to do this without having downtime? I also have one slave running. This slave has the current status now: Last_Error: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. How can I totally reset the slave after the new import on the master has happend? Hopefully we can find a solution without not to much downtime. Thanks!

Read the article
I have been told to accept one error with Memtest86+

- by DustByte

Bought a new computer back in August with 4x4 GB RAM. Had problems with the RAM. They sent me four new sticks, which also generated errors. Singled out four sticks (from the eight I now had) that didn't generate any errors. Discovered by coincident a new RAM error last week (this time no BSOD). Contacted the company. According to them there have been issues with a bad stock from last summer so I got two tested 8 GB sticks sent to me. Been running Memtest86+ over the weekend. After 20 hours I got an error (see attached photo). The test has now been running for 37 hours but so far only this one error. I contacted the company where I bought the computer. They wrote back: I wouldn't worry about hat one fail. We have had similar situations here whereby it passes numerous times but then fails once. We think it's an issue with memtest, after all memory is faulty or it isn't so you can't really have it pass a few times, fail the next time around and then pass again! Please trust me on this and continue with the memory we sent you and if your problems continue we'll look at getting it replaced again. I gather from other forum posts that many people do not accept a single error. What could this single error signify, faulty RAM or a glitch in the MEMTEST program (or other)? Update: From the helpful comments below I conclude that an occasional (and rare) "random" error could occur and be acceptable, but repeated errors at the same address would indicate malfunction. Memtest has now run for 45 hours and I still have only one error. For everyone's information, I will keep running the test. In less than two days I am going away for a month. I will most likely leave Memtest running. As I do not have a UPS there is a risk that a power outage will ruin the experiment. The computer is a desktop so I cannot bring it with me (which would curiously have exposed it to more cosmic rays as I will be flying ;)).

Read the article
OTN Architect Day Headed to Reston, VA - May 16

- by Bob Rhubart

In 2011 OTN Architect Day made stops in Chicago, Denver, Phoenix, Redwood Shores, and Toronto. The 2012 series begins with OTN Architect Day in Reston, VA on Wednesday May 16. Registration is now open for this free event, but don't get caught napping -- seating is limited, and the event is just 5 weeks away. The information below reflects the most recent updates to the event agenda, including the addition of Oracle ACE Director Kai Yu as the guest keynote speaker. Kai is Senior System Engineer / Architect at Dell, Inc., and has been very busy of late as a speaker at various industry and Oracle User Group events. I'm very happy Kai has agreed to make the trek from his hometown in Austin, TX to share his insight at the Architect Day event in Reston. If you're in the area, put this one on your calendar. You won't be sorry. Venue Sheraton Reston Hotel 11810 Sunrise Valley Drive Reston, VA 20191 Event Agenda 8:30 am - 9:00 am Registration and Continental Breakfast 9:00 am - 9:15 am Welcome and Opening Comments 9:15 am - 10:00 am Engineered Systems: Oracle's Vision for the Future | Ralf Dossman Oracle's Exadata and Exalogic are impressive products in their own right. But working in combination they deliver unparalleled transaction processing performance with up to a 30x increase over existing legacy systems, with the lowest cost of ownership over a 3 or 5 year basis than any other hardware. In this session you'll learn how to leverage Oracle's Engineered Systems within your enterprise to deliver record-breaking performance at the lowest TCO. 10:00 am - 10:30 am High Availability Infrastructure for Cloud Computing | Kai Yu Infrastructure high availability is extremely critical to Cloud Computing. In a Cloud system that hosts a large number of databases and applications with different SLAs, any unplanned outage can be devastating, and even a small planned downtime may be unacceptable. This presentation will discuss various technology solutions and the related best practices that system architects should consider in cloud infrastructure design to ensure high availability. 10:30 am - 10:45 am Break 10:45 am - 11:30 am Breakout Sessions: (pick one) Innovations in Grid Computing with Oracle Coherence | Bjorn Boe Learn how Coherence can increase the availability, scalability and performance of your existing applications with its advanced low-latency data-grid technologies. Also hear some interesting industry-specific use cases that customers had implemented and how Oracle is integrating Coherence into its Enterprise Java stack. Cloud Computing - Making IT Simple | Scott Mattoon The road to Cloud Computing is not without a few bumps. This session will help to smooth out your journey by tackling some of the potential complications. We'll examine whether standardization is a prerequisite for the Cloud. We'll look at why refactoring isn't just for application code. We'll check out deployable entities and their simplification via higher levels of abstraction. And we'll close out the session with a look at engineered systems and modular clouds. 11:30 pm - 12:15 pm Breakout Sessions: (pick one) Oracle Enterprise Manager | Joe Diemer Oracle Enterprise Manager (EM) provides complete lifecycle management for the cloud - from automated cloud setup to self-service delivery to cloud operations. In this session you'll learn how to take control of your cloud infrastructure with EM features including Consolidation Planning and Self-Service provisioning with Metering and Chargeback. Come hear how Oracle is expanding its management capabilities into the cloud! Rationalization and Defense in Depth - Two Steps Closer to the Clouds | Dave Chappelle Security represents one of the biggest concerns about cloud computing. In this session we'll get past the FUD with a real-world look at some key issues. We'll discuss the infrastructure necessary to support rationalization and security services, explore architecture for defense -in-depth, and deal frankly with the good, the bad, and the ugly in Cloud security. 12:15 pm - 1:15 pm Lunch 1:40 pm - 2:00 pm Panel Discussion - Q&A 2:00 pm - 2:45 pm Breakout Sessions: (pick one) 21st Century SOA | Peter Belknap Service Oriented Architecture has evolved from concept to reality in the last decade. The right methodology coupled with mature SOA technologies has helped customers demonstrate success in both innovation and ROI. In this session you will learn how Oracle SOA Suite's orchestration, virtualization, and governance capabilities provide the infrastructure to run mission critical business and system applications. And we'll take a special look at the convergence of SOA & BPM using Oracle's Unified technology stack. Track B: Oracle Cloud Reference Architecture | Anbu Krishnaswamy Cloud initiatives are beginning to dominate enterprise IT roadmaps. Successful adoption of Cloud and the subsequent governance challenges warrant a Cloud reference architecture that is applied consistently across the enterprise. This presentation gives an overview of Oracle's Cloud Reference Architecture, which is part of the Cloud Enterprise Technology Strategy (ETS). Concepts covered include common management layer capabilities, service models, resource pools, and use cases. 2:45 pm - 3:00 pm Break 3:00 pm - 4:00 pm Roundtable Discussions 4:00 pm - 4:15 pm Closing Comments & Readouts from Roundtable 4:15 pm - 5:00 pm Cocktail Reception / Networking Session schedule and content subject to change.

Read the article
Open World Day 3

- by Antony Reynolds

A Day in the Life of an Oracle OpenWorld Attendee Part IV My third day was exhibition day for me! I took the opportunity to wander around the JavaOne and OpenWorld exhibitions to see what might be useful for me when selling WebLogic, Coherence & SOA Suite. I found a number of interesting vendors and thought I would share what I found here. These are not necessarily endorsements, but observations on companies that I thought had interesting looking products that fill a need I have seen at customers. Highly Available EBS Upgrades A few years ago I worked with a customer that was a port authority. They wanted to tie E-Business Suite into their operations to provide faster processing of cargo and passengers. However they only had a 2 hour downtime window to perform upgrades. This was not a problem for core database and middleware technology, this could accommodate those upgrade timescales easily. It was a problem for EBS however so I intrigued to find Rapid E-Suite Inc offering an 11i to 12i upgrade service that claims to require no outage. This could be a real boon to EBS customers like my port friends that need to upgrade without disruption to their business. Mobile on WebLogic I have come across a number of customers who want a comprehensive mobile solution, connected and disconnected operation and so forth. ADF only addresses part of these requirements currently so I was excited to discover mFrontiers Inc offering an apparently comprehensive solution that should integrate easily with Oracle SOA Suite to mobile enable a SOA infrastructure. The ability to operate without a network is important for many applications, particularly in industries that require their engineers to enter buildings to perform maintenance or repairs, because network access is not always available – many of my colleagues don’t have mobile access from their homes because they live in the middle of nowhere – and disconnected support is crucial in these situations. Sharepoint Connector for WebCenter Content Obviously Sharepoint is an evil pernicious intrusion into a companies IT estate but it is widely deployed and many people like it but also would like to take advantage of Oracle products such as WebCenter Content. So I was encouraged to see that Fishbowl Solutions have created a connector for Sharepoint that allows it to bring in content from WebCenter, it looks like a valuable way to maintain the Sharepoint interface end users are used to but extend the range of content by pulling stuff (technical term for content) from WebCenter. Load Balancing The Enterprise Deployment Guides are Oracles bible on building highly available FMW environments, and each of them requires a front end load balancer. I have been asked to help configure F5 Load Balancers on a number of occasions over my time at Oracle and each time I come back to it I find more useful features have been added to the BigIP line of load balancers that F5 sell, many of their documents are tailored to FMW. I like F5, they provide (relatively) easy to use products that do what they say on the side of the box. They may not have all the bells and whistles of some of their more expensive competitors but they do the job and do it well! Besides which I like their logo! Other Stuff I saw lots of other interesting products and services, such as a lightweight monitoring tool for Coherence, Forms migration services, JCAPS migration services and lots of cool freebies to take home to the children! A Quiet Night Wednesday night was the partner appreciation event and I had decided to go back to the hotel and have an early night. I decided to attend the last session of the day – a Maven/Hudson/WebLogic tutorial. I got the wrong hotel for the session and snuck in 20 minutes late at the back and starting working on the hands on workshop. One of my co-attendees raised his hand for help and as the presenter came over to help he suddenly stopped and yelled – “Is that Antony”! It was my old friend Steve Button who used to be based in Redwood Shores but is now a WebLogic guru PM in Australia. It was good to catch up with him. As he yelled out a guy with really bad posture turned around to see who he was talking to, this turned out to be my friend Simon Haslan, Oracle ACE from the UK. After the tutorial Simon and I retired to the coffee shop to catch up and share stories. 2 and half hours later we decided it was time to retire, so much for an early night but great to renew old friendships and find out what real customers are worrying about.

Read the article
How to Achieve Real-Time Data Protection and Availabilty....For Real

- by JoeMeeks

There is a class of business and mission critical applications where downtime or data loss have substantial negative impact on revenue, customer service, reputation, cost, etc. Because the Oracle Database is used extensively to provide reliable performance and availability for this class of application, it also provides an integrated set of capabilities for real-time data protection and availability. Active Data Guard, depicted in the figure below, is the cornerstone for accomplishing these objectives because it provides the absolute best real-time data protection and availability for the Oracle Database. This is a bold statement, but it is supported by the facts. It isn’t so much that alternative solutions are bad, it’s just that their architectures prevent them from achieving the same levels of data protection, availability, simplicity, and asset utilization provided by Active Data Guard. Let’s explore further. Backups are the most popular method used to protect data and are an essential best practice for every database. Not surprisingly, Oracle Recovery Manager (RMAN) is one of the most commonly used features of the Oracle Database. But comparing Active Data Guard to backups is like comparing apples to motorcycles. Active Data Guard uses a hot (open read-only), synchronized copy of the production database to provide real-time data protection and HA. In contrast, a restore from backup takes time and often has many moving parts - people, processes, software and systems – that can create a level of uncertainty during an outage that critical applications can’t afford. This is why backups play a secondary role for your most critical databases by complementing real-time solutions that can provide both data protection and availability. Before Data Guard, enterprises used storage remote-mirroring for real-time data protection and availability. Remote-mirroring is a sophisticated storage technology promoted as a generic infrastructure solution that makes a simple promise – whatever is written to a primary volume will also be written to the mirrored volume at a remote site. Keeping this promise is also what causes data loss and downtime when the data written to primary volumes is corrupt – the same corruption is faithfully mirrored to the remote volume making both copies unusable. This happens because remote-mirroring is a generic process. It has no intrinsic knowledge of Oracle data structures to enable advanced protection, nor can it perform independent Oracle validation BEFORE changes are applied to the remote copy. There is also nothing to prevent human error (e.g. a storage admin accidentally deleting critical files) from also impacting the remote mirrored copy. Remote-mirroring tricks users by creating a false impression that there are two separate copies of the Oracle Database. In truth; while remote-mirroring maintains two copies of the data on different volumes, both are part of a single closely coupled system. Not only will remote-mirroring propagate corruptions and administrative errors, but the changes applied to the mirrored volume are a result of the same Oracle code path that applied the change to the source volume. There is no isolation, either from a storage mirroring perspective or from an Oracle software perspective. Bottom line, storage remote-mirroring lacks both the smarts and isolation level necessary to provide true data protection. Active Data Guard offers much more than storage remote-mirroring when your objective is protecting your enterprise from downtime and data loss. Like remote-mirroring, an Active Data Guard replica is an exact block for block copy of the primary. Unlike remote-mirroring, an Active Data Guard replica is NOT a tightly coupled copy of the source volumes - it is a completely independent Oracle Database. Active Data Guard’s inherent knowledge of Oracle data block and redo structures enables a separate Oracle Database using a different Oracle code path than the primary to use the full complement of Oracle data validation methods before changes are applied to the synchronized copy. These include: physical check sum, logical intra-block checking, lost write validation, and automatic block repair. The figure below illustrates the stark difference between the knowledge that remote-mirroring can discern from an Oracle data block and what Active Data Guard can discern. An Active Data Guard standby also provides a range of additional services enabled by the fact that it is a running Oracle Database - not just a mirrored copy of data files. An Active Data Guard standby database can be open read-only while it is synchronizing with the primary. This enables read-only workloads to be offloaded from the primary system and run on the active standby - boosting performance by utilizing all assets. An Active Data Guard standby can also be used to implement many types of system and database maintenance in rolling fashion. Maintenance and upgrades are first implemented on the standby while production runs unaffected at the primary. After the primary and standby are synchronized and all changes have been validated, the production workload is quickly switched to the standby. The only downtime is the time required for user connections to transfer from one system to the next. These capabilities further expand the expectations of availability offered by a data protection solution beyond what is possible to do using storage remote-mirroring. So don’t be fooled by appearances. Storage remote-mirroring and Active Data Guard replication may look similar on the surface - but the devil is in the details. Only Active Data Guard has the smarts, the isolation, and the simplicity, to provide the best data protection and availability for the Oracle Database. Stay tuned for future blog posts that dive into the many differences between storage remote-mirroring and Active Data Guard along the dimensions of data protection, data availability, cost, asset utilization and return on investment. For additional information on Active Data Guard, see: Active Data Guard Technical White Paper Active Data Guard vs Storage Remote-Mirroring Active Data Guard Home Page on the Oracle Technology Network

Read the article
Availability Best Practices on Oracle VM Server for SPARC

- by jsavit

This is the first of a series of blog posts on configuring Oracle VM Server for SPARC (also called Logical Domains) for availability. This series will show how to how to plan for availability, improve serviceability, avoid single points of failure, and provide resiliency against hardware and software failures. Availability is a broad topic that has filled entire books, so these posts will focus on aspects specifically related to Oracle VM Server for SPARC. The goal is to improve Reliability, Availability and Serviceability (RAS): An article defining RAS can be found here. Oracle VM Server for SPARC Principles for Availability Let's state some guiding principles for availability that apply to Oracle VM Server for SPARC: Avoid Single Points Of Failure (SPOFs). Systems should be configured so a component failure does not result in a loss of application service. The general method to avoid SPOFs is to provide redundancy so service can continue without interruption if a component fails. For a critical application there may be multiple levels of redundancy so multiple failures can be tolerated. Oracle VM Server for SPARC makes it possible to configure systems that avoid SPOFs. Configure for availability at a level of resource and effort consistent with business needs. Effort and resource should be consistent with business requirements. Production has different availability requirements than test/development, so it's worth expending resources to provide higher availability. Even within the category of production there may be different levels of criticality, outage tolerances, recovery and repair time requirements. Keep in mind that a simple design may be more understandable and effective than a complex design that attempts to "do everything". Design for availability at the appropriate tier or level of the platform stack. Availability can be provided in the application, in the database, or in the virtualization, hardware and network layers they depend on - or using a combination of all of them. It may not be necessary to engineer resilient virtualization for stateless web applications applications where availability is provided by a network load balancer, or for enterprise applications like Oracle Real Application Clusters (RAC) and WebLogic that provide their own resiliency. It's (often) the same architecture whether virtual or not: For example, providing resiliency against a lost device path or failing disk media is done for the same reasons and may use the same design whether in a domain or not. It's (often) the same technique whether using domains or not: Many configuration steps are the same. For example, configuring IPMP or creating a redundant ZFS pool is pretty much the same within the guest whether you're in a guest domain or not. There are configuration steps and choices for provisioning the guest with the virtual network and disk devices, which we will discuss. Sometimes it is different using domains: There are new resources to configure. Most notable is the use of alternate service domains, which provides resiliency in case of a domain failure, and also permits improved serviceability via "rolling upgrades". This is an important differentiator between Oracle VM Server for SPARC and traditional virtual machine environments where all virtual I/O is provided by a monolithic infrastructure that itself is a SPOF. Alternate service domains are widely used to provide resiliency in production logical domains environments. Some things are done via logical domains commands, and some are done in the guest: For example, with Oracle VM Server for SPARC we provide multiple network connections to the guest, and then configure network resiliency in the guest via IP Multi Pathing (IPMP) - essentially the same as for non-virtual systems. On the other hand, we configure virtual disk availability in the virtualization layer, and the guest sees an already-resilient disk without being aware of the details. These blogs will discuss configuration details like this. Live migration is not "high availability" in the sense of "continuous availability": If the server is down, then you don't live migrate from it! (A cluster or VM restart elsewhere would be used). However, live migration can be part of the RAS (Reliability, Availability, Serviceability) picture by improving Serviceability - you can move running domains off of a box before planned service or maintenance. The blog Best Practices - Live Migration on Oracle VM Server for SPARC discusses this. Topics Here are some of the topics that will be covered: Network availability using IP Multipathing and aggregates Disk path availability using virtual disks defined with multipath groups ("mpgroup") Disk media resiliency configuring for redundant disks that can tolerate media loss Multiple service domains - this is probably the most significant item and the one most specific to Oracle VM Server for SPARC. It is very widely deployed in production environments as the means to provide network and disk availability, but it can be confusing. Subsequent articles will describe why and how to configure multiple service domains. Note, for the sake of precision: an I/O domain is any domain that has a physical I/O resource (such as a PCIe bus root complex). A service domain is a domain providing virtual device services to other domains; it is almost always an I/O domain too (so it can have something to serve). Resources Here are some important links; we'll be drawing on their content in the next several articles: Oracle VM Server for SPARC Documentation Maximizing Application Reliability and Availability with SPARC T5 Servers whitepaper by Gary Combs Maximizing Application Reliability and Availability with the SPARC M5-32 Server whitepaper by Gary Combs Summary Oracle VM Server for SPARC offers features that can be used to provide highly-available environments. This and the following blog entries will describe how to plan and deploy them.

Read the article
Computer Networks UNISA - Chap 14 – Insuring Integrity & Availability

- by MarkPearl

After reading this section you should be able to Identify the characteristics of a network that keep data safe from loss or damage Protect an enterprise-wide network from viruses Explain network and system level fault tolerance techniques Discuss issues related to network backup and recovery strategies Describe the components of a useful disaster recovery plan and the options for disaster contingencies What are integrity and availability? Integrity – the soundness of a networks programs, data, services, devices, and connections Availability – How consistently and reliably a file or system can be accessed by authorized personnel A number of phenomena can compromise both integrity and availability including… security breaches natural disasters malicious intruders power flaws human error users etc Although you cannot predict every type of vulnerability, you can take measures to guard against the most damaging events. The following are some guidelines… Allow only network administrators to create or modify NOS and application system users. Monitor the network for unauthorized access or changes Record authorized system changes in a change management system’ Install redundant components Perform regular health checks on the network Check system performance, error logs, and the system log book regularly Keep backups Implement and enforce security and disaster recovery policies These are just some of the basics… Malware Malware refers to any program or piece of code designed to intrude upon or harm a system or its resources. Types of Malware… Boot sector viruses Macro viruses File infector viruses Worms Trojan Horse Network Viruses Bots Malware characteristics Some common characteristics of Malware include… Encryption Stealth Polymorphism Time dependence Malware Protection There are various tools available to protect you from malware called anti-malware software. These monitor your system for indications that a program is performing potential malware operations. A number of techniques are used to detect malware including… Signature Scanning Integrity Checking Monitoring unexpected file changes or virus like behaviours It is important to decide where anti-malware tools will be installed and find a balance between performance and protection. There are several general purpose malware policies that can be implemented to protect your network including… Every compute in an organization should be equipped with malware detection and cleaning software that regularly runs Users should not be allowed to alter or disable the anti-malware software Users should know what to do in case the anti-malware program detects a malware virus Users should be prohibited from installing any unauthorized software on their systems System wide alerts should be issued to network users notifying them if a serious malware virus has been detected. Fault Tolerance Besides guarding against malware, another key factor in maintaining the availability and integrity of data is fault tolerance. Fault tolerance is the ability for a system to continue performing despite an unexpected hardware or software malfunction. Fault tolerance can be realized in varying degrees, the optimal level of fault tolerance for a system depends on how critical its services and files are to productivity. Generally the more fault tolerant the system, the more expensive it is. The following describe some of the areas that need to be considered for fault tolerance. Environment (Temperature and humidity) Power Topology and Connectivity Servers Storage Power Typical power flaws include Surges – a brief increase in voltage due to lightening strikes, solar flares or some idiot at City Power Noise – Fluctuation in voltage levels caused by other devices on the network or electromagnetic interference Brownout – A sag in voltage for just a moment Blackout – A complete power loss The are various alternate power sources to consider including UPS’s and Generators. UPS’s are found in two categories… Standby UPS – provides continuous power when mains goes down (brief period of switching over) Online UPS – is online all the time and the device receives power from the UPS all the time (the UPS is charged continuously) Servers There are various techniques for fault tolerance with servers. Server mirroring is an option where one device or component duplicates the activities of another. It is generally an expensive process. Clustering is a fault tolerance technique that links multiple servers together to appear as a single server. They share processing and storage responsibilities and if one unit in the cluster goes down, another unit can be brought in to replace it. Storage There are various techniques available including the following… RAID Arrays NAS (Storage (Network Attached Storage) SANs (Storage Area Networks) Data Backup A backup is a copy of data or program files created for archiving or safekeeping. Many different options for backups exist with various media including… These vary in cost and speed. Optical Media Tape Backup External Disk Drives Network Backups Backup Strategy After selecting the appropriate tool for performing your servers backup, devise a backup strategy to guide you through performing reliable backups that provide maximum data protection. Questions that should be answered include… What data must be backed up At what time of day or night will the backups occur How will you verify the accuracy of the backups Where and for how long will backup media be stored Who will take responsibility for ensuring that backups occurred How long will you save backups Where will backup and recovery documentation be stored Different backup methods provide varying levels of certainty and corresponding labour cost. There are also different ways to determine which files should be backed up including… Full backup – all data on all servers is copied to storage media Incremental backup – Only data that has changed since the last full or incremental backup is copied to a storage medium Differential backup – Only data that has changed since the last backup is coped to a storage medium Disaster Recovery Disaster recovery is the process of restoring your critical functionality and data after an enterprise wide outage has occurred. A disaster recovery plan is for extreme scenarios (i.e. fire, line fault, etc). A cold site is a place were the computers, devices, and connectivity necessary to rebuild a network exist but they are not appropriately configured. A warm site is a place where the computers, devices, and connectivity necessary to rebuild a network exists with some appropriately configured devices. A hot site is a place where the computers, devices, and connectivity necessary to rebuild a network exists and all are appropriately configured.

Read the article
24 Hours of PASS – first reflections

- by Rob Farley

A few days after the end of 24HOP, I find myself reflecting on it. I’m still waiting on most of the information. I want to be able to discover things like where the countries represented on each of the sessions, and things like that. So far, I have the feedback scores and the numbers of attendees. The data was provided in a PDF, so while I wait for it to appear in a more flexible format, I’ve pushed the 24 attendee numbers into Excel. This chart shows the numbers by time. Remember that we started at midnight GMT, which was 10:30am in my part of the world and 8pm in New York. It’s probably no surprise that numbers drooped a bit at the start, stayed comparatively low, and then grew as the larger populations of the English-speaking world woke up. I remember last time 24HOP ran for 24 hours straight, there were quite a few sessions with less than 100 attendees. None this time though. We got close, but even when it was 4am in New York, 8am in London and 7pm in Sydney (which would have to be the worst slot for attracting people), we still had over 100 people tuning in. As expected numbers grew as the UK woke up, and even more so as the US did, with numbers peaking at 755 for the “3pm in New York” session on SQL Server Data Tools. Kendra Little almost reached those numbers too, and certainly contributed the biggest ‘spike’ on the chart with her session five hours earlier. Of all the sessions, Kendra had the highest proportion of ‘Excellent’s for the “Overall Evaluation of the session” question, and those of you who saw her probably won’t be surprised by that. Kendra had one of the best ranked sessions from the 24HOP event this time last year (narrowly missing out on being top 3), and she has produced a lot of good video content since then. The reports indicate that there were nearly 8.5 thousand attendees across the 24 sessions, averaging over 350 at each one. I’m looking forward to seeing how many different people that was, although I do know that Wil Sisney managed to attend every single one (if you did too, please let me know). Wil even moderated one of the sessions, which made his feat even greater. Thanks Wil. I also want to send massive thanks to Dave Dustin. Dave probably would have attended all of the sessions, if it weren’t for a power outage that forced him to take a break. He was also a moderator, and it was during this session that he earned special praise. Part way into the session he was moderating, the speaker lost connectivity and couldn’t get back for about fifteen minutes. That’s an incredibly long time when you’re in a live presentation. There were over 200 people tuned in at the time, and I’m sure Dave was as stressed as I was to have a speaker disappear. I started chasing down a phone number for the speaker, while Dave spoke to the audience. And he did brilliantly. He started answering questions, and kept doing that until the speaker came back. Bear in mind that Dave hadn’t expected to give a presentation on that topic (or any other), and was simply drawing on his SQL expertise to get him through. Also consider that this was between midnight at 1am in Dave’s part of the world (Auckland, NZ). I would’ve been expecting just to welcome people, monitor questions, probably read some out, and in general, help make things run smoothly. He went far beyond the call of duty, and if I had a medal to give him, he’d definitely be getting one. On the whole, I think this 24HOP was a success. We tried a different platform, and I think for the most part it was a popular move. We didn’t ask the question “Was this better than LiveMeeting?”, but we did get a number of people telling us that they thought the platform was very good. Some people have told me I get a chance to put my feet up now that this is over. As I’m also co-ordinating a tour of SQLSaturday events across the Australia/New Zealand region, I don’t quite get to take that much of a break (plus, there’s the little thing of squeezing in seven SQL 2012 exams over the next 2.5 weeks). But I am pleased to be reflecting on this event rather than anticipating it. There were a number of factors that could have gone badly, but on the whole I’m pleased about how it went. A massive thanks to everyone involved. If you’re reading this and thinking you wish you could’ve tuned in more, don’t worry – they were all recorded and you’ll be able to watch them on demand very soon. But as well as that, PASS has a stream of content produced by the Virtual Chapters, so you can keep learning from the comfort of your desk all year round. More info on them at sqlpass.org, of course.

Read the article
High Availability for IaaS, PaaS and SaaS in the Cloud

- by BuckWoody

Outages, natural disasters and unforeseen events have proved that even in a distributed architecture, you need to plan for High Availability (HA). In this entry I'll explain a few considerations for HA within Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS) and Software-as-a-Service (SaaS). In a separate post I'll talk more about Disaster Recovery (DR), since each paradigm has a different way to handle that. Planning for HA in IaaS IaaS involves Virtual Machines - so in effect, an HA strategy here takes on many of the same characteristics as it would on-premises. The primary difference is that the vendor controls the hardware, so you need to verify what they do for things like local redundancy and so on from the hardware perspective. As far as what you can control and plan for, the primary factors fall into three areas: multiple instances, geographical dispersion and task-switching. In almost every cloud vendor I've studied, to ensure your application will be protected by any level of HA, you need to have at least two of the Instances (VM's) running. This makes sense, but you might assume that the vendor just takes care of that for you - they don't. If a single VM goes down (for whatever reason) then the access to it is lost. Depending on multiple factors, you might be able to recover the data, but you should assume that you can't. You should keep a sync to another location (perhaps the vendor's storage system in another geographic datacenter or to a local location) to ensure you can continue to serve your clients. You'll also need to host the same VM's in another geographical location. Everything from a vendor outage to a network path problem could prevent your users from reaching the system, so you need to have multiple locations to handle this. This means that you'll have to figure out how to manage state between the geo's. If the system goes down in the middle of a transaction, you need to figure out what part of the process the system was in, and then re-create or transfer that state to the second set of systems. If you didn't write the software yourself, this is non-trivial. You'll also need a manual or automatic process to detect the failure and re-route the traffic to your secondary location. You could flip a DNS entry (if your application can tolerate that) or invoke another process to alias the first system to the second, such as load-balancing and so on. There are many options, but all of them involve coding the state into the application layer. If you've simply moved a state-ful application to VM's, you may not be able to easily implement an HA solution. Planning for HA in PaaS Implementing HA in PaaS is a bit simpler, since it's built on the concept of stateless applications deployment. Once again, you need at least two copies of each element in the solution (web roles, worker roles, etc.) to remain available in a single datacenter. Also, you need to deploy the application again in a separate geo, but the advantage here is that you could work out a "shared storage" model such that state is auto-balanced across the world. In fact, you don't have to maintain a "DR" site, the alternate location can be live and serving clients, and only take on extra load if the other site is not available. In Windows Azure, you can use the Traffic Manager service top route the requests as a type of auto balancer. Even with these benefits, I recommend a second backup of storage in another geographic location. Storage is inexpensive; and that second copy can be used for not only HA but DR. Planning for HA in SaaS In Software-as-a-Service (such as Office 365, or Hadoop in Windows Azure) You have far less control over the HA solution, although you still maintain the responsibility to ensure you have it. Since each SaaS is different, check with the vendor on the solution for HA - and make sure you understand what they do and what you are responsible for. They may have no HA for that solution, or pin it to a particular geo, or perhaps they have a massive HA built in with automatic load balancing (which is often the case). All of these options (with the exception of SaaS) involve higher costs for the design. Do not sacrifice reliability for cost - that will always cost you more in the end. Build in the redundancy and HA at the very outset of the project - if you try to tack it on later in the process the business will push back and potentially not implement HA. References: http://www.bing.com/search?q=windows+azure+High+Availability (each type of implementation is different, so I'm routing you to a search on the topic - look for the "Patterns and Practices" results for the area in Azure you're interested in)

Read the article
Clarification On Write-Caching Policy, Its Underlying Options And How It Applies To Hard Drives And Solid-State Drives

- by Boris_yo

In last week after doing more research on subject matter, I have been wondering about what I have been neglecting all those years to understand write-caching policy, always leaving it on default setting. Write-caching policy improves writing performance and consists of write-back caching and write-cache buffer flushing. This is how I understand all the above, but correct me if I erred somewhere: Write-through cache / Write-through caching itself is not a part of write caching policy per se and it's when data is written to both cache and storage device so if Windows will need that data later again, it is retrieved from cache and not from storage device which means only improved read performance as there is no need for waiting for storage device to read required data again. Since data is still written to storage device, write performance isn't improved and represents no risk of data loss or corruption in case of power failure or system crash while only data in cache gets lost. This option seems to be enabled by default and is recommended for removable devices with no need to use function of "Safely Remove Hardware" on user's part. Write-back caching is similar to above but without writing data to storage device, periodically releasing data from cache and writing to storage device when it is idle. In my opinion this option improves both read and write performance but represents risk if power failure or system crash occurs with the outcome of not only losing data eventually to be written to storage device, but causing file inconsistencies or corrupted file system. Write-back caching cannot be enabled together with write-through caching and it is not recommended to be enabled if no backup power supply is availabe. Write-cache buffer flushing I reckon is similar to write-back caching but enables immediate release and writing of data from cache to storage device right before power outage occurs but I don't know if it applies also to occasional system crash. This option seem to be complementary to write-back cache reducing or potentially eliminating risk of data loss or corruption of file system. I have questions about relevance of last 2 options to today's modern SSDs in order to get best performance and with less wear on SSDs: I know that traditional hard drives come with onboard cache (I wonder what type of cache that is), but do SSDs also come with cache? Assuming they do, is this cache faster than their NAND flash and system RAM and worth taking the risk of utilizing it by enabling write-back cache? I read somewhere that generally storage device's cache is faster than RAM, but I want to be sure. Additionally I read that write-caching should be enabled since current data that is to be written later to NAND flash is kept for a while in cache and provided there is data that gets modified a lot before finally being written, holding of this data and its periodic release reduces its write times to SSD thereby reducing its wearing. Now regarding to write-cache buffer flushing, I heard that SSD controllers are so fast by themselves that enabling this option is not required, because they manage flushing. However, once again, I don't know if SSDs have their own onboard cache and whether or not it is faster than their NAND flash and system RAM because if it is, keeping this option enabled would make sense. Recently I have posted question about issue with my Intel 330 SSD 120GB which was main reason to do deeper research having suspicion of write-caching policy being the culprit of SSD's freezing issue assuming data being released is what causes freezes. Currently I have write-cache enabled and write-cache buffer flushing disabled because I believe SSD controller's management of write-cache flushing and Windows write-cache buffer flushing are conflicting with each other: Since I want to troubleshoot in small steps to finally determine the source of issue, I have decided to start with write-caching policy and the move to drivers, switching to AHCI later on and finally disabling DIPM (device initiated power management) through registry modification thanks to @TomWijsman

Read the article
Week in Geek: New Security Flaw Confirmed for Internet Explorer Edition

- by Asian Angel

This week we learned how to use a PC to stay entertained while traveling for the holidays, create quality photo prints with free software, share links between any browser and any smartphone, create perfect Christmas photos using How-To Geek’s 10 best how-to photo guides, and had fun decorating Firefox with a collection of Holiday 2010 Personas themes. Photo by Repoort. Random Geek Links Photo by Asian Angel. Critical 0-Day Flaw Affects All Internet Explorer Versions, Microsoft Warns Microsoft has confirmed a zero-day vulnerability affecting all supported versions of Internet Explorer, including IE8, IE7 and IE6. Note: Article contains link to Microsoft Security Advisory detailing two work-arounds until a security update is released. Hackers targeting human rights, indie media groups Hackers are increasingly hitting the Web sites of human rights and independent media groups in an attempt to silence them, says a new study released this week by Harvard University’s Berkman Center for Internet & Society. OpenBSD: audits give no indication of back doors So far, the analyses of OpenBSD’s crypto and IPSec code have not provided any indication that the system contains back doors for listening to encrypted VPN connections. But the developers have already found two bugs during their current audits. Sophos: Beware Facebook’s new facial-recognition feature Facebook’s new facial recognition software might result in undesirable photos of users being circulated online, warned a security expert, who urged users to keep abreast with the social network’s privacy settings to prevent the abovementioned scenario from becoming a reality. Microsoft withdraws flawed Outlook update Microsoft has withdrawn update KB2412171 for Outlook 2007, released last Patch Tuesday, after a number of user complaints. Skype: Millions still without service Skype was still working to right itself going into the holiday weekend from a major outage that began this past Wednesday. Mozilla improves sync setup and WebGL in Firefox 4 beta 8 Firefox 4.0 beta 8 brings better support for WebGL and introduces an improved setup process for Firefox Sync that simplifies the steps for configuring the synchronization service across multiple devices. Chrome OS the litmus test for cloud The success or failure of Google’s browser-oriented Chrome OS will be the litmus test to decide if the cloud is capable of addressing user needs for content and services, according to a new Ovum report released Monday. FCC Net neutrality rules reach mobile apps The Federal Communications Commission (FCC) finally released its long-expected regulations on Thursday and the related explanations total a whopping 194 pages. One new item that was not previously disclosed: mobile wireless providers can’t block “applications that compete with the provider’s” own voice or video telephony services. KDE and the Document Foundation join Open Invention Network The KDE e.V. and the Document Foundation (TDF) have both joined the Open Invention Network (OIN) as licensees, expanding the organization’s roster of supporters. Report: SEC looks into Hurd’s ousting from HP The scandal surrounding Mark Hurd’s departure from the world’s largest technology company in August has officially drawn attention from the U.S. Securities and Exchange Commission. Report: Google requests delay of new Google TVs Google TV is apparently encountering a bit of static that has resulted in a programming change. Geek Video of the Week This week we have a double dose of geeky video goodness for you with the original Mac vs PC video and the trailer for the sequel. Photo courtesy of Peacer. Mac vs PC Photo courtesy of Peacer. Mac vs PC 2 Trailer Random TinyHacker Links Awesome Tools To Extract Audio From Video Here’s a list of really useful, and free tools to rip audio from videos. Getting Your iPhone Out of Recovery Mode Is your iPhone stuck in recovery mode? This tutorial will help you get it out of that state. Google Shared Spaces Quickly create a shared space and collaborate with friends online. McAfee Internet Security 2011 – Upgrade not worthy of a version change McAfee has released their 2011 version of security products. And as this review details, the upgrades are minimal when compared to their 2010 products. For more information, check out the review. 200 Countries Plotted Hans Rosling’s famous lectures combine enormous quantities of public data with a sport’s commentator’s style to reveal the story of the world’s past, present and future development. Now he explores stats in a way he has never done before – using augmented reality animation. Super User Questions Enjoy looking through this week’s batch of popular questions and answers from Super User. How to restore windows 7 to a known working state every time it boots? Is there an easy way to mass-transfer all files between two computers? Coffee spilled inside computer, damaged hard drive Computer does not boot after ram upgrade Keyboard not detected when trying to install Ubuntu 10.10 How-To Geek Weekly Article Recap Have you had a super busy week while preparing for the holiday weekend? Then here is your chance to get caught up on your reading with our five hottest articles for the week. Ask How-To Geek: Rescuing an Infected PC, Installing Bloat-free iTunes, and Taming a Crazy Trackpad How to Use the Avira Rescue CD to Clean Your Infected PC Eight Geektacular Christmas Projects for Your Day Off VirtualBox 4.0 Rocks Extensions and a Simplified GUI Ask the Readers: How Many Monitors Do You Use with Your Computer? One Year Ago on How-To Geek Here are more great articles from one year ago for you to read and enjoy during the holiday break. Enjoy Distraction-Free Writing with WriteMonkey Shutter is a State of Art Screenshot Tool for Ubuntu Get Hex & RGB Color Codes the Easy Way Find User Scripts for Your Favorite Websites the Easy Way Access Your Unsorted Bookmarks the Easy Way (Firefox) The Geek Note That “wraps” things up for this week and we hope that everyone enjoys the rest of their holiday break! Found a great tip during the break? Then be sure to send it in to us at [email protected]. Photo by ArSiSa7. Latest Features How-To Geek ETC How to Use the Avira Rescue CD to Clean Your Infected PC The Complete List of iPad Tips, Tricks, and Tutorials Is Your Desktop Printer More Expensive Than Printing Services? 20 OS X Keyboard Shortcuts You Might Not Know HTG Explains: Which Linux File System Should You Choose? HTG Explains: Why Does Photo Paper Improve Print Quality? Simon’s Cat Explores the Christmas Tree! [Video] The Outdoor Lights Scene from National Lampoon’s Christmas Vacation [Video] The Famous Home Alone Pizza Delivery Scene [Classic Video] Chronicles of Narnia: The Voyage of the Dawn Treader Theme for Windows 7 Cardinal and Rabbit Sharing a Tree on a Cold Winter Morning Wallpaper An Alternate Star Wars Christmas Special [Video]

Read the article
Slicing the EDG

- by Antony Reynolds

Different SOA Domain Configurations In this blog entry I would like to introduce three different configurations for a SOA environment. I have omitted load balancers and OTD/OHS as they introduce a whole new round of discussion. For each possible deployment architecture I have identified some of the advantages. Super Domain This is a single EDG style domain for everything needed for SOA/OSB. It extends the standard EDG slightly but otherwise assumes a single “super” domain. This is basically the SOA EDG. I have broken out JMS servers and Coherence servers to improve scalability and reduce dependencies. Key Points Separate JMS allows those servers to be kept up separately from rest of SOA Domain, allowing JMS clients to post messages even if rest of domain is unavailable. JMS servers are only used to host application specific JMS destinations, SOA/OSB JMS destinations remain in relevant SOA/OSB managed servers. Separate Coherence servers allow OSB cache to be offloaded from OSB servers. Use of Coherence by other components as a shared infrastructure data grid service. Coherence cluster may be managed by WLS but more likely run as a standalone Coherence cluster. Benefits Single Administration Point (1 Admin Server) Closely follows EDG with addition of application specific JMS servers and standalone Coherence servers for OSB caching and application specific caches. Coherence grid can be scaled independent of OSB/SOA. JMS queues provide for inter-application communication. Drawbacks Patching is an all or nothing affair. Startup time for SOA may be slow if large number of composites deployed. Multiple Domains This extends the EDG into multiple domains, allowing separate management and update of these domains. I see this type of configuration quite often with customers, although some don't have OWSM, others don't have separate Coherence etc. SOA & BAM are kept in the same domain as little benefit is obtained by separating them. Key Points Separate JMS allows those servers to be kept up separately from rest of SOA Domain, allowing JMS clients to post messages even if other domains are unavailable. JMS servers are only used to host application specific JMS destinations, SOA/OSB JMS destinations remain in relevant SOA/OSB managed servers. Separate Coherence servers allow OSB cache to be offloaded from OSB servers. Use of Coherence by other components as a shared infrastructure data grid service. Coherence cluster may be managed by WLS but more likely run as a standalone Coherence cluster. Benefits Follows EDG but in separate domains and with addition of application specific JMS servers and standalone Coherence servers for OSB caching and application specific caches. Coherence grid can be scaled independent of OSB/SOA. JMS queues provide for inter-application communication. Patch lifecycle of OSB/SOA/JMS are no longer lock stepped. JMS may be kept running independently of other domains allowing applications to insert messages fro later consumption by SOA/OSB. OSB may be kept running independent of other domains, allowing service virtualization to continue independent of other domains availability. All domains use same OWSM policy store (MDS-WSM). Drawbacks Multiple domains to manage and configure. Multiple Admin servers (single view requires use of Grid Control) Multiple Admin servers/WSM clusters waste resources. Additional homes needed to enjoy benefits of separate patching. Cross domain trust needs setting up to simplify cross domain interactions. Startup time for SOA may be slow if large number of composites deployed. Shared Service Environment This model extends the previous multiple domain arrangement to provide a true shared service environment.This extends the previous model by allowing multiple additional SOA domains and/or other domains to take advantage of the shared services. Only one non-shared domain is shown, but there could be multiple, allowing groups of applications to share patching independent of other application groups. Key Points Separate JMS allows those servers to be kept up separately from rest of SOA Domain, allowing JMS clients to post messages even if other domains are unavailable. JMS servers are only used to host application specific JMS destinations, SOA/OSB JMS destinations remain in relevant SOA/OSB managed servers. Separate Coherence servers allow OSB cache to be offloaded from OSB servers. Use of Coherence by other components as a shared infrastructure data grid service Coherence cluster may be managed by WLS but more likely run as a standalone Coherence cluster. Shared SOA Domain hosts Human Workflow Tasks BAM Common "utility" composites Single OSB domain provides "Enterprise Service Bus" All domains use same OWSM policy store (MDS-WSM) Benefits Follows EDG but in separate domains and with addition of application specific JMS servers and standalone Coherence servers for OSB caching and application specific caches. Coherence grid can be scaled independent of OSB/SOA. JMS queues provide for inter-application communication. Patch lifecycle of OSB/SOA/JMS are no longer lock stepped. JMS may be kept running independently of other domains allowing applications to insert messages fro later consumption by SOA/OSB. OSB may be kept running independent of other domains, allowing service virtualization to continue independent of other domains availability. All domains use same OWSM policy store (MDS-WSM). Supports large numbers of deployed composites in multiple domains. Single URL for Human Workflow end users. Single URL for BAM end users. Drawbacks Multiple domains to manage and configure. Multiple Admin servers (single view requires use of Grid Control) Multiple Admin servers/WSM clusters waste resources. Additional homes needed to enjoy benefits of separate patching. Cross domain trust needs setting up to simplify cross domain interactions. Human Workflow needs to be specially configured to point to shared services domain. Summary The alternatives in this blog allow for patching to have different impacts, depending on the model chosen. Each organization must decide the tradeoffs for itself. One extreme is to go for the shared services model and have one domain per SOA application. This requires a lot of administration of the multiple domains. The other extreme is to have a single super domain. This makes the entire enterprise susceptible to an outage at the same time due to patching or other domain level changes. Hopefully this blog will help your organization choose the right model for you.

Read the article
Let your Signature Experience drive IT-decision making

- by Tania Le Voi

Today’s CIO job description: ‘’Align IT infrastructure and solutions with business goals and objectives ; AND while doing so reduce costs; BUT ALSO, be innovative, ensure the architectures are adaptable and agile as we need to act today on the changes that we may request tomorrow.” Sound like an unachievable request? The fact is, reality dictates that CIO’s are put under this type of pressure to deliver more with less. In a past career phase I spent a few years as an IT Relationship Manager for a large Insurance company. This is a role that we see all too infrequently in many of our customers, and it’s a shame. The purpose of this role was to build a bridge, a relationship between IT and the business. Key to achieving that goal was to ensure the same language was being spoken and more importantly that objectives were commonly understood - hence service and projects were delivered to time, to budget and actually solved the business problems. In reality IT and the business are already married, but the relationship is most often defined as ‘supplier’ of IT rather than a ‘trusted partner’. To deliver business value they need to understand how to work together effectively to attain this next level of partnership. The Business cannot compete if they do not get a new product to market ahead of the competition, or for example act in a timely manner to address a new industry problem such as a legislative change. An even better example is when the Application or Service fails and the Business takes a hit by bad publicity, being trending topics on social media and losing direct revenue from online channels. For this reason alone Business and IT need the alignment of their priorities and deliverables now more than ever! Take a look at Forrester’s recent study that found ‘many IT respondents considering themselves to be trusted partners of the business but their efforts are impaired by the inadequacy of tools and organizations’. IT Meet the Business; Business Meet IT So what is going on? We talk about aligning the business with IT but the reality is it’s difficult to do. Like any relationship each side has different goals and needs and language can be a barrier; business vs. technology jargon! What if we could translate the needs of both sides into actionable information, backed by data both sides understand, presented in a meaningful way? Well now we can with the Business-Driven Application Management capabilities in Oracle Enterprise Manager 12cR2! Enterprise Manager’s Business-Driven Application Management capabilities provide the information that IT needs to understand the impact of its decisions on business criteria. No longer does IT need to be focused solely on speeds and feeds, performance and throughput – now IT can understand IT’s impact on business KPIs like inventory turns, order-to-cash cycle, pipeline-to-forecast, and similar. Similarly, now the line of business can understand which IT services are most critical for the KPIs they care about. There are a good deal of resources on Oracle Technology Network that describe the functionality of these products, so I won’t’ rehash them here. What I want to talk about is what you do with these products. What’s next after we meet? Where do you start? Step 1: Identify the Signature Experience. This is THE business process (or set of processes) that is core to the business, the one that drives the economic engine, the process that a customer recognises the company brand for, reputation, the customer experience, the process that a CEO would state as his number one priority. The crème de la crème of your business! Once you have nailed this it gets easy as Enterprise Manager 12c makes it easy. Step 2: Map the Signature Experience to underlying IT. Taking the signature experience, map out the touch points of the components that play a part in ensuring this business transaction is successful end to end, think of it like mapping out a critical path; the applications, middleware, databases and hardware. Use the wealth of Enterprise Manager features such as Systems, Services, Business Application Targets and Business Transaction Management (BTM) to assist you. Adding Real User Experience Insight (RUEI) into the mix will make the end to end customer satisfaction story transparent. Work with the business and define meaningful key performance indicators (KPI’s) and thresholds to enable you to report and action upon. Step 3: Observe the data over time. You now have meaningful insight into every step enabling your signature experience and you understand the implication of that experience on your underlying IT. Watch if for a few months, see what happens and reconvene with your business stakeholders and set clear and measurable targets which can re-define service levels. Step 4: Change the information about which you and the business communicate. It’s amazing what happens when you and the business speak the same language. You’ll be able to make more informed business and IT decisions. From here IT can identify where/how budget is spent whether on the level of support, performance, capacity, HA, DR, certification etc. IT SLA’s no longer need be focused on metrics such as %availability but structured around business process requirements. The power of this way of thinking doesn’t end here. IT staff get to see and understand how their own role contributes to the business making them accountable for the business service. Take a step further and appraise your staff on the business competencies that are linked to the service availability. For the business, the language barrier is removed by producing targeted reports on the signature experience core to the business and therefore key to the CEO. Chargeback or show back becomes easier to justify as the ‘cost of day per outage’ can be more easily calculated; the business will be able to translate the cost to the business to the cost/value of the underlying IT that supports it. Used this way, Oracle Enterprise Manager 12c is a key enabler to a harmonious relationship between the end customer the business and IT to deliver ultimate service and satisfaction. Just engage with the business upfront, make the signature experience visible and let Enterprise Manager 12c do the rest. In the next blog entry we will cover some of the Enterprise Manager features mentioned to enable you to implement this new way of working.

Read the article
Un-failing over a Cisco PIX 515e

- by ABrown

We had a power outage at our data center last week and when our dual PIX 515E running IOS 7.0(8) (configured with a failover cable) came back, they were in a failed over state where the Secondary unit is active and the Primary unit is standby I have tried 'failover reset', 'failover active', and 'failover reload-standby' as well as executing reloads on both units in a variety of orders, and they don't come back Primary/Active Secondary/Standby. The only thing in my arsenal that I haven't tried is driving to the data center and performing a hard reboot, which I hate to do. I have read How Failover Works on the Cisco Secure Firewall and it seems like this should be wicked straight forward. output of show failover on Primary: Failover On Cable status: Normal Failover unit Primary Failover LAN Interface: N/A - Serial-based failover enabled Unit Poll frequency 15 seconds, holdtime 45 seconds Interface Poll frequency 15 seconds Interface Policy 1 Monitored Interfaces 2 of 250 maximum Version: Ours 7.0(8), Mate 7.0(8) Last Failover at: 02:52:05 UTC Mar 10 2010 This host: Primary - Standby Ready Active time: 0 (sec) Interface outside (x.x.x.165): Normal Interface inside (y.y.y.3): Normal Other host: Secondary - Active Active time: 897045 (sec) Interface outside (x.x.x.164): Normal Interface inside (y.y.y.4): Normal Stateful Failover Logical Update Statistics Link : Unconfigured. output of show failover on Secondary: Failover On Cable status: Normal Failover unit Secondary Failover LAN Interface: N/A - Serial-based failover enabled Unit Poll frequency 15 seconds, holdtime 45 seconds Interface Poll frequency 15 seconds Interface Policy 1 Monitored Interfaces 2 of 250 maximum Version: Ours 7.0(8), Mate 7.0(8) Last Failover at: 02:03:04 UTC Feb 28 2010 This host: Secondary - Active Active time: 896925 (sec) Interface outside (x.x.x.164): Normal Interface inside (y.y.y.4): Normal Other host: Primary - Standby Ready Active time: 0 (sec) Interface outside (x.x.x.165): Normal Interface inside (y.y.y.3): Normal Stateful Failover Logical Update Statistics Link : Unconfigured. I'm seeing the following in my syslog: Mar 10 03:05:00 fw1 %PIX-5-111008: User 'enable_15' executed the 'failover reset' command. Mar 10 03:05:09 fw1 %PIX-5-111008: User 'enable_15' executed the 'failover reload-standby' command. Mar 10 03:05:12 fw1 %PIX-6-720032: (VPN-Secondary) HA status callback: id=3,seq=200,grp=0,event=406,op=20,my=Active,peer=Failed. Mar 10 03:05:12 fw1 %PIX-6-720028: (VPN-Secondary) HA status callback: Peer state Failed. Mar 10 03:06:09 fw1 %PIX-6-720032: (VPN-Secondary) HA status callback: id=3,seq=200,grp=0,event=401,op=0,my=Active,peer=Failed. Mar 10 03:06:09 fw1 %PIX-6-720024: (VPN-Secondary) HA status callback: Control channel is down. Mar 10 03:06:09 fw1 %PIX-6-720032: (VPN-Secondary) HA status callback: id=3,seq=200,grp=0,event=401,op=1,my=Active,peer=Failed. Mar 10 03:06:10 fw1 %PIX-6-720024: (VPN-Secondary) HA status callback: Control channel is up. Mar 10 03:06:10 fw1 %PIX-6-720032: (VPN-Secondary) HA status callback: id=3,seq=200,grp=0,event=411,op=2,my=Active,peer=Failed. Mar 10 03:06:23 fw1 %PIX-6-720032: (VPN-Secondary) HA status callback: id=3,seq=200,grp=0,event=406,op=80,my=Active,peer=Standby Ready. Mar 10 03:06:23 fw1 %PIX-6-720028: (VPN-Secondary) HA status callback: Peer state Standby Ready. Mar 10 03:06:24 fw2 %PIX-6-720027: (VPN-Primary) HA status callback: My state Standby Ready. Mar 10 03:07:05 fw1 %PIX-5-111008: User 'enable_15' executed the 'failover reset' command. Mar 10 03:07:31 fw1 %PIX-5-111008: User 'enable_15' executed the 'failover active' command. Mar 10 03:08:04 fw1 %PIX-5-611103: User logged out: Uname: enable_1 Mar 10 03:08:04 fw1 %PIX-6-315011: SSH session from admin1_int on interface inside for user "pix" terminated normally Mar 10 03:08:39 fw1 %PIX-6-720032: (VPN-Secondary) HA status callback: id=3,seq=200,grp=0,event=406,op=20,my=Active,peer=Failed. Mar 10 03:08:39 fw1 %PIX-6-720028: (VPN-Secondary) HA status callback: Peer state Failed. Mar 10 03:09:10 fw1 %PIX-6-605005: Login permitted from admin1_int/36891 to inside:192.168.4.4/ssh for user "pix" Mar 10 03:09:23 fw1 %PIX-5-111008: User 'enable_15' executed the 'failover reset' command. Mar 10 03:09:38 fw1 %PIX-6-720032: (VPN-Secondary) HA status callback: id=3,seq=200,grp=0,event=401,op=0,my=Active,peer=Failed. Mar 10 03:09:39 fw1 %PIX-6-720024: (VPN-Secondary) HA status callback: Control channel is down. Mar 10 03:09:39 fw1 %PIX-6-720032: (VPN-Secondary) HA status callback: id=3,seq=200,grp=0,event=401,op=1,my=Active,peer=Failed. Mar 10 03:09:39 fw1 %PIX-6-720024: (VPN-Secondary) HA status callback: Control channel is up. Mar 10 03:09:39 fw1 %PIX-6-720032: (VPN-Secondary) HA status callback: id=3,seq=200,grp=0,event=411,op=2,my=Active,peer=Failed. Mar 10 03:09:52 fw1 %PIX-6-720032: (VPN-Secondary) HA status callback: id=3,seq=200,grp=0,event=406,op=80,my=Active,peer=Standby Ready. Mar 10 03:09:52 fw1 %PIX-6-720028: (VPN-Secondary) HA status callback: Peer state Standby Ready. Mar 10 03:09:53 fw2 %PIX-6-720027: (VPN-Primary) HA status callback: My state Standby Ready. I'm not exactly sure how to interpret that syslog data. Primary doesn't seem to even try to become Active. When I reload the individual units separately, my connections are retained, so it doesn't seem like I have a real hardware failure. Is there something I can query (IOS or SNMP) to check for hardware issues? Any thoughts? My IOS-fu is weak. Thanks for any help you might provide, Aaron

Read the article
What does it mean when ARP shows <incomplete> on eth1

- by Geoff Dalgas

We have been using HAProxy along with heartbeat from the Linux-HA project. We are using two linux instances to provide a failover. Each server has with their own public IP and a single IP which is shared between the two using a virtual interface (eth1:1) at IP: 69.59.196.211 The virtual interface (eth1:1) IP 69.59.196.211 is configured as the gateway for the windows servers behind them and we use ip_forwarding to route traffic. We are experiencing an occasional network outage on one of our windows servers behind our linux gateways. HAProxy will detect the server is offline which we can verify by remoting to the failed server and attempting to ping the gateway: Pinging 69.59.196.211 with 32 bytes of data: Reply from 69.59.196.220: Destination host unreachable. Running arp -a on this failed server shows that there is no entry for the gateway address (69.59.196.211): Interface: 69.59.196.220 --- 0xa Internet Address Physical Address Type 69.59.196.161 00-26-88-63-c7-80 dynamic 69.59.196.210 00-15-5d-0a-3e-0e dynamic 69.59.196.212 00-21-5e-4d-45-c9 dynamic 69.59.196.213 00-15-5d-00-b2-0d dynamic 69.59.196.215 00-21-5e-4d-61-1a dynamic 69.59.196.217 00-21-5e-4d-2c-e8 dynamic 69.59.196.219 00-21-5e-4d-38-e5 dynamic 69.59.196.221 00-15-5d-00-b2-0d dynamic 69.59.196.222 00-15-5d-0a-3e-09 dynamic 69.59.196.223 ff-ff-ff-ff-ff-ff static 224.0.0.22 01-00-5e-00-00-16 static 224.0.0.252 01-00-5e-00-00-fc static 225.0.0.1 01-00-5e-00-00-01 static On our linux gateway instances arp -a shows: peak-colo-196-220.peak.org (69.59.196.220) at <incomplete> on eth1 stackoverflow.com (69.59.196.212) at 00:21:5e:4d:45:c9 [ether] on eth1 peak-colo-196-215.peak.org (69.59.196.215) at 00:21:5e:4d:61:1a [ether] on eth1 peak-colo-196-219.peak.org (69.59.196.219) at 00:21:5e:4d:38:e5 [ether] on eth1 peak-colo-196-222.peak.org (69.59.196.222) at 00:15:5d:0a:3e:09 [ether] on eth1 peak-colo-196-209.peak.org (69.59.196.209) at 00:26:88:63:c7:80 [ether] on eth1 peak-colo-196-217.peak.org (69.59.196.217) at 00:21:5e:4d:2c:e8 [ether] on eth1 Why would arp occasionally set the entry for this failed server as <incomplete>? Should we be defining our arp entries statically? I've always left arp alone since it works 99% of the time, but in this one instance it appears to be failing. Are there any additional troubleshooting steps we can take help resolve this issue? THINGS WE HAVE TRIED I added a static arp entry for testing on one of the linux gateways which still didn't help. root@haproxy2:~# arp -a peak-colo-196-215.peak.org (69.59.196.215) at 00:21:5e:4d:61:1a [ether] on eth1 peak-colo-196-221.peak.org (69.59.196.221) at 00:15:5d:00:b2:0d [ether] on eth1 stackoverflow.com (69.59.196.212) at 00:21:5e:4d:45:c9 [ether] on eth1 peak-colo-196-219.peak.org (69.59.196.219) at 00:21:5e:4d:38:e5 [ether] on eth1 peak-colo-196-209.peak.org (69.59.196.209) at 00:26:88:63:c7:80 [ether] on eth1 peak-colo-196-217.peak.org (69.59.196.217) at 00:21:5e:4d:2c:e8 [ether] on eth1 peak-colo-196-220.peak.org (69.59.196.220) at 00:21:5e:4d:30:8d [ether] PERM on eth1 root@haproxy2:~# arp -i eth1 -s 69.59.196.220 00:21:5e:4d:30:8d root@haproxy2:~# ping 69.59.196.220 PING 69.59.196.220 (69.59.196.220) 56(84) bytes of data. --- 69.59.196.220 ping statistics --- 7 packets transmitted, 0 received, 100% packet loss, time 6006ms Rebooting the windows web server solves this issue temporarily with no other changes to the network but our experience shows this issue will come back. Swapping network cards and switches I noticed the link light on the port of the switch for the failed windows server was running at 100Mb instead of 1Gb on the failed interface. I moved the cable to several other open ports and the link indicated 100Mb for each port that I tried. I also swapped the cable with the same result. I tried changing the properties of the network card in windows and the server locked up and required a hard reset after clicking apply. This windows server has two physical network interfaces so I have swapped the cables and network settings on the two interfaces to see if the problem follows the interface. If the public interface goes down again we will know that it is not an issue with the network card. (We also tried another switch we have on hand, no change) Changing network hardware driver versions We've had the same problem with the latest Broadcom driver, as well as the built-in driver that ships in Windows Server 2008 R2. Replacing network cables As a last ditch effort we remembered another change that occurred was the replacement of all of the patch cords between our servers / switch. We had purchased two sets, one green of lengths 1ft - 3ft for the private interfaces and another set of red cables for the public interfaces. We swapped out all of the public interface patch cables with a different brand and ran our servers without issue for a full week ... aaaaaand then the problem recurred. Disable checksum offload, remove TProxy We also tried disabling TCP/IP checksum offload in the driver, no change. We're now pulling out TProxy and moving to a more traditional x-forwarded-for network arrangement without any fancy IP address rewriting. We'll see if that helps.

Read the article
Recover RAID 5 data after created new array instead of re-using

- by Brigadieren

Folks please help - I am a newb with a major headache at hand (perfect storm situation). I have a 3 1tb hdd on my ubuntu 11.04 configured as software raid 5. The data had been copied weekly onto another separate off the computer hard drive until that completely failed and was thrown away. A few days back we had a power outage and after rebooting my box wouldn't mount the raid. In my infinite wisdom I entered mdadm --create -f... command instead of mdadm --assemble and didn't notice the travesty that I had done until after. It started the array degraded and proceeded with building and syncing it which took ~10 hours. After I was back I saw that that the array is successfully up and running but the raid is not I mean the individual drives are partitioned (partition type f8 ) but the md0 device is not. Realizing in horror what I have done I am trying to find some solutions. I just pray that --create didn't overwrite entire content of the hard driver. Could someone PLEASE help me out with this - the data that's on the drive is very important and unique ~10 years of photos, docs, etc. Is it possible that by specifying the participating hard drives in wrong order can make mdadm overwrite them? when I do mdadm --examine --scan I get something like ARRAY /dev/md/0 metadata=1.2 UUID=f1b4084a:720b5712:6d03b9e9:43afe51b name=<hostname>:0 Interestingly enough name used to be 'raid' and not the host hame with :0 appended. Here is the 'sanitized' config entries: DEVICE /dev/sdf1 /dev/sde1 /dev/sdd1 CREATE owner=root group=disk mode=0660 auto=yes HOMEHOST <system> MAILADDR root ARRAY /dev/md0 metadata=1.2 name=tanserv:0 UUID=f1b4084a:720b5712:6d03b9e9:43afe51b Here is the output from mdstat cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid5 sdd1[0] sdf1[3] sde1[1] 1953517568 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU] unused devices: <none> fdisk shows the following: fdisk -l Disk /dev/sda: 80.0 GB, 80026361856 bytes 255 heads, 63 sectors/track, 9729 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x000bf62e Device Boot Start End Blocks Id System /dev/sda1 * 1 9443 75846656 83 Linux /dev/sda2 9443 9730 2301953 5 Extended /dev/sda5 9443 9730 2301952 82 Linux swap / Solaris Disk /dev/sdb: 750.2 GB, 750156374016 bytes 255 heads, 63 sectors/track, 91201 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x000de8dd Device Boot Start End Blocks Id System /dev/sdb1 1 91201 732572001 8e Linux LVM Disk /dev/sdc: 500.1 GB, 500107862016 bytes 255 heads, 63 sectors/track, 60801 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x00056a17 Device Boot Start End Blocks Id System /dev/sdc1 1 60801 488384001 8e Linux LVM Disk /dev/sdd: 1000.2 GB, 1000204886016 bytes 255 heads, 63 sectors/track, 121601 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x000ca948 Device Boot Start End Blocks Id System /dev/sdd1 1 121601 976760001 fd Linux raid autodetect Disk /dev/dm-0: 1250.3 GB, 1250254913536 bytes 255 heads, 63 sectors/track, 152001 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x00000000 Disk /dev/dm-0 doesn't contain a valid partition table Disk /dev/sde: 1000.2 GB, 1000204886016 bytes 255 heads, 63 sectors/track, 121601 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x93a66687 Device Boot Start End Blocks Id System /dev/sde1 1 121601 976760001 fd Linux raid autodetect Disk /dev/sdf: 1000.2 GB, 1000204886016 bytes 255 heads, 63 sectors/track, 121601 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0xe6edc059 Device Boot Start End Blocks Id System /dev/sdf1 1 121601 976760001 fd Linux raid autodetect Disk /dev/md0: 2000.4 GB, 2000401989632 bytes 2 heads, 4 sectors/track, 488379392 cylinders Units = cylinders of 8 * 512 = 4096 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 524288 bytes / 1048576 bytes Disk identifier: 0x00000000 Disk /dev/md0 doesn't contain a valid partition table Per suggestions I did clean up the superblocks and re-created the array with --assume-clean option but with no luck at all. Is there any tool that will help me to revive at least some of the data? Can someone tell me what and how the mdadm --create does when syncs to destroy the data so I can write a tool to un-do whatever was done? After the re-creating of the raid I run fsck.ext4 /dev/md0 and here is the output root@tanserv:/etc/mdadm# fsck.ext4 /dev/md0 e2fsck 1.41.14 (22-Dec-2010) fsck.ext4: Superblock invalid, trying backup blocks... fsck.ext4: Bad magic number in super-block while trying to open /dev/md0 The superblock could not be read or does not describe a correct ext2 filesystem. If the device is valid and it really contains an ext2 filesystem (and not swap or ufs or something else), then the superblock is corrupt, and you might try running e2fsck with an alternate superblock: e2fsck -b 8193 Per Shanes' suggestion I tried root@tanserv:/home/mushegh# mkfs.ext4 -n /dev/md0 mke2fs 1.41.14 (22-Dec-2010) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=128 blocks, Stripe width=256 blocks 122101760 inodes, 488379392 blocks 24418969 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=0 14905 block groups 32768 blocks per group, 32768 fragments per group 8192 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848 and run fsck.ext4 with every backup block but all returned the following: root@tanserv:/home/mushegh# fsck.ext4 -b 214990848 /dev/md0 e2fsck 1.41.14 (22-Dec-2010) fsck.ext4: Invalid argument while trying to open /dev/md0 The superblock could not be read or does not describe a correct ext2 filesystem. If the device is valid and it really contains an ext2 filesystem (and not swap or ufs or something else), then the superblock is corrupt, and you might try running e2fsck with an alternate superblock: e2fsck -b 8193 <device> Any suggestions? Regards!

Read the article
How can a single disk in a hardware SATA RAID-10 array bring the entire array to a screeching halt?

- by Stu Thompson

Prelude: I'm a code-monkey that's increasingly taken on SysAdmin duties for my small company. My code is our product, and increasingly we provide the same app as SaaS. About 18 months ago I moved our servers from a premium hosting centric vendor to a barebones rack pusher in a tier IV data center. (Literally across the street.) This ment doing much more ourselves--things like networking, storage and monitoring. As part the big move, to replace our leased direct attached storage from the hosting company, I built a 9TB two-node NAS based on SuperMicro chassises, 3ware RAID cards, Ubuntu 10.04, two dozen SATA disks, DRBD and . It's all lovingly documented in three blog posts: Building up & testing a new 9TB SATA RAID10 NFSv4 NAS: Part I, Part II and Part III. We also setup a Cacit monitoring system. Recently we've been adding more and more data points, like SMART values. I could not have done all this without the awesome boffins at ServerFault. It's been a fun and educational experience. My boss is happy (we saved bucket loads of $$$), our customers are happy (storage costs are down), I'm happy (fun, fun, fun). Until yesterday. Outage & Recovery: Some time after lunch we started getting reports of sluggish performance from our application, an on-demand streaming media CMS. About the same time our Cacti monitoring system sent a blizzard of emails. One of the more telling alerts was a graph of iostat await. Performance became so degraded that Pingdom began sending "server down" notifications. The overall load was moderate, there was not traffic spike. After logging onto the application servers, NFS clients of the NAS, I confirmed that just about everything was experiencing highly intermittent and insanely long IO wait times. And once I hopped onto the primary NAS node itself, the same delays were evident when trying to navigate the problem array's file system. Time to fail over, that went well. Within 20 minuts everything was confirmed to be back up and running perfectly. Post-Mortem: After any and all system failures I perform a post-mortem to determine the cause of the failure. First thing I did was ssh back into the box and start reviewing logs. It was offline, completely. Time for a trip to the data center. Hardware reset, backup an and running. In /var/syslog I found this scary looking entry: Nov 15 06:49:44 umbilo smartd[2827]: Device: /dev/twa0 [3ware_disk_00], 6 Currently unreadable (pending) sectors Nov 15 06:49:44 umbilo smartd[2827]: Device: /dev/twa0 [3ware_disk_07], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 171 to 170 Nov 15 06:49:45 umbilo smartd[2827]: Device: /dev/twa0 [3ware_disk_10], 16 Currently unreadable (pending) sectors Nov 15 06:49:45 umbilo smartd[2827]: Device: /dev/twa0 [3ware_disk_10], 4 Offline uncorrectable sectors Nov 15 06:49:45 umbilo smartd[2827]: Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error Nov 15 06:49:45 umbilo smartd[2827]: # 1 Short offline Completed: read failure 90% 6576 3421766910 Nov 15 06:49:45 umbilo smartd[2827]: # 2 Short offline Completed: read failure 90% 6087 3421766910 Nov 15 06:49:45 umbilo smartd[2827]: # 3 Short offline Completed: read failure 10% 5901 656821791 Nov 15 06:49:45 umbilo smartd[2827]: # 4 Short offline Completed: read failure 90% 5818 651637856 Nov 15 06:49:45 umbilo smartd[2827]: So I went to check the Cacti graphs for the disks in the array. Here we see that, yes, disk 7 is slipping away just like syslog says it is. But we also see that disk 8's SMART Read Erros are fluctuating. There are no messages about disk 8 in syslog. More interesting is that the fluctuating values for disk 8 directly correlate to the high IO wait times! My interpretation is that: Disk 8 is experiencing an odd hardware fault that results in intermittent long operation times. Somehow this fault condition on the disk is locking up the entire array Maybe there is a more accurate or correct description, but the net result has been that the one disk is impacting the performance of the whole array. The Question(s) How can a single disk in a hardware SATA RAID-10 array bring the entire array to a screeching halt? Am I being naïve to think that the RAID card should have dealt with this? How can I prevent a single misbehaving disk from impacting the entire array? Am I missing something?

Read the article
How to Reuse Your Old Wi-Fi Router as a Network Switch

- by Jason Fitzpatrick

Just because your old Wi-Fi router has been replaced by a newer model doesn’t mean it needs to gather dust in the closet. Read on as we show you how to take an old and underpowered Wi-Fi router and turn it into a respectable network switch (saving your $20 in the process). Image by mmgallan. Why Do I Want To Do This? Wi-Fi technology has changed significantly in the last ten years but Ethernet-based networking has changed very little. As such, a Wi-Fi router with 2006-era guts is lagging significantly behind current Wi-Fi router technology, but the Ethernet networking component of the device is just as useful as ever; aside from potentially being only 100Mbs instead of 1000Mbs capable (which for 99% of home applications is irrelevant) Ethernet is Ethernet. What does this matter to you, the consumer? It means that even though your old router doesn’t hack it for your Wi-Fi needs any longer the device is still a perfectly serviceable (and high quality) network switch. When do you need a network switch? Any time you want to share an Ethernet cable among multiple devices, you need a switch. For example, let’s say you have a single Ethernet wall jack behind your entertainment center. Unfortunately you have four devices that you want to link to your local network via hardline including your smart HDTV, DVR, Xbox, and a little Raspberry Pi running XBMC. Instead of spending $20-30 to purchase a brand new switch of comparable build quality to your old Wi-Fi router it makes financial sense (and is environmentally friendly) to invest five minutes of your time tweaking the settings on the old router to turn it from a Wi-Fi access point and routing tool into a network switch–perfect for dropping behind your entertainment center so that your DVR, Xbox, and media center computer can all share an Ethernet connection. What Do I Need? For this tutorial you’ll need a few things, all of which you likely have readily on hand or are free for download. To follow the basic portion of the tutorial, you’ll need the following: 1 Wi-Fi router with Ethernet ports 1 Computer with Ethernet jack 1 Ethernet cable For the advanced tutorial you’ll need all of those things, plus: 1 copy of DD-WRT firmware for your Wi-Fi router We’re conducting the experiment with a Linksys WRT54GL Wi-Fi router. The WRT54 series is one of the best selling Wi-Fi router series of all time and there’s a good chance a significant number of readers have one (or more) of them stuffed in an office closet. Even if you don’t have one of the WRT54 series routers, however, the principles we’re outlining here apply to all Wi-Fi routers; as long as your router administration panel allows the necessary changes you can follow right along with us. A quick note on the difference between the basic and advanced versions of this tutorial before we proceed. Your typical Wi-Fi router has 5 Ethernet ports on the back: 1 labeled “Internet”, “WAN”, or a variation thereof and intended to be connected to your DSL/Cable modem, and 4 labeled 1-4 intended to connect Ethernet devices like computers, printers, and game consoles directly to the Wi-Fi router. When you convert a Wi-Fi router to a switch, in most situations, you’ll lose two port as the “Internet” port cannot be used as a normal switch port and one of the switch ports becomes the input port for the Ethernet cable linking the switch to the main network. This means, referencing the diagram above, you’d lose the WAN port and LAN port 1, but retain LAN ports 2, 3, and 4 for use. If you only need to switch for 2-3 devices this may be satisfactory. However, for those of you that would prefer a more traditional switch setup where there is a dedicated WAN port and the rest of the ports are accessible, you’ll need to flash a third-party router firmware like the powerful DD-WRT onto your device. Doing so opens up the router to a greater degree of modification and allows you to assign the previously reserved WAN port to the switch, thus opening up LAN ports 1-4. Even if you don’t intend to use that extra port, DD-WRT offers you so many more options that it’s worth the extra few steps. Preparing Your Router for Life as a Switch Before we jump right in to shutting down the Wi-Fi functionality and repurposing your device as a network switch, there are a few important prep steps to attend to. First, you want to reset the router (if you just flashed a new firmware to your router, skip this step). Following the reset procedures for your particular router or go with what is known as the “Peacock Method” wherein you hold down the reset button for thirty seconds, unplug the router and wait (while still holding the reset button) for thirty seconds, and then plug it in while, again, continuing to hold down the rest button. Over the life of a router there are a variety of changes made, big and small, so it’s best to wipe them all back to the factory default before repurposing the router as a switch. Second, after resetting, we need to change the IP address of the device on the local network to an address which does not directly conflict with the new router. The typical default IP address for a home router is 192.168.1.1; if you ever need to get back into the administration panel of the router-turned-switch to check on things or make changes it will be a real hassle if the IP address of the device conflicts with the new home router. The simplest way to deal with this is to assign an address close to the actual router address but outside the range of addresses that your router will assign via the DHCP client; a good pick then is 192.168.1.2. Once the router is reset (or re-flashed) and has been assigned a new IP address, it’s time to configure it as a switch. Basic Router to Switch Configuration If you don’t want to (or need to) flash new firmware onto your device to open up that extra port, this is the section of the tutorial for you: we’ll cover how to take a stock router, our previously mentioned WRT54 series Linksys, and convert it to a switch. Hook the Wi-Fi router up to the network via one of the LAN ports (consider the WAN port as good as dead from this point forward, unless you start using the router in its traditional function again or later flash a more advanced firmware to the device, the port is officially retired at this point). Open the administration control panel via web browser on a connected computer. Before we get started two things: first, anything we don’t explicitly instruct you to change should be left in the default factory-reset setting as you find it, and two, change the settings in the order we list them as some settings can’t be changed after certain features are disabled. To start, let’s navigate to Setup ->Basic Setup. Here you need to change the following things: Local IP Address: [different than the primary router, e.g. 192.168.1.2] Subnet Mask: [same as the primary router, e.g. 255.255.255.0] DHCP Server: Disable Save with the “Save Settings” button and then navigate to Setup -> Advanced Routing: Operating Mode: Router This particular setting is very counterintuitive. The “Operating Mode” toggle tells the device whether or not it should enable the Network Address Translation (NAT) feature. Because we’re turning a smart piece of networking hardware into a relatively dumb one, we don’t need this feature so we switch from Gateway mode (NAT on) to Router mode (NAT off). Our next stop is Wireless -> Basic Wireless Settings: Wireless SSID Broadcast: Disable Wireless Network Mode: Disabled After disabling the wireless we’re going to, again, do something counterintuitive. Navigate to Wireless -> Wireless Security and set the following parameters: Security Mode: WPA2 Personal WPA Algorithms: TKIP+AES WPA Shared Key: [select some random string of letters, numbers, and symbols like JF#d$di!Hdgio890] Now you may be asking yourself, why on Earth are we setting a rather secure Wi-Fi configuration on a Wi-Fi router we’re not going to use as a Wi-Fi node? On the off chance that something strange happens after, say, a power outage when your router-turned-switch cycles on and off a bunch of times and the Wi-Fi functionality is activated we don’t want to be running the Wi-Fi node wide open and granting unfettered access to your network. While the chances of this are next-to-nonexistent, it takes only a few seconds to apply the security measure so there’s little reason not to. Save your changes and navigate to Security ->Firewall. Uncheck everything but Filter Multicast Firewall Protect: Disable At this point you can save your changes again, review the changes you’ve made to ensure they all stuck, and then deploy your “new” switch wherever it is needed. Advanced Router to Switch Configuration For the advanced configuration, you’ll need a copy of DD-WRT installed on your router. Although doing so is an extra few steps, it gives you a lot more control over the process and liberates an extra port on the device. Hook the Wi-Fi router up to the network via one of the LAN ports (later you can switch the cable to the WAN port). Open the administration control panel via web browser on the connected computer. Navigate to the Setup -> Basic Setup tab to get started. In the Basic Setup tab, ensure the following settings are adjusted. The setting changes are not optional and are required to turn the Wi-Fi router into a switch. WAN Connection Type: Disabled Local IP Address: [different than the primary router, e.g. 192.168.1.2] Subnet Mask: [same as the primary router, e.g. 255.255.255.0] DHCP Server: Disable In addition to disabling the DHCP server, also uncheck all the DNSMasq boxes as the bottom of the DHCP sub-menu. If you want to activate the extra port (and why wouldn’t you), in the WAN port section: Assign WAN Port to Switch [X] At this point the router has become a switch and you have access to the WAN port so the LAN ports are all free. Since we’re already in the control panel, however, we might as well flip a few optional toggles that further lock down the switch and prevent something odd from happening. The optional settings are arranged via the menu you find them in. Remember to save your settings with the save button before moving onto a new tab. While still in the Setup -> Basic Setup menu, change the following: Gateway/Local DNS : [IP address of primary router, e.g. 192.168.1.1] NTP Client : Disable The next step is to turn off the radio completely (which not only kills the Wi-Fi but actually powers the physical radio chip off). Navigate to Wireless -> Advanced Settings -> Radio Time Restrictions: Radio Scheduling: Enable Select “Always Off” There’s no need to create a potential security problem by leaving the Wi-Fi radio on, the above toggle turns it completely off. Under Services -> Services: DNSMasq : Disable ttraff Daemon : Disable Under the Security -> Firewall tab, uncheck every box except “Filter Multicast”, as seen in the screenshot above, and then disable SPI Firewall. Once you’re done here save and move on to the Administration tab. Under Administration -> Management: Info Site Password Protection : Enable Info Site MAC Masking : Disable CRON : Disable 802.1x : Disable Routing : Disable After this final round of tweaks, save and then apply your settings. Your router has now been, strategically, dumbed down enough to plod along as a very dependable little switch. Time to stuff it behind your desk or entertainment center and streamline your cabling.

Read the article
Recreating OMS instances in a HA environment when instances on all nodes are lost

- by rnigam

Oracle highly recommends deploying EM in a HA environment. The best practices for HA deployments, backup and housekeeping of your Enterprise Manager environment are documented in the Oracle Enterprise Manager Advanced Configuration Guide. It is imperative that there is a good disaster recovery plan in place for your EM deployment. In this post I want to talk about a customer who failed to do the correct planning and housekeeping for EM and landed in a situation where we the all the OMSes were nearly blown away had we not jumped to help. We recently hit an issue at a customer site where we had a two node OMS setup of the Enterprise Manager and a RAC Database being used as the EM repository. An accidental delete of the OMS oracle home left us with a single node deployment. While we were trying to figure out a possible path to recover the first node, the second node was rebooted under a maintenance window. What followed was a complete site outage as the Admin and managed servers would not start on either of the nodes. In my situation there were - No backups of the Oracle Homes from any node - No OMS Configuration snapshots (created using the “emctl exportconfig oms” command) and the instance home was completely lost on node 1 which also had the Admin Server We did however have: - A copy of the emkey.ora that I found under the OMS_ORACLE_HOME/ of the second node (NOTE: it is a bad practice to have your emkey present under the OMS Oracle home directory on the same server as the OMS. The backup of the emkey should be maintained on some other server. In this case however it was a savior in my situation since there were no backups - The oms oracle home on the second node but missing a number of files and had a number of changes done to the files in the home. There were a number of attempts to start the server by modifying various files based on the Weblogic server logs to have atleast node up and running but all of them failed. Here is how you can recover from this scenario: Follow these steps: STEP 1: Check status of emkey.ora Check whether the emkey exists is present in the EM repository or not. Run the following command: $OMS_ORACLE_HOME/bin/emctl status emkey If the output is something like this below then you are good to go and the key is present in the repository ./emctl status emkey Oracle Enterprise Manager 11g Release 1 Grid Control Copyright (c) 1996, 2010 Oracle Corporation. All rights reserved. Enter Enterprise Manager Root (SYSMAN) Password : The EMKey is configured properly. Here are the messages that you might see as the emctl status emkey output depending upon whether the EM Admin Server is up and if the key is configured properly: Case1: AdminServer is up, emkey is proper in CredStore & not in repos. This is same as the output of the command shown above:The EMKey is configured properly Case 2: AdminServer is up, emkey is proper in CredStore & exists in repos:The EMKey is configured properly, but is not secure. Secure the EMKey by running "emctl config emkey -remove_from_repos".Case 3: AdminServer is down or emkey is corrupted in CredStore) & (emkey exists in repos): The EMKey exists in the Management Repository, but is not configured properly or is corrupted in the credential store.Configure the EMKey by running "emctl config emkey -copy_to_credstore".Case 4: (AdminServer is down or emkey is corrupted in CredStore) & (emkey does not exist in repos): The EMKey is not configured properly or is corrupted in the credential store and does not exist in the Management Repository. To correct the problem:1) Get the backed up emkey.ora file.2) Configure the emkey by running "emctl config emkey -copy_to_credstore_from_file". If not the key was not secured properly, we will have to be put in the repository before proceeding. Look at the next step 2 for doing this There may be cases (like mine) where running emctl may give errors like the following: $OMS_ORACLE_HOME/bin/emctl status emkey Exception in thread “Main Thread” java.lang.NoClassDefFoundError: oracle/security/pki/OracleWallet At oracle.sysman.emctl.config.oms.EMKeyCmds.main (EMKeyCmds.java:658) Just move to the next step to put the key back in the repository STEP 2: Put emkey.ora back in the repository Skip this step if your emkey.ora is present in the repository. If not, you need to put the key back in the repository See if you can run the following command (with sample output): $OMS_ORACLE_HOME/bin/emctl config emkey –copy_to_repos Oracle Enterprise Manager 11g Release 1 Grid Control Copyright (c) 1996, 2010 Oracle Corporation. All rights reserved. The EMKey has been copied to the Management Repository. This operation will cause the EMKey to become unsecure. After the required operation has been completed, secure the EMKey by running "emctl config emkey -remove_from_repos". Typically the key is present under $OMS_ORACLE_HOME/sysman/config directory before being removed after the install as a best practice. If you hit any errors while running emctl commands like the one mentioned in step 1, jump to step 3 and we will take care of the emkey.ora in Step 5 STEP 3: Get the port information Check for the existing port information in the emd.properties file under EM_INSTANCE_DIRECTORY (typically gc_inst directory right above the Middleware home where you have deployed em. For eg. /u01/app/oracle/product/gc_inst in case your oms home is /u01/app/oracle/product/Middleware/oms11g) In my case I got the information from the emgc.properties present in the gc_inst on the second node. If you can run emctl you may want to try the following command as well $OMS_ORACLE_HOME/bin/emctl status oms –details Note this information as this will be used in the next step STEP 4: Perform cleanup on Node 1 Note the oracle home of the Weblogic and OMS, get the list of applied patches in the homes (using opatch lsinventory command), take a backup copy of the home just in case we need it and then de-install/remove oracle homes, update inventory and cleanup processes on the first node STEP 5: Perform Software Only Installation of OMS on Node 1 Perform Weblogic 10.3.2 installation exactly under the same location as present in the earlier installation. Perform software only installation of the OMS using the following command. This will not run any configuration assistants and bypass all user interface validations runInstaller –noconfig -validationaswarnings Select the “Additional OMS” option while performing the installation. Provide the same path for OMS and Instance directories like the previous installation Use the port information collected in Step 3 while performing the installation. Once the installation is complete run the allroot.sh script to complete the binary deployment STEP 6: Apply one-off patches At this point you can apply any patches to the OMS Oracle Home previously. You only need to run opatch to install the patch in the home and not required to run the SQLs STEP 7: Copy EM key This step is only required if you were not able to use emctl command to put the emkey back into the EM repository in STEP 2 Copy the emkey.ora file of the old installation you have under $OMS_ORACLE_HOME/sysman/config directory of the newly installed OMS STEP 8: Configure Grid Control Domain Run the following command to configure the EM domain and OMS. Note that you need to use a different GC Domain name than what you used earlier. For example I have used GCDOMAIN11 as the new domain name when my previous domain name was GCDOMAIN $OMS_ORACLE_HOME/bin/omsca new –AS_USERNAME weblogic –EM_DOMAIN_NAME GCDOMAIN11 –NM_USER nodemanager -nostart This command as shown below will prompt for a number of inputs like Admin Server hostname, port, password, etc. Verify if the defaults shown are correct by pressing enter or provide a new value STEP 9: Run Add-ON Configuration Assistant After this step run the following add-on configuration assistant. This was used in my case to configure the virtualization add-on $OMS_ORACLE_HOME/addonca -oui -omsonly -name vt -install gc STEP 10: Start the OMS Now start the OMS using $OMS_ORACLE_HOME/bin/emctl start oms In a multi-node setup like mine you would either have a software load balancer or DNS round robin (using a virtual host name that resolves to one of multiple OMS hostnames) being used for load balancing. Secure the OMS against the SLB or DNS virtual hostname using the following $ OMS_HOME/bin/emctl secure oms -host slb.example.com -secure_port 1159 -slb_port 1159 -slb_console_port 443 STEP 11: Configure the Agent From the $AGENT_ORACLE_HOME/bin run the ./agentca –f At this point you should have your OMS on node 1 fully re-covered. Clean up node 2 and use the normal Additional OMS installation process documented in the official installation guide to add the additional OMS on node 2 Summary It took us nearly a little over two days to completely recover the environment with some other non-EM related issues that hit us along the way as well. In the end a situation like this could have been completely avoided had the proper housekeeping and backup of the Enterprise Manager Deployment been done in the first place. This is going to a topic that we cover in the next post. In the meantime please do refer to the Oracle Enterprise Manager Advanced Configuration Guide for planning your EM installation, backup and housekeeping procedures. This can be found here: http://download.oracle.com/docs/cd/E11857_01/index.htm Thanks This post would not have been possible without Raj Aggarwal, Prasad Chebrolu and Ravikumar Basa who helped to recover the environment and provided all the support we needed

Read the article
Oracle on Oracle: Is that all?

- by Darin Pendergraft

On October 17th, I posted a short blog and a podcast interview with Chirag Andani, talking about how Oracle IT uses its own IDM products. Blog link here. In response, I received a comment from reader Jaime Cardoso ([email protected]) who posted: “- You could have talked about how by deploying Oracle's Open standards base technology you were able to integrate any new system in your infrastructure in days. - You could have talked about how by deploying federation you were enabling the business side to keep all their options open in terms of companies to buy and sell while maintaining perfect employee and customer's single view. - You could have talked about how you are now able to cut response times to your audit and security teams into 1/10th of your former times Instead you spent 6 minutes talking about single sign on and self provisioning? If I didn't knew your IDM offer so well I would now be wondering what its differences from Microsoft's offer was. Sorry for not giving a positive comment here but, please your IDM suite is very good and, you simply aren't promoting it well enough” So I decided to send Jaime a note asking him about his experience, and to get his perspective on what makes the Oracle products great. What I found out is that Jaime is a very experienced IDM Architect with several major projects under his belt. Darin Pendergraft: Can you tell me a bit about your experience? How long have you worked in IT, and what is your IDM experience? Jaime Cardoso: I started working in "serious" IT in 1998 when I became Netscape's technical specialist in Portugal. Netscape Portugal didn't exist so, I was working for their VAR here. Most of my work at the time was with Netscape's mail server and LDAP server. Since that time I've been bouncing between the system's side like Sun resellers, Solaris stuff and even worked with Sun's Engineering in the making of an Hierarchical Storage Product (Sun CIS if you know it) and the application's side, mostly in LDAP and IDM. Over the years I've been doing support, service delivery and pre-sales / architecture design of IDM solutions in most big customers in Portugal, to name a few projects: - The first European deployment of Sun Access Manager (SAPO – Portugal Telecom) - The identity repository of 5/5 of the Biggest Portuguese banks - The Portuguese government federation of services project DP: OK, in your blog response, you mentioned 3 topics: 1. Using Oracle's standards based architecture; (you) were able to integrate any new system in days: can you give an example? What systems, how long did it take, number of apps/users/accounts/roles etc. JC: It's relatively easy to design a user management strategy for a static environment, or if you simply assume that you're an <insert vendor here> shop and all your systems will bow to that vendor's will. We've all seen that path, the use of proprietary technologies in interoperability solutions but, then reality kicks in. As an ISP I recall that I made the technical decision to use Active Directory as a central authentication system for the entire IT infrastructure. Clients, systems, apps, everything was there. As a good part of the systems and apps were running on UNIX, then a connector became needed in order to have UNIX boxes to authenticate against AD. And, that strategy worked but, each new machine required the component to be installed, monitoring had to be made for that component and each new app had to be independently certified. A self care user portal was an ongoing project, AD access assumes the client is inside the domain, something the ISP's customers (and UNIX boxes) weren't nor had any intention of ever being. When the Windows 2008 rollout was done, Microsoft changed the Active Directory interface. The Windows administrators didn't have enough know-how about directories and the way systems outside the MS world behaved so, on the go live, things weren't properly tested and a general outage followed. Several hours and 1 roll back later, everything was back working. But, the ISP still had to change all of its applications to work with the new access methods and reset the effort spent on the self service user portal. To keep with the same strategy, they would also have to trust Microsoft not to change interfaces again. Simply by putting up an Oracle LDAP server in the middle and replicating the user info from the AD into LDAP, most of the problems went away. Even systems for which no AD connector existed had PAM in them so, integration was made at the OS level, fully supported by the OS supplier. Sun Identity Manager already had a self care portal, combined with a user workflow so, all the clearances had to be given before the account was created or updated. Adding a new system as a client for these authentication services was simply a new checkbox in the OS installer and, even True64 systems were, for the first time integrated also with a 5 minute work of a junior system admin. True, all the windows clients and MS apps still went to the AD for their authentication needs so, from the start everybody knew that they weren't 100% free of migration pains but, now they had a single point of problems to look at. If you're looking for numbers: - 500K directory entries (users) - 2-300 systems After the initial setup, I personally integrated about 20 systems / apps against LDAP in 1 day while being watched by the different IT teams. The internal IT staff did the rest. DP: 2. Using Federation allows the business to keep options open for buying and selling companies, and yet maintain a single view for both employee and customer. What do you mean by this? Can you give an example? JC: The market is dynamic. The company that's being bought today tomorrow will be sold again. Companies that spread on different markets may see the regulator forcing a sale of part of a company due to monopoly reasons and companies that are in multiple countries have to comply with different legislations. Our job, as IT architects, while addressing the customers and employees authentication services, is quite hard and, quite contrary. On one hand, we need to give access to all of our employees to the relevant systems, apps and resources and, we already have marketing talking with us trying to find out who's a customer of the bough company but not from ours to address. On the other hand, we have to do that and keep in mind we may have to break up all that effort and that different countries legislation may became a problem with a full integration plan. That's a job for user Federation. you don't want to be the one who's telling your President that he will sell that business unit without it's customer's database (making the deal worth a lot less) or that the buyer will take with him a copy of your entire customer's database. Federation enables you to start controlling permissions to users outside of your traditional authentication realm. So what if the people of that company you just bought are keeping their old logins? Do you want, because of that, to have a dedicated system for their expenses reports? And do you want to keep their sales (and pre-sales) people out of the loop in terms of your group's path? Control the information flow, establish a Federation trust circle and give access to your apps to users that haven't (yet?) been brought into your internal login systems. You can still see your users in a unified view, you obviously control if a user has access to any particular application, either that user is in your local database or stored in a directory on the other side of the world. DP: 3. Cut response times of audit and security teams to 1/10. Is this a real number? Can you give an example? JC: No, I don't have any backing for this number. One of the companies I did system Administration for has a SOX compliance policy in place (I remind you that I live in Portugal so, this definition of SOX may be somewhat different from what you're used to) and, every time the audit team says they'll do another audit, we have to negotiate with them the size of the sample and we spend about 15 man/days gathering all the required info they ask. I did some work with Sun's Identity auditor and, from what I've been seeing, Oracle's product is even better and, I've seen that most of the information they ask would have been provided in a few hours with the help of this tool. I do stand by what I said here but, to be honest, someone from Identity Auditor team would do a much better job than me explaining this time savings. Jaime is right: the Oracle IDM products have a lot of business value, and Oracle IT is using them for a lot more than I was able to cover in the short podcast that I posted. I want to thank Jaime for his comments and perspective. We want these blog posts to be informative and honest – so if you have feedback for the Oracle IDM team on any topic discussed here, please post your comments below.

Read the article
Oracle Enterprise Manager 12c Configuration Best Practices (Part 3 of 3)

- by Bethany Lapaglia

<span id="XinhaEditingPostion"></span>&lt;span id=&quot;XinhaEditingPostion&quot;&gt;&lt;/span&gt;&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;span id=&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;quot;XinhaEditingPostion&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;quot;&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;/span&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; This is part 3 of a three-part blog series that summarizes the most commonly implemented configuration changes to improve performance and operation of a large Enterprise Manager 12c environment. A “large” environment is categorized by the number of agents, targets and users. See the Oracle Enterprise Manager Cloud Control Advanced Installation and Configuration Guide chapter on Sizing for more details on sizing your environment properly. Part 1 of this series covered recommended configuration changes for the OMS and Repository Part 2 covered recommended changes for the Weblogic server Part 3 covers general configuration recommendations and a few known issues The entire series can be found in the My Oracle Support note titled Oracle Enterprise Manager 12c Configuration Best Practices [1553342.1]. Configuration Recommendations Configure E-Mail Notifications for EM related Alerts In some environments, the notifications for events for different target types may be sent to different support teams (i.e. notifications on host targets may be sent to a platform support team). However, the EM application administrators should be well informed of any alerts or problems seen on the EM infrastructure components. Recommendation: Create a new Incident rule for monitoring all EM components and setup the notifications to be sent to the EM administrator(s). The notification methods available can create or update an incident, send an email or forward to an event connector. To setup the incident rule set follow the steps below. Note that each individual rule in the rule set can have different actions configured. 1. To create an incident rule for monitoring the EM components, click on Setup / Incidents / Incident Rules. On the All Enterprise Rules page, click on the out-of-box rule called “Incident management Ruleset for all targets” and then click on the Actions drop down list and select “Create Like Rule Set…” 2. For the rule set name, enter a name such as MTM Ruleset. Under the Targets tab, select “All targets of types” and select “OMS and Repository” from the drop down list. This target type contains all of the key EM components (OMS servers, repository, domains, etc.) 3. Click on the Rules tab. To edit a rule, click on the rule name and click on Edit as seen below 4. Modify the following rules: a. Incident creation Rule for metric alerts i. Leave the Type set as is but change the Severity to add Warning by clicking on the drop down list and selecting “Warning”. Click Next. ii. Add or modify the actions as required (i.e. add email notifications). Click Continue and then click Next. iii. Leave the Name and description the same and click Next. iv. Click Continue on the Review page. b. Incident creation Rule for target unreachable. i. Leave the Type set as is but change the Target type to add OMS and Repository by clicking on the drop down list selecting “OMS and Repository”. Click Next. ii. Add or modify the actions as required (i.e. add email notifications) Click Continue and then click Next. iii. Leave the Name and description the same and click Next. iv. Click Continue on the Review page. 5. Modify the actions for any other rule as required and be sure to click the “Save” push button to save the rule set or all changes will be lost. Configure Out-of-Band Notifications for EM Agent Out-of-Band notifications act as a backup when there’s a complete EM outage or a repository database issue. This is configured on the agent of the OMS server and can be used to send emails or execute another script that would create a trouble ticket. It will send notifications about the following issues: • Repository Database down • All OMS are down • Repository side collection job that is broken or has an invalid schedule • Notification job that is broken or has an invalid schedule Recommendation: To setup Out-of-Band Notifications, refer to the MOS note “How To Setup Out Of Bound Email Notification In 12c” (Doc ID 1472854.1) Modify the Performance Test for the EM Console Service The EM Console Service has an out-of-box defined performance test that will be run to determine the status of this service. The test issues a request via an HTTP method to a specific URL. By default, the HTTP method used for this test is a GET but for performance reasons, should be changed to HEAD. The URL used for this request is set to point to a specific OMS server by default. If a multi-OMS system has been implemented and the OMS servers are behind a load balancer, then the URL in this section must be modified to point to the load balancer name instead of a specific server name. If this is not done and a portion of the infrastructure is down then the EM Console Service will show down as this test will fail. Recommendation: Modify the HTTP Method for the EM Console Service test and the URL if required following the detailed steps below. 1. To create an incident rule for monitoring the EM components, click on Targets / Services. From the list of services, click on the EM Console Service. 2. On the EM Console Service page, click on the Test Performance tab. 3. At the bottom of the page, click on the Web Transaction test called EM Console Service Test 4. Click on the Service Tests and Beacons breadcrumb near the top of the page. 5. Under the Service Tests section, make sure the EM Console Service Test is selected and click on the Edit push button. 6. Under the Transaction section, make sure the Access Logout page transaction is selected and click on the Edit push button 7) Under the Request section, change the HTTP Method from the default of GET to the recommended value of HEAD. The URL in this section must be modified to point to the load balancer name instead of a specific server name if multi-OMSes have been implemented. Check for Known Issues Job Purge Repository Job is Shown as Down This issue is caused after upgrading EM from 12c to 12cR2. On the Repository page under Setup ? Manage Cloud Control ? Repository, the job called “Job Purge” is shown as down and the Next Scheduled Run is blank. Also, repvfy reports that this is a missing DBMS_SCHEDULER job. Recommendation: In EM 12cR2, the apply_purge_policies have been moved from the MGMT_JOB_ENGINE package to the EM_JOB_PURGE package. To remove this error, execute the commands below: $ repvfy verify core -test 2 -fix To confirm that the issue resolved, execute $ repvfy verify core -test 2 It can also be verified by refreshing the Job Service page in EM and check the status of the job, it should now be Up. Configure the Listener Targets in EM with the Listener Password (where required) EM will report this error every time it is encountered in the listener log file. In a RAC environment, typically the grid home and rdbms homes are owned by different OS users. The listener always runs from the grid home. Only the listener process owner can query or change the listener properties. The listener uses a password to allow other OS users (ex. the agent user) to query the listener process for parameters. EM has a default listener target metric that will query these properties. If the agent is not permitted to do this, the TNS incident (TNS-1190) will be logged in the listener’s log file. This means that the listener targets in EM also need to have this password set. Not doing so will cause many TNS incidents (TNS-1190). Below is a sample of this error from the listener log file: Recommendation: Set a listener password and include it in the configuration of the listener targets in EM For steps on setting the listener passwords, see MOS notes: 260986.1 , 427422.1

Read the article

Search Results

Search found 205 results on 9 pages for 'outage'.

Page 8/9 | < Previous Page | 4 5 6 7 8 9 | Next Page >

- by David

- by Eric Belair

- by Matteo Mosca

- by pts

- by Ward Loockx

- by DustByte

- by Bob Rhubart

- by Antony Reynolds

- by JoeMeeks

- by jsavit

- by MarkPearl

- by Rob Farley

- by BuckWoody

- by Boris_yo

- by Asian Angel

- by Antony Reynolds

- by Tania Le Voi

- by ABrown

- by Geoff Dalgas

- by Brigadieren

- by Stu Thompson

- by Jason Fitzpatrick

- by rnigam

- by Darin Pendergraft

- by Bethany Lapaglia

< Previous Page | 4 5 6 7 8 9 | Next Page >