Search Results

Search found 140 results on 6 pages for 'heartbeat'.

Page 2/6 | < Previous Page | 1 2 3 4 5 6  | Next Page >

  • IP issue with Heartbeat & DRBD

    - by adam0345
    I'm in the process of setting up 3-node stacked DRBD, and i'm experiencing a rather bizarre issue. Two nodes are located at the data center, and the 3rd node is located locally. The Primary and Secondary nodes are working as expected, however the 3rd node won't connect to the primary. If I ping the IP provided by heartbeat on the 3rd node it will return 100% packet loss, if I reset networking interfaces, ping will then return a few successful packets, but then stop returning any packets. I can't work out any reason why this would be behaving like this. All nodes are running Debian Squeeze, and the latest version of DRBD.

    Read the article

  • High Availability Configuration using Heartbeat and Pacemaker

    - by pradeepchhetri
    I have the following setup: I have configured high availability between two load balancers (HAProxy) so that if HAProxy1 get down, the floating IP gets transferred to the other load balancer HAProxy2, hence all the clients will get the response from HAProxy2, which at the back-end is doing LB among the sme two webserver. This is for removing the single point of failure in case of only one HAProxy. Whenever I stops the hearbeat in HAProxy1, the floating IP goes to HAProxy2. But I want to configure such that whenever the process haproxy goes down, the floating IP should get assigned to HAProxy2. Can someone tell me how to implement it ?

    Read the article

  • Architecture for highly available MySQL with automatic failover in physically diverse locations

    - by Warner
    I have been researching high availability (HA) solutions for MySQL between data centers. For servers located in the same physical environment, I have preferred dual master with heartbeat (floating VIP) using an active passive approach. The heartbeat is over both a serial connection as well as an ethernet connection. Ultimately, my goal is to maintain this same level of availability but between data centers. I want to dynamically failover between both data centers without manual intervention and still maintain data integrity. There would be BGP on top. Web clusters in both locations, which would have the potential to route to the databases between both sides. If the Internet connection went down on site 1, clients would route through site 2, to the Web cluster, and then to the database in site 1 if the link between both sites is still up. With this scenario, due to the lack of physical link (serial) there is a more likely chance of split brain. If the WAN went down between both sites, the VIP would end up on both sites, where a variety of unpleasant scenarios could introduce desync. Another potential issue I see is difficulty scaling this infrastructure to a third data center in the future. The network layer is not a focus. The architecture is flexible at this stage. Again, my focus is a solution for maintaining data integrity as well as automatic failover with the MySQL databases. I would likely design the rest around this. Can you recommend a proven solution for MySQL HA between two physically diverse sites? Thank you for taking the time to read this. I look forward to reading your recommendations.

    Read the article

  • Testing shared memory ,strange thing happen

    - by barfatchen
    I have 2 program compiled in 4.1.2 running in RedHat 5.5 , It is a simple job to test shared memory , shmem1.c like following : #define STATE_FILE "/program.shared" #define NAMESIZE 1024 #define MAXNAMES 100 typedef struct { char name[MAXNAMES][NAMESIZE]; int heartbeat ; int iFlag ; } SHARED_VAR; int main (void) { int first = 0; int shm_fd; static SHARED_VAR *conf; if((shm_fd = shm_open(STATE_FILE, (O_CREAT | O_EXCL | O_RDWR), (S_IREAD | S_IWRITE))) > 0 ) { first = 1; /* We are the first instance */ } else if((shm_fd = shm_open(STATE_FILE, (O_CREAT | O_RDWR), (S_IREAD | S_IWRITE))) < 0) { printf("Could not create shm object. %s\n", strerror(errno)); return errno; } if((conf = mmap(0, sizeof(SHARED_VAR), (PROT_READ | PROT_WRITE), MAP_SHARED, shm_fd, 0)) == MAP_FAILED) { return errno; } if(first) { for(idx=0;idx< 1000000000;idx++) { conf->heartbeat = conf->heartbeat + 1 ; } } printf("conf->heartbeat=(%d)\n",conf->heartbeat) ; close(shm_fd); shm_unlink(STATE_FILE); exit(0); }//main And shmem2.c like following : #define STATE_FILE "/program.shared" #define NAMESIZE 1024 #define MAXNAMES 100 typedef struct { char name[MAXNAMES][NAMESIZE]; int heartbeat ; int iFlag ; } SHARED_VAR; int main (void) { int first = 0; int shm_fd; static SHARED_VAR *conf; if((shm_fd = shm_open(STATE_FILE, (O_RDWR), (S_IREAD | S_IWRITE))) < 0) { printf("Could not create shm object. %s\n", strerror(errno)); return errno; } ftruncate(shm_fd, sizeof(SHARED_VAR)); if((conf = mmap(0, sizeof(SHARED_VAR), (PROT_READ | PROT_WRITE), MAP_SHARED, shm_fd, 0)) == MAP_FAILED) { return errno; } int idx ; for(idx=0;idx< 1000000000;idx++) { conf->heartbeat = conf->heartbeat + 1 ; } printf("conf->heartbeat=(%d)\n",conf->heartbeat) ; close(shm_fd); exit(0); } After compiled : gcc shmem1.c -lpthread -lrt -o shmem1.exe gcc shmem2.c -lpthread -lrt -o shmem2.exe And Run both program almost at the same time with 2 terminal : [test]$ ./shmem1.exe First creation of the shm. Setting up default values conf->heartbeat=(840825951) [test]$ ./shmem2.exe conf->heartbeat=(1215083817) I feel confused !! since shmem1.c is a loop 1,000,000,000 times , how can it be possible to have a answer like 840,825,951 ? I run shmem1.exe and shmem2.exe this way,most of the results are conf-heartbeat will larger than 1,000,000,000 , but seldom and randomly , I will see result conf-heartbeat will lesser than 1,000,000,000 , either in shmem1.exe or shmem2.exe !! if run shmem1.exe only , it is always print 1,000,000,000 , my question is , what is the reason cause conf-heartbeat=(840825951) in shmem1.exe ? Update: Although not sure , but I think I figure it out what is going on , If shmem1.exe run 10 times for example , then conf-heartbeat = 10 , in this time shmem1.exe take a rest and then back , shmem1.exe read from shared memory and conf-heartbeat = 8 , so shmem1.exe will continue from 8 , why conf-heartbeat = 8 ? I think it is because shmem2.exe update the shared memory data to 8 , shmem1.exe did not write 10 back to shared memory before it took a rest ....that is just my theory... i don't know how to prove it !!

    Read the article

  • DRBD with MySQL

    - by tdimmig
    Question about using DRBD to provide HA for MySQL. I need to be sure that my backup MySQL instance is always going to be in a functional state when the failover occurs. What happens, for example, if the primary dies part way through committing a transaction? Are we going to end up with data copied to the secondary that mysql can't handle? Or, what if the network goes away while the two are syncing, and not all of the data makes it across. It seems like it's possible to get into a state where incomplete data on the secondary makes it impossible for mysql to start up and read the database. Am I missing something?

    Read the article

  • Multiple Notifications Not Firing

    - by motionpotion
    I'm scheduling two notifications as shown below. The app is a long-lived app. One local notification is scheduled to run every hour. The other is scheduled to run once per day. Only the second scheduled notification (the hourly notifcation) fires. - (void)scheduleNotification { LogInfo(@"IN scheduleNotification - DELETEYESTERDAY NOTIFICATION SCHEDULED."); UILocalNotification *notif = [[UILocalNotification alloc] init]; NSDictionary *deleteDict = [NSDictionary dictionaryWithObject:@"DeleteYesterday" forKey:@"DeleteYesterday"]; NSCalendar *calendar = [NSCalendar currentCalendar]; NSDateComponents *components = [[NSDateComponents alloc] init]; components = [[NSCalendar currentCalendar] components:NSDayCalendarUnit | NSMonthCalendarUnit | NSYearCalendarUnit fromDate:[NSDate date]]; NSInteger day = [components day]; NSInteger month = [components month]; NSInteger year = [components year]; [components setDay: day]; [components setMonth: month]; [components setYear: year]; [components setHour: 00]; [components setMinute: 45]; [components setSecond: 0]; [calendar setTimeZone: [NSTimeZone systemTimeZone]]; NSDate *dateToFire = [calendar dateFromComponents:components]; notif.fireDate = dateToFire; notif.timeZone = [NSTimeZone systemTimeZone]; notif.repeatInterval = NSDayCalendarUnit; notif.userInfo = deleteDict; [[UIApplication sharedApplication] scheduleLocalNotification:notif]; } and then I schedule this after above: - (void)scheduleHeartBeat { LogInfo(@"IN scheduleHeartBeat - HEARTBEAT NOTIFICATION SCHEDULED."); UILocalNotification *heartbeat = [[UILocalNotification alloc] init]; NSDictionary *heartbeatDict = [NSDictionary dictionaryWithObject:@"HeartBeat" forKey:@"HeartBeat"]; heartbeat.userInfo = heartbeatDict; NSCalendar *calendar = [NSCalendar currentCalendar]; NSDateComponents *components = [[NSDateComponents alloc] init]; components = [[NSCalendar currentCalendar] components:NSDayCalendarUnit | NSMonthCalendarUnit | NSYearCalendarUnit fromDate:[NSDate date]]; NSInteger day = [components day]; NSInteger month = [components month]; NSInteger year = [components year]; [components setDay: day]; [components setMonth: month]; [components setYear: year]; [components setHour: 00]; [components setMinute: 50]; [components setSecond: 0]; [calendar setTimeZone: [NSTimeZone systemTimeZone]]; NSDate *dateToFire = [calendar dateFromComponents:components]; heartbeat.fireDate = dateToFire; heartbeat.timeZone = [NSTimeZone systemTimeZone]; heartbeat.repeatInterval = NSHourCalendarUnit; [[UIApplication sharedApplication] scheduleLocalNotification:heartbeat]; } The above are scheduled when the app launches in the viewDidLoad of the main view controller. - (void)viewDidLoad { [self scheduleNotification]; [self scheduleHeartBeat]; [super viewDidLoad]; //OTHER CODE HERE } Then in the appdelegate I have the following: - (void)application:(UIApplication *)application didReceiveLocalNotification:(UILocalNotification *)notification { LogInfo(@"IN didReceiveLocalNotification NOTIFICATION RECEIVED."); NSString *notificationHeartBeat = nil; NSString *notificationDeleteYesterday = nil; application.applicationIconBadgeNumber = 0; if (notification) { notificationHeartBeat = [notification.userInfo objectForKey:@"HeartBeat"]; notificationDeleteYesterday = [notification.userInfo objectForKey:@"DeleteYesterday"]; LogInfo(@"IN didReceiveLocalNotification HEARTBEAT NOTIFICATION TYPE: %@", notificationHeartBeat); LogInfo(@"IN didReceiveLocalNotification DELETEYESTERDAY NOTIFICATION TYPE: %@", notificationDeleteYesterday); } if ([notificationHeartBeat isEqualToString:@"HeartBeat"]) { //CREATE THE HEARTBEAT LogInfo(@"CREATING THE HEARTBEAT."); //CALL THE FUNCTIONALITY HERE THAT CREATES HEARTBEAT. } if ([notificationDeleteYesterday isEqualToString:@"DeleteYesterday"]) { //DELETE YESTERDAYS RECORDS LogInfo(@"DELETING YESTERDAYS RECORDS."); } } The notification that is scheduled last (scheduleHeartBeat) is the only notification that is fired. Could somebody help me figure out why this is happening?

    Read the article

  • Python service using Upstart on Ubuntu

    - by Soumya Simanta
    I want to create to deploy a heartbeat service (a python script) as a service using Upstart. My understanding is that I've to add a /etc/init/myheartbeatservice.conf with the following contents. # my heartbeat service description "Heartbeat monitor" start on startup stop on shutdown script exec /path/to/my/python/script.py end script My script starts another service process and the monitors the processes and sends heartbeat to an outside server regularly. Are startup and shutdown the correct events ? Also my script create a new thread. I'm assuming I also need to add fork daemon to my conf file? Thanks.

    Read the article

  • Oracle GoldenGate: Knowledge Document Series Post #2

    - by Doug Reid
    0 false 18 pt 18 pt 0 0 false false false /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin:0in; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:12.0pt; font-family:"Times New Roman"; mso-ascii-font-family:Cambria; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:"Times New Roman"; mso-fareast-theme-font:minor-fareast; mso-hansi-font-family:Cambria; mso-hansi-theme-font:minor-latin;} 0 false 18 pt 18 pt 0 0 false false false /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin:0in; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:12.0pt; font-family:"Times New Roman"; mso-ascii-font-family:Cambria; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:"Times New Roman"; mso-fareast-theme-font:minor-fareast; mso-hansi-font-family:Cambria; mso-hansi-theme-font:minor-latin;} For our second post in this series the team would like to highlight the knowledge document “How-To: Oracle GoldenGate – Heartbeat Process to Monitor Lag and Performance”. This knowledge document outlines a procedure to reliably measure lag between source and target systems through the use of 'heartbeat' tables. The basic idea is to have a table on the source system that gets updated at a predetermined interval. In your capture processes you would capture the update from the heartbeat table. Using tokens you would add some additional information to the heartbeat record to be able to tell which extract process was capturing the update. This additional information would be used downstream to calculate the real lag time between the source and target systems for a given extract and by checking the last update time on the heartbeat at the target you could also determine if data has stopped flowing between the source and target.  Click here to view the document

    Read the article

  • ??????????

    - by Allen Gao
    ???????RAC??????????????????10gR2?11gR1.??????????????CRS???????1.ocssd : ???????????(Node Monitoring)????(Group Management),??CRS????????????????????????,????????????(network heartbeat)?????(disk heartbeat)???,?????????????????????,????????????,??????????????????,?????node kill escalation(???11gR1????????),??????????????????????????(reboot time,???3?)????????:ocssd.bin??????????????????????????,??????????????????????????????,misscount(???30?,??????????????600?),????????????,???????????????????,?????????????2???,??????,??????????????,?????????????????????:ocssd.bin?????????????(Voting File)??????????,?????????????????????????????,disk timeou(???200?),?????????????????????,CRS???[N/2]+1????????,??N??????,??????2.oclsomon:????????ocssd????,????ocssd.bin??????,???????3.oprocd:??????Linux?Unix??,????????????????????????????????,?????????:????????????init.cssd?????????????????????????1.??????2.<crs???>/log/<????>/cssd/ocssd.log3.oprocd.log(/etc/oracle/oprocd/*.log.* ? /var/opt/oracle/oprocd/*.log.*)4.<crs???>/log/<????>/cssd/oclsomon/oclsomon.log5. Oracle OSWatcher ????????????????????1.?ocssd???????????ocssd.log???????,????????????????????????????????,???????,OSW??(traceroute???),???????(cluster interconnect)??????,?????????[ CSSD]2012-03-02 23:56:18.749 [3086] >WARNING: clssnmPollingThread: node <node_name> at 50% heartbeat fatal, eviction in 14.494 seconds[ CSSD]2012-03-02 23:56:25.749 [3086] >WARNING: clssnmPollingThread: node <node_name> at 75% heartbeat fatal, eviction in 7.494 seconds[ CSSD]2012-03-02 23:56:32.749 [3086] >WARNING: clssnmPollingThread: node <node_name>at 90% heartbeat fatal, eviction in 0.494 seconds[CSSD]2012-03-02 23:56:33.243 [3086] >TRACE:   clssnmPollingThread: Eviction started for node <node_name>, flags 0x040d, state 3, wt4c 0[CSSD]2012-03-02 23:56:33.243 [3086] >TRACE:   clssnmDiscHelper: <node_name>, node(4) connection failed, con (1128a5530), probe(0)[CSSD]2012-03-02 23:56:33.243 [3086] >TRACE:   clssnmDiscHelper: node 4 clean up, con (1128a5530), init state 5, cur state 5[CSSD]2012-03-02 23:56:33.243 [3600] >TRACE:   clssnmDoSyncUpdate: Initiating sync 196446491[CSSD]2012-03-02 23:56:33.243 [3600] >TRACE:   clssnmDoSyncUpdate: diskTimeout set to (27000)ms??:???????ocssd.log?????????????????????,?????????????????????ocssd.log???????,??????????????????????????????,OSWatcher??(iostat???),???i/o????????,?????????2010-08-13 18:34:37.423: [    CSSD][150477728]clssnmvDiskOpen: Opening /dev/sdb82010-08-13 18:34:37.423: [    CLSF][150477728]Opened hdl:0xf4336530 for dev:/dev/sdb8:2010-08-13 18:34:37.429: [   SKGFD][150477728]ERROR: -9(Error 27072, OS Error (Linux Error: 5: Input/output errorAdditional information: 4Additional information: 720913Additional information: -1))2010-08-13 18:34:37.429: [    CSSD][150477728](:CSSNM00060: )clssnmvReadBlocks: read failed at offset 17 of /dev/sdb82010-08-13 18:34:38.205: [    CSSD][4110736288](:CSSNM00058: )clssnmvDiskCheck: No I/O completions for 200880 ms for voting file /dev/sdb8)2010-08-13 18:34:38.206: [    CSSD][4110736288](:CSSNM00018: )clssnmvDiskCheck: Aborting, 0 of 1 configured voting disks available, need 12010-08-13 18:34:38.206: [    CSSD][4110736288]###################################2010-08-13 18:34:38.206: [    CSSD][4110736288]clssscExit: CSSD aborting from thread clssnmvDiskPingMonitorThread 2010-08-13 18:34:38.206: [    CSSD][4110736288]###################################2. ?oclsomon???????????oclsomon.log ?????,??????????ocssd????,??ocssd??????(RT)???,?????????????(?cpu)??,?????????????,OSW??(vmstat,top???),?????????3.?oprocd???????????oprocd?????????,?????????oprocd????? Dec 21 16:15:30.369857 | LASTGASP | AlarmHandler:  timeout(2312 msec) exceeds interval(1000 msec)+margin(500 msec).   Rebooting NOW.??oprocd?????????????????????,?????ntp(?????????),??diagwait=13 ????????,??,?????????????,??????CRS,???????????????,??????????????oprocd????,??,?????OSWatcher??(vmstat,top???),??????????????????????????????,????????????????? ???????,??????MOS ???Note 265769.1 :Troubleshooting 10g and 11.1 Clusterware RebootsNote 1050693.1 :Troubleshooting 11.2 Clusterware Node Evictions (Reboots)

    Read the article

  • Corosync :: Restarting some resources after Lan connectivity issue

    - by moebius_eye
    I am currently looking into corosync to build a two-node cluster. So, I've got it working fine, and it does what I want to do, which is: Lost connectivity between the two nodes gives the first node '10node' both Failover Wan IPs. (aka resources WanCluster100 and WanCluster101 ) '11node' does nothing. He "thinks" he still has his Failover Wan IP. (aka WanCluster101) But it doesn't do this: '11node' should restart the WanCluster101 resource when the connectivity with the other node is back. This is to prevent a condition where node10 simply dies (and thus does not get 11node's Failover Wan IP), resulting in a situation where none of the nodes have 10node's failover IP because 10node is down 11node has "given back" his failover Wan IP. Here's the current configuration I'm working on. node 10sch \ attributes standby="off" node 11sch \ attributes standby="off" primitive LanCluster100 ocf:heartbeat:IPaddr2 \ params ip="172.25.0.100" cidr_netmask="32" nic="eth3" \ op monitor interval="10s" \ meta is-managed="true" target-role="Started" primitive LanCluster101 ocf:heartbeat:IPaddr2 \ params ip="172.25.0.101" cidr_netmask="32" nic="eth3" \ op monitor interval="10s" \ meta is-managed="true" target-role="Started" primitive Ping100 ocf:pacemaker:ping \ params host_list="192.0.2.1" multiplier="500" dampen="15s" \ op monitor interval="5s" \ meta target-role="Started" primitive Ping101 ocf:pacemaker:ping \ params host_list="192.0.2.1" multiplier="500" dampen="15s" \ op monitor interval="5s" \ meta target-role="Started" primitive WanCluster100 ocf:heartbeat:IPaddr2 \ params ip="192.0.2.100" cidr_netmask="32" nic="eth2" \ op monitor interval="10s" \ meta target-role="Started" primitive WanCluster101 ocf:heartbeat:IPaddr2 \ params ip="192.0.2.101" cidr_netmask="32" nic="eth2" \ op monitor interval="10s" \ meta target-role="Started" primitive Website0 ocf:heartbeat:apache \ params configfile="/etc/apache2/apache2.conf" options="-DSSL" \ operations $id="Website-one" \ op start interval="0" timeout="40" \ op stop interval="0" timeout="60" \ op monitor interval="10" timeout="120" start-delay="0" statusurl="http://127.0.0.1/server-status/" \ meta target-role="Started" primitive Website1 ocf:heartbeat:apache \ params configfile="/etc/apache2/apache2.conf.1" options="-DSSL" \ operations $id="Website-two" \ op start interval="0" timeout="40" \ op stop interval="0" timeout="60" \ op monitor interval="10" timeout="120" start-delay="0" statusurl="http://127.0.0.1/server-status/" \ meta target-role="Started" group All100 WanCluster100 LanCluster100 group All101 WanCluster101 LanCluster101 location AlwaysPing100WithNode10 Ping100 \ rule $id="AlWaysPing100WithNode10-rule" inf: #uname eq 10sch location AlwaysPing101WithNode11 Ping101 \ rule $id="AlWaysPing101WithNode11-rule" inf: #uname eq 11sch location NeverLan100WithNode11 LanCluster100 \ rule $id="RAND1083308" -inf: #uname eq 11sch location NeverPing100WithNode11 Ping100 \ rule $id="NeverPing100WithNode11-rule" -inf: #uname eq 11sch location NeverPing101WithNode10 Ping101 \ rule $id="NeverPing101WithNode10-rule" -inf: #uname eq 10sch location Website0NeedsConnectivity Website0 \ rule $id="Website0NeedsConnectivity-rule" -inf: not_defined pingd or pingd lte 0 location Website1NeedsConnectivity Website1 \ rule $id="Website1NeedsConnectivity-rule" -inf: not_defined pingd or pingd lte 0 colocation Never -inf: LanCluster101 LanCluster100 colocation Never2 -inf: WanCluster100 LanCluster101 colocation NeverBothWebsitesTogether -inf: Website0 Website1 property $id="cib-bootstrap-options" \ dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \ cluster-infrastructure="openais" \ expected-quorum-votes="2" \ no-quorum-policy="ignore" \ stonith-enabled="false" \ last-lrm-refresh="1408954702" \ maintenance-mode="false" rsc_defaults $id="rsc-options" \ resource-stickiness="100" \ migration-threshold="3" I also have a less important question concerning this line: colocation NeverBothLans -inf: LanCluster101 LanCluster100 How do I tell it that this collocation only applies to '11node'.

    Read the article

  • ldirectord ipvsadm not show reals ip and not work wtih pacemaker and corosync

    - by miguer27
    first thanks for your time. I'm having a problem with ldirectord that I can not solve, I comment my situation: I have two nodes with pace maker and corosync and configure somes resources: root@ldap1:/home/mamartin# crm status Last updated: Tue Jun 3 12:58:30 2014 Last change: Tue Jun 3 12:23:47 2014 via cibadmin on ldap1 Stack: openais Current DC: ldap2 - partition with quorum Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff 2 Nodes configured, 2 expected votes 7 Resources configured. Online: [ ldap1 ldap2 ] Resource Group: IPV_LVS IPV_4 (ocf::heartbeat:IPaddr2): Started ldap1 IPV_6 (ocf::heartbeat:IPv6addr): Started ldap1 lvs (ocf::heartbeat:ldirectord): Started ldap1 Clone Set: clon_IPV_lo [IPV_lo] Started: [ ldap2 ] Stopped: [ IPV_lo:1 ] root@ldap1:/home/mamartin# crm configure show node ldap2 \ attributes standby="off" node ldap1 \ attributes standby="off" primitive IPV-lo_4 ocf:heartbeat:IPaddr \ params ip="192.168.1.10" cidr_netmask="32" nic="lo" \ op monitor interval="5s" primitive IPV-lo_6 ocf:heartbeat:IPv6addrLO \ params ipv6addr="[fc00:1::3]" cidr_netmask="64" \ op monitor interval="5s" primitive IPV_4 ocf:heartbeat:IPaddr2 \ params ip="192.168.1.10" nic="eth0" cidr_netmask="25" lvs_support="true" \ op monitor interval="5s" primitive IPV_6 ocf:heartbeat:IPv6addr \ params ipv6addr="[fc00:1::3]" nic="eth0" cidr_netmask="64" \ op monitor interval="5s" primitive lvs ocf:heartbeat:ldirectord \ params configfile="/etc/ldirectord.cf" \ op monitor interval="20" timeout="10" \ meta target-role="Started" group IPV_LVS IPV_4 IPV_6 lvs group IPV_lo IPV-lo_6 IPV-lo_4 clone clon_IPV_lo IPV_lo \ meta interleave="true" target-role="Started" location cli-prefer-IPV_LVS IPV_LVS \ rule $id="cli-prefer-rule-IPV_LVS" inf: #uname eq ldap1 colocation LVS_no_IPV_lo -inf: clon_IPV_lo IPV_LVS property $id="cib-bootstrap-options" \ dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \ cluster-infrastructure="openais" \ expected-quorum-votes="2" \ no-quorum-policy="ignore" \ stonith-enabled="false" \ last-lrm-refresh="1401264327" rsc_defaults $id="rsc-options" \ resource-stickiness="1000" The problem is in the ipvsadm only show a one real IP, when i configured two now, show the ldirector.cf: root@ldap1:/home/mamartin# ipvsadm IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags - RemoteAddress:Port Forward Weight ActiveConn InActConn TCP ldap-maqueta.cica.es:ldap wrr - ldap2.cica.es:ldap Route 4 0 0 TCP [[fc00:1::3]]:ldap wrr - [[fc00:1::2]]:ldap Route 4 0 0 root@ldap1:/home/mamartin# cat /etc/ldirectord.cf checktimeout=10 checkinterval=2 autoreload=yes logfile="/var/log/ldirectord.log" quiescent=yes #ipv4 virtual=192.168.1.10:389 real=192.168.1.11:389 gate 4 real=192.168.1.12:389 gate 4 scheduler=wrr protocol=tcp checktype=on #ipv6 virtual6=[[fc00:1::3]]:389 real6=[[fc00:1::1]]:389 gate 4 real6=[[fc00:1::2]]:389 gate 4 scheduler=wrr protocol=tcp checkport=389 checktype=on and in the logs I see nothing clear: root@ldap1:/home/mamartin# ldirectord -d /etc/ldirectord.cf start DEBUG2: Running system(/sbin/ipvsadm -a -t 192.168.1.10:389 -r 192.168.1.11:389 -g -w 0) Running system(/sbin/ipvsadm -a -t 192.168.1.10:389 -r 192.168.1.11:389 -g -w 0) DEBUG2: Quiescent real server: 192.168.1.11:389 (192.168.1.10:389) (Weight set to 0) Quiescent real server: 192.168.1.11:389 (192.168.1.10:389) (Weight set to 0) DEBUG2: Disabled real server=on:tcp:192.168.1.11:389:::4:gate:\/: (virtual=tcp:192.168.1.10:389) DEBUG2: Running system(/sbin/ipvsadm -a -t 192.168.1.10:389 -r 192.168.1.12:389 -g -w 0) Running system(/sbin/ipvsadm -a -t 192.168.1.10:389 -r 192.168.1.12:389 -g -w 0) DEBUG2: Quiescent real server: 192.168.1.12:389 (192.168.1.10:389) (Weight set to 0) Quiescent real server: 192.168.1.12:389 (192.168.1.10:389) (Weight set to 0) DEBUG2: Disabled real server=on:tcp:192.168.1.12:389:::4:gate:\/: (virtual=tcp:192.168.1.10:389) DEBUG2: Checking on: Real servers are added without any checks DEBUG2: Resetting soft failure count: 192.168.1.12:389 (tcp:192.168.1.10:389) Resetting soft failure count: 192.168.1.12:389 (tcp:192.168.1.10:389) DEBUG2: Running system(/sbin/ipvsadm -a -t 192.168.1.10:389 -r 192.168.1.12:389 -g -w 4) Running system(/sbin/ipvsadm -a -t 192.168.1.10:389 -r 192.168.1.12:389 -g -w 4) Destination already exists root@ldap1:/home/mamartin# cat /var/log/ldirectord.log [Tue Jun 3 09:39:29 2014|ldirectord.cf|19266] Quiescent real server: 192.168.1.11:389 (192.168.1.10:389) (Weight set to 0) [Tue Jun 3 09:39:29 2014|ldirectord.cf|19266] Quiescent real server: 192.168.1.12:389 (192.168.1.10:389) (Weight set to 0) [Tue Jun 3 09:39:29 2014|ldirectord.cf|19266] Resetting soft failure count: 192.168.1.12:389 (tcp:192.168.1.10:389) [Tue Jun 3 09:39:29 2014|ldirectord.cf|19266] system(/sbin/ipvsadm -a -t 192.168.1.10:389 -r 192.168.1.12:389 -g -w 4) failed: [Tue Jun 3 09:39:29 2014|ldirectord.cf|19266] Added real server: 192.168.1.12:389 (192.168.1.10:389) (Weight set to 4) [Tue Jun 3 09:39:29 2014|ldirectord.cf|19266] Resetting soft failure count: 192.168.1.11:389 (tcp:192.168.1.10:389) [Tue Jun 3 09:39:29 2014|ldirectord.cf|19266] Restored real server: 192.168.1.11:389 (192.168.1.10:389) (Weight set to 4) [Tue Jun 3 09:39:29 2014|ldirectord.cf|19266] Resetting soft failure count: [[fc00:1::2]]:389 (tcp:[[fc00:1::3]]:389) [Tue Jun 3 09:39:29 2014|ldirectord.cf|19266] system(/sbin/ipvsadm -a -t [[fc00:1::3]]:389 -r [[fc00:1::2]]:389 -g -w 4) failed: [Tue Jun 3 09:39:29 2014|ldirectord.cf|19266] Added real server: [[fc00:1::2]]:389 ([[fc00:1::3]]:389) (Weight set to 4) [Tue Jun 3 09:39:29 2014|ldirectord.cf|19266] Resetting soft failure count: [[fc00:1::1]]:389 (tcp:[[fc00:1::3]]:389) [Tue Jun 3 09:39:29 2014|ldirectord.cf|19266] Restored real server: [[fc00:1::1]]:389 ([[fc00:1::3]]:389) (Weight set to 4) do not know if this is a bug or a configuration error, can anyone help? Regards.

    Read the article

  • SAN/NAS with high availability?

    - by netvope
    I have two servers that I plan to use for storage. Each of them has a few SATA disks directly attached. I want the storage to be available even if one of the storage servers is down (preferably the clients wouldn't even notice that the fail-over, although I'm not sure if this is possible). The clients may access the storage via NFS and samba, but this is not a must; I could use something else if needed. I found this guide, Installing and Configuring Openfiler with DRBD and Heartbeat, which apparently does the thing I want. It relies on three components, Openfiler, DRBD, and Heartbeat, and all three of them need to be configured separately. I'm wondering are there simpler solutions? Is using DRBD+Heartbeat the best practice for a situation like mine? I'm also interested to know if there are alternatives that don't depend on DRBD.

    Read the article

  • Can my thread help the OS decide when to context switch it out?

    - by WilliamKF
    I am working on a threaded application on Linux in C++ which attempts to be real time, doing an action on a heartbeat, or as close to it as possible. In practice, I find the OS is swapping out my thread and causing delays of up to a tenth of a second while it is switched out, causing the heartbeat to be irregular. Is there a way my thread can hint to the OS that now is a good time to context switch it out? I could make this call right after doing a heartbeat, and thus minimize the delay due to an ill timed context switch.

    Read the article

  • Oracle Coherence, Split-Brain and Recovery Protocols In Detail

    - by Ricardo Ferreira
    This article provides a high level conceptual overview of Split-Brain scenarios in distributed systems. It will focus on a specific example of cluster communication failure and recovery in Oracle Coherence. This includes a discussion on the witness protocol (used to remove failed cluster members) and the panic protocol (used to resolve Split-Brain scenarios). Note that the removal of cluster members does not necessarily indicate a Split-Brain condition. Oracle Coherence does not (and cannot) detect a Split-Brain as it occurs, the condition is only detected when cluster members that previously lost contact with each other regain contact. Cluster Topology and Configuration In order to create an good didactic for the article, let's assume a cluster topology and configuration. In this example we have a six member cluster, consisting of one JVM on each physical machine. The member IDs are as follows: Member ID  IP Address  1  10.149.155.76  2  10.149.155.77  3  10.149.155.236  4  10.149.155.75  5  10.149.155.79  6  10.149.155.78 Members 1, 2, and 3 are connected to a switch, and members 4, 5, and 6 are connected to a second switch. There is a link between the two switches, which provides network connectivity between all of the machines. Member 1 is the first member to join this cluster, thus making it the senior member. Member 6 is the last member to join this cluster. Here is a log snippet from Member 6 showing the complete member set: 2010-02-26 15:27:57.390/3.062 Oracle Coherence GE 3.5.3/465p2 <Info> (thread=main, member=6): Started DefaultCacheServer... SafeCluster: Name=cluster:0xDDEB Group{Address=224.3.5.3, Port=35465, TTL=4} MasterMemberSet ( ThisMember=Member(Id=6, Timestamp=2010-02-26 15:27:58.635, Address=10.149.155.78:8088, MachineId=1102, Location=process:228, Role=CoherenceServer) OldestMember=Member(Id=1, Timestamp=2010-02-26 15:27:06.931, Address=10.149.155.76:8088, MachineId=1100, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:511, Role=CoherenceServer) ActualMemberSet=MemberSet(Size=6, BitSetCount=2 Member(Id=1, Timestamp=2010-02-26 15:27:06.931, Address=10.149.155.76:8088, MachineId=1100, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:511, Role=CoherenceServer) Member(Id=2, Timestamp=2010-02-26 15:27:17.847, Address=10.149.155.77:8088, MachineId=1101, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:296, Role=CoherenceServer) Member(Id=3, Timestamp=2010-02-26 15:27:24.892, Address=10.149.155.236:8088, MachineId=1260, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:32459, Role=CoherenceServer) Member(Id=4, Timestamp=2010-02-26 15:27:39.574, Address=10.149.155.75:8088, MachineId=1099, Location=process:800, Role=CoherenceServer) Member(Id=5, Timestamp=2010-02-26 15:27:49.095, Address=10.149.155.79:8088, MachineId=1103, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:3229, Role=CoherenceServer) Member(Id=6, Timestamp=2010-02-26 15:27:58.635, Address=10.149.155.78:8088, MachineId=1102, Location=process:228, Role=CoherenceServer) ) RecycleMillis=120000 RecycleSet=MemberSet(Size=0, BitSetCount=0 ) ) At approximately 15:30, the connection between the two switches is severed: Thirty seconds later (the default packet timeout in development mode) the logs indicate communication failures across the cluster. In this example, the communication failure was caused by a network failure. In a production setting, this type of communication failure can have many root causes, including (but not limited to) network failures, excessive GC, high CPU utilization, swapping/virtual memory, and exceeding maximum network bandwidth. In addition, this type of failure is not necessarily indicative of a split brain. Any communication failure will be logged in this fashion. Member 2 logs a communication failure with Member 5: 2010-02-26 15:30:32.638/196.928 Oracle Coherence GE 3.5.3/465p2 <Warning> (thread=PacketPublisher, member=2): Timeout while delivering a packet; requesting the departure confirmation for Member(Id=5, Timestamp=2010-02-26 15:27:49.095, Address=10.149.155.79:8088, MachineId=1103, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:3229, Role=CoherenceServer) by MemberSet(Size=2, BitSetCount=2 Member(Id=1, Timestamp=2010-02-26 15:27:06.931, Address=10.149.155.76:8088, MachineId=1100, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:511, Role=CoherenceServer) Member(Id=4, Timestamp=2010-02-26 15:27:39.574, Address=10.149.155.75:8088, MachineId=1099, Location=process:800, Role=CoherenceServer) ) The Coherence clustering protocol (TCMP) is a reliable transport mechanism built on UDP. In order for the protocol to be reliable, it requires an acknowledgement (ACK) for each packet delivered. If a packet fails to be acknowledged within the configured timeout period, the Coherence cluster member will log a packet timeout (as seen in the log message above). When this occurs, the cluster member will consult with other members to determine who is at fault for the communication failure. If the witness members agree that the suspect member is at fault, the suspect is removed from the cluster. If the witnesses unanimously disagree, the accuser is removed. This process is known as the witness protocol. Since Member 2 cannot communicate with Member 5, it selects two witnesses (Members 1 and 4) to determine if the communication issue is with Member 5 or with itself (Member 2). However, Member 4 is on the switch that is no longer accessible by Members 1, 2 and 3; thus a packet timeout for member 4 is recorded as well: 2010-02-26 15:30:35.648/199.938 Oracle Coherence GE 3.5.3/465p2 <Warning> (thread=PacketPublisher, member=2): Timeout while delivering a packet; requesting the departure confirmation for Member(Id=4, Timestamp=2010-02-26 15:27:39.574, Address=10.149.155.75:8088, MachineId=1099, Location=process:800, Role=CoherenceServer) by MemberSet(Size=2, BitSetCount=2 Member(Id=1, Timestamp=2010-02-26 15:27:06.931, Address=10.149.155.76:8088, MachineId=1100, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:511, Role=CoherenceServer) Member(Id=6, Timestamp=2010-02-26 15:27:58.635, Address=10.149.155.78:8088, MachineId=1102, Location=process:228, Role=CoherenceServer) ) Member 1 has the ability to confirm the departure of member 4, however Member 6 cannot as it is also inaccessible. At the same time, Member 3 sends a request to remove Member 6, which is followed by a report from Member 3 indicating that Member 6 has departed the cluster: 2010-02-26 15:30:35.706/199.996 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=2): MemberLeft request for Member 6 received from Member(Id=3, Timestamp=2010-02-26 15:27:24.892, Address=10.149.155.236:8088, MachineId=1260, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:32459, Role=CoherenceServer) 2010-02-26 15:30:35.709/199.999 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=2): MemberLeft notification for Member 6 received from Member(Id=3, Timestamp=2010-02-26 15:27:24.892, Address=10.149.155.236:8088, MachineId=1260, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:32459, Role=CoherenceServer) The log for Member 3 determines how Member 6 departed the cluster: 2010-02-26 15:30:35.161/191.694 Oracle Coherence GE 3.5.3/465p2 <Warning> (thread=PacketPublisher, member=3): Timeout while delivering a packet; requesting the departure confirmation for Member(Id=6, Timestamp=2010-02-26 15:27:58.635, Address=10.149.155.78:8088, MachineId=1102, Location=process:228, Role=CoherenceServer) by MemberSet(Size=2, BitSetCount=2 Member(Id=1, Timestamp=2010-02-26 15:27:06.931, Address=10.149.155.76:8088, MachineId=1100, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:511, Role=CoherenceServer) Member(Id=2, Timestamp=2010-02-26 15:27:17.847, Address=10.149.155.77:8088, MachineId=1101, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:296, Role=CoherenceServer) ) 2010-02-26 15:30:35.165/191.698 Oracle Coherence GE 3.5.3/465p2 <Info> (thread=Cluster, member=3): Member departure confirmed by MemberSet(Size=2, BitSetCount=2 Member(Id=1, Timestamp=2010-02-26 15:27:06.931, Address=10.149.155.76:8088, MachineId=1100, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:511, Role=CoherenceServer) Member(Id=2, Timestamp=2010-02-26 15:27:17.847, Address=10.149.155.77:8088, MachineId=1101, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:296, Role=CoherenceServer) ); removing Member(Id=6, Timestamp=2010-02-26 15:27:58.635, Address=10.149.155.78:8088, MachineId=1102, Location=process:228, Role=CoherenceServer) In this case, Member 3 happened to select two witnesses that it still had connectivity with (Members 1 and 2) thus resulting in a simple decision to remove Member 6. Given the departure of Member 6, Member 2 is left with a single witness to confirm the departure of Member 4: 2010-02-26 15:30:35.713/200.003 Oracle Coherence GE 3.5.3/465p2 <Info> (thread=Cluster, member=2): Member departure confirmed by MemberSet(Size=1, BitSetCount=2 Member(Id=1, Timestamp=2010-02-26 15:27:06.931, Address=10.149.155.76:8088, MachineId=1100, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:511, Role=CoherenceServer) ); removing Member(Id=4, Timestamp=2010-02-26 15:27:39.574, Address=10.149.155.75:8088, MachineId=1099, Location=process:800, Role=CoherenceServer) In the meantime, Member 4 logs a missing heartbeat from the senior member. This message is also logged on Members 5 and 6. 2010-02-26 15:30:07.906/150.453 Oracle Coherence GE 3.5.3/465p2 <Info> (thread=PacketListenerN, member=4): Scheduled senior member heartbeat is overdue; rejoining multicast group. Next, Member 4 logs a TcpRing failure with Member 2, thus resulting in the termination of Member 2: 2010-02-26 15:30:21.421/163.968 Oracle Coherence GE 3.5.3/465p2 <D4> (thread=Cluster, member=4): TcpRing: Number of socket exceptions exceeded maximum; last was "java.net.SocketTimeoutException: connect timed out"; removing the member: 2 For quick process termination detection, Oracle Coherence utilizes a feature called TcpRing which is a sparse collection of TCP/IP-based connections between different members in the cluster. Each member in the cluster is connected to at least one other member, which (if at all possible) is running on a different physical box. This connection is not used for any data transfer, only heartbeat communications are sent once a second per each link. If a certain number of exceptions are thrown while trying to re-establish a connection, the member throwing the exceptions is removed from the cluster. Member 5 logs a packet timeout with Member 3 and cites witnesses Members 4 and 6: 2010-02-26 15:30:29.791/165.037 Oracle Coherence GE 3.5.3/465p2 <Warning> (thread=PacketPublisher, member=5): Timeout while delivering a packet; requesting the departure confirmation for Member(Id=3, Timestamp=2010-02-26 15:27:24.892, Address=10.149.155.236:8088, MachineId=1260, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:32459, Role=CoherenceServer) by MemberSet(Size=2, BitSetCount=2 Member(Id=4, Timestamp=2010-02-26 15:27:39.574, Address=10.149.155.75:8088, MachineId=1099, Location=process:800, Role=CoherenceServer) Member(Id=6, Timestamp=2010-02-26 15:27:58.635, Address=10.149.155.78:8088, MachineId=1102, Location=process:228, Role=CoherenceServer) ) 2010-02-26 15:30:29.798/165.044 Oracle Coherence GE 3.5.3/465p2 <Info> (thread=Cluster, member=5): Member departure confirmed by MemberSet(Size=2, BitSetCount=2 Member(Id=4, Timestamp=2010-02-26 15:27:39.574, Address=10.149.155.75:8088, MachineId=1099, Location=process:800, Role=CoherenceServer) Member(Id=6, Timestamp=2010-02-26 15:27:58.635, Address=10.149.155.78:8088, MachineId=1102, Location=process:228, Role=CoherenceServer) ); removing Member(Id=3, Timestamp=2010-02-26 15:27:24.892, Address=10.149.155.236:8088, MachineId=1260, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:32459, Role=CoherenceServer) Eventually we are left with two distinct clusters consisting of Members 1, 2, 3 and Members 4, 5, 6, respectively. In the latter cluster, Member 4 is promoted to senior member. The connection between the two switches is restored at 15:33. Upon the restoration of the connection, the cluster members immediately receive cluster heartbeats from the two senior members. In the case of Members 1, 2, and 3, the following is logged: 2010-02-26 15:33:14.970/369.066 Oracle Coherence GE 3.5.3/465p2 <Warning> (thread=Cluster, member=1): The member formerly known as Member(Id=4, Timestamp=2010-02-26 15:30:35.341, Address=10.149.155.75:8088, MachineId=1099, Location=process:800, Role=CoherenceServer) has been forcefully evicted from the cluster, but continues to emit a cluster heartbeat; henceforth, the member will be shunned and its messages will be ignored. Likewise for Members 4, 5, and 6: 2010-02-26 15:33:14.343/336.890 Oracle Coherence GE 3.5.3/465p2 <Warning> (thread=Cluster, member=4): The member formerly known as Member(Id=1, Timestamp=2010-02-26 15:30:31.64, Address=10.149.155.76:8088, MachineId=1100, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:511, Role=CoherenceServer) has been forcefully evicted from the cluster, but continues to emit a cluster heartbeat; henceforth, the member will be shunned and its messages will be ignored. This message indicates that a senior heartbeat is being received from members that were previously removed from the cluster, in other words, something that should not be possible. For this reason, the recipients of these messages will initially ignore them. After several iterations of these messages, the existence of multiple clusters is acknowledged, thus triggering the panic protocol to reconcile this situation. When the presence of more than one cluster (i.e. Split-Brain) is detected by a Coherence member, the panic protocol is invoked in order to resolve the conflicting clusters and consolidate into a single cluster. The protocol consists of the removal of smaller clusters until there is one cluster remaining. In the case of equal size clusters, the one with the older Senior Member will survive. Member 1, being the oldest member, initiates the protocol: 2010-02-26 15:33:45.970/400.066 Oracle Coherence GE 3.5.3/465p2 <Warning> (thread=Cluster, member=1): An existence of a cluster island with senior Member(Id=4, Timestamp=2010-02-26 15:27:39.574, Address=10.149.155.75:8088, MachineId=1099, Location=process:800, Role=CoherenceServer) containing 3 nodes have been detected. Since this Member(Id=1, Timestamp=2010-02-26 15:27:06.931, Address=10.149.155.76:8088, MachineId=1100, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:511, Role=CoherenceServer) is the senior of an older cluster island, the panic protocol is being activated to stop the other island's senior and all junior nodes that belong to it. Member 3 receives the panic: 2010-02-26 15:33:45.803/382.336 Oracle Coherence GE 3.5.3/465p2 <Error> (thread=Cluster, member=3): Received panic from senior Member(Id=1, Timestamp=2010-02-26 15:27:06.931, Address=10.149.155.76:8088, MachineId=1100, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:511, Role=CoherenceServer) caused by Member(Id=4, Timestamp=2010-02-26 15:27:39.574, Address=10.149.155.75:8088, MachineId=1099, Location=process:800, Role=CoherenceServer) Member 4, the senior member of the younger cluster, receives the kill message from Member 3: 2010-02-26 15:33:44.921/367.468 Oracle Coherence GE 3.5.3/465p2 <Error> (thread=Cluster, member=4): Received a Kill message from a valid Member(Id=3, Timestamp=2010-02-26 15:27:24.892, Address=10.149.155.236:8088, MachineId=1260, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:32459, Role=CoherenceServer); stopping cluster service. In turn, Member 4 requests the departure of its junior members 5 and 6: 2010-02-26 15:33:44.921/367.468 Oracle Coherence GE 3.5.3/465p2 <Error> (thread=Cluster, member=4): Received a Kill message from a valid Member(Id=3, Timestamp=2010-02-26 15:27:24.892, Address=10.149.155.236:8088, MachineId=1260, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:32459, Role=CoherenceServer); stopping cluster service. 2010-02-26 15:33:43.343/349.015 Oracle Coherence GE 3.5.3/465p2 <Error> (thread=Cluster, member=6): Received a Kill message from a valid Member(Id=4, Timestamp=2010-02-26 15:27:39.574, Address=10.149.155.75:8088, MachineId=1099, Location=process:800, Role=CoherenceServer); stopping cluster service. Once Members 4, 5, and 6 restart, they rejoin the original cluster with senior member 1. The log below is from Member 4. Note that it receives a different member id when it rejoins the cluster. 2010-02-26 15:33:44.921/367.468 Oracle Coherence GE 3.5.3/465p2 <Error> (thread=Cluster, member=4): Received a Kill message from a valid Member(Id=3, Timestamp=2010-02-26 15:27:24.892, Address=10.149.155.236:8088, MachineId=1260, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:32459, Role=CoherenceServer); stopping cluster service. 2010-02-26 15:33:46.921/369.468 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=4): Service Cluster left the cluster 2010-02-26 15:33:47.046/369.593 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Invocation:InvocationService, member=4): Service InvocationService left the cluster 2010-02-26 15:33:47.046/369.593 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=OptimisticCache, member=4): Service OptimisticCache left the cluster 2010-02-26 15:33:47.046/369.593 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=ReplicatedCache, member=4): Service ReplicatedCache left the cluster 2010-02-26 15:33:47.046/369.593 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=DistributedCache, member=4): Service DistributedCache left the cluster 2010-02-26 15:33:47.046/369.593 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Invocation:Management, member=4): Service Management left the cluster 2010-02-26 15:33:47.046/369.593 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=4): Member 6 left service Management with senior member 5 2010-02-26 15:33:47.046/369.593 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=4): Member 6 left service DistributedCache with senior member 5 2010-02-26 15:33:47.046/369.593 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=4): Member 6 left service ReplicatedCache with senior member 5 2010-02-26 15:33:47.046/369.593 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=4): Member 6 left service OptimisticCache with senior member 5 2010-02-26 15:33:47.046/369.593 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=4): Member 6 left service InvocationService with senior member 5 2010-02-26 15:33:47.046/369.593 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=4): Member(Id=6, Timestamp=2010-02-26 15:33:47.046, Address=10.149.155.78:8088, MachineId=1102, Location=process:228, Role=CoherenceServer) left Cluster with senior member 4 2010-02-26 15:33:49.218/371.765 Oracle Coherence GE 3.5.3/465p2 <Info> (thread=main, member=n/a): Restarting cluster 2010-02-26 15:33:49.421/371.968 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=n/a): Service Cluster joined the cluster with senior service member n/a 2010-02-26 15:33:49.625/372.172 Oracle Coherence GE 3.5.3/465p2 <Info> (thread=Cluster, member=n/a): This Member(Id=5, Timestamp=2010-02-26 15:33:50.499, Address=10.149.155.75:8088, MachineId=1099, Location=process:800, Role=CoherenceServer, Edition=Grid Edition, Mode=Development, CpuCount=2, SocketCount=1) joined cluster "cluster:0xDDEB" with senior Member(Id=1, Timestamp=2010-02-26 15:27:06.931, Address=10.149.155.76:8088, MachineId=1100, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:511, Role=CoherenceServer, Edition=Grid Edition, Mode=Development, CpuCount=2, SocketCount=2) Cool isn't it?

    Read the article

  • Skynet Big Data Demo Using Hexbug Spider Robot, Raspberry Pi, and Java SE Embedded (Part 4)

    - by hinkmond
    Here's the first sign of life of a Hexbug Spider Robot converted to become a Skynet Big Data model T-1. Yes, this is T-1 the precursor to the Cyberdyne Systems T-101 (and you know where that will lead to...) It is demonstrating a heartbeat using a simple Java SE Embedded program to drive it. See: Skynet Model T-1 Heartbeat It's alive!!! Well, almost alive. At least there's a pulse. We'll program more to its actions next, and then finally connect it to Skynet Big Data to do more advanced stuff, like hunt for Sara Connor. Java SE Embedded programming makes it simple to create the first model in the long line of T-XXX robots to take on the world. Raspberry Pi makes connecting it all together on one simple device, easy. Next post, I'll show how the wires are connected to drive the T-1 robot. Hinkmond

    Read the article

  • SQL Server 2008 cluster freezing

    - by Ed Leighton-Dick
    We have run into a strange situation in which a SQL Server 2008 single-node cluster hangs. As background, we are rebuilding a Windows Server 2003/SQL Server 2005 two-node cluster using Windows 2008 and SQL Server 2008. Here's the timeline: Evicted the passive node (server B) from the Windows 2003/SQL 2005 cluster. The active node now functions as a single-node cluster with no problems. Wiped server B's disks and installed Windows 2008 and SQL Server 2008 as a single-node cluster. Since we do not want to the two clusters to communicate yet, we left the cluster's private network "heartbeat" adapter unconfigured. The cluster comes up and functions normally. Moved all databases to the new cluster. Cluster continues to function normally. Turned off server A (old cluster) in preparation for rebuilding as the second node of the new cluster. SQL Server instance on server B (new cluster) locks up, even though it should have no knowledge of or interaction with server A. Restarted server A. SQL Server instance on server B (new cluster) immediately begins working again. Things we have tried: The new cluster's name responds to ping and NETBIOS requests, even while the SQL Server is hung. We have confirmed that no IP address is assigned to the old heartbeat adapter, and it is not pulling an IP address from DHCP. Disabling the heartbeat's network card has the same effect. No errors were generated in any logs - Windows or SQL. When the error first occurred, it sat in the hung state for quite some time (well over 10 minutes) before anyone figured out what was going on. This would seem to eliminate any sort of normal cluster timeout in which it would have been searching for the other node (even if one had been configured). Server B is running Windows 2008 SP2, fully patched, and SQL Server 2008 SP1 CU7 (10.0.2775).

    Read the article

  • o2cb thinks ocfs2 cluster is still online, and refuses to shut down

    - by Kendall
    I have a handful of OpenSuSE 11.2 servers that utilize OCFS2 volumes. I've noticed that o2cb can't figure out when the OCFS2 cluster is actually mounted. For example, when I try to shutdown o2cb, after stopping OCSF2, o2cb refuses to shutdown because it thinks OCFS2 is still up! After stopping OCFS2 I try to stop o2cb... hamguy:/dev/disk/by-label # /etc/init.d/o2cb stop Stopping O2CB cluster ocfs2: Failed Unable to stop cluster as heartbeat region still active So I check the status... hamguy:/dev/disk/by-label # /etc/init.d/o2cb status Driver for "configfs": Loaded Filesystem "configfs": Mounted Stack glue driver: Loaded Stack plugin "o2cb": Loaded Driver for "ocfs2_dlmfs": Loaded Filesystem "ocfs2_dlmfs": Mounted Checking O2CB cluster ocfs2: Online Heartbeat dead threshold = 31 Network idle timeout: 30000 Network keepalive delay: 2000 Network reconnect delay: 2000 Checking O2CB heartbeat: Active And double check OCFS2... hamguy:/dev/disk/by-label # /etc/init.d/ocfs2 status Configured OCFS2 mountpoints: /u/conf /u/logs /u/backup /u/client /u/data /u/mdata OCFS2 is clearly down, while o2cb clearly thinks otherwise. The versions of OCFS2 and o2cb are... kendall@hamguy:~> rpm -qa |grep ocfs2 ocfs2console-1.4.1-25.6.x86_64 ocfs2-tools-o2cb-1.4.1-25.6.x86_64 ocfs2-tools-1.4.1-25.6.x86_64 kendall@hamguy:~> rpm -qa |grep o2cb ocfs2-tools-o2cb-1.4.1-25.6.x86_64 What causes this, and is there a way around it? If I try to reboot the machine, it will just sit there forever until your physically power cycle it. That obviously is a bit of a problem. Any insight is appreciated, thank you. Kendall

    Read the article

  • Clustering for Mere Mortals (Pt2)

    - by Geoff N. Hiten
    Planning. I could stop there and let that be the entirety post #2 in this series.  Planning is the single most important element in building a cluster and the Laptop Demo Cluster is no exception.  One of the more awkward parts of actually creating a cluster is coordinating information between Windows Clustering and SQL Clustering.  The dialog boxes show up hours apart, but still have to have matching and consistent information. Excel seems to be a good tool for tracking these settings.  My workbook has four pages: Systems, Storage, Network, and Service Accounts.  The systems page looks like this:   Name Role Software Location East Physical Cluster Node 1 Windows Server 2008 R2 Enterprise Laptop VM West Physical Cluster Node 2 Windows Server 2008 R2 Enterprise Laptop VM North Physical Cluster Node 3 (Future Reserved) Windows Server 2008 R2 Enterprise Laptop VM MicroCluster Cluster Management Interface N/A Laptop VM SQL01 High-Performance High-Security Instance SQL Server 2008 Enterprise Edition x64 SP1 Laptop VM SQL02 High-Performance Standard-Security Instance SQL Server 2008 Enterprise Edition x64 SP1 Laptop VM SQL03 Standard-Performance High-Security Instance SQL Server 2008 Enterprise Edition x64 SP1 Laptop VM Note that everything that has a computer name is listed here, whether physical or virtual. Storage looks like this: Storage Name Instance Purpose Volume Path Size (GB) LUN ID Speed Quorum MicroCluster Cluster Quorum Quorum Q: 2     SQL01Anchor SQL01 Instance Anchor SQL01Anchor L: 2     SQL02Anchor SQL02 Instance Anchor SQL02Anchor M: 2     SQL01Data1 SQL01 SQL Data SQL01Data1 L:\MountPoints\SQL01Data1 2     SQL02Data1 SQL02 SQL Data SQL02Data1 M:\MountPoints\SQL02Data1       Starting at the left is the name used in the storage array.  It is important to rename resources at each level, whether it is Storage, LUN, Volume, or disk folder.  Otherwise, troubleshooting things gets complex and difficult.  You want to be able to glance at a resource at any level and see where it comes from and what it is connected to. Networking is the same way:   System Network VLAN  IP Subnet Mask Gateway DNS1 DNS2 East Public Cluster1 10.97.230.x(DHCP) 255.255.255.0 10.97.230.1 10.97.230.1 10.97.230.1 East Heartbeat Cluster2   255.255.255.0       West Public Cluster1 10.97.230.x(DHCP) 255.255.255.0 10.97.230.1 10.97.230.1 10.97.230.1 West Heartbeat Cluster2   255.255.255.0       North Public Cluster1 10.97.230.x(DHCP) 255.255.255.0 10.97.230.1 10.97.230.1 10.97.230.1 North Heartbeat Cluster2   255.255.255.0       SQL01 Public Cluster1 10.97.230.x(DHCP) 255.255.255.0       SQL02 Public Cluster1 10.97.230.x(DHCP) 255.255.255.0       One hallmark of a poorly planned and implemented cluster is a bunch of "Local Network Connection #n" entries in the network settings page.  That lets me know that somebody didn't care about the long-term supportabaility of the cluster.  This can be critically important with Hyper-V Clusters and their high NIC counts.  Final page:   Instance Service Name Account Password Domain OU SQL01 SQL Server SVCSQL01 Baseline22 MicroAD Service Accounts SQL01 SQL Agent SVCSQL01 Baseline22 MicroAD Service Accounts SQL02 SQL Server SVC_SQL02 Baseline22 MicroAD Service Accounts SQL02 SQL Agent SVC_SQL02 Baseline22 MicroAD Service Accounts SQL03 (Future) SQL Server SVC_SQL03 Baseline22 MicroAD Service Accounts SQL03 (Future) SQL Agent SVC_SQL03 Baseline22 MicroAD Service Accounts             Installation Account           administrator            Yes.  I write down the account information.  I secure the file via NTFS, but I don't want to fumble around looking for passwords when it comes time to rebuild a node. Always fill out the workbook COMPLETELY before installing anything.  The whole point is to have everything you need at your fingertips before you begin.  The install experience is so much better and more productive with this information in place.

    Read the article

  • Thread resource sharing

    - by David
    I'm struggling with multi-threaded programming... I have an application that talks to an external device via a CAN to USB module. I've got the application talking on the CAN bus just fine, but there is a requirement for the application to transmit a "heartbeat" message every second. This sounds like a perfect time to use threads, so I created a thread that wakes up every second and sends the heartbeat. The problem I'm having is sharing the CAN bus interface. The heartbeat must only be sent when the bus is idle. How do I share the resource? Here is pseudo code showing what I have so far: TMainThread { Init: CanBusApi =new TCanBusApi; MutexMain =CreateMutex( "CanBusApiMutexName" ); HeartbeatThread =new THeartbeatThread( CanBusApi ); Execution: WaitForSingleObject( MutexMain ); CanBusApi->DoSomething(); ReleaseMutex( MutexMain ); } THeartbeatThread( CanBusApi ) { Init: MutexHeart =CreateMutex( "CanBusApiMutexName" ); Execution: Sleep( 1000 ); WaitForSingleObject( MutexHeart ); CanBusApi->DoHeartBeat(); ReleaseMutex( MutexHeart ); } The problem I'm seeing is that when DoHeartBeat is called, it causes the main thread to block while waiting for MutexMain as expected, but DoHeartBeat also stops. DoHeartBeat doesn't complete until after WaitForSingleObject(MutexMain) times out in failure. Does DoHeartBeat execute in the context of the MainThread or HeartBeatThread? It seems to be executing in MainThread. What am I doing wrong? Is there a better way? Thanks, David

    Read the article

  • Python print statement prints nothing with a carriage return

    - by Jonathan Sternberg
    I'm trying to write a simple tool that reads files from disc, does some image processing, and returns the result of the algorithm. Since the program can sometimes take awhile, I like to have a progress bar so I know where it is in the program. And since I don't like to clutter up my command line and I'm on a Unix platform, I wanted to use the '\r' character to print the progress bar on only one line. But when I have this code here, it prints nothing. # Files is a list with the filenames for i, f in enumerate(files): print '\r%d / %d' % (i, len(files)), # Code that takes a long time I have also tried: print '\r', i, '/', len(files), Now just to make sure this worked in python, I tried this: heartbeat = 1 while True: print '\rHello, world', heartbeat, heartbeat += 1 This code works perfectly. What's going on? My understanding of carriage returns on Linux was that it would just move the line feed character to the beginning and then I could overwrite old text that was written previously, as long as I don't print a newline anywhere. This doesn't seem to be happening though. Also, is there a better way to display a progress bar in a command line than what I'm current trying to do?

    Read the article

  • MySQL: Pacemaker cannot start the failed master as a new slave?

    - by quanta
    I'm going to setup failover for MySQL replication (1 master and 1 slave) follow this guide: https://github.com/jayjanssen/Percona-Pacemaker-Resource-Agents/blob/master/doc/PRM-setup-guide.rst Here're the output of crm configure show: node serving-6192 \ attributes p_mysql_mysql_master_IP="192.168.6.192" node svr184R-638.localdomain \ attributes p_mysql_mysql_master_IP="192.168.6.38" primitive p_mysql ocf:percona:mysql \ params config="/etc/my.cnf" pid="/var/run/mysqld/mysqld.pid" socket="/var/lib/mysql/mysql.sock" replication_user="repl" replication_passwd="x" test_user="test_user" test_passwd="x" \ op monitor interval="5s" role="Master" OCF_CHECK_LEVEL="1" \ op monitor interval="2s" role="Slave" timeout="30s" OCF_CHECK_LEVEL="1" \ op start interval="0" timeout="120s" \ op stop interval="0" timeout="120s" primitive writer_vip ocf:heartbeat:IPaddr2 \ params ip="192.168.6.8" cidr_netmask="32" \ op monitor interval="10s" \ meta is-managed="true" ms ms_MySQL p_mysql \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" globally-unique="false" target-role="Master" is-managed="true" colocation writer_vip_on_master inf: writer_vip ms_MySQL:Master order ms_MySQL_promote_before_vip inf: ms_MySQL:promote writer_vip:start property $id="cib-bootstrap-options" \ dc-version="1.0.12-unknown" \ cluster-infrastructure="openais" \ expected-quorum-votes="2" \ no-quorum-policy="ignore" \ stonith-enabled="false" \ last-lrm-refresh="1341801689" property $id="mysql_replication" \ p_mysql_REPL_INFO="192.168.6.192|mysql-bin.000006|338" crm_mon: Last updated: Mon Jul 9 10:30:01 2012 Stack: openais Current DC: serving-6192 - partition with quorum Version: 1.0.12-unknown 2 Nodes configured, 2 expected votes 2 Resources configured. ============ Online: [ serving-6192 svr184R-638.localdomain ] Master/Slave Set: ms_MySQL Masters: [ serving-6192 ] Slaves: [ svr184R-638.localdomain ] writer_vip (ocf::heartbeat:IPaddr2): Started serving-6192 Editing /etc/my.cnf on the serving-6192 of wrong syntax to test failover and it's working fine: svr184R-638.localdomain being promoted to become the master writer_vip switch to svr184R-638.localdomain Current state: Last updated: Mon Jul 9 10:35:57 2012 Stack: openais Current DC: serving-6192 - partition with quorum Version: 1.0.12-unknown 2 Nodes configured, 2 expected votes 2 Resources configured. ============ Online: [ serving-6192 svr184R-638.localdomain ] Master/Slave Set: ms_MySQL Masters: [ svr184R-638.localdomain ] Stopped: [ p_mysql:0 ] writer_vip (ocf::heartbeat:IPaddr2): Started svr184R-638.localdomain Failed actions: p_mysql:0_monitor_5000 (node=serving-6192, call=15, rc=7, status=complete): not running p_mysql:0_demote_0 (node=serving-6192, call=22, rc=7, status=complete): not running p_mysql:0_start_0 (node=serving-6192, call=26, rc=-2, status=Timed Out): unknown exec error Remove the wrong syntax from /etc/my.cnf on serving-6192, and restart corosync, what I would like to see is serving-6192 was started as a new slave but it doesn't: Failed actions: p_mysql:0_start_0 (node=serving-6192, call=4, rc=1, status=complete): unknown error Here're snippet of the logs which I'm suspecting: Jul 09 10:46:32 serving-6192 lrmd: [7321]: info: rsc:p_mysql:0:4: start Jul 09 10:46:32 serving-6192 lrmd: [7321]: info: RA output: (p_mysql:0:start:stderr) Error performing operation: The object/attribute does not exist Jul 09 10:46:32 serving-6192 crm_attribute: [7420]: info: Invoked: /usr/sbin/crm_attribute -N serving-6192 -l reboot --name readable -v 0 The full logs: http://fpaste.org/AyOZ/ The strange thing is I can starting it manually: export OCF_ROOT=/usr/lib/ocf export OCF_RESKEY_config="/etc/my.cnf" export OCF_RESKEY_pid="/var/run/mysqld/mysqld.pid" export OCF_RESKEY_socket="/var/lib/mysql/mysql.sock" export OCF_RESKEY_replication_user="repl" export OCF_RESKEY_replication_passwd="x" export OCF_RESKEY_test_user="test_user" export OCF_RESKEY_test_passwd="x" sh -x /usr/lib/ocf/resource.d/percona/mysql start: http://fpaste.org/RVGh/ Did I make something wrong?

    Read the article

  • Providing high availability and failover using MySQL on EC2

    - by crb
    I would like to have a highly-available MySQL system, with automatic failover, running on Amazon EC2 instances. The standard approach to solving this is problem Heartbeat + DRBD, but I've found a lot of posts suggesting DRBD doesn't work on EC2, though none saying exactly why. Obviously, a serial heartbeat or distinct network is out of the question in the virtualised environment. It would also be good to have the different servers be in different availability zones, but we're getting into a much harder problem there. What are peoples' opinion on having a high uptime solution in "the cloud"?

    Read the article

  • Providing high availability and failover using MySQL on EC2

    - by crb
    I would like to have a highly-available MySQL system, with automatic failover, running on Amazon EC2 instances. The standard approach to solving this is problem Heartbeat + DRBD, but I've found a lot of posts suggesting DRBD doesn't work on EC2, though none saying exactly why. Obviously, a serial heartbeat or distinct network is out of the question in the virtualised environment. It would also be good to have the different servers be in different availability zones, but we're getting into a much harder problem there. What are peoples' opinion on having a high uptime solution in "the cloud"? Note: This question was asked before RDS with multi-AZ was announced, which is the nice automatic answer for today's modern IT professional. :)

    Read the article

< Previous Page | 1 2 3 4 5 6  | Next Page >