Search Results

Search found 401 results on 17 pages for 'hadoop'.

Page 10/17 | < Previous Page | 6 7 8 9 10 11 12 13 14 15 16 17 | Next Page >

Advanced queries in HBase

- by Teflon Ted

Given the following HBase schema scenario (from the official FAQ)... How would you design an Hbase table for many-to-many association between two entities, for example Student and Course? I would define two tables: Student: student id student data (name, address, ...) courses (use course ids as column qualifiers here) Course: course id course data (name, syllabus, ...) students (use student ids as column qualifiers here) This schema gives you fast access to the queries, show all classes for a student (student table, courses family), or all students for a class (courses table, students family). How would you satisfy the request: "Give me all the students that share at least two courses in common"? Can you build a "query" in HBase that will return that set, or do you have to retrieve all the pertinent data and crunch it yourself in code?

Read the article
How to pick random (small) data samples using Map/Reduce?

- by Andrei Savu

I want to write a map/reduce job to select a number of random samples from a large dataset based on a row level condition. I want to minimize the number of intermediate keys. Pseudocode: for each row if row matches condition put the row.id in the bucket if the bucket is not already large enough Have you done something like this? Is there any well known algorithm? A sample containing sequential rows is also good enough. Thanks.

Read the article
Nutch search always returns 0 results

- by darbour

I have set up nutch 1.0 on a cluster. It has been setup and has successfully crawled, I copied the crawl directory using the dfs -copyToLocal and set the value of searcher.dir in the nutch-site.xml file located in the tomcat directory to point to that directory. Still when I try to search I receive 0 results. Any help would be greatly appreciated.

Read the article
Amazon Elastic MapReduce: the number of launched map task

- by S.N

Hi, In the "syslog" for a MapReduce job flow step, I see the following: Job Counters Launched reduce tasks=4 Launched map tasks=39 Does the number of launched map tasks include failed tasks? I am using NLineInputFormat class as input format to manage the number of map tasks. However, I get slightly different numbers for exact same input occasionally, or depending on the number of instances (10, 15, and 20). Can anyone tell me why I am seeing different number of tasks launched?

Read the article
Control the MultipleOutputFormat files sub-path

- by iCode

I need to control the sub-path of the different different files being managed by MultipleOutputFormat based on the reducer key. I basically want to set the sub path of the file based on the key given to the reducer. I can changed the file name by overwrting the generateFileNameForKeyValue method of MultipleOutputFormatbut how can I also change the sub-path of these files? I mean with just overriding the generateFileNameForKeyValue, I get mySetJobConfigOutputPath/fileNameBasedKey1.dat /fileNameBasedKey2.dat /fileNameBasedKey3.dat ... but I want to make it to be organize files like below mySetJobConfigOutputPath/path0ConfiguredInsideReducerBasedOnKey/fileNameBasedKey1.dat /path1ConfiguredInsideReducerBasedOnKey/fileNameBasedKey2.dat /fileNameBasedKey3.dat /path2ConfiguredInsideReducerBasedOnKey/fileNameBasedKey8.dat as seen, the sub-path and the file name are both figured out by the key inside the reducer. I know how to configure the file name but was wondering if I can configure the sub-path of the each file under the mySetJobConfigOutputPath folder?

Read the article
The Buzz at the JavaOne Bookstore

- by Janice J. Heiss

I found my way to the JavaOne bookstore, a hub of activity. Who says brick and mortar bookstores are dead? I asked what was hot and got two answers: Hadoop in Practice by Alex Holmes was doing well. And Scala for the Impatient by noted Java Champion Cay Horstmann also seemed to be a fast seller. Hadoop in PracticeHadoop is a framework that organizes large clusters of computers around a problem. It is touted as especially effective for large amounts of data, and is use such companies as Facebook, Yahoo, Apple, eBay and LinkedIn. Hadoop in Practice collects nearly 100 Hadoop examples and presents them in a problem/solution format with step by step explanations of solutions and designs. It’s very much a participatory book intended to make developers more at home with Hadoop.The author, Alex Holmes, is a senior software engineer with more than 15 years of experience developing large-scale distributed Java systems. For the last four years, he has gained expertise in Hadoop solving Big Data problems across a number of projects. He has presented at JavaOne and Jazoon and is currently a technical lead at VeriSign.At this year’s JavaOne, he is presenting a session with VeriSign colleague, Karthik Shyamsunder called “Java: A Perfect Platform for Data Science” where they will explain how the Java platform has emerged as a perfect platform for practicing data science, and also talk about such technologies as Hadoop, Hive, Pig, HBase, Cassandra, and Mahout. Scala for the ImpatientSan Jose State University computer science professor and Java Champion Cay Horstmann is the principal author of the highly regarded Core Java. Scala for the Impatient is a basic, practical introduction to Scala for experienced programmers. Horstmann has a presentation summarizing the themes of his book on at his website. On the final page he offers an enticing summary of his conclusions:* Widespread dissatisfaction with Java + XML + IDEs --Don't make me eat Elephant again * A separate language for every problem domain is not efficient --It takes time to master the idioms* ”JavaScript Everywhere” isn't going to scale* Trend is towards languages with more expressive power, less boilerplate* Will Scala be the “one ring to rule them”?* Maybe --If it succeeds in industry --If student-friendly subsets and tools are created The popularity of both books echoed comments by IBM Distinguished Engineer Jason McGee who closed his part of the Sunday JavaOne keynote by pointing out that the use of Java in complex applications is increasingly being augmented by a host of other languages with strong communities around them – JavaScript, JRuby, Scala, Python and so forth. Java developers increasingly must know the strengths and weaknesses of such languages going forward.

Read the article
Using Hadooop (HDInsight) with Microsoft - Two (OK, Three) Options

- by BuckWoody

Microsoft has many tools for “Big Data”. In fact, you need many tools – there’s no product called “Big Data Solution” in a shrink-wrapped box – if you find one, you probably shouldn’t buy it. It’s tempting to want a single tool that handles everything in a problem domain, but with large, complex data, that isn’t a reality. You’ll mix and match several systems, open and closed source, to solve a given problem. But there are tools that help with handling data at large, complex scales. Normally the best way to do this is to break up the data into parts, and then put the calculation engines for that chunk of data right on the node where the data is stored. These systems are in a family called “Distributed File and Compute”. Microsoft has a couple of these, including the High Performance Computing edition of Windows Server. Recently we partnered with Hortonworks to bring the Apache Foundation’s release of Hadoop to Windows. And as it turns out, there are actually two (technically three) ways you can use it. (There’s a more detailed set of information here: http://www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence/big-data.aspx, I’ll cover the options at a general level below) First Option: Windows Azure HDInsight Service Your first option is that you can simply log on to a Hadoop control node and begin to run Pig or Hive statements against data that you have stored in Windows Azure. There’s nothing to set up (although you can configure things where needed), and you can send the commands, get the output of the job(s), and stop using the service when you are done – and repeat the process later if you wish. (There are also connectors to run jobs from Microsoft Excel, but that’s another post) This option is useful when you have a periodic burst of work for a Hadoop workload, or the data collection has been happening into Windows Azure storage anyway. That might be from a web application, the logs from a web application, telemetrics (remote sensor input), and other modes of constant collection. You can read more about this option here: http://blogs.msdn.com/b/windowsazure/archive/2012/10/24/getting-started-with-windows-azure-hdinsight-service.aspx Second Option: Microsoft HDInsight Server Your second option is to use the Hadoop Distribution for on-premises Windows called Microsoft HDInsight Server. You set up the Name Node(s), Job Tracker(s), and Data Node(s), among other components, and you have control over the entire ecostructure. This option is useful if you want to have complete control over the system, leave it running all the time, or you have a huge quantity of data that you have to bulk-load constantly – something that isn’t going to be practical with a network transfer or disk-mailing scheme. You can read more about this option here: http://www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence/big-data.aspx Third Option (unsupported): Installation on Windows Azure Virtual Machines Although unsupported, you could simply use a Windows Azure Virtual Machine (we support both Windows and Linux servers) and install Hadoop yourself – it’s open-source, so there’s nothing preventing you from doing that. Aside from being unsupported, there are other issues you’ll run into with this approach – primarily involving performance and the amount of configuration you’ll need to do to access the data nodes properly. But for a single-node installation (where all components run on one system) such as learning, demos, training and the like, this isn’t a bad option. Did I mention that’s unsupported? :) You can learn more about Windows Azure Virtual Machines here: http://www.windowsazure.com/en-us/home/scenarios/virtual-machines/ And more about Hadoop and the installation/configuration (on Linux) here: http://en.wikipedia.org/wiki/Apache_Hadoop And more about the HDInsight installation here: http://www.microsoft.com/web/gallery/install.aspx?appid=HDINSIGHT-PREVIEW Choosing the right option Since you have two or three routes you can go, the best thing to do is evaluate the need you have, and place the workload where it makes the most sense. My suggestion is to install the HDInsight Server locally on a test system, and play around with it. Read up on the best ways to use Hadoop for a given workload, understand the parts, write a little Pig and Hive, and get your feet wet. Then sign up for a test account on HDInsight Service, and see how that leverages what you know. If you're a true tinkerer, go ahead and try the VM route as well. Oh - there’s another great reference on the Windows Azure HDInsight that just came out, here: http://blogs.msdn.com/b/brunoterkaly/archive/2012/11/16/hadoop-on-azure-introduction.aspx

Read the article
Know your Data Lineage

- by Simon Elliston Ball

An academic paper without the footnotes isn’t an academic paper. Journalists wouldn’t base a news article on facts that they can’t verify. So why would anyone publish reports without being able to say where the data has come from and be confident of its quality, in other words, without knowing its lineage. (sometimes referred to as ‘provenance’ or ‘pedigree’) The number and variety of data sources, both traditional and new, increases inexorably. Data comes clean or dirty, processed or raw, unimpeachable or entirely fabricated. On its journey to our report, from its source, the data can travel through a network of interconnected pipes, passing through numerous distinct systems, each managed by different people. At each point along the pipeline, it can be changed, filtered, aggregated and combined. When the data finally emerges, how can we be sure that it is right? How can we be certain that no part of the data collection was based on incorrect assumptions, that key data points haven’t been left out, or that the sources are good? Even when we’re using data science to give us an approximate or probable answer, we cannot have any confidence in the results without confidence in the data from which it came. You need to know what has been done to your data, where it came from, and who is responsible for each stage of the analysis. This information represents your data lineage; it is your stack-trace. If you’re an analyst, suspicious of a number, it tells you why the number is there and how it got there. If you’re a developer, working on a pipeline, it provides the context you need to track down the bug. If you’re a manager, or an auditor, it lets you know the right things are being done. Lineage tracking is part of good data governance. Most audit and lineage systems require you to buy into their whole structure. If you are using Hadoop for your data storage and processing, then tools like Falcon allow you to track lineage, as long as you are using Falcon to write and run the pipeline. It can mean learning a new way of running your jobs (or using some sort of proxy), and even a distinct way of writing your queries. Other Hadoop tools provide a lot of operational and audit information, spread throughout the many logs produced by Hive, Sqoop, MapReduce and all the various moving parts that make up the eco-system. To get a full picture of what’s going on in your Hadoop system you need to capture both Falcon lineage and the data-exhaust of other tools that Falcon can’t orchestrate. However, the problem is bigger even that that. Often, Hadoop is just one piece in a larger processing workflow. The next step of the challenge is how you bind together the lineage metadata describing what happened before and after Hadoop, where ‘after’ could be a data analysis environment like R, an application, or even directly into an end-user tool such as Tableau or Excel. One possibility is to push as much as you can of your key analytics into Hadoop, but would you give up the power, and familiarity of your existing tools in return for a reliable way of tracking lineage? Lineage and auditing should work consistently, automatically and quietly, allowing users to access their data with any tool they require to use. The real solution, therefore, is to create a consistent method by which to bring lineage data from these data various disparate sources into the data analysis platform that you use, rather than being forced to use the tool that manages the pipeline for the lineage and a different tool for the data analysis. The key is to keep your logs, keep your audit data, from every source, bring them together and use the data analysis tools to trace the paths from raw data to the answer that data analysis provides.

Read the article
Talking JavaOne with Rock Star Raghavan Srinivas

- by Janice J. Heiss

Raghavan Srinivas, affectionately known as “Rags,” is a two-time JavaOne Rock Star (from 2005 and 2011) who, as a Developer Advocate at Couchbase, gets his hands dirty with emerging technology directions and trends. His general focus is on distributed systems, with a specialization in cloud computing. He worked on Hadoop and HBase during its early stages, has spoken at conferences world-wide on a variety of technical topics, conducted and organized Hands-on Labs and taught graduate classes.He has 20 years of hands-on software development and over 10 years of architecture and technology evangelism experience and has worked for Digital Equipment Corporation, Sun Microsystems, Intuit and Accenture. He has evangelized and influenced the architecture of numerous technologies including the early releases of JavaFX, Java, Java EE, Java and XML, Java ME, AJAX and Web 2.0, and Java Security.Rags will be giving these sessions at JavaOne 2012: CON3570 -- Autosharding Enterprise to Social Gaming Applications with NoSQL and Couchbase CON3257 -- Script Bowl 2012: The Battle of the JVM-Based Languages (with Guillaume Laforge, Aaron Bedra, Dick Wall, and Dr Nic Williams) Rags emphasized the importance of the Cloud: “The Cloud and the Big Data are popular technologies not merely because they are trendy, but, largely due to the fact that it's possible to do massive data mining and use that information for business advantage,” he explained. I asked him what we should know about Hadoop. “Hadoop,” he remarked, “is mainly about using commodity hardware and achieving unprecedented scalability. At the heart of all this is the Java Virtual Machine which is running on each of these nodes. The vision of taking the processing to where the data resides is made possible by Java and Hadoop.” And the most exciting thing happening in the world of Java today? “I read recently that Java projects on github.com are just off the charts when compared to other projects. It's exciting to realize the robust growth of Java and the degree of collaboration amongst Java programmers.” He encourages Java developers to take advantage of Java 7 for Mac OS X which is now available for download. At the same time, he also encourages us to read the caveats. Originally published on blogs.oracle.com/javaone.

Read the article
Talking JavaOne with Rock Star Raghavan Srinivas

- by Janice J. Heiss

Raghavan Srinivas, affectionately known as “Rags,” is a two-time JavaOne Rock Star (from 2005 and 2011) who, as a Developer Advocate at Couchbase, gets his hands dirty with emerging technology directions and trends. His general focus is on distributed systems, with a specialization in cloud computing. He worked on Hadoop and HBase during its early stages, has spoken at conferences world-wide on a variety of technical topics, conducted and organized Hands-on Labs and taught graduate classes.He has 20 years of hands-on software development and over 10 years of architecture and technology evangelism experience and has worked for Digital Equipment Corporation, Sun Microsystems, Intuit and Accenture. He has evangelized and influenced the architecture of numerous technologies including the early releases of JavaFX, Java, Java EE, Java and XML, Java ME, AJAX and Web 2.0, and Java Security.Rags will be giving these sessions at JavaOne 2012: CON3570 -- Autosharding Enterprise to Social Gaming Applications with NoSQL and Couchbase CON3257 -- Script Bowl 2012: The Battle of the JVM-Based Languages (with Guillaume Laforge, Aaron Bedra, Dick Wall, and Dr Nic Williams) Rags emphasized the importance of the Cloud: “The Cloud and the Big Data are popular technologies not merely because they are trendy, but, largely due to the fact that it's possible to do massive data mining and use that information for business advantage,” he explained. I asked him what we should know about Hadoop. “Hadoop,” he remarked, “is mainly about using commodity hardware and achieving unprecedented scalability. At the heart of all this is the Java Virtual Machine which is running on each of these nodes. The vision of taking the processing to where the data resides is made possible by Java and Hadoop.” And the most exciting thing happening in the world of Java today? “I read recently that Java projects on github.com are just off the charts when compared to other projects. It's exciting to realize the robust growth of Java and the degree of collaboration amongst Java programmers.” He encourages Java developers to take advantage of Java 7 for Mac OS X which is now available for download. At the same time, he also encourages us to read the caveats.

Read the article
Why Oracle Data Integrator for Big Data?

- by Mala Narasimharajan

Big Data is everywhere these days - but what exactly is it? It’s data that comes from a multitude of sources – not only structured data, but unstructured data as well. The sheer volume of data is mindboggling – here are a few examples of big data: climate information collected from sensors, social media information, digital pictures, log files, online video files, medical records or online transaction records. These are just a few examples of what constitutes big data. Embedded in big data is tremendous value and being able to manipulate, load, transform and analyze big data is key to enhancing productivity and competitiveness. The value of big data lies in its propensity for greater in-depth analysis and data segmentation -- in turn giving companies detailed information on product performance, customer preferences and inventory. Furthermore, by being able to store and create more data in digital form, “big data can unlock significant value by making information transparent and usable at much higher frequency." (McKinsey Global Institute, May 2011) Oracle's flagship product for bulk data movement and transformation, Oracle Data Integrator, is a critical component of Oracle’s Big Data strategy. ODI provides automation, bulk loading, and validation and transformation capabilities for Big Data while minimizing the complexities of using Hadoop. Specifically, the advantages of ODI in a Big Data scenario are due to pre-built Knowledge Modules that drive processing in Hadoop. This leverages the graphical UI to load and unload data from Hadoop, perform data validations and create mapping expressions for transformations. The Knowledge Modules provide a key jump-start and eliminate a significant amount of Hadoop development. Using Oracle Data Integrator together with Oracle Big Data Connectors, you can simplify the complexities of mapping, accessing, and loading big data (via NoSQL or HDFS) but also correlating your enterprise data – this correlation may require integrating across heterogeneous and standards-based environments, connecting to Oracle Exadata, or sourcing via a big data platform such as Oracle Big Data Appliance. To learn more about Oracle Data Integration and Big Data, download our resource kit to see the latest in whitepapers, webinars, downloads, and more… or go to our website on www.oracle.com/bigdata

Read the article
Cascading S3 Sink Tap not being deleted with SinkMode.REPLACE

- by Eric Charles

We are running Cascading with a Sink Tap being configured to store in Amazon S3 and were facing some FileAlreadyExistsException (see [1]). This was only from time to time (1 time on around 100) and was not reproducable. Digging into the Cascading codem, we discovered the Hfs.deleteResource() is called (among others) by the BaseFlow.deleteSinksIfNotUpdate(). Btw, we were quite intrigued with the silent NPE (with comment "hack to get around npe thrown when fs reaches root directory"). From there, we extended the Hfs tap with our own Tap to add more action in the deleteResource() method (see [2]) with a retry mechanism calling directly the getFileSystem(conf).delete. The retry mechanism seemed to bring improvement, but we are still sometimes facing failures (see example in [3]): it sounds like HDFS returns isDeleted=true, but asking directly after if the folder exists, we receive exists=true, which should not happen. Logs also shows randomly isDeleted true or false when the flow succeeds, which sounds like the returned value is irrelevant or not to be trusted. Can anybody bring his own S3 experience with such a behavior: "folder should be deleted, but it is not"? We suspect a S3 issue, but could it also be in Cascading or HDFS? We run on Hadoop Cloudera-cdh3u5 and Cascading 2.0.1-wip-dev. [1] org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3n://... already exists at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132) at com.twitter.elephantbird.mapred.output.DeprecatedOutputFormatWrapper.checkOutputSpecs(DeprecatedOutputFormatWrapper.java:75) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:923) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:882) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:882) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:856) at cascading.flow.hadoop.planner.HadoopFlowStepJob.internalNonBlockingStart(HadoopFlowStepJob.java:104) at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:174) at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:137) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:122) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:42) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.j [2] @Override public boolean deleteResource(JobConf conf) throws IOException { LOGGER.info("Deleting resource {}", getIdentifier()); boolean isDeleted = super.deleteResource(conf); LOGGER.info("Hfs Sink Tap isDeleted is {} for {}", isDeleted, getIdentifier()); Path path = new Path(getIdentifier()); int retryCount = 0; int cumulativeSleepTime = 0; int sleepTime = 1000; while (getFileSystem(conf).exists(path)) { LOGGER .info( "Resource {} still exists, it should not... - I will continue to wait patiently...", getIdentifier()); try { LOGGER.info("Now I will sleep " + sleepTime / 1000 + " seconds while trying to delete {} - attempt: {}", getIdentifier(), retryCount + 1); Thread.sleep(sleepTime); cumulativeSleepTime += sleepTime; sleepTime *= 2; } catch (InterruptedException e) { e.printStackTrace(); LOGGER .error( "Interrupted while sleeping trying to delete {} with message {}...", getIdentifier(), e.getMessage()); throw new RuntimeException(e); } if (retryCount == 0) { getFileSystem(conf).delete(getPath(), true); } retryCount++; if (cumulativeSleepTime > MAXIMUM_TIME_TO_WAIT_TO_DELETE_MS) { break; } } if (getFileSystem(conf).exists(path)) { LOGGER .error( "We didn't succeed to delete the resource {}. Throwing now a runtime exception.", getIdentifier()); throw new RuntimeException( "Although we waited to delete the resource for " + getIdentifier() + ' ' + retryCount + " iterations, it still exists - This must be an issue in the underlying storage system."); } return isDeleted; } [3] INFO [pool-2-thread-15] (BaseFlow.java:1287) - [...] at least one sink is marked for delete INFO [pool-2-thread-15] (BaseFlow.java:1287) - [...] sink oldest modified date: Wed Dec 31 23:59:59 UTC 1969 INFO [pool-2-thread-15] (HiveSinkTap.java:148) - Now I will sleep 1 seconds while trying to delete s3n://... - attempt: 1 INFO [pool-2-thread-15] (HiveSinkTap.java:130) - Deleting resource s3n://... INFO [pool-2-thread-15] (HiveSinkTap.java:133) - Hfs Sink Tap isDeleted is true for s3n://... ERROR [pool-2-thread-15] (HiveSinkTap.java:175) - We didn't succeed to delete the resource s3n://... Throwing now a runtime exception. WARN [pool-2-thread-15] (Cascade.java:706) - [...] flow failed: ... java.lang.RuntimeException: Although we waited to delete the resource for s3n://... 0 iterations, it still exists - This must be an issue in the underlying storage system. at com.qubit.hive.tap.HiveSinkTap.deleteResource(HiveSinkTap.java:179) at com.qubit.hive.tap.HiveSinkTap.deleteResource(HiveSinkTap.java:40) at cascading.flow.BaseFlow.deleteSinksIfNotUpdate(BaseFlow.java:971) at cascading.flow.BaseFlow.prepare(BaseFlow.java:733) at cascading.cascade.Cascade$CascadeJob.call(Cascade.java:761) at cascading.cascade.Cascade$CascadeJob.call(Cascade.java:710) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619)

Read the article
Oracle R Enterprise Tutorial Series on Oracle Learning Library

- by mhornick

Oracle Server Technologies Curriculum has just released the Oracle R Enterprise Tutorial Series, which is publicly available on Oracle Learning Library (OLL). This 8 part interactive lecture series with review sessions covers Oracle R Enterprise 1.1 and an introduction to Oracle R Connector for Hadoop 1.1: Introducing Oracle R Enterprise Getting Started with ORE R Language Basics Producing Graphs in R The ORE Transparency Layer ORE Embedded R Scripts: R Interface ORE Embedded R Scripts: SQL Interface Using the Oracle R Connector for Hadoop We encourage you to download Oracle software for evaluation from the Oracle Technology Network. See these links for R-related software: Oracle R Distribution, Oracle R Enterprise, ROracle, Oracle R Connector for Hadoop. As always, we welcome comments and questions on the Oracle R Forum.

Read the article
Big Data – Buzz Words: What is HDFS – Day 8 of 21

- by Pinal Dave

In yesterday’s blog post we learned what is MapReduce. In this article we will take a quick look at one of the four most important buzz words which goes around Big Data – HDFS. What is HDFS ? HDFS stands for Hadoop Distributed File System and it is a primary storage system used by Hadoop. It provides high performance access to data across Hadoop clusters. It is usually deployed on low-cost commodity hardware. In commodity hardware deployment server failures are very common. Due to the same reason HDFS is built to have high fault tolerance. The data transfer rate between compute nodes in HDFS is very high, which leads to reduced risk of failure. HDFS creates smaller pieces of the big data and distributes it on different nodes. It also copies each smaller piece to multiple times on different nodes. Hence when any node with the data crashes the system is automatically able to use the data from a different node and continue the process. This is the key feature of the HDFS system. Architecture of HDFS The architecture of the HDFS is master/slave architecture. An HDFS cluster always consists of single NameNode. This single NameNode is a master server and it manages the file system as well regulates access to various files. In additional to NameNode there are multiple DataNodes. There is always one DataNode for each data server. In HDFS a big file is split into one or more blocks and those blocks are stored in a set of DataNodes. The primary task of the NameNode is to open, close or rename files and directory and regulate access to the file system, whereas the primary task of the DataNode is read and write to the file systems. DataNode is also responsible for the creation, deletion or replication of the data based on the instruction from NameNode. In reality, NameNode and DataNode are software designed to run on commodity machine build in Java language. Visual Representation of HDFS Architecture Let us understand how HDFS works with the help of the diagram. Client APP or HDFS Client connects to NameSpace as well as DataNode. Client App access to the DataNode is regulated by NameSpace Node. NameSpace Node allows Client App to connect to the DataNode based by allowing the connection to the DataNode directly. A big data file is divided into multiple data blocks (let us assume that those data chunks are A,B,C and D. Client App will later on write data blocks directly to the DataNode. Client App does not have to directly write to all the node. It just has to write to any one of the node and NameNode will decide on which other DataNode it will have to replicate the data. In our example Client App directly writes to DataNode 1 and detained 3. However, data chunks are automatically replicated to other nodes. All the information like in which DataNode which data block is placed is written back to NameNode. High Availability During Disaster Now as multiple DataNode have same data blocks in the case of any DataNode which faces the disaster, the entire process will continue as other DataNode will assume the role to serve the specific data block which was on the failed node. This system provides very high tolerance to disaster and provides high availability. If you notice there is only single NameNode in our architecture. If that node fails our entire Hadoop Application will stop performing as it is a single node where we store all the metadata. As this node is very critical, it is usually replicated on another clustered as well as on another data rack. Though, that replicated node is not operational in architecture, it has all the necessary data to perform the task of the NameNode in the case of the NameNode fails. The entire Hadoop architecture is built to function smoothly even there are node failures or hardware malfunction. It is built on the simple concept that data is so big it is impossible to have come up with a single piece of the hardware which can manage it properly. We need lots of commodity (cheap) hardware to manage our big data and hardware failure is part of the commodity servers. To reduce the impact of hardware failure Hadoop architecture is built to overcome the limitation of the non-functioning hardware. Tomorrow In tomorrow’s blog post we will discuss the importance of the relational database in Big Data. Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

Read the article
Big Data – Data Mining with Hive – What is Hive? – What is HiveQL (HQL)? – Day 15 of 21

- by Pinal Dave

In yesterday’s blog post we learned the importance of the operational database in Big Data Story. In this article we will understand what is Hive and HQL in Big Data Story. Yahoo started working on PIG (we will understand that in the next blog post) for their application deployment on Hadoop. The goal of Yahoo to manage their unstructured data. Similarly Facebook started deploying their warehouse solutions on Hadoop which has resulted in HIVE. The reason for going with HIVE is because the traditional warehousing solutions are getting very expensive. What is HIVE? Hive is a datawarehouseing infrastructure for Hadoop. The primary responsibility is to provide data summarization, query and analysis. It supports analysis of large datasets stored in Hadoop’s HDFS as well as on the Amazon S3 filesystem. The best part of HIVE is that it supports SQL-Like access to structured data which is known as HiveQL (or HQL) as well as big data analysis with the help of MapReduce. Hive is not built to get a quick response to queries but it it is built for data mining applications. Data mining applications can take from several minutes to several hours to analysis the data and HIVE is primarily used there. HIVE Organization The data are organized in three different formats in HIVE. Tables: They are very similar to RDBMS tables and contains rows and tables. Hive is just layered over the Hadoop File System (HDFS), hence tables are directly mapped to directories of the filesystems. It also supports tables stored in other native file systems. Partitions: Hive tables can have more than one partition. They are mapped to subdirectories and file systems as well. Buckets: In Hive data may be divided into buckets. Buckets are stored as files in partition in the underlying file system. Hive also has metastore which stores all the metadata. It is a relational database containing various information related to Hive Schema (column types, owners, key-value data, statistics etc.). We can use MySQL database over here. What is HiveSQL (HQL)? Hive query language provides the basic SQL like operations. Here are few of the tasks which HQL can do easily. Create and manage tables and partitions Support various Relational, Arithmetic and Logical Operators Evaluate functions Download the contents of a table to a local directory or result of queries to HDFS directory Here is the example of the HQL Query: SELECT upper(name), salesprice FROM sales; SELECT category, count(1) FROM products GROUP BY category; When you look at the above query, you can see they are very similar to SQL like queries. Tomorrow In tomorrow’s blog post we will discuss about very important components of the Big Data Ecosystem – Pig. Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

Read the article
Big Data Matters with ODI12c

- by Madhu Nair

contributed by Mike Eisterer On October 17th, 2013, Oracle announced the release of Oracle Data Integrator 12c (ODI12c). This release signifies improvements to Oracle’s Data Integration portfolio of solutions, particularly Big Data integration. Why Big Data = Big Business Organizations are gaining greater insights and actionability through increased storage, processing and analytical benefits offered by Big Data solutions. New technologies and frameworks like HDFS, NoSQL, Hive and MapReduce support these benefits now. As further data is collected, analytical requirements increase and the complexity of managing transformations and aggregations of data compounds and organizations are in need for scalable Data Integration solutions. ODI12c provides enterprise solutions for the movement, translation and transformation of information and data heterogeneously and in Big Data Environments through: The ability for existing ODI and SQL developers to leverage new Big Data technologies. A metadata focused approach for cataloging, defining and reusing Big Data technologies, mappings and process executions. Integration between many heterogeneous environments and technologies such as HDFS and Hive. Generation of Hive Query Language. Working with Big Data using Knowledge Modules ODI12c provides developers with the ability to define sources and targets and visually develop mappings to effect the movement and transformation of data. As the mappings are created, ODI12c leverages a rich library of prebuilt integrations, known as Knowledge Modules (KMs). These KMs are contextual to the technologies and platforms to be integrated. Steps and actions needed to manage the data integration are pre-built and configured within the KMs. The Oracle Data Integrator Application Adapter for Hadoop provides a series of KMs, specifically designed to integrate with Big Data Technologies. The Big Data KMs include: Check Knowledge Module Reverse Engineer Knowledge Module Hive Transform Knowledge Module Hive Control Append Knowledge Module File to Hive (LOAD DATA) Knowledge Module File-Hive to Oracle (OLH-OSCH) Knowledge Module Nothing to beat an Example: To demonstrate the use of the KMs which are part of the ODI Application Adapter for Hadoop, a mapping may be defined to move data between files and Hive targets. The mapping is defined by dragging the source and target into the mapping, performing the attribute (column) mapping (see Figure 1) and then selecting the KM which will govern the process. In this mapping example, movie data is being moved from an HDFS source into a Hive table. Some of the attributes, such as “CUSTID to custid”, have been mapped over. Figure 1 Defining the Mapping Before the proper KM can be assigned to define the technology for the mapping, it needs to be added to the ODI project. The Big Data KMs have been made available to the project through the KM import process. Generally, this is done prior to defining the mapping. Figure 2 Importing the Big Data Knowledge Modules Following the import, the KMs are available in the Designer Navigator. v\:* {behavior:url(#default#VML);} o\:* {behavior:url(#default#VML);} w\:* {behavior:url(#default#VML);} .shape {behavior:url(#default#VML);} Normal 0 false false false EN-US ZH-TW X-NONE MicrosoftInternetExplorer4 /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin:0in; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.0pt; font-family:"Calibri","sans-serif"; mso-bidi-font-family:"Times New Roman";} Figure 3 The Project View in Designer, Showing Installed IKMs Once the KM is imported, it may be assigned to the mapping target. This is done by selecting the Physical View of the mapping and examining the Properties of the Target. In this case MOVIAPP_LOG_STAGE is the target of our mapping. Figure 4 Physical View of the Mapping and Assigning the Big Data Knowledge Module to the Target Alternative KMs may have been selected as well, providing flexibility and abstracting the logical mapping from the physical implementation. Our mapping may be applied to other technologies as well. The mapping is now complete and is ready to run. We will see more in a future blog about running a mapping to load Hive. To complete the quick ODI for Big Data Overview, let us take a closer look at what the IKM File to Hive is doing for us. ODI provides differentiated capabilities by defining the process and steps which normally would have to be manually developed, tested and implemented into the KM. As shown in figure 5, the KM is preparing the Hive session, managing the Hive tables, performing the initial load from HDFS and then performing the insert into Hive. HDFS and Hive options are selected graphically, as shown in the properties in Figure 4. Figure 5 Process and Steps Managed by the KM What’s Next Big Data being the shape shifting business challenge it is is fast evolving into the deciding factor between market leaders and others. Now that an introduction to ODI and Big Data has been provided, look for additional blogs coming soon using the Knowledge Modules which make up the Oracle Data Integrator Application Adapter for Hadoop: Importing Big Data Metadata into ODI, Testing Data Stores and Loading Hive Targets Generating Transformations using Hive Query language Loading Oracle from Hadoop Sources For more information now, please visit the Oracle Data Integrator Application Adapter for Hadoop web site, http://www.oracle.com/us/products/middleware/data-integration/hadoop/overview/index.html Do not forget to tune in to the ODI12c Executive Launch webcast on the 12th to hear more about ODI12c and GG12c. Normal 0 false false false EN-US ZH-TW X-NONE MicrosoftInternetExplorer4 /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin:0in; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.0pt; font-family:"Calibri","sans-serif"; mso-bidi-font-family:"Times New Roman";}

Read the article
OTN Virtual Technology Summit - July 9 - Middleware Track

- by OTN ArchBeat

The Architecture of Analytics: Big Time Big Data and Business Intelligence This four-session track, part of the free OTN Virtual Technology Summit on July 9, will present a solution architect's perspective on how business intelligence products in Oracle's Fusion Middleware family and beyond fit into an effective big data architecture, offering insight and expertise from Oracle ACE Directors and product team experts specializing in business Intelligence to help you meet your big data business intelligence challenges. Register now! Sessions Oracle Big Data Appliance Case Study: Using Big Data to Analyze Cancer-Genome Relationships Tom Plunkett, Lead Author of the Oracle Big Data Handbook What does it take to build an award winning Big Data solution? This presentation takes a deep technical dive into the use of the Oracle Big Data Appliance in a project for the National Cancer Institute's Frederick National Laboratory for Cancer Research. The Frederick National Laboratory and the Oracle team won several awards for analyzing relationships between genomes and cancer subtypes with big data, including the 2012 Government Big Data Solutions Award, the 2013 Excellence.Gov Finalist for Innovation, and the 2013 ComputerWorld Honors Laureate for Innovation. [30 mins] Getting Value from Big Data Variety Richard Tomlinson, Director, Product Management, Oracle Big data variety implies big data complexity. Performing analytics on diverse data typically involves mashing up structured, semi-structured and unstructured content. So how can we do this effectively to get real value? How do we relate diverse content so we can start to analyze it? This session looks at how we approach this tricky problem using Endeca Information Discovery. [30 mins] How To Leverage Your Investment In Oracle Business Intelligence Enterprise Edition Within a Big Data Architecture Oracle ACE Director Kevin McGinley More and more organizations are realizing the value Big Data technologies contribute to the return on investment in Analytics. But as an increasing variety of data types reside in different data stores, organizations are finding that a unified Analytics layer can help bridge the divide in modern data architectures. This session will examine how you can enable Oracle Business Intelligence Enterprise Edition (OBIEE) to play a role in a unified Analytics layer and the benefits and use cases for doing so. [30 mins] Oracle Data Integrator 12c As Your Big Data Data Integration Hub Oracle ACE Director Mark Rittman Oracle Data Integrator 12c (ODI12c), as well as being able to integrate and transform data from application and database data sources, also has the ability to load, transform and orchestrate data loads to and from Big Data sources. In this session, we'll look at ODI12c's ability to load data from Hadoop, Hive, NoSQL and file sources, transform that data using Hive and MapReduce processing across the Hadoop cluster, and then bulk-load that data into an Oracle Data Warehouse using Oracle Big Data Connectors. We will also look at how ODI12c enables ETL-offloading to a Hadoop cluster, with some tips and techniques on real-time capture into a Hadoop data reservoir and techniques and limitations when performing ETL on big data sources. [90 mins] Register now!

Read the article
Big Data – Various Learning Resources – How to Start with Big Data? – Day 20 of 21

- by Pinal Dave

In yesterday’s blog post we learned how to become a Data Scientist for Big Data. In this article we will go over various learning resources related to Big Data. In this series we have covered many of the most essential details about Big Data. At the beginning of this series, I have encouraged readers to send me questions. One of the most popular questions is - “I want to learn more about Big Data. Where can I learn it?” This is indeed a great question as there are plenty of resources out to learn about Big Data and it is indeed difficult to select on one resource to learn Big Data. Hence I decided to write here a few of the very important resources which are related to Big Data. Learn from Pluralsight Pluralsight is a global leader in high-quality online training for hardcore developers. It has fantastic Big Data Courses and I started to learn about Big Data with the help of Pluralsight. Here are few of the courses which are directly related to Big Data. Big Data: The Big Picture Big Data Analytics with Tableau NoSQL: The Big Picture Understanding NoSQL Data Analysis Fundamentals with Tableau I encourage all of you start with this video course as they are fantastic fundamentals to learn Big Data. Learn from Apache Resources at Apache are single point the most authentic learning resources. If you want to learn fundamentals and go deep about every aspect of the Big Data, I believe you must understand various concepts in Apache’s library. I am pretty impressed with the documentation and I am personally referencing it every single day when I work with Big Data. I strongly encourage all of you to bookmark following all the links for authentic big data learning. Haddop - The Apache Hadoop® project develops open-source software for reliable, scalable, distributed computing. Ambari: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which include support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heat maps and ability to view MapReduce, Pig and Hive applications visually along with features to diagnose their performance characteristics in a user-friendly manner. Avro: A data serialization system. Cassandra: A scalable multi-master database with no single points of failure. Chukwa: A data collection system for managing large distributed systems. HBase: A scalable, distributed database that supports structured data storage for large tables. Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying. Mahout: A Scalable machine learning and data mining library. Pig: A high-level data-flow language and execution framework for parallel computation. ZooKeeper: A high-performance coordination service for distributed applications. Learn from Vendors One of the biggest issues with about learning Big Data is setting up the environment. Every Big Data vendor has different environment request and there are lots of things require to set up Big Data framework. Many of the users do not start with Big Data as they are afraid about the resources required to set up framework as well as a time commitment. Here Hortonworks have created fantastic learning environment. They have created Sandbox with everything one person needs to learn Big Data and also have provided excellent tutoring along with it. Sandbox comes with a dozen hands-on tutorial that will guide you through the basics of Hadoop as well it contains the Hortonworks Data Platform. I think Hortonworks did a fantastic job building this Sandbox and Tutorial. Though there are plenty of different Big Data Vendors I have decided to list only Hortonworks due to their unique setup. Please leave a comment if there are any other such platform to learn Big Data. I will include them over here as well. Learn from Books There are indeed few good books out there which one can refer to learn Big Data. Here are few good books which I have read. I will update the list as I will learn more. Ethics of Big Data Balancing Risk and Innovation Big Data for Dummies Head First Data Analysis: A Learner’s Guide to Big Numbers, Statistics, and Good Decisions If you search on Amazon there are millions of the books but I think above three books are a great set of books and it will give you great ideas about Big Data. Once you go through above books, you will have a clear idea about what is the next step you should follow in this series. You will be capable enough to make the right decision for yourself. Tomorrow In tomorrow’s blog post we will wrap up this series of Big Data. Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

Read the article
Windows Azure: Import/Export Hard Drives, VM ACLs, Web Sockets, Remote Debugging, Continuous Delivery, New Relic, Billing Alerts and More

- by ScottGu

Two weeks ago we released a giant set of improvements to Windows Azure, as well as a significant update of the Windows Azure SDK. This morning we released another massive set of enhancements to Windows Azure. Today’s new capabilities include: Storage: Import/Export Hard Disk Drives to your Storage Accounts HDInsight: General Availability of our Hadoop Service in the cloud Virtual Machines: New VM Gallery, ACL support for VIPs Web Sites: WebSocket and Remote Debugging Support Notification Hubs: Segmented customer push notification support with tag expressions TFS & GIT: Continuous Delivery Support for Web Sites + Cloud Services Developer Analytics: New Relic support for Web Sites + Mobile Services Service Bus: Support for partitioned queues and topics Billing: New Billing Alert Service that sends emails notifications when your bill hits a threshold you define All of these improvements are now available to use immediately (note that some features are still in preview). Below are more details about them. Storage: Import/Export Hard Disk Drives to Windows Azure I am excited to announce the preview of our new Windows Azure Import/Export Service! The Windows Azure Import/Export Service enables you to move large amounts of on-premises data into and out of your Windows Azure Storage accounts. It does this by enabling you to securely ship hard disk drives directly to our Windows Azure data centers. Once we receive the drives we’ll automatically transfer the data to or from your Windows Azure Storage account. This enables you to import or export massive amounts of data more quickly and cost effectively (and not be constrained by available network bandwidth). Encrypted Transport Our Import/Export service provides built-in support for BitLocker disk encryption – which enables you to securely encrypt data on the hard drives before you send it, and not have to worry about it being compromised even if the disk is lost/stolen in transit (since the content on the transported hard drives is completely encrypted and you are the only one who has the key to it). The drive preparation tool we are shipping today makes setting up bitlocker encryption on these hard drives easy. How to Import/Export your first Hard Drive of Data You can read our Getting Started Guide to learn more about how to begin using the import/export service. You can create import and export jobs via the Windows Azure Management Portal as well as programmatically using our Server Management APIs. It is really easy to create a new import or export job using the Windows Azure Management Portal. Simply navigate to a Windows Azure storage account, and then click the new Import/Export tab now available within it (note: if you don’t have this tab make sure to sign-up for the Import/Export preview): Then click the “Create Import Job” or “Create Export Job” commands at the bottom of it. This will launch a wizard that easily walks you through the steps required: For more comprehensive information about Import/Export, refer to Windows Azure Storage team blog. You can also send questions and comments to the [email protected] email address. We think you’ll find this new service makes it much easier to move data into and out of Windows Azure, and it will dramatically cut down the network bandwidth required when working on large data migration projects. We hope you like it. HDInsight: 100% Compatible Hadoop Service in the Cloud Last week we announced the general availability release of Windows Azure HDInsight. HDInsight is a 100% compatible Hadoop service that allows you to easily provision and manage Hadoop clusters for big data processing in Windows Azure. This release is now live in production, backed by an enterprise SLA, supported 24x7 by Microsoft Support, and is ready to use for production scenarios. HDInsight allows you to use Apache Hadoop tools, such as Pig and Hive, to process large amounts of data in Windows Azure Blob Storage. Because data is stored in Windows Azure Blob Storage, you can choose to dynamically create Hadoop clusters only when you need them, and then shut them down when they are no longer required (since you pay only for the time the Hadoop cluster instances are running this provides a super cost effective way to use them). You can create Hadoop clusters using either the Windows Azure Management Portal (see below) or using our PowerShell and Cross Platform Command line tools: The import/export hard drive support that came out today is a perfect companion service to use with HDInsight – the combination allows you to easily ingest, process and optionally export a limitless amount of data. We’ve also integrated HDInsight with our Business Intelligence tools, so users can leverage familiar tools like Excel in order to analyze the output of jobs. You can find out more about how to get started with HDInsight here. Virtual Machines: VM Gallery Enhancements Today’s update of Windows Azure brings with it a new Virtual Machine gallery that you can use to create new VMs in the cloud. You can launch the gallery by doing New->Compute->Virtual Machine->From Gallery within the Windows Azure Management Portal: The new Virtual Machine Gallery includes some nice enhancements that make it even easier to use: Search: You can now easily search and filter images using the search box in the top-right of the dialog. For example, simply type “SQL” and we’ll filter to show those images in the gallery that contain that substring. Category Tree-view: Each month we add more built-in VM images to the gallery. You can continue to browse these using the “All” view within the VM Gallery – or now quickly filter them using the category tree-view on the left-hand side of the dialog. For example, by selecting “Oracle” in the tree-view you can now quickly filter to see the official Oracle supplied images. MSDN and Supported checkboxes: With today’s update we are also introducing filters that makes it easy to filter out types of images that you may not be interested in. The first checkbox is MSDN: using this filter you can exclude any image that is not part of the Windows Azure benefits for MSDN subscribers (which have highly discounted pricing - you can learn more about the MSDN pricing here). The second checkbox is Supported: this filter will exclude any image that contains prerelease software, so you can feel confident that the software you choose to deploy is fully supported by Windows Azure and our partners. Sort options: We sort gallery images by what we think customers are most interested in, but sometimes you might want to sort using different views. So we’re providing some additional sort options, like “Newest,” to customize the image list for what suits you best. Pricing information: We now provide additional pricing information about images and options on how to cost effectively run them directly within the VM Gallery. The above improvements make it even easier to use the VM Gallery and quickly create launch and run Virtual Machines in the cloud. Virtual Machines: ACL Support for VIPs A few months ago we exposed the ability to configure Access Control Lists (ACLs) for Virtual Machines using Windows PowerShell cmdlets and our Service Management API. With today’s release, you can now configure VM ACLs using the Windows Azure Management Portal as well. You can now do this by clicking the new Manage ACL command in the Endpoints tab of a virtual machine instance: This will enable you to configure an ordered list of permit and deny rules to scope the traffic that can access your VM’s network endpoints. For example, if you were on a virtual network, you could limit RDP access to a Windows Azure virtual machine to only a few computers attached to your enterprise. Or if you weren’t on a virtual network you could alternatively limit traffic from public IPs that can access your workloads: Here is the default behaviors for ACLs in Windows Azure: By default (i.e. no rules specified), all traffic is permitted. When using only Permit rules, all other traffic is denied. When using only Deny rules, all other traffic is permitted. When there is a combination of Permit and Deny rules, all other traffic is denied. Lastly, remember that configuring endpoints does not automatically configure them within the VM if it also has firewall rules enabled at the OS level. So if you create an endpoint using the Windows Azure Management Portal, Windows PowerShell, or REST API, be sure to also configure your guest VM firewall appropriately as well. Web Sites: Web Sockets Support With today’s release you can now use Web Sockets with Windows Azure Web Sites. This feature enables you to easily integrate real-time communication scenarios within your web based applications, and is available at no extra charge (it even works with the free tier). Higher level programming libraries like SignalR and socket.io are also now supported with it. You can enable Web Sockets support on a web site by navigating to the Configure tab of a Web Site, and by toggling Web Sockets support to “on”: Once Web Sockets is enabled you can start to integrate some really cool scenarios into your web applications. Check out the new SignalR documentation hub on www.asp.net to learn more about some of the awesome scenarios you can do with it. Web Sites: Remote Debugging Support The Windows Azure SDK 2.2 we released two weeks ago introduced remote debugging support for Windows Azure Cloud Services. With today’s Windows Azure release we are extending this remote debugging support to also work with Windows Azure Web Sites. With live, remote debugging support inside of Visual Studio, you are able to have more visibility than ever before into how your code is operating live in Windows Azure. It is now super easy to attach the debugger and quickly see what is going on with your application in the cloud. Remote Debugging of a Windows Azure Web Site using VS 2013 Enabling the remote debugging of a Windows Azure Web Site using VS 2013 is really easy. Start by opening up your web application’s project within Visual Studio. Then navigate to the “Server Explorer” tab within Visual Studio, and click on the deployed web-site you want to debug that is running within Windows Azure using the Windows Azure->Web Sites node in the Server Explorer. Then right-click and choose the “Attach Debugger” option on it: When you do this Visual Studio will remotely attach the debugger to the Web Site running within Windows Azure. The debugger will then stop the web site’s execution when it hits any break points that you have set within your web application’s project inside Visual Studio. For example, below I set a breakpoint on the “ViewBag.Message” assignment statement within the HomeController of the standard ASP.NET MVC project template. When I hit refresh on the “About” page of the web site within the browser, the breakpoint was triggered and I am now able to debug the app remotely using Visual Studio: Note above how we can debug variables (including autos/watchlist/etc), as well as use the Immediate and Command Windows. In the debug session above I used the Immediate Window to explore some of the request object state, as well as to dynamically change the ViewBag.Message property. When we click the the “Continue” button (or press F5) the app will continue execution and the Web Site will render the content back to the browser. This makes it super easy to debug web apps remotely. Tips for Better Debugging To get the best experience while debugging, we recommend publishing your site using the Debug configuration within Visual Studio’s Web Publish dialog. This will ensure that debug symbol information is uploaded to the Web Site which will enable a richer debug experience within Visual Studio. You can find this option on the Web Publish dialog on the Settings tab: When you ultimately deploy/run the application in production we recommend using the “Release” configuration setting – the release configuration is memory optimized and will provide the best production performance. To learn more about diagnosing and debugging Windows Azure Web Sites read our new Troubleshooting Windows Azure Web Sites in Visual Studio guide. Notification Hubs: Segmented Push Notification support with tag expressions In August we announced the General Availability of Windows Azure Notification Hubs - a powerful Mobile Push Notifications service that makes it easy to send high volume push notifications with low latency from any mobile app back-end. Notification hubs can be used with any mobile app back-end (including ones built using our Mobile Services capability) and can also be used with back-ends that run in the cloud as well as on-premises. Beginning with the initial release, Notification Hubs allowed developers to send personalized push notifications to both individual users as well as groups of users by interest, by associating their devices with tags representing the logical target of the notification. For example, by registering all devices of customers interested in a favorite MLB team with a corresponding tag, it is possible to broadcast one message to millions of Boston Red Sox fans and another message to millions of St. Louis Cardinals fans with a single API call respectively. New support for using tag expressions to enable advanced customer segmentation With today’s release we are adding support for even more advanced customer targeting. You can now identify customers that you want to send push notifications to by defining rich tag expressions. With tag expressions, you can now not only broadcast notifications to Boston Red Sox fans, but take that segmenting a step farther and reach more granular segments. This opens up a variety of scenarios, for example: Offers based on multiple preferences—e.g. send a game day vegetarian special to users tagged as both a Boston Red Sox fan AND a vegetarian Push content to multiple segments in a single message—e.g. rain delay information only to users who are tagged as either a Boston Red Sox fan OR a St. Louis Cardinal fan Avoid presenting subsets of a segment with irrelevant content—e.g. season ticket availability reminder to users who are tagged as a Boston Red Sox fan but NOT also a season ticket holder To illustrate with code, consider a restaurant chain app that sends an offer related to a Red Sox vs Cardinals game for users in Boston. Devices can be tagged by your app with location tags (e.g. “Loc:Boston”) and interest tags (e.g. “Follows:RedSox”, “Follows:Cardinals”), and then a notification can be sent by your back-end to “(Follows:RedSox || Follows:Cardinals) && Loc:Boston” in order to deliver an offer to all devices in Boston that follow either the RedSox or the Cardinals. This can be done directly in your server backend send logic using the code below: var notification = new WindowsNotification(messagePayload); hub.SendNotificationAsync(notification, "(Follows:RedSox || Follows:Cardinals) && Loc:Boston"); In your expressions you can use all Boolean operators: AND (&&), OR (||), and NOT (!). Some other cool use cases for tag expressions that are now supported include: Social: To “all my group except me” - group:id && !user:id Events: Touchdown event is sent to everybody following either team or any of the players involved in the action: Followteam:A || Followteam:B || followplayer:1 || followplayer:2 … Hours: Send notifications at specific times. E.g. Tag devices with time zone and when it is 12pm in Seattle send to: GMT8 && follows:thaifood Versions and platforms: Send a reminder to people still using your first version for Android - version:1.0 && platform:Android For help on getting started with Notification Hubs, visit the Notification Hub documentation center. Then download the latest NuGet package (or use the Notification Hubs REST APIs directly) to start sending push notifications using tag expressions. They are really powerful and enable a bunch of great new scenarios. TFS & GIT: Continuous Delivery Support for Web Sites + Cloud Services With today’s Windows Azure release we are making it really easy to enable continuous delivery support with Windows Azure and Team Foundation Services. Team Foundation Services is a cloud based offering from Microsoft that provides integrated source control (with both TFS and Git support), build server, test execution, collaboration tools, and agile planning support. It makes it really easy to setup a team project (complete with automated builds and test runners) in the cloud, and it has really rich integration with Visual Studio. With today’s Windows Azure release it is now really easy to enable continuous delivery support with both TFS and Git based repositories hosted using Team Foundation Services. This enables a workflow where when code is checked in, built successfully on an automated build server, and all tests pass on it – I can automatically have the app deployed on Windows Azure with zero manual intervention or work required. The below screen-shots demonstrate how to quickly setup a continuous delivery workflow to Windows Azure with a Git-based ASP.NET MVC project hosted using Team Foundation Services. Enabling Continuous Delivery to Windows Azure with Team Foundation Services The project I’m going to enable continuous delivery with is a simple ASP.NET MVC project whose source code I’m hosting using Team Foundation Services. I did this by creating a “SimpleContinuousDeploymentTest” repository there using Git – and then used the new built-in Git tooling support within Visual Studio 2013 to push the source code to it. Below is a screen-shot of the Git repository hosted within Team Foundation Services: I can access the repository within Visual Studio 2013 and easily make commits with it (as well as branch, merge and do other tasks). Using VS 2013 I can also setup automated builds to take place in the cloud using Team Foundation Services every time someone checks in code to the repository: The cool thing about this is that I don’t have to buy or rent my own build server – Team Foundation Services automatically maintains its own build server farm and can automatically queue up a build for me (for free) every time someone checks in code using the above settings. This build server (and automated testing) support now works with both TFS and Git based source control repositories. Connecting a Team Foundation Services project to Windows Azure Once I have a source repository hosted in Team Foundation Services with Automated Builds and Testing set up, I can then go even further and set it up so that it will be automatically deployed to Windows Azure when a source code commit is made to the repository (assuming the Build + Tests pass). Enabling this is now really easy. To set this up with a Windows Azure Web Site simply use the New->Compute->Web Site->Custom Create command inside the Windows Azure Management Portal. This will create a dialog like below. I gave the web site a name and then made sure the “Publish from source control” checkbox was selected: When we click next we’ll be prompted for the location of the source repository. We’ll select “Team Foundation Services”: Once we do this we’ll be prompted for our Team Foundation Services account that our source repository is hosted under (in this case my TFS account is “scottguthrie”): When we click the “Authorize Now” button we’ll be prompted to give Windows Azure permissions to connect to the Team Foundation Services account. Once we do this we’ll be prompted to pick the source repository we want to connect to. Starting with today’s Windows Azure release you can now connect to both TFS and Git based source repositories. This new support allows me to connect to the “SimpleContinuousDeploymentTest” respository we created earlier: Clicking the finish button will then create the Web Site with the continuous delivery hooks setup with Team Foundation Services. Now every time someone pushes source control to the repository in Team Foundation Services, it will kick off an automated build, run all of the unit tests in the solution , and if they pass the app will be automatically deployed to our Web Site in Windows Azure. You can monitor the history and status of these automated deployments using the Deployments tab within the Web Site: This enables a really slick continuous delivery workflow, and enables you to build and deploy apps in a really nice way. Developer Analytics: New Relic support for Web Sites + Mobile Services With today’s Windows Azure release we are making it really easy to enable Developer Analytics and Monitoring support with both Windows Azure Web Site and Windows Azure Mobile Services. We are partnering with New Relic, who provide a great dev analytics and app performance monitoring offering, to enable this - and we have updated the Windows Azure Management Portal to make it really easy to configure. Enabling New Relic with a Windows Azure Web Site Enabling New Relic support with a Windows Azure Web Site is now really easy. Simply navigate to the Configure tab of a Web Site and scroll down to the “developer analytics” section that is now within it: Clicking the “add-on” button will display some additional UI. If you don’t already have a New Relic subscription, you can click the “view windows azure store” button to obtain a subscription (note: New Relic has a perpetually free tier so you can enable it even without paying anything): Clicking the “view windows azure store” button will launch the integrated Windows Azure Store experience we have within the Windows Azure Management Portal. You can use this to browse from a variety of great add-on services – including New Relic: Select “New Relic” within the dialog above, then click the next button, and you’ll be able to choose which type of New Relic subscription you wish to purchase. For this demo we’ll simply select the “Free Standard Version” – which does not cost anything and can be used forever: Once we’ve signed-up for our New Relic subscription and added it to our Windows Azure account, we can go back to the Web Site’s configuration tab and choose to use the New Relic add-on with our Windows Azure Web Site. We can do this by simply selecting it from the “add-on” dropdown (it is automatically populated within it once we have a New Relic subscription in our account): Clicking the “Save” button will then cause the Windows Azure Management Portal to automatically populate all of the needed New Relic configuration settings to our Web Site: Deploying the New Relic Agent as part of a Web Site The final step to enable developer analytics using New Relic is to add the New Relic runtime agent to our web app. We can do this within Visual Studio by right-clicking on our web project and selecting the “Manage NuGet Packages” context menu: This will bring up the NuGet package manager. You can search for “New Relic” within it to find the New Relic agent. Note that there is both a 32-bit and 64-bit edition of it – make sure to install the version that matches how your Web Site is running within Windows Azure (note: you can configure your Web Site to run in either 32-bit or 64-bit mode using the Web Site’s “Configuration” tab within the Windows Azure Management Portal): Once we install the NuGet package we are all set to go. We’ll simply re-publish the web site again to Windows Azure and New Relic will now automatically start monitoring the application Monitoring a Web Site using New Relic Now that the application has developer analytics support with New Relic enabled, we can launch the New Relic monitoring portal to start monitoring the health of it. We can do this by clicking on the “Add Ons” tab in the left-hand side of the Windows Azure Management Portal. Then select the New Relic add-on we signed-up for within it. The Windows Azure Management Portal will provide some default information about the add-on when we do this. Clicking the “Manage” button in the tray at the bottom will launch a new browser tab and single-sign us into the New Relic monitoring portal associated with our account: When we do this a new browser tab will launch with the New Relic admin tool loaded within it: We can now see insights into how our app is performing – without having to have written a single line of monitoring code. The New Relic service provides a ton of great built-in monitoring features allowing us to quickly see: Performance times (including browser rendering speed) for the overall site and individual pages. You can optionally set alert thresholds to trigger if the speed does not meet a threshold you specify. Information about where in the world your customers are hitting the site from (and how performance varies by region) Details on the latency performance of external services your web apps are using (for example: SQL, Storage, Twitter, etc) Error information including call stack details for exceptions that have occurred at runtime SQL Server profiling information – including which queries executed against your database and what their performance was And a whole bunch more… The cool thing about New Relic is that you don’t need to write monitoring code within your application to get all of the above reports (plus a lot more). The New Relic agent automatically enables the CLR profiler within applications and automatically captures the information necessary to identify these. This makes it super easy to get started and immediately have a rich developer analytics view for your solutions with very little effort. If you haven’t tried New Relic out yet with Windows Azure I recommend you do so – I think you’ll find it helps you build even better cloud applications. Following the above steps will help you get started and deliver you a really good application monitoring solution in only minutes. Service Bus: Support for partitioned queues and topics With today’s release, we are enabling support within Service Bus for partitioned queues and topics. Enabling partitioning enables you to achieve a higher message throughput and better availability from your queues and topics. Higher message throughput is achieved by implementing multiple message brokers for each partitioned queue and topic. The multiple messaging stores will also provide higher availability. You can create a partitioned queue or topic by simply checking the Enable Partitioning option in the custom create wizard for a Queue or Topic: Read this article to learn more about partitioned queues and topics and how to take advantage of them today. Billing: New Billing Alert Service Today’s Windows Azure update enables a new Billing Alert Service Preview that enables you to get proactive email notifications when your Windows Azure bill goes above a certain monetary threshold that you configure. This makes it easier to manage your bill and avoid potential surprises at the end of the month. With the Billing Alert Service Preview, you can now create email alerts to monitor and manage your monetary credits or your current bill total. To set up an alert first sign-up for the free Billing Alert Service Preview. Then visit the account management page, click on a subscription you have setup, and then navigate to the new Alerts tab that is available: The alerts tab allows you to setup email alerts that will be sent automatically once a certain threshold is hit. For example, by clicking the “add alert” button above I can setup a rule to send myself email anytime my Windows Azure bill goes above $100 for the month: The Billing Alert Service will evolve to support additional aspects of your bill as well as support multiple forms of alerts such as SMS. Try out the new Billing Alert Service Preview today and give us feedback. Summary Today’s Windows Azure release enables a ton of great new scenarios, and makes building applications hosted in the cloud even easier. If you don’t already have a Windows Azure account, you can sign-up for a free trial and start using all of the above features today. Then visit the Windows Azure Developer Center to learn more about how to build apps with it. Hope this helps, Scott P.S. In addition to blogging, I am also now using Twitter for quick updates and to share links. Follow me at: twitter.com/scottgu

Read the article
Big Data – Learning Basics of Big Data in 21 Days – Bookmark

- by Pinal Dave

Earlier this month I had a great time to write Bascis of Big Data series. This series received great response and lots of good comments I have received, I am going to follow up this basics series with further in-depth series in near future. Here is the consolidated blog post where you can find all the 21 days blog posts together. Bookmark this page for future reference. Big Data – Beginning Big Data – Day 1 of 21 Big Data – What is Big Data – 3 Vs of Big Data – Volume, Velocity and Variety – Day 2 of 21 Big Data – Evolution of Big Data – Day 3 of 21 Big Data – Basics of Big Data Architecture – Day 4 of 21 Big Data – Buzz Words: What is NoSQL – Day 5 of 21 Big Data – Buzz Words: What is Hadoop – Day 6 of 21 Big Data – Buzz Words: What is MapReduce – Day 7 of 21 Big Data – Buzz Words: What is HDFS – Day 8 of 21 Big Data – Buzz Words: Importance of Relational Database in Big Data World – Day 9 of 21 Big Data – Buzz Words: What is NewSQL – Day 10 of 21 Big Data – Role of Cloud Computing in Big Data – Day 11 of 21 Big Data – Operational Databases Supporting Big Data – RDBMS and NoSQL – Day 12 of 21 Big Data – Operational Databases Supporting Big Data – Key-Value Pair Databases and Document Databases – Day 13 of 21 Big Data – Operational Databases Supporting Big Data – Columnar, Graph and Spatial Database – Day 14 of 21 Big Data – Data Mining with Hive – What is Hive? – What is HiveQL (HQL)? – Day 15 of 21 Big Data – Interacting with Hadoop – What is PIG? – What is PIG Latin? – Day 16 of 21 Big Data – Interacting with Hadoop – What is Sqoop? – What is Zookeeper? – Day 17 of 21 Big Data – Basics of Big Data Analytics – Day 18 of 21 Big Data – How to become a Data Scientist and Learn Data Science? – Day 19 of 21 Big Data – Various Learning Resources – How to Start with Big Data? – Day 20 of 21 Big Data – Final Wrap and What Next – Day 21 of 21 Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

Read the article
ssh permission denied

- by Gitmo

I am trying to ssh into a remote machine and I get the following debug messages: debug1: Reading configuration data /etc/ssh/ssh_config debug1: Applying options for * debug2: ssh_connect: needpriv 0 debug1: Connecting to xxx.xxx.x.xx [xxx.xxx.xx.x] port 22. debug1: Connection established. debug3: Not a RSA1 key file /home/hadoop/.ssh/id_rsa. debug2: key_type_from_name: unknown key type '-----BEGIN' debug3: key_read: missing keytype debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug2: key_type_from_name: unknown key type '-----END' debug3: key_read: missing keytype debug1: identity file /home/hadoop/.ssh/id_rsa type 1 debug1: Checking blacklist file /usr/share/ssh/blacklist.RSA-2048 debug1: Checking blacklist file /etc/ssh/blacklist.RSA-2048 debug1: Remote protocol version 2.0, remote software version OpenSSH_5.1p1 Debian-6ubuntu2 debug1: match: OpenSSH_5.1p1 Debian-6ubuntu2 pat OpenSSH* debug1: Enabling compatibility mode for protocol 2.0 debug1: Local version string SSH-2.0-OpenSSH_5.1p1 Debian-6ubuntu2 debug2: fd 3 setting O_NONBLOCK debug1: SSH2_MSG_KEXINIT sent debug1: SSH2_MSG_KEXINIT received debug2: kex_parse_kexinit: diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha1,diffie-hellman-group1-sha1 debug2: kex_parse_kexinit: ssh-rsa,ssh-dss debug2: kex_parse_kexinit: aes128-cbc,3des-cbc,blowfish-cbc,cast128-cbc,arcfour128,arcfour256,arcfour,aes192-cbc,aes256-cbc,[email protected],aes128-ctr,aes192-ctr,aes256-ctr debug2: kex_parse_kexinit: aes128-cbc,3des-cbc,blowfish-cbc,cast128-cbc,arcfour128,arcfour256,arcfour,aes192-cbc,aes256-cbc,[email protected],aes128-ctr,aes192-ctr,aes256-ctr debug2: kex_parse_kexinit: hmac-md5,hmac-sha1,[email protected],hmac-ripemd160,[email protected],hmac-sha1-96,hmac-md5-96 debug2: kex_parse_kexinit: hmac-md5,hmac-sha1,[email protected],hmac-ripemd160,[email protected],hmac-sha1-96,hmac-md5-96 debug2: kex_parse_kexinit: none,[email protected],zlib debug2: kex_parse_kexinit: none,[email protected],zlib debug2: kex_parse_kexinit: debug2: kex_parse_kexinit: debug2: kex_parse_kexinit: first_kex_follows 0 debug2: kex_parse_kexinit: reserved 0 debug2: kex_parse_kexinit: diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha1,diffie-hellman-group1-sha1 debug2: kex_parse_kexinit: ssh-rsa,ssh-dss debug2: kex_parse_kexinit: aes128-cbc,3des-cbc,blowfish-cbc,cast128-cbc,arcfour128,arcfour256,arcfour,aes192-cbc,aes256-cbc,[email protected],aes128-ctr,aes192-ctr,aes256-ctr debug2: kex_parse_kexinit: aes128-cbc,3des-cbc,blowfish-cbc,cast128-cbc,arcfour128,arcfour256,arcfour,aes192-cbc,aes256-cbc,[email protected],aes128-ctr,aes192-ctr,aes256-ctr debug2: kex_parse_kexinit: hmac-md5,hmac-sha1,[email protected],hmac-ripemd160,[email protected],hmac-sha1-96,hmac-md5-96 debug2: kex_parse_kexinit: hmac-md5,hmac-sha1,[email protected],hmac-ripemd160,[email protected],hmac-sha1-96,hmac-md5-96 debug2: kex_parse_kexinit: none,[email protected] debug2: kex_parse_kexinit: none,[email protected] debug2: kex_parse_kexinit: debug2: kex_parse_kexinit: debug2: kex_parse_kexinit: first_kex_follows 0 debug2: kex_parse_kexinit: reserved 0 debug2: mac_setup: found hmac-md5 debug1: kex: server->client aes128-cbc hmac-md5 none debug2: mac_setup: found hmac-md5 debug1: kex: client->server aes128-cbc hmac-md5 none debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<1024<8192) sent debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP debug2: dh_gen_key: priv key bits set: 128/256 debug2: bits set: 511/1024 debug1: SSH2_MSG_KEX_DH_GEX_INIT sent debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY debug3: check_host_in_hostfile: filename /home/hadoop/.ssh/known_hosts debug3: check_host_in_hostfile: match line 20 debug1: Host '192.168.1.63' is known and matches the RSA host key. debug1: Found key in /home/hadoop/.ssh/known_hosts:20 debug2: bits set: 511/1024 debug1: ssh_rsa_verify: signature correct debug2: kex_derive_keys debug2: set_newkeys: mode 1 debug1: SSH2_MSG_NEWKEYS sent debug1: expecting SSH2_MSG_NEWKEYS debug2: set_newkeys: mode 0 debug1: SSH2_MSG_NEWKEYS received debug1: SSH2_MSG_SERVICE_REQUEST sent debug2: service_accept: ssh-userauth debug1: SSH2_MSG_SERVICE_ACCEPT received debug2: key: /home/hadoop/.ssh/id_rsa (0x241c110) debug1: Authentications that can continue: publickey,password debug3: start over, passed a different list publickey,password debug3: preferred gssapi-keyex,gssapi-with-mic,gssapi,publickey,keyboard-interactive debug3: authmethod_lookup publickey debug3: remaining preferred: keyboard-interactive debug3: authmethod_is_enabled publickey debug1: Next authentication method: publickey debug1: Offering public key: /home/hadoop/.ssh/id_rsa debug3: send_pubkey_test debug2: we sent a publickey packet, wait for reply debug1: Authentications that can continue: publickey,password debug2: we did not send a packet, disable method debug1: No more authentication methods to try. Permission denied (publickey,password). What seems to be the problem?? I have tried everything, this is driving me nuts.

Read the article
It's Not TV- It's OTN: Top 10 Videos on the OTN YouTube Channel

- by Bob Rhubart

It's been a while since we checked in on what people are watching on the Oracle Technology Network YouTube Channel. Here are the Top 10 video for the last 30 days. Tom Kyte: Keeping Up with the Latest in Database Technology Tom Kyte expands on his keynote presentation at the Great Lakes Oracle Conference with tips for developers, DBAs and others who want to make sure they are prepared to work with the latest database technologies. That Jeff Smith: Oracle SQL Developer Oracle SQL Developer product manager Jeff Smith (yeah, that Jeff Smith) talks about his presentations at the Great Lakes Oracle Conference and shares his reaction to keynote speaker C.J. Date's claim that "SQL dropped the ball." Gwen Shapira: Hadoop and Oracle Database Oracle ACE Director Gwen Shapira @gwenshap talks about the fit between Hadoop and Oracle Database and dives into the details of why Oracle Loader for Hadoop is 5x faster. Kai Yu: Virtualization and Cloud Oracle ACE Director Kai Yu talks about the questions he is most frequently asked when he does presentations on cloud computing and virtualization. Mark Sewtz: APEX 4.2 Mobile App Development Application Express developer Marc Sewtz demos the new features he built into APEX4.2 to support Mobile App Development. Jeremy Schneider: RAC Attack Oracle ACE Jeremy Schneider @jer_s describes what you can expect when you come to a RAC (Real Application Cluster) Attack. Frits Hoogland: Exadata Under the Hood Oracle ACE Director Frits Hoogland (@fritshoogland) talks about the secret sauce under Exadata's hood. David Peake: APEX 4.2 New Features David Peake, PM for Oracle Application Express, gives a quick overview of some of the new APEX features. Greg Marsden: Hugepages = Huge Performance on Linux Greg Marsden of Oracle's Linux Kernel Engineering Team talks about some common customer performance questions and making the most of Oracle Linux 6 and Transparent HugePages. John Hurley: NEOOUG and GLOC 2013 Northeast Ohio Oracle User Group president John Hurley talks about the background and success of the 2013 Great Lakes Oracle Conference.

Read the article
posting nutch data into a BASIC auth secured Solr instance

- by mlathe

Hi. I've secured a solr instance using BASIC auth, kind of how it is shown here: http://blog.comtaste.com/2009/02/securing_your_solr_server_on_t.html Now i'm trying to update my batch processes to push data into the authenticated instance. The ones using "curl" are easy, but i also have a Nutch crawl that uses the "solrindex" command to push data into Solr. When i do that i get this error: 2010-02-22 12:09:28,226 INFO auth.AuthChallengeProcessor - basic authentication scheme selected 2010-02-22 12:09:28,229 INFO httpclient.HttpMethodDirector - No credentials available for BASIC 'Tomcat Manager Application'@ninja:5500 2010-02-22 12:09:28,236 WARN mapred.LocalJobRunner - job_local_0001 org.apache.solr.common.SolrException: Unauthorized Unauthorized request: http://ninja:5500/solr/foo/update?wt=javabin&version=2.2 at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:343) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:183) at org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest.java:217) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:48) at org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:69) at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:48) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:447) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170) 2010-02-22 12:09:29,134 FATAL solr.SolrIndexer - SolrIndexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) at org.apache.nutch.indexer.solr.SolrIndexer.indexSolr(SolrIndexer.java:73) at org.apache.nutch.indexer.solr.SolrIndexer.run(SolrIndexer.java:95) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.solr.SolrIndexer.main(SolrIndexer.java:104) Apparently nutch uses SolrJ to push the content, and after going through the solrj code, it's clear that it uses commons-httpclient without providing a way to set the credentials. Here are my question(s) Is this possible to do? ie push from nutch into a BASIC auth secured Solr instance? Is it possible to tell commons-httpclient about a credential without explicitly doing an _httpclient.getState().setCredentials(...)? Anyother ideas? One idea i had was to use an IPfiltering Valve for just the "update" Solr webservices. That would mean you could only make an update call from certain nodes. Thanks

Read the article
What's a good(affordable) business router that can limit traffic to certain machines(ips/ports)?

- by Ryan Detzel

We are using the basic Verizon router but it sucks so we're looking for a new one that allows us to limit users and our hadoop cluster to certain limits. Our problem is one person can start downloading something and kill the network and every hour we download logs into our cluster but it floods the network unless we rate limit it. Ideally we want to be able to say: total: 35 mbps Hadoop Cluster (15 mbps) Phones (1 mbps) Office(25 people) (19mbps but no one machine can have more than 5mbps)

Read the article
NFS Server in Java

- by dmeister

I search an implementation of a network (or distributed) file system like NFS in Java. The goal is to extend it and do some research stuff with it. On the web I found some implementation e.g. DJ NFS, but the open question is how mature and fast they are. Can anyone purpose a good starting point, has anyone experience with such things? P.S. I know Hadoop DFS and I used it for some projects, but Hadoop is not a good fit for the things I want to do here. --EDIT-- Hadoop is really focused on highly scalable, high throughput computing without the possibilities to overwrite parts of a file and so an. The goal is you could use the filesystem e.g. for user home directories. --EDIT-- More Details: The idea is to modify such a implementation so that the files are not stored directly on a local filesystem, but to apply data de-duplication.

Read the article

Search Results

Search found 401 results on 17 pages for 'hadoop'.

Page 10/17 | < Previous Page | 6 7 8 9 10 11 12 13 14 15 16 17 | Next Page >

- by Teflon Ted

- by Andrei Savu

- by darbour

- by S.N

- by iCode

- by Janice J. Heiss

- by BuckWoody

- by Simon Elliston Ball

- by Janice J. Heiss

- by Janice J. Heiss

- by Mala Narasimharajan

- by Eric Charles

- by mhornick

- by Pinal Dave

- by Pinal Dave

- by Madhu Nair

- by OTN ArchBeat

- by Pinal Dave

- by ScottGu

- by Pinal Dave

- by Gitmo

- by Bob Rhubart

- by mlathe

- by Ryan Detzel

- by dmeister

< Previous Page | 6 7 8 9 10 11 12 13 14 15 16 17 | Next Page >