distributed algorithm - Page 142

I'm trying to implement 2 factor authentication on the cheap. How would I do that?

- by Biff MaGriff

Ok so I need 2 of the 3. Something the user knows. Something the user has. Something the user is. I have a system that is exposed to the internet and we need clients to connect in a secure manner to satisfy our security standards. I'm thinking when a user registers to use our system we send them an application that they install on their home system. The application generates a key based on a timed randomness algorithm. Our application server has the same algorithm so when the user submits their credentials with the key we know that they are a legitimate user. Is this a valid method of 2 factor authentication? What is another way of doing this? Are there any pitfalls that I should be aware of? Thanks for your help!

Read the article

Approaches for Content-based Item Recommendations

- by PartlyCloudy

Hello, I'm currently developing an application where I want to group similar items. Items (like videos) can be created by users and also their attributes can be altered or extended later (like new tags). Instead of relying on users' preferences as most collaborative filtering mechanisms do, I want to compare item similarity based on the items' attributes (like similar length, similar colors, similar set of tags, etc.). The computation is necessary for two main purposes: Suggesting x similar items for a given item and for clustering into groups of similar items. My application so far is follows an asynchronous design and I want to decouple this clustering component as far as possible. The creation of new items or the addition of new attributes for an existing item will be advertised by publishing events the component can then consume. Computations can be provided best-effort and "snapshotted", which means that I'm okay with the best result possible at a given point in time, although result quality will eventually increase. So I am now searching for appropriate algorithms to compute both similar items and clusters. At important constraint is scalability. Initially the application has to handle a few thousand items, but later million items might be possible as well. Of course, computations will then be executed on additional nodes, but the algorithm itself should scale. It would also be nice if the algorithm supports some kind of incremental mode on partial changes of the data. My initial thought of comparing each item with each other and storing the numerical similarity sounds a little bit crude. Also, it requires n*(n-1)/2 entries for storing all similarities and any change or new item will eventually cause n similarity computations. Thanks in advance! UPDATE tl;dr To clarify what I want, here is my targeted scenario: User generate entries (think of documents) User edit entry meta data (think of tags) And here is what my system should provide: List of similar entries to a given item as recommendation Clusters of similar entries Both calculations should be based on: The meta data/attributes of entries (i.e. usage of similar tags) Thus, the distance of two entries using appropriate metrics NOT based on user votings, preferences or actions (unlike collaborative filtering). Although users may create entries and change attributes, the computation should only take into account the items and their attributes, and not the users associated with (just like a system where only items and no users exist). Ideally, the algorithm should support: permanent changes of attributes of an entry incrementally compute similar entries/clusters on changes scale something better than a simple distance table, if possible (because of the O(n²) space complexity)

Read the article

in java, which is better - three arrays of booleans or 1 array of bytes?

- by joe_shmoe

I know the question sounds silly, but consider this: I have an array of items and a labelling algorithm. at any point the item is in one of three states. The current version holds these states in a byte array, where 0, 1 and 2 represent the three states. alternatively, I could have three arrays of boolean - one for each state. which is better (consumes less memory) depends on how jvm (sun's version) stores the arrays - is a boolean represented by 1 bit? (p.s. don't start with all that "this is not the way OO/Java works" - I know, but here performance comes in front. plus the algorithm is simple and perfectly readable even in such form). Thanks a lot

Read the article

How to implement HeightMap smoothing using Thrust

- by igal k

I'm trying to implement height map smoothing using Thrust. Let's say I have a big array ( around 1 million floats). A known graphics algorithm to implement the above problem is to calculate the average around a given cell. If for example I need to calculate the value at a given cell[i,j] what I will basically do is: cell[i,j] = cell[i-1,j-1] + cell[i-1,j] + cell[i-1,j+1] + cell[i,j-1] + cell[i,j+1] + cell[i+1,j -1] + cell[i+1,j] + cell[i+1,j+1] cell[i,j] /=9 That's the CPU code. Is there a way to implement it using thrust? I know that I could use the transform algorithm. But I'm not sure it's correct to access different cells which are occupied but different threads (banks conflicts and so on).

Read the article

algorithmes with no executable example code

- by gcc

[link] http://stackoverflow.com/questions/2932016/parsing-of-mathematical-expressions (problem has been told before) ** program is to create the infix tree for the given math expression.** if the expression is given completely paranthesized then the out put is fine but when there are no paranthesis or some part paranthesized then the out put is wrong. cant get the idea how to solve. my problem is told above. I have some algorithm to solve my problem, but I have no simple code which will be guide for me. Can anyone give me simple code (not working code) so that I will try working to understand.(for a 3 hours ,I have been searching and reading some text to understand algorithm which is told above.Actually,there is no example code investigating how it is working. Can anyone send me example which is written in c not other language.

Read the article

Detecting similar words among n text documents

- by javanes

Hi; I have n documents and want to find common words that are included in these documents. For example I want to say (n-3) documents include the word "web". Certainly I can do this by basic data structures but there maybe efficient algorithm or a way to handle same words with different suffix. Is there any algorithm for such purposes? I am unfamiliar with datamining world. In general manner is there a term used for efforts of finding similarities between different documents? If there is then I will make my research easily. Thanks.

Read the article

large test data for knapsack problem

- by user347918

i am researcher student. I am searching large data for knapsack problem. I wanted test my algorithm for knapsack problem. But i couldn't find large data. I need data has 1000 item and capacity is no matter. The point is item as much as huge it's good for my algorithm. Is there any huge data available in internet. Does anybody know please guys i need urgent.

Read the article

Problem with Precision floating point operation in C

- by Microkernel

Hi Guys, For one of my course project I started implementing "Naive Bayesian classifier" in C. My project is to implement a document classifier application (especially Spam) using huge training data. Now I have problem implementing the algorithm because of the limitations in the C's datatype. ( Algorithm I am using is given here, http://en.wikipedia.org/wiki/Bayesian_spam_filtering ) PROBLEM STATEMENT: The algorithm involves taking each word in a document and calculating probability of it being spam word. If p1, p2 p3 .... pn are probabilities of word-1, 2, 3 ... n. The probability of doc being spam or not is calculated using Here, probability value can be very easily around 0.01. So even if I use datatype "double" my calculation will go for a toss. To confirm this I wrote a sample code given below. #define PROBABILITY_OF_UNLIKELY_SPAM_WORD (0.01) #define PROBABILITY_OF_MOSTLY_SPAM_WORD (0.99) int main() { int index; long double numerator = 1.0; long double denom1 = 1.0, denom2 = 1.0; long double doc_spam_prob; /* Simulating FEW unlikely spam words */ for(index = 0; index < 162; index++) { numerator = numerator*(long double)PROBABILITY_OF_UNLIKELY_SPAM_WORD; denom2 = denom2*(long double)PROBABILITY_OF_UNLIKELY_SPAM_WORD; denom1 = denom1*(long double)(1 - PROBABILITY_OF_UNLIKELY_SPAM_WORD); } /* Simulating lot of mostly definite spam words */ for (index = 0; index < 1000; index++) { numerator = numerator*(long double)PROBABILITY_OF_MOSTLY_SPAM_WORD; denom2 = denom2*(long double)PROBABILITY_OF_MOSTLY_SPAM_WORD; denom1 = denom1*(long double)(1- PROBABILITY_OF_MOSTLY_SPAM_WORD); } doc_spam_prob= (numerator/(denom1+denom2)); return 0; } I tried Float, double and even long double datatypes but still same problem. Hence, say in a 100K words document I am analyzing, if just 162 words are having 1% spam probability and remaining 99838 are conspicuously spam words, then still my app will say it as Not Spam doc because of Precision error (as numerator easily goes to ZERO)!!!. This is the first time I am hitting such issue. So how exactly should this problem be tackled?

Read the article

how to tesselate bezier triangles?

- by Cheery

My concern are quadratic bezier triangles which I'm trying to tesselate for rendering them. I've managed to implement this by subdividing the triangle recursively like described in a wikipedia page. Though I'd like to get more precision to subdivision. The problem is that I'll either get too few subdivisions or too many because the amount of surfaces doubles on every iteration of that algorithm. In particular I would need an adaptive tesselation algorithm that allows me to define the amount of segments at the edges. I'm not sure whether I can get that though so I'd also like to hear about uniform tesselation techniques. Hardest trouble I have trouble with calculating normals for a point in bezier surface, which I'm not sure whether I need, but been trying to solve out.

Read the article

Preoblem with Precision floating point operation in C

- by Microkernel

Hi Guys, For one of my course project I started implementing "Naive Bayesian classifier" in C. My project is to implement a document classifier application (especially Spam) using huge training data. Now I have problem implementing the algorithm because of the limitations in the C's datatype. ( Algorithm I am using is given here, http://en.wikipedia.org/wiki/Bayesian_spam_filtering ) PROBLEM STATEMENT: The algorithm involves taking each word in a document and calculating probability of it being spam word. If p1, p2 p3 .... pn are probabilities of word-1, 2, 3 ... n. The probability of doc being spam or not is calculated using Here, probability value can be very easily around 0.01. So even if I use datatype "double" my calculation will go for a toss. To confirm this I wrote a sample code given below. #define PROBABILITY_OF_UNLIKELY_SPAM_WORD (0.01) #define PROBABILITY_OF_MOSTLY_SPAM_WORD (0.99) int main() { int index; long double numerator = 1.0; long double denom1 = 1.0, denom2 = 1.0; long double doc_spam_prob; /* Simulating FEW unlikely spam words */ for(index = 0; index < 162; index++) { numerator = numerator*(long double)PROBABILITY_OF_UNLIKELY_SPAM_WORD; denom2 = denom2*(long double)PROBABILITY_OF_UNLIKELY_SPAM_WORD; denom1 = denom1*(long double)(1 - PROBABILITY_OF_UNLIKELY_SPAM_WORD); } /* Simulating lot of mostly definite spam words */ for (index = 0; index < 1000; index++) { numerator = numerator*(long double)PROBABILITY_OF_MOSTLY_SPAM_WORD; denom2 = denom2*(long double)PROBABILITY_OF_MOSTLY_SPAM_WORD; denom1 = denom1*(long double)(1- PROBABILITY_OF_MOSTLY_SPAM_WORD); } doc_spam_prob= (numerator/(denom1+denom2)); return 0; } I tried Float, double and even long double datatypes but still same problem. Hence, say in a 100K words document I am analyzing, if just 162 words are having 1% spam probability and remaining 99838 are conspicuously spam words, then still my app will say it as Not Spam doc because of Precision error (as numerator easily goes to ZERO)!!!. This is the first time I am hitting such issue. So how exactly should this problem be tackled?

Read the article

Minimum number of training examples for Find-S/Candidate Elimination algorithms?

- by Rich

Consider the instance space consisting of integer points in the x, y plane, where 0 = x, y = 10, and the set of hypotheses consisting of rectangles (i.e. being of the form (a = x = b, c = y = d), where 0 = a, b, c, d = 10). What is the smallest number of training examples one needs to provide so that the Find-S algorithm perfectly learns a particular target concept (e.g. (2 = x = 4, 6 = y = 9))? When can we say that the target concept is exactly learned in the case of the Find-S algorithm, and what is the optimal query strategy? I'd also like to know the answer w.r.t Candidate Elimination. Thanks in advance.

Read the article

overloading new operator in c++

- by Angus

I have a code for best fit algorithm. I want to try to use the best fit algorithm using the operator new. Every time I create an object I should give it from the already allocated memory say, 1]20 2]12 3]15 4]6 5]23 respectively. which ever minimum amount fits to the objects size(eg.21) I wanted to do it for different object types, so I need to write the overloaded operator new to be common functionality for all the class objects. Can I do it through friend functions, or is there any possible way to do it.

Read the article

Read a file to multiple array byte[]

- by hankol

I have an encryption algorithm (AES) that accepts file converted to array byte and encrypt it. Since I am going to process a very big size files, the JVM may go out of memory. I am planing to read the files in multiple array byte. each containing some part of the file. Then I teratively feed the algorithm. Finally merge them to produce encrypted file. So my question is: there any way to read a file part by part to multiple array byte? I thought I can use the following to read the file to array byte: IOUtils.toByteArray(InputStream input). And then split the array into multiple bytes using: Arrays.copyOfRange(). But I am afraid that the first code that reads file to byte will make the JVM to go out of memory. any suggestion please ? thanks

Read the article

Determining if two rays intersect

- by Faken

I have two rays on a 2D plane that extend to infinity but both have a starting point. They are both described by a starting point and a vector in the direction of the ray extending to infinity. I want to find out if the two rays intersect but i don't need to know where they intersect (its part of a collision detection algorithm). Everything i have looked at so far describes finding the intersection point of two lines or line segments. Anyone know a fast algorithm to solve this?

Read the article

How can I call an executable to run on a separate machine within a program on my own machine (win xp

- by Mr. H.

My objective is to write a program which will call another executable on a separate computer(all with win xp) with parameters determined at run-time, then repeat for several more computers, and then collect the results. In short, I'm working on a grid-computing project. The algorithm itself being used is already coded in FORTRAN, but we are looking for an efficient way to run it on many computers at once. I suppose one way to accomplish this would be to upload a script to each computer and then run said script on each computer, all automatically and dependent on my own parameters. But how can I write a program which will write to, upload, and run a script on a separate computer? I had considered GridGain, but the algorithm is already coded and in a different language, so that is ruled out. My current guess at accomplishing this task is using Expect (wiki/Expect), but I have no knowledge of the tool. Any advice appreciated.

Read the article

Can I replicate some of the optimisations done by the JVM by hand?

- by Subb

I'm working on a Sudoku solver at school and we're having a little performance contest. Right now, my algorithm is pretty fast on the first run (about 2.5ms), but even faster when I solve the same puzzle 10 000 times (about 0.5ms for each run). Those timing are, of course, depend of the puzzle being solved. I know the JVM do some optimization when a method is called multiple time, and this is what I suspect is happening. I don't think I can further optimize the algorithm itself (though I'll keep looking), so I was wondering if I could replicate some of the optimizations done by the JVM. Note : compiling to native code is not an option Thanks!

Read the article

Scalling connected lines

- by Hristo

Hello, I have some kind of a shape consisting of vertical, horizontal and diagonal lines. I have starting X,Y and ending X,Y (this is my input - just 2 points defining a line) of each line and I would like to make the whole shape scalable (just by changing the value of a scale ratio variable), so that I can still preserve the proper connection of the lines and the proportions as well. Just for getting a better idea of what I mean: it'd be as if I had the same lines in a vector editor. Would that be possible with an algorithm, and could you please, give me another possible solution if there is no such algorithm ? Thank you very much in advance!

Read the article

i need help to designe code in c++

- by user344987

) Design and implement a Graph data structure. Use adjacency matrix to implement the unweighted graph edges. The Graph must support the following operations: 1.Constructor 2.Destructor 3.Copy constructor 4.A function to add an edge between two nodes in the graph 5.A display function that outputs all the edges of the graph 6.A function edge that accepts two nodes, the function returns true if there is an edge between the passed nodes, and returns false otherwise. B.(100 points) Depth first search and Breadth first search functions. C.(100 points) A function to output a spanning tree of the graph, use any algorithm you find appropriate, also, make the necessary changes on the data structure in A to implement your algorithm.

Read the article

Garbage collection implementation

- by Suresh S

In which language garbage collection algorithm is implemented for java.i think c, please confirm?

Read the article

How do C++ header files work?

- by PulpFiction

Hi all. When I include some function from a header file in a C++ program, does the entire header file code get copied to the final executable or only the machine code for the specific function is generated. For example, if I call std::sort from the <algorithm> header in C++, is the machine code generated only for the sort() function or for the entire <algorithm> header file. I think that a similar question exists somewhere on Stack Overflow, but I have tried my best to find it (I glanced over it once, but lost the link). If you can point me to that, it would be wonderful.

Read the article

Improve C function performance with cache locality?

- by Christoper Hans

I have to find a diagonal difference in a matrix represented as 2d array and the function prototype is int diagonal_diff(int x[512][512]) I have to use a 2d array, and the data is 512x512. This is tested on a SPARC machine: my current timing is 6ms but I need to be under 2ms. Sample data: [3][4][5][9] [2][8][9][4] [6][9][7][3] [5][8][8][2] The difference is: |4-2| + |5-6| + |9-5| + |9-9| + |4-8| + |3-8| = 2 + 1 + 4 + 0 + 4 + 5 = 16 In order to do that, I use the following algorithm: int i,j,result=0; for(i=0; i<4; i++) for(j=0; j<4; j++) result+=abs(array[i][j]-[j][i]); return result; But this algorithm keeps accessing the column, row, column, row, etc which make inefficient use of cache. Is there a way to improve my function?

Read the article

construct a unique number for a string in java

- by praveen

We have a requirement of reading/writing more than 10 million strings into a file. Also we do not want duplicates in the file. Since the strings would be flushed to a file as soon as they are read we are not maintaining it in memory. We cannot use hashcode because of collisions in the hash code due to which we might miss a string as duplicate. Two other approaches i found in my googling: 1.Use a message digest algorithm like MD5 - but it might be too costly to calculate and store. 2.Use a checksum algorithm. [i am not sure if this produces a unique key for a string- can someone please confirm] Is there any other approach avaiable. Thanks.

Read the article

How can we change method for generatinf PK and FK names in hibernate?

- by Max

Can i provide some algorithm for generate smth intead of FK030HF9840303 for example?

Read the article

CodePlex Daily Summary for Thursday, March 04, 2010

CodePlex Daily Summary for Thursday, March 04, 2010New ProjectsAcPrac: A educational program designed to run on Windows. It is fully customizable. It is developed in C# 2008.argo: Linguistic Tool Camelot: A simple utility for testing cross site data queries in SharePointdelta: Delta is a difference tool for large files. Delta will have several clients, including AJAX and Silverlight. You can view differences in a scroll...EF Dorsal: A dorsal spine for Entity Framework based project. This code generator provides a powerfull Business Layer with almost all high important best p...Excel formatting: Excel formatting projectEyes Recognition: eye recognitionGameOfLife: The Game of Life is a the best example of a cellular automaton. It's developed in C#, SilverLight.GKO Libraries: .NET tool libraries written in C# that we have used in our projectsHello Demo: A simple Hello World application, used to demonstrate access to CodePlex using Team Explorerjog.Portal: jogportal lays the foundation for an extensible portal solution leveraging the latest technology including linq and silver lightKM Brasil: k-meleon, km, gecko, brasil, browser, web, web browser, navegador, navegação, kmeleon, k meleonLost in Translation: Ever find yourself in a foreign country eager (or clueless) to what is written on a shop sign or restaurant menu? With your trusty phone, its came...Machine Learning for .NET: Machine Learning Library for .NET. Initial inclusions will be binary and multi-class classification as well as standard clustering algorithms.Maito: Iron Kingdoms Name GeneratorManPowerEngine: ManPowerEngine is a Silverlight 2D Game Engine. It has an game application framework and supports game object hierarchy management, 2D physics simu...Marvin's Arena: Marvin's Arena is a free and entertaining programming game. The game is designed to easily learn programming in any .NET compatible language. It is...MultiMediaPlayer: 整合我们的内容NCacheD - A Simple Distributed .NET Cache using the TPL and WCF: NCacheD is a simple distributed cache system written in .NET. NCacheD offers functionality similar to that of MemcacheD but scaled back. NCacheD ...NDAP: OPeNDAP is a client/server system for making local scientific data accessible to remote locations without regard to the local or remote storage for...Open Guide CMS: Open Guide makes your traveling through city more adventurous, mode educational and more funny. You can support us by lines of code, by interesti...Rollback - A social backup tool.: Rollback is a simple and intuitive social backup application. You can create multiple backup jobs with a few mouse clicks and even schedule it to ...sELedit: An editor for elements.data file of the popular MMORPG Perfect World.SharePoint Theme Applicator: SharePoint Theme Applicator will give SharePoint administrators the ability to apply a theme accross a whole web application (i.e. apply a theme to...sPATCH: ! sPATCH - Perfect World Client Autopatcher This beta patcher is an alternative to the default perfect world patcher. It offers easy client patc...Wiggle: A-life investigation. Let's make little squirmies!WPFSLBlendModeFx: A blend mode library for WPF & Silverlight.休息提醒: 程序员们，尤其是像我这样对程序痴狂的程序员们。一旦研究起自己感兴趣的程序时，觉得上厕所都是浪费时间。这个程序是在设定的时间后锁定计算机或关闭显示器，从而从某种程度强迫程序员们去休息New ReleasesAMFParser: 1.1.0: Add handling of DSK object when using BlazeDS Add handling of string references Add handling of DateTime values Correct handling of Double values B...Camelot: Camelot 0.1 Alpha: Early release: 1) Query by content type, optionally list or base template 2) Query by list or base templateDictionary Translator for Umbraco: Dictionary Translator 1.1: This is a minor release that fixes a bug that Thomas Hohler found on the first day of release This package is to be used with the ASP.NET open sou...Dynamic Configuration: Sample Application Release 1: These binaries demonstrate the effect of using DynamicConfigurationManager. The source-code for these binaries is available in the 'Source Code' t...EF Dorsal: EFDorsal v0.3b: Second real version. This version add suport to types and entity sets in tree-view.Enterprise Library Extensions: Release 1.0: This is the initial release of the package. The release will contain one feature only, as being able to deploy the project itself it the milestone ...Free Silverlight & WPF Chart Control - Visifire: Visifire SL and WPF Charts v3.0.4 beta 2 Released: Hi, Today’s release contains fix for the following issues: * In WPF application chart was throwing exception as VisualStateGroup was not foun...GameOfLife: Game of life: First release of the game. PublishedGameStore League Manager: League Manager 1.0 release 2.: Fixes crashing bugs from the first release. To use: 1. Install SQL Server Express 2005 http://www.microsoft.com/Sqlserver/2005/en/us/express-down....Marvin's Arena: Version 0.0.5.0: Code Editor (development of robots without Visual Studio - no debugging) * Rounds * 3D Battle Engine: Skybox * 3D Battle Engine: Robot...Morphfolia - ASP.NET CMS and Framework: Morphfolia v2.4.1.1: Morphfolia v2.4.1.1 - New Release Includes: Better support for browsers other than IE (Chrome, Firefox, Safari - all tested on Windows) Supports ...NCacheD - A Simple Distributed .NET Cache using the TPL and WCF: NCacheD Version 1: Getting Started To get up and running, open two instances of Visual Studio 2010. In one window open the NCacheD client solution and then open the ...Papercut: Papercut 2010-3-3: This release includes a few bug fixes and updates, several of which were contributed patches (thanks!). Feature: Added support for embedded images...PE-file Reader Writer API (PERWAPI): PERWAPI-1.1.4: Perwapi version 1.1.4 is the complete distribution package. It contains Binary files, pdb files and xml files for the PERWAPI and SymbolRW compone...Prolog.NET: Prolog.NET 1.0 Beta 1.2: Installer includes: primary Prolog.NET assembly Prolog.NET Workbench PrologTest console application all required dependencies Beta 1.2 in...Protoforma | Tactica Adversa: Skilful 0.2.4.320: BetaRoTwee: RoTwee 6.1.0.0: 16604 "Post playing tune feature" is added. Using this new feature, you can tweet tune playing in iTunes. 16625 Error processing for network error...sELedit: sELedit v1.0: -SharePoint 2007 Deployment Wizard: Support for SharePoint Server and Foundation 2010: This release encompasses the supported install paths for the default install of SPS 2010 (the 14 hive). All three versions are now supported (60 h...SharePoint Theme Applicator: SharePoint Theme Applicator: SharePoint Theme Applicator was built using C# and WPF, it includes the following features: Provides the total number of site collections in the g...Shinkansen: compress, crunch, combine, and cache JavaScript and CSS: Shinkansen 1.0.0.030310: Build 1.0.0.030310, binaries onlysPATCH: sPatcher v0.8: sPatch - Server Example *Contains a sample Patch that "downgrades" PWI 1.4.2 Client to an 1.3.6 ClientTFS Code Comment Checking Policy (CCCP): CCCP 3.0 for VSTS 2008 SP1: This release includes NRefactory 3.2.0.5571 and is built against VSTS 2008 SP1 (.NET 3.5 is required). New: horizontal scrollbar in listboxes for ...TortoiseHg: Beta for TortoiseHg 1.0 (0.9.31254): Please backup your user Mercurial.ini file and then uninstall any 0.9.X release before installing Use the x86 msi file for 32 bit platforms and th...TwitEclipseAPI: TwitEclipseAPI 0.9: 0.9 Release of TwitEclipseAPI Moved API calls to the api.Twitter.com URL. Moved API calls to the versioning API. Now uses the increased Rate Limit...TwitterVB - A .NET Twitter Library: TwitterVB-2.3: Patch 5151: Added BlockedUsers function to get a page other then the first Patch 5420: The ListMembers function will now return more then just th...休息提醒: 初始版本: 初始版本Most Popular ProjectsMetaSharpRawrWBFS ManagerAJAX Control ToolkitMicrosoft SQL Server Product Samples: DatabaseSilverlight ToolkitWindows Presentation Foundation (WPF)Microsoft SQL Server Community & SamplesASP.NETLiveUpload to FacebookMost Active ProjectsRawrBlogEngine.NETMapWindow GISpatterns & practices – Enterprise LibraryjQuery Library for SharePoint Web ServicesMDT Web FrontEndsvn2tfsDiffPlex - a .NET Diff GeneratorIonics Isapi Rewrite FilterFarseer Physics Engine

Read the article

Running a simple integration scenario using the Oracle Big Data Connectors on Hadoop/HDFS cluster

- by hamsun

Between the elephant ( the tradional image of the Hadoop framework) and the Oracle Iron Man (Big Data..) an english setter could be seen as the link to the right data Data, Data, Data, we are living in a world where data technology based on popular applications , search engines, Webservers, rich sms messages, email clients, weather forecasts and so on, have a predominant role in our life. More and more technologies are used to analyze/track our behavior, try to detect patterns, to propose us "the best/right user experience" from the Google Ad services, to Telco companies or large consumer sites (like Amazon:) ). The more we use all these technologies, the more we generate data, and thus there is a need of huge data marts and specific hardware/software servers (as the Exadata servers) in order to treat/analyze/understand the trends and offer new services to the users. Some of these "data feeds" are raw, unstructured data, and cannot be processed effectively by normal SQL queries. Large scale distributed processing was an emerging infrastructure need and the solution seemed to be the "collocation of compute nodes with the data", which in turn leaded to MapReduce parallel patterns and the development of the Hadoop framework, which is based on MapReduce and a distributed file system (HDFS) that runs on larger clusters of rather inexpensive servers. Several Oracle products are using the distributed / aggregation pattern for data calculation ( Coherence, NoSql, times ten ) so once that you are familiar with one of these technologies, lets says with coherence aggregators, you will find the whole Hadoop, MapReduce concept very similar. Oracle Big Data Appliance is based on the Cloudera Distribution (CDH), and the Oracle Big Data Connectors can be plugged on a Hadoop cluster running the CDH distribution or equivalent Hadoop clusters. In this paper, a "lab like" implementation of this concept is done on a single Linux X64 server, running an Oracle Database 11g Enterprise Edition Release 11.2.0.4.0, and a single node Apache hadoop-1.2.1 HDFS cluster, using the SQL connector for HDFS. The whole setup is fairly simple: Install on a Linux x64 server ( or virtual box appliance) an Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 server Get the Apache Hadoop distribution from: http://mir2.ovh.net/ftp.apache.org/dist/hadoop/common/hadoop-1.2.1. Get the Oracle Big Data Connectors from: http://www.oracle.com/technetwork/bdc/big-data-connectors/downloads/index.html?ssSourceSiteId=ocomen. Check the java version of your Linux server with the command: java -version java version "1.7.0_40" Java(TM) SE Runtime Environment (build 1.7.0_40-b43) Java HotSpot(TM) 64-Bit Server VM (build 24.0-b56, mixed mode) Decompress the hadoop hadoop-1.2.1.tar.gz file to /u01/hadoop-1.2.1 Modify your .bash_profile export HADOOP_HOME=/u01/hadoop-1.2.1 export PATH=$PATH:$HADOOP_HOME/bin export HIVE_HOME=/u01/hive-0.11.0 export PATH=$PATH:$HADOOP_HOME/bin:$HIVE_HOME/bin (also see my sample .bash_profile) Set up ssh trust for Hadoop process, this is a mandatory step, in our case we have to establish a "local trust" as will are using a single node configuration copy the new public keys to the list of authorized keys connect and test the ssh setup to your localhost: We will run a "pseudo-Hadoop cluster", in what is called "local standalone mode", all the Hadoop java components are running in one Java process, this is enough for our demo purposes. We need to "fine tune" some Hadoop configuration files, we have to go at our $HADOOP_HOME/conf, and modify the files: core-site.xml hdfs-site.xml mapred-site.xml check that the hadoop binaries are referenced correctly from the command line by executing: hadoop -version As Hadoop is managing our "clustered HDFS" file system we have to create "the mount point" and format it , the mount point will be declared to core-site.xml as: The layout under the /u01/hadoop-1.2.1/data will be created and used by other hadoop components (MapReduce = /mapred/...) HDFS is using the /dfs/... layout structure format the HDFS hadoop file system: Start the java components for the HDFS system As an additional check, you can use the GUI Hadoop browsers to check the content of your HDFS configurations: Once our HDFS Hadoop setup is done you can use the HDFS file system to store data ( big data : )), and plug them back and forth to Oracle Databases by the means of the Big Data Connectors ( which is the next configuration step). You can create / use a Hive db, but in our case we will make a simple integration of "raw data" , through the creation of an External Table to a local Oracle instance ( on the same Linux box, we run the Hadoop HDFS one node cluster and one Oracle DB). Download some public "big data", I use the site: http://france.meteofrance.com/france/observations, from where I can get *.csv files for my big data simulations :). Here is the data layout of my example file: Download the Big Data Connector from the OTN (oraosch-2.2.0.zip), unzip it to your local file system (see picture below) Modify your environment in order to access the connector libraries , and make the following test: [oracle@dg1 bin]$./hdfs_stream Usage: hdfs_stream locationFile [oracle@dg1 bin]$ Load the data to the Hadoop hdfs file system: hadoop fs -mkdir bgtest_data hadoop fs -put obsFrance.txt bgtest_data/obsFrance.txt hadoop fs -ls /user/oracle/bgtest_data/obsFrance.txt [oracle@dg1 bg-data-raw]$ hadoop fs -ls /user/oracle/bgtest_data/obsFrance.txt Found 1 items -rw-r--r-- 1 oracle supergroup 54103 2013-10-22 06:10 /user/oracle/bgtest_data/obsFrance.txt [oracle@dg1 bg-data-raw]$hadoop fs -ls hdfs:///user/oracle/bgtest_data/obsFrance.txt Found 1 items -rw-r--r-- 1 oracle supergroup 54103 2013-10-22 06:10 /user/oracle/bgtest_data/obsFrance.txt Check the content of the HDFS with the browser UI: Start the Oracle database, and run the following script in order to create the Oracle database user, the Oracle directories for the Oracle Big Data Connector (dg1 it’s my own db id replace accordingly yours): #!/bin/bash export ORAENV_ASK=NO export ORACLE_SID=dg1 . oraenv sqlplus /nolog <<EOF CONNECT / AS sysdba; CREATE OR REPLACE DIRECTORY osch_bin_path AS '/u01/orahdfs-2.2.0/bin'; CREATE USER BGUSER IDENTIFIED BY oracle; GRANT CREATE SESSION, CREATE TABLE TO BGUSER; GRANT EXECUTE ON sys.utl_file TO BGUSER; GRANT READ, EXECUTE ON DIRECTORY osch_bin_path TO BGUSER; CREATE OR REPLACE DIRECTORY BGT_LOG_DIR as '/u01/BG_TEST/logs'; GRANT READ, WRITE ON DIRECTORY BGT_LOG_DIR to BGUSER; CREATE OR REPLACE DIRECTORY BGT_DATA_DIR as '/u01/BG_TEST/data'; GRANT READ, WRITE ON DIRECTORY BGT_DATA_DIR to BGUSER; EOF Put the following in a file named t3.sh and make it executable, hadoop jar $OSCH_HOME/jlib/orahdfs.jar \ oracle.hadoop.exttab.ExternalTable \ -D oracle.hadoop.exttab.tableName=BGTEST_DP_XTAB \ -D oracle.hadoop.exttab.defaultDirectory=BGT_DATA_DIR \ -D oracle.hadoop.exttab.dataPaths="hdfs:///user/oracle/bgtest_data/obsFrance.txt" \ -D oracle.hadoop.exttab.columnCount=7 \ -D oracle.hadoop.connection.url=jdbc:oracle:thin:@//localhost:1521/dg1 \ -D oracle.hadoop.connection.user=BGUSER \ -D oracle.hadoop.exttab.printStackTrace=true \ -createTable --noexecute then test the creation fo the external table with it: [oracle@dg1 samples]$ ./t3.sh ./t3.sh: line 2: /u01/orahdfs-2.2.0: Is a directory Oracle SQL Connector for HDFS Release 2.2.0 - Production Copyright (c) 2011, 2013, Oracle and/or its affiliates. All rights reserved. Enter Database Password:] The create table command was not executed. The following table would be created. CREATE TABLE "BGUSER"."BGTEST_DP_XTAB" ( "C1" VARCHAR2(4000), "C2" VARCHAR2(4000), "C3" VARCHAR2(4000), "C4" VARCHAR2(4000), "C5" VARCHAR2(4000), "C6" VARCHAR2(4000), "C7" VARCHAR2(4000) ) ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER DEFAULT DIRECTORY "BGT_DATA_DIR" ACCESS PARAMETERS ( RECORDS DELIMITED BY 0X'0A' CHARACTERSET AL32UTF8 STRING SIZES ARE IN CHARACTERS PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream' FIELDS TERMINATED BY 0X'2C' MISSING FIELD VALUES ARE NULL ( "C1" CHAR(4000), "C2" CHAR(4000), "C3" CHAR(4000), "C4" CHAR(4000), "C5" CHAR(4000), "C6" CHAR(4000), "C7" CHAR(4000) ) ) LOCATION ( 'osch-20131022081035-74-1' ) ) PARALLEL REJECT LIMIT UNLIMITED; The following location files would be created. osch-20131022081035-74-1 contains 1 URI, 54103 bytes 54103 hdfs://localhost:19000/user/oracle/bgtest_data/obsFrance.txt Then remove the --noexecute flag and create the external Oracle table for the Hadoop data. Check the results: The create table command succeeded. CREATE TABLE "BGUSER"."BGTEST_DP_XTAB" ( "C1" VARCHAR2(4000), "C2" VARCHAR2(4000), "C3" VARCHAR2(4000), "C4" VARCHAR2(4000), "C5" VARCHAR2(4000), "C6" VARCHAR2(4000), "C7" VARCHAR2(4000) ) ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER DEFAULT DIRECTORY "BGT_DATA_DIR" ACCESS PARAMETERS ( RECORDS DELIMITED BY 0X'0A' CHARACTERSET AL32UTF8 STRING SIZES ARE IN CHARACTERS PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream' FIELDS TERMINATED BY 0X'2C' MISSING FIELD VALUES ARE NULL ( "C1" CHAR(4000), "C2" CHAR(4000), "C3" CHAR(4000), "C4" CHAR(4000), "C5" CHAR(4000), "C6" CHAR(4000), "C7" CHAR(4000) ) ) LOCATION ( 'osch-20131022081719-3239-1' ) ) PARALLEL REJECT LIMIT UNLIMITED; The following location files were created. osch-20131022081719-3239-1 contains 1 URI, 54103 bytes 54103 hdfs://localhost:19000/user/oracle/bgtest_data/obsFrance.txt This is the view from the SQL Developer: and finally the number of lines in the oracle table, imported from our Hadoop HDFS cluster SQL select count(*) from "BGUSER"."BGTEST_DP_XTAB"; COUNT(*) ---------- 1151 In a next post we will integrate data from a Hive database, and try some ODI integrations with the ODI Big Data connector. Our simplistic approach is just a step to show you how these unstructured data world can be integrated to Oracle infrastructure. Hadoop, BigData, NoSql are great technologies, they are widely used and Oracle is offering a large integration infrastructure based on these services. Oracle University presents a complete curriculum on all the Oracle related technologies: NoSQL: Introduction to Oracle NoSQL Database Using Oracle NoSQL Database Big Data: Introduction to Big Data Oracle Big Data Essentials Oracle Big Data Overview Oracle Data Integrator: Oracle Data Integrator 12c: New Features Oracle Data Integrator 11g: Integration and Administration Oracle Data Integrator: Administration and Development Oracle Data Integrator 11g: Advanced Integration and Development Oracle Coherence 12c: Oracle Coherence 12c: New Features Oracle Coherence 12c: Share and Manage Data in Clusters Oracle Coherence 12c: Oracle GoldenGate 11g: Fundamentals for Oracle Oracle GoldenGate 11g: Fundamentals for SQL Server Oracle GoldenGate 11g Fundamentals for Oracle Oracle GoldenGate 11g Fundamentals for DB2 Oracle GoldenGate 11g Fundamentals for Teradata Oracle GoldenGate 11g Fundamentals for HP NonStop Oracle GoldenGate 11g Management Pack: Overview Oracle GoldenGate 11g Troubleshooting and Tuning Oracle GoldenGate 11g: Advanced Configuration for Oracle Other Resources: Apache Hadoop : http://hadoop.apache.org/ is the homepage for these technologies. "Hadoop Definitive Guide 3rdEdition" by Tom White is a classical lecture for people who want to know more about Hadoop , and some active "googling " will also give you some more references. About the author: Eugene Simos is based in France and joined Oracle through the BEA-Weblogic Acquisition, where he worked for the Professional Service, Support, end Education for major accounts across the EMEA Region. He worked in the banking sector, ATT, Telco companies giving him extensive experience on production environments. Eugen currently specializes in Oracle Fusion Middleware teaching an array of courses on Weblogic/Webcenter, Content,BPM /SOA/Identity-Security/GoldenGate/Virtualisation/Unified Comm Suite) throughout the EMEA region.

Search Results

Search found 6716 results on 269 pages for 'distributed algorithm'.

Page 142/269 | < Previous Page | 138 139 140 141 142 143 144 145 146 147 148 149 | Next Page >

- by Biff MaGriff

- by PartlyCloudy

- by joe_shmoe

- by igal k

- by gcc

- by javanes

- by user347918

- by Microkernel

- by Cheery

- by Microkernel

- by Rich

- by Angus

- by hankol

- by Faken

- by Mr. H.

- by Subb

- by Hristo

- by user344987

- by Suresh S

- by PulpFiction

- by Christoper Hans

- by praveen

- by Max

- by hamsun

< Previous Page | 138 139 140 141 142 143 144 145 146 147 148 149 | Next Page >