Search Results

Search found 324 results on 13 pages for 'mining'.

Page 4/13 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >

Manchester UG Presentation Video

In July I was invited to speak at the UK SQL Server UG event in Manchester. I spoke about Excel being a good data mining client. I was a little rushed at the end as Chris Testa-ONeill told me I had only 5 minutes to go when I had only been talking for 10 minutes. Apparently I have a reputation for running over my time allocation. At the event we also had a product demo from SQL Sentry around their BI monitoring dashboard solution. This includes SSIS but the main thrust was SSAS Then came Chris with a look at Analysis Services. If you have never heard Chris talk then take the opportunity now, he is a top class presenter and I am often found sat at the back of his classes. Here is the video link

Read the article
Knowledge mining using Hadoop.

- by Anurag

Hello there, I want to do a project Hadoop and map reduce and present it as my graduation project. To this, I've given some thought,searched over the internet and came up with the idea of implementing some basic knowledge mining algorithms say on a social websites like Facebook or may stckoverflow, Quora etc and draw some statistical graphs, comparisons frequency distributions and other sort of important values.For searching purpose would it be wise to use Apache Solr ? I want know If such thing is feasible using the above mentioned tools, if so how should I build up on this little idea? Where can I learn about knowledge mining algorithms which are easy to implement using java and map reduce techniques? In case this is a wrong idea please suggest what else can otherwise be done on using Hadoop and other related sub-projects? Thank you

Read the article
How do I cluster strings based on a relation between two strings?

- by Tom Wijsman

If you don't know WEKA, you can try a theoretical answer. I don't need literal code/examples... I have a huge data set of strings in which I want to cluster the strings to find the most related ones, these could as well be seen as duplicate. I already have a set of couples of string for which I know that they are duplicate to each other, so, now I want to do some data mining on those two sets. The result I'm looking for is a system that would return me the possible most relevant couples of strings for which we don't know yet that they are duplicates, I believe that I need clustering for this, which type? Note that I want to base myself on word occurrence comparison, not on interpretation or meaning. Here is an example of two string of which we know they are duplicate (in our vision on them): The weather is really cold and it is raining. It is raining and the weather is really cold. Now, the following strings also exist (most to least relevant, ignoring stop words): Is the weather really that cold today? Rainy days are awful. I see the sunshine outside. The software would return the following two strings as most relevant, which aren't known to be duplicate: The weather is really cold and it is raining. Is the weather really that cold today? Then, I would mark that as duplicate or not duplicate and it would present me with another couple. How do I go to implement this in the most efficient way that I can apply to a large data set?

Read the article
What data mining tools do you use?

- by python dude

Hello everyone, Besides the two well-known Open Source tools RapidMiner and Weka, are there any other good tools (either Open Source or Commercial), which you can recommend for data mining? Thanks in advance!

Read the article
Big Data’s Killer App…

- by jean-pierre.dijcks

Recently Keith spent some time talking about the cloud on this blog and I will spare you my thoughts on the whole thing. What I do want to write down is something about the Big Data movement and what I think is the killer app for Big Data... Where is this coming from, ok, I confess... I spent 3 days in cloud land at the Cloud Connect conference in Santa Clara and it was quite a lot of fun. One of the nice things at Cloud Connect was that there was a track dedicated to Big Data, which prompted me to some extend to write this post. What is Big Data anyways? The most valuable point made in the Big Data track was that Big Data in itself is not very cool. Doing something with Big Data is what makes all of this cool and interesting to a business user! The other good insight I got was that a lot of people think Big Data means a single gigantic monolithic system holding gazillions of bytes or documents or log files. Well turns out that most people in the Big Data track are talking about a lot of collections of smaller data sets. So rather than thinking "big = monolithic" you should be thinking "big = many data sets". This is more than just theoretical, it is actually relevant when thinking about big data and how to process it. It is important because it means that the platform that stores data will most likely consist out of multiple solutions. You may be storing logs on something like HDFS, you may store your customer information in Oracle and you may store distilled clickstream information in some distilled form in MySQL. The big question you will need to solve is not what lives where, but how to get it all together and get some value out of all that data. NoSQL and MapReduce Nope, sorry, this is not the killer app... and no I'm not saying this because my business card says Oracle and I'm therefore biased. I think language is important, but as with storage I think pragmatic is better. In other words, some questions can be answered with SQL very efficiently, others can be answered with PERL or TCL others with MR. History should teach us that anyone trying to solve a problem will use any and all tools around. For example, most data warehouses (Big Data 1.0?) get a lot of data in flat files. Everyone then runs a bunch of shell scripts to massage or verify those files and then shoves those files into the database. We've even built shell script support into external tables to allow for this. I think the Big Data projects will do the same. Some people will use MapReduce, although I would argue that things like Cascading are more interesting, some people will use Java. Some data is stored on HDFS making Cascading the way to go, some data is stored in Oracle and SQL does do a good job there. As with storage and with history, be pragmatic and use what fits and neither NoSQL nor MR will be the one and only. Also, a language, while important, does in itself not deliver business value. So while cool it is not a killer app... Vertical Behavioral Analytics This is the killer app! And you are now thinking: "what does that mean?" Let's decompose that heading. First of all, analytics. I would think you had guessed by now that this is really what I'm after, and of course you are right. But not just analytics, which has a very large scope and means many things to many people. I'm not just after Business Intelligence (analytics 1.0?) or data mining (analytics 2.0?) but I'm after something more interesting that you can only do after collecting large volumes of specific data. That all important data is about behavior. What do my customers do? More importantly why do they behave like that? If you can figure that out, you can tailor web sites, stores, products etc. to that behavior and figure out how to be successful. Today's behavior that is somewhat easily tracked is web site clicks, search patterns and all of those things that a web site or web server tracks. that is where the Big Data lives and where these patters are now emerging. Other examples however are emerging, and one of the examples used at the conference was about prediction churn for a telco based on the social network its members are a part of. That social network is not about LinkedIn or Facebook, but about who calls whom. I call you a lot, you switch provider, and I might/will switch too. And that just naturally brings me to the next word, vertical. Vertical in this context means per industry, e.g. communications or retail or government or any other vertical. The reason for being more specific than just behavioral analytics is that each industry has its own data sources, has its own quirky logic and has its own demands and priorities. Of course, the methods and some of the software will be common and some will have both retail and service industry analytics in place (your corner coffee store for example). But the gist of it all is that analytics that can predict customer behavior for a specific focused group of people in a specific industry is what makes Big Data interesting. Building a Vertical Behavioral Analysis System Well, that is going to be interesting. I have not seen much going on in that space and if I had to have some criticism on the cloud connect conference it would be the lack of concrete user cases on big data. The telco example, while a step into the vertical behavioral part is not really on big data. It used a sample of data from the customers' data warehouse. One thing I do think, and this is where I think parts of the NoSQL stuff come from, is that we will be doing this analysis where the data is. Over the past 10 years we at Oracle have called this in-database analytics. I guess we were (too) early? Now the entire market is going there including companies like SAS. In-place btw does not mean "no data movement at all", what it means that you will do this on data's permanent home. For SAS that is kind of the current problem. Most of the inputs live in a data warehouse. So why move it into SAS and back? That all worked with 1 TB data warehouses, but when we are looking at 100TB to 500 TB of distilled data... Comments? As it is still early days with these systems, I'm very interested in seeing reactions and thoughts to some of these thoughts...

Read the article
How does Comparison Sites work?

- by Vijay

Need your thinking on how does these Comparision Sites actually work. Sites like Junglee.com policybazaar.com and there are many like these which provides comaprision of products , fares etc. grabbed from different websites. I had read a little about it and what i found is-: These sites uses Feeds of the sites data. These sites uses APIs of the sites which are actually provided by those sites. And for some sites which do not have any of these two posibility then the Comparision sites uses web-crawler to crawl their data. This is what i have found out. If you think there is more things to it please do give your own views. But i want to know these for my learning purpose and a little for curiosity- how does they actually matches the crawled data , feeds, and other so that there is no duplicacy. What is the process or algorithms for it. And where should i go to learn these concepts. References for books , articles or anything else.

Read the article
Extracting data from internet

- by Ankiov Spetsnaz

I would like to extract data from internet like www.mozenda.com does but I want to write my own program to do that. Specific data I'm looking for is various event data. Based on my research, I think custom web crawler is my answer but I Would like to confirm the answer and see if there are any suggestion to make custom web crawlers if web crawler indeed is an answer. Personally, I would prefer Java and I'm planning on using Glassfish technology if that matters...

Read the article
Going For Gold: AngloGold Ashanti and Oracle Spatial 11g

- by stephen.garth

Last chance - Register Now for Free Webinar Date and Time: Thursday May 6 at 11:00am PDT (2:00pm EDT) Check out this 1-hour Directions Media webinar to learn how the world's 3rd largest gold miner has implemented a unique geospatial data infrastructure based on Oracle Spatial 11g to streamline their business processes for gold exploration. Terry Harbort, Exploration Systems Architect with AngloGold Ashanti, will provide insights into the company's use of Oracle Spatial 11g GeoRaster, 3D visualization techniques, Real Application Clusters, and more. The presentation is followed by a live Q&A session. Register Here

Read the article
Free NOSQL database for use with C# client [closed]

- by Mitten

I've never used NOSQL databases before, but so far it seems like the best data storage solution for my project. I am going to implement a datamining application. The data I would like to mine is thousands of documents which cannot be imported into datamining applications. To make to import easier and faster (than importing thousands of documents) I am planning to import these documents into a NOSQL database first and when import NOSQL database into datamining software. At the very least once I have all the data in NOSQL database I should be able to code simplest datamining logic myself. Am I correct that NOSQL databases allow to creates records of data, but they don't mandate all the records to adhere to the same data schema (same column names/types in a classic table oriended SQL databases)? I think for each document I would create a row/entry/object (not sure what is the correct term is in use in NOSQL world) which would be a string id, few (columns) with unstructured text data, and a dozens of columns mostly of datetime and integer types. From its name NOSQL does not support SQL query syntax, but it support locating the object(row/entry?) by its unique id. Does NOSQL support qyuering objects using property=value syntax? Unfortunately most of free NOSQL db only support Java/C++ clients, which free NOSQL db would you recommend for a C# programmer?

Read the article
StreamInsight on the Brain - can you help?

- by sqlartist

I just came across this guy who is once again in the news as the world's first cyborg. I read all about this research some years back when he implanted a chip into his arm to allow him to open doors in his research lab. Now, without really advancing the research he is claiming that a virus could be implanted onto these implanted devices. Captain Cyborg sidekick implants virus-infected chip - http://www.theregister.co.uk/2010/05/26/captain_cyborg_cyberfud/ This is of interest to me as I actually...(read more)

Read the article
Create association between informations

- by Andrea Girardi

I deployed a project some days ago that allow to extract some medical articles using the results of a questionnaire completed by a user. For instance, if I reply on questionnaire I'm affected by Diabetes type 2 and I'm a smoker, my algorithm extracts all articles related to diabetes bubbling up all articles contains information about Diabetes type 2 and smoking. Basically we created a list of topic and, for every topic we define a kind of "guideline" that allows to extract and order informations for a user. I'm quite sure there are some better way to put on relationship two content but I was not able to find them on network. Could you suggest my a model, algorithm or paper to better understand this kind of problem and that helps me to find a faster, and more accurate way to extract information for an user?

Read the article
Cleaning a dataset of song data - what sort of problem is this?

- by Rob Lourens

I have a set of data about songs. Each entry is a line of text which includes the artist name, song title, and some extra text. Some entries are only "extra text". My goal is to resolve as many of these as possible to songs on Spotify using their web API. My strategy so far has been to search for the entry via the API - if there are no results, apply a transformation such as "remove all text between ( )" and search again. I have a list of heuristics and I've had reasonable success with this but as the code gets more and more convoluted I keep thinking there must be a more generic and consistent way. I don't know where to look - any suggestions for what to try, topics to study, buzzwords to google?

Read the article
Which prediction model for web page recommendation?

- by Nilesh

I am trying to implement a web page recommendation wherein registered users will be given a recommendation of which page to visit depending upon the previous data.So with initial study I decided to go on with clustering the data with rough sets and then will move forward to find out the sequential patters with the use of prefix span algorithm.So now I want to have a better prediction model in place which can predict the access frequency of pages.I have figured out with Markov model but still some more suggestions will be valuable.Also please help me with some references of the models too.Is it possible to directly predict the next page access with the result of PrefixSpan.If so how?

Read the article
Denali CTP3 - Semantic Search 2 (Lots of documents)

- by sqlartist

Hi again, I thought I would improve on the previous post by actually putting a decent about of content into the Filetable - this time I used the opensource DMOZ Health document repository which contains 5,880 files inside 220 folders. The files are all html and are pretty small in size. The entire document collection is about 120Mb unzipped and 30Mb zipped. If any one is interested in testing this collection drop me a note and I will upload the dmoz_health repository archive to Skydrive. This time...(read more)

Read the article
Clustering Strings on the basis of Common Substrings

- by pk188

I have around 10000+ strings and have to identify and group all the strings which looks similar(I base the similarity on the number of common words between any two give strings). The more number of common words, more similar the strings would be. For instance: How to make another layer from an existing layer Unable to edit data on the network drive Existing layers in the desktop Assistance with network drive In this case, the strings 1 and 3 are similar with common words Existing, Layer and 2 and 4 are similar with common words Network Drive(eliminating stop word) The steps I'm following are: Iterate through the data set Do a row by row comparison Find the common words between the strings Form a cluster where number of common words is greater than or equal to 2(eliminating stop words) If number of common words<2, put the string in a new cluster. Assign the rows either to the existing clusters or form a new one depending upon the common words Continue until all the strings are processed I am implementing the project in C#, and have got till step 3. However, I'm not sure how to proceed with the clustering. I have researched a lot about string clustering but could not find any solution that fits my problem. Your inputs would be highly appreciated.

Read the article
How much information can you mine out of a name?

- by Finglas Fjorn

While not directly related to programming, I figured that the programmers on here would be just as curious as I was about this question. Feel free to close the question if it does not meet with the guidelines. A name: first, possibly a middle, and surname. I'm curious about how much information you can mine out of a name, using publicly available datasets. I know that you can get the following with anywhere between a low-high probability (depending on the input) using US census data: 1) Gender. 2) Race. Facebook for instance, used exactly that to find out, with a decent level of accuracy, the racial distribution of users of their site (https://www.facebook.com/note.php?note_id=205925658858). What else can be mined? I'm not looking for anything specific, this is a very open-ended question to assuage my curiousity. My examples are US specific, so we'll assume that the name is the name of someone located in the US; but, if someone knows of publicly available datasets for other countries, I'm more than open to them too. I hope this is an interesting question!

Read the article
How can I perform sentiment analysis on extracted text from online sources?

- by aniket69

I'm working on extracting the sentiment from YouTube comments, blogs, news content, Facebook wall posts, and Twitter feeds. I'm looking for an automated way to do this: the two third-party solutions I've found have been AlchemyAPI and RapidMiner. Are these the best way to approach this project, or should I be using something else? Is there a more efficient way to approach sentiment analysis? What techniques have worked for you in a project like this?

Read the article
Algorithm for optimal combination of two variables

- by AlanChavez

I'm looking for an algorithm that would be able to determine the optimal combination of two variables, but I'm not sure where to start looking. For example, if I have 10,000 rows in a database and each row contains price, and square feet is there any algorithm out there that will be able to determine what combination of price and sq ft is optimal. I know this is vague, but I assume is along the lines of Fuzzy logic and fuzzy sets, but I'm not sure and I'd like to start digging in the right field to see if I can come up with something that solves my problem.

Read the article
Mahout - Clustering - "naming" the cluster elements

- by Mark Bramnik

I'm doing some research and I'm playing with Apache Mahout 0.6 My purpose is to build a system which will name different categories of documents based on user input. The documents are not known in advance and I don't know also which categories do I have while collecting these documents. But I do know, that all the documents in the model should belong to one of the predefined categories. For example: Lets say I've collected a N documents, that belong to 3 different groups : Politics Madonna (pop-star) Science fiction I don't know what document belongs to what category, but I know that each one of my N documents belongs to one of those categories (e.g. there are no documents about, say basketball among these N docs) So, I came up with the following idea: Apply mahout clustering (for example k-mean with k=3 on these documents) This should divide the N documents to 3 groups. This should be kind of my model to learn with. I still don't know which document really belongs to which group, but at least the documents are clustered now by group Ask the user to find any document in the web that should be about 'Madonna' (I can't show to the user none of my N documents, its a restriction). Then I want to measure 'similarity' of this document and each one of 3 groups. I expect to see that the measurement for similarity between user_doc and documents in Madonna group in the model will be higher than the similarity between the user_doc and documents about politics. I've managed to produce the cluster of documents using 'Mahout in Action' book. But I don't understand how should I use Mahout to measure similarity between the 'ready' cluster group of document and one given document. I thought about rerunning the cluster with k=3 for N+1 documents with the same centroids (in terms of k-mean clustering) and see whether where the new document falls, but maybe there is any other way to do that? Is it possible to do with Mahout or my idea is conceptually wrong? (example in terms of Mahout API would be really good) Thanks a lot and sorry for a long question (couldn't describe it better) Any help is highly appreciated P.S. This is not a home-work project :)

Read the article
Recorded YouTube-like presentation and "live" demos of Oracle Advanced Analytics

- by chberger

Ever want to just sit and watch a YouTube-like presentation and "live" demos of Oracle Advanced Analytics? Then ' target=""click here! This 1+ hour long session focuses primarily on the Oracle Data Mining component of the Oracle Advanced Analytics Option and is tied to the Oracle SQL Developer Days virtual and onsite events. I cover: Big Data + Big Data Analytics Competing on analytics & value proposition What is data mining? Typical use cases Oracle Data Mining high performance in-database SQL based data mining functions Exadata "smart scan" scoring Oracle Data Miner GUI (an Extension that ships with SQL Developer) Oracle Business Intelligence EE + Oracle Data Mining resutls/predictions in dashboards Applications "powered by Oracle Data Mining for factory installed predictive analytics methodologies Oracle R Enterprise Please contact [email protected] should you have any questions. Hope you enjoy! Charlie Berger, Sr. Director of Product Management, Oracle Data Mining & Advanced Analytics, Oracle Corporation

Read the article
Data mining - parsing a log file in Java

- by nuvio

Hello I am carrying on a Java project for the university, where I should analyse poker hands. I found some poker hands in a txt log file. They would typically look like this: PokerStars Zoom Hand #86981279921: Hold'em No Limit ($0.10/$0.25 USD) - 2012/09/30 23:49:51 ET Table 'Whirlpool Zoom 40-100 bb' 9-max Seat #1 is the button Seat 1: lgwong ($30.99 in chips) Seat 2: hastyboots ($28.61 in chips) Seat 3: seula i ($25.31 in chips) Seat 4: fr_kevin01 ($31.81 in chips) Seat 5: limey05 ($27.45 in chips) Seat 6: sanlu ($24.65 in chips) Seat 7: Masterfrank ($25.35 in chips) Seat 8: Refu$e2Lose ($33.23 in chips) Seat 9: 1pepepe0114 ($37.62 in chips) hastyboots: posts small blind $0.10 seula i: posts big blind $0.25 *** HOLE CARDS *** fr_kevin01: folds limey05: folds sanlu: folds Masterfrank: folds Refu$e2Lose: folds 1pepepe0114: folds lgwong: folds hastyboots: folds Uncalled bet ($0.15) returned to seula i seula i collected $0.20 from pot seula i: doesn't show hand *** SUMMARY *** Total pot $0.20 | Rake $0 Seat 1: lgwong (button) folded before Flop (didn't bet) Seat 2: hastyboots (small blind) folded before Flop Seat 3: seula i (big blind) collected ($0.20) Seat 4: fr_kevin01 folded before Flop (didn't bet) Seat 5: limey05 folded before Flop (didn't bet) Seat 6: sanlu folded before Flop (didn't bet) Seat 7: Masterfrank folded before Flop (didn't bet) Seat 8: Refu$e2Lose folded before Flop (didn't bet) Seat 9: 1pepepe0114 folded before Flop (didn't bet) My problem is that I am not sure about how to proceed to parse the log file: the only knowledge I have is "manually" scanning line by line for a particular character or symbol, but I am afraid it would need exhaustive error handling. So I was wandering if there is any other techniques or better way to parse these poker hands? Many thanks for your help

Read the article
Error occurs while using SPADE method in R

- by Yuwon Lee

I'm currently mining sequence patterns using SPADE algorithm in R. SPADE is included in "arulesSequence" package of R. I'm running R on my CentOS 6.3 64bit. For an exercise, I've tried an example presented in http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Sequence_Mining/SPADE When I tried to do "cspade(x, parameter = list(support = 0.4), control = list(verbose = TRUE))" R says: parameter specification: support : 0.4 maxsize : 10 maxlen : 10 algorithmic control: bfstype : FALSE verbose : TRUE summary : FALSE preprocessing ... 1 partition(s), 0 MB [0.096s] mining transactions ... 0 MB [0.066s] reading sequences ...Error in asMethod(object) : 's' is not an integer vector When I try to run SPADE on my Window 7 32bit, it runs well without any error. Does anybody know why such errors occur?

Read the article
Implementing Naïve Bayes algorithm in Java - Need some guidance

- by techventure

hello stackflow people As a School assignment i'm required to implement Naïve Bayes algorithm which i am intending to do in Java. In trying to understand how its done, i've read the book "Data Mining - Practical Machine Learning Tools and Techniques" which has a section on this topic but am still unsure on some primary points that are blocking my progress. Since i'm seeking guidance not solution in here, i'll tell you guys what i thinking in my head, what i think is the correct approach and in return ask for correction/guidance which will very much be appreciated. please note that i am an absolute beginner on Naïve Bayes algorithm, Data mining and in general programming so you might see stupid comments/calculations below: The training data set i'm given has 4 attributes/features that are numeric and normalized(in range[0 1]) using Weka (no missing values)and one nominal class(yes/no) 1) The data coming from a csv file is numeric HENCE * Given the attributes are numeric i use PDF (probability density function) formula. + To calculate the PDF in java i first separate the attributes based on whether they're in class yes or class no and hold them into different array (array class yes and array class no) + Then calculate the mean(sum of the values in row / number of values in that row) and standard divination for each of the 4 attributes (columns) of each class + Now to find PDF of a given value(n) i do (n-mean)^2/(2*SD^2), + Then to find P( yes | E) and P( no | E) i multiply the PDF value of all 4 given attributes and compare which is larger, which indicates the class it belongs to In temrs of Java, i'm using ArrayList of ArrayList and Double to store the attribute values. lastly i'm unsure how to to get new data? Should i ask for input file (like csv) or command prompt and ask for 4 values? I'll stop here for now (do have more questions) but I'm worried this won't get any responses given how long its got. I will really appreciate for those that give their time reading my problems and comment.

Read the article
Retrieivng coordinates in this page

- by hao

Hey guys, Im trying to do some data mining and analyze data based on locations. For this site, http://www.dianping.com/shop/1898365 I am trying to figure out whats the latitude and longitude by crawling. But I cant seem to figure out where this information is stored. Can someone give me some pointers

Read the article
How can I scrape specific data from a website

- by Stoney

I'm trying to scrape data from a website for research. The urls are nicely organized in an example.com/x format, with x as an ascending number and all of the pages are structured in the same way. I just need to grab certain headings and a few numbers which are always in the same locations. I'll then need to get this data into structured form for analysis in Excel. I have used wget before to download pages, but I can't figure out how to grab specific lines of text. Excel has a feature to grab data from the web (Data-From Web) but from what I can see it only allows me to download tables. Unfortunately, the data I need is not in tables.

Read the article

Search Results

Search found 324 results on 13 pages for 'mining'.

Page 4/13 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >

- by Anurag

- by Tom Wijsman

- by python dude

- by jean-pierre.dijcks

- by Vijay

- by Ankiov Spetsnaz

- by stephen.garth

- by Mitten

- by sqlartist

- by Andrea Girardi

- by Rob Lourens

- by Nilesh

- by sqlartist

- by pk188

- by Finglas Fjorn

- by aniket69

- by AlanChavez

- by Mark Bramnik

- by chberger

- by nuvio

- by Yuwon Lee

- by techventure

- by hao

- by Stoney

< Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >