mining - Page 5 - Developer IT

Machine learning challenge: diagnosing program in java/groovy (datamining, machine learning)

- by Registered User

Hi All! I'm planning to develop program in Java which will provide diagnosis. The data set is divided into two parts one for training and the other for testing. My program should learn to classify from the training data (BTW which contain answer for 30 questions each in new column, each record in new line the last column will be diagnosis 0 or 1, in the testing part of data diagnosis column will be empty - data set contain about 1000 records) and then make predictions in testing part of data :/ I've never done anything similar so I'll appreciate any advice or information about solution to similar problem. I was thinking about Java Machine Learning Library or Java Data Mining Package but I'm not sure if it's right direction... ? and I'm still not sure how to tackle this challenge... Please advise. All the best!

Read the article

DataMining / Analyzing responses to Multiple Choice Questions in a survey

- by Shailesh Tainwala

Hi, I have a set of training data consisting of 20 multiple choice questions (A/B/C/D) answered by a hundred respondents. The answers are purely categorical and cannot be scaled to numerical values. 50 of these respondents were selected for free product trial. The selection process is not known. What interesting knowledge can be mined from this information? The following is a list of what I have come up with so far- A study of percentages (Example - Percentage of people who answered B on Qs.5 and got selected for free product trial) Conditional probabilities (Example - What is the probability that a person will get selected for free product trial given that he answered B on Qs.5) Naive Bayesian classifier (This can be used to predict whether a person will be selected or not for a given set of values for any subset of questions). Can you think of any other interesting analysis or data-mining activities that can be performed? The usual suspects like correlation can be eliminated as the response is not quantifiable/scoreable. Is my approach correct?

Read the article

DMX Analysis Services question

- by user282382

Hi, I am have two mining models, both are time series. One is [Company_Inputs] and the other is [Booking_Projections]. What I want to do is use EXTEND_MODEL_CASES to join the results of [Company_Inputs] as the extended cases. So basically something like: Select Flattened PredictTimeSeries([Bookings], 1, 6, EXTEND_MODEL_CASES) FROM [Booking_Projections] Natural Prediction Join (Select Flattened PredictTimeSeries([Metric1], 1, 6) From [Company_Inputs]) AS T This code of course doesn't work, but the idea is to use the predictions made from [Company_Inputs] as cases for predicting future values of [Booking_Projections] If anyone has an idea of how I can accomplish this I would appreciate it very much.

Read the article

Parallelizing a serial algorithm

- by user643813

Hej folks, I am working on porting a Text mining/Natural language application from single-core to a Map-Reduce style system. One of the steps involves a while loop similar to this: Queue<Element>; while (!queue.empty()) { Element e = queue.next(); Set<Element> result = calculateResultSet(e); if (!result.empty()) { queue.addAll(result); } } Each iteration depends on the result of the one before (kind of). There is no way of determining the number of iterations this loop will have to perform. Is there a way of parallelizing a serial algorithm such as this one? I am trying to think of a feedback mechanism, that is able to provide its own input, but how would one go about parallelizing it? Thanks for any help/remarks

Read the article

Testing When Correctness is Poorly Defined?

- by dsimcha

I generally try to use unit tests for any code that has easily defined correct behavior given some reasonably small, well-defined set of inputs. This works quite well for catching bugs, and I do it all the time in my personal library of generic functions. However, a lot of the code I write is data mining code that basically looks for significant patterns in large datasets. Correct behavior in this case is often not well defined and depends on a lot of different inputs in ways that are not easy for a human to predict (i.e. the math can't reasonably be done by hand, which is why I'm using a computer to solve the problem in the first place). These inputs can be very complex, to the point where coming up with a reasonable test case is near impossible. Identifying the edge cases that are worth testing is extremely difficult. Sometimes the algorithm isn't even deterministic. Usually, I do the best I can by using asserts for sanity checks and creating a small toy test case with a known pattern and informally seeing if the answer at least "looks reasonable", without it necessarily being objectively correct. Is there any better way to test these kinds of cases?

Read the article

How to apply Data Mining (Association Rule) to a huge database ?

- by stckvrflw

Hello What I want to do is to apply Association method of data mining on my SQL Server 2000 database. Association rule is something like "finding the most frequent items that appear together in database." For those who don't know or who want to remember what is association method is like, take a look at this presentation about Association rule in Data Mining. www.authorstream.com/Presentation/a.besimi-233030-data-mining-intro-seeu-education-ppt-powerpoint 17th slide gives a nice example of applying association rule on a database. So Can you help me about how should I write my SQL codes (If they will be sufficient of course) Thanks.

Read the article

How can I extract similarities/patterns from a collection of binary strings?

- by JohnIdol

I have a collection of binary strings of given size encoding effective solutions to a given problem. By looking at them, I can spot obvious similarities and intuitively see patterns of symmetry and periodicity. Are there mathematical/algorithmic tools I can "feed" this set of strings to and get results that might give me an idea of what this set of strings have in common? By doing so I would be able to impose a structure (or at least favor some features over others) on candidate solutions in order to greatly reduce the search space, maximizing chances to find optimal solutions for my problem (I am using genetic algorithms as the search tool - but this is not pivotal to the question). Any pointers/approaches appreciated.

Read the article

A good web data extraction/screen scraper program?

- by Taylor

I need to capture product data from a site on a regular basis and wondered if any one knows of a good software program? I've trialed Mozenda but its a monthly subscription and pricey in the long term. Obviously something thats free would be best but I don't mind paying either. Just need a decent program thats reliable and doesn't require much programming knowledge.

Read the article

Java HTML Parsing

- by Richie_W

Hello everyone. I'm working on an app which scrapes data from a website and I was wondering how I should go about getting the data. Specifically I need data contained in a number of div tags which use a specific CSS class - Currently (for testing purposes) I'm just checking for "div class = "classname"" in each line of HTML - This works, but I can't help but feel there is a better solution out there. Ie. - Is there any nice way where I could give a class a line of HTML and have some nice methods like: boolean usesClass(String CSSClassname); String getText(); String getLink(); Many many thanks!

Read the article

Clever way of building a tag cloud? - Python

- by RadiantHex

Hi folks, I've built a content aggregator and would like to add a tag cloud representing the current trends. Unfortunately this is quite complex, as I have to look for keywords that represent the context of each article. For example words such as I, was, the, amazing, nice have no relation to context. Help would be much appreciated! :)

Read the article

Netflix prize dataset?

- by user169410

Hi, I am looking to work on a machine learning project for my course and I would like to use the netflix prize dataset? But it looks like the contest is closed and the dataset is not available for download in the netflix website. Does anyone who wokred on it has the dataset? If so ,can u share it?

Read the article

How to determine the (natural) language of a document?

- by Robert Petermeier

I have a set of documents in two languages: English and German. There is no usable meta information about these documents, a program can look at the content only. Based on that, the program has to decide which of the two languages the document is written in. Is there any "standard" algorithm for this problem that can be implemented in a few hours' time? Or alternatively, a free .NET library or toolkit that can do this? I know about LingPipe, but it is Java Not free for "semi-commercial" usage This problem seems to be surprisingly hard. I checked out the Google AJAX Language API (which I found by searching this site first), but it was ridiculously bad. For six web pages in German to which I pointed it only one guess was correct. The other guesses were Swedish, English, Danish and French... A simple approach I came up with is to use a list of stop words. My app already uses such a list for German documents in order to analyze them with Lucene.Net. If my app scans the documents for occurrences of stop words from either language the one with more occurrences would win. A very naive approach, to be sure, but it might be good enough. Unfortunately I don't have the time to become an expert at natural-language processing, although it is an intriguing topic.

Read the article

Finding Common Phrases in SQL Server TEXT Column

- by regex

Short Desc: I'm curious to see if I can use SQL Analysis services or some other SQL Server service to mine some data for me that will show commonalities between SQL TEXT fields in a dataset. Long Desc I am looking at a subset of data that consists of about 10,000 rows of TEXT blobs which are used as a notes column in a issue tracking (ticketing) software. I would like to use something out of the box (without having to build something) that might be able to parse through all of the rows and find commonly used byte sequences in the "Notes" column. In other words, I want to find commonly used phrases (two to three word phrases, so 9 - 20 character sections of the TEXT blob). This will help me better determine if associate's notes contain similar phrases (troubleshooting techniques) that we could standardize in our troubleshooting process flow. Closing Note I'd really rather not build an application to do this as my method will probably not be the most efficient way to do it. Hopefully all this makes sense. Please let me know in the comments if anything needs clarification.

Read the article

Minimum confidence and minimum support for Apriori

- by lmsasu

What are appropriate values for minimum confidence and minimum support values for the Apriori algorithm? How could you tweak them? Are they fixed values, or do they change during the running of the algorithm? If you have used this algorithm before, what values did you use?

Read the article

Naive Bayesian for Topic detection using "Bag of Words" approach

- by AlgoMan

I am trying to implement a naive bayseian approach to find the topic of a given document or stream of words. Is there are Naive Bayesian approach that i might be able to look up for this ? Also, i am trying to improve my dictionary as i go along. Initially, i have a bunch of words that map to a topics (hard-coded). Depending on the occurrence of the words other than the ones that are already mapped. And depending on the occurrences of these words i want to add them to the mappings, hence improving and learning about new words that map to topic. And also changing the probabilities of words. How should i go about doing this ? Is my approach the right one ? Which programming language would be best suited for the implementation ?

Read the article

how to find maximum frequent item sets from large transactional data file

- by ANIL MANE

Hi, I have the input file contains large amount of transactions like Transaction ID Items T1 Bread, milk, coffee, juice T2 Juice, milk, coffee T3 Bread, juice T4 Coffee, milk T5 Bread, Milk T6 Coffee, Bread T7 Coffee, Bread, Juice T8 Bread, Milk, Juice T9 Milk, Bread, Coffee, T10 Bread T11 Milk T12 Milk, Coffee, Bread, Juice i want the occurrence of every unique item like Item Name Count Bread 9 Milk 8 Coffee 7 Juice 6 and from that i want an a fp-tree now by traversing this tree i want the maximal frequent itemsets as follows The basic idea of method is to dispose nodes in each “layer” from bottom to up. The concept of “layer” is different to the common concept of layer in a tree. Nodes in a “layer” mean the nodes correspond to the same item and be in a linked list from the “Head Table”. For nodes in a “layer” NBN method will be used to dispose the nodes from left to right along the linked list. To use NBN method, two extra fields will be added to each node in the ordered FP-Tree. The field tag of node N stores the information of whether N is maximal frequent itemset, and the field count’ stores the support count information in the nodes at left. In Figure, the first node to be disposed is “juice: 2”. If the min_sup is equal to or less than 2 then “bread, milk, coffee, juice” is a maximal frequent itemset. Firstly output juice:2 and set the field tag of “coffee:3” as “false” (the field tag of each node is “true” initially ). Next check whether the right four itemsets juice:1 be the subset of juice:2. If the itemset one node “juice:1” corresponding to is the subset of juice:2 set the field tag of the node “false”. In the following process when the field tag of the disposed node is FALSE we can omit the node after the same tagging. If the min_sup is more than 2 then check whether the right four juice:1 is the subset of juice:2. If the itemset one node “juice:1” corresponding to is the subset of juice:2 then set the field count’ of the node with the sum of the former count’ and 2 After all the nodes “juice” disposed ,begin to dispose the node “coffee:3”. Any suggestions or available source code, welcome. thanks in advance

Read the article

Microsoft Business Intelligence. Is what I am trying to do possible?

- by Nai

Hi guys, I have been charged with the task of analysing the log table of my company's website. This table contains a user's click path throughout the website for a given session. My company is looking to understand/spot trends based on the 'click paths' of our users. In doing so, identify groups of users that take on a certain 'click path' based on age/geography and so on. As you can tell from the title, I am completely new to BI and its capabilities so I was wondering: Are our objectives attainable? How should I go about doing this? I am currently reading books online as well as other e-books I have found. All signs seem to suggest this is possible via sequence clustering. Although the exact implementation and tweaks involved are currently lost on me. Therefore, if anyone has first hand experience in such an undertaking, I would be awesome if you could share it here. Cheers!

Read the article

Confusion Matrix of Bayesian Network

- by iva123

Hi, I'm trying to understand bayesian network. I have a data file which has 10 attributes, I want to acquire the confusion table of this data table ,I thought I need to calculate tp,fp, fn, tn of all fields. Is it true ? if it's then what i need to do for bayesian network. Really need some guidance, I'm lost.

Read the article

Finding Common Phrases in MS SQL TEXT Column

- by regex

Hello All, Short Desc: I'm curious to see if I can use SQL Analysis services or some other MS SQL service to mine some data for me that will show commonalities between SQL TEXT fields in a dataset. Long Desc I am looking at a subset of data that consists of about 10,000 rows of TEXT blobs which are used as a notes column in a issue tracking (ticketing) software. I would like to use something out of the box (without having to build something) that might be able to parse through all of the rows and find commonly used byte sequences in the "Notes" column. In other words, I want to find commonly used phrases (two to three word phrases, so 9 - 20 character sections of the TEXT blob). This will help me better determine if associate's notes contain similar phrases (troubleshooting techniques) that we could standardize in our troubleshooting process flow. Closing Note I'd really rather not build an application to do this as my method will probably not be the most efficient way to do it. Hopefully all this makes sense. Please let me know in the comments if anything needs clarification. Thanks in advance for your help.

Read the article

How to identify ideas and concepts in a given text

- by Nick

I'm working on a project at the moment where it would be really useful to be able to detect when a certain topic/idea is mentioned in a body of text. For instance, if the text contained: Maybe if you tell me a little more about who Mr Balzac is, that would help. It would also be useful if I could have a description of his appearance, or even better a photograph? It'd be great to be able to detect that the person has asked for a photograph of Mr Balzac. I could take a really naïve approach and just look for the word "photo" or "photograph", but this would obviously be no good if they wrote something like: Please, never send me a photo of Mr Balzac. Does anyone know where to start with this? Is it even possible? I've looked into things like nltk, but I've yet to find an example of someone doing something similar and am still not entirely sure what this kind of analysis is called. Any help that can get me off the ground would be great. Thanks!

Read the article

Getting Wikipedia infoboxes in a format that Ruby can understand

- by hadees

I am trying to get the data from Wikipedia's infoboxes into a hash or something so that I can use it in my Ruby on Rails program. Specifically I'm interested in the Infobox company and Infobox person. The example I have been using is "Ford Motor Company". I want to get the company info for that and the person info for the people linked to in Ford's company box. I've tried figuring out how to do this from the Wikipedia API or DBPedia but I haven't had much luck. I know wikipedia can return some things as json which I could parse with ruby but I haven't been able to figure out how to get the infobox. In the case of DBPedia I am kind of lost on how to even query it to get the info for Ford Motor Company.

Read the article

Finding Common Byte Sequences in MS SQL TEXT Column

- by regex

Hello All, Short Desc: I'm curious to see if I can use SQL Analysis services or some other MS SQL service to mine some data for me that will show commonalities between SQL TEXT fields in a dataset. Long Desc I am looking at a subset of data that consists of about 10,000 rows of TEXT blobs which are used as a notes column in a issue tracking (ticketing) software. I would like to use something out of the box (without having to build something) that might be able to parse through all of the rows and find commonly used byte sequences in the "Notes" column. In other words, I want to find commonly used phrases (two to three word phrases, so 9 - 20 character sections of the TEXT blob). This will help me better determine if associate's notes contain similar phrases (troubleshooting techniques) that we could standardize in our troubleshooting process flow. Closing Note I'd really rather not build an application to do this as my method will probably not be the most efficient way to do it. Hopefully all this makes sense. Please let me know in the comments if anything needs clarification. Thanks in advance for your help.

Read the article

OPTICS Clustering algorithm. How to get the best epsilon

- by Marco Galassi

I am implementing a project which needs to cluster geographical points. OPTICS algorithm seems to be a very nice solution. It needs just 2 parameters as input(MinPts and Epsilon), which are, respectively, the minimum number of points needed to consider them as a cluster, and the distance value used to compare if two points are in can be placed in same cluster. My problem is that, due to the extreme variety of the points, I can't set a fixed epsilon. Just look at the image below. The same points structure but in a different scale would result very different. Suppose to set MinPts=2 and epsilon = 1Km. On the left, the algorithm would create 2 clusters(red and blue), but on the right it would create one single cluster containing all of the points(red), but I would like to obtain 2 clusters even on the right. So my question is: is there any kind of way to calculate dynamically the epsilon value to get this result? Thank you very much and excuse my for my poor english. Marco

Read the article

Algorithm detect repeating/similiar strings in a corpus of data -- say email subjects, in Python

- by RizwanK

I'm downloading a long list of my email subject lines , with the intent of finding email lists that I was a member of years ago, and would want to purge them from my Gmail account (which is getting pretty slow.) I'm specifically thinking of newsletters that often come from the same address, and repeat the product/service/group's name in the subject. I'm aware that I could search/sort by the common occurrence of items from a particular email address (and I intend to), but I'd like to correlate that data with repeating subject lines.... Now, many subject lines would fail a string match, but "Google Friends : Our latest news" "Google Friends : What we're doing today" are more similar to each other than a random subject line, as is: "Virgin Airlines has a great sale today" "Take a flight with Virgin Airlines" So -- how can I start to automagically extract trends/examples of strings that may be more similar. Approaches I've considered and discarded ('because there must be some better way'): Extracting all the possible substrings and ordering them by how often they show up, and manually selecting relevant ones Stripping off the first word or two and then count the occurrence of each sub string Comparing Levenshtein distance between entries Some sort of string similarity index ... Most of these were rejected for massive inefficiency or likelyhood of a vast amount of manual intervention required. I guess I need some sort of fuzzy string matching..? In the end, I can think of kludgy ways of doing this, but I'm looking for something more generic so I've added to my set of tools rather than special casing for this data set. After this, I'd be matching the occurring of particular subject strings with 'From' addresses - I'm not sure if there's a good way of building a data structure that represents how likely/not two messages are part of the 'same email list' or by filtering all my email subjects/from addresses into pools of likely 'related' emails and not -- but that's a problem to solve after this one. Any guidance would be appreciated.

Read the article

Architecture for database analytics

- by David Cournapeau

Hi, We have an architecture where we provide each customer Business Intelligence-like services for their website (internet merchant). Now, I need to analyze those data internally (for algorithmic improvement, performance tracking, etc...) and those are potentially quite heavy: we have up to millions of rows / customer / day, and I may want to know how many queries we had in the last month, weekly compared, etc... that is the order of billions entries if not more. The way it is currently done is quite standard: daily scripts which scan the databases, and generate big CSV files. I don't like this solutions for several reasons: as typical with those kinds of scripts, they fall into the write-once and never-touched-again category tracking things in "real-time" is necessary (we have separate toolset to query the last few hours ATM). this is slow and non-"agile" Although I have some experience in dealing with huge datasets for scientific usage, I am a complete beginner as far as traditional RDBM go. It seems that using column-oriented database for analytics could be a solution (the analytics don't need most of the data we have in the app database), but I would like to know what other options are available for this kind of issues.

Search Results

Search found 324 results on 13 pages for 'mining'.

Page 5/13 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >

- by Registered User

- by Shailesh Tainwala

- by user282382

- by user643813

- by dsimcha

- by stckvrflw

- by JohnIdol

- by Taylor

- by Richie_W

- by RadiantHex

- by user169410

- by Robert Petermeier

- by regex

- by lmsasu

- by AlgoMan

- by ANIL MANE

- by Nai

- by iva123

- by regex

- by Nick

- by hadees

- by regex

- by Marco Galassi

- by RizwanK

- by David Cournapeau

< Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >