Search Results

Search found 1631 results on 66 pages for 'statistics'.

Page 14/66 | < Previous Page | 10 11 12 13 14 15 16 17 18 19 20 21  | Next Page >

  • Optimal two variable linear regression calculation

    - by Dave Jarvis
    Problem Am looking to apply the y = mx + b equation (where m is SLOPE, b is INTERCEPT) to a data set, which is retrieved as shown in the SQL code. The values from the (MySQL) query are: SLOPE = 0.0276653965651912 INTERCEPT = -57.2338357550468 SQL Code SELECT ((sum(t.YEAR) * sum(t.AMOUNT)) - (count(1) * sum(t.YEAR * t.AMOUNT))) / (power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as SLOPE, ((sum( t.YEAR ) * sum( t.YEAR * t.AMOUNT )) - (sum( t.AMOUNT ) * sum(power(t.YEAR, 2)))) / (power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as INTERCEPT, FROM (SELECT D.AMOUNT, Y.YEAR FROM CITY C, STATION S, YEAR_REF Y, MONTH_REF M, DAILY D WHERE -- For a specific city ... -- C.ID = 8590 AND -- Find all the stations within a 15 unit radius ... -- SQRT( POW( C.LATITUDE - S.LATITUDE, 2 ) + POW( C.LONGITUDE - S.LONGITUDE, 2 ) ) < 15 AND -- Gather all known years for that station ... -- S.STATION_DISTRICT_ID = Y.STATION_DISTRICT_ID AND -- The data before 1900 is shaky; insufficient after 2009. -- Y.YEAR BETWEEN 1900 AND 2009 AND -- Filtered by all known months ... -- M.YEAR_REF_ID = Y.ID AND -- Whittled down by category ... -- M.CATEGORY_ID = '001' AND -- Into the valid daily climate data. -- M.ID = D.MONTH_REF_ID AND D.DAILY_FLAG_ID <> 'M' GROUP BY Y.YEAR ORDER BY Y.YEAR ) t Data The data is visualized here: Question The following results (to calculate the start and end points of the line) appear incorrect. Why are the results off by ~10 degrees (e.g., outliers skewing the data)? (1900 * 0.0276653965651912) + (-57.2338357550468) = -4.66958228 (2009 * 0.0276653965651912) + (-57.2338357550468) = -1.65405406 I would have expected the 1900 result to be around 10 (not -4.67) and the 2009 result to be around 11.50 (not -1.65). Related Sites Least absolute deviations Robust regression Thank you!

    Read the article

  • Is it possible to do A/B testing by page rather than by individual?

    - by mojones
    Lets say I have a simple ecommerce site that sells 100 different t-shirt designs. I want to do some a/b testing to optimise my sales. Let's say I want to test two different "buy" buttons. Normally, I would use AB testing to randomly assign each visitor to see button A or button B (and try to ensure that that the user experience is consistent by storing that assignment in session, cookies etc). Would it be possible to take a different approach and instead, randomly assign each of my 100 designs to use button A or B, and measure the conversion rate as (number of sales of design n) / (pageviews of design n) This approach would seem to have some advantages; I would not have to worry about keeping the user experience consistent - a given page (e.g. www.example.com/viewdesign?id=6) would always return the same html. If I were to test different prices, it would be far less distressing to the user to see different prices for different designs than different prices for the same design on different computers. I also wonder whether it might be better for SEO - my suspicion is that Google would "prefer" that it always sees the same html when crawling a page. Obviously this approach would only be suitable for a limited number of sites; I was just wondering if anyone has tried it?

    Read the article

  • mysql/algorithm: Weighting an average to accentuate differences from the mean

    - by Sai Emrys
    This is for a new feature on http://cssfingerprint.com (see /about for general info). The feature looks up the sites you've visited in a database of site demographics, and tries to guess what your demographic stats are based on that. All my demgraphics are in 0..1 probability format, not ratios or absolute numbers or the like. Essentially, you have a large number of data points that each tend you towards their own demographics. However, just taking the average is poor, because it means that by adding in a lot of generic data, the number goes down. For example, suppose you've visited sites S0..S50. All except S0 are 48% female; S0 is 100% male. If I'm guessing your gender, I want to have a value close to 100%, not just the 49% that a straight average would give. Also, consider that most demographics (i.e. everything other than gender) does not have the average at 50%. For example, the average probability of having kids 0-17 is ~37%. The more a given site's demographics are different from this average (e.g. maybe it's a site for parents, or for child-free people), the more it should count in my guess of your status. What's the best way to calculate this? For extra credit: what's the best way to calculate this, that is also cheap & easy to do in mysql?

    Read the article

  • Randomized experiments in R

    - by gd047
    Here is a simple randomized experiment. In the following code I calculate the p-value under the null hypothesis that two different fertilizers applied to tomato plants have no effect in plants yields. The first random sample (x) comes from plants where a standard fertilizer has been used, while an "improved" one has been used in the plants where the second sample (y) comes from. x <- c(11.4,25.3,29.9,16.5,21.1) y <- c(23.7,26.6,28.5,14.2,17.9,24.3) total <- c(x,y) first <- combn(total,length(x)) second <- apply(first,2,function(x) total[!total %in% x]) dif.treat <- apply(second,2,mean) - apply(first,2,mean) # the first element of dif.treat is the one that I'm interested in (p.value <- length(dif.treat[dif.treat >= dif.treat[1]]) / length(dif.treat)) Do you know of any R function that performs tests like this one?

    Read the article

  • R selecting duplicate rows

    - by Matt
    Okay, I'm fairly new to R and I've tried to search the documentation for what I need to do but here is the problem. I have a data.frame called heeds.data in the following form (some columns omitted for simplicity) eval.num, eval.count, ... fitness, fitness.mean, green.h.0, green.v.0, offset.0, green.h.1, green.v.1,...green.h.7, green.v.7, offset.7... And I have selected a row meeting the following criteria: best.fitness <- min(heeds.data$fitness.mean[heeds.data$eval.count = 10]) best.row <- heeds.data[heeds.data$fitness.mean == best.fitness] Now, what I want are all of the other rows with that have columns green.h.0 to offset.7 (a contiguous section of columns) equal to the best.row Basically I'm looking for rows that have some of the conditions the same as the "best" row. I thought I could just do this, heeds.best <- heeds.data$fitness[ heeds.data$green.h.0 == best.row$green.h.0 & ... ] But with 24 columns it seems like a stupid method. Looking for something a bit simpler with less manual typing. Thanks!

    Read the article

  • How to use boost normal distribution classes?

    - by David Alfonso
    Hi all, I'm trying to use boost::normal_distribution in order to generate a normal distribution with mean 0 and sigma 1. The following code uses boost normal classes. Am I using them correctly? #include <boost/random.hpp> #include <boost/random/normal_distribution.hpp> int main() { boost::mt19937 rng; // I don't seed it on purpouse (it's not relevant) boost::normal_distribution<> nd(0.0, 1.0); boost::variate_generator<boost::mt19937&, boost::normal_distribution<> > var_nor(rng, nd); int i = 0; for (; i < 10; ++i) { double d = var_nor(); std::cout << d << std::endl; } } The result on my machine is: 0.213436 -0.49558 1.57538 -1.0592 1.83927 1.88577 0.604675 -0.365983 -0.578264 -0.634376 As you can see all values are not between -1 and 1. Thank you all in advance!

    Read the article

  • What is the best Java numerical method package?

    - by Bob Cross
    I am looking for a Java-based numerical method package that provides functionality including: Solving systems of equations using different numerical analysis algorithms. Matrix methods (e.g., inversion). Spline approximations. Probability distributions and statistical methods. In this case, "best" is defined as a package with a mature and usable API, solid performance and numerical accuracy. Edit: derick van brought up a good point in that cost is a factor. I am heavily biased in favor of free packages but others may have a different emphasis.

    Read the article

  • How to notice unusual news activity

    - by ??iu
    Suppose you were able keep track of the news mentions of different entities, like say "Steve Jobs" and "Steve Ballmer". What are ways that could you tell whether the amount of mentions per entity per a given time period was unusual relative to their normal degree of frequency of appearance? I imagine that for a more popular person like Steve Jobs an increase of like 50% might be unusual (an increase of 1000 to 1500), while for a relatively unknown CEO an increase of 1000% for a given day could be possible (an increase of 2 to 200). If you didn't have a way of scaling that your unusualness index could be dominated by unheard-ofs getting their 15 minutes of fame.

    Read the article

  • Screening (multi)collinearity in a regression model

    - by aL3xa
    I hope that this one is not going to be "ask-and-answer" question... here goes: (multi)collinearity refers to extremely high correlations between predictors in the regression model. How to cure them... well, sometimes you don't need to "cure" collinearity, since it doesn't affect regression model itself, but interpretation of an effect of individual predictors. One way to spot collinearity is to put each predictor as a dependent variable, and other predictors as independent variables, determine R2, and if it's larger than .9 (or .95), we can consider predictor redundant. This is one "method"... what about other approaches? Some of them are time consuming, like excluding predictors from model and watching for b-coefficient changes - they should be noticeably different. Of course, we must always bare in mind specific context/goal of analysis... Sometimes, only remedy is to repeat a research, but right now, I'm interested in various ways of screening redundant predictors when (multi)collinearity occurs in a regression model.

    Read the article

  • R Question. Numeric variable vs. Non-numeric and "names" function

    - by Michael
    > scores=cbind(UNCA.score, A.score, B.score, U.m.A, U.m.B) > names(scores)=c('UNCA.scores', 'A.scores', 'B.scores','UNCA.minus.A', 'UNCA.minus.B') > names(scores) [1] "UNCA.scores" "A.scores" "B.scores" "UNCA.minus.A" "UNCA.minus.B" > summary(UNCA.scores) X6.69230769230769 Min. : 4.154 1st Qu.: 7.333 Median : 8.308 Mean : 8.451 3rd Qu.: 9.538 Max. :12.000 > is.numeric(UNCA.scores) [1] FALSE > is.numeric(scores[,1]) [1] TRUE My question is, what is the difference between UNCA.scores and scores[,1]? UNCA.scores is the first column in the data.frame 'scores', but they are not the same thing, since one is numeric and the other isn't. If UNCA.scores is just a label here how can I make it be equivalent to 'scores[,1]? Thanks!

    Read the article

  • What to use to create bar, line and pie charts with javascript compatible with all major browsers?

    - by marcgg
    I used to work with flot but it doesn't support pie charts so I'm forced to change. I just saw JS Charts, but their documentation is very obscure regarding cross browser compatibility (I need it to be IE6+ compliant :). Also this will be for commercial use, so I'd rather have something that I can use free of charge jQuery Google chart looks really nice and is well integrated with rails (the framework I'm using) but I'm not sure how good it is. So what do you guys use? What would you recommend keeping in mind that: It will be for commercial use (I can deal with a license, but I'd rather avoid that) It needs to be javascript (no svg, no flash please) It needs to be compatible with IE6+, FF, Chrome, Opera and Safari It needs to be pretty ^^ If it uses jQuery it's even better

    Read the article

  • how to develop a program to minimize errors in human transcription of hand written surveys

    - by Alex. S.
    I need to develop custom software to do surveys. Questions may be of multiple choice, or free text in a very few cases. I was asked to design a subsystem to check if there is any error in the manual data entry for the multiple choices part. We're trying to speed up the user data entry process and to minimize human input differences between digital forms and the original questionnaires. The surveys are filled with handwritten marks and text by human interviewers, so it's possible to find hard to read marks, or also the user could accidentally select a different value in some question, and we would like to avoid that. The software must include some automatic control to detect possible typing differences. Each answer of the multiple choice questions has the same probability of being selected. This question has two parts: The GUI. The most simple thing I have in mind is to implement the most usable design of the questions display: use of large and readable fonts and space generously the choices. Is there something else? For faster input, I would like to use drop down lists (favoring keyboard over mouse). Given the questions are grouped in sections, I would like to show the answers selected for the questions of that section, but this could slow down the process. Any other ideas? The error checking subsystem. What else can I do to minimize or to check human typos in the multiple choice questions? Is this a solvable problem? is there some statistical methodology to check values that were entered by the users are the same from the hand filled forms? For example, let's suppose the survey has 5 questions, and each has 4 options. Let's say I have n survey forms filled in paper by interviewers, and they're ready to be entered in the software, then how to minimize the accidental differences that can have the manual transcription of the n surveys, without having to double check everything in the 5 questions of the n surveys? My first suggestion is that at the end of the processing of all the hand filled forms, the software could choose some forms randomly to make a double check of the responses in a few instances, but on what criteria can I make this selection? This validation would be enough to cover everything in a significant way? The actual survey is nation level and it has 56 pages with over 200 questions in total, so it will be a lot of hand written pages by many people, and the intention is to reduce the likelihood of errors and to optimize speed in the data entry process. The surveys must filled in paper first, given the complications of taking laptops or handhelds with the interviewers.

    Read the article

  • Discrete problem of probability theory [closed]

    - by calejero
    A jury consists of 12 persons each of which has, before the trial started, a probability of 0.4 to vote in favor of the defendant's innocence. During the trial, the lawyer has a probability of 0.6 to change the mind of each juror who was biased against the accused. How likely is the defendant to be acquitted if he needs 10 votes in favor?

    Read the article

  • Significance in R

    - by Gemsie
    Ok, this is quite hard to explain, but I'm at a complete loss what to do. I'm a relative newcomer to R and although I can completely admire how powerful it is, I'm not too good at actually using it.... Basically, I have some very contrived data that I need to analyse (it wasn't me who chose this, I can assure you!). I have the right and left hand lengths of lots of people, as well as some numeric data that shows their sociability. Now I would like to know if people who have significantly different lengths of hand are more or less sociable than those who have the same (leading into the research that 'symmetrical' people are more sociable and intelligent, etc. I have got as far as loading the data into R, then I have no idea where to go from there. How on Earth do I start to separate those who are close to symmetrical to those who aren't to then start to do the analysis? Ok, using Sasha's great advice, I did the cor.test and got the following: Pearson's product-moment correlation data: measurements$l.hand - measurements$r.hand and measurements$sociable t = 0.2148, df = 150, p-value = 0.8302 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.1420623 0.1762437 sample estimates: cor 0.01753501 I have never used this test before, so am unsure how to intepret it...you wouldn't think I was on my fourth Scientific degree would you?! :(

    Read the article

  • incremental way of counting quantiles for large set of data

    - by Gacek
    I need to count the quantiles for a large set of data. Let's assume we can get the data only through some portions (i.e. one row of a large matrix). To count the Q3 quantile one need to get all the portions of the data and store it somewhere, then sort it and count the quantile: List<double> allData = new List<double>(); foreach(var row in matrix) // this is only example. In fact the portions of data are not rows of some matrix { allData.AddRange(row); } allData.Sort(); double p = 0.75*allData.Count; int idQ3 = (int)Math.Ceiling(p) - 1; double Q3 = allData[idQ3]; Now, I would like to find a way of counting this without storing the data in some separate variable. The best solution would be to count some parameters od mid-results for first row and then adjust it step by step for next rows. Note: These datasets are really big (ca 5000 elements in each row) The Q3 can be estimated, it doesn't have to be an exact value. I call the portions of data "rows", but they can have different leghts! Usually it varies not so much (+/- few hundred samples) but it varies! This question is similar to this one: http://stackoverflow.com/questions/1058813/on-line-iterator-algorithms-for-estimating-statistical-median-mode-skewness But I need to count quantiles. ALso there are few articles in this topic, i.e.: http://web.cs.wpi.edu/~hofri/medsel.pdf http://portal.acm.org/citation.cfm?id=347195&dl But before I would try to implement these, I wanted to ask you if there are maybe any other, qucker ways of counting the 0.25/0.75 quantiles?

    Read the article

  • Naive Bayesian classification (spam filtering) - Doubt in one calculation? Which one is right? Plz c

    - by Microkernel
    Hi guys, I am implementing Naive Bayesian classifier for spam filtering. I have doubt on some calculation. Please clarify me what to do. Here is my question. In this method, you have to calculate P(S|W) - Probability that Message is spam given word W occurs in it. P(W|S) - Probability that word W occurs in a spam message. P(W|H) - Probability that word W occurs in a Ham message. So to calculate P(W|S), should I do (1) (Number of times W occuring in spam)/(total number of times W occurs in all the messages) OR (2) (Number of times word W occurs in Spam)/(Total number of words in the spam message) So, to calculate P(W|S), should I do (1) or (2)? (I thought it to be (2), but I am not sure, so plz clarify me) I am refering http://en.wikipedia.org/wiki/Bayesian_spam_filtering for the info by the way. I got to complete the implementation by this weekend :( Thanks and regards, MicroKernel :) @sth: Hmm... Shouldn't repeated occurrence of word 'W' increase a message's spam score? In the your approach it wouldn't, right?. Lets take a scenario and discuss... Lets say, we have 100 training messages, out of which 50 are spam and 50 are Ham. and say word_count of each message = 100. And lets say, in spam messages word W occurs 5 times in each message and word W occurs 1 time in Ham message. So total number of times W occuring in all the spam message = 5*50 = 250 times. And total number of times W occuring in all Ham messages = 1*50 = 50 times. Total occurance of W in all of the training messages = (250+50) = 300 times. So, in this scenario, how do u calculate P(W|S) and P(W|H) ? Naturally we should expect, P(W|S) P(W|H)??? right. Please share your thought...

    Read the article

  • How do you combine "Revision Control" with "WorkFlow" for R?

    - by Tal Galili
    Hello all, I remember coming across R users writing that they use "Revision control" (e.g: "Source control"), and I am curious to know: How do you combine "Revision control" with your statistical analysis WorkFlow? Two (very) interesting discussions talk about how to deal with the WorkFlow. But neither of them refer to the revision control element: http://stackoverflow.com/questions/1266279/how-to-organize-large-r-programs http://stackoverflow.com/questions/1429907/workflow-for-statistical-analysis-and-report-writing A Long Update To The Question: Following some of the people's answers, and Dirk's question in the comment, I would like to direct my question a bit more. After reading the Wiki article about "revision control" (which I was previously not familiar with), it was clear to me that when using revision control, what one does is to build a development structure of his code. This structure either leads to a "final product" or to several branches. When building something like, let's say, a website. There is usually one end product you work towards (the website), with some prototypes along the way. But when doing a statistical analysis, the work (to my view) is different. Sometimes you know where you want to get to. But more often, you explore. Explore cleaning the dataset. Explore different methods for statistical analysis, and ask various questions of your data (and I am writing this, knowing how Frank Harrell, and other experience statisticians feels about Data dredging). That is way the WorkFlow question with statistical programming is (in my view) a serious and deep question, raising many issues, The simpler ones are technical: Which revision control software do you use (and why) ? Which IDE do you use(and why) ? The more interesting question are about work process: How do you structure your files? What do you keep as a separate file and what as a revision? or asking in a different way - What should be a "branch" and what should be a "sub project" in your code? For example: When starting to explore your data, should a plot be creating and then erased because it didn't lead any where (but kept as a revision) or should there be a backup file of that path? How you solve this tension was my initial curiosity. The second question is "what might I be missing?". What rules (of thumb) should one follow so to avoid common pitfalls doing statistical programming with version control? In my intuition, I feel that statistical programming is inherently different then software development (I am writing this without being a real expert in statistical programming, and even less so in software development). That's way I am unsure which of the lessons I have read here about version control would be applicable. Thanks a lot, Tal

    Read the article

  • What's the best way to unit test code that generates random output?

    - by Flynn1179
    Specifically, I've got a method picks n items from a list in such a way that a% of them meet one criterion, and b% meet a second, and so on. A simplified example would be to pick 5 items where 50% have a given property with the value 'true', and 50% 'false'; 50% of the time the method would return 2 true/3 false, and the other 50%, 3 true/2 false. Statistically speaking, this means that over 100 runs, I should get about 250 true/250 false, but because of the randomness, 240/260 is entirely possible. What's the best way to unit test this? I'm assuming that even though technically 300/200 is possible, it should probably fail the test if this happens. Is there a generally accepted tolerance for cases like this, and if so, how do you determine what that is?

    Read the article

  • Screening (multi)collinearity in a reggresion model

    - by aL3xa
    I hope that this one is not going to be "ask-and-answer" question... here goes: (multi)collinearity refers to extremely high correlations between predictors in the regression model. How to cure them... well, sometimes you don't need to "cure" collinearity, since it doesn't affect regression model itself, but interpretation of an effect of individual predictors. One way to spot collinearity is to put each predictor as a dependent variable, and other predictors as independent variables, determine R2, and if it's larger than .9 (or .95), we can consider predictor redundant. This is one "method"... what about other approaches? Some of them are time consuming, like excluding predictors from model and watching for b-coefficient changes - they should be noticeably different. Of course, we must always bare in mind specific context/goal of analysis... Sometimes, only remedy is to repeat a research, but right now, I'm interested in various ways of screening redundant predictors when (multi)collinearity occurs in a regression model.

    Read the article

  • R: Forecast package: Automatic algorithm for composite model involving ETS and AR

    - by phanikishan
    Hey, I would like to write a code involving automatic selection of a best composite model using ETS as well as autoregressive models. What is the criteria I should base my selection on? Also if I'm using the auto.arima function for deducing number of AR terms and corresponding coefficients from the forecast package in R, does my input series necessarily have to be stationary? or the value for d would be automatically selected thus returning a non-stationary model? Thanks, Phani

    Read the article

  • Determining the popularity of a video with ratings and views

    - by user295825
    I am about to embark on a new project - a video website. Users will be able to register, and vote on videos by clicking "like" or "dislike", or something to that effect. In any event, it will be a 2-option voting system, not a 5-star system. Every X number of days, I will be generating a "chart" of the most popular videos. So my question is: how should I determine the popularity of a given video? If I went the route of tallying up the videos with the most views, this could have the effect of exceptionally bad videos making it to the of the charts (just because they're so bad). If I go the route of a scoring system based on the amount of "like" and "dislike" votes (eg. 100 like votes, and 50 dislike votes equals a score of 2), videos with few views could appear on the top of the charts. So, what I need to do is a combination of the two. Barring, of course, spammy views and votes. What's your guys' thoughts on the subject?

    Read the article

  • Summarising grouped records in a dataframe in R (...again)

    - by monch1962
    Hello all, (I tried to ask this question earlier today, but later realised I over-simplified the question; the answers I received were correct, but I couldn't use them because of my over-simplification of the problem in the original question. Here's my 2nd attempt...) I have a data frame in R that looks like: "Timestamp", "Source", "Target", "Length", "Content" 0.1 , P1 , P2 , 5 , "ABCDE" 0.2 , P1 , P2 , 3 , "HIJ" 0.4 , P1 , P2 , 4 , "PQRS" 0.5 , P2 , P1 , 2 , "ZY" 0.9 , P2 , P1 , 4 , "SRQP" 1.1 , P1 , P2 , 1 , "B" 1.6 , P1 , P2 , 3 , "DEF" 2.0 , P2 , P1 , 3 , "IJK" ... and I want to convert this to: "StartTime", "EndTime", "Duration", "Source", "Target", "Length", "Content" 0.1 , 0.4 , 0.3 , P1 , P2 , 12 , "ABCDEHIJPQRS" 0.5 , 0.9 , 0.4 , P2 , P1 , 6 , "ZYSRQP" 1.1 , 1.6 , 0.5 , P1 , P2 , 4 , "BDEF" ... Trying to put this into English, I want to group consecutive records with the same 'Source' and 'Target' together, then print out a single record per group showing the StartTime, EndTime & Duration (=EndTime-StartTime) for that group, along with the sum of the Lengths for that group, and a concatenation of the Content (which will all be strings) in that group. The TimeOffset values will always increase throughout the data frame. I had a look at melt/recast and have a feeling that it could be used to solve the problem, but couldn't get my head around the documentation. I suspect it's possible to do this within R, but I really don't know where to start. In a pinch I could export the data frame out and do it in e.g. Python, but I'd prefer to stay within R if possible. Thanks in advance for any assistance you can provide

    Read the article

< Previous Page | 10 11 12 13 14 15 16 17 18 19 20 21  | Next Page >