Search Results

Search found 114 results on 5 pages for 'median'.

Page 5/5 | < Previous Page | 1 2 3 4 5

Optimizing near-duplicate value search

- by GApple

I'm trying to find near duplicate values in a set of fields in order to allow an administrator to clean them up. There are two criteria that I am matching on One string is wholly contained within the other, and is at least 1/4 of its length The strings have an edit distance less than 5% of the total length of the two strings The Pseudo-PHP code: foreach($values as $value){ foreach($values as $match){ if( ( $value['length'] < $match['length'] && $value['length'] * 4 > $match['length'] && stripos($match['value'], $value['value']) !== false ) || ( $match['length'] < $value['length'] && $match['length'] * 4 > $value['length'] && stripos($value['value'], $match['value']) !== false ) || ( abs($value['length'] - $match['length']) * 20 < ($value['length'] + $match['length']) && 0 < ($match['changes'] = levenshtein($value['value'], $match['value'])) && $match['changes'] * 20 <= ($value['length'] + $match['length']) ) ){ $matches[] = &$match; } } } I've tried to reduce calls to the comparatively expensive stripos and levenshtein functions where possible, which has reduced the execution time quite a bit. However, as an O(n^2) operation this just doesn't scale to the larger sets of values and it seems that a significant amount of the processing time is spent simply iterating through the arrays. Some properties of a few sets of values being operated on Total | Strings | # of matches per string | | Strings | With Matches | Average | Median | Max | Time (s) | --------+--------------+---------+--------+------+----------+ 844 | 413 | 1.8 | 1 | 58 | 140 | 593 | 156 | 1.2 | 1 | 5 | 62 | 272 | 168 | 3.2 | 2 | 26 | 10 | 157 | 47 | 1.5 | 1 | 4 | 3.2 | 106 | 48 | 1.8 | 1 | 8 | 1.3 | 62 | 47 | 2.9 | 2 | 16 | 0.4 | Are there any other things I can do to reduce the time to check criteria, and more importantly are there any ways for me to reduce the number of criteria checks required (for example, by pre-processing the input values), since there is such low selectivity?

Read the article
MongoDB and datasets that don't fit in RAM no matter how hard you shove

- by sysadmin1138

This is very system dependent, but chances are near certain we'll scale past some arbitrary cliff and get into Real Trouble. I'm curious what kind of rules-of-thumb exist for a good RAM to Disk-space ratio. We're planning our next round of systems, and need to make some choices regarding RAM, SSDs, and how much of each the new nodes will get. But now for some performance details! During normal workflow of a single project-run, MongoDB is hit with a very high percentage of writes (70-80%). Once the second stage of the processing pipeline hits, it's extremely high read as it needs to deduplicate records identified in the first half of processing. This is the workflow for which "keep your working set in RAM" is made for, and we're designing around that assumption. The entire dataset is continually hit with random queries from end-user derived sources; though the frequency is irregular, the size is usually pretty small (groups of 10 documents). Since this is user-facing, the replies need to be under the "bored-now" threshold of 3 seconds. This access pattern is much less likely to be in cache, so will be very likely to incur disk hits. A secondary processing workflow is high read of previous processing runs that may be days, weeks, or even months old, and is run infrequently but still needs to be zippy. Up to 100% of the documents in the previous processing run will be accessed. No amount of cache-warming can help with this, I suspect. Finished document sizes vary widely, but the median size is about 8K. The high-read portion of the normal project processing strongly suggests the use of Replicas to help distribute the Read traffic. I have read elsewhere that a 1:10 RAM-GB to HD-GB is a good rule-of-thumb for slow disks, As we are seriously considering using much faster SSDs, I'd like to know if there is a similar rule of thumb for fast disks. I know we're using Mongo in a way where cache-everything really isn't going to fly, which is why I'm looking at ways to engineer a system that can survive such usage. The entire dataset will likely be most of a TB within half a year and keep growing.

Read the article
And the Winners of Fusion Middleware Innovation Awards in Data Integration are…

- by Irem Radzik

Normal 0 false false false EN-US X-NONE X-NONE MicrosoftInternetExplorer4 /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin-top:0in; mso-para-margin-right:0in; mso-para-margin-bottom:10.0pt; mso-para-margin-left:0in; line-height:115%; mso-pagination:widow-orphan; font-size:11.0pt; font-family:"Calibri","sans-serif"; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:"Times New Roman"; mso-fareast-theme-font:minor-fareast; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin;} At OpenWorld, we announced the winners of Fusion Middleware Innovation Awards 2012. Raymond James and Morrison Supermarkets were selected for the data integration category for their innovative use of Oracle’s data integration products and the great results they have achieved. In this blog I would like to briefly introduce you to these award winning projects. Raymond James is a diversified financial services company, which provides financial planning, wealth management, investment banking, and asset management. They are using Oracle GoldenGate and Oracle Data Integrator to feed their operational data store (ODS), which supports application services across the enterprise. A major requirement for their project was low data latency, as key decisions are made based on the data in the ODS. They were able to fulfill this requirement due to the Oracle Data Integrator’s integrated solution with Oracle GoldenGate. Oracle GoldenGate captures changed data from different systems including Oracle Database, HP NonStop and Microsoft SQL Server into a single data store on SQL Server 2008. Oracle Data Integrator provides data transformations for the ODS. Leveraging ODI’s integration with GoldenGate, Raymond James now sees a 9 second median latency (from source commit to ODS target commit). The ODS solution delivers high quality, accurate data for consuming applications such as Raymond James’ next generation client and portfolio management systems as well as real-time operational reporting. It enables timely information for making better decisions. There are more benefits Raymond James achieved with this implementation of Oracle’s data integration solution. The software developers and architects of this solution, Tim Garrod and Ryan Fonnett, have told us during their presentation at OpenWorld that they also reduced application complexity significantly while improving developer productivity through trusted operational services. They were able to utilize CDC to generate alerts for business users, and for applications (for example for cache hydration mechanisms). One cool innovation example among many in this project is that using ODI's flexible architecture, Tim and Ryan could build 24/7 self-healing processes. And these processes have hardly failed. Integration processes fixes the errors itself. Pretty amazing; and a great solution for environments that need such reliability and availability. (You can see Tim and Ryan’s photo with the Innovation Award above.) The other winner of this year in the data integration category, Morrison Supermarkets, is the UK’s 4th largest grocery retailer. The company has been migrating all their legacy applications on to a new-world application set based on Oracle and consolidating all BI on to a single Oracle platform. The company recently implemented Oracle Exadata as the data warehouse engine and uses Oracle Business Intelligence EE. Their goal with deploying GoldenGate and ODI was to provide BI data to the enterprise in a way that it also supports operational decision making requirements from a wide range of Oracle based ERP applications such as E-Business Suite, PeopleSoft, Oracle Retail Suite. They use GoldenGate’s log-based change data capture capabilities and Oracle Data Integrator to populate the Oracle Retail Data Model. The electronic point of sale (EPOS) integration solution they built processes over 80 million transactions/day at busy periods in near real time (15 mins). It provides valuable insight to Retail and Commercial teams for both intra-day and historical trend analysis. As I mentioned in yesterday’s blog, the right data integration platform can transform the business. Here is another example: The point-of-sale integration enabled the grocery chain to optimize its stock management, leading to another award: Morrisons won the Grocer 33 award in 2012 - beating all other major UK supermarkets in product availability. Congratulations, Morrisons,on another award! Celebrating the innovation and the success of our customers with Oracle’s data integration products was definitely a highlight of Oracle OpenWorld for me. I look forward to hearing more from Raymond James, Morrisons, and the other customers that presented their data integration projects at OpenWorld, on how they are creating more value for their organizations.

Read the article
College Courses through distance learning

- by Matt

I realize this isn't really a programming question, but didn't really know where to post this in the stackexchange and because I am a computer science major i thought id ask here. This is pretty unique to the programmer community since my degree is about 95% programming. I have 1 semester left, but i work full time. I would like to finish up in December, but to make things easier i like to take online classes whenever I can. So, my question is does anyone know of any colleges that offer distance learning courses for computer science? I have been searching around and found a few potential classes, but not sure yet. I would like to gather some classes and see what i can get approval for. Class I need: Only need one C SC 437 Geometric Algorithms C SC 445 Algorithms C SC 473 Automata Only need one C SC 452 Operating Systems C SC 453 Compilers/Systems Software While i only need of each of the above courses i still need to take two more electives. These also have to be upper 400 level classes. So i can take multiple in each category. Some other classes I can take are: CSC 447 - Green Computing CSC 425 - Computer Networking CSC 460 - Database Design CSC 466 - Computer Security I hoping to take one or two of these courses over the summer. If not, then online over the regular semester would be ok too. Any help in helping find these classes would be awesome. Maybe you went to a college that offered distance learning. Some of these classes may be considered to be graduate courses too. Descriptions are listed below if you need. Thanks! Descriptions Computer Security This is an introductory course covering the fundamentals of computer security. In particular, the course will cover basic concepts of computer security such as threat models and security policies, and will show how these concepts apply to specific areas such as communication security, software security, operating systems security, network security, web security, and hardware-based security. Computer Networking Theory and practice of computer networks, emphasizing the principles underlying the design of network software and the role of the communications system in distributed computing. Topics include routing, flow and congestion control, end-to-end protocols, and multicast. Database Design Functions of a database system. Data modeling and logical database design. Query languages and query optimization. Efficient data storage and access. Database access through standalone and web applications. Green Computing This course covers fundamental principles of energy management faced by designers of hardware, operating systems, and data centers. We will explore basic energy management option in individual components such as CPUs, network interfaces, hard drives, memory. We will further present the energy management policies at the operating system level that consider performance vs. energy saving tradeoffs. Finally we will consider large scale data centers where energy management is done at multiple layers from individual components in the system to shutting down entries subset of machines. We will also discuss energy generation and delivery and well as cooling issues in large data centers. Compilers/Systems Software Basic concepts of compilation and related systems software. Topics include lexical analysis, parsing, semantic analysis, code generation; assemblers, loaders, linkers; debuggers. Operating Systems Concepts of modern operating systems; concurrent processes; process synchronization and communication; resource allocation; kernels; deadlock; memory management; file systems. Algorithms Introduction to the design and analysis of algorithms: basic analysis techniques (asymptotics, sums, recurrences); basic design techniques (divide and conquer, dynamic programming, greedy, amortization); acquiring an algorithm repertoire (sorting, median finding, strong components, spanning trees, shortest paths, maximum flow, string matching); and handling intractability (approximation algorithms, branch and bound). Automata Introduction to models of computation (finite automata, pushdown automata, Turing machines), representations of languages (regular expressions, context-free grammars), and the basic hierarchy of languages (regular, context-free, decidable, and undecidable languages). Geometric Algorithms The study of algorithms for geometric objects, using a computational geometry approach, with an emphasis on applications for graphics, VLSI, GIS, robotics, and sensor networks. Topics may include the representation and overlaying of maps, finding nearest neighbors, solving linear programming problems, and searching geometric databases.

Read the article
What are the industry metrics for average spend on dev hardware and software? [on hold]

- by RationalGeek

I'm trying to budget for my dev shop and compare our budget items to industry expectations. I'm hoping to find some information on what percentage of a dev's salary is generally spent on tooling, both hardware and software. Where can I find such information? If instead there is a source that looks at raw dollars that is useful, too. I can extrapolate what I need from that. NOTE: Your anecdotal evidence from your own job will not be very helpful. I'm looking for industry average statistics from a credible source. EDIT: I'm reluctant to even keep this question going based on the passionate negative responses of commenters, but I do think this is valuable information (assuming anyone will care to answer) so let me make one attempt to clarify why I'm looking for this information, and then leave it at that. I'm not sure why understanding and validating my motives is a necessary step to providing the information, but apparently that is the case, so I will do my best. Firstly, let me respond to the idea that us "management types" shouldn't use these types of metrics to evaluate budgets. I agree in part. Ideally, you should spend whatever is necessary on developers in order to keep them fully happy and productive. And this is true of all employees. However, companies operate in a world of limited resources, and every dollar spent in one area means a dollar not spent in another. So it is not enough to simply say "I need to spend $10,000 per developer next year" without having some way to justify that position. One way to help justify it is to compare yourself against the industry. If it is the case that on average a software shops spends 5% (making up that number) of their total development budget (salaries being the large portion of the other 95%, for arguments sake), and I'm only spending 3%, it helps in the justification process. So, it is not my intent to use this information to limit what I spend on developers, but rather to arm myself with the necessary justification to spend what I need to spend on developers to give them the best tools I can. I have been a developer for many years and I understand the need for proper tooling. Next, let's examine the idea that even considering the relationship between a spend on developer salaries and developer tooling is ludicrous and should be banned from budgetary thinking. As Jimmy Hoffa put it in their comment, it's like saying "I'm going to spend no more than 10% of median employee salary on light bulbs and coffee from now on.". Well, yes, it is like saying that, and from a budgeting perspective, this is a useful way to look at things. If you know that, on average, an employee consumes X dollars of coffee a year, then you can project a coffee budget based on that. And you can compare it to an industry metric to understand where you fall: do you spend more on coffee than other companies or less? Why might this be? If you are a coffee supply manager, that seems like a useful thought process. The same seems to hold true for developers. Now, on to the idea that I need to compare "apples to apples" and only look at other shops that are in the same place geographically, the same business, the same application architecture, and the same development frameworks. I guess if I could find such a statistic that said "a shop that is exactly identical to yours spends X on developer tooling" it would be wonderful. But there is plenty of value in an average statistic. Here's an analogy: let's say you are working on a household budget and need to decide how much to spend on groceries. Is it enough to know that the average consumer spends 15% on groceries and therefore decide that you will budget exactly 15%? No. You have to tweak your budget based on your individual needs and situation. But the generalized statistic does help in this evaluation. You can know if your budget is grossly off from what others are doing, and this can help you figure out why this is. So, I will concede the point that it would be better to find statistics that align to my shop, though I think any statistics I could find would be useful for what I'm doing. In that light, let's say that my shop is mostly focused on ASP.NET web applications. That doesn't map perfectly to reality because large enterprises have very heterogenous IT environments. But if I was going to pick one technology that is our focus that would be it. But, if you were to point me at some statistics that are related to a Linux shop doing embedded Java applications, I would still find it useful as a point of comparison. SUMMARY: Let me try to rephrase my question. I'm trying to find industry metrics on how much dev shops spend on developer tooling, both hardware and software. I don't so much care whether it is expressed as a percentage of total budget or as X dollars per dev or as Y percentage of salary. Any metric would be useful. If there are metrics that are specific to ASP.NET dev shops in the Northeast US, all the better, but I would be happy to find anything.

Read the article
! Extra }, or forgotten \endgroup. latex

- by gzou

hey, I met these latex format problem, anyone can offer some help? the .tex file: \begin{table}{} \renewcommand{\arraystretch}{1.1} \caption{Cambridge Flow feature definition and description} \label{cambridge-feature}} \centering \begin{tabular}{|c|c|} \hline\bfseries Abbreviation &\bfseries Description\\ \hline serv-port & Server port\\ \hline clnt-port & Client port\\ \hline push-pkts-serv & count of all packets with\\ & push bit set in TCP header (server to client)\\ \hline init-win-bytes-clnt & the total number of bytes \\ & sent in initial window (client to server)\\ \hline init-win-bytes-serv & the total number of bytes sent\\ & in initial window (server to client)\\ \hline avg-seg-size-clnt & average segment size: \\ & data bytes devided by number of packets\\ \hline IP-bytes-med-clnt & median of total bytes in IP packet\\ \hline act-data-pkt-serv & count of packet with at least one byte \\ & of TCP data playload (server to client)\\ \hline data-bytes-var-clnt & variance of total \\ & bytes in packets (client to server)\\ \hline min-seg-size-serv & minimum segment size \\ & observed (server to client)\\ \hline RTT-samples-serv & total number of RTT samples\\ & found (server to client),\\ & {\bf see also \cite{Moore05discriminators}}\\ \hline push-pkts-clnt & count of all packets with push bit set \\ & in TCP header (server to client)\\ \hline \end{tabular} \end{table} and the error message: ! Extra }, or forgotten \endgroup. \@endfloatbox ...pagefalse \outer@nobreak \egroup \color@endbox l.892 \end{table} I've deleted a group-closing symbol because it seems to be spurious, as in $x}$'. But perhaps the } is legitimate and you forgot something else, as in\hbox{$x}'. In such cases the way to recover is to insert both the forgotten and the deleted material, e.g., by typing `I$}'. there is no $ in my table, also this { are matching with the }, and also after I comment the citation, the error remains. anyone can offer help? really appreciate all the comments! ! Extra }, or forgotten \endgroup.

Read the article
Can MySQL reasonably perform queries on billions of rows?

- by haxney

I am planning on storing scans from a mass spectrometer in a MySQL database and would like to know whether storing and analyzing this amount of data is remotely feasible. I know performance varies wildly depending on the environment, but I'm looking for the rough order of magnitude: will queries take 5 days or 5 milliseconds? Input format Each input file contains a single run of the spectrometer; each run is comprised of a set of scans, and each scan has an ordered array of datapoints. There is a bit of metadata, but the majority of the file is comprised of arrays 32- or 64-bit ints or floats. Host system |----------------+-------------------------------| | OS | Windows 2008 64-bit | | MySQL version | 5.5.24 (x86_64) | | CPU | 2x Xeon E5420 (8 cores total) | | RAM | 8GB | | SSD filesystem | 500 GiB | | HDD RAID | 12 TiB | |----------------+-------------------------------| There are some other services running on the server using negligible processor time. File statistics |------------------+--------------| | number of files | ~16,000 | | total size | 1.3 TiB | | min size | 0 bytes | | max size | 12 GiB | | mean | 800 MiB | | median | 500 MiB | | total datapoints | ~200 billion | |------------------+--------------| The total number of datapoints is a very rough estimate. Proposed schema I'm planning on doing things "right" (i.e. normalizing the data like crazy) and so would have a runs table, a spectra table with a foreign key to runs, and a datapoints table with a foreign key to spectra. The 200 Billion datapoint question I am going to be analyzing across multiple spectra and possibly even multiple runs, resulting in queries which could touch millions of rows. Assuming I index everything properly (which is a topic for another question) and am not trying to shuffle hundreds of MiB across the network, is it remotely plausible for MySQL to handle this? UPDATE: additional info The scan data will be coming from files in the XML-based mzML format. The meat of this format is in the <binaryDataArrayList> elements where the data is stored. Each scan produces = 2 <binaryDataArray> elements which, taken together, form a 2-dimensional (or more) array of the form [[123.456, 234.567, ...], ...]. These data are write-once, so update performance and transaction safety are not concerns. My naïve plan for a database schema is: runs table | column name | type | |-------------+-------------| | id | PRIMARY KEY | | start_time | TIMESTAMP | | name | VARCHAR | |-------------+-------------| spectra table | column name | type | |----------------+-------------| | id | PRIMARY KEY | | name | VARCHAR | | index | INT | | spectrum_type | INT | | representation | INT | | run_id | FOREIGN KEY | |----------------+-------------| datapoints table | column name | type | |-------------+-------------| | id | PRIMARY KEY | | spectrum_id | FOREIGN KEY | | mz | DOUBLE | | num_counts | DOUBLE | | index | INT | |-------------+-------------| Is this reasonable?

Read the article
How to use an adjacency matrix to determine which rows to 'pass' to a function in r?

- by dubhousing

New to R, and I have a long-ish question: I have a shapefile/map, and I'm aiming to calculate a certain index for every polygon in that map, based on attributes of that polygon and each polygon that neighbors it. I have an adjacency matrix -- which I think is the same as a "1st-order queen contiguity weights matrix", although I'm not sure -- that describes which polygons border which other polygons, e.g., POLYID A B C D E A 0 0 1 0 1 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 1 0 The above indicates, for instance, that polygons 'C' and 'E' adjoin polygon 'A'; polygon 'B' adjoins only polygon 'C', etc. The attribute table I have has one polygon per row: POLYID TOT L10K 10_15K 15_20K ... A 500 24 30 77 ... Where TOT, L10K, etc. are the variables I use to calculate an index. There are 525 polygons/rows in my data, so I'd like to use the adjacency matrix to determine which rows' attributes to incorporate into the calculation of the index of interest. For now, I can calculate the index when I subset the rows that correspond to one 'bundle' of neighboring polygons, and then use a loop (if it's of interest, I'm calculating the Centile Gap Index, a measure of local income segregation). E.g., subsetting the 'neighborhood' of the Detroit City Schools: Detroit <- UNSD00[c(142,150,164,221,226,236,295,327,157,177,178,364,233,373,418,424,449,451,487),] Then record the marginal column proportions and a running total: catprops <- vector() for(i in 4:19) { catprops[(i-3)]<-sum(Detroit[,i])/sum(Detroit[,3]) } catprops <- as.data.frame(catprops) catprops[,2]<-cumsum(catprops[,1]) Columns 4:19 are the necessary ones in the attribute table. Then I use the following code to calculate the index -- note that the loop has "i in 1:19" because the Detroit subset has 19 polygons. cgidistsum <- 0 for(i in 1:19) { pranks <- vector() for(j in 4:19) { if (Detroit[i,j]==0) pranks <- append(pranks,0) else if (j == 4) pranks <- append(pranks,seq(0,catprops[1,2],by=catprops[1,2]/Detroit[i,j])) else pranks <- append(pranks,seq(catprops[j-4,2],catprops[j-3,2],by=catprops[j-3,1]/Detroit[i,j])) } distpranks <- vector() distpranks<-abs(pranks-median(pranks)) cgidistsum <- cgidistsum + sum(distpranks) } cgi <- (.25-(cgidistsum/sum(Detroit[,3])))/.25 My apologies if I've provided more information than is necessary. I would really like to exploit the adjacency matrix in order to calculate the CGI for each 'bundle' of these rows. If you happen to know how I could started with this, that would be great. and my apologies for any novice mistakes, I'm new to R!

Read the article
Analytic functions – they’re not aggregates

- by Rob Farley

SQL 2012 brings us a bunch of new analytic functions, together with enhancements to the OVER clause. People who have known me over the years will remember that I’m a big fan of the OVER clause and the types of things that it brings us when applied to aggregate functions, as well as the ranking functions that it enables. The OVER clause was introduced in SQL Server 2005, and remained frustratingly unchanged until SQL Server 2012. This post is going to look at a particular aspect of the analytic functions though (not the enhancements to the OVER clause). When I give presentations about the analytic functions around Australia as part of the tour of SQL Saturdays (starting in Brisbane this Thursday), and in Chicago next month, I’ll make sure it’s sufficiently well described. But for this post – I’m going to skip that and assume you get it. The analytic functions introduced in SQL 2012 seem to come in pairs – FIRST_VALUE and LAST_VALUE, LAG and LEAD, CUME_DIST and PERCENT_RANK, PERCENTILE_CONT and PERCENTILE_DISC. Perhaps frustratingly, they take slightly different forms as well. The ones I want to look at now are FIRST_VALUE and LAST_VALUE, and PERCENTILE_CONT and PERCENTILE_DISC. The reason I’m pulling this ones out is that they always produce the same result within their partitions (if you’re applying them to the whole partition). Consider the following query: SELECT YEAR(OrderDate), FIRST_VALUE(TotalDue) OVER (PARTITION BY YEAR(OrderDate) ORDER BY OrderDate, SalesOrderID RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), LAST_VALUE(TotalDue) OVER (PARTITION BY YEAR(OrderDate) ORDER BY OrderDate, SalesOrderID RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY TotalDue) OVER (PARTITION BY YEAR(OrderDate)), PERCENTILE_DISC(0.95) WITHIN GROUP (ORDER BY TotalDue) OVER (PARTITION BY YEAR(OrderDate)) FROM Sales.SalesOrderHeader ; This is designed to get the TotalDue for the first order of the year, the last order of the year, and also the 95% percentile, using both the continuous and discrete methods (‘discrete’ means it picks the closest one from the values available – ‘continuous’ means it will happily use something between, similar to what you would do for a traditional median of four values). I’m sure you can imagine the results – a different value for each field, but within each year, all the rows the same. Notice that I’m not grouping by the year. Nor am I filtering. This query gives us a result for every row in the SalesOrderHeader table – 31465 in this case (using the original AdventureWorks that dates back to the SQL 2005 days). The RANGE BETWEEN bit in FIRST_VALUE and LAST_VALUE is needed to make sure that we’re considering all the rows available. If we don’t specify that, it assumes we only mean “RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW”, which means that LAST_VALUE ends up being the row we’re looking at. At this point you might think about other environments such as Access or Reporting Services, and remember aggregate functions like FIRST. We really should be able to do something like: SELECT YEAR(OrderDate), FIRST_VALUE(TotalDue) OVER (PARTITION BY YEAR(OrderDate) ORDER BY OrderDate, SalesOrderID RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM Sales.SalesOrderHeader GROUP BY YEAR(OrderDate) ; But you can’t. You get that age-old error: Msg 8120, Level 16, State 1, Line 5 Column 'Sales.SalesOrderHeader.OrderDate' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause. Msg 8120, Level 16, State 1, Line 5 Column 'Sales.SalesOrderHeader.SalesOrderID' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause. Hmm. You see, FIRST_VALUE isn’t an aggregate function. None of these analytic functions are. There are too many things involved for SQL to realise that the values produced might be identical within the group. Furthermore, you can’t even surround it in a MAX. Then you get a different error, telling you that you can’t use windowed functions in the context of an aggregate. And so we end up grouping by doing a DISTINCT. SELECT DISTINCT YEAR(OrderDate), FIRST_VALUE(TotalDue) OVER (PARTITION BY YEAR(OrderDate) ORDER BY OrderDate, SalesOrderID RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), LAST_VALUE(TotalDue) OVER (PARTITION BY YEAR(OrderDate) ORDER BY OrderDate, SalesOrderID RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY TotalDue) OVER (PARTITION BY YEAR(OrderDate)), PERCENTILE_DISC(0.95) WITHIN GROUP (ORDER BY TotalDue) OVER (PARTITION BY YEAR(OrderDate)) FROM Sales.SalesOrderHeader ; I’m sorry. It’s just the way it goes. Hopefully it’ll change the future, but for now, it’s what you’ll have to do. If we look in the execution plan, we see that it’s incredibly ugly, and actually works out the results of these analytic functions for all 31465 rows, finally performing the distinct operation to convert it into the four rows we get in the results. You might be able to achieve a better plan using things like TOP, or the kind of calculation that I used in http://sqlblog.com/blogs/rob_farley/archive/2011/08/23/t-sql-thoughts-about-the-95th-percentile.aspx (which is how PERCENTILE_CONT works), but it’s definitely convenient to use these functions, and in time, I’m sure we’ll see good improvements in the way that they are implemented. Oh, and this post should be good for fellow SQL Server MVP Nigel Sammy’s T-SQL Tuesday this month.

Read the article
CodePlex Daily Summary for Tuesday, October 15, 2013

CodePlex Daily Summary for Tuesday, October 15, 2013Popular ReleasesFFXIV Crafting Simulator: Crafting Simulator 2.4.1: -Fixed the offset for the new patch (Auto Loading function)iBoxDB.EX - Fast Transactional NoSQL Database Resources: iBoxDB.net fast transactional nosql database 1.5.2: Easily process objects and documents, zero configuration. fast embeddable transactional nosql document database, includes CURD, QueryLanguage, Master-Master-Slave Replication, MVCC, etc. supports .net2, .net4, windows phone, mono, unity3d, node.js , copy and run. http://download-codeplex.sec.s-msft.com/Download?ProjectName=iboxdb&DownloadId=737783 Benchmark with MongoDB Compatibility more platforms for java versionneurogoody: slicebox: this is the slice box jsEvent-Based Components AppBuilder: AB3.AppDesigner.55: Iteration 55 (Feature): Moving of TargetEdge (simple wires only) by mouse.Sandcastle Help File Builder: SHFB v1.9.8.0 with Visual Studio Package: General InformationIMPORTANT: On some systems, the content of the ZIP file is blocked and the installer may fail to run. Before extracting it, right click on the ZIP file, select Properties, and click on the Unblock button if it is present in the lower right corner of the General tab in the properties dialog. This new release contains bug fixes and feature enhancements. There are some potential breaking changes in this release as some features of the Help File Builder have been moved into...SharpConfig: SharpConfig 1.2: Implemented comment parsing. Comments are now part of settings and setting categories. New properties: Setting: Comment PreComments SettingCategory: Comment PreCommentsC++ REST SDK (codename "Casablanca"): C++ REST SDK 1.3.0: This release fixes multiple customer reported issues as well as the following: Full support for Dev12 binaries and project files Full support for Windows XP New sample highlighting the Client and Server APIs : BlackJack Expose underlying native handle to set custom options on http_client Improvements to Listener Library Note: Dev10 binaries have been dropped as of this release, however the Dev10 project files are still available in the Source CodeAD ACL Scanner: 1.3.2: Minor bug fixed: Powershell 4.0 will report: Select—Object: Parameter cannot be processed because the parameter name p is ambiguous.Json.NET: Json.NET 5.0 Release 7: New feature - Added support for Immutable Collections New feature - Added WriteData and ReadData settings to DataExtensionAttribute New feature - Added reference and type name handling support to extension data New feature - Added default value and required support to constructor deserialization Change - Extension data is now written when serializing Fix - Added missing casts to JToken Fix - Fixed parsing large floating point numbers Fix - Fixed not parsing some ISO date ...RESX Manager: ResxManager 0.2.1: FIXED: Many critical bugs have been fixed. New Features Error logging for improved exception handling New toolbar Improvements of user interfaceFast YouTube Downloader: YouTube Downloader 2.2.0: YouTube Downloader 2.2.0VidCoder: 1.5.8 Beta: Added hardware acceleration options: Bicubic OpenCL scaling algorithm, QSV decoding/encoding and DXVA decoding. Updated HandBrake core to SVN 5834. Updated VidCoder setup icon. Fixed crash when choosing the mp4v2 container on x86 and opening on x64. Warning: the hardware acceleration features require specific hardware or file types to work correctly: QSV: Need an Intel processor that supports Quick Sync Video encoding, with a monitor hooked up to the Intel HD Graphics output and the lat...ASP.net MVC Awesome - jQuery Ajax Helpers: 3.5.2: version 3.5.2 - fix for setting single value to multivalue controls - datepicker min max date offset fix - html encoding for keys fix - enable Column.ClientFormatFunc to be a function call that will return a function version 3.5.1 - fixed html attributes rendering - fixed loading animation rendering - css improvements version 3.5 ========================== - autosize for all popups ( can be turned off by calling in js awe.autoSize = false ) - added Parent, Paremeter extensions ...Wsus Package Publisher: Release v1.3.1310.12: Allow the Update Creation Wizard to be set in full screen mode. Fix a bug which prevent WPP to Reset Remote Sus Client ID. Change the behavior of links in the Update Detail Viewer. Left-Click to open, Right-Click to copy to the Clipboard.TerrariViewer: TerrariViewer v7 [Terraria Inventory Editor]: This is a complete overhaul but has the same core style. I hope you enjoy it. This version is compatible with 1.2.0.3 Please send issues to my Twitter or https://github.com/TJChap2840WDTVHubGen - Adds Metadata, thumbnails and subtitles to WDTV Live Hubs: WDTVHubGen.v2.1.6.maint: I think this covers all of the issues. new additions: fixed the thumbnail problem for backgrounds. general clean up and error checking. need to get this put through the wringer and all feedback is welcome.BIDS Helper: BIDS Helper 1.6.4: This BIDS Helper release brings the following new features and fixes: New Features: A new Bus Matrix style report option when you run the Printer Friendly Dimension Usage report for an SSAS cube. The Biml engine is now fully in sync with the supported subset of Varigence Mist 3.4. This includes a large number of language enhancements, bugfixes, and project deployment support. Fixed Issues: Fixed Biml execution for project connections fixing a bug with Tabular Translations Editor not a...MoreTerra (Terraria World Viewer): MoreTerra 1.11.3: =========== =New Features= =========== New Markers added for Plantera's Bulb, Heart Fruits and Gold Cache. Markers now correctly display for the gems found in rock debris on the floor. =========== =Compatibility= =========== Fixed header changes found in Terraria 1.0.3.1Media Companion: Media Companion MC3.581b: Fix in place for TVDB xml issue. New* Movie - General Preferences, allow saving of ignored 'The' or 'A' to end of movie title, stored in sorttitle field. * Movie - New Way for Cropping Posters. Fixed* Movie - Rename of folders/filename. caught error message. * Movie - Fixed Bug in Save Cropped image, only saving in Pre-Frodo format if Both model selected. * Movie - Fixed Cropped image didn't take zoomed ratio into effect. * Movie - Separated Folder Renaming and File Renaming fuctions durin...SmartStore.NET - Free ASP.NET MVC Ecommerce Shopping Cart Solution: SmartStore.NET 1.2.0: HighlightsMulti-store support "Trusted Shops" plugins Highly improved SmartStore.biz Importer plugin Add custom HTML content to pages Performance optimization New FeaturesMulti-store-support: now multiple stores can be managed within a single application instance (e.g. for building different catalogs, brands, landing pages etc.) Added 3 new Trusted Shops plugins: Seal, Buyer Protection, Store Reviews Added Display as HTML Widget to CMS Topics (store owner now can add arbitrary HT...New ProjectsArtezio SharePoint 2013 Workflow Activities: SharePoint Workflow 2013 doesn’t provide activities to work with permissions, we've fixed it using HttpSend activity that makes REST API calls.Dependency.Injection: An attempt to write a really simple dependency injection framework. Does property-based and recursive dependency injection. Handles singletons. Yay!DHGMS SUO Killer: SUO Killer is a Visual Studio extension to deal with the removal of SUO file to mitigate SUO related issues in Visual Studio. This project is written in C#.dynamicsheet: dynamicsheetExcel Comparator: Excel Comparator is an add-in for Microsoft Excel that allows the user to compare a range between two sheets. FetchAIP: FetchAIP is a utility to download the various sections of the Aeronautical Information Publication (AIP) for New Zealand.Fluent Method and Type Builder: Still working on the summary.getboost: NuGet package for Boost framework.Goldstone Forum: WebForms Forum - TelerikAcademy Team ProjectGroupMe Software Development Kit: .NET Software Development Kit for http://groupme.com/ chat service.GSLMS: ----Import Excel Files Into SQL Server: Load Excel files into SQL Database without schema changes.Inaction: ?????????? jBegin: Learning ASP.net MVC from beginning, then here will be the source code for jbegin.comKDG's Statistical Quality Control Solver: This tool will include methods that can solve sample standard deviation, sample variance, median, mode, moving average, percentiles, margin of error, etc.kpi: Key Performance Indicator (KPI)????; visual studio 2010 with .NET 4.0 runtimeLECO Remote Control Client Application: Sample code and binaries are provided to demonstrate the remote control capability of a LECO Cornerstone instrument.LinkPad: My first Windows Store app intended for student to sketch up thoughts and concepts in quick diagrams.Modler.NET - Automating Graphical Data Model Co-Evolution: Modler.NET was the tool created for a Master's thesis project, which automates the co-evolution of graphical data models and the database that they represent.MyFileManager1: SummaryNever Lotto: Korean 465 Lotto Analyzer and Simulator. The real purpose of this project is to show that this kind of lotto things are just shit.NHibernate: The purpose of this project is to demo CRUD operations using NHibernate with Mono in Visual Studio 2012 using C# language. OAuth2 Authorizer: OAuth2 Authorizer helps you get the access code for a standard OAuth2 REST service that implements 3-legged authentication.Regular Expression for Excel: Regular Expression For Excel is an Excel Plugin. It provides a regular expressions EXCEL support. We can use it in the EXCEL function.Service Tester: Service Tester is an Azure Cloud based load testing application targeted at Soap Web Services which allows you to invoke your Web Service by random parameters.Simple TypeScript and C# Class Generator: Simple GUI application to generate compatible class source code for C# and TypeScript for communications between C# and TypeScript. Soccer team management: ---Spanner: No more stringly-typed web development! Build statically typed single page web applications in C#, automatically generating all HTML, JavaScript, and Knockout.

Read the article
Getting timing consistency in Linux

- by Jim Hunziker

I can't seem to get a simple program (with lots of memory access) to achieve consistent timing in Linux. I'm using a 2.6 kernel, and the program is being run on a dual-core processor with realtime priority. I'm trying to disable cache effects by declaring the memory arrays as volatile. Below are the results and the program. What are some possible sources of the outliers? Results: Number of trials: 100 Range: 0.021732s to 0.085596s Average Time: 0.058094s Standard Deviation: 0.006944s Extreme Outliers (2 SDs away from mean): 7 Average Time, excluding extreme outliers: 0.059273s Program: #include <stdio.h> #include <stdlib.h> #include <math.h> #include <sched.h> #include <sys/time.h> #define NUM_POINTS 5000000 #define REPS 100 unsigned long long getTimestamp() { unsigned long long usecCount; struct timeval timeVal; gettimeofday(&timeVal, 0); usecCount = timeVal.tv_sec * (unsigned long long) 1000000; usecCount += timeVal.tv_usec; return (usecCount); } double convertTimestampToSecs(unsigned long long timestamp) { return (timestamp / (double) 1000000); } int main(int argc, char* argv[]) { unsigned long long start, stop; double times[REPS]; double sum = 0; double scale, avg, newavg, median; double stddev = 0; double maxval = -1.0, minval = 1000000.0; int i, j, freq, count; int outliers = 0; struct sched_param sparam; sched_getparam(getpid(), &sparam); sparam.sched_priority = sched_get_priority_max(SCHED_FIFO); sched_setscheduler(getpid(), SCHED_FIFO, &sparam); volatile float* data; volatile float* results; data = calloc(NUM_POINTS, sizeof(float)); results = calloc(NUM_POINTS, sizeof(float)); for (i = 0; i < REPS; ++i) { start = getTimestamp(); for (j = 0; j < NUM_POINTS; ++j) { results[j] = data[j]; } stop = getTimestamp(); times[i] = convertTimestampToSecs(stop-start); } free(data); free(results); for (i = 0; i < REPS; i++) { sum += times[i]; if (times[i] > maxval) maxval = times[i]; if (times[i] < minval) minval = times[i]; } avg = sum/REPS; for (i = 0; i < REPS; i++) stddev += (times[i] - avg)*(times[i] - avg); stddev /= REPS; stddev = sqrt(stddev); for (i = 0; i < REPS; i++) { if (times[i] > avg + 2*stddev || times[i] < avg - 2*stddev) { sum -= times[i]; outliers++; } } newavg = sum/(REPS-outliers); printf("Number of trials: %d\n", REPS); printf("Range: %fs to %fs\n", minval, maxval); printf("Average Time: %fs\n", avg); printf("Standard Deviation: %fs\n", stddev); printf("Extreme Outliers (2 SDs away from mean): %d\n", outliers); printf("Average Time, excluding extreme outliers: %fs\n", newavg); return 0; }

Read the article
Join and sum not compatible matrices through data.table

- by leodido

My goal is to "sum" two not compatible matrices (matrices with different dimensions) using (and preserving) row and column names. I've figured this approach: convert the matrices to data.table objects, join them and then sum columns vectors. An example: > M1 1 3 4 5 7 8 1 0 0 1 0 0 0 3 0 0 0 0 0 0 4 1 0 0 0 0 0 5 0 0 0 0 0 0 7 0 0 0 0 1 0 8 0 0 0 0 0 0 > M2 1 3 4 5 8 1 0 0 1 0 0 3 0 0 0 0 0 4 1 0 0 0 0 5 0 0 0 0 0 8 0 0 0 0 0 > M1 %ms% M2 1 3 4 5 7 8 1 0 0 2 0 0 0 3 0 0 0 0 0 0 4 2 0 0 0 0 0 5 0 0 0 0 0 0 7 0 0 0 0 1 0 8 0 0 0 0 0 0 This is my code: M1 <- matrix(c(0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0), byrow = TRUE, ncol = 6) colnames(M1) <- c(1,3,4,5,7,8) M2 <- matrix(c(0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0), byrow = TRUE, ncol = 5) colnames(M2) <- c(1,3,4,5,8) # to data.table objects DT1 <- data.table(M1, keep.rownames = TRUE, key = "rn") DT2 <- data.table(M2, keep.rownames = TRUE, key = "rn") # join and sum of common columns if (nrow(DT1) > nrow(DT2)) { A <- DT2[DT1, roll = TRUE] A[, list(X1 = X1 + X1.1, X3 = X3 + X3.1, X4 = X4 + X4.1, X5 = X5 + X5.1, X7, X8 = X8 + X8.1), by = rn] } That outputs: rn X1 X3 X4 X5 X7 X8 1: 1 0 0 2 0 0 0 2: 3 0 0 0 0 0 0 3: 4 2 0 0 0 0 0 4: 5 0 0 0 0 0 0 5: 7 0 0 0 0 1 0 6: 8 0 0 0 0 0 0 Then I can convert back this data.table to a matrix and fix row and column names. The questions are: how to generalize this procedure? I need a way to automatically create list(X1 = X1 + X1.1, X3 = X3 + X3.1, X4 = X4 + X4.1, X5 = X5 + X5.1, X7, X8 = X8 + X8.1) because i want to apply this function to matrices which dimensions (and row/columns names) are not known in advance. In summary I need a merge procedure that behaves as described. there are other strategies/implementations that achieve the same goal that are, at the same time, faster and generalized? (hoping that some data.table monster help me) to what kind of join (inner, outer, etc. etc.) is assimilable this procedure? Thanks in advance. p.s.: I'm using data.table version 1.8.2 EDIT - SOLUTIONS @Aaron solution. No external libraries, only base R. It works also on list of matrices. add_matrices_1 <- function(...) { a <- list(...) cols <- sort(unique(unlist(lapply(a, colnames)))) rows <- sort(unique(unlist(lapply(a, rownames)))) out <- array(0, dim = c(length(rows), length(cols)), dimnames = list(rows,cols)) for (m in a) out[rownames(m), colnames(m)] <- out[rownames(m), colnames(m)] + m out } @MadScone solution. Used reshape2 package. It works only on two matrices per call. add_matrices_2 <- function(m1, m2) { m <- acast(rbind(melt(M1), melt(M2)), Var1~Var2, fun.aggregate = sum) mn <- unique(colnames(m1), colnames(m2)) rownames(m) <- mn colnames(m) <- mn m } BENCHMARK (100 runs with microbenchmark package) Unit: microseconds expr min lq median uq max 1 add_matrices_1 196.009 257.5865 282.027 291.2735 549.397 2 add_matrices_2 13737.851 14697.9790 14864.778 16285.7650 25567.448 No need to comment the benchmark: @Aaron solution wins. I'll continue to investigate a similar solution for data.table objects. I'll add other solutions eventually reported or discovered.

Read the article
What statistics can be maintained for a set of numerical data without iterating?

- by Dan Tao

Update Just for future reference, I'm going to list all of the statistics that I'm aware of that can be maintained in a rolling collection, recalculated as an O(1) operation on every addition/removal (this is really how I should've worded the question from the beginning): Obvious Count Sum Mean Max* Min* Median** Less Obvious Variance Standard Deviation Skewness Kurtosis Mode*** Weighted Average Weighted Moving Average**** OK, so to put it more accurately: these are not "all" of the statistics I'm aware of. They're just the ones that I can remember off the top of my head right now. *Can be recalculated in O(1) for additions only, or for additions and removals if the collection is sorted (but in this case, insertion is not O(1)). Removals potentially incur an O(n) recalculation for non-sorted collections. **Recalculated in O(1) for a sorted, indexed collection only. ***Requires a fairly complex data structure to recalculate in O(1). ****This can certainly be achieved in O(1) for additions and removals when the weights are assigned in a linearly descending fashion. In other scenarios, I'm not sure. Original Question Say I maintain a collection of numerical data -- let's say, just a bunch of numbers. For this data, there are loads of calculated values that might be of interest; one example would be the sum. To get the sum of all this data, I could... Option 1: Iterate through the collection, adding all the values: double sum = 0.0; for (int i = 0; i < values.Count; i++) sum += values[i]; Option 2: Maintain the sum, eliminating the need to ever iterate over the collection just to find the sum: void Add(double value) { values.Add(value); sum += value; } void Remove(double value) { values.Remove(value); sum -= value; } EDIT: To put this question in more relatable terms, let's compare the two options above to a (sort of) real-world situation: Suppose I start listing numbers out loud and ask you to keep them in your head. I start by saying, "11, 16, 13, 12." If you've just been remembering the numbers themselves and nothing more, and then I say, "What's the sum?", you'd have to think to yourself, "OK, what's 11 + 16 + 13 + 12?" before responding, "52." If, on the other hand, you had been keeping track of the sum yourself while I was listing the numbers (i.e., when I said, "11" you thought "11", when I said "16", you thought, "27," and so on), you could answer "52" right away. Then if I say, "OK, now forget the number 16," if you've been keeping track of the sum inside your head you can simply take 16 away from 52 and know that the new sum is 36, rather than taking 16 off the list and them summing up 11 + 13 + 12. So my question is, what other calculations, other than the obvious ones like sum and average, are like this? SECOND EDIT: As an arbitrary example of a statistic that (I'm almost certain) does require iteration -- and therefore cannot be maintained as simply as a sum or average -- consider if I asked you, "how many numbers in this collection are divisible by the min?" Let's say the numbers are 5, 15, 19, 20, 21, 25, and 30. The min of this set is 5, which divides into 5, 15, 20, 25, and 30 (but not 19 or 21), so the answer is 5. Now if I remove 5 from the collection and ask the same question, the answer is now 2, since only 15 and 30 are divisible by the new min of 15; but, as far as I can tell, you cannot know this without going through the collection again. So I think this gets to the heart of my question: if we can divide kinds of statistics into these categories, those that are maintainable (my own term, maybe there's a more official one somewhere) versus those that require iteration to compute any time a collection is changed, what are all the maintainable ones? What I am asking about is not strictly the same as an online algorithm (though I sincerely thank those of you who introduced me to that concept). An online algorithm can begin its work without having even seen all of the input data; the maintainable statistics I am seeking will certainly have seen all the data, they just don't need to reiterate through it over and over again whenever it changes.

Read the article
Problem measuring N times the execution time of a code block

- by Nazgulled

EDIT: I just found my problem after writing this long post explaining every little detail... If someone can give me a good answer on what I'm doing wrong and how can I get the execution time in seconds (using a float with 5 decimal places or so), I'll mark that as accepted. Hint: The problem was on how I interpreted the clock_getttime() man page. Hi, Let's say I have a function named myOperation that I need to measure the execution time of. To measure it, I'm using clock_gettime() as it was recommend here in one of the comments. My teacher recommends us to measure it N times so we can get an average, standard deviation and median for the final report. He also recommends us to execute myOperation M times instead of just one. If myOperation is a very fast operation, measuring it M times allow us to get a sense of the "real time" it takes; cause the clock being used might not have the required precision to measure such operation. So, execution myOperation only one time or M times really depends if the operation itself takes long enough for the clock precision we are using. I'm having trouble dealing with that M times execution. Increasing M decreases (a lot) the final average value. Which doesn't make sense to me. It's like this, on average you take 3 to 5 seconds to travel from point A to B. But then you go from A to B and back to A 5 times (which makes it 10 times, cause A to B is the same as B to A) and you measure that. Than you divide by 10, the average you get is supposed to be the same average you take traveling from point A to B, which is 3 to 5 seconds. This is what I want my code to do, but it's not working. If I keep increasing the number of times I go from A to B and back A, the average will be lower and lower each time, it makes no sense to me. Enough theory, here's my code: #include <stdio.h> #include <time.h> #define MEASUREMENTS 1 #define OPERATIONS 1 typedef struct timespec TimeClock; TimeClock diffTimeClock(TimeClock start, TimeClock end) { TimeClock aux; if((end.tv_nsec - start.tv_nsec) < 0) { aux.tv_sec = end.tv_sec - start.tv_sec - 1; aux.tv_nsec = 1E9 + end.tv_nsec - start.tv_nsec; } else { aux.tv_sec = end.tv_sec - start.tv_sec; aux.tv_nsec = end.tv_nsec - start.tv_nsec; } return aux; } int main(void) { TimeClock sTime, eTime, dTime; int i, j; for(i = 0; i < MEASUREMENTS; i++) { printf(" » MEASURE %02d\n", i+1); clock_gettime(CLOCK_REALTIME, &sTime); for(j = 0; j < OPERATIONS; j++) { myOperation(); } clock_gettime(CLOCK_REALTIME, &eTime); dTime = diffTimeClock(sTime, eTime); printf(" - NSEC (TOTAL): %ld\n", dTime.tv_nsec); printf(" - NSEC (OP): %ld\n\n", dTime.tv_nsec / OPERATIONS); } return 0; } Notes: The above diffTimeClock function is from this blog post. I replaced my real operation with myOperation() because it doesn't make any sense to post my real functions as I would have to post long blocks of code, you can easily code a myOperation() with whatever you like to compile the code if you wish. As you can see, OPERATIONS = 1 and the results are: » MEASURE 01 - NSEC (TOTAL): 27456580 - NSEC (OP): 27456580 For OPERATIONS = 100 the results are: » MEASURE 01 - NSEC (TOTAL): 218929736 - NSEC (OP): 2189297 For OPERATIONS = 1000 the results are: » MEASURE 01 - NSEC (TOTAL): 862834890 - NSEC (OP): 862834 For OPERATIONS = 10000 the results are: » MEASURE 01 - NSEC (TOTAL): 574133641 - NSEC (OP): 57413 Now, I'm not a math wiz, far from it actually, but this doesn't make any sense to me whatsoever. I've already talked about this with a friend that's on this project with me and he also can't understand the differences. I don't understand why the value is getting lower and lower when I increase OPERATIONS. The operation itself should take the same time (on average of course, not the exact same time), no matter how many times I execute it. You could tell me that that actually depends on the operation itself, the data being read and that some data could already be in the cache and bla bla, but I don't think that's the problem. In my case, myOperation is reading 5000 lines of text from an CSV file, separating the values by ; and inserting those values into a data structure. For each iteration, I'm destroying the data structure and initializing it again. Now that I think of it, I also that think that there's a problem measuring time with clock_gettime(), maybe I'm not using it right. I mean, look at the last example, where OPERATIONS = 10000. The total time it took was 574133641ns, which would be roughly 0,5s; that's impossible, it took a couple of minutes as I couldn't stand looking at the screen waiting and went to eat something.

Read the article

Search Results

Search found 114 results on 5 pages for 'median'.

Page 5/5 | < Previous Page | 1 2 3 4 5

- by GApple

- by sysadmin1138

- by Irem Radzik

- by Matt

- by RationalGeek

- by gzou

- by haxney

- by dubhousing

- by Rob Farley

- by Jim Hunziker

- by leodido

- by Dan Tao

- by Nazgulled

< Previous Page | 1 2 3 4 5