Search Results

Search found 118 results on 5 pages for 'aggregates'.

Page 1/5 | 1 2 3 4 5  | Next Page >

  • Fun with Aggregates

    - by Paul White
    There are interesting things to be learned from even the simplest queries.  For example, imagine you are given the task of writing a query to list AdventureWorks product names where the product has at least one entry in the transaction history table, but fewer than ten. One possible query to meet that specification is: SELECT p.Name FROM Production.Product AS p JOIN Production.TransactionHistory AS th ON p.ProductID = th.ProductID GROUP BY p.ProductID, p.Name HAVING COUNT_BIG(*) < 10; That query correctly returns 23 rows (execution plan and data sample shown below): The execution plan looks a bit different from the written form of the query: the base tables are accessed in reverse order, and the aggregation is performed before the join.  The general idea is to read all rows from the history table, compute the count of rows grouped by ProductID, merge join the results to the Product table on ProductID, and finally filter to only return rows where the count is less than ten. This ‘fully-optimized’ plan has an estimated cost of around 0.33 units.  The reason for the quote marks there is that this plan is not quite as optimal as it could be – surely it would make sense to push the Filter down past the join too?  To answer that, let’s look at some other ways to formulate this query.  This being SQL, there are any number of ways to write logically-equivalent query specifications, so we’ll just look at a couple of interesting ones.  The first query is an attempt to reverse-engineer T-SQL from the optimized query plan shown above.  It joins the result of pre-aggregating the history table to the Product table before filtering: SELECT p.Name FROM ( SELECT th.ProductID, cnt = COUNT_BIG(*) FROM Production.TransactionHistory AS th GROUP BY th.ProductID ) AS q1 JOIN Production.Product AS p ON p.ProductID = q1.ProductID WHERE q1.cnt < 10; Perhaps a little surprisingly, we get a slightly different execution plan: The results are the same (23 rows) but this time the Filter is pushed below the join!  The optimizer chooses nested loops for the join, because the cardinality estimate for rows passing the Filter is a bit low (estimate 1 versus 23 actual), though you can force a merge join with a hint and the Filter still appears below the join.  In yet another variation, the < 10 predicate can be ‘manually pushed’ by specifying it in a HAVING clause in the “q1” sub-query instead of in the WHERE clause as written above. The reason this predicate can be pushed past the join in this query form, but not in the original formulation is simply an optimizer limitation – it does make efforts (primarily during the simplification phase) to encourage logically-equivalent query specifications to produce the same execution plan, but the implementation is not completely comprehensive. Moving on to a second example, the following query specification results from phrasing the requirement as “list the products where there exists fewer than ten correlated rows in the history table”: SELECT p.Name FROM Production.Product AS p WHERE EXISTS ( SELECT * FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID HAVING COUNT_BIG(*) < 10 ); Unfortunately, this query produces an incorrect result (86 rows): The problem is that it lists products with no history rows, though the reasons are interesting.  The COUNT_BIG(*) in the EXISTS clause is a scalar aggregate (meaning there is no GROUP BY clause) and scalar aggregates always produce a value, even when the input is an empty set.  In the case of the COUNT aggregate, the result of aggregating the empty set is zero (the other standard aggregates produce a NULL).  To make the point really clear, let’s look at product 709, which happens to be one for which no history rows exist: -- Scalar aggregate SELECT COUNT_BIG(*) FROM Production.TransactionHistory AS th WHERE th.ProductID = 709;   -- Vector aggregate SELECT COUNT_BIG(*) FROM Production.TransactionHistory AS th WHERE th.ProductID = 709 GROUP BY th.ProductID; The estimated execution plans for these two statements are almost identical: You might expect the Stream Aggregate to have a Group By for the second statement, but this is not the case.  The query includes an equality comparison to a constant value (709), so all qualified rows are guaranteed to have the same value for ProductID and the Group By is optimized away. In fact there are some minor differences between the two plans (the first is auto-parameterized and qualifies for trivial plan, whereas the second is not auto-parameterized and requires cost-based optimization), but there is nothing to indicate that one is a scalar aggregate and the other is a vector aggregate.  This is something I would like to see exposed in show plan so I suggested it on Connect.  Anyway, the results of running the two queries show the difference at runtime: The scalar aggregate (no GROUP BY) returns a result of zero, whereas the vector aggregate (with a GROUP BY clause) returns nothing at all.  Returning to our EXISTS query, we could ‘fix’ it by changing the HAVING clause to reject rows where the scalar aggregate returns zero: SELECT p.Name FROM Production.Product AS p WHERE EXISTS ( SELECT * FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID HAVING COUNT_BIG(*) BETWEEN 1 AND 9 ); The query now returns the correct 23 rows: Unfortunately, the execution plan is less efficient now – it has an estimated cost of 0.78 compared to 0.33 for the earlier plans.  Let’s try adding a redundant GROUP BY instead of changing the HAVING clause: SELECT p.Name FROM Production.Product AS p WHERE EXISTS ( SELECT * FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID GROUP BY th.ProductID HAVING COUNT_BIG(*) < 10 ); Not only do we now get correct results (23 rows), this is the execution plan: I like to compare that plan to quantum physics: if you don’t find it shocking, you haven’t understood it properly :)  The simple addition of a redundant GROUP BY has resulted in the EXISTS form of the query being transformed into exactly the same optimal plan we found earlier.  What’s more, in SQL Server 2008 and later, we can replace the odd-looking GROUP BY with an explicit GROUP BY on the empty set: SELECT p.Name FROM Production.Product AS p WHERE EXISTS ( SELECT * FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID GROUP BY () HAVING COUNT_BIG(*) < 10 ); I offer that as an alternative because some people find it more intuitive (and it perhaps has more geek value too).  Whichever way you prefer, it’s rather satisfying to note that the result of the sub-query does not exist for a particular correlated value where a vector aggregate is used (the scalar COUNT aggregate always returns a value, even if zero, so it always ‘EXISTS’ regardless which ProductID is logically being evaluated). The following query forms also produce the optimal plan and correct results, so long as a vector aggregate is used (you can probably find more equivalent query forms): WHERE Clause SELECT p.Name FROM Production.Product AS p WHERE ( SELECT COUNT_BIG(*) FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID GROUP BY () ) < 10; APPLY SELECT p.Name FROM Production.Product AS p CROSS APPLY ( SELECT NULL FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID GROUP BY () HAVING COUNT_BIG(*) < 10 ) AS ca (dummy); FROM Clause SELECT q1.Name FROM ( SELECT p.Name, cnt = ( SELECT COUNT_BIG(*) FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID GROUP BY () ) FROM Production.Product AS p ) AS q1 WHERE q1.cnt < 10; This last example uses SUM(1) instead of COUNT and does not require a vector aggregate…you should be able to work out why :) SELECT q.Name FROM ( SELECT p.Name, cnt = ( SELECT SUM(1) FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID ) FROM Production.Product AS p ) AS q WHERE q.cnt < 10; The semantics of SQL aggregates are rather odd in places.  It definitely pays to get to know the rules, and to be careful to check whether your queries are using scalar or vector aggregates.  As we have seen, query plans do not show in which ‘mode’ an aggregate is running and getting it wrong can cause poor performance, wrong results, or both. © 2012 Paul White Twitter: @SQL_Kiwi email: [email protected]

    Read the article

  • Fun with Aggregates

    - by Paul White
    There are interesting things to be learned from even the simplest queries.  For example, imagine you are given the task of writing a query to list AdventureWorks product names where the product has at least one entry in the transaction history table, but fewer than ten. One possible query to meet that specification is: SELECT p.Name FROM Production.Product AS p JOIN Production.TransactionHistory AS th ON p.ProductID = th.ProductID GROUP BY p.ProductID, p.Name HAVING COUNT_BIG(*) < 10; That query correctly returns 23 rows (execution plan and data sample shown below): The execution plan looks a bit different from the written form of the query: the base tables are accessed in reverse order, and the aggregation is performed before the join.  The general idea is to read all rows from the history table, compute the count of rows grouped by ProductID, merge join the results to the Product table on ProductID, and finally filter to only return rows where the count is less than ten. This ‘fully-optimized’ plan has an estimated cost of around 0.33 units.  The reason for the quote marks there is that this plan is not quite as optimal as it could be – surely it would make sense to push the Filter down past the join too?  To answer that, let’s look at some other ways to formulate this query.  This being SQL, there are any number of ways to write logically-equivalent query specifications, so we’ll just look at a couple of interesting ones.  The first query is an attempt to reverse-engineer T-SQL from the optimized query plan shown above.  It joins the result of pre-aggregating the history table to the Product table before filtering: SELECT p.Name FROM ( SELECT th.ProductID, cnt = COUNT_BIG(*) FROM Production.TransactionHistory AS th GROUP BY th.ProductID ) AS q1 JOIN Production.Product AS p ON p.ProductID = q1.ProductID WHERE q1.cnt < 10; Perhaps a little surprisingly, we get a slightly different execution plan: The results are the same (23 rows) but this time the Filter is pushed below the join!  The optimizer chooses nested loops for the join, because the cardinality estimate for rows passing the Filter is a bit low (estimate 1 versus 23 actual), though you can force a merge join with a hint and the Filter still appears below the join.  In yet another variation, the < 10 predicate can be ‘manually pushed’ by specifying it in a HAVING clause in the “q1” sub-query instead of in the WHERE clause as written above. The reason this predicate can be pushed past the join in this query form, but not in the original formulation is simply an optimizer limitation – it does make efforts (primarily during the simplification phase) to encourage logically-equivalent query specifications to produce the same execution plan, but the implementation is not completely comprehensive. Moving on to a second example, the following query specification results from phrasing the requirement as “list the products where there exists fewer than ten correlated rows in the history table”: SELECT p.Name FROM Production.Product AS p WHERE EXISTS ( SELECT * FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID HAVING COUNT_BIG(*) < 10 ); Unfortunately, this query produces an incorrect result (86 rows): The problem is that it lists products with no history rows, though the reasons are interesting.  The COUNT_BIG(*) in the EXISTS clause is a scalar aggregate (meaning there is no GROUP BY clause) and scalar aggregates always produce a value, even when the input is an empty set.  In the case of the COUNT aggregate, the result of aggregating the empty set is zero (the other standard aggregates produce a NULL).  To make the point really clear, let’s look at product 709, which happens to be one for which no history rows exist: -- Scalar aggregate SELECT COUNT_BIG(*) FROM Production.TransactionHistory AS th WHERE th.ProductID = 709;   -- Vector aggregate SELECT COUNT_BIG(*) FROM Production.TransactionHistory AS th WHERE th.ProductID = 709 GROUP BY th.ProductID; The estimated execution plans for these two statements are almost identical: You might expect the Stream Aggregate to have a Group By for the second statement, but this is not the case.  The query includes an equality comparison to a constant value (709), so all qualified rows are guaranteed to have the same value for ProductID and the Group By is optimized away. In fact there are some minor differences between the two plans (the first is auto-parameterized and qualifies for trivial plan, whereas the second is not auto-parameterized and requires cost-based optimization), but there is nothing to indicate that one is a scalar aggregate and the other is a vector aggregate.  This is something I would like to see exposed in show plan so I suggested it on Connect.  Anyway, the results of running the two queries show the difference at runtime: The scalar aggregate (no GROUP BY) returns a result of zero, whereas the vector aggregate (with a GROUP BY clause) returns nothing at all.  Returning to our EXISTS query, we could ‘fix’ it by changing the HAVING clause to reject rows where the scalar aggregate returns zero: SELECT p.Name FROM Production.Product AS p WHERE EXISTS ( SELECT * FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID HAVING COUNT_BIG(*) BETWEEN 1 AND 9 ); The query now returns the correct 23 rows: Unfortunately, the execution plan is less efficient now – it has an estimated cost of 0.78 compared to 0.33 for the earlier plans.  Let’s try adding a redundant GROUP BY instead of changing the HAVING clause: SELECT p.Name FROM Production.Product AS p WHERE EXISTS ( SELECT * FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID GROUP BY th.ProductID HAVING COUNT_BIG(*) < 10 ); Not only do we now get correct results (23 rows), this is the execution plan: I like to compare that plan to quantum physics: if you don’t find it shocking, you haven’t understood it properly :)  The simple addition of a redundant GROUP BY has resulted in the EXISTS form of the query being transformed into exactly the same optimal plan we found earlier.  What’s more, in SQL Server 2008 and later, we can replace the odd-looking GROUP BY with an explicit GROUP BY on the empty set: SELECT p.Name FROM Production.Product AS p WHERE EXISTS ( SELECT * FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID GROUP BY () HAVING COUNT_BIG(*) < 10 ); I offer that as an alternative because some people find it more intuitive (and it perhaps has more geek value too).  Whichever way you prefer, it’s rather satisfying to note that the result of the sub-query does not exist for a particular correlated value where a vector aggregate is used (the scalar COUNT aggregate always returns a value, even if zero, so it always ‘EXISTS’ regardless which ProductID is logically being evaluated). The following query forms also produce the optimal plan and correct results, so long as a vector aggregate is used (you can probably find more equivalent query forms): WHERE Clause SELECT p.Name FROM Production.Product AS p WHERE ( SELECT COUNT_BIG(*) FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID GROUP BY () ) < 10; APPLY SELECT p.Name FROM Production.Product AS p CROSS APPLY ( SELECT NULL FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID GROUP BY () HAVING COUNT_BIG(*) < 10 ) AS ca (dummy); FROM Clause SELECT q1.Name FROM ( SELECT p.Name, cnt = ( SELECT COUNT_BIG(*) FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID GROUP BY () ) FROM Production.Product AS p ) AS q1 WHERE q1.cnt < 10; This last example uses SUM(1) instead of COUNT and does not require a vector aggregate…you should be able to work out why :) SELECT q.Name FROM ( SELECT p.Name, cnt = ( SELECT SUM(1) FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID ) FROM Production.Product AS p ) AS q WHERE q.cnt < 10; The semantics of SQL aggregates are rather odd in places.  It definitely pays to get to know the rules, and to be careful to check whether your queries are using scalar or vector aggregates.  As we have seen, query plans do not show in which ‘mode’ an aggregate is running and getting it wrong can cause poor performance, wrong results, or both. © 2012 Paul White Twitter: @SQL_Kiwi email: [email protected]

    Read the article

  • SQL Server 2008 R2: StreamInsight - User-defined aggregates

    - by Greg Low
    I'd briefly played around with user-defined aggregates in StreamInsight with CTP3 but when I started working with the new Count Windows, I found I had to have one working. I learned a few things along the way that I hope will help someone. The first thing you have to do is define a class: public class IntegerAverage : CepAggregate < int , int > { public override int GenerateOutput( IEnumerable < int > eventData) { if (eventData.Count() == 0) { return 0; } else { return eventData.Sum()...(read more)

    Read the article

  • Analytic functions – they’re not aggregates

    - by Rob Farley
    SQL 2012 brings us a bunch of new analytic functions, together with enhancements to the OVER clause. People who have known me over the years will remember that I’m a big fan of the OVER clause and the types of things that it brings us when applied to aggregate functions, as well as the ranking functions that it enables. The OVER clause was introduced in SQL Server 2005, and remained frustratingly unchanged until SQL Server 2012. This post is going to look at a particular aspect of the analytic functions though (not the enhancements to the OVER clause). When I give presentations about the analytic functions around Australia as part of the tour of SQL Saturdays (starting in Brisbane this Thursday), and in Chicago next month, I’ll make sure it’s sufficiently well described. But for this post – I’m going to skip that and assume you get it. The analytic functions introduced in SQL 2012 seem to come in pairs – FIRST_VALUE and LAST_VALUE, LAG and LEAD, CUME_DIST and PERCENT_RANK, PERCENTILE_CONT and PERCENTILE_DISC. Perhaps frustratingly, they take slightly different forms as well. The ones I want to look at now are FIRST_VALUE and LAST_VALUE, and PERCENTILE_CONT and PERCENTILE_DISC. The reason I’m pulling this ones out is that they always produce the same result within their partitions (if you’re applying them to the whole partition). Consider the following query: SELECT     YEAR(OrderDate),     FIRST_VALUE(TotalDue)         OVER (PARTITION BY YEAR(OrderDate)               ORDER BY OrderDate, SalesOrderID               RANGE BETWEEN UNBOUNDED PRECEDING                         AND UNBOUNDED FOLLOWING),     LAST_VALUE(TotalDue)         OVER (PARTITION BY YEAR(OrderDate)               ORDER BY OrderDate, SalesOrderID               RANGE BETWEEN UNBOUNDED PRECEDING                         AND UNBOUNDED FOLLOWING),     PERCENTILE_CONT(0.95)         WITHIN GROUP (ORDER BY TotalDue)         OVER (PARTITION BY YEAR(OrderDate)),     PERCENTILE_DISC(0.95)         WITHIN GROUP (ORDER BY TotalDue)         OVER (PARTITION BY YEAR(OrderDate)) FROM Sales.SalesOrderHeader ; This is designed to get the TotalDue for the first order of the year, the last order of the year, and also the 95% percentile, using both the continuous and discrete methods (‘discrete’ means it picks the closest one from the values available – ‘continuous’ means it will happily use something between, similar to what you would do for a traditional median of four values). I’m sure you can imagine the results – a different value for each field, but within each year, all the rows the same. Notice that I’m not grouping by the year. Nor am I filtering. This query gives us a result for every row in the SalesOrderHeader table – 31465 in this case (using the original AdventureWorks that dates back to the SQL 2005 days). The RANGE BETWEEN bit in FIRST_VALUE and LAST_VALUE is needed to make sure that we’re considering all the rows available. If we don’t specify that, it assumes we only mean “RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW”, which means that LAST_VALUE ends up being the row we’re looking at. At this point you might think about other environments such as Access or Reporting Services, and remember aggregate functions like FIRST. We really should be able to do something like: SELECT     YEAR(OrderDate),     FIRST_VALUE(TotalDue)         OVER (PARTITION BY YEAR(OrderDate)               ORDER BY OrderDate, SalesOrderID               RANGE BETWEEN UNBOUNDED PRECEDING                         AND UNBOUNDED FOLLOWING) FROM Sales.SalesOrderHeader GROUP BY YEAR(OrderDate) ; But you can’t. You get that age-old error: Msg 8120, Level 16, State 1, Line 5 Column 'Sales.SalesOrderHeader.OrderDate' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause. Msg 8120, Level 16, State 1, Line 5 Column 'Sales.SalesOrderHeader.SalesOrderID' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause. Hmm. You see, FIRST_VALUE isn’t an aggregate function. None of these analytic functions are. There are too many things involved for SQL to realise that the values produced might be identical within the group. Furthermore, you can’t even surround it in a MAX. Then you get a different error, telling you that you can’t use windowed functions in the context of an aggregate. And so we end up grouping by doing a DISTINCT. SELECT DISTINCT     YEAR(OrderDate),         FIRST_VALUE(TotalDue)              OVER (PARTITION BY YEAR(OrderDate)                   ORDER BY OrderDate, SalesOrderID                   RANGE BETWEEN UNBOUNDED PRECEDING                             AND UNBOUNDED FOLLOWING),         LAST_VALUE(TotalDue)             OVER (PARTITION BY YEAR(OrderDate)                   ORDER BY OrderDate, SalesOrderID                   RANGE BETWEEN UNBOUNDED PRECEDING                             AND UNBOUNDED FOLLOWING),     PERCENTILE_CONT(0.95)          WITHIN GROUP (ORDER BY TotalDue)         OVER (PARTITION BY YEAR(OrderDate)),     PERCENTILE_DISC(0.95)         WITHIN GROUP (ORDER BY TotalDue)         OVER (PARTITION BY YEAR(OrderDate)) FROM Sales.SalesOrderHeader ; I’m sorry. It’s just the way it goes. Hopefully it’ll change the future, but for now, it’s what you’ll have to do. If we look in the execution plan, we see that it’s incredibly ugly, and actually works out the results of these analytic functions for all 31465 rows, finally performing the distinct operation to convert it into the four rows we get in the results. You might be able to achieve a better plan using things like TOP, or the kind of calculation that I used in http://sqlblog.com/blogs/rob_farley/archive/2011/08/23/t-sql-thoughts-about-the-95th-percentile.aspx (which is how PERCENTILE_CONT works), but it’s definitely convenient to use these functions, and in time, I’m sure we’ll see good improvements in the way that they are implemented. Oh, and this post should be good for fellow SQL Server MVP Nigel Sammy’s T-SQL Tuesday this month.

    Read the article

  • The blocking nature of aggregates

    - by Rob Farley
    I wrote a post recently about how query tuning isn’t just about how quickly the query runs – that if you have something (such as SSIS) that is consuming your data (and probably introducing a bottleneck), then it might be more important to have a query which focuses on getting the first bit of data out. You can read that post here.  In particular, we looked at two operators that could be used to ensure that a query returns only Distinct rows. and The Sort operator pulls in all the data, sorts it (discarding duplicates), and then pushes out the remaining rows. The Hash Match operator performs a Hashing function on each row as it comes in, and then looks to see if it’s created a Hash it’s seen before. If not, it pushes the row out. The Sort method is quicker, but has to wait until it’s gathered all the data before it can do the sort, and therefore blocks the data flow. But that was my last post. This one’s a bit different. This post is going to look at how Aggregate functions work, which ties nicely into this month’s T-SQL Tuesday. I’ve frequently explained about the fact that DISTINCT and GROUP BY are essentially the same function, although DISTINCT is the poorer cousin because you have less control over it, and you can’t apply aggregate functions. Just like the operators used for Distinct, there are different flavours of Aggregate operators – coming in blocking and non-blocking varieties. The example I like to use to explain this is a pile of playing cards. If I’m handed a pile of cards and asked to count how many cards there are in each suit, it’s going to help if the cards are already ordered. Suppose I’m playing a game of Bridge, I can easily glance at my hand and count how many there are in each suit, because I keep the pile of cards in order. Moving from left to right, I could tell you I have four Hearts in my hand, even before I’ve got to the end. By telling you that I have four Hearts as soon as I know, I demonstrate the principle of a non-blocking operation. This is known as a Stream Aggregate operation. It requires input which is sorted by whichever columns the grouping is on, and it will release a row as soon as the group changes – when I encounter a Spade, I know I don’t have any more Hearts in my hand. Alternatively, if the pile of cards are not sorted, I won’t know how many Hearts I have until I’ve looked through all the cards. In fact, to count them, I basically need to put them into little piles, and when I’ve finished making all those piles, I can count how many there are in each. Because I don’t know any of the final numbers until I’ve seen all the cards, this is blocking. This performs the aggregate function using a Hash Match. Observant readers will remember this from my Distinct example. You might remember that my earlier Hash Match operation – used for Distinct Flow – wasn’t blocking. But this one is. They’re essentially doing a similar operation, applying a Hash function to some data and seeing if the set of values have been seen before, but before, it needs more information than the mere existence of a new set of values, it needs to consider how many of them there are. A lot is dependent here on whether the data coming out of the source is sorted or not, and this is largely determined by the indexes that are being used. If you look in the Properties of an Index Scan, you’ll be able to see whether the order of the data is required by the plan. A property called Ordered will demonstrate this. In this particular example, the second plan is significantly faster, but is dependent on having ordered data. In fact, if I force a Stream Aggregate on unordered data (which I’m doing by telling it to use a different index), a Sort operation is needed, which makes my plan a lot slower. This is all very straight-forward stuff, and information that most people are fully aware of. I’m sure you’ve all read my good friend Paul White (@sql_kiwi)’s post on how the Query Optimizer chooses which type of aggregate function to apply. But let’s take a look at SQL Server Integration Services. SSIS gives us a Aggregate transformation for use in Data Flow Tasks, but it’s described as Blocking. The definitive article on Performance Tuning SSIS uses Sort and Aggregate as examples of Blocking Transformations. I’ve just shown you that Aggregate operations used by the Query Optimizer are not always blocking, but that the SSIS Aggregate component is an example of a blocking transformation. But is it always the case? After all, there are plenty of SSIS Performance Tuning talks out there that describe the value of sorted data in Data Flow Tasks, describing the IsSorted property that can be set through the Advanced Editor of your Source component. And so I set about testing the Aggregate transformation in SSIS, to prove for sure whether providing Sorted data would let the Aggregate transform behave like a Stream Aggregate. (Of course, I knew the answer already, but it helps to be able to demonstrate these things). A query that will produce a million rows in order was in order. Let me rephrase. I used a query which produced the numbers from 1 to 1000000, in a single field, ordered. The IsSorted flag was set on the source output, with the only column as SortKey 1. Performing an Aggregate function over this (counting the number of rows per distinct number) should produce an additional column with 1 in it. If this were being done in T-SQL, the ordered data would allow a Stream Aggregate to be used. In fact, if the Query Optimizer saw that the field had a Unique Index on it, it would be able to skip the Aggregate function completely, and just insert the value 1. This is a shortcut I wouldn’t be expecting from SSIS, but certainly the Stream behaviour would be nice. Unfortunately, it’s not the case. As you can see from the screenshots above, the data is pouring into the Aggregate function, and not being released until all million rows have been seen. It’s not doing a Stream Aggregate at all. This is expected behaviour. (I put that in bold, because I want you to realise this.) An SSIS transformation is a piece of code that runs. It’s a physical operation. When you write T-SQL and ask for an aggregation to be done, it’s a logical operation. The physical operation is either a Stream Aggregate or a Hash Match. In SSIS, you’re telling the system that you want a generic Aggregation, that will have to work with whatever data is passed in. I’m not saying that it wouldn’t be possible to make a sometimes-blocking aggregation component in SSIS. A Custom Component could be created which could detect whether the SortKeys columns of the input matched the Grouping columns of the Aggregation, and either call the blocking code or the non-blocking code as appropriate. One day I’ll make one of those, and publish it on my blog. I’ve done it before with a Script Component, but as Script components are single-use, I was able to handle the data knowing everything about my data flow already. As per my previous post – there are a lot of aspects in which tuning SSIS and tuning execution plans use similar concepts. In both situations, it really helps to have a feel for what’s going on behind the scenes. Considering whether an operation is blocking or not is extremely relevant to performance, and that it’s not always obvious from the surface. In a future post, I’ll show the impact of blocking v non-blocking and synchronous v asynchronous components in SSIS, using some of LobsterPot’s Script Components and Custom Components as examples. When I get that sorted, I’ll make a Stream Aggregate component available for download.

    Read the article

  • The blocking nature of aggregates

    - by Rob Farley
    I wrote a post recently about how query tuning isn’t just about how quickly the query runs – that if you have something (such as SSIS) that is consuming your data (and probably introducing a bottleneck), then it might be more important to have a query which focuses on getting the first bit of data out. You can read that post here.  In particular, we looked at two operators that could be used to ensure that a query returns only Distinct rows. and The Sort operator pulls in all the data, sorts it (discarding duplicates), and then pushes out the remaining rows. The Hash Match operator performs a Hashing function on each row as it comes in, and then looks to see if it’s created a Hash it’s seen before. If not, it pushes the row out. The Sort method is quicker, but has to wait until it’s gathered all the data before it can do the sort, and therefore blocks the data flow. But that was my last post. This one’s a bit different. This post is going to look at how Aggregate functions work, which ties nicely into this month’s T-SQL Tuesday. I’ve frequently explained about the fact that DISTINCT and GROUP BY are essentially the same function, although DISTINCT is the poorer cousin because you have less control over it, and you can’t apply aggregate functions. Just like the operators used for Distinct, there are different flavours of Aggregate operators – coming in blocking and non-blocking varieties. The example I like to use to explain this is a pile of playing cards. If I’m handed a pile of cards and asked to count how many cards there are in each suit, it’s going to help if the cards are already ordered. Suppose I’m playing a game of Bridge, I can easily glance at my hand and count how many there are in each suit, because I keep the pile of cards in order. Moving from left to right, I could tell you I have four Hearts in my hand, even before I’ve got to the end. By telling you that I have four Hearts as soon as I know, I demonstrate the principle of a non-blocking operation. This is known as a Stream Aggregate operation. It requires input which is sorted by whichever columns the grouping is on, and it will release a row as soon as the group changes – when I encounter a Spade, I know I don’t have any more Hearts in my hand. Alternatively, if the pile of cards are not sorted, I won’t know how many Hearts I have until I’ve looked through all the cards. In fact, to count them, I basically need to put them into little piles, and when I’ve finished making all those piles, I can count how many there are in each. Because I don’t know any of the final numbers until I’ve seen all the cards, this is blocking. This performs the aggregate function using a Hash Match. Observant readers will remember this from my Distinct example. You might remember that my earlier Hash Match operation – used for Distinct Flow – wasn’t blocking. But this one is. They’re essentially doing a similar operation, applying a Hash function to some data and seeing if the set of values have been seen before, but before, it needs more information than the mere existence of a new set of values, it needs to consider how many of them there are. A lot is dependent here on whether the data coming out of the source is sorted or not, and this is largely determined by the indexes that are being used. If you look in the Properties of an Index Scan, you’ll be able to see whether the order of the data is required by the plan. A property called Ordered will demonstrate this. In this particular example, the second plan is significantly faster, but is dependent on having ordered data. In fact, if I force a Stream Aggregate on unordered data (which I’m doing by telling it to use a different index), a Sort operation is needed, which makes my plan a lot slower. This is all very straight-forward stuff, and information that most people are fully aware of. I’m sure you’ve all read my good friend Paul White (@sql_kiwi)’s post on how the Query Optimizer chooses which type of aggregate function to apply. But let’s take a look at SQL Server Integration Services. SSIS gives us a Aggregate transformation for use in Data Flow Tasks, but it’s described as Blocking. The definitive article on Performance Tuning SSIS uses Sort and Aggregate as examples of Blocking Transformations. I’ve just shown you that Aggregate operations used by the Query Optimizer are not always blocking, but that the SSIS Aggregate component is an example of a blocking transformation. But is it always the case? After all, there are plenty of SSIS Performance Tuning talks out there that describe the value of sorted data in Data Flow Tasks, describing the IsSorted property that can be set through the Advanced Editor of your Source component. And so I set about testing the Aggregate transformation in SSIS, to prove for sure whether providing Sorted data would let the Aggregate transform behave like a Stream Aggregate. (Of course, I knew the answer already, but it helps to be able to demonstrate these things). A query that will produce a million rows in order was in order. Let me rephrase. I used a query which produced the numbers from 1 to 1000000, in a single field, ordered. The IsSorted flag was set on the source output, with the only column as SortKey 1. Performing an Aggregate function over this (counting the number of rows per distinct number) should produce an additional column with 1 in it. If this were being done in T-SQL, the ordered data would allow a Stream Aggregate to be used. In fact, if the Query Optimizer saw that the field had a Unique Index on it, it would be able to skip the Aggregate function completely, and just insert the value 1. This is a shortcut I wouldn’t be expecting from SSIS, but certainly the Stream behaviour would be nice. Unfortunately, it’s not the case. As you can see from the screenshots above, the data is pouring into the Aggregate function, and not being released until all million rows have been seen. It’s not doing a Stream Aggregate at all. This is expected behaviour. (I put that in bold, because I want you to realise this.) An SSIS transformation is a piece of code that runs. It’s a physical operation. When you write T-SQL and ask for an aggregation to be done, it’s a logical operation. The physical operation is either a Stream Aggregate or a Hash Match. In SSIS, you’re telling the system that you want a generic Aggregation, that will have to work with whatever data is passed in. I’m not saying that it wouldn’t be possible to make a sometimes-blocking aggregation component in SSIS. A Custom Component could be created which could detect whether the SortKeys columns of the input matched the Grouping columns of the Aggregation, and either call the blocking code or the non-blocking code as appropriate. One day I’ll make one of those, and publish it on my blog. I’ve done it before with a Script Component, but as Script components are single-use, I was able to handle the data knowing everything about my data flow already. As per my previous post – there are a lot of aspects in which tuning SSIS and tuning execution plans use similar concepts. In both situations, it really helps to have a feel for what’s going on behind the scenes. Considering whether an operation is blocking or not is extremely relevant to performance, and that it’s not always obvious from the surface. In a future post, I’ll show the impact of blocking v non-blocking and synchronous v asynchronous components in SSIS, using some of LobsterPot’s Script Components and Custom Components as examples. When I get that sorted, I’ll make a Stream Aggregate component available for download.

    Read the article

  • T-SQL Tuesday # 16 : This is not the aggregate you're looking for

    - by AaronBertrand
    This week, T-SQL Tuesday is being hosted by Jes Borland ( blog | twitter ), and the theme is " Aggregate Functions ." When people think of aggregates, they tend to think of MAX(), SUM() and COUNT(). And occasionally, less common functions such as AVG() and STDEV(). I thought I would write a quick post about a different type of aggregate: string concatenation. Even going back to my classic ASP days, one of the more common questions out in the community has been, "how do I turn a column into a comma-separated...(read more)

    Read the article

  • Django and conditional aggregates

    - by piquadrat
    I have two models, authors and articles: class Author(models.Model): name = models.CharField('name', max_length=100) class Article(models.Model) title = models.CharField('title', max_length=100) pubdate = models.DateTimeField('publication date') authors = models.ManyToManyField(Author) Now I want to select all authors and annotate them with their respective article count. That's a piece of cake with Django's aggregates. Problem is, it should only count the articles that are already published. According to ticket 11305 in the Django ticket tracker, this is not yet possible. I tried to use the CountIf annotation mentioned in that ticket, but it doesn't quote the datetime string and doesn't make all the joins it would need. So, what's the best solution, other than writing custom SQL?

    Read the article

  • merging two tables, while applying aggregates on the duplicates (max,min and sum)

    - by cloudraven
    I have a table (let's call it log) with a few millions of records. Among the fields I have Id, Count, FirstHit, LastHit. Id - The record id Count - number of times this Id has been reported FirstHit - earliest timestamp with which this Id was reported LastHit - latest timestamp with which this Id was reported This table only has one record for any given Id Everyday I get into another table (let's call it feed) with around half a million records with these fields among many others: Id Timestamp - Entry date and time. This table can have many records for the same id What I want to do is to update log in the following way. Count - log count value, plus the count() of records for that id found in feed FirstHit - the earliest of the current value in log or the minimum value in feed for that id LastHit - the latest of the current value in log or the maximum value in feed for that id. It should be noticed that many of the ids in feed are already in log. The simple thing that worked is to create a temporary table and insert into it the union of both as in Select Id, Min(Timestamp) As FirstHit, MAX(Timestamp) as LastHit, Count(*) as Count FROM feed GROUP BY Id UNION ALL Select Id, FirstHit,LastHit,Count FROM log; From that temporary table I do a select that aggregates Min(firsthit), max(lasthit) and sum(Count) Select Id, Min(FirstHit),Max(LastHit),Sum(Count) FROM @temp GROUP BY Id; and that gives me the end result. I could then delete everything from log and replace it with everything with temp, or craft an update for the common records and insert the new ones. However, I think both are highly inefficient. Is there a more efficient way of doing this. Perhaps doing the update in place in the log table?

    Read the article

  • Are DDD Aggregates really a good idea in a Web Application?

    - by Mystere Man
    I'm diving in to Domain Driven Design and some of the concepts i'm coming across make a lot of sense on the surface, but when I think about them more I have to wonder if that's really a good idea. The concept of Aggregates, for instance makes sense. You create small domains of ownership so that you don't have to deal with the entire domain model. However, when I think about this in the context of a web app, we're frequently hitting the database to pull back small subsets of data. For instance, a page may only list the number of orders, with links to click on to open the order and see its order id's. If i'm understanding Aggregates right, I would typically use the repository pattern to return an OrderAggregate that would contain the members GetAll, GetByID, Delete, and Save. Ok, that sounds good. But... If I call GetAll to list all my order's, it would seem to me that this pattern would require the entire list of aggregate information to be returned, complete orders, order lines, etc... When I only need a small subset of that information (just header information). Am I missing something? Or is there some level of optimization you would use here? I can't imagine that anyone would advocate returning entire aggregates of information when you don't need it. Certainly, one could create methods on your repository like GetOrderHeaders, but that seems to defeat the purpose of using a pattern like repository in the first place. Can anyone clarify this for me?

    Read the article

  • What is good practice in .NET system architecture design concerning multiple models and aggregates

    - by BuzzBubba
    I'm designing a larger enterprise architecture and I'm in a doubt about how to separate the models and design those. There are several points I'd like suggestions for: - models to define - way to define models Currently my idea is to define: Core (domain) model Repositories to get data to that domain model from a database or other store Business logic model that would contain business logic, validation logic and more specific versions of forms of data retrieval methods View models prepared for specifically formated data output that would be parsed by views of different kind (web, silverlight, etc). For the first model I'm puzzled at what to use and how to define the mode. Should this model entities contain collections and in what form? IList, IEnumerable or IQueryable collections? - I'm thinking of immutable collections which IEnumerable is, but I'd like to avoid huge data collections and to offer my Business logic layer access with LINQ expressions so that query trees get executed at Data level and retrieve only really required data for situations like the one when I'm retrieving a very specific subset of elements amongst thousands or hundreds of thousands. What if I have an item with several thousands of bids? I can't just make an IEnumerable collection of those on the model and then retrieve an item list in some Repository method or even Business model method. Should it be IQueryable so that I actually pass my queries to Repository all the way from the Business logic model layer? Should I just avoid collections in my domain model? Should I void only some collections? Should I separate Domain model and BusinessLogic model or integrate those? Data would be dealt trough repositories which would use Domain model classes. Should repositories be used directly using only classes from domain model like data containers? This is an example of what I had in mind: So, my Domain objects would look like (e.g.) public class Item { public string ItemName { get; set; } public int Price { get; set; } public bool Available { get; set; } private IList<Bid> _bids; public IQueryable<Bid> Bids { get { return _bids.AsQueryable(); } private set { _bids = value; } } public AddNewBid(Bid newBid) { _bids.Add(new Bid {.... } } Where Bid would be defined as a normal class. Repositories would be defined as data retrieval factories and used to get data into another (Business logic) model which would again be used to get data to ViewModels which would then be rendered by different consumers. I would define IQueryable interfaces for all aggregating collections to get flexibility and minimize data retrieved from real data store. Or should I make Domain Model "anemic" with pure data store entities and all collections define for business logic model? One of the most important questions is, where to have IQueryable typed collections? - All the way from Repositories to Business model or not at all and expose only solid IList and IEnumerable from Repositories and deal with more specific queries inside Business model, but have more finer grained methods for data retrieval within Repositories. So, what do you think? Have any suggestions?

    Read the article

  • DDD: Getting aggregate roots for other aggregates

    - by Ed
    I've been studying DDD for the past 2 weeks, and one of the things that really stuck out to me was how aggregate roots can contain other aggregate roots. Aggregate roots are retrieved from the repository, but if a root contains another root, does the repository have a reference to the other repository and asks it to build the subroot?

    Read the article

  • MDX: Aggregates over a set

    - by =!3fb7.13e2.c801.b5f3
    Hi! What I am trying to achieves looks very simple, yet I cannot make it work. My facts are orders which have a date and I have a typical time dimension with the 'Month" and 'Year' levels. I would like to get an output which lists the number of orders for the last 6 months and the total, like this: Oct 2009 20 Nov 2009 30 Dec 2009 25 Jan 2009 15 Feb 2010 45 Mar 2010 5 Total 140 I can create the set with the members Oct 2009 until Mar 2010 and I manage to get this part of my desired output: Oct 2009 20 Nov 2009 30 Dec 2009 25 Jan 2009 15 Feb 2010 45 Mar 2010 5 Just I fail to get the total line.

    Read the article

  • DDD: Persisting aggregates

    - by Mosh
    Hello, Let's consider the typical Order and OrderItem example. Assuming that OrderItem is part of the Order Aggregate, it an only be added via Order. So, to add a new OrderItem to an Order, we have to load the entire Aggregate via Repository, add a new item to the Order object and persist the entire Aggregate again. This seems to have a lot of overhead. What if our Order has 10 OrderItems? This way, just to add a new OrderItem, not only do we have to read 10 OrderItems, but we should also re-insert all these 10 OrderItems again. (This is the approach that Jimmy Nillson has taken in his DDD book. Everytime he wants to persists an Aggregate, he clears all the child objects, and then re-inserts them again. This can cause other issues as the ID of the children are changed everytime because of the IDENTITY column in database.) I know some people may suggest to apply Unit of Work pattern at the Aggregate Root so it keeps track of what has been changed and only commit those changes. But this violates Persistence Ignorance (PI) principle because persistence logic is leaking into the Domain Model. Has anyone thought about this before? Mosh

    Read the article

  • Saving complex aggregates using Repository Pattern

    - by Kevin Lawrence
    We have a complex aggregate (sensitive names obfuscated for confidentiality reasons). The root, R, is composed of collections of Ms, As, Cs, Ss. Ms have collections of other low-level details. etc etc R really is an aggregate (no fair suggesting we split it!) We use lazy loading to retrieve the details. No problem there. But we are struggling a little with how to save such a complex aggregate. From the caller's point of view: r = repository.find(id); r.Ps.add(factory.createP()); r.Cs[5].updateX(123); r.Ms.removeAt(5); repository.save(r); Our competing solutions are: Dirty flags Each entity in the aggregate in the aggregate has a dirty flag. The save() method in the repository walks the tree looking for dirty objects and saves them. Deletes and adds are a little trickier - especially with lazy-loading - but doable. Event listener accumulates changes. Repository subscribes a listener to changes and accumulates events. When save is called, the repository grabs all the change events and writes them to the DB. Give up on repository pattern. Implement overloaded save methods to save the parts of the aggregate separately. The original example would become: r = repository.find(id); r.Ps.add(factory.createP()); r.Cs[5].updateX(123); r.Ms.removeAt(5); repository.save(r.Ps); repository.save(r.Cs); repository.save(r.Ms); (or worse) Advice please! What should we do?

    Read the article

  • N-tier Repository POCOs - Aggregates?

    - by Sam
    Assume the following simple POCOs, Country and State: public partial class Country { public Country() { States = new List<State>(); } public virtual int CountryId { get; set; } public virtual string Name { get; set; } public virtual string CountryCode { get; set; } public virtual ICollection<State> States { get; set; } } public partial class State { public virtual int StateId { get; set; } public virtual int CountryId { get; set; } public virtual Country Country { get; set; } public virtual string Name { get; set; } public virtual string Abbreviation { get; set; } } Now assume I have a simple respository that looks something like this: public partial class CountryRepository : IDisposable { protected internal IDatabase _db; public CountryRepository() { _db = new Database(System.Configuration.ConfigurationManager.AppSettings["DbConnName"]); } public IEnumerable<Country> GetAll() { return _db.Query<Country>("SELECT * FROM Countries ORDER BY Name", null); } public Country Get(object id) { return _db.SingleById(id); } public void Add(Country c) { _db.Insert(c); } /* ...And So On... */ } Typically in my UI I do not display all of the children (states), but I do display an aggregate count. So my country list view model might look like this: public partial class CountryListVM { [Key] public int CountryId { get; set; } public string Name { get; set; } public string CountryCode { get; set; } public int StateCount { get; set; } } When I'm using the underlying data provider (Entity Framework, NHibernate, PetaPoco, etc) directly in my UI layer, I can easily do something like this: IList<CountryListVM> list = db.Countries .OrderBy(c => c.Name) .Select(c => new CountryListVM() { CountryId = c.CountryId, Name = c.Name, CountryCode = c.CountryCode, StateCount = c.States.Count }) .ToList(); But when I'm using a repository or service pattern, I abstract away direct access to the data layer. It seems as though my options are to: Return the Country with a populated States collection, then map over in the UI layer. The downside to this approach is that I'm returning a lot more data than is actually needed. -or- Put all my view models into my Common dll library (as opposed to having them in the Models directory in my MVC app) and expand my repository to return specific view models instead of just the domain pocos. The downside to this approach is that I'm leaking UI specific stuff (MVC data validation annotations) into my previously clean POCOs. -or- Are there other options? How are you handling these types of things?

    Read the article

  • DDD: Aggregate Roots

    - by Mosh
    Hello, I need help with finding my aggregate root and boundary. I have 3 Entities: Plan, PlannedRole and PlannedTraining. Each Plan can include many PlannedRoles and PlannedTrainings. Solution 1: At first I thought Plan is the aggregate root because PlannedRole and PlannedTraining do not make sense out of the context of a Plan. They are always within a plan. Also, we have a business rule that says each Plan can have a maximum of 3 PlannedRoles and 5 PlannedTrainings. So I thought by nominating the Plan as the aggregate root, I can enforce this invariant. However, we have a Search page where the user searches for Plans. The results shows a few properties of the Plan itself (and none of its PlannedRoles or PlannedTrainings). I thought if I have to load the entire aggregate, it would have a lot of overhead. There are nearly 3000 plans and each may have a few children. Loading all these objects together and then ignoring PlannedRoles and PlannedTrainings in the search page doesn't make sense to me. Solution 2: I just realized the user wants 2 more search pages where they can search for Planned Roles or Planned Trainings. That made me realize they are trying to access these objects independently and "out of" the context of Plan. So I thought I was wrong about my initial design and that is how I came up with this solution. So, I thought to have 3 aggregates here, 1 for each Entity. This approach enables me to search for each Entity independently and also resolves the performance issue in solution 1. However, using this approach I cannot enforce the invariant I mentioned earlier. There is also another invariant that states a Plan can be changed only if it is of a certain status. So, I shouldn't be able to add any PlannedRoles or PlannedTrainings to a Plan that is not in that status. Again, I can't enforce this invariant with the second approach. Any advice would be greatly appreciated. Cheers, Mosh

    Read the article

  • Where can I find the list of all SQL-standard-mandated aggregate functions? [on hold]

    - by einpoklum
    I know that different DBMSes support different aggregate functions; for example: MySQL's aggregates Oracle's aggregates I want to get the list of aggregates mandated by the SQL standard. Or, to be more precise, the lists of mandatory aggregates for SQL 1992, 1998, 2003, 2008 and 2011 - with 2011 being the most important to me. Edit: Of course if I buy a copy of the standards I could compile these lists myself. My question is whether they're accessible somewhere online.

    Read the article

  • How to formulate a SQL Server indexed view that aggregates distinct values?

    - by Jeremy Lew
    I have a schema that includes tables like the following (pseudo schema): TABLE ItemCollection { ItemCollectionId ...etc... } TABLE Item { ItemId, ItemCollectionId, ContributorId } I need to aggregate the number of distinct contributors per ItemCollectionId. This is possible with a query like: SELECT ItemCollectionId, COUNT(DISTINCT ContributorId) FROM Item GROUP BY ItemCollectionId I further want to pre-calculate this aggregation using an indexed (materialized) view. The DISTINCT prevents an index being placed on this view. Is there any way to reformulate this which will not violate SQL Server's indexed view constraints?

    Read the article

  • How to do Linq aggregates when there might be an empty set?

    - by Shaul
    I have a Linq collection of Things, where Thing has an Amount (decimal) property. I'm trying to do an aggregate on this for a certain subset of Things: var total = myThings.Sum(t => t.Amount); and that works nicely. But then I added a condition that left me with no Things in the result: var total = myThings.Where(t => t.OtherProperty == 123).Sum(t => t.Amount); And instead of getting total = 0 or null, I get an error: System.InvalidOperationException: The null value cannot be assigned to a member with type System.Decimal which is a non-nullable value type. That is really nasty, because I didn't expect that behavior. I would have expected total to be zero, maybe null - but certainly not to throw an exception! What am I doing wrong? What's the workaround/fix?

    Read the article

  • Setting Up and Running Summary Advisor on an Exalytics Machine (Oracle-by-Example)

    - by Saresh
    If you are running Oracle BI on an Exalytics machine, you can use Summary Advisor to identify the aggregates that will increase query performance. Summary Advisor intelligently recommends an optimal list of aggregate tables based on query patterns that will achieve maximum query performance gain while meeting specific resource constraints. Summary Advisor then generates an aggregate creation script that can be run to create the recommended aggregate tables. Aggregate tables reduce query times by storing precomputed results for queries that include rolled-up data. This tutorial covers steps to set up, configure, and run Summary Advisor on an Exalytics machine using TimesTen database as a target for storing aggregates. You can find the Oracle By Example (OBE) in the Oracle Learning Library (OLL). The content in OLL is available to all customers, partners, and employees.

    Read the article

  • Information I need to know as a Java Developer [on hold]

    - by Woy
    I'm a java developer. I'm trying to get more knowledge to become a better programmer. I've listed a number of technologies to learn. Instead of what I've listed, what technologies would you suggest to learn as well for a Junior Java Developer? I realize, there's a lot of things to study. Java: - how a garbage collector works - resource management - network programming - TCP/IP HTTP - transactions, - consistency: interfaces, classes collections, hash codes, algorithms, comp. complexity concurrent programming: synchronizing, semafores steam management metability: thread-safety byte code manipulations, reflections, Aspect-Oriented Programming as base to understand frameworks such as Spring etc. Web stack: servlets, filters, socket programming Libraries: JDK, GWT, Apache Commons, Joda-Time, Dependency Injections: Spring, Nano Tools: IDE: very good knowledge - debugger - profiler - web analyzers: Wireshark, firebugs - unit testing SQL/Databases: Basics SELECTing columns from a table Aggregates Part 1: COUNT, SUM, MAX/MIN Aggregates Part 2: DISTINCT, GROUP BY, HAVING + Intermediate JOINs, ANSI-89 and ANSI-92 syntax + UNION vs UNION ALL x NULL handling: COALESCE & Native NULL handling Subqueries: IN, EXISTS, and inline views Subqueries: Correlated ITH syntax: Subquery Factoring/CTE Views Advanced Topics Functions, Stored Procedures, Packages Pivoting data: CASE & PIVOT syntax Hierarchical Queries Cursors: Implicit and Explicit Triggers Dynamic SQL Materialized Views Query Optimization: Indexes Query Optimization: Explain Plans Query Optimization: Profiling Data Modelling: Normal Forms, 1 through 3 Data Modelling: Primary & Foreign Keys Data Modelling: Table Constraints Data Modelling: Link/Corrollary Tables Full Text Searching XML Isolation Levels Entity Relationship Diagrams (ERDs), Logical and Physical Transactions: COMMIT, ROLLBACK, Error Handling

    Read the article

  • Architecture for dashboard showing aggregated stats [on hold]

    - by soulnafein
    I'd like to know what are common architectural pattern for the following problem. Web application A has information on sales, users, responsiveness score, etc. Some of this information are computationally intensive and or have a complex business logic (e.g. responsiveness score). I'm building a separate application (B) for internal admin tasks that modifies data in web application A and report on data from web application A. For writing I'm planning to use a restful api. E.g. create a new entity, update entity, etc. In application B I'd like to show some graphs and other aggregate data for the previous 12 months. I'm planning to store the aggregate data for each month in redis. Some data should update more often, e.g every 10 minutes. I can think of 3 ways of doing this. A scheduled task in app B that connects to an api of app A that provides some aggregated data. Then app B stores it in Redis and use that to visualise pages. Cons: it makes complex calculation within a web request, requires lot's of work e.g. api server and client, storing, etc., pros: business logic still lives in app A. A scheduled task in app A that aggregates data in an non-web process and stores it directly in Redis to be accessed by app B. A scheduled task in app A that aggregates data in a non-web process and uses an api in app B to save it. I'd like to know if there is a well known architectural solution to this type of problems and if not what are other pros/cons for the solution I've suggested?

    Read the article

1 2 3 4 5  | Next Page >