Search Results

Search found 79 results on 4 pages for 'dataflow'.

Page 1/4 | 1 2 3 4 | Next Page >

Investigation: Can different combinations of components effect Dataflow performance?

- by jamiet

Introduction The Dataflow task is one of the core components (if not the core component) of SQL Server Integration Services (SSIS) and often the most misunderstood. This is not surprising, its an incredibly complicated beast and we’re abstracted away from that complexity via some boxes that go yellow red or green and that have some lines drawn between them. Example dataflow In this blog post I intend to look under that facade and get into some of the nuts and bolts of the Dataflow Task by investigating how the decisions we make when building our packages can affect performance. I will do this by comparing the performance of three dataflows that all have the same input, all produce the same output, but which all operate slightly differently by way of having different transformation components. I also want to use this blog post to challenge a common held opinion that I see perpetuated over and over again on the SSIS forum. That is, that people assume adding components to a dataflow will be detrimental to overall performance. Its not surprising that people think this –it is intuitive to think that more components means more work- however this is not a view that I share. I have always been of the opinion that there are many factors affecting dataflow duration and the number of components is actually one of the less important ones; having said that I have never proven that assertion and that is one reason for this investigation. I have actually seen evidence that some people think dataflow duration is simply a function of number of rows and number of components. I’ll happily call that one out as a myth even without any investigation! The Setup I have a 2GB datafile which is a list of 4731904 (~4.7million) customer records with various attributes against them and it contains 2 columns that I am going to use for categorisation: [YearlyIncome] [BirthDate] The data file is a SSIS raw format file which I chose to use because it is the quickest way of getting data into a dataflow and given that I am testing the transformations, not the source or destination adapters, I want to minimise external influences as much as possible. In the test I will split the customers according to month of birth (12 of those) and whether or not their yearly income is above or below 50000 (2 of those); in other words I will be splitting them into 24 discrete categories and in order to do it I shall be using different combinations of SSIS’ Conditional Split and Derived Column transformation components. The 24 datapaths that occur will each input to a rowcount component, again because this is the least resource intensive means of terminating a datapath. The test is being carried out on a Dell XPS Studio laptop with a quad core (8 logical Procs) Intel Core i7 at 1.73GHz and Samsung SSD hard drive. Its running SQL Server 2008 R2 on Windows 7. The Variables Here are the three combinations of components that I am going to test: One Conditional Split - A single Conditional Split component CSPL Split by Month of Birth and income category that will use expressions on [YearlyIncome] & [BirthDate] to send each row to one of 24 outputs. This next screenshot displays the expression logic in use: Derived Column & Conditional Split - A Derived Column component DER Income Category that adds a new column [IncomeCategory] which will contain one of two possible text values {“LessThan50000”,”GreaterThan50000”} and uses [YearlyIncome] to determine which value each row should get. A Conditional Split component CSPL Split by Month of Birth and Income Category then uses that new column in conjunction with [BirthDate] to determine which of the same 24 outputs to send each row to. Put more simply, I am separating the Conditional Split of #1 into a Derived Column and a Conditional Split. The next screenshots display the expression logic in use: DER Income Category CSPL Split by Month of Birth and Income Category Three Conditional Splits - A Conditional Split component that produces two outputs based on [YearlyIncome], one for each Income Category. Each of those outputs will go to a further Conditional Split that splits the input into 12 outputs, one for each month of birth (identical logic in each). In this case then I am separating the single Conditional Split of #1 into three Conditional Split components. The next screenshots display the expression logic in use: CSPL Split by Income Category CSPL Split by Month of Birth 1& 2 Each of these combinations will provide an input to one of the 24 rowcount components, just the same as before. For illustration here is a screenshot of the dataflow containing three Conditional Split components: As you can these dataflows have a fair bit of work to do and remember that they’re doing that work for 4.7million rows. I will execute each dataflow 10 times and use the average for comparison. I foresee three possible outcomes: The dataflow containing just one Conditional Split (i.e. #1) will be quicker There is no significant difference between any of them One of the two dataflows containing multiple transformation components will be quicker Regardless of which of those outcomes come to pass we will have learnt something and that makes this an interesting test to carry out. Note that I will be executing the dataflows using dtexec.exe rather than hitting F5 within BIDS. The Results and Analysis The table below shows all of the executions, 10 for each dataflow. It also shows the average for each along with a standard deviation. All durations are in seconds. I’m pasting a screenshot because I frankly can’t be bothered with the faffing about needed to make a presentable HTML table. It is plain to see from the average that the dataflow containing three conditional splits is significantly faster, the other two taking 43% and 52% longer respectively. This seems strange though, right? Why does the dataflow containing the most components outperform the other two by such a big margin? The answer is actually quite logical when you put some thought into it and I’ll explain that below. Before progressing, a side note. The standard deviation for the “Three Conditional Splits” dataflow is orders of magnitude smaller – indicating that performance for this dataflow can be predicted with much greater confidence too. The Explanation I refer you to the screenshot above that shows how CSPL Split by Month of Birth and salary category in the first dataflow is setup. Observe that there is a case for each combination of Month Of Date and Income Category – 24 in total. These expressions get evaluated in the order that they appear and hence if we assume that Month of Date and Income Category are uniformly distributed in the dataset we can deduce that the expected number of expression evaluations for each row is 12.5 i.e. 1 (the minimum) + 24 (the maximum) divided by 2 = 12.5. Now take a look at the screenshots for the second dataflow. We are doing one expression evaluation in DER Income Category and we have the same 24 cases in CSPL Split by Month of Birth and Income Category as we had before, only the expression differs slightly. In this case then we have 1 + 12.5 = 13.5 expected evaluations for each row – that would account for the slightly longer average execution time for this dataflow. Now onto the third dataflow, the quick one. CSPL Split by Income Category does a maximum of 2 expression evaluations thus the expected number of evaluations per row is 1.5. CSPL Split by Month of Birth 1 & CSPL Split by Month of Birth 2 both have less work to do than the previous Conditional Split components because they only have 12 cases to test for thus the expected number of expression evaluations is 6.5 There are two of them so total expected number of expression evaluations for this dataflow is 6.5 + 6.5 + 1.5 = 14.5. 14.5 is still more than 12.5 & 13.5 though so why is the third dataflow so much quicker? Simple, the conditional expressions in the first two dataflows have two boolean predicates to evaluate – one for Income Category and one for Month of Birth; the expressions in the Conditional Split in the third dataflow however only have one predicate thus they are doing a lot less work. To sum up, the difference in execution times can be attributed to the difference between: MONTH(BirthDate) == 1 && YearlyIncome <= 50000 and MONTH(BirthDate) == 1 In the first two dataflows YearlyIncome <= 50000 gets evaluated an average of 12.5 times for every row whereas in the third dataflow it is evaluated once and once only. Multiply those 11.5 extra operations by 4.7million rows and you get a significant amount of extra CPU cycles – that’s where our duration difference comes from. The Wrap-up The obvious point here is that adding new components to a dataflow isn’t necessarily going to make it go any slower, moreover you may be able to achieve significant improvements by splitting logic over multiple components rather than one. Performance tuning is all about reducing the amount of work that needs to be done and that doesn’t necessarily mean use less components, indeed sometimes you may be able to reduce workload in ways that aren’t immediately obvious as I think I have proven here. Of course there are many variables in play here and your mileage will most definitely vary. I encourage you to download the package and see if you get similar results – let me know in the comments. The package contains all three dataflows plus a fourth dataflow that will create the 2GB raw file for you (you will also need the [AdventureWorksDW2008] sample database from which to source the data); simply disable all dataflows except the one you want to test before executing the package and remember, execute using dtexec, not within BIDS. If you want to explore dataflow performance tuning in more detail then here are some links you might want to check out: Inequality joins, Asynchronous transformations and Lookups Destination Adapter Comparison Don’t turn the dataflow into a cursor SSIS Dataflow – Designing for performance (webinar) Any comments? Let me know! @Jamiet

Read the article
Dataflow Programming - Patterns and Frameworks

- by Styrac

I just came across the proposed Boost::Dataflow library. It seems like an interesting approach and I was wondering if there are other such alternative frameworks for C++, and if there are any related design patterns. I have not ruled out Boost::Dataflow, I am just looking into any available alternatives so I can understand the domain and my options better (or roll my own if necessary).

Read the article
Techniques for modeling a dynamic dataflow with Java concurrency API

- by Maian

Is there an elegant way to model a dynamic dataflow in Java? By dataflow, I mean there are various types of tasks, and these tasks can be "connected" arbitrarily, such that when a task finishes, successor tasks are executed in parallel using the finished tasks output as input, or when multiple tasks finish, their output is aggregated in a successor task (see flow-based programming). By dynamic, I mean that the type and number of successors tasks when a task finishes depends on the output of that finished task, so for example, task A may spawn task B if it has a certain output, but may spawn task C if has a different output. Another way of putting it is that each task (or set of tasks) is responsible for determining what the next tasks are. Sample dataflow for rendering a webpage: I have as task types: file downloader, HTML/CSS renderer, HTML parser/DOM builder, image renderer, JavaScript parser, JavaScript interpreter. File downloader task for HTML file HTML parser/DOM builder task File downloader task for each embedded file/link If image, image renderer If external JavaScript, JavaScript parser JavaScript interpreter Otherwise, just store in some var/field in HTML parser task JavaScript parser for each embedded script JavaScript interpreter Wait for above tasks to finish, then HTML/CSS renderer (obviously not optimal or perfectly correct, but this is simple) I'm not saying the solution needs to be some comprehensive framework (in fact, the closer to the JDK API, the better), and I absolutely don't want something as heavyweight is say Spring Web Flow or some declarative markup or other DSL. To be more specific, I'm trying to think of a good way to model this in Java with Callables, Executors, ExecutorCompletionServices, and perhaps various synchronizer classes (like Semaphore or CountDownLatch). There are a couple use cases and requirements: Don't make any assumptions on what executor(s) the tasks will run on. In fact, to simplify, just assume there's only one executor. It can be a fixed thread pool executor, so a naive implementation can result in deadlocks (e.g. imagine a task that submits another task and then blocks until that subtask is finished, and now imagine several of these tasks using up all the threads). To simplify, assume that the data is not streamed between tasks (task output-succeeding task input) - the finishing task and succeeding task won't exist together, so the input data to the succeeding task will not be changed by the preceeding task (since it's already done). There are only a couple operations that the dataflow "engine" should be able to handle: A mechanism where a task can queue more tasks A mechanism whereby a successor task is not queued until all the required input tasks are finished A mechanism whereby the main thread (or other threads not managed by the executor) blocks until the flow is finished A mechanism whereby the main thread (or other threads not managed by the executor) blocks until certain tasks have finished Since the dataflow is dynamic (depends on input/state of the task), the activation of these mechanisms should occur within the task code, e.g. the code in a Callable is itself responsible for queueing more Callables. The dataflow "internals" should not be exposed to the tasks (Callables) themselves - only the operations listed above should be available to the task. Note that the type of the data is not necessarily the same for all tasks, e.g. a file download task may accept a File as input but will output a String. If a task throws an uncaught exception (indicating some fatal error requiring all dataflow processing to stop), it must propagate up to the thread that initiated the dataflow as quickly as possible and cancel all tasks (or something fancier like a fatal error handler). Tasks should be launched as soon as possible. This along with the previous requirement should preclude simple Future polling + Thread.sleep(). As a bonus, I would like to dataflow engine itself to perform some action (like logging) every time task is finished or when no has finished in X time since last task has finished. Something like: ExecutorCompletionService<T> ecs; while (hasTasks()) { Future<T> future = ecs.poll(1 minute); some_action_like_logging(); if (future != null) { future.get() ... } ... } Are there straightforward ways to do all this with Java concurrency API? Or if it's going to complex no matter what with what's available in the JDK, is there a lightweight library that satisfies the requirements? I already have a partial solution that fits my particular use case (it cheats in a way, since I'm using two executors, and just so you know, it's not related at all to the web browser example I gave above), but I'd like to see a more general purpose and elegant solution.

Read the article
Simple Concurrency with Dataflow Variables in C++

What are dataflow variables anyway? Programming - Languages - Dataflow - C++ - Threads

Read the article
What criteria would I use SQL Stream Insight vs TPL Dataflow [closed]

- by makerofthings7

There is an add-in to the Task Parallel Library (TPL) called TPL Dataflow that allows a variety of data processing scenarios. It seems that there are some parallels to the SQL Stream Insight product, however since SQL's Stream Insight has some interesting licensing around it, and it has a better performance depending on what license I get... I found myself asking myself should I use TPL Dataflow and not have any licensing issues, and possibly better performance. Can anyone tell me if performance is a valid criteria for comparing SQL Stream Insight vs TPL Dataflow? What other criteria should I be looking at when comparing the two?

Read the article
Library for Dataflow in C

- by msutherl

How can I do dataflow (pipes and filters, stream processing, flow based) in C? And not with UNIX pipes. I recently came across stream.py. Streams are iterables with a pipelining mechanism to enable data-flow programming and easy parallelization. The idea is to take the output of a function that turns an iterable into another iterable and plug that as the input of another such function. While you can already do this using function composition, this package provides an elegant notation for it by overloading the operator. I would like to duplicate a simple version of this kind of functionality in C. I particularly like the overloading of the operator to avoid function composition mess. Wikipedia points to this hint from a Usenet post in 1990. Why C? Because I would like to be able to do this on microcontrollers and in C extensions for other high level languages (Max, Pd*, Python). * (ironic given that Max and Pd were written, in C, specifically for this purpose – I'm looking for something barebones)

Read the article
Dataflow Pipeline holding on to memory

- by Jesse Carter

I've created a Dataflow pipeline consisting of 4 blocks (which includes one optional block) which is responsible for receiving a query object from my application across HTTP and retrieving information from a database, doing an optional transform on that data, and then writing the information back in the HTTP response. In some testing I've done I've been pulling down a significant amount of data from the database (570 thousand rows) which are stored in a List object and passed between the different blocks and it seems like even after the final block has been completed the memory isn't being released. Ram usage in Task Manager will spike up to over 2 GB and I can observe several large spikes as the List hits each block. The signatures for my blocks look like this: private TransformBlock<HttpListenerContext, Tuple<HttpListenerContext, QueryObject>> m_ParseHttpRequest; private TransformBlock<Tuple<HttpListenerContext, QueryObject>, Tuple<HttpListenerContext, QueryObject, List<string>>> m_RetrieveDatabaseResults; private TransformBlock<Tuple<HttpListenerContext, QueryObject, List<string>>, Tuple<HttpListenerContext, QueryObject, List<string>>> m_ConvertResults; private ActionBlock<Tuple<HttpListenerContext, QueryObject, List<string>>> m_ReturnHttpResponse; They are linked as follows: m_ParseHttpRequest.LinkTo(m_RetrieveDatabaseResults); m_RetrieveDatabaseResults.LinkTo(m_ConvertResults, tuple => tuple.Item2 is QueryObjectA); m_RetrieveDatabaseResults.LinkTo(m_ReturnHttpResponse, tuple => tuple.Item2 is QueryObjectB); m_ConvertResults.LinkTo(m_ReturnHttpResponse); Is it possible that I can set up the pipeline such that once each block is done with the list they no longer need to hold on to it as well as once the entire pipeline is completed that the memory is released?

Read the article
Asynchrony in C# 5: Dataflow Async Logger Sample

- by javarg

Check out this (very simple) code examples for TPL Dataflow. Suppose you are developing an Async Logger to register application events to different sinks or log writers. The logger architecture would be as follow: Note how blocks can be composed to achieved desired behavior. The BufferBlock<T> is the pool of log entries to be process whereas linked ActionBlock<TInput> represent the log writers or sinks. The previous composition would allows only one ActionBlock to consume entries at a time. Implementation code would be something similar to (add reference to System.Threading.Tasks.Dataflow.dll in %User Documents%\Microsoft Visual Studio Async CTP\Documentation): TPL Dataflow Logger var bufferBlock = new BufferBlock<Tuple<LogLevel, string>>(); ActionBlock<Tuple<LogLevel, string>> infoLogger = new ActionBlock<Tuple<LogLevel, string>>( e => Console.WriteLine("Info: {0}", e.Item2)); ActionBlock<Tuple<LogLevel, string>> errorLogger = new ActionBlock<Tuple<LogLevel, string>>( e => Console.WriteLine("Error: {0}", e.Item2)); bufferBlock.LinkTo(infoLogger, e => (e.Item1 & LogLevel.Info) != LogLevel.None); bufferBlock.LinkTo(errorLogger, e => (e.Item1 & LogLevel.Error) != LogLevel.None); bufferBlock.Post(new Tuple<LogLevel, string>(LogLevel.Info, "info message")); bufferBlock.Post(new Tuple<LogLevel, string>(LogLevel.Error, "error message")); Note the filter applied to each link (in this case, the Logging Level selects the writer used). We can specify message filters using Predicate functions on each link. Now, the previous sample is useless for a Logger since Logging Level is not exclusive (thus, several writers could be used to process a single message). Let´s use a Broadcast<T> buffer instead of a BufferBlock<T>. Broadcast Logger var bufferBlock = new BroadcastBlock<Tuple<LogLevel, string>>( e => new Tuple<LogLevel, string>(e.Item1, e.Item2)); ActionBlock<Tuple<LogLevel, string>> infoLogger = new ActionBlock<Tuple<LogLevel, string>>( e => Console.WriteLine("Info: {0}", e.Item2)); ActionBlock<Tuple<LogLevel, string>> errorLogger = new ActionBlock<Tuple<LogLevel, string>>( e => Console.WriteLine("Error: {0}", e.Item2)); ActionBlock<Tuple<LogLevel, string>> allLogger = new ActionBlock<Tuple<LogLevel, string>>( e => Console.WriteLine("All: {0}", e.Item2)); bufferBlock.LinkTo(infoLogger, e => (e.Item1 & LogLevel.Info) != LogLevel.None); bufferBlock.LinkTo(errorLogger, e => (e.Item1 & LogLevel.Error) != LogLevel.None); bufferBlock.LinkTo(allLogger, e => (e.Item1 & LogLevel.All) != LogLevel.None); bufferBlock.Post(new Tuple<LogLevel, string>(LogLevel.Info, "info message")); bufferBlock.Post(new Tuple<LogLevel, string>(LogLevel.Error, "error message")); As this block copies the message to all its outputs, we need to define the copy function in the block constructor. In this case we create a new Tuple, but you can always use the Identity function if passing the same reference to every output. Try both scenarios and compare the results.

Read the article
What is a good motivating example for dataflow concurrency?

- by Alex Miller

I understand the basics of dataflow programming and have encountered it a bit in Clojure APIs, talks from Jonas Boner, GPars in Groovy, etc. I know it's prevalent in languages like Io (although I have not studied Io). What I am missing is a compelling reason to care about dataflow as a paradigm when building a concurrent program. Why would I use a dataflow model instead of a mutable state+threads+locks model (common in Java, C++, etc) or an actor model (common in Erlang or Scala) or something else? In particular, while I know of library support in the languages above (and Scala and Ruby), I don't know of a single program or library that is a poster child user of this model. Who is using it? Why do they find it better than the other models I mentioned?

Read the article
SSIS code smell – Unused columns in the dataflow

- by jamiet

A code smell is defined on Wikipedia as being a “symptom in the source code of a program that possibly indicates a deeper problem”. It’s a term commonly used by our code-writing brethren to describe sub-optimal code but I think the term can be applied equally well to SSIS packages too as I shall now explain One of my pet hates about SSIS development is packages that throw warnings of the form: The output column "ColumnName" (1358) on output "OLE DB Source Output" (1289) and component "OLE_SRC Name" (1279) is not subsequently used in the Data Flow task. Removing this unused output column can increase Data Flow task performance. The warning is fairly self-explanatory – any column that appears in the data flow but doesn’t get used will throw this warning when the data flow is executed. Its not the negligible performance degradation that they cause that bothers me though, it’s the clutter that they cause in your log file/table. Take a look at the following screenshot if you don’t believe me: There are 231409 such warnings in the system that I took this screenshot from, that is 231409 log records that should not be there. The most infuriating thing about this warning is that it is so easily avoidable; eliminating such columns is a very quick and easy thing to do in the SSIS Designer. The only problem I see is that the warnings don’t occur until you execute the package – it would be preferable for the designer to have an unobtrusive way of informing you of them as well. Anyway, I digress… I consider such warnings to be a code smell because, to me, they’re symptomatic of a lack of due care and attention; a lack of developer discipline if you will. What other code smells can you think of when building SSIS packages? If I get a good list in the comments maybe I’ll compile them into a later blog post. @Jamiet Share this post: email it! | bookmark it! | digg it! | reddit! | kick it! | live it!

Read the article
Always use dtexec.exe to test performance of your dataflows. No exceptions.

- by jamiet

Earlier this evening I posted a blog post entitled Investigation: Can different combinations of components effect Dataflow performance? where I compared the performance of three different dataflows all working to the same overall goal. I wanted to make one last point related to the results but I thought it warranted a blog post all of its own. Here is a screenshot of one of the dataflows that I was testing: Pretty complicated I’m sure you’ll agree. Now, when I executed this dataflow in the test it was executing in ~19seconds however in that case I was executing using the command-line tool dtexec. I also tried executing inside the BIDS development environment and in that case it took much longer – 139seconds. That’s more than seven times as long. The point I want to make is very simple. If you are testing your dataflows for performance please use dtexec. Nothing else will suffice. @Jamiet

Read the article
SSIS Dataflow From Excel Empty Rows

- by Gerard

Hi All, I am using SSIS Dataflow to import data into SQL2008. My data source is an excel file. The dataflow is working, however it seems that it is importing empty rows from the Excel file. I don't understand why this is happening. For example i have data in rows 1 to rows 100,000. But when the data flow task runs it might say it is importing 200,000 rows. When I then import the data back into excel, I get 200,000 rows of data with 100,000 empty rows in between the data. Can someone please help? Thanks

Read the article
SSIS - Setting a Dataflow Derived Column value from a variable

- by Jason W

I have an ID in a package variable that I need to add as a column (with each row having that package variablevalue) in a Dataflow. Is there a way to do this with only the Derived Column? I know i can using the Derived Column to make a new column and then set the value using a Script Component, but that seems inefficient.

Read the article
Excel DataFlow UML Viewer/Navigator/Visualiser tool/ hint

- by Arjang

Not sure what to call it but, is there a birds eye view tool for excel to show the data flow between excel sheets/cels etc? I have inherited some huge reports and looking at each cell to see where it's data comes from or what sheet/cell dependencies it has is a nightmare. Or even just something with excel that show the dependencies within a sheet of cells to each other etc. Or Any other visualization tool that can show the data flow between cells ( I tried visio but it seemed it is only for making diagrams of data not the data model of excel itself ). Or at least if I am within a cell and see a formula referring to other sheets and cells, is there a quick way to navigate there and back? Like code navigation in VS? Thank you for your help

Read the article
Merge Join component sorted outputs [SSIS]

- by jamiet

One question that I have been asked a few times of late in regard to performance tuning SSIS data flows is this: Why isn’t the Merge Join output sorted (i.e.IsSorted=True)? This is a fair question. After all both of the Merge Join inputs are sorted, hence why wouldn’t the output be sorted as well? Well here’s a little secret, the Merge Join output IS sorted! There’s a caveat though – it is only under certain circumstances and SSIS itself doesn’t do a good job of informing you of it. Let’s take a look at an example. Here we have a dataflow that consumes data from the [AdventureWorks2008].[Sales].[SalesOrderHeader] & [AdventureWorks2008].[Sales].[SalesOrderDetail] tables then joins them using a Merge Join component: Let’s take a look inside the editor of the Merge Join: We are joining on the [SalesOrderId] field (which is what the two inputs just happen to be sorted upon). We are also putting [SalesOrderHeader].[SalesOrderId] into the output. Believe it or not the output from this Merge Join component is sorted (i.e. has IsSorted=True) but unfortunately the Merge Join component does not have an Advanced Editor hence it is hidden away from us. There are a couple of ways to prove to you that is the case; I could open up the package XML inside the .dtsx file and show you the metadata but there is an easier way than that – I can attach a Sort component to the output. Take a look: Notice that the Sort component is attempting to sort on the [SalesOrderId] column. This gives us the following warning: Validation warning. DFT Get raw data: {992B7C9A-35AD-47B9-A0B0-637F7DDF93EB}: The data is already sorted as specified so the transform can be removed. The warning proves that the output from the Merge Join is sorted! It must be noted that the Merge Join output will only have IsSorted=True if at least one of the join columns is included in the output. So there you go, the Merge Join component can indeed produce a sorted output and that’s very useful in order to avoid unnecessary expensive Sort operations downstream. Hope this is useful to someone out there! @Jamiet P.S. Thank you to Bob Bojanic on the SSIS product team who pointed this out to me!

Read the article
Merge Join component sorted outputs [SSIS]

- by jamiet

One question that I have been asked a few times of late in regard to performance tuning SSIS data flows is this: Why isn’t the Merge Join output sorted (i.e.IsSorted=True)? This is a fair question. After all both of the Merge Join inputs are sorted, hence why wouldn’t the output be sorted as well? Well here’s a little secret, the Merge Join output IS sorted! There’s a caveat though – it is only under certain circumstances and SSIS itself doesn’t do a good job of informing you of it. Let’s take a look at an example. Here we have a dataflow that consumes data from the [AdventureWorks2008].[Sales].[SalesOrderHeader] & [AdventureWorks2008].[Sales].[SalesOrderDetail] tables then joins them using a Merge Join component: Let’s take a look inside the editor of the Merge Join: We are joining on the [SalesOrderId] field (which is what the two inputs just happen to be sorted upon). We are also putting [SalesOrderHeader].[SalesOrderId] into the output. Believe it or not the output from this Merge Join component is sorted (i.e. has IsSorted=True) but unfortunately the Merge Join component does not have an Advanced Editor hence it is hidden away from us. There are a couple of ways to prove to you that is the case; I could open up the package XML inside the .dtsx file and show you the metadata but there is an easier way than that – I can attach a Sort component to the output. Take a look: Notice that the Sort component is attempting to sort on the [SalesOrderId] column. This gives us the following warning: Validation warning. DFT Get raw data: {992B7C9A-35AD-47B9-A0B0-637F7DDF93EB}: The data is already sorted as specified so the transform can be removed. The warning proves that the output from the Merge Join is sorted! It must be noted that the Merge Join output will only have IsSorted=True if at least one of the join columns is included in the output. So there you go, the Merge Join component can indeed produce a sorted output and that’s very useful in order to avoid unnecessary expensive Sort operations downstream. Hope this is useful to someone out there! @Jamiet P.S. Thank you to Bob Bojanic on the SSIS product team who pointed this out to me!

Read the article
My SqlComand on SSIS - DataFlow OLE DB Command seems not works

- by Angel Escobedo

Hello Im using OLE DB Source for get rows from a dBase IV file and it works, then I split the data and perform a group by with aggregate component. So I obtain a row with two columns with "null" value : CompanyID | CompanyName | SubTotal | Tax | TotalRevenue Null Null 145487 27642.53 173129.53 this success because all rows have been grouped with out taking care about the firsts columns and just Summing the valuable columns, so I need to change that null for default values as CompanyID = "100000000" and CompanyName = "Others". I try use SqlCommand on a OLE DB Command Component : SELECT "10000000" AS RUCCLI , "Otros - Varios" AS RAZCLI FROM RGVCAFAC <property id="1505" name="SqlCommand" dataType="System.String" state="default" isArray="false" description="The SQL command to be executed." typeConverter="" UITypeEditor="Microsoft.DataTransformationServices.Controls.ModalMultilineStringEditor, Microsoft.DataTransformationServices.Controls, Version=10.0.0.0, Culture=neutral, PublicKeyToken=89845dcd8080cc91" containsID="false" expressionType="Notify">SELECT "10000000" AS RUCCLI , "Otros - Varios" AS RAZCLI FROM RGVCAFAC</property> but nothings happens, why? and finally the task finish when the data is inserted on a SQL Server Table. Im using the same connection manager on extracting data and transform. (View Code) <DTS:Property DTS:Name="ConnectionString">Data Source=C:\CONTA\Resocen\Agosto\;Provider=Microsoft.Jet.OLEDB.4.0;Persist Security Info=False;Extended Properties=dBASE IV;</DTS:Property></DTS:ConnectionManager></DTS:ObjectData></DTS:ConnectionManager> all work is on memory, Im not using cache manager connections

Read the article
Dataflow in Controllers & Layouts in Zend Framework

- by favo

I use a basic setup which uses a layout.phtml for the HTML Layout and view scripts for the content part. I want to control some variables in the layout in my controllers, i.e. the title of my site. How can I access my layout to output variables from within the controller? Thanks for your feedback!

Read the article
Is it bad practice for a module to contain more information than it needs?

- by gekod

I just wanted to ask for your opinion on a situation that occurs sometimes and which I don't know what would be the most elegant way to solve it. Here it goes: We have module A which reads an entry from a database and sends a request to module B containing ONLY the information from the entry module B would need to accomplish it's job (to keep modularity I just give it the information it needs - module B has nothing to do with the rest of the information from the read DB entry). Now after finishing it's job, module B has to reply to a module C if it succeeded or failed. To do this module B replies with the information it has gotten from module A and some variable meaning success or fail. Now here comes the problem: module C needs to find that entry again BUT the information it has gotten from module B is not enough to uniquely find the exact same entry again. I don't think that module A giving more information to module B which it doesn't need to do it's job but which it could then give back to module C would be a good practice because this would mean giving some module information it doesn't really need. What do you think?

Read the article
Return dataset in dataflow

- by praveen

Hi All, Could I get ideas on retrieving the dataset using lookup method. Basically, my scenario as I have source data needs to lookup for other source table and on matching column from source I need to get all the records from other source data. its a one to many relations. I tried Lookup but gives only one record on matching condition, OLE DB command don't retrieve any data as it will do only Insert/Update operations. Thanks prav

Read the article
please explain the dataflow in MVC specially in codeigniter

- by pokhara

Dear friends, Can anyone please explain me the object flow in codeigniter MVC ? I can see for example when I put the followong code in controller it works, but i am not being able to figure out which part of this goes in model in vews. I tried several ways but coudn't. When i use the example codes from other it works but myself i am getting confused. Please help $query = $this->db->query("YOUR QUERY"); foreach ($query->result() as $row) { echo $row->title; echo $row->name; echo $row->body; }

Read the article
Is this bad practice?

- by gekod

I just wanted to ask for your opinion on a situation that occurs sometimes and which I don't know what would be the most elegant way to solve it. Here it goes: We have module A which reads an entry from a database and sends a request to module B containing ONLY the information from the entry module B would need to accomplish it's job (to keep modularity I just give it the information it needs - module B has nothing to do with the rest of the information from the read DB entry). Now after finishing it's job, module B has to reply to a module C if it succeeded or failed. To do this module B replies with the information it has gotten from module A and some variable meaning success or fail. Now here comes the problem: module C needs to find that entry again BUT the information it has gotten from module B is not enough to uniquely find the exact same entry again. I don't think that module A giving more information to module B which it doesn't need to do it's job but which it could then give back to module C would be a good practice because this would mean giving some module information it doesn't really need. What do you think?

Read the article
What are the difference between: agent, actor, dataflow based programming?

- by inf3rno

What are the difference between the following terms? agent-based programming agent-based programming with microagents actor-based programming actor-based programming with lightweight actors dataflow based programming It is hard to find comparing articles and they are very similar. Afaik they have different constraints and they are implemented on a different abstraction level, but I need some reassurance...

Read the article
SSIS Lookup component tuning tips

- by jamiet

Yesterday evening I attended a London meeting of the UK SQL Server User Group at Microsoft’s offices in London Victoria. As usual it was both a fun and informative evening and in particular there seemed to be a few questions arising about tuning the SSIS Lookup component; I rattled off some comments and figured it would be prudent to drop some of them into a dedicated blog post, hence the one you are reading right now. Scene setting A popular pattern in SSIS is to use a Lookup component to determine whether a record in the pipeline already exists in the intended destination table or not and I cover this pattern in my 2006 blog post Checking if a row exists and if it does, has it changed? (note to self: must rewrite that blog post for SSIS2008). Fundamentally the SSIS lookup component (when using FullCache option) sucks some data out of a database and holds it in memory so that it can be compared to data in the pipeline. One of the big benefits of using SSIS dataflows is that they process data one buffer at a time; that means that not all of the data from your source exists in the dataflow at the same time and is why a SSIS dataflow can process data volumes that far exceed the available memory. However, that only applies to data in the pipeline; for reasons that are hopefully obvious ALL of the data in the lookup set must exist in the memory cache for the duration of the dataflow’s execution which means that any memory used by the lookup cache will not be available to be used as a pipeline buffer. Moreover, there’s an obvious correlation between the amount of data in the lookup cache and the time it takes to charge that cache; the more data you have then the longer it will take to charge and the longer you have to wait until the dataflow actually starts to do anything. For these reasons your goal is simple: ensure that the lookup cache contains as little data as possible. General tips Here is a simple tick list you can follow in order to tune your lookups: Use a SQL statement to charge your cache, don’t just pick a table from the dropdown list made available to you. (Read why in SELECT *... or select from a dropdown in an OLE DB Source component?) Only pick the columns that you need, ignore everything else Make the database columns that your cache is populated from as narrow as possible. If a column is defined as VARCHAR(20) then SSIS will allocate 20 bytes for every value in that column – that is a big waste if the actual values are significantly less than 20 characters in length. Do you need DT_WSTR typed columns or will DT_STR suffice? DT_WSTR uses twice the amount of space to hold values that can be stored using a DT_STR so if you can use DT_STR, consider doing so. Same principle goes for the numerical datatypes DT_I2/DT_I4/DT_I8. Only populate the cache with data that you KNOW you will need. In other words, think about your WHERE clause! Thinking outside the box It is tempting to build a large monolithic dataflow that does many things, one of which is a Lookup. Often though you can make better use of your available resources by, well, mixing things up a little and here are a few ideas to get your creative juices flowing: There is no rule that says everything has to happen in a single dataflow. If you have some particularly resource intensive lookups then consider putting that lookup into a dataflow all of its own and using raw files to pass the pipeline data in and out of that dataflow. Know your data. If you think, for example, that the majority of your incoming rows will match with only a small subset of your lookup data then consider chaining multiple lookup components together; the first would use a FullCache containing that data subset and the remaining data that doesn’t find a match could be passed to a second lookup that perhaps uses a NoCache lookup thus negating the need to pull all of that least-used lookup data into memory. Do you need to process all of your incoming data all at once? If you can process different partitions of your data separately then you can partition your lookup cache as well. For example, if you are using a lookup to convert a location into a [LocationId] then why not process your data one region at a time? This will mean your lookup cache only has to contain data for the location that you are currently processing and with the ability of the Lookup in SSIS2008 and beyond to charge the cache using a dynamically built SQL statement you’ll be able to achieve it using the same dataflow and simply loop over it using a ForEach loop. Taking the previous data partitioning idea further … a dataflow can contain more than one data path so why not split your data using a conditional split component and, again, charge your lookup caches with only the data that they need for that partition. Lookups have two uses: to (1) find a matching row from the lookup set and (2) put attributes from that matching row into the pipeline. Ask yourself, do you need to do these two things at the same time? After all once you have the key column(s) from your lookup set then you can use that key to get the rest of attributes further downstream, perhaps even in another dataflow. Are you using the same lookup data set multiple times? If so, consider the file caching option in SSIS 2008 and beyond. Above all, experiment and be creative with different combinations. You may be surprised at what works. Final thoughts If you want to know more about how the Lookup component differs in SSIS2008 from SSIS2005 then I have a dedicated blog post about that at Lookup component gets a makeover. I am on a mini-crusade at the moment to get a BULK MERGE feature into the database engine, the thinking being that if the database engine can quickly merge massive amounts of data in a similar manner to how it can insert massive amounts using BULK INSERT then that’s a lot of work that wouldn’t have to be done in the SSIS pipeline. If you think that is a good idea then go and vote for BULK MERGE on Connect. If you have any other tips to share then please stick them in the comments. Hope this helps! @Jamiet Share this post: email it! | bookmark it! | digg it! | reddit! | kick it! | live it!

Read the article
How do I get SSIS Data Flow to put '0.00' in a flat file?

- by theog

I have an SSIS package with a Data Flow that takes an ADO.NET data source (just a small table), executes a select * query, and outputs the query results to a flat file (I've also tried just pulling the whole table and not using a SQL select). The problem is that the data source pulls a column that is a Money datatype, and if the value is not zero, it comes into the text flat file just fine (like '123.45'), but when the value is zero, it shows up in the destination flat file as '.00'. I need to know how to get the leading zero back into the flat file. I've tried various datatypes for the output (in the Flat File Connection Manager), including currency and string, but this seems to have no effect. I've tried a case statement in my select, like this: CASE WHEN columnValue = 0 THEN '0.00' ELSE columnValue END (still results in '.00') I've tried variations on that like this: CASE WHEN columnValue = 0 THEN convert(decimal(12,2), '0.00') ELSE convert(decimal(12,2), columnValue) END (Still results in '.00') and: CASE WHEN columnValue = 0 THEN convert(money, '0.00') ELSE convert(money, columnValue) END (results in '.0000000000000000000') This silly little issue is killin' me. Can anybody tell me how to get a zero Money datatype database value into a flat file as '0.00'?

Read the article

Search Results

Search found 79 results on 4 pages for 'dataflow'.

Page 1/4 | 1 2 3 4 | Next Page >

- by jamiet

- by Styrac

- by Maian

- by makerofthings7

- by msutherl

- by Jesse Carter

- by javarg

- by Alex Miller

- by jamiet

- by jamiet

- by Gerard

- by Jason W

- by Arjang

- by jamiet

- by jamiet

- by Angel Escobedo

- by favo

- by gekod

- by praveen

- by pokhara

- by gekod

- by inf3rno

- by jamiet

- by theog

1 2 3 4 | Next Page >