Search Results

Search found 28744 results on 1150 pages for 'higher order functions'.

Page 197/1150 | < Previous Page | 193 194 195 196 197 198 199 200 201 202 203 204 | Next Page >

Faster, secure, protocol/code required for long-distance transfer.

- by Chopper3

I've ran into a problem and I'm looking for a new secure protocol/client/server that's faster over a 1Gb/s fibre link - let me tell you the story... I have a pair of redundant, diversely-routed, 1Gb/s links over a distance of around 250 miles or so (not dark fibre but a dedicated point to point link, not a mesh). At the 'client' end I have a HP DL380 G5 (2 x dual-core 2.66Ghz Xeon's, 4GB, Windows 2003EE 32-bit), at the 'server' end I have a HP BL460c G6 (2 x quad-core 2.53Ghz Xeons, 48GB, Oracle Linux 5.3 64-bit). I need to transfer around 500 x 2GB files per week from the client to the server machines per week - but the transfer NEEDS to be secure. Using both iPerf or regular FTP I can get ~80MB/s of transfer pretty consistently, which is great. Using WinSCP or Windows SFTP I can't seem to get more that ~3-4MB/s, at this point the server's CPU is 3% busy while CPU0 of the client goes to ~30% utilised. We've tried editing various TCP window sizes with little success. Both ends are connected to quite low-usage Cisco Cat6509's with Sup720's. I can replace the client machine with a newer machine and/or move it to Linux - but this will take time. Clearly these single-threaded secure Windows clients are introducing too much latency doing their encryption. So a few questions/thoughts; Are there any higher performing secure protocols or client software for Windows that I could try? I'm pretty protocol-gnostic so long as it'll work between Windows and Linux. Should I be using hardware to do the encryption, either in the client or the network parts? If so what would you recommend? I'm not convinced that just swapping the server would be that much faster, the CPU was only at 30% but then again that's higher than I'd have expected given the load - moving to Linux at the client end may be a better idea but would be quite disruptive. Am I missing a trick? Thanks in advance.

Read the article
Simple electric DC question. Currency consumption

- by Bobb

Suppose you have DC power supply and a consumer connected to it (i.e. computer PSU and a hard drive). Suppose PSU which was supplied with the consumer has output 5V 1A. So I assume that the consumer should not consume more than 1A. Suppose the original PSU is broken now and I want to replace it with the one I have which is 5V 10A. My guess is that current is something which depends on the consumer. So if the consumer consumes normally 1A then it will not consume more than that even if it is connected to 10A PSU. In other word - am I right assuming that the consumer will not burn out being connected to a power supply with higher current output? P.S. my understanding is that voltage is something independent from the consumer. If you give it higher voltage it will burn (voltage is from PSU to the consumer). However current must be in opposite - consumer sucks as much current as it need not as much as PSU can provide (of course given that max PSU current is greater than the consumer needs)

Read the article
Rendering a frame is producing noise from speakers in Windows and Linux

- by Robber

When any hardware accelerated application is rendering a frame (or many of them) a very short noise is coming from my speakers. This can be a game, a WebGL application or XBMC. When the application/game is rendering many frames per second (like most of them do) the noise is a continuous buzzing that gets higher pitched with higher framerates. This applies to Linux and Windows, so I'd assume it's a hardware problem. The current hardware in the PC is: CPU: Core2Quad Q9550 GPU: Radeon HD 5770 RAM: 2x2GB DDR2 Motherboard: Asus P5QLD PRO PSU: be quiet! Pure Power 530W Screen and speakers: Old 720p LCD TV connected via VGA and aux cable Muting the TV stops the noise, muting Windows doesn't. I tried replacing the PSU first (used a Tagan 700W PSU before) because I thought it was a power problem. It wasn't. I tried replacing the motherboard (used a ASUS P5B SE before) next because I thought it was a sound card problem. It wasn't. I tried the GPU in a different PC because I thought it was a broken graphics card. It worked perfectly fine in the other PC. I thought it might be interference, but moving the audio cable around changes absolutely nothing. I tried using an HDMI cable instead and that did work, but is not an option since my TV has only one HDMI input and I need that for my PS3.

Read the article
reaching constant video bitrate by using ffmpeg

- by sebastian

We use ffmpeg and a transcoding script for transcoding and want to make some batch files which we can use for transcoding. For example I use a parameter called video_kbit and if I am writing in 30000 it should reach 30 Mbit. Of course if I use 6000 as parameter it should reach 6 MBit as well, so I have one script which reaches every video bitrate I want. As my settings are now, I only reach 18.1 Mbit. Only when I use 15000 as a parameter I am reaching my goal for a constant video bitrate of 15 MBit. If I use 8000 as parameter I get 10.1 MBit as a result. So under 15000, I get a higher bitrate and over 15000, I get a lower bitrate than I want. My presettings are: ffmpeg -threads "4" -i "$2" -f mp4 -c:v libx264 -crf 1 \ -bufsize 30000k -maxrate ${FC_PARAM_video_kbit}k \ -acodec libfaac -ac 2 -ab ${FC_PARAM_audio_kbit}k -ar 44100 \ -pix_fmt yuv420p -vf scale=${FC_PARAM_width}:${FC_PARAM_height} -y "$3" And I am using these parameters: FC_PARAM_video_kbit = 30000 FC_PARAM_audio_kbit = 192 FC_PARAM_width = 1920 FC_PARAM_height = 1080 I have tried using a higher bufsize and using profile:v and level settings, but nothing got me near the constant video bitrate of 30000 Mbit. Do you guys have any ideas or suggestions for a better way to reach my goal?

Read the article
LAMP Stack Version Help -- Is there a website or version tracker source to help suggest the right versions of each part of a platform stack?

- by Chris Adragna

Taken singly, it's easy to research versions and compatibility. Version information is readily available on each single part of a platform stack, such as MySQL. You can find out the latest version, stable version, and sometimes even the percentage of people adopting it by version (personally, I like seeing numbers on adoption rates). However, when trying to find the best possible mix of versions, I have a harder time. For example, "if you're using MySQL 5.5, you'll need PHP version XX or higher." It gets even more difficult to mitigate when you throw higher level platforms into the mix such as Drupal, Joomla, etc. I do consider "wizard" like installers to be beneficial, such as the Bitnami installers. However, I always wonder if those solutions cater more to the least common denominator -- be all to many -- and as such, I think I'd be better to install things on my own. Such solutions do seem kind of slow to adopt new versions, slower than necessary, I suspect. Is there a website or tool that consolidates versioning data in order to help a webmaster choose which versions to deploy or which upgrades to install, in consideration of all the other parts of the stack?

Read the article
Minecraft server hosting hardware specifications [on hold]

- by Andrew Wright

I am planning on purchasing a server to rent off Minecraft game servers, largely to friends. I am planning on purchasing a 128GB RAM server to save on colocation costs (as I am likely to need more than 32GB and would have to rent 2U of space...) I am hoping for some advice about the processing power needed to deal with this level of RAM. The servers will be run in a shared environment on linux in a VM to make backups easier. The server I have in mind is dual CPU. I have been considering at the low end dual Xeon E5-2609V2 Quad-Core 2.5Ghz, and at the high end dual Xeon E5-2650V2 Eight-Core 2.6Ghz. The difference between these is 6.4 GT/s and 8 GT/s and £3000 for the lower spec server, £4300 for the higher spec. I was hoping I could get advice about whether it is worth paying for the extra/higher speed processor or if I would be wasting my money? Thank you for any help - I appreciate that this is not directly related to professional system administration.

Read the article
Low 'Burst Rate' from SATA drive in HDTune?

- by UpTheCreek

I recently upgraded my laptop's v slow hard drive to a seagate momentus 7200. Everything is working fine, but I'm a bit confused by these benchmark results: The burst rate is significantly less than the Maximim transfer rate, and not much higher than the normal minimum (if you ignore the spikes). What's going on here? On the HDtune website it defines Burst Rate as: ...the highest speed (in megabytes per second) at which data can be transferred from the drive interface (IDE or SCSI for example) to the operating system. Which begs some questions... e.g. if this is the highest, then how did the bechmarking tool record the 103MB/sec maximum? And if this really is the true maximum, then where is the bottleneck? The laptops SATA interface is on an Intel 82801GBM southbridge controller. When I check in hardware manager, I see that it's driver is iaStor.sys from 2005. Maybe that's the issue? I'll look for a newever version, but any insights would be appreciated. Thanks UPDATE: Acorting to this page on the HDTune website... An important parameter of the test is the Burst Rate. This value should always be higher than the maximum transfer rate. A lower value is usually an indication of a configuration problem. So what might be the configuration problem?

Read the article
Sending mail with Gmail Account using System.Net.Mail in ASP.NET

- by Jalpesh P. Vadgama

Any web application is in complete without mail functionality you should have to write send mail functionality. Like if there is shopping cart application for example then when a order created on the shopping cart you need to send an email to administrator of website for Order notification and for customer you need to send an email of receipt of order. So any web application is not complete without sending email. This post is also all about sending email. In post I will explain that how we can send emails from our Gmail Account without purchasing any smtp server etc. There are some limitations for sending email from Gmail Account. Please note following things. Gmail will have fixed number of quota for sending emails per day. So you can not send more then that emails for the day. Your from email address always will be your account email address which you are using for sending email. You can not send an email to unlimited numbers of people. Gmail ant spamming policy will restrict this. Gmail provide both Popup and SMTP settings both should be active in your account where you testing. You can enable that via clicking on setting link in gmail account and go to Forwarding and POP/Imap. So if you are using mail functionality for limited emails then Gmail is Best option. But if you are sending thousand of email daily then it will not be Good Idea. Here is the code for sending mail from Gmail Account. using System.Net.Mail; namespace Experiement { public partial class WebForm1 : System.Web.UI.Page { protected void Page_Load(object sender,System.EventArgs e) { MailMessage mailMessage = new MailMessage(new MailAddress("[email protected]") ,new MailAddress("[email protected]")); mailMessage.Subject = "Sending mail through gmail account"; mailMessage.IsBodyHtml = true; mailMessage.Body = "<B>Sending mail thorugh gmail from asp.net</B>"; System.Net.NetworkCredential networkCredentials = new System.Net.NetworkCredential("[email protected]", "yourpassword"); SmtpClient smtpClient = new SmtpClient(); smtpClient.EnableSsl = true; smtpClient.UseDefaultCredentials = false; smtpClient.Credentials = networkCredentials; smtpClient.Host = "smtp.gmail.com"; smtpClient.Port = 587; smtpClient.Send(mailMessage); Response.Write("Mail Successfully sent"); } } } That’s run this application and you will get like below in your account. Technorati Tags: Gmail,System.NET.Mail,ASP.NET

Read the article
Rendering ASP.NET Script References into the Html Header

- by Rick Strahl

One thing that I’ve come to appreciate in control development in ASP.NET that use JavaScript is the ability to have more control over script and script include placement than ASP.NET provides natively. Specifically in ASP.NET you can use either the ClientScriptManager or ScriptManager to embed scripts and script references into pages via code. This works reasonably well, but the script references that get generated are generated into the HTML body and there’s very little operational control for placement of scripts. If you have multiple controls or several of the same control that need to place the same scripts onto the page it’s not difficult to end up with scripts that render in the wrong order and stop working correctly. This is especially critical if you load script libraries with dependencies either via resources or even if you are rendering referenced to CDN resources. Natively ASP.NET provides a host of methods that help embedding scripts into the page via either Page.ClientScript or the ASP.NET ScriptManager control (both with slightly different syntax): RegisterClientScriptBlock Renders a script block at the top of the HTML body and should be used for embedding callable functions/classes. RegisterStartupScript Renders a script block just prior to the </form> tag and should be used to for embedding code that should execute when the page is first loaded. Not recommended – use jQuery.ready() or equivalent load time routines. RegisterClientScriptInclude Embeds a reference to a script from a url into the page. RegisterClientScriptResource Embeds a reference to a Script from a resource file generating a long resource file string All 4 of these methods render their <script> tags into the HTML body. The script blocks give you a little bit of control by having a ‘top’ and ‘bottom’ of the document location which gives you some flexibility over script placement and precedence. Script includes and resource url unfortunately do not even get that much control – references are simply rendered into the page in the order of declaration. The ASP.NET ScriptManager control facilitates this task a little bit with the abililty to specify scripts in code and the ability to programmatically check what scripts have already been registered, but it doesn’t provide any more control over the script rendering process itself. Further the ScriptManager is a bear to deal with generically because generic code has to always check and see if it is actually present. Some time ago I posted a ClientScriptProxy class that helps with managing the latter process of sending script references either to ClientScript or ScriptManager if it’s available. Since I last posted about this there have been a number of improvements in this API, one of which is the ability to control placement of scripts and script includes in the page which I think is rather important and a missing feature in the ASP.NET native functionality. Handling ScriptRenderModes One of the big enhancements that I’ve come to rely on is the ability of the various script rendering functions described above to support rendering in multiple locations: /// <summary> /// Determines how scripts are included into the page /// </summary> public enum ScriptRenderModes { /// <summary> /// Inherits the setting from the control or from the ClientScript.DefaultScriptRenderMode /// </summary> Inherit, /// Renders the script include at the location of the control /// </summary> Inline, /// <summary> /// Renders the script include into the bottom of the header of the page /// </summary> Header, /// <summary> /// Renders the script include into the top of the header of the page /// </summary> HeaderTop, /// <summary> /// Uses ClientScript or ScriptManager to embed the script include to /// provide standard ASP.NET style rendering in the HTML body. /// </summary> Script, /// <summary> /// Renders script at the bottom of the page before the last Page.Controls /// literal control. Note this may result in unexpected behavior /// if /body and /html are not the last thing in the markup page. /// </summary> BottomOfPage } This enum is then applied to the various Register functions to allow more control over where scripts actually show up. Why is this useful? For me I often render scripts out of control resources and these scripts often include things like a JavaScript Library (jquery) and a few plug-ins. The order in which these can be loaded is critical so that jQuery.js always loads before any plug-in for example. Typically I end up with a general script layout like this: Core Libraries- HeaderTop Plug-ins: Header ScriptBlocks: Header or Script depending on other dependencies There’s also an option to render scripts and CSS at the very bottom of the page before the last Page control on the page which can be useful for speeding up page load when lots of scripts are loaded. The API syntax of the ClientScriptProxy methods is closely compatible with ScriptManager’s using static methods and control references to gain access to the page and embedding scripts. For example, to render some script into the current page in the header: // Create script block in header ClientScriptProxy.Current.RegisterClientScriptBlock(this, typeof(ControlResources), "hello_function", "function helloWorld() { alert('hello'); }", true, ScriptRenderModes.Header); // Same again - shouldn't be rendered because it's the same id ClientScriptProxy.Current.RegisterClientScriptBlock(this, typeof(ControlResources), "hello_function", "function helloWorld() { alert('hello'); }", true, ScriptRenderModes.Header); // Create a second script block in header ClientScriptProxy.Current.RegisterClientScriptBlock(this, typeof(ControlResources), "hello_function2", "function helloWorld2() { alert('hello2'); }", true, ScriptRenderModes.Header); // This just calls ClientScript and renders into bottom of document ClientScriptProxy.Current.RegisterStartupScript(this,typeof(ControlResources), "call_hello", "helloWorld();helloWorld2();", true); which generates: <html xmlns="http://www.w3.org/1999/xhtml" > <head><title> </title> <script type="text/javascript"> function helloWorld() { alert('hello'); } </script> <script type="text/javascript"> function helloWorld2() { alert('hello2'); } </script> </head> <body> … <script type="text/javascript"> //<![CDATA[ helloWorld();helloWorld2();//]]> </script> </form> </body> </html> Note that the scripts are generated into the header rather than the body except for the last script block which is the call to RegisterStartupScript. In general I wouldn’t recommend using RegisterStartupScript – ever. It’s a much better practice to use a script base load event to handle ‘startup’ code that should fire when the page first loads. So instead of the code above I’d actually recommend doing: ClientScriptProxy.Current.RegisterClientScriptBlock(this, typeof(ControlResources), "call_hello", "$().ready( function() { alert('hello2'); });", true, ScriptRenderModes.Header); assuming you’re using jQuery on the page. For script includes from a Url the following demonstrates how to embed scripts into the header. This example injects a jQuery and jQuery.UI script reference from the Google CDN then checks each with a script block to ensure that it has loaded and if not loads it from a server local location: // load jquery from CDN ClientScriptProxy.Current.RegisterClientScriptInclude(this, typeof(ControlResources), "http://ajax.googleapis.com/ajax/libs/jquery/1.3.2/jquery.min.js", ScriptRenderModes.HeaderTop); // check if jquery loaded - if it didn't we're not online string scriptCheck = @"if (typeof jQuery != 'object') document.write(unescape(""%3Cscript src='{0}' type='text/javascript'%3E%3C/script%3E""));"; string jQueryUrl = ClientScriptProxy.Current.GetWebResourceUrl(this, typeof(ControlResources), ControlResources.JQUERY_SCRIPT_RESOURCE); ClientScriptProxy.Current.RegisterClientScriptBlock(this, typeof(ControlResources), "jquery_register", string.Format(scriptCheck,jQueryUrl),true, ScriptRenderModes.HeaderTop); // Load jquery-ui from cdn ClientScriptProxy.Current.RegisterClientScriptInclude(this, typeof(ControlResources), "http://ajax.googleapis.com/ajax/libs/jqueryui/1.7.2/jquery-ui.min.js", ScriptRenderModes.Header); // check if we need to load from local string jQueryUiUrl = ResolveUrl("~/scripts/jquery-ui-custom.min.js"); ClientScriptProxy.Current.RegisterClientScriptBlock(this, typeof(ControlResources), "jqueryui_register", string.Format(scriptCheck, jQueryUiUrl), true, ScriptRenderModes.Header); // Create script block in header ClientScriptProxy.Current.RegisterClientScriptBlock(this, typeof(ControlResources), "hello_function", "$().ready( function() { alert('hello'); });", true, ScriptRenderModes.Header); which in turn generates this HTML: <html xmlns="http://www.w3.org/1999/xhtml" > <head> <script src="http://ajax.googleapis.com/ajax/libs/jquery/1.3.2/jquery.min.js" type="text/javascript"></script> <script type="text/javascript"> if (typeof jQuery != 'object') document.write(unescape("%3Cscript src='/WestWindWebToolkitWeb/WebResource.axd?d=DIykvYhJ_oXCr-TA_dr35i4AayJoV1mgnQAQGPaZsoPM2LCdvoD3cIsRRitHKlKJfV5K_jQvylK7tsqO3lQIFw2&t=633979863959332352' type='text/javascript'%3E%3C/script%3E")); </script> <title> </title> <script src="http://ajax.googleapis.com/ajax/libs/jqueryui/1.7.2/jquery-ui.min.js" type="text/javascript"></script> <script type="text/javascript"> if (typeof jQuery != 'object') document.write(unescape("%3Cscript src='/WestWindWebToolkitWeb/scripts/jquery-ui-custom.min.js' type='text/javascript'%3E%3C/script%3E")); </script> <script type="text/javascript"> $().ready(function() { alert('hello'); }); </script> </head> <body> …</body> </html> As you can see there’s a bit more control in this process as you can inject both script includes and script blocks into the document at the top or bottom of the header, plus if necessary at the usual body locations. This is quite useful especially if you create custom server controls that interoperate with script and have certain dependencies. The above is a good example of a useful switchable routine where you can switch where scripts load from by default – the above pulls from Google CDN but a configuration switch may automatically switch to pull from the local development copies if your doing development for example. How does it work? As mentioned the ClientScriptProxy object mimicks many of the ScriptManager script related methods and so provides close API compatibility with it although it contains many additional overloads that enhance functionality. It does however work against ScriptManager if it’s available on the page, or Page.ClientScript if it’s not so it provides a single unified frontend to script access. There are however many overloads of the original SM methods like the above to provide additional functionality. The implementation of script header rendering is pretty straight forward – as long as a server header (ie. it has to have runat=”server” set) is available. Otherwise these routines fall back to using the default document level insertions of ScriptManager/ClientScript. Given that there is a server header it’s relatively easy to generate the script tags and code and append them to the header either at the top or bottom. I suspect Microsoft didn’t provide header rendering functionality precisely because a runat=”server” header is not required by ASP.NET so behavior would be slightly unpredictable. That’s not really a problem for a custom implementation however. Here’s the RegisterClientScriptBlock implementation that takes a ScriptRenderModes parameter to allow header rendering: /// <summary> /// Renders client script block with the option of rendering the script block in /// the Html header /// /// For this to work Header must be defined as runat="server" /// </summary> /// <param name="control">any control that instance typically page</param> /// <param name="type">Type that identifies this rendering</param> /// <param name="key">unique script block id</param> /// <param name="script">The script code to render</param> /// <param name="addScriptTags">Ignored for header rendering used for all other insertions</param> /// <param name="renderMode">Where the block is rendered</param> public void RegisterClientScriptBlock(Control control, Type type, string key, string script, bool addScriptTags, ScriptRenderModes renderMode) { if (renderMode == ScriptRenderModes.Inherit) renderMode = DefaultScriptRenderMode; if (control.Page.Header == null || renderMode != ScriptRenderModes.HeaderTop && renderMode != ScriptRenderModes.Header && renderMode != ScriptRenderModes.BottomOfPage) { RegisterClientScriptBlock(control, type, key, script, addScriptTags); return; } // No dupes - ref script include only once const string identifier = "scriptblock_"; if (HttpContext.Current.Items.Contains(identifier + key)) return; HttpContext.Current.Items.Add(identifier + key, string.Empty); StringBuilder sb = new StringBuilder(); // Embed in header sb.AppendLine("\r\n<script type=\"text/javascript\">"); sb.AppendLine(script); sb.AppendLine("</script>"); int? index = HttpContext.Current.Items["__ScriptResourceIndex"] as int?; if (index == null) index = 0; if (renderMode == ScriptRenderModes.HeaderTop) { control.Page.Header.Controls.AddAt(index.Value, new LiteralControl(sb.ToString())); index++; } else if(renderMode == ScriptRenderModes.Header) control.Page.Header.Controls.Add(new LiteralControl(sb.ToString())); else if (renderMode == ScriptRenderModes.BottomOfPage) control.Page.Controls.AddAt(control.Page.Controls.Count-1,new LiteralControl(sb.ToString())); HttpContext.Current.Items["__ScriptResourceIndex"] = index; } Note that the routine has to keep track of items inserted by id so that if the same item is added again with the same key it won’t generate two script entries. Additionally the code has to keep track of how many insertions have been made at the top of the document so that entries are added in the proper order. The RegisterScriptInclude method is similar but there’s some additional logic in here to deal with script file references and ClientScriptProxy’s (optional) custom resource handler that provides script compression /// <summary> /// Registers a client script reference into the page with the option to specify /// the script location in the page /// </summary> /// <param name="control">Any control instance - typically page</param> /// <param name="type">Type that acts as qualifier (uniqueness)</param> /// <param name="url">the Url to the script resource</param> /// <param name="ScriptRenderModes">Determines where the script is rendered</param> public void RegisterClientScriptInclude(Control control, Type type, string url, ScriptRenderModes renderMode) { const string STR_ScriptResourceIndex = "__ScriptResourceIndex"; if (string.IsNullOrEmpty(url)) return; if (renderMode == ScriptRenderModes.Inherit) renderMode = DefaultScriptRenderMode; // Extract just the script filename string fileId = null; // Check resource IDs and try to match to mapped file resources // Used to allow scripts not to be loaded more than once whether // embedded manually (script tag) or via resources with ClientScriptProxy if (url.Contains(".axd?r=")) { string res = HttpUtility.UrlDecode( StringUtils.ExtractString(url, "?r=", "&", false, true) ); foreach (ScriptResourceAlias item in ScriptResourceAliases) { if (item.Resource == res) { fileId = item.Alias + ".js"; break; } } if (fileId == null) fileId = url.ToLower(); } else fileId = Path.GetFileName(url).ToLower(); // No dupes - ref script include only once const string identifier = "script_"; if (HttpContext.Current.Items.Contains( identifier + fileId ) ) return; HttpContext.Current.Items.Add(identifier + fileId, string.Empty); // just use script manager or ClientScriptManager if (control.Page.Header == null || renderMode == ScriptRenderModes.Script || renderMode == ScriptRenderModes.Inline) { RegisterClientScriptInclude(control, type,url, url); return; } // Retrieve script index in header int? index = HttpContext.Current.Items[STR_ScriptResourceIndex] as int?; if (index == null) index = 0; StringBuilder sb = new StringBuilder(256); url = WebUtils.ResolveUrl(url); // Embed in header sb.AppendLine("\r\n<script src=\"" + url + "\" type=\"text/javascript\"></script>"); if (renderMode == ScriptRenderModes.HeaderTop) { control.Page.Header.Controls.AddAt(index.Value, new LiteralControl(sb.ToString())); index++; } else if (renderMode == ScriptRenderModes.Header) control.Page.Header.Controls.Add(new LiteralControl(sb.ToString())); else if (renderMode == ScriptRenderModes.BottomOfPage) control.Page.Controls.AddAt(control.Page.Controls.Count-1, new LiteralControl(sb.ToString())); HttpContext.Current.Items[STR_ScriptResourceIndex] = index; } There’s a little more code here that deals with cleaning up the passed in Url and also some custom handling of script resources that run through the ScriptCompressionModule – any script resources loaded in this fashion are automatically cached based on the resource id. Raw urls extract just the filename from the URL and cache based on that. All of this to avoid doubling up of scripts if called multiple times by multiple instances of the same control for example or several controls that all load the same resources/includes. Finally RegisterClientScriptResource utilizes the previous method to wrap the WebResourceUrl as well as some custom functionality for the resource compression module: /// <summary> /// Returns a WebResource or ScriptResource URL for script resources that are to be /// embedded as script includes. /// </summary> /// <param name="control">Any control</param> /// <param name="type">A type in assembly where resources are located</param> /// <param name="resourceName">Name of the resource to load</param> /// <param name="renderMode">Determines where in the document the link is rendered</param> public void RegisterClientScriptResource(Control control, Type type, string resourceName, ScriptRenderModes renderMode) { string resourceUrl = GetClientScriptResourceUrl(control, type, resourceName); RegisterClientScriptInclude(control, type, resourceUrl, renderMode); } /// <summary> /// Works like GetWebResourceUrl but can be used with javascript resources /// to allow using of resource compression (if the module is loaded). /// </summary> /// <param name="control"></param> /// <param name="type"></param> /// <param name="resourceName"></param> /// <returns></returns> public string GetClientScriptResourceUrl(Control control, Type type, string resourceName) { #if IncludeScriptCompressionModuleSupport // If wwScriptCompression Module through Web.config is loaded use it to compress // script resources by using wcSC.axd Url the module intercepts if (ScriptCompressionModule.ScriptCompressionModuleActive) { string url = "~/wwSC.axd?r=" + HttpUtility.UrlEncode(resourceName); if (type.Assembly != GetType().Assembly) url += "&t=" + HttpUtility.UrlEncode(type.FullName); return WebUtils.ResolveUrl(url); } #endif return control.Page.ClientScript.GetWebResourceUrl(type, resourceName); } This code merely retrieves the resource URL and then simply calls back to RegisterClientScriptInclude with the URL to be embedded which means there’s nothing specific to deal with other than the custom compression module logic which is nice and easy. What else is there in ClientScriptProxy? ClientscriptProxy also provides a few other useful services beyond what I’ve already covered here: Transparent ScriptManager and ClientScript calls ClientScriptProxy includes a host of routines that help figure out whether a script manager is available or not and all functions in this class call the appropriate object – ScriptManager or ClientScript – that is available in the current page to ensure that scripts get embedded into pages properly. This is especially useful for control development where controls have no control over the scripting environment in place on the page. RegisterCssLink and RegisterCssResource Much like the script embedding functions these two methods allow embedding of CSS links. CSS links are appended to the header or to a form declared with runat=”server”. LoadControlScript Is a high level resource loading routine that can be used to easily switch between different script linking modes. It supports loading from a WebResource, a url or not loading anything at all. This is very useful if you build controls that deal with specification of resource urls/ids in a standard way. Check out the full Code You can check out the full code to the ClientScriptProxyClass here: ClientScriptProxy.cs ClientScriptProxy Documentation (class reference) Note that the ClientScriptProxy has a few dependencies in the West Wind Web Toolkit of which it is part of. ControlResources holds a few standard constants and script resource links and the ScriptCompressionModule which is referenced in a few of the script inclusion methods. There’s also another useful ScriptContainer companion control to the ClientScriptProxy that allows scripts to be placed onto the page’s markup including the ability to specify the script location and script minification options. You can find all the dependencies in the West Wind Web Toolkit repository: West Wind Web Toolkit Repository West Wind Web Toolkit Home Page© Rick Strahl, West Wind Technologies, 2005-2010Posted in ASP.NET JavaScript

Read the article
Improving Partitioned Table Join Performance

- by Paul White

The query optimizer does not always choose an optimal strategy when joining partitioned tables. This post looks at an example, showing how a manual rewrite of the query can almost double performance, while reducing the memory grant to almost nothing. Test Data The two tables in this example use a common partitioning partition scheme. The partition function uses 41 equal-size partitions: CREATE PARTITION FUNCTION PFT (integer) AS RANGE RIGHT FOR VALUES ( 125000, 250000, 375000, 500000, 625000, 750000, 875000, 1000000, 1125000, 1250000, 1375000, 1500000, 1625000, 1750000, 1875000, 2000000, 2125000, 2250000, 2375000, 2500000, 2625000, 2750000, 2875000, 3000000, 3125000, 3250000, 3375000, 3500000, 3625000, 3750000, 3875000, 4000000, 4125000, 4250000, 4375000, 4500000, 4625000, 4750000, 4875000, 5000000 ); GO CREATE PARTITION SCHEME PST AS PARTITION PFT ALL TO ([PRIMARY]); There two tables are: CREATE TABLE dbo.T1 ( TID integer NOT NULL IDENTITY(0,1), Column1 integer NOT NULL, Padding binary(100) NOT NULL DEFAULT 0x, CONSTRAINT PK_T1 PRIMARY KEY CLUSTERED (TID) ON PST (TID) ); CREATE TABLE dbo.T2 ( TID integer NOT NULL, Column1 integer NOT NULL, Padding binary(100) NOT NULL DEFAULT 0x, CONSTRAINT PK_T2 PRIMARY KEY CLUSTERED (TID, Column1) ON PST (TID) ); The next script loads 5 million rows into T1 with a pseudo-random value between 1 and 5 for Column1. The table is partitioned on the IDENTITY column TID: INSERT dbo.T1 WITH (TABLOCKX) (Column1) SELECT (ABS(CHECKSUM(NEWID())) % 5) + 1 FROM dbo.Numbers AS N WHERE n BETWEEN 1 AND 5000000; In case you don’t already have an auxiliary table of numbers lying around, here’s a script to create one with 10 million rows: CREATE TABLE dbo.Numbers (n bigint PRIMARY KEY); WITH L0 AS(SELECT 1 AS c UNION ALL SELECT 1), L1 AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B), L2 AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B), L3 AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B), L4 AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B), L5 AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B), Nums AS(SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS n FROM L5) INSERT dbo.Numbers WITH (TABLOCKX) SELECT TOP (10000000) n FROM Nums ORDER BY n OPTION (MAXDOP 1); Table T1 contains data like this: Next we load data into table T2. The relationship between the two tables is that table 2 contains ‘n’ rows for each row in table 1, where ‘n’ is determined by the value in Column1 of table T1. There is nothing particularly special about the data or distribution, by the way. INSERT dbo.T2 WITH (TABLOCKX) (TID, Column1) SELECT T.TID, N.n FROM dbo.T1 AS T JOIN dbo.Numbers AS N ON N.n >= 1 AND N.n <= T.Column1; Table T2 ends up containing about 15 million rows: The primary key for table T2 is a combination of TID and Column1. The data is partitioned according to the value in column TID alone. Partition Distribution The following query shows the number of rows in each partition of table T1: SELECT PartitionID = CA1.P, NumRows = COUNT_BIG(*) FROM dbo.T1 AS T CROSS APPLY (VALUES ($PARTITION.PFT(TID))) AS CA1 (P) GROUP BY CA1.P ORDER BY CA1.P; There are 40 partitions containing 125,000 rows (40 * 125k = 5m rows). The rightmost partition remains empty. The next query shows the distribution for table 2: SELECT PartitionID = CA1.P, NumRows = COUNT_BIG(*) FROM dbo.T2 AS T CROSS APPLY (VALUES ($PARTITION.PFT(TID))) AS CA1 (P) GROUP BY CA1.P ORDER BY CA1.P; There are roughly 375,000 rows in each partition (the rightmost partition is also empty): Ok, that’s the test data done. Test Query and Execution Plan The task is to count the rows resulting from joining tables 1 and 2 on the TID column: SET STATISTICS IO ON; DECLARE @s datetime2 = SYSUTCDATETIME(); SELECT COUNT_BIG(*) FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON T2.TID = T1.TID; SELECT DATEDIFF(Millisecond, @s, SYSUTCDATETIME()); SET STATISTICS IO OFF; The optimizer chooses a plan using parallel hash join, and partial aggregation: The Plan Explorer plan tree view shows accurate cardinality estimates and an even distribution of rows across threads (click to enlarge the image): With a warm data cache, the STATISTICS IO output shows that no physical I/O was needed, and all 41 partitions were touched: Running the query without actual execution plan or STATISTICS IO information for maximum performance, the query returns in around 2600ms. Execution Plan Analysis The first step toward improving on the execution plan produced by the query optimizer is to understand how it works, at least in outline. The two parallel Clustered Index Scans use multiple threads to read rows from tables T1 and T2. Parallel scan uses a demand-based scheme where threads are given page(s) to scan from the table as needed. This arrangement has certain important advantages, but does result in an unpredictable distribution of rows amongst threads. The point is that multiple threads cooperate to scan the whole table, but it is impossible to predict which rows end up on which threads. For correct results from the parallel hash join, the execution plan has to ensure that rows from T1 and T2 that might join are processed on the same thread. For example, if a row from T1 with join key value ‘1234’ is placed in thread 5’s hash table, the execution plan must guarantee that any rows from T2 that also have join key value ‘1234’ probe thread 5’s hash table for matches. The way this guarantee is enforced in this parallel hash join plan is by repartitioning rows to threads after each parallel scan. The two repartitioning exchanges route rows to threads using a hash function over the hash join keys. The two repartitioning exchanges use the same hash function so rows from T1 and T2 with the same join key must end up on the same hash join thread. Expensive Exchanges This business of repartitioning rows between threads can be very expensive, especially if a large number of rows is involved. The execution plan selected by the optimizer moves 5 million rows through one repartitioning exchange and around 15 million across the other. As a first step toward removing these exchanges, consider the execution plan selected by the optimizer if we join just one partition from each table, disallowing parallelism: SELECT COUNT_BIG(*) FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON T2.TID = T1.TID WHERE $PARTITION.PFT(T1.TID) = 1 AND $PARTITION.PFT(T2.TID) = 1 OPTION (MAXDOP 1); The optimizer has chosen a (one-to-many) merge join instead of a hash join. The single-partition query completes in around 100ms. If everything scaled linearly, we would expect that extending this strategy to all 40 populated partitions would result in an execution time around 4000ms. Using parallelism could reduce that further, perhaps to be competitive with the parallel hash join chosen by the optimizer. This raises a question. If the most efficient way to join one partition from each of the tables is to use a merge join, why does the optimizer not choose a merge join for the full query? Forcing a Merge Join Let’s force the optimizer to use a merge join on the test query using a hint: SELECT COUNT_BIG(*) FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON T2.TID = T1.TID OPTION (MERGE JOIN); This is the execution plan selected by the optimizer: This plan results in the same number of logical reads reported previously, but instead of 2600ms the query takes 5000ms. The natural explanation for this drop in performance is that the merge join plan is only using a single thread, whereas the parallel hash join plan could use multiple threads. Parallel Merge Join We can get a parallel merge join plan using the same query hint as before, and adding trace flag 8649: SELECT COUNT_BIG(*) FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON T2.TID = T1.TID OPTION (MERGE JOIN, QUERYTRACEON 8649); The execution plan is: This looks promising. It uses a similar strategy to distribute work across threads as seen for the parallel hash join. In practice though, performance is disappointing. On a typical run, the parallel merge plan runs for around 8400ms; slower than the single-threaded merge join plan (5000ms) and much worse than the 2600ms for the parallel hash join. We seem to be going backwards! The logical reads for the parallel merge are still exactly the same as before, with no physical IOs. The cardinality estimates and thread distribution are also still very good (click to enlarge): A big clue to the reason for the poor performance is shown in the wait statistics (captured by Plan Explorer Pro): CXPACKET waits require careful interpretation, and are most often benign, but in this case excessive waiting occurs at the repartitioning exchanges. Unlike the parallel hash join, the repartitioning exchanges in this plan are order-preserving ‘merging’ exchanges (because merge join requires ordered inputs): Parallelism works best when threads can just grab any available unit of work and get on with processing it. Preserving order introduces inter-thread dependencies that can easily lead to significant waits occurring. In extreme cases, these dependencies can result in an intra-query deadlock, though the details of that will have to wait for another time to explore in detail. The potential for waits and deadlocks leads the query optimizer to cost parallel merge join relatively highly, especially as the degree of parallelism (DOP) increases. This high costing resulted in the optimizer choosing a serial merge join rather than parallel in this case. The test results certainly confirm its reasoning. Collocated Joins In SQL Server 2008 and later, the optimizer has another available strategy when joining tables that share a common partition scheme. This strategy is a collocated join, also known as as a per-partition join. It can be applied in both serial and parallel execution plans, though it is limited to 2-way joins in the current optimizer. Whether the optimizer chooses a collocated join or not depends on cost estimation. The primary benefits of a collocated join are that it eliminates an exchange and requires less memory, as we will see next. Costing and Plan Selection The query optimizer did consider a collocated join for our original query, but it was rejected on cost grounds. The parallel hash join with repartitioning exchanges appeared to be a cheaper option. There is no query hint to force a collocated join, so we have to mess with the costing framework to produce one for our test query. Pretending that IOs cost 50 times more than usual is enough to convince the optimizer to use collocated join with our test query: -- Pretend IOs are 50x cost temporarily DBCC SETIOWEIGHT(50); -- Co-located hash join SELECT COUNT_BIG(*) FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON T2.TID = T1.TID OPTION (RECOMPILE); -- Reset IO costing DBCC SETIOWEIGHT(1); Collocated Join Plan The estimated execution plan for the collocated join is: The Constant Scan contains one row for each partition of the shared partitioning scheme, from 1 to 41. The hash repartitioning exchanges seen previously are replaced by a single Distribute Streams exchange using Demand partitioning. Demand partitioning means that the next partition id is given to the next parallel thread that asks for one. My test machine has eight logical processors, and all are available for SQL Server to use. As a result, there are eight threads in the single parallel branch in this plan, each processing one partition from each table at a time. Once a thread finishes processing a partition, it grabs a new partition number from the Distribute Streams exchange…and so on until all partitions have been processed. It is important to understand that the parallel scans in this plan are different from the parallel hash join plan. Although the scans have the same parallelism icon, tables T1 and T2 are not being co-operatively scanned by multiple threads in the same way. Each thread reads a single partition of T1 and performs a hash match join with the same partition from table T2. The properties of the two Clustered Index Scans show a Seek Predicate (unusual for a scan!) limiting the rows to a single partition: The crucial point is that the join between T1 and T2 is on TID, and TID is the partitioning column for both tables. A thread that processes partition ‘n’ is guaranteed to see all rows that can possibly join on TID for that partition. In addition, no other thread will see rows from that partition, so this removes the need for repartitioning exchanges. CPU and Memory Efficiency Improvements The collocated join has removed two expensive repartitioning exchanges and added a single exchange processing 41 rows (one for each partition id). Remember, the parallel hash join plan exchanges had to process 5 million and 15 million rows. The amount of processor time spent on exchanges will be much lower in the collocated join plan. In addition, the collocated join plan has a maximum of 8 threads processing single partitions at any one time. The 41 partitions will all be processed eventually, but a new partition is not started until a thread asks for it. Threads can reuse hash table memory for the new partition. The parallel hash join plan also had 8 hash tables, but with all 5,000,000 build rows loaded at the same time. The collocated plan needs memory for only 8 * 125,000 = 1,000,000 rows at any one time. Collocated Hash Join Performance The collated join plan has disappointing performance in this case. The query runs for around 25,300ms despite the same IO statistics as usual. This is much the worst result so far, so what went wrong? It turns out that cardinality estimation for the single partition scans of table T1 is slightly low. The properties of the Clustered Index Scan of T1 (graphic immediately above) show the estimation was for 121,951 rows. This is a small shortfall compared with the 125,000 rows actually encountered, but it was enough to cause the hash join to spill to physical tempdb: A level 1 spill doesn’t sound too bad, until you realize that the spill to tempdb probably occurs for each of the 41 partitions. As a side note, the cardinality estimation error is a little surprising because the system tables accurately show there are 125,000 rows in every partition of T1. Unfortunately, the optimizer uses regular column and index statistics to derive cardinality estimates here rather than system table information (e.g. sys.partitions). Collocated Merge Join We will never know how well the collocated parallel hash join plan might have worked without the cardinality estimation error (and the resulting 41 spills to tempdb) but we do know: Merge join does not require a memory grant; and Merge join was the optimizer’s preferred join option for a single partition join Putting this all together, what we would really like to see is the same collocated join strategy, but using merge join instead of hash join. Unfortunately, the current query optimizer cannot produce a collocated merge join; it only knows how to do collocated hash join. So where does this leave us? CROSS APPLY sys.partitions We can try to write our own collocated join query. We can use sys.partitions to find the partition numbers, and CROSS APPLY to get a count per partition, with a final step to sum the partial counts. The following query implements this idea: SELECT row_count = SUM(Subtotals.cnt) FROM ( -- Partition numbers SELECT p.partition_number FROM sys.partitions AS p WHERE p.[object_id] = OBJECT_ID(N'T1', N'U') AND p.index_id = 1 ) AS P CROSS APPLY ( -- Count per collocated join SELECT cnt = COUNT_BIG(*) FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON T2.TID = T1.TID WHERE $PARTITION.PFT(T1.TID) = p.partition_number AND $PARTITION.PFT(T2.TID) = p.partition_number ) AS SubTotals; The estimated plan is: The cardinality estimates aren’t all that good here, especially the estimate for the scan of the system table underlying the sys.partitions view. Nevertheless, the plan shape is heading toward where we would like to be. Each partition number from the system table results in a per-partition scan of T1 and T2, a one-to-many Merge Join, and a Stream Aggregate to compute the partial counts. The final Stream Aggregate just sums the partial counts. Execution time for this query is around 3,500ms, with the same IO statistics as always. This compares favourably with 5,000ms for the serial plan produced by the optimizer with the OPTION (MERGE JOIN) hint. This is another case of the sum of the parts being less than the whole – summing 41 partial counts from 41 single-partition merge joins is faster than a single merge join and count over all partitions. Even so, this single-threaded collocated merge join is not as quick as the original parallel hash join plan, which executed in 2,600ms. On the positive side, our collocated merge join uses only one logical processor and requires no memory grant. The parallel hash join plan used 16 threads and reserved 569 MB of memory: Using a Temporary Table Our collocated merge join plan should benefit from parallelism. The reason parallelism is not being used is that the query references a system table. We can work around that by writing the partition numbers to a temporary table (or table variable): SET STATISTICS IO ON; DECLARE @s datetime2 = SYSUTCDATETIME(); CREATE TABLE #P ( partition_number integer PRIMARY KEY); INSERT #P (partition_number) SELECT p.partition_number FROM sys.partitions AS p WHERE p.[object_id] = OBJECT_ID(N'T1', N'U') AND p.index_id = 1; SELECT row_count = SUM(Subtotals.cnt) FROM #P AS p CROSS APPLY ( SELECT cnt = COUNT_BIG(*) FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON T2.TID = T1.TID WHERE $PARTITION.PFT(T1.TID) = p.partition_number AND $PARTITION.PFT(T2.TID) = p.partition_number ) AS SubTotals; DROP TABLE #P; SELECT DATEDIFF(Millisecond, @s, SYSUTCDATETIME()); SET STATISTICS IO OFF; Using the temporary table adds a few logical reads, but the overall execution time is still around 3500ms, indistinguishable from the same query without the temporary table. The problem is that the query optimizer still doesn’t choose a parallel plan for this query, though the removal of the system table reference means that it could if it chose to: In fact the optimizer did enter the parallel plan phase of query optimization (running search 1 for a second time): Unfortunately, the parallel plan found seemed to be more expensive than the serial plan. This is a crazy result, caused by the optimizer’s cost model not reducing operator CPU costs on the inner side of a nested loops join. Don’t get me started on that, we’ll be here all night. In this plan, everything expensive happens on the inner side of a nested loops join. Without a CPU cost reduction to compensate for the added cost of exchange operators, candidate parallel plans always look more expensive to the optimizer than the equivalent serial plan. Parallel Collocated Merge Join We can produce the desired parallel plan using trace flag 8649 again: SELECT row_count = SUM(Subtotals.cnt) FROM #P AS p CROSS APPLY ( SELECT cnt = COUNT_BIG(*) FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON T2.TID = T1.TID WHERE $PARTITION.PFT(T1.TID) = p.partition_number AND $PARTITION.PFT(T2.TID) = p.partition_number ) AS SubTotals OPTION (QUERYTRACEON 8649); The actual execution plan is: One difference between this plan and the collocated hash join plan is that a Repartition Streams exchange operator is used instead of Distribute Streams. The effect is similar, though not quite identical. The Repartition uses round-robin partitioning, meaning the next partition id is pushed to the next thread in sequence. The Distribute Streams exchange seen earlier used Demand partitioning, meaning the next partition id is pulled across the exchange by the next thread that is ready for more work. There are subtle performance implications for each partitioning option, but going into that would again take us too far off the main point of this post. Performance The important thing is the performance of this parallel collocated merge join – just 1350ms on a typical run. The list below shows all the alternatives from this post (all timings include creation, population, and deletion of the temporary table where appropriate) from quickest to slowest: Collocated parallel merge join: 1350ms Parallel hash join: 2600ms Collocated serial merge join: 3500ms Serial merge join: 5000ms Parallel merge join: 8400ms Collated parallel hash join: 25,300ms (hash spill per partition) The parallel collocated merge join requires no memory grant (aside from a paltry 1.2MB used for exchange buffers). This plan uses 16 threads at DOP 8; but 8 of those are (rather pointlessly) allocated to the parallel scan of the temporary table. These are minor concerns, but it turns out there is a way to address them if it bothers you. Parallel Collocated Merge Join with Demand Partitioning This final tweak replaces the temporary table with a hard-coded list of partition ids (dynamic SQL could be used to generate this query from sys.partitions): SELECT row_count = SUM(Subtotals.cnt) FROM ( VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10), (11),(12),(13),(14),(15),(16),(17),(18),(19),(20), (21),(22),(23),(24),(25),(26),(27),(28),(29),(30), (31),(32),(33),(34),(35),(36),(37),(38),(39),(40),(41) ) AS P (partition_number) CROSS APPLY ( SELECT cnt = COUNT_BIG(*) FROM dbo.T1 AS T1 JOIN dbo.T2 AS T2 ON T2.TID = T1.TID WHERE $PARTITION.PFT(T1.TID) = p.partition_number AND $PARTITION.PFT(T2.TID) = p.partition_number ) AS SubTotals OPTION (QUERYTRACEON 8649); The actual execution plan is: The parallel collocated hash join plan is reproduced below for comparison: The manual rewrite has another advantage that has not been mentioned so far: the partial counts (per partition) can be computed earlier than the partial counts (per thread) in the optimizer’s collocated join plan. The earlier aggregation is performed by the extra Stream Aggregate under the nested loops join. The performance of the parallel collocated merge join is unchanged at around 1350ms. Final Words It is a shame that the current query optimizer does not consider a collocated merge join (Connect item closed as Won’t Fix). The example used in this post showed an improvement in execution time from 2600ms to 1350ms using a modestly-sized data set and limited parallelism. In addition, the memory requirement for the query was almost completely eliminated – down from 569MB to 1.2MB. The problem with the parallel hash join selected by the optimizer is that it attempts to process the full data set all at once (albeit using eight threads). It requires a large memory grant to hold all 5 million rows from table T1 across the eight hash tables, and does not take advantage of the divide-and-conquer opportunity offered by the common partitioning. The great thing about the collocated join strategies is that each parallel thread works on a single partition from both tables, reading rows, performing the join, and computing a per-partition subtotal, before moving on to a new partition. From a thread’s point of view… If you have trouble visualizing what is happening from just looking at the parallel collocated merge join execution plan, let’s look at it again, but from the point of view of just one thread operating between the two Parallelism (exchange) operators. Our thread picks up a single partition id from the Distribute Streams exchange, and starts a merge join using ordered rows from partition 1 of table T1 and partition 1 of table T2. By definition, this is all happening on a single thread. As rows join, they are added to a (per-partition) count in the Stream Aggregate immediately above the Merge Join. Eventually, either T1 (partition 1) or T2 (partition 1) runs out of rows and the merge join stops. The per-partition count from the aggregate passes on through the Nested Loops join to another Stream Aggregate, which is maintaining a per-thread subtotal. Our same thread now picks up a new partition id from the exchange (say it gets id 9 this time). The count in the per-partition aggregate is reset to zero, and the processing of partition 9 of both tables proceeds just as it did for partition 1, and on the same thread. Each thread picks up a single partition id and processes all the data for that partition, completely independently from other threads working on other partitions. One thread might eventually process partitions (1, 9, 17, 25, 33, 41) while another is concurrently processing partitions (2, 10, 18, 26, 34) and so on for the other six threads at DOP 8. The point is that all 8 threads can execute independently and concurrently, continuing to process new partitions until the wider job (of which the thread has no knowledge!) is done. This divide-and-conquer technique can be much more efficient than simply splitting the entire workload across eight threads all at once. Related Reading Understanding and Using Parallelism in SQL Server Parallel Execution Plans Suck © 2013 Paul White – All Rights Reserved Twitter: @SQL_Kiwi

Read the article
Getting the innermost .NET Exception

- by Rick Strahl

Here's a trivial but quite useful function that I frequently need in dynamic execution of code: Finding the innermost exception when an exception occurs, because for many operations (for example Reflection invocations or Web Service calls) the top level errors returned can be rather generic. A good example - common with errors in Reflection making a method invocation - is this generic error: Exception has been thrown by the target of an invocation In the debugger it looks like this: In this case this is an AJAX callback, which dynamically executes a method (ExecuteMethod code) which in turn calls into an Amazon Web Service using the old Amazon WSE101 Web service extensions for .NET. An error occurs in the Web Service call and the innermost exception holds the useful error information which in this case points at an invalid web.config key value related to the System.Net connection APIs. The "Exception has been thrown by the target of an invocation" error is the Reflection APIs generic error message that gets fired when you execute a method dynamically and that method fails internally. The messages basically says: "Your code blew up in my face when I tried to run it!". Which of course is not very useful to tell you what actually happened. If you drill down the InnerExceptions eventually you'll get a more detailed exception that points at the original error and code that caused the exception. In the code above the actually useful exception is two innerExceptions down. In most (but not all) cases when inner exceptions are returned, it's the innermost exception that has the information that is really useful. It's of course a fairly trivial task to do this in code, but I do it so frequently that I use a small helper method for this: /// <summary> /// Returns the innermost Exception for an object /// </summary> /// <param name="ex"></param> /// <returns></returns> public static Exception GetInnerMostException(Exception ex) { Exception currentEx = ex; while (currentEx.InnerException != null) { currentEx = currentEx.InnerException; } return currentEx; } This code just loops through all the inner exceptions (if any) and assigns them to a temporary variable until there are no more inner exceptions. The end result is that you get the innermost exception returned from the original exception. It's easy to use this code then in a try/catch handler like this (from the example above) to retrieve the more important innermost exception: object result = null; string stringResult = null; try { if (parameterList != null) // use the supplied parameter list result = helper.ExecuteMethod(methodToCall,target, parameterList.ToArray(), CallbackMethodParameterType.Json,ref attr); else // grab the info out of QueryString Values or POST buffer during parameter parsing // for optimization result = helper.ExecuteMethod(methodToCall, target, null, CallbackMethodParameterType.Json, ref attr); } catch (Exception ex) { Exception activeException = DebugUtils.GetInnerMostException(ex); WriteErrorResponse(activeException.Message, ( HttpContext.Current.IsDebuggingEnabled ? ex.StackTrace : null ) ); return; } Another function that is useful to me from time to time is one that returns all inner exceptions and the original exception as an array: /// <summary> /// Returns an array of the entire exception list in reverse order /// (innermost to outermost exception) /// </summary> /// <param name="ex">The original exception to work off</param> /// <returns>Array of Exceptions from innermost to outermost</returns> public static Exception[] GetInnerExceptions(Exception ex) { List<Exception> exceptions = new List<Exception>(); exceptions.Add(ex); Exception currentEx = ex; while (currentEx.InnerException != null) { exceptions.Add(ex); } // Reverse the order to the innermost is first exceptions.Reverse(); return exceptions.ToArray(); } This function loops through all the InnerExceptions and returns them and then reverses the order of the array returning the innermost exception first. This can be useful in certain error scenarios where exceptions stack and you need to display information from more than one of the exceptions in order to create a useful error message. This is rare but certain database exceptions bury their exception info in mutliple inner exceptions and it's easier to parse through them in an array then to manually walk the exception stack. It's also useful if you need to log errors and want to see the all of the error detail from all exceptions. None of this is rocket science, but it's useful to have some helpers that make retrieval of the critical exception info trivial. Resources DebugUtils.cs utility class in the West Wind Web Toolkit© Rick Strahl, West Wind Technologies, 2005-2011Posted in CSharp .NET

Read the article
query to select topic with highest number of comment +support+oppose+views

- by chetan

table schema title description desid replyto support oppose views browser used a1 none 1 1 12 - bad topic b2 1 2 3 14 sql database a3 none 4 5 34 - crome b4 1 3 4 12 Topic desid starts with a and comment desid starts with b .For comment replyto is the desid of topic . Its easy to select * with highest number of support+oppose+views by query "select * from [DB_user1212].[dbo].[discussions] where desid like 'a%' order by (sup+opp+visited) desc" For highest (comment +support+oppose+views ) i tried "select * from [DB_user1212].[dbo].[discussions] where desid like 'a%' order by ((select count(*) from [DB_user1212].[dbo].[discussions] where replyto = desid )+sup+opp+visited) desc" but it didn't work . Because its not possible to send desid from outer query to innner subquery .

Read the article
SQL SERVER – SSMS: Backup and Restore Events Report

- by Pinal Dave

A DBA wears multiple hats and in fact does more than what an eye can see. One of the core task of a DBA is to take backups. This looks so trivial that most developers shrug this off as the only activity a DBA might be doing. I have huge respect for DBA’s all around the world because even if they seem cool with all the scripting, automation, maintenance works round the clock to keep the business working almost 365 days 24×7, their worth is knowing that one day when the systems / HDD crashes and you have an important delivery to make. So these backup tasks / maintenance jobs that have been done come handy and are no more trivial as they might seem to be as considered by many. So the important question like: “When was the last backup taken?”, “How much time did the last backup take?”, “What type of backup was taken last?” etc are tricky questions and this report lands answers to the same in a jiffy. So the SSMS report, we are talking can be used to find backups and restore operation done for the selected database. Whenever we perform any backup or restore operation, the information is stored in the msdb database. This report can utilize that information and provide information about the size, time taken and also the file location for those operations. Here is how this report can be launched. Once we launch this report, we can see 4 major sections shown as listed below. Average Time Taken For Backup Operations Successful Backup Operations Backup Operation Errors Successful Restore Operations Let us look at each section next. Average Time Taken For Backup Operations Information shown in “Average Time Taken For Backup Operations” section is taken from a backupset table in the msdb database. Here is the query and the expanded version of that particular section USE msdb; SELECT (ROW_NUMBER() OVER (ORDER BY t1.TYPE))%2 AS l1 , 1 AS l2 , 1 AS l3 , t1.TYPE AS [type] , (AVG(DATEDIFF(ss,backup_start_date, backup_finish_date)))/60.0 AS AverageBackupDuration FROM backupset t1 INNER JOIN sys.databases t3 ON ( t1.database_name = t3.name) WHERE t3.name = N'AdventureWorks2014' GROUP BY t1.TYPE ORDER BY t1.TYPE On my small database the time taken for differential backup was less than a minute, hence the value of zero is displayed. This is an important piece of backup operation which might help you in planning maintenance windows. Successful Backup Operations Here is the expanded version of this section. This information is derived from various backup tracking tables from msdb database. Here is the simplified version of the query which can be used separately as well. SELECT * FROM sys.databases t1 INNER JOIN backupset t3 ON (t3.database_name = t1.name) LEFT OUTER JOIN backupmediaset t5 ON ( t3.media_set_id = t5.media_set_id) LEFT OUTER JOIN backupmediafamily t6 ON ( t6.media_set_id = t5.media_set_id) WHERE (t1.name = N'AdventureWorks2014') ORDER BY backup_start_date DESC,t3.backup_set_id,t6.physical_device_name; The report does some calculations to show the data in a more readable format. For example, the backup size is shown in KB, MB or GB. I have expanded first row by clicking on (+) on “Device type” column. That has shown me the path of the physical backup file. Personally looking at this section, the Backup Size, Device Type and Backup Name are critical and are worth a note. As mentioned in the previous section, this section also has the Duration embedded inside it. Backup Operation Errors This section of the report gets data from default trace. You might wonder how. One of the event which is tracked by default trace is “ErrorLog”. This means that whatever message is written to errorlog gets written to default trace file as well. Interestingly, whenever there is a backup failure, an error message is written to ERRORLOG and hence default trace. This section takes advantage of that and shows the information. We can read below message under this section, which confirms above logic. No backup operations errors occurred for (AdventureWorks2014) database in the recent past or default trace is not enabled. Successful Restore Operations This section may not be very useful in production server (do you perform a restore of database?) but might be useful in the development and log shipping secondary environment, where we might be interested to see restore operations for a particular database. Here is the expanded version of the section. To fill this section of the report, I have restored the same backups which were taken to populate earlier sections. Here is the simplified version of the query used to populate this output. USE msdb; SELECT * FROM restorehistory t1 LEFT OUTER JOIN restorefile t2 ON ( t1.restore_history_id = t2.restore_history_id) LEFT OUTER JOIN backupset t3 ON ( t1.backup_set_id = t3.backup_set_id) WHERE t1.destination_database_name = N'AdventureWorks2014' ORDER BY restore_date DESC, t1.restore_history_id,t2.destination_phys_name Have you ever looked at the backup strategy of your key databases? Are they in sync and do we have scope for improvements? Then this is the report to analyze after a week or month of maintenance plans running in your database. Do chime in with what are the strategies you are using in your environments. Reference: Pinal Dave (http://blog.sqlauthority.com)Filed under: PostADay, SQL, SQL Authority, SQL Backup and Restore, SQL Query, SQL Server, SQL Server Management Studio, SQL Tips and Tricks, T SQL Tagged: SQL Reports

Read the article
[GEEK SCHOOL] Network Security 1: Securing User Accounts and Passwords in Windows

- by Matt Klein

This How-To Geek School class is intended for people who want to learn more about security when using Windows operating systems. You will learn many principles that will help you have a more secure computing experience and will get the chance to use all the important security tools and features that are bundled with Windows. Obviously, we will share everything you need to know about using them effectively. In this first lesson, we will talk about password security; the different ways of logging into Windows and how secure they are. In the proceeding lesson, we will explain where Windows stores all the user names and passwords you enter while working in this operating systems, how safe they are, and how to manage this data. Moving on in the series, we will talk about User Account Control, its role in improving the security of your system, and how to use Windows Defender in order to protect your system from malware. Then, we will talk about the Windows Firewall, how to use it in order to manage the apps that get access to the network and the Internet, and how to create your own filtering rules. After that, we will discuss the SmartScreen Filter – a security feature that gets more and more attention from Microsoft and is now widely used in its Windows 8.x operating systems. Moving on, we will discuss ways to keep your software and apps up-to-date, why this is important and which tools you can use to automate this process as much as possible. Last but not least, we will discuss the Action Center and its role in keeping you informed about what’s going on with your system and share several tips and tricks about how to stay safe when using your computer and the Internet. Let’s get started by discussing everyone’s favorite subject: passwords. The Types of Passwords Found in Windows In Windows 7, you have only local user accounts, which may or may not have a password. For example, you can easily set a blank password for any user account, even if that one is an administrator. The only exception to this rule are business networks where domain policies force all user accounts to use a non-blank password. In Windows 8.x, you have both local accounts and Microsoft accounts. If you would like to learn more about them, don’t hesitate to read the lesson on User Accounts, Groups, Permissions & Their Role in Sharing, in our Windows Networking series. Microsoft accounts are obliged to use a non-blank password due to the fact that a Microsoft account gives you access to Microsoft services. Using a blank password would mean exposing yourself to lots of problems. Local accounts in Windows 8.1 however, can use a blank password. On top of traditional passwords, any user account can create and use a 4-digit PIN or a picture password. These concepts were introduced by Microsoft to speed up the sign in process for the Windows 8.x operating system. However, they do not replace the use of a traditional password and can be used only in conjunction with a traditional user account password. Another type of password that you encounter in Windows operating systems is the Homegroup password. In a typical home network, users can use the Homegroup to easily share resources. A Homegroup can be joined by a Windows device only by using the Homegroup password. If you would like to learn more about the Homegroup and how to use it for network sharing, don’t hesitate to read our Windows Networking series. What to Keep in Mind When Creating Passwords, PINs and Picture Passwords When creating passwords, a PIN, or a picture password for your user account, we would like you keep in mind the following recommendations: Do not use blank passwords, even on the desktop computers in your home. You never know who may gain unwanted access to them. Also, malware can run more easily as administrator because you do not have a password. Trading your security for convenience when logging in is never a good idea. When creating a password, make it at least eight characters long. Make sure that it includes a random mix of upper and lowercase letters, numbers, and symbols. Ideally, it should not be related in any way to your name, username, or company name. Make sure that your passwords do not include complete words from any dictionary. Dictionaries are the first thing crackers use to hack passwords. Do not use the same password for more than one account. All of your passwords should be unique and you should use a system like LastPass, KeePass, Roboform or something similar to keep track of them. When creating a PIN use four different digits to make things slightly harder to crack. When creating a picture password, pick a photo that has at least 10 “points of interests”. Points of interests are areas that serve as a landmark for your gestures. Use a random mixture of gesture types and sequence and make sure that you do not repeat the same gesture twice. Be aware that smudges on the screen could potentially reveal your gestures to others. The Security of Your Password vs. the PIN and the Picture Password Any kind of password can be cracked with enough effort and the appropriate tools. There is no such thing as a completely secure password. However, passwords created using only a few security principles are much harder to crack than others. If you respect the recommendations shared in the previous section of this lesson, you will end up having reasonably secure passwords. Out of all the log in methods in Windows 8.x, the PIN is the easiest to brute force because PINs are restricted to four digits and there are only 10,000 possible unique combinations available. The picture password is more secure than the PIN because it provides many more opportunities for creating unique combinations of gestures. Microsoft have compared the two login options from a security perspective in this post: Signing in with a picture password. In order to discourage brute force attacks against picture passwords and PINs, Windows defaults to your traditional text password after five failed attempts. The PIN and the picture password function only as alternative login methods to Windows 8.x. Therefore, if someone cracks them, he or she doesn’t have access to your user account password. However, that person can use all the apps installed on your Windows 8.x device, access your files, data, and so on. How to Create a PIN in Windows 8.x If you log in to a Windows 8.x device with a user account that has a non-blank password, then you can create a 4-digit PIN for it, to use it as a complementary login method. In order to create one, you need to go to “PC Settings”. If you don’t know how, then press Windows + C on your keyboard or flick from the right edge of the screen, on a touch-enabled device, then press “Settings”. The Settings charm is now open. Click or tap the link that says “Change PC settings”, on the bottom of the charm. In PC settings, go to Accounts and then to “Sign-in options”. Here you will find all the necessary options for changing your existing password, creating a PIN, or a picture password. To create a PIN, press the “Add” button in the PIN section. The “Create a PIN” wizard is started and you are asked to enter the password of your user account. Type it and press “OK”. Now you are asked to enter a 4-digit pin in the “Enter PIN” and “Confirm PIN” fields. The PIN has been created and you can now use it to log in to Windows. How to Create a Picture Password in Windows 8.x If you log in to a Windows 8.x device with a user account that has a non-blank password, then you can also create a picture password and use it as a complementary login method. In order to create one, you need to go to “PC settings”. In PC Settings, go to Accounts and then to “Sign-in options”. Here you will find all the necessary options for changing your existing password, creating a PIN, or a picture password. To create a picture password, press the “Add” button in the “Picture password” section. The “Create a picture password” wizard is started and you are asked to enter the password of your user account. You are shown a guide on how the picture password works. Take a few seconds to watch it and learn the gestures that can be used for your picture password. You will learn that you can create a combination of circles, straight lines, and taps. When ready, press “Choose picture”. Browse your Windows 8.x device and select the picture you want to use for your password and press “Open”. Now you can drag the picture to position it the way you want. When you like how the picture is positioned, press “Use this picture” on the left. If you are not happy with the picture, press “Choose new picture” and select a new one, as shown during the previous step. After you have confirmed that you want to use this picture, you are asked to set up your gestures for the picture password. Draw three gestures on the picture, any combination you wish. Please remember that you can use only three gestures: circles, straight lines, and taps. Once you have drawn those three gestures, you are asked to confirm. Draw the same gestures one more time. If everything goes well, you are informed that you have created your picture password and that you can use it the next time you sign in to Windows. If you don’t confirm the gestures correctly, you will be asked to try again, until you draw the same gestures twice. To close the picture password wizard, press “Finish”. Where Does Windows Store Your Passwords? Are They Safe? All the passwords that you enter in Windows and save for future use are stored in the Credential Manager. This tool is a vault with the usernames and passwords that you use to log on to your computer, to other computers on the network, to apps from the Windows Store, or to websites using Internet Explorer. By storing these credentials, Windows can automatically log you the next time you access the same app, network share, or website. Everything that is stored in the Credential Manager is encrypted for your protection.

Read the article
Internet Protocol Suite: Transition Control Protocol (TCP) vs. User Datagram Protocol (UDP)

How do we communicate over the Internet? How is data transferred from one machine to another? These types of act ivies can only be done by using one of two Internet protocols currently. The collection of Internet Protocol consists of the Transition Control Protocol (TCP) and the User Datagram Protocol (UDP). Both protocols are used to send data between two network end points, however they both have very distinct ways of transporting data from one endpoint to another. If transmission speed and reliability is the primary concern when trying to transfer data between two network endpoints then TCP is the proper choice. When a device attempts to send data to another endpoint using TCP it creates a direct connection between both devices until the transmission has completed. The direct connection between both devices ensures the reliability of the transmission due to the fact that no intermediate devices are needed to transfer the data. Due to the fact that both devices have to continuously poll the connection until transmission has completed increases the resources needed to perform the transmission. An example of this type of direct communication can be seen when a teacher tells a students to do their homework. The teacher is talking directly to the students in order to communicate that the homework needs to be done. Students can then ask questions about the assignment to ensure that they have received the proper instructions for the assignment. UDP is a less resource intensive approach to sending data between to network endpoints. When a device uses UDP to send data across a network, the data is broken up and repackaged with the destination address. The sending device then releases the data packages to the network, but cannot ensure when or if the receiving device will actually get the data. The sending device depends on other devices on the network to forward the data packages to the destination devices in order to complete the transmission. As you can tell this type of transmission is less resource intensive because not connection polling is needed, but should not be used for transmitting data with speed or reliability requirements. This is due to the fact that the sending device can not ensure that the transmission is received. An example of this type of communication can be seen when a teacher tells a student that they would like to speak with their parents. The teacher is relying on the student to complete the transmission to the parents, and the teacher has no guarantee that the student will actually inform the parents about the request. Both TCP and UPD are invaluable when attempting to send data across a network, but depending on the situation one protocol may be better than the other. Before deciding on which protocol to use an evaluation for transmission speed, reliability, latency, and overhead must be completed in order to define the best protocol for the situation.

Read the article
OWB 11gR2 - Early Arriving Facts

- by Dawei Sun

A common challenge when building ETL components for a data warehouse is how to handle early arriving facts. OWB 11gR2 introduced a new feature to address this for dimensional objects entitled Orphan Management. An orphan record is one that does not have a corresponding existing parent record. Orphan management automates the process of handling source rows that do not meet the requirements necessary to form a valid dimension or cube record. In this article, a simple example will be provided to show you how to use Orphan Management in OWB. We first import a sample MDL file that contains all the objects we need. Then we take some time to examine all the objects. After that, we prepare the source data, deploy the target table and dimension/cube loading map. Finally, we run the loading maps, and check the data in target dimension/cube tables. OK, let’s start… 1. Import MDL file and examine sample project First, download zip file from here, which includes a MDL file and three source data files. Then we open OWB design center, import orphan_management.mdl by using the menu File->Import->Warehouse Builder Metadata. Now we have several objects in BI_DEMO project as below: Mapping LOAD_CHANNELS_OM: The mapping for dimension loading. Mapping LOAD_SALES_OM: The mapping for cube loading. Dimension CHANNELS_OM: The dimension that contains channels data. Cube SALES_OM: The cube that contains sales data. Table CHANNELS_OM: The star implementation table of dimension CHANNELS_OM. Table SALES_OM: The star implementation table of cube SALES_OM. Table SRC_CHANNELS: The source table of channels data, that will be loaded into dimension CHANNELS_OM. Table SRC_ORDERS and SRC_ORDER_ITEMS: The source tables of sales data that will be loaded into cube SALES_OM. Sequence CLASS_OM_DIM_SEQ: The sequence used for loading dimension CHANNELS_OM. Dimension CHANNELS_OM This dimension has a hierarchy with three levels: TOTAL, CLASS and CHANNEL. Each level has three attributes: ID (surrogate key), NAME and SOURCE_ID (business key). It has a standard star implementation. The orphan management policy and the default parent setting are shown in the following screenshots: The orphan management policy options that you can set for loading are: Reject Orphan: The record is not inserted. Default Parent: You can specify a default parent record. This default record is used as the parent record for any record that does not have an existing parent record. If the default parent record does not exist, Warehouse Builder creates the default parent record. You specify the attribute values of the default parent record at the time of defining the dimensional object. If any ancestor of the default parent does not exist, Warehouse Builder also creates this record. No Maintenance: This is the default behavior. Warehouse Builder does not actively detect, reject, or fix orphan records. While removing data from a dimension, you can select one of the following orphan management policies: Reject Removal: Warehouse Builder does not allow you to delete the record if it has existing child records. No Maintenance: This is the default behavior. Warehouse Builder does not actively detect, reject, or fix orphan records. (More details are at http://download.oracle.com/docs/cd/E11882_01/owb.112/e10935/dim_objects.htm#insertedID1) Cube SALES_OM This cube is references to dimension CHANNELS_OM. It has three measures: AMOUNT, QUANTITY and COST. The orphan management policy setting are shown as following screenshot: The orphan management policy options that you can set for loading are: No Maintenance: Warehouse Builder does not actively detect, reject, or fix orphan rows. Default Dimension Record: Warehouse Builder assigns a default dimension record for any row that has an invalid or null dimension key value. Use the Settings button to define the default parent row. Reject Orphan: Warehouse Builder does not insert the row if it does not have an existing dimension record. (More details are at http://download.oracle.com/docs/cd/E11882_01/owb.112/e10935/dim_objects.htm#BABEACDG) Mapping LOAD_CHANNELS_OM This mapping loads source data from table SRC_CHANNELS to dimension CHANNELS_OM. The operator CHANNELS_IN is bound to table SRC_CHANNELS; CHANNELS_OUT is bound to dimension CHANNELS_OM. The TOTALS operator is used for generating a constant value for the top level in the dimension. The CLASS_FILTER operator is used to filter out the “invalid” class name, so then we can see what will happen when those channel records with an “invalid” parent are loading into dimension. Some properties of the dimension operator in this mapping are important to orphan management. See the screenshot below: Create Default Level Records: If YES, then default level records will be created. This property must be set to YES for dimensions and cubes if one of their orphan management policies is “Default Parent” or “Default Dimension Record”. This property is set to NO by default, so the user may need to set this to YES manually. LOAD policy for INVALID keys/ LOAD policy for NULL keys: These two properties have the same meaning as in the dimension editor. The values are set to the same as the dimension value when user drops the dimension into the mapping. The user does not need to modify these properties. Record Error Rows: If YES, error rows will be inserted into error table when loading the dimension. REMOVE Orphan Policy: This property is used when removing data from a dimension. Since the dimension loading type is set to LOAD in this example, this property is disabled. Mapping LOAD_SALES_OM This mapping loads source data from table SRC_ORDERS and SRC_ORDER_ITEMS to cube SALES_OM. This mapping seems a little bit complicated, but operators in the red rectangle are used to filter out and generate the records with “invalid” or “null” dimension keys. Some properties of the cube operator in a mapping are important to orphan management. See the screenshot below: Enable Source Aggregation: Should be checked in this example. If the default dimension record orphan policy is set for the cube operator, then it is recommended that source aggregation also be enabled. Otherwise, the orphan management processing may produce multiple fact rows with the same default dimension references, which will cause an “unstable rowset” execution error in the database, since the dimension refs are used as update match attributes for updating the fact table. LOAD policy for INVALID keys/ LOAD policy for NULL keys: These two properties have the same meaning as in the cube editor. The values are set to the same as in the cube editor when the user drops the cube into the mapping. The user does not need to modify these properties. Record Error Rows: If YES, error rows will be inserted into error table when loading the cube. 2. Deploy objects and mappings We now can deploy the objects. First, make sure location SALES_WH_LOCAL has been correctly configured. Then open Control Center Manager by using the menu Tools->Control Center Manager. Expand BI_DEMO->SALES_WH_LOCAL, click SALES_WH node on the project tree. We can see the following objects: Deploy all the objects in the following order: Sequence CLASS_OM_DIM_SEQ Table CHANNELS_OM, SALES_OM, SRC_CHANNELS, SRC_ORDERS, SRC_ORDER_ITEMS Dimension CHANNELS_OM Cube SALES_OM Mapping LOAD_CHANNELS_OM, LOAD_SALES_OM Note that we deployed source tables as well. Normally, we import source table from database instead of deploying them to target schema. However, in this example, we designed the source tables in OWB and deployed them to database for the purpose of this demonstration. 3. Prepare and examine source data Before running the mappings, we need to populate and examine the source data first. Run SRC_CHANNELS.sql, SRC_ORDERS.sql and SRC_ORDER_ITEMS.sql as target user. Then we check the data in these three tables. Table SRC_CHANNELS SQL> select rownum, id, class, name from src_channels; Records 1~5 are correct; they should be loaded into dimension without error. Records 6,7 and 8 have null parents; they should be loaded into dimension with a default parent value, and should be inserted into error table at the same time. Records 9, 10 and 11 have “invalid” parents; they should be rejected by dimension, and inserted into error table. Table SRC_ORDERS and SRC_ORDER_ITEMS SQL> select rownum, a.id, a.channel, b.amount, b.quantity, b.cost from src_orders a, src_order_items b where a.id = b.order_id; Record 178 has null dimension reference; it should be loaded into cube with a default dimension reference, and should be inserted into error table at the same time. Record 179 has “invalid” dimension reference; it should be rejected by cube, and inserted into error table. Other records should be aggregated and loaded into cube correctly. 4. Run the mappings and examine the target data In the Control Center Manager, expand BI_DEMO-> SALES_WH_LOCAL-> SALES_WH-> Mappings, right click on LOAD_CHANNELS_OM node, click Start. Use the same way to run mapping LOAD_SALES_OM. When they successfully finished, we can check the data in target tables. Table CHANNELS_OM SQL> select rownum, total_id, total_name, total_source_id, class_id,class_name, class_source_id, channel_id, channel_name,channel_source_id from channels_om order by abs(dimension_key); Records 1,2 and 3 are the default dimension records for the three levels. Records 8, 10 and 15 are the loaded records that originally have null parents. We see their parents name (class_name) is set to DEF_CLASS_NAME. Those records whose CHANNEL_NAME are Special_4, Special_5 and Special_6 are not loaded to this table because of the invalid parent. Error Table CHANNELS_OM_ERR SQL> select rownum, class_source_id, channel_id, channel_name,channel_source_id, err$$$_error_reason from channels_om_err order by channel_name; We can see all the record with null parent or invalid parent are inserted into this error table. Error reason is “Default parent used for record” for the first three records, and “No parent found for record” for the last three. Table SALES_OM SQL> select a.*, b.channel_name from sales_om a, channels_om b where a.channels=b.channel_id; We can see the order record with null channel_name has been loaded into target table with a default channel_name. The one with “invalid” channel_name are not loaded. Error Table SALES_OM_ERR SQL> select a.amount, a.cost, a.quantity, a.channels, b.channel_name, a.err$$$_error_reason from sales_om_err a, channels_om b where a.channels=b.channel_id(+); We can see the order records with null or invalid channel_name are inserted into error table. If the dimension reference column is null, the error reason is “Default dimension record used for fact”. If it is invalid, the error reason is “Dimension record not found for fact”. Summary In summary, this article illustrated the Orphan Management feature in OWB 11gR2. Automated orphan management policies improve ETL developer and administrator productivity by addressing an important cause of cube and dimension load failures, without requiring developers to explicitly build logic to handle these orphan rows.

Read the article
rsnapshot intervals in configuration file…

- by Patrick

A simple question about rsnapshot. In order to perform daily backups I'm going to add lines to cron in my Ubuntu. Then, why do I have also these lines in the rsnapshot.conf ? ######################################### # BACKUP INTERVALS # # Must be unique and in ascending order # # i.e. hourly, daily, weekly, etc. # ######################################### interval hourly 6 interval daily 7 interval weekly 4 #interval monthly 3 If I use cron, should I disable them ? thanks ps. I've just realized that in the crontab I still have "hourly" and "daily". Should I then uncomment only the one I use in the crontab ? And what's the point to specify hourly if it is already specified in cron ? I'm a bit confused. # crontab -e 0 */4 * * * /usr/local/bin/rsnapshot hourly 30 23 * * * /usr/local/bin/rsnapshot daily

Read the article
Asserting with JustMock

In this post, i will be digging in a bit deep on Mock.Assert. This is the continuation from previous post and covers up the ways you can use assert for your mock expectations. I have used another traditional sample of Talisker that has a warehouse [Collaborator] and an order class [SUT] that will call upon the warehouse to see the stock and fill it up with items. Our sample, interface of warehouse and order looks similar to : public interface IWarehouse { bool HasInventory(string productName,...Did you know that DotNetSlackers also publishes .net articles written by top known .net Authors? We already have over 80 articles in several categories including Silverlight. Take a look: here.

Read the article
Specifying a file name for the FTP and File based transports in OSB

- by [email protected]

A common question I receive is how to incorporate a variable value into a file name when using the FTP, SFTP, or File transports in Oracle Service Bus. For example, if one of the fields in a message being put down to a file by the File transport is an order number variable, then how can you make the order number become part of the file name? Another example might be if you want to specify the date in the file name. The transport configuration wizard in OSB does not have an option to allow for this, other than allowing you to specify a static prefix of suffix variable.

Read the article
How do I use with LTSP with a Dell FX170 thin client?

- by v4169sgr

Just got a reconditioned DELL Optiplex FX170 thin client delivered, without an image. On power-up, I see the message: DISK BOOT FAILURE, INSERT SYSTEM DISK AND PRESS ENTER I am very sure all the wires etc are connected properly. I would like it to PXE boot and find my LTSP installation. I have an HP T5525 with the same connectivity that works fine. On F12 there's a BIOS boot order menu. The only options available though are LS120, the HDD, CD / DVD [via USB presumably], and various USB options. I do not see 'PXE' or any network card. And I don't know how to change the boot order :( Any help appreciated!

Read the article
How to bill a client for frequently-interrupted time

- by Greg

I find that when I'm working on hourly-billable projects (in particular, those that are research/design/architecture-oriented as opposed to straight coding) that I'm easily distracted by any number of things (email, grab a drink (loss of focus, but nature happens), link off the webpage I was reading, wandering mind (easy when the job calls for a lot of thinking), etc.) This results in very fragmented time, far too incremental IMO to accurately track with a timeclock, and some time very gray. I frequently end up billing for only some fraction of the elapsed time I spent in order to feel fair, but sometimes it takes a really long time to put in an 8-hour day. By contrast, when I've worked for salary I've not worried about whether I'm actively working at any given minute, I just get the job done, and I've never had anything but stellar reviews/feedback from past salaried employers, so I think I get the job done well. I personally believe in an 80/20 cycle: I get 80% of my work done during an inspired 20% of my time. But I have to screw around the other 80% of the time in order to get that first 20%. So the question: what billing/time-tracking policy can I adopt in order to be fair to my hourly customers without having to write off my own less-productive 80% that a salaried employer is willing to overlook in light of the complete package? Note: This question is not about how to be more productive or focused. It's about how to work around whatever salient limitations that I have in a way that's both fair to me and to my customers. Update: A little clarification (to pre-emptively stop some righteous indignation): I currently have a half dozen different project/client groups. It's not a great situation and I'm working at reducing it down to two, but that's my current reality. It's very easy to get off on a thread related to a different project than the one I'm clocking, and I'm not always conscious of it at the time. [I did not intend the question to mean that I was off playing games or making personal calls, etc., and have adjusted wording above to be clearer. Most of the time. I am only human, and sometimes the mind does force you to take a break! :-)]

Read the article
Inside the iPad SDK: Bigger Screens, Continued Frustrations

Now that the gag order has expired, Developer World reviews the iPad SDK.

Read the article
Inside the iPad SDK: Bigger Screens, Continued Frustrations

Now that the gag order has expired, Developer World reviews the iPad SDK.

Read the article
How John Got 15x Improvement Without Really Trying

- by rchrd

The following article was published on a Sun Microsystems website a number of years ago by John Feo. It is still useful and worth preserving. So I'm republishing it here. How I Got 15x Improvement Without Really Trying John Feo, Sun Microsystems Taking ten "personal" program codes used in scientific and engineering research, the author was able to get from 2 to 15 times performance improvement easily by applying some simple general optimization techniques. Introduction Scientific research based on computer simulation depends on the simulation for advancement. The research can advance only as fast as the computational codes can execute. The codes' efficiency determines both the rate and quality of results. In the same amount of time, a faster program can generate more results and can carry out a more detailed simulation of physical phenomena than a slower program. Highly optimized programs help science advance quickly and insure that monies supporting scientific research are used as effectively as possible. Scientific computer codes divide into three broad categories: ISV, community, and personal. ISV codes are large, mature production codes developed and sold commercially. The codes improve slowly over time both in methods and capabilities, and they are well tuned for most vendor platforms. Since the codes are mature and complex, there are few opportunities to improve their performance solely through code optimization. Improvements of 10% to 15% are typical. Examples of ISV codes are DYNA3D, Gaussian, and Nastran. Community codes are non-commercial production codes used by a particular research field. Generally, they are developed and distributed by a single academic or research institution with assistance from the community. Most users just run the codes, but some develop new methods and extensions that feed back into the general release. The codes are available on most vendor platforms. Since these codes are younger than ISV codes, there are more opportunities to optimize the source code. Improvements of 50% are not unusual. Examples of community codes are AMBER, CHARM, BLAST, and FASTA. Personal codes are those written by single users or small research groups for their own use. These codes are not distributed, but may be passed from professor-to-student or student-to-student over several years. They form the primordial ocean of applications from which community and ISV codes emerge. Government research grants pay for the development of most personal codes. This paper reports on the nature and performance of this class of codes. Over the last year, I have looked at over two dozen personal codes from more than a dozen research institutions. The codes cover a variety of scientific fields, including astronomy, atmospheric sciences, bioinformatics, biology, chemistry, geology, and physics. The sources range from a few hundred lines to more than ten thousand lines, and are written in Fortran, Fortran 90, C, and C++. For the most part, the codes are modular, documented, and written in a clear, straightforward manner. They do not use complex language features, advanced data structures, programming tricks, or libraries. I had little trouble understanding what the codes did or how data structures were used. Most came with a makefile. Surprisingly, only one of the applications is parallel. All developers have access to parallel machines, so availability is not an issue. Several tried to parallelize their applications, but stopped after encountering difficulties. Lack of education and a perception that parallelism is difficult prevented most from trying. I parallelized several of the codes using OpenMP, and did not judge any of the codes as difficult to parallelize. Even more surprising than the lack of parallelism is the inefficiency of the codes. I was able to get large improvements in performance in a matter of a few days applying simple optimization techniques. Table 1 lists ten representative codes [names and affiliation are omitted to preserve anonymity]. Improvements on one processor range from 2x to 15.5x with a simple average of 4.75x. I did not use sophisticated performance tools or drill deep into the program's execution character as one would do when tuning ISV or community codes. Using only a profiler and source line timers, I identified inefficient sections of code and improved their performance by inspection. The changes were at a high level. I am sure there is another factor of 2 or 3 in each code, and more if the codes are parallelized. The study’s results show that personal scientific codes are running many times slower than they should and that the problem is pervasive. Computational scientists are not sloppy programmers; however, few are trained in the art of computer programming or code optimization. I found that most have a working knowledge of some programming language and standard software engineering practices; but they do not know, or think about, how to make their programs run faster. They simply do not know the standard techniques used to make codes run faster. In fact, they do not even perceive that such techniques exist. The case studies described in this paper show that applying simple, well known techniques can significantly increase the performance of personal codes. It is important that the scientific community and the Government agencies that support scientific research find ways to better educate academic scientific programmers. The inefficiency of their codes is so bad that it is retarding both the quality and progress of scientific research. # cacheperformance redundantoperations loopstructures performanceimprovement 1 x x 15.5 2 x 2.8 3 x x 2.5 4 x 2.1 5 x x 2.0 6 x 5.0 7 x 5.8 8 x 6.3 9 2.2 10 x x 3.3 Table 1 — Area of improvement and performance gains of 10 codes The remainder of the paper is organized as follows: sections 2, 3, and 4 discuss the three most common sources of inefficiencies in the codes studied. These are cache performance, redundant operations, and loop structures. Each section includes several examples. The last section summaries the work and suggests a possible solution to the issues raised. Optimizing cache performance Commodity microprocessor systems use caches to increase memory bandwidth and reduce memory latencies. Typical latencies from processor to L1, L2, local, and remote memory are 3, 10, 50, and 200 cycles, respectively. Moreover, bandwidth falls off dramatically as memory distances increase. Programs that do not use cache effectively run many times slower than programs that do. When optimizing for cache, the biggest performance gains are achieved by accessing data in cache order and reusing data to amortize the overhead of cache misses. Secondary considerations are prefetching, associativity, and replacement; however, the understanding and analysis required to optimize for the latter are probably beyond the capabilities of the non-expert. Much can be gained simply by accessing data in the correct order and maximizing data reuse. 6 out of the 10 codes studied here benefited from such high level optimizations. Array Accesses The most important cache optimization is the most basic: accessing Fortran array elements in column order and C array elements in row order. Four of the ten codes—1, 2, 4, and 10—got it wrong. Compilers will restructure nested loops to optimize cache performance, but may not do so if the loop structure is too complex, or the loop body includes conditionals, complex addressing, or function calls. In code 1, the compiler failed to invert a key loop because of complex addressing do I = 0, 1010, delta_x IM = I - delta_x IP = I + delta_x do J = 5, 995, delta_x JM = J - delta_x JP = J + delta_x T1 = CA1(IP, J) + CA1(I, JP) T2 = CA1(IM, J) + CA1(I, JM) S1 = T1 + T2 - 4 * CA1(I, J) CA(I, J) = CA1(I, J) + D * S1 end do end do In code 2, the culprit is conditionals do I = 1, N do J = 1, N If (IFLAG(I,J) .EQ. 0) then T1 = Value(I, J-1) T2 = Value(I-1, J) T3 = Value(I, J) T4 = Value(I+1, J) T5 = Value(I, J+1) Value(I,J) = 0.25 * (T1 + T2 + T5 + T4) Delta = ABS(T3 - Value(I,J)) If (Delta .GT. MaxDelta) MaxDelta = Delta endif enddo enddo I fixed both programs by inverting the loops by hand. Code 10 has three-dimensional arrays and triply nested loops. The structure of the most computationally intensive loops is too complex to invert automatically or by hand. The only practical solution is to transpose the arrays so that the dimension accessed by the innermost loop is in cache order. The arrays can be transposed at construction or prior to entering a computationally intensive section of code. The former requires all array references to be modified, while the latter is cost effective only if the cost of the transpose is amortized over many accesses. I used the second approach to optimize code 10. Code 5 has four-dimensional arrays and loops are nested four deep. For all of the reasons cited above the compiler is not able to restructure three key loops. Assume C arrays and let the four dimensions of the arrays be i, j, k, and l. In the original code, the index structure of the three loops is L1: for i L2: for i L3: for i for l for l for j for k for j for k for j for k for l So only L3 accesses array elements in cache order. L1 is a very complex loop—much too complex to invert. I brought the loop into cache alignment by transposing the second and fourth dimensions of the arrays. Since the code uses a macro to compute all array indexes, I effected the transpose at construction and changed the macro appropriately. The dimensions of the new arrays are now: i, l, k, and j. L3 is a simple loop and easily inverted. L2 has a loop-carried scalar dependence in k. By promoting the scalar name that carries the dependence to an array, I was able to invert the third and fourth subloops aligning the loop with cache. Code 5 is by far the most difficult of the four codes to optimize for array accesses; but the knowledge required to fix the problems is no more than that required for the other codes. I would judge this code at the limits of, but not beyond, the capabilities of appropriately trained computational scientists. Array Strides When a cache miss occurs, a line (64 bytes) rather than just one word is loaded into the cache. If data is accessed stride 1, than the cost of the miss is amortized over 8 words. Any stride other than one reduces the cost savings. Two of the ten codes studied suffered from non-unit strides. The codes represent two important classes of "strided" codes. Code 1 employs a multi-grid algorithm to reduce time to convergence. The grids are every tenth, fifth, second, and unit element. Since time to convergence is inversely proportional to the distance between elements, coarse grids converge quickly providing good starting values for finer grids. The better starting values further reduce the time to convergence. The downside is that grids of every nth element, n > 1, introduce non-unit strides into the computation. In the original code, much of the savings of the multi-grid algorithm were lost due to this problem. I eliminated the problem by compressing (copying) coarse grids into continuous memory, and rewriting the computation as a function of the compressed grid. On convergence, I copied the final values of the compressed grid back to the original grid. The savings gained from unit stride access of the compressed grid more than paid for the cost of copying. Using compressed grids, the loop from code 1 included in the previous section becomes do j = 1, GZ do i = 1, GZ T1 = CA(i+0, j-1) + CA(i-1, j+0) T4 = CA1(i+1, j+0) + CA1(i+0, j+1) S1 = T1 + T4 - 4 * CA1(i+0, j+0) CA(i+0, j+0) = CA1(i+0, j+0) + DD * S1 enddo enddo where CA and CA1 are compressed arrays of size GZ. Code 7 traverses a list of objects selecting objects for later processing. The labels of the selected objects are stored in an array. The selection step has unit stride, but the processing steps have irregular stride. A fix is to save the parameters of the selected objects in temporary arrays as they are selected, and pass the temporary arrays to the processing functions. The fix is practical if the same parameters are used in selection as in processing, or if processing comprises a series of distinct steps which use overlapping subsets of the parameters. Both conditions are true for code 7, so I achieved significant improvement by copying parameters to temporary arrays during selection. Data reuse In the previous sections, we optimized for spatial locality. It is also important to optimize for temporal locality. Once read, a datum should be used as much as possible before it is forced from cache. Loop fusion and loop unrolling are two techniques that increase temporal locality. Unfortunately, both techniques increase register pressure—as loop bodies become larger, the number of registers required to hold temporary values grows. Once register spilling occurs, any gains evaporate quickly. For multiprocessors with small register sets or small caches, the sweet spot can be very small. In the ten codes presented here, I found no opportunities for loop fusion and only two opportunities for loop unrolling (codes 1 and 3). In code 1, unrolling the outer and inner loop one iteration increases the number of result values computed by the loop body from 1 to 4, do J = 1, GZ-2, 2 do I = 1, GZ-2, 2 T1 = CA1(i+0, j-1) + CA1(i-1, j+0) T2 = CA1(i+1, j-1) + CA1(i+0, j+0) T3 = CA1(i+0, j+0) + CA1(i-1, j+1) T4 = CA1(i+1, j+0) + CA1(i+0, j+1) T5 = CA1(i+2, j+0) + CA1(i+1, j+1) T6 = CA1(i+1, j+1) + CA1(i+0, j+2) T7 = CA1(i+2, j+1) + CA1(i+1, j+2) S1 = T1 + T4 - 4 * CA1(i+0, j+0) S2 = T2 + T5 - 4 * CA1(i+1, j+0) S3 = T3 + T6 - 4 * CA1(i+0, j+1) S4 = T4 + T7 - 4 * CA1(i+1, j+1) CA(i+0, j+0) = CA1(i+0, j+0) + DD * S1 CA(i+1, j+0) = CA1(i+1, j+0) + DD * S2 CA(i+0, j+1) = CA1(i+0, j+1) + DD * S3 CA(i+1, j+1) = CA1(i+1, j+1) + DD * S4 enddo enddo The loop body executes 12 reads, whereas as the rolled loop shown in the previous section executes 20 reads to compute the same four values. In code 3, two loops are unrolled 8 times and one loop is unrolled 4 times. Here is the before for (k = 0; k < NK[u]; k++) { sum = 0.0; for (y = 0; y < NY; y++) { sum += W[y][u][k] * delta[y]; } backprop[i++]=sum; } and after code for (k = 0; k < KK - 8; k+=8) { sum0 = 0.0; sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0; sum5 = 0.0; sum6 = 0.0; sum7 = 0.0; for (y = 0; y < NY; y++) { sum0 += W[y][0][k+0] * delta[y]; sum1 += W[y][0][k+1] * delta[y]; sum2 += W[y][0][k+2] * delta[y]; sum3 += W[y][0][k+3] * delta[y]; sum4 += W[y][0][k+4] * delta[y]; sum5 += W[y][0][k+5] * delta[y]; sum6 += W[y][0][k+6] * delta[y]; sum7 += W[y][0][k+7] * delta[y]; } backprop[k+0] = sum0; backprop[k+1] = sum1; backprop[k+2] = sum2; backprop[k+3] = sum3; backprop[k+4] = sum4; backprop[k+5] = sum5; backprop[k+6] = sum6; backprop[k+7] = sum7; } for one of the loops unrolled 8 times. Optimizing for temporal locality is the most difficult optimization considered in this paper. The concepts are not difficult, but the sweet spot is small. Identifying where the program can benefit from loop unrolling or loop fusion is not trivial. Moreover, it takes some effort to get it right. Still, educating scientific programmers about temporal locality and teaching them how to optimize for it will pay dividends. Reducing instruction count Execution time is a function of instruction count. Reduce the count and you usually reduce the time. The best solution is to use a more efficient algorithm; that is, an algorithm whose order of complexity is smaller, that converges quicker, or is more accurate. Optimizing source code without changing the algorithm yields smaller, but still significant, gains. This paper considers only the latter because the intent is to study how much better codes can run if written by programmers schooled in basic code optimization techniques. The ten codes studied benefited from three types of "instruction reducing" optimizations. The two most prevalent were hoisting invariant memory and data operations out of inner loops. The third was eliminating unnecessary data copying. The nature of these inefficiencies is language dependent. Memory operations The semantics of C make it difficult for the compiler to determine all the invariant memory operations in a loop. The problem is particularly acute for loops in functions since the compiler may not know the values of the function's parameters at every call site when compiling the function. Most compilers support pragmas to help resolve ambiguities; however, these pragmas are not comprehensive and there is no standard syntax. To guarantee that invariant memory operations are not executed repetitively, the user has little choice but to hoist the operations by hand. The problem is not as severe in Fortran programs because in the absence of equivalence statements, it is a violation of the language's semantics for two names to share memory. Codes 3 and 5 are C programs. In both cases, the compiler did not hoist all invariant memory operations from inner loops. Consider the following loop from code 3 for (y = 0; y < NY; y++) { i = 0; for (u = 0; u < NU; u++) { for (k = 0; k < NK[u]; k++) { dW[y][u][k] += delta[y] * I1[i++]; } } } Since dW[y][u] can point to the same memory space as delta for one or more values of y and u, assignment to dW[y][u][k] may change the value of delta[y]. In reality, dW and delta do not overlap in memory, so I rewrote the loop as for (y = 0; y < NY; y++) { i = 0; Dy = delta[y]; for (u = 0; u < NU; u++) { for (k = 0; k < NK[u]; k++) { dW[y][u][k] += Dy * I1[i++]; } } } Failure to hoist invariant memory operations may be due to complex address calculations. If the compiler can not determine that the address calculation is invariant, then it can hoist neither the calculation nor the associated memory operations. As noted above, code 5 uses a macro to address four-dimensional arrays #define MAT4D(a,q,i,j,k) (double *)((a)->data + (q)*(a)->strides[0] + (i)*(a)->strides[3] + (j)*(a)->strides[2] + (k)*(a)->strides[1]) The macro is too complex for the compiler to understand and so, it does not identify any subexpressions as loop invariant. The simplest way to eliminate the address calculation from the innermost loop (over i) is to define a0 = MAT4D(a,q,0,j,k) before the loop and then replace all instances of *MAT4D(a,q,i,j,k) in the loop with a0[i] A similar problem appears in code 6, a Fortran program. The key loop in this program is do n1 = 1, nh nx1 = (n1 - 1) / nz + 1 nz1 = n1 - nz * (nx1 - 1) do n2 = 1, nh nx2 = (n2 - 1) / nz + 1 nz2 = n2 - nz * (nx2 - 1) ndx = nx2 - nx1 ndy = nz2 - nz1 gxx = grn(1,ndx,ndy) gyy = grn(2,ndx,ndy) gxy = grn(3,ndx,ndy) balance(n1,1) = balance(n1,1) + (force(n2,1) * gxx + force(n2,2) * gxy) * h1 balance(n1,2) = balance(n1,2) + (force(n2,1) * gxy + force(n2,2) * gyy)*h1 end do end do The programmer has written this loop well—there are no loop invariant operations with respect to n1 and n2. However, the loop resides within an iterative loop over time and the index calculations are independent with respect to time. Trading space for time, I precomputed the index values prior to the entering the time loop and stored the values in two arrays. I then replaced the index calculations with reads of the arrays. Data operations Ways to reduce data operations can appear in many forms. Implementing a more efficient algorithm produces the biggest gains. The closest I came to an algorithm change was in code 4. This code computes the inner product of K-vectors A(i) and B(j), 0 = i < N, 0 = j < M, for most values of i and j. Since the program computes most of the NM possible inner products, it is more efficient to compute all the inner products in one triply-nested loop rather than one at a time when needed. The savings accrue from reading A(i) once for all B(j) vectors and from loop unrolling. for (i = 0; i < N; i+=8) { for (j = 0; j < M; j++) { sum0 = 0.0; sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0; sum5 = 0.0; sum6 = 0.0; sum7 = 0.0; for (k = 0; k < K; k++) { sum0 += A[i+0][k] * B[j][k]; sum1 += A[i+1][k] * B[j][k]; sum2 += A[i+2][k] * B[j][k]; sum3 += A[i+3][k] * B[j][k]; sum4 += A[i+4][k] * B[j][k]; sum5 += A[i+5][k] * B[j][k]; sum6 += A[i+6][k] * B[j][k]; sum7 += A[i+7][k] * B[j][k]; } C[i+0][j] = sum0; C[i+1][j] = sum1; C[i+2][j] = sum2; C[i+3][j] = sum3; C[i+4][j] = sum4; C[i+5][j] = sum5; C[i+6][j] = sum6; C[i+7][j] = sum7; }} This change requires knowledge of a typical run; i.e., that most inner products are computed. The reasons for the change, however, derive from basic optimization concepts. It is the type of change easily made at development time by a knowledgeable programmer. In code 5, we have the data version of the index optimization in code 6. Here a very expensive computation is a function of the loop indices and so cannot be hoisted out of the loop; however, the computation is invariant with respect to an outer iterative loop over time. We can compute its value for each iteration of the computation loop prior to entering the time loop and save the values in an array. The increase in memory required to store the values is small in comparison to the large savings in time. The main loop in Code 8 is doubly nested. The inner loop includes a series of guarded computations; some are a function of the inner loop index but not the outer loop index while others are a function of the outer loop index but not the inner loop index for (j = 0; j < N; j++) { for (i = 0; i < M; i++) { r = i * hrmax; R = A[j]; temp = (PRM[3] == 0.0) ? 1.0 : pow(r, PRM[3]); high = temp * kcoeff * B[j] * PRM[2] * PRM[4]; low = high * PRM[6] * PRM[6] / (1.0 + pow(PRM[4] * PRM[6], 2.0)); kap = (R > PRM[6]) ? high * R * R / (1.0 + pow(PRM[4]*r, 2.0) : low * pow(R/PRM[6], PRM[5]); < rest of loop omitted > }} Note that the value of temp is invariant to j. Thus, we can hoist the computation for temp out of the loop and save its values in an array. for (i = 0; i < M; i++) { r = i * hrmax; TEMP[i] = pow(r, PRM[3]); } [N.B. – the case for PRM[3] = 0 is omitted and will be reintroduced later.] We now hoist out of the inner loop the computations invariant to i. Since the conditional guarding the value of kap is invariant to i, it behooves us to hoist the computation out of the inner loop, thereby executing the guard once rather than M times. The final version of the code is for (j = 0; j < N; j++) { R = rig[j] / 1000.; tmp1 = kcoeff * par[2] * beta[j] * par[4]; tmp2 = 1.0 + (par[4] * par[4] * par[6] * par[6]); tmp3 = 1.0 + (par[4] * par[4] * R * R); tmp4 = par[6] * par[6] / tmp2; tmp5 = R * R / tmp3; tmp6 = pow(R / par[6], par[5]); if ((par[3] == 0.0) && (R > par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * tmp5; } else if ((par[3] == 0.0) && (R <= par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * tmp4 * tmp6; } else if ((par[3] != 0.0) && (R > par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * TEMP[i] * tmp5; } else if ((par[3] != 0.0) && (R <= par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * TEMP[i] * tmp4 * tmp6; } for (i = 0; i < M; i++) { kap = KAP[i]; r = i * hrmax; < rest of loop omitted > } } Maybe not the prettiest piece of code, but certainly much more efficient than the original loop, Copy operations Several programs unnecessarily copy data from one data structure to another. This problem occurs in both Fortran and C programs, although it manifests itself differently in the two languages. Code 1 declares two arrays—one for old values and one for new values. At the end of each iteration, the array of new values is copied to the array of old values to reset the data structures for the next iteration. This problem occurs in Fortran programs not included in this study and in both Fortran 77 and Fortran 90 code. Introducing pointers to the arrays and swapping pointer values is an obvious way to eliminate the copying; but pointers is not a feature that many Fortran programmers know well or are comfortable using. An easy solution not involving pointers is to extend the dimension of the value array by 1 and use the last dimension to differentiate between arrays at different times. For example, if the data space is N x N, declare the array (N, N, 2). Then store the problem’s initial values in (_, _, 2) and define the scalar names new = 2 and old = 1. At the start of each iteration, swap old and new to reset the arrays. The old–new copy problem did not appear in any C program. In programs that had new and old values, the code swapped pointers to reset data structures. Where unnecessary coping did occur is in structure assignment and parameter passing. Structures in C are handled much like scalars. Assignment causes the data space of the right-hand name to be copied to the data space of the left-hand name. Similarly, when a structure is passed to a function, the data space of the actual parameter is copied to the data space of the formal parameter. If the structure is large and the assignment or function call is in an inner loop, then copying costs can grow quite large. While none of the ten programs considered here manifested this problem, it did occur in programs not included in the study. A simple fix is always to refer to structures via pointers. Optimizing loop structures Since scientific programs spend almost all their time in loops, efficient loops are the key to good performance. Conditionals, function calls, little instruction level parallelism, and large numbers of temporary values make it difficult for the compiler to generate tightly packed, highly efficient code. Conditionals and function calls introduce jumps that disrupt code flow. Users should eliminate or isolate conditionls to their own loops as much as possible. Often logical expressions can be substituted for if-then-else statements. For example, code 2 includes the following snippet MaxDelta = 0.0 do J = 1, N do I = 1, M < code omitted > Delta = abs(OldValue ? NewValue) if (Delta > MaxDelta) MaxDelta = Delta enddo enddo if (MaxDelta .gt. 0.001) goto 200 Since the only use of MaxDelta is to control the jump to 200 and all that matters is whether or not it is greater than 0.001, I made MaxDelta a boolean and rewrote the snippet as MaxDelta = .false. do J = 1, N do I = 1, M < code omitted > Delta = abs(OldValue ? NewValue) MaxDelta = MaxDelta .or. (Delta .gt. 0.001) enddo enddo if (MaxDelta) goto 200 thereby, eliminating the conditional expression from the inner loop. A microprocessor can execute many instructions per instruction cycle. Typically, it can execute one or more memory, floating point, integer, and jump operations. To be executed simultaneously, the operations must be independent. Thick loops tend to have more instruction level parallelism than thin loops. Moreover, they reduce memory traffice by maximizing data reuse. Loop unrolling and loop fusion are two techniques to increase the size of loop bodies. Several of the codes studied benefitted from loop unrolling, but none benefitted from loop fusion. This observation is not too surpising since it is the general tendency of programmers to write thick loops. As loops become thicker, the number of temporary values grows, increasing register pressure. If registers spill, then memory traffic increases and code flow is disrupted. A thick loop with many temporary values may execute slower than an equivalent series of thin loops. The biggest gain will be achieved if the thick loop can be split into a series of independent loops eliminating the need to write and read temporary arrays. I found such an occasion in code 10 where I split the loop do i = 1, n do j = 1, m A24(j,i)= S24(j,i) * T24(j,i) + S25(j,i) * U25(j,i) B24(j,i)= S24(j,i) * T25(j,i) + S25(j,i) * U24(j,i) A25(j,i)= S24(j,i) * C24(j,i) + S25(j,i) * V24(j,i) B25(j,i)= S24(j,i) * U25(j,i) + S25(j,i) * V25(j,i) C24(j,i)= S26(j,i) * T26(j,i) + S27(j,i) * U26(j,i) D24(j,i)= S26(j,i) * T27(j,i) + S27(j,i) * V26(j,i) C25(j,i)= S27(j,i) * S28(j,i) + S26(j,i) * U28(j,i) D25(j,i)= S27(j,i) * T28(j,i) + S26(j,i) * V28(j,i) end do end do into two disjoint loops do i = 1, n do j = 1, m A24(j,i)= S24(j,i) * T24(j,i) + S25(j,i) * U25(j,i) B24(j,i)= S24(j,i) * T25(j,i) + S25(j,i) * U24(j,i) A25(j,i)= S24(j,i) * C24(j,i) + S25(j,i) * V24(j,i) B25(j,i)= S24(j,i) * U25(j,i) + S25(j,i) * V25(j,i) end do end do do i = 1, n do j = 1, m C24(j,i)= S26(j,i) * T26(j,i) + S27(j,i) * U26(j,i) D24(j,i)= S26(j,i) * T27(j,i) + S27(j,i) * V26(j,i) C25(j,i)= S27(j,i) * S28(j,i) + S26(j,i) * U28(j,i) D25(j,i)= S27(j,i) * T28(j,i) + S26(j,i) * V28(j,i) end do end do Conclusions Over the course of the last year, I have had the opportunity to work with over two dozen academic scientific programmers at leading research universities. Their research interests span a broad range of scientific fields. Except for two programs that relied almost exclusively on library routines (matrix multiply and fast Fourier transform), I was able to improve significantly the single processor performance of all codes. Improvements range from 2x to 15.5x with a simple average of 4.75x. Changes to the source code were at a very high level. I did not use sophisticated techniques or programming tools to discover inefficiencies or effect the changes. Only one code was parallel despite the availability of parallel systems to all developers. Clearly, we have a problem—personal scientific research codes are highly inefficient and not running parallel. The developers are unaware of simple optimization techniques to make programs run faster. They lack education in the art of code optimization and parallel programming. I do not believe we can fix the problem by publishing additional books or training manuals. To date, the developers in questions have not studied the books or manual available, and are unlikely to do so in the future. Short courses are a possible solution, but I believe they are too concentrated to be much use. The general concepts can be taught in a three or four day course, but that is not enough time for students to practice what they learn and acquire the experience to apply and extend the concepts to their codes. Practice is the key to becoming proficient at optimization. I recommend that graduate students be required to take a semester length course in optimization and parallel programming. We would never give someone access to state-of-the-art scientific equipment costing hundreds of thousands of dollars without first requiring them to demonstrate that they know how to use the equipment. Yet the criterion for time on state-of-the-art supercomputers is at most an interesting project. Requestors are never asked to demonstrate that they know how to use the system, or can use the system effectively. A semester course would teach them the required skills. Government agencies that fund academic scientific research pay for most of the computer systems supporting scientific research as well as the development of most personal scientific codes. These agencies should require graduate schools to offer a course in optimization and parallel programming as a requirement for funding. About the Author John Feo received his Ph.D. in Computer Science from The University of Texas at Austin in 1986. After graduate school, Dr. Feo worked at Lawrence Livermore National Laboratory where he was the Group Leader of the Computer Research Group and principal investigator of the Sisal Language Project. In 1997, Dr. Feo joined Tera Computer Company where he was project manager for the MTA, and oversaw the programming and evaluation of the MTA at the San Diego Supercomputer Center. In 2000, Dr. Feo joined Sun Microsystems as an HPC application specialist. He works with university research groups to optimize and parallelize scientific codes. Dr. Feo has published over two dozen research articles in the areas of parallel parallel programming, parallel programming languages, and application performance.

Read the article
CodeStock 2012 Review: Michael Eaton( @mjeaton ) - 3 Simple Things for Increased Productivity

3 Simple Things for Increased ProductivitySpeaker: Michael EatonTwitter: @mjeatonBlog: http://mjeaton.net/blog This was the first time I had seen Michael Eaton speak but have hear a lot of really good things about his speaking abilities. Needless to say I was really looking forward to his session. He basically addressed the topic of distractions and how they can decrease or increase your productivity as a developer. He makes the case that in order to become more productive you must block/limit all distractions. For example, he covered his top distractions as a developer. Top Distractions Social Media(Twitter, Reddit, Facebook) Wiki sites Phone Email Video Games Coworkers, Friends, Family Michael stated that he uses various types of music to help him block out these distractions in order for him to get into his coding zone. While he states that music works for him, he also notes that he knows of others that cannot really work with music. I have to say I am in the latter group because I require a quiet environment in order to work. A few session attendees also recommended listening to really loud white noise or music in another language other than your own. This allows for less focus to be placed on words being sung compared to the rhythmic beats being played. I have to say that I have not tried these suggestions yet but will in the near future. However, distractions can be very beneficial to productivity in that they give your mind a chance to relax and not think about the issues at hand. He spoke highly of taking vacations, and setting boundaries at work so that develops prevent the problem of burnout. One way he suggested that developer’s combat distractions is to use the Pomodoro technique. In his example he selects one task to do for 20 minutes and he can only do that task during that time. He ignores all other distractions until this task or time limit is complete. After it is completed he allows himself to relax and distract himself for another 5- 10 minutes before his next Pomodoro. This allows him to stay completely focused on a task and when the time is up he can then focus on other things.

Read the article

Search Results

Search found 28744 results on 1150 pages for 'higher order functions'.

Page 197/1150 | < Previous Page | 193 194 195 196 197 198 199 200 201 202 203 204 | Next Page >

- by Chopper3

- by Bobb

- by Robber

- by sebastian

- by Chris Adragna

- by Andrew Wright

- by UpTheCreek

- by Jalpesh P. Vadgama

- by Rick Strahl

- by Paul White

- by Rick Strahl

- by chetan

- by Pinal Dave

- by Matt Klein

- by Dawei Sun

- by Patrick

- by [email protected]

- by v4169sgr

- by Greg

- by rchrd

< Previous Page | 193 194 195 196 197 198 199 200 201 202 203 204 | Next Page >