Search Results

Search found 96 results on 4 pages for 'proportional'.

Page 4/4 | < Previous Page | 1 2 3 4 

  • Short Season, Long Models - Dealing with Seasonality

    - by Michel Adar
    Accounting for seasonality presents a challenge for the accurate prediction of events. Examples of seasonality include: ·         Boxed cosmetics sets are more popular during Christmas. They sell at other times of the year, but they rise higher than other products during the holiday season. ·         Interest in a promotion rises around the time advertising on TV airs ·         Interest in the Sports section of a newspaper rises when there is a big football match There are several ways of dealing with seasonality in predictions. Time Windows If the length of the model time windows is short enough relative to the seasonality effect, then the models will see only seasonal data, and therefore will be accurate in their predictions. For example, a model with a weekly time window may be quick enough to adapt during the holiday season. In order for time windows to be useful in dealing with seasonality it is necessary that: The time window is significantly shorter than the season changes There is enough volume of data in the short time windows to produce an accurate model An additional issue to consider is that sometimes the season may have an abrupt end, for example the day after Christmas. Input Data If available, it is possible to include the seasonality effect in the input data for the model. For example the customer record may include a list of all the promotions advertised in the area of residence. A model with these inputs will have to learn the effect of the input. It is possible to learn it specific to the promotion – and by the way learn about inter-promotion cross feeding – by leaving the list of ads as it is; or it is possible to learn the general effect by having a flag that indicates if the promotion is being advertised. For inputs to properly represent the effect in the model it is necessary that: The model sees enough events with the input present. For example, by virtue of the model lifetime (or time window) being long enough to see several “seasons” or by having enough volume for the model to learn seasonality quickly. Proportional Frequency If we create a model that ignores seasonality it is possible to use that model to predict how the specific person likelihood differs from average. If we have a divergence from average then we can transfer that divergence proportionally to the observed frequency at the time of the prediction. Definitions: Ft = trailing average frequency of the event at time “t”. The average is done over a suitable period of to achieve a statistical significant estimate. F = average frequency as seen by the model. L = likelihood predicted by the model for a specific person Lt = predicted likelihood proportionally scaled for time “t”. If the model is good at predicting deviation from average, and this holds over the interesting range of seasons, then we can estimate Lt as: Lt = L * (Ft / F) Considering that: L = (L – F) + F Substituting we get: Lt = [(L – F) + F] * (Ft / F) Which simplifies to: (i)                  Lt = (L – F) * (Ft / F)  +  Ft This latest expression can be understood as “The adjusted likelihood at time t is the average likelihood at time t plus the effect from the model, which is calculated as the difference from average time the proportion of frequencies”. The formula above assumes a linear translation of the proportion. It is possible to generalize the formula using a factor which we will call “a” as follows: (ii)                Lt = (L – F) * (Ft / F) * a  +  Ft It is also possible to use a formula that does not scale the difference, like: (iii)               Lt = (L – F) * a  +  Ft While these formulas seem reasonable, they should be taken as hypothesis to be proven with empirical data. A theoretical analysis provides the following insights: The Cumulative Gains Chart (lift) should stay the same, as at any given time the order of the likelihood for different customers is preserved If F is equal to Ft then the formula reverts to “L” If (Ft = 0) then Lt in (i) and (ii) is 0 It is possible for Lt to be above 1. If it is desired to avoid going over 1, for relatively high base frequencies it is possible to use a relative interpretation of the multiplicative factor. For example, if we say that Y is twice as likely as X, then we can interpret this sentence as: If X is 3%, then Y is 6% If X is 11%, then Y is 22% If X is 70%, then Y is 85% - in this case we interpret “twice as likely” as “half as likely to not happen” Applying this reasoning to (i) for example we would get: If (L < F) or (Ft < (1 / ((L/F) + 1)) Then  Lt = L * (Ft / F) Else Lt = 1 – (F / L) + (Ft * F / L)  

    Read the article

  • Developing Schema Compare for Oracle (Part 5): Query Snapshots

    - by Simon Cooper
    If you've emailed us about a bug you've encountered with the EAP or beta versions of Schema Compare for Oracle, we probably asked you to send us a query snapshot of your databases. Here, I explain what a query snapshot is, and how it helps us fix your bug. Problem 1: Debugging users' bug reports When we started the Schema Compare project, we knew we were going to get problems with users' databases - configurations we hadn't considered, features that weren't installed, unicode issues, wierd dependencies... With SQL Compare, users are generally happy to send us a database backup that we can restore using a single RESTORE DATABASE command on our test servers and immediately reproduce the problem. Oracle, on the other hand, would be a lot more tricky. As Oracle generally has a 1-to-1 mapping between instances and databases, any databases users sent would have to be restored to their own instance. Furthermore, the number of steps required to get a properly working database, and the size of most oracle databases, made it infeasible to ask every customer who came across a bug during our beta program to send us their databases. We also knew that there would be lots of issues with data security that would make it hard to get backups. So we needed an easier way to be able to debug customers issues and sort out what strange schema data Oracle was returning. Problem 2: Test execution time Another issue we knew we would have to solve was the execution time of the tests we would produce for the Schema Compare engine. Our initial prototype showed that querying the data dictionary for schema information was going to be slow (at least 15 seconds per database), and this is generally proportional to the size of the database. If you're running thousands of tests on the same databases, each one registering separate schemas, not only would the tests would take hours and hours to run, but the test servers would be hammered senseless. The solution To solve these, we needed to be able to populate the schema of a database without actually connecting to it. Well, the IDataReader interface is the primary way we read data from an Oracle server. The data dictionary queries we use return their data in terms of simple strings and numbers, which we then process and reconstruct into an object model, and the results of these queries are identical for identical schemas. So, we can record the raw results of the queries once, and then replay these results to construct the same object model as many times as required without needing to actually connect to the original database. This is what query snapshots do. They are binary files containing the raw unprocessed data we get back from the oracle server for all the queries we run on the data dictionary to get schema information. The core of the query snapshot generation takes the results of the IDataReader we get from running queries on Oracle, and passes the row data to a BinaryWriter that writes it straight to a file. The query snapshot can then be replayed to create the same object model; when the results of a specific query is needed by the population code, we can simply read the binary data stored in the file on disk and present it through an IDataReader wrapper. This is far faster than querying the server over the network, and allows us to run tests in a reasonable time. They also allow us to easily debug a customers problem; using a simple snapshot generation program, users can generate a query snapshot that could be sent along with a bug report that we can immediately replay on our machines to let us debug the issue, rather than having to obtain database backups and restore databases to test systems. There are also far fewer problems with data security; query snapshots only contain schema information, which is generally less sensitive than table data. Query snapshots implementation However, actually implementing such a feature did have a couple of 'gotchas' to it. My second blog post detailed the development of the dependencies algorithm we use to ensure we get all the dependencies in the database, and that algorithm uses data from both databases to find all the needed objects - what database you're comparing to affects what objects get populated from both databases. We get information on these additional objects using an appropriate WHERE clause on all the population queries. So, in order to accurately replay the results of querying the live database, the query snapshot needs to be a snapshot of a comparison of two databases, not just populating a single database. Furthermore, although the code population queries (eg querying all_tab_cols to get column information) can simply be passed straight from the IDataReader to the BinaryWriter, we need to hook into and run the live dependencies algorithm while we're creating the snapshot to ensure we get the same WHERE clauses, and the same query results, as if we were populating straight from a live system. We also need to store the results of the dependencies queries themselves, as the resulting dependency graph is stored within the OracleDatabase object that is produced, and is later used to help order actions in synchronization scripts. This is significantly helped by the dependencies algorithm being a deterministic algorithm - given the same input, it will always return the same output. Therefore, when we're replaying a query snapshot, and processing dependency information, we simply have to return the results of the queries in the order we got them from the live database, rather than trying to calculate the contents of all_dependencies on the fly. Query snapshots are a significant feature in Schema Compare that really helps us to debug problems with the tool, as well as making our testers happier. Although not really user-visible, they are very useful to the development team to help us fix bugs in the product much faster than we otherwise would be able to.

    Read the article

  • Applications: The Mathematics of Movement, Part 3

    - by TechTwaddle
    Previously: Part 1, Part 2 As promised in the previous post, this post will cover two variations of the marble move program. The first one, Infinite Move, keeps the marble moving towards the click point, rebounding it off the screen edges and changing its direction when the user clicks again. The second version, Finite Move, is the same as first except that the marble does not move forever. It moves towards the click point, rebounds off the screen edges and slowly comes to rest. The amount of time that it moves depends on the distance between the click point and marble. Infinite Move This case is simple (actually both cases are simple). In this case all we need is the direction information which is exactly what the unit vector stores. So when the user clicks, you calculate the unit vector towards the click point and then keep updating the marbles position like crazy. And, of course, there is no stop condition. There’s a little more additional code in the bounds checking conditions. Whenever the marble goes off the screen boundaries, we need to reverse its direction.  Here is the code for mouse up event and UpdatePosition() method, //stores the unit vector double unitX = 0, unitY = 0; double speed = 6; //speed times the unit vector double incrX = 0, incrY = 0; private void Form1_MouseUp(object sender, MouseEventArgs e) {     double x = e.X - marble1.x;     double y = e.Y - marble1.y;     //calculate distance between click point and current marble position     double lenSqrd = x * x + y * y;     double len = Math.Sqrt(lenSqrd);     //unit vector along the same direction (from marble towards click point)     unitX = x / len;     unitY = y / len;     timer1.Enabled = true; } private void UpdatePosition() {     //amount by which to increment marble position     incrX = speed * unitX;     incrY = speed * unitY;     marble1.x += incrX;     marble1.y += incrY;     //check for bounds     if ((int)marble1.x < MinX + marbleWidth / 2)     {         marble1.x = MinX + marbleWidth / 2;         unitX *= -1;     }     else if ((int)marble1.x > (MaxX - marbleWidth / 2))     {         marble1.x = MaxX - marbleWidth / 2;         unitX *= -1;     }     if ((int)marble1.y < MinY + marbleHeight / 2)     {         marble1.y = MinY + marbleHeight / 2;         unitY *= -1;     }     else if ((int)marble1.y > (MaxY - marbleHeight / 2))     {         marble1.y = MaxY - marbleHeight / 2;         unitY *= -1;     } } So whenever the user clicks we calculate the unit vector along that direction and also the amount by which the marble position needs to be incremented. The speed in this case is fixed at 6. You can experiment with different values. And under bounds checking, whenever the marble position goes out of bounds along the x or y direction we reverse the direction of the unit vector along that direction. Here’s a video of it running;   Finite Move The code for finite move is almost exactly same as that of Infinite Move, except for the difference that the speed is not fixed and there is an end condition, so the marble comes to rest after a while. Code follows, //unit vector along the direction of click point double unitX = 0, unitY = 0; //speed of the marble double speed = 0; private void Form1_MouseUp(object sender, MouseEventArgs e) {     double x = 0, y = 0;     double lengthSqrd = 0, length = 0;     x = e.X - marble1.x;     y = e.Y - marble1.y;     lengthSqrd = x * x + y * y;     //length in pixels (between click point and current marble pos)     length = Math.Sqrt(lengthSqrd);     //unit vector along the same direction as vector(x, y)     unitX = x / length;     unitY = y / length;     speed = length / 12;     timer1.Enabled = true; } private void UpdatePosition() {     marble1.x += speed * unitX;     marble1.y += speed * unitY;     //check for bounds     if ((int)marble1.x < MinX + marbleWidth / 2)     {         marble1.x = MinX + marbleWidth / 2;         unitX *= -1;     }     else if ((int)marble1.x > (MaxX - marbleWidth / 2))     {         marble1.x = MaxX - marbleWidth / 2;         unitX *= -1;     }     if ((int)marble1.y < MinY + marbleHeight / 2)     {         marble1.y = MinY + marbleHeight / 2;         unitY *= -1;     }     else if ((int)marble1.y > (MaxY - marbleHeight / 2))     {         marble1.y = MaxY - marbleHeight / 2;         unitY *= -1;     }     //reduce speed by 3% in every loop     speed = speed * 0.97f;     if ((int)speed <= 0)     {         timer1.Enabled = false;     } } So the only difference is that the speed is calculated as a function of length when the mouse up event occurs. Again, this can be experimented with. Bounds checking is same as before. In the update and draw cycle, we reduce the speed by 3% in every cycle. Since speed is calculated as a function of length, speed = length/12, the amount of time it takes speed to reach zero is directly proportional to length. Note that the speed is in ‘pixels per 40ms’ because the timeout value of the timer is 40ms.  The readability can be improved by representing speed in ‘pixels per second’. This would require you to add some more calculations to the code, which I leave out as an exercise. Here’s a video of this second version,

    Read the article

  • A Patent for Workload Management Based on Service Level Objectives

    - by jsavit
    I'm very pleased to announce that after a tiny :-) wait of about 5 years, my patent application for a workload manager was finally approved. Background Many operating systems have a resource manager which lets you control machine resources. For example, Solaris provides controls for CPU with several options: shares for proportional CPU allocation. If you have twice as many shares as me, and we are competing for CPU, you'll get about twice as many CPU cycles), dedicated CPU allocation in which a number of CPUs are exclusively dedicated to an application's use. You can say that a zone or project "owns" 8 CPUs on a 32 CPU machine, for example. And, capped CPU in which you specify the upper bound, or cap, of how much CPU an application gets. For example, you can throttle an application to 0.125 of a CPU. (This isn't meant to be an exhaustive list of Solaris RM controls.) Workload management Useful as that is (and tragic that some other operating systems have little resource management and isolation, and frighten people into running only 1 app per OS instance - and wastefully size every server for the peak workload it might experience) that's not really workload management. With resource management one controls the resources, and hope that's enough to meet application service objectives. In fact, we hold resource distribution constant, see if that was good enough, and adjust resource distribution if that didn't meet service level objectives. Here's an example of what happens today: Let's try 30% dedicated CPU. Not enough? Let's try 80% Oh, that's too much, and we're achieving much better response time than the objective, but other workloads are starving. Let's back that off and try again. It's not the process I object to - it's that we to often do this manually. Worse, we sometimes identify and adjust the wrong resource and fiddle with that to no useful result. Back in my days as a customer managing large systems, one of my users would call me up to beg for a "CPU boost": Me: "it won't make any difference - there's plenty of spare CPU to be had, and your application is completely I/O bound." User: "Please do it anyway." Me: "oh, all right, but it won't do you any good." (I did, because he was a friend, but it didn't help.) Prior art There are some operating environments that take a stab about workload management (rather than resource management) but I find them lacking. I know of one that uses synthetic "service units" composed of the sum of CPU, I/O and memory allocations multiplied by weighting factors. A workload is set to make a target rate of service units consumed per second. But this seems to be missing a key point: what is the relationship between artificial 'service units' and actually meeting a throughput or response time objective? What if I get plenty of one of the components (so am getting enough service units), but not enough of the resource whose needed to remove the bottleneck? Actual workload management That's not really the answer either. What is needed is to specify a workload's service levels in terms of externally visible metrics that are meaningful to a business, such as response times or transactions per second, and have the workload manager figure out which resources are not being adequately provided, and then adjust it as needed. If an application is not meeting its service level objectives and the reason is that it's not getting enough CPU cycles, adjust its CPU resource accordingly. If the reason is that the application isn't getting enough RAM to keep its working set in memory, then adjust its RAM assignment appropriately so it stops swapping. Simple idea, but that's a task we keep dumping on system administrators. In other words - don't hold the number of CPU shares constant and watch the achievement of service level vary. Instead, hold the service level constant, and dynamically adjust the number of CPU shares (or amount of other resources like RAM or I/O bandwidth) in order to meet the objective. Instrumenting non-instrumented applications There's one little problem here: how do I measure application performance in a way relating to a service level. I don't want to do it based on internal resources like number of CPU seconds it received per minute - We need to make resource decisions based on externally visible and meaningful measures of performance, not synthetic items or internal resource counters. If I have a way of marking the beginning and end of a transaction, I can then measure whether or not the application is meeting an objective based on it. If I can observe the delay factors for an application, I can see which resource shortages are slowing an application enough to keep it from meeting its objectives. I can then adjust resource allocations to relieve those shortages. Fortunately, Solaris provides facilities for both marking application progress and determining what factors cause application latency. The Solaris DTrace facility let's me introspect on application behavior: in particular I can see events like "receive a web hit" and "respond to that web hit" so I can get transaction rate and response time. DTrace (and tools like prstat) let me see where latency is being added to an application, so I know which resource to adjust. Summary After a delay of a mere few years, I am the proud creator of a patent (advice to anyone interested in going through the process: don't hold your breath!). The fundamental idea is fairly simple: instead of holding resource constant and suffering variable levels of success meeting service level objectives, properly characterise the service level objective in meaningful terms, instrument the application to see if it's meeting the objective, and then have a workload manager change resource allocations to remove delays preventing service level attainment. I've done it by hand for a long time - I think that's what a computer should do for me.

    Read the article

  • delta-dictionary/dictionary with revision awareness in python?

    - by shabbychef
    I am looking to create a dictionary with 'roll-back' capabilities in python. The dictionary would start with a revision number of 0, and the revision would be bumped up only by explicit method call. I do not need to delete keys, only add and update key,value pairs, and then roll back. I will never need to 'roll forward', that is, when rolling the dictionary back, all the newer revisions can be discarded, and I can start re-reving up again. thus I want behaviour like: >>> rr = rev_dictionary() >>> rr.rev 0 >>> rr["a"] = 17 >>> rr[('b',23)] = 'foo' >>> rr["a"] 17 >>> rr.rev 0 >>> rr.roll_rev() >>> rr.rev 1 >>> rr["a"] 17 >>> rr["a"] = 0 >>> rr["a"] 0 >>> rr[('b',23)] 'foo' >>> rr.roll_to(0) >>> rr.rev 0 >>> rr["a"] 17 >>> rr.roll_to(1) Exception ... Just to be clear, the state associated with a revision is the state of the dictionary just prior to the roll_rev() method call. thus if I can alter the value associated with a key several times 'within' a revision, and only have the last one remembered. I would like a fairly memory-efficient implementation of this: the memory usage should be proportional to the deltas. Thus simply having a list of copies of the dictionary will not scale for my problem. One should assume the keys are in the tens of thousands, and the revisions are in the hundreds of thousands. We can assume the values are immutable, but need not be numeric. For the case where the values are e.g. integers, there is a fairly straightforward implementation (have a list of dictionaries of the numerical delta from revision to revision). I am not sure how to turn this into the general form. Maybe bootstrap the integer version and add on an array of values? all help appreciated.

    Read the article

  • The Math of a Jump in a 2D game.

    - by ONi
    I have been all around with this question and I can't find the correct answer! So, behold, the description of my question: I'm working in J2ME, I have my gameloop that do the following: public void run() { Graphics g = this.getGraphics(); while (running) { long diff = System.currentTimeMillis() - lastLoop; lastLoop = System.currentTimeMillis(); input(); this.level.doLogic(); render(g, diff); try { Thread.sleep(10); } catch (InterruptedException e) { stop(e); } } } so it's just a basic gameloop, the doLogic() function calls for all the logic functions of the characters in the scene and render(g, diff) calls the animateChar function of every character on scene, following this, the animChar function in the Character class sets up everything in the screen as this: protected void animChar(long diff) { this.checkGravity(); this.move((int) ((diff * this.dx) / 1000), (int) ((diff * this.dy) / 1000)); if (this.acumFrame > this.framerate) { this.nextFrame(); this.acumFrame = 0; } else { this.acumFrame += diff; } } This ensures me that everything must to move according to the time that the machine takes to go from cycle to cycle (remember it's a phone, not a gaming rig). I'm sure it's not the most efficient way to achieve this behavior so I'm totally open for criticism of my programming skills in the comments, but here my problem: When I make I character jump, what I do is that I put his dy to a negative value, say -200 and I set the boolean jumping to true, that makes the character go up, and then I have this function called checkGravity() that ensure that everything that goes up has to go down, checkGravity also checks for the character being over platforms so I will strip it down a little for the sake of your time: public void checkGravity() { if (this.jumping) { this.jumpSpeed += 10; if (this.jumpSpeed > 0) { this.jumping = false; this.falling = true; } this.dy = this.jumpSpeed; } if (this.falling) { this.jumpSpeed += 10; if (this.jumpSpeed > 200) this.jumpSpeed = 200; this.dy = this.jumpSpeed; if (this.collidesWithPlatform()) { this.falling = false; this.standing = true; this.jumping = false; this.jumpSpeed = 0; this.dy = this.jumpSpeed; } } } So, the problem is, that this function updates the dy regardless of the diff, making the characters fly like Superman in slow machines, and I have no idea how to implement the diff factor so that when a character is jumping, his speed decrement in a proportional way to the game speed. Can anyone help me fix this issue? or give me pointers on how to make a 2D Jump in J2ME the right way. Thank you very much for your time.

    Read the article

  • Counting point size based on chart area during zooming/unzoomin

    - by Gacek
    Hi folks. I heave a quite simple task. I know (I suppose) it should be easy, but from the reasons I cannot understand, I try to solve it since 2 days and I don't know where I'm making the mistake. So, the problem is as follows: - we have a chart with some points - The chart starts with some known area and points have known size - we would like to "emulate" the zooming effect. So when we zoom to some part of the chart, the size of points is getting proportionally bigger. In other words, the smaller part of the chart we select, the bigger the point should get. So, we have something like that. We know this two parameters: initialArea; // Initial area - area of the whole chart, counted as width*height initialSize; // initial size of the points Now lets assume we are handling some kind of OnZoom event. We selected some part of the chart and would like to count the current size of the points float CountSizeOnZoom() { float currentArea = CountArea(...); // the area is counted for us. float currentSize = initialSize * initialArea / currentArea; return currentSize; } And it works. But the rate of change is too fast. In other words, the points are getting really big too soon. So I would like the currentSize to be invertly proportional to currentArea, but with some scaling coefficient. So I created the second function: float CountSizeOnZoom() { float currentArea = CountArea(...); % the area is counted for us. // Lets assume we want the size of points to change ten times slower, than area of the chart changed. float currentSize = initialSize + 0.1f* initialSize * ((initialArea / currentArea) -1); return currentSize; } Lets do some calculations in mind. if currentArea is smaller than initialArea, initialArea/currentArea > 1 and then we add "something" small and postive to initialSize. Checked, it works. Lets check what happens if we would un-zoom. currentArea will be equal to initialArea, so we would have 0 at the right side (1-1), so new size should be equal to initialSize. Right? Yeah. So lets check it... and it doesn't work. My question is: where is the mistake? Or maybe you have any ideas how to count this scaled size depending on current area in some other way?

    Read the article

  • Best practices for displaying large number of images as thumbnails in c#

    - by andySF
    I got to a point where it's very difficult to get answers by debugging and tracing object, so i need some help. What I'm trying to do: A history form for my screen capture pet project. The history must list all images as thumbnails (ex: picasa). What I've done: I created a HistoryItem:UserControl. This history item has a few buttons, a check box, a label and a picture box. The buttons are for delete/edit/copy image. The check box is used for selecting one or more images and the label is for some info text. The picture box is getting the image from a public property that is a path and a method creates a proportional thumbnail to display it when the control has been loaded. This user control has two public events. One for deleting the image and one for bubbling the events for mouse enter and mouse leave trough all controls. For this I use EventBroadcastProvider. The bubbling is useful because wherever I move the mouse over the control, the buttons appear. The dispose method has been extended and I manually remove the events. All images are loaded by looping a xml file that contains the path of all images. For each image in this XML I create a new HitoryItem that is added (after a little coding to sort and limit the amount of images loaded) to a flow layout panel. The problem: When I lunch the history form, and the flow layout panel is populated with my HistoryItem custom control, my memory usage increases drastically.From 14Mb to around 100MB with 100 images loaded. By closing the history form and disposing whatever I could dispose and even trying to call GC.Collect() the memory increase remain. I search for any object that could not be disposed properly like an image or event but wherever I used them they are disposed. The problem seams to be from multiple sources. One is that the events for bubbling are not disposing properly, and the other is from the picture box itself. All of this i could see by commenting all the code to a limited version when only the custom control without any image processing and even events is loaded. Without the events the memory consumption is reduced by axiomatically 20%. So my real question is if this logic, flow layout panels and custom controls with picture boxes, is the best solution for displaying large amounts of images as thumbnails. Thank you!

    Read the article

  • Big Oh Notation - formal definition.

    - by aloh
    I'm reading a textbook right now for my Java III class. We're reading about Big-Oh and I'm a little confused by its formal definition. Formal Definition: "A function f(n) is of order at most g(n) - that is, f(n) = O(g(n)) - if a positive real number c and positive integer N exist such that f(n) <= c g(n) for all n = N. That is, c g(n) is an upper bound on f(n) when n is sufficiently large." Ok, that makes sense. But hold on, keep reading...the book gave me this example: "In segment 9.14, we said that an algorithm that uses 5n + 3 operations is O(n). We now can show that 5n + 3 = O(n) by using the formal definition of Big Oh. When n = 3, 5n + 3 <= 5n + n = 6n. Thus, if we let f(n) = 5n + 3, g(n) = n, c = 6, N = 3, we have shown that f(n) <= 6 g(n) for n = 3, or 5n + 3 = O(n). That is, if an algorithm requires time directly proportional to 5n + 3, it is O(n)." Ok, this kind of makes sense to me. They're saying that if n = 3 or greater, 5n + 3 takes less time than if n was less than 3 - thus 5n + n = 6n - right? Makes sense, since if n was 2, 5n + 3 = 13 while 6n = 12 but when n is 3 or greater 5n + 3 will always be less than or equal to 6n. Here's where I get confused. They give me another example: Example 2: "Let's show that 4n^2 + 50n - 10 = O(n^2). It is easy to see that: 4n^2 + 50n - 10 <= 4n^2 + 50n for any n. Since 50n <= 50n^2 for n = 50, 4n^2 + 50n - 10 <= 4n^2 + 50n^2 = 54n^2 for n = 50. Thus, with c = 54 and N = 50, we have shown that 4n^2 + 50n - 10 = O(n^2)." This statement doesn't make sense: 50n <= 50n^2 for n = 50. Isn't any n going to make the 50n less than 50n^2? Not just greater than or equal to 50? Why did they even mention that 50n <= 50n^2? What does that have to do with the problem? Also, 4n^2 + 50n - 10 <= 4n^2 + 50n^2 = 54n^2 for n = 50 is going to be true no matter what n is. And how in the world does picking numbers show that f(n) = O(g(n))? Please help me understand! :(

    Read the article

  • Cloud Computing = Elasticity * Availability

    - by Herve Roggero
    What is cloud computing? Is hosting the same thing as cloud computing? Are you running a cloud if you already use virtual machines? What is the difference between Infrastructure as a Service (IaaS) and a cloud provider? And the list goes on… these questions keep coming up and all try to fundamentally explain what “cloud” means relative to other concepts. At the risk of over simplification, answering these questions becomes simpler once you understand the primary foundations of cloud computing: Elasticity and Availability.   Elasticity The basic value proposition of cloud computing is to pay as you go, and to pay for what you use. This implies that an application can expand and contract on demand, across all its tiers (presentation layer, services, database, security…).  This also implies that application components can grow independently from each other. So if you need more storage for your database, you should be able to grow that tier without affecting, reconfiguring or changing the other tiers. Basically, cloud applications behave like a sponge; when you add water to a sponge, it grows in size; in the application world, the more customers you add, the more it grows. Pure IaaS providers will provide certain benefits, specifically in terms of operating costs, but an IaaS provider will not help you in making your applications elastic; neither will Virtual Machines. The smallest elasticity unit of an IaaS provider and a Virtual Machine environment is a server (physical or virtual). While adding servers in a datacenter helps in achieving scale, it is hardly enough. The application has yet to use this hardware.  If the process of adding computing resources is not transparent to the application, the application is not elastic.   As you can see from the above description, designing for the cloud is not about more servers; it is about designing an application for elasticity regardless of the underlying server farm.   Availability The fact of the matter is that making applications highly available is hard. It requires highly specialized tools and trained staff. On top of it, it's expensive. Many companies are required to run multiple data centers due to high availability requirements. In some organizations, some data centers are simply on standby, waiting to be used in a case of a failover. Other organizations are able to achieve a certain level of success with active/active data centers, in which all available data centers serve incoming user requests. While achieving high availability for services is relatively simple, establishing a highly available database farm is far more complex. In fact it is so complex that many companies establish yearly tests to validate failover procedures.   To a certain degree certain IaaS provides can assist with complex disaster recovery planning and setting up data centers that can achieve successful failover. However the burden is still on the corporation to manage and maintain such an environment, including regular hardware and software upgrades. Cloud computing on the other hand removes most of the disaster recovery requirements by hiding many of the underlying complexities.   Cloud Providers A cloud provider is an infrastructure provider offering additional tools to achieve application elasticity and availability that are not usually available on-premise. For example Microsoft Azure provides a simple configuration screen that makes it possible to run 1 or 100 web sites by clicking a button or two on a screen (simplifying provisioning), and soon SQL Azure will offer Data Federation to allow database sharding (which allows you to scale the database tier seamlessly and automatically). Other cloud providers offer certain features that are not available on-premise as well, such as the Amazon SC3 (Simple Storage Service) which gives you virtually unlimited storage capabilities for simple data stores, which is somewhat equivalent to the Microsoft Azure Table offering (offering a server-independent data storage model). Unlike IaaS providers, cloud providers give you the necessary tools to adopt elasticity as part of your application architecture.    Some cloud providers offer built-in high availability that get you out of the business of configuring clustered solutions, or running multiple data centers. Some cloud providers will give you more control (which puts some of that burden back on the customers' shoulder) and others will tend to make high availability totally transparent. For example, SQL Azure provides high availability automatically which would be very difficult to achieve (and very costly) on premise.   Keep in mind that each cloud provider has its strengths and weaknesses; some are better at achieving transparent scalability and server independence than others.    Not for Everyone Note however that it is up to you to leverage the elasticity capabilities of a cloud provider, as discussed previously; if you build a website that does not need to scale, for which elasticity is not important, then you can use a traditional host provider unless you also need high availability. Leveraging the technologies of cloud providers can be difficult and can become a journey for companies that build their solutions in a scale up fashion. Cloud computing promises to address cost containment and scalability of applications with built-in high availability. If your application does not need to scale or you do not need high availability, then cloud computing may not be for you. In fact, you may pay a premium to run your applications with cloud providers due to the underlying technologies built specifically for scalability and availability requirements. And as such, the cloud is not for everyone.   Consistent Customer Experience, Predictable Cost With all its complexities, buzz and foggy definition, cloud computing boils down to a simple objective: consistent customer experience at a predictable cost.  The objective of a cloud solution is to provide the same user experience to your last customer than the first, while keeping your operating costs directly proportional to the number of customers you have. Making your applications elastic and highly available across all its tiers, with as much automation as possible, achieves the first objective of a consistent customer experience. And the ability to expand and contract the infrastructure footprint of your application dynamically achieves the cost containment objectives.     Herve Roggero is a SQL Azure MVP and co-author of Pro SQL Azure (APress).  He is the co-founder of Blue Syntax Consulting (www.bluesyntax.net), a company focusing on cloud computing technologies helping customers understand and adopt cloud computing technologies. For more information contact herve at hroggero @ bluesyntax.net .

    Read the article

  • Why Is Vertical Resolution Monitor Resolution so Often a Multiple of 360?

    - by Jason Fitzpatrick
    Stare at a list of monitor resolutions long enough and you might notice a pattern: many of the vertical resolutions, especially those of gaming or multimedia displays, are multiples of 360 (720, 1080, 1440, etc.) But why exactly is this the case? Is it arbitrary or is there something more at work? Today’s Question & Answer session comes to us courtesy of SuperUser—a subdivision of Stack Exchange, a community-driven grouping of Q&A web sites. The Question SuperUser reader Trojandestroy recently noticed something about his display interface and needs answers: YouTube recently added 1440p functionality, and for the first time I realized that all (most?) vertical resolutions are multiples of 360. Is this just because the smallest common resolution is 480×360, and it’s convenient to use multiples? (Not doubting that multiples are convenient.) And/or was that the first viewable/conveniently sized resolution, so hardware (TVs, monitors, etc) grew with 360 in mind? Taking it further, why not have a square resolution? Or something else unusual? (Assuming it’s usual enough that it’s viewable). Is it merely a pleasing-the-eye situation? So why have the display be a multiple of 360? The Answer SuperUser contributor User26129 offers us not just an answer as to why the numerical pattern exists but a history of screen design in the process: Alright, there are a couple of questions and a lot of factors here. Resolutions are a really interesting field of psychooptics meeting marketing. First of all, why are the vertical resolutions on youtube multiples of 360. This is of course just arbitrary, there is no real reason this is the case. The reason is that resolution here is not the limiting factor for Youtube videos – bandwidth is. Youtube has to re-encode every video that is uploaded a couple of times, and tries to use as little re-encoding formats/bitrates/resolutions as possible to cover all the different use cases. For low-res mobile devices they have 360×240, for higher res mobile there’s 480p, and for the computer crowd there is 360p for 2xISDN/multiuser landlines, 720p for DSL and 1080p for higher speed internet. For a while there were some other codecs than h.264, but these are slowly being phased out with h.264 having essentially ‘won’ the format war and all computers being outfitted with hardware codecs for this. Now, there is some interesting psychooptics going on as well. As I said: resolution isn’t everything. 720p with really strong compression can and will look worse than 240p at a very high bitrate. But on the other side of the spectrum: throwing more bits at a certain resolution doesn’t magically make it better beyond some point. There is an optimum here, which of course depends on both resolution and codec. In general: the optimal bitrate is actually proportional to the resolution. So the next question is: what kind of resolution steps make sense? Apparently, people need about a 2x increase in resolution to really see (and prefer) a marked difference. Anything less than that and many people will simply not bother with the higher bitrates, they’d rather use their bandwidth for other stuff. This has been researched quite a long time ago and is the big reason why we went from 720×576 (415kpix) to 1280×720 (922kpix), and then again from 1280×720 to 1920×1080 (2MP). Stuff in between is not a viable optimization target. And again, 1440P is about 3.7MP, another ~2x increase over HD. You will see a difference there. 4K is the next step after that. Next up is that magical number of 360 vertical pixels. Actually, the magic number is 120 or 128. All resolutions are some kind of multiple of 120 pixels nowadays, back in the day they used to be multiples of 128. This is something that just grew out of LCD panel industry. LCD panels use what are called line drivers, little chips that sit on the sides of your LCD screen that control how bright each subpixel is. Because historically, for reasons I don’t really know for sure, probably memory constraints, these multiple-of-128 or multiple-of-120 resolutions already existed, the industry standard line drivers became drivers with 360 line outputs (1 per subpixel). If you would tear down your 1920×1080 screen, I would be putting money on there being 16 line drivers on the top/bottom and 9 on one of the sides. Oh hey, that’s 16:9. Guess how obvious that resolution choice was back when 16:9 was ‘invented’. Then there’s the issue of aspect ratio. This is really a completely different field of psychology, but it boils down to: historically, people have believed and measured that we have a sort of wide-screen view of the world. Naturally, people believed that the most natural representation of data on a screen would be in a wide-screen view, and this is where the great anamorphic revolution of the ’60s came from when films were shot in ever wider aspect ratios. Since then, this kind of knowledge has been refined and mostly debunked. Yes, we do have a wide-angle view, but the area where we can actually see sharply – the center of our vision – is fairly round. Slightly elliptical and squashed, but not really more than about 4:3 or 3:2. So for detailed viewing, for instance for reading text on a screen, you can utilize most of your detail vision by employing an almost-square screen, a bit like the screens up to the mid-2000s. However, again this is not how marketing took it. Computers in ye olden days were used mostly for productivity and detailed work, but as they commoditized and as the computer as media consumption device evolved, people didn’t necessarily use their computer for work most of the time. They used it to watch media content: movies, television series and photos. And for that kind of viewing, you get the most ‘immersion factor’ if the screen fills as much of your vision (including your peripheral vision) as possible. Which means widescreen. But there’s more marketing still. When detail work was still an important factor, people cared about resolution. As many pixels as possible on the screen. SGI was selling almost-4K CRTs! The most optimal way to get the maximum amount of pixels out of a glass substrate is to cut it as square as possible. 1:1 or 4:3 screens have the most pixels per diagonal inch. But with displays becoming more consumery, inch-size became more important, not amount of pixels. And this is a completely different optimization target. To get the most diagonal inches out of a substrate, you want to make the screen as wide as possible. First we got 16:10, then 16:9 and there have been moderately successful panel manufacturers making 22:9 and 2:1 screens (like Philips). Even though pixel density and absolute resolution went down for a couple of years, inch-sizes went up and that’s what sold. Why buy a 19″ 1280×1024 when you can buy a 21″ 1366×768? Eh… I think that about covers all the major aspects here. There’s more of course; bandwidth limits of HDMI, DVI, DP and of course VGA played a role, and if you go back to the pre-2000s, graphics memory, in-computer bandwdith and simply the limits of commercially available RAMDACs played an important role. But for today’s considerations, this is about all you need to know. Have something to add to the explanation? Sound off in the the comments. Want to read more answers from other tech-savvy Stack Exchange users? Check out the full discussion thread here.     

    Read the article

  • New Bundling and Minification Support (ASP.NET 4.5 Series)

    - by ScottGu
    This is the sixth in a series of blog posts I'm doing on ASP.NET 4.5. The next release of .NET and Visual Studio include a ton of great new features and capabilities.  With ASP.NET 4.5 you'll see a bunch of really nice improvements with both Web Forms and MVC - as well as in the core ASP.NET base foundation that both are built upon. Today’s post covers some of the work we are doing to add built-in support for bundling and minification into ASP.NET - which makes it easy to improve the performance of applications.  This feature can be used by all ASP.NET applications, including both ASP.NET MVC and ASP.NET Web Forms solutions. Basics of Bundling and Minification As more and more people use mobile devices to surf the web, it is becoming increasingly important that the websites and apps we build perform well with them. We’ve all tried loading sites on our smartphones – only to eventually give up in frustration as it loads slowly over a slow cellular network.  If your site/app loads slowly like that, you are likely losing potential customers because of bad performance.  Even with powerful desktop machines, the load time of your site and perceived performance can make an enormous customer perception. Most websites today are made up of multiple JavaScript and CSS files to separate the concerns and keep the code base tight. While this is a good practice from a coding point of view, it often has some unfortunate consequences for the overall performance of the website.  Multiple JavaScript and CSS files require multiple HTTP requests from a browser – which in turn can slow down the performance load time.  Simple Example Below I’ve opened a local website in IE9 and recorded the network traffic using IE’s built-in F12 developer tools. As shown below, the website consists of 5 CSS and 4 JavaScript files which the browser has to download. Each file is currently requested separately by the browser and returned by the server, and the process can take a significant amount of time proportional to the number of files in question. Bundling ASP.NET is adding a feature that makes it easy to “bundle” or “combine” multiple CSS and JavaScript files into fewer HTTP requests. This causes the browser to request a lot fewer files and in turn reduces the time it takes to fetch them.   Below is an updated version of the above sample that takes advantage of this new bundling functionality (making only one request for the JavaScript and one request for the CSS): The browser now has to send fewer requests to the server. The content of the individual files have been bundled/combined into the same response, but the content of the files remains the same - so the overall file size is exactly the same as before the bundling.   But notice how even on a local dev machine (where the network latency between the browser and server is minimal), the act of bundling the CSS and JavaScript files together still manages to reduce the overall page load time by almost 20%.  Over a slow network the performance improvement would be even better. Minification The next release of ASP.NET is also adding a new feature that makes it easy to reduce or “minify” the download size of the content as well.  This is a process that removes whitespace, comments and other unneeded characters from both CSS and JavaScript. The result is smaller files, which will download and load in a browser faster.  The graph below shows the performance gain we are seeing when both bundling and minification are used together: Even on my local dev box (where the network latency is minimal), we now have a 40% performance improvement from where we originally started.  On slow networks (and especially with international customers), the gains would be even more significant. Using Bundling and Minification inside ASP.NET The upcoming release of ASP.NET makes it really easy to take advantage of bundling and minification within projects and see performance gains like in the scenario above. The way it does this allows you to avoid having to run custom tools as part of your build process –  instead ASP.NET has added runtime support to perform the bundling/minification for you dynamically (caching the results to make sure perf is great).  This enables a really clean development experience and makes it super easy to start to take advantage of these new features. Let’s assume that we have a simple project that has 4 JavaScript files and 6 CSS files: Bundling and Minifying the .css files Let’s say you wanted to reference all of the stylesheets in the “Styles” folder above on a page.  Today you’d have to add multiple CSS references to get all of them – which would translate into 6 separate HTTP requests: The new bundling/minification feature now allows you to instead bundle and minify all of the .css files in the Styles folder – simply by sending a URL request to the folder (in this case “styles”) with an appended “/css” path after it.  For example:    This will cause ASP.NET to scan the directory, bundle and minify the .css files within it, and send back a single HTTP response with all of the CSS content to the browser.  You don’t need to run any tools or pre-processor to get this behavior.  This enables you to cleanly separate your CSS into separate logical .css files and maintain a very clean development experience – while not taking a performance hit at runtime for doing so.  The Visual Studio designer will also honor the new bundling/minification logic as well – so you’ll still get a WYSWIYG designer experience inside VS as well. Bundling and Minifying the JavaScript files Like the CSS approach above, if we wanted to bundle and minify all of our JavaScript into a single response we could send a URL request to the folder (in this case “scripts”) with an appended “/js” path after it:   This will cause ASP.NET to scan the directory, bundle and minify the .js files within it, and send back a single HTTP response with all of the JavaScript content to the browser.  Again – no custom tools or builds steps were required in order to get this behavior.  And it works with all browsers. Ordering of Files within a Bundle By default, when files are bundled by ASP.NET they are sorted alphabetically first, just like they are shown in Solution Explorer. Then they are automatically shifted around so that known libraries and their custom extensions such as jQuery, MooTools and Dojo are loaded before anything else. So the default order for the merged bundling of the Scripts folder as shown above will be: Jquery-1.6.2.js Jquery-ui.js Jquery.tools.js a.js By default, CSS files are also sorted alphabetically and then shifted around so that reset.css and normalize.css (if they are there) will go before any other file. So the default sorting of the bundling of the Styles folder as shown above will be: reset.css content.css forms.css globals.css menu.css styles.css The sorting is fully customizable, though, and can easily be changed to accommodate most use cases and any common naming pattern you prefer.  The goal with the out of the box experience, though, is to have smart defaults that you can just use and be successful with. Any number of directories/sub-directories supported In the example above we just had a single “Scripts” and “Styles” folder for our application.  This works for some application types (e.g. single page applications).  Often, though, you’ll want to have multiple CSS/JS bundles within your application – for example: a “common” bundle that has core JS and CSS files that all pages use, and then page specific or section specific files that are not used globally. You can use the bundling/minification support across any number of directories or sub-directories in your project – this makes it easy to structure your code so as to maximize the bunding/minification benefits.  Each directory by default can be accessed as a separate URL addressable bundle.  Bundling/Minification Extensibility ASP.NET’s bundling and minification support is built with extensibility in mind and every part of the process can be extended or replaced. Custom Rules In addition to enabling the out of the box - directory-based - bundling approach, ASP.NET also supports the ability to register custom bundles using a new programmatic API we are exposing.  The below code demonstrates how you can register a “customscript” bundle using code within an application’s Global.asax class.  The API allows you to add/remove/filter files that go into the bundle on a very granular level:     The above custom bundle can then be referenced anywhere within the application using the below <script> reference:     Custom Processing You can also override the default CSS and JavaScript bundles to support your own custom processing of the bundled files (for example: custom minification rules, support for Saas, LESS or Coffeescript syntax, etc). In the example below we are indicating that we want to replace the built-in minification transforms with a custom MyJsTransform and MyCssTransform class. They both subclass the CSS and JavaScript minifier respectively and can add extra functionality:     The end result of this extensibility is that you can plug-into the bundling/minification logic at a deep level and do some pretty cool things with it. 2 Minute Video of Bundling and Minification in Action Mads Kristensen has a great 90 second video that shows off using the new Bundling and Minification feature.  You can watch the 90 second video here. Summary The new bundling and minification support within the next release of ASP.NET will make it easier to build fast web applications.  It is really easy to use, and doesn’t require major changes to your existing dev workflow.  It is also supports a rich extensibility API that enables you to customize it however you want. You can easily take advantage of this new support within ASP.NET MVC, ASP.NET Web Forms and ASP.NET Web Pages based applications. Hope this helps, Scott P.S. In addition to blogging, I use Twitter to-do quick posts and share links. My Twitter handle is: @scottgu

    Read the article

  • Trace flags - TF 1117

    - by Damian
    I had a session about trace flags this year on the SQL Day 2014 conference that was held in Wroclaw at the end of April. The session topic is important to most of DBA's and the reason I did it was that I sometimes forget about various trace flags :). So I decided to prepare a presentation but I think it is a good idea to write posts about trace flags, too. Let's start then - today I will describe the TF 1117. I assume that we all know how to setup a TF using starting parameters or registry or in the session or on the query level. I will always write if a trace flag is local or global to make sure we know how to use it. Why do we need this trace flag? Let’s create a test database first. This is quite ordinary database as it has two data files (4 MB each) and a log file that has 1MB. The data files are able to expand by 1 MB and the log file grows by 10%: USE [master] GO CREATE DATABASE [TF1117]  ON  PRIMARY ( NAME = N'TF1117',      FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL12.SQL2014\MSSQL\DATA\TF1117.mdf' ,      SIZE = 4096KB ,      MAXSIZE = UNLIMITED,      FILEGROWTH = 1024KB ), ( NAME = N'TF1117_1',      FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL12.SQL2014\MSSQL\DATA\TF1117_1.ndf' ,      SIZE = 4096KB ,      MAXSIZE = UNLIMITED,      FILEGROWTH = 1024KB )  LOG ON ( NAME = N'TF1117_log',      FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL12.SQL2014\MSSQL\DATA\TF1117_log.ldf' ,      SIZE = 1024KB ,      MAXSIZE = 2048GB ,      FILEGROWTH = 10% ) GO Without the TF 1117 turned on the data files don’t grow all up at once. When a first file is full the SQL Server expands it but the other file is not expanded until is full. Why is that so important? The SQL Server proportional fill algorithm will direct new extent allocations to the file with the most available space so new extents will be written to the file that was just expanded. When the TF 1117 is enabled it will cause all files to auto grow by their specified increment. That means all files will have the same percent of free space so we still have the benefit of evenly distributed IO. The TF 1117 is global flag so it affects all databases on the instance. Of course if a filegroup contains only one file the TF does not have any effect on it. Now let’s do a simple test. First let’s create a table in which every row will fit to a single page: The table definition is pretty simple as it has two integer columns and one character column of fixed size 8000 bytes: create table TF1117Tab (      col1 int,      col2 int,      col3 char (8000) ) go Now I load some data to the table to make sure that one of the data file must grow: declare @i int select @i = 1 while (@i < 800) begin       insert into TF1117Tab  values (@i, @i+1000, 'hello')        select @i= @i + 1 end I can check the actual file size in the sys.database_files DMV: SELECT name, (size*8)/1024 'Size in MB' FROM sys.database_files  GO   As you can see only the first data file was  expanded and the other has still the initial size:   name                  Size in MB --------------------- ----------- TF1117                5 TF1117_log            1 TF1117_1              4 There is also other methods of looking at the events of file autogrows. One possibility is to create an Extended Events session and the other is to look into the default trace file:     DECLARE @path NVARCHAR(260); SELECT    @path = REVERSE(SUBSTRING(REVERSE([path]),          CHARINDEX('\', REVERSE([path])), 260)) + N'log.trc' FROM    sys.traces WHERE   is_default = 1; SELECT    DatabaseName,                 [FileName],                 SPID,                 Duration,                 StartTime,                 EndTime,                 FileType =                         CASE EventClass                                     WHEN 92 THEN 'Data'                                    WHEN 93 THEN 'Log'             END FROM sys.fn_trace_gettable(@path, DEFAULT) WHERE   EventClass IN (92,93) AND StartTime >'2014-07-12' AND DatabaseName = N'TF1117' ORDER BY   StartTime DESC;   After running the query I can see the file was expanded and how long did the process take which might be useful from the performance perspective.    Now it’s time to turn on the flag 1117. DBCC TRACEON(1117)   I dropped the database and recreated it once again. Then I ran the queries and observed the results. After loading the records I see that both files were evenly expanded: name                  Size in MB --------------------- ----------- TF1117                5 TF1117_log            1 TF1117_1              5 I found also information in the default trace. The query returned three rows. The last one is connected to my first experiment when the TF was turned off.  The two rows shows that first file was expanded by 1MB and right after that operation the second file was expanded, too. This is what is this TF all about J  

    Read the article

  • Python: Memory usage and optimization when modifying lists

    - by xApple
    The problem My concern is the following: I am storing a relativity large dataset in a classical python list and in order to process the data I must iterate over the list several times, perform some operations on the elements, and often pop an item out of the list. It seems that deleting one item out of a Python list costs O(N) since Python has to copy all the items above the element at hand down one place. Furthermore, since the number of items to delete is approximately proportional to the number of elements in the list this results in an O(N^2) algorithm. I am hoping to find a solution that is cost effective (time and memory-wise). I have studied what I could find on the internet and have summarized my different options below. Which one is the best candidate ? Keeping a local index: while processingdata: index = 0 while index < len(somelist): item = somelist[index] dosomestuff(item) if somecondition(item): del somelist[index] else: index += 1 This is the original solution I came up with. Not only is this not very elegant, but I am hoping there is better way to do it that remains time and memory efficient. Walking the list backwards: while processingdata: for i in xrange(len(somelist) - 1, -1, -1): dosomestuff(item) if somecondition(somelist, i): somelist.pop(i) This avoids incrementing an index variable but ultimately has the same cost as the original version. It also breaks the logic of dosomestuff(item) that wishes to process them in the same order as they appear in the original list. Making a new list: while processingdata: for i, item in enumerate(somelist): dosomestuff(item) newlist = [] for item in somelist: if somecondition(item): newlist.append(item) somelist = newlist gc.collect() This is a very naive strategy for eliminating elements from a list and requires lots of memory since an almost full copy of the list must be made. Using list comprehensions: while processingdata: for i, item in enumerate(somelist): dosomestuff(item) somelist[:] = [x for x in somelist if somecondition(x)] This is very elegant but under-the-cover it walks the whole list one more time and must copy most of the elements in it. My intuition is that this operation probably costs more than the original del statement at least memory wise. Keep in mind that somelist can be huge and that any solution that will iterate through it only once per run will probably always win. Using the filter function: while processingdata: for i, item in enumerate(somelist): dosomestuff(item) somelist = filter(lambda x: not subtle_condition(x), somelist) This also creates a new list occupying lots of RAM. Using the itertools' filter function: from itertools import ifilterfalse while processingdata: for item in itertools.ifilterfalse(somecondtion, somelist): dosomestuff(item) This version of the filter call does not create a new list but will not call dosomestuff on every item breaking the logic of the algorithm. I am including this example only for the purpose of creating an exhaustive list. Moving items up the list while walking while processingdata: index = 0 for item in somelist: dosomestuff(item) if not somecondition(item): somelist[index] = item index += 1 del somelist[index:] This is a subtle method that seems cost effective. I think it will move each item (or the pointer to each item ?) exactly once resulting in an O(N) algorithm. Finally, I hope Python will be intelligent enough to resize the list at the end without allocating memory for a new copy of the list. Not sure though. Abandoning Python lists: class Doubly_Linked_List: def __init__(self): self.first = None self.last = None self.n = 0 def __len__(self): return self.n def __iter__(self): return DLLIter(self) def iterator(self): return self.__iter__() def append(self, x): x = DLLElement(x) x.next = None if self.last is None: x.prev = None self.last = x self.first = x self.n = 1 else: x.prev = self.last x.prev.next = x self.last = x self.n += 1 class DLLElement: def __init__(self, x): self.next = None self.data = x self.prev = None class DLLIter: etc... This type of object resembles a python list in a limited way. However, deletion of an element is guaranteed O(1). I would not like to go here since this would require massive amounts of code refactoring almost everywhere.

    Read the article

  • Optimizing Python code with many attribute and dictionary lookups

    - by gotgenes
    I have written a program in Python which spends a large amount of time looking up attributes of objects and values from dictionary keys. I would like to know if there's any way I can optimize these lookup times, potentially with a C extension, to reduce the time of execution, or if I need to simply re-implement the program in a compiled language. The program implements some algorithms using a graph. It runs prohibitively slowly on our data sets, so I profiled the code with cProfile using a reduced data set that could actually complete. The vast majority of the time is being burned in one function, and specifically in two statements, generator expressions, within the function: The generator expression at line 202 is neighbors_in_selected_nodes = (neighbor for neighbor in node_neighbors if neighbor in selected_nodes) and the generator expression at line 204 is neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for neighbor in neighbors_in_selected_nodes) The source code for this function of context provided below. selected_nodes is a set of nodes in the interaction_graph, which is a NetworkX Graph instance. node_neighbors is an iterator from Graph.neighbors_iter(). Graph itself uses dictionaries for storing nodes and edges. Its Graph.node attribute is a dictionary which stores nodes and their attributes (e.g., 'weight') in dictionaries belonging to each node. Each of these lookups should be amortized constant time (i.e., O(1)), however, I am still paying a large penalty for the lookups. Is there some way which I can speed up these lookups (e.g., by writing parts of this as a C extension), or do I need to move the program to a compiled language? Below is the full source code for the function that provides the context; the vast majority of execution time is spent within this function. def calculate_node_z_prime( node, interaction_graph, selected_nodes ): """Calculates a z'-score for a given node. The z'-score is based on the z-scores (weights) of the neighbors of the given node, and proportional to the z-score (weight) of the given node. Specifically, we find the maximum z-score of all neighbors of the given node that are also members of the given set of selected nodes, multiply this z-score by the z-score of the given node, and return this value as the z'-score for the given node. If the given node has no neighbors in the interaction graph, the z'-score is defined as zero. Returns the z'-score as zero or a positive floating point value. :Parameters: - `node`: the node for which to compute the z-prime score - `interaction_graph`: graph containing the gene-gene or gene product-gene product interactions - `selected_nodes`: a `set` of nodes fitting some criterion of interest (e.g., annotated with a term of interest) """ node_neighbors = interaction_graph.neighbors_iter(node) neighbors_in_selected_nodes = (neighbor for neighbor in node_neighbors if neighbor in selected_nodes) neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for neighbor in neighbors_in_selected_nodes) try: max_z_score = max(neighbor_z_scores) # max() throws a ValueError if its argument has no elements; in this # case, we need to set the max_z_score to zero except ValueError, e: # Check to make certain max() raised this error if 'max()' in e.args[0]: max_z_score = 0 else: raise e z_prime = interaction_graph.node[node]['weight'] * max_z_score return z_prime Here are the top couple of calls according to cProfiler, sorted by time. ncalls tottime percall cumtime percall filename:lineno(function) 156067701 352.313 0.000 642.072 0.000 bpln_contextual.py:204(<genexpr>) 156067701 289.759 0.000 289.759 0.000 bpln_contextual.py:202(<genexpr>) 13963893 174.047 0.000 816.119 0.000 {max} 13963885 69.804 0.000 936.754 0.000 bpln_contextual.py:171(calculate_node_z_prime) 7116883 61.982 0.000 61.982 0.000 {method 'update' of 'set' objects}

    Read the article

  • SQLAuthority News – Job Interviewing the Right Way (and for the Right Reasons) – Guest Post by Feodor Georgiev

    - by pinaldave
    Feodor Georgiev is a SQL Server database specialist with extensive experience of thinking both within and outside the box. He has wide experience of different systems and solutions in the fields of architecture, scalability, performance, etc. Feodor has experience with SQL Server 2000 and later versions, and is certified in SQL Server 2008. Feodor has written excellent article on Job Interviewing the Right Way. Here is his article in his own language. A while back I was thinking to start a blog post series on interviewing and employing IT personnel. At that time I had just read the ‘Smart and gets things done’ book (http://www.joelonsoftware.com/items/2007/06/05.html) and I was hyped up on some debatable topics regarding finding and employing the best people in the branch. I have no problem with hiring the best of the best; it’s just the definition of ‘the best of the best’ that makes things a bit more complicated. One of the fundamental books one can read on the topic of interviewing is the one mentioned above. If you have not read it, then you must do so; not because it contains the ultimate truth, and not because it gives the answers to most questions on the subject, but because the book contains an extensive set of questions about interviewing and employing people. Of course, a big part of these questions have different answers, depending on location, culture, available funds and so on. (What works in the US may not necessarily work in the Nordic countries or India, or it may work in a different way). The only thing that is valid regardless of any external factor is this: curiosity. In my belief there are two kinds of people – curious and not-so-curious; regardless of profession. Think about it – professional success is directly proportional to the individual’s curiosity + time of active experience in the field. (I say ‘active experience’ because vacations and any distractions do not count as experience :)  ) So, curiosity is the factor which will distinguish a good employee from the not-so-good one. But let’s shift our attention to something else for now: a few tips and tricks for successful interviews. Tip and trick #1: get your priorities straight. Your status usually dictates your priorities; for example, if the person looking for a job has just relocated to a new country, they might tend to ignore some of their priorities and overload others. In other words, setting priorities straight means to define the personal criteria by which the interview process is lead. For example, similar to the following questions can help define the criteria for someone looking for a job: How badly do I need a (any) job? Is it more important to work in a clean and quiet environment or is it important to get paid well (or both, if possible)? And so on… Furthermore, before going to the interview, the candidate should have a list of priorities, sorted by the most importance: e.g. I want a quiet environment, x amount of money, great helping boss, a desk next to a window and so on. Also it is a good idea to be prepared and know which factors can be compromised and to what extent. Tip and trick #2: the interview is a two-way street. A job candidate should not forget that the interview process is not a one-way street. What I mean by this is that while the employer is interviewing the potential candidate, the job seeker should not miss the chance to interview the employer. Usually, the employer and the candidate will meet for an interview and talk about a variety of topics. In a quality interview the candidate will be presented to key members of the team and will have the opportunity to ask them questions. By asking the right questions both parties will define their opinion about each other. For example, if the candidate talks to one of the potential bosses during the interview process and they notice that the potential manager has a hard time formulating a question, then it is up to the candidate to decide whether working with such person is a red flag for them. There are as many interview processes out there as there are companies and each one is different. Some bigger companies and corporates can afford pre-selection processes, 3 or even 4 stages of interviews, small companies usually settle with one interview. Some companies even give cognitive tests on the interview. Why not? In his book Joel suggests that a good candidate should be pampered and spoiled beyond belief with a week-long vacation in New York, fancy hotels, food and who knows what. For all I can imagine, an interview might even take place at the top of the Eifel tower (right, Mr. Joel, right?) I doubt, however, that this is the optimal way to capture the attention of a good employee. The ‘curiosity’ topic What I have learned so far in my professional experience is that opinions can be subjective. Plus, opinions on technology subjects can also be subjective. According to Joel, only hiring the best of the best is worth it. If you ask me, there is no such thing as best of the best, simply because human nature (well, aside from some physical limitations, like putting your pants on through your head :) ) has no boundaries. And why would it have boundaries? I have seen many curious and interesting people, naturally good at technology, though uninterested in it as one  can possibly be; I have also seen plenty of people interested in technology, who (in an ideal world) should have stayed far from it. At any rate, all of this sums up at the end to the ‘supply and demand’ factor. The interview process big-bang boils down to this: If there is a mutual benefit for both the employer and the potential employee to work together, then it all sorts out nicely. If there is no benefit, then it is much harder to get to a common place. Tip and trick #3: word-of-mouth is worth a thousand words Here I would just mention that the best thing a job candidate can get during the interview process is access to future team members or other employees of the new company. Nowadays the world has become quite small and everyone knows everyone. Look at LinkedIn, look at other professional networks and you will realize how small the world really is. Knowing people is a good way to become more approachable and to approach them. Tip and trick #4: Be confident. It is true that for some people confidence is as natural as breathing and others have to work hard to express it. Confidence is, however, a key factor in convincing the other side (potential employer or employee) that there is a great chance for success by working together. But it cannot get you very far if it’s not backed up by talent, curiosity and knowledge. Tip and trick #5: The right reasons What really bothers me in Sweden (and I am sure that there are similar situations in other countries) is that there is a tendency to fill quotas and to filter out candidates by criteria different from their skill and knowledge. In job ads I see quite often the phrases ‘positive thinker’, ‘team player’ and many similar hints about personality features. So my guess here is that discrimination has evolved to a new level. Let me clear up the definition of discrimination: ‘unfair treatment of a person or group on the basis of prejudice’. And prejudice is the ‘partiality that prevents objective consideration of an issue or situation’. In other words, there is not much difference whether a job candidate is filtered out by race, gender or by personality features – it is all a bad habit. And in reality, there is no proven correlation between the technology knowledge paired with skills and the personal features (gender, race, age, optimism). It is true that a significantly greater number of Darwin awards were given to men than to women, but I am sure that somewhere there is a paper or theory explaining the genetics behind this. J This topic actually brings to mind one of my favorite work related stories. A while back I was working for a big company with many teams involved in their processes. One of the teams was occupying 2 rooms – one had the team members and was full of light, colorful posters, chit-chats and giggles, whereas the other room was dark, lighted only by a single monitor with a quiet person in front of it. Later on I realized that the ‘dark room’ person was the guru and the ultimate problem-solving-brain who did not like the chats and giggles and hence was in a separate room. In reality, all severe problems which the chatty and cheerful team members could not solve and all emergencies were directed to ‘the dark room’. And thus all worked out well. The moral of the story: Personality has nothing to do with technology knowledge and skills. End of story. Summary: I’d like to stress the fact that there is no ultimately perfect candidate for a job, and there is no such thing as ‘best-of-the-best’. From my personal experience, the main criteria by which I measure people (co-workers and bosses) is the curiosity factor; I know from experience that the more curious and inventive a person is, the better chances there are for great achievements in their field. Related stories: (for extra credit) 1) Get your priorities straight. A while back as a consultant I was working for a few days at a time at different offices and for different clients, and so I was able to compare and analyze the work environments. There were two different places which I compared and recently I asked a friend of mine the following question: “Which one would you prefer as a work environment: a noisy office full of people, or a quiet office full of faulty smells because the office is rarely cleaned?” My friend was puzzled for a while, thought about it and said: “Hmm, you are talking about two different kinds of pollution… I will probably choose the second, since I can clean the workplace myself a bit…” 2) The interview is a two-way street. One time, during a job interview, I met a potential boss that had a hard time phrasing a question. At that particular time it was clear to me that I would not have liked to work under this person. According to my work religion, the properly asked question contains at least half of the answer. And if I work with someone who cannot ask a question… then I’d be doing double or triple work. At another interview, after the technical part with the team leader of the department, I was introduced to one of the team members and we were left alone for 5 minutes. I immediately jumped on the occasion and asked the blunt question: ‘What have you learned here for the past year and how do you like your job?’ The team member looked at me and said ‘Nothing really. I like playing with my cats at home, so I am out of here at 5pm and I don’t have time for much.’ I was disappointed at the time and I did not take the job offer. I wasn’t that shocked a few months later when the company went bankrupt. 3) The right reasons to take a job: personality check. A while back I was asked to serve as a job reference for a coworker. I agreed, and after some weeks I got a phone call from the company where my colleague was applying for a job. The conversation started with the manager’s question about my colleague’s personality and about their social skills. (You can probably guess what my internal reaction was… J ) So, after 30 minutes of pouring common sense into the interviewer’s head, we finally agreed on the fact that a shy or quiet personality has nothing to do with work skills and knowledge. Some years down the road my former colleague is taking the manager’s position as the manager is demoted to a different department. Reference: Feodor Georgiev, Pinal Dave (http://blog.SQLAuthority.com) Filed under: PostADay, Readers Contribution, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL, Technology

    Read the article

  • Understanding the value of Customer Experience & Loyalty for the Telecommunications Industry

    - by raul.goycoolea
    Worried by economic woes and market forces, especially in mature markets, communications service providers (CSPs) increasingly focus on improving customer experience. In fact, it seems difficult to find a major message by a C-level executive in the developed world that does not include something on "meeting and exceeding customers' needs". Frequently in customer satisfaction studies by prominent firms, CSPs fall short of the leadership demonstrated by other industries that take customer-centric approaches to their bottom-line strategies. Consider the following:Despite the continued impact of global economic crisis, in July 2010, Apple Computer posted record revenue and net quarterly profit. Those who attribute the results primarily to the iPhone 4 launch should note that Apple also shipped around 30% more Macintosh computers than the same period the previous year. Even sales of the iPod line increased by 8% in a highly commoditized, shrinking media player market. Finally, Apple began selling iPads during the quarter, with total sales of more than 3 million units. What does Apple have that the others lack? Well, some great products (and services) to be sure, but it also excels at customer service and support, marketing, and distribution, and has one of the strongest brands globally. Its products are useful, simple to use, easy to acquire and augment, high quality, and considered very cool. They also evoke such an emotional response from many of Apple's customers, which they turn up their noses at competitive products.In other words, Apple appears to have mastered virtually every aspect of customer experience and the resultant loyalty of its customer base - even in difficult financial times. Through that unwavering customer focus, Apple continues to drive its revenues and profits to new heights. Other customer loyalty leaders like Wal-Mart, Google, Toyota and Honda are also doing well by focusing on customer experience as an essential driver of profitability. Service providers should note this performance and ask themselves how they might leverage the same principles to increase their own profitability. After all, that is what customer experience and loyalty are all about: profitability.To successfully manage all the critical touch points of customer experience, CSPs must shun the one-size-fits-all approach. They can no longer afford to view customer service fundamentally as an act of altruism - which mentality dates back to the industry's civil service days, when CSPs were typically government organizations that were critical to economic development and public safety.As regulators and public officials have pushed, and continue to push, service providers to new heights of reliability - using incentives and punishments - most CSPs already have some of the fundamental building blocks of customer service in place. Yet despite that history and experience, service providers still lag other industries in providing what is seen as good customer service.As we observed in the TMF's 2009 Insights Research report, Customer Experience Management: Driving Loyalty & Profitability there has been resurgence in interest by CSPs. More and more of them have stated ambitions to catch up other industries, and they are realizing that good customer service is a powerful strategy for increasing business performance and profitability, not an act of good will.CSPs are recognizing the connection between customer experience and profitability, as demonstrated in many studies. For example, according to research by Bain & Company, a 5 percent improvement in customer retention rates can yield as much as a 75 percent increase in profits for companies across a range of industries.After decades of customer experience strategy formulation, Bain partner and business author, Frederick Reichheld, considers "would you recommend us to a friend?" as the ultimate question for a customer. How many times have you or your friends recommended an iPod, iPhone or a Mac? What do your children recommend to their peers? Their peers to them?There are certain steps service providers have to take to create more personalized relationships with their customers, as well as reduce churn and increase profitability, all while becoming leaner and more agile. First, they have to define customer experience, we define it as the result of the sum of observations, perceptions, thoughts and feelings arising from interactions and relationships between customers and their service provider(s). Virtually every customer touch point - whether directly or indirectly linked to service providers and their partners - contributes to customer perception, satisfaction, loyalty, and ultimately profitability. Gaining leadership in customer experience and satisfaction will not be a simple task, as it is affected by virtually every customer-facing aspect of the service provider, and in turn impacts the service provider deeply - especially on the all-important bottom line. The scope of issues affecting customer experience is complex and dynamic.With new services, devices and applications extending the basis of customer experience to domains beyond the direct control of the service provider, it is likely to increase in complexity and dynamism.Customer loyalty = increased profitsAs stated earlier, customer experience programs are not fundamentally altruistic exercises, but a strategic means of improving competitiveness and profitability in the short and long term. Loyalty is essential to deriving long term profits from customers.Some of the earliest loyalty programs date back to the 1930s, when packaged goods companies offered embedded coupons for rewards to buyers, and eventually retail chains began offering reward programs to frequent shoppers. These programs continued for decades but were leapfrogged in the 1980s by more aggressive programs from the airlines.This movement was led by American Airlines, which launched the first full-scale loyalty marketing program of the modern era with the AAdvantage frequent flyer scheme. It was the first to reward frequent fliers with notional air miles that could be accumulated and later redeemed for free travel. Figure 1: Opportunities example of Customer loyalty driven profitOther airlines and travel providers were quick to grasp the incredible value of providing customers with an incentive to use their company exclusively. Within a few years, dozens of travel industry companies launched similar initiatives and now loyalty programs are achieving near-ubiquity in many service industries, especially those in which it is difficult to differentiate offerings by product attributes.The belief is that increased profitability will result from customer retention efforts because:•    The cost of acquisition occurs only at the beginning of a relationship: the longer the relationship, the lower the amortized cost;•    Account maintenance costs decline as a percentage of total costs, or as a percentage of revenue, over the lifetime of the relationship;•    Long term customers tend to be less inclined to switch and less price sensitive which can result in stable unit sales volume and increases in dollar-sales volume;•    Long term customers may initiate word-of-mouth promotions and referrals, which cost the company nothing and arguably are the most effective form of advertising;•    Long-term customers are more likely to buy ancillary products and higher margin supplemental products;•    Long term customers tend to be satisfied with their relationship with the company and are less likely to switch to competitors, making market entry or competitors gaining market share difficult;•    Regular customers tend to be less expensive to service, as they are familiar with the processes involved, require less 'education', and are consistent in their order placement;•    Increased customer retention and loyalty makes the employees' jobs easier and more satisfying. In turn, happy employees feed back into higher customer satisfaction in a virtuous circle. Figure 2: The virtuous circle of customer loyaltyFigure 2 represents a high-level example of a virtuous cycle driven by customer satisfaction and loyalty, depicting how superiority in product and service offerings, as well as strong customer support by competent employees, lead to higher sales and ultimately profitability. As stated above, this is not a new concept, but succeeding with it is difficult. It has eluded many a company driven to achieve profitability goals. Of course, for this circle to be virtuous, the customer relationship(s) must be profitable.Trying to maintain the loyalty of unprofitable customers is not a viable business strategy. It is, therefore, important that marketers can assess the profitability of each customer (or customer segment), and either improve or terminate relationships that are not profitable. This means each customer's 'relationship costs' must be understood and compared to their 'relationship revenue'. Customer lifetime value (CLV) is the most commonly used metric here, as it is generally accepted as a representation of exactly how much each customer is worth in monetary terms, and therefore a determinant of exactly how much a service provider should be willing to spend to acquire or retain that customer.CLV models make several simplifying assumptions and often involve the following inputs:•    Churn rate represents the percentage of customers who end their relationship with a company in a given period;•    Retention rate is calculated by subtracting the churn rate percentage from 100;•    Period/horizon equates to the units of time into which a customer relationship can be divided for analysis. A year is the most commonly used period for this purpose. Customer lifetime value is a multi-period calculation, often projecting three to seven years into the future. In practice, analysis beyond this point is viewed as too speculative to be reliable. The model horizon is the number of periods used in the calculation;•    Periodic revenue is the amount of revenue collected from a customer in a given period (though this is often extended across multiple periods into the future to understand lifetime value), such as usage revenue, revenues anticipated from cross and upselling, and often some weighting for referrals by a loyal customer to others; •    Retention cost describes the amount of money the service provider must spend, in a given period, to retain an existing customer. Again, this is often forecast across multiple periods. Retention costs include customer support, billing, promotional incentives and so on;•    Discount rate means the cost of capital used to discount future revenue from a customer. Discounting is an advanced method used in more sophisticated CLV calculations;•    Profit margin is the projected profit as a percentage of revenue for the period. This may be reflected as a percentage of gross or net profit. Again, this is generally projected across the model horizon to understand lifetime value.A strong focus on managing these inputs can help service providers realize stronger customer relationships and profits, but there are some obstacles to overcome in achieving accurate calculations of CLV, such as the complexity of allocating costs across the customer base. There are many costs that serve all customers which must be properly allocated across the base, and often a simple proportional allocation across the whole base or a segment may not accurately reflect the true cost of serving that customer;  This is made worse by the fragmentation of customer information, which is likely to be across a variety of product or operations groups, and may be difficult to aggregate due to different representations.In addition, there is the complexity of account relationships and structures to take into consideration. Complex account structures may not be understood or properly represented. For example, a profitable customer may have a separate account for a second home or another family member, which may appear to be unprofitable. If the service provider cannot relate the two accounts, CLV is not properly represented and any resultant cancellation of the apparently unprofitable account may result in the customer churning from the profitable one.In summary, if service providers are to realize strong customer relationships and their attendant profits, there must be a very strong focus on data management. This needs to be coupled with analytics that help business managers and those who work in customer-facing functions offer highly personalized solutions to customers, while maintaining profitability for the service provider. It's clear that acquiring new customers is expensive. Advertising costs, campaign management expenses, promotional service pricing and discounting, and equipment subsidies make a serious dent in a new customer's profitability. That is especially true given the rising subsidies for Smartphone users, which service providers hope will result in greater profits from profits from data services profitability in future.  The situation is made worse by falling prices and greater competition in mature markets.Customer acquisition through industry consolidation isn't cheap either. A North American service provider spent about $2,000 per subscriber in its acquisition of a smaller company earlier this year. While this has allowed it to leapfrog to become the largest mobile service provider in the country, it required a total investment of more than $28 billion (including assumption of the acquiree's debt).While many operating cost synergies clearly made this deal more attractive to the acquiring company, this is certainly an expensive way to acquire customers: the cost per subscriber in this case is not out of line with the prices others have paid for acquisitions.While growth by acquisition certainly increases overall revenues, it often creates tremendous challenges for profitability. Organic growth through increased customer loyalty and retention is a more effective driver of profit, as well as a stronger predictor of future profitability. Service providers, especially those in mature markets, are increasingly recognizing this and taking steps toward a creating a more personalized, flexible and satisfying experience for their customers.In summary, the clearest path to profitability for companies in virtually all industries is through customer retention and maximization of lifetime value. Service providers would do well to recognize this and focus attention on profitable customer relationships.

    Read the article

  • creating a color coded time chart using colorbar and colormaps in python

    - by Rusty
    I'm trying to make a time tracking chart based on a daily time tracking file that I used. I wrote code that crawls through my files and generates a few lists. endTimes is a list of times that a particular activity ends in minutes going from 0 at midnight the first day of the month to however many minutes are in a month. labels is a list of labels for the times listed in endTimes. It is one shorter than endtimes since the trackers don't have any data about before 0 minute. Most labels are repeats. categories contains every unique value of labels in order of how well I regard that time. I want to create a colorbar or a stack of colorbars (1 for eachday) that will depict how I spend my time for a month and put a color associated with each label. Each value in categories will have a color associated. More blue for more good. More red for more bad. It is already in order for the jet colormap to be right, but I need to get desecrate color values evenly spaced out for each value in categories. Then I figure the next step would be to convert that to a listed colormap to use for the colorbar based on how the labels associated with the categories. I think this is the right way to do it, but I am not sure. I am not sure how to associate the labels with color values. Here is the last part of my code so far. I found one function to make a discrete colormaps. It does, but it isn't what I am looking for and I am not sure what is happening. Thanks for the help! # now I need to develop the graph import numpy as np from matplotlib import pyplot,mpl import matplotlib from scipy import interpolate from scipy import * def contains(thelist,name): # checks if the current list of categories contains the one just read for val in thelist: if val == name: return True return False def getCategories(lastFile): ''' must determine the colors to use I would like to make a gradient so that the better the task, the closer to blue bad labels will recieve colors closer to blue read the last file given for the information on how I feel the order should be then just keep them in the order of how good they are in the tracker use a color range and develop discrete values for each category by evenly spacing them out any time not found should assume to be sleep sleep should be white ''' tracker = open(lastFile+'.txt') # open the last file # find all the categories categories = [] for line in tracker: pos = line.find(':') # does it have a : or a ? if pos==-1: pos=line.find('?') if pos != -1: # ignore if no : or ? name = line[0:pos].strip() # split at the : or ? if contains(categories,name)==False: # if the category is new categories.append(name) # make a new one return categories # find good values in order of last day newlabels=[] for val in getCategories(lastDay): if contains(labels,val): newlabels.append(val) categories=newlabels # convert discrete colormap to listed colormap python for ii,val in enumerate(labels): if contains(categories,val)==False: labels[ii]='sleep' # create a figure fig = pyplot.figure() axes = [] for x in range(endTimes[-1]%(24*60)): ax = fig.add_axes([0.05, 0.65, 0.9, 0.15]) axes.append(ax) # figure out the colors to use # stole this function to make a discrete colormap # http://www.scipy.org/Cookbook/Matplotlib/ColormapTransformations def cmap_discretize(cmap, N): """Return a discrete colormap from the continuous colormap cmap. cmap: colormap instance, eg. cm.jet. N: Number of colors. Example x = resize(arange(100), (5,100)) djet = cmap_discretize(cm.jet, 5) imshow(x, cmap=djet) """ cdict = cmap._segmentdata.copy() # N colors colors_i = np.linspace(0,1.,N) # N+1 indices indices = np.linspace(0,1.,N+1) for key in ('red','green','blue'): # Find the N colors D = np.array(cdict[key]) I = interpolate.interp1d(D[:,0], D[:,1]) colors = I(colors_i) # Place these colors at the correct indices. A = zeros((N+1,3), float) A[:,0] = indices A[1:,1] = colors A[:-1,2] = colors # Create a tuple for the dictionary. L = [] for l in A: L.append(tuple(l)) cdict[key] = tuple(L) # Return colormap object. return matplotlib.colors.LinearSegmentedColormap('colormap',cdict,1024) # jet colormap goes from blue to red (good to bad) cmap = cmap_discretize(mpl.cm.jet, len(categories)) cmap.set_over('0.25') cmap.set_under('0.75') #norm = mpl.colors.Normalize(endTimes,cmap.N) print endTimes print labels # make a color list by matching labels to a picture #norm = mpl.colors.ListedColormap(colorList) cb1 = mpl.colorbar.ColorbarBase(axes[0],cmap=cmap ,orientation='horizontal' ,boundaries=endTimes ,ticks=endTimes ,spacing='proportional') pyplot.show()

    Read the article

  • How to declare a(n) vector/array of reducer objects in Cilk++?

    - by Jin
    Hi All, I had a problem when I am using Cilk++, an extension to C++ for parallel computing. I found that I can't declare a vector of reducer objects: typedef cilk::reducer_opadd<int> T_reducer; vector<T_reducer> bitmiss_vec; for (int i = 0; i < 24; ++i) { T_reducer r; bitmiss_vec.push_back(r); } However, when I compile the code with Cilk++, it complains at the push_back() line: cilk++ geneAttack.cilk -O1 -g -lcilkutil -o geneAttack /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/cilk++/reducer_opadd.h: In member function ‘void __gnu_cxx::new_allocator<_Tp>::construct(_Tp*, const _Tp&) [with _Tp = cilk::reducer_opadd<int>]’: /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/stl_vector.h:601: instantiated from ‘void std::vector<_Tp, _Alloc>::push_back(const _Tp&) [with _Tp = cilk::reducer_opadd<int>, _Alloc = std::allocator<cilk::reducer_opadd<int> >]’ geneAttack.cilk:667: instantiated from here /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/cilk++/reducer_opadd.h:229: error: ‘cilk::reducer_opadd<Type>::reducer_opadd(const cilk::reducer_opadd<Type>&) [with Type = int]’ is private /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/ext/new_allocator.h:107: error: within this context /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/cilk++/reducer_opadd.h: In member function ‘void std::vector<_Tp, _Alloc>::_M_insert_aux(__gnu_cxx::__normal_iterator<typename std::_Vector_base<_Tp, _Alloc>::_Tp_alloc_type::pointer, std::vector<_Tp, _Alloc> >, const _Tp&) [with _Tp = cilk::reducer_opadd<int>, _Alloc = std::allocator<cilk::reducer_opadd<int> >]’: /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/stl_vector.h:605: instantiated from ‘void std::vector<_Tp, _Alloc>::push_back(const _Tp&) [with _Tp = cilk::reducer_opadd<int>, _Alloc = std::allocator<cilk::reducer_opadd<int> >]’ geneAttack.cilk:667: instantiated from here /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/cilk++/reducer_opadd.h:229: error: ‘cilk::reducer_opadd<Type>::reducer_opadd(const cilk::reducer_opadd<Type>&) [with Type = int]’ is private /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/vector.tcc:252: error: within this context /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/stl_vector.h:605: instantiated from ‘void std::vector<_Tp, _Alloc>::push_back(const _Tp&) [with _Tp = cilk::reducer_opadd<int>, _Alloc = std::allocator<cilk::reducer_opadd<int> >]’ geneAttack.cilk:667: instantiated from here /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/cilk++/reducer_opadd.h:230: error: ‘cilk::reducer_opadd<Type>& cilk::reducer_opadd<Type>::operator=(const cilk::reducer_opadd<Type>&) [with Type = int]’ is private /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/vector.tcc:256: error: within this context /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/cilk++/reducer_opadd.h: In static member function ‘static _BI2 std::__copy_backward<_BoolType, std::random_access_iterator_tag>::__copy_b(_BI1, _BI1, _BI2) [with _BI1 = cilk::reducer_opadd<int>*, _BI2 = cilk::reducer_opadd<int>*, bool _BoolType = false]’: /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/stl_algobase.h:465: instantiated from ‘_BI2 std::__copy_backward_aux(_BI1, _BI1, _BI2) [with _BI1 = cilk::reducer_opadd<int>*, _BI2 = cilk::reducer_opadd<int>*]’ /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/stl_algobase.h:474: instantiated from ‘static _BI2 std::__copy_backward_normal<<anonymous>, <anonymous> >::__copy_b_n(_BI1, _BI1, _BI2) [with _BI1 = cilk::reducer_opadd<int>*, _BI2 = cilk::reducer_opadd<int>*, bool <anonymous> = false, bool <anonymous> = false]’ /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/stl_algobase.h:540: instantiated from ‘_BI2 std::copy_backward(_BI1, _BI1, _BI2) [with _BI1 = cilk::reducer_opadd<int>*, _BI2 = cilk::reducer_opadd<int>*]’ /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/vector.tcc:253: instantiated from ‘void std::vector<_Tp, _Alloc>::_M_insert_aux(__gnu_cxx::__normal_iterator<typename std::_Vector_base<_Tp, _Alloc>::_Tp_alloc_type::pointer, std::vector<_Tp, _Alloc> >, const _Tp&) [with _Tp = cilk::reducer_opadd<int>, _Alloc = std::allocator<cilk::reducer_opadd<int> >]’ /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/stl_vector.h:605: instantiated from ‘void std::vector<_Tp, _Alloc>::push_back(const _Tp&) [with _Tp = cilk::reducer_opadd<int>, _Alloc = std::allocator<cilk::reducer_opadd<int> >]’ geneAttack.cilk:667: instantiated from here /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/cilk++/reducer_opadd.h:230: error: ‘cilk::reducer_opadd<Type>& cilk::reducer_opadd<Type>::operator=(const cilk::reducer_opadd<Type>&) [with Type = int]’ is private /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/stl_algobase.h:433: error: within this context /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/cilk++/reducer_opadd.h: In function ‘void std::_Construct(_T1*, const _T2&) [with _T1 = cilk::reducer_opadd<int>, _T2 = cilk::reducer_opadd<int>]’: /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/stl_uninitialized.h:87: instantiated from ‘_ForwardIterator std::__uninitialized_copy_aux(_InputIterator, _InputIterator, _ForwardIterator, std::__false_type) [with _InputIterator = cilk::reducer_opadd<int>*, _ForwardIterator = cilk::reducer_opadd<int>*]’ /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/stl_uninitialized.h:114: instantiated from ‘_ForwardIterator std::uninitialized_copy(_InputIterator, _InputIterator, _ForwardIterator) [with _InputIterator = cilk::reducer_opadd<int>*, _ForwardIterator = cilk::reducer_opadd<int>*]’ /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/stl_uninitialized.h:254: instantiated from ‘_ForwardIterator std::__uninitialized_copy_a(_InputIterator, _InputIterator, _ForwardIterator, std::allocator<_Tp>) [with _InputIterator = cilk::reducer_opadd<int>*, _ForwardIterator = cilk::reducer_opadd<int>*, _Tp = cilk::reducer_opadd<int>]’ /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/vector.tcc:275: instantiated from ‘void std::vector<_Tp, _Alloc>::_M_insert_aux(__gnu_cxx::__normal_iterator<typename std::_Vector_base<_Tp, _Alloc>::_Tp_alloc_type::pointer, std::vector<_Tp, _Alloc> >, const _Tp&) [with _Tp = cilk::reducer_opadd<int>, _Alloc = std::allocator<cilk::reducer_opadd<int> >]’ /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/stl_vector.h:605: instantiated from ‘void std::vector<_Tp, _Alloc>::push_back(const _Tp&) [with _Tp = cilk::reducer_opadd<int>, _Alloc = std::allocator<cilk::reducer_opadd<int> >]’ geneAttack.cilk:667: instantiated from here /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/cilk++/reducer_opadd.h:229: error: ‘cilk::reducer_opadd<Type>::reducer_opadd(const cilk::reducer_opadd<Type>&) [with Type = int]’ is private /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/stl_construct.h:81: error: within this context make: *** [geneAttack] Error 1 jinchen@galactica:~/workspace/biometrics/genAttack$ make cilk++ geneAttack.cilk -O1 -g -lcilkutil -o geneAttack geneAttack.cilk: In function ‘int cilk cilk_main(int, char**)’: geneAttack.cilk:670: error: expected primary-expression before ‘,’ token geneAttack.cilk:670: error: expected primary-expression before ‘}’ token geneAttack.cilk:674: error: ‘bitmiss_vec’ was not declared in this scope make: *** [geneAttack] Error 1 The Cilk++ manule says it supports array/vector of reducers, although there are performance issues to consider: "If you create a large number of reducers (for example, an array or vector of reducers) you must be aware that there is an overhead at steal and reduce that is proportional to the number of reducers in the program. " Anyone knows what is going on? How should I declare/use vector of reducers? Thank you

    Read the article

  • How to declare a vector or array of reducer objects in Cilk++?

    - by Jin
    Hi All, I had a problem when I am using Cilk++, an extension to C++ for parallel computing. I found that I can't declare a vector of reducer objects: typedef cilk::reducer_opadd<int> T_reducer; vector<T_reducer> bitmiss_vec; for (int i = 0; i < 24; ++i) { T_reducer r; bitmiss_vec.push_back(r); } However, when I compile the code with Cilk++, it complains at the push_back() line: cilk++ geneAttack.cilk -O1 -g -lcilkutil -o geneAttack /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/cilk++/reducer_opadd.h: In member function ‘void __gnu_cxx::new_allocator<_Tp>::construct(_Tp*, const _Tp&) [with _Tp = cilk::reducer_opadd<int>]’: /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/stl_vector.h:601: instantiated from ‘void std::vector<_Tp, _Alloc>::push_back(const _Tp&) [with _Tp = cilk::reducer_opadd<int>, _Alloc = std::allocator<cilk::reducer_opadd<int> >]’ geneAttack.cilk:667: instantiated from here /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/cilk++/reducer_opadd.h:229: error: ‘cilk::reducer_opadd<Type>::reducer_opadd(const cilk::reducer_opadd<Type>&) [with Type = int]’ is private /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/ext/new_allocator.h:107: error: within this context /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/cilk++/reducer_opadd.h: In member function ‘void std::vector<_Tp, _Alloc>::_M_insert_aux(__gnu_cxx::__normal_iterator<typename std::_Vector_base<_Tp, _Alloc>::_Tp_alloc_type::pointer, std::vector<_Tp, _Alloc> >, const _Tp&) [with _Tp = cilk::reducer_opadd<int>, _Alloc = std::allocator<cilk::reducer_opadd<int> >]’: /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/stl_vector.h:605: instantiated from ‘void std::vector<_Tp, _Alloc>::push_back(const _Tp&) [with _Tp = cilk::reducer_opadd<int>, _Alloc = std::allocator<cilk::reducer_opadd<int> >]’ geneAttack.cilk:667: instantiated from here /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/cilk++/reducer_opadd.h:229: error: ‘cilk::reducer_opadd<Type>::reducer_opadd(const cilk::reducer_opadd<Type>&) [with Type = int]’ is private /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/vector.tcc:252: error: within this context /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/stl_vector.h:605: instantiated from ‘void std::vector<_Tp, _Alloc>::push_back(const _Tp&) [with _Tp = cilk::reducer_opadd<int>, _Alloc = std::allocator<cilk::reducer_opadd<int> >]’ geneAttack.cilk:667: instantiated from here /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/cilk++/reducer_opadd.h:230: error: ‘cilk::reducer_opadd<Type>& cilk::reducer_opadd<Type>::operator=(const cilk::reducer_opadd<Type>&) [with Type = int]’ is private /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/vector.tcc:256: error: within this context /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/cilk++/reducer_opadd.h: In static member function ‘static _BI2 std::__copy_backward<_BoolType, std::random_access_iterator_tag>::__copy_b(_BI1, _BI1, _BI2) [with _BI1 = cilk::reducer_opadd<int>*, _BI2 = cilk::reducer_opadd<int>*, bool _BoolType = false]’: /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/stl_algobase.h:465: instantiated from ‘_BI2 std::__copy_backward_aux(_BI1, _BI1, _BI2) [with _BI1 = cilk::reducer_opadd<int>*, _BI2 = cilk::reducer_opadd<int>*]’ /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/stl_algobase.h:474: instantiated from ‘static _BI2 std::__copy_backward_normal<<anonymous>, <anonymous> >::__copy_b_n(_BI1, _BI1, _BI2) [with _BI1 = cilk::reducer_opadd<int>*, _BI2 = cilk::reducer_opadd<int>*, bool <anonymous> = false, bool <anonymous> = false]’ /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/stl_algobase.h:540: instantiated from ‘_BI2 std::copy_backward(_BI1, _BI1, _BI2) [with _BI1 = cilk::reducer_opadd<int>*, _BI2 = cilk::reducer_opadd<int>*]’ /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/vector.tcc:253: instantiated from ‘void std::vector<_Tp, _Alloc>::_M_insert_aux(__gnu_cxx::__normal_iterator<typename std::_Vector_base<_Tp, _Alloc>::_Tp_alloc_type::pointer, std::vector<_Tp, _Alloc> >, const _Tp&) [with _Tp = cilk::reducer_opadd<int>, _Alloc = std::allocator<cilk::reducer_opadd<int> >]’ /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/stl_vector.h:605: instantiated from ‘void std::vector<_Tp, _Alloc>::push_back(const _Tp&) [with _Tp = cilk::reducer_opadd<int>, _Alloc = std::allocator<cilk::reducer_opadd<int> >]’ geneAttack.cilk:667: instantiated from here /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/cilk++/reducer_opadd.h:230: error: ‘cilk::reducer_opadd<Type>& cilk::reducer_opadd<Type>::operator=(const cilk::reducer_opadd<Type>&) [with Type = int]’ is private /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/stl_algobase.h:433: error: within this context /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/cilk++/reducer_opadd.h: In function ‘void std::_Construct(_T1*, const _T2&) [with _T1 = cilk::reducer_opadd<int>, _T2 = cilk::reducer_opadd<int>]’: /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/stl_uninitialized.h:87: instantiated from ‘_ForwardIterator std::__uninitialized_copy_aux(_InputIterator, _InputIterator, _ForwardIterator, std::__false_type) [with _InputIterator = cilk::reducer_opadd<int>*, _ForwardIterator = cilk::reducer_opadd<int>*]’ /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/stl_uninitialized.h:114: instantiated from ‘_ForwardIterator std::uninitialized_copy(_InputIterator, _InputIterator, _ForwardIterator) [with _InputIterator = cilk::reducer_opadd<int>*, _ForwardIterator = cilk::reducer_opadd<int>*]’ /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/stl_uninitialized.h:254: instantiated from ‘_ForwardIterator std::__uninitialized_copy_a(_InputIterator, _InputIterator, _ForwardIterator, std::allocator<_Tp>) [with _InputIterator = cilk::reducer_opadd<int>*, _ForwardIterator = cilk::reducer_opadd<int>*, _Tp = cilk::reducer_opadd<int>]’ /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/vector.tcc:275: instantiated from ‘void std::vector<_Tp, _Alloc>::_M_insert_aux(__gnu_cxx::__normal_iterator<typename std::_Vector_base<_Tp, _Alloc>::_Tp_alloc_type::pointer, std::vector<_Tp, _Alloc> >, const _Tp&) [with _Tp = cilk::reducer_opadd<int>, _Alloc = std::allocator<cilk::reducer_opadd<int> >]’ /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/stl_vector.h:605: instantiated from ‘void std::vector<_Tp, _Alloc>::push_back(const _Tp&) [with _Tp = cilk::reducer_opadd<int>, _Alloc = std::allocator<cilk::reducer_opadd<int> >]’ geneAttack.cilk:667: instantiated from here /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/cilk++/reducer_opadd.h:229: error: ‘cilk::reducer_opadd<Type>::reducer_opadd(const cilk::reducer_opadd<Type>&) [with Type = int]’ is private /usr/local/cilk/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/stl_construct.h:81: error: within this context make: *** [geneAttack] Error 1 jinchen@galactica:~/workspace/biometrics/genAttack$ make cilk++ geneAttack.cilk -O1 -g -lcilkutil -o geneAttack geneAttack.cilk: In function ‘int cilk cilk_main(int, char**)’: geneAttack.cilk:670: error: expected primary-expression before ‘,’ token geneAttack.cilk:670: error: expected primary-expression before ‘}’ token geneAttack.cilk:674: error: ‘bitmiss_vec’ was not declared in this scope make: *** [geneAttack] Error 1 The Cilk++ manule says it supports array/vector of reducers, although there are performance issues to consider: "If you create a large number of reducers (for example, an array or vector of reducers) you must be aware that there is an overhead at steal and reduce that is proportional to the number of reducers in the program. " Anyone knows what is going on? How should I declare/use vector of reducers? Thank you

    Read the article

  • How John Got 15x Improvement Without Really Trying

    - by rchrd
    The following article was published on a Sun Microsystems website a number of years ago by John Feo. It is still useful and worth preserving. So I'm republishing it here.  How I Got 15x Improvement Without Really Trying John Feo, Sun Microsystems Taking ten "personal" program codes used in scientific and engineering research, the author was able to get from 2 to 15 times performance improvement easily by applying some simple general optimization techniques. Introduction Scientific research based on computer simulation depends on the simulation for advancement. The research can advance only as fast as the computational codes can execute. The codes' efficiency determines both the rate and quality of results. In the same amount of time, a faster program can generate more results and can carry out a more detailed simulation of physical phenomena than a slower program. Highly optimized programs help science advance quickly and insure that monies supporting scientific research are used as effectively as possible. Scientific computer codes divide into three broad categories: ISV, community, and personal. ISV codes are large, mature production codes developed and sold commercially. The codes improve slowly over time both in methods and capabilities, and they are well tuned for most vendor platforms. Since the codes are mature and complex, there are few opportunities to improve their performance solely through code optimization. Improvements of 10% to 15% are typical. Examples of ISV codes are DYNA3D, Gaussian, and Nastran. Community codes are non-commercial production codes used by a particular research field. Generally, they are developed and distributed by a single academic or research institution with assistance from the community. Most users just run the codes, but some develop new methods and extensions that feed back into the general release. The codes are available on most vendor platforms. Since these codes are younger than ISV codes, there are more opportunities to optimize the source code. Improvements of 50% are not unusual. Examples of community codes are AMBER, CHARM, BLAST, and FASTA. Personal codes are those written by single users or small research groups for their own use. These codes are not distributed, but may be passed from professor-to-student or student-to-student over several years. They form the primordial ocean of applications from which community and ISV codes emerge. Government research grants pay for the development of most personal codes. This paper reports on the nature and performance of this class of codes. Over the last year, I have looked at over two dozen personal codes from more than a dozen research institutions. The codes cover a variety of scientific fields, including astronomy, atmospheric sciences, bioinformatics, biology, chemistry, geology, and physics. The sources range from a few hundred lines to more than ten thousand lines, and are written in Fortran, Fortran 90, C, and C++. For the most part, the codes are modular, documented, and written in a clear, straightforward manner. They do not use complex language features, advanced data structures, programming tricks, or libraries. I had little trouble understanding what the codes did or how data structures were used. Most came with a makefile. Surprisingly, only one of the applications is parallel. All developers have access to parallel machines, so availability is not an issue. Several tried to parallelize their applications, but stopped after encountering difficulties. Lack of education and a perception that parallelism is difficult prevented most from trying. I parallelized several of the codes using OpenMP, and did not judge any of the codes as difficult to parallelize. Even more surprising than the lack of parallelism is the inefficiency of the codes. I was able to get large improvements in performance in a matter of a few days applying simple optimization techniques. Table 1 lists ten representative codes [names and affiliation are omitted to preserve anonymity]. Improvements on one processor range from 2x to 15.5x with a simple average of 4.75x. I did not use sophisticated performance tools or drill deep into the program's execution character as one would do when tuning ISV or community codes. Using only a profiler and source line timers, I identified inefficient sections of code and improved their performance by inspection. The changes were at a high level. I am sure there is another factor of 2 or 3 in each code, and more if the codes are parallelized. The study’s results show that personal scientific codes are running many times slower than they should and that the problem is pervasive. Computational scientists are not sloppy programmers; however, few are trained in the art of computer programming or code optimization. I found that most have a working knowledge of some programming language and standard software engineering practices; but they do not know, or think about, how to make their programs run faster. They simply do not know the standard techniques used to make codes run faster. In fact, they do not even perceive that such techniques exist. The case studies described in this paper show that applying simple, well known techniques can significantly increase the performance of personal codes. It is important that the scientific community and the Government agencies that support scientific research find ways to better educate academic scientific programmers. The inefficiency of their codes is so bad that it is retarding both the quality and progress of scientific research. # cacheperformance redundantoperations loopstructures performanceimprovement 1 x x 15.5 2 x 2.8 3 x x 2.5 4 x 2.1 5 x x 2.0 6 x 5.0 7 x 5.8 8 x 6.3 9 2.2 10 x x 3.3 Table 1 — Area of improvement and performance gains of 10 codes The remainder of the paper is organized as follows: sections 2, 3, and 4 discuss the three most common sources of inefficiencies in the codes studied. These are cache performance, redundant operations, and loop structures. Each section includes several examples. The last section summaries the work and suggests a possible solution to the issues raised. Optimizing cache performance Commodity microprocessor systems use caches to increase memory bandwidth and reduce memory latencies. Typical latencies from processor to L1, L2, local, and remote memory are 3, 10, 50, and 200 cycles, respectively. Moreover, bandwidth falls off dramatically as memory distances increase. Programs that do not use cache effectively run many times slower than programs that do. When optimizing for cache, the biggest performance gains are achieved by accessing data in cache order and reusing data to amortize the overhead of cache misses. Secondary considerations are prefetching, associativity, and replacement; however, the understanding and analysis required to optimize for the latter are probably beyond the capabilities of the non-expert. Much can be gained simply by accessing data in the correct order and maximizing data reuse. 6 out of the 10 codes studied here benefited from such high level optimizations. Array Accesses The most important cache optimization is the most basic: accessing Fortran array elements in column order and C array elements in row order. Four of the ten codes—1, 2, 4, and 10—got it wrong. Compilers will restructure nested loops to optimize cache performance, but may not do so if the loop structure is too complex, or the loop body includes conditionals, complex addressing, or function calls. In code 1, the compiler failed to invert a key loop because of complex addressing do I = 0, 1010, delta_x IM = I - delta_x IP = I + delta_x do J = 5, 995, delta_x JM = J - delta_x JP = J + delta_x T1 = CA1(IP, J) + CA1(I, JP) T2 = CA1(IM, J) + CA1(I, JM) S1 = T1 + T2 - 4 * CA1(I, J) CA(I, J) = CA1(I, J) + D * S1 end do end do In code 2, the culprit is conditionals do I = 1, N do J = 1, N If (IFLAG(I,J) .EQ. 0) then T1 = Value(I, J-1) T2 = Value(I-1, J) T3 = Value(I, J) T4 = Value(I+1, J) T5 = Value(I, J+1) Value(I,J) = 0.25 * (T1 + T2 + T5 + T4) Delta = ABS(T3 - Value(I,J)) If (Delta .GT. MaxDelta) MaxDelta = Delta endif enddo enddo I fixed both programs by inverting the loops by hand. Code 10 has three-dimensional arrays and triply nested loops. The structure of the most computationally intensive loops is too complex to invert automatically or by hand. The only practical solution is to transpose the arrays so that the dimension accessed by the innermost loop is in cache order. The arrays can be transposed at construction or prior to entering a computationally intensive section of code. The former requires all array references to be modified, while the latter is cost effective only if the cost of the transpose is amortized over many accesses. I used the second approach to optimize code 10. Code 5 has four-dimensional arrays and loops are nested four deep. For all of the reasons cited above the compiler is not able to restructure three key loops. Assume C arrays and let the four dimensions of the arrays be i, j, k, and l. In the original code, the index structure of the three loops is L1: for i L2: for i L3: for i for l for l for j for k for j for k for j for k for l So only L3 accesses array elements in cache order. L1 is a very complex loop—much too complex to invert. I brought the loop into cache alignment by transposing the second and fourth dimensions of the arrays. Since the code uses a macro to compute all array indexes, I effected the transpose at construction and changed the macro appropriately. The dimensions of the new arrays are now: i, l, k, and j. L3 is a simple loop and easily inverted. L2 has a loop-carried scalar dependence in k. By promoting the scalar name that carries the dependence to an array, I was able to invert the third and fourth subloops aligning the loop with cache. Code 5 is by far the most difficult of the four codes to optimize for array accesses; but the knowledge required to fix the problems is no more than that required for the other codes. I would judge this code at the limits of, but not beyond, the capabilities of appropriately trained computational scientists. Array Strides When a cache miss occurs, a line (64 bytes) rather than just one word is loaded into the cache. If data is accessed stride 1, than the cost of the miss is amortized over 8 words. Any stride other than one reduces the cost savings. Two of the ten codes studied suffered from non-unit strides. The codes represent two important classes of "strided" codes. Code 1 employs a multi-grid algorithm to reduce time to convergence. The grids are every tenth, fifth, second, and unit element. Since time to convergence is inversely proportional to the distance between elements, coarse grids converge quickly providing good starting values for finer grids. The better starting values further reduce the time to convergence. The downside is that grids of every nth element, n > 1, introduce non-unit strides into the computation. In the original code, much of the savings of the multi-grid algorithm were lost due to this problem. I eliminated the problem by compressing (copying) coarse grids into continuous memory, and rewriting the computation as a function of the compressed grid. On convergence, I copied the final values of the compressed grid back to the original grid. The savings gained from unit stride access of the compressed grid more than paid for the cost of copying. Using compressed grids, the loop from code 1 included in the previous section becomes do j = 1, GZ do i = 1, GZ T1 = CA(i+0, j-1) + CA(i-1, j+0) T4 = CA1(i+1, j+0) + CA1(i+0, j+1) S1 = T1 + T4 - 4 * CA1(i+0, j+0) CA(i+0, j+0) = CA1(i+0, j+0) + DD * S1 enddo enddo where CA and CA1 are compressed arrays of size GZ. Code 7 traverses a list of objects selecting objects for later processing. The labels of the selected objects are stored in an array. The selection step has unit stride, but the processing steps have irregular stride. A fix is to save the parameters of the selected objects in temporary arrays as they are selected, and pass the temporary arrays to the processing functions. The fix is practical if the same parameters are used in selection as in processing, or if processing comprises a series of distinct steps which use overlapping subsets of the parameters. Both conditions are true for code 7, so I achieved significant improvement by copying parameters to temporary arrays during selection. Data reuse In the previous sections, we optimized for spatial locality. It is also important to optimize for temporal locality. Once read, a datum should be used as much as possible before it is forced from cache. Loop fusion and loop unrolling are two techniques that increase temporal locality. Unfortunately, both techniques increase register pressure—as loop bodies become larger, the number of registers required to hold temporary values grows. Once register spilling occurs, any gains evaporate quickly. For multiprocessors with small register sets or small caches, the sweet spot can be very small. In the ten codes presented here, I found no opportunities for loop fusion and only two opportunities for loop unrolling (codes 1 and 3). In code 1, unrolling the outer and inner loop one iteration increases the number of result values computed by the loop body from 1 to 4, do J = 1, GZ-2, 2 do I = 1, GZ-2, 2 T1 = CA1(i+0, j-1) + CA1(i-1, j+0) T2 = CA1(i+1, j-1) + CA1(i+0, j+0) T3 = CA1(i+0, j+0) + CA1(i-1, j+1) T4 = CA1(i+1, j+0) + CA1(i+0, j+1) T5 = CA1(i+2, j+0) + CA1(i+1, j+1) T6 = CA1(i+1, j+1) + CA1(i+0, j+2) T7 = CA1(i+2, j+1) + CA1(i+1, j+2) S1 = T1 + T4 - 4 * CA1(i+0, j+0) S2 = T2 + T5 - 4 * CA1(i+1, j+0) S3 = T3 + T6 - 4 * CA1(i+0, j+1) S4 = T4 + T7 - 4 * CA1(i+1, j+1) CA(i+0, j+0) = CA1(i+0, j+0) + DD * S1 CA(i+1, j+0) = CA1(i+1, j+0) + DD * S2 CA(i+0, j+1) = CA1(i+0, j+1) + DD * S3 CA(i+1, j+1) = CA1(i+1, j+1) + DD * S4 enddo enddo The loop body executes 12 reads, whereas as the rolled loop shown in the previous section executes 20 reads to compute the same four values. In code 3, two loops are unrolled 8 times and one loop is unrolled 4 times. Here is the before for (k = 0; k < NK[u]; k++) { sum = 0.0; for (y = 0; y < NY; y++) { sum += W[y][u][k] * delta[y]; } backprop[i++]=sum; } and after code for (k = 0; k < KK - 8; k+=8) { sum0 = 0.0; sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0; sum5 = 0.0; sum6 = 0.0; sum7 = 0.0; for (y = 0; y < NY; y++) { sum0 += W[y][0][k+0] * delta[y]; sum1 += W[y][0][k+1] * delta[y]; sum2 += W[y][0][k+2] * delta[y]; sum3 += W[y][0][k+3] * delta[y]; sum4 += W[y][0][k+4] * delta[y]; sum5 += W[y][0][k+5] * delta[y]; sum6 += W[y][0][k+6] * delta[y]; sum7 += W[y][0][k+7] * delta[y]; } backprop[k+0] = sum0; backprop[k+1] = sum1; backprop[k+2] = sum2; backprop[k+3] = sum3; backprop[k+4] = sum4; backprop[k+5] = sum5; backprop[k+6] = sum6; backprop[k+7] = sum7; } for one of the loops unrolled 8 times. Optimizing for temporal locality is the most difficult optimization considered in this paper. The concepts are not difficult, but the sweet spot is small. Identifying where the program can benefit from loop unrolling or loop fusion is not trivial. Moreover, it takes some effort to get it right. Still, educating scientific programmers about temporal locality and teaching them how to optimize for it will pay dividends. Reducing instruction count Execution time is a function of instruction count. Reduce the count and you usually reduce the time. The best solution is to use a more efficient algorithm; that is, an algorithm whose order of complexity is smaller, that converges quicker, or is more accurate. Optimizing source code without changing the algorithm yields smaller, but still significant, gains. This paper considers only the latter because the intent is to study how much better codes can run if written by programmers schooled in basic code optimization techniques. The ten codes studied benefited from three types of "instruction reducing" optimizations. The two most prevalent were hoisting invariant memory and data operations out of inner loops. The third was eliminating unnecessary data copying. The nature of these inefficiencies is language dependent. Memory operations The semantics of C make it difficult for the compiler to determine all the invariant memory operations in a loop. The problem is particularly acute for loops in functions since the compiler may not know the values of the function's parameters at every call site when compiling the function. Most compilers support pragmas to help resolve ambiguities; however, these pragmas are not comprehensive and there is no standard syntax. To guarantee that invariant memory operations are not executed repetitively, the user has little choice but to hoist the operations by hand. The problem is not as severe in Fortran programs because in the absence of equivalence statements, it is a violation of the language's semantics for two names to share memory. Codes 3 and 5 are C programs. In both cases, the compiler did not hoist all invariant memory operations from inner loops. Consider the following loop from code 3 for (y = 0; y < NY; y++) { i = 0; for (u = 0; u < NU; u++) { for (k = 0; k < NK[u]; k++) { dW[y][u][k] += delta[y] * I1[i++]; } } } Since dW[y][u] can point to the same memory space as delta for one or more values of y and u, assignment to dW[y][u][k] may change the value of delta[y]. In reality, dW and delta do not overlap in memory, so I rewrote the loop as for (y = 0; y < NY; y++) { i = 0; Dy = delta[y]; for (u = 0; u < NU; u++) { for (k = 0; k < NK[u]; k++) { dW[y][u][k] += Dy * I1[i++]; } } } Failure to hoist invariant memory operations may be due to complex address calculations. If the compiler can not determine that the address calculation is invariant, then it can hoist neither the calculation nor the associated memory operations. As noted above, code 5 uses a macro to address four-dimensional arrays #define MAT4D(a,q,i,j,k) (double *)((a)->data + (q)*(a)->strides[0] + (i)*(a)->strides[3] + (j)*(a)->strides[2] + (k)*(a)->strides[1]) The macro is too complex for the compiler to understand and so, it does not identify any subexpressions as loop invariant. The simplest way to eliminate the address calculation from the innermost loop (over i) is to define a0 = MAT4D(a,q,0,j,k) before the loop and then replace all instances of *MAT4D(a,q,i,j,k) in the loop with a0[i] A similar problem appears in code 6, a Fortran program. The key loop in this program is do n1 = 1, nh nx1 = (n1 - 1) / nz + 1 nz1 = n1 - nz * (nx1 - 1) do n2 = 1, nh nx2 = (n2 - 1) / nz + 1 nz2 = n2 - nz * (nx2 - 1) ndx = nx2 - nx1 ndy = nz2 - nz1 gxx = grn(1,ndx,ndy) gyy = grn(2,ndx,ndy) gxy = grn(3,ndx,ndy) balance(n1,1) = balance(n1,1) + (force(n2,1) * gxx + force(n2,2) * gxy) * h1 balance(n1,2) = balance(n1,2) + (force(n2,1) * gxy + force(n2,2) * gyy)*h1 end do end do The programmer has written this loop well—there are no loop invariant operations with respect to n1 and n2. However, the loop resides within an iterative loop over time and the index calculations are independent with respect to time. Trading space for time, I precomputed the index values prior to the entering the time loop and stored the values in two arrays. I then replaced the index calculations with reads of the arrays. Data operations Ways to reduce data operations can appear in many forms. Implementing a more efficient algorithm produces the biggest gains. The closest I came to an algorithm change was in code 4. This code computes the inner product of K-vectors A(i) and B(j), 0 = i < N, 0 = j < M, for most values of i and j. Since the program computes most of the NM possible inner products, it is more efficient to compute all the inner products in one triply-nested loop rather than one at a time when needed. The savings accrue from reading A(i) once for all B(j) vectors and from loop unrolling. for (i = 0; i < N; i+=8) { for (j = 0; j < M; j++) { sum0 = 0.0; sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0; sum5 = 0.0; sum6 = 0.0; sum7 = 0.0; for (k = 0; k < K; k++) { sum0 += A[i+0][k] * B[j][k]; sum1 += A[i+1][k] * B[j][k]; sum2 += A[i+2][k] * B[j][k]; sum3 += A[i+3][k] * B[j][k]; sum4 += A[i+4][k] * B[j][k]; sum5 += A[i+5][k] * B[j][k]; sum6 += A[i+6][k] * B[j][k]; sum7 += A[i+7][k] * B[j][k]; } C[i+0][j] = sum0; C[i+1][j] = sum1; C[i+2][j] = sum2; C[i+3][j] = sum3; C[i+4][j] = sum4; C[i+5][j] = sum5; C[i+6][j] = sum6; C[i+7][j] = sum7; }} This change requires knowledge of a typical run; i.e., that most inner products are computed. The reasons for the change, however, derive from basic optimization concepts. It is the type of change easily made at development time by a knowledgeable programmer. In code 5, we have the data version of the index optimization in code 6. Here a very expensive computation is a function of the loop indices and so cannot be hoisted out of the loop; however, the computation is invariant with respect to an outer iterative loop over time. We can compute its value for each iteration of the computation loop prior to entering the time loop and save the values in an array. The increase in memory required to store the values is small in comparison to the large savings in time. The main loop in Code 8 is doubly nested. The inner loop includes a series of guarded computations; some are a function of the inner loop index but not the outer loop index while others are a function of the outer loop index but not the inner loop index for (j = 0; j < N; j++) { for (i = 0; i < M; i++) { r = i * hrmax; R = A[j]; temp = (PRM[3] == 0.0) ? 1.0 : pow(r, PRM[3]); high = temp * kcoeff * B[j] * PRM[2] * PRM[4]; low = high * PRM[6] * PRM[6] / (1.0 + pow(PRM[4] * PRM[6], 2.0)); kap = (R > PRM[6]) ? high * R * R / (1.0 + pow(PRM[4]*r, 2.0) : low * pow(R/PRM[6], PRM[5]); < rest of loop omitted > }} Note that the value of temp is invariant to j. Thus, we can hoist the computation for temp out of the loop and save its values in an array. for (i = 0; i < M; i++) { r = i * hrmax; TEMP[i] = pow(r, PRM[3]); } [N.B. – the case for PRM[3] = 0 is omitted and will be reintroduced later.] We now hoist out of the inner loop the computations invariant to i. Since the conditional guarding the value of kap is invariant to i, it behooves us to hoist the computation out of the inner loop, thereby executing the guard once rather than M times. The final version of the code is for (j = 0; j < N; j++) { R = rig[j] / 1000.; tmp1 = kcoeff * par[2] * beta[j] * par[4]; tmp2 = 1.0 + (par[4] * par[4] * par[6] * par[6]); tmp3 = 1.0 + (par[4] * par[4] * R * R); tmp4 = par[6] * par[6] / tmp2; tmp5 = R * R / tmp3; tmp6 = pow(R / par[6], par[5]); if ((par[3] == 0.0) && (R > par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * tmp5; } else if ((par[3] == 0.0) && (R <= par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * tmp4 * tmp6; } else if ((par[3] != 0.0) && (R > par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * TEMP[i] * tmp5; } else if ((par[3] != 0.0) && (R <= par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * TEMP[i] * tmp4 * tmp6; } for (i = 0; i < M; i++) { kap = KAP[i]; r = i * hrmax; < rest of loop omitted > } } Maybe not the prettiest piece of code, but certainly much more efficient than the original loop, Copy operations Several programs unnecessarily copy data from one data structure to another. This problem occurs in both Fortran and C programs, although it manifests itself differently in the two languages. Code 1 declares two arrays—one for old values and one for new values. At the end of each iteration, the array of new values is copied to the array of old values to reset the data structures for the next iteration. This problem occurs in Fortran programs not included in this study and in both Fortran 77 and Fortran 90 code. Introducing pointers to the arrays and swapping pointer values is an obvious way to eliminate the copying; but pointers is not a feature that many Fortran programmers know well or are comfortable using. An easy solution not involving pointers is to extend the dimension of the value array by 1 and use the last dimension to differentiate between arrays at different times. For example, if the data space is N x N, declare the array (N, N, 2). Then store the problem’s initial values in (_, _, 2) and define the scalar names new = 2 and old = 1. At the start of each iteration, swap old and new to reset the arrays. The old–new copy problem did not appear in any C program. In programs that had new and old values, the code swapped pointers to reset data structures. Where unnecessary coping did occur is in structure assignment and parameter passing. Structures in C are handled much like scalars. Assignment causes the data space of the right-hand name to be copied to the data space of the left-hand name. Similarly, when a structure is passed to a function, the data space of the actual parameter is copied to the data space of the formal parameter. If the structure is large and the assignment or function call is in an inner loop, then copying costs can grow quite large. While none of the ten programs considered here manifested this problem, it did occur in programs not included in the study. A simple fix is always to refer to structures via pointers. Optimizing loop structures Since scientific programs spend almost all their time in loops, efficient loops are the key to good performance. Conditionals, function calls, little instruction level parallelism, and large numbers of temporary values make it difficult for the compiler to generate tightly packed, highly efficient code. Conditionals and function calls introduce jumps that disrupt code flow. Users should eliminate or isolate conditionls to their own loops as much as possible. Often logical expressions can be substituted for if-then-else statements. For example, code 2 includes the following snippet MaxDelta = 0.0 do J = 1, N do I = 1, M < code omitted > Delta = abs(OldValue ? NewValue) if (Delta > MaxDelta) MaxDelta = Delta enddo enddo if (MaxDelta .gt. 0.001) goto 200 Since the only use of MaxDelta is to control the jump to 200 and all that matters is whether or not it is greater than 0.001, I made MaxDelta a boolean and rewrote the snippet as MaxDelta = .false. do J = 1, N do I = 1, M < code omitted > Delta = abs(OldValue ? NewValue) MaxDelta = MaxDelta .or. (Delta .gt. 0.001) enddo enddo if (MaxDelta) goto 200 thereby, eliminating the conditional expression from the inner loop. A microprocessor can execute many instructions per instruction cycle. Typically, it can execute one or more memory, floating point, integer, and jump operations. To be executed simultaneously, the operations must be independent. Thick loops tend to have more instruction level parallelism than thin loops. Moreover, they reduce memory traffice by maximizing data reuse. Loop unrolling and loop fusion are two techniques to increase the size of loop bodies. Several of the codes studied benefitted from loop unrolling, but none benefitted from loop fusion. This observation is not too surpising since it is the general tendency of programmers to write thick loops. As loops become thicker, the number of temporary values grows, increasing register pressure. If registers spill, then memory traffic increases and code flow is disrupted. A thick loop with many temporary values may execute slower than an equivalent series of thin loops. The biggest gain will be achieved if the thick loop can be split into a series of independent loops eliminating the need to write and read temporary arrays. I found such an occasion in code 10 where I split the loop do i = 1, n do j = 1, m A24(j,i)= S24(j,i) * T24(j,i) + S25(j,i) * U25(j,i) B24(j,i)= S24(j,i) * T25(j,i) + S25(j,i) * U24(j,i) A25(j,i)= S24(j,i) * C24(j,i) + S25(j,i) * V24(j,i) B25(j,i)= S24(j,i) * U25(j,i) + S25(j,i) * V25(j,i) C24(j,i)= S26(j,i) * T26(j,i) + S27(j,i) * U26(j,i) D24(j,i)= S26(j,i) * T27(j,i) + S27(j,i) * V26(j,i) C25(j,i)= S27(j,i) * S28(j,i) + S26(j,i) * U28(j,i) D25(j,i)= S27(j,i) * T28(j,i) + S26(j,i) * V28(j,i) end do end do into two disjoint loops do i = 1, n do j = 1, m A24(j,i)= S24(j,i) * T24(j,i) + S25(j,i) * U25(j,i) B24(j,i)= S24(j,i) * T25(j,i) + S25(j,i) * U24(j,i) A25(j,i)= S24(j,i) * C24(j,i) + S25(j,i) * V24(j,i) B25(j,i)= S24(j,i) * U25(j,i) + S25(j,i) * V25(j,i) end do end do do i = 1, n do j = 1, m C24(j,i)= S26(j,i) * T26(j,i) + S27(j,i) * U26(j,i) D24(j,i)= S26(j,i) * T27(j,i) + S27(j,i) * V26(j,i) C25(j,i)= S27(j,i) * S28(j,i) + S26(j,i) * U28(j,i) D25(j,i)= S27(j,i) * T28(j,i) + S26(j,i) * V28(j,i) end do end do Conclusions Over the course of the last year, I have had the opportunity to work with over two dozen academic scientific programmers at leading research universities. Their research interests span a broad range of scientific fields. Except for two programs that relied almost exclusively on library routines (matrix multiply and fast Fourier transform), I was able to improve significantly the single processor performance of all codes. Improvements range from 2x to 15.5x with a simple average of 4.75x. Changes to the source code were at a very high level. I did not use sophisticated techniques or programming tools to discover inefficiencies or effect the changes. Only one code was parallel despite the availability of parallel systems to all developers. Clearly, we have a problem—personal scientific research codes are highly inefficient and not running parallel. The developers are unaware of simple optimization techniques to make programs run faster. They lack education in the art of code optimization and parallel programming. I do not believe we can fix the problem by publishing additional books or training manuals. To date, the developers in questions have not studied the books or manual available, and are unlikely to do so in the future. Short courses are a possible solution, but I believe they are too concentrated to be much use. The general concepts can be taught in a three or four day course, but that is not enough time for students to practice what they learn and acquire the experience to apply and extend the concepts to their codes. Practice is the key to becoming proficient at optimization. I recommend that graduate students be required to take a semester length course in optimization and parallel programming. We would never give someone access to state-of-the-art scientific equipment costing hundreds of thousands of dollars without first requiring them to demonstrate that they know how to use the equipment. Yet the criterion for time on state-of-the-art supercomputers is at most an interesting project. Requestors are never asked to demonstrate that they know how to use the system, or can use the system effectively. A semester course would teach them the required skills. Government agencies that fund academic scientific research pay for most of the computer systems supporting scientific research as well as the development of most personal scientific codes. These agencies should require graduate schools to offer a course in optimization and parallel programming as a requirement for funding. About the Author John Feo received his Ph.D. in Computer Science from The University of Texas at Austin in 1986. After graduate school, Dr. Feo worked at Lawrence Livermore National Laboratory where he was the Group Leader of the Computer Research Group and principal investigator of the Sisal Language Project. In 1997, Dr. Feo joined Tera Computer Company where he was project manager for the MTA, and oversaw the programming and evaluation of the MTA at the San Diego Supercomputer Center. In 2000, Dr. Feo joined Sun Microsystems as an HPC application specialist. He works with university research groups to optimize and parallelize scientific codes. Dr. Feo has published over two dozen research articles in the areas of parallel parallel programming, parallel programming languages, and application performance.

    Read the article

< Previous Page | 1 2 3 4