Search Results

Search found 130 results on 6 pages for 'rig veda'.

Page 6/6 | < Previous Page | 2 3 4 5 6 

  • Load and Web Performance Testing using Visual Studio Ultimate 2010-Part 3

    - by Tarun Arora
    Welcome back once again, in Part 1 of Load and Web Performance Testing using Visual Studio 2010 I talked about why Performance Testing the application is important, the test tools available in Visual Studio Ultimate 2010 and various test rig topologies, in Part 2 of Load and Web Performance Testing using Visual Studio 2010 I discussed the details of web performance & load tests as well as why it’s important to follow a goal based pattern while performance testing your application. In part 3 I’ll be discussing Test Result Analysis, Test Result Drill through, Test Report Generation, Test Run Comparison, Asp.net Profiler and some closing thoughts. Test Results – I see some creepy worms! In Part 2 we put together a web performance test and a load test, lets run the test to see load test to see how the Web site responds to the load simulation. While the load test is running you will be able to see close to real time analysis in the Load Test Analyser window. You can use the Load Test Analyser to conduct load test analysis in three ways: Monitor a running load test - A condensed set of the performance counter data is maintained in memory. To prevent the results memory requirements from growing unbounded, up to 200 samples for each performance counter are maintained. This includes 100 evenly spaced samples that span the current elapsed time of the run and the most recent 100 samples.         After the load test run is completed - The test controller spools all collected performance counter data to a database while the test is running. Additional data, such as timing details and error details, is loaded into the database when the test completes. The performance data for a completed test is loaded from the database and analysed by the Load Test Analyser. Below you can see a screen shot of the summary view, this provides key results in a format that is compact and easy to read. You can also print the load test summary, this is generated after the test has completed or been stopped.         Analyse the load test results of a previously run load test – We’ll see this in the section where i discuss comparison between two test runs. The performance counters can be plotted on the graphs. You also have the option to highlight a selected part of the test and view details, drill down to the user activity chart where you can hover over to see more details of the test run.   Generate Report => Test Run Comparisons The level of reports you can generate using the Load Test Analyser is astonishing. You have the option to create excel reports and conduct side by side analysis of two test results or to track trend analysis. The tools also allows you to export the graph data either to MS Excel or to a CSV file. You can view the ASP.NET profiler report to conduct further analysis as well. View Data and Diagnostic Attachments opens the Choose Diagnostic Data Adapter Attachment dialog box to select an adapter to analyse the result type. For example, you can select an IntelliTrace adapter, click OK and open the IntelliTrace summary for the test agent that was used in the load test.   Compare results This creates a set of reports that compares the data from two load test results using tables and bar charts. I have taken these screen shots from the MSDN documentation, I would highly recommend exploring the wealth of knowledge available on MSDN. Leaving Thoughts While load testing the application with an excessive load for a longer duration of time, i managed to bring the IIS to its knees by piling up a huge queue of requests waiting to be processed. This clearly means that the IIS had run out of threads as all the threads were busy processing existing request, one easy way of fixing this is by increasing the default number of allocated threads, but this might escalate the problem. The better suggestion is to try and drill down to the actual root cause of the problem. When ever the garbage collection runs it stops processing any pages so all requests that come in during that period are queued up, but realistically the garbage collection completes in fraction of a a second. To understand this better lets look at the .net heap, it is divided into large heap and small heap, anything greater than 85kB in size will be allocated to the Large object heap, the Large object heap is non compacting and remember large objects are expensive to move around, so if you are allocating something in the large object heap, make sure that you really need it! The small object heap on the other hand is divided into generations, so all objects that are supposed to be short-lived are suppose to live in Gen-0 and the long living objects eventually move to Gen-2 as garbage collection goes through.  As you can see in the picture below all < 85 KB size objects are first assigned to Gen-0, when Gen-0 fills up and a new object comes in and finds Gen-0 full, the garbage collection process is started, the process checks for all the dead objects and assigns them as the valid candidate for deletion to free up memory and promotes all the remaining objects in Gen-0 to Gen-1. So in the future when ever you clean up Gen-1 you have to clean up Gen-0 as well. When you fill up Gen – 0 again, all of Gen – 1 dead objects are drenched and rest are moved to Gen-2 and Gen-0 objects are moved to Gen-1 to free up Gen-0, but by this time your Garbage collection process has started to take much more time than it usually takes. Now as I mentioned earlier when garbage collection is being run all page requests that come in during that period are queued up. Does this explain why possibly page requests are getting queued up, apart from this it could also be the case that you are waiting for a long running database process to complete.      Lets explore the heap a bit more… What is really a case of crisis is when the objects are living long enough to make it to Gen-2 and then dying, this is definitely a high cost operation. But sometimes you need objects in memory, for example when you cache data you hold on to the objects because you need to use them right across the user session, which is acceptable. But if you wanted to see what extreme caching can do to your server then write a simple application that chucks in a lot of data in cache, run a load test over it for about 10-15 minutes, forcing a lot of data in memory causing the heap to run out of memory. If you get to such a state where you start running out of memory the IIS as a mode of recovery restarts the worker process. It is great way to free up all your memory in the heap but this would clear the cache. The problem with this is if the customer had 10 items in their shopping basket and that data was stored in the application cache, the user basket will now be empty forcing them either to get frustrated and go to a competitor website or if the customer is really patient, give it another try! How can you address this, well two ways of addressing this; 1. Workaround – A x86 bit processor only allows a maximum of 4GB of RAM, this means the machine effectively has around 3.4 GB of RAM available, the OS needs about 1.5 GB of RAM to run efficiently, the IIS and .net framework also need their share of memory, leaving you a heap of around 800 MB to play with. Because Team builds by default build your application in ‘Compile as any mode’ it means the application is build such that it will run in x86 bit mode if run on a x86 bit processor and run in a x64 bit mode if run on a x64 but processor. The problem with this is not all applications are really x64 bit compatible specially if you are using com objects or external libraries. So, as a quick win if you compiled your application in x86 bit mode by changing the compile as any selection to compile as x86 in the team build, you will be able to run your application on a x64 bit machine in x86 bit mode (WOW – By running Windows on Windows) and what that means is, you could use 8GB+ worth of RAM, if you take away everything else your application will roughly get a heap size of at least 4 GB to play with, which is immense. If you need a heap size of more than 4 GB you have either build a software for NASA or there is something fundamentally wrong in your application. 2. Solution – Now that you have put a workaround in place the IIS will not restart the worker process that regularly, which means you can take a breather and start working to get to the root cause of this memory leak. But this begs a question “How do I Identify possible memory leaks in my application?” Well i won’t say that there is one single tool that can tell you where the memory leak is, but trust me, ‘Performance Profiling’ is a great start point, it definitely gets you started in the right direction, let’s have a look at how. Performance Wizard - Start the Performance Wizard and select Instrumentation, this lets you measure function call counts and timings. Before running the performance session right click the performance session settings and chose properties from the context menu to bring up the Performance session properties page and as shown in the screen shot below, check the check boxes in the group ‘.NET memory profiling collection’ namely ‘Collect .NET object allocation information’ and ‘Also collect the .NET Object lifetime information’.    Now if you fire off the profiling session on your pages you will notice that the results allows you to view ‘Object Lifetime’ which shows you the number of objects that made it to Gen-0, Gen-1, Gen-2, Large heap, etc. Another great feature about the profile is that if your application has > 5% cases where objects die right after making to the Gen-2 storage a threshold alert is generated to alert you. Since you have the option to also view the most expensive methods and by capturing the IntelliTrace data you can drill in to narrow down to the line of code that is the root cause of the problem. Well now that we have seen how crucial memory management is and how easy Visual Studio Ultimate 2010 makes it for us to identify and reproduce the problem with the best of breed tools in the product. Caching One of the main ways to improve performance is Caching. Which basically means you tell the web server that instead of going to the database for each request you keep the data in the webserver and when the user asks for it you serve it from the webserver itself. BUT that can have consequences! Let’s look at some code, trust me caching code is not very intuitive, I define a cache key for almost all searches made through the common search page and cache the results. The approach works fine, first time i get the data from the database and second time data is served from the cache, significant performance improvement, EXCEPT when two users try to do the same operation and run into each other. But it is easy to handle this by adding the lock as you can see in the snippet below. So, as long as a user comes in and finds that the cache is empty, the user locks and starts to get the cache no more concurrency issues. But lets say you are processing 10 requests per second, by the time i have locked the operation to get the results from the database, 9 other users came in and found that the cache key is null so after i have come out and populated the cache they will still go in to get the results again. The application will still be faster because the next set of 10 users and so on would continue to get data from the cache. BUT if we added another null check after locking to build the cache and before actual call to the db then the 9 users who follow me would not make the extra trip to the database at all and that would really increase the performance, but didn’t i say that the code won’t be very intuitive, may be you should leave a comment you don’t want another developer to come in and think what a fresher why is he checking for the cache key null twice !!! The downside of caching is, you are storing the data outside of the database and the data could be wrong because the updates applied to the database would make the data cached at the web server out of sync. So, how do you invalidate the cache? Well if you only had one way of updating the data lets say only one entry point to the data update you can write some logic to say that every time new data is entered set the cache object to null. But this approach will not work as soon as you have several ways of feeding data to the system or your system is scaled out across a farm of web servers. The perfect solution to this is Micro Caching which means you cache the query for a set time duration and invalidate the cache after that set duration. The advantage is every time the user queries for that data with in the time span for which you have cached the results there are no calls made to the database and the data is served right from the server which makes the response immensely quick. Now figuring out the appropriate time span for which you micro cache the query results really depends on the application. Lets say your website gets 10 requests per second, if you retain the cache results for even 1 minute you will have immense performance gains. You would reduce 90% hits to the database for searching. Ever wondered why when you go to e-bookers.com or xpedia.com or yatra.com to book a flight and you click on the book button because the fare seems too exciting and you get an error message telling you that the fare is not valid any more. Yes, exactly => That is a cache failure! These travel sites or price compare engines are not going to hit the database every time you hit the compare button instead the results will be served from the cache, because the query results are micro cached, its a perfect trade-off, by micro caching the results the site gains 100% performance benefits but every once in a while annoys a customer because the fare has expired. But the trade off works in the favour of these sites as they are still able to process up to 30+ page requests per second which means cater to the site traffic by may be losing 1 customer every once in a while to a competitor who is also using a similar caching technique what are the odds that the user will not come back to their site sooner or later? Recap   Resources Below are some Key resource you might like to review. I would highly recommend the documentation, walkthroughs and videos available on MSDN. You can always make use of Fiddler to debug Web Performance Tests. Some community test extensions and plug ins available on Codeplex might also be of interest to you. The Road Ahead Thank you for taking the time out and reading this blog post, you may also want to read Part I and Part II if you haven’t so far. If you enjoyed the post, remember to subscribe to http://feeds.feedburner.com/TarunArora. Questions/Feedback/Suggestions, etc please leave a comment. Next ‘Load Testing in the cloud’, I’ll be working on exploring the possibilities of running Test controller/Agents in the Cloud. See you on the other side! Thank You!   Share this post : CodeProject

    Read the article

  • Seeking on a Heap, and Two Useful DMVs

    - by Paul White
    So far in this mini-series on seeks and scans, we have seen that a simple ‘seek’ operation can be much more complex than it first appears.  A seek can contain one or more seek predicates – each of which can either identify at most one row in a unique index (a singleton lookup) or a range of values (a range scan).  When looking at a query plan, we will often need to look at the details of the seek operator in the Properties window to see how many operations it is performing, and what type of operation each one is.  As you saw in the first post in this series, the number of hidden seeking operations can have an appreciable impact on performance. Measuring Seeks and Scans I mentioned in my last post that there is no way to tell from a graphical query plan whether you are seeing a singleton lookup or a range scan.  You can work it out – if you happen to know that the index is defined as unique and the seek predicate is an equality comparison, but there’s no separate property that says ‘singleton lookup’ or ‘range scan’.  This is a shame, and if I had my way, the query plan would show different icons for range scans and singleton lookups – perhaps also indicating whether the operation was one or more of those operations underneath the covers. In light of all that, you might be wondering if there is another way to measure how many seeks of either type are occurring in your system, or for a particular query.  As is often the case, the answer is yes – we can use a couple of dynamic management views (DMVs): sys.dm_db_index_usage_stats and sys.dm_db_index_operational_stats. Index Usage Stats The index usage stats DMV contains counts of index operations from the perspective of the Query Executor (QE) – the SQL Server component that is responsible for executing the query plan.  It has three columns that are of particular interest to us: user_seeks – the number of times an Index Seek operator appears in an executed plan user_scans – the number of times a Table Scan or Index Scan operator appears in an executed plan user_lookups – the number of times an RID or Key Lookup operator appears in an executed plan An operator is counted once per execution (generating an estimated plan does not affect the totals), so an Index Seek that executes 10,000 times in a single plan execution adds 1 to the count of user seeks.  Even less intuitively, an operator is also counted once per execution even if it is not executed at all.  I will show you a demonstration of each of these things later in this post. Index Operational Stats The index operational stats DMV contains counts of index and table operations from the perspective of the Storage Engine (SE).  It contains a wealth of interesting information, but the two columns of interest to us right now are: range_scan_count – the number of range scans (including unrestricted full scans) on a heap or index structure singleton_lookup_count – the number of singleton lookups in a heap or index structure This DMV counts each SE operation, so 10,000 singleton lookups will add 10,000 to the singleton lookup count column, and a table scan that is executed 5 times will add 5 to the range scan count. The Test Rig To explore the behaviour of seeks and scans in detail, we will need to create a test environment.  The scripts presented here are best run on SQL Server 2008 Developer Edition, but the majority of the tests will work just fine on SQL Server 2005.  A couple of tests use partitioning, but these will be skipped if you are not running an Enterprise-equivalent SKU.  Ok, first up we need a database: USE master; GO IF DB_ID('ScansAndSeeks') IS NOT NULL DROP DATABASE ScansAndSeeks; GO CREATE DATABASE ScansAndSeeks; GO USE ScansAndSeeks; GO ALTER DATABASE ScansAndSeeks SET ALLOW_SNAPSHOT_ISOLATION OFF ; ALTER DATABASE ScansAndSeeks SET AUTO_CLOSE OFF, AUTO_SHRINK OFF, AUTO_CREATE_STATISTICS OFF, AUTO_UPDATE_STATISTICS OFF, PARAMETERIZATION SIMPLE, READ_COMMITTED_SNAPSHOT OFF, RESTRICTED_USER ; Notice that several database options are set in particular ways to ensure we get meaningful and reproducible results from the DMVs.  In particular, the options to auto-create and update statistics are disabled.  There are also three stored procedures, the first of which creates a test table (which may or may not be partitioned).  The table is pretty much the same one we used yesterday: The table has 100 rows, and both the key_col and data columns contain the same values – the integers from 1 to 100 inclusive.  The table is a heap, with a non-clustered primary key on key_col, and a non-clustered non-unique index on the data column.  The only reason I have used a heap here, rather than a clustered table, is so I can demonstrate a seek on a heap later on.  The table has an extra column (not shown because I am too lazy to update the diagram from yesterday) called padding – a CHAR(100) column that just contains 100 spaces in every row.  It’s just there to discourage SQL Server from choosing table scan over an index + RID lookup in one of the tests. The first stored procedure is called ResetTest: CREATE PROCEDURE dbo.ResetTest @Partitioned BIT = 'false' AS BEGIN SET NOCOUNT ON ; IF OBJECT_ID(N'dbo.Example', N'U') IS NOT NULL BEGIN DROP TABLE dbo.Example; END ; -- Test table is a heap -- Non-clustered primary key on 'key_col' CREATE TABLE dbo.Example ( key_col INTEGER NOT NULL, data INTEGER NOT NULL, padding CHAR(100) NOT NULL DEFAULT SPACE(100), CONSTRAINT [PK dbo.Example key_col] PRIMARY KEY NONCLUSTERED (key_col) ) ; IF @Partitioned = 'true' BEGIN -- Enterprise, Trial, or Developer -- required for partitioning tests IF SERVERPROPERTY('EngineEdition') = 3 BEGIN EXECUTE (' DROP TABLE dbo.Example ; IF EXISTS ( SELECT 1 FROM sys.partition_schemes WHERE name = N''PS'' ) DROP PARTITION SCHEME PS ; IF EXISTS ( SELECT 1 FROM sys.partition_functions WHERE name = N''PF'' ) DROP PARTITION FUNCTION PF ; CREATE PARTITION FUNCTION PF (INTEGER) AS RANGE RIGHT FOR VALUES (20, 40, 60, 80, 100) ; CREATE PARTITION SCHEME PS AS PARTITION PF ALL TO ([PRIMARY]) ; CREATE TABLE dbo.Example ( key_col INTEGER NOT NULL, data INTEGER NOT NULL, padding CHAR(100) NOT NULL DEFAULT SPACE(100), CONSTRAINT [PK dbo.Example key_col] PRIMARY KEY NONCLUSTERED (key_col) ) ON PS (key_col); '); END ELSE BEGIN RAISERROR('Invalid SKU for partition test', 16, 1); RETURN; END; END ; -- Non-unique non-clustered index on the 'data' column CREATE NONCLUSTERED INDEX [IX dbo.Example data] ON dbo.Example (data) ; -- Add 100 rows INSERT dbo.Example WITH (TABLOCKX) ( key_col, data ) SELECT key_col = V.number, data = V.number FROM master.dbo.spt_values AS V WHERE V.[type] = N'P' AND V.number BETWEEN 1 AND 100 ; END; GO The second stored procedure, ShowStats, displays information from the Index Usage Stats and Index Operational Stats DMVs: CREATE PROCEDURE dbo.ShowStats @Partitioned BIT = 'false' AS BEGIN -- Index Usage Stats DMV (QE) SELECT index_name = ISNULL(I.name, I.type_desc), scans = IUS.user_scans, seeks = IUS.user_seeks, lookups = IUS.user_lookups FROM sys.dm_db_index_usage_stats AS IUS JOIN sys.indexes AS I ON I.object_id = IUS.object_id AND I.index_id = IUS.index_id WHERE IUS.database_id = DB_ID(N'ScansAndSeeks') AND IUS.object_id = OBJECT_ID(N'dbo.Example', N'U') ORDER BY I.index_id ; -- Index Operational Stats DMV (SE) IF @Partitioned = 'true' SELECT index_name = ISNULL(I.name, I.type_desc), partitions = COUNT(IOS.partition_number), range_scans = SUM(IOS.range_scan_count), single_lookups = SUM(IOS.singleton_lookup_count) FROM sys.dm_db_index_operational_stats ( DB_ID(N'ScansAndSeeks'), OBJECT_ID(N'dbo.Example', N'U'), NULL, NULL ) AS IOS JOIN sys.indexes AS I ON I.object_id = IOS.object_id AND I.index_id = IOS.index_id GROUP BY I.index_id, -- Key I.name, I.type_desc ORDER BY I.index_id; ELSE SELECT index_name = ISNULL(I.name, I.type_desc), range_scans = SUM(IOS.range_scan_count), single_lookups = SUM(IOS.singleton_lookup_count) FROM sys.dm_db_index_operational_stats ( DB_ID(N'ScansAndSeeks'), OBJECT_ID(N'dbo.Example', N'U'), NULL, NULL ) AS IOS JOIN sys.indexes AS I ON I.object_id = IOS.object_id AND I.index_id = IOS.index_id GROUP BY I.index_id, -- Key I.name, I.type_desc ORDER BY I.index_id; END; The final stored procedure, RunTest, executes a query written against the example table: CREATE PROCEDURE dbo.RunTest @SQL VARCHAR(8000), @Partitioned BIT = 'false' AS BEGIN -- No execution plan yet SET STATISTICS XML OFF ; -- Reset the test environment EXECUTE dbo.ResetTest @Partitioned ; -- Previous call will throw an error if a partitioned -- test was requested, but SKU does not support it IF @@ERROR = 0 BEGIN -- IO statistics and plan on SET STATISTICS XML, IO ON ; -- Test statement EXECUTE (@SQL) ; -- Plan and IO statistics off SET STATISTICS XML, IO OFF ; EXECUTE dbo.ShowStats @Partitioned; END; END; The Tests The first test is a simple scan of the heap table: EXECUTE dbo.RunTest @SQL = 'SELECT * FROM Example'; The top result set comes from the Index Usage Stats DMV, so it is the Query Executor’s (QE) view.  The lower result is from Index Operational Stats, which shows statistics derived from the actions taken by the Storage Engine (SE).  We see that QE performed 1 scan operation on the heap, and SE performed a single range scan.  Let’s try a single-value equality seek on a unique index next: EXECUTE dbo.RunTest @SQL = 'SELECT key_col FROM Example WHERE key_col = 32'; This time we see a single seek on the non-clustered primary key from QE, and one singleton lookup on the same index by the SE.  Now for a single-value seek on the non-unique non-clustered index: EXECUTE dbo.RunTest @SQL = 'SELECT data FROM Example WHERE data = 32'; QE shows a single seek on the non-clustered non-unique index, but SE shows a single range scan on that index – not the singleton lookup we saw in the previous test.  That makes sense because we know that only a single-value seek into a unique index is a singleton seek.  A single-value seek into a non-unique index might retrieve any number of rows, if you think about it.  The next query is equivalent to the IN list example seen in the first post in this series, but it is written using OR (just for variety, you understand): EXECUTE dbo.RunTest @SQL = 'SELECT data FROM Example WHERE data = 32 OR data = 33'; The plan looks the same, and there’s no difference in the stats recorded by QE, but the SE shows two range scans.  Again, these are range scans because we are looking for two values in the data column, which is covered by a non-unique index.  I’ve added a snippet from the Properties window to show that the query plan does show two seek predicates, not just one.  Now let’s rewrite the query using BETWEEN: EXECUTE dbo.RunTest @SQL = 'SELECT data FROM Example WHERE data BETWEEN 32 AND 33'; Notice the seek operator only has one predicate now – it’s just a single range scan from 32 to 33 in the index – as the SE output shows.  For the next test, we will look up four values in the key_col column: EXECUTE dbo.RunTest @SQL = 'SELECT key_col FROM Example WHERE key_col IN (2,4,6,8)'; Just a single seek on the PK from the Query Executor, but four singleton lookups reported by the Storage Engine – and four seek predicates in the Properties window.  On to a more complex example: EXECUTE dbo.RunTest @SQL = 'SELECT * FROM Example WITH (INDEX([PK dbo.Example key_col])) WHERE key_col BETWEEN 1 AND 8'; This time we are forcing use of the non-clustered primary key to return eight rows.  The index is not covering for this query, so the query plan includes an RID lookup into the heap to fetch the data and padding columns.  The QE reports a seek on the PK and a lookup on the heap.  The SE reports a single range scan on the PK (to find key_col values between 1 and 8), and eight singleton lookups on the heap.  Remember that a bookmark lookup (RID or Key) is a seek to a single value in a ‘unique index’ – it finds a row in the heap or cluster from a unique RID or clustering key – so that’s why lookups are always singleton lookups, not range scans. Our next example shows what happens when a query plan operator is not executed at all: EXECUTE dbo.RunTest @SQL = 'SELECT key_col FROM Example WHERE key_col = 8 AND @@TRANCOUNT < 0'; The Filter has a start-up predicate which is always false (if your @@TRANCOUNT is less than zero, call CSS immediately).  The index seek is never executed, but QE still records a single seek against the PK because the operator appears once in an executed plan.  The SE output shows no activity at all.  This next example is 2008 and above only, I’m afraid: EXECUTE dbo.RunTest @SQL = 'SELECT * FROM Example WHERE key_col BETWEEN 1 AND 30', @Partitioned = 'true'; This is the first example to use a partitioned table.  QE reports a single seek on the heap (yes – a seek on a heap), and the SE reports two range scans on the heap.  SQL Server knows (from the partitioning definition) that it only needs to look at partitions 1 and 2 to find all the rows where key_col is between 1 and 30 – the engine seeks to find the two partitions, and performs a range scan seek on each partition. The final example for today is another seek on a heap – try to work out the output of the query before running it! EXECUTE dbo.RunTest @SQL = 'SELECT TOP (2) WITH TIES * FROM Example WHERE key_col BETWEEN 1 AND 50 ORDER BY $PARTITION.PF(key_col) DESC', @Partitioned = 'true'; Notice the lack of an explicit Sort operator in the query plan to enforce the ORDER BY clause, and the backward range scan. © 2011 Paul White email: [email protected] twitter: @SQL_Kiwi

    Read the article

  • Need help unformatting text

    - by Axilus
    I am currently programming a Visual C# service to receive emails from various sources then I take certain info and organize it in a database using Regex to retrieve the deferent cell values (such as header, body, problem, cost, etc.etc.). My problem is that I am currently using a Hotmail account to email the service which the service then extracts data and writes it to a csv file; however this is all going fine an dandy except for the fact that the text is formated so when there is a "\n" or something of the sort, the program decides to not input the data that follows that into the cell. For instance, if I emailed this: Cost:$1000.00 Body: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed vulputate mattis dolor, a dapibus turpis lacinia mollis. Fusce in enim nulla, sit amet gravida dolor. Suspendisse at nisi velit, vel ornare odio. Integer metus justo, imperdiet et pellentesque in, facilisis dignissim metus. Suspendisse potenti. Vivamus purus nisl, hendrerit sit amet rutrum eu, euismod in felis. Maecenas blandit, metus ac eleifend vulputate, nibh ligula mollis mi, non malesuada nunc tellus ac risus. In at rutrum elit. Proin metus sem, ullamcorper ut rhoncus sed, semper nec tellus. Maecenas adipiscing nisl nec elit egestas vel bibendum justo vehicula. Aliquam erat volutpat. Nullam fermentum enim in magna consequat a lacinia felis iaculis. Ut odio justo, consectetur nec cursus eu, dignissim non sapien. Duis tincidunt fringilla aliquet. Vivamus elementum lobortis massa vel posuere. Aenean non congue odio. Aenean aliquam elit volutpat tortor tempor pharetra. Mauris non est eu orci ultricies lacinia. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Ut vitae orci lectus, sit amet convallis nunc. Vivamus feugiat ante at justo auctor at pretium ante congue. In hac habitasse platea dictumst. Sed at feugiat odio. The body cell would look as follows: <span class=3D"ecxecxApple-style-s= pan" style=3D"font-family:Arial=2C Helvetica=2C sans=3Bfont-size:11px"><p s= tyle=3D"text-align:justify=3Bfont-size:11px=3Bline-height:14px=3Bmargin-rig= ht:0px=3Bmargin-bottom:14px=3Bmargin-left:0px=3Bpadding-top:0px=3Bpadding-r= ight:0px=3Bpadding-bottom:0px=3Bpadding-left:0px">Lorem ipsum dolor sit ame= t=2C consectetur adipiscing elit. Praesent in augue nec justo tempor varius= eu et tellus. Nunc id massa tortor=2C ut lobortis sem. Class aptent taciti= sociosqu ad litora torquent per conubia nostra=2C per inceptos himenaeos. = Maecenas quis nisl nec quam tristique posuere sed at nibh. Cras fringilla v= estibulum metus vel porttitor. Cras iaculis=2C erat nec gravida accumsan=2C= metus felis vestibulum risus=2C quis venenatis nisl nulla sed diam. Aenean= quis viverra velit. Etiam quis massa lectus=2C faucibus facilisis sem. Cur= abitur non eros tellus. Sed at ligula neque. Donec elementum rhoncus volutp= at. Curabitur eu accumsan erat. Phasellus auctor odio dolor=2C ut ornare au= gue. Suspendisse vel est nibh. Vivamus facilisis placerat augue sit amet al= iquam. Maecenas viverra=2C ipsum a tincidunt elementum=2C arcu tellus rutru= m ipsum=2C et dignissim urna orci ac mi. Vivamus non odio massa. Nulla cong= ue massa eu leo pretium non consequat urna molestie.</p><p style=3D"text-al= ign:justify=3Bfont-size:11px=3Bline-height:14px=3Bmargin-right:0px=3Bmargin= -bottom:14px=3Bmargin-left:0px=3Bpadding-top:0px=3Bpadding-right:0px=3Bpadd= ing-bottom:0px=3Bpadding-left:0px">Integer neque odio=2C scelerisque at mol= estie quis=2C congue sed arcu. Praesent a arcu odio. Donec sollicitudin=2C = quam vel tincidunt lobortis=2C urna augue cursus lorem=2C in eleifend nunc = risus nec neque. Donec euismod mauris non nibh blandit sollicitudin. Vivamu= s sed tincidunt augue. Suspendisse iaculis massa ut tellus rutrum auctor. C= ras venenatis consequat urna in viverra. Ut blandit imperdiet dolor non sce= lerisque. Suspendisse potenti. Sed vitae lacus ac odio euismod tempus. Aene= an ut sem odio. Curabitur auctor purus a diam iaculis facilisis. Integer mo= lestie commodo mauris a imperdiet. Nunc aliquet tempus orci sit amet viverr= a.</p><p style=3D"text-align:justify=3Bfont-size:11px=3Bline-height:14px=3B= margin-right:0px=3Bmargin-bottom:14px=3Bmargin-left:0px=3Bpadding-top:0px= =3Bpadding-right:0px=3Bpadding-bottom:0px=3Bpadding-left:0px">Morbi ultrici= es fermentum magna=2C et ultricies urna convallis non. Aenean nibh felis=2C= faucibus et pellentesque ultrices=2C accumsan a ligula. Aliquam vulputate = nisi vitae mi pretium et pretium nulla aliquet. Nam egestas diam vel elit c= ommodo fermentum. Aenean venenatis bibendum tellus=2C eget scelerisque risu= s consequat ut. In porta interdum eleifend. Cras laoreet venenatis pulvinar= .. Praesent ultricies tristique lorem=2C quis interdum arcu scelerisque nec.= Quisque arcu tellus=2C consectetur vel mattis nec=2C feugiat ac quam. Prae= sent sit amet fermentum nulla. Nulla lobortis=2C odio vitae elementum aucto= r=2C libero turpis condimentum mi=2C sed aliquet felis sapien nec tortor. I= nteger vehicula=2C neque in egestas accumsan=2C felis metus sagittis nulla= =2C eu dapibus ligula ipsum ut sapien. Nulla quis urna tortor=2C sed facili= sis leo. In at metus sed velit venenatis varius. Fusce aliquam mattis enim= =2C vitae tincidunt sem cursus in.</p><p style=3D"text-align:justify=3Bfont= -size:11px=3Bline-height:14px=3Bmargin-right:0px=3Bmargin-bottom:14px=3Bmar= gin-left:0px=3Bpadding-top:0px=3Bpadding-right:0px=3Bpadding-bottom:0px=3Bp= adding-left:0px">Proin tincidunt ligula at ligula bibendum vitae condimentu= m nunc congue. Curabitur ac magna nibh=2C vel accumsan nisl. Duis nec eros = et purus vestibulum tincidunt at sit amet libero. Donec eu nibh eros. Pelle= ntesque habitant morbi tristique senectus et netus et malesuada fames ac tu= rpis egestas. Donec accumsan=2C tellus at luctus faucibus=2C est nibh sempe= r diam=2C vitae adipiscing lorem tellus vel nulla. Donec eget ipsum ut lore= m tristique ultricies. Aliquam sem diam=2C semper sit amet volutpat pretium= =2C lobortis et eros. Sed vel iaculis metus. Phasellus malesuada elementum = porta.</p><p style=3D"text-align:justify=3Bfont-size:11px=3Bline-height:14p= x=3Bmargin-right:0px=3Bmargin-bottom:14px=3Bmargin-left:0px=3Bpadding-top:0= px=3Bpadding-right:0px=3Bpadding-bottom:0px=3Bpadding-left:0px">Fusce tinci= dunt dignissim massa quis dapibus. Sed aliquet consequat orci=2C eu cursus = libero dapibus vitae. Pellentesque at felis felis=2C vitae condimentum libe= ro. Vivamus eros erat=2C elementum et tristique vitae=2C mattis et neque. P= raesent bibendum leo ac tortor congue id mollis libero ornare. Pellentesque= adipiscing accumsan mi=2C a bibendum purus dignissim id. Cum sociis natoqu= e penatibus et magnis dis parturient montes=2C nascetur ridiculus mus. Morb= i mollis nisi in nibh cursus facilisis. Ut eu quam dolor=2C sit amet congue= orci. Aliquam quam dolor=2C viverra vitae varius sed=2C molestie et quam. = Suspendisse purus mauris=2C fermentum condimentum pharetra at=2C molestie a= nunc. Nam rhoncus euismod venenatis. Nam pellentesque quam ac ipsum volutp= at a eleifend odio imperdiet. Class aptent taciti sociosqu ad litora torque= nt per conubia nostra=2C per inceptos himenaeos. Nulla in nunc magna. Lorem= ipsum dolor sit amet=2C consectetur adipiscing elit. Donec pretium tincidu= nt gravida.</p></span> As you can tell I need a way to get rid of all that html junk and make it readable again. Is there anyway to do this with Regex? Or an easier way if possible. Cheers

    Read the article

  • How John Got 15x Improvement Without Really Trying

    - by rchrd
    The following article was published on a Sun Microsystems website a number of years ago by John Feo. It is still useful and worth preserving. So I'm republishing it here.  How I Got 15x Improvement Without Really Trying John Feo, Sun Microsystems Taking ten "personal" program codes used in scientific and engineering research, the author was able to get from 2 to 15 times performance improvement easily by applying some simple general optimization techniques. Introduction Scientific research based on computer simulation depends on the simulation for advancement. The research can advance only as fast as the computational codes can execute. The codes' efficiency determines both the rate and quality of results. In the same amount of time, a faster program can generate more results and can carry out a more detailed simulation of physical phenomena than a slower program. Highly optimized programs help science advance quickly and insure that monies supporting scientific research are used as effectively as possible. Scientific computer codes divide into three broad categories: ISV, community, and personal. ISV codes are large, mature production codes developed and sold commercially. The codes improve slowly over time both in methods and capabilities, and they are well tuned for most vendor platforms. Since the codes are mature and complex, there are few opportunities to improve their performance solely through code optimization. Improvements of 10% to 15% are typical. Examples of ISV codes are DYNA3D, Gaussian, and Nastran. Community codes are non-commercial production codes used by a particular research field. Generally, they are developed and distributed by a single academic or research institution with assistance from the community. Most users just run the codes, but some develop new methods and extensions that feed back into the general release. The codes are available on most vendor platforms. Since these codes are younger than ISV codes, there are more opportunities to optimize the source code. Improvements of 50% are not unusual. Examples of community codes are AMBER, CHARM, BLAST, and FASTA. Personal codes are those written by single users or small research groups for their own use. These codes are not distributed, but may be passed from professor-to-student or student-to-student over several years. They form the primordial ocean of applications from which community and ISV codes emerge. Government research grants pay for the development of most personal codes. This paper reports on the nature and performance of this class of codes. Over the last year, I have looked at over two dozen personal codes from more than a dozen research institutions. The codes cover a variety of scientific fields, including astronomy, atmospheric sciences, bioinformatics, biology, chemistry, geology, and physics. The sources range from a few hundred lines to more than ten thousand lines, and are written in Fortran, Fortran 90, C, and C++. For the most part, the codes are modular, documented, and written in a clear, straightforward manner. They do not use complex language features, advanced data structures, programming tricks, or libraries. I had little trouble understanding what the codes did or how data structures were used. Most came with a makefile. Surprisingly, only one of the applications is parallel. All developers have access to parallel machines, so availability is not an issue. Several tried to parallelize their applications, but stopped after encountering difficulties. Lack of education and a perception that parallelism is difficult prevented most from trying. I parallelized several of the codes using OpenMP, and did not judge any of the codes as difficult to parallelize. Even more surprising than the lack of parallelism is the inefficiency of the codes. I was able to get large improvements in performance in a matter of a few days applying simple optimization techniques. Table 1 lists ten representative codes [names and affiliation are omitted to preserve anonymity]. Improvements on one processor range from 2x to 15.5x with a simple average of 4.75x. I did not use sophisticated performance tools or drill deep into the program's execution character as one would do when tuning ISV or community codes. Using only a profiler and source line timers, I identified inefficient sections of code and improved their performance by inspection. The changes were at a high level. I am sure there is another factor of 2 or 3 in each code, and more if the codes are parallelized. The study’s results show that personal scientific codes are running many times slower than they should and that the problem is pervasive. Computational scientists are not sloppy programmers; however, few are trained in the art of computer programming or code optimization. I found that most have a working knowledge of some programming language and standard software engineering practices; but they do not know, or think about, how to make their programs run faster. They simply do not know the standard techniques used to make codes run faster. In fact, they do not even perceive that such techniques exist. The case studies described in this paper show that applying simple, well known techniques can significantly increase the performance of personal codes. It is important that the scientific community and the Government agencies that support scientific research find ways to better educate academic scientific programmers. The inefficiency of their codes is so bad that it is retarding both the quality and progress of scientific research. # cacheperformance redundantoperations loopstructures performanceimprovement 1 x x 15.5 2 x 2.8 3 x x 2.5 4 x 2.1 5 x x 2.0 6 x 5.0 7 x 5.8 8 x 6.3 9 2.2 10 x x 3.3 Table 1 — Area of improvement and performance gains of 10 codes The remainder of the paper is organized as follows: sections 2, 3, and 4 discuss the three most common sources of inefficiencies in the codes studied. These are cache performance, redundant operations, and loop structures. Each section includes several examples. The last section summaries the work and suggests a possible solution to the issues raised. Optimizing cache performance Commodity microprocessor systems use caches to increase memory bandwidth and reduce memory latencies. Typical latencies from processor to L1, L2, local, and remote memory are 3, 10, 50, and 200 cycles, respectively. Moreover, bandwidth falls off dramatically as memory distances increase. Programs that do not use cache effectively run many times slower than programs that do. When optimizing for cache, the biggest performance gains are achieved by accessing data in cache order and reusing data to amortize the overhead of cache misses. Secondary considerations are prefetching, associativity, and replacement; however, the understanding and analysis required to optimize for the latter are probably beyond the capabilities of the non-expert. Much can be gained simply by accessing data in the correct order and maximizing data reuse. 6 out of the 10 codes studied here benefited from such high level optimizations. Array Accesses The most important cache optimization is the most basic: accessing Fortran array elements in column order and C array elements in row order. Four of the ten codes—1, 2, 4, and 10—got it wrong. Compilers will restructure nested loops to optimize cache performance, but may not do so if the loop structure is too complex, or the loop body includes conditionals, complex addressing, or function calls. In code 1, the compiler failed to invert a key loop because of complex addressing do I = 0, 1010, delta_x IM = I - delta_x IP = I + delta_x do J = 5, 995, delta_x JM = J - delta_x JP = J + delta_x T1 = CA1(IP, J) + CA1(I, JP) T2 = CA1(IM, J) + CA1(I, JM) S1 = T1 + T2 - 4 * CA1(I, J) CA(I, J) = CA1(I, J) + D * S1 end do end do In code 2, the culprit is conditionals do I = 1, N do J = 1, N If (IFLAG(I,J) .EQ. 0) then T1 = Value(I, J-1) T2 = Value(I-1, J) T3 = Value(I, J) T4 = Value(I+1, J) T5 = Value(I, J+1) Value(I,J) = 0.25 * (T1 + T2 + T5 + T4) Delta = ABS(T3 - Value(I,J)) If (Delta .GT. MaxDelta) MaxDelta = Delta endif enddo enddo I fixed both programs by inverting the loops by hand. Code 10 has three-dimensional arrays and triply nested loops. The structure of the most computationally intensive loops is too complex to invert automatically or by hand. The only practical solution is to transpose the arrays so that the dimension accessed by the innermost loop is in cache order. The arrays can be transposed at construction or prior to entering a computationally intensive section of code. The former requires all array references to be modified, while the latter is cost effective only if the cost of the transpose is amortized over many accesses. I used the second approach to optimize code 10. Code 5 has four-dimensional arrays and loops are nested four deep. For all of the reasons cited above the compiler is not able to restructure three key loops. Assume C arrays and let the four dimensions of the arrays be i, j, k, and l. In the original code, the index structure of the three loops is L1: for i L2: for i L3: for i for l for l for j for k for j for k for j for k for l So only L3 accesses array elements in cache order. L1 is a very complex loop—much too complex to invert. I brought the loop into cache alignment by transposing the second and fourth dimensions of the arrays. Since the code uses a macro to compute all array indexes, I effected the transpose at construction and changed the macro appropriately. The dimensions of the new arrays are now: i, l, k, and j. L3 is a simple loop and easily inverted. L2 has a loop-carried scalar dependence in k. By promoting the scalar name that carries the dependence to an array, I was able to invert the third and fourth subloops aligning the loop with cache. Code 5 is by far the most difficult of the four codes to optimize for array accesses; but the knowledge required to fix the problems is no more than that required for the other codes. I would judge this code at the limits of, but not beyond, the capabilities of appropriately trained computational scientists. Array Strides When a cache miss occurs, a line (64 bytes) rather than just one word is loaded into the cache. If data is accessed stride 1, than the cost of the miss is amortized over 8 words. Any stride other than one reduces the cost savings. Two of the ten codes studied suffered from non-unit strides. The codes represent two important classes of "strided" codes. Code 1 employs a multi-grid algorithm to reduce time to convergence. The grids are every tenth, fifth, second, and unit element. Since time to convergence is inversely proportional to the distance between elements, coarse grids converge quickly providing good starting values for finer grids. The better starting values further reduce the time to convergence. The downside is that grids of every nth element, n > 1, introduce non-unit strides into the computation. In the original code, much of the savings of the multi-grid algorithm were lost due to this problem. I eliminated the problem by compressing (copying) coarse grids into continuous memory, and rewriting the computation as a function of the compressed grid. On convergence, I copied the final values of the compressed grid back to the original grid. The savings gained from unit stride access of the compressed grid more than paid for the cost of copying. Using compressed grids, the loop from code 1 included in the previous section becomes do j = 1, GZ do i = 1, GZ T1 = CA(i+0, j-1) + CA(i-1, j+0) T4 = CA1(i+1, j+0) + CA1(i+0, j+1) S1 = T1 + T4 - 4 * CA1(i+0, j+0) CA(i+0, j+0) = CA1(i+0, j+0) + DD * S1 enddo enddo where CA and CA1 are compressed arrays of size GZ. Code 7 traverses a list of objects selecting objects for later processing. The labels of the selected objects are stored in an array. The selection step has unit stride, but the processing steps have irregular stride. A fix is to save the parameters of the selected objects in temporary arrays as they are selected, and pass the temporary arrays to the processing functions. The fix is practical if the same parameters are used in selection as in processing, or if processing comprises a series of distinct steps which use overlapping subsets of the parameters. Both conditions are true for code 7, so I achieved significant improvement by copying parameters to temporary arrays during selection. Data reuse In the previous sections, we optimized for spatial locality. It is also important to optimize for temporal locality. Once read, a datum should be used as much as possible before it is forced from cache. Loop fusion and loop unrolling are two techniques that increase temporal locality. Unfortunately, both techniques increase register pressure—as loop bodies become larger, the number of registers required to hold temporary values grows. Once register spilling occurs, any gains evaporate quickly. For multiprocessors with small register sets or small caches, the sweet spot can be very small. In the ten codes presented here, I found no opportunities for loop fusion and only two opportunities for loop unrolling (codes 1 and 3). In code 1, unrolling the outer and inner loop one iteration increases the number of result values computed by the loop body from 1 to 4, do J = 1, GZ-2, 2 do I = 1, GZ-2, 2 T1 = CA1(i+0, j-1) + CA1(i-1, j+0) T2 = CA1(i+1, j-1) + CA1(i+0, j+0) T3 = CA1(i+0, j+0) + CA1(i-1, j+1) T4 = CA1(i+1, j+0) + CA1(i+0, j+1) T5 = CA1(i+2, j+0) + CA1(i+1, j+1) T6 = CA1(i+1, j+1) + CA1(i+0, j+2) T7 = CA1(i+2, j+1) + CA1(i+1, j+2) S1 = T1 + T4 - 4 * CA1(i+0, j+0) S2 = T2 + T5 - 4 * CA1(i+1, j+0) S3 = T3 + T6 - 4 * CA1(i+0, j+1) S4 = T4 + T7 - 4 * CA1(i+1, j+1) CA(i+0, j+0) = CA1(i+0, j+0) + DD * S1 CA(i+1, j+0) = CA1(i+1, j+0) + DD * S2 CA(i+0, j+1) = CA1(i+0, j+1) + DD * S3 CA(i+1, j+1) = CA1(i+1, j+1) + DD * S4 enddo enddo The loop body executes 12 reads, whereas as the rolled loop shown in the previous section executes 20 reads to compute the same four values. In code 3, two loops are unrolled 8 times and one loop is unrolled 4 times. Here is the before for (k = 0; k < NK[u]; k++) { sum = 0.0; for (y = 0; y < NY; y++) { sum += W[y][u][k] * delta[y]; } backprop[i++]=sum; } and after code for (k = 0; k < KK - 8; k+=8) { sum0 = 0.0; sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0; sum5 = 0.0; sum6 = 0.0; sum7 = 0.0; for (y = 0; y < NY; y++) { sum0 += W[y][0][k+0] * delta[y]; sum1 += W[y][0][k+1] * delta[y]; sum2 += W[y][0][k+2] * delta[y]; sum3 += W[y][0][k+3] * delta[y]; sum4 += W[y][0][k+4] * delta[y]; sum5 += W[y][0][k+5] * delta[y]; sum6 += W[y][0][k+6] * delta[y]; sum7 += W[y][0][k+7] * delta[y]; } backprop[k+0] = sum0; backprop[k+1] = sum1; backprop[k+2] = sum2; backprop[k+3] = sum3; backprop[k+4] = sum4; backprop[k+5] = sum5; backprop[k+6] = sum6; backprop[k+7] = sum7; } for one of the loops unrolled 8 times. Optimizing for temporal locality is the most difficult optimization considered in this paper. The concepts are not difficult, but the sweet spot is small. Identifying where the program can benefit from loop unrolling or loop fusion is not trivial. Moreover, it takes some effort to get it right. Still, educating scientific programmers about temporal locality and teaching them how to optimize for it will pay dividends. Reducing instruction count Execution time is a function of instruction count. Reduce the count and you usually reduce the time. The best solution is to use a more efficient algorithm; that is, an algorithm whose order of complexity is smaller, that converges quicker, or is more accurate. Optimizing source code without changing the algorithm yields smaller, but still significant, gains. This paper considers only the latter because the intent is to study how much better codes can run if written by programmers schooled in basic code optimization techniques. The ten codes studied benefited from three types of "instruction reducing" optimizations. The two most prevalent were hoisting invariant memory and data operations out of inner loops. The third was eliminating unnecessary data copying. The nature of these inefficiencies is language dependent. Memory operations The semantics of C make it difficult for the compiler to determine all the invariant memory operations in a loop. The problem is particularly acute for loops in functions since the compiler may not know the values of the function's parameters at every call site when compiling the function. Most compilers support pragmas to help resolve ambiguities; however, these pragmas are not comprehensive and there is no standard syntax. To guarantee that invariant memory operations are not executed repetitively, the user has little choice but to hoist the operations by hand. The problem is not as severe in Fortran programs because in the absence of equivalence statements, it is a violation of the language's semantics for two names to share memory. Codes 3 and 5 are C programs. In both cases, the compiler did not hoist all invariant memory operations from inner loops. Consider the following loop from code 3 for (y = 0; y < NY; y++) { i = 0; for (u = 0; u < NU; u++) { for (k = 0; k < NK[u]; k++) { dW[y][u][k] += delta[y] * I1[i++]; } } } Since dW[y][u] can point to the same memory space as delta for one or more values of y and u, assignment to dW[y][u][k] may change the value of delta[y]. In reality, dW and delta do not overlap in memory, so I rewrote the loop as for (y = 0; y < NY; y++) { i = 0; Dy = delta[y]; for (u = 0; u < NU; u++) { for (k = 0; k < NK[u]; k++) { dW[y][u][k] += Dy * I1[i++]; } } } Failure to hoist invariant memory operations may be due to complex address calculations. If the compiler can not determine that the address calculation is invariant, then it can hoist neither the calculation nor the associated memory operations. As noted above, code 5 uses a macro to address four-dimensional arrays #define MAT4D(a,q,i,j,k) (double *)((a)->data + (q)*(a)->strides[0] + (i)*(a)->strides[3] + (j)*(a)->strides[2] + (k)*(a)->strides[1]) The macro is too complex for the compiler to understand and so, it does not identify any subexpressions as loop invariant. The simplest way to eliminate the address calculation from the innermost loop (over i) is to define a0 = MAT4D(a,q,0,j,k) before the loop and then replace all instances of *MAT4D(a,q,i,j,k) in the loop with a0[i] A similar problem appears in code 6, a Fortran program. The key loop in this program is do n1 = 1, nh nx1 = (n1 - 1) / nz + 1 nz1 = n1 - nz * (nx1 - 1) do n2 = 1, nh nx2 = (n2 - 1) / nz + 1 nz2 = n2 - nz * (nx2 - 1) ndx = nx2 - nx1 ndy = nz2 - nz1 gxx = grn(1,ndx,ndy) gyy = grn(2,ndx,ndy) gxy = grn(3,ndx,ndy) balance(n1,1) = balance(n1,1) + (force(n2,1) * gxx + force(n2,2) * gxy) * h1 balance(n1,2) = balance(n1,2) + (force(n2,1) * gxy + force(n2,2) * gyy)*h1 end do end do The programmer has written this loop well—there are no loop invariant operations with respect to n1 and n2. However, the loop resides within an iterative loop over time and the index calculations are independent with respect to time. Trading space for time, I precomputed the index values prior to the entering the time loop and stored the values in two arrays. I then replaced the index calculations with reads of the arrays. Data operations Ways to reduce data operations can appear in many forms. Implementing a more efficient algorithm produces the biggest gains. The closest I came to an algorithm change was in code 4. This code computes the inner product of K-vectors A(i) and B(j), 0 = i < N, 0 = j < M, for most values of i and j. Since the program computes most of the NM possible inner products, it is more efficient to compute all the inner products in one triply-nested loop rather than one at a time when needed. The savings accrue from reading A(i) once for all B(j) vectors and from loop unrolling. for (i = 0; i < N; i+=8) { for (j = 0; j < M; j++) { sum0 = 0.0; sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0; sum5 = 0.0; sum6 = 0.0; sum7 = 0.0; for (k = 0; k < K; k++) { sum0 += A[i+0][k] * B[j][k]; sum1 += A[i+1][k] * B[j][k]; sum2 += A[i+2][k] * B[j][k]; sum3 += A[i+3][k] * B[j][k]; sum4 += A[i+4][k] * B[j][k]; sum5 += A[i+5][k] * B[j][k]; sum6 += A[i+6][k] * B[j][k]; sum7 += A[i+7][k] * B[j][k]; } C[i+0][j] = sum0; C[i+1][j] = sum1; C[i+2][j] = sum2; C[i+3][j] = sum3; C[i+4][j] = sum4; C[i+5][j] = sum5; C[i+6][j] = sum6; C[i+7][j] = sum7; }} This change requires knowledge of a typical run; i.e., that most inner products are computed. The reasons for the change, however, derive from basic optimization concepts. It is the type of change easily made at development time by a knowledgeable programmer. In code 5, we have the data version of the index optimization in code 6. Here a very expensive computation is a function of the loop indices and so cannot be hoisted out of the loop; however, the computation is invariant with respect to an outer iterative loop over time. We can compute its value for each iteration of the computation loop prior to entering the time loop and save the values in an array. The increase in memory required to store the values is small in comparison to the large savings in time. The main loop in Code 8 is doubly nested. The inner loop includes a series of guarded computations; some are a function of the inner loop index but not the outer loop index while others are a function of the outer loop index but not the inner loop index for (j = 0; j < N; j++) { for (i = 0; i < M; i++) { r = i * hrmax; R = A[j]; temp = (PRM[3] == 0.0) ? 1.0 : pow(r, PRM[3]); high = temp * kcoeff * B[j] * PRM[2] * PRM[4]; low = high * PRM[6] * PRM[6] / (1.0 + pow(PRM[4] * PRM[6], 2.0)); kap = (R > PRM[6]) ? high * R * R / (1.0 + pow(PRM[4]*r, 2.0) : low * pow(R/PRM[6], PRM[5]); < rest of loop omitted > }} Note that the value of temp is invariant to j. Thus, we can hoist the computation for temp out of the loop and save its values in an array. for (i = 0; i < M; i++) { r = i * hrmax; TEMP[i] = pow(r, PRM[3]); } [N.B. – the case for PRM[3] = 0 is omitted and will be reintroduced later.] We now hoist out of the inner loop the computations invariant to i. Since the conditional guarding the value of kap is invariant to i, it behooves us to hoist the computation out of the inner loop, thereby executing the guard once rather than M times. The final version of the code is for (j = 0; j < N; j++) { R = rig[j] / 1000.; tmp1 = kcoeff * par[2] * beta[j] * par[4]; tmp2 = 1.0 + (par[4] * par[4] * par[6] * par[6]); tmp3 = 1.0 + (par[4] * par[4] * R * R); tmp4 = par[6] * par[6] / tmp2; tmp5 = R * R / tmp3; tmp6 = pow(R / par[6], par[5]); if ((par[3] == 0.0) && (R > par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * tmp5; } else if ((par[3] == 0.0) && (R <= par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * tmp4 * tmp6; } else if ((par[3] != 0.0) && (R > par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * TEMP[i] * tmp5; } else if ((par[3] != 0.0) && (R <= par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * TEMP[i] * tmp4 * tmp6; } for (i = 0; i < M; i++) { kap = KAP[i]; r = i * hrmax; < rest of loop omitted > } } Maybe not the prettiest piece of code, but certainly much more efficient than the original loop, Copy operations Several programs unnecessarily copy data from one data structure to another. This problem occurs in both Fortran and C programs, although it manifests itself differently in the two languages. Code 1 declares two arrays—one for old values and one for new values. At the end of each iteration, the array of new values is copied to the array of old values to reset the data structures for the next iteration. This problem occurs in Fortran programs not included in this study and in both Fortran 77 and Fortran 90 code. Introducing pointers to the arrays and swapping pointer values is an obvious way to eliminate the copying; but pointers is not a feature that many Fortran programmers know well or are comfortable using. An easy solution not involving pointers is to extend the dimension of the value array by 1 and use the last dimension to differentiate between arrays at different times. For example, if the data space is N x N, declare the array (N, N, 2). Then store the problem’s initial values in (_, _, 2) and define the scalar names new = 2 and old = 1. At the start of each iteration, swap old and new to reset the arrays. The old–new copy problem did not appear in any C program. In programs that had new and old values, the code swapped pointers to reset data structures. Where unnecessary coping did occur is in structure assignment and parameter passing. Structures in C are handled much like scalars. Assignment causes the data space of the right-hand name to be copied to the data space of the left-hand name. Similarly, when a structure is passed to a function, the data space of the actual parameter is copied to the data space of the formal parameter. If the structure is large and the assignment or function call is in an inner loop, then copying costs can grow quite large. While none of the ten programs considered here manifested this problem, it did occur in programs not included in the study. A simple fix is always to refer to structures via pointers. Optimizing loop structures Since scientific programs spend almost all their time in loops, efficient loops are the key to good performance. Conditionals, function calls, little instruction level parallelism, and large numbers of temporary values make it difficult for the compiler to generate tightly packed, highly efficient code. Conditionals and function calls introduce jumps that disrupt code flow. Users should eliminate or isolate conditionls to their own loops as much as possible. Often logical expressions can be substituted for if-then-else statements. For example, code 2 includes the following snippet MaxDelta = 0.0 do J = 1, N do I = 1, M < code omitted > Delta = abs(OldValue ? NewValue) if (Delta > MaxDelta) MaxDelta = Delta enddo enddo if (MaxDelta .gt. 0.001) goto 200 Since the only use of MaxDelta is to control the jump to 200 and all that matters is whether or not it is greater than 0.001, I made MaxDelta a boolean and rewrote the snippet as MaxDelta = .false. do J = 1, N do I = 1, M < code omitted > Delta = abs(OldValue ? NewValue) MaxDelta = MaxDelta .or. (Delta .gt. 0.001) enddo enddo if (MaxDelta) goto 200 thereby, eliminating the conditional expression from the inner loop. A microprocessor can execute many instructions per instruction cycle. Typically, it can execute one or more memory, floating point, integer, and jump operations. To be executed simultaneously, the operations must be independent. Thick loops tend to have more instruction level parallelism than thin loops. Moreover, they reduce memory traffice by maximizing data reuse. Loop unrolling and loop fusion are two techniques to increase the size of loop bodies. Several of the codes studied benefitted from loop unrolling, but none benefitted from loop fusion. This observation is not too surpising since it is the general tendency of programmers to write thick loops. As loops become thicker, the number of temporary values grows, increasing register pressure. If registers spill, then memory traffic increases and code flow is disrupted. A thick loop with many temporary values may execute slower than an equivalent series of thin loops. The biggest gain will be achieved if the thick loop can be split into a series of independent loops eliminating the need to write and read temporary arrays. I found such an occasion in code 10 where I split the loop do i = 1, n do j = 1, m A24(j,i)= S24(j,i) * T24(j,i) + S25(j,i) * U25(j,i) B24(j,i)= S24(j,i) * T25(j,i) + S25(j,i) * U24(j,i) A25(j,i)= S24(j,i) * C24(j,i) + S25(j,i) * V24(j,i) B25(j,i)= S24(j,i) * U25(j,i) + S25(j,i) * V25(j,i) C24(j,i)= S26(j,i) * T26(j,i) + S27(j,i) * U26(j,i) D24(j,i)= S26(j,i) * T27(j,i) + S27(j,i) * V26(j,i) C25(j,i)= S27(j,i) * S28(j,i) + S26(j,i) * U28(j,i) D25(j,i)= S27(j,i) * T28(j,i) + S26(j,i) * V28(j,i) end do end do into two disjoint loops do i = 1, n do j = 1, m A24(j,i)= S24(j,i) * T24(j,i) + S25(j,i) * U25(j,i) B24(j,i)= S24(j,i) * T25(j,i) + S25(j,i) * U24(j,i) A25(j,i)= S24(j,i) * C24(j,i) + S25(j,i) * V24(j,i) B25(j,i)= S24(j,i) * U25(j,i) + S25(j,i) * V25(j,i) end do end do do i = 1, n do j = 1, m C24(j,i)= S26(j,i) * T26(j,i) + S27(j,i) * U26(j,i) D24(j,i)= S26(j,i) * T27(j,i) + S27(j,i) * V26(j,i) C25(j,i)= S27(j,i) * S28(j,i) + S26(j,i) * U28(j,i) D25(j,i)= S27(j,i) * T28(j,i) + S26(j,i) * V28(j,i) end do end do Conclusions Over the course of the last year, I have had the opportunity to work with over two dozen academic scientific programmers at leading research universities. Their research interests span a broad range of scientific fields. Except for two programs that relied almost exclusively on library routines (matrix multiply and fast Fourier transform), I was able to improve significantly the single processor performance of all codes. Improvements range from 2x to 15.5x with a simple average of 4.75x. Changes to the source code were at a very high level. I did not use sophisticated techniques or programming tools to discover inefficiencies or effect the changes. Only one code was parallel despite the availability of parallel systems to all developers. Clearly, we have a problem—personal scientific research codes are highly inefficient and not running parallel. The developers are unaware of simple optimization techniques to make programs run faster. They lack education in the art of code optimization and parallel programming. I do not believe we can fix the problem by publishing additional books or training manuals. To date, the developers in questions have not studied the books or manual available, and are unlikely to do so in the future. Short courses are a possible solution, but I believe they are too concentrated to be much use. The general concepts can be taught in a three or four day course, but that is not enough time for students to practice what they learn and acquire the experience to apply and extend the concepts to their codes. Practice is the key to becoming proficient at optimization. I recommend that graduate students be required to take a semester length course in optimization and parallel programming. We would never give someone access to state-of-the-art scientific equipment costing hundreds of thousands of dollars without first requiring them to demonstrate that they know how to use the equipment. Yet the criterion for time on state-of-the-art supercomputers is at most an interesting project. Requestors are never asked to demonstrate that they know how to use the system, or can use the system effectively. A semester course would teach them the required skills. Government agencies that fund academic scientific research pay for most of the computer systems supporting scientific research as well as the development of most personal scientific codes. These agencies should require graduate schools to offer a course in optimization and parallel programming as a requirement for funding. About the Author John Feo received his Ph.D. in Computer Science from The University of Texas at Austin in 1986. After graduate school, Dr. Feo worked at Lawrence Livermore National Laboratory where he was the Group Leader of the Computer Research Group and principal investigator of the Sisal Language Project. In 1997, Dr. Feo joined Tera Computer Company where he was project manager for the MTA, and oversaw the programming and evaluation of the MTA at the San Diego Supercomputer Center. In 2000, Dr. Feo joined Sun Microsystems as an HPC application specialist. He works with university research groups to optimize and parallelize scientific codes. Dr. Feo has published over two dozen research articles in the areas of parallel parallel programming, parallel programming languages, and application performance.

    Read the article

  • Retrieving saved checkboxes' name and values from database

    - by sermed
    I have a form with checkboxes, each one has a value. When the registered user select any checkbox the value is incremented (the summation) and then then registred user save his selection of checkbox if he satisfied with the result of summation into database all this work fine ...i want to enable the registred user to view his selection history by retriving and displaying the checkboxes he selected in a page with thier values ... How I can do that? I'm just able to save the selected checkboxes as choice 1, choice 2, for example .. I want to view the selected checkboxes that is saved in database as the appear in the page when the user first select them: for example if the registred user selects these 3 options LEAD DEEP KEEL (1825) FULLY BATTENED MAINSAIL (558) TEAK SIDE DECKS (2889) They will be saved as for example (choice1, choice2, choice3). But if he want to view selected checkboxes the appear exactly as first he selects them: LEAD DEEP KEEL (1825) FULLY BATTENED MAINSAIL (558) TEAK SIDE DECKS (2889) This is my user table: $query="CREATE TABLE User( user_id varchar(20), password varchar(40), user_type varchar(20), firstname varchar(30), lastname varchar(30), street varchar(50), city varchar(50), county varchar(50), post_code varchar(10), country varchar(50), gender varchar(6), dob varchar(15), tel_no varchar(50), vals varchar(50), email varchar(50))"; and the code to inser the options selected to database <?php include("databaseconnection.php"); $str = ''; foreach($_POST as $key => $val) if (strpos($key,'choice') !== false) $str .= $key.','; $query = "INSERT INTO User (vals) VALUES('$str')"; $result=mysql_query($query,$conn); if ($result) { (mysql_error(); } else { echo " done"; } ?> And this is my form: function checkTotal() { document.listForm.total.value = ''; var sum = 0; for (i=0;i <form name="listForm" method="post" action="insert_options.php" > <TABLE cellPadding=3 width=600 border=0> <TBODY> <TR> <TH align=left width="87%" bgColor=#b0b3b4><SPAN class=whiteText>Item</SPAN></TH> <TH align=right width="13%" bgColor=#b0b3b4><SPAN class=whiteText>Select</SPAN></TH></TR> <TR> <TD bgcolor="#9da8af"colSpan=2><SPAN class=normalText><B>General</B></SPAN></TD></TR> <TR> <TD bgcolor="#c4c8ca"><SPAN class=normalText >TEAK SIDE DECKS (2889)</SPAN></TD> <TD align=right bgColor=#c4c8ca><input name="choice" value="2889" type="checkbox" onchange="checkTotal()" /></TD></TR> <TR> <TD bgColor=#c4c8ca><SPAN class=normalText>LEAD DEEP KEEL (1825)</SPAN></TD> <TD align=right bgColor=#c4c8ca><input name="choice" value="1825" type="checkbox" onchange="checkTotal()"></TD></TR> <TR> <TD bgColor=#c4c8ca><SPAN class=normalText>FULLY BATTENED MAINSAIL (558)</SPAN></TD> <TD align=right bgColor=#c4c8ca><input name="choice" value="558" type="checkbox" onchange="checkTotal()"></TD></TR> <TR> <TD bgColor=#c4c8ca><SPAN class=normalText>HIGH TECH SAILS FOR CONVENTIONAL RIG (1979)</SPAN></TD> <TD align=right bgColor=#c4c8ca><input name="choice" value="1979" type="checkbox" onchange="checkTotal()"></TD></TR> <TR> <TD bgColor=#c4c8ca><SPAN class=normalText>IN MAST REEFING WITH HIGH TECH SAILS (2539)</SPAN></TD> <TD align=right bgColor=#c4c8ca><input name="choice" value="2539" type="checkbox" onchange="checkTotal()"></TD></TR> <TR> <TD bgColor=#c4c8ca><SPAN class=normalText>SPlNNAKER GEAR (POLE LINES DECK FITTINGS) (820)</SPAN></TD> <TD align=right bgColor=#c4c8ca><input name="choice" value="820" type="checkbox" onchange="checkTotal()"></TD></TR> <TR> <TD bgColor=#c4c8ca><SPAN class=normalText>SPINNAKER POLE VERTICAL STOWAGE SYSTEM (214)</SPAN></TD> <TD align=right bgColor=#c4c8ca><input name="choice" value="214" type="checkbox" onchange="checkTotal()"></TD></TR> <TR> <TD bgColor=#c4c8ca><SPAN class=normalText>GAS ROD KICKER (208)</SPAN></TD> <TD align=right bgColor=#c4c8ca><input name="choice" value="208" type="checkbox" onchange="checkTotal()"></TD></TR> <TR> <TD bgColor=#c4c8ca><SPAN class=normalText>SIDE RAIL OPENINGS (BOTH SIDES) (392)</SPAN></TD> <TD align=right bgColor=#c4c8ca><input name="choice" value="392" type="checkbox" onchange="checkTotal()"></TD></TR> <TR> <TD bgColor=#c4c8ca><SPAN class=normalText>SPRING CLEATS MIDSHIPS -ALUMIMIUM (148)</SPAN></TD> <TD align=right bgColor=#c4c8ca><input name="choice" value="148" type="checkbox" onchange="checkTotal()"></TD></TR> <TR> <TD bgColor=#c4c8ca><SPAN class=normalText>ELECTRIC ANCHOR WINDLASS (1189)</SPAN></TD> <TD align=right bgColor=#c4c8ca><input name="choice" value="1189" type="checkbox" onchange="checkTotal()"> </TD></TR> <TR> <TD bgColor=#c4c8ca><SPAN class=normalText>ANCHOR CHAIN GALVANISED (50m) (202)</SPAN></TD> <TD align=right bgColor=#c4c8ca><input name="choice" value="202" type="checkbox" onchange="checkTotal()"> </TD></TR> <TR> <TD bgColor=#c4c8ca><SPAN class=normalText>ANCHOR CHAIN GALVANISED (50m) (1141)</SPAN></TD> <TD align=right bgColor=#c4c8ca><input name="choice" value="1141" type="checkbox" onchange="checkTotal()"></TD></TR> <TR> <TD bgcolor="#9da8af"colSpan=2><SPAN class=normalText><B>NAVIGATION & ELECTRONICS</B></SPAN></TD></TR> <TR> <TD bgcolor="#c4c8ca"><SPAN class=normalText >WIND VANE (STAINLESS STEEL)(41)</SPAN></TD> <TD align=right bgColor=#c4c8ca><input name="choice" value="41" type="checkbox" onchange="checkTotal()" /></TD></TR> <TR> <TD bgColor=#c4c8ca><SPAN class=normalText>RAYMARINE ST6O LOG & DEPTH (SEPARATE UNITS)(226)</SPAN></TD> <TD align=right bgColor=#c4c8ca><input name="choice" value="226" type="checkbox" onchange="checkTotal()"></TD></TR> <TR> <TD bgcolor="#9da8af"colSpan=2><SPAN class=normalText><B>ENGINES & ELECTRICS</B></SPAN></TD></TR> <TR> <TD bgColor=#c4c8ca><SPAN class=normalText>SHORE SUPPLY (220V) WITH 3 OUTLETS (EXCLUDJNG SHORE CABLE) (327)</SPAN></TD> <TD align=right bgColor=#c4c8ca><input name="choice" value="327" type="checkbox" onchange="checkTotal()"></TD></TR> <TR> <TD bgColor=#c4c8ca><SPAN class=normalText>3rd BATTERY(14OA/H)(196)</SPAN></TD> <TD align=right bgColor=#c4c8ca><input name="choice" value="196" type="checkbox" onchange="checkTotal()"></TD></TR> <TD bgColor=#c4c8ca><SPAN class=normalText>24 AMP BATTERY CHARGER (475)</SPAN></TD> <TD align=right bgColor=#c4c8ca><input name="choice" value="475" type="checkbox" onchange="checkTotal()"></TD></TR> <TD bgColor=#c4c8ca><SPAN class=normalText>2 BLADED FOLDING PROPELLER (UPGRADE)(299)</SPAN></TD> <TD align=right bgColor=#c4c8ca><input name="choice" value="299" type="checkbox" onchange="checkTotal()"></TD></TR> <TR> <TD bgcolor="#9da8af"colSpan=2><SPAN class=normalText><B>BELOW DECKS/DOMESTIC</B></SPAN></TD></TR> <TD bgColor=#c4c8ca><SPAN class=normalText>WARM WATER (FROM ENGINE & 220V)(749)</SPAN></TD> <TD align=right bgColor=#c4c8ca><input name="choice" value="749" type="checkbox" onchange="checkTotal()"></TD></TR> <TD bgColor=#c4c8ca><SPAN class=normalText>SHOWER IN AFT HEADS WITH PUMPOUT(446)</SPAN></TD> <TD align=right bgColor=#c4c8ca><input name="choice" value="446" type="checkbox" onchange="checkTotal()"></TD></TR> <TD bgColor=#c4c8ca><SPAN class=normalText>DECK SUCTION DISPOSAL FOR HOLDINGTANK(166)</SPAN></TD> <TD align=right bgColor=#c4c8ca><input name="choice" value="166" type="checkbox" onchange="checkTotal()"></TD></TR> <TD bgColor=#c4c8ca><SPAN class=normalText>REFRIGERATED COOLBOX (12V)(666)</SPAN></TD> <TD align=right bgColor=#c4c8ca><input name="choice" value="666" type="checkbox" onchange="checkTotal()"></TD></TR> <TD bgColor=#c4c8ca><SPAN class=normalText>LFS SAFETY PACKAGE (COCKPIT HARNESS POINTS STAINLESS STEEL JACKSTAYS)(208)</SPAN></TD> <TD align=right bgColor=#c4c8ca><input name="choice" value="208" type="checkbox" onchange="checkTotal()"></TD></TR> <TD bgColor=#c4c8ca><SPAN class=normalText>UPHOLSTERY UPGRADE IN SALOON (SUEDETYPE)(701)</SPAN></TD> <TD align=right bgColor=#c4c8ca><input name="choice" value="701" type="checkbox" onchange="checkTotal()"></TD></TR> <TR> <TD bgcolor="#9da8af"colSpan=2><SPAN class=normalText><B>NAVIGATION ELECTRONICS & ELECTRICS</B></SPAN></TD></TR> <TD bgColor=#c4c8ca><SPAN class=normalText>VHF RADIO AERIAL CABLED TO NAVIGATION AREA(178)</SPAN></TD> <TD align=right bgColor=#c4c8ca><input name="choice" value="178" type="checkbox" onchange="checkTotal()"></TD></TR> </table>

    Read the article

< Previous Page | 2 3 4 5 6