Search Results

Search found 35507 results on 1421 pages for 'performance test'.

Page 127/1421 | < Previous Page | 123 124 125 126 127 128 129 130 131 132 133 134 | Next Page >

Does Test Driven Development (TDD) improve Quality and Correctness? (Part 1)

- by David V. Corbin

Since the dawn of the computer age, various methodologies have been introduced to improve quality and reduce cost. In this posting, I will by sharing my experiences with Test Driven Development; both its benefits and limitations. To start this topic, we need to agree on what TDD is. The first is to define each of the three words as used in this context. Test - An item or action which measures something in some quantifiable form. Driven - The primary motivation or focus of a series of activities (process) Development - All phases of a software project/product from concept through delivery. The above are very simple definitions that result in the following: "TDD is a process where the primary focus is on measuring and quantifying all aspects of the creation of a (software) product." There are many places where TDD is used outside of software development, even though it is not known by this name. Consider the (conventional) education process that most of us grew up on. The focus was to get the best grades as measured by different tests. Many of these tests measured rote memorization and not understanding of the subject matter. The result of this that many people graduated with high scores but without "quality and correctness" in their ability to utilize the subject matter (of course, the flip side is true where certain people DID understand the material but were not very good at taking this type of test). Returning to software development, let us look at some common scenarios. While these items are generally applicable regardless of platform, language and tools; the remainder of this post will utilize Microsoft Visual Studio and Team Foundation Server (TFS) for examples. It should be realized that everyone does at least some aspect of TDD. At the most rudimentary level, getting a program to compile involves a "pass/fail" measurement (is the syntax valid) that drives their ability to proceed further (run the program). Other developers may create "Unit Tests" in the belief that having a test for every method/property of a class and good code coverage is the goal of TDD. These items may be helpful and even important, but really only address a small aspect of the overall effort. To see TDD in a bigger view, lets identify the various activities that are part of the Software Development LifeCycle. These are going to be presented in a Waterfall style for simplicity, but each item also occurs within Iterative methodologies such as Agile/Scrum. the key ones here are: Requirements Gathering Architecture Design Implementation Quality Assurance Can each of these items be subjected to a process which establishes metrics (quantified metrics) that reflect both the quality and correctness of each item? It should be clear that conventional Unit Tests do not apply to all of these items; at best they can verify that a local aspect (e.g. a Class/Method) of implementation matches the (test writers perspective of) the appropriate design document. So what can we do? For each of area, the goal is to create tests that are quantifiable and durable. The ability to quantify the measurements (beyond a simple pass/fail) is critical to tracking progress(eventually measuring the level of success that has been achieved) and for providing clear information on what items need to be addressed (along with the appropriate time to address them - in varying levels of detail) . Durability is important so that the test can be reapplied (ideally in an automated fashion) over the entire cycle. Returning for a moment back to our "education example", one must also be careful of how the tests are organized and how the measurements are taken. If a test is in a multiple choice format, there is a significant statistical probability that a correct answer might be the result of a random guess. Also, in many situations, having the student simply provide a final answer can obscure many important elements. For example, on a math test, having the student simply provide a numeric answer (rather than showing the methodology) may result in a complete mismatch between the process and the result. It is hard to determine which is worse: The student who makes a simple arithmetric error at one step of a long process (resulting in a wrong answer) or The student who (without providing the "workflow") uses a completely invalid approach, yet still comes up with the right number. The "Wrong Process"/"Right Answer" is probably the single biggest problem in software development. Even very simple items can suffer from this. As an example consider the following code for a "straight line" calculation....Is it correct? (for Integral Points) int Solve(int m, int b, int x) { return m * x + b; } Most people would respond "Yes". But let's take the question one step further... Is it correct for all possible values of m,b,x??? (no fair if you cheated by being focused on the bolded text!) Without additional information regarding constrains on "the possible values of m,b,x" the answer must be NO, there is the risk of overflow/wraparound that will produce an incorrect result! To properly answer this question (i.e. Test the Code), one MUST be able to backtrack from the implementation through the design, and architecture all the way back to the requirements. And the requirement itself must be tested against the stakeholder(s). It is only when the bounding conditions are defined that it is possible to determine if the code is "Correct" and has "Quality". Yet, how many of us (myself included) have written such code without even thinking about it. In many canses we (think we) "know" what the bounds are, and that the code will be correct. As we all know, requirements change, "code reuse" causes implementations to be applied to different scenarios, etc. This leads directly to the types of system failures that plague so many projects. This approach to TDD is much more holistic than ones which start by focusing on the details. The fundamental concepts still apply: Each item should be tested. The test should be defined/implemented before (or concurrent with) the definition/implementation of the actual item. We also add concepts that expand the scope and alter the style by recognizing: There are many things beside "lines of code" that benefit from testing (measuring/evaluating in a formal way) Correctness and Quality can not be solely measured by "correct results" In the future parts, we will examine in greater detail some of the techniques that can be applied to each of these areas....

Read the article
Various problems with software raid1 array built with Samsung 840 Pro SSDs

- by Andy B

I am bringing to ServerFault a problem that is tormenting me for 6+ months. I have a CentOS 6 (64bit) server with an md software raid-1 array with 2 x Samsung 840 Pro SSDs (512GB). Problems: Serious write speed problems: root [~]# time dd if=arch.tar.gz of=test4 bs=2M oflag=sync 146+1 records in 146+1 records out 307191761 bytes (307 MB) copied, 23.6788 s, 13.0 MB/s real 0m23.680s user 0m0.000s sys 0m0.932s When doing the above (or any other larger copy) the load spikes to unbelievable values (even over 100) going up from ~ 1. When doing the above I've also noticed very weird iostat results: Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 1589.50 0.00 54.00 0.00 13148.00 243.48 0.60 11.17 0.46 2.50 sdb 0.00 1627.50 0.00 16.50 0.00 9524.00 577.21 144.25 1439.33 60.61 100.00 md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md2 0.00 0.00 0.00 1602.00 0.00 12816.00 8.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 And it keeps it this way until it actually writes the file to the device (out from swap/cache/memory). The problem is that the second SSD in the array has svctm and await roughly 100 times larger than the second. For some reason the wear is different between the 2 members of the array root [~]# smartctl --attributes /dev/sda | grep -i wear 177 Wear_Leveling_Count 0x0013 094% 094 000 Pre-fail Always - 180 root [~]# smartctl --attributes /dev/sdb | grep -i wear 177 Wear_Leveling_Count 0x0013 070% 070 000 Pre-fail Always - 1005 The first SSD has a wear of 6% while the second SSD has a wear of 30%!! It's like the second SSD in the array works at least 5 times as hard as the first one as proven by the first iteration of iostat (the averages since reboot): Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 10.44 51.06 790.39 125.41 8803.98 1633.11 11.40 0.33 0.37 0.06 5.64 sdb 9.53 58.35 322.37 118.11 4835.59 1633.11 14.69 0.33 0.76 0.29 12.97 md1 0.00 0.00 1.88 1.33 15.07 10.68 8.00 0.00 0.00 0.00 0.00 md2 0.00 0.00 1109.02 173.12 10881.59 1620.39 9.75 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.41 0.01 3.10 0.02 7.42 0.00 0.00 0.00 0.00 What I've tried: I've updated the firmware to DXM05B0Q (following reports of dramatic improvements for 840Ps after this update). I have looked for "hard resetting link" in dmesg to check for cable/backplane issues but nothing. I have checked the alignment and I believe they are aligned correctly (1MB boundary, listing below) I have checked /proc/mdstat and the array is Optimal (second listing below). root [~]# fdisk -ul /dev/sda Disk /dev/sda: 512.1 GB, 512110190592 bytes 255 heads, 63 sectors/track, 62260 cylinders, total 1000215216 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x00026d59 Device Boot Start End Blocks Id System /dev/sda1 2048 4196351 2097152 fd Linux raid autodetect Partition 1 does not end on cylinder boundary. /dev/sda2 * 4196352 4605951 204800 fd Linux raid autodetect Partition 2 does not end on cylinder boundary. /dev/sda3 4605952 814106623 404750336 fd Linux raid autodetect root [~]# fdisk -ul /dev/sdb Disk /dev/sdb: 512.1 GB, 512110190592 bytes 255 heads, 63 sectors/track, 62260 cylinders, total 1000215216 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x0003dede Device Boot Start End Blocks Id System /dev/sdb1 2048 4196351 2097152 fd Linux raid autodetect Partition 1 does not end on cylinder boundary. /dev/sdb2 * 4196352 4605951 204800 fd Linux raid autodetect Partition 2 does not end on cylinder boundary. /dev/sdb3 4605952 814106623 404750336 fd Linux raid autodetect /proc/mdstat root # cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sdb2[1] sda2[0] 204736 blocks super 1.0 [2/2] [UU] md2 : active raid1 sdb3[1] sda3[0] 404750144 blocks super 1.0 [2/2] [UU] md1 : active raid1 sdb1[1] sda1[0] 2096064 blocks super 1.1 [2/2] [UU] unused devices: Running a read test with hdparm root [~]# hdparm -t /dev/sda /dev/sda: Timing buffered disk reads: 664 MB in 3.00 seconds = 221.33 MB/sec root [~]# hdparm -t /dev/sdb /dev/sdb: Timing buffered disk reads: 288 MB in 3.01 seconds = 95.77 MB/sec But look what happens if I add --direct root [~]# hdparm --direct -t /dev/sda /dev/sda: Timing O_DIRECT disk reads: 788 MB in 3.01 seconds = 262.08 MB/sec root [~]# hdparm --direct -t /dev/sdb /dev/sdb: Timing O_DIRECT disk reads: 534 MB in 3.02 seconds = 176.90 MB/sec Both tests increase but /dev/sdb doubles while /dev/sda increases maybe 20%. I just don't know what to make of this. As suggested by Mr. Wagner I've done another read test with dd this time and it confirms the hdparm test: root [/home2]# dd if=/dev/sda of=/dev/null bs=1G count=10 10+0 records in 10+0 records out 10737418240 bytes (11 GB) copied, 38.0855 s, 282 MB/s root [/home2]# dd if=/dev/sdb of=/dev/null bs=1G count=10 10+0 records in 10+0 records out 10737418240 bytes (11 GB) copied, 115.24 s, 93.2 MB/s So sda is 3 times faster than sdb. Or maybe sdb is doing also something else besides what sda does. Is there some way to find out if sdb is doing more than what sda does? UPDATE Again, as suggested by Mr. Wagner, I have swapped the 2 SSDs. And as he thought it would happen, the problem moved from sdb to sda. So I guess I'll RMA one of the SSDs. I wonder if the cage might be problematic. What is wrong with this array? Please help!

Read the article
Why would Linux VM in vSphere ESXi 5.5 show dramatically increased disk i/o latency?

- by mhucka

I'm stumped and I hope someone else will recognize the symptoms of this problem. Hardware: new Dell T110 II, dual-core Pentium G860 2.9 GHz, onboard SATA controller, one new 500 GB 7200 RPM cabled hard drive inside the box, other drives inside but not mounted yet. No RAID. Software: fresh CentOS 6.5 virtual machine under VMware ESXi 5.5.0 (build 174 + vSphere Client). 2.5 GB RAM allocated. The disk is how CentOS offered to set it up, namely as a volume inside an LVM Volume Group, except that I skipped having a separate /home and simply have / and /boot. CentOS is patched up, ESXi patched up, latest VMware tools installed in the VM. No users on the system, no services running, no files on the disk but the OS installation. I'm interacting with the VM via the VM virtual console in vSphere Client. Before going further, I wanted to check that I configured things more or less reasonably. I ran the following command as root in a shell on the VM: for i in 1 2 3 4 5 6 7 8 9 10; do dd if=/dev/zero of=/test.img bs=8k count=256k conv=fdatasync done I.e., just repeat the dd command 10 times, which results in printing the transfer rate each time. The results are disturbing. It starts off well: 262144+0 records in 262144+0 records out 2147483648 bytes (2.1 GB) copied, 20.451 s, 105 MB/s 262144+0 records in 262144+0 records out 2147483648 bytes (2.1 GB) copied, 20.4202 s, 105 MB/s ... but after 7-8 of these, it then prints 262144+0 records in 262144+0 records out 2147483648 bytes (2.1 GG) copied, 82.9779 s, 25.9 MB/s 262144+0 records in 262144+0 records out 2147483648 bytes (2.1 GB) copied, 84.0396 s, 25.6 MB/s 262144+0 records in 262144+0 records out 2147483648 bytes (2.1 GB) copied, 103.42 s, 20.8 MB/s If I wait a significant amount of time, say 30-45 minutes, and run it again, it again goes back to 105 MB/s, and after several rounds (sometimes a few, sometimes 10+), it drops to ~20-25 MB/s again. Plotting the disk latency in vSphere's interface, it shows periods of high disk latency hitting 1.2-1.5 seconds during the times that dd reports the low throughput. (And yes, things get pretty unresponsive while that's happening.) What could be causing this? I'm comfortable that it is not due to the disk failing, because I also had configured two other disks as an additional volume in the same system. At first I thought I did something wrong with that volume, but after commenting the volume out from /etc/fstab and rebooting, and trying the tests on / as shown above, it became clear that the problem is elsewhere. It is probably an ESXi configuration problem, but I'm not very experienced with ESXi. It's probably something stupid, but after trying to figure this out for many hours over multiple days, I can't find the problem, so I hope someone can point me in the right direction. (P.S.: yes, I know this hardware combo won't win any speed awards as a server, and I have reasons for using this low-end hardware and running a single VM, but I think that's besides the point for this question [unless it's actually a hardware problem].) ADDENDUM #1: Reading other answers such as this one made me try adding oflag=direct to dd. However, it makes no difference in the pattern of results: initially the numbers are higher for many rounds, then they drop to 20-25 MB/s. (The initial absolute numbers are in the 50 MB/s range.) ADDENDUM #2: Adding sync ; echo 3 > /proc/sys/vm/drop_caches into the loop does not make a difference at all. ADDENDUM #3: To take out further variables, I now run dd such that the file it creates is larger than the amount of RAM on the system. The new command is dd if=/dev/zero of=/test.img bs=16k count=256k conv=fdatasync oflag=direct. Initial throughput numbers with this version of the command are ~50 MB/s. They drop to 20-25 MB/s when things go south. ADDENDUM #4: Here is the output of iostat -d -m -x 1 running in another terminal window while performance is "good" and then again when it's "bad". (While this is going on, I'm running dd if=/dev/zero of=/test.img bs=16k count=256k conv=fdatasync oflag=direct.) First, when things are "good", it shows this: When things go "bad", iostat -d -m -x 1 shows this:

Read the article
optimizing iPhone OpenGL ES fill rate

- by NateS

I have an Open GL ES game on the iPhone. My framerate is pretty sucky, ~20fps. Using the Xcode OpenGL ES performance tool on an iPhone 3G, it shows: Renderer Utilization: 95% to 99% Tiler Utilization: ~27% I am drawing a lot of pretty large images with a lot of blending. If I reduce the number of images drawn, framerates go from ~20 to ~40, though the performance tool results stay about the same (renderer still maxed). I think I'm being limited by the fill rate of the iPhone 3G, but I'm not sure. My questions are: How can I determine with more granularity where the bottleneck is? That is my biggest problem, I just don't know what is taking all the time. If it is fillrate, is there anything I do to improve it besides just drawing less? I am using texture atlases. I have tried to minimize image binds, though it isn't always possible (drawing order, not everything fits on one 1024x1024 texture, etc). Every frame I do 10 image binds. This seem pretty reasonable, but I could be mistaken. I'm using vertex arrays and glDrawArrays. I don't really have a lot of geometry. I can try to be more precise if needed. Each image is 2 triangles and I try to batch things were possible, though often (maybe half the time) images are drawn with individual glDrawArrays calls. Besides the images, I have ~60 triangles worth of geometry being rendered in ~6 glDrawArrays calls. I often glTranslate before calling glDrawArrays. Would it improve the framerate to switch to VBOs? I don't think it is a huge amount of geometry, but maybe it is faster for other reasons? Are there certain things to watch out for that could reduce performance? Eg, should I avoid glTranslate, glColor4g, etc? I'm using glScissor in a 3 places per frame. Each use consists of 2 glScissor calls, one to set it up, and one to reset it to what it was. I don't know if there is much of a performance impact here. If I used PVRTC would it be able to render faster? Currently all my images are GL_RGBA. I don't have memory issues. Here is a rough idea of what I'm drawing, in this order: 1) Switch to perspective matrix. 2) Draw a full screen background image 3) Draw a full screen image with translucency (this one has a scrolling texture). 4) Draw a few sprites. 5) Switch to ortho matrix. 6) Draw a few sprites. 7) Switch to perspective matrix. 8) Draw sprites and some other textured geometry. 9) Switch to ortho matrix. 10) Draw a few sprites (eg, game HUD). Steps 1-6 draw a bunch of background stuff. 8 draws most of the game content. 10 draws the HUD. As you can see, there are many layers, some of them full screen and some of the sprites are pretty large (1/4 of the screen). The layers use translucency, so I have to draw them in back-to-front order. This is further complicated by needing to draw various layers in ortho and others in perspective. I will gladly provide additional information if reqested. Thanks in advance for any performance tips or general advice on my problem!

Read the article
CUDA: Memory copy to GPU 1 is slower in multi-GPU

- by zenna

My company has a setup of two GTX 295, so a total of 4 GPUs in a server, and we have several servers. We GPU 1 specifically was slow, in comparison to GPU 0, 2 and 3 so I wrote a little speed test to help find the cause of the problem. //#include <stdio.h> //#include <stdlib.h> //#include <cuda_runtime.h> #include <iostream> #include <fstream> #include <sstream> #include <string> #include <cutil.h> __global__ void test_kernel(float *d_data) { int tid = blockDim.x*blockIdx.x + threadIdx.x; for (int i=0;i<10000;++i) { d_data[tid] = float(i*2.2); d_data[tid] += 3.3; } } int main(int argc, char* argv[]) { int deviceCount; cudaGetDeviceCount(&deviceCount); int device = 0; //SELECT GPU HERE cudaSetDevice(device); cudaEvent_t start, stop; unsigned int num_vals = 200000000; float *h_data = new float[num_vals]; for (int i=0;i<num_vals;++i) { h_data[i] = float(i); } float *d_data = NULL; float malloc_timer; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord( start, 0 ); cudaMemcpy(d_data, h_data, sizeof(float)*num_vals,cudaMemcpyHostToDevice); cudaMalloc((void**)&d_data, sizeof(float)*num_vals); cudaEventRecord( stop, 0 ); cudaEventSynchronize( stop ); cudaEventElapsedTime( &malloc_timer, start, stop ); cudaEventDestroy( start ); cudaEventDestroy( stop ); float mem_timer; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord( start, 0 ); cudaMemcpy(d_data, h_data, sizeof(float)*num_vals,cudaMemcpyHostToDevice); cudaEventRecord( stop, 0 ); cudaEventSynchronize( stop ); cudaEventElapsedTime( &mem_timer, start, stop ); cudaEventDestroy( start ); cudaEventDestroy( stop ); float kernel_timer; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord( start, 0 ); test_kernel<<<1000,256>>>(d_data); cudaEventRecord( stop, 0 ); cudaEventSynchronize( stop ); cudaEventElapsedTime( &kernel_timer, start, stop ); cudaEventDestroy( start ); cudaEventDestroy( stop ); printf("cudaMalloc took %f ms\n",malloc_timer); printf("Copy to the GPU took %f ms\n",mem_timer); printf("Test Kernel took %f ms\n",kernel_timer); cudaMemcpy(h_data,d_data, sizeof(float)*num_vals,cudaMemcpyDeviceToHost); delete[] h_data; return 0; } The results are GPU0 cudaMalloc took 0.908640 ms Copy to the GPU took 296.058777 ms Test Kernel took 326.721283 ms GPU1 cudaMalloc took 0.913568 ms Copy to the GPU took[b] 663.182251 ms[/b] Test Kernel took 326.710785 ms GPU2 cudaMalloc took 0.925600 ms Copy to the GPU took 296.915039 ms Test Kernel took 327.127930 ms GPU3 cudaMalloc took 0.920416 ms Copy to the GPU took 296.968384 ms Test Kernel took 327.038696 ms As you can see, the cudaMemcpy to the GPU is well double the amount of time for GPU1. This is consistent between all our servers, it is always GPU1 that is slow. Any ideas why this may be? All servers are running windows XP.

Read the article
Python bindings for a vala library

- by celil

I am trying to create python bindings to a vala library using the following IBM tutorial as a reference. My initial directory has the following two files: test.vala using GLib; namespace Test { public class Test : Object { public int sum(int x, int y) { return x + y; } } } test.override %% headers #include <Python.h> #include "pygobject.h" #include "test.h" %% modulename test %% import gobject.GObject as PyGObject_Type %% ignore-glob *_get_type %% and try to build the python module source test_wrap.c using the following code build.sh #/usr/bin/env bash valac test.vala -CH test.h python /usr/share/pygobject/2.0/codegen/h2def.py test.h > test.defs pygobject-codegen-2.0 -o test.override -p test test.defs > test_wrap.c However, the last command fails with an error $ ./build.sh Traceback (most recent call last): File "/usr/share/pygobject/2.0/codegen/codegen.py", line 1720, in <module> sys.exit(main(sys.argv)) File "/usr/share/pygobject/2.0/codegen/codegen.py", line 1672, in main o = override.Overrides(arg) File "/usr/share/pygobject/2.0/codegen/override.py", line 52, in __init__ self.handle_file(filename) File "/usr/share/pygobject/2.0/codegen/override.py", line 84, in handle_file self.__parse_override(buf, startline, filename) File "/usr/share/pygobject/2.0/codegen/override.py", line 96, in __parse_override command = words[0] IndexError: list index out of range Is this a bug in pygobject, or is something wrong with my setup? What is the best way to call code written in vala from python? EDIT: Removing the extra line fixed the current problem, but now as I proceed to build the python module, I am facing another problem. Adding the following C file to the existing two in the directory: test_module.c #include <Python.h> void test_register_classes (PyObject *d); extern PyMethodDef test_functions[]; DL_EXPORT(void) inittest(void) { PyObject *m, *d; init_pygobject(); m = Py_InitModule("test", test_functions); d = PyModule_GetDict(m); test_register_classes(d); if (PyErr_Occurred ()) { Py_FatalError ("can't initialise module test"); } } and building with the following script build.sh #/usr/bin/env bash valac test.vala -CH test.h python /usr/share/pygobject/2.0/codegen/h2def.py test.h > test.defs pygobject-codegen-2.0 -o test.override -p test test.defs > test_wrap.c CFLAGS="`pkg-config --cflags pygobject-2.0` -I/usr/include/python2.6/ -I." LDFLAGS="`pkg-config --libs pygobject-2.0`" gcc $CFLAGS -fPIC -c test.c gcc $CFLAGS -fPIC -c test_wrap.c gcc $CFLAGS -fPIC -c test_module.c gcc $LDFLAGS -shared test.o test_wrap.o test_module.o -o test.so python -c 'import test; exit()' results in an error: $ ./build.sh ***INFO*** The coverage of global functions is 100.00% (1/1) ***INFO*** The coverage of methods is 100.00% (1/1) ***INFO*** There are no declared virtual proxies. ***INFO*** There are no declared virtual accessors. ***INFO*** There are no declared interface proxies. Traceback (most recent call last): File "<string>", line 1, in <module> ImportError: ./test.so: undefined symbol: init_pygobject Where is the init_pygobject symbol defined? What have I missed linking to?

Read the article
udp expected behaviour not responding to test result

- by ernst

I have a local network topology that is structured as follows: three hosts and a switch in the middle. I am using a switch that supports 10,100,1000 Mbit/s full/half duplex connection. I have configured the hosts with a static ip 172.16.0.1-2-3/25. This is the output of ifconfig eth0 Link encap: Ethernet HWaddr ***** inet addr:172.16.0.3 Bcast:172.16.0.127 Mask:255.255.255.128 UP BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) Interrupt:16 The output on H1 and H2 is perfectly matchable They are mutually reachable since i have tested the network with ping. I have forced the ethernet interface to work at 10M with ethtool -s eth0 speed 10 duplex full autoneg on this is the output of ethtool eth0 supported ports: [ TP ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Half 1000baseT/Full S upported pause frame use: No Supports auto-negotiation: Yes Advertised link modes: 10baseT/Full Advertised pause frame use: Symmetric A dvertised auto-negotiation: Yes Speed: 10Mb/s Duplex: Full Port: Twisted Pair PHYAD: 1 Transceiver: internal Auto-negotiation: on MDI-X: Unknown Supports Wake-on: g Wake-on: d Current message level: 0x000000ff (255) drv probe link timer ifdown ifup rx_err tx_err Link detected: yes – I am doing an experimental test using nttcp to calculate the GOODPUT in the case that H1 and H2 at the same time send data to H3. Since the three links have the same forced capability and the amount of arrving data speed is 10 from H1+10 from H2--20M to H3 it would be expected a bottleneck effect and, due to the non reliable nature of udp, a packet loss. But this doesn't appen since the output of nttcp application shows the same number of byte sended and received. this is the output of nttcp on h3 nttcp -T -r -u 172.16.0.2 & nttcp -T -r -u 172.16.0.1 [1] 4071 Bytes Real s CPU s Real-MBit/s CPU-MBit/s Calls Real-C/s CPU-C/s l 8388608 13.74 0.05 4.8848 1398.0140 2049 149.14 42684.8 Bytes Real s CPU s Real-MBit/s CPU-MBit/s Calls Real-C/s CPU-C/s l 8388608 14.02 0.05 4.7872 1398.0140 2049 146.17 42684.8 1 8388608 13.56 0.06 4.9500 1118.4065 2051 151.28 34181.1 1 8388608 13.89 0.06 4.8310 1198.3084 2051 147.65 36623.0 – How is this possible? Am i missing something? Any help will be gratefully apprecciated, Best regards

Read the article
Performance of Cluster Shared Volume file copy from SAN

- by Sequenzia

I am hoping someone can help me out with a strange issue. We are running a Microsoft Failover Cluster with Server 2008 R2 and an Equallogic PS4000 SAN. Our main configuration has 2 Dell Poweredge T710 Servers in the cluster. We have CSV and Quorm setup. The servers each have 10 Broadcom 1Gb NICs. Right now 4 of the NICS are on the iSCSI network for accessing the SAN. They use MPIO and the Dell HIT pack. We have 5 VMs running on each node and everything runs smooth. No noticeable performance issues or anything. From the SAN I can see the 4 iSCSI connections from each server to each volume (CSV and Quorm). Again, it seems to perform great. The problem I am running into is with backups. I have tried a few backup programs like backupchain and Veeam. The problem is both of them are very very slow to backup the VMs. For instance I have a 500GB (fixed disc) VHD that’s running on the cluster. It takes over 18 hours to backup that VHD and that’s with compression and depuping turned off which is supposed to be the fasted. We also have a separate server that is just for backups. It has a lot of directed attached storage. As part of the troubleshooting I decided to bring that server into the cluster as a node. It now has access to the CSV and can read from C:\clusterstorage\volume1 which is where our VHDs live. This backup server only has 2 NICs. 1 NIC is going to the iSCSI network and the other is just on the main network. It has Intel NICS in it without any sort of MPIO or teaming. So with the 3rd server now in the cluster I started doing some benchmarking. I have a test VHD that’s about 7GBs that’s stored in the CSV. I have tested file copying that VHD from all 3 servers to directed attached storage in the respective server. The 2 Dell servers that are the main nodes in the cluster (they house the VMs) are reading that file at about 20Mbs/Sec. Which at that rate is way to slow for the backups. The other server which only has 1 NIC to the SAN is reading at around 100Mbs/Sec. I spent a few hours on the phone with Dell today about this . We went through all kind of tests and he was pretty dumb founded. He really has no idea why that server with only 1 NIC is reading about 5 times as fast as the servers with 4 NICS and MPIO. We looked at the network utilization of the NICs while the file copy was going on. The servers with the 4 NICs had a small increase of activity during the file copy but they only went up to around 8-10% on all 4 NICs. The other server with the 1 NIC jumped up to over 80% during the file copy. I plan on doing some more testing after hours and calling Dell back tomorrow but I really am confused (and so is Dell’s support rep) why I cannot get faster file copy access to the CSV on those servers. Anyone have any input on this? Any feedback would be greatly appreciated. Thanks in advance.

Read the article
Improving SAS multipath to JBOD performance on Linux

- by user36825

Hello all I'm trying to optimize a storage setup on some Sun hardware with Linux. Any thoughts would be greatly appreciated. We have the following hardware: Sun Blade X6270 2* LSISAS1068E SAS controllers 2* Sun J4400 JBODs with 1 TB disks (24 disks per JBOD) Fedora Core 12 2.6.33 release kernel from FC13 (also tried with latest 2.6.31 kernel from FC12, same results) Here's the datasheet for the SAS hardware: http://www.sun.com/storage/storage_networking/hba/sas/PCIe.pdf It's using PCI Express 1.0a, 8x lanes. With a bandwidth of 250 MB/sec per lane, we should be able to do 2000 MB/sec per SAS controller. Each controller can do 3 Gb/sec per port and has two 4 port PHYs. We connect both PHYs from a controller to a JBOD. So between the JBOD and the controller we have 2 PHYs * 4 SAS ports * 3 Gb/sec = 24 Gb/sec of bandwidth, which is more than the PCI Express bandwidth. With write caching enabled and when doing big writes, each disk can sustain about 80 MB/sec (near the start of the disk). With 24 disks, that means we should be able to do 1920 MB/sec per JBOD. multipath { rr_min_io 100 uid 0 path_grouping_policy multibus failback manual path_selector "round-robin 0" rr_weight priorities alias somealias no_path_retry queue mode 0644 gid 0 wwid somewwid } I tried values of 50, 100, 1000 for rr_min_io, but it doesn't seem to make much difference. Along with varying rr_min_io I tried adding some delay between starting the dd's to prevent all of them writing over the same PHY at the same time, but this didn't make any difference, so I think the I/O's are getting properly spread out. According to /proc/interrupts, the SAS controllers are using a "IR-IO-APIC-fasteoi" interrupt scheme. For some reason only core #0 in the machine is handling these interrupts. I can improve performance slightly by assigning a separate core to handle the interrupts for each SAS controller: echo 2 /proc/irq/24/smp_affinity echo 4 /proc/irq/26/smp_affinity Using dd to write to the disk generates "Function call interrupts" (no idea what these are), which are handled by core #4, so I keep other processes off this core too. I run 48 dd's (one for each disk), assigning them to cores not dealing with interrupts like so: taskset -c somecore dd if=/dev/zero of=/dev/mapper/mpathx oflag=direct bs=128M oflag=direct prevents any kind of buffer cache from getting involved. None of my cores seem maxed out. The cores dealing with interrupts are mostly idle and all the other cores are waiting on I/O as one would expect. Cpu0 : 0.0%us, 1.0%sy, 0.0%ni, 91.2%id, 7.5%wa, 0.0%hi, 0.2%si, 0.0%st Cpu1 : 0.0%us, 0.8%sy, 0.0%ni, 93.0%id, 0.2%wa, 0.0%hi, 6.0%si, 0.0%st Cpu2 : 0.0%us, 0.6%sy, 0.0%ni, 94.4%id, 0.1%wa, 0.0%hi, 4.8%si, 0.0%st Cpu3 : 0.0%us, 7.5%sy, 0.0%ni, 36.3%id, 56.1%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 0.0%us, 1.3%sy, 0.0%ni, 85.7%id, 4.9%wa, 0.0%hi, 8.1%si, 0.0%st Cpu5 : 0.1%us, 5.5%sy, 0.0%ni, 36.2%id, 58.3%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 0.0%us, 5.0%sy, 0.0%ni, 36.3%id, 58.7%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 0.0%us, 5.1%sy, 0.0%ni, 36.3%id, 58.5%wa, 0.0%hi, 0.0%si, 0.0%st Cpu8 : 0.1%us, 8.3%sy, 0.0%ni, 27.2%id, 64.4%wa, 0.0%hi, 0.0%si, 0.0%st Cpu9 : 0.1%us, 7.9%sy, 0.0%ni, 36.2%id, 55.8%wa, 0.0%hi, 0.0%si, 0.0%st Cpu10 : 0.0%us, 7.8%sy, 0.0%ni, 36.2%id, 56.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu11 : 0.0%us, 7.3%sy, 0.0%ni, 36.3%id, 56.4%wa, 0.0%hi, 0.0%si, 0.0%st Cpu12 : 0.0%us, 5.6%sy, 0.0%ni, 33.1%id, 61.2%wa, 0.0%hi, 0.0%si, 0.0%st Cpu13 : 0.1%us, 5.3%sy, 0.0%ni, 36.1%id, 58.5%wa, 0.0%hi, 0.0%si, 0.0%st Cpu14 : 0.0%us, 4.9%sy, 0.0%ni, 36.4%id, 58.7%wa, 0.0%hi, 0.0%si, 0.0%st Cpu15 : 0.1%us, 5.4%sy, 0.0%ni, 36.5%id, 58.1%wa, 0.0%hi, 0.0%si, 0.0%st Given all this, the throughput reported by running "dstat 10" is in the range of 2200-2300 MB/sec. Given the math above I would expect something in the range of 2*1920 ~= 3600+ MB/sec. Does anybody have any idea where my missing bandwidth went? Thanks!

Read the article
Accessing local variable doesn't improve performance

- by NicMagnier

The short version Why is this code: var index = (Math.floor(y / scale) * img.width + Math.floor(x / scale)) * 4; More performant than this one? var index = Math.floor(ref_index) * 4; The long version This week, the author of Impact js published an article about some rendering issue: http://www.phoboslab.org/log/2012/09/drawing-pixels-is-hard In the article there was the source of a function to scale an image by accessing pixels in the canvas. I wanted to suggest some traditional ways to optimize this kind of code so that the scaling would be shorter at loading time. But after testing it my result was most of the time worst that the original function. Guessing this was the JavaScript engine that was doing some smart optimization I tried to understand a bit more what was going on so I did a bunch of test. But my results are quite confusing and I would need some help to understand what's going on. I have a test page here: http://www.mx981.com/stuff/resize_bench/test.html jsPerf: http://jsperf.com/local-variable-due-to-the-scope-lookup To start the test, click the picture and the results will appear in the console. There are three different versions: The original code: for( var y = 0; y < heightScaled; y++ ) { for( var x = 0; x < widthScaled; x++ ) { var index = (Math.floor(y / scale) * img.width + Math.floor(x / scale)) * 4; var indexScaled = (y * widthScaled + x) * 4; scaledPixels.data[ indexScaled ] = origPixels.data[ index ]; scaledPixels.data[ indexScaled+1 ] = origPixels.data[ index+1 ]; scaledPixels.data[ indexScaled+2 ] = origPixels.data[ index+2 ]; scaledPixels.data[ indexScaled+3 ] = origPixels.data[ index+3 ]; } } jsPerf: http://jsperf.com/so-accessing-local-variable-doesn-t-improve-performance One of my attempt to optimize it: var ref_index = 0; var ref_indexScaled = 0 var ref_step = 1 / scale; for( var y = 0; y < heightScaled; y++ ) { for( var x = 0; x < widthScaled; x++ ) { var index = Math.floor(ref_index) * 4; scaledPixels.data[ ref_indexScaled++ ] = origPixels.data[ index ]; scaledPixels.data[ ref_indexScaled++ ] = origPixels.data[ index+1 ]; scaledPixels.data[ ref_indexScaled++ ] = origPixels.data[ index+2 ]; scaledPixels.data[ ref_indexScaled++ ] = origPixels.data[ index+3 ]; ref_index+= ref_step; } } jsPerf: http://jsperf.com/so-accessing-local-variable-doesn-t-improve-performance The same optimized code but with recalculating the index variable each time (Hybrid) var ref_index = 0; var ref_indexScaled = 0 var ref_step = 1 / scale; for( var y = 0; y < heightScaled; y++ ) { for( var x = 0; x < widthScaled; x++ ) { var index = (Math.floor(y / scale) * img.width + Math.floor(x / scale)) * 4; scaledPixels.data[ ref_indexScaled++ ] = origPixels.data[ index ]; scaledPixels.data[ ref_indexScaled++ ] = origPixels.data[ index+1 ]; scaledPixels.data[ ref_indexScaled++ ] = origPixels.data[ index+2 ]; scaledPixels.data[ ref_indexScaled++ ] = origPixels.data[ index+3 ]; ref_index+= ref_step; } } jsPerf: http://jsperf.com/so-accessing-local-variable-doesn-t-improve-performance The only difference in the two last one is the calculation of the 'index' variable. And to my surprise the optimized version is slower in most browsers (except opera). Results of personal testing (not the jsPerf tests): Opera Original: 8668ms Optimized: 932ms Hybrid: 8696ms Chrome Original: 139ms Optimized: 145ms Hybrid: 136ms Safari Original: 433ms Optimized: 853ms Hybrid: 451ms Firefox Original: 343ms Optimized: 422ms Hybrid: 350ms After digging around, it seems an usual good practice is to access mainly local variable due to the scope lookup. Because The optimized version only call one local variable it should be faster that the Hybrid code which call multiple variable and object in addition to the various operation involved. So why the "optimized" version is slower? I thought that it might be because some JavaScript engine don't optimize the Optimized version because it is not hot enough but after using --trace-opt in chrome, it seems all version are properly compiled by V8. At this point I am a bit clueless and wonder if somebody would know what is going on? I did also some more test cases in this page: http://www.mx981.com/stuff/resize_bench/index.html

Read the article
MySQL Memory usage

- by Rob Stevenson-Leggett

Our MySQL server seems to be using a lot of memory. I've tried looking for slow queries and queries with no index and have halved the peak CPU usage and Apache memory usage but the MySQL memory stays constantly at 2.2GB (~51% of available memory on the server). Here's the graph from Plesk. Running top in the SSH window shows the same figures. Does anyone have any ideas on why the memory usage is constant like this and not peaks and troughs with usage of the app? Here's the output of the MySQL Tuning Primer script: -- MYSQL PERFORMANCE TUNING PRIMER -- - By: Matthew Montgomery - MySQL Version 5.0.77-log x86_64 Uptime = 1 days 14 hrs 4 min 21 sec Avg. qps = 22 Total Questions = 3059456 Threads Connected = 13 Warning: Server has not been running for at least 48hrs. It may not be safe to use these recommendations To find out more information on how each of these runtime variables effects performance visit: http://dev.mysql.com/doc/refman/5.0/en/server-system-variables.html Visit http://www.mysql.com/products/enterprise/advisors.html for info about MySQL's Enterprise Monitoring and Advisory Service SLOW QUERIES The slow query log is enabled. Current long_query_time = 1 sec. You have 6 out of 3059477 that take longer than 1 sec. to complete Your long_query_time seems to be fine BINARY UPDATE LOG The binary update log is NOT enabled. You will not be able to do point in time recovery See http://dev.mysql.com/doc/refman/5.0/en/point-in-time-recovery.html WORKER THREADS Current thread_cache_size = 0 Current threads_cached = 0 Current threads_per_sec = 2 Historic threads_per_sec = 0 Threads created per/sec are overrunning threads cached You should raise thread_cache_size MAX CONNECTIONS Current max_connections = 100 Current threads_connected = 14 Historic max_used_connections = 20 The number of used connections is 20% of the configured maximum. Your max_connections variable seems to be fine. INNODB STATUS Current InnoDB index space = 6 M Current InnoDB data space = 18 M Current InnoDB buffer pool free = 0 % Current innodb_buffer_pool_size = 8 M Depending on how much space your innodb indexes take up it may be safe to increase this value to up to 2 / 3 of total system memory MEMORY USAGE Max Memory Ever Allocated : 2.07 G Configured Max Per-thread Buffers : 274 M Configured Max Global Buffers : 2.01 G Configured Max Memory Limit : 2.28 G Physical Memory : 3.84 G Max memory limit seem to be within acceptable norms KEY BUFFER Current MyISAM index space = 4 M Current key_buffer_size = 7 M Key cache miss rate is 1 : 40 Key buffer free ratio = 81 % Your key_buffer_size seems to be fine QUERY CACHE Query cache is supported but not enabled Perhaps you should set the query_cache_size SORT OPERATIONS Current sort_buffer_size = 2 M Current read_rnd_buffer_size = 256 K Sort buffer seems to be fine JOINS Current join_buffer_size = 132.00 K You have had 16 queries where a join could not use an index properly You should enable "log-queries-not-using-indexes" Then look for non indexed joins in the slow query log. If you are unable to optimize your queries you may want to increase your join_buffer_size to accommodate larger joins in one pass. Note! This script will still suggest raising the join_buffer_size when ANY joins not using indexes are found. OPEN FILES LIMIT Current open_files_limit = 1024 files The open_files_limit should typically be set to at least 2x-3x that of table_cache if you have heavy MyISAM usage. Your open_files_limit value seems to be fine TABLE CACHE Current table_cache value = 64 tables You have a total of 426 tables You have 64 open tables. Current table_cache hit rate is 1% , while 100% of your table cache is in use You should probably increase your table_cache TEMP TABLES Current max_heap_table_size = 16 M Current tmp_table_size = 32 M Of 15134 temp tables, 9% were created on disk Effective in-memory tmp_table_size is limited to max_heap_table_size. Created disk tmp tables ratio seems fine TABLE SCANS Current read_buffer_size = 128 K Current table scan ratio = 2915 : 1 read_buffer_size seems to be fine TABLE LOCKING Current Lock Wait ratio = 1 : 142213 Your table locking seems to be fine The app is a facebook game with about 50-100 concurrent users. Thanks, Rob

Read the article
Linux software RAID6: rebuild slow

- by Ole Tange

I am trying to find the bottleneck in the rebuilding of a software raid6. ## Pause rebuilding when measuring raw I/O performance # echo 1 > /proc/sys/dev/raid/speed_limit_min # echo 1 > /proc/sys/dev/raid/speed_limit_max ## Drop caches so that does not interfere with measuring # sync ; echo 3 | tee /proc/sys/vm/drop_caches >/dev/null # time parallel -j0 "dd if=/dev/{} bs=256k count=4000 | cat >/dev/null" ::: sdbd sdbc sdbf sdbm sdbl sdbk sdbe sdbj sdbh sdbg 4000+0 records in 4000+0 records out 1048576000 bytes (1.0 GB) copied, 7.30336 s, 144 MB/s [... similar for each disk ...] # time parallel -j0 "dd if=/dev/{} skip=15000000 bs=256k count=4000 | cat >/dev/null" ::: sdbd sdbc sdbf sdbm sdbl sdbk sdbe sdbj sdbh sdbg 4000+0 records in 4000+0 records out 1048576000 bytes (1.0 GB) copied, 12.7991 s, 81.9 MB/s [... similar for each disk ...] So we can read sequentially at 140 MB/s in the outer tracks and 82 MB/s in the inner tracks on all the drives simultaneously. Sequential write performance is similar. This would lead me to expect a rebuild speed of 82 MB/s or more. # echo 800000 > /proc/sys/dev/raid/speed_limit_min # echo 800000 > /proc/sys/dev/raid/speed_limit_max # cat /proc/mdstat md2 : active raid6 sdbd[10](S) sdbc[9] sdbf[0] sdbm[8] sdbl[7] sdbk[6] sdbe[11] sdbj[4] sdbi[3](F) sdbh[2] sdbg[1] 27349121408 blocks super 1.2 level 6, 128k chunk, algorithm 2 [9/8] [UUU_UUUUU] [=========>...........] recovery = 47.3% (1849905884/3907017344) finish=855.9min speed=40054K/sec But we only get 40 MB/s. And often this drops to 30 MB/s. # iostat -dkx 1 sdbc 0.00 8023.00 0.00 329.00 0.00 33408.00 203.09 0.70 2.12 1.06 34.80 sdbd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdbe 13.00 0.00 8334.00 0.00 33388.00 0.00 8.01 0.65 0.08 0.06 47.20 sdbf 0.00 0.00 8348.00 0.00 33388.00 0.00 8.00 0.58 0.07 0.06 48.00 sdbg 16.00 0.00 8331.00 0.00 33388.00 0.00 8.02 0.71 0.09 0.06 48.80 sdbh 961.00 0.00 8314.00 0.00 37100.00 0.00 8.92 0.93 0.11 0.07 54.80 sdbj 70.00 0.00 8276.00 0.00 33384.00 0.00 8.07 0.78 0.10 0.06 48.40 sdbk 124.00 0.00 8221.00 0.00 33380.00 0.00 8.12 0.88 0.11 0.06 47.20 sdbl 83.00 0.00 8262.00 0.00 33380.00 0.00 8.08 0.96 0.12 0.06 47.60 sdbm 0.00 0.00 8344.00 0.00 33376.00 0.00 8.00 0.56 0.07 0.06 47.60 iostat says the disks are not 100% busy (but only 40-50%). This fits with the hypothesis that the max is around 80 MB/s. Since this is software raid the limiting factor could be CPU. top says: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 38520 root 20 0 0 0 0 R 64 0.0 2947:50 md2_raid6 6117 root 20 0 0 0 0 D 53 0.0 473:25.96 md2_resync So md2_raid6 and md2_resync are clearly busy taking up 64% and 53% of a CPU respectively, but not near 100%. The chunk size (128k) of the RAID was chosen after measuring which chunksize gave the least CPU penalty. If this speed is normal: What is the limiting factor? Can I measure that? If this speed is not normal: How can I find the limiting factor? Can I change that?

Read the article
Parallel processing slower than sequential?

- by zebediah49

EDIT: For anyone who stumbles upon this in the future: Imagemagick uses a MP library. It's faster to use available cores if they're around, but if you have parallel jobs, it's unhelpful. Do one of the following: do your jobs serially (with Imagemagick in parallel mode) set MAGICK_THREAD_LIMIT=1 for your invocation of the imagemagick binary in question. By making Imagemagick use only one thread, it slows down by 20-30% in my test cases, but meant I could run one job per core without issues, for a significant net increase in performance. Original question: While converting some images using ImageMagick, I noticed a somewhat strange effect. Using xargs was significantly slower than a standard for loop. Since xargs limited to a single process should act like a for loop, I tested that, and found it to be about the same. Thus, we have this demonstration. Quad core (AMD Athalon X4, 2.6GHz) Working entirely on a tempfs (16g ram total; no swap) No other major loads Results: /media/ramdisk/img$ time for f in *.bmp; do echo $f ${f%bmp}png; done | xargs -n 2 -P 1 convert -auto-level real 0m3.784s user 0m2.240s sys 0m0.230s /media/ramdisk/img$ time for f in *.bmp; do echo $f ${f%bmp}png; done | xargs -n 2 -P 2 convert -auto-level real 0m9.097s user 0m28.020s sys 0m0.910s /media/ramdisk/img$ time for f in *.bmp; do echo $f ${f%bmp}png; done | xargs -n 2 -P 10 convert -auto-level real 0m9.844s user 0m33.200s sys 0m1.270s Can anyone think of a reason why running two instances of this program takes more than twice as long in real time, and more than ten times as long in processor time to complete the same task? After that initial hit, more processes do not seem to have as significant of an effect. I thought it might have to do with disk seeking, so I did that test entirely in ram. Could it have something to do with how Convert works, and having more than one copy at once means it cannot use processor cache as efficiently or something? EDIT: When done with 1000x 769KB files, performance is as expected. Interesting. /media/ramdisk/img$ time for f in *.bmp; do echo $f ${f%bmp}png; done | xargs -n 2 -P 1 convert -auto-level real 3m37.679s user 5m6.980s sys 0m6.340s /media/ramdisk/img$ time for f in *.bmp; do echo $f ${f%bmp}png; done | xargs -n 2 -P 1 convert -auto-level real 3m37.152s user 5m6.140s sys 0m6.530s /media/ramdisk/img$ time for f in *.bmp; do echo $f ${f%bmp}png; done | xargs -n 2 -P 2 convert -auto-level real 2m7.578s user 5m35.410s sys 0m6.050s /media/ramdisk/img$ time for f in *.bmp; do echo $f ${f%bmp}png; done | xargs -n 2 -P 4 convert -auto-level real 1m36.959s user 5m48.900s sys 0m6.350s /media/ramdisk/img$ time for f in *.bmp; do echo $f ${f%bmp}png; done | xargs -n 2 -P 10 convert -auto-level real 1m36.392s user 5m54.840s sys 0m5.650s

Read the article
Informed TDD – Kata “To Roman Numerals”

- by Ralf Westphal

Originally posted on: http://geekswithblogs.net/theArchitectsNapkin/archive/2014/05/28/informed-tdd-ndash-kata-ldquoto-roman-numeralsrdquo.aspxIn a comment on my article on what I call Informed TDD (ITDD) reader gustav asked how this approach would apply to the kata “To Roman Numerals”. And whether ITDD wasn´t a violation of TDD´s principle of leaving out “advanced topics like mocks”. I like to respond with this article to his questions. There´s more to say than fits into a commentary. Mocks and TDD I don´t see in how far TDD is avoiding or opposed to mocks. TDD and mocks are orthogonal. TDD is about pocess, mocks are about structure and costs. Maybe by moving forward in tiny red+green+refactor steps less need arises for mocks. But then… if the functionality you need to implement requires “expensive” resource access you can´t avoid using mocks. Because you don´t want to constantly run all your tests against the real resource. True, in ITDD mocks seem to be in almost inflationary use. That´s not what you usually see in TDD demonstrations. However, there´s a reason for that as I tried to explain. I don´t use mocks as proxies for “expensive” resource. Rather they are stand-ins for functionality not yet implemented. They allow me to get a test green on a high level of abstraction. That way I can move forward in a top-down fashion. But if you think of mocks as “advanced” or if you don´t want to use a tool like JustMock, then you don´t need to use mocks. You just need to stand the sight of red tests for a little longer ;-) Let me show you what I mean by that by doing a kata. ITDD for “To Roman Numerals” gustav asked for the kata “To Roman Numerals”. I won´t explain the requirements again. You can find descriptions and TDD demonstrations all over the internet, like this one from Corey Haines. Now here is, how I would do this kata differently. 1. Analyse A demonstration of TDD should never skip the analysis phase. It should be made explicit. The requirements should be formalized and acceptance test cases should be compiled. “Formalization” in this case to me means describing the API of the required functionality. “[D]esign a program to work with Roman numerals” like written in this “requirement document” is not enough to start software development. Coding should only begin, if the interface between the “system under development” and its context is clear. If this interface is not readily recognizable from the requirements, it has to be developed first. Exploration of interface alternatives might be in order. It might be necessary to show several interface mock-ups to the customer – even if that´s you fellow developer. Designing the interface is a task of it´s own. It should not be mixed with implementing the required functionality behind the interface. Unfortunately, though, this happens quite often in TDD demonstrations. TDD is used to explore the API and implement it at the same time. To me that´s a violation of the Single Responsibility Principle (SRP) which not only should hold for software functional units but also for tasks or activities. In the case of this kata the API fortunately is obvious. Just one function is needed: string ToRoman(int arabic). And it lives in a class ArabicRomanConversions. Now what about acceptance test cases? There are hardly any stated in the kata descriptions. Roman numerals are explained, but no specific test cases from the point of view of a customer. So I just “invent” some acceptance test cases by picking roman numerals from a wikipedia article. They are supposed to be just “typical examples” without special meaning. Given the acceptance test cases I then try to develop an understanding of the problem domain. I´ll spare you that. The domain is trivial and is explain in almost all kata descriptions. How roman numerals are built is not difficult to understand. What´s more difficult, though, might be to find an efficient solution to convert into them automatically. 2. Solve The usual TDD demonstration skips a solution finding phase. Like the interface exploration it´s mixed in with the implementation. But I don´t think this is how it should be done. I even think this is not how it really works for the people demonstrating TDD. They´re simplifying their true software development process because they want to show a streamlined TDD process. I doubt this is helping anybody. Before you code you better have a plan what to code. This does not mean you have to do “Big Design Up-Front”. It just means: Have a clear picture of the logical solution in your head before you start to build a physical solution (code). Evidently such a solution can only be as good as your understanding of the problem. If that´s limited your solution will be limited, too. Fortunately, in the case of this kata your understanding does not need to be limited. Thus the logical solution does not need to be limited or preliminary or tentative. That does not mean you need to know every line of code in advance. It just means you know the rough structure of your implementation beforehand. Because it should mirror the process described by the logical or conceptual solution. Here´s my solution approach: The arabic “encoding” of numbers represents them as an ordered set of powers of 10. Each digit is a factor to multiply a power of ten with. The “encoding” 123 is the short form for a set like this: {1*10^2, 2*10^1, 3*10^0}. And the number is the sum of the set members. The roman “encoding” is different. There is no base (like 10 for arabic numbers), there are just digits of different value, and they have to be written in descending order. The “encoding” XVI is short for [10, 5, 1]. And the number is still the sum of the members of this list. The roman “encoding” thus is simpler than the arabic. Each “digit” can be taken at face value. No multiplication with a base required. But what about IV which looks like a contradiction to the above rule? It is not – if you accept roman “digits” not to be limited to be single characters only. Usually I, V, X, L, C, D, M are viewed as “digits”, and IV, IX etc. are viewed as nuisances preventing a simple solution. All looks different, though, once IV, IX etc. are taken as “digits”. Then MCMLIV is just a sum: M+CM+L+IV which is 1000+900+50+4. Whereas before it would have been understood as M-C+M+L-I+V – which is more difficult because here some “digits” get subtracted. Here´s the list of roman “digits” with their values: {1, I}, {4, IV}, {5, V}, {9, IX}, {10, X}, {40, XL}, {50, L}, {90, XC}, {100, C}, {400, CD}, {500, D}, {900, CM}, {1000, M} Since I take IV, IX etc. as “digits” translating an arabic number becomes trivial. I just need to find the values of the roman “digits” making up the number, e.g. 1954 is made up of 1000, 900, 50, and 4. I call those “digits” factors. If I move from the highest factor (M=1000) to the lowest (I=1) then translation is a two phase process: Find all the factors Translate the factors found Compile the roman representation Translation is just a look-up. Finding, though, needs some calculation: Find the highest remaining factor fitting in the value Remember and subtract it from the value Repeat with remaining value and remaining factors Please note: This is just an algorithm. It´s not code, even though it might be close. Being so close to code in my solution approach is due to the triviality of the problem. In more realistic examples the conceptual solution would be on a higher level of abstraction. With this solution in hand I finally can do what TDD advocates: find and prioritize test cases. As I can see from the small process description above, there are two aspects to test: Test the translation Test the compilation Test finding the factors Testing the translation primarily means to check if the map of factors and digits is comprehensive. That´s simple, even though it might be tedious. Testing the compilation is trivial. Testing factor finding, though, is a tad more complicated. I can think of several steps: First check, if an arabic number equal to a factor is processed correctly (e.g. 1000=M). Then check if an arabic number consisting of two consecutive factors (e.g. 1900=[M,CM]) is processed correctly. Then check, if a number consisting of the same factor twice is processed correctly (e.g. 2000=[M,M]). Finally check, if an arabic number consisting of non-consecutive factors (e.g. 1400=[M,CD]) is processed correctly. I feel I can start an implementation now. If something becomes more complicated than expected I can slow down and repeat this process. 3. Implement First I write a test for the acceptance test cases. It´s red because there´s no implementation even of the API. That´s in conformance with “TDD lore”, I´d say: Next I implement the API: The acceptance test now is formally correct, but still red of course. This will not change even now that I zoom in. Because my goal is not to most quickly satisfy these tests, but to implement my solution in a stepwise manner. That I do by “faking” it: I just “assume” three functions to represent the transformation process of my solution: My hypothesis is that those three functions in conjunction produce correct results on the API-level. I just have to implement them correctly. That´s what I´m trying now – one by one. I start with a simple “detail function”: Translate(). And I start with all the test cases in the obvious equivalence partition: As you can see I dare to test a private method. Yes. That´s a white box test. But as you´ll see it won´t make my tests brittle. It serves a purpose right here and now: it lets me focus on getting one aspect of my solution right. Here´s the implementation to satisfy the test: It´s as simple as possible. Right how TDD wants me to do it: KISS. Now for the second equivalence partition: translating multiple factors. (It´a pattern: if you need to do something repeatedly separate the tests for doing it once and doing it multiple times.) In this partition I just need a single test case, I guess. Stepping up from a single translation to multiple translations is no rocket science: Usually I would have implemented the final code right away. Splitting it in two steps is just for “educational purposes” here. How small your implementation steps are is a matter of your programming competency. Some “see” the final code right away before their mental eye – others need to work their way towards it. Having two tests I find more important. Now for the next low hanging fruit: compilation. It´s even simpler than translation. A single test is enough, I guess. And normally I would not even have bothered to write that one, because the implementation is so simple. I don´t need to test .NET framework functionality. But again: if it serves the educational purpose… Finally the most complicated part of the solution: finding the factors. There are several equivalence partitions. But still I decide to write just a single test, since the structure of the test data is the same for all partitions: Again, I´m faking the implementation first: I focus on just the first test case. No looping yet. Faking lets me stay on a high level of abstraction. I can write down the implementation of the solution without bothering myself with details of how to actually accomplish the feat. That´s left for a drill down with a test of the fake function: There are two main equivalence partitions, I guess: either the first factor is appropriate or some next. The implementation seems easy. Both test cases are green. (Of course this only works on the premise that there´s always a matching factor. Which is the case since the smallest factor is 1.) And the first of the equivalence partitions on the higher level also is satisfied: Great, I can move on. Now for more than a single factor: Interestingly not just one test becomes green now, but all of them. Great! You might say, then I must have done not the simplest thing possible. And I would reply: I don´t care. I did the most obvious thing. But I also find this loop very simple. Even simpler than a recursion of which I had thought briefly during the problem solving phase. And by the way: Also the acceptance tests went green: Mission accomplished. At least functionality wise. Now I´ve to tidy up things a bit. TDD calls for refactoring. Not uch refactoring is needed, because I wrote the code in top-down fashion. I faked it until I made it. I endured red tests on higher levels while lower levels weren´t perfected yet. But this way I saved myself from refactoring tediousness. At the end, though, some refactoring is required. But maybe in a different way than you would expect. That´s why I rather call it “cleanup”. First I remove duplication. There are two places where factors are defined: in Translate() and in Find_factors(). So I factor the map out into a class constant. Which leads to a small conversion in Find_factors(): And now for the big cleanup: I remove all tests of private methods. They are scaffolding tests to me. They only have temporary value. They are brittle. Only acceptance tests need to remain. However, I carry over the single “digit” tests from Translate() to the acceptance test. I find them valuable to keep, since the other acceptance tests only exercise a subset of all roman “digits”. This then is my final test class: And this is the final production code: Test coverage as reported by NCrunch is 100%: Reflexion Is this the smallest possible code base for this kata? Sure not. You´ll find more concise solutions on the internet. But LOC are of relatively little concern – as long as I can understand the code quickly. So called “elegant” code, however, often is not easy to understand. The same goes for KISS code – especially if left unrefactored, as it is often the case. That´s why I progressed from requirements to final code the way I did. I first understood and solved the problem on a conceptual level. Then I implemented it top down according to my design. I also could have implemented it bottom-up, since I knew some bottom of the solution. That´s the leaves of the functional decomposition tree. Where things became fuzzy, since the design did not cover any more details as with Find_factors(), I repeated the process in the small, so to speak: fake some top level, endure red high level tests, while first solving a simpler problem. Using scaffolding tests (to be thrown away at the end) brought two advantages: Encapsulation of the implementation details was not compromised. Naturally private methods could stay private. I did not need to make them internal or public just to be able to test them. I was able to write focused tests for small aspects of the solution. No need to test everything through the solution root, the API. The bottom line thus for me is: Informed TDD produces cleaner code in a systematic way. It conforms to core principles of programming: Single Responsibility Principle and/or Separation of Concerns. Distinct roles in development – being a researcher, being an engineer, being a craftsman – are represented as different phases. First find what, what there is. Then devise a solution. Then code the solution, manifest the solution in code. Writing tests first is a good practice. But it should not be taken dogmatic. And above all it should not be overloaded with purposes. And finally: moving from top to bottom through a design produces refactored code right away. Clean code thus almost is inevitable – and not left to a refactoring step at the end which is skipped often for different reasons. PS: Yes, I have done this kata several times. But that has only an impact on the time needed for phases 1 and 2. I won´t skip them because of that. And there are no shortcuts during implementation because of that.

Read the article
MSTest VS2010 - DeploymentItem copying files to different locations on different machines

- by Jack

I have found that DeploymentItem [TestClass(), DeploymentItem(@"TestData\")] is not copying my test data files to the same location when tests are built and run on different machines. The test data files are copied to the "bin\debug" directory in the test project on my machine, but on my friend's machine they are copied to "TestResults\*name_machine YY-MM-DD HH_MM_SS*\Out". The bin\debug directory on my machine can be obtained with the code: string appDirectory = Path.GetDirectoryNameSystem.Reflection.Assembly.GetExecutingAssembly().Location; and the same code will return "TestResults\*name_machine YY-MM-DD HH_MM_SS*\Out" on my friends PC. This however isn't really the problem. The problem is that the test data files I have made have a folder structure, and this folder structure is only maintained on my machine when copied to bin\debug, whereas on my friends machine only the files are added to the "TestResults\*name_machine YY-MM-DD HH_MM_SS*\Out" directory. This means that tests will pass on my machine and fail on his! Is there a way to ensure that DeploymentItem always copys to the bin\debug folder? Or a way to ensure that the folder structure will be retained when DeploymentItem copies the files to the "TestResults\*name_machine YY-MM-DD HH_MM_SS*\Out" folder?

Read the article
Performance of Silverlight Datagrid in Silverlight 3 vs Silverlight 4 on a mac

- by Simon

I'm using Silverlight Beta 4 for a LOB application. After finding out today that I'll have to wait perhaps 4 months to be able to develop with SL4 on Visual Studio 2010 I'm thinking I need to downgrade my application to SL3 but thats another question. The problem is I'm noticing absolutely abismal performance for simple datagrids that work just fine on a PC when I'm running on a Mac. These grids contain only 5-10 columns and maybe 50 rows. Paging up and down takes about 1-2 seconds sometimes. I would appreciate anybody's experience in which of the following is the best solution: reverting to Silverlight 3 and hoping DataGrid is faster switching to 3rd party datagrid such as Telerik forgetting silverlight altogether I was hoping that possibly SL4 runtime might be updated but that won't happen probably for 3-4 months. Just a reminder - this is specifically a mac issue. Performance on my PC while slightly slow to populate the grid initially is fine.

Read the article
Workflow for academic research projects, one-step builds, and the Joel Test

- by Steve

Working alone on academic research sometimes breeds bad habits. With no one else reading my code, I would write a lot of throw-away code, and I would lose track of intermediate results which, weeks or months later, I wish I had retained. My recent attempts to make my personal workflow conform to the Joel Test raised interesting questions. Academic research has inherently different goals than industrial software development, and therefore some aspects of the Joel Test become less valid. Nevertheless, I find these steps to be still valuable for academic research: Do you use source control? Can you make a build in one step? Do you have an up-to-date schedule? Do you have a spec? Of particular use is the one-step build. I find myself more organized now that I have implemented the following "one-step build": In other words, I have a single script, build.py, that accepts Python code, data, and TeX as inputs. The outputs are results, figures, and a paper with all the results filled in. (Yes, I know "build" is probably not accurate in this context, but you get the idea.) By consolidating many small steps into one, I am not backtracking as much as I used to. ...but I'm sure there is still room for improvement. Question: For research projects, which steps of the Joel Test do you still value? Do you have a one-step build? If so, what does yours consist of, i.e., what inputs does it accept, and what output does it generate?

Read the article
Linux testing - CPU & disk

- by JavaRocky

I am a bit rusty lately. What is the standard way to test CPU and disk I/O these days?

Read the article
RAD/Eclipse Eclipse Test and Performance Tools Platform, export data to text file

- by Berlin Brown

I am using the RAD (also on Eclipse) Test and Performance Monitoring. I monitor CPU performance time with it, on particular methods, etc. It is a good tool for my monitoring my applications but I can't copy/paste or export the output to a text file format. So I can send to the others. There has to be a way to export this? Also, I can save the output to file but it is '*.trcxml' binary file? has anyone seen a parser for this file format?

Read the article
Does obfuscation affects performance?

- by Magesh

Does obfuscating a Java program affect its performance (excluding renaming things)?

Read the article
.NET or Windows Synchronization Primitives Performance Specifications

- by ovanes

Hello *, I am currently writing a scientific article, where I need to be very exact with citation. Can someone point me to either MSDN, MSDN article, some published article source or a book, where I can find performance comparison of Windows or .NET Synchronization primitives. I know that these are in the descending performance order: Interlocked API, Critical Section, .NET lock-statement, Monitor, Mutex, EventWaitHandle, Semaphore. Many Thanks, Ovanes P.S. I found a great book: Concurrent Programming on Windows by Joe Duffy. This book is written by one of the head concurrency developers for .NET Framework and is simply brilliant with lots of explanations, how things work or were implemented.

Read the article
Postfix/ClamAV not stopping viruses under Virtualmin

- by Josh

I am using Virtualmin and have it set up to have Postfix scan incoming emails with ClamAV (using clamdscan) and delete any emails which contain a virus. However when I email myself the EICAR test string, it comes through just fine. I know ClamAV will report this file as a virus. How can I troubleshoot this / what could be wrong?

Read the article
Logging strategy vs. performance

- by vtortola

Hi, I'm developing a web application that has to support lots of simultaneous requests, and I'd like to keep it fast enough. I have now to implement a logging strategy, I'm gonna use log4net, but ... what and how should I log? I mean: How logging impacts in performance? is it possible/recomendable logging using async calls? Is better use a text file or a database? Is it possible to do it conditional? for example, default log to the database, and if it fails, the switch to a text file. What about multithreading? should I care about synchronization when I use log4net? or it's thread safe out of the box? In the requirements appear that the application should cache a couple of things per request, and I'm afraid of the performance impact of that. Cheers.

Read the article
iPhone Foundation - performance implications of mutable and xxxWithCapacity:0

- by Adam Eberbach

All of the collection classes have two versions - mutable and immutable, such as NSArray and NSMutableArray. Is the distinction merely to promote careful programming by providing a const collection or is there some performance hit when using a mutable object as opposed to immutable? Similarly each of the collection classes has a method xxxxWithCapacity, like [NSMutableArray arrayWithCapacity:0]. I often use zero as the argument because it seems a better choice than guessing wrongly how many objects might be added. Is there some performance advantage to creating a collection with capacity for enough objects in advance? If not why isn't the function something like + (id)emptyArray?

Read the article
A way to measure performance

- by Andrei Ciobanu

Given Exercise 14 from 99 Haskell Problems: (*) Duplicate the elements of a list. Eg.: *Main> dupli''' [1..10] [1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10] I've implemented 4 solutions: {-- my first attempt --} dupli :: [a] -> [a] dupli [] = [] dupli (x:xs) = replicate 2 x ++ dupli xs {-- using concatMap and replicate --} dupli' :: [a] -> [a] dupli' xs = concatMap (replicate 2) xs {-- usign foldl --} dupli'' :: [a] -> [a] dupli'' xs = foldl (\acc x -> acc ++ [x,x]) [] xs {-- using foldl 2 --} dupli''' :: [a] -> [a] dupli''' xs = reverse $ foldl (\acc x -> x:x:acc) [] xs Still, I don't know how to really measure performance . So what's the recommended function (from the above list) in terms of performance . Any suggestions ?

Read the article

Search Results

Search found 35507 results on 1421 pages for 'performance test'.

Page 127/1421 | < Previous Page | 123 124 125 126 127 128 129 130 131 132 133 134 | Next Page >

- by David V. Corbin

- by Andy B

- by mhucka

- by NateS

- by zenna

- by celil

- by ernst

- by Sequenzia

- by user36825

- by NicMagnier

- by Rob Stevenson-Leggett

- by Ole Tange

- by zebediah49

- by Ralf Westphal

- by Jack

- by Simon

- by Steve

- by JavaRocky

- by Berlin Brown

- by Magesh

- by ovanes

- by Josh

- by vtortola

- by Adam Eberbach

- by Andrei Ciobanu

< Previous Page | 123 124 125 126 127 128 129 130 131 132 133 134 | Next Page >