Search Results

Search found 73 results on 3 pages for 'openmp'.

Page 3/3 | < Previous Page | 1 2 3 

  • STL algorithms and concurrent programming

    - by Andrew
    Hello everyone, Can any of STL algorithms/container operations like std::fill, std::transform be executed in parallel if I enable OpenMP for my compiler? I am working with MSVC 2008 at the moment. Or maybe there are other ways to make it concurrent? Thanks.

    Read the article

  • Ideas for student parallel programming project

    - by chi42
    I'm looking to do a parallel programming project in C (probably using pthreads or maybe OpenMP) for a class. It will done by a group of about four students, and should take about 4 weeks. I was thinking it would be interesting to attack some NP-complete problem with a more complex algorithm like a genetic algo with simulated annealing, but I'm not sure if it would be a big enough project. Anyone knew of any cool problems that could benefit from a parallel approach?

    Read the article

  • Data Structure for Small Number of Agents in a Relatively Big 2D World

    - by Seçkin Savasçi
    I'm working on a project where we will implement a kind of world simulation where there is a square 2D world. Agents live on this world and make decisions like moving or replicating themselves based on their neighbor cells(world=grid) and some extra parameters(which are not based on the state of the world). I'm looking for a data structure to implement such a project. My concerns are : I will implement this 3 times: sequential, using OpenMP, using MPI. So if I can use the same structure that will be quite good. The first thing comes up is keeping a 2D array for the world and storing agent references in it. And simulate the world for each time slice by checking every cell in each iteration and further processing if an agents is found in the cell. The downside is what if I have 1000x1000 world and only 5 agents in it. It will be an overkill for both sequential and parallel versions to check each cell and look for possible agents in them. I can use quadtree and store agents in it, but then how can I get the information about neighbor cells then? Please let me know if I should elaborate more.

    Read the article

  • Running Awk command on a cluster

    - by alex
    How do you execute a Unix shell command (awk script, a pipe etc) on a cluster in parallel (step 1) and collect the results back to a central node (step 2) Hadoop seems to be a huge overkill with its 600k LOC and its performance is terrible (takes minutes just to initialize the job) i don't need shared memory, or - something like MPI/openMP as i dont need to synchronize or share anything, don't need a distributed VM or anything as complex Google's SawZall seems to work only with Google proprietary MapReduce API some distributed shell packages i found failed to compile, but there must be a simple way to run a data-centric batch job on a cluster, something as close as possible to native OS, may be using unix RPC calls i liked rsync simplicity but it seem to update remote notes sequentially, and you cant use it for executing scripts as afar as i know switching to Plan 9 or some other network oriented OS looks like another overkill i'm looking for a simple, distributed way to run awk scripts or similar - as close as possible to data with a minimal initialization overhead, in a nothing-shared, nothing-synchronized fashion Thanks Alex

    Read the article

  • Parallelism in Python

    - by fmark
    What are the options for achieving parallelism in Python? I want to perform a bunch of CPU bound calculations over some very large rasters, and would like to parallelise them. Coming from a C background, I am familiar with three approaches to parallelism: Message passing processes, possibly distributed across a cluster, e.g. MPI. Explicit shared memory parallelism, either using pthreads or fork(), pipe(), et. al Implicit shared memory parallelism, using OpenMP. Deciding on an approach to use is an exercise in trade-offs. In Python, what approaches are available and what are their characteristics? Is there a clusterable MPI clone? What are the preferred ways of achieving shared memory parallelism? I have heard reference to problems with the GIL, as well as references to tasklets. In short, what do I need to know about the different parallelization strategies in Python before choosing between them?

    Read the article

  • Where are the static methods in gcc's dump file.c.135r.jump

    - by Customizer
    When I run gcc with the parameter -fdump-rtl-jump, I get a dump file with the name file.c.135r.jump, where I can read some information about the intermediate representation of the methods in my C or C++ file. I just recently discovered, that the static methods of a project are missing in this dump file. Do you know, why they are missing in that representation and if there is a possibility to include the static methods in this file, too. Update (some additional information): The test program, I'm using here, is the Hybrid OpenMP MPI Benchmark. Update2: I just reproduced the problem with a serial application, so it has nothing to do with parallel sections.

    Read the article

  • [C++] Needed: A simple C++ container (stack, linked list) that is thread-safe for writing

    - by conradlee
    I am writing a multi-threaded program using OpenMP in C++. At one point my program forks into many threads, each of which need to add "jobs" to some container that keeps track of all added jobs. Each job can just be a pointer to some object. Basically, I just need the add pointers to some container from several threads at the same time. Is there a simple solution that performs well? After some googling, I found that STL containers are not thread-safe. Some stackoverflow threads address this question, but none form a consensus on a simple solution.

    Read the article

  • Simple C++ container class that is thread-safe for writing

    - by conradlee
    I am writing a multi-threaded program using OpenMP in C++. At one point my program forks into many threads, each of which need to add "jobs" to some container that keeps track of all added jobs. Each job can just be a pointer to some object. Basically, I just need the add pointers to some container from several threads at the same time. Is there a simple solution that performs well? After some googling, I found that STL containers are not thread-safe. Some stackoverflow threads address this question, but none that forms a consensus on a simple solution.

    Read the article

  • Visual Studio 2005 won't install on Windows 7

    - by Peanut
    Hi, My question relates very closely to this question: http://superuser.com/questions/34190/visual-studio-2005-sp1-refuses-to-install-in-windows-7 However this question hasn't provided the answer I'm looking for. I'm trying to install Visual Studio 2005 onto a clean Windows 7 (64 bit) box. However I keep getting the following error when the 'Microsoft Visual Studio 2005' component finishes installing ... Error 1935.An error occurred during the installation of assembly 'policy.8.0.Microsoft.VC80.OpenMP,type="win32-policy",version="8.0.50727.42",publicKeyToken="1fc8b3b9a1e18e3b",processorArchitecture="x86",Please refer to Help and Support for more information. HRESULT: 0x80073712. On my first attempt to install VS 2005 I got a warning about compatibility issues. I stopped at this point, downloaded the necessary service packs and restarted the installation from the beginning. Every since then I just get the error message above. I keep rolling back the installation and trying again ... it's but always the same error. Any help would be very much appreciated. Thanks.

    Read the article

  • MPI Cluster Debugger launch integration in VS2010

    Let's assume that you have all the HPC bits installed and that you have existing MPI code (or you created a "Hello World" project using the MPI project template). Of course, you create a single MPI application and at runtime it will correspond to multiple processes (of the same app) launched on multiple nodes (i.e. machines) on the cluster. So how do you debug such a situation by simply hitting the familiar "F5" keystroke (i.e. Debug - Start Debugging)?WATCH IT INSTEAD OF READING ABOUT ITIf you can't bear to read through all the details below, just watch this 19-minute screencast explaining this VS2010 feature. Alternatively, or even additionally, keep on reading.REQUIREMENTWhen you debug an MPI application, you would want the copying of resources from your client machine (where Visual Studio is installed) to each compute node (where Windows HPC Server is installed) to take place automatically for you. 'Resources' in the previous sentence includes your application binary, plus any binary or data dependencies it may have, plus PDBs if needed, plus the debug CRT of the correct bitness, plus msvsmon for remote debugging to work. You would also want, after copying is complete, to have your app and msvsmon launched and attached so that you can hit breakpoints back in Visual Studio on your client machine. All these thing that you would want are delivered in VS2010.STEPS TO F51. In your MPI project where you have placed a breakpoint go to Project Properties - Configuration Properties - Debugging. Ensure the "Debugger to launch" combo box value is set to MPI Cluster Debugger.2. There are a whole bunch of properties here and typically you can ignore all of them except one: Run Environment. By default it is set to run 1 process on your local machine and if you change the number after that to, for example, 4 it will launch 4 processes of your app on your local machine.You want this to run on your cluster though, so go to the dropdown arrow at the end of the Run Environment cell and open it to expose the "Edit Hpc node" menu which opens the Node Selector dialog:In this dialog you can enter (or pick from a list) the cluster head node name and then the number of processes you want to execute on the cluster and then hit OK and… you are done.3. Press F5 and watch your breakpoint get hit (after giving it some time for copying, remote execution, attachment and symbol resolution to take place).GOING DEEPERIn the MPI Cluster Debugger project properties above, you can see many additional properties to the Run Environment. They are all optional, but you may want to understand them in order to fine tune your cluster debugging. Read all about each one of these on the MSDN page Configuration Properties for the MPI Cluster Debugger.In the Node Selector dialog above you can see more options than just the Head Node name and Number of Process to run. They should be self-explanatory but I also cover them in depth in my screencast showing you an example of why you would choose to schedule processes per core versus per node. You can also read about these options on MSDN as part of the page How to: Configure and Launch the MPI Cluster Debugger.To read through an example that touches on MPI project creation, project properties, node selector, and also usage of MPI with OpenMP plus MPI with PPL, read the MSDN page Walkthrough: Launching the MPI Cluster Debugger in Visual Studio 2010.Happy MPI debugging! Comments about this post welcome at the original blog.

    Read the article

  • MPI Cluster Debugger launch integration in VS2010

    Let's assume that you have all the HPC bits installed and that you have existing MPI code (or you created a "Hello World" project using the MPI project template). Of course, you create a single MPI application and at runtime it will correspond to multiple processes (of the same app) launched on multiple nodes (i.e. machines) on the cluster. So how do you debug such a situation by simply hitting the familiar "F5" keystroke (i.e. Debug - Start Debugging)?WATCH IT INSTEAD OF READING ABOUT ITIf you can't bear to read through all the details below, just watch this 19-minute screencast explaining this VS2010 feature. Alternatively, or even additionally, keep on reading.REQUIREMENTWhen you debug an MPI application, you would want the copying of resources from your client machine (where Visual Studio is installed) to each compute node (where Windows HPC Server is installed) to take place automatically for you. 'Resources' in the previous sentence includes your application binary, plus any binary or data dependencies it may have, plus PDBs if needed, plus the debug CRT of the correct bitness, plus msvsmon for remote debugging to work. You would also want, after copying is complete, to have your app and msvsmon launched and attached so that you can hit breakpoints back in Visual Studio on your client machine. All these thing that you would want are delivered in VS2010.STEPS TO F51. In your MPI project where you have placed a breakpoint go to Project Properties - Configuration Properties - Debugging. Ensure the "Debugger to launch" combo box value is set to MPI Cluster Debugger.2. There are a whole bunch of properties here and typically you can ignore all of them except one: Run Environment. By default it is set to run 1 process on your local machine and if you change the number after that to, for example, 4 it will launch 4 processes of your app on your local machine.You want this to run on your cluster though, so go to the dropdown arrow at the end of the Run Environment cell and open it to expose the "Edit Hpc node" menu which opens the Node Selector dialog:In this dialog you can enter (or pick from a list) the cluster head node name and then the number of processes you want to execute on the cluster and then hit OK and… you are done.3. Press F5 and watch your breakpoint get hit (after giving it some time for copying, remote execution, attachment and symbol resolution to take place).GOING DEEPERIn the MPI Cluster Debugger project properties above, you can see many additional properties to the Run Environment. They are all optional, but you may want to understand them in order to fine tune your cluster debugging. Read all about each one of these on the MSDN page Configuration Properties for the MPI Cluster Debugger.In the Node Selector dialog above you can see more options than just the Head Node name and Number of Process to run. They should be self-explanatory but I also cover them in depth in my screencast showing you an example of why you would choose to schedule processes per core versus per node. You can also read about these options on MSDN as part of the page How to: Configure and Launch the MPI Cluster Debugger.To read through an example that touches on MPI project creation, project properties, node selector, and also usage of MPI with OpenMP plus MPI with PPL, read the MSDN page Walkthrough: Launching the MPI Cluster Debugger in Visual Studio 2010.Happy MPI debugging! Comments about this post welcome at the original blog.

    Read the article

  • Scalable / Parallel Large Graph Analysis Library?

    - by Joel Hoff
    I am looking for good recommendations for scalable and/or parallel large graph analysis libraries in various languages. The problems I am working on involve significant computational analysis of graphs/networks with 1-100 million nodes and 10 million to 1+ billion edges. The largest SMP computer I am using has 256 GB memory, but I also have access to an HPC cluster with 1000 cores, 2 TB aggregate memory, and MPI for communication. I am primarily looking for scalable, high-performance graph libraries that could be used in either single or multi-threaded scenarios, but parallel analysis libraries based on MPI or a similar protocol for communication and/or distributed memory are also of interest for high-end problems. Target programming languages include C++, C, Java, and Python. My research to-date has come up with the following possible solutions for these languages: C++ -- The most viable solutions appear to be the Boost Graph Library and Parallel Boost Graph Library. I have looked briefly at MTGL, but it is currently slanted more toward massively multithreaded hardware architectures like the Cray XMT. C - igraph and SNAP (Small-world Network Analysis and Partitioning); latter uses OpenMP for parallelism on SMP systems. Java - I have found no parallel libraries here yet, but JGraphT and perhaps JUNG are leading contenders in the non-parallel space. Python - igraph and NetworkX look like the most solid options, though neither is parallel. There used to be Python bindings for BGL, but these are now unsupported; last release in 2005 looks stale now. Other topics here on SO that I've looked at have discussed graph libraries in C++, Java, Python, and other languages. However, none of these topics focused significantly on scalability. Does anyone have recommendations they can offer based on experience with any of the above or other library packages when applied to large graph analysis problems? Performance, scalability, and code stability/maturity are my primary concerns. Most of the specialized algorithms will be developed by my team with the exception of any graph-oriented parallel communication or distributed memory frameworks (where the graph state is distributed across a cluster).

    Read the article

  • How are you taking advantage of Multicore?

    - by tgamblin
    As someone in the world of HPC who came from the world of enterprise web development, I'm always curious to see how developers back in the "real world" are taking advantage of parallel computing. This is much more relevant now that all chips are going multicore, and it'll be even more relevant when there are thousands of cores on a chip instead of just a few. My questions are: How does this affect your software roadmap? I'm particularly interested in real stories about how multicore is affecting different software domains, so specify what kind of development you do in your answer (e.g. server side, client-side apps, scientific computing, etc). What are you doing with your existing code to take advantage of multicore machines, and what challenges have you faced? Are you using OpenMP, Erlang, Haskell, CUDA, TBB, UPC or something else? What do you plan to do as concurrency levels continue to increase, and how will you deal with hundreds or thousands of cores? If your domain doesn't easily benefit from parallel computation, then explaining why is interesting, too. Finally, I've framed this as a multicore question, but feel free to talk about other types of parallel computing. If you're porting part of your app to use MapReduce, or if MPI on large clusters is the paradigm for you, then definitely mention that, too. Update: If you do answer #5, mention whether you think things will change if there get to be more cores (100, 1000, etc) than you can feed with available memory bandwidth (seeing as how bandwidth is getting smaller and smaller per core). Can you still use the remaining cores for your application?

    Read the article

  • Multithreaded linked list traversal

    - by Rob Bryce
    Given a (doubly) linked list of objects (C++), I have an operation that I would like multithread, to perform on each object. The cost of the operation is not uniform for each object. The linked list is the preferred storage for this set of objects for a variety of reasons. The 1st element in each object is the pointer to the next object; the 2nd element is the previous object in the list. I have solved the problem by building an array of nodes, and applying OpenMP. This gave decent performance. I then switched to my own threading routines (based off Windows primitives) and by using InterlockedIncrement() (acting on the index into the array), I can achieve higher overall CPU utilization and faster through-put. Essentially, the threads work by "leap-frog'ing" along the elements. My next approach to optimization is to try to eliminate creating/reusing the array of elements in my linked list. However, I'd like to continue with this "leap-frog" approach and somehow use some nonexistent routine that could be called "InterlockedCompareDereference" - to atomically compare against NULL (end of list) and conditionally dereference & store, returning the dereferenced value. I don't think InterlockedCompareExchangePointer() will work since I cannot atomically dereference the pointer and call this Interlocked() method. I've done some reading and others are suggesting critical sections or spin-locks. Critical sections seem heavy-weight here. I'm tempted to try spin-locks but I thought I'd first pose the question here and ask what other people are doing. I'm not convinced that the InterlockedCompareExchangePointer() method itself could be used like a spin-lock. Then one also has to consider acquire/release/fence semantics... Ideas? Thanks!

    Read the article

  • Build OpenGL model in parallel?

    - by Brendan Long
    I have a program which draws some terrain and simulates water flowing over it (in a cheap and easy way). Updating the water was easy to parallelize using OpenMP, so I can do ~50 updates per second. The problem is that even with a small amounts of water, my draws per second are very very low (starts at 5 and drops to around 2 once there's a significant amount of water). It's not a problem with the video card because the terrain is more complicated and gets drawn so quickly that boost::timer tells me that I get infinity draws per second if I turn the water off. It may be related to memory bandwidth though (since I assume the model stays on the card and doesn't have to be transfered every time). What I'm concerned about is that on every draw, I'm calling glVertex3f() about a million times (max size is 450*600, 4 vertices each), and it's done entirely sequentially because Glut won't let me call anything in parallel. So.. is if there's some way of building the list in parallel and then passing it to OpenGL all at once? Or some other way of making it draw this faster? Am I using the wrong method (besides the obvious "use less vertices")?

    Read the article

  • Embedded Development Board

    - by ALF3130
    I'm new to the embedded development world and am looking to get my very first board. After some research, I realize that there aren't many choices with FPUs. This is important in my project as I'm going to be doing quite a bit of floating point computations. I found the Mini2440 which seems to run on the ARM920T core. This particular unit is perfect for my needs (decent price, all the right I/O ports, and a touch screen to boot) but it seems that it doesn't have an FPU. I don't know how big of a penalty I'd be paying for FP emulation, so I'm unsure of whether to pull the trigger on this one. That said: Can someone please confirm whether this product (Mini2440) has an FPU or not? My project will do image capture and analysis. Does anyone have any experience with running things like OpenMP on such platforms? Please suggest any other similar boards in the = $200 price range that have an FPU. This world is new to me. Any other advice or things I should be aware of is much appreciated.

    Read the article

  • iproute2 not functioning ("RTNETLINK answers: Operation not supported")

    - by James Watt
    The command and error message: gtwy ~ # ip rule add from 64.251.23.186 table t1 RTNETLINK answers: Operation not supported Older article of the same problem, but it did not help me: http://forums.gentoo.org/viewtopic-t-696982-start-0-postdays-0-postorder-asc-highlight-.html I have looked on google at great lengths to try to find a solution. It seems that my kernel configuration is missing something? Any help or ideas would be appreciated. My system/kernel is: 2.6.36-gentoo-r5 #3 SMP Thu Jan 13 10:49:06 EST 2011 x86_64 Intel(R) Xeon(R) CPU X3220 @ 2.40GHz GenuineIntel GNU/Linux. I am posting this on SuperUser since this system is used as a workstation and this problem is unrelated to specific tasks that are handled exclusively by servers. iproute2 is installed: gtwy etc # emerge --search iproute2 Searching... [ Results for search key : iproute2 ] [ Applications found : 1 ] * sys-apps/iproute2 Latest version available: 2.6.35-r2 Latest version installed: 2.6.35-r2 Size of files: 378 kB Homepage: http://www.linuxfoundation.org/collaborate/workgroups/networking/iproute2 Description: kernel routing and traffic control utilities License: GPL-2 A small snippet of my kernel .config (view entire .config): gtwy linux # cat .config | grep NETLINK CONFIG_NETFILTER_NETLINK=y CONFIG_NETFILTER_NETLINK_QUEUE=y CONFIG_NETFILTER_NETLINK_LOG=y CONFIG_NF_CT_NETLINK=y CONFIG_SCSI_NETLINK=y gtwy linux # cat .config | grep IP_ADVANCED_ROUTER CONFIG_IP_ADVANCED_ROUTER=y gtwy linux # cat .config | grep INGRESS CONFIG_NET_SCH_INGRESS=y gtwy linux # cat .config | grep NET_SCHED CONFIG_NET_SCHED=y emerge --info Portage 2.1.9.25 (default/linux/amd64/10.0, gcc-4.1.2, glibc-2.10.1-r1, 2.6.36-gentoo-r5 x86_64) ================================================================= System uname: Linux-2.6.36-gentoo-r5-x86_64-Intel-R-_Xeon-R-_CPU_X3220_@_2.40GHz-with-gentoo-1.12.13 Timestamp of tree: Thu, 13 Jan 2011 01:15:01 +0000 app-shells/bash: 4.0_p37 dev-java/java-config: 1.3.7-r1, 2.1.10 dev-lang/python: 2.4.6, 2.5.4-r4, 2.6.5-r2, 3.1.2-r3 sys-apps/baselayout: 1.12.13 sys-apps/sandbox: 1.6-r2 sys-devel/autoconf: 2.13, 2.65 sys-devel/automake: 1.9.6-r2::<unknown repository>, 1.10.2, 1.11.1 sys-devel/binutils: 2.20.1-r1 sys-devel/gcc: 4.1.2, 4.3.4, 4.4.3-r2 sys-devel/gcc-config: 1.4.1 sys-devel/libtool: 2.2.6b sys-devel/make: 3.81 virtual/os-headers: 2.6.30-r1 (sys-kernel/linux-headers) ACCEPT_KEYWORDS="amd64" ACCEPT_LICENSE="*" CBUILD="x86_64-pc-linux-gnu" CFLAGS="-march=nocona -O2 -pipe" CHOST="x86_64-pc-linux-gnu" CONFIG_PROTECT="/etc /var/bind" CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/env.d /etc/env.d/java/ /etc/fonts/fonts.conf /etc/gconf /etc/php/apache2-php5/ext-active/ /etc/php/cgi-php5/ext-active/ /etc/php/cli-php5/ext-active/ /etc/revdep-rebuild /etc/sandbox.d /etc/terminfo" CXXFLAGS="-march=nocona -O2 -pipe" DISTDIR="/usr/portage/distfiles" FEATURES="assume-digests binpkg-logs distlocks fixlafiles fixpackages news parallel-fetch protect-owned sandbox sfperms strict unknown-features-warn unmerge-logs unmerge-orphans userfetch" GENTOO_MIRRORS="http://gentoo.chem.wisc.edu/gentoo" LC_ALL="en_US.UTF-8" LDFLAGS="-Wl,-O1 -Wl,--as-needed" LINGUAS="en" MAKEOPTS="-j5" PKGDIR="/usr/portage/packages" PORTAGE_CONFIGROOT="/" PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --stats --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages" PORTAGE_TMPDIR="/var/tmp" PORTDIR="/usr/portage" PORTDIR_OVERLAY="/usr/local/portage" SYNC="rsync://rsync.namerica.gentoo.org/gentoo-portage" USE="acl amd64 apache2 berkdb bzip2 cli cracklib crypt ctype cups curl cxx dri fortran gdbm gpm iconv jpeg jpeg2k libwww mmx modules mudflap multilib mysql ncurses nls nptl nptlonly openmp pam pcre perl php png pppd python readline session sockets sse sse2 ssl symlink sysfs tcpd threads unicode vhosts xml xorg xsl zlib" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" ALSA_PCM_PLUGINS="adpcm alaw asym copy dmix dshare dsnoop empty extplug file hooks iec958 ioplug ladspa lfloat linear meter mmap_emul mulaw multi null plug rate route share shm softvol" APACHE2_MODULES="actions alias auth_basic authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache cgi cgid dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter file_cache filter headers include info log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" COLLECTD_PLUGINS="df interface irq load memory rrdtool swap syslog" ELIBC="glibc" GPSD_PROTOCOLS="ashtech aivdm earthmate evermore fv18 garmin garmintxt gpsclock itrax mtk3301 nmea ntrip navcom oceanserver oldstyle oncore rtcm104v2 rtcm104v3 sirf superstar2 timing tsip tripmate tnt ubx" INPUT_DEVICES="keyboard mouse evdev" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LINGUAS="en" PHP_TARGETS="php5-3" RUBY_TARGETS="ruby18" USERLAND="GNU" VIDEO_CARDS="fbdev glint intel mach64 mga neomagic nouveau nv r128 radeon savage sis tdfx trident vesa via vmware dummy v4l" XTABLES_ADDONS="quota2 psd pknock lscan length2 ipv4options ipset ipp2p iface geoip fuzzy condition tee tarpit sysrq steal rawnat logmark ipmark dhcpmac delude chaos account" Unset: CPPFLAGS, CTARGET, EMERGE_DEFAULT_OPTS, FFLAGS, INSTALL_MASK, LANG, PORTAGE_BUNZIP2_COMMAND, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS

    Read the article

  • C++ Optimize if/else condition

    - by Heye
    I have a single line of code, that consumes 25% - 30% of the runtime of my application. It is a less-than comparator for an std::set (the set is implemented with a Red-Black-Tree). It is called about 180 Million times within 52 seconds. struct Entry { const float _cost; const long _id; // some other vars Entry(float cost, float id) : _cost(cost), _id(id) { } }; template<class T> struct lt_entry: public binary_function <T, T, bool> { bool operator()(const T &l, const T &r) const { // Most readable shape if(l._cost != r._cost) { return r._cost < l._cost; } else { return l._id < r._id; } } }; The entries should be sorted by cost and if the cost is the same by their id. I have many insertions for each extraction of the minimum. I thought about using Fibonacci-Heaps, but I have been told that they are theoretically nice, but suffer from high constants and are pretty complicated to implement. And since insert is in O(log(n)) the runtime increase is nearly constant with large n. So I think its okay to stick to the set. To improve performance I tried to express it in different shapes: return l._cost < r._cost || r._cost > l._cost || l._id < r._id; return l._cost < r._cost || (l._cost == r._cost && l._id < r._id); Even this: typedef union { float _f; int _i; } flint; //... flint diff; diff._f = (l._cost - r._cost); return (diff._i && diff._i >> 31) || l._id < r._id; But the compiler seems to be smart enough already, because I haven't been able to improve the runtime. I also thought about SSE but this problem is really not very applicable for SSE... The assembly looks somewhat like this: movss (%rbx),%xmm1 mov $0x1,%r8d movss 0x20(%rdx),%xmm0 ucomiss %xmm1,%xmm0 ja 0x410600 <_ZNSt8_Rb_tree[..]+96> ucomiss %xmm0,%xmm1 jp 0x4105fd <_ZNSt8_Rb_[..]_+93> jne 0x4105fd <_ZNSt8_Rb_[..]_+93> mov 0x28(%rdx),%rax cmp %rax,0x8(%rbx) jb 0x410600 <_ZNSt8_Rb_[..]_+96> xor %r8d,%r8d I have a very tiny bit experience with assembly language, but not really much. I thought it would be the best (only?) point to squeeze out some performance, but is it really worth the effort? Can you see any shortcuts that could save some cycles? The platform the code will run on is an ubuntu 12 with gcc 4.6 (-stl=c++0x) on a many-core intel machine. Only libraries available are boost, openmp and tbb. I am really stuck on this one, it seems so simple, but takes that much time. I have been crunching my head since days thinking how I could improve this line... Can you give me a suggestion how to improve this part, or is it already at its best?

    Read the article

  • Residual packages Ubuntu 12.04

    - by hydroxide
    I have an Asus Q500A with win8 and Ubuntu 12.04 64 bit; Linux kernel 3.8.0-32-generic. I have been having residual package issues which have been giving me trouble trying to reconfigure xserver-xorg-lts-raring. I tried removing all residual packages from synaptic but the following were not removed. Output of sudo dpkg -l | grep "^rc" rc gstreamer0.10-plugins-good:i386 0.10.31-1ubuntu1.2 GStreamer plugins from the "good" set rc libaa1:i386 1.4p5-39ubuntu1 ASCII art library rc libaio1:i386 0.3.109-2ubuntu1 Linux kernel AIO access library - shared library rc libao4:i386 1.1.0-1ubuntu2 Cross Platform Audio Output Library rc libasn1-8-heimdal:i386 1.6~git20120311.dfsg.1-2ubuntu0.1 Heimdal Kerberos - ASN.1 library rc libasound2:i386 1.0.25-1ubuntu10.2 shared library for ALSA applications rc libasyncns0:i386 0.8-4 Asynchronous name service query library rc libatk1.0-0:i386 2.4.0-0ubuntu1 ATK accessibility toolkit rc libavahi-client3:i386 0.6.30-5ubuntu2 Avahi client library rc libavahi-common3:i386 0.6.30-5ubuntu2 Avahi common library rc libavc1394-0:i386 0.5.3-1ubuntu2 control IEEE 1394 audio/video devices rc libcaca0:i386 0.99.beta17-2.1ubuntu2 colour ASCII art library rc libcairo-gobject2:i386 1.10.2-6.1ubuntu3 The Cairo 2D vector graphics library (GObject library) rc libcairo2:i386 1.10.2-6.1ubuntu3 The Cairo 2D vector graphics library rc libcanberra-gtk0:i386 0.28-3ubuntu3 GTK+ helper for playing widget event sounds with libcanberra rc libcanberra0:i386 0.28-3ubuntu3 simple abstract interface for playing event sounds rc libcap2:i386 1:2.22-1ubuntu3 support for getting/setting POSIX.1e capabilities rc libcdparanoia0:i386 3.10.2+debian-10ubuntu1 audio extraction tool for sampling CDs (library) rc libcroco3:i386 0.6.5-1ubuntu0.1 Cascading Style Sheet (CSS) parsing and manipulation toolkit rc libcups2:i386 1.5.3-0ubuntu8 Common UNIX Printing System(tm) - Core library rc libcupsimage2:i386 1.5.3-0ubuntu8 Common UNIX Printing System(tm) - Raster image library rc libcurl3:i386 7.22.0-3ubuntu4.3 Multi-protocol file transfer library (OpenSSL) rc libdatrie1:i386 0.2.5-3 Double-array trie library rc libdbus-glib-1-2:i386 0.98-1ubuntu1.1 simple interprocess messaging system (GLib-based shared library) rc libdbusmenu-qt2:i386 0.9.2-0ubuntu1 Qt implementation of the DBusMenu protocol rc libdrm-nouveau2:i386 2.4.43-0ubuntu0.0.3 Userspace interface to nouveau-specific kernel DRM services -- runtime rc libdv4:i386 1.0.0-3ubuntu1 software library for DV format digital video (runtime lib) rc libesd0:i386 0.2.41-10build3 Enlightened Sound Daemon - Shared libraries rc libexif12:i386 0.6.20-2ubuntu0.1 library to parse EXIF files rc libexpat1:i386 2.0.1-7.2ubuntu1.1 XML parsing C library - runtime library rc libflac8:i386 1.2.1-6 Free Lossless Audio Codec - runtime C library rc libfontconfig1:i386 2.8.0-3ubuntu9.1 generic font configuration library - runtime rc libfreetype6:i386 2.4.8-1ubuntu2.1 FreeType 2 font engine, shared library files rc libgail18:i386 2.24.10-0ubuntu6 GNOME Accessibility Implementation Library -- shared libraries rc libgconf-2-4:i386 3.2.5-0ubuntu2 GNOME configuration database system (shared libraries) rc libgcrypt11:i386 1.5.0-3ubuntu0.2 LGPL Crypto library - runtime library rc libgd2-xpm:i386 2.0.36~rc1~dfsg-6ubuntu2 GD Graphics Library version 2 rc libgdbm3:i386 1.8.3-10 GNU dbm database routines (runtime version) rc libgdk-pixbuf2.0-0:i386 2.26.1-1 GDK Pixbuf library rc libgif4:i386 4.1.6-9ubuntu1 library for GIF images (library) rc libgl1-mesa-dri-lts-quantal:i386 9.0.3-0ubuntu0.4~precise1 free implementation of the OpenGL API -- DRI modules rc libgl1-mesa-dri-lts-raring:i386 9.1.4-0ubuntu0.1~precise2 free implementation of the OpenGL API -- DRI modules rc libgl1-mesa-glx:i386 8.0.4-0ubuntu0.6 free implementation of the OpenGL API -- GLX runtime rc libgl1-mesa-glx-lts-quantal:i386 9.0.3-0ubuntu0.4~precise1 free implementation of the OpenGL API -- GLX runtime rc libgl1-mesa-glx-lts-raring:i386 9.1.4-0ubuntu0.1~precise2 free implementation of the OpenGL API -- GLX runtime rc libglapi-mesa:i386 8.0.4-0ubuntu0.6 free implementation of the GL API -- shared library rc libglapi-mesa-lts-quantal:i386 9.0.3-0ubuntu0.4~precise1 free implementation of the GL API -- shared library rc libglapi-mesa-lts-raring:i386 9.1.4-0ubuntu0.1~precise2 free implementation of the GL API -- shared library rc libglu1-mesa:i386 8.0.4-0ubuntu0.6 Mesa OpenGL utility library (GLU) rc libgnome-keyring0:i386 3.2.2-2 GNOME keyring services library rc libgnutls26:i386 2.12.14-5ubuntu3.5 GNU TLS library - runtime library rc libgomp1:i386 4.6.3-1ubuntu5 GCC OpenMP (GOMP) support library rc libgpg-error0:i386 1.10-2ubuntu1 library for common error values and messages in GnuPG components rc libgphoto2-2:i386 2.4.13-1ubuntu1.2 gphoto2 digital camera library rc libgphoto2-port0:i386 2.4.13-1ubuntu1.2 gphoto2 digital camera port library rc libgssapi-krb5-2:i386 1.10+dfsg~beta1-2ubuntu0.3 MIT Kerberos runtime libraries - krb5 GSS-API Mechanism rc libgssapi3-heimdal:i386 1.6~git20120311.dfsg.1-2ubuntu0.1 Heimdal Kerberos - GSSAPI support library rc libgstreamer-plugins-base0.10-0:i386 0.10.36-1ubuntu0.1 GStreamer libraries from the "base" set rc libgstreamer0.10-0:i386 0.10.36-1ubuntu1 Core GStreamer libraries and elements rc libgtk2.0-0:i386 2.24.10-0ubuntu6 GTK+ graphical user interface library rc libgudev-1.0-0:i386 1:175-0ubuntu9.4 GObject-based wrapper library for libudev rc libhcrypto4-heimdal:i386 1.6~git20120311.dfsg.1-2ubuntu0.1 Heimdal Kerberos - crypto library rc libheimbase1-heimdal:i386 1.6~git20120311.dfsg.1-2ubuntu0.1 Heimdal Kerberos - Base library rc libheimntlm0-heimdal:i386 1.6~git20120311.dfsg.1-2ubuntu0.1 Heimdal Kerberos - NTLM support library rc libhx509-5-heimdal:i386 1.6~git20120311.dfsg.1-2ubuntu0.1 Heimdal Kerberos - X509 support library rc libibus-1.0-0:i386 1.4.1-3ubuntu1 Intelligent Input Bus - shared library rc libice6:i386 2:1.0.7-2build1 X11 Inter-Client Exchange library rc libidn11:i386 1.23-2 GNU Libidn library, implementation of IETF IDN specifications rc libiec61883-0:i386 1.2.0-0.1ubuntu1 an partial implementation of IEC 61883 rc libieee1284-3:i386 0.2.11-10build1 cross-platform library for parallel port access rc libjack-jackd2-0:i386 1.9.8~dfsg.1-1ubuntu2 JACK Audio Connection Kit (libraries) rc libjasper1:i386 1.900.1-13 JasPer JPEG-2000 runtime library rc libjpeg-turbo8:i386 1.1.90+svn733-0ubuntu4.2 IJG JPEG compliant runtime library. rc libjson0:i386 0.9-1ubuntu1 JSON manipulation library - shared library rc libk5crypto3:i386 1.10+dfsg~beta1-2ubuntu0.3 MIT Kerberos runtime libraries - Crypto Library rc libkeyutils1:i386 1.5.2-2 Linux Key Management Utilities (library) rc libkrb5-26-heimdal:i386 1.6~git20120311.dfsg.1-2ubuntu0.1 Heimdal Kerberos - libraries rc libkrb5-3:i386 1.10+dfsg~beta1-2ubuntu0.3 MIT Kerberos runtime libraries rc libkrb5support0:i386 1.10+dfsg~beta1-2ubuntu0.3 MIT Kerberos runtime libraries - Support library rc liblcms1:i386 1.19.dfsg-1ubuntu3 Little CMS color management library rc libldap-2.4-2:i386 2.4.28-1.1ubuntu4.4 OpenLDAP libraries rc libllvm3.0:i386 3.0-4ubuntu1 Low-Level Virtual Machine (LLVM), runtime library rc libllvm3.1:i386 3.1-2ubuntu1~12.04.1 Low-Level Virtual Machine (LLVM), runtime library rc libllvm3.2:i386 3.2-2ubuntu5~precise1 Low-Level Virtual Machine (LLVM), runtime library rc libltdl7:i386 2.4.2-1ubuntu1 A system independent dlopen wrapper for GNU libtool rc libmad0:i386 0.15.1b-7ubuntu1 MPEG audio decoder library rc libmikmod2:i386 3.1.12-2 Portable sound library rc libmng1:i386 1.0.10-3 Multiple-image Network Graphics library rc libmpg123-0:i386 1.12.1-3.2ubuntu1 MPEG layer 1/2/3 audio decoder -- runtime library rc libmysqlclient18:i386 5.5.32-0ubuntu0.12.04.1 MySQL database client library rc libnspr4:i386 4.9.5-0ubuntu0.12.04.1 NetScape Portable Runtime Library rc libnss3:i386 3.14.3-0ubuntu0.12.04.1 Network Security Service libraries rc libodbc1:i386 2.2.14p2-5ubuntu3 ODBC library for Unix rc libogg0:i386 1.2.2~dfsg-1ubuntu1 Ogg bitstream library rc libopenal1:i386 1:1.13-4ubuntu3 Software implementation of the OpenAL API (shared library) rc liborc-0.4-0:i386 1:0.4.16-1ubuntu2 Library of Optimized Inner Loops Runtime Compiler rc libosmesa6:i386 8.0.4-0ubuntu0.6 Mesa Off-screen rendering extension rc libp11-kit0:i386 0.12-2ubuntu1 Library for loading and coordinating access to PKCS#11 modules - runtime rc libpango1.0-0:i386 1.30.0-0ubuntu3.1 Layout and rendering of internationalized text rc libpixman-1-0:i386 0.24.4-1 pixel-manipulation library for X and cairo rc libproxy1:i386 0.4.7-0ubuntu4.1 automatic proxy configuration management library (shared) rc libpulse-mainloop-glib0:i386 1:1.1-0ubuntu15.4 PulseAudio client libraries (glib support) rc libpulse0:i386 1:1.1-0ubuntu15.4 PulseAudio client libraries rc libqt4-dbus:i386 4:4.8.1-0ubuntu4.4 Qt 4 D-Bus module rc libqt4-declarative:i386 4:4.8.1-0ubuntu4.4 Qt 4 Declarative module rc libqt4-designer:i386 4:4.8.1-0ubuntu4.4 Qt 4 designer module rc libqt4-network:i386 4:4.8.1-0ubuntu4.4 Qt 4 network module rc libqt4-opengl:i386 4:4.8.1-0ubuntu4.4 Qt 4 OpenGL module rc libqt4-qt3support:i386 4:4.8.1-0ubuntu4.4 Qt 3 compatibility library for Qt 4 rc libqt4-script:i386 4:4.8.1-0ubuntu4.4 Qt 4 script module rc libqt4-scripttools:i386 4:4.8.1-0ubuntu4.4 Qt 4 script tools module rc libqt4-sql:i386 4:4.8.1-0ubuntu4.4 Qt 4 SQL module rc libqt4-svg:i386 4:4.8.1-0ubuntu4.4 Qt 4 SVG module rc libqt4-test:i386 4:4.8.1-0ubuntu4.4 Qt 4 test module rc libqt4-xml:i386 4:4.8.1-0ubuntu4.4 Qt 4 XML module rc libqt4-xmlpatterns:i386 4:4.8.1-0ubuntu4.4 Qt 4 XML patterns module rc libqtcore4:i386 4:4.8.1-0ubuntu4.4 Qt 4 core module rc libqtgui4:i386 4:4.8.1-0ubuntu4.4 Qt 4 GUI module rc libqtwebkit4:i386 2.2.1-1ubuntu4 Web content engine library for Qt rc libraw1394-11:i386 2.0.7-1ubuntu1 library for direct access to IEEE 1394 bus (aka FireWire) rc libroken18-heimdal:i386 1.6~git20120311.dfsg.1-2ubuntu0.1 Heimdal Kerberos - roken support library rc librsvg2-2:i386 2.36.1-0ubuntu1 SAX-based renderer library for SVG files (runtime) rc librtmp0:i386 2.4~20110711.gitc28f1bab-1 toolkit for RTMP streams (shared library) rc libsamplerate0:i386 0.1.8-4 Audio sample rate conversion library rc libsane:i386 1.0.22-7ubuntu1 API library for scanners rc libsasl2-2:i386 2.1.25.dfsg1-3ubuntu0.1 Cyrus SASL - authentication abstraction library rc libsdl-image1.2:i386 1.2.10-3 image loading library for Simple DirectMedia Layer 1.2 rc libsdl-mixer1.2:i386 1.2.11-7 Mixer library for Simple DirectMedia Layer 1.2, libraries rc libsdl-net1.2:i386 1.2.7-5 Network library for Simple DirectMedia Layer 1.2, libraries rc libsdl-ttf2.0-0:i386 2.0.9-1.1ubuntu1 ttf library for Simple DirectMedia Layer with FreeType 2 support rc libsdl1.2debian:i386 1.2.14-6.4ubuntu3 Simple DirectMedia Layer rc libshout3:i386 2.2.2-7ubuntu1 MP3/Ogg Vorbis broadcast streaming library rc libsm6:i386 2:1.2.0-2build1 X11 Session Management library rc libsndfile1:i386 1.0.25-4 Library for reading/writing audio files rc libsoup-gnome2.4-1:i386 2.38.1-1 HTTP library implementation in C -- GNOME support library rc libsoup2.4-1:i386 2.38.1-1 HTTP library implementation in C -- Shared library rc libspeex1:i386 1.2~rc1-3ubuntu2 The Speex codec runtime library rc libspeexdsp1:i386 1.2~rc1-3ubuntu2 The Speex extended runtime library rc libsqlite3-0:i386 3.7.9-2ubuntu1.1 SQLite 3 shared library rc libssl0.9.8:i386 0.9.8o-7ubuntu3.1 SSL shared libraries rc libstdc++5:i386 1:3.3.6-25ubuntu1 The GNU Standard C++ Library v3 rc libstdc++6:i386 4.6.3-1ubuntu5 GNU Standard C++ Library v3 rc libtag1-vanilla:i386 1.7-1ubuntu5 audio meta-data library - vanilla flavour rc libtasn1-3:i386 2.10-1ubuntu1.1 Manage ASN.1 structures (runtime) rc libtdb1:i386 1.2.9-4 Trivial Database - shared library rc libthai0:i386 0.1.16-3 Thai language support library rc libtheora0:i386 1.1.1+dfsg.1-3ubuntu2 The Theora Video Compression Codec rc libtiff4:i386 3.9.5-2ubuntu1.5 Tag Image File Format (TIFF) library rc libtxc-dxtn-s2tc0:i386 0~git20110809-2.1 Texture compression library for Mesa rc libunistring0:i386 0.9.3-5 Unicode string library for C rc libusb-0.1-4:i386 2:0.1.12-20 userspace USB programming library rc libv4l-0:i386 0.8.6-1ubuntu2 Collection of video4linux support libraries rc libv4lconvert0:i386 0.8.6-1ubuntu2 Video4linux frame format conversion library rc libvisual-0.4-0:i386 0.4.0-4 Audio visualization framework rc libvorbis0a:i386 1.3.2-1ubuntu3 The Vorbis General Audio Compression Codec (Decoder library) rc libvorbisenc2:i386 1.3.2-1ubuntu3 The Vorbis General Audio Compression Codec (Encoder library) rc libvorbisfile3:i386 1.3.2-1ubuntu3 The Vorbis General Audio Compression Codec (High Level API) rc libwavpack1:i386 4.60.1-2 audio codec (lossy and lossless) - library rc libwind0-heimdal:i386 1.6~git20120311.dfsg.1-2ubuntu0.1 Heimdal Kerberos - stringprep implementation rc libwrap0:i386 7.6.q-21 Wietse Venema's TCP wrappers library rc libx11-6:i386 2:1.4.99.1-0ubuntu2.2 X11 client-side library rc libx11-xcb1:i386 2:1.4.99.1-0ubuntu2.2 Xlib/XCB interface library rc libxau6:i386 1:1.0.6-4 X11 authorisation library rc libxaw7:i386 2:1.0.9-3ubuntu1 X11 Athena Widget library rc libxcb-dri2-0:i386 1.8.1-1ubuntu0.2 X C Binding, dri2 extension rc libxcb-glx0:i386 1.8.1-1ubuntu0.2 X C Binding, glx extension rc libxcb-render0:i386 1.8.1-1ubuntu0.2 X C Binding, render extension rc libxcb-shm0:i386 1.8.1-1ubuntu0.2 X C Binding, shm extension rc libxcb1:i386 1.8.1-1ubuntu0.2 X C Binding rc libxcomposite1:i386 1:0.4.3-2build1 X11 Composite extension library rc libxcursor1:i386 1:1.1.12-1ubuntu0.1 X cursor management library rc libxdamage1:i386 1:1.1.3-2build1 X11 damaged region extension library rc libxdmcp6:i386 1:1.1.0-4 X11 Display Manager Control Protocol library rc libxext6:i386 2:1.3.0-3ubuntu0.1 X11 miscellaneous extension library rc libxfixes3:i386 1:5.0-4ubuntu4.1 X11 miscellaneous 'fixes' extension library rc libxft2:i386 2.2.0-3ubuntu2 FreeType-based font drawing library for X rc libxi6:i386 2:1.6.0-0ubuntu2.1 X11 Input extension library rc libxinerama1:i386 2:1.1.1-3ubuntu0.1 X11 Xinerama extension library rc libxml2:i386 2.7.8.dfsg-5.1ubuntu4.6 GNOME XML library rc libxmu6:i386 2:1.1.0-3 X11 miscellaneous utility library rc libxp6:i386 1:1.0.1-2ubuntu0.12.04.1 X Printing Extension (Xprint) client library rc libxpm4:i386 1:3.5.9-4 X11 pixmap library rc libxrandr2:i386 2:1.3.2-2ubuntu0.2 X11 RandR extension library rc libxrender1:i386 1:0.9.6-2ubuntu0.1 X Rendering Extension client library rc libxslt1.1:i386 1.1.26-8ubuntu1.3 XSLT 1.0 processing library - runtime library rc libxss1:i386 1:1.2.1-2 X11 Screen Saver extension library rc libxt6:i386 1:1.1.1-2ubuntu0.1 X11 toolkit intrinsics library rc libxtst6:i386 2:1.2.0-4ubuntu0.1 X11 Testing -- Record extension library rc libxv1:i386 2:1.0.6-2ubuntu0.1 X11 Video extension library rc libxxf86vm1:i386 1:1.1.1-2ubuntu0.1 X11 XFree86 video mode extension library rc odbcinst1debian2:i386 2.2.14p2-5ubuntu3 Support library for accessing odbc ini files rc skype-bin:i386 4.2.0.11-0ubuntu0.12.04.2 client for Skype VOIP and instant messaging service - binary files rc sni-qt:i386 0.2.5-0ubuntu3 indicator support for Qt rc wine-compholio:i386 1.7.4~ubuntu12.04.1 The Compholio Edition is a special build of the popular Wine software rc xaw3dg:i386 1.5+E-18.1ubuntu1 Xaw3d widget set

    Read the article

  • A Taxonomy of Numerical Methods v1

    - by JoshReuben
    Numerical Analysis – When, What, (but not how) Once you understand the Math & know C++, Numerical Methods are basically blocks of iterative & conditional math code. I found the real trick was seeing the forest for the trees – knowing which method to use for which situation. Its pretty easy to get lost in the details – so I’ve tried to organize these methods in a way that I can quickly look this up. I’ve included links to detailed explanations and to C++ code examples. I’ve tried to classify Numerical methods in the following broad categories: Solving Systems of Linear Equations Solving Non-Linear Equations Iteratively Interpolation Curve Fitting Optimization Numerical Differentiation & Integration Solving ODEs Boundary Problems Solving EigenValue problems Enjoy – I did ! Solving Systems of Linear Equations Overview Solve sets of algebraic equations with x unknowns The set is commonly in matrix form Gauss-Jordan Elimination http://en.wikipedia.org/wiki/Gauss%E2%80%93Jordan_elimination C++: http://www.codekeep.net/snippets/623f1923-e03c-4636-8c92-c9dc7aa0d3c0.aspx Produces solution of the equations & the coefficient matrix Efficient, stable 2 steps: · Forward Elimination – matrix decomposition: reduce set to triangular form (0s below the diagonal) or row echelon form. If degenerate, then there is no solution · Backward Elimination –write the original matrix as the product of ints inverse matrix & its reduced row-echelon matrix à reduce set to row canonical form & use back-substitution to find the solution to the set Elementary ops for matrix decomposition: · Row multiplication · Row switching · Add multiples of rows to other rows Use pivoting to ensure rows are ordered for achieving triangular form LU Decomposition http://en.wikipedia.org/wiki/LU_decomposition C++: http://ganeshtiwaridotcomdotnp.blogspot.co.il/2009/12/c-c-code-lu-decomposition-for-solving.html Represent the matrix as a product of lower & upper triangular matrices A modified version of GJ Elimination Advantage – can easily apply forward & backward elimination to solve triangular matrices Techniques: · Doolittle Method – sets the L matrix diagonal to unity · Crout Method - sets the U matrix diagonal to unity Note: both the L & U matrices share the same unity diagonal & can be stored compactly in the same matrix Gauss-Seidel Iteration http://en.wikipedia.org/wiki/Gauss%E2%80%93Seidel_method C++: http://www.nr.com/forum/showthread.php?t=722 Transform the linear set of equations into a single equation & then use numerical integration (as integration formulas have Sums, it is implemented iteratively). an optimization of Gauss-Jacobi: 1.5 times faster, requires 0.25 iterations to achieve the same tolerance Solving Non-Linear Equations Iteratively find roots of polynomials – there may be 0, 1 or n solutions for an n order polynomial use iterative techniques Iterative methods · used when there are no known analytical techniques · Requires set functions to be continuous & differentiable · Requires an initial seed value – choice is critical to convergence à conduct multiple runs with different starting points & then select best result · Systematic - iterate until diminishing returns, tolerance or max iteration conditions are met · bracketing techniques will always yield convergent solutions, non-bracketing methods may fail to converge Incremental method if a nonlinear function has opposite signs at 2 ends of a small interval x1 & x2, then there is likely to be a solution in their interval – solutions are detected by evaluating a function over interval steps, for a change in sign, adjusting the step size dynamically. Limitations – can miss closely spaced solutions in large intervals, cannot detect degenerate (coinciding) solutions, limited to functions that cross the x-axis, gives false positives for singularities Fixed point method http://en.wikipedia.org/wiki/Fixed-point_iteration C++: http://books.google.co.il/books?id=weYj75E_t6MC&pg=PA79&lpg=PA79&dq=fixed+point+method++c%2B%2B&source=bl&ots=LQ-5P_taoC&sig=lENUUIYBK53tZtTwNfHLy5PEWDk&hl=en&sa=X&ei=wezDUPW1J5DptQaMsIHQCw&redir_esc=y#v=onepage&q=fixed%20point%20method%20%20c%2B%2B&f=false Algebraically rearrange a solution to isolate a variable then apply incremental method Bisection method http://en.wikipedia.org/wiki/Bisection_method C++: http://numericalcomputing.wordpress.com/category/algorithms/ Bracketed - Select an initial interval, keep bisecting it ad midpoint into sub-intervals and then apply incremental method on smaller & smaller intervals – zoom in Adv: unaffected by function gradient à reliable Disadv: slow convergence False Position Method http://en.wikipedia.org/wiki/False_position_method C++: http://www.dreamincode.net/forums/topic/126100-bisection-and-false-position-methods/ Bracketed - Select an initial interval , & use the relative value of function at interval end points to select next sub-intervals (estimate how far between the end points the solution might be & subdivide based on this) Newton-Raphson method http://en.wikipedia.org/wiki/Newton's_method C++: http://www-users.cselabs.umn.edu/classes/Summer-2012/csci1113/index.php?page=./newt3 Also known as Newton's method Convenient, efficient Not bracketed – only a single initial guess is required to start iteration – requires an analytical expression for the first derivative of the function as input. Evaluates the function & its derivative at each step. Can be extended to the Newton MutiRoot method for solving multiple roots Can be easily applied to an of n-coupled set of non-linear equations – conduct a Taylor Series expansion of a function, dropping terms of order n, rewrite as a Jacobian matrix of PDs & convert to simultaneous linear equations !!! Secant Method http://en.wikipedia.org/wiki/Secant_method C++: http://forum.vcoderz.com/showthread.php?p=205230 Unlike N-R, can estimate first derivative from an initial interval (does not require root to be bracketed) instead of inputting it Since derivative is approximated, may converge slower. Is fast in practice as it does not have to evaluate the derivative at each step. Similar implementation to False Positive method Birge-Vieta Method http://mat.iitm.ac.in/home/sryedida/public_html/caimna/transcendental/polynomial%20methods/bv%20method.html C++: http://books.google.co.il/books?id=cL1boM2uyQwC&pg=SA3-PA51&lpg=SA3-PA51&dq=Birge-Vieta+Method+c%2B%2B&source=bl&ots=QZmnDTK3rC&sig=BPNcHHbpR_DKVoZXrLi4nVXD-gg&hl=en&sa=X&ei=R-_DUK2iNIjzsgbE5ID4Dg&redir_esc=y#v=onepage&q=Birge-Vieta%20Method%20c%2B%2B&f=false combines Horner's method of polynomial evaluation (transforming into lesser degree polynomials that are more computationally efficient to process) with Newton-Raphson to provide a computational speed-up Interpolation Overview Construct new data points for as close as possible fit within range of a discrete set of known points (that were obtained via sampling, experimentation) Use Taylor Series Expansion of a function f(x) around a specific value for x Linear Interpolation http://en.wikipedia.org/wiki/Linear_interpolation C++: http://www.hamaluik.com/?p=289 Straight line between 2 points à concatenate interpolants between each pair of data points Bilinear Interpolation http://en.wikipedia.org/wiki/Bilinear_interpolation C++: http://supercomputingblog.com/graphics/coding-bilinear-interpolation/2/ Extension of the linear function for interpolating functions of 2 variables – perform linear interpolation first in 1 direction, then in another. Used in image processing – e.g. texture mapping filter. Uses 4 vertices to interpolate a value within a unit cell. Lagrange Interpolation http://en.wikipedia.org/wiki/Lagrange_polynomial C++: http://www.codecogs.com/code/maths/approximation/interpolation/lagrange.php For polynomials Requires recomputation for all terms for each distinct x value – can only be applied for small number of nodes Numerically unstable Barycentric Interpolation http://epubs.siam.org/doi/pdf/10.1137/S0036144502417715 C++: http://www.gamedev.net/topic/621445-barycentric-coordinates-c-code-check/ Rearrange the terms in the equation of the Legrange interpolation by defining weight functions that are independent of the interpolated value of x Newton Divided Difference Interpolation http://en.wikipedia.org/wiki/Newton_polynomial C++: http://jee-appy.blogspot.co.il/2011/12/newton-divided-difference-interpolation.html Hermite Divided Differences: Interpolation polynomial approximation for a given set of data points in the NR form - divided differences are used to approximately calculate the various differences. For a given set of 3 data points , fit a quadratic interpolant through the data Bracketed functions allow Newton divided differences to be calculated recursively Difference table Cubic Spline Interpolation http://en.wikipedia.org/wiki/Spline_interpolation C++: https://www.marcusbannerman.co.uk/index.php/home/latestarticles/42-articles/96-cubic-spline-class.html Spline is a piecewise polynomial Provides smoothness – for interpolations with significantly varying data Use weighted coefficients to bend the function to be smooth & its 1st & 2nd derivatives are continuous through the edge points in the interval Curve Fitting A generalization of interpolating whereby given data points may contain noise à the curve does not necessarily pass through all the points Least Squares Fit http://en.wikipedia.org/wiki/Least_squares C++: http://www.ccas.ru/mmes/educat/lab04k/02/least-squares.c Residual – difference between observed value & expected value Model function is often chosen as a linear combination of the specified functions Determines: A) The model instance in which the sum of squared residuals has the least value B) param values for which model best fits data Straight Line Fit Linear correlation between independent variable and dependent variable Linear Regression http://en.wikipedia.org/wiki/Linear_regression C++: http://www.oocities.org/david_swaim/cpp/linregc.htm Special case of statistically exact extrapolation Leverage least squares Given a basis function, the sum of the residuals is determined and the corresponding gradient equation is expressed as a set of normal linear equations in matrix form that can be solved (e.g. using LU Decomposition) Can be weighted - Drop the assumption that all errors have the same significance –-> confidence of accuracy is different for each data point. Fit the function closer to points with higher weights Polynomial Fit - use a polynomial basis function Moving Average http://en.wikipedia.org/wiki/Moving_average C++: http://www.codeproject.com/Articles/17860/A-Simple-Moving-Average-Algorithm Used for smoothing (cancel fluctuations to highlight longer-term trends & cycles), time series data analysis, signal processing filters Replace each data point with average of neighbors. Can be simple (SMA), weighted (WMA), exponential (EMA). Lags behind latest data points – extra weight can be given to more recent data points. Weights can decrease arithmetically or exponentially according to distance from point. Parameters: smoothing factor, period, weight basis Optimization Overview Given function with multiple variables, find Min (or max by minimizing –f(x)) Iterative approach Efficient, but not necessarily reliable Conditions: noisy data, constraints, non-linear models Detection via sign of first derivative - Derivative of saddle points will be 0 Local minima Bisection method Similar method for finding a root for a non-linear equation Start with an interval that contains a minimum Golden Search method http://en.wikipedia.org/wiki/Golden_section_search C++: http://www.codecogs.com/code/maths/optimization/golden.php Bisect intervals according to golden ratio 0.618.. Achieves reduction by evaluating a single function instead of 2 Newton-Raphson Method Brent method http://en.wikipedia.org/wiki/Brent's_method C++: http://people.sc.fsu.edu/~jburkardt/cpp_src/brent/brent.cpp Based on quadratic or parabolic interpolation – if the function is smooth & parabolic near to the minimum, then a parabola fitted through any 3 points should approximate the minima – fails when the 3 points are collinear , in which case the denominator is 0 Simplex Method http://en.wikipedia.org/wiki/Simplex_algorithm C++: http://www.codeguru.com/cpp/article.php/c17505/Simplex-Optimization-Algorithm-and-Implemetation-in-C-Programming.htm Find the global minima of any multi-variable function Direct search – no derivatives required At each step it maintains a non-degenerative simplex – a convex hull of n+1 vertices. Obtains the minimum for a function with n variables by evaluating the function at n-1 points, iteratively replacing the point of worst result with the point of best result, shrinking the multidimensional simplex around the best point. Point replacement involves expanding & contracting the simplex near the worst value point to determine a better replacement point Oscillation can be avoided by choosing the 2nd worst result Restart if it gets stuck Parameters: contraction & expansion factors Simulated Annealing http://en.wikipedia.org/wiki/Simulated_annealing C++: http://code.google.com/p/cppsimulatedannealing/ Analogy to heating & cooling metal to strengthen its structure Stochastic method – apply random permutation search for global minima - Avoid entrapment in local minima via hill climbing Heating schedule - Annealing schedule params: temperature, iterations at each temp, temperature delta Cooling schedule – can be linear, step-wise or exponential Differential Evolution http://en.wikipedia.org/wiki/Differential_evolution C++: http://www.amichel.com/de/doc/html/ More advanced stochastic methods analogous to biological processes: Genetic algorithms, evolution strategies Parallel direct search method against multiple discrete or continuous variables Initial population of variable vectors chosen randomly – if weighted difference vector of 2 vectors yields a lower objective function value then it replaces the comparison vector Many params: #parents, #variables, step size, crossover constant etc Convergence is slow – many more function evaluations than simulated annealing Numerical Differentiation Overview 2 approaches to finite difference methods: · A) approximate function via polynomial interpolation then differentiate · B) Taylor series approximation – additionally provides error estimate Finite Difference methods http://en.wikipedia.org/wiki/Finite_difference_method C++: http://www.wpi.edu/Pubs/ETD/Available/etd-051807-164436/unrestricted/EAMPADU.pdf Find differences between high order derivative values - Approximate differential equations by finite differences at evenly spaced data points Based on forward & backward Taylor series expansion of f(x) about x plus or minus multiples of delta h. Forward / backward difference - the sums of the series contains even derivatives and the difference of the series contains odd derivatives – coupled equations that can be solved. Provide an approximation of the derivative within a O(h^2) accuracy There is also central difference & extended central difference which has a O(h^4) accuracy Richardson Extrapolation http://en.wikipedia.org/wiki/Richardson_extrapolation C++: http://mathscoding.blogspot.co.il/2012/02/introduction-richardson-extrapolation.html A sequence acceleration method applied to finite differences Fast convergence, high accuracy O(h^4) Derivatives via Interpolation Cannot apply Finite Difference method to discrete data points at uneven intervals – so need to approximate the derivative of f(x) using the derivative of the interpolant via 3 point Lagrange Interpolation Note: the higher the order of the derivative, the lower the approximation precision Numerical Integration Estimate finite & infinite integrals of functions More accurate procedure than numerical differentiation Use when it is not possible to obtain an integral of a function analytically or when the function is not given, only the data points are Newton Cotes Methods http://en.wikipedia.org/wiki/Newton%E2%80%93Cotes_formulas C++: http://www.siafoo.net/snippet/324 For equally spaced data points Computationally easy – based on local interpolation of n rectangular strip areas that is piecewise fitted to a polynomial to get the sum total area Evaluate the integrand at n+1 evenly spaced points – approximate definite integral by Sum Weights are derived from Lagrange Basis polynomials Leverage Trapezoidal Rule for default 2nd formulas, Simpson 1/3 Rule for substituting 3 point formulas, Simpson 3/8 Rule for 4 point formulas. For 4 point formulas use Bodes Rule. Higher orders obtain more accurate results Trapezoidal Rule uses simple area, Simpsons Rule replaces the integrand f(x) with a quadratic polynomial p(x) that uses the same values as f(x) for its end points, but adds a midpoint Romberg Integration http://en.wikipedia.org/wiki/Romberg's_method C++: http://code.google.com/p/romberg-integration/downloads/detail?name=romberg.cpp&can=2&q= Combines trapezoidal rule with Richardson Extrapolation Evaluates the integrand at equally spaced points The integrand must have continuous derivatives Each R(n,m) extrapolation uses a higher order integrand polynomial replacement rule (zeroth starts with trapezoidal) à a lower triangular matrix set of equation coefficients where the bottom right term has the most accurate approximation. The process continues until the difference between 2 successive diagonal terms becomes sufficiently small. Gaussian Quadrature http://en.wikipedia.org/wiki/Gaussian_quadrature C++: http://www.alglib.net/integration/gaussianquadratures.php Data points are chosen to yield best possible accuracy – requires fewer evaluations Ability to handle singularities, functions that are difficult to evaluate The integrand can include a weighting function determined by a set of orthogonal polynomials. Points & weights are selected so that the integrand yields the exact integral if f(x) is a polynomial of degree <= 2n+1 Techniques (basically different weighting functions): · Gauss-Legendre Integration w(x)=1 · Gauss-Laguerre Integration w(x)=e^-x · Gauss-Hermite Integration w(x)=e^-x^2 · Gauss-Chebyshev Integration w(x)= 1 / Sqrt(1-x^2) Solving ODEs Use when high order differential equations cannot be solved analytically Evaluated under boundary conditions RK for systems – a high order differential equation can always be transformed into a coupled first order system of equations Euler method http://en.wikipedia.org/wiki/Euler_method C++: http://rosettacode.org/wiki/Euler_method First order Runge–Kutta method. Simple recursive method – given an initial value, calculate derivative deltas. Unstable & not very accurate (O(h) error) – not used in practice A first-order method - the local error (truncation error per step) is proportional to the square of the step size, and the global error (error at a given time) is proportional to the step size In evolving solution between data points xn & xn+1, only evaluates derivatives at beginning of interval xn à asymmetric at boundaries Higher order Runge Kutta http://en.wikipedia.org/wiki/Runge%E2%80%93Kutta_methods C++: http://www.dreamincode.net/code/snippet1441.htm 2nd & 4th order RK - Introduces parameterized midpoints for more symmetric solutions à accuracy at higher computational cost Adaptive RK – RK-Fehlberg – estimate the truncation at each integration step & automatically adjust the step size to keep error within prescribed limits. At each step 2 approximations are compared – if in disagreement to a specific accuracy, the step size is reduced Boundary Value Problems Where solution of differential equations are located at 2 different values of the independent variable x à more difficult, because cannot just start at point of initial value – there may not be enough starting conditions available at the end points to produce a unique solution An n-order equation will require n boundary conditions – need to determine the missing n-1 conditions which cause the given conditions at the other boundary to be satisfied Shooting Method http://en.wikipedia.org/wiki/Shooting_method C++: http://ganeshtiwaridotcomdotnp.blogspot.co.il/2009/12/c-c-code-shooting-method-for-solving.html Iteratively guess the missing values for one end & integrate, then inspect the discrepancy with the boundary values of the other end to adjust the estimate Given the starting boundary values u1 & u2 which contain the root u, solve u given the false position method (solving the differential equation as an initial value problem via 4th order RK), then use u to solve the differential equations. Finite Difference Method For linear & non-linear systems Higher order derivatives require more computational steps – some combinations for boundary conditions may not work though Improve the accuracy by increasing the number of mesh points Solving EigenValue Problems An eigenvalue can substitute a matrix when doing matrix multiplication à convert matrix multiplication into a polynomial EigenValue For a given set of equations in matrix form, determine what are the solution eigenvalue & eigenvectors Similar Matrices - have same eigenvalues. Use orthogonal similarity transforms to reduce a matrix to diagonal form from which eigenvalue(s) & eigenvectors can be computed iteratively Jacobi method http://en.wikipedia.org/wiki/Jacobi_method C++: http://people.sc.fsu.edu/~jburkardt/classes/acs2_2008/openmp/jacobi/jacobi.html Robust but Computationally intense – use for small matrices < 10x10 Power Iteration http://en.wikipedia.org/wiki/Power_iteration For any given real symmetric matrix, generate the largest single eigenvalue & its eigenvectors Simplest method – does not compute matrix decomposition à suitable for large, sparse matrices Inverse Iteration Variation of power iteration method – generates the smallest eigenvalue from the inverse matrix Rayleigh Method http://en.wikipedia.org/wiki/Rayleigh's_method_of_dimensional_analysis Variation of power iteration method Rayleigh Quotient Method Variation of inverse iteration method Matrix Tri-diagonalization Method Use householder algorithm to reduce an NxN symmetric matrix to a tridiagonal real symmetric matrix vua N-2 orthogonal transforms     Whats Next Outside of Numerical Methods there are lots of different types of algorithms that I’ve learned over the decades: Data Mining – (I covered this briefly in a previous post: http://geekswithblogs.net/JoshReuben/archive/2007/12/31/ssas-dm-algorithms.aspx ) Search & Sort Routing Problem Solving Logical Theorem Proving Planning Probabilistic Reasoning Machine Learning Solvers (eg MIP) Bioinformatics (Sequence Alignment, Protein Folding) Quant Finance (I read Wilmott’s books – interesting) Sooner or later, I’ll cover the above topics as well.

    Read the article

  • Cannot build digiKam

    - by Tichomir Mitkov
    I'm trying to compile digiKam 2.8.0. I have installed the required libraries but cMake seems to stuck without any meaningful reason. Here is the output of cMake: $ cmake -DCMAKE_BUILD_TYPE=relwithdebinfo -DCMAKE_INSTALL_PREFIX=/usr/local . -- Found Qt-Version 4.7.1 (using /usr/bin/qmake) -- Found X11: /usr/lib64/libX11.so -- Found KDE 4.6 include dir: /usr/include -- Found KDE 4.6 library dir: /usr/lib64 -- Found the KDE4 kconfig_compiler preprocessor: /usr/bin/kconfig_compiler -- Found automoc4: /usr/bin/automoc4 -- Local kdegraphics libraries will be compiled... YES -- Handbooks will be compiled..................... YES -- Extract translations files..................... NO -- Translations will be compiled.................. YES -- ---------------------------------------------------------------------------------- -- Starting CMake configuration for: libmediawiki ----------------------------------------------------------------------------- -- The following external packages were located on your system. -- This installation will have the extra features provided by these packages. ----------------------------------------------------------------------------- * QJSON - Qt library for handling JSON data ----------------------------------------------------------------------------- -- Congratulations! All external packages have been found. ----------------------------------------------------------------------------- -- ---------------------------------------------------------------------------------- -- Starting CMake configuration for: libkgeomap -- Found Qt-Version 4.7.1 (using /usr/bin/qmake) -- Found X11: /usr/lib64/libX11.so -- Check Kexiv2 library in local sub-folder... -- Found Kexiv2 library in local sub-folder: /home/tichomir/Downloads/digikam-2.8.0/extra/libkexiv2 -- kexiv2 found, the demo application will be compiled. -- ---------------------------------------------------------------------------------- -- Starting CMake configuration for: libkface -- Found Qt-Version 4.7.1 (using /usr/bin/qmake) -- Found X11: /usr/lib64/libX11.so -- First try at finding OpenCV... -- Great, found OpenCV on the first try. -- OpenCV Root directory is /usr/share/opencv -- External libface was not found, use internal version instead... -- ---------------------------------------------------------------------------------- -- Starting CMake configuration for: kipi-plugins -- Check Kexiv2 library in local sub-folder... -- Found Kexiv2 library in local sub-folder: /home/tichomir/Downloads/digikam-2.8.0/extra/libkexiv2 -- Check for Kdcraw library in local sub-folder... -- Found Kdcraw library in local sub-folder: /home/tichomir/Downloads/digikam-2.8.0/extra/libkdcraw CMake Error at extra/libkdcraw/cmake/modules/FindKdcraw.cmake:137 (file): file Internal CMake error when trying to open file: /home/tichomir/Downloads/digikam-2.8.0/extra/libkdcraw/libkdcraw/version.h for reading. Call Stack (most recent call first): extra/kipi-plugins/CMakeLists.txt:123 (FIND_PACKAGE) -- Check Kipi library in local sub-folder... -- Found Kipi library in local sub-folder: /home/tichomir/Downloads/digikam-2.8.0/extra/libkipi CMake Warning at extra/kipi-plugins/CMakeLists.txt:139 (MESSAGE): libkdcraw: Version information not found, your version is probably too old. -- Found GObject libraries: /usr/lib64/libgobject-2.0.so;/usr/lib64/libgmodule-2.0.so;/usr/lib64/libgthread-2.0.so;/usr/lib64/libglib-2.0.so -- Found GObject includes : /usr/include/glib-2.0/gobject -- Check for Ksane library in local sub-folder... -- Found Ksane library in local sub-folder: /home/tichomir/Downloads/digikam-2.8.0/extra/libksane -- Check for KGeoMap library in local sub-folder... -- Found KGeoMap library in local sub-folder: /home/tichomir/Downloads/digikam-2.8.0/extra/libkgeomap -- Check Mediawiki library in local sub-folder... -- Found Mediawiki library in local sub-folder: /home/tichomir/Downloads/digikam-2.8.0/extra/libmediawiki -- Check Vkontakte library in local sub-folder... -- Found Vkontakte library in local sub-folder: /home/tichomir/Downloads/digikam-2.8.0/extra/libkvkontakte -- Boost version: 1.38.0 -- libkgeomap: Found version 2.0.0 -- Found X11: /usr/lib64/libX11.so -- CMake version: cmake version 2.8.9 -- CMake version (cleaned): cmake version 2.8.9 -- -- ---------------------------------------------------------------------------------- -- kipi-plugins 2.8.0 dependencies results <http://www.digikam.org> -- -- libjpeg library found.................... YES -- libtiff library found.................... YES -- libpng library found..................... YES -- libkipi library found.................... YES -- libkexiv2 library found.................. YES -- libkdcraw library found.................. YES -- libxml2 library found.................... YES (optional) -- libxslt library found.................... YES (optional) -- libexpat library found................... YES (optional) -- native threads support library found..... YES (optional) -- libopengl library found.................. YES (optional) -- Qt4 OpenGL module found.................. YES -- libopencv library found.................. YES (optional) -- QJson library found...................... YES (optional) -- libgpod library found.................... YES (optional) -- Gdk library found........................ YES (optional) -- libkdepim library found.................. YES (optional) -- qca2 library found....................... YES (optional) -- libkgeomap library found................. YES (optional) -- libmediawiki library found............... YES (optional) -- libkvkontakte library found.............. YES (optional) -- boost library found...................... YES (optional) -- OpenMP library found..................... YES (optional) -- libX11 library found..................... YES (optional) -- libksane library found................... YES (optional) -- -- kipi-plugins will be compiled............ YES -- Shwup will be compiled................... YES (optional) -- YandexFotki will be compiled............. YES (optional) -- HtmlExport will be compiled.............. YES (optional) -- AdvancedSlideshow will be compiled....... YES (optional) -- ImageViewer will be compiled............. YES (optional) -- AcquireImages will be compiled........... YES (optional) -- DNGConverter will be compiled............ YES (optional) -- RemoveRedEyes will be compiled........... YES (optional) -- Debian Screenshots will be compiled...... YES (optional) -- Facebook will be compiled................ YES (optional) -- Imgur will be compiled................... YES (optional) -- VKontakte will be compiled............... YES (optional) -- IpodExport will be compiled.............. YES (optional) -- Calendar will be compiled................ YES (optional) -- GPSSync will be compiled................. YES (optional) -- Mediawiki will be compiled............... YES (optional) -- Panorama will be compiled................ YES (optional) -- ---------------------------------------------------------------------------------- -- -- ---------------------------------------------------------------------------------- -- Starting CMake configuration for: digiKam -- Check for Kdcraw library in local sub-folder... -- Found Kdcraw library in local sub-folder: /home/tichomir/Downloads/digikam-2.8.0/extra/libkdcraw CMake Error at extra/libkdcraw/cmake/modules/FindKdcraw.cmake:137 (file): file Internal CMake error when trying to open file: /home/tichomir/Downloads/digikam-2.8.0/extra/libkdcraw/libkdcraw/version.h for reading. Call Stack (most recent call first): core/CMakeLists.txt:156 (FIND_PACKAGE) -- Check Kexiv2 library in local sub-folder... -- Found Kexiv2 library in local sub-folder: /home/tichomir/Downloads/digikam-2.8.0/extra/libkexiv2 -- Check Kipi library in local sub-folder... -- Found Kipi library in local sub-folder: /home/tichomir/Downloads/digikam-2.8.0/extra/libkipi -- Check Kface library in local sub-folder... -- Found Kface library in local sub-folder: /home/tichomir/Downloads/digikam-2.8.0/extra/libkface -- Check for KGeoMap library in local sub-folder... -- Found KGeoMap library in local sub-folder: /home/tichomir/Downloads/digikam-2.8.0/extra/libkgeomap -- PGF_INCLUDE_DIRS = /usr/local/include/libpgf -- PGF_INCLUDEDIR = /usr/local/include/libpgf -- PGF_LIBRARIES = pgf -- PGF_LDFLAGS = -L/usr/local/lib;-lpgf -- PGF_CFLAGS = -I/usr/local/include/libpgf -- PGF_VERSION = 6.12.24 -- PGF_CODEC_VERSION_ID = 61224 -- Could NOT find any working clapack installation -- Boost version: 1.38.0 -- Check for LCMS1 availability... -- Found LCMS1: /usr/lib64/liblcms.so /usr/include -- Paralelized PGF codec disabled... -- Identified libjpeg version: 62 -- Found MySQL server executable at: /usr/sbin/mysqld -- Found MySQL install_db executable at: /usr/bin/mysql_install_db CMake Warning at core/CMakeLists.txt:310 (MESSAGE): libkdcraw: Version information not found, your version is probably too old. -- libkgeomap: Found version 2.0.0 -- Found gphoto2: -L/usr/lib64 -lgphoto2_port;-L/usr/lib64 -lgphoto2 -lgphoto2_port -lm -- WARNING: you are using the obsolete 'PKGCONFIG' macro, use FindPkgConfig -- WARNING: you are using the obsolete 'PKGCONFIG' macro, use FindPkgConfig -- PKGCONFIG() indicates that lqr-1 is not installed (install the package which contains lqr-1.pc if you want to support this feature) -- Could NOT find Lqr-1 (missing: LQR-1_INCLUDE_DIRS LQR-1_LIBRARIES) -- Found SharedDesktopOntologies: /usr/share/ontology -- Found SharedDesktopOntologies: /usr/share/ontology (found version "0.5.0", required is "0.2") -- -- ---------------------------------------------------------------------------------- -- digiKam 2.8.0 dependencies results <http://www.digikam.org> -- -- Qt4 SQL module found..................... YES -- MySQL Server found....................... YES -- MySQL install_db tool found.............. YES -- libtiff library found.................... YES -- libpng library found..................... YES -- libjasper library found.................. YES -- liblcms library found.................... YES -- Boost Graph library found................ YES -- libkipi library found.................... YES -- libkexiv2 library found.................. YES -- libkdcraw library found.................. YES -- libkface library found................... YES -- libkgeomap library found................. YES -- libpgf library found..................... YES (optional) -- libclapack library found................. NO (optional - internal version used instead) -- libgphoto2 and libusb libraries found.... YES (optional) -- libkdepimlibs library found.............. YES (optional) -- Nepomuk libraries found.................. YES (optional) -- libglib2 library found................... YES (optional) -- liblqr-1 library found................... NO (optional - internal version used instead) -- liblensfun library found................. YES (optional) -- Doxygen found............................ YES (optional) -- digiKam can be compiled.................. YES -- ---------------------------------------------------------------------------------- -- -- Adjusting compilation flags for GCC version ( 4.5.1 ) -- Configuring incomplete, errors occurred! Actually this line shows a sign of error CMake Error at extra/libkdcraw/cmake/modules/FindKdcraw.cmake:137 (file): file Internal CMake error when trying to open file: /home/tichomir/Downloads/digikam-2.8.0/extra/libkdcraw/libkdcraw/version.h for reading. 'version.h' doesn't exists instead there is a file 'version.h.cmake' I have installed libkdcraw (64-bit) from sources. I'm using OpenSuse

    Read the article

  • How John Got 15x Improvement Without Really Trying

    - by rchrd
    The following article was published on a Sun Microsystems website a number of years ago by John Feo. It is still useful and worth preserving. So I'm republishing it here.  How I Got 15x Improvement Without Really Trying John Feo, Sun Microsystems Taking ten "personal" program codes used in scientific and engineering research, the author was able to get from 2 to 15 times performance improvement easily by applying some simple general optimization techniques. Introduction Scientific research based on computer simulation depends on the simulation for advancement. The research can advance only as fast as the computational codes can execute. The codes' efficiency determines both the rate and quality of results. In the same amount of time, a faster program can generate more results and can carry out a more detailed simulation of physical phenomena than a slower program. Highly optimized programs help science advance quickly and insure that monies supporting scientific research are used as effectively as possible. Scientific computer codes divide into three broad categories: ISV, community, and personal. ISV codes are large, mature production codes developed and sold commercially. The codes improve slowly over time both in methods and capabilities, and they are well tuned for most vendor platforms. Since the codes are mature and complex, there are few opportunities to improve their performance solely through code optimization. Improvements of 10% to 15% are typical. Examples of ISV codes are DYNA3D, Gaussian, and Nastran. Community codes are non-commercial production codes used by a particular research field. Generally, they are developed and distributed by a single academic or research institution with assistance from the community. Most users just run the codes, but some develop new methods and extensions that feed back into the general release. The codes are available on most vendor platforms. Since these codes are younger than ISV codes, there are more opportunities to optimize the source code. Improvements of 50% are not unusual. Examples of community codes are AMBER, CHARM, BLAST, and FASTA. Personal codes are those written by single users or small research groups for their own use. These codes are not distributed, but may be passed from professor-to-student or student-to-student over several years. They form the primordial ocean of applications from which community and ISV codes emerge. Government research grants pay for the development of most personal codes. This paper reports on the nature and performance of this class of codes. Over the last year, I have looked at over two dozen personal codes from more than a dozen research institutions. The codes cover a variety of scientific fields, including astronomy, atmospheric sciences, bioinformatics, biology, chemistry, geology, and physics. The sources range from a few hundred lines to more than ten thousand lines, and are written in Fortran, Fortran 90, C, and C++. For the most part, the codes are modular, documented, and written in a clear, straightforward manner. They do not use complex language features, advanced data structures, programming tricks, or libraries. I had little trouble understanding what the codes did or how data structures were used. Most came with a makefile. Surprisingly, only one of the applications is parallel. All developers have access to parallel machines, so availability is not an issue. Several tried to parallelize their applications, but stopped after encountering difficulties. Lack of education and a perception that parallelism is difficult prevented most from trying. I parallelized several of the codes using OpenMP, and did not judge any of the codes as difficult to parallelize. Even more surprising than the lack of parallelism is the inefficiency of the codes. I was able to get large improvements in performance in a matter of a few days applying simple optimization techniques. Table 1 lists ten representative codes [names and affiliation are omitted to preserve anonymity]. Improvements on one processor range from 2x to 15.5x with a simple average of 4.75x. I did not use sophisticated performance tools or drill deep into the program's execution character as one would do when tuning ISV or community codes. Using only a profiler and source line timers, I identified inefficient sections of code and improved their performance by inspection. The changes were at a high level. I am sure there is another factor of 2 or 3 in each code, and more if the codes are parallelized. The study’s results show that personal scientific codes are running many times slower than they should and that the problem is pervasive. Computational scientists are not sloppy programmers; however, few are trained in the art of computer programming or code optimization. I found that most have a working knowledge of some programming language and standard software engineering practices; but they do not know, or think about, how to make their programs run faster. They simply do not know the standard techniques used to make codes run faster. In fact, they do not even perceive that such techniques exist. The case studies described in this paper show that applying simple, well known techniques can significantly increase the performance of personal codes. It is important that the scientific community and the Government agencies that support scientific research find ways to better educate academic scientific programmers. The inefficiency of their codes is so bad that it is retarding both the quality and progress of scientific research. # cacheperformance redundantoperations loopstructures performanceimprovement 1 x x 15.5 2 x 2.8 3 x x 2.5 4 x 2.1 5 x x 2.0 6 x 5.0 7 x 5.8 8 x 6.3 9 2.2 10 x x 3.3 Table 1 — Area of improvement and performance gains of 10 codes The remainder of the paper is organized as follows: sections 2, 3, and 4 discuss the three most common sources of inefficiencies in the codes studied. These are cache performance, redundant operations, and loop structures. Each section includes several examples. The last section summaries the work and suggests a possible solution to the issues raised. Optimizing cache performance Commodity microprocessor systems use caches to increase memory bandwidth and reduce memory latencies. Typical latencies from processor to L1, L2, local, and remote memory are 3, 10, 50, and 200 cycles, respectively. Moreover, bandwidth falls off dramatically as memory distances increase. Programs that do not use cache effectively run many times slower than programs that do. When optimizing for cache, the biggest performance gains are achieved by accessing data in cache order and reusing data to amortize the overhead of cache misses. Secondary considerations are prefetching, associativity, and replacement; however, the understanding and analysis required to optimize for the latter are probably beyond the capabilities of the non-expert. Much can be gained simply by accessing data in the correct order and maximizing data reuse. 6 out of the 10 codes studied here benefited from such high level optimizations. Array Accesses The most important cache optimization is the most basic: accessing Fortran array elements in column order and C array elements in row order. Four of the ten codes—1, 2, 4, and 10—got it wrong. Compilers will restructure nested loops to optimize cache performance, but may not do so if the loop structure is too complex, or the loop body includes conditionals, complex addressing, or function calls. In code 1, the compiler failed to invert a key loop because of complex addressing do I = 0, 1010, delta_x IM = I - delta_x IP = I + delta_x do J = 5, 995, delta_x JM = J - delta_x JP = J + delta_x T1 = CA1(IP, J) + CA1(I, JP) T2 = CA1(IM, J) + CA1(I, JM) S1 = T1 + T2 - 4 * CA1(I, J) CA(I, J) = CA1(I, J) + D * S1 end do end do In code 2, the culprit is conditionals do I = 1, N do J = 1, N If (IFLAG(I,J) .EQ. 0) then T1 = Value(I, J-1) T2 = Value(I-1, J) T3 = Value(I, J) T4 = Value(I+1, J) T5 = Value(I, J+1) Value(I,J) = 0.25 * (T1 + T2 + T5 + T4) Delta = ABS(T3 - Value(I,J)) If (Delta .GT. MaxDelta) MaxDelta = Delta endif enddo enddo I fixed both programs by inverting the loops by hand. Code 10 has three-dimensional arrays and triply nested loops. The structure of the most computationally intensive loops is too complex to invert automatically or by hand. The only practical solution is to transpose the arrays so that the dimension accessed by the innermost loop is in cache order. The arrays can be transposed at construction or prior to entering a computationally intensive section of code. The former requires all array references to be modified, while the latter is cost effective only if the cost of the transpose is amortized over many accesses. I used the second approach to optimize code 10. Code 5 has four-dimensional arrays and loops are nested four deep. For all of the reasons cited above the compiler is not able to restructure three key loops. Assume C arrays and let the four dimensions of the arrays be i, j, k, and l. In the original code, the index structure of the three loops is L1: for i L2: for i L3: for i for l for l for j for k for j for k for j for k for l So only L3 accesses array elements in cache order. L1 is a very complex loop—much too complex to invert. I brought the loop into cache alignment by transposing the second and fourth dimensions of the arrays. Since the code uses a macro to compute all array indexes, I effected the transpose at construction and changed the macro appropriately. The dimensions of the new arrays are now: i, l, k, and j. L3 is a simple loop and easily inverted. L2 has a loop-carried scalar dependence in k. By promoting the scalar name that carries the dependence to an array, I was able to invert the third and fourth subloops aligning the loop with cache. Code 5 is by far the most difficult of the four codes to optimize for array accesses; but the knowledge required to fix the problems is no more than that required for the other codes. I would judge this code at the limits of, but not beyond, the capabilities of appropriately trained computational scientists. Array Strides When a cache miss occurs, a line (64 bytes) rather than just one word is loaded into the cache. If data is accessed stride 1, than the cost of the miss is amortized over 8 words. Any stride other than one reduces the cost savings. Two of the ten codes studied suffered from non-unit strides. The codes represent two important classes of "strided" codes. Code 1 employs a multi-grid algorithm to reduce time to convergence. The grids are every tenth, fifth, second, and unit element. Since time to convergence is inversely proportional to the distance between elements, coarse grids converge quickly providing good starting values for finer grids. The better starting values further reduce the time to convergence. The downside is that grids of every nth element, n > 1, introduce non-unit strides into the computation. In the original code, much of the savings of the multi-grid algorithm were lost due to this problem. I eliminated the problem by compressing (copying) coarse grids into continuous memory, and rewriting the computation as a function of the compressed grid. On convergence, I copied the final values of the compressed grid back to the original grid. The savings gained from unit stride access of the compressed grid more than paid for the cost of copying. Using compressed grids, the loop from code 1 included in the previous section becomes do j = 1, GZ do i = 1, GZ T1 = CA(i+0, j-1) + CA(i-1, j+0) T4 = CA1(i+1, j+0) + CA1(i+0, j+1) S1 = T1 + T4 - 4 * CA1(i+0, j+0) CA(i+0, j+0) = CA1(i+0, j+0) + DD * S1 enddo enddo where CA and CA1 are compressed arrays of size GZ. Code 7 traverses a list of objects selecting objects for later processing. The labels of the selected objects are stored in an array. The selection step has unit stride, but the processing steps have irregular stride. A fix is to save the parameters of the selected objects in temporary arrays as they are selected, and pass the temporary arrays to the processing functions. The fix is practical if the same parameters are used in selection as in processing, or if processing comprises a series of distinct steps which use overlapping subsets of the parameters. Both conditions are true for code 7, so I achieved significant improvement by copying parameters to temporary arrays during selection. Data reuse In the previous sections, we optimized for spatial locality. It is also important to optimize for temporal locality. Once read, a datum should be used as much as possible before it is forced from cache. Loop fusion and loop unrolling are two techniques that increase temporal locality. Unfortunately, both techniques increase register pressure—as loop bodies become larger, the number of registers required to hold temporary values grows. Once register spilling occurs, any gains evaporate quickly. For multiprocessors with small register sets or small caches, the sweet spot can be very small. In the ten codes presented here, I found no opportunities for loop fusion and only two opportunities for loop unrolling (codes 1 and 3). In code 1, unrolling the outer and inner loop one iteration increases the number of result values computed by the loop body from 1 to 4, do J = 1, GZ-2, 2 do I = 1, GZ-2, 2 T1 = CA1(i+0, j-1) + CA1(i-1, j+0) T2 = CA1(i+1, j-1) + CA1(i+0, j+0) T3 = CA1(i+0, j+0) + CA1(i-1, j+1) T4 = CA1(i+1, j+0) + CA1(i+0, j+1) T5 = CA1(i+2, j+0) + CA1(i+1, j+1) T6 = CA1(i+1, j+1) + CA1(i+0, j+2) T7 = CA1(i+2, j+1) + CA1(i+1, j+2) S1 = T1 + T4 - 4 * CA1(i+0, j+0) S2 = T2 + T5 - 4 * CA1(i+1, j+0) S3 = T3 + T6 - 4 * CA1(i+0, j+1) S4 = T4 + T7 - 4 * CA1(i+1, j+1) CA(i+0, j+0) = CA1(i+0, j+0) + DD * S1 CA(i+1, j+0) = CA1(i+1, j+0) + DD * S2 CA(i+0, j+1) = CA1(i+0, j+1) + DD * S3 CA(i+1, j+1) = CA1(i+1, j+1) + DD * S4 enddo enddo The loop body executes 12 reads, whereas as the rolled loop shown in the previous section executes 20 reads to compute the same four values. In code 3, two loops are unrolled 8 times and one loop is unrolled 4 times. Here is the before for (k = 0; k < NK[u]; k++) { sum = 0.0; for (y = 0; y < NY; y++) { sum += W[y][u][k] * delta[y]; } backprop[i++]=sum; } and after code for (k = 0; k < KK - 8; k+=8) { sum0 = 0.0; sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0; sum5 = 0.0; sum6 = 0.0; sum7 = 0.0; for (y = 0; y < NY; y++) { sum0 += W[y][0][k+0] * delta[y]; sum1 += W[y][0][k+1] * delta[y]; sum2 += W[y][0][k+2] * delta[y]; sum3 += W[y][0][k+3] * delta[y]; sum4 += W[y][0][k+4] * delta[y]; sum5 += W[y][0][k+5] * delta[y]; sum6 += W[y][0][k+6] * delta[y]; sum7 += W[y][0][k+7] * delta[y]; } backprop[k+0] = sum0; backprop[k+1] = sum1; backprop[k+2] = sum2; backprop[k+3] = sum3; backprop[k+4] = sum4; backprop[k+5] = sum5; backprop[k+6] = sum6; backprop[k+7] = sum7; } for one of the loops unrolled 8 times. Optimizing for temporal locality is the most difficult optimization considered in this paper. The concepts are not difficult, but the sweet spot is small. Identifying where the program can benefit from loop unrolling or loop fusion is not trivial. Moreover, it takes some effort to get it right. Still, educating scientific programmers about temporal locality and teaching them how to optimize for it will pay dividends. Reducing instruction count Execution time is a function of instruction count. Reduce the count and you usually reduce the time. The best solution is to use a more efficient algorithm; that is, an algorithm whose order of complexity is smaller, that converges quicker, or is more accurate. Optimizing source code without changing the algorithm yields smaller, but still significant, gains. This paper considers only the latter because the intent is to study how much better codes can run if written by programmers schooled in basic code optimization techniques. The ten codes studied benefited from three types of "instruction reducing" optimizations. The two most prevalent were hoisting invariant memory and data operations out of inner loops. The third was eliminating unnecessary data copying. The nature of these inefficiencies is language dependent. Memory operations The semantics of C make it difficult for the compiler to determine all the invariant memory operations in a loop. The problem is particularly acute for loops in functions since the compiler may not know the values of the function's parameters at every call site when compiling the function. Most compilers support pragmas to help resolve ambiguities; however, these pragmas are not comprehensive and there is no standard syntax. To guarantee that invariant memory operations are not executed repetitively, the user has little choice but to hoist the operations by hand. The problem is not as severe in Fortran programs because in the absence of equivalence statements, it is a violation of the language's semantics for two names to share memory. Codes 3 and 5 are C programs. In both cases, the compiler did not hoist all invariant memory operations from inner loops. Consider the following loop from code 3 for (y = 0; y < NY; y++) { i = 0; for (u = 0; u < NU; u++) { for (k = 0; k < NK[u]; k++) { dW[y][u][k] += delta[y] * I1[i++]; } } } Since dW[y][u] can point to the same memory space as delta for one or more values of y and u, assignment to dW[y][u][k] may change the value of delta[y]. In reality, dW and delta do not overlap in memory, so I rewrote the loop as for (y = 0; y < NY; y++) { i = 0; Dy = delta[y]; for (u = 0; u < NU; u++) { for (k = 0; k < NK[u]; k++) { dW[y][u][k] += Dy * I1[i++]; } } } Failure to hoist invariant memory operations may be due to complex address calculations. If the compiler can not determine that the address calculation is invariant, then it can hoist neither the calculation nor the associated memory operations. As noted above, code 5 uses a macro to address four-dimensional arrays #define MAT4D(a,q,i,j,k) (double *)((a)->data + (q)*(a)->strides[0] + (i)*(a)->strides[3] + (j)*(a)->strides[2] + (k)*(a)->strides[1]) The macro is too complex for the compiler to understand and so, it does not identify any subexpressions as loop invariant. The simplest way to eliminate the address calculation from the innermost loop (over i) is to define a0 = MAT4D(a,q,0,j,k) before the loop and then replace all instances of *MAT4D(a,q,i,j,k) in the loop with a0[i] A similar problem appears in code 6, a Fortran program. The key loop in this program is do n1 = 1, nh nx1 = (n1 - 1) / nz + 1 nz1 = n1 - nz * (nx1 - 1) do n2 = 1, nh nx2 = (n2 - 1) / nz + 1 nz2 = n2 - nz * (nx2 - 1) ndx = nx2 - nx1 ndy = nz2 - nz1 gxx = grn(1,ndx,ndy) gyy = grn(2,ndx,ndy) gxy = grn(3,ndx,ndy) balance(n1,1) = balance(n1,1) + (force(n2,1) * gxx + force(n2,2) * gxy) * h1 balance(n1,2) = balance(n1,2) + (force(n2,1) * gxy + force(n2,2) * gyy)*h1 end do end do The programmer has written this loop well—there are no loop invariant operations with respect to n1 and n2. However, the loop resides within an iterative loop over time and the index calculations are independent with respect to time. Trading space for time, I precomputed the index values prior to the entering the time loop and stored the values in two arrays. I then replaced the index calculations with reads of the arrays. Data operations Ways to reduce data operations can appear in many forms. Implementing a more efficient algorithm produces the biggest gains. The closest I came to an algorithm change was in code 4. This code computes the inner product of K-vectors A(i) and B(j), 0 = i < N, 0 = j < M, for most values of i and j. Since the program computes most of the NM possible inner products, it is more efficient to compute all the inner products in one triply-nested loop rather than one at a time when needed. The savings accrue from reading A(i) once for all B(j) vectors and from loop unrolling. for (i = 0; i < N; i+=8) { for (j = 0; j < M; j++) { sum0 = 0.0; sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0; sum5 = 0.0; sum6 = 0.0; sum7 = 0.0; for (k = 0; k < K; k++) { sum0 += A[i+0][k] * B[j][k]; sum1 += A[i+1][k] * B[j][k]; sum2 += A[i+2][k] * B[j][k]; sum3 += A[i+3][k] * B[j][k]; sum4 += A[i+4][k] * B[j][k]; sum5 += A[i+5][k] * B[j][k]; sum6 += A[i+6][k] * B[j][k]; sum7 += A[i+7][k] * B[j][k]; } C[i+0][j] = sum0; C[i+1][j] = sum1; C[i+2][j] = sum2; C[i+3][j] = sum3; C[i+4][j] = sum4; C[i+5][j] = sum5; C[i+6][j] = sum6; C[i+7][j] = sum7; }} This change requires knowledge of a typical run; i.e., that most inner products are computed. The reasons for the change, however, derive from basic optimization concepts. It is the type of change easily made at development time by a knowledgeable programmer. In code 5, we have the data version of the index optimization in code 6. Here a very expensive computation is a function of the loop indices and so cannot be hoisted out of the loop; however, the computation is invariant with respect to an outer iterative loop over time. We can compute its value for each iteration of the computation loop prior to entering the time loop and save the values in an array. The increase in memory required to store the values is small in comparison to the large savings in time. The main loop in Code 8 is doubly nested. The inner loop includes a series of guarded computations; some are a function of the inner loop index but not the outer loop index while others are a function of the outer loop index but not the inner loop index for (j = 0; j < N; j++) { for (i = 0; i < M; i++) { r = i * hrmax; R = A[j]; temp = (PRM[3] == 0.0) ? 1.0 : pow(r, PRM[3]); high = temp * kcoeff * B[j] * PRM[2] * PRM[4]; low = high * PRM[6] * PRM[6] / (1.0 + pow(PRM[4] * PRM[6], 2.0)); kap = (R > PRM[6]) ? high * R * R / (1.0 + pow(PRM[4]*r, 2.0) : low * pow(R/PRM[6], PRM[5]); < rest of loop omitted > }} Note that the value of temp is invariant to j. Thus, we can hoist the computation for temp out of the loop and save its values in an array. for (i = 0; i < M; i++) { r = i * hrmax; TEMP[i] = pow(r, PRM[3]); } [N.B. – the case for PRM[3] = 0 is omitted and will be reintroduced later.] We now hoist out of the inner loop the computations invariant to i. Since the conditional guarding the value of kap is invariant to i, it behooves us to hoist the computation out of the inner loop, thereby executing the guard once rather than M times. The final version of the code is for (j = 0; j < N; j++) { R = rig[j] / 1000.; tmp1 = kcoeff * par[2] * beta[j] * par[4]; tmp2 = 1.0 + (par[4] * par[4] * par[6] * par[6]); tmp3 = 1.0 + (par[4] * par[4] * R * R); tmp4 = par[6] * par[6] / tmp2; tmp5 = R * R / tmp3; tmp6 = pow(R / par[6], par[5]); if ((par[3] == 0.0) && (R > par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * tmp5; } else if ((par[3] == 0.0) && (R <= par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * tmp4 * tmp6; } else if ((par[3] != 0.0) && (R > par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * TEMP[i] * tmp5; } else if ((par[3] != 0.0) && (R <= par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * TEMP[i] * tmp4 * tmp6; } for (i = 0; i < M; i++) { kap = KAP[i]; r = i * hrmax; < rest of loop omitted > } } Maybe not the prettiest piece of code, but certainly much more efficient than the original loop, Copy operations Several programs unnecessarily copy data from one data structure to another. This problem occurs in both Fortran and C programs, although it manifests itself differently in the two languages. Code 1 declares two arrays—one for old values and one for new values. At the end of each iteration, the array of new values is copied to the array of old values to reset the data structures for the next iteration. This problem occurs in Fortran programs not included in this study and in both Fortran 77 and Fortran 90 code. Introducing pointers to the arrays and swapping pointer values is an obvious way to eliminate the copying; but pointers is not a feature that many Fortran programmers know well or are comfortable using. An easy solution not involving pointers is to extend the dimension of the value array by 1 and use the last dimension to differentiate between arrays at different times. For example, if the data space is N x N, declare the array (N, N, 2). Then store the problem’s initial values in (_, _, 2) and define the scalar names new = 2 and old = 1. At the start of each iteration, swap old and new to reset the arrays. The old–new copy problem did not appear in any C program. In programs that had new and old values, the code swapped pointers to reset data structures. Where unnecessary coping did occur is in structure assignment and parameter passing. Structures in C are handled much like scalars. Assignment causes the data space of the right-hand name to be copied to the data space of the left-hand name. Similarly, when a structure is passed to a function, the data space of the actual parameter is copied to the data space of the formal parameter. If the structure is large and the assignment or function call is in an inner loop, then copying costs can grow quite large. While none of the ten programs considered here manifested this problem, it did occur in programs not included in the study. A simple fix is always to refer to structures via pointers. Optimizing loop structures Since scientific programs spend almost all their time in loops, efficient loops are the key to good performance. Conditionals, function calls, little instruction level parallelism, and large numbers of temporary values make it difficult for the compiler to generate tightly packed, highly efficient code. Conditionals and function calls introduce jumps that disrupt code flow. Users should eliminate or isolate conditionls to their own loops as much as possible. Often logical expressions can be substituted for if-then-else statements. For example, code 2 includes the following snippet MaxDelta = 0.0 do J = 1, N do I = 1, M < code omitted > Delta = abs(OldValue ? NewValue) if (Delta > MaxDelta) MaxDelta = Delta enddo enddo if (MaxDelta .gt. 0.001) goto 200 Since the only use of MaxDelta is to control the jump to 200 and all that matters is whether or not it is greater than 0.001, I made MaxDelta a boolean and rewrote the snippet as MaxDelta = .false. do J = 1, N do I = 1, M < code omitted > Delta = abs(OldValue ? NewValue) MaxDelta = MaxDelta .or. (Delta .gt. 0.001) enddo enddo if (MaxDelta) goto 200 thereby, eliminating the conditional expression from the inner loop. A microprocessor can execute many instructions per instruction cycle. Typically, it can execute one or more memory, floating point, integer, and jump operations. To be executed simultaneously, the operations must be independent. Thick loops tend to have more instruction level parallelism than thin loops. Moreover, they reduce memory traffice by maximizing data reuse. Loop unrolling and loop fusion are two techniques to increase the size of loop bodies. Several of the codes studied benefitted from loop unrolling, but none benefitted from loop fusion. This observation is not too surpising since it is the general tendency of programmers to write thick loops. As loops become thicker, the number of temporary values grows, increasing register pressure. If registers spill, then memory traffic increases and code flow is disrupted. A thick loop with many temporary values may execute slower than an equivalent series of thin loops. The biggest gain will be achieved if the thick loop can be split into a series of independent loops eliminating the need to write and read temporary arrays. I found such an occasion in code 10 where I split the loop do i = 1, n do j = 1, m A24(j,i)= S24(j,i) * T24(j,i) + S25(j,i) * U25(j,i) B24(j,i)= S24(j,i) * T25(j,i) + S25(j,i) * U24(j,i) A25(j,i)= S24(j,i) * C24(j,i) + S25(j,i) * V24(j,i) B25(j,i)= S24(j,i) * U25(j,i) + S25(j,i) * V25(j,i) C24(j,i)= S26(j,i) * T26(j,i) + S27(j,i) * U26(j,i) D24(j,i)= S26(j,i) * T27(j,i) + S27(j,i) * V26(j,i) C25(j,i)= S27(j,i) * S28(j,i) + S26(j,i) * U28(j,i) D25(j,i)= S27(j,i) * T28(j,i) + S26(j,i) * V28(j,i) end do end do into two disjoint loops do i = 1, n do j = 1, m A24(j,i)= S24(j,i) * T24(j,i) + S25(j,i) * U25(j,i) B24(j,i)= S24(j,i) * T25(j,i) + S25(j,i) * U24(j,i) A25(j,i)= S24(j,i) * C24(j,i) + S25(j,i) * V24(j,i) B25(j,i)= S24(j,i) * U25(j,i) + S25(j,i) * V25(j,i) end do end do do i = 1, n do j = 1, m C24(j,i)= S26(j,i) * T26(j,i) + S27(j,i) * U26(j,i) D24(j,i)= S26(j,i) * T27(j,i) + S27(j,i) * V26(j,i) C25(j,i)= S27(j,i) * S28(j,i) + S26(j,i) * U28(j,i) D25(j,i)= S27(j,i) * T28(j,i) + S26(j,i) * V28(j,i) end do end do Conclusions Over the course of the last year, I have had the opportunity to work with over two dozen academic scientific programmers at leading research universities. Their research interests span a broad range of scientific fields. Except for two programs that relied almost exclusively on library routines (matrix multiply and fast Fourier transform), I was able to improve significantly the single processor performance of all codes. Improvements range from 2x to 15.5x with a simple average of 4.75x. Changes to the source code were at a very high level. I did not use sophisticated techniques or programming tools to discover inefficiencies or effect the changes. Only one code was parallel despite the availability of parallel systems to all developers. Clearly, we have a problem—personal scientific research codes are highly inefficient and not running parallel. The developers are unaware of simple optimization techniques to make programs run faster. They lack education in the art of code optimization and parallel programming. I do not believe we can fix the problem by publishing additional books or training manuals. To date, the developers in questions have not studied the books or manual available, and are unlikely to do so in the future. Short courses are a possible solution, but I believe they are too concentrated to be much use. The general concepts can be taught in a three or four day course, but that is not enough time for students to practice what they learn and acquire the experience to apply and extend the concepts to their codes. Practice is the key to becoming proficient at optimization. I recommend that graduate students be required to take a semester length course in optimization and parallel programming. We would never give someone access to state-of-the-art scientific equipment costing hundreds of thousands of dollars without first requiring them to demonstrate that they know how to use the equipment. Yet the criterion for time on state-of-the-art supercomputers is at most an interesting project. Requestors are never asked to demonstrate that they know how to use the system, or can use the system effectively. A semester course would teach them the required skills. Government agencies that fund academic scientific research pay for most of the computer systems supporting scientific research as well as the development of most personal scientific codes. These agencies should require graduate schools to offer a course in optimization and parallel programming as a requirement for funding. About the Author John Feo received his Ph.D. in Computer Science from The University of Texas at Austin in 1986. After graduate school, Dr. Feo worked at Lawrence Livermore National Laboratory where he was the Group Leader of the Computer Research Group and principal investigator of the Sisal Language Project. In 1997, Dr. Feo joined Tera Computer Company where he was project manager for the MTA, and oversaw the programming and evaluation of the MTA at the San Diego Supercomputer Center. In 2000, Dr. Feo joined Sun Microsystems as an HPC application specialist. He works with university research groups to optimize and parallelize scientific codes. Dr. Feo has published over two dozen research articles in the areas of parallel parallel programming, parallel programming languages, and application performance.

    Read the article

< Previous Page | 1 2 3