Search Results

Search found 231 results on 10 pages for 'fortran'.

Page 5/10 | < Previous Page | 1 2 3 4 5 6 7 8 9 10  | Next Page >

  • DllImport Based on OS Platform

    - by Ngu Soon Hui
    I have a mixture of unmanaged code ( backend) and managed code ( front end), as such, I would need to call the unmanaged code from my managed code, using interop techniques and DllImport attribute. Now, I've compiled two versions of unmanaged code, for both 32 and 64 bit OS; they are named service32.dll and service64.dll respectively. So, in my .Net code, I would have to do a DllImport for both dlls: [DllImport(@"service32.dll")] //for 32 bit OS invocation public static void SimpleFunction(); [DllImport(@"service64.dll")] //for 64 bit OS invocation public static void SimpleFunction(); And call them depending on which platform my application is running on. The issue now is that for every unmanaged function, I have to declared it twice, one for 32 bit OS and one for 64 bit OS. This is a duplication of work, and everytime I change the signature of an unmanaged function, I have to modified it in two places. Is there anyway that I can change the argument in DllImport so that the correct dll will be invoked automagically, depending on the platform?

    Read the article

  • Importing data from file to array

    - by stamp
    I have 2 dimensional table in file, which look like this: 11, 12, 13, 14, 15 21, 22, 23, 24, 25 I want it to be imported in 2 dimensional array. I wrote this code: INTEGER :: SMALL(10) DO I = 1, 3 READ(UNIT=10, FMT='(5I4)') SMALL WRITE(UNIT=*, FMT='(6X,5I4)') SMALL ENDDO But it imports everything in one dimensional array.

    Read the article

  • Program crash for array copy with ifort

    - by Stefano Borini
    This program crashes with Illegal instruction: 4 on MacOSX Lion and ifort (IFORT) 12.1.0 20111011 program foo real, pointer :: a(:,:), b(:,:) allocate(a(5400, 5400)) allocate(b(5400, 3600)) a=1.0 b(:, 1:3600) = a(:, 1:3600) print *, a print *, b deallocate(a) deallocate(b) end program The same program works with gfortran. I don't see any problem. Any ideas ? Unrolling the copy and performing the explicit loop over the columns works in both compilers. Note that with allocatable instead of pointer I have no problems. The behavior is the same if the statement is either inside a module or not. I confirm the same behavior on Linux using ifort (IFORT) 12.1.3 20120130.

    Read the article

  • Can I legally publish my Fortran 90 wrappers to nVidias CUFFT library (from CUDA SDK)?

    - by Jakub Narebski
    From a legal standpoint (licensing issues), can I legally in agreement with license publish Fortran 90 wrappers (bindings) to CUFFT library from nVidia CUDA Toolkit, under some open source license (either CC0 i.e. public domain, or some kind of permissive license like BSD). nVidia provides only C bindings with their CUDA SDK. Header files contain the following text: /* * Copyright 1993-2011 NVIDIA Corporation. All rights reserved. * * NOTICE TO LICENSEE: * * This source code and/or documentation ("Licensed Deliverables") are * subject to NVIDIA intellectual property rights under U.S. and * international Copyright laws. * * These Licensed Deliverables contained herein is PROPRIETARY and * CONFIDENTIAL to NVIDIA and is being provided under the terms and * conditions of a form of NVIDIA software license agreement by and * between NVIDIA and Licensee ("License Agreement") or electronically * accepted by Licensee. Notwithstanding any terms or conditions to * the contrary in the License Agreement, reproduction or disclosure * of the Licensed Deliverables to any third party without the express * written consent of NVIDIA is prohibited. The License.txt file includes the following fragment Source Code: Developer shall have the right to modify and create derivative works with the Source Code. Developer shall own any derivative works ("Derivatives") it creates to the Source Code, provided that Developer uses the Materials in accordance with the terms and conditions of this Agreement. Developer may distribute the Derivatives, provided that all NVIDIA copyright notices and trademarks are propagated and used properly and the Derivatives include the following statement: "This software contains source code provided by NVIDIA Corporation."

    Read the article

  • Make errors when compiling HPL-2.1 on MOSIX-clustered Debian server

    - by tlake
    I'm trying to compile HPL 2.1 on a MOSIX-clustered Debian server, but the make process terminates with errors as seen below. Included are my makefile and two versions of output: one from a standard execution, and one from an execution run with the debug flag. Any help and guidance would be very much appreciated! The makefile: # ---------------------------------------------------------------------- # - shell -------------------------------------------------------------- # ---------------------------------------------------------------------- # SHELL = /bin/bash # CD = cd CP = cp LN_S = ln -s MKDIR = mkdir RM = /bin/rm -f TOUCH = touch # # ---------------------------------------------------------------------- # - Platform identifier ------------------------------------------------ # ---------------------------------------------------------------------- # ARCH = Linux_PII_CBLAS # # ---------------------------------------------------------------------- # - HPL Directory Structure / HPL library ------------------------------ # ---------------------------------------------------------------------- # TOPdir = $(HOME)/hpl-2.1 INCdir = $(TOPdir)/include BINdir = $(TOPdir)/bin/$(ARCH) LIBdir = $(TOPdir)/lib/$(ARCH) # HPLlib = $(LIBdir)/libhpl.a # # ---------------------------------------------------------------------- # - Message Passing library (MPI) -------------------------------------- # ---------------------------------------------------------------------- # MPinc tells the C compiler where to find the Message Passing library # header files, MPlib is defined to be the name of the library to be # used. The variable MPdir is only used for defining MPinc and MPlib. # MPdir = /usr/local MPinc = -I$(MPdir)/include MPlib = $(MPdir)/lib/libmpi.so # # ---------------------------------------------------------------------- # - Linear Algebra library (BLAS or VSIPL) ----------------------------- # ---------------------------------------------------------------------- # LAinc tells the C compiler where to find the Linear Algebra library # header files, LAlib is defined to be the name of the library to be # used. The variable LAdir is only used for defining LAinc and LAlib. # LAdir = $(HOME)/CBLAS/lib LAinc = LAlib = $(LAdir)/cblas_LINUX.a # # ---------------------------------------------------------------------- # - F77 / C interface -------------------------------------------------- # ---------------------------------------------------------------------- # You can skip this section if and only if you are not planning to use # a BLAS library featuring a Fortran 77 interface. Otherwise, it is # necessary to fill out the F2CDEFS variable with the appropriate # options. **One and only one** option should be chosen in **each** of # the 3 following categories: # # 1) name space (How C calls a Fortran 77 routine) # # -DAdd_ : all lower case and a suffixed underscore (Suns, # Intel, ...), [default] # -DNoChange : all lower case (IBM RS6000), # -DUpCase : all upper case (Cray), # -DAdd__ : the FORTRAN compiler in use is f2c. # # 2) C and Fortran 77 integer mapping # # -DF77_INTEGER=int : Fortran 77 INTEGER is a C int, [default] # -DF77_INTEGER=long : Fortran 77 INTEGER is a C long, # -DF77_INTEGER=short : Fortran 77 INTEGER is a C short. # # 3) Fortran 77 string handling # # -DStringSunStyle : The string address is passed at the string loca- # tion on the stack, and the string length is then # passed as an F77_INTEGER after all explicit # stack arguments, [default] # -DStringStructPtr : The address of a structure is passed by a # Fortran 77 string, and the structure is of the # form: struct {char *cp; F77_INTEGER len;}, # -DStringStructVal : A structure is passed by value for each Fortran # 77 string, and the structure is of the form: # struct {char *cp; F77_INTEGER len;}, # -DStringCrayStyle : Special option for Cray machines, which uses # Cray fcd (fortran character descriptor) for # interoperation. # F2CDEFS = # # ---------------------------------------------------------------------- # - HPL includes / libraries / specifics ------------------------------- # ---------------------------------------------------------------------- # HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc) $(MPinc) HPL_LIBS = $(HPLlib) $(LAlib) $(MPlib) # # - Compile time options ----------------------------------------------- # # -DHPL_COPY_L force the copy of the panel L before bcast; # -DHPL_CALL_CBLAS call the cblas interface; # -DHPL_CALL_VSIPL call the vsip library; # -DHPL_DETAILED_TIMING enable detailed timers; # # By default HPL will: # *) not copy L before broadcast, # *) call the BLAS Fortran 77 interface, # *) not display detailed timing information. # HPL_OPTS = -DHPL_CALL_CBLAS # # ---------------------------------------------------------------------- # HPL_DEFS = $(F2CDEFS) $(HPL_OPTS) $(HPL_INCLUDES) # # ---------------------------------------------------------------------- # - Compilers / linkers - Optimization flags --------------------------- # ---------------------------------------------------------------------- # CC = /usr/bin/gcc CCNOOPT = $(HPL_DEFS) CCFLAGS = $(HPL_DEFS) -fomit-frame-pointer -O3 -funroll-loops # # On some platforms, it is necessary to use the Fortran linker to find # the Fortran internals used in the BLAS library. # LINKER = ~/BLAS LINKFLAGS = $(CCFLAGS) # ARCHIVER = ar ARFLAGS = r RANLIB = echo # # ---------------------------------------------------------------------- Make output: ~/BLAS -DHPL_CALL_CBLAS -I/homes/laket/hpl-2.1/include -I/homes/laket/hpl-2.1/include/Linux_PII_CBLAS -I/usr/local/include -fomit-frame-pointer -O3 -funroll-loops -o /homes/laket/hpl-2.1/bin/Linux_PII_CBLAS/xhpl HPL_pddriver.o HPL_pdinfo.o HPL_pdtest.o /homes/laket/hpl-2.1/lib/Linux_PII_CBLAS/libhpl.a /homes/laket/CBLAS/lib/cblas_LINUX.a /usr/local/lib/libmpi.so /bin/bash: /homes/laket/BLAS: Is a directory make[2]: *** [dexe.grd] Error 126 make[2]: Target `all' not remade because of errors. make[2]: Leaving directory `/homes/laket/hpl-2.1/testing/ptest/Linux_PII_CBLAS' make[1]: *** [build_tst] Error 2 make[1]: Leaving directory `/homes/laket/hpl-2.1' make: *** [build] Error 2 make: Target `all' not remade because of errors. Make -d output: Considering target file `/homes/laket/hpl-2.1/lib/Linux_PII_CBLAS/libhpl.a'. Looking for an implicit rule for `/homes/laket/hpl-2.1/lib/Linux_PII_CBLAS/libhpl.a'. Trying pattern rule with stem `libhpl.a'. Trying implicit prerequisite `/homes/laket/hpl-2.1/lib/Linux_PII_CBLAS/libhpl.a,v'. Trying pattern rule with stem `libhpl.a'. Trying implicit prerequisite `/homes/laket/hpl-2.1/lib/Linux_PII_CBLAS/RCS/libhpl.a,v'. Trying pattern rule with stem `libhpl.a'. Trying implicit prerequisite `/homes/laket/hpl-2.1/lib/Linux_PII_CBLAS/RCS/libhpl.a'. Trying pattern rule with stem `libhpl.a'. Trying implicit prerequisite `/homes/laket/hpl-2.1/lib/Linux_PII_CBLAS/s.libhpl.a'. Trying pattern rule with stem `libhpl.a'. Trying implicit prerequisite `/homes/laket/hpl-2.1/lib/Linux_PII_CBLAS/SCCS/s.libhpl.a'. No implicit rule found for `/homes/laket/hpl-2.1/lib/Linux_PII_CBLAS/libhpl.a'. Finished prerequisites of target file `/homes/laket/hpl-2.1/lib/Linux_PII_CBLAS/libhpl.a'. No need to remake target `/homes/laket/hpl-2.1/lib/Linux_PII_CBLAS/libhpl.a'. Finished prerequisites of target file `dexe.grd'. Must remake target `dexe.grd'. ~/BLAS -DHPL_CALL_CBLAS -I/homes/laket/hpl-2.1/include -I/homes/laket/hpl-2.1/include/Linux_PII_CBLAS -I/usr/local/include -fomit-frame-pointer -O3 -funroll-loops -o /homes/laket/hpl-2.1/bin/Linux_PII_CBLAS/xhpl HPL_pddriver.o HPL_pdinfo.o HPL_pdtest.o /homes/laket/hpl-2.1/lib/Linux_PII_CBLAS/libhpl.a /homes/laket/CBLAS/lib/cblas_LINUX.a /usr/local/lib/libmpi.so Putting child 0x0129a2c0 (dexe.grd) PID 24853 on the chain. Live child 0x0129a2c0 (dexe.grd) PID 24853 /bin/bash: /homes/laket/BLAS: Is a directory make[2]: Reaping losing child 0x0129a2c0 PID 24853 *** [dexe.grd] Error 126 Removing child 0x0129a2c0 PID 24853 from chain. Failed to remake target file `dexe.grd'. Finished prerequisites of target file `dexe'. Giving up on target file `dexe'. Finished prerequisites of target file `all'. Giving up on target file `all'. make[2]: Target `all' not remade because of errors. make[2]: Leaving directory `/homes/laket/hpl-2.1/testing/ptest/Linux_PII_CBLAS' Reaping losing child 0x010ce900 PID 24841 make[1]: *** [build_tst] Error 2 Removing child 0x010ce900 PID 24841 from chain. Failed to remake target file `build_tst'. make[1]: Leaving directory `/homes/laket/hpl-2.1' Reaping losing child 0x00d91ae0 PID 24774 make: *** [build] Error 2 Removing child 0x00d91ae0 PID 24774 from chain. Failed to remake target file `build'. Finished prerequisites of target file `install'. make: Target `all' not remade because of errors. Giving up on target file `install'. Finished prerequisites of target file `all'. Giving up on target file `all'. Thanks!

    Read the article

  • IBM fête ses 100 ans et revient sur 100 moments forts de son histoire : invention du PC, de Fortran, du disque dur, etc.

    IBM fête ses 100 ans Et sélectionne 100 moments forts de son histoire illustrés par des « Icônes du Progrès » « Arrive un moment où chaque entreprise doit se demander : en quoi nous sommes nous différenciés ? Quel impact avons-nous eu sur le monde ? Qu'avons-nous changé ? » Ces questions, IBM se les pose au moment de fêter son centième anniversaire, « 100 ans d'innovation, de prises de risques audacieuses et de transformation ». Pour l'occasion, son PDG, Sam J. Palmisano, a inauguré le centenaire de la société en France en donnant une conférence sur le thème : « A business and its ideas : Shaping a company and a century » dans les locaux...

    Read the article

  • Referencing variables in a structure / C++

    - by user1628622
    Below, I provided a minimal example of code I created. I managed to get this code working, but I'm not sure if the practice being employed is sound. In essence, what I am trying to do is have the 'Parameter' class reference select elements in the 'States' class, so variables in States can be changed via Parameters. Questions I have: is the approach taken OK? If not, is there a better way to achieve what I am aiming for? Example code: struct VAR_TYPE{ public: bool is_fixed; // If is_fixed = true, then variable is a parameter double value; // Numerical value std::string name; // Description of variable (to identify it by name) }; struct NODE{ public: VAR_TYPE X, Y, Z; /* VAR_TYPE is a structure of primitive types */ }; class States{ private: std::vector <NODE_ptr> node; // shared ptr to struct NODE std::vector <PROP_DICTIONARY_ptr> property; // CAN NOT be part of Parameter std::vector <ELEMENT_ptr> element; // CAN NOT be part of Parameter public: /* ect */ void set_X_reference ( Parameter &T , int i ) { T.push_var( &node[i]->X ); } void set_Y_reference ( Parameter &T , int i ) { T.push_var( &node[i]->Y ); } void set_Z_reference ( Parameter &T , int i ) { T.push_var( &node[i]->Z ); } bool get_node_bool_X( int i ) { return node[i]->X.is_fixed; } // repeat for Y and Z }; class Parameter{ private: std::vector <VAR_TYPE*> var; public: /* ect */ }; int main(){ States S; Parameter P; /* Here I initialize and set S, and do other stuff */ // Now I assign components in States to Parameters for(int n=0 ; n<S.size_of_nodes() ; n++ ){ if ( S.get_node_bool_X(n)==true ){ S.set_X_reference ( P , n ); }; // repeat if statement for Y and Z }; /* Now P points selected to data in S, and I can * modify the contents of S through P */ return 0; }; Update The reason this issue cropped up is I am working with Fortran legacy code. To sum up this Fotran code - it's a numerical simulation of a flight vehicle. This code has a fairly rigid procedural framework one must work within, which comes with a pre-defined list of allowable Fortran types. The Fortran glue code can create an instance of a C++ object (in actuality, a reference from the perspective of Fortran), but is not aware what is contained in it (other means are used to extract C++ data into Fortran). The problem that I encountered is when a C++ module is dynamically linked to the Fortran glue code, C++ objects have to be initialized each instance the C++ code is called. This happens by virtue of how the Fortran template is defined. To avoid this cycle of re-initializing objects, I plan to use 'State' as a container class. The Fortran code allows a 'State' object, which has an arbitrary definition; but I plan to use it to harness all relevant information about the model. The idea is to use the Parameters class (which is exposed and updated by the Fortran code) to update variables in States.

    Read the article

  • Why do old programming languages continue to be revised?

    - by SunAvatar
    This question is not, "Why do people still use old programming languages?" I understand that quite well. In fact the two programming languages I know best are C and Scheme, both of which date back to the 70s. Recently I was reading about the changes in C99 and C11 versus C89 (which seems to still be the most-used version of C in practice and the version I learned from K&R). Looking around, it seems like every programming language in heavy use gets a new specification at least once per decade or so. Even Fortran is still getting new revisions, despite the fact that most people using it are still using FORTRAN 77. Contrast this with the approach of, say, the typesetting system TeX. In 1989, with the release of TeX 3.0, Donald Knuth declared that TeX was feature-complete and future releases would contain only bug fixes. Even beyond this, he has stated that upon his death, "all remaining bugs will become features" and absolutely no further updates will be made. Others are free to fork TeX and have done so, but the resulting systems are renamed to indicate that they are different from the official TeX. This is not because Knuth thinks TeX is perfect, but because he understands the value of a stable, predictable system that will do the same thing in fifty years that it does now. Why do most programming language designers not follow the same principle? Of course, when a language is relatively new, it makes sense that it will go through a period of rapid change before settling down. And no one can really object to minor changes that don't do much more than codify existing pseudo-standards or correct unintended readings. But when a language still seems to need improvement after ten or twenty years, why not just fork it or start over, rather than try to change what is already in use? If some people really want to do object-oriented programming in Fortran, why not create "Objective Fortran" for that purpose, and leave Fortran itself alone? I suppose one could say that, regardless of future revisions, C89 is already a standard and nothing stops people from continuing to use it. This is sort of true, but connotations do have consequences. GCC will, in pedantic mode, warn about syntax that is either deprecated or has a subtly different meaning in C99, which means C89 programmers can't just totally ignore the new standard. So there must be some benefit in C99 that is sufficient to impose this overhead on everyone who uses the language. This is a real question, not an invitation to argue. Obviously I do have an opinion on this, but at the moment I'm just trying to understand why this isn't just how things are done already. I suppose the question is: What are the (real or perceived) advantages of updating a language standard, as opposed to creating a new language based on the old?

    Read the article

  • Modifying a gedit syntax highlighting file

    - by Oscar Saleta Reig
    I am trying to change a highlighting file from Gedit. I have modified the file /usr/share/gtksourceview-3.0/language-specs/fortran.lang because I want to change the cases in which the editor takes a statement as a comment. The problem I have is that when I choose the new highlighting scheme nothing highlights, it just remains as plain text. The file fortran.lang was opened with su permissions and I just copy-pasted everything into a new Gedit file and later saved it as fortran_enhanced.lang in the same folder. The changes I've done to the original file are these: Original fortran.lang file: <language id="fortran" _name="Fortran 95" version="2.0" _section="Sources"> <metadata> <property name="mimetypes">text/x-fortran</property> <property name="globs">*.f;*.f90;*.f95;*.for</property> <property name="line-comment-start">!</property> </metadata> <styles> <style id="comment" _name="Comment" map-to="def:comment"/> <style id="floating-point" _name="Floating Point" map-to="def:floating-point"/> <style id="keyword" _name="Keyword" map-to="def:keyword"/> <style id="intrinsic" _name="Intrinsic function" map-to="def:builtin"/> <style id="boz-literal" _name="BOZ Literal" map-to="def:base-n-integer"/> <style id="decimal" _name="Decimal" map-to="def:decimal"/> <style id="type" _name="Data Type" map-to="def:type"/> </styles> <default-regex-options case-sensitive="false"/> <definitions> <!-- Note: contains an hack to avoid considering ^COMMON a comment --> <context id="line-comment" style-ref="comment" end-at-line-end="true" class="comment" class-disabled="no-spell-check"> <start>!|(^[Cc](\b|[^OoAaYy]))</start> <include> <context ref="def:escape"/> <context ref="def:in-line-comment"/> </include> </context> (...) Modified fortran_enhanced.lang file: <!-- Note: changed language id and name --> <language id="fortran_enhanced" _name="Fortran 95 2.0" version="2.0" _section="Sources"> <metadata> <property name="mimetypes">text/x-fortran</property> <!-- Note: removed *.f and *.for from file extensions --> <property name="globs">*.f90;*.f95;</property> <property name="line-comment-start">!</property> </metadata> <styles> <style id="comment" _name="Comment" map-to="def:comment"/> <style id="floating-point" _name="Floating Point" map-to="def:floating-point"/> <style id="keyword" _name="Keyword" map-to="def:keyword"/> <style id="intrinsic" _name="Intrinsic function" map-to="def:builtin"/> <style id="boz-literal" _name="BOZ Literal" map-to="def:base-n-integer"/> <style id="decimal" _name="Decimal" map-to="def:decimal"/> <style id="type" _name="Data Type" map-to="def:type"/> </styles> <default-regex-options case-sensitive="false"/> <definitions> <!-- Note: I want comments only beginning with !, not C --> <context id="line-comment" style-ref="comment" end-at-line-end="true" class="comment" class-disabled="no-spell-check"> <start>!</start> <include> <context ref="def:escape"/> <context ref="def:in-line-comment"/> </include> </context> (...) I have read this question [ Custom gedit Syntax Highlighting for Dummies? ] and I tried to make the new fortran_enhanced.lang file readable with $ cd /usr/share/gtksourceview-3.0/language-specs $ sudo chmod 0644 fortran_enhanced.lang but it doesn't seem that made some difference. I have to say that I have never done a thing like this before and I don't even understand most of the language file, so I am open to every criticism, as I have been guided purely by intuition. Thank you in advanced!

    Read the article

  • Helping install mrcwa and solve problems with f2py in Ubuntu 14.04 LTS

    - by user288160
    I am sorry if this is the wrong section but I am starting to get desperate, please someone help me... I need to install the program mrcwa-20080820 (sourceforge.net/projects/mrcwa/) because a summer project that I am involved. I need to use it together with anaconda (store.continuum.io/cshop/anaconda/), I already installed Anaconda and apparently it is working. When I type: conda --version I got the expected answer. conda 3.5.2 If I tried to import numpy or scipy with python or simple type f2py there are no errors. So far so good. But when I tried to install this program sudo python setup.py install I got these errors: running install running build sh: 1: f2py: not found cp: cannot stat ‘mrcwaf.so’: No such file or directory running build_py running install_lib running install_egg_info Removing /usr/local/lib/python2.7/dist-packages/mrcwa-20080820.egg-info Writing /usr/local/lib/python2.7/dist-packages/mrcwa-20080820.egg-info Obs: I am trying to use intel fortran 64-bits and Ubuntu 14.04 LTS. So I was checking f2py and tried to execute the program hello world f2py -c -m hello hello.f from here: cens.ioc.ee/projects/f2py2e/index.html#usage and I had some problems too: running build running config_cc unifing config_cc, config, build_clib, build_ext, build commands --compiler options running config_fc unifing config_fc, config, build_clib, build_ext, build commands --fcompiler options running build_src build_src building extension "hello" sources f2py options: [] f2py:> /tmp/tmpf8P4Y3/src.linux-x86_64-2.7/hellomodule.c creating /tmp/tmpf8P4Y3/src.linux-x86_64-2.7 Reading fortran codes... Reading file 'hello.f' (format:fix,strict) Post-processing... Block: hello Block: foo Post-processing (stage 2)... Building modules... Building module "hello"... Constructing wrapper function "foo"... foo(a) Wrote C/API module "hello" to file "/tmp/tmpf8P4Y3/src.linux-x86_64-2.7 /hellomodule.c" adding '/tmp/tmpf8P4Y3/src.linux-x86_64-2.7/fortranobject.c' to sources. adding '/tmp/tmpf8P4Y3/src.linux-x86_64-2.7' to include_dirs. copying /home/felipe/.local/lib/python2.7/site-packages/numpy/f2py/src/fortranobject.c -> /tmp/tmpf8P4Y3/src.linux-x86_64-2.7 copying /home/felipe/.local/lib/python2.7/site-packages/numpy/f2py/src/fortranobject.h -> /tmp/tmpf8P4Y3/src.linux-x86_64-2.7 build_src: building npy-pkg config files running build_ext customize UnixCCompiler customize UnixCCompiler using build_ext customize Gnu95FCompiler Could not locate executable gfortran Could not locate executable f95 customize IntelFCompiler Found executable /opt/intel/composer_xe_2013_sp1.3.174/bin/intel64/ifort customize LaheyFCompiler Could not locate executable lf95 customize PGroupFCompiler Could not locate executable pgfortran customize AbsoftFCompiler Could not locate executable f90 Could not locate executable f77 customize NAGFCompiler customize VastFCompiler customize CompaqFCompiler Could not locate executable fort customize IntelItaniumFCompiler customize IntelEM64TFCompiler customize IntelEM64TFCompiler customize IntelEM64TFCompiler using build_ext building 'hello' extension compiling C sources C compiler: gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC creating /tmp/tmpf8P4Y3/tmp creating /tmp/tmpf8P4Y3/tmp/tmpf8P4Y3 creating /tmp/tmpf8P4Y3/tmp/tmpf8P4Y3/src.linux-x86_64-2.7 compile options: '-I/tmp/tmpf8P4Y3/src.linux-x86_64-2.7 -I/home/felipe/.local/lib/python2.7/site-packages/numpy/core/include -I/home/felipe/anaconda/include/python2.7 -c' gcc: /tmp/tmpf8P4Y3/src.linux-x86_64-2.7/hellomodule.c In file included from /home/felipe/.local/lib/python2.7/site-packages/numpy/core/include/numpy/ndarraytypes.h:1761:0, from /home/felipe/.local/lib/python2.7/site-packages/numpy/core/include/numpy/ndarrayobject.h:17, from /home/felipe/.local/lib/python2.7/site-packages/numpy/core/include/numpy/arrayobject.h:4, from /tmp/tmpf8P4Y3/src.linux-x86_64-2.7/fortranobject.h:13, from /tmp/tmpf8P4Y3/src.linux-x86_64-2.7/hellomodule.c:17: /home/felipe/.local/lib/python2.7/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp] #warning "Using deprecated NumPy API, disable it by " \ ^ gcc: /tmp/tmpf8P4Y3/src.linux-x86_64-2.7/fortranobject.c In file included from /home/felipe/.local/lib/python2.7/site-packages/numpy/core/include/numpy/ndarraytypes.h:1761:0, from /home/felipe/.local/lib/python2.7/site-packages/numpy/core/include/numpy/ndarrayobject.h:17, from /home/felipe/.local/lib/python2.7/site-packages/numpy/core/include/numpy/arrayobject.h:4, from /tmp/tmpf8P4Y3/src.linux-x86_64-2.7/fortranobject.h:13, from /tmp/tmpf8P4Y3/src.linux-x86_64-2.7/fortranobject.c:2: /home/felipe/.local/lib/python2.7/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp] #warning "Using deprecated NumPy API, disable it by " \ ^ compiling Fortran sources Fortran f77 compiler: /opt/intel/composer_xe_2013_sp1.3.174/bin/intel64/ifort -FI -fPIC -xhost -openmp -fp-model strict Fortran f90 compiler: /opt/intel/composer_xe_2013_sp1.3.174/bin/intel64/ifort -FR -fPIC -xhost -openmp -fp-model strict Fortran fix compiler: /opt/intel/composer_xe_2013_sp1.3.174/bin/intel64/ifort -FI -fPIC -xhost -openmp -fp-model strict compile options: '-I/tmp/tmpf8P4Y3/src.linux-x86_64-2.7 -I/home/felipe/.local /lib/python2.7/site-packages/numpy/core/include -I/home/felipe/anaconda/include/python2.7 -c' ifort:f77: hello.f /opt/intel/composer_xe_2013_sp1.3.174/bin/intel64/ifort -shared -shared -nofor_main /tmp/tmpf8P4Y3/tmp/tmpf8P4Y3/src.linux-x86_64-2.7/hellomodule.o /tmp/tmpf8P4Y3 /tmp/tmpf8P4Y3/src.linux-x86_64-2.7/fortranobject.o /tmp/tmpf8P4Y3/hello.o -L/home/felipe /anaconda/lib -lpython2.7 -o ./hello.so Removing build directory /tmp/tmpf8P4Y3 Please help me I am new in ubuntu and python. I really need this program, my advisor is waiting an answer. Thank you very much, Felipe Oliveira.

    Read the article

  • How John Got 15x Improvement Without Really Trying

    - by rchrd
    The following article was published on a Sun Microsystems website a number of years ago by John Feo. It is still useful and worth preserving. So I'm republishing it here.  How I Got 15x Improvement Without Really Trying John Feo, Sun Microsystems Taking ten "personal" program codes used in scientific and engineering research, the author was able to get from 2 to 15 times performance improvement easily by applying some simple general optimization techniques. Introduction Scientific research based on computer simulation depends on the simulation for advancement. The research can advance only as fast as the computational codes can execute. The codes' efficiency determines both the rate and quality of results. In the same amount of time, a faster program can generate more results and can carry out a more detailed simulation of physical phenomena than a slower program. Highly optimized programs help science advance quickly and insure that monies supporting scientific research are used as effectively as possible. Scientific computer codes divide into three broad categories: ISV, community, and personal. ISV codes are large, mature production codes developed and sold commercially. The codes improve slowly over time both in methods and capabilities, and they are well tuned for most vendor platforms. Since the codes are mature and complex, there are few opportunities to improve their performance solely through code optimization. Improvements of 10% to 15% are typical. Examples of ISV codes are DYNA3D, Gaussian, and Nastran. Community codes are non-commercial production codes used by a particular research field. Generally, they are developed and distributed by a single academic or research institution with assistance from the community. Most users just run the codes, but some develop new methods and extensions that feed back into the general release. The codes are available on most vendor platforms. Since these codes are younger than ISV codes, there are more opportunities to optimize the source code. Improvements of 50% are not unusual. Examples of community codes are AMBER, CHARM, BLAST, and FASTA. Personal codes are those written by single users or small research groups for their own use. These codes are not distributed, but may be passed from professor-to-student or student-to-student over several years. They form the primordial ocean of applications from which community and ISV codes emerge. Government research grants pay for the development of most personal codes. This paper reports on the nature and performance of this class of codes. Over the last year, I have looked at over two dozen personal codes from more than a dozen research institutions. The codes cover a variety of scientific fields, including astronomy, atmospheric sciences, bioinformatics, biology, chemistry, geology, and physics. The sources range from a few hundred lines to more than ten thousand lines, and are written in Fortran, Fortran 90, C, and C++. For the most part, the codes are modular, documented, and written in a clear, straightforward manner. They do not use complex language features, advanced data structures, programming tricks, or libraries. I had little trouble understanding what the codes did or how data structures were used. Most came with a makefile. Surprisingly, only one of the applications is parallel. All developers have access to parallel machines, so availability is not an issue. Several tried to parallelize their applications, but stopped after encountering difficulties. Lack of education and a perception that parallelism is difficult prevented most from trying. I parallelized several of the codes using OpenMP, and did not judge any of the codes as difficult to parallelize. Even more surprising than the lack of parallelism is the inefficiency of the codes. I was able to get large improvements in performance in a matter of a few days applying simple optimization techniques. Table 1 lists ten representative codes [names and affiliation are omitted to preserve anonymity]. Improvements on one processor range from 2x to 15.5x with a simple average of 4.75x. I did not use sophisticated performance tools or drill deep into the program's execution character as one would do when tuning ISV or community codes. Using only a profiler and source line timers, I identified inefficient sections of code and improved their performance by inspection. The changes were at a high level. I am sure there is another factor of 2 or 3 in each code, and more if the codes are parallelized. The study’s results show that personal scientific codes are running many times slower than they should and that the problem is pervasive. Computational scientists are not sloppy programmers; however, few are trained in the art of computer programming or code optimization. I found that most have a working knowledge of some programming language and standard software engineering practices; but they do not know, or think about, how to make their programs run faster. They simply do not know the standard techniques used to make codes run faster. In fact, they do not even perceive that such techniques exist. The case studies described in this paper show that applying simple, well known techniques can significantly increase the performance of personal codes. It is important that the scientific community and the Government agencies that support scientific research find ways to better educate academic scientific programmers. The inefficiency of their codes is so bad that it is retarding both the quality and progress of scientific research. # cacheperformance redundantoperations loopstructures performanceimprovement 1 x x 15.5 2 x 2.8 3 x x 2.5 4 x 2.1 5 x x 2.0 6 x 5.0 7 x 5.8 8 x 6.3 9 2.2 10 x x 3.3 Table 1 — Area of improvement and performance gains of 10 codes The remainder of the paper is organized as follows: sections 2, 3, and 4 discuss the three most common sources of inefficiencies in the codes studied. These are cache performance, redundant operations, and loop structures. Each section includes several examples. The last section summaries the work and suggests a possible solution to the issues raised. Optimizing cache performance Commodity microprocessor systems use caches to increase memory bandwidth and reduce memory latencies. Typical latencies from processor to L1, L2, local, and remote memory are 3, 10, 50, and 200 cycles, respectively. Moreover, bandwidth falls off dramatically as memory distances increase. Programs that do not use cache effectively run many times slower than programs that do. When optimizing for cache, the biggest performance gains are achieved by accessing data in cache order and reusing data to amortize the overhead of cache misses. Secondary considerations are prefetching, associativity, and replacement; however, the understanding and analysis required to optimize for the latter are probably beyond the capabilities of the non-expert. Much can be gained simply by accessing data in the correct order and maximizing data reuse. 6 out of the 10 codes studied here benefited from such high level optimizations. Array Accesses The most important cache optimization is the most basic: accessing Fortran array elements in column order and C array elements in row order. Four of the ten codes—1, 2, 4, and 10—got it wrong. Compilers will restructure nested loops to optimize cache performance, but may not do so if the loop structure is too complex, or the loop body includes conditionals, complex addressing, or function calls. In code 1, the compiler failed to invert a key loop because of complex addressing do I = 0, 1010, delta_x IM = I - delta_x IP = I + delta_x do J = 5, 995, delta_x JM = J - delta_x JP = J + delta_x T1 = CA1(IP, J) + CA1(I, JP) T2 = CA1(IM, J) + CA1(I, JM) S1 = T1 + T2 - 4 * CA1(I, J) CA(I, J) = CA1(I, J) + D * S1 end do end do In code 2, the culprit is conditionals do I = 1, N do J = 1, N If (IFLAG(I,J) .EQ. 0) then T1 = Value(I, J-1) T2 = Value(I-1, J) T3 = Value(I, J) T4 = Value(I+1, J) T5 = Value(I, J+1) Value(I,J) = 0.25 * (T1 + T2 + T5 + T4) Delta = ABS(T3 - Value(I,J)) If (Delta .GT. MaxDelta) MaxDelta = Delta endif enddo enddo I fixed both programs by inverting the loops by hand. Code 10 has three-dimensional arrays and triply nested loops. The structure of the most computationally intensive loops is too complex to invert automatically or by hand. The only practical solution is to transpose the arrays so that the dimension accessed by the innermost loop is in cache order. The arrays can be transposed at construction or prior to entering a computationally intensive section of code. The former requires all array references to be modified, while the latter is cost effective only if the cost of the transpose is amortized over many accesses. I used the second approach to optimize code 10. Code 5 has four-dimensional arrays and loops are nested four deep. For all of the reasons cited above the compiler is not able to restructure three key loops. Assume C arrays and let the four dimensions of the arrays be i, j, k, and l. In the original code, the index structure of the three loops is L1: for i L2: for i L3: for i for l for l for j for k for j for k for j for k for l So only L3 accesses array elements in cache order. L1 is a very complex loop—much too complex to invert. I brought the loop into cache alignment by transposing the second and fourth dimensions of the arrays. Since the code uses a macro to compute all array indexes, I effected the transpose at construction and changed the macro appropriately. The dimensions of the new arrays are now: i, l, k, and j. L3 is a simple loop and easily inverted. L2 has a loop-carried scalar dependence in k. By promoting the scalar name that carries the dependence to an array, I was able to invert the third and fourth subloops aligning the loop with cache. Code 5 is by far the most difficult of the four codes to optimize for array accesses; but the knowledge required to fix the problems is no more than that required for the other codes. I would judge this code at the limits of, but not beyond, the capabilities of appropriately trained computational scientists. Array Strides When a cache miss occurs, a line (64 bytes) rather than just one word is loaded into the cache. If data is accessed stride 1, than the cost of the miss is amortized over 8 words. Any stride other than one reduces the cost savings. Two of the ten codes studied suffered from non-unit strides. The codes represent two important classes of "strided" codes. Code 1 employs a multi-grid algorithm to reduce time to convergence. The grids are every tenth, fifth, second, and unit element. Since time to convergence is inversely proportional to the distance between elements, coarse grids converge quickly providing good starting values for finer grids. The better starting values further reduce the time to convergence. The downside is that grids of every nth element, n > 1, introduce non-unit strides into the computation. In the original code, much of the savings of the multi-grid algorithm were lost due to this problem. I eliminated the problem by compressing (copying) coarse grids into continuous memory, and rewriting the computation as a function of the compressed grid. On convergence, I copied the final values of the compressed grid back to the original grid. The savings gained from unit stride access of the compressed grid more than paid for the cost of copying. Using compressed grids, the loop from code 1 included in the previous section becomes do j = 1, GZ do i = 1, GZ T1 = CA(i+0, j-1) + CA(i-1, j+0) T4 = CA1(i+1, j+0) + CA1(i+0, j+1) S1 = T1 + T4 - 4 * CA1(i+0, j+0) CA(i+0, j+0) = CA1(i+0, j+0) + DD * S1 enddo enddo where CA and CA1 are compressed arrays of size GZ. Code 7 traverses a list of objects selecting objects for later processing. The labels of the selected objects are stored in an array. The selection step has unit stride, but the processing steps have irregular stride. A fix is to save the parameters of the selected objects in temporary arrays as they are selected, and pass the temporary arrays to the processing functions. The fix is practical if the same parameters are used in selection as in processing, or if processing comprises a series of distinct steps which use overlapping subsets of the parameters. Both conditions are true for code 7, so I achieved significant improvement by copying parameters to temporary arrays during selection. Data reuse In the previous sections, we optimized for spatial locality. It is also important to optimize for temporal locality. Once read, a datum should be used as much as possible before it is forced from cache. Loop fusion and loop unrolling are two techniques that increase temporal locality. Unfortunately, both techniques increase register pressure—as loop bodies become larger, the number of registers required to hold temporary values grows. Once register spilling occurs, any gains evaporate quickly. For multiprocessors with small register sets or small caches, the sweet spot can be very small. In the ten codes presented here, I found no opportunities for loop fusion and only two opportunities for loop unrolling (codes 1 and 3). In code 1, unrolling the outer and inner loop one iteration increases the number of result values computed by the loop body from 1 to 4, do J = 1, GZ-2, 2 do I = 1, GZ-2, 2 T1 = CA1(i+0, j-1) + CA1(i-1, j+0) T2 = CA1(i+1, j-1) + CA1(i+0, j+0) T3 = CA1(i+0, j+0) + CA1(i-1, j+1) T4 = CA1(i+1, j+0) + CA1(i+0, j+1) T5 = CA1(i+2, j+0) + CA1(i+1, j+1) T6 = CA1(i+1, j+1) + CA1(i+0, j+2) T7 = CA1(i+2, j+1) + CA1(i+1, j+2) S1 = T1 + T4 - 4 * CA1(i+0, j+0) S2 = T2 + T5 - 4 * CA1(i+1, j+0) S3 = T3 + T6 - 4 * CA1(i+0, j+1) S4 = T4 + T7 - 4 * CA1(i+1, j+1) CA(i+0, j+0) = CA1(i+0, j+0) + DD * S1 CA(i+1, j+0) = CA1(i+1, j+0) + DD * S2 CA(i+0, j+1) = CA1(i+0, j+1) + DD * S3 CA(i+1, j+1) = CA1(i+1, j+1) + DD * S4 enddo enddo The loop body executes 12 reads, whereas as the rolled loop shown in the previous section executes 20 reads to compute the same four values. In code 3, two loops are unrolled 8 times and one loop is unrolled 4 times. Here is the before for (k = 0; k < NK[u]; k++) { sum = 0.0; for (y = 0; y < NY; y++) { sum += W[y][u][k] * delta[y]; } backprop[i++]=sum; } and after code for (k = 0; k < KK - 8; k+=8) { sum0 = 0.0; sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0; sum5 = 0.0; sum6 = 0.0; sum7 = 0.0; for (y = 0; y < NY; y++) { sum0 += W[y][0][k+0] * delta[y]; sum1 += W[y][0][k+1] * delta[y]; sum2 += W[y][0][k+2] * delta[y]; sum3 += W[y][0][k+3] * delta[y]; sum4 += W[y][0][k+4] * delta[y]; sum5 += W[y][0][k+5] * delta[y]; sum6 += W[y][0][k+6] * delta[y]; sum7 += W[y][0][k+7] * delta[y]; } backprop[k+0] = sum0; backprop[k+1] = sum1; backprop[k+2] = sum2; backprop[k+3] = sum3; backprop[k+4] = sum4; backprop[k+5] = sum5; backprop[k+6] = sum6; backprop[k+7] = sum7; } for one of the loops unrolled 8 times. Optimizing for temporal locality is the most difficult optimization considered in this paper. The concepts are not difficult, but the sweet spot is small. Identifying where the program can benefit from loop unrolling or loop fusion is not trivial. Moreover, it takes some effort to get it right. Still, educating scientific programmers about temporal locality and teaching them how to optimize for it will pay dividends. Reducing instruction count Execution time is a function of instruction count. Reduce the count and you usually reduce the time. The best solution is to use a more efficient algorithm; that is, an algorithm whose order of complexity is smaller, that converges quicker, or is more accurate. Optimizing source code without changing the algorithm yields smaller, but still significant, gains. This paper considers only the latter because the intent is to study how much better codes can run if written by programmers schooled in basic code optimization techniques. The ten codes studied benefited from three types of "instruction reducing" optimizations. The two most prevalent were hoisting invariant memory and data operations out of inner loops. The third was eliminating unnecessary data copying. The nature of these inefficiencies is language dependent. Memory operations The semantics of C make it difficult for the compiler to determine all the invariant memory operations in a loop. The problem is particularly acute for loops in functions since the compiler may not know the values of the function's parameters at every call site when compiling the function. Most compilers support pragmas to help resolve ambiguities; however, these pragmas are not comprehensive and there is no standard syntax. To guarantee that invariant memory operations are not executed repetitively, the user has little choice but to hoist the operations by hand. The problem is not as severe in Fortran programs because in the absence of equivalence statements, it is a violation of the language's semantics for two names to share memory. Codes 3 and 5 are C programs. In both cases, the compiler did not hoist all invariant memory operations from inner loops. Consider the following loop from code 3 for (y = 0; y < NY; y++) { i = 0; for (u = 0; u < NU; u++) { for (k = 0; k < NK[u]; k++) { dW[y][u][k] += delta[y] * I1[i++]; } } } Since dW[y][u] can point to the same memory space as delta for one or more values of y and u, assignment to dW[y][u][k] may change the value of delta[y]. In reality, dW and delta do not overlap in memory, so I rewrote the loop as for (y = 0; y < NY; y++) { i = 0; Dy = delta[y]; for (u = 0; u < NU; u++) { for (k = 0; k < NK[u]; k++) { dW[y][u][k] += Dy * I1[i++]; } } } Failure to hoist invariant memory operations may be due to complex address calculations. If the compiler can not determine that the address calculation is invariant, then it can hoist neither the calculation nor the associated memory operations. As noted above, code 5 uses a macro to address four-dimensional arrays #define MAT4D(a,q,i,j,k) (double *)((a)->data + (q)*(a)->strides[0] + (i)*(a)->strides[3] + (j)*(a)->strides[2] + (k)*(a)->strides[1]) The macro is too complex for the compiler to understand and so, it does not identify any subexpressions as loop invariant. The simplest way to eliminate the address calculation from the innermost loop (over i) is to define a0 = MAT4D(a,q,0,j,k) before the loop and then replace all instances of *MAT4D(a,q,i,j,k) in the loop with a0[i] A similar problem appears in code 6, a Fortran program. The key loop in this program is do n1 = 1, nh nx1 = (n1 - 1) / nz + 1 nz1 = n1 - nz * (nx1 - 1) do n2 = 1, nh nx2 = (n2 - 1) / nz + 1 nz2 = n2 - nz * (nx2 - 1) ndx = nx2 - nx1 ndy = nz2 - nz1 gxx = grn(1,ndx,ndy) gyy = grn(2,ndx,ndy) gxy = grn(3,ndx,ndy) balance(n1,1) = balance(n1,1) + (force(n2,1) * gxx + force(n2,2) * gxy) * h1 balance(n1,2) = balance(n1,2) + (force(n2,1) * gxy + force(n2,2) * gyy)*h1 end do end do The programmer has written this loop well—there are no loop invariant operations with respect to n1 and n2. However, the loop resides within an iterative loop over time and the index calculations are independent with respect to time. Trading space for time, I precomputed the index values prior to the entering the time loop and stored the values in two arrays. I then replaced the index calculations with reads of the arrays. Data operations Ways to reduce data operations can appear in many forms. Implementing a more efficient algorithm produces the biggest gains. The closest I came to an algorithm change was in code 4. This code computes the inner product of K-vectors A(i) and B(j), 0 = i < N, 0 = j < M, for most values of i and j. Since the program computes most of the NM possible inner products, it is more efficient to compute all the inner products in one triply-nested loop rather than one at a time when needed. The savings accrue from reading A(i) once for all B(j) vectors and from loop unrolling. for (i = 0; i < N; i+=8) { for (j = 0; j < M; j++) { sum0 = 0.0; sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0; sum5 = 0.0; sum6 = 0.0; sum7 = 0.0; for (k = 0; k < K; k++) { sum0 += A[i+0][k] * B[j][k]; sum1 += A[i+1][k] * B[j][k]; sum2 += A[i+2][k] * B[j][k]; sum3 += A[i+3][k] * B[j][k]; sum4 += A[i+4][k] * B[j][k]; sum5 += A[i+5][k] * B[j][k]; sum6 += A[i+6][k] * B[j][k]; sum7 += A[i+7][k] * B[j][k]; } C[i+0][j] = sum0; C[i+1][j] = sum1; C[i+2][j] = sum2; C[i+3][j] = sum3; C[i+4][j] = sum4; C[i+5][j] = sum5; C[i+6][j] = sum6; C[i+7][j] = sum7; }} This change requires knowledge of a typical run; i.e., that most inner products are computed. The reasons for the change, however, derive from basic optimization concepts. It is the type of change easily made at development time by a knowledgeable programmer. In code 5, we have the data version of the index optimization in code 6. Here a very expensive computation is a function of the loop indices and so cannot be hoisted out of the loop; however, the computation is invariant with respect to an outer iterative loop over time. We can compute its value for each iteration of the computation loop prior to entering the time loop and save the values in an array. The increase in memory required to store the values is small in comparison to the large savings in time. The main loop in Code 8 is doubly nested. The inner loop includes a series of guarded computations; some are a function of the inner loop index but not the outer loop index while others are a function of the outer loop index but not the inner loop index for (j = 0; j < N; j++) { for (i = 0; i < M; i++) { r = i * hrmax; R = A[j]; temp = (PRM[3] == 0.0) ? 1.0 : pow(r, PRM[3]); high = temp * kcoeff * B[j] * PRM[2] * PRM[4]; low = high * PRM[6] * PRM[6] / (1.0 + pow(PRM[4] * PRM[6], 2.0)); kap = (R > PRM[6]) ? high * R * R / (1.0 + pow(PRM[4]*r, 2.0) : low * pow(R/PRM[6], PRM[5]); < rest of loop omitted > }} Note that the value of temp is invariant to j. Thus, we can hoist the computation for temp out of the loop and save its values in an array. for (i = 0; i < M; i++) { r = i * hrmax; TEMP[i] = pow(r, PRM[3]); } [N.B. – the case for PRM[3] = 0 is omitted and will be reintroduced later.] We now hoist out of the inner loop the computations invariant to i. Since the conditional guarding the value of kap is invariant to i, it behooves us to hoist the computation out of the inner loop, thereby executing the guard once rather than M times. The final version of the code is for (j = 0; j < N; j++) { R = rig[j] / 1000.; tmp1 = kcoeff * par[2] * beta[j] * par[4]; tmp2 = 1.0 + (par[4] * par[4] * par[6] * par[6]); tmp3 = 1.0 + (par[4] * par[4] * R * R); tmp4 = par[6] * par[6] / tmp2; tmp5 = R * R / tmp3; tmp6 = pow(R / par[6], par[5]); if ((par[3] == 0.0) && (R > par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * tmp5; } else if ((par[3] == 0.0) && (R <= par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * tmp4 * tmp6; } else if ((par[3] != 0.0) && (R > par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * TEMP[i] * tmp5; } else if ((par[3] != 0.0) && (R <= par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * TEMP[i] * tmp4 * tmp6; } for (i = 0; i < M; i++) { kap = KAP[i]; r = i * hrmax; < rest of loop omitted > } } Maybe not the prettiest piece of code, but certainly much more efficient than the original loop, Copy operations Several programs unnecessarily copy data from one data structure to another. This problem occurs in both Fortran and C programs, although it manifests itself differently in the two languages. Code 1 declares two arrays—one for old values and one for new values. At the end of each iteration, the array of new values is copied to the array of old values to reset the data structures for the next iteration. This problem occurs in Fortran programs not included in this study and in both Fortran 77 and Fortran 90 code. Introducing pointers to the arrays and swapping pointer values is an obvious way to eliminate the copying; but pointers is not a feature that many Fortran programmers know well or are comfortable using. An easy solution not involving pointers is to extend the dimension of the value array by 1 and use the last dimension to differentiate between arrays at different times. For example, if the data space is N x N, declare the array (N, N, 2). Then store the problem’s initial values in (_, _, 2) and define the scalar names new = 2 and old = 1. At the start of each iteration, swap old and new to reset the arrays. The old–new copy problem did not appear in any C program. In programs that had new and old values, the code swapped pointers to reset data structures. Where unnecessary coping did occur is in structure assignment and parameter passing. Structures in C are handled much like scalars. Assignment causes the data space of the right-hand name to be copied to the data space of the left-hand name. Similarly, when a structure is passed to a function, the data space of the actual parameter is copied to the data space of the formal parameter. If the structure is large and the assignment or function call is in an inner loop, then copying costs can grow quite large. While none of the ten programs considered here manifested this problem, it did occur in programs not included in the study. A simple fix is always to refer to structures via pointers. Optimizing loop structures Since scientific programs spend almost all their time in loops, efficient loops are the key to good performance. Conditionals, function calls, little instruction level parallelism, and large numbers of temporary values make it difficult for the compiler to generate tightly packed, highly efficient code. Conditionals and function calls introduce jumps that disrupt code flow. Users should eliminate or isolate conditionls to their own loops as much as possible. Often logical expressions can be substituted for if-then-else statements. For example, code 2 includes the following snippet MaxDelta = 0.0 do J = 1, N do I = 1, M < code omitted > Delta = abs(OldValue ? NewValue) if (Delta > MaxDelta) MaxDelta = Delta enddo enddo if (MaxDelta .gt. 0.001) goto 200 Since the only use of MaxDelta is to control the jump to 200 and all that matters is whether or not it is greater than 0.001, I made MaxDelta a boolean and rewrote the snippet as MaxDelta = .false. do J = 1, N do I = 1, M < code omitted > Delta = abs(OldValue ? NewValue) MaxDelta = MaxDelta .or. (Delta .gt. 0.001) enddo enddo if (MaxDelta) goto 200 thereby, eliminating the conditional expression from the inner loop. A microprocessor can execute many instructions per instruction cycle. Typically, it can execute one or more memory, floating point, integer, and jump operations. To be executed simultaneously, the operations must be independent. Thick loops tend to have more instruction level parallelism than thin loops. Moreover, they reduce memory traffice by maximizing data reuse. Loop unrolling and loop fusion are two techniques to increase the size of loop bodies. Several of the codes studied benefitted from loop unrolling, but none benefitted from loop fusion. This observation is not too surpising since it is the general tendency of programmers to write thick loops. As loops become thicker, the number of temporary values grows, increasing register pressure. If registers spill, then memory traffic increases and code flow is disrupted. A thick loop with many temporary values may execute slower than an equivalent series of thin loops. The biggest gain will be achieved if the thick loop can be split into a series of independent loops eliminating the need to write and read temporary arrays. I found such an occasion in code 10 where I split the loop do i = 1, n do j = 1, m A24(j,i)= S24(j,i) * T24(j,i) + S25(j,i) * U25(j,i) B24(j,i)= S24(j,i) * T25(j,i) + S25(j,i) * U24(j,i) A25(j,i)= S24(j,i) * C24(j,i) + S25(j,i) * V24(j,i) B25(j,i)= S24(j,i) * U25(j,i) + S25(j,i) * V25(j,i) C24(j,i)= S26(j,i) * T26(j,i) + S27(j,i) * U26(j,i) D24(j,i)= S26(j,i) * T27(j,i) + S27(j,i) * V26(j,i) C25(j,i)= S27(j,i) * S28(j,i) + S26(j,i) * U28(j,i) D25(j,i)= S27(j,i) * T28(j,i) + S26(j,i) * V28(j,i) end do end do into two disjoint loops do i = 1, n do j = 1, m A24(j,i)= S24(j,i) * T24(j,i) + S25(j,i) * U25(j,i) B24(j,i)= S24(j,i) * T25(j,i) + S25(j,i) * U24(j,i) A25(j,i)= S24(j,i) * C24(j,i) + S25(j,i) * V24(j,i) B25(j,i)= S24(j,i) * U25(j,i) + S25(j,i) * V25(j,i) end do end do do i = 1, n do j = 1, m C24(j,i)= S26(j,i) * T26(j,i) + S27(j,i) * U26(j,i) D24(j,i)= S26(j,i) * T27(j,i) + S27(j,i) * V26(j,i) C25(j,i)= S27(j,i) * S28(j,i) + S26(j,i) * U28(j,i) D25(j,i)= S27(j,i) * T28(j,i) + S26(j,i) * V28(j,i) end do end do Conclusions Over the course of the last year, I have had the opportunity to work with over two dozen academic scientific programmers at leading research universities. Their research interests span a broad range of scientific fields. Except for two programs that relied almost exclusively on library routines (matrix multiply and fast Fourier transform), I was able to improve significantly the single processor performance of all codes. Improvements range from 2x to 15.5x with a simple average of 4.75x. Changes to the source code were at a very high level. I did not use sophisticated techniques or programming tools to discover inefficiencies or effect the changes. Only one code was parallel despite the availability of parallel systems to all developers. Clearly, we have a problem—personal scientific research codes are highly inefficient and not running parallel. The developers are unaware of simple optimization techniques to make programs run faster. They lack education in the art of code optimization and parallel programming. I do not believe we can fix the problem by publishing additional books or training manuals. To date, the developers in questions have not studied the books or manual available, and are unlikely to do so in the future. Short courses are a possible solution, but I believe they are too concentrated to be much use. The general concepts can be taught in a three or four day course, but that is not enough time for students to practice what they learn and acquire the experience to apply and extend the concepts to their codes. Practice is the key to becoming proficient at optimization. I recommend that graduate students be required to take a semester length course in optimization and parallel programming. We would never give someone access to state-of-the-art scientific equipment costing hundreds of thousands of dollars without first requiring them to demonstrate that they know how to use the equipment. Yet the criterion for time on state-of-the-art supercomputers is at most an interesting project. Requestors are never asked to demonstrate that they know how to use the system, or can use the system effectively. A semester course would teach them the required skills. Government agencies that fund academic scientific research pay for most of the computer systems supporting scientific research as well as the development of most personal scientific codes. These agencies should require graduate schools to offer a course in optimization and parallel programming as a requirement for funding. About the Author John Feo received his Ph.D. in Computer Science from The University of Texas at Austin in 1986. After graduate school, Dr. Feo worked at Lawrence Livermore National Laboratory where he was the Group Leader of the Computer Research Group and principal investigator of the Sisal Language Project. In 1997, Dr. Feo joined Tera Computer Company where he was project manager for the MTA, and oversaw the programming and evaluation of the MTA at the San Diego Supercomputer Center. In 2000, Dr. Feo joined Sun Microsystems as an HPC application specialist. He works with university research groups to optimize and parallelize scientific codes. Dr. Feo has published over two dozen research articles in the areas of parallel parallel programming, parallel programming languages, and application performance.

    Read the article

  • CMake and BLAS for a C program

    - by SuperBloup
    I'm trying to use CMake to build a program relying on blas, I'm detecting blas using : include (${CMAKE_ROOT}/Modules/FindBLAS.cmake) The problem is, FindBLAS require a fortran compiler and complain with -- Looking for BLAS... - NOT found (Fortran not enabled) As blas is already installed on my machine (ATLAS Blas), and gfortran is also installed, how can I enable Fortran, or is there a workaround to find the blas library for C?

    Read the article

  • Designing for an algorithm that reports progress

    - by Stefano Borini
    I have an iterative algorithm and I want to print the progress. However, I may also want it not to print any information, or to print it in a different way, or do other logic. In an object oriented language, I would perform the following solutions: Solution 1: virtual method have the algorithm class MyAlgoClass which implements the algo. The class also implements a virtual reportIteration(iterInfo) method which is empty and can be reimplemented. Subclass the MyAlgoClass and override reportIteration so that it does what it needs to do. This solution allows you to carry additional information (for example, the file unit) in the reimplemented class. I don't like this method because it clumps together two functionalities that may be unrelated, but in GUI apps it may be ok. Solution 2: observer pattern the algorithm class has a register(Observer) method, keeps a list of the registered observers and takes care of calling notify() on each of them. Observer::notify() needs a way to get the information from the Subject, so it either has two parameters, one with the Subject and the other with the data the Subject may pass, or just the Subject and the Observer is now in charge of querying it to fetch the relevant information. Solution 3: callbacks I tend to see the callback method as a lightweight observer. Instead of passing an object, you pass a callback, which may be a plain function, but also an instance method in those languages that allow it (for example, in python you can because passing an instance method will remain bound to the instance). C++ however does not allow it, because if you pass a pointer to an instance method, this will not be defined. Please correct me on this regard, my C++ is quite old. The problem with callbacks is that generally you have to pass them together with the data you want the callback to be invoked with. Callbacks don't store state, so you have to pass both the callback and the state to the Subject in order to find it at callback execution, together with any additional data the Subject may provide about the event is reporting. Question My question is relative to the fact that I need to implement the opening problem in a language that is not object oriented, namely Fortran 95, and I am fighting with my usual reasoning which is based on python assumptions and style. I think that in Fortran the concept is similar to C, with the additional trouble that in C you can store a function pointer, while in Fortran 95 you can only pass it around. Do you have any comments, suggestions, tips, and quirks on this regard (in C, C++, Fortran and python, but also in any other language, so to have a comparison of language features that can be exploited on this regard) on how to design for an algorithm that must report progress to some external entity, using state from both the algorithm and the external entity ?

    Read the article

  • Performance of Java matrix math libraries?

    - by dfrankow
    We are computing something whose runtime is bound by matrix operations. (Some details below if interested.) This experience prompted the following question: Do folk have experience with the performance of Java libraries for matrix math (e.g., multiply, inverse, etc.)? For example: JAMA: http://math.nist.gov/javanumerics/jama/ COLT: http://acs.lbl.gov/~hoschek/colt/ Apache commons math: http://commons.apache.org/math/ I searched and found nothing. Details of our speed comparison: We are using Intel FORTRAN (ifort (IFORT) 10.1 20070913). We have reimplemented it in Java (1.6) using Apache commons math 1.2 matrix ops, and it agrees to all of its digits of accuracy. (We have reasons for wanting it in Java.) (Java doubles, Fortran real*8). Fortran: 6 minutes, Java 33 minutes, same machine. jvisualm profiling shows much time spent in RealMatrixImpl.{getEntry,isValidCoordinate} (which appear to be gone in unreleased Apache commons math 2.0, but 2.0 is no faster). Fortran is using Atlas BLAS routines (dpotrf, etc.). Obviously this could depend on our code in each language, but we believe most of the time is in equivalent matrix operations. In several other computations that do not involve libraries, Java has not been much slower, and sometimes much faster.

    Read the article

  • Oracle Solaris Studio Express 6/10 and its Customer Feedback Program are now available

    - by pieter.humphrey
    Oracle Solaris Studio Express 6/10 and the Customer Feedback Program for it are now available. Oracle Solaris Studio Express 6/10 is available on Solaris 10 (SPARC, x86), OEL 5 (x86), RHEL 5 (x86), SuSE 11 (x86) today and will be available for OpenSolaris in the near future. New feature highlights since the last release include: C/C++/Fortran compiler optimizations for the latest UltraSPARC and SPARC64-based architectures such as UltraSPARC T2 and SPARC64 VII C/C++/Fortran compiler optimizations for the latest x86 architectures including the Intel Xeon 7500 processor series (Nehalem-EX) and the Intel Xeon 5600 processor series (Westmere-EP) Enhanced debugging and code coverage tooling Improved application profiling with the Performance Analyzer Updated IDE based on NetBeans 6.8 To find more information and download go to http://developers.sun.com/sunstudio/downloads/express/ To participate in the customer feedback program for Oracle Solaris Studio Express 6/10 go to http://developers.sun.com/sunstudio/customerfeedback/index.jsp Please get the word out, try out this new release and send us your feedback! Technorati Tags: developer,development,solaris,sparc,Oracle Solaris Studio,Solaris Studio,Sun Studio,oracle,otn del.icio.us Tags: developer,development,solaris,sparc,Oracle Solaris Studio,Solaris Studio,Sun Studio,oracle,otn

    Read the article

  • Installing g77 in Ubuntu 12.04 LTS (32bit)

    - by pksahani
    I am using Ubuntu 12.04 LTS (32bit-i386) in my desktop PC. I need g77 compiler for some specific applications. The app can only be installed after having g77 compiler. This specific app is designed based on g77 fortran compiler and can't be used with gfortran which is the standard available compiler in 12.04 LTS. And guide me the procedure to install g77 in 12.04. I have been trying apt-get update & apt-get install g77 after changing the sources.list file. After processing I am able to install g77 but when i try to compile a fortran program, it shows error /usr/bin/ld: cannot find crt1.o: No such file or directory /usr/bin/ld: cannot find crti.o: No such file or directory /usr/bin/ld: cannot find -lgcc_s collect2: ld returned 1 exit status Please help me. I m struggling a lot to fix this.

    Read the article

  • reading parameters and files on browser, looking how to execute on server

    - by jbcolmenares
    I have a site done in Rails, which uses javascript to load files and generate forms for the user to input certain information. Those files and parameters are then to be used in a fortran code on the server. When the UI was on the server (using Qt), I would create a parameters file and execute the fortran code using threads so I wouldn't block the computer. Now that is web-based, I need to make the server and browser talk. What's the procedure for that? where should I start looking? I'm already using rails + javascript. I need that extra tool to do the talking, and no idea where to start.

    Read the article

  • How do I install gfortran (via cygwin and etexteditor) and enable ifort under Windows XP?

    - by bez
    I'm a newbie in the Unix world so all this is a little confusing to me. I'm having trouble compiling some Fortran files under Cygwin on Windows XP. Here's what I've done so far: Installed the e text editor. Installed Cygwin via the "automatic" option inside e text editor. I need to compile some Fortran files so via the "manage bundles" option I installed the Fortran bundle as well. However, when I select "compile single file" I get an error saying gfortran was missing, and then that I need to set the TM_FORTRAN variable to the full path of my compiler. I tried opening a Cygwin bash shell at the path mentioned (.../bin/gfortran), but the compiler was nowhere to be found. Can someone tell me how to install this from the Cygwin command line? Where do I need to update the TM_FORTRAN variable for the bundle to work? Also, how do I change the bundle "compile" option to work with ifort (my native compiler) on Windows? I've read the bundle file, but it is totally incomprehensible to me. Ifort is a Windows compiler, invoked simply by ifort filename.f90, since it is on the Windows path. I know this is a lot to ask of a first time user here, but I really would appreciate any time you can spare to help.

    Read the article

  • Perl for a Python programmer

    - by fortran
    I know Python (and a bunch of other languages) and I think it might be nice to learn Perl, even if it seems that most of the people is doing it the other way around... My main concern is not about the language itself (I think that part is always easy), but about learning the Perlish (as contrasted with Pythonic) way of doing things; because I don't think it'll be worth the effort if I end up programming Python in Perl. So my questions are basically two: Are there many problems/application areas where it's actually more convenient to solve them in Perl rather than in Python? If the first question is positive, where can I found a good place to get started and learn best practices that is not oriented to beginners?

    Read the article

  • Minimizing the number of boxes that cover a given set of intervals

    - by fortran
    Hi, this is a question for the algorithms gurus out there :-) Let S be a set of intervals of the natural numbers that might overlap and b a box size. Assume that for each interval, the range is strictly less than b. I want to find the minimum set of intervals of size b (let's call it M) so all the intervals in S are contained in the intervals of M. Trivial example: S = {[1..4], [2..7], [3..5], [8..15], [9..13]} b = 10 M = {[1..10], [8..18]} // so ([1..4], [2..7], [3..5]) are inside [1..10] and ([8..15], [9..13]) are inside [8..18] I think a greedy algorithm might not work always, so if anybody knows of a solution of this problem (or a similar one that can be converted into), that would be great. Thanks!

    Read the article

  • Why has anybody ever used COBOL?

    - by sarzl
    I know: You and me hate COBOL. I took a look at a lot of code examples and it didn't take me long to know why everybody tries to avoid it. So I really have no idea: Why was COBOL ever used? I mean: Hey - there was Fortran before it, and Fortran looks like a jesus-language compared to COBOL. This isn't argumentative but historical as I'm young and didn't even know about COBOL before 4 months.

    Read the article

  • Appropriate high level language to deal with binary data

    - by fortran
    Hi, I need to write a small tool that parses a textual input and generates some binary encoded data. I would prefer to stay away from C and the like, in favour of a higher level, (optionally) safer, more expressive and faster to develop language. My language of choice for this kind of tasks usually is Python, but for this case dealing with binary raw data can be problematic if one isn't very careful with the numbers being promoted to bignums, sign extensions and such. Ideally I would like to have records with named bitfields that are portable to be serialised in a consistent manner. (I know that there's a strong point in doing it in a language I already master, although it isn't optimal, but I think this could be a good opportunity to learn something new). Thanks.

    Read the article

  • Checking when two headers are included at the same time.

    - by fortran
    Hi, I need to do an assertion based on two related macro preprocessor #define's declared in different header files... The codebase is huge and it would be nice if I could find a place to put the assertion where the two headers are already included, to avoid polluting namespaces unnecessarily. Checking just that a file includes both explicitly might not suffice, as one (or both) of them might be included in an upper level of a nesting include's hierarchy. I know it wouldn't be too hard to write an script to check that, but if there's already a tool that does the job, the better. Example: file foo.h #define FOO 0xf file bar.h #define BAR 0x1e I need to put somewhere (it doesn't matter a lot where) something like this: #if (2*FOO) != BAR #error "foo is not twice bar" #endif Yes, I know the example is silly, as they could be replaced so one is derived from the other, but let's say that the includes can be generated from different places not under my control and I just need to check that they match at compile time... And I don't want to just add one include after the other, as it might conflict with previous code that I haven't written, so that's why I would like to find a file where both are already present. In brief: how can I find a file that includes (direct or indirectly) two other files? Thanks!

    Read the article

  • Multiple (variant) arguments overloading in Java: What's the purpose?

    - by fortran
    Browsing google's guava collect library code, I've found the following: // Casting to any type is safe because the list will never hold any elements. @SuppressWarnings("unchecked") public static <E> ImmutableList<E> of() { return (ImmutableList<E>) EmptyImmutableList.INSTANCE; } public static <E> ImmutableList<E> of(E element) { return new SingletonImmutableList<E>(element); } public static <E> ImmutableList<E> of(E e1, E e2) { return new RegularImmutableList<E>( ImmutableList.<E>nullCheckedList(e1, e2)); } public static <E> ImmutableList<E> of(E e1, E e2, E e3) { return new RegularImmutableList<E>( ImmutableList.<E>nullCheckedList(e1, e2, e3)); } public static <E> ImmutableList<E> of(E e1, E e2, E e3, E e4) { return new RegularImmutableList<E>( ImmutableList.<E>nullCheckedList(e1, e2, e3, e4)); } public static <E> ImmutableList<E> of(E e1, E e2, E e3, E e4, E e5) { return new RegularImmutableList<E>( ImmutableList.<E>nullCheckedList(e1, e2, e3, e4, e5)); } public static <E> ImmutableList<E> of(E e1, E e2, E e3, E e4, E e5, E e6) { return new RegularImmutableList<E>( ImmutableList.<E>nullCheckedList(e1, e2, e3, e4, e5, e6)); } public static <E> ImmutableList<E> of( E e1, E e2, E e3, E e4, E e5, E e6, E e7) { return new RegularImmutableList<E>( ImmutableList.<E>nullCheckedList(e1, e2, e3, e4, e5, e6, e7)); } public static <E> ImmutableList<E> of( E e1, E e2, E e3, E e4, E e5, E e6, E e7, E e8) { return new RegularImmutableList<E>( ImmutableList.<E>nullCheckedList(e1, e2, e3, e4, e5, e6, e7, e8)); } public static <E> ImmutableList<E> of( E e1, E e2, E e3, E e4, E e5, E e6, E e7, E e8, E e9) { return new RegularImmutableList<E>( ImmutableList.<E>nullCheckedList(e1, e2, e3, e4, e5, e6, e7, e8, e9)); } public static <E> ImmutableList<E> of( E e1, E e2, E e3, E e4, E e5, E e6, E e7, E e8, E e9, E e10) { return new RegularImmutableList<E>(ImmutableList.<E>nullCheckedList( e1, e2, e3, e4, e5, e6, e7, e8, e9, e10)); } public static <E> ImmutableList<E> of( E e1, E e2, E e3, E e4, E e5, E e6, E e7, E e8, E e9, E e10, E e11) { return new RegularImmutableList<E>(ImmutableList.<E>nullCheckedList( e1, e2, e3, e4, e5, e6, e7, e8, e9, e10, e11)); } public static <E> ImmutableList<E> of( E e1, E e2, E e3, E e4, E e5, E e6, E e7, E e8, E e9, E e10, E e11, E e12, E... others) { final int paramCount = 12; Object[] array = new Object[paramCount + others.length]; arrayCopy(array, 0, e1, e2, e3, e4, e5, e6, e7, e8, e9, e10, e11, e12); arrayCopy(array, paramCount, others); return new RegularImmutableList<E>(ImmutableList.<E>nullCheckedList(array)); } And although it seems reasonable to have overloads for empty and single arguments (as they are going to use special instances), I cannot see the reason behind having all the others, when just the last one (with two fixed arguments plus the variable argument instead the dozen) seems to be enough. As I'm writing, one explanation that pops into my head is that the API pre-dates Java 1.5; and although the signatures would be source-level compatible, the binary interface would differ. Isn't it?

    Read the article

< Previous Page | 1 2 3 4 5 6 7 8 9 10  | Next Page >