Search Results

Search found 13675 results on 547 pages for 'concurrent programming'.

Page 233/547 | < Previous Page | 229 230 231 232 233 234 235 236 237 238 239 240  | Next Page >

  • Climbing the hacker ladder

    - by cobie
    This is not a question in which I am asking for opinions rather I am asking for first hand experience. I have been programming in python for quite a while and I feel solid enough in python programming. I can come up with algorithms for problems and implement them but I somehow feel I am stuck with remaining an apprentice. What are some first hand experiences on how to climb up the ladder and become better at programming as in learning about browsers security, compilers etc. Personal experiences would be valued in responses.

    Read the article

  • Becoming an expert vs boredom [closed]

    - by QAH
    I am a college student, and I love to program, period. I code all kinds of things in different kinds of languages. Although I enjoy programming, I have an extremely hard time sticking to one project for a long time. I attribute this shortcoming to my high level of curiosity, exploring different technologies, languages, libraries, etc. What would be best? Should I settle down more and spend time on becoming an expert in one or two programming fields, or should I be more of a jack of all trades, trying out all kinds of new technologies, languages, programming methods, etc.? I'm guessing that somewhere in the middle would be best. I'm always amazed at how many developers are able to create one or two projects, and develop on them for years. What techniques do you guys employ to help you stay focused on a project?

    Read the article

  • job offer in dead technology

    - by bold
    I have a job offer in a dead technology (specific programming language) that I don't want to work with nor do I believe it will offer many jobs in the future. It requires twice a year travels abroad, which not a plus in my eyes. On the other hand the money on the table is high. What would you do? edit: as its not clear I got a job in a programming language that is different from the academic programming language I worked with. Now I see it as a mistake to head to that direction.

    Read the article

  • Should I continue to learn and program using ansi c or non standard? [closed]

    - by Erik
    I am a Cs Student, enthusiastic about programming. C is my first programming language that I learned. (Never been exposed to programming, data structures, algorithms before) I failed the exam because I studied on my own and didn't know how to use windows non standard libraries like conio.h. (I had to draw circles, etc using get x and get y). I told them but they just don't care. What do I do because for a beginner this is very confusing? Do I continue to study ansi c on my own, or should I study non standard c? Should I do them in parallel? *I did use the search bar but found nothing useful that could help me.

    Read the article

  • Why do we need private variables?

    - by rak
    Why do we need private variables in classes in the context of programming? Every book on programming I've read says this is a private variable, this is how you define it but stops there. The wording of these explanations always seemed to me like we really have a crisis of trust in our profession. The explanations always sounded like other programmers are out to mess up our code. Yet, there are many programming languages that do not have private variables. What do private variables help prevent? How do you decide if a particular of properties should be private or not? If by default every field SHOULD be private then why are there public data members in a class? Under what circumstances should a variable be made public?

    Read the article

  • Usefull skills from a computer science degree

    - by Tom Squires
    I did my degree in physics and moved later into programming. I have two and a half years experience under my belt and like to think I write good code. I am, however, concerned that not doing a compsci degree has left holes in my knowledge. I would like to fill those up now since I know I want to be doing programming for the rest of my career. What skills/techniques did you learn in your compsci degree that one wouldn't pick up from on-the-job programming?

    Read the article

  • 27 vidéos techniques des Qt DevDays 2005, 2006 et 2008 sont désormais rendues publiques par Qt eLear

    L'équipe eLearning de Qt a depuis quelques temps cherché à récupérer des vidéos techniques issues des conférences des anciens QtDevDays dans l'optique de les faire partager à tout le monde. C'est aujourd'hui chose faite avec la publication en ligne de 27 présentations techniques ce qui correspond à 22h30min de vidéos. Les sujets traités sont toujours valides aujourd'hui, même si le framework a évolué au fil des années. 2005 :All About Qt Widgets Effective Graphics Programming Practical Model/View Programming Threaded Programming with Qt - Good Practise Writing Custom Styles with QStyle Writing plugin applications with Qt 2006 :Advanced Item Views...

    Read the article

  • Windows Phone App with 4SQ

    - by Nuttanon Pornpipak
    I'm want to create a my own Coffee shop app for semester's project. It's Windows Phone App. The App can i.e. view who is check-in here now , view menu , view photo by using 4SQ Endpoint APIs. And my problem is I don't know how to start it...which book i should read about C# and I don't know which knowledge (keyword) should i google it i.e. GET POST METHOD , JSON I ever used 4SQ Endpoint APIs once with javascript (jquery) $.ajax{(.....)} to get data from 4SQ Endpoint APIs So I googled and found JSON.NET Class but I don't know how to use it because i never programming in C# I'm just begin programming. I can programming in C only. Thank you Sorry for my bad grammar

    Read the article

  • For asp.net mvc is this a three tiered solution?

    - by bbb
    I am a asp.net mvc programmer and if I want to start a project I do this: I make a class library named Model for my models. I make a class library named Infrastructure.Repository for database processes I make a class library named Application for business logic layer And finally I make a MVC project for the UI. But now some things are confusing me. Am I using 3-tier programming? If yes so what is n-tier programming and which one is better? If no so what is 3-tier programming? Some where I see that the tiers namings are DAL and BIZ. Which one is correct according to the naming convention?

    Read the article

  • Which C# Book to take?

    - by Fischkopf
    I was searching for a book to learn C#, but now i'm kinda stuck. I found many people asking the same question, and many people gave answers, but there are so many books about C# that it is really hard to decide which one to take. Now i reduced my choice on two books, but I just can't decide between them. Namely, there are: Programming C# 4.0 and C# 4.0 In A Nutshell The first thing I want to know, are these good choices? I'm not completely new to programming, but I just didn't find the right language until know, but i think C# is the one I was searching for. I know all the bassic stuff from Delphi/Java/Python so I think i'm not a complete beginner in programming. Is there anyone out there that read both books and can cleary explain whats the difference between them? I haven't found many reviews and sort of, so I just don't know which one to chose. Or is there any book that is better suiting me?

    Read the article

  • Unable to import Maven project into IntelliJ IDEA

    - by del
    I'm having problems importing any Maven projects into IntelliJ IDEA. I create an empty Maven project like this: $ mvn archetype:generate -DgroupId=com.mycompany.app -DartifactId=my-app -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false Then I try to open the project in IDEA (File Open Project, then choose the pom.xml). A progress box saying "Reading pom.xml" displays for a few minutes, and then just disappears without opening the project. Looking in the IDEA log, I see some connection timeout exceptions like this: 2012-10-03 11:55:55,483 [ 16981] INFO - ution.rmi.RemoteProcessSupport - Port/ID: 18011/Maven2ServerImpl9407569f 2012-10-03 11:56:58,898 [ 80396] WARN - ution.rmi.RemoteProcessSupport - The cook failed to start due to java.net.ConnectException: Connection timed out 2012-10-03 11:57:55,483 [ 136981] WARN - ution.rmi.RemoteProcessSupport - java.rmi.NotBoundException: _DEAD_HAND_ 2012-10-03 11:57:55,484 [ 136982] WARN - ution.rmi.RemoteProcessSupport - at sun.rmi.registry.RegistryImpl.lookup(RegistryImpl.java:106) 2012-10-03 11:57:55,484 [ 136982] WARN - ution.rmi.RemoteProcessSupport - at com.intellij.execution.rmi.RemoteServer.start(RemoteServer.java:73) 2012-10-03 11:57:55,484 [ 136982] WARN - ution.rmi.RemoteProcessSupport - at org.jetbrains.idea.maven.server.RemoteMavenServer.main(RemoteMavenServer.java:22) 2012-10-03 11:58:01,749 [ 143247] ERROR - com.intellij.ide.IdeEventQueue - Error during dispatching of java.awt.event.MouseEvent[MOUSE_RELEASED,(65,116),absolute(64,140),button=1,modifiers=Button1,clickCount=1] on frame0 java.lang.RuntimeException: Cannot reconnect. at org.jetbrains.idea.maven.server.RemoteObjectWrapper.perform(RemoteObjectWrapper.java:82) at org.jetbrains.idea.maven.server.MavenServerManager.applyProfiles(MavenServerManager.java:311) at org.jetbrains.idea.maven.project.MavenProjectReader.applyProfiles(MavenProjectReader.java:369) at org.jetbrains.idea.maven.project.MavenProjectReader.doReadProjectModel(MavenProjectReader.java:98) at org.jetbrains.idea.maven.project.MavenProjectReader.readProject(MavenProjectReader.java:52) at org.jetbrains.idea.maven.project.MavenProject.read(MavenProject.java:405) at org.jetbrains.idea.maven.project.MavenProjectsTree.doUpdate(MavenProjectsTree.java:534) at org.jetbrains.idea.maven.project.MavenProjectsTree.doAdd(MavenProjectsTree.java:481) at org.jetbrains.idea.maven.project.MavenProjectsTree.update(MavenProjectsTree.java:442) at org.jetbrains.idea.maven.project.MavenProjectsTree.updateAll(MavenProjectsTree.java:413) at org.jetbrains.idea.maven.wizards.MavenProjectBuilder.readMavenProjectTree(MavenProjectBuilder.java:198) at org.jetbrains.idea.maven.wizards.MavenProjectBuilder.access$800(MavenProjectBuilder.java:44) at org.jetbrains.idea.maven.wizards.MavenProjectBuilder$3.run(MavenProjectBuilder.java:179) at org.jetbrains.idea.maven.utils.MavenUtil$8.run(MavenUtil.java:388) at com.intellij.openapi.progress.impl.ProgressManagerImpl$TaskRunnable.run(ProgressManagerImpl.java:469) at com.intellij.openapi.progress.impl.ProgressManagerImpl$6.run(ProgressManagerImpl.java:288) at com.intellij.openapi.progress.impl.ProgressManagerImpl$2.run(ProgressManagerImpl.java:178) at com.intellij.openapi.progress.impl.ProgressManagerImpl.executeProcessUnderProgress(ProgressManagerImpl.java:218) at com.intellij.openapi.progress.impl.ProgressManagerImpl.runProcess(ProgressManagerImpl.java:169) at com.intellij.openapi.application.impl.ApplicationImpl$8$1.run(ApplicationImpl.java:641) at com.intellij.openapi.application.impl.ApplicationImpl$6.run(ApplicationImpl.java:434) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) at com.intellij.openapi.application.impl.ApplicationImpl$1$1.run(ApplicationImpl.java:145) Caused by: java.rmi.RemoteException: Cannot start maven service; nested exception is: java.rmi.ConnectException: Connection refused to host: localhost; nested exception is: java.net.ConnectException: Connection timed out at org.jetbrains.idea.maven.server.MavenServerManager.create(MavenServerManager.java:120) at org.jetbrains.idea.maven.server.MavenServerManager.create(MavenServerManager.java:71) at org.jetbrains.idea.maven.server.RemoteObjectWrapper.getOrCreateWrappee(RemoteObjectWrapper.java:41) at org.jetbrains.idea.maven.server.MavenServerManager$8.execute(MavenServerManager.java:314) at org.jetbrains.idea.maven.server.MavenServerManager$8.execute(MavenServerManager.java:311) at org.jetbrains.idea.maven.server.RemoteObjectWrapper.perform(RemoteObjectWrapper.java:76) ... 27 more Caused by: java.rmi.ConnectException: Connection refused to host: localhost; nested exception is: java.net.ConnectException: Connection timed out at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:601) at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:198) at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:184) at sun.rmi.server.UnicastRef.newCall(UnicastRef.java:322) at sun.rmi.registry.RegistryImpl_Stub.lookup(Unknown Source) at com.intellij.execution.rmi.RemoteProcessSupport$2.compute(RemoteProcessSupport.java:215) at com.intellij.execution.rmi.RemoteUtil.executeWithClassLoader(RemoteUtil.java:122) at com.intellij.execution.rmi.RemoteProcessSupport.acquire(RemoteProcessSupport.java:212) at com.intellij.execution.rmi.RemoteProcessSupport.acquire(RemoteProcessSupport.java:133) at org.jetbrains.idea.maven.server.MavenServerManager.create(MavenServerManager.java:117) ... 32 more Caused by: java.net.ConnectException: Connection timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351) at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) at java.net.Socket.connect(Socket.java:529) at java.net.Socket.connect(Socket.java:478) at java.net.Socket.(Socket.java:375) at java.net.Socket.(Socket.java:189) at sun.rmi.transport.proxy.RMIDirectSocketFactory.createSocket(RMIDirectSocketFactory.java:22) at sun.rmi.transport.proxy.RMIMasterSocketFactory.createSocket(RMIMasterSocketFactory.java:128) at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:595) ... 41 more I'm using the latest versions of IDEA (11.1.3) and Maven (3.0.4). Any ideas what I am doing wrong?

    Read the article

  • Concurrency and Calendar classes

    - by fbielejec
    I have a thread (class implementing runnable, called AnalyzeTree) organised around a hash map (ConcurrentMap slicesMap). The class goes through the data (called trees here) in the large text file and parses the geographical coordinates from it to the HashMap. The idea is to process one tree at a time and add or grow the values according to the key (which is just a Double value representing time). The relevant part of code looks like this: // grow map entry if key exists if (slicesMap.containsKey(sliceTime)) { double[] imputedLocation = imputeValue( location, parentLocation, sliceHeight, nodeHeight, parentHeight, rate, useTrueNoise, currentTreeNormalization, precisionArray); slicesMap.get(sliceTime).add( new Coordinates(imputedLocation[1], imputedLocation[0], 0.0)); // start new entry if no such key in the map } else { List<Coordinates> coords = new ArrayList<Coordinates>(); double[] imputedLocation = imputeValue( location, parentLocation, sliceHeight, nodeHeight, parentHeight, rate, useTrueNoise, currentTreeNormalization, precisionArray); coords.add(new Coordinates(imputedLocation[1], imputedLocation[0], 0.0)); slicesMap.putIfAbsent(sliceTime, coords); // slicesMap.put(sliceTime, coords); }// END: key check And the class is called like this (executor is ExecutorService executor = Executors.newFixedThreadPool(NTHREDS) ): mrsd = new SpreadDate(mrsdString); int readTrees = 1; while (treesImporter.hasTree()) { currentTree = (RootedTree) treesImporter.importNextTree(); executor.submit(new AnalyzeTree(currentTree, precisionString, coordinatesName, rateString, numberOfIntervals, treeRootHeight, timescaler, mrsd, slicesMap, useTrueNoise)); // new AnalyzeTree(currentTree, precisionString, // coordinatesName, rateString, numberOfIntervals, // treeRootHeight, timescaler, mrsd, slicesMap, // useTrueNoise).run(); readTrees++; }// END: while has trees Now this is running into troubles when executed in parallel (the commented part running sequentially is fine), I thought it might throw a ConcurrentModificationException, but apparently the problem is in mrsd (instance of SpreadDate object, which is simply a class for date related calculations). The SpreadDate class looks like this: public class SpreadDate { private Calendar cal; private SimpleDateFormat formatter; private Date stringdate; public SpreadDate(String date) throws ParseException { // if no era specified assume current era String line[] = date.split(" "); if (line.length == 1) { StringBuilder properDateStringBuilder = new StringBuilder(); date = properDateStringBuilder.append(date).append(" AD") .toString(); } formatter = new SimpleDateFormat("yyyy-MM-dd G", Locale.US); stringdate = formatter.parse(date); cal = Calendar.getInstance(); } public long plus(int days) { cal.setTime(stringdate); cal.add(Calendar.DATE, days); return cal.getTimeInMillis(); }// END: plus public long minus(int days) { cal.setTime(stringdate); cal.add(Calendar.DATE, -days); //line 39 return cal.getTimeInMillis(); }// END: minus public long getTime() { cal.setTime(stringdate); return cal.getTimeInMillis(); }// END: getDate } And the stack trace from when exception is thrown: java.lang.ArrayIndexOutOfBoundsException: 58 at sun.util.calendar.BaseCalendar.getCalendarDateFromFixedDate(BaseCalendar.java:454) at java.util.GregorianCalendar.computeFields(GregorianCalendar.java:2098) at java.util.GregorianCalendar.computeFields(GregorianCalendar.java:2013) at java.util.Calendar.setTimeInMillis(Calendar.java:1126) at java.util.GregorianCalendar.add(GregorianCalendar.java:1020) at utils.SpreadDate.minus(SpreadDate.java:39) at templates.AnalyzeTree.run(AnalyzeTree.java:88) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) If a move the part initializing mrsd to the AnalyzeTree class it runs without any problems - however it is not very memory efficient to initialize class each time this thread is running, hence my concerns. How can it be remedied?

    Read the article

  • Apache reaching MaxClients and locking the server

    - by Rodrigo Sieiro
    Hi. I currently have an Apache2 server running with mpm-prefork and mod_php on a OpenVZ VPS with 512M real / 1024M burstable RAM (no swap). After running some tests, I found that the maximum process size Apache gets is 23M, so I've set MaxClients to 25 (23M x 25 = 575 MB, ok for me). I decided to run some load tests on my server, and the results left me puzzled. I'm using ab on my desktop machine requesting the main page from a wordpress blog. When I run ab with 24 concurrent connections, everything seems fine. Sure, CPU goes up, free RAM goes down, and the result is about 2-3s response time per request. But if I run ab with 25 concurrent connections (my server limit), Apache just hangs after a couple of seconds. It starts processing the requests, then it stops responding, CPU goes back to 100% idle and ab times out. Apache log says it reached MaxClients. When this happens, Apache keeps itself locked up with 25 running processes (they're all in "W" if I check server status) and only after the TimeOut setting the processes start to die and the server starts responding again (in my case it's set to 45). My question: is that expected behaviour? Why Apache just dies when it reaches MaxClients? If it works with 24 connections, shouldn't it work with 25, just taking maybe more time to respond each request and queueing up the rest? It sounds kinda strange to me that any kid running ab can alone kill a webserver just by setting the concurrent connections to the servers MaxClients.

    Read the article

  • Apache MaxClients reaching max and locking the server

    - by Rodrigo Sieiro
    Hi. I currently have an Apache2 server running with mpm-prefork and mod_php on a OpenVZ VPS with 512M real / 1024M burstable RAM (no swap). After running some tests, I found that the maximum process size Apache gets is 23M, so I've set MaxClients to 25 (23M x 25 = 575 MB, ok for me). I decided to run some load tests on my server, and the results left me puzzled. I'm using ab on my desktop machine requesting the main page from a wordpress blog. When I run ab with 24 concurrent connections, everything seems fine. Sure, CPU goes up, free RAM goes down, and the result is about 2-3s response time per request. But if I run ab with 25 concurrent connections (my server limit), Apache just hangs after a couple of seconds. It starts processing the requests, then it stops responding, CPU goes back to 100% idle and ab times out. Apache log says it reached MaxClients. When this happens, Apache keeps itself locked up with 25 running processes (they're all in "W" if I check server status) and only after the TimeOut setting the processes start to die and the server starts responding again (in my case it's set to 45). My question: is that expected behaviour? Why Apache just dies when it reaches MaxClients? If it works with 24 connections, shouldn't it work with 25, just taking maybe more time to respond each request and queueing up the rest? It sounds kinda strange to me that any kid running ab can alone kill a webserver just by setting the concurrent connections to the servers MaxClients.

    Read the article

  • Virtualbox HTTP load testing, host CPU overload issues

    - by aschuler
    I'm doing HTTP load testing benchmarks (using Apache Benchmark and Siege) on a small Java EE 1.7.0 / Tomcat 7.0.26 application running on a Debian Squeeze 6.0.4 x64 virtualized with Virtualbox 4.1.8. The computer host is Ubuntu 11.10 x64. I've modified those parameters in the Tomcat server.xml : <Connector port="8080" protocol="HTTP/1.1" connectionTimeout="200000" redirectPort="8443" acceptCount="2000" maxThreads="150" minSpareThreads="50" /> The application executed on the server takes around 300ms. This app is running well until a certain amount of concurrent connections like those one : ab -n 500 -c 150 http://xx.xx.xx.xx:8080/myapp/ ab -n 1000 -c 50 http://xx.xx.xx.xx:8080/myapp/ siege -b -c 100 -r 20 http://xx.xx.xx.xx:8080/myapp/ A lot of socket connection timed out happens and this completly overload the host processor (but the CPU load inside the VM is normal). Doing an htop on the host, i can see that the Virtualbox processus is running under 300% CPU and never come down even after the load test is finished. (I've allocated 4 processors to the VM, if I allocate only one processor, CPU load goes under 100%). Restarting Tomcat don't do anything, i'm forced to restart the whole VM. I've tryed to launch those ab/siege commands locally on the VM and everything goes well. I first thought it was related to a linux network limit as explained here: Running some benchmarks using ab, and tomcat starts to really slow down So I've modified those TCP parameters : echo 15 > /proc/sys/net/ipv4/tcp_fin_timeout echo 30 > /proc/sys/net/ipv4/tcp_keepalive_intvl echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle echo 1 > /proc/sys/net/ipv4/tcp_tw_reuse It seems to be better, but it continues to overload the host CPU and output socket connections time out at a certain amount of concurrent connections. I'm wondering if this is not related to how Virtualbox handles external concurrent connections.

    Read the article

  • What is optimal hardware configuration for heavy load LAMP application

    - by Piotr K.
    I need to run Linux-Apache-PHP-MySQL application (Moodle e-learning platform) for a large number of concurrent users - I am aiming 5000 users. By concurrent I mean that 5000 people should be able to work with the application at the same time. "Work" means not only do database reads but writes as well. The application is not very typical, since it is doing a lot of inserts/updates on the database, so caching techniques are not helping to much. We are using InnoDB storage engine. In addition application is not written with performance in mind. For instance one Apache thread usually occupies about 30-50 MB of RAM. I would be greatful for information what hardware is needed to build scalable configuration that is able to handle this kind of load. We are using right now two HP DLG 380 with two 4 core processors which are able to handle much lower load (typically 300-500 concurrent users). Is it reasonable to invest in this kind of boxes and build cluster using them or is it better to go with some more high-end hardware? I am particularly curious how many and how powerful servers are needed (number of processors/cores, size of RAM) what network equipment should be used (what kind of switches, network cards) any other hardware, like particular disc storage solutions, etc, that are needed Another thing is how to put together everything, that is what is the most optimal architecture. Clustering with MySQL is rather hard (people are complaining about MySQL Cluster, even here on Stackoverflow).

    Read the article

  • java.lang.OutOfMemoryError: unable to create new native thread

    - by Brad
    I consistently get this exception when trying to run my Junit tests on my mac: java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:658) at java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:657) at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92) at com.google.appengine.tools.development.ApiProxyLocalImpl$PrivilegedApiAction.run(ApiProxyLocalImpl.java:197) at com.google.appengine.tools.development.ApiProxyLocalImpl$PrivilegedApiAction.run(ApiProxyLocalImpl.java:184) at java.security.AccessController.doPrivileged(Native Method) at com.google.appengine.tools.development.ApiProxyLocalImpl.doAsyncCall(ApiProxyLocalImpl.java:172) at com.google.appengine.tools.development.ApiProxyLocalImpl.makeAsyncCall(ApiProxyLocalImpl.java:138) The same set of unit tests pass perfectly fine on ubuntu and windows. Some information about my system resources on the mac: $ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited file size (blocks, -f) unlimited max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 1 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 266 virtual memory (kbytes, -v) unlimited $ java -version java version "1.6.0_24" Java(TM) SE Runtime Environment (build 1.6.0_24-b07-334-10M3326) Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02-334, mixed mode) The reason I dont think this is an application issue is because the same tests pass in different environments. I have tried setting heap to 1024m, 512m and setting the stack to 64k and 128k (and each of these combinations) with no luck. My open files was originally 256 and I have bumped this to 1024. I have been googling around for a bit and all posts say to decrease heap size and increase stack size but that doesnt seem to help. Anyone have anymore ideas? EDIT: Here are is some environment information on my ubuntu box: $ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 20 file size (blocks, -f) unlimited pending signals (-i) 16382 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) unlimited virtual memory (kbytes, -v) unlimited file locks (-x) unlimited $ java -version java version "1.6.0_24" Java(TM) SE Runtime Environment (build 1.6.0_24-b07) Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)

    Read the article

  • Difference in performance: local machine VS amazon medium instance

    - by user644745
    I see a drastic difference in performance matrix when i run it with apache benchmark (ab) in my local machine VS production hosted in amazon medium instance. Same concurrent requests (5) and same total number of requests (111) has been run against both. Amazon has better memory than my local machine. But there are 2 CPUs in my local machine vs 1 CPU in m1.medium. My internet speed is very low at the moment, I am getting Transfer rate as 25.29KBps. How can I improve the performance ? Do not know how to interpret Connect, Processing, Waiting and total in ab output. Here is Localhost: Server Hostname: localhost Server Port: 9999 Document Path: / Document Length: 7631 bytes Concurrency Level: 5 Time taken for tests: 1.424 seconds Complete requests: 111 Failed requests: 102 (Connect: 0, Receive: 0, Length: 102, Exceptions: 0) Write errors: 0 Total transferred: 860808 bytes HTML transferred: 847155 bytes Requests per second: 77.95 [#/sec] (mean) Time per request: 64.148 [ms] (mean) Time per request: 12.830 [ms] (mean, across all concurrent requests) Transfer rate: 590.30 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.5 0 1 Processing: 14 63 99.9 43 562 Waiting: 14 60 96.7 39 560 Total: 14 63 99.9 43 563 And this is production: Document Path: / Document Length: 7783 bytes Concurrency Level: 5 Time taken for tests: 33.883 seconds Complete requests: 111 Failed requests: 0 Write errors: 0 Total transferred: 877566 bytes HTML transferred: 863913 bytes Requests per second: 3.28 [#/sec] (mean) Time per request: 1526.258 [ms] (mean) Time per request: 305.252 [ms] (mean, across all concurrent requests) Transfer rate: 25.29 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 290 297 14.0 293 413 Processing: 897 1178 63.4 1176 1391 Waiting: 296 606 135.6 588 1171 Total: 1191 1475 66.0 1471 1684

    Read the article

  • Performance: Nginx SSL slowness or just SSL slowness in general?

    - by Mauvis Ledford
    I have an Amazon Web Services setup with an Apache instance behind Nginx with Nginx handling SSL and serving everything but the .php pages. In my ApacheBench tests I'm seeing this for my most expensive API call (which cache via Memcached): 100 concurrent calls to API call (http): 115ms (median) 260ms (max) 100 concurrent calls to API call (https): 6.1s (median) 11.9s (max) I've done a bit of research, disabled the most expensive SSL ciphers and enabled SSL caching (I know it doesn't help in this particular test.) Can you tell me why my SSL is taking so long? I've set up a massive EC2 server with 8CPUs and even applying consistent load to it only brings it up to 50% total CPU. I have 8 Nginx workers set and a bunch of Apache. Currently this whole setup is on one EC2 box but I plan to split it up and load balance it. There have been a few questions on this topic but none of those answers (disable expensive ciphers, cache ssl, seem to do anything.) Sample results below: $ ab -k -n 100 -c 100 https://URL This is ApacheBench, Version 2.3 <$Revision: 655654 $> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/ Benchmarking URL.com (be patient).....done Server Software: nginx/1.0.15 Server Hostname: URL.com Server Port: 443 SSL/TLS Protocol: TLSv1/SSLv3,AES256-SHA,2048,256 Document Path: /PATH Document Length: 73142 bytes Concurrency Level: 100 Time taken for tests: 12.204 seconds Complete requests: 100 Failed requests: 0 Write errors: 0 Keep-Alive requests: 0 Total transferred: 7351097 bytes HTML transferred: 7314200 bytes Requests per second: 8.19 [#/sec] (mean) Time per request: 12203.589 [ms] (mean) Time per request: 122.036 [ms] (mean, across all concurrent requests) Transfer rate: 588.25 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 65 168 64.1 162 268 Processing: 385 6096 3438.6 6199 11928 Waiting: 379 6091 3438.5 6194 11923 Total: 449 6264 3476.4 6323 12196 Percentage of the requests served within a certain time (ms) 50% 6323 66% 8244 75% 9321 80% 9919 90% 11119 95% 11720 98% 12076 99% 12196 100% 12196 (longest request)

    Read the article

  • Autologin 2 Windows users OR Login another user from the desktop

    - by fpdragon
    I'm using two windows users on my HTPC at the same time. One is just for watching videos and one for administration via remote. This setup is quite ideal for me since windows can handle multiple concurrent logins and the win "rdp concurrent hack" (Google). The problem is, I want both users to be logged in automatically when the pc was started. It shall be possible to watch tv and also the admin user shall be automatically logged in to start my scripts and other tasks, even if I haven't logged in via remote desktop manually. Later, when I want to admin my htpc I can just rdp connect the admin user without interrupting the video playback on the actual HTPC's screen and check my cleanup tasks, downloads, ... witch already executed for this admin user. But right now I found no solution to automatically login user A from a user B desktop and I also found no solution to autologin both users immediately at startup. As a workaround I have to fire up my other notebook machine and login one time with the remote user via rdp. From this time on the remote admin user is running concurrent with the main user in the background of the machine. The other workaround would be... after startup switch user from main user to admin user and then back again. But that also requires manual steps. I'm on a Windows 8 System right now but all infos for Win7 or XP would be also interesting. thanks a lot for all ideas. PS: just to prevent useless posts... don't tell me that only one user can be logged in to windows. ;)

    Read the article

  • How to implement a virtual server running Ubuntu inside a fileserver in windows?

    - by user541445
    I work in a company that has some limitations regarding their budget. They need client/server aplication. I can code the aplication, I've made mini tests on primordial applications that work. The thing is that they only have fileservers, and the application they need must be concurrent compliant, so the database must be in their local network (Fileserver is the only choice). So far, I have explored almost any option available, starting with: Desktop databases. Access (We have a license)(But concurreny is just not effective, besides it's a windows software, yuck). Sqlite (Nice, but since the information they manage is a lot, I've performed some concurrency tests with INSERTs and doing SELECTS at the same time). It fails, somehow it just stops inserting. Open office Base (I dismember a base office only to see that it was a file mode HSQLDB). I've tried, not concurrent at all. Etc (You name the Open source Desktop Database manager, and yes, I've tried that one.) Server databases Call me a stubborn person, but reading some server databases documentation say that their databases will work in a file mode. I've tried a lot of products. Postgres, MySQL, FireBird, H2, Derby, Oracle Express, IBM DB2 Express, etc. So, I really need a hand here, I've been doing this install/delete/depression thing with a lot of databases for 3 weeks, until I got with a crazy idea and I just came here to express it. So, my question comes down to this: Installing a light virtual OS software like Virtual PC and then installing in there a Server OS like an Ubuntu server inside that virtual software will work? Will it work 24/7 or when I close the virtual pc software? Will this work in a fileserver? Any suggestion, answer, critic to the place I work, crazy new concurrent database that will work in fileserver will be most than welcome.

    Read the article

  • Remote Desktop Services Licensing - Does server have to have a RDS role?

    - by transistor1
    I recently set up a "micro" size Windows 2008 Datacenter server on Amazon AWS. My small group needs several concurrent RDS users to be able to access the machine. Without installing the "Remote Desktop Server" role, it allows 2 concurrent connections. I read on MS' website that in order to set up multiple users, we needed to install the RDS role. I did so, but now the application we are trying to share is running much slower than it was before. Prior to the role installation, it was taking about 5 seconds to open; now it is taking a few minutes to open -- without any other users logged on except me. My assumption is that the RDS role may be too much for this micro instance to handle, and currently, changing to another size instance is not an option (it may be possible later if we were to receive enough funding). This leads me to the following questions: 1) Is it a sensible assessment to assume that it is the RDS role is slowing things down, or are there other things that I could look at to speed it up? We are talking about a machine with ~600MB of memory. 2) If I revert back to the pre-RDS role, is there any legitimate way (in terms of purchasing RDS licenses) to get more than 2 concurrent desktops? I did read this, and am not questioning that the answerer is knowlegeable; but someone else may have some other experience. I am also making it clear that we want to do this in a legitimate way. Thanks in advance for any assistance that can be provided! EDIT: if it is helpful in answering the question, the application in question is a Lotus Approach database. Also, I am asking this from a technical perspective: not a legal one. I want to know if it is possible to install valid licenses without the RDS role.

    Read the article

  • How John Got 15x Improvement Without Really Trying

    - by rchrd
    The following article was published on a Sun Microsystems website a number of years ago by John Feo. It is still useful and worth preserving. So I'm republishing it here.  How I Got 15x Improvement Without Really Trying John Feo, Sun Microsystems Taking ten "personal" program codes used in scientific and engineering research, the author was able to get from 2 to 15 times performance improvement easily by applying some simple general optimization techniques. Introduction Scientific research based on computer simulation depends on the simulation for advancement. The research can advance only as fast as the computational codes can execute. The codes' efficiency determines both the rate and quality of results. In the same amount of time, a faster program can generate more results and can carry out a more detailed simulation of physical phenomena than a slower program. Highly optimized programs help science advance quickly and insure that monies supporting scientific research are used as effectively as possible. Scientific computer codes divide into three broad categories: ISV, community, and personal. ISV codes are large, mature production codes developed and sold commercially. The codes improve slowly over time both in methods and capabilities, and they are well tuned for most vendor platforms. Since the codes are mature and complex, there are few opportunities to improve their performance solely through code optimization. Improvements of 10% to 15% are typical. Examples of ISV codes are DYNA3D, Gaussian, and Nastran. Community codes are non-commercial production codes used by a particular research field. Generally, they are developed and distributed by a single academic or research institution with assistance from the community. Most users just run the codes, but some develop new methods and extensions that feed back into the general release. The codes are available on most vendor platforms. Since these codes are younger than ISV codes, there are more opportunities to optimize the source code. Improvements of 50% are not unusual. Examples of community codes are AMBER, CHARM, BLAST, and FASTA. Personal codes are those written by single users or small research groups for their own use. These codes are not distributed, but may be passed from professor-to-student or student-to-student over several years. They form the primordial ocean of applications from which community and ISV codes emerge. Government research grants pay for the development of most personal codes. This paper reports on the nature and performance of this class of codes. Over the last year, I have looked at over two dozen personal codes from more than a dozen research institutions. The codes cover a variety of scientific fields, including astronomy, atmospheric sciences, bioinformatics, biology, chemistry, geology, and physics. The sources range from a few hundred lines to more than ten thousand lines, and are written in Fortran, Fortran 90, C, and C++. For the most part, the codes are modular, documented, and written in a clear, straightforward manner. They do not use complex language features, advanced data structures, programming tricks, or libraries. I had little trouble understanding what the codes did or how data structures were used. Most came with a makefile. Surprisingly, only one of the applications is parallel. All developers have access to parallel machines, so availability is not an issue. Several tried to parallelize their applications, but stopped after encountering difficulties. Lack of education and a perception that parallelism is difficult prevented most from trying. I parallelized several of the codes using OpenMP, and did not judge any of the codes as difficult to parallelize. Even more surprising than the lack of parallelism is the inefficiency of the codes. I was able to get large improvements in performance in a matter of a few days applying simple optimization techniques. Table 1 lists ten representative codes [names and affiliation are omitted to preserve anonymity]. Improvements on one processor range from 2x to 15.5x with a simple average of 4.75x. I did not use sophisticated performance tools or drill deep into the program's execution character as one would do when tuning ISV or community codes. Using only a profiler and source line timers, I identified inefficient sections of code and improved their performance by inspection. The changes were at a high level. I am sure there is another factor of 2 or 3 in each code, and more if the codes are parallelized. The study’s results show that personal scientific codes are running many times slower than they should and that the problem is pervasive. Computational scientists are not sloppy programmers; however, few are trained in the art of computer programming or code optimization. I found that most have a working knowledge of some programming language and standard software engineering practices; but they do not know, or think about, how to make their programs run faster. They simply do not know the standard techniques used to make codes run faster. In fact, they do not even perceive that such techniques exist. The case studies described in this paper show that applying simple, well known techniques can significantly increase the performance of personal codes. It is important that the scientific community and the Government agencies that support scientific research find ways to better educate academic scientific programmers. The inefficiency of their codes is so bad that it is retarding both the quality and progress of scientific research. # cacheperformance redundantoperations loopstructures performanceimprovement 1 x x 15.5 2 x 2.8 3 x x 2.5 4 x 2.1 5 x x 2.0 6 x 5.0 7 x 5.8 8 x 6.3 9 2.2 10 x x 3.3 Table 1 — Area of improvement and performance gains of 10 codes The remainder of the paper is organized as follows: sections 2, 3, and 4 discuss the three most common sources of inefficiencies in the codes studied. These are cache performance, redundant operations, and loop structures. Each section includes several examples. The last section summaries the work and suggests a possible solution to the issues raised. Optimizing cache performance Commodity microprocessor systems use caches to increase memory bandwidth and reduce memory latencies. Typical latencies from processor to L1, L2, local, and remote memory are 3, 10, 50, and 200 cycles, respectively. Moreover, bandwidth falls off dramatically as memory distances increase. Programs that do not use cache effectively run many times slower than programs that do. When optimizing for cache, the biggest performance gains are achieved by accessing data in cache order and reusing data to amortize the overhead of cache misses. Secondary considerations are prefetching, associativity, and replacement; however, the understanding and analysis required to optimize for the latter are probably beyond the capabilities of the non-expert. Much can be gained simply by accessing data in the correct order and maximizing data reuse. 6 out of the 10 codes studied here benefited from such high level optimizations. Array Accesses The most important cache optimization is the most basic: accessing Fortran array elements in column order and C array elements in row order. Four of the ten codes—1, 2, 4, and 10—got it wrong. Compilers will restructure nested loops to optimize cache performance, but may not do so if the loop structure is too complex, or the loop body includes conditionals, complex addressing, or function calls. In code 1, the compiler failed to invert a key loop because of complex addressing do I = 0, 1010, delta_x IM = I - delta_x IP = I + delta_x do J = 5, 995, delta_x JM = J - delta_x JP = J + delta_x T1 = CA1(IP, J) + CA1(I, JP) T2 = CA1(IM, J) + CA1(I, JM) S1 = T1 + T2 - 4 * CA1(I, J) CA(I, J) = CA1(I, J) + D * S1 end do end do In code 2, the culprit is conditionals do I = 1, N do J = 1, N If (IFLAG(I,J) .EQ. 0) then T1 = Value(I, J-1) T2 = Value(I-1, J) T3 = Value(I, J) T4 = Value(I+1, J) T5 = Value(I, J+1) Value(I,J) = 0.25 * (T1 + T2 + T5 + T4) Delta = ABS(T3 - Value(I,J)) If (Delta .GT. MaxDelta) MaxDelta = Delta endif enddo enddo I fixed both programs by inverting the loops by hand. Code 10 has three-dimensional arrays and triply nested loops. The structure of the most computationally intensive loops is too complex to invert automatically or by hand. The only practical solution is to transpose the arrays so that the dimension accessed by the innermost loop is in cache order. The arrays can be transposed at construction or prior to entering a computationally intensive section of code. The former requires all array references to be modified, while the latter is cost effective only if the cost of the transpose is amortized over many accesses. I used the second approach to optimize code 10. Code 5 has four-dimensional arrays and loops are nested four deep. For all of the reasons cited above the compiler is not able to restructure three key loops. Assume C arrays and let the four dimensions of the arrays be i, j, k, and l. In the original code, the index structure of the three loops is L1: for i L2: for i L3: for i for l for l for j for k for j for k for j for k for l So only L3 accesses array elements in cache order. L1 is a very complex loop—much too complex to invert. I brought the loop into cache alignment by transposing the second and fourth dimensions of the arrays. Since the code uses a macro to compute all array indexes, I effected the transpose at construction and changed the macro appropriately. The dimensions of the new arrays are now: i, l, k, and j. L3 is a simple loop and easily inverted. L2 has a loop-carried scalar dependence in k. By promoting the scalar name that carries the dependence to an array, I was able to invert the third and fourth subloops aligning the loop with cache. Code 5 is by far the most difficult of the four codes to optimize for array accesses; but the knowledge required to fix the problems is no more than that required for the other codes. I would judge this code at the limits of, but not beyond, the capabilities of appropriately trained computational scientists. Array Strides When a cache miss occurs, a line (64 bytes) rather than just one word is loaded into the cache. If data is accessed stride 1, than the cost of the miss is amortized over 8 words. Any stride other than one reduces the cost savings. Two of the ten codes studied suffered from non-unit strides. The codes represent two important classes of "strided" codes. Code 1 employs a multi-grid algorithm to reduce time to convergence. The grids are every tenth, fifth, second, and unit element. Since time to convergence is inversely proportional to the distance between elements, coarse grids converge quickly providing good starting values for finer grids. The better starting values further reduce the time to convergence. The downside is that grids of every nth element, n > 1, introduce non-unit strides into the computation. In the original code, much of the savings of the multi-grid algorithm were lost due to this problem. I eliminated the problem by compressing (copying) coarse grids into continuous memory, and rewriting the computation as a function of the compressed grid. On convergence, I copied the final values of the compressed grid back to the original grid. The savings gained from unit stride access of the compressed grid more than paid for the cost of copying. Using compressed grids, the loop from code 1 included in the previous section becomes do j = 1, GZ do i = 1, GZ T1 = CA(i+0, j-1) + CA(i-1, j+0) T4 = CA1(i+1, j+0) + CA1(i+0, j+1) S1 = T1 + T4 - 4 * CA1(i+0, j+0) CA(i+0, j+0) = CA1(i+0, j+0) + DD * S1 enddo enddo where CA and CA1 are compressed arrays of size GZ. Code 7 traverses a list of objects selecting objects for later processing. The labels of the selected objects are stored in an array. The selection step has unit stride, but the processing steps have irregular stride. A fix is to save the parameters of the selected objects in temporary arrays as they are selected, and pass the temporary arrays to the processing functions. The fix is practical if the same parameters are used in selection as in processing, or if processing comprises a series of distinct steps which use overlapping subsets of the parameters. Both conditions are true for code 7, so I achieved significant improvement by copying parameters to temporary arrays during selection. Data reuse In the previous sections, we optimized for spatial locality. It is also important to optimize for temporal locality. Once read, a datum should be used as much as possible before it is forced from cache. Loop fusion and loop unrolling are two techniques that increase temporal locality. Unfortunately, both techniques increase register pressure—as loop bodies become larger, the number of registers required to hold temporary values grows. Once register spilling occurs, any gains evaporate quickly. For multiprocessors with small register sets or small caches, the sweet spot can be very small. In the ten codes presented here, I found no opportunities for loop fusion and only two opportunities for loop unrolling (codes 1 and 3). In code 1, unrolling the outer and inner loop one iteration increases the number of result values computed by the loop body from 1 to 4, do J = 1, GZ-2, 2 do I = 1, GZ-2, 2 T1 = CA1(i+0, j-1) + CA1(i-1, j+0) T2 = CA1(i+1, j-1) + CA1(i+0, j+0) T3 = CA1(i+0, j+0) + CA1(i-1, j+1) T4 = CA1(i+1, j+0) + CA1(i+0, j+1) T5 = CA1(i+2, j+0) + CA1(i+1, j+1) T6 = CA1(i+1, j+1) + CA1(i+0, j+2) T7 = CA1(i+2, j+1) + CA1(i+1, j+2) S1 = T1 + T4 - 4 * CA1(i+0, j+0) S2 = T2 + T5 - 4 * CA1(i+1, j+0) S3 = T3 + T6 - 4 * CA1(i+0, j+1) S4 = T4 + T7 - 4 * CA1(i+1, j+1) CA(i+0, j+0) = CA1(i+0, j+0) + DD * S1 CA(i+1, j+0) = CA1(i+1, j+0) + DD * S2 CA(i+0, j+1) = CA1(i+0, j+1) + DD * S3 CA(i+1, j+1) = CA1(i+1, j+1) + DD * S4 enddo enddo The loop body executes 12 reads, whereas as the rolled loop shown in the previous section executes 20 reads to compute the same four values. In code 3, two loops are unrolled 8 times and one loop is unrolled 4 times. Here is the before for (k = 0; k < NK[u]; k++) { sum = 0.0; for (y = 0; y < NY; y++) { sum += W[y][u][k] * delta[y]; } backprop[i++]=sum; } and after code for (k = 0; k < KK - 8; k+=8) { sum0 = 0.0; sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0; sum5 = 0.0; sum6 = 0.0; sum7 = 0.0; for (y = 0; y < NY; y++) { sum0 += W[y][0][k+0] * delta[y]; sum1 += W[y][0][k+1] * delta[y]; sum2 += W[y][0][k+2] * delta[y]; sum3 += W[y][0][k+3] * delta[y]; sum4 += W[y][0][k+4] * delta[y]; sum5 += W[y][0][k+5] * delta[y]; sum6 += W[y][0][k+6] * delta[y]; sum7 += W[y][0][k+7] * delta[y]; } backprop[k+0] = sum0; backprop[k+1] = sum1; backprop[k+2] = sum2; backprop[k+3] = sum3; backprop[k+4] = sum4; backprop[k+5] = sum5; backprop[k+6] = sum6; backprop[k+7] = sum7; } for one of the loops unrolled 8 times. Optimizing for temporal locality is the most difficult optimization considered in this paper. The concepts are not difficult, but the sweet spot is small. Identifying where the program can benefit from loop unrolling or loop fusion is not trivial. Moreover, it takes some effort to get it right. Still, educating scientific programmers about temporal locality and teaching them how to optimize for it will pay dividends. Reducing instruction count Execution time is a function of instruction count. Reduce the count and you usually reduce the time. The best solution is to use a more efficient algorithm; that is, an algorithm whose order of complexity is smaller, that converges quicker, or is more accurate. Optimizing source code without changing the algorithm yields smaller, but still significant, gains. This paper considers only the latter because the intent is to study how much better codes can run if written by programmers schooled in basic code optimization techniques. The ten codes studied benefited from three types of "instruction reducing" optimizations. The two most prevalent were hoisting invariant memory and data operations out of inner loops. The third was eliminating unnecessary data copying. The nature of these inefficiencies is language dependent. Memory operations The semantics of C make it difficult for the compiler to determine all the invariant memory operations in a loop. The problem is particularly acute for loops in functions since the compiler may not know the values of the function's parameters at every call site when compiling the function. Most compilers support pragmas to help resolve ambiguities; however, these pragmas are not comprehensive and there is no standard syntax. To guarantee that invariant memory operations are not executed repetitively, the user has little choice but to hoist the operations by hand. The problem is not as severe in Fortran programs because in the absence of equivalence statements, it is a violation of the language's semantics for two names to share memory. Codes 3 and 5 are C programs. In both cases, the compiler did not hoist all invariant memory operations from inner loops. Consider the following loop from code 3 for (y = 0; y < NY; y++) { i = 0; for (u = 0; u < NU; u++) { for (k = 0; k < NK[u]; k++) { dW[y][u][k] += delta[y] * I1[i++]; } } } Since dW[y][u] can point to the same memory space as delta for one or more values of y and u, assignment to dW[y][u][k] may change the value of delta[y]. In reality, dW and delta do not overlap in memory, so I rewrote the loop as for (y = 0; y < NY; y++) { i = 0; Dy = delta[y]; for (u = 0; u < NU; u++) { for (k = 0; k < NK[u]; k++) { dW[y][u][k] += Dy * I1[i++]; } } } Failure to hoist invariant memory operations may be due to complex address calculations. If the compiler can not determine that the address calculation is invariant, then it can hoist neither the calculation nor the associated memory operations. As noted above, code 5 uses a macro to address four-dimensional arrays #define MAT4D(a,q,i,j,k) (double *)((a)->data + (q)*(a)->strides[0] + (i)*(a)->strides[3] + (j)*(a)->strides[2] + (k)*(a)->strides[1]) The macro is too complex for the compiler to understand and so, it does not identify any subexpressions as loop invariant. The simplest way to eliminate the address calculation from the innermost loop (over i) is to define a0 = MAT4D(a,q,0,j,k) before the loop and then replace all instances of *MAT4D(a,q,i,j,k) in the loop with a0[i] A similar problem appears in code 6, a Fortran program. The key loop in this program is do n1 = 1, nh nx1 = (n1 - 1) / nz + 1 nz1 = n1 - nz * (nx1 - 1) do n2 = 1, nh nx2 = (n2 - 1) / nz + 1 nz2 = n2 - nz * (nx2 - 1) ndx = nx2 - nx1 ndy = nz2 - nz1 gxx = grn(1,ndx,ndy) gyy = grn(2,ndx,ndy) gxy = grn(3,ndx,ndy) balance(n1,1) = balance(n1,1) + (force(n2,1) * gxx + force(n2,2) * gxy) * h1 balance(n1,2) = balance(n1,2) + (force(n2,1) * gxy + force(n2,2) * gyy)*h1 end do end do The programmer has written this loop well—there are no loop invariant operations with respect to n1 and n2. However, the loop resides within an iterative loop over time and the index calculations are independent with respect to time. Trading space for time, I precomputed the index values prior to the entering the time loop and stored the values in two arrays. I then replaced the index calculations with reads of the arrays. Data operations Ways to reduce data operations can appear in many forms. Implementing a more efficient algorithm produces the biggest gains. The closest I came to an algorithm change was in code 4. This code computes the inner product of K-vectors A(i) and B(j), 0 = i < N, 0 = j < M, for most values of i and j. Since the program computes most of the NM possible inner products, it is more efficient to compute all the inner products in one triply-nested loop rather than one at a time when needed. The savings accrue from reading A(i) once for all B(j) vectors and from loop unrolling. for (i = 0; i < N; i+=8) { for (j = 0; j < M; j++) { sum0 = 0.0; sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0; sum5 = 0.0; sum6 = 0.0; sum7 = 0.0; for (k = 0; k < K; k++) { sum0 += A[i+0][k] * B[j][k]; sum1 += A[i+1][k] * B[j][k]; sum2 += A[i+2][k] * B[j][k]; sum3 += A[i+3][k] * B[j][k]; sum4 += A[i+4][k] * B[j][k]; sum5 += A[i+5][k] * B[j][k]; sum6 += A[i+6][k] * B[j][k]; sum7 += A[i+7][k] * B[j][k]; } C[i+0][j] = sum0; C[i+1][j] = sum1; C[i+2][j] = sum2; C[i+3][j] = sum3; C[i+4][j] = sum4; C[i+5][j] = sum5; C[i+6][j] = sum6; C[i+7][j] = sum7; }} This change requires knowledge of a typical run; i.e., that most inner products are computed. The reasons for the change, however, derive from basic optimization concepts. It is the type of change easily made at development time by a knowledgeable programmer. In code 5, we have the data version of the index optimization in code 6. Here a very expensive computation is a function of the loop indices and so cannot be hoisted out of the loop; however, the computation is invariant with respect to an outer iterative loop over time. We can compute its value for each iteration of the computation loop prior to entering the time loop and save the values in an array. The increase in memory required to store the values is small in comparison to the large savings in time. The main loop in Code 8 is doubly nested. The inner loop includes a series of guarded computations; some are a function of the inner loop index but not the outer loop index while others are a function of the outer loop index but not the inner loop index for (j = 0; j < N; j++) { for (i = 0; i < M; i++) { r = i * hrmax; R = A[j]; temp = (PRM[3] == 0.0) ? 1.0 : pow(r, PRM[3]); high = temp * kcoeff * B[j] * PRM[2] * PRM[4]; low = high * PRM[6] * PRM[6] / (1.0 + pow(PRM[4] * PRM[6], 2.0)); kap = (R > PRM[6]) ? high * R * R / (1.0 + pow(PRM[4]*r, 2.0) : low * pow(R/PRM[6], PRM[5]); < rest of loop omitted > }} Note that the value of temp is invariant to j. Thus, we can hoist the computation for temp out of the loop and save its values in an array. for (i = 0; i < M; i++) { r = i * hrmax; TEMP[i] = pow(r, PRM[3]); } [N.B. – the case for PRM[3] = 0 is omitted and will be reintroduced later.] We now hoist out of the inner loop the computations invariant to i. Since the conditional guarding the value of kap is invariant to i, it behooves us to hoist the computation out of the inner loop, thereby executing the guard once rather than M times. The final version of the code is for (j = 0; j < N; j++) { R = rig[j] / 1000.; tmp1 = kcoeff * par[2] * beta[j] * par[4]; tmp2 = 1.0 + (par[4] * par[4] * par[6] * par[6]); tmp3 = 1.0 + (par[4] * par[4] * R * R); tmp4 = par[6] * par[6] / tmp2; tmp5 = R * R / tmp3; tmp6 = pow(R / par[6], par[5]); if ((par[3] == 0.0) && (R > par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * tmp5; } else if ((par[3] == 0.0) && (R <= par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * tmp4 * tmp6; } else if ((par[3] != 0.0) && (R > par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * TEMP[i] * tmp5; } else if ((par[3] != 0.0) && (R <= par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * TEMP[i] * tmp4 * tmp6; } for (i = 0; i < M; i++) { kap = KAP[i]; r = i * hrmax; < rest of loop omitted > } } Maybe not the prettiest piece of code, but certainly much more efficient than the original loop, Copy operations Several programs unnecessarily copy data from one data structure to another. This problem occurs in both Fortran and C programs, although it manifests itself differently in the two languages. Code 1 declares two arrays—one for old values and one for new values. At the end of each iteration, the array of new values is copied to the array of old values to reset the data structures for the next iteration. This problem occurs in Fortran programs not included in this study and in both Fortran 77 and Fortran 90 code. Introducing pointers to the arrays and swapping pointer values is an obvious way to eliminate the copying; but pointers is not a feature that many Fortran programmers know well or are comfortable using. An easy solution not involving pointers is to extend the dimension of the value array by 1 and use the last dimension to differentiate between arrays at different times. For example, if the data space is N x N, declare the array (N, N, 2). Then store the problem’s initial values in (_, _, 2) and define the scalar names new = 2 and old = 1. At the start of each iteration, swap old and new to reset the arrays. The old–new copy problem did not appear in any C program. In programs that had new and old values, the code swapped pointers to reset data structures. Where unnecessary coping did occur is in structure assignment and parameter passing. Structures in C are handled much like scalars. Assignment causes the data space of the right-hand name to be copied to the data space of the left-hand name. Similarly, when a structure is passed to a function, the data space of the actual parameter is copied to the data space of the formal parameter. If the structure is large and the assignment or function call is in an inner loop, then copying costs can grow quite large. While none of the ten programs considered here manifested this problem, it did occur in programs not included in the study. A simple fix is always to refer to structures via pointers. Optimizing loop structures Since scientific programs spend almost all their time in loops, efficient loops are the key to good performance. Conditionals, function calls, little instruction level parallelism, and large numbers of temporary values make it difficult for the compiler to generate tightly packed, highly efficient code. Conditionals and function calls introduce jumps that disrupt code flow. Users should eliminate or isolate conditionls to their own loops as much as possible. Often logical expressions can be substituted for if-then-else statements. For example, code 2 includes the following snippet MaxDelta = 0.0 do J = 1, N do I = 1, M < code omitted > Delta = abs(OldValue ? NewValue) if (Delta > MaxDelta) MaxDelta = Delta enddo enddo if (MaxDelta .gt. 0.001) goto 200 Since the only use of MaxDelta is to control the jump to 200 and all that matters is whether or not it is greater than 0.001, I made MaxDelta a boolean and rewrote the snippet as MaxDelta = .false. do J = 1, N do I = 1, M < code omitted > Delta = abs(OldValue ? NewValue) MaxDelta = MaxDelta .or. (Delta .gt. 0.001) enddo enddo if (MaxDelta) goto 200 thereby, eliminating the conditional expression from the inner loop. A microprocessor can execute many instructions per instruction cycle. Typically, it can execute one or more memory, floating point, integer, and jump operations. To be executed simultaneously, the operations must be independent. Thick loops tend to have more instruction level parallelism than thin loops. Moreover, they reduce memory traffice by maximizing data reuse. Loop unrolling and loop fusion are two techniques to increase the size of loop bodies. Several of the codes studied benefitted from loop unrolling, but none benefitted from loop fusion. This observation is not too surpising since it is the general tendency of programmers to write thick loops. As loops become thicker, the number of temporary values grows, increasing register pressure. If registers spill, then memory traffic increases and code flow is disrupted. A thick loop with many temporary values may execute slower than an equivalent series of thin loops. The biggest gain will be achieved if the thick loop can be split into a series of independent loops eliminating the need to write and read temporary arrays. I found such an occasion in code 10 where I split the loop do i = 1, n do j = 1, m A24(j,i)= S24(j,i) * T24(j,i) + S25(j,i) * U25(j,i) B24(j,i)= S24(j,i) * T25(j,i) + S25(j,i) * U24(j,i) A25(j,i)= S24(j,i) * C24(j,i) + S25(j,i) * V24(j,i) B25(j,i)= S24(j,i) * U25(j,i) + S25(j,i) * V25(j,i) C24(j,i)= S26(j,i) * T26(j,i) + S27(j,i) * U26(j,i) D24(j,i)= S26(j,i) * T27(j,i) + S27(j,i) * V26(j,i) C25(j,i)= S27(j,i) * S28(j,i) + S26(j,i) * U28(j,i) D25(j,i)= S27(j,i) * T28(j,i) + S26(j,i) * V28(j,i) end do end do into two disjoint loops do i = 1, n do j = 1, m A24(j,i)= S24(j,i) * T24(j,i) + S25(j,i) * U25(j,i) B24(j,i)= S24(j,i) * T25(j,i) + S25(j,i) * U24(j,i) A25(j,i)= S24(j,i) * C24(j,i) + S25(j,i) * V24(j,i) B25(j,i)= S24(j,i) * U25(j,i) + S25(j,i) * V25(j,i) end do end do do i = 1, n do j = 1, m C24(j,i)= S26(j,i) * T26(j,i) + S27(j,i) * U26(j,i) D24(j,i)= S26(j,i) * T27(j,i) + S27(j,i) * V26(j,i) C25(j,i)= S27(j,i) * S28(j,i) + S26(j,i) * U28(j,i) D25(j,i)= S27(j,i) * T28(j,i) + S26(j,i) * V28(j,i) end do end do Conclusions Over the course of the last year, I have had the opportunity to work with over two dozen academic scientific programmers at leading research universities. Their research interests span a broad range of scientific fields. Except for two programs that relied almost exclusively on library routines (matrix multiply and fast Fourier transform), I was able to improve significantly the single processor performance of all codes. Improvements range from 2x to 15.5x with a simple average of 4.75x. Changes to the source code were at a very high level. I did not use sophisticated techniques or programming tools to discover inefficiencies or effect the changes. Only one code was parallel despite the availability of parallel systems to all developers. Clearly, we have a problem—personal scientific research codes are highly inefficient and not running parallel. The developers are unaware of simple optimization techniques to make programs run faster. They lack education in the art of code optimization and parallel programming. I do not believe we can fix the problem by publishing additional books or training manuals. To date, the developers in questions have not studied the books or manual available, and are unlikely to do so in the future. Short courses are a possible solution, but I believe they are too concentrated to be much use. The general concepts can be taught in a three or four day course, but that is not enough time for students to practice what they learn and acquire the experience to apply and extend the concepts to their codes. Practice is the key to becoming proficient at optimization. I recommend that graduate students be required to take a semester length course in optimization and parallel programming. We would never give someone access to state-of-the-art scientific equipment costing hundreds of thousands of dollars without first requiring them to demonstrate that they know how to use the equipment. Yet the criterion for time on state-of-the-art supercomputers is at most an interesting project. Requestors are never asked to demonstrate that they know how to use the system, or can use the system effectively. A semester course would teach them the required skills. Government agencies that fund academic scientific research pay for most of the computer systems supporting scientific research as well as the development of most personal scientific codes. These agencies should require graduate schools to offer a course in optimization and parallel programming as a requirement for funding. About the Author John Feo received his Ph.D. in Computer Science from The University of Texas at Austin in 1986. After graduate school, Dr. Feo worked at Lawrence Livermore National Laboratory where he was the Group Leader of the Computer Research Group and principal investigator of the Sisal Language Project. In 1997, Dr. Feo joined Tera Computer Company where he was project manager for the MTA, and oversaw the programming and evaluation of the MTA at the San Diego Supercomputer Center. In 2000, Dr. Feo joined Sun Microsystems as an HPC application specialist. He works with university research groups to optimize and parallelize scientific codes. Dr. Feo has published over two dozen research articles in the areas of parallel parallel programming, parallel programming languages, and application performance.

    Read the article

  • Windows Azure Use Case: New Development

    - by BuckWoody
    This is one in a series of posts on when and where to use a distributed architecture design in your organization's computing needs. You can find the main post here: http://blogs.msdn.com/b/buckwoody/archive/2011/01/18/windows-azure-and-sql-azure-use-cases.aspx Description: Computing platforms evolve over time. Originally computers were directed by hardware wiring - that, the “code” was the path of the wiring that directed an electrical signal from one component to another, or in some cases a physical switch controlled the path. From there software was developed, first in a very low machine language, then when compilers were created, computer languages could more closely mimic written statements. These language statements can be compiled into the lower-level machine language still used by computers today. Microprocessors replaced logic circuits, sometimes with fewer instructions (Reduced Instruction Set Computing, RISC) and sometimes with more instructions (Complex Instruction Set Computing, CISC). The reason this history is important is that along each technology advancement, computer code has adapted. Writing software for a RISC architecture is significantly different than developing for a CISC architecture. And moving to a Distributed Architecture like Windows Azure also has specific implementation details that our code must follow. But why make a change? As I’ve described, we need to make the change to our code to follow advances in technology. There’s no point in change for its own sake, but as a new paradigm offers benefits to our users, it’s important for us to leverage those benefits where it makes sense. That’s most often done in new development projects. It’s a far simpler task to take a new project and adapt it to Windows Azure than to try and retrofit older code designed in a previous computing environment. We can still use the same coding languages (.NET, Java, C++) to write code for Windows Azure, but we need to think about the architecture of that code on a new project so that it runs in the most efficient, cost-effective way in a Distributed Architecture. As we receive new requests from the organization for new projects, a distributed architecture paradigm belongs in the decision matrix for the platform target. Implementation: When you are designing new applications for Windows Azure (or any distributed architecture) there are many important details to consider. But at the risk of over-simplification, there are three main concepts to learn and architect within the new code: Stateless Programming - Stateless program is a prime concept within distributed architectures. Rather than each server owning the complete processing cycle, the information from an operation that needs to be retained (the “state”) should be persisted to another location c(like storage) common to all machines involved in the process.  An interesting learning process for Stateless Programming (although not unique to this language type) is to learn Functional Programming. Server-Side Processing - Along with developing using a Stateless Design, the closer you can locate the code processing to the data, the less expensive and faster the code will run. When you control the network layer, this is less important, since you can send vast amounts of data between the server and client, allowing the client to perform processing. In a distributed architecture, you don’t always own the network, so it’s performance is unpredictable. Also, you may not be able to control the platform the user is on (such as a smartphone, PC or tablet), so it’s imperative to deliver only results and graphical elements where possible.  Token-Based Authentication - Also called “Claims-Based Authorization”, this code practice means instead of allowing a user to log on once and then running code in that context, a more granular level of security is used. A “token” or “claim”, often represented as a Certificate, is sent along for a series or even one request. In other words, every call to the code is authenticated against the token, rather than allowing a user free reign within the code call. While this is more work initially, it can bring a greater level of security, and it is far more resilient to disconnections. Resources: See the references of “Nondistributed Deployment” and “Distributed Deployment” at the top of this article for more information with graphics:  http://msdn.microsoft.com/en-us/library/ee658120.aspx  Stack Overflow has a good thread on functional programming: http://stackoverflow.com/questions/844536/advantages-of-stateless-programming  Another good discussion on Stack Overflow on server-side processing is here: http://stackoverflow.com/questions/3064018/client-side-or-server-side-processing Claims Based Authorization is described here: http://msdn.microsoft.com/en-us/magazine/ee335707.aspx

    Read the article

  • Windows Azure Use Case: New Development

    - by BuckWoody
    This is one in a series of posts on when and where to use a distributed architecture design in your organization's computing needs. You can find the main post here: http://blogs.msdn.com/b/buckwoody/archive/2011/01/18/windows-azure-and-sql-azure-use-cases.aspx Description: Computing platforms evolve over time. Originally computers were directed by hardware wiring - that, the “code” was the path of the wiring that directed an electrical signal from one component to another, or in some cases a physical switch controlled the path. From there software was developed, first in a very low machine language, then when compilers were created, computer languages could more closely mimic written statements. These language statements can be compiled into the lower-level machine language still used by computers today. Microprocessors replaced logic circuits, sometimes with fewer instructions (Reduced Instruction Set Computing, RISC) and sometimes with more instructions (Complex Instruction Set Computing, CISC). The reason this history is important is that along each technology advancement, computer code has adapted. Writing software for a RISC architecture is significantly different than developing for a CISC architecture. And moving to a Distributed Architecture like Windows Azure also has specific implementation details that our code must follow. But why make a change? As I’ve described, we need to make the change to our code to follow advances in technology. There’s no point in change for its own sake, but as a new paradigm offers benefits to our users, it’s important for us to leverage those benefits where it makes sense. That’s most often done in new development projects. It’s a far simpler task to take a new project and adapt it to Windows Azure than to try and retrofit older code designed in a previous computing environment. We can still use the same coding languages (.NET, Java, C++) to write code for Windows Azure, but we need to think about the architecture of that code on a new project so that it runs in the most efficient, cost-effective way in a Distributed Architecture. As we receive new requests from the organization for new projects, a distributed architecture paradigm belongs in the decision matrix for the platform target. Implementation: When you are designing new applications for Windows Azure (or any distributed architecture) there are many important details to consider. But at the risk of over-simplification, there are three main concepts to learn and architect within the new code: Stateless Programming - Stateless program is a prime concept within distributed architectures. Rather than each server owning the complete processing cycle, the information from an operation that needs to be retained (the “state”) should be persisted to another location c(like storage) common to all machines involved in the process.  An interesting learning process for Stateless Programming (although not unique to this language type) is to learn Functional Programming. Server-Side Processing - Along with developing using a Stateless Design, the closer you can locate the code processing to the data, the less expensive and faster the code will run. When you control the network layer, this is less important, since you can send vast amounts of data between the server and client, allowing the client to perform processing. In a distributed architecture, you don’t always own the network, so it’s performance is unpredictable. Also, you may not be able to control the platform the user is on (such as a smartphone, PC or tablet), so it’s imperative to deliver only results and graphical elements where possible.  Token-Based Authentication - Also called “Claims-Based Authorization”, this code practice means instead of allowing a user to log on once and then running code in that context, a more granular level of security is used. A “token” or “claim”, often represented as a Certificate, is sent along for a series or even one request. In other words, every call to the code is authenticated against the token, rather than allowing a user free reign within the code call. While this is more work initially, it can bring a greater level of security, and it is far more resilient to disconnections. Resources: See the references of “Nondistributed Deployment” and “Distributed Deployment” at the top of this article for more information with graphics:  http://msdn.microsoft.com/en-us/library/ee658120.aspx  Stack Overflow has a good thread on functional programming: http://stackoverflow.com/questions/844536/advantages-of-stateless-programming  Another good discussion on Stack Overflow on server-side processing is here: http://stackoverflow.com/questions/3064018/client-side-or-server-side-processing Claims Based Authorization is described here: http://msdn.microsoft.com/en-us/magazine/ee335707.aspx

    Read the article

< Previous Page | 229 230 231 232 233 234 235 236 237 238 239 240  | Next Page >