size optimization - Page 111

Is there table with timing(cost) of C functions?

- by DSblizzard

Preferable for x86-32 gcc implementation

Can we do this query without subqueries? SELECT login, post_n, (SELECT SUM(vote) FROM votes WHERE votes.post_n=posts.post_n)AS votes, (SELECT COUNT(comments.post_n) FROM comments WHERE comments.post_n=posts.post_n)AS comments_count FROM users, posts WHERE posts.id=users.id AND (visibility=2 OR visibility=3) ORDER BY date DESC LIMIT 0, 15 tables: Users: id, login Posts: post_n, id, visibility Votes: post_n, vote id — it`s user id, Users the main table.

Read the article

some pointer to understanding GCC source code

- by user299570

hi, I'm student working on optimizing GCC for multi-core processor. I tried going through the source code, it is difficult to follow through it since I need to add some code to the back end. Can anyone suggest some good resource which explains the code flow through the different phases. Also suggest some development environment for debugging GCC mainly to step through the code. Is it possible on windows?

Read the article

Optimize a MySQL count each duplicate Query

- by Onema

I have the following query That gets the city name, city id, the region name, and a count of duplicate names for that record: SELECT Country_CA.City AS currentCity, Country_CA.CityID, globe_region.region_name, ( SELECT count(Country_CA.City) FROM Country_CA WHERE City LIKE currentCity ) as counter FROM Country_CA LEFT JOIN globe_region ON globe_region.region_id = Country_CA.RegionID AND globe_region.country_code = Country_CA.CountryCode ORDER BY City This example is for Canada, and the cities will be displayed on a dropdown list. There are a few towns in Canada, and in other countries, that have the same names. Therefore I want to know if there is more than one town with the same name region name will be appended to the town name. Region names are found in the globe_region table. Country_CA and globe_region look similar to this (I have changed a few things for visualization purposes) CREATE TABLE IF NOT EXISTS `Country_CA` ( `City` varchar(75) NOT NULL DEFAULT '', `RegionID` varchar(10) NOT NULL DEFAULT '', `CountryCode` varchar(10) NOT NULL DEFAULT '', `CityID` int(11) NOT NULL DEFAULT '0', PRIMARY KEY (`City`,`RegionID`), KEY `CityID` (`CityID`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8; AND CREATE TABLE IF NOT EXISTS `globe_region` ( `country_code` char(2) COLLATE utf8_unicode_ci NOT NULL, `region_code` char(2) COLLATE utf8_unicode_ci NOT NULL, `region_name` varchar(50) COLLATE utf8_unicode_ci NOT NULL, PRIMARY KEY (`country_code`,`region_code`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci; The query on the top does exactly what I want it to do, but It takes way too long to generate a list for 5000 records. I would like to know if there is a way to optimize the sub-query in order to obtain the same results faster. the results should look like this City CityID region_name counter sheraton 2349269 British Columbia 1 sherbrooke 2349270 Quebec 2 sherbrooke 2349271 Nova Scotia 2 shere 2349273 British Columbia 1 sherridon 2349274 Manitoba 1

Read the article

Is opening too many datacontexts bad?

- by ryudice

I've been checking my application with linq 2 sql profiler, and I noticed that it opens a lot of datacontexts, most of them are opened by the linq datasource I used, since my repositories use only the instance stored in Request.Items, is it bad to open too many datacontext? and how can I make my linqdatasource to use the datacontext that I store in Request.Items for the duration of the request? thanks for any help!

Read the article

fast way to check if an array of chars is zero

- by Claudiu

I have an array of bytes, in memory. What's the fastest way to see if all the bytes in the array are zero?

Read the article

Google Web Optimizer -- How long until winning combination?

- by Django Reinhardt

I've had an A/B Test running in Google Web Optimizer for six weeks now, and there's still no end in sight. Google is still saying: "We have not gathered enough data yet to show any significant results. When we collect more data we should be able to show you a winning combination." Is there any way of telling how close Google is to making up its mind? (Does anyone know what algorithm does it use to decide if there's been any "high confidence winners"?) According to the Google help documentation: Sometimes we simply need more data to be able to reach a level of high confidence. A tested combination typically needs around 200 conversions for us to judge its performance with certainty. But all of our conversions have over 200 conversations at the moment: 230 / 4061 (Original) 223 / 3937 (Variation 1) 205 / 3984 (Variation 2) 205 / 4007 (Variation 3) How much longer is it going to have to run?? Thanks for any help.

Read the article

Meassure website

- by s0mmer

Hi, I was wondering if it is possible to install or use any online service to measure your website's performance? I've seen many just checking the download speed of images, external files etc. But is it possible to meassure how long asp/php code takes to execute? I have a site running a bit slowly, and it would be very nice with some app/service guiding where to optimize.

Read the article

Does replacing statements by expressions using the C++ comma operator could allow more compiler opti

- by Gabriel Cuvillier

The C++ comma operator is used to chain individual expressions, yielding the value of the last executed expression as the result. For example the skeleton code (6 statements, 6 expressions): step1; step2; if (condition) step3; return step4; else return step5; May be rewritten to: (1 statement, 6 expressions) return step1, step2, condition? step3, step4 : step5; I noticed that it is not possible to perform step-by-step debugging of such code, as the expression chain seems to be executed as a whole. Does it means that the compiler is able to perform special optimizations which are not possible with the traditional statement approach (specially if the steps are const or inline)? Note: I'm not talking about the coding style merit of that way of expressing sequence of expressions! Just about the possible optimisations allowed by replacing statements by expressions.

Read the article

help me improve my sse yuv to rgb ssse3 code

- by David McPaul

Hello, I am looking to optimise some sse code I wrote for converting yuv to rgb (both planar and packed yuv functions). i am using SSSE3 at the moment but if there are useful functions from later sse versions thats ok. I am mainly interested in how I would work out processor stalls and the like. Anyone know of any tools that do static analysis of sse code? ; ; Copyright (C) 2009-2010 David McPaul ; ; All rights reserved. Distributed under the terms of the MIT License. ; ; A rather unoptimised set of ssse3 yuv to rgb converters ; does 8 pixels per loop ; inputer: ; reads 128 bits of yuv 8 bit data and puts ; the y values converted to 16 bit in xmm0 ; the u values converted to 16 bit and duplicated into xmm1 ; the v values converted to 16 bit and duplicated into xmm2 ; conversion: ; does the yuv to rgb conversion using 16 bit integer and the ; results are placed into the following registers as 8 bit clamped values ; r values in xmm3 ; g values in xmm4 ; b values in xmm5 ; outputer: ; writes out the rgba pixels as 8 bit values with 0 for alpha ; xmm6 used for scratch ; xmm7 used for scratch %macro cglobal 1 global _%1 %define %1 _%1 align 16 %1: %endmacro ; conversion code %macro yuv2rgbsse2 0 ; u = u - 128 ; v = v - 128 ; r = y + v + v >> 2 + v >> 3 + v >> 5 ; g = y - (u >> 2 + u >> 4 + u >> 5) - (v >> 1 + v >> 3 + v >> 4 + v >> 5) ; b = y + u + u >> 1 + u >> 2 + u >> 6 ; subtract 16 from y movdqa xmm7, [Const16] ; loads a constant using data cache (slower on first fetch but then cached) psubsw xmm0,xmm7 ; y = y - 16 ; subtract 128 from u and v movdqa xmm7, [Const128] ; loads a constant using data cache (slower on first fetch but then cached) psubsw xmm1,xmm7 ; u = u - 128 psubsw xmm2,xmm7 ; v = v - 128 ; load r,b with y movdqa xmm3,xmm0 ; r = y pshufd xmm5,xmm0, 0xE4 ; b = y ; r = y + v + v >> 2 + v >> 3 + v >> 5 paddsw xmm3, xmm2 ; add v to r movdqa xmm7, xmm1 ; move u to scratch pshufd xmm6, xmm2, 0xE4 ; move v to scratch psraw xmm6,2 ; divide v by 4 paddsw xmm3, xmm6 ; and add to r psraw xmm6,1 ; divide v by 2 paddsw xmm3, xmm6 ; and add to r psraw xmm6,2 ; divide v by 4 paddsw xmm3, xmm6 ; and add to r ; b = y + u + u >> 1 + u >> 2 + u >> 6 paddsw xmm5, xmm1 ; add u to b psraw xmm7,1 ; divide u by 2 paddsw xmm5, xmm7 ; and add to b psraw xmm7,1 ; divide u by 2 paddsw xmm5, xmm7 ; and add to b psraw xmm7,4 ; divide u by 32 paddsw xmm5, xmm7 ; and add to b ; g = y - u >> 2 - u >> 4 - u >> 5 - v >> 1 - v >> 3 - v >> 4 - v >> 5 movdqa xmm7,xmm2 ; move v to scratch pshufd xmm6,xmm1, 0xE4 ; move u to scratch movdqa xmm4,xmm0 ; g = y psraw xmm6,2 ; divide u by 4 psubsw xmm4,xmm6 ; subtract from g psraw xmm6,2 ; divide u by 4 psubsw xmm4,xmm6 ; subtract from g psraw xmm6,1 ; divide u by 2 psubsw xmm4,xmm6 ; subtract from g psraw xmm7,1 ; divide v by 2 psubsw xmm4,xmm7 ; subtract from g psraw xmm7,2 ; divide v by 4 psubsw xmm4,xmm7 ; subtract from g psraw xmm7,1 ; divide v by 2 psubsw xmm4,xmm7 ; subtract from g psraw xmm7,1 ; divide v by 2 psubsw xmm4,xmm7 ; subtract from g %endmacro ; outputer %macro rgba32sse2output 0 ; clamp values pxor xmm7,xmm7 packuswb xmm3,xmm7 ; clamp to 0,255 and pack R to 8 bit per pixel packuswb xmm4,xmm7 ; clamp to 0,255 and pack G to 8 bit per pixel packuswb xmm5,xmm7 ; clamp to 0,255 and pack B to 8 bit per pixel ; convert to bgra32 packed punpcklbw xmm5,xmm4 ; bgbgbgbgbgbgbgbg movdqa xmm0, xmm5 ; save bg values punpcklbw xmm3,xmm7 ; r0r0r0r0r0r0r0r0 punpcklwd xmm5,xmm3 ; lower half bgr0bgr0bgr0bgr0 punpckhwd xmm0,xmm3 ; upper half bgr0bgr0bgr0bgr0 ; write to output ptr movntdq [edi], xmm5 ; output first 4 pixels bypassing cache movntdq [edi+16], xmm0 ; output second 4 pixels bypassing cache %endmacro SECTION .data align=16 Const16 dw 16 dw 16 dw 16 dw 16 dw 16 dw 16 dw 16 dw 16 Const128 dw 128 dw 128 dw 128 dw 128 dw 128 dw 128 dw 128 dw 128 UMask db 0x01 db 0x80 db 0x01 db 0x80 db 0x05 db 0x80 db 0x05 db 0x80 db 0x09 db 0x80 db 0x09 db 0x80 db 0x0d db 0x80 db 0x0d db 0x80 VMask db 0x03 db 0x80 db 0x03 db 0x80 db 0x07 db 0x80 db 0x07 db 0x80 db 0x0b db 0x80 db 0x0b db 0x80 db 0x0f db 0x80 db 0x0f db 0x80 YMask db 0x00 db 0x80 db 0x02 db 0x80 db 0x04 db 0x80 db 0x06 db 0x80 db 0x08 db 0x80 db 0x0a db 0x80 db 0x0c db 0x80 db 0x0e db 0x80 ; void Convert_YUV422_RGBA32_SSSE3(void *fromPtr, void *toPtr, int width) width equ ebp+16 toPtr equ ebp+12 fromPtr equ ebp+8 ; void Convert_YUV420P_RGBA32_SSSE3(void *fromYPtr, void *fromUPtr, void *fromVPtr, void *toPtr, int width) width1 equ ebp+24 toPtr1 equ ebp+20 fromVPtr equ ebp+16 fromUPtr equ ebp+12 fromYPtr equ ebp+8 SECTION .text align=16 cglobal Convert_YUV422_RGBA32_SSSE3 ; reserve variables push ebp mov ebp, esp push edi push esi push ecx mov esi, [fromPtr] mov edi, [toPtr] mov ecx, [width] ; loop width / 8 times shr ecx,3 test ecx,ecx jng ENDLOOP REPEATLOOP: ; loop over width / 8 ; YUV422 packed inputer movdqa xmm0, [esi] ; should have yuyv yuyv yuyv yuyv pshufd xmm1, xmm0, 0xE4 ; copy to xmm1 movdqa xmm2, xmm0 ; copy to xmm2 ; extract both y giving y0y0 pshufb xmm0, [YMask] ; extract u and duplicate so each u in yuyv becomes u0u0 pshufb xmm1, [UMask] ; extract v and duplicate so each v in yuyv becomes v0v0 pshufb xmm2, [VMask] yuv2rgbsse2 rgba32sse2output ; endloop add edi,32 add esi,16 sub ecx, 1 ; apparently sub is better than dec jnz REPEATLOOP ENDLOOP: ; Cleanup pop ecx pop esi pop edi mov esp, ebp pop ebp ret cglobal Convert_YUV420P_RGBA32_SSSE3 ; reserve variables push ebp mov ebp, esp push edi push esi push ecx push eax push ebx mov esi, [fromYPtr] mov eax, [fromUPtr] mov ebx, [fromVPtr] mov edi, [toPtr1] mov ecx, [width1] ; loop width / 8 times shr ecx,3 test ecx,ecx jng ENDLOOP1 REPEATLOOP1: ; loop over width / 8 ; YUV420 Planar inputer movq xmm0, [esi] ; fetch 8 y values (8 bit) yyyyyyyy00000000 movd xmm1, [eax] ; fetch 4 u values (8 bit) uuuu000000000000 movd xmm2, [ebx] ; fetch 4 v values (8 bit) vvvv000000000000 ; extract y pxor xmm7,xmm7 ; 00000000000000000000000000000000 punpcklbw xmm0,xmm7 ; interleave xmm7 into xmm0 y0y0y0y0y0y0y0y0 ; extract u and duplicate so each becomes 0u0u punpcklbw xmm1,xmm7 ; interleave xmm7 into xmm1 u0u0u0u000000000 punpcklwd xmm1,xmm7 ; interleave again u000u000u000u000 pshuflw xmm1,xmm1, 0xA0 ; copy u values pshufhw xmm1,xmm1, 0xA0 ; to get u0u0 ; extract v punpcklbw xmm2,xmm7 ; interleave xmm7 into xmm1 v0v0v0v000000000 punpcklwd xmm2,xmm7 ; interleave again v000v000v000v000 pshuflw xmm2,xmm2, 0xA0 ; copy v values pshufhw xmm2,xmm2, 0xA0 ; to get v0v0 yuv2rgbsse2 rgba32sse2output ; endloop add edi,32 add esi,8 add eax,4 add ebx,4 sub ecx, 1 ; apparently sub is better than dec jnz REPEATLOOP1 ENDLOOP1: ; Cleanup pop ebx pop eax pop ecx pop esi pop edi mov esp, ebp pop ebp ret SECTION .note.GNU-stack noalloc noexec nowrite progbits

Read the article

Rewriting a for loop in pure NumPy to decrease execution time

- by Statto

I recently asked about trying to optimise a Python loop for a scientific application, and received an excellent, smart way of recoding it within NumPy which reduced execution time by a factor of around 100 for me! However, calculation of the B value is actually nested within a few other loops, because it is evaluated at a regular grid of positions. Is there a similarly smart NumPy rewrite to shave time off this procedure? I suspect the performance gain for this part would be less marked, and the disadvantages would presumably be that it would not be possible to report back to the user on the progress of the calculation, that the results could not be written to the output file until the end of the calculation, and possibly that doing this in one enormous step would have memory implications? Is it possible to circumvent any of these? import numpy as np import time def reshape_vector(v): b = np.empty((3,1)) for i in range(3): b[i][0] = v[i] return b def unit_vectors(r): return r / np.sqrt((r*r).sum(0)) def calculate_dipole(mu, r_i, mom_i): relative = mu - r_i r_unit = unit_vectors(relative) A = 1e-7 num = A*(3*np.sum(mom_i*r_unit, 0)*r_unit - mom_i) den = np.sqrt(np.sum(relative*relative, 0))**3 B = np.sum(num/den, 1) return B N = 20000 # number of dipoles r_i = np.random.random((3,N)) # positions of dipoles mom_i = np.random.random((3,N)) # moments of dipoles a = np.random.random((3,3)) # three basis vectors for this crystal n = [10,10,10] # points at which to evaluate sum gamma_mu = 135.5 # a constant t_start = time.clock() for i in range(n[0]): r_frac_x = np.float(i)/np.float(n[0]) r_test_x = r_frac_x * a[0] for j in range(n[1]): r_frac_y = np.float(j)/np.float(n[1]) r_test_y = r_frac_y * a[1] for k in range(n[2]): r_frac_z = np.float(k)/np.float(n[2]) r_test = r_test_x +r_test_y + r_frac_z * a[2] r_test_fast = reshape_vector(r_test) B = calculate_dipole(r_test_fast, r_i, mom_i) omega = gamma_mu*np.sqrt(np.dot(B,B)) # write r_test, B and omega to a file frac_done = np.float(i+1)/(n[0]+1) t_elapsed = (time.clock()-t_start) t_remain = (1-frac_done)*t_elapsed/frac_done print frac_done*100,'% done in',t_elapsed/60.,'minutes...approximately',t_remain/60.,'minutes remaining'

Read the article

Find point which sum of distances to set of other points is minimal

- by Pawel Markowski

I have one set (X) of points (not very big let's say 1-20 points) and the second (Y), much larger set of points. I need to choose some point from Y which sum of distances to all points from X is minimal. I came up with an idea that I would treat X as a vertices of a polygon and find centroid of this polygon, and then I will choose a point from Y nearest to the centroid. But I'm not sure whether centroid minimizes sum of its distances to the vertices of polygon, so I'm not sure whether this is a good way? Is there any algorithm for solving this problem? Points are defined by geographical coordinates.

Read the article

Difference between Logarithmic and Uniform cost criteria

- by Marthin

I'v got some problem to understand the difference between Logarithmic(Lcc) and Uniform(Ucc) cost criteria and also how to use it in calculations. Could someone please explain the difference between the two and perhaps show how to calculate the complexity for a problem like A+B*C (Yes this is part of an assignment =) ) Thx for any help! /Marthin

Read the article

Offering text in different sizes

- by Simon R

This is a general question of sorts, but do you think that it's important to offer text resizing tools on a website, which in essense only effect text as it seems that most browsers offer text resising or more commonly zooming?

Read the article

Trying to reduce the speed overhead of an almost-but-not-quite-int number class

- by Fumiyo Eda

I have implemented a C++ class which behaves very similarly to the standard int type. The difference is that it has an additional concept of "epsilon" which represents some tiny value that is much less than 1, but greater than 0. One way to think of it is as a very wide fixed point number with 32 MSBs (the integer parts), 32 LSBs (the epsilon parts) and a huge sea of zeros in between. The following class works, but introduces a ~2x speed penalty in the overall program. (The program includes code that has nothing to do with this class, so the actual speed penalty of this class is probably much greater than 2x.) I can't paste the code that is using this class, but I can say the following: +, -, +=, <, > and >= are the only heavily used operators. Use of setEpsilon() and getInt() is extremely rare. * is also rare, and does not even need to consider the epsilon values at all. Here is the class: #include <limits> struct int32Uepsilon { typedef int32Uepsilon Self; int32Uepsilon () { _value = 0; _eps = 0; } int32Uepsilon (const int &i) { _value = i; _eps = 0; } void setEpsilon() { _eps = 1; } Self operator+(const Self &rhs) const { Self result = *this; result._value += rhs._value; result._eps += rhs._eps; return result; } Self operator-(const Self &rhs) const { Self result = *this; result._value -= rhs._value; result._eps -= rhs._eps; return result; } Self operator-( ) const { Self result = *this; result._value = -result._value; result._eps = -result._eps; return result; } Self operator*(const Self &rhs) const { return this->getInt() * rhs.getInt(); } // XXX: discards epsilon bool operator<(const Self &rhs) const { return (_value < rhs._value) || (_value == rhs._value && _eps < rhs._eps); } bool operator>(const Self &rhs) const { return (_value > rhs._value) || (_value == rhs._value && _eps > rhs._eps); } bool operator>=(const Self &rhs) const { return (_value >= rhs._value) || (_value == rhs._value && _eps >= rhs._eps); } Self &operator+=(const Self &rhs) { this->_value += rhs._value; this->_eps += rhs._eps; return *this; } Self &operator-=(const Self &rhs) { this->_value -= rhs._value; this->_eps -= rhs._eps; return *this; } int getInt() const { return(_value); } private: int _value; int _eps; }; namespace std { template<> struct numeric_limits<int32Uepsilon> { static const bool is_signed = true; static int max() { return 2147483647; } } }; The code above works, but it is quite slow. Does anyone have any ideas on how to improve performance? There are a few hints/details I can give that might be helpful: 32 bits are definitely insufficient to hold both _value and _eps. In practice, up to 24 ~ 28 bits of _value are used and up to 20 bits of _eps are used. I could not measure a significant performance difference between using int32_t and int64_t, so memory overhead itself is probably not the problem here. Saturating addition/subtraction on _eps would be cool, but isn't really necessary. Note that the signs of _value and _eps are not necessarily the same! This broke my first attempt at speeding this class up. Inline assembly is no problem, so long as it works with GCC on a Core i7 system running Linux!

Read the article

How to index a date column with null values?

- by Heinz Z.

How should I index a date column when some rows has null values? We have to select rows between a date range and rows with null dates. We use Oracle 9.2 and higher. Options I found Using a bitmap index on the date column Using an index on date column and an index on a state field which value is 1 when the date is null Using an index on date column and an other granted not null column My thoughts to the options are: to 1: the table have to many different values to use an bitmap index to 2: I have to add an field only for this purpose and to change the query when I want to retrieve the null date rows to 3: locks tricky to add an field to an index which is not really needed What is the best practice for this case? Thanks in advance Some infos I have read: Oracle Date Index When does Oracle index null column values?

Read the article

How to optimize an database suggestion engine

- by Dimitar Vouldjeff

Hi, I`m making an online engine for item-to-item recommending movies. I have made some researches and I think that the best way to implement that is using pearson correlation and make a table with item1, item2 and correlation fields, but the problem is that after each rate of item I have to regenerate the correlation for in the worst case N records (where N is the number of items). Another think that I read is the following article, but I haven`t thought a way to implement it. So what is your suggestion to optimize this process? Or any other suggestions? Thanks.

Read the article

How can I optimize this subqueried and Joined MySQL Query?

- by kevzettler

Read the article

Where is the bottleneck in this code?

- by Mikhail

I have the following tight loop that makes up the serial bottle neck of my code. Ideally I would parallelize the function that calls this but that is not possible. //n is about 60 for (int k = 0;k < n;k++) { double fone = z[k*n+i+1]; double fzer = z[k*n+i]; z[k*n+i+1]= s*fzer+c*fone; z[k*n+i] = c*fzer-s*fone; } Are there any optimizations that can be made such as vectorization or some evil inline that can help this code? I am looking into finding eigen solutions of tridiagonal matrices. http://www.cimat.mx/~posada/OptDoglegGraph/DocLogisticDogleg/projects/adjustedrecipes/tqli.cpp.html

Read the article

How many developers are there in the world?

- by Nick Hodges

What is the total number of software developers in the world? And to respond to the inevitable "How do you define a software developer?" -- I'll answer two ways: Define it as "Anyone who writes code to make a computer do something he wants done". Define it however you like and then answer the question References to studies or more authoritative sources of information would be greatly appreciated.

Read the article

Javascriptlibrary more efficient than Rickshaw for realtime visualizations

- by dan kutz

I want to visualize data as time-series graphs on mobile devices(tablets) and therefore stumbled upon rickshaw, which is based on D3. First I must say I was a little bit confused when I realized that realtime in web design is defined totally different to realtime in engineering which has fixed(and often very short) timeframes. Anyway my aim is to visualize the data as fast as possible, and on older tablets visualization with rickshaw is quite slow. Can anybody recommend another library, which may be more efficient in rendering? Or is there no way out and I have to go native? regards Dan.

Read the article

Database structure, optimizing table with lots of rows

- by aepheus

I have a table with a lot of rows. The structure is something like this: UserID | ItemID | Item Data... Would I see any gains in query time if I separated this into a table per user, or per smaller group of users? Queries are always single user getting/modifying a selection of items.

Read the article

How to insert zeros between bits in a bitmap?

- by anatolyg

I have some performance-heavy code that performs bit manipulations. It can be reduced to the following well-defined problem: Given a 13-bit bitmap, construct a 26-bit bitmap that contains the original bits spaced at even positions. To illustrate: 0000000000000000000abcdefghijklm (input, 32 bits) 0000000a0b0c0d0e0f0g0h0i0j0k0l0m (output, 32 bits) I currently have it implemented in the following way in C: if (input & (1 << 12)) output |= 1 << 24; if (input & (1 << 11)) output |= 1 << 22; if (input & (1 << 10)) output |= 1 << 20; ... My compiler (MS Visual Studio) turned this into the following: test eax,1000h jne 0064F5EC or edx,1000000h ... (repeated 13 times with minor differences in constants) I wonder whether i can make it any faster. I would like to have my code written in C, but switching to assembly language is possible. Can i use some MMX/SSE instructions to process all bits at once? Maybe i can use multiplication? (multiply by 0x11111111 or some other magical constant) Would it be better to use condition-set instruction (SETcc) instead of conditional-jump instruction? If yes, how can i make the compiler produce such code for me? Any other idea how to make it faster? Any idea how to do the inverse bitmap transformation (i have to implement it too, bit it's less critical)?

Read the article

Script Speed vs Memory Usage

- by Doug Neiner

I am working on an image generation script in PHP and have gotten it working two ways. One way is slow but uses a limited amount of memory, the second is much faster, but uses 6x the memory . There is no leakage in either script (as far as I can tell). In a limited benchmark, here is how they performed: -------------------------------------------- METHOD | TOTAL TIME | PEAK MEMORY | IMAGES -------------------------------------------- One | 65.626 | 540,036 | 200 Two | 20.207 | 3,269,600 | 200 -------------------------------------------- And here is the average of the previous numbers (if you don't want to do your own math): -------------------------------------------- METHOD | TOTAL TIME | PEAK MEMORY | IMAGES -------------------------------------------- One | 0.328 | 540,036 | 1 Two | 0.101 | 3,269,600 | 1 -------------------------------------------- Which method should I use and why? I anticipate this being used by a high volume of users, with each user making 10-20 requests to this script during a normal visit. I am leaning toward the faster method because though it uses more memory, it is for a 1/3 of the time and would reduce the number of concurrent requests.

Read the article

Optimizing Vector elements swaps using CUDA

- by Orion Nebula

Hi all, Since I am new to cuda .. I need your kind help I have this long vector, for each group of 24 elements, I need to do the following: for the first 12 elements, the even numbered elements are multiplied by -1, for the second 12 elements, the odd numbered elements are multiplied by -1 then the following swap takes place: Graph: because I don't yet have enough points, I couldn't post the image so here it is: http://www.freeimagehosting.net/image.php?e4b88fb666.png I have written this piece of code, and wonder if you could help me further optimize it to solve for divergence or bank conflicts .. //subvector is a multiple of 24, Mds and Nds are shared memory _shared_ double Mds[subVector]; _shared_ double Nds[subVector]; int tx = threadIdx.x; int tx_mod = tx ^ 0x0001; int basex = __umul24(blockDim.x, blockIdx.x); Mds[tx] = M.elements[basex + tx]; __syncthreads(); // flip the signs if (tx < (tx/24)*24 + 12) { //if < 12 and even if ((tx & 0x0001)==0) Mds[tx] = -Mds[tx]; } else if (tx < (tx/24)*24 + 24) { //if >12 and < 24 and odd if ((tx & 0x0001)==1) Mds[tx] = -Mds[tx]; } __syncthreads(); if (tx < (tx/24)*24 + 6) { //for the first 6 elements .. swap with last six in the 24elements group (see graph) Nds[tx] = Mds[tx_mod + 18]; Mds [tx_mod + 18] = Mds [tx]; Mds[tx] = Nds[tx]; } else if (tx < (tx/24)*24 + 12) { // for the second 6 elements .. swp with next adjacent group (see graph) Nds[tx] = Mds[tx_mod + 6]; Mds [tx_mod + 6] = Mds [tx]; Mds[tx] = Nds[tx]; } __syncthreads(); Thanks in advance ..

Search Results

Search found 21107 results on 845 pages for 'size optimization'.

Page 111/845 | < Previous Page | 107 108 109 110 111 112 113 114 115 116 117 118 | Next Page >

- by DSblizzard

- by swamprunner7

- by user299570

- by Onema

- by ryudice

- by Claudiu

- by Django Reinhardt

- by s0mmer

- by Gabriel Cuvillier

- by David McPaul

- by Statto

- by Pawel Markowski

- by Marthin

- by Simon R

- by Fumiyo Eda

- by Heinz Z.

- by Dimitar Vouldjeff

- by kevzettler

- by Mikhail

- by Nick Hodges

- by dan kutz

- by aepheus

- by anatolyg

- by Doug Neiner

- by Orion Nebula

< Previous Page | 107 108 109 110 111 112 113 114 115 116 117 118 | Next Page >