Search Results

Search found 26 results on 2 pages for 'intrinsics'.

Page 1/2 | 1 2  | Next Page >

  • Intel AVX intrinsics: any compatibility library out?

    - by ~buratinas
    Are there any Intel AVX intrinsics library out? I'm looking for something similar as 'sse2mmx.h' header which fall-backs to MMX intrinsics if SSE2 integer intrinsics are not available on compile time. Thus if I had similar library for AVX I could write optimized code for new hardware which would have almost optimal speed in case AVX extension isn't available. Googling didn't help much so far :(

    Read the article

  • g++ SSE intrinsics dilemma - value from intrinsic "saturates"

    - by Sriram
    Hi, I wrote a simple program to implement SSE intrinsics for computing the inner product of two large (100000 or more elements) vectors. The program compares the execution time for both, inner product computed the conventional way and using intrinsics. Everything works out fine, until I insert (just for the fun of it) an inner loop before the statement that computes the inner product. Before I go further, here is the code: //this is a sample Intrinsics program to compute inner product of two vectors and compare Intrinsics with traditional method of doing things. #include <iostream> #include <iomanip> #include <xmmintrin.h> #include <stdio.h> #include <time.h> #include <stdlib.h> using namespace std; typedef float v4sf __attribute__ ((vector_size(16))); double innerProduct(float* arr1, int len1, float* arr2, int len2) { //assume len1 = len2. float result = 0.0; for(int i = 0; i < len1; i++) { for(int j = 0; j < len1; j++) { result += (arr1[i] * arr2[i]); } } //float y = 1.23e+09; //cout << "y = " << y << endl; return result; } double sse_v4sf_innerProduct(float* arr1, int len1, float* arr2, int len2) { //assume that len1 = len2. if(len1 != len2) { cout << "Lengths not equal." << endl; exit(1); } /*steps: * 1. load a long-type (4 float) into a v4sf type data from both arrays. * 2. multiply the two. * 3. multiply the same and store result. * 4. add this to previous results. */ v4sf arr1Data, arr2Data, prevSums, multVal, xyz; //__builtin_ia32_xorps(prevSums, prevSums); //making it equal zero. //can explicitly load 0 into prevSums using loadps or storeps (Check). float temp[4] = {0.0, 0.0, 0.0, 0.0}; prevSums = __builtin_ia32_loadups(temp); float result = 0.0; for(int i = 0; i < (len1 - 3); i += 4) { for(int j = 0; j < len1; j++) { arr1Data = __builtin_ia32_loadups(&arr1[i]); arr2Data = __builtin_ia32_loadups(&arr2[i]); //store the contents of two arrays. multVal = __builtin_ia32_mulps(arr1Data, arr2Data); //multiply. xyz = __builtin_ia32_addps(multVal, prevSums); prevSums = xyz; } } //prevSums will hold the sums of 4 32-bit floating point values taken at a time. Individual entries in prevSums also need to be added. __builtin_ia32_storeups(temp, prevSums); //store prevSums into temp. cout << "Values of temp:" << endl; for(int i = 0; i < 4; i++) cout << temp[i] << endl; result += temp[0] + temp[1] + temp[2] + temp[3]; return result; } int main() { clock_t begin, end; int length = 100000; float *arr1, *arr2; double result_Conventional, result_Intrinsic; // printStats("Allocating memory."); arr1 = new float[length]; arr2 = new float[length]; // printStats("End allocation."); srand(time(NULL)); //init random seed. // printStats("Initializing array1 and array2"); begin = clock(); for(int i = 0; i < length; i++) { // for(int j = 0; j < length; j++) { // arr1[i] = rand() % 10 + 1; arr1[i] = 2.5; // arr2[i] = rand() % 10 - 1; arr2[i] = 2.5; // } } end = clock(); cout << "Time to initialize array1 and array2 = " << ((double) (end - begin)) / CLOCKS_PER_SEC << endl; // printStats("Finished initialization."); // printStats("Begin inner product conventionally."); begin = clock(); result_Conventional = innerProduct(arr1, length, arr2, length); end = clock(); cout << "Time to compute inner product conventionally = " << ((double) (end - begin)) / CLOCKS_PER_SEC << endl; // printStats("End inner product conventionally."); // printStats("Begin inner product using Intrinsics."); begin = clock(); result_Intrinsic = sse_v4sf_innerProduct(arr1, length, arr2, length); end = clock(); cout << "Time to compute inner product with intrinsics = " << ((double) (end - begin)) / CLOCKS_PER_SEC << endl; //printStats("End inner product using Intrinsics."); cout << "Results: " << endl; cout << " result_Conventional = " << result_Conventional << endl; cout << " result_Intrinsics = " << result_Intrinsic << endl; return 0; } I use the following g++ invocation to build this: g++ -W -Wall -O2 -pedantic -march=i386 -msse intrinsics_SSE_innerProduct.C -o innerProduct Each of the loops above, in both the functions, runs a total of N^2 times. However, given that arr1 and arr2 (the two floating point vectors) are loaded with a value 2.5, the length of the array is 100,000, the result in both cases should be 6.25e+10. The results I get are: Results: result_Conventional = 6.25e+10 result_Intrinsics = 5.36871e+08 This is not all. It seems that the value returned from the function that uses intrinsics "saturates" at the value above. I tried putting other values for the elements of the array and different sizes too. But it seems that any value above 1.0 for the array contents and any size above 1000 meets with the same value we see above. Initially, I thought it might be because all operations within SSE are in floating point, but floating point should be able to store a number that is of the order of e+08. I am trying to see where I could be going wrong but cannot seem to figure it out. I am using g++ version: g++ (GCC) 4.4.1 20090725 (Red Hat 4.4.1-2). Any help on this is most welcome. Thanks, Sriram.

    Read the article

  • What's the difference between logical SSE intrinsics?

    - by ~buratinas
    Hello, Is there any difference between logical SSE intrinsics for different types? For example if we take OR operation, there are three intrinsics: _mm_or_ps, _mm_or_pd and _mm_or_si128. My questions: Is there any difference between using one or another intrinsic (with appropriate type casting). Won't there be any hidden costs like longer execution in some specific situation? These intrinsics maps to three different x86 instructions (por, orps, orpd). Does anyone have any ideas why Intel is wasting precious opcode space for several instructions which do the same thing?

    Read the article

  • Intrinsics program (SSE) - g++ - help needed

    - by Sriram
    Hi all, This is the first time I am posting a question on stackoverflow, so please try and overlook any errors I may have made in formatting my question/code. But please do point the same out to me so I may be more careful. I was trying to write some simple intrinsics routines for the addition of two 128-bit (containing 4 float variables) numbers. I found some code on the net and was trying to get it to run on my system. The code is as follows: //this is a sample Intrinsics program to add two vectors. #include <iostream> #include <iomanip> #include <xmmintrin.h> #include <stdio.h> using namespace std; struct vector4 { float x, y, z, w; }; //functions to operate on them. vector4 set_vector(float x, float y, float z, float w = 0) { vector4 temp; temp.x = x; temp.y = y; temp.z = z; temp.w = w; return temp; } void print_vector(const vector4& v) { cout << " This is the contents of vector: " << endl; cout << " > vector.x = " << v.x << endl; cout << " vector.y = " << v.y << endl; cout << " vector.z = " << v.z << endl; cout << " vector.w = " << v.w << endl; } vector4 sse_vector4_add(const vector4&a, const vector4& b) { vector4 result; asm volatile ( "movl $a, %eax" //move operands into registers. "\n\tmovl $b, %ebx" "\n\tmovups (%eax), xmm0" //move register contents into SSE registers. "\n\tmovups (%ebx), xmm1" "\n\taddps xmm0, xmm1" //add the elements. addps operates on single-precision vectors. "\n\t movups xmm0, result" //move result into vector4 type data. ); return result; } int main() { vector4 a, b, result; a = set_vector(1.1, 2.1, 3.2, 4.5); b = set_vector(2.2, 4.2, 5.6); result = sse_vector4_add(a, b); print_vector(a); print_vector(b); print_vector(result); return 0; } The g++ parameters I use are: g++ -Wall -pedantic -g -march=i386 -msse intrinsics_SSE_example.C -o h The errors I get are as follows: intrinsics_SSE_example.C: Assembler messages: intrinsics_SSE_example.C:45: Error: too many memory references for movups intrinsics_SSE_example.C:46: Error: too many memory references for movups intrinsics_SSE_example.C:47: Error: too many memory references for addps intrinsics_SSE_example.C:48: Error: too many memory references for movups I have spent a lot of time on trying to debug these errors, googled them and so on. I am a complete noob to Intrinsics and so may have overlooked some important things. Any help is appreciated, Thanks, Sriram.

    Read the article

  • Stack usage with MMX intrinsics and Microsoft C++

    - by arik-funke
    I have an inline assembler loop that cumulatively adds elements from an int32 data array with MMX instructions. In particular, it uses the fact that the MMX registers can accommodate 16 int32s to calculate 16 different cumulative sums in parallel. I would now like to convert this piece of code to MMX intrinsics but I am afraid that I will suffer a performance penalty because one cannot explicitly intruct the compiler to use the 8 MMX registers to accomulate 16 independent sums. Can anybody comment on this and maybe propose a solution on how to convert the piece of code below to use intrinsics? == inline assembler (only part within the loop) == paddd mm0, [esi+edx+8*0] ; add first & second pair of int32 elements paddd mm1, [esi+edx+8*1] ; add third & fourth pair of int32 elements ... paddd mm2, [esi+edx+8*2] paddd mm3, [esi+edx+8*3] paddd mm4, [esi+edx+8*4] paddd mm5, [esi+edx+8*5] paddd mm6, [esi+edx+8*6] paddd mm7, [esi+edx+8*7] ; add 15th & 16th pair of int32 elements esi points to the beginning of the data array edx provides the offset in the data array for the current loop iteration the data array is arranged such that the elements for the 16 independent sums are interleaved.

    Read the article

  • How do i a vector data using ARM Neon intrinsics?

    - by goldenmean
    Hello, This is specifically related to ARM Neon SIMD coding.I am using ARM Neon instrinsics for certain module in a Video Decoder. I have a vectorized data as follows:- There are four, 32 bit elements in a Neon register say Q0 which is of size 128 bit. 3B 3A 1B 1A There are another four, 32 bit elements in other Neon register say Q1 which is of size 128 bit. 3D 3C 1D 1C I want the final data to be in order as shown below:- 1D 1C 1B 1A 3D 3C 3D 3A Using what Neon instrinscis can achive the desired data order? thanks, -AD

    Read the article

  • How do I reorder vector data using ARM Neon intrinsics?

    - by goldenmean
    This is specifically related to ARM Neon SIMD coding. I am using ARM Neon instrinsics for certain module in a video decoder. I have a vectorized data as follows: There are four 32 bit elements in a Neon register - say, Q0 - which is of size 128 bit. 3B 3A 1B 1A There are another four, 32 bit elements in other Neon register say Q1 which is of size 128 bit. 3D 3C 1D 1C I want the final data to be in order as shown below: 1D 1C 1B 1A 3D 3C 3D 3A What Neon instrinsics can achieve the desired data order?

    Read the article

  • How to rotate an SSE/AVX vector

    - by user1584773
    I need to perform a rotate operation with as little clock cycle as possible. In the first case let's assume __m128i as source and dest type source: || A0 || A1 || A2 || A3 || dest : || A1 || A2 || A3 || A0 || dest = (__m128i)_mm_shuffle_epi32((__m128i)source, _MM_SHUFFLE(0,3,2,1)); Now I want to do the same whit AVX intrinsics So let's assume this time __m256i as source and dest type source: || A0 || A1 || A2 || A3 || A4 || A5 || A6 || A7 || dest : || A1 || A2 || A3 || A4 || A5 || A6 || A7 || A0 || The Avx intrinsics is missing most of the corresponding SSE integer operations. Maybe there is some way go get the desider output working with the floating point version. I've tryed with: dest = (__m256i)_mm256_shuffle_ps((__m256)source, (__m256)source, _MM_SHUFFLE(0,3,2,1)); but what I get is: || A0 || A2 || A3 || A4 || A5 || A6 || A7 || A1 || Any Idea on how to solve this in an efficient way? (without mixing SSE and AVX operation and without "manually" inverting A0 and A1 Thanks in advance!

    Read the article

  • does passing __m128i objects by reference to inline function cause these objects to be moved to stac

    - by ~buratinas
    Hello, I'm writing transpose function for 8x16bit vectors with SSE2 intrinsics. Since there are 8 arguments for that function (a matrix of 8x8x16bit size), I can't do anything but pass them by reference. Will that be optimized by the compiler (I mean, will these __m128i objects be passed in registers instead of stack)? Code snippet: inline void transpose (__m128i &a0, __m128i &a1, __m128i &a2, __m128i &a3, __m128i &a4, __m128i &a5, __m128i &a6, __m128i &a7) { .... }

    Read the article

  • How to use VC++ intrinsic functions w/o run-time library

    - by Adrian McCarthy
    I'm involved in one of those challenges where you try to produce the smallest possible binary, so I'm building my program without the C or C++ run-time libraries (RTL). I don't link to the DLL version or the static version. I don't even #include the header files. I have this working fine. For some code constructs, the compiler generates calls to memset(). For example: struct MyStruct { int foo; int bar; }; MyStruct blah = {}; // calls memset() Since I don't include the RTL, this results in a missing symbol at link time. I've been getting around this by avoiding those constructs. For the given example, I'll explicitly initialize the struct. MyStruct blah; blah.foo = 0; blah.bar = 0; But memset() can be useful, so I tried adding my own implementation. It works fine in Debug builds, even for those places where the compiler generates an implicit call to memset(). But in Release builds, I get an error saying that I cannot define an intrinsic function. You see, in Release builds, intrinsic functions are enabled, and memset() is an intrinsic. I would love to use the intrinsic for memset() in my release builds, since it's probably inlined and smaller and faster than my implementation. But I seem to be a in catch-22. If I don't define memset(), the linker complains that it's undefined. If I do define it, the compiler complains that I cannot define an intrinsic function. I've tried adding #pragma intrinsic(memset) with and without declarations of memset, but no luck. Does anyone know the right combination of definition, declaration, #pragma, and compiler and linker flags to get an intrinsic function without pulling in RTL overhead? Visual Studio 2008, x86, Windows XP+.

    Read the article

  • Help with Assembly/SSE Multiplication

    - by Brett
    I've been trying to figure out how to gain some improvement in my code at a very crucial couple lines: float x = a*b; float y = c*d; float z = e*f; float w = g*h; all a, b, c... are floats. I decided to look into using SSE, but can't seem to find any improvement, in fact it turns out to be twice as slow. My SSE code is: Vector4 abcd, efgh, result; abcd = [float a, float b, float c, float d]; efgh = [float e, float f, float g, float h]; _asm { movups xmm1, abcd movups xmm2, efgh mulps xmm1, xmm2 movups result, xmm1 } I also attempted using standard inline assembly, but it doesn't appear that I can pack the register with the four floating points like I can with SSE. Any comments, or help would be greatly appreciated, I mainly need to understand why my calculations using SSE are slower than the serial C++ code? I'm compiling in Visual Studio 2005, on a Windows XP, using a Pentium 4 with HT if that provides any additional information to assit. Thanks in advance!

    Read the article

  • Clarification of atomic memory access for different OSs

    - by murrekatt
    I'm currently porting a Windows C++ library to MacOS as a hobby project as a learning experience. I stumbled across some code using the Win Interlocked* functions and thus I've been trying to read up on the subject in general. Reading related questions here in SO, I understand there are different ways to do these operations depending on the OS. Interlocked* in Windows, OSAtomic* in MacOS and I also found that compilers have builtin (intrinsic) operations for this. After reading gcc builtin atomic memory access, I'm left wondering what is the difference between intrinsic and the OSAtomic* or Interlocked* ones? I mean, can I not choose between OSAtomic* or gcc builtin if I'm on MacOS when I use gcc? The same if I'd be on Windows using gcc. I also read that on Windows Interlocked* come as both inline and intrinsic versions. What to consider when choosing between intrinsic or inline? In general, are there multiple options on OSs what to use? Or is this again "it depends"? If so, what does it depend on? Thanks!

    Read the article

  • How do I use compiler intrinsic __fmul_?

    - by Eric Thoma
    I am writing a massively parallel GPU application. I have been optimizing it by hand. I received a 20% performance increase with _fdividef(x, y), and according to The Cuda C Programming Guide (section C.2.1), using similar functions for multiplication and adding is also beneficial. The function is stated as this: "_fmulrn,rz,ru,rd". __fdividef(x,y) was not stated with the arguments in brackets. I was wondering, what are those brackets? If I run the simple code: int t = __fmul_(5,4); I a compiler error about how _fmul is undefined. I have the CUDA runtime included, so I don't think it is a setup thing; rather it is something to do with those square brackets. How do I correctly use this function? Thank you.

    Read the article

  • XNA and C# vs the 360's in order processor

    - by Richard Fabian
    Having read this rant, I feel that they're probably right (seeing as the Xenon and the CELL BE are both highly sensitive to branchy and cache ignorant code), but I've heard that people are making games fine with C# too. So, is it true that it's impossible to do anything professional with XNA, or was the poster just missing what makes XNA the killer API for game development? By professional, I don't mean in the sense that you can make money from it, but more in the sense that the more professional games have phsyics models and animation systems that seem outside the reach of a language with such definite barriers to the intrinsics. If I wanted to write a collision system or fluid dynamics engine for my game, I don't feel like C# offers me the chance to do that as the runtime code generator and the lack of intrinsics gets in the way of performance. However, many people seem to be fine working within these constraints, making their successful games but seemingly avoiding any of the problems by omission. I've not noticed any XNA games do anything complex other than what's already provided by the libraries, or handled by shaders. Is this avoidance of the more complex game dynamics because of teh limitations of C#, or just people concentrating on getting it done? In all honesty, I can't believe that AI war can maintain that many units of AI without some data oriented approach, or some dirty C# hacks to make it run better than the standard approach, and I guess that's partly my question too, how have people hacked C# so it's able to do what games coders do naturally with C++?

    Read the article

  • How to get `gcc` to generate `bts` instruction for x86-64 from standard C?

    - by Norman Ramsey
    Inspired by a recent question, I'd like to know if anyone knows how to get gcc to generate the x86-64 bts instruction (bit test and set) on the Linux x86-64 platforms, without resorting to inline assembly or to nonstandard compiler intrinsics. Related questions: Why doesn't gcc do this for a simple |= operation were the right-hand side has exactly 1 bit set? How to get bts using compiler intrinsics or the asm directive Portability is more important to me than bts, so I won't use and asm directive, and if there's another solution, I prefer not to use compiler instrinsics. EDIT: The C source language does not support atomic operations, so I'm not particularly interested in getting atomic test-and-set (even though that's the original reason for test-and-set to exist in the first place). If I want something atomic I know I have no chance of doing it with standard C source: it has to be an intrinsic, a library function, or inline assembly. (I have implemented atomic operations in compilers that support multiple threads.)

    Read the article

  • Well tested C/C++ lock free queue?

    - by uj
    I am looking for a well-tested, publicly available C/C++ implementation of a lock free queue. I need at least multiple-producers/single-consumer functionality. Multiple-consumers is even better, if exists. I'm targetting VC's _Interlocked... intrinsics, though anything which is straight forward to port would be fine. Could anyone give any pointers?

    Read the article

  • FFMPEG based Theora Video Decoder performance??

    - by goldenmean
    Hi, I am in process of porting and optimization of the theora video decoder in the ffmpeg-0.5 package to ARM-Cortex-A8 -Neon processor @ 667 MHz. I am looking for some target estimate for frames per second the decoder library alone should achieve after full optimization (C level and Neon assembly / Intrinsics) for 720x480 Progressive content for a 2Mbps stream. I have a Real Video 9 decoder on cortex-A8 which gives around 40 fps for the same stream above.(720x480, 2Mbps) How can i extrapolate this data based on relative complexities of RV9 and Theora and get a fps estimate for theora decoder Cortex-A8? I am aware the performance depends upon the cache configuration of the h/w, etc...,but any Any pointers will help. Thanks, -AD

    Read the article

  • C++/g++: Concurrent programm

    - by phimuemue
    Hi, I got a C++ program (source) that is said to work in parallel. However, if I compile it (I am using Ubuntu 10.04 and g++ 4.4.3) with g++ and run it, one of my two CPU cores gets full load while the other is doing "nothing". So I spoke to the one who gave me the program. I was told that I had to set specific flags for g++ in order to get the program compiled for 2 CPU cores. However, if I look at the code I'm not able to find any lines that point to parallelism. So I have two questions: Are there any C++-intrinsics for multithreaded applications, i.e. is it possible to write parallel code without any extra libraries (because I did not find any non-standard libraries included)? Is it true that there are indeed flags for g++ that tell the compiler to compile the program for 2 CPU cores and to compile it so it runs in parallel (and if: what are they)?

    Read the article

  • How much effort do you have to put in to get gains from using SSE?

    - by John
    Case One Say you have a little class: class Point3D { private: float x,y,z; public: operator+=() ...etc }; Point3D &Point3D::operator+=(Point3D &other) { this->x += other.x; this->y += other.y; this->z += other.z; } A naive use of SSE would simply replace these function bodies with using a few intrinsics. But would we expect this to make much difference? MMX used to involve costly state cahnges IIRC, does SSE or are they just like other instructions? And even if there's no direct "use SSE" overhead, would moving the values into SSE registers and back out again really make it any faster? Case Two Instead, you're working with a less OO-based code base. Rather than an array/vector of Point3D objects, you simply have a big array of floats: float coordinateData[NUM_POINTS*3]; void add(int i,int j) //yes it's unsafe, no overlap check... example only { for (int x=0;x<3;++x) { coordinateData[i*3+x] += coordinateData[j*3+x]; } } What about use of SSE here? Any better? In conclusion Is trying to optimise single vector operations using SSE actually worthwhile, or is it really only valuable when doing bulk operations?

    Read the article

  • How to efficiently compare the sign of two floating-point values while handling negative zeros

    - by François Beaune
    Given two floating-point numbers, I'm looking for an efficient way to check if they have the same sign, given that if any of the two values is zero (+0.0 or -0.0), they should be considered to have the same sign. For instance, SameSign(1.0, 2.0) should return true SameSign(-1.0, -2.0) should return true SameSign(-1.0, 2.0) should return false SameSign(0.0, 1.0) should return true SameSign(0.0, -1.0) should return true SameSign(-0.0, 1.0) should return true SameSign(-0.0, -1.0) should return true A naive but correct implementation of SameSign in C++ would be: bool SameSign(float a, float b) { if (fabs(a) == 0.0f || fabs(b) == 0.0f) return true; return (a >= 0.0f) == (b >= 0.0f); } Assuming the IEEE floating-point model, here's a variant of SameSign that compiles to branchless code (at least with with Visual C++ 2008): bool SameSign(float a, float b) { int ia = binary_cast<int>(a); int ib = binary_cast<int>(b); int az = (ia & 0x7FFFFFFF) == 0; int bz = (ib & 0x7FFFFFFF) == 0; int ab = (ia ^ ib) >= 0; return (az | bz | ab) != 0; } with binary_cast defined as follow: template <typename Target, typename Source> inline Target binary_cast(Source s) { union { Source m_source; Target m_target; } u; u.m_source = s; return u.m_target; } I'm looking for two things: A faster, more efficient implementation of SameSign, using bit tricks, FPU tricks or even SSE intrinsics. An efficient extension of SameSign to three values.

    Read the article

  • Writing my own implementation of stl-like Iterator in C++.

    - by Negai
    Good evening everybody, I'm currently trying to understand the intrinsics of iterators in various languages i.e. the way they are implemented. For example, there is the following class exposing the list interface. template<class T> class List { public: virtual void Insert( int beforeIndex, const T item ) throw( ListException ) =0 ; virtual void Append( const T item ) =0; virtual T Get( int position ) const throw( ListException ) =0; virtual int GetLength() const =0; virtual void Remove( int position ) throw( ListException ) =0; virtual ~List() =0 {}; }; According to GoF, the best way to implement an iterator that can support different kinds of traversal is to create the base Iterator class (friend of List) with protected methods that can access List's members. The concrete implementations of Iterator will handle the job in different ways and access List's private and protected data through the base interface. From here forth things are getting confusing. Say, I have class LinkedList and ArrayList, both derived from List, and there are also corresponding iterators, each of the classes returns. How can I implement LinkedListIterator? I'm absolutely out of ideas. And what kind of data can the base iterator class retrieve from the List (which is a mere interface, while the implementations of all the derived classes differ significantly) ? Sorry for so much clutter. Thanks.

    Read the article

  • Calculating rotation and translation matrices between two odometry positions for monocular linear triangulation

    - by user1298891
    Recently I've been trying to implement a system to identify and triangulate the 3D position of an object in a robotic system. The general outline of the process goes as follows: Identify the object using SURF matching, from a set of "training" images to the actual live feed from the camera Move/rotate the robot a certain amount Identify the object using SURF again in this new view Now I have: a set of corresponding 2D points (same object from the two different views), two odometry locations (position + orientation), and camera intrinsics (focal length, principal point, etc.) since it's been calibrated beforehand, so I should be able to create the 2 projection matrices and triangulate using a basic linear triangulation method as in Hartley & Zissermann's book Multiple View Geometry, pg. 312. Solve the AX = 0 equation for each of the corresponding 2D points, then take the average In practice, the triangulation only works when there's almost no change in rotation; if the robot even rotates a slight bit while moving (due to e.g. wheel slippage) then the estimate is way off. This also applies for simulation. Since I can only post two hyperlinks, here's a link to a page with images from the simulation (on the map, the red square is simulated robot position and orientation, and the yellow square is estimated position of the object using linear triangulation.) So you can see that the estimate is thrown way off even by a little rotation, as in Position 2 on that page (that was 15 degrees; if I rotate it any more then the estimate is completely off the map), even in a simulated environment where a perfect calibration matrix is known. In a real environment when I actually move around with the robot, it's worse. There aren't any problems with obtaining point correspondences, nor with actually solving the AX = 0 equation once I compute the A matrix, so I figure it probably has to do with how I'm setting up the two camera projection matrices, specifically how I'm calculating the translation and rotation matrices from the position/orientation info I have relative to the world frame. How I'm doing that right now is: Rotation matrix is composed by creating a 1x3 matrix [0, (change in orientation angle), 0] and then converting that to a 3x3 one using OpenCV's Rodrigues function Translation matrix is composed by rotating the two points (start angle) degrees and then subtracting the final position from the initial position, in order to get the robot's straight and lateral movement relative to its starting orientation Which results in the first projection matrix being K [I | 0] and the second being K [R | T], with R and T calculated as described above. Is there anything I'm doing really wrong here? Or could it possibly be some other problem? Any help would be greatly appreciated.

    Read the article

1 2  | Next Page >