Search Results

Search found 1848 results on 74 pages for 'algorithms'.

Page 16/74 | < Previous Page | 12 13 14 15 16 17 18 19 20 21 22 23 | Next Page >

How should I compress a file with multiple bytes that are the same with Huffman coding?

- by Omega

On my great quest for compressing/decompressing files with a Java implementation of Huffman coding (http://en.wikipedia.org/wiki/Huffman_coding) for a school assignment, I am now at the point of building a list of prefix codes. Such codes are used when decompressing a file. Basically, the code is made of zeroes and ones, that are used to follow a path in a Huffman tree (left or right) for, ultimately, finding a byte. In this Wikipedia image, to reach the character m the prefix code would be 0111 The idea is that when you compress the file, you will basically convert all the bytes of the file into prefix codes instead (they tend to be smaller than 8 bits, so there's some gain). So every time the character m appears in a file (which in binary is actually 1101101), it will be replaced by 0111 (if we used the tree above). Therefore, 1101101110110111011011101101 becomes 0111011101110111 in the compressed file. I'm okay with that. But what if the following happens: In the file to be compressed there exists only one unique byte, say 1101101. There are 1000 of such byte. Technically, the prefix code of such byte would be... none, because there is no path to follow, right? I mean, there is only one unique byte anyway, so the tree has just one node. Therefore, if the prefix code is none, I would not be able to write the prefix code in the compressed file, because, well, there is nothing to write. Which brings this problem: how would I compress/decompress such file if it is impossible to write a prefix code when compressing? (using Huffman coding, due to the school assignment's rules) This tutorial seems to explain a bit better about prefix codes: http://www.cprogramming.com/tutorial/computersciencetheory/huffman.html but doesn't seem to address this issue either.

Read the article
Looking for a dynamic programming solution

- by krammer

Given a sequence of integers in range 1 to n. Each number can appear at most once. Let there be a symbol X in the sequence which means remove the minimum element from the list. There can be an arbitrarily number of X in the sequence. Example: 1,3,4,X,5,2,X The output is 1,2. We need to find the best way to perform this operation. The solution I have been thinking is: Scan the sequence from left to right and count number of X which takes O(n) time. Perform partial sorting and find the k smallest elements (k = number of X) which takes O(n+klogk) time using median of medians. Is there a better way to solve this problem using dynamic programming or any other way ?

Read the article
Finding the median of the merged array of two sorted arrays in O(logN)?

- by user176517

Refering to the solution present at MIT handout I have tried to figure out the solution myself but have got stuck and I believe I need help to understand the following points. In the function header used in the solution MEDIAN -SEARCH (A[1 . . l], B[1 . . m], max(1,n/2 - m), min(l, n/2)) I do not understand the last two arguments why not simply 1, l why the max and min respectively. Thanking You.

Read the article
Writing a spell checker similar to "did you mean"

- by user888734

I'm hoping to write a spellchecker for search queries in a web application - not unlike Google's "Did you mean?" The algorithm will be loosely based on this: http://catalog.ldc.upenn.edu/LDC2006T13 In short, it generates correction candidates and scores them on how often they appear (along with adjacent words in the search query) in an enormous dataset of known n-grams - Google Web 1T - which contains well over 1 billion 5-grams. I'm not using the Web 1T dataset, but building my n-gram sets from my own documents - about 200k docs, and I'm estimating tens or hundreds of millions of n-grams will be generated. This kind of process is pushing the limits of my understanding of basic computing performance - can I simply load my n-grams into memory in a hashtable or dictionary when the app starts? Is the only limiting factor the amount of memory on the machine? Or am I barking up the wrong tree? Perhaps putting all my n-grams in a graph database with some sort of tree query optimisation? Could that ever be fast enough?

Read the article
Algorithm for grouping friends at the cinema [closed]

- by Tim Skauge

I got a brain teaser for you - it's not as simple as it sounds so please read and try to solve the issue. Before you ask if it's homework - it's not! I just wish to see if there's an elegant way of solving this. Here's the issue: X-number of friends want's to go to the cinema and wish to be seated in the best available groups. Best case is that everyone sits together and worst case is that everyone sits alone. Fewer groups are preferred over more groups. Sitting alone is least preferred. Input is the number of people going to the cinema and output should be an array of integer arrays that contains: Ordered combinations (most preferred are first) Number of people in each group Below are some examples of number of people going to the cinema and a list of preferred combinations these people can be seated: 1 person: 1 2 persons: 2, 1+1 3 persons: 3, 2+1, 1+1+1 4 persons: 4, 2+2, 3+1, 2+1+1, 1+1+1+1 5 persons: 5, 3+2, 4+1, 2+2+1, 3+1+1, 2+1+1+1, 1+1+1+1+1 6 persons: 6, 3+3, 4+2, 2+2+2, 5+1, 3+2+1, 2+2+1+1, 2+1+1+1+1, 1+1+1+1+1+1 Example with more than 7 persons explodes in combinations but I think you get the point by now. Question is: What does an algorithm look like that solves this problem? My language by choice is C# so if you could give an answer in C# it would be fantastic!

Read the article
Where would you start if you were trying to solve this PDF classification problem?

- by burtonic

We are crawling and downloading lots of companies' PDFs and trying to pick out the ones that are Annual Reports. Such reports can be downloaded from most companies' investor-relations pages. The PDFs are scanned and the database is populated with, among other things, the: Title Contents (full text) Page count Word count Orientation First line Using this data we are checking for the obvious phrases such as: Annual report Financial statement Quarterly report Interim report Then recording the frequency of these phrases and others. So far we have around 350,000 PDFs to scan and a training set of 4,000 documents that have been manually classified as either a report or not. We are experimenting with a number of different approaches including Bayesian classifiers and weighting the different factors available. We are building the classifier in Ruby. My question is: if you were thinking about this problem, where would you start?

Read the article
How to find optimal path visit every node with parallel workers complicated by dynamic edge costs?

- by Aaron Anodide

Say you have an acyclic directed graph with weighted edges and create N workers. My goal is to calculate the optimal way those workers can traverse the entire graph in parralel. However, edge costs may change along the way. Example: A -1-> B A -2-> C B -3-> C (if A has already been visited) B -5-> C (if A has not already been visited) Does what I describe lend itself to a standard algorithmic approach, or alternately can someone suggest if I'm looking at this in an inherently flawed way (i have an intuition I might be)?

Read the article
Is using build-in sorting considered cheating in practice tests?

- by user10326

I am using one of the practice online judges where a practice problem is asked and one submits the answer and gets back if it is accepted or not based on test inputs. My question is the following: In one of the practice tests, I needed to sort an array as part of the solution algorithm. If it matters the problem was: find 2 numbers in an array that add up to a specific target. As part of my algorithm I sorted the array, but to do that I used Java's quicksort and not implement sorting as part of the same method. To do that I had to do: java.util.Arrays.sort(array); Since I had to use the fully qualified name I am wondering if this is a kind of "cheating". (I mean perhaps an online judge does not expect this) Is it? In a formal interview (since these tests are practice for interview as I understand) would this be acceptable?

Read the article
Grid Game Algorithm

- by 7Aces

Problem Link - http://www.iarcs.org.in/inoi/2009/zco2009/zco2009-1a.php You have to find the path with the maximum weight, from the top-left cell to the bottom-right cell. It could've been solved with a simple Dynamic Programming approach, if it were not for the special condition - you are allowed at most one move that can be either to the left or up. How do I now approach the problem with this special case? Also, I'm looking for a time-efficient approach. Thanks!

Read the article
Is the Leptonica implementation of 'Modified Median Cut' not using the median at all?

- by TheCodeJunkie

I'm playing around a bit with image processing and decided to read up on how color quantization worked and after a bit of reading I found the Modified Median Cut Quantization algorithm. I've been reading the code of the C implementation in Leptonica library and came across something I thought was a bit odd. Now I want to stress that I am far from an expert in this area, not am I a math-head, so I am predicting that this all comes down to me not understanding all of it and not that the implementation of the algorithm is wrong at all. The algorithm states that the vbox should be split along the lagest axis and that it should be split using the following logic The largest axis is divided by locating the bin with the median pixel (by population), selecting the longer side, and dividing in the center of that side. We could have simply put the bin with the median pixel in the shorter side, but in the early stages of subdivision, this tends to put low density clusters (that are not considered in the subdivision) in the same vbox as part of a high density cluster that will outvote it in median vbox color, even with future median-based subdivisions. The algorithm used here is particularly important in early subdivisions, and 3is useful for giving visible but low population color clusters their own vbox. This has little effect on the subdivision of high density clusters, which ultimately will have roughly equal population in their vboxes. For the sake of the argument, let's assume that we have a vbox that we are in the process of splitting and that the red axis is the largest. In the Leptonica algorithm, on line 01297, the code appears to do the following Iterate over all the possible green and blue variations of the red color For each iteration it adds to the total number of pixels (population) it's found along the red axis For each red color it sum up the population of the current red and the previous ones, thus storing an accumulated value, for each red note: when I say 'red' I mean each point along the axis that is covered by the iteration, the actual color may not be red but contains a certain amount of red So for the sake of illustration, assume we have 9 "bins" along the red axis and that they have the following populations 4 8 20 16 1 9 12 8 8 After the iteration of all red bins, the partialsum array will contain the following count for the bins mentioned above 4 12 32 48 49 58 70 78 86 And total would have a value of 86 Once that's done it's time to perform the actual median cut and for the red axis this is performed on line 01346 It iterates over bins and check they accumulated sum. And here's the part that throws me of from the description of the algorithm. It looks for the first bin that has a value that is greater than total/2 Wouldn't total/2 mean that it is looking for a bin that has a value that is greater than the average value and not the median ? The median for the above bins would be 49 The use of 43 or 49 could potentially have a huge impact on how the boxes are split, even though the algorithm then proceeds by moving to the center of the larger side of where the matched value was.. Another thing that puzzles me a bit is that the paper specified that the bin with the median value should be located, but does not mention how to proceed if there are an even number of bins.. the median would be the result of (a+b)/2 and it's not guaranteed that any of the bins contains that population count. So this is what makes me thing that there are some approximations going on that are negligible because of how the split actually takes part at the center of the larger side of the selected bin. Sorry if it got a bit long winded, but I wanted to be as thoroughas I could because it's been driving me nuts for a couple of days now ;)

Read the article
Approach to Authenticate Clients to TCP Server

- by dab

I'm writing a Server/Client application where clients will connect to the server. What I want to do, is make sure that the client connecting to the server is actually using my protocol and I can "trust" the data being sent from the client to the server. What I thought about doing is creating a sort of hash on the client's machine that follows a particular algorithm. What I did in a previous version was took their IP address, the client version, and a few other attributes of the client and sent it as a calculated hash to the server, who then took their IP, and the version of the protocol the client claimed to be using, and calculated that number to see if they matched. This works ok until you get clients that connect from within a router environment where their internal IP is different from their external IP. My fix for this was to pass the client's internal IP used to calculate this hash with the authentication protocol. My fear is this approach is not secure enough. Since I'm passing the data used to create the "auth hash". Here's an example of what I'm talking about: Client IP: 192.168.1.10, Version: 2.4.5.2 hash = 2*4*5*1 * (1+9+2) * (1+6+8) * (1) * (1+0) Client Connects to Server client sends: auth hash ip version Server calculates that info, and accepts or denies the hash. Before I go and come up with another algorithm to prove a client can provide data a server (or use this existing algorithm), I was wondering if there are any existing, proven, and secure systems out there for generating a hash that both sides can generate with general knowledge. The server won't know about the client until the very first connection is established. The protocol's intent is to manage a network of clients who will be contributing data to the server periodically. New clients will be added simply by connecting the client to the server and "registering" with the server. So a client connects to the server for the first time, and registers their info (mac address or some other kind of unique computer identifier), then when they connect again, the server will recognize that client as a previous person and associate them with their data in the database.

Read the article
Algorithm for optimal combination of two variables

- by AlanChavez

I'm looking for an algorithm that would be able to determine the optimal combination of two variables, but I'm not sure where to start looking. For example, if I have 10,000 rows in a database and each row contains price, and square feet is there any algorithm out there that will be able to determine what combination of price and sq ft is optimal. I know this is vague, but I assume is along the lines of Fuzzy logic and fuzzy sets, but I'm not sure and I'd like to start digging in the right field to see if I can come up with something that solves my problem.

Read the article
Is there a phrase or word to describe an algorithim or program is complete in that given any value for its arguments there is a defined outcome?

- by Mrk Mnl

Is there a phrase or word to describe an algorithim or programme is complete in that given any value for its arguments there is a defined outcome? i.e. all the ramifications have been considered whatever the context? A simple example would be the below function: function returns string get_item_type(int type_no) { if(type_no < 10) return "hockey stick" else if (type_no < 20) return "bulldozer" else return "unknown" } (excuse the dismal pseudo code) No matter what number is supplied all possibiblites are catered for. My question is: is there a word to fill the blank here: "get_item_type() is ______ complete" ? (The answer is not Turing Complete - that is something quite different - but I annoyingly always think of something as "Turing Complete" when I am thinking of the above).

Read the article
Efficient Bus Loading

- by System Down

This is something I did for a bus travel company a long time ago, and I was never happy with the results. I was thinking about that old project recently and thought I'd revisit that problem. Problem: Bus travel company has several buses with different passenger capacities (e.g. 15 50-passenger buses, 25 30-passenger buses ... etc). They specialized in offering transportation to very large groups (300+ passengers per group). Since each group needs to travel together they needed to manage their fleet efficiently to reduce waste. For instance, 88 passengers are better served by three 30-passenger buses (2 empty seats) than by two 50-passenger buses (12 empty seats). Another example, 75 passengers would be better served by one 50-passenger bus and one 30-passenger bus, a mix of types. What's a good algorithm to do this?

Read the article
Structuring Access Control In Hierarchical Object Graph

- by SB2055

I have a Folder entity that can be Moderated by users. Folders can contain other folders. So I may have a structure like this: Folder 1 Folder 2 Folder 3 Folder 4 I have to decide how to implement Moderation for this entity. I've come up with two options: Option 1 When the user is given moderation privileges to Folder 1, define a moderator relationship between Folder 1 and User 1. No other relationships are added to the db. To determine if the user can moderate Folder 3, I check and see if User 1 is the moderator of any parent folders. This seems to alleviate some of the complexity of handling updates / moved entities / additions under Folder 1 after the relationship has been defined, and reverting the relationship means I only have to deal with one entity. Option 2 When the user is given moderation privileges to Folder 1, define a new relationship between User 1 and Folder 1, and all child entities down to the grandest of grandchildren when the relationship is created, and if it's ever removed, iterate back down the graph to remove the relationship. If I add something under Folder 2 after this relationship has been made, I just copy all Moderators into the new Entity. But when I need to show only the top-level Folders that a user is Moderating, I need to query all folders that have a parent folder that the user does not moderate, as opposed to option 1, where I just query any items that the user is moderating. Thoughts I think it comes down to determining if users will be querying for all parent items more than they'll be querying child items... if so, then option 1 seems better. But I'm not sure. Is either approach better than the other? Why? Or is there another approach that's better than both? I'm using Entity Framework in case it matters.

Read the article
Common Substring of two strings

- by Chander Shivdasani

This particular interview-question stumped me: Given two Strings S1 and S2. Find the longest Substring which is a Prefix of S1 and suffix of S2. Through Google, I came across the following solution, but didnt quite understand what it was doing. public String findLongestSubstring(String s1, String s2) { List<Integer> occurs = new ArrayList<>(); for (int i = 0; i < s1.length(); i++) { if (s1.charAt(i) == s2.charAt(s2.length()-1)) { occurs.add(i); } } Collections.reverse(occurs); for(int index : occurs) { boolean equals = true; for(int i = index; i >= 0; i--) { if (s1.charAt(index-i) != s2.charAt(s2.length() - i - 1)) { equals = false; break; } } if(equals) { return s1.substring(0,index+1); } } return null; }

Read the article
Slides and Code from “Using C#’s Async Effectively”

- by Reed

The slides and code from my talk on the new async language features in C# and VB.Net are now available on https://github.com/ReedCopsey/Effective-Async This includes the complete slide deck, and all 4 projects, including: FakeService: Simple WCF service to run locally and simulate network service calls. AsyncService: Simple WCF service which wraps FakeService to demonstrate converting sync to async SimpleWPFExample: Simplest example of converting a method call to async from a synchronous version AsyncExamples: Windows Store application demonstrating main concepts, pitfalls, tips, and tricks from the slide deck

Read the article
How to identify a PDF classification problem?

- by burtonic

We are crawling and downloading lots of companies' PDFs and trying to pick out the ones that are Annual Reports. Such reports can be downloaded from most companies' investor-relations pages. The PDFs are scanned and the database is populated with, among other things, the: Title Contents (full text) Page count Word count Orientation First line Using this data we are checking for the obvious phrases such as: Annual report Financial statement Quarterly report Interim report Then recording the frequency of these phrases and others. So far we have around 350,000 PDFs to scan and a training set of 4,000 documents that have been manually classified as either a report or not. We are experimenting with a number of different approaches including Bayesian classifiers and weighting the different factors available. We are building the classifier in Ruby. My question is: if you were thinking about this problem, where would you start?

Read the article
Algorithm for Learning development

- by user9057

Hi all, This is a fairly general question. I know a bit of Perl and Python and I am looking to learn programming in more depth so that once I get the hang of it I can start developing applications and then websites. I would like to know of an algorithm (sequence of steps :)) that could describe my approach towards learning programming in general. I have posted small questions on Perl/Python and I have recieved great help from everyone. Note:- I am not in a hurry to learn. I know it takes time and that's fine. Please give any suggestions you think are valid. Also, please don't push me to learn Lisp, Haskell etc - I am a beginner.

Read the article
An algorithm for finding subset matching criteria?

- by Macin

I recently came up with a problem which I would like to share some thoughts about with someone on this forum. This relates to finding a subset. In reality it is more complicated, but I tried to present it here using some simpler concepts. To make things easier, I created this conceptual DB model: Let's assume this is a DB for storing recipes. Recipe can have many instructions steps and many ingredients. Ingredients are stored in a cupboard and we know how much of each ingredient we have. Now, when we create a recipe, we have to define how much of each ingredient we need. When we want to use a recipe, we would just check if required amount is less than available amount for each product and then decide if we can cook a dinner - if amount required for at least one ingredient is less than available amount - recipe cannot be cooked. Simple sql query to get the result. This is straightforward, but I'm wondering, how should I work when the problem is stated the other way round, i.e. how to find recipies which can be cooked only from ingredients that are available? I hope my explanation is clear, but if you need any more clarification, please ask.

Read the article
Tiling Problem Solutions for Various Size "Dominoes"

- by user67081

I've got an interesting tiling problem, I have a large square image (size 128k so 131072 squares) with dimensons 256x512... I want to fill this image with certain grain types (a 1x1 tile, a 1x2 strip, a 2x1 strip, and 2x2 square) and have no overlap, no holes, and no extension past the image boundary. Given some probability for each of these grain types, a list of the number required to be placed is generated for each. Obviously an iterative/brute force method doesn't work well here if we just randomly place the pieces, instead a certain algorithm is required. 1) all 2x2 square grains are randomly placed until exhaustion. 2) 1x2 and 2x1 grains are randomly placed alternatively until exhaustion 3) the remaining 1x1 tiles are placed to fill in all holes. It turns out this algorithm works pretty well for some cases and has no problem filling the entire image, however as you might guess, increasing the probability (and thus number) of 1x2 and 2x1 grains eventually causes the placement to stall (since there are too many holes created by the strips and not all them can be placed). My approach to this solution has been as follows: 1) Create a mini-image of size 8x8 or 16x16. 2) Fill this image randomly and following the algorithm specified above so that the desired probability of the entire image is realized in the mini-image. 3) Create N of these mini-images and then randomly successively place them in the large image. Unfortunately there are some downfalls to this simplification. 1) given the small size of the mini-images, nailing an exact probability for the entire image is not possible. Example if I want p(2x1)=P(1x2)=0.4, the mini image may only give 0.41 as the closes probability. 2) The mini-images create a pseudo boundary where no overlaps occur which isn't really descriptive of the model this is being used for. 3) There is only a fixed number of mini-images so i'm not sure how random this really is. I'm really just looking to brainstorm about possible solutions to this. My main concern is really to nail down closer probabilities, now one might suggest I just increase the mini-image size. Well I have, and it turns out that in certain cases(p(1x2)=p(2x1)=0.5) the mini-image 16x16 isn't even iteratively solvable.. So it's pretty obvious how difficult it is to randomly solve this for anything greater than 8x8 sizes.. So I'd love to hear some ideas. Thanks

Read the article
Improving python code

- by cobie

I just answered the question on project euler about finding circular primes below 1 million using python. My solution is below. I was able to reduce the running time of the solution from 9 seconds to about 3 seconds. I would like to see what else can be done to the code to reduce its running time further. This is strictly for educational purposes and for fun. import math import time def getPrimes(n): """returns set of all primes below n""" non_primes = [j for j in range(4, n, 2)] # 2 covers all even numbers for i in range(3, n, 2): non_primes.extend([j for j in range(i*2, n, i)]) return set([i for i in range(2, n)]) - set(non_primes) def getCircularPrimes(n): primes = getPrimes(n) is_circ = [] for prime in primes: prime_str = str(prime) iter_count = len(prime_str) - 1 rotated_num = [] while iter_count > 0: prime_str = prime_str[1:] + prime_str[:1] rotated_num.append(int(prime_str)) iter_count -= 1 if primes >= set(rotated_num): is_circ.append(prime) return len(is_circ)

Read the article
Resolving equivalence relations

- by Luca Cerone

I am writing a function to label the connected component in an image (I know there are several libraries outside, I just wanted to play with the algorithm). To do this I label the connected regions with different labels and create an equivalence table that contain information on the labels belonging to the same connected component. As an example if my equivalence table (vector of vector) looks something like: 1: 1,3 2: 2,3 3: 1,2,3 4: 4 It means that in the image there are 2 different regions, one made of elements that are labelled 1,2,3 and an other made of elements labelled 4. What is an easy and efficient way to resolve the equivalences and end up with something that looks like: 1: 1,2,3 2: 4 that I can use to "merge" the different connected regions belonging to the same connected component? Thanks a lot for the help!

Read the article
Algorithm to generate N random numbers between A and B which sum up to X

- by Shaamaan

This problem seemed like something which should be solvable with but a few lines of code. Unfortunately, once I actually started to write the thing, I've realized it's not as simple as it sounds. What I need is a set of X random numbers, each of which is between A and B and they all add up to X. The exact variables for the problem I'm facing seem to be even simpler: I need 5 numbers, between -1 and 1 (note: these are decimal numbers), which add up to 1. My initial "few lines of code, should be easy" approach was to randomize 4 numbers between -1 and 1 (which is simple enough), and then make the last one 1-(sum of previous numbers). This quickly proved wrong, as the last number could just as well be larger than 1 or smaller than -1. What would be the best way to approach this problem? PS. Just for reference: I'm using C#, but I don't think it matters. I'm actually having trouble creating a good enough solution for the problem in my head.

Read the article
Labeling algorithm for points

- by Qwertie

I need an algorithm to place horizontal text labels for multiple series of points on the screen (basically I need to show timestamps and other information for a history of moving objects on a map; in general there are multiple data points per object). The text labels should appear close to their points--above, below, or on the right side--but should not overlap other points or text labels. Does anyone know an algorithm/heuristic for this?

Read the article

< Previous Page | 12 13 14 15 16 17 18 19 20 21 22 23 | Next Page >