Search Results

Search found 84 results on 4 pages for 'sharding'.

Page 1/4 | 1 2 3 4  | Next Page >

  • Sharding / indexing strategy for multi-faceted search

    - by Graham
    I'm currently thinking about our database structure and how we modify it for scale. Specifically, we're thinking about using ElasticSearch to provide our search functionality. One common pattern with ElasticSearch seems to be the 'user-routing' pattern; that is, using routing to ensure that any one user's data resides on the same shard. This is great for client-specific search e.g. Gmail. Our application has a constraint such that any user will have a maximum of a few thousand documents, so this pattern seems like a good candidate. However, our search needs to work across all users, as well as targeting a specific user (so I might search my content, Alice's content, or all content). Similarly, we need to provide full-text search across any timeframe; recent months to several years ago. I'm thinking of combining the 'user-routing' and 'index-per-time-interval' patterns: I create an index for each month By default, searches are aliased against the most recent X months If no results are found, we can search against previous X months As we grow, we can reduce the interval X Each document is routed by the user ID So, this should let us do the following: search by user. This will search all indeces across 1 shard search by time. This will search ~2 indeces (by default) across all shards Is this a reasonable approach, considering we may scale to multi-million+ documents? Or should I be denormalizing the data somehow, so that user searches are performed on a totally seperate index from date searches? Thanks for any pros-cons of the above scenario.

    Read the article

  • Custom Lucene Sharding with Hibernate Search

    - by Timo Westkämper
    Has anyone experience with custom Lucene sharding / paritioning using Hibernate Search? The documentation of Hibernate Search says the following about Lucene Sharding : In some cases, it is necessary to split (shard) the indexing data of a given entity type into several Lucene indexes. This solution is not recommended unless there is a pressing need because by default, searches will be slower as all shards have to be opened for a single search. In other words don't do it until you have problems :) Has anyone implemented sharding in such a way for Hibernate Search that also queries can be target to one of the shards? In our case we have Lucene queries that should target only one shard per query.

    Read the article

  • PASS Conference 2011 Topic: Multitenant Design and Sharding with SQL Azure

    - by Herve Roggero
    I am really happy to announce that I have been accepted as a speaker at the 2011 PASS Conference in Seattle. The topic? It will be about SQL Azure scalability using shards, and the Data Federation feature of SQL Azure. I will also talk extensively about the community open-source sharding library Enzo SQL Shard (enzosqlshard.codeplex.com) and show how to make the most out of it. In general, the presentation will provide details about how to properly design an application for sharding, how to make it work for SQL Server, SQL Azure, and how to leverage the upcoming Data Federation technology that Microsoft is planning. The primary objective is to turn sharding an implementation concern, not a development concern. Using a library like Enzo SQL Shard will help you achieve this objective. If you come to PASS Summit this year, come see me and mention you saw this blog!

    Read the article

  • Rabbitmq queue sharding

    - by kolchanov
    I have to implement this scenario: An external application publish message to rabbitmq. This message has a client_id property. We can place this id to routing key or message header or some other property. I have to implement sharding in a exchange routng logic - the message should be delivered to specific queue based on the client_id range. Is it possible to implement in a standard exchanges? If not what exchange should I take as the base? How to dynamicly change client_id ranges?

    Read the article

  • Webscale is all about sharding and its coming to SQL Azure

    - by simonsabin
    There are many that joke about developers always talking about webscale and needing to shard to be able to scale. In reality many systems, if not most, don’t need to be able to scale to numerous nodes because todays processing is so powerful. However in the cloud world where you don’t have 1 big box you have many little ones (instances) you need some way of sharding/federating/distributing data and load. I’ve mentioned before of a PDC presentation on whats coming in SQL Azure, well they’ve put some...(read more)

    Read the article

  • SQL Sharding and SQL Azure…

    - by Dave Noderer
    Herve Roggero has just published a paper that outlines patterns for scaling using SQL Azure and the Blue Syntax (he and Scott Klein’s company) sharding api. You can find the paper at: http://www.bluesyntax.net/files/EnzoFramework.pdf Herve and Scott have also just released an Apress book Pro SQL Azure. The idea of being able to split (shard) database operations automatically and control them from a web based management console is very appealing. These ideas have been talked about for a long time and implemented in thousands of very custom ways that have been costly, complicated and fragile. Now, there is light at the end of the tunnel. Scaling database access will become easier and move into the mainstream of application development. The main cost is using an api whenever accessing the database. The api will direct the query to the correct database(s) which may be located locally or in the cloud. It is inevitable that the api will change in the future, perhaps incorporated into a Microsoft offering. Even if this is the case, your application has now been architected to utilize these patterns and details of the actual api will be less important. Herve does a great job of laying out the concepts which every developer and architect should be familiar with!

    Read the article

  • mySQL & Relational databases: How to handle sharding/splitting on application level?

    - by Industrial
    Hi everybody, I have thought a bit about sharding tables, since partitioning cannot be done with foreign keys in a mySQL table. Maybe there's an option to switch to a different relational database that features both, but I don't see that as an option right now. So, the sharding idea seems like a pretty decent thing. But, what's a good approach to do this on a application level? I am guessing that a take-off point would be to prefix tables with a max value for the primary key in each table. Something like products_4000000 , products_8000000 and products_12000000. Then the application would have to check with a simple if-statement the size of the id (PK) that will be requested is smaller then four, eight or twelve million before doing any actual database calls. So, is this a step in the right direction or are we doing something really stupid?

    Read the article

  • How to distribute a unique database already in production?

    - by JVerstry
    Let's assume a successful web spring application running on a MySql or PostGre kind of database. The traffic is becoming so high and the amount of data is becoming so big that a distributed dataase solution needs to be implemented. It is a scalability issue. Let's assume this application is using Hibernate and the data access layer is cleanly separated with DAO objects. What would be the best strategy to scale this database? Does anyone have hands on experience to share? Is it possible to minimize sharding code (Shard) in the application? Ideally, one should be able to add or remove databases easily. A failback solution is welcome too. I am not looking for you could go for sharding or you could go no sql kind of answers. I am looking for deeper answers from people with experience.

    Read the article

  • How do you deal with denormalization / secondary indexes in database sharding?

    - by Continuation
    Say I have a "message" table with 2 secondary indexes: "recipient_id" "sender_id" I want to shard the "message" table by "recipient_id". That way to retrieve all messages sent to a certain recipient I only need to query one shard. But at the same time, I want to be able to make a query that ask for all messages sent by a certain sender. Now I don't want to send that query to every single shard of the "message" table. One way to do this is to duplicate the data and have a "message_by_sender" table sharded by "sender_id". The problem with that approach is that every time a message has been sent, I need to insert the message into both "message" and "message_by_sender" tables. But what if after inserting into "message" the insertion into "message_by_sender" fail? In that case the message exists in "message" but not in "message_by_sender". How do I make sure that if a message exists in "message" then it also exists in "message_by_sender" without resorting to 2 phase commit? This must be a very common issue for anyone who shards their databases. How do you deal woth it?

    Read the article

  • High-concurrency counters without sharding

    - by dound
    This question concerns two implementations of counters which are intended to scale without sharding (with a tradeoff that they might under-count in some situations): http://appengine-cookbook.appspot.com/recipe/high-concurrency-counters-without-sharding/ (the code in the comments) http://blog.notdot.net/2010/04/High-concurrency-counters-without-sharding My questions: With respect to #1: Running memcache.decr() in a deferred, transactional task seems like overkill. If memcache.decr() is done outside the transaction, I think the worst-case is the transaction fails and we miss counting whatever we decremented. Am I overlooking some other problem that could occur by doing this? What are the significiant tradeoffs between the two implementations? Here are the tradeoffs I see: #2 does not require datastore transactions. To get the counter's value, #2 requires a datastore fetch while with #1 typically only needs to do a memcache.get() and memcache.add(). When incrementing a counter, both call memcache.incr(). Periodically, #2 adds a task to the task queue while #1 transactionally performs a datastore get and put. #1 also always performs memcache.add() (to test whether it is time to persist the counter to the datastore). Conclusions (without actually running any performance tests): #1 should typically be faster at retrieving a counter (#1 memcache vs #2 datastore). Though #1 has to perform an extra memcache.add() too. However, #2 should be faster when updating counters (#1 datastore get+put vs #2 enqueue a task). On the other hand, with #1 you have to be a bit more careful with the update interval since the task queue quota is almost 100x smaller than either the datastore or memcahce APIs.

    Read the article

  • How can I distribute a unique database already in production?

    - by JVerstry
    Let's assume a successful web Spring application running on a MySQL or PostgreSQL database. The traffic is becoming so high and the amount of data is becoming so big that a distributed database solution needs to be implemented to address scalability issue. Let's also assume this application is using Hibernate and the data access layer is cleanly separated with DAOs. Ideally, one should be able to add or remove databases easily. A failback solution is welcome too. What would be the best strategy to scale this database? Is it possible to minimize sharding code (Shard) in the application?

    Read the article

  • Rename Mongo Shard

    - by HeySteve
    Can I, and if I can, how can I rename a shard in Mongo? Like if I wanted to change the instances of rs0 to rep0 below: mongos> sh.status() --- Sharding Status --- sharding version: { "_id" : 1, "version" : 4, "minCompatibleVersion" : 4, "currentVersion" : 5, "clusterId" : ObjectId("111111111111") } shards: { "_id" : "rs0", "host" : "rs0/mongo0a:27017,mongo0b:27017" } ... I have thought about removing and re-adding the shard, but I'm not sure how I'd do that without having to drain the shard and drop dbs. Currently 0 of the collections have sharding enabled, I just have a few standalones added as shards. Thanks

    Read the article

  • Replicated MongoDB server slower than simple shards

    - by displayName
    I tried to compare the performance of a sharded configuration against a sharded and replicated configuration. The sharded configuration consists of 8 shards each running on three different machines thereby constituting a total of 24 shards. All 8 of these shards run in the same partition on each machine. The sharded and replicated version is 8 shards again just like plain sharding, and all 8 mongods run on the same partition in each machine. But apart from this, each of these three machine now run additional 16 threads on another partition which serve as the secondary for the 8 mongods running on other machines. This is the way I prepared a sharded and replicated configuration with data chunks having replication factor of 3. Important point to note is that once the data has been loaded, it is not modified. So after primary and secondaries have synchronized then it doesn't matter which one i read from. To run the queries, I use an entirely different machine (let's call it config) which runs mongos and this machine's only purpose is to receive queries and run them on the cluster. Contrary to my expectations, plain sharding of 8 threads on each machine (total = 3 * 8 = 24) is performing better for queries than the sharded + replicated configuration. I have a script written to perform the query. So in order to time the scripts, I use time ./testScript and see the result. I tried changing the reading preference for replicated cluster by logging to mongo of config and run db.getMongo().setReadPref('secondary') and then exit the shell and run the queries like time ./testScript. The questions are: Where am i going wrong in the replication? Why is it slower than its plain sharding version? Does the db.getMongo().ReadPref('secondary') persist when i leave the shell and try to perform the query? All the four machines are running Linux and i have already increased the ulimit -n to 2048 from initial value of 1024 to allow more connections. The collections are properly distributed and all the mongods have equal number of chunks. Goes without saying that indices in both configurations are the same.

    Read the article

  • multitenancy with some data sharing

    - by user55108
    I'm in the planning stages of a new webapp, and I am leaning strongly toward a multitenancy model. The app has a file storage function, where the user can upload (and operate on) files. I would like the ability of the user to share these files, however. How is this typically accomplished in a multi-tenant model? The example would be something like google docs. Each user has their own files; they can edit and tag and build collections with these files. Then, they can share a doc or a collection with someone else for collaboration. If every user has their own Database and tables, what strategy would one use to allow this kind of sharing while minimizing duplication of files and associated metadata?

    Read the article

  • How would I template an SLS using saltstack

    - by user180041
    I'm trying to do proof of concept with Mongodb(sharding) and Id like to run a command every time I spin up a new cluster without having to add lines in all my sls files. My current init is as follows: mongo Replica4:27000 /usr/lib/mongo/init_addshard.js: cmd: - run - user: present The word Replica4 is not templated id like to know a way I would be able to do so, that way when I spin up a new cluster I wouldn't have to touch anything in this file.

    Read the article

  • postgreSQL vs Cassandra vs MongoDB vs Voldemart ?

    - by ramonrails
    Which database to decide upon? Any comparisions? Existing: postgresql Issues Not easily scalable horizontal. Needs sharding etc Clustering does not solve the data growth problem Looking for: Any database that is easily horizontally scalable Cassandra (Twitter uses that?) MongoDB (rapidly gaining popularity) Voldemart Other? Why? Data growing with snowball effect existing postgresql locks table etc for vaccuum tasks periodically Archiving data is tideous currently Human interaction involved in existing archive, vaccuum, ... process periodically Need a 'set it. forget it. just add another server when data grows more.' type of solution

    Read the article

  • Implementing database redundancy with sharded tables

    - by ensnare
    We're looking to implement load balancing by horizontally sharding our tables across a cluster of servers. What are some options to implement live redundancy should a server fail? Would it be effective to do (2) INSERTS instead of one ... one to the target shard, and another to a secondary shard which could be accessed should the primary shard not respond? Or is there a better way? Thanks.

    Read the article

  • Run a MongoDB configuration server without 3GB of journal files

    - by Thilo
    For a production sharded MongoDB installation we need 3 configuration servers. According to the documentation "the config server mongod process is fairly lightweight and can be ran on machines performing other work". However, in the default configuration, they all have journalling enabled, and with preallocation this takes up 3 GB of disk space. I assume that the actual data and transaction volume of a config server is quite small, so that this seems a bit too much. Is there a way to (safely!) run these config servers with much less disk use for the journal? Do I need journalling at all on config servers? Can I set the journal size to be smaller?

    Read the article

  • I have to shard a mysql database. I want to start with 12 shards on 2 machines. What is the best w

    - by Tim
    All tables are InnoDb. I would rather not use mysqldump, because the shard sizes will be about 200 GB (about 700 million rows), and that will take too long. I was hoping to just stop mysql for an hour, copy the data files to a new machine, and start back up. But you can't do this with InnoDb, as some data is in the shared tablespace. Even if I have the innodb_file_per_table option set. This is not a website, but a custom application, used by tens of thousands right now, so uptime and performance are important. I suppose I could add logic into my server application to allow for gradual rebalancing / moving of a shard. Does anyone have a better idea?

    Read the article

  • Mongodb Slave replication lag

    - by Leonid Bugaev
    We using standard mongo setup: 2 replicas + 1 arbiter. Both replica servers use same AWS m1.medium with RAID10 EBS. We experiencing constantly growing replication lag on secondary replica. I tried to do full-resync, you can see it on graph, but it helped only for some hours. Our mongo usage is really low now, and frankly i can't understan why it can be. iostat 1 for secondary: avg-cpu: %user %nice %system %iowait %steal %idle 80.39 0.00 2.94 0.00 16.67 0.00 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn xvdap1 0.00 0.00 0.00 0 0 xvdb 0.00 0.00 0.00 0 0 xvdfp4 12.75 0.00 189.22 0 193 xvdfp3 12.75 0.00 189.22 0 193 xvdfp2 7.84 0.00 40.20 0 41 xvdfp1 7.84 0.00 40.20 0 41 md127 19.61 0.00 219.61 0 224 mongostat for secondary (why 100% locks? i guess its the problem): insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn set repl time *10 *0 *16 *0 0 2|4 0 30.9g 62.4g 1.65g 0 107 0 0|0 0|0 198b 1k 16 replset-01 SEC 06:55:37 *4 *0 *8 *0 0 12|0 0 30.9g 62.4g 1.65g 0 91.7 0 0|0 0|0 837b 5k 16 replset-01 SEC 06:55:38 *4 *0 *7 *0 0 3|0 0 30.9g 62.4g 1.64g 0 110 0 0|0 0|0 342b 1k 16 replset-01 SEC 06:55:39 *4 *0 *8 *0 0 1|0 0 30.9g 62.4g 1.64g 0 82.9 0 0|0 0|0 62b 1k 16 replset-01 SEC 06:55:40 *3 *0 *7 *0 0 5|0 0 30.9g 62.4g 1.6g 0 75.2 0 0|0 0|0 466b 2k 16 replset-01 SEC 06:55:41 *4 *0 *7 *0 0 1|0 0 30.9g 62.4g 1.64g 0 138 0 0|0 0|1 62b 1k 16 replset-01 SEC 06:55:42 *7 *0 *15 *0 0 3|0 0 30.9g 62.4g 1.64g 0 95.4 0 0|0 0|0 342b 1k 16 replset-01 SEC 06:55:43 *7 *0 *14 *0 0 1|0 0 30.9g 62.4g 1.64g 0 98 0 0|0 0|0 62b 1k 16 replset-01 SEC 06:55:44 *8 *0 *17 *0 0 3|0 0 30.9g 62.4g 1.64g 0 96.3 0 0|0 0|0 342b 1k 16 replset-01 SEC 06:55:45 *7 *0 *14 *0 0 3|0 0 30.9g 62.4g 1.64g 0 96.1 0 0|0 0|0 186b 2k 16 replset-01 SEC 06:55:46 mongostat for primary insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn set repl time 12 30 20 0 0 3 0 30.9g 62.6g 641m 0 0.9 0 0|0 0|0 212k 619k 48 replset-01 M 06:56:41 5 17 10 0 0 2 0 30.9g 62.6g 641m 0 0.5 0 0|0 0|0 159k 429k 48 replset-01 M 06:56:42 9 22 16 0 0 3 0 30.9g 62.6g 642m 0 0.7 0 0|0 0|0 158k 276k 48 replset-01 M 06:56:43 6 18 12 0 0 2 0 30.9g 62.6g 640m 0 0.7 0 0|0 0|0 93k 231k 48 replset-01 M 06:56:44 6 12 8 0 0 3 0 30.9g 62.6g 640m 0 0.3 0 0|0 0|0 80k 125k 48 replset-01 M 06:56:45 8 21 14 0 0 9 0 30.9g 62.6g 641m 0 0.6 0 0|0 0|0 118k 419k 48 replset-01 M 06:56:46 10 34 20 0 0 6 0 30.9g 62.6g 640m 0 1.3 0 0|0 0|0 164k 527k 48 replset-01 M 06:56:47 6 21 13 0 0 2 0 30.9g 62.6g 641m 0 0.7 0 0|0 0|0 111k 477k 48 replset-01 M 06:56:48 8 21 15 0 0 2 0 30.9g 62.6g 641m 0 0.7 0 0|0 0|0 204k 336k 48 replset-01 M 06:56:49 4 12 8 0 0 8 0 30.9g 62.6g 641m 0 0.5 0 0|0 0|0 156k 530k 48 replset-01 M 06:56:50 Mongo version: 2.0.6

    Read the article

  • Splitting a MySQL DB in two may ease server from "Too many connetions"? I don't think so

    - by Petruza
    I was requested to split a MySQL in two, it's kind of a horizontal partition, in which some rows correspond to one site, and some other correspond to another site. But they want to split it in two DBs in the same MySQL server. I'm no DB expert but I guess keeping them in the same MySQL server with the same amount of memory and processor and the same platform won't improve things. What we're trying to avoid is the "Too many connections" problem.

    Read the article

  • Is it OK to re-create many SQL connections (SQL 2008)

    - by Mr. Flibble
    When performing many inserts into a database I would usually have code like this: using (var connection = new SqlConnection(connStr)) { connection.Open(); foreach (var item in items) { var cmd = new SqlCommand("INSERT ...") cmd.ExecuteNonQuery(); } } I now want to shard the database and therefore need to choose the connection string based on the item being inserted. This would make my code run more like this foreach (var item in items) { connStr = GetConnectionString(item); using (var connection = new SqlConnection(connStr)) { connection.Open(); var cmd = new SqlCommand("INSERT ...") cmd.ExecuteNonQuery(); } } Which basically means it's creating a new connection to the database for each item. Will this work or will recreating connections for each insert cause terrible overhead?

    Read the article

1 2 3 4  | Next Page >