Search Results

Search found 813 results on 33 pages for 'concurrency'.

Page 18/33 | < Previous Page | 14 15 16 17 18 19 20 21 22 23 24 25 | Next Page >

How to reduce celeryd memory consumption?

- by Gringo Suave

I'm using celery 2.5.1 with django on a micro ec2 instance with 613mb memory and as such have to keep memory consumption down. Currently I'm using it only for the scheduler "celery beat" as a web interface to cron, though I hope to use it for more in the future. I've noticed it is the biggest consumer of memory on my micro machine even though I have configured the number of workers to one. I don't have many other options set in settings.py: import djcelery djcelery.setup_loader() BROKER_BACKEND = 'djkombu.transport.DatabaseTransport' CELERYBEAT_SCHEDULER = 'djcelery.schedulers.DatabaseScheduler' CELERY_RESULT_BACKEND = 'database' BROKER_POOL_LIMIT = 2 CELERYD_CONCURRENCY = 1 CELERY_DISABLE_RATE_LIMITS = True CELERYD_MAX_TASKS_PER_CHILD = 20 CELERYD_SOFT_TASK_TIME_LIMIT = 5 * 60 CELERYD_TASK_TIME_LIMIT = 6 * 60 Here's the details via top: PID USER NI CPU% VIRT SHR RES MEM% Command 1065 wuser 10 0.0 283M 4548 85m 14.3 python manage_prod.py celeryd --beat 1025 wuser 10 1.0 577M 6368 67m 11.2 python manage_prod.py celeryd --beat 1071 wuser 10 0.0 578M 2384 62m 10.6 python manage_prod.py celeryd --beat That's about 214mb of memory (and not much shared) to run a cron job occasionally. Have I done anything wrong, or can this be reduced about ten-fold somehow? ;) Update: here's my upstart config: description "Celery Daemon" start on (net-device-up and local-filesystems) stop on runlevel [016] nice 10 respawn respawn limit 5 10 chdir /home/wuser/wuser/ env CELERYD_OPTS=--concurrency=1 exec sudo -u wuser -H /usr/bin/python manage_prod.py celeryd --beat --concurrency=1 --loglevel info --logfile /var/tmp/celeryd.log Update 2: I notice there is one root process, one user child process, and two grandchildren from that. So I think it isn't a matter of duplicate startup. root 34580 1556 sudo -u wuser -H /usr/bin/python manage_prod.py celeryd wuser 577M 67548 +- python manage_prod.py celeryd --beat --concurrency=1 wuser 578M 63784 +- python manage_prod.py celeryd --beat --concurrency=1 wuser 271M 76260 +- python manage_prod.py celeryd --beat --concurrency=1

Read the article
Apache Bench length failures

- by Laurens

I am running Apache Bench against a Ruby on Rails XML-RPC web service that is running on Passenger via mod_passenger. All is fine when I run 1000 requests without concurrency. Bench indicates that all requests successfully complete with no failures. When I run Bench again with a concurrency level of 2, however, requests start to fail due to content length. I am seeing failures rates of 70-80% when using concurrency. This should not happen. The requests I am sending to the web service should always results in the same response. I have used cURL to verify that this is in fact the case. My Rails log is not showing any errors as well so I am curious to see what content Bench actually received and interpreted as a failure. Is there any way to print these failures?

Read the article
Concurrent Affairs

- by Tony Davis

I once wrote an editorial, multi-core mania, on the conundrum of ever-increasing numbers of processor cores, but without the concurrent programming techniques to get anywhere near exploiting their performance potential. I came to the.controversial.conclusion that, while the problem loomed for all procedural languages, it was not a big issue for the vast majority of programmers. Two years later, I still think most programmers don't concern themselves overly with this issue, but I do think that's a bigger problem than I originally implied. Firstly, is the performance boost from writing code that can fully exploit all available cores worth the cost of the additional programming complexity? Right now, with quad-core processors that, at best, can make our programs four times faster, the answer is still no for many applications. But what happens in a few years, as the number of cores grows to 100 or even 1000? At this point, it becomes very hard to ignore the potential gains from exploiting concurrency. Possibly, I was optimistic to assume that, by the time we have 100-core processors, and most applications really needed to exploit them, some technology would be around to allow us to do so with relative ease. The ideal solution would be one that allows programmers to forget about the problem, in much the same way that garbage collection removed the need to worry too much about memory allocation. From all I can find on the topic, though, there is only a remote likelihood that we'll ever have a compiler that takes a program written in a single-threaded style and "auto-magically" converts it into an efficient, correct, multi-threaded program. At the same time, it seems clear that what is currently the most common solution, multi-threaded programming with shared memory, is unsustainable. As soon as a piece of state can be changed by a different thread of execution, the potential number of execution paths through your program grows exponentially with the number of threads. If you have two threads, each executing n instructions, then there are 2^n possible "interleavings" of those instructions. Of course, many of those interleavings will have identical behavior, but several won't. Not only does this make understanding how a program works an order of magnitude harder, but it will also result in irreproducible, non-deterministic, bugs. And of course, the problem will be many times worse when you have a hundred or a thousand threads. So what is the answer? All of the possible alternatives require a change in the way we write programs and, currently, seem to be plagued by performance issues. Software transactional memory (STM) applies the ideas of database transactions, and optimistic concurrency control, to memory. However, working out how to break down your program into sufficiently small transactions, so as to avoid contention issues, isn't easy. Another approach is concurrency with actors, where instead of having threads share memory, each thread runs in complete isolation, and communicates with others by passing messages. It simplifies concurrent programs but still has performance issues, if the threads need to operate on the same large piece of data. There are doubtless other possible solutions that I haven't mentioned, and I would love to know to what extent you, as a developer, are considering the problem of multi-core concurrency, what solution you currently favor, and why. Cheers, Tony.

Read the article
SYN receives RST,ACK very frequently

- by user1289508

Hi Socket Programming experts, I am writing a proxy server on Linux for SQL Database server running on Windows. The proxy is coded using bsd sockets and in C, and it is working just fine. When I use a database client (written in JAVA, and running on a Linux box) to fire queries (with a concurrency of 100 or more) directly to the Database server, not experiencing connection resets. But through my proxy I am experiencing many connection resets. Digging deeper I came to know that connection from 'DB client' to 'Proxy' always succeeds but when the 'Proxy' tries to connect to the DB server the connection fails, due to the SYN packet getting RST,ACK. That was to give some background. The question is : Why does sometimes SYN receives RST,ACK? 'DB client(linux)' to 'Server(windows)' ---- Works fine 'DB client(linux) to 'Proxy(Linux)' to 'Server(windows)' ----- problematic I am aware that this can happen in "connection refused" case but this definitely is not that one. SYN flooding might be another scenario, but that does not explain fine behavior while firing to Server directly. I am suspecting some socket option setting may be required, that the client does before connecting and my proxy does not. Please put some light on this. Any help (links or pointers) is most appreciated. Additional info: Wrote a C client that does concurrent connections, which takes concurrency as an argument. Here are my observations: - At 5000 concurrency and above, some connects failed with 'connection refused'. - Below 2000, it works fine. But the actual problem is observed even at a concurrency of 100 or more. Note: The problem is time dependent sometimes it never comes at all and sometimes it is very frequent and DB client (directly to server) works fine at all times .

Read the article
Apache2 benchmarks - very poor performance

- by andrzejp

I have two servers on which I test the configuration of apache2. The first server: 4GB of RAM, AMD Athlon (tm) 64 X2 Dual Core Processor 5600 + Apache 2.2.3, mod_php, mpm prefork: Settings: Timeout 100 KeepAlive On MaxKeepAliveRequests 150 KeepAliveTimeout 4 <IfModule Mpm_prefork_module> StartServers 7 MinSpareServers 15 MaxSpareServers 30 MaxClients 250 MaxRequestsPerChild 2000 </ IfModule> Compiled in modules: core.c mod_log_config.c mod_logio.c prefork.c http_core.c mod_so.c Second server: 8GB of RAM, Intel (R) Core (TM) i7 CPU [email protected] Apache 2.2.9, **fcgid, mpm worker, suexec** PHP scripts are running via fcgi-wrapper Settings: Timeout 100 KeepAlive On MaxKeepAliveRequests 100 KeepAliveTimeout 4 <IfModule Mpm_worker_module> StartServers 10 MaxClients 200 MinSpareThreads 25 MaxSpareThreads 75 ThreadsPerChild 25 MaxRequestsPerChild 1000 </ IfModule> Compiled in modules: core.c mod_log_config.c mod_logio.c worker.c http_core.c mod_so.c The following test results, which are very strange! New server (dynamic content - php via fcgid+suexec): Server Software: Apache/2.2.9 Server Hostname: XXXXXXXX Server Port: 80 Document Path: XXXXXXX Document Length: 179512 bytes Concurrency Level: 10 Time taken for tests: 0.26276 seconds Complete requests: 1000 Failed requests: 0 Total transferred: 179935000 bytes HTML transferred: 179512000 bytes Requests per second: 38.06 Transfer rate: 6847.88 kb/s received Connnection Times (ms) min avg max Connect: 2 4 54 Processing: 161 257 449 Total: 163 261 503 Old server (dynamic content - mod_php): Server Software: Apache/2.2.3 Server Hostname: XXXXXX Server Port: 80 Document Path: XXXXXX Document Length: 187537 bytes Concurrency Level: 10 Time taken for tests: 173.073 seconds Complete requests: 1000 Failed requests: 22 (Connect: 0, Length: 22, Exceptions: 0) Total transferred: 188003372 bytes HTML transferred: 187546372 bytes Requests per second: 5777.91 Transfer rate: 1086267.40 kb/s received Connnection Times (ms) min avg max Connect: 3 3 28 Processing: 298 1724 26615 Total: 301 1727 26643 Old server: Static content (jpg file) Server Software: Apache/2.2.3 Server Hostname: xxxxxxxxx Server Port: 80 Document Path: /images/top2.gif Document Length: 40486 bytes Concurrency Level: 100 Time taken for tests: 3.558 seconds Complete requests: 1000 Failed requests: 0 Write errors: 0 Total transferred: 40864400 bytes HTML transferred: 40557482 bytes Requests per second: 281.09 [#/sec] (mean) Time per request: 355.753 [ms] (mean) Time per request: 3.558 [ms] (mean, across all concurrent requests) Transfer rate: 11217.51 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 3 11 4.5 12 23 Processing: 40 329 61.4 339 1009 Waiting: 6 282 55.2 293 737 Total: 43 340 63.0 351 1020 New server - static content (jpg file) Server Software: Apache/2.2.9 Server Hostname: XXXXX Server Port: 80 Document Path: /images/top2.gif Document Length: 40486 bytes Concurrency Level: 100 Time taken for tests: 3.571531 seconds Complete requests: 1000 Failed requests: 0 Write errors: 0 Total transferred: 41282792 bytes HTML transferred: 41030080 bytes Requests per second: 279.99 [#/sec] (mean) Time per request: 357.153 [ms] (mean) Time per request: 3.572 [ms] (mean, across all concurrent requests) Transfer rate: 11287.88 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 2 63 24.8 66 119 Processing: 124 278 31.8 282 391 Waiting: 3 70 28.5 66 164 Total: 126 341 35.9 350 443 I noticed that in the apache error.log is a lot of entries: [notice] mod_fcgid: call /www/XXXXX/public_html/forum/index.php with wrapper /www/php-fcgi-scripts/XXXXXX/php-fcgi-starter What I have omitted, or do not understand? Such a difference in requests per second? Is it possible? What could be the cause?

Read the article
Is there a quasi-standard set of attributes to annotate thread safety, immutability etc.?

- by Eugene Beresovksy

Except for a blog post here and there, describing the custom attributes someone created, but that do not seem to get any traction - like one describing how to enforce immutability, another one on Documenting Thread Safety, modeling the attributes after JCIP annotations - is there any standard emerging? Anything MS might be planning for the future? This is something that should be standard, if there's to be any chance of interoperability between libraries concurrency-wise. Both for documentation purposes, and also to feed static / dynamic test tools. If MS isn't doing anything in that direction, it could be done on CodePlex - but I couldn't find anything there, either. <opinion>Concurrency and thread safety are really hard in imperative and object-languages like C# and Java, we should try to tame it, until we hopefully switch to more appropriate languages.</opinion>

Read the article
Utility Objects–Waitfor Delay Coordinator (SQL Server 2008+)

- by drsql

Finally… took longer than I had expected when I wrote this a while back, but I had to move my website and get DNS moved before I could post code… When I write code, I do my best to test that code in as many ways as necessary. One of the last types of tests that is necessary is concurrency testing. Concurrency testing is one of the most difficult types of testing because it takes running multiple processes simultaneously and making sure that you get the correct answers multiple times. This is really...(read more)

Read the article
How can I designed multi-threaded application for larger user base

- by rokonoid

Here's my scenario, I need to develop a scalable application. My user base may be over 100K, every user has 10 or more small tasks. Tasks check every minute with a third party application through web services, retrieving data which is written in the database, then the data is forwarded to it's original destination. So here's the processing of a small task: while(true){ Boolean isNewInformationAvailable = checkWhetherInformationIsAvailableOrNot(); If(isNewInformationAvailable ==true){ fetchTheData(); writeToDatabase(); findTheDestination(); deliverTheData(); isAvailable =false; } } Here is the question: as the application is large, how should I approach designing this. I'm going to use Java to write it. Should I use concurrency, and how would you manage the concurrency?

Read the article
How can I design multi-threaded application for larger user base

- by rokonoid

Here's my scenario, I need to develop a scalable application. My user base may be over 100K, every user has 10 or more small tasks. Tasks check every minute with a third party application through web services, retrieving data which is written in the database, then the data is forwarded to it's original destination. So here's the processing of a small task: while(true){ Boolean isNewInformationAvailable = checkWhetherInformationIsAvailableOrNot(); If(isNewInformationAvailable ==true){ fetchTheData(); writeToDatabase(); findTheDestination(); deliverTheData(); isAvailable =false; } } As the application is large, how should I approach designing this? I'm going to use Java to write it. Should I use concurrency, and how would you manage the concurrency?

Read the article
Performance issues when using SSD for a developer notebook (WAMP/LAMP stack)?

- by András Szepesházi

I'm a web application developer using my notebook as a standalone development environment (WAMP stack). I just switched from a Core2-duo Vista 32 bit notebook with 2Gb RAM and SATA HDD, to an i5-2520M Win7 64 bit with 4Gb RAM and 128 GB SDD (Corsair P3 128). My initial experience was what I expected, fast boot, quick load of all the applications (Eclipse takes now 5 seconds as opposed to 30s on my old notebook), overall great experience. Then I started to build up my development stack, both as LAMP (using VirtualBox with a debian guest) and WAMP (windows native apache + mysql + php). I wanted to compare those two. This still all worked great out, then I started to pull in my projects to these stacks. And here came the nasty surprise, one of those projects produced a lot worse response times than on my old notebook (that was true for both the VirtualBox and WAMP stack). Apache, php and mysql configurations were practically identical in all environments. I started to do a lot of benchmarking and profiling, and here is what I've found: All general benchmarks (Performance Test 7.0, HDTune Pro, wPrime2 and some more) gave a big advantage to the new notebook. Nothing surprising here. Disc specific tests showed that read/write operations peaked around 380M/160M for the SSD, and all the different sized block operations also performed very well. Started apache performance benchmarking with Apache Benchmark for a small static html file (10 concurrent threads, 500 iterations). Old notebook: min 47ms, median 111ms, max 156ms New WAMP stack: min 71ms, median 135ms, max 296ms New LAMP stack (in VirtualBox): min 6ms, median 46ms, max 175ms Right here I don't get why the native WAMP stack performed so bad, but at least the LAMP environment brought the expected speed. Apache performance measurement for non-cached php content. The php runs a loop of 1000 and generates sha1(uniqid()) inisde. Again, 10 concurrent threads, 500 iterations were used for the benchmark. Old notebook: min 0ms, median 39ms, max 218ms New WAMP stack: min 20ms, median 61ms, max 186ms New LAMP stack (in VirtualBox): min 124ms, median 704ms, max 2463ms What the hell? The new LAMP performed miserably, and even the new native WAMP was outperformed by the old notebook. php + mysql test. The test consists of connecting to a database and reading a single record form a table using INNER JOIN on 3 more (indexed) tables, repeated 100 times within a loop. Databases were identical. 10 concurrent threads, 100 iterations were used for the benchmark. Old notebook: min 1201ms, median 1734ms, max 3728ms New WAMP stack: min 367ms, median 675ms, max 1893ms New LAMP stack (in VirtualBox): min 1410ms, median 3659ms, max 5045ms And the same test with concurrency set to 1 (instead of 10): Old notebook: min 1201ms, median 1261ms, max 1357ms New WAMP stack: min 399ms, median 483ms, max 539ms New LAMP stack (in VirtualBox): min 285ms, median 348ms, max 444ms Strictly for my purposes, as I'm using a self contained development environment (= low concurrency) I could be satisfied with the second test's result. Though I have no idea why the VirtualBox environment performed so bad with higher concurrency. Finally I performed a test of including many php files. The application that I mentioned at the beginning, the one that was performing so bad, has a heavy bootstrap, loads hundreds of small library and configuration files while initializing. So this test does nothing else just includes about 100 files. Concurrency set to 1, 100 iterations: Old notebook: min 140ms, median 168ms, max 406ms New WAMP stack: min 434ms, median 488ms, max 604ms New LAMP stack (in VirtualBox): min 413ms, median 1040ms, max 1921ms Even if I consider that VirtualBox reached those files via shared folders, and that slows things down a bit, I still don't see how could the old notebook outperform so heavily both new configurations. And I think this is the real root of the slow performance, as the application uses even more includes, and the whole bootstrap will occur several times within a page request (for each ajax call, for example). To sum it up, here I am with a brand new high-performance notebook that loads the same page in 20 seconds, that my old notebook can do in 5-7 seconds. Needless to say, I'm not a very happy person right now. Why do you think I experience these poor performance values? What are my options to remedy this situation?

Read the article
OSError : [Errno 38] Function not implemented - Django Celery implementation

- by Jordan Messina

I installed django-celery and I tried to start up the worker server but I get an OSError that a function isn't implemented. I'm running CentOS release 5.4 (Final) on a VPS: . broker -> amqp://guest@localhost:5672/ . queues -> . celery -> exchange:celery (direct) binding:celery . concurrency -> 4 . loader -> djcelery.loaders.DjangoLoader . logfile -> [stderr]@WARNING . events -> OFF . beat -> OFF [2010-07-22 17:10:01,364: WARNING/MainProcess] Traceback (most recent call last): [2010-07-22 17:10:01,364: WARNING/MainProcess] File "manage.py", line 11, in <module> [2010-07-22 17:10:01,364: WARNING/MainProcess] execute_manager(settings) [2010-07-22 17:10:01,364: WARNING/MainProcess] File "/usr/local/lib/python2.6/site-packages/django/core/management/__init__.py", line 438, in execute_manager [2010-07-22 17:10:01,364: WARNING/MainProcess] utility.execute() [2010-07-22 17:10:01,364: WARNING/MainProcess] File "/usr/local/lib/python2.6/site-packages/django/core/management/__init__.py", line 379, in execute [2010-07-22 17:10:01,365: WARNING/MainProcess] self.fetch_command(subcommand).run_from_argv(self.argv) [2010-07-22 17:10:01,365: WARNING/MainProcess] File "/usr/local/lib/python2.6/site-packages/django/core/management/base.py", line 191, in run_from_argv [2010-07-22 17:10:01,365: WARNING/MainProcess] self.execute(*args, **options.__dict__) [2010-07-22 17:10:01,365: WARNING/MainProcess] File "/usr/local/lib/python2.6/site-packages/django/core/management/base.py", line 218, in execute [2010-07-22 17:10:01,365: WARNING/MainProcess] output = self.handle(*args, **options) [2010-07-22 17:10:01,365: WARNING/MainProcess] File "/usr/local/lib/python2.6/site-packages/django_celery-2.0.0-py2.6.egg/djcelery/management/commands/celeryd.py", line 22, in handle [2010-07-22 17:10:01,366: WARNING/MainProcess] run_worker(**options) [2010-07-22 17:10:01,366: WARNING/MainProcess] File "/usr/local/lib/python2.6/site-packages/celery-2.0.1-py2.6.egg/celery/bin/celeryd.py", line 385, in run_worker [2010-07-22 17:10:01,366: WARNING/MainProcess] return Worker(**options).run() [2010-07-22 17:10:01,366: WARNING/MainProcess] File "/usr/local/lib/python2.6/site-packages/celery-2.0.1-py2.6.egg/celery/bin/celeryd.py", line 218, in run [2010-07-22 17:10:01,366: WARNING/MainProcess] self.run_worker() [2010-07-22 17:10:01,366: WARNING/MainProcess] File "/usr/local/lib/python2.6/site-packages/celery-2.0.1-py2.6.egg/celery/bin/celeryd.py", line 312, in run_worker [2010-07-22 17:10:01,367: WARNING/MainProcess] worker.start() [2010-07-22 17:10:01,367: WARNING/MainProcess] File "/usr/local/lib/python2.6/site-packages/celery-2.0.1-py2.6.egg/celery/worker/__init__.py", line 206, in start [2010-07-22 17:10:01,367: WARNING/MainProcess] component.start() [2010-07-22 17:10:01,367: WARNING/MainProcess] File "/usr/local/lib/python2.6/site-packages/celery-2.0.1-py2.6.egg/celery/concurrency/processes/__init__.py", line 54, in start [2010-07-22 17:10:01,367: WARNING/MainProcess] maxtasksperchild=self.maxtasksperchild) [2010-07-22 17:10:01,367: WARNING/MainProcess] File "/usr/local/lib/python2.6/site-packages/celery-2.0.1-py2.6.egg/celery/concurrency/processes/pool.py", line 448, in __init__ [2010-07-22 17:10:01,368: WARNING/MainProcess] self._setup_queues() [2010-07-22 17:10:01,368: WARNING/MainProcess] File "/usr/local/lib/python2.6/site-packages/celery-2.0.1-py2.6.egg/celery/concurrency/processes/pool.py", line 564, in _setup_queues [2010-07-22 17:10:01,368: WARNING/MainProcess] self._inqueue = SimpleQueue() [2010-07-22 17:10:01,368: WARNING/MainProcess] File "/usr/local/lib/python2.6/multiprocessing/queues.py", line 315, in __init__ [2010-07-22 17:10:01,368: WARNING/MainProcess] self._rlock = Lock() [2010-07-22 17:10:01,368: WARNING/MainProcess] File "/usr/local/lib/python2.6/multiprocessing/synchronize.py", line 117, in __init__ [2010-07-22 17:10:01,369: WARNING/MainProcess] SemLock.__init__(self, SEMAPHORE, 1, 1) [2010-07-22 17:10:01,369: WARNING/MainProcess] File "/usr/local/lib/python2.6/multiprocessing/synchronize.py", line 49, in __init__ [2010-07-22 17:10:01,369: WARNING/MainProcess] sl = self._semlock = _multiprocessing.SemLock(kind, value, maxvalue) [2010-07-22 17:10:01,369: WARNING/MainProcess] OSError [2010-07-22 17:10:01,369: WARNING/MainProcess] : [2010-07-22 17:10:01,369: WARNING/MainProcess] [Errno 38] Function not implemented Am I just totally screwed and should use a new kernel that has this implemented or is there an easy way to resolve this?

Read the article
Yet Another ASP.NET MVC CRUD Tutorial

- by Ricardo Peres

I know that I have not posted much on MVC, mostly because I don’t use it on my daily life, but since I find it so interesting, and since it is gaining such popularity, I will be talking about it much more. This time, it’s about the most basic of scenarios: CRUD. Although there are several ASP.NET MVC tutorials out there that cover ordinary CRUD operations, I couldn’t find any that would explain how we can have also AJAX, optimistic concurrency control and validation, using Entity Framework Code First, so I set out to write one! I won’t go into explaining what is MVC, Code First or optimistic concurrency control, or AJAX, I assume you are all familiar with these concepts by now. Let’s consider an hypothetical use case, products. For simplicity, we only want to be able to either view a single product or edit this product. First, we need our model: 1: public class Product 2: { 3: public Product() 4: { 5: this.Details = new HashSet<OrderDetail>(); 6: } 7: 8: [Required] 9: [StringLength(50)] 10: public String Name 11: { 12: get; 13: set; 14: } 15: 16: [Key] 17: [ScaffoldColumn(false)] 18: [DatabaseGenerated(DatabaseGeneratedOption.Identity)] 19: public Int32 ProductId 20: { 21: get; 22: set; 23: } 24: 25: [Required] 26: [Range(1, 100)] 27: public Decimal Price 28: { 29: get; 30: set; 31: } 32: 33: public virtual ISet<OrderDetail> Details 34: { 35: get; 36: protected set; 37: } 38: 39: [Timestamp] 40: [ScaffoldColumn(false)] 41: public Byte[] RowVersion 42: { 43: get; 44: set; 45: } 46: } Keep in mind that this is a simple scenario. Let’s see what we have: A class Product, that maps to a product record on the database; A product has a required (RequiredAttribute) Name property which can contain up to 50 characters (StringLengthAttribute); The product’s Price must be a decimal value between 1 and 100 (RangeAttribute); It contains a set of order details, for each time that it has been ordered, which we will not talk about (Details); The record’s primary key (mapped to property ProductId) comes from a SQL Server IDENTITY column generated by the database (KeyAttribute, DatabaseGeneratedAttribute); The table uses a SQL Server ROWVERSION (previously known as TIMESTAMP) column for optimistic concurrency control mapped to property RowVersion (TimestampAttribute). Then we will need a controller for viewing product details, which will located on folder ~/Controllers under the name ProductController: 1: public class ProductController : Controller 2: { 3: [HttpGet] 4: public ViewResult Get(Int32 id = 0) 5: { 6: if (id != 0) 7: { 8: using (ProductContext ctx = new ProductContext()) 9: { 10: return (this.View("Single", ctx.Products.Find(id) ?? new Product())); 11: } 12: } 13: else 14: { 15: return (this.View("Single", new Product())); 16: } 17: } 18: } If the requested product does not exist, or one was not requested at all, one with default values will be returned. I am using a view named Single to display the product’s details, more on that later. As you can see, it delegates the loading of products to an Entity Framework context, which is defined as: 1: public class ProductContext: DbContext 2: { 3: public DbSet<Product> Products 4: { 5: get; 6: set; 7: } 8: } Like I said before, I’ll keep it simple for now, only aggregate root Product is available. The controller will use the standard routes defined by the Visual Studio ASP.NET MVC 3 template: 1: routes.MapRoute( 2: "Default", // Route name 3: "{controller}/{action}/{id}", // URL with parameters 4: new { controller = "Home", action = "Index", id = UrlParameter.Optional } // Parameter defaults 5: ); Next, we need a view for displaying the product details, let’s call it Single, and have it located under ~/Views/Product: 1: <%@ Page Language="C#" Inherits="System.Web.Mvc.ViewPage<Product>" %> 2: <!DOCTYPE html> 3: 4: <html> 5: <head runat="server"> 6: <title>Product</title> 7: <script src="/Scripts/jquery-1.7.2.js" type="text/javascript"></script> 1: 2: <script src="/Scripts/jquery-ui-1.8.19.js" type="text/javascript"> 1: </script> 2: <script src="/Scripts/jquery.unobtrusive-ajax.js" type="text/javascript"> 1: </script> 2: <script src="/Scripts/jquery.validate.js" type="text/javascript"> 1: </script> 2: <script src="/Scripts/jquery.validate.unobtrusive.js" type="text/javascript"> 1: </script> 2: <script type="text/javascript"> 3: function onFailure(error) 4: { 5: } 6: 7: function onComplete(ctx) 8: { 9: } 10: 11: </script> 8: </head> 9: <body> 10: <div> 11: <% 1: : this.Html.ValidationSummary(false) %> 12: <% 1: using (this.Ajax.BeginForm("Edit", "Product", new AjaxOptions{ HttpMethod = FormMethod.Post.ToString(), OnSuccess = "onSuccess", OnFailure = "onFailure" })) { %> 13: <% 1: : this.Html.EditorForModel() %> 14: <input type="submit" name="submit" value="Submit" /> 15: <% 1: } %> 16: </div> 17: </body> 18: </html> Yes… I am using ASPX syntax… sorry about that! I implemented an editor template for the Product class, which must be located on the ~/Views/Shared/EditorTemplates folder as file Product.ascx: 1: <%@ Control Language="C#" Inherits="System.Web.Mvc.ViewUserControl<Product>" %> 2: <div> 3: <%: this.Html.HiddenFor(model => model.ProductId) %> 4: <%: this.Html.HiddenFor(model => model.RowVersion) %> 5: <fieldset> 6: <legend>Product</legend> 7: <div class="editor-label"> 8: <%: this.Html.LabelFor(model => model.Name) %> 9: </div> 10: <div class="editor-field"> 11: <%: this.Html.TextBoxFor(model => model.Name) %> 12: <%: this.Html.ValidationMessageFor(model => model.Name) %> 13: </div> 14: <div class="editor-label"> 15: <%= this.Html.LabelFor(model => model.Price) %> 16: </div> 17: <div class="editor-field"> 18: <%= this.Html.TextBoxFor(model => model.Price) %> 19: <%: this.Html.ValidationMessageFor(model => model.Price) %> 20: </div> 21: </fieldset> 22: </div> One thing you’ll notice is, I am including both the ProductId and the RowVersion properties as hidden fields; they will come handy later or, so that we know what product and version we are editing. The other thing is the included JavaScript files: jQuery, jQuery UI and unobtrusive validations. Also, I am not using the Content extension method for translating relative URLs, because that way I would lose JavaScript intellisense for jQuery functions. OK, so, at this moment, I want to add support for AJAX and optimistic concurrency control. So I write a controller method like this: 1: [HttpPost] 2: [AjaxOnly] 3: [Authorize] 4: public JsonResult Edit(Product product) 5: { 6: if (this.TryValidateModel(product) == true) 7: { 8: using (BlogContext ctx = new BlogContext()) 9: { 10: Boolean success = false; 11: 12: ctx.Entry(product).State = (product.ProductId == 0) ? EntityState.Added : EntityState.Modified; 13: 14: try 15: { 16: success = (ctx.SaveChanges() == 1); 17: } 18: catch (DbUpdateConcurrencyException) 19: { 20: ctx.Entry(product).Reload(); 21: } 22: 23: return (this.Json(new { Success = success, ProductId = product.ProductId, RowVersion = Convert.ToBase64String(product.RowVersion) })); 24: } 25: } 26: else 27: { 28: return (this.Json(new { Success = false, ProductId = 0, RowVersion = String.Empty })); 29: } 30: } So, this method is only valid for HTTP POST requests (HttpPost), coming from AJAX (AjaxOnly, from MVC Futures), and from authenticated users (Authorize). It returns a JSON object, which is what you would normally use for AJAX requests, containing three properties: Success: a boolean flag; RowVersion: the current version of the ROWVERSION column as a Base-64 string; ProductId: the inserted product id, as coming from the database. If the product is new, it will be inserted into the database, and its primary key will be returned into the ProductId property. Success will be set to true; If a DbUpdateConcurrencyException occurs, it means that the value in the RowVersion property does not match the current ROWVERSION column value on the database, so the record must have been modified between the time that the page was loaded and the time we attempted to save the product. In this case, the controller just gets the new value from the database and returns it in the JSON object; Success will be false. Otherwise, it will be updated, and Success, ProductId and RowVersion will all have their values set accordingly. So let’s see how we can react to these situations on the client side. Specifically, we want to deal with these situations: The user is not logged in when the update/create request is made, perhaps the cookie expired; The optimistic concurrency check failed; All went well. So, let’s change our view: 1: <%@ Page Language="C#" Inherits="System.Web.Mvc.ViewPage<Product>" %> 2: <%@ Import Namespace="System.Web.Security" %> 3: 4: <!DOCTYPE html> 5: 6: <html> 7: <head runat="server"> 8: <title>Product</title> 9: <script src="/Scripts/jquery-1.7.2.js" type="text/javascript"></script> 1: 2: <script src="/Scripts/jquery-ui-1.8.19.js" type="text/javascript"> 1: </script> 2: <script src="/Scripts/jquery.unobtrusive-ajax.js" type="text/javascript"> 1: </script> 2: <script src="/Scripts/jquery.validate.js" type="text/javascript"> 1: </script> 2: <script src="/Scripts/jquery.validate.unobtrusive.js" type="text/javascript"> 1: </script> 2: <script type="text/javascript"> 3: function onFailure(error) 4: { 5: window.alert('An error occurred: ' + error); 6: } 7: 8: function onSuccess(ctx) 9: { 10: if (typeof (ctx.Success) != 'undefined') 11: { 12: $('input#ProductId').val(ctx.ProductId); 13: $('input#RowVersion').val(ctx.RowVersion); 14: 15: if (ctx.Success == false) 16: { 17: window.alert('An error occurred while updating the entity: it may have been modified by third parties. Please try again.'); 18: } 19: else 20: { 21: window.alert('Saved successfully'); 22: } 23: } 24: else 25: { 26: if (window.confirm('Not logged in. Login now?') == true) 27: { 28: document.location.href = '<%: FormsAuthentication.LoginUrl %>?ReturnURL=' + document.location.pathname; 29: } 30: } 31: } 32: 33: </script> 10: </head> 11: <body> 12: <div> 13: <% 1: : this.Html.ValidationSummary(false) %> 14: <% 1: using (this.Ajax.BeginForm("Edit", "Product", new AjaxOptions{ HttpMethod = FormMethod.Post.ToString(), OnSuccess = "onSuccess", OnFailure = "onFailure" })) { %> 15: <% 1: : this.Html.EditorForModel() %> 16: <input type="submit" name="submit" value="Submit" /> 17: <% 1: } %> 18: </div> 19: </body> 20: </html> The implementation of the onSuccess function first checks if the response contains a Success property, if not, the most likely cause is the request was redirected to the login page (using Forms Authentication), because it wasn’t authenticated, so we navigate there as well, keeping the reference to the current page. It then saves the current values of the ProductId and RowVersion properties to their respective hidden fields. They will be sent on each successive post and will be used in determining if the request is for adding a new product or to updating an existing one. The only thing missing is the ability to insert a new product, after inserting/editing an existing one, which can be easily achieved using this snippet: 1: <input type="button" value="New" onclick="$('input#ProductId').val('');$('input#RowVersion').val('');"/> And that’s it.

Read the article
PTLQueue : a scalable bounded-capacity MPMC queue

- by Dave

Title: Fast concurrent MPMC queue -- I've used the following concurrent queue algorithm enough that it warrants a blog entry. I'll sketch out the design of a fast and scalable multiple-producer multiple-consumer (MPSC) concurrent queue called PTLQueue. The queue has bounded capacity and is implemented via a circular array. Bounded capacity can be a useful property if there's a mismatch between producer rates and consumer rates where an unbounded queue might otherwise result in excessive memory consumption by virtue of the container nodes that -- in some queue implementations -- are used to hold values. A bounded-capacity queue can provide flow control between components. Beware, however, that bounded collections can also result in resource deadlock if abused. The put() and take() operators are partial and wait for the collection to become non-full or non-empty, respectively. Put() and take() do not allocate memory, and are not vulnerable to the ABA pathologies. The PTLQueue algorithm can be implemented equally well in C/C++ and Java. Partial operators are often more convenient than total methods. In many use cases if the preconditions aren't met, there's nothing else useful the thread can do, so it may as well wait via a partial method. An exception is in the case of work-stealing queues where a thief might scan a set of queues from which it could potentially steal. Total methods return ASAP with a success-failure indication. (It's tempting to describe a queue or API as blocking or non-blocking instead of partial or total, but non-blocking is already an overloaded concurrency term. Perhaps waiting/non-waiting or patient/impatient might be better terms). It's also trivial to construct partial operators by busy-waiting via total operators, but such constructs may be less efficient than an operator explicitly and intentionally designed to wait. A PTLQueue instance contains an array of slots, where each slot has volatile Turn and MailBox fields. The array has power-of-two length allowing mod/div operations to be replaced by masking. We assume sensible padding and alignment to reduce the impact of false sharing. (On x86 I recommend 128-byte alignment and padding because of the adjacent-sector prefetch facility). Each queue also has PutCursor and TakeCursor cursor variables, each of which should be sequestered as the sole occupant of a cache line or sector. You can opt to use 64-bit integers if concerned about wrap-around aliasing in the cursor variables. Put(null) is considered illegal, but the caller or implementation can easily check for and convert null to a distinguished non-null proxy value if null happens to be a value you'd like to pass. Take() will accordingly convert the proxy value back to null. An advantage of PTLQueue is that you can use atomic fetch-and-increment for the partial methods. We initialize each slot at index I with (Turn=I, MailBox=null). Both cursors are initially 0. All shared variables are considered "volatile" and atomics such as CAS and AtomicFetchAndIncrement are presumed to have bidirectional fence semantics. Finally T is the templated type. I've sketched out a total tryTake() method below that allows the caller to poll the queue. tryPut() has an analogous construction. Zebra stripping : alternating row colors for nice-looking code listings. See also google code "prettify" : https://code.google.com/p/google-code-prettify/ Prettify is a javascript module that yields the HTML/CSS/JS equivalent of pretty-print. -- pre:nth-child(odd) { background-color:#ff0000; } pre:nth-child(even) { background-color:#0000ff; } border-left: 11px solid #ccc; margin: 1.7em 0 1.7em 0.3em; background-color:#BFB; font-size:12px; line-height:65%; " // PTLQueue : Put(v) : // producer : partial method - waits as necessary assert v != null assert Mask = 1 && (Mask & (Mask+1)) == 0 // Document invariants // doorway step // Obtain a sequence number -- ticket // As a practical concern the ticket value is temporally unique // The ticket also identifies and selects a slot auto tkt = AtomicFetchIncrement (&PutCursor, 1) slot * s = &Slots[tkt & Mask] // waiting phase : // wait for slot's generation to match the tkt value assigned to this put() invocation. // The "generation" is implicitly encoded as the upper bits in the cursor // above those used to specify the index : tkt div (Mask+1) // The generation serves as an epoch number to identify a cohort of threads // accessing disjoint slots while s-Turn != tkt : Pause assert s-MailBox == null s-MailBox = v // deposit and pass message Take() : // consumer : partial method - waits as necessary auto tkt = AtomicFetchIncrement (&TakeCursor,1) slot * s = &Slots[tkt & Mask] // 2-stage waiting : // First wait for turn for our generation // Acquire exclusive "take" access to slot's MailBox field // Then wait for the slot to become occupied while s-Turn != tkt : Pause // Concurrency in this section of code is now reduced to just 1 producer thread // vs 1 consumer thread. // For a given queue and slot, there will be most one Take() operation running // in this section. // Consumer waits for producer to arrive and make slot non-empty // Extract message; clear mailbox; advance Turn indicator // We have an obvious happens-before relation : // Put(m) happens-before corresponding Take() that returns that same "m" for T v = s-MailBox if v != null : s-MailBox = null ST-ST barrier s-Turn = tkt + Mask + 1 // unlock slot to admit next producer and consumer return v Pause tryTake() : // total method - returns ASAP with failure indication for auto tkt = TakeCursor slot * s = &Slots[tkt & Mask] if s-Turn != tkt : return null T v = s-MailBox // presumptive return value if v == null : return null // ratify tkt and v values and commit by advancing cursor if CAS (&TakeCursor, tkt, tkt+1) != tkt : continue s-MailBox = null ST-ST barrier s-Turn = tkt + Mask + 1 return v The basic idea derives from the Partitioned Ticket Lock "PTL" (US20120240126-A1) and the MultiLane Concurrent Bag (US8689237). The latter is essentially a circular ring-buffer where the elements themselves are queues or concurrent collections. You can think of the PTLQueue as a partitioned ticket lock "PTL" augmented to pass values from lock to unlock via the slots. Alternatively, you could conceptualize of PTLQueue as a degenerate MultiLane bag where each slot or "lane" consists of a simple single-word MailBox instead of a general queue. Each lane in PTLQueue also has a private Turn field which acts like the Turn (Grant) variables found in PTL. Turn enforces strict FIFO ordering and restricts concurrency on the slot mailbox field to at most one simultaneous put() and take() operation. PTL uses a single "ticket" variable and per-slot Turn (grant) fields while MultiLane has distinct PutCursor and TakeCursor cursors and abstract per-slot sub-queues. Both PTL and MultiLane advance their cursor and ticket variables with atomic fetch-and-increment. PTLQueue borrows from both PTL and MultiLane and has distinct put and take cursors and per-slot Turn fields. Instead of a per-slot queues, PTLQueue uses a simple single-word MailBox field. PutCursor and TakeCursor act like a pair of ticket locks, conferring "put" and "take" access to a given slot. PutCursor, for instance, assigns an incoming put() request to a slot and serves as a PTL "Ticket" to acquire "put" permission to that slot's MailBox field. To better explain the operation of PTLQueue we deconstruct the operation of put() and take() as follows. Put() first increments PutCursor obtaining a new unique ticket. That ticket value also identifies a slot. Put() next waits for that slot's Turn field to match that ticket value. This is tantamount to using a PTL to acquire "put" permission on the slot's MailBox field. Finally, having obtained exclusive "put" permission on the slot, put() stores the message value into the slot's MailBox. Take() similarly advances TakeCursor, identifying a slot, and then acquires and secures "take" permission on a slot by waiting for Turn. Take() then waits for the slot's MailBox to become non-empty, extracts the message, and clears MailBox. Finally, take() advances the slot's Turn field, which releases both "put" and "take" access to the slot's MailBox. Note the asymmetry : put() acquires "put" access to the slot, but take() releases that lock. At any given time, for a given slot in a PTLQueue, at most one thread has "put" access and at most one thread has "take" access. This restricts concurrency from general MPMC to 1-vs-1. We have 2 ticket locks -- one for put() and one for take() -- each with its own "ticket" variable in the form of the corresponding cursor, but they share a single "Grant" egress variable in the form of the slot's Turn variable. Advancing the PutCursor, for instance, serves two purposes. First, we obtain a unique ticket which identifies a slot. Second, incrementing the cursor is the doorway protocol step to acquire the per-slot mutual exclusion "put" lock. The cursors and operations to increment those cursors serve double-duty : slot-selection and ticket assignment for locking the slot's MailBox field. At any given time a slot MailBox field can be in one of the following states: empty with no pending operations -- neutral state; empty with one or more waiting take() operations pending -- deficit; occupied with no pending operations; occupied with one or more waiting put() operations -- surplus; empty with a pending put() or pending put() and take() operations -- transitional; or occupied with a pending take() or pending put() and take() operations -- transitional. The partial put() and take() operators can be implemented with an atomic fetch-and-increment operation, which may confer a performance advantage over a CAS-based loop. In addition we have independent PutCursor and TakeCursor cursors. Critically, a put() operation modifies PutCursor but does not access the TakeCursor and a take() operation modifies the TakeCursor cursor but does not access the PutCursor. This acts to reduce coherence traffic relative to some other queue designs. It's worth noting that slow threads or obstruction in one slot (or "lane") does not impede or obstruct operations in other slots -- this gives us some degree of obstruction isolation. PTLQueue is not lock-free, however. The implementation above is expressed with polite busy-waiting (Pause) but it's trivial to implement per-slot parking and unparking to deschedule waiting threads. It's also easy to convert the queue to a more general deque by replacing the PutCursor and TakeCursor cursors with Left/Front and Right/Back cursors that can move either direction. Specifically, to push and pop from the "left" side of the deque we would decrement and increment the Left cursor, respectively, and to push and pop from the "right" side of the deque we would increment and decrement the Right cursor, respectively. We used a variation of PTLQueue for message passing in our recent OPODIS 2013 paper. ul { list-style:none; padding-left:0; padding:0; margin:0; margin-left:0; } ul#myTagID { padding: 0px; margin: 0px; list-style:none; margin-left:0;} -- -- There's quite a bit of related literature in this area. I'll call out a few relevant references: Wilson's NYU Courant Institute UltraComputer dissertation from 1988 is classic and the canonical starting point : Operating System Data Structures for Shared-Memory MIMD Machines with Fetch-and-Add. Regarding provenance and priority, I think PTLQueue or queues effectively equivalent to PTLQueue have been independently rediscovered a number of times. See CB-Queue and BNPBV, below, for instance. But Wilson's dissertation anticipates the basic idea and seems to predate all the others. Gottlieb et al : Basic Techniques for the Efficient Coordination of Very Large Numbers of Cooperating Sequential Processors Orozco et al : CB-Queue in Toward high-throughput algorithms on many-core architectures which appeared in TACO 2012. Meneghin et al : BNPVB family in Performance evaluation of inter-thread communication mechanisms on multicore/multithreaded architecture Dmitry Vyukov : bounded MPMC queue (highly recommended) Alex Otenko : US8607249 (highly related). John Mellor-Crummey : Concurrent queues: Practical fetch-and-phi algorithms. Technical Report 229, Department of Computer Science, University of Rochester Thomasson : FIFO Distributed Bakery Algorithm (very similar to PTLQueue). Scott and Scherer : Dual Data Structures I'll propose an optimization left as an exercise for the reader. Say we wanted to reduce memory usage by eliminating inter-slot padding. Such padding is usually "dark" memory and otherwise unused and wasted. But eliminating the padding leaves us at risk of increased false sharing. Furthermore lets say it was usually the case that the PutCursor and TakeCursor were numerically close to each other. (That's true in some use cases). We might still reduce false sharing by incrementing the cursors by some value other than 1 that is not trivially small and is coprime with the number of slots. Alternatively, we might increment the cursor by one and mask as usual, resulting in a logical index. We then use that logical index value to index into a permutation table, yielding an effective index for use in the slot array. The permutation table would be constructed so that nearby logical indices would map to more distant effective indices. (Open question: what should that permutation look like? Possibly some perversion of a Gray code or De Bruijn sequence might be suitable). As an aside, say we need to busy-wait for some condition as follows : "while C == 0 : Pause". Lets say that C is usually non-zero, so we typically don't wait. But when C happens to be 0 we'll have to spin for some period, possibly brief. We can arrange for the code to be more machine-friendly with respect to the branch predictors by transforming the loop into : "if C == 0 : for { Pause; if C != 0 : break; }". Critically, we want to restructure the loop so there's one branch that controls entry and another that controls loop exit. A concern is that your compiler or JIT might be clever enough to transform this back to "while C == 0 : Pause". You can sometimes avoid this by inserting a call to a some type of very cheap "opaque" method that the compiler can't elide or reorder. On Solaris, for instance, you could use :"if C == 0 : { gethrtime(); for { Pause; if C != 0 : break; }}". It's worth noting the obvious duality between locks and queues. If you have strict FIFO lock implementation with local spinning and succession by direct handoff such as MCS or CLH,then you can usually transform that lock into a queue. Hidden commentary and annotations - invisible : * And of course there's a well-known duality between queues and locks, but I'll leave that topic for another blog post. * Compare and contrast : PTLQ vs PTL and MultiLane * Equivalent : Turn; seq; sequence; pos; position; ticket * Put = Lock; Deposit Take = identify and reserve slot; wait; extract & clear; unlock * conceptualize : Distinct PutLock and TakeLock implemented as ticket lock or PTL Distinct arrival cursors but share per-slot "Turn" variable provides exclusive role-based access to slot's mailbox field put() acquires exclusive access to a slot for purposes of "deposit" assigns slot round-robin and then acquires deposit access rights/perms to that slot take() acquires exclusive access to slot for purposes of "withdrawal" assigns slot round-robin and then acquires withdrawal access rights/perms to that slot At any given time, only one thread can have withdrawal access to a slot at any given time, only one thread can have deposit access to a slot Permissible for T1 to have deposit access and T2 to simultaneously have withdrawal access * round-robin for the purposes of; role-based; access mode; access role mailslot; mailbox; allocate/assign/identify slot rights; permission; license; access permission; * PTL/Ticket hybrid Asymmetric usage ; owner oblivious lock-unlock pairing K-exclusion add Grant cursor pass message m from lock to unlock via Slots[] array Cursor performs 2 functions : + PTL ticket + Assigns request to slot in round-robin fashion Deconstruct protocol : explication put() : allocate slot in round-robin fashion acquire PTL for "put" access store message into slot associated with PTL index take() : Acquire PTL for "take" access // doorway step seq = fetchAdd (&Grant, 1) s = &Slots[seq & Mask] // waiting phase while s-Turn != seq : pause Extract : wait for s-mailbox to be full v = s-mailbox s-mailbox = null Release PTL for both "put" and "take" access s-Turn = seq + Mask + 1 * Slot round-robin assignment and lock "doorway" protocol leverage the same cursor and FetchAdd operation on that cursor FetchAdd (&Cursor,1) + round-robin slot assignment and dispersal + PTL/ticket lock "doorway" step waiting phase is via "Turn" field in slot * PTLQueue uses 2 cursors -- put and take. Acquire "put" access to slot via PTL-like lock Acquire "take" access to slot via PTL-like lock 2 locks : put and take -- at most one thread can access slot's mailbox Both locks use same "turn" field Like multilane : 2 cursors : put and take slot is simple 1-capacity mailbox instead of queue Borrow per-slot turn/grant from PTL Provides strict FIFO Lock slot : put-vs-put take-vs-take at most one put accesses slot at any one time at most one put accesses take at any one time reduction to 1-vs-1 instead of N-vs-M concurrency Per slot locks for put/take Release put/take by advancing turn * is instrumental in ... * P-V Semaphore vs lock vs K-exclusion * See also : FastQueues-excerpt.java dice-etc/queue-mpmc-bounded-blocking-circular-xadd/ * PTLQueue is the same as PTLQB - identical * Expedient return; ASAP; prompt; immediately * Lamport's Bakery algorithm : doorway step then waiting phase Threads arriving at doorway obtain a unique ticket number Threads enter in ticket order * In the terminology of Reed and Kanodia a ticket lock corresponds to the busy-wait implementation of a semaphore using an eventcount and a sequencer It can also be thought of as an optimization of Lamport's bakery lock was designed for fault-tolerance rather than performance Instead of spinning on the release counter, processors using a bakery lock repeatedly examine the tickets of their peers --

Read the article
RHEL - blocked FC remote port time out: saving binding

- by Dev G

My Server went into a faulty state since the database could not write on the partition. I found out that the partition went into Read Only mode. Finally to fix it, I had to do a hard reboot. Linux 2.6.18-164.el5PAE #1 SMP Tue Aug 18 15:59:11 EDT 2009 i686 i686 i386 GNU/Linux /var/log/messages Oct 31 00:56:45 ota3g1 Had[17275]: VCS ERROR V-16-1-10214 Concurrency Violation:CurrentCount increased above 1 for failover group sg_network Oct 31 00:57:05 ota3g1 Had[17275]: VCS CRITICAL V-16-1-50086 CPU usage on ota3g1.mtsallstream.com is 100% Oct 31 01:01:47 ota3g1 Had[17275]: VCS ERROR V-16-1-10214 Concurrency Violation:CurrentCount increased above 1 for failover group sg_network Oct 31 01:06:50 ota3g1 Had[17275]: VCS ERROR V-16-1-10214 Concurrency Violation:CurrentCount increased above 1 for failover group sg_network Oct 31 01:11:52 ota3g1 Had[17275]: VCS ERROR V-16-1-10214 Concurrency Violation:CurrentCount increased above 1 for failover group sg_network Oct 31 01:12:10 ota3g1 kernel: lpfc 0000:29:00.1: 1:1305 Link Down Event x2 received Data: x2 x20 x80000 x0 x0 Oct 31 01:12:10 ota3g1 kernel: lpfc 0000:29:00.1: 1:1303 Link Up Event x3 received Data: x3 x1 x10 x1 x0 x0 0 Oct 31 01:12:12 ota3g1 kernel: lpfc 0000:29:00.1: 1:1305 Link Down Event x4 received Data: x4 x20 x80000 x0 x0 Oct 31 01:12:40 ota3g1 kernel: rport-8:0-0: blocked FC remote port time out: saving binding Oct 31 01:12:40 ota3g1 kernel: lpfc 0000:29:00.1: 1:(0):0203 Devloss timeout on WWPN 20:25:00:a0:b8:74:f5:65 NPort x0000e4 Data: x0 x7 x0 Oct 31 01:12:40 ota3g1 kernel: sd 8:0:0:4: SCSI error: return code = 0x00010000 Oct 31 01:12:40 ota3g1 kernel: end_request: I/O error, dev sdi, sector 38617577 Oct 31 01:12:40 ota3g1 kernel: sd 8:0:0:4: SCSI error: return code = 0x00010000 Oct 31 01:12:40 ota3g1 kernel: end_request: I/O error, dev sdi, sector 283532153 Oct 31 01:12:40 ota3g1 kernel: sd 8:0:0:4: SCSI error: return code = 0x00010000 Oct 31 01:12:40 ota3g1 kernel: end_request: I/O error, dev sdi, sector 90825 Oct 31 01:12:40 ota3g1 kernel: Aborting journal on device dm-16. Oct 31 01:12:40 ota3g1 kernel: sd 8:0:0:4: SCSI error: return code = 0x00010000 Oct 31 01:12:40 ota3g1 kernel: end_request: I/O error, dev sdi, sector 868841 Oct 31 01:12:40 ota3g1 kernel: sd 8:0:0:4: SCSI error: return code = 0x00010000 Oct 31 01:12:40 ota3g1 kernel: Aborting journal on device dm-10. Oct 31 01:12:40 ota3g1 kernel: end_request: I/O error, dev sdi, sector 37759889 Oct 31 01:12:40 ota3g1 kernel: sd 8:0:0:4: SCSI error: return code = 0x00010000 Oct 31 01:12:40 ota3g1 kernel: end_request: I/O error, dev sdi, sector 283349449 Oct 31 01:12:40 ota3g1 kernel: printk: 6 messages suppressed. Oct 31 01:12:40 ota3g1 kernel: Aborting journal on device dm-12. Oct 31 01:12:40 ota3g1 kernel: EXT3-fs error (device dm-12) in ext3_reserve_inode_write: Journal has aborted Oct 31 01:12:40 ota3g1 kernel: Buffer I/O error on device dm-16, logical block 1545 Oct 31 01:12:40 ota3g1 kernel: lost page write due to I/O error on dm-16 Oct 31 01:12:40 ota3g1 kernel: sd 8:0:0:4: SCSI error: return code = 0x00010000 Oct 31 01:12:40 ota3g1 kernel: end_request: I/O error, dev sdi, sector 12745 Oct 31 01:12:40 ota3g1 kernel: Buffer I/O error on device dm-10, logical block 1545 Oct 31 01:12:40 ota3g1 kernel: EXT3-fs error (device dm-16) in ext3_reserve_inode_write: Journal has aborted Oct 31 01:12:40 ota3g1 kernel: lost page write due to I/O error on dm-10 Oct 31 01:12:40 ota3g1 kernel: sd 8:0:0:4: SCSI error: return code = 0x00010000 Oct 31 01:12:40 ota3g1 kernel: end_request: I/O error, dev sdi, sector 37749121 Oct 31 01:12:40 ota3g1 kernel: Buffer I/O error on device dm-12, logical block 0 Oct 31 01:12:40 ota3g1 kernel: lost page write due to I/O error on dm-12 Oct 31 01:12:40 ota3g1 kernel: sd 8:0:0:4: SCSI error: return code = 0x00010000 Oct 31 01:12:40 ota3g1 kernel: EXT3-fs error (device dm-12) in ext3_dirty_inode: Journal has aborted Oct 31 01:12:40 ota3g1 kernel: end_request: I/O error, dev sdi, sector 37757897 Oct 31 01:12:40 ota3g1 kernel: Buffer I/O error on device dm-12, logical block 1097 Oct 31 01:12:40 ota3g1 kernel: lost page write due to I/O error on dm-12 Oct 31 01:12:40 ota3g1 kernel: sd 8:0:0:4: SCSI error: return code = 0x00010000 Oct 31 01:12:40 ota3g1 kernel: end_request: I/O error, dev sdi, sector 283337089 Oct 31 01:12:40 ota3g1 kernel: Buffer I/O error on device dm-16, logical block 0 Oct 31 01:12:40 ota3g1 kernel: lost page write due to I/O error on dm-16 Oct 31 01:12:40 ota3g1 kernel: sd 8:0:0:4: SCSI error: return code = 0x00010000 Oct 31 01:12:40 ota3g1 kernel: EXT3-fs error (device dm-16) in ext3_dirty_inode: Journal has aborted Oct 31 01:12:40 ota3g1 kernel: end_request: I/O error, dev sdi, sector 37749121 Oct 31 01:12:40 ota3g1 kernel: Buffer I/O error on device dm-12, logical block 0 Oct 31 01:12:41 ota3g1 kernel: lost page write due to I/O error on dm-12 Oct 31 01:12:41 ota3g1 kernel: sd 8:0:0:4: SCSI error: return code = 0x00010000 Oct 31 01:12:41 ota3g1 kernel: end_request: I/O error, dev sdi, sector 283337089 Oct 31 01:12:41 ota3g1 kernel: Buffer I/O error on device dm-16, logical block 0 Oct 31 01:12:41 ota3g1 kernel: lost page write due to I/O error on dm-16 Oct 31 01:12:41 ota3g1 kernel: sd 8:0:0:4: SCSI error: return code = 0x00010000 df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/cciss-root 4.9G 730M 3.9G 16% / /dev/mapper/cciss-home 9.7G 1.2G 8.1G 13% /home /dev/mapper/cciss-var 9.7G 494M 8.8G 6% /var /dev/mapper/cciss-usr 15G 2.6G 12G 19% /usr /dev/mapper/cciss-tmp 3.9G 153M 3.6G 5% /tmp /dev/sda1 996M 43M 902M 5% /boot tmpfs 5.9G 0 5.9G 0% /dev/shm /dev/mapper/cciss-product 25G 16G 7.4G 68% /product /dev/mapper/cciss-opt 20G 4.5G 14G 25% /opt /dev/mapper/dg_db1-vol_db1_system 18G 2.2G 15G 14% /database/OTADB/sys /dev/mapper/dg_db1-vol_db1_undo 18G 5.8G 12G 35% /database/OTADB/undo /dev/mapper/dg_db1-vol_db1_redo 8.9G 4.3G 4.2G 51% /database/OTADB/redo /dev/mapper/dg_db1-vol_db1_sgbd 8.9G 654M 7.8G 8% /database/OTADB/admin /dev/mapper/dg_db1-vol_db1_arch 98G 24G 69G 26% /database/OTADB/arch /dev/mapper/dg_db1-vol_db1_indexes 240G 14G 214G 6% /database/OTADB/index /dev/mapper/dg_db1-vol_db1_data 275G 47G 215G 18% /database/OTADB/data /dev/mapper/dg_dbrman-vol_db_rman 8.9G 351M 8.1G 5% /database/RMAN /dev/mapper/dg_app1-vol_app1 151G 113G 31G 79% /files/ota /etc/fstab /dev/cciss/root / ext3 defaults 1 1 /dev/cciss/home /home ext3 defaults 1 2 /dev/cciss/var /var ext3 defaults 1 2 /dev/cciss/usr /usr ext3 defaults 1 2 /dev/cciss/tmp /tmp ext3 defaults 1 2 LABEL=/boot /boot ext3 defaults 1 2 tmpfs /dev/shm tmpfs defaults 0 0 devpts /dev/pts devpts gid=5,mode=620 0 0 sysfs /sys sysfs defaults 0 0 proc /proc proc defaults 0 0 /dev/cciss/swap swap swap defaults 0 0 /dev/cciss/product /product ext3 defaults 1 2 /dev/cciss/opt /opt ext3 defaults 1 2 /dev/dg_db1/vol_db1_system /database/OTADB/sys ext3 defaults 1 2 /dev/dg_db1/vol_db1_undo /database/OTADB/undo ext3 defaults 1 2 /dev/dg_db1/vol_db1_redo /database/OTADB/redo ext3 defaults 1 2 /dev/dg_db1/vol_db1_sgbd /database/OTADB/admin ext3 defaults 1 2 /dev/dg_db1/vol_db1_arch /database/OTADB/arch ext3 defaults 1 2 /dev/dg_db1/vol_db1_indexes /database/OTADB/index ext3 defaults 1 2 /dev/dg_db1/vol_db1_data /database/OTADB/data ext3 defaults 1 2 /dev/dg_dbrman/vol_db_rman /database/RMAN ext3 defaults 1 2 /dev/dg_app1/vol_app1 /files/ota ext3 defaults 1 2 Thanks for all the help.

Read the article
array and array_view from amp.h

- by Daniel Moth

This is a very long post, but it also covers what are probably the classes (well, array_view at least) that you will use the most with C++ AMP, so I hope you enjoy it! Overview The concurrency::array and concurrency::array_view template classes represent multi-dimensional data of type T, of N dimensions, specified at compile time (and you can later access the number of dimensions via the rank property). If N is not specified, it is assumed that it is 1 (i.e. single-dimensional case). They are rectangular (not jagged). The difference between them is that array is a container of data, whereas array_view is a wrapper of a container of data. So in that respect, array behaves like an STL container, whereas the closest thing an array_view behaves like is an STL iterator (albeit with random access and allowing you to view more than one element at a time!). The data in the array (whether provided at creation time or added later) resides on an accelerator (which is specified at creation time either explicitly by the developer, or set to the default accelerator at creation time by the runtime) and is laid out contiguously in memory. The data provided to the array_view is not stored by/in the array_view, because the array_view is simply a view over the real source (which can reside on the CPU or other accelerator). The underlying data is copied on demand to wherever the array_view is accessed. Elements which differ by one in the least significant dimension of the array_view are adjacent in memory. array objects must be captured by reference into the lambda you pass to the parallel_for_each call, whereas array_view objects must be captured by value (into the lambda you pass to the parallel_for_each call). Creating array and array_view objects and relevant properties You can create array_view objects from other array_view objects of the same rank and element type (shallow copy, also possible via assignment operator) so they point to the same underlying data, and you can also create array_view objects over array objects of the same rank and element type e.g. array_view<int,3> a(b); // b can be another array or array_view of ints with rank=3 Note: Unlike the constructors above which can be called anywhere, the ones in the rest of this section can only be called from CPU code. You can create array objects from other array objects of the same rank and element type (copy and move constructors) and from other array_view objects, e.g. array<float,2> a(b); // b can be another array or array_view of floats with rank=2 To create an array from scratch, you need to at least specify an extent object, e.g. array<int,3> a(myExtent);. Note that instead of an explicit extent object, there are convenience overloads when N<=3 so you can specify 1-, 2-, 3- integers (dependent on the array's rank) and thus have the extent created for you under the covers. At any point, you can access the array's extent thought the extent property. The exact same thing applies to array_view (extent as constructor parameters, incl. convenience overloads, and property). While passing only an extent object to create an array is enough (it means that the array will be written to later), it is not enough for the array_view case which must always wrap over some other container (on which it relies for storage space and actual content). So in addition to the extent object (that describes the shape you'd like to be viewing/accessing that data through), to create an array_view from another container (e.g. std::vector) you must pass in the container itself (which must expose .data() and a .size() methods, e.g. like std::array does), e.g. array_view<int,2> aaa(myExtent, myContainerOfInts); Similarly, you can create an array_view from a raw pointer of data plus an extent object. Back to the array case, to optionally initialize the array with data, you can pass an iterator pointing to the start (and optionally one pointing to the end of the source container) e.g. array<double,1> a(5, myVector.begin(), myVector.end()); We saw that arrays are bound to an accelerator at creation time, so in case you don’t want the C++ AMP runtime to assign the array to the default accelerator, all array constructors have overloads that let you pass an accelerator_view object, which you can later access via the accelerator_view property. Note that at the point of initializing an array with data, a synchronous copy of the data takes place to the accelerator, and then to copy any data back we'll see that an explicit copy call is required. This does not happen with the array_view where copying is on demand... refresh and synchronize on array_view Note that in the previous section on constructors, unlike the array case, there was no overload that accepted an accelerator_view for array_view. That is because the array_view is simply a wrapper, so the allocation of the data has already taken place before you created the array_view. When you capture an array_view variable in your call to parallel_for_each, the copy of data between the non-CPU accelerator and the CPU takes place on demand (i.e. it is implicit, versus the explicit copy that has to happen with the array). There are some subtleties to the on-demand-copying that we cover next. The assumption when using an array_view is that you will continue to access the data through the array_view, and not through the original underlying source, e.g. the pointer to the data that you passed to the array_view's constructor. So if you modify the data through the array_view on the GPU, the original pointer on the CPU will not "know" that, unless one of two things happen: you access the data through the array_view on the CPU side, i.e. using indexing that we cover below you explicitly call the array_view's synchronize method on the CPU (this also gets called in the array_view's destructor for you) Conversely, if you make a change to the underlying data through the original source (e.g. the pointer), the array_view will not "know" about those changes, unless you call its refresh method. Finally, note that if you create an array_view of const T, then the data is copied to the accelerator on demand, but it does not get copied back, e.g. array_view<const double, 5> myArrView(…); // myArrView will not get copied back from GPU There is also a similar mechanism to achieve the reverse, i.e. not to copy the data of an array_view to the GPU. copy_to, data, and global copy/copy_async functions Both array and array_view expose two copy_to overloads that allow copying them to another array, or to another array_view, and these operations can also be achieved with assignment (via the = operator overloads). Also both array and array_view expose a data method, to get a raw pointer to the underlying data of the array or array_view, e.g. float* f = myArr.data();. Note that for array_view, this only works when the rank is equal to 1, due to the data only being contiguous in one dimension as covered in the overview section. Finally, there are a bunch of global concurrency::copy functions returning void (and corresponding concurrency::copy_async functions returning a future) that allow copying between arrays and array_views and iterators etc. Just browse intellisense or amp.h directly for the full set. Note that for array, all copying described throughout this post is deep copying, as per other STL container expectations. You can never have two arrays point to the same data. indexing into array and array_view plus projection Reading or writing data elements of an array is only legal when the code executes on the same accelerator as where the array was bound to. In the array_view case, you can read/write on any accelerator, not just the one where the original data resides, and the data gets copied for you on demand. In both cases, the way you read and write individual elements is via indexing as described next. To access (or set the value of) an element, you can index into it by passing it an index object via the subscript operator. Furthermore, if the rank is 3 or less, you can use the function ( ) operator to pass integer values instead of having to use an index object. e.g. array<float,2> arr(someExtent, someIterator); //or array_view<float,2> arr(someExtent, someContainer); index<2> idx(5,4); float f1 = arr[idx]; float f2 = arr(5,4); //f2 ==f1 //and the reverse for assigning, e.g. arr(idx[0], 7) = 6.9; Note that for both array and array_view, regardless of rank, you can also pass a single integer to the subscript operator which results in a projection of the data, and (for both array and array_view) you get back an array_view of rank N-1 (or if the rank was 1, you get back just the element at that location). Not Covered In this already very long post, I am not going to cover three very cool methods (and related overloads) that both array and array_view expose: view_as, section, reinterpret_as. We'll revisit those at some point in the future, probably on the team blog. Comments about this post by Daniel Moth welcome at the original blog.

Read the article
State Changes in a Component Based Architecture [closed]

- by Maxem

I'm currently working on a game and using the naive component based architecture thingie (Entities are a bag of components, entity.Update() calls Update on each updateable component), while the addition of new features is really simple, it makes a few things really difficult: a) multithreading / currency b) networking c) unit testing. Multithreading / Concurrency is difficult because I basically have to do poor mans concurrency (running the entity updates in separate threads while locking only stuff that crashes (like lists) and ignoring the staleness of read state (some states are already updated, others aren't)) Networking: There are no explicit state changes that I could efficiently push over the net. Unit testing: All updates may or may not conflict, so automated testing is at least awkward. I was thinking about these issues a bit and would like your input on these changes / idea: Switch from the naive cba to a cba with sub systems that work on lists of components Make all state changes explicit Combine 1 and 2 :p Example world update: statePostProcessing.Wait() // ensure that post processing has finished Apply(postProcessedState) state = new StateBag() Concurrently( () => LifeCycleSubSystem.Update(state), // populates the state bag () => MovementSubSystem.Update(state), // populates the state bag .... }) statePostProcessing = Future(() => PostProcess(state)) statePostProcessing.Start() // Tick is finished, the post processing happens in the background So basically the changes are (consistently) based on the data for the last tick; the post processing can a) generate network packages and b) fix conflicts / remove useless changes (example: entity has been destroyed - ignore movement etc.). EDIT: To clarify the granularity of the state changes: If I save these post processed state bags and apply them to an empty world, I see exactly what has happened in the game these state bags originated from - "Free" replay capability. EDIT2: I guess I should have used the term Event instead of State Change and point out that I kind of want to use the Event Sourcing pattern

Read the article
Why mysql 5.5 slower than 5.1 (linux,using mysqlslap)

- by Zenofo

my.cnf (5.5 and 5.1 is the same) : back_log=200 max_connections=512 max_connect_errors=999999 key_buffer=512M max_allowed_packet=8M table_cache=512 sort_buffer=8M read_buffer_size=8M thread_cache=8 thread_concurrency=4 myisam_sort_buffer_size=128M interactive_timeout=28800 wait_timeout=7200 mysql 5.5: ..mysql5.5/bin/mysqlslap -a --concurrency=10 --number-of-queries 5000 --iterations=5 -S /tmp/mysql_5.5.sock --engine=innodb Benchmark Running for engine innodb Average number of seconds to run all queries: 15.156 seconds Minimum number of seconds to run all queries: 15.031 seconds Maximum number of seconds to run all queries: 15.296 seconds Number of clients running queries: 10 Average number of queries per client: 500 mysql5.1: ..mysql5.5/bin/mysqlslap -a --concurrency=10 --number-of-queries 5000 --iterations=5 -S /tmp/mysql_5.1.sock --engine=innodb Benchmark Running for engine innodb Average number of seconds to run all queries: 13.252 seconds Minimum number of seconds to run all queries: 13.019 seconds Maximum number of seconds to run all queries: 13.480 seconds Number of clients running queries: 10 Average number of queries per client: 500 Why mysql 5.5 slower than 5.1 ? BTW:I'm tried mysql5.5/bin/mysqlslap and mysql5.1/bin/mysqlslap,result is the same

Read the article
Should I amortize scripting cost via bytecode analysis or multithreading?

- by user18983

I'm working on a game sort of thing where users can write arbitrary code for individual agents, and I'm trying to decide the best way to divide up computation time. The simplest option would be to give each agent a set amount of time and skip their turn if it elapses without an action being decided upon, but I would like people to be able to write their agents decision functions without having to think too much about how long its taking unless they really want to. The two approaches I'm considering are giving each agent a set number of bytecode instructions (taking cost into account) each timestep, and making players deal with the consequences of the game state changing between blocks of computation (as with Battlecode) or giving each agent it's own thread and giving each thread equal time on the processor. I'm about equally knowledgeable on both concurrency and bytecode stuff, which is to say not very, so I'm wondering which approach would be best. I have a clearer idea of how I'd structure things if I used bytecode, but less certainty about how to actually implement the analysis. I'm pretty sure I can work up a concurrency based system without much trouble, but I worry it will be messier with more overhead and will add unnecessary complexity to the project.

Read the article
Partition Wise Joins

- by jean-pierre.dijcks

Some say they are the holy grail of parallel computing and PWJ is the basis for a shared nothing system and the only join method that is available on a shared nothing system (yes this is oversimplified!). The magic in Oracle is of course that is one of many ways to join data. And yes, this is the old flexibility vs. simplicity discussion all over, so I won't go there... the point is that what you must do in a shared nothing system, you can do in Oracle with the same speed and methods. The Theory A partition wise join is a join between (for simplicity) two tables that are partitioned on the same column with the same partitioning scheme. In shared nothing this is effectively hard partitioning locating data on a specific node / storage combo. In Oracle is is logical partitioning. If you now join the two tables on that partitioned column you can break up the join in smaller joins exactly along the partitions in the data. Since they are partitioned (grouped) into the same buckets, all values required to do the join live in the equivalent bucket on either sides. No need to talk to anyone else, no need to redistribute data to anyone else... in short, the optimal join method for parallel processing of two large data sets. PWJ's in Oracle Since we do not hard partition the data across nodes in Oracle we use the Partitioning option to the database to create the buckets, then set the Degree of Parallelism (or run Auto DOP - see here) and get our PWJs. The main questions always asked are: How many partitions should I create? What should my DOP be? In a shared nothing system the answer is of course, as many partitions as there are nodes which will be your DOP. In Oracle we do want you to look at the workload and concurrency, and once you know that to understand the following rules of thumb. Within Oracle we have more ways of joining of data, so it is important to understand some of the PWJ ideas and what it means if you have an uneven distribution across processes. Assume we have a simple scenario where we partition the data on a hash key resulting in 4 hash partitions (H1 -H4). We have 2 parallel processes that have been tasked with reading these partitions (P1 - P2). The work is evenly divided assuming the partitions are the same size and we can scan this in time t1 as shown below. Now assume that we have changed the system and have a 5th partition but still have our 2 workers P1 and P2. The time it takes is actually 50% more assuming the 5th partition has the same size as the original H1 - H4 partitions. In other words to scan these 5 partitions, the time t2 it takes is not 1/5th more expensive, it is a lot more expensive and some other join plans may now start to look exciting to the optimizer. Just to post the disclaimer, it is not as simple as I state it here, but you get the idea on how much more expensive this plan may now look... Based on this little example there are a few rules of thumb to follow to get the partition wise joins. First, choose a DOP that is a factor of two (2). So always choose something like 2, 4, 8, 16, 32 and so on... Second, choose a number of partitions that is larger or equal to 2* DOP. Third, make sure the number of partitions is divisible through 2 without orphans. This is also known as an even number... Fourth, choose a stable partition count strategy, which is typically hash, which can be a sub partitioning strategy rather than the main strategy (range - hash is a popular one). Fifth, make sure you do this on the join key between the two large tables you want to join (and this should be the obvious one...). Translating this into an example: DOP = 8 (determined based on concurrency or by using Auto DOP with a cap due to concurrency) says that the number of partitions >= 16. Number of hash (sub) partitions = 32, which gives each process four partitions to work on. This number is somewhat arbitrary and depends on your data and system. In this case my main reasoning is that if you get more room on the box you can easily move the DOP for the query to 16 without repartitioning... and of course it makes for no leftovers on the table... And yes, we recommend up-to-date statistics. And before you start complaining, do read this post on a cool way to do stats in 11.

Read the article
Coherence Data Guarantees for Data Reads - Basic Terminology

- by jpurdy

When integrating Coherence into applications, each application has its own set of requirements with respect to data integrity guarantees. Developers often describe these requirements using expressions like "avoiding dirty reads" or "making sure that updates are transactional", but we often find that even in a small group of people, there may be a wide range of opinions as to what these terms mean. This may simply be due to a lack of familiarity, but given that Coherence sits at an intersection of several (mostly) unrelated fields, it may be a matter of conflicting vocabularies (e.g. "consistency" is similar but different in transaction processing versus multi-threaded programming). Since almost all data read consistency issues are related to the concept of concurrency, it is helpful to start with a definition of that, or rather what it means for two operations to be concurrent. Rather than implying that they occur "at the same time", concurrency is a slightly weaker statement -- it simply means that it can't be proven that one event precedes (or follows) the other. As an example, in a Coherence application, if two client members mutate two different cache entries sitting on two different cache servers at roughly the same time, it is likely that one update will precede the other by a significant amount of time (say 0.1ms). However, since there is no guarantee that all four members have their clocks perfectly synchronized, and there is no way to precisely measure the time it takes to send a given message between any two members (that have differing clocks), we consider these to be concurrent operations since we can not (easily) prove otherwise. So this leads to a question that we hear quite frequently: "Are the contents of the near cache always synchronized with the underlying distributed cache?". It's easy to see that if an update on a cache server results in a message being sent to each near cache, and then that near cache being updated that there is a window where the contents are different. However, this is irrelevant, since even if the application reads directly from the distributed cache, another thread update the cache before the read is returned to the application. Even if no other member modifies a cache entry prior to the local near cache entry being updated (and subsequently read), the purpose of reading a cache entry is to do something with the result, usually either displaying for consumption by a human, or by updating the entry based on the current state of the entry. In the former case, it's clear that if the data is updated faster than a human can perceive, then there is no problem (and in many cases this can be relaxed even further). For the latter case, the application must assume that the value might potentially be updated before it has a chance to update it. This almost aways the case with read-only caches, and the solution is the traditional optimistic transaction pattern, which requires the application to explicitly state what assumptions it made about the old value of the cache entry. If the application doesn't want to bother stating those assumptions, it is free to lock the cache entry prior to reading it, ensuring that no other threads will mutate the entry, a pessimistic approach. The optimistic approach relies on what is sometimes called a "fuzzy read". In other words, the application assumes that the read should be correct, but it also acknowledges that it might not be. (I use the qualifier "sometimes" because in some writings, "fuzzy read" indicates the situation where the application actually sees an original value and then later sees an updated value within the same transaction -- however, both definitions are roughly equivalent from an application design perspective). If the read is not correct it is called a "stale read". Going back to the definition of concurrency, it may seem difficult to precisely define a stale read, but the practical way of detecting a stale read is that is will cause the encompassing transaction to roll back if it tries to update that value. The pessimistic approach relies on a "coherent read", a guarantee that the value returned is not only the same as the primary copy of that value, but also that it will remain that way. In most cases this can be used interchangeably with "repeatable read" (though that term has additional implications when used in the context of a database system). In none of cases above is it possible for the application to perform a "dirty read". A dirty read occurs when the application reads a piece of data that was never committed. In practice the only way this can occur is with multi-phase updates such as transactions, where a value may be temporarily update but then withdrawn when a transaction is rolled back. If another thread sees that value prior to the rollback, it is a dirty read. If an application uses optimistic transactions, dirty reads will merely result in a lack of forward progress (this is actually one of the main risks of dirty reads -- they can be chained and potentially cause cascading rollbacks). The concepts of dirty reads, fuzzy reads, stale reads and coherent reads are able to describe the vast majority of requirements that we see in the field. However, the important thing is to define the terms used to define requirements. A quick web search for each of the terms in this article will show multiple meanings, so I've selected what are generally the most common variations, but it never hurts to state each definition explicitly if they are critical to the success of a project (many applications have sufficiently loose requirements that precise terminology can be avoided).

Read the article
Plagued by multithreaded bugs

- by koncurrency

On my new team that I manage, the majority of our code is platform, TCP socket, and http networking code. All C++. Most of it originated from other developers that have left the team. The current developers on the team are very smart, but mostly junior in terms of experience. Our biggest problem: multi-threaded concurrency bugs. Most of our class libraries are written to be asynchronous by use of some thread pool classes. Methods on the class libraries often enqueue long running taks onto the thread pool from one thread and then the callback methods of that class get invoked on a different thread. As a result, we have a lot of edge case bugs involving incorrect threading assumptions. This results in subtle bugs that go beyond just having critical sections and locks to guard against concurrency issues. What makes these problems even harder is that the attempts to fix are often incorrect. Some mistakes I've observed the team attempting (or within the legacy code itself) includes something like the following: Common mistake #1 - Fixing concurrency issue by just put a lock around the shared data, but forgetting about what happens when methods don't get called in an expected order. Here's a very simple example: void Foo::OnHttpRequestComplete(statuscode status) { m_pBar->DoSomethingImportant(status); } void Foo::Shutdown() { m_pBar->Cleanup(); delete m_pBar; m_pBar=nullptr; } So now we have a bug in which Shutdown could get called while OnHttpNetworkRequestComplete is occuring on. A tester finds the bug, captures the crash dump, and assigns the bug to a developer. He in turn fixes the bug like this. void Foo::OnHttpRequestComplete(statuscode status) { AutoLock lock(m_cs); m_pBar->DoSomethingImportant(status); } void Foo::Shutdown() { AutoLock lock(m_cs); m_pBar->Cleanup(); delete m_pBar; m_pBar=nullptr; } The above fix looks good until you realize there's an even more subtle edge case. What happens if Shutdown gets called before OnHttpRequestComplete gets called back? The real world examples my team has are even more complex, and the edge cases are even harder to spot during the code review process. Common Mistake #2 - fixing deadlock issues by blindly exiting the lock, wait for the other thread to finish, then re-enter the lock - but without handling the case that the object just got updated by the other thread! Common Mistake #3 - Even though the objects are reference counted, the shutdown sequence "releases" it's pointer. But forgets to wait for the thread that is still running to release it's instance. As such, components are shutdown cleanly, then spurious or late callbacks are invoked on an object in an state not expecting any more calls. There are other edge cases, but the bottom line is this: Multithreaded programming is just plain hard, even for smart people. As I catch these mistakes, I spend time discussing the errors with each developer on developing a more appropriate fix. But I suspect they are often confused on how to solve each issue because of the enormous amount of legacy code that the "right" fix will involve touching. We're going to be shipping soon, and I'm sure the patches we're applying will hold for the upcoming release. Afterwards, we're going to have some time to improve the code base and refactor where needed. We won't have time to just re-write everything. And the majority of the code isn't all that bad. But I'm looking to refactor code such that threading issues can be avoided altogether. One approach I am considering is this. For each significant platform feature, have a dedicated single thread where all events and network callbacks get marshalled onto. Similar to COM apartment threading in Windows with use of a message loop. Long blocking operations could still get dispatched to a work pool thread, but the completion callback is invoked on on the component's thread. Components could possibly even share the same thread. Then all the class libraries running inside the thread can be written under the assumption of a single threaded world. Before I go down that path, I am also very interested if there are other standard techniques or design patterns for dealing with multithreaded issues. And I have to emphasize - something beyond a book that describes the basics of mutexes and semaphores. What do you think? I am also interested in any other approaches to take towards a refactoring process. Including any of the following: Literature or papers on design patterns around threads. Something beyond an introduction to mutexes and semaphores. We don't need massive parallelism either, just ways to design an object model so as to handle asynchronous events from other threads correctly. Ways to diagram the threading of various components, so that it will be easy to study and evolve solutions for. (That is, a UML equivalent for discussing threads across objects and classes) Educating your development team on the issues with multithreaded code. What would you do?

Read the article
Inside the Concurrent Collections: ConcurrentDictionary

- by Simon Cooper

Using locks to implement a thread-safe collection is rather like using a sledgehammer - unsubtle, easy to understand, and tends to make any other tool redundant. Unlike the previous two collections I looked at, ConcurrentStack and ConcurrentQueue, ConcurrentDictionary uses locks quite heavily. However, it is careful to wield locks only where necessary to ensure that concurrency is maximised. This will, by necessity, be a higher-level look than my other posts in this series, as there is quite a lot of code and logic in ConcurrentDictionary. Therefore, I do recommend that you have ConcurrentDictionary open in a decompiler to have a look at all the details that I skip over. The problem with locks There's several things to bear in mind when using locks, as encapsulated by the lock keyword in C# and the System.Threading.Monitor class in .NET (if you're unsure as to what lock does in C#, I briefly covered it in my first post in the series): Locks block threads The most obvious problem is that threads waiting on a lock can't do any work at all. No preparatory work, no 'optimistic' work like in ConcurrentQueue and ConcurrentStack, nothing. It sits there, waiting to be unblocked. This is bad if you're trying to maximise concurrency. Locks are slow Whereas most of the methods on the Interlocked class can be compiled down to a single CPU instruction, ensuring atomicity at the hardware level, taking out a lock requires some heavy lifting by the CLR and the operating system. There's quite a bit of work required to take out a lock, block other threads, and wake them up again. If locks are used heavily, this impacts performance. Deadlocks When using locks there's always the possibility of a deadlock - two threads, each holding a lock, each trying to aquire the other's lock. Fortunately, this can be avoided with careful programming and structured lock-taking, as we'll see. So, it's important to minimise where locks are used to maximise the concurrency and performance of the collection. Implementation As you might expect, ConcurrentDictionary is similar in basic implementation to the non-concurrent Dictionary, which I studied in a previous post. I'll be using some concepts introduced there, so I recommend you have a quick read of it. So, if you were implementing a thread-safe dictionary, what would you do? The naive implementation is to simply have a single lock around all methods accessing the dictionary. This would work, but doesn't allow much concurrency. Fortunately, the bucketing used by Dictionary allows a simple but effective improvement to this - one lock per bucket. This allows different threads modifying different buckets to do so in parallel. Any thread making changes to the contents of a bucket takes the lock for that bucket, ensuring those changes are thread-safe. The method that maps each bucket to a lock is the GetBucketAndLockNo method: private void GetBucketAndLockNo( int hashcode, out int bucketNo, out int lockNo, int bucketCount) { // the bucket number is the hashcode (without the initial sign bit) // modulo the number of buckets bucketNo = (hashcode & 0x7fffffff) % bucketCount; // and the lock number is the bucket number modulo the number of locks lockNo = bucketNo % m_locks.Length; } However, this does require some changes to how the buckets are implemented. The 'implicit' linked list within a single backing array used by the non-concurrent Dictionary adds a dependency between separate buckets, as every bucket uses the same backing array. Instead, ConcurrentDictionary uses a strict linked list on each bucket: This ensures that each bucket is entirely separate from all other buckets; adding or removing an item from a bucket is independent to any changes to other buckets. Modifying the dictionary All the operations on the dictionary follow the same basic pattern: void AlterBucket(TKey key, ...) { int bucketNo, lockNo; 1: GetBucketAndLockNo( key.GetHashCode(), out bucketNo, out lockNo, m_buckets.Length); 2: lock (m_locks[lockNo]) { 3: Node headNode = m_buckets[bucketNo]; 4: Mutate the node linked list as appropriate } } For example, when adding another entry to the dictionary, you would iterate through the linked list to check whether the key exists already, and add the new entry as the head node. When removing items, you would find the entry to remove (if it exists), and remove the node from the linked list. Adding, updating, and removing items all follow this pattern. Performance issues There is a problem we have to address at this point. If the number of buckets in the dictionary is fixed in the constructor, then the performance will degrade from O(1) to O(n) when a large number of items are added to the dictionary. As more and more items get added to the linked lists in each bucket, the lookup operations will spend most of their time traversing a linear linked list. To fix this, the buckets array has to be resized once the number of items in each bucket has gone over a certain limit. (In ConcurrentDictionary this limit is when the size of the largest bucket is greater than the number of buckets for each lock. This check is done at the end of the TryAddInternal method.) Resizing the bucket array and re-hashing everything affects every bucket in the collection. Therefore, this operation needs to take out every lock in the collection. Taking out mutiple locks at once inevitably summons the spectre of the deadlock; two threads each hold a lock, and each trying to acquire the other lock. How can we eliminate this? Simple - ensure that threads never try to 'swap' locks in this fashion. When taking out multiple locks, always take them out in the same order, and always take out all the locks you need before starting to release them. In ConcurrentDictionary, this is controlled by the AcquireLocks, AcquireAllLocks and ReleaseLocks methods. Locks are always taken out and released in the order they are in the m_locks array, and locks are all released right at the end of the method in a finally block. At this point, it's worth pointing out that the locks array is never re-assigned, even when the buckets array is increased in size. The number of locks is fixed in the constructor by the concurrencyLevel parameter. This simplifies programming the locks; you don't have to check if the locks array has changed or been re-assigned before taking out a lock object. And you can be sure that when a thread takes out a lock, another thread isn't going to re-assign the lock array. This would create a new series of lock objects, thus allowing another thread to ignore the existing locks (and any threads controlling them), breaking thread-safety. Consequences of growing the array Just because we're using locks doesn't mean that race conditions aren't a problem. We can see this by looking at the GrowTable method. The operation of this method can be boiled down to: private void GrowTable(Node[] buckets) { try { 1: Acquire first lock in the locks array // this causes any other thread trying to take out // all the locks to block because the first lock in the array // is always the one taken out first // check if another thread has already resized the buckets array // while we were waiting to acquire the first lock 2: if (buckets != m_buckets) return; 3: Calculate the new size of the backing array 4: Node[] array = new array[size]; 5: Acquire all the remaining locks 6: Re-hash the contents of the existing buckets into array 7: m_buckets = array; } finally { 8: Release all locks } } As you can see, there's already a check for a race condition at step 2, for the case when the GrowTable method is called twice in quick succession on two separate threads. One will successfully resize the buckets array (blocking the second in the meantime), when the second thread is unblocked it'll see that the array has already been resized & exit without doing anything. There is another case we need to consider; looking back at the AlterBucket method above, consider the following situation: Thread 1 calls AlterBucket; step 1 is executed to get the bucket and lock numbers. Thread 2 calls GrowTable and executes steps 1-5; thread 1 is blocked when it tries to take out the lock in step 2. Thread 2 re-hashes everything, re-assigns the buckets array, and releases all the locks (steps 6-8). Thread 1 is unblocked and continues executing, but the calculated bucket and lock numbers are no longer valid. Between calculating the correct bucket and lock number and taking out the lock, another thread has changed where everything is. Not exactly thread-safe. Well, a similar problem was solved in ConcurrentStack and ConcurrentQueue by storing a local copy of the state, doing the necessary calculations, then checking if that state is still valid. We can use a similar idea here: void AlterBucket(TKey key, ...) { while (true) { Node[] buckets = m_buckets; int bucketNo, lockNo; GetBucketAndLockNo( key.GetHashCode(), out bucketNo, out lockNo, buckets.Length); lock (m_locks[lockNo]) { // if the state has changed, go back to the start if (buckets != m_buckets) continue; Node headNode = m_buckets[bucketNo]; Mutate the node linked list as appropriate } break; } } TryGetValue and GetEnumerator And so, finally, we get onto TryGetValue and GetEnumerator. I've left these to the end because, well, they don't actually use any locks. How can this be? Whenever you change a bucket, you need to take out the corresponding lock, yes? Indeed you do. However, it is important to note that TryGetValue and GetEnumerator don't actually change anything. Just as immutable objects are, by definition, thread-safe, read-only operations don't need to take out a lock because they don't change anything. All lockless methods can happily iterate through the buckets and linked lists without worrying about locking anything. However, this does put restrictions on how the other methods operate. Because there could be another thread in the middle of reading the dictionary at any time (even if a lock is taken out), the dictionary has to be in a valid state at all times. Every change to state has to be made visible to other threads in a single atomic operation (all relevant variables are marked volatile to help with this). This restriction ensures that whatever the reading threads are doing, they never read the dictionary in an invalid state (eg items that should be in the collection temporarily removed from the linked list, or reading a node that has had it's key & value removed before the node itself has been removed from the linked list). Fortunately, all the operations needed to change the dictionary can be done in that way. Bucket resizes are made visible when the new array is assigned back to the m_buckets variable. Any additions or modifications to a node are done by creating a new node, then splicing it into the existing list using a single variable assignment. Node removals are simply done by re-assigning the node's m_next pointer. Because the dictionary can be changed by another thread during execution of the lockless methods, the GetEnumerator method is liable to return dirty reads - changes made to the dictionary after GetEnumerator was called, but before the enumeration got to that point in the dictionary. It's worth listing at this point which methods are lockless, and which take out all the locks in the dictionary to ensure they get a consistent view of the dictionary: Lockless: TryGetValue GetEnumerator The indexer getter ContainsKey Takes out every lock (lockfull?): Count IsEmpty Keys Values CopyTo ToArray Concurrent principles That covers the overall implementation of ConcurrentDictionary. I haven't even begun to scratch the surface of this sophisticated collection. That I leave to you. However, we've looked at enough to be able to extract some useful principles for concurrent programming: Partitioning When using locks, the work is partitioned into independant chunks, each with its own lock. Each partition can then be modified concurrently to other partitions. Ordered lock-taking When a method does need to control the entire collection, locks are taken and released in a fixed order to prevent deadlocks. Lockless reads Read operations that don't care about dirty reads don't take out any lock; the rest of the collection is implemented so that any reading thread always has a consistent view of the collection. That leads us to the final collection in this little series - ConcurrentBag. Lacking a non-concurrent analogy, it is quite different to any other collection in the class libraries. Prepare your thinking hats!

Read the article
How to reconcile my support of open-source software and need to feed and house myself?

- by Guzba

I have a bit of a dilemma and wanted to get some other developers' opinions on it and maybe some guidance. So I have created a 2D game for Android from the ground up, learning and re factoring as I went along. I did it for the experience, and have been proud of the results. I released it for free as ad supported with AdMob not really expecting much out of it, but curious to see what would happen. Its been a few of months now since release, and it has become very popular (250k downloads!). Additionally, the ad revenue is great and is driving me to make more good games and even allowing me to work less so that I can focus on my own works. When I originally began working on the game, I was pretty new to concurrency and completely new to Android (had Java experience though). The standard advice I got for starting an Android game was to look at the sample games from Google (Snake, Lunar Lander, ...) so I did. In my opinion, these Android sample games from Google are decent to see in general what your code should look like, but not actually all that great to follow. This is because some of their features don't work (saving game state), the concurrency is both unexplained and cumbersome (there is no real separation between the game thread and the UI thread since they sync lock each other out all the time and the UI thread runs game thread code). This made it difficult for me as a newbie to concurrency to understand how it was organized and what was really running what code. Here is my dilemma: After spending this past few months slowly improving my code, I feel that it could be very beneficial to developers who are in the same position that I was in when I started. (Since it is not a complex game, but clearly coded in my opinion.) I want to open up the source so that others can learn from it but don't want to lose my ad revenue stream, which, if I did open the source, I fear I would when people released versions with the ad stripped, or minor tweaks that would fragment my audience, etc. I am a CS undergrad major in college and this money is giving me the freedom to work less at summer job, thus giving me the time and will to work on more of my own projects and improving my own skills while still paying the bills. So what do I do? Open the source at personal sacrifice for the greater good, or keep it closed and be a sort of hypocritical supporter of open source?

Read the article
Does this prove a network bandwidth bottleneck?

- by Yuji Tomita

I've incorrectly assumed that my internal AB testing means my server can handle 1k concurrency @3k hits per second. My theory at at the moment is that the network is the bottleneck. The server can't send enough data fast enough. External testing from blitz.io at 1k concurrency shows my hits/s capping off at 180, with pages taking longer and longer to respond as the server is only able to return 180 per second. I've served a blank file from nginx and benched it: it scales 1:1 with concurrency. Now to rule out IO / memcached bottlenecks (nginx normally pulls from memcached), I serve up a static version of the cached page from the filesystem. The results are very similar to my original test; I'm capped at around 180 RPS. Splitting the HTML page in half gives me double the RPS, so it's definitely limited by the size of the page. If I internally ApacheBench from the local server, I get consistent results of around 4k RPS on both the Full Page and the Half Page, at high transfer rates. Transfer rate: 62586.14 [Kbytes/sec] received If I AB from an external server, I get around 180RPS - same as the blitz.io results. How do I know it's not intentional throttling? If I benchmark from multiple external servers, all results become poor which leads me to believe the problem is in MY servers outbound traffic, not a download speed issue with my benchmarking servers / blitz.io. So I'm back to my conclusion that my server can't send data fast enough. Am I right? Are there other ways to interpret this data? Is the solution/optimization to set up multiple servers + load balancing that can each serve 180 hits per second? I'm quite new to server optimization, so I'd appreciate any confirmation interpreting this data. Outbound traffic Here's more information about the outbound bandwidth: The network graph shows a maximum output of 16 Mb/s: 16 megabits per second. Doesn't sound like much at all. Due to a suggestion about throttling, I looked into this and found that linode has a 50mbps cap (which I'm not even close to hitting, apparently). I had it raised to 100mbps. Since linode caps my traffic, and I'm not even hitting it, does this mean that my server should indeed be capable of outputting up to 100mbps but is limited by some other internal bottleneck? I just don't understand how networks at this large of a scale work; can they literally send data as fast as they can read from the HDD? Is the network pipe that big? In conclusion 1: Based on the above, I'm thinking I can definitely raise my 180RPS by adding an nginx load balancer on top of a multi nginx server setup at exactly 180RPS per server behind the LB. 2: If linode has a 50/100mbit limit that I'm not hitting at all, there must be something I can do to hit that limit with my single server setup. If I can read / transmit data fast enough locally, and linode even bothers to have a 50mbit/100mbit cap, there must be an internal bottleneck that's not allowing me to hit those caps that I'm not sure how to detect. Correct? I realize the question is huge and vague now, but I'm not sure how to condense it. Any input is appreciated on any conclusion I've made.

Read the article
C# 4: The Curious ConcurrentDictionary

- by James Michael Hare

In my previous post (here) I did a comparison of the new ConcurrentQueue versus the old standard of a System.Collections.Generic Queue with simple locking. The results were exactly what I would have hoped, that the ConcurrentQueue was faster with multi-threading for most all situations. In addition, concurrent collections have the added benefit that you can enumerate them even if they're being modified. So I set out to see what the improvements would be for the ConcurrentDictionary, would it have the same performance benefits as the ConcurrentQueue did? Well, after running some tests and multiple tweaks and tunes, I have good and bad news. But first, let's look at the tests. Obviously there's many things we can do with a dictionary. One of the most notable uses, of course, in a multi-threaded environment is for a small, local in-memory cache. So I set about to do a very simple simulation of a cache where I would create a test class that I'll just call an Accessor. This accessor will attempt to look up a key in the dictionary, and if the key exists, it stops (i.e. a cache "hit"). However, if the lookup fails, it will then try to add the key and value to the dictionary (i.e. a cache "miss"). So here's the Accessor that will run the tests: 1: internal class Accessor 2: { 3: public int Hits { get; set; } 4: public int Misses { get; set; } 5: public Func<int, string> GetDelegate { get; set; } 6: public Action<int, string> AddDelegate { get; set; } 7: public int Iterations { get; set; } 8: public int MaxRange { get; set; } 9: public int Seed { get; set; } 10: 11: public void Access() 12: { 13: var randomGenerator = new Random(Seed); 14: 15: for (int i=0; i<Iterations; i++) 16: { 17: // give a wide spread so will have some duplicates and some unique 18: var target = randomGenerator.Next(1, MaxRange); 19: 20: // attempt to grab the item from the cache 21: var result = GetDelegate(target); 22: 23: // if the item doesn't exist, add it 24: if(result == null) 25: { 26: AddDelegate(target, target.ToString()); 27: Misses++; 28: } 29: else 30: { 31: Hits++; 32: } 33: } 34: } 35: } Note that so I could test different implementations, I defined a GetDelegate and AddDelegate that will call the appropriate dictionary methods to add or retrieve items in the cache using various techniques. So let's examine the three techniques I decided to test: Dictionary with mutex - Just your standard generic Dictionary with a simple lock construct on an internal object. Dictionary with ReaderWriterLockSlim - Same Dictionary, but now using a lock designed to let multiple readers access simultaneously and then locked when a writer needs access. ConcurrentDictionary - The new ConcurrentDictionary from System.Collections.Concurrent that is supposed to be optimized to allow multiple threads to access safely. So the approach to each of these is also fairly straight-forward. Let's look at the GetDelegate and AddDelegate implementations for the Dictionary with mutex lock: 1: var addDelegate = (key,val) => 2: { 3: lock (_mutex) 4: { 5: _dictionary[key] = val; 6: } 7: }; 8: var getDelegate = (key) => 9: { 10: lock (_mutex) 11: { 12: string val; 13: return _dictionary.TryGetValue(key, out val) ? val : null; 14: } 15: }; Nothing new or fancy here, just your basic lock on a private object and then query/insert into the Dictionary. Now, for the Dictionary with ReadWriteLockSlim it's a little more complex: 1: var addDelegate = (key,val) => 2: { 3: _readerWriterLock.EnterWriteLock(); 4: _dictionary[key] = val; 5: _readerWriterLock.ExitWriteLock(); 6: }; 7: var getDelegate = (key) => 8: { 9: string val; 10: _readerWriterLock.EnterReadLock(); 11: if(!_dictionary.TryGetValue(key, out val)) 12: { 13: val = null; 14: } 15: _readerWriterLock.ExitReadLock(); 16: return val; 17: }; And finally, the ConcurrentDictionary, which since it does all it's own concurrency control, is remarkably elegant and simple: 1: var addDelegate = (key,val) => 2: { 3: _concurrentDictionary[key] = val; 4: }; 5: var getDelegate = (key) => 6: { 7: string s; 8: return _concurrentDictionary.TryGetValue(key, out s) ? s : null; 9: }; Then, I set up a test harness that would simply ask the user for the number of concurrent Accessors to attempt to Access the cache (as specified in Accessor.Access() above) and then let them fly and see how long it took them all to complete. Each of these tests was run with 10,000,000 cache accesses divided among the available Accessor instances. All times are in milliseconds. 1: Dictionary with Mutex Locking 2: --------------------------------------------------- 3: Accessors Mostly Misses Mostly Hits 4: 1 7916 3285 5: 10 8293 3481 6: 100 8799 3532 7: 1000 8815 3584 8: 9: 10: Dictionary with ReaderWriterLockSlim Locking 11: --------------------------------------------------- 12: Accessors Mostly Misses Mostly Hits 13: 1 8445 3624 14: 10 11002 4119 15: 100 11076 3992 16: 1000 14794 4861 17: 18: 19: Concurrent Dictionary 20: --------------------------------------------------- 21: Accessors Mostly Misses Mostly Hits 22: 1 17443 3726 23: 10 14181 1897 24: 100 15141 1994 25: 1000 17209 2128 The first test I did across the board is the Mostly Misses category. The mostly misses (more adds because data requested was not in the dictionary) shows an interesting trend. In both cases the Dictionary with the simple mutex lock is much faster, and the ConcurrentDictionary is the slowest solution. But this got me thinking, and a little research seemed to confirm it, maybe the ConcurrentDictionary is more optimized to concurrent "gets" than "adds". So since the ratio of misses to hits were 2 to 1, I decided to reverse that and see the results. So I tweaked the data so that the number of keys were much smaller than the number of iterations to give me about a 2 to 1 ration of hits to misses (twice as likely to already find the item in the cache than to need to add it). And yes, indeed here we see that the ConcurrentDictionary is indeed faster than the standard Dictionary here. I have a strong feeling that as the ration of hits-to-misses gets higher and higher these number gets even better as well. This makes sense since the ConcurrentDictionary is read-optimized. Also note that I tried the tests with capacity and concurrency hints on the ConcurrentDictionary but saw very little improvement, I think this is largely because on the 10,000,000 hit test it quickly ramped up to the correct capacity and concurrency and thus the impact was limited to the first few milliseconds of the run. So what does this tell us? Well, as in all things, ConcurrentDictionary is not a panacea. It won't solve all your woes and it shouldn't be the only Dictionary you ever use. So when should we use each? Use System.Collections.Generic.Dictionary when: You need a single-threaded Dictionary (no locking needed). You need a multi-threaded Dictionary that is loaded only once at creation and never modified (no locking needed). You need a multi-threaded Dictionary to store items where writes are far more prevalent than reads (locking needed). And use System.Collections.Concurrent.ConcurrentDictionary when: You need a multi-threaded Dictionary where the writes are far more prevalent than reads. You need to be able to iterate over the collection without locking it even if its being modified. Both Dictionaries have their strong suits, I have a feeling this is just one where you need to know from design what you hope to use it for and make your decision based on that criteria.

Read the article

Search Results

Search found 813 results on 33 pages for 'concurrency'.

Page 18/33 | < Previous Page | 14 15 16 17 18 19 20 21 22 23 24 25 | Next Page >

- by Gringo Suave

- by Laurens

- by Tony Davis

- by user1289508

- by andrzejp

- by Eugene Beresovksy

- by drsql

- by rokonoid

- by rokonoid

- by András Szepesházi

- by Jordan Messina

- by Ricardo Peres

- by Dave

- by Dev G

- by Daniel Moth

- by Maxem

- by Zenofo

- by user18983

- by jean-pierre.dijcks

- by jpurdy

- by koncurrency

- by Simon Cooper

- by Guzba

- by Yuji Tomita

- by James Michael Hare

< Previous Page | 14 15 16 17 18 19 20 21 22 23 24 25 | Next Page >