Search Results

Search found 33242 results on 1330 pages for 'database optimization'.

Page 712/1330 | < Previous Page | 708 709 710 711 712 713 714 715 716 717 718 719  | Next Page >

  • T-SQL Tuesday #31 - Logging Tricks with CONTEXT_INFO

    - by Most Valuable Yak (Rob Volk)
    This month's T-SQL Tuesday is being hosted by Aaron Nelson [b | t], fellow Atlantan (the city in Georgia, not the famous sunken city, or the resort in the Bahamas) and covers the topic of logging (the recording of information, not the harvesting of trees) and maintains the fine T-SQL Tuesday tradition begun by Adam Machanic [b | t] (the SQL Server guru, not the guy who fixes cars, check the spelling again, there will be a quiz later). This is a trick I learned from Fernando Guerrero [b | t] waaaaaay back during the PASS Summit 2004 in sunny, hurricane-infested Orlando, during his session on Secret SQL Server (not sure if that's the correct title, and I haven't used parentheses in this paragraph yet).  CONTEXT_INFO is a neat little feature that's existed since SQL Server 2000 and perhaps even earlier.  It lets you assign data to the current session/connection, and maintains that data until you disconnect or change it.  In addition to the CONTEXT_INFO() function, you can also query the context_info column in sys.dm_exec_sessions, or even sysprocesses if you're still running SQL Server 2000, if you need to see it for another session. While you're limited to 128 bytes, one big advantage that CONTEXT_INFO has is that it's independent of any transactions.  If you've ever logged to a table in a transaction and then lost messages when it rolled back, you can understand how aggravating it can be.  CONTEXT_INFO also survives across multiple SQL batches (GO separators) in the same connection, so for those of you who were going to suggest "just log to a table variable, they don't get rolled back":  HA-HA, I GOT YOU!  Since GO starts a new batch all variable declarations are lost. Here's a simple example I recently used at work.  I had to test database mirroring configurations for disaster recovery scenarios and measure the network throughput.  I also needed to log how long it took for the script to run and include the mirror settings for the database in question.  I decided to use AdventureWorks as my database model, and Adam Machanic's Big Adventure script to provide a fairly large workload that's repeatable and easily scalable.  My test would consist of several copies of AdventureWorks running the Big Adventure script while I mirrored the databases (or not). Since Adam's script contains several batches, I decided CONTEXT_INFO would have to be used.  As it turns out, I only needed to grab the start time at the beginning, I could get the rest of the data at the end of the process.   The code is pretty small: declare @time binary(128)=cast(getdate() as binary(8)) set context_info @time   ... rest of Big Adventure code ...   go use master; insert mirror_test(server,role,partner,db,state,safety,start,duration) select @@servername, mirroring_role_desc, mirroring_partner_instance, db_name(database_id), mirroring_state_desc, mirroring_safety_level_desc, cast(cast(context_info() as binary(8)) as datetime), datediff(s,cast(cast(context_info() as binary(8)) as datetime),getdate()) from sys.database_mirroring where db_name(database_id) like 'Adv%';   I declared @time as a binary(128) since CONTEXT_INFO is defined that way.  I couldn't convert GETDATE() to binary(128) as it would pad the first 120 bytes as 0x00.  To keep the CAST functions simple and avoid using SUBSTRING, I decided to CAST GETDATE() as binary(8) and let SQL Server do the implicit conversion.  It's not the safest way perhaps, but it works on my machine. :) As I mentioned earlier, you can query system views for sessions and get their CONTEXT_INFO.  With a little boilerplate code this can be used to monitor long-running procedures, in case you need to kill a process, or are just curious  how long certain parts take.  In this example, I added code to Adam's Big Adventure script to set CONTEXT_INFO messages at strategic places I want to monitor.  (His code is in UPPERCASE as it was in the original, mine is all lowercase): declare @msg binary(128) set @msg=cast('Altering bigProduct.ProductID' as binary(128)) set context_info @msg go ALTER TABLE bigProduct ALTER COLUMN ProductID INT NOT NULL GO set context_info 0x0 go declare @msg1 binary(128) set @msg1=cast('Adding pk_bigProduct Constraint' as binary(128)) set context_info @msg1 go ALTER TABLE bigProduct ADD CONSTRAINT pk_bigProduct PRIMARY KEY (ProductID) GO set context_info 0x0 go declare @msg2 binary(128) set @msg2=cast('Altering bigTransactionHistory.TransactionID' as binary(128)) set context_info @msg2 go ALTER TABLE bigTransactionHistory ALTER COLUMN TransactionID INT NOT NULL GO set context_info 0x0 go declare @msg3 binary(128) set @msg3=cast('Adding pk_bigTransactionHistory Constraint' as binary(128)) set context_info @msg3 go ALTER TABLE bigTransactionHistory ADD CONSTRAINT pk_bigTransactionHistory PRIMARY KEY NONCLUSTERED(TransactionID) GO set context_info 0x0 go declare @msg4 binary(128) set @msg4=cast('Creating IX_ProductId_TransactionDate Index' as binary(128)) set context_info @msg4 go CREATE NONCLUSTERED INDEX IX_ProductId_TransactionDate ON bigTransactionHistory(ProductId,TransactionDate) INCLUDE(Quantity,ActualCost) GO set context_info 0x0   This doesn't include the entire script, only those portions that altered a table or created an index.  One annoyance is that SET CONTEXT_INFO requires a literal or variable, you can't use an expression.  And since GO starts a new batch I need to declare a variable in each one.  And of course I have to use CAST because it won't implicitly convert varchar to binary.  And even though context_info is a nullable column, you can't SET CONTEXT_INFO NULL, so I have to use SET CONTEXT_INFO 0x0 to clear the message after the statement completes.  And if you're thinking of turning this into a UDF, you can't, although a stored procedure would work. So what does all this aggravation get you?  As the code runs, if I want to see which stage the session is at, I can run the following (assuming SPID 51 is the one I want): select CAST(context_info as varchar(128)) from sys.dm_exec_sessions where session_id=51   Since SQL Server 2005 introduced the new system and dynamic management views (DMVs) there's not as much need for tagging a session with these kinds of messages.  You can get the session start time and currently executing statement from them, and neatly presented if you use Adam's sp_whoisactive utility (and you absolutely should be using it).  Of course you can always use xp_cmdshell, a CLR function, or some other tricks to log information outside of a SQL transaction.  All the same, I've used this trick to monitor long-running reports at a previous job, and I still think CONTEXT_INFO is a great feature, especially if you're still using SQL Server 2000 or want to supplement your instrumentation.  If you'd like an exercise, consider adding the system time to the messages in the last example, and an automated job to query and parse it from the system tables.  That would let you track how long each statement ran without having to run Profiler. #TSQL2sDay

    Read the article

  • PTLQueue : a scalable bounded-capacity MPMC queue

    - by Dave
    Title: Fast concurrent MPMC queue -- I've used the following concurrent queue algorithm enough that it warrants a blog entry. I'll sketch out the design of a fast and scalable multiple-producer multiple-consumer (MPSC) concurrent queue called PTLQueue. The queue has bounded capacity and is implemented via a circular array. Bounded capacity can be a useful property if there's a mismatch between producer rates and consumer rates where an unbounded queue might otherwise result in excessive memory consumption by virtue of the container nodes that -- in some queue implementations -- are used to hold values. A bounded-capacity queue can provide flow control between components. Beware, however, that bounded collections can also result in resource deadlock if abused. The put() and take() operators are partial and wait for the collection to become non-full or non-empty, respectively. Put() and take() do not allocate memory, and are not vulnerable to the ABA pathologies. The PTLQueue algorithm can be implemented equally well in C/C++ and Java. Partial operators are often more convenient than total methods. In many use cases if the preconditions aren't met, there's nothing else useful the thread can do, so it may as well wait via a partial method. An exception is in the case of work-stealing queues where a thief might scan a set of queues from which it could potentially steal. Total methods return ASAP with a success-failure indication. (It's tempting to describe a queue or API as blocking or non-blocking instead of partial or total, but non-blocking is already an overloaded concurrency term. Perhaps waiting/non-waiting or patient/impatient might be better terms). It's also trivial to construct partial operators by busy-waiting via total operators, but such constructs may be less efficient than an operator explicitly and intentionally designed to wait. A PTLQueue instance contains an array of slots, where each slot has volatile Turn and MailBox fields. The array has power-of-two length allowing mod/div operations to be replaced by masking. We assume sensible padding and alignment to reduce the impact of false sharing. (On x86 I recommend 128-byte alignment and padding because of the adjacent-sector prefetch facility). Each queue also has PutCursor and TakeCursor cursor variables, each of which should be sequestered as the sole occupant of a cache line or sector. You can opt to use 64-bit integers if concerned about wrap-around aliasing in the cursor variables. Put(null) is considered illegal, but the caller or implementation can easily check for and convert null to a distinguished non-null proxy value if null happens to be a value you'd like to pass. Take() will accordingly convert the proxy value back to null. An advantage of PTLQueue is that you can use atomic fetch-and-increment for the partial methods. We initialize each slot at index I with (Turn=I, MailBox=null). Both cursors are initially 0. All shared variables are considered "volatile" and atomics such as CAS and AtomicFetchAndIncrement are presumed to have bidirectional fence semantics. Finally T is the templated type. I've sketched out a total tryTake() method below that allows the caller to poll the queue. tryPut() has an analogous construction. Zebra stripping : alternating row colors for nice-looking code listings. See also google code "prettify" : https://code.google.com/p/google-code-prettify/ Prettify is a javascript module that yields the HTML/CSS/JS equivalent of pretty-print. -- pre:nth-child(odd) { background-color:#ff0000; } pre:nth-child(even) { background-color:#0000ff; } border-left: 11px solid #ccc; margin: 1.7em 0 1.7em 0.3em; background-color:#BFB; font-size:12px; line-height:65%; " // PTLQueue : Put(v) : // producer : partial method - waits as necessary assert v != null assert Mask = 1 && (Mask & (Mask+1)) == 0 // Document invariants // doorway step // Obtain a sequence number -- ticket // As a practical concern the ticket value is temporally unique // The ticket also identifies and selects a slot auto tkt = AtomicFetchIncrement (&PutCursor, 1) slot * s = &Slots[tkt & Mask] // waiting phase : // wait for slot's generation to match the tkt value assigned to this put() invocation. // The "generation" is implicitly encoded as the upper bits in the cursor // above those used to specify the index : tkt div (Mask+1) // The generation serves as an epoch number to identify a cohort of threads // accessing disjoint slots while s-Turn != tkt : Pause assert s-MailBox == null s-MailBox = v // deposit and pass message Take() : // consumer : partial method - waits as necessary auto tkt = AtomicFetchIncrement (&TakeCursor,1) slot * s = &Slots[tkt & Mask] // 2-stage waiting : // First wait for turn for our generation // Acquire exclusive "take" access to slot's MailBox field // Then wait for the slot to become occupied while s-Turn != tkt : Pause // Concurrency in this section of code is now reduced to just 1 producer thread // vs 1 consumer thread. // For a given queue and slot, there will be most one Take() operation running // in this section. // Consumer waits for producer to arrive and make slot non-empty // Extract message; clear mailbox; advance Turn indicator // We have an obvious happens-before relation : // Put(m) happens-before corresponding Take() that returns that same "m" for T v = s-MailBox if v != null : s-MailBox = null ST-ST barrier s-Turn = tkt + Mask + 1 // unlock slot to admit next producer and consumer return v Pause tryTake() : // total method - returns ASAP with failure indication for auto tkt = TakeCursor slot * s = &Slots[tkt & Mask] if s-Turn != tkt : return null T v = s-MailBox // presumptive return value if v == null : return null // ratify tkt and v values and commit by advancing cursor if CAS (&TakeCursor, tkt, tkt+1) != tkt : continue s-MailBox = null ST-ST barrier s-Turn = tkt + Mask + 1 return v The basic idea derives from the Partitioned Ticket Lock "PTL" (US20120240126-A1) and the MultiLane Concurrent Bag (US8689237). The latter is essentially a circular ring-buffer where the elements themselves are queues or concurrent collections. You can think of the PTLQueue as a partitioned ticket lock "PTL" augmented to pass values from lock to unlock via the slots. Alternatively, you could conceptualize of PTLQueue as a degenerate MultiLane bag where each slot or "lane" consists of a simple single-word MailBox instead of a general queue. Each lane in PTLQueue also has a private Turn field which acts like the Turn (Grant) variables found in PTL. Turn enforces strict FIFO ordering and restricts concurrency on the slot mailbox field to at most one simultaneous put() and take() operation. PTL uses a single "ticket" variable and per-slot Turn (grant) fields while MultiLane has distinct PutCursor and TakeCursor cursors and abstract per-slot sub-queues. Both PTL and MultiLane advance their cursor and ticket variables with atomic fetch-and-increment. PTLQueue borrows from both PTL and MultiLane and has distinct put and take cursors and per-slot Turn fields. Instead of a per-slot queues, PTLQueue uses a simple single-word MailBox field. PutCursor and TakeCursor act like a pair of ticket locks, conferring "put" and "take" access to a given slot. PutCursor, for instance, assigns an incoming put() request to a slot and serves as a PTL "Ticket" to acquire "put" permission to that slot's MailBox field. To better explain the operation of PTLQueue we deconstruct the operation of put() and take() as follows. Put() first increments PutCursor obtaining a new unique ticket. That ticket value also identifies a slot. Put() next waits for that slot's Turn field to match that ticket value. This is tantamount to using a PTL to acquire "put" permission on the slot's MailBox field. Finally, having obtained exclusive "put" permission on the slot, put() stores the message value into the slot's MailBox. Take() similarly advances TakeCursor, identifying a slot, and then acquires and secures "take" permission on a slot by waiting for Turn. Take() then waits for the slot's MailBox to become non-empty, extracts the message, and clears MailBox. Finally, take() advances the slot's Turn field, which releases both "put" and "take" access to the slot's MailBox. Note the asymmetry : put() acquires "put" access to the slot, but take() releases that lock. At any given time, for a given slot in a PTLQueue, at most one thread has "put" access and at most one thread has "take" access. This restricts concurrency from general MPMC to 1-vs-1. We have 2 ticket locks -- one for put() and one for take() -- each with its own "ticket" variable in the form of the corresponding cursor, but they share a single "Grant" egress variable in the form of the slot's Turn variable. Advancing the PutCursor, for instance, serves two purposes. First, we obtain a unique ticket which identifies a slot. Second, incrementing the cursor is the doorway protocol step to acquire the per-slot mutual exclusion "put" lock. The cursors and operations to increment those cursors serve double-duty : slot-selection and ticket assignment for locking the slot's MailBox field. At any given time a slot MailBox field can be in one of the following states: empty with no pending operations -- neutral state; empty with one or more waiting take() operations pending -- deficit; occupied with no pending operations; occupied with one or more waiting put() operations -- surplus; empty with a pending put() or pending put() and take() operations -- transitional; or occupied with a pending take() or pending put() and take() operations -- transitional. The partial put() and take() operators can be implemented with an atomic fetch-and-increment operation, which may confer a performance advantage over a CAS-based loop. In addition we have independent PutCursor and TakeCursor cursors. Critically, a put() operation modifies PutCursor but does not access the TakeCursor and a take() operation modifies the TakeCursor cursor but does not access the PutCursor. This acts to reduce coherence traffic relative to some other queue designs. It's worth noting that slow threads or obstruction in one slot (or "lane") does not impede or obstruct operations in other slots -- this gives us some degree of obstruction isolation. PTLQueue is not lock-free, however. The implementation above is expressed with polite busy-waiting (Pause) but it's trivial to implement per-slot parking and unparking to deschedule waiting threads. It's also easy to convert the queue to a more general deque by replacing the PutCursor and TakeCursor cursors with Left/Front and Right/Back cursors that can move either direction. Specifically, to push and pop from the "left" side of the deque we would decrement and increment the Left cursor, respectively, and to push and pop from the "right" side of the deque we would increment and decrement the Right cursor, respectively. We used a variation of PTLQueue for message passing in our recent OPODIS 2013 paper. ul { list-style:none; padding-left:0; padding:0; margin:0; margin-left:0; } ul#myTagID { padding: 0px; margin: 0px; list-style:none; margin-left:0;} -- -- There's quite a bit of related literature in this area. I'll call out a few relevant references: Wilson's NYU Courant Institute UltraComputer dissertation from 1988 is classic and the canonical starting point : Operating System Data Structures for Shared-Memory MIMD Machines with Fetch-and-Add. Regarding provenance and priority, I think PTLQueue or queues effectively equivalent to PTLQueue have been independently rediscovered a number of times. See CB-Queue and BNPBV, below, for instance. But Wilson's dissertation anticipates the basic idea and seems to predate all the others. Gottlieb et al : Basic Techniques for the Efficient Coordination of Very Large Numbers of Cooperating Sequential Processors Orozco et al : CB-Queue in Toward high-throughput algorithms on many-core architectures which appeared in TACO 2012. Meneghin et al : BNPVB family in Performance evaluation of inter-thread communication mechanisms on multicore/multithreaded architecture Dmitry Vyukov : bounded MPMC queue (highly recommended) Alex Otenko : US8607249 (highly related). John Mellor-Crummey : Concurrent queues: Practical fetch-and-phi algorithms. Technical Report 229, Department of Computer Science, University of Rochester Thomasson : FIFO Distributed Bakery Algorithm (very similar to PTLQueue). Scott and Scherer : Dual Data Structures I'll propose an optimization left as an exercise for the reader. Say we wanted to reduce memory usage by eliminating inter-slot padding. Such padding is usually "dark" memory and otherwise unused and wasted. But eliminating the padding leaves us at risk of increased false sharing. Furthermore lets say it was usually the case that the PutCursor and TakeCursor were numerically close to each other. (That's true in some use cases). We might still reduce false sharing by incrementing the cursors by some value other than 1 that is not trivially small and is coprime with the number of slots. Alternatively, we might increment the cursor by one and mask as usual, resulting in a logical index. We then use that logical index value to index into a permutation table, yielding an effective index for use in the slot array. The permutation table would be constructed so that nearby logical indices would map to more distant effective indices. (Open question: what should that permutation look like? Possibly some perversion of a Gray code or De Bruijn sequence might be suitable). As an aside, say we need to busy-wait for some condition as follows : "while C == 0 : Pause". Lets say that C is usually non-zero, so we typically don't wait. But when C happens to be 0 we'll have to spin for some period, possibly brief. We can arrange for the code to be more machine-friendly with respect to the branch predictors by transforming the loop into : "if C == 0 : for { Pause; if C != 0 : break; }". Critically, we want to restructure the loop so there's one branch that controls entry and another that controls loop exit. A concern is that your compiler or JIT might be clever enough to transform this back to "while C == 0 : Pause". You can sometimes avoid this by inserting a call to a some type of very cheap "opaque" method that the compiler can't elide or reorder. On Solaris, for instance, you could use :"if C == 0 : { gethrtime(); for { Pause; if C != 0 : break; }}". It's worth noting the obvious duality between locks and queues. If you have strict FIFO lock implementation with local spinning and succession by direct handoff such as MCS or CLH,then you can usually transform that lock into a queue. Hidden commentary and annotations - invisible : * And of course there's a well-known duality between queues and locks, but I'll leave that topic for another blog post. * Compare and contrast : PTLQ vs PTL and MultiLane * Equivalent : Turn; seq; sequence; pos; position; ticket * Put = Lock; Deposit Take = identify and reserve slot; wait; extract & clear; unlock * conceptualize : Distinct PutLock and TakeLock implemented as ticket lock or PTL Distinct arrival cursors but share per-slot "Turn" variable provides exclusive role-based access to slot's mailbox field put() acquires exclusive access to a slot for purposes of "deposit" assigns slot round-robin and then acquires deposit access rights/perms to that slot take() acquires exclusive access to slot for purposes of "withdrawal" assigns slot round-robin and then acquires withdrawal access rights/perms to that slot At any given time, only one thread can have withdrawal access to a slot at any given time, only one thread can have deposit access to a slot Permissible for T1 to have deposit access and T2 to simultaneously have withdrawal access * round-robin for the purposes of; role-based; access mode; access role mailslot; mailbox; allocate/assign/identify slot rights; permission; license; access permission; * PTL/Ticket hybrid Asymmetric usage ; owner oblivious lock-unlock pairing K-exclusion add Grant cursor pass message m from lock to unlock via Slots[] array Cursor performs 2 functions : + PTL ticket + Assigns request to slot in round-robin fashion Deconstruct protocol : explication put() : allocate slot in round-robin fashion acquire PTL for "put" access store message into slot associated with PTL index take() : Acquire PTL for "take" access // doorway step seq = fetchAdd (&Grant, 1) s = &Slots[seq & Mask] // waiting phase while s-Turn != seq : pause Extract : wait for s-mailbox to be full v = s-mailbox s-mailbox = null Release PTL for both "put" and "take" access s-Turn = seq + Mask + 1 * Slot round-robin assignment and lock "doorway" protocol leverage the same cursor and FetchAdd operation on that cursor FetchAdd (&Cursor,1) + round-robin slot assignment and dispersal + PTL/ticket lock "doorway" step waiting phase is via "Turn" field in slot * PTLQueue uses 2 cursors -- put and take. Acquire "put" access to slot via PTL-like lock Acquire "take" access to slot via PTL-like lock 2 locks : put and take -- at most one thread can access slot's mailbox Both locks use same "turn" field Like multilane : 2 cursors : put and take slot is simple 1-capacity mailbox instead of queue Borrow per-slot turn/grant from PTL Provides strict FIFO Lock slot : put-vs-put take-vs-take at most one put accesses slot at any one time at most one put accesses take at any one time reduction to 1-vs-1 instead of N-vs-M concurrency Per slot locks for put/take Release put/take by advancing turn * is instrumental in ... * P-V Semaphore vs lock vs K-exclusion * See also : FastQueues-excerpt.java dice-etc/queue-mpmc-bounded-blocking-circular-xadd/ * PTLQueue is the same as PTLQB - identical * Expedient return; ASAP; prompt; immediately * Lamport's Bakery algorithm : doorway step then waiting phase Threads arriving at doorway obtain a unique ticket number Threads enter in ticket order * In the terminology of Reed and Kanodia a ticket lock corresponds to the busy-wait implementation of a semaphore using an eventcount and a sequencer It can also be thought of as an optimization of Lamport's bakery lock was designed for fault-tolerance rather than performance Instead of spinning on the release counter, processors using a bakery lock repeatedly examine the tickets of their peers --

    Read the article

  • Announcing Entity Framework Code-First (CTP 5 release)

    In this article, Scott provides a detailed coverage of Entity Framework Code-First CTP 5 release and the features included with the build. He begins with the steps required to install EF Code First. Scott then examines the usage of EF Code First to create a model layer for the Northwind sample database in a series of steps. Towards the end of the article, Scott examines the usage of UI Validation and few addtional EF Code First Improvements shipped with CTP 5.

    Read the article

  • The Data Scientist

    - by BuckWoody
    A new term - well, perhaps not that new - has come up and I’m actually very excited about it. The term is Data Scientist, and since it’s new, it’s fairly undefined. I’ll explain what I think it means, and why I’m excited about it. In general, I’ve found the term deals at its most basic with analyzing data. Of course, we all do that, and the term itself in that definition is redundant. There is no science that I know of that does not work with analyzing lots of data. But the term seems to refer to more than the common practices of looking at data visually, putting it in a spreadsheet or report, or even using simple coding to examine data sets. The term Data Scientist (as far as I can make out this early in it’s use) is someone who has a strong understanding of data sources, relevance (statistical and otherwise) and processing methods as well as front-end displays of large sets of complicated data. Some - but not all - Business Intelligence professionals have these skills. In other cases, senior developers, database architects or others fill these needs, but in my experience, many lack the strong mathematical skills needed to make these choices properly. I’ve divided the knowledge base for someone that would wear this title into three large segments. It remains to be seen if a given Data Scientist would be responsible for knowing all these areas or would specialize. There are pretty high requirements on the math side, specifically in graduate-degree level statistics, but in my experience a company will only have a few of these folks, so they are expected to know quite a bit in each of these areas. Persistence The first area is finding, cleaning and storing the data. In some cases, no cleaning is done prior to storage - it’s just identified and the cleansing is done in a later step. This area is where the professional would be able to tell if a particular data set should be stored in a Relational Database Management System (RDBMS), across a set of key/value pair storage (NoSQL) or in a file system like HDFS (part of the Hadoop landscape) or other methods. Or do you examine the stream of data without storing it in another system at all? This is an important decision - it’s a foundation choice that deals not only with a lot of expense of purchasing systems or even using Cloud Computing (PaaS, SaaS or IaaS) to source it, but also the skillsets and other resources needed to care and feed the system for a long time. The Data Scientist sets something into motion that will probably outlast his or her career at a company or organization. Often these choices are made by senior developers, database administrators or architects in a company. But sometimes each of these has a certain bias towards making a decision one way or another. The Data Scientist would examine these choices in light of the data itself, starting perhaps even before the business requirements are created. The business may not even be aware of all the strategic and tactical data sources that they have access to. Processing Once the decision is made to store the data, the next set of decisions are based around how to process the data. An RDBMS scales well to a certain level, and provides a high degree of ACID compliance as well as offering a well-known set-based language to work with this data. In other cases, scale should be spread among multiple nodes (as in the case of Hadoop landscapes or NoSQL offerings) or even across a Cloud provider like Windows Azure Table Storage. In fact, in many cases - most of the ones I’m dealing with lately - the data should be split among multiple types of processing environments. This is a newer idea. Many data professionals simply pick a methodology (RDBMS with Star Schemas, NoSQL, etc.) and put all data there, regardless of its shape, processing needs and so on. A Data Scientist is familiar not only with the various processing methods, but how they work, so that they can choose the right one for a given need. This is a huge time commitment, hence the need for a dedicated title like this one. Presentation This is where the need for a Data Scientist is most often already being filled, sometimes with more or less success. The latest Business Intelligence systems are quite good at allowing you to create amazing graphics - but it’s the data behind the graphics that are the most important component of truly effective displays. This is where the mathematics requirement of the Data Scientist title is the most unforgiving. In fact, someone without a good foundation in statistics is not a good candidate for creating reports. Even a basic level of statistics can be dangerous. Anyone who works in analyzing data will tell you that there are multiple errors possible when data just seems right - and basic statistics bears out that you’re on the right track - that are only solvable when you understanding why the statistical formula works the way it does. And there are lots of ways of presenting data. Sometimes all you need is a “yes” or “no” answer that can only come after heavy analysis work. In that case, a simple e-mail might be all the reporting you need. In others, complex relationships and multiple components require a deep understanding of the various graphical methods of presenting data. Knowing which kind of chart, color, graphic or shape conveys a particular datum best is essential knowledge for the Data Scientist. Why I’m excited I love this area of study. I like math, stats, and computing technologies, but it goes beyond that. I love what data can do - how it can help an organization. I’ve been fortunate enough in my professional career these past two decades to work with lots of folks who perform this role at companies from aerospace to medical firms, from manufacturing to retail. Interestingly, the size of the company really isn’t germane here. I worked with one very small bio-tech (cryogenics) company that worked deeply with analysis of complex interrelated data. So  watch this space. No, I’m not leaving Azure or distributed computing or Microsoft. In fact, I think I’m perfectly situated to investigate this role further. We have a huge set of tools, from RDBMS to Hadoop to allow me to explore. And I’m happy to share what I learn along the way.

    Read the article

  • SQLBeat Episode 11 – Ted the Fred Krueger Halloween SQL

    - by SQLBeat
    In this episode of the SQLBeat Podcast I speak conversationally (Ok I will just say I converse) with Ted Krueger about Elm Street, where he works as a DBA who stores nightmares in SQL Server database tables. The joke about it being BLOB storage is only one of several that may scare you away from this Halloween Special. If you like listening to two SQL guys talking about the bands they used to be in, rainbow trout and video games, come on in. Bwaaaa Haaaa Haaa…..Ok I will stop. Download the MP3

    Read the article

  • Daily tech links for .net and related technologies - Apr 1-3, 2010

    - by SanjeevAgarwal
    Daily tech links for .net and related technologies - Apr 1-3, 2010 Web Development Cleaner HTML Markup with ASP.NET 4 Web Forms - Client IDs - ScottGu Using jQuery and OData to Insert a Database Record - Stephen Walter Apple vs. Microsoft – A Website Usability Study Mastering ASP.NET MVC 2.0: Preview - TekPub Web Design UX Lessons Learned From Offline Experiences - Jon Phillips 5 Steps Toward jQuery Mastery - Dave Ward 20 jQuery Cheatsheets, Docs and References for Every Occasion - Paul Andrew 11...(read more)

    Read the article

  • Best partition Scheme for Ubuntu Server

    - by K.K Patel
    I am going to deploy Ubuntu server having Following servers on it Bind server, dhcp server, LAMP Server, Openssh Server, Ldap server, Monodb database, FTP server,mail server, Samba server, NFS server , in future I want to set Openstack for PAAS. Currently I have Raid 5 with 10TB. How should I make my Partition Scheme So never get problem in future and easily expand Storage size. Suggest me such a partition Scheme with giving specific percentage of Storage to partitions like /, /boot, /var, /etc. Thanks In advance

    Read the article

  • JavaOne Rock Star – Adam Bien

    - by Janice J. Heiss
    Among the most celebrated developers in recent years, especially in the domain of Java EE and JavaFX, is consultant Adam Bien, who, in addition to being a JavaOne Rock Star for Java EE sessions given in 2009 and 2011, is a Java Champion, the winner of Oracle Magazine’s 2011 Top Java Developer of the Year Award, and recently won a 2012 JAX Innovation Award as a top Java Ambassador. Bien will be presenting the following sessions: TUT3907 - Java EE 6/7: The Lean Parts CON3906 - Stress-Testing Java EE 6 Applications Without Stress CON3908 - Building Serious JavaFX 2 Applications CON3896 - Interactive Onstage Java EE Overengineering I spoke with Bien to get his take on Java today. He expressed excitement that the smallest companies and startups are showing increasing interest in Java EE. “This is a very good sign,” said Bien. “Only a few years ago J2EE was mostly used by larger companies -- now it becomes interesting even for one-person shows. Enterprise Java events are also extremely popular. On the Java SE side, I'm really excited about Project Nashorn.” Nashorn is an upcoming JavaScript engine, developed fully in Java by Oracle, and based on the Da Vinci Machine (JSR 292) which is expected to be available for Java 8.    Bien expressed concern about a common misconception regarding Java's mediocre productivity. “The problem is not Java,” explained Bien, “but rather systems built with ancient patterns and approaches. Sometimes it really is ‘Cargo Cult Programming.’ Java SE/EE can be incredibly productive and lean without the unnecessary and hard-to-maintain bloat. The real problems are ‘Ivory Towers’ and not Java’s lack of productivity.” Bien remarked that if there is one thing he wanted Java developers to understand it is that, "Premature optimization is the root of all evil. Or at least of some evil. Modern JVMs and application servers are hard to optimize upfront. It is far easier to write simple code and measure the results continuously. Identify the hotspots first, then optimize.”   He advised Java EE developers to, “Rethink everything you know about Enterprise Java. Before you implement anything, ask the question: ‘Why?’ If there is no clear answer -- just don't do it. Most well known best practices are outdated. Focus your efforts on the domain problem and not the technology.” Looking ahead, Bien remarked, “I would like to see open source application servers running directly on a hypervisor. Packaging the whole runtime in a single file would significantly simplify the deployment and operations.” Check out a recent Java Magazine interview with Bien about his Java EE 6 stress monitoring tool here.

    Read the article

  • MySQL and Hadoop Integration - Unlocking New Insight

    - by Mat Keep
    “Big Data” offers the potential for organizations to revolutionize their operations. With the volume of business data doubling every 1.2 years, analysts and business users are discovering very real benefits when integrating and analyzing data from multiple sources, enabling deeper insight into their customers, partners, and business processes. As the world’s most popular open source database, and the most deployed database in the web and cloud, MySQL is a key component of many big data platforms, with Hadoop vendors estimating 80% of deployments are integrated with MySQL. The new Guide to MySQL and Hadoop presents the tools enabling integration between the two data platforms, supporting the data lifecycle from acquisition and organisation to analysis and visualisation / decision, as shown in the figure below The Guide details each of these stages and the technologies supporting them: Acquire: Through new NoSQL APIs, MySQL is able to ingest high volume, high velocity data, without sacrificing ACID guarantees, thereby ensuring data quality. Real-time analytics can also be run against newly acquired data, enabling immediate business insight, before data is loaded into Hadoop. In addition, sensitive data can be pre-processed, for example healthcare or financial services records can be anonymized, before transfer to Hadoop. Organize: Data is transferred from MySQL tables to Hadoop using Apache Sqoop. With the MySQL Binlog (Binary Log) API, users can also invoke real-time change data capture processes to stream updates to HDFS. Analyze: Multi-structured data ingested from multiple sources is consolidated and processed within the Hadoop platform. Decide: The results of the analysis are loaded back to MySQL via Apache Sqoop where they inform real-time operational processes or provide source data for BI analytics tools. So how are companies taking advantage of this today? As an example, on-line retailers can use big data from their web properties to better understand site visitors’ activities, such as paths through the site, pages viewed, and comments posted. This knowledge can be combined with user profiles and purchasing history to gain a better understanding of customers, and the delivery of highly targeted offers. Of course, it is not just in the web that big data can make a difference. Every business activity can benefit, with other common use cases including: - Sentiment analysis; - Marketing campaign analysis; - Customer churn modeling; - Fraud detection; - Research and Development; - Risk Modeling; - And more. As the guide discusses, Big Data is promising a significant transformation of the way organizations leverage data to run their businesses. MySQL can be seamlessly integrated within a Big Data lifecycle, enabling the unification of multi-structured data into common data platforms, taking advantage of all new data sources and yielding more insight than was ever previously imaginable. Download the guide to MySQL and Hadoop integration to learn more. I'd also be interested in hearing about how you are integrating MySQL with Hadoop today, and your requirements for the future, so please use the comments on this blog to share your insights.

    Read the article

  • Good practices for large scale development/delivery of software

    - by centic
    What practices do you apply when working with large teams on multiple versions of a software or multiple competing projects? What are best practices that can be used to still get the right things done first? Is there information available how big IT companies do development and management of some of their large projects, e.g. things like Oracle Database, WebSphere Application Server, Microsoft Windows, ....?

    Read the article

  • SQLAuthority News – Presented Soft Skill Session on Presentation Skills at SQL Bangalore on May 3, 2014

    - by Pinal Dave
    I have presented on various database technologies for almost 10 years now. SQL, Database and NoSQL have been part of my life. Earlier this month, I had the opportunity to present on the topic Performing an Effective Presentation. I must say it was blast to prepare as well as present this session. This event was part of the SQL Bangalore community. If you are in Bangalore, you must be part of this group. SQL Bangalore is a wonderful community and we always have a great response when we present on technology. It is SQL User Group and we discuss everything SQL there. This month we had SQL Server 2014 theme and we had a community launch of SQL Server. We have the best of the best speakers presenting on SQL Server 2014 technology. The event had amazing speakers and each of them did justice to the subject. You can read about this over here. In this session I told a story from my life. I talked about who inspired me and how I learned to speak in public. I told stories about two legends  who have inspired me. There is no video recording of this session. If you want to get resources from this session, please sign up my newsletter at http://bit.ly/sqllearn. Well, I had a great time at this event. We had over 250 people showed up at this event and had a grand  time together. I personally enjoyed a session of Amit Benerjee, Balmukund Lakhani and Vinod Kumar. Ken and Surabh also entertained the audience. Overall, this was a grand event and if you were in Bangalore and did not make it to this event. You did miss out on a few things. Here are a few photos of this event. SQL Bangalore UG Nupur, Chandra, Shaivi, Balmukund, Amit, Vinod [captions This] SQL Bangalore UG Audience Pinal Dave presenting at SQL UG in Bangalore Here are few of the slides from this presentation: Reference: Pinal Dave (http://blog.sqlauthority.com)Filed under: PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, SQLAuthority Author Visit, SQLAuthority News, T SQL

    Read the article

  • Ubuntu server spontaneous reboot

    - by user1941407
    I have got two ubuntu 12.04 servers(xeon e3). Sometimes(several days) each server spontaneously reboots. HDDs and other hardware are ok. Which logfile can help find a reason of the problem? UPDATED. hardware: xeon e3 processor, intel server motherboard, 32gb ddr3 ecc, mdadm mirror hdd raid for system, mdadm ssd raid for database(postgres). Both servers have similar (not identical) components. Smart is OK. It seems that the problem is in the software. Python process and database are running on this servers. Syslog (time of reboot): Aug 23 13:42:23 xeon hddtemp[1411]: /dev/sdc: WDC WD15NPVT-00Z2TT0: 34 C Aug 23 13:42:23 xeon hddtemp[1411]: /dev/sdd: WDC WD15NPVT-00Z2TT0: 34 C Aug 23 13:43:24 xeon hddtemp[1411]: /dev/sdc: WDC WD15NPVT-00Z2TT0: 34 C Aug 23 13:43:24 xeon hddtemp[1411]: /dev/sdd: WDC WD15NPVT-00Z2TT0: 34 C Aug 23 13:44:14 xeon sensord: Chip: acpitz-virtual-0 Aug 23 13:44:14 xeon sensord: Adapter: Virtual device Aug 23 13:44:14 xeon sensord: temp1: 27.8 C Aug 23 13:44:14 xeon sensord: temp2: 29.8 C Aug 23 13:44:14 xeon sensord: Chip: coretemp-isa-0000 Aug 23 13:44:14 xeon sensord: Adapter: ISA adapter Aug 23 13:44:14 xeon sensord: Physical id 0: 37.0 C Aug 23 13:44:14 xeon sensord: Core 0: 37.0 C Aug 23 13:44:14 xeon sensord: Core 1: 37.0 C Aug 23 13:44:14 xeon sensord: Core 2: 37.0 C Aug 23 13:44:14 xeon sensord: Core 3: 37.0 C Aug 23 13:44:24 xeon hddtemp[1411]: /dev/sdc: WDC WD15NPVT-00Z2TT0: 34 C Aug 23 13:44:24 xeon hddtemp[1411]: /dev/sdd: WDC WD15NPVT-00Z2TT0: 34 C Aug 23 13:47:01 xeon kernel: imklog 5.8.6, log source = /proc/kmsg started. Aug 23 13:47:01 xeon rsyslogd: [origin software="rsyslogd" swVersion="5.8.6" x-pid="582" x-info="http://www.rsyslog.com"] start Aug 23 13:47:01 xeon rsyslogd: rsyslogd's groupid changed to 103 Aug 23 13:47:01 xeon rsyslogd: rsyslogd's userid changed to 101 Aug 23 13:47:00 xeon rsyslogd-2039: Could not open output pipe '/dev/xconsole' [try http://www.rsyslog.com/e/2039 ] Aug 23 13:47:01 xeon kernel: [ 0.000000] Initializing cgroup subsys cpuset Aug 23 13:47:01 xeon kernel: [ 0.000000] Initializing cgroup subsys cpu Aug 23 13:47:01 xeon kernel: [ 0.000000] Initializing cgroup subsys cpuacct Aug 23 13:47:01 xeon kernel: [ 0.000000] Linux version 3.11.0-26-generic (buildd@komainu) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #45~precise1-Ubuntu SMP Tue Jul 15 04:02:35 UTC 2014 (Ubuntu 3.11.0-26.45~precise1-generic 3.11.10.12) Aug 23 13:47:01 xeon kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.11.0-26-generic root=UUID=0daa7f53-6c74-47d2-873e-ebd339cd39b0 ro splash quiet vt.handoff=7 Aug 23 13:47:01 xeon kernel: [ 0.000000] KERNEL supported cpus: Aug 23 13:47:01 xeon kernel: [ 0.000000] Intel GenuineIntel Aug 23 13:47:01 xeon kernel: [ 0.000000] AMD AuthenticAMD Aug 23 13:47:01 xeon kernel: [ 0.000000] Centaur CentaurHauls Aug 23 13:47:01 xeon kernel: [ 0.000000] e820: BIOS-provided physical RAM map: Aug 23 13:47:01 xeon kernel: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009bbff] usable Aug 23 13:47:01 xeon kernel: [ 0.000000] BIOS-e820: [mem 0x000000000009bc00-0x000000000009ffff] reserved Dmseg - nothing strange.

    Read the article

  • New videos: Getting started with embedded Java and more

    - by terrencebarr
    OTN just published a set of six videos related to embedded Java: Java at ARM TechCon Java SE Embedded Development Made Easy, Part 1 Java SE Embedded Development Made Easy, Part 2 Mobile Database Synchronization – Healthcare Demonstration Tomcat Micro Cluster Java Embedded Partnerships Good stuff. Enjoy! Cheers, – Terrence Filed under: Mobile & Embedded Tagged: embedded, Java Embedded, Java SE Embedded, video

    Read the article

  • Can you Trust Search?

    - by David Dorf
    An awful lot of referrals to e-commerce sites come from web searches. Retailers rely on search engine optimization (SEO) to correctly position their website so they can be found. Search on "blue jeans" and the results are determined by a semi-secret algorithm -- in my case Levi.com, Banana Republic, and ShopStyle show up. The NY Times recently uncovered a situation where JCPenney, via third-parties hired to help with SEO, was caught manipulating search results so they were erroneously higher in page rankings. No doubt this helped drive additional sales during this part Christmas. The article, The Dirty Little Secrets of Search, is well worth reading. My friend Ron Kleinman started an interesting discussion at the ARTS Linkedin forum. He posed the question: The ability of a single company to "punish" any retailer (by significantly impacting their on-line sales volume) who does not play by their rules ... is this a good thing or a bad thing? Clearly JCP was in the wrong and needed to be punished, but should that decision lie with Google alone? Don't get me wrong -- I'm certainly not advocating we create a Department of Search where bureaucrats think of ways to spend money, but Google wields an awful lot of power in this situation, and it makes me feel uncomfortable. Now Google is incorporating more social aspects into their search results. For example, when Google knows its me (i.e. I'm logged in when using Google) search results will be influenced by my Twitter network. In an effort to increase relevance, the blogs and re-tweeted articles from my network will be higher in the search results than they otherwise would be. So in the case of product searches, things discussed in my network will rise to the top. Continuing my blue jean example, if someone in my network had been discussing Macy's perhaps they would now be higher in the result set. soapbox: I already have lots of spammers posting bogus comments to this blog in an effort to create additional links to their sites and thus increase their search ranking. Should I expect a similar situation in Twitter and eventually Facebook? Now retailers need to expand their SEO efforts to incorporate social media as well, but do us all a favor and please don't cheat.

    Read the article

  • How to implement an email unsubscribe system for a site with many kinds of emails?

    - by Mike Liu
    I'm working on a website that features many different types of emails. Users have accounts, and when logged in they have access to a setting page that they can use to customize what types of emails they receive. However, I'd like to also give users an easy way to unsubscribe directly in the emails they receive. I've looked into list unsubscribe headers as well as creating some type of one click link that would unsubscribe a user from that type of email without requiring login or further action. The later would probably require me to break convention and make changes to the database in response to a GET on the link. However, am I incorrect in thinking that either of these would require me to generate and permanently store a unique identifier in my database for every email I ever send, really complicating email delivery? Without that, I'm not sure how I would be able to uniquely identify a user and a type of email in order to change their email preferences, and this identifier would need to be stored forever as a user could have an email sitting in their inbox for a long time before they decide to act on it. Alternatively, I was considering having a no-login page for managing email preferences. In contrast to above where I would need one of these identifiers for each email, this would only need one identifier per user, with no generation or other action required on sending an email. All of these raise security issues, and they could potentially be used by people to tamper with others' email preferences. This could be mitigated somewhat by ensuring that the identifier is really difficult to guess. For the once per user identifier approach, I was considering generating the identifier by passing a user's ID through some type of encryption algorithm, is this a sound approach? For the per-email identifiers, perhaps I could use a user's ID appended to the time. However, even this would not eliminate the problem entirely, as this would really just be security through obscurity, and anyone with the URL could tamper, and in the end the main defense would have to be that most people aren't so bored as to tamper with other people's email preferences. Are there any other alternatives I've missed, or issues or solutions with these that anyone can provide insight on? What are best practices in this area?

    Read the article

  • Extracting Data from a Source System to History Tables

    - by Derek D.
    This is a topic I find very little information written about, however it is very important that the method for extracting data be done in a way that does not hinder performance of the source system.  In this example, the goal is to extract data from a source system, into another database (or server) all [...]

    Read the article

  • Basics of XML and SQL Server, Part 4: Create an XML invoice with SSIS

    This article demonstrates how to build an SSIS package that generates an XML invoice document from data stored in SQL Server and saves it to an XML file. What are your servers really trying to tell you? Find out with new SQL Monitor 3.0, an easy-to-use tool built for no-nonsense database professionals.For effortless insights into SQL Server, download a free trial today.

    Read the article

< Previous Page | 708 709 710 711 712 713 714 715 716 717 718 719  | Next Page >