Search Results

Search found 65101 results on 2605 pages for 'big data'.

Page 11/2605 | < Previous Page | 7 8 9 10 11 12 13 14 15 16 17 18  | Next Page >

  • Data Virtualization: Federated and Hybrid

    - by Krishnamoorthy
    Data becomes useful when it can be leveraged at the right time. Not only enterprises application stores operate on large volume, velocity and variety of data. Mobile and social computing are in the need of operating in foresaid data. Replicating and transferring large swaths of data is one challenge faced in the field of data integration. However, smaller chunks of data aggregated from a variety of sources presents and even more interesting challenge in the industry. Over the past few decades, technology trends focused on best user experience, operating systems, high performance computing, high performance web sites, analysis of warehouse data, service oriented architecture, social computing, cloud computing, and big data. Operating on the ‘dark data’ becomes mandatory in the future technology trend, although, no solution can make dark data useful data in a single day. Useful data can be quantified by the facts of contextual, personalized and on time delivery. In most cases, data from a single source may not be complete the picture. Data has to be combined and computed from various sources, where data may be captured as hybrid data, meaning the combination of structured and unstructured data. Since related data is often found across disparate sources, effectively integrating these sources determines how useful this data ultimately becomes. Technology trends in 2013 are expected to focus on big data and private cloud. Consumers are not merely interested in where data is located or how data is retrieved and computed. Consumers are interested in how quick and how the data can be leveraged. In many cases, data virtualization is the right solution, and is expected to play a foundational role for SOA, Cloud integration, and Big Data. The Oracle Data Integration portfolio includes a data virtualization product called ODSI (Oracle Data Service Integrator). Unlike other data virtualization solutions, ODSI can perform both read and write operations on federated/hybrid data (RDBMS, Webservices,  delimited file and XML). The ODSI Engine is built on XQuery, hence ODSI user can perform computations on data either using XQuery or SQL. Built in data and query caching features, which reduces latency in repetitive calls. Rightly positioning ODSI, can results in a highly scalable model, reducing spend on additional hardware infrastructure.

    Read the article

  • Big O and Little o

    - by hyperdude
    If algorithm A has complexity O(n) and algorithm B has complexity o(n^2), what, if anything, can we say about the relationship between A and B? Note: the complexity of A is expressed using big-Oh, and the complexity of B is expressed using little-Oh.

    Read the article

  • Big Data Sessions at Openworld 2012

    - by Jean-Pierre Dijcks
    If you are coming to San Francisco, and you are interested in all the aspects to big data, this Focus On Big Data is a must have document.  Some (other) highlights: A performance demo of a full rack Big Data Appliance in the engineered systems showcase A set of handson labs on how to go from a NoSQL DB to an effective analytics play on big data Much, much more See you all in a few weeks in SF!

    Read the article

  • BIG DATA eBook - Now Available

    - by Javier Puerta
    The Big Data interactive e-book “Meeting the Challenge of Big Data: Part One” has just been released. It’s your “one-stop shop” for info about Big Data and the Oracle offering around it.The new e-book (available on your computer or iPad) is packed with multi-media resources to educate Oracle staff, customers, prospects and partners on the value of Big Data. It features videos, tutorials, podcasts, reports, white papers, datasheets, blogs, web links, a 3-D demo, and more. Go and get it here!

    Read the article

  • Python what's the data structure for triple data

    - by Paul
    I've got a set of data that has three attributes, say A, B, and C, where A is kind of the index (i.e., A is used to look up the other two attributes.) What would be the best data structure for such data? I used two dictionaries, with A as the index of each. However, there's key errors when the query to the data doesn't match any instance of A.

    Read the article

  • Big Oh Notation - formal definition.

    - by aloh
    I'm reading a textbook right now for my Java III class. We're reading about Big-Oh and I'm a little confused by its formal definition. Formal Definition: "A function f(n) is of order at most g(n) - that is, f(n) = O(g(n)) - if a positive real number c and positive integer N exist such that f(n) <= c g(n) for all n = N. That is, c g(n) is an upper bound on f(n) when n is sufficiently large." Ok, that makes sense. But hold on, keep reading...the book gave me this example: "In segment 9.14, we said that an algorithm that uses 5n + 3 operations is O(n). We now can show that 5n + 3 = O(n) by using the formal definition of Big Oh. When n = 3, 5n + 3 <= 5n + n = 6n. Thus, if we let f(n) = 5n + 3, g(n) = n, c = 6, N = 3, we have shown that f(n) <= 6 g(n) for n = 3, or 5n + 3 = O(n). That is, if an algorithm requires time directly proportional to 5n + 3, it is O(n)." Ok, this kind of makes sense to me. They're saying that if n = 3 or greater, 5n + 3 takes less time than if n was less than 3 - thus 5n + n = 6n - right? Makes sense, since if n was 2, 5n + 3 = 13 while 6n = 12 but when n is 3 or greater 5n + 3 will always be less than or equal to 6n. Here's where I get confused. They give me another example: Example 2: "Let's show that 4n^2 + 50n - 10 = O(n^2). It is easy to see that: 4n^2 + 50n - 10 <= 4n^2 + 50n for any n. Since 50n <= 50n^2 for n = 50, 4n^2 + 50n - 10 <= 4n^2 + 50n^2 = 54n^2 for n = 50. Thus, with c = 54 and N = 50, we have shown that 4n^2 + 50n - 10 = O(n^2)." This statement doesn't make sense: 50n <= 50n^2 for n = 50. Isn't any n going to make the 50n less than 50n^2? Not just greater than or equal to 50? Why did they even mention that 50n <= 50n^2? What does that have to do with the problem? Also, 4n^2 + 50n - 10 <= 4n^2 + 50n^2 = 54n^2 for n = 50 is going to be true no matter what n is. And how in the world does picking numbers show that f(n) = O(g(n))? Please help me understand! :(

    Read the article

  • Big O Complexity of a method

    - by timeNomad
    I have this method: public static int what(String str, char start, char end) { int count=0; for(int i=0;i<str.length(); i++) { if(str.charAt(i) == start) { for(int j=i+1;j<str.length(); j++) { if(str.charAt(j) == end) count++; } } } return count; } What I need to find is: 1) What is it doing? Answer: counting the total number of end occurrences after EACH (or is it? Not specified in the assignment, point 3 depends on this) start. 2) What is its complexity? Answer: the first loops iterates over the string completely, so it's at least O(n), the second loop executes only if start char is found and even then partially (index at which start was found + 1). Although, big O is all about worst case no? So in the worst case, start is the 1st char & the inner iteration iterates over the string n-1 times, the -1 is a constant so it's n. But, the inner loop won't be executed every outer iteration pass, statistically, but since big O is about worst case, is it correct to say the complexity of it is O(n^2)? Ignoring any constants and the fact that in 99.99% of times the inner loop won't execute every outer loop pass. 3) Rewrite it so that complexity is lower. What I'm not sure of is whether start occurs at most once or more, if once at most, then method can be rewritten using one loop (having a flag indicating whether start has been encountered and from there on incrementing count at each end occurrence), yielding a complexity of O(n). In case though, that start can appear multiple times, which most likely it is, because assignment is of a Java course and I don't think they would make such ambiguity. Solving, in this case, is not possible using one loop... WAIT! Yes it is..! Just have a variable, say, inc to be incremented each time start is encountered & used to increment count each time end is encountered after the 1st start was found: inc = 0, count = 0 if (current char == start) inc++ if (inc > 0 && current char == end) count += inc This would also yield a complexity of O(n)? Because there is only 1 loop. Yes I realize I wrote a lot hehe, but what I also realized is that I understand a lot better by forming my thoughts into words...

    Read the article

  • Implementing a generic repository for WCF data services

    - by cibrax
    The repository implementation I am going to discuss here is not exactly what someone would call repository in terms of DDD, but it is an abstraction layer that becomes handy at the moment of unit testing the code around this repository. In other words, you can easily create a mock to replace the real repository implementation. The WCF Data Services update for .NET 3.5 introduced a nice feature to support two way data bindings, which is very helpful for developing WPF or Silverlight based application but also for implementing the repository I am going to talk about. As part of this feature, the WCF Data Services Client library introduced a new collection DataServiceCollection<T> that implements INotifyPropertyChanged to notify the data context (DataServiceContext) about any change in the association links. This means that it is not longer necessary to manually set or remove the links in the data context when an item is added or removed from a collection. Before having this new collection, you basically used the following code to add a new item to a collection. Order order = new Order {   Name = "Foo" }; OrderItem item = new OrderItem {   Name = "bar",   UnitPrice = 10,   Qty = 1 }; var context = new OrderContext(); context.AddToOrders(order); context.AddToOrderItems(item); context.SetLink(item, "Order", order); context.SaveChanges(); Now, thanks to this new collection, everything is much simpler and similar to what you have in other ORMs like Entity Framework or L2S. Order order = new Order {   Name = "Foo" }; OrderItem item = new OrderItem {   Name = "bar",   UnitPrice = 10,   Qty = 1 }; order.Items.Add(item); var context = new OrderContext(); context.AddToOrders(order); context.SaveChanges(); In order to use this new feature, you first need to enable V2 in the data service, and then use some specific arguments in the datasvcutil tool (You can find more information about this new feature and how to use it in this post). DataSvcUtil /uri:"http://localhost:3655/MyDataService.svc/" /out:Reference.cs /dataservicecollection /version:2.0 Once you use those two arguments, the generated proxy classes will use DataServiceCollection<T> rather than a simple ObjectCollection<T>, which was the default collection in V1. There are some aspects that you need to know to use this feature correctly. 1. All the entities retrieved directly from the data context with a query track the changes and report those to the data context automatically. 2. A entity created with “new” does not track any change in the properties or associations. In order to enable change tracking in this entity, you need to do the following trick. public Order CreateOrder() {   var collection = new DataServiceCollection<Order>(this.context);   var order = new Order();   collection.Add(order);   return order; } You basically need to create a collection, and add the entity to that collection with the “Add” method to enable change tracking on that entity. 3. If you need to attach an existing entity (For example, if you created the entity with the “new” operator rather than retrieving it from the data context with a query) to a data context for tracking changes, you can use the “Load” method in the DataServiceCollection. var order = new Order {   Id = 1 }; var collection = new DataServiceCollection<Order>(this.context); collection.Load(order); In this case, the order with Id = 1 must exist on the data source exposed by the Data service. Otherwise, you will get an error because the entity did not exist. These cool extensions methods discussed by Stuart Leeks in this post to replace all the magic strings in the “Expand” operation with Expression Trees represent another feature I am going to use to implement this generic repository. Thanks to these extension methods, you could replace the following query with magic strings by a piece of code that only uses expressions. Magic strings, var customers = dataContext.Customers .Expand("Orders")         .Expand("Orders/Items") Expressions, var customers = dataContext.Customers .Expand(c => c.Orders.SubExpand(o => o.Items)) That query basically returns all the customers with their orders and order items. Ok, now that we have the automatic change tracking support and the expression support for explicitly loading entity associations, we are ready to create the repository. The interface for this repository looks like this,public interface IRepository { T Create<T>() where T : new(); void Update<T>(T entity); void Delete<T>(T entity); IQueryable<T> RetrieveAll<T>(params Expression<Func<T, object>>[] eagerProperties); IQueryable<T> Retrieve<T>(Expression<Func<T, bool>> predicate, params Expression<Func<T, object>>[] eagerProperties); void Attach<T>(T entity); void SaveChanges(); } The Retrieve and RetrieveAll methods are used to execute queries against the data service context. While both methods receive an array of expressions to load associations explicitly, only the Retrieve method receives a predicate representing the “where” clause. The following code represents the final implementation of this repository.public class DataServiceRepository: IRepository { ResourceRepositoryContext context; public DataServiceRepository() : this (new DataServiceContext()) { } public DataServiceRepository(DataServiceContext context) { this.context = context; } private static string ResolveEntitySet(Type type) { var entitySetAttribute = (EntitySetAttribute)type.GetCustomAttributes(typeof(EntitySetAttribute), true).FirstOrDefault(); if (entitySetAttribute != null) return entitySetAttribute.EntitySet; return null; } public T Create<T>() where T : new() { var collection = new DataServiceCollection<T>(this.context); var entity = new T(); collection.Add(entity); return entity; } public void Update<T>(T entity) { this.context.UpdateObject(entity); } public void Delete<T>(T entity) { this.context.DeleteObject(entity); } public void Attach<T>(T entity) { var collection = new DataServiceCollection<T>(this.context); collection.Load(entity); } public IQueryable<T> Retrieve<T>(Expression<Func<T, bool>> predicate, params Expression<Func<T, object>>[] eagerProperties) { var entitySet = ResolveEntitySet(typeof(T)); var query = context.CreateQuery<T>(entitySet); foreach (var e in eagerProperties) { query = query.Expand(e); } return query.Where(predicate); } public IQueryable<T> RetrieveAll<T>(params Expression<Func<T, object>>[] eagerProperties) { var entitySet = ResolveEntitySet(typeof(T)); var query = context.CreateQuery<T>(entitySet); foreach (var e in eagerProperties) { query = query.Expand(e); } return query; } public void SaveChanges() { this.context.SaveChanges(SaveChangesOptions.Batch); } } For instance, you can use the following code to retrieve customers with First name equal to “John”, and all their orders in a single call. repository.Retrieve<Customer>(    c => c.FirstName == “John”, //Where    c => c.Orders.SubExpand(o => o.Items)); In case, you want to have some pre-defined queries that you are going to use across several places, you can put them in an specific class. public static class CustomerQueries {   public static Expression<Func<Customer, bool>> LastNameEqualsTo(string lastName)   {     return c => c.LastName == lastName;   } } And then, use it with the repository. repository.Retrieve<Customer>(    CustomerQueries.LastNameEqualsTo("foo"),    c => c.Orders.SubExpand(o => o.Items));

    Read the article

  • data structure for counting frequencies in a database table-like format

    - by user373312
    i was wondering if there is a data structure optimized to count frequencies against data that is stored in a database table-like format. for example, the data comes in a (comma) delimited format below. col1, col2, col3 x, a, green x, b, blue ... y, c, green now i simply want to count the frequency of col1=x or col1=x and col2=green. i have been storing the data in a database table, but in my profiling and from empirical observation, database connection is the bottle-neck. i have tried using in-memory database solutions too, and that works quite well; the only problem is memory requirements and quirky init/destroy calls. also, i work mainly with java, but have experience with .net, and was wondering if there was any api to work with "tabular" data in a linq way using java. any help is appreciated.

    Read the article

  • Big Data for Retail

    - by David Dorf
    Right up there with mobile, social, and cloud is the term "big data," which seems to be popping up lots in the press these days.  Companies like Google, Yahoo, and Facebook have popularized a new class of data technologies meant to solve the problem of processing large amounts of data quickly.  I first mentioned this in a posting back in March 2009.  Put simply, big data implies datasets so large they can't normally be processed using a standard transactional database.  The term "noSQL" is often used in this context as well. Actually, using parallel processing within the Oracle database combined with Exadata can achieve impressive results.  Look for more from Oracle at OpenWorld as hinted by Jean-Pierre Dijcks. McKinsey recently released a report on big data in which retail was specifically mentioned as an industry that can benefit from the new technologies.  I won't rehash that report because my friend Rama already did such a good job in his posting, Impact of "Big Data" on Retail. The presentation below does a pretty good job of framing the problem, although it doesn't really get into the available technologies (e.g. Exadata, Hadoop, Cassandra, etc.) and isn't retail specific. Determine the Right Analytic Database: A Survey of New Data Technologies So when a retailer asks me about big data, here's what I say:  Big data refers to a set of technologies for processing large volumes of structured and unstructured data.  Imagine collecting everything uttered by your customers on Facebook and Twitter and combining it with all the data you can find about the products you sell (e.g. reviews, images, demonstration videos), including competitive data.  Assuming you could process all that data, you could then personalize offers to specific customers based on their tastes, ensure prices are competitive, and implement better local assortments.  It's really not that far off.

    Read the article

  • What Works in Data Integration?

    - by dain.hansen
    TDWI just recently put out this paper on "What Works in Data Integration". I invite you especially to take a look at the section on "Accelerating your Business with Real-time Data Integration" and the DIRECTV case study. The article discusses some of the technology considerations for BI/DW and how data integration plays a role to deliver timely, accessible, and high-quality data. It goes on to outline the three key requirements for how to deliver high performance, low impact, and reliability and how that can translate to faster results. The DIRECTV webinar is something you definitely want to take a look at, you'll hear how DIRECTV successfully transformed their data warehouse investments into a competitive advantage with Oracle GoldenGate.

    Read the article

  • The Ins and Outs of Effective Smart Grid Data Management

    - by caroline.yu
    Oracle Utilities and Accenture recently sponsored a one-hour Web cast entitled, "The Ins and Outs of Effective Smart Grid Data Management." Oracle and Accenture created this Web cast to help utilities better understand the types of data collected over smart grid networks and the issues associated with mapping out a coherent information management strategy. The Web cast also addressed important points that utilities must consider with the imminent flood of data that both present and next-generation smart grid components will generate. The three speakers, including Oracle Utilities' Brad Williams, focused on the key factors associated with taking the millions of data points captured in real time and implementing the strategies, frameworks and technologies that enable utilities to process, store, analyze, visualize, integrate, transport and transform data into the information required to deliver targeted business benefits. The Web cast replay is available here. The Web cast slides are available here.

    Read the article

  • Are there sources of email marketing data available?

    - by Gortron
    Are sources of email marketing data available to the public? I would like to see email marketing data to see what kind of content a business sends out, the frequency of sending, the number of people emailed, especially the resulting open rates and click through rates. Are businesses willing to share data on their previous email marketing campaigns without divulging their contact list? I would like to use this data to create an application to help businesses create better newsletters by using this data as a benchmark, basically sharing what works and what doesn't for each industry.

    Read the article

  • Extending SSIS with custom Data Flow components (Presentation)

    Download the slides and sample code from my Extending SSIS with custom Data Flow components presentation, first presented at the SQLBits II (The SQL) Community Conference. Abstract Get some real-world insights into developing data flow components for SSIS. This starts with an introduction to the data flow pipeline engine, and explains the real differences between adapters and the three sub-types of transformation. Understanding how the different types of component behave and manage data is key to writing components of your own, and probably should but be required knowledge for anyone building packages at all. Using sample code throughout, I will show you how to write components, as well as highlighting best practice and lessons learned. The sample code includes fully working example projects for source, destination and transformation components. Presentation & Samples (358KB) Extending SSIS with custom Data Flow components.zip

    Read the article

  • How to Achieve Real-Time Data Protection and Availabilty....For Real

    - by JoeMeeks
    There is a class of business and mission critical applications where downtime or data loss have substantial negative impact on revenue, customer service, reputation, cost, etc. Because the Oracle Database is used extensively to provide reliable performance and availability for this class of application, it also provides an integrated set of capabilities for real-time data protection and availability. Active Data Guard, depicted in the figure below, is the cornerstone for accomplishing these objectives because it provides the absolute best real-time data protection and availability for the Oracle Database. This is a bold statement, but it is supported by the facts. It isn’t so much that alternative solutions are bad, it’s just that their architectures prevent them from achieving the same levels of data protection, availability, simplicity, and asset utilization provided by Active Data Guard. Let’s explore further. Backups are the most popular method used to protect data and are an essential best practice for every database. Not surprisingly, Oracle Recovery Manager (RMAN) is one of the most commonly used features of the Oracle Database. But comparing Active Data Guard to backups is like comparing apples to motorcycles. Active Data Guard uses a hot (open read-only), synchronized copy of the production database to provide real-time data protection and HA. In contrast, a restore from backup takes time and often has many moving parts - people, processes, software and systems – that can create a level of uncertainty during an outage that critical applications can’t afford. This is why backups play a secondary role for your most critical databases by complementing real-time solutions that can provide both data protection and availability. Before Data Guard, enterprises used storage remote-mirroring for real-time data protection and availability. Remote-mirroring is a sophisticated storage technology promoted as a generic infrastructure solution that makes a simple promise – whatever is written to a primary volume will also be written to the mirrored volume at a remote site. Keeping this promise is also what causes data loss and downtime when the data written to primary volumes is corrupt – the same corruption is faithfully mirrored to the remote volume making both copies unusable. This happens because remote-mirroring is a generic process. It has no  intrinsic knowledge of Oracle data structures to enable advanced protection, nor can it perform independent Oracle validation BEFORE changes are applied to the remote copy. There is also nothing to prevent human error (e.g. a storage admin accidentally deleting critical files) from also impacting the remote mirrored copy. Remote-mirroring tricks users by creating a false impression that there are two separate copies of the Oracle Database. In truth; while remote-mirroring maintains two copies of the data on different volumes, both are part of a single closely coupled system. Not only will remote-mirroring propagate corruptions and administrative errors, but the changes applied to the mirrored volume are a result of the same Oracle code path that applied the change to the source volume. There is no isolation, either from a storage mirroring perspective or from an Oracle software perspective.  Bottom line, storage remote-mirroring lacks both the smarts and isolation level necessary to provide true data protection. Active Data Guard offers much more than storage remote-mirroring when your objective is protecting your enterprise from downtime and data loss. Like remote-mirroring, an Active Data Guard replica is an exact block for block copy of the primary. Unlike remote-mirroring, an Active Data Guard replica is NOT a tightly coupled copy of the source volumes - it is a completely independent Oracle Database. Active Data Guard’s inherent knowledge of Oracle data block and redo structures enables a separate Oracle Database using a different Oracle code path than the primary to use the full complement of Oracle data validation methods before changes are applied to the synchronized copy. These include: physical check sum, logical intra-block checking, lost write validation, and automatic block repair. The figure below illustrates the stark difference between the knowledge that remote-mirroring can discern from an Oracle data block and what Active Data Guard can discern. An Active Data Guard standby also provides a range of additional services enabled by the fact that it is a running Oracle Database - not just a mirrored copy of data files. An Active Data Guard standby database can be open read-only while it is synchronizing with the primary. This enables read-only workloads to be offloaded from the primary system and run on the active standby - boosting performance by utilizing all assets. An Active Data Guard standby can also be used to implement many types of system and database maintenance in rolling fashion. Maintenance and upgrades are first implemented on the standby while production runs unaffected at the primary. After the primary and standby are synchronized and all changes have been validated, the production workload is quickly switched to the standby. The only downtime is the time required for user connections to transfer from one system to the next. These capabilities further expand the expectations of availability offered by a data protection solution beyond what is possible to do using storage remote-mirroring. So don’t be fooled by appearances.  Storage remote-mirroring and Active Data Guard replication may look similar on the surface - but the devil is in the details. Only Active Data Guard has the smarts, the isolation, and the simplicity, to provide the best data protection and availability for the Oracle Database. Stay tuned for future blog posts that dive into the many differences between storage remote-mirroring and Active Data Guard along the dimensions of data protection, data availability, cost, asset utilization and return on investment. For additional information on Active Data Guard, see: Active Data Guard Technical White Paper Active Data Guard vs Storage Remote-Mirroring Active Data Guard Home Page on the Oracle Technology Network

    Read the article

  • How to use OO for data analysis? [closed]

    - by Konsta
    In which ways could object-orientation (OO) make my data analysis more efficient and let me reuse more of my code? The data analysis can be broken up into get data (from db or csv or similar) transform data (filter, group/pivot, ...) display/plot (graph timeseries, create tables, etc.) I mostly use Python and its Pandas and Matplotlib packages for this besides some DB connectivity (SQL). Almost all of my code is a functional/procedural mix. While I have started to create a data object for a certain collection of time series, I wonder if there are OO design patterns/approaches for other parts of the process that might increase efficiency?

    Read the article

  • Markup format or script for data files?

    - by Aaron
    The game I'm designing will be mainly written in a high level scripting language (leaning towards either Lua or Squirrel) with a C++ core. In addition to scripts I'm also going to need different data files. Many data files will be for static information such as graphical assets and monster types. I'd also want to create and update data files at runtime for user information like option settings and game saves. Can I get away with using plain script files (i.e. .lua or .nut files) for my data files, or is it better to use dedicated markup formats like XML or YAML? If I use script files, loaded separately from my true scripts, then I wouldn't need an extra library to read those files. Scripting languages like Lua also have table syntax that lend themselves towards data definition. On the other hand I'd have to write my own schema check code. These languages also don't seem to support serialization "out of the box" like the markup format libraries do.

    Read the article

  • SQL Server and the XML Data Type : Data Manipulation

    The introduction of the xml data type, with its own set of methods for processing xml data, made it possible for SQL Server developers to create columns and variables of the type xml. Deanna Dicken examines the modify() method, which provides for data manipulation of the XML data stored in the xml data type via XML DML statements. Too many SQL Servers to keep up with?Download a free trial of SQL Response to monitor your SQL Servers in just one intuitive interface."The monitoringin SQL Response is excellent." Mike Towery.

    Read the article

  • Getting data from a webpage in a stable and efficient way

    - by Mike Heremans
    Recently I've learned that using a regex to parse the HTML of a website to get the data you need isn't the best course of action. So my question is simple: What then, is the best / most efficient and a generally stable way to get this data? I should note that: There are no API's There is no other source where I can get the data from (no databases, feeds and such) There is no access to the source files. (Data from public websites) Let's say the data is normal text, displayed in a table in a html page I'm currently using python for my project but a language independent solution/tips would be nice. As a side question: How would you go about it when the webpage is constructed by Ajax calls?

    Read the article

  • What data structure to use / data persistence

    - by Dave
    I have an app where I need one table of information with the following fields: field 1 - int or char field 2 - string (max 10 char) field 3 - string (max 20 char) field 4 - float I need the program to filter on field 1 based upon a segmented control and select a field 2 from a picker. From this data I need to look up field 4 to use in a calculation. Total records will be about 200. I never see it go above 400 - 500. I am going to use a singleton which I am able to do, I just need help with the structure for this with data persistence. What type of data structure should I use for this and should I use NSNumber, NSString, etc. or old data types like float, Char, etc. I thought about a struct put into an array but there is probably a better way. This is new to me so any help or reference to examples would be great. I also thought about a plist or dictionary but it looks like it is just a lookup and a field which obviously won't work. Core data looked like overkill to me. Also, with any recommendation how should I get initial data into it? I want the user to be able to edit and add to the database. Sorry for the old terms, you can see what generation I am from... Thanks in advance!!!!

    Read the article

  • big O notation algorithm

    - by niggersak
    Use big-O notation to classify the traditional grade school algorithms for addition and multiplication. That is, if asked to add two numbers each having N digits, how many individual additions must be performed? If asked to multiply two N-digit numbers, how many individual multiplications are required? . Suppose f is a function that returns the result of reversing the string of symbols given as its input, and g is a function that returns the concatenation of the two strings given as its input. If x is the string hrwa, what is returned by g(f(x),x)? Explain your answer - don't just provide the result!

    Read the article

  • Tricky Big-O complexity

    - by timeNomad
    public void foo (int n, int m) { int i = m; while (i > 100) i = i/3; for (int k=i ; k>=0; k--) { for (int j=1; j<n; j*=2) System.out.print(k + "\t" + j); System.out.println(); } } I figured the complexity would be O(logn). That is as a product of the inner loop, the outer loop -- will never be executed more than 100 times, so it can be omitted. What I'm not sure about is the while clause, should it be incorporated into the Big-O complexity? For very large i values it could make an impact, or arithmetic operations, doesn't matter on what scale, count as basic operations and can be omitted?

    Read the article

  • Database indexes and their Big-O notation

    - by miket2e
    I'm trying to understand the performance of database indexes in terms of Big-O notation. Without knowing much about it, I would guess that: Querying on a primary key or unique index will give you a O(1) lookup time. Querying on a non-unique index will also give a O(1) time, albeit maybe the '1' is slower than for the unique index (?) Querying on a column without an index will give a O(N) lookup time (full table scan). Is this generally correct ? Will querying on a primary key ever give worse performance than O(1) ? My specific concern is for SQLite, but I'd be interested in knowing to what extent this varies between different databases too.

    Read the article

  • Can someone help with big O notation?

    - by Dann
    void printScientificNotation(double value, int powerOfTen) { if (value >= 1.0 && value < 10.0) { System.out.println(value + " x 10^" + powerOfTen); } else if (value < 1.0) { printScientificNotation(value * 10, powerOfTen - 1); } else // value >= 10.0 { printScientificNotation(value / 10, powerOfTen + 1); } } I understand how the method goes but I cannot figure out a way to represent the method. For example, if value was 0.00000009 or 9e-8, the method will call on printScientificNotation(value * 10, powerOfTen - 1); eight times and System.out.println(value + " x 10^" + powerOfTen); once. So the it is called recursively by the exponent for e. But how do I represent this by big O notation? Thanks!

    Read the article

  • Using a "white list" for extracting terms for Text Mining

    - by [email protected]
    In Part 1 of my post on "Generating cluster names from a document clustering model" (part 1, part 2, part 3), I showed how to build a clustering model from text documents using Oracle Data Miner, which automates preparing data for text mining. In this process we specified a custom stoplist and lexer and relied on Oracle Text to identify important terms.  However, there is an alternative approach, the white list, which uses a thesaurus object with the Oracle Text CTXRULE index to allow you to specify the important terms. INTRODUCTIONA stoplist is used to exclude, i.e., black list, specific words in your documents from being indexed. For example, words like a, if, and, or, and but normally add no value when text mining. Other words can also be excluded if they do not help to differentiate documents, e.g., the word Oracle is ubiquitous in the Oracle product literature. One problem with stoplists is determining which words to specify. This usually requires inspecting the terms that are extracted, manually identifying which ones you don't want, and then re-indexing the documents to determine if you missed any. Since a corpus of documents could contain thousands of words, this could be a tedious exercise. Moreover, since every word is considered as an individual token, a term excluded in one context may be needed to help identify a term in another context. For example, in our Oracle product literature example, the words "Oracle Data Mining" taken individually are not particular helpful. The term "Oracle" may be found in nearly all documents, as with the term "Data." The term "Mining" is more unique, but could also refer to the Mining industry. If we exclude "Oracle" and "Data" by specifying them in the stoplist, we lose valuable information. But it we include them, they may introduce too much noise. Still, when you have a broad vocabulary or don't have a list of specific terms of interest, you rely on the text engine to identify important terms, often by computing the term frequency - inverse document frequency metric. (This is effectively a weight associated with each term indicating its relative importance in a document within a collection of documents. We'll revisit this later.) The results using this technique is often quite valuable. As noted above, an alternative to the subtractive nature of the stoplist is to specify a white list, or a list of terms--perhaps multi-word--that we want to extract and use for data mining. The obvious downside to this approach is the need to specify the set of terms of interest. However, this may not be as daunting a task as it seems. For example, in a given domain (Oracle product literature), there is often a recognized glossary, or a list of keywords and phrases (Oracle product names, industry names, product categories, etc.). Being able to identify multi-word terms, e.g., "Oracle Data Mining" or "Customer Relationship Management" as a single token can greatly increase the quality of the data mining results. The remainder of this post and subsequent posts will focus on how to produce a dataset that contains white list terms, suitable for mining. CREATING A WHITE LIST We'll leverage the thesaurus capability of Oracle Text. Using a thesaurus, we create a set of rules that are in effect our mapping from single and multi-word terms to the tokens used to represent those terms. For example, "Oracle Data Mining" becomes "ORACLEDATAMINING." First, we'll create and populate a mapping table called my_term_token_map. All text has been converted to upper case and values in the TERM column are intended to be mapped to the token in the TOKEN column. TERM                                TOKEN DATA MINING                         DATAMINING ORACLE DATA MINING                  ORACLEDATAMINING 11G                                 ORACLE11G JAVA                                JAVA CRM                                 CRM CUSTOMER RELATIONSHIP MANAGEMENT    CRM ... Next, we'll create a thesaurus object my_thesaurus and a rules table my_thesaurus_rules: CTX_THES.CREATE_THESAURUS('my_thesaurus', FALSE); CREATE TABLE my_thesaurus_rules (main_term     VARCHAR2(100),                                  query_string  VARCHAR2(400)); We next populate the thesaurus object and rules table using the term token map. A cursor is defined over my_term_token_map. As we iterate over  the rows, we insert a synonym relationship 'SYN' into the thesaurus. We also insert into the table my_thesaurus_rules the main term, and the corresponding query string, which specifies synonyms for the token in the thesaurus. DECLARE   cursor c2 is     select token, term     from my_term_token_map; BEGIN   for r_c2 in c2 loop     CTX_THES.CREATE_RELATION('my_thesaurus',r_c2.token,'SYN',r_c2.term);     EXECUTE IMMEDIATE 'insert into my_thesaurus_rules values                        (:1,''SYN(' || r_c2.token || ', my_thesaurus)'')'     using r_c2.token;   end loop; END; We are effectively inserting the token to return and the corresponding query that will look up synonyms in our thesaurus into the my_thesaurus_rules table, for example:     'ORACLEDATAMINING'        SYN ('ORACLEDATAMINING', my_thesaurus)At this point, we create a CTXRULE index on the my_thesaurus_rules table: create index my_thesaurus_rules_idx on        my_thesaurus_rules(query_string)        indextype is ctxsys.ctxrule; In my next post, this index will be used to extract the tokens that match each of the rules specified. We'll then compute the tf-idf weights for each of the terms and create a nested table suitable for mining.

    Read the article

< Previous Page | 7 8 9 10 11 12 13 14 15 16 17 18  | Next Page >