Search Results

Search found 401 results on 17 pages for 'hadoop'.

Page 9/17 | < Previous Page | 5 6 7 8 9 10 11 12 13 14 15 16 | Next Page >

Better to build or buy a compute grid platform?

- by James B

I am looking to do some quite processor-intensive brute force processing for string matching. I have run my prototype in a multi-threaded environment and compared the performance to an implementation using Gridgain with a couple of nodes (also multithreaded). The performance I observed was that my Gridgain implementation performed slower to my multithreaded implementation. It could be the case that there was a flaw in my gridgain implementation, but it was only a prototype, and I thought the results were indicative. So my question is this: What are the advantages of having to learn and then build an implementation for a particular grid platform (hadoop, gridgain, or EC2 if going hosted - other suggestions welcome), when one could fairly easily put together a lightweight compute grid platform with a much shallower learning curve?...i.e. what do we get for free with these cloud/grid platforms that are worth having/tricky to implement? (Please note, I don't have any need for a data grid) Cheers, -James (p.s. Happy to make this community wiki if needbe)

Read the article
Pig: Count number of keys in a map

- by Donald Miner

I'd like to count the number of keys in a map in Pig. I could write a UDF to do this, but I was hoping there would be an easier way. data = LOAD 'hbase://MARS1' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( 'A:*', '-loadKey true -caching=100000') AS (id:bytearray, A_map:map[]); In the code above, I want to basically build a histogram of id and how many items in column family A that key has. In hoping, I tried c = FOREACH data GENERATE id, COUNT(A_map); but that unsurprisingly didn't work. Or, perhaps someone can suggest a better way to do this entirely. If I can't figure this out soon I'll just write a Java MapReduce job or a Pig UDF.

Read the article
HBase as a multimap

- by Ibrahim

Hi guys, I'm doing some large scale text processing work and I'm trying to get started with Hadoop and HBase. One of the things I need to do is build a multimap of some stuff, which I later use to look up things and get all items with a certain key (in a M/R job). Would it be OK to use HBase and insert many rows with the same key and rely on versions/timestamps to achieve a multimap-like setup or is this a bad idea? The multimap is built up in the reduce phase of a Mapreduce task by the way, or at least in the way I've formulated it on paper. Thanks! If more information is needed, I'd be happy to provide it. Not sure whether this question is clear.

Read the article
Architecture for analysing search result impressions/clicks to improve future searches

- by Hais

We have a large database of items (10m+) stored in MySQL and intend to implement search on metadata on these items, taking advantage of something like Sphinx. The dataset will be changing slightly on a daily basis so Sphinx will be re-indexing daily. However we want the algorithm to self-learn and improve search results by analysing impression and click data so that we provide better results for our customers on that search term, and possibly other similar search terms too. I've been reading up on Hadoop and it seems like it has the potential to crunch all this data, although I'm still unsure how to approach it. Amazon has tutorials for compiling impression vs click data using MapReduce but I can't see how to get this data in a useable format. My idea is that when a search term comes in I query Sphinx to get all the matching items from the dataset, then query the analytics (compiled on an hourly basis or similar) so that we know the most popular items for that search term, then cache the final results using something like Memcached, Membase or similar. Am I along the right lines here?

Read the article
How should I best store these files?

- by Triton Man

I have a set of image files, they are generally very small, between 5k and 100k. They can be any size though, upwards of 50mb but this is very rare. When these images are put into the system they are not ever modified. There is about 50 TB of these images total. They are currently chunked and stored in BLOBs in Oracle, but we want to change this since it requires special software to extract them. These images are access sometimes at a rate of over 100 requests per second among about 10 servers. I'm thinking about Hadoop or Cassandra, but I really don't know which would be best or how best to index them.

Read the article
How to map a set of text as a whole to a node?

- by JIpeng Tan

Suppose I have a plain text file with the following data: DataSetOne <br /> content <br /> content <br /> content <br /> DataSetTwo <br /> content <br /> content <br /> content <br /> content <br /> ...and so on... What I want to to is: count how many contents in each data set. For example the result should be <DataSetOne, 3>, <DataSetTwo, 4> I am a beginer to hadoop, I wonder if there is a way to map a chunk of data as a whole to a node. for example, set all DataSetOne to node 1 and all DataSetTwo to node 2. Does anyone can give me an idea how to archive this?

Read the article
Find Port Number and Domain Name to connect to Hive Table

- by user1419563

I am new to Hive, MapReduce and Hadoop. I am using Putty to connect to hive table and access records in the tables. So what I did is- I opened Putty and in the host name I typed- ares-ingest.vip.host.com and then I click Open. And then I entered my username and password and then few commands to get to Hive sql. Below is the list what I did $ bash bash-3.00$ hive Hive history file=/tmp/rjamal/hive_job_log_rjamal_201207010451_1212680168.txt hive> set mapred.job.queue.name=hdmi-technology; hive> select * from table LIMIT 1; So my question is- I was trying to connect to Hive Tables using Squirrel SQL Client, so in that my Connection URL is- jdbc:hive://ares-ingest.vip.host.com:10000/default. So whenever I try to connect with these attributes, I always get Hive: Could not establish connection to ares-ingest.vip.host.com:10000/default: java.net.ConnectException: Connection timed out: connect. It might be possible I am using wrong port number or domain name here. Is there any way from the command prompt I can find out these two things, like what Domain Name and Port Number(where Hive server is running) should I use to connect to Hive table from Squirrel SQL Client. As I know host and port are determined by where the hive server is running

Read the article
Mahout - Error when try out wikipedia exmaples

- by Li'

Note this post is similar to Caused by: java.lang.ClassNotFoundException: classpath but different error message. When I try to run Wikipedia Bayes Example from https://cwiki.apache.org/confluence/display/MAHOUT/Wikipedia+Bayes+Example When I ran the following command : lis-macbook-pro:mahout-distribution-0.8 Li$ mahout wikipediaXMLSplitter -d examples/temp/enwiki-latest-pages-articles10.xml -o wikipedia/chunks -c 64 I got error message: MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath. MAHOUT_LOCAL is set, running locally SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/Users/Li/File/Java/mahout-distribution-0.8/examples/target/mahout-examples-0.8-job.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/Users/Li/File/Java/mahout-distribution-0.8/examples/target/dependency/slf4j-jcl-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.JCLLoggerFactory] Oct 21, 2013 4:25:47 PM org.slf4j.impl.JCLLoggerAdapter warn WARNING: Unable to add class: wikipediaXMLSplitter java.lang.ClassNotFoundException: wikipediaXMLSplitter at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:171) at org.apache.mahout.driver.MahoutDriver.addClass(MahoutDriver.java:236) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:127) I am using Hadoop 1.2 and Mahout 0.8. mahout-distribution-0.8/bin has been added to $PATH. $MAHOUT_LOCAL is set to "True", so it runs locally. I dont know why I got "Unable to add class: wikipediaXMLSplitter"

Read the article
Getting started with massive data

- by Max

I'm a math guy and occasionally do some statistics/machine learning analysis consulting projects on the side. The data I have access to are usually on the smaller side, at most a couple hundred of megabytes (and almost always far less), but I want to learn more about handling and analyzing data on the gigabyte/terabyte scale. What do I need to know and what are some good resources to learn from? Hadoop/MapReduce is one obvious start. Is there a particular programming language I should pick up? (I primarily work now in Python, Ruby, R, and occasionally Java, but it seems like C and Clojure are often used for large-scale data analysis?) I'm not really familiar with the whole NoSQL movement, except that it's associated with big data. What's a good place to learn about it, and is there a particular implementation (Cassandra, CouchDB, etc.) I should get familiar with? Where can I learn about applying machine learning algorithms to huge amounts of data? My math background is mostly on the theory side, definitely not on the numerical or approximation side, and I'm guessing most of the standard ML algorithms don't really scale. Any other suggestions on things to learn would be great!

Read the article
Filtering null values with pig

- by arianp

It looks like a silly problem, but I can´t find a way to filter null values from my rows. This is the result when I dump the object geoinfo: DUMP geoinfo; ([longitude#70.95853,latitude#30.9773]) ([longitude#-9.37944507,latitude#38.91780853]) (null) (null) (null) ([longitude#-92.64416,latitude#16.73326]) (null) (null) ([longitude#-9.15199849,latitude#38.71179122]) ([longitude#-9.15210796,latitude#38.71195131]) here is the description DESCRIBE geoinfo; geoinfo: {geoLocation: bytearray} What I'm trying to do is to filter null values like this: geoinfo_no_nulls = FILTER geoinfo BY geoLocation is not null; but the result remains the same. nothing is filtered. I also tried something like this geoinfo_no_nulls = FILTER geoinfo BY geoLocation != 'null'; and I got an error org.apache.pig.backend.executionengine.ExecException: ERROR 1071: Cannot convert a map to a String What am I doing wrong? details, running on ubuntu, hadoop-1.0.3 with pig 0.9.3 pig -version Apache Pig version 0.9.3-SNAPSHOT (rexported) compiled Oct 24 2012, 19:04:03 java version "1.6.0_24" OpenJDK Runtime Environment (IcedTea6 1.11.4) (6b24-1.11.4-1ubuntu0.12.04.1) OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

Read the article
Sybase IQ 15.4 annoncé : Sybase parie sur Hadoop et MapReduce, et défie sa maison mère ?

Sybase IQ 15.4 annoncé pour fin novembre Sybase veut repousser les limites du Big Data avec Hadoop et MapReduce Alors que la grand messe annuelle de SAP, le SAPPHIRE NOW, battait son plein, la nouvelle filiale de l'éditeur allemand Sybase a annoncé en totale indépendance la sortie de Sybase IQ 15.4, son serveur analytique haute performance structuré en colonnes pour gérer les "big data". Alors que de son côté SAP met en avant HANA, sa nouvelle technologie de mise en cache des données (ou "In-Memory Computing") pour accélérer la vitesse de traite...

Read the article
Crawling engine architecture - Java/ Perl integration

- by Bigtwinz

Hi all, I am looking to develop a management and administration solution around our webcrawling perl scripts. Basically, right now our scripts are saved in SVN and are manually kicked off by SysAdmin/devs etc. Everytime we need to retrieve data from new sources we have to create a ticket with business instructions and goals. As you can imagine, not an optimal solution. There are 3 consistent themes with this system: the retrieval of data has a "conceptual structure" for lack of a better phrase i.e. the retrieval of information follows a particular path we are only looking for very specific information so we dont have to really worry about extensive crawling for awhile (think thousands-tens of thousands of pages vs millions) crawls are url-based instead of site-based. As I enhance this alpha version to a more production-level beta I am looking to add automation and management of the retrieval of data. Additionally our other systems are Java (which I'm more proficient in) and I'd like to compartmentalize the perl aspects so we dont have to lean heavily on outside help. I've evaluated the usual suspects Nutch, Droid etc but the time spent on modifying those frameworks to suit our specific information retrieval cant be justified. So I'd like your thoughts regarding the following architecture. I want to create a solution which use Java as the interface for managing and execution of the perl scripts use Java for configuration and data access stick with perl for retrieval An example use case would be a data analyst delivers us a requirement for crawling perl developer creates the required script and uses this webapp to submit the script (which gets saved to the filesystem) the script gets kicked off from the webapp with specific parameters .... Webapp should be able to create multiple threads of the perl script to initiate multiple crawlers. So questions are what do you think how solid is integration between Java and Perl specifically from calling perl from java has someone used such a system which actually is part perl repository The goal really is to not have a whole bunch of unorganized perl scripts and put some management and organization on our information retrieval. Also, I know I can use perl do do the web part of what we want - but as I mentioned before - trying to keep perl focused. But it seems assbackwards I'm not adverse to making it an all perl solution. Open to any all suggestions and opinions. Thanks

Read the article
I'm familiar with Python and its data structures. Can someone give me a very basic example on how to

- by alex

What can I do with Mapreduce? Dictionaries? Lists? What do I use it for? Give a real easy example

Read the article
Does throwing an exception in an EvalFunc pig UDF skip just that line, or stop completely?

- by Daniel Huckstep

I have a User Defined Function (UDF) written in Java to parse lines in a log file and return information back to pig, so it can do all the processing. It looks something like this: public abstract class Foo extends EvalFunc<Tuple> { public Foo() { super(); } public Tuple exec(Tuple input) throws IOException { try { // do stuff with input } catch (Exception e) { throw WrappedIOException.wrap("Error with line", e); } } } My question is: if it throws the IOException, will it stop completely, or will it return results for the rest of the lines that don't throw an exception? Example: I run this in pig REGISTER myjar.jar DEFINE Extractor com.namespace.Extractor(); logs = LOAD '$IN' USING TextLoader AS (line: chararray); events = FOREACH logs GENERATE FLATTEN(Extractor(line)); With this input: 1.5 7 "Valid Line" 1.3 gghyhtt Inv"alid line"" I throw an exceptioN!! 1.8 10 "Valid Line 2" Will it process the two lines and will 'logs' have 2 tuples, or will it just die in a fire?

Read the article
Unable to run MR on cluster

- by RAVITEJA SATYAVADA

I have an Map reduce program that is running successfully in standalone(Ecllipse) mode but while trying to run the same MR by exporting the jar in cluster. It is showing null pointer exception like this, 13/06/26 05:46:22 ERROR mypackage.HHDriver: Error while configuring run method. java.lang.NullPointerException I double checked the run method parameters those are not null and it is running in standalone mode as well..

Read the article
Is it worth purchasing Mahout in Action to get up to speed with Mahout, or are there other better sources?

- by gab

I'm currently a very casual user of Apache Mahout, and I'm considering purchasing the book Mahout in Action. Unfortunately, I'm having a really hard time getting an idea of how worth it this book is -- and seeing as it's a Manning Early Access Program book (and therefore only currently available as a beta-version e-book), I can't take a look myself in a bookstore. Can anyone recommend this as a good (or less good) guide to getting up to speed with Mahout, and/or other sources that can supplement the Mahout website?

Read the article
Converting python collaborative filtering code to use Map Reduce

- by Neil Kodner

Using Python, I'm computing cosine similarity across items. given event data that represents a purchase (user,item), I have a list of all items 'bought' by my users. Given this input data (user,item) X,1 X,2 Y,1 Y,2 Z,2 Z,3 I build a python dictionary {1: ['X','Y'], 2 : ['X','Y','Z'], 3 : ['Z']} From that dictionary, I generate a bought/not bought matrix, also another dictionary(bnb). {1 : [1,1,0], 2 : [1,1,1], 3 : [0,0,1]} From there, I'm computing similarity between (1,2) by calculating cosine between (1,1,0) and (1,1,1), yielding 0.816496 I'm doing this by: items=[1,2,3] for item in items: for sub in items: if sub >= item: #as to not calculate similarity on the inverse sim = coSim( bnb[item], bnb[sub] ) I think the brute force approach is killing me and it only runs slower as the data gets larger. Using my trusty laptop, this calculation runs for hours when dealing with 8500 users and 3500 items. I'm trying to compute similarity for all items in my dict and it's taking longer than I'd like it to. I think this is a good candidate for MapReduce but I'm having trouble 'thinking' in terms of key/value pairs. Alternatively, is the issue with my approach and not necessarily a candidate for Map Reduce?

Read the article
Searches (and general querying) with HBase and/or Cassandra (best practices?)

- by alexeypro

I have User model object with quite few fields (properties, if you wish) in it. Say "firstname", "lastname", "city" and "year-of-birth". Each user also gets "unique id". I want to be able to search by them. How do I do that properly? How to do that at all? My understanding (will work for pretty much any key-value storage -- first goes key, then value) u:123456789 = serialized_json_object ("u" as a simple prefix for user's keys, 123456789 is "unique id"). Now, thinking that I want to be able to search by firstname and lastname, I can save in: f:Steve = u:384734807,u:2398248764,u:23276263 f:Alex = u:12324355,u:121324334 so key is "f" - which is prefix for firstnames, and "Steve" is actual firstname. For "u:Steve" we save as value all user id's who are "Steve's". That makes every search very-very easy. Querying by few fields (properties) -- say by firstname (i.e. "Steve") and lastname (i.e. "l:Anything") is still easy - first get list of user ids from "f:Steve", then list from "l:Anything", find crossing user ids, an here you go. Problems (and there are quite a few): Saving, updating, deleting user is a pain. It has to be atomic and consistent operation. Also, if we have size of value limited to some value - then we are in (potential) trouble. And really not of an answer here. Only zipping the list of user ids? Not too cool, though. What id we want to add new field to search by. Eventually. Say by "city". We certainly can do the same way "c:Los Angeles" = ..., "c:Chicago" = ..., but if we didn't foresee all those "search choices" from the very beginning, then we will have to be able to create some night job or something to go by all existing User records and update those "c:CITY" for them... Quite a big job! Problems with locking. User "u:123" updates his name "Alex", and user "u:456" updates his name "Alex". They both have to update "f:Alex" with their id's. That means either we get into overwriting problem, or one update will wait for another (and imaging if there are many of them?!). What's the best way of doing that? Keeping in mind that I want to search by many fields? P.S. Please, the question is about HBase/Cassandra/NoSQL/Key-Value storages. Please please - no advices to use MySQL and "read about" SELECTs; and worry about scaling problems "later". There is a reason why I asked MY question exactly the way I did. :-)

Read the article
CouchDB, HDFS, HBase or which is right for my situation?

- by Lucas

Hello all, This question is regarding data storage systems such as CouchDB, HDFS and HBase, specifically, which is right. I am looking at making a simple and customized Document Management System for my organization. Basically, we need the ability to store some Word Documents, PDFs and other similar files. I also want to store metadata about these files (e.g., Author, Dates, etc). Usage permissions would also be handy, but that can probably be built using meta-data. I would also need the ability to full-text index. The ability to version, while not required would be extremely useful. I would like the ability to simply add hardware to expand the resources of the system and the system must support Network Attached Storage over the CIFS or NFS protocol(s). I have read about CouchDB, HDFS and HBase. My preferred programming language is C# as all of my end-users will be running Windows machines and I will want to make both web and winforms client implementations. My question is which solution best fits my needs? Based on my research it appears that CouchDB (utilizing the CouchDB-Lounge and CouchDB-Lucene) perfectly fits my needs. However, I am worried that since I have worked with CouchDB that I might be overlooking something useful for my needs in HDFS or HBase or something similar due to a bias. Any and all opinions are welcome as I am looking for the community input as I really do not want to make the wrong choice at the start of my project. Please ask if you need more information. I thank you all for your time, input and assistance.

Read the article
what is a data serialization system?

- by Yang

according to Apache AVRO project, "Avro is a serialization system". By saying data serialization system, does it mean that avro is a product or api? also, I am not quit sure about what a data serialization system is? for now, my understanding is that it is a protocol that defines how data object is passed over the network. Can anyone help explain it in an intuitive way that it is easier for people with limited distributed computing background to understand? Thanks in advance!

Read the article
Strange results - I obtain same value for all keys

- by Pietro Luciani

I have a problem with mapreduce. Giving as input a list of song ("Songname"#"UserID"#"boolean") i must have as result a song list in which is specified how many time different useres listen them... so a output ("Songname","timelistening"). I used hashtable to allow only one couple . With short files it works well but when I put as input a list about 1000000 of records it returns me the same value (20) for all records. This is my mapper: public static class CanzoniMapper extends Mapper<Object, Text, Text, IntWritable>{ private IntWritable userID = new IntWritable(0); private Text song = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { /*StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); }*/ String[] caratteri = value.toString().split("#"); if(caratteri[2].equals("1")){ song.set(caratteri[0]); userID.set(Integer.parseInt(caratteri[1])); context.write(song,userID); } } } This is my reducer: public static class CanzoniReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { Hashtable<IntWritable,Text> doppioni = new Hashtable<IntWritable,Text>(); for (IntWritable val : values) { doppioni.put(val,key); } result.set(doppioni.size()); //doppioni.clear(); context.write(key,result); } } and main: Configuration conf = new Configuration(); Job job = new Job(conf, "word count"); job.setJarByClass(Canzoni.class); job.setMapperClass(CanzoniMapper.class); //job.setCombinerClass(CanzoniReducer.class); //job.setNumReduceTasks(2); job.setReducerClass(CanzoniReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); Any idea???

Read the article
PIG doesn't read my custom InputFormat

- by Simon Guo

I have a custom MyInputFormat that suppose to deal with record boundary problem for multi-lined inputs. But when I put the MyInputFormat into my UDF load function. As follow: public class EccUDFLogLoader extends LoadFunc { @Override public InputFormat getInputFormat() { System.out.println("I am in getInputFormat function"); return new MyInputFormat(); } } public class MyInputFormat extends TextInputFormat { public RecordReader createRecordReader(InputSplit inputSplit, JobConf jobConf) throws IOException { System.out.prinln("I am in createRecordReader"); //MyRecordReader suppose to handle record boundary return new MyRecordReader((FileSplit)inputSplit, jobConf); } } For each mapper, it print out I am in getInputFormat function but not I am in createRecordReader. I am wondering if anyone can provide a hint on how to hoop up my costome MyInputFormat to PIG's UDF loader? Much Thanks. I am using PIG on Amazon EMR.

Read the article
retrieving multiple versions through API through hbase

- by sammy

hello , this is a continuation of my previous question where id used hbase shell.. http://stackoverflow.com/questions/3024417/facing-problems-while-updating-rows-in-hbase i tried the same with API.. im not able to figure out how to retrieve all versions , iterate and print their values for a specific row... i've spending hours reading... please help me out... Scan s = new Scan(Bytes.toBytes("row1")); s.addColumn(Bytes.toBytes("column"),Bytes.toBytes("address")); SETTING RANGE FOR THE VERSIONS s.setTimeRange(0L,6L); ResultScanner scanner = table.getScanner(s); for (Result r : scanner) { for(KeyValue kv : r.sorted()) { System.out.println("To"+kv.getTimestamp()); System.out.println("from "+Bytes.toString(kv.getKey())); System.out.println("To "+Bytes.toString(kv.getValue())); } scanner.close(); } here im intending to print all versions of the column..... but it gives the most recent one... im stuck here...

Read the article
Reducer getting fewer records than expected

- by sathishs

We have a scenario of generating unique key for every single row in a file. we have a timestamp column but the are multiple rows available for a same timestamp in few scenarios. We decided unique values to be timestamp appended with their respective count as mentioned in the below program. Mapper will just emit the timestamp as key and the entire row as its value, and in reducer the key is generated. Problem is Map outputs about 236 rows, of which only 230 records are fed as an input for reducer which outputs the same 230 records. public class UniqueKeyGenerator extends Configured implements Tool { private static final String SEPERATOR = "\t"; private static final int TIME_INDEX = 10; private static final String COUNT_FORMAT_DIGITS = "%010d"; public static class Map extends Mapper<LongWritable, Text, Text, Text> { @Override protected void map(LongWritable key, Text row, Context context) throws IOException, InterruptedException { String input = row.toString(); String[] vals = input.split(SEPERATOR); if (vals != null && vals.length >= TIME_INDEX) { context.write(new Text(vals[TIME_INDEX - 1]), row); } } } public static class Reduce extends Reducer<Text, Text, NullWritable, Text> { @Override protected void reduce(Text eventTimeKey, Iterable<Text> timeGroupedRows, Context context) throws IOException, InterruptedException { int cnt = 1; final String eventTime = eventTimeKey.toString(); for (Text val : timeGroupedRows) { final String res = SEPERATOR.concat(getDate( Long.valueOf(eventTime)).concat( String.format(COUNT_FORMAT_DIGITS, cnt))); val.append(res.getBytes(), 0, res.length()); cnt++; context.write(NullWritable.get(), val); } } } public static String getDate(long time) { SimpleDateFormat utcSdf = new SimpleDateFormat("yyyyMMddhhmmss"); utcSdf.setTimeZone(TimeZone.getTimeZone("America/Los_Angeles")); return utcSdf.format(new Date(time)); } public int run(String[] args) throws Exception { conf(args); return 0; } public static void main(String[] args) throws Exception { conf(args); } private static void conf(String[] args) throws IOException, InterruptedException, ClassNotFoundException { Configuration conf = new Configuration(); Job job = new Job(conf, "uniquekeygen"); job.setJarByClass(UniqueKeyGenerator.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); // job.setNumReduceTasks(400); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } } It is consistent for higher no of lines and the difference is as huge as 208969 records for an input of 20855982 lines. what might be the reason for reduced inputs to reducer?

Read the article
HBase as web app backend

- by NathanD

Can anyone advise if it is a good idea to have HBase as primary data source for web-based application? My primary concern is HBase's response time to queries. Is it possible to have sub-second response? edit: more details about the app itself. Amount of data: ~500GB of text data, expect to reach 1TB soon Number of concurrent users using the app: up to 50 The app will be used to present reports about data stored in HBase, like how many times keyword "X" occured in last 24h. For ~80% of requests from that app I will know the exact key, 20% will be scans (I'm looking into HBase schema design related topics to make it run fast)

Read the article

< Previous Page | 5 6 7 8 9 10 11 12 13 14 15 16 | Next Page >