Search Results

Search found 7 results on 1 pages for 'flume'.

Page 1/1 | 1 

  • On Ubuntu get: "-bash: ./flume No such file or directory" BUT flume is there and executable. Same binary OK on RHEL

    - by lcbrevard
    This is already posted in serverfault - and may be more apprpriate there. Reworked a bit from the orginal posting. We have a product built on CentOS 4 32-bit Linux that runs unmodified on 32- and 64-bit CentOS/RHEL 4 and 5 and SLES 10. It also runs unmodified on SLES 9 64-bit. [SLES 9 32-bit requires a different libstdc++.] The name of the main binary executable is 'flume' Yesterday we tried to put this on 64-bit Ubuntu 10 and, even though the file is there and the right size, we get: -bash: ./flume: No such file or directory 'file flume' shows it to be a 32-bit ELF (can't remember the exact output and the system is on an isolated network) If put into /usr/local/bin, then 'which flume' returns: /usr/local/bin/flume The file is marked as executable (did 'chmod +x flume') and lsattr shows no problems with attribute bits. I was not able to try 'ldd flume' yet. I have also not tried 'strace flume'. Currently I am with an air conditioning failure. [It's been that kind of week!] I now suspect that some library is not there. This is a profoundly unhelpful message and one I have never seen before. Is this peculiar to Ubuntu or perhaps just to this installation. We gave up and moved to a RHEL 4 system and everything is fine. But I sure would like to know what causes this.

    Read the article

  • simple and reliable centralized logging inside Amazon VPC

    - by Nakedible
    I need to set up centralized logging for a set of servers (10-20) in an Amazon VPC. The logging should be as to not lose any log messages in case any single server goes offline - or in the case that an entire availability zone goes offline. It should also tolerate packet loss and other normal network conditions without losing or duplicating messages. It should store the messages durably, at the minimum on two different EBS volumes in two availability zones, but S3 is a good place as well. It should also be realtime so that the messages arrive within seconds of their generation to two different availability zones. I also need to sync logfiles not generated via syslog, so a syslog-only centralized logging solution would not fulfill all the needs, although I guess that limitation could be worked around. I have already reviewed a few solutions, and I will list them here: Flume to Flume to S3: I could set up two logservers as Flume hosts which would store log messages either locally or in S3, and configure all the servers with Flume to send all messages to both servers, using the end-to-end reliability options. That way the loss of a single server shouldn't cause lost messages and all messages would arrive in two availability zones in realtime. However, there would need to be some way to join the logs of the two servers, deduplicating all the messages delivered to both. This could be done by adding a unique id on the sending side to each message and then write some manual deduplication runs on the logfiles. I haven't found an easy solution to the duplication problem. Logstash to Logstash to ElasticSearch: I could install Logstash on the servers and have them deliver to a central server via AMQP, with the durability options turned on. However, for this to work I would need to use some of the clustering capable AMQP implementations, or fan out the deliver just as in the Flume case. AMQP seems to be a yet another moving part with several implementations and no real guidance on what works best this sort of setup. And I'm not entirely convinced that I could get actual end-to-end durability from logstash to elasticsearch, assuming crashing servers in between. The fan-out solutions run in to the deduplication problem again. The best solution that would seem to handle all the cases, would be Beetle, which seems to provide high availability and deduplication via a redis store. However, I haven't seen any guidance on how to set this up with Logstash and Redis is one more moving part again for something that shouldn't be terribly difficult. Logstash to ElasticSearch: I could run Logstash on all the servers, have all the filtering and processing rules in the servers themselves and just have them log directly to a removet ElasticSearch server. I think this should bring me reliable logging and I can use the ElasticSearch clustering features to share the database transparently. However, I am not sure if the setup actually survives Logstash restarts and intermittent network problems without duplicating messages in a failover case or similar. But this approach sounds pretty promising. rsync: I could just rsync all the relevant log files to two different servers. The reliability aspect should be perfect here, as the files should be identical to the source files after a sync is done. However, doing an rsync several times per second doesn't sound fun. Also, I need the logs to be untamperable after they have been sent, so the rsyncs would need to be in append-only mode. And log rotations mess things up unless I'm careful. rsyslog with RELP: I could set up rsyslog to send messages to two remote hosts via RELP and have a local queue to store the messages. There is the deduplication problem again, and RELP itself might also duplicate some messages. However, this would only handle the things that log via syslog. None of these solutions seem terribly good, and they have many unknowns still, so I am asking for more information here from people who have set up centralized reliable logging as to what are the best tools to achieve that goal.

    Read the article

  • Hadoop Rolling Small files

    - by Arenstar
    I am running Hadoop on a project and need a suggestion. Generally by default Hadoop has a "block size" of around 64mb.. There is also a suggestion to not use many/small files.. I am currently having very very very small files being put into HDFS due to the application design of flume.. The problem is, that Hadoop <= 0.20 cannot append to files, whereby i have too many files for my map-reduce to function efficiently.. There must be a correct way to simply roll/merge roughly 100 files into one.. Therefore Hadoop is effectively reading 1 large file instead of 10 Any Suggestions??

    Read the article

  • Cloudera Hadoop Certification Value in IT Industry for freshers

    - by Saumitra
    I am a software developer with 8 months of experience in IT industry working on development of tools for BIG DATA analytics. I have learned Hadoop basics on my own and I am pretty comfortable with writing MapReduce Jobs, PIG, HIVE, Flume and other related projects. I am thinking of appearing for Cloudera Hadoop Certification. My question is whether it will benefit me in any way, considering that I am a fresher with not even 1 year of experience. Most of the jobs posting which I have seen related to Hadoop requires at least 3 years of experience. I currently work in India but I can relocate. Please help me in deciding whether I should invest my time in perfecting my Hadoop skills for certification?

    Read the article

  • What is the value of the Cloudera Hadoop Certification for people new to the IT industry?

    - by Saumitra
    I am a software developer with 8 months of experience in the IT industry, currently working on the development of tools for BIG DATA analytics. I have learned Hadoop basics on my own and I am pretty comfortable with writing MapReduce Jobs, PIG, HIVE, Flume and other related projects. I am thinking of taking the exam for the Cloudera Hadoop Certification. Will this certification add value, considering that I have less than 1 year of experience? Many of the jobs I've seen relating to Hadoop require at least 3 years of experience. Should I invest more time in learning Hadoop and improving my skills to take this certification?

    Read the article

  • Windows Azure Recipe: Big Data

    - by Clint Edmonson
    As the name implies, what we’re talking about here is the explosion of electronic data that comes from huge volumes of transactions, devices, and sensors being captured by businesses today. This data often comes in unstructured formats and/or too fast for us to effectively process in real time. Collectively, we call these the 4 big data V’s: Volume, Velocity, Variety, and Variability. These qualities make this type of data best managed by NoSQL systems like Hadoop, rather than by conventional Relational Database Management System (RDBMS). We know that there are patterns hidden inside this data that might provide competitive insight into market trends.  The key is knowing when and how to leverage these “No SQL” tools combined with traditional business such as SQL-based relational databases and warehouses and other business intelligence tools. Drivers Petabyte scale data collection and storage Business intelligence and insight Solution The sketch below shows one of many big data solutions using Hadoop’s unique highly scalable storage and parallel processing capabilities combined with Microsoft Office’s Business Intelligence Components to access the data in the cluster. Ingredients Hadoop – this big data industry heavyweight provides both large scale data storage infrastructure and a highly parallelized map-reduce processing engine to crunch through the data efficiently. Here are the key pieces of the environment: Pig - a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Mahout - a machine learning library with algorithms for clustering, classification and batch based collaborative filtering that are implemented on top of Apache Hadoop using the map/reduce paradigm. Hive - data warehouse software built on top of Apache Hadoop that facilitates querying and managing large datasets residing in distributed storage. Directly accessible to Microsoft Office and other consumers via add-ins and the Hive ODBC data driver. Pegasus - a Peta-scale graph mining system that runs in parallel, distributed manner on top of Hadoop and that provides algorithms for important graph mining tasks such as Degree, PageRank, Random Walk with Restart (RWR), Radius, and Connected Components. Sqoop - a tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases. Flume - a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large log data amounts to HDFS. Database – directly accessible to Hadoop via the Sqoop based Microsoft SQL Server Connector for Apache Hadoop, data can be efficiently transferred to traditional relational data stores for replication, reporting, or other needs. Reporting – provides easily consumable reporting when combined with a database being fed from the Hadoop environment. Training These links point to online Windows Azure training labs where you can learn more about the individual ingredients described above. Hadoop Learning Resources (20+ tutorials and labs) Huge collection of resources for learning about all aspects of Apache Hadoop-based development on Windows Azure and the Hadoop and Windows Azure Ecosystems SQL Azure (7 labs) Microsoft SQL Azure delivers on the Microsoft Data Platform vision of extending the SQL Server capabilities to the cloud as web-based services, enabling you to store structured, semi-structured, and unstructured data. See my Windows Azure Resource Guide for more guidance on how to get started, including links web portals, training kits, samples, and blogs related to Windows Azure.

    Read the article

  • Oracle Big Data Software Downloads

    - by Mike.Hallett(at)Oracle-BI&EPM
    Companies have been making business decisions for decades based on transactional data stored in relational databases. Beyond that critical data, is a potential treasure trove of less structured data: weblogs, social media, email, sensors, and photographs that can be mined for useful information. Oracle offers a broad integrated portfolio of products to help you acquire and organize these diverse data sources and analyze them alongside your existing data to find new insights and capitalize on hidden relationships. Oracle Big Data Connectors Downloads here, includes: Oracle SQL Connector for Hadoop Distributed File System Release 2.1.0 Oracle Loader for Hadoop Release 2.1.0 Oracle Data Integrator Companion 11g Oracle R Connector for Hadoop v 2.1 Oracle Big Data Documentation The Oracle Big Data solution offers an integrated portfolio of products to help you organize and analyze your diverse data sources alongside your existing data to find new insights and capitalize on hidden relationships. Oracle Big Data, Release 2.2.0 - E41604_01 zip (27.4 MB) Integrated Software and Big Data Connectors User's Guide HTML PDF Oracle Data Integrator (ODI) Application Adapter for Hadoop Apache Hadoop is designed to handle and process data that is typically from data sources that are non-relational and data volumes that are beyond what is handled by relational databases. Typical processing in Hadoop includes data validation and transformations that are programmed as MapReduce jobs. Designing and implementing a MapReduce job usually requires expert programming knowledge. However, when you use Oracle Data Integrator with the Application Adapter for Hadoop, you do not need to write MapReduce jobs. Oracle Data Integrator uses Hive and the Hive Query Language (HiveQL), a SQL-like language for implementing MapReduce jobs. Employing familiar and easy-to-use tools and pre-configured knowledge modules (KMs), the application adapter provides the following capabilities: Loading data into Hadoop from the local file system and HDFS Performing validation and transformation of data within Hadoop Loading processed data from Hadoop to an Oracle database for further processing and generating reports Oracle Database Loader for Hadoop Oracle Loader for Hadoop is an efficient and high-performance loader for fast movement of data from a Hadoop cluster into a table in an Oracle database. It pre-partitions the data if necessary and transforms it into a database-ready format. Oracle Loader for Hadoop is a Java MapReduce application that balances the data across reducers to help maximize performance. Oracle R Connector for Hadoop Oracle R Connector for Hadoop is a collection of R packages that provide: Interfaces to work with Hive tables, the Apache Hadoop compute infrastructure, the local R environment, and Oracle database tables Predictive analytic techniques, written in R or Java as Hadoop MapReduce jobs, that can be applied to data in HDFS files You install and load this package as you would any other R package. Using simple R functions, you can perform tasks such as: Access and transform HDFS data using a Hive-enabled transparency layer Use the R language for writing mappers and reducers Copy data between R memory, the local file system, HDFS, Hive, and Oracle databases Schedule R programs to execute as Hadoop MapReduce jobs and return the results to any of those locations Oracle SQL Connector for Hadoop Distributed File System Using Oracle SQL Connector for HDFS, you can use an Oracle Database to access and analyze data residing in Hadoop in these formats: Data Pump files in HDFS Delimited text files in HDFS Hive tables For other file formats, such as JSON files, you can stage the input in Hive tables before using Oracle SQL Connector for HDFS. Oracle SQL Connector for HDFS uses external tables to provide Oracle Database with read access to Hive tables, and to delimited text files and Data Pump files in HDFS. Related Documentation Cloudera's Distribution Including Apache Hadoop Library HTML Oracle R Enterprise HTML Oracle NoSQL Database HTML Recent Blog Posts Big Data Appliance vs. DIY Price Comparison Big Data: Architecture Overview Big Data: Achieve the Impossible in Real-Time Big Data: Vertical Behavioral Analytics Big Data: In-Memory MapReduce Flume and Hive for Log Analytics Building Workflows in Oozie

    Read the article

1