Search Results

Search found 22 results on 1 pages for 'stopwords'.

Page 1/1 | 1 

  • Making profit - Adsense contains too many stopwords

    - by Jack
    I was thinking of using Adsense, but after I've read about the stopwords policy... Too many words are banned: "a**, s**t, id**t, a****le, bu****it," etc.. That generally means that I cannot use Adsense, unless I edit my posts. How else would I go about making some profit out of my site? I don't want to use things like popups, text-link ads, I can't post many shoplinks, and my site is too small to sell adspace. For specific reasons, I also don't do videos, am not planning on starting a forum or premium content, or anything very close to what's in this sentence. The reason for this post is basically the fact that I've seen sites without any ads, huge sites, and I started to wonder: how do they make money? That was Gizmodo to be precise. Some info about my site: It's a blog where I review games and post news. There is no forum, no registration.

    Read the article

  • Removing stopwords,but should return as a line

    - by Sarath R Nair
    My question may appear silly. But as I am a rookie in Python , help me out. I have to pass a line to a stopword removal function. It works fine. But my problem is return of the function is appending the words. I want it as like follows: line = " I am feeling good , but I cant talk" Let "I,but,cant" are stopwords. After passing to the function , my output should be as "am feeling good , talk". What I a getting now is [['am','feeling','good','talk']]. Help me.

    Read the article

  • How to remove list of words from strings

    - by zeljko
    What I would like to do (in Clojure): For example, I have a vector of words that need to be removed: (def forbidden-words [":)" "the" "." "," " " ...many more...]) ... and a vector of strings: (def strings ["the movie list" "this.is.a.string" "haha :)" ...many more...]) So, each forbidden word should be removed from each string, and the result, in this case, would be: ["movie list" "thisisastring" "haha"]. How to do this ?

    Read the article

  • How to get the top keys from a hash by value

    - by Kirs Kringle
    I have a hash that I sorted by values greatest to least. How would I go about getting the top 5? There was a post on here that talked about getting only one value. What is the easiest way to get a key with the highest value from a hash in Perl? I understand that so would lets say getting those values add them to an array and delete the element in the hash and then do the process again? Seems like there should be an easier way to do this then that though. My hash is called %words. use strict; use warnings; use Tk; #Learn to install here: http://factscruncher.blogspot.com/2012/01/easy-way-to-install-tk- on-strawberry.html #Reading in the text file my $file0 = Tk::MainWindow->new->Tk::getOpenFile; open( my $filehandle0, '<', $file0 ) || die "Could not open $file0\n"; my @words; while ( my $line = <$filehandle0> ) { chomp $line; my @word = split( /\s+/, lc($line)); push( @words, @word ); } for (@words) { s/[\,|\.|\!|\?|\:|\;|\"]//g; } #Counting words that repeat; put in hash my %words_count; $words_count{$_}++ for @words; #Reading in the stopwords file my $file1 = "stoplist.txt"; open( my $filehandle1, '<', $file1 ) or die "Could not open $file1\n"; my @stopwords; while ( my $line = <$filehandle1> ) { chomp $line; my @linearray = split( " ", $line ); push( @stopwords, @linearray ); } for my $w ( my @stopwords ) { s/\b\Q$w\E\B//ig; } #Comparing the array to Hash and deleteing stopwords my %words = %words_count; for my $stopwords ( @stopwords ) { delete $words{ $stopwords }; } #Sorting Hash Table my @keys = sort { $words{$b} <=> $words{$a} or "\L$a" cmp "\L$b" } keys %words; #Starting Statistical Work my $value_count = 0; my $key_count = 0; #Printing Hash Table $key_count = keys %words; foreach my $key (@keys) { $value_count = $words{$key} + $value_count; printf "%-20s %6d\n", $key, $words{$key}; } my $value_average = $value_count / $key_count; #my @topwords; #foreach my $key (@keys){ #if($words{$key} > $value_average){ # @topwords = keys %words; # } #} print "\n", "The number of values: ", $value_count, "\n"; print "The number of elements: ", $key_count, "\n"; print "The Average: ", $value_average, "\n\n";

    Read the article

  • What is a good stopword in full text indexation?

    - by Benoit
    When you go to the Appendix D in Oracle Text Reference they provide lists of stopwords used by Oracle Text when indexing table contents. When I see the English list, nothing puzzles me. But the reason why the French list includes moyennant (French for in view of which) for example is unclear. Oracle has probably thought it through more than once before including it. How would you constitute a list of appropriate stopwords if you were to design an indexer?

    Read the article

  • Removing words from a file

    - by user1765792
    I'm trying to take a regular text file and remove words identified in a separate file (stopwords) containing the words to be removed separated by carriage returns ("\n"). Right now I'm converting both files into lists so that the elements of each list can be compared. I got this function to work, but it doesn't remove all of the words I have specified in the stopwords file. Any help is greatly appreciated. def elimstops(file_str): #takes as input a string for the stopwords file location stop_f = open(file_str, 'r') stopw = stop_f.read() stopw = stopw.split('\n') text_file = open('sample.txt') #Opens the file whose stop words will be eliminated prime = text_file.read() prime = prime.split(' ') #Splits the string into a list separated by a space tot_str = "" #total string i = 0 while i < (len(stopw)): if stopw[i] in prime: prime.remove(stopw[i]) #removes the stopword from the text else: pass i += 1 # Creates a new string from the compilation of list elements # with the stop words removed for v in prime: tot_str = tot_str + str(v) + " " return tot_str

    Read the article

  • Filter items from a feed using words in a text file, with Yahoo Pipes

    - by pufferfish
    I have a pipe that filters an RSS feed and removes any item that contains "stopwords" that I've chosen. Currently I've manually created a filter for each stopword in the pipe editor, but the more logical way is to read these from a file. I've figured out how to read the stopwords out of the text file, but how do I apply the filter operator to the feed, once for every stopword? The documentation states explicitly that operators can't be applied within the loop construct, but hopefully I'm missing something here.

    Read the article

  • Yahoo Pipes: filter items in a feed based on words in a text file

    - by pufferfish
    I have a pipe that filters an RSS feed and removes any item that contains "stopwords" that I've chosen. Currently I've manually created a filter for each stopword in the pipe editor, but the more logical way is to read these from a file. I've figured out how to read the stopwords out of the text file, but how do I apply the filter operator to the feed, once for every stopword? The documentation states explicitly that operators can't be applied within the loop construct, but hopefully I'm missing something here.

    Read the article

  • Django sphinx works only after app restart.

    - by Lhiash
    Hi, I've set up django-sphinx in my project, which works perfectly only for some time. Later it always returns empty result set. Surprisingly restarting django app fixes it. And search works again but again only for short time (or very limiter number of queries). Heres my sphinx.conf: source src_questions { # data source type = mysql sql_host = xxxxxx sql_user = xxxxxx #replace with your db username sql_pass = xxxxxx #replace with your db password sql_db = xxxxxx #replace with your db name # these two are optional sql_port = xxxxxx #sql_sock = /var/lib/mysql/mysql.sock # pre-query, executed before the main fetch query sql_query_pre = SET NAMES utf8 # main document fetch query sql_query = SELECT q.id AS id, q.title AS title, q.tagnames AS tags, q.html AS text, q.level AS level \ FROM question AS q \ WHERE q.deleted=0 \ # optional - used by command-line search utility to display document information sql_query_info = SELECT title, id, level FROM question WHERE id=$id sql_attr_uint = level } index questions { # which document source to index source = src_questions # this is path and index file name without extension # you may need to change this path or create this folder path = /home/rafal/core_index/index_questions # docinfo (ie. per-document attribute values) storage strategy docinfo = extern # morphology morphology = stem_en # stopwords file #stopwords = /var/data/sphinx/stopwords.txt # minimum word length min_word_len = 3 # uncomment next 2 lines to allow wildcard (*) searches min_infix_len = 1 enable_star = 1 # charset encoding type charset_type = utf-8 } # indexer settings indexer { # memory limit (default is 32M) mem_limit = 64M } # searchd settings searchd { # IP address on which search daemon will bind and accept # optional, default is to listen on all addresses, # ie. address = 0.0.0.0 address = 127.0.0.1 # port on which search daemon will listen port = 3312 # searchd run info is logged here - create or change the folder log = ../log/sphinx.log # all the search queries are logged here query_log = ../log/query.log # client read timeout, seconds read_timeout = 5 # maximum amount of children to fork max_children = 30 # a file which will contain searchd process ID pid_file = searchd.pid # maximum amount of matches this daemon would ever retrieve # from each index and serve to client max_matches = 1000 } and heres my search part from views.py: content = Question.search.query(keywords) if level: content = content.filter(level=level)#level is array of integers There are no errors in any logs, it just isnt returning any results. All help would be most appreciated.

    Read the article

  • Lucene stop words not removed during searching need a substitute for AnalyzingQueryParser

    - by iamrohitbanga
    I have created a Lucene index with the following analyzer. public class DocSpecAnalyzer extends Analyzer { private static CharArraySet stopSet;// = new HashSet<String>(Arrays.asList());//STOP_WORDS_SET; static { stopSet = new CharArraySet(FDConstants.stopwords, true); // uncommenting this displays all the stop words // for (String s: FDConstants.stopwords) { // System.out.println(s); // } } /** * Specifies whether deprecated acronyms should be replaced with HOST type. * See {@linkplain https://issues.apache.org/jira/browse/LUCENE-1068} */ private final boolean enableStopPositionIncrements; private final Version matchVersion; public DocSpecAnalyzer(Version matchVersion) { this.matchVersion = matchVersion; enableStopPositionIncrements = StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion); } public TokenStream tokenStream(String fieldName, Reader reader) { StandardTokenizer tokenStream = new StandardTokenizer(matchVersion, reader); tokenStream.setMaxTokenLength(DEFAULT_MAX_TOKEN_LENGTH); TokenStream result = new StandardFilter(tokenStream); result = new LowerCaseFilter(result); result = new StopFilter(enableStopPositionIncrements, result, stopSet); result = new PorterStemFilter(result); return result; } /** Default maximum allowed token length */ public static final int DEFAULT_MAX_TOKEN_LENGTH = 255; } Now when I search for documents for a query containing stop words, i get hits for stop words also. It is because of http://lucene.apache.org/java/2_9_2/api/contrib-misc/org/apache/lucene/queryParser/analyzing/AnalyzingQueryParser.html not handling stop words. Is there a substitute? Update: forgot to mention that I need to do a fuzzy search. that is why i am using an AnalyzingQueryParser. Update portion of code that invokes AnalyzingQueryParser AnalyzingQueryParser parser = new AnalyzingQueryParser(Version.LUCENE_CURRENT,"description", analyzer); // fuzzy matching preparation String fuzzyStr = TextQuery.prepareFuzzy(tq.text, fuzzyDist); Query query = parser.parse(fuzzyStr); TopScoreDocCollector collector = TopScoreDocCollector.create(numHits, true); searcher.search(query, collector);

    Read the article

  • Lucene stop words not removed during searching

    - by iamrohitbanga
    I have created a Lucene index with the following analyzer. public class DocSpecAnalyzer extends Analyzer { private static CharArraySet stopSet;// = new HashSet<String>(Arrays.asList());//STOP_WORDS_SET; static { stopSet = new CharArraySet(FDConstants.stopwords, true); // uncommenting this displays all the stop words // for (String s: FDConstants.stopwords) { // System.out.println(s); // } } /** * Specifies whether deprecated acronyms should be replaced with HOST type. * See {@linkplain https://issues.apache.org/jira/browse/LUCENE-1068} */ private final boolean enableStopPositionIncrements; private final Version matchVersion; public DocSpecAnalyzer(Version matchVersion) { this.matchVersion = matchVersion; enableStopPositionIncrements = StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion); } public TokenStream tokenStream(String fieldName, Reader reader) { StandardTokenizer tokenStream = new StandardTokenizer(matchVersion, reader); tokenStream.setMaxTokenLength(DEFAULT_MAX_TOKEN_LENGTH); TokenStream result = new StandardFilter(tokenStream); result = new LowerCaseFilter(result); result = new StopFilter(enableStopPositionIncrements, result, stopSet); result = new PorterStemFilter(result); return result; } /** Default maximum allowed token length */ public static final int DEFAULT_MAX_TOKEN_LENGTH = 255; } Now when I search for documents for a query containing stop words, i get hits for stop words also. As I post this problem, I found the bug. It is because of http://lucene.apache.org/java/2_9_2/api/contrib-misc/org/apache/lucene/queryParser/analyzing/AnalyzingQueryParser.html not handling stop words. Is there a substitute? Update: forgot to mention that I need to do a fuzzy search. that is why i am using an AnalyzingQueryParser.

    Read the article

  • SOLR - wildcard search with capital letter

    - by Yurish
    I have a problem with SOLR searching. When i`am searching query: dog* everything is ok, but when query is Dog*(with first capital letter), i get no results. Any advice? My config: <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType>

    Read the article

  • Wildcard searching and highlighting with Solr 1.4

    - by andy
    Hey guys, I've got a pretty much vanilla install of SOLR 1.4 apart from a few small config and schema changes. <requestHandler name="standard" class="solr.SearchHandler" default="true"> <!-- default values for query parameters --> <lst name="defaults"> <str name="defType">dismax</str> <str name="echoParams">explicit</str> <str name="qf"> text </str> <str name="spellcheck.dictionary">default</str> <str name="spellcheck.onlyMorePopular">false</str> <str name="spellcheck.extendedResults">false</str> <str name="spellcheck.count">1</str> </lst> </requestHandler> The main field type I'm using for Indexing is this: <fieldType name="textNoHTML" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <charFilter class="solr.HTMLStripCharFilterFactory" /> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/> </analyzer> </fieldType> now, when I perform a search using "q=search+term&hl=on" I get highlighting, and nice accurate scores. BUT, for wildcard, I'm assuming you need to use "q.alt"? Is that true? If so my query looks like this: "q.alt=search*&hl=on" When I use the above query, highlighting doesn't work, and all the scores are "1.0". What am I doing wrong? is what I want possible without bypassing some of the really cool SOLR optimizations. cheers!

    Read the article

  • Exception when indexing text documents with Lucene, using SnowballAnalyzer for cleaning up

    - by Julia
    Hello!!! I am indexing the documents with Lucene and am trying to apply the SnowballAnalyzer for punctuation and stopword removal from text .. I keep getting the following error :( IllegalAccessError: tried to access method org.apache.lucene.analysis.Tokenizer.(Ljava/io/Reader;)V from class org.apache.lucene.analysis.snowball.SnowballAnalyzer Here is the code, I would very much appreciate help!!!! I am new with this.. public class Indexer { private Indexer(){}; private String[] stopWords = {....}; private String indexName; private IndexWriter iWriter; private static String FILES_TO_INDEX = "/Users/ssi/forindexing"; public static void main(String[] args) throws Exception { Indexer m = new Indexer(); m.index("./newindex"); } public void index(String indexName) throws Exception { this.indexName = indexName; final File docDir = new File(FILES_TO_INDEX); if(!docDir.exists() || !docDir.canRead()){ System.err.println("Something wrong... " + docDir.getPath()); System.exit(1); } Date start = new Date(); PerFieldAnalyzerWrapper analyzers = new PerFieldAnalyzerWrapper(new SimpleAnalyzer()); analyzers.addAnalyzer("text", new SnowballAnalyzer("English", stopWords)); Directory directory = FSDirectory.open(new File(this.indexName)); IndexWriter.MaxFieldLength maxLength = IndexWriter.MaxFieldLength.UNLIMITED; iWriter = new IndexWriter(directory, analyzers, true, maxLength); System.out.println("Indexing to dir..........." + indexName); if(docDir.isDirectory()){ File[] files = docDir.listFiles(); if(files != null){ for (int i = 0; i < files.length; i++) { try { indexDocument(files[i]); }catch (FileNotFoundException fnfe){ fnfe.printStackTrace(); } } } } System.out.println("Optimizing...... "); iWriter.optimize(); iWriter.close(); Date end = new Date(); System.out.println("Time to index was" + (end.getTime()-start.getTime()) + "miliseconds"); } private void indexDocument(File someDoc) throws IOException { Document doc = new Document(); Field name = new Field("name", someDoc.getName(), Field.Store.YES, Field.Index.ANALYZED); Field text = new Field("text", new FileReader(someDoc), Field.TermVector.WITH_POSITIONS_OFFSETS); doc.add(name); doc.add(text); iWriter.addDocument(doc); } }

    Read the article

  • ft_stopword_file not picked up

    - by Alex Holsgrove
    I have a VPS server with a company called Webfusion. I want to remove some or all of the FULLTEXT stopwords because some specific words needs to be searchable with my DB content. I opened /etc/mysql/my.cnf and added the line ft_stopword_file="". I restarted the mysql service, ran a repair table and then tried my MATCH query with no success. I ran SHOW VARIABLES LIKE 'ft_%' and it simply shows (built-in) next to the stopword file. I am running WAMP on my workstation, and whilst I realise this isn't configured the same as a commercial VPS, the above method worked just fine. Couple someone please offer some guidance?

    Read the article

  • how to write sql script to achieve the following

    - by 3nigma
    hi, so i have a table lets say call it "tbl.items" and there is a column "title" in "tbl.items" i want to loop through each row and for each "title" in "tbl.items" i want to do following: the column has the datatype nvarchar(max) and contains a string... filter the string to remove words like in,out, where etc (stopwords) compare the rest of the string to a predefined list and if there is a match perform some action which involves inserting data in other tables as well.. the problem is im ignotent when it comes to writing T-sql scripts, plz help and guide me how can i achieve this? whether it can be achieved by writing a sql script?? or i have to develope a console application in c# or anyother language?? im using mssql server 2008 thanks in advance

    Read the article

  • How can I cluster short messages [Tweets] based on topic ? [Topic Based Clustering]

    - by Jagira
    Hello, I am planning an application which will make clusters of short messages/tweets based on topics. The number of topics will be limited like Sports [ NBA, NFL, Cricket, Soccer ], Entertainment [ movies, music ] and so on... I can think of two approaches to this Ask for users to tag questions like Stackoverflow does. Users can select tags from a predefined list of tags. Then on server side I will cluster them on based of tags. Pros:- Simple design. Less complexity in code. Cons:- Choices for users will be restricted. Clusters will not be dynamic. If a new event occurs, the predefined tags will miss it. Take the message, delete the stopwords [ predefined in a dictionary ] and apply some clustering algorithm to make a cluster and depending on its popularity, display the cluster. The cluster will be maintained according to its sustained popularity. New messages will be skimmed and assigned to corresponding clusters. Pros:- Dynamic clustering based on the popularity of the event/accident. Cons:- Increased complexity. More server resources required. I would like to know whether there are any other approaches to this problem. Or are there any ways of improving the above mentioned methods? Also suggest some good clustering algorithms.I think "K-Nearest Clustering" algorithm is apt for this situation.

    Read the article

1