Search Results

Search found 7 results on 1 pages for 'rohanbk'.

Page 1/1 | 1 

  • Python web scraping involving HTML tags with attributes

    - by rohanbk
    I'm trying to make a web scraper that will parse a web-page of publications and extract the authors. The skeletal structure of the web-page is the following: <html> <body> <div id="container"> <div id="contents"> <table> <tbody> <tr> <td class="author">####I want whatever is located here ###</td> </tr> </tbody> </table> </div> </div> </body> </html> I've been trying to use BeautifulSoup and lxml thus far to accomplish this task, but I'm not sure how to handle the two div tags and td tag because they have attributes. In addition to this, I'm not sure whether I should rely more on BeautifulSoup or lxml or a combination of both. What should I do? At the moment, my code looks like what is below: import re import urllib2,sys import lxml from lxml import etree from lxml.html.soupparser import fromstring from lxml.etree import tostring from lxml.cssselect import CSSSelector from BeautifulSoup import BeautifulSoup, NavigableString address='http://www.example.com/' html = urllib2.urlopen(address).read() soup = BeautifulSoup(html) html=soup.prettify() html=html.replace('&nbsp', '&#160') html=html.replace('&iacute','&#237') root=fromstring(html) I realize that a lot of the import statements may be redundant, but I just copied whatever I currently had in more source file. EDIT: I suppose that I didn't make this quite clear, but I have multiple tags in page that I want to scrape.

    Read the article

  • Python, dictionaries, and chi-square contingency table

    - by rohanbk
    I have a file which contains several lines in the following format (word, time that the word occurred in, and frequency of documents containing the given word within the given instance in time): #inputfile <word, time, frequency> apple, 1, 3 banana, 1, 2 apple, 2, 1 banana, 2, 4 orange, 3, 1 I have Python class below that I used to create 2-D dictionaries to store the above file using as the key, and frequency as the value: class Ddict(dict): ''' 2D dictionary class ''' def __init__(self, default=None): self.default = default def __getitem__(self, key): if not self.has_key(key): self[key] = self.default() return dict.__getitem__(self, key) wordtime=Ddict(dict) # Store each inputfile entry with a <word,time> key timeword=Ddict(dict) # Store each inputfile entry with a <time,word> key # Loop over every line of the inputfile for line in open('inputfile'): word,time,count=line.split(',') # If <word,time> already a key, increment count try: wordtime[word][time]+=count # Otherwise, create the key except KeyError: wordtime[word][time]=count # If <time,word> already a key, increment count try: timeword[time][word]+=count # Otherwise, create the key except KeyError: timeword[time][word]=count The question that I have pertains to calculating certain things while iterating over the entries in this 2D dictionary. For each word 'w' at each time 't', calculate: The number of documents with word 'w' within time 't'. (a) The number of documents without word 'w' within time 't'. (b) The number of documents with word 'w' outside time 't'. (c) The number of documents without word 'w' outside time 't'. (d) Each of the items above represents one of the cells of a chi-square contingency table for each word and time. Can all of these be calculated within a single loop or do they need to be done one at a time? Ideally, I would like the output to be what's below, where a,b,c,d are all the items calculated above: print "%s, %s, %s, %s" %(a,b,c,d)

    Read the article

  • Python and MySQLdb

    - by rohanbk
    I have the following query that I'm executing using a Python script (by using the MySQLdb module). conn=MySQLdb.connect (host = "localhost", user = "root",passwd = "<password>",db = "test") cursor = conn.cursor () preamble='set @radius=%s; set @o_lat=%s; set @o_lon=%s; '%(radius,latitude,longitude) query='SELECT *, 6371*1000 * acos(cos(radians(@o_lat)) * cos(radians(lat)) * cos(radians(lon) - radians(@o_lon)) + sin(radians(@o_lat)) * sin(radians(lat))) as distance FROM poi_table HAVING distance < @radius ORDER BY distance ASC LIMIT 0, 50' complete_query=preamble+query results=cursor.execute (complete_query) print results The values of radius, latitude, and longitude are not important, but they are being defined when the script executes. What bothers me is that the snippet of code above returns no results; essentially meaning that the way that the query is being executed is wonky. I executed the SQL query (including the set variables with actual values, and it returned the correct number of results). If I modify the query to just be a simple SELECT FROM query (SELECT * FROM poi_table) it returns results. What is going on here?

    Read the article

  • Convert data retrieved from MySQL database into JSON object using Python/Django

    - by rohanbk
    I have a MySQL database called People which contains the following schema <id,name,foodchoice1,foodchoice2>. The database contains a list of people and the two choices of food they wish to have at a party (for example). I want to create some kind of Python web-service that will output a JSON object. An example of output should be like: { "guestlist": [ {"id":1,"name":"Bob","choice1":"chicken","choice2":"pasta"},{"id":2,"name":"Alice","choice1":"pasta","choice2":"chicken"} ], "partyname": "My awesome party", "day": "1", "month": "June", "2010": "null" } Basically every guest is stored into a dictionary 'guestlist' along with their choices of food. At the end of the JSON object is just some additional information that only needs to be mentioned once. The question that I have is regarding the method that I need to utilize to grab the data from my database, and create the JSON object. Do I need to use a standard Model/View structure of Django or can I get away with something that is much simpler since what I need to do is really simple?

    Read the article

  • Auto SSH and execute script

    - by rohanbk
    I have roughly 12 computers that each have the same script on them. This script merely pings all the other machines, and prints out whether the machine is "reachable" or "unreachable". However, it is inefficient to login to each machine manually using ssh to execute this script. Suppose I'm logged into node 1. Is there any way to for me to login to node 2-12 automatically using SSH, execute the ping script, pipe the results to a file, logout and proceed to the next machine? Some kind of bash shell script? I'm afraid I'm at a loss here since I haven't had experience with shell-scripting before.

    Read the article

  • Substring extraction using bash shell scripting and awk

    - by rohanbk
    So, I have a file called 'dummy' which contains the string: "There is 100% packet loss at node 1". I also have a small script that I want to use to grab the percentage from this file. The script is below. result=`grep 'packet loss' dummy` | awk '{ first=match($0,"[0-9]+%") last=match($0," packet loss") s=substr($0,first,last-first) print s}' echo $result I want the value of $result to basically be 100% in this case. But for some reason, it just prints out a blank string. Can anyone help me?

    Read the article

  • Hadoop Map Reduce job never finishes

    - by rohanbk
    I am running a Hadoop Map Reduce job using a Python Mapper and Reducer script, and Hadoop Streaming. Both my Map and Reduce jobs run till they are both 100%, but the job doesn't end. I know that when things go sour, Hadoop will terminate the job, but in this case, both stages reach a 100% and just never end. Has anyone else encountered anything similar? Also, how do I debug my program to figure out where things are going wrong? If I use a smaller input file, and I just run something like: $> cat input_file | mapper.py | sort | reduce.py >> output_file everything works perfectly fine. However, when I use Hadoop, things don't work out.

    Read the article

1