Hi there,
I'd like to scrape
the discussion list of a private google
group. It's a multi-page list and I might have to this later again so scripting sounds like
the way to go.
Since this is a private
group, I need to login in my google account first.
Unfortunately I can't manage to login using wget or ruby Net::HTTP. Surprisingly google groups is not accessible with
the Client Login interface, so all
the code samples are useless.
My ruby script is embedded at
the end of
the post.
The response to
the authentication query is a 200-OK but no cookies in
the response headers and
the body contains
the message "Your browser's cookie functionality is turned off. Please turn it on."
I got
the same output with wget. See
the bash script at
the end of this message.
I don't know how to workaround this. am I missing something? Any idea?
Thanks in advance.
John
Here is
the ruby script:
# a ruby script
require 'net/https'
http = Net::HTTP.new('www.google.com', 443)
http.use_ssl = true
path = '/accounts/ServiceLoginAuth'
email='
[email protected]'
password='topsecret'
# form inputs from
the login page
data = "Email=#{email}&Passwd=#{password}&dsh=7379491738180116079&GALX=irvvmW0Z-zI"
headers = { 'Content-Type' => 'application/x-www-form-urlencoded',
'user-agent' => "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.2 (KHTML, like Gecko) Chrome/6.0"}
# Post
the request and print out
the response to retrieve our authentication token
resp, data = http.post(path, data, headers)
puts resp
resp.each {|h, v| puts h+'='+v}
#warning: peer certificate won't be verified in this SSL session
Here is
the bash script:
# A bash script for wget
CMD=""
CMD="$CMD --keep-session-cookies --save-cookies cookies.tmp"
CMD="$CMD --no-check-certificate"
CMD="$CMD --post-data='
[email protected]&Passwd=topsecret&dsh=-8408553335275857936&GALX=irvvmW0Z-zI'"
CMD="$CMD --user-agent='Mozilla'"
CMD="$CMD https://www.google.com/accounts/ServiceLoginAuth"
echo $CMD
wget $CMD
wget --load-cookies="cookies.tmp" http://groups.google.com/group/mygroup/topics?tsc=2