Translating with Google Translate without API and C# Code

Posted by Rick Strahl on West-Wind See other posts from West-Wind or by Rick Strahl
Published on Sat, 06 Aug 2011 22:44:16 GMT Indexed on 2011/11/11 17:53 UTC
Read the original article Hit count: 715

Filed under:
|

Some time back I created a data base driven ASP.NET Resource Provider along with some tools that make it easy to edit ASP.NET resources interactively in a Web application. One of the small helper features of the interactive resource admin tool is the ability to do simple translations using both Google Translate and Babelfish.

Here's what this looks like in the resource administration form:

LocalizationAdmin

When a resource is displayed, the user can click a Translate button and it will show the current resource text and then lets you set the source and target languages to translate. The Go button fires the translation for both Google and Babelfish and displays them - pressing use then changes the language of the resource to the target language and sets the resource value to the newly translated value. It's a nice and quick way to get a quick translation going.

Ch… Ch… Changes

Originally, both implementations basically did some screen scraping of the interactive Web sites and retrieved translated text out of result HTML. Screen scraping is always kind of an iffy proposition as content can be changed easily, but surprisingly that code worked for many years without fail. Recently however, Google at least changed their input pages to use AJAX callbacks and the page updates no longer worked the same way. End result: The Google translate code was broken.

Now, Google does have an official API that you can access, but the API is being deprecated and you actually need to have an API key. Since I have public samples that people can download the API key is an issue if I want people to have the samples work out of the box - the only way I could even do this is by sharing my API key (not allowed).  

However, after a bit of spelunking and playing around with the public site however I found that Google's interactive translate page actually makes callbacks using plain public access without an API key. By intercepting some of those AJAX calls and calling them directly from code I was able to get translation back up and working with minimal fuss, by parsing out the JSON these AJAX calls return. I don't think this particular

Warning: This is hacky code, but after a fair bit of testing I found this to work very well with all sorts of languages and accented and escaped text etc. as long as you stick to small blocks of translated text. I thought I'd share it in case anybody else had been relying on a screen scraping mechanism like I did and needed a non-API based replacement.

Here's the code:

/// <summary>
/// Translates a string into another language using Google's translate API JSON calls.
/// <seealso>Class TranslationServices</seealso>
/// </summary>
/// <param name="Text">Text to translate. Should be a single word or sentence.</param>
/// <param name="FromCulture">
/// Two letter culture (en of en-us, fr of fr-ca, de of de-ch)
/// </param>
/// <param name="ToCulture">
/// Two letter culture (as for FromCulture)
/// </param>
public string TranslateGoogle(string text, string fromCulture, string toCulture)
{
    fromCulture = fromCulture.ToLower();
    toCulture = toCulture.ToLower();

    // normalize the culture in case something like en-us was passed 
    // retrieve only en since Google doesn't support sub-locales
    string[] tokens = fromCulture.Split('-');
    if (tokens.Length > 1)
        fromCulture = tokens[0];
    
    // normalize ToCulture
    tokens = toCulture.Split('-');
    if (tokens.Length > 1)
        toCulture = tokens[0];
    
    string url = string.Format(@"http://translate.google.com/translate_a/t?client=j&text={0}&hl=en&sl={1}&tl={2}",                                     
                               HttpUtility.UrlEncode(text),fromCulture,toCulture);

    // Retrieve Translation with HTTP GET call
    string html = null;
    try
    {
        WebClient web = new WebClient();

        // MUST add a known browser user agent or else response encoding doen't return UTF-8 (WTF Google?)
        web.Headers.Add(HttpRequestHeader.UserAgent, "Mozilla/5.0");
        web.Headers.Add(HttpRequestHeader.AcceptCharset, "UTF-8");

        // Make sure we have response encoding to UTF-8
        web.Encoding = Encoding.UTF8;
        html = web.DownloadString(url);
    }
    catch (Exception ex)
    {
        this.ErrorMessage = Westwind.Globalization.Resources.Resources.ConnectionFailed + ": " +
                            ex.GetBaseException().Message;
        return null;
    }

    // Extract out trans":"...[Extracted]...","from the JSON string
    string result = Regex.Match(html, "trans\":(\".*?\"),\"", RegexOptions.IgnoreCase).Groups[1].Value;            

    if (string.IsNullOrEmpty(result))
    {
        this.ErrorMessage = Westwind.Globalization.Resources.Resources.InvalidSearchResult;
        return null;
    }

    //return WebUtils.DecodeJsString(result);

    // Result is a JavaScript string so we need to deserialize it properly
    JavaScriptSerializer ser = new JavaScriptSerializer();
    return ser.Deserialize(result, typeof(string)) as string;            
}

To use the code is straightforward enough - simply provide a string to translate and a pair of two letter source and target languages:

string result = service.TranslateGoogle("Life is great and one is spoiled when it goes on and on and on", "en", "de");
TestContext.WriteLine(result);

How it works

The code to translate is fairly straightforward. It basically uses the URL I snagged from the Google Translate Web Page slightly changed to return a JSON result (&client=j) instead of the funky nested PHP style JSON array that the default returns.

The JSON result returned looks like this:

{"sentences":[{"trans":"Das Leben ist großartig und man wird verwöhnt, wenn es weiter und weiter und weiter geht","orig":"Life is great and one is spoiled when it goes on and on and on","translit":"","src_translit":""}],"src":"en","server_time":24}

I use WebClient to make an HTTP GET call to retrieve the JSON data and strip out part of the full JSON response that contains the actual translated text. Since this is a JSON response I need to deserialize the JSON string in case it's encoded (for upper/lower ASCII chars or quotes etc.).

Couple of odd things to note in this code:

First note that a valid user agent string must be passed (or at least one starting with a common browser identification - I use Mozilla/5.0). Without this Google doesn't encode the result with UTF-8, but instead uses a ISO encoding that .NET can't easily decode. Google seems to ignore the character set header and use the user agent instead which is - odd to say the least.

The other is that the code returns a full JSON response. Rather than use the full response and decode it into a custom type that matches Google's result object, I just strip out the translated text. Yeah I know that's hacky but avoids an extra type and firing up the JavaScript deserializer. My internal version uses a small DecodeJsString() method to decode Javascript without the overhead of a full JSON parser.

It's obviously not rocket science but as mentioned above what's nice about it is that it works without an Google API key. I can't vouch on how many translates you can do before there are cut offs but in my limited testing running a few stress tests on a Web server under load I didn't run into any problems.

Limitations

There are some restrictions with this: It only works on single words or single sentences - multiple sentences (delimited by .) are cut off at the
".". There is also a length limitation which appears to happen at around 220 characters or so. While that may not sound  like much for typical word or phrase translations this this is plenty of length.

Use with a grain of salt - Google seems to be trying to limit their exposure to usage of the Translate APIs so this code might break in the future, but for now at least it works.

FWIW, I also found that Google's translation is not as good as Babelfish, especially for contextual content like sentences. Google is faster, but Babelfish tends to give better translations. This is why in my translation tool I show both Google and Babelfish values retrieved. You can check out the code for this in the West Wind West Wind Web Toolkit's TranslationService.cs file which contains both the Google and Babelfish translation code pieces. Ironically the Babelfish code has been working forever using screen scraping and continues to work just fine today. I think it's a good idea to have multiple translation providers in case one is down or changes its format, hence the dual display in my translation form above.

I hope this has been helpful to some of you - I've actually had many small uses for this code in a number of applications and it's sweet to have a simple routine that performs these operations for me easily.

Resources

© Rick Strahl, West Wind Technologies, 2005-2011
Posted in CSharp  HTTP  
kick it on DotNetKicks.com

© West-Wind or respective owner

Related posts about CSharp

Related posts about http