Let me first start by apologizing for being away for so long time without writing any posts. I was just not in the mood if you know what I mean.
But today I’m going to show you how to translate some text using Google Translate, right inside your own .NET programs.
I was in need of that feature today, and at first glance I just thought “sure Google is nice, they have an API for programmers like me”. Turns out they do, but that’s a JSON API intended for javascript. Yes yes, we can indeed access that in .NET but I didn’t want to start messing with JSON at the moment and I always like a challenge. So I begin to look into the possibilities for screen scraping, a process I’m quite comfortable with as I’ve done a lot of it.
But be warned. If Google changes anything in their layout your application is likely to break. I will not offer support on this, but If you ask nice I may be able to help you anyway.
For me that doesn’t really matter as it’s primarily intended for personal use on a project I’m constantly working on.
So let’s get coding. First take a look at translation. For testing purposes I will go from Danish to English. But normally Google hides the parameters inside Ajax and POST so I will help you do the hard work and show you this URL.
http://translate.google.com/?hl=en&ie=UTF8&text=Hej+verden&langpair=da|en
It’s the direct translation URL. Put in your own text instead of “Hej Verden”, change the langpair to suit your needs (da|en means from Danish to English) and you are good to
go.
Let’s take that into .NET. Today I’m going to present it in C# but there will be a VB example at the end.
Let’s start out by creating a new Windows Forms project and open code view. You should begin with some imports. There should be some auto generated ones, so just insert
this at the end of the imports.
using System.Net; using System.Text.RegularExpressions;
After that we can create or initial function. It will take 3 parameters. “input” is gonna be the text to be translated, “langFrom” is the language to translate from in shortcode (like
da or en, for Danish or English) and “langTo” is gonna be the language to translate to, in the same format as “langFrom”.
public string TranslateText(string input, string langFrom, string langTo) { //Function here }
Now we can move on to the next step where we will actually fetch some data from Google. We will create a new instance of WebClient and add the appropriate headers to it
(to make sure Google is going to send us UTF-8 encoded text). From where we will take the parameters passed to the function and insert them into our previous translation
URL, and after that fetch the content.
public string TranslateText(string input, string langFrom, string langTo) { //Defines a new WebClient WebClient Client = new WebClient(); //Sets the client encoding to UTF8 Client.Headers.Add("Charset", "text/html; charset=UTF-8"); //Creates the string. And yes I prefer this over string.format ! ;) string downloadUrl = "http://www.google.com/translate_t?hl=da&ie=UTF8&text=" + input + "&langpair="+langFrom+"|"+langTo; //Downloads the string from the URL above string data = Client.DownloadString(downloadUrl); return data; }
Now we have the data stored inside our “data” variable. Let’s just parse it real fast. We will begin by finding where the “resultbox” and afterwards parse our way trough until we
hit two “” right after each other (indicating the end of the resultbox).
public string TranslateText(string input, string langFrom, string langTo) { //Defines a new WebClient WebClient Client = new WebClient(); //Sets the client encoding to UTF8 Client.Headers.Add("Charset", "text/html; charset=UTF-8"); //Creates the string. And yes I prefer this over string.format ! ;) string downloadUrl = "http://www.google.com/translate_t?hl=da&ie=UTF8&text=" + input + "&langpair="+langFrom+"|"+langTo; //Downloads the string from the URL above string data = Client.DownloadString(downloadUrl); //Searches for the beginning of the resultbox and cuts everything away before that data = data.Substring(data.IndexOf("<span id=result_box")+19); //Finds the ending of the resultbox by searching for two spans right after each other data = data.Remove(data.IndexOf("</span></span>")+7); return data; }
Now we have the contents of the “resultbox” (and a little of it’s beginning) and are ready to move on to the next step. Here we will use a regex for counting the occurences of
spans and afterwards loop through the entire datablock, extract each span and put the together in the variable “translatedText” and return it at the end.
public string TranslateText(string input, string langFrom, string langTo) { //Defines a new WebClient WebClient Client = new WebClient(); //Sets the client encoding to UTF8 Client.Headers.Add("Charset", "text/html; charset=UTF-8"); //Creates the string. And yes I prefer this over string.format ! ;) string downloadUrl = "http://www.google.com/translate_t?hl=da&ie=UTF8&text=" + input + "&langpair="+langFrom+"|"+langTo; //Downloads the string from the URL above string data = Client.DownloadString(downloadUrl); //Searches for the beginning of the resultbox and cuts everything away before that data = data.Substring(data.IndexOf("<span id=result_box")+19); //Finds the ending of the resultbox by searching for two spans right after each other data = data.Remove(data.IndexOf("</span></span>")+7); //Defines a new regex used for counting all spans inside the resultbox Regex spans = new Regex("<span"); //Finds the count and puts it inside the variable spanOccurences int spanOccurences = spans.Matches(data).Count; //Defines an empty string for use in the for loop string translatedText = ""; //Extract each tiny bit of text from each span in the resultbox for (int i = 0; i < spanOccurences; i++) { //Defines currentBlock and sets it to everything which comes after the first "<span" string currentBlock = data.Substring(data.IndexOf("<span") + 5); //Finds the ending of the current span and removes everything after that currentBlock = currentBlock.Remove(currentBlock.IndexOf("</span>")); //Goes back to the beginning and cleans everything from inside the first span currentBlock = currentBlock.Substring(currentBlock.IndexOf(">") + 1); //Removes the current processed span from the beginning of the data for next extraction data = data.Substring(data.IndexOf("</span>") + 7); //Adds the extracted text to the translatedText variable translatedText += currentBlock; } //Returns the translated text return translatedText; }
And the full code in C# looks like.
using System; using System.Collections.Generic; using System.ComponentModel; using System.Data; using System.Drawing; using System.Linq; using System.Text; using System.Windows.Forms; using System.Net; using System.Text.RegularExpressions; namespace TranslateScraper { public partial class Form1 : Form { public Form1() { InitializeComponent(); } private void Form1_Load(object sender, EventArgs e) { MessageBox.Show(TranslateText("Tillykke! Dine programmer kan nu bruge Google Oversæt", "da", "en")); } /// <summary> /// Translates a text using screenscaping on Google Translate /// </summary> /// <param name="input">The string to translate</param> /// <param name="langFrom">The language to translate from. Fx "en" for English or "da" for Danish</param> /// <param name="langTo">The language to translate to in the same format as langFrom</param> /// <returns></returns> public string TranslateText(string input, string langFrom, string langTo) { //Defines a new WebClient WebClient Client = new WebClient(); //Sets the client encoding to UTF8 Client.Headers.Add("Charset", "text/html; charset=UTF-8"); //Creates the string. And yes I prefer this over string.format ! ;) string downloadUrl = "http://www.google.com/translate_t?hl=da&ie=UTF8&text=" + input + "&langpair="+langFrom+"|"+langTo; //Downloads the string from the URL above string data = Client.DownloadString(downloadUrl); //Searches for the beginning of the resultbox and cuts everything away before that data = data.Substring(data.IndexOf("<span id=result_box")+19); //Finds the ending of the resultbox by searching for two spans right after each other data = data.Remove(data.IndexOf("</span></span>")+7); //Defines a new regex used for counting all spans inside the resultbox Regex spans = new Regex("<span"); //Finds the count and puts it inside the variable spanOccurences int spanOccurences = spans.Matches(data).Count; //Defines an empty string for use in the for loop string translatedText = ""; //Extract each tiny bit of text from each span in the resultbox for (int i = 0; i < spanOccurences; i++) { //Defines currentBlock and sets it to everything which comes after the first "<span" string currentBlock = data.Substring(data.IndexOf("<span") + 5); //Finds the ending of the current span and removes everything after that currentBlock = currentBlock.Remove(currentBlock.IndexOf("</span>")); //Goes back to the beginning and cleans everything from inside the first span currentBlock = currentBlock.Substring(currentBlock.IndexOf(">") + 1); //Removes the current processed span from the beginning of the data for next extraction data = data.Substring(data.IndexOf("</span>") + 7); //Adds the extracted text to the translatedText variable translatedText += currentBlock; } //Returns the translated text return translatedText; } } }
And in VB.
Imports System.Text.RegularExpressions Imports System.Net Public Class Form1 Private Sub Form1_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load MsgBox(TranslateText("Tillykke! Dine programmer kan nu bruge Google oversæt", "da", "en")) End Sub ''' <summary> ''' Translates a text using screenscaping on Google Translate ''' </summary> ''' <param name="input">The string to translate</param> ''' <param name="langFrom">The language to translate from. Fx "en" for English or "da" for Danish</param> ''' <param name="langTo">The language to translate to in the same format as langFrom</param> ''' <returns></returns> Public Function TranslateText(ByVal input As String, ByVal langFrom As String, ByVal langTo As String) As String 'Defines a new WebClient Dim Client As New WebClient() 'Sets the client encoding to UTF8 Client.Headers.Add("Charset", "text/html; charset=UTF-8") 'Creates the string. And yes I prefer this over string.format ! ;) Dim downloadUrl As String = "http://www.google.com/translate_t?hl=da&ie=UTF8&text=" & input & "&langpair=" & langFrom & "|" & 'Downloads the string from the URL above Dim data As String = Client.DownloadString(downloadUrl) 'Searches for the beginning of the resultbox and cuts everything away before that data = data.Substring(data.IndexOf("<span id=result_box") + 19) 'Finds the ending of the resultbox by searching for two spans right after each other data = data.Remove(data.IndexOf("</span></span>") + 7) 'Defines a new regex used for counting all spans inside the resultbox Dim spans As New Regex("<span") 'Finds the count and puts it inside the variable spanOccurences Dim spanOccurences As Integer = spans.Matches(data).Count 'Defines an empty string for use in the for loop Dim translatedText As String = "" 'Extract each tiny bit of text from each span in the resultbox For i As Integer = 0 To spanOccurences - 1 'Defines currentBlock and sets it to everything which comes after the first "<span" Dim currentBlock As String = data.Substring(data.IndexOf("<span") + 5) 'Finds the ending of the current span and removes everything after that currentBlock = currentBlock.Remove(currentBlock.IndexOf("</span>")) 'Goes back to the beginning and cleans everything from inside the first span currentBlock = currentBlock.Substring(currentBlock.IndexOf(">") + 1) 'Removes the current processed span from the beginning of the data for next extraction data = data.Substring(data.IndexOf("</span>") + 7) 'Adds the extracted text to the translatedText variable translatedText += currentBlock Next 'Returns the translated text Return translatedText End Function End Class
And that concludes this tutorial. Thanks for reading, I hope to be back soon with some fresh new content, and possibly an article about my home automation system.