Goldendict: Template system for Online Dictionaries (Websites)

Created on 9 Jun 2012  路  18Comments  路  Source: goldendict/goldendict

Greetings,

I got a feature request from a user I reproduce here:

I was thinking about how the online dictionaries are handled in GoldenDict. I was using Lingoes the other day, and I noticed how nicely it handles the online dictionaries. Only the bare dictionary entry is shown, no banners, no anything, just the text. This way, the online dictionaries are completely indistinguishable from the offline ones, and are very convenient to use. I wonder, if the same functionality is possible in GoldenDict?

At the very least, one should be able to configure the program to only show the div containing the actual dictionary entry. Let's take a look at dr.eye.com (a Chinese-English online dictionary). The entry is contained within the "dict_cont" div, everything else could be safely skipped.

I understand such functionality is difficult to implement using the existing interface. I propose simple text files, which would contain the web page url, and the names of what elements on that page need to be shown (or maybe more advanced rules, similar to AdBlock or GreaseMonkey*). Then such files would be placed in the dictionary folder and GoldenDict would pick them up on start and add them to the dictionary list. After that, they would act as normal dictionary files, very user-friendly. Users would also be able to exchange these files with each other.

I also agree this functionality should exist in GoldenDict. I have talked previously about it here:
http://goldendict.org/forum/viewtopic.php?f=6&t=1317&p=5898#p5898

This is how I see it:

1) We should have some kind of Proxy (playing with _QNetworkAccessManager_, _QNetworkReply_ and similar Qt classes) that gets the whole Webpage as of now. We can also take the opportunity to implement POST HTTP method. Right now GoldenDict only supports GET method for search forms. So only works with searchs that show the search term in the URL.

2) Then GoldenDict should load a user-defined template from the config folder. One for each online dictionary. This template would have the fixed html and CSS and placeholders where GD should write the results. To locate the results in the original web page retrieved from the Web we could use some form of Regular Expressions and/or CSS selectors or XPath.

For example:

Original web page:

        <html>
            <head>
                <title>Online dictionary - Results</title>
                <script type="text/javascript">
                    <!--lot of crappy javascript-->
                    <!--frame killer frame buster javascript-->
                </script>
                <style type="text/css">
                    //lot of crappy css styles
                </style>
            </head>
            <body>
                <!--lot of annoying banners-->
                ...
                <table id="dictionaryResult">
                    <tr id="dictionaryEntry">
                        <td class="term">
                            cat
                        </td>

                        <td class="definition">
                            feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats
                        </td>
                    </tr>
                </table>
                ...
                <!--more annoying banners-->
                ...
            </body>
        </html>

Template page:

        <html>
            <head>
                <style type="text/css">
                    //custom css styles or just use the default external CSS provided by GoldenDict
                    span.term {
                        ...
                    }
                    div.definition {
                        ...
                    }
                </style>
            </head>
            <body>
                <div>
                    <span class="term">
                        <!-- placeholder for term -->
                        {{ table#dictionaryResult td.term }}
                    </span>
                    <div class="definition">
                        <br>
                        <!-- placeholder for definition -->
                        {{ table#dictionaryResult td.definition }} 
                    </div>
                </div>
            </body>
        </html>

3) GoldenDict should merge the results with the template and output the static html to the result window. In case a Regular Expression, CSS Selector or XPath expression could not be resolved it should output a warning in the console.

We can have a default template that just displays the whole webpage as it is, with no text or style manipulation. This way we keep backward compatibility.

This solution will allow to customize the result according to user needs. And also to avoid frame buster techniques that prevent an online dictionary to display itself along with offline dictionaries in the same GoldenDict result screen.

I would like to hear other's opinion.

Regards,

Chulai

Bug Feature Request UI

Most helpful comment

Basically, this has already been implemented. It is implemented as the possibility to call an external program and show its output. I have some python3 scripts that I use for extracting the relevant information from websites. It would be good if we could share such scripts as part of the goldendict release so that people do not have to look for them in the internet or write them themselves, but I am not quite sure how to do that. I am also not quite sure how to run a python script on a windows machine, so I cannot help you if you are on one.

On a unix machine, copy the text of the script below, and save it as a regular text file, let's say "/home/login/bin/Goldendict/getxpath".

In Goldendict, open the Dictionary menu, go to the "programs" tab and enter a new dictionary (this is an online dictionary I am using, not Multitran.ru; but the comments in the script, listed between the lines containing """, and examining the source code of the webpage will help you adjust this for multitran; or if you are able to run this dictionary but unable to use the script for multitran, I will have a look at it later and suggest the correct way to call the script):

Enabled, Type: Html, Name: Collins English-German, Command Line:
/home/login/bin/Goldendict/getxpath http://www.collinsdictionary.com/dictionary/english-german/%GDWORD% definition_main .//div[@class="""definition_main"""] collins.css ''

where /home/login/bin/Goldendict/getxpath is the address of the following script (make sure the line starting with #! is the first line of the file):


!/usr/bin/python3

import urllib.request
import urllib.parse
import sys
from lxml.html import fromstring, tostring
import re

"""
Arguments: url
select_div class attribute of the selected part (used only in output, not for actual selection)
select_xpath xpath of the elements that will be output (leave empty for outputting whole page)
css_file file in the same dir as this script
elements_for_removal array of xpath addresses, e.g. ['//div[@class="copyright"]','//input','//img']
values_for_javascript (optional) hash of 'key': 'value' pairs
"""

def get_page():
if len(sys.argv)==7:
values = eval(sys.argv[6])
data = urllib.parse.urlencode(values).encode('utf8')
request = urllib.request.Request(url, data)
else:
request = urllib.request.Request(url)
response = urllib.request.urlopen(request)
html = response.read()
response.close()
return html

def remove(el):
el.getparent().remove(el)

url = urllib.parse.quote(sys.argv[1],':?/=&#;')
select_div = sys.argv[2]
select_xpath = sys.argv[3]
css_file = sys.argv[4]
if css_file.endswith('.css'):
css_file = css_file[:-4]
elements_for_removal = eval(sys.argv[5])
try:
html = get_page()
page = fromstring(html.decode('utf8','ignore'))
page.make_links_absolute(base_url=url)
baseurl = url.split("#",2)

for address in elements_for_removal:
for element in page.xpath(address):
remove(element)

print('')
print('')
if not css_file=='':
print('

')
else:
print('
')
if not select_xpath=='':
if not page.findall(select_xpath)==[]:
for element in page.findall(select_xpath):
print(tostring(element).decode('utf8').replace(baseurl[0]+"#","#") )
else:
print("Nothing found.")
else:
print(tostring(page).decode('utf8').replace(baseurl[0]+"#","#") )
print('
')

except urllib.error.HTTPError as e:
print('Downloading the page '+url+' failed with error code %s.' % e.code)

All 18 comments

I guess this could also be a solution for my request https://github.com/goldendict/goldendict/issues/104 ?

Yes, only the first issue in #104. However, that doesn't mean we couldn't implement also the "disallow loading content from other sites" on each online dictionary.

This would also address the issue reported here: http://goldendict.org/forum/viewtopic.php?f=4&t=1539

i wrote a PHP script to communicate with Google Translate. then i send a word to my script, it sent to google, process response, at the end i receive a clear text of translated text.

Is there any chance this is going to be addressed any time soon? I'm looking for a way to use Multitran.ru with other online dictionaries. Thank you.

Basically, this has already been implemented. It is implemented as the possibility to call an external program and show its output. I have some python3 scripts that I use for extracting the relevant information from websites. It would be good if we could share such scripts as part of the goldendict release so that people do not have to look for them in the internet or write them themselves, but I am not quite sure how to do that. I am also not quite sure how to run a python script on a windows machine, so I cannot help you if you are on one.

On a unix machine, copy the text of the script below, and save it as a regular text file, let's say "/home/login/bin/Goldendict/getxpath".

In Goldendict, open the Dictionary menu, go to the "programs" tab and enter a new dictionary (this is an online dictionary I am using, not Multitran.ru; but the comments in the script, listed between the lines containing """, and examining the source code of the webpage will help you adjust this for multitran; or if you are able to run this dictionary but unable to use the script for multitran, I will have a look at it later and suggest the correct way to call the script):

Enabled, Type: Html, Name: Collins English-German, Command Line:
/home/login/bin/Goldendict/getxpath http://www.collinsdictionary.com/dictionary/english-german/%GDWORD% definition_main .//div[@class="""definition_main"""] collins.css ''

where /home/login/bin/Goldendict/getxpath is the address of the following script (make sure the line starting with #! is the first line of the file):


!/usr/bin/python3

import urllib.request
import urllib.parse
import sys
from lxml.html import fromstring, tostring
import re

"""
Arguments: url
select_div class attribute of the selected part (used only in output, not for actual selection)
select_xpath xpath of the elements that will be output (leave empty for outputting whole page)
css_file file in the same dir as this script
elements_for_removal array of xpath addresses, e.g. ['//div[@class="copyright"]','//input','//img']
values_for_javascript (optional) hash of 'key': 'value' pairs
"""

def get_page():
if len(sys.argv)==7:
values = eval(sys.argv[6])
data = urllib.parse.urlencode(values).encode('utf8')
request = urllib.request.Request(url, data)
else:
request = urllib.request.Request(url)
response = urllib.request.urlopen(request)
html = response.read()
response.close()
return html

def remove(el):
el.getparent().remove(el)

url = urllib.parse.quote(sys.argv[1],':?/=&#;')
select_div = sys.argv[2]
select_xpath = sys.argv[3]
css_file = sys.argv[4]
if css_file.endswith('.css'):
css_file = css_file[:-4]
elements_for_removal = eval(sys.argv[5])
try:
html = get_page()
page = fromstring(html.decode('utf8','ignore'))
page.make_links_absolute(base_url=url)
baseurl = url.split("#",2)

for address in elements_for_removal:
for element in page.xpath(address):
remove(element)

print('')
print('')
if not css_file=='':
print('

')
else:
print('
')
if not select_xpath=='':
if not page.findall(select_xpath)==[]:
for element in page.findall(select_xpath):
print(tostring(element).decode('utf8').replace(baseurl[0]+"#","#") )
else:
print("Nothing found.")
else:
print(tostring(page).decode('utf8').replace(baseurl[0]+"#","#") )
print('
')

except urllib.error.HTTPError as e:
print('Downloading the page '+url+' failed with error code %s.' % e.code)

Cool, thanks for the pointer. The script is not properly indented and therefore doesn't wanna run, and I'm not familiar with python to fix it myself.

Hey @Ansa211, this is exactly the script I'm looking after since I also need to use Collins' German-English dictionary. However the script you posted was not indented and thus cannot run. Could you kindly post a fixed version? Thank you!

Sorry for such a slow reply. I hope I manage to make the script appear indented this time :-)

#!/usr/bin/python3
import urllib.request
import urllib.parse
import sys
from lxml.html import fromstring, tostring
import re


"""
Arguments: url
           select_div             class attribute of the selected part 
                                   (used only in output, not for actual selection)
           select_xpath           xpath of the elements that will be output 
                                   (leave empty for outputting whole page)
           css_file               file in the same dir as this script
           elements_for_removal   array of xpath addresses, 
                                    e.g. ['//div[@class="copyright"]','//input','//img']
           values_for_javascript  (optional) hash of 'key': 'value' pairs
"""

def get_page():
    if len(sys.argv)==7:
      values = eval(sys.argv[6])
      data = urllib.parse.urlencode(values).encode('utf8')
      request = urllib.request.Request(url, data)
    else:
      request = urllib.request.Request(url)
    response = urllib.request.urlopen(request)
    html = response.read()
    response.close()
    return html

def remove(el):
    el.getparent().remove(el)

url = urllib.parse.quote(sys.argv[1],':?/=&#;')
select_div = sys.argv[2]
select_xpath = sys.argv[3]
css_file = sys.argv[4]
if css_file.endswith('.css'):
  css_file = css_file[:-4]
#print("ELEMENTS FOR REMOVAL:"+sys.argv[5])
elements_for_removal = eval(sys.argv[5])
#print("URL: "+url)
#print("class: "+select_div)
#print("select_xpath: "+select_xpath)
#print("CSS_FILE: "+css_file)
#print("ELEMENTS FOR REMOVAL:"+elements_for_removal)
try:
  html = get_page()
  page = fromstring(html.decode('utf8','ignore'))
  page.make_links_absolute(base_url=url)
  baseurl = url.split("#",2)

  for address in elements_for_removal:
    for element in page.xpath(address):
      remove(element)

  print('<!DOCTYPE html>')
  print('<html><head><meta charset="utf-8">')
  if not css_file=='':
    print('<link rel="stylesheet" type="text/css" href="file:///home/ansa/bin/Goldendict/'+css_file+'.css"></head><body><div class="'+css_file+'"><div class="'+select_div+'">')
  else:
    print('</head><body><div><div class="'+select_div+'">')
  if not select_xpath=='':
    if not page.findall(select_xpath)==[]:
      for element in page.findall(select_xpath):
        print(tostring(element).decode('utf8').replace(baseurl[0]+"#","#") )
    else:
      print("Nothing found.")
  else:
    print(tostring(page).decode('utf8').replace(baseurl[0]+"#","#") )
  print('</div></div></body></html>')

except urllib.error.HTTPError as e:
  print('Downloading the page '+url+' failed with error code %s.' % e.code)

I see that the script also expects to get a collins.css file. That is a file that tells goldendict how certain parts of the downloaded page should be formatted (colours, font sizes and such). In Firefox, you can obtain the css file for a webpage you are viewing by following these steps:

press Shift+F7
(this is equivalent to clicking the right mouse button, selecting "Inspect Element", and then selecting "Style Editor" from the top bar of the developer tools)

then copy the content of all the css files you see there into a common file, and save that file as collins.css in the same directory where you saved the script itself

Ansa211, could this be automatized? Have the script be part of GoldenDict, or maybe a plugin, and then have it retrieve the CSS from websites automatically, giving the .css file whatever name the user gives to the online dictionary?

Also, is there any chance this would get implemented in a way that is more intuitive to those of us that don't code? Like, an adblock-style option to click on an element with the mouse to block it, or at least support for ABP filter lists?

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
   <head>
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
      <title>Iframe test for GoldenDict</title>
   </head>
   <body>
        Google search:
      <iframe src="http://www.google.com/search?q=hello"></iframe>
      <br>
        Google translate:
      <iframe src="https://translate.google.cn/#auto/zh-CN/hello"></iframe>
      <br>
        Baidu translate:
      <iframe src="http://fanyi.baidu.com/#auto/zh/hello"></iframe>
      <br>
        WordReference.com:
      <iframe src="http://www.wordreference.com/es/en/translation.asp?spen=ventana"></iframe>
      <br>
        testers:
      <iframe src="http://139.199.209.106/trans/google.action?from=en&to=zh&query=human"></iframe>
   </body>
</html>

"Baidu translate" is ok in goldendict,but testers is not ok!!

I thingk the bug is in goldendict,can it contain a tiny browser??
besides,goldendict on windows don't surport mp3 voice file,could it embed one?

For me, the link you give for "testers" does not work in a normal browser either (I get a "302 the document has moved" error; on the address it has moved to, I get a "violation of terms of service" message from google). I therefore do not think this is a boldendict bug.

For the mp3 issue, file a separate bug report.

Unfortunately,"testers" didn't work today,is't not administered by me.

try this

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
   <head>
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
      <title>Iframe test for GoldenDict</title>
   </head>
   <body>
        testers:
      <iframe src="http://139.199.209.106/trans/baidu.action?from=en&to=zh&query=human"></iframe>
   </body>
</html>

it can't be shown in the goldendict either

what is necessarily for a oline-dict's interpretation to be show in the goldendict.
it seems very hard...

I see!
This is a known bug: https://github.com/goldendict/goldendict/issues/628
A related issue has been reported as https://github.com/goldendict/goldendict/issues/517

we don't need to reinvent the wheel we should use translate-shell or doodle-translate

Here is a full explanation of what we could use to provide that feature easily
https://stackoverflow.com/questions/49972190/how-to-simply-implement-online-dictionaries-for-goldendict/

If you want to use translate shell, I think you can simply install it on your computer and then integrate it with goldendict through goldendict's "script" type of dictionary. However, that does not address this issue, which is much more general - for any online dictionary that the user chooses to access through goldendict, how do we allow the user to configure which part of the webpage returned by the dictionary they want to see? Currently, the whole page is displayed by default. There is a global option that prohibits loading stuff from other servers than the one which is directly accessed, which usually gets rid of adverts, but sometimes also removes stuff that the user actually wants to see; it does not remove page headers and all the other unnecessary "decorations" of the online dictionaries.

Can you file a separate issue with the suggestion to integrate translate shell directly in goldendict core?

Was this page helpful?
2 / 5 - 1 ratings