About Wikiwhere

Wikiwhere is the result of the research lab 2015-2016 at the University of Koblenz-Landau in cooperation with the GESIS – Leibniz Institute for the Social Sciences.

Wikipedia articles about the same event in different language editions have different sources of information. Wikiwhere helps to answer the question where this information comes from by analyzing and visualizing the geographic location of external links that are displayed on a given Wikipedia article. Instead of relying solely on the IP location of a given URL, our machine learning models additionally consider the top level domain and the website language.

Publication: Körner, M., Sennikova, T., Windhäuser, F., Wagner, C., & Flöck, F. (2016) Wikiwhere: An interactive tool for studying the geographical provenance of Wikipedia references. arXiv e-prints. (Updated version of the accepted abstract for the CSS Winter Symposium 2016 poster session.)

See also the related Wikipedia “Localness” project by Shilad Sen et al.

Why should I use wikiwhere?

You might just want a starting point for checking the sources, but there may be other interesting things you can find out. For instance you could compare different language versions of the same article to find out whether the different version have a geographical bias.

Comparison between the German and Russian version of an article:

You could even do larger investigations on Wikipedia sources. Do national, cross-national patterns exist? Do certain language versions of Wikipedia use more diverse sources? There is still more you could find out. Get creative!

In the following we will explain our approach and the different ways of how to use our web service. Additionally we provide the source code for this website and the underlying python code that runs the analysis and data extraction under a MIT license.

Approach

We use the term reference to refer to an URL that leads from a given Wikipedia article to another web page that is not associated with the Wikimedia foundation. To obtain a training set for the machine learning model, we retrieved geo-location information on websites from DBpedia SPARQL endpoints. In order to evaluate the accuracy of the ground truth we manually checked 255 locations for references that we extracted from the English DBpedia. The resulting accuracy was 95% (see DBpedia Location Extraction). We then randomly extracted URLs from Wikipedia articles which link to websites for which we have the geo-location. For this list of URLs we computed the IP-location, top level domain, and website language and used these features in combination with the DBpedia geo-location on the country level to train the model.

The following subsections provide some more details on the individual steps.

Data Collection

In the prediction model we used three features: The IP location of the URL, top level domain (TLD) location of the URL and language of its content. The target variable is the country that we retrieved from the DBpedia geo-location that was linked to the given website. We built separate prediction models for the following languages: English, German, French, Italian, Spanish, Ukrainian, Slovak, and Dutch. These are the languages for which there currently exist DBpedia knowledge bases. We also built a general model that combines the data from all DBpedia knowledge bases. The Python modules for the data collection can be found in the feature_extraction and location_extraction packages in our wikiwhere Github repository.

IP Location Extraction

There exist several public APIs that return the geo-coordinates for a given IP. Our first implementation relied on the Google Maps API. Yet, due to the high number of requests that we run for the analysis of larger Wikipedia articles, we quickly reached the daily limits. Our final solution uses the geoip2 Python library in combination with the GeoLite2 data created by MaxMind, available from http://www.maxmind.com. This allows us to locally calculate IP-locations on a country level. One potential source of wrong IP-locations could result from websites that use content delivery networks such as Akamai since their servers are globally distributed. Thereby, the localization of the content server can be missleading. One goal of the future work could be a better handling of such cases.

Top Level Domain Extraction

As a first step we created 2 datasets by parsing HTML tables. The first, taken from the IANA website, contains information on whether a TLD is generic or a country-code. The second, from the CIA, gives us the top-level domains corresponding to different country codes. With help from the Python package tld we extracted the TLD from a URL. If it's a TLD with a "." inside, for example "co.uk", we only considered the part after the final "." for further analysis. We then used our first dataset to find out whether the TLD is a country-code. If that's the case we took the corresponding ISO 2 character code from the second dataset. If it's not a country-code we set the TLD as the parameter, for example "COM". Errors in this process can be a "bad URL" or no found TLD within the usage of the TLD package or cases unknown to the IANA dataset. Empty values can be a result of a TLD which is a country-code but not corresponding to any specific country, for example ".eu".

Website Language Extraction

In order to determine the language of a given Website, we first request the website content using the urllib package. The next step is extracting the actual textual content of the website out of the HTML code.

This is done by first generating Markdown text using the html2text package and then using the beautifulsoup package to extract the text from Markdown via another conversion to HTML. After extracting the text we use the langdetect package for detecting the language.

DBpedia Location Extraction

In order to gather a large amount of geo-locations for websites we used the DBpedia SPARQL endpoints. The SPARQL requests were made with the SPARQLWrapper package. The first part of getting a location is to associate a given URL to a DBpedia entity. For this entity we then query for it's location, location city, or the location of it's parent company. The SPARQL query that we used can be found here. It is possible to copy this query into the field in the English DBpedia SPARQL endpoint. In cases where we retrieve more than one location for a URL we perform a majority voting. For example, the SPARQL query returns four locations for the URL http://www.treasury.gov.au/. The geo-coordinates for all four locations only differ in the second digit after the comma. With our current threshold, geo-coordinates are considered the same if they differ by less than 0.1. Another example is http://www.bangladesh.gov.bd/maps/images/pabna/Chatmohar.gif for which the geo-coordinates two of the four locations differ by 0.2. In this case a majority voting takes place. Since there are two different locations which both appear two times, one of them is selected at random. For the English DBpedia, a majority voting was necessary for 5179 out of the 162827 URLs that we extracted from the SPARQL endpoint. In addition, we do not consider URLs that contain "web.archive.org" and "webcitation.org" since they usually reference to another website and the DBpedia location also refers to the referenced website.

Since the quality of the geo-location that we extract from DBpedia is crucial for the performance of the machine learning and thereby the predictions of our service, we performed a manual evaluation. In the following we use the term entity to refer to the subject the website belongs to, for example companies, schools, or the government. The location result (LR) is the geo-coordinate acquired from DBpedia queries. For the manual evaluation we used the following rules:

  1. If the entity is surely based and active (for companies, etc.) within only one country, we evaluated every LR within the same country as correct, anything else as wrong.
  2. If the entity is active internationally and has publicly available information about the location of its headquarter we evaluated every LR within the country of the headquarter as correct, anything else as wron
  3. If the entity is active internationally and has no publicly available information about the location of its headquarter we evaluated every LR that can be related to an office as true.
  4. If website was not reachable (offline, etc.) we set not found as our result which was handled during evaluating by removing these cases from our statistics.
The evaluation showed 95% accuracy of the ground truth. The web site was not reachable in 8 out of 255 cases.

Learning model

The result of the data collection was a total of 233932 URLs with a location from DBpedia and for which we extracted the IP location, TLD location, and website language. On this data we applied a variety of statistical models including logistic regression, random forests, and support vector machines (SVMs). SVMs consistently provided the most accurate prediction of a location. We used a one vs. one multiclass classifier. We trained the models separately for each of our Wikipedia language editions. We also trained a general prediction model based on merged data from all DBpedia knowledge bases. We use this model as a model for all the languages. To evaluate the performance of our model, we used 10-cross fold validation.

Table 1 shows the accuracy of the models. First we checked the accuracy over all the data we have. It is represented in the entry "All data - Model". Then we checked how well the models can handle difficult cases, when all the parameters disagree. It is represented in the entry "Difficult cases - Model". As the baseline we used the IP location.

Table 1. Accuracy of the models
Method General EN FR DE ES UK IT NL SV CS
All data - Model 81% 81% 91% 90% 75% 96% 91% 96% 92% 98%
All data - IP only (Baseline) 61% 30% 62% 77% 29% 86% 73% 86% 81% 80%
Difficult cases - Model 77% 78% 86% 80% 71% 89% 85% 91% 85% 93%
Difficult cases - IP only (Baseline) 30% 57% 64% 25% 81% 66% 80% 74% 79% 53%

Table 2 presents the importance of each parameter of the learning models. The number in each cell reflects how well a particular parameter can describe the variance of the ground truth. To obtain these data we calculated how often a particular parameter agrees with the ground truth.

Table 2. Parameter contribution over all data
Model IP
location
TLD
location
Website
Language
General 61% 58% 25%
EN 30% 13% 2%
FR 62% 73% 23%
DE 77% 68% 42%
ES 29% 30% 7%
UK 86% 89% 29%
IT 73% 70% 27%
NL 86% 76% 47%
SV 81% 82% 29%
CS 80% 78% 34%

Classification Fix

Due to the poor data quality for some of the DBpedia language editions we decided to include one exception in our final classification. If the classification from our machine learning model predicts a country that appears in none of the three features we instead use the IP-location as the classification. One concrete example where this is helpful is, for the German model, the case where IP-location equals "US", the TLD is "COM", and the website language is "EN". In this case, our training data contains 776 URLs with the DBpedia location "FR" and only 460 URLs with the location "US". One goal of a future work thereby should be to further improve the input data by either modifying the SPARQL queries or switching to another source for website locations.

Usage

In the following we will give some examples of different ways to run the analysis and access the results.

Via the Web Interface

The easiest way is to use the web interface that we provide at our homepage. This is done by inserting a valid Wikipedia article URL in the input box and pressing "Get Analysis". If the option "Fresh crawl" is not selected and the article was analysed before, the previous results are displayed. Otherwise, a new analysis gets executed on the server. Currently we allow up to ten parallel analyses. Since we extract the content of all linked websites in the given article, the analysis can take several minutes, depending on the number of external links in the article. The plotted results are shown on a separate webpage.

Via URL Parameters

It is possible to access the plotted results via URL parameters:
http://wikiwhere.west.uni-koblenz.de/article.php?url=[article-url]
For example, the German Wikipedia article Test can be accessed with:
http://wikiwhere.west.uni-koblenz.de/article.php?url=https://de.wikipedia.org/wiki/Test
The URL parameter new-crawl allows to force a new analysis:
http://wikiwhere.west.uni-koblenz.de/article.php?url=https://de.wikipedia.org/wiki/Test&new-crawl=true
Again, the analysis can take several minutes, depending on the number of external links in the article.

Via the File Browser

Previous analyses can be accessed via the Articles tab on the website. The results are stored folders based on the wikipedia language edition and the article title.
For example, the analysis results (analysis.json) for the German Wikipedia article Test can be found at: http://wikiwhere.west.uni-koblenz.de/articles/de/Test/ In addition to the analysis results we also provide a file called visualization-redirect.php that performs a redirect to the visualization page of the according article.

Via wget

In order to retrieve the analysis.json file with wget, the following command can be used:
wget "http://wikiwhere.west.uni-koblenz.de/json.php?url=[article-url]" -O [file-name].json
Again, it is possible to use the new-crawl parameter to force a new analysis.
A concrete example for the German Wikipedia article Test:
wget "http://wikiwhere.west.uni-koblenz.de/json.php?url=https://de.wikipedia.org/wiki/Test" -O de-Test.json

Source Code

The source code for this website is on GitHub at https://github.com/mkrnr/wikiwhere-website.
For the analysis we have written Python modules which are also on Github at https://github.com/mkrnr/wikiwhere.
The code in both repositories is available under a MIT license.