Wikipedia articles about the same event in different language editions have different sources of information. Wikiwhere helps to answer the question where this information comes from by analyzing and visualizing the geographic location of external links that are displayed on a given Wikipedia article. Instead of relying solely on the IP location of a given URL, our machine learning models additionally consider the top level domain and the website language.
Publication: Körner, M., Sennikova, T., Windhäuser, F., Wagner, C., & Flöck, F. (2016) Wikiwhere: An interactive tool for studying the geographical provenance of Wikipedia references. arXiv e-prints. (Updated version of the accepted abstract for the CSS Winter Symposium 2016 poster session.)
See also the related Wikipedia “Localness” project by Shilad Sen et al.
You might just want a starting point for checking the sources, but there may be other interesting things you can find out. For instance you could compare different language versions of the same article to find out whether the different version have a geographical bias.
You could even do larger investigations on Wikipedia sources. Do national, cross-national patterns exist? Do certain language versions of Wikipedia use more diverse sources? There is still more you could find out. Get creative!
In the following we will explain our approach and the different ways of how to use our web service. Additionally we provide the source code for this website and the underlying python code that runs the analysis and data extraction under a MIT license.
We use the term reference to refer to an URL that leads from a given Wikipedia article to another web page that is not associated with the Wikimedia foundation. To obtain a training set for the machine learning model, we retrieved geo-location information on websites from DBpedia SPARQL endpoints. In order to evaluate the accuracy of the ground truth we manually checked 255 locations for references that we extracted from the English DBpedia. The resulting accuracy was 95% (see DBpedia Location Extraction). We then randomly extracted URLs from Wikipedia articles which link to websites for which we have the geo-location. For this list of URLs we computed the IP-location, top level domain, and website language and used these features in combination with the DBpedia geo-location on the country level to train the model.
The following subsections provide some more details on the individual steps.
In the prediction model we used three features: The IP location of the URL, top level domain (TLD) location of the URL and language of its content. The target variable is the country that we retrieved from the DBpedia geo-location that was linked to the given website. We built separate prediction models for the following languages: English, German, French, Italian, Spanish, Ukrainian, Slovak, and Dutch. These are the languages for which there currently exist DBpedia knowledge bases. We also built a general model that combines the data from all DBpedia knowledge bases. The Python modules for the data collection can be found in the feature_extraction and location_extraction packages in our wikiwhere Github repository.
There exist several public APIs that return the geo-coordinates for a given IP.
Our first implementation relied on the Google Maps API.
Yet, due to the high number of requests that we run for the analysis of larger Wikipedia articles, we quickly reached the daily limits.
Our final solution uses the
geoip2 Python library in combination with the GeoLite2 data created by MaxMind, available from
This allows us to locally calculate IP-locations on a country level.
One potential source of wrong IP-locations could result from websites that use content delivery networks such as Akamai since their servers are globally distributed.
Thereby, the localization of the content server can be missleading.
One goal of the future work could be a better handling of such cases.
As a first step we created 2 datasets by parsing HTML tables.
The first, taken from the IANA website, contains information on whether a TLD is generic or a country-code.
The second, from the CIA, gives us the top-level domains corresponding to different country codes.
With help from the Python package
tld we extracted the TLD from a URL.
If it's a TLD with a "." inside, for example "co.uk", we only considered the part after the final "." for further analysis.
We then used our first dataset to find out whether the TLD is a country-code.
If that's the case we took the corresponding ISO 2 character code from the second dataset.
If it's not a country-code we set the TLD as the parameter, for example "COM".
Errors in this process can be a "bad URL" or no found TLD within the usage of the TLD package or cases unknown to the IANA dataset.
Empty values can be a result of a TLD which is a country-code but not corresponding to any specific country, for example ".eu".
In order to determine the language of a given Website, we first request the website content using the
The next step is extracting the actual textual content of the website out of the HTML code.
html2textpackage and then using the
beautifulsouppackage to extract the text from Markdown via another conversion to HTML. After extracting the text we use the
langdetectpackage for detecting the language.
SPARQLWrapperpackage. The first part of getting a location is to associate a given URL to a DBpedia entity. For this entity we then query for it's location, location city, or the location of it's parent company. The SPARQL query that we used can be found here. It is possible to copy this query into the field in the English DBpedia SPARQL endpoint. In cases where we retrieve more than one location for a URL we perform a majority voting. For example, the SPARQL query returns four locations for the URL http://www.treasury.gov.au/. The geo-coordinates for all four locations only differ in the second digit after the comma. With our current threshold, geo-coordinates are considered the same if they differ by less than 0.1. Another example is http://www.bangladesh.gov.bd/maps/images/pabna/Chatmohar.gif for which the geo-coordinates two of the four locations differ by 0.2. In this case a majority voting takes place. Since there are two different locations which both appear two times, one of them is selected at random. For the English DBpedia, a majority voting was necessary for 5179 out of the 162827 URLs that we extracted from the SPARQL endpoint. In addition, we do not consider URLs that contain "web.archive.org" and "webcitation.org" since they usually reference to another website and the DBpedia location also refers to the referenced website.
Since the quality of the geo-location that we extract from DBpedia is crucial for the performance of the machine learning and thereby the predictions of our service, we performed a manual evaluation. In the following we use the term entity to refer to the subject the website belongs to, for example companies, schools, or the government. The location result (LR) is the geo-coordinate acquired from DBpedia queries. For the manual evaluation we used the following rules:
The result of the data collection was a total of 233932 URLs with a location from DBpedia and for which we extracted the IP location, TLD location, and website language. On this data we applied a variety of statistical models including logistic regression, random forests, and support vector machines (SVMs). SVMs consistently provided the most accurate prediction of a location. We used a one vs. one multiclass classifier. We trained the models separately for each of our Wikipedia language editions. We also trained a general prediction model based on merged data from all DBpedia knowledge bases. We use this model as a model for all the languages. To evaluate the performance of our model, we used 10-cross fold validation.
Table 1 shows the accuracy of the models. First we checked the accuracy over all the data we have. It is represented in the entry "All data - Model". Then we checked how well the models can handle difficult cases, when all the parameters disagree. It is represented in the entry "Difficult cases - Model". As the baseline we used the IP location.
|All data - Model||81%||81%||91%||90%||75%||96%||91%||96%||92%||98%|
|All data - IP only (Baseline)||61%||30%||62%||77%||29%||86%||73%||86%||81%||80%|
|Difficult cases - Model||77%||78%||86%||80%||71%||89%||85%||91%||85%||93%|
|Difficult cases - IP only (Baseline)||30%||57%||64%||25%||81%||66%||80%||74%||79%||53%|
Table 2 presents the importance of each parameter of the learning models. The number in each cell reflects how well a particular parameter can describe the variance of the ground truth. To obtain these data we calculated how often a particular parameter agrees with the ground truth.
Due to the poor data quality for some of the DBpedia language editions we decided to include one exception in our final classification. If the classification from our machine learning model predicts a country that appears in none of the three features we instead use the IP-location as the classification. One concrete example where this is helpful is, for the German model, the case where IP-location equals "US", the TLD is "COM", and the website language is "EN". In this case, our training data contains 776 URLs with the DBpedia location "FR" and only 460 URLs with the location "US". One goal of a future work thereby should be to further improve the input data by either modifying the SPARQL queries or switching to another source for website locations.
In the following we will give some examples of different ways to run the analysis and access the results.
The easiest way is to use the web interface that we provide at our homepage. This is done by inserting a valid Wikipedia article URL in the input box and pressing "Get Analysis". If the option "Fresh crawl" is not selected and the article was analysed before, the previous results are displayed. Otherwise, a new analysis gets executed on the server. Currently we allow up to ten parallel analyses. Since we extract the content of all linked websites in the given article, the analysis can take several minutes, depending on the number of external links in the article. The plotted results are shown on a separate webpage.
It is possible to access the plotted results via URL parameters:
For example, the German Wikipedia article Test can be accessed with:
The URL parameter
new-crawl allows to force a new analysis:
Again, the analysis can take several minutes, depending on the number of external links in the article.
Previous analyses can be accessed via the Articles tab on the website.
The results are stored folders based on the wikipedia language edition and the article title.
For example, the analysis results (
analysis.json) for the German Wikipedia article Test can be found at: http://wikiwhere.west.uni-koblenz.de/articles/de/Test/
In addition to the analysis results we also provide a file called
visualization-redirect.php that performs a redirect to the visualization page of the according article.
In order to retrieve the analysis.json file with wget, the following command can be used:
wget "http://wikiwhere.west.uni-koblenz.de/json.php?url=[article-url]" -O [file-name].json
Again, it is possible to use the
new-crawl parameter to force a new analysis.
A concrete example for the German Wikipedia article Test:
wget "http://wikiwhere.west.uni-koblenz.de/json.php?url=https://de.wikipedia.org/wiki/Test" -O de-Test.json
The source code for this website is on GitHub at https://github.com/mkrnr/wikiwhere-website.
For the analysis we have written Python modules which are also on Github at https://github.com/mkrnr/wikiwhere.
The code in both repositories is available under a MIT license.