4b Python assignment TED talks

Assignment 4b

A corpus of TED talks and their translations
In this assignment, you are going to work with a corpus of transcribed TED talks (English) and their translations into other languages.
The resource is called WIT3 – acronym for Web Inventory of Transcribed and Translated Talks.
The learning goals of this assginment are the following:
Become comfortable with extracting information from xml
Learn how to explore a corpus consisting of multiple files
Lean how to map information accross mutliple files
Take the first steps in indepdently structuring code
If you are interesed in learning more about the resource and what it is used for, you can check the following paper:
M. Cettolo, C. Girardi, and M. Federico. 2012. WIT3: Web Inventory of Transcribed and Translated Talks. In Proc. of EAMT, pp. 261-268, Trento, Italy [pdf].
Step 1: Prepare the data and understand the corpus structure
1.) Download
Please go to the following url: https://wit3.fbk.eu/home
Download the latest version of the corpus by clicking on “Talks in XML format (109 languages)”.
Please the download in the Data directory (../Data/). Please create a new folder called ted-talks and move the download there. Unpack it (should work by double-clicking on it).
The corpus you downloaded contains multiple releases of the data. You data should be at ../Data/ted-talks/XML_releases/.
Please run the following cell below to check.
You should see these files and subdirectories:
tools.html
wit3_xml-20140120.zip
xml-20150616/
xml-20140120/
The directories starting with ‘xml’ contain different versions of the data. We will focus on the most recent version of the data in the directory xml/. Feel free to remove the remaining two directories.
# mac os/linux:
%ls ../Data/ted-talks/XML_releases/
# windows (please adapt if necessary in this and all following cells):
%ls ..\Data\ted-talks\XML_releases\
2.) Unzipping the xml files
If you run the next cell, you will see that the xml files are still zipped.
Mac os/linux To unzip all files at once, navigate to the directory on your command line and run: unzip “*.zip”
Windows On Windowns, the unzip command will not work. Instead, the following command should do the trick (first navigate to the directory on your command line):
Get-ChildItem ‘path to folder’ -Filter *.zip | Expand-Archive -DestinationPath ‘path to folder’ -Force
Alternatively, you can install a Windows subsystem for Linux (WSL), now you can use an Ubuntu terminal and can thus use all the linux commands on your own windows computer.

%ls ../Data/ted-talks/XML_releases/xml
3.) Understanding the structure of the corpus
List only the xml files in the corpus using the cell below.
You will see that all file names follow the same convention: ted-[language]-[release-date].xml
Each file contains all talks translated to the target language (i.e. the language in the file name) from English. Not all talks are translated into all languages. The orginial English talks are in ../Data/ted-talks/XML_releases/xml/ted_en-20160408.xml.

#%ls ../Data/ted-talks/XML_releases/xml/*.xml
4.) Understanding the xml structure and finding translations of the same text
To extract a translation of a particular text, we have to look into the xml structure of a single file. Open the xml file with the translations into Dutch in an editor (e.g. atom).
Try to get a global idea of the structure. The following questions can help to guide you:
(1) How many different talks does the file contain? (Hint: Scroll all the way down.)
(2) Which tags indicate new talks?
(3) Where do you find meta-information about a talk? Where do you find the translated text of the talk? Where do you find the transcription of the video?
(4) Where do you find the identifier of the talk? Hint: look for a tag called ‘talkid’. You can use this information to match translations with original talks.

提示: load one file and explore the information given about one talk using lxml.etree in python. You can use the code below to get started.
from lxml import etree as et
test = ‘../Data/ted-talks/XML_releases/xml/ted_nl-20160408.xml’
tree = et.parse(test)
root = tree.getroot()
# The language is again provided in an attribute of the root:
print(root.attrib)
# Explore the first layer of xml tags:
for ch in root.getchildren():
print(ch.tag)
# Tip: Is there more information you can access at this point? Tip: explore text and attributes
# Explore a single talk
## First extract all talks (hint: each file capture one talk)
talks = root.findall(‘file’)
print(len(talks))
## Pick one talk to explore
test_talk = talks[5]
for ch in test_talk.getchildren():
print(ch.tag)
# explore the meta information:
head = test_talk.find(‘head’)
for ch in head.getchildren():
print(ch.tag)
Assignment
In this assignment, you will write code to explore the dataset. For instance, you will answer questions such as
How many talks are there in total? How many languages do the translations cover?
What is the oldest/latest talk?
Which speaker is most widely translated?
Don’t worry if you don’t know how to solve these questions just now. You will start by working on the English talks (i.e. a single xml file). This will give you a feeling for the xml structure. You will write a number of small functions to extract information and compare talks. You should then be able to reuse your functions to explore the translations (i.e. work on multiple xml files).
Please write doc strings for all of your functions to get full points.
Part I: Analyze the original English talks
This part of the assignment only requires you to analyze the content of the xml file containing the original talks (in English).
Create a python scipt which give you the following information:

What is the longest talk (in terms of word count), what is the shortest talk (in terms of word count), what is the average word count? (id and title, numbers) (find_wc)
Oldest and latest talk (id and title, dates) (find_date)
Is there a speaker with multiple talks? (function: find_speaker)
How many English talks are there in total? (No function required, you can simply use a built-in function on the list of all English talk elements.)
Each of these aspects should be covered by a single function. Below, we give you some instructions about what the functions should do:

input: list of all talk elements (positional), length (longest/shortest, keyword argument)
output: title(s), id(s), mean word count
find_date:
input: list of all talk elements (positional), time (latest/oldest, keyword argument)
output: title(s), id(s)
find_speaker:
input: list of talk elemenets (position)
output: dict mapping speakers with more than one talk to their talks (tuple of talk title and id)
Note: All of your functions should return the talk title and id of the talk. The reason for this is that we use the id information in Part II of the assignment. We will need ids to find translation pairs.

The script should execute all functions.
Code structure
The script should only contain the functions listed above. If you use helper functions (highly recommended, see tips below), please put them in a script called utils.py and import them in the script called ted_english_analysis.py.
You can use a main function (main()), which calls all functions (not compulsory).
Informative print statements
Please use print statements to indicate what the function outputs refer to. Please don’t forget to print the total number of talks. We recommend using f-strings, for instance:

n_talks = len(talks)
print(‘In total, there are {n_talks} English Ted talks in the dataset.’)

The script should be called `ted_english_analysis.py` and execute all functions when called from the command line. Add print statements so the output can be interpreted. For example, you the script should print:
The total number of English talks is: [total number]
Talk length:
Longest talk: [title] (id: [id])
Shortest talk: [title] (id: [id])
Mean word count: [mean word count]

You will receive points for the correct functions (doc string, definition, and correct output), the print statements, and the script(s).
Recommended helper functions
We highly recommend definining small helper functions that extract information from a single talk element.
First think about the different pieces of information you will need:
word count
In addition, you want to load your file as an xml tree, access the root and extract all talks (as a list of xml elements). Tip: Use one element in the list to develop and test your helper functions.

# For example:

def load_root(path):
“””Find the root of an xml file given a filepath (str). “””
tree = et.parse(path)
root = tree.getroot()
return root

def get_talks(root):
“””Get all talk elements from an xml file.”””
talks = root.findall(‘file’)
return talks

path = ‘../Data/ted-talks/XML_releases/xml/ted_en-20160408.xml’
root = load_root(path)
talks = get_talks(root)
print(len(talks))

# This can be your test example
test_talk = talks[3]
Part II: Analyze the translations
This part of the assignment requires you to compare information accross multiple files.
Which language has the most translations? Which language has the least translations?
Which talk(s) is (are) translated into most languages? Please provide the English title(s) and the talk ids.
BONUS (just for fun – no points): What is the word for ‘applaus’ in the languages represented in the corpus?
Please create a script called ted_translation_analysis.py. The script should print answers to the questions above. You can reuse (import, copy, modify) functions you created for Part I.
Below, you will find several steps that guide you through the assignment. Code following these steps will earn you points (even if you do not manage to get the correct final output).
Step 1. Map languages to filepaths (x points)
Answering the questions above will require you to load and analyze xml files containing the translations of the English talks. It will be useful to have a dictionary that maps languages to filepaths. For example, the dictionary should contain the following entries:
‘nl’ : ‘../Data/ted-talks/XML_releases/xml/ted_nl-20160408.xml’,
‘it’ : ‘../Data/ted-talks/XML_releases/xml/ted_it-20160408.xml’,
‘fr-ca’: ‘../Data/ted-talks/XML_releases/xml/ted_fr-ca-20160408.xml’,
You can use the os or glob package to get a list of all filepaths in the directory ../Data/ted-talks/XML_releases/xml/. Note that the language is provided between ted_ and the release information (-20160408). Use string manipulation to access the language information. Attention: Some languages contain a hyphen (e.g. Canadian French fr-ca).
In this assignment, you do not have to spell out the languages. It is alright if you provide the shortened names as they appear in the filepaths.
Write a function (called map_languages_to_paths) that returns the dictionary.
# Example string manipulation (feel free to choose a different strategy)
lang_id = ‘fr-ca-20160408’
rev = lang_id[::-1]
# split only at first point:
f_id, lang = rev.split(‘-‘, 1)
lang[::-1]
Step 2: Write a function which returns the language with the most/least translations
Name: find_coverage
dictionary mapping languages to paths (positional)
most/least languages (e.g. ‘most’, ‘least’) (positional)
a dictionary with language(s) (keys) and the respective number of tranlated talks (values)
Tip: You can simply check the number of talks in each xml file corresponding to a language.

Step 3: Map talk ids to titles
You can use talk ids to map English talks to their translations in other languages. Most of your code will work with these ids. In the end, you should map talk ids to titles.
Write a function called get_id_title_dict that maps talk ids to English titles. Your function should take the path to the English file as input and return a dictionary (keys: talk ids, values: English talk titles).
提示: Reuse functions from the previous assignment.
Step 4 Map talks to languages they have been translated into
Function name: Map_talks_to_languages
Input: language filepath dict (result of Step 1)
Output: a dictionary mapping talk ids to languages with translations of the talk. The dictionary should have the following structure (the example is made up):
’10’: [‘hy’, ‘nl’, ‘de’, ‘fr-ca’]
’20’: [‘pl’, ‘da’, ‘nl’, ‘oc’, ‘ar’]

提示: You can use defaultdict for this step.
Step 5: Map number of languages to talks
目标: You want to know which talks haven been translated into how many languages. In the next step, you will want to rank talks by how many languages they’ve been translated into (or directly get the highest or lowest number of translations). To do this, it is useful to have a mapping from the number of translations to the talk ids

Function name: map_nlang_to_talks
Input: dictionary mapping talk ids to languages (list)
Output: dictionary mapping number of translations (int) to talks (list of talk ids) having the following structure (this is not the correct output – just an example of the structure):
30 : [‘200′, ’10’, ’31’]
29 : [‘201’, ‘9’, ‘7’]
47 : [‘1′, ’14’, ‘209’, ‘5’]
Step 6: Put it all together
Put the functions you wrote above together to find the talk(s) that has (have) been translated into the most or least languages
Name: find_top_coverage
Input: * dict mapping languages to filepaths (created in Step 1) * most or least translations (‘most’ or ‘least’)
Output: A dictionary mapping the most/least talk titles to the languages they have been translated into. For example (this is not the correct solution – we just use it to show the structure):
‘Dan Gross: Why gun violence can’t be our new normal’ : [‘de’, ‘nl’, ‘it’]
‘Angélica Dass: The beauty of human skin in every color’: [‘fr-ca’, ‘de’, ‘ceb’]
Use the output to print the correct solution to the terminal. (Tip: Call the same function twice – once for the most translations, once for the least translations)
Code structure:
Your functions should be called in the script called ted_translation_analysis.py. If you used helper functions (recommended), please store them in utils.py and import them.
Informative print statements
Please add print statements so the output can be interpreted. See instructions for Part 1.
You will receive points for the functions, the print statements, and the script.