This page provides additional material for Debopam Das' dissertation, "Signalling of Coherence Relations in Discourse", completed August 2014 in the Department of Linguistics at Simon Fraser University.

The thesis is available via the SFU Library: http://summit.sfu.ca/item/14446

The corpus compiled as part of the thesis is available through the Linguistic Data Consortium.

From this page you can download statistics on the above corpus, as a zip file: Das_PhD_Supplementary_Material.zip

Details on the zip file:

  • The material contains two types of distributions in the RST Signalling Corpus:
    • The statistical distribution of coherence (rhetorical) relations with respect to the signals used to indicate those relations, and
    • The statistical distribution of signals with respect to the relations indicated by those signals.
  • The RST Signalling Corpus, which is built over the RST Discourse Treebank, includes a collection of over 20,000 coherence relations annotated for signalling information. The corpus provides signalling-wise annotation for 78 types of relations present in the RST Discourse Treebank.
  • The signals used for annotation are divided into three broad classes: Single Signal, Combined Signal and Unsure.
    • The class Single Signal is divided into nine signal types: discourse marker (DM), reference, lexical, semantic, morphological, syntactic, graphical, genre and numerical features.
    • The class Combined Signal is divided into six signal types: (reference + syntactic), (semantic + syntactic), (lexical + syntactic), (syntactic + semantic), (syntactic + positional) and (graphical + syntactic).
    • These signal types are further divided into numerous specific signals.
  • More information about the RST Discourse Treebank and the RST Signalling Corpus can be found in the dissertation.

A description of the sub-directories and data follows:

  • All files are in HTML format.
  • The directory named "Relation_Distribution_by_Signals" contains the statistical distribution of individual relations with respect to both signal types and specific signals used to indicate those relations.
  • The directory named "Signal_Distribution_by_Relations" contains the statistical distribution of signals (including signal classes, signal types and specific signals) with respect to individual relations.
  • Terms used in file names as abbreviations:

    1. graph_plus_syn = graphical + syntactic
    2. lex_plus_syn = lexical + syntactic
    3. ref_plus_syn = reference + syntactic
    4. sem_plus_syn = semantic + syntactic
    5. syn_plus_pos = syntactic + positional
    6. Syn_plus_sem = syntactic + semantic
    7. comma_plus_pres_part_cl = comma + present participial clause
    8. comma_plus_pst_part_cl = comma + past participial clause
    9. ind_wrd_plus_pres_part_cl = indicative word + present participial clause
    10. com_ref_plus_subj_NP = comparative reference + subject NP
    11. dem_ref_plus_subj_NP = demonstrative reference + subject NP
    12. per_ref_plus_subj_NP = personal reference + subject NP
    13. prop_ref_plus_subj_NP = propositional reference + subject NP
    14. gen_word_plus_subj_NP = general word + subject NP
    15. lex_chain_plus_subj_NP = lexical chain + subject NP
    16. mero_plus_subj_NP = meronymy + subject NP
    17. rep_plus_subj_NP = repetition + subject NP
    18. syno_plus_subj_NP = synonymy + subject NP
    19. pres_part_cl_plus_beg = present participial clause + beginning
    20. pst_part_cl_plus_beg = past participial clause + beginning
    21. par_constr_plus_lex_chain = parallel syntactic construction + lexical chain

©2014 Debopam Das