class: center # Identifying novel features from specimen data for the prediction of valuable collection trips ### Nicky Nicolson
1,2
, Allan Tucker
2
1\. Biodiversity Informatics & Spatial Analysis, Royal Botanic Gardens, Kew (UK). 2\. Department of Computer Science, Brunel University London (UK). #### Intelligent Data Analysis XVI, 26-28th October 2017, London (UK)
--- class:bigger_font ## Outline 1. About the scientific domain: - Examine a specimen - Motivation for research 1. Method: - Data exploration - Data mining - Applications: - Data abstraction (grouping & feature definition) - Classifier construction 1. Results 1. Conclusions ??? To introduce the domain, I'll show a specimen and explain how it was collected, what it's used for (its life as a research object), how it was made digital and what we can do with the digital data. I'll explain the motivation for the research (better understanding species discovery). We'll explore the data mining process and how the data from the data mining process has been used to build models to better understand species discovery. --- background-image: url(images/solanum-sanchez-vegae-US.jpg) background-size: contain ??? - This is a specimen, a physical sample of plant material collected in the field. - If you only remember one thing about a herbarium specimen remember this - they are usually flat. That means: - Easy to share (established culture of specimen exchange between researchers and institutions) - Easy to annotate - to add notes about the creation and use of the specimen - Easy to photograph (ie digitise). Also we are able to transcribe after the fact (from the digitsed image) - Information rich --
Collection information
??? - The highlighted notes concern the field collection of the specimen - what/when/where/who - This is all about the transitioning an organism in the field into a research object --
Research annotation
??? - Specimens are lodged in institutional collections for long term consultation and research. - We also have annotations concerning the use of the specimen as a research object - Long-term, large scale scientific data sharing --- background-image: url(images/solanum-sanchez-vegae-US-label.jpg) background-size: contain ??? Focussing on the collection information... --
WHAT
??? - **What** was collected - they were not sure (only identified in very general terms) --
WHERE
??? - **Where** was it collected - only available as a decription (collection pre-dates hand held GPS) --
WHEN
??? - **When** the record was collected - recorded here to a precise day --
WHO
??? - **Who** collected it - the names of the collecting team: --
??? - The first team member is the **primary collector** --
RECORD #
??? - They control the **record number** - a numeric sequence used throughout their career --
WHY
??? - **Why** the specimen was collected - in this case part of an expedition Although the data are rich, also somewhat fragile (reliant on text transcription). --
??? That said, we have two **numeric features**... --
??? ...and we will use these to try to data-mine the **primary collector (who)** and the **collecting trip (why)**. --- class:bigger_font ## Motivation: species discovery On-going: 2000 new species described / year in higher plants. Our example specimen: - Early 1960s: collected from field - 2010: recognised as a species new to science (and published). Specimen annotated, digital record flagged as a "type" **Can we direct effort to field collection localities / institutional collections that will yield new species?** ??? - Species discovery not complete, even in a well known group like flowering plants. - Discovery in two main forms - in-field and in-collection. Example specimen shows in-collection discovery - described c50 years after collection. - Given the abundance of specimens and field localities - can we use the specimen data to direct effort? --- background-image: url(images/scatters-1a.png) background-size: contain ??? - Example specimen showed two numeric features (date of collection and collectors sequence number). - These data points for the activity of a single collector show a positive correlation --- background-image: url(images/scatters-1b.png) background-size: contain ??? - However as collector names are transcribed text (also in many languages), they are very variable: - it is hard to gather a dataset of specimens for a single collector - Looking at a dataset of points from all collectors (for a constrained time period), we see that the specimens form elongated clusters - The data mining process will detect these --- class:bigger_font ## Data mining: preparation Data: c 3.5m specimens from Brazil, downloaded from Global Biodiversity Information Facility (GBIF) ### Feature definition - Numeric feature-set: - eventdate - days since 1970-01-01 - recordnumber - sequential, unique in context - Collector name transcription, e.g. Gert Hatcshbach - Lexical feature-set: - First initial - First upper-case of surname - First lower-case of surname - Last lower-case of surname - e.g. **G**ert **Ha**tcshbac**h** -> G, H, a, h ??? - Numeric features as used in the scatter plots shown in data exploration - Lexical features allow minimal advice from the collector name transcription --- class:bigger_font ## Data mining steps (1/4): Cluster - DBSCAN: selected as we want to detect elongated clusters - featuresets: lexical & numeric - episilon: 300 - min_samples: 2 - Expert analysis: - density of samples: multiple logical collectors assigned to a single cluster - Computational post-processing - clusters pessimistically broken into subclusters, based on lexical examination of transcriptions ??? Data mining is a 4 step process. Design principles: - Visualise the results after each step - Allow expert input to influence design, using insights from visualisation --- class:bigger_font ## Data mining steps (2/4): Classify - Expert analysis: - variation in transcription results affects lexical featureset: logical collectors assigned to separate clusters - visualised using scatter plot - Classify: - train decision tree on numeric featureset, to predict cluster identifier - commonly confused classes candidates for joining - computationally assessed for lexical similarity - iterative process (join affects overlap calculation) ??? If the expert advice in the previous step was that some were too greedily contructed, we also found the converse problem... --- background-image: url(images/classification-example-cluster.png) background-size: contain --- class:bigger_font ## Data mining steps (2/4): Classify - Expert analysis: - variation in transcription results affects lexical featureset: logical collectors assigned to separate clusters - visualised using scatter plot - Classify: - train decision tree on numeric featureset, to predict cluster identifier - commonly confused classes candidates for joining - computationally assessed for lexical similarity - iterative process (join affects overlap calculation) ??? If the expert advice in the previous step was that some were too greedily contructed, we also found the converse problem... --- class:bigger_font ## Data mining steps (3/4): Join - Aim to gather all data to get a career grouping for a single logical collector - Two stage process, clusters are joined if: - Most frequently occurring transcription is shared and all variant transcriptions agree - Clusters share external identifier in bibliographic author dataset ??? Their activity may stretch across years, with long spans of time with no field work - so a single career will legitimately cover multiple clusters. At the end of this stage we have now detected collectors... --- class:bigger_font ## Data mining steps (4/4): Detect trips - For each collector's career, pass all samples into DBSCAN to detect collecting trips - Create and apply a trip identifier to each "collecting trip" cluster ??? 4th step: for each collector, use DBSCAN to find their trips --- class:bigger_font ## Application (1/3): grouping 1. Baseline - grouped by transcribed primary collector name 2. Collector - grouped by data-mined collector entity 3. Trip - grouped by data-mined collecting trip entity ??? After data-mining we have several options for grouping the data - Baseline is the pre-data mining grouping used for comparison (data grouped by the source of the lexical features) --- class:bigger_font ## Application (2/3): feature definition - Temporal: - Start year - Scale: - *#* specimens - Range of numbers allocated - Rate: - Slope of line of best fit - Correlation score - Character: - Specialist (T/F) - Generalist (T/F) - Experience: - *#* previous collections - Class: species discovery value: - Does the grouping include material used as a type (T/F) ??? Given that we can group the data, we can define a novel set of features about the grouping (a richer set of data than that available when working at specimen level). --- class:bigger_font ## Application (3/3): build classifier - Decision tree classifier used to predict species discovery value. - Datasets downsampled to balance class variable. - 10-fold stratified cross-validation. - Feature selection. --- class:bigger_font ## Results: data mining Raw data: - 131582 unique collector team transcriptions - 41511 unique primary collector name transcriptions Data mining process: - Step 1: DBSCAN identified 42096 clusters; lexically post-processed to 51192 clusters - Step 2: Resolved via decision-tree classifier to 44768 clusters - Step 3: Joined to 19706 clusters representing collector entities - Step 4: 79012 different collecting trips were identified Species discovery value: - 1127 (5.7%) of collectors and 3412 (4.3%) of trips collected specimens later labelled as type specimens. --- background-image: url(images/scatters-1c.png) background-size: contain ??? Top 3 most numerous clusters shown Note that crossed over clusters (red/green, top right) correctly distinguished --- background-image: url(images/sankey-taylor-captioned.png) background-size: contain ??? A sankey diagram shows flow... here we use it to illustrate the "flow" of specimen records into different groupings Points to note: - ambiguous collector name transcription split into two separate collector groupings - greedy allocation of data to topmost collector category - many small collection trips identified (future work to define post processing steps) --- background-image: url(images/unified.png) background-size: contain ??? This ROC-AUC plot shows classification results for each of the 3 groupings. The 2 data mined groupings additionally show results from a feature selected subset. Features selected were: - Temporal (**start year**) - Character (**specialist** and **nomenclaturalist**) --- class:bigger_font ## Conclusions - Specimens visible end point of a hidden collecting process - Machine learning techniques help to uncover the hidden processes - Data mining results reshape the data, build models - steps towards understanding species discovery - Techniques also have practical applications - efficiencies in data mobilisation --- class:center,bigger_font ## Acknowledgements Data providers: for sharing their specimen data Reviewers: for valuable comments Kew: for funding support ## Further information n.nicolson@kew.org / @nickynicolson [http://bit.ly/nicolson-ida2017](http://bit.ly/nicolson-ida2017)