Nicolson & Tucker: Identifying novel features from specimen data for the prediction of valuable collection trips

class: center

# Identifying novel features from specimen data for the prediction of valuable collection trips

### Nicky Nicolson<sup>1,2</sup>, Allan Tucker<sup>2</sup>

1\. Biodiversity Informatics & Spatial Analysis, Royal Botanic Gardens, Kew (UK). 2\. Department of Computer Science, Brunel University London (UK).

#### Intelligent Data Analysis XVI, 26-28th October 2017, London (UK)

<div style="text-align:center; height:10%; width:auto; padding: none;" >
<img src="images/logos.png" width="50%"/>
</div>
---
class:bigger_font

## Outline

1. About the scientific domain:
    - Examine a specimen
    - Motivation for research
1. Method:
    - Data exploration
    - Data mining
    - Applications:
        - Data abstraction (grouping & feature definition)
        - Classifier construction
1. Results
1. Conclusions

???

To introduce the domain, I'll show a specimen and explain how it was collected,
what it's used for (its life as a research object), how it was made digital and
what we can do with the digital data.

I'll explain the motivation for the research (better understanding species discovery).

We'll explore the data mining process and how the data from the data mining
process has been used to build models to better understand species discovery.

---
background-image: url(images/solanum-sanchez-vegae-US.jpg)
background-size: contain

???

- This is a specimen, a physical sample of plant material collected in the field.
- If you only remember one thing about a herbarium specimen remember this - they are usually flat. That means:
    - Easy to share (established culture of specimen exchange between researchers and institutions)
    - Easy to annotate - to add notes about the creation and use of the specimen
    - Easy to photograph (ie digitise). Also we are able to transcribe after the fact (from the digitsed image)
- Information rich

--
<div style="position:absolute; top: 480px; left:560px; width:145px; height:47px; border:3px solid #000;font-size:18px;font-weight:bold;background: rgba(255,255,255,1.0);">Collection information</div>
<div style="position:absolute; top: 530px; left:560px; width:145px; height:240px; border:3px solid #000;background: rgba(255,0,0,.25);"></div>

???

- The highlighted notes concern the field collection of the specimen - what/when/where/who
- This is all about the transitioning an organism in the field into a research object

--
<div style="position:absolute; top: 590px; left:280px; width:145px; height:100px; border:3px solid #000;background: rgba(0,0,255,.25);"></div>
<div style="position:absolute; top: 340px; left:590px; width:100px; height:47px; border:3px solid #000;font-size:18px;font-weight:bold;background: rgba(255,255,255,1.0);">Research annotation</div>
<div style="position:absolute; top: 390px; left:590px; width:100px; height:50px; border:3px solid #000;background: rgba(0,0,255,.25);"></div>

???
- Specimens are lodged in institutional collections for long term consultation and research.
- We also have annotations concerning the use of the specimen as a research object
- Long-term, large scale scientific data sharing

---

background-image: url(images/solanum-sanchez-vegae-US-label.jpg)
background-size: contain

???

Focussing on the collection information...

--
<div style="position:absolute; top: 250px; left:20px; width:240px; height:60px; border:3px solid #00f;font-weight:bold;">WHAT</div>

???

- **What** was collected - they were not sure (only identified in very general terms)

--
<div style="position:absolute; top: 330px; left:20px; width:770px; height:85px; border:3px solid #00f;font-weight:bold;">WHERE</div>

???

- **Where** was it collected - only available as a decription (collection pre-dates hand held GPS)

--
<div style="position:absolute; top: 560px; left:465px; width:240px; height:60px; border:3px solid #00f;font-weight:bold;">WHEN</div>

???

- **When** the record was collected - recorded here to a precise day

--
<div style="position:absolute; top: 535px; left:20px; width:282px; height:105px; border:3px solid #00f;font-weight:bold;">WHO</div>

???

- **Who** collected it - the names of the collecting team:

--
<div style="position:absolute; top: 560px; left:20px; width:282px; height:35px; border:3px solid #00f;font-weight:bold;"></div>
???
- The first team member is the **primary collector**

--
<div style="position:absolute; top: 555px; left:310px; width:145px; height:60px; border:3px solid #00f;font-weight:bold;">RECORD #</div>

???
- They control the **record number** - a numeric sequence used throughout their career

--
<div style="position:absolute; top: 85px; left:20px; width:790px; height:30px; border:3px solid #00f;font-weight:bold">WHY</div>

???

- **Why** the specimen was collected - in this case part of an expedition

Although the data are rich, also somewhat fragile (reliant on text transcription).

--
<div style="position:absolute; top: 555px; left:310px; width:145px; height:60px; border:3px solid #f00;background: rgba(255,0,0,.25);"></div>
<div style="position:absolute; top: 560px; left:465px; width:240px; height:60px; border:3px solid #f00;background: rgba(255,0,0,.25);"></div>

???
That said, we have two **numeric features**...
--
<div style="position:absolute; top: 560px; left:20px; width:282px; height:35px; border:3px solid #0f0;background: rgba(0,255,0,.25);"></div>
<div style="position:absolute; top: 85px; left:20px; width:790px; height:30px; border:3px solid #0f0;background: rgba(0,255,0,.25);"></div>

???

...and we will use these to try to data-mine the **primary collector (who)** and the **collecting trip (why)**.

---
class:bigger_font

## Motivation: species discovery

On-going: 2000 new species described / year in higher plants.

Our example specimen:
- Early 1960s: collected from field
- 2010: recognised as a species new to science (and published). Specimen annotated, digital record flagged as a "type"

**Can we direct effort to field collection localities / institutional collections that will yield new species?**

???

- Species discovery not complete, even in a well known group like flowering plants.
- Discovery in two main forms - in-field and in-collection. Example specimen shows in-collection discovery - described c50 years after collection.

- Given the abundance of specimens and field localities - can we use the specimen data to direct effort?

---

???

- Example specimen showed two numeric features (date of collection and collectors sequence number).
- These data points for the activity of a single collector show a positive correlation

---

???

- However as collector names are transcribed text (also in many languages), they are very variable:
    - it is hard to gather a dataset of specimens for a single collector
- Looking at a dataset of points from all collectors (for a constrained time period), we see that the specimens form elongated clusters
- The data mining process will detect these
---
class:bigger_font

## Data mining: preparation

Data: c 3.5m specimens from Brazil, downloaded from Global Biodiversity Information Facility (GBIF)

### Feature definition
- Numeric feature-set:
    - eventdate - days since 1970-01-01
    - recordnumber - sequential, unique in context
- Collector name transcription, e.g. Gert Hatcshbach
- Lexical feature-set:
     - First initial
     - First upper-case of surname
     - First lower-case of surname
     - Last lower-case of surname
     - e.g. **G**ert **Ha**tcshbac**h** -> G, H, a, h
???
- Numeric features as used in the scatter plots shown in data exploration
- Lexical features allow minimal advice from the collector name transcription

---
class:bigger_font
## Data mining steps (1/4): Cluster

- DBSCAN: selected as we want to detect elongated clusters
    - featuresets: lexical & numeric
    - episilon: 300
    - min_samples: 2
- Expert analysis:
    - density of samples: multiple logical collectors assigned to a single cluster
- Computational post-processing
    - clusters pessimistically broken into subclusters, based on lexical examination of transcriptions
???
Data mining is a 4 step process. Design principles:
- Visualise the results after each step
- Allow expert input to influence design, using insights from visualisation

---
class:bigger_font
## Data mining steps (2/4): Classify

- Expert analysis:
    - variation in transcription results affects lexical
    featureset: logical collectors assigned to separate clusters
    - visualised using scatter plot
- Classify:
    - train decision tree on numeric featureset, to predict cluster identifier
    - commonly confused classes candidates for joining
    - computationally assessed for lexical similarity
    - iterative process (join affects overlap calculation)
???
If the expert advice in the previous step was that some were too greedily
contructed, we also found the converse problem...

---
background-image: url(images/classification-example-cluster.png)
background-size: contain
---
class:bigger_font
## Data mining steps (2/4): Classify

---
class:bigger_font
## Data mining steps (3/4): Join

- Aim to gather all data to get a career grouping for a single logical collector
- Two stage process, clusters are joined if:
    - Most frequently occurring transcription is shared and all variant transcriptions agree
    - Clusters share external identifier in bibliographic author dataset

???
Their activity may stretch across years, with long spans of time with no
field work - so a single career will legitimately cover multiple clusters.

At the end of this stage we have now detected collectors...

---
class:bigger_font
## Data mining steps (4/4): Detect trips

- For each collector's career, pass all samples into DBSCAN to detect collecting trips
- Create and apply a trip identifier to each "collecting trip" cluster

???
4th step: for each collector, use DBSCAN to find their trips

---
class:bigger_font
## Application (1/3): grouping
1. Baseline - grouped by transcribed primary collector name

2. Collector - grouped by data-mined collector entity

3. Trip - grouped by data-mined collecting trip entity
???
After data-mining we have several options for grouping the data
- Baseline is the pre-data mining grouping used for comparison (data grouped by the source of the lexical features)

---
class:bigger_font
## Application (2/3): feature definition
- Temporal:
    - Start year
- Scale:
    - *#* specimens
    - Range of numbers allocated
- Rate:
    - Slope of line of best fit
    - Correlation score
- Character:
    - Specialist (T/F)
    - Generalist (T/F)
- Experience:
    - *#* previous collections
- Class: species discovery value:
    - Does the grouping include material used as a type (T/F)
???
Given that we can group the data, we can define a novel set of features about
the grouping (a richer set of data than that available when working at specimen
level).

---
class:bigger_font
## Application (3/3): build classifier

- Decision tree classifier used to predict species discovery value.

- Datasets downsampled to balance class variable.

- 10-fold stratified cross-validation.

- Feature selection.

---
class:bigger_font
## Results: data mining

Raw data:
- 131582 unique collector team transcriptions
- 41511 unique primary collector name transcriptions

Data mining process:
- Step 1: DBSCAN identified 42096 clusters; lexically post-processed to 51192 clusters
- Step 2: Resolved via decision-tree classifier to 44768 clusters
- Step 3: Joined to 19706 clusters representing collector entities
- Step 4: 79012 different collecting trips were identified

Species discovery value:
- 1127 (5.7%) of collectors and 3412 (4.3%) of trips collected specimens later labelled as type specimens.

---

???
Top 3 most numerous clusters shown

Note that crossed over clusters (red/green, top right) correctly distinguished

---

background-image: url(images/sankey-taylor-captioned.png)
background-size: contain

???

A sankey diagram shows flow... here we use it to illustrate the "flow" of specimen records into different groupings

Points to note:
- ambiguous collector name transcription split into two separate collector groupings
- greedy allocation of data to topmost collector category
- many small collection trips identified (future work to define post processing steps)

---

???
This ROC-AUC plot shows classification results for each of the 3 groupings.

The 2 data mined groupings additionally show results from a feature selected subset.

Features selected were:
- Temporal (**start year**)
- Character (**specialist** and **nomenclaturalist**)
---
class:bigger_font
## Conclusions

- Specimens visible end point of a hidden collecting process

- Machine learning techniques help to uncover the hidden processes

- Data mining results reshape the data, build models - steps towards understanding species discovery

- Techniques also have practical applications  - efficiencies in data mobilisation

---
class:center,bigger_font
## Acknowledgements
Data providers: for sharing their specimen data

Reviewers: for valuable comments

Kew: for funding support

## Further information
n.nicolson@kew.org / @nickynicolson

[http://bit.ly/nicolson-ida2017](http://bit.ly/nicolson-ida2017)