Current State of Our Investigation and Available Outcomes

Modelling Efforts

Through application of varied techniques, we’ve steadily increased our predictive power, based on this same dataset. Currently, we’ve got fairly accurate models just for grapes and spinach ( both with \(R^2\) around 0.7 ). For the other types of produce, we are usually below 0.5. We’ve searched for another useful application of our current efforts and developed classification models with three categories of content ( lower, middle and superior ). These ones are considerably more accurate and may probably find useful application as they exist right now or with a reasonable level of additional effort. We’ve got one model with 60% or more precission on category preddiction for almost every produce type, most of them require just the usage of the Our Sci reflectometer and perhaps a brixometer, so they are deployable and could find use for in-field testing. There’s still ongoing effort in getting a better profit from the SIWARE data, which has required a substantial amount of cutting edge knowledge, specifical to the fields of Functional Data Analysis and Spectrometry. We may still find a way to squeeze more predictive power from this wavelengths, which have mostly been the least correlated data but have shown some promising results under heavy processing and with the usage of specialized methods. This is reasonable considering the extremely high dimensionality of the dataset and the complex nature of NIR spectroscopy data.

Other Useful Outcomes

  • We’ve organized and made available this dataset for further investigations from us or other interested parts.
  • Provided a huge array of graphical information to make the task of a first insight over what’s available rather easy and fast.
  • Devised a big set of specific tools to help in that task.
  • Developed several methods to identify atypical spectrometry curves, all of them deployable on the field, so a user can immediately know if they’re under an unusual datum or a badly acquired measure.
  • Pointed at the most successful modelling schemes and the most powerful variables for those models.

Characterization of the Dataset

Search For Outliers

Before any further analysis, a search for outrageously outlying data will be carried. As the amount of variables is overwhelming, boxplots graphics will serve the purpose of a fast glimpse in search of radically unusual points.

Our boxplots will show label IDs over the outliers to help identify the points that require more urgent consideration. While defining what an outlier is may be a contentious matter, we are not making decisions based on this criteria, just choosing a range of data to plot the labels and help making informed decisions later.

Sample Composition

This is the sample composition by species:

explicativeMetadata <- c(
  "Brix",
  "Source",
  "Color"
  ## "Dry_Mass_%_(x_100)",
  ## State <- turn into climate regions
)

regsframe <- dataframe %>%
  select( -contains( "Device_ID" ) ) %>%
  select( contains( "siware" ), contains("Scan"), Type, Polyphenols, Antioxidants, id_sample, !!explicativeMetadata, `Dry_Mass_%_(x_100)` ) %>%
  filter( !is.na(Type) ) %>%
  filter( Type != "other" ) %>%
  mutate( Type = fct_drop( Type ),
         Color = as.factor(Color),
         Source = as.factor(Source)
         )

siwareJuiceVars <- regsframe %>% select( contains("Juice-siware") ) %>% colnames
siwareSurfaceVars <- regsframe %>% select( contains("Surface-siware") ) %>% colnames
uvJuiceVars <- regsframe %>% select( contains("Juice_Scan") ) %>% colnames
uvSurfaceVars <- regsframe %>% select( contains("Surface_Scan") ) %>% colnames

datasets <- list(
  metadata = explicativeMetadata,
  uvJuice = uvJuiceVars,
  uvSurface = uvSurfaceVars,
  uvAll = c( uvJuiceVars, uvSurfaceVars, explicativeMetadata ),
  siwareJuice = siwareJuiceVars,
  siwareSurface = siwareSurfaceVars,
  siwareAll = c( siwareJuiceVars, siwareSurfaceVars, explicativeMetadata ),
  juiceAll = c( uvJuiceVars, siwareJuiceVars, explicativeMetadata ),
  surfaceAll = c( uvSurfaceVars, siwareSurfaceVars, explicativeMetadata ),
  all = c( uvJuiceVars, siwareJuiceVars, uvSurfaceVars, siwareSurfaceVars, explicativeMetadata )
)

datasetsDescription <- list(
  metadata = explicativeMetadata,
  uvJuice = c("ourSci's reflectometer on juice"),
  uvSurface = c("ourSci's reflectometer on surface"),
  uvAll = c( "ourSci in surface and juice", explicativeMetadata ),
  siwareJuice = c("SIWARE on juice"),
  siwareSurface =c("SIWARE on surface") ,
  siwareAll = c( "SIWARE on juice","SIWARE on surface", explicativeMetadata ),
  juiceAll = c( "SIWARE and OurSci measures over juice", explicativeMetadata ),
  surfaceAll = c("SIWARE and OurSci measures over juice" , explicativeMetadata ),
  all = c("SIWARE and OurSci measures over juice and surface", explicativeMetadata )
)

componentsList <- function( list ) {
  list %>% reduce( ~ paste( .x, .y,  sep = "," ) )
} 

datasetsTable <- tibble(
  name = names( datasets ),
  description = map_chr( datasetsDescription ,componentsList )
)

produce <-  regsframe$Type %>% unique %>% na.omit %>% enframe( name=NULL, value="produce" ) %>% filter( produce != "other" )

## produce.sample_type+ produce.brix + produce.sample_color + farm_variety + sample_source + climate_region

datasetsNames <- datasets %>% names %>% enframe( name=NULL, value="dataset")

modelsFrame <- crossing( produce, datasetsNames )

write_rds(
  x = list(
    datasetsVarSets = datasets,
    metadata = explicativeMetadata,
    modelsFrame = modelsFrame
  ),
  path = "./modelling/variablesSets.Rds"
)
sampleComposition <- dataframe %>% group_by( id_full, Type ) %>% summarise() %>% arrange( Type ) %>% ungroup %>% group_by( Type ) %>% summarise( n() )
## Warning: Factor `Type` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## Warning: Factor `Type` contains implicit NA, consider using
## `forcats::fct_explicit_na`
sampleComposition %>% kable
Type n()
carrot 482
cherry_tomato 461
grape 81
kale 534
lettuce 482
other 11
spinach 164
NA 222

Boxplots for Polyphenols content produce type.

It’s important to notice that, as the distribution of this variables is drastically different among species, we will mostly be using free scales in this faceted graphics. Looking at two adjacent boxplots is not useful to compare their contents. We are using this visualization mostly in search of outliers inside every species.

Polyphenols Content by Produce, Far Outliers Labeled

dataframe %>%
  ggplot()  +
  aes( y = Polyphenols, fill = Type, label = id_full ) +
  geom_boxplot() +
  geom_label_repel( data =
                      dataframe %>% group_by( Type ) %>% arrange( -Polyphenols ) %>% select( Type, id_full,Polyphenols ) %>% top_n(3) %>% arrange( Type ) %>% group_by( id_full ) %>% filter( row_number(Polyphenols) == 1 )
                 , aes( x=0, y = Polyphenols ), color = "white" ) +
  facet_wrap( Type ~ ., scales = "free" ) +
  ggtitle( "Polyphenols Content by Produce, Far Outliers Labeled" )
## Selecting by Polyphenols

ggsave("polyphenolsBoxPlot.png",width=20, height=28, units = "cm")

## This chunk searchs for boxplot outliers (measured as "some number of IQRs away from extreme quartile"). They were so many it's not useful.
## %>%
##   group_by( Type ) %>%
##   mutate( iqrPo = IQR( Polyphenols, na.rm = TRUE ), quartile1 = quantile( x=Polyphenols, probs = c(0.25), na.rm=TRUE ), quartile3 = quantile( x=Polyphenols, probs = c(0.75), na.rm=TRUE ) ) %>%
##   mutate( outlier = Polyphenols < ( quartile1 - ( 1.5*iqrPo ) ) | Polyphenols > ( quartile3 + ( 1.5*iqrPo ) ) ) %>%
##   ungroup()

Here we see an interesting trend: outliers form clusters of a size reasonable enought to justify searching for a partition criteria on the population. This could perhaps lead us to interesting information.

We are going to add the far outlying points that are beyond any cluster into a table. This will be expanded with all suspicious points we find, so it will have a column for the reason of suspicion.

farOutliersPoly <- tibble(
  full_id = c("2186_a", "2186_b", "4581_2_b", "2545_b", "3349", "3349", "3826_3", "3826_1", "3266", "3370", "3899", "2997", "2869", "2868", "2103_b", "2103_a" ),
  reason = "Polyphenols."
)

farOutliersPoly %>% kable
full_id reason
2186_a Polyphenols.
2186_b Polyphenols.
4581_2_b Polyphenols.
2545_b Polyphenols.
3349 Polyphenols.
3349 Polyphenols.
3826_3 Polyphenols.
3826_1 Polyphenols.
3266 Polyphenols.
3370 Polyphenols.
3899 Polyphenols.
2997 Polyphenols.
2869 Polyphenols.
2868 Polyphenols.
2103_b Polyphenols.
2103_a Polyphenols.

Boxplots for Antioxidants content by produce type.

Antioxidants Content by Produce, Far Outliers Labeled

dataframe %>%
  ggplot()  +
  aes( y = Antioxidants, fill = Type, label = id_full ) +
  geom_boxplot() +
  geom_label_repel( data =
                      dataframe %>% group_by( Type ) %>% arrange( -Antioxidants ) %>% select( Type, id_full,Antioxidants ) %>% top_n(3) %>% arrange( Type ) %>% group_by( id_full ) %>% filter( row_number(Antioxidants) == 1 )
                 , aes( x=0, y = Antioxidants ), color = "white"         ) +
  facet_wrap( Type ~ ., scales = "free" ) +
ggtitle( "Antioxidants Content by Produce, Far Outliers Labeled" )
## Selecting by Antioxidants