discuss.terradue.com Open in urlscan Pro
161.35.50.22  Public Scan

URL: https://discuss.terradue.com/t/announcing-the-launch-of-the-ai-ml-enhancement-project-for-gep-and-urban-tep-exploitation-plat...
Submission: On July 17 via api from BE — Scanned from DE

Form analysis 1 forms found in the DOM

POST /login

<form id="hidden-login-form" method="post" action="/login" style="display: none;">
  <input name="username" type="text" id="signin_username">
  <input name="password" type="password" id="signin_password">
  <input name="redirect" type="hidden">
  <input type="submit" id="signin-button" value="Log In">
</form>

Text Content

Skip to main content



ANNOUNCING THE LAUNCH OF THE AI/ML ENHANCEMENT PROJECT FOR GEP AND URBAN TEP
EXPLOITATION PLATFORMS

TerradueAI Extensions
aigepu-tep

Log In
 * 
 * 


ANNOUNCING THE LAUNCH OF THE AI/ML ENHANCEMENT PROJECT FOR GEP AND URBAN TEP
EXPLOITATION PLATFORMS

TerradueAI Extensions
aigepu-tep

You have selected 0 posts.

select all

cancel selecting

May 2023
2 / 7
Mar 21

8d ago

pedroTerradue staff
1
May '23


We are excited to announce the launch of a new project aimed at augmenting the
capabilities of two Ellip-powered Exploitation platforms, the Geohazards
Exploitation Platform (GEP) and the Urban Thematic Exploitation Platform
(U-TEP). The project’s primary objective is to seamlessly integrate an AI/ML
processing framework into both platforms to enhance their services and empower
service providers to develop and deploy AI/ML models for improved geohazards and
urban management applications.

Project Overview
The project will focus on integrating a comprehensive AI/ML processing framework
that covers the entire machine learning pipeline, including data discovery,
training data, model development, deployment, hosting, monitoring, and
visualization. A critical aspect of this project will be the integration of
MLOps processes into both GEP and Urban TEP platforms’ service offerings,
ensuring the smooth operation of AI-driven applications on the platforms.

GEP and Urban TEP Platforms
GEP is designed to support the exploitation of satellite Earth Observations for
geohazards, focusing on mapping hazard-prone land surfaces and monitoring
terrain deformation. It offers over 25 services for monitoring terrain motion
and critical infrastructures, with more than 2500 registered users actively
participating in content creation.

Urban TEP aims to provide end-to-end and ready-to-use solutions for a broad
spectrum of users to extract unique information and indicators required for
urban management and sustainability. It focuses on bridging the gap between the
mass data streams and archives of various satellite missions and the information
needs of users involved in urban and environmental science, planning, and
policy.

Project Partners
The project brings together a strong partnership of experienced organizations,
including Terradue, CRIM, Solenix, and Gisat. These partners have a proven track
record in various aspects of Thematic Exploitation Platforms, cloud research
platforms, AI/ML applications, and EO data analytics.

Expected Outcomes
Upon successful completion, the project will result in the enhancement of both
GEP and Urban TEP platforms and their service offerings. The addition of AI/ML
capabilities will empower service providers to develop and deploy AI/ML models,
ultimately improving their services and delivering added value to their
customers. This enhancement will greatly benefit the GEP and Urban TEP platforms
by expanding their capabilities and enabling new AI-driven applications for
geohazards and urban management.

Discussion Points:

 1. How do you foresee AI/ML capabilities enhancing the services provided by GEP
    and Urban TEP?
 2. What challenges do you anticipate in integrating AI/ML processing frameworks
    into existing platforms?
 3. Which use cases do you believe would benefit the most from the addition of
    AI/ML capabilities in GEP and Urban TEP?

We encourage you to share your thoughts, ideas, and experiences related to the
project. Let’s discuss the potential impact and improvements this project can
bring to the GEP and Urban TEP platforms and their user communities.


1




 * CREATED
   
   May '23

 * LAST REPLY
   
   8d
 * 6
   
   REPLIES

 * 1.3k
   
   VIEWS

 * 3
   
   USERS

 * 4
   
   LIKES

 * 6
   
   LINKS

 * 5
   
   
   


11 months later
simonevaccariTerradue staff
1
Mar 21



AI/ML ENHANCEMENT PROJECT - PROGRESS UPDATE


BACKGROUND

One year has passed since the announcement of the AI/ML Enhancement Project
launch (see post). This project innovatively integrates cutting-edge Artificial
Intelligence (AI) and Machine Learning (ML) technologies into Earth Observation
(EO) platforms like Geohazards Exploitation Platform (GEP) and Urban Thematic
Exploitation Platform (U-TEP) through MLOps - the fusion of ML with DevOps
principles.

Leveraging these platforms’ extensive EO data usage, the new AI extensions
promise enhanced efficiency, accuracy, and functionalities. The integration of
these new capabilities unlocks advanced data processing, predictive modelling,
and automation, strengthening capabilities in urban management and geohazard
assessment.


USER PERSONAS, USER SCENARIOS AND SHOWCASES

For the project implementation we have identified two types of users:

 * A ML Practitioner that we will call “Alice”: expert in building and training
   ML models, selecting appropriate algorithms, analysing data, and using ML
   techniques to solve real-world problems.
 * And a Consumer that we will call “Eric”: stakeholder or user (e.g. business
   owner, a customer, a researcher, etc) who benefits from, or relies upon, the
   insights or predictions generated by the ML models to inform his
   decision-making process.

From these users we have derived ten User Scenarios that capture the key
activities and goals of these types of users in utilising the service. The user
scenarios are:

 * User Scenario 1 - Alice does Exploratory Data Analysis (EDA)
 * User Scenario 2 - Alice labels Earth Observation data
 * User Scenario 3 - Alice describes the labelled Earth Observation data
 * User Scenario 4 - Alice discovers labelled Earth Observation data
 * User Scenario 5 - Alice develops a new Machine Learning model
 * User Scenario 6 - Alice starts a training job on a remote machine
 * User Scenario 7 - Alice describes her trained machine learning model
 * User Scenario 8 - Alice reuses an existing pre-trained model
 * User Scenario 9 - Alice creates a training dataset
 * User Scenario 10 - Eric discovers a model and consumes it

From these user scenarios, three Showcases were selected to develop and apply AI
approaches in different context in order to validate and verify the activities
of the AI Extensions service:

 * “Urban greenery” showcase: urban greenery using EO data, specifically
   focusing on monitoring urban heat patterns and preventing flooding in urban
   areas.
 * “Informal settlement” showcase: AI approaches in the context of urban
   management, specifically targeting the challenges posed by informal
   settlements.
 * “Geohazards - volcanoes” showcase: AI approaches for EO data for monitoring
   and assessing volcanic hazards.


PROJECT STATUS

The first release of this project was critical in setting the foundation as it
focused on developing a cloud-based environment and related tools that enabled
users to work with EO data and data labels. With the successful completion of
the second release, the user is now able to build and train ML models with EO
data labels effectively.

The project implementation with the User Scenarios focused on developing
interactive Jupyter Notebooks that aim at validating and verifying all the key
requirements of the activities performed in each Scenario.

To date, Jupyter Notebooks for User Scenarios 1 - 5 have been developed and
validated.


UPCOMING WORK

The project’s future phases are eagerly anticipated. Release 3 will focus on
enabling users to train their ML models on remote machines, while Release 4 will
empower them to execute these models from the stakeholder/end-user Eric’s
perspective. This progression underscores a strategic roadmap towards making GEP
and U-TEP powerful platforms for data analysis and interpretation using advanced
AI techniques.

Dedicated articles will be published in the coming weeks, describing the
activities and main outcomes of each Scenario / Notebook, so stay tuned!


1




1 month later
simonevaccariTerradue staff
2
May 3



AI/ML ENHANCEMENT PROJECT - EXPLORATORY DATA ANALYSIS USER SCENARIO


INTRODUCTION

Exploratory Data Analysis (EDA) is an essential step in the workflow of a data
scientist or machine learning (ML) practitioner. The purpose of EDA is to
analyse the data that will be used to train and evaluate ML models. This new
capability brought by AI/ML Enhancement Project will support users in the
Geohazards Exploitation Platform (GEP) and Urban Thematic Exploitation Platform
(U-TEP) to better understand the dataset structure and its properties, discover
missing values and possible outliers, as well as to identify correlations and
patterns between features that can be used to tailor and improve model
performance.

This post presents User Scenario 1 of the AI/ML Enhancement Project titled
“Alice does Exploratory Data Analysis (EDA)”. For this user scenario, an
interactive Jupyter Notebook has been developed to guide an ML practitioner,
such as Alice, implement EDA on her data. The Notebook firstly introduces the
connectivity with a STAC catalog interacting with the STAC API to search and
access EO data and labels by defining specific query parameters (we will cover
that in a dedicated article). Subsequently, the user loads the input dataframe
and then performs the EDA steps for understanding her data, such as data
cleaning, correlation analysis, histogram plotting and data engineering.
Practical examples and commands are displayed to demonstrate how simply this
Notebook can be used for this purpose.



Notebook_preview904×910 169 KB




INPUT DATAFRAME

The input data consisted of point data labelled with three classes and features
extracted from Sentinel-2 reflectance bands. Three vegetation indices were also
computed from selected bands. The pre-arranged dataframe was loaded using the
pandas library. The dataframe is composed by 13 columns:

 * column CLASSIFICATION: defines the land cover classes of each label, with
   available classes VEGETATION, NOT_VEGETATED, WATER.
 * columns with reflectance bands: extracted from the spectral bands of six
   Sentinel-2 scenes: coastal, red, green, blue, nir, nir08, nir09, swir16, and
   swir22.
 * columns for vegetation indices, calculated from the reflectance bands: ndvi,
   ndwi1, and ndwi2.

A snapshot of how the dataframe can be loaded and displayed is shown below.

import pandas as pd
dataset = pd.read_pickle('./input/dataframe.pkl')
dataset




df1008×339 35.9 KB



This analysis focused on differentiating between “water” and “no-water” labels,
therefore a pre-processing operation was performed on the dataframe to change
the classification of VEGETATION and NOT_VEGETATED labels as “no-water”. This
can be quickly achieved with the command below:

LABEL_NAME = 'water'
dataset[LABEL_NAME] = dataset['CLASSIFICATION'].apply(lambda x: 1 if x == 'WATER' else 0)



DATA CLEANING

After loading, the user can inspect the dataframe with the pandas function
dataset.info() to show a quick overview of the data, such as number of rows,
columns and data types. A further statistical analysis can then be performed for
each feature with the function dataset.describe(), which extracts relevant
information, including count, mean, min & max, standard deviation and 25%, 50%,
75% percentiles.

dataset.info()
dataset.describe()


The user can quickly check if null data was present in the dataframe. In
general, if features with null values are identified, the user should either
remove them from the dataframe, or convert or assign them to appropriate values,
if known.

dataset[dataset.isnull().any(axis=1)]


CORRELATION ANALYSIS

The correlation analysis between “water” and “no-water” pixels for all features
was performed with the pairplot() function of the seaborn library.

import seaborn as sns
sns.pairplot(dataframe, hue=LABEL_NAME, kind='reg', palette = "Set1")


This simple command generates multiple pairwise bivariate distributions of all
features in the dataset, where the diagonal plots represent univariate
distributions. It displays the relationship for the (n, 2) combination of
variables in a dataframe as a matrix of plots, as depicted in the figure below
(with ‘water’ points shown in blue).



corr_plots1920×1880 390 KB



The correlation between variables can also be visually represented by the
correlation matrix, simply generated by the seaborn corr() function (see figure
below). Each cell in the matrix represents the correlation coefficient, which
quantifies the degree to which two variables are linearly related. Values close
to 1 (in yellow) and -1 (in dark blue) respectively represent positive and
negative correlations, and values close to 0 represent no correlation. The
matrix is highly customisable with different format and colour maps available.



corr_matrix1174×352 186 KB




DISTRIBUTION DENSITY HISTOGRAMS

Another good practice is to understand the distribution density of values for
each column feature. The user can target specifically the distribution of
specific features when related to the corresponding label “water”, plot this
over the histograms, and save the output figure to file.

import matplotlib.pyplot as plt

for i, c in enumerate(dataset.select_dtypes(include='number').columns):
   plt.subplot(4,3,i+1)
   sns.distplot(dataset[c])
   plt.title('Distribution plot for field:' + c)
   plt.savefig(f'./distribution_hist.png')




distribution_hist2000×1400 238 KB




OUTLIERS DETECTION

The statistical analysis and histogram plots provide an assessment regarding the
data distribution of each feature. To further analyse the data distribution, it
is advised to conduct a dedicated analysis to detect possible outliers in the
data. The Tukey IQR method identifies outliers as values with more than 1.5
times the interquartile range from the quartiles — either below Q1 − 1.5 IQR, or
above Q3 + 1.5 IQR. An example of the Tukey IQR method applied to the NDVI index
is shown below:

import numpy as np

def find_outliers_tukey(x):
   q1 = np.percentile(x,25)
   q3 = np.percentile(x,75)
   iqr = q3 - q1
   floor = q1 - 1.5*iqr
   ceiling = q3 + 1.5*iqr
   outlier_indices = list(x.index[(x<floor) | (x>ceiling)])
   outlier_values = list(x[outlier_indices])
   return outlier_indices,outlier_values

tukey_indices, tukey_values = find_outliers_tukey(dataset['ndvi'])



FEATURE ENGINEERING AND DIMENSIONALITY REDUCTION

Future engineering can be used when the available features are not enough for
training an ML model, for example when a small number of features is available,
or it is not representative enough. In such cases, feature engineering can be
used to increase and/or improve the dataframe representativeness. In this case
it was used the PolynomialFeatures function from the sklearn library to increase
the number of overall features through an iterative combination of the available
features. For an algorithm like a Random Forest where decisions are being made,
adding more features through feature engineering could provide a substantial
improvement. Algorithms like convolutional neural networks however, might not
need these since they can extract patterns directly out of the data.

from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(interaction_only=True, include_bias=False)
new_dataset = pd.DataFrame(poly.fit_transform(dataset))
new_dataset


On the other hand, Principal Component Analysis (PCA) is a technique that
transforms a dataset of many features into fewer, principal components that best
summarise the variance that underlies the data. This can be used to extract the
principal components from each feature so they can be used in training. PCA is
also a function offered by the sklearn Python library.

from sklearn.decomposition import PCA

pca = PCA(n_components=len(new_dataset.columns))
X_pca = pd.DataFrame(pca.fit_transform(new_dataset))



CONCLUSION

This work demonstrates the new functionalities brought by the AI/ML Enhancement
Project with simple steps and commands a ML practitioner like Alice can take to
analyse a dataframe for the preparatory step of a ML application lifecycle.
Using this Jupyter Notebook, Alice can iteratively conduct the EDA steps to gain
insights and analyse data patterns, calculate statistical summaries, generating
histograms or scatter plots, understand correlation between features, and share
results to her colleagues.

Useful links:

 * The link to the Notebook for User Scenario 1 is:
   https://github.com/ai-extensions/notebooks/blob/main/scenario-1/s1-eda.ipynb
   6.
   Note: access to this Notebook must be granted - please send an email to
   support@terradue.com with subject “Request Access to s1-eda” and body “Please
   provide access to Notebook for AI Extensions User Scenario 1”;
 * The user manual of the AI/ML Enhancement Project Platform is available at
   AI-Extensions Application Hub - User Manual 2.


1




20 days later
simonevaccariTerradue staff
1
May 23



AI/ML ENHANCEMENT PROJECT - LABELLING EO DATA USER SCENARIO 2


INTRODUCTION

Labelling data is a crucial step in the process for developing supervised
Machine Learning (ML) models. It involves the critical task of assigning
relevant labels or categories to different features within the data, such as
land cover class (e.g. vegetation, water bodies, urban area, etc.) or other
physical characteristics of the Earth’s surface. These labels can be multi-class
(e.g., forest, grassland, urban), or binary (e.g., water or non-water).

This post presents User Scenario 2 of the AI/ML Enhancement Project, titled
“Alice labels Earth Observation (EO) data”. It demonstrates how the enhancements
being deployed in the Geohazards Exploitation Platform (GEP) and Urban Thematic
Exploitation Platform (U-TEP) will support users labelling EO data.

For this User Scenario, an interactive Jupyter Notebook is used to guide an ML
practitioner, such as Alice, through the following steps:

 * create data labels, using QGIS Software or a Solara / Leafmap application
 * load Labels and Sentinel-2 data using STAC API
 * sample Sentinel-2 data with Labels and create a dataframe
 * validate the labelled data against the Global Surface Water (GSW) dataset
 * use the dataframe to train a ML model based on a Random Forest classifier
 * perform raster inference on a Sentinel-2 scene to generate a binary water
   mask

Practical examples and commands are displayed to demonstrate how this new
capabilities can be used from a Jupyter Notebook.



s2-Notebook_preview985×850 157 KB




LABELLING EO DATA

The process for creating vector (point or polygon) data layers is illustrated
with two examples:

 * QGIS Software: a dedicated profile on the App Hub is configured to the user
   for using QGIS Software (more details can be found on the App Hub online User
   Manual). The steps to create new Shapefile Layers, add classification types
   for each point / polygon, and save the output in a geojson format are
   illustrated with several screenshots.
 * Solara / Leafmap application: an interactive map, built on Solara and
   Leafmap, has been integrated in the Notebook to give the option to the user
   to manually create and save labels right from the Notebook itself.

After the annotations are created, either from QGIS or from the Solara / Leafmap
interactive map, and saved into a .geojson file, the user can create the STAC
Item of the EO labels, and publish it on the STAC endpoint. This is done with
the pystac Python library and an interactive form right in the Notebook.


LOAD LABELS AND EO DATA WITH STAC API

Access to Labels and EO data was facilitated through the utilisation of the
libraries pystac and pystac_client. These libraries enable users to interact
with a STAC catalog by defining specific query parameters, such as time range,
area of interest, and data collection preferences. Subsequently, only the STAC
Items that align with the provided criteria are retrieved for the user.

Below is given a simplified code snippet for implementing STAC data search and
for displaying results on an interactive map. An upcoming article, dedicated to
the STAC format and data access will be published, with more guidance and
examples.


SEARCH DATA USING STAC API

# Import libraries
import pystac; from pystac_client import Client

# Access to STAC Catalog
cat = Client.open("https://ai-extensions-stac.terradue.com", ...)

# Define query parameters
start_date = “2023-06-01”
end_date = “2023-06-30”
bbox = [-121.857043 37.853934 -120.608968 38.840424]
cloud_cover = 30
tile = “10SFH”

# Search Labels by AOI, start/end date
query_sel = cat.search(
  collections=[“ai-extensions-svv-dataset-labels”],
  datetime=(start_date, end_date),
  bbox=bbox,
)

labels = query_sel.item_collection()

# Search EO data (Sentinel-2) by AOI, start/end date, cloud cover and tile number
query_sel = cat.search(
  collections=[“sentinel-2-l2a”],
  datetime=(start_date, end_date),
  bbox=bbox,
  query={"eo:cloud_cover": {"lt": cloud_cover}},
)

eo_item = [item for item in query_sel.item_collection() if tile in item.id][0]



PLOT LABELS AND EO DATA ON INTERACTIVE MAP

Once the Label data is loaded, it is converted into a dataframe (gdf) using
geopandas library. The Python library folium was then used to display both the
Labels and EO data on an interactive map.

import folium; from folium import GeoJson, LayerControl, plugins

map = folium.Map(location=[x, y], tiles="OpenStreetMap", zoom_start=9)

# Add Labels to map
map = addPoints2Map(gdf, map)

# Add footprint of EO scene
footprint_eo = folium.GeoJson(eo_item.geometry,style_function=lambda x: {...})
footprint_eo.add_to(map)

# Visualise map
map




s2-interactiveMap829×500 82.1 KB




SAMPLE EO DATA WITH LABELS

After loading the data, the Notebook continues with the implementation of a
function to iteratively sample the EO data in correspondence of each labelled
point. In addition to sampling a selection of the Sentinel-2 reflectance band
(coastal, red, green, blue, nir, nir08, nir09, swir16, and swir22), three
vegetation indices are also calculated (ndvi, ndwi1, and ndwi2). After sampling
the EO bands and calculating the vegetation indices, all the data is
concatenated into a pandas DataFrame.

import pandas as pd

tmp_gdfs = []
for i, label_item in enumerate(eo_items):
  sampled_data = sample_data(label_item=label_item, common_bands=["coastal", "red", "green", "blue", "nir", "nir08", "nir09", "swir16", "swir22"])
  tmp_gdfs.append(sampled_data)

# Create pandas dataframe
gdf_points = pd.concat(tmp_gdfs)

# Save to file
gdf_points.to_pickle(“filename.pkl”)




s2-dataframe913×319 32.9 KB




VALIDATION AGAINST REFERENCE DATASET

A comparison against another, independent, dataset was performed to show a
validation approach of the labelled data. As a validation dataset, we used the
Global Surface Water (GSW) dataset, generated by JRC (Citation: Pekel,
Jean-François; Cottam, Andrew; Gorelick, Noel; Belward, Alan (2017): Global
Surface Water Explorer dataset. European Commission, Joint Research Centre
(JRC), http://data.europa.eu/89h/jrc-gswe-global-surface-water-explorer-v1).

The comparison was performed simply by iterating through the generated labels
dataframe and by counting the number of points labelled as “water” that were
correctly classified as water (i.e. with pixel value higher than 80%) also in
the GSW dataset.


EO LABELLED DATA FOR SUPERVISED ML TASK


DATASET PREPARATION

The dataframe was prepared for the supervised ML task by converting it into a
binary classification dataset (i.e. “water” and “no-water”) and by removing
unnecessary columns. Further and more detailed analysis on the dataframe can be
performed through Exploratory Data Analysis (EDA). Check out more information on
the recently published article dedicated to EDA, for more details and guidance
on this.

The dataset was then split into train and test with the dedicated function
train_test_split() from the sklearn package.

from sklearn.model_selection import train_test_split

# columns used as features during training
feature_cols = ['coastal','red','green','blue','nir','nir08','nir09','swir16','swir22', 'ndvi', 'ndwi1', 'ndwi2']

# column name for label
LABEL_NAME = 'CLASSIFICATION'

features = train_dataset[feature_cols] # cols for features
label = train_dataset[LABEL_NAME] # col for labels
X_train, X_test, y_train, y_test = train_test_split(
  features, label,
  random_state=42,
  train_size=0.85,
)



ML MODEL

The ML model developed in this Notebook was a Random Forest classifier using
k-fold cross validation. Random Forest is a powerful and versatile supervised ML
algorithm that grows and combines multiple decision trees to create a “forest.”
It can be used for both classification and regression problems. K-Fold
Cross-Validation is a technique used in ML to assess the performance and
generalisation ability of a model. The steps involved in the K-Fold
Cross-Validation are:

 1. split the dataset into K subsets, or “folds”.
 2. The model is then trained K times, each time using K-1 folds for training,
    and the remaining fold for validation.
 3. This process is repeated K times, with each of the K folds used exactly once
    as the validation data.
 4. The K results from the K folds are then averaged to produce a single
    estimation of model performance.



s2 - k-fold_diagram1036×452 83.8 KB



The ML parameters are defined and used to train the model with a few simple
functions, provided these are defined.

hyperparameters = {
  'n_estimators': 200,
  'criterion':'gini',
  'max_depth':None,
  'min_samples_split':2,
  'min_samples_leaf':1,
  'min_weight_fraction_leaf':0.0,
  'max_features':'sqrt',
  'max_leaf_nodes':None,
  'min_impurity_decrease':0.0,
  'bootstrap':True,
  'oob_score':False,
  'n_jobs':-1,
  'random_state':42,
  'verbose':0,
  'warm_start':True,
  'class_weight':None,
  'ccp_alpha':0.0,
  'max_samples':None
}

# define model obj which is defined in utils.py
model = Model(hyperparameters)

# training model using k-fold cross validation
estimators = model.training(X=X_train,Y=y_train,folds=5)



MODEL EVALUATION

The model is evaluated on unseen data with the following evaluation metrics:

 * Accuracy: calculated as the ratio of correctly predicted instances to the
   total number of instances in the dataset
 * Recall: also known as sensitivity or true positive rate, recall is a metric
   that evaluates the ability of a classification model to correctly identify
   all relevant instances from a dataset
 * Precision: it evaluates the accuracy of the positive predictions made by a
   classification model
 * F1-score: it is a metric that combines precision and recall into a single
   value. It is particularly useful when there is an uneven class distribution
   (imbalanced classes) and provides a balance between precision and recall
 * Confusion Matrix: it provides a detailed breakdown of the model’s
   performance, highlighting instances of correct and incorrect predictions.

The code snippet below shows how the model can be evaluated, followed by the
output of the evaluation metrics calculated during the process.

# evaluate model
best_model = model.evaluation(estimators,X_test, y_test)




Other ways to evaluate the ML model are the distribution of the probability of
predicted values, the Receiver Operating Characteristic (ROC) Curve, and the
analysis of the permutation features importance. All three can be derived and
plotted from within the Notebook with one simple line of code.

# Distribution of probability of predicted values
ml_helper.distribution_of_predicted_val(best_model, X_train, X_test)

# ROC Curve
ml_helper.roc(best_model,X_test,y_test)

# Permutation Importance
ml_helper.p_importance(best_model,X_test,y_test,hyperparameters,MODEL_OUTPUT_DIR)




s2-three images1600×517 73.9 KB



Finally, the best ML model can be saved to a file so that it can be loaded and
used in the future. The only prerequisite for applying the ML model is for the
input dataset to have the same format as the training dataset described above.

import joblib

# Save the model to file
model_fname = 'best_rf_model.joblib'
joblib.dump(best_model, model_fname)



RASTER INFERENCE

Now the user can apply the ML model on a Sentinel-2 image to generate a binary
water mask output. After loading the EO data and the ML model into the Notebook,
the ML model is applied to make predictions over the entire input EO data. The
steps to perform these operations are shown in the simplified code snippet
below.

# Select EO assets from the loaded Sentinel-2 scene (eo_item)

fileList = {}
for f in eo_item.get_assets():
  if (f in feature_cols) or f == 'scl':
    fileList[f] = eo_item.get_assets()[f].href

# Load the ML model classifier
model = joblib.load(model_fname)

# Make predictions
predictions = ml_helper.readRastersToArray(model, fileList, feature_cols)

# Save predictions
df_predict = pd.DataFrame(predictions.ravel(),columns=['predictions'])
df_predict.to_pickle('prediction.pkl')

# Create binary mask
predictions = df_predict['predictions']
predictions = predictions.to_numpy().reshape((10980,10980))

# Apply sieve operation to remove small features (in pixels)
my_array_uint8 = predictions.astype(rasterio.uint8)
sieved = sieve(my_array_uint8, threshold=1000, connectivity=8)

# Use Scene Classification band to filter out clouds and bad data
with rasterio.open(fileList['scl']) as scl_src:
  scl = scl_src.read(1)
  scl = np.where(~np.isin(scl, [4, 5, 6, 7, 11]), np.nan, scl)
mask_out = np.where(~np.isnan(scl), sieved, np.nan)

# Use Scene Classification band to filter out clouds and bad data
import matplotlib.pyplot as plt
plt.imshow(mask_out,interpolation='none'); plt.title("Improved result")




In the figure above, water bodies are plotted in yellow and non-water pixels are
plotted in dark blue, and clouds are masked out in white (top-right corner of
the image).


CONCLUSION

This work demonstrates the new functionalities brought by the AI/ML Enhancement
Project to help a ML practitioner:

 * create EO data labels, using QGIS Software or a Solara / Leafmap application
 * load Labels and EO data with STAC API
 * sample EO data with Labels and create a dataframe
 * use the dataframe to train a Random Forest classifier
 * perform raster inference on a selected Sentinel-2 scene to generate a binary
   water mask.

Useful links:

 * The link to the Notebook for User Scenario 2 is:
   https://github.com/ai-extensions/notebooks/blob/main/scenario-2/s2-labellingEOdata.ipynb
   4.
   Note: access to the Notebook for User Scenario 2 4 must be granted - please
   send an email to support@terradue.com with subject “Request Access to
   s2-labellingEOdata” and body “Please provide access to Notebook for AI
   Extensions User Scenario 2”
 * The user manual of the AI/ML Enhancement Project Platform is available at
   AI-Extensions Application Hub - User Manual 3
 * Link to the project update article “AI/ML Enhancement Project - Progress
   Update”
 * Link to User Scenario 1 article “AI/ML Enhancement Project - Exploratory Data
   Analysis”





simonevaccariTerradue staff
1
simonevaccari
May 29



AI/ML ENHANCEMENT PROJECT - DESCRIBING LABELLED EO DATA WITH STAC


INTRODUCTION

The use of the SpatioTemporal Asset Catalogs (STAC) format is crucial when it
comes to describing spatio-temporal datasets, including labelled Earth
Observation (EO) data. This allows to describe the labelled EO data while
defining standardised sets of metadata to delineate its key properties, such as
spatial and temporal extents, resolution, and other pertinent characteristics.
The use of STAC brings several benefits, including enhancing the reproducibility
and transparency of the process and its result, as well as ensuring that the
data becomes discoverable and accessible to other stakeholders (e.g. users,
researchers, policymakers, etc).

This post presents User Scenario 3 of the AI/ML Enhancement Project, titled
“Alice describes the labelled EO data”. It demonstrates how the enhancements
being deployed in the Geohazards Exploitation Platform (GEP) and Urban Thematic
Exploitation Platform (U-TEP) will support users describing labelled EO data
using the STAC format.

To demonstrate these new capabilities defined in this User Scenario, an
interactive Jupyter Notebook is used to guide an ML practitioner, such as Alice,
in the process of exploiting the STAC format to describe, publish, and search
labelled EO data, including:

 * Loading a labelled EO data (.geojson file) and display it as geopandas
   dataframe
 * Show labelled EO data on an interactive map
 * Generate a STAC Item and add metadata to it
 * Publish the STAC Item on dedicated S3 and on the STAC endpoint
 * Search the STAC Item using using STAC API and query parameters

Practical examples and commands are displayed to demonstrate how these new
capabilities can be used from a Jupyter Notebook.



s3-notebook887×827 144 KB




LOADING LABELLED EO DATA

A .geojson file of the labelled EO data was loaded into the notebook and
converted into a geopandas dataframe.

import geopandas as gpd
import geojson

fname = './input/label-S2A_10SFH_20230519_0_L2A.geojson'

with open(fname) as f:
  gj = geojson.load(f)

# Make geodataframe out of the created object
gdf = gpd.read_file(fname)
gdf




The Python library folium was then used to display the labelled EO data on an
interactive map.

import folium
from folium import GeoJson, LayerControl

# Get extent and center of dataframe points
bbox = (gdf.geometry.total_bounds)
centerx,centery = (np.average(bbox[1::2]), np.average(bbox[::2]))

# Create map
map = folium.Map(location=[centerx, centery], tiles="OpenStreetMap", zoom_start=9)

# Add Labels to map
map = addPoints2Map(gdf, map)

# Add layer control
LayerControl().add_to(map)

# Visualis map
map






GENERATE STAC ITEM

Before creating the STAC Item, the user defines the geometry of the vector data
represented by the dataframe.

# Get geometry of dataframe points

label_geom = geojson.Polygon([[
  (bbox[0], bbox[1]),
  (bbox[2], bbox[1]),
  (bbox[2], bbox[3]),
  (bbox[0], bbox[3]),
  (bbox[0], bbox[1])
]])



The user can now create the STAC Item and populate it with relevant information,
by exploiting the pystac library.

import pystac

# Creating STAC Item
label_item = pystac.Item(
  id="<label_id>",
  geometry=label_geom,
  bbox=list(bbox),
  datetime=datetime.utcnow(),
  properties={},
)


The user defines a dictionary named label_classes to represent the classes for a
classification task. The dictionary contains the class names for various land
cover types, such as vegetation, water, clouds, shadows, and more. This mapping
can be used to label and categorise data in a classification process.

The user can then apply the label-specific STAC Extension 1 with the defined
label classes.

from pystac.extensions.label import LabelExtension, LabelType, LabelClasses

# Define label classes
label_classes = {
  "name": "CLASSIFICATION",
  "classes": [
    "NO_DATA",
    "SATURATED_OR_DEFECTIVE",
    "CAST_SHADOWS",
    "CLOUD_SHADOWS",
    "VEGETATION",
    "NOT_VEGETATED",
    "WATER",
    "UNCLASSIFIED",
    "CLOUD_MEDIUM_PROBABILITY",
    "CLOUD_HIGH_PROBABILITY",
    "THIN_CIRRUS",
    "SNOW or ICE",
  ],
}

# Apply label-specific STAC Extension “LabelExtension” with its related fields
label = LabelExtension.ext(label_item, add_if_missing=True)
label.apply(
   label_description="Land cover labels",
   label_type=LabelType.VECTOR,
   label_tasks=["segmentation", "regression"],
   label_classes=[LabelClasses(label_classes)],
   label_methods=["manual"],
   label_properties=["CLASSIFICATION"],
)

# Add geojson labels
label.add_geojson_labels(f"label-{label_id}.geojson")

# Add version
version = ItemVersionExtension(label_item)
version.apply(version="0.1", deprecated=False)

label_item.stac_extensions.extend(
   ["https://stac-extensions.github.io/version/v1.2.0/schema.json"]
)


In the end, the user validates the created STAC Item.

# Validate STAC Item
label_item.validate()
display(label_item)




s3-STACItem449×740 72.4 KB




PUBLISH THE STAC ITEM

The STAC endpoint and STAC Collection in which to publish the STAC Item are
firstly defined:

stac_endpoint = "https://ai-extensions-stac.terradue.com"
collection = read_file("input/collection/collection.json")


Subsequently, the STAC Item can be posted on a dedicated S3 bucket.

# Define filename and write locally
out_fname = f"item-label-{label_id}.json"
pystac.write_file(label_item, dest_href=out_fname)


# Define wrapper to write on S3 bucket
wrapper = StarsCopyWrapper()
exit_code, stdout, stderr = (
   wrapper.recursivity()
   .output(f"s3://ai-ext-bucket-dev/svv-dataset/{label_id}")
   .config_file("/etc/Stars/appsettings.json")
   .extract_archive(extract=False)
   .absolute_assets()
   .run(f"file://{os.getcwd()}/{out_fname}")
)


When the STAC Item is posted on S3, it can be published on the dedicated STAC
endpoint.

# Define customized StacIO class
StacIO.set_default(CustomStacIO)

# Read catalog.json file posted on S3
catalog_url = f"s3://ai-ext-bucket-dev/svv-dataset/{label_id}/catalog.json"
catalog = read_url(catalog_url)

ingest_items(
   app_host=stac_endpoint,
   items=list(catalog.get_all_items()),
   collection=collection,
   headers=get_headers(),
)



FIND STAC ITEM ON STAC CATALOG

Once the STAC Item is successfully published on the STAC endpoint, it can be
searched using pystac and pystac_client libraries. These libraries enable users
to interact with a STAC catalog by defining specific query parameters, such as
time range, area of interest, and data collection preferences. Subsequently,
only the STAC Item(s) that align with the provided criteria is(are) retrieved
for the user.

# Import libraries
import pystac
from pystac_client import Client

# Access to STAC Catalog
cat = Client.open(stac_endpoint, headers=get_headers(), ignore_conformance=True)

# Define query parameters
start_date = datetime.strptime(“20230601”, '%Y%m%d')
end_date = datetime.strptime(“20230630”, '%Y%m%d')
bbox = [-121.857043   37.853934 -120.608968   38.840424]
tile = “10SFH”

# Query by AOI, start and end date 
query_sel = cat.search(
    collections=[“ai-extensions-svv-dataset-labels”],
    datetime=(start_date, end_date),
    bbox=bbox,
)
item = [item for item in query_sel.item_collection() if tile in item.id][0]

# Display Item
display(item)




s3-searched_item845×582 75.1 KB




CONCLUSION

This work demonstrates the new functionalities brought by the AI/ML Enhancement
Project to help a ML practitioner exploiting the STAC format to describe,
publish, and search labelled EO data, including:

 * Loading a labelled EO data (.geojson file) and display it as geopandas
   dataframe
 * Show labelled EO data on an interactive map
 * Generate a STAC Item and add metadata to it
 * Publish the STAC Item on dedicated S3 and on the STAC endpoint
 * Search the STAC Item using using STAC API and query parameters with pystac

Useful links:

 * The link to the Notebook for User Scenario 3 is:
   https://github.com/ai-extensions/notebooks/blob/main/scenario-3/s3-describingEOdata.ipynb
   3
   Note: access to this Notebook must be granted - please send an email to
   support@terradue.com with subject “Request Access to s3-describeEOdata” and
   body “Please provide access to Notebook for AI Extensions User Scenario 3”
 * The user manual of the AI/ML Enhancement Project Platform is available at
   AI-Extensions Application Hub - User Manual 1
 * Link to the project update article “AI/ML Enhancement Project - Progress
   Update”
 * Link to User Scenario 1 article “AI/ML Enhancement Project - Exploratory Data
   Analysis”
 * Link to User Scenario 2 article “AI/ML Enhancement Project - Labelling EO
   Data”





14 days later
simonevaccariTerradue staff
Jun 13



AI/ML ENHANCEMENT PROJECT - DISCOVERING LABELLED EO DATA WITH STAC


INTRODUCTION

The use of the SpatioTemporal Asset Catalogs (STAC) format is crucial when it
comes to search and discover spatio-temporal datasets, including labelled Earth
Observation (EO) data. It allows filtering search results using STAC metadata as
query parameters, such as spatial and temporal extents, resolution, and other
properties. As well as ensuring that the data becomes discoverable and
accessible to other stakeholders (e.g. users, researchers, policymakers, etc),
the use of STAC brings several other benefits, including enhancing the
reproducibility and transparency of the process and its result.

This post presents User Scenario 4 of the AI/ML Enhancement Project, titled
“Alice discovers the labelled EO data”. It demonstrates how the enhancements
being deployed in the Geohazards Exploitation Platform (GEP) and Urban Thematic
Exploitation Platform (U-TEP) will support users exploiting STAC format to
discover labelled EO data.

To demonstrate these new capabilities defined in this User Scenario, an
interactive Jupyter Notebook is used to guide an ML practitioner, such as Alice,
in the process of exploiting the STAC format to discover labelled EO data,
including:

 * Understanding the STAC format
 * Accessing STAC via STAC Browser and STAC API
 * Connectivity with dedicated S3 storage

Practical examples and commands are displayed to demonstrate how these new
capabilities can be used from a Jupyter Notebook.



s4-notebook901×860 166 KB




UNDERSTANDING STAC

The SpatioTemporal Asset Catalog (STAC) specification was designed to establish
a standard, unified language to talk about geospatial data, allowing it to be
more easily searchable and queryable. By defining query parameters based on STAC
metadata, such as spatial and temporal extents, resolution, and other
properties, the user can narrow down a search with only those datasets that
align with the specific requirements.

There are four components specifications that together make up the core STAC
specification:

 * STAC Item: the core unit representing a single spatiotemporal asset as a
   GeoJSON feature with datetime and links.

 * STAC Catalog: a simple, flexible JSON file of links that provides a structure
   to organize and browse STAC Items.

 * STAC Collection: an extension of the STAC Catalog with additional information
   such as the extents, license, keywords, providers, etc., that describe STAC
   Items that fall within the Collection.

 * STAC API: it provides a RESTful endpoint that enables search of STAC Items,
   specified in OpenAPI, following OGC’s WFS 3.

A STAC Catalog is used to group STAC objects like Items, Collections, and/or
even other Catalogs.

Below are shown some commands of the pystac library that can be used to extract
information from a STAC Catalog / Item / Collection.

import pystac

# Read STAC Catalog from file and explore High-Level Catalog Information
cat = Catalog.from_file(url)
cat.describe()

# Print some key metadata
print(f"ID: {cat.id}")
print(f"Title: {cat.title or 'N/A'}")
print(f"Description: {cat.description}")

# Access to STAC Child Catalogs and/or Collections
col = [col for col in cat.get_all_collections()]

# Explore STAC Item Metadata
item = cat.get_item(id=<item_id>, recursive=True)


More information can be found in the official STAC documentation.


ACCESSING STAC VIA STAC BROWSER AND STAC API

There are two ways to discover STAC data: by using the STAC Browser or by using
the STAC API.


ACCESSING USING STAC BROWSER

The STAC Browser provides a user-friendly graphical interface that facilitates
the search and discovery of datasets. A few screenshots of the graphical
interface are provided below.

The dedicated STAC Browser app can be launched by the user at login with the
option STAC Browser for AI-Extensions STAC API. The STAC Catalog and Collections
available on the App Hub project endpoint will be displayed.



s4-STAC-browser1103×1026 111 KB



After selecting a specific collection, the query parameters can be manually
specified with the dedicated widgets in the Filters section (temporal and
spatial extents in this case).



s4-browser_filters1025×876 90.7 KB



The search results are then shown after clicking Submit. In the example
screenshot below, it is shown a single STAC Item with its key metadata.



s4-browser_res1013×1175 215 KB



Despite its user-friendly interface, the use of the STAC Browser is quite
limited to manual interactions with the user, making it difficult and time
consuming when performing multiple searches with different parameters, for
example. For this reason, the use of the STAC Browser is primarily designed for
manual exploration and is less suited for automated workflows.


ACCESSING USING STAC API

The STAC API allows for programmatic access to data, enabling automation of data
discovery, retrieval, and processing workflows. This is particularly useful for
integrating STAC data into larger geospatial data processing pipelines or
applications…

import requests

# Define payload for token request
payload = {
  "client_id": "ai-extensions",
  "username": "ai-extensions-user",
  "password": os.environ.get("IAM_PASSWORD"),
  "grant_type": "password",
}

auth_url = 'https://iam-dev.terradue.com/realms/ai-extensions/protocol/openid-connect/token'
token = get_token(url=auth_url, **payload)
headers = {"Authorization": f"Bearer {token}"}


Once the authentication credentials are defined, the private STAC Catalog can be
accessed and searched using specific query parameters, such as time range, area
of interest, and data collection preferences. Subsequently, only the STAC
Item(s) that align with the provided criteria is(are) retrieved for the user.
This can be achieved with the pystac and pystac_client libraries.

# Import libraries
import pystac
from pystac_client import Client

# Define STAC endpoint and access to the Catalog
stac_endpoint = "https://ai-extensions-stac.terradue.com"
cat = Client.open(stac_endpoint, headers=headers, ignore_conformance=True)

# Define query parameters
start_date = datetime.strptime(“20230601”, '%Y%m%d')
end_date = datetime.strptime(“20230630”, '%Y%m%d')
bbox = [-121.857043 37.853934 -120.608968 38.840424]
tile = “10SFH”

# Query by AOI, start and end date
query_sel = cat.search(
  collections=[“ai-extensions-svv-dataset-labels”],
  datetime=(start_date, end_date),
  bbox=bbox,
)
item = [item for item in query_sel.item_collection() if tile in item.id][0]

# Display Item
display(item)




s4-stac_item832×576 75.1 KB




CONNECTIVITY WITH DEDICATED S3 STORAGE

Up until now the user accessed the STAC endpoint for exploring the Catalog and
its Collections / Items. In this section we describe the process to access the
data referenced in the Item’s assets, which are stored in a dedicated S3 bucket.

The AWS S3 configuration settings are defined in a .json file (eg
appsettings.json), which is used to create a UserSettings object. This will be
used to create a configured S3 client to retrieve an object stored on S3, using
boto3 and botocore libraries.

# Import libraries
import botocore, boto3

# Define AWS S3 settings
settings = UserSettings("appsettings.json")
settings.set_s3_environment(<asset_s3_path>)

# Start botocore session
session = botocore.session.Session()

# create client obj
s3_client = session.create_client(
  service_name="s3",
  region_name=os.environ.get("AWS_REGION"),
  use_ssl=True,
  endpoint_url=os.environ.get("AWS_S3_ENDPOINT"),
  aws_access_key_id=os.environ.get("AWS_ACCESS_KEY_ID"),
  aws_secret_access_key=os.environ.get("AWS_SECRET_ACCESS_KEY"),
)
parsed = urlparse(geojson_url)

# retrieve bucket name
bucket = parsed.netloc
key = parsed.path[1:]

# retrive the obj which was stored on s3
response = s3_client.get_object(Bucket=bucket, Key=key)


The user can then download locally the file stored on S3 using io library.

import io

geojson_content = io.BytesIO(respond["Body"].read())
fname = './output/downloaded.geojson'

# Save the GeoJSON content to a local file
with open(fname, "wb") as file:
  file.write(geojson_content.getvalue())


The user can also import the downloaded data into this Notebook. In this
example, the downloaded .geojson file is loaded and converted into a pandas
dataframe.

import geopandas as gpd

# Make geodataframe out of the downloaded .geojson file
gdf = gpd.read_file(fname)
gdf





CONCLUSION

This work demonstrates the new functionalities brought by the AI/ML Enhancement
Project to help a ML practitioner exploiting the STAC format to discover
labelled EO data, including:

 * Understanding the STAC format
 * Accessing STAC via STAC Browser and STAC API
 * Connectivity with dedicated S3 storage

Useful links:

 * The link to the Notebook for User Scenario 4 is:
   https://github.com/ai-extensions/notebooks/blob/develop/scenario-4/s4-discoveringLabelledEOData.ipynb
   1
   Note: access to this Notebook must be granted - please send an email to
   support@terradue.com with subject “Request Access to
   s4-discoveringLabelledEOData” and body “Please provide access to Notebook for
   AI Extensions User Scenario 4”
 * The user manual of the AI/ML Enhancement Project Platform is available at
   AI-Extensions Application Hub - User Manual
 * Project Update “AI/ML Enhancement Project - Progress Update”
 * User Scenario 1 “AI/ML Enhancement Project - Exploratory Data Analysis”
 * User Scenario 2 “AI/ML Enhancement Project - Labelling EO Data”
 * User Scenario 3 “AI/ML Enhancement Project - Describing labelled EO data”





25 days later
pmembariTerradue staff
8d



AI/ML ENHANCEMENT PROJECT - DEVELOPING A NEW ML MODEL AND TRACKING WITH MLFLOW


INTRODUCTION

In this scenario, the ML practitioner Alice develops a Convolutional Neural
Networks (CNN) model for a classification task and employs MLflow for monitoring
the ML model development cycle. MLflow is a crucial tool that ensures effective
log tracking and preserves key information, including specific code versions,
datasets used, and model hyperparameters. By logging this information, the
reproducibility of the work drastically increases, enabling users to revisit and
replicate past experiments accurately. Moreover, quality metrics such as
classification accuracy, loss function fluctuations, and inference time are also
tracked, enabling easy comparison between different models.

This post presents User Scenario 5 of the AI/ML Enhancement Project, titled
“Alice develops a new ML model”. It demonstrates how the enhancements being
deployed in the Geohazards Exploitation Platform (GEP) and Urban Thematic
Exploitation Platform (U-TEP) will support users on developing a new ML model
and on using MLflow to track experiments.

These new capabilities are implemented with an interactive Jupyter Notebook to
guide an ML practitioner, such as Alice, through the following steps:

 * Data ingestion
 * Design the ML model architecture
 * Train the ML model and fine-tuning
 * Evaluate the ML model performance with metrics such as accuracy, precision,
   recall, or F1 score, and confusion matrix
 * Check experiments with MLflow

These steps are outlined in the diagram below.



s5-diagram730×570 79.8 KB



Practical examples and commands are displayed to demonstrate how these new
capabilities can be used from a Jupyter Notebook.



s5-notebook-new920×991 172 KB




DATA INGESTION

The training data used for this scenario is the EuroSAT dataset. The EuroSAT
dataset is based on ESA’s Sentinel-2 data, covering 13 spectral bands and
consisting out of 10 classes with a total of 27,000 labelled and geo-referenced
images. A separate Notebook was generated to create a STAC Catalog, a STAC
Collection, and STAC Items for the entire EuroSAT dataset, and then publish
these into the STAC endpoint
(https://ai-extensions-stac.terradue.com/collections/EUROSAT_2024_dataset).

The data ingestion process was implemented with a DataIngestion class,
configured with three main components:

 * stac_loader: for fetching the dataset from the STAC endpoint
 * data_splitting: for splitting the dataset into train, test and validation
   sets with defined percentages
 * data_downloader: for downloading the data into the local system.


ML MODEL ARCHITECTURE

In this section, the user defines a Convolutional Neural Networks (CNNs) model
with six layers. The first layer serves as the input layer, accepting an image
with a defined shape of (13, 64, 64) (i.e. same as the shape of the EuroSAT
labels in this case). The model is designed with four convolutional layers, each
employing: a relu activation function, a BatchNormalization layer, a 2D
MaxPooling operation, and a Dropout layer. Subsequently, the model includes two
Dense layers and finally, a Softmax activation layer is implied in the last
Dense layer which generates a vector with 10 cells containing the likelihood of
the predicted classes. The user defines a loss function and an optimizer, and
eventually the best model is compiled and saved locally for each epoch based on
the improvement in validation loss function. The input parameters defining the
ML model architecture are described in a params.yml file which is used for the
configuration process. See below for the params.yml file defined in this test.

params.yml

BATCH_SIZE: 128
EPOCHS: 50
LEARNING_RATE: 0.001
DECAY: 0.1 ### float
EPSILON: 0.0000001
MEMENTUM: 0.9
LOSS: categorical_crossentropy
# choose one of l1,l2,None
REGULIZER: None
OPTIMIZER: SGD


The configuration of the ML model architecture is run with a dedicated pipeline,
such as that defined below.

# pipeline
try:
  config = ConfigurationManager()
  prepare_base_model_config = config.get_prepare_base_model_config()
  prepare_base_model = PrepareBaseModel(config=prepare_base_model_config)
  prepare_base_model.base_model()
except Exception as e:
  raise e


The output of the ML model architecture configuration is displayed below,
allowing the user to summarise the model and report the number of trainable and
non-trainable parameters.

Model: "sequential"
___________________________________________________________________
 Layer (type)                    Output Shape              Param #   
===================================================================
 conv2d (Conv2D)                 (None, 64, 64, 32)        3776                                                              
 activation (Activation)         (None, 64, 64, 32)        0                                                                          
 conv2d_1 (Conv2D)               (None, 62, 62, 32)        9248                                                                       
 activation_1 (Activation)       (None, 62, 62, 32)        0        
 max_pooling2d (MaxPooling2D)    (None, 31, 31, 32)        0                                                   
 dropout (Dropout)               (None, 31, 31, 32)        0         
 conv2d_2 (Conv2D)               (None, 31, 31, 64)        18496     
 activation_2 (Activation)       (None, 31, 31, 64)        0         
 conv2d_3 (Conv2D)               (None, 29, 29, 64)        36928    
 activation_3 (Activation)       (None, 29, 29, 64)        0         
 max_pooling2d_1 (MaxPooling2D)  (None, 14, 14, 64)        0         
 dropout_1 (Dropout)             (None, 14, 14, 64)        0         
 flatten (Flatten)               (None, 12544)             0         
 dense (Dense)                   (None, 512)               6423040   
 activation_4 (Activation)       (None, 512)               0         
 dropout_2 (Dropout)             (None, 512)               0         
 dense_1 (Dense)                 (None, 10)                5130      
 activation_5 (Activation)       (None, 10)                0         
===================================================================
Total params: 6,496,618
Trainable params: 6,496,618
Non-trainable params: 0
===================================================================



TRAINING AND FINE-TUNING

The steps involved in the training phase are as follows:

 * Create the training entity
 * Create the configuration manager
 * Define the training component
 * Run the training pipeline

As mentioned in the “Training Data Ingestion” chapter, the training data was
split into train, test and validation sets in order to ensure that the model is
trained effectively and its performance is evaluated accurately and without
bias. The user trains the ML model on the train data set for a specific number
of epochs, defined in the params.yml file, after each epoch the model is
evaluated on the validation data to avoid overfitting. There are several
approaches to address overfitting during training. One effective method is
adding a regularizer to the model’s layers, which introduces a penalty term to
the loss function to penalize larger weights. In the end, the test set, which is
not used in any part of the training or validation process, is used to evaluate
the final model’s performance.

In order to assess the ML model’s performance and reliability, the user can plot
the Loss and Accuracy curves of the Training and Validation sets. This can be
done with the matplotlib library, as illustrated below.

# Import library
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 5))

# Plot Loss
plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training and Validation Loss')
plt.legend()

# Plot Accuracy
plt.subplot(1, 2, 2)
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.title('Training and Validation Accuracy')
plt.legend()
plt.tight_layout()
plt.show()




s5-curves1189×490 44.6 KB




EVALUATION

The evaluation of the trained ML model was conducted on the test set. It is
crucial for the user to prevent any data leakage between the train and test sets
to ensure an independent and unbiased assessment of the training pipeline’s
outcome. The model’s performance was measured using the following evaluation
metrics: accuracy, recall, precision, F1-score, and the confusion matrix.

 * Accuracy: calculated as the ratio of correctly predicted instances to the
   total number of instances in the dataset
 * Recall: also known as sensitivity or true positive rate, recall is a metric
   that evaluates the ability of a classification model to correctly identify
   all relevant instances from a dataset
 * Precision: it evaluates the accuracy of the positive predictions made by a
   classification model
 * F1-score: it is a metric that combines precision and recall into a single
   value. It is particularly useful when there is an uneven class distribution
   (imbalanced classes) and provides a balance between precision and recall
 * Confusion Matrix: it provides a detailed breakdown of the model’s
   performance, highlighting instances of correct and incorrect predictions.

The pipeline for generating the evaluation metrics was defined as follows:

try:
  config = ConfigurationManager()
  eval_config = config.get_evaluation_config()
  evaluation = Evaluation(eval_config)
  test_dataset,conf_mat = evaluation.evaluation()
  evaluation.log_into_mlflow()
except Exception as e:
  raise e


The confusion matrix can be easily plotted with the seaborne library.

# Import libraries
import matplotlib.pyplot as plt
import seaborn as sn
import numpy as np
def plot_confusion_matrix(self):
  class_names = np.unique(self.y_true)
  fig, ax = plt.subplots()
 
  # Create a heatmap
  sns.heatmap(
    self.matrix,
    annot=True,
    fmt="d",
    cmap="Blues",
    xticklabels=class_names,
    yticklabels=class_names
  )

  # Add labels and title
  plt.xlabel('Predicted')
  plt.ylabel('True')
  plt.title('Confusion Matrix')
  # Show the plot
  plt.show()





MLFLOW TRACKING

The training, fine-tuning, and evaluation processes are executed multiple times,
referred to as “runs”. Each run is generated by executing multiple jobs with
different combinations of parameters, specified in the params.yaml file
described in the ML Model Architecture section. The user monitors all executed
runs during the training and evaluation phases using mlflow and its built-in
tracking functionalities, as shown in the code below.

# Import libraries
import mlflow
import tensorflow

def log_into_mlflow(self):
  mlflow.set_tracking_uri(os.environ.get("MLFLOW_TRACKING_URI"))
  tracking_url_type_store = urlparse(os.environ.get("MLFLOW_TRACKING_URI")).scheme
  confusion_matrix_figure = self.plot_confusion_matrix()

  with mlflow.start_run():
    mlflow.tensorflow.autolog()
    mlflow.log_params(self.config.all_params)
    mlflow.log_figure(confusion_matrix_figure, artifact_file="Confusion_Matrix.png")
    mlflow.log_metrics(
      {
        "loss": self.score[0], "test_accuracy": self.score[1],
        "test_precision":self.score[2],"test_recall":self.score[3],
      }
    )
    # Model registry does not work with file store
    if tracking_url_type_store != "file":
      log_model(self.model, "model", registered_model_name=f"CNN")


The MLflow dashboard allows for visual and interactive comparisons of different
runs, enabling the user to make informed decisions when selecting the best
model. The user can access the MLflow dashboard by clicking on the dedicated
icon from the user’s App Hub dashboard.



s5-mlflow-jupyterhub_icon1139×854 66.4 KB



On the MLflow dashboard, the user can select the experiments to compare in the
“Experiment” tab.



s5-mlflow-compare11304×831 117 KB



Subsequently, the user can select the specific parameters and metrics to include
in the comparison from the “Visualizations” dropdown. The runs’ behavior and
details generated by the different evaluation metrics and parameters are
displayed.



s5-mlflow-compare21396×829 121 KB



The comparison of the parameters and metrics are shown in the dedicated
dropdown.



s5-mlflow-parameters1396×829 39.4 KB





s5-mlflow-metrics1165×370 20.4 KB




CONCLUSION

This work demonstrates the new functionalities brought by the AI/ML Enhancement
Project to guide a ML practitioner through the development of a new ML model and
its related tracking functionalities provided by MLflow, including:

 * Data ingestion
 * Design the ML model architecture
 * Train the ML model and fine-tuning
 * Evaluate the ML model performance with metrics such as accuracy, precision,
   recall, or F1 score, and confusion matrix
 * Check experiments with MLflow dashboard and tools.

Useful links:

 * The link to the Notebook for User Scenario 5 is:
   https://github.com/ai-extensions/notebooks/blob/main/scenario-5/trials/s5-newMLModel.ipynb
   Note: access to this Notebook must be granted - please send an email to
   support@terradue.com with subject “Request Access to s5-newMLModel ” and body
   “Please provide access to Notebook for AI Extensions User Scenario 5”
 * The user manual of the AI/ML Enhancement Project Platform is available at
   AI-Extensions Application Hub - User Manual
 * Project Update “AI/ML Enhancement Project - Progress Update”
 * User Scenario 1 “AI/ML Enhancement Project - Exploratory Data Analysis”
 * User Scenario 2 “AI/ML Enhancement Project - Labelling EO Data”
 * User Scenario 3 “AI/ML Enhancement Project - Describing labelled EO data”
 * User Scenario 4 “AI/ML Enhancement Project - Discovering labelled EO data
   with STAC”


1







Reply




NEW & UNREAD TOPICS

Topic Replies Views Activity On-line Landslide Detection service on GEP used for
operational disaster response activity in Haiti
gep-blog
0 601 Aug '23 Free UTEP Online EO Training Event for SDG Monitoring on Dec. 5th
utep-blog
0 1.7k Nov '23 Radar interferogram over Morocco using the Copernicus Sentinel-1
acquisitions of 30 August 2023 & 11 September 2023
gep-blog
0 1.5k Sep '23 Preparing the ESA Living Planet Symposium 2025
gep-blog
geohazards
0 2.6k May 2 SNAPPING and GDM-OPT ground motion time series available over the
Alps in the new GEP Time Series viewer
gep-blog
geohazardsgepvisualisation
0 1.2k May 4


WANT TO READ MORE? BROWSE OTHER TOPICS IN AI EXTENSIONS OR VIEW LATEST TOPICS.









Invalid date Invalid date