discuss.terradue.com
Open in
urlscan Pro
161.35.50.22
Public Scan
URL:
https://discuss.terradue.com/t/announcing-the-launch-of-the-ai-ml-enhancement-project-for-gep-and-urban-tep-exploitation-plat...
Submission: On July 17 via api from BE — Scanned from DE
Submission: On July 17 via api from BE — Scanned from DE
Form analysis
1 forms found in the DOMPOST /login
<form id="hidden-login-form" method="post" action="/login" style="display: none;">
<input name="username" type="text" id="signin_username">
<input name="password" type="password" id="signin_password">
<input name="redirect" type="hidden">
<input type="submit" id="signin-button" value="Log In">
</form>
Text Content
Skip to main content ANNOUNCING THE LAUNCH OF THE AI/ML ENHANCEMENT PROJECT FOR GEP AND URBAN TEP EXPLOITATION PLATFORMS TerradueAI Extensions aigepu-tep Log In * * ANNOUNCING THE LAUNCH OF THE AI/ML ENHANCEMENT PROJECT FOR GEP AND URBAN TEP EXPLOITATION PLATFORMS TerradueAI Extensions aigepu-tep You have selected 0 posts. select all cancel selecting May 2023 2 / 7 Mar 21 8d ago pedroTerradue staff 1 May '23 We are excited to announce the launch of a new project aimed at augmenting the capabilities of two Ellip-powered Exploitation platforms, the Geohazards Exploitation Platform (GEP) and the Urban Thematic Exploitation Platform (U-TEP). The project’s primary objective is to seamlessly integrate an AI/ML processing framework into both platforms to enhance their services and empower service providers to develop and deploy AI/ML models for improved geohazards and urban management applications. Project Overview The project will focus on integrating a comprehensive AI/ML processing framework that covers the entire machine learning pipeline, including data discovery, training data, model development, deployment, hosting, monitoring, and visualization. A critical aspect of this project will be the integration of MLOps processes into both GEP and Urban TEP platforms’ service offerings, ensuring the smooth operation of AI-driven applications on the platforms. GEP and Urban TEP Platforms GEP is designed to support the exploitation of satellite Earth Observations for geohazards, focusing on mapping hazard-prone land surfaces and monitoring terrain deformation. It offers over 25 services for monitoring terrain motion and critical infrastructures, with more than 2500 registered users actively participating in content creation. Urban TEP aims to provide end-to-end and ready-to-use solutions for a broad spectrum of users to extract unique information and indicators required for urban management and sustainability. It focuses on bridging the gap between the mass data streams and archives of various satellite missions and the information needs of users involved in urban and environmental science, planning, and policy. Project Partners The project brings together a strong partnership of experienced organizations, including Terradue, CRIM, Solenix, and Gisat. These partners have a proven track record in various aspects of Thematic Exploitation Platforms, cloud research platforms, AI/ML applications, and EO data analytics. Expected Outcomes Upon successful completion, the project will result in the enhancement of both GEP and Urban TEP platforms and their service offerings. The addition of AI/ML capabilities will empower service providers to develop and deploy AI/ML models, ultimately improving their services and delivering added value to their customers. This enhancement will greatly benefit the GEP and Urban TEP platforms by expanding their capabilities and enabling new AI-driven applications for geohazards and urban management. Discussion Points: 1. How do you foresee AI/ML capabilities enhancing the services provided by GEP and Urban TEP? 2. What challenges do you anticipate in integrating AI/ML processing frameworks into existing platforms? 3. Which use cases do you believe would benefit the most from the addition of AI/ML capabilities in GEP and Urban TEP? We encourage you to share your thoughts, ideas, and experiences related to the project. Let’s discuss the potential impact and improvements this project can bring to the GEP and Urban TEP platforms and their user communities. 1 * CREATED May '23 * LAST REPLY 8d * 6 REPLIES * 1.3k VIEWS * 3 USERS * 4 LIKES * 6 LINKS * 5 11 months later simonevaccariTerradue staff 1 Mar 21 AI/ML ENHANCEMENT PROJECT - PROGRESS UPDATE BACKGROUND One year has passed since the announcement of the AI/ML Enhancement Project launch (see post). This project innovatively integrates cutting-edge Artificial Intelligence (AI) and Machine Learning (ML) technologies into Earth Observation (EO) platforms like Geohazards Exploitation Platform (GEP) and Urban Thematic Exploitation Platform (U-TEP) through MLOps - the fusion of ML with DevOps principles. Leveraging these platforms’ extensive EO data usage, the new AI extensions promise enhanced efficiency, accuracy, and functionalities. The integration of these new capabilities unlocks advanced data processing, predictive modelling, and automation, strengthening capabilities in urban management and geohazard assessment. USER PERSONAS, USER SCENARIOS AND SHOWCASES For the project implementation we have identified two types of users: * A ML Practitioner that we will call “Alice”: expert in building and training ML models, selecting appropriate algorithms, analysing data, and using ML techniques to solve real-world problems. * And a Consumer that we will call “Eric”: stakeholder or user (e.g. business owner, a customer, a researcher, etc) who benefits from, or relies upon, the insights or predictions generated by the ML models to inform his decision-making process. From these users we have derived ten User Scenarios that capture the key activities and goals of these types of users in utilising the service. The user scenarios are: * User Scenario 1 - Alice does Exploratory Data Analysis (EDA) * User Scenario 2 - Alice labels Earth Observation data * User Scenario 3 - Alice describes the labelled Earth Observation data * User Scenario 4 - Alice discovers labelled Earth Observation data * User Scenario 5 - Alice develops a new Machine Learning model * User Scenario 6 - Alice starts a training job on a remote machine * User Scenario 7 - Alice describes her trained machine learning model * User Scenario 8 - Alice reuses an existing pre-trained model * User Scenario 9 - Alice creates a training dataset * User Scenario 10 - Eric discovers a model and consumes it From these user scenarios, three Showcases were selected to develop and apply AI approaches in different context in order to validate and verify the activities of the AI Extensions service: * “Urban greenery” showcase: urban greenery using EO data, specifically focusing on monitoring urban heat patterns and preventing flooding in urban areas. * “Informal settlement” showcase: AI approaches in the context of urban management, specifically targeting the challenges posed by informal settlements. * “Geohazards - volcanoes” showcase: AI approaches for EO data for monitoring and assessing volcanic hazards. PROJECT STATUS The first release of this project was critical in setting the foundation as it focused on developing a cloud-based environment and related tools that enabled users to work with EO data and data labels. With the successful completion of the second release, the user is now able to build and train ML models with EO data labels effectively. The project implementation with the User Scenarios focused on developing interactive Jupyter Notebooks that aim at validating and verifying all the key requirements of the activities performed in each Scenario. To date, Jupyter Notebooks for User Scenarios 1 - 5 have been developed and validated. UPCOMING WORK The project’s future phases are eagerly anticipated. Release 3 will focus on enabling users to train their ML models on remote machines, while Release 4 will empower them to execute these models from the stakeholder/end-user Eric’s perspective. This progression underscores a strategic roadmap towards making GEP and U-TEP powerful platforms for data analysis and interpretation using advanced AI techniques. Dedicated articles will be published in the coming weeks, describing the activities and main outcomes of each Scenario / Notebook, so stay tuned! 1 1 month later simonevaccariTerradue staff 2 May 3 AI/ML ENHANCEMENT PROJECT - EXPLORATORY DATA ANALYSIS USER SCENARIO INTRODUCTION Exploratory Data Analysis (EDA) is an essential step in the workflow of a data scientist or machine learning (ML) practitioner. The purpose of EDA is to analyse the data that will be used to train and evaluate ML models. This new capability brought by AI/ML Enhancement Project will support users in the Geohazards Exploitation Platform (GEP) and Urban Thematic Exploitation Platform (U-TEP) to better understand the dataset structure and its properties, discover missing values and possible outliers, as well as to identify correlations and patterns between features that can be used to tailor and improve model performance. This post presents User Scenario 1 of the AI/ML Enhancement Project titled “Alice does Exploratory Data Analysis (EDA)”. For this user scenario, an interactive Jupyter Notebook has been developed to guide an ML practitioner, such as Alice, implement EDA on her data. The Notebook firstly introduces the connectivity with a STAC catalog interacting with the STAC API to search and access EO data and labels by defining specific query parameters (we will cover that in a dedicated article). Subsequently, the user loads the input dataframe and then performs the EDA steps for understanding her data, such as data cleaning, correlation analysis, histogram plotting and data engineering. Practical examples and commands are displayed to demonstrate how simply this Notebook can be used for this purpose. Notebook_preview904×910 169 KB INPUT DATAFRAME The input data consisted of point data labelled with three classes and features extracted from Sentinel-2 reflectance bands. Three vegetation indices were also computed from selected bands. The pre-arranged dataframe was loaded using the pandas library. The dataframe is composed by 13 columns: * column CLASSIFICATION: defines the land cover classes of each label, with available classes VEGETATION, NOT_VEGETATED, WATER. * columns with reflectance bands: extracted from the spectral bands of six Sentinel-2 scenes: coastal, red, green, blue, nir, nir08, nir09, swir16, and swir22. * columns for vegetation indices, calculated from the reflectance bands: ndvi, ndwi1, and ndwi2. A snapshot of how the dataframe can be loaded and displayed is shown below. import pandas as pd dataset = pd.read_pickle('./input/dataframe.pkl') dataset df1008×339 35.9 KB This analysis focused on differentiating between “water” and “no-water” labels, therefore a pre-processing operation was performed on the dataframe to change the classification of VEGETATION and NOT_VEGETATED labels as “no-water”. This can be quickly achieved with the command below: LABEL_NAME = 'water' dataset[LABEL_NAME] = dataset['CLASSIFICATION'].apply(lambda x: 1 if x == 'WATER' else 0) DATA CLEANING After loading, the user can inspect the dataframe with the pandas function dataset.info() to show a quick overview of the data, such as number of rows, columns and data types. A further statistical analysis can then be performed for each feature with the function dataset.describe(), which extracts relevant information, including count, mean, min & max, standard deviation and 25%, 50%, 75% percentiles. dataset.info() dataset.describe() The user can quickly check if null data was present in the dataframe. In general, if features with null values are identified, the user should either remove them from the dataframe, or convert or assign them to appropriate values, if known. dataset[dataset.isnull().any(axis=1)] CORRELATION ANALYSIS The correlation analysis between “water” and “no-water” pixels for all features was performed with the pairplot() function of the seaborn library. import seaborn as sns sns.pairplot(dataframe, hue=LABEL_NAME, kind='reg', palette = "Set1") This simple command generates multiple pairwise bivariate distributions of all features in the dataset, where the diagonal plots represent univariate distributions. It displays the relationship for the (n, 2) combination of variables in a dataframe as a matrix of plots, as depicted in the figure below (with ‘water’ points shown in blue). corr_plots1920×1880 390 KB The correlation between variables can also be visually represented by the correlation matrix, simply generated by the seaborn corr() function (see figure below). Each cell in the matrix represents the correlation coefficient, which quantifies the degree to which two variables are linearly related. Values close to 1 (in yellow) and -1 (in dark blue) respectively represent positive and negative correlations, and values close to 0 represent no correlation. The matrix is highly customisable with different format and colour maps available. corr_matrix1174×352 186 KB DISTRIBUTION DENSITY HISTOGRAMS Another good practice is to understand the distribution density of values for each column feature. The user can target specifically the distribution of specific features when related to the corresponding label “water”, plot this over the histograms, and save the output figure to file. import matplotlib.pyplot as plt for i, c in enumerate(dataset.select_dtypes(include='number').columns): plt.subplot(4,3,i+1) sns.distplot(dataset[c]) plt.title('Distribution plot for field:' + c) plt.savefig(f'./distribution_hist.png') distribution_hist2000×1400 238 KB OUTLIERS DETECTION The statistical analysis and histogram plots provide an assessment regarding the data distribution of each feature. To further analyse the data distribution, it is advised to conduct a dedicated analysis to detect possible outliers in the data. The Tukey IQR method identifies outliers as values with more than 1.5 times the interquartile range from the quartiles — either below Q1 − 1.5 IQR, or above Q3 + 1.5 IQR. An example of the Tukey IQR method applied to the NDVI index is shown below: import numpy as np def find_outliers_tukey(x): q1 = np.percentile(x,25) q3 = np.percentile(x,75) iqr = q3 - q1 floor = q1 - 1.5*iqr ceiling = q3 + 1.5*iqr outlier_indices = list(x.index[(x<floor) | (x>ceiling)]) outlier_values = list(x[outlier_indices]) return outlier_indices,outlier_values tukey_indices, tukey_values = find_outliers_tukey(dataset['ndvi']) FEATURE ENGINEERING AND DIMENSIONALITY REDUCTION Future engineering can be used when the available features are not enough for training an ML model, for example when a small number of features is available, or it is not representative enough. In such cases, feature engineering can be used to increase and/or improve the dataframe representativeness. In this case it was used the PolynomialFeatures function from the sklearn library to increase the number of overall features through an iterative combination of the available features. For an algorithm like a Random Forest where decisions are being made, adding more features through feature engineering could provide a substantial improvement. Algorithms like convolutional neural networks however, might not need these since they can extract patterns directly out of the data. from sklearn.preprocessing import PolynomialFeatures poly = PolynomialFeatures(interaction_only=True, include_bias=False) new_dataset = pd.DataFrame(poly.fit_transform(dataset)) new_dataset On the other hand, Principal Component Analysis (PCA) is a technique that transforms a dataset of many features into fewer, principal components that best summarise the variance that underlies the data. This can be used to extract the principal components from each feature so they can be used in training. PCA is also a function offered by the sklearn Python library. from sklearn.decomposition import PCA pca = PCA(n_components=len(new_dataset.columns)) X_pca = pd.DataFrame(pca.fit_transform(new_dataset)) CONCLUSION This work demonstrates the new functionalities brought by the AI/ML Enhancement Project with simple steps and commands a ML practitioner like Alice can take to analyse a dataframe for the preparatory step of a ML application lifecycle. Using this Jupyter Notebook, Alice can iteratively conduct the EDA steps to gain insights and analyse data patterns, calculate statistical summaries, generating histograms or scatter plots, understand correlation between features, and share results to her colleagues. Useful links: * The link to the Notebook for User Scenario 1 is: https://github.com/ai-extensions/notebooks/blob/main/scenario-1/s1-eda.ipynb 6. Note: access to this Notebook must be granted - please send an email to support@terradue.com with subject “Request Access to s1-eda” and body “Please provide access to Notebook for AI Extensions User Scenario 1”; * The user manual of the AI/ML Enhancement Project Platform is available at AI-Extensions Application Hub - User Manual 2. 1 20 days later simonevaccariTerradue staff 1 May 23 AI/ML ENHANCEMENT PROJECT - LABELLING EO DATA USER SCENARIO 2 INTRODUCTION Labelling data is a crucial step in the process for developing supervised Machine Learning (ML) models. It involves the critical task of assigning relevant labels or categories to different features within the data, such as land cover class (e.g. vegetation, water bodies, urban area, etc.) or other physical characteristics of the Earth’s surface. These labels can be multi-class (e.g., forest, grassland, urban), or binary (e.g., water or non-water). This post presents User Scenario 2 of the AI/ML Enhancement Project, titled “Alice labels Earth Observation (EO) data”. It demonstrates how the enhancements being deployed in the Geohazards Exploitation Platform (GEP) and Urban Thematic Exploitation Platform (U-TEP) will support users labelling EO data. For this User Scenario, an interactive Jupyter Notebook is used to guide an ML practitioner, such as Alice, through the following steps: * create data labels, using QGIS Software or a Solara / Leafmap application * load Labels and Sentinel-2 data using STAC API * sample Sentinel-2 data with Labels and create a dataframe * validate the labelled data against the Global Surface Water (GSW) dataset * use the dataframe to train a ML model based on a Random Forest classifier * perform raster inference on a Sentinel-2 scene to generate a binary water mask Practical examples and commands are displayed to demonstrate how this new capabilities can be used from a Jupyter Notebook. s2-Notebook_preview985×850 157 KB LABELLING EO DATA The process for creating vector (point or polygon) data layers is illustrated with two examples: * QGIS Software: a dedicated profile on the App Hub is configured to the user for using QGIS Software (more details can be found on the App Hub online User Manual). The steps to create new Shapefile Layers, add classification types for each point / polygon, and save the output in a geojson format are illustrated with several screenshots. * Solara / Leafmap application: an interactive map, built on Solara and Leafmap, has been integrated in the Notebook to give the option to the user to manually create and save labels right from the Notebook itself. After the annotations are created, either from QGIS or from the Solara / Leafmap interactive map, and saved into a .geojson file, the user can create the STAC Item of the EO labels, and publish it on the STAC endpoint. This is done with the pystac Python library and an interactive form right in the Notebook. LOAD LABELS AND EO DATA WITH STAC API Access to Labels and EO data was facilitated through the utilisation of the libraries pystac and pystac_client. These libraries enable users to interact with a STAC catalog by defining specific query parameters, such as time range, area of interest, and data collection preferences. Subsequently, only the STAC Items that align with the provided criteria are retrieved for the user. Below is given a simplified code snippet for implementing STAC data search and for displaying results on an interactive map. An upcoming article, dedicated to the STAC format and data access will be published, with more guidance and examples. SEARCH DATA USING STAC API # Import libraries import pystac; from pystac_client import Client # Access to STAC Catalog cat = Client.open("https://ai-extensions-stac.terradue.com", ...) # Define query parameters start_date = “2023-06-01” end_date = “2023-06-30” bbox = [-121.857043 37.853934 -120.608968 38.840424] cloud_cover = 30 tile = “10SFH” # Search Labels by AOI, start/end date query_sel = cat.search( collections=[“ai-extensions-svv-dataset-labels”], datetime=(start_date, end_date), bbox=bbox, ) labels = query_sel.item_collection() # Search EO data (Sentinel-2) by AOI, start/end date, cloud cover and tile number query_sel = cat.search( collections=[“sentinel-2-l2a”], datetime=(start_date, end_date), bbox=bbox, query={"eo:cloud_cover": {"lt": cloud_cover}}, ) eo_item = [item for item in query_sel.item_collection() if tile in item.id][0] PLOT LABELS AND EO DATA ON INTERACTIVE MAP Once the Label data is loaded, it is converted into a dataframe (gdf) using geopandas library. The Python library folium was then used to display both the Labels and EO data on an interactive map. import folium; from folium import GeoJson, LayerControl, plugins map = folium.Map(location=[x, y], tiles="OpenStreetMap", zoom_start=9) # Add Labels to map map = addPoints2Map(gdf, map) # Add footprint of EO scene footprint_eo = folium.GeoJson(eo_item.geometry,style_function=lambda x: {...}) footprint_eo.add_to(map) # Visualise map map s2-interactiveMap829×500 82.1 KB SAMPLE EO DATA WITH LABELS After loading the data, the Notebook continues with the implementation of a function to iteratively sample the EO data in correspondence of each labelled point. In addition to sampling a selection of the Sentinel-2 reflectance band (coastal, red, green, blue, nir, nir08, nir09, swir16, and swir22), three vegetation indices are also calculated (ndvi, ndwi1, and ndwi2). After sampling the EO bands and calculating the vegetation indices, all the data is concatenated into a pandas DataFrame. import pandas as pd tmp_gdfs = [] for i, label_item in enumerate(eo_items): sampled_data = sample_data(label_item=label_item, common_bands=["coastal", "red", "green", "blue", "nir", "nir08", "nir09", "swir16", "swir22"]) tmp_gdfs.append(sampled_data) # Create pandas dataframe gdf_points = pd.concat(tmp_gdfs) # Save to file gdf_points.to_pickle(“filename.pkl”) s2-dataframe913×319 32.9 KB VALIDATION AGAINST REFERENCE DATASET A comparison against another, independent, dataset was performed to show a validation approach of the labelled data. As a validation dataset, we used the Global Surface Water (GSW) dataset, generated by JRC (Citation: Pekel, Jean-François; Cottam, Andrew; Gorelick, Noel; Belward, Alan (2017): Global Surface Water Explorer dataset. European Commission, Joint Research Centre (JRC), http://data.europa.eu/89h/jrc-gswe-global-surface-water-explorer-v1). The comparison was performed simply by iterating through the generated labels dataframe and by counting the number of points labelled as “water” that were correctly classified as water (i.e. with pixel value higher than 80%) also in the GSW dataset. EO LABELLED DATA FOR SUPERVISED ML TASK DATASET PREPARATION The dataframe was prepared for the supervised ML task by converting it into a binary classification dataset (i.e. “water” and “no-water”) and by removing unnecessary columns. Further and more detailed analysis on the dataframe can be performed through Exploratory Data Analysis (EDA). Check out more information on the recently published article dedicated to EDA, for more details and guidance on this. The dataset was then split into train and test with the dedicated function train_test_split() from the sklearn package. from sklearn.model_selection import train_test_split # columns used as features during training feature_cols = ['coastal','red','green','blue','nir','nir08','nir09','swir16','swir22', 'ndvi', 'ndwi1', 'ndwi2'] # column name for label LABEL_NAME = 'CLASSIFICATION' features = train_dataset[feature_cols] # cols for features label = train_dataset[LABEL_NAME] # col for labels X_train, X_test, y_train, y_test = train_test_split( features, label, random_state=42, train_size=0.85, ) ML MODEL The ML model developed in this Notebook was a Random Forest classifier using k-fold cross validation. Random Forest is a powerful and versatile supervised ML algorithm that grows and combines multiple decision trees to create a “forest.” It can be used for both classification and regression problems. K-Fold Cross-Validation is a technique used in ML to assess the performance and generalisation ability of a model. The steps involved in the K-Fold Cross-Validation are: 1. split the dataset into K subsets, or “folds”. 2. The model is then trained K times, each time using K-1 folds for training, and the remaining fold for validation. 3. This process is repeated K times, with each of the K folds used exactly once as the validation data. 4. The K results from the K folds are then averaged to produce a single estimation of model performance. s2 - k-fold_diagram1036×452 83.8 KB The ML parameters are defined and used to train the model with a few simple functions, provided these are defined. hyperparameters = { 'n_estimators': 200, 'criterion':'gini', 'max_depth':None, 'min_samples_split':2, 'min_samples_leaf':1, 'min_weight_fraction_leaf':0.0, 'max_features':'sqrt', 'max_leaf_nodes':None, 'min_impurity_decrease':0.0, 'bootstrap':True, 'oob_score':False, 'n_jobs':-1, 'random_state':42, 'verbose':0, 'warm_start':True, 'class_weight':None, 'ccp_alpha':0.0, 'max_samples':None } # define model obj which is defined in utils.py model = Model(hyperparameters) # training model using k-fold cross validation estimators = model.training(X=X_train,Y=y_train,folds=5) MODEL EVALUATION The model is evaluated on unseen data with the following evaluation metrics: * Accuracy: calculated as the ratio of correctly predicted instances to the total number of instances in the dataset * Recall: also known as sensitivity or true positive rate, recall is a metric that evaluates the ability of a classification model to correctly identify all relevant instances from a dataset * Precision: it evaluates the accuracy of the positive predictions made by a classification model * F1-score: it is a metric that combines precision and recall into a single value. It is particularly useful when there is an uneven class distribution (imbalanced classes) and provides a balance between precision and recall * Confusion Matrix: it provides a detailed breakdown of the model’s performance, highlighting instances of correct and incorrect predictions. The code snippet below shows how the model can be evaluated, followed by the output of the evaluation metrics calculated during the process. # evaluate model best_model = model.evaluation(estimators,X_test, y_test) Other ways to evaluate the ML model are the distribution of the probability of predicted values, the Receiver Operating Characteristic (ROC) Curve, and the analysis of the permutation features importance. All three can be derived and plotted from within the Notebook with one simple line of code. # Distribution of probability of predicted values ml_helper.distribution_of_predicted_val(best_model, X_train, X_test) # ROC Curve ml_helper.roc(best_model,X_test,y_test) # Permutation Importance ml_helper.p_importance(best_model,X_test,y_test,hyperparameters,MODEL_OUTPUT_DIR) s2-three images1600×517 73.9 KB Finally, the best ML model can be saved to a file so that it can be loaded and used in the future. The only prerequisite for applying the ML model is for the input dataset to have the same format as the training dataset described above. import joblib # Save the model to file model_fname = 'best_rf_model.joblib' joblib.dump(best_model, model_fname) RASTER INFERENCE Now the user can apply the ML model on a Sentinel-2 image to generate a binary water mask output. After loading the EO data and the ML model into the Notebook, the ML model is applied to make predictions over the entire input EO data. The steps to perform these operations are shown in the simplified code snippet below. # Select EO assets from the loaded Sentinel-2 scene (eo_item) fileList = {} for f in eo_item.get_assets(): if (f in feature_cols) or f == 'scl': fileList[f] = eo_item.get_assets()[f].href # Load the ML model classifier model = joblib.load(model_fname) # Make predictions predictions = ml_helper.readRastersToArray(model, fileList, feature_cols) # Save predictions df_predict = pd.DataFrame(predictions.ravel(),columns=['predictions']) df_predict.to_pickle('prediction.pkl') # Create binary mask predictions = df_predict['predictions'] predictions = predictions.to_numpy().reshape((10980,10980)) # Apply sieve operation to remove small features (in pixels) my_array_uint8 = predictions.astype(rasterio.uint8) sieved = sieve(my_array_uint8, threshold=1000, connectivity=8) # Use Scene Classification band to filter out clouds and bad data with rasterio.open(fileList['scl']) as scl_src: scl = scl_src.read(1) scl = np.where(~np.isin(scl, [4, 5, 6, 7, 11]), np.nan, scl) mask_out = np.where(~np.isnan(scl), sieved, np.nan) # Use Scene Classification band to filter out clouds and bad data import matplotlib.pyplot as plt plt.imshow(mask_out,interpolation='none'); plt.title("Improved result") In the figure above, water bodies are plotted in yellow and non-water pixels are plotted in dark blue, and clouds are masked out in white (top-right corner of the image). CONCLUSION This work demonstrates the new functionalities brought by the AI/ML Enhancement Project to help a ML practitioner: * create EO data labels, using QGIS Software or a Solara / Leafmap application * load Labels and EO data with STAC API * sample EO data with Labels and create a dataframe * use the dataframe to train a Random Forest classifier * perform raster inference on a selected Sentinel-2 scene to generate a binary water mask. Useful links: * The link to the Notebook for User Scenario 2 is: https://github.com/ai-extensions/notebooks/blob/main/scenario-2/s2-labellingEOdata.ipynb 4. Note: access to the Notebook for User Scenario 2 4 must be granted - please send an email to support@terradue.com with subject “Request Access to s2-labellingEOdata” and body “Please provide access to Notebook for AI Extensions User Scenario 2” * The user manual of the AI/ML Enhancement Project Platform is available at AI-Extensions Application Hub - User Manual 3 * Link to the project update article “AI/ML Enhancement Project - Progress Update” * Link to User Scenario 1 article “AI/ML Enhancement Project - Exploratory Data Analysis” simonevaccariTerradue staff 1 simonevaccari May 29 AI/ML ENHANCEMENT PROJECT - DESCRIBING LABELLED EO DATA WITH STAC INTRODUCTION The use of the SpatioTemporal Asset Catalogs (STAC) format is crucial when it comes to describing spatio-temporal datasets, including labelled Earth Observation (EO) data. This allows to describe the labelled EO data while defining standardised sets of metadata to delineate its key properties, such as spatial and temporal extents, resolution, and other pertinent characteristics. The use of STAC brings several benefits, including enhancing the reproducibility and transparency of the process and its result, as well as ensuring that the data becomes discoverable and accessible to other stakeholders (e.g. users, researchers, policymakers, etc). This post presents User Scenario 3 of the AI/ML Enhancement Project, titled “Alice describes the labelled EO data”. It demonstrates how the enhancements being deployed in the Geohazards Exploitation Platform (GEP) and Urban Thematic Exploitation Platform (U-TEP) will support users describing labelled EO data using the STAC format. To demonstrate these new capabilities defined in this User Scenario, an interactive Jupyter Notebook is used to guide an ML practitioner, such as Alice, in the process of exploiting the STAC format to describe, publish, and search labelled EO data, including: * Loading a labelled EO data (.geojson file) and display it as geopandas dataframe * Show labelled EO data on an interactive map * Generate a STAC Item and add metadata to it * Publish the STAC Item on dedicated S3 and on the STAC endpoint * Search the STAC Item using using STAC API and query parameters Practical examples and commands are displayed to demonstrate how these new capabilities can be used from a Jupyter Notebook. s3-notebook887×827 144 KB LOADING LABELLED EO DATA A .geojson file of the labelled EO data was loaded into the notebook and converted into a geopandas dataframe. import geopandas as gpd import geojson fname = './input/label-S2A_10SFH_20230519_0_L2A.geojson' with open(fname) as f: gj = geojson.load(f) # Make geodataframe out of the created object gdf = gpd.read_file(fname) gdf The Python library folium was then used to display the labelled EO data on an interactive map. import folium from folium import GeoJson, LayerControl # Get extent and center of dataframe points bbox = (gdf.geometry.total_bounds) centerx,centery = (np.average(bbox[1::2]), np.average(bbox[::2])) # Create map map = folium.Map(location=[centerx, centery], tiles="OpenStreetMap", zoom_start=9) # Add Labels to map map = addPoints2Map(gdf, map) # Add layer control LayerControl().add_to(map) # Visualis map map GENERATE STAC ITEM Before creating the STAC Item, the user defines the geometry of the vector data represented by the dataframe. # Get geometry of dataframe points label_geom = geojson.Polygon([[ (bbox[0], bbox[1]), (bbox[2], bbox[1]), (bbox[2], bbox[3]), (bbox[0], bbox[3]), (bbox[0], bbox[1]) ]]) The user can now create the STAC Item and populate it with relevant information, by exploiting the pystac library. import pystac # Creating STAC Item label_item = pystac.Item( id="<label_id>", geometry=label_geom, bbox=list(bbox), datetime=datetime.utcnow(), properties={}, ) The user defines a dictionary named label_classes to represent the classes for a classification task. The dictionary contains the class names for various land cover types, such as vegetation, water, clouds, shadows, and more. This mapping can be used to label and categorise data in a classification process. The user can then apply the label-specific STAC Extension 1 with the defined label classes. from pystac.extensions.label import LabelExtension, LabelType, LabelClasses # Define label classes label_classes = { "name": "CLASSIFICATION", "classes": [ "NO_DATA", "SATURATED_OR_DEFECTIVE", "CAST_SHADOWS", "CLOUD_SHADOWS", "VEGETATION", "NOT_VEGETATED", "WATER", "UNCLASSIFIED", "CLOUD_MEDIUM_PROBABILITY", "CLOUD_HIGH_PROBABILITY", "THIN_CIRRUS", "SNOW or ICE", ], } # Apply label-specific STAC Extension “LabelExtension” with its related fields label = LabelExtension.ext(label_item, add_if_missing=True) label.apply( label_description="Land cover labels", label_type=LabelType.VECTOR, label_tasks=["segmentation", "regression"], label_classes=[LabelClasses(label_classes)], label_methods=["manual"], label_properties=["CLASSIFICATION"], ) # Add geojson labels label.add_geojson_labels(f"label-{label_id}.geojson") # Add version version = ItemVersionExtension(label_item) version.apply(version="0.1", deprecated=False) label_item.stac_extensions.extend( ["https://stac-extensions.github.io/version/v1.2.0/schema.json"] ) In the end, the user validates the created STAC Item. # Validate STAC Item label_item.validate() display(label_item) s3-STACItem449×740 72.4 KB PUBLISH THE STAC ITEM The STAC endpoint and STAC Collection in which to publish the STAC Item are firstly defined: stac_endpoint = "https://ai-extensions-stac.terradue.com" collection = read_file("input/collection/collection.json") Subsequently, the STAC Item can be posted on a dedicated S3 bucket. # Define filename and write locally out_fname = f"item-label-{label_id}.json" pystac.write_file(label_item, dest_href=out_fname) # Define wrapper to write on S3 bucket wrapper = StarsCopyWrapper() exit_code, stdout, stderr = ( wrapper.recursivity() .output(f"s3://ai-ext-bucket-dev/svv-dataset/{label_id}") .config_file("/etc/Stars/appsettings.json") .extract_archive(extract=False) .absolute_assets() .run(f"file://{os.getcwd()}/{out_fname}") ) When the STAC Item is posted on S3, it can be published on the dedicated STAC endpoint. # Define customized StacIO class StacIO.set_default(CustomStacIO) # Read catalog.json file posted on S3 catalog_url = f"s3://ai-ext-bucket-dev/svv-dataset/{label_id}/catalog.json" catalog = read_url(catalog_url) ingest_items( app_host=stac_endpoint, items=list(catalog.get_all_items()), collection=collection, headers=get_headers(), ) FIND STAC ITEM ON STAC CATALOG Once the STAC Item is successfully published on the STAC endpoint, it can be searched using pystac and pystac_client libraries. These libraries enable users to interact with a STAC catalog by defining specific query parameters, such as time range, area of interest, and data collection preferences. Subsequently, only the STAC Item(s) that align with the provided criteria is(are) retrieved for the user. # Import libraries import pystac from pystac_client import Client # Access to STAC Catalog cat = Client.open(stac_endpoint, headers=get_headers(), ignore_conformance=True) # Define query parameters start_date = datetime.strptime(“20230601”, '%Y%m%d') end_date = datetime.strptime(“20230630”, '%Y%m%d') bbox = [-121.857043 37.853934 -120.608968 38.840424] tile = “10SFH” # Query by AOI, start and end date query_sel = cat.search( collections=[“ai-extensions-svv-dataset-labels”], datetime=(start_date, end_date), bbox=bbox, ) item = [item for item in query_sel.item_collection() if tile in item.id][0] # Display Item display(item) s3-searched_item845×582 75.1 KB CONCLUSION This work demonstrates the new functionalities brought by the AI/ML Enhancement Project to help a ML practitioner exploiting the STAC format to describe, publish, and search labelled EO data, including: * Loading a labelled EO data (.geojson file) and display it as geopandas dataframe * Show labelled EO data on an interactive map * Generate a STAC Item and add metadata to it * Publish the STAC Item on dedicated S3 and on the STAC endpoint * Search the STAC Item using using STAC API and query parameters with pystac Useful links: * The link to the Notebook for User Scenario 3 is: https://github.com/ai-extensions/notebooks/blob/main/scenario-3/s3-describingEOdata.ipynb 3 Note: access to this Notebook must be granted - please send an email to support@terradue.com with subject “Request Access to s3-describeEOdata” and body “Please provide access to Notebook for AI Extensions User Scenario 3” * The user manual of the AI/ML Enhancement Project Platform is available at AI-Extensions Application Hub - User Manual 1 * Link to the project update article “AI/ML Enhancement Project - Progress Update” * Link to User Scenario 1 article “AI/ML Enhancement Project - Exploratory Data Analysis” * Link to User Scenario 2 article “AI/ML Enhancement Project - Labelling EO Data” 14 days later simonevaccariTerradue staff Jun 13 AI/ML ENHANCEMENT PROJECT - DISCOVERING LABELLED EO DATA WITH STAC INTRODUCTION The use of the SpatioTemporal Asset Catalogs (STAC) format is crucial when it comes to search and discover spatio-temporal datasets, including labelled Earth Observation (EO) data. It allows filtering search results using STAC metadata as query parameters, such as spatial and temporal extents, resolution, and other properties. As well as ensuring that the data becomes discoverable and accessible to other stakeholders (e.g. users, researchers, policymakers, etc), the use of STAC brings several other benefits, including enhancing the reproducibility and transparency of the process and its result. This post presents User Scenario 4 of the AI/ML Enhancement Project, titled “Alice discovers the labelled EO data”. It demonstrates how the enhancements being deployed in the Geohazards Exploitation Platform (GEP) and Urban Thematic Exploitation Platform (U-TEP) will support users exploiting STAC format to discover labelled EO data. To demonstrate these new capabilities defined in this User Scenario, an interactive Jupyter Notebook is used to guide an ML practitioner, such as Alice, in the process of exploiting the STAC format to discover labelled EO data, including: * Understanding the STAC format * Accessing STAC via STAC Browser and STAC API * Connectivity with dedicated S3 storage Practical examples and commands are displayed to demonstrate how these new capabilities can be used from a Jupyter Notebook. s4-notebook901×860 166 KB UNDERSTANDING STAC The SpatioTemporal Asset Catalog (STAC) specification was designed to establish a standard, unified language to talk about geospatial data, allowing it to be more easily searchable and queryable. By defining query parameters based on STAC metadata, such as spatial and temporal extents, resolution, and other properties, the user can narrow down a search with only those datasets that align with the specific requirements. There are four components specifications that together make up the core STAC specification: * STAC Item: the core unit representing a single spatiotemporal asset as a GeoJSON feature with datetime and links. * STAC Catalog: a simple, flexible JSON file of links that provides a structure to organize and browse STAC Items. * STAC Collection: an extension of the STAC Catalog with additional information such as the extents, license, keywords, providers, etc., that describe STAC Items that fall within the Collection. * STAC API: it provides a RESTful endpoint that enables search of STAC Items, specified in OpenAPI, following OGC’s WFS 3. A STAC Catalog is used to group STAC objects like Items, Collections, and/or even other Catalogs. Below are shown some commands of the pystac library that can be used to extract information from a STAC Catalog / Item / Collection. import pystac # Read STAC Catalog from file and explore High-Level Catalog Information cat = Catalog.from_file(url) cat.describe() # Print some key metadata print(f"ID: {cat.id}") print(f"Title: {cat.title or 'N/A'}") print(f"Description: {cat.description}") # Access to STAC Child Catalogs and/or Collections col = [col for col in cat.get_all_collections()] # Explore STAC Item Metadata item = cat.get_item(id=<item_id>, recursive=True) More information can be found in the official STAC documentation. ACCESSING STAC VIA STAC BROWSER AND STAC API There are two ways to discover STAC data: by using the STAC Browser or by using the STAC API. ACCESSING USING STAC BROWSER The STAC Browser provides a user-friendly graphical interface that facilitates the search and discovery of datasets. A few screenshots of the graphical interface are provided below. The dedicated STAC Browser app can be launched by the user at login with the option STAC Browser for AI-Extensions STAC API. The STAC Catalog and Collections available on the App Hub project endpoint will be displayed. s4-STAC-browser1103×1026 111 KB After selecting a specific collection, the query parameters can be manually specified with the dedicated widgets in the Filters section (temporal and spatial extents in this case). s4-browser_filters1025×876 90.7 KB The search results are then shown after clicking Submit. In the example screenshot below, it is shown a single STAC Item with its key metadata. s4-browser_res1013×1175 215 KB Despite its user-friendly interface, the use of the STAC Browser is quite limited to manual interactions with the user, making it difficult and time consuming when performing multiple searches with different parameters, for example. For this reason, the use of the STAC Browser is primarily designed for manual exploration and is less suited for automated workflows. ACCESSING USING STAC API The STAC API allows for programmatic access to data, enabling automation of data discovery, retrieval, and processing workflows. This is particularly useful for integrating STAC data into larger geospatial data processing pipelines or applications… import requests # Define payload for token request payload = { "client_id": "ai-extensions", "username": "ai-extensions-user", "password": os.environ.get("IAM_PASSWORD"), "grant_type": "password", } auth_url = 'https://iam-dev.terradue.com/realms/ai-extensions/protocol/openid-connect/token' token = get_token(url=auth_url, **payload) headers = {"Authorization": f"Bearer {token}"} Once the authentication credentials are defined, the private STAC Catalog can be accessed and searched using specific query parameters, such as time range, area of interest, and data collection preferences. Subsequently, only the STAC Item(s) that align with the provided criteria is(are) retrieved for the user. This can be achieved with the pystac and pystac_client libraries. # Import libraries import pystac from pystac_client import Client # Define STAC endpoint and access to the Catalog stac_endpoint = "https://ai-extensions-stac.terradue.com" cat = Client.open(stac_endpoint, headers=headers, ignore_conformance=True) # Define query parameters start_date = datetime.strptime(“20230601”, '%Y%m%d') end_date = datetime.strptime(“20230630”, '%Y%m%d') bbox = [-121.857043 37.853934 -120.608968 38.840424] tile = “10SFH” # Query by AOI, start and end date query_sel = cat.search( collections=[“ai-extensions-svv-dataset-labels”], datetime=(start_date, end_date), bbox=bbox, ) item = [item for item in query_sel.item_collection() if tile in item.id][0] # Display Item display(item) s4-stac_item832×576 75.1 KB CONNECTIVITY WITH DEDICATED S3 STORAGE Up until now the user accessed the STAC endpoint for exploring the Catalog and its Collections / Items. In this section we describe the process to access the data referenced in the Item’s assets, which are stored in a dedicated S3 bucket. The AWS S3 configuration settings are defined in a .json file (eg appsettings.json), which is used to create a UserSettings object. This will be used to create a configured S3 client to retrieve an object stored on S3, using boto3 and botocore libraries. # Import libraries import botocore, boto3 # Define AWS S3 settings settings = UserSettings("appsettings.json") settings.set_s3_environment(<asset_s3_path>) # Start botocore session session = botocore.session.Session() # create client obj s3_client = session.create_client( service_name="s3", region_name=os.environ.get("AWS_REGION"), use_ssl=True, endpoint_url=os.environ.get("AWS_S3_ENDPOINT"), aws_access_key_id=os.environ.get("AWS_ACCESS_KEY_ID"), aws_secret_access_key=os.environ.get("AWS_SECRET_ACCESS_KEY"), ) parsed = urlparse(geojson_url) # retrieve bucket name bucket = parsed.netloc key = parsed.path[1:] # retrive the obj which was stored on s3 response = s3_client.get_object(Bucket=bucket, Key=key) The user can then download locally the file stored on S3 using io library. import io geojson_content = io.BytesIO(respond["Body"].read()) fname = './output/downloaded.geojson' # Save the GeoJSON content to a local file with open(fname, "wb") as file: file.write(geojson_content.getvalue()) The user can also import the downloaded data into this Notebook. In this example, the downloaded .geojson file is loaded and converted into a pandas dataframe. import geopandas as gpd # Make geodataframe out of the downloaded .geojson file gdf = gpd.read_file(fname) gdf CONCLUSION This work demonstrates the new functionalities brought by the AI/ML Enhancement Project to help a ML practitioner exploiting the STAC format to discover labelled EO data, including: * Understanding the STAC format * Accessing STAC via STAC Browser and STAC API * Connectivity with dedicated S3 storage Useful links: * The link to the Notebook for User Scenario 4 is: https://github.com/ai-extensions/notebooks/blob/develop/scenario-4/s4-discoveringLabelledEOData.ipynb 1 Note: access to this Notebook must be granted - please send an email to support@terradue.com with subject “Request Access to s4-discoveringLabelledEOData” and body “Please provide access to Notebook for AI Extensions User Scenario 4” * The user manual of the AI/ML Enhancement Project Platform is available at AI-Extensions Application Hub - User Manual * Project Update “AI/ML Enhancement Project - Progress Update” * User Scenario 1 “AI/ML Enhancement Project - Exploratory Data Analysis” * User Scenario 2 “AI/ML Enhancement Project - Labelling EO Data” * User Scenario 3 “AI/ML Enhancement Project - Describing labelled EO data” 25 days later pmembariTerradue staff 8d AI/ML ENHANCEMENT PROJECT - DEVELOPING A NEW ML MODEL AND TRACKING WITH MLFLOW INTRODUCTION In this scenario, the ML practitioner Alice develops a Convolutional Neural Networks (CNN) model for a classification task and employs MLflow for monitoring the ML model development cycle. MLflow is a crucial tool that ensures effective log tracking and preserves key information, including specific code versions, datasets used, and model hyperparameters. By logging this information, the reproducibility of the work drastically increases, enabling users to revisit and replicate past experiments accurately. Moreover, quality metrics such as classification accuracy, loss function fluctuations, and inference time are also tracked, enabling easy comparison between different models. This post presents User Scenario 5 of the AI/ML Enhancement Project, titled “Alice develops a new ML model”. It demonstrates how the enhancements being deployed in the Geohazards Exploitation Platform (GEP) and Urban Thematic Exploitation Platform (U-TEP) will support users on developing a new ML model and on using MLflow to track experiments. These new capabilities are implemented with an interactive Jupyter Notebook to guide an ML practitioner, such as Alice, through the following steps: * Data ingestion * Design the ML model architecture * Train the ML model and fine-tuning * Evaluate the ML model performance with metrics such as accuracy, precision, recall, or F1 score, and confusion matrix * Check experiments with MLflow These steps are outlined in the diagram below. s5-diagram730×570 79.8 KB Practical examples and commands are displayed to demonstrate how these new capabilities can be used from a Jupyter Notebook. s5-notebook-new920×991 172 KB DATA INGESTION The training data used for this scenario is the EuroSAT dataset. The EuroSAT dataset is based on ESA’s Sentinel-2 data, covering 13 spectral bands and consisting out of 10 classes with a total of 27,000 labelled and geo-referenced images. A separate Notebook was generated to create a STAC Catalog, a STAC Collection, and STAC Items for the entire EuroSAT dataset, and then publish these into the STAC endpoint (https://ai-extensions-stac.terradue.com/collections/EUROSAT_2024_dataset). The data ingestion process was implemented with a DataIngestion class, configured with three main components: * stac_loader: for fetching the dataset from the STAC endpoint * data_splitting: for splitting the dataset into train, test and validation sets with defined percentages * data_downloader: for downloading the data into the local system. ML MODEL ARCHITECTURE In this section, the user defines a Convolutional Neural Networks (CNNs) model with six layers. The first layer serves as the input layer, accepting an image with a defined shape of (13, 64, 64) (i.e. same as the shape of the EuroSAT labels in this case). The model is designed with four convolutional layers, each employing: a relu activation function, a BatchNormalization layer, a 2D MaxPooling operation, and a Dropout layer. Subsequently, the model includes two Dense layers and finally, a Softmax activation layer is implied in the last Dense layer which generates a vector with 10 cells containing the likelihood of the predicted classes. The user defines a loss function and an optimizer, and eventually the best model is compiled and saved locally for each epoch based on the improvement in validation loss function. The input parameters defining the ML model architecture are described in a params.yml file which is used for the configuration process. See below for the params.yml file defined in this test. params.yml BATCH_SIZE: 128 EPOCHS: 50 LEARNING_RATE: 0.001 DECAY: 0.1 ### float EPSILON: 0.0000001 MEMENTUM: 0.9 LOSS: categorical_crossentropy # choose one of l1,l2,None REGULIZER: None OPTIMIZER: SGD The configuration of the ML model architecture is run with a dedicated pipeline, such as that defined below. # pipeline try: config = ConfigurationManager() prepare_base_model_config = config.get_prepare_base_model_config() prepare_base_model = PrepareBaseModel(config=prepare_base_model_config) prepare_base_model.base_model() except Exception as e: raise e The output of the ML model architecture configuration is displayed below, allowing the user to summarise the model and report the number of trainable and non-trainable parameters. Model: "sequential" ___________________________________________________________________ Layer (type) Output Shape Param # =================================================================== conv2d (Conv2D) (None, 64, 64, 32) 3776 activation (Activation) (None, 64, 64, 32) 0 conv2d_1 (Conv2D) (None, 62, 62, 32) 9248 activation_1 (Activation) (None, 62, 62, 32) 0 max_pooling2d (MaxPooling2D) (None, 31, 31, 32) 0 dropout (Dropout) (None, 31, 31, 32) 0 conv2d_2 (Conv2D) (None, 31, 31, 64) 18496 activation_2 (Activation) (None, 31, 31, 64) 0 conv2d_3 (Conv2D) (None, 29, 29, 64) 36928 activation_3 (Activation) (None, 29, 29, 64) 0 max_pooling2d_1 (MaxPooling2D) (None, 14, 14, 64) 0 dropout_1 (Dropout) (None, 14, 14, 64) 0 flatten (Flatten) (None, 12544) 0 dense (Dense) (None, 512) 6423040 activation_4 (Activation) (None, 512) 0 dropout_2 (Dropout) (None, 512) 0 dense_1 (Dense) (None, 10) 5130 activation_5 (Activation) (None, 10) 0 =================================================================== Total params: 6,496,618 Trainable params: 6,496,618 Non-trainable params: 0 =================================================================== TRAINING AND FINE-TUNING The steps involved in the training phase are as follows: * Create the training entity * Create the configuration manager * Define the training component * Run the training pipeline As mentioned in the “Training Data Ingestion” chapter, the training data was split into train, test and validation sets in order to ensure that the model is trained effectively and its performance is evaluated accurately and without bias. The user trains the ML model on the train data set for a specific number of epochs, defined in the params.yml file, after each epoch the model is evaluated on the validation data to avoid overfitting. There are several approaches to address overfitting during training. One effective method is adding a regularizer to the model’s layers, which introduces a penalty term to the loss function to penalize larger weights. In the end, the test set, which is not used in any part of the training or validation process, is used to evaluate the final model’s performance. In order to assess the ML model’s performance and reliability, the user can plot the Loss and Accuracy curves of the Training and Validation sets. This can be done with the matplotlib library, as illustrated below. # Import library import matplotlib.pyplot as plt plt.figure(figsize=(12, 5)) # Plot Loss plt.subplot(1, 2, 1) plt.plot(history.history['loss'], label='Train Loss') plt.plot(history.history['val_loss'], label='Validation Loss') plt.xlabel('Epoch') plt.ylabel('Loss') plt.title('Training and Validation Loss') plt.legend() # Plot Accuracy plt.subplot(1, 2, 2) plt.plot(history.history['accuracy'], label='Train Accuracy') plt.plot(history.history['val_accuracy'], label='Validation Accuracy') plt.xlabel('Epoch') plt.ylabel('Accuracy') plt.title('Training and Validation Accuracy') plt.legend() plt.tight_layout() plt.show() s5-curves1189×490 44.6 KB EVALUATION The evaluation of the trained ML model was conducted on the test set. It is crucial for the user to prevent any data leakage between the train and test sets to ensure an independent and unbiased assessment of the training pipeline’s outcome. The model’s performance was measured using the following evaluation metrics: accuracy, recall, precision, F1-score, and the confusion matrix. * Accuracy: calculated as the ratio of correctly predicted instances to the total number of instances in the dataset * Recall: also known as sensitivity or true positive rate, recall is a metric that evaluates the ability of a classification model to correctly identify all relevant instances from a dataset * Precision: it evaluates the accuracy of the positive predictions made by a classification model * F1-score: it is a metric that combines precision and recall into a single value. It is particularly useful when there is an uneven class distribution (imbalanced classes) and provides a balance between precision and recall * Confusion Matrix: it provides a detailed breakdown of the model’s performance, highlighting instances of correct and incorrect predictions. The pipeline for generating the evaluation metrics was defined as follows: try: config = ConfigurationManager() eval_config = config.get_evaluation_config() evaluation = Evaluation(eval_config) test_dataset,conf_mat = evaluation.evaluation() evaluation.log_into_mlflow() except Exception as e: raise e The confusion matrix can be easily plotted with the seaborne library. # Import libraries import matplotlib.pyplot as plt import seaborn as sn import numpy as np def plot_confusion_matrix(self): class_names = np.unique(self.y_true) fig, ax = plt.subplots() # Create a heatmap sns.heatmap( self.matrix, annot=True, fmt="d", cmap="Blues", xticklabels=class_names, yticklabels=class_names ) # Add labels and title plt.xlabel('Predicted') plt.ylabel('True') plt.title('Confusion Matrix') # Show the plot plt.show() MLFLOW TRACKING The training, fine-tuning, and evaluation processes are executed multiple times, referred to as “runs”. Each run is generated by executing multiple jobs with different combinations of parameters, specified in the params.yaml file described in the ML Model Architecture section. The user monitors all executed runs during the training and evaluation phases using mlflow and its built-in tracking functionalities, as shown in the code below. # Import libraries import mlflow import tensorflow def log_into_mlflow(self): mlflow.set_tracking_uri(os.environ.get("MLFLOW_TRACKING_URI")) tracking_url_type_store = urlparse(os.environ.get("MLFLOW_TRACKING_URI")).scheme confusion_matrix_figure = self.plot_confusion_matrix() with mlflow.start_run(): mlflow.tensorflow.autolog() mlflow.log_params(self.config.all_params) mlflow.log_figure(confusion_matrix_figure, artifact_file="Confusion_Matrix.png") mlflow.log_metrics( { "loss": self.score[0], "test_accuracy": self.score[1], "test_precision":self.score[2],"test_recall":self.score[3], } ) # Model registry does not work with file store if tracking_url_type_store != "file": log_model(self.model, "model", registered_model_name=f"CNN") The MLflow dashboard allows for visual and interactive comparisons of different runs, enabling the user to make informed decisions when selecting the best model. The user can access the MLflow dashboard by clicking on the dedicated icon from the user’s App Hub dashboard. s5-mlflow-jupyterhub_icon1139×854 66.4 KB On the MLflow dashboard, the user can select the experiments to compare in the “Experiment” tab. s5-mlflow-compare11304×831 117 KB Subsequently, the user can select the specific parameters and metrics to include in the comparison from the “Visualizations” dropdown. The runs’ behavior and details generated by the different evaluation metrics and parameters are displayed. s5-mlflow-compare21396×829 121 KB The comparison of the parameters and metrics are shown in the dedicated dropdown. s5-mlflow-parameters1396×829 39.4 KB s5-mlflow-metrics1165×370 20.4 KB CONCLUSION This work demonstrates the new functionalities brought by the AI/ML Enhancement Project to guide a ML practitioner through the development of a new ML model and its related tracking functionalities provided by MLflow, including: * Data ingestion * Design the ML model architecture * Train the ML model and fine-tuning * Evaluate the ML model performance with metrics such as accuracy, precision, recall, or F1 score, and confusion matrix * Check experiments with MLflow dashboard and tools. Useful links: * The link to the Notebook for User Scenario 5 is: https://github.com/ai-extensions/notebooks/blob/main/scenario-5/trials/s5-newMLModel.ipynb Note: access to this Notebook must be granted - please send an email to support@terradue.com with subject “Request Access to s5-newMLModel ” and body “Please provide access to Notebook for AI Extensions User Scenario 5” * The user manual of the AI/ML Enhancement Project Platform is available at AI-Extensions Application Hub - User Manual * Project Update “AI/ML Enhancement Project - Progress Update” * User Scenario 1 “AI/ML Enhancement Project - Exploratory Data Analysis” * User Scenario 2 “AI/ML Enhancement Project - Labelling EO Data” * User Scenario 3 “AI/ML Enhancement Project - Describing labelled EO data” * User Scenario 4 “AI/ML Enhancement Project - Discovering labelled EO data with STAC” 1 Reply NEW & UNREAD TOPICS Topic Replies Views Activity On-line Landslide Detection service on GEP used for operational disaster response activity in Haiti gep-blog 0 601 Aug '23 Free UTEP Online EO Training Event for SDG Monitoring on Dec. 5th utep-blog 0 1.7k Nov '23 Radar interferogram over Morocco using the Copernicus Sentinel-1 acquisitions of 30 August 2023 & 11 September 2023 gep-blog 0 1.5k Sep '23 Preparing the ESA Living Planet Symposium 2025 gep-blog geohazards 0 2.6k May 2 SNAPPING and GDM-OPT ground motion time series available over the Alps in the new GEP Time Series viewer gep-blog geohazardsgepvisualisation 0 1.2k May 4 WANT TO READ MORE? BROWSE OTHER TOPICS IN AI EXTENSIONS OR VIEW LATEST TOPICS. Invalid date Invalid date