Faculty of Science and TechnologyDepartment of Physics and Technology Scalable computing for Earth observation Application on Sea Ice analysis Salman Khaleghian A dissertation for the degree of Doctor of Philosophy (PhD), December 2022 This thesis document was typeset using the UiT Thesis LaTEX Template. © 2022 – http://github.com/egraff/uit-thesis “Work for a better life as if you live forever, and work for a better end as if you die tomorrow.” –Ali (pbuh) Abstract In recent years, Deep learning (DL) networks have shown considerable im- provements and have become a preferred methodology in many different applications. These networks have outperformed other classical techniques, particularly in large data settings. In Earth observation from the satellite field, for example, DL algorithms have demonstrated the ability to learn complicated nonlinear relationships in input data accurately. Thus, it contributed to ad- vancement in this field. However, the training process of these networks has heavy computational overheads. The reason is two-fold: The sizable complexity of these networks and the high number of training samples needed to learn all parameters comprising these architectures. Although the quantity of training data enhances the accuracy of the trained models in general, the computa- tional cost may restrict the amount of analysis that can be done. This issue is particularly critical in satellite remote sensing, where a myriad of satellites generate an enormous amount of data daily, and acquiring in-situ ground truth for building a large training dataset is a fundamental prerequisite. This dissertation considers various aspects of deep learning based sea ice monitoring from SAR data. In this application, labeling data is very costly and time-consuming. Also, in some cases, it is not even achievable due to challenges in establishing the required domain knowledge, specifically when it comes to monitoring Arctic Sea ice with Synthetic Aperture Radar (SAR), which is the application domain of this thesis. Because the Arctic is remote, has long dark seasons, and has a very dynamic weather system, the collection of reliable in- situ data is very demanding. In addition to the challenges of interpreting SAR data of sea ice, this issue makes SAR-based sea ice analysis with DL networks a complicated process. We propose novel DL methods to cope with the problems of scarce training data and address the computational cost of the training process. We analyze DL network capabilities based on self-designed architectures and learn strategies, such as transfer learning for sea ice classification. We also address the scarcity of training data by proposing a novel deep semi-supervised learning method based on SAR data for incorporating unlabeled data information into the training process. Finally, a new distributed DL method that can be used in a iv abstract semi-supervised manner is proposed to address the computational complexity of deep neural network training. Acknowledgements I would like to thank my supervisors, Andrea Marinoni, Einar Broch Johnsen, and Anders Andersen. Thank you for your time and energy, and your continuous motivation and encouragement during my PhD studies. It was a great oppor- tunity for me to learn from all of you during this multidiscipline PhD project. Andrea, thank you for your constant support in the course of PhD. I would also like to thank all members of the Earth observation group at UiT and the people at CIRFA. I learned many things from you through scientific and technical discussions. I would like to state my appreciation to CIRFA center leader, Torbjørn Eltoft, who founded this project and gave me this opportunity to research interesting areas. I will not forget your support during all the ups and downs of my PhD. Finally, I would like to state my heartfelt gratitude to my family for their unending support and presence, Thank you to my wife Shima, my daughters Kosar and Tasnim, my son Aliakbar and my parents even from far away. Contents Abstract iii Acknowledgements v List of Figures xi List of Tables xiii List of Abbreviations xv 1 Introduction 1 1.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.1 Deep Learning for Earth Observation . . . . . . . . . 4 1.1.2 Deep Learning Architectures Design . . . . . . . . . 5 1.1.3 Scarce Training Data . . . . . . . . . . . . . . . . . . 6 1.1.4 Computational Complexity . . . . . . . . . . . . . . 6 1.2 Distributed Deep Learning for Big Data Analysis . . . . . . . 7 1.2.1 Communication Overhead . . . . . . . . . . . . . . . 9 1.2.2 Earth Observation as a Big Data Problem . . . . . . . 9 1.3 Earth Observation . . . . . . . . . . . . . . . . . . . . . . . 11 1.3.1 Thesis Application: Sea Ice Analysis . . . . . . . . . . 13 1.3.2 Synthetic Aperture Radar (SAR) . . . . . . . . . . . 13 1.3.3 Classification Problems . . . . . . . . . . . . . . . . 14 1.3.4 Exploitation: Operational Ice Charting . . . . . . . . 15 1.4 Objectives of this Thesis . . . . . . . . . . . . . . . . . . . . 16 1.4.1 Thesis Outline . . . . . . . . . . . . . . . . . . . . . 17 2 Synthetic Aperture Radar Imagery for Sea Ice Classification 19 2.1 Spaceborne SAR System Properties . . . . . . . . . . . . . . 21 2.2 Sentinel-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3 Sea Ice Classes in SAR Images . . . . . . . . . . . . . . . . . 24 2.4 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3 Deep Neural Networks Learning Approaches 31 vii viii contents 3.1 Learning Approaches . . . . . . . . . . . . . . . . . . . . . 33 3.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . 35 3.3 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . 37 3.3.1 Semi-supervised GANs . . . . . . . . . . . . . . . . . 39 3.3.2 Auto-Encoders . . . . . . . . . . . . . . . . . . . . . 39 3.3.3 Disentangling Representations . . . . . . . . . . . . 40 3.3.4 Teacher-Student Models . . . . . . . . . . . . . . . . 40 3.3.5 Self-training . . . . . . . . . . . . . . . . . . . . . . 41 3.3.6 Deep Transductive Learning . . . . . . . . . . . . . . 41 3.4 Thesis Approach: DL and Sea Ice Classification . . . . . . . . 41 4 Distributed Deep learning for Scalable Computing 45 4.1 Type of Parallelism . . . . . . . . . . . . . . . . . . . . . . . 46 4.1.1 Model parallelism . . . . . . . . . . . . . . . . . . . 46 4.1.2 Data Parallelism . . . . . . . . . . . . . . . . . . . . 47 4.2 Type of Aggregation . . . . . . . . . . . . . . . . . . . . . . 48 4.2.1 Allreduce Algorithm . . . . . . . . . . . . . . . . . . 49 4.3 Type of Concurrency . . . . . . . . . . . . . . . . . . . . . . 50 4.3.1 Concurrency in Network . . . . . . . . . . . . . . . . 50 4.3.2 Concurrency in Training . . . . . . . . . . . . . . . . 50 4.4 Type of Communication . . . . . . . . . . . . . . . . . . . . 51 4.4.1 Synchronous . . . . . . . . . . . . . . . . . . . . . . 51 4.4.2 Asynchronous . . . . . . . . . . . . . . . . . . . . . 51 4.5 Communication Compression . . . . . . . . . . . . . . . . . 52 4.5.1 Quantization . . . . . . . . . . . . . . . . . . . . . . 52 4.5.2 Sparsification . . . . . . . . . . . . . . . . . . . . . 53 4.6 Communication Overhead Issue . . . . . . . . . . . . . . . . 53 4.7 Thesis Approach: Distributed Deep Learning . . . . . . . . . 53 5 Overview of Publications 55 5.1 Paper Summaries . . . . . . . . . . . . . . . . . . . . . . . 55 5.1.1 Paper 1 . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.1.2 Paper 2 . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.1.3 Paper 3 . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2 Other Contributions . . . . . . . . . . . . . . . . . . . . . . 59 6 Paper 1: Sea Ice Classification of SAR Imagery Based on Con- volution Neural Networks 63 7 Paper 2: Deep Semi-supervised Teacher–Student Model Based on Label Propagation for Sea Ice Classification 85 8 Paper 3: AFSD- Adaptive Feature Space Distillation for Dis- tributed Deep Learning 99 contents ix 9 Conclusions and Future Works 111 9.1 Future works . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Bibliography 113 List of Figures 1.1 Dissertation Research areas. . . . . . . . . . . . . . . . . . . 3 1.2 Deep learning performs better in comparison of other meth- ods when available data is increased [3]. . . . . . . . . . . . 4 1.3 Example of learned features in each layer for a object detec- tion task (from Yann LeCun, 2015). . . . . . . . . . . . . . . 5 1.4 Computational complexity. The Top-1 accuracy versus the floating- point operations (FLOPs) needed for a single forward pass. The size of each ball depends model complexity [23]. . . . . 7 1.5 Parallel Architectures in Deep Learning using different hard- ware acceleration and single or multi nodes systems [24]. . . 8 1.6 Five characteristics of a big data problem [28]. . . . . . . . 11 1.7 SAR vs Optical Earth observation. Left is a Sentinel-1 SAR scene and right is a optical image by Sentinel-2 from same location with small time gap [50]. . . . . . . . . . . . . . . 14 1.8 Example of an ice chart produced by the Ice Service of the Norwegian Meteorological Institute (NIS). The polygons de- pict different ice concentration zones. The ice charts also de- pict the ice edge to indicate open water which is important for ship navigation. Detailed information can be found in [51]. 16 2.1 A geometric model for a SAR system. . . . . . . . . . . . . . 20 2.2 Illustration of Sentinel-1 acquisition modes [66]. . . . . . . 24 2.3 Ice types in optical observation. . . . . . . . . . . . . . . . . 25 2.4 Sentinel-1 HH+HV SAR image, (a), versus Sea Ice in optical image, (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5 The difference between Sentinel-1 IW-mode photos (HH po- larisation) for a) pre-melt sea ice (17 April 2018) and b) full- melt sea ice (17 April 2018) is shown in the figure (16 June 2018). Due to the fact that this is a region of fast ice, Belgica Bank off the coast of north-east Greenland, the sea ice and icebergs are identical in both photographs (courtesy of Nick Hughes, the Norwegian Meteorological Institute). . . . . . . 27 2.6 Two HH-polarization sentinel-1 SAR from the marginal ice zone. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 xi xii l ist of figures 2.7 An example of a noise HV SAR image, with corresponding noise floor profile on the right. . . . . . . . . . . . . . . . . 28 3.1 Deep Learning is a machine learning technique [9]. . . . . . 32 3.2 Deep Learning vs Traditional machine learning. . . . . . . . 33 3.3 Basic example of binary classification considering supervised, unsupervised, and semi-supervised learning. We have two classes, blue circles, and red triangles. Unlabeled data belonging to each class are shown in black circles and triangles. (a) refers to supervised learning in which all training samples are la- beled. (b) is unsupervised learning in the absence of labeled training data. (c) is semi-supervised learning, in which some training samples contain labels and some do not [74]. . . . . 34 3.4 A general CNN architecture for image analysis, here for char- acter recognition [78]. . . . . . . . . . . . . . . . . . . . . 35 3.5 An overview of the landscape of semi-supervised methods [16]. 39 4.1 Model parallelism. It can achieve through vertical or horizon- tal splits [97]. . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2 Data parallelism [25]. . . . . . . . . . . . . . . . . . . . . . 47 4.3 Centralized approach. . . . . . . . . . . . . . . . . . . . . . 48 4.4 Illustration of Ring Allreduce. . . . . . . . . . . . . . . . . . 49 4.5 synchronous and asynchronous in centralized data parallelism distributed deep learning. . . . . . . . . . . . . . . . . . . . 52 5.1 Paper 1 considered deep learning approaches. . . . . . . . . 56 5.2 Overall architecture of proposed method in paper 2. . . . . . 57 5.3 Overall architecture of proposed method in paper 3. . . . . . 59 List of Tables 2.1 Different water and ice classes. . . . . . . . . . . . . . . . . 25 4.1 Communication time versus training time in the parameter server [132]. . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2 Training time for different methods using 60 labeled data and 10000 unlabeled data on the UiT training dataset [134] on a single GPU (Quadro RTX 5000 16GB). . . . . . . . . . . . . 54 xiii List of Abbreviations CNN Convolutional Neural Network DDL Distributed Deep Learning DL Deep Learning DNN Deep Neural Network EM Electromagnetic EO Earth Observation ESA European Space Agency EW Extra Wide Swath FC Fully Connected GANs Generative Adversarial Networks IA Incident Angle IR Infrared ISERV Station SERVIR Environmental Research and Visualization System IW Interferometric Wide Swath LULC Land cover/land use mapping:and-use and land cover MIR Mid-infrared NASA National Aeronautics and Space Administration xv xvi l ist of abbreviat ions NIR Near-infrared NIS Norwegian Meteorological Institute RGB Red, Green, Blue RS Remote Sensing SAR Synthetic Aperture Radar SGD Stochastic Gradient Descent SM Stripmap SOD Stage of Development SRT Shuttle Radar Topography SRTM Shuttle Radar Topography Mission SSL Semi-Supervised Learning TIR Thermal Infrared UiT University of Tromsø VAE Variational Autoencoders WHO World Health Organization WMO World Meteorological Organization WV Wave 1 Introduction In the last decade, Deep Neural Networks (dnns) have shown remarkable performance in tackling various challenging machine learning problems [1]. Practically, a dnn may have hundreds of layers and millions of parameters. It has been demonstrated that deep neural networks outperform alternative methods, especially in big data problems [1]. Despite this, training a deep network architecture is a computationally expensive task. The reason for this costly computation is the high amount of data points to be handled and the complexity of the network structures [2]. In general, a larger training data set will improve the accuracy of the trained models. However, the computing cost of the training data may limit the amount of analysis that can be performed [3]. dnns have been employed to address many Remote Sensing (rs) and Earth Observation (eo) challenges, and they have shown great success in solving a variety of satellite-based rs image analysis tasks, including land cover classi- fication, object detection, and change detection [4]. Satellite images, which constitute a significant data source for Earth observation, allow us to measure and observe intricate features on the surface of the Earth. The amount of satellite images is rapidly increasing as a result of the development in space- borne Earth observation technologies [5]. It is no accident that this field is now referred to as big data. The National Aeronautics and Space Administration (nasa)’s Landsat [6], and European Space Agency (esa)’s Copernicus [7], respectively, offer high revisiting frequency data and data with large spectral- spatial coverage, allowing for near-real-time worldwide surveillance of the 1 2 chapter 1 introduction Earth surface. Indeed, Copernicus is presently the world’s biggest single eo program, with its fleet of Sentinel spacecraft. To conduct large-scale, high-frequency monitoring of the Earth using deep learning architectures, we need scalable computing to train the models using a substantial quantity of labeled data [8]. However, these massive amounts of training data do not always exist, which is the case for the application focused in this thesis. In this dissertation, we consider various aspects of the deep learning-based analysis of Arctic Sea ice from satellite-based synthetic aperture radar (SAR) data. This application is challenging. The Arctic is remote, has long dark seasons and a very dynamic weather system, and the collection of reliable in-situ data is very demanding. In addition, SAR images of sea ice are inherently difficult to interpret and require extensive time and resources. Scarce training data is exacerbated when the trained model is to examine the dynamic sea ice in an Arctic-wide setting, dealing with a variety of seasonal and meteorological circumstances. This thesis investigates three main topics in deep learning–based sea ice monitoring from SAR (Figure 1.1): deep learning architecture design, semi- supervised learning to cope with the training data, distributed systems, and high-performance computing. • Deep learning architectures design: Deep neural networks are a holis- tic learning architecture for feature extraction and classification. We consider how different deep learning architectures cope with the sea ice classification task. • Semi-supervised learning:We investigate several methods of label prop- agation and advance deep semi-supervised learning approaches to ad- dress the scarce training data problem in sea ice classification. • Scalable computing: Distributed deep learning is considered to address computation complexity along with specific challenges when deep neural networks are trained on big Earth observation data. In the analysis, we specifically consider the challenges associated with the properties of sar data like scattering ambiguities, type-dependent incidence angle slopes, and the annoying additive noise pattern of Sentinel-1 SAR data, which are the principal data used in this study. 1.1 deep learning 3 Figure 1.1: Dissertation Research areas. This section briefly discusses the benefits of deep learning and its advantage for big data analysis, particularly for Earth observation applications. We describe the distributed deep learning setup and explain how this provides a scalable computing framework for deep learning analysis. Then, some major Earth ob- servation applications are briefly listed to contextualize our chosen application. We explain our primary application, namely sea ice analysis from SAR, and add some perspectives on how the results may be exploited in operational sea ice charting. Finally, the thesis’s objectives and structure are outlined. 1.1 Deep Learning Deep learning models as feature learning hierarchies extract multiple layers of non-linear features and feed them to a classifier that integrates all the features to produce predictions. Hierarchically,dl algorithms directly learn the representative and discriminative features from the data. It is different from manual feature learning, which performs manual selection and extraction of features for each task. Features are automatically learned in a deep learning method to optimize the model’s performance [9]. Thanks to dl theory [9], unsupervised feature learning from immense raw- image data sets has become conceivable. This unsupervised feature learning provides an alternative technique that autonomously learns practical features from the training set [10]. More specifically, when a large amount of data is available, it has been shown that deep neural networks perform better than other learning approaches [3], as we see in Figure 1.2. 4 chapter 1 introduction Figure 1.2: Deep learning performs better in comparison of other methods when available data is increased [3]. For example, Convolutional Neural Network (Convolutional Neural Network (cnn)) [9] is a well-known deep learning model that has been used to solve several computer vision tasks. A cnn employs convolutional layers to ex- tract valuable information from the inputs. These convolutional layers contain learned parameters, allowing the filters to automatically extract the most valu- able information for the task [9]. More details ondl andcnns will be discussed in Chapter 3. Figure 1.3 shows a traffic sign image filtered by four convolutional kernels, which create four feature maps; These feature maps are sub-sampled by max pooling. The next layer applies ten convolutional kernels to these sub- sampled images. The final layer is a fc layers where all generated features are combined and used in the classifier. 1.1.1 Deep Learning for Earth Observation dl has been utilized in Remote Sensing (rs) and Earth Observation (eo) for tasks ranging from image preprocessing, pixel-based classification, patch-based classification, and target recognition to high-level semantic feature extraction and rs scene interpretation [11]. In fact, deep learning is a novel and fasci- nating method that potentially can be the next step in the evolution of Earth observation and remote sensing image processing [11]. Extracting useful information from diverse forms of remote sensing data, and coping with ever-increasing data types and volumes, is a significant problem in 1.1 deep learning 5 Figure 1.3: Example of learned features in each layer for a object detection task (from Yann LeCun, 2015). Earth observation analysis [11]. Traditional techniques use feature engineering from rs images to build extracted and selected features to feed to different classification/regression models. Handcrafted features are shown to be suc- cessful in representing several spectral, textural, and geometrical properties of images [12, 13]. However, because these features cannot readily reflect the complexities of the actual data statistics, they cannot attain an ideal balance between discriminability and resilience. The dilemma is exacerbated when dealing with a large amount of remote sensing image data since imaging conditions fluctuate rapidly and may change dramatically in a short period [11]. 1.1.2 Deep Learning Architectures Design Different deep learning network architectures specially have been proposed for addressing computer vision issues. [14]. In computer vision, dl architectures have been made to work best for specific object detection and recognition by optimizing number layers, number filters, and many network hyperparameters. Optimizing the best architecture for new tasks, like rs and eo problems, is challenging. This issue is even more significant for more specific uses, such as SAR-based monitoring of sea ice in the Arctic. When looking into dnn architectures, we can usually consider two main ways to deal with this problem. The first approach is to analyze the problem by making a model of a custom or ad hoc architecture. An ad hoc architecture is interesting because it is very flexible. However, it usually needs to have a lot of hyper-parameters optimized. The second approach relies on the use of an existing architecture that can either be fine-tuned by already trained 6 chapter 1 introduction parameters or trained from scratch. This approach reduces the time needed for the design of the deep learning architecture. For instance, we have employed this strategy in Chapter 6. 1.1.3 Scarce Training Data Deep neural networks have been employed in many rs and eo data analysis challenges and showed the promising results [15]. They are well-known for their high efficiency and test-time performance. The disadvantage is that a significant number of training samples must be available to train the models. Also, these samples are in most cases labeled by humans. This issue is more serious in the case of the rs and eo domains since acquiring in-situ ground truth are very expensive, time-demanding, and often impossible. This issue becomes more significant in Arctic applications, where the volume of validated labeled data is typically low. The remoteness, long dark seasons, and exceedingly variable weather conditions make it difficult to acquire ground truth in these areas. The scarcity of training data is well-known in the machine learning and deep learning domains and is recognized as a significant issue in big data applica- tions [16]. New advanced approaches in deep semi-supervised learning, deep unsupervised learning, and deep self-supervised learning have been proposed to overcome this issue [17, 16]. More details on semi-supervised learning, one of our focus areas, will be discussed in Chapter 3. 1.1.4 Computational Complexity In dl, increasing the quantity of training datasets often improves model per- formance (e.g., classification accuracy) [18, 19]. Nonetheless, as data amount and model complexity rise, the training process of dl is computationally costly and time-consuming. For instance, training a state-of-the-art ResNet-50 [20] model (in 90 epochs) on the ImageNet dataset [21] using the most recent Nvidia Tesla V100 GPU takes around two days [22]. In general, one must tweak the hyperparameters for certain task, which takes a great deal of effort and is necessary to get acceptable performance. Figure 1.4 quantifies the computing requirements of frequently used deep learning models [23]. As shown in the graph, the highest performing trained architectures are those with very high computational complexity (such as NASNet-A-Large), which are located at the far right of the graph. In addition to this, it should be noted that they are not the ones with the most model complexity (as is evidenced by the size of the bubble). Due to the high compu- 1.2 distributed deep learning for big data analysis 7 Figure 1.4: Computational complexity. The Top-1 accuracy versus the floating-point operations (FLOPs) needed for a single forward pass. The size of each ball depends model complexity [23]. tational cost of iterative dl training across a large quantity of data, substantial computer resources are required. As a result, single machines are not always capable of performing this job in the time provided. 1.2 Distributed Deep Learning for Big Data Analysis The ascent in popularity of dnns is closely tied to the amount of accessible processing power, which has made it possible to harness the fruits of the intrinsic parallelism of these networks [24]. Deep learning’s computational 8 chapter 1 introduction Figure 1.5: Parallel Architectures in Deep Learning using different hardware accelera- tion and single or multi nodes systems [24]. intensity and memory requirements rise in direct proportion to the size of the available datasets and the level of dnns complexity. Training a dnn to an accuracy that is competitive in today’s market is almost impossible without a high-performance computer cluster [25, 24]. Different components of training and inference (evaluation) of dnns are adjusted to boost concurrency to use such systems. In modern computer architectures, parallelism can be found both internally on the chip in the form of pipelining and out-of-order execution, as well as in the form of multi-core or multi-socket systems. Multi-core computers may be designed with either multiple processes, which use distinct memory domains, or multiple threads, which use sharedmemory domains. Alternatively, a combination of the two is also possible. The primary distinction between these two lies in the fact that multi-process parallel programming requires the programmer to think about the distribution of the data as a first-class concern, whereas multi-threaded programming only requires the programmer to focus about the parallelism and allows the hardware system to handle the data shuffling (typically through hardware cache-coherence protocols). The process of training large-scale models requires a significant amount of computational resources. Therefore, single machines are not always capable of completing this work within the allotted amount of time. The computation might be split up and carried out simultaneously on several different computers all linked together through a network. Distributed deep learning by leveraging the computational resources of multiple devices (e.g., multiple GPUs) [24] is used to accelerate the training process of dnns when working with a large amount of data. Figure 1.5 provides a summary of the machine architectures that have been employed in technical literature. There is a discernible shift toward GPUs, which are the focus of the majority of articles starting from 2013. Despite 1.2 distributed deep learning for big data analysis 9 this, even the most accelerated nodes are not enough to handle the massive computational demand. Figure 1.5 illustrates how the multi-node parallelism in those activities is rapidly expanding. Different approaches have been proposed to train deep learning models on multi-GPUs in distributed environments. More details on different approaches will be discussed in Chapter 4. 1.2.1 Communication Overhead Latency, bandwidth, and message rate are the three most significant metrics for the interconnection network [24]. There are many performance measures available across the various network technologies. InfiniBand, for instance, provides much shorter latencies and more message rates than current Ethernet, even though both provide enormous capacity [26]. Interconnection networks designed specifically for high-performance computing may yield better results in all three performance criteria. However, communication via a network is often slower than the communication that occurs inside a single computer. In distributed deep learning, regardless of networking technology, the communi- cation overhead directly affects the capability of distributed training [24]. This effect is because a large number of model parameters (full or partial) should be sent over the network in each training iteration. This issue will be discussed in more detail in Chapter 4. 1.2.2 Earth Observation as a Big Data Problem International space agencies, like ESA and NASA, adhere to an open data policy and make available an enormous amount of multi-sensor data for free daily. Because of the rapid technological advancement that has been incorporated into Remote Sensing (rs) optical and microwave sensor technologies [5], the systems have made significant strides forward in the last several decades. In this sense, it is not a coincidence that remote sensing data are now being described using the big data terminology, which includes characteristics such as volume, velocity, variety, veracity, and value [27, 8]. Copernicus is the European Union’s program [7] for Environmental monitoring. It is made up of a collection of systems that receive data from satellites and in- situ sensors, process and interprets this data, and then offer users accurate and up-to-date information on a variety of environmental and security concerns. The Sentinel satellites, in orbit for the particular purposes of the Copernicus program, and other contributing satellite missions run by national or interna- tional organizations, support Copernicus with data. Access to Sentinel data is governed by EU legislation and is complete, open, and unrestricted. Copernicus information is made accessible to consumers via Copernicus services covering 10 chapter 1 introduction six theme areas: land, marine, atmosphere, climate, emergency, and security. The Copernicus program is a vanguard of the Big Data paradigm resulting from the data and information processed and distributed. It also gives rise to the so-called five V’s paradigm, briefly discussed below [5]: • Volume: The European Space Agency’s (ESA) Sentinel product repository has published over 5 million products to date, It has over 100000 users who have downloaded more than 50 PB of data since the system’s incep- tion. As additional Sentinel satellites are being launched, this volume will grow in the coming years. • Velocity: The Copernicus data must be sent and processed quickly to provide 24/7 services to users that want immediate information. By the end of 2016, six TB of data had been generated, with 100 TB of data being broadcasted daily from the Sentinel product repository. As additional Sentinel satellites are being launched, these rates will rise in the following years. • Variety: The Sentinel satellites have various types of sensors (e.g., radars, optical instruments), offering data products at multiple processing lev- els (from raw data to advanced products). Furthermore, in addition to satellite data (e.g., public government data), datasets utilized for geospa- tial applications might include aerial images, in-situ data, and other collateral information. To extract information and knowledge eo actors process this data. The information data is similarly sizable and faces the same Big Data issues mentioned above. For example, 1PB of Sentinel data may consist of around 750000 datasets which, when processed, can yield approximately 450TB of information and knowledge contents (e.g., classifications of items observed). • Veracity: Reliable information is required for decision-making in opera- tions. As a result, verifying data quality is critical for the entire informa- tion extraction chain. • Value: The extraction of information from Copernicus data directly im- proves Europe’s economy. Several economic assessments have found that the Copernicus initiative can substantially influence job generation, inno- vation, and growth. According to the Copernicus Market Report 2016, the overall investment in Copernicus will reach EUR 7.4 billion between 2008 and 2020, with a collective commercial benefit of roughly EUR 13.5 billion created during the same period; it also will provide 28.030 employment years in the eo industry. For these reasons, eo problems represent a good platform to design, develop 1.3 earth observation 11 Figure 1.6: Five characteristics of a big data problem [28]. and test distributed deep learning architectures. Indeed, it is possible to expect that distributed deep learning frameworks can provide a remarkable added value to the implementation of eo analysis pipeline, so to substantially sup- port the understanding of human-environment interactions by retrieving solid information from EO data analysis. 1.3 Earth Observation The science of remote sensing (rs) is the ability to obtain information without physically touching an object or surface. In this process, the reflected or emitted radiation of the remote object or surface is observed and measured. Based on these measurements, the object and materials are identified and categorized by class or type, essence, and spatial characteristics. The Earth’s physical, chemical, and biological systems can be measured and mapped from various remote sensing platforms, including satellites and aircraft, This technology is known as Earth Observation (eo) [29]. Plenty of phenomena, such as climate change, disasters, disease outbreaks, ship navigation, and fire and smoke observation, can be studied using Earth observation data [30]. The amount of eo data is steadily increasing due to the rapid development of this technology and the continuous launch of more satellites, this is one of the main reasons for which the eo domain has become a significant big data application area. [31]. The developments in the eo domain are now driven by various application areas and environmental and climate problems threatening our planet such 12 chapter 1 introduction as: • Mapping land cover and land use: Land-use and land cover (lulc) map help assess climate change effects on hydrology, biodiversity, carbon dynamics, population, migration, and urbanization. This lulcmaps can effectively be used for mobilizing decision-makers, industry, farmers, and the general public toward more sustainable use of resources [32, 33]. • Carbon biomass assessment: Forest biomass is a key in estimating car- bon sequestration. the Forests sequester carbon in part by accumulating biomass since approximately half of the forests’ dry biomass is carbon [34]. In carbon biomass assessment, Landsat Imagery, Shuttle Radar Topography Mission (srt) and International Space Station SERVIR En- vironmental Research and Visualization System (iserv) are examples of remote sensing platforms which provide principal images of the Earth [35]. • Agriculture and food security: In developing countries with subsistence farming, food security significantly impacts agriculture. Defined by the World Health Organization (WHO) is, food security characterized by three factors; (1) access to sufficient resources for obtaining a nutritious diet, (2) knowledge of primary nutrition, and (3) adequate water and sanitation. Satellite remote sensing is a prominent tool for acquiring food security [36]. • Disaster management: Global climate change severely impacts the already marginalized areas of the Earth, which are more susceptible to unpredictable weather patterns, floods, droughts, and rising sea levels. There is a need for chief short, medium, and long-term mitigation and warning strategies at national and regional levels to prepare for disasters like landslides, floods, and earthquakes. Satellite remote sensing is an important tool in establishing this preparedness [37]. • Polar monitoring: Despite its remoteness, the Arctic is home to 4 million people and has an economy exceeding 230 billion US$ (World Economic Forum, 2014). eo has a critical role to play in securing sustainable development in the Arctic [38]. Furthermore, this region directly impacts climate change and human life on the whole planet Earth. In the Arctic, satellite data are used to generate information about sea ice conditions and weather, which is of paramount importance for Arctic peoples, science, the commercial sector, and decision-making, marine navigation, safety, and climate change research [5]. 1.3 earth observation 13 1.3.1 Thesis Application: Sea Ice Analysis The specific application studied in this thesis is sea ice classification. Sea ice is a critical environmental component of the Earth’s climate system [39] and consid- erably impacts polar ecosystems. The Arctic has witnessed substantial climate change in the last decades, affecting its ecosystem, ecology, and meteorology. The changes are stronger in the Arctic than in any other place. These changes have been dubbed the Arctic augmentation, causing highly changeable Arctic weather and sea ice conditions. These harsh conditions pose challenges and hazards to high north maritime operations connected to resource extraction, fishing, and tourism [40, 41]. As a result, reliable and continuous monitoring of sea ice behavior, thickness, and ice type distribution are critical for effective and reliable human activities, along with for understanding changes over ex- tended time frames [42, 43]. Therefore sea ice monitoring is a prominent eo application from a scientific and societal point of view. Every year,many polar nations conduct scientific cruises to the Arctic to perform sea ice observations and surveys. Regular ice watch inspections from vessels [44, 45] are the most common and fundamental in-situ measurements. In-situ investigations on the ice include ice thickness and roughness measurements, temperature and salinity profiles, and the thickness and characteristics of snow cover on sea ice [46]. For validation of satellite rs products, airborne ice thickness measurements using electromagnetic em induction devices are frequently used and referred to as in-situ data [47]. 1.3.2 Synthetic Aperture Radar (SAR) There are two primary forms of remote sensing sensors, active and passive. They are distinguished by generating the signals’ sources used to examine an object. Active remote sensing systems generate their own illumination, whereas passive systems rely on reflected or backscattered radiation generated by other sources, like the sun. The electromagnetic radiation most commonly used for remote sensing differs in wavelengths, is classified as short (visible, nir, and mir) and long (visible,nir,mir andmicrowave). In the high north,microwave sensors are particularly prominent, because they are generally independent of weather and light conditions [48]. Synthetic Aperture Radars (sars) are active imaging remote sensing sensors, commonly employed in Arctic monitoring. They are particularly useful for monitoring sea ice. It can obtain high spatial resolution by using the coherent character of the transmitted radar pulse by combining radar technology and advanced signal processing [49]. SAR imaging is not affected by sunlight or cloud cover. The difference between sar and optical images is illustrated in 14 chapter 1 introduction Figure 1.7. Sea ice classification using sar images is affected by several challenges that influence the performance of any machine learning system. These include incidence angle dependencies, seasonal changes, ambiguous radar scattering, and system noise. The sar challenges will be discussed in more detail in chapter 2. Figure 1.7: sar vs Optical Earth observation. Left is a Sentinel-1 sar scene and right is a optical image by Sentinel-2 from same location with small time gap [50]. 1.3.3 Classification Problems Focusing on the operational needs of stakeholders, communities and authorities in the Arctic ecosystem, rs research on sea ice monitoring has addressed two classification problems: • Classification of sea ice as opposed to water: High-resolution ice masks can provide detailed information about the location of the ice edge and leads, and can be utilized to build large-scale ice concentration maps. • Multi-class ice type classification: This entails the more general classi- fication problem of constructing ice maps of multiple classes, including first and multi-year ice, thin ice, deformed ice, ridges, and leads. The primary emphasis of this thesis is on the binary classification of sea ice versus water. However, the assumptions made in the suggested approaches are such that the algorithms described in this thesis may also be extended to multi-class ice type classification. 1.3 earth observation 15 1.3.4 Exploitation: Operational Ice Charting National sea ice services are responsible for producing operational sea ice charts, which are often updated daily. The ice charts offer information on the concentration of some ice categories and ice margins based on a range of satel- lite data, principally from sars, passive microwave, and optical sensors. Visual observations and meteorological weather forecasts supplement the satellite data. Thermal infrared imaging satellite data are used to create contours of the sea surface temperatures. [51]. Currently, none of the ice services have reported deploying automated categorization methods on a year-round basis. All algorithms still require some level of human intervention. The criteria for ice chart generation vary greatly based on the end-user and their demands. There are both operational and scientific end-users. Scientific end-users employ ice charts for academic research, such as in climate, biology, or data assimila- tion in numerical models, On the other hand, operational end-users demand timely ice supporting their activities. Mariners, for example, require data to support navigation and safety in ice-covered waterways. In this respect, precise and consistently accurate information regarding the location of the ice edge, leads and ridges, areas of (thin) first-year ice, and areas of multi-year ice is critical. Mariners also demand short-term projections of the development of ice conditions in the region where they are or are headed. [52]. Current ice charts for European seas are provided after 1500 UTC on weekdays (Monday-Friday), and for the Antarctic on Mondays (October-April). [51]. An example of an ice chart for January 14th, 2022 is provided in Figure 1.8. 16 chapter 1 introduction Figure 1.8: Example of an ice chart produced by the Ice Service of the Norwegian Meteorological Institute (nis). The polygons depict different ice concen- tration zones. The ice charts also depict the ice edge to indicate open water which is important for ship navigation. Detailed information can be found in [51]. 1.4 Objectives of this Thesis Sea ice is a dynamic and complicated target surface, making it difficult to de- velop a reliable algorithm for autonomous classification based on SAR images. Heavy speckle noise, incidence angle impact, range-dependent noise pattern, and sensitivity of SAR signals to the sea surface all contribute to the diffi- culty of sea ice analysis [53]. Robust ice charting and ice edge identification become increasingly difficult as a result of the intricate interaction between SAR signals, the imaging geometry, and underlying physical processes [54]. To accomplish automated sea ice analysis, rather than using feature engineering or hand-crafted features (i.e. features that have been designed to highlight specific properties of the surface), this project will use deep neural network architectures as a feature learning scheme to achieve more robust features and better classification. In this regard, the objectives of this thesis can be summarized as follows: Main objective: The main objective of this thesis is to investigate the capa- bilities of deep neural networks and propose new methods to improve sea ice analysis by considering the scarce labeled data and computation complexity issues. 1.4 objectives of this thesis 17 • Objective 1: Investigating the capabilities of deep neural networks based on self-designed and previously proposed famous architectures consider- ing networks depth, different inputs, and learning strategies (Paper 1). This investigation includes identifying specific issues and requirements related to sea ice analysis, which will subsequently be addressed by semi-supervised learning and scalable computing. • Objective 2: Investigating different semi-supervised methods to address scarce training data considering the specific requirements of sea ice analysis by deep neural networks. We propose a new architecture for semi-supervised sea ice classification in Paper 2. • Objective 3: Investigating distributed deep learning in relation to sea ice analysis. We propose a new distributed deep learning method to address the computational complexity of deep neural networks’ training for the sea ice analysis application (Paper 3). The thesis considers deep neural networks as a holistic learning architecture for feature extraction and classification problems. It proposes new methods in deep supervised (Paper 1, Chapter 6) and semi-supervised (Paper 2, Chapter 7) learning to improve the accuracy and generalization of sea ice analyses. Moreover, to mitigate the computational complexity of training deep learning architectures, the thesis proposes a new distributed deep learning method for sea ice analysis (Paper 3, Chapter 8). 1.4.1 Thesis Outline The three journal articles, which form the core of this thesis, are discussed in detail in chapters 6, 7, and 8. In these three studies, supervised and semi- supervised learning for sar-based sea ice classification are discussed, and a novel distributed deep learning algorithm is suggested. Chapter 2 provides essential background material, covering the basic principles in Synthetic Aper- ture Radar (sar) image formation, emphasizing sea ice categorization, to provide the necessary context. Chapter 3 discusses several learning techniques, including supervised and semi-supervised deep learning. The core idea and techniques of distributed deep learning are covered in detail in Chapter 4. Chap- ter 5 provides a summary of the scientific papers. Finally, Chapter 9 summarizes the findings and discusses possible future studies. 2 Synthetic Aperture RadarImagery for Sea IceClassification Synthetic Aperture Radar images are the primary data source for operational ice services [55], and the scientific study is presented in this dissertation. An imaging Synthetic Aperture Radar, abbreviated as sar, is a radar that can be operated from the ground, aircraft (airborne) or satellites (spaceborne) that can provide two-dimensional images of the Earth’s surface [56]. This chapter discusses the fundamental concepts of spaceborne sar, emphasizing on its use for the rs of sea ice in polar regions. Generally speaking, radar (radio detection and ranging) systems are designed on the principle of echolocation. An antenna sends an Electromagnetic (em) signal and analyses the bounces returned from a given target. Assuming that the signal’s speed is known, the distance between the antenna and the target may be estimated by multiplying the signal’s travel time by speed [29]. The Earth is now orbited by several sar instruments, each operating at a differ- ent frequency range, with various polarization channels, and divergent spatial resolution. They provide a critical contribution to observations of distant and inaccessible locations, such as the high northern latitudes. The measurements are independent of sunlight or any other naturally released radiation since the 19 20 chapter 2 synthetic aperture radar imagery for sea ice classif ication sar is an active system that generates its own signals [57]. Furthermore, the usual sar wavelengths penetrate clouds with little or no loss of energy [58]. As a result, sar sensors may capture pictures continually, regardless of whether or not there is sunshine or good weather. This matter is particularly paramount in the Arctic, where there is prolonged darkness throughout the polar night and generally approximately 70-80 percent of cloud cover throughout the year [59]. Because of its capacity to capture images throughout the day and in all weather conditions, sar has become the chief data source for sea ice charting in operational ice services across the globe [60]. On the other hand, sar images are significantly different from optical images and may be difficult to decipher and comprehend. The geometry of a side-looking imaging radar system is sketched in Fig- ure 2.1. Figure 2.1: A geometric model for a sar system. The main concepts of sar imaging are defined below: Swath: The swath width is the coverage of the image in range direction, which is the direction perpendicular to the flight direction. The along-track direction is often referred to as the azimuth direction (Figure 2.1). Slant range: Slant range is the length between the antenna and grounds of a pixel. Ground range: The ground range is the distance between the ground track and the ground pixel (Figure 2.1). Incidence Angle (IA): The incidence angle is the angle formed between the 2.1 spaceborne sar system properties 21 incident sar beam and the axis normal to the geodetic ground surface of the area where the beam is incident. 2.1 Spaceborne SAR System Properties sar systems can image the Earth’s surface from space with a high spatial resolution. They are furthermore characterized by properties like frequency, polarimetric capabilities, and the temporal resolution given by the revisit time period of the satellite platform. The following are the definition of these characteristics: Spatial Resolution: Spatial resolution is a prominent metric for measuring image quality because it indicates the distance between two objects on the ground which can be separated in a sar image. sar spatial resolution is characterized in the azimuth and range directions separately. Range refers to the across-track direction, perpendicular to the satellite’s flight direction, while azimuth refers to the along-track dimension, parallel to the satellite’s flight direction. In a real aperture radar, the range resolution is dependent on the length of the pulse. Two distinct targets on the surface will be resolved if their separation is greater than half the pulse length. Most sar systems use a linear frequency modulated pulsed waveform, called a chirp, and achieve, after compression, a range resolution that is inversely proportional to the chirp bandwidth, i.e., 𝛿𝑟 = 𝑐02𝐵𝑟 , where 𝛿𝑟 is the range resolution, 𝑐0 is the velocityof light, and 𝐵𝑟 is the chirp bandwidth. In a realistic real aperture radar, the azimuth resolution is determined by the angular width of the radiated microwave beam, and the slant range distance [61]. This beam width is a measure of the width of the illumination pattern. As the radar’s distance to the surface increases, the azimuth resolution gets worse. A sar exploits the relative movement of the antenna compared to the imaged surface, which in essence generates a linear frequency modulation in the azimuth direction. This feature is exploited to improve the resolution dramatically. In fact, after compression, the azimuth resolution becomes 𝛿𝑎 = 𝐿𝑎2 , where 𝛿𝑎 is the azimuth resolution, and 𝐿𝑎 is the antenna length in the azimuth direction. The motion of the platform is used to synthesize a long antenna in order to get a high azimuth resolution [61]. Hence, the name synthetic aperture radar, and we note that both the range and the azimuth resolutions are independent of the range to the target. Frequency: Most spaceborne sar systems operate at wavelengths between 0.5 22 chapter 2 synthetic aperture radar imagery for sea ice classif ication and 75 cm [62]. Using this spectral range has the primary advantage of allowing the atmosphere to be virtually transparent. The atmosphere is transparent for wavelengths in microwave bands. The backscattered radiation is generally dependent on the surface roughness at the scale of the radar wavelength. The wavelength selection should therefore be made to match the size of surface features on the targeted object. Furthermore, longer wavelengths can penetrate through the snow cover and the ice, resulting in a higher contribution of volume scattering. The most commonly used frequency bands for spaceborne sar observations of sea ice are L-band, C-band, and X-band [63]. Polarimetry: An EM wave is made up of an electric and magnetic field. These fields are orthogonal to each other and the propagation direction of the wave. The direction of the electric field characterizes the signal’s polarization, and it may be expressed in terms of two orthogonal basis vectors [64]. em waves are polarized, with linear or circular polarization in particular cases [57]. Most sar satellites use linear polarization on both the transmission and reception. The polarization direction in the plane of wave propagation is either horizontal (H) or vertical (V). An individual sar channel might also be defined by a two-letter combination, with the first letter indicating the polarization of the sent signal and the second letter indicating the polarization of the received signal. The HV channel, for example, sends in horizontal polarization and receives in vertical polarization. Temporal Resolution: When building a time series of satellite images, the revisit cycle determines the temporal resolution. The revisit cycle equals the time it takes between two observations of the same object on the Planet’s sur- face. These observations may occur at different orbits of the sar. The revisit cycle should not be confused with the repetition cycle, which is defined as the duration between two passes of a satellite along the same orbit, according to [65]. While the repetition cycle is solely determined by the satellite orbit configuration, the revisit duration is determined by the target location and swath width. The majority of sar sensors in space are borne by polar orbit- ing spacecraft. The orbit track spacing for these satellites is closer at higher latitudes, resulting in a much higher revisit rate in Polar regions. Data Acquisition Modes: The majority of spaceborne sar sensors can operate in a variety of acquisition modes. The geographic coverage, resolution, and polarimetric capability all change across these modes. For example, the sar instrument onboard Sentinel-1 may operate in: Stripmap (sm), Interferometric Wide Swath (iw), ExtraWide Swath (ew) andWave (wv). The best acquisition mode and image product for a certain application must be selected based on its individual aims and needs, along with the overall availability of data. The Stripmap (sm) and ScanSAR modes are the most used acquisition modes. The antenna footprint is fixed to one swath in SM mode, and a continuous strip on 2.2 sentinel-1 23 the Earth’s surface is scanned. By directing the antenna to various elevation angles and merging numerous subswaths, the Scan SAR mode obtains more sizable spatial coverage. This increased geographical coverage comes at the expense of reduced spatial resolution. 2.2 Sentinel-1 There are several spaceborne platforms operated by different countries for sea ice monitoring and analysis. They include satellites like, Radarsat-2 (Canada), TerraSAR-X/TanDEM-X (Germany),HJ-1C (China),ALOS-2 (Japan) andSentinel- 1A/1B (Europe/ESA). These sensors take images at different frequencies and polarization constellations. The Sentinel-1 C-band sar instrument supports operation in single polarization (HH or VV) and dual polarization (HH+HV or VV+VH) modes, implemented through one transmit chain (switchable to H or V) and two parallel receive chains for H and V polarisation. SM, IW, and EW products are available in single (HH or VV) or dual polarization (HH+HV or VV+VH). WV is single polarisation only (HH or VV). The primary conflict-free modes are IW, with VV+VH polarization over land, and WV, with VV polarization, over the open ocean. Having the Interferometric Wide swath mode as the chief operational mode satisfies most current service requirements, avoids conflicts, preserves revisit performance, simplifies mission planning, decreases operational costs, and builds a consistent long-term archive [56]. Figure 2.2 shows Sentinel-1 acquisition modes. 24 chapter 2 synthetic aperture radar imagery for sea ice classif ication Figure 2.2: Illustration of Sentinel-1 acquisition modes [66]. Generally, The Sentinel EW mode is utilized for wide-area coastal monitoring, such as ship traffic, oil spill, and sea-ice monitoring, among others [65, 67]. Strictly speaking, smmode is only utilized on tiny islands and upon request for special occasions like disaster relief. At least some operational sea ice charting criteria can be met by the EM mode’s extended coverage at a resolution for various applications [56]. Sentinel-1 em mode data have been used as the primary source in Paper 1 and Paper 2 of this dissertation to develop deep learning models for sea ice classification. Sentinel-1 has a temporal revisit period of twelve days and completes 175 orbits every cycle, giving it a spatial resolution of twelve days. Because Sentinel-1a and Sentinel-1b circle in the same orbit plane with a 180◦ orbital phase difference, the effective repetition constellation cycle is decreased to six days instead of the previous twelve. 2.3 Sea Ice Classes in SAR Images Sea ice surfaces may be classified into different ice types. Distinct user commu- nities use different ice class definitions. The most used categorization criteria 2.3 sea ice classes in sar images 25 Table 2.1: Different water and ice classes. wmo code Classes 02 Open Water/ Leads with Water 01–02 Brash/Pancake Ice 83 Young Ice (YI) 86–89 Level first-year ice FYI 95 Old/deformed Ice are based on the ice thickness, where ice type is associated with a defined thickness range. In general, ice thickness is related to the ice age. Therefore, these criteria may be regarded as age-based [68]. A term used to describe the distinct thickness/age-based classes of sea ice is the sea ice stage of develop- ment (sod). [69] describes a categorization method that has been accepted by the World Meteorological Organization (wmo) and is frequently used in operational ice charts. Figure 2.3 shows various types of ice. In this dissertation, we use a dataset with five classes based on the wmo definition. The Table 2.1 shows these classes and their respective codes. Figure 2.3: Ice types in optical observation. Ice types appear very differently in sar compared to optical images. Their signatures in sar depend on the backscatter of the radar signal at the ice and water surfaces and the signal’s penetration into the ice. Figure 2.4 displays an optical image from the same region as a sar image, acquired with a short time interval between the two images. To take both HV and HH polarization channels into account, they are displayed as the red and green channels, respectively, while the blue channel is zero. 26 chapter 2 synthetic aperture radar imagery for sea ice classif ication Figure 2.4: Sentinel-1 HH+HV SAR image, (a), versus Sea Ice in optical image, (b). 2.4 Challenges Sea ice classification has a variety of challenges that might impact the perfor- mance of any machine learning classification system. Several particular issues associated with this dissertation are addressed in this section. Incidence angle dependency: In the case of spaceborne sar systems, the ia can span a wide range from the side-looking aspect of imaging geometry, where the local ia changes dramatically from the near to the far range views. While this significantly impacts the resulting image intensity values, it also represents an important physical phenomenon related to radar scattering,which is influenced by the physical properties of the sea ice surface and volume. The IA gradient is dependent on the ice type. It is also very different for a sea ice surface and open water Seasonal changes: Seasonality significantly influences the dielectric charac- teristics of ice and snow covers, and also, as a result, the intensities of the backscattered signals. During the winter, the surface scattering from young ice and the volume scattering from the bubble structure in the top layer of multi-year ice strongly contribute to backscattered signals [70]. Melting alters the physical qualities and dielectric properties of sea ice and snow, resulting in a drastic reduction in the radar signal from a particular ice type consequence of increased humidity. Hence, it becomes increasingly difficult to distinguish between different ice forms in the summer, as seen in Figure 2.5. This figure shows two Sentinel-1 sar images from the fast ice region off the east coast of Greenland before (on the left), and after (on the right) melting starts. Ambiguous scattering: The roughness and dielectric characteristics of the target surfaces are responsible for the backscattering seen from sea ice and water surfaces. In the ocean, small-scale roughness (on the order of the radar 2.4 challenges 27 Figure 2.5: The difference between Sentinel-1 IW-mode photos (HH polarisation) for a) pre-melt sea ice (17 April 2018) and b) full-melt sea ice (17 April 2018) is shown in the figure (16 June 2018). Due to the fact that this is a region of fast ice, Belgica Bank off the coast of north-east Greenland, the sea ice and icebergs are identical in both photographs (courtesy of Nick Hughes, the Norwegian Meteorological Institute). signal’s wavelength size) is the dominant source of scattering, and they are modified by the larger waves that go through the water. As a result, excessive wind and turbulence will produce high backscatter, with texture linked with random directions of the wind turbulence. The scattering process is more complicated for sea ice. Level or deformed ice surfaces, dry or wet, snow- covered or bare, are all possible conditions on an ice surface. There are also incidence angle dependencies, which depend on ice type, to consider. All these phenomena combine to make the sar returns confusing, to the point that many diverse ice types may provide almost identical signal characteristics. Even distinguishing between open ocean and sea ice may be problematic. This problem is depicted in Figure 2.6. Noise characteristics of Sentinel-1: Internal noise on Sentinel-1 has an im- pact on the performance of sar systems. A structured but class-independent intensity noise signal is introduced throughout the images as a result of this, which is particularly noticeable in the lower intensity HV and VH channels. As a function of range (and hence IA), this noise floor is often more prominent with the five sub-swath assemblies of Terrain Observation with Progressive Scans sar Sentinel-1 EW imagery. The varying noise floor often has undesired classification effects comparable, but distinct, from the IA dependence. It has proven difficult to efficiently filter out this noise component, in part because the spatial correlation properties of the noise are very similar to the spatial correlation properties of speckle, which is a well-known phenomenon associ- 28 chapter 2 synthetic aperture radar imagery for sea ice classif ication Figure 2.6: Two HH-polarization sentinel-1 sar from the marginal ice zone. Figure 2.7: An example of a noise HV sar image, with corresponding noise floor profile on the right. ated with coherent radar imaging and has been studied extensively. Figure 2.7 shows an example of a noise pattern in a Sentinel-1 EW-mode HV sar scene (on the left), and the associated noise floor profile (on the right). Ground truth labeling: Acquiring in-situ ground truth for Sea Ice classification is very costly, time-consuming, and in certain cases, not even practicable. There are many reasons for this, including the difficulties of acquiring ground truth in this particular region of the Earth, which has extended dark seasons and very variable weather conditions. Even though numerous sensors are often utilized to label the data, manual labeling suffers from the above issues. Manual labeling is restricted in diversity since humans can only assess a small number of locations, resulting in tiny and unbalanced training datasets. It is important to note that this is due to the nature of the application and that even with great 2.4 challenges 29 efforts, we will only have a limited quantity of labeled data compared with the volume and diversity of data available in the Arctic. Thus, it is a critical problem that should be considered, particularly in deep learning training. The issue of scarce training data is well-known in the machine learning and deep learning domains. It has resulted in significant hurdles in the field of big data applications. For this reason, new advanced approaches in deep learning, such as deep semi-supervised learning, deep unsupervised learning, and deep self- supervised learning, have been developed to solve this problem at hand. Lack of automatic learning methods: The majority of Sea ice classification techniques rely on traditional and statistical approaches, while deep learning models and architectures for sea ice classification are still not very mature, especially for operational use. When a deep learning model is trained on sea ice records, it may uncover a slew of previously undisclosed difficulties and confederations. This issue is much more crucial when the question of the explainability of deep learning is still being debated. Thus, describing deep learning models’ behaviors based on newly found physical features might be challenging. 3 Deep Neural NetworksLearning Approaches Deep Learning (dl) is a machine learning technique that creates artificial neural networks designed to imitate the structure and function of the human brain. Deep Learning is a technique used to train computers to recognize patterns in data. Deep learning, also known as deep structured learning or hierarchical learning, is a type of nonlinear processing that, in practice, employs a large number of hidden layers - typically more than 6, but often much higher - to extract features from data and transform the data into different levels of abstraction or representations [71]. This concept is motivated by the fact that the mammalian brain is structured in a deep architecture, with each input percept encoded at numerous layers of abstraction, particularly in the monkey visual system reference [72]. DL researchers have built unique deep architectures as an alternative to shallow architectures, inspired by the architectural depth of the human brain. Tradi- tionally, machine learning needs human feature extraction from training data by a subject matter expert. However, as shown in Figure 3.2, a deep learning framework automatically learns relevant features from training data such as images, text, patterns, and digital signals. The optimization techniques for training deep models differ numerously from typical optimization strategies. Typically, machine learning operates directly. In 31 32 chapter 3 deep neural networks learning approaches AI Machine learning Representation learning Deep learning Example: Knowledge bases Example: Logistic regression Example: Shallow autoencodersExample: MLPs Figure 3.1: Deep Learning is a machine learning technique [9]. the majority of machine learning cases, we are interested in some performance measure P that is defined relative to the test set and may be intractable. As a result, we optimize P indirectly. We lower a different cost function 𝐽 (𝜃 ) to improve P. This contrasts with pure optimization, in which reducing J is a self-contained objective. Additionally, optimization techniques for deep model training often incorporate some specialization in the form of machine learning objective functions. Typically, the cost function is expressed as a weighted average across the training set, as follows: 𝐽 (𝜃 ) = E(𝑥,𝑦)∼𝑃𝑑𝑎𝑡𝑎𝐿(𝑓 (𝑥, 𝜃 ), 𝑦) (3.1) where 𝐿 is the per-example loss function, 𝑓 (𝑥, 𝜃 ) is the predicted output when the input is 𝑥 , and 𝑃𝑑𝑎𝑡𝑎 is the empirical distribution. In the supervised learning case,y is the target output. 3.1 learning approaches 33 Figure 3.2: Deep Learning vs Traditional machine learning. 3.1 Learning Approaches The learning techniques used in deep learning methodologies are generally classified as follows: 1) Supervised Learning, 2) Semi-supervised Learning (ab- breviated as ssl), and 3) Unsupervised Learning. Supervised learning creates a knowledge base from previously identified patterns, which aids in the classi- fication of new patterns. The primary objective of this learning is to translate the input characteristics into a class output. The result of this learning is the construction of a model from input patterns. The model may be used to categorize previously unknown occurrences accurately. In general, it may be expressed as a function f(x) with patterns as input and a class 𝑦 as output. The effectiveness of deep learning and cnns is often contingent on the availability of a vast amount of labeled data, where millions of photos are tagged to train deep neural networks [14, 20] to allow these models to perform on par with or even better than humans. While visual data are abundant, data that has been accurately annotated by humans is very sparse. Not only obtaining vast volumes of annotated training data for each job is difficult and perhaps expensive, but also it proves to be mistake-prone. This problem is even more essential in the case of remote sens- ing and Earth observation, since acquiring in-situ ground truth is prohibitively costly, time demanding, and in some cases impossible. This problem will be far more severe when it comes to Arctic data analysis [73]. While the amount of labeled data would be incredibly tiny, the amount of unlabeled data may be enormous. The distribution of such unlabeled data contains critical infor- 34 chapter 3 deep neural networks learning approaches mation constructing resilient representations that are generalizable to novel learning tasks. depending on the number of labeled instances used to train models, Unlabeled data may be used in an unsupervised or semi-supervised way. Unlabeled data may also help models bridge the domain divide between multiple tasks, resulting in a plethora of unsupervised and semi-supervised techniques [16]. These techniques are shown in Figure 3.3. Figure 3.3: Basic example of binary classification considering supervised, unsuper- vised, and semi-supervised learning. We have two classes, blue circles, and red triangles. Unlabeled data belonging to each class are shown in black circles and triangles. (a) refers to supervised learning in which all training samples are labeled. (b) is unsupervised learning in the absence of labeled training data. (c) is semi-supervised learning, in which some training samples contain labels and some do not [74]. On the other hand, unsupervised learning does not need a predefined output value. Rather than that, one attempts to deduce some underlying structure from the inputs. Unsupervised approaches aim to develop representations that are sufficiently generalizable to be used for various future learning challenges. For example, in unsupervised clustering, the objective is to infer a mapping from provided inputs (e.g., vectors of real numbers) to groups in such a way that comparable inputs are mapped to the same group [75]. Semi-supervised learning seeks to bridge the divide between these two ob- jectives [76, 77]. While attempting to solve a classification issue, extra data points with unknown labels may be introduced to help the classification process. On the other hand, for clustering algorithms, the learning process may profit from the fact that some data points belong to the same class. Semi-supervised classification techniques are especially useful when labeled data is in little supply [75]. The majority of semi-supervised learning research is focused on classification. 3.2 convolutional neural networks 35 3.2 Convolutional Neural Networks As the most representative supervised DL model, the Convolutional Neural Networks (Convolutional Neural Network (cnn)s) [78] have outperformed most algorithms in visual recognition. The deep structure of cnns allows the model to learn highly abstract feature detectors and to map the input features into representations that can boost the performance of the subsequent classifiers. The cnn is a trainable multilayer architecture made from many feature- extraction stages that may be trained. Usually, each level is made from three layers, as follows: a convolutional layer, a nonlinearity layer, and a pooling layer. When designing the architecture of a cnn, it is crucial to consider how the two-dimensional structure of the input picture will be used. A cnn is composed of the feature extraction process including one or more conventional layers, followed by fc layers that serve as the classifier. Figure 3.4 is a diagram that depicts the overall architecture of a cnn . Figure 3.4: A general cnn architecture for image analysis, here for character recog- nition [78]. The convolutional layer receives as input a three-dimensional array of 𝑟 two- dimensional feature maps of size 𝑚 × 𝑛. Additionally, a three-dimensional array 𝑚 × 𝑛 × 𝑘 is produced, consisting of 𝑘 feature maps of size 𝑚 × 𝑛. The convolutional layer has 𝑘 trainable filters of size 𝑙 × 𝑙 × 𝑞, collectively referred to as the filter bank 𝑊 , which links the input and output feature maps. The convolutional layer generates the output feature maps seen in Equation 3.2. 𝑧𝑠 = 𝑞∑︁ 𝑖=1 𝑊 𝑠𝑖 ∗ 𝑥𝑖 + 𝑏𝑠 (3.2) where 𝑥𝑖 represents each input feature map, * represents a two-dimensional dis- 36 chapter 3 deep neural networks learning approaches crete convolution operator, and b represents a trainable bias parameter. This layer is just a pointwise nonlinearity function applied to each component in a feature map in a conventional cnn. The nonlinearity layer determines the output feature map 𝑎𝑠 = 𝑔(𝑧𝑠), where g(.) is often selected to be a rectified linear unit (𝑅𝑒𝐿𝑈 )𝑔(𝑥) =𝑚𝑎𝑥 (0, 𝑥). This function is often referred to as the activation function. However, for the last layer of completely linked layers, the SoftMax function is utilized to get the probability distribution. 𝑆𝑜 𝑓 𝑡𝑚𝑎𝑥 (𝑧𝑠) = 𝑒𝑥𝑝 (𝑧 𝑠)∑ 𝑗 𝑒𝑥𝑝 (𝑧 𝑗 ) (3.3) The pooling layer is composed of a grid of pooling units spaced s pixels apart, each of which summarizes a small spatial area of size 𝑝 ∗ 𝑝 centered on the pooling unit’s position. Following numerous feature extraction steps and using a Fully Connected (fc) layers as a classifier, the complete network is trained using backpropagation of a supervised loss function such as the traditional least-squares output as Equation 3.3. 𝐽 (𝜃 ) = ∑︁ (∥ 𝑓 (𝑥, 𝜃 ) − 𝑦∥2) (3.4) where f denotes the network output after the operation of the SoftMax function to the trainable parameter𝜃 . The output vector y is a 1-of-K vector. The objective is to minimize J(𝜃) as a function of 𝜃 . The stochastic gradient descent with backpropagation [79] algorithm is investigated for this optimization. Gradient-based optimization methods are the most often utilized in dl to solve the above optimization issue. Due to the computational difficulty of second-order gradient descent techniques, first-order gradient descentmethods, particularly Stochastic Gradient Descent (sgd) with mini-batch and variations, are often employed in dl. Generally, the iterative process consists of the following steps: 1) It takes a mini-batch of data and samples it. 2) It conducts feed-forward calculations to determine the objective function’s loss value (Equation 3.4). 3) It uses backward propagation to determine the gradients relative to the model parameters. 4) Finally, it does a model parameter update. It takes time to train a deep model, much more so for big models or datasets. It is becoming more popular to use distributed training approaches to expedite the training process by using 3.3 Semi-Supervised Learning 37 several processors [25]. 3.3 Semi-Supervised Learning Several semi-supervised classification techniques have been proposed through- out the last two decades. These approaches vary in their underlying semi- supervised learning assumptions, their treatment of unlabeled data, and their relationship to supervised algorithms. The primary differences between ssl techniques are due to the various assumptions they make. The underlying marginal data distribution p(x) over the input space must include information about the posterior distribution p(y|x). This is a crucial condition for semi- supervised learning. If this is the case, unlabeled data may be used to elicit information about p(x), and hence about p(y|x). If, on the other hand, this requirement is not satisfied and p(x) includes no information about p(y|x), it is intrinsically impossible to enhance the accuracy of predictions using the extra unlabelled data [77]. Fortunately, the previously indicated criteria seem to be fulfilled in many real- world learning situations, as seen by the effectual application of semi-supervised learning approaches in practice. However, the interaction between p(x) and p(y|x) is not necessarily the same. As a result, semi-supervised learning as- sumptions such as smoothness, low-density assumption, and manifold have emerged, which codify the forms of predicted interaction [76]. It is assumed that the associated labels 𝑦 and 𝑦 should be the same for two input points 𝑥, 𝑥 ′ ∈ 𝑋 that, according to the smoothness assumption, are near to each other in the input/feature space. This assumption is also often employed in supervised learning, but it has an additional advantage in the semi- supervised context: the smoothness assumption may be applied transitively to unlabeled data, which is practical in many situations. Consider the following scenario: there is a labelled data point 𝑥1 ∈ 𝑋𝐿, and two unlabeled data points 𝑥2, 𝑥3 ∈ 𝑋𝑈 , and 𝑥1 is near to 𝑥2 and 𝑥2 is close to 𝑥3, but 𝑥1 is not close to 𝑥3. If the smoothness condition is met, we may still expect 𝑥3 to have the same label as 𝑥1, since closeness (and hence the label) is transitively transmitted via 𝑥2. The low-density assumption states that the decision border of a classifier should, wherever possible, travel through low-density areas in the input or feature space of the classification problem. In other words, the decision border should not travel through densely populated neighborhoods. The assumption is specified over the actual distribution of the input data, which is represented by the function 𝑝 (𝑥). When just a small number of samples from this distribution are 38 chapter 3 deep neural networks learning approaches considered, it simply indicates that the decision boundary should be located in an area where only a small number of data points are seen. Consequently, the low-density assumption is closely connected to the smoothness assumption; in fact, it may be thought of as the smoothness assumption’s equivalent for the data distribution underpinning the assumption of smoothness. When dealing with machine learning situations in which the data may be rep- resented in Euclidean space, the observed data points in the high-dimensional input space 𝑅𝑑 are often clustered in lower-dimensional substructures of the input space 𝑅𝑑 . Manifolds are topological spaces that are locally Euclidean. For example, if we examine a 3-dimensional input space in which all points are located on the surface of a sphere, then the data may be said to be located on a 2-dimensional manifold in which the points are located. Semi-supervised learning is based on the manifold assumption, which asserts that (a) the in- put/feature space consists of numerous lower-dimensional manifolds on which all data points sit and (b) data points lying on the same manifold are assigned the same label. If we can figure out which manifolds exist and which data points are located on which manifold, we may infer class assignments for unla- beled data points from the class assignments of labeled data points located on the same manifold. This is very useful when dealing with large datasets. Various deep semi-supervised learning methods have been proposed consid- ering these assumptions and advanced deep learning approaches. These ap- proaches are shown in Figure 3.5. 3.3 Semi-Supervised Learning 39 Figure 3.5: An overview of the landscape of semi-supervised methods [16]. 3.3.1 Semi-supervised GANs Generative Adversarial Networks gans have been used multifariously to ad- dress different problems. For semi-supervised learning, they have represented two distinct approaches to the use of gans in semi-supervised learning. One suggests training a K+1 classifier using K predefined labels and a false class to represent generated samples. It investigates the distribution of unlabeled instances by classifying them into the first K real classes [80]. The second paradigm interprets the generator of a learned GANs model as the (local) parameterization of the data manifold. This parameterization allows for the description of label invariance along the manifold’s tangents. This is strongly connected to the Laplace-Beltrami operator [81], which is only approximated in conventional graph-based semi-supervised models by the graph Laplacian [16]. 3.3.2 Auto-Encoders These approaches expand the unsupervised Variational Autoencoders (vae) to two types of semi-supervised models, which are described in more detail below. The first latent-feature discriminative model (M1), [82], is a basic model to understand. A classifier is trained to predict the label of sample x on top of 40 chapter 3 deep neural networks learning approaches the latent representation z of a sample x created by a vae model. While the vae is trained on both the labeled and unlabeled portions of a training set, the classifier is only trained on the labeled portions of a training set. The second generative semi-supervised model (M2), [82], is more difficult to understand and implement. An additional class variable y, which is latent for unlabeled x but visible for labeled x, is used to construct sample x. This class variable y is generated in addition to the latent representation z. In addition to the M1 and M2 models and the hybrid model, attempts have been made in the literature to include supervision information in vae ([83]). 3.3.3 Disentangling Representations Through the development of an inverse graphical representation of the visual model, these approaches construct the semi-supervised model based on the vae model. In the first method, it is intended to learn a collection of "graphics codes," which are used to manipulate and render pictures similar to that of graphics software [84]. Images are represented as disentangled representations of graphics codes in this context. The second technique [85], proposes an extended form of semi-supervised vaes to separate interpretable variables from latent representations, similar to the previous approach. 3.3.4 Teacher-Student Models The idea behind teacher-student models for semi-supervised learning is to obtain a single or an ensemble of teachers, and use the predictions on unlabeled examples as targets to supervise the training of a student model. Consistency between the teacher and the student is maximized to improve the student’s performance and stability in classifying unlabeled samples. Various ways of training the teacher and maximizing the consistency between the teacher and the student lead to a variety of semi-supervised models of this category. Several famous approaches can be categorized in this group, such as mean teacher [86] and Noisy teacher [87, 88]. Obtaining a single or an ensemble of instructors and using their predictions on unlabeled instances as objectives to oversee the training of a student model is the concept underlying teacher-student models for semi-supervised learning. Maximizing consistency between the teacher and the student is crucial to in- crease the student’s performance and stability while categorizing unlabeled samples. Diverse approaches to training the instructor and optimizing consis- tency between the teacher and the student result in an array of semi-supervised models. In this category, there are numerous well-known techniques, such as the mean teacher [86] and the Noisy teacher [87, 88]. 3.4 thesis approach: dl and sea ice classif ication 41 3.3.5 Self-training These methods are based on Consistency regularization by considering noisy data (usually data augmentation) by leveraging the idea that a classifier should output the same class distribution for an unlabeled example even after it has been changed by adding some noise (augmentation). These approaches can be considered as a single teacher-student model in which the model is trained by itself. Therefore, these methods can also be considered the Teacher-Student models. Several famous approaches can be categorized in this group, such as MixMatch [89], FixMatch [90] and [91]. 3.3.6 Deep Transductive Learning In transductive learning, label inference that is limited to a certain set of unlabeled examples is crucial. This act may be accomplished, for example, by combining classification loss on labeled data with unsupervised aims on all data, the latter of which serves as a regularizer. In the deep learning field, [73] first employs efficient transductive label propagation [92] to infer pseudo-labels for unlabeled data that are then utilized to train the classifier. Label propagation is a graph-based technique, and in this study, the graph is generated using the classification network’s embeddings. As a result, this procedure consists of two phases. The network is first trained using labeled and pseudo-labeled data. The second phase constructs the closest neighbor graph using the embeddings of the network trained in the previous step. 3.4 Thesis Approach: DL and Sea Ice Classification In order to employ dl for sea ice classification, two main challenges, including network architecture and scarce training data, are considered. Network Architectures: Network architecture is an important factor in extracting meaningful features for any application. Hyperparameters such as the number of layers, the number of kernels, the size of kernels and etc. are significantly important to designing a network architecture. However, it is interesting to employ the proposed architectures for other applications and fine-tune them for sea ice applications. Customizing a proposed network for sea ice application could be a shortcut approach. To encounter this issue, we consider the following questions: 42 chapter 3 deep neural networks learning approaches • What is the proper network architecture in terms of number layers, number kernels and etc. to be used for sea ice classification? • Is it beneficial to use existing well-known architectures for sea ice classi- fication? • Which type of proposed architectures are more suitable for this applica- tion? These questions were significantly important since dl networks are not very mature, and there are few works in technical literature. Therefore, these issues were targeted in the first publication. we proposed the following contribution to address these questions: • A dataset for deep learning analysis is published. • Different approaches in deep learning, including ad hoc architecture design, transfer learning, and re-training from scratch, have been studied. • A self-design ad hoc architecture is designed and trained for sea ice classification. • Different proposed architectures including MobileNetV2 [93], RestNet50 [20], and DenseNet121 [94] are trained for sea ice classification. • Considering transfer learning and re-training from scratch, especially the VGG-16 model [95] is trained for sea ice classification. • Effects of max-pooling layers in the VGG-16 architecture are studied, and a modified VGG-16 model is proposed. • Scarce training data issue is highlighted through the reported experi- ments. • Data augmentation technique is used, and it is highlighted that this technique is led to some uncertainty on preserving the physical properties of signals. • Based on our finding using self-design ad hoc architecture and proposed architectures, a specific 13-layer is considered for further analysis. Scarce training data: To encounter this issue, we consider the following questions: 3.4 thesis approach: dl and sea ice classif ication 43 • What is the efficient approach to increasing the number of samples in the training data set? • How advanced approaches can be used to address this issue for sea ice classification? • Which approach is more appropriate to preserve physical properties? Scarce training data is an important obstacle toward employing dl models for sea ice classification. To address this issue,wemight consider human labeling or an algorithm that can use the advantage of labeled and unlabeled data: • Increasing the dataset: Adding more training data depends on visual manual annotation,which is very time-consuming. Moreover, considering the ambiguity of image, even with the help of optical images, it is not always visually possible to label data. • Semi-supervised learning: Semi-supervised learning seeks to bridge between labeled and unlabeled data to improve classification accuracy. In this method, the semi-supervised algorithm itself tries to include information from unlabeled data without human interference. To address the scarce training data issue, a new semi-supervised approach is proposed in the second publication: • A teacher-student learning based on label propagation for deep semi- supervised learning is proposed. • The method is based on the feature space of the trained cnn. • the method is not dependent on heavy augmentation technique that is not desired in this application. • A limited number of labeled samples starting from 15 samples and unla- beled samples are considered to efficiently train the models. • Our method reduced the dependence on labeled samples, which is very time-consuming and costly to collect for sea ice analysis. • We have also shown that by adding more unlabeled samples, the perfor- mance of the inference results has improved. Since unlabeled data is available more than labeled data, the size of training data can be increased easily. Regarding the computational intensity of deep 44 chapter 3 deep neural networks learning approaches learning, the training time can be significantly increased in this situation. It leads to a consequential problem for hyperparameter optimization. This issue is considered in the third publication. 4 Distributed Deep learningfor Scalable Computing Deep Neural Networks dnns are rapidly taking over various aspects of our daily lives and apply for many different applications. In particular, the disruptive trend toward big data has led to an explosion in the size and availability of training datasets for machine learning tasks. Training such models on large datasets to convergence can easily take weeks or even months on a single GPU [20, 95]. However,dnns rise into prominence is tightly coupled to the available computational power. Often, A single machine can’t finish a training task in the desired time frame [96]. An effective remedy to this problem is to utilize multiple GPUs to speed up training. Scale-up approaches rely on tight hardware integration to improve the data throughput. These solutions are effective but costly. Furthermore, technological and economic constraints impose tight limitations on scaling up [97]. In contrast, distributed deep learning (ddl) aims at scaling out to train large models using the combined resources of clusters of independent machines [97]. ddl can be studied from various perspectives, including the type of paral- lelism, type of concurrency, type of aggregation, and type of communication. In this section, different proposed methods are discussed based on different views. 45 46 chapter 4 distributed deep learning for scalable computing 4.1 Type of Parallelism In general, distributed deep learning can be divided into two main approaches: model parallelism and data parallelism. 4.1.1 Model parallelism Model parallelism distributes model parameters to several computational work- ers [98]. Each computer worker is in charge of distinct model parameters or layers. Because various neurons in the deep model are highly dependent on one another, each worker should share its output results with the other work- ers before proceeding with the calculation of the next layer. One prominent advantage of model parallelism is that training large models becomes viable since each worker may only keep a subset of the model, resulting in a reduced memory need. However, the model parallelism approaches have numerous significant flaws, including unbalanced parameter sizes and significant computing reliance in different levels of the deep model [98]. It isn’t easy, if not NP-complete [99], to divide the learning model into appropriate sections and allocate them to distinct compute nodes. Furthermore, an intricate technique must be designed to ensure the durability of model parallelism, which is challenging owing to computational reliance. Because of these difficulties, using model parallelism to expedite training is not straightforward [100]. This approach can conserve memory (since the complete network is not stored in one place) but incurs additional communication after every layer [24]. Model partitioning can be conducted by applying splits between neural network layers (=vertical split) or by splitting the layers (=horizontal split), as depicted in Figure 4.1. 4.1 type of parallel ism 47 Figure 4.1: Model parallelism. It can achieve through vertical or horizontal splits [97]. 4.1.2 Data Parallelism The other prominent form of distributed training is data parallelism [101, 102, 103, 104], which is seen in Figure 4.2. In this case, the model parameters are duplicated to all computer workstations, resulting in a faster training time. Each computing worker analyses separate mini-batches of data in a single iteration to calculate the local gradient changes, which are then exchanged with the other workers before updating the model parameters to the latest values. Figure 4.2: Data parallelism [25]. 48 chapter 4 distributed deep learning for scalable computing 4.2 Type of Aggregation From an aggregation view, we can have two different architectures, centralized and decentralized approaches. In a centralized approach, one or several central servers, often called parameter server, is responsible for updating a specific model parameter [103]. Parameter servers depend on the gradients computed by cluster nodes that perform back- propagation (workers). Figure 4.1 Illustrates the data flow during training in such a system. Centralized optimization allows the expensive task of computing per-parameter gradients to be distributed across the cluster machines. It also elegantly handles updating the model by pooling all communication at the parameter server. Thereby, per-parameter gradients for large amounts of train- ing, samples can be computed quickly. For large clusters, this frequent need for communication focused on the same network endpoints can quickly be- come a bottleneck [105, 106]. Therefore, most centralized-optimization-based ddl implement communication patterns where the parameter server role is distributed [97]. Figure 4.3: Centralized approach. On the other hand, the decentralized approach treats its workers as a swarm, in which each worker independently probes the loss function to find gradient descent trajectories to minima that have good generalization properties [107]. The famous approach in decentralized approaches are collective algorithms like Allreduce, and Ring-Allreduce [108]. 4.2 type of aggregation 49 4.2.1 Allreduce Algorithm To describe Allreduce, we consider P as the total number of the processes, and each process is uniquely identified as a number between 1 and P. We consider each process has an array of values (network parameters) that should be aggregated with other arrays on other processes. In this case, each process divides its own array into P subarrays, which we refer to as “chunks.” Let chunk[p] be the p-th chunk. These processes communicate in a ring mode, as in Figure 4.4. Each process sends chunk[p] to the next process while simulta- neously receiving chunk[p-1] from the previous. Then, process p performs the aggregation operation to the received chunk[p-1] and its own chunk[p-1] and sends the aggregated chunk to the next process p+1. In other words, every chunk travels all around the ring and accumulates a chunk in each process. After visiting all processes once, it becomes a portion of the final result array, and the last-visited process holds the chunk. Finally, all processes can obtain the complete array by sharing the distributed partial results. This process is achieved by circulating step again without aggregation operations, i.e., merely overwriting the received chunk to the corresponding local chunk in each pro- cess. The AllReduce operation completes when all processes obtain all portions of the final array [109, 110]. The aggregation operations of SUM, MAX, and MIN are frequently used. In distributed deep learning, the SUM operation is used to compute the mean of gradients. By repeating the receive-aggregate-send steps P-1 times, each process obtains a different portion of the resulting array. Figure 4.4: Illustration of Ring Allreduce. 50 chapter 4 distributed deep learning for scalable computing 4.3 Type of Concurrency 4.3.1 Concurrency in Network In this category, we compute the output of the layers or the whole network in a concurrent mode for the forward evaluation and backpropagation phases. For example, model parallelism divides the work according to the neurons in each layer. Different parts of the dnns are computed on distinct processors in various machines [24, 111]. With data parallelism, several replicas of a neural network model are created during training, each on a different worker (processor). For example, the replicas of the model are synchronized (i.e., either average gradients or parameters) at every step by communicating either with a centralized parameter server [103, 112] or decentralized using Allreduce [113, 106, 109]. 4.3.2 Concurrency in Training In this category, concurrency is used in the training stage. Multiple instances of training processes run independently on different machines. In this sense distributed training of ensembles is an entirely parallel process, requiring no communication between the workers [114]. Ensemble learning requires more memory and computational power in the training and inference phases. Therefore, knowledge distillation has been used in a two-step training to transfer knowledge of an ensemble with several networks to a single network [115, 116, 117]. To handle the problem of two-step training, Zhang et al. [118] investigated how an ensemble of students can learn collaboratively and teach each other throughout the training process. Kim et el. [119] introduced a fusion learning method that trains a robust classifier by integrating feature maps. Park et el. [120] used a feature-level ensemble for knowledge distillation by transferring the ensemble knowledge between multiple teacher networks. Although these methods can be trained in parallel, their main problem is accuracy, where the number of epochs is not taken into account. Codistillation [121] takes advantage of ensemble and mutual learning to speed up the training. Codistillation uses a distillation-like loss that penalizes predictions made by one model on a batch of training samples for deviating from the predictions made by other models on the same batch. 4.4 type of communication 51 4.4 Type of Communication 4.4.1 Synchronous From the communication perspective,ddl can also be divided into synchronous and asynchronous [24] methods. In synchronous systems, all computations hap- pen simultaneously. A global synchronization barrier prevents workers from progressing until all workers are in the same position. By avoiding deviations in progression between workers, synchronous systems can achieve efficient collaborative training at the expense of potentially underutilizing resources. In synchronize approach, the replicas synchronize (i.e., average either gradi- ents or parameters) at every step by communicating either with a centralized parameter server [103] or using all reduce [113] 4.4.2 Asynchronous Using asynchronous systems takes a more relaxed approach to organizing collaboration and avoids delaying the execution of one worker for another. However, asynchronous systems favor a higher utilization of hardware and regard deviations between workers as manageable side effects that can, in certain circumstances, even prove helpful. On the other hand, by relaxing the synchronization restriction and creating inconsistent models, training workers can update gradients asynchronously [122]. The Figure 5.3 depicts synchronous and asynchronous approaches in centralized data parallelism distributed deep learning. 52 chapter 4 distributed deep learning for scalable computing Figure 4.5: synchronous and asynchronous in centralized data parallelism distributed deep learning. 4.5 Communication Compression Regarding what to communicate, the exchanged gradients or models may be compressed by quantization or sparsification before transmission over net- work connections to minimize communication traffic while maintaining model convergence. 4.5.1 Quantization Quantization is a kind of compression technique that utilizes fewer bits to rep- resent data previously encoded in 32 bits on each dimension of the transmitted gradient. As a result, the gradients utilized to improve the models are imprecise after the quantized transmission. Quantization of transmitted gradients is di- rectly related to low-precision deep learning. Deep learning with low precision evolved in an environment where the CPU and GPU need faster calculation 4.6 communication overhead issue 53 and less memory to train dnns. Quantized communication is feasible if the low accuracy of gradients is sufficient to ensure training convergence. Numerous studies have previously addressed the convergence of deep learning under low-precision gradients [123, 124, 125, 126, 127, 125]. 4.5.2 Sparsification Sparsification techniques seek to minimize the number of items conveyed during each repetition. The essential concept of sparsification is that during Stochastic Gradient Descent (sgd) updates, only "significant" gradients are necessary to update the model parameter and ensure training convergence [128]. As seen in Fig. 6, a considerable part of the gradient vector’s coordinates are wiped out, eliminating the need to transmit zero-valued components. Gra- dient sparsification is a more severe compression method than quantization for reducing communication traffic [129, 130, 128, 131]. 4.6 Communication Overhead Issue To end this section, we present an example to show the main challenge in distributed deep learning. Most DL models contain a large number of param- eters to capture the complex features of the input data and their impact on the prediction results. In the parameter server approach, each worker needs to push the gradient of every parameter and pull every updated parameter from others through a parameter server. As a result, distributed training entails a substantial amount of data exchange between workers and servers. For example, Table 4.1 shows an experiment using one parameter server and three workers on 1 PS – 3 workers P3.2xlarge Instance Amazon EC2 – Tesla V100 with 10Gb Bandwidth for 100 Iteration [132]. An estimation of the commu- nication time and training time in each iteration is shown considering different network architectures. As we can see, most of the time spends on transferring the models between machines. In fact, different approaches in distributed deep learning try to address communication overhead from different aspects. 4.7 Thesis Approach: Distributed Deep Learning Expanding the deep learning techniques to process big data can significantly improve the overall performance. However, time constraints to training deep learning architectures are one serious obstacle to training a complex dlmodel. 54 chapter 4 distributed deep learning for scalable computing Table 4.1: Communication time versus training time in the parameter server [132]. Model # Param. Data Size (MB) Training Time Comm. Time AlexNet [14] 61.1M 488 x 4 1.99s 1.56s VGG-16 [95] 138M 1104 x 4 4.93s 3.53s VGG-19 [95] 143M 1104 x 4 5.15s 3.66s ResNet-152 [133] 60.2 M 481.6 x 4 1.86s 1.54s It is more significantly important when we have a large amount of training data along with complex dl architectures, especially in semi-supervised learning where we can increase unlabeled data to be involved in the training process easier. Table 4.2 shows the training time considering various semi-supervised methods and our proposed method in paper 2. Although our training dataset is not so sizable, we can observe high training time. It is important to consider which distributed learning method is more suitable for our proposed supervised and semi-supervised (proposed semi-supervised method is based on extracted feature space of the model) methods. Paper 3 specifically considered the scalability issue and proposed a new method for distributed deep learning. Our method is a new distributed learning approach based on knowledge distil- lation. Our method is interesting because: • Considering the feature space of distributed models, it is related to our proposed method in supervised and semi-supervised, and we can apply this method to extend them to distributed training. • Our method provides more parallelism while reducing the communica- tion cost. • It can be used on commodity hardware in contrast to other approaches that need high-speed and low latency networks. Table 4.2: Training time for different methods using 60 labeled data and 10000 unla- beled data on the UiT training dataset [134] on a single GPU (Quadro RTX 5000 16GB). Methods # Batch Size. Training time LP-DeepSSL(WideResNet) [73] 100 2h LP-DeepSSL [73] 100 3h TSLP-SSL [135] 100 6h MixMatch (WideResNet) [89] 100 10h SGANs [80] 64(best) 1h 5 Overview of Publications In this chapter, we present a summary of three publications and other contri- butions. 5.1 Paper Summaries 5.1.1 Paper 1 Salman Khaleghian,Habib Ullah, Thomas Kræmer,NickHughes, Torbjørn Eltoft, Andrea Marinoni, “Sea Ice Classification of SAR Imagery Based on Convo- lution Neural Networks”, Remote Sensing, 2021, 13(9). In the first paper, different training approaches in deep learning have been studied and a modified version of famous architecture called VGG-16 has been proposed. Most State-of-the-art methods are largely based on traditional statis- tical and machine learning methods. dl architectures for sea ice classification are in an early stage of development. In this sense, most works use basic simple shallow architecture and trivial learning methods. In this paper, the most recent deep architecture networks, and a self-design ad hoc architecture, have been studied. These methods are depicted in Figure 5.1. We consider two main approaches the first consists of modeling a custom or ad hoc architecture to analyze the problem. An ad hoc architecture is interesting as it offers high flexibility, but it generally requires the optimization of many hyper-parameters. 55 56 chapter 5 overview of publications In the second approach, where a given existing DL architecture is used in a new application domain, the existing architecture can either be fine-tuned based on already pre-trained parameters or trained from scratch. This approach signifi- cantly reduces the time to design the deep learning architecture. We explore the VGG-16 model [95], well-known network architecture developed for image recognition at the University of Oxford and has achieved high performance in many applications. Different training approaches, including transfer learning and re-training from scratch along with data augmentation, are discussed. We also studied the effects of max-pooling layers in the VGG-16 architecture and proposed a modified VGG-16 model for sea ice classification. We also compared it with three other reference models to show the stability and robustness of our modified VGG-16 model for sea ice classification. These reference models are MobileNetV2 [93], RestNet50 [20], and DenseNet121 [94]. Figure 5.1: Paper 1 considered deep learning approaches. We tested and assessed the results both qualitatively and quantitatively. The results showed that these complex architectures (such as those based on the VGG network) typically obtain promising classification results. Moreover, based on the evaluation of data augmentation, even if the quantitative performance improvement was only minor, the data extension technique seemingly can prevent over-fitting caused by a scarce training dataset. We also assessed the robustness of the trained CNN models when applied to SAR scenes collected at different spatial locations and times. We also found that the additive system noise in the SAR imagery is challenging in obtaining refined sea ice maps. Both the computational requirements and the additive system noise are crucial issues for the operational use of SAR data for sea ice classification. This work highlights the scarcity issue in sea ice classification and shows the significance of involving more training data. This finding was a base for our second paper on semi-supervised learning. 5.1 paper summaries 57 5.1.2 Paper 2 S. Khaleghian, H. Ullah, T. Kræmer, T. Eltoft, and A. Marinoni, "Deep Semi- supervised Teacher–Student Model Based on Label Propagation for Sea Ice Classification" in IEEE Journal of Selected Topics in Applied Earth Obser- vations and Remote Sensing, vol. 14, pp. 10761-10772, 2021, doi: 10.1109/JS- TARS.2021.3119485. Semi-Supervised Learning (SLL) [16] is employed, to extract accurate infor- mation from large-scale datasets, when a limited amount of labeled data are available. These methods aim to combine labeled data with unlabeled records. In the past few years, semi-supervised models have presented performance improvement in various fields of remote sensing research. To address the scarce train data in sea ice classification, we propose a teacher–student-based label propagation deep semi-supervised learning (TSLP-SSL) method. The overall architecture is illustrated in Figure 5.2. Our architecture consists of two models: a teacher and a student model. The teacher model is trained in a two-step pro- cedure. Initially, we trained the teacher model in a supervised fashion utilizing only the labeled data. We then feed both the labeled and unlabeled samples to the trained teacher model and consider the feature space embedding to engen- der pseudo-labels for the unlabeled data through a label propagation procedure [136, 92, 73]. In the next step, the original and the pseudo-labels are used to train the student model, which is subsequently used during the inference stage. Hence, our proposed method effectively exploits a relatively large amount of unlabeled data to improve the final classification performance. Figure 5.2: Overall architecture of proposed method in paper 2. To show our proposed method’s capabilities, we considered a limited number of 58 chapter 5 overview of publications labeled samples starting from 15 samples and unlabeled samples to efficiently train the models. In fact, our proposed method was characterized by the ability to learn practical information from both labeled and unlabeled data. Our method reduced the dependence on labeled samples, which is very time- consuming and costly to collect for sea ice analysis. Therefore, this property of our method makes it a good fit for the community of sea ice analysis, where limited labeled data are available. We have also shown that by adding more unlabeled samples, the performance of the inference results has improved our method can be extended to other problem areas considering the semi-supervised aspect, where a limited number of labeled samples are available. Since unlabeled data is available more than labeled data, the size of training data can be increased easily. Regarding the computational intensity of deep learning, the training time can be significantly increased in this situation. It leads to a consequential problem for hyperparam- eter optimization. In this regard, the first and second papers show the need for scalable computing, especially when unlabeled data is available. Therefore, the third paper addresses this issue to reduce the training time when a large amount of labeled and unlabeled data is available. 5.1.3 Paper 3 S. Khaleghian, H. Ullah, E. B. Johnsen, A. Andersen and A. Marinoni, "AFSD: Adaptive Feature Space Distillation for Distributed Deep Learning," in IEEE Ac- cess, vol. 10, pp. 84569-84578, 2022, doi: 10.1109/ACCESS.2022.3197646. Deep Neural Networks (dnns) training is very computationally intensive. A sin- gle machine often can’t finish finishing a training task in the desired time frame. It is more significantly important when we have a large amount of training data along with complex dl architectures, especially in semi-supervised learning where we can increase unlabeled data to be involved in the training process easier. In fact, expanding the deep learning techniques to process big data can significantly improve the overall performance. However, time constraints to training deep learning architectures are one serious obstacle to training a complex dl model. Through the Ph.D. project, different methods have been investigated. Paper 3 specifically considered the scalability issue and proposed a new method for distributed deep learning. Our method is based on knowledge distillation. This method is interesting because: 1) it is related to our proposed method in supervised and semi- supervised, and we can apply this method to extend to distributed training. 2) 5.2 other contributions 59 this method provides more parallelism while reducing the communication cost. And 3) it can be used on commodity hardware in contrast to other approaches that need high-speed and low latency networks. Figure 5.3: Overall architecture of proposed method in paper 3. In contrast to the two-step distillation, our method trains n copies of a model in parallel and starts distillation early in the training process. We proposed a new loss function using an additional distillation term based on extracted features by the models. Our proposed method can tolerate longer update interval rates. We rarely update the models but still, we achieve the same accuracy with fewer epochs. In our method, distilling the knowledge between the models less frequently provides flexibility to the models in terms of learning diverse variations in the data. The overall architecture of this method is shown in Figure 5.3. The proposed method mainly targets distributed deep supervised learning. However, a possible extension of the semi-supervised version has been described in future works. 5.2 Other Contributions Dataset: A dataset for sea ice classification based on deep learning methods was published during the PhD Project. S. Khaleghian, J. P. Lohse, and T. Kræmer, “Synthetic-aperture radar (SAR) based ice types/ice edge dataset for deep learning analysis,” 2020. [Online]. Available: https://doi.org/10.18710/QAYI4O. Conference Papers: S. Khaleghian, T. Kræmer, A. Everett, Å. Kiærbech, N. Hughes, T. Eltoft, A. Mari- 60 chapter 5 overview of publications noni, “Synthetic aperture radar data analysis by deep learning for automatic sea ice classification”, EUSAR, Leipzig, Germany, June 2021. H. Ullah, S. Khaleghian, T. Kromer, T. Eltoft and A. Marinoni, "A Noise-Aware Deep LearningModel for Sea Ice Classification Based on Sentinel-1 Sar Imagery," 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, 2021, pp. 816-819, doi: 10.1109/IGARSS47720.2021.9553971. A. Marinoni, G. C. Iannelli, S. Khaleghian and P. Gamba, "On the Optimal Design of Convolutional Neural Networks for Earth Observation Data Analysis by Maximization of Information Extraction," IGARSS 2020 - 2020 IEEE Interna- tional Geoscience and Remote Sensing Symposium, 2020, pp. 3505-3508, doi: 10.1109/IGARSS39084.2020.9323521. Manolis Koubarakis, George Stamoulis, Dimitris Bilidas, Theofilos Ioannidis, George Mandilaras, Despina-Athanasia Pantazi, George Papadakis, Vladimir Vlassov, Amir H. Payberah, Tianze Wang, Sina Sheikholeslami, Desta Haile- selassie Hagos, Lorenzo Bruzzone, Claudia Paris, Giulio Weikmann, Daniele Marinelli,Torbjørn Eltoft,AndreaMarinoni,Thomas Kræmer,Salman Khaleghian, Habib Ullah, Antonis Troumpoukis, Nefeli Prokopaki Kostopoulou, Stasinos Konstantopoulos, Vangelis Karkaletsis, Jim Dowling, Theofilos Kakantousis, Mi- hai Datcu, Wei Yao, Corneliu Octavian Dumitru, Florian Appel, Silke Migdall, Markus Muerth, Heike Bach, Nick Hughes, Alistair Everett, Ashild Kiærbech, Joakim Lillehaug Pedersen, David Arthurs, Andrew Fleming, Andreas Czifer- szky.” Artificial Intelligence and Big Data Technologies for Copernicus Data: The ExtremeEarth Project.” Conference on Big Data from Space (BiDS21) 2021. Virtual event, 18-20 May 2021. Journal papers: Desta Haileselassie Hagos, Theofilos Kakantousis, Vladimir Vlassov, Sina Sheik- holeslami, Tianze Wang, Jim Dowling, Claudia Paris, Daniele Marinelli, Giulio Weikmann, Lorenzo Bruzzone, Salman Khaleghian, Thomas Kræmer, Torb- jørn Eltoft, Andrea Marinoni, Despina-Athanasia Pantazi, George Stamoulis, Dimitris Bilidas,George Papadakis,GeorgeMandilaras,Manolis Koubarakis, An- tonis Troumpoukis, Stasinos Konstantopoulos, Markus Muerth, Florian Appel, Andrew Fleming, and Andreas Cziferszky. "ExtremeEarth Meets Satellite Data From Space." IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2021 Presentations: Salman Khaleghian, Thomas Kræmer, Alistair Everett, Åshild Kiærbech, Nick Hughes, Torbjørn Eltoft, Andrea Marinoni. Deep learning for enhanced sea 5.2 other contributions 61 ice understanding. Arctic Frontiers 2020, Tromsø, Norway, January 26 - 30, 2020 Salman Khaleghian, Habib Ullah, Thomas Kræmer, Torbjørn Eltoft, Andrea Marinoni, "A deep semi-supervised learning method based on transductive label propagation for sea/ice classification", NORA Conference 2021, Bergen, Norway. This work is funded in part by Centre for Integrated Remote Sensing and Forecasting for Arctic Operations (CIRFA) and the Research Council of Norway (RCN Grant no. 237906), the European Union’s Horizon 2020 research and innovation programme ExtremeEarth project, grant agreement no. 825258 (http://Earthanalytics.eu/) and by the Fram Center under the Automised Large- scale Sea Ice Mapping (ALSIM) "Polhavet" flagship project. In this regards,most of the findings has been presented in the project meetings. 6 Paper 1: Sea IceClassification of SARImagery Based onConvolution NeuralNetworks 63 remote sensing Article Sea Ice Classification of SAR Imagery Based on Convolution Neural Networks Salman Khaleghian 1,* , Habib Ullah 1, Thomas Kræmer 1 , Nick Hughes 2 Torbjørn Eltoft 1 and Andrea Marinoni 1   Citation: Khaleghian, S.; Ullah, H.; Kræmer, T.; Hughes, N.; Eltoft, T.; Marinoni, A. Sea Ice Classification of SAR Imagery Based on Convolution Neural Networks. Remote Sens. 2021, 13, 1734. https://doi.org/10.3390/ rs13091734 Academic Editor: John Paden Received: 1 March 2021 Accepted: 22 April 2021 Published: 29 April 2021 Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. Copyright: © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/). 1 Department of Science and Technology, UiT the Arctic University of Norway, NO-9037 Tromsø, Norway; habib.ullah@uit.no (H.U.); thomas.Kramer@uit.no (T.K.); torbjorn.eltoft@uit.no (T.E.); andrea.marinoni@uit.no (A.M.) 2 Norwegian Ice Service, Norwegian Meteorological Institute, P.O. Box 6314 Langnes, NO-9293 Tromsø, Norway; nick.hughes@met.no * Correspondence: salman.khaleghian@uit.no Abstract: We explore new and existing convolutional neural network (CNN) architectures for sea ice classification using Sentinel-1 (S1) synthetic aperture radar (SAR) data by investigating two key challenges: binary sea ice versus open-water classification, and a multi-class sea ice type classification. The analysis of sea ice in SAR images is challenging because of the thermal noise effects and ambiguities in the radar backscatter for certain conditions that include the reflection of complex information from sea ice surfaces. We use manually annotated SAR images containing various sea ice types to construct a dataset for our Deep Learning (DL) analysis. To avoid contamination between classes we use a combination of near-simultaneous SAR images from S1 and fine resolution cloud-free optical data from Sentinel-2 (S2). For the classification, we use data augmentation to adjust for the imbalance of sea ice type classes in the training data. The SAR images are divided into small patches which are processed one at a time. We demonstrate that the combination of data augmentation and training of a proposed modified Visual Geometric Group 16-layer (VGG-16) network, trained from scratch, significantly improves the classification performance, compared to the original VGG-16 model and an ad hoc CNN model. The experimental results show both qualitatively and quantitatively that our models produce accurate classification results. Keywords: convolutional neural network; ice edge detection; polar region; Sentinel-1; sea ice classification; synthetic aperture radar 1. Introduction Sea ice is a key environmental factor [1] that significantly affects polar ecosystems. Over the past decade, the Arctic has experienced dramatic climate change that affects its environment, ecology, and meteorology. The trends are more pronounced than in other regions, and this has been called the Arctic amplification [2] resulting in increasingly variable Arctic weather and sea ice conditions. These are already more extreme than at lower latitudes, and present challenges and threats to maritime operations related to resource exploitation, fisheries, and tourism in the northern areas [3,4]. Therefore, reliable and continuous monitoring of sea ice dynamics, coverage, and the distribution of ice types is important for safe and efficient operations, in addition to supporting detection of how the conditions are changing over longer timescales [5,6]. For example, Ren et al. [7] classified sea ice and open water from synthetic aperture radar (SAR) images using the U-Net model, and integrated a dual-attention mechanism into the original U-Net to improve the feature representations. Han et al. [8] introduced a method for sea ice image classification based on feature extraction and a feature-level fusion of heterogeneous data from SAR and optical images. Song et al. [9] proposed a method based on the combination of spatial and temporal features, derived from residual convolutional neural networks (ResNet) and long Remote Sens. 2021, 13, 1734. https://doi.org/10.3390/rs13091734 https://www.mdpi.com/journal/remotesensing Remote Sens. 2021, 13, 1734 2 of 20 short-term memory (LSTM) networks that allowed the extraction of spatial feature vectors for a time series of sea-ice samples using a trained ResNet network. Then, using the feature vectors as inputs, the LSTM network further learnt the temporal variation of the set of sea-ice samples. Subsequently, they fed the high-level features into a Softmax classifier to output the most recent ice type. For sea ice data analysis, SAR imaging plays a key role as the images acquired by air and satellite-borne platforms provide information that is not restricted by environmental factors and, importantly for Arctic monitoring, can continue to be collected during all weather conditions and through the polar night [10]. Recently, Deep Learning (DL) based methods have shown promising results in many application areas, including computer vision [11], information theory [12], and natural language processing [13]. These have been shown to have excellent generalization capabilities, particularly when properly trained on large datasets. These developments have therefore led to a belief that deep neural networks (DNNs) could lead to a significant improvement of automatic sea ice classification, considering the specific challenges related to this task. However, no applications based on this approach have yet made it into operational use. In our paper, we explore the performance and efficiency of some DL-based methods for sea ice classification from SAR imagery. DNNs are trainable multi-layer architectures composed of multiple feature-extraction stages, succeeded by a fully connected classification module. DNNs may consist of hundreds of layers, and their architecture can be feed- forward or recurrent, having different types of layers and activation functions, and the training can be achieved through many different optimization strategies. A DNN can be built from different combinations of fully connected, convolutional, maxpooling (sub- sampling), or recurrent layers. Due to their deep nature, they are often trained on large datasets, and in general are able to achieve low generalization errors. A convolutional neural network (CNN) [14,15] is a feed-forward network consisting of only convolutional layers, pooling layers, and fully connected layers. A CNN [16] is the type of DNN which is most commonly applied to analyzing visual imagery. In the convolutional layers, a CNN extracts features from the image in a hierarchical way by using multiple filters. Each filter consists of a set of weight parameters, which are iteratively adjusted and optimised using an optimization algorithm. These filters are applied to an input image to create a feature map that summarizes the presence of detected features in the input. The CNN learns the filter coefficients during training in the context of the specific problem, and uses pooling layers to sub-sample the output in such a way that the most prominent pixels are propagated to the next layer, dropping the rest. Here it provides a fixed sized output matrix, which is translation and rotation invariant. Sea ice classification based on SAR imagery is a very challenging task because, in addition to the sea ice characteristics, the radar signals are sensitive to imaging geometry, speckle noise [17], and the blurring of edges and strong anisotropies that may be produced by the SAR imaging process based on the backscattering of signals. In the literature, different methods [18–21] for sea ice classification based on SAR imagery have been presented and typically consider traditional machine learning and probabilistic approaches based on shallow learning strategies. Generally, shallow learning relies on handcrafted features like intensities, polarization ratios, and texture features, which may not encode well the large variations that sea ice may display. Therefore, their generalization capabilities are limited. To address these challenges, we explore deep learning networks for sea ice classification. Inspired by the success of DNNs in general, and CNNs in particular in many applications, we consider two main approaches when exploring DL networks. The first consists of modeling a custom or ad hoc architecture to analyze the problem. An ad hoc architecture is interesting as it offers high flexibility, but it generally requires optimization of many hyper- parameters. In the second approach, where a given existing DL architecture is used in a new application domain, the existing architecture can either be fine-tuned based on already Remote Sens. 2021, 13, 1734 3 of 20 pre-trained parameters, or trained from scratch. This approach significantly reduces the time to design the deep learning architecture. We explore the VGG-16 model [14], which is a well-known network architecture developed for image recognition at University of Oxford, and has achieved high performances in many applications. This architecture is the core in other architectures like the Fully Convolutional Networks (FCN) [22]. Different training approaches including transfer learning and re-training from scratch are discussed in Section 3.1. We also studied effects of maxpooling layers in the VGG-16 architecture and propose a modified VGG-16 model for sea ice classification. The main contributions of this paper are: 1. We present a deep learning based models for sea ice classification based on SAR imagery. One of the major attractions of these models is their capability to model sea ice and water distinctively in SAR images representing different geographic locations and timing. 2. We extensively evaluate the models on our collected dataset and compare it to both a baseline method and a reference method. Our results show that our explored model outperforms these methods. 3. We categorize state-of-the-art methods and present a comprehensive literature review in this area in the next section. The rest of the paper is organized as follows. In Section 2, related work is presented. Section 3 reports our proposed deep models and training strategies. Section 4 presents the experimental results on multiple SAR scenes. Finally, Section 5 outlines the conclusion and final remarks. 2. Related Work Sea ice type classification is a major research field in the exploitation of SAR images and has been the subject of research for more than 30 years [23]. The literature on this topic is quite extensive, and here we highlight only a few of the more recent studies. In general, sea ice classification methods fall into three categories: probabilistic/ statistical methods, classical machine learning methods, and deep learning based methods. In the first category, Moen et al. [24] investigated a Bayesian classification algorithm based on statistical and polarimetric properties for automatic segmentation of SAR sea ice scenes into a specified number of ice classes. Fors et al. [25] investigated the ability of various statistical and polarimetric SAR features to discriminate between sea ice types and their temporal consistency within a similar Bayesian framework, finding that the relative kurtosis, geometric brightness, cross-polarisation ratio and co-polarisation correlation angle are temporally consistent, while the co-polarisation ratio and the co-polarisation correlation magnitude are temporally inconsistent. Yu et al. [26] presented a sea ice classification framework based on a projection matrix, which preserves spatial localities of multi-source images features from SAR and multi-spectral images. By applying a Laplacian eigen-decomposition to the feature similarity matrix, they obtained a set of fusion vectors that preserved the local similarities. The classification was then obtained in a sliding ensemble strategy, which enhances both the feature similarity and spatial locality. In a recent paper, Cristea et al. [27] proposed to integrate the target-specific incidence-angle- dependent intensity decay rates into a non-stationary statistical model. The decay of the intensities of co-polarized SAR signals with incidence angle is dependent on the nature of the targets, and this decay impacts the segmentation result when applied to wide-swath images. By integrating the decay into the Bayesian segmentation process, this deteriorating effect is alleviated and cleaner segmentation results are obtained. In the second category of classical machine learning, Orlando et al. [28] used a multi- layer perceptron classifier to perform multi-class classification considering first-year ice, multi-year ice, icebergs, and the shadows cast by icebergs. Alhumaidi et al. [29] trained a neural network classifier using polarization features alone, and polarization features plus multi-azimuth “look” Ku-band backscatter for sea ice edge classification. Their method demonstrated a slight advantage in combining polarization and multi-azimuth ‘look’ Remote Sens. 2021, 13, 1734 4 of 20 over using only co-polarized backscatter. Bogdanov et al. [30] also used a multi-layer perceptron classifier for sea ice classification in the winter season, based on a multi-sensor data fusion using coincident data from both the ERS-2 and RADARSAT-1 SAR satellites, low-resolution television camera images, and image texture features. They assessed the performance of their method with different combinations of input features and concluded that a substantial improvement can be gained by fusing the three different types of data. Leigh et al. [31] proposed a support vector machine (SVM) based ice-water discrimination algorithm considering dual polarization images produced by RADARSAT-2, and extracting texture features from the gray-level co-occurrence matrix (GLCM), in addition to backscatter features. Lit et al. [32] introduced a sea ice classification method based on the extraction of local binary patterns, and subsequently used a bagging principal component analysis (PCA) to generate hashing codes of the extracted features. Finally, these hashing codes were fed into an extreme learning machine for classification. Park et al. [33] extracted texture features from SAR images and trained a random forest classifier for sea ice classification. Their method classifies a SAR scene into three generalized cover types, including ice-free water, integrated first-year ice, and old ice. Zhang et al. [34] introduced a conditional random fields classifier for sea ice classification for Sentinel-1 (S1) data that has been applied to SAR scenes from the melt season in the Fram Strait region in the Arctic, and is based on the modeling of backscatter from ice and water to overcome the effects of speckle noise and wind roughened open water. In the third category, we present DL-based methods for sea ice classification. These methods have been widely used in analyzing Earth observation (EO) data, but the literature is very limited when it comes to the analysis of sea ice data. Previous work can be categorized into two main approaches, namely ad hoc architectures and well-established, existing architectures. For ad hoc architectures, one can, for example, freely determine hyperparameters, including the number of layers, the number of nodes in a particular layer, and the training technique. Many researchers have created ad hoc architectures for handling specific problems [35–39]. Some of the popular existing architectures are the AlexNet [40], the VGG net [14], and the GoogLeNet [41]. There are three sub-approaches on how to train the network when considering the use of existing architectures: (1) re-training the architecture from scratch, (2) using transfer learning and fine-tuning the architecture based on problem specific training data, or (3) applying feature extractors. In the case of re- training from scratch on a new training dataset, the weights of the architecture are randomly initialized. In the case of transfer learning, pre-trained weights are copied and fine-tuned with the new data. All weights may be adjusted, or only some of the network’s layers are re-trained and fine-tuned with new training data [42]. For example, Castelluccio et al. [43] fine-tuned two existing architectures to perform semantic classification of remote sensing data, namely the CaffeNet and the GoogLeNet, and showed significant performance improvements. Wang et al. [38,39] used deep ad hoc CNNs for ice concentration estimation. Kruk et al. [44] used DenseNet [45] for finding ice concentration and ice types considering dual-polarization RADARSAT-2 SAR imagery by fusing the HH and HV polarizations for the input samples. Han et al. [46] introduced a hyperspectral sea ice image classification method based on spectral-spatial-joint features with deep learning. Initially, they extracted sea ice texture information from the GLCM and then a three-dimensional deep network to extract deep spectral-spatial features of sea ice for classification. Gao et al. [47] proposed a deep fusion network for sea ice change detection based on SAR images. They exploited the complementary information among low, mid, and high-level feature representations, and for optimizing the network’s parameters, they used a fine-tuning strategy. Petrou and Tian [48] used a DL approach [49] to predict sea ice motion for several days in the future, given only a series of past motion observations. Their method is based on an encoder- decoder network and to calculate motion vectors, they used sea ice drift derived from daily optical images covering the entire Arctic. Their model learnt long-time dependencies within the motion time series and captured spatial correlations among neighboring motion vectors. Remote Sens. 2021, 13, 1734 5 of 20 3. Method Our work falls in the third category, namely DL-based methods, and is inspired by the success of the versatile CNNs in many different applications [14]. We perform both binary and multi-class classifications. In binary classification, we categorize different types of sea ice into one class and water into another class. In multi-class classification, we consider four different ice types that correspond to the World Meteorological Organization (WMO) ice types classification [50]. To model the effects of incidence angle, we create a patch-based training dataset which includes incidence angle as a separate image channel. We explore three different CNN models for sea ice classification, including an ad hoc CNN architecture designed from scratch, a VGG-16 model [14], considering both transfer learning and re- training from scratch, and a modified version of the VGG-16 model. The ad hoc architecture is a new CNN, where we explore different numbers of convolutional and maxpooling layers to examine the impact on their classification performance. We also studied effects of maxpooling layers in the VGG-16 architecture and propose a modified VGG-16 model for sea ice classification. When it comes to the training of the CNN architectures, it is worth noticing that there is no pre-prepared, publicly available sea ice classification datasets. Therefore, in our work, we train and test all the architectures considering a sea ice dataset that we have carefully generated ourselves from a combination of overlapping SAR and optical satellite images, supported by expert evaluations from sea ice analysts. Our dataset consists of 31 SAR images from north of the Svalbard archipelago collected between March and September during 2015–2018. In order to reduce the effect of overfitting of the models during the training process due to scarce training data, we use an augmentation technique to extend the training set. 3.1. CNN Models for Classification The ad hoc CNN model we investigate consists of three convolution layers along with an equal number of maxpooling layers. For these layers, the number of kernels/filters are 32, 64, and 64, respectively. Our model also consists of three fully connected layers with 1024, 512, and 2 nodes, respectively. We also use dropout, a regularization technique to avoid overfitting, where we set the dropout probability equal to 0.5. The specification of the ad hoc model is depicted in Figure 1. Figure 1. Ad hoc CNN. Our proposed CNN architecture consists of three convolution layers and three fully connected layers. At the input, there are training images. We extract patches from these images and feed them to the network during the training process. We also explore the VGG-16 model [14] for sea ice classification. The architecture of this model is depicted in Figure 2. The fully connected layers are the same for both architectures, but the convolution layers are different, and hence the extracted output Remote Sens. 2021, 13, 1734 6 of 20 features from the convolution layers are not the same. In both architectures, the rectified linear unit (ReLU) activation functions [51] have been used in all layers, except for the last layer, where the SoftMax function [52] is used. We use the cross-entropy loss function and Adam optimizer [53] in the training process. The batch size, which refers t o the number of training examples utilized in one iteration, was set to 50 patches. In case of the VGG-16 network, we adopt both training from scratch using our sea ice training dataset and a transfer learning strategy. Training from scratch provides insight on the impact of a deeper network in relation to the sea ice classification task. In this case, we adjust all the weights during the training process, starting from a random initialization. For transfer learning, we readjust the weights during training, following the setup obtained for a specific application. We tested each network model with different sizes of the input patches. Furthermore, we use an augmentation technique to extend our labelled dataset. Here it is supposed to consolidate the architectures both in the feature extraction and the classification stages. In data augmentation, the training data is processed using multiple patch-wise operators and transformations. We used the augmentation strategy of Buslaev et al. [54]. According to the strategy, we perform horizontal flip, rotation with 90 degrees, blurring, and random changes to both brightness and contrast. The data augmentation technique aims to improving the robustness of both architectures by focusing on the structure of the classes, and should help both architectures to be independent of changes in brightness and contrast. Figure 2. VGG-16 CNN. The architecture of this network consists of several convolution layers and three fully connected (FC) layers. The FC layers are represented by dense layers. At the input, there are training images. We extract patches from these images and feed them to the network during the training process. 3.2. Modified CNN Model for Classification We also propose a modified VGG-16 model [14] for sea ice classification. In general, convolutional neural networks introduce equivariance to translation. It means that if an object moves along the height or width axis of an image, the activation translated to the output will be the same. However, this is not true for rotations and changes in the illumination. It can be described intuitively by thinking about a filter that picks up horizontal edges. This filter can find all the horizontal edges in the image, but it cannot detect vertical edges. To this end, a maxpooling layer adds translational invariance [55]. If we consider a pooling layer with a window size of 2 × 2 and a stride equal to 2, it does not matter in which of the four locations the big activation is, the output of the pooling layer will be the same. However, this is not always desired. For example, translation invariance is not desired for face-recognition where the exact distances between eyes and the nose are crucial. In this case, you would not want to reduce the use of pooling. Remote Sens. 2021, 13, 1734 7 of 20 In sea ice classification, we are not looking for a specific object, and the texture of the classes is important. The use of many maxpooling layers severely affects the networks ability to encode texture characteristics. In a large network, a maxpooling layer shrinks the image size and saves computation. On the other hand, this process limits the minimum input patch size which can be used. For example, the smallest input image that can be used in VGG-16 is 32 × 32. If smaller patches are fed to the network, there will be no output image after the convolution and maxpooling layers to feed to the fully connected layers. We investigate the effect on sea ice classification of removing the last maxpooling layer in the VGG-16 architecture. By reducing the maxpooling layers, we suppress the the translational invariance property of the VGG-16 network, and simultaneously reduce the minimum input size that is allowed to be used. 4. Experimental Results 4.1. Dataset To test the deep CNN models for sea ice classification, we created an annotated dataset building on the work of Lohse et al. [56]. This is based on 31 Sentinel-1A Extended Wide (EW) Level-1 Ground Range Detected (GRD) scenes, with a spatial resolution of 40 m × 40 m, that were acquired north of the Svalbard archipelago in winter months between September and March during the period 2015–2018. Four sample images from our dataset are shown in Figure 3. Our dataset can be accessed from the provided link https://dataverse.no/dataset.xhtml?persistentId=doi:10.18710/QAYI4O (accessed on 1 March 2021). The images were pre-processed by applying a thermal noise removal algorithm in the European Space Agency (ESA) Sentinel Application Platform (SNAP) software [57], calibrated using the σ0 look-up table, and multi-looked using a 3 × 3 boxcar filter. After conversion to dB scale, the images were clipped and scaled linearly in the range [0, 1], considering the dual-polarization intensity channels individually, and including a third input-channel representing the incident angle. The range for the co- polarization (HH) is −30 to 0 dB, for the cross-polarization (HV) it is −35 to −5 dB, and for the incidence angle 19 to 46 degrees. A set of polygons representing homogeneous sea ice types was subsequently manually annotated with labels for those types, taking into account additional information from co-located and nearly temporally coincident optical image data from Sentinel-2. Patches were then extracted from these polygons for 5 different classes representing: Water (including ice-free water (windy), ice-free water (calm), and open water in leads), Brash/Pancake Ice, Young Ice, Level First-Year Ice, and Deformed Ice (including both first-year and multi-year ice). The stride between patches was 10 pixels. In Table 1, we provide the code values for the ice types related to the stage of development (ice age), as defined by the SIGRID-3 vector archive format for sea ice georeferenced information and data ([58]), the class names, and the number of samples for each class for a patch size of 32 × 32. It is worth noticing that we have an imbalanced dataset, where the number of samples for each class has considerable variation. This is a result of the effort we made to accurately annotate the polygons, and hence the number of polygons was small and not representing all classes equally. For binary Water/Ice classification, we grouped the samples into two classes, namely Water and Ice. Our motivation for performing binary classification is to investigate if deep models can distinguish between sea ice and water, which would subsequently allow for quantitative sea ice concentration mapping. The number of samples of water and ice for different patch sizes are shown in Table 2, and there is not a class imbalance problem in this case. For all the tests, 80 percent of the dataset was used for training and 20 percent for validation. Remote Sens. 2021, 13, 1734 8 of 20 Figure 3. Sample images. Four different input SAR images are presented in both the rows from our collected dataset. The light vertical lines in all the images represent SAR additive noise. Table 1. Number of samples for different classes namely open water/leads with water, brash/pancake ice, young ice (YI), level first year ice (FYI), and old/deformed ice. Codes Classes 32× 32 02 Open Water/Leads with Water 9318 01–02 Brash/Pancake Ice 159 83 Young Ice (YI) 202 86–89 Level First-Year Ice (FYI) 213 95 Old/Deformed Ice 9137 Table 2. Number of samples for sea/ice classification in different patch sizes. For example, the total number of patches are 19,029 in case of patch size equal to 32 × 32. Patch Size Total Ice Sea 10 × 10 22,999 12,723 10,276 20 × 20 21,020 11,301 9719 32 × 32 19,029 9711 9318 36 × 36 18,469 9255 9214 46 × 46 17,237 8255 8982 We would like to mention that in the inference experiment, we used completely different images. These were another 4 scenes from north of Svalbard, and 8 scenes from Danmarkshavn, East Greenland that were each collected during separate months in 2018. 4.2. Model Accuracies 4.2.1. Patch Channels and Sizes In the first study case, we report the validation accuracy by considering three different channel compositions. We calculate the validation accuracy for a patch by checking if the predicted class is the same as the true class, and by comparing the index of the highest scoring class in the predicted vector with the index of the actual class in the ground truth vector. It is interesting to use the HH polarization alone since it generally has a Remote Sens. 2021, 13, 1734 9 of 20 stronger signal and is less affected by additive noise. However, the HV polarization is more sensitive to ice types during freezing conditions and provides information about the different classes [56]. Furthermore, it is well-known that the radar backscatter from sea ice is dependent on the incidence angle, with lower incidence angles appearing brighter [56]. In order to study the importance of this effect for different classes, we included the incidence angle as a separate input channel. Hence, we consider three alternatives. First, we extracted one-channel patches using only the HH polarization. Secondly, we extracted two-channel patches, with both the HH and HV polarizations as inputs. Finally, we extracted three- channel patches by considering the HH and HV channels, plus the incidence angle. The results for these channel compositions are summarized in Table 3 for the ad hoc CNN, using a patch size of 32 × 32. As can be seen, the composition of the input patches affects performance of the model, with a large improvement due to adding the HV channel to the HH, and another small improvement by adding the incidence angle. The improvement of adding the incidence angle is surprisingly small. However, based on the validation results for the ad hoc CNN, we will use all three channels in our next experiments. Table 3. Validation accuracy of ad hoc CNN for different Patch compositions including HH, HH-HV, and HH-HV-incidence angle. The patch size is equal to 32 × 32 with spatial resolution 1440 m2. HH HH, HV HH, HV, Incidence Angle Validation Accuracy 88.4% 98.2% 98.4% Next, we studied the effect of using different patch sizes for the three-channel case. We consider the ad hoc architecture and input patch sizes of 10 × 10, 20 × 20, 32 × 32, 36 × 36 and 46 × 46, respectively. The validation results in Table 4 show that the accuracy improves with the increase in the patch size. However, this improvement comes at the cost of a lower spatial resolution as larger patches cover wider areas of the surface. Note that for S1 EW GRD images each pixel covers 40× 40 meters on the Earth surface and, for example, a patch size equal to 46 × 46 covers a 1840 × 1840 square meters area. This patch will be classified as water if the majority of the pixels represent water and would be a problem at ice edges as classification based on larger patches would lead to coarser or non-smooth edges. Hence, there is a trade-off between accuracy and resolution. We used smaller patch sizes in our other experiments. Table 4. Validation accuracy using ad hoc CNN with different patch sizes including 10 × 10, 20 × 20, 32 × 32, 36 × 36, and 46 × 46. 10× 10 20× 20 32× 32 36× 36 46× 46 Validation Accuracy 95.54% 97.49 % 98.53 % 98.75 % 99.09% Spatial Resolution (meter) 400 800 1280 1440 1840 4.2.2. Different Training Strategies In this section, we study the performance of the VGG-16 architecture for sea ice classification from SAR images under different training strategies. These strategies include: (a) training the network by transfer learning, where the pre-trained network is trained on the ImageNet dataset, (b) training the network from scratch, (c) training the network from scratch, with an augmented dataset, (d) training the modified VGG-16 network from scratch considering the augmented dataset with a patch sizes equal to 32 × 32, and (e) similar to (d) with a patch size of 20 × 20. Transfer learning and data augmentation are well-known learning strategies that have been successfully applied in computer vision applications. The image formation process for SAR images is fundamentally different from optical images, and our objective here is to understand if these techniques are also suitable for the sea ice classification task using SAR data. For training our model, we consider the learning rate equal to 0.001 and batch size equal to 20. Remote Sens. 2021, 13, 1734 10 of 20 The number of convolutional layers of the VGG-16 network is different from the ad hoc network. Therefore, the extracted features are different in these architectures, and presumably also their classification performances. We present the classification results related to the VGG-16 network with different training approaches in Table 5. As can be seen, when the network is trained by transfer learning, the validation accuracy is equal to 97.9%, whereas when the same network is trained from scratch, the accuracy is 99.5%. Figure 4 shows the training and validation losses for these two cases, transfer learning in the left panel and ’from scratch’ learning in right, respectively. We note that the validation losses in both panels show increasing trends after the point where the training losses indicate conversion, meaning these networks suffer from overfitting. Figure 4. Validation and training losses. Considering the VGG-16 network trained with transfer learning (left), the validation loss is increasing and the training loss is decreasing. Considering the VGG-16 network trained from scratch (right), the Validation loss is increasing and the training loss is decreasing. Therefore, both the networks are facing the problem of overfitting. Table 5. The comparison of the validation accuracy of different models: VGG-16 transfer learning, VGG-16 trained from scratch, VGG-16 trained from scratch with augmentation, VGG-16 modified with augmentation considering two different pixel resolutions. Training Strategies Validation Resolution ResolutionAccuracy in Pixels in Meters VGG-16 transfer learning 97.9 32 × 32 1280 VGG-16 trained from scratch 99.5% 32 × 32 1280 VGG-16 trained from scratch with augmentation 99.79% 32 × 32 1280 VGG-16 Modified + augmentation 99.89% 32 × 32 1280 VGG-16 Modified + augmentation 99.30% 20 × 20 800 The issue of overfitting is often related to sparse training data, and can be remedied by extending the training set using data augmentation. We demonstrate this for the strategy of training the network from scratch, since as shown in Table 5 it has the best performance. We train the VGG-16 network from scratch with the augmented data according to the augmentation strategy described above. Figure 5 presents the corresponding training and validation loss curves, and, as can be noted, both the validation and training losses are decreasing, hence, showing better generalization capabilities. Table 5 shows that data augmentation also improves the classification results. In fact, we achieve a validation accuracy equal to 99.79%, which is remarkably good. We also report the validation accuracies of the modified VGG-16 network trained from scratch using the augmented dataset considering two different patch sizes, namely 32 × 32 and 20 × 20. We remind the reader that this architecture is designed to have a reduced number of maxpooling layers, and hence would allow for better texture preservation and smaller input patch sizes. Table 5 also displays the validation accuracies for these models, and as can be noted, the modified VGG-16 networks achieves very high validation Remote Sens. 2021, 13, 1734 11 of 20 accuracies. We also perform a comparison with three other reference models to show the stability and robustness of our modified VGG-16 model for sea ice classification. These reference models are MobileNetV2 [59], RestNet50 [60], and DenseNet121 [45]. The performance of our model in comparison with these reference models is presented in Figure 6 in the form of validation accuracy over time. As can be seen, our model presents higher and consistent validation accuracy. Based on these experimental analyses, we observe that the modified VGG-16 network trained from scratch with the augmented data provides the highest accuracies. This leads us to conclude that in the case of sea ice classification from SAR data, training the network from scratch with an augmented dataset enables better adjustment and learning of the sea ice characteristics. Transfer learning, with pre-training on ImageNet data, which is fundamentally different from SAR data, does not allow the same adaptation to the data. Moreover, by reducing the number of maxpooling layers, the network better preserves the structure of the data and shows improved performance. Figure 5. Validation and training losses. Considering the VGG-16 network trained from scratch with the augmented data, both the validation and training losses are decreasing showing the better generalization capability of the network. Figure 6. Validation accuracy. We present the performance of our model in comparison with MobileNetV2 [59], RestNet50 [60], DenseNet121 [45] in the form of validation accuracy. As can be seen, our model shows higher and consistent validation accuracy over time. Remote Sens. 2021, 13, 1734 12 of 20 4.3. Inference Results In order to assess the robustness of the proposed approaches, we investigated the classification results for four new SAR scenes from north of Svalbard, i.e., scenes that are not part of the training data, by presenting the results as qualitative ice versus water maps. To this aim, we set up the inference experiment in a patch-wise manner, where the images are partitioned into non-overlapping patches, and the classification is performed on the entire patches. Figure 7 shows the four input images from north of Svalbard in the first row. In the same figure, the patch-wise results of the ad hoc CNN are presented in the second row, the results of the VGG-16 model trained with transfer learning are presented in the third row, the results of the VGG-16 model retrained from scratch without the augmented data are presented in the fourth row, the results of the VGG-16 model retrained from scratch with the augmented data are presented in the fifth row, and the results of the modified VGG-16 model trained from scratch with augmented data are presented in the sixth row. Areas consisting of water are annotated in blue and areas consisting of sea ice are annotating in white. For better visualization, we applied a land mask to detect land areas, and the black regions in the images represent land areas. We zoom in on parts of some images to highlight specific details. The classification results obtained with ad hoc CNN (second row) are not satisfactory. The classified images are severely affected by the banding additive noise pattern, as can be clearly seen in columns two and three. The VGG-16 trained with transfer learning (third row) does not classify sea ice areas properly. In fact, open water and newly formed sea ice often have lower radar backscatter values in HV than in HH channels.These cross-polarization values are closer to the noise floor and therefore often have a lower signal-to-noise ratio producing artifacts due to different noise patterns. It can lead to problems during the interpretation of sea ice maps because the added intensity corrupts the true back scattered signal of the sea ice region. In Figure 7, The VGG-16 retrained from scratch without using the augmented data (fourth row) is better than ad hoc CNN and VGG-16 trained with transfer learning. However, there are still some misclassifications, as can be seen in the first column. The second last row presents the results obtained with the VGG-16 model retrained from scratch with the augmented data. The last row presents the results obtained with the modified VGG-16 retrained from scratch with the augmented data. For the modified VGG-16 model, we reduced the number of maxpooling layers. In this case, the noise seems to be quite well handled, as can be seen in the second column of the last row. However, there is still some noise effects in the third column. Hence, it is worth noticing how the results are affected by the additive noise, which can be seen in the original images (row one) as distinct bands marking the different sub-swaths, and in particular the case when the ad hoc CNN and VGG-16 with transfer learning are considered. Nevertheless, the results obtained by using VGG-16 trained from scratch appear to be more robust against the noise. From this experimental analysis, we conclude that the patch-wise classification results seem to be better when the training data obtained from data augmentation is used to train the VGG-16 model from scratch. The improvement is evident in the last row of Figure 7. To further show the generalization performance of the CNN models for ice versus water classification, we also tested the models on images acquired from a different Arctic region, the area offshore of Danmarkshavn, East Greenland (76°46′ N, 18°40′ W). Here the Norwegian Meteorological Institute provided vector polygon data representing manually interpreted sea ice areas for the SAR data [61], which consisted of eight images, corresponding to eight different months of the year. These including both the freezing and melting seasons, and were then analyzed with the trained architectures. Figure 8 displays the classification results corresponding to the modified VGG network, trained from scratch with data augmentation, using patch sizes of 32 × 32 and 20 × 20. Remote Sens. 2021, 13, 1734 13 of 20 Figure 7. Patch-wise results considering patch size equal to 32 × 32. The first row presents the original images in two-bands. The second row presents results using ad hoc CNN, the third row presents results using VGG-16 with transfer learning. The fourth row presents results using VGG-16 trained from scratch without augmentation. The fifth row presents results obtained using VGG-16 trained from scratch with augmentation. The sixth row presents the results of modified VGG-16 trained from scratch considering patch size equal to 20 × 20. Ice is annotated in white and water is annotated in blue. The land mask is annotated in black. As can be seen, the overall performance is good. It is also noticed that the results obtained with patch size equal to 32 × 32 are better than the results obtained with patch size equal to 20 × 20. The larger patch-size seems to be less affected by the noise and therefore we conclude that a patch size equal to 32 × 32 is a better choice for Sentinel-1 SAR images corrupted by additive noise. Overall, our experimental analysis shows that the VGG-16, when trained from scratch with augmented data, presents very good classification results when trained in a supervised fashion. Remote Sens. 2021, 13, 1734 14 of 20 Figure 8. Patch-wise inference results of VGG-16 trained from scratch with 32 × 32 and 20 × 20 patches. The network is trained on north of Svalbard images and tested on the new region, Danmarkshavn, East Greenland. Different images in freezing and melting seasons are shown. To better characterize the quality of the sea ice classification, it is important to distinguish between ice edges and water. Therefore, we also present the performance of our proposed method considering the ice edges of 16 January 2018 as depicted in Figure 9. For this purpose, we overlay the ice polygons (Norwegian Meteorological Institute [61]) from the Danmarkshavn region over the geo-referenced classified image from our method. Overestimation means predicting a larger sea ice area than the manually labelled cover area. Underestimation means predicting a smaller sea ice area than the manually labelled cover area. As can be seen, our proposed method performs effectively to separate ice edges from the water, although there remains some minor overestimation of the sea ice extent in some areas which is preferable to underestimating. However misclassification still occurs in interior areas of the ice pack where there is low backscatter from both cross- and co-polarization such as for areas of level, undeformed landfast ice close to the Greenland coast. An assessment of the accuracy of the ice edge, based on the Integrated Ice Edge Error (IIEE) metric [62], was performed on this example against a selection of other data sources. In Table 6 it can be seen that the contribution to the error from classifying ice as water (under-representing the ice) is consistent with all the products (4646 to 6632 km2) that are compared, as these have fairly good agreement on the presence of landfast ice. There is also a similar level of error against products with accurate ice edges (1522 to 3766 km2) such as the manually analyzed polygons introduced earlier [61], the Norwegian Ice Chart from the Norwegian Meteorological Institute (https://cryo.met.no/en/latest-ice-charts, accessed on 1 March 2021) which is the routine operational analysis produced by an ice analyst, and the sea ice concentration (SIC) produced by the University of Bremen from Advanced Microwave Scanning Radiometer 2 (AMSR2) data [63]. Products based on low resolution passive microwave radiometry, for example the EUMETSAT Satellite Application Facility on Ocean and Sea Ice (OSI SAF) SIC that uses Special Sensor Microwave Imager/Sounder (SSMIS) [64], are less capable of resolving the ice edge, and here there is a far greater contribution to the IIEE (10,797 km2) because the SAR classification correctly identifies sea ice. Remote Sens. 2021, 13, 1734 15 of 20 Table 6. Area differences and IIEE scores for the 16 January 2018 VGG-16 results, (Figure 9) against 4 different sea ice data products: manual analysis [61], Norwegian Ice Chart, OSI SAF SIC [64], and University of Bremen AMSR2 SIC [63]. Products Overestimation km2 Underestimation km2 IIEE km2 Manual Analysis [61] 3766 4646 8412 Norwegian Ice Chart 1522 6632 8155 OSI SAF SIC [64] 10,797 5482 16,279 Bremen AMSR2 SIC [63] 2637 5966 8604 Figure 9. Ice edges. We overlay the manually analyzed polygons from the Danmarkshavn region over the classified images from our method to show the effectiveness of our proposed method considering ice edges. The polygons highlighted in the light red color represent the manual analysis, the light grey color represents ice, the dark grey color represent water, and the white color represents overestimated ice from our method. We have also extended our experimental analysis to multi-class sea ice type classifica- tion considering five images from the Danmarkshavn region. The results are depicted in Figure 10. In this classification experiment, we used the modified VGG-16 model trained from scratch with the dataset from north of Svalbard as shown Table 1. We would like to emphasize that our dataset is scarce and unbalanced, with an unequal number samples from the ice types. This is affecting the classification performance, and the results presented in Figure 10 are slightly biased toward ice types where we have more samples than others. The effect of the imbalance data can be seen in Figure 10, where brash/pancake ice is detected in the right-hand side of the right-most image, which apparently is a dense ice area. In general, brash/pancake ice is located at the edges towards open water. Despite this problem, the results indicate that the VGG-16 trained from scratch shows promising performance in distinguishing different ice types as well as binary ice versus water classification. Remote Sens. 2021, 13, 1734 16 of 20 Figure 10. Multi-class ice types classification using 32 × 32 patches using the network trained from scratch by considering multi-class ice types in Table 1 from north of Svalbard and tested on the Danmarkshavn region. We also present the inference result obtained considering only the HH channel. In Figure 11, the left column shows the input SAR image and the right column shows the inference results. As can be seen, the inference result lacks coherency to distinguish sea ice from water. Therefore, both the HV channel and incident angle contribute to the process of properly training the model. Figure 11. The input HH SAR image is shown on the left side and the inference result considering only the HH channel is shown on the right side. The color of the input image is different from the ones reported earlier because in this case we have only HH channel. As can be seen, the result lacks coherency to distinguish sea ice from water. 5. Conclusions In this work, we explored the potential of different CNN models for sea ice classification. We tested and assessed the results both qualitatively and quantitatively. The results showed that these complex architectures (such as those based on the VGG network) typically obtain promising classification results. Moreover, we evaluated the value of data augmentation, Remote Sens. 2021, 13, 1734 17 of 20 and found that even if the quantitative performance improvement was only minor, the data extension technique seemingly can prevent over-fitting caused by a scarce training dataset. We also assessed the robustness of the trained CNN models when applied to SAR scenes collected at different spatial locations and times. Even though our analysis is limited to only a few scenes, our findings are positive and show that the models have good potential. The computational processing to obtain the inference result for a single high resolution SAR image requires a few minutes on a typical desktop computer. We also found that the additive system noise in the SAR imagery is a challenging problem to obtaining refined sea ice maps. Both the computational requirements and the additive system noise are important issues for the operational use of SAR data for sea ice classification. We also trained our models to perform multi-class classification. In this preliminary study, we had a scarce and unbalanced dataset, which obviously affected the output, but the analysis still showed promise. This motivates us to carry out our research in this direction. In our investigation we performed patch-wise classification which degrades the spatial resolution. Future work will address a pixel-wise setup. However, the pixel-wise set-up will be driven by more computational overhead. Therefore, our future work will also focus on transforming the current architecture to process the input data quickly. For this purpose, we will replace the fully connected layers by convolution layers based on the work of Sermanet et al. [65]. To reduce the impact of noise on sea ice classification, we would include the nominal noise profiles as a feature directly into the model. Finally, we emphasize that the scarcity of reliable and balanced sea ice training and validation datasets is a severe problem for these complex CNN architectures and needs full attention from the sea ice community. In future work, we will develop semi-supervised learning methods to partly remedy this issue. Author Contributions: Conceptualization, formal analysis, validation, S.K.; writing and formal analysis, H.U.; data curation and review, T.K.; writing and editing, validation, visualization, N.H.; review and editing, supervision, T.E.; review and editing, supervision, A.M. All authors have read and agreed to the published version of the manuscript. Funding: This research is funded in part by Centre for Integrated Remote Sensing and Forecasting for Arctic Operations (CIRFA) and the Research Council of Norway (RCN Grant no. 237906), the European Union’s Horizon 2020 research and innovation programme ExtremeEarth project, grant agreement no. 825258 (http://earthanalytics.eu/, accessed on 1 March 2021) and by the Fram Center under the Automised Large-scale Sea Ice Mapping (ALSIM) “Polhavet” flagship project. Institutional Review Board Statement: Not applicable. Informed Consent Statement: Not applicable. Data Availability Statement: https://dataverse.no/dataset.xhtml?persistentId=doi:10.18710/QAYI4O (accessed on 1 March 2021). Acknowledgments: This work is funded in part by Centre for Integrated Remote Sensing and Forecasting for Arctic Operations (CIRFA) and the Research Council of Norway (RCN Grant no. 237906), the European Union’s Horizon 2020 research and innovation programme ExtremeEarth project, grant agreement no. 825258 (http://earthanalytics.eu/, accessed on 1 January 2020) and by the Fram Center under the Automised Large-scale Sea Ice Mapping (ALSIM) “Polhavet” flagship project. Conflicts of Interest: The authors declare no conflict of interest. Abbreviations We provide the abbreviations used in the paper in this part SAR Synthetic aperture radar LSTM Long short term memory DNNs Deep neural networks FCN Fully convolutional networks PCA Principal component analysis Remote Sens. 2021, 13, 1734 18 of 20 S1 Sentinel-1 HH Horizontal-horizontal polarization RLU Rectified linear unit ResNet Residual convolutional neural network DL Deep learning CNN Convolutional neural network SVM Support vector machine GLCM Gray-level co-occurrence matrix EO Earth observation HV Horizontal-vertical polarization ESA European Space Agency References 1. Bobylev, L.P.; Miles, M.W. Sea Ice in the Arctic Paleoenvironments. In Sea Ice in the Arctic; Springer: Berlin/Heidelberg, Germany, 2020; pp. 9–56. 2. Serreze, M.; Barrett, A.; Stroeve, J.; Kindig, D.; Holland, M. The emergence of surface-based Arctic amplification. Cryosphere 2009, 3, 11. [CrossRef] 3. Vihma, T. Effects of Arctic sea ice decline on weather and climate: A review. Surv. Geophys. 2014, 35, 1175–1214. [CrossRef] 4. Najafi, M.R.; Zwiers, F.W.; Gillett, N.P. Attribution of Arctic temperature change to greenhouse-gas and aerosol influences. Nat. Clim. Chang. 2015, 5, 246. [CrossRef] 5. Stroeve, J.C.; Serreze, M.C.; Holland, M.M.; Kay, J.E.; Malanik, J.; Barrett, A.P. The Arctic’s rapidly shrinking sea ice cover: A research synthesis. Clim. Chang. 2012, 110, 1005–1027. [CrossRef] 6. Haykin, S.; Lewis, E.O.; Raney, R.K.; Rossiter, J.R. Remote Sensing of Sea Ice and Icebergs; John Wiley & Sons: Hoboken, NJ, USA, 1994; Volume 13. 7. Ren, Y.; Li, X.; Yang, X.; Xu, H. Development of a Dual-Attention U-Net Model for Sea Ice and Open Water Classification on SAR Images. IEEE Geosci. Remote Sens. Lett. 2021.[CrossRef] 8. Han, Y.; Liu, Y.; Hong, Z.; Zhang, Y.; Yang, S.; Wang, J. Sea Ice Image Classification Based on Heterogeneous Data Fusion and Deep Learning. Remote Sens. 2021, 13, 592. [CrossRef] 9. Song, W.; Li, M.; Gao, W.; Huang, D.; Ma, Z.; Liotta, A.; Perra, C. Automatic Sea-Ice Classification of SAR Images Based on Spatial and Temporal Features Learning. IEEE Trans. Geosci. Remote Sens. 2021. [CrossRef] 10. Awange, J.L.; Kiema, J.B.K. Microwave remote sensing. In Environmental Geoinformatics; Springer: Berlin/Heidelberg, Germany, 2013; pp. 133–144. 11. Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. 12. Yan, X.; Cui, B.; Xu, Y.; Shi, P.; Wang, Z. A method of information protection for collaborative deep learning under gan model attack. IEEE/ACM Trans. Comput. Biol. Bioinform. 2019. [CrossRef] 13. Otter, D.W.; Medina, J.R.; Kalita, J.K. A survey of the usages of deep learning for natural language processing. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 604–624. [CrossRef] 14. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. 15. Jin, K.H.; McCann, M.T.; Froustey, E.; Unser, M. Deep convolutional neural network for inverse problems in imaging. IEEE Trans. Image Process. 2017, 26, 4509–4522. [CrossRef] [PubMed] 16. Mustaqeem; Kwon, S. A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors 2020, 20, 183. 17. Liu, J.; Scott, K.A.; Gawish, A.; Fieguth, P. Automatic detection of the ice edge in SAR imagery using curvelet transform and active contour. Remote Sens. 2016, 8, 480. [CrossRef] 18. Lindell, D.B.; Long, D.G. Multiyear Arctic sea ice classification using OSCAT and QuikSCAT. IEEE Trans. Geosci. Remote Sens. 2015, 54, 167–175. [CrossRef] 19. Shen, X.; Zhang, J.; Zhang, X.; Meng, J.; Ke, C. Sea ice classification using Cryosat-2 altimeter data by optimal classifier–feature assembly. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1948–1952. [CrossRef] 20. Zakhvatkina, N.; Smirnov, V.; Bychkova, I. Sea ice classification based on neural networks method using Sentinel-1 data. Int. Multidiscip. Sci. GeoConf. SGEM 2019, 19, 617–623. 21. Zakhvatkina, N.Y.; Demchev, D.; Sandven, S.; Volkov, V.A.; Komarov, A.S. SAR Sea Ice Type Classification and Drift Retrieval in the Arctic. In Sea Ice in the Arctic; Springer: Berlin/Heidelberg, Germany, 2020; pp. 247–299. 22. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. 23. Eppler, D.T.; Farmer, L.D.; Lohanick, A.W.; Hoover, M. Classification of sea ice types with single-band (33.6 GHz) airborne passive microwave imagery. J. Geophys. Res. Ocean. 1986, 91, 10661–10695. [CrossRef] 24. Moen, M.A.; Doulgeris, A.P.; Anfinsen, S.N.; Renner, A.H.; Hughes, N.; Gerland, S.; Eltoft, T. Comparison of feature based segmentation of full polarimetric SAR satellite sea ice images with manually drawn ice charts. Cryosphere 2013, 7, 1693–1705. [CrossRef] Remote Sens. 2021, 13, 1734 19 of 20 25. Fors, A.S.; Brekke, C.; Doulgeris, A.P.; Eltoft, T.; Renner, A.H.; Gerland, S. Late-summer sea ice segmentation with multi- polarisation SAR features in C and X band. Cryosphere 2016, 10, 401–415. [CrossRef] 26. Yu, Z.; Wang, T.; Zhang, X.; Zhang, J.; Ren, P. Locality preserving fusion of multi-source images for sea-ice classification. Acta Oceanol. Sin. 2019, 38, 129–136. [CrossRef] 27. Cristea, A.; van Houtte, J.; Doulgeris, A.P. Integrating incidence angle dependencies into the clustering-based segmentation of SAR images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 2925–2939. [CrossRef] 28. Orlando, J.R.; Mann, R.; Haykin, S. Classification of sea-ice images using a dual-polarized radar. IEEE J. Ocean. Eng. 1990, 15, 228–237. [CrossRef] 29. Alhumaidi, S.M.; Jones, L.; Park, J.D.; Ferguson, S.M. A neural network algorithm for sea ice edge classification. IEEE Trans. Geosci. Remote Sens. 1997, 35, 817–826. [CrossRef] 30. Bogdanov, A.V.; Sandven, S.; Johannessen, O.M.; Alexandrov, V.Y.; Bobylev, L.P. Multisensor approach to automated classification of sea ice image data. IEEE Trans. Geosci. Remote Sens. 2005, 43, 1648–1664. [CrossRef] 31. Leigh, S.; Wang, Z.; Clausi, D.A. Automated ice–water classification using dual polarization SAR satellite imagery. IEEE Trans. Geosci. Remote Sens. 2013, 52, 5529–5539. [CrossRef] 32. Li, Y.; Gao, F.; Dong, J.; Wang, S. A Novel Sea Ice Classification Method from Hyperspectral Image Based on Bagging PCA Hashing. In Proceedings of the 2018 Fifth International Workshop on Earth Observation and Remote Sensing Applications (EORSA), Xi’an, China, 18–20 June 2018; pp. 1–4. 33. Park, J.W.; Korosov, A.A.; Babiker, M.; Won, J.S.; Hansen, M.W.; Kim, H.C. Classification of Sea Ice Types in Sentinel-1 SAR images. Cryosphere Discuss. 2019, 1–23. [CrossRef] 34. Zhang, Y.; Zhu, T.; Spreen, G.; Melsheimer, C.; Zhang, S.; Li, F. Sea ice-water classification on dual-polarized Sentinel-1 imagery during melting season. In Proceedings of the 21st EGU General Assembly, EGU2019, Vienna, Austria, 7–12 April 2019; Volume 21. 35. Nogueira, K.; Miranda, W.O.; Dos Santos, J.A. Improving spatial feature representation from aerial scenes by using convolutional networks. In Proceedings of the 2015 28th SIBGRAPI Conference on Graphics, Patterns and Images, Salvador, Brazil, 26–29 August 2015; pp. 289–296. 36. Makantasis, K.; Karantzalos, K.; Doulamis, A.; Doulamis, N. Deep supervised learning for hyperspectral data classification through convolutional neural networks. In Proceedings of the 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Milan, Italy, 26–31 July 2015; pp. 4959–4962. 37. Wang, L.; Wong, A.; Scott, K.A.; Clausi, D.A.; Xu, L.; Shafiee, M.J.; Li, F. Sea ice concentration estimation from satellite SAR imagery using convolutional neural network and stochastic fully connected conditional random field. In Proceedings of the CVPR 2015 Earthvision Workshop, Boston, MA, USA, 11–12 June 2015. 38. Wang, L.; Scott, K.A.; Xu, L.; Clausi, D.A. Sea ice concentration estimation during melt from dual-pol SAR scenes using deep convolutional neural networks: A case study. IEEE Trans. Geosci. Remote Sens. 2016, 54, 4524–4533. [CrossRef] 39. Wang, L.; Scott, K.; Clausi, D. Sea ice concentration estimation during freeze-up from SAR imagery using a convolutional neural network. Remote Sens. 2017, 9, 408. [CrossRef] 40. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [CrossRef] 41. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. 42. Nogueira, K.; Penatti, O.A.; dos Santos, J.A. Towards better exploiting convolutional neural networks for remote sensing scene classification. Pattern Recognit. 2017, 61, 539–556. [CrossRef] 43. Castelluccio, M.; Poggi, G.; Sansone, C.; Verdoliva, L. Land use classification in remote sensing images by convolutional neural networks. arXiv 2015, arXiv:1508.00092. 44. Kruk, R.; Fuller, M.C.; Komarov, A.S.; Isleifson, D.; Jeffrey, I. Proof of Concept for Sea Ice Stage of Development Classification Using Deep Learning. Remote Sens. 2020, 12, 2486. [CrossRef] 45. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. 46. Han, Y.; Gao, Y.; Zhang, Y.; Wang, J.; Yang, S. Hyperspectral Sea Ice Image Classification Based on the Spectral-Spatial-Joint Feature with Deep Learning. Remote Sens. 2019, 11, 2170. [CrossRef] 47. Gao, Y.; Gao, F.; Dong, J.; Wang, S. Transferred deep learning for sea ice change detection from synthetic-aperture radar images. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1655–1659. [CrossRef] 48. Petrou, Z.I.; Tian, Y. Prediction of Sea Ice Motion With Convolutional Long Short-Term Memory Networks. IEEE Trans. Geosci. Remote Sens. 2019, 57, 6865–6876. [CrossRef] 49. Mustaqeem; Sajjad, M.; Kwon, S. Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM. IEEE Access 2020, 8, 79861–79875. [CrossRef] 50. WMO Sea-Ice Nomenclature, Volumes I, II and III. 2014. Available online: https://library.wmo.int/doc_num.php?explnum_id= 4651 (accessed on 1 March 2021). 51. Agarap, A.F. Deep learning using rectified linear units (relu). arXiv 2018, arXiv:1803.08375. Remote Sens. 2021, 13, 1734 20 of 20 52. Wang, M.; Lu, S.; Zhu, D.; Lin, J.; Wang, Z. A high-speed and low-complexity architecture for softmax function in deep learning. In Proceedings of the 2018 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS), Chengdu, China, 26–30 October 2018; pp. 223–226. 53. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. 54. Buslaev, A.; Iglovikov, V.I.; Khvedchenya, E.; Parinov, A.; Druzhinin, M.; Kalinin, A.A. Albumentations: Fast and flexible image augmentations. Information 2020, 11, 125. [CrossRef] 55. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA. Available online: http://www. deeplearningbook.org (accessed on 1 March 2021). 56. Lohse, J.; Doulgeris, A.P.; Dierking, W. Mapping sea-ice types from Sentinel-1 considering the surface-type dependent effect of incidence angle. Ann. Glaciol. 2020, 1–11. [CrossRef] 57. Piantanida, R.; Miranda, N.; Hadjduch, G.Thermal Denoising of Products Generated by the S-1 IPF; S-1 Mission Performance Centre. 2017. Available online: https://sentinel.esa.int/documents/247904/2142675/Thermal-Denoising-of-Products-Generated-by- Sentinel-1-IPF (accessed on 1 March 2021) 58. Joint WMO-IOC Technical Commission for Oceanography. Marine Meteorology. SIGRID-3 : A Vector Archive Format for Sea Ice Charts: Developed by the International Ice Charting Working Group’s Ad Hoc Format Team for the WMO Global Digital Sea Ice Data Bank Project; WMO & IOC: Geneva, Switzerland, 2004. 59. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. 60. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. 61. Hughes, N. ExtremeEarth Polar Use Case Training Data 2020. Available online: https://zenodo.org/record/3695276#.X-ytf2j0 mUn (accessed on 1 March 2021). 62. Goessling, H.F.; Tietsche, S.; Day, J.J.; Hawkins, E.; Jung, T. Predictability of the Arctic sea ice edge. Geophys. Res. Lett. 2016, 43, 1642–1650. [CrossRef] 63. Melsheimer, C.; Spreen, G. AMSR2 ASI Sea Ice Concentration Data, Arctic, Version 5.4 (NetCDF) (July 2012–December 2019). 2019. Available online: https://doi.pangaea.de/10.1594/PANGAEA.898399 (accessed on 1 March 2021). [CrossRef] 64. Lavergne, T.; Sørensen, A.M.; Kern, S.; Tonboe, R.; Notz, D.; Aaboe, S.; Bell, L.; Dybkjær, G.; Eastwood, S.; Gabarro, C.; et al. Version 2 of the EUMETSAT OSI SAF and ESA CCI sea-ice concentration climate data records. Cryosphere 2019, 13, 49–78. [CrossRef] 65. Sermanet, P.; Eigen, D.; Zhang, X.; Mathieu, M.; Fergus, R.; LeCun, Y. Overfeat: Integrated recognition, localization and detection using convolutional networks. In Proceedings of the 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, 14–16 April 2014. 7 Paper 2: DeepSemi-supervisedTeacher–Student ModelBased on LabelPropagation for Sea IceClassification 85 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 14, 2021 10761 Deep Semisupervised Teacher–Student Model Based on Label Propagation for Sea Ice Classification Salman Khaleghian , Habib Ullah, Thomas Kræmer , Torbjørn Eltoft , Member, IEEE, and Andrea Marinoni , Senior Member, IEEE Abstract—In this article, we propose a novelteacher–student- based label propagation deep semisupervised learning (TSLP-SSL) method for sea ice classification based on Sentinel-1 synthetic aperture radar data. For sea ice classification, labeling the data precisely is very time consuming and requires expert knowledge. Our method efficiently learns sea ice characteristics from a lim- ited number of labeled samples and a relatively large number of unlabeled samples. Therefore, our method addresses the key challenge of using a limited number of precisely labeled samples to achieve generalization capability by discovering the underlying sea ice characteristics also from unlabeled data. We perform ex- perimental analysis considering a standard dataset consisting of properly labeled sea ice data spanning over different time slots of the year. Both qualitative and quantitative results obtained on this dataset show that our proposed TSLP-SSL method outperforms deep supervised and semisupervised reference methods. Index Terms—Deep learning, earth observation, scarce training data, sea ice classification, semisupervised learning (SSL). I. INTRODUCTION ARCTIC sea ice keeps the northern polar regions cooland thereby helps to moderate the global climate. It is a key component of the Arctic environment [1] that substantially affects the polar physical environment and its ecosystems. The Arctic has faced severe environmental impacts over the past few decades. These changes have transformed its environment, ecology, and meteorology and caused unsteady variations in the weather and sea ice conditions, which pose new challenges to maritime industries, including but not limited to aquaculture, natural energy resources, and travel exploration operating in the high north areas [2], [3]. Therefore, proper monitoring Manuscript received July 1, 2021; revised September 11, 2021 and September 22, 2021; accepted September 23, 2021. Date of publication October 14, 2021; date of current version November 2, 2021. This work was supported in part by the Centre for Integrated Remote Sensing and Forecasting for Arctic Operations and the Research Council of Norway under Grant 237906, in part by the European Union’s Horizon 2020 Research and Innovation Program ExtremeEarth Project under Grant 825258, and in part by the Fram Center under the Automised Large- scale Sea Ice Mapping “Polhavet” Flagship Project. (Corresponding author: Salman Khaleghian.) Salman Khaleghian, Thomas Kræmer, Torbjørn Eltoft, and Andrea Mari- noni are with the Faculty of Science and Technology, University of Tromsø—The Arctic University of Norway, 9019 Tromsø, Norway (e-mail: salman.khaleghian@uit.no; thomas.kramer@uit.no; torbjorn.eltoft@uit.no; an- drea.marinoni@uit.no). Habib Ullah is with the Faculty of Science and Technology, Norwegian University of Life Sciences, 1430 ˚As, Norway (e-mail: habib.ullah@nmbu.no). Digital Object Identifier 10.1109/JSTARS.2021.3119485 of the sea ice conditions and how it changes with time is important [4], [5]. For high-resolution sea ice analysis, researchers and ice cen- ters around the world are using synthetic aperture radar (SAR) data [6], [7]. These data are not restricted by weather conditions and polar darkness [8]. An important part of sea ice analysis includes sea ice classification. Sea ice classification based on SAR data [9] is carried out by classical statistical classifica- tion methods, traditional machine learning (TML) methods, and deep-learning-based methods (DLMs). Statistical and TML methods rely on handcrafted features, which may not properly encapsulate the challenging sea ice characteristics [10]. There- fore, their generalization capabilities and their abilities to find efficient features that can be considered to various geographic areas and time frames are limited [10]. DLMs, when prop- erly trained on large training datasets, have shown excellent generalization capabilities in many research fields, including several remote sensing applications such as food security moni- toring [11], hybrid data-driven Earth observation modeling [12], and flood mapping from high-resolution optical data [13]. We consider these achievements in the aforementioned fields and believe that deep neural networks (DNNs) may also show per- formance improvement in automatic sea ice classification [14], [15]. However, scarce training data is the most challenging issue in sea ice data analysis. This problem is particularly challenging in the Arctic, where gathering of precise true observations is expensive, time driven, and sometimes not feasible [16]. For sea ice classification, archived ice charts are available rendering huge labeled data. Nonetheless, these charts are very coarsely labeled and do not have the quality and details needed to train a DLM effectively [17]. To extract accurate information from large-scale datasets, when limited amount of labeled data are available, semisu- pervised learning (SSL) has been introduced in the technical literature [18]. These methods aim to combine labeled data with unlabeled records. In the past few years, semisupervised models have presented performance improvement in various fields of re- mote sensing research, such as despeckling of SAR images [19], change detection in heterogeneous remote sensing images [20], and hyperspectral image classification [21]. Considering these successes, we anticipate that deep SSL methodologies could also be favorable in sea ice classification and potentially lead to significant improvements by overcoming the specific challenge of few labeled samples. In fact, a deep SSL technique is halfway This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ 10762 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 14, 2021 Fig. 1. TSLP-SSL method. We have two models, namely, teacher and student models. The teacher model is trained on labeled data during the first stage, and then, both models are trained on labeled and unlabeled data during the second stage of the training. between supervised and unsupervised learning. This technique exploits multiple layers to progressively extract higher level features from the raw input data considering both labeled and unlabeled data. We propose a teacher–student-based label propagation deep semisupervised learning (TSLP-SSL) method. Our architecture consists of two models, namely, a teacher model and a student model. The teacher model is trained in a two-step procedure. Initially, we trained the teacher model in a supervised fashion utilizing only the labeled data. We then feed both the labeled and unlabeled samples to the trained teacher model and consider the feature space embedding to engender pseudo-labels for the unlabeled data through a label propagation procedure [22]–[24]. The original and the pseudo-labels are in the next step used to train the student model, which is subsequently used during the inference stage. The purpose of using the student model is to avoid the problem of the teacher model being biased toward the labeled data, which is like in case of a small training set. Our proposed method, hence, effectively exploits a relatively large amount of unlabeled data to improve the final classification performance. The training methodology is depicted in Fig. 1 and is more thoroughly described in Section III. The summary of our contributions is as follows. 1) We propose a novel TSLP-SSL method. One of the major attractions of our proposed method is its capability to deal with a small number of labeled samples. This is a favorable property in the case of sea ice classification using SAR data, where the availability of a large amount of reliable labeled data is scarce. 2) We consider sea ice datasets to train and analyze the generalization capabilities of our proposed method. We compare our method with a supervised method and three state-of-the-art semisupervised methods. Our results show that our proposed method performs better than all the reference methods, especially in cases with a small number of labeled samples. 3) Additionally, we present a comprehensive literature re- view covering both the probabilistic learning method and the DLM. The rest of this article is organized as follows. Related work is described in Section II. We present our proposed deep models and training approaches in Section III. Section IV depicts the experimental analysis considering a set of SAR images. Finally, Section V concludes this article and presents future work. II. RELATED WORKS In general, sea ice classification can be divided into two major classes: TML/probabilistic methods and DLMs [25]. The approaches in the latter class fall into two subclasses, namely, supervised deep learning and semisupervised deep learning methods. The literature is very limited in the case of semisuper- vised DLMs since methods in this subcategory are quite recent and still under development. A. Probabilistic Methods for Sea Ice Classification The literature on TML/probabilistic methods is very rich, and we will restrict ourselves to only including a few recent publica- tions. Statistical algorithms often combine probabilistic models and classical classification methods with texture or polarimetric features to perform sea-ice-type maps. An extensive survey is given in [26]. Some specific studies in this category are highlighted be- low. Examples of machine learning algorithms include the use KHALEGHIAN et al.: DEEP SEMISUPERVISED TEACHER–STUDENT MODEL BASED ON LABEL PROPAGATION 10763 of standard multilayer perceptrons, as in [14], support vector machines, as in [7], or decision tree methods [15], as in [15]. Statistical and shallow machine learning methods often rely on having extracted the input features in a preoperation prior to the classification. Karvonen [27] and Dinessen [28] used probabilis- tic and statistical features for estimating sea ice concentration from SAR imagery. Johansson et al. [29] used statistical entropy and horizontal–vertical (HV) polarization computations to iso- late sea ice from open water and thicker sea ice. Furthermore, Fors et al. [30] investigated the potential of C- and X-band multipolarization SAR features for sea ice segmentation during late summer. Dabboor et al. [31] analyzed a set of compact polarimetric parameters for classifying newly formed ice and multiyear ice. Hong and Yang [32] used the statistical coefficient, incidence angle, environment temperature, and speed of wind to improve the sea ice and water classification. Johansson et al. [33] used a statistical mixture model to isolate open water from sea ice. Their method is based on the semiautomatic segmentation technique. They applied the algorithm to explore the sea ice characteristics in Svalbard. Aldenhoff et al. [34] demonstrated that C-band SAR can reliably generate the layout of the ice boundary, whereas the L-band shows effectiveness considering thin ice and water regions. B. DLMs for Sea Ice Classification Deep-learning-based approaches have been widely exploited for addressing the challenge of sea ice classification. Malmgren- Hansen et al. [17] applied a convolutional neural network (CNN) model to predict Arctic sea ice by fusing data from two different satellites. They found that the CNNs are showing good performance for multisensor data integration. It is worth noting that they used archived ice chart data for both training and validation. However, these data are coarsely labeled, hence leading to undesired effects in the training of the CNN model. Wang et al. [10], [35], [36] exploited CNNs for ice concentration estimation. Tom et al. [37] proposed an ice monitoring model based on Sentinel-1 data with a deep learning approach. Boulze et al. [38] introduced a CNN for detecting different kinds of sea ice [39] using SAR data. They trained the CNN considering the archived ice chart data. They performed comparison with a random forest classifier using texture features. SSL methods are proposed for classification when only scarce training data or a limited number of training samples are available. The idea of SSL relies on the assumption that unlabeled samples provide essential information and clues on how the data are distributed. Therefore, a DLM can be trained by considering this distribution. In this sense, different ap- proaches such as teacher–student models [40], graph-based methods [41], pseudo-labeling [42], consistency regulariza- tion [43], and generative models (i.e., generative adversarial networks—GANs) [44] have been introduced. Shin [40] pro- posed a multiteacher single-student method to solve the visual attribute prediction problem. His method learnt task-specific domain experts called teacher networks and a student network by forcing a model to imitate the distributions learned by domain experts. Xie et al. [45] proposed a noisy student method for generating pseudo-labels to train a model in an iterative way. The output of the trained model based on the labeled sam- ples is exploited to produce pseudo-labels for the unlabeled samples, which are subsequently used to train another model. They used the teacher–student model to train a larger student model by incorporating noise, considering data augmentation (DA), dropout, and stochastic depth. Tarvainen and Valpola [46] proposed a mean teacher method that averages model weights instead of label predictions. Their method improves test accu- racy and enables training with fewer labeled samples. Salimans et al. [47] trained the semisupervised generative adversarial network (semi-GAN) as a generative model. Kingma et al. [48] exploited a variational autoencoder in the form of a semisu- pervised model. In their method, a classifier is trained on top a latent representation to predict the labels. Iscen et al. [24] proposed a transductive label propagation model for deep SSL. This model is trained in an iterative two-step procedure. In the first phase, a CNN is trained using the labeled part of the dataset in a supervised manner. In the second phase, based on a manifold assumption in the feature space of the CNN, pseudo-labels are produced for the unlabeled data through a label propagation procedure using a nearest neighbor graph. The pseudo-labels are considered to extend the set of labeled samples in the second stage to train the CNN model. Berthelot et al. [49] used an augmentation technique to introduce an SSL approach. They assumed that the distribution of a classifier should remain the same considering unlabeled data. They used average prediction to produce pseudo-labels for the unlabeled samples. C. SSL Methods for Sea Ice Classification The aforesaid cases show that the development of SSL meth- ods is a hot topic in the data analysis community. However, it is also true that the application of SSL architectures to sea ice classification is very limited. For example, Han et al. [50] investigated an approach for sea ice classification based on active learning (AL) and SSL. They acquired the most informative data examples considering AL. They exploited these informative examples in training the SSL method. Staccone [51] introduced an SSL approach based on GANs for sea ice classification. In this work, both labeled and unlabeled data were considered to achieve more accurate results by exploiting the knowledge from both data sources. Li et al. [52] presented an SSL method for ice and water classification based on self-training. Their method combined a contextual model and the self-training approach into a unified framework. Our proposed method falls into the subcategory of SSL meth- ods. We propose a teacher–student model considering the feature space using the label propagation method, which is summarized in the following section. III. TEACHER–STUDENT-BASED LABEL PROPAGATION METHOD As mentioned above, labeled sea ice samples are difficult to acquire, making the training of sea ice classification architec- tures a difficult task. Therefore, we explore a novel TSLP-SSL method for this application. We adequately utilize a limited 10764 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 14, 2021 number of labeled samples and a comparatively much large number of unlabeled samples to train a deep CNN architecture for extracting sea ice information. Our proposed TSLP-SSL method consists of a teacher model and a student model, which are cooperatively trained in an iterative way during two training stages. Our method is different from the teacher–student models presented in [45] and [46] in two major aspects. First, in our case, features generated by the trained teacher model are extracted before the final classification layer and used in the label propaga- tion process to produce pseudo-labels for the unlabeled samples using a k-nearest neighbor approach. Hence, label propagation is performed in feature space, and not in output label space. Second, the pseudo-labels from the teacher model are exploited, together with the original labels, to train the student model in order to find an optimal decision boundary during a second iterative training stage. Our proposed method is also different from the deep SSL model in [24] in the way it aims to avoid the model to be biased toward the labeled data. In fact, the method in [24] is based on a single model, which is trained on only the labeled data, making it susceptible to be biased toward these data samples. The biasing problem may be even more significant in the sea ice classification task, considering the small amount of labeled data and noting the fact that texture features are important for discriminating between different ice types. In our proposed method, both models are represented by a CNN constructed of a 13-layer architecture [24]. During the first training stage, the teacher model is trained on the labeled data only. During the second stage, the teacher model gener- ates pseudo-labels for the unlabeled data. These pseudo-labels, combined with the labeled samples, are used to train the student model. The motivation for considering an additional student model is to handle the problem of the teacher model being biased toward the labeled data, as discussed above [53]. To further elaborate on this issue, the teacher model formulates a decision boundary considering a small set of labeled data. However, this decision boundary may not be the best boundary when also considering the unlabeled data during the second stage, especially if the teacher model gets overfitted to the labeled data because of the limited number of samples [54]. The idea is that the student model should discover a more appropriate decision boundary, as illustrated in Fig. 2. Fig. 2 displays a simplified case, in which the triangles represent samples from one arbitrary class and the circles show samples from another class. Hence, the red and blue symbols represent labeled data from the two classes, respectively, and the black symbols represent unlabeled data from both classes. Since the teacher model is trained using labeled data only, the decision boundary shown as a blue solid line in Fig. 2 could be a solution. A better decision boundary is discovered by repeatedly training the student model from scratch with both pseudo-labeled and labeled data. In this way, the student model would end up with the decision boundary defined by the green-dashed line, which properly separates both the labeled and unlabeled data from both the two classes. It is worth noting that this example shows the advantage of using label propagation based on nearest neighbors instead of using the network output as pseudo-labels. Fig. 2. Complexity of tuning the teacher decision boundary to also take into account the unlabeled data. We show two-class labeled data with red triangles and blue circles. The black markers represent the unlabeled data. During the second stage of our training, the teacher model generates predictions for the entire dataset. The feature space embedding is subsequently used to construct a nearest neigh- bor graph and an adjacency matrix, from which we assign pseudo-labels to the unlabeled samples in a transductive label propagation procedure [24]. A. Formulation for the Learning Process To clearly provide the details of the process of label prop- agation for our teacher model, we present the affiliated nota- tions in this section. In this, we will largely follow the outline in [24]. We consider a set of n samples denoted by X := (x1, . . ., xs, xs+1, . . ., xn) with xi ∈ X , where s samples xi for i ∈ S := {1, . . ., s}, represented by XS , are labeled according to YS := (y1, . . ., ys). Each element in YS is yi ∈ G, where G := {1, . . ., g} is a discrete label set of g classes. The rest of the e := n− s samples xi for i ∈ E := {s+ 1, . . ., n}, represented by XE , are unlabeled. We consider all samples in X and labels in YS to train the CNN to assign class labels to the previously unseen samples. The CNN takes an input sample xi from X and builds a vector of class probabilities fΛ(xi), fΛ : X → Rg, where Λ represents the hyperparameters of our deep model. In this process, the feature extraction stage is represented by the function ΩΛ : X → Rd, which maps the input data to a d- dimensional feature vector, where the ith sample is represented by di := ΩΛ(xi). In the next stage, a vector of class probabilities is built by the softmax on top of the fully connected layer considering ΩΛ. The prediction of the CNN for the ith sample is the class of the highest probability, i.e., yˆi = argmaxjfΛ(xi)j (1) where j is the jth dimension of the vector. In supervised learning, the loss function in (2) is minimized to train the CNN ξsup(XS , YS ; Λ) = s∑ i=1 εsup(fΛ(xi), yi). (2) KHALEGHIAN et al.: DEEP SEMISUPERVISED TEACHER–STUDENT MODEL BASED ON LABEL PROPAGATION 10765 Equation (2) applies only to the labeled samples, i.e., xi ∈ XS . In fact, (2) shows one term of the loss function in SSL. In classification problems, the cross-entropy loss function is generally used for εsup, which for a given sample xi is defined as εsup(fΛ(xi), yi) = − g∑ k=1 y′k log (fΛ(xi))k (3) where y′k is the kth component of the one-hot encoding of yi ∈ G. Pseudo-labeling finds a pseudo-label yˆi for each sample xi for i ∈ E. The pseudo-labels for unlabeled samples in XE are represented by YˆE = {yˆs+1, . . ., yˆn}, and they form an additional loss term formulated as ξpseu(XE , YˆE ; Λ) = n∑ i=s+1 εpseu(fΛ(xi), yˆi). (4) B. Pseudo-Label Generation and Learning Process In our method, the CNN is represented by the param- eters Λ, and we formulate the descriptor set as D := (d1, . . ., ds, ds+1, . . ., dn), where di := ΩΛ(xi). We build a sparse affinity matrix Δ ∈ Rn×n, where its elements are rep- resented by νij = { [dTi , dj ] γ +, if i 6= j ∧ di ∈ Nk(dj) 0, otherwise (5) where Nk represents the set of k-nearest neighbors in X , and γ is a hyperparameter. It is worth noticing that building the sparse affinity matrix is computationally efficient even if we have a very large number of samples. We then build a symmet- ric adjacency matrix Θ = Δ+ΔT such that Θ ∈ Rn×n. The diagonal of the matrix Θ consists of zeroes. The rest of the elements of Θ are nonnegative pairwise similarities between di and dj for i = 1, 2, . . . , n and j = 1, 2, . . . , n. We formulate the symmetrically normalized counterpart of Θ as Ξ = Γ− 1 2ΘΓ− 1 2 (6) where Γ = (Θ1n) is the degree matrix and 1n is an n- dimensional vector with all elements set to 1. We formulate a label matrix Y of size n× g consisting of the elements Yij = { 1, if i ∈ S ∧ yi = j 0, otherwise . (7) The rows of the matrix Y represent one-hot encoded labels for the labeled samples. Subsequently, the diffusion amounts to formulating an n× g matrix ψ such that ψ = (I − αΞ)−1Y (8) where α ∈ [0, 1) is a parameter. The elements of ψ are repre- sented by δij . In fact, calculating matrix ψ, according to (8), is impractical for large n because the inverse matrix (I − αΞ)−1 is not sparse. Therefore, we use the conjugate gradient method to solve the linear system (I − αΞ)ψ = Y. (9) Equation (9) is fast and valid since the matrix (I − αΞ) is pos- itive definite. We find the pseudo-labels YˆE = {yˆs+1, . . ., yˆn} for unlabeled samples as yˆi = argmaxjδij (10) where δij is the (i, j)th element of matrix ψ. It is worth noting that finding pseudo-labels from matrix ψ in this way has some unwanted causes. For example, we assign pseudo-labels to all unlabeled samples; however, we are clearly not confident about the same certainty for all generated pseudo-labels. Moreover, pseudo-labels may not represent the same number of samples for each class, which will affect the performance of the learning process. To handle the former problem, we affiliate a weight representing the certainty of the prediction to each pseudo-label. For this purpose, we consider the entropy Υ to compute the level of uncertainty and provide a weight ωi to sample xi formulated as ωi = 1− Υ(δˆi) log(g) (11) where Υ : Rg → R is the entropy function, and the weight ωi is normalized in [0, 1] because log(g) is the maximum possi- ble entropy in Rg [when all datapoints are equally distributed to the clusters, the maximum entropy for g classes is H = −∑gc=1 1/glog(1/g) = log(g)]. δˆi is a g-dimensional vector of the ith rowwise normalized counterpart of δi with components formulated as δˆij = δij∑ k δik . (12) To cope with the issue of the situation when we have different number of samples for each class, we provide weight υj to class j that is inversely related to class size, formulated as υj = (|Sj |+ |Ej |)−1 (13) where |Sj | is the number of labeled samples and |Ej | is the number of pseudo-labeled samples in class j. To this end, we formulated per-sample and per-class weights. We relate the weighted loss to the labeled and pseudo-labeled samples as follows: ξw(X,YS , YˆE ; Λ) = s∑ i=1 υyiεsup(fΛ(xi), yi) + n∑ i=s+1 ωiυyˆiεpseu(fΛ(xi), yˆi). (14) In fact, (14) is the sum of weighted versions of ξsup and ξpseu in (2) and (4), respectively. Iscen et al. [24] used one CNN model to produce the pseudo-labels and then used these labels to train the same model. On the contrary of this approach, we are using two CNN models in the form of a teacher model and a student model. The teacher model generates the pseudo-labels, which are combined with the labeled samples to train the student model. Therefore, the trained student model is not biased toward the labeled data. To this end, the student and teacher models are trained in parallel, according to (14), in which yˆi in the student model comes from the teacher model. In summary, considering the nearest neighbor graph defini- tion in the form of affinity matrix, label propagation, sample 10766 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 14, 2021 TABLE I DIFFERENT WATER AND ICE CLASSES and class weights, and label and pseudo-label loss terms, our semisupervised method follows a repetitive procedure. Initially, we randomly initialize all the parameters. We then train the teacher model using the s labeled samples in XS , considering the supervised loss term. We use the trained teacher model to extract descriptors D for the complete training set X . We then find the k-nearest neighbors of all samples to build the adjacency matrix Θ and carry out label propagation by computing (9). We then assign pseudo-labels to the unlabeled samples in XE by considering (10). Subsequently, we train both the teacher and student models for one epoch on the complete training set X using the weighted loss function in (14). This process is repeated for T ′ epochs. IV. EXPERIMENTAL ANALYSIS A. SAR-Based Sea Ice Dataset We have trained our proposed method considering 31 Sentinel-1 images. The images are acquired from the North of Svalbard with 40 m × 40 m pixel resolution. They are prepro- cessed using the ESA SNAP software by applying thermal noise removal, calibration using the σ0 lookup table, and multilooking using a 3 × 3 boxcar filter. After converting the intensity images to dB values, they are clipped and scaled linearly in the range [0, 1] considering individual channels. The range in dB for horizontal–horizontal (HH) polarization and HV polarization are [min: −30, max: 0] and [min: −35, max: −5], respectively. To create a suitable dataset for sea ice classification, we used labeled polygons generated from 31 Sentinel-1 EW scenes from the North of Svalbard. These polygons were carefully labeled manually according to coregistered optical images with as small as possible time gaps. We used these images for training our proposed method. More details can be found in [39]. The dataset consists of five classes, as shown in Table I. Nonetheless, to perform sea ice classification and create a proper dataset [55] for deep learning, we extracted patches with size equal to 32 × 32 pixels, corresponding to a spatial resolution of 1280 m2, from inside the polygons, with a stride of 10 pixels. This dataset can be accessed from the link [55]. It is worth mentioning that we analyzed the effect of different patch sizes in a previous work [9]. We found that the validation results got better by increasing the patch size. However, this improvement comes at the cost of a lower spatial resolution as larger patches cover wider areas of the surface. For instance, a larger patch will be classified as water if the majority of the pixels represent water. This would become a significant issue at ice edges as classification based on larger patches would lead to coarser or nonsmooth edges. Therefore, there is a tradeoff between accuracy and resolution. To compensate for this, in our proposed work, we consider a patch size equal to 32 × 32 pixels. We extracted two channel patches consisting of HH and HV intensities. It is also worth mentioning that we also analyzed the effect of different channel composition (HH, HV, and incidence angle) in our previous work [9]. We found that adding the HV channel to the HH gives large improvement. However, the improvement resulting from also adding the in- cidence angle is quite small. In the current work, we do not include the incidence angle as this also enables more proper comparison with other SSL methods [48], [49]. These reference SSL methods largely apply different DA techniques, and the inclusion of the incidence angle is not feasible because of the DA techniques. Therefore, the patches in our work consist of only HH and HV intensities to maintain consistency. In Table I, we provide ice type codes, following the definitions of the World Meteorological Organization [56] and a brief description of each class. We consider binary sea ice classification. The first class, namely, the water class, consists of open water and leads with water, and the various ice types are grouped together as the ice class. The total number of patches for water is 9317, and for ice, it is 5433. We provided the dataset online [55]. For now, we are interested in analyzing the performance of DNNs for binary classification. Our consideration based on our experience with sea ice classification is that if DNNs can perform well in the binary classification case, they may also classify multiple sea ice types properly. For validation, we consider some other Sentinel-1 scenes provided by the Norwegian Meteorological Institute [57] from the Danmarkshavn area on the Northeastern coast of Greenland and extract 1516 water patches and 1324 ice patches, mostly from challenging areas. In the first experiment, we consider the training dataset from the North of Svalbard and split it into labeled XS and unlabeled XE samples. In the next experiment, to show the capability of the proposed method in classifying real unlabeled data, we consider 5000 random patches picked from the Norwegian Meteorological Institute dataset as the unlabeled dataset XE and use all samples in the training set as the labeled dataset XS . We insert a different number of labeled datasets for each class, i.e., 15, 30, 60, 100, 500, and 1000. For the inference results, we apply SAR images from the Norwegian Meteorological Institute dataset [57], which were collected in 2018. B. Our Model Configurations We exploit the same network models for the teacher and student models. Similar to [24], we use the network architecture defined in [46] and shown in Table II. We trained the teacher model for 100 epochs in the first training step. In the second step, we trained the teacher model for 200 epochs based on the label propagation to produce pseudo-labels. These labels are then exploited to train the student model concurrently. The learning rate for the teacher model is 0.0008 for the first step and 0.0001 for the second step. The learning rate for the student model during the second step is 0.002. For DA, we used only rotation in both steps to keep the same physical meaning of all KHALEGHIAN et al.: DEEP SEMISUPERVISED TEACHER–STUDENT MODEL BASED ON LABEL PROPAGATION 10767 TABLE II BASE CNN ARCHITECTURE TABLE III VALIDATION ACCURACY FOR DIFFERENT AMOUNT OF LABELED DATA AND UNLABELED DATA FROM THE TRAINING DATASET the channels of the SAR data and considered the same values for the hyperparameters as used in the previous studies [24], [58] in all experiments. We run the experiments on a single NVIDIA Quadro RTX 5000 with 16-GB memory. The code is available.1 C. Results and Discussion We trained our models with a distinct number of labeled data to assess the performance of our proposed method in comparison with four reference methods. For this purpose, we consider both a supervised CNN model and three semisupervised methods, namely, semi-GANs [47], MixMatch [49], and label propaga- tion model (LP-SSL) [24]. In the supervised CNN model, we consider the same CNN architecture that we use for both our teacher and student models. We present the validation results in Table III in terms of accuracy for both our proposed TSLP-SSL method and the reference methods. In the first experiment, we use our training data and split it into labeled, i.e., XS , YS , and unlabeled datasets, (XE . For the validation, we use the validation data that were mentioned previously (see Section IV-A). As can be seen in Table III, our proposed method outperforms the fully supervised CNN architecture considering 15, 30, 40, 60, and 100 labeled samples. Similarly, our method also outperforms the semisupervised methods semi-GANs [47], MixMatch [49], and LP-SSL [24] considering different number of labeled datasets except in case of 500 labeled samples. For comprehensive analysis, we also consider other performance metrics, namely, average precision, average recall, and average F1-score, for both the classes: water and ice. We present the results in Table IV. As 1https://github.com/sakh251/TSLP-SSL can be seen, we also outperform in most cases considering both the supervised and semisupervised methods. In fact, our method learns more information from the unlabeled data, especially when a very limited number of samples are available. In fact, the student model in our approach has the potential to remedy the problem of overfitting of the teacher model when only few samples are available, and it presents comparable validation accuracy when considering 500 and 1000 labeled datasets. However, when the number of labeled datasets increases, the amount of information extracted from the unlabeled data does not significantly improve the results. It is worth noticing that the good samples of the labeled data can significantly impact the results in the second step. This can be seen when comparing the results of using 15 and 30 labeled samples in Tables III and IV. In fact, our proposed method can learn from the unlabeled data and, thus, improves its performance. It even achieves better validation accuracy than the supervised and LP-SSL models considering 15, 30, 40, 60, and 100 labeled samples. In order to explain the behavior of our method considering 500 and 1000 labeled samples, we compute the accuracy of the pseudo-labels from the teacher model during the second step of the training process. This can be done since the ground-truth labels of the unlabeled data can be extracted from the training dataset. We consider the comparison of our proposed method with the fully supervised CNN architecture. When both the methods are trained on 500 and 1000 labeled datasets, the accuracy on the pseudo-labels reaches more than 99%, but at the same time, the validation accuracy does not increase, as shown in Table III. This means that there is no more information in the unlabeled data to further improve the validation accuracy considering this particular dataset. We investigate this by training the supervised model with all the data in the training dataset, and it reached a validation accuracy of 91.57%. We also investigated the inference results on a single-image SAR scene from Danmarkshavn considering 30, 60, and 100 labeled datasets in our proposed TSLP-SSL model. The results of this experiment are reported in Fig. 3, where the first row shows results using the supervised model and the second row shows results using our proposed method. Blue color indicates the water and white color indicates the ice class. As can be seen, our method presents improvement compared to the supervised model, especially in the noisy areas. D. Feature Separability of Our Proposed Method Furthermore, we illustrate the capability of the label propaga- tion step that we use to generate the pseudo-labels for training the student model. In fact, label propagation is characterized by consolidated feature separability, which helps generate mean- ingful pseudo-labels for training the student model. To explain this visually, we extract the feature vector output from the last convolution layer. The dimension of the feature vectors is 128. We transform the feature vectors into three components based on the principal component analysis (PCA), considering both labeled and unlabeled data, to visually understand the feature space. These components are shown in Fig. 4. Fig. 4(a) and (c) shows the feature space when training the teacher model in the 10768 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 14, 2021 TABLE IV AVERAGE OF PRECISION, RECALL, AND F1-SCORE FOR DIFFERENT AMOUNT OF LABELED DATA AND UNLABELED DATA FROM THE TRAINING DATASET Fig. 3. Inference results. We present qualitative results of a single input image. The first row depicts the results considering supervised deep learning, and the second row depicts the results using our proposed TSLP-SSL model. Fig. 4. Three PCA components’ visualization of extracted features (flattened vector after convolution layers with 128 values) from labeled and unlabeled data. The yellow color represents water and the purple color represents ice. (a) and (c) show the supervised feature space from first step with 60 and 1000 labeled data, respectively. (b) and (d) show the best feature space of second step with 60 and 1000 labeled data, respectively. KHALEGHIAN et al.: DEEP SEMISUPERVISED TEACHER–STUDENT MODEL BASED ON LABEL PROPAGATION 10769 Fig. 5. Inference results. The first column shows input images, the second column shows the results obtained with supervised deep learning, and the third row shows results obtained with our TSLP-SSL model, which is trained by also taking into account unlabeled data from other images. first step considering 60 and 1000 labeled samples, respectively. Fig. 4(b) and (d) shows the feature space representation after label propagation is applied in the second step. The yellow circles represent water and the purple circles represent the ice class. As can be seen, label propagation leads to more separable classes in the feature space, especially when 1000 labeled sam- ples are considered. Therefore, through label propagation, the unlabeled data help to build a more class-separable feature space and generate more meaningful and informative pseudo-labels to train the student model. E. Extended Unlabeled Data To elaborate a bit more on the capability of our proposed method, we conduct another experiment. We evaluate the vali- dation accuracy of the proposed method by considering 1000 data samples from the training dataset as labeled data (i.e., considering it as an element of XS) and adding unlabeled data TABLE V VALIDATION ACCURACY, AVERAGE PRECISION, AVERAGE RECALL, AND AVERAGE F1-SCORE CONSIDERING ADDITIONAL REAL UNLABELED DATA not contained in the training dataset. For this purpose, we extract 5000 random patches from the Danmarkshavn data and add to the training process in the second stepXE . We present the perfor- mance of all the methods in Table V in terms of accuracy, average precision, average recall, and average F1-scores. As can be seen, our method performs better than the fully supervised CNN method and three semisupervised methods: semi-GANs [47], MixMatch [49], and LP-SSL [24]. These results demonstrate 10770 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 14, 2021 that our proposed method can extract and use relevant infor- mation from real unlabeled data and learn new information from unseen and unlabeled data. This is a useful and powerful capability that can be beneficial in sea ice classification, where the amount of available training data is limited. We also present inference results using four different images from the Danmarkshavn data considering the student model trained on 1000 labeled datasets and extended with unlabeled data. In Fig. 5, the left column depicts the original SAR images, the middle column presents the inference results obtained with the supervised learning model, and the last column shows the results obtained with our proposed TSLP-SSL method. Water is highlighted in blue color and ice is highlighted in white color. These inference results again show the capability of our proposed semisupervised method in using the information of unlabeled data. V. CONCLUSION In this article, we proposed a teacher–student-based label propagation method for sea ice classification. The teacher model and the student model were trained in an iterative way during the training stage. The teacher model produced features that were extracted before the final classification layer. These fea- tures were used during the label propagation process. Consid- ering the unlabeled data, the labels were propagated to produce pseudo-labels. Subsequently, the pseudo-labels from the teacher models were fed to the student model during the training to find an unbiased decision boundary. Our method outperformed the supervised CNN and the semisupervised LP-SSL models. We presented both qualitative and quantitative results for our pro- posed method and the reference methods. Our proposed method outperformed both the reference methods. Our proposed method considered a very limited number of labeled samples starting from 15 samples and unlabeled samples to train the models efficiently. In fact, our proposed method was characterized by the ability to learn useful information from both labeled and unlabeled data. Our method reduced the dependence on labeled samples, which is very time consuming and costly to collect for sea ice analysis. Therefore, this property of our method makes it a good fit for the community of sea ice analysis, where limited labeled data are available. We have also shown that by adding more unlabeled samples, the performance of the inference results has improved. Considering the semisupervised aspect, our method can be extended to other problem areas, where a very limited number of labeled samples are available since we coped with the biasing and dependence issues related to the labeled samples. The dataset we collected consists of different ice types. However, the number of samples for each ice type is limited. Considering the promising performance of our proposed method for binary sea ice classification, in our future work, we would adopt and extend our method to ice type classification. ACKNOWLEDGMENT This work is funded in part by Centre for Integrated Remote Sensing and Forecasting for Arctic Operations (CIRFA) and the Research Council of Norway (RCN Grant no. 237906), the European Union’s Horizon 2020 research and innovation programme ExtremeEarth project, grant agreement no. 825258 (http://earthanalytics.eu/) and by the Fram Center under the Automised Large-scale Sea Ice Mapping (ALSIM) “Polhavet” flagship project. REFERENCES [1] L. P. Bobylev and M. W. Miles, “Sea ice in the Arctic paleoenvironments,” in Sea Ice in the Arctic. Berlin, Germany: Springer, 2020, pp. 9–56. [2] T. Vihma, “Effects of Arctic Sea ice decline on weather and climate: A review,” Surv. Geophys., vol. 35, no. 5, pp. 1175–1214, 2014. [3] M. R. Najafi, F. W. Zwiers, and N. P. Gillett, “Attribution of Arctic tem- perature change to greenhouse-gas and aerosol influences,” Nat. Climate Change, vol. 5, no. 3, pp. 246–249, 2015. [4] J. C. Stroeve, M. C. Serreze, M. M. Holland, J. E. Kay, J. Malanik, and A. P. Barrett, “The Arctic’s rapidly shrinking sea ice cover: A research synthesis,” Climatic Change, vol. 110, no. 3, pp. 1005–1027, Feb. 2012. [5] S. Haykin, E. O. Lewis, R. K. Raney, and J. R. Rossiter, Remote Sensing of Sea Ice and Icebergs, vol. 13. Hoboken, NJ, USA: Wiley, 1994. [6] A. Cristea, J. van Houtte, and A. P. Doulgeris, “Integrating incidence angle dependencies into the clustering-based segmentation of SAR im- ages,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 13, pp. 2925–2939, 2020. [7] M. Ghanbari, D. A. Clausi, L. Xu, and M. Jiang, “Contextual classification of sea-ice types using compact polarimetric SAR data,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 10, pp. 7476–7491, Oct. 2019. [8] J. L. Awange and J. B. K. Kiema, “Microwave remote sensing,” in Environ- mental Geoinformatics. Berlin, Germany: Springer, 2013, pp. 133–144. [9] S. Khaleghian, H. Ullah, T. Kræmer, N. Hughes, T. Eltoft, and A. Marinoni, “Sea ice classification of SAR imagery based on convolution neural networks,” Remote Sens., vol. 13, no. 9, 2021, Art. no. 1734. [10] L. Wang, K. Scott, and D. Clausi, “Sea ice concentration estimation during freeze-up from SAR imagery using a convolutional neural network,” Remote Sens., vol. 9, no. 5, 2017, Art. no. 408. [11] W. Wen, J. Timmermans, Q. Chen, and P. M. van Bodegom, “A review of remote sensing challenges for food security with respect to salinity and drought threats,” Remote Sens., vol. 13, no. 1, 2021, Art. no. 6. [12] D. H. Svendsen, M. Piles, J. Muñoz-Marí, D. Luengo, L. Martino, and G. Camps-Valls, “Integrating domain knowledge in data-driven Earth ob- servation with process convolutions,” IEEE Trans. Geosci. Remote Sens., 2021. [13] L. Hashemi-Beni and A. A. Gebrehiwot, “Flood extent mapping: An integrated method using deep learning and region growing using UAV optical data,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 14, pp. 2127–2135, Jan. 2021. [14] N. Asadi, K. A. Scott, A. S. Komarov, M. Buehner, and D. A. Clausi, “Evaluation of a neural network with uncertainty for detection of ice and water in SAR imagery,” IEEE Trans. Geosci. Remote Sens., vol. 59, no. 1, pp. 247–259, Jan. 2021. [15] J. Lohse, A. P. Doulgeris, and W. Dierking, “An optimal decision-tree design strategy and its application to sea ice classification from SAR imagery,” Remote Sens., vol. 11, no. 13, 2019, Art. no. 1574. [16] J. Lohse, A. Doulgeris, and W. Dierking, “Mapping sea-ice types from Sentinel-1 considering surface-type dependent effect of incidence angle,” Ann. Glaciol., vol. 61, no. 83, pp. 260–270, 2020. [17] D. Malmgren-Hansen et al., “A convolutional neural network architecture for Sentinel-1 and AMSR2 data fusion,” IEEE Trans. Geosci. Remote Sens., vol. 59, no. 3, pp. 1890–1902, Mar. 2021. [18] G.-J. Qi and J. Luo, “Small data challenges in Big Data Era: A sur- vey of recent progress on unsupervised and semi-supervised meth- ods,” IEEE Trans. Pattern Anal. Mach. Intell., to be published, doi: 10.1109/TPAMI.2020.3031898. [19] E. Dalsasso, L. Denis, and F. Tupin, “SAR2SAR: A semi-supervised despeckling algorithm for SAR images,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 14, pp. 4321–4329, 2021. [20] X. Jiang, G. Li, X.-P. Zhang, and Y. He, “A semisupervised Siamese network for efficient change detection in heterogeneous remote sens- ing images,” IEEE Trans. Geosci. Remote Sens., to be published, doi: 10.1109/TGRS.2021.3061686. KHALEGHIAN et al.: DEEP SEMISUPERVISED TEACHER–STUDENT MODEL BASED ON LABEL PROPAGATION 10771 [21] Y. Ding, X. Zhao, Z. Zhang, W. Cai, N. Yang, and Y. Zhan, “Semi-supervised locality preserving dense graph neural network with ARMA filters and context-aware learning for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., to be published, doi: 10.1109/TGRS.2021.3100578. [22] M. Douze, A. Szlam, B. Hariharan, and H. Jégou, “Low-shot learning with large-scale diffusion,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 3349–3358. [23] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf, “Learning with local and global consistency,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2004, pp. 321–328. [24] A. Iscen, G. Tolias, Y. Avrithis, and O. Chum, “Label propagation for deep semi-supervised learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 5065–5074. [25] N. Zakhvatkina, V. Smirnov, and I. Bychkova, “Sea ice classification based on neural networks method using Sentinel-1 data,” Int. Multidisciplinary Sci. GeoConf., vol. 19, no. 2.2, pp. 617–623, 2019. [26] N. Zakhvatkina, V. Smirnov, and I. Bychkova, “Satellite SAR data-based sea ice classification: An overview,” Geosciences, vol. 9, no. 4, 2019, Art. no. 152. [27] J. Karvonen, “Baltic sea ice concentration estimation using SENTINEL- 1 SAR and AMSR2 microwave radiometer data,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 5, pp. 2871–2883, May 2017. [28] F. Dinessen, “Operational multisensor sea ice concentration algorithm utilizing SENTINEL-1 and AMSR2 data,” in Proc. 19th EGU General Assembly, 2017, Art. no. 19037. [29] A. M. Johansson, C. Brekke, G. Spreen, and J. A. King, “X-, C-, and L-band SAR signatures of newly formed sea ice in Arctic leads during winter and spring,” Remote Sens. Environ., vol. 204, pp. 162–180, 2018. [30] A. S. Fors, C. Brekke, A. P. Doulgeris, T. Eltoft, A. H. Renner, and S. Gerland, “Late-summer sea ice segmentation with multi-polarisation SAR features in C and X band,” Cryosphere, vol. 10, no. 1, pp. 401–415, 2016. [31] M. Dabboor, B. Montpetit, and S. Howell, “Assessment of the high resolution SAR mode of the RADARSAT constellation mission for first year ice and multiyear ice characterization,” Remote Sens., vol. 10, no. 4, 2018, Art. no. 594. [32] D.-B. Hong and C.-S. Yang, “Automatic discrimination approach of sea ice in the arctic ocean using SENTINEL-1 extra wide swath dual-polarized SAR data,” Int. J. Remote Sens., vol. 39, no. 13, pp. 4469–4483, 2018. [33] A. M. Johansson et al., “Consistent ice and open water classification com- bining historical synthetic aperture radar satellite images from ERS-1/2, ENVISAT ASAR, RADARSAT-2 and sentinel-1A/B,” Ann. Glaciology, vol. 61, no. 82, pp. 40–50, 2020. [34] W. Aldenhoff, C. Heuzé, and L. E. Eriksson, “Comparison of ice/water classification in Fram Strait from C-and L-band SAR imagery,” Ann. Glaciol., vol. 59, no. 76pt2, pp. 112–123, 2018. [35] L. Wang, K. A. Scott, L. Xu, and D. A. Clausi, “Sea ice concentration estimation during melt from dual-pol SAR scenes using deep convolutional neural networks: A case study,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 8, pp. 4524–4533, Aug. 2016. [36] Y. Gao, F. Gao, J. Dong, and S. Wang, “Transferred deep learning for sea ice change detection from synthetic-aperture radar images,” IEEE Geosci. Remote Sens. Lett., vol. 16, no. 10, pp. 1655–1659, Oct. 2019. [37] M. Tom, R. Aguilar, P. Imhof, S. Leinss, E. Baltsavias, and K. Schindler, “Lake ice detection from SENTINEL-1 SAR with deep learning,” ISPRS Ann. Photogrammetry, Remote Sens. Spatial Inf. Sci., vol. 3, pp. 409–416, 2020. [38] H. Boulze, A. Korosov, and J. Brajard, “Classification of sea ice types in SENTINEL-1 SAR data using convolutional neural networks,” Remote Sens., vol. 12, no. 13, 2020, Art. no. 2165. [39] J. Lohse, A. P. Doulgeris, and W. Dierking, “Mapping sea-ice types from SENTINEL-1 considering the surface-type dependent effect of incidence angle,” Ann. Glaciol., vol. 61, no. 83, pp. 260–270, 2020. [40] M. Shin, “Semi-supervised learning with a teacher-student network for generalized attribute prediction,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 509–525. [41] Q. She, J. Zou, M. Meng, Y. Fan, and Z. Luo, “Balanced graph-based regu- larized semi-supervised extreme learning machine for EEG classification,” Int. J. Mach. Learn. Cybern., vol. 12, no. 4, pp. 903–916, 2021. [42] E. Arazo, D. Ortego, P. Albert, N. E. O’Connor, and K. McGuinness, “Pseudo-labeling and confirmation bias in deep semi-supervised learning,” in Proc. IEEE Int. Joint Conf. Neural Netw., 2020, pp. 1–8. [43] K. Yu, H. Ma, T. R. Lin, and X. Li, “A consistency regularization based semi-supervised learning approach for intelligent fault diagnosis of rolling bearing,” Measurement, vol. 165, 2020, Art. no. 107987. [44] J. Gordon and J. M. Hernández-Lobato, “Combining deep generative and discriminative models for Bayesian semi-supervised learning,” Pattern Recognit., vol. 100, 2020, Art. no. 107156. [45] Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le, “Self-training with noisy student improves ImageNet classification,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 10 687–10 698. [46] A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learn- ing results,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2017, pp. 1195– 1204. [47] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training GANs,” in Proc. 30th Int. Conf. Neural Inf. Process. Syst., 2016, pp. 2234–2242. [48] D. P. Kingma, D. J. Rezende, S. Mohamed, and M. Welling, “Semi- supervised learning with deep generative models,” in Proc. 27th Int. Conf. Neural Inf. Process. Syst., 2014, pp. 3581–3589. [49] D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. Raffel, “Mixmatch: A holistic approach to semi-supervised learning,” in Proc. 33rd Conf. Neural Inf. Process. Syst., 2019, pp. 5050–5060. [50] Y. Han et al., “A cooperative framework based on active and semi- supervised learning for sea ice classification using EO-1 hyperion data,” Trans. Japan Soc. Aeronaut. Space Sci., vol. 62, no. 6, pp. 318–330, 2019. [51] F. Staccone, “Deep learning for sea-ice classification on synthetic aperture radar (SAR) images in Earth observation: Classification using semi- supervised generative adversarial networks on partially labeled data,” master’s thesis, School Elect. Eng. Comput. Sci., KTH Roy. Inst. Technol., Stockholm, Sweden, 2020. [52] F. Li, D. A. Clausi, L. Wang, and L. Xu, “A semi-supervised approach for ice-water classification using dual-polarization SAR satellite imagery,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2015, pp. 28–35. [53] G. Algan and I. Ulusoy, “Image classification with deep learning in the presence of noisy labels: A survey,” Knowl.-Based Syst., vol. 215, 2021, Art. no. 106771. [54] M. Ren, W. Zeng, B. Yang, and R. Urtasun, “Learning to reweight ex- amples for robust deep learning,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 4334–4343. [55] S. Khaleghian, J. P. Lohse, and T. Kræmer, “Synthetic-aperture radar (SAR) based ice types/ice edge dataset for deep learning analysis,” 2020. [Online]. Available: https://doi.org/10.18710/QAYI4O [56] J. Falkingham and V. Smolyanitsky, “Electronic chart systems ice objects catalogue,” Version 5.1. draft for approval. Feb. 2012. [Online]. Available: http://hdl.handle. net/11329/403 [57] N. Hughes, “Extremeearth polar use case training data,” 2020. [Online]. Available: https://zenodo.org/record/3695276#.X-ytf2j0mUn [58] A. Iscen, G. Tolias, Y. Avrithis, T. Furon, and O. Chum, “Efficient diffusion on region manifolds: Recovering small objects with compact CNN rep- resentations,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 2077–2086. Salman Khaleghian received the bachelor’s de- gree in applied mathematics from Shahed Univer- sity, Tehran, Iran, in 2006, and the M.S. degree in computer software engineering from Science and Research branch of Azad University, Tehran, Iran, in 2010. He is currently working toward the Ph.D. degree in scalable computing for Earth observation with the Center for Integrated Remote Sensing and Forecasting for Arctic Operations, Faculty of Science and Technology, University of Tromsø—The Arctic University of Norway, Tromsø, Norway, and the SIR- IUS Lab, Department of Informatics, University of Oslo, Oslo, Norway. His research interests include machine learning, deep learning, scalable deep learning, and computer vision. 10772 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 14, 2021 Habib Ullah received the M.S. degree in electronics and computer engineering from Hanyang University, Seoul, South Korea, in 2009, and the Ph.D. degree in information and communication technology from the University of Trento, Trento, Italy, in 2015. From 2015 to 2016, he was an Assistant Professor of Electrical Engineering with COMSATS Univer- sity, Islamabad, Pakistan. From 2016 to 2020, he was an Assistant Professor with the College of Computer Science and Engineering, University of Ha’il, Ha’il, Saudi Arabia. In 2020, he was a Postdoctoral Re- searcher with the University of Tromsø—The Arctic University of Norway, Tromsø, Norway. He is currently an Associate Professor with the Norwegian University of Life Sciences, ˚As, Norway. His research interests include computer vision and machine learning. Thomas Kræmer received the M.Sc. degree in data analysis and sensor technology in 2011 from the University of Tromsø—the Arctic University of Nor- way, Tromsø, Norway, where he is currently working toward the Ph.D. degree. Since 2016, he has been the Head Engineer with the Earth Observation Laboratory, University of Tromsø—The Arctic University of Norway. His research interests include algorithms for automated analysis of synthetic aperture radar images for sea ice applications. Torbjørn Eltoft (Member, IEEE) received the M.Sc. and Ph.D. degrees in physics from University of Tromsø, Norway, in 1981 and 1984, respectively. In 1988, he joined the Faculty of Science and Technology, University of Tromsø (UiT)—The Arc- tic University of Norway, Tromsø, Norway, where he is currently a Professor of Remote Sensing with the Department of Physics and Technology. He is also the Director of the Centre for Integrated Remote Sens- ing and Forecasting for Arctic Operations. He has a significant publication record. His research interests include signal and image analysis, statistical modeling, and machine learning with applications in synthetic aperture radar and ocean color remote sensing. Dr. Eltoft was the co-recipient of the 2000 Outstanding Paper Award in Neural Networks awarded by the IEEE Neural Networks Council and an honorable mention for the 2003 Pattern Recognition Journal Best Paper Award. He was the recipient of the UiT Award for Research and Development in 2017. He was an Associate Editor for Pattern Recognition from 2005 to 2011 and a Guest Editor for Remote Sensing on the Special Issue for the PolInSAR 2017 Conference. Andrea Marinoni (Senior Member, IEEE) received the B.S., M.Sc. (cum laude), and Ph.D. degrees in electronic engineering from the University of Pavia, Pavia, Italy, in 2005, 2007, and 2011, respectively. He is currently an Associate Professor with the Earth Observation Group, Centre for Integrated Re- mote Sensing and Forecasting for Arctic Operations, Department of Physics and Technology, University of Tromsø—The Arctic University of Norway, Tromsø, Norway, and a Visiting Academic Fellow with the De- partment of Engineering, University of Cambridge, Cambridge, U.K. From 2015 to 2017, he was a Visiting Researcher with the Earth and Planetary Image Facility, Ben-Gurion University of the Negev, Be’er Sheva, Israel; the School of Geography and Planning, Sun Yat-Sen University, Guangzhou, China; the School of Computer Science, Fudan University, Shang- hai, China; the Institute of Remote Sensing and Digital Earth, Chinese Academy of Sciences, Beijing, China; and the Instituto de Telecomunicações, Instituto Superior Tecnico, Universidade de Lisboa, Lisbon, Portugal. From 2013 to 2018, he was a Research Fellow with the Telecommunications and Remote Sensing Laboratory, Department of Electrical, Computer and Biomedical Engineering, University of Pavia. In 2009, he was a Visiting Researcher with the Communi- cations Systems Laboratory, Department of Electrical Engineering, University of California, Los Angeles, Los Angeles, CA, USA. In 2020 and 2021, he was a Visiting Professor with the Department of Electrical, Computer and Biomedical Engineering, University of Pavia. His research interests include efficient infor- mation extraction from multimodal remote sensing, nonlinear signal processing applied to large-scale heterogeneous records, Earth observation interpretation and Big Data mining, and analysis and management for human–environment interaction assessment. Dr. Marinoni was the recipient of the two-year “Applied Research Grant,” sponsored by the Region of Lombardy, Italy, and STMicroelectronics N.V., in 2011; the INROAD Grant, sponsored by the University of Pavia and Fondazione Cariplo, Italy, for supporting excellence in design of ERC proposal in 2017; the “Progetto professionalitá Ivano Becchi” grant funded by the Fondazione Banco del Monte di Lombardia, Italy, and sponsored by University of Pavia and NASA Jet Propulsion Laboratory, Pasadena, CA, for supporting the development of advanced methods of air pollution analysis by remote sensing data investigation, in 2018; and the ˚Asgard Research Program and ˚Asgard Recherche+ Program grants funded by the Institut Français de Norvège, Oslo, Norway, in 2019 and 2020, respectively, for supporting the development of scientific collaborations between French and Norwegian Research Institutes. He is the Founder and the Chair of the IEEE Geoscience and Remote Sensing Society (GRSS) Norway Chapter. He is also an Ambassador of the IEEE Region 8 Humanitarian activities, and a research contact point for the Norwegian Artificial Intelligence Research Consortium. He is a topical Associate Editor of machine learning for IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING. He was a Guest Editor for three special issues on multimodal remote sensing and sustainable develop- ment of IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING. He is the Leader of the GR4S Committee of the IEEE GRSS, coordinating the organization of schools and workshops sponsored by the IEEE GRSS worldwide. 8 Paper 3: AFSD- AdaptiveFeature Space Distillationfor Distributed DeepLearning 99 Received 7 July 2022, accepted 29 July 2022, date of publication 8 August 2022, date of current version 16 August 2022. Digital Object Identifier 10.1109/ACCESS.2022.3197646 AFSD: Adaptive Feature Space Distillation for Distributed Deep Learning SALMAN KHALEGHIAN 1, HABIB ULLAH 2, EINAR BROCH JOHNSEN3, ANDERS ANDERSEN 1, AND ANDREA MARINONI 1, (Senior Member, IEEE) 1Faculty of Science and Technology, UiT The Arctic University of Norway, 9019 Tromsø, Norway 2Faculty of Science and Technology, Norwegian University of Life Sciences (NMBU), 1430 Ås, Norway 3Department of Informatics, University of Oslo, 0315 Oslo, Norway Corresponding author: Salman Khaleghian (salman.khaleghian@uit.no) This work was supported in part by the Centre for Integrated Remote Sensing and Forecasting for Arctic Operations (CIRFA) and the Research Council of Norway (RCN) under Grant 237906, in part by the European Union’s Horizon 2020 Research and Innovation Programme Extreme Earth Project under Agreement 825258, and in part by the Fram Center under the Automised Large-Scale Sea Ice Mapping (ALSIM) through the ‘‘Polhavet’’ Flagship Project. ABSTRACT We propose a novel and adaptive feature space distillation method (AFSD) to reduce the communication overhead among distributed computers. The proposed method improves the Codistillation process by supporting longer update interval rates. AFSD performs knowledge distillates across the models infrequently and provides flexibility to the models in terms of exploring diverse variations in the training process. We perform knowledge distillation in terms of sharing the feature space instead of output only. Therefore, we also propose a new loss function for the Codistillation technique in AFSD. Using the feature space leads to more efficient knowledge transfer between models with a longer update interval rates. In our method, the models can achieve the same accuracy as Allreduce and Codistillation with fewer epochs. INDEX TERMS Distributed deep learning, convolutional neural networks, knowledge distillation, codistil- lation. I. INTRODUCTION To efficiently process big data, new deep learning based systems have been proposed. These systems significantly improve the overall performance when considering the big data. Furthermore, these systems scale up the training and inference process of deep learning techniques. To address the need for computational resources, the training process could be distributed across multiple computers connected by a net- work [1], [2]. Distributed deep learning is the most widely used approach to speed up the neural network training by leveraging the computational resources of multiple devices (e.g., multiple GPUs) [1]. These devices are used to accelerate training by distributing data (data-parallel) across the devices. Each device holds a copy of the model being trained, and the copies are kept synchronized throughout the training process. In one step of a typical implementation, every device com- putes a gradient using different data samples. The gradients The associate editor coordinating the review of this manuscript and approving it for publication was Massimo Cafaro . are then averaged across all devices (e.g., via an Allreduce operation). Subsequently, each device locally performs an optimization step using the average gradient [1], [3], [4]. The model parameters on every device are initialized to the same value to keep them synchronized. Increasing the number of devices brings more computational power. However, it also brings a significant communication overhead to synchronize the model parameters across all devices at every step [5], [6]. In the literature, different methods are introduced to reduce the communication overhead. For example, quantizing or compressing gradients before synchronizing them [7], syn- chronizing periodically rather than on every update [8], and only synchronizing among subsets of devices [9]. These methods reduce the communication overhead per update but they also impact the quality of a model or training time. In this regards, one-step Codistillation [10] method is pro- posed based on online distillation. The distillation process involves training a single model to match the ensemble output rather than the real data labels [11]. Codistillation makes use of distillation in an online manner to accelerate training by VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ 84569 S. Khaleghian et al.: AFSD: Adaptive Feature Space Distillation for Distributed Deep Learning transferring the improved performance of the ensemble to each model [12]. In the codistillation technique, the update interval for synchronizing the stored models is important, because it can affect the communication overhead between the machines. However, by having a longer update inter- val, the efficiency of knowledge distillation is crucial for knowledge sharing between models. Otherwise, the train- ing time would increase which is against the essence of the Codistillation process. In this paper, we propose adaptive feature space distilla- tion (AFSD). In our proposed method, the models perform knowledge distillation by sharing features instead of output as in regular Codistillation [10]. We achieve this by means of a new loss function for the Codistillation technique, char- acterized by reduced communication overhead. Our method can achieve the same accuracywith longer update intervals by considering fewer epochs. Our method performs knowledge distillation across the models infrequently, which provides flexibility to themodels in terms of learning diverse variations in the data. In short, the main contribution of this paper are: • A novel and adaptive feature space distillation method characterized by reduced communication overhead. • A new loss function for knowledge distillation which shares features instead of output in the Codistillation technique. • We outperform the state-of-the-art methods, Allreduce andCodistillation, by getting the same performancewith fewer epochs. The paper is organized as follows: related work is discussed in Section II and background material in Section III. We present the AFSD method in Section IV. Section V validates the proposed method experimentally. Section VI discuss the conclusions and future work. II. RELATED WORK We can categorize distributed deep learning techniques from two perspectives [1], namely concurrency in networks and concurrency in training. The first category can be further divided into the two sub-categories: model parallelism and data parallelism. A. CONCURRENCY IN NETWORKS In this category, we compute the output of the layers or the whole network in concurrent mode for the forward evalua- tion and backpropagation phases. Model parallelism divides the work according to the neurons in each layer. Different parts of the Deep Neural Network (DNN) are computed on different processors in different machines [1]. For example, Huang et al. [13] proposed an approach for training huge DNNs that can not be stored in one GPU. With data paral- lelism, several replicas of a neural network model are created during training, each on a different worker (processor). The workers process different mini-batches locally at each step using an optimizer. For example, the replicas of the model are synchronized (i.e., either by average gradients or parame- ters) at every step by communicating either with a centralized parameter server [14], [15] or decentralized using Allre- duce [16], [17], [18]. By relaxing the synchronization restric- tions and creating an inconsistent model, training workers can read parameters and update gradients asynchronously [19]. Data communication in distributed deep learning can be reduced using methods such as quantization [20], [21], [22] or sparsification [23], [24], [25], [26], [27]. B. CONCURRENCY IN TRAINING In this category, concurrency is used in the training stage. Multiple instances of training processes run independently on different machines. Concurrency is also used for ensem- ble learning. Distributed training of ensembles is a com- pletely parallel process, requiring no communication between the workers [28]. Ensemble learning requires more mem- ory and computational power in the training and inference phases. Therefore, knowledge distillation has been used in a two-step training to transfer knowledge of an ensemble with several networks to a single network [12], [29], [30]. To handle the problem of two-step training, Zhang et al. [31] investigated how an ensemble of students can learn collab- oratively and teach each other throughout the training pro- cess. Kim et al. [32] introduced a fusion learning method that trains a robust classifier by integrating feature maps. Park and Kwak [33] used feature-level ensembles for knowl- edge distillation by transferring the ensemble knowledge between multiple teacher networks. Although these meth- ods can be trained in parallel, their main problem is accu- racy when the number of epochs is not taken into account. Codistillation [10] taken advantage of ensemble learning and mutual learning to speed up the training. Codistillation uses a distillation-like loss that penalizes predictions made by one model on a batch of training samples for deviating from the predictions made by other models on the same batch. Our proposed method falls in the category concurrency in training. It includes distilled knowledge between models by directly tuning their feature space. III. BACKGROUND Distributed ensemble learning (DEL) addresses the problem of communication overhead by training multiple instances of models (weights) independently on the same dataset. The overall prediction is the average of the predictions of all the models. DEL requires no communication between the com- puters [1]. However, ensemble learning increases the cost during the validation stage since the predictions from mul- tiple machines are averaged. Ensemble learning also causes latency [1]. The distillation approach [12] addresses this problem by a two-step processes. In the first step, ensem- ble learning is performed over several machines, so-called teachers. In the second step, a student model is trained to mimic the teacher models. The student model is then used during the test stage. It reduces the cost of ensemble learn- ing by adding another phase to the training process. Using more machines, distillation increases the training time and 84570 VOLUME 10, 2022 S. Khaleghian et al.: AFSD: Adaptive Feature Space Distillation for Distributed Deep Learning FIGURE 1. Architecture of our proposed method. The checkpoints of the models are stored on a shared storage or a prediction server after the specific interval (N). Each model is forced to produce the same feature space as the others models for the same inputs batch. complexity in return for a quality improvement close to a larger teacher ensemble model [10]. However, the ensemble method with distillation remains time consuming. In contrast, one-step Codistillation [10] is based on online distillation and trains N copies of a model in parallel. It starts distillation early in the training pro- cess. In the Codistillation technique, the length of the update interval can affect the communication overhead among the machines. For example, the longer the update interval, the lower is the communication overhead. In the ideal case, the communication among different models should be reduced with longer update intervals. Moreover, the update interval affects the diversity between the trained models. The distillation process in Codistillation reduces diversity by forc- ing themodels to predict the same outputs for the same inputs. IV. AFSD: ADAPTIVE FEATURE SPACE DISTILLATION AFSD, the method proposed in this paper, exploits the feature space of each model to explore more variations in the training data instead of using the outputs of the networks to share knowledge between the models. In fact, we tune the mod- els to generate similar feature spaces for transferring knowl- edge between the models. We consider feature spaces to be similar when the distance between extracted features is the same for the same inputs using different models. To perform knowledge distillation more efficiently with fewer epochs, we manipulate and tune the feature space by considering a distillation term. Our method is based on the Codistillation technique [10] to share knowledge between models rather than synchronizingmodels to have the sameweights.We train n copies of a model in parallel and start distillation early in the training process by adding a new distillation loss term to the loss function. In fact, we have a set of students who simulta- neously learn during the training process to handle the classi- fication together (Figure 1). Each model saves the checkpoint of its weights on a shared storage after each update interval, so each model considers the other models as teachers in a distillation-like setup. However, each model uses the stale version of stored models on shared storage (or a prediction server) and performs additional forward passes with the same input batch. A. FORMULATION OF THE LEARNING PROCESS We consider an input batch X = {x1, . . . xn} where xk is an input sample and n is the size of the input batch and labels Y = {y1, y2, . . . , yn} according to the input batch X . The output of model i is defined by fθ i(xk ) where xk ∈ X and θi is the parameter of model i. The extracted features from model i are represented by Fθ i(xk ) = aik where aik are extracted features by considering the xk sample by the model i. The number of models is represented by m. We propose a loss function including a distillation penalty to force the models to produce the same feature space (extracted features before fully connected layers (FC)). For this purpose, the penalty term forces models to have the same distance between extracted features when considering the same inputs. In other words, if the distance between two features for two samples is shorter or longer using one model, the distance between the features for the same two samples extracted from the other models should be similar. For this purpose, we formulate a n × n distance matrix representing the distance between an individual feature and all other fea- tures. We formulate the distance matrix Ei of the model i as Ei =  D(ai1, ai1) D(ai1, ai2) . . . D(ai1, ain) D(ai2, ai1) D(ai2, ai2) . . . D(ai2, ain) . . . . . . . . . . . . D(ain, ai1) D(ain, ai2) . . . D(ain, ain)  (1) where aik and aij are extracted features consider- ing samples xk and xj using model i. Let D be the distance metric and n the input batch size. We formulate the loss function for the model i as VOLUME 10, 2022 84571 S. Khaleghian et al.: AFSD: Adaptive Feature Space Distillation for Distributed Deep Learning FIGURE 2. Comparison between earliest epoch number that achieves the same accuracy as Allreduce considering Codistillation and our proposed method. follows: loss = L(Y , fθ i(X ))+ α 1m− 1 ∑ g(Ei,Ej) i 6= j and j ∈ 1, 2, 3, . . .m (2) g(Ei,Ej) = 1ne T |Ei − Ej|e (3) where m is the number of models on different machines, X is the input batch, Y is the input labels, α is the penalty coefficient and L represents the loss between prediction and the labels. The function g indicates the average distance between elements of Ei and Ej considering batch size n, where e is the column vector whose entries are all 1’s and T is the transpose operator. The first term is the cross entropy loss and the second term is the distillation loss. Based on this loss function, we show our pro- posed method in Algorithm 1. Algorithm 1 AFSD Algorithm 1: Initialization of network parameters (θi, i ∈ 1, 2, 3, . . . ,m) 2: Initialization of learning rate µ and penalty coefficients α for each model 3: Repeat for the number of epochs: 4: Do in parallel for i ∈ 1, 2, 3, . . . ,m : 5: Get next batch (X,Y) by size n 6: Update θ : θk+1i = θki + µ∇θ i(L(Y , f kθ i(X )) + α 1m−1 ∑ g(Ei,Ej)) i 6= j and j ∈ 1,2,3, . . .m V. EXPERIMENTAL ANALYSIS In our experiments, wewant to evaluate ourmethod and show that it can address the aforementioned problems. The evaluation is done in terms of four research questions (RQs). RQ1: How does AFSD cope with the impact of the update interval with- out affecting the performance? RQ2: Does AFSD achieve the same performance with fewer epochs considering a longer update interval? RQ3: How does the new distillation loss term based on the fea- ture space affect the training process and the out- puts of the models? RQ4: How does AFSD work when different network architectures are trained? We address these research questions in the subsec- tions V-A, V-B, V-C, respectively. Experimental Setup and Design: In our experi- ments, we use the standard CIFAR10 dataset [34]. For comparison, we consider the Allreduce [16] and the Codistillation [10] techniques. We con- sider the Allreduce technique as a baseline for comparison. For the Allreduce technique, we used hyperparameters and the gradual warmup strategy for changing the learning rate [16]. For evaluating our model, we also consider different architectures, namely ResNet20, ResNet32, and VGG16. In case of ResNet20, we trained the network using the Allreduce technique and we achieved a top-1 vali- dation accuracy of 91.71% after 114 epochs. In the Allreduce technique, the input batch is divided 84572 VOLUME 10, 2022 S. Khaleghian et al.: AFSD: Adaptive Feature Space Distillation for Distributed Deep Learning TABLE 1. Validation accuracy and earliest epoch that achieves the same accuracy as Allreduce considering Codistillation [10] and AFSD with update interval between 40 and 3500. TABLE 2. Validation accuracy and earliest epoch that achieves the same accuracy as Allreduce considering Codistillation [10] and AFSD with update interval between 4000 and 15000. between different GPUs. However, we use the same batch input for training the models based on AFSD. Therefore, we feed the data twice as compared to the Allreduce technique and we expect that AFSD driven by linear scalability should achieve the same accuracywith half the number of epochs.We set the batch size such that each GPU receives 128 sam- ples in each batch in all three methods. In our experiments, we use two servers with three Nvidia GPUs namely Quadro RTX 5000 with 16GB memory on each. The servers are connected through a point-to-point 10Gb network. We use NFS shared storage to save and restore models checkpoints. A. UPDATE INTERVAL (RQ1 AND RQ2) We compare our proposed method with the Codistillation method [10] considering different update intervals.We consider update intervals from 50 steps to 15000 steps. To show the capability of our proposed method, we compare the epoch number on which each method achieves the desired accuracy based on the Allreduce method. The com- parison between the Codistillation method and our proposed method is shown in Figure 2. Table 1 shows validation accuracy and earliest epoch that achieves the same accuracy as Allreduce consider- ing Codistillation [10] andAFSDwith update inter- val 40 to 3500 and Table 2 shows same validation accuracy for update interval from 4000 to 15000. In this experiment, we record the epoch number when a specific method reaches the same accu- racy as Allreduce. It can be seen, when the update interval is longer, our method achieves the desired accuracy with much fewer epochs. Additionally, our proposed method is driven by linear scalability and tolerates longer update intervals considering 12000 steps. In fact, when using 9000 steps as the update interval, we update the saved checkpoint after 23 epochs (the batch size is 128, and we have 50000 samples in the training dataset). This shows that our method can achieve the same accuracy and scalability with very little communication over- head. However, when we update the models more often with shorter update intervals, we reduce the diversity of the models. When we use more power- ful distillation, information sharing does not have the intended benefits. The comparison between Allreduce, Codistillation, and our proposedmethod using ResNet20 [36] is shown in Figure 3. We con- sider update intervals equal to 5000 and 9000 steps. We use the early stopping strategy when Codistil- lation and AFSD achieve 91.71% or more accu- racy. We reduce the learning rate on epochs 45 and 55 with a factor 0.1 when Codistillation and AFSD are used. We also consider the first four epochs as the warm-up epochs applicable to the update interval. Otherwise, we let the networks continue the training independently, based on the specified update intervals. As we can see in Figure3, our method can achieve the same accuracy with fewer epochs compared to Allreduce and Codistillation. Since we are using much longer update intervals, the networks are trained more independently in a different direction. Therefore, we can see a rise in training loss when we transfer knowledge based on distillation loss terms. However, the loss reduces through the training process. VOLUME 10, 2022 84573 S. Khaleghian et al.: AFSD: Adaptive Feature Space Distillation for Distributed Deep Learning FIGURE 3. Validation accuracy and training loss using Codistillation [10], our proposed method, and Allreduce [35]. We use the early stopping strategy when the Codistillation achieves the same accuracy of Allreduce considering 9000 steps (a, b) and 5000 steps (c, d) update intervals. FIGURE 4. Distillation losses based on the outputs and features considering Codistillation and our proposed method. We do not include distillation loss based on the outputs in our method, but we measure it to compare with the Codistillation approach. B. DISTILLATION LOSS (RQ3) The loss function is based on two terms: cross- entropy loss and distillation loss. The Codistilla- tion technique uses distillation loss based on the output of the networks. In contrast, our method encodes distillation loss based on the feature space. Wewant to illustrate the difference between the two loss terms by showing how these behave during the training process for AFSD. Figure4 shows the dis- tillation loss considering the Codistillation method andAFSD. It should be noted, in ourmethod, we do not use output-based distillation, but we measure it during the training. As we can see, in Figures4 (a) and4 (c), distillation based on outputs fluctuates 84574 VOLUME 10, 2022 S. Khaleghian et al.: AFSD: Adaptive Feature Space Distillation for Distributed Deep Learning and even increases with AFSD. The networks are robust for classifying the extracted features even with different outputs since we just force them to generate the same features. This is because the neural network models can represent the same function in different ways with different parameter values [37]. However, Figures4 (b) and4 (d) show that the loss between extracted features is reduced through the training, and we can transfer knowl- edge between the models. Even small changes in the features can lead to more effective distillation, and the networks can achieve the same accuracy with fewer epochs. C. NETWORK ARCHITECTURES (RQ4) In order to evaluate the generalization capability of AFSD regardless of the use of a specific net- work architecture, we consider other architectures in this section. Figure5 shows the validation accu- racy of the ResNet32 network using Codistil- lation and AFSD for update intervals equal to 5000 and 9000 steps. For this experiment, we con- sider 92.41% as the top-1 accuracy of Allreduce. The Allreduce operation achieves this accuracy after 123 epochs. We use the learning rate schedule for this network to reduce the learning rate by a fac- tor of 0.1 on the epoch number equal to 50 and 60. As we can see, our method achieves this accuracy with fewer epochs compared to the Codistillation method. We also explore the VGG-16 [38] model and a 13-layer CNN [39] architecture to consider architectures not belonging to the ResNet fami- lies. However, considering both Codistillation and AFSD, using these architectures, we would not get the same accuracy with fewer epochs then needed for the Allreduce technique. Therefore it seems these methods can bemore effective with the ResNet family of architectures. D. THREATS TO VALIDITY In our experiments, we used three GPUs on each server. In each server we considered the Allreduce algorithm to train a model in a synchronized man- ner on these three GPUs. Increasing the number of GPUs on each server could affect the results since it would increase the number of epochs to get the appropriate accuracy. Since this would be FIGURE 5. Validation accuracy of the ResNet32 network using Codistillation and our proposed method for the update intervals equal to 5000 and 9000. same for both Codistillation and AFSD, we con- sider them significant parameters. Experiments with more GPUs on each server in a two-way setup can be considered. We considered the ResNet20, ResNet30, VGG16 and a 13-layer CNN [39] architectures in our experiments. Deeper architectures with more parameters could exhibit different behaviors. Deeper architectures usually learn features at var- ious levels of abstractions. Therefore, considering only high-level features at the end of the network would not be sufficient to share knowledge.We can also observe this issue when a pure convolution network like VGG16 is used. Hence deeper archi- tectures with more parameters and more inter- mediate features can be considered for future experiments. We consider random initialization as a diversity enforcement regularization. However, it can violate a specific situation when the initialized weights or training directions are aligned together. Hence we assume that this diversity can be accomplished by weight randomization. VI. CONCLUSION In this paper, we propose AFSD, a new method for large scale distributed deep learning. The main VOLUME 10, 2022 84575 S. Khaleghian et al.: AFSD: Adaptive Feature Space Distillation for Distributed Deep Learning novelty of AFSD is knowledge sharing based on the feature space of parallel models in the Codis- tillation setup. Our method supports much longer update intervals using a new knowledge distillation loss function. By prolonging the update interval, the models become more diverse and contribute more to the training process. Additionally, the communication overhead is significantly reduced. We show that with only two updates through the training process, the models can achieve linear scalability using the feature space for sharing the information. In future work, we will consider our approach of feature space sharing for scalable semi-supervised learning. It has been shown that generating pseudo-labels for unlabeled data using fea- ture spaces increases the performance of deep semi-supervised learning [40], [41]. An extension of our method to semi-supervised learning, could be useful to address the issue of scarce training data, which is critical in many areas where a small number of labeled data and a large amount of unla- beled data are available. REFERENCES [1] T. Ben-Nun and T. Hoefler, ‘‘Demystifying parallel and distributed deep learning: An in-depth concurrency analysis,’’ ACM Comput. Surv., vol. 52, no. 4, pp. 1–43, Jul. 2020. [2] A. C. Zhou, B. Shen, Y. Xiao, S. Ibrahim, and B. He, ‘‘Cost-aware par- titioning for efficient large graph processing in geo-distributed datacen- ters,’’ IEEE Trans. Parallel Distrib. Syst., vol. 31, no. 7, pp. 1707–1723, Jul. 2020. [3] R. Mayer and H.-A. Jacobsen, ‘‘Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools,’’ ACM Comput. Surv., vol. 53, no. 1, pp. 1–37, Jan. 2021. [4] M. A. Soyturk, P. Akhtar, E. Tezcan, and D. Unat, ‘‘Monitoring collective communication among GPUs,’’ 2021, arXiv:2110.10401. [5] J. Chen, X. Pan, R. Monga, S. Bengio, and R. Jozefowicz, ‘‘Revisiting distributed synchronous SGD,’’ 2016, arXiv:1604.00981. [6] F. Zhang, Z. Chen, C. Zhang, A. C. Zhou, J. Zhai, and X. Du, ‘‘An effi- cient parallel secure machine learning framework on GPUs,’’ IEEE Trans. Parallel Distrib. Syst., vol. 32, no. 9, pp. 2262–2276, Sep. 2021. [7] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, ‘‘QSGD: Communication-efficient SGD via gradient quantization and encoding,’’ in Advances in Neural Information Processing Systems, vol. 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Red Hook, NY, USA: Curran Associates, 2017. [Online]. Available: https://proceedings. neurips.cc/paper/2017/file/6c340f25839e6acdc73414517203f5f0- Paper.pdf [8] S. U. Stich, ‘‘Local SGD converges fast and communicates little,’’ in Proc. Int. Conf. Learn. Represent., 2019, pp. 1–19. [Online]. Available: https://openreview.net/forum?id=S1g2JnRcFX [9] M. Assran, N. Loizou, N. Ballas, and M. Rabbat, ‘‘Stochastic gradient push for distributed deep learning,’’ in Proc. Int. Conf. Mach. Learn., 2019, pp. 344–353. [10] R. Anil, G. Pereyra, A. Passos, R. Ormandi, G. E. Dahl, and G. E. Hinton, ‘‘Large scale distributed neural network training through online distilla- tion,’’ 2018, arXiv:1804.03235. [11] Z. Allen-Zhu and Y. Li, ‘‘Towards understanding ensemble, knowledge distillation and self-distillation in deep learning,’’ 2020, arXiv:2012.09816. [12] G. Hinton, O. Vinyals, and J. Dean, ‘‘Distilling the knowledge in a neural network,’’ in Proc. Conf. Neural Inf. Process. Syst. (NIPS), Deep Learn. Workshop, 2014. [Online]. Available: https://openreview. net/forum?id=S1g2JnRcFX [13] Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu, and Z. Chen, ‘‘GPipe: Efficient training of giant neural networks using pipeline parallelism,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 32, 2019, pp. 103–112. [14] M. Li, D. G. Andersen, A. J. Smola, and K. Yu, ‘‘Communication efficient distributed machine learning with the parameter server,’’ in Proc. NIPS, vol. 2, 2014, pp. 1–4. [15] H. Cui, H. Zhang, G. R. Ganger, P. B. Gibbons, and E. P. Xing, ‘‘GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server,’’ in Proc. 11th Eur. Conf. Comput. Syst., Apr. 2016, pp. 1–16. [16] P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L.Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, ‘‘Accurate, large minibatch SGD: Training ImageNet in 1 hour,’’ CoRR, vol. abs/1706.02677, pp. 1–12, Jun. 2017. [17] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, ‘‘Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent,’’ 2017, arXiv:1705.09056. [18] A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, ‘‘S-Caffe: Co-designingMPI runtimes and caffe for scalable deep learning onmodern GPU clusters,’’ in Proc. 22nd ACM SIGPLAN Symp. Princ. Pract. Parallel Program., 2017, pp. 193–205. [19] F. Niu, B. Recht, C. Ré, and S. J. Wright, ‘‘HOGWILD!: A lock- free approach to parallelizing stochastic gradient descent,’’ 2011, arXiv:1106.5730. [20] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, ‘‘Deep learn- ing with limited numerical precision,’’ in Proc. Int. Conf. Mach. Learn., 2015, pp. 1737–1746. [21] S. Han, H. Mao, and W. J. Dally, ‘‘Deep compression: Compressing deep neural network with pruning, trained quantization and Huffman cod- ing,’’ in Proc. 4th Int. Conf. Learn. Represent. (ICLR), Y. Bengio and Y. LeCun, Eds., San Juan, PR, USA, May 2016, pp. 1–14. [22] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, ‘‘Quan- tized neural networks: Training neural networkswith low precisionweights and activations,’’ J. Mach. Learn. Res., vol. 18, no. 1, pp. 6869–6898, 2017. [23] A. F. Aji and K. Heafield, ‘‘Sparse communication for distributed gra- dient descent,’’ in Proc. Conf. Empirical Methods Natural Lang. Pro- cess. Copenhagen, Denmark: Association for Computational Linguis- tics, Sep. 2017, pp. 440–445. [Online]. Available: https://aclanthology. org/D17-1045 [24] C.-Y. Chen, J. Choi, D. Brand, A. Agrawal, W. Zhang, and K. Gopalakrishnan, ‘‘Adacomp: Adaptive residual gradient compression for data-parallel distributed training,’’ in Proc. AAAI Conf. Artif. Intell., 2018, vol. 32, no. 1, pp. 1–9. [25] Y. Lin, S. Han, H. Mao, Y. Wang, and B. Dally, ‘‘Deep gradi- ent compression: Reducing the communication bandwidth for dis- tributed training,’’ in Proc. 6th Int. Conf. Learn. Represent. (ICLR), Vancouver, BC, Canada, Apr./May 2018, pp. 1–14. [Online]. Available: https://openreview.net/forum?id=SkhQHMW0W [26] C. Renggli, S. Ashkboos, M. Aghagolzadeh, D. Alistarh, and T. Hoefler, ‘‘SparCML: High-performance sparse communication for machine learn- ing,’’ in Proc. Int. Conf. High Perform. Comput., Netw., Storage Anal., Nov. 2019, pp. 1–15. [27] P. D’Ambra and S. Filippone, ‘‘A parallel generalized relaxation method for high-performance image segmentation on GPUs,’’ J. Com- put. Appl. Math., vol. 293, pp. 35–44, Feb. 2016. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S037704271500254X [28] S. Lee, S. Purushwalkam, M. Cogswell, D. Crandall, and D. Batra, ‘‘Why m heads are better than one: Training a diverse ensemble of deep net- works,’’ 2015, arXiv:1511.06314. [29] D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph, J. Gilmer, and B. Lakshminarayanan, ‘‘Augmix: A simple data processing method to improve robustness and uncertainty,’’ in Proc. 8th Int. Conf. Learn. Rep- resent. (ICLR), Addis Ababa, Ethiopia, Apr. 2020, pp. 1–15. [Online]. Available: https://openreview.net/forum?id=S1gmrxHFvB [30] T. Fukuda, M. Suzuki, G. Kurata, S. Thomas, J. Cui, and B. Ramabhadran, ‘‘Efficient knowledge distillation from an ensemble of teachers,’’ in Proc. Interspeech, Aug. 2017, pp. 3697–3701. 84576 VOLUME 10, 2022 S. Khaleghian et al.: AFSD: Adaptive Feature Space Distillation for Distributed Deep Learning [31] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu, ‘‘Deep mutual learn- ing,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 4320–4328. [32] J. Kim, M. Hyun, I. Chung, and N. Kwak, ‘‘Feature fusion for online mutual knowledge distillation,’’ in Proc. 25th Int. Conf. Pattern Recognit. (ICPR), Jan. 2021, pp. 4619–4625. [33] S. Park and N. Kwak, ‘‘Feature-level ensemble knowledge distillation for aggregating knowledge from multiple networks,’’ in Proc. ECAI. Amsterdam, The Netherlands: IOS Press, 2020, pp. 1411–1418. [34] A. Krizhevsky and G. Hinton, ‘‘Learning multiple layers of features from tiny images,’’ Univ. Toronto, Toronto, ON, USA, Tech. Rep., 2009. [35] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, ‘‘Accurate, large minibatch SGD: Training ImageNet in 1 hour,’’ 2017, arXiv:1706.02677. [36] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 770–778. [37] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, MA, USA: MIT Press, 2016. [38] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for large-scale image recognition,’’ in Proc. 3rd Int. Conf. Learn. Represent. (ICLR), Y. Bengio and Y. LeCun, Eds., San Diego, CA, USA, May 2015, pp. 1–14. [39] A. Tarvainen and H. Valpola, ‘‘Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learn- ing results,’’ in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 1195–1204. [40] A. Iscen, G. Tolias, Y. Avrithis, and O. Chum, ‘‘Label propagation for deep semi-supervised learning,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 5070–5079. [41] S. Khaleghian, H. Ullah, T. Kraemer, T. Eltoft, and A. Marinoni, ‘‘Deep semisupervised teacher–student model based on label propagation for sea ice classification,’’ IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 14, pp. 10761–10772, 2021. SALMAN KHALEGHIAN received the bachelor’s degree in applied mathematics/computer science and the M.S. degree in computer software engi- neering, in 2006 and 2010, respectively. He is currently pursuing the Ph.D. degree in scalable computing for earth observation with the Center for Integrated Remote Sensing and Forecasting for Arctic Operations (CIRFA), Faculty of Science and Technology, University of Tromsø (UiT), and the SIRIUS Laboratory, Department of informat- ics, University of Oslo (UiO). His research interests include machine learn- ing, deep learning, scalable deep learning, and computer vision. HABIB ULLAH received the M.S. degree in elec- tronics and computer engineering from Hanyang University, South Korea, in 2009, and the Ph.D. degree in information and communication technol- ogy from the University of Trento, Italy, in 2015. He served as an Assistant Professor in electrical engineering at COMSATS University Islamabad, Pakistan, from 2015 to 2016. He worked as an Assistant Professor at the College of Computer Science and Engineering, University of Ha’il, Saudi Arabia, from 2016 to 2020. He also worked as a Postdoctoral Researcher at the UiT The Arctic University of Norway, in 2020. He is currently working as an Associate Professor with the Norwegian University of Life Sciences (NMBU). His research interests include computer vision and machine learning. EINAR BROCH JOHNSEN is currently a Professor with the Department of Informatics, University of Oslo. He is active in formal methods for distributed and concurrent systems, including object-oriented and actor languages, manycore computing, cloud computing, and digital twins. He is one of the main developers of the ABS Modeling Language. He is the Strategy Director of SIRIUS, a Center for Research-Driven Innovation on Scalable Data Access, with eight years funding, from the Research Council of Norway. He has been prominently involved in many national and European research projects; in particular, he was the Coordinator of the EU FP7 Project Envis- age on formal methods for cloud computing, from 2013 to 2016, and the Scientific Coordinator of the EU H2020 Project HyVar on hybrid variability systems, from 2015 to 2018. His research interests include programming models andmethodology, program specification andmodeling, formal meth- ods and associated theory, lightweight analysis, type systems, testing, and deductive verification and formal logic. He is a member of IFIP WG2.2 ‘‘Formal Description of Programming Concepts’’. He was a Board Member of SINTEF ICT, from 2009 to 2015. He is currently a member of the Scientific Council of the Science Centre, UiO, a Board Member of the Formal Methods Europe, and a Steering Committee Member of the conference series on Fundamental Approaches to Software Engineering (FASE), Integrated Formal Methods (iFM), and For- mal Techniques for Networked and Distributed Systems (FORTE). He was the General Chair of FM 2015 and DisCoTec 2008, and the PC Chair of FASE 2022, SEFM 2018, TAP 2017, ESOCC 2016, iFM 2013, and FMOODS 2007. He is an Editorial Board Member of the Journals Formal Aspects of Computing and Journal of Logical and Algebraic Methods in Programming. ANDERS ANDERSEN is the Head of Depart- ment at the Department of Computer Science, UiT The Arctic University of Norway. He is Lead- ing a national workgroup for the strengthening of ICT-security in technology studies in Norway, from 2019 to 2020. The workgroup is appointed by the Norwegian Association of Higher Education Institutions (UHR) on behalf of the Ministry of Education and Research. The research of Anders Andersen covers four main domains. The first domain is security, where the focus is adaptable security, secure storage and sharing of data, security related to mobile sys- tems, NFC and secure elements (e.g. SIM), and analysis of sensitive data (e.g. person-sensitive health data). The second domain is support for mobility and context, where configuration and reconfiguration of systems based on the current context are made possible with adaptable architectures. This domain includes integration of a wide range of services and information sources for the development of personalized and context sensitive solutions. The third domain is support for multimedia applications, including support for continuous media. He has used formal specifications directly for quality of service (QoS) management in running systems and he has developed an explicit binding architecture for multimedia communication. The fourth domain is adaptive system architectures, where he has developed program- ming abstractions for adaption control and techniques to observe and analyse system behavior. His research interests include security, mobile services, personalisa- tion, complex distributed applications, and privacy aware analysis of sensitive data. VOLUME 10, 2022 84577 S. Khaleghian et al.: AFSD: Adaptive Feature Space Distillation for Distributed Deep Learning ANDREA MARINONI (Senior Member, IEEE) received the B.S., M.Sc., (cum laude) and Ph.D. degrees in electronic engineering from the Univer- sity of Pavia, Pavia, Italy, in 2005, 2007, and 2011, respectively. From 2013 to 2018, he has been a Research Fellow at the Telecommunications and Remote Sensing Laboratory, Department of Electrical, Computer, and Biomedical Engineering, Univer- sity of Pavia. In 2009, he has been a Visiting Researcher at the Communications Systems Laboratory, Department of Electrical Engineering, University of California–Los Angeles (UCLA), Los Angeles, CA, USA. In 2011, he has been the recipient of the two-year ‘‘Applied Research Grant’’, sponsored by the Region of Lombardy, Italy, and STMicroelectronics N.V. In 2017, he has been the recipient of the INROAD grant, sponsored by University of Pavia and Fondazione Cariplo, Italy, for supporting excellence in design of ERC proposal. In 2018, he has been the recipient of the ‘‘Progetto Professionalità Ivano Becchi’’ grant funded by Fondazione Banco del Monte di Lombardia, Italy, and sponsored by the University of Pavia and NASA Jet Propulsion Laboratory, Pasadena, CA, for supporting the development of advanced methods of air pollution analysis by remote sensing data investigation. He has been the recipient of Åsgard Research Programme and Åsgard Recherche+Programme grants funded by the Institut Français de Norvège, Oslo, Norway, in 2019 and 2020, respec- tively, for supporting the development of scientific collaborations between French and Norwegian research institutes. From 2015 to 2017, he has been a Visiting Researcher at the Earth and Planetary Image Facility, Ben-Gurion University of the Negev, Be’er Sheva, Israel; School of Geography and Planning, Sun Yat-sen University, Guangzhou, China; School of Computer Science, Fudan University, Shanghai, China.; Institute of Remote Sensing and Digital Earth, Chinese Academy of Sciences, Beijing, China.; Instituto de Telecomunicações, Instituto Superior Tecnico, Universidade de Lisboa, Lisbon, Portugal. In 2020 and 2021, he has been a Visiting Professor at the Department of Electrical, Computer, and Biomedical Engineering, Univer- sity of Pavia. He is currently an Associate Professor with the Earth Obser- vation Group, Centre for Integrated Remote Sensing and Forecasting for Arctic Operations (CIRFA), Department of Physics and Technology, UiT the Arctic University of Norway, Tromsø, Norway, and a Visiting Academic Fellow with the Department of Engineering, University of Cambridge, U.K. His main research interests include efficient information extraction from multimodal remote sensing, nonlinear signal processing applied to large scale heterogeneous records, earth observation interpretation and big data mining, and analysis and management for human–environment interaction assessment. He is the Founder and the Current Chair of the IEEE GRSS Nor- way Chapter. He is also an Ambassador for IEEE Region 8 Humanitar- ian Activities, and a Research Contact Point for the Norwegian Artificial Intelligence Research Consortium (NORA–nora.ai). He serves as a Topical Associate Editor of machine learning for IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING. He has been the Guest Editor of three special issues on multimodal remote sensing and sustainable development for IEEE JOURNAL OF SELECTED TOPICS INAPPLIED EARTHOBSERVATIONS AND REMOTE SENSING. He is also the leader of the GR4S committee within IEEE GRSS, coordinat- ing the organization of schools and workshops sponsored by IEEE GRSS worldwide. 84578 VOLUME 10, 2022 9 Conclusions and FutureWorks We explored the potential of different cnn models for sea ice classification. The results showed that cnn architectures (such as those based on the VGG network) typically obtain promising classification results. We assessed the robustness of the trained cnn models when applied to sar scenes collected at different spatial locations and times our findings were positive and show that the models have good potential. However, we found that the additive system noise in the sar imagery is a challenging problem in obtaining refined sea ice maps. Additionally, we emphasized that the scarcity of reliable and balanced sea ice training and validation datasets is a severe problem for training deep neural network architectures. We proposed a semi-supervised method that considered very limited labeled samples and relatively large unlabeled samples to train the models to address the scarce training data issue. In fact, our method in semi-supervised learning is characterized by the ability to learn practical information from labeled and unlabeled data. Ourmethod reduced the dependency on labeled samples which is very time-consuming and costly to collect for sea ice analysis. However, as mentioned before, Deep Neural Networks training is computationally intensive, especially in semi-supervised learning where we can increase unlabeled data to involve in the training process easier. Additionally, we found selecting the labeled and unlabeled data can significantly affect the performance of the 111 112 chapter 9 conclusions and future works models. Specifically, unlabeled and labeled data should come from the same classes and the same distribution. Otherwise, the methods could not be able to extract useful information from unlabeled data to improve the model. We proposed a new method for large-scale distributed deep learning analysis to address computation complexity. Our method is a novel approach for knowl- edge sharing based on the feature space of parallel models. In our method, the communication overhead is significantly reduced and we showed that with only a few communications through the training process, the models can achieve lin- ear scalability. We considered our proposed distributed deep learning method in a supervised manner. However, this method can be used to extend our semi-supervised method for large-scale semi-supervised analysis. 9.1 Future works In our methods, we considered patch-wise classification which degrades the spatial resolution. To cope with this issue one direction could be considering a pixel-wise set-up. However, the pixel-wise set-up will be driven by more computational overhead. Therefore, transforming the current architecture to process the input data quickly in a pixel-wise setup is necessary. For this purpose, one direction can be accomplished by replacing the fc layers with convolution layers based on the work of Sermanet et al. [137]. We considered a supervised version of our proposed ddl. In future work, our adaptive feature space distillation can be used with the junction of label propagation [135, 73] idea for scalable semi-supervised learning. The feature space from different models based on unlabeled data can bring more informa- tion. Bibliography [1] X.-W. Chen and X. Lin, “Big data deep learning: challenges and perspec- tives,” IEEE access, vol. 2, pp. 514–525, 2014. [2] V. Sze, Y.-H. Chen, T.-J. Yang, and J. Emer, “Efficient processing of deep neural networks: A tutorial and survey.” [Online]. Available: https://arxiv.org/abs/1703.09039 [3] A. Ng. (2019) Deep learning course - stanford cs229. [Online]. Available: http://cs229.stanford.edu/materials/CS229-DeepLearning.pdf [4] X. X. Zhu, D. Tuia, L. Mou, G.-S. Xia, L. Zhang, F. Xu, and F. Fraundorfer, “Deep learning in remote sensing: A comprehensive review and list of resources,” IEEE Geoscience and Remote Sensing Magazine, vol. 5, no. 4, pp. 8–36, 2017. [5] e. a. Manolis Koubarakis, Konstantina Bereta, “From copernicus big data to extreme earth analytics.” in 22nd International Conference on Extending Database Technology (EDBT 2019). Lisbon, Portugal, pp. 26–29, year=2019. [6] M. A. Wulder, J. G. Masek, W. B. Cohen, T. R. Loveland, and C. E. Woodcock, “Opening the archive: How free data has enabled the science and monitoring promise of landsat,” Remote Sensing of Environment, vol. 122, pp. 2–10, 2012. [7] ESA. (2019) Esa copernicus program. [Online]. Available: https: //www.copernicus.eu/en/about-copernicus [8] M. Chi, A. Plaza, J. A. Benediktsson, Z. Sun, J. Shen, and Y. Zhu, “Big data for remote sensing: Challenges and opportunities,” Proceedings of the IEEE, vol. 104, no. 11, pp. 2207–2219, 2016. [9] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016. 113 114 bibl iography [10] G. E. Hinton, “Learning multiple layers of representation,” Trends in cognitive sciences, vol. 11, no. 10, pp. 428–434, 2007. [11] L. Zhang, L. Zhang, and B. Du, “Deep learning for remote sensing data: A technical tutorial on the state of the art,” IEEE Geoscience and Remote Sensing Magazine, vol. 4, no. 2, pp. 22–40, 2016. [12] X. Jia, B.-C. Kuo, and M. M. Crawford, “Feature mining for hyperspectral image classification,” Proceedings of the IEEE, vol. 101, no. 3, pp. 676–697, 2013. [13] G. Camps-Valls, D. Tuia, L. Bruzzone, and J. A. Benediktsson, “Advances in hyperspectral image classification: Earth monitoring with statistical learning methods,” IEEE Signal Processing Magazine, vol. 31, no. 1, pp. 45–54, 2014. [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural informa- tion processing systems, vol. 25, pp. 1097–1105, 2012. [15] X. X. Zhu, D. Tuia, L. Mou, G.-S. Xia, L. Zhang, F. Xu, and F. Fraundorfer, “Deep learning in remote sensing: A comprehensive review and list of resources,” IEEE Geoscience and Remote Sensing Magazine, vol. 5, no. 4, pp. 8–36, 2017. [16] G.-J. Qi and J. Luo, “Small data challenges in big data era: A survey of recent progress on unsupervised and semi-supervised methods,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. [17] Y. Ouali, C. Hudelot, and M. Tami, “An overview of deep semi-supervised learning,” arXiv preprint arXiv:2006.05278, 2020. [18] C. Sun, A. Shrivastava, S. Singh, and A. Gupta, “Revisiting unreasonable effectiveness of data in deep learning era,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 843–852. [19] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International journal of computer vision, vol. 115, no. 3, pp. 211–252, 2015. [20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. bibl iography 115 [21] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255. [22] Y. Wang, Q. Wang, S. Shi, X. He, Z. Tang, K. Zhao, and X. Chu, “Bench- marking the performance and energy efficiency of ai accelerators for ai training,” in 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), 2020, pp. 744–751. [23] S. Bianco, R. Cadene, L. Celona, and P. Napoletano, “Benchmark analysis of representative deep neural network architectures,” IEEE access, vol. 6, pp. 64 270–64 277, 2018. [24] T. Ben-Nun and T. Hoefler, “Demystifying parallel and distributed deep learning: An in-depth concurrency analysis,” ACM Computing Surveys (CSUR), vol. 52, no. 4, pp. 1–43, 2019. [25] Z. Tang, S. Shi, X. Chu, W. Wang, and B. Li, “Communication-efficient distributed deep learning: A comprehensive survey,” arXiv preprint arXiv:2003.06307, 2020. [26] H. Yu, S. Yang, and S. Zhu, “Parallel restarted sgdwith faster convergence and less communication: Demystifying why model averaging works for deep learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 5693–5700. [27] Y. Ma, H. Wu, L. Wang, B. Huang, R. Ranjan, A. Zomaya, and W. Jie, “Remote sensing big data computing: Challenges and opportunities,” Future Generation Computer Systems, vol. 51, pp. 47–60, 2015. [28] H. Özköse, E. S. Arı, and C. Gencer, “Yesterday, today and tomorrow of big data,” Procedia-Social and Behavioral Sciences, vol. 195, pp. 1042–1050, 2015. [29] NASA. (1999) Earth observatory (1999). [Online]. Available: https: //earthobservatory.nasa.gov/features/RemoteSensing/remote_02.php [30] X. Yao, G. Li, J. Xia, J. Ben, Q. Cao, L. Zhao, Y. Ma, L. Zhang, and D. Zhu, “Enabling the big earth observation data via cloud computing and dggs: Opportunities and challenges,” Remote Sensing, vol. 12, no. 1, 2020. [Online]. Available: https://www.mdpi.com/2072-4292/12/1/62 [31] J. Xia, C. Yang, and Q. Li, “Building a spatiotemporal index for earth observation big data,” International journal of applied earth observation 116 bibl iography and geoinformation, vol. 73, pp. 245–252, 2018. [32] X. Hu, J. S. Næss, C. M. Iordan, B. Huang, W. Zhao, and F. Cherubini, “Recent global land cover dynamics and implications for soil erosion and carbon losses from deforestation,” Anthropocene, vol. 34, p. 100291, 2021. [33] D. G. Goodin, K. L. Anibas, andM. Bezymennyi, “Mapping land cover and land use from object-based classification: An example from a complex agricultural landscape,” International Journal of Remote Sensing, vol. 36, no. 18, pp. 4702–4723, 2015. [34] P. Vicharnakorn, R. P. Shrestha, M. Nagai, A. P. Salam, and S. Kiratipray- oon, “Carbon stock assessment using remote sensing and forest inventory data in savannakhet, lao pdr,” Remote Sensing, vol. 6, no. 6, pp. 5452– 5479, 2014. [35] R. D. D. Altarez, A. Apan, and T. Maraseni, “Spaceborne satellite remote sensing of tropical montane forests: a review of applications and future trends,” Geocarto International, pp. 1–29, 2022. [36] C. Persello, J. D. Wegner, R. Hänsch, D. Tuia, P. Ghamisi, M. Koeva, and G. Camps-Valls, “Deep learning and earth observation to support the sustainable development goals: Current approaches, open challenges, and future opportunities,” IEEE Geoscience and Remote Sensing Magazine, vol. 10, no. 2, pp. 172–200, 2022. [37] G. J. Schumann,G. R. Brakenridge,A. J. Kettner,R. Kashif, andE. Niebuhr, “Assisting flood disaster response with earth observation data and prod- ucts: A critical assessment,” Remote Sensing, vol. 10, no. 8, p. 1230, 2018. [38] E. Lancheros, A. Camps, H. Park, P. Rodriguez, S. Tonetti, J. Cote, and S. Pierotti, “Selection of the key earth observation sensors and platforms focusing on applications for polar regions in the scope of copernicus system 2020–2030,” Remote Sensing, vol. 11, no. 2, p. 175, 2019. [39] L. P. Bobylev and M. W. Miles, “Sea ice in the arctic paleoenvironments,” in Sea Ice in the Arctic. Springer, 2020, pp. 9–56. [40] T. Vihma, “Effects of arctic sea ice decline on weather and climate: A review,” Surveys in Geophysics, vol. 35, no. 5, pp. 1175–1214, 2014. [41] M. R. Najafi, F. W. Zwiers, and N. P. Gillett, “Attribution of arctic temper- ature change to greenhouse-gas and aerosol influences,” Nature Climate Change, vol. 5, no. 3, p. 246, 2015. bibl iography 117 [42] J. C. Stroeve, M. C. Serreze, M. M. Holland, J. E. Kay, J. Malanik, and A. P. Barrett, “The arctic’s rapidly shrinking sea ice cover: a research synthesis,” Climatic Change, vol. 110, no. 3, pp. 1005–1027, Feb 2012. [Online]. Available: https://doi.org/10.1007/s10584-011-0101-1 [43] S. Haykin, E. O. Lewis, R. K. Raney, and J. R. Rossiter, Remote sensing of sea ice and icebergs. John Wiley & Sons, 1994, vol. 13. [44] O. J. Hegelund, A. Everett, T. Cheeseman, P. Wagner, N. Hughes,M. Piere- chod, K. Southerland, P. Robinson, J. Hutchings, Å. Kiærbech et al., “Ex- tending the ice watch system as a citizen science project for the collection of in-situ sea ice observations,” in EGU General Assembly Conference Ab- stracts, 2020, p. 7126. [45] J. K. Hutchings and M. K. Faber, “Sea-ice morphology change in the canada basin summer: 2006–2015 ship observations compared to obser- vations from the 1960s to the early 1990s,” Frontiers in Earth Science, vol. 6, p. 123, 2018. [46] H. Kaartokallio, M. A. Granskog, H. Kuosa, and J. Vainio, “Ice in subarctic seas,” Sea Ice, pp. 630–644, 2017. [47] C. Haas, “Airborne electromagnetic sea ice thickness sounding in shal- low, brackish water environments of the caspian and baltic seas,” in International Conference on Offshore Mechanics and Arctic Engineering, vol. 47470, 2006, pp. 717–722. [48] eos. (18) Types of remote sensing: Technology changing the world. [49] J. Grahn, “Multi-frequency radar remote sensing of sea ice. modelling and interpretation of polarimetric multi-frequency radar signatures of sea ice,” 2018. [50] D. Murashkin, G. Spreen, M. Huntemann, and W. Dierking, “Method for detection of leads from sentinel-1 sar images,” Annals of Glaciology, vol. 59, no. 76pt2, p. 124–136, 2018. [51] T. I. S. of the Norwegian Meteorological Institute (NIS). (2022) Ice charts -. [Online]. Available: https://cryo.met.no/en/latest-ice-charts [52] J. Lohse, “On automated classification of sea ice types in sar imagery,” Ph.D. dissertation, Universitetet i Tromsø, 2021. [53] L.-K. Soh, C. Tsatsoulis, D. Gineris, and C. Bertoia, “Arktos: An intelli- 118 bibl iography gent system for sar sea ice image classification,” IEEE Transactions on geoscience and remote sensing, vol. 42, no. 1, pp. 229–248, 2004. [54] J. A. Karvonen, “Baltic sea ice sar segmentation and classification using modified pulse-coupled neural networks,” IEEE Transactions on Geo- science and Remote Sensing, vol. 42, no. 7, pp. 1566–1574, 2004. [55] W. Dierking, “Sea ice monitoring by synthetic aperture radar,” Oceanog- raphy, vol. 26, no. 2, pp. 100–111, 2013. [56] ESA. (2022) Sentinel-1 acquisition modes. [Online]. Available: https://sentinels.copernicus.eu/web/sentinel/user-guides/ sentinel-1-sar/acquisition-modes [57] J.-S. Lee and E. Pottier, Polarimetric radar imaging: from basics to appli- cations. CRC press, 2017. [58] C. Oliver and S. Quegan, Understanding synthetic aperture radar images. SciTech Publishing, 2004. [59] A. J. Schweiger, “Changes in seasonal cloud cover over the arctic seas from satellite and surface observations,” Geophysical Research Letters, vol. 31, no. 12, 2004. [60] N. Zakhvatkina, V. Smirnov, and I. Bychkova, “Satellite sar data-based sea ice classification: An overview,” Geosciences, vol. 9, no. 4, p. 152, 2019. [61] A. Moreira, P. Prats-Iraola, M. Younis, G. Krieger, I. Hajnsek, and K. P. Papathanassiou, “A tutorial on synthetic aperture radar,” IEEE Geoscience and remote sensing magazine, vol. 1, no. 1, pp. 6–43, 2013. [62] W. Parker, “Discover the benefits of radar imaging: The top 10 consid- erations for buying and using synthetic aperture radar imagery,” Earth Imaging Journal, 2012. [63] W. Dierking, “Sea ice and icebergs,” in Maritime Surveillance with Syn- thetic Aperture Radar. Institution of Engineering and Technology, 2020, pp. 173–225. [64] C. Elachi and J. J. Van Zyl, Introduction to the physics and techniques of remote sensing. John Wiley & Sons, 2021. [65] J. Lohse, “On automated classication of sea ice types in sar imagery,” Ph.D. dissertation, UiT The Arctic university of Norway, 2020. bibl iography 119 [66] K. Čotar, K. Oštir, and Ž. Kokalj, “Radar satellite imagery and automatic detection of water bodies,” Geodetski glasnik, vol. 50, no. 47, pp. 5–15, 2016. [67] R. Pelich, M. Chini, R. Hostache, P. Matgen, and C. López-Martinez, “Coastline detection based on sentinel-1 time series for ship- and flood- monitoring applications,” IEEE Geoscience and Remote Sensing Letters, vol. 18, no. 10, pp. 1771–1775, 2021. [68] N. K. Sinha and M. Shokr, Sea ice: physics and remote sensing. John Wiley & Sons, 2015. [69] T. Armstrong, “World meteorological organization. wmo sea-ice nomen- clature. terminology, codes and illustrated glossary. edition 1970. geneva, secretariat of the world meteorological organization, 1970.[ix], 147 p.[including 175 photos]+ corrigenda slip.(wmo/omm/bmo, no. 259, tp. 145.),” Journal of Glaciology, vol. 11, no. 61, pp. 148–149, 1972. [70] F. T. Ulaby, F. Kouyate, B. Brisco, and T. L. Williams, “Textural infornation in sar images,” IEEE Transactions on Geoscience and Remote Sensing, no. 2, pp. 235–245, 1986. [71] DeepAI.og. (1999) Earth observatory (1999). [Online]. Available: deepAI.org [72] T. Serre, G. Kreiman, M. Kouh, C. Cadieu, U. Knoblich, and T. Poggio, “A quantitative theory of immediate visual recognition,” Progress in brain research, vol. 165, pp. 33–56, 2007. [73] A. Iscen, G. Tolias, Y. Avrithis, and O. Chum, “Label propagation for deep semi-supervised learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5070–5079. [74] K. Y. Yip, C. Cheng, and M. Gerstein, “Machine learning and genome annotation: a match meant to be?” Genome biology, vol. 14, no. 5, pp. 1–10, 2013. [75] J. E. Van Engelen andH. H. Hoos, “A survey on semi-supervised learning,” Machine Learning, vol. 109, no. 2, pp. 373–440, 2020. [76] O. Chapelle, M. Chi, and A. Zien, “A continuation method for semi- supervised svms,” in Proceedings of the 23rd International Conference on Machine Learning, ser. ICML ’06. New York, NY, USA: Association for Computing Machinery, 2006, p. 185–192. [Online]. Available: 120 bibl iography https://doi.org/10.1145/1143844.1143868 [77] X. J. Zhu, “Semi-supervised learning literature survey,” 2005. [78] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. [79] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning repre- sentations by back-propagating errors,” Nature, vol. 323, pp. 533–536, 1986. [80] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” Advances in neural information processing systems, vol. 29, pp. 2234–2242, 2016. [81] D. Burago,S. Ivanov, andY. Kurylev, “A graph discretization of the laplace– beltrami operator,” Journal of Spectral Theory, vol. 4, no. 4, pp. 675–714, 2015. [82] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling, “Semi- supervised learning with deep generative models,” in Advances in neural information processing systems, 2014, pp. 3581–3589. [83] L. Maaløe, C. K. Sønderby, S. K. Sønderby, and O. Winther, “Auxiliary deep generative models,” in International conference on machine learning. PMLR, 2016, pp. 1445–1453. [84] T. D. Kulkarni, W. Whitney, P. Kohli, and J. B. Tenenbaum, “Deep con- volutional inverse graphics network,” arXiv preprint arXiv:1503.03167, 2015. [85] S. Narayanaswamy, T. Paige, J.-W. Van de Meent, A. Desmaison, N. Good- man, P. Kohli, F. Wood, and P. Torr, “Learning disentangled representa- tions with semi-supervised deep generative models,” 2018. [86] A. Tarvainen and H. Valpola, “Mean teachers are better role mod- els: Weight-averaged consistency targets improve semi-supervised deep learning results,” arXiv preprint arXiv:1703.01780, 2017. [87] A. Rasmus, H. Valpola, M. Honkala, M. Berglund, and T. Raiko, “Semi-supervised learning with ladder networks,” arXiv preprint arXiv:1507.02672, 2015. bibl iography 121 [88] S. Laine and T. Aila, “Temporal ensembling for semi-supervised learning,” arXiv preprint arXiv:1610.02242, 2016. [89] D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. Raf- fel, “Mixmatch: A holistic approach to semi-supervised learning,” arXiv preprint arXiv:1905.02249, 2019. [90] K. Sohn, D. Berthelot, C.-L. Li, Z. Zhang, N. Carlini, E. D. Cubuk, A. Kurakin, H. Zhang, and C. Raffel, “Fixmatch: Simplifying semi- supervised learning with consistency and confidence,” arXiv preprint arXiv:2001.07685, 2020. [91] Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le, “Self-training with noisy student improves imagenet classification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 687– 10 698. [92] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf, “Learning with local and global consistency,” in Advances in neural information processing systems, 2004, pp. 321–328. [93] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mo- bilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510–4520. [94] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708. [95] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [96] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, p. 436, 2015. [97] M. Langer, Z. He, W. Rahayu, and Y. Xue, “Distributed training of deep learning models: A taxonomic perspective,” IEEE Transactions on Parallel and Distributed Systems, vol. 31, no. 12, pp. 2802–2818, 2020. [98] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ran- zato, A. Senior, P. Tucker, K. Yang et al., “Large scale distributed deep networks,” Advances in neural information processing systems, vol. 25, 2012. 122 bibl iography [99] R. Mayer, C. Mayer, and L. Laich, “The tensorflow partitioning and scheduling problem: it’s the critical path!” in Proceedings of the 1st Workshop on Distributed Infrastructures for Deep Learning, 2017, pp. 1–6. [100] A. Mirhoseini, H. Pham, Q. V. Le, B. Steiner, R. Larsen, Y. Zhou, N. Kumar, M. Norouzi, S. Bengio, and J. Dean, “Device placement optimization with reinforcement learning,” in International Conference on Machine Learning. PMLR, 2017, pp. 2430–2439. [101] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su, “Scaling distributed machine learning with the parameter server,” in 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), 2014, pp. 583–598. [102] M. Li, L. Zhou, Z. Yang, A. Li, F. Xia, D. G. Andersen, and A. Smola, “Parameter server for distributed machine learning,” in Big learning NIPS workshop, vol. 6, 2013, p. 2. [103] M. Li, D. G. Andersen, A. J. Smola, and K. Yu, “Communication efficient distributed machine learning with the parameter server,” Advances in Neural Information Processing Systems, vol. 27, 2014. [104] B. C. Ooi, K.-L. Tan, S. Wang, W. Wang, Q. Cai, G. Chen, J. Gao, Z. Luo, A. K. Tung, Y. Wang et al., “Singa: A distributed deep learning platform,” in Proceedings of the 23rd ACM international conference on Multimedia, 2015, pp. 685–688. [105] M. Langer, A. Hall, Z. He, and W. Rahayu, “Mpca sgd—a method for distributed training of deep learningmodels on spark,” IEEE Transactions on Parallel and Distributed Systems, vol. 29, no. 11, pp. 2540–2556, 2018. [106] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can de- centralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent,” arXiv preprint arXiv:1705.09056, 2017. [107] S. Zhang, “Distributed stochastic optimization for deep learning,” Ph.D. dissertation, New York University, 2016. [108] R. Rabenseifner, “Optimization of collective reduction operations,” in International Conference on Computational Science. Springer, 2004, pp. 1–9. [109] A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, “S-caffe: Co- bibl iography 123 designing mpi runtimes and caffe for scalable deep learning on modern gpu clusters,” in Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2017, pp. 193–205. [110] C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton, and D. K. Panda, “Efficient and scalable multi-source streaming broadcast on gpu clusters for deep learning,” in 2017 46th International Conference on Parallel Processing (ICPP). IEEE, 2017, pp. 161–170. [111] Y. Huang,Y. Cheng,A. Bapna,O. Firat,D. Chen,M. Chen,H. Lee, J. Ngiam, Q. V. Le, Y. Wu et al., “Gpipe: Efficient training of giant neural networks using pipeline parallelism,” Advances in neural information processing systems, vol. 32, pp. 103–112, 2019. [112] H. Cui, H. Zhang, G. R. Ganger, P. B. Gibbons, and E. P. Xing, “Geeps: Scalable deep learning on distributed gpus with a gpu-specialized pa- rameter server,” in Proceedings of the Eleventh European Conference on Computer Systems, 2016, pp. 1–16. [113] P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch SGD: training imagenet in 1 hour,” CoRR, vol. abs/1706.02677, 2017. [Online]. Available: http://arxiv.org/abs/1706.02677 [114] S. Lee, S. Purushwalkam, M. Cogswell, D. Crandall, and D. Batra, “Why m heads are better than one: Training a diverse ensemble of deep networks,” arXiv preprint arXiv:1511.06314, 2015. [115] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” in Conference on Neural Information Processing Systems (NIPS), Deep Learning Workshop, 2014. [Online]. Available: https://openreview.net/forum?id=S1g2JnRcFX [116] D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph, J. Gilmer, and B. Lakshminarayanan, “Augmix: A simple data processing method to improve robustness and uncertainty,” in 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. [Online]. Available: https: //openreview.net/forum?id=S1gmrxHFvB [117] T. Fukuda, M. Suzuki, G. Kurata, S. Thomas, J. Cui, and B. Ramabhad- ran, “Efficient knowledge distillation from an ensemble of teachers.” in Interspeech, 2017, pp. 3697–3701. 124 bibl iography [118] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu, “Deep mutual learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4320–4328. [119] J. Kim, M. Hyun, I. Chung, and N. Kwak, “Feature fusion for online mutual knowledge distillation,” in 2020 25th International Conference on Pattern Recognition (ICPR), 2021, pp. 4619–4625. [120] S. Park and N. Kwak, “Feature-level ensemble knowledge distillation for aggregating knowledge from multiple networks,” in ECAI 2020. IOS Press, 2020, pp. 1411–1418. [121] R. Anil, G. Pereyra, A. Passos, R. Ormandi, G. E. Dahl, and G. E. Hin- ton, “Large scale distributed neural network training through online distillation,” arXiv preprint arXiv:1804.03235, 2018. [122] F. Niu, B. Recht, C. Re, and S. J. Wright, “Hogwild!: A lock-free approach to parallelizing stochastic gradient descent,” 2011. [123] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learning with limited numerical precision,” in International conference on machine learning. PMLR, 2015, pp. 1737–1746. [124] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding,” in 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2016. [Online]. Available: http: //arxiv.org/abs/1510.00149 [125] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Quan- tized neural networks: Training neural networks with low precision weights and activations,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 6869–6898, 2017. [126] M. Höhfeld and S. E. Fahlman, “Probabilistic rounding in neural network learning with limited precision,” Neurocomputing, vol. 4, no. 6, pp. 291– 299, 1992. [127] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “Dorefa-net: Training lowbitwidth convolutional neural networkswith lowbitwidth gradients,” arXiv preprint arXiv:1606.06160, 2016. [128] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep gradient compres- bibl iography 125 sion: Reducing the communication bandwidth for distributed training,” arXiv preprint arXiv:1712.01887, 2017. [129] A. F. Aji and K. Heafield, “Sparse communication for distributed gradient descent,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark: Association for Computational Linguistics, Sep. 2017, pp. 440–445. [Online]. Available: https://aclanthology.org/D17-1045 [130] C.-Y. Chen, J. Choi, D. Brand, A. Agrawal, W. Zhang, and K. Gopalakrish- nan, “Adacomp: Adaptive residual gradient compression for data-parallel distributed training,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018. [131] C. Renggli, S. Ashkboos, M. Aghagolzadeh, D. Alistarh, and T. Hoefler, “Sparcml: High-performance sparse communication for machine learn- ing,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019, pp. 1–15. [132] X.Wu,H. Xu,B. Li, and Y. Xiong, “Stanza: Layer separation for distributed training in deep learning,” IEEE Transactions on Services Computing, 2020. [133] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in European conference on computer vision. Springer, 2016, pp. 630–645. [134] S. Khaleghian, J. P. Lohse, and T. Kræmer, “Synthetic-Aperture Radar (SAR) based Ice types/Ice edge dataset for deep learning analysis,” 2020. [Online]. Available: https://doi.org/10.18710/QAYI4O [135] S. Khaleghian, H. Ullah, T. Kræmer, T. Eltoft, and A. Marinoni, “Deep semisupervised teacher–student model based on label propagation for sea ice classification,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 14, pp. 10 761–10 772, 2021. [136] M. Douze, A. Szlam, B. Hariharan, and H. Jégou, “Low-shot learning with large-scale diffusion,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3349–3358. [137] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. Le- Cun, “Overfeat: Integrated recognition, localization and detection us- ing convolutional networks. 2nd international conference on learning representations, iclr 2014,” in 2nd International Conference on Learning 126 bibl iography Representations, ICLR 2014, 2014.