In the recent Kaggle competition Dstl Satellite Imagery Feature Detection our team won 4th place among 419 teams. We applied a modified U-Net – an artificial neural network for image segmentation. In this blog post we wish to present our deep learning solution and share the lessons that we have learnt in the process with you.


The challenge was organized by the Defence Science and Technology Laboratory (Dstl), an Executive Agency of the United Kingdom’s Ministry of Defence on Kaggle platform. As a training set, they provided 25 high-resolution satellite images representing 1 km2 areas. The task was to locate 10 different types of objects:

  1. Buildings
  2. Miscellaneous manmade structures
  3. Roads
  4. Tracks
  5. Trees
  6. Crops
  7. Waterway
  8. Standing water
  9. Large vehicles
  10. Small vehicles

Satellite imagery sample with labels.

Sample image from the training set with labels.

These objects were not completely disjoint – you can find examples with vehicles on roads or trees within crops. The distribution of classes was uneven: from very common, such as crops (28% of the total area) and trees (10%), to much smaller such as roads (0.8%) or vehicles (0.02%). Moreover, most images only had a subset of classes.

Correctness of prediction was calculated using Intersection over Union (IoU, known also as Jaccard Index) between predictions and the ground truth. A score of 0 meant complete mismatch, whereas 1 – complete overlap. The score result was calculated for each class separately and then averaged. For our solution the average IoU was 0.46, whereas for the winning solution it was 0.49.


For each image we were given three versions: grayscale, 3-band and 16-band. Details are presented in the table below:

Type Wavebands Pixel resolution #channels Size
grayscale Panchromatic 0.31 m 1 3348 x 3392
3-band RGB 0.31 m 3 3348 x 3392
16-band Multispectral 1.24 m 8 837 x 848
Short-wave infrared 7.5 m 8 134 x 136

We resized and aligned 16-band channels to match those from 3-band channels. Alignment was necessary to remove shifts between channels. Finally all channels were concatenated into single 20-channels input image.


Our fully convolutional model was inspired by the family of U-Net architectures, where low-level feature maps are combined with higher-level ones, which enables precise localization. This type of network architecture was especially designed to effectively solve image segmentation problems. U-Net was the default choice for us and other competitors. If you would like more insights into architecture we suggest that you read the original paper. Our final architecture is depicted below:

Convolutional neural network for image segmentation in satellite imagery.For more details about specific modules click here.

Typical convolutional neural network (CNN) architecture involves increasing the number of feature maps (channels) with each max pooling operation. In our network we decided to keep a constant number of 64 feature maps throughout the network. This choice was motivated by two observations. Firstly, we can allow the network to lose some information after the downsampling layer because the model has access to low level features in the upsampling path. Secondly, in satellite images there is no concept of depth or high-level 3D objects to understand, so a large number of feature maps in higher layers may not be critical for good performance.

We developed separate models for each class, because it was easier to fine tune them individually for better performance and to overcome imbalanced data problems.

Training procedure

Models assign probability of belonging to a target class for each pixel from the input image. Although Jaccard was the evaluation metric, we used the per-pixel binary cross entropy objective for training.

We normalized images to have a zero mean and unit variance using precomputed statistics from the dataset. Depending on class we left preprocessed images unchanged or resized them together with corresponding label masks to 1024 x 1024 or 2048 x 2048 squares. During training we collected a batch of cropped 256 x 256 patches from different images where half of the images always contained some positive pixels (objects of target classes). We found this to be both the best and the simplest way to handle the imbalanced classes problem. Each image in a batch was augmented by randomly applying horizontal and vertical flips together with random rotation and color jittering.

Each model had approx. 1.7 million parameters. Its training (with batch size 4) from scratch took about two days on a single GTX 1070.


We used a sliding window approach at test time with window size fixed to 256 x 256 and stride of 64. This allowed us to eliminate weaker predictions on image patch boundaries where objects may only be partially shown without context around them. To further improve prediction quality we averaged results for flipped and rotated versions of the input image, as well as for models trained on different scales. Overall we obtained well smoothed outputs.


Ground truth labels were provided in WKT format, presenting objects as polygons (defined by their vertices). It was necessary for us to generate submissions where polygons are concise and can be processed quickly by the evaluation system to avoid timeout limits. We found that this can be accomplished with minimal loss on the evaluation metric by using parameterized operations on binarized outputs. In our post-processing stage we used morphology dilation/erosion and simply removed objects/holes smaller than a given threshold.

Our solution

Buildings, Misc., Roads, Tracks, Trees, Crops, Standing Water

For these seven classes we were able to train convolutional networks (separately for each class) with binary cross entropy loss as described above on 20 channels inputs and two different scales (1024 and 2048) with satisfactory results. Outputs of the models were simply averaged and then post-processed with hyperparameters depending on particular classes.


The solution for the waterway class was a combination of linear regression and random forest, trained on per pixel data from 20 input channels. Such a simple setup works surprisingly well because of the characteristic spectral response of water.

Large and Small Vehicles

We observed high variation of the results on the local validation and public leaderboard due to the small number of vehicles in the training set. To combat this we trained models separately for large and small vehicles, as well as single model for both of them (label masks were added together) on 20 channels inputs. Additionally, we repeated all experiments using 4 channels inputs (RGB + Panchromatic) to increase diversity of our models in ensemble. Outputs from models trained on both classes were averaged with single class specific models to produce final predictions for each type of vehicles.


We implemented models in PyTorch and Keras (with TensorFlow backend), according to our team members’ preferences. Our strategy was to build separate models for each class, so this required careful management of our code. To run models and keep track of our experiments we used Neptune.

Final results

Below we present a small sample of the final results from our models:

Buildings detection in satellite imagery.


Roads detection in satellite imagery.


Tracks detection in satellite imagery.


Crops detection in satellite imagery.


Waterway detection in satellite imagery.


Small vehicles detection in satellite imagery.

Small vehicles


Satellite imagery is a domain with a high volume of data which is perfect for deep learning. We have proved that the results gained from current state-of-the-art research can be applied to solve practical problems. Excited by our results, we look forward to more of such challenges in the future.

Team members:
Arkadiusz Nowaczyński
Michał Romaniuk
Adam Jakubowski
Michał Tadeusiak
Konrad Czechowski
Maksymilian Sokołowski
Kamil Kaczmarek
Piotr Migdał

Suggested readings

For those of you interested in additional reading, we recommend the following papers on image segmentation which inspired our work and success:

  1. Fully Convolutional Networks for Semantic Segmentation
  2. U-Net: Convolutional Networks for Biomedical Image Segmentation
  3. The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation
  4. Analyzing The Papers Behind Facebook’s Computer Vision Approach
  5. Image-to-Image Translation with Conditional Adversarial Nets
  6. ReSeg: A Recurrent Neural Network-based Model for Semantic Segmentation

Related Posts

60 Comments » for Deep learning for satellite imagery via image segmentation
  1. Will C says:

    This is awesome. Perhaps I missed it, but I did not see what weights you fine tuned from.

  2. Deepank says:

    Can this type of feature extraction be possible in Landsat images? Like extracting some larger features.

    • Arkadiusz Nowaczynski says:

      In general yes – it’s possible. Automatic labeling by convolutional neural networks should work well in most image-based data.

      • vijendra singh says:

        Don’t you think Landsat resolution is quite higher ( 30m) to find structure within that.

        • Arkadiusz Nowaczynski says:

          Yes – but my point is whatever human can see on these images, machine can label automatically after training. Applications in agriculture, forestry or water resources are definitely possible.

  3. asmith says:

    Great post thanks – do you intend to release any code?

  4. Ayush Singh says:

    Thanks for this post,really great !
    Did you use only 25 images for training or used data(images) from some other sources(public datasets etc.) also for different classes.

  5. Zahran says:

    Awesome! Can I ask general question ?
    Why did you go for Image segmentation not object detection with this type of images ?

    • Arkadiusz Nowaczynski says:

      For object detection you usually predict bounding boxes together with classes, whereas in image segmentation you do fine-grained classification of each pixel. This image should make things clear:
      detection vs segmentation
      In our case, if you look at labels (e.g. Roads, Tracks) – it would be difficult to predict accurate bounding boxes matching those complex shapes. Using image segmentation you can match irregular shapes quite easy.

  6. quest says:

    Would you share the code please, for starters like us to learn. Thank you very much.

  7. guo says:

    Great works!
    Did you try other image segmentation methods based CNN? Such as Deeplab, Deeplab V2.

  8. lee says:

    will the result be different if all types of object are displayed is a single image with different color?

    • Arkadiusz Nowaczynski says:

      If you’re concerned about color change between training and test images, you can always play around with color augmentation techniques to make training images more similar to test ones during training phase. This particular dataset was really small and additionally all images came from regions located in Nigeria, so using this model for images from other parts of the world may not be fully successful.

  9. Navin kumar says:

    Hello, your concepts are really interesting. What impacts the image segmentation more while you perform deep learning-is it either the number of observations i.e. training sets considered or level of reasoning i.e. spectral bands or both balance each other?

    • Arkadiusz Nowaczynski says:

      I’d say that having large training dataset is always beneficial if you want to build successful application of deep learning for image segmentation.
      Information from spectral bands are useful, but we should remember that humans are able to create annotations using only RGB visualization so neural nets should also. In this competition training set was rather small and spectral bands definitely helped a lot, but if there were 100 times more samples, I think using RGB would be sufficient.

    • Navin kumar says:

      Thank you for your reply.
      Let us say that, if we are going to segment forest vegetation and their types. RGB with 100 times of sample is enough?
      Because features that are going to be segmented are not regular and linear features, in that case which idea could we adapt. I would be happy with your suggestions as an experience from Deep learning point of view.

      • Arkadiusz Nowaczynski says:

        It’s hard for me to answer without looking at data.
        100 images per class? How many classes? It’s seem small but reasonable number at the beginning. I suggest to try U-Net model for segmentation (feel free to fine tune original architecture if you need). It should be better to train single model for all labels at once. If your labels do not overlap into each other – use per pixel softmax loss, otherwise you can try per pixel binary cross entropy. Wish you good luck!

  10. youyou says:

    Great work!
    Maybe I missed it. I would like to ask why semantic segmentation could predict so good shape of buildings? I applied FCN for building extraction myself, but it did bad at shape of buildings.

    • Arkadiusz Nowaczynski says:

      There are many reasons influencing the quality of predicted segmentation mask, the most important from our experience:
      1) Using a smaller scale during training and test time should give more contiguous results because one pixel covers bigger area. After predicting you can resize image up to match original scale. For example here original size of the images was 3348×3392, but for training and predictions we used 1024×1024 or 2048×2048.
      2) Make sure that model converges properly – learning rate should be tuned. Monitor IoU on training images. At the end you should see almost perfect predictions on training samples. Very good idea is to visualize it on RGB image (e.g. Green – true positive pixel, Red – false positive, Blue – false negative; our predictions on training images was almost completely green).
      3) During test time use sliding window approach with stride smaller than window size (our window was 256×256 and stride 64). You can also generate predictions for horizontal and vertical flips of the original image and then average all probabilities.
      4) Post-processing: removing small holes/objects from final binary mask.
      Our IoU for buildings in test set was around 0.7

    • ccnucb says:

      could you send you file to my
      i just want to know how to divide diferent land-use by this model

  11. Akash says:

    Awesome work!!
    I just wanted to clarify two things.
    1. While normalizing the images before training, do you mean you are normalizing across all the 20 channels of the image or only the RGB image?
    2. What is the ultimate size of input image for training? And depending on class, how do you decide whether to resize to 1024 x 1024 or 2048 x 2048?

    • Arkadiusz Nowaczynski says:

      1. We normalized input across 20 channels, so each channel is expected to have zero mean and unit variance. To compute mean and std that are used for normalization it’s good to use all available data.
      2. In most cases we trained separate models on each of the scale and then we averaged the results. If I had to choose one scale, I would try both and then I’d select a better model based on the validation/test result.

      • Akash says:

        Thanks. I have a very simple question which I am having a problem understanding. Suppose we consider only 3 channels, RGB. If we normalize each of the channels separately, won’t the color composition of the image change completely? In that case, will the model be successful in training these normalized images?

        • Arkadiusz Nowaczynski says:

          Per-channel normalization is a pure technical step and its purpose is to improve optimization performance. The values of the channels are linearly scaled using some pre-computed statistics and this doesn’t change color composition.

  12. surf reta says:

    Hi, I am working on a data set consisting of 150 image pairs, each of which has 500*700 pixels. The mask has only two classes, i.e., background and foreground. Across the image set, the foreground class takes about 25% of the whole image area. For this kind of scenario, what are your recommendations for generating the training set?

    • Arkadiusz Nowaczynski says:

      A good starting point is to divide the whole set into training set (80%) and validation set (20%), such that both these sets have similar class distributions.
      If you have a groups/clusters of very similar images (almost identical, maybe the same region but slightly shifted or photos taken from video sequence) make sure that split doesn’t separate them.

      At the very beginning it’s very good to establish some sanity check whether your algorithm is correct. I suggest to pick 3 images from training set and try to overfit model on them. Visualize your predictions as JPG. You should be able to see almost perfect prediction compared to ground truth. Only then move to the training using whole training set.

  13. Srinivas says:

    I have a scenario where I need to detect objects that could occupy smaller or larger fraction of the image (video frame) based on the zoom level. Essentially the model has to be able to detect object across different scales. Will U-Net be suitable? My frame size would be 1080×720 and the object size could vary from 30×20 to 300×200 .

    • Arkadiusz Nowaczynski says:

      I think in your case the key is to train on multiple scale at once. As a part of training data augmentation step you can randomly resize images to have a size of let’s say [0.3, 1.0] of original scale. Then crop the image to fixed size and feed a network with it. U-Net should handle such scenario.

  14. Akshay says:

    Excellent Work…

    After Segmentation can We Know the Amount of Pixel under the Segmentation or can we Crop the Image as per segmentation and Know the Amt. of Pixels.. the Logic behind is to understand the Percentage of Area covered under segmentation…

    Looking forward to hear from you…

    • Arkadiusz Nowaczynski says:

      Yes, you can do all of that:
      – calculating percentage of area covered by a given class (or even precise value of area in a specific unit)
      – counting objects like trees/cars/buildings, and area covered for each one of them separately
      – obtaining exact location (latitude and longitude) for extracted objects

      • Akshay says:

        Thanks For your Reply
        So Generous of you…

        If You Could Give a Hint on how to Achieve the above , It Would be a Great Help to me..and More Over Are you Using Some Pretrained Model on the Above Network or is it only Trained on Kaggle Images From scratch…

  15. Ruchira says:

    I had a question regarding combining of predictions. Suppose we have 10 models for 10 labels. We will get prediction from each of those models. How are you combining them to form a singe prediction map? Do you check for the highest probablity occuring among the 10 classifier models and then assign the pixel to that label. However, if the probabilty map is normalized, due to class imbalance, this might lead to an inaccurate prediction map. How did you tackle this issue?

    • Arkadiusz Nowaczynski says:

      In this competition ground truth labels were allowed to overlap (between classes), so there was no problem with that (look at sample image from the training set with labels at the top). If you have a case where each pixel always belongs to only one class, I suggest training one model for all classes with softmax as the final layer.

  16. Saetlan says:

    Thanks for your really well described work !
    I’ve just posted a question on stackoverflow about a projet on U-net with keras that I’m building, could you take a look if you have any advice to speedup my prediction task ?
    link :

    Also how did you do with the stride 64, because at the moment I’m keeping an array of the same size as my image to know how much time I’ve computed the proba for a given pixel to divide it afterwards. Is there a more efficient way to do it ?
    Thanks again !

    • Arkadiusz Nowaczynski says:

      We did exactly as you wrote (with array in which you count how many times the pixel’s probability was updated). Additionally we used batched input at test time. Other possibility is to train the model using smaller images, i.e. reduce scale of input data (maybe 5000×5000 instead of 20000×20000 would be enough? good thing to try).

  17. Mahdi Alehdaghi says:

    Did you release your code or implementation? I need a simple black-box object (or feature) detector in satellite images.

  18. zanluyang says:

    I learned a lot.
    have you change the lable(provied by WKT format or shp) to a new raster image?(such us the background is 0, building 1, water is 2…).
    if you converted to the raster image, when a pixel inclued in the two categories,how did you deal with it.
    Thanks !

    • Arkadiusz Nowaczynski says:

      We trained separate models for each class so every category had their own binary mask. If you want to train one model with overlapping targets you need to predict 10 channel output and compare it with 10 channel label.

  19. Arthur Costa says:

    Thanks for sharing your work, it’s really interesting.
    Is U-Net a good solution in my case? I’m working with digital mammography, and our goal is to do segmentation of lesions on breast tissue. I’ve been searching for CNN architectures that also deals with localization and this model really showed up as a better alternative. Also, my dataset is not very big, something around 200 exams. At first, we’re trying to deal with only 2 classes (normal tissue or lesion) and then we’ll try to expand it for more lesion types. The point is that in our images we have the background(black), the breast and the lesion area inside.
    Our labeled regions were selected by a doctor, and it has normal tissue mixed with lesion parts.

    Thanks again!

    • Arkadiusz Nowaczynski says:

      Given your description I don’t see any reason why it shouldn’t work, but maybe some changes will be needed for architecture.

  20. Abhinav G says:

    Hey Arkadiusz Nowaczynski,
    Great work there! I actually needed a little guidance from you, I am actually working on Hyperspectral satellite imagery mainly World View3 dataset for crop differentiation using deep learning in python in which I need to preprocess data using multi resolution segment and apply supervised classification on it. Some people have recommended CNN for this.
    I don’t have much idea on this. Actually I am new to Python and Machine learning concepts. Can you guide me on this? Like which tutorials to go through(Python + Machine Learning based) and available libraries required for it.
    It would be off great help


    • Arkadiusz Nowaczynski says:


      There are so many resources for learning Python and Machine Learning that I have a hard time to choose best available recommendation for you.
      If you like online courses: platforms like Coursera or Edx can give you great introduction (search for Python, Machine Learning, Deep Learning).
      Other possibilities are: or
      You also need to know some kind of library for image processing (either scikit-image or opencv are good choices).

      Good luck!

  21. Razieh says:

    Awesome, You did very good job.
    I am pretty new in this area. I downloaded the data set, but I don’t know how to visualize the images!
    Is there any way to visualize them using Matlab?

  22. huaiyang says:


    Thanks for sharing with us the wonderful tutorial. I have a question regarding the training procedure, “We developed separate models for each class, because it was easier to fine tune them individually for better performance and to overcome imbalanced data problems.”
    For the classes such as trace or waterway, which are usually very thin and take only a small proportion for the corresponding image. In other words, the respective binary semantic segmentation problem has to handle highly imbalanced data set. Would you like to discuss more on how to handle this kind of scenario? Thanks.

    • Arkadiusz Nowaczynski says:

      I referred to this problem in the Training procedure section:
      “During training we collected a batch of cropped 256 x 256 patches from different images where half of the images always contained some positive pixels (objects of target classes). We found this to be both the best and the simplest way to handle the imbalanced classes problem.”.

  23. Juan Taylor says:

    Awesome example and explanations.

    I am experimenting with the u-net architecture for simple two class segmentation. It gives pretty good results. However, one thing is persistently bad: the model segments out objects that are mostly much ‘fatter’ than the ground truth. Taking your example, my model would find those roads or houses, but a few neighboring pixels of a road or a house would also be taken as part of the house or the road.

    Have you encountered similar issues and how did you solve them?

  24. Amazing work, I can see this used for planets and sea floor exploration.

  25. Razieh says:

    Awesome. I’ve learned a lot from your explanation and answers to the questions.
    There are two things that I could not understand very well!
    1-Would you please explain more why did you keep a constant number of feature maps throughout the network?
    2-I appreciate if you explain how did you combine linear regression and random forest to detect waterways?
    Thanks in advance,

  26. ada_wang says:

    I was doing some tasks with satellite data, so it really helpful for me, I still have some questions.
    why your team chooses the U-net models instead of others( like deeplab v2, RefineNet). I have finetuning the model of deeplabv2 with my data(just around 300 images), but it turns out exploding-gradients, I have no idea

Leave a Reply

Your email address will not be published. Required fields are marked *


Like the post? You can also get deep learning power in your team!
On-site workshops for teams

Pin It on Pinterest