Segmentation of X-Ray Microtomography Scans with a Convolutional Neural Network

Yahia Ali, Shehab Attia, Landan Seguin, Justin Zheng

[Left] Image of a slice of a mouse brain. [Center] Convolutional neural network segmentation of axons. [Right] Ground truth annotations of axons.


There is a need among researchers be able to quickly and accurately identify cells, blood vessels and axons present in a brain scan [1].  A proper segmentation of brain components can help provide a summary of a brain’s state, and potentially identify any abnormalities. The goal of this case study is to develop a convolutional neural network (CNN) to segment blood vessels and axons given images of a thalamocortical sample from a mouse brain. The main objective is to maximize precision and recall of predicting the class of each pixel given an image of the mouse brain.

Dataset Generation

Six NRRD files containing an equivalent of more than 2696 slices of the brain are provided. Separate annotations are provided for axons and blood vessels. Annotations are represented by a binary image, where the positive class (axon or blood vessel) has a non-zero pixel value, and the rest of the image has a value of 0. 


Figure 1. Block diagram of case study steps for implementing CNN segmentation on mouse brain.

For this case study, several hundred sparsely and densely annotated images of axons and blood vessels were provided to train a supervised method to learn features to generalize to new images from different sections of the brain. However, there were two major issues with the provided dataset. First, sparsely annotated images can significantly decrease the performance of deep learning models. Second, the training set provided contains only one section of the brain which means the training samples are highly correlated. To avoid these problems, sparse annotations of axons and blood vessels were replaced with manual dense annotations using 3DSlicer, an image-analysis software. Additional NRRDs containing other regions of the brain were also manually annotated. To segment axons, two threshold masks were created and subtracted from each other to create a binary image of axons free of blood vessel walls or cell membranes. To segment blood vessels, a threshold mask was used to mask the lumen of the blood vessels and cells. To remove cells, connected voxels smaller than 2000 units were removed. Then, the lumen mask was grown by 4um to include the vessel’s wall into the vessel mask. Finally, 3D reconstructions of the segmentations were displayed and manually cleaned in 3DSlicer’s 3D-view window. Examples of this process are shown in Fig. 2.


Figure 2. A. Shows the mask for the lumen of the blood vessel in red. B. Shows the mask grown by 4um to include the vessel’s wall in the mask. C. Shows the 3D reconstruction of the vessel mask before cleaning in 3DSlicer’s 3D-view window. D. Shows the vessel mask after manual cleaning.

Network Architecture and Training

The CNN training and testing pipelines were built using the PyTorch framework. The final CNN was based on the U-Net architecture (Fig. 3) [1] due to three of its defining characteristics. First, U-Net was designed to perform binary segmentation on biomedical images, the same domain as this project. Second, it was designed to perform well with relatively small amounts of training images compared to other segmentation architectures that require tens or hundreds of thousands of training images. Third, the architecture is small, so it can be trained quickly and has low memory requirements (weights file is approximately 1.9mb). The network operates by downsampling the input image into a spatially smaller feature representation, then upsamples the feature representation to the same spatial dimensions as the input image, and finally, a linear classifier is used for prediction at each pixel location. Features from early in the network are concatenated with features later in the network so that the final convolutional can directly operate on both low-level and high-level features to make a predictions.


Figure 3. U-Net architecture, a fully convolutional network. This network downsamples the input image into a vectorized feature space, and upsamples the features into a prediction map consisting of predictions for each pixel.

The network developed for this case study has a similar structure to U-Net, but it is considered “shallower” since it only has two downsampling and two upsampling operations instead of four each. Input images and the corresponding labels are resized to 400×400 via scaling before feeding into the network.  Binary cross-entropy was used as the loss function to train the model. The network architecture was the same for predicting axons and blood vessels, but a separate model was trained for each task and each model had different training hyperparameters. For axon prediction, the network was trained using the Adam optimizer with an initial learning rate of 1e-2, a learning rate decay multiplier of 0.5 every epoch, regularization strength of 0.1, and a batch size of 1 for 10 epochs. For blood vessel prediction, the network used the Adam optimizer with an initial learning rate of 1e-3, a learning rate decay multiplier of 0.9 every epoch, a regularization strength of 1e-4, and a batch size of 1 for 10 epochs.


The models’ performances were evaluated by calculating the accuracy, precision, and recall of the results from the test set. The accuracy was 0.909 for axons and 0.993 for blood vessels, but this metric can be optimistic if most of the image is background, a common case for blood vessel images. Alternatively, precision and recall can provide better measures of performance. Precision is defined as the percentage of positively identified pixels that are actually positive in the annotation. Recall is defined as the number of true positives divided by the number of positives in the annotation. Precision and recall can vary depending on the threshold used to determine positive (axon/bloodvessel) or negative (background) predictions. A precision-recall curve, shown in Fig. 4, was used to understand model performance regardless of the threshold. The curve is generated by systematically varying the threshold between 0 and 1, then the resulting precision-recall pairs are plotted. Output examples are shown in Fig. 5.


Figure 4. [Left] The precision-recall curve of axon segmentation. [Right] The precision-recall Curve for blood vessel segmentation. Both precision-recall curves are computed for the 212 image test dataset.


Figure 5. Network output on the densely annotated test images. Left: Blood vessel segmentation results with Image, CNN Output and Ground Truth in order. Right: Axon segmentation results with Image, CNN Output and Ground Truth in order.


As shown in the results section, the CNN segments axons better than it segments blood vessels. This is reasonable because the majority of axons can be segmented with simple color thresholding, whereas blood vessels have different shapes and often look very similar to other objects in the image. Typically, a disadvantage of CNNs is that they often require more data to train compared to more simple classifiers, and the network itself may consume significantly more memory. However, the network used in this project is very lightweight, and takes less than 30 minutes to train. One major limitation of this network’s performance is that the network can only be as good as the annotations provided. Considering that the images were labeled by non-experts (undergraduate students), there are likely to be errors in the annotations and therefore in the CNN output. Ideally, experts would label millions of training images, although this would be very time consuming and expensive.


[1] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241. Springer, 2015.


search previous next tag category expand menu location phone mail time cart zoom edit close