Preferred Networks: Deep Learning Supercomputer

2nd Generation Intel® Xeon® Scalable processors and Intel® Optane™ persistent memory enable faster data pipeline.

At a glance:

  • Preferred Networks (PFN) develops artificial intelligence solutions for industrial and domestic robotics, Industrial Internet of Things (IIoT), manufacturing systems, and other industries.

  • Traditional SSDs could not meet the I/O throughput requirements of PFN’s new custom-designed deep learning accelerator, so they turned to Supermicro’s SuperServer hardware with Intel® Xeon® Platinum 8260M processors and Intel® Optane™ persistent memory to enable a balanced node with fast access and high capacity for training data.

author-image

Von

Executive Summary

Preferred Networks (PFN) uses Intel® Xeon® Platinum 8260M processors and Intel® Optane™ persistent memory to create a high-performance data pipeline to keep their custom, high-performance deep learning training accelerator busy in their new MN-3 HPC system. Located in Tokyo, Preferred Networks is a deep learning company, deploying High Performance Computing (HPC) clusters to build and train algorithms used in domestic and industrial applications. Their latest system, MN-3, integrates a custom-designed deep learning accelerator they engineered. Intel Optane persistent memory provides the capacity and speed needed to feed data to the accelerator, maintaining high training performance.

Traditional SSDs could not meet the I/O throughput requirements of the new architec­ture, so Preferred Networks turned to Intel® Xeon® Platinum 8260M processors and Intel® Optane™ persistent memory to enable a balanced node with fast access and high capacity for training data.

Challenge

Preferred Networks develops artificial intelligence solutions for industrial and domestic robotics, Industrial Internet of Things (IIoT), manufacturing systems, and other industries. It is a leader in the robotics revolution.1

The company’s Research and Development (R&D) team uses HPC systems designed specifically to create and train algo­rithms for automated functions, such as:

  • Predictive analytics of industrial machines to optimize the use and maintenance of them for increased productivity
  • Controlling a robot to easily navigate in a home, recognize objects out of place, pick them up, and put them where they belong
  • Other autonomous operations based on vision computing

Preferred Network’s largest R&D supercomputers, MN-1 and MN-2, include more than 2500 GPUs total. Yet Preferred Networks needed to accelerate computations to support the many projects the engineering team is working on.

Solution

“We believe more computational power makes our engineers and researchers more effective,” vice president of Comput­ing Infrastructure Yusuke Doi said. “By keeping a leadership position in our computational capabilities, we can better compete in our industry and provide advanced solutions to our customers.”

So, Preferred Networks designed a unique custom accelerator called MN-Core.2 MN-Core is a custom processor based on a four-die multi-chip package designed specifically for PFN’s own R&D projects. The quadruple-chip package—specialized for deep learning training tasks—is at the center of a design for a new supercomputing cluster, MN-3. However, due to the dramatic increase in computing performance, they ran into I/O bottlenecks when they began to design and evaluate the data loading path for the training system.

Many of Preferred Networks’ projects are computer vision problems. The training data set, consisting of millions of JPEG image files, is archived on a large external storage system. It is not practical to store the entire data set directly in system memory to take advantage of the faster access. For training, the data is first copied to the nodes into high-performance NVMe SSD drives.

2nd Generation Intel® Xeon® Scalable Processors and Intel® Optane™ Persistent Memory Enable Up to 3.5X Faster Data Pipeline3 

“We first benchmarked node performance with the Intel Xeon 8260M processors,” engineer Tianqi Xu of Preferred Networks explained. “During the I/O phase, the processor must get the JPEG files out of block storage and into memory, decode them, and then perform model-specific augmentations. With the 2nd Generation Intel® Xeon® Scalable processors and current GPUs, the node was well balanced for I/O, computing, and storage.”

But with terabytes of data to move during training and the I/O challenges discovered in the data path, traditional storage hierarchy with SSDs would not be able to keep up with the custom accelerator. The accelerator would be starved for data. Preferred Networks needed high capacity storage at DIMM-like speeds in the node. Engineers worked directly with Intel to understand how the high memory bandwidth of 2nd Gen Intel Xeon Scalable processors and support for high-capacity Intel Optane persistent memory could create a very fast and very large data pipeline.

Once Preferred Networks became aware of Intel Optane persistent memory’s capability to speed up their AI pipeline they initiated a proof of concept to verify that the design would support high capacity storage. Intel continues to advise the company as it moves forward with their AI technology efforts.

Leveraging a New Hierarchy of Storage with Intel® Optane™ Technology

Intel Optane persistent memory is a high-density, byte-addressable, 3D memory technology in a DIMM form factor that delivers a unique combination of large capacity, low latency, low power, and data persistence. The persistent memory modules integrate a new layer into the memory/ storage hierarchy of an HPC system, offering DIMM-like speeds of byte-addressable data access with terabytes of capacity on the memory bus. Most 2nd Generation Intel Xeon Scalable processors support Intel Optane persistent memory modules. A node with the Intel Xeon 8260M processors can support up to 3 TB of Intel Optane persistent memory.

Intel Optane persistent memory can operate in different modes (memory, app-direct, and storage over app-direct). In memory mode, the CPU uses the Intel Optane persistent memory as system memory and uses the system memory (DIMMs) as a cache. In app-direct mode, software is made aware of both types of memory and is configured to direct data reads and writes based on suitability for DRAM or Intel Optane persistent memory. This offers larger capacity and higher performance to Preferred Networks’ training processes.

“In memory mode, the entire memory domain would reside in the persistent memory,” Xu added, “which means we wouldn’t get optimal use of the entire three terabytes. Additionally, deep learning data access patterns are very random. DRAM as cache doesn’t work effectively for those accesses. We needed direct control over the persistent memory, so we developed custom code to control it in app-direct mode.”

In addition to their own code, Preferred Networks developed a custom library to take advantage of the large capacity, low latency, and byte-addressable features of Intel Optane persistent memory. To optimize performance for the entire data pipeline and custom accelerator, they included a staging phase to pre-process the JPEG images by converting them to raw pixel data and loading the data set into Intel Optane persistent memory.

Result

The company is manufacturing its accelerator and launching MN-3 with the accelerator. MN-3 is a cluster with up to 48 nodes initially. The company will expand MN-3 into a half-precision exascale supercomputer. The Intel Xeon 8260M processors will allow MN-3 to optimize pre-processing performance to stage the data set and effectively handle the post-processing phase to manage the results.

Early benchmarking of the data pipeline with Preferred Networks’ accelerator MN-Core, Intel Xeon 8260M processors, and Intel Optane persistent memory is returning up to 3.5X faster data throughput compared to the system with NVMe SSDs.4 In addition to being fast, the system is highly energy efficient. MN-3 ranked #1 on the June 2020 Green 500 list.5 Preferred Networks expects to grow the system over five years as much as 20X to exascale performance for deep learning training.

Solution Summary

Preferred Networks has been using HPC clusters for deep learning training to support their customers. They needed more performance, so they built their own deep learning accelerator and the first stage of a new cluster around it named MN-3. Traditional SSDs could not meet the I/O throughput requirements of the new architecture, so Preferred Networks turned to Intel Xeon 8260M processors and Intel Optane persistent memory to enable a balanced node with fast access and high capacity for training data. The new system design is expected to deliver up to 3.5X faster performance, according to Preferred Networks.

Solution Ingredients

  • 48-node deep learning training cluster with custom accelerator
  • Two 24-core Intel Xeon 8260M processors per node
  • 3 TB of Intel Optane persistent memory per node (153.6 PB total)

Spotlight on Supermicro

Supermicro’s SuperServer hardware was deployed at Preferred Networks. The SuperServer platform offers high levels of performance, efficiency, and supports 2nd Gen Intel Xeon Scalable processors.

Supermicro (Nasdaq: SMCI), is a leading innovator in high-performance, high-efficiency server technology, is a premier provider of advanced Server Building Block Solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop/Big Data, HPC and Embedded Systems worldwide.

Download the PDF ›