Overview

Computing infrastructure powering deep learning’s problem-solving capabilities

Preferred Networks (PFN)’s core technologies, especially deep learning, require enormous computing power. To perform a vast number of computations in an efficient way, we currently operate our own computer clusters, more commonly known as supercomputers. Our computer clusters are named MN followed by a series number: MN-1, MN-2 and MN-3.

Infrastructure

MN-3

MN-3 is PFN’s first computer cluster that uses MN-Core, a highly efficient custom processor co-developed by PFN and Kobe University specifically for use in deep learning. Operating since May 2020, MN-3 achieved an energy efficiency of 21.11 Gflops/W and topped the Green500 list of the world’s most energy-efficient supercomputers in June 2020. PFN continued improving MN-3’s software stack and achieved an efficiency of 26.04 Gflops/W, 23.3% above the June record, in November 2020. The supercomputer was ranked number two in the Green500 list in November 2020 and number one among the systems with highest-quality Level 3 measurement. In June 2021, it was updated another 14.0%, recorded 29.70Gflops/W, and was once again recognized as the most energy-performing supercomputer in the world.

Left: June 2020 Green500 no. 1 certificate
Center: November 2020 Green500 no. 1 certificate for Level-3 measured systems
Right: June 2021 Green500 no. 1 certificate

PFN plans to expand MN-Core-powered computer clusters in multiple phases. The first of these, MN-3a, was completed with the following configuration in May 2020.

MN-3a is made up of 1.5 “zones,” each of which consists of 32 compute nodes (MN-Core Servers) tightly coupled by two MN-Core DirectConnect Switches.

Configuration of the MN-3a cluster:

  • 48 compute nodes (MN-Core Servers)
  • 4 Interconnect nodes (MN-Core DirectConnect Switches)
  • 5 100GbE Ethernet switches

Configuration of each MN-3a node:

MN-Core Server

MN-Core MN-Core Board x 4
CPU Intel Xeon 8260M two-way (48 physical cores)
Memory 384GB DDR4
Storage Class Memory 3TB Intel Optane DC Persistent Memory
Network MN-Core DirectConnect(112Gbps) x 2
Mellanox ConnectX-6(100GbE) x 2
On board(10GbE) x 2

MN-2

MN-2 is the first GPU cluster built and managed solely by PFN. Operating since July 2019, MN-2 adopts RoCEv2 (RDMA over Converged Ethernet) to interconnect and integrate GPU servers using standard Ethernet.

This made it possible to efficiently process high-speed communication between storage servers and collective communications for deep learning over the same network with low latency.

Specifications of the MN-2 cluster:

  • 128 GPU servers (compute nodes)
  • 32 CPU servers (compute nodes)
  • 24 storage servers
  • 18 100GBE Ethernet Switches

Specifications of each compute node on MN-2:

GPU server

GPU NVIDIA V100 SXM x 8
CPU Intel Xeon 6254 two-way (36 physical cores)
Memory 384GB DDR4
Network Mellanox ConnectX-4(100GbE) x 4
On board(10GbE) x 2

CPU server

CPU Intel Xeon 6254 two-way (36 physical cores)
Memory 384GB DDR4
Network Mellanox ConnectX-4(100GbE) x 2
On board(10GbE) x 2

MN-1, MN-1b

MN-1 is a GPU computer cluster that NTT Communications operates exclusively for PFN. The MN-1 cluster has two generations: MN-1 operating since September 2017 and MN-1b operating since July 2018.

The configuration of each MN-1 cluster is as follows:

  • MN-1
  • 128 GPU servers (NVIDIA P100 x 8, FDR 56Gbps InfiniBand × 2)
  • MN-1b
  • 64 GPU servers (NVIDIA V100 x 8, EDR 100Gbps InfiniBand × 2)

MN-1 milestones:

MN-1b milestones:

Computing Sites

  • NTT Communications Datacenter in Tokyo
    • MN-1, MN-1b
    • Operating under service contract
  • Simulator Building at Yokohama Institute for Earth Sciences, Japan Agency for Marine-Earth Science and Technology (JAMSTEC)
    • MN-2, MN-3
    • Independently operated by PFN on tenanted premises

Middleware

PFN’s computer clusters use Kubernetes, an OSS for managing containerised applications, as core technology. This, in combination with PFN’s own schedulers and front-end tools, provides a computing platform for efficient machine learning and deep learning research.