Overview

Computing infrastructure powering problem-solving capabilities of deep learning

Preferred Networks (PFN)’s core technologies, especially deep learning, require enormous computing power. To perform a vast number of computations efficiently, we currently operate our own computer clusters (more commonly known as supercomputers). Our computer clusters are named MN followed by a series number: MN-1, MN-2 and MN-3.

Infrastructure

MN-3

Operating since May 2020, MN-3 is PFN’s third-generation computer cluster that uses MN-Core, a highly efficient custom processor co-developed by PFN and Kobe University specifically for use in deep learning. PFN is currently working to increase MN-3's computational speed for practical deep learning workloads. MN-3 topped the Green500 list of the world's most energy-efficient supercomputers three times in June 2020, June 2021 and November 2021.

Systems used for Green500 measurement and their respective performance

November 2021 June 2021 November 2020 June 2020
Nodes 32 40
MN-Core processors 128 160
CPU (Intel Xeon) cores 1,536 1,920
Peak performance (theoretical, under given standard) 3.390 PFlops 3.138 PFlops 3.92 PFlops
HPL benchmark 2.181 PFlops 1.822 PFlops 1.653 PFlops 1.621 PFlops
Energy efficiency 39.38 GFlops/W 29.70 GFlops/W 26.04 GFlops/W 21.11 GFlops/W
Green500 ranking #1 #1 #2 #1

Green500 certificate for November 2021

PFN plans to expand MN-Core-powered computer clusters in multiple phases. The first of these, MN-3a, was completed with the following configuration in May 2020.

MN-3a is made up of 1.5 “zones,” each of which consists of 32 compute nodes (MN-Core Servers) tightly coupled by two MN-Core DirectConnect Switches.

Configuration of the MN-3a cluster:

  • 48 compute nodes (MN-Core Servers)
  • Network between MN-Core servers
    • MN-Core DirectConnect (interconnect developed specifically for MN-Core processors)
    • 5 100GbE Ethernet

Configuration of each MN-3a node:

MN-Core Server

MN-Core MN-Core Board x 4
CPU Intel Xeon 8260M two-way (48 physical cores)
Memory 384GB DDR4
Storage Class Memory 3TB Intel Optane DC Persistent Memory
Network MN-Core DirectConnect (112Gbps) x 2
Mellanox ConnectX-6 (100GbE) x 2
On board (10GbE) x 2

MN-2

MN-2 is the first GPU cluster built and managed solely by PFN, Operating since July 2019.

Specifications of the MN-2 cluster:

  • 128 GPU servers (compute nodes)
  • 32 CPU servers (compute nodes)
  • 24 storage servers
  • 18 100GBE Ethernet Switches

Specifications of each compute node on MN-2:

GPU server

GPU NVIDIA V100 SXM x 8
CPU Intel Xeon 6254 two-way (36 physical cores)
Memory 384GB DDR4
Network Mellanox ConnectX-4 (100GbE) x 4
On board (10GbE) x 2

CPU server

CPU Intel Xeon 6254 two-way (36 physical cores)
Memory 384GB DDR4
Network Mellanox ConnectX-4(100GbE) x 2
On board(10GbE) x 2

MN-1, MN-1b

MN-1 is a GPU computer cluster that NTT Communications operates exclusively for PFN. The MN-1 cluster has two generations: MN-1 (operating since September 2017) and MN-1b (operated between July 2018 and July 2021).

The configuration of each MN-1 cluster is as follows:

  • MN-1
  • 128 GPU servers (NVIDIA P100 x 8, FDR 56Gbps InfiniBand × 2)
  • MN-1b
  • 64 GPU servers (NVIDIA V100 x 8, EDR 100Gbps InfiniBand × 2)

MN-1 milestones:

MN-1b milestones:

Middleware

PFN’s computer clusters use Kubernetes, an OSS for managing containerised applications, as core technology. This, in combination with PFN’s own schedulers and front-end tools, provides a computing platform for efficient machine learning and deep learning research.