Documentation

Environment

1, EC2 and the S3 bucket are located in the same region

2, make sure fio is installed

// Redhat/CentOS/etc
$ sudo dnf install -y fio

// Ubuntu/Debian/etc
$ sudo apt install -y fio

Note:

CPU Usage is measured in percentage of cores, not overall CPU usage.

For example, on c6id.8xlarge, there are 32 vCPUs, so 1600% CPU usage corresponds to 16 cores fully utilized.

And 1600% CPU usage on c6id.8xlarge also corresponds to 1600%/32 = 50% of total CPU capacity.

Create a 100GB test file

fio --name=create_100gb_file \
--filename=/mnt/fuse/100gb \
--ioengine=libaio \
--direct=1 \
--group_reporting \
--fallocate=none \
--create_on_open=1 \
--end_fsync=1 \
--size=100000M \
--rw=write \
--bs=10M \
--numjobs=1

Load the test file into local cache

$ lsblk

$ sudo mkfs.xfs /dev/nvme1n1

$ sudo mkdir /mnt/fuse /data

$ sudo chmod 0777 /mnt/fuse /data -R

$ sudo mount /dev/nvme1n1 /data

$ sudo mapfs add vol_benchmark aws <AWSAccessKey> <AWSSecretKey> <S3-BucketName> <Region> cache_dir=/data

$ sudo mapfs mount vol_benchmark /mnt/fuse

$ mapfs load /mnt/fuse/100gb

fio commands

fio: Read Performance with cache

fio --name=read_benchmark \

--ioengine=libaio \

--direct=1 \

--group_reporting \

--runtime=60 \

--size=100000M \

--readonly \

--filename=/mnt/fuse/100gb \

--rw=<read|randread> \

--bs=[4k|1M] \

--numjobs=[1|32|128]

Note:

mapfs uses DirectIO mode in the whole IO process, and does not use the OS Page Cache.

Therefore, after fio testing, no need to drop the Page Cache.

Note:

After each fio test, delete fio generated test file(s) from the Cloud Storage:

$ rm -f /mnt/fuse/write_benchmark.*

EC2 c6id Seriess

With 1 NVME SSD instance storage.

Instance Type	Baseline IOPS	Peak IOPS	Baseline Throughput (MB/s)	Peak Throughput (MB/s)	Baseline Bandwidth (Mbps)	Peak Bandwidth (Mbps)
c6id.large	3600	40000	81.25	1250	650	10000
c6id.xlarge	6000	40000	156.25	1250	1250	10000
c6id.2xlarge	12000	40000	312.5	1250	2500	10000
c6id.4xlarge	20000	40000	625	1250	5000	10000
c6id.8xlarge	40000	40000	1250	1250	10000	10000

Note:

c6id.large/xlarge/2xlarge/4xlarge: EBS and Network have a baseline performance, and burstable upper limit.

c6id Series: benchmark

c6id.large

2 CPU, 4GB RAM

On-Demand Linux pricing: 0.1155 USD per Hour

Network: Up to 12.5Gbps

NVME SSD: Instance Store (data is lost after EC2 Stop)

1 x 118 GiB NVMe SSD

Throughput (block size: 1MB)

result/numjobs	Sequential			Random
result/numjobs	1	32	128	1	32	128
Throughput	152 MiB/s	4839 MiB/s	6440 MiB/s	152 MiB/s	152 MiB/s	152 MiB/s
CPU (cores)	3.36%	75.22%	162.88%	3%	4%	4%
RSS	381 MB	399 MB	401 MB	219 MB	227 MB	228 MB
%MEM	10%	10.45%	10.51%	6%	6%	6%

IOPS (block size: 4KB)

result/numjobs	Sequential			Random
result/numjobs	1	32	128	1	32	128
IOPS	39k	182k	132k	8.2k	34.1k	34.2k
CPU (Cores)	31.68%	93.89%	80%	27.2%	97.96%	101.48
RSS	388 MB	409 MB	394 MB	207 MB	208 MB	208 MB
%MEM	10.16%	10.72%	10.32%	5.43%	5.46%	5.46%

Network Read

Bandwidth	70.8 MiB/s
CPU(cores)	57.48%
RSS	395 MB
%MEM	10.36%

Network Write

result/numjobs	block size: 1MB		block size: 4MB
result/numjobs	1	32	1	32
Throughput	791 MiB/s	831 MiB/s	986 MiB/s	850 MiB/s
CPU (cores)	152.61%	140%	161.14%	145.84%
RSS	648 MB	659 MB	715 MB	681 MB
%MEM	16.98	17.27%	18.73%	17.85%

c6id.xlarge

4 vCPU, 8GB RAM
On-Demand Linux pricing: 0.231 USD per Hour
Network: baseline 1.25Gbps (156MB/s), up to 12.5Gbps
NVME SSD: 1 x 237 GiB Instance Store

Throughput (block size: 1MB)

reslt/numjobs	Sequential			Random
reslt/numjobs	1	32	128	1	32	128
Throughput	305 MiB/s	9754 MiB/s	11.3 GiB/s	309 MiB/s	305 MiB/s	305 MiB/s
CPU (cores)	6.62%	177%	312%	8.64%	9.72%	8.90%
RSS	701 MB	701 MB	725 MB	309 MB	309 MB	309 MB
%MEM	9.01%	9.02%	9.32%	3.97%	3.97%	3.98%

IOPS (block size: 4KB)

result/numjobs	Sequential			Random
result/numjobs	1	32	128	1	32	128
IOPS	53.1k	319k	198k	8.2k	68.2k	68.3k
CPU (Cores)	41%	199%	172%	26%	223%	236%
RSS	702 MB	682 MB	463 MB	300 MB	301 MB	303 MB
%MEM	9.03%	8.76%	5.95%	3.86%	3.87%	3.89%

Network Read

Speed	141.8 MiB/s
CPU(cores)	85%
RSS	616 MB
%MEM	7.92%

Network Write

result/numjobs	block size: 1MB		block size: 4MB
result/numjobs	1	32	1	32
Throughput	917 MiB/s	1200 MiB/s	1144 MiB/s	879 MiB/s
CPU (cores)	163%	180%	191%	165%
RSS	1094 MB	1048 MB	1036 MB	1041 MB
%MEM	14.06%	13.47%	13.32%	13.38%

c6id.2xlarge

8 vCPU, 16GB RAM
On-Demand Linux pricing: 0.4620 USD per Hour
Network: Up to 12.5Gbps
NVME SSD: 1 x 474 GiB Instance Store

Throughput (block size: 1MB)

result/numjobs	Sequential			Random
result/numjobs	1	32	128	1	32	128
Throughput	615 MiB/s	19.0 GiB/s	18.8 GiB/s	612 MiB/s	610 MiB/s	610 MiB/s
CPU (cores)	17%	427%	589%	18%	20%	22%
RSS	705 MB	760 MB	795 MB	253 MB	252 MB	254 MB

IOPS (block size: 4KB)

result/numjobs	Sequential			Random
result/numjobs	1	32	128	1	32	128
IOPS	51.2k	472k	282k	7.7k	116k	103k
CPU (Cores)	43%	384%	347%	30%	385%	391%
RSS	770 MB	758 MB	502 MB	237 MB	239 MB	239 MB

Network Read

Bandwidth	283.3 MiB/s
CPU(cores)	112%
RSS	736 MB
%MEM	4.68%

Network Write

result/numjobs	block size: 1MB		block size: 4MB
result/numjobs	1	32	1	32
Throughput	1319 MiB/s	1254 MiB/s	1255 MiB/s	1250 MiB/s
CPU (cores)	186%	183%	193%	182%
RSS	1952 MB	1985 MB	2028 MB	2004 MB
%MEM	12.43%	12.64%	12.91%	12.76%

c6id.4xlarge

16 vCPU, 32GB RAM
On-Demand Linux pricing: 0.9240 USD per Hour
Network: Up to 12.5Gbps
NVME SSD: 1 x 950 GiB Instance Store

Throughput (block size: 1MB)

result/numjobs	Sequential			Random
result/numjobs	1	32	128	1	32	128
Throughput	1202 MiB/s	38.0 GiB/s	45.8 GiB/s	1222 MiB/s	1221 MiB/s	1221 MiB/s
CPU (cores)	29%	645%	1236%	26%	31%	32%
RSS	742 MB	825 MB	910 MB	345 MB	344 MB	343 MB

IOPS (block size: 4KB), mounted with iouring

$ sudo mapfs umount vol_benchmark

$ sudo mapfs mount vol_benchmark /mnt/fuse iouring

result/numjobs	Sequential			Random
result/numjobs	1	32	128	1	32	128
IOPS	147k	1010k	1124k	9.5k	112k	124k
CPU(cores)	41%	591%	705%	17%	250%	285%
RSS	703 MB	726 MB	734 MB	203 MB	205 MB	205 MB

Network Read

Speed	568.2 MiB/s
CPU(cores)	125%
RSS	778 MB
%MEM	2.47%

Network Write

result/numjobs	block size: 1MB		block size: 4MB		block size: 8MB
result/numjobs	1	32	1	32	1	32
Throughput	1429 MiB/s	1370 MiB/s	1375 MiB/s	1380 MiB/s	1387 MiB/s	1051 MiB/s
CPU (cores)	169%	174%	190%	180%	206%	162%
RSS	2390 MB	2473 MB	2556 MB	2516 MB	2556 MB	2552 MB
%MEM	7.57%	7.84%	8.10%	7.97%	8.10%	8.09%

c6id.8xlarge

32 vCPU, 64GB RAM
On-Demand Linux pricing: 1.8480 USD per Hour
Network: 12.5Gbps
NVME SSD: 1 x 1900 GiB Instance Store

Throughput (block size: 1MB)

result/numjobs	Sequential			Random
result/numjobs	1	32	128	1	32	128
Throughput	1671 MiB/s	51.8 GiB/s	80.4 GIB/s	1471 MiB/s	2487 MiB/s	2482 MiB/s
CPU (cores)	35%	820%	2666%	36%	68%	71%
RSS	885 MB	904 MB	868 MB	432 MB	435 MB	435 MB
%MEM	1.40%	1.43%	1.37%	0.68%	0.69%	0.69%

IOPS (block size: 4KB), mounted with iouring

$ sudo mapfs umount vol_benchmark

$ sudo mapfs mount vol_benchmark /mnt/fuse iouring

result/numjobs	Sequential			Random
result/numjobs	1	32	128	1	32	128
IOPS	147k	829k	818k	9.6k	165k	201k
CPU(cores)	40%	1893%	1889%	17%	345%	432%
RSS	704 MB	734 MB	714 MB	256 MB	256 MB	256 MB
%MEM	1.11%	1.16%	1.13%	0.41%	0.41%	0.41%

Network Read

Bandwidth	1124 MiB/s
CPU(cores)	150%
RSS	1166 MB
%MEM	1.84%

Network Write

result/numjobs	block size: 1MB		block size: 4MB		block size: 8MB
result/numjobs	1	32	1	32	1	32
Throughput	1460 MiB/s	1457 MiB/s	1453 MiB/s	1419 MiB/s	1399 MiB/s	1434 MiB/s
CPU (cores)	166%	167%	192%	185%	197%	186%
RSS	2386 MB	2524 MB	2658 MB	2611 MB	2663 MB	2668 MB
%MEM	3.77%	3.99%	4.20%	4.13%	4.21%	4.22%

Benchmark Overview

mapfs only occupies a small amount of system memory, so doesn't illustrate in the charts.

Throughput

Sequential Read (bs=1MB)

Random Read (bs=1MB)

IOPS

Sequential Read (bs=4KB)

Random Read (bs=4KB)

Network Read

Network Write

Sequential Write (bs=1MB)

fuse over iouring

kernel support

"mapfs mount <VolumeName> <MountPoint>" mounts in traditional mode by default;

To enable "fuse over iouring" mode, specify "iouring" during mount:

$ sudo mapfs mount <VolumeName> <MountPoint> iouring

Note:

Specifying iouring does not guarantee "fuse over iouring" will be enabled. It also requires Linux kernel version >= 6.18, typically from these distributions:

Amazon Linux 2023, kernel 6.18
Ubuntu 26.04

Advantages and limitations of iouring mode

Traditional mount:
When CPU number exceeds 16 and IOPS exceeds 300K, a single lock contention point can cause CPU usage to spike while I/O performance may not improve or can even degrade greatly.
The more CPUs added, the worse the performance degradation becomes, due to scheduling pressure, frequent L3 cache invalidations, and the single kernel spinlock CPU usage.

iouring mode mount:
breaks single lock contention and scale up better when CPU number exceeds 16 and IOPS exceed 300K.
However, its limitation is that when IOPS exceed 1M and CPU number reaches 32 or more, fuse uring threads may degrade into polling mode, causing high CPU usage without a corresponding I/O performance increase.

Common benchmark bottlenecks

SSD Throughput and IOPS limitations

For example,

If you are testing the performance with a GP3 volume with default GP3 settings, which is 125 MiB/s throughput and 3000 IOPS,

you will see the Random Read performance with 1MB block size is limited to 125 MiB/s, and the Random Read performance with 4KB block size is limited to 3000 IOPS.

EC2 EBS Throughput and IOPS limitations

Just imagine you are testing the performance with a c6i.large EC2 and one GP3 SSD.

The baseline EBS performance of c6i.large is 3600 IOPS and 81.25 MiB/s throughput, and the GP3 SSD performance is 3000 IOPS and 125 MiB/s throughput.

Even if you raise the GP3 performance to 5000 IOPS and 500 MiB/s throughput, you will still be limited by the EBS performance after burst credits are exhausted (generally after 10 - 30 minutes), which is 3600 IOPS and 81.25 MiB/s throughput.

Network bandwidth

Even if you are testing the performance with a single fio job (numjobs=1), when the network bandwidth is saturated, you will see the Network Read/Write performance will not increase with more fio jobs added.

CPU Burstable, EBS Burstable, Network Burstable, NVME SSD Burstable, VPC Credits, etc.

For example,

if you use burstable EC2 instances (such as t3.small), CPU performance may fluctuate between baseline and burst limits.

Environment

Create a 100GB test file

Load the test file into local cache

fio commands

fio: Read Performance with cache

fio: Network Write Performance

EC2 c6id Seriess

c6id Series: benchmark

c6id.large

c6id.xlarge

c6id.2xlarge

c6id.4xlarge

c6id.8xlarge

Benchmark Overview

Throughput

Sequential Read (bs=1MB)

Random Read (bs=1MB)

IOPS

Sequential Read (bs=4KB)

Random Read (bs=4KB)

Network Read

Network Write

Sequential Write (bs=1MB)

fuse over iouring

kernel support

Advantages and limitations of iouring mode

Common benchmark bottlenecks

SSD Throughput and IOPS limitations

EC2 EBS Throughput and IOPS limitations

Network bandwidth

CPU Burstable, EBS Burstable, Network Burstable, NVME SSD Burstable, VPC Credits, etc.