This article brought to you by LWN subscribers
Subscribers to LWN.net made this article — and everything that |
Darrick Wong has been doing work on XFS online
repair for a number of years
and things are getting to the point where most of the filesystem-internal work
has been completed and is under review. The work remaining mostly concerns
the user-space side
to set up a periodic scan and repair cycle, so he wanted to discuss what
user space needs from this kind of feature in a filesystem session at the
2023 Linux Storage, Filesystem,
Memory-Management and BPF Summit that he led remotely. The session may
not have gone quite as he hoped, as it got somewhat derailed by topics that
spilled over from the earlier session on
unprivileged image mounts.
His current patch set for XFS online repair is “out for review on Dave
Chinner’s laptop right now”, so it is time to start talking about the
missing pieces. That means that he will be talking more about user space
than he would normally; there is a user-space driver program that controls
how often the online fsck mechanism runs. There is nothing yet
for notifying user space of problems that were found by an online fsck
pass, nor is there a daemon monitoring for notifications to do anything
about them, such as to issue repair requests. There is no good
infrastructure in the kernel for handling and dispatching such things, he
said.
He said that the earlier discussion in the unprivileged-mounts session
on using fsck to decide that an image was sound enough to mount
made him think that it was a good time to discuss these kinds of issues.
As he noted, there is a command-line program, xfs_scrub,
which opens the block device and root directory, then starts issuing the right
ioctl() commands, but the real use case is not for running a tool
in that fashion. Instead, the idea is that it would do a background check
and repair periodically from
a systemd service; he is struggling a bit with setting that up, but has
something working. It is not, however, much different from the age-old
periodic cron job that reports its results to the system log and hopes an
administrator is paying attention.
He would like to create a notification
system that would allow the system to respond dynamically to the events
that get reported by the periodic scrubbing. He would also like there to
be a way for programs to initiate scrubbing for various reasons, such as a
container manager that notices relatively low activity so it kicks off
scrubbing on the mounted filesystems. Maybe that could mesh with the
unprivileged-mounting use case in some fashion as well, Wong said.
So he wondered if any user-space developers had thoughts on how they might
want to use this facility. He could continue developing “with my
kernel colored glasses on”, but he fears that may not produce the best
results. There was an effort made to scare up Lennart Poettering, who might
have some thoughts on the matter, but who had not made it back to the
filesystem room after the coffee break.
Josef Bacik said that he generally relied on people from Fedora and other
distributions to give him feedback on features of this sort. The
distribution developers often have different ideas on how these things will
be used. So, for thoughts on policies that might be applied to the online
scrubber, he
recommended seeking out people from Linux distributions.
Ted Ts’o replayed some of the earlier discussion around using (offline)
fsck
to check image files before mounting them.
In order to be sure that image files are not modified by user
space while the fsck was being done, Ts’o had said that they
would need to be copied somewhere inaccessible to user space beforehand.
One difference
with the in-kernel fsck equivalent that XFS is planning to add
might be that the copy/snapshot step would be
unnecessary. A
kernel-level fsck might not have that requirement, he suggested,
but that does not really change whether using fsck in that manner
is sufficient.
By that point, Poettering had returned so Wong repeated some of what he had
said earlier. He said that the work on the online scrubber had quite
recently become
more urgent because “a lot more distros than the zero I thought there were
will actually let you mount XFS filesystems without privilege”. There have
also been recent efforts in XFS to flag strange problems (“weird-looking
metadata or outright bad metadata”) that it sees, but that is not connected
with fsnotify events (as ext4 is) to notify user space of these
kinds of corruption. XFS generally knows exactly what the problem was,
which could be encoded in the notification somehow in the hopes that
someone is listening and can take appropriate action. For some filesystems
that might be to unmount and fsck the filesystem, while XFS could
use the online repair facility.
Poettering said that the current practice of having desktops mount
removable media automatically is
“stupid”; the approach that Chrome OS takes with only mounting certain specific
filesystem types (e.g. VFAT), and only through a user-space driver, is much
better and one that other desktops should adopt. The desktop use case is
generally for USB sticks, and people do not normally put XFS on that kind
of media, he said, so those should not be automatically mounted at all.
For mounting filesystem images in containers, though, he thinks
trust should come from dm-verity as described in his earlier talk. Ts’o had
said that fsck might be sufficient for establishing that an ext4
image would not compromise the kernel, so Poettering wondered if Wong would
say the same for XFS. There is a difficult answer to that, Wong said; “as
soon as I say ‘yes’, everybody in the world will watch their fuzzer rigs in
order to try to find all of the things that fsck doesn’t catch”.
That said, he generally agrees with Ts’o that fsck, either online
or offline, should be
robust enough to catch any bad filesystems, but it is not an absolute
guarantee since bugs happen.
Poettering noted that the online checking for XFS was not usable for
establishing trust, since the filesystem would need to be mounted first.
Wong agreed, but wondered about images that had been signed by the
distributor. Poettering and Christian Brauner said that signed images are
fully trustable or, at least, that it is a user-space problem if they are not.
Kent Overstreet said that fsck could not be used to establish
trust in any case because a malicious device could change the data out from
underneath the check. While that is true for, say, USB devices, the
snapshot/copy requirement for a local image file that is getting mounted in
a container removes that possibility, Ts’o said.
Overstreet argued that
requiring the copy was onerous and unenforceable for users.
Instead, he thinks “the responsible thing for us to be doing as
filesystem implementers is to start taking it a little bit more seriously to
just hardening our code at run time”. He said that XFS does
a lot of read- and write-time verification of metadata along with fuzzing,
as does
Overstreet’s bcachefs, so
“we might not be in as bad a shape as we assume”.
Brauner wanted to clarify that the copy and fsck being discussed
was not something that is under the user’s control, but would be handled
by a mount daemon. Overstreet was adamant that it would still be
unacceptable to do the copy and “people are going to want to be able to
mount images in the
cloud untrusted very soon”.
Bacik said that the session was “getting off the rails” at that point.
He said that Wong is interested in what kinds of notifications would be of
interest to
user space and how to handle the policy questions around those; Wong agreed
with that. Poettering
said that he is “not a storage guy” so he does not know what kinds of
policies they might want, but he thinks that simply shutting down the affected
services when a filesystem it relies on has errors is the safest approach.
If systemd were to get a notification of that sort, it could easily be set
up to shut down affected services.
Ts’o said that those who are running these kinds of services should be
consulted about how to handle the events. For example: what do the
Kubernetes people actually want? They may want to shut down affected
services, but give the services a short time frame to send a “goodbye cruel
world”
message or similar. The ext4 notifications that Wong mentioned were
specifically added for the
internal Google Kubernetes-like container manager Borg; the people
maintaining those systems wanted to be able to shut down services in the
face of filesystem corruption.
Wong said things are a little different working for a large database vendor
(Oracle); most of the use of XFS, beyond root filesystems, is for “really
large data partitions where we would like to be able to perform at least
simple
repairs on the 100TB data partition to try to keep the VM running”. At any
given time, the workload running in the VM or container is probably not
accessing the whole 100TB, so there is an opportunity to fix things before
the application even notices. “We would at least like to try to grow new
engines on the plane while it’s flying in order to avoid having to do an
emergency landing.” Restoring 100TB (or even more) can take a long time,
which is best avoided.
Poettering wondered if a mount option that simply instructed XFS to run its
online scrubber whenever it detected an anomaly might be a reasonable
approach. “Why involve user space to trigger the online filesystem check?”
User space is better for performing actions on other parts of the system,
such as shutting down relevant services, so it does not really make sense
for XFS to notify of a problem and have user space say “go fix yourself”.
Wong said that he was willing to write an XFS daemon that would receive
notifications and schedule scrubbing if need be.
He wrapped up by describing some of the fuzzing that is done for XFS, which
uses the XFS
debugger to “walk every single field of every metadata object in the entire
filesystem and fuzz them”. That is part of why the XFS QA test suite takes
almost a week to run; it spends a lot of time fuzzing and checking to see
that the repair tool notices the problems and can fix them, both online
and offline. He thinks he added some fuzzing of ext4 metadata blocks to
fstests along the way, but not to the same level of precision of the XFS fuzz
testing.
(Log in to post comments)