Compute node disk expansion resulted in mount.ocfs2: Bad magic number in superblock while opening device /dev/sda3
Before Christmas 2017 I had the honour to expand the compute node storage for a virtualised exadata. Better said, OVM on exadata. This can be done for several reasons. The most logic one, also the reason why this customer bought the disk expansion, is that you want to create more virtual machines and you have ran out of space on /EXAVMIMAGES.
The preparation is actually pretty simple. It is documented completely in the Exadata Database Machine Maintenance Guide. And look, point 2.9.11 Expanding /EXAVMIMAGES on Management Domain after Database Server Disk Expansion looks like what we need. In the second point it states
Add the disk expansion kit to the database server.
The kit consists of 4 additional hard drives to be installed in the 4 available slots. Remove the filler panels and install the drives. The drives may be installed in any order.
Even that is documented in point 2.4 Adding Disk expansion kit to Database Servers. So this is an easy task. Right? Yes! but … should I blog this if there isn’t an “oops” in it? So there are 2 very little “Gotcha’s”.
I highlight how I did it. Also Oracle Support verified this should be ok, so here we go.
Gotcha 1
Preparation
You need to set aside important information on how the system looks like.
First of all you need to ensure that reclaimdisks.sh has been run correctly after installation. As I did this installation myself, I can confirm this has been done correctly, so this one can be skipped.
Then we go the next step. Adding the disks physically in the servers. This really not so difficult, but need to be done with care and according the safety measures Oracle tells you, also watch out for ESD. They are FRU anyway, but in this case you know when the engineer should do it.
When the disks are put in the server, the raid starts to rebuild automatically. In our case it took around 14hours to finish. Dbmcli puts it in alerthistory when the rebuild is ready.
1 2 3 4 |
[root@demoexa01db01 ~]# dbmcli -e list alerthistory 23_1 2017-12-18T13:54:04+00:00 warning "A disk expansion kit was installed. The additional physical drives were automatically added to the existing RAID5 configuration, and reconstruction of the corresponding virtual drive was automatically started." 23_2 2017-12-19T04:14:05+00:00 clear "Virtual drive reconstruction due to disk expansion was completed." [root@demoexa01db01 ~]# |
It’s also good to know how the partitions look like:
1 2 3 4 5 6 |
[root@demoexa01db01 ~]# cat /proc/partitions |grep sda 8 0 4094720000 sda 8 1 524288 sda1 8 2 119541760 sda2 8 3 1634813903 sda3 [root@demoexa01db01 ~]# |
While preparing the steps, I saw that we had to recreate the partition, but not on the sector but the specified size. Personally I do not like this. So it’s good to gather the information about start and end sectors as well using parted.
1 2 3 4 5 6 7 8 9 10 11 12 |
[root@demoexa01db01 ~]# parted /dev/sda 'unit s print' Model: LSI MR9361-8i (scsi) Disk /dev/sda: 8189440000s Sector size (logical/physical): 512B/512B Partition Table: gpt Number Start End Size File system Name Flags 1 64s 1048639s 1048576s ext3 primary boot 2 1048640s 240132159s 239083520s primary lvm 3 240132160s 3509759965s 3269627806s primary [root@demoexa01db01 ~]# |
Check how parted sees how the disk looks like
1 2 3 4 5 6 7 8 9 10 11 12 |
[root@demoexadb01 ~]# parted -s /dev/sda print Model: LSI MR9361-8i (scsi) Disk /dev/sda: 4193GB Sector size (logical/physical): 512B/512B Partition Table: gpt Number Start End Size File system Name Flags 1 32.8kB 537MB 537MB ext3 primary boot 2 537MB 123GB 122GB primary lvm 3 123GB 1797GB 1674GB primary [root@demoexadb01 ~]# |
Also copy aside the information from df -h and the list of virtual machines using xm list.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
[root@demoexa01db01 ~]# df -h /EXAVMIMAGES Filesystem Size Used Avail Use% Mounted on /dev/sda3 1.6T 1.5T 117G 93% /EXAVMIMAGES [root@demoexa01db01 ~]# xm list Name ID Mem VCPUs State Time(s) Domain-0 0 7309 4 r----- 11323224.3 demoexaadm01db01vm01.mydomain.demo 32 16387 4 -b---- 141175.1 demoexaadm01db01vm02.mydomain.demo 33 77827 12 r----- 2422336.9 demoexaadm01db01vm03.mydomain.demo 34 122883 20 -b---- 4851298.7 demoexaadm01db01vm04.mydomain.demo 35 61443 8 r----- 1203861.5 demoexaadm01db01vm05.mydomain.demo 36 98307 8 r----- 1998874.0 demoexaadm01db01vm06.mydomain.demo 37 71683 10 r----- 2530870.0 demoexaadm01db01vm07.mydomain.demo 38 61443 4 r----- 1316172.4 demoexaadm01db01vm08.mydomain.demo 39 122883 8 r----- 2759130.6 demoexaadm01db01vm09.mydomain.demo 40 65639 4 r----- 1240107.5 [root@demoexa01db01 ~]# |
Expansion
Partition enlargement
Then all domains can be shutdown. So do this only on one node at the time so that your database doesn’t go down
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
[root@demoexadb01 ~]# xm shutdown -a -w Domain demoexaadm01vm01.mydomain.demo terminated Domain demoexaadm01vm04.mydomain.demo terminated Domain demoexaadm01vm07.mydomain.demo terminated Domain demoexaadm01vm05.mydomain.demo terminated Domain demoexaadm01vm09.mydomain.demo terminated Domain demoexaadm01vm06.mydomain.demo terminated Domain demoexaadm01vm03.mydomain.demo terminated Domain demoexaadm01vm08.mydomain.demo terminated Domain demoexaadm01vm02.mydomain.demo terminated All domains terminated [root@demoexadb01 ~]# [root@demoexadb01 ~]# xm list Name ID Mem VCPUs State Time(s) Domain-0 0 7309 4 r----- 11325897.9 [root@demoexadb01 ~]# |
Make sure now to unmount the /EXAVMIMAGES filesystem.
1 2 |
[root@demoexadb01 ~]# umount /EXAVMIMAGES/ [root@demoexadb01 ~]# |
It MIGHT (I had one node who needed it, another one who didn’t need it) be necessary to stop the xen deamon and the ocfs service. You can do this by running:
1 2 3 |
service xend stop service xendomains stop service ocfs2 stop |
Then the filesystem cleanly unmounted. This is necessary because we will remove the partition in the next step.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
[root@demoexadb01 ~]# parted /dev/sda GNU Parted 2.1 Using /dev/sda Welcome to GNU Parted! Type 'help' to view a list of commands. (parted) print Model: LSI MR9361-8i (scsi) Disk /dev/sda: 4193GB Sector size (logical/physical): 512B/512B Partition Table: gpt Number Start End Size File system Name Flags 1 32.8kB 537MB 537MB ext3 primary boot 2 537MB 123GB 122GB primary lvm 3 123GB 1797GB 1674GB primary (parted) rm 3 Warning: WARNING: the kernel failed to re-read the partition table on /dev/sda (Device or resource busy). As a result, it may not reflect all of your changes until after reboot. (parted) quit Information: You may need to update /etc/fstab. [root@demoexadb01 ~]# |
The warning, we’re aware off. That’s because we are interfering with the disk which is actually in use and we cannot unmount everything on it, but the change has been done, we only cannot see it (yet). So then we need to create the new partition. In this case we follow the Oracle documentation:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
[root@demoexadb01 ~]# parted /dev/sda GNU Parted 2.1 Using /dev/sda Welcome to GNU Parted! Type 'help' to view a list of commands. (parted) mkpart primary 123gb 4193gb Warning: WARNING: the kernel failed to re-read the partition table on /dev/sda (Device or resource busy). As a result, it may not reflect all of your changes until after reboot. (parted) print Model: LSI MR9361-8i (scsi) Disk /dev/sda: 4193GB Sector size (logical/physical): 512B/512B Partition Table: gpt Number Start End Size File system Name Flags 1 32.8kB 537MB 537MB ext3 primary boot 2 537MB 123GB 122GB primary lvm 3 123GB 4193GB 4070GB primary (parted) quit [root@demoexadb01 ~]# |
So far so good, so we should be able to mount the partition without errors, but it won’t be bigger yet. So let’s do that.
1 2 3 |
[root@demoexadb01 ~]# mount /EXAVMIMAGES mount.ocfs2: Bad magic number in superblock while opening device /dev/sda3 [root@demoexadb01 ~]# |
And here is where the journey begins. Debug is not so difficult. For your interest, I had a service request already open and the engineer responded to just drop everything and restore the partition and try to recovery using -r. I’m not a fan of this, because it’s a risky operation and it’s always better to know why it happened so join me on the reasoning to find the solution, which makes me feel more comfortable than just “restoring things”:
First confirm the parted output. Remember this?
1 2 3 4 5 6 7 8 9 10 11 12 |
(parted) print Model: LSI MR9361-8i (scsi) Disk /dev/sda: 4193GB Sector size (logical/physical): 512B/512B Partition Table: gpt Number Start End Size File system Name Flags 1 32.8kB 537MB 537MB ext3 primary boot 2 537MB 123GB 122GB primary lvm 3 123GB 4193GB 4070GB primary (parted) |
it matched exactly with the oracle documentation. So let’s see a bit deeper and check the /proc/partitions.
1 2 3 4 5 6 |
[root@ demoexadb01 ~]# cat /proc/partitions |grep sda 8 0 4094720000 sda 8 1 524288 sda1 8 2 119541760 sda2 8 3 3974651904 sda3 <<<<--------- [root@ demoexadb01 ~]# |
that doesn’t match with the documentation. Last number should be 3! So the beginning of the partition is not on the spot we expect it and let just that part contain very interesting information.
First step is to get rid of the wrongly created partition
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
[root@demoexadb01 ~]# parted /dev/sda GNU Parted 2.1 Using /dev/sda Welcome to GNU Parted! Type 'help' to view a list of commands. (parted) mkpart primary 123gb 4193gb Warning: WARNING: the kernel failed to re-read the partition table on /dev/sda (Device or resource busy). As a result, it may not reflect all of your changes until after reboot. (parted) print Model: LSI MR9361-8i (scsi) Disk /dev/sda: 4193GB Sector size (logical/physical): 512B/512B Partition Table: gpt Number Start End Size File system Name Flags 1 32.8kB 537MB 537MB ext3 primary boot 2 537MB 123GB 122GB primary lvm 3 123GB 4193GB 4070GB primary (parted) rm 3 Warning: WARNING: the kernel failed to re-read the partition table on /dev/sda (Device or resource busy). As a result, it may not reflect all of your changes until after reboot. (parted) quit [root@demoexadb01 ~]# |
When we are in a situation where the partition is gone, we can recreate it (the way we want), but starting from the sector we retrieved first. That way we know for sure that it starts at the same spot and it should turn out fine:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
[root@demoexadb01 ~]# parted /dev/sda GNU Parted 2.1 Using /dev/sda Welcome to GNU Parted! Type 'help' to view a list of commands. (parted) print Model: LSI MR9361-8i (scsi) Disk /dev/sda: 4193GB Sector size (logical/physical): 512B/512B Partition Table: gpt Number Start End Size File system Name Flags 1 32.8kB 537MB 537MB ext3 primary boot 2 537MB 123GB 122GB primary lvm (parted) mkpart primary 240132160s 4193gb Warning: The resulting partition is not properly aligned for best performance. Ignore/Cancel? Ignore Warning: WARNING: the kernel failed to re-read the partition table on /dev/sda (Device or resource busy). As a result, it may not reflect all of your changes until after reboot. (parted) print Model: LSI MR9361-8i (scsi) Disk /dev/sda: 4193GB Sector size (logical/physical): 512B/512B Partition Table: gpt Number Start End Size File system Name Flags 1 32.8kB 537MB 537MB ext3 primary boot 2 537MB 123GB 122GB primary lvm 3 123GB 4193GB 4070GB primary (parted) quit [root@demoexadb01 ~]# |
This way the partition looks EXACTLY the same as the output in the documentation, but … it did before as well.
1 2 |
[root@ demoexadb01 ~]# mount /EXAVMIMAGES/ [root@ demoexadb01 ~]# |
That’s good! Remember that we still have to reboot the server to make the system reread the partition table, so the reboot must be done now.
When the system is back online, first verify if all went well:
1 2 3 4 5 6 7 8 9 |
[root@demoexadb01 ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VGExaDb-LVDbSys3 30G 17G 12G 58% / tmpfs 7.8G 0 7.8G 0% /dev/shm /dev/sda1 480M 59M 396M 13% /boot /dev/sda3 1.6T 1.5T 117G 93% /EXAVMIMAGES none 3.9G 40K 3.9G 1% /var/lib/xenstored [root@demoexadb01 ~]# |
So far so good. The filesystem mounts, but it’s still not expanded. The partition should be bigger now, lets check:
1 2 3 4 5 6 |
[root@demoexadb01 ~]# cat /proc/partitions |grep sda 8 0 4094720000 sda 8 1 524288 sda1 8 2 119541760 sda2 8 3 3974653903 sda3 [root@demoexadb01 ~]# |
and it matches the documentation. This means the expansion of the portion has been executed successfully. Now the filesystem has to be enlarged.
Expand filesystem
The /EXAVMIMAGES is an ocfs2 filesystem. We can expand it using tunefs.ocfs2. This command should not give any output.
1 2 |
[root@demoexadb01 ~]# tunefs.ocfs2 -S /dev/sda3 [root@demoexadb01 ~]# |
Looks fine. Df -h:
1 2 3 4 5 6 7 8 9 |
[root@demoexadb01 ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VGExaDb-LVDbSys3 30G 17G 12G 58% / tmpfs 7.8G 0 7.8G 0% /dev/shm /dev/sda1 480M 59M 396M 13% /boot /dev/sda3 3.8T 1.5T 2.3T 39% /EXAVMIMAGES none 3.9G 40K 3.9G 1% /var/lib/xenstored [root@demoexadb01 ~]# |
Yay, this is ok. Due to the reboot the user domains are already back online, but that’s ok. If they aren’t started automatically. It’s time to start them now. After they are fully booted, then you can repeat the actions on the other node.
Gotcha 2
This one worries me a bit to be honest. This customer has a certain amount of exadata’s. They stepped in at X2-2 and are currently in the X6-2 range. Also they try to keep up with patch levels and upgrade regularly. Which is, in my opinion, a good thing. Recently, some of the racks are extended to elastic configuration. I discovered that with the 12.2 image on the compute nodes there is something odd. By default a virtualised exadata dom0 filesystem looks like this:
1 2 3 4 5 6 7 8 9 |
[root@demoexa01db01 ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VGExaDb-LVDbSys3 30G 17G 12G 58% / tmpfs 7.8G 4.0K 7.8G 1% /dev/shm none 3.9G 840K 3.9G 1% /var/lib/xenstored /dev/sda1 480M 59M 396M 13% /boot /dev/sda3 1.6T 1.5T 117G 93% /EXAVMIMAGES [root@demoexa01db01 ~]# |
The thing with a newly imaged (or newly deployed with the oracle provided image which comes from the factory:
1 2 3 4 5 |
[root@demoexa02db01 ~]# df -h /EXAVMIMAGES Filesystem Size Used Avail Use% Mounted on /dev/mapper/VGExaDb-LVDbExaVMImages 1.6T 86G 1.5T 6% /EXAVMIMAGES [root@demoexa02db01 ~]# |
When highlighting this to support I got an amusing talk. Apparently the disk layout is not included (yet?) in the patching / upgrading of these systems. So I asked them, when you want to expand a compute node which is already on LVM which naming you should use, or what the standards are. As in the latest EIS DVD (November2017) this was not included, neither on the oracle documentation (December2017). The answer I got was:
“From a patching perspective we don’t care about the pv names as the work is done on a much higher level. For pv names, we recommend you use the same approach as for the existing disks.”
When you know that by default the pvs used for vgexadb is /dev/sda2. So this story will still be a story to be continued. If you read this and know the answer, let me know please.
Lessons learned
- Copy more information aside from the system than Oracle tells you to do and do not forget to use common sense.
- Use sectors instead of other mechanisms to expand partitions or even better, use LVM.
- In case of doubt, open a service request to verify thing so you can be sure before continuing.
I’ve included the full output from the commands, merely for my own reference, but in case you end up in troubles, at least you know where the default partitions in a virtualised exadata starts and ends.
As always, questions, remarks? find me on twitter @vanpupi