In this blog post I am going to document steps I did in order to install CEPH storage cluster. Such installed CEPH storage cluster can be used later on in openstack/openshift installation as storage for virtual machines/pods or deploy it with some other solution requesting object and/or block storage. CEPH FS ( File System ) exist, but it will not be discussed in this blog post.
CEPH storage is opensource storage solution which becomes very popular due to its flexibility and features set it offers. Ceph project was started by Sage Weil back in 2007, or so, more at : ceph wiki page, current version of CEPH is Hammer (v0.95) and this version of ceph will be used in this blog post.
As operating system for CEPH cluster, I am going to use Fedora 23, and it will be used due to below reasons
- it has good set of features and many available packages. I guess same process we describe here can be used with Debian – with small package/files names corrections
- it is close to RHEL Enterprise Linux ( I know I could use Centos 7, but I have Fedora 23 machines handy) , and information you get here can be easily transformed to RHEL ( with prior reading of Red Hat ceph documentation
Note: In order to apply notes from here for RHEL case, you will need to work with Red Hat Sales / Support to get access to proper Red Hat software entitlements which contain CEPH packages
- it is free and in order to start working with CEPH you do not need subscriptions in order to get software
I am going to use Fedora 23 KVM environment for this POC, due to below
- it is most convenient and cheap – I do not need to invest in hardware
- I have access to it
Using KVM as base for CEPH nodes is not supported in production, so be aware of this in case you decide to use RHEL instead of Fedora and if you want to get support for CEPH cluster running on top of RHEL from Red Hat global support services team. Here are in my opinion some interesting links regarding CEPH cluster sizing and planning
As first step is to install Fedora 23, you can use this centos kickstart for this purpose, just adapt it to point to proper Fedora repositories, or you can manually install machines. ISO installation images is possible to get from Fedora 23 server iso images
Once system is installed, I recommend to updated it to latest packages
# dnf -y update
For CEPH storage cluster we need at least 3 CEPH monitors ( mon services ) machines to preserve HA. From excellent CEPH book ( I got it for free,but if you buy it, what I strongly advice, it worth the money you pay for it ) Learning CEPH we can read there A Ceph storage cluster requires at least one monitor to run. For high availability, a Ceph storage cluster relies on an odd number of monitors that’s more than one, for example, 3 or 5, to form a quorum. For this initial POC, I find 3 monitors to be fine,as it will grant HA solution and it will serve POC purpose.Later I am going to add more monitors.
In this test environment instead installing machine with same configuration three times, I installed it once and then used KVM tools I cloned it to desired number, so I had identical KVM guests / machines for CEPH nodes.
After machine installation, we need to ensure below on ceph nodes before doing any further steps
- ensure all ceph nodes can properly resolve all nodes, either with configuring dns sever or /etc/hosts. In my case I have dns server already in place and I added my ceph cluster nodes in dns configuration and it worked fine
Important : if ceph nodes are not able to properly resolve other nodes, there will be problem
- ensure that ceph nodes have access to internet
With physical hardware,it is expected / necessary to have separate disk for CEPH OSDs. In my test case, as I am using KVM guests, I created for every machine virtual disk using below commands. I decided to use 15 GB disk size,… this is just test POC which can be turn to real POC
# qemu-img create -f qcow2 cephf23-node1disk.qcow2 15G
# qemu-img create -f qcow2 cephf23-node2disk.qcow2 15G
# qemu-img create -f qcow2 cephf23-node3disk.qcow2 15G
after this, in order kvm guests see / use these disks it is necessary to attach them to machines. I edited /etc/libvirt/qemu/kvm_guest_file.xml files for kvm guests and added there definition for new block device. There is already definition for disk in kvm machine .xml file, and it is easy to add new disk, just follow same syntax, while pointing to desired disk and adapting PCI numbers. If there is mistake with configuration it will be reported during virsh define step below
After this, it it necessary to (re)define machine, and restart it
# virsh destroy kvm_machine
# virsh define /etc/libvirt/qemu/kvm_guest_file.xml
# virsh start kvm_machine
where kvm_machine is name of your KVM domain/machine. Once kvm guest is up, new disk marked as /dev/vdb will be visible. It is necessary to repeat above process for all guests.
Another option how to add storage to virtual guests is described at documentation Adding storage devices to guests and using for example
# virsh attach-disk Guest1 /var/lib/libvirt/images/FileName.img vdb --cache none
which is supposed to work too. Part related to disks, is kvm specific, with physical hardware, this is not necessary.
Further, it is necessary to have passwordless login between CEPH nodes, ensure this is working ( ssh-keygen, ssh-copy-id … )
Let’s now proceed and install packages
# dnf install -y ceph-deploy
More about ceph-deploy is possible to find at ceph-deploy ceph documentation. In short ceph-deploy is tool which enables users to install ceph cluster easier than in case it is not used.
Assuming ceph cluster nodes can resolve properly other nodes, and passordless access works fine, issuing below command
# mkdir /etc/ceph
# ceph-deploy new cephf23-node1 cephf23-node2 cephf23-node3
will write ceph.conf file with some basic parameters. In this case
[global]
fsid = b71a3eb1-e253-410a-bf11-84ae01bad654
mon_initial_members = cephf23-node1, cephf23-node2, cephf23-node3
mon_host = 192.168.122.101,192.168.122.102,192.168.122.103
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
I added in ceph.conf additionally
cluster_network = 192.168.122.0/24
to define cluster network. Network for cluster traffic – which will be same as public network, in this case I have only one network card. In ceph.conf we can see mon_initial_members and mon_host which MUST be correct from hostname / ip point of view. If that was not the case, it would be necessary to go back to check DNS settings.
After running ceph-deploy new as showed above ceph.conf will be created, it is necessary to pay particular attention on mon_initial_members and mon_host – there must be correct hostnames / IPs for these machines written in ceph.conf
Now we are ready to install ceph packages
# ceph-deploy install cephf23-node1 cephf23-node2 cephf23-node3 --no-adjust-repos
Last command will install all ceph components on specified nodes. If –no-adjust-repos is not specified, then ceph-deploy will go to cehp repositories and try to get packages from there – and this is not bad if packages exist. At time when I write this, packages for Fedora 20 are available under ceph repositories and ceph-deploy will fail.
Specifying –no-adjust-repos ceph-deploy is instructed not to adjust repositories and to get packages from OS repos – in this case Fedora 23.
Install step will last some time, depending on network speed to download ceph packages.Once it finishes, it is necessary to create monitors
# ceph-deploy --overwrite-conf mon create-initial
Important is to specify –overwrite-conf as ceph.conf was edited and –overwrite-conf will ensure ceph-deploy does not fail.
Now ceph mon dump shows monitors status
# ceph mon dump
dumped monmap epoch 1
epoch 1
fsid b71a3eb1-e253-410a-bf11-84ae01bad654
last_changed 0.000000
created 0.000000
0: 192.168.122.101:6789/0 mon.cephf23-node1
1: 192.168.122.102:6789/0 mon.cephf23-node2
2: 192.168.122.103:6789/0 mon.cephf23-node3
and after this ceph status shows
# ceph status
cluster b71a3eb1-e253-410a-bf11-84ae01bad654
health HEALTH_ERR
64 pgs stuck inactive
64 pgs stuck unclean
no osds
monmap e1: 3 mons at {cephf23-node1=192.168.122.101:6789/0,cephf23-node2=192.168.122.102:6789/0,cephf23-node3=192.168.122.103:6789/0}
election epoch 30, quorum 0,1,2 cephf23-node1,cephf23-node2,cephf23-node3
osdmap e1: 0 osds: 0 up, 0 in
pgmap v2: 64 pgs, 1 pools, 0 bytes data, 0 objects
0 kB used, 0 kB / 0 kB avail
64 creating
Ok,ceph has some health issues. If ceph health is checked, we see
# ceph health
HEALTH_ERR 64 pgs stuck inactive; 64 pgs stuck unclean; no osds
I highlighted most important part in ceph health, and it says no osds. OSD – Object Storage Device is building block of CEPH cluster and it is necessary to create them.
OSD creation is simple,however there are below important points to take care prior OSD creation
- identify block device for OSD
- ensure it is really device planned for OSD – OSD preparations / actions are destructive and they can cause trouble if pointed to wrong block device
-
Once block devices are identified, run ceph-deploy disk zap :
# ceph-deploy disk zap cephf23-node1:vdb
# ceph-deploy disk zap cephf23-node2:vdb
# ceph-deploy disk zap cephf23-node3:vdb
ceph-deploy disk zap will do some kind of disk formatting and preparation to be OSD.
After this step prepare OSDs
# ceph-deploy osd prepare cephf23-node1:vdb
# ceph-deploy osd prepare cephf23-node2:vdb
# ceph-deploy osd prepare cephf23-node3:vdb
I created on /dev/vdb all – leaving up to ceph-deploy to divide it to data and journal part eg, if run
# blkid | grep vdb
/dev/vdb1: UUID="cdf2f55e-67ee-4077-808e-fcfa94f531ae" TYPE="xfs" PARTLABEL="ceph data" PARTUUID="6cb675e2-ed7c-4757-861e-8090b4c3dda3"
/dev/vdb2: PARTLABEL="ceph journal" PARTUUID="b75b2e61-5664-4730-af52-a43a1ee99845"
from this is visible that /dev/vda1 is marked as ceph data and /dev/vdb2 as ceph journal. For maximal and better performance it is recommended to use separate device for ceph journal ( even SSD disk ).
In above example, if there was separate device for journal, then ceph-deploy osd prepare step would be
# ceph-deploy osd prepare {node}:/dev/vdb:/dev/vdX
where vdX would be separate device for journal
Knowing that on /dev/vdb we have now
command to activate OSD would be
# ceph-deploy osd activate cephf23-node1:/dev/vdb1:/dev/vdb2
# ceph-deploy osd activate cephf23-node2:/dev/vdb1:/dev/vdb2
# ceph-deploy osd activate cephf23-node3:/dev/vdb1:/dev/vdb2
If we now execute ceph osd tree it will show OSDs and their placement across CEPH nodes.
# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.02998 root default
-2 0.00999 host cephf23-node3
0 0.00999 osd.0 up 1.00000 1.00000
-3 0.00999 host cephf23-node2
1 0.00999 osd.1 up 1.00000 1.00000
-4 0.00999 host cephf23-node1
2 0.00999 osd.2 up 1.00000 1.00000
ceph osd dump will dump osd parameters
# ceph osd dump
epoch 54
fsid b71a3eb1-e253-410a-bf11-84ae01bad654
created 2015-12-27 18:07:12.158247
modified 2015-12-31 16:05:27.526052
flags
pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
max_osd 3
osd.0 up in weight 1 up_from 51 up_thru 51 down_at 50 last_clean_interval [34,43) 192.168.122.103:6800/1081 192.168.122.103:6801/1081 192.168.122.103:6802/1081 192.168.122.103:6803/1081 exists,up bf703535-de09-4bde-b81e-805a9c85169b
osd.1 up in weight 1 up_from 48 up_thru 51 down_at 44 last_clean_interval [35,43) 192.168.122.102:6800/1090 192.168.122.102:6801/1090 192.168.122.102:6802/1090 192.168.122.102:6803/1090 exists,up 60d237f0-38c0-4dfd-9672-ef4076c57a7f
osd.2 up in weight 1 up_from 46 up_thru 51 down_at 45 last_clean_interval [42,43) 192.168.122.101:6800/1228 192.168.122.101:6801/1228 192.168.122.101:6802/1228 192.168.122.101:6803/1228 exists,up 6cb675e2-ed7c-4757-861e-8090b4c3dda3
And in ps aux | grep ceph-osd we can see there is osd process running – this is for node1, but same applies for other nodes. It applies OSD one block device and this means if we had more OSD on CEPH node, then we should see more osd.X.pid processes in ps aux output
ps aux | grep ceph-osd
root 1226 0.0 0.1 119632 3180 ? Ss 10:23 0:00 /bin/bash -c ulimit -n 32768; /usr/bin/ceph-osd -i 2 --pid-file /var/run/ceph/osd.2.pid -c /etc/ceph/ceph.conf --cluster ceph -f
root 1228 0.2 2.2 713224 46884 ? Sl 10:23 0:23 /usr/bin/ceph-osd -i 2 --pid-file /var/run/ceph/osd.2.pid -c /etc/ceph/ceph.conf --cluster ceph -f
Running ceph status is run now, it will show status of ceph cluster
# ceph status
cluster b71a3eb1-e253-410a-bf11-84ae01bad654
health HEALTH_OK
monmap e1: 3 mons at {cephf23-node1=192.168.122.101:6789/0,cephf23-node2=192.168.122.102:6789/0,cephf23-node3=192.168.122.103:6789/0}
election epoch 32, quorum 0,1,2 cephf23-node1,cephf23-node2,cephf23-node3
osdmap e13: 3 osds: 3 up, 3 in
pgmap v54: 64 pgs, 1 pools, 0 bytes data, 0 objects
100 MB used, 30586 MB / 30686 MB avail
64 active+clean
Now we see status of ceph storage cluster, and one might ask why is there 30 GB space available, when 15 GB OSD devices are built into ceph cluster. Now we have
# fdisk -l /dev/vdb
Disk /dev/vdb: 15 GiB, 16106127360 bytes, 31457280 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 893E7CCC-C92A-47DD-9DAE-FC62DFA91314
Device Start End Sectors Size Type
/dev/vdb1 10487808 31457246 20969439 10G Ceph OSD
/dev/vdb2 2048 10485760 10483713 5G Ceph Journal
Partition table entries are not in disk order.
the reason for this is osd journal size which is left default to 5120 MB. More about OSD config parameters osd config reference what means if not specified the size of 5 GB will be used for ceph journal device. This is not issue, in case we specified separate device for journal, then whole device will be used.
From this point it is possible to create ceph pool(s) and use them.
In order to create CEPH pool we can do below
# ceph osd pool create poolA 128 128
this will create poolA with designed number of placement groups ( pg ) , what is visible in below output
]# ceph osd lspools
0 rbd,3 poolA
[root@cephf23-node1 ~]# rados lspools
rbd
poolA
General command for creating CEPH pool is
# ceph osd pool create pg_number pgp_number
where pg_number placement group number and pgp_number placement group for placement has to be equal, otherwise CEPH will not start re-balancing
How many PG we need in specific CEPH cluster? This number will vary from number of OSDs and number of replicas,but below formula can help
(OSDs * 100)
Total PGs = ------------
Replicas
which is stated on CEPH web How Data Is Stored In CEPH Cluster
Also do not forget to enable ceph service to ensure it starts during boot
# systemctl enable ceph
ceph.service is not a native service, redirecting to systemd-sysv-install
Executing /usr/lib/systemd/systemd-sysv-install enable ceph
This is basic CEPH storage cluster setup. It can be turned into useful POC if instead KVM physical hardware is used (certified and supported HW) and if adapted to other specific needs. It would be also necessary for production case to tweak below points
- In above configuration I decided to run monitors and OSD on same machines what will work fine, however this consumes more
memory / cpu on machine and for production environments it is recommended to have monitors and OSDs running on separate machines
-
separate networks : cluster and public networks
-
configure firewall – I disabled firewalld in this configuration, and to open ports which are necessary for CEPH all what necessary is to follow ceph network reference manual and apply recommendations from there
-
CRUSH, what is and how to configure it, read crush and crush paper
What is something is broken? It is possible to do below
If you suspect that issue might be some kind of PEBKAC then it is possible to start from zero and try to configure it again. ceph-deploy offers simple tools to remove ceph packages and reset ceph nodes to state before ceph packages installation
For example below commands will get system in pre-installation state – they will remove all ceph packages and is possible to try again installation using steps from beginning of this post
# ceph-deploy uninstall cephf23-node1 cephf23-node2 cephf23-node3
# ceph-deploy purgedata cephf23-node1 cephf23-node2 cephf23-node3
# ceph-deploy purge cephf23-node1 cephf23-node2 cephf23-node3
# ceph-deploy forgetkeys
and all is ready to start again. Doing it again, will be good learning experience anyway.
Happy CEPH hacking!
#ceph, #ceph-pg, #ceph-pgp, #ceph-pool, #fedora-2, #linux, #object-storage-device, #osd, #storage