etcd error message “etcd failed to send out hearbeat on time”

… etcd distributed key value store that provides a reliable way to store data across a cluster of machines per 1 and 2. ETCD is very sensitive on delays in networks, and not only in networks but all kind of overlay sluggishness of etcd cluster nodes can lead to complete kubernets cluster functionality problems.

At time when OpenShift/Kubernetes cluster starts reporting error messages as showed below, cluster will already behave inappropriate and pods scheduling / deleting will not work as expected and problems will be more than visible

Sep 27 00:04:01 dhcp7-237 etcd: failed to send out heartbeat on time (deadline exceeded for 1.766957688s)
Sep 27 00:04:01 dhcp7-237 etcd: server is likely overloaded
Sep 27 00:04:01 dhcp7-237 etcd: failed to send out heartbeat on time (deadline exceeded for 1.766976918s)
Sep 27 00:04:01 dhcp7-237 etcd: server is likely overloaded

systemctl status etcd output

 systemctl status etcd
● etcd.service - Etcd Server
   Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled)
   Active: active (running) since Sat 2016-10-01 09:18:37 EDT; 5h 20min ago
 Main PID: 11970 (etcd)
   Memory: 1.0G
   CGroup: /system.slice/etcd.service
           └─11970 /usr/bin/etcd --name=dhcp6-138.example.net --data-dir=/var/lib/etcd/ --listen-client-urls=https://172.16.6.138:2379

Oct 01 14:38:55 dhcp6-138.example.net etcd[11970]: server is likely overloaded
Oct 01 14:38:56 dhcp6-138.example.net etcd[11970]: failed to send out heartbeat on time (deadline exceeded for 377.70994ms)
Oct 01 14:38:56 dhcp6-138.example.net etcd[11970]: server is likely overloaded
Oct 01 14:38:56 dhcp6-138.example.net etcd[11970]: failed to send out heartbeat on time (deadline exceeded for 377.933298ms)
Oct 01 14:38:56 dhcp6-138.example.net etcd[11970]: server is likely overloaded
Oct 01 14:38:58 dhcp6-138.example.net etcd[11970]: failed to send out heartbeat on time (deadline exceeded for 1.226630142s)
Oct 01 14:38:58 dhcp6-138.example.net etcd[11970]: server is likely overloaded
Oct 01 14:38:58 dhcp6-138.example.net etcd[11970]: failed to send out heartbeat on time (deadline exceeded for 1.226803192s)
Oct 01 14:38:58 dhcp6-138.example.net etcd[11970]: server is likely overloaded
Oct 01 14:39:07 dhcp6-138.example.net etcd[11970]: the clock difference against peer f801f8148b694198 is too high [1.078081179s > 1s]

# systemctl status etcd -l will also have similar messages,and check these too

ETCD configuration file is located in /etc/etcd/etcd.conf and has similar content as below, this one is from RHEL, other OSes can have it a bit changed

ETCD_NAME=dhcp7-237.example.net
ETCD_LISTEN_PEER_URLS=https://172.16.7.237:2380
ETCD_DATA_DIR=/var/lib/etcd/
ETCD_HEARTBEAT_INTERVAL=6000
ETCD_ELECTION_TIMEOUT=30000
ETCD_LISTEN_CLIENT_URLS=https://172.16.7.237:2379

ETCD_INITIAL_ADVERTISE_PEER_URLS=https://172.16.7.237:2380
ETCD_INITIAL_CLUSTER=dhcp7-241.example.net=https://172.16.7.241:2380,dhcp7-237.example.net=https://172.16.7.237:2380,dhcp7-239.example.net=https://172.16.7.239:2380
ETCD_INITIAL_CLUSTER_STATE=new
ETCD_INITIAL_CLUSTER_TOKEN=etcd-cluster-1
ETCD_ADVERTISE_CLIENT_URLS=https://172.16.7.237:2379


ETCD_CA_FILE=/etc/etcd/ca.crt
ETCD_CERT_FILE=/etc/etcd/server.crt
ETCD_KEY_FILE=/etc/etcd/server.key
ETCD_PEER_CA_FILE=/etc/etcd/ca.crt
ETCD_PEER_CERT_FILE=/etc/etcd/peer.crt
ETCD_PEER_KEY_FILE=/etc/etcd/peer.key

bold parameters in above configuration files are ones we want to change ETCD_HEARTBEAT_INTERVAL and ETCD_ELECTION_TIMEOUT and there is not unified value for all, it is necessary to play with different values and find out what is best. For most cases default (500/2500) will be fine.

After changing /etc/etcd/etc.conf do not forget to restart etcd service

# systemctl restart etcd

Below issue affecting ETCD nodes can lead to problem described in this post

  • network latency
  • storage latency
  • combination of network latency and storage latency

if network latency is low, then check storage which is used by Kubernets/OpenShift ETCD servers. This is workaround for case when root cause is discovered and changes as stated in this post are performed in order to mitigate issue when no other option is possible. First and better solution would be to solve issue at its roots by fixing problematic subsystem(s).

In my particular case storage subsystem was slow and not possible to change that without bunch of $$$

References : etcd documentation

#etcd, #k8s, #kubernetes, #linux, #openshift, #redhat, #storage

AWS EBS as persistent storage for Openshift

Openshift supports different persistent storage backends. Ceph, gluster, Amazon EBS, just to name some of them. In using ceph as persistent storage for openshift pods blog post I described how to hook openshift installation to use CEPH cluster as persistent storage for openshift pods. Here will be described process how configure Openshift to use EBS AWS based storage as persistent volumes for openshift pods

It is assumed here that as starting point is working Openshift environment. In order to use Amazon EBS storage with openshift, it is necessary to do small changes in openshift configuration

  • Openshift master and all Openshift nodes need to be configured to support AWS EBS. Openshift documentation gives nice overview of necessary changes
  • After configuring openshift master/nodes to be able to reach / allocate EBS, next steps are straightforward.
    Examples of PersistentVolume , PersistentVolumeClaim, and Pod json file are showed below

  • PersistenVolume
  • PersistentVolumeClaim
  • pod file
  • In above .json files is necessary to adapt them with proper pv name, size, pvc name, size, pod name and etc to correspond environment where used. After adapting them, create PersistentVolume and PersistentVolumeClaim

    # oc create -f pv.json 
    # oc create -f pvc.json 
    
     # oc get pv
    NAME       LABELS    CAPACITY     ACCESSMODES     STATUS    CLAIM               REASON      AGE
    pvolume1               1          RWO             Bound     default/pvclaim1                3h
    # oc get pvc 
    NAME       LABELS     STATUS    VOLUME      CAPACITY      ACCESSMODES     AGE
    pvclaim1              Bound     pvolume1    1             RWO             3h
    

    Now we can create pod which will use above created PersistentVolume and PersistentVolumeClaim

    # oc create -f pod.json
    # oc get pods | grep pod1
    pod1                      1/1       Running   0          30s
    

    Now, on node where pod runs, we will see

    # fdisk -l | grep xvd
    WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.
    Disk /dev/xvda: 10.7 GB, 10737418240 bytes, 20971520 sectors
    Disk /dev/xvdb: 10.7 GB, 10737418240 bytes, 20971520 sectors
    /dev/xvdb1            2048    20971519    10484736   8e  Linux LVM
    Disk /dev/xvdf: 1073 MB, 1073741824 bytes, 2097152 sectors ---- this is EBS device used inside pod 
    

    /dev/xvdf is device mapped to node, I know this as it is only 1GB device on my system. Further, device is visible in mount output

    # mount | grep xvdf
    /dev/xvdf on /var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/aws-ebs/mounts/vol-3f7210c9 type ext4 (rw,relatime,seclabel,data=ordered)
    /dev/xvdf on /var/lib/origin/openshift.local.volumes/pods/3a467bfd-eeb7-11e5-9e13-02982d3b6367/volumes/kubernetes.io~aws-ebs/pvolume1 type ext4 (rw,relatime,seclabel,data=ordered)
    

    it is mounted at mount point under /var/lib/origin/openshift.local.volumes …

    If we take closer look, we will see there, EBS volume id ( vol-3f7210c ), then plugin used ( aws-ebs ) – this can be useful if it is necessary to do sorting / searching devices allocated on same Openshift node and commit from two different storage backennds ( eg, EBS and CEPH )
    In pod we see

     # oc exec  pod1 -- mount  | grep xvdf
    /dev/xvdf on /mnt/persistentvolume type ext4 (rw,relatime,seclabel,data=ordered)
    

    For every pod using EBS as persistant storage pair of lines similar to ones above will be visible on Openshift node

    #amazon-ebs, #linux, #openshift, #persistantvolume, #persistantvolumeclaim, #pod, #storage

    CEPH storage cluster installation – OS Fedora 23

    In this blog post I am going to document steps I did in order to install CEPH storage cluster. Such installed CEPH storage cluster can be used later on in openstack/openshift installation as storage for virtual machines/pods or deploy it with some other solution requesting object and/or block storage. CEPH FS ( File System ) exist, but it will not be discussed in this blog post.

    CEPH storage is opensource storage solution which becomes very popular due to its flexibility and features set it offers. Ceph project was started by Sage Weil back in 2007, or so, more at : ceph wiki page, current version of CEPH is Hammer (v0.95) and this version of ceph will be used in this blog post.

    As operating system for CEPH cluster, I am going to use Fedora 23, and it will be used due to below reasons

    • it has good set of features and many available packages. I guess same process we describe here can be used with Debian – with small package/files names corrections
    • it is close to RHEL Enterprise Linux ( I know I could use Centos 7, but I have Fedora 23 machines handy) , and information you get here can be easily transformed to RHEL ( with prior reading of Red Hat ceph documentation
      Note: In order to apply notes from here for RHEL case, you will need to work with Red Hat Sales / Support to get access to proper Red Hat software entitlements which contain CEPH packages
    • it is free and in order to start working with CEPH you do not need subscriptions in order to get software

    I am going to use Fedora 23 KVM environment for this POC, due to below

    • it is most convenient and cheap – I do not need to invest in hardware
    • I have access to it

    Using KVM as base for CEPH nodes is not supported in production, so be aware of this in case you decide to use RHEL instead of Fedora and if you want to get support for CEPH cluster running on top of RHEL from Red Hat global support services team. Here are in my opinion some interesting links regarding CEPH cluster sizing and planning

    As first step is to install Fedora 23, you can use this centos kickstart for this purpose, just adapt it to point to proper Fedora repositories, or you can manually install machines. ISO installation images is possible to get from Fedora 23 server iso images

    Once system is installed, I recommend to updated it to latest packages

    # dnf -y update

    For CEPH storage cluster we need at least 3 CEPH monitors ( mon services ) machines to preserve HA. From excellent CEPH book ( I got it for free,but if you buy it, what I strongly advice, it worth the money you pay for it ) Learning CEPH we can read there A Ceph storage cluster requires at least one monitor to run. For high availability, a Ceph storage cluster relies on an odd number of monitors that’s more than one, for example, 3 or 5, to form a quorum. For this initial POC, I find 3 monitors to be fine,as it will grant HA solution and it will serve POC purpose.Later I am going to add more monitors.

    In this test environment instead installing machine with same configuration three times, I installed it once and then used KVM tools I cloned it to desired number, so I had identical KVM guests / machines for CEPH nodes.

    After machine installation, we need to ensure below on ceph nodes before doing any further steps

    • ensure all ceph nodes can properly resolve all nodes, either with configuring dns sever or /etc/hosts. In my case I have dns server already in place and I added my ceph cluster nodes in dns configuration and it worked fine
      Important : if ceph nodes are not able to properly resolve other nodes, there will be problem
    • ensure that ceph nodes have access to internet

    With physical hardware,it is expected / necessary to have separate disk for CEPH OSDs. In my test case, as I am using KVM guests, I created for every machine virtual disk using below commands. I decided to use 15 GB disk size,… this is just test POC which can be turn to real POC

    # qemu-img create -f qcow2 cephf23-node1disk.qcow2 15G
    # qemu-img create -f qcow2 cephf23-node2disk.qcow2 15G
    # qemu-img create -f qcow2 cephf23-node3disk.qcow2 15G
    

    after this, in order kvm guests see / use these disks it is necessary to attach them to machines. I edited /etc/libvirt/qemu/kvm_guest_file.xml files for kvm guests and added there definition for new block device. There is already definition for disk in kvm machine .xml file, and it is easy to add new disk, just follow same syntax, while pointing to desired disk and adapting PCI numbers. If there is mistake with configuration it will be reported during virsh define step below

    After this, it it necessary to (re)define machine, and restart it

    # virsh destroy kvm_machine
    # virsh define  /etc/libvirt/qemu/kvm_guest_file.xml  
    # virsh start kvm_machine
    

    where kvm_machine is name of your KVM domain/machine. Once kvm guest is up, new disk marked as /dev/vdb will be visible. It is necessary to repeat above process for all guests.

    Another option how to add storage to virtual guests is described at documentation Adding storage devices to guests and using for example

     # virsh attach-disk Guest1 /var/lib/libvirt/images/FileName.img vdb --cache none

    which is supposed to work too. Part related to disks, is kvm specific, with physical hardware, this is not necessary.

    Further, it is necessary to have passwordless login between CEPH nodes, ensure this is working ( ssh-keygen, ssh-copy-id … )

    Let’s now proceed and install packages

    # dnf install -y ceph-deploy 
    

    More about ceph-deploy is possible to find at ceph-deploy ceph documentation. In short ceph-deploy is tool which enables users to install ceph cluster easier than in case it is not used.

    Assuming ceph cluster nodes can resolve properly other nodes, and passordless access works fine, issuing below command

    # mkdir /etc/ceph
    # ceph-deploy new cephf23-node1 cephf23-node2 cephf23-node3
    

    will write ceph.conf file with some basic parameters. In this case

    [global]
    fsid = b71a3eb1-e253-410a-bf11-84ae01bad654
    mon_initial_members = cephf23-node1, cephf23-node2, cephf23-node3 
    mon_host = 192.168.122.101,192.168.122.102,192.168.122.103 
    auth_cluster_required = cephx
    auth_service_required = cephx
    auth_client_required = cephx
    filestore_xattr_use_omap = true
    

    I added in ceph.conf additionally

    cluster_network = 192.168.122.0/24
    

    to define cluster network. Network for cluster traffic – which will be same as public network, in this case I have only one network card. In ceph.conf we can see mon_initial_members and mon_host which MUST be correct from hostname / ip point of view. If that was not the case, it would be necessary to go back to check DNS settings.

    After running ceph-deploy new as showed above ceph.conf will be created, it is necessary to pay particular attention on mon_initial_members and mon_host – there must be correct hostnames / IPs for these machines written in ceph.conf

    Now we are ready to install ceph packages

    # ceph-deploy install cephf23-node1 cephf23-node2 cephf23-node3  --no-adjust-repos
    

    Last command will install all ceph components on specified nodes. If –no-adjust-repos is not specified, then ceph-deploy will go to cehp repositories and try to get packages from there – and this is not bad if packages exist. At time when I write this, packages for Fedora 20 are available under ceph repositories and ceph-deploy will fail.
    Specifying –no-adjust-repos ceph-deploy is instructed not to adjust repositories and to get packages from OS repos – in this case Fedora 23.

    Install step will last some time, depending on network speed to download ceph packages.Once it finishes, it is necessary to create monitors

    # ceph-deploy --overwrite-conf mon create-initial
    

    Important is to specify –overwrite-conf as ceph.conf was edited and –overwrite-conf will ensure ceph-deploy does not fail.

    Now ceph mon dump shows monitors status

    # ceph mon dump
    dumped monmap epoch 1
    epoch 1
    fsid b71a3eb1-e253-410a-bf11-84ae01bad654
    last_changed 0.000000
    created 0.000000
    0: 192.168.122.101:6789/0 mon.cephf23-node1
    1: 192.168.122.102:6789/0 mon.cephf23-node2
    2: 192.168.122.103:6789/0 mon.cephf23-node3
    

    and after this ceph status shows

    # ceph status
        cluster b71a3eb1-e253-410a-bf11-84ae01bad654
         health HEALTH_ERR
                64 pgs stuck inactive
                64 pgs stuck unclean
                no osds
         monmap e1: 3 mons at {cephf23-node1=192.168.122.101:6789/0,cephf23-node2=192.168.122.102:6789/0,cephf23-node3=192.168.122.103:6789/0}
                election epoch 30, quorum 0,1,2 cephf23-node1,cephf23-node2,cephf23-node3
         osdmap e1: 0 osds: 0 up, 0 in
          pgmap v2: 64 pgs, 1 pools, 0 bytes data, 0 objects
                0 kB used, 0 kB / 0 kB avail
                      64 creating
    

    Ok,ceph has some health issues. If ceph health is checked, we see

    # ceph health
    HEALTH_ERR 64 pgs stuck inactive; 64 pgs stuck unclean; no osds
    

    I highlighted most important part in ceph health, and it says no osds. OSD – Object Storage Device is building block of CEPH cluster and it is necessary to create them.

    OSD creation is simple,however there are below important points to take care prior OSD creation

    • identify block device for OSD
    • ensure it is really device planned for OSD – OSD preparations / actions are destructive and they can cause trouble if pointed to wrong block device
    • Once block devices are identified, run ceph-deploy disk zap :
     # ceph-deploy disk zap cephf23-node1:vdb
     # ceph-deploy disk zap cephf23-node2:vdb
     # ceph-deploy disk zap cephf23-node3:vdb
    

    ceph-deploy disk zap will do some kind of disk formatting and preparation to be OSD.

    After this step prepare OSDs

    # ceph-deploy osd prepare cephf23-node1:vdb
    # ceph-deploy osd prepare cephf23-node2:vdb
    # ceph-deploy osd prepare cephf23-node3:vdb
    

    I created on /dev/vdb all – leaving up to ceph-deploy to divide it to data and journal part eg, if run

    # blkid |  grep vdb
    /dev/vdb1: UUID="cdf2f55e-67ee-4077-808e-fcfa94f531ae" TYPE="xfs" PARTLABEL="ceph data" PARTUUID="6cb675e2-ed7c-4757-861e-8090b4c3dda3"
    /dev/vdb2: PARTLABEL="ceph journal" PARTUUID="b75b2e61-5664-4730-af52-a43a1ee99845"
    

    from this is visible that /dev/vda1 is marked as ceph data and /dev/vdb2 as ceph journal. For maximal and better performance it is recommended to use separate device for ceph journal ( even SSD disk ).

    In above example, if there was separate device for journal, then ceph-deploy osd prepare step would be

    # ceph-deploy osd prepare {node}:/dev/vdb:/dev/vdX 
    

    where vdX would be separate device for journal

    Knowing that on /dev/vdb we have now

    • ceph data
    • ceph journal

    command to activate OSD would be

    # ceph-deploy osd activate cephf23-node1:/dev/vdb1:/dev/vdb2
    # ceph-deploy osd activate cephf23-node2:/dev/vdb1:/dev/vdb2
    # ceph-deploy osd activate cephf23-node3:/dev/vdb1:/dev/vdb2
    

    If we now execute ceph osd tree it will show OSDs and their placement across CEPH nodes.

    # ceph osd tree
    ID WEIGHT  TYPE NAME              UP/DOWN REWEIGHT PRIMARY-AFFINITY 
    -1 0.02998 root default                                             
    -2 0.00999     host cephf23-node3                                   
     0 0.00999         osd.0               up  1.00000          1.00000 
    -3 0.00999     host cephf23-node2                                   
     1 0.00999         osd.1               up  1.00000          1.00000 
    -4 0.00999     host cephf23-node1                                   
     2 0.00999         osd.2               up  1.00000          1.00000 
    

    ceph osd dump will dump osd parameters

    # ceph osd dump
    epoch 54
    fsid b71a3eb1-e253-410a-bf11-84ae01bad654
    created 2015-12-27 18:07:12.158247
    modified 2015-12-31 16:05:27.526052
    flags 
    pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
    max_osd 3
    osd.0 up   in  weight 1 up_from 51 up_thru 51 down_at 50 last_clean_interval [34,43) 192.168.122.103:6800/1081 192.168.122.103:6801/1081 192.168.122.103:6802/1081 192.168.122.103:6803/1081 exists,up bf703535-de09-4bde-b81e-805a9c85169b
    osd.1 up   in  weight 1 up_from 48 up_thru 51 down_at 44 last_clean_interval [35,43) 192.168.122.102:6800/1090 192.168.122.102:6801/1090 192.168.122.102:6802/1090 192.168.122.102:6803/1090 exists,up 60d237f0-38c0-4dfd-9672-ef4076c57a7f
    osd.2 up   in  weight 1 up_from 46 up_thru 51 down_at 45 last_clean_interval [42,43) 192.168.122.101:6800/1228 192.168.122.101:6801/1228 192.168.122.101:6802/1228 192.168.122.101:6803/1228 exists,up 6cb675e2-ed7c-4757-861e-8090b4c3dda3
    
    

    And in ps aux | grep ceph-osd we can see there is osd process running – this is for node1, but same applies for other nodes. It applies OSD one block device and this means if we had more OSD on CEPH node, then we should see more osd.X.pid processes in ps aux output

    ps aux | grep ceph-osd
    root      1226  0.0  0.1 119632  3180 ?        Ss   10:23   0:00 /bin/bash -c ulimit -n 32768; /usr/bin/ceph-osd -i 2 --pid-file /var/run/ceph/osd.2.pid -c /etc/ceph/ceph.conf --cluster ceph -f
    root      1228  0.2  2.2 713224 46884 ?        Sl   10:23   0:23 /usr/bin/ceph-osd -i 2 --pid-file /var/run/ceph/osd.2.pid -c /etc/ceph/ceph.conf --cluster ceph -f
    

    Running ceph status is run now, it will show status of ceph cluster

    # ceph status 
        cluster b71a3eb1-e253-410a-bf11-84ae01bad654
         health HEALTH_OK
         monmap e1: 3 mons at {cephf23-node1=192.168.122.101:6789/0,cephf23-node2=192.168.122.102:6789/0,cephf23-node3=192.168.122.103:6789/0}
                election epoch 32, quorum 0,1,2 cephf23-node1,cephf23-node2,cephf23-node3
         osdmap e13: 3 osds: 3 up, 3 in
          pgmap v54: 64 pgs, 1 pools, 0 bytes data, 0 objects
                100 MB used, 30586 MB / 30686 MB avail
                      64 active+clean
    

    Now we see status of ceph storage cluster, and one might ask why is there 30 GB space available, when 15 GB OSD devices are built into ceph cluster. Now we have

    # fdisk -l /dev/vdb
    Disk /dev/vdb: 15 GiB, 16106127360 bytes, 31457280 sectors
    Units: sectors of 1 * 512 = 512 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes
    Disklabel type: gpt
    Disk identifier: 893E7CCC-C92A-47DD-9DAE-FC62DFA91314
    
    Device        Start      End  Sectors Size Type
    /dev/vdb1  10487808 31457246 20969439  10G Ceph OSD
    /dev/vdb2      2048 10485760 10483713   5G Ceph Journal
    
    Partition table entries are not in disk order.
    

    the reason for this is osd journal size which is left default to 5120 MB. More about OSD config parameters osd config reference what means if not specified the size of 5 GB will be used for ceph journal device. This is not issue, in case we specified separate device for journal, then whole device will be used.

    From this point it is possible to create ceph pool(s) and use them.

    In order to create CEPH pool we can do below

     
    # ceph osd pool create poolA 128 128 
    

    this will create poolA with designed number of placement groups ( pg ) , what is visible in below output

    ]# ceph osd lspools
    0 rbd,3 poolA
    
    [root@cephf23-node1 ~]# rados lspools
    rbd
    poolA
    

    General command for creating CEPH pool is

    # ceph osd pool create pg_number pgp_number  
    

    where pg_number placement group number and pgp_number placement group for placement has to be equal, otherwise CEPH will not start re-balancing

    How many PG we need in specific CEPH cluster? This number will vary from number of OSDs and number of replicas,but below formula can help

               (OSDs * 100)
    Total PGs = ------------
                  Replicas
    

    which is stated on CEPH web How Data Is Stored In CEPH Cluster

    Also do not forget to enable ceph service to ensure it starts during boot

    # systemctl enable ceph
    ceph.service is not a native service, redirecting to systemd-sysv-install
    Executing /usr/lib/systemd/systemd-sysv-install enable ceph
    

    This is basic CEPH storage cluster setup. It can be turned into useful POC if instead KVM physical hardware is used (certified and supported HW) and if adapted to other specific needs. It would be also necessary for production case to tweak below points

    • In above configuration I decided to run monitors and OSD on same machines what will work fine, however this consumes more
      memory / cpu on machine and for production environments it is recommended to have monitors and OSDs running on separate machines
    • separate networks : cluster and public networks

    • configure firewall – I disabled firewalld in this configuration, and to open ports which are necessary for CEPH all what necessary is to follow ceph network reference manual and apply recommendations from there

    • CRUSH, what is and how to configure it, read crush and crush paper

    What is something is broken? It is possible to do below

    • try to debug it, /var/log/messages and /var/log/messages/ceph/* on CEPH nodes are locations where to start
      Fellow Red Hatter gave an excellent presentation on this topic – Troubleshouting CEPH I recommend to check it.
    • If not possible to solve, then collect data and open BZ for CEPH

    If you suspect that issue might be some kind of PEBKAC then it is possible to start from zero and try to configure it again. ceph-deploy offers simple tools to remove ceph packages and reset ceph nodes to state before ceph packages installation

    For example below commands will get system in pre-installation state – they will remove all ceph packages and is possible to try again installation using steps from beginning of this post

     
    # ceph-deploy uninstall cephf23-node1 cephf23-node2 cephf23-node3
    # ceph-deploy purgedata cephf23-node1 cephf23-node2 cephf23-node3 
    # ceph-deploy purge cephf23-node1 cephf23-node2 cephf23-node3
    # ceph-deploy forgetkeys 
    

    and all is ready to start again. Doing it again, will be good learning experience anyway.

    Happy CEPH hacking!

    #ceph, #ceph-pg, #ceph-pgp, #ceph-pool, #fedora-2, #linux, #object-storage-device, #osd, #storage

    copy/edit partition table with sfdisk

    sfdisk is nice tool for playing with disk partitions. It has many features, and is very useful when is necessary to do some changes with disk partitions. Before doing anything with sfdisk I recommend reading sfdisk man page to get basic picture what is sfdisk and for what it can be used. If not used carefully, it can be dangerous command, especially if pointed to wrong device so … think before running it
    I needed it where was necessary to clone partition table of one sdcard to another ( fdisk can do this too )

    To save partition table, I did

     
    # sfdisk --dump /dev/sdb > 16gcard
    

    Now in 16gcard dump file was written

    # cat 16gcard
    label: dos
    label-id: 0x00000000
    device: /dev/sdb
    unit: sectors
    
    /dev/sdb1 : start=        8192, size=    31108096, type=c
    

    This is what I need, however, new card is double in size, so 32 GB and writing above on new card will occupy just first 16 GB. Luckily, sfdisk is very versatile tool and it allows editing partition dump and then writing it back to disk. Open 16gcard in text editor ( eg. Vim ) and edit dump file. If original size is 31108096 * 512 B ( sectors ) then new size would be 61399040 * 512 B (sectors) and new dump file

    # cat 16gcard 
    label: dos
    label-id: 0x00000000
    device: /dev/sdb
    unit: sectors
    
    /dev/sdb1 : start=        8192, size=    61399040, type=c
    

    Now I can write it to new card

     
    # sfdisk /dev/sdb < 16gcard
    

    and fdisk -l shows

    #  fdisk -l /dev/sdb
    Disk /dev/sdb: 29.3 GiB, 31440502784 bytes, 61407232 sectors
    Units: sectors of 1 * 512 = 512 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes
    Disklabel type: dos
    Disk identifier: 0x00000000
    
    Device     Boot Start      End  Sectors  Size Id Type
    /dev/sdb1  *     2048 61407231 61405184 29.3G  c W95 FAT32 (LBA)
    

    What is very same partition table as one I had on old card except last sector which is adapted to suit size of new card.

    #linux, #sfdisk, #storage