etcd error message “etcd failed to send out hearbeat on time”

… etcd distributed key value store that provides a reliable way to store data across a cluster of machines per 1 and 2. ETCD is very sensitive on delays in networks, and not only in networks but all kind of overlay sluggishness of etcd cluster nodes can lead to complete kubernets cluster functionality problems.

At time when OpenShift/Kubernetes cluster starts reporting error messages as showed below, cluster will already behave inappropriate and pods scheduling / deleting will not work as expected and problems will be more than visible

Sep 27 00:04:01 dhcp7-237 etcd: failed to send out heartbeat on time (deadline exceeded for 1.766957688s)
Sep 27 00:04:01 dhcp7-237 etcd: server is likely overloaded
Sep 27 00:04:01 dhcp7-237 etcd: failed to send out heartbeat on time (deadline exceeded for 1.766976918s)
Sep 27 00:04:01 dhcp7-237 etcd: server is likely overloaded

systemctl status etcd output

 systemctl status etcd
● etcd.service - Etcd Server
   Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled)
   Active: active (running) since Sat 2016-10-01 09:18:37 EDT; 5h 20min ago
 Main PID: 11970 (etcd)
   Memory: 1.0G
   CGroup: /system.slice/etcd.service
           └─11970 /usr/bin/etcd --data-dir=/var/lib/etcd/ --listen-client-urls=

Oct 01 14:38:55 etcd[11970]: server is likely overloaded
Oct 01 14:38:56 etcd[11970]: failed to send out heartbeat on time (deadline exceeded for 377.70994ms)
Oct 01 14:38:56 etcd[11970]: server is likely overloaded
Oct 01 14:38:56 etcd[11970]: failed to send out heartbeat on time (deadline exceeded for 377.933298ms)
Oct 01 14:38:56 etcd[11970]: server is likely overloaded
Oct 01 14:38:58 etcd[11970]: failed to send out heartbeat on time (deadline exceeded for 1.226630142s)
Oct 01 14:38:58 etcd[11970]: server is likely overloaded
Oct 01 14:38:58 etcd[11970]: failed to send out heartbeat on time (deadline exceeded for 1.226803192s)
Oct 01 14:38:58 etcd[11970]: server is likely overloaded
Oct 01 14:39:07 etcd[11970]: the clock difference against peer f801f8148b694198 is too high [1.078081179s > 1s]

# systemctl status etcd -l will also have similar messages,and check these too

ETCD configuration file is located in /etc/etcd/etcd.conf and has similar content as below, this one is from RHEL, other OSes can have it a bit changed



bold parameters in above configuration files are ones we want to change ETCD_HEARTBEAT_INTERVAL and ETCD_ELECTION_TIMEOUT and there is not unified value for all, it is necessary to play with different values and find out what is best. For most cases default (500/2500) will be fine.

After changing /etc/etcd/etc.conf do not forget to restart etcd service

# systemctl restart etcd

Below issue affecting ETCD nodes can lead to problem described in this post

  • network latency
  • storage latency
  • combination of network latency and storage latency

if network latency is low, then check storage which is used by Kubernets/OpenShift ETCD servers. This is workaround for case when root cause is discovered and changes as stated in this post are performed in order to mitigate issue when no other option is possible. First and better solution would be to solve issue at its roots by fixing problematic subsystem(s).

In my particular case storage subsystem was slow and not possible to change that without bunch of $$$

References : etcd documentation

#etcd, #k8s, #kubernetes, #linux, #openshift, #redhat, #storage


Git is excellent tool and it offers some amazing options / commands which can make daily work easier. Below I am going to describe some of commands I find useful, some of them and straightforward, some not, but all of them can make git experience better.

Among all git tools, I find git filter-branch strongest and most awesome git command. It can do so much good things. As its names suggest, this git command will “filter” branch. Where is this useful? In all cases where is necessary to filter out some data we do not want to be visible in history, or it ever existed in repository.

Let’s assume there is local repository where are all internal information like hostnames, passwords are present, and at some time point it is decided to make internal repository public.
Logical question would be how to remove internal information from git history / files. Luckily, git filter-branch can be used exactly for that.

Eg. to remove all traces in history of particular hostname and rename files with name of particular hostname, it is possible to do that with below command

$ git filter-branch -f --tree-filter 'find . -type f -name "*" | while read  FNAME; do mv "$FNAME" "${FNAME//}"; done' --tag-name-filter cat -- --all

this will filter out all strings from git history where is present.

Important note, above command will do this on all branches, so if run in some branch different than master, master will be filtered too.

If we wanted first to experiment with git filter-branch on single branch, then that is possible too, instead of –all pass branch name where to do change, eg

$ git filter-branch -f --tree-filter 'find . -type f -name "*" | while read  FNAME; do mv "$FNAME" "${FNAME//}"; done' --tag-name-filter cat -- mybranch

to remove password from history and from files

$ git filter-branch --tree-filter "find . -type f -exec sed -i -e 's/origpass/newpassword/g' {} \;"

To remove remote branch

$ git push origin --delete branchname

push local branch to remote

$ git push -u origin local_branch 

clone remote branch

$ git clone -b branch_name remote_repo

clone remote branch an track it

$ git fetch origin
$ git checkout -b 
$ git branch --set-upstream-to=origin/ 
$ git pull

After this, last updates from remote branch will be reflected to newly created local branch

Check out file from another branch to current branch

$ git checkout branch_name
$ git checkout file_to_check.txt
$ git add file_checked 
$ git commit -s 

rebase last X commits into one commit

$ git rebase -i HEAD~X 

and follow steps in editor. Read carefully!

Sometimes I edit some file, and then I got distracted before committing / pushing stuff to repository, and days / hours later I do not remember exactly what changed, git can help here too, if run git status is output like

$ git status 
On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:
  (use "git add ..." to update what will be committed)
  (use "git checkout -- ..." to discard changes in working directory)

    modified:   file.txt

then running,

$ git diff HEAD file.txt 

will show changes between committed version vs staged version.

To see to what branch last commits were sent,

$ git for-each-ref --sort=-committerdate refs/heads/

will list what git branch was updated as last one.

Track remote branch

$ git remote add upstream
$ git remote -v 

In output of git remote -v will be showed what upstream git repository is and ten it can be used to fetch changes which can be merged to local master

$ git fetch upstream 
$ git checkout master 
$ git merge upstream master

#git, #git-branch, #git-filter-branch, #linux

Openshift v3 pod bulk creation with Amazon Elastic Block Store (EBS) – Block Storage as persistent storage

If you ever want to create many Openshift pods with one EBS volume attached to each pod then can help you with that. Help is showed below.
There are few steps which has to be taken in advance before running script

  • It is necessary to configure openshift nodes / masters to be able to talk with Amazon EC2. This blog post can give you some ideas what to do to achieve this aws-ebs-as-persistent-storage-for-openshift
  • boto3 has to be installed on machine where is supposed to run pip install boto3 will get it
  • Ensure to run
    # pip install awscli 
    # aws configure 
  • and follow steps to configure Amazon cli client. aws configure is going to ask questions regarding aws credentials which needs to be provided

    After all above is satisfied, running below

     ./ --volumesize=1 --image=fedora  --tagprefix=fedora_test --minpod=1 --maxpod=11 --pvfile=pv.json --pvcfile=pvc.json --podfile=pod.json 

    will create 10 pods based on image fedora, with each pod having mounted one EBS volume to /mnt/persistentvolume. Each EBS volume will have tag value of fedora_test. The EBS tag is not used for anything except to easier delete EBS volumes later on, for example using script from delete-ebs-volumes-based-on-ebs-volume-tag

    pv.json, pvc.json, pod.json are generic files which examples can be found here

    #python -h
    usage: [-h] [--volumesize VOLUMESIZE] [--vtype VTYPE]
                               [--region REGION] [--image IMAGE]
                               [--tagprefix TAGPREFIX] [--mountpoint MOUNTPOINT]
                               [--minpod MINPOD] [--action ACTION] --maxpod MAXPOD
                               --pvfile PVFILE --pvcfile PVCFILE --podfile PODFILE
    Script to create OSE pods and attach one EBS volume per pod as persistent
    optional arguments:
      -h, --help            show this help message and exit
      --volumesize VOLUMESIZE
                            size of EBS voluems - in GB
      --vtype VTYPE         EBS volume type, default is gp2
      --region REGION       Amazon region where to connect
      --image IMAGE         docker image to use
      --tagprefix TAGPREFIX
                            tag prefix for EBS volumes, default tag is openshift-
      --mountpoint MOUNTPOINT
                            mount point inside pod where EBS volume will be
                            mounted, default is /mnt/persistentvolume
      --minpod MINPOD       minimum number of pods to create - default is 1 - so
                            minimum one pod will be created
      --action ACTION       what to do - either to create pods or delete pods
      --maxpod MAXPOD       maximum number of pods to create - required parameter
      --pvfile PVFILE       persistent volume definition json file - required
      --pvcfile PVCFILE     persistent volume claim definition json file -
                            required parameter
      --podfile PODFILE     pod definition json file - required parameter

    #amazon, #aws, #ebs, #linux, #openshift, #python

    AWS EBS as persistent storage for Openshift

    Openshift supports different persistent storage backends. Ceph, gluster, Amazon EBS, just to name some of them. In using ceph as persistent storage for openshift pods blog post I described how to hook openshift installation to use CEPH cluster as persistent storage for openshift pods. Here will be described process how configure Openshift to use EBS AWS based storage as persistent volumes for openshift pods

    It is assumed here that as starting point is working Openshift environment. In order to use Amazon EBS storage with openshift, it is necessary to do small changes in openshift configuration

  • Openshift master and all Openshift nodes need to be configured to support AWS EBS. Openshift documentation gives nice overview of necessary changes
  • After configuring openshift master/nodes to be able to reach / allocate EBS, next steps are straightforward.
    Examples of PersistentVolume , PersistentVolumeClaim, and Pod json file are showed below

  • PersistenVolume
  • PersistentVolumeClaim
  • pod file
  • In above .json files is necessary to adapt them with proper pv name, size, pvc name, size, pod name and etc to correspond environment where used. After adapting them, create PersistentVolume and PersistentVolumeClaim

    # oc create -f pv.json 
    # oc create -f pvc.json 
     # oc get pv
    NAME       LABELS    CAPACITY     ACCESSMODES     STATUS    CLAIM               REASON      AGE
    pvolume1               1          RWO             Bound     default/pvclaim1                3h
    # oc get pvc 
    pvclaim1              Bound     pvolume1    1             RWO             3h

    Now we can create pod which will use above created PersistentVolume and PersistentVolumeClaim

    # oc create -f pod.json
    # oc get pods | grep pod1
    pod1                      1/1       Running   0          30s

    Now, on node where pod runs, we will see

    # fdisk -l | grep xvd
    WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.
    Disk /dev/xvda: 10.7 GB, 10737418240 bytes, 20971520 sectors
    Disk /dev/xvdb: 10.7 GB, 10737418240 bytes, 20971520 sectors
    /dev/xvdb1            2048    20971519    10484736   8e  Linux LVM
    Disk /dev/xvdf: 1073 MB, 1073741824 bytes, 2097152 sectors ---- this is EBS device used inside pod 

    /dev/xvdf is device mapped to node, I know this as it is only 1GB device on my system. Further, device is visible in mount output

    # mount | grep xvdf
    /dev/xvdf on /var/lib/origin/openshift.local.volumes/plugins/ type ext4 (rw,relatime,seclabel,data=ordered)
    /dev/xvdf on /var/lib/origin/openshift.local.volumes/pods/3a467bfd-eeb7-11e5-9e13-02982d3b6367/volumes/ type ext4 (rw,relatime,seclabel,data=ordered)

    it is mounted at mount point under /var/lib/origin/openshift.local.volumes …

    If we take closer look, we will see there, EBS volume id ( vol-3f7210c ), then plugin used ( aws-ebs ) – this can be useful if it is necessary to do sorting / searching devices allocated on same Openshift node and commit from two different storage backennds ( eg, EBS and CEPH )
    In pod we see

     # oc exec  pod1 -- mount  | grep xvdf
    /dev/xvdf on /mnt/persistentvolume type ext4 (rw,relatime,seclabel,data=ordered)

    For every pod using EBS as persistant storage pair of lines similar to ones above will be visible on Openshift node

    #amazon-ebs, #linux, #openshift, #persistantvolume, #persistantvolumeclaim, #pod, #storage

    delete EBS volume(s) based on EBS volume tag

    After detaching EBS volume from AWS instance it is necessary to clean environment to reduce costs and not to leave unused EC2 resources. In case you tag EBS volumes – as I do regularly for easier distinguishing for what are used, then deleting EBS volumes once de-attached from instance is easy task if used simple script I am using for this task.

    Important: This is destructive operation, it will delete EBS volume in case it is not attached to instance, even if you pass tag of wrong EBS volume. Use carefully and with caution

    Usage :

    $ python --tagname=EBS_TAG

    where EBS_TAG is tag of EBS volume you want to delete.

    #amazon, #ebs, #ec2, #linux, #python

    CEPH storage cluster installation – OS Fedora 23

    In this blog post I am going to document steps I did in order to install CEPH storage cluster. Such installed CEPH storage cluster can be used later on in openstack/openshift installation as storage for virtual machines/pods or deploy it with some other solution requesting object and/or block storage. CEPH FS ( File System ) exist, but it will not be discussed in this blog post.

    CEPH storage is opensource storage solution which becomes very popular due to its flexibility and features set it offers. Ceph project was started by Sage Weil back in 2007, or so, more at : ceph wiki page, current version of CEPH is Hammer (v0.95) and this version of ceph will be used in this blog post.

    As operating system for CEPH cluster, I am going to use Fedora 23, and it will be used due to below reasons

    • it has good set of features and many available packages. I guess same process we describe here can be used with Debian – with small package/files names corrections
    • it is close to RHEL Enterprise Linux ( I know I could use Centos 7, but I have Fedora 23 machines handy) , and information you get here can be easily transformed to RHEL ( with prior reading of Red Hat ceph documentation
      Note: In order to apply notes from here for RHEL case, you will need to work with Red Hat Sales / Support to get access to proper Red Hat software entitlements which contain CEPH packages
    • it is free and in order to start working with CEPH you do not need subscriptions in order to get software

    I am going to use Fedora 23 KVM environment for this POC, due to below

    • it is most convenient and cheap – I do not need to invest in hardware
    • I have access to it

    Using KVM as base for CEPH nodes is not supported in production, so be aware of this in case you decide to use RHEL instead of Fedora and if you want to get support for CEPH cluster running on top of RHEL from Red Hat global support services team. Here are in my opinion some interesting links regarding CEPH cluster sizing and planning

    As first step is to install Fedora 23, you can use this centos kickstart for this purpose, just adapt it to point to proper Fedora repositories, or you can manually install machines. ISO installation images is possible to get from Fedora 23 server iso images

    Once system is installed, I recommend to updated it to latest packages

    # dnf -y update

    For CEPH storage cluster we need at least 3 CEPH monitors ( mon services ) machines to preserve HA. From excellent CEPH book ( I got it for free,but if you buy it, what I strongly advice, it worth the money you pay for it ) Learning CEPH we can read there A Ceph storage cluster requires at least one monitor to run. For high availability, a Ceph storage cluster relies on an odd number of monitors that’s more than one, for example, 3 or 5, to form a quorum. For this initial POC, I find 3 monitors to be fine,as it will grant HA solution and it will serve POC purpose.Later I am going to add more monitors.

    In this test environment instead installing machine with same configuration three times, I installed it once and then used KVM tools I cloned it to desired number, so I had identical KVM guests / machines for CEPH nodes.

    After machine installation, we need to ensure below on ceph nodes before doing any further steps

    • ensure all ceph nodes can properly resolve all nodes, either with configuring dns sever or /etc/hosts. In my case I have dns server already in place and I added my ceph cluster nodes in dns configuration and it worked fine
      Important : if ceph nodes are not able to properly resolve other nodes, there will be problem
    • ensure that ceph nodes have access to internet

    With physical hardware,it is expected / necessary to have separate disk for CEPH OSDs. In my test case, as I am using KVM guests, I created for every machine virtual disk using below commands. I decided to use 15 GB disk size,… this is just test POC which can be turn to real POC

    # qemu-img create -f qcow2 cephf23-node1disk.qcow2 15G
    # qemu-img create -f qcow2 cephf23-node2disk.qcow2 15G
    # qemu-img create -f qcow2 cephf23-node3disk.qcow2 15G

    after this, in order kvm guests see / use these disks it is necessary to attach them to machines. I edited /etc/libvirt/qemu/kvm_guest_file.xml files for kvm guests and added there definition for new block device. There is already definition for disk in kvm machine .xml file, and it is easy to add new disk, just follow same syntax, while pointing to desired disk and adapting PCI numbers. If there is mistake with configuration it will be reported during virsh define step below

    After this, it it necessary to (re)define machine, and restart it

    # virsh destroy kvm_machine
    # virsh define  /etc/libvirt/qemu/kvm_guest_file.xml  
    # virsh start kvm_machine

    where kvm_machine is name of your KVM domain/machine. Once kvm guest is up, new disk marked as /dev/vdb will be visible. It is necessary to repeat above process for all guests.

    Another option how to add storage to virtual guests is described at documentation Adding storage devices to guests and using for example

     # virsh attach-disk Guest1 /var/lib/libvirt/images/FileName.img vdb --cache none

    which is supposed to work too. Part related to disks, is kvm specific, with physical hardware, this is not necessary.

    Further, it is necessary to have passwordless login between CEPH nodes, ensure this is working ( ssh-keygen, ssh-copy-id … )

    Let’s now proceed and install packages

    # dnf install -y ceph-deploy 

    More about ceph-deploy is possible to find at ceph-deploy ceph documentation. In short ceph-deploy is tool which enables users to install ceph cluster easier than in case it is not used.

    Assuming ceph cluster nodes can resolve properly other nodes, and passordless access works fine, issuing below command

    # mkdir /etc/ceph
    # ceph-deploy new cephf23-node1 cephf23-node2 cephf23-node3

    will write ceph.conf file with some basic parameters. In this case

    fsid = b71a3eb1-e253-410a-bf11-84ae01bad654
    mon_initial_members = cephf23-node1, cephf23-node2, cephf23-node3 
    mon_host =,, 
    auth_cluster_required = cephx
    auth_service_required = cephx
    auth_client_required = cephx
    filestore_xattr_use_omap = true

    I added in ceph.conf additionally

    cluster_network =

    to define cluster network. Network for cluster traffic – which will be same as public network, in this case I have only one network card. In ceph.conf we can see mon_initial_members and mon_host which MUST be correct from hostname / ip point of view. If that was not the case, it would be necessary to go back to check DNS settings.

    After running ceph-deploy new as showed above ceph.conf will be created, it is necessary to pay particular attention on mon_initial_members and mon_host – there must be correct hostnames / IPs for these machines written in ceph.conf

    Now we are ready to install ceph packages

    # ceph-deploy install cephf23-node1 cephf23-node2 cephf23-node3  --no-adjust-repos

    Last command will install all ceph components on specified nodes. If –no-adjust-repos is not specified, then ceph-deploy will go to cehp repositories and try to get packages from there – and this is not bad if packages exist. At time when I write this, packages for Fedora 20 are available under ceph repositories and ceph-deploy will fail.
    Specifying –no-adjust-repos ceph-deploy is instructed not to adjust repositories and to get packages from OS repos – in this case Fedora 23.

    Install step will last some time, depending on network speed to download ceph packages.Once it finishes, it is necessary to create monitors

    # ceph-deploy --overwrite-conf mon create-initial

    Important is to specify –overwrite-conf as ceph.conf was edited and –overwrite-conf will ensure ceph-deploy does not fail.

    Now ceph mon dump shows monitors status

    # ceph mon dump
    dumped monmap epoch 1
    epoch 1
    fsid b71a3eb1-e253-410a-bf11-84ae01bad654
    last_changed 0.000000
    created 0.000000
    0: mon.cephf23-node1
    1: mon.cephf23-node2
    2: mon.cephf23-node3

    and after this ceph status shows

    # ceph status
        cluster b71a3eb1-e253-410a-bf11-84ae01bad654
         health HEALTH_ERR
                64 pgs stuck inactive
                64 pgs stuck unclean
                no osds
         monmap e1: 3 mons at {cephf23-node1=,cephf23-node2=,cephf23-node3=}
                election epoch 30, quorum 0,1,2 cephf23-node1,cephf23-node2,cephf23-node3
         osdmap e1: 0 osds: 0 up, 0 in
          pgmap v2: 64 pgs, 1 pools, 0 bytes data, 0 objects
                0 kB used, 0 kB / 0 kB avail
                      64 creating

    Ok,ceph has some health issues. If ceph health is checked, we see

    # ceph health
    HEALTH_ERR 64 pgs stuck inactive; 64 pgs stuck unclean; no osds

    I highlighted most important part in ceph health, and it says no osds. OSD – Object Storage Device is building block of CEPH cluster and it is necessary to create them.

    OSD creation is simple,however there are below important points to take care prior OSD creation

    • identify block device for OSD
    • ensure it is really device planned for OSD – OSD preparations / actions are destructive and they can cause trouble if pointed to wrong block device
    • Once block devices are identified, run ceph-deploy disk zap :
     # ceph-deploy disk zap cephf23-node1:vdb
     # ceph-deploy disk zap cephf23-node2:vdb
     # ceph-deploy disk zap cephf23-node3:vdb

    ceph-deploy disk zap will do some kind of disk formatting and preparation to be OSD.

    After this step prepare OSDs

    # ceph-deploy osd prepare cephf23-node1:vdb
    # ceph-deploy osd prepare cephf23-node2:vdb
    # ceph-deploy osd prepare cephf23-node3:vdb

    I created on /dev/vdb all – leaving up to ceph-deploy to divide it to data and journal part eg, if run

    # blkid |  grep vdb
    /dev/vdb1: UUID="cdf2f55e-67ee-4077-808e-fcfa94f531ae" TYPE="xfs" PARTLABEL="ceph data" PARTUUID="6cb675e2-ed7c-4757-861e-8090b4c3dda3"
    /dev/vdb2: PARTLABEL="ceph journal" PARTUUID="b75b2e61-5664-4730-af52-a43a1ee99845"

    from this is visible that /dev/vda1 is marked as ceph data and /dev/vdb2 as ceph journal. For maximal and better performance it is recommended to use separate device for ceph journal ( even SSD disk ).

    In above example, if there was separate device for journal, then ceph-deploy osd prepare step would be

    # ceph-deploy osd prepare {node}:/dev/vdb:/dev/vdX 

    where vdX would be separate device for journal

    Knowing that on /dev/vdb we have now

    • ceph data
    • ceph journal

    command to activate OSD would be

    # ceph-deploy osd activate cephf23-node1:/dev/vdb1:/dev/vdb2
    # ceph-deploy osd activate cephf23-node2:/dev/vdb1:/dev/vdb2
    # ceph-deploy osd activate cephf23-node3:/dev/vdb1:/dev/vdb2

    If we now execute ceph osd tree it will show OSDs and their placement across CEPH nodes.

    # ceph osd tree
    -1 0.02998 root default                                             
    -2 0.00999     host cephf23-node3                                   
     0 0.00999         osd.0               up  1.00000          1.00000 
    -3 0.00999     host cephf23-node2                                   
     1 0.00999         osd.1               up  1.00000          1.00000 
    -4 0.00999     host cephf23-node1                                   
     2 0.00999         osd.2               up  1.00000          1.00000 

    ceph osd dump will dump osd parameters

    # ceph osd dump
    epoch 54
    fsid b71a3eb1-e253-410a-bf11-84ae01bad654
    created 2015-12-27 18:07:12.158247
    modified 2015-12-31 16:05:27.526052
    pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
    max_osd 3
    osd.0 up   in  weight 1 up_from 51 up_thru 51 down_at 50 last_clean_interval [34,43) exists,up bf703535-de09-4bde-b81e-805a9c85169b
    osd.1 up   in  weight 1 up_from 48 up_thru 51 down_at 44 last_clean_interval [35,43) exists,up 60d237f0-38c0-4dfd-9672-ef4076c57a7f
    osd.2 up   in  weight 1 up_from 46 up_thru 51 down_at 45 last_clean_interval [42,43) exists,up 6cb675e2-ed7c-4757-861e-8090b4c3dda3

    And in ps aux | grep ceph-osd we can see there is osd process running – this is for node1, but same applies for other nodes. It applies OSD one block device and this means if we had more OSD on CEPH node, then we should see more processes in ps aux output

    ps aux | grep ceph-osd
    root      1226  0.0  0.1 119632  3180 ?        Ss   10:23   0:00 /bin/bash -c ulimit -n 32768; /usr/bin/ceph-osd -i 2 --pid-file /var/run/ceph/ -c /etc/ceph/ceph.conf --cluster ceph -f
    root      1228  0.2  2.2 713224 46884 ?        Sl   10:23   0:23 /usr/bin/ceph-osd -i 2 --pid-file /var/run/ceph/ -c /etc/ceph/ceph.conf --cluster ceph -f

    Running ceph status is run now, it will show status of ceph cluster

    # ceph status 
        cluster b71a3eb1-e253-410a-bf11-84ae01bad654
         health HEALTH_OK
         monmap e1: 3 mons at {cephf23-node1=,cephf23-node2=,cephf23-node3=}
                election epoch 32, quorum 0,1,2 cephf23-node1,cephf23-node2,cephf23-node3
         osdmap e13: 3 osds: 3 up, 3 in
          pgmap v54: 64 pgs, 1 pools, 0 bytes data, 0 objects
                100 MB used, 30586 MB / 30686 MB avail
                      64 active+clean

    Now we see status of ceph storage cluster, and one might ask why is there 30 GB space available, when 15 GB OSD devices are built into ceph cluster. Now we have

    # fdisk -l /dev/vdb
    Disk /dev/vdb: 15 GiB, 16106127360 bytes, 31457280 sectors
    Units: sectors of 1 * 512 = 512 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes
    Disklabel type: gpt
    Disk identifier: 893E7CCC-C92A-47DD-9DAE-FC62DFA91314
    Device        Start      End  Sectors Size Type
    /dev/vdb1  10487808 31457246 20969439  10G Ceph OSD
    /dev/vdb2      2048 10485760 10483713   5G Ceph Journal
    Partition table entries are not in disk order.

    the reason for this is osd journal size which is left default to 5120 MB. More about OSD config parameters osd config reference what means if not specified the size of 5 GB will be used for ceph journal device. This is not issue, in case we specified separate device for journal, then whole device will be used.

    From this point it is possible to create ceph pool(s) and use them.

    In order to create CEPH pool we can do below

    # ceph osd pool create poolA 128 128 

    this will create poolA with designed number of placement groups ( pg ) , what is visible in below output

    ]# ceph osd lspools
    0 rbd,3 poolA
    [root@cephf23-node1 ~]# rados lspools

    General command for creating CEPH pool is

    # ceph osd pool create pg_number pgp_number  

    where pg_number placement group number and pgp_number placement group for placement has to be equal, otherwise CEPH will not start re-balancing

    How many PG we need in specific CEPH cluster? This number will vary from number of OSDs and number of replicas,but below formula can help

               (OSDs * 100)
    Total PGs = ------------

    which is stated on CEPH web How Data Is Stored In CEPH Cluster

    Also do not forget to enable ceph service to ensure it starts during boot

    # systemctl enable ceph
    ceph.service is not a native service, redirecting to systemd-sysv-install
    Executing /usr/lib/systemd/systemd-sysv-install enable ceph

    This is basic CEPH storage cluster setup. It can be turned into useful POC if instead KVM physical hardware is used (certified and supported HW) and if adapted to other specific needs. It would be also necessary for production case to tweak below points

    • In above configuration I decided to run monitors and OSD on same machines what will work fine, however this consumes more
      memory / cpu on machine and for production environments it is recommended to have monitors and OSDs running on separate machines
    • separate networks : cluster and public networks

    • configure firewall – I disabled firewalld in this configuration, and to open ports which are necessary for CEPH all what necessary is to follow ceph network reference manual and apply recommendations from there

    • CRUSH, what is and how to configure it, read crush and crush paper

    What is something is broken? It is possible to do below

    • try to debug it, /var/log/messages and /var/log/messages/ceph/* on CEPH nodes are locations where to start
      Fellow Red Hatter gave an excellent presentation on this topic – Troubleshouting CEPH I recommend to check it.
    • If not possible to solve, then collect data and open BZ for CEPH

    If you suspect that issue might be some kind of PEBKAC then it is possible to start from zero and try to configure it again. ceph-deploy offers simple tools to remove ceph packages and reset ceph nodes to state before ceph packages installation

    For example below commands will get system in pre-installation state – they will remove all ceph packages and is possible to try again installation using steps from beginning of this post

    # ceph-deploy uninstall cephf23-node1 cephf23-node2 cephf23-node3
    # ceph-deploy purgedata cephf23-node1 cephf23-node2 cephf23-node3 
    # ceph-deploy purge cephf23-node1 cephf23-node2 cephf23-node3
    # ceph-deploy forgetkeys 

    and all is ready to start again. Doing it again, will be good learning experience anyway.

    Happy CEPH hacking!

    #ceph, #ceph-pg, #ceph-pgp, #ceph-pool, #fedora-2, #linux, #object-storage-device, #osd, #storage

    copy/edit partition table with sfdisk

    sfdisk is nice tool for playing with disk partitions. It has many features, and is very useful when is necessary to do some changes with disk partitions. Before doing anything with sfdisk I recommend reading sfdisk man page to get basic picture what is sfdisk and for what it can be used. If not used carefully, it can be dangerous command, especially if pointed to wrong device so … think before running it
    I needed it where was necessary to clone partition table of one sdcard to another ( fdisk can do this too )

    To save partition table, I did

    # sfdisk --dump /dev/sdb > 16gcard

    Now in 16gcard dump file was written

    # cat 16gcard
    label: dos
    label-id: 0x00000000
    device: /dev/sdb
    unit: sectors
    /dev/sdb1 : start=        8192, size=    31108096, type=c

    This is what I need, however, new card is double in size, so 32 GB and writing above on new card will occupy just first 16 GB. Luckily, sfdisk is very versatile tool and it allows editing partition dump and then writing it back to disk. Open 16gcard in text editor ( eg. Vim ) and edit dump file. If original size is 31108096 * 512 B ( sectors ) then new size would be 61399040 * 512 B (sectors) and new dump file

    # cat 16gcard 
    label: dos
    label-id: 0x00000000
    device: /dev/sdb
    unit: sectors
    /dev/sdb1 : start=        8192, size=    61399040, type=c

    Now I can write it to new card

    # sfdisk /dev/sdb < 16gcard

    and fdisk -l shows

    #  fdisk -l /dev/sdb
    Disk /dev/sdb: 29.3 GiB, 31440502784 bytes, 61407232 sectors
    Units: sectors of 1 * 512 = 512 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes
    Disklabel type: dos
    Disk identifier: 0x00000000
    Device     Boot Start      End  Sectors  Size Id Type
    /dev/sdb1  *     2048 61407231 61405184 29.3G  c W95 FAT32 (LBA)

    What is very same partition table as one I had on old card except last sector which is adapted to suit size of new card.

    #linux, #sfdisk, #storage