armchair dot tech
Ceph getting acquainted | HOME | ABOUT | TAGS

January 20, 2016 | TAGS: #ceph

The two key components:

OSD Notes:

OSD Journal Location: stores a daemon’s journal by default on /var/lib/ceph/osd/$cluster-$id/journal - on a ECS node, this would be an SSD, which is recommended by CEPH. However, you could point it to an SSD partition instead of a file for even faster performance.

OSD Journal Size: The expected throughput number should include the expected disk throughput (i.e., sustained data transfer rate), and network throughput. For example, a 7200 RPM disk will likely have approximately 100 MB/s. Taking the min() of the disk and network throughput should provide a reasonable expected throughput. Some users just start off with a 10GB journal size. For example: osd journal size = 10000

OSD’s can be removed gracefully: http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual

Check Max Threadcount: If you have a node with a lot of OSDs, you may be hitting the default maximum number of threads (e.g., usually 32k), especially during recovery. You can increase the number of threads using sysctl to see if increasing the maximum number of threads to the maximum possible number of threads allowed (i.e., 4194303) will help. For example: sysctl -w kernel.pid_max=4194303

Crush MAP

The “location” of each Ceph OSD is maintained in a CRUSH MAP.

The CRUSH algorithm determines how to store and retrieve data by computing data storage locations. CRUSH empowers Ceph clients to communicate with OSDs directly rather than through a centralized server or broker. With an algorithmically determined method of storing and retrieving data, Ceph avoids a single point of failure, a performance bottleneck, and a physical limit to its scalability.

CRUSH maps contain a list of OSDs, a list of ‘buckets’ for aggregating the devices into physical locations, and a list of rules that tell CRUSH how it should replicate data in a Ceph cluster’s pools. By reflecting the underlying physical organization of the installation, CRUSH can model—and thereby address—potential sources of correlated device failures. Typical sources include physical proximity, a shared power source, and a shared network. By encoding this information into the cluster map, CRUSH placement policies can separate object replicas across different failure domains while still maintaining the desired distribution. For example, to address the possibility of concurrent failures, it may be desirable to ensure that data replicas are on devices using different shelves, racks, power supplies, controllers, and/or physical locations.

The short of this is that in ceph.conf, you can define a host’s location, which subsequently defines the location of each Ceph OSD operating on that host. A location is a collection of key pairs consisting of Ceph predefined types.

root=default row=a rack=a2 chassis=a2a host=a2a1

# types (from narrowest ascending to broadest grouping)
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

Each CRUSH type has a value. The higher this value, the less specific the grouping is. So when deciding where to place data chunks or replicants of an object, Ceph OSDs will consult the crush maps to find other Ceph OSDs in other host, chassis, and racks. The fault domain policies can be defined and tweaked.

https://ceph.com/planet/zero-to-hero-guide-for-ceph-cluster-planning/ https://www.mellanox.com/related-docs/whitepapers/WP_Deploying_Ceph_over_High_Performance_Networks.pdf - High performance ceph builds.