New User Special Price Expires in

Let's log you in.

Sign in with Facebook


Don't have a StudySoup account? Create one here!


Create a StudySoup account

Be part of our community, it's free to join!

Sign up with Facebook


Create your account
By creating an account you agree to StudySoup's terms and conditions and privacy policy

Already have a StudySoup account? Login here

Distributed Systems

by: Mrs. Damaris Hyatt

Distributed Systems CSE 60771

Mrs. Damaris Hyatt
GPA 3.79

Douglas Thain

Almost Ready


These notes were just uploaded, and will be ready to view shortly.

Purchase these notes here, or revisit this page.

Either way, we'll remind you when they're ready :)

Preview These Notes for FREE

Get a free preview of these Notes, just enter your email below.

Unlock Preview
Unlock Preview

Preview these materials now for free

Why put in your email? Get access to more of this material and other relevant free materials for your school

View Preview

About this Document

Douglas Thain
Class Notes
25 ?




Popular in Course

Popular in Computer Science and Engineering

This 0 page Class Notes was uploaded by Mrs. Damaris Hyatt on Sunday November 1, 2015. The Class Notes belongs to CSE 60771 at University of Notre Dame taught by Douglas Thain in Fall. Since its upload, it has received 14 views. For similar materials see /class/232744/cse-60771-university-of-notre-dame in Computer Science and Engineering at University of Notre Dame.

Popular in Computer Science and Engineering


Reviews for Distributed Systems


Report this Material


What is Karma?


Karma is the currency of StudySoup.

You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 11/01/15
Ceph A Scalable HighrPerformance Distributed we System Ceph A Scalable High Performance Distributed File System S A Weil S A Brandt E L Miller D D E Long Presented by Philip Snowberger Department of Computer Science and Engineering University of Notre Dame April 20 2007 Ceph A Scalable nghrPerformance Dlstrlbuted Flle System Outline lntrod uction Goals of Ceph Ceph Architecture Handling Failures Performance Ceph A Scalable HighrPerformance Distributed File System Linuoducuon Problem gt Distributed filesystems allow aggregation of resources gt Can increase faulttolerance gt Can increase performance gt Increases complexity gt Metadata is a bottleneck in many distributed filesystems gt Centralized metadata gt Distributed metadata gt Can we make a distributed filesystem that scales on both data and metadata operations Ceph A Scalable HighrPerformance Distributed File System Locals of Ceph Goals of Ceph gt Achieve scalability to petabyte workloads while maintaining gt Performance gt Reliability gt Scalability gt Storage capacity gt Throughput gt Scale the above while maintaining useful performance for individuals Ceph A Scalable High Performance Distributed File System l Goals of Ceph How Does Ceph Attempt to Accomplish This Metadata storage gt Decoupling data and metadata gt A Ceph cluster consists of servers responsible for storing objects and servers responsible for managing metadata gt Dynamic distributed metadata management gt Robust against failures and workload changes gt Reliable Autonomic Distributed Object Storage gt Leverage intelligence available at each node in a cluster Ceph A Scaiabie HighrPerformance Distributed Fiie System LCeph Architecture What are Metadata gt Metadata are information about data gt Length permissions creator modification time gt File name gt Almost every filesystem access affects metadata gt Different types of metadata can have different consistency requirements Ceph A Scalable HighrPerformance Distributed File System LCeph Architecture Traditional Block Storage gt In traditional block storage one piece of metadata is the allocation list gt Sequence ofdisk block ranges that comprise the data of the file gt Managing this list takes a significant amount of computrons gt In a distributed setting disk blocks are too low level an abstraction Ceph A Scalable HighrPerformance Distributed File System LCeph Architecture Object Storage gt Objects consist of paired data and metadata gt An Object storage device is responsible for keeping track of where39s the object39s bytes are on disk gt Thus object storage relies on intelligence at storage nodes to relieve some of the management load Ceph A Scalable HighrPerformance Distributed File System LCeph Architecture A Simple Object Storage System gt Consider a simple distributed filesystem gt Centralized directory server Where is tmpfoo gt Distributed file servers Give me bytes 904343880 of tmpfooH gt This design does not scale gt Can the directory server handle 10000 requestssecond 1000000 gt Can a single file server serve up a 10 MB file to 1000 hosts Ceph A Scalable HighrPerformance Distributed File System LCeph Architecture Metadata Distribution in Ceph gt In the past distributed filesystems have used static sub tree partitioning to distribute filesystem loa gt This does not perform optimally for some cases eg tmp varrun1og gt It also performs poorly when the workload changes gt To offset this lack of optimality more recent distributed filesystems have opted to distribute metadata with a hashing function gt This removes the performance gains from directory locality gt Ceph uses a new approach dynamic sub tree partitioning Ceph A Scalable High Performance Distributed File System I Ceph Architecture Dynamic Sub Tree Partitioning Busy directory hashed across many MDS s b Each MDS is responsible for some sub tree of the filesystem hierarchy gt Whenever an operation visits an inode directory or file the MDS increments that inode s time decay counter gt MDSs compare their counter values periodically gt When an imbalance is detected the MDS cluster reassigns the responsiblity over some sub trees to balance the counter values gt Extremely busy directories can be hashed across multiple MDSs Ceph A Scalable HighrPerformance Distributed we System LCeph Architecture A Na39ive Object Placement Method V V V To find where to put a chunk of a file hashinodechunkNummodNumSeriers But what happens when a server goes down or we add a server gt The hashing function needs a new modulus In a petascale distributed filesystem failures and expansion must be regarded as the rule rather than the exception Is there a better way to place chunks Ceph A Scalable High Performance Distributed File System I Ceph Architecture Distributing Objects in Ceph File inoono gtoid Objects hashoid amp mask gtpgid PGs CRUSHpgid gt osd1 ost 0303 x groupedby s x xx failure domainj quot39 quot39 quotquotquot quot gt Each object is mapped to a Placement Group by a simple hash function with an adjustable bitmask gt This bitmask controls the number of PGs gt Placement Groups PGs are mapped to each Object Storage Device OSD by a special mapping function CRUSH gt Number of PGs per OSD affects load balancing Ceph A Scalable HighrPerformance Distributed File System LCeph Architecture CRUSH V V V V A special purpose mapping function Given as input a PG identifier a cluster map and a set of placement rules it deterministically maps each PG to a sequence of OSDs R This sequence is pseudo random but deteministic With high probability this achieves good distribution of objects and metadata This distribution is called declustered Ceph A Scalabie HighrPerformance Distributed File System LCeph Architecture Cluster Maps gt A cluster map is composed of devices and buckets Devices are leaf nodes Buckets may contain devices or other buckets gt The OSDs that make up a cluster or group of clusters can be organized into a hierarchy gt This hierarchy can reflect the physical or logical layout of the cluster or network gt Room123 root gt Rowl Row2 Row8 gt Cabinet1Cabinet2 Cabinet16 gt Diskl Disk2 Disk256 Ceph A Scalable HighrPerformance Distributed File System LCeph Architecture Placement Rules gt Specify how the replicas of a PG should be placed on the cluster gt The following example distributes three replicas across single disks in each of three cabinets all in the same row gt This pattern reduces or eliminates inter row replication traffic Action Resulting takeRoom123 Room123 select 1 row Row2 select3cabinet Cabinet4 Cabinet8 Cabinet9 select1disk Disk44 Disk509 Disk612 Ceph A Scalable High Performance Distributed File System I Ceph Architecture Safety vs Synchronization Q Client a Primary 8 Replica 8 Replica gt Write 3 Apply update l Ack quotb Commit to disk 9 Commit 4 Time gt When a client writes data it sends the write to the Primary gt The Primary forwards the data to the Secondaries who ack that they ve received the data and it s been applied to their page caches gt When the Primary receives all the Secondary acks it returns an ack back to the client gt At this point the client knows that any other client accessing the object will see a consistent view of it Ceph A Scalable HighrPerformance Distributed File System LCeph Architecture Safety vs Synchronization continued gt So how does Ceph treat data safety gt Each OSD aggressively flushes its caches to secondary storage gt When the Secondaries have flushed each update they send a commit message to the Primary gt After collecting all the Secondaries commits the Primary sends a commit to the client gt Clients keep their updates buffered until receipt of a commit message from the Primary gt Why is this necessary Ceph A Scalable HighrPerformance Distributed we System LHandimg Failures Commonality of Failures gt Failures must be assumed in a petascale filesystem consisting of hundreds or thousands of disks gt Centralized failure monitoring gt Places a lot of load on the network gt Can not see through a network partition gt Can we distribute monitoring of failures Ceph A Scalable HighrPerformance Distributed File System LHandimg Failures Monitors gt Monitors are processes that keep track of transient and systemic failures in the cluster gt They are responsible with providing access to a consistent cluster map gt When the monitors change the cluster map they propagate that change to the affected OSDs gt The updated cluster map since it is small propagates via other interOSD communication to the whole cluster gt When an OSD receives an updated map it determines if the ownership of any of its PGs have changed gt If so it directly connects to the other OSD and replicates its PG there Ceph A Scalable HighrPerformance Distributed File System LHandimg Failures So What If A Rack Of Servers Explodes V Each OSD keeps track of the last time it heard from each other server it shares a Placement Group with replication traffic serves as heartbeats gt When a node goes down it isn39t heard from in a short time and is marked down but not out by the monitors V If the node doesn39t recover quickly it is marked out and another OSD joins each of the PCs that was affected in order to bring the replication level back up Replication of data on the downout node is prioritized by the other OSDs V Ceph A Scalable High Performance Distributed File System l Performance Throughput and Latency 60 3 S 50 Ch 8 340 h m IE Q 30 D m w 320 H no replication 0 i quot39quot 2x replication g 10 quot 3x replication 0 I I I I I I 1024 4096 V srite Size gt 14 node OSD cluster gt Load is generated by 400 clients running on 20 other nodes gt Plateau indicates the physical limitation of disk throughput Ceph A Scalable High Performance Distributed File System l Performance Throughput and Latency continued N O A no replication E15 quotquot quot 2x replication 3x replication g a sync write A 910 x sync lock async write xx quotX m l i a E 5 E gt No difference for low write sizes between two and three repHcas gt At higher write sizes network transmission times dominate network latency Ceph A Scalable High Performance Distributed File System l Performance Throughput and Latency continued 50 H crush 32k PGs H crush 4k PGs 40 39 hash 32k PGs 39Xquot39 hash 4k PGs 39 quot linear Per OSD Throughput MBsec 2 6 10 14 18 22 26 OSD Cluster Size gt OSD throughput scales linearly with the size of the OSD cluster until the network switch is saturated gt More PGs even load out more giving better per node throughput Ceph A Scalable High Performance Distributed File System I Pe rforma nce Metadata Operation Scaling 5000 A H makedirs g quotquotquot make les 4000 9 399 openshared 939 quot39 opensshinclude 0 quot39 39 39 opensshlib Per MDS Throughput opssec I l O 16 32 48 64 80 96 112 128 MDS Cluster Size nodes gt 430 node cluster varying number of MDSs gt Metadata only workloads gt Only a per node throughput slowdown of 50 for large clusters Ceph A Scalable High Performance Distributed File System l Performa nce Metadata Operation Scaling continued 4 MDSs 39 40 16 MDSs I 128 MDSs Ir Latency ms O l l 500 1000 1500 2000 Per MDS throughput opssec gt From the makedirs workload gt Larger clusters have less optimal metadata distributions resulting in lower throughput gt However this is still very much adequate and performant for a large distributed filesystem Ceph A Scalable HighrPerformance Distributed File System LPerformance Summary V V V V Ceph is a distributed filesystem that scales to extremely high loads and storage capacities Latency of Ceph operations scales well with the number of nodes in the cluster the size of readswrites and the replication factor By offering slightly non POSIX semantics they achieve big performance wins for scientific workloads Distributing load with a cluster wide mapping function CRUSH is both effective and performant


Buy Material

Are you sure you want to buy this material for

25 Karma

Buy Material

BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.


You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

Why people love StudySoup

Bentley McCaw University of Florida

"I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

Jennifer McGill UCSF Med School

"Selling my MCAT study guides and notes has been a great source of side revenue while I'm in school. Some months I'm making over $500! Plus, it makes me happy knowing that I'm helping future med students with their MCAT."

Jim McGreen Ohio University

"Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

Parker Thompson 500 Startups

"It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

Become an Elite Notetaker and start selling your notes online!

Refund Policy


All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email


StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here:

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.