Many of the articles in this newsletter refer to “the cloud.” Feedback from several newsletter readers indicates that not everyone understands what a "cloud" is in the Internet world. I thought I would publish this introduction to cloud computing and also explain how cloud computing is used to provide digital images of census records to millions of online genealogists.
A number of companies provide cloud computing services, including Amazon Web Services (often referred to as "AWS"), Microsoft Azure, Google Cloud Platform, IBM Cloud, Rackspace, pCloud, Red Hat (later acquired by IBM), Backblaze B2 Cloud Storage, and several others. To keep things simple, I will describe Amazon simply because it is the largest cloud services provider and is the one that I use the most. However, I believe the other cloud service providers are all similar in operation.
Amazon Web Services and most of the other cloud providers offer a number of services, including file storage, bulk email services, or running programs in the large cloud servers. Again, I will focus on file storage services because that is both the most popular cloud-based service and also the one that genealogists use the most. If I receive enough requests, I will describe other cloud-based services in future articles.
Amazon? I thought that was an online retailer!
Yes, this is the same Amazon that is well known as a huge online retailer. Amazon started in business as an online bookstore but has since expanded into selling all sorts of retail products. The company had to build huge data centers in order to handle the workload of its own retail customers. In effect, Amazon first created its own “cloud” for internal company use. Any number of computers in their data centers could be brought into action to “serve” data, applications, or both to Amazon customers. Whenever there was a lot of activity, more of these “servers” could be added to accommodate the volume of applications and data being accessed, moved around, or stored. When the workload was lighter, some of those servers could be returned to their standby status until the next surge of activity or could be redirected to other uses. Systems administrators would monitor the needs and ensure the required servers were active at any given time. The customers accessing the servers never knew which computer in which data center was handling their work; they had no need to know. It was as though their activity moved from the computer in front of them, off to a cloud that would send their information to its destination via routes and patterns that nobody had to know or navigate. All this happened instantaneously and reliably, whether there were a handful of users, hundreds, or thousands.
Eventually, Amazon's senior management realized that the company had developed facilities and expertise that corporate customers and individuals could also use. A few years ago Amazon capitalized on this idea by creating a new division called Amazon Web Services. The company expanded its data centers and started offering services to corporations and even to private individuals around the world. In effect, Amazon Web Services is in the “rent a server” and “rent some disk space” business. Computing power and storage space is available for both short-term and long-term rentals.
Amazon now has several data centers in many different locations, including Northern California, Northern Virginia, Oregon, Ohio, Ireland, Tokyo, Sao Paulo (Brazil), and Singapore, Argentina, Australia, Belgium, Bulgaria, Finland, and numerous other locations. The list is expanding as Amazon continues building even more data centers. You can even view a PowerPoint slide presentation that describes the data centers in detail, created by Amazon Engineer James Hamilton at http://mvdirona.com/jrh/TalksAndPapers/JamesHamilton_AmazonOpenHouse20110607.pdf. That slide show is several years old but the information within it appears to still be correct except that today there are more of these AWS data centers than ever before.
Handling the storage and transfer of all that customer data requires Amazon to keep that data safe, as well as to allow users to access it on demand. This is more complex than simply keeping multiple copies in multiple locations; it also entails the maintenance of processes and applications around the world, a function commonly called “redundancy.” Redundancy is achieved as data at any one data center is copied to other data centers in other locations. In case of a major disaster (fire, flood, hurricane, tornado, earthquake, network outages, etc.), any Amazon data center that goes offline quickly has its workload assumed by other Amazon data centers in other parts of the world. In most cases, such outages are invisible to users.
Nothing is ever 100% perfect, but Amazon's uptime (availability) has always been better than 99.999999999% of the time, even when entire data centers have been shut down. One example is the Tokyo data center, when a magnitude 9 earthquake hit Japan on April 11, 2011. The Tokyo data center was new at the time and only partially in use. It was knocked out of operation within seconds when the earthquake occurred. However, all data, programs, and web services being supplied by the Tokyo data center were moved to data centers in other parts of the world within minutes. In most cases, end users did not encounter any outages or inconveniences. Most end users remained unaware of any problem in one Amazon data center.
Anyone can use the servers in Amazon's data centers and pay only for the amount of disk storage, high-speed connections, and processing power consumed. Amazon's customers include large corporations, small businesses, government agencies, and private individuals. In fact, even you can use Amazon's Web services. All you need is a credit card and a few minutes of time to create an account. I have an account and make backups several times a day from my laptop and desktop computers to Amazon's S3 service, as do hundreds of thousands of other computer users all over the world. All data is protected off-site and is available to the person who uploaded it wherever the person might be, at home, at work, or while traveling. You could do the same, should you wish to do so.
In many cases, using Amazon's hardware, data centers, and support personnel is more cost-effective than buying one's own hardware and hiring people. The cost savings can be especially important when the need for such services exists for only a few days or weeks. It also works well when a person or a company has modest needs: if you need to store a limited amount of data or need only a single or a few web servers, you probably will find cloud computing to be much cheaper than purchasing one's own servers and building a data center.
Amazon also provides very high security for all the data it stores. In fact, most security experts consider data stored on Amazon's cloud servers to be more secure than storing the same data in home computers, where data is vulnerable to attacks from the Internet as well as to visitors.
Amazon Web Services can be used for almost any Internet-based purpose: web servers, mail servers, disk storage space, backup of data and processes, running almost any sort of application, or even for displaying images of a census. Many well-known services use Amazon's cloud-based services, including Netflix (with thousands of movies stored in Amazon's disk space and available for downloads). See https://aws.amazon.com/solutions/case-studies/netflix/ for more information). Several genealogy societies, including the New England Historic Genealogical Society, use Amazon's web servers and disk storage instead of buying and staffing their own servers and data centers. The financial savings often are significant.
Do you have one of Amazon’s Echo devices? If so, every time you start by saying, “Alexa…,” you are communication with Amazon Web Services (AWS)! All the computing and ll the data resides within Amazon Web Services. The device in your home or automobile simply serves as a remote terminal that is connected to Amazon Web Services.
Amazon Web Services (usually called AWS) actually is a collection of several related Internet services. The better-known services include those described below:
Elastic Cloud Compute (EC2)
EC2, in its simplest form, is a collection of virtual machines. Instead of running computers in a company's own data center or at an individual's home, the computers are physically located in Amazon's data centers with constant backups being made to servers in other data centers. Each Amazon server can run Linux or Windows, and servers can be linked together. In fact, additional servers can be brought online and made operational within minutes, if needed. When the need goes away, the extra servers can be taken offline, disk drives erased, and the no-longer-needed servers then become available to other Amazon customers.
Control of the applications, the adding of servers, the reduction in servers, and other system administration tasks can all be controlled by systems administrators of Amazon customers who may be located thousands of miles from the data centers. Physical access to the computers being used is not required. In fact, Amazon's servers typically run in huge rooms with the lights turned off to save electricity.
In most cases, Amazon employees are not involved in installing or running the application activities; the customers' systems administrators perform administrative chores from their own offices, from their own homes, or even while riding a commuter train. The physical location of people and the physical location of servers both are irrelevant in cloud computing.
Elastic Cloud Compute simply means that any application can be “stretched” – like an elastic band – to run on more than one server, even on thousands of servers, as needed.
To be sure, adding web servers is not an instantaneous process. Data and programs do have to be copied to the new servers. However, the time required is measured in minutes, not in days or months as would happen with the old-fashioned method of ordering new servers from a manufacturer, waiting for delivery, and then mounting those servers into racks in a privately-owned data center. Using cloud computing, any company can add thousands of web servers within minutes to handle the load.
This is the elastic in Elastic Cloud Compute: systems administrators can “stretch” computing power to fit the need. As the load decreases in the future, servers can be removed from the application, thereby “shrinking” the required hardware. Servers are available with different levels of storage and computing power. To measure customers' usage, Amazon refers to each virtual server as an “instance.” Each Micro instance, for example, only comes with 613 megabytes of RAM memory, while Extra Large instances can go up to 15 gigabytes. There are also other configurations for various processing needs.
Finally, Elastic Cloud Compute (EC2) instances can be deployed across multiple geographic locations of Amazon's data centers. Deploying multiple servers in different locations around the world increases redundancy and reduces latency (the delay before the screen changes after you click the mouse).
Elastic Load Balance (ELB)
Load balancing is simply a fancy term meaning to “share the load equally.” If you have 1,000 servers running one application, the systems need to have the load distributed equally amongst all those servers. After all, it wouldn't be productive to have 500 idle servers plus 500 overloaded servers!
All the larger data centers practice load balancing. Amazon uses its own load balancing, called ELB, to balance the load amongst all its servers being used together on one application, even if those servers are in different data centers around the world. Here again, the elasticity of the operation simply means that administrators can stretch or shrink the workload to keep the distribution in balance.
Elastic Block Storage (EBS) and Simple Storage Service (S3)
Block storage is essentially the same thing as disk storage. Amazon uses two versions: Elastic Block Storage (EBS) for operating systems and for storing applications and Simple Storage Service (S3) for storing data. In both cases, you can think of block storage as the equivalent of a hard drive in your computer. It operates in much the same manner.
Files uploaded to S3 are referred to as objects, which are then stored in buckets. S3 storage is scalable, which means that the only limit on storage is the amount of money you have to pay for it. Amazon has petabytes available (one petabyte is equal to 1,000,000,000 megabytes). S3 storage is automatically backed up within seconds to other data centers in other locations.
I use Amazon's Simple Storage Service (S3) for making backups of my computers' hard drives. So do millions of other Amazon customers. The 1940 census images also is stored as S3 objects, as is information from Netflix, Dropbox, and other applications. If you have the proper passwords and access, you can retrieve files from S3 storage at any time and at any location.
So how can you use Amazon Web Services (AWS)?
It is easy to sign up for disk storage space on AWS. In fact, AWS even offers free accounts for one year with a limited amount of storage space. Start at https://portal.aws.amazon.com/billing/signup#/start. The free “starter package” can be expanded to a paid service at any time without interruption to data already residing on AWS.
Another feature that I like is that you only pay for the disk space you use with Amazon’s AWS. Many other disk storage services require you to purchase “blocks of storage space in advance of using it. For instance, if you wish to safely store 627 gigabytes of files, many services will require you to first purchase 1,000 gigabytes (one terabyte) in advance in order to have sufficient space. Amazon’s AWS is different: if you store 627 gigabytes of files, you only pay for 627 gigabytes of file space. No more.
You can learn more about AWS’ pricing at https://aws.amazon.com/pricing/ and especially with the AWS Pricing Calculator at https://calculator.aws/#/.
The major downside of using Amazon Web Services is that Amazon only provides the disk space. You will need to obtain additional software to install in your computer to send and retrieve files stored in AWS. In most cases, Amazon doesn’t provide that software. Luckily, many third-party products work work with AWS and prices for these products vary from free to products designed for corporate use that can be very expensive. Most of the products used by individuals have very modest pricing.
Backup products to be installed in your computer that will communicate with Amazon Web Services include Arq for Windows and Macintosh (my favorite); CloudBerry Backup for AWS for Windows, macOS and Linux; Druva CloudRanger; Duplicati for Windows, Linux, and Macintosh (free but a bit complicated); CyberDuck for Windows and Macintosh, MountainDuck for Windows and Macintosh, Transmit, Forklift2, and many others. Perform a search for any of those products on your favorite search engine to learn more.
One product serves special mention: MountainDuck for Windows and Macintosh is a $39 US product that configures Amazon AWS as a remote disk drive in Windows Explorer or Macintosh Finder. You then can use Amazon AWS as a multi-petabyte disk drive connected to your computer. (1 petabyte is one quadrillion bytes.) That should be sufficient for most home users! Again, with Amazon AWS you only pay for the actual amount of disk space you are using.
I use MountainDuck on my Macintosh and Windows computers and never worry about running out of disk space! You can learn more at: https://mountainduck.io/.
Summation
The various pieces of Amazon Web Services and other cloud computing providers all work together in harmony to provide “computing services on demand.” Whether computer power and storage is needed for a few hours or for a few years, cloud computing services are always available for the work. Pricing is based upon usage: the company or the individual pays only for the amount of computing power and storage space used. In many ways, this is the same operating philosophy as that of your local electric company. Indeed, cloud computing also is sometimes referred to as “utility grade” computing: available whenever you need it and billed only as you actually use it. Unlike your own data center, cloud computing customers never pay for idle servers or for purchasing hardware to pay for future growth.
If you want to back up a few files, or if you wish to serve census images to millions of genealogists, cloud computing may be the best solution.
All of this is a rather simplified explanation of cloud computing. Actually, there are more pieces and more buzzwords involved, such as DynamoDB, Route53, Elastic Beanstalk, and other features that would take much longer to detail here. I will suggest those details will only interest systems administrators. I believe I have covered the basics that will be relevant to most end users.