You are on page 1of 25

Google Cloud Platform and Machine Learning Specialization Coursera

1. Products and Use Cases where ML is used are discussed in the course.
2. Compute and Storage -> Processing Data
3. Decision whether to move to GCP

5. MySQL database is hosted on GCP as the CloudSQL.
6. Dataproc is the set of processes which can be used to manipulate the data

8. What is GCP (Google Cloud Platform) – To organize world’s information, Google has created
the most powerful infrastructure on the planet.
9. Edge Locations – Cached version of a resource is maintained at the edged locations. The
data centres contain the main data resources but the edge location maintains the cached
versions of the same resources.
10. Colossus is the GFS, it’s the base file system used in Google Cloud Platform. The Colossus
takes large datasets and breaks them into smaller pieces and distributes it across various
storage resources.
11. Dremel and Flume are the MapReduce platforms of GCP. They perform the complete tasks
of MapReduce as it is done in Hadoop.
12. Flume is part of Dataflow
13. MapReduce is limited by the number of compute nodes that can be used.

14. Auto-scale the cluster, distribute the compute and distribute the storage
15. No-Ops is the internal understanding of Google Cloud to execute a task and identifying the
location of data resources and processing the same in compute environment.


18. (Rooms to Go) Customer Purchase data and Navigation History are used to identify the
customer purchase patterns and create new packages which were then sold to the
19. The compute may not happen at the individual virtual machine level. The computing is at a
much higher level with the flexibility of compute space usage as per our requirement.
20. A preemptible machine is a machine on which the customer gets a great discount of the
order of 80% but in exchange the customer is agreeing to give up the machine if another

https://github. https://cloud. Other options like the disks on compute engines can be risky since . https://cloud. 23. GCS is not useful for real-time data storage. 22. customer is willing to pay the full amount for the machine. 30. it is reset every 60 35.if the compute engine is taken away so is the disk attached to the compute engine. A machine might be a part of the Hadoop Cluster. There are Standard VMs and Preemptible VMS. 24. in case a machine goes down the other machines in the cluster can take over the tasks that were performed by that machine. By taking a preemptible machine along with the standard machines. Creating a Data Proc as a Hadoop cluster. a. the Hadoop clustering company is dropping its cost significantly. GCS should not be used for anything that requires high through 27. 33. which can manage a failover machine scenario. “Gsutil” is a command line prompt to copy data from the local machine on to the compute engine in cloud. https://cloud. however the system by itself takes care of caching the data in edge locations based on the locations from where the data is accessed frequently. In case of SSH tunnelling. . Cloud Storage buckets can be mounted onto the Compute engines as persistent storage volumes. The instructions for the cloud storage bucket mount and access from compute engine can be found at the following etc. https://cloud. REST API is available to be used to transfer content to and from the Cloud Storage. The location for caching the data in cloud storage is not controlled by the user. Thus. For ex. from where the data can be used by various cloud applications such as Cloud b. Compute engine. Thus. 28. Frequent read/writes are not a speciality of GCS. 34. Cloud storage is a key-value storage system with a wrapper of hierarchy system which reflects the folder structures in file systems. 31. the connections will need to be reset every 60 min. the data required for processing on the compute engine can be stored in and retrieved from the cloud storage. e. 21. Cloud Shell is an ephemeral machine which is used to access the google cloud platform. There is an option of cloud storage which is cheaper than a persistent disk storage and efficient in handling Cloud storage acts as a staging area for data. Every object in the buckets has a URL which can be shared and accessed by users logged into the their google account. Every bucket is a project in GCP and the billing is done on the project level in Google d. 29. Region > Zone 32.

36. Google has a managed services facility which allows the companies to request help from Google in sorting out the issues faced by the company on the front end. Old App Engine is where the program is deployed for the world to be able to reach out to the Application. 41. This was the basics of Google’s hosting application. We must understand Google Cloud as a service which provides a complete environment which can manage all the tasks of a company on its own infrastructure. In case the application must be run on Tomcat itself. Nearline (once in a month) and Coldline (once in a year) are cloud archival solutions. 37. Google App Engine allows the users to run their programs directly with integration with Google Cloud. Cloud Launcher is a platform that allows the user to use the VM instances which already have the required software installed on them. 46. . There is a price estimator for calculating the total monthly charges based on the usage specification of the Company. 44. 38. Google suggests that the tomcat application be deployed in Docker containers which can be managed by Google App Engines in case of any service issues. It is ideal for low- latency and high QPS content. Cloud Pub/Sub is a messaging service which doesn’t need a server to execute the messaging services. 39. Old App Engine is ideal for green field development instead of Legacy systems. It is well suited for reading the content from a single storage at high availability. These are counterparts of Amazon Glacier in AWS. 40. 45. GCS is a geo-redundant storage with high availability and performance. The independent applications of each company can be deployed just in a compute engine and made to function from there but Google Cloud offers a much larger perspective of a system working on cloud using various components of Google Cloud to perform tasks without the requirement of a specific application running server. however it is not suited for real-time data exchanges. 43. 42.

A few 100GB of data is ideal for management using Cloud SQL.47. a. c. The time spent by a user on a specific page c. There will be 2 users who rate a house at the same level b. . The rating is created based on such similarities and the top 3-5 ratings are extracted from the model and suggested as recommendations. Ratings given by a user to a house b. For ML. The recommendation engines identify following a. Cloud SQL is a managed MYSQL platform 50. The links on which the customer has clicked 48. There will be houses which are not so great and all the users have rated the house 1 or 2. 52. 49. Data management is done using python spark on DataProc 51.

The benefits are listed below. 55. a. Google Dataproc allows you to integrate Google Cloud Storage with the Hadoop clusters for data processing using Spark. etc. . 57. Instead of initiating compute machines for storage of data after processing. Cloud SQL helps pacify the database. This reduces the pricing for machines which are not required to be run 24X7. the system IP address must be added to authentication every time a user tries to access through a new cloud shell (because every time cloud shell IP is different). the data can be directly stored in the Cloud Storage. Hive.53. In Cloud SQL. The authentication section is checked for accesses every time some system tries to access the CloudSQL 56. 54.

Relational Databases (Cloud SQL) have relatively slow writes and thus high-throughput writes are not possible. Thus. The integration with Cloud Storage reduces the cost of maintaining the machines and the amount I/O from the compute machine to various databases. the dataproc machines must have permissions and authorization to access the Cloud SQL. . d. Thus. c. For ex. They can work only on data at the level of few 100 Gigabytes. 59. 60. saving on the costs. The Dataproc instances can be started just for performing the compute and be shut once the jobs are completed. e. The systems must be updated so that they have authorization to access other components in Cloud environment.b. 61. To access CloudSQL. reducing the cost for securing the compute engine environments. The hassle of managing the infrastructure is reduced 58. The security offered by Dataproc on the data is the same as the security provided by Cloud storage since the data is ultimately stored there.

Cloud SQL persistence of data is a part of customization and is not a default feature of Cloud SQL. While reading data from DataStore the items (objects) are retuned in form of iterables because a single query on TB of data can respond with GB of data and returning a list of items will consume that many GB of space in the memory. It uses indexing of information to store data in an easily retrievable manner. The data is stored in form of Key-Value pairs. 68. For ex. 66. Bigtable cannot be efficiently used for aggregations and other batch operations on data (No-Ops). The capacity of Big Table is petabytes to manage real-time stream data. 64. Thus the returned objects are returned in iterables.using the key and using properties of the object. DataStore allows all operational functions possible with a RDBMS – Create. Update and Read but it is not a relational database. There is no transactional support in BigTable. The table are recommended to be tall and narrow. 65. Instead of saving data as key Bool1 Bool2 Bool3 Bool4 Bool5 1011 T T T F F . 63. For streaming data and high throughput needs Big table is used. delete. Each object is stored as a key value pair. Each object has a key.62. 67. The search in Bigtable is based on just the key. in Bigtable it is suggested that instead of storing the 5 variables as columns. the data must be stored as the variables which are true. The datastore can hold data up to Terabytes. It allows you to store data as objects. 69. In case there are 5 Boolean columns. Datastore allows transactional exchanges of structured data. but it is designed to be extremely efficient in managing real-time transactional data. Thus. The data can be searched in 2 ways .

Bool2. The 3 scenarios where the servers can be hosted are a. Bool3 70. On Compute Engine – where in the setup is not cost effective. the server must be hosted somewhere for others to be able to access the URLs. In case of an update. c. 76. BigQuery can be run from Datalabs as well which allows the user to work on Python and BigQuery parallelly in codes. It can access csv and json data without ingesting them into Bigquery. a new row is added to the table and BigTable takes care of providing the latest data from the table in case any read requests.Save data as key Bool_Column 1011 Bool1. 71. Datalabs inherently follow a server and client system. BigQuery – It is a superfast data processor. The importing and functioning of data from BigTable is managed using the HBase methods. for the processing the data. In this case the SSH tunnel needs to be established from cloud shell to connect with the machine. 77. 75. Notebook Development . It can handle data of size of Petabytes. Data can be streamed in using Cloud Dataflow and while the streaming is going on. Gateway VM – where in the servers are installed in a VM using the already existing Cloud Launcher VMs. Datalabs 73. Each Datalab has a web URL for sharing the draft codes to collaborate with team members. it needs to be ingested into BigQuery warehouse. Thus. . This is done by reading data from the bottom such that latest data is retrieved first. 72. the access is stopped b. the processing can be done using Bigquery. However. Local machine (on your laptop) – where in if the laptop is shut. 74.

ParDo option helps execute a task parallelly on all rows of a table. Pub/Sub is the system which manages these message processing.78. This plays an important role since the cloud storage and the compute engine must be . Tensor Flows are C++ based framework to perform numeric processing. The charges applied when the data is leaving a zone and not when it is being added to a zone. These machines need not be created beforehand. It is extremely efficient in doing machine learning on Neural Network. 81. It creates multiple engines to compute the data parallelly. Pub/Sub has topics. 91. 83. 82. 84. Dataflow – Serverless Data Pipelines using Apache beam – Each step of dataflow is a class in Java. 86. Machine Learning APIs by Google are trained on Google’s data. Dataflow is primarily made to process historical data and the real-time data coming in from different sources. 90. Dataflow manages the creation of these machines on the fly. It essentially introduces a messaging based queueing service to manage the scaling and expansion of resource usage in case of escalated usage. Modular management of different tasks can be easily managed by Pub/Sub. where each topic can be held responsible for manging a specific set of functions. Cloud Dataflow allows the system to keep reading the information from Pub/Sub for a specific amount of time. They can be directly used by ML models instead of creating the complete ML models again. Asynchronous processing a way of managing the processing system separate from data receiving system. It is a No-Ops data pipeline. Tensor Flow is a way to perform machine learning modelling. ML Engine also allows Jobs to be deployed from models. It requires the models to be created and trained and then jobs can be deployed in real time execution. BigQuery is Awesome!!! 79. 88. 80. The information can be read from a file or any source including cloud storage. 85. 87. It can massage the data in the same way for both and send them for scoring and testing the ML models to BigQuery. 89.

A custom dataproc cluster can be created using the CLI or graphic interface. The pricing for preemptible machines starts after 10 min of running the machine. There is no primary disk assigned to these nodes since they are not a part of the HDFS. It can be stored on completely different hardware . A system can have 1 master 3 standard worker and 7 standard preemptible nodes in a cluster. 100. 96. The master node in Dataproc manages the joining and leaving of preemptible machines. YARN is a management platform to manage the Hadoop file system used for persistent storage. The preemptible nodes are valid only for 24 hours and can shut off based on demand of compute engines. In this case if the master node dies. then it can exist for a max of 24 hours. The CLI command is . It is suggested to use Preemptible nodes as worker nodes for clusters. Pig is used for transformation of data. It may go away in less time. High Availability System with 3 master nodes – There is a load balancer in this system which manages the information processing on 3 masters and respective worker nodes 94. Usually the zones in the same region are present in different buildings in the same facility. Each preemptible machine when leaving has 30 seconds to complete the job it is assigned at that moment.Custom-6-30720 for 6CPU and 30GB Ram (30X1024 = 30720) 102. Hence it is advised that the data location is the same as compute location 92. 99. A shard is a horizontal partition of database table such that each shard can be stored in a different table in a different database. 104. 101. Every new cluster config would require reconfiguring the connections to enable input/output of information. in the same zone to nullify this cost. the company benefits from such nodes since it reduces costs drastically. There is an option to select the versions for the software on Dataproc. The usage of HDFS file system is not suggested in compute using dataproc applications because this doesn’t allow the dataproc to dynamically create and delete clusters. report back to the master and then shut down. There are 3 ways of using the HDFS file system in dataproc a. Single Master Node – There is just 1 master node and other worker nodes. 97. Minimum number of worker nodes in a cluster is 2 nodes. 105. Single machine cluster – All the compute is done in the single machine b. The preemptible nodes added to a cluster have the same configuration as other worker nodes. 93. It is difficult to find and get the larger machine as preemptible since there is only limited numbers and every other person on cloud is looking for such a machine hence the lifetime of such a preemptible machine may not be long enough. The bandwidth between zones in different regions is not as good as the bandwidth between zones in the same region. Standard machines for Dataproc are more cost efficient than the custom machines. If the machine is lost within the 10 min then there is no charged applicable on the machine. Thus. 98. 95. 103. If it exists more than 10 min. then the complete system is broken c.

There is way of installing just on the master or just on the worker node by writing an external script. 111. as persistent storage units completely beats the point of using dataproc services. as well. Apache Beam can be configured to run processes on Dataproc in a stateless manner. Reasons a. 112. The benefits of sharding data is that it reduces the amount of indexed content in each table and thus further reducing the time taken for searching data from a table. Copying of data every time on to the new machines and keeping them. This is managed by the GCP itself so that the tasks need not be repeated on every new node. The reference can be changed from hdfs:// to gs:// to point to the appropriate cloud storage buckets. Dataproc fundamentally eases the process of creating new compute engines whenever needed to tackle high loads and deleting the engines when the processing is finished. With Dataproc. It lays the foundation for Google Cloud Storage buckets. When a dataproc setup needs to be scaled out – the data needs to be copied again onto all the new nodes added into the system b. Note that the role is mentioned as “master” and this attribute is received from the metadata servers. Google cloud platform infrastructure tries to isolate compute from the storage through dataproc component. All the installations and infrastructural configurations are done on both master and worker nodes in the dataproc cluster. 108. The script recognizes the master or worker nodes using the following code format. 110. Dataflow and BigQuery can be used to decouple the storage and compute components of a dataproc system. 106. 107. . Dataproc. Colossus is a global file system of petabyte scale. This can help further automate the processing capabilities in a completely serverless environment. 109. the option to utilize persistent cloud storage buckets instead of HDFS file system is available.

120. Thus. 129. Cost of storage in Bigquery is the same as cost of storage in Cloud Storage. the payment plans can be selected for different functions. A bigquery dataset can have tables and views. There can be a huge machine learning algorithm running in spark which might need to use bigquery data and thus it can communicate and interact with bigquery to generate output for the algorithm. 122. the file references must be separated by commas. The pricing is also related to the number of columns that are contained in the data. An aggregated information from BigQuery cannot be imported into Pandas data frame because the information processed by pandas must be in-memory and the aggregated output from big query is most unlikely to fit into the memory of the dataproc cluster machine. 128. 115. Every column of data in bigquery is stored in a separate file. This initiation command can be stored in the google cloud storage and can be triggered as soon as the cluster is up and running. 118. Thus. These properties are stored in a file called core-site. The metadata server keeps a note of all the existing master and worker nodes in dataproc. Bigquery data can be shared across the company. 123.xml which can be modified using commands from the Gcloud SDK. The initialization actions can be mentioned in the CLI and GUI while creating the dataproc. 127. The access is managed at the dataset level. There is google git repository with a set of commands to install open source software on newly created master nodes. In case of multiple files. The location of initialization scripts can be mentioned as a part of the command line or in the console to be executed as soon as the dataproc is created. GCP also allows you to change the properties of complete cluster and not just the nodes. 126. metadata server manages the awareness of new worker nodes at the master node level. Bigquery is near real-time data storage system which can be used for ad-hoc analysis. Hence collecting a subset of bigquery data makes more sense than an aggregated data from bigquery. Big query is efficient if the data is de-normalized 121. 125. 114. the tasks are distributed to the new worker nodes as well. Spark code can process bigquery data. one must reduce the number of columns in the data tables. Bigquery is meant for immutable large datasets. Partitions can be used but they are not a part of default establishment of Bigquery. Based on the amount of usage. Each file must be referenced during the execution. 117. There are both flat payment and pay-per-use payment plans for bigquery. In case of addition of preemptible nodes into the cluster. The storage of data in bigquery is columnar. 119. 124. Any other script can also be created and run as an initializing script. The output data can be exported into Google sheet or tables anywhere else. there are no indexes or keys to manage the data.113. Bigquery is not a transactional storage. . to reduce the pricing. Thus. 116. The payment for Bigquery is made on the bases of data that we process. Somebody who has access to the data can utilize data to perform ad-hoc analysis on the same.

then the padding would result in YYYY-02-DD 132.130. a. For ex. 134. ARRAY/STRUCT create arrays of key value pairs in bigquery. There are also functions like ROW_NUMBER(). 131. etc. Bq command lines can be used from Cloud shell to create a new dataset in Bigquery 133. If the month is 2 and it is concatenated into a data. A query can be assigned as a view and can be referred to in another query by using WITH as the keyword. LPAD command in Bigquery does a left padding for text content. Optimization of Query can be managed by following - 139. 137. The struct helps in untangling the arrays using the keyword UNNEST. The tables in bigquery can be joined based on functions. 135. RANK(). Starts_with(a. The end of a regular expression is marked by a $ 138. LAG(). 136. For ex. With RANK() function a keyword OVER is used. WITH xyz as (select * from abc) select * from xyz where pqr. To optimize the query. Bigquery has the options like LEAD(). For ex. NTH_VALUE(). select only the required number of columns from the table .b) gives the output of all entities of a which start with b.

150. Data processed beyond 1Tb are charged. 148. The tasks for pipeline must be executed just by the main script which takes care of the subsequent flows. To execute main in cloud. 147. . the aggregate functions must be run by applying a window thus you get a moving average in case of mean. Model is a mathematical function which takes and input and provides an output based on training. 146. 153. While reading from Pub/Sub. Data in a pipeline is represented as a parallel collection. e. 143. Order the outermost query instead of on any inner queries. 141. Without sharding the output content is stored on a single machine which creates i/o issues. This function can operate faster in bq. 145. It downloads or includes all the dependencies for the code flow. ParDo acts on 1 item at a time but ParDo allows parallel processing. Parallel Collection – Pcollection is a list of items which is not bounded (it need not fit into the memory). Partitioned tables (based on Time stamp) can be processed faster in bq. 152. 149. g. Dataflow is used to transfer data from Bigquery to Cloud storage through a series of transform steps. Dataflow has the capability to autoscale and run the transform logics on multiple parallel servers and create the output in cloud storage. Combine operation and GroupByKey perform similar operations. The transformations are applied to parallel collections (input and output of transformation). 140. N-dimensional array is a tensor. Pcollection can be used to pass a list of string. The error value in APPROX functions are 1%. If the data that must be transferred is huge. Like map in MapReduce. dataflow must be configured to manage the pipeline tasks. b. 144. 151. The main runs on the local server. APPROX functions are internal bq functions (which are extremely fast) which can be used in cases where approximate number can work. Loading data from bq and exporting data from bq is free. Built-in functions are faster than UDFs d. The wait time is a relative value. For ex. Join the biggest tables first then the smaller then the smaller and so on c. There is a cost for ingesting data into bq. Dataflow allows the management of data pipelines. f. Stackdriver allows you to monitor the Bigquery from a GUI based interface 142. Wildcard tables function helps union all the tables starting with a certain pattern of names. Cache is maintained per user. If the wait value is 100% then all the other functions are measured relative to the wait time. Any query that results in an error is free. Use “mvn” to execute a java program. Combine is preferred as the first choice for any of the aggregation tasks. Any cached query is free of cost.

158. as opposed to Statistical modelling. if there are outliers then the user must go search for more instances of such outliers to train the model to learn about the occurrence of such outliers and include the outliers within the ML model. Neuron is the sum of weighted inputs. if the mean square error goes down. Gradient decent is the method used to identified w1. 160. The machine learning model is only as good as the training data used to train the model. . 157. 159. it means that we are closer to the finer model. 156. This is more like fine tuning the model to give better results. The training data must also have a good number of negative data to train the model to identify if a certain type of input is negative. w2 and b. 155. it can be understood as the new learning achieved from the same set of input variables.154. By changing the value of weights. In machine learning.

163. the error measure is called cross-entropy 164. The error value from model must be identified or measured with MSE to iterate and create better model.161. Thus. 162. the model that gives the lowest MSE is the best model for the problem. For classification problem. .

It represents the codes as dataflow graphs which can be presented in different hardware platforms. Root Mean Square Error – Root of MSE 168. Tensor flow is a framework like various python libraries which can perform complex machine learning modelling operations. . Precision – If the model predicts a certain label to be positive. It is an open source framework which can be downloaded into the local system and used for ML. 170. TensorFlow – It’s a library which allows one to do numeric computations. Tensor flow can be run at scale in Google Cloud using Cloud ML engine. It’s a way of writing JAVA codes which runs on variety of hardware components. a. TP/(TP+FP) c. then how often is the model correct. Accuracy – The number of times the model has identified the label correctly b.165. 174. Hyper-evaluation 169. 172. TensorFlow allows one to train the ML model on GPUs and run it on mobile phone. Accuracy is used when the outcomes are balanced or there are roughly 50% chances of each outcome. There is a component called tensor board which will be used to do the same. Recall – Out of all the correct objects. TP/(TP + FN) 166. 171. It provides portability. how many did the model recognize correctly. ML models using TensorFlow. 167. The operation of 2 tensors happens only in a tensor session. 173.

Thus.175. This allows the user to create a graph in a different place and run the graph in a different place. the remote execution of tensor is possible. . 176. 177. To implement operations in tensors one must setup placeholders for values and then insert values into the respective placeholders 178.

It allows the user to assign the metric on which one wanted to base the iteration trigger cycle. to manage message in a loosely coupled system message bus is used to buffer the messages. then the customer information data should also be used as it was 3 days ago. 188. There is a column for cases where the data is not provided in case of 1 hot encoding. The ML modelling process will occur in a single region and hence the region must be maintained consistently. ML requires default values for the predictor variables input into the model 185. There is an experiment class in tensor flow which allows the user to create a robust modelling system. 184.179. Tight coupling between the sender and receiver of a certain message in stream processing can create issues in case a component (sender or receiver) crashes. Hyperparameter Tunnelling? 182. 180. In case of machine learning the data is not imputed from available data in case the data is missing as in the case of statistics. 187. This maintains consistency of the model 186. For ex. While modelling it is a good practice to keep the data coming from multiple sources synced on time. Use single region bucket for ML training data. . 181. If the transaction data for prediction is used till 3 days ago. 183. Thus. A tensorflow model can be packaged as a Python module for scaling out the execution of tensorflow in real time.

Thus. 190. Thus. 194. the data will be jumbled up and latency in involved in the data ingestion part of the system 191. Smaller packets are going to write faster than the Larger packets. Pub/Sub is the solution. 193. Any kind of stream processing on stored data increases the latency. . For very high throughput and very low latency – Bigtable is a more appropriate solution.189. 192. big query can process data on streaming data while it arrives into the warehouse.

For ex. . 197. the network costs can be reduced. 7 days is the maximum time duration for which the data is saved in the message buffer in pub/sub. to manage this shuffled data the processing can be done based on windowing. Windowing the data helps in organizing the data and distinguishing between the processing time and event time. Thus. while push is faster with zero latency. Pull has delays because the subscriber function works periodically. 203. if each message is send separately. A message published before the existence of a subscriber will not be displayed in the subscriber once its created in the topic. Alternatively.195. By publishing the messages as a set of batch processes. It supports both Push and Pull for subscribers. 200. The continuously arriving data can be out of order. 201. 198. 196. 199. All data arriving before 9 am can be processed together. 202. Pubsub primarily resolves the variable input data speed. the cost incurred for passing the message is higher. It guarantees delivery of a message at least once.

Dataflow knows whether all the messages have been received from Pub/Sub for a window. etc. maps. 206. Watermark tracks the time difference between processing and event time of a message. 207. Cloud Spanner must be used if one needs global consistency with data. 208. 205. It enables horizontal scalability with addition of more and more nodes. Data studio is a dashboard tool given by Google which can be connected to various GCP sources. 209.204. . It helps one build charts.

210. it is preferred that sections of similar row-keys are stacked together in the same tablet for low-latency access to the data. . While reading the data. While designing BigTable there can be a wide table or a narrow table. BigTable separates processing and storage by storing the data in colossus filesystem (cloud storage) and processes the information on nodes. 211. 215. 214. It uses pointers to the data to compute the information. In wide table format each column has a value for every row. BigTable information is stored in form of Column families. 212. There is a single row key which can be indexed and the data is stored in ascending order of the row key in sections called tablets. 213.

219. If the user_id is sequentially assigned then user id must not be used to start the rowkey either. this kind of redistribution is not heavy process.216. And since the Bigtable only stores pointers to data on its nodes. . it must be allowed to process > 300 GB of data and run for at least a few hours. 218. The key must be designed such that the reverse timestamp is read (YMD). Rowkey must not start with Domain name. Bigtable learns the access patterns and reorganizes data to balance the loads on all nodes across the cluster nodes. To enable Bigtable to make this adjustment. 217. The rows are sorted lexicographically by row key.