You are on page 1of 12

Question 1

States of a thread: Ready, Running, Waiting and Terminated: Running State: A thread which is currently being executed by a CPU core is in the running state. A running thread enters waiting state, when it issues a system call to wait for any kernel event. A running thread enters the ready state, if it is pre-empted and put into the ready queue. Waiting State: A thread which is waiting for a kernel event (either in the form of IPC constructs or I/O operation) is in the Waiting state. A waiting thread becomes ready when the event that it was waiting for, occurs. Ready State: A CPU core can only execute one thread at a time. The other threads which are ready to be executed, but have to wait because of lack of enough CPUs, are said to be in the Ready state. A Ready state thread becomes Running when a CPU core becomes free to run it. Terminated: When a thread has finished execution, it enters the terminated state. From the terminated state, the thread cannot move to any other state.

Question 2
The DT9816 has the following specifications: Internal Gain 1 2 Resolution: 16 bits Voltage Range -10 to +10 V -5 to +5V

With +- 10 V, the step size (in terms of MilliVolts) for the DT9816 is Range/2^(Number Of Bits) = 20000 mV / (2^16) = 20000 / 65536 = 0.3051 mV The accelerometer gives 2.5 volts per G acceleration. Hence 1 G acceleration is equivalent to 2500/0.3051 steps of the ADC. 1G = 2500/0.3051 = 8194 steps. Hence 1 step corresponds to 1G/8194 acceleration i.e., 0.0001220405G. Hence the smallest acceleration reading is 0.0001220405G.

But the best configuration would be to use a gain of 2 and voltage range of +-5V when using the DT9816 with this accelerometer. The reason is explained below

The range of accelerometer is +- 2G, which means that the acceleration can go from -2G to +2G, that is, a range of 4G acceleration. Since 1G corresponds to 2.5 Volts, the maximum range of accelerometer in volts is 10 volts, that is -5V to +5V as shown in the table below. Acceleration in terms of G -2G -G 0 +G +2G Voltage -5V -2.5V 0V +2.5V +5V

Since the output of the accelerometer always fall within the -5 to +5V range, we can use the DT9816 in the +- 5V for interfacing with this accelerometer. The step size of DT9816 for the +-5V configuration would be 10000/(2^16) = 0.15258 mV which corresponds to G*0.15258/2500 = 0.000061032G, that is twice the resolution of the +-10 V configuration.

Question 3
The DT9816 has the following specifications: Internal Gain 1 2 Resolution: 16 bits Voltage Range -10 to +10 V -5 to +5V

If we use the -10 V to +10V range for DT9816, the step size (in mV) for the module would be Step Size = Range/2^(Number Of Bits) = 20000 mV / (2^16) = 20000 / 65536 = 0.3051 mV For the -5 to +5V configuration, Step Size = Range/2^(Number Of Bits) = 10000 mV / (2^16) = 10000 / 65536 = 0.15258 mV The calculation for the -10 to +10 V setting would be as follows: The displacement sensor provides 10 mV per 1 mm displacement. Writing in terms of proportion, 10 mV : 1 mm = 0.3051 mV : X mm, where X is the displacement corresponding to the smallest step size (0.3051 mV). So X = .3051/10 mm = 0.03051 mm, if we use DT9816 in the -10 V to +10 V setting.

Similarly for the -5V to +5V configuration, the resolution (in terms of mm) would be 0.015258 mm.

Question 4
The DT9816 has the following specifications: Internal Gain 1 2 Resolution: 16 bits Voltage Range -10 to +10 V -5 to +5V

The answer to this question assumes that the range is -10 to +10 V. The reading of 1F2 can be written as 0x01F2. Converting to decimal this value corresponds to 498. The step size for the ADC (in mV) is Step Size = Range/2^(Number Of Bits) = 20000 mV / (2^16) = 20000 / 65536 = 0.3051 mV So 498 units corresponds to 498*0.3051 mV = 151.9398 mV. So the voltage reading is (-10000 mV) + 151.9398 mV = -9848.0602 mV We do this calculation since in offset binary mode, 0x0000 corresponds to -10 V

Question 5
If the reading in problem 4 was 2s complement, then the reading would be 151.9398 mV.

Question 6
Threads are usually created dynamically as need arises. For example, a web server may dynamically add more threads when there are more requests. Some applications even create a thread for each new request/task and kill the thread when the request is done. But thread creation is expensive as the OS has to allocate resource for the thread. So, the application will end up spending most of the time in thread creation and deletion. This is where thread pools are a great pattern to use. The application programmer can create a pool of threads when starting the application. Each thread will accept a task from its queue and execute the task in a loop. The advantage is that the thread creation overhead is moved to the initialization of the application and not at the main run time. With this kind of thread pool implementation, there is a problem of determining the optimum number of threads. For this reason, in Win32 there is an API named QueueUserWorkItem, which adds a function pointer to the Windows System. The Windows OS creates a thread pool when this API is called for the first time. For each call of QueueUserWorkItem, the function is passed to one of the threads (container) to execute. This way the programmer need not bother about the number of threads to create at the initialization.

Question 7
About Lock Holder Pre-emption: Lock holder pre-emption is related to the use of Spinlocks by an Operating system running on a virtual machine. Spinlock is a busy-wait implementation of locking. This mechanism is typically used in multicore systems, where busy-wait will not block out the lock releasing thread (since the lock releasing thread typically runs in the other processor core). Spinlocks are typically held for very short periods of time, such that the overhead of thread scheduling is greater than the CPU cycles lost in busy wait. But this whole assumption goes wrong when the same set of applications and the OS runs in a guest OS inside a Virtual Machine (VM) and the host system runs multiple VMs. That is, if the core that is supposed to release the spinlock is swapped out to another VM by the Virtual Machine Monitor, but the other thread which spins on the lock is not swapped out, then large number of CPU cycles is wasted by the thread waiting on the spinlock. This problem is known as the problem of Lock-holder-Preemption, i.e., pre-emption of the thread holding the spinlock. As can be seen in the definition itself, this problem cannot happen in a typical real time embedded multicore system, where a single Symmetric Multiprocessing (SMP) Kernel runs the entire system. There is no chance of an external agent swapping out the whole CPU core from the kernel. So for this reason, lock holder pre-emption problem does not affect real time embedded systems.

Question 8
Task level parallelism decomposes the problem into a set of parallel tasks (functions) that operate independently of each other. An example of task level parallelism would be the following algorithm Int main() { Thread1 = Pthread_create(doTaskA); Thread2 = Pthread_create(doTaskB); Wait(for threads thread1 and thread2); } Int doTaskA() { Get the frame from the stream Decode the frame

Put it to the output queue } Int doTaskB() { Get the decoded frame from TaskAs queue Do the color balancing, then Put it to the output queue. } Geometric pattern in parallel programming is a type of data decomposition (data parallelism) where the data set is divided into multiple threads and each thread handles part of the data set. It is called geometric pattern because the dataset is analogous to a geometric area and we are dividing the area (dataset) into chunks, each chunk for an individual thread (core). In geometric pattern, the functions performed by the individual threads are identical, i.e., they perform the same set of actions but on a different data set. An example would be the matrix multiplication program. If we multiply a N*M matrix with a M*J, we could divide the problem into N threads with each thread responsible for the calculation of a row of the resulting matrix. The problem given in question 13 of this test would also come under geometric pattern. The basic difference is that task pattern divides the problem into tasks and geometric pattern divides the data structures associated with the problem into chunks for each thread to process.

Question 9
The minimum sampling rate to be used is 6000 Kilo samples per second. This is derived from the Nyquists minimum sampling theorem which states that the minimum sampling rate = 2 * the maximum frequency present in the signal. But the minimum sampling rate is not enough in most practical scenarios. There are 3 reasons. 1. The minimum sampling rate is specified to remove the low frequency non-existent signals as shown in the figure below.

Original Signal

Sampling at exactly twice the highest frequency

Non-existent signal introduced as a result of lower sampling rate

But that does not mean that this sampling rate is the optimum rate. For example if we keep sampling at the time where this signal is at 0 voltage level, the entire signal would look to be 0 and we would never get the highest amplitude of the signal. The signal has to be sampled at exactly the extreme points to fully recover the amplitude at the DSP. 2. Sampling at the minimum frequency means that the analog filters have to almost completely remove the aliasing (higher frequency) signals. The resulting circuit can be expensive. Instead the signal can be oversampled (sampling at higher than the minimum rate) and some part of filtering can be done using digital filters. This can be possible only with oversampling.

3. Oversampling can also help in decreasing the resolution of the ADC. Higher number of samples means that we can average out the quantization errors. For example, to implement a 24-bit converter, it is sufficient to use 20-bit ADC with a sampling rate of 256 times the minimum sampling rate. (Source: This means that we need not have expensive high resolution ADC, if we sample at a higher frequency.

Question 10
CPU Affinity of a thread is the affinity of the thread towards a core, i.e., the probability that the thread be run in a particular core. Normally a symmetric multiprocessing kernel follows the policy of soft affinity, i.e., it tries to schedule the thread to the CPU that ran it previously. This policy is followed based on the principle of locality; that is, the cache would not have to be repopulated if the memory used by the thread is still available in the cache. If the thread moves to a different CPU, the entire cache has to be reloaded from the RAM. But there is a flip side to fixing a thread to one core. The thread might have to wait for an indefinitely long time, if the CPU core to which it has affinity does not become available and there is a chance that the other CPU core remains idle, even though there is a thread in the ready queue. To avoid this, the OS will strike a balance between this latency and the cost of populating the cache of the new CPU core. Another disadvantage is with respect to hyper threaded CPUs. If 2 threads are having the same CPU core affinity, but the other CPU core is idle, the system may achieve better performance by running one of the threads in a different core even though the new CPU core would have to repopulate the cache. This is because in this way the 2 threads would run on separate cores and hence would not have to compete for the execution units within the core.

Question 11
Mutex and Critical Section are two mechanisms provided in Win32 library for providing mutual exclusion synchronization. The major difference between the two is that Mutex is implemented using system calls to the Kernel; whereas Critical Section is completely implemented at the user level. This makes Critical Section faster than Mutex. As a result of the implementation difference, Mutex can be used for inter-process synchronization, that is, synchronization between 2 different processes; where as Critical Section can be used only for synchronization between 2 threads of the SAME process. Critical section involves less system overhead than Mutex and hence faster than Mutex. It is highly recommended to use Critical Section rather than Mutex, when the synchronization to be done is between 2 threads of the same process.

Question 12
Maximum sampling rate per channel: 50KHz Maximum number of analog channels: 6 Number of Digital inputs: 8 Number of Digital outputs: 8

Question 13
I would use Data Decomposition if I were to decompose the algorithm into threads. This is because each of the terms in the summation is independent of each other. Hence we can divide the data set (from k=0 to K=M-1) into n threads where n is the number of cores in the CPU. For example if there are 2 cores, thread 0 would process data from k=0 to k=M-1/2 and thread 1 would process the data from k=((M-1/2) + 1) TO k = (M-1). The algorithm for the threading is as follows Int n; Double sum; HANDLE mutexHandle; Int mainThread() { Int pointsPerCore = ((M-1)/MAX_CORES); mutexHandle = CrearteMutex (); For(coreCtr=0;coreCtr<MAX_CORES;coreCtr++) { createThread for data set from pointsPerCore*(coreCtr) till pointsPerCore*(coreCtr+1) } Wait for all threads to complete. The summation is in the variable sum } Struct THREAD_PARAM { Int from; Int to; } Int SummationThread (Struct THREAD_PARAM dataset) { For (k=dataset.from;k<;k++) { tempSum = funcX(k) * funcY(k+n); /* CRITICAL SECTION */ WaitForSingleObject(mutexHandle, INFINITE); Sum += tempSum;

ReleaseMutex(mutexHandle); /*END Critical section */ } } Each thread will update the global variable sum inside their loops. This means that the global variable sum has to be protected from race conditions. We can implement this protection using mutex. The following are the Mutex API For initializing the mutex:
HANDLE WINAPI CreateMutex( __in_opt LPSECURITY_ATTRIBUTES lpMutexAttributes, __in BOOL bInitialOwner, __in_opt LPCTSTR lpName );

This function is used to create the mutex. The function returns a Handle on success. The function accepts a name for the mutex as a parameter. For waiting on the mutex:
DWORD WINAPI WaitForSingleObject( __in HANDLE hHandle, __in DWORD dwMilliseconds ); This function accepts the handle to the mutex that was created using CreateMutex function. The parameter dwMilliseconds is the wait time. If it is a non-zero value the function will wait for dwMilliseconds and return with time out if the lock is not released in the interval. If we pass INFINITE as parameter, the function will not return until the lock is released. BOOL WINAPI ReleaseMutex( __in HANDLE hMutex );

This function releases the mutex so that other threads can enter the critical section.

Question 14
Accelerometer is connected to the embedded system for measuring displacement and acceleration. The acceleration will give continuous output when the embedded system is being run. So displacement has to be measured instantaneously for measuring the distance travelled in each instance of time t. There is an equation which can be used to measure displacement directly from the acceleration (without an intermediate conversion to velocity) The equation is

Di = 2*Di-1 Di-2 + (((h*h)/4) * (Xi + 2*Xi-1 + Xi-2)

The first step would be to simplify this equation. First lets bring in a new variable factor =

((h*h)/4). The value of factor can be calculated before the system starts, i.e. we need not calculate
the value of factor in each loop of run time. Now the equation becomes

Di = 2*Di-1 Di-2 + ((factor * (Xi + 2*Xi-1 + Xi-2)

Now lets convert the equation to sum of products form. The idea is that products are more CPU intensive than addition. In a RISC processor (like early versions of MIPS), there might not be inbuilt support for multiplication, or even if present, they take more cycles than add/subtract instructions.

Di = 2*Di-1 Di-2 + (factor * Xi) + (2*factor * Xi-1) + (factor * Xi-2)

In the above equation (factor*X) is the most CPU intensive. The multiply by 2 operation is just a matter of a left-shift, and this kind of shift operation takes less CPU cycles. Type of Operation Type of Operation CPU Intensiveness Multiply-by-2 = left shift LOW 2*Di-1

(2*factor * Xi-1) 2*Di-1 Di-2 + (factor * Xi) + (2*factor * Xi-1) + (factor * Xi-2) factor*X i-1

Multiply-by-2 = left shift Sum Operation of 4 terms


Floating point multiplication


If we assume that the sum total of other operations (add and multiply-by-2) is equal to the floating point multiplication, then we can divide the work across two threads: One thread (which I call as multiply thread), only calculates the floating point product, and the other thread (which I call DisplacementCalculator) does all the remaining operations. So, we have a pipeline pattern here with 3 stages: ADC Reader/ Multiplier / Displacement Calculator. The ADC Reader Thread simply reads from ADC and put the reading into a queue ADC Reader can be a bottom half of the ADC Interrupt Service Routine ADCReader() { While(1) { If (ADC Interrupt set) /* This waiting can be implemented with a signaling between the ISR and this thread */ { Read from ADC and put into the ADC buffer If buffer becomes full Put buffer to the output queue: ADCQ

Signal the ADCReaders semaphore Obtain a new buffer from the pool } } }

The algorithm for multiply thread will be Multiply() { Factor = (h*h)/4; While(1) { Wait on the ADC Readers semaphore Get the ADC Buffer from ADC Threads Queue Obtain a new buffer from the pool to use as Factor Buffer For each accelerometer entry in the ADC buffer { Ai = read the accelerometer entry in the buffer Calculate Fi = Factor * Ai; Put Fi into the Factor buffer. } Return the ADC Readers buffer to the pool Put the factor buffer into the output queue for Multiply Thread Signal the Multiply Threads semaphore } } The algorithm for DisplacementCalculator would be the following: Global variable Di; DisplacementCalculator() {

Fi = Fi-1 = Fi-2 = 0;
Di = Di-1 = Di-2 = 0; While(1) { Wait on the Multiply Threads semaphore Get the Factor buffer from Multiply Threads Queue For each accelerometer entry in the Factor buffers {

Fi =

Read entry from Factor buffer

Di = 2*Di-1 Di-2 + Fi + (Fi-1 *2) + Fi-2 Fi-2 = Fi-1; FI-1 = Fi; Di-2 = Di-1; Di-1 = Di;
} Return the Factor buffer to the pool. } } getDisplacement() { Return Global variable Di } If there are only 2 processor cores, instead of 3, then we can combine the 2 threads: ADC Reader and Displacement calculator into 1 (In order to avoid threading overhead on 1 core). The multiply thread remains unchanged even in this case. Thread communication: The threading synchronization from ADC Reader -> Multiply Thread -> Displacement Calculator can be done using multi-valued semaphore. Logically, this can be viewed as follows

ADC Buffer Factor Buffer Signal ADC sem Signal Multiply Sem ADC Reader -------------------------- Multiply Thread -----------------------------DisplacementCalculator Frees the ADC Buffer Frees the Factor Buffer Buffering: The data is passed by putting the buffer pointer to a queue. Buffer management can be done using the Systemss Heap, i.e., malloc and free. The larger the buffer size, the lesser the threading overhead (overhead for semaphore wait and signal). Since we are using a queue (i.e., access only the start and end of the list), there need not be any lock for accessing the queue, since a queue can be implemented in a thread safe way without locking (Circular queue implemented using Arrays).