Tuesday, May 5, 2020

Data Analysis for Decision Makers

Question: Discuss about the inferential statistics of continuous random variables. Answer: Introduction The price of any house is an important factor in the daily life. Every people have a desire to purchase his or her own house at a certain point of time. This purchase of the house mainly depends on the price of the house (Gelman et al. 2014). The price of the house is influenced by various factors. Data had been collected on surveying the houses of Singapore. Data for 50 samples were collected. Various factors of the house were also surveyed. These data would be subjected to various statistical methods. Discrete variables, continuous variables and categorical variables have been used in this data set (Ott and Longnecker 2015). Explanations of various statistical terms would be given in this assignment with the help of the given data set. Graphs and charts would represent the data more precisely. Discussion Descriptive statistics The data consist of both discrete and continuous variables. The price of the house, the distance from the house from its nearest railway station, the distance of the house from the nearest bus stop and the distance of the house from the nearest shop are all continuous variables. It was seen that the average price of the selected houses was $788580. This is calculated by dividing the total price of all the houses with the number of houses chosen for the experiment. The median of the price of the house was $778000 (Thomson and Emery 2014). The minimum value of the chosen houses found to be $310000 while the maximum price was $1354000. The range of the price of the chosen houses was $104400. The standard deviation of the price of house is $280708 while the variance was found to be $78797064.9 (Menke 2012). The standard deviation is calculated by the square root of the sum of the deviation of the prices from the mean divided by number of samples chosen. The variance is calculated by squa ring the standard deviation. Variance gives the amount of deviation of the samples from its mean value. The average of other continuous variables is given as follows. The average distance of the houses from the nearest railway station is 1.086 kilometres while the average distance from the nearest bus stop is 1.186 kilometres and the average distance from the nearest shop is 0.99 kilometres (Bazeley and Jackson 2013). The median of the distance from nearest railway station is 1.1kilometres, from the nearest bus stop is 1.25 kilometres and from the nearest shop is 1.1 kilometres. The minimum distance of the houses from all the three cases was 0.1 kilometres while the maximum distance of the houses from the nearest railway station is 2.100 kilometres, from the nearest bus stop is 2.700 kilometres and from the nearest shop is 1.900 kilometres. The range of the distance from the nearest railway station, nearest bus stop and nearest shop is 2 kilometres, 2.6 kilometres and 1.8 kilometres respectively (Dimaggio 2013). The standard deviation of the distance from the nearest railway station, n earest bus stop and nearest shop is 0.58554 kilometres, 0.76798 kilometres and 0.54332 kilometres respectively. The variance of the distance of the nearest railway station from the houses was found to be 0.3428 kilometres, the variance of the distance of the nearest bus stop from the houses was found to be 0.5898 kilometres and the variance of the distance of the nearest shop from the houses was found to be 0.2952 kilometres (Twisk 2013). Figure 1: Graph of price distribution of the houses (Source: created by author) Figure 2: Graph of the distribution of distance of house from station in kilometres (Source: created by author) Figure 3: Graph of the distribution of distance of house from bus stop in kilometres (Source: created by author) Figure 4: Graph of the distribution of distance of house from shops in kilometres (Source: created by author) Discrete random variables and its probability distributions The discrete random variables for the data set are number of rooms, age of the house, area of the house in square metre, and number of bedrooms. These variables give the details of each of the randomly selected houses. The Poisson distribution can be used to describe these variables (Miles et al. 2013). Poisson distribution and the interpretation of the data set Poisson distribution gives the probability of the occurrence of a number of independent events in a fixed period. In this data set, considering the intervals of distances of the house from nearest railway station, bus stop and shops, the probability of the prices to be high in the intervals could be predicted. Prices are said to be high if they are beyond $1000000 (Woodward 2013). The probability of occurrence of such prices within the intervals of distances could be found from the data set. This would give an idea about how the distance of the house from its nearest railway station; nearest bus stop and nearest shop affect the price of the houses. The probability of occurrence of the high priced house in the distance intervals would give an idea whether factors other than influences the price of the house or not. Inferential statistics Inferential statistics is defined as the collection of data from its population and its measures. The collected data is used to find various measures of statistics. These measures include descriptive statistics, correlation, regression and hypothesis testing. When the sample size is greater than thirty, Normal distribution Normal distribution is a common continuous probability distribution in the theory of probability. Normal distribution is mainly used in natural and social science in order to represent the random variables whose distribution was unknown. Central limit theorem makes the use of normal distribution an important one. Central limit theorem states that when a large number of samples are drawn independently from independent distributions, the average of these random variables converges to normal distribution (Balakrishnan 2013). The curve of the normal distribution is a bell shaped curve and the probability density of the normal distribution is given as follows: f (x/ , 2) = (1/ sqrt(2 2)) e (x- )^2/ 2 2 ; where is the mean value of the distribution and 2 is the variance of the distribution. Reasons for normal distribution used in sampling distribution The commonly used probability distribution for continuous distribution is the normal distribution. According to the central limit theorem, it was seen that the distribution of large samples tend to normal distribution. This tendency of every variable to follow the normal distributions leads to the concept of using normal distribution for sampling distribution (Kleinbaum et al. 2013). Moreover, using normal distribution in sampling distribution helps in easy calculations of sampling distributions and this distribution is usually considered as the standard distribution for sampling distributions (Balakrishnan 2013). The basis for Inferential Statistics The basis of inferential statistics is the Central Limit theorem. The samples drawn from large population are used to estimate the characteristics of the population. The probable value of mean of the population can be guessed from the mean of the samples drawn from the population. The standard deviation and variance of the sample gives the probable value of the standard deviation and variance of the population (Boy-Roura et al. 2015). Inferential statistics helps to infer about the population using one or more samples from the population. Inferential statistics gives an idea about whether the difference between the groups of variables occurs by chance or they are real. The basis of inferential statistics is the assumptions that could be made about the populations from the selected samples. Inference could be drawn about the larger groups on studying the variables of the smaller groups (Ciarleglio et al. 2016). It is not possible to study the whole population, as it would be logistically impossible, too much expensive and time consuming. The method of sampling and inferences drawn from the statistics, helps to infer about the population in reduced cost, with great accuracy and more scope to yield information (Pineda et al. 2015). More attention can be given to each of the samples and the results could be more accurate for the sample statistics. The sample statistics would infer a better result about the population from which the sample is selected. Confidence interval Explanation of continuous random variable Continuous random variables are those variables who take all the values in a given interval. These random variables are defined as the possible outcomes of the random variables in a definite interval of real numbers (Zhai et al. 2013). Continuous random variables are uncountably infinite; i.e. they have too many possible values to list out as the possible outcome of any problem. Measuring of continuous random variable can be done with high level of precession than the discrete random variables. Explanation of a confidence interval The range of values that describe uncertainty by surrounding an estimate is called confidence interval of the values. A confidence interval is indicated by the endpoints of the intervals. Lower limit and upper limit of the interval gives the value of the confidence interval. Confidence interval also defines the range of values that most probably encompass the true value (Altman et al. 2013). Confidence interval of the statistic is computed in such a way that the interval have a specific chance of containing the value of the corresponding parameter of the population. Construction of the confidence interval Confidence interval of any set of data is constructed by first choosing its level of significance. The level of significance is indicates the true probable value of the data to lie in the given confidence interval. Confidence interval is calculated by determining the lower limit and upper limit of the data set. The z value of the test is found by using the level of significance of the test. This value is used to calculate the lower limit and upper limit of the confidence interval. The lower limit is computed by subtracting the product of z value and standard deviation from the mean value of the data set. The upper limit is computed by adding the product of z value and standard deviation to the mean of the data set. Thus, the confidence interval is constructed using the lower limit and upper limit of the test. From the result of the data set, it was seen that the upper limit of the prices of the houses of Singapore is 1707.2 and the lower limit of the prices of the houses of Singapore is 186.97. The interval (186.97, 1701.2) gives the confidence interval of the price of the houses of Singapore at 95% level of significance. Interpretation of the confidence interval Confidence interval helps to determine the probable range of values between which the values of the data set would lie. In the given data set, the confidence interval of the given data set is given as (186.97, 1701.2) at 95% level of significance (Siegmund 2013). This indicates that there is a probability of 95% for the values of the data set to lie within the interval of (186.97, 1701.2). This helps to determine the expected value of the population and the interval in which they would lie. Regression method The method of regression is used to estimate the relationships between the variables. There are two types of variables in a data set; dependent variable and independent variables. Regression methods help to determine the relationship between the dependent variable and independent variables (Cumming 2013). The independent variables do not posses any multi co linearity. The method of regression helps to understand the relationship between the dependent and independent variables. It also denotes how the value of the dependent variable changes with the change in the values of the independent variable. From the values of the data set, it was seen that the dependent variable is the price of the houses and the independent variables are the other factors. The regression equation is given as follows: Y = 1751.3982 86.5995x1 4.42955x2 + 0.099x3 119.1857x4+27.601 x5 79.134 x6 2.073 x7 34.833x8 + 3.1486 x9 This shows that the price of the houses is negatively correlated with number of rooms, age of the house, distance of house from station (km), distance of house from shops (in km), number of bedrooms and number of storeys (Altman et al. 2013). The slopes of these variables are negative and this indicates that the increase in the values of these variables leads to decrease in the price of the houses of Singapore. The variables Area of house (in sq m) distance of house from bus stop (km) and type of Kitchen has a positive slope and the value of the price of the house of Singapore would increase with the increase in the values of these variables (Ciarleglio et al. 2016). The regression equation is used to interpolate the values of the dependent variables using the values of the independent variables. Conclusion The values of the price of houses of Singapore had been subjected to various statistical methods. The factors that influence the prices of the houses of Singapore were also subjected to statistical methods of measures of central tendency and methods of variations. Confidence interval of the prices of the houses of Singapore had been calculated, which gave the 95% probability of the values to lie in the interval of (186.97, 1701.2). Regression analysis was done to understand the method of interpolation of the value of dependent variable with the help of independent variables. The assignment also gave an idea about the inferential statistics of continuous random variables. References Altman, D., Machin, D., Bryant, T. and Gardner, M. eds., 2013.Statistics with confidence: confidence intervals and statistical guidelines. John Wiley Sons. Balakrishnan, N., 2013.Handbook of the logistic distribution. CRC Press. Bazeley, P. and Jackson, K. eds., 2013.Qualitative data analysis with NVivo. Sage Publications Limited. Boy-Roura, M., Cameron, K.C. and Di, H.J., 2015. Identification of nitrate leaching loss indicators through regression methods based on a meta-analysis of lysimeter studies.Environmental Science and Pollution Research, pp.1-10. Ciarleglio, A., Petkova, E., Tarpey, T. and Ogden, R.T., 2016. Flexible functional regression methods for estimating individualized treatment rules.Stat,5(1), pp.185-199. Cumming, G., 2013.Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis. Routledge. Dimaggio, C., 2013.Introduction(pp. 1-5). Springer New York. Gelman, A., Carlin, J.B., Stern, H.S. and Rubin, D.B., 2014.Bayesian data analysis(Vol. 2). Boca Raton, FL, USA: Chapman Hall/CRC. Kleinbaum, D., Kupper, L., Nizam, A. and Rosenberg, E., 2013.Applied regression analysis and other multivariable methods. Nelson Education. Menke, W., 2012.Geophysical data analysis: discrete inverse theory. Academic press. Miles, M.B., Huberman, A.M. and Saldana, J., 2013.Qualitative data analysis: A methods sourcebook. SAGE Publications, Incorporated. Ott, R.L. and Longnecker, M., 2015.An introduction to statistical methods and data analysis. Nelson Education. Pineda, S., Real, F.X., Kogevinas, M., Carrato, A., Chanock, S.J., Malats, N. and Van Steen, K., 2015. Integration analysis of three omics data using penalized regression methods: An application to bladder cancer.PLoS Genet,11(12), p.e1005689. Siegmund, D., 2013.Sequential analysis: tests and confidence intervals. Springer Science Business Media. Thomson, R.E. and Emery, W.J., 2014.Data analysis methods in physical oceanography. Newnes. Twisk, J.W., 2013.Applied longitudinal data analysis for epidemiology: a practical guide. Cambridge University Press. Woodward, M., 2013.Epidemiology: study design and data analysis. CRC Press. Zhai, Y., Cui, L., Zhou, X., Gao, Y., Fei, T. and Gao, W., 2013. Estimation of nitrogen, phosphorus, and potassium contents in the leaves of different plants using laboratory-based visible and near-infrared reflectance spectroscopy: comparison of partial least-square regression and support vector machine regression methods.International journal of remote sensing,34(7), pp.2502-2518.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.