Calculation of selected financial indicators.
Numerous methods exist aimed at examining patterns in structured and unstructured financial data. Applications of these methods include fraud detection, risk management, credit allocation, assessment of the risk of default, customer analytics, trading prediction, and many others, creating a broad field of research named Financial data science. A problem within the field that remains significantly under-researched, yet very important, is that of differentiating between the three major types of business activities—merchandising, manufacturing, and service based on the structured data available in financial reports. It can be argued that, due to the inherent idiosyncrasies of the three types of business activities, methods for assessment of the risk of default, methods for credit allocation, and methods for fraud detection would all see an improved performance if reliable information on the percentage of entities’ business activities allocated to the three major activities would be available. To this end, in this paper, we propose a clustering procedure that relies on Principal Component Analysis (PCA) for dimensionality reduction and feature selection. The procedure is presented using a large empirical data set comprising complete financial reports for various business entities operating in the Republic in Serbia, that pertain to the reporting period 2019.
- data science
- principal component analysis
- random forest algorithm
- financial data
- financial reporting
The established financial reporting system within an entity is the basic source of information on its financial position and results. The economic and financial globalization of the world market has emphasized the importance of high quality financial reporting. For the business decision-making process, financial and audit reports are the main source of information, as they contain information on financial position, business results, changes in equity, cash-flows and other reliable information . Development of the capital market and the increase in the number of interested parties (investors) created even higher demand of reliable, on time and fair financial statements as the main results of financial reporting. The regulation of the relationship between the state and society, owners of capital and management, various stakeholders and society, and others; has been further improved by a quality financial reporting and audit process. However, in order to fulfill their main purpose for all interested parties, financial statements must provide information that is true, objective, comprehensible, comparable and uniform . In the first place, financial statements have to be publicly available, which is usually regulated by law. For example, Law on Accounting of the Republic of Serbia prescribes that all business entities have to submit their financial reports to the competent institution which later publishes them on the official internet site . Information contained in financial statements can be used for numerous purposes. For example, other business entities can use them in the process of making business, financial, investment and other decisions. Likewise, banks and financial institutions can use them in order to approve loans or assess investment risks related to the certain business entity. However, financial information contained in financial statements are not processed and represent a raw data that should be analyzed in order to assess the performance of a certain business entity. Aside Notes to financial statements, as one of the qualitative statements that business entities prepare and report, all other statements are quantitative in nature and offer hundreds of pieces of data. Therefore, it is of great importance to perform certain type of analysis on the collected data in order to gain a solid basis for business decision making process. Analysis of financial statements is one of the most common methods of assessing business performance. The main goal of conducting the analysis of financial statements is to obtain information on the performance of the observed company, i.e. liquidity, profitability and solvency. Measuring financial performance using compiled and disclosed financial statements is a quantitative analysis of the position of the observed company, including the way in which the company uses the capital invested in business. High quality analysis of the performance of the observed entity provides a comprehensive image of the business, including meeting the information needs of stakeholders. The authors  point out in their paper that the analysis of financial performance is crucial in determining the efficiency in terms of the use of available resources. Likewise, an entity owners will be able to assess management skills and decisions that have been made in previous, as well as in current reporting period, so that they could analyze entities strengths, weaknesses and therefore improve their overall performance [5, 6, 7].
Some pieces of data disclosed in financial statements have informational power to be used on their own, such as Total assets, Sales revenue, or Net result. However, informational power of data increases when they are put into relation with other pieces of data. Therefore, financial statements analysis using ratios has been one of the most commonly used methods of assessing business performance. Financial ratio is a relative magnitude of two (or more) selected numerical values taken from financial statements. For example, relation between Net result and Equity will provide information on how much dollars of profit an entity earns for each dollar invested in equity. Results of financial statements analysis can be used to compare performance of a certain entity over a period of time, or for comparison with other entities within the industry. However, since financial statements analysis takes time and there are numerous financial ratios that analysts could use (and the fact that most of these ratios are correlated), the number of ratios that are being calculated and assessed should be reduced so that an analyst could focus on several of them without losing data that could be relevant for the analysis . One of the methods that can be used is Principal Component Analysis (PCA), which reduces number of observed variables for any further, regression, or any other type of analysis . PCA analysis has found its numerous purposes in different industries, for example, in image compressing [9, 10, 11], as well as in biometrics or “bioimaging” where physical characteristics of a person are used for its identification with application on communication devices and security systems.
The significance of PCA results is reflected in the fact that they can be used for more effective and efficient analysis of performance of certain entity, or for all business entities within a certain industry, or if analyzed financial data is related to whole economy, than results could be used for the analysis of all entities within it. The main advantages of PCA are precision of results; reduction of time needed for the analysis and evaluation of results; as well as reduction of related costs and efforts of the analyst.
With the development of technology, we have gained the ability to generate massive amounts of data. The use of correct methodologies for data analysis has become essential when dealing with complex financial challenges. In this paper, we discuss the theory underlying PCA. This type of analysis is one of the most used statistical tools in the field of financial data analysis. To ensure that the proper method is used for the analysis, theoretical knowledge and an comprehension of statistical methods are essential.
1.1 General postulates of PCA
PCA is primarily designed as a statistical technique that selectively reduces the dimensionality of data in complex data sets while preserving maximum variance. Since research in the financial sector involves both a large amount of data and a large number of variables simultaneously, it is difficult for us to perform analysis for this type of data.
Visualization techniques are only useful in two or three dimensional spaces, and single-variable analysis does not provide precise results due to overlapping variance. To achieve dimensionality reduction, it is necessary to generate principal components, i.e., a new set of variables containing a linear combination of the original variables. PCA can be used for a variety of tasks. A very small number of components are sufficient to cope with the variability of a data set. Since the number of components is reduced by using principal components, the complexity of the analysis itself is also reduced by avoiding analyzing a large number of output variables.
The standard PCA procedure takes as its starting point a data set in which numerical variables are observed for each individuals. These data are defined by the vectors or of the data matrix . The column is the vector resulting from the variable. Linear combinations of columns for an matrix with maximum variance are calculated as . Here stands for the vector of constants . The variants of such a linear combination are obtained as . Here stands for an exemplary covariance matrix. Finding a linear combination with maximum variance is the same as finding a dimensional vector that maximizes the quadratic form . For this reason, it is necessary to enter another constraint, which is usually unit norm vectors. Such vectors require . This problem is the same as maximizing , where represents the Lagrange multiplier. Equating it to the zero vector gives the following equation:
This equation is valid even when the eigenvectors are multiplied by −1. Here, is the eigenvector and is the corresponding eigenvalue for the covariance matrix . We need the largest , the largest eigenvalue, and the corresponding eigenvector . Eigenvalues are defined by the corresponding eigenvector . The covariance matrix is a symmetric matrix and has exactly real eigenvalues. can be defined together with the corresponding eigenvectors to form a set of vectors that are orthonormal. An example of this is if . The eigenvectors of are used to obtain up to linear combinations of that maximize the variances. The fact that the covariance between the two linear combinations of and is obtained from if , leads to results of uncorrelatedness . Linear combinations of represent the principal component of a data set. There are several PCA terms used for specific values. Elements of linear combinations are called principal component scores (PCA scores) and eigenvectors are also called principal component loads (PCA loads). These contain a generic element , where represents the observed value for variable .
The matrix labeled contains columns with centered variables , resulting in the following equation:
1.2 Premises of PCA
For the final outcome of the PCA assessment to be successful and significant, numerous conditions must be met. Initially, it is crucial that the data entered are uninterrupted and that variables should be measured on an interval or ratio scale. This condition must be met because PCA tests important correlation patterns for these variables.
Another crucial requirement is that the relationships between the individual pairs of variables are linear. If there are nonlinear relationships between the individual pairs of variables, appropriate data transformation techniques, such as logarithmic transformations, should be considered. Presumptions for PCA are filling missing values with not null values, outliers handling, and normalization scaling. All outliers should be filtered out prior to analysis, as they can bias the results by affecting the magnitude of the correlation.
To obtain more accurate estimates for the correlation population parameters, a large sample size is required. The data sets must be linear in order to be formed. The basic principle of PCA is that high variance must be taken into account, while variables with lower variance can be considered noise and are not taken into account. All variables must be processed at the same level of measurement.
1.3 Features extraction in PCA
Eq. (2) associates the eigenvalue decomposition of the covariance matrix and the singular value decomposition of the matrix with the centered column data. For dimension and rank , where it must be , the matrix can be calculated as follows:
Where and represent the matrices and containing orthonormal columns , where represents the identity matrix . is the diagonal matrix. The columns are also called right singular vectors and represent eigenvectors for the matrix associated with its non-zero eigenvalues. Columns are also called left singular vectors and represent eigenvectors for the matrix associated with its non-zero eigenvalues. Singular values of represent diagonal elements of the matrix, denoted by . These elements are non-negative square roots for the non-zero eigenvalues of the two matrices and . We consider that the diagonal elements are sorted from the largest to the smallest element, which determines the order of the columns and , except for singular values that are equal . This is true in all cases except when the singular values are equal. If we assume that , then the right singular vectors for the matrix are vectors of principal component loads. Because of the orthogonality of columns , columns are the principal components for . The types of these principal components are obtained by squaring the singular values of and dividing by . This results in the following equation:
Here stands for a diagonal matrix with one square of the singular values. With this equation we get the eigenvalue decomposition for the matrix . The singular value decomposition for the matrix with the data centered in the column is equivalent to PCA. Taking the rank in the matrix , which has the magnitude , the matrix , which has the same magnitude but the second rank and whose elements reduce the sum of squared differences with the corresponding elements of , is obtained as:
Here stands for the diagonal matrix of dimensions , which contains the first largest diagonal element of and . stands for the matrices and obtained by keeping the columns in and . The number of rows from the rank of the matrix defines the scatter plot from the number of points in the dimensional subspace , where the beginning of the gravity center for the scatter plot is located. It follows that the best approximation of the points in this scatterplot in the dimensional subspace, obtained by using rows, is given by this equation. That means that the sum of the squared distances between the given points in each scatterplot is minimal, as in Pearson’s original approach . The axis system defines the main subspace. It can be concluded that PCA is a dimensionality reduction method where a set of original variables can be replaced by a given set of variables. In the case of or , it is possible to make a graphical approximation for points in the scatter plot, and it is very often used to visualize the whole data set. A very important aspect is that the results are incremental in their dimensions.
The variability associated with the set of retained principal components can be used to ensure the quality of any dimensional approximation. The trace, i.e. the sum of the diagonal elements, of the covariance matrix is equal to the sum of the variances of the variables. It is possible to achieve this with the help of matrix theory results. It is easy to prove that this number is also the sum of the variances of all principal components. Consequently, the proportion of the overall variation accounted for by a given principal component is a standard measurement of its quality and it’s equal to:
The trace of is labeled . Due to the incremental behavior of principal components, we can speak of a proportion of the total variance explained by a set of principal components, which is usually expressed as a percentage of the total variance and is accounted for:
It is a common approach to use a pre-specified percentage of the total variance to determine how many principal components to keep, but graphical constraints often lead to keeping only the first two or three principal components. The percentage of total variance is a basic tool for measuring the quality of these low-dimensional graphical representations of the data set.
The biggest problem is the number of components needed to obtain a sufficient number of variances while achieving a reduction in dimensionality. There are several ways to determine the components, and one of them is to set a threshold.
The next very popular approach is the “Scree Plot” , where the components are arranged on the
The most popular method is parallel analysis , where PCA is performed with as many variables as the original data set includes. The average eigenvalues between the original data set and the simulated data set are measured. Any values from the original data that are lower than the data in the simulated set are discarded.
1.4 Sparse PCA
PCA has many advantages. In terms of maximizing variance in dimensions, PCA provides the best possible representation of a dimensional data set in dimensions . However, the new variables it defines are often linear functions of all the original variables, which is a downside. Multiple variables with not so simple coefficients are common for larger , making the components difficult to read. A number of PCA adjustments have been proposed to facilitate interpretation of the dimensions while limiting the loss of variance that results from not using the principal components themselves. There is a compromise between interpretability and variance. Two types of adjustments are briefly outlined below.
Factor analysis is a method that is often combined with PCA and it inspires the concept of rotating principal components . Assume that is the matrix whose columns are the loadings of the first of the principal components. Then is the matrix whose columns are the scores of the first of the principal components for the observations. Let us assume that is an orthogonal matrix. Multiplying by causes orthogonal rotation of the axes within the space spanned by the first of principal components, resulting in , a matrix whose columns are the charges of the rotated principal components. is an matrix containing the associated values of the rotated principal components. Any orthogonal matrix can be used to rotate the components, but it is preferable to make the rotated components easy to understand. For this reason, is chosen to maximize simplicity. A variety of such criteria have been proposed, some of which involve non-orthogonal rotation. The criterion where an orthogonal matrix is chosen for maximizing , where is the member of , is probably the most commonly used. No variance is lost when considering the rotated dimensional space, since the sum of the variances of the rotated components is the same as the sum of the variances of the unrotated components. Successive maximization of the non-rotated principal components is lost, which means that the sum of the variances of the rotated components is the same as the sum of the variances of the non-rotated components. A disadvantage of rotation is the necessary choice between different rotation criteria, although this choice often makes less difference than the choice of the number of components to rotate. If is increased by 1, the rotated components may look substantially different. That is because this does not happen in principal components with defined non-rotated nature.
Another method of simplifying the principal components is to limit the charges of the new variables. This is called adding a constraint. There are several variants of this strategy, one of which uses LASSO linear regression , that represents least absolute shrinkage and selection operator. In this approach, SCoTLASS components are discovered, solving the same optimization problem as PCA, but with the additional constraint , where tuning parameter is . The constraint has no effect for , and principal components are generated; however, more charges are pushed to zero at a lower value, which simplifies the interpretation. These simplified components must have less variation than the corresponding number of principal components, and multiple values of are often examined to find a reasonable compromise between added simplicity and loss of variance. One distinction between rotation and constraint techniques is that the second has the advantage that some loadings in linear functions are set exactly to zero for interpretation, whereas this is usually not the case with rotation. Sparse variants of PCA are type of adjustments in which many coefficients are zero, and numerous studies of such principal components have been conducted in recent years. Hastie et al.  provides a good overview of this work.
1.5 Robust PCA
PCA is inherently sensitive to the occurrence of outliers and thus to large errors in data sets . As a result, efforts have been made to define robust variants of PCA, and the terminology RPCA has been used to refer to several approaches to this problem. Huber’s early work focused on robust alternatives to covariance or correlation matrices and how they could be used to generate robust principal components . The demand for methods to process very large data sets sparked renewed interest in robust PCA variants. This led to PCA research lines, especially in areas such as machine learning, image processing, web data analysis, and many others.
Wright et al.  defined RPCA as the sum of two components, a low-rank component and a sparse component in an data matrix . Identifying the matrix components of that minimize a linear combination of two separate component norms was defined as a convex optimization task and calculated as:
where is the nuclear norm of and is the norm of matrix .
2. Related work
PCA was first introduced into mechanics by , as an analogue of the axis theorem. It was later named “PCA” by . The range of applications in finance and economics is extensive. Take as an example , who used PCA to document three factor structures. Stock and Watson  used PCA to monitor economic development and activity, as well as the inflation index. Egloff et al.  used PCA as a way to analyze the dimensions of inconsistent dynamics. Volatility is a statistical measure that can be used to determine these inconsistencies using a two-factor volatility model. This includes long-term and short-term fluctuations in the volatility structure. Baker and Wurgler  used PCA to measure investors sentiment, i.e., their positive or negative view. This was done according to the principle of the number of sentiment proxies before Baker,  created the policy uncertainty index. This index represents potential risks in the near future.
The most important item in the construction of PCA is the estimation of the eigenvalues of the covariance matrix sample. Anderson and Weeks  and Anderson  showed that sample eigenvalues were consistent when dealing with asymptomatic sentiment proxy results. Waternaux  proved that similar results are obtained with simple eigenvalues as long as there is a fourth moment in the data. In addition to the discussions in the  book,  was able to establish the asymptotic distribution of eigenvectors using generalized assumptions.
However, this PCA approach to eigenvalues has some downsides. The first problem is certainly dimensionality, which can be noticed when the cross sectional dimension grows simultaneously with the sample in the same period. Then inconsistencies occur. Another problem arises from linear data types that do not include nonlinear patterns. A third problem  arises from the dependence of the asymptotic theory on fixed assumptions for the analysis. For these reasons, we have a problem when we use PCA for reimbursement data. Most of the time, we need years of data to make an assumption, which in turn leads to other problems, such as permanence and consistency of non-fixed parameters. This type of data has backlogs and volatility times often vary.
These problems stimulate the improvement in this field and motivate the development of tools for PCA methods. The approach to the problem, where the number of occurances grows in fixed time periods, touches all the listed downsides. Theoretically, it is known that as the frequency of the sample increases, the estimated variance and covariance increase. This is true until the microstructure of the market begins to take effect. Incidentally, this is not a serious problem if we choose a sampling frequency of minutes, which we use as opposed to the below one second time interval most often used for liquid stocks. A high frequency asymptotic analysis with the cross-sectional dimension is expected as the time interval increases sharply. This high frequency asymptotic framework allows us to perform non-parametric analysis as well as independent, non-static and analysis without underlying parameters as is the case with low frequency processes.
Asymptotic theory is very common in many contexts. Jacod et al.  and Jacod and Podolskij  also dealt with one problem that we deal with in this paper, where the cross sectional dimensions are invariant and the process is continuous. Mykland and Zhang  designed an alternative theory to the one put forward by , that discuss inference for volatility function dependence. It is based on the aggregation of local estimates and uses a finite number of blocks. Saha et al.  considered the expected values of the integrated covariance matrix under conditions where there is an error measure and the matrix is large containing high frequency data. Tao et al.  addressed work on the convergence rate. Jacod and Rosenbaum  analyzed estimators, composed of aggregating functions of estimates. They did so using integrated quarticity estimation. Heinrich and Podolskij  discussed empirical covariate matrices of Brownian integrals. Here is discussed the measurement of the leverage effect and its evaluation by the integrated correlation method .
PCA analysis can be used in analysis of financial data for different purposes. For example  used it to identify the type of impact on grouped impact factors, such as assessing the quality of accounting information and facilitating the process of financial analysis conducted by different users. On the other hand,  used PCA to assess the impact of the evolution of Finnish standards on IFRS (International Financial Reporting Standards). Finally  used PCA analysis to determine the macroeconomic impact on the profitability of Romanian listed companies, using data from 1997 to 2007, and identified following indicators: liquidity, solvency, and firm’s dimension.
When it comes to the use of PCA analysis in financial statements analysis, four papers that focus on Romanian listed companies will be reviewed first. All papers emphasize the importance of using PCA analysis in the analysis of key financial ratios. In the first paper author  analyzed the data of 16 initial variables which he grouped into 3 new variables (general efficiency indicator, indicator in correlation with historical debts of companies and development indicator (given long-term debt and deferred income). Those three variables where able to explain 96.72% of initial variability. In the second paper,  analyzed data for 2010 including initially seven indicators of standard financial analysis and they reduced them to only two (which explain 94% of initial variability). In third paper,  used data from the stock exchange in the period 2006–2011 to identify the main components of financial statements which explain 79.08% of initial variability. The same group of indicators has been used by  on research sample that consisted of 111 companies from Madrid stock exchange and 32 companies from Eurostoxx50 for reporting periods 2005–2007. Research results showed that those six indicators explained 87% of total variance, with the first two indicators at app 44% of total variance.
3. Case study—PCA and cluster analysis in financial accounting data
3.1 Research methodology
In order to provide an answer on defined research question, 3.013 medium and large business entities were selected by random and used as a research sample. Financial statements for 2019 reporting period have been downloaded manually from the official website of the Business Registers Agency (BRA). BRA is a state administrative body that collects financial statements and corresponding audit reports of business entities that operate within the territory of the Republic of Serbia. Information published by BRA is used for financial analysis of business entities and as a basis of decision-making process. Afterwards, data from the pdf files containing financial statements have been copied and recorded in pre-set up tables in Excel files. Namely, medium and large business entities in the Republic of Serbia have an obligation to prepare and disclose full set of financial statements, consisting of balance sheet, income statement, cash-flow statement, statement of changes in equity and notes to financial statements. Since all previously mentioned statement, except notes to financial statements, are quantitative in nature, they were used for this research. Values originally disclosed in RSD, as the reporting currency, were converted into euros by using the average exchange rate of euros on the balance sheet date (31st December). Values of each financial statement line is presented in thousands, and therefore they are presented as such in this research .
Financial statement item lines in official financial statements are marked by corresponding automatic data processing number (in Serbian: Automatska obrada podataka—AOP), that belongs to the national nomenclature system. These markings are used in order to perform control of mathematical calculations before each financial statement is accepted for publishing by BRA. They also serve as an instrument of connecting data and information regarding the same financial statement item presented in financial statements. Balance sheet items cover automatic data processing numbers from 0001 to 0465; income statement from 1001 to 1071; statement of cash-flows from 3001 to 3047; and statement of changes in equity from 4001 to 4252. Table 1 shows the formulas used for the calculation of the selected financial indicators that will be used in this research. Having in mind that these variables will be used in order to differentiate business entities to three major types of business activities, these variables have been selected by a common sense.
|Fixed assets in total assets||AOP2/AOP71|
|Percent sales of merchandise in total operating revenue||AOP1002/AOP1001|
|Percent sales of products and services in total operating revenue||AOP1009/AOP1018|
|Percent cost of merchandise sold in total operating expenses||AOP1019/AOP1018|
|Percent cost of material in total operating expenses||AOP1023/AOP1018|
|Percent fuel and energy cost in total operating expenses||AOP1024/AOP1018|
|Percent wage cost in total operating expenses||AOP1025/AOP1018|
|Percent productive service cost in total operating expenses||AOP1026/AOP1018|
|Percent depreciation cost in total operating expenses||AOP1027/AOP1018|
|Percent raw material in total assets||AOP45/AOP71|
|Percent WIP in total assets||AOP46/AOP71|
|Percent finished products in total assets||AOP47/AOP71|
|Percent WIP and finished products in total assets||(AOP46 + AOP47)/AOP71|
|Percent merchandise in total assets||AOP48/AOP71|
Data preparation is a key process in data analysis. The basic preparation and cleaning procedures are:
Preparing a copy of the table
Adding new attributes
Conversion of column types
General data cleaning and adjustment
Specifically, the cleaning includes the following items:
Editing date variables—the most common formatting problems
Recoding of zeros/missing values
Decoding categorical variables using labels and hot encoding
Application of normalization/standardization/ log transformation
Calculating descriptive statistics—mean, median, mode, standard deviation, variance, rank, etc.
Calculating inferential statistics - distributions, t-value, p-value, frequencies, cross-tabulations, correlation, covariance, etc.
More advanced techniques include:
Categorical variables are labeled as character variables and must be converted to a factor type for modeling purposes. Queues perform this task.
For numeric variables, we can identify deviations numerically by the value of the bias.
One of the techniques to normalize the biased distribution is logarithmic transformation. First, a new variable is created, while later the value of the bias of this new variable is calculated and printed.
One of the standardization techniques is that all characteristics are centered around zero and have approximately the variance of one unit. Scaling is used so that the variable is converted. The result is that these variables are standardized with a mean of zero.
As part of the preparation for PCA, firstly missing values from the dataset were filled with zeros. After that, the data was scaled by using a standard scaler, which standardizes features by removing the mean and scaling to unit variance. The preprocessed dataset, was then used for:
All three of the PCA methods were instanciated with the number of components set to 7. After PCA, the now transformed data went through several clustering methods for the purpose of comparing results. The clustering methods that were used for each PCA are:
Furthermore, each of the clustering methods were executed with just the preprocessed data, without PCA, also for the purpose of comparing results.
Compute dot product matrix:
Keep first 7 components:
Compute 7 features:
4. Results and discussion
4.1 Comparative results—total variance explained
This chapter discusses the outcomes of PCA and cluster analysis. The initial variables that load on the principal components are studied. Correlations or covariances between the original variables and the principal components correlate with the loadings. The variable loadings are contained in a loading matrix, which is created by multiplying the eigenvector matrix by a diagonal matrix containing the square root of each eigenvalue. The entries are determined by the component extraction method used. Non-standardized loadings show the covariance between mean-centered variables and standardized component values, regardless of whether the extraction is based on the singular value decomposition of the matrix or the eigenvalue decomposition of the covariance matrix.
The eigenvalue decomposition of the correlation matrix results in the standardized charges. The correlations between the original variables and the component scores are represented by these loadings. Because they always vary between −1 and 1 and are independent of the scale used, standardized charges are easy to read. In most cases, a threshold is set and only variables with loadings above this threshold are examined.
The total variance presents sum of variances of principal components. The ratio between the variance of principal component and the total variance is the fraction of variance explained by a principal component.
Figure 1 shows total variance explained by using three methods of PCA. The steepest increase belongs to the PCA line, which cumulative explained variance is app. 87%. This line is almost parallel to the line from Sparse PCA which cumulative explained variance is 83%. However, when it comes to Robust PCA line it has been noticed that cumulative explained variance is only app. 26% and the increase of values is minimal.
PCA: The highest fraction of explained variance among these variables is 32%, and the lowest one is 5%. Cumulative explained variance is 86% (see Table 2).
|Factors||Total||% of variance||Cumulative %|
Sparse PCA: The highest fraction of explained variance among these variables is 21%, and the lowest one is 5%. For instance, variables together explain 83% of the total variance (see Table 3).
|Factors||Total||% of variance||Cumulative %|
Robust PCA: The highest fraction of explained variance among these variables is 21%, and the lowest one is 0%. For instance, variables together explain 25% of the total variance (see Table 4).
|Factors||Total||% of variance||Cumulative %|
PCA is the best approach for this kind of data, regarding number of features.
The amount of variance in each variable considered is represented by the communalities. The variance in each variable explained by all components or factors is estimated using the initial communalities.
The percent fuel and energy cost in total operating expenses is given here with 88% variance. The percent productive service cost in total operating expenses is given here with 75% variance. The percent finished products in total assets here is 75% of the estimated variance (see Table 5).
|Percent merchandise in total assets||0.159427|
|Percent sales of merchandise in total operating revenue||0.222216|
|Percent cost of merchandise sold in total operating expenses||0.224299|
|Percent sales of products and services in total operating revenue||0.236318|
|Fixed assets in total assets||0.347415|
|Percent cost of material in total operating expenses||0.411423|
|Percent raw material in total assets||0.426201|
|Percent WIP and finished products in total assets||0.449704|
|Percent depreciation cost in total operating expenses||0.683213|
|Percent wage cost in total operating expenses||0.729997|
|Percent WIP in total assets||0.731771|
|Percent finished products in total assets||0.745349|
|Percent productive service cost in total operating expenses||0.752027|
|Percent fuel and energy cost in total operating expenses||0.880639|
The percent fuel and energy cost in total operating expenses here is 91% variance. The percent finished products in total assets here is 80% of the estimated variance. The percent productive service cost in total operating expenses here is 74% variance (see Table 6).
|Percent merchandise in total assets||0.191833|
|Percent sales of products and services in total operating revenue||0.227810|
|Percent sales of merchandise in total operating revenue||0.260545|
|Percent cost of merchandise sold in total operating expenses||0.263888|
|Fixed assets in total assets||0.354743|
|Percent cost of material in total operating expenses||0.407825|
|Percent raw material in total assets||0.417451|
|Percent WIP and finished products in total assets||0.451553|
|Percent depreciation cost in total operating expenses||0.555661|
|Percent wage cost in total operating expenses||0.695148|
|Percent WIP in total assets||0.719447|
|Percent productive service cost in total operating expenses||0.742714|
|Percent finished products in total assets||0.800108|
|Percent fuel and energy cost in total operating expenses||0.911274|
The percent wage cost in total operating expenses here is 82% variance. The percent sales of merchandise in total operating revenue here is 79% of the estimated variance. The percent cost of merchandise sold in total operating expenses here is 74% variance (see Table 7).
|Percent WIP in total assets||0.200472|
|Percent merchandise in total assets||0.317793|
|Percent finished products in total assets||0.333984|
|Percent depreciation cost in total operating expenses||0.345393|
|Percent fuel and energy cost in total operating expenses||0.349862|
|Percent sales of products and services in total operating revenue||0.365996|
|Percent raw material in total assets||0.433737|
|Percent WIP and finished products in total assets||0.444081|
|Percent cost of material in total operating expenses||0.519423|
|Fixed assets in total assets||0.651365|
|Percent productive service cost in total operating expenses||0.680299|
|Percent cost of merchandise sold in total operating expenses||0.745842|
|Percent sales of merchandise in total operating revenue||0.789024|
|Percent wage cost in total operating expenses||0.822730|
Figure 2 presents the amount of variance for each considered variable represented by the communalities. From the aspect of PCA and Sparse PCA it can be noticed that variable Percent fuel and energy cost in total operating expenses and variable Percent finished products in total assets have significant estimated variance. When it comes to Robust PCA, variance of 82% refers to the variable Percent wage cost in total operating expenses. From the economic point of view first two variables could be used to distinguish type of three major business activities. Mainly, the amount of fuel and energy cost will differ between business activities. It is expected that production entities will have higher values of fuel and energy costs because plant, machinery and equipment will require energy to operate. Also, merchandise entities will probably have higher values of fuel and energy costs compared to other services having in mind fuel spent for transportation of merchandise and energy needed for operation of their facilities. Second variable Percent finished products in total assets is also expected to be used for differentiation since only production entities will have this balance sheet line in their financial statements. Main surprise might be third variable Percent wage cost in total operating expenses, since most entities have very similar share of total wage costs in total operating expenses. Namely, although official state records showed that average wages differ across industries, management of companies usually plan operating expenses and their structure.
The best approach for the PCA/Clustering combination regarding high level of Silhouette Index and Cluster Sizes are: K-means/Robust PCA and Spectral/Robust PCA. The Davies Bouldin Index implies that a smaller value gives better clustering. This produces the idea that no cluster has to be similar to another, and that object inside clusters are very uniformly distributed (see Table 8).
|Clustering/PCA method||Cluster sizes||Silhouette index||Davies bouldin index|
|K-means/No PCA||(1345, 932, 733)||0.30208710358306756||1.5444364169813884|
|K-means/PCA||(1353, 934, 723)||0.3637346841903855||1.3405097768944103|
|K-means/Sparse PCA||(1356, 939, 715)||0.36307616530243575||1.3418713066940657|
|K-means/Robust PCA||(1209, 944, 857)||0.5193200382282146||0.7834359567299072|
|Agglomerative/no PCA||(1151, 935, 924)||0.27839422485839554||1.7150687814273013|
|Agglomerative/ PCA||(1225, 962, 823)||0.31642069773357084||1.4995739243069988|
|Agglomerative/sparse PCA||(1888, 893, 229)||0.31642069773357084||1.4995739243069988|
|Agglomerative/robust PCA||(1311, 878, 821)||0.4593880561940543||0.9274868826361716|
|Birch/no PCA||(1151, 935, 924)||0.27839422485839554||1.7150687814273013|
|Birch/ PCA||(1225, 962, 823)||0.31642069773357084||1.4995739243069988|
|Birch/sparse PCA||(1225, 962, 823)||0.31642069773357084||1.4995739243069988|
|Birch/robust PCA||(1317, 867, 826)||0.45631070311567473||0.9348852316431389|
|Gaussian mixture/no PCA||(1336, 992, 682)||0.17495781525891207||2.1078218204567496|
|Gaussian mixture/ PCA||(1161, 1155, 694)||0.2539355374019169||1.6227017939395394|
|Gaussian mixture/sparse PCA||(1161, 1155, 694)||0.2539355374019169||1.6227017939395394|
|Gaussian mixture/robust PCA||(1467,784, 759)||0.28455634384131373||1.1919962215015028|
|Spectral/no PCA||(2994, 8, 8)||0.460433642421337||0.9718901349784725|
|Spectral/PCA||(3001, 7, 2)||0.5399338738262545||0.6856986473871954|
|Spectral/sparse PCA||(3001, 7, 2)||0.5399338738262545||0.6856986473871954|
|Spectral/robust PCA||(1346, 920, 744)||0.5146721760042233||0.7917964357887189|
This chapter was focused on the use of Principle component analysis in financial data science. Research has been conducted that included 3013 medium and large business entities and their financial statements from 2019 reporting period. PCA has been used in order to differentiate between the three major types of business activities - merchandising, manufacturing, and service. Therefore, 14 financial ratios have been selected by common sense and further analyzed according to their significance in dimensionality reduction. Results of clustering gave 7 new variables: 1. cost of merchandise sold in total operating expenses, and cost of material in total operating expenses; 2. fuel and energy cost in total operating expenses, and sales of product and services in total operating revenue; 3. wage costs in total operating expenses, and sales on merchandise in total operating revenue; 4. productive service cost in total operating expanses, and fixed assets in total assets; 5. depreciation cost in total operating expenses, and merchandise in total assets; 6. raw material in total assets, and WIP and finished products in total assets; 7. finished products in total assets, and WIP in total assets. These groups of variables were able to explain 86.7% of initial variability. Compared to the results of authors previously mentioned in literature review, it can be concluded that percentage is within the range of reached results. When it comes to initial communalities which estimated the variance in each variable, three financial ratios that had the highest percentage were: fuel and energy cost in total operating expenses (original PCA—88%, sparse PCA—91%); productive service cost in total operating expenses (original PCA—75%, sparse PCA—74%); and finished products in total assets (original PCA 75%, sparse PCA—80%). Although these ratios showed the best results, it has to be mentioned that there is a correlation between all of financial ratios used in analysis and therefore results would be different when ratios are used.
We would like to express our gratitude to Prof. Nemanja Stanišić, Ph.D. from the Singidunum University for supporting this research through valuable suggestions, and assignment of a research database.
Conflicts of interest
Authors declare no conflict of interest.
|Columns/factors||Factor 0||Factor 1||Factor 2||Factor 3||Factor 4||Factor 5||Factor 6|
|Fixed assets in total assets||0.178413||−0.326641||0.415221||−0.102354||−0.025277||−0.072754||−0.141675|
|Percent sales of merchandise in total operating revenue||−0.436002||0.152729||0.080029||−0.025182||0.028728||0.016652||−0.025519|
|Percent sales of products and services in total operating revenue||0.398117||0.022315||−0.270570||−0.046006||0.031509||0.012930||−0.028959|
|Percent cost of merchandise sold in total operating expenses||−0.432559||0.162995||0.080296||−0.035352||0.022542||0.026778||−0.041260|
|Percent cost of material in total operating expenses||0.269688||0.303749||−0.078000||−0.386050||−0.243323||−0.066637||−0.166323|
|Percent fuel and energy cost in total operating expenses||0.150958||−0.217356||0.283494||−0.008988||0.096350||0.822637||−0.210100|
|Percent wage cost in total operating expenses||0.210317||−0.224488||−0.145697||0.058149||0.719422||−0.304476||−0.022063|
|Percent productive service cost in total operating expenses||0.137457||−0.048360||−0.397081||0.585499||−0.374661||0.172006||0.245674|
|Percent depreciation cost in total operating expenses||0.095815||−0.269993||0.484683||0.000289||−0.400868||−0.359862||0.275725|
|Percent raw material in total assets||0.190490||0.245296||−0.165694||−0.526444||−0.137683||0.071830||0.032101|
|Percent WIP in total assets||0.158335||0.359273||0.200936||0.383609||−0.059087||−0.175372||−0.596528|
|Percent finished products in total assets||0.174390||0.375283||0.278149||0.015943||0.252221||0.151402||0.640266|
|Percent WIP and finished products in total assets||0.214830||0.474174||0.309621||0.255892||0.126328||−0.013709||0.034896|
|Percent merchandise in total assets||−0.355975||0.151166||−0.014079||−0.013559||0.088827||0.039431||0.005508|
|Columns/factors||Factor 0||Factor 1||Factor 2||Factor 3||Factor 4||Factor 5||Factor 6|
|Fixed assets in total assets||0.000000||−0.020910||−0.504987||0.000000||−0.254639||−0.185601||−0.002365|
|Percent sales of merchandise in total operating revenue||0.435472||0.000000||0.246576||−0.085566||0.000000||0.052803||0.000000|
|Percent sales of products and services in total operating revenue||−0.433624||0.008108||0.000000||0.199284||0.000000||0.000000||0.000000|
|Percent cost of merchandise sold in total operating expenses||0.438993||0.000000||0.254341||−0.067395||0.000000||0.044065||0.000000|
|Percent cost of material in total operating expenses||−0.027509||0.085834||0.000000||0.630267||0.000000||0.045436||−0.019978|
|Percent fuel and energy cost in total operating expenses||0.000000||0.000000||0.000000||0.000000||0.000000||−0.954607||0.000000|
|Percent wage cost in total operating expenses||−0.453726||0.000000||0.000000||−0.325539||−0.594326||0.173439||0.000000|
|Percent productive service cost in total operating expenses||−0.333679||0.000000||0.000000||−0.222344||0.762847||0.000000||0.000000|
|Percent depreciation cost in total operating expenses||0.108694||0.000000||−0.726249||0.000000||0.000000||0.128099||0.000000|
|Percent raw material in total assets||0.000000||−0.007293||0.083372||0.624407||−0.000201||0.000000||0.143399|
|Percent WIP in total assets||0.000000||0.460374||0.000000||0.000000||0.000000||0.000000||−0.712393|
|Percent finished products in total assets||0.000000||0.573218||0.000000||0.000000||0.000000||0.000000||0.686680|
|Percent WIP and finished products in total assets||0.000000||0.671977||0.000000||0.000000||0.000000||0.000000||0.000000|
|Percent merchandise in total assets||0.315974||0.000000||0.291738||−0.076742||0.000000||0.031515||0.000000|
|Columns/factors||Factor 0||Factor 1||Factor 2||Factor 3||Factor 4||Factor 5||Factor 6|
|Fixed assets in total assets||−0.173467||0.275000||0.507855||−0.499215||0.156525||−0.088965||0.078114|
|Percent sales of merchandise in total operating revenue||0.525938||−0.128407||0.122190||−0.093645||−0.109748||−0.078270||0.673835|
|Percent sales of products and services in total operating revenue||−0.444479||−0.119035||−0.307493||−0.162452||−0.114015||−0.009416||−0.142249|
|Percent cost of merchandise sold in total operating expenses||0.510523||−0.126489||0.118665||−0.111772||−0.087492||0.088045||−0.653626|
|Percent cost of material in total operating expenses||−0.204519||−0.558377||0.030438||−0.282440||−0.265091||0.084341||0.087885|
|Percent fuel and energy cost in total operating expenses||−0.119620||0.103472||0.453140||0.056594||−0.245465||0.200560||−0.125820|
|Percent wage cost in total operating expenses||−0.204794||0.179552||0.159958||0.368300||−0.273464||−0.715834||−0.010912|
|Percent productive service cost in total operating expenses||−0.131802||0.032733||0.012123||0.533088||−0.234838||0.537258||0.183658|
|Percent depreciation cost in total operating expenses||−0.098392||0.135731||0.478665||0.038015||−0.137017||0.259630||−0.023300|
|Percent raw material in total assets||−0.120240||−0.430495||0.139923||−0.141176||−0.424059||−0.117921||0.026698|
|Percent WIP in total assets||−0.048543||−0.251299||0.164302||0.160983||0.282957||−0.044589||−0.000473|
|Percent finished products in total assets||−0.062609||−0.324119||0.211913||0.207631||0.364950||−0.057519||0.022279|
|Percent WIP and finished products in total assets||−0.071814||−0.371772||0.243069||0.238158||0.418607||−0.065970||−0.072968|
|Percent merchandise in total assets||0.289968||−0.108389||0.082710||0.230665||−0.308640||−0.196863||−0.167039|
Author Stefana Janićijević contributed to the design and implementation of the research and analysis of the results. Authors Vule Mizdraković and Maja Kljajić prepared sections of the chapter that refers to the financial data science and financial reporting: introduction, related work, research methodology and analysis of discussion and result. All authors provided critical feedback and helped shape the research, analysis, and manuscript.
number of numerical variables
number of columns
vector of constants
matrix with orthonormal colums—eigenvectors
matrix with singular vectors
diagonal elements of the matrix
diagnal matrix with one square of the singular values
rank of the matrix
trace of matrix
Mrvaljevic M, Dobricanin S, Djuricanin J. Finansijski izveštaji u funkciji menadžment odlučivanja. Ekonomski signali. 2014; 9(2):85-103
Djukic T, Pavlovic M. Kvalitet finansijskog izveštavanja u republici srbiji. Ekonomske teme. 2014; 52(1):101-116
Law on Accounting (“Off. Herald of RS”, Nos. 62/2013, 30/2018 and 73/2019 - other law). 2021
Bhunia A, Mukhuti SS, Roy SG. Financial performance analysis—A case study. Current Research Journal of Social Sciences. 2011; 3(3):269-275
Abraham A. A model of financial performance analysis adapted for nonprofit organisations. In: AFAANZ 2004 Conference Proceedings. Melbourne, Austalia: Accounting & Finance Association of Australia and New Zealand (afaanz) Limited; 2004. pp. 1-18
Bhargava P. Financial analysis of information and technology industry of India (A Case Study of Wipro Ltd and Infosys Ltd). Journal of Accounting, Finance and Auditing Studies. Yalova, Turkey: Istanbul Business Academy. 2017; 3(3):1-13
Schönbohm A. Performance Measurement and Management with Financial Ratios: The Basf se Case. Technical report, Working Paper. London: IEEE; 2013
Taylor SL. Analysing financial statements: How many variables should we look at? JASSA. 1986; 1(1):19-21
Karamizadeh S, Abdullah SM, Manaf AA, Zamani M, Hooman A. An overview of principal component analysis. Journal of Signal and Information Processing. 2013; 4(3B):173
Abbas AH, Arab A, Harbi J. Image compression using principal component analysis. Mustansiriyah Journal of Science. 2018; 29(2):01854
Polyak BT, Khlebnikov MV. Principle component analysis: Robust versions. Automation and Remote Control. 2017; 78(3):490-506
Dunteman GH. Principal Components Analysis. Thousand Oaks, California, United States: Sage; 1989. p. 69
Jacod J, Lejay A, Talay D. Estimation of the brownian dimension of a continuous itô process. Bernoulli. 2008; 14(2):469-498
Rummel RJ. Applied Factor Analysis. Evanston, Illinois, United States: Northwestern University Press; 1988
Demmel JW. Applied Numerical Linear Algebra, SIAM. Philadelphia, Pennsylvania, United States: Society for Industrial and Applied Mathematics; 1997
Abdi H, Williams LJ. Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics. 2010; 2(4):433-459
Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological). 1996; 58(1):267-288
Hastie T, Tibshirani R, Wainwright M. The lasso for linear models. Statistical Learning with Sparsity: The LASSO and Generalization. 2015; 7–28
Xu H, Caramanis C, Sanghavi S. Robust PCA Via Outlier Pursuit. arXiv preprint arXiv:1010.4237. 2010
Candès EJ, Li X, Ma Y, Wright J. Robust principal component analysis? Journal of the ACM (JACM). 2011; 58(3):1-37
Wright J, Ganesh A, Rao S, Peng Y, Ma Y. Robust principal component analysis: Exact recovery of corrupted low-rank matrices via convex optimization. Advances in Neural Information Processing Systems. 2009; 58:289-298
Pearson K. LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science. 1901; 2(11):559-572
Hotelling H. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology. 1933; 24(6):417
Litterman R, Scheinkman J. Common factors affecting bond returns. Journal of Fixed Income. 1991; 1(1):54-61
Stock JH, Watson MW. Forecasting inflation. Journal of Monetary Economics. 1999; 44(2):293-335
Egloff D, Leippold M, Wu L. The term structure of variance swap rates and optimal variance swap investments. Journal of Financial and Quantitative Analysis. 2010; 45(5):1279-1310
Baker M, Wurgler J. Investor sentiment and the cross-section of stock returns. The Journal of Finance. 2006; 61(4):1645-1680
Baker SR, Bloom N, Davis SJ. Measuring economic policy uncertainty. The Quarterly Journal of Economics. 2016; 131(4):1593-1636
Anderson DL, Weeks WF. A theoretical analysis of sea-ice strength. Eos, Transactions American Geophysical Union. 1958; 39(4):632-640
Anderson OL. A simplified method for calculating the debye temperature from elastic constants. Journal of Physics and Chemistry of Solids. 1963; 24(7):909-917
Waternaux CM. Asymptotic distribution of the sample roots for a nonnormal population. Biometrika. 1976; 63(3):639-645
Jolliffe D. Whose education matters in the determination of household income? Evidence from a developing country. Economic Development and Cultural Change. 2002; 50(2):287-312
Tyler WG. Growth and export expansion in developing countries: Some empirical evidence. Journal of Development Economics. 1981; 9(1):121-130
Brillinger DR. Time Series: Data Analysis and Theory. Philadelphia, Pennsylvania, United States: SIAM. Society for Industrial and Applied Mathematics; 2001
Jacod J, Podolskij M. A test for the rank of the volatility process: The random perturbation approach. The Annals of Statistics. 2013; 41(5):2391-2427
Mykland PA, Zhang L. Inference for continuous semimartingales observed at high frequency. Econometrica. 2009; 77(5):1403-1445
Li J, Todorov V, Tauchen G. Volatility occupation times. The Annals of Statistics. 2013; 41(4):1865-1891
Saha S, Moorthi S, Pan H-L, Wu X, Wang J, Nadiga S, et al. The ncep climate forecast system reanalysis. Bulletin of the American Meteorological Society. 2010; 91(8):1015-1058
Tao M, Wang Y, Yao Q, Zou J. Large volatility matrix inference via combining low-frequency and high-frequency approaches. Journal of the American Statistical Association. 2011; 106(495):1025-1040
Jacod J, Rosenbaum M. Quarticity and other functionals of volatility: Efficient estimation. The Annals of Statistics. 2013; 41(3):1462-1484
Heinrich C, Podolskij M. On Spectral Distribution of High Dimensional Covariation Matrices. arXiv preprint arXiv:1410.6764. 2014
Kalnina I, Xiu D. Nonparametric estimation of the leverage effect: A trade-off between robustness and efficiency. Journal of the American Statistical Association. 2017; 112(517):384-396
Jara EG, Ebrero AC, Zapata RE. Effect of international financial reporting standards on financial information quality. Journal of Financial Reporting and Accounting. 2011; 9(2):176-196
Lantto A-M, Sahlström P. Impact of international financial reporting standard adoption on key financial ratios. Accounting & Finance. 2009; 49(2):341-361
Triandafil C, Brezeanu P, Badea L. Impactul macroeconomic asupra profotabilitagii sectorului corporativ: analiza la nivelul companiilor listate la bursa de valori bucuresti. Economie teoretica si aplicata. 2010; 17(10):551
Tudor E. Metode de recunoastere a formelor in analiza economico-financiara. Bucharest: Academy of Economic Studies; 2009
Armeanu D, Negru A. Aplicarea analiza componentelor principale in managementul portofoliului de investitii. Internal Auditing & Risk Management. 2011; 6(3):01865
Robu IB, Istrate C. The analysis of the principal components of the financial reporting in the case of Romanian listed companies. Procedia Economics and Finance. 2015; 20:553-561
Mizdrakovic V, Stanic N, Mitic V, Obradovic A, Kljajic M, Obradovic M, et al. Empirical data on financial and audit reports of Serbian business entities. In: FINIZ 2020-People in the Focus of Process Automation. Belgrade, Serbia: Singidunum University International Scientific Conference; 2020. pp. 193-198