As languages for data computations, both esProc and R language have rich functions to process text files. They have many similarities in basic usage, as well as obvious differences, such as in the aspect of processing files with fixed column width and big text files, reading and writing designated columns, computational performance, etc. The article aims to compare their similarities and differences.
As languages for data computations, both esProc and R language have rich functions to process text files. They have many similarities in basic usage, as well as obvious differences, such as in the aspect of processing files with fixed column width and big text files, reading and writing designated columns, computational performance, etc. The article aims to compare their similarities and differences.
1.Comparison of basic functions
There are six columns in sales.txt, they are separated from each other by tab \t. Lines are separated from each other by line break\n. The first row contains column names. Read the file into the memory and write it anew. The first rows of the file are as follows:
2. Reading files with fixed column width
In some files, fixed width, instead of separators, is used to differentiate one column from another. For example, read file static.txt which contains three columns of data into the memory and modify column names respectively to col1, col2 and col3, among which the width of col1 is 1, that of col2 is 4 and that of col3 is 3.
3. Reading and writing designated columns
Sometimes only someof the data columns are needed in order to save memory and enhance performance. In this example, read columns ORDERID, CLIENT and AMOUNT into the memory and write ORDERID and AMOUNT to a new file.
Conclusion:R language can only read all columns into the memory, which will occupy a relatively large memory.
4.Processing big text files
Big text files are files whose sizes are bigger than memory size. Usually they are processed by reading and computing in batches. For example, in big text file sales.txt, filter data according to the condition Amount>2000 and sum up Amount of each SellerID.
5. Processing big text files in parallel
Parallel computing can make full use of the resource of multi-core CPU and significantly improve computational performance.
The example in the above part is still used here, but parallel computing is used. That is, divide sales.txt into four segments to give four CPU cores to perform computations, then filter data according to the condition Amount>2000 and compute the total sales amount of each seller.
6. Computational performance
Under the same test circumstance, use esProc and R language to read a file of 1G size, and summarize one of the fields.
Summary:
Leave a Review