> For the complete documentation index, see [llms.txt](https://growingio.gitbook.io/v3/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://growingio.gitbook.io/v3/developer-manual/api-reference/originaldata-export-v1/suggest.md).

# 导出数据处理建议

## 数据处理建议

数据处理建议采用Hive或者Spark平台工具，若是需要导入自有BI平台，可能需要进一步调整数据格式（csv转成其他符合数据处理需求的格式），针对以上的需求，给出相应的数据处理建议。

{% hint style="info" %}
注意不要以逗号为分隔符进行处理，csv数据格式以引号外的逗号为分隔符。
{% endhint %}

## 处理方式

{% tabs %}
{% tab title="Spark" %}
建议下载数据后，将下载的压缩文件放于hdfs的以日期建立目录结构，同一小时或者同一天的数据放在同一目录下，然后通过spark streaming的fileStream接口监控根目录，读取变动的文件内容。

```
streamingContext.fileStream[KeyClass, ValueClass, InputFormatClass](dataDirectory)
```

在依赖中添加：

```
roupId: com.databricks
artifactId: spark-csv_2.10
version: 1.4.0
```

具体数据操作参考spark-csv(<https://github.com/databricks/spark-csv>)
{% endtab %}

{% tab title="Hive" %}
可以参考Hive对CSV数据操作的支持 <https://cwiki.apache.org/confluence/display/Hive/CSV+Serde>​

目前暂未测试该方式～

```
create external table xxx
```

{% endtab %}
{% endtabs %}

## 数据格式调整处理

以java为例

新建maven project，在prm.xml中添加以下依赖

```
    <dependency>
      <groupId>org.apache.commons</groupId>
      <artifactId>commons-compress</artifactId>
      <version>1.12</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.commons/commons-csv -->
    <dependency>
      <groupId>org.apache.commons</groupId>
      <artifactId>commons-csv</artifactId>
      <version>1.4</version>
    </dependency>
```

而后在读取数据的方法中：

```
    GzipCompressorInputStream stream = new GzipCompressorInputStream(new BufferedInputStream(new FileInputStream("data/test.gz")));
​
    Reader reader = new InputStreamReader(stream);
    Iterable<CSVRecord> records = CSVFormat.DEFAULT.parse(reader);
    for (CSVRecord record : records) {
        System.out.println(record);
    }
```

上例中，数据读取依赖于commons-compress与commons-csv库，同样在python中有类似的数据处理库。


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://growingio.gitbook.io/v3/developer-manual/api-reference/originaldata-export-v1/suggest.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
