博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Ibis: Scaling the Python Data Experience
阅读量:7094 次
发布时间:2019-06-28

本文共 3356 字,大约阅读时间需要 11 分钟。

hot3.png

Ibis: Scaling the Python Data Experience

Ibis 0.5 (September 10, 2015)

Ibis 0.5.0 is released.

Please also sign up for the .

What is Ibis?

Ibis is a new Python data analysis framework with the goal of enabling data scientists and data engineers to be as productive working with big data as they are working with small and medium data today. In doing so, we will enable Python to become a true first-class language for Apache Hadoop, without compromises in functionality, usability, or performance. Having spent much of the last decade improving the usability of the single-node Python experience (with pandas and other projects), we are looking to achieve:

  • 100% Python end-to-end user workflows

  • Native hardware speeds for a broad set of use cases

  • Full-fidelity data analysis without extractions or sampling

  • Scalability for big data

  • Integration with the existing Python data ecosystem (pandas, scikit-learn, NumPy, and so on)

Ibis Vision and Roadmap

Ibis is being designed to take advantage of architectural synergies with the that will enable high performance Python at massive scale without serialization or other interface bottlenecks. Specifically, we have on the roadmap:

  • Support for Impala’s forthcoming complex types: lists, maps, and structs as first-class value types.

  • Fast Python API for a canonical in-memory columnar data format being developed for Impala and to be standardized amongst software components.

  • Enabling intepreted Python user-defined functions to be run on Impala nodes and perform computations directly on columnar data in shared memory without any need for deserialization. This will enable users to leverage theexisting Python data ecosystem, both tools and libraries, at performance and scale never seen before.

  • Expanding the useful set of Python that can be translated to LLVM IR to achieve true native performance at scale on complex data within Impala.

  • Exposing machine learning functionality already available in MADLib.

This current version of Ibis includes a great deal of useful big data functionality, putting Impala, the open source interactive SQL-on-Hadoop engine, right at your fingertips in Python:

  • A pandas-like data expression system providing comprehensive coverage of the functionality already provided by Impala. It is composable and semantically complete; if you can write it with SQL, you can write it with Ibis, often with substantially less code. This even includes such tricky relational data concepts as

    • Window functions

    • Correlated and uncorrelated subqueries

    • Self-joins

  • High level analytics tools like bucketing, top-k, histogram, and value_counts.

  • Tools for performing computations directly on datasets in HDFS, hiding the low-level details of Impala for accessing such data.

  • Tools to simplify interactions with HDFS

  • Interoperability with pandas: executing expressions returns pandas objects, and pandas objects can be written back to HDFS (experimental).

It’s possible to support other compute engines in Ibis, or SQL databases like PostgreSQL. In particular, Ibis’s data expressions are decoupled from the Impala expression executor/compiler. We welcome community contributions to integrate Ibis with other backend systems. Keep in mind that it’s a design goal of Ibis to hide as much of backend complexity as possible.

Copyright 2015, Cloudera, Inc.

转载于:https://my.oschina.net/u/2306127/blog/600302

你可能感兴趣的文章
详解:Redis主从技术的应用
查看>>
maven 笔记,具体配置
查看>>
Linux学习笔记<二十二>——计划任务
查看>>
Python3 通过 pika 连接 RabbitMQ 的基本用法
查看>>
C/C++踩坑记录(二)一段有趣的常量字符串
查看>>
GDI+ 学习记录(2): 画笔线帽 - Cap
查看>>
一张表里的多个字段值 取自 字典表里的text 的查询
查看>>
golang tcp socket
查看>>
特么的程序员励志故事(小IT职员在北京5年买了500W的房子)
查看>>
全选和反选 checkbox
查看>>
wget
查看>>
分析Redis架构设计
查看>>
Python获取本机资源使用信息
查看>>
vue实例的生命周期讲解
查看>>
使用linux远程登录另一台linux
查看>>
使用JWT保护你的Spring Boot应用 - Spring Security实战
查看>>
varnish 4.0 官方文档翻译11-Parameters
查看>>
Java多线程(二) -- synchronized
查看>>
springcloud之eureka中实现服务的注册
查看>>
单线程多路复用和多线程加锁的区别(Redis)
查看>>